Improving connection scalability: GetSnapshotData()

Started by Andres Freundalmost 6 years ago124 messages
#1Andres Freund
andres@anarazel.de
12 attachment(s)

Hi,

I think postgres' issues with scaling to larger numbers of connections
is a serious problem in the field. While poolers can address some of
that, given the issues around prepared statements, transaction state,
etc, I don't think that's sufficient in many cases. It also adds
latency.

Nor do I think the argument that one shouldn't have more than a few
dozen connection holds particularly much water. As clients have think
time, and database results have to be sent/received (most clients don't
use pipelining), and as many applications have many application servers
with individual connection pools, it's very common to need more
connections than postgres can easily deal with.

The largest reason for that is GetSnapshotData(). It scales poorly to
larger connection counts. Part of that is obviously it's O(connections)
nature, but I always thought it had to be more. I've seen production
workloads spending > 98% of the cpu time n GetSnapshotData().

After a lot of analysis and experimentation I figured out that the
primary reason for this is PGXACT->xmin. Even the simplest transaction
modifies MyPgXact->xmin several times during its lifetime (IIRC twice
(snapshot & release) for exec_bind_message(), same for
exec_exec_message(), then again as part of EOXact processing). Which
means that a backend doing GetSnapshotData() on a system with a number
of other connections active, is very likely to hit PGXACT cachelines
that are owned by another cpu / set of cpus / socket. The larger the
system is, the worse the consequences of this are.

This problem is most prominent (and harder to fix) for xmin, but also
exists for the other fields in PGXACT. We rarely have xid, nxids,
overflow, or vacuumFlags set, yet constantly set them, leading to
cross-node traffic.

The second biggest problem is that the indirection through pgprocnos
that GetSnapshotData() has to do to go through to get each backend's
xmin is very unfriendly for a pipelined CPU (i.e. all that postgres runs
on). There's basically a stall at the end of every loop iteration -
which is exascerbated by there being so many cache misses.

It's fairly easy to avoid unnecessarily dirtying cachelines for all the
PGXACT fields except xmin. Because that actually needs to be visible to
other backends.

While it sounds almost trivial in hindsight, it took me a long while to
grasp a solution to a big part of this problem: We don't actually need
to look at PGXACT->xmin to compute a snapshot. The only reason that
GetSnapshotData() does so, is because it also computes
RecentGlobal[Data]Xmin.

But we don't actually need them all that frequently. They're primarily
used as a horizons for heap_page_prune_opt() etc. But for one, while
pruning is really important, it doesn't happen *all* the time. But more
importantly a RecentGlobalXmin from an earlier transaction is actually
sufficient for most pruning requests, especially when there is a larger
percentage of reading than updating transaction (very common).

By having GetSnapshotData() compute an accurate upper bound after which
we are certain not to be able to prune (basically the transaction's
xmin, slots horizons, etc), and a conservative lower bound below which
we are definitely able to prune, we can allow some pruning actions to
happen. If a pruning request (or something similar) encounters an xid
between those, an accurate lower bound can be computed.

That allows to avoid looking at PGXACT->xmin.

To address the second big problem (the indirection), we can instead pack
the contents of PGXACT tightly, just like we do for pgprocnos. In the
attached series, I introduced separate arrays for xids, vacuumFlags,
nsubxids.

The reason for splitting them is that they change at different rates,
and different sizes. In a read-mostly workload, most backends are not
going to have an xid, therefore making the xids array almost
constant. As long as all xids are unassigned, GetSnapshotData() doesn't
need to look at anything else, therefore making it sensible to check the
xid first.

Here are some numbers for the submitted patch series. I'd to cull some
further improvements to make it more manageable, but I think the numbers
still are quite convincing.

The workload is a pgbench readonly, with pgbench -M prepared -c $conns
-j $conns -S -n for each client count. This is on a machine with 2
Intel(R) Xeon(R) Platinum 8168, but virtualized.

conns tps master tps pgxact-split

1 26842.492845 26524.194821
10 246923.158682 249224.782661
50 695956.539704 709833.746374
100 1054727.043139 1903616.306028
200 964795.282957 1949200.338012
300 906029.377539 1927881.231478
400 845696.690912 1911065.369776
500 812295.222497 1926237.255856
600 888030.104213 1903047.236273
700 866896.532490 1886537.202142
800 863407.341506 1883768.592610
900 871386.608563 1874638.012128
1000 887668.277133 1876402.391502
1500 860051.361395 1815103.564241
2000 890900.098657 1775435.271018
3000 874184.980039 1653953.817997
4000 845023.080703 1582582.316043
5000 817100.195728 1512260.802371

I think these are pretty nice results.

Note that the patchset currently does not implement snapshot_too_old,
the rest of the regression tests do pass.

One further cool recognition of the fact that GetSnapshotData()'s
results can be made to only depend on the set of xids in progress, is
that caching the results of GetSnapshotData() is almost trivial at that
point: We only need to recompute snapshots when a toplevel transaction
commits/aborts.

So we can avoid rebuilding snapshots when no commt has happened since it
was last built. Which amounts to assigning a current 'commit sequence
number' to the snapshot, and checking that against the current number
at the time of the next GetSnapshotData() call. Well, turns out there's
this "LSN" thing we assign to commits (there are some small issues with
that though). I've experimented with that, and it considerably further
improves the numbers above. Both with a higher peak throughput, but more
importantly it almost entirely removes the throughput regression from
2000 connections onwards.

I'm still working on cleaning that part of the patch up, I'll post it in
a bit.

The series currently consists out of:

0001-0005: Fixes and assert improvements that are independent of the patch, but
are hit by the new code (see also separate thread).

0006: Move delayChkpt from PGXACT to PGPROC it's rarely checked & frequently modified

0007: WIP: Introduce abstraction layer for "is tuple invisible" tests.

This is the most crucial piece. Instead of code directly using
RecentOldestXmin, there's a new set of functions for testing
whether an xid is visible (InvisibleToEveryoneTestXid() et al).

Those function use new horizon boundaries computed as part of
GetSnapshotData(), and recompute an accurate boundary when the
tested xid falls inbetween.

There's a bit more infrastructure needed - we need to limit how
often an accurate snapshot is computed. Probably to once per
snapshot? Or once per transaction?

To avoid issues with the lower boundary getting too old and
presenting a wraparound danger, I made all the xids be
FullTransactionIds. That imo is a good thing?

This patch currently breaks old_snapshot_threshold, as I've not
yet integrated it with the new functions. I think we can make the
old snapshot stuff a lot more precise with this - instead of
always triggering conflicts when a RecentGlobalXmin is too old, we
can do so only in the cases we actually remove a row. I ran out of
energy threading that through the heap_page_prune and
HeapTupleSatisfiesVacuum.

0008: Move PGXACT->xmin back to PGPROC.

Now that GetSnapshotData() doesn't access xmin anymore, we can
make it a normal field in PGPROC again.

0009: Improve GetSnapshotData() performance by avoiding indirection for xid access.
0010: Improve GetSnapshotData() performance by avoiding indirection for vacuumFlags
0011: Improve GetSnapshotData() performance by avoiding indirection for nsubxids access

These successively move the remaining PGXACT fields into separate
arrays in ProcGlobal, and adjust GetSnapshotData() to take
advantage. Those arrays are dense in the sense that they only
contain data for PGPROCs that are in use (i.e. when disconnecting,
the array is moved around)..

I think xid, and vacuumFlags are pretty reasonable. But need
cleanup, obviously:
- The biggest cleanup would be to add a few helper functions for
accessing the values, rather than open coding that.
- Perhaps we should call the ProcGlobal ones 'cached', and name
the PGPROC ones as the one true source of truth?

For subxid I thought it'd be nice to have nxids and overflow be
only one number. But that probably was the wrong call? Now
TransactionIdInProgress() cannot look at at the subxids that did
fit in PGPROC.subxid. I'm not sure that's important, given the
likelihood of misses? But I'd probably still have the subxid
array be one of {uint8 nsubxids; bool overflowed} instead.

To keep the arrays dense they copy the logic for pgprocnos. Which
means that ProcArrayAdd/Remove move things around. Unfortunately
that requires holding both ProcArrayLock and XidGenLock currently
(to avoid GetNewTransactionId() having to hold ProcArrayLock). But
that doesn't seem too bad?

0012: Remove now unused PGXACT.

There's no reason to have it anymore.

The patchseries is also available at
https://github.com/anarazel/postgres/tree/pgxact-split

Greetings,

Andres Freund

Attachments:

v1-0001-WIP-Ensure-snapshot-is-registered-within-ScanPgRe.patch.gzapplication/x-patch-gzipDownload
v1-0002-TMP-work-around-missing-snapshot-registrations.patch.gzapplication/x-patch-gzipDownload
v1-0003-TMP-don-t-build-snapshot_too_old-module-it-s-curr.patch.gzapplication/x-patch-gzipDownload
v1-0004-Improve-and-extend-asserts-for-a-snapshot-being-s.patch.gzapplication/x-patch-gzipDownload
v1-0005-Fix-unlikely-xid-wraparound-issue-in-heap_abort_s.patch.gzapplication/x-patch-gzipDownload
v1-0006-Move-delayChkpt-from-PGXACT-to-PGPROC-it-s-rarely.patch.gzapplication/x-patch-gzipDownload
v1-0007-WIP-Introduce-abstraction-layer-for-is-tuple-invi.patch.gzapplication/x-patch-gzipDownload
v1-0008-Move-PGXACT-xmin-back-to-PGPROC.patch.gzapplication/x-patch-gzipDownload
v1-0009-Improve-GetSnapshotData-performance-by-avoiding-i.patch.gzapplication/x-patch-gzipDownload
v1-0010-Improve-GetSnapshotData-performance-by-avoiding-i.patch.gzapplication/x-patch-gzipDownload
v1-0011-Improve-GetSnapshotData-performance-by-avoiding-i.patch.gzapplication/x-patch-gzipDownload
v1-0012-Remove-now-unused-PGXACT.patch.gzapplication/x-patch-gzipDownload
#2Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#1)
1 attachment(s)
Re: Improving connection scalability: GetSnapshotData()

Hi,

On 2020-03-01 00:36:01 -0800, Andres Freund wrote:

conns tps master tps pgxact-split

1 26842.492845 26524.194821
10 246923.158682 249224.782661
50 695956.539704 709833.746374
100 1054727.043139 1903616.306028
200 964795.282957 1949200.338012
300 906029.377539 1927881.231478
400 845696.690912 1911065.369776
500 812295.222497 1926237.255856
600 888030.104213 1903047.236273
700 866896.532490 1886537.202142
800 863407.341506 1883768.592610
900 871386.608563 1874638.012128
1000 887668.277133 1876402.391502
1500 860051.361395 1815103.564241
2000 890900.098657 1775435.271018
3000 874184.980039 1653953.817997
4000 845023.080703 1582582.316043
5000 817100.195728 1512260.802371

I think these are pretty nice results.

Attached as a graph as well.

Attachments:

connection-scalability-improvements.pngimage/pngDownload
#3Amit Kapila
amit.kapila16@gmail.com
In reply to: Andres Freund (#2)
Re: Improving connection scalability: GetSnapshotData()

On Sun, Mar 1, 2020 at 2:17 PM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2020-03-01 00:36:01 -0800, Andres Freund wrote:

conns tps master tps pgxact-split

1 26842.492845 26524.194821
10 246923.158682 249224.782661
50 695956.539704 709833.746374
100 1054727.043139 1903616.306028
200 964795.282957 1949200.338012
300 906029.377539 1927881.231478
400 845696.690912 1911065.369776
500 812295.222497 1926237.255856
600 888030.104213 1903047.236273
700 866896.532490 1886537.202142
800 863407.341506 1883768.592610
900 871386.608563 1874638.012128
1000 887668.277133 1876402.391502
1500 860051.361395 1815103.564241
2000 890900.098657 1775435.271018
3000 874184.980039 1653953.817997
4000 845023.080703 1582582.316043
5000 817100.195728 1512260.802371

I think these are pretty nice results.

Nice improvement. +1 for improving the scalability for higher connection count.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#4Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#1)
2 attachment(s)
Re: Improving connection scalability: GetSnapshotData()

Hi,

On 2020-03-01 00:36:01 -0800, Andres Freund wrote:

Here are some numbers for the submitted patch series. I'd to cull some
further improvements to make it more manageable, but I think the numbers
still are quite convincing.

The workload is a pgbench readonly, with pgbench -M prepared -c $conns
-j $conns -S -n for each client count. This is on a machine with 2
Intel(R) Xeon(R) Platinum 8168, but virtualized.

conns tps master tps pgxact-split

1 26842.492845 26524.194821
10 246923.158682 249224.782661
50 695956.539704 709833.746374
100 1054727.043139 1903616.306028
200 964795.282957 1949200.338012
300 906029.377539 1927881.231478
400 845696.690912 1911065.369776
500 812295.222497 1926237.255856
600 888030.104213 1903047.236273
700 866896.532490 1886537.202142
800 863407.341506 1883768.592610
900 871386.608563 1874638.012128
1000 887668.277133 1876402.391502
1500 860051.361395 1815103.564241
2000 890900.098657 1775435.271018
3000 874184.980039 1653953.817997
4000 845023.080703 1582582.316043
5000 817100.195728 1512260.802371

I think these are pretty nice results.

One further cool recognition of the fact that GetSnapshotData()'s
results can be made to only depend on the set of xids in progress, is
that caching the results of GetSnapshotData() is almost trivial at that
point: We only need to recompute snapshots when a toplevel transaction
commits/aborts.

So we can avoid rebuilding snapshots when no commt has happened since it
was last built. Which amounts to assigning a current 'commit sequence
number' to the snapshot, and checking that against the current number
at the time of the next GetSnapshotData() call. Well, turns out there's
this "LSN" thing we assign to commits (there are some small issues with
that though). I've experimented with that, and it considerably further
improves the numbers above. Both with a higher peak throughput, but more
importantly it almost entirely removes the throughput regression from
2000 connections onwards.

I'm still working on cleaning that part of the patch up, I'll post it in
a bit.

I triggered a longer run on the same hardware, that also includes
numbers for the caching patch.

nclients master pgxact-split pgxact-split-cache
1 29742.805074 29086.874404 28120.709885
2 58653.005921 56610.432919 57343.937924
3 116580.383993 115102.94057 117512.656103
4 150821.023662 154130.354635 152053.714824
5 186679.754357 189585.156519 191095.841847
6 219013.756252 223053.409306 224480.026711
7 256861.673892 256709.57311 262427.179555
8 291495.547691 294311.524297 296245.219028
9 332835.641015 333223.666809 335460.280487
10 367883.74842 373562.206447 375682.894433
15 561008.204553 578601.577916 587542.061911
20 748000.911053 794048.140682 810964.700467
25 904581.660543 1037279.089703 1043615.577083
30 999231.007768 1251113.123461 1288276.726489
35 1001274.289847 1438640.653822 1438508.432425
40 991672.445199 1518100.079695 1573310.171868
45 994427.395069 1575758.31948 1649264.339117
50 1017561.371878 1654776.716703 1715762.303282
60 993943.210188 1720318.989894 1789698.632656
70 971379.995255 1729836.303817 1819477.25356
80 966276.137538 1744019.347399 1842248.57152
90 901175.211649 1768907.069263 1847823.970726
100 803175.74326 1784636.397822 1865795.782943
125 664438.039582 1806275.514545 1870983.64688
150 623562.201749 1796229.009658 1876529.428419
175 680683.150597 1809321.487338 1910694.40987
200 668413.988251 1833457.942035 1878391.674828
225 682786.299485 1816577.462613 1884587.77743
250 727308.562076 1825796.324814 1864692.025853
275 676295.999761 1843098.107926 1908698.584573
300 698831.398432 1832068.168744 1892735.290045
400 661534.639489 1859641.983234 1898606.247281
500 645149.788352 1851124.475202 1888589.134422
600 740636.323211 1875152.669115 1880653.747185
700 858645.363292 1833527.505826 1874627.969414
800 858287.957814 1841914.668668 1892106.319085
900 882204.933544 1850998.221969 1868260.041595
1000 910988.551206 1836336.091652 1862945.18557
1500 917727.92827 1808822.338465 1864150.00307
2000 982137.053108 1813070.209217 1877104.342864
3000 1013514.639108 1753026.733843 1870416.924248
4000 1025476.80688 1600598.543635 1859908.314496
5000 1019889.160511 1534501.389169 1870132.571895
7500 968558.864242 1352137.828569 1853825.376742
10000 887558.112017 1198321.352461 1867384.381886
15000 687766.593628 950788.434914 1710509.977169

The odd dip for master between 90 and 700 connections looks like it's
not directly related to GetSnapshotData(). It looks like it's related to
the linux scheduler and virtiualization. When a pgbench thread and
postgres backend need to swap who gets executed, and both are on
different CPUs, the wakeup is more expensive when the target CPU is idle
or isn't going to reschedule soon. In the expensive path a
inter-process-interrupt (IPI) gets triggered, which requires to exit out
of the VM (which is really expensive on azure, apparently). I can
trigger similar behaviour for the other runs by renicing, albeit on a
slightly smaller scale.

I'll try to find a larger system that's not virtualized :/.

Greetings,

Andres Freund

Attachments:

connection-scalability-improvements-2.pngimage/pngDownload
csn.difftext/x-diff; charset=us-asciiDownload
diff --git i/src/include/access/transam.h w/src/include/access/transam.h
index e37ecb52fa1..2b46d2b76a0 100644
--- i/src/include/access/transam.h
+++ w/src/include/access/transam.h
@@ -183,6 +183,8 @@ typedef struct VariableCacheData
 	TransactionId latestCompletedXid;	/* newest XID that has committed or
 										 * aborted */
 
+	uint64 csn;
+
 	/*
 	 * These fields are protected by CLogTruncationLock
 	 */
diff --git i/src/include/utils/snapshot.h w/src/include/utils/snapshot.h
index 2bc415376ac..389f18cf8a5 100644
--- i/src/include/utils/snapshot.h
+++ w/src/include/utils/snapshot.h
@@ -207,6 +207,8 @@ typedef struct SnapshotData
 
 	TimestampTz whenTaken;		/* timestamp when snapshot was taken */
 	XLogRecPtr	lsn;			/* position in the WAL stream when taken */
+
+	uint64		csn;
 } SnapshotData;
 
 #endif							/* SNAPSHOT_H */
diff --git i/src/backend/storage/ipc/procarray.c w/src/backend/storage/ipc/procarray.c
index 617853f56fc..520a79c9f73 100644
--- i/src/backend/storage/ipc/procarray.c
+++ w/src/backend/storage/ipc/procarray.c
@@ -65,6 +65,7 @@
 #include "utils/snapmgr.h"
 
 #define UINT32_ACCESS_ONCE(var)		 ((uint32)(*((volatile uint32 *)&(var))))
+#define UINT64_ACCESS_ONCE(var)		 ((uint64)(*((volatile uint64 *)&(var))))
 
 /* Our shared memory area */
 typedef struct ProcArrayStruct
@@ -266,6 +267,7 @@ CreateSharedProcArray(void)
 		procArray->lastOverflowedXid = InvalidTransactionId;
 		procArray->replication_slot_xmin = InvalidTransactionId;
 		procArray->replication_slot_catalog_xmin = InvalidTransactionId;
+		ShmemVariableCache->csn = 1;
 	}
 
 	allProcs = ProcGlobal->allProcs;
@@ -404,6 +406,8 @@ ProcArrayRemove(PGPROC *proc, TransactionId latestXid)
 		if (TransactionIdPrecedes(ShmemVariableCache->latestCompletedXid,
 								  latestXid))
 			ShmemVariableCache->latestCompletedXid = latestXid;
+		/* Same with CSN */
+		ShmemVariableCache->csn++;
 
 		ProcGlobal->xids[proc->pgxactoff] = 0;
 		ProcGlobal->nsubxids[proc->pgxactoff] = 0;
@@ -531,6 +535,7 @@ ProcArrayEndTransactionInternal(PGPROC *proc, TransactionId latestXid)
 {
 	size_t		pgxactoff = proc->pgxactoff;
 
+	Assert(LWLockHeldByMe(ProcArrayLock));
 	Assert(TransactionIdIsValid(ProcGlobal->xids[pgxactoff]));
 	Assert(ProcGlobal->xids[pgxactoff] == proc->xidCopy);
 
@@ -561,6 +566,9 @@ ProcArrayEndTransactionInternal(PGPROC *proc, TransactionId latestXid)
 	if (TransactionIdPrecedes(ShmemVariableCache->latestCompletedXid,
 							  latestXid))
 		ShmemVariableCache->latestCompletedXid = latestXid;
+
+	/* Same with CSN */
+	ShmemVariableCache->csn++;
 }
 
 /*
@@ -1637,7 +1645,7 @@ GetSnapshotData(Snapshot snapshot)
 	TransactionId oldestxid;
 	int			mypgxactoff;
 	TransactionId myxid;
-
+	uint64		csn;
 	TransactionId replication_slot_xmin = InvalidTransactionId;
 	TransactionId replication_slot_catalog_xmin = InvalidTransactionId;
 
@@ -1675,6 +1683,26 @@ GetSnapshotData(Snapshot snapshot)
 					 errmsg("out of memory")));
 	}
 
+#if 1
+	if (snapshot->csn != 0 && MyProc->xidCopy == InvalidTransactionId &&
+		UINT64_ACCESS_ONCE(ShmemVariableCache->csn) == snapshot->csn)
+	{
+		if (!TransactionIdIsValid(MyProc->xmin))
+			MyProc->xmin = TransactionXmin = snapshot->xmin;
+		RecentXmin = snapshot->xmin;
+		Assert(TransactionIdPrecedesOrEquals(TransactionXmin, RecentXmin));
+
+		snapshot->curcid = GetCurrentCommandId(false);
+		snapshot->active_count = 0;
+		snapshot->regd_count = 0;
+		snapshot->copied = false;
+		snapshot->lsn = InvalidXLogRecPtr;
+		snapshot->whenTaken = 0;
+
+		return snapshot;
+	}
+#endif
+
 	/*
 	 * It is sufficient to get shared lock on ProcArrayLock, even if we are
 	 * going to set MyProc->xmin.
@@ -1687,6 +1715,7 @@ GetSnapshotData(Snapshot snapshot)
 
 	nextfxid = ShmemVariableCache->nextFullXid;
 	oldestxid = ShmemVariableCache->oldestXid;
+	csn = ShmemVariableCache->csn;
 
 	/* xmax is always latestCompletedXid + 1 */
 	xmax = ShmemVariableCache->latestCompletedXid;
@@ -1941,6 +1970,8 @@ GetSnapshotData(Snapshot snapshot)
 	snapshot->regd_count = 0;
 	snapshot->copied = false;
 
+	snapshot->csn = csn;
+
 	if (old_snapshot_threshold < 0)
 	{
 		/*
diff --git i/src/backend/utils/time/snapmgr.c w/src/backend/utils/time/snapmgr.c
index 41914f1a6c7..9d8ebe51307 100644
--- i/src/backend/utils/time/snapmgr.c
+++ w/src/backend/utils/time/snapmgr.c
@@ -604,6 +604,8 @@ SetTransactionSnapshot(Snapshot sourcesnap, VirtualTransactionId *sourcevxid,
 	CurrentSnapshot->takenDuringRecovery = sourcesnap->takenDuringRecovery;
 	/* NB: curcid should NOT be copied, it's a local matter */
 
+	CurrentSnapshot->csn = 0;
+
 	/*
 	 * Now we have to fix what GetSnapshotData did with MyPgXact->xmin and
 	 * TransactionXmin.  There is a race condition: to make sure we are not
@@ -679,6 +681,7 @@ CopySnapshot(Snapshot snapshot)
 	newsnap->regd_count = 0;
 	newsnap->active_count = 0;
 	newsnap->copied = true;
+	newsnap->csn = 0;
 
 	/* setup XID array */
 	if (snapshot->xcnt > 0)
@@ -2180,6 +2183,7 @@ RestoreSnapshot(char *start_address)
 	snapshot->curcid = serialized_snapshot.curcid;
 	snapshot->whenTaken = serialized_snapshot.whenTaken;
 	snapshot->lsn = serialized_snapshot.lsn;
+	snapshot->csn = 0;
 
 	/* Copy XIDs, if present. */
 	if (serialized_snapshot.xcnt > 0)
#5David Steele
david@pgmasters.net
In reply to: Andres Freund (#1)
Re: Improving connection scalability: GetSnapshotData()

On 3/1/20 3:36 AM, Andres Freund wrote:

I think these are pretty nice results.

Indeed they are.

Is the target version PG13 or PG14? It seems like a pretty big patch to
go in the last commitfest for PG13.

Regards,
--
-David
david@pgmasters.net

#6David Rowley
dgrowleyml@gmail.com
In reply to: Andres Freund (#1)
Re: Improving connection scalability: GetSnapshotData()

Hi,

Nice performance gains.

On Sun, 1 Mar 2020 at 21:36, Andres Freund <andres@anarazel.de> wrote:

The series currently consists out of:

0001-0005: Fixes and assert improvements that are independent of the patch, but
are hit by the new code (see also separate thread).

0006: Move delayChkpt from PGXACT to PGPROC it's rarely checked & frequently modified

0007: WIP: Introduce abstraction layer for "is tuple invisible" tests.

This is the most crucial piece. Instead of code directly using
RecentOldestXmin, there's a new set of functions for testing
whether an xid is visible (InvisibleToEveryoneTestXid() et al).

Those function use new horizon boundaries computed as part of
GetSnapshotData(), and recompute an accurate boundary when the
tested xid falls inbetween.

There's a bit more infrastructure needed - we need to limit how
often an accurate snapshot is computed. Probably to once per
snapshot? Or once per transaction?

To avoid issues with the lower boundary getting too old and
presenting a wraparound danger, I made all the xids be
FullTransactionIds. That imo is a good thing?

This patch currently breaks old_snapshot_threshold, as I've not
yet integrated it with the new functions. I think we can make the
old snapshot stuff a lot more precise with this - instead of
always triggering conflicts when a RecentGlobalXmin is too old, we
can do so only in the cases we actually remove a row. I ran out of
energy threading that through the heap_page_prune and
HeapTupleSatisfiesVacuum.

0008: Move PGXACT->xmin back to PGPROC.

Now that GetSnapshotData() doesn't access xmin anymore, we can
make it a normal field in PGPROC again.

0009: Improve GetSnapshotData() performance by avoiding indirection for xid access.

I've only looked at 0001-0009 so far. I'm not quite the expert in this
area, so the review feels a bit superficial. Here's what I noted down
during my pass.

0001

1. cant't -> can't

* snapshot cant't change in the midst of a relcache build, so there's no

0002

2. I don't quite understand your change in
UpdateSubscriptionRelState(). snap seems unused. Drilling down into
SearchSysCacheCopy2, in SearchCatCacheMiss() the systable_beginscan()
passes a NULL snapshot.

the whole patch does this. I guess I don't understand why 0002 does this.

0004

3. This comment seems to have the line order swapped in bt_check_every_level

/*
* RecentGlobalXmin/B-Tree page deletion.
* This assertion matches the one in index_getnext_tid(). See note on
*/
Assert(SnapshotSet());

0006

4. Did you consider the location of 'delayChkpt' in PGPROC. Couldn't
you slot it in somewhere it would fit in existing padding?

0007

5. GinPageIsRecyclable() has no comments at all. I know that
ginvacuum.c is not exactly the modal citizen for function header
comments, but likely this patch is no good reason to continue the
trend.

6. The comment rearrangement in bt_check_every_level should be in the
0004 patch.

7. struct InvisibleToEveryoneState could do with some comments
explaining the fields.

8. The header comment in GetOldestXminInt needs to be updated. It
talks about "if rel = NULL and there are no transactions", but there's
no parameter by that name now. Maybe the whole comment should be moved
down to the external implementation of the function

9. I get the idea you don't intend to keep the debug message in
InvisibleToEveryoneTestFullXid(), but if you do, then shouldn't it be
using UINT64_FORMAT?

10. teh -> the

* which is based on teh value computed when getting the current snapshot.

11. InvisibleToEveryoneCheckXid and InvisibleToEveryoneCheckFullXid
seem to have their extern modifiers in the .c file.

0009

12. iare -> are

* These iare separate from the main PGPROC array so that the most heavily

13. is -> are

* accessed data is stored contiguously in memory in as few cache lines as

14. It doesn't seem to quite make sense to talk about "this proc" in:

/*
* TransactionId of top-level transaction currently being executed by this
* proc, if running and XID is assigned; else InvalidTransactionId.
*
* Each PGPROC has a copy of its value in PGPROC.xidCopy.
*/
TransactionId *xids;

maybe "this" can be replaced with "each"

I will try to continue with the remaining patches soon. However, it
would be good to get a more complete patchset. I feel there are quite
a few XXX comments remaining for things you need to think about later,
and ... it's getting late.

#7Thomas Munro
thomas.munro@gmail.com
In reply to: Andres Freund (#1)
Re: Improving connection scalability: GetSnapshotData()

On Sun, Mar 1, 2020 at 9:36 PM Andres Freund <andres@anarazel.de> wrote:

conns tps master tps pgxact-split

1 26842.492845 26524.194821
10 246923.158682 249224.782661
50 695956.539704 709833.746374
100 1054727.043139 1903616.306028
200 964795.282957 1949200.338012
300 906029.377539 1927881.231478
400 845696.690912 1911065.369776
500 812295.222497 1926237.255856
600 888030.104213 1903047.236273
700 866896.532490 1886537.202142
800 863407.341506 1883768.592610
900 871386.608563 1874638.012128
1000 887668.277133 1876402.391502
1500 860051.361395 1815103.564241
2000 890900.098657 1775435.271018
3000 874184.980039 1653953.817997
4000 845023.080703 1582582.316043
5000 817100.195728 1512260.802371

This will clearly be really big news for lots of PostgreSQL users.

One further cool recognition of the fact that GetSnapshotData()'s
results can be made to only depend on the set of xids in progress, is
that caching the results of GetSnapshotData() is almost trivial at that
point: We only need to recompute snapshots when a toplevel transaction
commits/aborts.

So we can avoid rebuilding snapshots when no commt has happened since it
was last built. Which amounts to assigning a current 'commit sequence
number' to the snapshot, and checking that against the current number
at the time of the next GetSnapshotData() call. Well, turns out there's
this "LSN" thing we assign to commits (there are some small issues with
that though). I've experimented with that, and it considerably further
improves the numbers above. Both with a higher peak throughput, but more
importantly it almost entirely removes the throughput regression from
2000 connections onwards.

I'm still working on cleaning that part of the patch up, I'll post it in
a bit.

I looked at that part on your public pgxact-split branch. In that
version you used "CSN" rather than something based on LSNs, which I
assume avoids complications relating to WAL locking or something like
that. We should probably be careful to avoid confusion with the
pre-existing use of the term "commit sequence number" (CommitSeqNo,
CSN) that appears in predicate.c. This also calls to mind the
2013-2016 work by Ants Aasma and others[1]/messages/by-id/CA+CSw_tEpJ=md1zgxPkjH6CWDnTDft4gBi=+P9SnoC+Wy3pKdA@mail.gmail.com on CSN-based snapshots,
which is obviously a much more radical change, but really means what
it says (commits). The CSN in your patch set is used purely as a
level-change for snapshot cache invalidation IIUC, and it advances
also for aborts -- so maybe it should be called something like
completed_xact_count, using existing terminology from procarray.c.

+       if (snapshot->csn != 0 && MyProc->xidCopy == InvalidTransactionId &&
+               UINT64_ACCESS_ONCE(ShmemVariableCache->csn) == snapshot->csn)

Why is it OK to read ShmemVariableCache->csn without at least a read
barrier? I suppose this allows a cached snapshot to be used very soon
after a transaction commits and should be visible to you, but ...
hmmmrkwjherkjhg... I guess it must be really hard to observe any
anomaly. Let's see... maybe it's possible on a relaxed memory system
like POWER or ARM, if you use a shm flag to say "hey I just committed
a transaction", and the other guy sees the flag but can't yet see the
new CSN, so an SPI query can't see the transaction?

Another theoretical problem is the non-atomic read of a uint64 on some
32 bit platforms.

0007: WIP: Introduce abstraction layer for "is tuple invisible" tests.

This is the most crucial piece. Instead of code directly using
RecentOldestXmin, there's a new set of functions for testing
whether an xid is visible (InvisibleToEveryoneTestXid() et al).

Those function use new horizon boundaries computed as part of
GetSnapshotData(), and recompute an accurate boundary when the
tested xid falls inbetween.

There's a bit more infrastructure needed - we need to limit how
often an accurate snapshot is computed. Probably to once per
snapshot? Or once per transaction?

To avoid issues with the lower boundary getting too old and
presenting a wraparound danger, I made all the xids be
FullTransactionIds. That imo is a good thing?

+1, as long as we don't just move the wraparound danger to the places
where we convert xids to fxids!

+/*
+ * Be very careful about when to use this function. It can only safely be used
+ * when there is a guarantee that, at the time of the call, xid is within 2
+ * billion xids of rel. That e.g. can be guaranteed if the the caller assures
+ * a snapshot is held by the backend, and xid is from a table (where
+ * vacuum/freezing ensures the xid has to be within that range).
+ */
+static inline FullTransactionId
+FullXidViaRelative(FullTransactionId rel, TransactionId xid)
+{
+       uint32 rel_epoch = EpochFromFullTransactionId(rel);
+       TransactionId rel_xid = XidFromFullTransactionId(rel);
+       uint32 epoch;
+
+       /*
+        * TODO: A function to easily write an assertion ensuring that xid is
+        * between [oldestXid, nextFullXid) woudl be useful here, and in plenty
+        * other places.
+        */
+
+       if (xid > rel_xid)
+               epoch = rel_epoch - 1;
+       else
+               epoch = rel_epoch;
+
+       return FullTransactionIdFromEpochAndXid(epoch, xid);
+}

I hate it, but I don't have a better concrete suggestion right now.
Whatever I come up with amounts to the same thing on some level,
though I feel like it might be better to used an infrequently updated
oldestFxid as the lower bound in a conversion. An upper bound would
also seem better, though requires much trickier interlocking. What
you have means "it's near here!"... isn't that too prone to bugs that
are hidden because of the ambient fuzziness? A lower bound seems like
it could move extremely infrequently and therefore it'd be OK for it
to be protected by both proc array and xid gen locks (ie it'd be
recomputed when nextFxid needs to move too far ahead of it, so every
~2 billion xacts). I haven't looked at this long enough to have a
strong opinion, though.

On a more constructive note:

GetOldestXminInt() does:

LWLockAcquire(ProcArrayLock, LW_SHARED);

+       nextfxid = ShmemVariableCache->nextFullXid;
+
...
        LWLockRelease(ProcArrayLock);
...
+       return FullXidViaRelative(nextfxid, result);

But nextFullXid is protected by XidGenLock; maybe that's OK from a
data freshness point of view (I'm not sure), but from an atomicity
point of view, you can't do that can you?

This patch currently breaks old_snapshot_threshold, as I've not
yet integrated it with the new functions. I think we can make the
old snapshot stuff a lot more precise with this - instead of
always triggering conflicts when a RecentGlobalXmin is too old, we
can do so only in the cases we actually remove a row. I ran out of
energy threading that through the heap_page_prune and
HeapTupleSatisfiesVacuum.

CCing Kevin as an FYI.

0008: Move PGXACT->xmin back to PGPROC.

Now that GetSnapshotData() doesn't access xmin anymore, we can
make it a normal field in PGPROC again.

0009: Improve GetSnapshotData() performance by avoiding indirection for xid access.
0010: Improve GetSnapshotData() performance by avoiding indirection for vacuumFlags
0011: Improve GetSnapshotData() performance by avoiding indirection for nsubxids access

These successively move the remaining PGXACT fields into separate
arrays in ProcGlobal, and adjust GetSnapshotData() to take
advantage. Those arrays are dense in the sense that they only
contain data for PGPROCs that are in use (i.e. when disconnecting,
the array is moved around)..

I think xid, and vacuumFlags are pretty reasonable. But need
cleanup, obviously:
- The biggest cleanup would be to add a few helper functions for
accessing the values, rather than open coding that.
- Perhaps we should call the ProcGlobal ones 'cached', and name
the PGPROC ones as the one true source of truth?

For subxid I thought it'd be nice to have nxids and overflow be
only one number. But that probably was the wrong call? Now
TransactionIdInProgress() cannot look at at the subxids that did
fit in PGPROC.subxid. I'm not sure that's important, given the
likelihood of misses? But I'd probably still have the subxid
array be one of {uint8 nsubxids; bool overflowed} instead.

To keep the arrays dense they copy the logic for pgprocnos. Which
means that ProcArrayAdd/Remove move things around. Unfortunately
that requires holding both ProcArrayLock and XidGenLock currently
(to avoid GetNewTransactionId() having to hold ProcArrayLock). But
that doesn't seem too bad?

In the places where you now acquire both, I guess you also need to
release both in the error path?

[1]: /messages/by-id/CA+CSw_tEpJ=md1zgxPkjH6CWDnTDft4gBi=+P9SnoC+Wy3pKdA@mail.gmail.com

#8Andres Freund
andres@anarazel.de
In reply to: Thomas Munro (#7)
Re: Improving connection scalability: GetSnapshotData()

Hi,

Thanks for looking!

On 2020-03-20 18:23:03 +1300, Thomas Munro wrote:

On Sun, Mar 1, 2020 at 9:36 PM Andres Freund <andres@anarazel.de> wrote:

I'm still working on cleaning that part of the patch up, I'll post it in
a bit.

I looked at that part on your public pgxact-split branch. In that
version you used "CSN" rather than something based on LSNs, which I
assume avoids complications relating to WAL locking or something like
that.

Right, I first tried to use LSNs, but after further tinkering found that
it's too hard to address the difference between visiblity order and LSN
order. I don't think there's an easy way to address the difference.

We should probably be careful to avoid confusion with the
pre-existing use of the term "commit sequence number" (CommitSeqNo,
CSN) that appears in predicate.c.

I looked at that after you mentioned it on IM. But I find it hard to
grok what it's precisely defined at. There's basically no comments
explaining what it's really supposed to do, and I find the relevant code
far from easy to grok :(.

This also calls to mind the 2013-2016 work by Ants Aasma and others[1]
on CSN-based snapshots, which is obviously a much more radical change,
but really means what it says (commits).

Well, I think you could actually build some form of more dense snapshots
ontop of "my" CSN, with a bit of effort (and lot of handwaving). I don't
think they're that different concepts.

The CSN in your patch set is used purely as a level-change for
snapshot cache invalidation IIUC, and it advances also for aborts --
so maybe it should be called something like completed_xact_count,
using existing terminology from procarray.c.

I expect it to be used outside of snapshots too, in the future, FWIW.

completed_xact_count sounds good to me.

+       if (snapshot->csn != 0 && MyProc->xidCopy == InvalidTransactionId &&
+               UINT64_ACCESS_ONCE(ShmemVariableCache->csn) == snapshot->csn)

Why is it OK to read ShmemVariableCache->csn without at least a read
barrier? I suppose this allows a cached snapshot to be used very soon
after a transaction commits and should be visible to you, but ...
hmmmrkwjherkjhg... I guess it must be really hard to observe any
anomaly. Let's see... maybe it's possible on a relaxed memory system
like POWER or ARM, if you use a shm flag to say "hey I just committed
a transaction", and the other guy sees the flag but can't yet see the
new CSN, so an SPI query can't see the transaction?

Yea, it does need more thought / comments. I can't really see an actual
correctness violation though. As far as I can tell you'd never be able
to get an "older" ShmemVariableCache->csn than one since *after* the
last lock acquired/released by the current backend - which then also
means a different "ordering" would have been possible allowing the
current backend to take the snapshot earlier.

Another theoretical problem is the non-atomic read of a uint64 on some
32 bit platforms.

Yea, it probably should be a pg_atomic_uint64 to address that. I don't
think it really would cause problems, because I think it'd always end up
causing an unnecessary snapshot build. But there's no need to go there.

0007: WIP: Introduce abstraction layer for "is tuple invisible" tests.

This is the most crucial piece. Instead of code directly using
RecentOldestXmin, there's a new set of functions for testing
whether an xid is visible (InvisibleToEveryoneTestXid() et al).

Those function use new horizon boundaries computed as part of
GetSnapshotData(), and recompute an accurate boundary when the
tested xid falls inbetween.

There's a bit more infrastructure needed - we need to limit how
often an accurate snapshot is computed. Probably to once per
snapshot? Or once per transaction?

To avoid issues with the lower boundary getting too old and
presenting a wraparound danger, I made all the xids be
FullTransactionIds. That imo is a good thing?

+1, as long as we don't just move the wraparound danger to the places
where we convert xids to fxids!

+/*
+ * Be very careful about when to use this function. It can only safely be used
+ * when there is a guarantee that, at the time of the call, xid is within 2
+ * billion xids of rel. That e.g. can be guaranteed if the the caller assures
+ * a snapshot is held by the backend, and xid is from a table (where
+ * vacuum/freezing ensures the xid has to be within that range).
+ */
+static inline FullTransactionId
+FullXidViaRelative(FullTransactionId rel, TransactionId xid)
+{
+       uint32 rel_epoch = EpochFromFullTransactionId(rel);
+       TransactionId rel_xid = XidFromFullTransactionId(rel);
+       uint32 epoch;
+
+       /*
+        * TODO: A function to easily write an assertion ensuring that xid is
+        * between [oldestXid, nextFullXid) woudl be useful here, and in plenty
+        * other places.
+        */
+
+       if (xid > rel_xid)
+               epoch = rel_epoch - 1;
+       else
+               epoch = rel_epoch;
+
+       return FullTransactionIdFromEpochAndXid(epoch, xid);
+}

I hate it, but I don't have a better concrete suggestion right now.
Whatever I come up with amounts to the same thing on some level,
though I feel like it might be better to used an infrequently updated
oldestFxid as the lower bound in a conversion.

I am not sure it's as clearly correct to use oldestFxid in as many
cases. Normally PGPROC->xmin (PGXACT->xmin currently) should prevent the
"system wide" xid horizon to move too far relative to that, but I think
there are more plausible problems with the "oldest" xid horizon to move
concurrently with the a backend inspecting values.

It shouldn't be a problem here since the values are taken under a lock
preventing both from being moved I think, and since we're only comparing
those two values without taking anything else into account, the "global"
horizon changing concurrently wouldn't matter.

But it seems easier to understand the correctness when comparing to
nextXid?

What's the benefit of looking at an "infrequently updated" value
instead? I guess you can argue that it'd be more likely to be in cache,
but since all of this lives in a single cacheline...

An upper bound would also seem better, though requires much trickier
interlocking. What you have means "it's near here!"... isn't that too
prone to bugs that are hidden because of the ambient fuzziness?

I can't follow the last sentence. Could you expand?

On a more constructive note:

GetOldestXminInt() does:

LWLockAcquire(ProcArrayLock, LW_SHARED);

+       nextfxid = ShmemVariableCache->nextFullXid;
+
...
LWLockRelease(ProcArrayLock);
...
+       return FullXidViaRelative(nextfxid, result);

But nextFullXid is protected by XidGenLock; maybe that's OK from a
data freshness point of view (I'm not sure), but from an atomicity
point of view, you can't do that can you?

Hm. Yea, I think it's not safe against torn 64bit reads, you're right.

This patch currently breaks old_snapshot_threshold, as I've not
yet integrated it with the new functions. I think we can make the
old snapshot stuff a lot more precise with this - instead of
always triggering conflicts when a RecentGlobalXmin is too old, we
can do so only in the cases we actually remove a row. I ran out of
energy threading that through the heap_page_prune and
HeapTupleSatisfiesVacuum.

CCing Kevin as an FYI.

If anybody has an opinion on this sketch I'd be interested. I've started
to implement it, so ...

0008: Move PGXACT->xmin back to PGPROC.

Now that GetSnapshotData() doesn't access xmin anymore, we can
make it a normal field in PGPROC again.

0009: Improve GetSnapshotData() performance by avoiding indirection for xid access.
0010: Improve GetSnapshotData() performance by avoiding indirection for vacuumFlags
0011: Improve GetSnapshotData() performance by avoiding indirection for nsubxids access

These successively move the remaining PGXACT fields into separate
arrays in ProcGlobal, and adjust GetSnapshotData() to take
advantage. Those arrays are dense in the sense that they only
contain data for PGPROCs that are in use (i.e. when disconnecting,
the array is moved around)..

I think xid, and vacuumFlags are pretty reasonable. But need
cleanup, obviously:
- The biggest cleanup would be to add a few helper functions for
accessing the values, rather than open coding that.
- Perhaps we should call the ProcGlobal ones 'cached', and name
the PGPROC ones as the one true source of truth?

For subxid I thought it'd be nice to have nxids and overflow be
only one number. But that probably was the wrong call? Now
TransactionIdInProgress() cannot look at at the subxids that did
fit in PGPROC.subxid. I'm not sure that's important, given the
likelihood of misses? But I'd probably still have the subxid
array be one of {uint8 nsubxids; bool overflowed} instead.

To keep the arrays dense they copy the logic for pgprocnos. Which
means that ProcArrayAdd/Remove move things around. Unfortunately
that requires holding both ProcArrayLock and XidGenLock currently
(to avoid GetNewTransactionId() having to hold ProcArrayLock). But
that doesn't seem too bad?

In the places where you now acquire both, I guess you also need to
release both in the error path?

Hm. I guess you mean:

if (arrayP->numProcs >= arrayP->maxProcs)
{
/*
* Oops, no room. (This really shouldn't happen, since there is a
* fixed supply of PGPROC structs too, and so we should have failed
* earlier.)
*/
LWLockRelease(ProcArrayLock);
ereport(FATAL,
(errcode(ERRCODE_TOO_MANY_CONNECTIONS),
errmsg("sorry, too many clients already")));
}

I think we should just remove the LWLockRelease? At this point we
already have set up ProcKill(), which would release all lwlocks after
the error was thrown?

Greetings,

Andres Freund

#9Andres Freund
andres@anarazel.de
In reply to: David Rowley (#6)
Re: Improving connection scalability: GetSnapshotData()

Hi,

On 2020-03-17 23:59:14 +1300, David Rowley wrote:

Nice performance gains.

Thanks.

On Sun, 1 Mar 2020 at 21:36, Andres Freund <andres@anarazel.de> wrote:
2. I don't quite understand your change in
UpdateSubscriptionRelState(). snap seems unused. Drilling down into
SearchSysCacheCopy2, in SearchCatCacheMiss() the systable_beginscan()
passes a NULL snapshot.

the whole patch does this. I guess I don't understand why 0002 does this.

See the thread at /messages/by-id/20200229052459.wzhqnbhrriezg4v2@alap3.anarazel.de

Basically, the way catalog snapshots are handled right now, it's not
correct to much without a snapshot held. Any concurrent invalidation can
cause the catalog snapshot to be released, which can reset the backend's
xmin. Which in turn can allow for pruning etc to remove required data.

This is part of this series only because I felt I needed to add stronger
asserts to be confident in what's happening. And they started to trigger
all over :( - and weren't related to the patchset :(.

4. Did you consider the location of 'delayChkpt' in PGPROC. Couldn't
you slot it in somewhere it would fit in existing padding?

0007

Hm, maybe. I'm not sure what the best thing to do here is - there's some
arguments to be made that we should keep the fields moved from PGXACT
together on their own cacheline. Compared to some of the other stuff in
PGPROC they're still accessed from other backends fairly frequently.

5. GinPageIsRecyclable() has no comments at all. I know that
ginvacuum.c is not exactly the modal citizen for function header
comments, but likely this patch is no good reason to continue the
trend.

Well, I basically just moved the code from the macro of the same
name... I'll add something.

9. I get the idea you don't intend to keep the debug message in
InvisibleToEveryoneTestFullXid(), but if you do, then shouldn't it be
using UINT64_FORMAT?

Yea, I don't intend to keep them - they're way too verbose, even for
DEBUG*. Note that there's some advantage in the long long cast approach
- it's easier to deal with for translations IIRC.

13. is -> are

* accessed data is stored contiguously in memory in as few cache lines as

Oh? 'data are stored' sounds wrong to me, somehow.

Greetings,

Andres Freund

#10Thomas Munro
thomas.munro@gmail.com
In reply to: Andres Freund (#9)
Re: Improving connection scalability: GetSnapshotData()

On Sun, Mar 29, 2020 at 1:49 PM Andres Freund <andres@anarazel.de> wrote:

13. is -> are

* accessed data is stored contiguously in memory in as few cache lines as

Oh? 'data are stored' sounds wrong to me, somehow.

In computer contexts it seems pretty well established that we treat
"data" as an uncountable noun (like "air"), so I think "is" is right
here. In maths or science contexts it's usually treated as a plural
following Latin, which admittedly sounds cleverer, but it also has a
slightly different meaning, not bits and bytes but something more like
samples or (wince) datums.

In reply to: Andres Freund (#1)
Re: Improving connection scalability: GetSnapshotData()

On Sun, Mar 1, 2020 at 12:36 AM Andres Freund <andres@anarazel.de> wrote:

The workload is a pgbench readonly, with pgbench -M prepared -c $conns
-j $conns -S -n for each client count. This is on a machine with 2
Intel(R) Xeon(R) Platinum 8168, but virtualized.

conns tps master tps pgxact-split

1 26842.492845 26524.194821
10 246923.158682 249224.782661
50 695956.539704 709833.746374
100 1054727.043139 1903616.306028
200 964795.282957 1949200.338012
300 906029.377539 1927881.231478
400 845696.690912 1911065.369776
500 812295.222497 1926237.255856
600 888030.104213 1903047.236273
700 866896.532490 1886537.202142
800 863407.341506 1883768.592610
900 871386.608563 1874638.012128
1000 887668.277133 1876402.391502
1500 860051.361395 1815103.564241
2000 890900.098657 1775435.271018
3000 874184.980039 1653953.817997
4000 845023.080703 1582582.316043
5000 817100.195728 1512260.802371

I think these are pretty nice results.

This scalability improvement is clearly very significant. There is
little question that this is a strategically important enhancement for
the Postgres project in general. I hope that you will ultimately be
able to commit the patchset before feature freeze.

I have heard quite a few complaints about the scalability of snapshot
acquisition in Postgres. Generally from very large users that are not
well represented on the mailing lists, for a variety of reasons. The
GetSnapshotData() bottleneck is a *huge* problem for us. (As problems
for Postgres users go, I would probably rank it second behind issues
with VACUUM.)

--
Peter Geoghegan

#12Bruce Momjian
bruce@momjian.us
In reply to: Peter Geoghegan (#11)
Re: Improving connection scalability: GetSnapshotData()

On Sat, Mar 28, 2020 at 06:39:32PM -0700, Peter Geoghegan wrote:

On Sun, Mar 1, 2020 at 12:36 AM Andres Freund <andres@anarazel.de> wrote:

The workload is a pgbench readonly, with pgbench -M prepared -c $conns
-j $conns -S -n for each client count. This is on a machine with 2
Intel(R) Xeon(R) Platinum 8168, but virtualized.

conns tps master tps pgxact-split

1 26842.492845 26524.194821
10 246923.158682 249224.782661
50 695956.539704 709833.746374
100 1054727.043139 1903616.306028
200 964795.282957 1949200.338012
300 906029.377539 1927881.231478
400 845696.690912 1911065.369776
500 812295.222497 1926237.255856
600 888030.104213 1903047.236273
700 866896.532490 1886537.202142
800 863407.341506 1883768.592610
900 871386.608563 1874638.012128
1000 887668.277133 1876402.391502
1500 860051.361395 1815103.564241
2000 890900.098657 1775435.271018
3000 874184.980039 1653953.817997
4000 845023.080703 1582582.316043
5000 817100.195728 1512260.802371

I think these are pretty nice results.

This scalability improvement is clearly very significant. There is
little question that this is a strategically important enhancement for
the Postgres project in general. I hope that you will ultimately be
able to commit the patchset before feature freeze.

+1

I have heard quite a few complaints about the scalability of snapshot
acquisition in Postgres. Generally from very large users that are not
well represented on the mailing lists, for a variety of reasons. The
GetSnapshotData() bottleneck is a *huge* problem for us. (As problems
for Postgres users go, I would probably rank it second behind issues
with VACUUM.)

+1

--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EnterpriseDB https://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +
#13Andres Freund
andres@anarazel.de
In reply to: Peter Geoghegan (#11)
Re: Improving connection scalability: GetSnapshotData()

Hi,

On 2020-03-28 18:39:32 -0700, Peter Geoghegan wrote:

I have heard quite a few complaints about the scalability of snapshot
acquisition in Postgres. Generally from very large users that are not
well represented on the mailing lists, for a variety of reasons. The
GetSnapshotData() bottleneck is a *huge* problem for us. (As problems
for Postgres users go, I would probably rank it second behind issues
with VACUUM.)

Yea, I see it similarly. For busy databases, my experience is that
vacuum is the big problem for write heavy workloads (or the write
portion), and snapshot scalability the big problem for read heavy oltp
workloads.

This scalability improvement is clearly very significant. There is
little question that this is a strategically important enhancement for
the Postgres project in general. I hope that you will ultimately be
able to commit the patchset before feature freeze.

I've done a fair bit of cleanup, but I'm still fighting with how to
implement old_snapshot_threshold in a good way. It's not hard to get it
back to kind of working, but it requires some changes that go into the
wrong direction.

The problem basically is that the current old_snapshot_threshold
implementation just reduces OldestXmin to whatever is indicated by
old_snapshot_threshold, even if not necessary for pruning to do the
specific cleanup that's about to be done. If OldestXmin < threshold,
it'll set shared state that fails all older accesses. But that doesn't
really work well with approach in the patch of using a lower/upper
boundary for potentially valid xmin horizons.

I thinkt he right approach would be to split
TransactionIdLimitedForOldSnapshots() into separate parts. One that
determines the most aggressive horizon that old_snapshot_threshold
allows, and a separate part that increases the threshold after which
accesses need to error out
(i.e. SetOldSnapshotThresholdTimestamp()). Then we can only call
SetOldSnapshotThresholdTimestamp() for exactly the xids that are
removed, not for the most aggressive interpretation.

Unfortunately I think that basically requires changing
HeapTupleSatisfiesVacuum's signature, to take a more complex parameter
than OldestXmin (to take InvisibleToEveryoneState *), which quickly
increases the size of the patch.

I'm currently doing that and seeing how the result makes me feel about
the patch.

Alternatively we also can just be less efficient and call
GetOldestXmin() more aggressively when old_snapshot_threshold is
set. That'd be easier to implement - but seems like an ugly gotcha.

Greetings,

Andres Freund

#14Alexander Korotkov
a.korotkov@postgrespro.ru
In reply to: Peter Geoghegan (#11)
Re: Improving connection scalability: GetSnapshotData()

On Sun, Mar 29, 2020 at 4:40 AM Peter Geoghegan <pg@bowt.ie> wrote:

On Sun, Mar 1, 2020 at 12:36 AM Andres Freund <andres@anarazel.de> wrote:

The workload is a pgbench readonly, with pgbench -M prepared -c $conns
-j $conns -S -n for each client count. This is on a machine with 2
Intel(R) Xeon(R) Platinum 8168, but virtualized.

conns tps master tps pgxact-split

1 26842.492845 26524.194821
10 246923.158682 249224.782661
50 695956.539704 709833.746374
100 1054727.043139 1903616.306028
200 964795.282957 1949200.338012
300 906029.377539 1927881.231478
400 845696.690912 1911065.369776
500 812295.222497 1926237.255856
600 888030.104213 1903047.236273
700 866896.532490 1886537.202142
800 863407.341506 1883768.592610
900 871386.608563 1874638.012128
1000 887668.277133 1876402.391502
1500 860051.361395 1815103.564241
2000 890900.098657 1775435.271018
3000 874184.980039 1653953.817997
4000 845023.080703 1582582.316043
5000 817100.195728 1512260.802371

I think these are pretty nice results.

This scalability improvement is clearly very significant. There is
little question that this is a strategically important enhancement for
the Postgres project in general. I hope that you will ultimately be
able to commit the patchset before feature freeze.

+1, this is really very cool results.

Despite this patchset is expected to be clearly a big win on majority
of workloads, I think we still need to investigate different workloads
on different hardware to ensure there is no regression.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

#15Andres Freund
andres@anarazel.de
In reply to: Alexander Korotkov (#14)
Re: Improving connection scalability: GetSnapshotData()

Hi,

On March 29, 2020 11:24:32 AM PDT, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:

clearly a big win on majority
of workloads, I think we still need to investigate different workloads
on different hardware to ensure there is no regression.

Definitely. Which workloads are you thinking of? I can think of those affected facets: snapshot speed, commit speed with writes, connection establishment, prepared transaction speed. All in the small and large connection count cases.

I did measurements on all of those but prepared xacts, fwiw. That definitely needs to be measured, due to the locking changes around procarrayaddd/remove.

I don't think regressions besides perhaps 2pc are likely - there's nothing really getting more expensive but procarray add/remove.

Andres

Regards,

Andres

--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

#16Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Andres Freund (#15)
Re: Improving connection scalability: GetSnapshotData()

On Sun, Mar 29, 2020 at 11:50:10AM -0700, Andres Freund wrote:

Hi,

On March 29, 2020 11:24:32 AM PDT, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:

clearly a big win on majority
of workloads, I think we still need to investigate different workloads
on different hardware to ensure there is no regression.

Definitely. Which workloads are you thinking of? I can think of those
affected facets: snapshot speed, commit speed with writes, connection
establishment, prepared transaction speed. All in the small and large
connection count cases.

I did measurements on all of those but prepared xacts, fwiw. That
definitely needs to be measured, due to the locking changes around
procarrayaddd/remove.

I don't think regressions besides perhaps 2pc are likely - there's
nothing really getting more expensive but procarray add/remove.

If I get some instructions what tests to do, I can run a bunch of tests
on my machinees (not the largest boxes, but at least something). I don't
have the bandwidth to come up with tests on my own, at the moment.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#17Alexander Korotkov
a.korotkov@postgrespro.ru
In reply to: Andres Freund (#15)
Re: Improving connection scalability: GetSnapshotData()

On Sun, Mar 29, 2020 at 9:50 PM Andres Freund <andres@anarazel.de> wrote:

On March 29, 2020 11:24:32 AM PDT, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:

clearly a big win on majority
of workloads, I think we still need to investigate different workloads
on different hardware to ensure there is no regression.

Definitely. Which workloads are you thinking of? I can think of those affected facets: snapshot speed, commit speed with writes, connection establishment, prepared transaction speed. All in the small and large connection count cases.

Following pgbench scripts comes first to my mind:
1) SELECT txid_current(); (artificial but good for checking corner case)
2) Single insert statement (as example of very short transaction)
3) Plain pgbench read-write (you already did it for sure)
4) pgbench read-write script with increased amount of SELECTs. Repeat
select from pgbench_accounts say 10 times with different aids.
5) 10% pgbench read-write, 90% of pgbench read-only

I did measurements on all of those but prepared xacts, fwiw

Great, it would be nice to see the results in the thread.

That definitely needs to be measured, due to the locking changes around procarrayaddd/remove.

I don't think regressions besides perhaps 2pc are likely - there's nothing really getting more expensive but procarray add/remove.

I agree that ProcArrayAdd()/Remove() should be first subject of
investigation, but other cases should be checked as well IMHO.
Regarding 2pc I can following scenarios come to my mind:
1) pgbench read-write modified so that every transaction is prepared
first, then commit prepared.
2) 10% of 2pc pgbench read-write, 90% normal pgbench read-write
3) 10% of 2pc pgbench read-write, 90% normal pgbench read-only

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

#18Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#13)
Re: Improving connection scalability: GetSnapshotData()

Hi,

I'm still fighting with snapshot_too_old. The feature is just badly
undertested, underdocumented, and there's lots of other oddities. I've
now spent about as much time on that feature than on the whole rest of
the patchset.

As an example for under-documented, here's a definitely non-trivial
block of code without a single comment explaining what it's doing.

if (oldSnapshotControl->count_used > 0 &&
ts >= oldSnapshotControl->head_timestamp)
{
int offset;

offset = ((ts - oldSnapshotControl->head_timestamp)
/ USECS_PER_MINUTE);
if (offset > oldSnapshotControl->count_used - 1)
offset = oldSnapshotControl->count_used - 1;
offset = (oldSnapshotControl->head_offset + offset)
% OLD_SNAPSHOT_TIME_MAP_ENTRIES;
xlimit = oldSnapshotControl->xid_by_minute[offset];

if (NormalTransactionIdFollows(xlimit, recentXmin))
SetOldSnapshotThresholdTimestamp(ts, xlimit);
}

LWLockRelease(OldSnapshotTimeMapLock);

Also, SetOldSnapshotThresholdTimestamp() acquires a separate spinlock -
not great to call that with the lwlock held.

Then there's this comment:

/*
* Failsafe protection against vacuuming work of active transaction.
*
* This is not an assertion because we avoid the spinlock for
* performance, leaving open the possibility that xlimit could advance
* and be more current; but it seems prudent to apply this limit. It
* might make pruning a tiny bit less aggressive than it could be, but
* protects against data loss bugs.
*/
if (TransactionIdIsNormal(latest_xmin)
&& TransactionIdPrecedes(latest_xmin, xlimit))
xlimit = latest_xmin;

if (NormalTransactionIdFollows(xlimit, recentXmin))
return xlimit;

So this is not using lock, so the values aren't accurate, but it avoids
data loss bugs? I also don't know which spinlock is avoided on the path
here as mentioend - the acquisition is unconditional.

But more importantly - if this is about avoiding data loss bugs, how on
earth is it ok that we don't go through these checks in the
old_snapshot_threshold == 0 path?

/*
* Zero threshold always overrides to latest xmin, if valid. Without
* some heuristic it will find its own snapshot too old on, for
* example, a simple UPDATE -- which would make it useless for most
* testing, but there is no principled way to ensure that it doesn't
* fail in this way. Use a five-second delay to try to get useful
* testing behavior, but this may need adjustment.
*/
if (old_snapshot_threshold == 0)
{
if (TransactionIdPrecedes(latest_xmin, MyProc->xmin)
&& TransactionIdFollows(latest_xmin, xlimit))
xlimit = latest_xmin;

ts -= 5 * USECS_PER_SEC;
SetOldSnapshotThresholdTimestamp(ts, xlimit);

return xlimit;
}

This feature looks like it was put together by applying force until
something gave, and then stopping just there.

Greetings,

Andres Freund

#19Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#18)
Re: Improving connection scalability: GetSnapshotData()

Hi,

On 2020-03-31 13:04:38 -0700, Andres Freund wrote:

I'm still fighting with snapshot_too_old. The feature is just badly
undertested, underdocumented, and there's lots of other oddities. I've
now spent about as much time on that feature than on the whole rest of
the patchset.

To expand on this being under-tested: The whole time mapping
infrastructure is not tested, because all of that is bypassed when
old_snapshot_threshold = 0. And old_snapshot_threshold = 0 basically
only exists for testing. The largest part of the complexity of this
feature are TransactionIdLimitedForOldSnapshots() and
MaintainOldSnapshotTimeMapping() - and none of the complexity is tested
due to the tests running with old_snapshot_threshold = 0.

So we have test only infrastructure that doesn't allow to actually test
the feature.

And the tests that we do have don't have a single comment explaining
what the expected results are. Except for the newer
sto_using_hash_index.spec, they just run all permutations. I don't know
how those tests actually help, since it's not clear why any of the
results are the way they are. And which just are the results of
bugs. Ore not affected by s_t_o.

Greetings,

Andres Freund

#20David Rowley
dgrowleyml@gmail.com
In reply to: Andres Freund (#2)
1 attachment(s)
Re: Improving connection scalability: GetSnapshotData()

On Sun, 1 Mar 2020 at 21:47, Andres Freund <andres@anarazel.de> wrote:

On 2020-03-01 00:36:01 -0800, Andres Freund wrote:

conns tps master tps pgxact-split

1 26842.492845 26524.194821
10 246923.158682 249224.782661
50 695956.539704 709833.746374
100 1054727.043139 1903616.306028
200 964795.282957 1949200.338012
300 906029.377539 1927881.231478
400 845696.690912 1911065.369776
500 812295.222497 1926237.255856
600 888030.104213 1903047.236273
700 866896.532490 1886537.202142
800 863407.341506 1883768.592610
900 871386.608563 1874638.012128
1000 887668.277133 1876402.391502
1500 860051.361395 1815103.564241
2000 890900.098657 1775435.271018
3000 874184.980039 1653953.817997
4000 845023.080703 1582582.316043
5000 817100.195728 1512260.802371

I think these are pretty nice results.

FWIW, I took this for a spin on an AMD 3990x:

# setup
pgbench -i postgres

#benchmark
#!/bin/bash

for i in 1 10 50 100 200 300 400 500 600 700 800 900 1000 1500 2000
3000 4000 5000;
do
echo Testing with $i connections >> bench.log
pgbench2 -M prepared -c $i -j $i -S -n -T 60 postgres >> bench.log
done

pgbench2 is your patched version pgbench. I got some pretty strange
results with the unpatched version. Up to about 50 million tps for
excluding connection establishing, which seems pretty farfetched

connections Unpatched Patched
1 49062.24413 49834.64983
10 428673.1027 453290.5985
50 1552413.084 1849233.821
100 2039675.027 2261437.1
200 3139648.845 3632008.991
300 3091248.316 3597748.942
400 3056453.5 3567888.293
500 3019571.47 3574009.053
600 2991052.393 3537518.903
700 2952484.763 3553252.603
800 2910976.875 3539404.865
900 2873929.989 3514353.776
1000 2846859.499 3490006.026
1500 2540003.038 3370093.934
2000 2361799.107 3197556.738
3000 2056973.778 2949740.692
4000 1751418.117 2627174.81
5000 1464786.461 2334586.042

Attached as a graph as well.

Likewise.

David

Attachments:

excluding_connections_est_3990x.pngimage/png; name=excluding_connections_est_3990x.pngDownload
�PNG


IHDRyK�g�sRGB���gAMA���a	pHYs%%IR$��/IDATx^���Gz����?�S�����N��V+)N{G��]r�k�@�=� H� ��A����{����{��o����7��PS��������~����������*3���/B!�B!YE!�B!�dy�B!��EP�B!�BHA�G!�B!YE!�B!�dy�B!��EP�B!�BHA�G!�B!YE!�B!�dy�B!��EP�B!�BHA�G!�B!YE!�B!�dy�B!��EP�B!�BHA�G!�B!YE!�B!�dy�B!��EP�B!�BHA�G!�B!YE!�B!�dy�B!��EP�B!�BHA�G!�B!YE!�B!�dy�B!��EP�%a��M��K�?�Z�����P�G����r	�������H������7N<Ol�M��^�7�_��?�9s���V�k������&��t�d�4g*n�[&L� �{��[�n���{�_�w��Z����)����}]���w�X�b�a�U��^$�|�������1��`�t�p&��H�/�eee2d�4h����yIX�n�t��A��G����.�n�4��������%Kd���U������Gec#F����|6���U��O�>r������a���O�m�W�\Qk�s�(��a�������z�j���/�_{�R?�/�;Ur�~IV���Vu��.�i���w,[������b/*;v�N�:I�����CynS6�I�?��4�K+��={���_~�>f��cN}�"/	����u�!W*abz���U��T�g#x��L5���"/s�p���9������F����6�+�K����P��?��>l����������������t���E2����1c���������f?yI����(
�@@.B.}� �%Y��
	�ac>�/����m�������I&�L�Q�hx�B!+]�����4^�k�c��������P��&��>7E^�!]"�8w��U�8�N������U|�������7cgJ��]:���iK&���c��G��f�����KB}����<B�A�T��Hu��b��a�!��3A&5<j�C����h���U]��_m`����6����\��)�*{�������C�I7��]&����<|�5j�z�U�(��P*9�<b�\��3��6�S�^������T�sT�Ljx����I4��6T����d���m��M'�R��+�xU����wr��]���3g���S�.���TD��<��^�zIAAAlk�R�"_�/_���t��U�!@F��7O����u�K�.U�-Zw��a����	��x��r����������*o;�7o���.Y%7w�\u��={&����%��t�K����+q�?���{H&��n�G8q�Dlk�u1Oc���}��ajl2�6%�����w�}���_���mun\��������xH����:xnHG"p��-[���c����'���'���?�q��+W�y-�������'���|��@�uey������^�wL���/|���
�����?>�{�_�w{��YI����Ca�����! �0T����5��k������@�m������������W_Y����-�s��{��U�����@���J\4���N�����������F�>�F�����yn��n���~�+�;�{����IG�j��<u����#Q=dB���i�m����xx&�9�g>#/�M���'M���������*���xe�8|���e�S�!���>>��i'^�^������e4���^�xQ�O�P�5�9��^�S�'�������w8��U�������yy�c�e�&//O�3��!Q�
�j���c����Q����
;�b���)S����<����Wx��Zy0��;w�����\m��
���a�}�����s�����{�������4������q-��QI�`�q�F��(�*9g���d'��s�+���N�*�����4��>\w��m*���;&�������/�.Dt��($���1�U���m���t�����u�?(�	�����g�<y���8q����������������O��PN����5v��C�l%����p�n��%|����svt����
y;8*�d�v���#�`�����m`����v��<����)�N�3��s'vDE��>�����`�g��5��r>?����@ZaC�� �8w�2�M�n4����>7�Mb^B�}<��m�y]�J�kR.�F��k��
h�s���O��R�����o��e�����y�{���Wo���������`�������.^p�>F`�����>?�;�x}
���5�=��x�AM�O�M(��][�#:�x^��H�
:�������De����n��8���s��[���:���}����.��4�,�3�W����1[�� U�n;+e-H�
����x�>�={V����U4h+M�6M�C���M�-yx��r�S?�Dq�n����
q������/�Ff?��(�
�m/��Ur��;������!��J�.��h\���iH%}xt�%
�7�I'bP�s��p�)0Q�����������G�Vv,����������G���s�Y�&v�=���xv��l%QCy��;��O���������3u���;C���4���<��Fp>T&�w�V*����0HT�$�4\p�=-��g�x��Ku�g�����_���;��g��b�(�P�;����7�=�"��eTM��=�IHs������rRI3�Y�r�6�N��~^�
��{v��T�Q
=;`�?�Q�z�8r����/�3{�le�cu�>g��������C���m�����E{�Q_�G��,��Y��~k�~:����"zW���c����H�9 �/^|_:u^au����A�D��V���������i@giNf����3����pu�Yx�5-k�a�������=]&y@�E�$�-y�����A@�@�C����.@�5=svP����#�p��
G},
xs���@�(P"M.�K����_?+}���d��~�*{�u<g�y��e���>�y����d�������/
��O�#����/2���]ooH�U������3��k|i�����&^BT��x{��</<7�GS���X=��PIa�5�f?V����k��c5��C@!�������E6��8���4�����{���p�X����=p�@��(�{��b�P@"�:� x1d�2q
w�v�?�Gl��4������uuc��v}�3fX6��;+����!��P9`�f�e��/��������������26��
��
��<��`H�+���MG��|p
|��>4z�����D����G��,������p��K���sBp���Q.#��l��"��eTM��=�8���y���VCq��l'�4#T�\����t��!l����j�T���{�!8�V�#(�t��,��@�>?�u4��i�?���;�G<g����J� �upH���m�X�G��E��|�9P^b���PY�����zu�O����`��QF#�p������*��v�>=u:��9���� ���5��� �����ku��T�'��xo�	���:]�����A���o��f�/���0$�I;+emMm���3C}�v���6��:Ne"e���&��'��<4t��@�Q��S���1���62��aoD���B�,^ ]��!������/k��s�e�6��a@f�j�J Y%���^t�^p��	��7h������t�E�/O�4��>=<�^ ���uc_��
|U��
�r>74�.]�����u�Z`������9��|�m���8(� z.
���BA�4it�0�y��f�a�����G�g?���8P�:EB:�Yn�U��3p�O����B��P��Y��&*E�q~��g�{!�,8�;��xiG:!���-055�?����F@<��	�C�RC�����������H+��E<TP���i��5(`��z��(������~��s{9i����h��A��7�q��DeTM���f��Kg��`�q�xRMsu���(;�+��

�������{���F�^��/�n������5;����F}`��g�;]F O��������8���P���F��Q>�z��~��FH�|C9b/ck�~B���v{�5���Cn!^>$���B��uq���|�s�i�P�S��r���(�{��m=���w�nq-��O-$�	���v0S��h��!��H7��>�N'�l�&��t��5�a����A{�i�s-t*y����3���yx��"#Q!&�������+��1���������E�6�>m\O(�����8���8�<�*9�RUVq�{����
�Y�h����y�4��>4:�y�]���������F+9���e��Z`�^
~�/��
�~�S=y��D�������	
�����NV1�$�a�h�%����a&����hv��*{!o/8��v��~�QX����&����}���3���b�X1���gv�f� l���]j�>m���\y�,�� ���gl���h�;��"h�h�
~<O;������=��'4�!����Q�i�j�`/L��@e4V�hID2JF�g��g�����9p�v���m%z��|��|�B����A���"��/�o<�v]����T�|��l�i���~4��1>Z�w*^>T��{G�=l�s���bM��{�2!�Zt��5�:�y��v�j@���,��nm7��>���,s�W �
����si��NY����v�N����P�����u���yx�Z-�L�p4����h4:�/��c0�x���"���J�JN�T�U��^>{�`xW�|�c���!���������~���_��/���r�JD�XU)0q���G^��R��Y�t����*�����w���/�p'])$��j�����2�}����'@A�0��#�*{!o/8�iw~Q��~�Q�����&��_��/�b���!h��*����l"m��d�~�;m��j�>m���3v��"=��/���@���?_8�@^'��E�Wc/g�<�$*�j�/�4�����k�3]��������0Uv��f�c���Iul��`;�!�����v{��y���x���oN��N����DT��A����P�V�T�7{]hOC:�O���1M�=�+��T�h�y�l��wU�x��J^���b/����m���#:#�>Ey��,������D}��3�y��Hd�5ig�����)k�a�������M���[���>�I%n�cT��!���;|�L������#��/����+"���HVY'��2�S���%��};�b�+������Q�`;
�d�����Z���~��
�x�����kLe������&'�?;:^��LW>�w�;��:�{@�sG���?��
�x�K���B�^���J7�8���=�k��E���b�S4*�w��a�AL|��yP� }��axv����t�^��%�=
�B��7�#'��$���N�3���s����#�k������$z�j�/�l��=�3]�I��x�������n�	�e`e�g�t���8��d�g��y��������'z�AM�$;7>f`��l�D"��+��xiHG�I���~x}t��*�������i�{��#��������������U}���[�)�0�� ��^�.��^��j������iF9��i�a���xyj��8����S����u=��l��+K+�>�A\�Me�2�"�^�W5$�\Q���_~��w�sTv�d��9���Hx����g���
|��q�i�i�@��!���NC���U�B:������M�������//���U
Z�V�Y%"^�4�����}{*_�����`������_���uEah``<��"��U
N{�I������Cc�@c�%� c>F��CS��g���HF�����i������K"�tb��LWu�l'��IvM'�-;��L�@���)�w����$�?;�x�{�����p
�u�oM�$:7�����>we��K�=����V%��%^>T�c������w��s���GB���$���c;�'�>��i��PS���z(�qB��_�e.�g���Mt��T����=_�S�"jj����	�R��Vv��|��R�Ut7��EF�����#Q��'�*�B/�.���(���~�p�/J�2#8+{mx�^t�W�����:��8��F:�P���x/�@�no�?���K\�_�t�_�/)�c+{�z�t���&Y��IV��$���Ec�1
�x�B����W�Y%"^�4����^p��������<��hw�Q��s�@�����p���J��0$1�a3��6+8_��/�ki<�[��%�=<�
��;�{�� ���E��Wg���/��&_V����7{��d'�w�����&���9�U�4�x��6�N��4������xN�'�@^c�}������,��$�G������=��ko����A�����z�|U����H<��!����_��������c����hA]������g�4�!����&yU����N���{�����:*p>�oH���Q�#�qF�eq�N��	���mg�����)k�a�����xf�>�����vUD�/��f:FE*m����d$M�&�+~�4��B���
+TLz�^ {��	�%�a�z<�~���=����������$t�'n������,}���N���K�*x;v�VU-0�#<S��;H*i<�;
�T���!���0"Y��IV��$�Q���(�+�[��Be�h�9��GL~F>��	�K��c;����\��`���qk��@#�EE���ZLW�����z��I��Kf{���F�|!���Or����V6�Y���pm
���a���T�1M*���&���9���4�{6�Uv]U���L�8��&�3��D]�:�N���c���G��=�<(Q>"��IB:�$+[��Q���x�}����:�f��r}n���#^��~J5��Vu��C���ht�������&yU�����]���:�Y��#��ze��x`��C��;��'�����
����)k�a�����++'��u�U�����~�I�J���y0X�1��Id�x �Z���J�9A�Lx!��>T8�A����T�[*�D_�PA��^}t��_fge�F2�'2<���I��_>��;w���|���#�=��xiH����@��F�Q9��@�O�=3BUL���c!�������{���^5r0f^���G��J�v��5�g��#�,X���A��;�
��������
��f��Di���Yp���
6����q�B\�v!Q���yP@#O�w<P�q���"��_|1$m��iCZ�{���%�==_��.�����������������^���7���><';�����K2��c��LWU��D��6�N�5������"z���e���U������=�������W�#���:-�����l� M��7�b���v�Am\�s;m�I�4����wF7���6�k8��
��Fv��v��f��x���I^�{��L�{����*��'<�D��(�g���w�(O���T�v��sb[u��t�0>�a4
���������
��E���y�KI���@q�
�����
_7�/(�k�����;%��bp��4p|����Q�����	������B�WB��^�@��4dq/8��9
}^����w�u��x�=h���#8��,}���n���J��)��c����!���
_}U-01,A7>��ax���9�y���v�`?i�WT6Z$�cQ��<��v{�,��$+�k���9��m��}��/8^�Gh�������8��cp�v{��#���6�G��{a�,8��u>?}]��~?��.�k�8��p��6�{G�"�%�����_�q~�w���G�;C��x(t�;�������;{��� |�BY��������NA`�x����H4VQ��,�����xn8�����{�j�/�l��=�3]UMs<��Qv������s"����~~��#����2u+>��|�=�{/���{<�=��y�<A���y�������3��H���,�����C��2@���A������s;m���g/�@M�OH����!�>���=����Q�4T�����)mW���8:-�7;5�+�{�g����D~ /�N��{����*H��~��]N ��y���i�7U�I<�I;+emMm����rB������<\Ox���l���C���H�����(���_<��*��C��������.@�>?���/�x�� �����wm�e��WB����P2�W�����������3_��. �+�����d���\�#�_�;��L���A�j��g��6<�������~q-C��4�;��eH�,��$+�����E@�y���E��>�v���_�;��Q���3�?���<��a��|��,���"�+���4���OA��w�4��}�n*�2�
��s��~~��:�s�i�to)����A���8^_�yn��5���v/{���w��t�����#�=�����xNv�c5��d6i�������I�r�6�N�Q	��`�@�Q�zA@���9��mq�.�=k���{<�=����E��<A\{zt���u��������#�w�c���u���=����8vq��9��������;���3���3
�����3^���g�%)j�W�
����������@�:�������
�}�?[{9���6���=&����������nY����zyoO?����zx��|lF<\7Q�|�`\�4�0y2��n(����9�i���w6�
1�F
���w!����(������eFpV�H#��!�0H4d�u��+8�/���(_���un�iH�>]V��hH�+`���������D�C?�x�G!�|iq,�`�%�w����e��lA�����POW>c�K"[��������9��d��>L��$������SY!�J�qn8�p�7�i��=L��>4�Q�T�e��x��6�2�)bk�>�WrPb����Q/ ]h�x�4���/���D��Ee�|�������N�w�&�R�Mj�����N��$+�^�.;���{�s�<3��x���:�<FCH;t�,����9����I]�;le���J���d�@����J`����C�[�l����,�@M�O���#������!8?�\��p�/>x��� f'y�g�JT��-��&��T��m�t�F����=���|�5��"�G����l����t����6��Q�k8������,��m�Z���6jE�d$3���+�6H��h<��F%d�0]��2�w7{��6���Sb?^&o�C�/8�/��4,�2#����� ���s�����=���@0D�� ]�`B���
9\7Q��/&@%������/��v���
8Zp��t���EA�k$jl$:i��?���xvS����x��2]�[B���>Pc<>*D�g2��<���;��]���O�v�n���1���PH��AZ�lOTp��������|�g���C�������C">��#}��=��C*R�������]�p!aeQ���X�0�)�����8��~"8/�9�����!���?_��>�]���g�u���
�O��N�w�����M����n��$+@m���
�X�<����g���P-�Q�<�����
�W��%Y����s��=��v���M�W(�P�${wkj� Y������PG{��C��C�kg��A<�c<��x�Wg���~�����x���
����U��d��|��u�x����E���v�}�i��+�w8�����Ay[�����U��W�>����sk���"�I�Ek���
���j;+e���6C^���7�G>��G;��XYZq�F��x��4jM�<<d<
K8�
C�W�#>���C�}���+��Ac�~}�u��u<� �p����ua����8+�3]�����&JC�����\��-��������nw��Fpu��q?�y;���D���=7���:^��Lg>��}�':���{��-Y>�������tz��x$�.�;�|A����%�;��X`#��=V���&��u�����z�����K�W�C���H�����9��}�����
������w!�;V�|���2�t���&i��x8O"�u>��������CC
���d���[g��\�O<�Dy�I�v���_v��e\�^��/��,-�<F���}����k�N��:���t!�'���!����b�}�d�w�x���cu�RIC2���s�_}}g��+���I2�L�T����|.���{��\B\}_�xg����~-�<����'z^v��p�D$�}��/�:����`[��b�(�K�:{D��Zy$s�WP������k
8|�AE��1��b��k�f����`�����h�d
��!��d@Hjg7�Hfy$.�F�_�1qWiAc��x����8�"C!�Lm��8k�W�3�<BHm���.�c�p.@�G�7��)�ncL;�N��}z>�b�Q.|!��d�v��c�t��|�L��F�G�-��M;����x�S��<|q�gO=Y=Q@�N�3��B���(;���C�J8k�(�!��LN� |�B�G���Xz���	�?����!���.�Nx��qx���y���@i�0Mx��%(�HR0^�0�n��*���k!��{�v���i�s�P.#ot����BL��B/S��tB�G!�B!YE!�B!�dy�B!��EP�B!�BHA�G!�B!YE!�B!�dy�B!��EP�B!�BHA�G!�B!YE!�B!�dy�B!��EP�B!�BHA�G!�B!YE!�B!�dy�B!��EP�B!�BHA�G!�B!YE!�B!�dy�B!��EP�BH=���X>��3Y�fMl��`� n]3q���i���#G�����k���N]�m�/�U>_B�L(�!�QXX(����,X� ��~�q�.��x��w��5{�����?���k���N]�m�/�����o����zcG��/!�d.y�R��$��JZ�A�^�s=z�\�x�B8u���j�J������=^|����P�BH=�:"������_�R5v�c����y�!�v�mw���/�p>����8��k>�������������Z8��X��?{������=��\gz'WD@2�	��r�����-K�,�m��%��l�"�B����q}5��bgL���<�����w���g�
���{[���q�z����&M������5��G��M�6j���:=�1�s��?���V�];j�(�]�q�W���W�^�����6D�{�<)7����pV���S���?^{�1k?�/!�d?y��`�:�4���q���;c��D���f��Mr��9y��|�����3�����U<��6��x���{N�}�Y�r���w��!�h���K�.���O��K����X�"���7%�!�h���H�=-|�A��x��
q���w.�{zW�Z����:���Hm�������w5��G���:�lB�G�J��8qBm��%����"�B@�=������E��	���^��f������8�V�Xa�C�}���*�����64�qLYY��=4�����_��_+�/^Z! 6������CD�P��9r�����L/������DH�&M��6L��v
�����������jj"�����M='{x��G�������g��K!�E!�8�T��s�N������?Ts��C���3�M^^�jT�1��p9- p~����u�������Z"��=P_}����G?Rn�uZ�?�#�:^��m��v����-PD�~F�1�mE��K!�
E!�8�"�.��`_<��l�:���n�I��k���]�zupl��-����r����zN���;Z=8�H%�8���������f����>�<<<'g(--�����K!�E!�8�K���
x!��d�����qt�F���s;����o��F�ap��5{�l�?D�n�c�J����%�yW�H%��&�r�E�3X��|�M)**��)���K�^��A���C2�|	!$���#�K���������]S7j1\�O�>j����
�G7�SiT4����l	H���;����
�5������������8�l��Y	��={VH�����7���r��jzq�����i�8���_�����4�uL�Y�H\3&�8�<U��"���B��<BqP�
�-T������C�/~c��	E��j��yj���r��s�/����������<�I���
j����8h����X
�3��#�c�������<?���T\�~�B<�5�����{�����JUE�/!�d?y�R���B4pu#��
�G�j�"=vh��������y[:b�k?��~�z%G��)D�����V��w���/����?��&:�7�:��C�q����N�QU���K!�E!��S����'L{�6'h�c����qn�"���h��t ��~\=}�gO�=��<����q�K�h��\ZL8��C��/����w:XI�/!�d7y��#h����I!����"�Br�<B!$7��#���"�B�
(�!$G�wDx��^4	!���P�B!�BHA�G!�B!YE!�B!�dy�B!��EP�B!�BHA�G!�B!YE!�B!�dy�B!��EP�B!�BHA�G!�B!YE!�B!�dy�B!��EP�B!�BHA�G!�B!YE!�B!�dy�B!��EP�B!�BHA�G!�B!YD�����/K���e��U�-���~������gO�z������x�����uk+~c��T�B!�BH��U����7��������c[�@h����j�<�����Cz��QAp�U<��V�Z�~�+~c�]���B!�BLP�"o�����J�8E������7n����G��-[�������x��-Sb�[�nV<��6����B!�BLPk"��������^���'?���+�<�p���KJ\�������<yR~�a<x������G�}����/��7�6����}��j<B!�B1E��<�|��g����{��+����������	b[�������qTW�
U�;w�,�P(K��Q�F�}��j<B!�B1�q����kW����$//ON�>}����{�2�a����{t���Gu/���C�T�B!�B�)����+W*�t��	�<��}�v5O�@l5h�@����,^�4k�
�'�x�B!�b
�"���#�����2g�5P��&������������������r�Z�����������U�L t0D��M�6�������3�Db��������:�w��my��G�7�C�T���uB$30000000000dg05�����P��KtO�GC�����_���U<=�:�9���6-S��
��:'C���K�d���q�10�#��L���@c0`c�����Q�w��5��sg�0{�l�{g,X��8q�@��+�W�V�BaA�?��?�����muB��/��/�m��M��T�R�`YT\���6FLC#������1��W������q�F5�������	a����+�X���v>��CJ"~c�iR�GH]�����6FLC#����d��n�[�.]���	���>{�W]�C���]��X�����S�j<B�V\�4�1b�1
m��&�DD��C������-�����BKSW� ��������[���+Xq����ihc�4�1b��y����1
m���6FLC#���#��V\�4�1b�1
m���"��VXq����ihc�4�1b�<BHZa�ELC#�������i(�!i�1
m���6FLC#���#��V\�4�1b�1
m���"��VXq����ihc�4�1b�<BHZa�ELC#�������i(�!i�1
m���6FLC#���#��V\�4�1b�1
m���"��VXq����ihc�4�1b�<BHZa�ELC#�������i(�!i�1
m���6FLC#���#��V\�4�1b�1
m���"��VXq����ihc�4�1b�<BHZa�ELC#�������i(�!i�1
m���6FLC#���#��V\�4�1b�1
m���"��VXq����ihc�4�1b�<BHZa�ELC#�������i(�!i�1
m���6FLC#���#��V\�4�1b�1
m���"��VXq����ihc�4�1b�<b�pQ�w�D�F����?rPBywc{I������6FLC#����P��������L�����7V
?h.Em�W���ggqM� �-�%p����cG�L�1
m���6FLC#���#��m��D^������))x�e)��
)����Z�F������Y��H��,v6R�a�ELC#�������i(��q�fY����7R�������+�=g���f�H�;�H�GoIQ�����R6z�����-Q�WZ��7��"�������ihc�4y�8��Y���d���������K<��H��QR������=������������F1����m)%�v������)�m�$p���H�V\�4�1b�1
m���"�����P����%��H� _�W/���	���-����5k��
�!�>����� ��O�B��{�^�_�����C������Y�B<+�DD�|����3U�,�4F��
���-�OJ��������%�~!%��Iq�O����;�OByw��'qa�ELC#�������i(��q�>z�d�gc[$\R,�[7%p��������Y�P�'���>�JQ�VR���w�����'�:{i���[��sR�jc%��<�����������y�'y�~#yO�J��zD���O<�������^���>P�|�H%��K�<�s����_�*�kW$T�/a�/��������6FLC#����P����b**��PYilO�	����qM��F��|�^&����l���iK�;M�X� ��"0��'�x,l����HJ�vS=����$p��H(Ky�`P
#
���D$���)��^0K�'����}��Gg�s[��{R�:���z"������V���Wb�=w��W,��u*'�I��E	��UyUW��"�������ihc�4y�(���T��Q�{M�>i ��J("���nH�����:&�{��a��fN��!}��wW%���Q���������6A\���k�L5_��b�x�.��5���Q	1���R>i��|���|]������/�����2Z��;�a���I����X���F���R���R�nS%1�0��'�=�O%��|�!�k��h�dD`�J�N�3�����I[�#E��P�\K��W�O! qo������&E +.b�1
m���6FLC�G��?��)�_��m��~_t����J�o����%��wTozC�E�8%.+���\���}���~����N�|*.�
bh&�
z�/R�>�;����_������ �����M>��"��{�q)x����{)"$uH�0@6�������W��k�[iD��fL���J�b�����D�����"�������ihc�4y�(J�EG��^����`(��/_����C�g�!��>����W$�5_i(��6���@��B�,���������Z�������=��s(���`Vtx��AR������;,5Q%Q�5�I�SR��s��U9�AO`��j�������P����ws$�EE�LH+.b�1
m���6FLC�G���:���c[�8_AO!���������5Y<��w�z�<�xDU�H.]P=��9�pS����D�Bjh*�������.�+��E����E���}�Z���^8[�����~�����TO�yj�2��b���F�����Z�CM����
��=�0V��y�{������q�+.b6��ihc�4�1b�<b����[���~elk�QVR�z�����^J!�DEDb�"���P������EOb��_Kq�VQ�5�?VA��
R��o��g�7�0�N�����e'��]�L�N�PaA,q�T6��ihc�4�1b�<b��/>��z�H�A�
FD(z�|�sg�K��f�F�,�/�)c�0�����O�YqWA��	w��pD��*���#E�p@�p�<u-\B�X	�6��ihc�4�1b�<b��w^�6��xH�{G2�B�]**P�k�$D��I�=��i�m(�wR���R����S����O�9<�V��2�}�������w����������s*.�%��:l����ihc�4y���"������t��,���y�
��P�C�����rj�pq��
����}����9�3�)������,����^J�R=��i�����G����Jr6��ihc�4�1b�<b���SV����{��$�A�r`�.	��D��\�����1z����D��;�l�^�V�|F�bk*&z.`lYx*�W���[�%:J�G�!��;]<�W���^��&\\K��8"�������i(��1|�7[�s,Nr�Ju��Z{a.��������E�GPvS��������WA��	XL>���$��3j9������jq���tHo��<��vlQ�h�wn�5I���1
m���6FLC�G��9[��]6fhl+�vjZ�`�ea����=u�\�	zY�6�e!���,�^��R��sR�^S)��m)j����Z��|�0q��)��k����%\ZK�/�qDLC#����P�c`!m��v/��J��W0�Ts�)�,�������][��r���O��a���{'������������zD��L9�A�aa�����?Q��
�9���X��6�.!<����X�Hm��1
m���6FLC�G��F�n@{wl�m%�N]V\X"T\$�������#���j]@������|�0)���w�T
���)5��.����E��2��7��O��.����W�F�f�E�}G�^�pYY,e$��qDLC#����P�c���X������
�<���I��U�8���Q�. ����/�����lH�s�^<������y
����W�[^-	���WX6���fN���j�!��������`��,��#b�1
m���"�!����O����*�w�\!S+.,�**T�b.�c�G��x�����R>f����*E?Q���������%���$��0|tI��II�vR:�{�iCM}�)g4�RE���1
m���6FLC�G������;��w�(��$7�����������g�<�$��M�^�@���Uk��|�A�����S���'�kI����x^
?|]��~��wQ��Y���Y�=�\D��#b�1
m���"��������5�;�b�������nH���4����l�����=U�G
���]���GR�^3��;��s��g'�6����x��}����U�yDB�wc)�-�8"�������i(��<kWX����=b[I.���"a�_-�]�rtax,	�w�x�`a�yR>a����,E��S�:�b���J�_~J
�zY
?~K�;~�����.���9����5��1
m���6FLC�G����0�o�;�������.	�G�<u\|�v�w���H0�O-�|_x�!���Z���kj���.��lh_q��.�����pI���������ihc�4y�e���n��W�����@0����V��0�7�{�5���kG��_�s�V|��~o����T�>}O-_���������jHu����3�1
m���6FLC�G��yx���*�Xq�#�r�a��K��<���jY��I����7j���f�T{����/%��'U�`a�7���GR�mG)=X����9��K����m���6FLC#���#F��!��T.�I����v	�|*,P^?��ND=~nZ'��S�lp/��k�a-?����46����-��s[)�K\3'�w���<����hc�4�1b�1
EI;��h<b14BI������~�%��g��54K1��K��!R���~�fR�yO�:��_D }�ND,����]�|�H��X,�{$x��ZZ����ihc�4�1b�6F���`
1�P,����V�+������n	�9�{�Y������V9m)x��
b�Bx�a�������!��[����5�f�w�f��cMA����ihc�4�1b�<�v|;�X����_���\�W��G���~E�O������k�4�������������o*
>[�k�;�o����@���\J��<�z����C�{;v��C#�������i(�H�q/�k5����������o���{z1�s�|);L}�)��m��A�g����{�e)��m)����|����&n���N	^�$a�7v��������ihc�4y$���j5���f���\�W��x$�W��/���Q�����HqM/���+o��?����>+<�s����{]��~(%��W=��9���e��O��Pqa���������ihc�4y$��|�������[���1��_q��_���������5o��
�'%��)��y�������o����T���/�_}���+Wk�-���Q/����$m���6FLC#���#i��t�
9�����r	��%�g������Y�P�����_KQ��$����^�����blM������+������������ji����zbW%$=�#����P����y#
��[v,�*)��!�+.����<	^�(�����V*-e���O����������X�{��r��3r�������1��Y��{)�0R�g�w��h�_�:X�����r���6FLC�G�
����q�f]B����[Xq�*
I("���^��)����{��������j���9�=@�������X
�m�z����R����"e���1�N_gOI(����`�?R�a9FLC#���#i�����*��ql+�%Xq�tv�$t��.�S�=�;��{�"���.��jD���L�_z�����x(��C��UO!��+����Fy
-�G��������I�!�v�b�!��1b�1
EI+��+���O���1E���Pa����G�.^��d�vm������|�p��k�}����X��z����������?'���*E���
����y�j���u��J��y	q(z6�r���6FLC�G��k�D����:>�Ud��,�����r� �!�d'���i���pX����~��y��Z��2���eTOz�J�}�z����g_A��

~���b���w^�
�P��n+%=:K����9Y<���o�6��>)�;�"��K�DX�����i(�HZ)��jyV/�n����Qni>�%NpK����'+�������)��	V\�4U�1���<��!����N_����Mj��k�)5X��C�|��)��_��X���o#qF��b���>�
�$e�{�k��,�'����^�"���X�H}��1
m���"����/>�@������wB��O�<��^xv�K^���&���|�L���5Gr�vHB�|
+.b���X0(��R	��#�+�$p���c�6�n\��u�kg�����s[)��)h���YA��	�0o�j����E��v�X9�)��M��
���Y�J|�w��&B��g��-,��ihc�4y$������	^���m;����\��c=#�����n�|�G�,���~�y6(�����4Xq�����=n%���O5�pl�zq/[(�i��lh?�sW������BrO�*<���=���7{&6���0����7t�xV.Q����K��M�	XK�#����P���.-Q_���������������5^���/������y������^���`����N�=2h�O�����KA�Y�G�z+.b��bcp�u�B��tA-��Y<kW�{�4)3DJ{w��N��pNx�����D�#�=�5dK��|M��|9��jAx�/�4Z����w�z�9�z!Q���r���6FLC�G�\��Ja��b[E��Zn����}"��r�jH6���~�~�W����{V|O�)��#�C;���Qq�o���C~�1���.F�V\�4ac���{����W/����7t���P%�{J�7��PN:�����xC�����U�H�~':�@�t���C�5w�Zh��o����P�]54���1b�1M�������k��f����T�������_Z�&L� >_Dm8��xeee��wo+~c��T��h��FHI�������Z���r���j����y!��e���L���o�{��Q�����\lN_��n�9�w��2v�O��z���?�3u+.b�l�1�H��u5��0�
�Y�H\�'H���R���jQ������}v�7`M�g/�M����_��VoJQ[x�\J��Z���W��,_(�������\��1b�1MF��E��#�<"<��|�A�8qb!���'������X�~�����c�E<������_��_[����.�xu�{�<��Q6j���=kB�����W�X����r��/"�V��u>�r�G���a�S���}��p��~3��o�����A%$���4
+.b����@@
��R�a�>x�0����|�P�kO������f�$��_V|qB�3����D����0��#��T�),�C�'���9���V9�	^��<��,��ihc�4+� �7n,/���L�6�
�����k��N?��)S�����o*���:y�d,V����q�k�Z����7�a�&�xu<����{�,���Y����;�0������Rs�0��u��k�W9jA�����}�����l[���B��Z��E���lP.�
�$<I���"��i���.�����r�����<k���R>y���%�];*W���UZ��yc)|��}Zo���J��!��3U<k��o�u���;d���,��ihc�4&m���+))�e�����Gc[���?_	�}����������+qu��]�
���C������W
�U<��u��E}�Q9}������qR�W��|���X�n��N�g��+��Xj
�a���J~HN\�^:�����k���|���������{��K^�	?O��|�`o@v�	���0-�R�UV\�4�����U��P
��8t�xV.��I�sgi���������KiM���S�c�
m[�@�*O��eT�����0}��<����6FLC#��X���>}�(u������K���������o�"l���|�������W\\��`�N�$h���T�'�xu	*}� ��`��9{����._X����H;r%���#���/}�y��T�4_�=��\�
�z�l~�.����>�������A�[J��������6VM"����-��%p��������I�k�5"��_w)��s)������Q������������h��5���{�,�lX-��5�I��@�L�Ghc�4�1b��y�5����4o�\�������C�������v0�M�����H�����}��j��"���*�'��$�g�-�T�`a�wX�������;����IK���Co^�y~
#��iD�}0�-�M���s=j�(�E���S�����������ihcf���PA�a���|�X)�[J��B�>�@���r?iPk>&�6��w�Hal�Z�wW)=X����g�R���.��'bk�g�Gu������iL�X��<-��#����^	?�����N �4h ���u�-�,B�	�a������e
�����b�gM�}��y���������k����~_���E���Ch���z����?�~5�#=�D����o����J����>Yq���j|�������1�<���j��fNV�J{v��������&
���O�9��������~��}�a�1o�C��+".�K��w��=���0�& m���6FLc��jM���ypt������!�-���.D^QQQ�C�m�������]�],���
�&CJ�������/�f$m�/������@����'�V�K������bysd�<7��>�g���T���c������5�F����].�x�L�^,�\+~�%`�4
�x��hc�+�K�������>(wwm�;��������rk�h�5�������i���i���y���}���^�����tD�k����E��'�_���n_�a�%c�J��)R�V��MRt��]� E��5^ZS
�1��6�`:h3A���|����5���#�w�w	���+--��xP����o�7�C�T����'�.L�`U�W��Lm[���<��pL�}������I����,�rF��?#SV���K�I������W��	7��Qw���By�O�}�O�g��)����%��Y$}�����d����|� �����L�}*��10�#��2+�<tPN��!'7���+���������rj�9=l����M.t� �?o)W?xMn4Vn��
����n����W��7^����r'"�t�Dn�An��&���+���������rf�V9IO�t�m��t��1���L`T��={V�z�-��sgl�=�=eZio��|��7,G&uB��,��Q�����`��RBy��pw�p���3Tm[���:���wL&��H�~�H�_/�����y��8����}���_9y�*�������i|:<��T^P,�-���f�-�:���\~Kl�-����sW����������B%�>�t�X���;w���)>wVJ���{�|�6qo^'����5{��������KaK�	���5�|M�;
�[�?&7�>#�#"�������������R�re@O�6a��X0Gnm\+w�����t��L����m�FEz�~���(qs����VQ���?��Z��W�8�6-!���#�7#�F�R��*��?��?���6�����`������=������Z��Y�\?8y�U���Cr�jPv�
������'�/����������O��2�%��u���<�u�'"���p,�	^���T�B���X��\Byw%x���O���}��yu/���v��!�]�IQ���"�y������""1������g��Nn
�%g����8yLy$%$��#�1icFE\�n��O��O��^���~[�_��������WK���m��8X���w����:u�N���`� D<��6�p�T�����+O��v��[O����^T�Ky!9|%��e�>�Z���\�u
��~p�e�Nlz-����>Y< �n��S�����"����z�0o\���3�)�sz7����j���a�����ZI�{��2��	x��o���OD��-��,��M�'���{A����.����H��r��&cE@o�W_}���5o�<5O��p��e��y��
�_u�=rk:~c�^
��.��H��^���i�����?�5����������}�J�����+�j���F�}/t�;c�����B�����`@^
�u�V\�4�1R)�:6TZ"���$p���O����Cgt�T)3$"����/��u^�_m,y
��	���c��#��2�;I��>��9I�kW���>U���$3`9FL��"��sG&N�h�;vTP�����xik�����w<��j�����*��F�S_W����Pq+Q�������G��FaHN^��s���'l��w�����Z��)���W�����n�<�#�Vz�"������n_�
?V\�4�1RS����yj9�������3��-�Y�J�������b�6R��'R�n��C@�R�_n��
,j���F-%�����3�2����:!,��y2^���'p��U���Dm��h����)R=0�����n����Fo_D�������$�����q�Gz,�.���`@�_F�dX
!5+.b�1	��q��^�R����o�����fE���~_Q�$���$��/*
>[�{�7����~���m]��_wqM#����C>�q�g��r���"���M��Z�W��m�E��c`D�����E!9q=$;��0M��7���%�����K^�V�_�x{�e^���'������Z>]��"������1�T��.���!���*�5��P��!}����VoH��OU{�Pa�g�O���/�lh_���w�J���!�9�1b�6F��#����*��a���\��YW�B"��o��.�h���/}""��T�4�dn�������y���������u�BW��+.b�1MUl,\^&�[7%p��������9����������5�lb�BPC>���w^�z���s)������z1�0p�C>�	�c�4y����jUT�y3��!v������H�Q�	��������,=��|�m�W>�����~������a�G�-�H��^��'���p�C��r9/$����1
m����6
I��H������1���!�
��=������6�W>�����\��~(%];Hi���5C>�o�����R��4X��P���u�t��	����Q���Ez�H���>��<I=�=vC����9���T��=~p��9��uK������s�
�L��S�Al>�#WBr5?$g/�d�E���1�)�0���!�����e��1Qy�TC>?�����3�!�o</��!�_E�|��"�u+�wC>�r�g=��1
E�1��t�8uBy�|ux��f�!.q��f����]��7C��|P�
�9z�������p%�����xv�K9��|�6S=J4~=�H�N�&��z>I?l����YC>����=jv���R>q����&E?��w^Ii�'�YC>{w������p�x�n����9����r���"�����/y�=�`���

T���A��g����y
N]�6"�������Q���l�Z�����u������T!���Z�S����IH:`����^��s������fH��j�l���|�QJ�/�)c���C>�
�c�4y�F���V�I�KO������d�h{ayT�������l=�U�2k�_F��I�%^5��y/
����a�r5���\�����'r����T6��i����!�yw��E�#��}�xV-��	R:��wn+��������������ji�����B����8�� ,��iL�E^�?~��0�>yGm�g���9����J^twx��p36��WS�+'/��i|O�+Wb�����}[N��QB�G�4�hc�!�7$p������u�^2_�'���>���CU�|6Q����}���O�#���#5���_���&�;������GT����:�J���(����"�k�M�]��n	�6��i���0���P�W/���Q���g�Jq��.e#II�/������V2�����jc)l���o����!�����r�Z0=������Ie�#���#5���J_A�YQ����Grg�r�(:�S�Vqko1�-]#�obD��8P������#b�l���C>�w���X["�i��tP�����m)������,e#��9w��|���1�I���0y[�������~y0=k�"�
�
�oU|��%ZNp���2e�_v�R��
�qDL��6��|��!�3'���x7����R>~����V�:|����W�R�}��ZBq���m�N���!�,��i(�H�(j��*�Q!�������v��Y3I�P�d��A|��jY���
�����n����A����y�8"����p���M<kW�k�4)1��C>?��n��tPOqM�zsm�'m���"����G@�����
$����[z "�"vO�=7���c!������BP-�Nr6��ihc��yG-��?�_9e��X��������+l���������������6R���������a�� ��C>ic�4&m�"/�����O=�
���%a�[���f�S�B%�lX^�|�|��c]�"��������/�.F_9_���1
m��T��o�r��^4G��
W�������S��46������~o��v=��n���m���"�T���V����`��{�5���fN��B�.���[�O���A.�d�[�[���;�*>��#�G�4��4��|^�$��G��s�x�,WZ��PC7�Z�������Uaq�����/�lpo�F �
��\��!��1b�<Rm|{vXoq�����3�y�<O��9��B%d|��	�'��������#�o�n��L����qDLC������yV��0�s�x�/����!�_a��k��|!:��������F�F
����'�C>�I�W�|���iL�E^��EVu[6��������gM4��Y371]qU|���'�/����S��^��'*��\|�G�4���%:����O��|nX-���9,"�0����]����0�����T�>o!%����������K��o��������I#���#��l��0u��"��,g�M�g�\�6+.��[�����`��S��2�%m���GD������W�RL����qDLC��T�yX|;��g�2q��"e�����<��#�|~%eC��k����jo�'m���"�T����B��i]��Vr��\��*.-�:_������{�o�>���sS�el���2��C>��w�|.���C
�TC>_�����]�l�`q��ad�'m�����Q�e9�-�[����15$N7��Y3w�W0�u��=|�"���
����!��s�^�^K�J��
�zG�4���'\Ve�yB���Y�J�fI���j���-PC>[A�U���fR�yK�q�t��R>q������-��hu�|���iL�E^�+�������<kn�g����U\|7�B����������|K���_HJ)��%l�����`PBE�!��+��!��Y��lX)�V�l,yO>\Q���|4V
?|C�;~"%�%eC��k�|.����t!��O�1
E�X��*��T$������y��5s��\qi��z����r�Z~�)��w����;"�����k�����/�qDLC�-�^�1�����+�����t��O-���W������'+�=gPC>�zI	E,_��[)3D��g�w�Z�=$���!��1b�6F����O�
5L^�g�E���g�\&S*.%�
�=|m��)������#�o�W������+���S�8"���`
�<u\-�Y�R	���p+��E��}����&�!�m[J���t`�5���>P��6E�TG�7���N�H���#���a�U��t�B.�
�c�F1�)#�K&6����oD�aH�S}*
��G��������r""�����.l����HB0���@��/��9,��Y�D\3&IYD��t�������|���g~#7_~*���}+��,"��g��D����zY-)AHU��#��5}�UP��";��5W��f.���#����!������x���zW���|���4�1RU�C>o�!����O8g�8��y�C>�����7}F
����R����uW��jq����w`�/�WKJ��<R-J�}gHXT��Ys�nz��e��qF_AD�]���~���+E_���z��������CV	��!q�(�L�81
m������k�!����g�
q��!���K�?��KAD�U&��6�����@
�m"E��'�_��.�0b�r����!�����O8�#�
E�W��&���5I�lmA�]���I��=#�oTD������;�:"�NE���/��NLC#F	����rd�r�?"�]���sc������o�h���o�^���>����_��oKq�O��Gg)�[\���{�"�E���o����Sjr�<R-�_md0�����Tz�$Qr�qd|�#�^8?�XQ�=��\����7t�O����!�P��6��ihc�4qm,�PI��_�z�������^<7:�s`)��N��+x�Y�{�W�#�=��H�����z�XJ��*r���\�������c�����H��<Re��r�-J 
~)eE.yn`�q����8%/�����?&��9��6��d����|��F�i>?_u`���6FLSU�����8{J��Ss��-��	R6���t�$E�}��p�{��:��B�3���f������C�����;)7<����5�;�O9�������L�"�T����VA�B�5uc*Hn����
�oOT��r�����������|�6&��]|��81
m��&]6�eBp�r��������Y�X\s�I���j�fq�����������}��G$���R�nS%K��S���#�q��*�5���w�r.��#a����P��*����@��'�gM��'�
GQ������BT��Z��������""�0�o�6�Z�o�������2
�x���ihc�4�m,�PA�._PK=xwlQ�Y���7v�Z������E��4��O<TQ��Cd_��OJ�[/Gz����x�"eC��k�D����O�P��a�+�R���1��,c���_�����}����]�����qt?J��G<��|��G_G�_����7q�O���+�F��y�8��y�ehc�4�1b����Pq��]��������/�-���H�����s[)����Z
�(�!��G������[J4�t�R�,�2V<K���E8���0���&m�"/K)9�z�]S�K���I���Qr|�w%&���B��)ym��k�!4�[.���������ye������d@�^
���}����������i������%x������]���F������K��.���{R��������3`A��R��5)j���yG�{�l�nZ'��$x���
�%d�1���1��,��K{�%���O&��)=k6�RG�����)"���
�yz�z��Z,����C�#�W����qni;�#�"����*���x@^����]��6FLC#��$���]����������jqv5��������R��+������3�5���7y�Z���J����Aj�@O�]���G�N���Xd�T�<Re
��zYK�gD�����������a�ea���PD���n�������y��i����~���z����\����]�{e�J����W�<������z2_����ihc�4Yac�Pt�����?rP��6��y��X+7LJ{}��n�����Rk��_������R%�>}_��+���7k����O������(IE��`@��s�{��m���9F��q�^ ����r)/$��e���,=���}�c�W��������������@������z���i�}��P@v��F��I���ihc�4�nc��	^�]�}�V���=��&�����+W��M�o�H-�UA�9B����[�����S-�Y�H�;6������.���Xo�P��**����5���<yot��'����K.\�?���z: �"�o�F������#�=zxv����6�%Mr�����='o����~&(�n�$���V��1b�1M��X�����S�����U���t����K{)l�����Z��.���j����0�B��$�����~�����p ��R��*�?q�z�
�m*sv���9k'=k6���RwX-�p�Z0"���n�f����8����X��00��k�'��eBtIx��'��e���:g}��I#������DX���{��oZ���{�t)5(:���R���������3�=�k�o�Lt���1������R>f�ZB��a����V�
��%aOv�D3icyY&�������U�n$��&��������#"m������I[|�o�W:E��������������:r�O-��x@�\�z����gO�1
m���6�:������C���u�r���;M��W����\�_z����/4����������P����
�+���<kW�%%��I��5	���R��P��*Q>u����
�/�&��&�+����/r�(,'��d����:P�X��/�xTo��]��`w�G0�:t�Of�����9p)�<�����m���6FLC�9aj��c���c�xV-���)c�t@����_���OK����(�����K��K�[/I�������W�y3')g2����wH�W.)�	�3��<R%J�|k�E�gJ����_����I����L���cb>��sAY~���d�*�t��Q�7���#"�)��u����.�F��E������& <�b8i:������ihc����[7��y���:�����1I���t����`>�vV|��������
���'���t/[�����'��$�wW-7Q_��#U���,���r���k1��5IV\�C (r�$,gn�d���������z����Q�[0���^��>�F��^��f�y��'S�����s�C�shU������ihcu@0(���8wF-���y�Es�5g���,%������J����3�)L������W����`�_W);T���y�v���!8�.W,�E�X�R���g��\�%��I����nBM�_V
=t��d��Z���c�������O/}���p���j�G�
�s����j^o*�=ic�4�1b�X�K/._�����N��A��O)�}�������|�/>QA����e���P
�m���e)��Y���S�b)	��m�?qT
5
��R�~L�E^����������d��"��6c=k�(��r���y��/e�������Q����c�k#��t��^0�<{bI��^����9�������=ic�4�1b�Xf./�����?~X|X�}�bq����kb���ya������>H9�A/a�Go)�h��9�g�b�m�$��%����6uM��#)�r�2���(������5IV\DS����w�rP6��� �zF����=j�3�8x���G={v�����'sw�e���2w�I����c���]X����2�OBwnI���������r�Y��lX?)��)�z������T|��������Z�.��ZIIW���]9�q/] ����p�������j�a*��1��,��k�e�E���<��=k
+.����(�gX~a���,�?�����G��v6�_�!<?�%o�(�F�U^�����>u���������#5��1
m,K	�%T�'�g�Z|�u+��|��S�|�5t��mK�������3�5���7k$����|b�tn���������������j��pyY4
1(�H���������(�Y_�]��Gb��"U�����Zp}��������zX����n���)�t������T���G�{R����r���6�{�KK��#������?5����Z	:,�`{�������
�yE�Z�����m���-���������,�l����a5��G�����H:����;q=$��e��L���+����bysX�4�R�����,�)�0<�[L�a1x%�"�=��	�1b����n��:��`=>����,����tP/)��N
[�u~��C�=<���������~���i����-v��B��ewnk������T�E��#�`�EL����z��L_uZ�
�9z���g�W:��������#����J	�����S�������`����R����y_
^{^���?����������,����������j0M�F������i�6V��l+�I�B�s"�o����[�C�����%�ze�j�L��?!�B��3�#�������X��w`�x��R�3�fF���!R����o%w^^��;"�P�e�`PM�"o���V�
B4���iR�1�?R]X�����	�e��zY-�pk�\��o_lOz���"B�y���{�W���b�t��5�
V\�45�1K�]���]~����� ��v_e�����k�#�o��V���J����#������1��,���%�����<�/�������I���"�1ecD�r���6FLC�GR��v�%�n~������V��a�ELS�6F��{�#����P���(�0�yg��a5Z�!vXq��+��?,���lPV�l���U
�7����K�e�����?,&O��r���6FLC�GR��������w��8����5IEXq��w���|X�����i(�H����IK��y�j�l:A���"���i2�� �����������/����������ye��L��S��k�w�����#������1��,!x��%���{TZ����Y�8`�EL�m6F�W�`9FLC#�1icyY���J~�����E�>.q�b������&WlL������a#c��cD������ "�����0�����_����1
m�����Q�e	eC�X"����V���Q��I��1M����yX�����iL�E^�P��MK�����j |3?�� �+.b�X|��;~-*�V
�����'=�����*��H\���!���d�v�,;��e��������iL�E^./��F�U/���e���VC`�f��$������6V5(��m���6FLc��(����#V/^��/J����_��5����"�������~&*�fD�����^�0����|9vxD�M����g2O����ihc�4y$)��3,�W������+����Y��+.b��Y(�hc�<�1b�<������D^����T�h�� R����5��+.b�X�!Aa��<"� � �z����S���8V�?��>�?�1
m���"�$%�����;���UI�>��AHEXq�����(�hc�4�1b�<��P~��}�!%��^|B6�W=��EHEXq���2�L�1b�1
EI�o��{C5;~"�7��Jw�Fz�$�a�ELC�l�����������_�r5/�q�b� �b!�b^!�b�aU������i(�HB�'��D^��1�y�=��k���&�+.b�XvR��m���6FL��"������[7+,_�\��H~~��3���s��2}�t+~c��T����Y"��o��>��g�37�Y���1
m,���?�������F����}UXkb�@���5��;w�&m���1MF�<��^zIx�+����L�<�����������_�U�x�/� ��*�:n�8���G+~c��>R��|>�����k����.��]VEYN��$���ihc��H�y��U�IH"X��d�������/�G?���k�N�t���o�[%|����)�p�B%��~��
�~������c��.��4?���V<�v�G���A��E����f��NW��G��&I+.b�I�G�F���~��-"����;-��U.���U���WD��\��Y;���pT���N�G��1b��y�O�V���������X�r�~��
S�C~��J\]�rEm:�;������={�����Wx��m��8��K�UK-�W:���9�*�/���&I+.b��UO���G�Gj�1b��yN�;w� ��H6h�@����z��<��#�����_:t���Q]�+))��M�J�������b�������8��Kea�E�g�2��t7z=k����"����t��_~T�m���+����kJ������S�����eD�}�l�o����1O���D�Y�F��
6��<��L.[�L�o�{���8��x�� @o�!N���Ea��,��vE:��y�<��C�}��"��������:t��C���X���	,��iL�X���c�����?.-[����<�m����A	�:��B�z��*��>�����M;��}��j�T(((��7o&�/���������������NWv���3gT�oC:m��tHfcg/����o��=�e��k2f�U�=��t�vM>�pKZ��#�
-����=�*>{p���K�j����>Y��-����SEr�|�\�z'nZ27�c0�����D�=b��+��"���o�����=y��t��u�x���EV/�����Cg�J��>�r�����10 ��-
�x��hc�Cul���s���EY�����rC&��#�V�K�����"�db��5�D^T�t���0�X��/�����nH�We��K2v�E����,�vN��;+������X�1���LP+"N�M�&?�����7���7��D�������H(��4ib
���xb��_�2�x�>=\3�x���&��X"�x�X�t���|���{����B%�>�t����N��W&������.Y�-�ze��Y�����nz��>#���A�����lr�|3�P�/)�qk�e��Ys�P����\,7��OC�	,�Lmc&0.���9j��������/���C��3f��D�8l���������!@����K�Hj���L5^:(��K����Y��V%��$�$���ihc�4�ac���\����A�r* ���e��j�?������}������d�������a)g^o`9FL��s� n�N���T�>}�9xN�8�\�����m8v������rdRW� � ���i�&�
�7�i1�j�t����xy�~+��R���kU&�v��${a�ELC#��K���
X��d���b�7LXJ�����U�����?~�:	��}.�c�����}����*X�|���?��<��*~c�iR�WByw�^����R����*����j!�`�ELC#���6f��+6���'�#��[D������SMc��SD�����1�m����iX��d���0H86I0|C#���w�k��Jp��Z������UW��v��#�X�����O�j������ye���'��o���c=���<�IH|Xq����i2��(�2�c�4+��^�N����'��p����A�M�0������[���K������nZ+go��
��H�OHe��"����d��������[��/�"�o����[�UA�}���2���������h@vS�U�c�4+������D^��-Yu$`�(�	�V\�4�1b�l�1����1b�<b�z$������o���6b��*��m��m�$�1
m��&mL���'��o�����6"�>��[�7�-_F�_�����'&�����1b�6F��a���z�J��Rm�=]po<qo�B�����6FLC���hc�4y���|�%��fI0$������+�B���`�ELC#���U���"�oaD�M����+c�oZ5�����������6"������,�1b�<bQ��;K��O�;%a�n��\|z�$������6FLC�>�A#���#��5S/��G%�v���A��m)d	IV\�4�1b�X�Q�/�&��F�_D�
����������<8�7���%��{e<�_�|k���_Q��1b�<%���Q"����j���~�P��"�gG�*�����6FLC�=rU����i(��"\Zj
�,����W����DkBR�1
m���6V�d�����P�E��-K�������9�m�XO��T`�ELC#����_��;t9��~/��5��+����j�������S�[{/��Ms��6FLC�G���-�W:���E
����5QH��t��+.b�1
m,��4�G#���#������+=XN�Y^���X,B*�1
m���6�=8��i�K'|
|S��?"�&l��a���!�hc�4yD����y�S�������t�W���I������6FLC�~�Z����i(����e�%�\�f������L�B�+$uXq����ihc���e���}��o�W�L���1ny)���o�[����'B������Y���,�p\����5	I7yD�Y��y�����{NW�����:l����ihc�I����cJ������?������F��kA��XX��`��P��{��{=y���s�NW��I6��ihc�4�1�*Z����
�7/"��B�E���@�����{�g���8�t����K�2r�Of����C�z:����!�S�/�B@�G�i�-�wk�6��yv�K�X$BR��#b�1
m�����_������ye�K��s����g����V�]xd�*�L������������n�������yDQ>f�%��-�o,��xb1I
6��ihc�4�1b����[����_v��R+��mD��������p�4������z��Y$�}�!��[�����^AYX��X"HN@�Ge�zZ"o���V2x�����1
m���6FL��1�/�z��_��3AYz�/S��d���X��N�=��d���{a�K�y��������n%��?b�O�o���C�r* /�����.6��;�{(���������7��UH,=��>�j�qDLC#�����jc��H^iX��
���AY}$����Z��>���y�G>��Q���P�4�d�g��.y#"?�����1�����/���0��zAH�]t���P�E���["����Vap�:'�����1
m���6FL�+qW�����6�d�*�t[���3=�r�[��pI�����=g���}��^C�1|2k�_	�g�r�jP.�
)�I�/��<�(���%����@��OE^x��B�
G�4�1b�1�I��r�($'��d���,?��i��2|�Oy��r�G>����F���A.y"��O����{��E^��'S��#���������[Ea:~�G��1�����kJ��y��X�2�Rc�7!U��#b�1
m����lX0o���r�2{�_F��^?=��4��?�-M������E���s��#��z�[;p�O9�A�"|������r���_��<�(x�y%�n4z�zy�-�L\Ru�8"�������i�����a����W���dP�����'�y��}����Z^�.����^��JD$���N�<�{it��;bC?��H�:�F��+�\���<���	%�.>��zI��OHUa����6FLC#����K��7B��lP�����O�7=������{�/�����.y{�[�N�H��^��'��������r 6��F���O�/��"�(�>�D���X/"^fB�
G�4�1b�1M6��\��}���X@�����
>���m�C?����0�<��~�g
������'E�~b�(���+:�s�����]���_R�"���J�!�z�m���S��sRu�8"�������i�����=qG��c�E��2q�O������C?'��|>��=f{��X�ryy�K>��/fy�g�x8���-:�s{l�������,6�e'E�.)�D��WZ��
/]��T6��ihc�4�1b�\�18`9s3$;�e���L��W�:�s�5�ZO��;c�J�5�d�'��5�������_0�p���l>����r:r�k�!)(�7�f!Q�	��n��e���]��T6��ihc�4�1b�XE\��������sw��:}X���B����Q�y�wI�J�~>����}4�-_��H��^�A�������e���<"�g-�7��.�AW8!��1
m���6FLCK
��������x�_&m�)�-���e�'"�<��(��0�%���/�txi�K�l?k�y���[}��pt�'���/yD�GY"o�[��K�.tB�+.b�1
m���6Vs��:�`=>�[gl����+����gS�j��+C\��O|��C�~.�>���c���J��9�m�����~^��!����!�_(���vo�D���+c?s��VI�`�ELC#����������_����y{�2C?W�����5�ZN��l�?��O�*W�A�(F9~Y��k}2="(���sA9z5�����������I�����X"oH��jK��&�:��"�������ihcu�����3Yr�/��������+_������{q�K�����	?^��yw�[���v�W�����>Yv0 �N��v�RRgj��E�������j�4�_�CH�a�ELC#�������n��%"��4��~���
�_0���P�<�7�����~.�<������\���H8~�:��Nd�����K6�/&�V(����N�D���J����B�+.b�1
m���6�`91��s������z��������e�[^�VK:�}:`�D����]��)[}��X@v�
��+A9{;������_L�E^���<�y_|�ZZ�w��RuXq����ihc�4�����u��yy�"b�/�VE��t���61�//
v�������@��5�-m����|�r3~������Q�/�n��J^H�Rt�B�G�|� K�}�~����5�H�a�ELC#�����������`}��0���z����f�]�L%k�5�]���H|��ed�\3v�e��������1��������k��>�vE�>���ihc�4�1b�XnRhs��ht����^����n�dr��z�K�����P;~���B�5W���[�����{0v��B��!�t��yo|uBu3R]Xq����ihc�4�1b�r�r1(��d�6D�y����e�K��l���Q�/�-�-����^(�2��N�Z"��o.��
��#����"�������ihc$�u���e�n��9 "�z,��!��b�_����c�����E^�P�����k�����A�G�+.b�1
m���6Fj����|P��7q�_��~����+���n��m
�U	���?$O|_"K��j�$�`�ELC#�������)A��Ea�p����w �5�P�e��U"����V]��Nb{�:���ihc�4�1b�1�I�����L���
(���b
V^$9+.b�1
m���6FLC��������J�i��y�oR��������6FLC#����P��:^�x{�M�����U	����ihc�4�1b�1
E^�*,�D����W"��
���TV\�4�1b�1
m���"/�	^�j���M?��{�K�y���"�������ihc�4y9N��)K�-j�I�/��!�z��"�������ihc�4y9���~K��z��4���!�z��"�������ihc�4y9�o�VK�M|������!�z��"�������ihc�4y9�w�*K��x������!�z��"�������ihc�4y9�g�|K�
h9I�]@�Gj+.b�1
m���6FLC����gO�D�����+��=�TV\�4�1b�1
m���"/�)�8�y�?]&�7�b{����ihc�4�1b�1
E^�S6��%���� 3vP��������6FLC#����P��8�}�Y"���d�l!��1
m���6FLC#����qJ�v�D�������B�+.b�1
m���6FLC���w��y�v>+�.c{����ihc�4�1b�1
E^�S��;��{��u9u�"��V\�4�1b�1
m���"/�)|��%����P��b{����ihc�4�1b�1
E^�S��%�n=�y�G�����=�TV\�4�1b�1
m���"/��k�;%�.6xL�Y.��$5�1
m���6FLC#����eB!����J��|��4�_�AH�a�ELC#�������i(�r���e��;�li>��CH�a�ELC#�������i(�r�P�]K��x�mi1��CH�a�ELC#�������i(�r���K��[��G�n�'���������6FLC#����P��0���-���Y;���"��V\�4�1b�1
m���"/���k��y�w��+��=�TV\�4�1b�1
m���"/��m�d���o���|�=�TV\�4�1b�1
m��&kD^0�}������c[*RZZ*�V��Y�f���8�I]��z��y�f+~c��T���g�rK��yw�L�A�Gj+.b�1
m���6FL�5"������������c[��5q�D��(<��
��c��
����A�-[�L|�A+~c�]��/U���X"oH�����?���������6FLC#����d���XB�X�&M���'���[���c�=&M�6U�'?��\�M]�;~��k?����
�4��K��I����j�l<��!���,T�1b�1
m�����y�)[�t�������_�J�����>������O?U�������"3g�����������*D��A��G?��z��ol�>�I5^U(?�y][/���v<!�`�ELC#�������i2Z�A<5o�\:t� 6lP=ZN����o+�m����k���_~��Q]�+++��_]>��3%5����_�}��j��P6��%���l���I�Gj��B�@#�������iL��q���c�&���'��9��L.Z�(����w��Q]��=����}��j��P��K�}�n�\+��R}`����Ihc�4�1b�1�I�U�+�D�����\=�u���A�����x��
�
�'�xU���v��k��>)v�c{�>���ihc�4�1b�1
E^�<-,,��6X"���O���B����&����P����!�6�`:��L����m��B�a8�~���b��^����:�w��%y����7�C�T��:�q^{��%�^�~Qmc`�i8|��*T��c`HG��1��1��6�`:h3A�yZio�8,i��������A�A�
<��wL��6-S��
Z��j��%��x���g`�n��I��6�`:��L����m��B�iq��Y3�r���a�e����7��I]�����{��'����7�i1�j��p��������C����y�$$�g@LC#�������i�~N�O�����I����?_�Ua�����i��:�������T�������j�T������������tOl+!5�1
m���6FLC#��z�p�-Z�P��P �F�Q������J����X�����O�j�����?W"�l��uEI&Bm���6FLC#�1ic�*��_�.m��U�!�������*�=�����U<�48d���;�pK5^e��J��xG� �Vxc{����ihc�4�1b�1M��<R5BwnY"o�����
�OB�+.b�1
m���6FLC���/]�D������;(�Hz`�ELC#�������i(�r������[��SYr��CH�`�ELC#�������i(�r��]��[���l<��!�f��"�������ihc�4y9�w�K��|����po�uBj+.b�1
m���6FLC���xV-�D��w���y$=��"�������ihc�4y9�{�,K��x�\+EwRCXq����ihc�4�1b���5m�%��8I�]��Bj+.b�1
m���6FLC�����j�����Gk�4�����6FLC#����P��(�zY"�{�e����V\�4�1b�1
m���"/G����%����1���������6FLC#����P��(7�n��=w��RsXq����ihc�4�1b���Z����=�Xl+!5�1
m���6FLC#����Q����%�L��JH�a�ELC#�������i(�r�kM��D��y�c[	�9���ihc�4�1b�1
E^�r��',�7vuYl+!5�1
m���6FLC#����E�a���%�n=�����AH�a�ELC#�������i(�r�������c�d�?���������6FLC#����P�� ��bK��x��l8��!����"�������ihc�4y9H��
K�x������!����"�������ihc�4y9H��YK�m�m9y�"��V\�4�1b�1
m���"/�=d����|$�
B�=��V\�4�1b�1
m���"/���n��e���bW8���������6FLC#����P�� �
k-�7��.�hM�FXq����ihc�4�1b���x�"K��|�Wl+!��1
m���6FLC#����A�L�f�������Xq����ihc�4�1b����������v|l+!��1
m���6FLC#����A.�pO�}=3���������6FLC#����P�� g�t�D����c[	I���ihc�4�1b�1
E^r�����=d]l+!��1
m���6FLC#����AN�jm���w���Xq����ihc�4�1b������Z"o�������V\�4�1b�1
m���"/9���%���<�JHz`�ELC#�������i(�r��7�D���7c[	I���ihc�4�1b�1
E^r��G-���XIl+!��1
m���6FLC#����An=�%��<������m%$=��"�������ihc�4y9G������o�jA(���������6FLC#����P����%��6|J�\��B�+.b�1
m���6FLC��k����;���c�	I���ihc�4�1b�1
E^�,�����b[	I���ihc�4�1b�1
E^���-��������V\�4�1b�1
m���"/�z}����z��VB�+.b�1
m���6FLC��c���D��w����>Xq����ihc�4�1b�����Z"ok��c[	I���ihc�4�1b�1
E^��-�X"os������V\�4�1b�1
m���"/����-���m��VB�+.b�1
m���6FLC��cxJm"��!����V\�4�1b�1
m���"/�p�����oG���>Xq����ihc�4�1b���UTf����&���>Xq����ihc�4�1b����&����JH�`�ELC#�������i(�r���RK�m�0���������6FLC#����P��e�%���9qel+!��1
m���6FLC#����1J��-��g���VB�+.b�1
m���6FLC��c���'��/��JH�`�ELC#�������i(�r���E��;��@l+!��1
m���6FLC#����1�oZ"��������V\�4�1b�1
m���"/�(�Y`���{����>Xq����ihc�4�1b����F�%�.��JH�`�ELC#�������i(�r����D���y����V\�4�1b�1
m���"/�(��g���;�����V\�4�1b�1
m�����Q��C
��U��c?��r_l+!��1
m���6FLC#����1��DE��'%�`l#!i�1
m���6FLC#����1�������B�+.b�1
m���6FLC��ch�w�Q��B�+.b�1
m���6FLC��ch�w��b[I/���ihc�4�1b�1
E^��E�������^Xq����ihc�4�1b��C��M��m!$���"�������ihc�4y9�y{_k�BHza�ELC#�������i(�r-�v��:���������6FLC#����P��Z����}l!��1
m���6FLC#����0���?^<�~c[�h�����c[I/���ihc�4�1b�1
E^1����_�Bx���RzZ�m����B�+.b�1
m���6FLC��A\�xQ���������T��6�KK�u��BHza�ELC#�������i(�2�P($c���?��?�e�������m��8�Qp=_.5n �;�m!$���"�������ihc�4yByy����;��'�����m��m��j�T���-���Cc����^Xq����ihc�4�1b��������c�I���c[�1}�t�q�kXq����ihc�4�1b���������*A���q�kXq����ihc�4�1b��!�"����r��Y#a���d�����hc�m��t��1���]�v�Z���"/�`=�����H(��qR�M�6��O00000000000diX�bE���^(��Haa�4n��>/���&�!N*��7�����������!���'b���B��F�E�5_y��z�jl�������I!�B!��"/�>\����H	�c�������}�B!�b��4STT$����u<��
��m�G!�B!&��3�\�-�_��_U�o
<B!�BHm@�G!�B!YE!�B!�dy�B!��EP�B!�BHA�GH�����[R\\�R��'/^��'O���8�I������v[�k}����{`��v<��8m,Qy�n�I5�.�s����m�G�m'�x$�q�c:�+�`{�886��G�GH�r��5i���L�>=��d6l������@�K�.U(x��d>�(f��a=g�_��Wr��Ae4Z�=����a��A��x$���)���<9{��Q�I5�>��;��u���c[���vR�G��M�6Y��Z�lYAtA��n��B��Na��xy�dU�J�����;����o���������?T����\5�
c���d6��+V����?��?X�������_�����+�����x��}�t�#�
�#F��?��?�����Y��y3�6F�CII�U_:E^�m'�x$�A}9v�X���Z��]����Q��q�C��<���gQ�����(�� �r�:uJ�|�M�����R;EQ_~��*x8�*2l�0�o������k����G24�`[������\��%KT���7��
DWn��}�L�#���}���+4�����Uy�/� ���j<�]��:t��/Q_�E^�m'�x$;(//�w�}Wz��������>��Q��C��z�-�]��x�"��������c��s�='+W�T/�S���sG�q<h\=���VC=��H�����t����(��'���Tt������g�}��!N������(� �@�m'�x${���[��^�5k�H�
*��t�N��Hv��F3g�T�g�>C��,0i�$����5d��:�y'N�P��;��#$�@#|����7
�x"/8������rTp��Iw<����{W
������?���G�C�t�#�>��}[�e~��aCU��t�N��H�p��yi���>\����y���T���@��0����~��){(��!v�����p���x�<B2�D"o������_'��P��Jw<�������������l`�!N������r��D(_�XB���@�m'�x$;@���cG�����I�'��m;��#��+�]�W���!��G?R��a{(��(�#`��t��P���$�P�-�R�G�<�Y�f�~�5S����a��;�.0�"O7�0	��0�.���j<���q�p�B���~g�S��t�i
(���P�5o�\}����t��T�i(��PU(�v�R�2t��=�0$����i��Rv����������l`�!N�������_~Y�qe{���T����k4�1�r ���W~��_�2
^.!�m;��#��n3������d�Sa�OC�GH��H������R{������dx�.\�f�����_����8
D=���q�� ��7�i���x$�A���x�e�4�m���T���v��'�m��<S=�������h���[RZZ���h/�� ��E!J"��+|���
�3g��/�zRx�����)X��d��5B������~c����G28�x���Tyb����y�>�I����d>�0����B�:b����7n�X
�L����d>x����|�z�0l
��b�o��m�#*����w7��E!J"�����/�P��^�c������*����wY�x��x$����j�8*���Y�*3]�2DUd���~��6���;�lP��o�^����=�z���m��8 ���j<�}����m;��#��s�g�z�'��c:���������(��bO�whG��(��P�<��Thfx� ����k���d6G�Q�~���NvP�`�UTrz?~c���Iw<�����w�w�<�����W�4���T���#��K����d>x�o���j�g�o����0s���w ��(��P ��y���=iz?��!t��9� Kw<�������P�94��h�&��H��,Oj�vR�G����^�5���T���m ����Fp
-
l���Q�B!�BHA�G!�B!YE!�B!�dy�B!��EP�B!�BHA�G!�B!YE!�B!�dy�B!��EP�B!�BHA�G!�B!YE!�B!�dy�B!��EP�B!�BHA�G!�B!YE!�B!�dy�B!��EP�B!�BHA�G!�B!YE!�B!�dy�B!��EP�B!�BHA�G!�B!YE!�B!�dy�B!��EP�B!�BHA�G!�B!YE!�B!�dy�B!��EP�B!�BHA�G!�B!YE!�B!�dy�BH�r��������������B�m(�!��#K�.]d����-"��o�x@�%���P�B!F~~�4h������g�������K!$���#��u��n%�t�k�`[�8�@@m�����r+^II��I��OS�u5���;{��<���2z�h�::�k']�	t\?�=B�[(�!�d
�:uRCu���F��6g+��8�|y����xWW�^��M��R�ne����z� ���5�}�x+V���# >��8�B�y�B����Ry�������O��O�O��O�_�?a�%F�����/��0-~���/����B���{O	����w]���_��t��Y��`�x�|��]�v��������?��?��_~Y


�y���M�6�k�s!���8�B�y�B���6m����Q���m���J�c;�-~z�!9w����x��<���r��i�-�xZ$}��g�uA�>}��?��?~\���=���9'�)��}���wW�{�.]����s�7���u!���E!���B�?��?���D$������?3g�T�k��O~�9u���?�x�A���o�H:��Z�c�����;F�,������xk����#X��	<��������,m��FA,����nd���fe`��}f���IR�xo2?~�����%��{�ggg%f����c��� �C��	.����9�U�~������j<�RW�;�Y�������7�}�n���Ny���9���zx/5,!�n$�|��Y5�g�_�OK]<::������677�j����s-{h
y��g�u�5�����{_X B�8==-����M��������h�������\����Z��=88(��S���z7�e5��G<c�|��g>���6���N����~���*6�`1yt#'O�nY��������R���I��i5yM��DK������u����+���fXZZ.//'�>�C
y����Y��c�	p���������������X
�,!�n$�������PjK��j9�fuuu�����3�_��g������_N�����=�voo��'d===5���}&���M|W��#�50��<���%����h���k���~_�%�%��3o�����/
������z-s����}F�m{{��S@�?�<��y�:"�tD�����!�#B@G�<��y�:"�tD�����!�#B@G�<�n�Oy�o����IEND�B`�
#21Andres Freund
andres@anarazel.de
In reply to: Alexander Korotkov (#17)
Re: Improving connection scalability: GetSnapshotData()

Hi,

These benchmarks are on my workstation. The larger VM I used in the last
round wasn't currently available.

HW:
2 x Intel(R) Xeon(R) Gold 5215 (each 10 cores / 20 threads)
192GB Ram.
data directory is on a Samsung SSD 970 PRO 1TB

A bunch of terminals, emacs, mutt are open while the benchmark is
running. No browser.

Unless mentioned otherwise, relevant configuration options are:
max_connections=1200
shared_buffers=8GB
max_prepared_transactions=1000
synchronous_commit=local
huge_pages=on
fsync=off # to make it more likely to see scalability bottlenecks

Independent of the effects of this patch (i.e. including master) I had a
fairly hard time getting reproducible number for *low* client cases. I
found the numbers to be more reproducible if I pinned server/pgbench
onto the same core :(. I chose to do that for the -c1 cases, to
benchmark the optimal behaviour, as that seemed to have the biggest
potential for regressions.

All numbers are best of three. Tests start in freshly created cluster
each.

On 2020-03-30 17:04:00 +0300, Alexander Korotkov wrote:

Following pgbench scripts comes first to my mind:
1) SELECT txid_current(); (artificial but good for checking corner case)

-M prepared -T 180
(did a few longer runs, but doesn't seem to matter much)

clients tps master tps pgxact
1 46118 46027
16 377357 440233
40 373304 410142
198 103912 105579

btw, there's some pretty horrible cacheline bouncing in txid_current()
because backends first ReadNextFullTransactionId() (acquires XidGenLock
in shared mode, reads ShmemVariableCache->nextFullXid), then separately
causes GetNewTransactionId() (acquires XidGenLock exclusively, reads &
writes nextFullXid).

With for fsync=off (and also for synchronous_commit=off) the numbers
are, at lower client counts, severly depressed and variable due to
walwriter going completely nuts (using more CPU than the backend doing
the queries). Because WAL writes are so fast on my storage, individual
XLogBackgroundFlush() calls are very quick. This leads to a *lot* of
kill()s from the backend, from within XLogSetAsyncXactLSN(). There got
to be a bug here. But unrelated.

2) Single insert statement (as example of very short transaction)

CREATE TABLE testinsert(c1 int not null, c2 int not null, c3 int not null, c4 int not null);
INSERT INTO testinsert VALUES(1, 2, 3, 4);

-M prepared -T 360

fsync on:
clients tps master tps pgxact
1 653 658
16 5687 5668
40 14212 14229
198 60483 62420

fsync off:
clients tps master tps pgxact
1 59356 59891
16 290626 299991
40 348210 355669
198 289182 291529

clients tps master tps pgxact
1024 47586 52135

-M simple
fsync off:
clients tps master tps pgxact
40 289077 326699
198 286011 299928

3) Plain pgbench read-write (you already did it for sure)

-s 100 -M prepared -T 700

autovacuum=off, fsync on:
clients tps master tps pgxact
1 474 479
16 4356 4476
40 8591 9309
198 20045 20261
1024 17986 18545

autovacuum=off, fsync off:
clients tps master tps pgxact
1 7828 7719
16 49069 50482
40 68241 73081
198 73464 77801
1024 25621 28410

I chose autovacuum off because otherwise the results vary much more
widely, and autovacuum isn't really needed for the workload.

4) pgbench read-write script with increased amount of SELECTs. Repeat
select from pgbench_accounts say 10 times with different aids.

I did intersperse all server-side statements in the script with two
selects of other pgbench_account rows each.

-s 100 -M prepared -T 700
autovacuum=off, fsync on:
clients tps master tps pgxact
1 365 367
198 20065 21391

-s 1000 -M prepared -T 700
autovacuum=off, fsync on:
clients tps master tps pgxact
16 2757 2880
40 4734 4996
198 16950 19998
1024 22423 24935

5) 10% pgbench read-write, 90% of pgbench read-only

-s 100 -M prepared -T 100 -bselect-only@9 -btpcb-like@1

autovacuum=off, fsync on:
clients tps master tps pgxact
16 37289 38656
40 81284 81260
198 189002 189357
1024 143986 164762

That definitely needs to be measured, due to the locking changes around procarrayaddd/remove.

I don't think regressions besides perhaps 2pc are likely - there's nothing really getting more expensive but procarray add/remove.

I agree that ProcArrayAdd()/Remove() should be first subject of
investigation, but other cases should be checked as well IMHO.

I'm not sure I really see the point. If simple prepared tx doesn't show
up as a negative difference, a more complex one won't either, since the
ProcArrayAdd()/Remove() related bottlenecks will play smaller and
smaller role.

Regarding 2pc I can following scenarios come to my mind:
1) pgbench read-write modified so that every transaction is prepared
first, then commit prepared.

The numbers here are -M simple, because I wanted to use
PREPARE TRANSACTION 'ptx_:client_id';
COMMIT PREPARED 'ptx_:client_id';

-s 100 -M prepared -T 700 -f ~/tmp/pgbench-write-2pc.sql
autovacuum=off, fsync on:
clients tps master tps pgxact
1 251 249
16 2134 2174
40 3984 4089
198 6677 7522
1024 3641 3617

2) 10% of 2pc pgbench read-write, 90% normal pgbench read-write

-s 100 -M prepared -T 100 -f ~/tmp/pgbench-write-2pc.sql@1 -btpcb-like@9

clients tps master tps pgxact
198 18625 18906

3) 10% of 2pc pgbench read-write, 90% normal pgbench read-only

-s 100 -M prepared -T 100 -f ~/tmp/pgbench-write-2pc.sql@1 -bselect-only@9

clients tps master tps pgxact
198 84817 84350

I also benchmarked connection overhead, by using pgbench with -C
executing SELECT 1.

-T 10
clients tps master tps pgxact
1 572 587
16 2109 2140
40 2127 2136
198 2097 2129
1024 2101 2118

These numbers seem pretty decent to me. The regressions seem mostly
within noise. The one possible exception to that is plain pgbench
read/write with fsync=off and only a single session. I'll run more
benchmarks around that tomorrow (but now it's 6am :().

Greetings,

Andres Freund

#22Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#21)
Re: Improving connection scalability: GetSnapshotData()

Hi,

On 2020-04-06 06:39:59 -0700, Andres Freund wrote:

These benchmarks are on my workstation. The larger VM I used in the last
round wasn't currently available.

One way to reproduce the problem at smaller connection counts / smaller
machines is to take more snapshots. Doesn't fully reproduce the problem,
because resetting ->xmin without xact overhead is part of the problem,
but it's helpful.

I use a volatile function that loops over a trivial statement. There's
probably an easier / more extreme way to reproduce the problem. But it's
good enough.

-- setup
CREATE OR REPLACE FUNCTION snapme(p_ret int, p_loop int) RETURNS int VOLATILE LANGUAGE plpgsql AS $$BEGIN FOR x in 1..p_loop LOOP EXECUTE 'SELECT 1';END LOOP; RETURN p_ret; END;$$;
-- statement executed in parallel
SELECT snapme(17, 10000);

before (all above 1.5%):
+   37.82%  postgres  postgres          [.] GetSnapshotData
+    6.26%  postgres  postgres          [.] AllocSetAlloc
+    3.77%  postgres  postgres          [.] base_yyparse
+    3.04%  postgres  postgres          [.] core_yylex
+    1.94%  postgres  postgres          [.] grouping_planner
+    1.83%  postgres  libc-2.30.so      [.] __strncpy_avx2
+    1.80%  postgres  postgres          [.] palloc
+    1.73%  postgres  libc-2.30.so      [.] __memset_avx2_unaligned_erms
after:
+    5.75%  postgres  postgres          [.] base_yyparse
+    4.37%  postgres  postgres          [.] palloc
+    4.29%  postgres  postgres          [.] AllocSetAlloc
+    3.75%  postgres  postgres          [.] expression_tree_walker.part.0
+    3.14%  postgres  postgres          [.] core_yylex
+    2.51%  postgres  postgres          [.] subquery_planner
+    2.48%  postgres  postgres          [.] CheckExprStillValid
+    2.45%  postgres  postgres          [.] check_stack_depth
+    2.42%  postgres  plpgsql.so        [.] exec_stmt
+    1.92%  postgres  libc-2.30.so      [.] __memset_avx2_unaligned_erms
+    1.91%  postgres  postgres          [.] query_tree_walker
+    1.88%  postgres  libc-2.30.so      [.] __GI_____strtoll_l_internal
+    1.86%  postgres  postgres          [.] _SPI_execute_plan
+    1.85%  postgres  postgres          [.] assign_query_collations_walker
+    1.84%  postgres  postgres          [.] remove_useless_results_recurse
+    1.83%  postgres  postgres          [.] grouping_planner
+    1.50%  postgres  postgres          [.] set_plan_refs

If I change the workload to be
BEGIN;
SELECT txid_current();
SELECT snapme(17, 1000);
COMMIT;

the difference reduces (because GetSnapshotData() only needs to look at
procs with xids, and xids are assigned for much longer), but still is
significant:

before (all above 1.5%):
+   35.89%  postgres  postgres            [.] GetSnapshotData
+    7.94%  postgres  postgres            [.] AllocSetAlloc
+    4.42%  postgres  postgres            [.] base_yyparse
+    3.62%  postgres  libc-2.30.so        [.] __memset_avx2_unaligned_erms
+    2.87%  postgres  postgres            [.] LWLockAcquire
+    2.76%  postgres  postgres            [.] core_yylex
+    2.30%  postgres  postgres            [.] expression_tree_walker.part.0
+    1.81%  postgres  postgres            [.] MemoryContextAllocZeroAligned
+    1.80%  postgres  postgres            [.] transformStmt
+    1.66%  postgres  postgres            [.] grouping_planner
+    1.64%  postgres  postgres            [.] subquery_planner
after:
+   24.59%  postgres  postgres          [.] GetSnapshotData
+    4.89%  postgres  postgres          [.] base_yyparse
+    4.59%  postgres  postgres          [.] AllocSetAlloc
+    3.00%  postgres  postgres          [.] LWLockAcquire
+    2.76%  postgres  postgres          [.] palloc
+    2.27%  postgres  postgres          [.] MemoryContextAllocZeroAligned
+    2.26%  postgres  postgres          [.] check_stack_depth
+    1.77%  postgres  postgres          [.] core_yylex

Greetings,

Andres Freund

#23Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#21)
Re: Improving connection scalability: GetSnapshotData()

Hi,

On 2020-04-06 06:39:59 -0700, Andres Freund wrote:

3) Plain pgbench read-write (you already did it for sure)

-s 100 -M prepared -T 700

autovacuum=off, fsync on:
clients tps master tps pgxact
1 474 479
16 4356 4476
40 8591 9309
198 20045 20261
1024 17986 18545

autovacuum=off, fsync off:
clients tps master tps pgxact
1 7828 7719
16 49069 50482
40 68241 73081
198 73464 77801
1024 25621 28410

I chose autovacuum off because otherwise the results vary much more
widely, and autovacuum isn't really needed for the workload.

These numbers seem pretty decent to me. The regressions seem mostly
within noise. The one possible exception to that is plain pgbench
read/write with fsync=off and only a single session. I'll run more
benchmarks around that tomorrow (but now it's 6am :().

The "one possible exception" turned out to be a "real" regression, but
one that was dead easy to fix: It was an DEBUG1 elog I had left in. The
overhead seems to solely have been the increased code size + overhead of
errstart(). After that there's no difference in the single client case
anymore (I'd not expect a benefit).

Greetings,

Andres Freund

#24Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#23)
13 attachment(s)
Re: Improving connection scalability: GetSnapshotData()

Hi,

SEE BELOW: What, and what not, to do for v13.

Attached is a substantially polished version of my patches. Note that
the first three patches, as well as the last, are not intended to be
committed at this time / in this form - they're there to make testing
easier.

There is a lot of polish, but also a few substantial changes:

- To be compatible with old_snapshot_threshold I've revised the way
heap_page_prune_opt() deals with old_snapshot_threshold. Now
old_snapshot_threshold is only applied when we otherwise would have
been unable to prune (both at the time of the pd_prune_xid check, and
on individual tuples). This makes old_snapshot_threshold considerably
cheaper and cause less conflicts.

This required adding a version of HeapTupleSatisfiesVacuum that
returns the horizon, rather than doing the horizon test itself; that
way we can first test a tuple's horizon against the normal approximate
threshold (making it an accurate threshold if needed) and only if that
fails fall back to old_snapshot_threshold.

The main reason here was not to improve old_snapshot_threshold, but to
avoid a regression when its being used. Because we need a horizon to
pass to old_snapshot_threshold, we'd have to fall back to computing an
accurate horizon too often.

- Previous versions of the patch had a TODO about computing horizons not
just for one of shared / catalog / data tables, but all of them at
once. To avoid potentially needing to determine xmin horizons multiple
times within one transaction. For that I've renamed GetOldestXmin()
to ComputeTransactionHorizons() and added wrapper functions instead of
the different flag combinations we previously had for GetOldestXmin().

This allows us to get rid of the PROCARRAY_* flags, and PROC_RESERVED.

- To address Thomas' review comment about not accessing nextFullXid
without xidGenLock, I made latestCompletedXid a FullTransactionId (a
fxid is needed to be able to infer 64bit xids for the horizons -
otherwise there is some danger they could wrap).

- Improving the comment around the snapshot caching, I decided that the
justification for correctness around not taking ProcArrayLock is too
complicated (in particular around setting MyProc->xmin). While
avoiding ProcArrayLock alltogether is a substantial gain, the caching
itself helps a lot already. Seems best to leave that for a later step.

This means that the numbers for the very high connection counts aren't
quite as good.

- Plenty of small changes to address issues I found while
benchmarking. The only one of real note is that I had released
XidGenLock after ProcArrayLock in ProcArrayAdd/Remove. For 2pc that
causes noticable unnecessary contention, because we'll wait for
XidGenLock while holding ProcArrayLock...

I think this is pretty close to being committable.

But: This patch came in very late for v13, and it took me much longer to
polish it up than I had hoped (partially distraction due to various bugs
I found (in particular snapshot_too_old), partially covid19, partially
"hell if I know"). The patchset touches core parts of the system. While
both Thomas and David have done some review, they haven't for the latest
version (mea culpa).

In many other instances I would say that the above suggests slipping to
v14, given the timing.

The main reason I am considering pushing is that I think this patcheset
addresses one of the most common critiques of postgres, as well as very
common, hard to fix, real-world production issues. GetSnapshotData() has
been a major bottleneck for about as long as I have been using postgres,
and this addresses that to a significant degree.

A second reason I am considering it is that, in my opinion, the changes
are not all that complicated and not even that large. At least not for a
change to a problem that we've long tried to improve.

Obviously we all have a tendency to think our own work is important, and
that we deserve a bit more leeway than others. So take the above with a
grain of salt.

Comments?

Greetings,

Andres Freund

Attachments:

v1-0001-TMP-work-around-missing-snapshot-registrations.patchtext/x-diff; charset=us-asciiDownload
From a22780e3544d7c83b5ad8851de240c3d5ef8f221 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sat, 29 Feb 2020 18:07:25 -0800
Subject: [PATCH v1 1/2] TMP: work around missing snapshot registrations.

This is just what's hit by the tests. It's not an actual fix.
---
 src/backend/catalog/namespace.c             |  7 +++++++
 src/backend/catalog/pg_subscription.c       |  4 ++++
 src/backend/commands/indexcmds.c            |  9 +++++++++
 src/backend/commands/tablecmds.c            |  8 ++++++++
 src/backend/replication/logical/tablesync.c | 12 ++++++++++++
 src/backend/replication/logical/worker.c    |  4 ++++
 src/backend/utils/time/snapmgr.c            |  4 ++++
 7 files changed, 48 insertions(+)

diff --git a/src/backend/catalog/namespace.c b/src/backend/catalog/namespace.c
index 2ec23016fe5..e4696d8d417 100644
--- a/src/backend/catalog/namespace.c
+++ b/src/backend/catalog/namespace.c
@@ -55,6 +55,7 @@
 #include "utils/inval.h"
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
+#include "utils/snapmgr.h"
 #include "utils/syscache.h"
 #include "utils/varlena.h"
 
@@ -4244,12 +4245,18 @@ RemoveTempRelationsCallback(int code, Datum arg)
 {
 	if (OidIsValid(myTempNamespace))	/* should always be true */
 	{
+		Snapshot snap;
+
 		/* Need to ensure we have a usable transaction. */
 		AbortOutOfAnyTransaction();
 		StartTransactionCommand();
 
+		/* ensure xmin stays set */
+		snap = RegisterSnapshot(GetCatalogSnapshot(InvalidOid));
+
 		RemoveTempRelations(myTempNamespace);
 
+		UnregisterSnapshot(snap);
 		CommitTransactionCommand();
 	}
 }
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index cb157311154..4a324dfb4f1 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -31,6 +31,7 @@
 #include "utils/fmgroids.h"
 #include "utils/pg_lsn.h"
 #include "utils/rel.h"
+#include "utils/snapmgr.h"
 #include "utils/syscache.h"
 
 static List *textarray_to_stringlist(ArrayType *textarray);
@@ -286,6 +287,7 @@ UpdateSubscriptionRelState(Oid subid, Oid relid, char state,
 	bool		nulls[Natts_pg_subscription_rel];
 	Datum		values[Natts_pg_subscription_rel];
 	bool		replaces[Natts_pg_subscription_rel];
+	Snapshot snap = RegisterSnapshot(GetCatalogSnapshot(InvalidOid));
 
 	LockSharedObject(SubscriptionRelationId, subid, 0, AccessShareLock);
 
@@ -321,6 +323,8 @@ UpdateSubscriptionRelState(Oid subid, Oid relid, char state,
 
 	/* Cleanup. */
 	table_close(rel, NoLock);
+
+	UnregisterSnapshot(snap);
 }
 
 /*
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 4e8263af4be..b5fe3649a22 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -2720,6 +2720,7 @@ ReindexRelationConcurrently(Oid relationOid, int options)
 	char	   *relationName = NULL;
 	char	   *relationNamespace = NULL;
 	PGRUsage	ru0;
+	Snapshot	snap;
 
 	/*
 	 * Create a memory context that will survive forced transaction commits we
@@ -3189,6 +3190,7 @@ ReindexRelationConcurrently(Oid relationOid, int options)
 	 */
 
 	StartTransactionCommand();
+	snap = RegisterSnapshot(GetCatalogSnapshot(InvalidOid));
 
 	forboth(lc, indexIds, lc2, newIndexIds)
 	{
@@ -3237,8 +3239,11 @@ ReindexRelationConcurrently(Oid relationOid, int options)
 	}
 
 	/* Commit this transaction and make index swaps visible */
+	UnregisterSnapshot(snap);
 	CommitTransactionCommand();
+
 	StartTransactionCommand();
+	snap = RegisterSnapshot(GetCatalogSnapshot(InvalidOid));
 
 	/*
 	 * Phase 5 of REINDEX CONCURRENTLY
@@ -3269,7 +3274,9 @@ ReindexRelationConcurrently(Oid relationOid, int options)
 	}
 
 	/* Commit this transaction to make the updates visible. */
+	UnregisterSnapshot(snap);
 	CommitTransactionCommand();
+
 	StartTransactionCommand();
 
 	/*
@@ -3283,6 +3290,7 @@ ReindexRelationConcurrently(Oid relationOid, int options)
 	WaitForLockersMultiple(lockTags, AccessExclusiveLock, true);
 
 	PushActiveSnapshot(GetTransactionSnapshot());
+	snap = RegisterSnapshot(GetCatalogSnapshot(InvalidOid));
 
 	{
 		ObjectAddresses *objects = new_object_addresses();
@@ -3308,6 +3316,7 @@ ReindexRelationConcurrently(Oid relationOid, int options)
 	}
 
 	PopActiveSnapshot();
+	UnregisterSnapshot(snap);
 	CommitTransactionCommand();
 
 	/*
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 8e35c5bd1a2..311a950297a 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -15148,6 +15148,7 @@ PreCommit_on_commit_actions(void)
 	ListCell   *l;
 	List	   *oids_to_truncate = NIL;
 	List	   *oids_to_drop = NIL;
+	Snapshot	snap;
 
 	foreach(l, on_commits)
 	{
@@ -15179,6 +15180,11 @@ PreCommit_on_commit_actions(void)
 		}
 	}
 
+	if (oids_to_truncate == NIL && oids_to_drop == NIL)
+		return;
+
+	snap = RegisterSnapshot(GetCatalogSnapshot(InvalidOid));
+
 	/*
 	 * Truncate relations before dropping so that all dependencies between
 	 * relations are removed after they are worked on.  Doing it like this
@@ -15232,6 +15238,8 @@ PreCommit_on_commit_actions(void)
 		}
 #endif
 	}
+
+	UnregisterSnapshot(snap);
 }
 
 /*
diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index a60c6661538..5bdb15b1d50 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -864,6 +864,7 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos)
 			{
 				Relation	rel;
 				WalRcvExecResult *res;
+				Snapshot	snap;
 
 				SpinLockAcquire(&MyLogicalRepWorker->relmutex);
 				MyLogicalRepWorker->relstate = SUBREL_STATE_DATASYNC;
@@ -872,10 +873,14 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos)
 
 				/* Update the state and make it visible to others. */
 				StartTransactionCommand();
+				snap = RegisterSnapshot(GetCatalogSnapshot(InvalidOid));
+
 				UpdateSubscriptionRelState(MyLogicalRepWorker->subid,
 										   MyLogicalRepWorker->relid,
 										   MyLogicalRepWorker->relstate,
 										   MyLogicalRepWorker->relstate_lsn);
+
+				UnregisterSnapshot(snap);
 				CommitTransactionCommand();
 				pgstat_report_stat(false);
 
@@ -919,6 +924,7 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos)
 								   CRS_USE_SNAPSHOT, origin_startpos);
 
 				PushActiveSnapshot(GetTransactionSnapshot());
+				snap = RegisterSnapshot(GetCatalogSnapshot(InvalidOid));
 				copy_table(rel);
 				PopActiveSnapshot();
 
@@ -934,6 +940,8 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos)
 				/* Make the copy visible. */
 				CommandCounterIncrement();
 
+				UnregisterSnapshot(snap);
+
 				/*
 				 * We are done with the initial data synchronization, update
 				 * the state.
@@ -958,6 +966,8 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos)
 				 */
 				if (*origin_startpos >= MyLogicalRepWorker->relstate_lsn)
 				{
+					snap = RegisterSnapshot(GetCatalogSnapshot(InvalidOid));
+
 					/*
 					 * Update the new state in catalog.  No need to bother
 					 * with the shmem state as we are exiting for good.
@@ -966,6 +976,8 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos)
 											   MyLogicalRepWorker->relid,
 											   SUBREL_STATE_SYNCDONE,
 											   *origin_startpos);
+					UnregisterSnapshot(snap);
+
 					finish_sync_worker();
 				}
 				break;
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index fa3811783f6..f60b1581abf 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -989,6 +989,9 @@ apply_handle_truncate(StringInfo s)
 
 	ensure_transaction();
 
+	/* catalog modifications need a set snapshot */
+	PushActiveSnapshot(GetTransactionSnapshot());
+
 	remote_relids = logicalrep_read_truncate(s, &cascade, &restart_seqs);
 
 	foreach(lc, remote_relids)
@@ -1029,6 +1032,7 @@ apply_handle_truncate(StringInfo s)
 	}
 
 	CommandCounterIncrement();
+	PopActiveSnapshot();
 }
 
 
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 1c063c592ce..b5cff157bf6 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -441,6 +441,8 @@ GetOldestSnapshot(void)
 Snapshot
 GetCatalogSnapshot(Oid relid)
 {
+	Assert(IsTransactionState());
+
 	/*
 	 * Return historic snapshot while we're doing logical decoding, so we can
 	 * see the appropriate state of the catalog.
@@ -1017,6 +1019,8 @@ SnapshotResetXmin(void)
 	if (pairingheap_is_empty(&RegisteredSnapshots))
 	{
 		MyPgXact->xmin = InvalidTransactionId;
+		TransactionXmin = InvalidTransactionId;
+		RecentXmin = InvalidTransactionId;
 		return;
 	}
 
-- 
2.25.0.114.g5b0ca878e0

v1-0002-Improve-and-extend-asserts-for-a-snapshot-being-s.patchtext/x-diff; charset=us-asciiDownload
From 4e213f29851ceee73f9b20846df84f6f9662714c Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sat, 29 Feb 2020 19:33:21 -0800
Subject: [PATCH v1 2/2] Improve and extend asserts for a snapshot being set.

---
 src/include/utils/snapmgr.h        |  2 ++
 src/backend/access/heap/heapam.c   |  6 ++++--
 src/backend/access/index/indexam.c |  8 +++++++-
 src/backend/catalog/indexing.c     | 11 +++++++++++
 src/backend/utils/time/snapmgr.c   | 19 +++++++++++++++++++
 contrib/amcheck/verify_nbtree.c    |  4 ++--
 6 files changed, 45 insertions(+), 5 deletions(-)

diff --git a/src/include/utils/snapmgr.h b/src/include/utils/snapmgr.h
index b28d13ce841..7738d6a8e01 100644
--- a/src/include/utils/snapmgr.h
+++ b/src/include/utils/snapmgr.h
@@ -116,6 +116,8 @@ extern void PopActiveSnapshot(void);
 extern Snapshot GetActiveSnapshot(void);
 extern bool ActiveSnapshotSet(void);
 
+extern bool SnapshotSet(void);
+
 extern Snapshot RegisterSnapshot(Snapshot snapshot);
 extern void UnregisterSnapshot(Snapshot snapshot);
 extern Snapshot RegisterSnapshotOnOwner(Snapshot snapshot, ResourceOwner owner);
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 29694b8aa4a..912cadeb03a 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1137,6 +1137,8 @@ heap_beginscan(Relation relation, Snapshot snapshot,
 {
 	HeapScanDesc scan;
 
+	Assert(SnapshotSet());
+
 	/*
 	 * increment relation ref count while scanning relation
 	 *
@@ -1546,7 +1548,7 @@ heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer,
 	at_chain_start = first_call;
 	skip = !first_call;
 
-	Assert(TransactionIdIsValid(RecentGlobalXmin));
+	Assert(SnapshotSet());
 	Assert(BufferGetBlockNumber(buffer) == blkno);
 
 	/* Scan through possible multiple members of HOT-chain */
@@ -5629,7 +5631,7 @@ heap_abort_speculative(Relation relation, ItemPointer tid)
 	 * RecentGlobalXmin.  That's not pretty, but it doesn't seem worth
 	 * inventing a nicer API for this.
 	 */
-	Assert(TransactionIdIsValid(RecentGlobalXmin));
+	Assert(SnapshotSet());
 	PageSetPrunable(page, RecentGlobalXmin);
 
 	/* store transaction information of xact deleting the tuple */
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index a5210d0b342..558b490d079 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -178,6 +178,8 @@ index_insert(Relation indexRelation,
 	RELATION_CHECKS;
 	CHECK_REL_PROCEDURE(aminsert);
 
+	Assert(SnapshotSet());
+
 	if (!(indexRelation->rd_indam->ampredlocks))
 		CheckForSerializableConflictIn(indexRelation,
 									   (ItemPointer) NULL,
@@ -250,6 +252,8 @@ index_beginscan_internal(Relation indexRelation,
 {
 	IndexScanDesc scan;
 
+	Assert(SnapshotSet());
+
 	RELATION_CHECKS;
 	CHECK_REL_PROCEDURE(ambeginscan);
 
@@ -513,7 +517,7 @@ index_getnext_tid(IndexScanDesc scan, ScanDirection direction)
 	SCAN_CHECKS;
 	CHECK_SCAN_PROCEDURE(amgettuple);
 
-	Assert(TransactionIdIsValid(RecentGlobalXmin));
+	Assert(SnapshotSet());
 
 	/*
 	 * The AM's amgettuple proc finds the next index entry matching the scan
@@ -568,6 +572,8 @@ index_fetch_heap(IndexScanDesc scan, TupleTableSlot *slot)
 	bool		all_dead = false;
 	bool		found;
 
+	Assert(SnapshotSet());
+
 	found = table_index_fetch_tuple(scan->xs_heapfetch, &scan->xs_heaptid,
 									scan->xs_snapshot, slot,
 									&scan->xs_heap_continue, &all_dead);
diff --git a/src/backend/catalog/indexing.c b/src/backend/catalog/indexing.c
index d63fcf58cf1..8ba6b3dfa5e 100644
--- a/src/backend/catalog/indexing.c
+++ b/src/backend/catalog/indexing.c
@@ -22,6 +22,7 @@
 #include "catalog/indexing.h"
 #include "executor/executor.h"
 #include "utils/rel.h"
+#include "utils/snapmgr.h"
 
 
 /*
@@ -184,6 +185,8 @@ CatalogTupleInsert(Relation heapRel, HeapTuple tup)
 {
 	CatalogIndexState indstate;
 
+	Assert(SnapshotSet());
+
 	indstate = CatalogOpenIndexes(heapRel);
 
 	simple_heap_insert(heapRel, tup);
@@ -204,6 +207,8 @@ void
 CatalogTupleInsertWithInfo(Relation heapRel, HeapTuple tup,
 						   CatalogIndexState indstate)
 {
+	Assert(SnapshotSet());
+
 	simple_heap_insert(heapRel, tup);
 
 	CatalogIndexInsert(indstate, tup);
@@ -225,6 +230,8 @@ CatalogTupleUpdate(Relation heapRel, ItemPointer otid, HeapTuple tup)
 {
 	CatalogIndexState indstate;
 
+	Assert(SnapshotSet());
+
 	indstate = CatalogOpenIndexes(heapRel);
 
 	simple_heap_update(heapRel, otid, tup);
@@ -245,6 +252,8 @@ void
 CatalogTupleUpdateWithInfo(Relation heapRel, ItemPointer otid, HeapTuple tup,
 						   CatalogIndexState indstate)
 {
+	Assert(SnapshotSet());
+
 	simple_heap_update(heapRel, otid, tup);
 
 	CatalogIndexInsert(indstate, tup);
@@ -268,5 +277,7 @@ CatalogTupleUpdateWithInfo(Relation heapRel, ItemPointer otid, HeapTuple tup,
 void
 CatalogTupleDelete(Relation heapRel, ItemPointer tid)
 {
+	Assert(SnapshotSet());
+
 	simple_heap_delete(heapRel, tid);
 }
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index b5cff157bf6..f7e1665aae6 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -857,6 +857,25 @@ ActiveSnapshotSet(void)
 	return ActiveSnapshot != NULL;
 }
 
+/*
+ * Does this transaction have a snapshot.
+ */
+bool
+SnapshotSet(void)
+{
+	/* can't be safe, because somehow xmin is not set */
+	if (!TransactionIdIsValid(MyPgXact->xmin) && HistoricSnapshot == NULL)
+		return false;
+
+	/*
+	 * Can't be safe because no snapshot being registered means invalidation
+	 * processing can change xmin horizon.
+	 */
+	return ActiveSnapshot != NULL ||
+		!pairingheap_is_empty(&RegisteredSnapshots) ||
+		HistoricSnapshot != NULL;
+}
+
 /*
  * RegisterSnapshot
  *		Register a snapshot as being in use by the current resource owner
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index ceaaa271680..50a46dca933 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -412,10 +412,10 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 	Snapshot	snapshot = SnapshotAny;
 
 	/*
-	 * RecentGlobalXmin assertion matches index_getnext_tid().  See note on
 	 * RecentGlobalXmin/B-Tree page deletion.
+	 * This assertion matches the one in index_getnext_tid().  See note on
 	 */
-	Assert(TransactionIdIsValid(RecentGlobalXmin));
+	Assert(SnapshotSet());
 
 	/*
 	 * Initialize state for entire verification operation
-- 
2.25.0.114.g5b0ca878e0

v7-0001-TMP-work-around-missing-snapshot-registrations.patchtext/x-diff; charset=us-asciiDownload
From 3b58990c088936122f38d855a5a3900602deacf7 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 6 Apr 2020 21:28:55 -0700
Subject: [PATCH v7 01/11] TMP: work around missing snapshot registrations.

This is just what's hit by the tests. It's not an actual fix.
---
 src/backend/catalog/namespace.c             |  7 +++++++
 src/backend/catalog/pg_subscription.c       |  4 ++++
 src/backend/commands/indexcmds.c            |  9 +++++++++
 src/backend/commands/tablecmds.c            |  8 ++++++++
 src/backend/replication/logical/tablesync.c | 12 ++++++++++++
 src/backend/replication/logical/worker.c    |  4 ++++
 src/backend/utils/time/snapmgr.c            |  4 ++++
 7 files changed, 48 insertions(+)

diff --git a/src/backend/catalog/namespace.c b/src/backend/catalog/namespace.c
index 2ec23016fe5..e4696d8d417 100644
--- a/src/backend/catalog/namespace.c
+++ b/src/backend/catalog/namespace.c
@@ -55,6 +55,7 @@
 #include "utils/inval.h"
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
+#include "utils/snapmgr.h"
 #include "utils/syscache.h"
 #include "utils/varlena.h"
 
@@ -4244,12 +4245,18 @@ RemoveTempRelationsCallback(int code, Datum arg)
 {
 	if (OidIsValid(myTempNamespace))	/* should always be true */
 	{
+		Snapshot snap;
+
 		/* Need to ensure we have a usable transaction. */
 		AbortOutOfAnyTransaction();
 		StartTransactionCommand();
 
+		/* ensure xmin stays set */
+		snap = RegisterSnapshot(GetCatalogSnapshot(InvalidOid));
+
 		RemoveTempRelations(myTempNamespace);
 
+		UnregisterSnapshot(snap);
 		CommitTransactionCommand();
 	}
 }
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index cb157311154..4a324dfb4f1 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -31,6 +31,7 @@
 #include "utils/fmgroids.h"
 #include "utils/pg_lsn.h"
 #include "utils/rel.h"
+#include "utils/snapmgr.h"
 #include "utils/syscache.h"
 
 static List *textarray_to_stringlist(ArrayType *textarray);
@@ -286,6 +287,7 @@ UpdateSubscriptionRelState(Oid subid, Oid relid, char state,
 	bool		nulls[Natts_pg_subscription_rel];
 	Datum		values[Natts_pg_subscription_rel];
 	bool		replaces[Natts_pg_subscription_rel];
+	Snapshot snap = RegisterSnapshot(GetCatalogSnapshot(InvalidOid));
 
 	LockSharedObject(SubscriptionRelationId, subid, 0, AccessShareLock);
 
@@ -321,6 +323,8 @@ UpdateSubscriptionRelState(Oid subid, Oid relid, char state,
 
 	/* Cleanup. */
 	table_close(rel, NoLock);
+
+	UnregisterSnapshot(snap);
 }
 
 /*
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 2baca12c5f4..094bf6139f0 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -2837,6 +2837,7 @@ ReindexRelationConcurrently(Oid relationOid, int options)
 	char	   *relationName = NULL;
 	char	   *relationNamespace = NULL;
 	PGRUsage	ru0;
+	Snapshot	snap;
 
 	/*
 	 * Create a memory context that will survive forced transaction commits we
@@ -3306,6 +3307,7 @@ ReindexRelationConcurrently(Oid relationOid, int options)
 	 */
 
 	StartTransactionCommand();
+	snap = RegisterSnapshot(GetCatalogSnapshot(InvalidOid));
 
 	forboth(lc, indexIds, lc2, newIndexIds)
 	{
@@ -3354,8 +3356,11 @@ ReindexRelationConcurrently(Oid relationOid, int options)
 	}
 
 	/* Commit this transaction and make index swaps visible */
+	UnregisterSnapshot(snap);
 	CommitTransactionCommand();
+
 	StartTransactionCommand();
+	snap = RegisterSnapshot(GetCatalogSnapshot(InvalidOid));
 
 	/*
 	 * Phase 5 of REINDEX CONCURRENTLY
@@ -3386,7 +3391,9 @@ ReindexRelationConcurrently(Oid relationOid, int options)
 	}
 
 	/* Commit this transaction to make the updates visible. */
+	UnregisterSnapshot(snap);
 	CommitTransactionCommand();
+
 	StartTransactionCommand();
 
 	/*
@@ -3400,6 +3407,7 @@ ReindexRelationConcurrently(Oid relationOid, int options)
 	WaitForLockersMultiple(lockTags, AccessExclusiveLock, true);
 
 	PushActiveSnapshot(GetTransactionSnapshot());
+	snap = RegisterSnapshot(GetCatalogSnapshot(InvalidOid));
 
 	{
 		ObjectAddresses *objects = new_object_addresses();
@@ -3425,6 +3433,7 @@ ReindexRelationConcurrently(Oid relationOid, int options)
 	}
 
 	PopActiveSnapshot();
+	UnregisterSnapshot(snap);
 	CommitTransactionCommand();
 
 	/*
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 6162fb018c7..e1eacc6a4a6 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -15200,6 +15200,7 @@ PreCommit_on_commit_actions(void)
 	ListCell   *l;
 	List	   *oids_to_truncate = NIL;
 	List	   *oids_to_drop = NIL;
+	Snapshot	snap;
 
 	foreach(l, on_commits)
 	{
@@ -15231,6 +15232,11 @@ PreCommit_on_commit_actions(void)
 		}
 	}
 
+	if (oids_to_truncate == NIL && oids_to_drop == NIL)
+		return;
+
+	snap = RegisterSnapshot(GetCatalogSnapshot(InvalidOid));
+
 	/*
 	 * Truncate relations before dropping so that all dependencies between
 	 * relations are removed after they are worked on.  Doing it like this
@@ -15284,6 +15290,8 @@ PreCommit_on_commit_actions(void)
 		}
 #endif
 	}
+
+	UnregisterSnapshot(snap);
 }
 
 /*
diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index c27d9705895..aec5a044790 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -863,6 +863,7 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos)
 			{
 				Relation	rel;
 				WalRcvExecResult *res;
+				Snapshot	snap;
 
 				SpinLockAcquire(&MyLogicalRepWorker->relmutex);
 				MyLogicalRepWorker->relstate = SUBREL_STATE_DATASYNC;
@@ -871,10 +872,14 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos)
 
 				/* Update the state and make it visible to others. */
 				StartTransactionCommand();
+				snap = RegisterSnapshot(GetCatalogSnapshot(InvalidOid));
+
 				UpdateSubscriptionRelState(MyLogicalRepWorker->subid,
 										   MyLogicalRepWorker->relid,
 										   MyLogicalRepWorker->relstate,
 										   MyLogicalRepWorker->relstate_lsn);
+
+				UnregisterSnapshot(snap);
 				CommitTransactionCommand();
 				pgstat_report_stat(false);
 
@@ -918,6 +923,7 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos)
 								   CRS_USE_SNAPSHOT, origin_startpos);
 
 				PushActiveSnapshot(GetTransactionSnapshot());
+				snap = RegisterSnapshot(GetCatalogSnapshot(InvalidOid));
 				copy_table(rel);
 				PopActiveSnapshot();
 
@@ -933,6 +939,8 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos)
 				/* Make the copy visible. */
 				CommandCounterIncrement();
 
+				UnregisterSnapshot(snap);
+
 				/*
 				 * We are done with the initial data synchronization, update
 				 * the state.
@@ -957,6 +965,8 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos)
 				 */
 				if (*origin_startpos >= MyLogicalRepWorker->relstate_lsn)
 				{
+					snap = RegisterSnapshot(GetCatalogSnapshot(InvalidOid));
+
 					/*
 					 * Update the new state in catalog.  No need to bother
 					 * with the shmem state as we are exiting for good.
@@ -965,6 +975,8 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos)
 											   MyLogicalRepWorker->relid,
 											   SUBREL_STATE_SYNCDONE,
 											   *origin_startpos);
+					UnregisterSnapshot(snap);
+
 					finish_sync_worker();
 				}
 				break;
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index a752a1224d6..f10f3f843d1 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -1245,6 +1245,9 @@ apply_handle_truncate(StringInfo s)
 
 	ensure_transaction();
 
+	/* catalog modifications need a set snapshot */
+	PushActiveSnapshot(GetTransactionSnapshot());
+
 	remote_relids = logicalrep_read_truncate(s, &cascade, &restart_seqs);
 
 	foreach(lc, remote_relids)
@@ -1332,6 +1335,7 @@ apply_handle_truncate(StringInfo s)
 	}
 
 	CommandCounterIncrement();
+	PopActiveSnapshot();
 }
 
 
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 1c063c592ce..b5cff157bf6 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -441,6 +441,8 @@ GetOldestSnapshot(void)
 Snapshot
 GetCatalogSnapshot(Oid relid)
 {
+	Assert(IsTransactionState());
+
 	/*
 	 * Return historic snapshot while we're doing logical decoding, so we can
 	 * see the appropriate state of the catalog.
@@ -1017,6 +1019,8 @@ SnapshotResetXmin(void)
 	if (pairingheap_is_empty(&RegisteredSnapshots))
 	{
 		MyPgXact->xmin = InvalidTransactionId;
+		TransactionXmin = InvalidTransactionId;
+		RecentXmin = InvalidTransactionId;
 		return;
 	}
 
-- 
2.25.0.114.g5b0ca878e0

v7-0002-Improve-and-extend-asserts-for-a-snapshot-being-s.patchtext/x-diff; charset=us-asciiDownload
From 076c589dff7e08f0a6b562b185f179da4fbfc13a Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 6 Apr 2020 21:28:55 -0700
Subject: [PATCH v7 02/11] Improve and extend asserts for a snapshot being set.

---
 src/include/utils/snapmgr.h        |  2 ++
 src/backend/access/heap/heapam.c   |  6 ++++--
 src/backend/access/index/indexam.c |  8 +++++++-
 src/backend/catalog/indexing.c     | 11 +++++++++++
 src/backend/utils/time/snapmgr.c   | 19 +++++++++++++++++++
 contrib/amcheck/verify_nbtree.c    |  6 +++---
 6 files changed, 46 insertions(+), 6 deletions(-)

diff --git a/src/include/utils/snapmgr.h b/src/include/utils/snapmgr.h
index b28d13ce841..7738d6a8e01 100644
--- a/src/include/utils/snapmgr.h
+++ b/src/include/utils/snapmgr.h
@@ -116,6 +116,8 @@ extern void PopActiveSnapshot(void);
 extern Snapshot GetActiveSnapshot(void);
 extern bool ActiveSnapshotSet(void);
 
+extern bool SnapshotSet(void);
+
 extern Snapshot RegisterSnapshot(Snapshot snapshot);
 extern void UnregisterSnapshot(Snapshot snapshot);
 extern Snapshot RegisterSnapshotOnOwner(Snapshot snapshot, ResourceOwner owner);
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index c4a5aa616a3..0af51880ccc 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1136,6 +1136,8 @@ heap_beginscan(Relation relation, Snapshot snapshot,
 {
 	HeapScanDesc scan;
 
+	Assert(SnapshotSet());
+
 	/*
 	 * increment relation ref count while scanning relation
 	 *
@@ -1545,7 +1547,7 @@ heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer,
 	at_chain_start = first_call;
 	skip = !first_call;
 
-	Assert(TransactionIdIsValid(RecentGlobalXmin));
+	Assert(SnapshotSet());
 	Assert(BufferGetBlockNumber(buffer) == blkno);
 
 	/* Scan through possible multiple members of HOT-chain */
@@ -5633,7 +5635,7 @@ heap_abort_speculative(Relation relation, ItemPointer tid)
 	 * if so (vacuum can't subsequently move relfrozenxid to beyond
 	 * TransactionXmin, so there's no race here).
 	 */
-	Assert(TransactionIdIsValid(TransactionXmin));
+	Assert(SnapshotSet() && TransactionIdIsValid(TransactionXmin));
 	if (TransactionIdPrecedes(TransactionXmin, relation->rd_rel->relfrozenxid))
 		prune_xid = relation->rd_rel->relfrozenxid;
 	else
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index a3f77169a79..5d6354dedf5 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -184,6 +184,8 @@ index_insert(Relation indexRelation,
 	RELATION_CHECKS;
 	CHECK_REL_PROCEDURE(aminsert);
 
+	Assert(SnapshotSet());
+
 	if (!(indexRelation->rd_indam->ampredlocks))
 		CheckForSerializableConflictIn(indexRelation,
 									   (ItemPointer) NULL,
@@ -256,6 +258,8 @@ index_beginscan_internal(Relation indexRelation,
 {
 	IndexScanDesc scan;
 
+	Assert(SnapshotSet());
+
 	RELATION_CHECKS;
 	CHECK_REL_PROCEDURE(ambeginscan);
 
@@ -519,7 +523,7 @@ index_getnext_tid(IndexScanDesc scan, ScanDirection direction)
 	SCAN_CHECKS;
 	CHECK_SCAN_PROCEDURE(amgettuple);
 
-	Assert(TransactionIdIsValid(RecentGlobalXmin));
+	Assert(SnapshotSet());
 
 	/*
 	 * The AM's amgettuple proc finds the next index entry matching the scan
@@ -574,6 +578,8 @@ index_fetch_heap(IndexScanDesc scan, TupleTableSlot *slot)
 	bool		all_dead = false;
 	bool		found;
 
+	Assert(SnapshotSet());
+
 	found = table_index_fetch_tuple(scan->xs_heapfetch, &scan->xs_heaptid,
 									scan->xs_snapshot, slot,
 									&scan->xs_heap_continue, &all_dead);
diff --git a/src/backend/catalog/indexing.c b/src/backend/catalog/indexing.c
index d63fcf58cf1..8ba6b3dfa5e 100644
--- a/src/backend/catalog/indexing.c
+++ b/src/backend/catalog/indexing.c
@@ -22,6 +22,7 @@
 #include "catalog/indexing.h"
 #include "executor/executor.h"
 #include "utils/rel.h"
+#include "utils/snapmgr.h"
 
 
 /*
@@ -184,6 +185,8 @@ CatalogTupleInsert(Relation heapRel, HeapTuple tup)
 {
 	CatalogIndexState indstate;
 
+	Assert(SnapshotSet());
+
 	indstate = CatalogOpenIndexes(heapRel);
 
 	simple_heap_insert(heapRel, tup);
@@ -204,6 +207,8 @@ void
 CatalogTupleInsertWithInfo(Relation heapRel, HeapTuple tup,
 						   CatalogIndexState indstate)
 {
+	Assert(SnapshotSet());
+
 	simple_heap_insert(heapRel, tup);
 
 	CatalogIndexInsert(indstate, tup);
@@ -225,6 +230,8 @@ CatalogTupleUpdate(Relation heapRel, ItemPointer otid, HeapTuple tup)
 {
 	CatalogIndexState indstate;
 
+	Assert(SnapshotSet());
+
 	indstate = CatalogOpenIndexes(heapRel);
 
 	simple_heap_update(heapRel, otid, tup);
@@ -245,6 +252,8 @@ void
 CatalogTupleUpdateWithInfo(Relation heapRel, ItemPointer otid, HeapTuple tup,
 						   CatalogIndexState indstate)
 {
+	Assert(SnapshotSet());
+
 	simple_heap_update(heapRel, otid, tup);
 
 	CatalogIndexInsert(indstate, tup);
@@ -268,5 +277,7 @@ CatalogTupleUpdateWithInfo(Relation heapRel, ItemPointer otid, HeapTuple tup,
 void
 CatalogTupleDelete(Relation heapRel, ItemPointer tid)
 {
+	Assert(SnapshotSet());
+
 	simple_heap_delete(heapRel, tid);
 }
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index b5cff157bf6..3b148ae30a6 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -857,6 +857,25 @@ ActiveSnapshotSet(void)
 	return ActiveSnapshot != NULL;
 }
 
+/*
+ * Does this transaction have a snapshot.
+ */
+bool
+SnapshotSet(void)
+{
+	/* can't be safe, because somehow xmin is not set */
+	if (!TransactionIdIsValid(MyPgXact->xmin) && HistoricSnapshot == NULL)
+		return false;
+
+	/*
+	 * Can't be safe because no snapshot being active/registered means that
+	 * e.g. invalidation processing could change xmin horizon.
+	 */
+	return ActiveSnapshot != NULL ||
+		!pairingheap_is_empty(&RegisteredSnapshots) ||
+		HistoricSnapshot != NULL;
+}
+
 /*
  * RegisterSnapshot
  *		Register a snapshot as being in use by the current resource owner
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index ceaaa271680..8f43f3e9dfb 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -412,10 +412,10 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 	Snapshot	snapshot = SnapshotAny;
 
 	/*
-	 * RecentGlobalXmin assertion matches index_getnext_tid().  See note on
-	 * RecentGlobalXmin/B-Tree page deletion.
+	 * This assertion matches the one in index_getnext_tid().  See page
+	 * recycling/RecentGlobalXmin notes in nbtree README.
 	 */
-	Assert(TransactionIdIsValid(RecentGlobalXmin));
+	Assert(SnapshotSet());
 
 	/*
 	 * Initialize state for entire verification operation
-- 
2.25.0.114.g5b0ca878e0

v7-0003-Fix-xlogreader-fd-leak-encountered-with-twophase-.patchtext/x-diff; charset=us-asciiDownload
From 89fd977a8a7cb90b9d85f6e9386507a2f7997604 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 6 Apr 2020 21:28:55 -0700
Subject: [PATCH v7 03/11] Fix xlogreader fd leak encountered with twophase
 commit.

This perhaps is not the best fix, but it's better than the current
situation of failing after a few commits.

This issue appeared after 0dc8ead46, but only because before that
change fd leakage was limited to a single file descriptor.

Discussion: https://postgr.es/m/20200406025651.fpzdb5yyb7qyhqko@alap3.anarazel.de
---
 src/backend/access/transam/xlogreader.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index f3fea5132fe..79ff976474c 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -136,6 +136,9 @@ XLogReaderFree(XLogReaderState *state)
 {
 	int			block_id;
 
+	if (state->seg.ws_file != -1)
+		close(state->seg.ws_file);
+
 	for (block_id = 0; block_id <= XLR_MAX_BLOCK_ID; block_id++)
 	{
 		if (state->blocks[block_id].data)
-- 
2.25.0.114.g5b0ca878e0

v7-0004-Move-delayChkpt-from-PGXACT-to-PGPROC-it-s-rarely.patchtext/x-diff; charset=us-asciiDownload
From 255a20ef1df5230c692331b504f2886fd5064491 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 6 Apr 2020 21:28:55 -0700
Subject: [PATCH v7 04/11] Move delayChkpt from PGXACT to PGPROC it's rarely
 checked & frequently modified.

The goal of PGXACT is to make foreign accesses faster. Having rarely
accessed & frequently modified reduces cache hit ratio for other CPU
cores.
---
 src/include/storage/proc.h              |  4 ++--
 src/backend/access/transam/multixact.c  |  6 +++---
 src/backend/access/transam/twophase.c   | 10 +++++-----
 src/backend/access/transam/xact.c       |  4 ++--
 src/backend/access/transam/xloginsert.c |  2 +-
 src/backend/storage/buffer/bufmgr.c     |  4 ++--
 src/backend/storage/ipc/procarray.c     | 14 ++++++--------
 src/backend/storage/lmgr/proc.c         |  4 ++--
 8 files changed, 23 insertions(+), 25 deletions(-)

diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index d21780108bb..ae4f573ab46 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -142,6 +142,8 @@ struct PGPROC
 	LOCKMASK	heldLocks;		/* bitmask for lock types already held on this
 								 * lock object by this backend */
 
+	bool		delayChkpt;		/* true if this proc delays checkpoint start */
+
 	/*
 	 * Info to allow us to wait for synchronous replication, if needed.
 	 * waitLSN is InvalidXLogRecPtr if not waiting; set only by user backend.
@@ -232,8 +234,6 @@ typedef struct PGXACT
 
 	uint8		vacuumFlags;	/* vacuum-related flags, see above */
 	bool		overflowed;
-	bool		delayChkpt;		/* true if this proc delays checkpoint start;
-								 * previously called InCommit */
 
 	uint8		nxids;
 } PGXACT;
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index fdd0394ffae..70d0e1c215f 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -3058,8 +3058,8 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
 	 * crash/basebackup, even though the state of the data directory would
 	 * require it.
 	 */
-	Assert(!MyPgXact->delayChkpt);
-	MyPgXact->delayChkpt = true;
+	Assert(!MyProc->delayChkpt);
+	MyProc->delayChkpt = true;
 
 	/* WAL log truncation */
 	WriteMTruncateXlogRec(newOldestMultiDB,
@@ -3085,7 +3085,7 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
 	/* Then offsets */
 	PerformOffsetsTruncation(oldestMulti, newOldestMulti);
 
-	MyPgXact->delayChkpt = false;
+	MyProc->delayChkpt = false;
 
 	END_CRIT_SECTION();
 	LWLockRelease(MultiXactTruncationLock);
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 5adf956f413..2f7d4ed59a8 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -465,7 +465,7 @@ MarkAsPreparingGuts(GlobalTransaction gxact, TransactionId xid, const char *gid,
 	proc->lxid = (LocalTransactionId) xid;
 	pgxact->xid = xid;
 	pgxact->xmin = InvalidTransactionId;
-	pgxact->delayChkpt = false;
+	proc->delayChkpt = false;
 	pgxact->vacuumFlags = 0;
 	proc->pid = 0;
 	proc->backendId = InvalidBackendId;
@@ -1114,7 +1114,7 @@ EndPrepare(GlobalTransaction gxact)
 
 	START_CRIT_SECTION();
 
-	MyPgXact->delayChkpt = true;
+	MyProc->delayChkpt = true;
 
 	XLogBeginInsert();
 	for (record = records.head; record != NULL; record = record->next)
@@ -1157,7 +1157,7 @@ EndPrepare(GlobalTransaction gxact)
 	 * checkpoint starting after this will certainly see the gxact as a
 	 * candidate for fsyncing.
 	 */
-	MyPgXact->delayChkpt = false;
+	MyProc->delayChkpt = false;
 
 	/*
 	 * Remember that we have this GlobalTransaction entry locked for us.  If
@@ -2204,7 +2204,7 @@ RecordTransactionCommitPrepared(TransactionId xid,
 	START_CRIT_SECTION();
 
 	/* See notes in RecordTransactionCommit */
-	MyPgXact->delayChkpt = true;
+	MyProc->delayChkpt = true;
 
 	/*
 	 * Emit the XLOG commit record. Note that we mark 2PC commits as
@@ -2252,7 +2252,7 @@ RecordTransactionCommitPrepared(TransactionId xid,
 	TransactionIdCommitTree(xid, nchildren, children);
 
 	/* Checkpoint can proceed now */
-	MyPgXact->delayChkpt = false;
+	MyProc->delayChkpt = false;
 
 	END_CRIT_SECTION();
 
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 110ec228eba..6b1ae1f981d 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -1307,7 +1307,7 @@ RecordTransactionCommit(void)
 		 * a bit fuzzy, but it doesn't matter.
 		 */
 		START_CRIT_SECTION();
-		MyPgXact->delayChkpt = true;
+		MyProc->delayChkpt = true;
 
 		SetCurrentTransactionStopTimestamp();
 
@@ -1408,7 +1408,7 @@ RecordTransactionCommit(void)
 	 */
 	if (markXidCommitted)
 	{
-		MyPgXact->delayChkpt = false;
+		MyProc->delayChkpt = false;
 		END_CRIT_SECTION();
 	}
 
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index 5e032e7042d..4259309dbae 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -904,7 +904,7 @@ XLogSaveBufferForHint(Buffer buffer, bool buffer_std)
 	/*
 	 * Ensure no checkpoint can change our view of RedoRecPtr.
 	 */
-	Assert(MyPgXact->delayChkpt);
+	Assert(MyProc->delayChkpt);
 
 	/*
 	 * Update RedoRecPtr so that we can make the right decision
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 7317ac8a2c4..a7a39dd2a1e 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3587,7 +3587,7 @@ MarkBufferDirtyHint(Buffer buffer, bool buffer_std)
 			 * essential that CreateCheckpoint waits for virtual transactions
 			 * rather than full transactionids.
 			 */
-			MyPgXact->delayChkpt = delayChkpt = true;
+			MyProc->delayChkpt = delayChkpt = true;
 			lsn = XLogSaveBufferForHint(buffer, buffer_std);
 		}
 
@@ -3620,7 +3620,7 @@ MarkBufferDirtyHint(Buffer buffer, bool buffer_std)
 		UnlockBufHdr(bufHdr, buf_state);
 
 		if (delayChkpt)
-			MyPgXact->delayChkpt = false;
+			MyProc->delayChkpt = false;
 
 		if (dirtied)
 		{
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 281fe671bdf..363000670b2 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -436,7 +436,7 @@ ProcArrayEndTransaction(PGPROC *proc, TransactionId latestXid)
 		pgxact->xmin = InvalidTransactionId;
 		/* must be cleared with xid/xmin: */
 		pgxact->vacuumFlags &= ~PROC_VACUUM_STATE_MASK;
-		pgxact->delayChkpt = false; /* be sure this is cleared in abort */
+		proc->delayChkpt = false; /* be sure this is cleared in abort */
 		proc->recoveryConflictPending = false;
 
 		Assert(pgxact->nxids == 0);
@@ -458,7 +458,7 @@ ProcArrayEndTransactionInternal(PGPROC *proc, PGXACT *pgxact,
 	pgxact->xmin = InvalidTransactionId;
 	/* must be cleared with xid/xmin: */
 	pgxact->vacuumFlags &= ~PROC_VACUUM_STATE_MASK;
-	pgxact->delayChkpt = false; /* be sure this is cleared in abort */
+	proc->delayChkpt = false; /* be sure this is cleared in abort */
 	proc->recoveryConflictPending = false;
 
 	/* Clear the subtransaction-XID cache too while holding the lock */
@@ -616,7 +616,7 @@ ProcArrayClearTransaction(PGPROC *proc)
 
 	/* redundant, but just in case */
 	pgxact->vacuumFlags &= ~PROC_VACUUM_STATE_MASK;
-	pgxact->delayChkpt = false;
+	proc->delayChkpt = false;
 
 	/* Clear the subtransaction-XID cache too */
 	pgxact->nxids = 0;
@@ -2257,7 +2257,7 @@ GetOldestSafeDecodingTransactionId(bool catalogOnly)
  * delaying checkpoint because they have critical actions in progress.
  *
  * Constructs an array of VXIDs of transactions that are currently in commit
- * critical sections, as shown by having delayChkpt set in their PGXACT.
+ * critical sections, as shown by having delayChkpt set in their PGPROC.
  *
  * Returns a palloc'd array that should be freed by the caller.
  * *nvxids is the number of valid entries.
@@ -2288,9 +2288,8 @@ GetVirtualXIDsDelayingChkpt(int *nvxids)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
 		PGPROC	   *proc = &allProcs[pgprocno];
-		PGXACT	   *pgxact = &allPgXact[pgprocno];
 
-		if (pgxact->delayChkpt)
+		if (proc->delayChkpt)
 		{
 			VirtualTransactionId vxid;
 
@@ -2328,12 +2327,11 @@ HaveVirtualXIDsDelayingChkpt(VirtualTransactionId *vxids, int nvxids)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
 		PGPROC	   *proc = &allProcs[pgprocno];
-		PGXACT	   *pgxact = &allPgXact[pgprocno];
 		VirtualTransactionId vxid;
 
 		GET_VXID_FROM_PGPROC(vxid, *proc);
 
-		if (pgxact->delayChkpt && VirtualTransactionIdIsValid(vxid))
+		if (proc->delayChkpt && VirtualTransactionIdIsValid(vxid))
 		{
 			int			i;
 
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 9938cddb570..5aa19d3f781 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -396,7 +396,7 @@ InitProcess(void)
 	MyProc->roleId = InvalidOid;
 	MyProc->tempNamespaceId = InvalidOid;
 	MyProc->isBackgroundWorker = IsBackgroundWorker;
-	MyPgXact->delayChkpt = false;
+	MyProc->delayChkpt = false;
 	MyPgXact->vacuumFlags = 0;
 	/* NB -- autovac launcher intentionally does not set IS_AUTOVACUUM */
 	if (IsAutoVacuumWorkerProcess())
@@ -578,7 +578,7 @@ InitAuxiliaryProcess(void)
 	MyProc->roleId = InvalidOid;
 	MyProc->tempNamespaceId = InvalidOid;
 	MyProc->isBackgroundWorker = IsBackgroundWorker;
-	MyPgXact->delayChkpt = false;
+	MyProc->delayChkpt = false;
 	MyPgXact->vacuumFlags = 0;
 	MyProc->lwWaiting = false;
 	MyProc->lwWaitMode = 0;
-- 
2.25.0.114.g5b0ca878e0

v7-0005-Change-the-way-backends-perform-tuple-is-invisibl.patchtext/x-diff; charset=us-asciiDownload
From 6089fa0107af4fcc8a2e94798d27a324de748607 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 6 Apr 2020 21:28:55 -0700
Subject: [PATCH v7 05/11] Change the way backends perform
 tuple-is-invisible-to-everyone tests.

Instead of using RecentGlobal[Data]Xmin the tests are now done via
InvisibleToEveryone* APIs.

Following commit will take advantage of that to make GetSnapshotData()
more scalable.

Note: This contains a workaround in heap_page_prune_opt() to keep the
snapshot_too_old tests working. While that workaround is ugly, the
tests currently are not meaningful, and it seems best to address them
separately.
---
 src/include/access/ginblock.h               |   4 +-
 src/include/access/heapam.h                 |  11 +-
 src/include/access/transam.h                |  78 +-
 src/include/storage/bufpage.h               |   6 -
 src/include/storage/proc.h                  |   8 -
 src/include/storage/procarray.h             |  39 +-
 src/include/utils/snapmgr.h                 |  37 +-
 src/include/utils/snapshot.h                |   6 +
 src/backend/access/gin/ginvacuum.c          |  19 +
 src/backend/access/gist/gistutil.c          |   8 +-
 src/backend/access/gist/gistxlog.c          |  10 +-
 src/backend/access/heap/heapam.c            |  12 +-
 src/backend/access/heap/heapam_handler.c    |  24 +-
 src/backend/access/heap/heapam_visibility.c |  78 +-
 src/backend/access/heap/pruneheap.c         | 199 +++-
 src/backend/access/heap/vacuumlazy.c        |  24 +-
 src/backend/access/nbtree/README            |  10 +-
 src/backend/access/nbtree/nbtpage.c         |   4 +-
 src/backend/access/nbtree/nbtree.c          |  28 +-
 src/backend/access/nbtree/nbtxlog.c         |  10 +-
 src/backend/access/spgist/spgvacuum.c       |   6 +-
 src/backend/access/transam/README           |  96 +-
 src/backend/access/transam/varsup.c         |  48 +
 src/backend/access/transam/xlog.c           |  11 +-
 src/backend/commands/analyze.c              |   2 +-
 src/backend/commands/vacuum.c               |  37 +-
 src/backend/postmaster/autovacuum.c         |   4 +
 src/backend/replication/logical/launcher.c  |   6 +
 src/backend/replication/walreceiver.c       |  17 +-
 src/backend/replication/walsender.c         |  15 +-
 src/backend/storage/ipc/procarray.c         | 949 ++++++++++++++++----
 src/backend/utils/adt/selfuncs.c            |  20 +-
 src/backend/utils/init/postinit.c           |   4 +
 src/backend/utils/time/snapmgr.c            | 252 +++---
 contrib/amcheck/verify_nbtree.c             |   4 +-
 contrib/pg_visibility/pg_visibility.c       |  18 +-
 contrib/pgstattuple/pgstatapprox.c          |   2 +-
 37 files changed, 1498 insertions(+), 608 deletions(-)

diff --git a/src/include/access/ginblock.h b/src/include/access/ginblock.h
index 3f64fd572e3..fe66a95226b 100644
--- a/src/include/access/ginblock.h
+++ b/src/include/access/ginblock.h
@@ -12,6 +12,7 @@
 
 #include "access/transam.h"
 #include "storage/block.h"
+#include "storage/bufpage.h"
 #include "storage/itemptr.h"
 #include "storage/off.h"
 
@@ -134,8 +135,7 @@ typedef struct GinMetaPageData
  */
 #define GinPageGetDeleteXid(page) ( ((PageHeader) (page))->pd_prune_xid )
 #define GinPageSetDeleteXid(page, xid) ( ((PageHeader) (page))->pd_prune_xid = xid)
-#define GinPageIsRecyclable(page) ( PageIsNew(page) || (GinPageIsDeleted(page) \
-	&& TransactionIdPrecedes(GinPageGetDeleteXid(page), RecentGlobalXmin)))
+extern bool GinPageIsRecyclable(Page page);
 
 /*
  * We use our own ItemPointerGet(BlockNumber|OffsetNumber)
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index f279edc4734..db9e0b48a08 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -172,9 +172,12 @@ extern TransactionId heap_compute_xid_horizon_for_tuples(Relation rel,
 														 int nitems);
 
 /* in heap/pruneheap.c */
+struct InvisibleToEveryoneState;
 extern void heap_page_prune_opt(Relation relation, Buffer buffer);
 extern int	heap_page_prune(Relation relation, Buffer buffer,
-							TransactionId OldestXmin,
+							struct InvisibleToEveryoneState *horizon,
+							TransactionId limited_oldest_xmin,
+							TimestampTz limited_oldest_ts,
 							bool report_stats, TransactionId *latestRemovedXid);
 extern void heap_page_prune_execute(Buffer buffer,
 									OffsetNumber *redirected, int nredirected,
@@ -201,11 +204,15 @@ extern TM_Result HeapTupleSatisfiesUpdate(HeapTuple stup, CommandId curcid,
 										  Buffer buffer);
 extern HTSV_Result HeapTupleSatisfiesVacuum(HeapTuple stup, TransactionId OldestXmin,
 											Buffer buffer);
+extern HTSV_Result HeapTupleSatisfiesVacuumHorizon(HeapTuple stup, Buffer buffer,
+												   TransactionId *dead_after);
 extern void HeapTupleSetHintBits(HeapTupleHeader tuple, Buffer buffer,
 								 uint16 infomask, TransactionId xid);
 extern bool HeapTupleHeaderIsOnlyLocked(HeapTupleHeader tuple);
 extern bool XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot);
-extern bool HeapTupleIsSurelyDead(HeapTuple htup, TransactionId OldestXmin);
+struct InvisibleToEveryoneState;
+extern bool HeapTupleIsSurelyDead(struct InvisibleToEveryoneState *invstate,
+								  HeapTuple htup);
 
 /*
  * To avoid leaking too much knowledge about reorderbuffer implementation
diff --git a/src/include/access/transam.h b/src/include/access/transam.h
index 9a808f64ebe..924e5fa724e 100644
--- a/src/include/access/transam.h
+++ b/src/include/access/transam.h
@@ -54,6 +54,8 @@
 #define FullTransactionIdFollowsOrEquals(a, b) ((a).value >= (b).value)
 #define FullTransactionIdIsValid(x)		TransactionIdIsValid(XidFromFullTransactionId(x))
 #define InvalidFullTransactionId		FullTransactionIdFromEpochAndXid(0, InvalidTransactionId)
+#define FirstNormalFullTransactionId	FullTransactionIdFromEpochAndXid(0, FirstNormalTransactionId)
+#define FullTransactionIdIsNormal(x)	FullTransactionIdFollowsOrEquals(x, FirstNormalFullTransactionId)
 
 /*
  * A 64 bit value that contains an epoch and a TransactionId.  This is
@@ -102,6 +104,19 @@ FullTransactionIdAdvance(FullTransactionId *dest)
 		dest->value++;
 }
 
+/* retreat a FullTransactionId variable, stepping over special XIDs */
+static inline void
+FullTransactionIdRetreat(FullTransactionId *dest)
+{
+	dest->value--;
+
+	if (FullTransactionIdPrecedes(*dest, FirstNormalFullTransactionId))
+		return;
+
+	while (XidFromFullTransactionId(*dest) < FirstNormalTransactionId)
+		dest->value--;
+}
+
 /* back up a transaction ID variable, handling wraparound correctly */
 #define TransactionIdRetreat(dest)	\
 	do { \
@@ -193,8 +208,8 @@ typedef struct VariableCacheData
 	/*
 	 * These fields are protected by ProcArrayLock.
 	 */
-	TransactionId latestCompletedXid;	/* newest XID that has committed or
-										 * aborted */
+	FullTransactionId latestCompletedFullXid;	/* newest full XID that has
+												 * committed or aborted */
 
 	/*
 	 * These fields are protected by CLogTruncationLock
@@ -244,6 +259,12 @@ extern void AdvanceOldestClogXid(TransactionId oldest_datfrozenxid);
 extern bool ForceTransactionIdLimitUpdate(void);
 extern Oid	GetNewObjectId(void);
 
+#ifdef USE_ASSERT_CHECKING
+extern void AssertTransactionIdMayBeOnDisk(TransactionId xid);
+#else
+#define AssertTransactionIdMayBeOnDisk(xid) ((void)true)
+#endif
+
 /*
  * Some frontend programs include this header.  For compilers that emit static
  * inline functions even when they're unused, that leads to unsatisfied
@@ -260,6 +281,59 @@ ReadNewTransactionId(void)
 	return XidFromFullTransactionId(ReadNextFullTransactionId());
 }
 
+/* return transaction ID backed up by amount, handling wraparound correctly */
+static inline TransactionId
+TransactionIdRetreatedBy(TransactionId xid, uint32 amount)
+{
+	xid -= amount;
+
+	while (xid < FirstNormalTransactionId)
+		xid--;
+
+	return xid;
+}
+
+/* return the older of the two IDs */
+static inline TransactionId
+TransactionIdOlder(TransactionId a, TransactionId b)
+{
+	if (!TransactionIdIsValid(a))
+		return b;
+
+	if (!TransactionIdIsValid(b))
+		return a;
+
+	if (TransactionIdPrecedes(a, b))
+		return a;
+	return b;
+}
+
+/* return the older of the two IDs, assuming they're both normal */
+static inline TransactionId
+NormalTransactionIdOlder(TransactionId a, TransactionId b)
+{
+	Assert(TransactionIdIsNormal(a));
+	Assert(TransactionIdIsNormal(b));
+	if (NormalTransactionIdPrecedes(a, b))
+		return a;
+	return b;
+}
+
+/* return the newer of the two IDs */
+static inline FullTransactionId
+FullTransactionIdNewer(FullTransactionId a, FullTransactionId b)
+{
+	if (!FullTransactionIdIsValid(a))
+		return b;
+
+	if (!FullTransactionIdIsValid(b))
+		return a;
+
+	if (FullTransactionIdFollows(a, b))
+		return a;
+	return b;
+}
+
 #endif							/* FRONTEND */
 
 #endif							/* TRANSAM_H */
diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h
index 3f88683a059..51b8f994ac0 100644
--- a/src/include/storage/bufpage.h
+++ b/src/include/storage/bufpage.h
@@ -389,12 +389,6 @@ PageValidateSpecialPointer(Page page)
 #define PageClearAllVisible(page) \
 	(((PageHeader) (page))->pd_flags &= ~PD_ALL_VISIBLE)
 
-#define PageIsPrunable(page, oldestxmin) \
-( \
-	AssertMacro(TransactionIdIsNormal(oldestxmin)), \
-	TransactionIdIsValid(((PageHeader) (page))->pd_prune_xid) && \
-	TransactionIdPrecedes(((PageHeader) (page))->pd_prune_xid, oldestxmin) \
-)
 #define PageSetPrunable(page, xid) \
 do { \
 	Assert(TransactionIdIsNormal(xid)); \
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index ae4f573ab46..23d12c1f72f 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -42,13 +42,6 @@ struct XidCache
 
 /*
  * Flags for PGXACT->vacuumFlags
- *
- * Note: If you modify these flags, you need to modify PROCARRAY_XXX flags
- * in src/include/storage/procarray.h.
- *
- * PROC_RESERVED may later be assigned for use in vacuumFlags, but its value is
- * used for PROCARRAY_SLOTS_XMIN in procarray.h, so GetOldestXmin won't be able
- * to match and ignore processes with this flag set.
  */
 #define		PROC_IS_AUTOVACUUM	0x01	/* is it an autovac worker? */
 #define		PROC_IN_VACUUM		0x02	/* currently running lazy vacuum */
@@ -56,7 +49,6 @@ struct XidCache
 #define		PROC_VACUUM_FOR_WRAPAROUND	0x08	/* set by autovac only */
 #define		PROC_IN_LOGICAL_DECODING	0x10	/* currently doing logical
 												 * decoding outside xact */
-#define		PROC_RESERVED				0x20	/* reserved for procarray */
 
 /* flags reset at EOXact */
 #define		PROC_VACUUM_STATE_MASK \
diff --git a/src/include/storage/procarray.h b/src/include/storage/procarray.h
index a5c7d0c0644..0f3c151fdb2 100644
--- a/src/include/storage/procarray.h
+++ b/src/include/storage/procarray.h
@@ -20,41 +20,6 @@
 #include "utils/snapshot.h"
 
 
-/*
- * These are to implement PROCARRAY_FLAGS_XXX
- *
- * Note: These flags are cloned from PROC_XXX flags in src/include/storage/proc.h
- * to avoid forcing to include proc.h when including procarray.h. So if you modify
- * PROC_XXX flags, you need to modify these flags.
- */
-#define		PROCARRAY_VACUUM_FLAG			0x02	/* currently running lazy
-													 * vacuum */
-#define		PROCARRAY_ANALYZE_FLAG			0x04	/* currently running
-													 * analyze */
-#define		PROCARRAY_LOGICAL_DECODING_FLAG 0x10	/* currently doing logical
-													 * decoding outside xact */
-
-#define		PROCARRAY_SLOTS_XMIN			0x20	/* replication slot xmin,
-													 * catalog_xmin */
-/*
- * Only flags in PROCARRAY_PROC_FLAGS_MASK are considered when matching
- * PGXACT->vacuumFlags. Other flags are used for different purposes and
- * have no corresponding PROC flag equivalent.
- */
-#define		PROCARRAY_PROC_FLAGS_MASK	(PROCARRAY_VACUUM_FLAG | \
-										 PROCARRAY_ANALYZE_FLAG | \
-										 PROCARRAY_LOGICAL_DECODING_FLAG)
-
-/* Use the following flags as an input "flags" to GetOldestXmin function */
-/* Consider all backends except for logical decoding ones which manage xmin separately */
-#define		PROCARRAY_FLAGS_DEFAULT			PROCARRAY_LOGICAL_DECODING_FLAG
-/* Ignore vacuum backends */
-#define		PROCARRAY_FLAGS_VACUUM			PROCARRAY_FLAGS_DEFAULT | PROCARRAY_VACUUM_FLAG
-/* Ignore analyze backends */
-#define		PROCARRAY_FLAGS_ANALYZE			PROCARRAY_FLAGS_DEFAULT | PROCARRAY_ANALYZE_FLAG
-/* Ignore both vacuum and analyze backends */
-#define		PROCARRAY_FLAGS_VACUUM_ANALYZE	PROCARRAY_FLAGS_DEFAULT | PROCARRAY_VACUUM_FLAG | PROCARRAY_ANALYZE_FLAG
-
 extern Size ProcArrayShmemSize(void);
 extern void CreateSharedProcArray(void);
 extern void ProcArrayAdd(PGPROC *proc);
@@ -88,7 +53,9 @@ extern RunningTransactions GetRunningTransactionData(void);
 
 extern bool TransactionIdIsInProgress(TransactionId xid);
 extern bool TransactionIdIsActive(TransactionId xid);
-extern TransactionId GetOldestXmin(Relation rel, int flags);
+extern TransactionId GetOldestVisibleTransactionId(Relation rel);
+extern void GetReplicationHorizons(TransactionId *slot_xmin, TransactionId *catalog_xmin);
+extern TransactionId GetOldestTransactionIdConsideredRunning(void);
 extern TransactionId GetOldestActiveTransactionId(void);
 extern TransactionId GetOldestSafeDecodingTransactionId(bool catalogOnly);
 
diff --git a/src/include/utils/snapmgr.h b/src/include/utils/snapmgr.h
index 7738d6a8e01..a47eb7406cf 100644
--- a/src/include/utils/snapmgr.h
+++ b/src/include/utils/snapmgr.h
@@ -52,13 +52,12 @@ extern Size SnapMgrShmemSize(void);
 extern void SnapMgrInit(void);
 extern TimestampTz GetSnapshotCurrentTimestamp(void);
 extern TimestampTz GetOldSnapshotThresholdTimestamp(void);
+extern void SnapshotTooOldMagicForTest(void);
 
 extern bool FirstSnapshotSet;
 
 extern PGDLLIMPORT TransactionId TransactionXmin;
 extern PGDLLIMPORT TransactionId RecentXmin;
-extern PGDLLIMPORT TransactionId RecentGlobalXmin;
-extern PGDLLIMPORT TransactionId RecentGlobalDataXmin;
 
 /* Variables representing various special snapshot semantics */
 extern PGDLLIMPORT SnapshotData SnapshotSelfData;
@@ -78,11 +77,12 @@ extern PGDLLIMPORT SnapshotData CatalogSnapshotData;
 
 /*
  * Similarly, some initialization is required for a NonVacuumable snapshot.
- * The caller must supply the xmin horizon to use (e.g., RecentGlobalXmin).
+ * The caller must supply the visibility cutoff state to use (c.f.
+ * InvisibleToEveryoneTestInit()).
  */
-#define InitNonVacuumableSnapshot(snapshotdata, xmin_horizon)  \
+#define InitNonVacuumableSnapshot(snapshotdata, statep)  \
 	((snapshotdata).snapshot_type = SNAPSHOT_NON_VACUUMABLE, \
-	 (snapshotdata).xmin = (xmin_horizon))
+	 (snapshotdata).invstate = (statep))
 
 /*
  * Similarly, some initialization is required for SnapshotToast.  We need
@@ -98,6 +98,10 @@ extern PGDLLIMPORT SnapshotData CatalogSnapshotData;
 	((snapshot)->snapshot_type == SNAPSHOT_MVCC || \
 	 (snapshot)->snapshot_type == SNAPSHOT_HISTORIC_MVCC)
 
+static inline bool OldSnapshotThresholdActive(void)
+{
+	return old_snapshot_threshold >= 0;
+}
 
 extern Snapshot GetTransactionSnapshot(void);
 extern Snapshot GetLatestSnapshot(void);
@@ -123,8 +127,6 @@ extern void UnregisterSnapshot(Snapshot snapshot);
 extern Snapshot RegisterSnapshotOnOwner(Snapshot snapshot, ResourceOwner owner);
 extern void UnregisterSnapshotFromOwner(Snapshot snapshot, ResourceOwner owner);
 
-extern FullTransactionId GetFullRecentGlobalXmin(void);
-
 extern void AtSubCommit_Snapshot(int level);
 extern void AtSubAbort_Snapshot(int level);
 extern void AtEOXact_Snapshot(bool isCommit, bool resetXmin);
@@ -133,13 +135,30 @@ extern void ImportSnapshot(const char *idstr);
 extern bool XactHasExportedSnapshots(void);
 extern void DeleteAllExportedSnapshotFiles(void);
 extern bool ThereAreNoPriorRegisteredSnapshots(void);
-extern TransactionId TransactionIdLimitedForOldSnapshots(TransactionId recentXmin,
-														 Relation relation);
+extern bool TransactionIdLimitedForOldSnapshots(TransactionId recentXmin,
+												Relation relation,
+												TransactionId *limit_xid,
+												TimestampTz *limit_ts);
+extern void SetOldSnapshotThresholdTimestamp(TimestampTz ts, TransactionId xlimit);
 extern void MaintainOldSnapshotTimeMapping(TimestampTz whenTaken,
 										   TransactionId xmin);
 
 extern char *ExportSnapshot(Snapshot snapshot);
 
+/*
+ * These live in procarray.c because they're intimately linked to the
+ * procarray contents, but thematically they better fit into snapmgr.h
+ */
+typedef struct InvisibleToEveryoneState InvisibleToEveryoneState;
+extern InvisibleToEveryoneState *InvisibleToEveryoneTestInit(Relation rel);
+extern bool InvisibleToEveryoneTestXid(InvisibleToEveryoneState *state, TransactionId xid);
+extern bool InvisibleToEveryoneTestFullXid(InvisibleToEveryoneState *state, FullTransactionId fxid);
+extern FullTransactionId InvisibleToEveryoneTestFullCutoff(InvisibleToEveryoneState *state);
+extern TransactionId InvisibleToEveryoneTestCutoff(InvisibleToEveryoneState *state);
+extern bool InvisibleToEveryoneCheckXid(Relation rel, TransactionId xid);
+extern bool InvisibleToEveryoneCheckFullXid(Relation rel, FullTransactionId fxid);
+
+
 /*
  * Utility functions for implementing visibility routines in table AMs.
  */
diff --git a/src/include/utils/snapshot.h b/src/include/utils/snapshot.h
index 4796edb63aa..2bc415376ac 100644
--- a/src/include/utils/snapshot.h
+++ b/src/include/utils/snapshot.h
@@ -192,6 +192,12 @@ typedef struct SnapshotData
 	 */
 	uint32		speculativeToken;
 
+	/*
+	 * For SNAPSHOT_NON_VACUUMABLE (and hopefully more in the future) this
+	 * contains the visibility cutoff state.
+	 */
+	struct InvisibleToEveryoneState *invstate;
+
 	/*
 	 * Book-keeping information, used by the snapshot manager
 	 */
diff --git a/src/backend/access/gin/ginvacuum.c b/src/backend/access/gin/ginvacuum.c
index 8ae4fd95a7b..1b0e04ee0fa 100644
--- a/src/backend/access/gin/ginvacuum.c
+++ b/src/backend/access/gin/ginvacuum.c
@@ -793,3 +793,22 @@ ginvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 
 	return stats;
 }
+
+bool
+GinPageIsRecyclable(Page page)
+{
+	TransactionId delete_xid;
+
+	if (PageIsNew(page))
+		return true;
+
+	if (!GinPageIsDeleted(page))
+		return false;
+
+	delete_xid = GinPageGetDeleteXid(page);
+
+	if (!TransactionIdIsValid(delete_xid))
+		return true;
+
+	return InvisibleToEveryoneCheckXid(NULL, delete_xid);
+}
diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
index 765329bbcd4..195491e2766 100644
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -891,15 +891,13 @@ gistPageRecyclable(Page page)
 		 * As long as that can happen, we must keep the deleted page around as
 		 * a tombstone.
 		 *
-		 * Compare the deletion XID with RecentGlobalXmin. If deleteXid <
-		 * RecentGlobalXmin, then no scan that's still in progress could have
+		 * For that check if the deletion XID could still be visible to
+		 * anyone. If not, then no scan that's still in progress could have
 		 * seen its downlink, and we can recycle it.
 		 */
 		FullTransactionId deletexid_full = GistPageGetDeleteXid(page);
-		FullTransactionId recentxmin_full = GetFullRecentGlobalXmin();
 
-		if (FullTransactionIdPrecedes(deletexid_full, recentxmin_full))
-			return true;
+		return InvisibleToEveryoneCheckFullXid(NULL, deletexid_full);
 	}
 	return false;
 }
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index b60dba052fa..66ddbaa5c4a 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -387,11 +387,11 @@ gistRedoPageReuse(XLogReaderState *record)
 	 * PAGE_REUSE records exist to provide a conflict point when we reuse
 	 * pages in the index via the FSM.  That's all they do though.
 	 *
-	 * latestRemovedXid was the page's deleteXid.  The deleteXid <
-	 * RecentGlobalXmin test in gistPageRecyclable() conceptually mirrors the
-	 * pgxact->xmin > limitXmin test in GetConflictingVirtualXIDs().
-	 * Consequently, one XID value achieves the same exclusion effect on
-	 * master and standby.
+	 * latestRemovedXid was the page's deleteXid.  The
+	 * InvisibleToEveryoneCheckFullXid(deleteXid) test in gistPageRecyclable()
+	 * conceptually mirrors the pgxact->xmin > limitXmin test in
+	 * GetConflictingVirtualXIDs().  Consequently, one XID value achieves the
+	 * same exclusion effect on master and standby.
 	 */
 	if (InHotStandby)
 	{
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 0af51880ccc..f7caae2c081 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1537,6 +1537,7 @@ heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer,
 	bool		at_chain_start;
 	bool		valid;
 	bool		skip;
+	InvisibleToEveryoneState *invstate = NULL;
 
 	/* If this is not the first call, previous call returned a (live!) tuple */
 	if (all_dead)
@@ -1636,9 +1637,14 @@ heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer,
 		 * Note: if you change the criterion here for what is "dead", fix the
 		 * planner's get_actual_variable_range() function to match.
 		 */
-		if (all_dead && *all_dead &&
-			!HeapTupleIsSurelyDead(heapTuple, RecentGlobalXmin))
-			*all_dead = false;
+		if (all_dead && *all_dead)
+		{
+			if (!invstate)
+				invstate = InvisibleToEveryoneTestInit(relation);
+
+			if (!HeapTupleIsSurelyDead(invstate, heapTuple))
+				*all_dead = false;
+		}
 
 		/*
 		 * Check to see if HOT chain continues past this tuple; if so fetch
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 56b35622f1a..854176a0e2f 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1201,7 +1201,7 @@ heapam_index_build_range_scan(Relation heapRelation,
 
 	/* okay to ignore lazy VACUUMs here */
 	if (!IsBootstrapProcessingMode() && !indexInfo->ii_Concurrent)
-		OldestXmin = GetOldestXmin(heapRelation, PROCARRAY_FLAGS_VACUUM);
+		OldestXmin = GetOldestVisibleTransactionId(heapRelation);
 
 	if (!scan)
 	{
@@ -1242,6 +1242,17 @@ heapam_index_build_range_scan(Relation heapRelation,
 
 	hscan = (HeapScanDesc) scan;
 
+	/*
+	 * Must have called GetOldestVisibleTransactionId() if using SnapshotAny.
+	 * Shouldn't have for an MVCC snapshot. (It's especially worth checking
+	 * this for parallel builds, since ambuild routines that support parallel
+	 * builds must work these details out for themselves.)
+	 */
+	Assert(snapshot == SnapshotAny || IsMVCCSnapshot(snapshot));
+	Assert(snapshot == SnapshotAny ? TransactionIdIsValid(OldestXmin) :
+		   !TransactionIdIsValid(OldestXmin));
+	Assert(snapshot == SnapshotAny || !anyvisible);
+
 	/* Publish number of blocks to scan */
 	if (progress)
 	{
@@ -1261,17 +1272,6 @@ heapam_index_build_range_scan(Relation heapRelation,
 									 nblocks);
 	}
 
-	/*
-	 * Must call GetOldestXmin() with SnapshotAny.  Should never call
-	 * GetOldestXmin() with MVCC snapshot. (It's especially worth checking
-	 * this for parallel builds, since ambuild routines that support parallel
-	 * builds must work these details out for themselves.)
-	 */
-	Assert(snapshot == SnapshotAny || IsMVCCSnapshot(snapshot));
-	Assert(snapshot == SnapshotAny ? TransactionIdIsValid(OldestXmin) :
-		   !TransactionIdIsValid(OldestXmin));
-	Assert(snapshot == SnapshotAny || !anyvisible);
-
 	/* set our scan endpoints */
 	if (!allow_sync)
 		heap_setscanlimits(scan, start_blockno, numblocks);
diff --git a/src/backend/access/heap/heapam_visibility.c b/src/backend/access/heap/heapam_visibility.c
index dba10890aab..793a8036331 100644
--- a/src/backend/access/heap/heapam_visibility.c
+++ b/src/backend/access/heap/heapam_visibility.c
@@ -1154,19 +1154,55 @@ HeapTupleSatisfiesMVCC(HeapTuple htup, Snapshot snapshot,
  *	we mainly want to know is if a tuple is potentially visible to *any*
  *	running transaction.  If so, it can't be removed yet by VACUUM.
  *
- * OldestXmin is a cutoff XID (obtained from GetOldestXmin()).  Tuples
- * deleted by XIDs >= OldestXmin are deemed "recently dead"; they might
- * still be visible to some open transaction, so we can't remove them,
- * even if we see that the deleting transaction has committed.
+ * OldestXmin is a cutoff XID (obtained from GetOldestVisibleTransactionId()).
+ * Tuples deleted by XIDs >= OldestXmin are deemed "recently dead"; they might
+ * still be visible to some open transaction, so we can't remove them, even if
+ * we see that the deleting transaction has committed.
  */
 HTSV_Result
 HeapTupleSatisfiesVacuum(HeapTuple htup, TransactionId OldestXmin,
 						 Buffer buffer)
+{
+	TransactionId dead_after = InvalidTransactionId;
+	HTSV_Result res;
+
+	res = HeapTupleSatisfiesVacuumHorizon(htup, buffer, &dead_after);
+
+	if (res == HEAPTUPLE_RECENTLY_DEAD)
+	{
+		Assert(TransactionIdIsValid(dead_after));
+
+		if (TransactionIdPrecedes(dead_after, OldestXmin))
+			res = HEAPTUPLE_DEAD;
+	}
+	else
+		Assert(!TransactionIdIsValid(dead_after));
+
+	return res;
+}
+
+/*
+ * Work horse for HeapTupleSatisfiesVacuum and similar routines.
+ *
+ * In contrast to HeapTupleSatisfiesVacuum this routine, when encountering a
+ * tuple that could still be visible to some backend, stores the xid that
+ * needs to be compared with the horizon in *dead_after, and returns
+ * HEAPTUPLE_RECENTLY_DEAD. The caller then can perform the comparison with
+ * the horizon.  This is e.g. useful when comparing with different horizons.
+ *
+ * Note: HEAPTUPLE_DEAD can still be returned here, e.g. if the inserting
+ * transaction aborted.
+ */
+HTSV_Result
+HeapTupleSatisfiesVacuumHorizon(HeapTuple htup, Buffer buffer, TransactionId *dead_after)
 {
 	HeapTupleHeader tuple = htup->t_data;
 
 	Assert(ItemPointerIsValid(&htup->t_self));
 	Assert(htup->t_tableOid != InvalidOid);
+	Assert(dead_after != NULL);
+
+	*dead_after = InvalidTransactionId;
 
 	/*
 	 * Has inserting transaction committed?
@@ -1323,17 +1359,15 @@ HeapTupleSatisfiesVacuum(HeapTuple htup, TransactionId OldestXmin,
 		else if (TransactionIdDidCommit(xmax))
 		{
 			/*
-			 * The multixact might still be running due to lockers.  If the
-			 * updater is below the xid horizon, we have to return DEAD
-			 * regardless -- otherwise we could end up with a tuple where the
-			 * updater has to be removed due to the horizon, but is not pruned
-			 * away.  It's not a problem to prune that tuple, because any
-			 * remaining lockers will also be present in newer tuple versions.
+			 * The multixact might still be running due to lockers.  Need to
+			 * allow for pruning if below the xid horizon regardless --
+			 * otherwise we could end up with a tuple where the updater has to
+			 * be removed due to the horizon, but is not pruned away.  It's
+			 * not a problem to prune that tuple, because any remaining
+			 * lockers will also be present in newer tuple versions.
 			 */
-			if (!TransactionIdPrecedes(xmax, OldestXmin))
-				return HEAPTUPLE_RECENTLY_DEAD;
-
-			return HEAPTUPLE_DEAD;
+			*dead_after = xmax;
+			return HEAPTUPLE_RECENTLY_DEAD;
 		}
 		else if (!MultiXactIdIsRunning(HeapTupleHeaderGetRawXmax(tuple), false))
 		{
@@ -1372,14 +1406,11 @@ HeapTupleSatisfiesVacuum(HeapTuple htup, TransactionId OldestXmin,
 	}
 
 	/*
-	 * Deleter committed, but perhaps it was recent enough that some open
-	 * transactions could still see the tuple.
+	 * Deleter committed, allow caller to check if it was recent enough that
+	 * some open transactions could still see the tuple.
 	 */
-	if (!TransactionIdPrecedes(HeapTupleHeaderGetRawXmax(tuple), OldestXmin))
-		return HEAPTUPLE_RECENTLY_DEAD;
-
-	/* Otherwise, it's dead and removable */
-	return HEAPTUPLE_DEAD;
+	*dead_after = HeapTupleHeaderGetRawXmax(tuple);
+	return HEAPTUPLE_RECENTLY_DEAD;
 }
 
 
@@ -1418,7 +1449,8 @@ HeapTupleSatisfiesNonVacuumable(HeapTuple htup, Snapshot snapshot,
  *	if the tuple is removable.
  */
 bool
-HeapTupleIsSurelyDead(HeapTuple htup, TransactionId OldestXmin)
+HeapTupleIsSurelyDead(InvisibleToEveryoneState *invstate,
+					  HeapTuple htup)
 {
 	HeapTupleHeader tuple = htup->t_data;
 
@@ -1459,7 +1491,7 @@ HeapTupleIsSurelyDead(HeapTuple htup, TransactionId OldestXmin)
 		return false;
 
 	/* Deleter committed, so tuple is dead if the XID is old enough. */
-	return TransactionIdPrecedes(HeapTupleHeaderGetRawXmax(tuple), OldestXmin);
+	return InvisibleToEveryoneTestXid(invstate, HeapTupleHeaderGetRawXmax(tuple));
 }
 
 /*
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index 1794cfd8d9a..e36ca648cef 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -23,12 +23,24 @@
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "storage/bufmgr.h"
+#include "utils/snapmgr.h"
 #include "utils/rel.h"
 #include "utils/snapmgr.h"
 
 /* Working data for heap_page_prune and subroutines */
 typedef struct
 {
+	Relation rel;
+
+	/*
+	 * State related to determining whether a dead tuple is still needed.
+	 */
+	InvisibleToEveryoneState *vistest;
+	TimestampTz limited_oldest_ts;
+	TransactionId limited_oldest_xmin;
+	/* have we made removal decision based on old_snapshot_threshold */
+	bool limited_oldest_committed;
+
 	TransactionId new_prune_xid;	/* new prune hint value for page */
 	TransactionId latestRemovedXid; /* latest xid to be removed by this prune */
 	int			nredirected;	/* numbers of entries in arrays below */
@@ -43,9 +55,8 @@ typedef struct
 } PruneState;
 
 /* Local functions */
-static int	heap_prune_chain(Relation relation, Buffer buffer,
+static int	heap_prune_chain(Buffer buffer,
 							 OffsetNumber rootoffnum,
-							 TransactionId OldestXmin,
 							 PruneState *prstate);
 static void heap_prune_record_prunable(PruneState *prstate, TransactionId xid);
 static void heap_prune_record_redirect(PruneState *prstate,
@@ -65,16 +76,16 @@ static void heap_prune_record_unused(PruneState *prstate, OffsetNumber offnum);
  * if there's not any use in pruning.
  *
  * Caller must have pin on the buffer, and must *not* have a lock on it.
- *
- * OldestXmin is the cutoff XID used to distinguish whether tuples are DEAD
- * or RECENTLY_DEAD (see HeapTupleSatisfiesVacuum).
  */
 void
 heap_page_prune_opt(Relation relation, Buffer buffer)
 {
 	Page		page = BufferGetPage(buffer);
+	TransactionId prune_xid;
+	InvisibleToEveryoneState *vistest;
+	TransactionId limited_xmin = InvalidTransactionId;
+	TimestampTz limited_ts = 0;
 	Size		minfree;
-	TransactionId OldestXmin;
 
 	/*
 	 * We can't write WAL in recovery mode, so there's no point trying to
@@ -85,37 +96,53 @@ heap_page_prune_opt(Relation relation, Buffer buffer)
 		return;
 
 	/*
-	 * Use the appropriate xmin horizon for this relation. If it's a proper
-	 * catalog relation or a user defined, additional, catalog relation, we
-	 * need to use the horizon that includes slots, otherwise the data-only
-	 * horizon can be used. Note that the toast relation of user defined
-	 * relations are *not* considered catalog relations.
+	 * XXX: pointless call to make old_snapshot_threshold tests work. They're
+	 * broken, and discussion of what to do about them is ongoing.
+	 */
+	if (old_snapshot_threshold == 0)
+		SnapshotTooOldMagicForTest();
+
+	/*
+	 * First check whether there's any chance there's something to prune,
+	 * determining the appropriate horizon is a waste if there's no prune_xid
+	 * (i.e. no updates/deletes left potentially dead tuples around).
+	 */
+	prune_xid = ((PageHeader) page)->pd_prune_xid;
+	if (!TransactionIdIsValid(prune_xid))
+		return;
+
+	/*
+	 * Check whether prune_xid indicates that there may be dead rows that can
+	 * be cleaned up.
 	 *
-	 * It is OK to apply the old snapshot limit before acquiring the cleanup
+	 * It is OK to check the old snapshot limit before acquiring the cleanup
 	 * lock because the worst that can happen is that we are not quite as
 	 * aggressive about the cleanup (by however many transaction IDs are
 	 * consumed between this point and acquiring the lock).  This allows us to
 	 * save significant overhead in the case where the page is found not to be
 	 * prunable.
-	 */
-	if (IsCatalogRelation(relation) ||
-		RelationIsAccessibleInLogicalDecoding(relation))
-		OldestXmin = RecentGlobalXmin;
-	else
-		OldestXmin =
-			TransactionIdLimitedForOldSnapshots(RecentGlobalDataXmin,
-												relation);
-
-	Assert(TransactionIdIsValid(OldestXmin));
-
-	/*
-	 * Let's see if we really need pruning.
 	 *
-	 * Forget it if page is not hinted to contain something prunable that's
-	 * older than OldestXmin.
+	 * Even if old_snapshot_threshold is set, we first check whether the page
+	 * can be pruned without. Both because
+	 * TransactionIdLimitedForOldSnapshots() is not cheap, and because not
+	 * unnecessarily relying on old_snapshot_threshold avoids causing
+	 * conflicts.
 	 */
-	if (!PageIsPrunable(page, OldestXmin))
-		return;
+	vistest = InvisibleToEveryoneTestInit(relation);
+
+	if (!InvisibleToEveryoneTestXid(vistest, prune_xid))
+	{
+		if (!OldSnapshotThresholdActive())
+			return;
+
+		if (!TransactionIdLimitedForOldSnapshots(InvisibleToEveryoneTestCutoff(vistest),
+												 relation,
+												 &limited_xmin, &limited_ts))
+			return;
+
+		if (!TransactionIdPrecedes(prune_xid, limited_xmin))
+			return;
+	}
 
 	/*
 	 * We prune when a previous UPDATE failed to find enough space on the page
@@ -151,7 +178,9 @@ heap_page_prune_opt(Relation relation, Buffer buffer)
 															 * needed */
 
 			/* OK to prune */
-			(void) heap_page_prune(relation, buffer, OldestXmin, true, &ignore);
+			(void) heap_page_prune(relation, buffer, vistest,
+								   limited_xmin, limited_ts,
+								   true, &ignore);
 		}
 
 		/* And release buffer lock */
@@ -165,8 +194,11 @@ heap_page_prune_opt(Relation relation, Buffer buffer)
  *
  * Caller must have pin and buffer cleanup lock on the page.
  *
- * OldestXmin is the cutoff XID used to distinguish whether tuples are DEAD
- * or RECENTLY_DEAD (see HeapTupleSatisfiesVacuum).
+ * vistest is used to distinguish whether tuples are DEAD or RECENTLY_DEAD
+ * (see heap_prune_satisfies_vacuum and
+ * HeapTupleSatisfiesVacuum). limited_oldest_xmin / limited_oldest_ts need to
+ * either have been set by TransactionIdLimitedForOldSnapshots, or
+ * InvalidTransactionId/0 respectively.
  *
  * If report_stats is true then we send the number of reclaimed heap-only
  * tuples to pgstats.  (This must be false during vacuum, since vacuum will
@@ -177,7 +209,10 @@ heap_page_prune_opt(Relation relation, Buffer buffer)
  * latestRemovedXid.
  */
 int
-heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin,
+heap_page_prune(Relation relation, Buffer buffer,
+				InvisibleToEveryoneState* vistest,
+				TransactionId limited_oldest_xmin,
+				TimestampTz limited_oldest_ts,
 				bool report_stats, TransactionId *latestRemovedXid)
 {
 	int			ndeleted = 0;
@@ -198,6 +233,11 @@ heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin,
 	 * initialize the rest of our working state.
 	 */
 	prstate.new_prune_xid = InvalidTransactionId;
+	prstate.rel = relation;
+	prstate.vistest = vistest;
+	prstate.limited_oldest_xmin = limited_oldest_xmin;
+	prstate.limited_oldest_ts = limited_oldest_ts;
+	prstate.limited_oldest_committed = false;
 	prstate.latestRemovedXid = *latestRemovedXid;
 	prstate.nredirected = prstate.ndead = prstate.nunused = 0;
 	memset(prstate.marked, 0, sizeof(prstate.marked));
@@ -220,9 +260,7 @@ heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin,
 			continue;
 
 		/* Process this item or chain of items */
-		ndeleted += heap_prune_chain(relation, buffer, offnum,
-									 OldestXmin,
-									 &prstate);
+		ndeleted += heap_prune_chain(buffer, offnum, &prstate);
 	}
 
 	/* Any error while applying the changes is critical */
@@ -323,6 +361,85 @@ heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin,
 }
 
 
+/*
+ * Perform visiblity checks for heap pruning.
+ *
+ * This is more complicated than just calling InvisibleToEveryoneTestXid()
+ * because of old_snapshot_threshold. We only want to increase the threshold
+ * that triggers errors for old snapshots when we actually decide to remove a
+ * row based on the limited horizon.
+ *
+ * Due to its cost we also only want to call
+ * TransactionIdLimitedForOldSnapshots() if necessary, i.e. we might not have
+ * done so in heap_hot_prune_opt() if pd_prune_xid was old enough. But we
+ * still want to be able to remove rows that are too new to be removed
+ * accuring to prstate->vistest, but that can be removed based on
+ * old_snapshot_threshold. So we call TransactionIdLimitedForOldSnapshots() on
+ * demand in here, if appropriate.
+ */
+static HTSV_Result
+heap_prune_satisfies_vacuum(PruneState *prstate, HeapTuple tup, Buffer buffer)
+{
+	HTSV_Result res;
+	TransactionId dead_after;
+
+	res = HeapTupleSatisfiesVacuumHorizon(tup, buffer, &dead_after);
+
+	if (res != HEAPTUPLE_RECENTLY_DEAD)
+		return res;
+
+	/*
+	 * If we are already relying on the limited xmin, there is no need to
+	 * delay doing so anymore.
+	 */
+	if (prstate->limited_oldest_committed)
+	{
+		Assert(TransactionIdIsValid(prstate->limited_oldest_xmin));
+
+		if (TransactionIdPrecedes(dead_after, prstate->limited_oldest_xmin))
+			res = HEAPTUPLE_DEAD;
+		return res;
+	}
+
+	/*
+	 * First check if InvisibleToEveryoneTestXid() is sufficient to find the
+	 * row dead. If not, and old_snapshot_threshold is enabled, try to use the
+	 * lowered horizon.
+	 */
+	if (InvisibleToEveryoneTestXid(prstate->vistest, dead_after))
+		res = HEAPTUPLE_DEAD;
+	else if (OldSnapshotThresholdActive())
+	{
+		/* haven't determined limited horizon yet, requests */
+		if (!TransactionIdIsValid(prstate->limited_oldest_xmin))
+		{
+			TransactionId horizon =
+				InvisibleToEveryoneTestCutoff(prstate->vistest);
+
+			TransactionIdLimitedForOldSnapshots(horizon, prstate->rel,
+												&prstate->limited_oldest_xmin,
+												&prstate->limited_oldest_ts);
+		}
+
+		if (TransactionIdIsValid(prstate->limited_oldest_xmin) &&
+			TransactionIdPrecedes(dead_after, prstate->limited_oldest_xmin))
+		{
+			/*
+			 * About to remove row based on snapshot_too_old. Need to raise
+			 * the threshold so problematic accesses would error.
+			 */
+			Assert(!prstate->limited_oldest_committed);
+			SetOldSnapshotThresholdTimestamp(prstate->limited_oldest_ts,
+											 prstate->limited_oldest_xmin);
+			prstate->limited_oldest_committed = true;
+			res = HEAPTUPLE_DEAD;
+		}
+	}
+
+	return res;
+}
+
+
 /*
  * Prune specified line pointer or a HOT chain originating at line pointer.
  *
@@ -349,9 +466,7 @@ heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin,
  * Returns the number of tuples (to be) deleted from the page.
  */
 static int
-heap_prune_chain(Relation relation, Buffer buffer, OffsetNumber rootoffnum,
-				 TransactionId OldestXmin,
-				 PruneState *prstate)
+heap_prune_chain(Buffer buffer, OffsetNumber rootoffnum, PruneState *prstate)
 {
 	int			ndeleted = 0;
 	Page		dp = (Page) BufferGetPage(buffer);
@@ -366,7 +481,7 @@ heap_prune_chain(Relation relation, Buffer buffer, OffsetNumber rootoffnum,
 				i;
 	HeapTupleData tup;
 
-	tup.t_tableOid = RelationGetRelid(relation);
+	tup.t_tableOid = RelationGetRelid(prstate->rel);
 
 	rootlp = PageGetItemId(dp, rootoffnum);
 
@@ -401,7 +516,7 @@ heap_prune_chain(Relation relation, Buffer buffer, OffsetNumber rootoffnum,
 			 * either here or while following a chain below.  Whichever path
 			 * gets there first will mark the tuple unused.
 			 */
-			if (HeapTupleSatisfiesVacuum(&tup, OldestXmin, buffer)
+			if (heap_prune_satisfies_vacuum(prstate, &tup, buffer)
 				== HEAPTUPLE_DEAD && !HeapTupleHeaderIsHotUpdated(htup))
 			{
 				heap_prune_record_unused(prstate, rootoffnum);
@@ -485,7 +600,7 @@ heap_prune_chain(Relation relation, Buffer buffer, OffsetNumber rootoffnum,
 		 */
 		tupdead = recent_dead = false;
 
-		switch (HeapTupleSatisfiesVacuum(&tup, OldestXmin, buffer))
+		switch (heap_prune_satisfies_vacuum(prstate, &tup, buffer))
 		{
 			case HEAPTUPLE_DEAD:
 				tupdead = true;
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index f3382d37a40..5799795b877 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -780,6 +780,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 		PROGRESS_VACUUM_MAX_DEAD_TUPLES
 	};
 	int64		initprog_val[3];
+	InvisibleToEveryoneState *vistest;
 
 	pg_rusage_init(&ru0);
 
@@ -808,6 +809,8 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	vacrelstats->nonempty_pages = 0;
 	vacrelstats->latestRemovedXid = InvalidTransactionId;
 
+	vistest = InvisibleToEveryoneTestInit(onerel);
+
 	/*
 	 * Initialize the state for a parallel vacuum.  As of now, only one worker
 	 * can be used for an index, so we invoke parallelism only if there are at
@@ -1231,7 +1234,8 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 		 *
 		 * We count tuples removed by the pruning step as removed by VACUUM.
 		 */
-		tups_vacuumed += heap_page_prune(onerel, buf, OldestXmin, false,
+		tups_vacuumed += heap_page_prune(onerel, buf, vistest, false,
+										 InvalidTransactionId, 0,
 										 &vacrelstats->latestRemovedXid);
 
 		/*
@@ -1588,14 +1592,16 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 		}
 
 		/*
-		 * It's possible for the value returned by GetOldestXmin() to move
-		 * backwards, so it's not wrong for us to see tuples that appear to
-		 * not be visible to everyone yet, while PD_ALL_VISIBLE is already
-		 * set. The real safe xmin value never moves backwards, but
-		 * GetOldestXmin() is conservative and sometimes returns a value
-		 * that's unnecessarily small, so if we see that contradiction it just
-		 * means that the tuples that we think are not visible to everyone yet
-		 * actually are, and the PD_ALL_VISIBLE flag is correct.
+		 * It's possible for the value returned by
+		 * GetOldestVisibleTransactionId() to move backwards, so it's not
+		 * wrong for us to see tuples that appear to not be visible to
+		 * everyone yet, while PD_ALL_VISIBLE is already set. The real safe
+		 * xmin value never moves backwards, but
+		 * GetOldestVisibleTransactionId() is conservative and sometimes
+		 * returns a value that's unnecessarily small, so if we see that
+		 * contradiction it just means that the tuples that we think are not
+		 * visible to everyone yet actually are, and the PD_ALL_VISIBLE flag
+		 * is correct.
 		 *
 		 * There should never be dead tuples on a page with PD_ALL_VISIBLE
 		 * set, however.
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 2d0f8f4b79a..46adc5ee9a2 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -336,9 +336,9 @@ snapshots and registered snapshots as of the deletion are gone; which is
 overly strong, but is simple to implement within Postgres.  When marked
 dead, a deleted page is labeled with the next-transaction counter value.
 VACUUM can reclaim the page for re-use when this transaction number is
-older than RecentGlobalXmin.  As collateral damage, this implementation
-also waits for running XIDs with no snapshots and for snapshots taken
-until the next transaction to allocate an XID commits.
+guaranteed to be "invisible to everyone".  As collateral damage, this
+implementation also waits for running XIDs with no snapshots and for
+snapshots taken until the next transaction to allocate an XID commits.
 
 Reclaiming a page doesn't actually change its state on disk --- we simply
 record it in the shared-memory free space map, from which it will be
@@ -405,8 +405,8 @@ page and also the correct place to hold the current value. We can avoid
 the cost of walking down the tree in such common cases.
 
 The optimization works on the assumption that there can only be one
-non-ignorable leaf rightmost page, and so even a RecentGlobalXmin style
-interlock isn't required.  We cannot fail to detect that our hint was
+non-ignorable leaf rightmost page, and so not even a invisible-to-everyone
+style interlock required.  We cannot fail to detect that our hint was
 invalidated, because there can only be one such page in the B-Tree at
 any time. It's possible that the page will be deleted and recycled
 without a backend's cached page also being detected as invalidated, but
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 39b8f17f4b5..6e5ee3b443e 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -983,7 +983,7 @@ _bt_page_recyclable(Page page)
 	 */
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 	if (P_ISDELETED(opaque) &&
-		TransactionIdPrecedes(opaque->btpo.xact, RecentGlobalXmin))
+		InvisibleToEveryoneCheckXid(NULL, opaque->btpo.xact))
 		return true;
 	return false;
 }
@@ -2186,7 +2186,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, bool *rightsib_empty)
 	 * updated links to the target, ReadNewTransactionId() suffices as an
 	 * upper bound.  Any scan having retained a now-stale link is advertising
 	 * in its PGXACT an xmin less than or equal to the value we read here.  It
-	 * will continue to do so, holding back RecentGlobalXmin, for the duration
+	 * will continue to do so, holding back xmin horizon, for the duration
 	 * of that scan.
 	 */
 	page = BufferGetPage(buf);
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 36294789f3f..fc81d719093 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -802,6 +802,12 @@ _bt_vacuum_needs_cleanup(IndexVacuumInfo *info)
 	metapg = BufferGetPage(metabuf);
 	metad = BTPageGetMeta(metapg);
 
+	/*
+	 * XXX: If IndexVacuumInfo contained the heap relation, we could be more
+	 * aggressive about vacuuming non catalog relations by passing the table
+	 * to InvisibleToEveryoneCheckXid().
+	 */
+
 	if (metad->btm_version < BTREE_NOVAC_VERSION)
 	{
 		/*
@@ -811,12 +817,11 @@ _bt_vacuum_needs_cleanup(IndexVacuumInfo *info)
 		result = true;
 	}
 	else if (TransactionIdIsValid(metad->btm_oldest_btpo_xact) &&
-			 TransactionIdPrecedes(metad->btm_oldest_btpo_xact,
-								   RecentGlobalXmin))
+			 InvisibleToEveryoneCheckXid(NULL, metad->btm_oldest_btpo_xact))
 	{
 		/*
-		 * If oldest btpo.xact in the deleted pages is older than
-		 * RecentGlobalXmin, then at least one deleted page can be recycled.
+		 * If oldest btpo.xact in the deleted pages is invisible, then at
+		 * least one deleted page can be recycled.
 		 */
 		result = true;
 	}
@@ -1227,14 +1232,13 @@ restart:
 				 * own conflict now.)
 				 *
 				 * Backends with snapshots acquired after a VACUUM starts but
-				 * before it finishes could have a RecentGlobalXmin with a
-				 * later xid than the VACUUM's OldestXmin cutoff.  These
-				 * backends might happen to opportunistically mark some index
-				 * tuples LP_DEAD before we reach them, even though they may
-				 * be after our cutoff.  We don't try to kill these "extra"
-				 * index tuples in _bt_delitems_vacuum().  This keep things
-				 * simple, and allows us to always avoid generating our own
-				 * conflicts.
+				 * before it finishes could have visibility cutoff with a
+				 * later xid than VACUUM's OldestXmin cutoff.  These backends
+				 * might happen to opportunistically mark some index tuples
+				 * LP_DEAD before we reach them, even though they may be after
+				 * our cutoff.  We don't try to kill these "extra" index
+				 * tuples in _bt_delitems_vacuum().  This keep things simple,
+				 * and allows us to always avoid generating our own conflicts.
 				 */
 				Assert(!BTreeTupleIsPivot(itup));
 				if (!BTreeTupleIsPosting(itup))
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 99d0914e724..431d7c3d709 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -926,11 +926,11 @@ btree_xlog_reuse_page(XLogReaderState *record)
 	 * Btree reuse_page records exist to provide a conflict point when we
 	 * reuse pages in the index via the FSM.  That's all they do though.
 	 *
-	 * latestRemovedXid was the page's btpo.xact.  The btpo.xact <
-	 * RecentGlobalXmin test in _bt_page_recyclable() conceptually mirrors the
-	 * pgxact->xmin > limitXmin test in GetConflictingVirtualXIDs().
-	 * Consequently, one XID value achieves the same exclusion effect on
-	 * master and standby.
+	 * latestRemovedXid was the page's btpo.xact.  The
+	 * InvisibleToEveryoneCheckXid test in _bt_page_recyclable() conceptually
+	 * mirrors the pgxact->xmin > limitXmin test in
+	 * GetConflictingVirtualXIDs().  Consequently, one XID value achieves the
+	 * same exclusion effect on master and standby.
 	 */
 	if (InHotStandby)
 	{
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index bd98707f3c0..0414382f34e 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -501,10 +501,14 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	OffsetNumber itemToPlaceholder[MaxIndexTuplesPerPage];
 	OffsetNumber itemnos[MaxIndexTuplesPerPage];
 	spgxlogVacuumRedirect xlrec;
+	InvisibleToEveryoneState *invstate;
 
 	xlrec.nToPlaceholder = 0;
 	xlrec.newestRedirectXid = InvalidTransactionId;
 
+	/* XXX: providing heap relation would allow more pruning */
+	invstate = InvisibleToEveryoneTestInit(NULL);
+
 	START_CRIT_SECTION();
 
 	/*
@@ -521,7 +525,7 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 		dt = (SpGistDeadTuple) PageGetItem(page, PageGetItemId(page, i));
 
 		if (dt->tupstate == SPGIST_REDIRECT &&
-			TransactionIdPrecedes(dt->xid, RecentGlobalXmin))
+			InvisibleToEveryoneTestXid(invstate, dt->xid))
 		{
 			dt->tupstate = SPGIST_PLACEHOLDER;
 			Assert(opaque->nRedirection > 0);
diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README
index eb9aac5fd39..be805a5660b 100644
--- a/src/backend/access/transam/README
+++ b/src/backend/access/transam/README
@@ -257,31 +257,31 @@ simultaneously, we have one backend take ProcArrayLock and clear the XIDs
 of multiple processes at once.)
 
 ProcArrayEndTransaction also holds the lock while advancing the shared
-latestCompletedXid variable.  This allows GetSnapshotData to use
-latestCompletedXid + 1 as xmax for its snapshot: there can be no
+latestCompletedFullXid variable.  This allows GetSnapshotData to use
+latestCompletedFullXid + 1 as xmax for its snapshot: there can be no
 transaction >= this xid value that the snapshot needs to consider as
 completed.
 
 In short, then, the rule is that no transaction may exit the set of
-currently-running transactions between the time we fetch latestCompletedXid
+currently-running transactions between the time we fetch latestCompletedFullXid
 and the time we finish building our snapshot.  However, this restriction
 only applies to transactions that have an XID --- read-only transactions
 can end without acquiring ProcArrayLock, since they don't affect anyone
-else's snapshot nor latestCompletedXid.
+else's snapshot nor latestCompletedFullXid.
 
 Transaction start, per se, doesn't have any interlocking with these
 considerations, since we no longer assign an XID immediately at transaction
 start.  But when we do decide to allocate an XID, GetNewTransactionId must
 store the new XID into the shared ProcArray before releasing XidGenLock.
-This ensures that all top-level XIDs <= latestCompletedXid are either
+This ensures that all top-level XIDs <= latestCompletedFullXid are either
 present in the ProcArray, or not running anymore.  (This guarantee doesn't
 apply to subtransaction XIDs, because of the possibility that there's not
 room for them in the subxid array; instead we guarantee that they are
 present or the overflow flag is set.)  If a backend released XidGenLock
 before storing its XID into MyPgXact, then it would be possible for another
-backend to allocate and commit a later XID, causing latestCompletedXid to
+backend to allocate and commit a later XID, causing latestCompletedFullXid to
 pass the first backend's XID, before that value became visible in the
-ProcArray.  That would break GetOldestXmin, as discussed below.
+ProcArray.  That would break ComputeTransactionHorizons, as discussed below.
 
 We allow GetNewTransactionId to store the XID into MyPgXact->xid (or the
 subxid array) without taking ProcArrayLock.  This was once necessary to
@@ -293,42 +293,54 @@ once, rather than assume they can read it multiple times and get the same
 answer each time.  (Use volatile-qualified pointers when doing this, to
 ensure that the C compiler does exactly what you tell it to.)
 
-Another important activity that uses the shared ProcArray is GetOldestXmin,
-which must determine a lower bound for the oldest xmin of any active MVCC
-snapshot, system-wide.  Each individual backend advertises the smallest
-xmin of its own snapshots in MyPgXact->xmin, or zero if it currently has no
-live snapshots (eg, if it's between transactions or hasn't yet set a
-snapshot for a new transaction).  GetOldestXmin takes the MIN() of the
-valid xmin fields.  It does this with only shared lock on ProcArrayLock,
-which means there is a potential race condition against other backends
-doing GetSnapshotData concurrently: we must be certain that a concurrent
-backend that is about to set its xmin does not compute an xmin less than
-what GetOldestXmin returns.  We ensure that by including all the active
-XIDs into the MIN() calculation, along with the valid xmins.  The rule that
-transactions can't exit without taking exclusive ProcArrayLock ensures that
-concurrent holders of shared ProcArrayLock will compute the same minimum of
-currently-active XIDs: no xact, in particular not the oldest, can exit
-while we hold shared ProcArrayLock.  So GetOldestXmin's view of the minimum
-active XID will be the same as that of any concurrent GetSnapshotData, and
-so it can't produce an overestimate.  If there is no active transaction at
-all, GetOldestXmin returns latestCompletedXid + 1, which is a lower bound
-for the xmin that might be computed by concurrent or later GetSnapshotData
-calls.  (We know that no XID less than this could be about to appear in
-the ProcArray, because of the XidGenLock interlock discussed above.)
+Another important activity that uses the shared ProcArray is
+ComputeTransactionHorizons, which must determine lower bound for the oldest
+xmin of any active MVCC snapshot, system-wide.  Each individual backend
+advertises the smallest xmin of its own snapshots in MyPgXact->xmin, or zero
+if it currently has no live snapshots (eg, if it's between transactions or
+hasn't yet set a snapshot for a new transaction).
+ComputeTransactionHorizons takes the MIN() of the valid xmin fields.  It
+does this with only shared lock on ProcArrayLock, which means there is a
+potential race condition against other backends doing GetSnapshotData
+concurrently: we must be certain that a concurrent backend that is about to
+set its xmin does not compute an xmin less than what
+ComputeTransactionHorizons determines.  We ensure that by including all the
+active XIDs into the MIN() calculation, along with the valid xmins.  The
+rule that transactions can't exit without taking exclusive ProcArrayLock
+ensures that concurrent holders of shared ProcArrayLock will compute the
+same minimum of currently-active XIDs: no xact, in particular not the
+oldest, can exit while we hold shared ProcArrayLock.  So
+ComputeTransactionHorizons's view of the minimum active XID will be the same
+as that of any concurrent GetSnapshotData, and so it can't produce an
+overestimate.  If there is no active transaction at all,
+ComputeTransactionHorizons uses latestCompletedFullXid + 1, which is a lower
+bound for the xmin that might be computed by concurrent or later
+GetSnapshotData calls.  (We know that no XID less than this could be about
+to appear in the ProcArray, because of the XidGenLock interlock discussed
+above.)
 
-GetSnapshotData also performs an oldest-xmin calculation (which had better
-match GetOldestXmin's) and stores that into RecentGlobalXmin, which is used
-for some tuple age cutoff checks where a fresh call of GetOldestXmin seems
-too expensive.  Note that while it is certain that two concurrent
-executions of GetSnapshotData will compute the same xmin for their own
-snapshots, as argued above, it is not certain that they will arrive at the
-same estimate of RecentGlobalXmin.  This is because we allow XID-less
-transactions to clear their MyPgXact->xmin asynchronously (without taking
-ProcArrayLock), so one execution might see what had been the oldest xmin,
-and another not.  This is OK since RecentGlobalXmin need only be a valid
-lower bound.  As noted above, we are already assuming that fetch/store
-of the xid fields is atomic, so assuming it for xmin as well is no extra
-risk.
+As GetSnapshotData is performance critical, it does not perform an
+accurate oldest-xmin calculation (it used to, until v13). The contents
+of a snapshot only depend on the xids of other backends, not their
+xmin. As backend's xmin changes much more often than its xid, having
+GetSnapshotData look at xmins can lead to a lot of unnecessary
+cacheline ping-pong.  Instead GetSnapshotData updates approximate
+thresholds (one that guarantees that all deleted rows older than it
+can be removed, another determining that deleted rows newer than it
+can not be removed). InvisibleToEveryoneTest* uses those threshold to
+make invisibility decision, falling back to ComputeTransactionHorizons
+if necessary.
+
+Note that while it is certain that two concurrent executions of
+GetSnapshotData will compute the same xmin for their own snapshots,
+there is no such guarantee for the horizons computed by
+ComputeTransactionHorizons.  This is because we allow XID-less
+transactions to clear their MyPgXact->xmin asynchronously (without
+taking ProcArrayLock), so one execution might see what had been the
+oldest xmin, and another not.  This is OK since the thresholds need
+only be a valid lower bound.  As noted above, we are already assuming
+that fetch/store of the xid fields is atomic, so assuming it for xmin
+as well is no extra risk.
 
 
 pg_xact and pg_subtrans
diff --git a/src/backend/access/transam/varsup.c b/src/backend/access/transam/varsup.c
index 2570e7086a7..43973130b7c 100644
--- a/src/backend/access/transam/varsup.c
+++ b/src/backend/access/transam/varsup.c
@@ -566,3 +566,51 @@ GetNewObjectId(void)
 
 	return result;
 }
+
+
+#ifdef USE_ASSERT_CHECKING
+
+/*
+ * Assert that xid is one that we could actually see on disk.
+ *
+ * As xid ShmemVariableCache->oldestXid could change just after this call
+ * without further precautions, and as xid could just fall between the bounds
+ * due to xid wraparound, this can only detect if something is definitely
+ * wrong, but not establish correctness.
+ *
+ * This intentionally does not expose a return value, to avoid code being
+ * introduced that depends on the return value.
+ */
+void AssertTransactionIdMayBeOnDisk(TransactionId xid)
+{
+	TransactionId oldest_xid;
+	TransactionId next_xid;
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* we may see bootstrap / frozen */
+	if (!TransactionIdIsNormal(xid))
+		return;
+
+	/*
+	 * We can't acquire XidGenLock, as this may be called with XidGenLock
+	 * already held (or with other locks that don't allow XidGenLock to be
+	 * nested). That's ok for our purposes though, since we already rely on
+	 * 32bit reads to be atomic. While nextFullXid is 64 bit, we only look at
+	 * the lower 32bit, so a skewed read doesn't hurt.
+	 *
+	 * There's no increased danger of oldest / next by accessing them without
+	 * a lock. xid needs to have been created with GetNewTransactionId() in
+	 * the originating session, and the locks there pair with the memory
+	 * barrier below.  We do however accept xid to be <= to next_xid, instead
+	 * of just <, as xid could be from the procarray, before we see the
+	 * updated nextFullXid value.
+	 */
+	pg_memory_barrier();
+	oldest_xid = ShmemVariableCache->oldestXid;
+	next_xid = XidFromFullTransactionId(ShmemVariableCache->nextFullXid);
+
+	Assert(TransactionIdFollowsOrEquals(xid, oldest_xid) ||
+		   TransactionIdPrecedesOrEquals(xid, next_xid));
+}
+#endif
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index abf954ba392..8ce853c81d4 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7810,10 +7810,11 @@ StartupXLOG(void)
 	XLogCtl->lastSegSwitchTime = (pg_time_t) time(NULL);
 	XLogCtl->lastSegSwitchLSN = EndOfLog;
 
-	/* also initialize latestCompletedXid, to nextXid - 1 */
+	/* also initialize latestCompletedFullXid, to nextFullXid - 1 */
 	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
-	ShmemVariableCache->latestCompletedXid = XidFromFullTransactionId(ShmemVariableCache->nextFullXid);
-	TransactionIdRetreat(ShmemVariableCache->latestCompletedXid);
+	ShmemVariableCache->latestCompletedFullXid =
+		ShmemVariableCache->nextFullXid;
+	FullTransactionIdRetreat(&ShmemVariableCache->latestCompletedFullXid);
 	LWLockRelease(ProcArrayLock);
 
 	/*
@@ -9023,7 +9024,7 @@ CreateCheckPoint(int flags)
 	 * StartupSUBTRANS hasn't been called yet.
 	 */
 	if (!RecoveryInProgress())
-		TruncateSUBTRANS(GetOldestXmin(NULL, PROCARRAY_FLAGS_DEFAULT));
+		TruncateSUBTRANS(GetOldestTransactionIdConsideredRunning());
 
 	/* Real work is done, but log and update stats before releasing lock. */
 	LogCheckpointEnd(false);
@@ -9382,7 +9383,7 @@ CreateRestartPoint(int flags)
 	 * this because StartupSUBTRANS hasn't been called yet.
 	 */
 	if (EnableHotStandby)
-		TruncateSUBTRANS(GetOldestXmin(NULL, PROCARRAY_FLAGS_DEFAULT));
+		TruncateSUBTRANS(GetOldestTransactionIdConsideredRunning());
 
 	/* Real work is done, but log and update before releasing lock. */
 	LogCheckpointEnd(true);
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 924ef37c816..7b75945c4a9 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -1056,7 +1056,7 @@ acquire_sample_rows(Relation onerel, int elevel,
 	totalblocks = RelationGetNumberOfBlocks(onerel);
 
 	/* Need a cutoff xmin for HeapTupleSatisfiesVacuum */
-	OldestXmin = GetOldestXmin(onerel, PROCARRAY_FLAGS_VACUUM);
+	OldestXmin = GetOldestVisibleTransactionId(onerel);
 
 	/* Prepare for sampling block numbers */
 	nblocks = BlockSampler_Init(&bs, totalblocks, targrows, random());
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 3a89f8fe1e2..7055b237337 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -957,8 +957,25 @@ vacuum_set_xid_limits(Relation rel,
 	 * working on a particular table at any time, and that each vacuum is
 	 * always an independent transaction.
 	 */
-	*oldestXmin =
-		TransactionIdLimitedForOldSnapshots(GetOldestXmin(rel, PROCARRAY_FLAGS_VACUUM), rel);
+	*oldestXmin = GetOldestVisibleTransactionId(rel);
+
+	if (OldSnapshotThresholdActive())
+	{
+		TransactionId limit_xmin;
+		TimestampTz limit_ts;
+
+		if (TransactionIdLimitedForOldSnapshots(*oldestXmin, rel, &limit_xmin, &limit_ts))
+		{
+			/*
+			 * TODO: We should only set the threshold if we are pruning on the
+			 * basis of the increased limits. Not as crucial here as it is for
+			 * opportunistic pruning (which often happens at a much higher
+			 * frequency), but would still be a significant improvement.
+			 */
+			SetOldSnapshotThresholdTimestamp(limit_ts, limit_xmin);
+			*oldestXmin = limit_xmin;
+		}
+	}
 
 	Assert(TransactionIdIsNormal(*oldestXmin));
 
@@ -1347,12 +1364,13 @@ vac_update_datfrozenxid(void)
 	bool		dirty = false;
 
 	/*
-	 * Initialize the "min" calculation with GetOldestXmin, which is a
-	 * reasonable approximation to the minimum relfrozenxid for not-yet-
-	 * committed pg_class entries for new tables; see AddNewRelationTuple().
-	 * So we cannot produce a wrong minimum by starting with this.
+	 * Initialize the "min" calculation with GetOldestVisibleTransactionId(),
+	 * which is a reasonable approximation to the minimum relfrozenxid for
+	 * not-yet-committed pg_class entries for new tables; see
+	 * AddNewRelationTuple().  So we cannot produce a wrong minimum by
+	 * starting with this.
 	 */
-	newFrozenXid = GetOldestXmin(NULL, PROCARRAY_FLAGS_VACUUM);
+	newFrozenXid = GetOldestVisibleTransactionId(NULL);
 
 	/*
 	 * Similarly, initialize the MultiXact "min" with the value that would be
@@ -1683,8 +1701,9 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params)
 	StartTransactionCommand();
 
 	/*
-	 * Functions in indexes may want a snapshot set.  Also, setting a snapshot
-	 * ensures that RecentGlobalXmin is kept truly recent.
+	 * Need to acquire a snapshot to prevent pg_subtrans from being truncated,
+	 * cutoff xids in local memory wrapping around, and to have updated xmin
+	 * horizons.
 	 */
 	PushActiveSnapshot(GetTransactionSnapshot());
 
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 7e97ffab27d..df1af9354ce 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -1878,6 +1878,10 @@ get_database_list(void)
 	 * the secondary effect that it sets RecentGlobalXmin.  (This is critical
 	 * for anything that reads heap pages, because HOT may decide to prune
 	 * them even if the process doesn't attempt to modify any tuples.)
+	 *
+	 * FIXME: This comment is inaccurate / the code buggy. A snapshot that is
+	 * not pushed/active does not reliably prevent HOT pruning (->xmin could
+	 * e.g. be cleared when cache invalidations are processed).
 	 */
 	StartTransactionCommand();
 	(void) GetTransactionSnapshot();
diff --git a/src/backend/replication/logical/launcher.c b/src/backend/replication/logical/launcher.c
index aec885e9871..eb9b1c87caf 100644
--- a/src/backend/replication/logical/launcher.c
+++ b/src/backend/replication/logical/launcher.c
@@ -122,6 +122,12 @@ get_subscription_list(void)
 	 * the secondary effect that it sets RecentGlobalXmin.  (This is critical
 	 * for anything that reads heap pages, because HOT may decide to prune
 	 * them even if the process doesn't attempt to modify any tuples.)
+	 *
+	 *
+	 * FIXME: This comment is inaccurate / the code buggy. A snapshot that is
+	 * not pushed/active does not reliably prevent HOT pruning (->xmin could
+	 * e.g. be cleared when cache invalidations are processed). Also, this is
+	 * not reading pg_database.
 	 */
 	StartTransactionCommand();
 	(void) GetTransactionSnapshot();
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index aee67c61aa6..2975242b5b3 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -1176,22 +1176,7 @@ XLogWalRcvSendHSFeedback(bool immed)
 	 */
 	if (hot_standby_feedback)
 	{
-		TransactionId slot_xmin;
-
-		/*
-		 * Usually GetOldestXmin() would include both global replication slot
-		 * xmin and catalog_xmin in its calculations, but we want to derive
-		 * separate values for each of those. So we ask for an xmin that
-		 * excludes the catalog_xmin.
-		 */
-		xmin = GetOldestXmin(NULL,
-							 PROCARRAY_FLAGS_DEFAULT | PROCARRAY_SLOTS_XMIN);
-
-		ProcArrayGetReplicationSlotXmin(&slot_xmin, &catalog_xmin);
-
-		if (TransactionIdIsValid(slot_xmin) &&
-			TransactionIdPrecedes(slot_xmin, xmin))
-			xmin = slot_xmin;
+		GetReplicationHorizons(&xmin, &catalog_xmin);
 	}
 	else
 	{
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 9e5611574cc..d7088d19fd6 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -2097,9 +2097,10 @@ ProcessStandbyHSFeedbackMessage(void)
 
 	/*
 	 * Set the WalSender's xmin equal to the standby's requested xmin, so that
-	 * the xmin will be taken into account by GetOldestXmin.  This will hold
-	 * back the removal of dead rows and thereby prevent the generation of
-	 * cleanup conflicts on the standby server.
+	 * the xmin will be taken into account by GetSnapshotData() /
+	 * ComputeTransactionHorizons().  This will hold back the removal of dead
+	 * rows and thereby prevent the generation of cleanup conflicts on the
+	 * standby server.
 	 *
 	 * There is a small window for a race condition here: although we just
 	 * checked that feedbackXmin precedes nextXid, the nextXid could have
@@ -2112,10 +2113,10 @@ ProcessStandbyHSFeedbackMessage(void)
 	 * own xmin would prevent nextXid from advancing so far.
 	 *
 	 * We don't bother taking the ProcArrayLock here.  Setting the xmin field
-	 * is assumed atomic, and there's no real need to prevent a concurrent
-	 * GetOldestXmin.  (If we're moving our xmin forward, this is obviously
-	 * safe, and if we're moving it backwards, well, the data is at risk
-	 * already since a VACUUM could have just finished calling GetOldestXmin.)
+	 * is assumed atomic, and there's no real need to prevent concurrent
+	 * horizon determinations.  (If we're moving our xmin forward, this is
+	 * obviously safe, and if we're moving it backwards, well, the data is at
+	 * risk already since a VACUUM could already have determined the horizon.)
 	 *
 	 * If we're using a replication slot we reserve the xmin via that,
 	 * otherwise via the walsender's PGXACT entry. We can only track the
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 363000670b2..a1823caf632 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -99,6 +99,98 @@ typedef struct ProcArrayStruct
 	int			pgprocnos[FLEXIBLE_ARRAY_MEMBER];
 } ProcArrayStruct;
 
+/*
+ * State for testing whether tuple versions may be removed. To improve
+ * GetSnapshotData() performance we don't compute an accurate value whenever
+ * acquiring a snapshot. Instead we compute boundaries above/below which we
+ * know that row versions are [not] needed anymore.  If at test time values
+ * falls in between the two, the boundaries can be recomputed (unless that
+ * just happened).
+ *
+ * The thresholds are FullTransactionIds instead of TransactionIds as it
+ * otherwise would be possible that, since the time the values were last
+ * computed, other activity in the system would lead to them being considered
+ * in the future. There is no procarray state preventing that from happening.
+ *
+ * The typedef is in the header.
+ */
+struct InvisibleToEveryoneState
+{
+	/*
+	 * Xids above definitely_needed_bound are considered as definitely not
+	 * removable. Xids below may be old enough to be removed, but unless
+	 * they're older than maybe_needed_bound, the procarray needs to be
+	 * consulted to be sure.
+	 */
+	FullTransactionId definitely_needed_bound;
+
+	/*
+	 * Xids below maybe_needed_bound are definitely removable.
+	 */
+	FullTransactionId maybe_needed_bound;
+};
+
+/* state for ComputeTransactionHorizons() */
+typedef struct ComputedHorizons
+{
+	/*
+	 * The value of ShmemVariableCache->latestCompletedFullXid when
+	 * ComputeTransactionHorizons() held ProcArrayLock.
+	 */
+	FullTransactionId latest_completed;
+
+	/*
+	 * The same for procArray->replication_slot_xmin and.
+	 * procArray->replication_slot_catalog_xmin.
+	 */
+	TransactionId slot_xmin;
+	TransactionId slot_catalog_xmin;
+
+	/*
+	 * Oldest xid that any backend might think is still running. This needs to
+	 * include processes running VACUUM, in contrast to the normal visibility
+	 * cutoffs, as vacuum needs to be able to perform pg_subtrans lookups when
+	 * determining visibility, but doesn't care about rows above its xmin to
+	 * be removed.
+	 *
+	 * This likely should only be needed to determine whether pg_subtrans can
+	 * be truncated. It currently includes the effects of replications slots,
+	 * for historical reasons. But that could likely be changed.
+	 */
+	TransactionId oldest_considered_running;
+
+	/*
+	 * Oldest xid that may be necessary to retain in for shared tables.
+	 *
+	 * This includes the effects of replications lots. If that's not desired,
+	 * look at shared_oldest_visible_raw;
+	 */
+	TransactionId shared_oldest_visible;
+
+	/*
+	 * Oldest xid that may be necessary to retain in for shared tables,
+	 * but is not affected by replication slot's catalog_xmin.
+	 *
+	 * This is mainly useful to be able to send the catalog_xmin to upstream
+	 * streaming replication servers via hot_standby_feedback, so they can
+	 * apply the limit only when accessing catalog tables.
+	 */
+	TransactionId shared_oldest_visible_raw;
+
+	/*
+	 * Oldest xid that may be necessary to retain in for non-shared catalog
+	 * tables.
+	 */
+	TransactionId catalog_oldest_visible;
+
+	/*
+	 * Oldest xid that may be necessary to retain in for normal user defined
+	 * tables.
+	 */
+	TransactionId data_oldest_visible;
+} ComputedHorizons;
+
+
 static ProcArrayStruct *procArray;
 
 static PGPROC *allProcs;
@@ -118,6 +210,23 @@ static TransactionId latestObservedXid = InvalidTransactionId;
  */
 static TransactionId standbySnapshotPendingXmin;
 
+/*
+ * State for visibility checks on different types of relations. See struct
+ * InvisibleToEveryoneState for details. As shared, catalog, and user defined
+ * relations can have different horizons, one such state exists for each.
+ */
+static InvisibleToEveryoneState InvisibleShared;
+static InvisibleToEveryoneState InvisibleCatalog;
+static InvisibleToEveryoneState InvisibleData;
+
+/*
+ * This backend's RecentXmin at the last time the accurate xmin horizon was
+ * recomputed, or InvalidTransactionId if it has not. Used to limit how many
+ * times accurate horizons are recomputed
+ * InvisibleToEveryoneShouldUpdateHorizons().
+ */
+static TransactionId ComputedHorizonsLastXmin;
+
 #ifdef XIDCACHE_DEBUG
 
 /* counters for XidCache measurement */
@@ -175,6 +284,9 @@ static void KnownAssignedXidsReset(void);
 static inline void ProcArrayEndTransactionInternal(PGPROC *proc,
 												   PGXACT *pgxact, TransactionId latestXid);
 static void ProcArrayGroupClearXid(PGPROC *proc, TransactionId latestXid);
+static void MaintainLatestCompletedXid(TransactionId latestXid);
+
+static inline FullTransactionId FullXidViaRelative(FullTransactionId rel, TransactionId xid);
 
 /*
  * Report shared-memory space needed by CreateSharedProcArray.
@@ -351,9 +463,7 @@ ProcArrayRemove(PGPROC *proc, TransactionId latestXid)
 		Assert(TransactionIdIsValid(allPgXact[proc->pgprocno].xid));
 
 		/* Advance global latestCompletedXid while holding the lock */
-		if (TransactionIdPrecedes(ShmemVariableCache->latestCompletedXid,
-								  latestXid))
-			ShmemVariableCache->latestCompletedXid = latestXid;
+		MaintainLatestCompletedXid(latestXid);
 	}
 	else
 	{
@@ -466,9 +576,7 @@ ProcArrayEndTransactionInternal(PGPROC *proc, PGXACT *pgxact,
 	pgxact->overflowed = false;
 
 	/* Also advance global latestCompletedXid while holding the lock */
-	if (TransactionIdPrecedes(ShmemVariableCache->latestCompletedXid,
-							  latestXid))
-		ShmemVariableCache->latestCompletedXid = latestXid;
+	MaintainLatestCompletedXid(latestXid);
 }
 
 /*
@@ -623,6 +731,29 @@ ProcArrayClearTransaction(PGPROC *proc)
 	pgxact->overflowed = false;
 }
 
+/*
+ * Update ShmemVariableCache->latestCompletedFullXid to point to latestXid if
+ * currently older.
+ */
+static void
+MaintainLatestCompletedXid(TransactionId latestXid)
+{
+	FullTransactionId cur_latest = ShmemVariableCache->latestCompletedFullXid;
+
+	Assert(LWLockHeldByMe(ProcArrayLock));
+	Assert(FullTransactionIdIsValid(cur_latest));
+
+	if (TransactionIdPrecedes(XidFromFullTransactionId(cur_latest), latestXid))
+	{
+		FullTransactionId fxid = FullXidViaRelative(cur_latest, latestXid);
+
+		ShmemVariableCache->latestCompletedFullXid = fxid;
+	}
+
+	Assert(IsBootstrapProcessingMode() ||
+		   FullTransactionIdIsNormal(ShmemVariableCache->latestCompletedFullXid));
+}
+
 /*
  * ProcArrayInitRecovery -- initialize recovery xid mgmt environment
  *
@@ -667,6 +798,7 @@ ProcArrayApplyRecoveryInfo(RunningTransactions running)
 	TransactionId *xids;
 	int			nxids;
 	int			i;
+	FullTransactionId fxid;
 
 	Assert(standbyState >= STANDBY_INITIALIZED);
 	Assert(TransactionIdIsValid(running->nextXid));
@@ -843,7 +975,7 @@ ProcArrayApplyRecoveryInfo(RunningTransactions running)
 	 * Now we've got the running xids we need to set the global values that
 	 * are used to track snapshots as they evolve further.
 	 *
-	 * - latestCompletedXid which will be the xmax for snapshots
+	 * - latestCompletedFullXid which will be the xmax for snapshots
 	 * - lastOverflowedXid which shows whether snapshots overflow
 	 * - nextXid
 	 *
@@ -867,24 +999,26 @@ ProcArrayApplyRecoveryInfo(RunningTransactions running)
 		standbySnapshotPendingXmin = InvalidTransactionId;
 	}
 
-	/*
-	 * If a transaction wrote a commit record in the gap between taking and
-	 * logging the snapshot then latestCompletedXid may already be higher than
-	 * the value from the snapshot, so check before we use the incoming value.
-	 */
-	if (TransactionIdPrecedes(ShmemVariableCache->latestCompletedXid,
-							  running->latestCompletedXid))
-		ShmemVariableCache->latestCompletedXid = running->latestCompletedXid;
-
-	Assert(TransactionIdIsNormal(ShmemVariableCache->latestCompletedXid));
-
-	LWLockRelease(ProcArrayLock);
 
 	/* ShmemVariableCache->nextFullXid must be beyond any observed xid. */
 	AdvanceNextFullTransactionIdPastXid(latestObservedXid);
 
 	Assert(FullTransactionIdIsValid(ShmemVariableCache->nextFullXid));
 
+	/*
+	 * If a transaction wrote a commit record in the gap between taking and
+	 * logging the snapshot then latestCompletedFullXid may already be higher
+	 * than the value from the snapshot, so check before we use the incoming
+	 * value. It also might not yet be set at all.
+	 */
+	fxid = FullXidViaRelative(ShmemVariableCache->nextFullXid,
+							  running->latestCompletedXid);
+	if (!FullTransactionIdIsValid(ShmemVariableCache->latestCompletedFullXid) ||
+		FullTransactionIdFollows(fxid, ShmemVariableCache->latestCompletedFullXid))
+		ShmemVariableCache->latestCompletedFullXid = fxid;
+
+	LWLockRelease(ProcArrayLock);
+
 	KnownAssignedXidsDisplay(trace_recovery(DEBUG3));
 	if (standbyState == STANDBY_SNAPSHOT_READY)
 		elog(trace_recovery(DEBUG1), "recovery snapshots are now enabled");
@@ -1050,10 +1184,11 @@ TransactionIdIsInProgress(TransactionId xid)
 	LWLockAcquire(ProcArrayLock, LW_SHARED);
 
 	/*
-	 * Now that we have the lock, we can check latestCompletedXid; if the
+	 * Now that we have the lock, we can check latestCompletedFullXid; if the
 	 * target Xid is after that, it's surely still running.
 	 */
-	if (TransactionIdPrecedes(ShmemVariableCache->latestCompletedXid, xid))
+	if (TransactionIdPrecedes(XidFromFullTransactionId(ShmemVariableCache->latestCompletedFullXid),
+							  xid))
 	{
 		LWLockRelease(ProcArrayLock);
 		xc_by_latest_xid_inc();
@@ -1250,159 +1385,166 @@ TransactionIdIsActive(TransactionId xid)
 
 
 /*
- * GetOldestXmin -- returns oldest transaction that was running
- *					when any current transaction was started.
+ * Determine horizons due to concurrently running transactions.
  *
- * If rel is NULL or a shared relation, all backends are considered, otherwise
- * only backends running in this database are considered.
- *
- * The flags are used to ignore the backends in calculation when any of the
- * corresponding flags is set. Typically, if you want to ignore ones with
- * PROC_IN_VACUUM flag, you can use PROCARRAY_FLAGS_VACUUM.
- *
- * PROCARRAY_SLOTS_XMIN causes GetOldestXmin to ignore the xmin and
- * catalog_xmin of any replication slots that exist in the system when
- * calculating the oldest xmin.
- *
- * This is used by VACUUM to decide which deleted tuples must be preserved in
- * the passed in table. For shared relations backends in all databases must be
- * considered, but for non-shared relations that's not required, since only
- * backends in my own database could ever see the tuples in them. Also, we can
- * ignore concurrently running lazy VACUUMs because (a) they must be working
- * on other tables, and (b) they don't need to do snapshot-based lookups.
- *
- * This is also used to determine where to truncate pg_subtrans.  For that
- * backends in all databases have to be considered, so rel = NULL has to be
- * passed in.
+ * This is used by wrapper functions for more specific use cases like hot
+ * pruning, vacuuming and pg_subtrans truncations.
  *
  * Note: we include all currently running xids in the set of considered xids.
  * This ensures that if a just-started xact has not yet set its snapshot,
  * when it does set the snapshot it cannot set xmin less than what we compute.
  * See notes in src/backend/access/transam/README.
  *
- * Note: despite the above, it's possible for the calculated value to move
- * backwards on repeated calls. The calculated value is conservative, so that
- * anything older is definitely not considered as running by anyone anymore,
- * but the exact value calculated depends on a number of things. For example,
- * if rel = NULL and there are no transactions running in the current
- * database, GetOldestXmin() returns latestCompletedXid. If a transaction
+ * Note: despite the above, it's possible for the calculated values to move
+ * backwards on repeated calls. The calculated values are conservative, so
+ * that anything older is definitely not considered as running by anyone
+ * anymore, but the exact values calculated depend on a number of things. For
+ * example, if there are no transactions running in the current database, the
+ * horizon for normal tables will be latestCompletedFullXid. If a transaction
  * begins after that, its xmin will include in-progress transactions in other
  * databases that started earlier, so another call will return a lower value.
  * Nonetheless it is safe to vacuum a table in the current database with the
  * first result.  There are also replication-related effects: a walsender
  * process can set its xmin based on transactions that are no longer running
  * in the master but are still being replayed on the standby, thus possibly
- * making the GetOldestXmin reading go backwards.  In this case there is a
- * possibility that we lose data that the standby would like to have, but
- * unless the standby uses a replication slot to make its xmin persistent
- * there is little we can do about that --- data is only protected if the
- * walsender runs continuously while queries are executed on the standby.
- * (The Hot Standby code deals with such cases by failing standby queries
- * that needed to access already-removed data, so there's no integrity bug.)
- * The return value is also adjusted with vacuum_defer_cleanup_age, so
- * increasing that setting on the fly is another easy way to make
- * GetOldestXmin() move backwards, with no consequences for data integrity.
+ * making the values go backwards.  In this case there is a possibility that
+ * we lose data that the standby would like to have, but unless the standby
+ * uses a replication slot to make its xmin persistent there is little we can
+ * do about that --- data is only protected if the walsender runs continuously
+ * while queries are executed on the standby.  (The Hot Standby code deals
+ * with such cases by failing standby queries that needed to access
+ * already-removed data, so there's no integrity bug.)  The computed values
+ * are also adjusted with vacuum_defer_cleanup_age, so increasing that setting
+ * on the fly is another easy way to make horizons move backwards, with no
+ * consequences for data integrity.
  */
-TransactionId
-GetOldestXmin(Relation rel, int flags)
+static void
+ComputeTransactionHorizons(ComputedHorizons *h)
 {
 	ProcArrayStruct *arrayP = procArray;
-	TransactionId result;
-	int			index;
-	bool		allDbs;
+	TransactionId kaxmin;
+	bool		in_recovery = RecoveryInProgress();
 
-	TransactionId replication_slot_xmin = InvalidTransactionId;
-	TransactionId replication_slot_catalog_xmin = InvalidTransactionId;
-
-	/*
-	 * If we're not computing a relation specific limit, or if a shared
-	 * relation has been passed in, backends in all databases have to be
-	 * considered.
-	 */
-	allDbs = rel == NULL || rel->rd_rel->relisshared;
-
-	/* Cannot look for individual databases during recovery */
-	Assert(allDbs || !RecoveryInProgress());
+	/* inferred after ProcArrayLock is released */
+	h->catalog_oldest_visible = InvalidTransactionId;
 
 	LWLockAcquire(ProcArrayLock, LW_SHARED);
 
-	/*
-	 * We initialize the MIN() calculation with latestCompletedXid + 1. This
-	 * is a lower bound for the XIDs that might appear in the ProcArray later,
-	 * and so protects us against overestimating the result due to future
-	 * additions.
-	 */
-	result = ShmemVariableCache->latestCompletedXid;
-	Assert(TransactionIdIsNormal(result));
-	TransactionIdAdvance(result);
+	h->latest_completed = ShmemVariableCache->latestCompletedFullXid;
 
-	for (index = 0; index < arrayP->numProcs; index++)
+	/*
+	 * We initialize the MIN() calculation with latestCompletedFullXid +
+	 * 1. This is a lower bound for the XIDs that might appear in the
+	 * ProcArray later, and so protects us against overestimating the result
+	 * due to future additions.
+	 */
+	{
+		TransactionId initial;
+
+		initial = XidFromFullTransactionId(h->latest_completed);
+		Assert(TransactionIdIsValid(initial));
+		TransactionIdAdvance(initial);
+
+		h->oldest_considered_running = initial;
+		h->shared_oldest_visible = initial;
+		h->data_oldest_visible = initial;
+	}
+
+	/*
+	 * Fetch slot horizons while ProcArrayLock is held - the
+	 * LWLockAcquire/LWLockRelease are a barrier, ensuring this happens inside
+	 * the lock.
+	 */
+	h->slot_xmin = procArray->replication_slot_xmin;
+	h->slot_catalog_xmin = procArray->replication_slot_catalog_xmin;
+
+	for (int index = 0; index < arrayP->numProcs; index++)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
 		PGPROC	   *proc = &allProcs[pgprocno];
 		PGXACT	   *pgxact = &allPgXact[pgprocno];
+		TransactionId xid;
+		TransactionId xmin;
 
-		if (pgxact->vacuumFlags & (flags & PROCARRAY_PROC_FLAGS_MASK))
+		/* Fetch xid just once - see GetNewTransactionId */
+		xid = UINT32_ACCESS_ONCE(pgxact->xid);
+		xmin = UINT32_ACCESS_ONCE(pgxact->xmin);
+
+		/*
+		 * Consider both the transaction's Xmin, and its Xid.
+		 *
+		 * We must check both because a transaction might have an Xmin but not
+		 * (yet) an Xid; conversely, if it has an Xid, that could determine
+		 * some not-yet-set Xmin.
+		 */
+		xmin = TransactionIdOlder(xmin, xid);
+
+		/* if neither is set, this proc doesn't influence the horizon */
+		if (!TransactionIdIsValid(xmin))
 			continue;
 
-		if (allDbs ||
+		/*
+		 * Don't ignore any procs when determining which transactions might be
+		 * considered running.  While slots should ensure logical decoding
+		 * backends are protected even without this check, it can't hurt to
+		 * include them here as well..
+		 */
+		h->oldest_considered_running =
+			TransactionIdOlder(h->oldest_considered_running, xmin);
+
+		/*
+		 * Skip over backends either vacuuming (which is ok with rows being
+		 * removed, as long as pg_subtrans is not truncated) or doing logical
+		 * decoding (which manages xmin separately, check below).
+		 */
+		if (pgxact->vacuumFlags & (PROC_IN_VACUUM | PROC_IN_LOGICAL_DECODING))
+			continue;
+
+		/* shared tables need to take backends in all database into account */
+		h->shared_oldest_visible =
+			TransactionIdOlder(h->shared_oldest_visible, xmin);
+
+		/*
+		 * Normally queries in other databases are ignored for anything but
+		 * the shared horizon. But in recovery we cannot compute an accurate
+		 * per-database horizon as all xids are managed via the
+		 * KnownAssignedXids machinery.
+		 */
+		if (in_recovery ||
 			proc->databaseId == MyDatabaseId ||
 			proc->databaseId == 0)	/* always include WalSender */
 		{
-			/* Fetch xid just once - see GetNewTransactionId */
-			TransactionId xid = UINT32_ACCESS_ONCE(pgxact->xid);
-
-			/* First consider the transaction's own Xid, if any */
-			if (TransactionIdIsNormal(xid) &&
-				TransactionIdPrecedes(xid, result))
-				result = xid;
-
-			/*
-			 * Also consider the transaction's Xmin, if set.
-			 *
-			 * We must check both Xid and Xmin because a transaction might
-			 * have an Xmin but not (yet) an Xid; conversely, if it has an
-			 * Xid, that could determine some not-yet-set Xmin.
-			 */
-			xid = UINT32_ACCESS_ONCE(pgxact->xmin);
-			if (TransactionIdIsNormal(xid) &&
-				TransactionIdPrecedes(xid, result))
-				result = xid;
+			h->data_oldest_visible =
+				TransactionIdOlder(h->data_oldest_visible, xmin);
 		}
 	}
 
 	/*
-	 * Fetch into local variable while ProcArrayLock is held - the
-	 * LWLockRelease below is a barrier, ensuring this happens inside the
-	 * lock.
+	 * If in recovery fetch oldest xid in KnownAssignedXids, will be applied
+	 * after lock is released.
 	 */
-	replication_slot_xmin = procArray->replication_slot_xmin;
-	replication_slot_catalog_xmin = procArray->replication_slot_catalog_xmin;
+	if (in_recovery)
+		kaxmin = KnownAssignedXidsGetOldestXmin();
 
-	if (RecoveryInProgress())
+	/*
+	 * No other information needed, so release the lock immediately. The rest
+	 * of the computations can be done without a lock.
+	 */
+	LWLockRelease(ProcArrayLock);
+
+	if (in_recovery)
 	{
-		/*
-		 * Check to see whether KnownAssignedXids contains an xid value older
-		 * than the main procarray.
-		 */
-		TransactionId kaxmin = KnownAssignedXidsGetOldestXmin();
-
-		LWLockRelease(ProcArrayLock);
-
-		if (TransactionIdIsNormal(kaxmin) &&
-			TransactionIdPrecedes(kaxmin, result))
-			result = kaxmin;
+		h->oldest_considered_running =
+			TransactionIdOlder(h->oldest_considered_running, kaxmin);
+		h->shared_oldest_visible =
+			TransactionIdOlder(h->shared_oldest_visible, kaxmin);
+		h->data_oldest_visible =
+			TransactionIdOlder(h->data_oldest_visible, kaxmin);
 	}
 	else
 	{
 		/*
-		 * No other information needed, so release the lock immediately.
-		 */
-		LWLockRelease(ProcArrayLock);
-
-		/*
-		 * Compute the cutoff XID by subtracting vacuum_defer_cleanup_age,
-		 * being careful not to generate a "permanent" XID.
+		 * Compute the cutoff XID by subtracting vacuum_defer_cleanup_age.
 		 *
 		 * vacuum_defer_cleanup_age provides some additional "slop" for the
 		 * benefit of hot standby queries on standby servers.  This is quick
@@ -1414,34 +1556,141 @@ GetOldestXmin(Relation rel, int flags)
 		 * in varsup.c.  Also note that we intentionally don't apply
 		 * vacuum_defer_cleanup_age on standby servers.
 		 */
-		result -= vacuum_defer_cleanup_age;
-		if (!TransactionIdIsNormal(result))
-			result = FirstNormalTransactionId;
+		h->oldest_considered_running =
+			TransactionIdRetreatedBy(h->oldest_considered_running,
+									 vacuum_defer_cleanup_age);
+		h->shared_oldest_visible =
+			TransactionIdRetreatedBy(h->shared_oldest_visible,
+									 vacuum_defer_cleanup_age);
+		h->data_oldest_visible =
+			TransactionIdRetreatedBy(h->data_oldest_visible,
+									 vacuum_defer_cleanup_age);
 	}
 
 	/*
 	 * Check whether there are replication slots requiring an older xmin.
 	 */
-	if (!(flags & PROCARRAY_SLOTS_XMIN) &&
-		TransactionIdIsValid(replication_slot_xmin) &&
-		NormalTransactionIdPrecedes(replication_slot_xmin, result))
-		result = replication_slot_xmin;
+	h->shared_oldest_visible =
+		TransactionIdOlder(h->shared_oldest_visible, h->slot_xmin);
+	h->data_oldest_visible =
+		TransactionIdOlder(h->data_oldest_visible, h->slot_xmin);
 
 	/*
-	 * After locks have been released and vacuum_defer_cleanup_age has been
-	 * applied, check whether we need to back up further to make logical
-	 * decoding possible. We need to do so if we're computing the global limit
-	 * (rel = NULL) or if the passed relation is a catalog relation of some
-	 * kind.
+	 * The only difference between catalog / data horizons is that the slot's
+	 * catalog xmin is applied to the catalog one (so catalogs can be accessed
+	 * for logical decoding). Initialize with data horizon, and then back up
+	 * further if necessary. Have to back up the shared horizon as well, since
+	 * that also can contain catalogs.
 	 */
-	if (!(flags & PROCARRAY_SLOTS_XMIN) &&
-		(rel == NULL ||
-		 RelationIsAccessibleInLogicalDecoding(rel)) &&
-		TransactionIdIsValid(replication_slot_catalog_xmin) &&
-		NormalTransactionIdPrecedes(replication_slot_catalog_xmin, result))
-		result = replication_slot_catalog_xmin;
+	h->shared_oldest_visible_raw = h->shared_oldest_visible;
+	h->shared_oldest_visible =
+		TransactionIdOlder(h->shared_oldest_visible,
+						   h->slot_catalog_xmin);
+	h->catalog_oldest_visible = h->data_oldest_visible;
+	h->catalog_oldest_visible =
+		TransactionIdOlder(h->catalog_oldest_visible,
+						   h->slot_catalog_xmin);
 
-	return result;
+	/*
+	 * It's possible that slots / vacuum_defer_cleanup_age backed up the
+	 * horizons further than oldest_considered_running. Fix.
+	 */
+	h->oldest_considered_running =
+		TransactionIdOlder(h->oldest_considered_running,
+						   h->shared_oldest_visible);
+	h->oldest_considered_running =
+		TransactionIdOlder(h->oldest_considered_running,
+						   h->catalog_oldest_visible);
+	h->oldest_considered_running =
+		TransactionIdOlder(h->oldest_considered_running,
+						   h->data_oldest_visible);
+
+	/* shared horizons have to be at least as old as the oldest visible in current db */
+	Assert(TransactionIdPrecedesOrEquals(h->shared_oldest_visible, h->data_oldest_visible));
+	Assert(TransactionIdPrecedesOrEquals(h->shared_oldest_visible, h->catalog_oldest_visible));
+
+	/*
+	 * Horizons need to ensure that pg_subtrans access is still possible for
+	 * the relevant backends.
+	 */
+	Assert(TransactionIdPrecedesOrEquals(h->oldest_considered_running,
+										 h->shared_oldest_visible));
+	Assert(TransactionIdPrecedesOrEquals(h->oldest_considered_running,
+										 h->catalog_oldest_visible));
+	Assert(TransactionIdPrecedesOrEquals(h->oldest_considered_running,
+										 h->data_oldest_visible));
+	Assert(!TransactionIdIsValid(h->slot_xmin) ||
+		   TransactionIdPrecedesOrEquals(h->oldest_considered_running,
+										 h->slot_xmin));
+	Assert(!TransactionIdIsValid(h->slot_catalog_xmin) ||
+		   TransactionIdPrecedesOrEquals(h->oldest_considered_running,
+										 h->slot_catalog_xmin));
+}
+
+/*
+ * Return the oldest transaction id that might still be considered as visible
+ * by any backend. Rows that are only visible to transactions before the
+ * returned xid can safely be removed.
+ *
+ * If rel is not NULL the horizon may be considerably more recent than if NULL
+ * were passed. In the NULL case a horizon that is correct (but not optimal)
+ * for all relations will be returned.
+ */
+TransactionId
+GetOldestVisibleTransactionId(Relation rel)
+{
+	ComputedHorizons horizons;
+
+	ComputeTransactionHorizons(&horizons);
+
+	/*
+	 * If we're not computing a relation specific limit, or if a shared
+	 * relation has been passed in, backends in all databases have to be
+	 * considered.
+	 */
+	if (rel == NULL || rel->rd_rel->relisshared)
+		return horizons.shared_oldest_visible;
+
+	if (RelationIsAccessibleInLogicalDecoding(rel))
+		return horizons.catalog_oldest_visible;
+
+	return horizons.data_oldest_visible;
+}
+
+/*
+ * Return the oldest transaction id any currently running backend might still
+ * think is running. This should not be used for visibility / pruning
+ * determinations (see GetOldestVisibleTransactionId()), but for decisions
+ * like up to where pg_subtrans can be truncated.
+ */
+TransactionId
+GetOldestTransactionIdConsideredRunning(void)
+{
+	ComputedHorizons horizons;
+
+	ComputeTransactionHorizons(&horizons);
+
+	return horizons.oldest_considered_running;
+}
+
+/*
+ * Return the visibility horizons for a hot standby feedback message.
+ */
+void
+GetReplicationHorizons(TransactionId *xmin, TransactionId *catalog_xmin)
+{
+	ComputedHorizons horizons;
+
+	ComputeTransactionHorizons(&horizons);
+
+	/*
+	 * Don't want to use shared_oldest_visible here, as that contains the
+	 * effect of replication slot's catalog_xmin. We want to send a separate
+	 * feedback for the catalog horizon, so the primary can remove data table
+	 * contents more aggressively.
+	 */
+	*xmin = horizons.shared_oldest_visible_raw;
+	*catalog_xmin = horizons.slot_catalog_xmin;
 }
 
 /*
@@ -1492,12 +1741,9 @@ GetMaxSnapshotSubxidCount(void)
  *			current transaction (this is the same as MyPgXact->xmin).
  *		RecentXmin: the xmin computed for the most recent snapshot.  XIDs
  *			older than this are known not running any more.
- *		RecentGlobalXmin: the global xmin (oldest TransactionXmin across all
- *			running transactions, except those running LAZY VACUUM).  This is
- *			the same computation done by
- *			GetOldestXmin(NULL, PROCARRAY_FLAGS_VACUUM).
- *		RecentGlobalDataXmin: the global xmin for non-catalog tables
- *			>= RecentGlobalXmin
+ *
+ * And update the state in InvisibleShared, InvisibleCatalog, InvisibleData
+ * for the benefit InvisibleToEveryone*.
  *
  * Note: this function should probably not be called with an argument that's
  * not statically allocated (see xip allocation below).
@@ -1508,11 +1754,12 @@ GetSnapshotData(Snapshot snapshot)
 	ProcArrayStruct *arrayP = procArray;
 	TransactionId xmin;
 	TransactionId xmax;
-	TransactionId globalxmin;
 	int			index;
 	int			count = 0;
 	int			subcount = 0;
 	bool		suboverflowed = false;
+	FullTransactionId latest_completed;
+	TransactionId oldestxid;
 	TransactionId replication_slot_xmin = InvalidTransactionId;
 	TransactionId replication_slot_catalog_xmin = InvalidTransactionId;
 
@@ -1556,13 +1803,16 @@ GetSnapshotData(Snapshot snapshot)
 	 */
 	LWLockAcquire(ProcArrayLock, LW_SHARED);
 
+	latest_completed = ShmemVariableCache->latestCompletedFullXid;
+	oldestxid = ShmemVariableCache->oldestXid;
+
 	/* xmax is always latestCompletedXid + 1 */
-	xmax = ShmemVariableCache->latestCompletedXid;
-	Assert(TransactionIdIsNormal(xmax));
+	xmax = XidFromFullTransactionId(latest_completed);
 	TransactionIdAdvance(xmax);
+	Assert(TransactionIdIsNormal(xmax));
 
 	/* initialize xmin calculation with xmax */
-	globalxmin = xmin = xmax;
+	xmin = xmax;
 
 	snapshot->takenDuringRecovery = RecoveryInProgress();
 
@@ -1591,12 +1841,6 @@ GetSnapshotData(Snapshot snapshot)
 				(PROC_IN_LOGICAL_DECODING | PROC_IN_VACUUM))
 				continue;
 
-			/* Update globalxmin to be the smallest valid xmin */
-			xid = UINT32_ACCESS_ONCE(pgxact->xmin);
-			if (TransactionIdIsNormal(xid) &&
-				NormalTransactionIdPrecedes(xid, globalxmin))
-				globalxmin = xid;
-
 			/* Fetch xid just once - see GetNewTransactionId */
 			xid = UINT32_ACCESS_ONCE(pgxact->xid);
 
@@ -1712,34 +1956,78 @@ GetSnapshotData(Snapshot snapshot)
 
 	LWLockRelease(ProcArrayLock);
 
-	/*
-	 * Update globalxmin to include actual process xids.  This is a slightly
-	 * different way of computing it than GetOldestXmin uses, but should give
-	 * the same result.
-	 */
-	if (TransactionIdPrecedes(xmin, globalxmin))
-		globalxmin = xmin;
+	/* maintain state for invisible-to-everyone tests */
+	{
+		TransactionId def_vis_xid;
+		TransactionId def_vis_xid_data;
+		FullTransactionId def_vis_fxid;
+		FullTransactionId def_vis_fxid_data;
+		FullTransactionId oldestfxid;
 
-	/* Update global variables too */
-	RecentGlobalXmin = globalxmin - vacuum_defer_cleanup_age;
-	if (!TransactionIdIsNormal(RecentGlobalXmin))
-		RecentGlobalXmin = FirstNormalTransactionId;
+		/*
+		 * Converting oldestXid is only safe when xid horizon cannot advance,
+		 * i.e. holding locks. While we don't hold the lock anymore, all the
+		 * necessary data has been gathered with lock held.
+		 */
+		oldestfxid = FullXidViaRelative(latest_completed, oldestxid);
 
-	/* Check whether there's a replication slot requiring an older xmin. */
-	if (TransactionIdIsValid(replication_slot_xmin) &&
-		NormalTransactionIdPrecedes(replication_slot_xmin, RecentGlobalXmin))
-		RecentGlobalXmin = replication_slot_xmin;
+		/* apply vacuum_defer_cleanup_age */
+		def_vis_xid_data =
+			TransactionIdRetreatedBy(xmin, vacuum_defer_cleanup_age);
 
-	/* Non-catalog tables can be vacuumed if older than this xid */
-	RecentGlobalDataXmin = RecentGlobalXmin;
+		/* Check whether there's a replication slot requiring an older xmin. */
+		def_vis_xid_data =
+			TransactionIdOlder(def_vis_xid_data, replication_slot_xmin);
 
-	/*
-	 * Check whether there's a replication slot requiring an older catalog
-	 * xmin.
-	 */
-	if (TransactionIdIsNormal(replication_slot_catalog_xmin) &&
-		NormalTransactionIdPrecedes(replication_slot_catalog_xmin, RecentGlobalXmin))
-		RecentGlobalXmin = replication_slot_catalog_xmin;
+		/*
+		 * Rows in non-shared, non-catalog tables possibly could be vacuumed
+		 * if older than this xid.
+		 */
+		def_vis_xid = def_vis_xid_data;
+
+		/*
+		 * Check whether there's a replication slot requiring an older catalog
+		 * xmin.
+		 */
+		def_vis_xid =
+			TransactionIdOlder(replication_slot_catalog_xmin, def_vis_xid);
+
+		def_vis_fxid = FullXidViaRelative(latest_completed, def_vis_xid);
+		def_vis_fxid_data = FullXidViaRelative(latest_completed, def_vis_xid_data);
+
+		/*
+		 * Check if we can increase upper bound. As a previous
+		 * InvisibleToEveryoneUpdateHorizons() might have computed more
+		 * aggressive values, don't overwrite them if so.
+		 */
+		InvisibleShared.definitely_needed_bound =
+			FullTransactionIdNewer(def_vis_fxid,
+								   InvisibleShared.definitely_needed_bound);
+		InvisibleCatalog.definitely_needed_bound =
+			FullTransactionIdNewer(def_vis_fxid,
+								   InvisibleCatalog.definitely_needed_bound);
+		InvisibleData.definitely_needed_bound =
+			FullTransactionIdNewer(def_vis_fxid_data,
+								   InvisibleData.definitely_needed_bound);
+
+		/*
+		 * Check if we know that we can initialize or increase the lower
+		 * bound. Currently the only cheap way to do so is to use
+		 * ShmemVariableCache->oldestXid as input.
+		 *
+		 * We should definitely be able to do better. We could e.g. put a
+		 * global lower bound value into ShmemVariableCache.
+		 */
+		InvisibleShared.maybe_needed_bound =
+			FullTransactionIdNewer(InvisibleShared.maybe_needed_bound,
+								   oldestfxid);
+		InvisibleCatalog.maybe_needed_bound =
+			FullTransactionIdNewer(InvisibleCatalog.maybe_needed_bound,
+								   oldestfxid);
+		InvisibleData.maybe_needed_bound =
+			FullTransactionIdNewer(InvisibleData.maybe_needed_bound,
+								   oldestfxid);
+	}
 
 	RecentXmin = xmin;
 
@@ -1986,7 +2274,7 @@ GetRunningTransactionData(void)
 	LWLockAcquire(ProcArrayLock, LW_SHARED);
 	LWLockAcquire(XidGenLock, LW_SHARED);
 
-	latestCompletedXid = ShmemVariableCache->latestCompletedXid;
+	latestCompletedXid = XidFromFullTransactionId(ShmemVariableCache->latestCompletedFullXid);
 
 	oldestRunningXid = XidFromFullTransactionId(ShmemVariableCache->nextFullXid);
 
@@ -3209,9 +3497,11 @@ XidCacheRemoveRunningXids(TransactionId xid,
 		elog(WARNING, "did not find subXID %u in MyProc", xid);
 
 	/* Also advance global latestCompletedXid while holding the lock */
-	if (TransactionIdPrecedes(ShmemVariableCache->latestCompletedXid,
+	if (TransactionIdPrecedes(XidFromFullTransactionId(ShmemVariableCache->latestCompletedFullXid),
 							  latestXid))
-		ShmemVariableCache->latestCompletedXid = latestXid;
+		ShmemVariableCache->latestCompletedFullXid =
+			FullXidViaRelative(ShmemVariableCache->latestCompletedFullXid,
+							   latestXid);
 
 	LWLockRelease(ProcArrayLock);
 }
@@ -3238,6 +3528,273 @@ DisplayXidCache(void)
 }
 #endif							/* XIDCACHE_DEBUG */
 
+/*
+ * Initialize test allowing to make determinations about whether rows with
+ * xids are still needed for backend that can access rel. If rel is NULL, the
+ * test state will be appropriate to test if there's any table in the system
+ * that may still need a row with such an xid.
+ *
+ * This needs to be called while holding a snapshot, otherwise there are
+ * wraparound and other dangers.
+ */
+InvisibleToEveryoneState *
+InvisibleToEveryoneTestInit(Relation rel)
+{
+	bool need_shared;
+	bool need_catalog;
+	InvisibleToEveryoneState *state;
+
+	/* cannot safely be used without holding a snapshot */
+	Assert(SnapshotSet());
+
+	if (!rel)
+		need_shared = need_catalog = true;
+	else
+	{
+		/*
+		 * Other kinds currently don't contain xids, nor always the necessary
+		 * logical decoding markers.
+		 */
+		Assert(rel->rd_rel->relkind == RELKIND_RELATION ||
+			   rel->rd_rel->relkind == RELKIND_MATVIEW ||
+			   rel->rd_rel->relkind == RELKIND_TOASTVALUE);
+
+		need_shared = rel->rd_rel->relisshared || RecoveryInProgress();
+		need_catalog = IsCatalogRelation(rel) || RelationIsAccessibleInLogicalDecoding(rel);
+	}
+
+	if (need_shared)
+		state = &InvisibleShared;
+	else if (need_catalog)
+		state = &InvisibleCatalog;
+	else
+		state = &InvisibleData;
+
+	Assert(FullTransactionIdIsValid(state->definitely_needed_bound) &&
+		   FullTransactionIdIsValid(state->maybe_needed_bound));
+
+	return state;
+}
+
+/*
+ * Return true if it's worth updating the accurate maybe_needed_bound visibility boundary.
+ *
+ * As it is somewhat expensive to determine xmin horizons, we don't want to
+ * repeatedly do so when there is a low likelihood of it being
+ * beneficial.
+ *
+ * The current heuristic is that we at most do so once per snapshot computed,
+ * and for further computations of the snapshot, we only recompute if the xmin
+ * horizon has changed since. The latter indicates that transactions have
+ * completed since.
+ */
+static bool
+InvisibleToEveryoneShouldUpdateHorizons(InvisibleToEveryoneState *state)
+{
+	/* hasn't been computed yet in this transaction */
+	if (!TransactionIdIsValid(ComputedHorizonsLastXmin))
+		return true;
+
+	/*
+	 * If the maybe_needed_bound/definitely_needed_bound boundaries are the
+	 * same, it's unlikely to be beneficial to recompute boundaries.
+	 */
+	if (FullTransactionIdFollowsOrEquals(state->maybe_needed_bound,
+										 state->definitely_needed_bound))
+		return false;
+
+	/* snapshot computation has yielded different xmin since last update */
+	return RecentXmin != ComputedHorizonsLastXmin;
+}
+
+/*
+ * Update the boundaries in Invisible{Shared,Catalog, Data} with accurate
+ * values.
+ */
+static void
+InvisibleToEveryoneUpdateHorizons(void)
+{
+	ComputedHorizons horizons;
+
+	ComputeTransactionHorizons(&horizons);
+
+	InvisibleShared.maybe_needed_bound =
+		FullXidViaRelative(horizons.latest_completed,
+						   horizons.shared_oldest_visible);
+	InvisibleCatalog.maybe_needed_bound =
+		FullXidViaRelative(horizons.latest_completed,
+						   horizons.catalog_oldest_visible);
+	InvisibleData.maybe_needed_bound =
+		FullXidViaRelative(horizons.latest_completed,
+						   horizons.data_oldest_visible);
+
+	/*
+	 * In longer running transactions it's possible that transactions we
+	 * previously needed to treat as running aren't around anymore. So update
+	 * definitely_needed_bound to not be earlier than maybe_needed_bound.
+	 */
+	InvisibleShared.definitely_needed_bound =
+		FullTransactionIdNewer(InvisibleShared.maybe_needed_bound,
+							   InvisibleShared.definitely_needed_bound);
+	InvisibleCatalog.definitely_needed_bound =
+		FullTransactionIdNewer(InvisibleCatalog.maybe_needed_bound,
+							   InvisibleCatalog.definitely_needed_bound);
+	InvisibleData.definitely_needed_bound =
+		FullTransactionIdNewer(InvisibleData.maybe_needed_bound,
+							   InvisibleData.definitely_needed_bound);
+
+	ComputedHorizonsLastXmin = RecentXmin;
+}
+
+/*
+ * Return true if rows that have become invisible at fxid are not visible to
+ * any backend anymore, false otherwise.
+ *
+ * The state passed needs to have been initialized for the relation fxid is
+ * from (NULL is also OK), otherwise the result may not be correct.
+ */
+bool
+InvisibleToEveryoneTestFullXid(InvisibleToEveryoneState *state, FullTransactionId fxid)
+{
+	/*
+	 * If the xid is older than maybe_needed_bound bound, it definitely can be
+	 * removed (even though maybe_needed_bound is approximate, it can only be
+	 * older than the accurate bound).
+	 */
+	if (FullTransactionIdPrecedes(fxid, state->maybe_needed_bound))
+		return true;
+
+	/*
+	 * If the xid is >= definitely_needed_bound bound, it can't be removed,
+	 * and updating our horizons would not help (or at least be fairly
+	 * unlikely to).
+	 */
+	if (FullTransactionIdFollowsOrEquals(fxid, state->definitely_needed_bound))
+		return false;
+
+	/*
+	 * The value is between maybe_needed_bound and definitely_needed_bound,
+	 * i.e. it may or may not still be visible. If we haven't already done so,
+	 * recompute bounds, and recheck.
+	 */
+	if (InvisibleToEveryoneShouldUpdateHorizons(state))
+	{
+		InvisibleToEveryoneUpdateHorizons();
+
+		Assert(FullTransactionIdPrecedes(fxid, state->definitely_needed_bound));
+
+		return FullTransactionIdPrecedes(fxid, state->maybe_needed_bound);
+	}
+	else
+		return false;
+}
+
+/*
+ * Wrapper around InvisibleToEveryoneTestFullXid() that accepts 32bit xids.
+ *
+ * It is crucial that this only gets called for xids from a source that
+ * protects against xid wraparounds (e.g. from a table and thus protected by
+ * relfrozenxid).
+ */
+bool
+InvisibleToEveryoneTestXid(InvisibleToEveryoneState *state, TransactionId xid)
+{
+	FullTransactionId fxid;
+
+	/*
+	 * Convert 32 bit argument to FullTransactionId. We can do so safely
+	 * because we know the xid has to, at the very least, be between
+	 * [oldestXid, nextFullXid), i.e. within 2 billion of xid. To avoid taking
+	 * a lock to determine either, we can just compare with
+	 * state->definitely_needed_bound, which was based on those value at the
+	 * time the current snapshot was built.
+	 */
+	fxid = FullXidViaRelative(state->definitely_needed_bound, xid);
+
+	return InvisibleToEveryoneTestFullXid(state, fxid);
+}
+
+/*
+ * Return FullTransactionId below which rows that have become invisible are
+ * not visible to any backend anymore.
+ *
+ * Note: This is less efficient than testing with
+ * InvisibleToEveryoneTestFullXid because it will require computing an
+ * accurate value, even if the all the values compared with the return value
+ * would be determined invisible due to being < state->maybe_needed_bound.
+ *
+ */
+FullTransactionId
+InvisibleToEveryoneTestFullCutoff(InvisibleToEveryoneState *state)
+{
+	/* acquire accurate horizon if not already done */
+	if (InvisibleToEveryoneShouldUpdateHorizons(state))
+		InvisibleToEveryoneUpdateHorizons();
+
+	return state->maybe_needed_bound;
+}
+
+/* wrapper around InvisibleToEveryoneTestFullCutoff */
+TransactionId
+InvisibleToEveryoneTestCutoff(InvisibleToEveryoneState *state)
+{
+	return XidFromFullTransactionId(InvisibleToEveryoneTestFullCutoff(state));
+}
+
+/*
+ * Convenience wrapper around InvisibleToEveryoneTestInit() and
+ * InvisibleToEveryoneTestFullXid(), see their comments.
+ */
+bool
+InvisibleToEveryoneCheckFullXid(Relation rel, FullTransactionId fxid)
+{
+	InvisibleToEveryoneState *state;
+
+	state = InvisibleToEveryoneTestInit(rel);
+
+	return InvisibleToEveryoneTestFullXid(state, fxid);
+}
+
+/*
+ * Convenience wrapper around InvisibleToEveryoneTestInit() and
+ * InvisibleToEveryoneTestXid(), see their comments.
+ */
+bool
+InvisibleToEveryoneCheckXid(Relation rel, TransactionId xid)
+{
+	InvisibleToEveryoneState *state;
+
+	state = InvisibleToEveryoneTestInit(rel);
+
+	return InvisibleToEveryoneTestXid(state, xid);
+}
+
+/*
+ * Convert a 32 bit transaction id into 64 bit transaction id, by assuming it
+ * is within MaxTransactionId / 2 of XidFromFullTransactionId(rel).
+ *
+ * Be very careful about when to use this function. It can only safely be used
+ * when there is a guarantee that xid is within MaxTransactionId / 2 xids of
+ * rel. That e.g. can be guaranteed if the the caller assures a snapshot is
+ * held by the backend and xid is from a table (where vacuum/freezing ensures
+ * the xid has to be within that range), or if xid is from the procarray and
+ * prevents xid wraparound that way.
+ */
+static inline FullTransactionId
+FullXidViaRelative(FullTransactionId rel, TransactionId xid)
+{
+	TransactionId rel_xid = XidFromFullTransactionId(rel);
+
+	Assert(TransactionIdIsValid(xid));
+	Assert(TransactionIdIsValid(rel_xid));
+
+	/* not guaranteed to find issues, but likely to catch mistakes */
+	AssertTransactionIdMayBeOnDisk(xid);
+
+	return FullTransactionIdFromU64(
+		U64FromFullTransactionId(rel) + (int32)(xid - rel_xid));
+}
+
 
 /* ----------------------------------------------
  *		KnownAssignedTransactionIds sub-module
@@ -3390,9 +3947,7 @@ ExpireTreeKnownAssignedTransactionIds(TransactionId xid, int nsubxids,
 	KnownAssignedXidsRemoveTree(xid, nsubxids, subxids);
 
 	/* As in ProcArrayEndTransaction, advance latestCompletedXid */
-	if (TransactionIdPrecedes(ShmemVariableCache->latestCompletedXid,
-							  max_xid))
-		ShmemVariableCache->latestCompletedXid = max_xid;
+	MaintainLatestCompletedXid(max_xid);
 
 	LWLockRelease(ProcArrayLock);
 }
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index 4fdcb07d97b..fb94c114a50 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -5591,14 +5591,15 @@ get_actual_variable_endpoint(Relation heapRel,
 	 * recent); that case motivates not using SnapshotAny here.
 	 *
 	 * A crucial point here is that SnapshotNonVacuumable, with
-	 * RecentGlobalXmin as horizon, yields the inverse of the condition that
-	 * the indexscan will use to decide that index entries are killable (see
-	 * heap_hot_search_buffer()).  Therefore, if the snapshot rejects a tuple
-	 * (or more precisely, all tuples of a HOT chain) and we have to continue
-	 * scanning past it, we know that the indexscan will mark that index entry
-	 * killed.  That means that the next get_actual_variable_endpoint() call
-	 * will not have to re-consider that index entry.  In this way we avoid
-	 * repetitive work when this function is used a lot during planning.
+	 * InvisibleToEveryoneTestInit(heapRel) as horizon, yields the inverse of
+	 * the condition that the indexscan will use to decide that index entries
+	 * are killable (see heap_hot_search_buffer()).  Therefore, if the
+	 * snapshot rejects a tuple (or more precisely, all tuples of a HOT chain)
+	 * and we have to continue scanning past it, we know that the indexscan
+	 * will mark that index entry killed.  That means that the next
+	 * get_actual_variable_endpoint() call will not have to re-consider that
+	 * index entry.  In this way we avoid repetitive work when this function
+	 * is used a lot during planning.
 	 *
 	 * But using SnapshotNonVacuumable creates a hazard of its own.  In a
 	 * recently-created index, some index entries may point at "broken" HOT
@@ -5610,7 +5611,8 @@ get_actual_variable_endpoint(Relation heapRel,
 	 * or could even be NULL.  We avoid this hazard because we take the data
 	 * from the index entry not the heap.
 	 */
-	InitNonVacuumableSnapshot(SnapshotNonVacuumable, RecentGlobalXmin);
+	InitNonVacuumableSnapshot(SnapshotNonVacuumable,
+							  InvisibleToEveryoneTestInit(heapRel));
 
 	index_scan = index_beginscan(heapRel, indexRel,
 								 &SnapshotNonVacuumable,
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index f4247ea70d5..893be2f3ddb 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -722,6 +722,10 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
 	 * is critical for anything that reads heap pages, because HOT may decide
 	 * to prune them even if the process doesn't attempt to modify any
 	 * tuples.)
+	 *
+	 * FIXME: This comment is inaccurate / the code buggy. A snapshot that is
+	 * not pushed/active does not reliably prevent HOT pruning (->xmin could
+	 * e.g. be cleared when cache invalidations are processed).
 	 */
 	if (!bootstrap)
 	{
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 3b148ae30a6..1182233bf43 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -157,16 +157,9 @@ static Snapshot HistoricSnapshot = NULL;
  * These are updated by GetSnapshotData.  We initialize them this way
  * for the convenience of TransactionIdIsInProgress: even in bootstrap
  * mode, we don't want it to say that BootstrapTransactionId is in progress.
- *
- * RecentGlobalXmin and RecentGlobalDataXmin are initialized to
- * InvalidTransactionId, to ensure that no one tries to use a stale
- * value. Readers should ensure that it has been set to something else
- * before using it.
  */
 TransactionId TransactionXmin = FirstNormalTransactionId;
 TransactionId RecentXmin = FirstNormalTransactionId;
-TransactionId RecentGlobalXmin = InvalidTransactionId;
-TransactionId RecentGlobalDataXmin = InvalidTransactionId;
 
 /* (table, ctid) => (cmin, cmax) mapping during timetravel */
 static HTAB *tuplecid_data = NULL;
@@ -583,9 +576,7 @@ SetTransactionSnapshot(Snapshot sourcesnap, VirtualTransactionId *sourcevxid,
 	 * Even though we are not going to use the snapshot it computes, we must
 	 * call GetSnapshotData, for two reasons: (1) to be sure that
 	 * CurrentSnapshotData's XID arrays have been allocated, and (2) to update
-	 * RecentXmin and RecentGlobalXmin.  (We could alternatively include those
-	 * two variables in exported snapshot files, but it seems better to have
-	 * snapshot importers compute reasonably up-to-date values for them.)
+	 * the state for InvisibleToEveryone*.
 	 */
 	CurrentSnapshot = GetSnapshotData(&CurrentSnapshotData);
 
@@ -977,36 +968,6 @@ xmin_cmp(const pairingheap_node *a, const pairingheap_node *b, void *arg)
 		return 0;
 }
 
-/*
- * Get current RecentGlobalXmin value, as a FullTransactionId.
- */
-FullTransactionId
-GetFullRecentGlobalXmin(void)
-{
-	FullTransactionId nextxid_full;
-	uint32		nextxid_epoch;
-	TransactionId nextxid_xid;
-	uint32		epoch;
-
-	Assert(TransactionIdIsNormal(RecentGlobalXmin));
-
-	/*
-	 * Compute the epoch from the next XID's epoch. This relies on the fact
-	 * that RecentGlobalXmin must be within the 2 billion XID horizon from the
-	 * next XID.
-	 */
-	nextxid_full = ReadNextFullTransactionId();
-	nextxid_epoch = EpochFromFullTransactionId(nextxid_full);
-	nextxid_xid = XidFromFullTransactionId(nextxid_full);
-
-	if (RecentGlobalXmin > nextxid_xid)
-		epoch = nextxid_epoch - 1;
-	else
-		epoch = nextxid_epoch;
-
-	return FullTransactionIdFromEpochAndXid(epoch, RecentGlobalXmin);
-}
-
 /*
  * SnapshotResetXmin
  *
@@ -1776,106 +1737,151 @@ GetOldSnapshotThresholdTimestamp(void)
 	return threshold_timestamp;
 }
 
-static void
+void
 SetOldSnapshotThresholdTimestamp(TimestampTz ts, TransactionId xlimit)
 {
 	SpinLockAcquire(&oldSnapshotControl->mutex_threshold);
+	Assert(oldSnapshotControl->threshold_timestamp <= ts);
+	Assert(TransactionIdPrecedesOrEquals(oldSnapshotControl->threshold_xid, xlimit));
 	oldSnapshotControl->threshold_timestamp = ts;
 	oldSnapshotControl->threshold_xid = xlimit;
 	SpinLockRelease(&oldSnapshotControl->mutex_threshold);
 }
 
+void
+SnapshotTooOldMagicForTest(void)
+{
+	TimestampTz ts = GetSnapshotCurrentTimestamp();
+
+	Assert(old_snapshot_threshold == 0);
+
+	ts -= 5 * USECS_PER_SEC;
+
+	SpinLockAcquire(&oldSnapshotControl->mutex_threshold);
+	oldSnapshotControl->threshold_timestamp = ts;
+	SpinLockRelease(&oldSnapshotControl->mutex_threshold);
+}
+
+/*
+ * If there is a valid mapping for the timestamp, set *xlimitp to
+ * that. Returns whether there is such a mapping.
+ */
+static bool
+GetOldSnapshotFromTimeMapping(TimestampTz ts, TransactionId *xlimitp)
+{
+	bool in_mapping = false;
+
+	Assert(ts == AlignTimestampToMinuteBoundary(ts));
+
+	LWLockAcquire(OldSnapshotTimeMapLock, LW_SHARED);
+
+	if (oldSnapshotControl->count_used > 0
+		&& ts >= oldSnapshotControl->head_timestamp)
+	{
+		int			offset;
+
+		offset = ((ts - oldSnapshotControl->head_timestamp)
+				  / USECS_PER_MINUTE);
+		if (offset > oldSnapshotControl->count_used - 1)
+			offset = oldSnapshotControl->count_used - 1;
+		offset = (oldSnapshotControl->head_offset + offset)
+			% OLD_SNAPSHOT_TIME_MAP_ENTRIES;
+
+		*xlimitp = oldSnapshotControl->xid_by_minute[offset];
+
+		in_mapping = true;
+	}
+
+	LWLockRelease(OldSnapshotTimeMapLock);
+
+	return in_mapping;
+}
+
 /*
  * TransactionIdLimitedForOldSnapshots
  *
- * Apply old snapshot limit, if any.  This is intended to be called for page
- * pruning and table vacuuming, to allow old_snapshot_threshold to override
- * the normal global xmin value.  Actual testing for snapshot too old will be
- * based on whether a snapshot timestamp is prior to the threshold timestamp
- * set in this function.
+ * Apply old snapshot limit.  This is intended to be called for page pruning
+ * and table vacuuming, to allow old_snapshot_threshold to override the normal
+ * global xmin value.  Actual testing for snapshot too old will be based on
+ * whether a snapshot timestamp is prior to the threshold timestamp set in
+ * this function.
+ *
+ * If the limited horizon allows a cleanup action that otherwise would not be
+ * possible, SetOldSnapshotThresholdTimestamp(*limit_ts, *limit_xid) needs to
+ * be called before that cleanup action.
  */
-TransactionId
+bool
 TransactionIdLimitedForOldSnapshots(TransactionId recentXmin,
-									Relation relation)
+									Relation relation,
+									TransactionId *limit_xid,
+									TimestampTz *limit_ts)
 {
-	if (TransactionIdIsNormal(recentXmin)
-		&& old_snapshot_threshold >= 0
-		&& RelationAllowsEarlyPruning(relation))
+	TimestampTz ts;
+	TransactionId xlimit = recentXmin;
+	TransactionId latest_xmin;
+	TimestampTz next_map_update_ts;
+	TransactionId threshold_timestamp;
+	TransactionId threshold_xid;
+
+	Assert(TransactionIdIsNormal(recentXmin));
+	Assert(OldSnapshotThresholdActive());
+	Assert(limit_ts != NULL && limit_xid != NULL);
+
+	if (!RelationAllowsEarlyPruning(relation))
+		return false;
+
+	ts = GetSnapshotCurrentTimestamp();
+
+	SpinLockAcquire(&oldSnapshotControl->mutex_latest_xmin);
+	latest_xmin = oldSnapshotControl->latest_xmin;
+	next_map_update_ts = oldSnapshotControl->next_map_update;
+	SpinLockRelease(&oldSnapshotControl->mutex_latest_xmin);
+
+	/*
+	 * Zero threshold always overrides to latest xmin, if valid.  Without
+	 * some heuristic it will find its own snapshot too old on, for
+	 * example, a simple UPDATE -- which would make it useless for most
+	 * testing, but there is no principled way to ensure that it doesn't
+	 * fail in this way.  Use a five-second delay to try to get useful
+	 * testing behavior, but this may need adjustment.
+	 */
+	if (old_snapshot_threshold == 0)
 	{
-		TimestampTz ts = GetSnapshotCurrentTimestamp();
-		TransactionId xlimit = recentXmin;
-		TransactionId latest_xmin;
-		TimestampTz update_ts;
-		bool		same_ts_as_threshold = false;
-
-		SpinLockAcquire(&oldSnapshotControl->mutex_latest_xmin);
-		latest_xmin = oldSnapshotControl->latest_xmin;
-		update_ts = oldSnapshotControl->next_map_update;
-		SpinLockRelease(&oldSnapshotControl->mutex_latest_xmin);
-
-		/*
-		 * Zero threshold always overrides to latest xmin, if valid.  Without
-		 * some heuristic it will find its own snapshot too old on, for
-		 * example, a simple UPDATE -- which would make it useless for most
-		 * testing, but there is no principled way to ensure that it doesn't
-		 * fail in this way.  Use a five-second delay to try to get useful
-		 * testing behavior, but this may need adjustment.
-		 */
-		if (old_snapshot_threshold == 0)
-		{
-			if (TransactionIdPrecedes(latest_xmin, MyPgXact->xmin)
-				&& TransactionIdFollows(latest_xmin, xlimit))
-				xlimit = latest_xmin;
-
-			ts -= 5 * USECS_PER_SEC;
-			SetOldSnapshotThresholdTimestamp(ts, xlimit);
-
-			return xlimit;
-		}
+		if (TransactionIdPrecedes(latest_xmin, MyPgXact->xmin)
+			&& TransactionIdFollows(latest_xmin, xlimit))
+			xlimit = latest_xmin;
 
+		ts -= 5 * USECS_PER_SEC;
+	}
+	else
+	{
 		ts = AlignTimestampToMinuteBoundary(ts)
 			- (old_snapshot_threshold * USECS_PER_MINUTE);
 
 		/* Check for fast exit without LW locking. */
 		SpinLockAcquire(&oldSnapshotControl->mutex_threshold);
-		if (ts == oldSnapshotControl->threshold_timestamp)
-		{
-			xlimit = oldSnapshotControl->threshold_xid;
-			same_ts_as_threshold = true;
-		}
+		threshold_timestamp = oldSnapshotControl->threshold_timestamp;
+		threshold_xid = oldSnapshotControl->threshold_xid;
 		SpinLockRelease(&oldSnapshotControl->mutex_threshold);
 
-		if (!same_ts_as_threshold)
+		if (ts == threshold_timestamp)
+		{
+			/*
+			 * Current timestamp is in same bucket as the the last limit that
+			 * was applied. Reuse.
+			 */
+			xlimit = threshold_xid;
+		}
+		else if (ts == next_map_update_ts)
+		{
+			/*
+			 * FIXME: This branch is super iffy - but that should probably
+			 * fixed separately.
+			 */
+			xlimit = latest_xmin;
+		}
+		else if (GetOldSnapshotFromTimeMapping(ts, &xlimit))
 		{
-			if (ts == update_ts)
-			{
-				xlimit = latest_xmin;
-				if (NormalTransactionIdFollows(xlimit, recentXmin))
-					SetOldSnapshotThresholdTimestamp(ts, xlimit);
-			}
-			else
-			{
-				LWLockAcquire(OldSnapshotTimeMapLock, LW_SHARED);
-
-				if (oldSnapshotControl->count_used > 0
-					&& ts >= oldSnapshotControl->head_timestamp)
-				{
-					int			offset;
-
-					offset = ((ts - oldSnapshotControl->head_timestamp)
-							  / USECS_PER_MINUTE);
-					if (offset > oldSnapshotControl->count_used - 1)
-						offset = oldSnapshotControl->count_used - 1;
-					offset = (oldSnapshotControl->head_offset + offset)
-						% OLD_SNAPSHOT_TIME_MAP_ENTRIES;
-					xlimit = oldSnapshotControl->xid_by_minute[offset];
-
-					if (NormalTransactionIdFollows(xlimit, recentXmin))
-						SetOldSnapshotThresholdTimestamp(ts, xlimit);
-				}
-
-				LWLockRelease(OldSnapshotTimeMapLock);
-			}
 		}
 
 		/*
@@ -1890,12 +1896,18 @@ TransactionIdLimitedForOldSnapshots(TransactionId recentXmin,
 		if (TransactionIdIsNormal(latest_xmin)
 			&& TransactionIdPrecedes(latest_xmin, xlimit))
 			xlimit = latest_xmin;
-
-		if (NormalTransactionIdFollows(xlimit, recentXmin))
-			return xlimit;
 	}
 
-	return recentXmin;
+	if (TransactionIdIsValid(xlimit) &&
+		TransactionIdFollowsOrEquals(xlimit, recentXmin))
+	{
+		*limit_ts = ts;
+		*limit_xid = xlimit;
+
+		return true;
+	}
+
+	return false;
 }
 
 /*
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 8f43f3e9dfb..b16facad70c 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -413,7 +413,7 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 
 	/*
 	 * This assertion matches the one in index_getnext_tid().  See page
-	 * recycling/RecentGlobalXmin notes in nbtree README.
+	 * recycling/"invisible to everyone" notes in nbtree README.
 	 */
 	Assert(SnapshotSet());
 
@@ -1437,7 +1437,7 @@ bt_right_page_check_scankey(BtreeCheckState *state)
 	 * does not occur until no possible index scan could land on the page.
 	 * Index scans can follow links with nothing more than their snapshot as
 	 * an interlock and be sure of at least that much.  (See page
-	 * recycling/RecentGlobalXmin notes in nbtree README.)
+	 * recycling/"invisible to everyone" notes in nbtree README.)
 	 *
 	 * Furthermore, it's okay if we follow a rightlink and find a half-dead or
 	 * dead (ignorable) page one or more times.  There will either be a
diff --git a/contrib/pg_visibility/pg_visibility.c b/contrib/pg_visibility/pg_visibility.c
index 0cd1160ceb2..ee1fb208e07 100644
--- a/contrib/pg_visibility/pg_visibility.c
+++ b/contrib/pg_visibility/pg_visibility.c
@@ -563,17 +563,14 @@ collect_corrupt_items(Oid relid, bool all_visible, bool all_frozen)
 	BufferAccessStrategy bstrategy = GetAccessStrategy(BAS_BULKREAD);
 	TransactionId OldestXmin = InvalidTransactionId;
 
-	if (all_visible)
-	{
-		/* Don't pass rel; that will fail in recovery. */
-		OldestXmin = GetOldestXmin(NULL, PROCARRAY_FLAGS_VACUUM);
-	}
-
 	rel = relation_open(relid, AccessShareLock);
 
 	/* Only some relkinds have a visibility map */
 	check_relation_relkind(rel);
 
+	if (all_visible)
+		OldestXmin = GetOldestVisibleTransactionId(rel);
+
 	nblocks = RelationGetNumberOfBlocks(rel);
 
 	/*
@@ -679,11 +676,12 @@ collect_corrupt_items(Oid relid, bool all_visible, bool all_frozen)
 				 * From a concurrency point of view, it sort of sucks to
 				 * retake ProcArrayLock here while we're holding the buffer
 				 * exclusively locked, but it should be safe against
-				 * deadlocks, because surely GetOldestXmin() should never take
-				 * a buffer lock. And this shouldn't happen often, so it's
-				 * worth being careful so as to avoid false positives.
+				 * deadlocks, because surely GetOldestVisibleTransactionId()
+				 * should never take a buffer lock. And this shouldn't happen
+				 * often, so it's worth being careful so as to avoid false
+				 * positives.
 				 */
-				RecomputedOldestXmin = GetOldestXmin(NULL, PROCARRAY_FLAGS_VACUUM);
+				RecomputedOldestXmin = GetOldestVisibleTransactionId(rel);
 
 				if (!TransactionIdPrecedes(OldestXmin, RecomputedOldestXmin))
 					record_corrupt_item(items, &tuple.t_self);
diff --git a/contrib/pgstattuple/pgstatapprox.c b/contrib/pgstattuple/pgstatapprox.c
index 96d837485fa..b664f95e865 100644
--- a/contrib/pgstattuple/pgstatapprox.c
+++ b/contrib/pgstattuple/pgstatapprox.c
@@ -71,7 +71,7 @@ statapprox_heap(Relation rel, output_type *stat)
 	BufferAccessStrategy bstrategy;
 	TransactionId OldestXmin;
 
-	OldestXmin = GetOldestXmin(rel, PROCARRAY_FLAGS_VACUUM);
+	OldestXmin = GetOldestVisibleTransactionId(rel);
 	bstrategy = GetAccessStrategy(BAS_BULKREAD);
 
 	nblocks = RelationGetNumberOfBlocks(rel);
-- 
2.25.0.114.g5b0ca878e0

v7-0006-snapshot-scalability-Move-PGXACT-xmin-back-to-PGP.patchtext/x-diff; charset=us-asciiDownload
From 3c0c74d5ed455d2c74646205bacb7cbab25f2596 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 6 Apr 2020 21:28:55 -0700
Subject: [PATCH v7 06/11] snapshot scalability: Move PGXACT->xmin back to
 PGPROC.

Now that xmin isn't needed for GetSnapshotData() anymore, it just
leads to unnecessary cacheline conflicts to have backends share a
cacheline with other backends PGXACT data (which also have frequently
changing xmins of course).
---
 src/include/storage/proc.h                  | 10 +++---
 src/backend/access/gist/gistxlog.c          |  2 +-
 src/backend/access/nbtree/nbtpage.c         |  2 +-
 src/backend/access/nbtree/nbtxlog.c         |  2 +-
 src/backend/access/transam/README           |  4 +--
 src/backend/access/transam/twophase.c       |  2 +-
 src/backend/commands/indexcmds.c            |  2 +-
 src/backend/replication/logical/snapbuild.c |  6 ++--
 src/backend/replication/walsender.c         | 10 +++---
 src/backend/storage/ipc/procarray.c         | 36 +++++++++------------
 src/backend/storage/ipc/sinvaladt.c         |  2 +-
 src/backend/storage/lmgr/proc.c             |  4 +--
 src/backend/utils/time/snapmgr.c            | 30 ++++++++---------
 13 files changed, 54 insertions(+), 58 deletions(-)

diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 23d12c1f72f..3b3936249ab 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -95,6 +95,11 @@ struct PGPROC
 
 	Latch		procLatch;		/* generic latch for process */
 
+	TransactionId xmin;			/* minimal running XID as it was when we were
+								 * starting our xact, excluding LAZY VACUUM:
+								 * vacuum must not remove tuples deleted by
+								 * xid >= xmin ! */
+
 	LocalTransactionId lxid;	/* local id of top-level transaction currently
 								 * being executed by this proc, if running;
 								 * else InvalidLocalTransactionId */
@@ -219,11 +224,6 @@ typedef struct PGXACT
 								 * executed by this proc, if running and XID
 								 * is assigned; else InvalidTransactionId */
 
-	TransactionId xmin;			/* minimal running XID as it was when we were
-								 * starting our xact, excluding LAZY VACUUM:
-								 * vacuum must not remove tuples deleted by
-								 * xid >= xmin ! */
-
 	uint8		vacuumFlags;	/* vacuum-related flags, see above */
 	bool		overflowed;
 
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index 66ddbaa5c4a..2a5b5308644 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -389,7 +389,7 @@ gistRedoPageReuse(XLogReaderState *record)
 	 *
 	 * latestRemovedXid was the page's deleteXid.  The
 	 * InvisibleToEveryoneCheckFullXid(deleteXid) test in gistPageRecyclable()
-	 * conceptually mirrors the pgxact->xmin > limitXmin test in
+	 * conceptually mirrors the PGPROC->xmin > limitXmin test in
 	 * GetConflictingVirtualXIDs().  Consequently, one XID value achieves the
 	 * same exclusion effect on master and standby.
 	 */
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 6e5ee3b443e..a21ee727ed4 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -2185,7 +2185,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, bool *rightsib_empty)
 	 * we're in VACUUM and would not otherwise have an XID.  Having already
 	 * updated links to the target, ReadNewTransactionId() suffices as an
 	 * upper bound.  Any scan having retained a now-stale link is advertising
-	 * in its PGXACT an xmin less than or equal to the value we read here.  It
+	 * in its PGPROC an xmin less than or equal to the value we read here.  It
 	 * will continue to do so, holding back xmin horizon, for the duration
 	 * of that scan.
 	 */
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 431d7c3d709..d43eb21a3cf 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -928,7 +928,7 @@ btree_xlog_reuse_page(XLogReaderState *record)
 	 *
 	 * latestRemovedXid was the page's btpo.xact.  The
 	 * InvisibleToEveryoneCheckXid test in _bt_page_recyclable() conceptually
-	 * mirrors the pgxact->xmin > limitXmin test in
+	 * mirrors the PGPROC->xmin > limitXmin test in
 	 * GetConflictingVirtualXIDs().  Consequently, one XID value achieves the
 	 * same exclusion effect on master and standby.
 	 */
diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README
index be805a5660b..85c2625ec42 100644
--- a/src/backend/access/transam/README
+++ b/src/backend/access/transam/README
@@ -296,7 +296,7 @@ ensure that the C compiler does exactly what you tell it to.)
 Another important activity that uses the shared ProcArray is
 ComputeTransactionHorizons, which must determine lower bound for the oldest
 xmin of any active MVCC snapshot, system-wide.  Each individual backend
-advertises the smallest xmin of its own snapshots in MyPgXact->xmin, or zero
+advertises the smallest xmin of its own snapshots in MyProc->xmin, or zero
 if it currently has no live snapshots (eg, if it's between transactions or
 hasn't yet set a snapshot for a new transaction).
 ComputeTransactionHorizons takes the MIN() of the valid xmin fields.  It
@@ -335,7 +335,7 @@ Note that while it is certain that two concurrent executions of
 GetSnapshotData will compute the same xmin for their own snapshots,
 there is no such guarantee for the horizons computed by
 ComputeTransactionHorizons.  This is because we allow XID-less
-transactions to clear their MyPgXact->xmin asynchronously (without
+transactions to clear their MyProc->xmin asynchronously (without
 taking ProcArrayLock), so one execution might see what had been the
 oldest xmin, and another not.  This is OK since the thresholds need
 only be a valid lower bound.  As noted above, we are already assuming
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 2f7d4ed59a8..5867cc60f3e 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -464,7 +464,7 @@ MarkAsPreparingGuts(GlobalTransaction gxact, TransactionId xid, const char *gid,
 	/* We set up the gxact's VXID as InvalidBackendId/XID */
 	proc->lxid = (LocalTransactionId) xid;
 	pgxact->xid = xid;
-	pgxact->xmin = InvalidTransactionId;
+	Assert(proc->xmin == InvalidTransactionId);
 	proc->delayChkpt = false;
 	pgxact->vacuumFlags = 0;
 	proc->pid = 0;
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 094bf6139f0..b63697da456 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1535,7 +1535,7 @@ DefineIndex(Oid relationId,
 	StartTransactionCommand();
 
 	/* We should now definitely not be advertising any xmin. */
-	Assert(MyPgXact->xmin == InvalidTransactionId);
+	Assert(MyProc->xmin == InvalidTransactionId);
 
 	/*
 	 * The index is now valid in the sense that it contains all currently
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 3089f0d5ddc..e9701ea7221 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -553,8 +553,8 @@ SnapBuildInitialSnapshot(SnapBuild *builder)
 		elog(ERROR, "cannot build an initial slot snapshot, not all transactions are monitored anymore");
 
 	/* so we don't overwrite the existing value */
-	if (TransactionIdIsValid(MyPgXact->xmin))
-		elog(ERROR, "cannot build an initial slot snapshot when MyPgXact->xmin already is valid");
+	if (TransactionIdIsValid(MyProc->xmin))
+		elog(ERROR, "cannot build an initial slot snapshot when MyProc->xmin already is valid");
 
 	snap = SnapBuildBuildSnapshot(builder);
 
@@ -575,7 +575,7 @@ SnapBuildInitialSnapshot(SnapBuild *builder)
 	}
 #endif
 
-	MyPgXact->xmin = snap->xmin;
+	MyProc->xmin = snap->xmin;
 
 	/* allocate in transaction context */
 	newxip = (TransactionId *)
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index d7088d19fd6..667ebca4e23 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1948,7 +1948,7 @@ PhysicalReplicationSlotNewXmin(TransactionId feedbackXmin, TransactionId feedbac
 	ReplicationSlot *slot = MyReplicationSlot;
 
 	SpinLockAcquire(&slot->mutex);
-	MyPgXact->xmin = InvalidTransactionId;
+	MyProc->xmin = InvalidTransactionId;
 
 	/*
 	 * For physical replication we don't need the interlock provided by xmin
@@ -2077,7 +2077,7 @@ ProcessStandbyHSFeedbackMessage(void)
 	if (!TransactionIdIsNormal(feedbackXmin)
 		&& !TransactionIdIsNormal(feedbackCatalogXmin))
 	{
-		MyPgXact->xmin = InvalidTransactionId;
+		MyProc->xmin = InvalidTransactionId;
 		if (MyReplicationSlot != NULL)
 			PhysicalReplicationSlotNewXmin(feedbackXmin, feedbackCatalogXmin);
 		return;
@@ -2119,7 +2119,7 @@ ProcessStandbyHSFeedbackMessage(void)
 	 * risk already since a VACUUM could already have determined the horizon.)
 	 *
 	 * If we're using a replication slot we reserve the xmin via that,
-	 * otherwise via the walsender's PGXACT entry. We can only track the
+	 * otherwise via the walsender's PGPROC entry. We can only track the
 	 * catalog xmin separately when using a slot, so we store the least of the
 	 * two provided when not using a slot.
 	 *
@@ -2132,9 +2132,9 @@ ProcessStandbyHSFeedbackMessage(void)
 	{
 		if (TransactionIdIsNormal(feedbackCatalogXmin)
 			&& TransactionIdPrecedes(feedbackCatalogXmin, feedbackXmin))
-			MyPgXact->xmin = feedbackCatalogXmin;
+			MyProc->xmin = feedbackCatalogXmin;
 		else
-			MyPgXact->xmin = feedbackXmin;
+			MyProc->xmin = feedbackXmin;
 	}
 }
 
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index a1823caf632..52822c74cff 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -543,9 +543,9 @@ ProcArrayEndTransaction(PGPROC *proc, TransactionId latestXid)
 		Assert(!TransactionIdIsValid(allPgXact[proc->pgprocno].xid));
 
 		proc->lxid = InvalidLocalTransactionId;
-		pgxact->xmin = InvalidTransactionId;
 		/* must be cleared with xid/xmin: */
 		pgxact->vacuumFlags &= ~PROC_VACUUM_STATE_MASK;
+		proc->xmin = InvalidTransactionId;
 		proc->delayChkpt = false; /* be sure this is cleared in abort */
 		proc->recoveryConflictPending = false;
 
@@ -565,9 +565,9 @@ ProcArrayEndTransactionInternal(PGPROC *proc, PGXACT *pgxact,
 {
 	pgxact->xid = InvalidTransactionId;
 	proc->lxid = InvalidLocalTransactionId;
-	pgxact->xmin = InvalidTransactionId;
 	/* must be cleared with xid/xmin: */
 	pgxact->vacuumFlags &= ~PROC_VACUUM_STATE_MASK;
+	proc->xmin = InvalidTransactionId;
 	proc->delayChkpt = false; /* be sure this is cleared in abort */
 	proc->recoveryConflictPending = false;
 
@@ -719,7 +719,7 @@ ProcArrayClearTransaction(PGPROC *proc)
 	 */
 	pgxact->xid = InvalidTransactionId;
 	proc->lxid = InvalidLocalTransactionId;
-	pgxact->xmin = InvalidTransactionId;
+	proc->xmin = InvalidTransactionId;
 	proc->recoveryConflictPending = false;
 
 	/* redundant, but just in case */
@@ -1468,7 +1468,7 @@ ComputeTransactionHorizons(ComputedHorizons *h)
 
 		/* Fetch xid just once - see GetNewTransactionId */
 		xid = UINT32_ACCESS_ONCE(pgxact->xid);
-		xmin = UINT32_ACCESS_ONCE(pgxact->xmin);
+		xmin = UINT32_ACCESS_ONCE(proc->xmin);
 
 		/*
 		 * Consider both the transaction's Xmin, and its Xid.
@@ -1738,7 +1738,7 @@ GetMaxSnapshotSubxidCount(void)
  *
  * We also update the following backend-global variables:
  *		TransactionXmin: the oldest xmin of any snapshot in use in the
- *			current transaction (this is the same as MyPgXact->xmin).
+ *			current transaction (this is the same as MyProc->xmin).
  *		RecentXmin: the xmin computed for the most recent snapshot.  XIDs
  *			older than this are known not running any more.
  *
@@ -1799,7 +1799,7 @@ GetSnapshotData(Snapshot snapshot)
 
 	/*
 	 * It is sufficient to get shared lock on ProcArrayLock, even if we are
-	 * going to set MyPgXact->xmin.
+	 * going to set MyProc->xmin.
 	 */
 	LWLockAcquire(ProcArrayLock, LW_SHARED);
 
@@ -1951,8 +1951,8 @@ GetSnapshotData(Snapshot snapshot)
 	replication_slot_xmin = procArray->replication_slot_xmin;
 	replication_slot_catalog_xmin = procArray->replication_slot_catalog_xmin;
 
-	if (!TransactionIdIsValid(MyPgXact->xmin))
-		MyPgXact->xmin = TransactionXmin = xmin;
+	if (!TransactionIdIsValid(MyProc->xmin))
+		MyProc->xmin = TransactionXmin = xmin;
 
 	LWLockRelease(ProcArrayLock);
 
@@ -2072,7 +2072,7 @@ GetSnapshotData(Snapshot snapshot)
 }
 
 /*
- * ProcArrayInstallImportedXmin -- install imported xmin into MyPgXact->xmin
+ * ProcArrayInstallImportedXmin -- install imported xmin into MyProc->xmin
  *
  * This is called when installing a snapshot imported from another
  * transaction.  To ensure that OldestXmin doesn't go backwards, we must
@@ -2125,7 +2125,7 @@ ProcArrayInstallImportedXmin(TransactionId xmin,
 		/*
 		 * Likewise, let's just make real sure its xmin does cover us.
 		 */
-		xid = UINT32_ACCESS_ONCE(pgxact->xmin);
+		xid = UINT32_ACCESS_ONCE(proc->xmin);
 		if (!TransactionIdIsNormal(xid) ||
 			!TransactionIdPrecedesOrEquals(xid, xmin))
 			continue;
@@ -2136,7 +2136,7 @@ ProcArrayInstallImportedXmin(TransactionId xmin,
 		 * GetSnapshotData first, we'll be overwriting a valid xmin here, so
 		 * we don't check that.)
 		 */
-		MyPgXact->xmin = TransactionXmin = xmin;
+		MyProc->xmin = TransactionXmin = xmin;
 
 		result = true;
 		break;
@@ -2148,7 +2148,7 @@ ProcArrayInstallImportedXmin(TransactionId xmin,
 }
 
 /*
- * ProcArrayInstallRestoredXmin -- install restored xmin into MyPgXact->xmin
+ * ProcArrayInstallRestoredXmin -- install restored xmin into MyProc->xmin
  *
  * This is like ProcArrayInstallImportedXmin, but we have a pointer to the
  * PGPROC of the transaction from which we imported the snapshot, rather than
@@ -2161,7 +2161,6 @@ ProcArrayInstallRestoredXmin(TransactionId xmin, PGPROC *proc)
 {
 	bool		result = false;
 	TransactionId xid;
-	PGXACT	   *pgxact;
 
 	Assert(TransactionIdIsNormal(xmin));
 	Assert(proc != NULL);
@@ -2169,20 +2168,18 @@ ProcArrayInstallRestoredXmin(TransactionId xmin, PGPROC *proc)
 	/* Get lock so source xact can't end while we're doing this */
 	LWLockAcquire(ProcArrayLock, LW_SHARED);
 
-	pgxact = &allPgXact[proc->pgprocno];
-
 	/*
 	 * Be certain that the referenced PGPROC has an advertised xmin which is
 	 * no later than the one we're installing, so that the system-wide xmin
 	 * can't go backwards.  Also, make sure it's running in the same database,
 	 * so that the per-database xmin cannot go backwards.
 	 */
-	xid = UINT32_ACCESS_ONCE(pgxact->xmin);
+	xid = UINT32_ACCESS_ONCE(proc->xmin);
 	if (proc->databaseId == MyDatabaseId &&
 		TransactionIdIsNormal(xid) &&
 		TransactionIdPrecedesOrEquals(xid, xmin))
 	{
-		MyPgXact->xmin = TransactionXmin = xmin;
+		MyProc->xmin = TransactionXmin = xmin;
 		result = true;
 	}
 
@@ -2807,7 +2804,7 @@ GetCurrentVirtualXIDs(TransactionId limitXmin, bool excludeXmin0,
 		if (allDbs || proc->databaseId == MyDatabaseId)
 		{
 			/* Fetch xmin just once - might change on us */
-			TransactionId pxmin = UINT32_ACCESS_ONCE(pgxact->xmin);
+			TransactionId pxmin = UINT32_ACCESS_ONCE(proc->xmin);
 
 			if (excludeXmin0 && !TransactionIdIsValid(pxmin))
 				continue;
@@ -2893,7 +2890,6 @@ GetConflictingVirtualXIDs(TransactionId limitXmin, Oid dbOid)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
 		PGPROC	   *proc = &allProcs[pgprocno];
-		PGXACT	   *pgxact = &allPgXact[pgprocno];
 
 		/* Exclude prepared transactions */
 		if (proc->pid == 0)
@@ -2903,7 +2899,7 @@ GetConflictingVirtualXIDs(TransactionId limitXmin, Oid dbOid)
 			proc->databaseId == dbOid)
 		{
 			/* Fetch xmin just once - can't change on us, but good coding */
-			TransactionId pxmin = UINT32_ACCESS_ONCE(pgxact->xmin);
+			TransactionId pxmin = UINT32_ACCESS_ONCE(proc->xmin);
 
 			/*
 			 * We ignore an invalid pxmin because this means that backend has
diff --git a/src/backend/storage/ipc/sinvaladt.c b/src/backend/storage/ipc/sinvaladt.c
index e5c115b92f2..ad048bc85fa 100644
--- a/src/backend/storage/ipc/sinvaladt.c
+++ b/src/backend/storage/ipc/sinvaladt.c
@@ -420,7 +420,7 @@ BackendIdGetTransactionIds(int backendID, TransactionId *xid, TransactionId *xmi
 			PGXACT	   *xact = &ProcGlobal->allPgXact[proc->pgprocno];
 
 			*xid = xact->xid;
-			*xmin = xact->xmin;
+			*xmin = proc->xmin;
 		}
 	}
 
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 5aa19d3f781..66d25dba7f8 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -388,7 +388,7 @@ InitProcess(void)
 	MyProc->fpVXIDLock = false;
 	MyProc->fpLocalTransactionId = InvalidLocalTransactionId;
 	MyPgXact->xid = InvalidTransactionId;
-	MyPgXact->xmin = InvalidTransactionId;
+	MyProc->xmin = InvalidTransactionId;
 	MyProc->pid = MyProcPid;
 	/* backendId, databaseId and roleId will be filled in later */
 	MyProc->backendId = InvalidBackendId;
@@ -572,7 +572,7 @@ InitAuxiliaryProcess(void)
 	MyProc->fpVXIDLock = false;
 	MyProc->fpLocalTransactionId = InvalidLocalTransactionId;
 	MyPgXact->xid = InvalidTransactionId;
-	MyPgXact->xmin = InvalidTransactionId;
+	MyProc->xmin = InvalidTransactionId;
 	MyProc->backendId = InvalidBackendId;
 	MyProc->databaseId = InvalidOid;
 	MyProc->roleId = InvalidOid;
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 1182233bf43..01f1c133014 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -27,11 +27,11 @@
  * their lifetime is managed separately (as they live longer than one xact.c
  * transaction).
  *
- * These arrangements let us reset MyPgXact->xmin when there are no snapshots
+ * These arrangements let us reset MyProc->xmin when there are no snapshots
  * referenced by this transaction, and advance it when the one with oldest
  * Xmin is no longer referenced.  For simplicity however, only registered
  * snapshots not active snapshots participate in tracking which one is oldest;
- * we don't try to change MyPgXact->xmin except when the active-snapshot
+ * we don't try to change MyProc->xmin except when the active-snapshot
  * stack is empty.
  *
  *
@@ -187,7 +187,7 @@ static ActiveSnapshotElt *OldestActiveSnapshot = NULL;
 
 /*
  * Currently registered Snapshots.  Ordered in a heap by xmin, so that we can
- * quickly find the one with lowest xmin, to advance our MyPgXact->xmin.
+ * quickly find the one with lowest xmin, to advance our MyProc->xmin.
  */
 static int	xmin_cmp(const pairingheap_node *a, const pairingheap_node *b,
 					 void *arg);
@@ -477,7 +477,7 @@ GetNonHistoricCatalogSnapshot(Oid relid)
 
 		/*
 		 * Make sure the catalog snapshot will be accounted for in decisions
-		 * about advancing PGXACT->xmin.  We could apply RegisterSnapshot, but
+		 * about advancing PGPROC->xmin.  We could apply RegisterSnapshot, but
 		 * that would result in making a physical copy, which is overkill; and
 		 * it would also create a dependency on some resource owner, which we
 		 * do not want for reasons explained at the head of this file. Instead
@@ -598,7 +598,7 @@ SetTransactionSnapshot(Snapshot sourcesnap, VirtualTransactionId *sourcevxid,
 	/* NB: curcid should NOT be copied, it's a local matter */
 
 	/*
-	 * Now we have to fix what GetSnapshotData did with MyPgXact->xmin and
+	 * Now we have to fix what GetSnapshotData did with MyProc->xmin and
 	 * TransactionXmin.  There is a race condition: to make sure we are not
 	 * causing the global xmin to go backwards, we have to test that the
 	 * source transaction is still running, and that has to be done
@@ -855,7 +855,7 @@ bool
 SnapshotSet(void)
 {
 	/* can't be safe, because somehow xmin is not set */
-	if (!TransactionIdIsValid(MyPgXact->xmin) && HistoricSnapshot == NULL)
+	if (!TransactionIdIsValid(MyProc->xmin) && HistoricSnapshot == NULL)
 		return false;
 
 	/*
@@ -971,13 +971,13 @@ xmin_cmp(const pairingheap_node *a, const pairingheap_node *b, void *arg)
 /*
  * SnapshotResetXmin
  *
- * If there are no more snapshots, we can reset our PGXACT->xmin to InvalidXid.
+ * If there are no more snapshots, we can reset our PGPROC->xmin to InvalidXid.
  * Note we can do this without locking because we assume that storing an Xid
  * is atomic.
  *
  * Even if there are some remaining snapshots, we may be able to advance our
- * PGXACT->xmin to some degree.  This typically happens when a portal is
- * dropped.  For efficiency, we only consider recomputing PGXACT->xmin when
+ * PGPROC->xmin to some degree.  This typically happens when a portal is
+ * dropped.  For efficiency, we only consider recomputing PGPROC->xmin when
  * the active snapshot stack is empty; this allows us not to need to track
  * which active snapshot is oldest.
  *
@@ -998,7 +998,7 @@ SnapshotResetXmin(void)
 
 	if (pairingheap_is_empty(&RegisteredSnapshots))
 	{
-		MyPgXact->xmin = InvalidTransactionId;
+		MyProc->xmin = InvalidTransactionId;
 		TransactionXmin = InvalidTransactionId;
 		RecentXmin = InvalidTransactionId;
 		return;
@@ -1007,8 +1007,8 @@ SnapshotResetXmin(void)
 	minSnapshot = pairingheap_container(SnapshotData, ph_node,
 										pairingheap_first(&RegisteredSnapshots));
 
-	if (TransactionIdPrecedes(MyPgXact->xmin, minSnapshot->xmin))
-		MyPgXact->xmin = minSnapshot->xmin;
+	if (TransactionIdPrecedes(MyProc->xmin, minSnapshot->xmin))
+		MyProc->xmin = minSnapshot->xmin;
 }
 
 /*
@@ -1155,13 +1155,13 @@ AtEOXact_Snapshot(bool isCommit, bool resetXmin)
 
 	/*
 	 * During normal commit processing, we call ProcArrayEndTransaction() to
-	 * reset the MyPgXact->xmin. That call happens prior to the call to
+	 * reset the MyProc->xmin. That call happens prior to the call to
 	 * AtEOXact_Snapshot(), so we need not touch xmin here at all.
 	 */
 	if (resetXmin)
 		SnapshotResetXmin();
 
-	Assert(resetXmin || MyPgXact->xmin == 0);
+	Assert(resetXmin || MyProc->xmin == 0);
 }
 
 
@@ -1847,7 +1847,7 @@ TransactionIdLimitedForOldSnapshots(TransactionId recentXmin,
 	 */
 	if (old_snapshot_threshold == 0)
 	{
-		if (TransactionIdPrecedes(latest_xmin, MyPgXact->xmin)
+		if (TransactionIdPrecedes(latest_xmin, MyProc->xmin)
 			&& TransactionIdFollows(latest_xmin, xlimit))
 			xlimit = latest_xmin;
 
-- 
2.25.0.114.g5b0ca878e0

v7-0007-snapshot-scalability-Move-in-progress-xids-to-Pro.patchtext/x-diff; charset=us-asciiDownload
From d3ae09a783df085789fa4b899163a189eed889de Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 6 Apr 2020 21:28:55 -0700
Subject: [PATCH v7 07/11] snapshot scalability: Move in-progress xids to
 ProcGlobal->xids.

This improves performance because GetSnapshotData() each time needs to
access the xids of all procarray entries. As the set of running xids
changes fairly rarely compared to the number of snapshots taken, this
substantially increases the likelihood of most data required for a
snapshot already being in l2 cache.
---
 src/include/storage/proc.h                  |  45 ++-
 src/backend/access/heap/heapam_visibility.c |   8 +-
 src/backend/access/transam/README           |  33 +-
 src/backend/access/transam/clog.c           |   8 +-
 src/backend/access/transam/twophase.c       |  31 +-
 src/backend/access/transam/varsup.c         |  20 +-
 src/backend/commands/vacuum.c               |   4 +-
 src/backend/storage/ipc/procarray.c         | 319 +++++++++++++-------
 src/backend/storage/ipc/sinvaladt.c         |   4 +-
 src/backend/storage/lmgr/lock.c             |   3 +-
 src/backend/storage/lmgr/proc.c             |  33 +-
 11 files changed, 333 insertions(+), 175 deletions(-)

diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 3b3936249ab..60586e8be34 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -83,6 +83,10 @@ struct XidCache
  * distinguished from a real one at need by the fact that it has pid == 0.
  * The semaphore and lock-activity fields in a prepared-xact PGPROC are unused,
  * but its myProcLocks[] lists are valid.
+ *
+ * The various *Copy fields are copies of the data in ProcGlobal arrays that
+ * can be accessed without holding ProcArrayLock / XidGenLock (see PROC_HDR
+ * comments).
  */
 struct PGPROC
 {
@@ -95,6 +99,9 @@ struct PGPROC
 
 	Latch		procLatch;		/* generic latch for process */
 
+	TransactionId xidCopy;		/* this backend's xid, a copy of this proc's
+								   ProcGlobal->xids[] entry. */
+
 	TransactionId xmin;			/* minimal running XID as it was when we were
 								 * starting our xact, excluding LAZY VACUUM:
 								 * vacuum must not remove tuples deleted by
@@ -104,6 +111,10 @@ struct PGPROC
 								 * being executed by this proc, if running;
 								 * else InvalidLocalTransactionId */
 	int			pid;			/* Backend's process ID; 0 if prepared xact */
+
+	int			pgxactoff;		/* offset into various ProcGlobal-> arrays
+								 * NB: can change any time unless locks held!
+								 */
 	int			pgprocno;
 
 	/* These fields are zero while a backend is still starting up: */
@@ -220,10 +231,6 @@ extern PGDLLIMPORT struct PGXACT *MyPgXact;
  */
 typedef struct PGXACT
 {
-	TransactionId xid;			/* id of top-level transaction currently being
-								 * executed by this proc, if running and XID
-								 * is assigned; else InvalidTransactionId */
-
 	uint8		vacuumFlags;	/* vacuum-related flags, see above */
 	bool		overflowed;
 
@@ -232,6 +239,13 @@ typedef struct PGXACT
 
 /*
  * There is one ProcGlobal struct for the whole database cluster.
+ *
+ * Adding/Removing an entry into the procarray requires holding *both*
+ * ProcArrayLock and XidGenLock in exclusive mode (in that order). Both are
+ * needed because the dense arrays (see below) are accessed from
+ * GetNewTransactionId() and GetSnapshotData(), and we don't want to add
+ * further contention by both using one lock. Adding/Removing a procarray
+ * entry is much less frequent.
  */
 typedef struct PROC_HDR
 {
@@ -239,6 +253,29 @@ typedef struct PROC_HDR
 	PGPROC	   *allProcs;
 	/* Array of PGXACT structures (not including dummies for prepared txns) */
 	PGXACT	   *allPgXact;
+
+	/*
+	 * Arrays with per-backend information that is hotly accessed, indexed by
+	 * PGPROC->pgxactoff. These are in separate arrays for three reasons:
+	 * First, to allow for as tight loops accessing the data as
+	 * possible. Second, to prevent updates of frequently changing data from
+	 * invalidating cachelines shared with less frequently changing
+	 * data. Third to condense frequently accessed data into as few cachelines
+	 * as possible.
+	 *
+	 * When entering a PGPROC for 2PC transactions with ProcArrayAdd(), those
+	 * copies are used to provide the contents of the dense data, and will be
+	 * transferred by ProcArrayAdd() while it already holds ProcArrayLock.
+	 */
+
+	/*
+	 * TransactionId of top-level transaction currently being executed by each
+	 * proc, if running and XID is assigned; else InvalidTransactionId.
+	 *
+	 * Each PGPROC has a copy of its value in PGPROC.xidCopy.
+	 */
+	TransactionId *xids;
+
 	/* Length of allProcs array */
 	uint32		allProcCount;
 	/* Head of list of free PGPROC structures */
diff --git a/src/backend/access/heap/heapam_visibility.c b/src/backend/access/heap/heapam_visibility.c
index 793a8036331..ddd8f19bd10 100644
--- a/src/backend/access/heap/heapam_visibility.c
+++ b/src/backend/access/heap/heapam_visibility.c
@@ -11,12 +11,12 @@
  * shared buffer content lock on the buffer containing the tuple.
  *
  * NOTE: When using a non-MVCC snapshot, we must check
- * TransactionIdIsInProgress (which looks in the PGXACT array)
+ * TransactionIdIsInProgress (which looks in the PGPROC array)
  * before TransactionIdDidCommit/TransactionIdDidAbort (which look in
  * pg_xact).  Otherwise we have a race condition: we might decide that a
  * just-committed transaction crashed, because none of the tests succeed.
  * xact.c is careful to record commit/abort in pg_xact before it unsets
- * MyPgXact->xid in the PGXACT array.  That fixes that problem, but it
+ * MyProc->xid in the PGPROC array.  That fixes that problem, but it
  * also means there is a window where TransactionIdIsInProgress and
  * TransactionIdDidCommit will both return true.  If we check only
  * TransactionIdDidCommit, we could consider a tuple committed when a
@@ -956,7 +956,7 @@ HeapTupleSatisfiesDirty(HeapTuple htup, Snapshot snapshot,
  * coding where we tried to set the hint bits as soon as possible, we instead
  * did TransactionIdIsInProgress in each call --- to no avail, as long as the
  * inserting/deleting transaction was still running --- which was more cycles
- * and more contention on the PGXACT array.
+ * and more contention on ProcArrayLock.
  */
 static bool
 HeapTupleSatisfiesMVCC(HeapTuple htup, Snapshot snapshot,
@@ -1444,7 +1444,7 @@ HeapTupleSatisfiesNonVacuumable(HeapTuple htup, Snapshot snapshot,
  *	HeapTupleSatisfiesMVCC) and, therefore, any hint bits that can be set
  *	should already be set.  We assume that if no hint bits are set, the xmin
  *	or xmax transaction is still running.  This is therefore faster than
- *	HeapTupleSatisfiesVacuum, because we don't consult PGXACT nor CLOG.
+ *	HeapTupleSatisfiesVacuum, because we consult neither procarray nor CLOG.
  *	It's okay to return false when in doubt, but we must return true only
  *	if the tuple is removable.
  */
diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README
index 85c2625ec42..818f84d32aa 100644
--- a/src/backend/access/transam/README
+++ b/src/backend/access/transam/README
@@ -251,10 +251,10 @@ enforce, and it assists with some other issues as explained below.)  The
 implementation of this is that GetSnapshotData takes the ProcArrayLock in
 shared mode (so that multiple backends can take snapshots in parallel),
 but ProcArrayEndTransaction must take the ProcArrayLock in exclusive mode
-while clearing MyPgXact->xid at transaction end (either commit or abort).
-(To reduce context switching, when multiple transactions commit nearly
-simultaneously, we have one backend take ProcArrayLock and clear the XIDs
-of multiple processes at once.)
+while clearing the ProcGlobal->xids[] entry at transaction end (either
+commit or abort). (To reduce context switching, when multiple transactions
+commit nearly simultaneously, we have one backend take ProcArrayLock and
+clear the XIDs of multiple processes at once.)
 
 ProcArrayEndTransaction also holds the lock while advancing the shared
 latestCompletedFullXid variable.  This allows GetSnapshotData to use
@@ -278,12 +278,13 @@ present in the ProcArray, or not running anymore.  (This guarantee doesn't
 apply to subtransaction XIDs, because of the possibility that there's not
 room for them in the subxid array; instead we guarantee that they are
 present or the overflow flag is set.)  If a backend released XidGenLock
-before storing its XID into MyPgXact, then it would be possible for another
-backend to allocate and commit a later XID, causing latestCompletedFullXid to
-pass the first backend's XID, before that value became visible in the
-ProcArray.  That would break ComputeTransactionHorizons, as discussed below.
+before storing its XID into ProcGlobal->xids[], then it would be possible for
+another backend to allocate and commit a later XID, causing
+latestCompletedFullXid to pass the first backend's XID, before that value
+became visible in the ProcArray.  That would break ComputeTransactionHorizons,
+as discussed below.
 
-We allow GetNewTransactionId to store the XID into MyPgXact->xid (or the
+We allow GetNewTransactionId to store the XID into ProcGlobal->xids[] (or the
 subxid array) without taking ProcArrayLock.  This was once necessary to
 avoid deadlock; while that is no longer the case, it's still beneficial for
 performance.  We are thereby relying on fetch/store of an XID to be atomic,
@@ -386,13 +387,13 @@ Top-level transactions do not have a parent, so they leave their pg_subtrans
 entries set to the default value of zero (InvalidTransactionId).
 
 pg_subtrans is used to check whether the transaction in question is still
-running --- the main Xid of a transaction is recorded in the PGXACT struct,
-but since we allow arbitrary nesting of subtransactions, we can't fit all Xids
-in shared memory, so we have to store them on disk.  Note, however, that for
-each transaction we keep a "cache" of Xids that are known to be part of the
-transaction tree, so we can skip looking at pg_subtrans unless we know the
-cache has been overflowed.  See storage/ipc/procarray.c for the gory details.
-
+running --- the main Xid of a transaction is recorded in ProcGlobal->xids[],
+with a copy in PGPROC->xidCopy, but since we allow arbitrary nesting of
+subtransactions, we can't fit all Xids in shared memory, so we have to store
+them on disk.  Note, however, that for each transaction we keep a "cache" of
+Xids that are known to be part of the transaction tree, so we can skip looking
+at pg_subtrans unless we know the cache has been overflowed.  See
+storage/ipc/procarray.c for the gory details.
 slru.c is the supporting mechanism for both pg_xact and pg_subtrans.  It
 implements the LRU policy for in-memory buffer pages.  The high-level routines
 for pg_xact are implemented in transam.c, while the low-level functions are in
diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c
index f8e7670f8da..8e9c211b02a 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -285,15 +285,15 @@ TransactionIdSetPageStatus(TransactionId xid, int nsubxids,
 	 * updates for multiple backends so that the number of times
 	 * CLogControlLock needs to be acquired is reduced.
 	 *
-	 * For this optimization to be safe, the XID in MyPgXact and the subxids
-	 * in MyProc must be the same as the ones for which we're setting the
-	 * status.  Check that this is the case.
+	 * For this optimization to be safe, the XID in MyProc->xidCopy and the
+	 * subxids in MyProc must be the same as the ones for which we're setting
+	 * the status.  Check that this is the case.
 	 *
 	 * For this optimization to be efficient, we shouldn't have too many
 	 * sub-XIDs and all of the XIDs for which we're adjusting clog should be
 	 * on the same page.  Check those conditions, too.
 	 */
-	if (all_xact_same_page && xid == MyPgXact->xid &&
+	if (all_xact_same_page && xid == MyProc->xidCopy &&
 		nsubxids <= THRESHOLD_SUBTRANS_CLOG_OPT &&
 		nsubxids == MyPgXact->nxids &&
 		memcmp(subxids, MyProc->subxids.xids,
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 5867cc60f3e..8103c5cb71f 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -351,7 +351,7 @@ AtAbort_Twophase(void)
 
 /*
  * This is called after we have finished transferring state to the prepared
- * PGXACT entry.
+ * PGPROC entry.
  */
 void
 PostPrepare_Twophase(void)
@@ -463,7 +463,7 @@ MarkAsPreparingGuts(GlobalTransaction gxact, TransactionId xid, const char *gid,
 	proc->waitStatus = STATUS_OK;
 	/* We set up the gxact's VXID as InvalidBackendId/XID */
 	proc->lxid = (LocalTransactionId) xid;
-	pgxact->xid = xid;
+	proc->xidCopy = xid;
 	Assert(proc->xmin == InvalidTransactionId);
 	proc->delayChkpt = false;
 	pgxact->vacuumFlags = 0;
@@ -768,7 +768,6 @@ pg_prepared_xact(PG_FUNCTION_ARGS)
 	{
 		GlobalTransaction gxact = &status->array[status->currIdx++];
 		PGPROC	   *proc = &ProcGlobal->allProcs[gxact->pgprocno];
-		PGXACT	   *pgxact = &ProcGlobal->allPgXact[gxact->pgprocno];
 		Datum		values[5];
 		bool		nulls[5];
 		HeapTuple	tuple;
@@ -783,7 +782,7 @@ pg_prepared_xact(PG_FUNCTION_ARGS)
 		MemSet(values, 0, sizeof(values));
 		MemSet(nulls, 0, sizeof(nulls));
 
-		values[0] = TransactionIdGetDatum(pgxact->xid);
+		values[0] = TransactionIdGetDatum(proc->xidCopy);
 		values[1] = CStringGetTextDatum(gxact->gid);
 		values[2] = TimestampTzGetDatum(gxact->prepared_at);
 		values[3] = ObjectIdGetDatum(gxact->owner);
@@ -829,9 +828,8 @@ TwoPhaseGetGXact(TransactionId xid, bool lock_held)
 	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
 	{
 		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
-		PGXACT	   *pgxact = &ProcGlobal->allPgXact[gxact->pgprocno];
 
-		if (pgxact->xid == xid)
+		if (gxact->xid == xid)
 		{
 			result = gxact;
 			break;
@@ -987,8 +985,7 @@ void
 StartPrepare(GlobalTransaction gxact)
 {
 	PGPROC	   *proc = &ProcGlobal->allProcs[gxact->pgprocno];
-	PGXACT	   *pgxact = &ProcGlobal->allPgXact[gxact->pgprocno];
-	TransactionId xid = pgxact->xid;
+	TransactionId xid = gxact->xid;
 	TwoPhaseFileHeader hdr;
 	TransactionId *children;
 	RelFileNode *commitrels;
@@ -1140,15 +1137,15 @@ EndPrepare(GlobalTransaction gxact)
 
 	/*
 	 * Mark the prepared transaction as valid.  As soon as xact.c marks
-	 * MyPgXact as not running our XID (which it will do immediately after
+	 * MyProc as not running our XID (which it will do immediately after
 	 * this function returns), others can commit/rollback the xact.
 	 *
 	 * NB: a side effect of this is to make a dummy ProcArray entry for the
-	 * prepared XID.  This must happen before we clear the XID from MyPgXact,
-	 * else there is a window where the XID is not running according to
-	 * TransactionIdIsInProgress, and onlookers would be entitled to assume
-	 * the xact crashed.  Instead we have a window where the same XID appears
-	 * twice in ProcArray, which is OK.
+	 * prepared XID.  This must happen before we clear the XID from
+	 * ProcGlobal->xids[], else there is a window where the XID is not running
+	 * according to TransactionIdIsInProgress, and onlookers would be entitled
+	 * to assume the xact crashed.  Instead we have a window where the same
+	 * XID appears twice in ProcArray, which is OK.
 	 */
 	MarkAsPrepared(gxact, false);
 
@@ -1401,7 +1398,6 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
 {
 	GlobalTransaction gxact;
 	PGPROC	   *proc;
-	PGXACT	   *pgxact;
 	TransactionId xid;
 	char	   *buf;
 	char	   *bufptr;
@@ -1420,8 +1416,7 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
 	 */
 	gxact = LockGXact(gid, GetUserId());
 	proc = &ProcGlobal->allProcs[gxact->pgprocno];
-	pgxact = &ProcGlobal->allPgXact[gxact->pgprocno];
-	xid = pgxact->xid;
+	xid = gxact->xid;
 
 	/*
 	 * Read and validate 2PC state data. State data will typically be stored
@@ -1723,7 +1718,7 @@ CheckPointTwoPhase(XLogRecPtr redo_horizon)
 	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
 	{
 		/*
-		 * Note that we are using gxact not pgxact so this works in recovery
+		 * Note that we are using gxact not pgproc so this works in recovery
 		 * also
 		 */
 		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
diff --git a/src/backend/access/transam/varsup.c b/src/backend/access/transam/varsup.c
index 43973130b7c..f703c229450 100644
--- a/src/backend/access/transam/varsup.c
+++ b/src/backend/access/transam/varsup.c
@@ -38,7 +38,8 @@ VariableCache ShmemVariableCache = NULL;
  * Allocate the next FullTransactionId for a new transaction or
  * subtransaction.
  *
- * The new XID is also stored into MyPgXact before returning.
+ * The new XID is also stored into ProcGlobal->xids[]/MyProc->xidCopy before
+ * returning.
  *
  * Note: when this is called, we are actually already inside a valid
  * transaction, since XIDs are now not allocated until the transaction
@@ -65,7 +66,8 @@ GetNewTransactionId(bool isSubXact)
 	if (IsBootstrapProcessingMode())
 	{
 		Assert(!isSubXact);
-		MyPgXact->xid = BootstrapTransactionId;
+		ProcGlobal->xids[MyProc->pgxactoff] = BootstrapTransactionId;
+		MyProc->xidCopy = BootstrapTransactionId;
 		return FullTransactionIdFromEpochAndXid(0, BootstrapTransactionId);
 	}
 
@@ -190,10 +192,10 @@ GetNewTransactionId(bool isSubXact)
 	 * latestCompletedXid is present in the ProcArray, which is essential for
 	 * correct OldestXmin tracking; see src/backend/access/transam/README.
 	 *
-	 * Note that readers of PGXACT xid fields should be careful to fetch the
-	 * value only once, rather than assume they can read a value multiple
-	 * times and get the same answer each time.  Note we are assuming that
-	 * TransactionId and int fetch/store are atomic.
+	 * Note that readers of ProcGlobal->xids/PGPROC->xidCopy should be careful
+	 * to fetch the value for each proc only once, rather than assume they can
+	 * read a value multiple times and get the same answer each time.  Note we
+	 * are assuming that TransactionId and int fetch/store are atomic.
 	 *
 	 * The same comments apply to the subxact xid count and overflow fields.
 	 *
@@ -219,7 +221,11 @@ GetNewTransactionId(bool isSubXact)
 	 * answer later on when someone does have a reason to inquire.)
 	 */
 	if (!isSubXact)
-		MyPgXact->xid = xid;	/* LWLockRelease acts as barrier */
+	{
+		/* LWLockRelease acts as barrier */
+		ProcGlobal->xids[MyProc->pgxactoff] = xid;
+		MyProc->xidCopy = xid;
+	}
 	else
 	{
 		int			nxids = MyPgXact->nxids;
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 7055b237337..1cc220c2d56 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -1726,8 +1726,8 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params)
 		 *
 		 * Note: these flags remain set until CommitTransaction or
 		 * AbortTransaction.  We don't want to clear them until we reset
-		 * MyPgXact->xid/xmin, else OldestXmin might appear to go backwards,
-		 * which is probably Not Good.
+		 * MyProc->xidCopy/xmin, otherwise GetOldestVisibleTransactionId()
+		 * might appear to go backwards, which is probably Not Good.
 		 */
 		LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
 		MyPgXact->vacuumFlags |= PROC_IN_VACUUM;
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 52822c74cff..7a6efaafe26 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -9,8 +9,9 @@
  * one is as a means of determining the set of currently running transactions.
  *
  * Because of various subtle race conditions it is critical that a backend
- * hold the correct locks while setting or clearing its MyPgXact->xid field.
- * See notes in src/backend/access/transam/README.
+ * hold the correct locks while setting or clearing its xid (in
+ * ProcGlobal->xids[]/MyProc->xidCopy).  See notes in
+ * src/backend/access/transam/README.
  *
  * The process arrays now also include structures representing prepared
  * transactions.  The xid and subxids fields of these are valid, as are the
@@ -61,6 +62,7 @@
 #include "storage/spin.h"
 #include "utils/acl.h"
 #include "utils/builtins.h"
+#include "utils/memutils.h"
 #include "utils/rel.h"
 #include "utils/snapmgr.h"
 
@@ -288,6 +290,9 @@ static void MaintainLatestCompletedXid(TransactionId latestXid);
 
 static inline FullTransactionId FullXidViaRelative(FullTransactionId rel, TransactionId xid);
 
+static TransactionId *snapshot_workspace_xid;
+static ssize_t *snapshot_workspace_off;
+
 /*
  * Report shared-memory space needed by CreateSharedProcArray.
  */
@@ -381,6 +386,11 @@ CreateSharedProcArray(void)
 	}
 
 	LWLockRegisterTranche(LWTRANCHE_PROC, "proc");
+
+	snapshot_workspace_xid = MemoryContextAllocZero(TopMemoryContext,
+													sizeof(*snapshot_workspace_xid) * PROCARRAY_MAXPROCS);
+	snapshot_workspace_off = MemoryContextAllocZero(TopMemoryContext,
+													sizeof(*snapshot_workspace_off) * PROCARRAY_MAXPROCS);
 }
 
 /*
@@ -392,7 +402,9 @@ ProcArrayAdd(PGPROC *proc)
 	ProcArrayStruct *arrayP = procArray;
 	int			index;
 
+	/* See ProcGlobal comment explaining why both locks are held */
 	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+	LWLockAcquire(XidGenLock, LW_EXCLUSIVE);
 
 	if (arrayP->numProcs >= arrayP->maxProcs)
 	{
@@ -401,7 +413,6 @@ ProcArrayAdd(PGPROC *proc)
 		 * fixed supply of PGPROC structs too, and so we should have failed
 		 * earlier.)
 		 */
-		LWLockRelease(ProcArrayLock);
 		ereport(FATAL,
 				(errcode(ERRCODE_TOO_MANY_CONNECTIONS),
 				 errmsg("sorry, too many clients already")));
@@ -427,10 +438,25 @@ ProcArrayAdd(PGPROC *proc)
 	}
 
 	memmove(&arrayP->pgprocnos[index + 1], &arrayP->pgprocnos[index],
-			(arrayP->numProcs - index) * sizeof(int));
+			(arrayP->numProcs - index) * sizeof(*arrayP->pgprocnos));
+	memmove(&ProcGlobal->xids[index + 1], &ProcGlobal->xids[index],
+			(arrayP->numProcs - index) * sizeof(*ProcGlobal->xids));
+
 	arrayP->pgprocnos[index] = proc->pgprocno;
+	ProcGlobal->xids[index] = proc->xidCopy;
+
 	arrayP->numProcs++;
 
+	for (; index < arrayP->numProcs; index++)
+	{
+		allProcs[arrayP->pgprocnos[index]].pgxactoff = index;
+	}
+
+	/*
+	 * Release in reversed acquisition order, to reduce frequency of having to
+	 * wait for XidGenLock while holding ProcArrayLock.
+	 */
+	LWLockRelease(XidGenLock);
 	LWLockRelease(ProcArrayLock);
 }
 
@@ -456,36 +482,59 @@ ProcArrayRemove(PGPROC *proc, TransactionId latestXid)
 		DisplayXidCache();
 #endif
 
+	/* See ProcGlobal comment explaining why both locks are held */
 	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+	LWLockAcquire(XidGenLock, LW_EXCLUSIVE);
+
+	Assert(ProcGlobal->allProcs[arrayP->pgprocnos[proc->pgxactoff]].pgxactoff == proc->pgxactoff);
 
 	if (TransactionIdIsValid(latestXid))
 	{
-		Assert(TransactionIdIsValid(allPgXact[proc->pgprocno].xid));
+		Assert(TransactionIdIsValid(ProcGlobal->xids[proc->pgxactoff]));
 
 		/* Advance global latestCompletedXid while holding the lock */
 		MaintainLatestCompletedXid(latestXid);
+
+		ProcGlobal->xids[proc->pgxactoff] = 0;
 	}
 	else
 	{
 		/* Shouldn't be trying to remove a live transaction here */
-		Assert(!TransactionIdIsValid(allPgXact[proc->pgprocno].xid));
+		Assert(!TransactionIdIsValid(ProcGlobal->xids[proc->pgxactoff]));
 	}
 
+	Assert(TransactionIdIsValid(ProcGlobal->xids[proc->pgxactoff] == 0));
+
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
 		if (arrayP->pgprocnos[index] == proc->pgprocno)
 		{
 			/* Keep the PGPROC array sorted. See notes above */
 			memmove(&arrayP->pgprocnos[index], &arrayP->pgprocnos[index + 1],
-					(arrayP->numProcs - index - 1) * sizeof(int));
+					(arrayP->numProcs - index - 1) * sizeof(*arrayP->pgprocnos));
+			memmove(&ProcGlobal->xids[index], &ProcGlobal->xids[index + 1],
+					(arrayP->numProcs - index - 1) * sizeof(*ProcGlobal->xids));
+
 			arrayP->pgprocnos[arrayP->numProcs - 1] = -1;	/* for debugging */
 			arrayP->numProcs--;
+
+			for (; index < arrayP->numProcs; index++)
+			{
+				allProcs[arrayP->pgprocnos[index]].pgxactoff--;
+			}
+
+			/*
+			 * Release in reversed acquisition order, to reduce frequency of
+			 * having to wait for XidGenLock while holding ProcArrayLock.
+			 */
+			LWLockRelease(XidGenLock);
 			LWLockRelease(ProcArrayLock);
 			return;
 		}
 	}
 
 	/* Oops */
+	LWLockRelease(XidGenLock);
 	LWLockRelease(ProcArrayLock);
 
 	elog(LOG, "failed to find proc %p in ProcArray", proc);
@@ -518,7 +567,7 @@ ProcArrayEndTransaction(PGPROC *proc, TransactionId latestXid)
 		 * else is taking a snapshot.  See discussion in
 		 * src/backend/access/transam/README.
 		 */
-		Assert(TransactionIdIsValid(allPgXact[proc->pgprocno].xid));
+		Assert(TransactionIdIsValid(proc->xidCopy));
 
 		/*
 		 * If we can immediately acquire ProcArrayLock, we clear our own XID
@@ -540,7 +589,7 @@ ProcArrayEndTransaction(PGPROC *proc, TransactionId latestXid)
 		 * anyone else's calculation of a snapshot.  We might change their
 		 * estimate of global xmin, but that's OK.
 		 */
-		Assert(!TransactionIdIsValid(allPgXact[proc->pgprocno].xid));
+		Assert(!TransactionIdIsValid(proc->xidCopy));
 
 		proc->lxid = InvalidLocalTransactionId;
 		/* must be cleared with xid/xmin: */
@@ -563,7 +612,13 @@ static inline void
 ProcArrayEndTransactionInternal(PGPROC *proc, PGXACT *pgxact,
 								TransactionId latestXid)
 {
-	pgxact->xid = InvalidTransactionId;
+	size_t		pgxactoff = proc->pgxactoff;
+
+	Assert(TransactionIdIsValid(ProcGlobal->xids[pgxactoff]));
+	Assert(ProcGlobal->xids[pgxactoff] == proc->xidCopy);
+
+	ProcGlobal->xids[pgxactoff] = InvalidTransactionId;
+	proc->xidCopy = InvalidTransactionId;
 	proc->lxid = InvalidLocalTransactionId;
 	/* must be cleared with xid/xmin: */
 	pgxact->vacuumFlags &= ~PROC_VACUUM_STATE_MASK;
@@ -599,7 +654,7 @@ ProcArrayGroupClearXid(PGPROC *proc, TransactionId latestXid)
 	uint32		wakeidx;
 
 	/* We should definitely have an XID to clear. */
-	Assert(TransactionIdIsValid(allPgXact[proc->pgprocno].xid));
+	Assert(TransactionIdIsValid(proc->xidCopy));
 
 	/* Add ourselves to the list of processes needing a group XID clear. */
 	proc->procArrayGroupMember = true;
@@ -704,20 +759,28 @@ ProcArrayGroupClearXid(PGPROC *proc, TransactionId latestXid)
  * This is used after successfully preparing a 2-phase transaction.  We are
  * not actually reporting the transaction's XID as no longer running --- it
  * will still appear as running because the 2PC's gxact is in the ProcArray
- * too.  We just have to clear out our own PGXACT.
+ * too.  We just have to clear out our own PGPROC.
  */
 void
 ProcArrayClearTransaction(PGPROC *proc)
 {
 	PGXACT	   *pgxact = &allPgXact[proc->pgprocno];
+	size_t		pgxactoff;
 
 	/*
-	 * We can skip locking ProcArrayLock here, because this action does not
-	 * actually change anyone's view of the set of running XIDs: our entry is
-	 * duplicate with the gxact that has already been inserted into the
-	 * ProcArray.
+	 * We can skip locking ProcArrayLock exclusively here, because this action
+	 * does not actually change anyone's view of the set of running XIDs: our
+	 * entry is duplicate with the gxact that has already been inserted into
+	 * the ProcArray. But need it in shared mode for pgproc->pgxactoff to stay
+	 * the same.
 	 */
-	pgxact->xid = InvalidTransactionId;
+	LWLockAcquire(ProcArrayLock, LW_SHARED);
+
+	pgxactoff = proc->pgxactoff;
+
+	ProcGlobal->xids[pgxactoff] = InvalidTransactionId;
+	proc->xidCopy = InvalidTransactionId;
+
 	proc->lxid = InvalidLocalTransactionId;
 	proc->xmin = InvalidTransactionId;
 	proc->recoveryConflictPending = false;
@@ -729,6 +792,8 @@ ProcArrayClearTransaction(PGPROC *proc)
 	/* Clear the subtransaction-XID cache too */
 	pgxact->nxids = 0;
 	pgxact->overflowed = false;
+
+	LWLockRelease(ProcArrayLock);
 }
 
 /*
@@ -1099,7 +1164,7 @@ ProcArrayApplyXidAssignment(TransactionId topxid,
  * there are four possibilities for finding a running transaction:
  *
  * 1. The given Xid is a main transaction Id.  We will find this out cheaply
- * by looking at the PGXACT struct for each backend.
+ * by looking at ProcGlobal->xids.
  *
  * 2. The given Xid is one of the cached subxact Xids in the PGPROC array.
  * We can find this out cheaply too.
@@ -1108,25 +1173,27 @@ ProcArrayApplyXidAssignment(TransactionId topxid,
  * if the Xid is running on the master.
  *
  * 4. Search the SubTrans tree to find the Xid's topmost parent, and then see
- * if that is running according to PGXACT or KnownAssignedXids.  This is the
- * slowest way, but sadly it has to be done always if the others failed,
- * unless we see that the cached subxact sets are complete (none have
+ * if that is running according to ProcGlobal->xids[] or KnownAssignedXids.
+ * This is the slowest way, but sadly it has to be done always if the others
+ * failed, unless we see that the cached subxact sets are complete (none have
  * overflowed).
  *
  * ProcArrayLock has to be held while we do 1, 2, 3.  If we save the top Xids
  * while doing 1 and 3, we can release the ProcArrayLock while we do 4.
  * This buys back some concurrency (and we can't retrieve the main Xids from
- * PGXACT again anyway; see GetNewTransactionId).
+ * ProcGlobal->xids[] again anyway; see GetNewTransactionId).
  */
 bool
 TransactionIdIsInProgress(TransactionId xid)
 {
 	static TransactionId *xids = NULL;
+	static TransactionId *other_xids;
 	int			nxids = 0;
 	ProcArrayStruct *arrayP = procArray;
 	TransactionId topxid;
-	int			i,
-				j;
+	int			mypgxactoff;
+	size_t		numProcs;
+	int			j;
 
 	/*
 	 * Don't bother checking a transaction older than RecentXmin; it could not
@@ -1181,6 +1248,8 @@ TransactionIdIsInProgress(TransactionId xid)
 					 errmsg("out of memory")));
 	}
 
+	other_xids = ProcGlobal->xids;
+
 	LWLockAcquire(ProcArrayLock, LW_SHARED);
 
 	/*
@@ -1196,20 +1265,22 @@ TransactionIdIsInProgress(TransactionId xid)
 	}
 
 	/* No shortcuts, gotta grovel through the array */
-	for (i = 0; i < arrayP->numProcs; i++)
+	mypgxactoff = MyProc->pgxactoff;
+	numProcs = arrayP->numProcs;
+	for (size_t pgxactoff = 0; pgxactoff < numProcs; pgxactoff++)
 	{
-		int			pgprocno = arrayP->pgprocnos[i];
-		PGPROC	   *proc = &allProcs[pgprocno];
-		PGXACT	   *pgxact = &allPgXact[pgprocno];
+		int			pgprocno;
+		PGXACT	   *pgxact;
+		PGPROC	   *proc;
 		TransactionId pxid;
 		int			pxids;
 
-		/* Ignore my own proc --- dealt with it above */
-		if (proc == MyProc)
+		/* Ignore ourselves --- dealt with it above */
+		if (pgxactoff == mypgxactoff)
 			continue;
 
 		/* Fetch xid just once - see GetNewTransactionId */
-		pxid = UINT32_ACCESS_ONCE(pgxact->xid);
+		pxid = UINT32_ACCESS_ONCE(other_xids[pgxactoff]);
 
 		if (!TransactionIdIsValid(pxid))
 			continue;
@@ -1234,8 +1305,12 @@ TransactionIdIsInProgress(TransactionId xid)
 		/*
 		 * Step 2: check the cached child-Xids arrays
 		 */
+		pgprocno = arrayP->pgprocnos[pgxactoff];
+		pgxact = &allPgXact[pgprocno];
 		pxids = pgxact->nxids;
 		pg_read_barrier();		/* pairs with barrier in GetNewTransactionId() */
+		pgprocno = arrayP->pgprocnos[pgxactoff];
+		proc = &allProcs[pgprocno];
 		for (j = pxids - 1; j >= 0; j--)
 		{
 			/* Fetch xid just once - see GetNewTransactionId */
@@ -1266,7 +1341,7 @@ TransactionIdIsInProgress(TransactionId xid)
 	 */
 	if (RecoveryInProgress())
 	{
-		/* none of the PGXACT entries should have XIDs in hot standby mode */
+		/* none of the PGPROC entries should have XIDs in hot standby mode */
 		Assert(nxids == 0);
 
 		if (KnownAssignedXidExists(xid))
@@ -1321,7 +1396,7 @@ TransactionIdIsInProgress(TransactionId xid)
 	Assert(TransactionIdIsValid(topxid));
 	if (!TransactionIdEquals(topxid, xid))
 	{
-		for (i = 0; i < nxids; i++)
+		for (int i = 0; i < nxids; i++)
 		{
 			if (TransactionIdEquals(xids[i], topxid))
 				return true;
@@ -1344,6 +1419,7 @@ TransactionIdIsActive(TransactionId xid)
 {
 	bool		result = false;
 	ProcArrayStruct *arrayP = procArray;
+	TransactionId *other_xids = ProcGlobal->xids;
 	int			i;
 
 	/*
@@ -1359,11 +1435,10 @@ TransactionIdIsActive(TransactionId xid)
 	{
 		int			pgprocno = arrayP->pgprocnos[i];
 		PGPROC	   *proc = &allProcs[pgprocno];
-		PGXACT	   *pgxact = &allPgXact[pgprocno];
 		TransactionId pxid;
 
 		/* Fetch xid just once - see GetNewTransactionId */
-		pxid = UINT32_ACCESS_ONCE(pgxact->xid);
+		pxid = UINT32_ACCESS_ONCE(other_xids[i]);
 
 		if (!TransactionIdIsValid(pxid))
 			continue;
@@ -1424,6 +1499,7 @@ ComputeTransactionHorizons(ComputedHorizons *h)
 	ProcArrayStruct *arrayP = procArray;
 	TransactionId kaxmin;
 	bool		in_recovery = RecoveryInProgress();
+	TransactionId *other_xids = ProcGlobal->xids;
 
 	/* inferred after ProcArrayLock is released */
 	h->catalog_oldest_visible = InvalidTransactionId;
@@ -1467,7 +1543,7 @@ ComputeTransactionHorizons(ComputedHorizons *h)
 		TransactionId xmin;
 
 		/* Fetch xid just once - see GetNewTransactionId */
-		xid = UINT32_ACCESS_ONCE(pgxact->xid);
+		xid = UINT32_ACCESS_ONCE(other_xids[pgprocno]);
 		xmin = UINT32_ACCESS_ONCE(proc->xmin);
 
 		/*
@@ -1752,14 +1828,17 @@ Snapshot
 GetSnapshotData(Snapshot snapshot)
 {
 	ProcArrayStruct *arrayP = procArray;
+	TransactionId *other_xids = ProcGlobal->xids;
 	TransactionId xmin;
 	TransactionId xmax;
-	int			index;
-	int			count = 0;
+	size_t		count = 0;
 	int			subcount = 0;
 	bool		suboverflowed = false;
 	FullTransactionId latest_completed;
 	TransactionId oldestxid;
+	int			mypgxactoff;
+	TransactionId myxid;
+
 	TransactionId replication_slot_xmin = InvalidTransactionId;
 	TransactionId replication_slot_catalog_xmin = InvalidTransactionId;
 
@@ -1804,6 +1883,10 @@ GetSnapshotData(Snapshot snapshot)
 	LWLockAcquire(ProcArrayLock, LW_SHARED);
 
 	latest_completed = ShmemVariableCache->latestCompletedFullXid;
+	mypgxactoff = MyProc->pgxactoff;
+	myxid = other_xids[mypgxactoff];
+	Assert(myxid == MyProc->xidCopy);
+
 	oldestxid = ShmemVariableCache->oldestXid;
 
 	/* xmax is always latestCompletedXid + 1 */
@@ -1814,57 +1897,82 @@ GetSnapshotData(Snapshot snapshot)
 	/* initialize xmin calculation with xmax */
 	xmin = xmax;
 
+	/* take own xid into account, saves a check inside the loop */
+	if (TransactionIdIsNormal(myxid) && NormalTransactionIdPrecedes(myxid, xmin))
+		xmin = myxid;
+
 	snapshot->takenDuringRecovery = RecoveryInProgress();
 
 	if (!snapshot->takenDuringRecovery)
 	{
-		int		   *pgprocnos = arrayP->pgprocnos;
-		int			numProcs;
+		size_t		numProcs = arrayP->numProcs;
+		TransactionId *xip = snapshot->xip;
+		int			workspace_count = 0;
 
 		/*
-		 * Spin over procArray checking xid, xmin, and subxids.  The goal is
-		 * to gather all active xids, find the lowest xmin, and try to record
-		 * subxids.
+		 * First collect set of pgxactoff/xids that need to be included in the
+		 * snapshot.
 		 */
-		numProcs = arrayP->numProcs;
-		for (index = 0; index < numProcs; index++)
+		for (size_t pgxactoff = 0; pgxactoff < numProcs; pgxactoff++)
 		{
-			int			pgprocno = pgprocnos[index];
+			/* Fetch xid just once - see GetNewTransactionId */
+			TransactionId xid = UINT32_ACCESS_ONCE(other_xids[pgxactoff]);
+
+			if (unlikely(xid != 0))
+			{
+				/*
+				 * We don't include our own XIDs (if any) in the snapshot. It
+				 * needs to be includeded in the xmin computation, but we did so
+				 * outside the loop.
+				 */
+				if (pgxactoff == mypgxactoff)
+					continue;
+
+				/*
+				 * The only way we are able to get here with a non-normal xid
+				 * is during bootstrap - with this backend using
+				 * BootstrapTransactionId. But the above test should filter
+				 * that out.
+				 */
+				Assert(TransactionIdIsNormal(xid));
+
+				/*
+				 * If the XID is >= xmax, we can skip it; such transactions will
+				 * be treated as running anyway (and any sub-XIDs will also be >=
+				 * xmax).
+				 */
+				if (!NormalTransactionIdPrecedes(xid, xmax))
+					continue;
+
+				snapshot_workspace_xid[workspace_count] = xid;
+				snapshot_workspace_off[workspace_count] = pgxactoff;
+				workspace_count++;
+			}
+		}
+
+		for (ssize_t i = 0; i < workspace_count; i++)
+		{
+			int pgxactoff = snapshot_workspace_off[i];
+			TransactionId xid = snapshot_workspace_xid[i];
+			int			pgprocno = arrayP->pgprocnos[pgxactoff];
 			PGXACT	   *pgxact = &allPgXact[pgprocno];
-			TransactionId xid;
+			uint8		vacuumFlags = pgxact->vacuumFlags;
+			int			nsubxids;
+
+			Assert(allProcs[arrayP->pgprocnos[pgxactoff]].pgxactoff == pgxactoff);
 
 			/*
 			 * Skip over backends doing logical decoding which manages xmin
 			 * separately (check below) and ones running LAZY VACUUM.
 			 */
-			if (pgxact->vacuumFlags &
-				(PROC_IN_LOGICAL_DECODING | PROC_IN_VACUUM))
+			if (vacuumFlags & (PROC_IN_LOGICAL_DECODING | PROC_IN_VACUUM))
 				continue;
 
-			/* Fetch xid just once - see GetNewTransactionId */
-			xid = UINT32_ACCESS_ONCE(pgxact->xid);
-
-			/*
-			 * If the transaction has no XID assigned, we can skip it; it
-			 * won't have sub-XIDs either.  If the XID is >= xmax, we can also
-			 * skip it; such transactions will be treated as running anyway
-			 * (and any sub-XIDs will also be >= xmax).
-			 */
-			if (!TransactionIdIsNormal(xid)
-				|| !NormalTransactionIdPrecedes(xid, xmax))
-				continue;
-
-			/*
-			 * We don't include our own XIDs (if any) in the snapshot, but we
-			 * must include them in xmin.
-			 */
 			if (NormalTransactionIdPrecedes(xid, xmin))
 				xmin = xid;
-			if (pgxact == MyPgXact)
-				continue;
 
 			/* Add XID to snapshot. */
-			snapshot->xip[count++] = xid;
+			xip[count++] = xid;
 
 			/*
 			 * Save subtransaction XIDs if possible (if we've already
@@ -1881,26 +1989,25 @@ GetSnapshotData(Snapshot snapshot)
 			 *
 			 * Again, our own XIDs are not included in the snapshot.
 			 */
-			if (!suboverflowed)
+			if (suboverflowed)
+				continue;
+
+			suboverflowed = pgxact->overflowed;
+			nsubxids = pgxact->nxids;
+
+			if (suboverflowed || nsubxids == 0)
+				continue;
+			else
 			{
-				if (pgxact->overflowed)
-					suboverflowed = true;
-				else
-				{
-					int			nxids = pgxact->nxids;
+				int			pgprocno = arrayP->pgprocnos[pgxactoff];
+				PGPROC	   *proc = &allProcs[pgprocno];
 
-					if (nxids > 0)
-					{
-						PGPROC	   *proc = &allProcs[pgprocno];
+				pg_read_barrier();	/* pairs with GetNewTransactionId */
 
-						pg_read_barrier();	/* pairs with GetNewTransactionId */
-
-						memcpy(snapshot->subxip + subcount,
-							   (void *) proc->subxids.xids,
-							   nxids * sizeof(TransactionId));
-						subcount += nxids;
-					}
-				}
+				memcpy(snapshot->subxip + subcount,
+					   (void *) proc->subxids.xids,
+					   nsubxids * sizeof(TransactionId));
+				subcount += nsubxids;
 			}
 		}
 	}
@@ -2030,6 +2137,7 @@ GetSnapshotData(Snapshot snapshot)
 	}
 
 	RecentXmin = xmin;
+	Assert(TransactionIdPrecedesOrEquals(TransactionXmin, RecentXmin));
 
 	snapshot->xmin = xmin;
 	snapshot->xmax = xmax;
@@ -2192,7 +2300,7 @@ ProcArrayInstallRestoredXmin(TransactionId xmin, PGPROC *proc)
  * GetRunningTransactionData -- returns information about running transactions.
  *
  * Similar to GetSnapshotData but returns more information. We include
- * all PGXACTs with an assigned TransactionId, even VACUUM processes and
+ * all PGPROCs with an assigned TransactionId, even VACUUM processes and
  * prepared transactions.
  *
  * We acquire XidGenLock and ProcArrayLock, but the caller is responsible for
@@ -2207,7 +2315,7 @@ ProcArrayInstallRestoredXmin(TransactionId xmin, PGPROC *proc)
  * This is never executed during recovery so there is no need to look at
  * KnownAssignedXids.
  *
- * Dummy PGXACTs from prepared transaction are included, meaning that this
+ * Dummy PGPROCs from prepared transaction are included, meaning that this
  * may return entries with duplicated TransactionId values coming from
  * transaction finishing to prepare.  Nothing is done about duplicated
  * entries here to not hold on ProcArrayLock more than necessary.
@@ -2226,6 +2334,7 @@ GetRunningTransactionData(void)
 	static RunningTransactionsData CurrentRunningXactsData;
 
 	ProcArrayStruct *arrayP = procArray;
+	TransactionId *other_xids = ProcGlobal->xids;
 	RunningTransactions CurrentRunningXacts = &CurrentRunningXactsData;
 	TransactionId latestCompletedXid;
 	TransactionId oldestRunningXid;
@@ -2285,7 +2394,7 @@ GetRunningTransactionData(void)
 		TransactionId xid;
 
 		/* Fetch xid just once - see GetNewTransactionId */
-		xid = UINT32_ACCESS_ONCE(pgxact->xid);
+		xid = UINT32_ACCESS_ONCE(other_xids[index]);
 
 		/*
 		 * We don't need to store transactions that don't have a TransactionId
@@ -2382,7 +2491,7 @@ GetRunningTransactionData(void)
  * GetOldestActiveTransactionId()
  *
  * Similar to GetSnapshotData but returns just oldestActiveXid. We include
- * all PGXACTs with an assigned TransactionId, even VACUUM processes.
+ * all PGPROCs with an assigned TransactionId, even VACUUM processes.
  * We look at all databases, though there is no need to include WALSender
  * since this has no effect on hot standby conflicts.
  *
@@ -2397,6 +2506,7 @@ TransactionId
 GetOldestActiveTransactionId(void)
 {
 	ProcArrayStruct *arrayP = procArray;
+	TransactionId *other_xids = ProcGlobal->xids;
 	TransactionId oldestRunningXid;
 	int			index;
 
@@ -2419,12 +2529,10 @@ GetOldestActiveTransactionId(void)
 	LWLockAcquire(ProcArrayLock, LW_SHARED);
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
-		int			pgprocno = arrayP->pgprocnos[index];
-		PGXACT	   *pgxact = &allPgXact[pgprocno];
 		TransactionId xid;
 
 		/* Fetch xid just once - see GetNewTransactionId */
-		xid = UINT32_ACCESS_ONCE(pgxact->xid);
+		xid = UINT32_ACCESS_ONCE(other_xids[index]);
 
 		if (!TransactionIdIsNormal(xid))
 			continue;
@@ -2502,8 +2610,8 @@ GetOldestSafeDecodingTransactionId(bool catalogOnly)
 	 * If we're not in recovery, we walk over the procarray and collect the
 	 * lowest xid. Since we're called with ProcArrayLock held and have
 	 * acquired XidGenLock, no entries can vanish concurrently, since
-	 * PGXACT->xid is only set with XidGenLock held and only cleared with
-	 * ProcArrayLock held.
+	 * ProcGlobal->xids[i] is only set with XidGenLock held and only cleared
+	 * with ProcArrayLock held.
 	 *
 	 * In recovery we can't lower the safe value besides what we've computed
 	 * above, so we'll have to wait a bit longer there. We unfortunately can
@@ -2512,17 +2620,17 @@ GetOldestSafeDecodingTransactionId(bool catalogOnly)
 	 */
 	if (!recovery_in_progress)
 	{
+		TransactionId *other_xids = ProcGlobal->xids;
+
 		/*
-		 * Spin over procArray collecting all min(PGXACT->xid)
+		 * Spin over procArray collecting min(ProcGlobal->xids[i])
 		 */
 		for (index = 0; index < arrayP->numProcs; index++)
 		{
-			int			pgprocno = arrayP->pgprocnos[index];
-			PGXACT	   *pgxact = &allPgXact[pgprocno];
 			TransactionId xid;
 
 			/* Fetch xid just once - see GetNewTransactionId */
-			xid = UINT32_ACCESS_ONCE(pgxact->xid);
+			xid = UINT32_ACCESS_ONCE(other_xids[index]);
 
 			if (!TransactionIdIsNormal(xid))
 				continue;
@@ -2710,6 +2818,7 @@ BackendXidGetPid(TransactionId xid)
 {
 	int			result = 0;
 	ProcArrayStruct *arrayP = procArray;
+	TransactionId *other_xids = ProcGlobal->xids;
 	int			index;
 
 	if (xid == InvalidTransactionId)	/* never match invalid xid */
@@ -2721,9 +2830,8 @@ BackendXidGetPid(TransactionId xid)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
 		PGPROC	   *proc = &allProcs[pgprocno];
-		PGXACT	   *pgxact = &allPgXact[pgprocno];
 
-		if (pgxact->xid == xid)
+		if (other_xids[index] == xid)
 		{
 			result = proc->pid;
 			break;
@@ -3003,7 +3111,6 @@ MinimumActiveBackends(int min)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
 		PGPROC	   *proc = &allProcs[pgprocno];
-		PGXACT	   *pgxact = &allPgXact[pgprocno];
 
 		/*
 		 * Since we're not holding a lock, need to be prepared to deal with
@@ -3020,7 +3127,7 @@ MinimumActiveBackends(int min)
 			continue;			/* do not count deleted entries */
 		if (proc == MyProc)
 			continue;			/* do not count myself */
-		if (pgxact->xid == InvalidTransactionId)
+		if (proc->xidCopy == InvalidTransactionId)
 			continue;			/* do not count if no XID assigned */
 		if (proc->pid == 0)
 			continue;			/* do not count prepared xacts */
@@ -3446,8 +3553,8 @@ XidCacheRemoveRunningXids(TransactionId xid,
 	 *
 	 * Note that we do not have to be careful about memory ordering of our own
 	 * reads wrt. GetNewTransactionId() here - only this process can modify
-	 * relevant fields of MyProc/MyPgXact.  But we do have to be careful about
-	 * our own writes being well ordered.
+	 * relevant fields of MyProc/ProcGlobal->xids[].  But we do have to be
+	 * careful about our own writes being well ordered.
 	 */
 	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
 
@@ -3801,7 +3908,7 @@ FullXidViaRelative(FullTransactionId rel, TransactionId xid)
  * In Hot Standby mode, we maintain a list of transactions that are (or were)
  * running in the master at the current point in WAL.  These XIDs must be
  * treated as running by standby transactions, even though they are not in
- * the standby server's PGXACT array.
+ * the standby server's ProcGlobal->xids[] array.
  *
  * We record all XIDs that we know have been assigned.  That includes all the
  * XIDs seen in WAL records, plus all unobserved XIDs that we can deduce have
diff --git a/src/backend/storage/ipc/sinvaladt.c b/src/backend/storage/ipc/sinvaladt.c
index ad048bc85fa..b353c2f8005 100644
--- a/src/backend/storage/ipc/sinvaladt.c
+++ b/src/backend/storage/ipc/sinvaladt.c
@@ -417,9 +417,7 @@ BackendIdGetTransactionIds(int backendID, TransactionId *xid, TransactionId *xmi
 
 		if (proc != NULL)
 		{
-			PGXACT	   *xact = &ProcGlobal->allPgXact[proc->pgprocno];
-
-			*xid = xact->xid;
+			*xid = proc->xidCopy;
 			*xmin = proc->xmin;
 		}
 	}
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index efb44a25c42..5cd9a81bde8 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -3974,9 +3974,8 @@ GetRunningTransactionLocks(int *nlocks)
 			proclock->tag.myLock->tag.locktag_type == LOCKTAG_RELATION)
 		{
 			PGPROC	   *proc = proclock->tag.myProc;
-			PGXACT	   *pgxact = &ProcGlobal->allPgXact[proc->pgprocno];
 			LOCK	   *lock = proclock->tag.myLock;
-			TransactionId xid = pgxact->xid;
+			TransactionId xid = proc->xidCopy;
 
 			/*
 			 * Don't record locks for transactions if we know they have
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 66d25dba7f8..4dc588223c1 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -102,21 +102,17 @@ Size
 ProcGlobalShmemSize(void)
 {
 	Size		size = 0;
+	uint32		TotalProcs = MaxBackends + NUM_AUXILIARY_PROCS + max_prepared_xacts;
 
 	/* ProcGlobal */
 	size = add_size(size, sizeof(PROC_HDR));
-	/* MyProcs, including autovacuum workers and launcher */
-	size = add_size(size, mul_size(MaxBackends, sizeof(PGPROC)));
-	/* AuxiliaryProcs */
-	size = add_size(size, mul_size(NUM_AUXILIARY_PROCS, sizeof(PGPROC)));
-	/* Prepared xacts */
-	size = add_size(size, mul_size(max_prepared_xacts, sizeof(PGPROC)));
-	/* ProcStructLock */
+	size = add_size(size, mul_size(TotalProcs, sizeof(PGPROC)));
 	size = add_size(size, sizeof(slock_t));
 
 	size = add_size(size, mul_size(MaxBackends, sizeof(PGXACT)));
 	size = add_size(size, mul_size(NUM_AUXILIARY_PROCS, sizeof(PGXACT)));
 	size = add_size(size, mul_size(max_prepared_xacts, sizeof(PGXACT)));
+	size = add_size(size, mul_size(TotalProcs, sizeof(*ProcGlobal->xids)));
 
 	return size;
 }
@@ -216,6 +212,25 @@ InitProcGlobal(void)
 	MemSet(pgxacts, 0, TotalProcs * sizeof(PGXACT));
 	ProcGlobal->allPgXact = pgxacts;
 
+	/*
+	 * Also allocate a separate arrays for data that is frequently (e.g. by
+	 * GetSnapshotData()) accessed from outside a backend.  There is one entry
+	 * in each for every *live* PGPROC entry, and they are densely packed so
+	 * that the first procArray->numProc entries are all valid.  The entries
+	 * for a PGPROC in those arrays are at PGPROC->pgxactoff.
+	 *
+	 * Note that they may not be accessed without ProcArrayLock held! Upon
+	 * ProcArrayRemove() later entries will be moved.
+	 *
+	 * These are separate from the main PGPROC array so that the most heavily
+	 * accessed data is stored contiguously in memory in as few cache lines as
+	 * possible. This provides significant performance benefits, especially on
+	 * a multiprocessor system.
+	 */
+	// XXX: Pad to cacheline (or even page?)!
+	ProcGlobal->xids = (TransactionId *) ShmemAlloc(TotalProcs * sizeof(*ProcGlobal->xids));
+	MemSet(ProcGlobal->xids, 0, TotalProcs * sizeof(*ProcGlobal->xids));
+
 	for (i = 0; i < TotalProcs; i++)
 	{
 		/* Common initialization for all PGPROCs, regardless of type. */
@@ -387,7 +402,7 @@ InitProcess(void)
 	MyProc->lxid = InvalidLocalTransactionId;
 	MyProc->fpVXIDLock = false;
 	MyProc->fpLocalTransactionId = InvalidLocalTransactionId;
-	MyPgXact->xid = InvalidTransactionId;
+	MyProc->xidCopy = InvalidTransactionId;
 	MyProc->xmin = InvalidTransactionId;
 	MyProc->pid = MyProcPid;
 	/* backendId, databaseId and roleId will be filled in later */
@@ -571,7 +586,7 @@ InitAuxiliaryProcess(void)
 	MyProc->lxid = InvalidLocalTransactionId;
 	MyProc->fpVXIDLock = false;
 	MyProc->fpLocalTransactionId = InvalidLocalTransactionId;
-	MyPgXact->xid = InvalidTransactionId;
+	MyProc->xidCopy = InvalidTransactionId;
 	MyProc->xmin = InvalidTransactionId;
 	MyProc->backendId = InvalidBackendId;
 	MyProc->databaseId = InvalidOid;
-- 
2.25.0.114.g5b0ca878e0

v7-0008-snapshot-scalability-Move-PGXACT-vacuumFlags-to-P.patchtext/x-diff; charset=us-asciiDownload
From ccf17928fb28853d9ac3af9c1c8155ff4bc9ff68 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 6 Apr 2020 21:28:55 -0700
Subject: [PATCH v7 08/11] snapshot scalability: Move PGXACT->vacuumFlags to
 ProcGlobal->vacuumFlags.

Similar to the previous commit this increases the chance that data
frequently needed by GetSnapshotData() stays in l2 cache. As we now
take care to not unnecessarily write to ProcGlobal->vacuumFlags, there
should be very few modifications to the ProcGlobal->vacuumFlags array.
---
 src/include/storage/proc.h                | 12 ++++-
 src/backend/access/transam/twophase.c     |  2 +-
 src/backend/commands/analyze.c            | 10 ++--
 src/backend/commands/vacuum.c             |  5 +-
 src/backend/postmaster/autovacuum.c       |  6 +--
 src/backend/replication/logical/logical.c |  3 +-
 src/backend/replication/slot.c            |  3 +-
 src/backend/storage/ipc/procarray.c       | 57 +++++++++++++++--------
 src/backend/storage/lmgr/deadlock.c       |  6 +--
 src/backend/storage/lmgr/proc.c           | 16 ++++---
 10 files changed, 77 insertions(+), 43 deletions(-)

diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 60586e8be34..37208ddf342 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -41,7 +41,7 @@ struct XidCache
 };
 
 /*
- * Flags for PGXACT->vacuumFlags
+ * Flags for ProcGlobal->vacuumFlags[]
  */
 #define		PROC_IS_AUTOVACUUM	0x01	/* is it an autovac worker? */
 #define		PROC_IN_VACUUM		0x02	/* currently running lazy vacuum */
@@ -171,6 +171,10 @@ struct PGPROC
 
 	struct XidCache subxids;	/* cache for subtransaction XIDs */
 
+	uint8		vacuumFlagsCopy; /* this backend's vacuum flags, a copy of its
+								  * ProcGlobal->vacuumFlagsCopy[], see
+								  * PROC_* above */
+
 	/* Support for group XID clearing. */
 	/* true, if member of ProcArray group waiting for XID clear */
 	bool		procArrayGroupMember;
@@ -231,7 +235,6 @@ extern PGDLLIMPORT struct PGXACT *MyPgXact;
  */
 typedef struct PGXACT
 {
-	uint8		vacuumFlags;	/* vacuum-related flags, see above */
 	bool		overflowed;
 
 	uint8		nxids;
@@ -276,6 +279,11 @@ typedef struct PROC_HDR
 	 */
 	TransactionId *xids;
 
+	/*
+	 * Vacuum flags. See PROC_* above.
+	 */
+	uint8	   *vacuumFlags;
+
 	/* Length of allProcs array */
 	uint32		allProcCount;
 	/* Head of list of free PGPROC structures */
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 8103c5cb71f..06b61605649 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -466,7 +466,7 @@ MarkAsPreparingGuts(GlobalTransaction gxact, TransactionId xid, const char *gid,
 	proc->xidCopy = xid;
 	Assert(proc->xmin == InvalidTransactionId);
 	proc->delayChkpt = false;
-	pgxact->vacuumFlags = 0;
+	proc->vacuumFlagsCopy = 0;
 	proc->pid = 0;
 	proc->backendId = InvalidBackendId;
 	proc->databaseId = databaseid;
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 7b75945c4a9..d55477464e9 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -250,7 +250,8 @@ analyze_rel(Oid relid, RangeVar *relation,
 	 * OK, let's do it.  First let other backends know I'm in ANALYZE.
 	 */
 	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
-	MyPgXact->vacuumFlags |= PROC_IN_ANALYZE;
+	MyProc->vacuumFlagsCopy |= PROC_IN_ANALYZE;
+	ProcGlobal->vacuumFlags[MyProc->pgxactoff] = MyProc->vacuumFlagsCopy;
 	LWLockRelease(ProcArrayLock);
 	pgstat_progress_start_command(PROGRESS_COMMAND_ANALYZE,
 								  RelationGetRelid(onerel));
@@ -281,11 +282,12 @@ analyze_rel(Oid relid, RangeVar *relation,
 	pgstat_progress_end_command();
 
 	/*
-	 * Reset my PGXACT flag.  Note: we need this here, and not in vacuum_rel,
-	 * because the vacuum flag is cleared by the end-of-xact code.
+	 * Reset vacuumFlags we set early.  Note: we need this here, and not in
+	 * vacuum_rel, because the vacuum flag is cleared by the end-of-xact code.
 	 */
 	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
-	MyPgXact->vacuumFlags &= ~PROC_IN_ANALYZE;
+	MyProc->vacuumFlagsCopy &= ~PROC_IN_ANALYZE;
+	ProcGlobal->vacuumFlags[MyProc->pgxactoff] = MyProc->vacuumFlagsCopy;
 	LWLockRelease(ProcArrayLock);
 }
 
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 1cc220c2d56..1da7b4d3e06 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -1730,9 +1730,10 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params)
 		 * might appear to go backwards, which is probably Not Good.
 		 */
 		LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
-		MyPgXact->vacuumFlags |= PROC_IN_VACUUM;
+		MyProc->vacuumFlagsCopy |= PROC_IN_VACUUM;
 		if (params->is_wraparound)
-			MyPgXact->vacuumFlags |= PROC_VACUUM_FOR_WRAPAROUND;
+			MyProc->vacuumFlagsCopy |= PROC_VACUUM_FOR_WRAPAROUND;
+		ProcGlobal->vacuumFlags[MyProc->pgxactoff] = MyProc->vacuumFlagsCopy;
 		LWLockRelease(ProcArrayLock);
 	}
 
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index df1af9354ce..465f8893cd5 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -2494,7 +2494,7 @@ do_autovacuum(void)
 						   tab->at_datname, tab->at_nspname, tab->at_relname);
 			EmitErrorReport();
 
-			/* this resets the PGXACT flags too */
+			/* this resets ProcGlobal->vacuumFlags[i] too */
 			AbortOutOfAnyTransaction();
 			FlushErrorState();
 			MemoryContextResetAndDeleteChildren(PortalContext);
@@ -2510,7 +2510,7 @@ do_autovacuum(void)
 
 		did_vacuum = true;
 
-		/* the PGXACT flags are reset at the next end of transaction */
+		/* ProcGlobal->vacuumFlags[i] are reset at the next end of xact */
 
 		/* be tidy */
 deleted:
@@ -2687,7 +2687,7 @@ perform_work_item(AutoVacuumWorkItem *workitem)
 				   cur_datname, cur_nspname, cur_relname);
 		EmitErrorReport();
 
-		/* this resets the PGXACT flags too */
+		/* this resets ProcGlobal->vacuumFlags[i] too */
 		AbortOutOfAnyTransaction();
 		FlushErrorState();
 		MemoryContextResetAndDeleteChildren(PortalContext);
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 5adf253583b..d5a28821e14 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -163,7 +163,8 @@ StartupDecodingContext(List *output_plugin_options,
 	if (!IsTransactionOrTransactionBlock())
 	{
 		LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
-		MyPgXact->vacuumFlags |= PROC_IN_LOGICAL_DECODING;
+		MyProc->vacuumFlagsCopy |= PROC_IN_LOGICAL_DECODING;
+		ProcGlobal->vacuumFlags[MyProc->pgxactoff] = MyProc->vacuumFlagsCopy;
 		LWLockRelease(ProcArrayLock);
 	}
 
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 47851ec4c1a..0dcc6a90e79 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -468,7 +468,8 @@ ReplicationSlotRelease(void)
 
 	/* might not have been set when we've been a plain slot */
 	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
-	MyPgXact->vacuumFlags &= ~PROC_IN_LOGICAL_DECODING;
+	MyProc->vacuumFlagsCopy &= ~PROC_IN_LOGICAL_DECODING;
+	ProcGlobal->vacuumFlags[MyProc->pgxactoff] = MyProc->vacuumFlagsCopy;
 	LWLockRelease(ProcArrayLock);
 }
 
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 7a6efaafe26..384b5a8efb5 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -441,9 +441,12 @@ ProcArrayAdd(PGPROC *proc)
 			(arrayP->numProcs - index) * sizeof(*arrayP->pgprocnos));
 	memmove(&ProcGlobal->xids[index + 1], &ProcGlobal->xids[index],
 			(arrayP->numProcs - index) * sizeof(*ProcGlobal->xids));
+	memmove(&ProcGlobal->vacuumFlags[index + 1], &ProcGlobal->vacuumFlags[index],
+			(arrayP->numProcs - index) * sizeof(*ProcGlobal->vacuumFlags));
 
 	arrayP->pgprocnos[index] = proc->pgprocno;
 	ProcGlobal->xids[index] = proc->xidCopy;
+	ProcGlobal->vacuumFlags[index] = proc->vacuumFlagsCopy;
 
 	arrayP->numProcs++;
 
@@ -504,6 +507,7 @@ ProcArrayRemove(PGPROC *proc, TransactionId latestXid)
 	}
 
 	Assert(TransactionIdIsValid(ProcGlobal->xids[proc->pgxactoff] == 0));
+	ProcGlobal->vacuumFlags[proc->pgxactoff] = 0;
 
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
@@ -514,6 +518,8 @@ ProcArrayRemove(PGPROC *proc, TransactionId latestXid)
 					(arrayP->numProcs - index - 1) * sizeof(*arrayP->pgprocnos));
 			memmove(&ProcGlobal->xids[index], &ProcGlobal->xids[index + 1],
 					(arrayP->numProcs - index - 1) * sizeof(*ProcGlobal->xids));
+			memmove(&ProcGlobal->vacuumFlags[index], &ProcGlobal->vacuumFlags[index + 1],
+					(arrayP->numProcs - index - 1) * sizeof(*ProcGlobal->vacuumFlags));
 
 			arrayP->pgprocnos[arrayP->numProcs - 1] = -1;	/* for debugging */
 			arrayP->numProcs--;
@@ -592,14 +598,23 @@ ProcArrayEndTransaction(PGPROC *proc, TransactionId latestXid)
 		Assert(!TransactionIdIsValid(proc->xidCopy));
 
 		proc->lxid = InvalidLocalTransactionId;
-		/* must be cleared with xid/xmin: */
-		pgxact->vacuumFlags &= ~PROC_VACUUM_STATE_MASK;
 		proc->xmin = InvalidTransactionId;
 		proc->delayChkpt = false; /* be sure this is cleared in abort */
 		proc->recoveryConflictPending = false;
 
 		Assert(pgxact->nxids == 0);
 		Assert(pgxact->overflowed == false);
+
+		/* must be cleared with xid/xmin: */
+		/* avoid unnecessarily dirtying shared cachelines */
+		if (proc->vacuumFlagsCopy & PROC_VACUUM_STATE_MASK)
+		{
+			Assert(!LWLockHeldByMe(ProcArrayLock));
+			LWLockAcquire(ProcArrayLock, LW_SHARED);
+			proc->vacuumFlagsCopy &= ~PROC_VACUUM_STATE_MASK;
+			ProcGlobal->vacuumFlags[proc->pgxactoff] = proc->vacuumFlagsCopy;
+			LWLockRelease(ProcArrayLock);
+		}
 	}
 }
 
@@ -620,12 +635,18 @@ ProcArrayEndTransactionInternal(PGPROC *proc, PGXACT *pgxact,
 	ProcGlobal->xids[pgxactoff] = InvalidTransactionId;
 	proc->xidCopy = InvalidTransactionId;
 	proc->lxid = InvalidLocalTransactionId;
-	/* must be cleared with xid/xmin: */
-	pgxact->vacuumFlags &= ~PROC_VACUUM_STATE_MASK;
 	proc->xmin = InvalidTransactionId;
 	proc->delayChkpt = false; /* be sure this is cleared in abort */
 	proc->recoveryConflictPending = false;
 
+	/* must be cleared with xid/xmin: */
+	/* avoid unnecessarily dirtying shared cachelines */
+	if (proc->vacuumFlagsCopy & PROC_VACUUM_STATE_MASK)
+	{
+		proc->vacuumFlagsCopy &= ~PROC_VACUUM_STATE_MASK;
+		ProcGlobal->vacuumFlags[proc->pgxactoff] = proc->vacuumFlagsCopy;
+	}
+
 	/* Clear the subtransaction-XID cache too while holding the lock */
 	pgxact->nxids = 0;
 	pgxact->overflowed = false;
@@ -785,9 +806,8 @@ ProcArrayClearTransaction(PGPROC *proc)
 	proc->xmin = InvalidTransactionId;
 	proc->recoveryConflictPending = false;
 
-	/* redundant, but just in case */
-	pgxact->vacuumFlags &= ~PROC_VACUUM_STATE_MASK;
-	proc->delayChkpt = false;
+	Assert(!(proc->vacuumFlagsCopy & PROC_VACUUM_STATE_MASK));
+	Assert(!proc->delayChkpt);
 
 	/* Clear the subtransaction-XID cache too */
 	pgxact->nxids = 0;
@@ -1538,7 +1558,7 @@ ComputeTransactionHorizons(ComputedHorizons *h)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
 		PGPROC	   *proc = &allProcs[pgprocno];
-		PGXACT	   *pgxact = &allPgXact[pgprocno];
+		int8		vacuumFlags = ProcGlobal->vacuumFlags[index];
 		TransactionId xid;
 		TransactionId xmin;
 
@@ -1555,10 +1575,6 @@ ComputeTransactionHorizons(ComputedHorizons *h)
 		 */
 		xmin = TransactionIdOlder(xmin, xid);
 
-		/* if neither is set, this proc doesn't influence the horizon */
-		if (!TransactionIdIsValid(xmin))
-			continue;
-
 		/*
 		 * Don't ignore any procs when determining which transactions might be
 		 * considered running.  While slots should ensure logical decoding
@@ -1573,7 +1589,7 @@ ComputeTransactionHorizons(ComputedHorizons *h)
 		 * removed, as long as pg_subtrans is not truncated) or doing logical
 		 * decoding (which manages xmin separately, check below).
 		 */
-		if (pgxact->vacuumFlags & (PROC_IN_VACUUM | PROC_IN_LOGICAL_DECODING))
+		if (vacuumFlags & (PROC_IN_VACUUM | PROC_IN_LOGICAL_DECODING))
 			continue;
 
 		/* shared tables need to take backends in all database into account */
@@ -1908,6 +1924,7 @@ GetSnapshotData(Snapshot snapshot)
 		size_t		numProcs = arrayP->numProcs;
 		TransactionId *xip = snapshot->xip;
 		int			workspace_count = 0;
+		uint8	   *allVacuumFlags = ProcGlobal->vacuumFlags;
 
 		/*
 		 * First collect set of pgxactoff/xids that need to be included in the
@@ -1956,7 +1973,7 @@ GetSnapshotData(Snapshot snapshot)
 			TransactionId xid = snapshot_workspace_xid[i];
 			int			pgprocno = arrayP->pgprocnos[pgxactoff];
 			PGXACT	   *pgxact = &allPgXact[pgprocno];
-			uint8		vacuumFlags = pgxact->vacuumFlags;
+			uint8       vacuumFlags = allVacuumFlags[pgxactoff];
 			int			nsubxids;
 
 			Assert(allProcs[arrayP->pgprocnos[pgxactoff]].pgxactoff == pgxactoff);
@@ -2208,11 +2225,11 @@ ProcArrayInstallImportedXmin(TransactionId xmin,
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
 		PGPROC	   *proc = &allProcs[pgprocno];
-		PGXACT	   *pgxact = &allPgXact[pgprocno];
+		int			vacuumFlags = ProcGlobal->vacuumFlags[index];
 		TransactionId xid;
 
 		/* Ignore procs running LAZY VACUUM */
-		if (pgxact->vacuumFlags & PROC_IN_VACUUM)
+		if (vacuumFlags & PROC_IN_VACUUM)
 			continue;
 
 		/* We are only interested in the specific virtual transaction. */
@@ -2901,12 +2918,12 @@ GetCurrentVirtualXIDs(TransactionId limitXmin, bool excludeXmin0,
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
 		PGPROC	   *proc = &allProcs[pgprocno];
-		PGXACT	   *pgxact = &allPgXact[pgprocno];
+		uint8		vacuumFlags = ProcGlobal->vacuumFlags[index];
 
 		if (proc == MyProc)
 			continue;
 
-		if (excludeVacuum & pgxact->vacuumFlags)
+		if (excludeVacuum & vacuumFlags)
 			continue;
 
 		if (allDbs || proc->databaseId == MyDatabaseId)
@@ -3321,7 +3338,7 @@ CountOtherDBBackends(Oid databaseId, int *nbackends, int *nprepared)
 		{
 			int			pgprocno = arrayP->pgprocnos[index];
 			PGPROC	   *proc = &allProcs[pgprocno];
-			PGXACT	   *pgxact = &allPgXact[pgprocno];
+			uint8		vacuumFlags = ProcGlobal->vacuumFlags[index];
 
 			if (proc->databaseId != databaseId)
 				continue;
@@ -3335,7 +3352,7 @@ CountOtherDBBackends(Oid databaseId, int *nbackends, int *nprepared)
 			else
 			{
 				(*nbackends)++;
-				if ((pgxact->vacuumFlags & PROC_IS_AUTOVACUUM) &&
+				if ((vacuumFlags & PROC_IS_AUTOVACUUM) &&
 					nautovacs < MAXAUTOVACPIDS)
 					autovac_pids[nautovacs++] = proc->pid;
 			}
diff --git a/src/backend/storage/lmgr/deadlock.c b/src/backend/storage/lmgr/deadlock.c
index beedc7947db..70e653bd3c9 100644
--- a/src/backend/storage/lmgr/deadlock.c
+++ b/src/backend/storage/lmgr/deadlock.c
@@ -544,7 +544,6 @@ FindLockCycleRecurseMember(PGPROC *checkProc,
 {
 	PGPROC	   *proc;
 	LOCK	   *lock = checkProc->waitLock;
-	PGXACT	   *pgxact;
 	PROCLOCK   *proclock;
 	SHM_QUEUE  *procLocks;
 	LockMethod	lockMethodTable;
@@ -582,7 +581,6 @@ FindLockCycleRecurseMember(PGPROC *checkProc,
 		PGPROC	   *leader;
 
 		proc = proclock->tag.myProc;
-		pgxact = &ProcGlobal->allPgXact[proc->pgprocno];
 		leader = proc->lockGroupLeader == NULL ? proc : proc->lockGroupLeader;
 
 		/* A proc never blocks itself or any other lock group member */
@@ -628,9 +626,11 @@ FindLockCycleRecurseMember(PGPROC *checkProc,
 					 * problems (which needs to read a different vacuumFlag
 					 * bit), but we don't do that here to avoid grabbing
 					 * ProcArrayLock.
+					 *
+					 * XXX: That's why this is using vacuumFlagsCopy.
 					 */
 					if (checkProc == MyProc &&
-						pgxact->vacuumFlags & PROC_IS_AUTOVACUUM)
+						proc->vacuumFlagsCopy & PROC_IS_AUTOVACUUM)
 						blocking_autovacuum_proc = proc;
 
 					/* We're done looking at this proclock */
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 4dc588223c1..00f26d93c1a 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -113,6 +113,7 @@ ProcGlobalShmemSize(void)
 	size = add_size(size, mul_size(NUM_AUXILIARY_PROCS, sizeof(PGXACT)));
 	size = add_size(size, mul_size(max_prepared_xacts, sizeof(PGXACT)));
 	size = add_size(size, mul_size(TotalProcs, sizeof(*ProcGlobal->xids)));
+	size = add_size(size, mul_size(TotalProcs, sizeof(*ProcGlobal->vacuumFlags)));
 
 	return size;
 }
@@ -230,6 +231,8 @@ InitProcGlobal(void)
 	// XXX: Pad to cacheline (or even page?)!
 	ProcGlobal->xids = (TransactionId *) ShmemAlloc(TotalProcs * sizeof(*ProcGlobal->xids));
 	MemSet(ProcGlobal->xids, 0, TotalProcs * sizeof(*ProcGlobal->xids));
+	ProcGlobal->vacuumFlags = (uint8 *) ShmemAlloc(TotalProcs * sizeof(*ProcGlobal->vacuumFlags));
+	MemSet(ProcGlobal->vacuumFlags, 0, TotalProcs * sizeof(*ProcGlobal->vacuumFlags));
 
 	for (i = 0; i < TotalProcs; i++)
 	{
@@ -412,10 +415,10 @@ InitProcess(void)
 	MyProc->tempNamespaceId = InvalidOid;
 	MyProc->isBackgroundWorker = IsBackgroundWorker;
 	MyProc->delayChkpt = false;
-	MyPgXact->vacuumFlags = 0;
+	MyProc->vacuumFlagsCopy = 0;
 	/* NB -- autovac launcher intentionally does not set IS_AUTOVACUUM */
 	if (IsAutoVacuumWorkerProcess())
-		MyPgXact->vacuumFlags |= PROC_IS_AUTOVACUUM;
+		MyProc->vacuumFlagsCopy |= PROC_IS_AUTOVACUUM;
 	MyProc->lwWaiting = false;
 	MyProc->lwWaitMode = 0;
 	MyProc->waitLock = NULL;
@@ -594,7 +597,7 @@ InitAuxiliaryProcess(void)
 	MyProc->tempNamespaceId = InvalidOid;
 	MyProc->isBackgroundWorker = IsBackgroundWorker;
 	MyProc->delayChkpt = false;
-	MyPgXact->vacuumFlags = 0;
+	MyProc->vacuumFlagsCopy = 0;
 	MyProc->lwWaiting = false;
 	MyProc->lwWaitMode = 0;
 	MyProc->waitLock = NULL;
@@ -1330,7 +1333,7 @@ ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable)
 		if (deadlock_state == DS_BLOCKED_BY_AUTOVACUUM && allow_autovacuum_cancel)
 		{
 			PGPROC	   *autovac = GetBlockingAutoVacuumPgproc();
-			PGXACT	   *autovac_pgxact = &ProcGlobal->allPgXact[autovac->pgprocno];
+			uint8		vacuumFlags;
 
 			LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
 
@@ -1338,8 +1341,9 @@ ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable)
 			 * Only do it if the worker is not working to protect against Xid
 			 * wraparound.
 			 */
-			if ((autovac_pgxact->vacuumFlags & PROC_IS_AUTOVACUUM) &&
-				!(autovac_pgxact->vacuumFlags & PROC_VACUUM_FOR_WRAPAROUND))
+			vacuumFlags = ProcGlobal->vacuumFlags[proc->pgxactoff];
+			if ((vacuumFlags & PROC_IS_AUTOVACUUM) &&
+				!(vacuumFlags & PROC_VACUUM_FOR_WRAPAROUND))
 			{
 				int			pid = autovac->pid;
 				StringInfoData locktagbuf;
-- 
2.25.0.114.g5b0ca878e0

v7-0009-snapshot-scalability-Move-subxact-info-from-PGXAC.patchtext/x-diff; charset=us-asciiDownload
From 29c93a56d0db15050c33c0a130b3716632b8ae44 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 7 Apr 2020 03:33:16 -0700
Subject: [PATCH v7 09/11] snapshot scalability: Move subxact info from PGXACT
 to ProcGlobal.

Similar to the previous changes this increases the chance that data
frequently needed by GetSnapshotData() stays in l2 cache. In many
workloads subtransactions are very rare, and this makes the check for
that cheaper.
---
 src/include/storage/proc.h            |  19 ++++-
 src/backend/access/transam/clog.c     |   7 +-
 src/backend/access/transam/twophase.c |  11 +--
 src/backend/access/transam/varsup.c   |   9 ++-
 src/backend/storage/ipc/procarray.c   | 110 ++++++++++++++++----------
 src/backend/storage/lmgr/proc.c       |   3 +
 6 files changed, 99 insertions(+), 60 deletions(-)

diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 37208ddf342..3cd48382260 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -35,6 +35,14 @@
  */
 #define PGPROC_MAX_CACHED_SUBXIDS 64	/* XXX guessed-at value */
 
+typedef struct XidCacheStatus
+{
+	/* number of cached subxids, never more than PGPROC_MAX_CACHED_SUBXIDS */
+	uint8	count;
+	/* has PGPROC->subxids overflowed */
+	bool	overflowed;
+} XidCacheStatus;
+
 struct XidCache
 {
 	TransactionId xids[PGPROC_MAX_CACHED_SUBXIDS];
@@ -169,6 +177,7 @@ struct PGPROC
 	 */
 	SHM_QUEUE	myProcLocks[NUM_LOCK_PARTITIONS];
 
+	XidCacheStatus subxidStatusCopy; /* copy of ProcGlobal->subxidStates[i] */
 	struct XidCache subxids;	/* cache for subtransaction XIDs */
 
 	uint8		vacuumFlagsCopy; /* this backend's vacuum flags, a copy of its
@@ -235,9 +244,6 @@ extern PGDLLIMPORT struct PGXACT *MyPgXact;
  */
 typedef struct PGXACT
 {
-	bool		overflowed;
-
-	uint8		nxids;
 } PGXACT;
 
 /*
@@ -279,6 +285,13 @@ typedef struct PROC_HDR
 	 */
 	TransactionId *xids;
 
+	/*
+	 * Subtransaction caching status for each proc's PGPROC.subxids.
+	 *
+	 * Each PGPROC has a copy of its value in PGPROC.subxidStatusCopy.
+	 */
+	XidCacheStatus *subxidStates;
+
 	/*
 	 * Vacuum flags. See PROC_* above.
 	 */
diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c
index 8e9c211b02a..ee90ec5af29 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -295,7 +295,7 @@ TransactionIdSetPageStatus(TransactionId xid, int nsubxids,
 	 */
 	if (all_xact_same_page && xid == MyProc->xidCopy &&
 		nsubxids <= THRESHOLD_SUBTRANS_CLOG_OPT &&
-		nsubxids == MyPgXact->nxids &&
+		nsubxids == MyProc->subxidStatusCopy.count &&
 		memcmp(subxids, MyProc->subxids.xids,
 			   nsubxids * sizeof(TransactionId)) == 0)
 	{
@@ -510,16 +510,15 @@ TransactionGroupUpdateXidStatus(TransactionId xid, XidStatus status,
 	while (nextidx != INVALID_PGPROCNO)
 	{
 		PGPROC	   *proc = &ProcGlobal->allProcs[nextidx];
-		PGXACT	   *pgxact = &ProcGlobal->allPgXact[nextidx];
 
 		/*
 		 * Transactions with more than THRESHOLD_SUBTRANS_CLOG_OPT sub-XIDs
 		 * should not use group XID status update mechanism.
 		 */
-		Assert(pgxact->nxids <= THRESHOLD_SUBTRANS_CLOG_OPT);
+		Assert(proc->subxidStatusCopy.count <= THRESHOLD_SUBTRANS_CLOG_OPT);
 
 		TransactionIdSetPageStatusInternal(proc->clogGroupMemberXid,
-										   pgxact->nxids,
+										   proc->subxidStatusCopy.count,
 										   proc->subxids.xids,
 										   proc->clogGroupMemberXidStatus,
 										   proc->clogGroupMemberLsn,
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 06b61605649..f802544d378 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -447,14 +447,12 @@ MarkAsPreparingGuts(GlobalTransaction gxact, TransactionId xid, const char *gid,
 					TimestampTz prepared_at, Oid owner, Oid databaseid)
 {
 	PGPROC	   *proc;
-	PGXACT	   *pgxact;
 	int			i;
 
 	Assert(LWLockHeldByMeInMode(TwoPhaseStateLock, LW_EXCLUSIVE));
 
 	Assert(gxact != NULL);
 	proc = &ProcGlobal->allProcs[gxact->pgprocno];
-	pgxact = &ProcGlobal->allPgXact[gxact->pgprocno];
 
 	/* Initialize the PGPROC entry */
 	MemSet(proc, 0, sizeof(PGPROC));
@@ -480,8 +478,8 @@ MarkAsPreparingGuts(GlobalTransaction gxact, TransactionId xid, const char *gid,
 	for (i = 0; i < NUM_LOCK_PARTITIONS; i++)
 		SHMQueueInit(&(proc->myProcLocks[i]));
 	/* subxid data must be filled later by GXactLoadSubxactData */
-	pgxact->overflowed = false;
-	pgxact->nxids = 0;
+	proc->subxidStatusCopy.count = 0;
+	proc->subxidStatusCopy.overflowed = 0;
 
 	gxact->prepared_at = prepared_at;
 	gxact->xid = xid;
@@ -510,19 +508,18 @@ GXactLoadSubxactData(GlobalTransaction gxact, int nsubxacts,
 					 TransactionId *children)
 {
 	PGPROC	   *proc = &ProcGlobal->allProcs[gxact->pgprocno];
-	PGXACT	   *pgxact = &ProcGlobal->allPgXact[gxact->pgprocno];
 
 	/* We need no extra lock since the GXACT isn't valid yet */
 	if (nsubxacts > PGPROC_MAX_CACHED_SUBXIDS)
 	{
-		pgxact->overflowed = true;
+		proc->subxidStatusCopy.overflowed = true;
 		nsubxacts = PGPROC_MAX_CACHED_SUBXIDS;
 	}
 	if (nsubxacts > 0)
 	{
 		memcpy(proc->subxids.xids, children,
 			   nsubxacts * sizeof(TransactionId));
-		pgxact->nxids = nsubxacts;
+		proc->subxidStatusCopy.count = PGPROC_MAX_CACHED_SUBXIDS;
 	}
 }
 
diff --git a/src/backend/access/transam/varsup.c b/src/backend/access/transam/varsup.c
index f703c229450..091a183c0e9 100644
--- a/src/backend/access/transam/varsup.c
+++ b/src/backend/access/transam/varsup.c
@@ -225,19 +225,22 @@ GetNewTransactionId(bool isSubXact)
 		/* LWLockRelease acts as barrier */
 		ProcGlobal->xids[MyProc->pgxactoff] = xid;
 		MyProc->xidCopy = xid;
+
+		Assert(ProcGlobal->subxidStates[MyProc->pgxactoff].count == 0);
 	}
 	else
 	{
-		int			nxids = MyPgXact->nxids;
+		XidCacheStatus *substat = &ProcGlobal->subxidStates[MyProc->pgxactoff];
+		int			nxids = MyProc->subxidStatusCopy.count;
 
 		if (nxids < PGPROC_MAX_CACHED_SUBXIDS)
 		{
 			MyProc->subxids.xids[nxids] = xid;
 			pg_write_barrier();
-			MyPgXact->nxids = nxids + 1;
+			MyProc->subxidStatusCopy.count = substat->count = nxids + 1;
 		}
 		else
-			MyPgXact->overflowed = true;
+			MyProc->subxidStatusCopy.overflowed = substat->overflowed = true;
 	}
 
 	LWLockRelease(XidGenLock);
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 384b5a8efb5..6de0590dbec 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -86,7 +86,7 @@ typedef struct ProcArrayStruct
 	/*
 	 * Highest subxid that has been removed from KnownAssignedXids array to
 	 * prevent overflow; or InvalidTransactionId if none.  We track this for
-	 * similar reasons to tracking overflowing cached subxids in PGXACT
+	 * similar reasons to tracking overflowing cached subxids in PGPROC
 	 * entries.  Must hold exclusive ProcArrayLock to change this, and shared
 	 * lock to read it.
 	 */
@@ -441,11 +441,14 @@ ProcArrayAdd(PGPROC *proc)
 			(arrayP->numProcs - index) * sizeof(*arrayP->pgprocnos));
 	memmove(&ProcGlobal->xids[index + 1], &ProcGlobal->xids[index],
 			(arrayP->numProcs - index) * sizeof(*ProcGlobal->xids));
+	memmove(&ProcGlobal->subxidStates[index + 1], &ProcGlobal->subxidStates[index],
+			(arrayP->numProcs - index) * sizeof(*ProcGlobal->subxidStates));
 	memmove(&ProcGlobal->vacuumFlags[index + 1], &ProcGlobal->vacuumFlags[index],
 			(arrayP->numProcs - index) * sizeof(*ProcGlobal->vacuumFlags));
 
 	arrayP->pgprocnos[index] = proc->pgprocno;
 	ProcGlobal->xids[index] = proc->xidCopy;
+	ProcGlobal->subxidStates[index] = proc->subxidStatusCopy;
 	ProcGlobal->vacuumFlags[index] = proc->vacuumFlagsCopy;
 
 	arrayP->numProcs++;
@@ -499,6 +502,8 @@ ProcArrayRemove(PGPROC *proc, TransactionId latestXid)
 		MaintainLatestCompletedXid(latestXid);
 
 		ProcGlobal->xids[proc->pgxactoff] = 0;
+		ProcGlobal->subxidStates[proc->pgxactoff].overflowed = false;
+		ProcGlobal->subxidStates[proc->pgxactoff].count = 0;
 	}
 	else
 	{
@@ -507,6 +512,8 @@ ProcArrayRemove(PGPROC *proc, TransactionId latestXid)
 	}
 
 	Assert(TransactionIdIsValid(ProcGlobal->xids[proc->pgxactoff] == 0));
+	Assert(TransactionIdIsValid(ProcGlobal->subxidStates[proc->pgxactoff].count == 0));
+	Assert(TransactionIdIsValid(ProcGlobal->subxidStates[proc->pgxactoff].overflowed == false));
 	ProcGlobal->vacuumFlags[proc->pgxactoff] = 0;
 
 	for (index = 0; index < arrayP->numProcs; index++)
@@ -518,6 +525,8 @@ ProcArrayRemove(PGPROC *proc, TransactionId latestXid)
 					(arrayP->numProcs - index - 1) * sizeof(*arrayP->pgprocnos));
 			memmove(&ProcGlobal->xids[index], &ProcGlobal->xids[index + 1],
 					(arrayP->numProcs - index - 1) * sizeof(*ProcGlobal->xids));
+			memmove(&ProcGlobal->subxidStates[index], &ProcGlobal->subxidStates[index + 1],
+					(arrayP->numProcs - index - 1) * sizeof(*ProcGlobal->subxidStates));
 			memmove(&ProcGlobal->vacuumFlags[index], &ProcGlobal->vacuumFlags[index + 1],
 					(arrayP->numProcs - index - 1) * sizeof(*ProcGlobal->vacuumFlags));
 
@@ -596,15 +605,13 @@ ProcArrayEndTransaction(PGPROC *proc, TransactionId latestXid)
 		 * estimate of global xmin, but that's OK.
 		 */
 		Assert(!TransactionIdIsValid(proc->xidCopy));
+		Assert(proc->subxidStatusCopy.count == 0);
 
 		proc->lxid = InvalidLocalTransactionId;
 		proc->xmin = InvalidTransactionId;
 		proc->delayChkpt = false; /* be sure this is cleared in abort */
 		proc->recoveryConflictPending = false;
 
-		Assert(pgxact->nxids == 0);
-		Assert(pgxact->overflowed == false);
-
 		/* must be cleared with xid/xmin: */
 		/* avoid unnecessarily dirtying shared cachelines */
 		if (proc->vacuumFlagsCopy & PROC_VACUUM_STATE_MASK)
@@ -648,8 +655,15 @@ ProcArrayEndTransactionInternal(PGPROC *proc, PGXACT *pgxact,
 	}
 
 	/* Clear the subtransaction-XID cache too while holding the lock */
-	pgxact->nxids = 0;
-	pgxact->overflowed = false;
+	Assert(ProcGlobal->subxidStates[pgxactoff].count == proc->subxidStatusCopy.count &&
+		   ProcGlobal->subxidStates[pgxactoff].overflowed == proc->subxidStatusCopy.overflowed);
+	if (proc->subxidStatusCopy.count > 0 || proc->subxidStatusCopy.overflowed)
+	{
+		ProcGlobal->subxidStates[pgxactoff].count = 0;
+		ProcGlobal->subxidStates[pgxactoff].overflowed = false;
+		proc->subxidStatusCopy.count = 0;
+		proc->subxidStatusCopy.overflowed = false;
+	}
 
 	/* Also advance global latestCompletedXid while holding the lock */
 	MaintainLatestCompletedXid(latestXid);
@@ -785,7 +799,6 @@ ProcArrayGroupClearXid(PGPROC *proc, TransactionId latestXid)
 void
 ProcArrayClearTransaction(PGPROC *proc)
 {
-	PGXACT	   *pgxact = &allPgXact[proc->pgprocno];
 	size_t		pgxactoff;
 
 	/*
@@ -810,8 +823,15 @@ ProcArrayClearTransaction(PGPROC *proc)
 	Assert(!proc->delayChkpt);
 
 	/* Clear the subtransaction-XID cache too */
-	pgxact->nxids = 0;
-	pgxact->overflowed = false;
+	Assert(ProcGlobal->subxidStates[pgxactoff].count == proc->subxidStatusCopy.count &&
+		   ProcGlobal->subxidStates[pgxactoff].overflowed == proc->subxidStatusCopy.overflowed);
+	if (proc->subxidStatusCopy.count > 0 || proc->subxidStatusCopy.overflowed)
+	{
+		ProcGlobal->subxidStates[pgxactoff].count = 0;
+		ProcGlobal->subxidStates[pgxactoff].overflowed = false;
+		proc->subxidStatusCopy.count = 0;
+		proc->subxidStatusCopy.overflowed = false;
+	}
 
 	LWLockRelease(ProcArrayLock);
 }
@@ -1208,6 +1228,7 @@ TransactionIdIsInProgress(TransactionId xid)
 {
 	static TransactionId *xids = NULL;
 	static TransactionId *other_xids;
+	XidCacheStatus *other_nsubxids;
 	int			nxids = 0;
 	ProcArrayStruct *arrayP = procArray;
 	TransactionId topxid;
@@ -1269,6 +1290,7 @@ TransactionIdIsInProgress(TransactionId xid)
 	}
 
 	other_xids = ProcGlobal->xids;
+	other_nsubxids = ProcGlobal->subxidStates;
 
 	LWLockAcquire(ProcArrayLock, LW_SHARED);
 
@@ -1290,7 +1312,6 @@ TransactionIdIsInProgress(TransactionId xid)
 	for (size_t pgxactoff = 0; pgxactoff < numProcs; pgxactoff++)
 	{
 		int			pgprocno;
-		PGXACT	   *pgxact;
 		PGPROC	   *proc;
 		TransactionId pxid;
 		int			pxids;
@@ -1325,12 +1346,16 @@ TransactionIdIsInProgress(TransactionId xid)
 		/*
 		 * Step 2: check the cached child-Xids arrays
 		 */
-		pgprocno = arrayP->pgprocnos[pgxactoff];
-		pgxact = &allPgXact[pgprocno];
-		pxids = pgxact->nxids;
+		pxids = other_nsubxids[pgxactoff].count;
+
+		if (pxids == 0 && !other_nsubxids[pgxactoff].overflowed)
+			continue;
+
 		pg_read_barrier();		/* pairs with barrier in GetNewTransactionId() */
+
 		pgprocno = arrayP->pgprocnos[pgxactoff];
 		proc = &allProcs[pgprocno];
+
 		for (j = pxids - 1; j >= 0; j--)
 		{
 			/* Fetch xid just once - see GetNewTransactionId */
@@ -1351,7 +1376,7 @@ TransactionIdIsInProgress(TransactionId xid)
 		 * we hold ProcArrayLock.  So we can't miss an Xid that we need to
 		 * worry about.)
 		 */
-		if (pgxact->overflowed)
+		if (other_nsubxids[pgxactoff].overflowed)
 			xids[nxids++] = pxid;
 	}
 
@@ -1923,6 +1948,7 @@ GetSnapshotData(Snapshot snapshot)
 	{
 		size_t		numProcs = arrayP->numProcs;
 		TransactionId *xip = snapshot->xip;
+		XidCacheStatus *allnsubxids = ProcGlobal->subxidStates;
 		int			workspace_count = 0;
 		uint8	   *allVacuumFlags = ProcGlobal->vacuumFlags;
 
@@ -1971,9 +1997,7 @@ GetSnapshotData(Snapshot snapshot)
 		{
 			int pgxactoff = snapshot_workspace_off[i];
 			TransactionId xid = snapshot_workspace_xid[i];
-			int			pgprocno = arrayP->pgprocnos[pgxactoff];
-			PGXACT	   *pgxact = &allPgXact[pgprocno];
-			uint8       vacuumFlags = allVacuumFlags[pgxactoff];
+			uint8		vacuumFlags = allVacuumFlags[pgxactoff];
 			int			nsubxids;
 
 			Assert(allProcs[arrayP->pgprocnos[pgxactoff]].pgxactoff == pgxactoff);
@@ -2009,8 +2033,8 @@ GetSnapshotData(Snapshot snapshot)
 			if (suboverflowed)
 				continue;
 
-			suboverflowed = pgxact->overflowed;
-			nsubxids = pgxact->nxids;
+			suboverflowed = allnsubxids[pgxactoff].overflowed;
+			nsubxids = allnsubxids[pgxactoff].count;
 
 			if (suboverflowed || nsubxids == 0)
 				continue;
@@ -2406,8 +2430,6 @@ GetRunningTransactionData(void)
 	 */
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
-		int			pgprocno = arrayP->pgprocnos[index];
-		PGXACT	   *pgxact = &allPgXact[pgprocno];
 		TransactionId xid;
 
 		/* Fetch xid just once - see GetNewTransactionId */
@@ -2428,7 +2450,7 @@ GetRunningTransactionData(void)
 		if (TransactionIdPrecedes(xid, oldestRunningXid))
 			oldestRunningXid = xid;
 
-		if (pgxact->overflowed)
+		if (ProcGlobal->subxidStates[index].overflowed)
 			suboverflowed = true;
 
 		/*
@@ -2448,27 +2470,28 @@ GetRunningTransactionData(void)
 	 */
 	if (!suboverflowed)
 	{
+		XidCacheStatus *other_nsubxids = ProcGlobal->subxidStates;
+
 		for (index = 0; index < arrayP->numProcs; index++)
 		{
 			int			pgprocno = arrayP->pgprocnos[index];
 			PGPROC	   *proc = &allProcs[pgprocno];
-			PGXACT	   *pgxact = &allPgXact[pgprocno];
-			int			nxids;
+			int			nsubxids;
 
 			/*
 			 * Save subtransaction XIDs. Other backends can't add or remove
 			 * entries while we're holding XidGenLock.
 			 */
-			nxids = pgxact->nxids;
-			if (nxids > 0)
+			nsubxids = other_nsubxids[index].count;
+			if (nsubxids > 0)
 			{
 				/* barrier not really required, as XidGenLock is held, but ... */
 				pg_read_barrier();	/* pairs with GetNewTransactionId */
 
 				memcpy(&xids[count], (void *) proc->subxids.xids,
-					   nxids * sizeof(TransactionId));
-				count += nxids;
-				subcount += nxids;
+					   nsubxids * sizeof(TransactionId));
+				count += nsubxids;
+				subcount += nsubxids;
 
 				/*
 				 * Top-level XID of a transaction is always less than any of
@@ -3535,14 +3558,6 @@ ProcArrayGetReplicationSlotXmin(TransactionId *xmin,
 	LWLockRelease(ProcArrayLock);
 }
 
-
-#define XidCacheRemove(i) \
-	do { \
-		MyProc->subxids.xids[i] = MyProc->subxids.xids[MyPgXact->nxids - 1]; \
-		pg_write_barrier(); \
-		MyPgXact->nxids--; \
-	} while (0)
-
 /*
  * XidCacheRemoveRunningXids
  *
@@ -3558,6 +3573,7 @@ XidCacheRemoveRunningXids(TransactionId xid,
 {
 	int			i,
 				j;
+	XidCacheStatus *mysubxidstat;
 
 	Assert(TransactionIdIsValid(xid));
 
@@ -3575,6 +3591,8 @@ XidCacheRemoveRunningXids(TransactionId xid,
 	 */
 	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
 
+	mysubxidstat = &ProcGlobal->subxidStates[MyProc->pgxactoff];
+
 	/*
 	 * Under normal circumstances xid and xids[] will be in increasing order,
 	 * as will be the entries in subxids.  Scan backwards to avoid O(N^2)
@@ -3584,11 +3602,14 @@ XidCacheRemoveRunningXids(TransactionId xid,
 	{
 		TransactionId anxid = xids[i];
 
-		for (j = MyPgXact->nxids - 1; j >= 0; j--)
+		for (j = MyProc->subxidStatusCopy.count - 1; j >= 0; j--)
 		{
 			if (TransactionIdEquals(MyProc->subxids.xids[j], anxid))
 			{
-				XidCacheRemove(j);
+				MyProc->subxids.xids[j] = MyProc->subxids.xids[MyProc->subxidStatusCopy.count - 1];
+				pg_write_barrier();
+				mysubxidstat->count--;
+				MyProc->subxidStatusCopy.count--;
 				break;
 			}
 		}
@@ -3600,20 +3621,23 @@ XidCacheRemoveRunningXids(TransactionId xid,
 		 * error during AbortSubTransaction.  So instead of Assert, emit a
 		 * debug warning.
 		 */
-		if (j < 0 && !MyPgXact->overflowed)
+		if (j < 0 && !MyProc->subxidStatusCopy.overflowed)
 			elog(WARNING, "did not find subXID %u in MyProc", anxid);
 	}
 
-	for (j = MyPgXact->nxids - 1; j >= 0; j--)
+	for (j = MyProc->subxidStatusCopy.count - 1; j >= 0; j--)
 	{
 		if (TransactionIdEquals(MyProc->subxids.xids[j], xid))
 		{
-			XidCacheRemove(j);
+			MyProc->subxids.xids[j] = MyProc->subxids.xids[MyProc->subxidStatusCopy.count - 1];
+			pg_write_barrier();
+			mysubxidstat->count--;
+			MyProc->subxidStatusCopy.count--;
 			break;
 		}
 	}
 	/* Ordinarily we should have found it, unless the cache has overflowed */
-	if (j < 0 && !MyPgXact->overflowed)
+	if (j < 0 && !MyProc->subxidStatusCopy.overflowed)
 		elog(WARNING, "did not find subXID %u in MyProc", xid);
 
 	/* Also advance global latestCompletedXid while holding the lock */
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 00f26d93c1a..c619250318b 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -113,6 +113,7 @@ ProcGlobalShmemSize(void)
 	size = add_size(size, mul_size(NUM_AUXILIARY_PROCS, sizeof(PGXACT)));
 	size = add_size(size, mul_size(max_prepared_xacts, sizeof(PGXACT)));
 	size = add_size(size, mul_size(TotalProcs, sizeof(*ProcGlobal->xids)));
+	size = add_size(size, mul_size(TotalProcs, sizeof(*ProcGlobal->subxidStates)));
 	size = add_size(size, mul_size(TotalProcs, sizeof(*ProcGlobal->vacuumFlags)));
 
 	return size;
@@ -231,6 +232,8 @@ InitProcGlobal(void)
 	// XXX: Pad to cacheline (or even page?)!
 	ProcGlobal->xids = (TransactionId *) ShmemAlloc(TotalProcs * sizeof(*ProcGlobal->xids));
 	MemSet(ProcGlobal->xids, 0, TotalProcs * sizeof(*ProcGlobal->xids));
+	ProcGlobal->subxidStates = (XidCacheStatus *) ShmemAlloc(TotalProcs * sizeof(*ProcGlobal->subxidStates));
+	MemSet(ProcGlobal->subxidStates, 0, TotalProcs * sizeof(*ProcGlobal->subxidStates));
 	ProcGlobal->vacuumFlags = (uint8 *) ShmemAlloc(TotalProcs * sizeof(*ProcGlobal->vacuumFlags));
 	MemSet(ProcGlobal->vacuumFlags, 0, TotalProcs * sizeof(*ProcGlobal->vacuumFlags));
 
-- 
2.25.0.114.g5b0ca878e0

v7-0010-Remove-now-unused-PGXACT.patchtext/x-diff; charset=us-asciiDownload
From a7a19f52063fd8baa38a25948d3d9c6a6d614196 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 6 Apr 2020 21:28:55 -0700
Subject: [PATCH v7 10/11] Remove now unused PGXACT.

---
 src/include/storage/proc.h            | 15 ---------------
 src/backend/access/transam/twophase.c |  6 +++---
 src/backend/storage/ipc/procarray.c   | 24 +++++++++---------------
 src/backend/storage/lmgr/proc.c       | 21 +--------------------
 src/tools/pgindent/typedefs.list      |  1 -
 5 files changed, 13 insertions(+), 54 deletions(-)

diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 3cd48382260..e225efa8ece 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -232,19 +232,6 @@ struct PGPROC
 
 
 extern PGDLLIMPORT PGPROC *MyProc;
-extern PGDLLIMPORT struct PGXACT *MyPgXact;
-
-/*
- * Prior to PostgreSQL 9.2, the fields below were stored as part of the
- * PGPROC.  However, benchmarking revealed that packing these particular
- * members into a separate array as tightly as possible sped up GetSnapshotData
- * considerably on systems with many CPU cores, by reducing the number of
- * cache lines needing to be fetched.  Thus, think very carefully before adding
- * anything else here.
- */
-typedef struct PGXACT
-{
-} PGXACT;
 
 /*
  * There is one ProcGlobal struct for the whole database cluster.
@@ -260,8 +247,6 @@ typedef struct PROC_HDR
 {
 	/* Array of PGPROC structures (not including dummies for prepared txns) */
 	PGPROC	   *allProcs;
-	/* Array of PGXACT structures (not including dummies for prepared txns) */
-	PGXACT	   *allPgXact;
 
 	/*
 	 * Arrays with per-backend information that is hotly accessed, indexed by
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index f802544d378..d03fbefd5d2 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -21,9 +21,9 @@
  *		GIDs and aborts the transaction if there already is a global
  *		transaction in prepared state with the same GID.
  *
- *		A global transaction (gxact) also has dummy PGXACT and PGPROC; this is
- *		what keeps the XID considered running by TransactionIdIsInProgress.
- *		It is also convenient as a PGPROC to hook the gxact's locks to.
+ *		A global transaction (gxact) also has dummy PGPROC; this is what keeps
+ *		the XID considered running by TransactionIdIsInProgress.  It is also
+ *		convenient as a PGPROC to hook the gxact's locks to.
  *
  *		Information to recover prepared transactions in case of crash is
  *		now stored in WAL for the common case. In some cases there will be
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 6de0590dbec..92ed8d20519 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -4,9 +4,10 @@
  *	  POSTGRES process array code.
  *
  *
- * This module maintains arrays of the PGPROC and PGXACT structures for all
- * active backends.  Although there are several uses for this, the principal
- * one is as a means of determining the set of currently running transactions.
+ * This module maintains arrays of PGPROC substructures, as well as associated
+ * arrays in ProcGlobal, for all active backends.  Although there are several
+ * uses for this, the principal one is as a means of determining the set of
+ * currently running transactions.
  *
  * Because of various subtle race conditions it is critical that a backend
  * hold the correct locks while setting or clearing its xid (in
@@ -97,7 +98,7 @@ typedef struct ProcArrayStruct
 	/* oldest catalog xmin of any replication slot */
 	TransactionId replication_slot_catalog_xmin;
 
-	/* indexes into allPgXact[], has PROCARRAY_MAXPROCS entries */
+	/* indexes into allProcs[], has PROCARRAY_MAXPROCS entries */
 	int			pgprocnos[FLEXIBLE_ARRAY_MEMBER];
 } ProcArrayStruct;
 
@@ -196,7 +197,6 @@ typedef struct ComputedHorizons
 static ProcArrayStruct *procArray;
 
 static PGPROC *allProcs;
-static PGXACT *allPgXact;
 
 /*
  * Bookkeeping for tracking emulated transactions in recovery
@@ -283,8 +283,7 @@ static int	KnownAssignedXidsGetAndSetXmin(TransactionId *xarray,
 static TransactionId KnownAssignedXidsGetOldestXmin(void);
 static void KnownAssignedXidsDisplay(int trace_level);
 static void KnownAssignedXidsReset(void);
-static inline void ProcArrayEndTransactionInternal(PGPROC *proc,
-												   PGXACT *pgxact, TransactionId latestXid);
+static inline void ProcArrayEndTransactionInternal(PGPROC *proc, TransactionId latestXid);
 static void ProcArrayGroupClearXid(PGPROC *proc, TransactionId latestXid);
 static void MaintainLatestCompletedXid(TransactionId latestXid);
 
@@ -369,7 +368,6 @@ CreateSharedProcArray(void)
 	}
 
 	allProcs = ProcGlobal->allProcs;
-	allPgXact = ProcGlobal->allPgXact;
 
 	/* Create or attach to the KnownAssignedXids arrays too, if needed */
 	if (EnableHotStandby)
@@ -572,8 +570,6 @@ ProcArrayRemove(PGPROC *proc, TransactionId latestXid)
 void
 ProcArrayEndTransaction(PGPROC *proc, TransactionId latestXid)
 {
-	PGXACT	   *pgxact = &allPgXact[proc->pgprocno];
-
 	if (TransactionIdIsValid(latestXid))
 	{
 		/*
@@ -591,7 +587,7 @@ ProcArrayEndTransaction(PGPROC *proc, TransactionId latestXid)
 		 */
 		if (LWLockConditionalAcquire(ProcArrayLock, LW_EXCLUSIVE))
 		{
-			ProcArrayEndTransactionInternal(proc, pgxact, latestXid);
+			ProcArrayEndTransactionInternal(proc, latestXid);
 			LWLockRelease(ProcArrayLock);
 		}
 		else
@@ -631,8 +627,7 @@ ProcArrayEndTransaction(PGPROC *proc, TransactionId latestXid)
  * We don't do any locking here; caller must handle that.
  */
 static inline void
-ProcArrayEndTransactionInternal(PGPROC *proc, PGXACT *pgxact,
-								TransactionId latestXid)
+ProcArrayEndTransactionInternal(PGPROC *proc, TransactionId latestXid)
 {
 	size_t		pgxactoff = proc->pgxactoff;
 
@@ -753,9 +748,8 @@ ProcArrayGroupClearXid(PGPROC *proc, TransactionId latestXid)
 	while (nextidx != INVALID_PGPROCNO)
 	{
 		PGPROC	   *proc = &allProcs[nextidx];
-		PGXACT	   *pgxact = &allPgXact[nextidx];
 
-		ProcArrayEndTransactionInternal(proc, pgxact, proc->procArrayGroupMemberXid);
+		ProcArrayEndTransactionInternal(proc, proc->procArrayGroupMemberXid);
 
 		/* Move to next proc in list. */
 		nextidx = pg_atomic_read_u32(&proc->procArrayGroupNext);
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index c619250318b..34a6c6b1536 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -63,9 +63,8 @@ int			LockTimeout = 0;
 int			IdleInTransactionSessionTimeout = 0;
 bool		log_lock_waits = false;
 
-/* Pointer to this process's PGPROC and PGXACT structs, if any */
+/* Pointer to this process's PGPROC struct, if any */
 PGPROC	   *MyProc = NULL;
-PGXACT	   *MyPgXact = NULL;
 
 /*
  * This spinlock protects the freelist of recycled PGPROC structures.
@@ -109,9 +108,6 @@ ProcGlobalShmemSize(void)
 	size = add_size(size, mul_size(TotalProcs, sizeof(PGPROC)));
 	size = add_size(size, sizeof(slock_t));
 
-	size = add_size(size, mul_size(MaxBackends, sizeof(PGXACT)));
-	size = add_size(size, mul_size(NUM_AUXILIARY_PROCS, sizeof(PGXACT)));
-	size = add_size(size, mul_size(max_prepared_xacts, sizeof(PGXACT)));
 	size = add_size(size, mul_size(TotalProcs, sizeof(*ProcGlobal->xids)));
 	size = add_size(size, mul_size(TotalProcs, sizeof(*ProcGlobal->subxidStates)));
 	size = add_size(size, mul_size(TotalProcs, sizeof(*ProcGlobal->vacuumFlags)));
@@ -161,7 +157,6 @@ void
 InitProcGlobal(void)
 {
 	PGPROC	   *procs;
-	PGXACT	   *pgxacts;
 	int			i,
 				j;
 	bool		found;
@@ -202,18 +197,6 @@ InitProcGlobal(void)
 	/* XXX allProcCount isn't really all of them; it excludes prepared xacts */
 	ProcGlobal->allProcCount = MaxBackends + NUM_AUXILIARY_PROCS;
 
-	/*
-	 * Also allocate a separate array of PGXACT structures.  This is separate
-	 * from the main PGPROC array so that the most heavily accessed data is
-	 * stored contiguously in memory in as few cache lines as possible. This
-	 * provides significant performance benefits, especially on a
-	 * multiprocessor system.  There is one PGXACT structure for every PGPROC
-	 * structure.
-	 */
-	pgxacts = (PGXACT *) ShmemAlloc(TotalProcs * sizeof(PGXACT));
-	MemSet(pgxacts, 0, TotalProcs * sizeof(PGXACT));
-	ProcGlobal->allPgXact = pgxacts;
-
 	/*
 	 * Also allocate a separate arrays for data that is frequently (e.g. by
 	 * GetSnapshotData()) accessed from outside a backend.  There is one entry
@@ -382,7 +365,6 @@ InitProcess(void)
 				(errcode(ERRCODE_TOO_MANY_CONNECTIONS),
 				 errmsg("sorry, too many clients already")));
 	}
-	MyPgXact = &ProcGlobal->allPgXact[MyProc->pgprocno];
 
 	/*
 	 * Cross-check that the PGPROC is of the type we expect; if this were not
@@ -579,7 +561,6 @@ InitAuxiliaryProcess(void)
 	((volatile PGPROC *) auxproc)->pid = MyProcPid;
 
 	MyProc = auxproc;
-	MyPgXact = &ProcGlobal->allPgXact[auxproc->pgprocno];
 
 	SpinLockRelease(ProcStructLock);
 
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 525d58e7f01..3dd5fedcf7b 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1508,7 +1508,6 @@ PGSetenvStatusType
 PGShmemHeader
 PGTransactionStatusType
 PGVerbosity
-PGXACT
 PG_Locale_Strategy
 PG_Lock_Status
 PG_init_t
-- 
2.25.0.114.g5b0ca878e0

v7-0011-snapshot-scalability-cache-snapshots-using-a-xact.patchtext/x-diff; charset=us-asciiDownload
From bd4f9ab115cf3f881f4d780b03c102e550238e6d Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 6 Apr 2020 21:28:55 -0700
Subject: [PATCH v7 11/11] snapshot scalability: cache snapshots using a xact
 completion counter.

---
 src/include/access/transam.h                |   8 ++
 src/include/utils/snapshot.h                |   7 ++
 src/backend/replication/logical/snapbuild.c |   1 +
 src/backend/storage/ipc/procarray.c         | 111 ++++++++++++++++----
 src/backend/utils/time/snapmgr.c            |   4 +
 5 files changed, 109 insertions(+), 22 deletions(-)

diff --git a/src/include/access/transam.h b/src/include/access/transam.h
index 924e5fa724e..73ed8c25dff 100644
--- a/src/include/access/transam.h
+++ b/src/include/access/transam.h
@@ -211,6 +211,14 @@ typedef struct VariableCacheData
 	FullTransactionId latestCompletedFullXid;	/* newest full XID that has
 												 * committed or aborted */
 
+	/*
+	 * Number of top-level transactions that completed in some form since the
+	 * start of the server. This currently is solely used to check whether
+	 * GetSnapshotData() needs to recompute the contents of the snapshot, or
+	 * not. There are likely other users of this.  Always above 1.
+	 */
+	uint64 xactCompletionCount;
+
 	/*
 	 * These fields are protected by CLogTruncationLock
 	 */
diff --git a/src/include/utils/snapshot.h b/src/include/utils/snapshot.h
index 2bc415376ac..dc37798fe9e 100644
--- a/src/include/utils/snapshot.h
+++ b/src/include/utils/snapshot.h
@@ -207,6 +207,13 @@ typedef struct SnapshotData
 
 	TimestampTz whenTaken;		/* timestamp when snapshot was taken */
 	XLogRecPtr	lsn;			/* position in the WAL stream when taken */
+
+	/*
+	 * The transaction completion count at the time GetSnapshotData() built
+	 * this snapshot. Allows to avoid re-computing static snapshots when no
+	 * transactions completed since the last GetSnapshotData()..
+	 */
+	uint64		snapXactCompletionCount;
 } SnapshotData;
 
 #endif							/* SNAPSHOT_H */
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index e9701ea7221..9d5d68f3fa7 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -524,6 +524,7 @@ SnapBuildBuildSnapshot(SnapBuild *builder)
 	snapshot->curcid = FirstCommandId;
 	snapshot->active_count = 0;
 	snapshot->regd_count = 0;
+	snapshot->snapXactCompletionCount = 0;
 
 	return snapshot;
 }
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 92ed8d20519..7bd847d70d9 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -365,6 +365,7 @@ CreateSharedProcArray(void)
 		procArray->lastOverflowedXid = InvalidTransactionId;
 		procArray->replication_slot_xmin = InvalidTransactionId;
 		procArray->replication_slot_catalog_xmin = InvalidTransactionId;
+		ShmemVariableCache->xactCompletionCount = 1;
 	}
 
 	allProcs = ProcGlobal->allProcs;
@@ -499,6 +500,9 @@ ProcArrayRemove(PGPROC *proc, TransactionId latestXid)
 		/* Advance global latestCompletedXid while holding the lock */
 		MaintainLatestCompletedXid(latestXid);
 
+		/* Same with CSN */
+		ShmemVariableCache->xactCompletionCount++;
+
 		ProcGlobal->xids[proc->pgxactoff] = 0;
 		ProcGlobal->subxidStates[proc->pgxactoff].overflowed = false;
 		ProcGlobal->subxidStates[proc->pgxactoff].count = 0;
@@ -631,6 +635,7 @@ ProcArrayEndTransactionInternal(PGPROC *proc, TransactionId latestXid)
 {
 	size_t		pgxactoff = proc->pgxactoff;
 
+	Assert(LWLockHeldByMe(ProcArrayLock));
 	Assert(TransactionIdIsValid(ProcGlobal->xids[pgxactoff]));
 	Assert(ProcGlobal->xids[pgxactoff] == proc->xidCopy);
 
@@ -662,6 +667,9 @@ ProcArrayEndTransactionInternal(PGPROC *proc, TransactionId latestXid)
 
 	/* Also advance global latestCompletedXid while holding the lock */
 	MaintainLatestCompletedXid(latestXid);
+
+	/* Same with CSN */
+	ShmemVariableCache->xactCompletionCount++;
 }
 
 /*
@@ -1826,6 +1834,77 @@ GetMaxSnapshotSubxidCount(void)
 	return TOTAL_MAX_CACHED_SUBXIDS;
 }
 
+static void
+GetSnapshotDataFillTooOld(Snapshot snapshot)
+{
+	if (old_snapshot_threshold < 0)
+	{
+		/*
+		 * If not using "snapshot too old" feature, fill related fields with
+		 * dummy values that don't require any locking.
+		 */
+		snapshot->lsn = InvalidXLogRecPtr;
+		snapshot->whenTaken = 0;
+	}
+	else
+	{
+		/*
+		 * Capture the current time and WAL stream location in case this
+		 * snapshot becomes old enough to need to fall back on the special
+		 * "old snapshot" logic.
+		 */
+		snapshot->lsn = GetXLogInsertRecPtr();
+		snapshot->whenTaken = GetSnapshotCurrentTimestamp();
+		MaintainOldSnapshotTimeMapping(snapshot->whenTaken, snapshot->xmin);
+	}
+}
+
+/*
+ * Helper function for GetSnapshotData() that check if the bulk of the
+ * visibility information in the snapshot is still valid. If so, it updates
+ * the fields that need to change and returns true. false is returned
+ * otherwise.
+ *
+ * This very likely can be evolved to not need ProcArrayLock held (at very
+ * least in the case we already hold a snapshot), but that's for another day.
+ */
+static bool
+GetSnapshotDataReuse(Snapshot snapshot)
+{
+	uint64 curXactCompletionCount;
+
+	Assert(LWLockHeldByMe(ProcArrayLock));
+
+	if (unlikely(snapshot->snapXactCompletionCount == 0))
+		return false;
+
+	curXactCompletionCount = ShmemVariableCache->xactCompletionCount;
+	if (curXactCompletionCount != snapshot->snapXactCompletionCount)
+		return false;
+
+	/*
+	 * It is safe to re-enter the snapshot's xmin. This can't cause xmin to go
+	 * backwards, as ProcArrayLock prevents concurrent commits of transactions
+	 * with xids, and the completion count check ensures we'd have gotten the
+	 * same result computing the snapshot the hard way (as only running xids
+	 * matter).
+	 */
+	if (!TransactionIdIsValid(MyProc->xmin))
+		MyProc->xmin = TransactionXmin = snapshot->xmin;
+
+	RecentXmin = snapshot->xmin;
+	Assert(TransactionIdPrecedesOrEquals(TransactionXmin, RecentXmin));
+
+	snapshot->curcid = GetCurrentCommandId(false);
+	snapshot->active_count = 0;
+	snapshot->regd_count = 0;
+	snapshot->copied = false;
+
+	GetSnapshotDataFillTooOld(snapshot);
+
+	return true;
+}
+
 /*
  * GetSnapshotData -- returns information about running transactions.
  *
@@ -1873,7 +1952,7 @@ GetSnapshotData(Snapshot snapshot)
 	TransactionId oldestxid;
 	int			mypgxactoff;
 	TransactionId myxid;
-
+	uint64		curXactCompletionCount;
 	TransactionId replication_slot_xmin = InvalidTransactionId;
 	TransactionId replication_slot_catalog_xmin = InvalidTransactionId;
 
@@ -1917,12 +1996,19 @@ GetSnapshotData(Snapshot snapshot)
 	 */
 	LWLockAcquire(ProcArrayLock, LW_SHARED);
 
+	if (GetSnapshotDataReuse(snapshot))
+	{
+		LWLockRelease(ProcArrayLock);
+		return snapshot;
+	}
+
 	latest_completed = ShmemVariableCache->latestCompletedFullXid;
 	mypgxactoff = MyProc->pgxactoff;
 	myxid = other_xids[mypgxactoff];
 	Assert(myxid == MyProc->xidCopy);
 
 	oldestxid = ShmemVariableCache->oldestXid;
+	curXactCompletionCount = ShmemVariableCache->xactCompletionCount;
 
 	/* xmax is always latestCompletedXid + 1 */
 	xmax = XidFromFullTransactionId(latest_completed);
@@ -2179,7 +2265,7 @@ GetSnapshotData(Snapshot snapshot)
 	snapshot->xcnt = count;
 	snapshot->subxcnt = subcount;
 	snapshot->suboverflowed = suboverflowed;
-
+	snapshot->snapXactCompletionCount = curXactCompletionCount;
 	snapshot->curcid = GetCurrentCommandId(false);
 
 	/*
@@ -2190,26 +2276,7 @@ GetSnapshotData(Snapshot snapshot)
 	snapshot->regd_count = 0;
 	snapshot->copied = false;
 
-	if (old_snapshot_threshold < 0)
-	{
-		/*
-		 * If not using "snapshot too old" feature, fill related fields with
-		 * dummy values that don't require any locking.
-		 */
-		snapshot->lsn = InvalidXLogRecPtr;
-		snapshot->whenTaken = 0;
-	}
-	else
-	{
-		/*
-		 * Capture the current time and WAL stream location in case this
-		 * snapshot becomes old enough to need to fall back on the special
-		 * "old snapshot" logic.
-		 */
-		snapshot->lsn = GetXLogInsertRecPtr();
-		snapshot->whenTaken = GetSnapshotCurrentTimestamp();
-		MaintainOldSnapshotTimeMapping(snapshot->whenTaken, xmin);
-	}
+	GetSnapshotDataFillTooOld(snapshot);
 
 	return snapshot;
 }
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 01f1c133014..62e5d747d64 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -597,6 +597,8 @@ SetTransactionSnapshot(Snapshot sourcesnap, VirtualTransactionId *sourcevxid,
 	CurrentSnapshot->takenDuringRecovery = sourcesnap->takenDuringRecovery;
 	/* NB: curcid should NOT be copied, it's a local matter */
 
+	CurrentSnapshot->snapXactCompletionCount = 0;
+
 	/*
 	 * Now we have to fix what GetSnapshotData did with MyProc->xmin and
 	 * TransactionXmin.  There is a race condition: to make sure we are not
@@ -672,6 +674,7 @@ CopySnapshot(Snapshot snapshot)
 	newsnap->regd_count = 0;
 	newsnap->active_count = 0;
 	newsnap->copied = true;
+	newsnap->snapXactCompletionCount = 0;
 
 	/* setup XID array */
 	if (snapshot->xcnt > 0)
@@ -2224,6 +2227,7 @@ RestoreSnapshot(char *start_address)
 	snapshot->curcid = serialized_snapshot.curcid;
 	snapshot->whenTaken = serialized_snapshot.whenTaken;
 	snapshot->lsn = serialized_snapshot.lsn;
+	snapshot->snapXactCompletionCount = 0;
 
 	/* Copy XIDs, if present. */
 	if (serialized_snapshot.xcnt > 0)
-- 
2.25.0.114.g5b0ca878e0

#25Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#24)
1 attachment(s)
Re: Improving connection scalability: GetSnapshotData()

On 2020-04-07 05:15:03 -0700, Andres Freund wrote:

Attached is a substantially polished version of my patches. Note that
the first three patches, as well as the last, are not intended to be
committed at this time / in this form - they're there to make testing
easier.

I didn't actually attached that last not-to-be-committed patch... It's
just the pgbench patch that I had attached before (and started a thread
about). Here it is again.

Attachments:

v7-0012-WIP-pgbench.patchtext/x-diff; charset=us-asciiDownload
From 59a9a03da728d53364f9c3d6fe8b48e21697b93e Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 6 Apr 2020 21:28:55 -0700
Subject: [PATCH v7 12/12] WIP: pgbench

---
 src/bin/pgbench/pgbench.c | 107 +++++++++++++++++++++++++++++---------
 1 file changed, 83 insertions(+), 24 deletions(-)

diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index e99af801675..21d1ab2aac1 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -310,6 +310,10 @@ typedef struct RandomState
 /* Various random sequences are initialized from this one. */
 static RandomState base_random_sequence;
 
+#ifdef ENABLE_THREAD_SAFETY
+pthread_barrier_t conn_barrier;
+#endif
+
 /*
  * Connection state machine states.
  */
@@ -5206,6 +5210,10 @@ printResults(StatsData *total, instr_time total_time,
 	tps_exclude = ntx /
 		(time_include - (INSTR_TIME_GET_DOUBLE(conn_total_time) / nclients));
 
+	//fprintf(stderr, "time: include: %f, exclude: %f, conn total: %f\n",
+	//		time_include, time_include - (INSTR_TIME_GET_DOUBLE(conn_total_time) / nclients),
+	//		INSTR_TIME_GET_DOUBLE(conn_total_time));
+
 	/* Report test parameters. */
 	printf("transaction type: %s\n",
 		   num_scripts == 1 ? sql_script[0].desc : "multiple scripts");
@@ -6126,26 +6134,14 @@ main(int argc, char **argv)
 	/* all clients must be assigned to a thread */
 	Assert(nclients_dealt == nclients);
 
-	/* get start up time */
-	INSTR_TIME_SET_CURRENT(start_time);
-
-	/* set alarm if duration is specified. */
-	if (duration > 0)
-		setalarm(duration);
-
 	/* start threads */
 #ifdef ENABLE_THREAD_SAFETY
+	pthread_barrier_init(&conn_barrier, NULL, nthreads);
+
 	for (i = 0; i < nthreads; i++)
 	{
 		TState	   *thread = &threads[i];
 
-		INSTR_TIME_SET_CURRENT(thread->start_time);
-
-		/* compute when to stop */
-		if (duration > 0)
-			end_time = INSTR_TIME_GET_MICROSEC(thread->start_time) +
-				(int64) 1000000 * duration;
-
 		/* the first thread (i = 0) is executed by main thread */
 		if (i > 0)
 		{
@@ -6162,13 +6158,38 @@ main(int argc, char **argv)
 			thread->thread = INVALID_THREAD;
 		}
 	}
-#else
-	INSTR_TIME_SET_CURRENT(threads[0].start_time);
-	/* compute when to stop */
+#endif							/* ENABLE_THREAD_SAFETY */
+
+#ifdef ENABLE_THREAD_SAFETY
+	/* wait till all threads started (threads wait in threadRun()) */
+	//fprintf(stderr, "andres: waiting for thread start: %u\n", threads[0].tid);
+	pthread_barrier_wait(&conn_barrier);
+#endif							/* ENABLE_THREAD_SAFETY */
+
+	/* get start up time */
+	INSTR_TIME_SET_CURRENT(start_time);
+
+	/* */
+	for (i = 0; i < nthreads; i++)
+	{
+		TState	   *thread = &threads[i];
+
+		thread->start_time = start_time;
+
+		/* compute when to stop */
+		if (duration > 0)
+			end_time = INSTR_TIME_GET_MICROSEC(thread->start_time) +
+				(int64) 1000000 * duration;
+	}
+
+	/* set alarm if duration is specified. */
 	if (duration > 0)
-		end_time = INSTR_TIME_GET_MICROSEC(threads[0].start_time) +
-			(int64) 1000000 * duration;
-	threads[0].thread = INVALID_THREAD;
+		setalarm(duration);
+
+#ifdef ENABLE_THREAD_SAFETY
+	/* updated start time (threads wait in threadRun()) */
+	//fprintf(stderr, "andres: %u: waiting for start time\n", threads[0].tid);
+	pthread_barrier_wait(&conn_barrier);
 #endif							/* ENABLE_THREAD_SAFETY */
 
 	/* wait for threads and accumulate results */
@@ -6236,12 +6257,30 @@ threadRun(void *arg)
 	int			i;
 
 	/* for reporting progress: */
-	int64		thread_start = INSTR_TIME_GET_MICROSEC(thread->start_time);
-	int64		last_report = thread_start;
-	int64		next_report = last_report + (int64) progress * 1000000;
+	int64		thread_start;
+	int64		last_report;
+	int64		next_report;
 	StatsData	last,
 				aggs;
 
+	/* wait till all threads started (main waits outside) */
+	if (thread->tid != 0)
+	{
+		//fprintf(stderr, "andres: %u: waiting for thread start\n", thread->tid);
+		pthread_barrier_wait(&conn_barrier);
+	}
+
+	/* wait for start time to be initialized (main waits outside) */
+	if (thread->tid != 0)
+	{
+		//fprintf(stderr, "andres: %u: waiting for start time\n", thread->tid);
+		pthread_barrier_wait(&conn_barrier);
+	}
+
+	thread_start = INSTR_TIME_GET_MICROSEC(thread->start_time);
+	last_report = thread_start;
+	next_report = last_report + (int64) progress * 1000000;
+
 	/*
 	 * Initialize throttling rate target for all of the thread's clients.  It
 	 * might be a little more accurate to reset thread->start_time here too.
@@ -6288,7 +6327,27 @@ threadRun(void *arg)
 
 	/* time after thread and connections set up */
 	INSTR_TIME_SET_CURRENT(thread->conn_time);
-	INSTR_TIME_SUBTRACT(thread->conn_time, thread->start_time);
+	INSTR_TIME_SUBTRACT(thread->conn_time, start);
+
+	//	e = thread->conn_time;
+	//fprintf(stderr, "andres: %u: connection established in %f (s %f, e %f)\n",
+	//		thread->tid, INSTR_TIME_GET_DOUBLE(thread->conn_time),
+	//		INSTR_TIME_GET_DOUBLE(e),
+	//		INSTR_TIME_GET_DOUBLE(start));
+
+	/* add once for each other connection */
+	if (!is_connect)
+	{
+		instr_time e = thread->conn_time;
+		for (i = 0; i < (nstate - 1); i++)
+		{
+			INSTR_TIME_ADD(thread->conn_time, e);
+		}
+	}
+
+	/* wait for all connections to be established */
+	//fprintf(stderr, "andres: %u: waiting for connection establishment\n", thread->tid);
+	pthread_barrier_wait(&conn_barrier);
 
 	/* explicitly initialize the state machines */
 	for (i = 0; i < nstate; i++)
-- 
2.25.0.114.g5b0ca878e0

#26Jonathan S. Katz
jkatz@postgresql.org
In reply to: Andres Freund (#24)
Re: Improving connection scalability: GetSnapshotData()

On 4/7/20 8:15 AM, Andres Freund wrote:

I think this is pretty close to being committable.

But: This patch came in very late for v13, and it took me much longer to
polish it up than I had hoped (partially distraction due to various bugs
I found (in particular snapshot_too_old), partially covid19, partially
"hell if I know"). The patchset touches core parts of the system. While
both Thomas and David have done some review, they haven't for the latest
version (mea culpa).

In many other instances I would say that the above suggests slipping to
v14, given the timing.

The main reason I am considering pushing is that I think this patcheset
addresses one of the most common critiques of postgres, as well as very
common, hard to fix, real-world production issues. GetSnapshotData() has
been a major bottleneck for about as long as I have been using postgres,
and this addresses that to a significant degree.

A second reason I am considering it is that, in my opinion, the changes
are not all that complicated and not even that large. At least not for a
change to a problem that we've long tried to improve.

Even as recently as earlier this week there was a blog post making the
rounds about the pain points running PostgreSQL with many simultaneous
connections. Anything to help with that would go a long way, and looking
at the benchmarks you ran (at least with a quick, nonthorough glance)
this could and should be very positively impactful to a *lot* of
PostgreSQL users.

I can't comment on the "close to committable" aspect (at least not with
an informed, confident opinion) but if it is indeed close to committable
and you can put the work to finish polishing (read: "bug fixing" :-) and
we have a plan both of testing and, if need be, to revert, I would be
okay with including it, for whatever my vote is worth. Is the timing /
situation ideal? No, but the way you describe it, it sounds like there
is enough that can be done to ensure it's ready for Beta 1.

From a RMT standpoint, perhaps this is one of the "Recheck at Mid-Beta"
items, as well.

Thanks,

Jonathan

#27Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#24)
Re: Improving connection scalability: GetSnapshotData()

Comments:

In 0002, the comments in SnapshotSet() are virtually incomprehensible.
There's no commit message so the reasons for the changes are unclear.
But mostly looks unproblematic.

0003 looks like a fairly unrelated bug fix that deserves to be
discussed on the thread related to the original patch. Probably should
be an open item.

0004 looks fine.

Regarding 0005:

There's sort of a mix of terminology here: are we pruning tuples or
removing tuples or testing whether things are invisible? It would be
better to be more consistent.

+ * State for testing whether tuple versions may be removed. To improve
+ * GetSnapshotData() performance we don't compute an accurate value whenever
+ * acquiring a snapshot. Instead we compute boundaries above/below which we
+ * know that row versions are [not] needed anymore.  If at test time values
+ * falls in between the two, the boundaries can be recomputed (unless that
+ * just happened).

I don't like the wording here much. Maybe: State for testing whether
an XID is invisible to all current snapshots. If an XID precedes
maybe_needed_bound, it's definitely not visible to any current
snapshot. If it equals or follows definitely_needed_bound, that XID
isn't necessarily invisible to all snapshots. If it falls in between,
we're not sure. If, when testing a particular tuple, we see an XID
somewhere in the middle, we can try recomputing the boundaries to get
a more accurate answer (unless we've just done that). This is cheaper
than maintaining an accurate value all the time.

There's also the problem that this sorta contradicts the comment for
definitely_needed_bound. There it says intermediate values needed to
be tested against the ProcArray, whereas here it says we need to
recompute the bounds. That's kinda confusing.

ComputedHorizons seems like a fairly generic name. I think there's
some relationship between InvisibleToEveryoneState and
ComputedHorizons that should be brought out more clearly by the naming
and the comments.

+ /*
+ * The value of ShmemVariableCache->latestCompletedFullXid when
+ * ComputeTransactionHorizons() held ProcArrayLock.
+ */
+ FullTransactionId latest_completed;
+
+ /*
+ * The same for procArray->replication_slot_xmin and.
+ * procArray->replication_slot_catalog_xmin.
+ */
+ TransactionId slot_xmin;
+ TransactionId slot_catalog_xmin;

Department of randomly inconsistent names. In general I think it's
quite hard to grasp the relationship between the different fields in
ComputedHorizons.

+static inline bool OldSnapshotThresholdActive(void)
+{
+ return old_snapshot_threshold >= 0;
+}

Formatting.

+
+bool
+GinPageIsRecyclable(Page page)

Needs a comment. Or more than one.

- /*
- * If a transaction wrote a commit record in the gap between taking and
- * logging the snapshot then latestCompletedXid may already be higher than
- * the value from the snapshot, so check before we use the incoming value.
- */
- if (TransactionIdPrecedes(ShmemVariableCache->latestCompletedXid,
- running->latestCompletedXid))
- ShmemVariableCache->latestCompletedXid = running->latestCompletedXid;
-
- Assert(TransactionIdIsNormal(ShmemVariableCache->latestCompletedXid));
-
- LWLockRelease(ProcArrayLock);

This code got relocated so that the lock is released later, but you
didn't add any comments explaining why. Somebody will move it back and
then you'll yet at them for doing it wrong. :-)

+ /*
+ * Must have called GetOldestVisibleTransactionId() if using SnapshotAny.
+ * Shouldn't have for an MVCC snapshot. (It's especially worth checking
+ * this for parallel builds, since ambuild routines that support parallel
+ * builds must work these details out for themselves.)
+ */
+ Assert(snapshot == SnapshotAny || IsMVCCSnapshot(snapshot));
+ Assert(snapshot == SnapshotAny ? TransactionIdIsValid(OldestXmin) :
+    !TransactionIdIsValid(OldestXmin));
+ Assert(snapshot == SnapshotAny || !anyvisible);

This looks like a gratuitous code relocation.

+HeapTupleSatisfiesVacuumHorizon(HeapTuple htup, Buffer buffer,
TransactionId *dead_after)

I don't much like the name dead_after, but I don't have a better
suggestion, either.

- * Deleter committed, but perhaps it was recent enough that some open
- * transactions could still see the tuple.
+ * Deleter committed, allow caller to check if it was recent enough that
+ * some open transactions could still see the tuple.

I think you could drop this change.

+ /*
+ * State related to determining whether a dead tuple is still needed.
+ */
+ InvisibleToEveryoneState *vistest;
+ TimestampTz limited_oldest_ts;
+ TransactionId limited_oldest_xmin;
+ /* have we made removal decision based on old_snapshot_threshold */
+ bool limited_oldest_committed;

Would benefit from more comments.

+ * accuring to prstate->vistest, but that can be removed based on

Typo.

Generally, heap_prune_satisfies_vacuum looks pretty good. The
limited_oldest_committed naming is confusing, but the comments make it
a lot clearer.

+ * If oldest btpo.xact in the deleted pages is invisible, then at

I'd say "invisible to everyone" here for clarity.

-latestCompletedXid variable.  This allows GetSnapshotData to use
-latestCompletedXid + 1 as xmax for its snapshot: there can be no
+latestCompletedFullXid variable.  This allows GetSnapshotData to use
+latestCompletedFullXid + 1 as xmax for its snapshot: there can be no

Is this fixing a preexisting README defect?

It might be useful if this README expanded on the new machinery a bit
instead of just updating the wording to account for it, but I'm not
sure exactly what that would look like or whether it would be too
duplicative of other things.

+void AssertTransactionIdMayBeOnDisk(TransactionId xid)

Formatting.

+ * Assert that xid is one that we could actually see on disk.

I don't know what this means. The whole purpose of this routine is
very unclear to me.

  * the secondary effect that it sets RecentGlobalXmin.  (This is critical
  * for anything that reads heap pages, because HOT may decide to prune
  * them even if the process doesn't attempt to modify any tuples.)
+ *
+ * FIXME: This comment is inaccurate / the code buggy. A snapshot that is
+ * not pushed/active does not reliably prevent HOT pruning (->xmin could
+ * e.g. be cleared when cache invalidations are processed).

Something needs to be done here... and in the other similar case.

Is this kind of review helpful?

...Robert

#28Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#27)
Re: Improving connection scalability: GetSnapshotData()

Hi,

Thanks for the review!

On 2020-04-07 12:41:07 -0400, Robert Haas wrote:

In 0002, the comments in SnapshotSet() are virtually incomprehensible.
There's no commit message so the reasons for the changes are unclear.
But mostly looks unproblematic.

I was planning to drop that patch pre-commit, at least for now. I think
there's a few live bugs here, but they're all older. I did send a few emails
about the class of problem, unfortunately it was a fairly one-sided
conversation so far ;)

/messages/by-id/20200407072418.ccvnyjbrktyi3rzc@alap3.anarazel.de

0003 looks like a fairly unrelated bug fix that deserves to be
discussed on the thread related to the original patch. Probably should
be an open item.

There was some discussion in a separate thread:
/messages/by-id/20200406025651.fpzdb5yyb7qyhqko@alap3.anarazel.de
The only reason for including it in this patch stack is that I can't
really execercise the patchset without the fix (it's a bit sad that this
issue has gone unnoticed for months before I found it as part of the
development of this patch).

Think I'll push a minimal version now, and add an open item.

Regarding 0005:

There's sort of a mix of terminology here: are we pruning tuples or
removing tuples or testing whether things are invisible? It would be
better to be more consistent.

+ * State for testing whether tuple versions may be removed. To improve
+ * GetSnapshotData() performance we don't compute an accurate value whenever
+ * acquiring a snapshot. Instead we compute boundaries above/below which we
+ * know that row versions are [not] needed anymore.  If at test time values
+ * falls in between the two, the boundaries can be recomputed (unless that
+ * just happened).

I don't like the wording here much. Maybe: State for testing whether
an XID is invisible to all current snapshots. If an XID precedes
maybe_needed_bound, it's definitely not visible to any current
snapshot. If it equals or follows definitely_needed_bound, that XID
isn't necessarily invisible to all snapshots. If it falls in between,
we're not sure. If, when testing a particular tuple, we see an XID
somewhere in the middle, we can try recomputing the boundaries to get
a more accurate answer (unless we've just done that). This is cheaper
than maintaining an accurate value all the time.

I'll incorporate that, thanks.

There's also the problem that this sorta contradicts the comment for
definitely_needed_bound. There it says intermediate values needed to
be tested against the ProcArray, whereas here it says we need to
recompute the bounds. That's kinda confusing.

For me those are the same. Computing an accurate bound is visitting the
procarray. But I'll rephrase.

ComputedHorizons seems like a fairly generic name. I think there's
some relationship between InvisibleToEveryoneState and
ComputedHorizons that should be brought out more clearly by the naming
and the comments.

I don't like the naming of ComputedHorizons, ComputeTransactionHorizons
much... But I find it hard to come up with something that's meaningfully
better.

I'll add a comment.

+ /*
+ * The value of ShmemVariableCache->latestCompletedFullXid when
+ * ComputeTransactionHorizons() held ProcArrayLock.
+ */
+ FullTransactionId latest_completed;
+
+ /*
+ * The same for procArray->replication_slot_xmin and.
+ * procArray->replication_slot_catalog_xmin.
+ */
+ TransactionId slot_xmin;
+ TransactionId slot_catalog_xmin;

Department of randomly inconsistent names. In general I think it's
quite hard to grasp the relationship between the different fields in
ComputedHorizons.

What's the inconsistency? The dropped replication_ vs dropped FullXid
postfix?

+
+bool
+GinPageIsRecyclable(Page page)

Needs a comment. Or more than one.

Well, I started to write one a couple times. But it's really just moving
the pre-existing code from the macro into a function and there weren't
any comments around *any* of it before. All my comment attempts
basically just were restating the code in so many words, or would have
required more work than I saw justified in the context of just moving
code.

- /*
- * If a transaction wrote a commit record in the gap between taking and
- * logging the snapshot then latestCompletedXid may already be higher than
- * the value from the snapshot, so check before we use the incoming value.
- */
- if (TransactionIdPrecedes(ShmemVariableCache->latestCompletedXid,
- running->latestCompletedXid))
- ShmemVariableCache->latestCompletedXid = running->latestCompletedXid;
-
- Assert(TransactionIdIsNormal(ShmemVariableCache->latestCompletedXid));
-
- LWLockRelease(ProcArrayLock);

This code got relocated so that the lock is released later, but you
didn't add any comments explaining why. Somebody will move it back and
then you'll yet at them for doing it wrong. :-)

I just moved it because the code now references ->nextFullXid, which was
previously maintained after latestCompletedXid.

+ /*
+ * Must have called GetOldestVisibleTransactionId() if using SnapshotAny.
+ * Shouldn't have for an MVCC snapshot. (It's especially worth checking
+ * this for parallel builds, since ambuild routines that support parallel
+ * builds must work these details out for themselves.)
+ */
+ Assert(snapshot == SnapshotAny || IsMVCCSnapshot(snapshot));
+ Assert(snapshot == SnapshotAny ? TransactionIdIsValid(OldestXmin) :
+    !TransactionIdIsValid(OldestXmin));
+ Assert(snapshot == SnapshotAny || !anyvisible);

This looks like a gratuitous code relocation.

I found it hard to understand the comments because the Asserts were done
further away from where the relevant decisions they were made. And I
think I have history to back me up: It looks to me that that that is
because ab0dfc961b6a821f23d9c40c723d11380ce195a6 just put the progress
related code between the if (!scan) and the Asserts.

+HeapTupleSatisfiesVacuumHorizon(HeapTuple htup, Buffer buffer,
TransactionId *dead_after)

I don't much like the name dead_after, but I don't have a better
suggestion, either.

- * Deleter committed, but perhaps it was recent enough that some open
- * transactions could still see the tuple.
+ * Deleter committed, allow caller to check if it was recent enough that
+ * some open transactions could still see the tuple.

I think you could drop this change.

Ok. Wasn't quite sure what to what to do with that comment.

Generally, heap_prune_satisfies_vacuum looks pretty good. The
limited_oldest_committed naming is confusing, but the comments make it
a lot clearer.

I didn't like _committed much either. But couldn't come up with
something short. _relied_upon?

+ * If oldest btpo.xact in the deleted pages is invisible, then at

I'd say "invisible to everyone" here for clarity.

-latestCompletedXid variable.  This allows GetSnapshotData to use
-latestCompletedXid + 1 as xmax for its snapshot: there can be no
+latestCompletedFullXid variable.  This allows GetSnapshotData to use
+latestCompletedFullXid + 1 as xmax for its snapshot: there can be no

Is this fixing a preexisting README defect?

It's just adjusting for the changed name of latestCompletedXid to
latestCompletedFullXid, as part of widening it to 64bits. I'm not
really a fan of adding that to the variable name, but surrounding code
already did it (cf VariableCache->nextFullXid), so I thought I'd follow
suit.

It might be useful if this README expanded on the new machinery a bit
instead of just updating the wording to account for it, but I'm not
sure exactly what that would look like or whether it would be too
duplicative of other things.

+void AssertTransactionIdMayBeOnDisk(TransactionId xid)

Formatting.

+ * Assert that xid is one that we could actually see on disk.

I don't know what this means. The whole purpose of this routine is
very unclear to me.

It's intended to be a double check against

* the secondary effect that it sets RecentGlobalXmin.  (This is critical
* for anything that reads heap pages, because HOT may decide to prune
* them even if the process doesn't attempt to modify any tuples.)
+ *
+ * FIXME: This comment is inaccurate / the code buggy. A snapshot that is
+ * not pushed/active does not reliably prevent HOT pruning (->xmin could
+ * e.g. be cleared when cache invalidations are processed).

Something needs to be done here... and in the other similar case.

Indeed. I wrote a separate email about it yesterday:
/messages/by-id/20200407072418.ccvnyjbrktyi3rzc@alap3.anarazel.de

Is this kind of review helpful?

Yes!

Greetings,

Andres Freund

#29Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#28)
Re: Improving connection scalability: GetSnapshotData()

More review, since it sounds like you like it:

0006 - Boring. But I'd probably make this move both xmin and xid back,
with related comment changes; see also next comment.

0007 -

+ TransactionId xidCopy; /* this backend's xid, a copy of this proc's
+    ProcGlobal->xids[] entry. */

Can we please NOT put Copy into the name like that? Pretty please?

+ int pgxactoff; /* offset into various ProcGlobal-> arrays
+ * NB: can change any time unless locks held!
+ */

I'm going to add the helpful comment "NB: can change any time unless
locks held!" to every data structure in the backend that is in shared
memory and not immutable. No need, of course, to mention WHICH
locks...

On a related note, PROC_HDR really, really, really needs comments
explaining the locking regimen for the new xids field.

+ ProcGlobal->xids[pgxactoff] = InvalidTransactionId;

Apparently this array is not dense in the sense that it excludes
unused slots, but comments elsewhere don't seem to entirely agree.
Maybe the comments discussing how it is "dense" need to be a little
more precise about this.

+ for (int i = 0; i < nxids; i++)

I miss my C89. Yeah, it's just me.

- if (!suboverflowed)
+ if (suboverflowed)
+ continue;
+

Do we really need to do this kind of diddling in this patch? I mean
yes to the idea, but no to things that are going to make it harder to
understand what happened if this blows up.

+ uint32 TotalProcs = MaxBackends + NUM_AUXILIARY_PROCS + max_prepared_xacts;

  /* ProcGlobal */
  size = add_size(size, sizeof(PROC_HDR));
- /* MyProcs, including autovacuum workers and launcher */
- size = add_size(size, mul_size(MaxBackends, sizeof(PGPROC)));
- /* AuxiliaryProcs */
- size = add_size(size, mul_size(NUM_AUXILIARY_PROCS, sizeof(PGPROC)));
- /* Prepared xacts */
- size = add_size(size, mul_size(max_prepared_xacts, sizeof(PGPROC)));
- /* ProcStructLock */
+ size = add_size(size, mul_size(TotalProcs, sizeof(PGPROC)));

This seems like a bad idea. If we establish a precedent that it's OK
to have sizing routines that don't use add_size() and mul_size(),
people are going to cargo cult that into places where there is more
risk of overflow than there is here.

You've got a bunch of different places that talk about the new PGXACT
array and they are somewhat redundant yet without saying exactly the
same thing every time either. I think that needs cleanup.

One thing I didn't see is any clear discussion of what happens if the
two copies of the XID information don't agree with each other. That
should be added someplace, either in an appropriate code comment or in
a README or something. I *think* both are protected by the same locks,
but there's also some unlocked access to those structure members, so
it's not entirely a slam dunk.

...Robert

#30Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#28)
Re: Improving connection scalability: GetSnapshotData()

Hi,

On 2020-04-07 10:51:12 -0700, Andres Freund wrote:

+void AssertTransactionIdMayBeOnDisk(TransactionId xid)

Formatting.

+ * Assert that xid is one that we could actually see on disk.

I don't know what this means. The whole purpose of this routine is
very unclear to me.

It's intended to be a double check against

forgetting things...? Err:

It is intended to make it easier to detect cases where the passed
TransactionId is not safe against wraparound. If there is protection
against wraparound, then the xid

a) may never be older than ShmemVariableCache->oldestXid (since
otherwise the rel/datfrozenxid could not have advanced past the xid,
and because oldestXid is what what prevents ->nextFullXid from
advancing far enough to cause a wraparound)

b) cannot be >= ShmemVariableCache->nextFullXid. If it is, it cannot
recently have come from GetNewTransactionId(), and thus there is no
anti-wraparound protection either.

As full wraparounds are painful to exercise in testing,
AssertTransactionIdMayBeOnDisk() is intended to make it easier to detect
potential hazards.

The reason for the *OnDisk naming is that [oldestXid, nextFullXid) is
the appropriate check for values actually stored in tables. There could,
and probably should, be a narrower assertion ensuring that a xid is
protected against being pruned away (i.e. a PGPROC's xmin covering it).

The reason for being concerned enough in the new code to add the new
assertion helper (as well as a major motivating reason for making the
horizons 64 bit xids) is that it's much harder to ensure that "global
xmin" style horizons don't wrap around. By definition they include other
backend's ->xmin, and those can be released without a lock at any
time. As a lot of wraparound issues are triggered by very longrunning
transactions, it is not even unlikely to hit such problems: At some
point somebody is going to kill that old backend and ->oldestXid will
advance very quickly.

There is a lot of code that is pretty unsafe around wraparounds... They
are getting easier and easier to hit on a regular schedule in production
(plenty of databases that hit wraparounds multiple times a week). And I
don't think we as PG developers often don't quite take that into
account.

Does that make some sense? Do you have a better suggestion for a name?

Greetings,

Andres Freund

In reply to: Andres Freund (#30)
Re: Improving connection scalability: GetSnapshotData()

On Tue, Apr 7, 2020 at 11:28 AM Andres Freund <andres@anarazel.de> wrote:

There is a lot of code that is pretty unsafe around wraparounds... They
are getting easier and easier to hit on a regular schedule in production
(plenty of databases that hit wraparounds multiple times a week). And I
don't think we as PG developers often don't quite take that into
account.

It would be nice if there was high level documentation on wraparound
hazards. Maybe even a dedicated README file.

--
Peter Geoghegan

#32Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#30)
Re: Improving connection scalability: GetSnapshotData()

On Tue, Apr 7, 2020 at 2:28 PM Andres Freund <andres@anarazel.de> wrote:

Does that make some sense? Do you have a better suggestion for a name?

I think it makes sense. I have two basic problems with the name. The
first is that "on disk" doesn't seem to be a very clear way of
describing what you're actually checking here, and it definitely
doesn't refer to an existing concept which sophisticated hackers can
be expected to understand. The second is that "may" is ambiguous in
English: it can either mean that something is permissible ("Johnny,
you may go to the bathroom") or that we do not have certain knowledge
of it ("Johnny may be in the bathroom"). When it is followed by "be",
it usually has the latter sense, although there are exceptions (e.g.
"She may be discharged from the hospital today if she wishes, but we
recommend that she stay for another day"). Consequently, I found that
use of "may be" in this context wicked confusing. What came to mind
was:

bool
RobertMayBeAGiraffe(void)
{
return true; // i mean, i haven't seen him since last week, so who knows?
}

So I suggest a name with "Is" or no verb, rather than one with
"MayBe." And I suggest something else instead of "OnDisk," e.g.
AssertTransactionIdIsInUsableRange() or
TransactionIdIsInAllowableRange() or
AssertTransactionIdWraparoundProtected(). I kind of like that last
one, but YMMV.

I wish to clarify that in sending these review emails I am taking no
position on whether or not it is prudent to commit any or all of them.
I do not think we can rule out the possibility that they will Break
Things, but neither do I wish to be seen as That Guy Who Stands In The
Way of Important Improvements. Time does not permit me a detailed
review anyway. So, these comments are provided in the hope that they
may be useful but without endorsement or acrimony. If other people
want to endorse or, uh, acrimoniate, based on my comments, that is up
to them.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#33Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#28)
Re: Improving connection scalability: GetSnapshotData()

On Tue, Apr 7, 2020 at 1:51 PM Andres Freund <andres@anarazel.de> wrote:

ComputedHorizons seems like a fairly generic name. I think there's
some relationship between InvisibleToEveryoneState and
ComputedHorizons that should be brought out more clearly by the naming
and the comments.

I don't like the naming of ComputedHorizons, ComputeTransactionHorizons
much... But I find it hard to come up with something that's meaningfully
better.

It would help to stick XID in there, like ComputedXIDHorizons. What I
find really baffling is that you seem to have two structures in the
same file that have essentially the same purpose, but the second one
(ComputedHorizons) has a lot more stuff in it. I can't understand why.

What's the inconsistency? The dropped replication_ vs dropped FullXid
postfix?

Yeah, just having the member names be randomly different between the
structs. Really harms greppability.

Generally, heap_prune_satisfies_vacuum looks pretty good. The
limited_oldest_committed naming is confusing, but the comments make it
a lot clearer.

I didn't like _committed much either. But couldn't come up with
something short. _relied_upon?

oldSnapshotLimitUsed or old_snapshot_limit_used, like currentCommandIdUsed?

It's just adjusting for the changed name of latestCompletedXid to
latestCompletedFullXid, as part of widening it to 64bits. I'm not
really a fan of adding that to the variable name, but surrounding code
already did it (cf VariableCache->nextFullXid), so I thought I'd follow
suit.

Oops, that was me misreading the diff. Sorry for the noise.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#34Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#29)
Re: Improving connection scalability: GetSnapshotData()

Hi,

On 2020-04-07 14:28:09 -0400, Robert Haas wrote:

More review, since it sounds like you like it:

0006 - Boring. But I'd probably make this move both xmin and xid back,
with related comment changes; see also next comment.

0007 -

+ TransactionId xidCopy; /* this backend's xid, a copy of this proc's
+    ProcGlobal->xids[] entry. */

Can we please NOT put Copy into the name like that? Pretty please?

Do you have a suggested naming scheme? Something indicating that it's
not the only place that needs to be updated?

+ int pgxactoff; /* offset into various ProcGlobal-> arrays
+ * NB: can change any time unless locks held!
+ */

I'm going to add the helpful comment "NB: can change any time unless
locks held!" to every data structure in the backend that is in shared
memory and not immutable. No need, of course, to mention WHICH
locks...

I think it's more on-point here, because we need to hold either of the
locks* even, for changes to a backend's own status that one reasonably
could expect would be safe to at least inspect. E.g looking at
ProcGlobal->xids[MyProc->pgxactoff]
doesn't look suspicious, but could very well return another backends
xid, if neither ProcArrayLock nor XidGenLock is held (because a
ProcArrayRemove() could have changed pgxactoff if a previous entry was
removed).

*see comment at PROC_HDR:

*
* Adding/Removing an entry into the procarray requires holding *both*
* ProcArrayLock and XidGenLock in exclusive mode (in that order). Both are
* needed because the dense arrays (see below) are accessed from
* GetNewTransactionId() and GetSnapshotData(), and we don't want to add
* further contention by both using one lock. Adding/Removing a procarray
* entry is much less frequent.
*/
typedef struct PROC_HDR
{
/* Array of PGPROC structures (not including dummies for prepared txns) */
PGPROC *allProcs;

/*
* Arrays with per-backend information that is hotly accessed, indexed by
* PGPROC->pgxactoff. These are in separate arrays for three reasons:
* First, to allow for as tight loops accessing the data as
* possible. Second, to prevent updates of frequently changing data from
* invalidating cachelines shared with less frequently changing
* data. Third to condense frequently accessed data into as few cachelines
* as possible.
*
* When entering a PGPROC for 2PC transactions with ProcArrayAdd(), those
* copies are used to provide the contents of the dense data, and will be
* transferred by ProcArrayAdd() while it already holds ProcArrayLock.
*/

there's also

* The various *Copy fields are copies of the data in ProcGlobal arrays that
* can be accessed without holding ProcArrayLock / XidGenLock (see PROC_HDR
* comments).

I had a more explicit warning/explanation about the dangers of accessing
the arrays without locks, but apparently went to far when reducing
duplicated comments.

On a related note, PROC_HDR really, really, really needs comments
explaining the locking regimen for the new xids field.

I'll expand the above, in particular highlighting the danger of
pgxactoff changing.

+ ProcGlobal->xids[pgxactoff] = InvalidTransactionId;

Apparently this array is not dense in the sense that it excludes
unused slots, but comments elsewhere don't seem to entirely agree.

What do you mean with "unused slots"? Backends that committed?

Dense is intended to describe that the arrays only contain currently
"live" entries. I.e. that the first procArray->numProcs entries in each
array have the data for all procs (including prepared xacts) that are
"connected". This is extending the concept that already existed for
procArray->pgprocnos.

Wheras the PGPROC/PGXACT arrays have "unused" entries interspersed.

This is what previously lead to the slow loop in GetSnapshotData(),
where we had to iterate over PGXACTs over an indirection in
procArray->pgprocnos. I.e. to only look at in-use PGXACTs we had to go
through allProcs[arrayP->pgprocnos[i]], which is, uh, suboptimal for
a performance critical routine holding a central lock.

I'll try to expand the comments around dense, but if you have a better
descriptor.

Maybe the comments discussing how it is "dense" need to be a little
more precise about this.

+ for (int i = 0; i < nxids; i++)

I miss my C89. Yeah, it's just me.

Oh, dear god. I hate declaring variables like 'i' on function scope. The
bug that haunted me the longest in the development of this patch was in
XidCacheRemoveRunningXids, where there are both i and j, and a macro
XidCacheRemove(i), but the macro gets passed j as i.

- if (!suboverflowed)
+ if (suboverflowed)
+ continue;
+

Do we really need to do this kind of diddling in this patch? I mean
yes to the idea, but no to things that are going to make it harder to
understand what happened if this blows up.

I can try to reduce those differences. Given the rest of the changes it
didn't seem likely to matter. I found it hard to keep the branches
nesting in my head when seeing:
}
}
}
}
}

+ uint32 TotalProcs = MaxBackends + NUM_AUXILIARY_PROCS + max_prepared_xacts;

/* ProcGlobal */
size = add_size(size, sizeof(PROC_HDR));
- /* MyProcs, including autovacuum workers and launcher */
- size = add_size(size, mul_size(MaxBackends, sizeof(PGPROC)));
- /* AuxiliaryProcs */
- size = add_size(size, mul_size(NUM_AUXILIARY_PROCS, sizeof(PGPROC)));
- /* Prepared xacts */
- size = add_size(size, mul_size(max_prepared_xacts, sizeof(PGPROC)));
- /* ProcStructLock */
+ size = add_size(size, mul_size(TotalProcs, sizeof(PGPROC)));

This seems like a bad idea. If we establish a precedent that it's OK
to have sizing routines that don't use add_size() and mul_size(),
people are going to cargo cult that into places where there is more
risk of overflow than there is here.

Hm. I'm not sure I see the problem. Are you concerned that TotalProcs
would overflow due to too big MaxBackends or max_prepared_xacts? The
multiplication itself is still protected by add_size(). It didn't seem
correct to use add_size for the TotalProcs addition, since that's not
really a size. And since the limit for procs is much lower than
UINT32_MAX...

It seems worse to add a separate add_size calculation for each type of
proc entry, for for each of the individual arrays. We'd end up with

size = add_size(size, sizeof(PROC_HDR));
size = add_size(size, mul_size(MaxBackends, sizeof(PGPROC)));
size = add_size(size, mul_size(NUM_AUXILIARY_PROCS, sizeof(PGPROC)));
size = add_size(size, mul_size(max_prepared_xacts, sizeof(PGPROC)));
size = add_size(size, sizeof(slock_t));

size = add_size(size, mul_size(MaxBackends, sizeof(sizeof(*ProcGlobal->xids))));
size = add_size(size, mul_size(NUM_AUXILIARY_PROCS, sizeof(sizeof(*ProcGlobal->xids)));
size = add_size(size, mul_size(max_prepared_xacts, sizeof(sizeof(*ProcGlobal->xids))));
size = add_size(size, mul_size(MaxBackends, sizeof(sizeof(*ProcGlobal->subxidStates))));
size = add_size(size, mul_size(NUM_AUXILIARY_PROCS, sizeof(sizeof(*ProcGlobal->subxidStates)));
size = add_size(size, mul_size(max_prepared_xacts, sizeof(sizeof(*ProcGlobal->subxidStates))));
size = add_size(size, mul_size(MaxBackends, sizeof(sizeof(*ProcGlobal->vacuumFlags))));
size = add_size(size, mul_size(NUM_AUXILIARY_PROCS, sizeof(sizeof(*ProcGlobal->vacuumFlags)));
size = add_size(size, mul_size(max_prepared_xacts, sizeof(sizeof(*ProcGlobal->vacuumFlags))));

instead of

size = add_size(size, sizeof(PROC_HDR));
size = add_size(size, mul_size(TotalProcs, sizeof(PGPROC)));
size = add_size(size, sizeof(slock_t));

size = add_size(size, mul_size(TotalProcs, sizeof(*ProcGlobal->xids)));
size = add_size(size, mul_size(TotalProcs, sizeof(*ProcGlobal->subxidStates)));
size = add_size(size, mul_size(TotalProcs, sizeof(*ProcGlobal->vacuumFlags)));

which seems clearly worse.

You've got a bunch of different places that talk about the new PGXACT
array and they are somewhat redundant yet without saying exactly the
same thing every time either. I think that needs cleanup.

Could you point out a few of those comments, I'm not entirely sure which
you're talking about?

One thing I didn't see is any clear discussion of what happens if the
two copies of the XID information don't agree with each other.

It should never happen. There's asserts that try to ensure that. For the
xid-less case:

ProcArrayEndTransaction(PGPROC *proc, TransactionId latestXid)
...
Assert(!TransactionIdIsValid(proc->xidCopy));
Assert(proc->subxidStatusCopy.count == 0);
and for the case of having an xid:

ProcArrayEndTransactionInternal(PGPROC *proc, TransactionId latestXid)
...
Assert(ProcGlobal->xids[pgxactoff] == proc->xidCopy);
...
Assert(ProcGlobal->subxidStates[pgxactoff].count == proc->subxidStatusCopy.count &&
ProcGlobal->subxidStates[pgxactoff].overflowed == proc->subxidStatusCopy.overflowed);

That should be added someplace, either in an appropriate code comment
or in a README or something. I *think* both are protected by the same
locks, but there's also some unlocked access to those structure
members, so it's not entirely a slam dunk.

Hm. I considered is allowed to modify those and when to really be
covered by the existing comments in transam/README. In particular in the
"Interlocking Transaction Begin, Transaction End, and Snapshots"
section.

Do you think that a comment explaining that the *Copy version has to be
kept up2date at all times (except when not yet added with ProcArrayAdd)
would ameliorate that concern?

Greetings,

Andres Freund

#35Robert Haas
robertmhaas@gmail.com
In reply to: Robert Haas (#29)
Re: Improving connection scalability: GetSnapshotData()

0008 -

Here again, I greatly dislike putting Copy in the name. It makes
little sense to pretend that either is the original and the other is
the copy. You just have the same data in two places. If one of them is
more authoritative, the place to explain that is in the comments, not
by elongating the structure member name and supposing anyone will be
able to make something of that.

+ *
+ * XXX: That's why this is using vacuumFlagsCopy.

I am not sure there's any problem with the code that needs fixing
here, so I might think about getting rid of this XXX. But this gets
back to my complaint about the locking regime being unclear. What I
think you need to do here is rephrase the previous paragraph so that
it explains the reason for using this copy a bit better. Like "We read
the copy of vacuumFlags from PGPROC rather than visiting the copy
attached to ProcGlobal because we can do that without taking a lock.
See fuller explanation in <place>." Or whatever.

0009, 0010 -

I think you've got this whole series of things divided up too finely.
Like, 0005 feels like the meat of it, and that has a bunch of things
in it that could plausible be separated out as separate commits. 0007
also seems to do more than one kind of thing (see my comment regarding
moving some of that into 0006). But whacking everything around like a
crazy man in 0005 and a little more in 0007 and then doing the
following cleanup in these little tiny steps seems pretty lame.
Separating 0009 from 0010 is maybe the clearest example of that, but
IMHO it's pretty unclear why both of these shouldn't be merged with
0008.

To be clear, I exaggerate for effect. 0005 is not whacking everything
around like a crazy man. But it is a non-minimal patch, whereas I
consider 0009 and 0010 to be sub-minimal.

My comments on the Copy naming apply here as well. I am also starting
to wonder why exactly we need two copies of all this stuff. Perhaps
I've just failed to absorb the idea for having read the patch too
briefly, but I think that we need to make sure that it's super-clear
why we're doing that. If we just needed it for one field because
$REASONS, that would be one thing, but if we need it for all of them
then there must be some underlying principle here that needs a good
explanation in an easy-to-find and centrally located place.

0011 -

+ * Number of top-level transactions that completed in some form since the
+ * start of the server. This currently is solely used to check whether
+ * GetSnapshotData() needs to recompute the contents of the snapshot, or
+ * not. There are likely other users of this.  Always above 1.

Does it only count XID-bearing transactions? If so, best mention that.

+ * transactions completed since the last GetSnapshotData()..

Too many periods.

+ /* Same with CSN */
+ ShmemVariableCache->xactCompletionCount++;

If I didn't know that CSN stood for commit sequence number from
reading years of mailing list traffic, I'd be lost here. So I think
this comment shouldn't use that term.

+GetSnapshotDataFillTooOld(Snapshot snapshot)

Uh... no clue what's going on here. Granted the code had no comments
in the old place either, so I guess it's not worse, but even the name
of the new function is pretty incomprehensible.

+ * Helper function for GetSnapshotData() that check if the bulk of the

checks

+ * the fields that need to change and returns true. false is returned
+ * otherwise.

Otherwise, it returns false.

+ * It is safe to re-enter the snapshot's xmin. This can't cause xmin to go

I know what it means to re-enter a building, but I don't know what it
means to re-enter the snapshot's xmin.

This whole comment seems a bit murky.

...Robert

#36Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#32)
Re: Improving connection scalability: GetSnapshotData()

Hi,

On 2020-04-07 14:51:52 -0400, Robert Haas wrote:

On Tue, Apr 7, 2020 at 2:28 PM Andres Freund <andres@anarazel.de> wrote:

Does that make some sense? Do you have a better suggestion for a name?

I think it makes sense. I have two basic problems with the name. The
first is that "on disk" doesn't seem to be a very clear way of
describing what you're actually checking here, and it definitely
doesn't refer to an existing concept which sophisticated hackers can
be expected to understand. The second is that "may" is ambiguous in
English: it can either mean that something is permissible ("Johnny,
you may go to the bathroom") or that we do not have certain knowledge
of it ("Johnny may be in the bathroom"). When it is followed by "be",
it usually has the latter sense, although there are exceptions (e.g.
"She may be discharged from the hospital today if she wishes, but we
recommend that she stay for another day"). Consequently, I found that
use of "may be" in this context wicked confusing.

Well, it *is* only a vague test :). It shouldn't ever have a false
positive, but there's plenty chance for false negatives (if wrapped
around far enough).

So I suggest a name with "Is" or no verb, rather than one with
"MayBe." And I suggest something else instead of "OnDisk," e.g.
AssertTransactionIdIsInUsableRange() or
TransactionIdIsInAllowableRange() or
AssertTransactionIdWraparoundProtected(). I kind of like that last
one, but YMMV.

Make sense - but they all seem to express a bit more certainty than I
think the test actually provides.

I explicitly did not want (and added a comment to that affect) have
something like TransactionIdIsInAllowableRange(), because there never
can be a safe use of its return value, as far as I can tell.

The "OnDisk" was intended to clarify that the range it verifies is
whether it'd be ok for the xid to have been found in a relation.

Greetings,

Andres Freund

#37Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#33)
Re: Improving connection scalability: GetSnapshotData()

Hi,

On 2020-04-07 15:03:46 -0400, Robert Haas wrote:

On Tue, Apr 7, 2020 at 1:51 PM Andres Freund <andres@anarazel.de> wrote:

ComputedHorizons seems like a fairly generic name. I think there's
some relationship between InvisibleToEveryoneState and
ComputedHorizons that should be brought out more clearly by the naming
and the comments.

I don't like the naming of ComputedHorizons, ComputeTransactionHorizons
much... But I find it hard to come up with something that's meaningfully
better.

It would help to stick XID in there, like ComputedXIDHorizons. What I
find really baffling is that you seem to have two structures in the
same file that have essentially the same purpose, but the second one
(ComputedHorizons) has a lot more stuff in it. I can't understand why.

ComputedHorizons are the various "accurate" horizons computed by
ComputeTransactionHorizons(). That's used to determine a horizon for
vacuuming (via GetOldestVisibleTransactionId()) and other similar use
cases.

The various InvisibleToEveryoneState variables contain the boundary
based horizons, and are updated / initially filled by
GetSnapshotData(). When the a tested value falls between the boundaries,
we update the approximate boundaries using
ComputeTransactionHorizons(). That briefly makes the boundaries in
the InvisibleToEveryoneState accurate - but future GetSnapshotData()
calls will increase the definitely_needed_bound (if transactions
committed since).

The ComputedHorizons fields could instead just be pointer based
arguments to ComputeTransactionHorizons(), but that seems clearly
worse.

I'll change ComputedHorizons's comment to say that it's the result of
ComputeTransactionHorizons(), not the "state".

What's the inconsistency? The dropped replication_ vs dropped FullXid
postfix?

Yeah, just having the member names be randomly different between the
structs. Really harms greppability.

The long names make it hard to keep line lengths in control, in
particular when also involving the long macro names for TransactionId /
FullTransactionId comparators...

Generally, heap_prune_satisfies_vacuum looks pretty good. The
limited_oldest_committed naming is confusing, but the comments make it
a lot clearer.

I didn't like _committed much either. But couldn't come up with
something short. _relied_upon?

oldSnapshotLimitUsed or old_snapshot_limit_used, like currentCommandIdUsed?

Will go for old_snapshot_limit_used, and rename the other variables to
old_snapshot_limit_ts, old_snapshot_limit_xmin.

Greetings,

Andres Freund

#38Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#36)
Re: Improving connection scalability: GetSnapshotData()

On Tue, Apr 7, 2020 at 3:31 PM Andres Freund <andres@anarazel.de> wrote:

Well, it *is* only a vague test :). It shouldn't ever have a false
positive, but there's plenty chance for false negatives (if wrapped
around far enough).

Sure, but I think you get my point. Asserting that something "might
be" true isn't much of an assertion. Saying that it's in the correct
range is not to say there can't be a problem - but we're saying that
it IS in the expect range, not that it may or may not be.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#39Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#34)
Re: Improving connection scalability: GetSnapshotData()

On Tue, Apr 7, 2020 at 3:24 PM Andres Freund <andres@anarazel.de> wrote:

0007 -

+ TransactionId xidCopy; /* this backend's xid, a copy of this proc's
+    ProcGlobal->xids[] entry. */

Can we please NOT put Copy into the name like that? Pretty please?

Do you have a suggested naming scheme? Something indicating that it's
not the only place that needs to be updated?

I don't think trying to indicate that in the structure member names is
a useful idea. I think you should give them the same names, maybe with
an "s" to pluralize the copy hanging off of ProcGlobal, and put a
comment that says something like:

We keep two copies of each of the following three fields. One copy is
here in the PGPROC, and the other is in a more densely-packed array
hanging off of PGXACT. Both copies of the value must always be updated
at the same time and under the same locks, so that it is always the
case that MyProc->xid == ProcGlobal->xids[MyProc->pgprocno] and
similarly for vacuumFlags and WHATEVER. Note, however, that the arrays
attached to ProcGlobal only contain entries for PGPROC structures that
are currently part of the ProcArray (i.e. there is currently a backend
for that PGPROC). We use those arrays when STUFF and the copies in the
individual PGPROC when THINGS.

I think it's more on-point here, because we need to hold either of the
locks* even, for changes to a backend's own status that one reasonably
could expect would be safe to at least inspect.

It's just too brief and obscure to be useful.

+ ProcGlobal->xids[pgxactoff] = InvalidTransactionId;

Apparently this array is not dense in the sense that it excludes
unused slots, but comments elsewhere don't seem to entirely agree.

What do you mean with "unused slots"? Backends that committed?

Backends that have no XID. You mean, I guess, that it is "dense" in
the sense that only live backends are in there, not "dense" in the
sense that only active write transactions are in there. It would be
nice to nail that down better; the wording I suggested above might
help.

+ uint32 TotalProcs = MaxBackends + NUM_AUXILIARY_PROCS + max_prepared_xacts;

/* ProcGlobal */
size = add_size(size, sizeof(PROC_HDR));
- /* MyProcs, including autovacuum workers and launcher */
- size = add_size(size, mul_size(MaxBackends, sizeof(PGPROC)));
- /* AuxiliaryProcs */
- size = add_size(size, mul_size(NUM_AUXILIARY_PROCS, sizeof(PGPROC)));
- /* Prepared xacts */
- size = add_size(size, mul_size(max_prepared_xacts, sizeof(PGPROC)));
- /* ProcStructLock */
+ size = add_size(size, mul_size(TotalProcs, sizeof(PGPROC)));

This seems like a bad idea. If we establish a precedent that it's OK
to have sizing routines that don't use add_size() and mul_size(),
people are going to cargo cult that into places where there is more
risk of overflow than there is here.

Hm. I'm not sure I see the problem. Are you concerned that TotalProcs
would overflow due to too big MaxBackends or max_prepared_xacts? The
multiplication itself is still protected by add_size(). It didn't seem
correct to use add_size for the TotalProcs addition, since that's not
really a size. And since the limit for procs is much lower than
UINT32_MAX...

I'm concerned that there are 0 uses of add_size in any shared-memory
sizing function, and I think it's best to keep it that way. If you
initialize TotalProcs = add_size(MaxBackends,
add_size(NUM_AUXILIARY_PROCS, max_prepared_xacts)) then I'm happy. I
think it's a desperately bad idea to imagine that we can dispense with
overflow checks here and be safe. It's just too easy for that to
become false due to future code changes, or get copied to other places
where it's unsafe now.

You've got a bunch of different places that talk about the new PGXACT
array and they are somewhat redundant yet without saying exactly the
same thing every time either. I think that needs cleanup.

Could you point out a few of those comments, I'm not entirely sure which
you're talking about?

+ /*
+ * Also allocate a separate arrays for data that is frequently (e.g. by
+ * GetSnapshotData()) accessed from outside a backend.  There is one entry
+ * in each for every *live* PGPROC entry, and they are densely packed so
+ * that the first procArray->numProc entries are all valid.  The entries
+ * for a PGPROC in those arrays are at PGPROC->pgxactoff.
+ *
+ * Note that they may not be accessed without ProcArrayLock held! Upon
+ * ProcArrayRemove() later entries will be moved.
+ *
+ * These are separate from the main PGPROC array so that the most heavily
+ * accessed data is stored contiguously in memory in as few cache lines as
+ * possible. This provides significant performance benefits, especially on
+ * a multiprocessor system.
+ */
+ * Arrays with per-backend information that is hotly accessed, indexed by
+ * PGPROC->pgxactoff. These are in separate arrays for three reasons:
+ * First, to allow for as tight loops accessing the data as
+ * possible. Second, to prevent updates of frequently changing data from
+ * invalidating cachelines shared with less frequently changing
+ * data. Third to condense frequently accessed data into as few cachelines
+ * as possible.
+ *
+ * The various *Copy fields are copies of the data in ProcGlobal arrays that
+ * can be accessed without holding ProcArrayLock / XidGenLock (see PROC_HDR
+ * comments).
+ * Adding/Removing an entry into the procarray requires holding *both*
+ * ProcArrayLock and XidGenLock in exclusive mode (in that order). Both are
+ * needed because the dense arrays (see below) are accessed from
+ * GetNewTransactionId() and GetSnapshotData(), and we don't want to add
+ * further contention by both using one lock. Adding/Removing a procarray
+ * entry is much less frequent.

I'm not saying these are all entirely redundant with each other;
that's not so. But I don't think it gives a terribly clear grasp of
the overall picture either, even taking all of them together.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#40Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#35)
Re: Improving connection scalability: GetSnapshotData()

Hi,

On 2020-04-07 15:26:36 -0400, Robert Haas wrote:

0008 -

Here again, I greatly dislike putting Copy in the name. It makes
little sense to pretend that either is the original and the other is
the copy. You just have the same data in two places. If one of them is
more authoritative, the place to explain that is in the comments, not
by elongating the structure member name and supposing anyone will be
able to make something of that.

Ok.

0009, 0010 -

I think you've got this whole series of things divided up too finely.
Like, 0005 feels like the meat of it, and that has a bunch of things
in it that could plausible be separated out as separate commits. 0007
also seems to do more than one kind of thing (see my comment regarding
moving some of that into 0006). But whacking everything around like a
crazy man in 0005 and a little more in 0007 and then doing the
following cleanup in these little tiny steps seems pretty lame.
Separating 0009 from 0010 is maybe the clearest example of that, but
IMHO it's pretty unclear why both of these shouldn't be merged with
0008.

I found it a *lot* easier to review / evolve them this way. I e.g. had
an earlier version in which the subxid part of the change worked
substantially differently (it tried to elide the overflowed bool, by
definining -1 as the indicator for overflows), and it'd been way harder
to change that if I didn't have a patch with *just* the subxid changes.

I'd not push them separated by time, but I do think it'd make sense to
push them as separate commits. I think it's easier to review them in
case of a bug in a separate area.

My comments on the Copy naming apply here as well. I am also starting
to wonder why exactly we need two copies of all this stuff. Perhaps
I've just failed to absorb the idea for having read the patch too
briefly, but I think that we need to make sure that it's super-clear
why we're doing that. If we just needed it for one field because
$REASONS, that would be one thing, but if we need it for all of them
then there must be some underlying principle here that needs a good
explanation in an easy-to-find and centrally located place.

The main reason is that we want to be able to cheaply check the current
state of the variables (mostly when checking a backend's own state). We
can't access the "dense" ones without holding a lock, but we e.g. don't
want to make ProcArrayEndTransactionInternal() take a lock just to check
if vacuumFlags is set.

It turns out to also be good for performance to have the copy for
another reason: The "dense" arrays share cachelines with other
backends. That's worth it because it allows to make GetSnapshotData(),
by far the most frequent operation, touch fewer cache lines. But it also
means that it's more likely that a backend's "dense" array entry isn't
in a local cpu cache (it'll be pulled out of there when modified in
another backend). In many cases we don't need the shared entry at commit
etc time though, we just need to check if it is set - and most of the
time it won't be. The local entry allows to do that cheaply.

Basically it makes sense to access the PGPROC variable when checking a
single backend's data, especially when we have to look at the PGPROC for
other reasons already. It makes sense to look at the "dense" arrays if
we need to look at many / most entries, because we then benefit from the
reduced indirection and better cross-process cacheability.

0011 -

+ * Number of top-level transactions that completed in some form since the
+ * start of the server. This currently is solely used to check whether
+ * GetSnapshotData() needs to recompute the contents of the snapshot, or
+ * not. There are likely other users of this.  Always above 1.

Does it only count XID-bearing transactions? If so, best mention that.

Oh, good point.

+GetSnapshotDataFillTooOld(Snapshot snapshot)

Uh... no clue what's going on here. Granted the code had no comments
in the old place either, so I guess it's not worse, but even the name
of the new function is pretty incomprehensible.

It fills the old_snapshot_threshold fields of a Snapshot.

+ * It is safe to re-enter the snapshot's xmin. This can't cause xmin to go

I know what it means to re-enter a building, but I don't know what it
means to re-enter the snapshot's xmin.

Re-entering it into the procarray, thereby preventing rows that the
snapshot could see from being removed.

This whole comment seems a bit murky.

How about:
/*
* If the current xactCompletionCount is still the same as it was at the
* time the snapshot was built, we can be sure that rebuilding the
* contents of the snapshot the hard way would result in the same snapshot
* contents:
*
* As explained in transam/README, the set of xids considered running by
* GetSnapshotData() cannot change while ProcArrayLock is held. Snapshot
* contents only depend on transactions with xids and xactCompletionCount
* is incremented whenever a transaction with an xid finishes (while
* holding ProcArrayLock) exclusively). Thus the xactCompletionCount check
* ensures we would detect if the snapshot would have changed.
*
* As the snapshot contents are the same as it was before, it is is safe
* to re-enter the snapshot's xmin into the PGPROC array. None of the rows
* visible under the snapshot could already have been removed (that'd
* require the set of running transactions to change) and it fulfills the
* requirement that concurrent GetSnapshotData() calls yield the same
* xmin.
*/

Greetings,

Andres Freund

#41Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#39)
Re: Improving connection scalability: GetSnapshotData()

Hi,

On 2020-04-07 16:13:07 -0400, Robert Haas wrote:

On Tue, Apr 7, 2020 at 3:24 PM Andres Freund <andres@anarazel.de> wrote:

+ ProcGlobal->xids[pgxactoff] = InvalidTransactionId;

Apparently this array is not dense in the sense that it excludes
unused slots, but comments elsewhere don't seem to entirely agree.

What do you mean with "unused slots"? Backends that committed?

Backends that have no XID. You mean, I guess, that it is "dense" in
the sense that only live backends are in there, not "dense" in the
sense that only active write transactions are in there.

Correct.

I tried the "only active write transaction" approach, btw, and had a
hard time making it scale well (due to the much more frequent moving of
entries at commit/abort time). If we were to go to a 'only active
transactions' array at some point we'd imo still need pretty much all
the other changes made here - so I'm not worried about it for now.

+ uint32 TotalProcs = MaxBackends + NUM_AUXILIARY_PROCS + max_prepared_xacts;

/* ProcGlobal */
size = add_size(size, sizeof(PROC_HDR));
- /* MyProcs, including autovacuum workers and launcher */
- size = add_size(size, mul_size(MaxBackends, sizeof(PGPROC)));
- /* AuxiliaryProcs */
- size = add_size(size, mul_size(NUM_AUXILIARY_PROCS, sizeof(PGPROC)));
- /* Prepared xacts */
- size = add_size(size, mul_size(max_prepared_xacts, sizeof(PGPROC)));
- /* ProcStructLock */
+ size = add_size(size, mul_size(TotalProcs, sizeof(PGPROC)));

This seems like a bad idea. If we establish a precedent that it's OK
to have sizing routines that don't use add_size() and mul_size(),
people are going to cargo cult that into places where there is more
risk of overflow than there is here.

Hm. I'm not sure I see the problem. Are you concerned that TotalProcs
would overflow due to too big MaxBackends or max_prepared_xacts? The
multiplication itself is still protected by add_size(). It didn't seem
correct to use add_size for the TotalProcs addition, since that's not
really a size. And since the limit for procs is much lower than
UINT32_MAX...

I'm concerned that there are 0 uses of add_size in any shared-memory
sizing function, and I think it's best to keep it that way.

I can't make sense of that sentence?

We already have code like this, and have for a long time:
/* Size of the ProcArray structure itself */
#define PROCARRAY_MAXPROCS (MaxBackends + max_prepared_xacts)

adding NUM_AUXILIARY_PROCS doesn't really change that, does it?

If you initialize TotalProcs = add_size(MaxBackends,
add_size(NUM_AUXILIARY_PROCS, max_prepared_xacts)) then I'm happy.

Will do.

Greetings,

Andres Freund

#42Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#24)
6 attachment(s)
Re: Improving connection scalability: GetSnapshotData()

Hi

On 2020-04-07 05:15:03 -0700, Andres Freund wrote:

SEE BELOW: What, and what not, to do for v13.

[ description of changes ]

I think this is pretty close to being committable.

But: This patch came in very late for v13, and it took me much longer to
polish it up than I had hoped (partially distraction due to various bugs
I found (in particular snapshot_too_old), partially covid19, partially
"hell if I know"). The patchset touches core parts of the system. While
both Thomas and David have done some review, they haven't for the latest
version (mea culpa).

In many other instances I would say that the above suggests slipping to
v14, given the timing.

The main reason I am considering pushing is that I think this patcheset
addresses one of the most common critiques of postgres, as well as very
common, hard to fix, real-world production issues. GetSnapshotData() has
been a major bottleneck for about as long as I have been using postgres,
and this addresses that to a significant degree.

A second reason I am considering it is that, in my opinion, the changes
are not all that complicated and not even that large. At least not for a
change to a problem that we've long tried to improve.

Obviously we all have a tendency to think our own work is important, and
that we deserve a bit more leeway than others. So take the above with a
grain of salt.

I tried hard, but came up short. It's 5 AM, and I am still finding
comments that aren't quite right. For a while I thought I'd be pushing a
few hours ... And even if it were ready now: This is too large a patch
to push this tired (but damn, I'd love to).

Unfortunately adressing Robert's comments made me realize I didn't like
some of my own naming. In particular I started to dislike
InvisibleToEveryone, and some of the procarray.c variables around
"visible". After trying about half a dozen schemes I think I found
something that makes some sense, although I am still not perfectly
happy.

I think the attached set of patches address most of Robert's review
comments, minus a few cases minor quibbles where I thought he was wrong
(fundamentally wrong of course). There are no *Copy fields in PGPROC
anymore, there's a lot more comments above PROC_HDR (not duplicated
elsewhere). I've reduced the interspersed changes to GetSnapshotData()
so those can be done separately.

There's also somewhat meaningful commit messages now. But
snapshot scalability: Move in-progress xids to ProcGlobal->xids[].
needs to be expanded to mention the changed locking requirements.

Realistically it still 2-3 hours of proof-reading.

This makes me sad :(

Attachments:

v9-0001-snapshot-scalability-Don-t-compute-global-horizon.patchtext/x-diff; charset=us-asciiDownload
From 528cc04aa3ee3a29208e9972b5ce6970d6651b3d Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 8 Apr 2020 04:33:19 -0700
Subject: [PATCH v9 1/6] snapshot scalability: Don't compute global horizons
 when building snapshots.

To make GetSnapshotData() more scalable, it cannot not look at at each proc's
xmin (see Discussion link below). Due to the frequency at which xmins are
updated, that just does not scale.

Without accessing xmins GetSnapshotData() cannot calculate accurate thresholds
as it has so far. But we don't really have to: The horizons don't actually
change that much between GetSnapshotData() calls. Nor are the horizons
actually used every time a snapshot is called.

The use of RecentGlobal[Data]Xmin to decide whether a row version could be
removed has been replaces with new GlobalVisTest* functions.  These use two
thresholds to determine whether a row can be pruned:
1) definitely_needed, indicating that rows deleted by XIDs >=
   definitely_needed are definitely still visible.
2) maybe_needed, indicating that rows deleted by XIDs < maybe_needed can
   definitely be removed
GetSnapshotData() updates definitely_needed to be the xmin of the computed
snapshot.

When testing whether a row can be removed (with GlobalVisTestIsRemovableXid())
and the tested XID falls in between the two (i.e. XID >= maybe_needed && XID <
definitely_needed) the boundaries can be recomputed to be more accurate. As it
is not cheap to compute accurate boundaries, we limit the number of times that
happens in short succession.  As the boundaries used by
GlobalVisTestIsRemovableXid() are never reset (with maybe_needed updated
byGetSnapshotData()), it is likely that further test can benefit from an
earlier computation of accurate horizons.

To avoid regressing performance when old_snapshot_threshold is set (as
that requires an accurate horizon to be computed),
heap_page_prune_opt() doesn't unconditionally call
TransactionIdLimitedForOldSnapshots() anymore. Both the computation of
the limited horizon, and the triggering of errors (with
SetOldSnapshotThresholdTimestamp()) is now only done when necessary to
remove tuples.

Subsequent commits will take further advantage of the fact that
GetSnapshotData() will not need to access xmins anymore.

Note: This contains a workaround in heap_page_prune_opt() to keep the
snapshot_too_old tests working. While that workaround is ugly, the
tests currently are not meaningful, and it seems best to address them
separately.

Author: Andres Freund
Reviewed-By: Robert Haas, Thomas Munro, David Rowley
Discussion: https://postgr.es/m/20200301083601.ews6hz5dduc3w2se@alap3.anarazel.de
---
 src/include/access/ginblock.h               |    4 +-
 src/include/access/heapam.h                 |   11 +-
 src/include/access/transam.h                |  100 +-
 src/include/storage/bufpage.h               |    6 -
 src/include/storage/proc.h                  |    8 -
 src/include/storage/procarray.h             |   39 +-
 src/include/utils/snapmgr.h                 |   37 +-
 src/include/utils/snapshot.h                |    6 +
 src/backend/access/gin/ginvacuum.c          |   26 +
 src/backend/access/gist/gistutil.c          |    8 +-
 src/backend/access/gist/gistxlog.c          |   10 +-
 src/backend/access/heap/heapam.c            |   19 +-
 src/backend/access/heap/heapam_handler.c    |   24 +-
 src/backend/access/heap/heapam_visibility.c |   79 +-
 src/backend/access/heap/pruneheap.c         |  207 +++-
 src/backend/access/heap/vacuumlazy.c        |   24 +-
 src/backend/access/index/indexam.c          |    3 +-
 src/backend/access/nbtree/README            |   10 +-
 src/backend/access/nbtree/nbtpage.c         |    6 +-
 src/backend/access/nbtree/nbtree.c          |   28 +-
 src/backend/access/nbtree/nbtxlog.c         |   10 +-
 src/backend/access/spgist/spgvacuum.c       |    6 +-
 src/backend/access/transam/README           |   92 +-
 src/backend/access/transam/varsup.c         |   50 +
 src/backend/access/transam/xlog.c           |   11 +-
 src/backend/commands/analyze.c              |    2 +-
 src/backend/commands/vacuum.c               |   41 +-
 src/backend/postmaster/autovacuum.c         |    4 +
 src/backend/replication/logical/launcher.c  |    4 +
 src/backend/replication/walreceiver.c       |   17 +-
 src/backend/replication/walsender.c         |   15 +-
 src/backend/storage/ipc/procarray.c         | 1022 +++++++++++++++----
 src/backend/utils/adt/selfuncs.c            |   20 +-
 src/backend/utils/init/postinit.c           |    4 +
 src/backend/utils/time/snapmgr.c            |  258 ++---
 contrib/amcheck/verify_nbtree.c             |    8 +-
 contrib/pg_visibility/pg_visibility.c       |   18 +-
 contrib/pgstattuple/pgstatapprox.c          |    2 +-
 src/tools/pgindent/typedefs.list            |    2 +
 39 files changed, 1630 insertions(+), 611 deletions(-)

diff --git a/src/include/access/ginblock.h b/src/include/access/ginblock.h
index 3f64fd572e3..fe66a95226b 100644
--- a/src/include/access/ginblock.h
+++ b/src/include/access/ginblock.h
@@ -12,6 +12,7 @@
 
 #include "access/transam.h"
 #include "storage/block.h"
+#include "storage/bufpage.h"
 #include "storage/itemptr.h"
 #include "storage/off.h"
 
@@ -134,8 +135,7 @@ typedef struct GinMetaPageData
  */
 #define GinPageGetDeleteXid(page) ( ((PageHeader) (page))->pd_prune_xid )
 #define GinPageSetDeleteXid(page, xid) ( ((PageHeader) (page))->pd_prune_xid = xid)
-#define GinPageIsRecyclable(page) ( PageIsNew(page) || (GinPageIsDeleted(page) \
-	&& TransactionIdPrecedes(GinPageGetDeleteXid(page), RecentGlobalXmin)))
+extern bool GinPageIsRecyclable(Page page);
 
 /*
  * We use our own ItemPointerGet(BlockNumber|OffsetNumber)
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index f279edc4734..ef2fcb55a71 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -172,9 +172,12 @@ extern TransactionId heap_compute_xid_horizon_for_tuples(Relation rel,
 														 int nitems);
 
 /* in heap/pruneheap.c */
+struct GlobalVisState;
 extern void heap_page_prune_opt(Relation relation, Buffer buffer);
 extern int	heap_page_prune(Relation relation, Buffer buffer,
-							TransactionId OldestXmin,
+							struct GlobalVisState *vistest,
+							TransactionId limited_oldest_xmin,
+							TimestampTz limited_oldest_ts,
 							bool report_stats, TransactionId *latestRemovedXid);
 extern void heap_page_prune_execute(Buffer buffer,
 									OffsetNumber *redirected, int nredirected,
@@ -201,11 +204,15 @@ extern TM_Result HeapTupleSatisfiesUpdate(HeapTuple stup, CommandId curcid,
 										  Buffer buffer);
 extern HTSV_Result HeapTupleSatisfiesVacuum(HeapTuple stup, TransactionId OldestXmin,
 											Buffer buffer);
+extern HTSV_Result HeapTupleSatisfiesVacuumHorizon(HeapTuple stup, Buffer buffer,
+												   TransactionId *dead_after);
 extern void HeapTupleSetHintBits(HeapTupleHeader tuple, Buffer buffer,
 								 uint16 infomask, TransactionId xid);
 extern bool HeapTupleHeaderIsOnlyLocked(HeapTupleHeader tuple);
 extern bool XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot);
-extern bool HeapTupleIsSurelyDead(HeapTuple htup, TransactionId OldestXmin);
+struct GlobalVisState;
+extern bool HeapTupleIsSurelyDead(struct GlobalVisState *vistest,
+								  HeapTuple htup);
 
 /*
  * To avoid leaking too much knowledge about reorderbuffer implementation
diff --git a/src/include/access/transam.h b/src/include/access/transam.h
index 9a808f64ebe..94ba797f026 100644
--- a/src/include/access/transam.h
+++ b/src/include/access/transam.h
@@ -54,6 +54,8 @@
 #define FullTransactionIdFollowsOrEquals(a, b) ((a).value >= (b).value)
 #define FullTransactionIdIsValid(x)		TransactionIdIsValid(XidFromFullTransactionId(x))
 #define InvalidFullTransactionId		FullTransactionIdFromEpochAndXid(0, InvalidTransactionId)
+#define FirstNormalFullTransactionId	FullTransactionIdFromEpochAndXid(0, FirstNormalTransactionId)
+#define FullTransactionIdIsNormal(x)	FullTransactionIdFollowsOrEquals(x, FirstNormalFullTransactionId)
 
 /*
  * A 64 bit value that contains an epoch and a TransactionId.  This is
@@ -93,15 +95,48 @@ FullTransactionIdFromU64(uint64 value)
 			(dest) = FirstNormalTransactionId; \
 	} while(0)
 
-/* advance a FullTransactionId variable, stepping over special XIDs */
+/*
+ * Advance a FullTransactionId variable, stepping over xids that would appear
+ * to be special only when viewed as 32bit XIDs.
+ */
 static inline void
 FullTransactionIdAdvance(FullTransactionId *dest)
 {
 	dest->value++;
+
+	/*
+	 * In contrast to 32bit XIDs don't step over the "actual" special xids.
+	 * For 64bit xids these can't be reached as part of a wraparound as they
+	 * can in the 32bit case.
+	 */
+	if (FullTransactionIdPrecedes(*dest, FirstNormalFullTransactionId))
+		return;
+
+	/*
+	 * But we do need to step over XIDs that'd appear special only for 32bit
+	 * XIDs.
+	 */
 	while (XidFromFullTransactionId(*dest) < FirstNormalTransactionId)
 		dest->value++;
 }
 
+/*
+ * Retreat a FullTransactionId variable, stepping over xids that would appear
+ * to be special only when viewed as 32bit XIDs.
+ */
+static inline void
+FullTransactionIdRetreat(FullTransactionId *dest)
+{
+	dest->value--;
+
+	/* see FullTransactionIdAdvance() */
+	if (FullTransactionIdPrecedes(*dest, FirstNormalFullTransactionId))
+		return;
+
+	while (XidFromFullTransactionId(*dest) < FirstNormalTransactionId)
+		dest->value--;
+}
+
 /* back up a transaction ID variable, handling wraparound correctly */
 #define TransactionIdRetreat(dest)	\
 	do { \
@@ -193,8 +228,8 @@ typedef struct VariableCacheData
 	/*
 	 * These fields are protected by ProcArrayLock.
 	 */
-	TransactionId latestCompletedXid;	/* newest XID that has committed or
-										 * aborted */
+	FullTransactionId latestCompletedFullXid;	/* newest full XID that has
+												 * committed or aborted */
 
 	/*
 	 * These fields are protected by CLogTruncationLock
@@ -244,6 +279,12 @@ extern void AdvanceOldestClogXid(TransactionId oldest_datfrozenxid);
 extern bool ForceTransactionIdLimitUpdate(void);
 extern Oid	GetNewObjectId(void);
 
+#ifdef USE_ASSERT_CHECKING
+extern void AssertTransactionInAllowableRange(TransactionId xid);
+#else
+#define AssertTransactionInAllowableRange(xid) ((void)true)
+#endif
+
 /*
  * Some frontend programs include this header.  For compilers that emit static
  * inline functions even when they're unused, that leads to unsatisfied
@@ -260,6 +301,59 @@ ReadNewTransactionId(void)
 	return XidFromFullTransactionId(ReadNextFullTransactionId());
 }
 
+/* return transaction ID backed up by amount, handling wraparound correctly */
+static inline TransactionId
+TransactionIdRetreatedBy(TransactionId xid, uint32 amount)
+{
+	xid -= amount;
+
+	while (xid < FirstNormalTransactionId)
+		xid--;
+
+	return xid;
+}
+
+/* return the older of the two IDs */
+static inline TransactionId
+TransactionIdOlder(TransactionId a, TransactionId b)
+{
+	if (!TransactionIdIsValid(a))
+		return b;
+
+	if (!TransactionIdIsValid(b))
+		return a;
+
+	if (TransactionIdPrecedes(a, b))
+		return a;
+	return b;
+}
+
+/* return the older of the two IDs, assuming they're both normal */
+static inline TransactionId
+NormalTransactionIdOlder(TransactionId a, TransactionId b)
+{
+	Assert(TransactionIdIsNormal(a));
+	Assert(TransactionIdIsNormal(b));
+	if (NormalTransactionIdPrecedes(a, b))
+		return a;
+	return b;
+}
+
+/* return the newer of the two IDs */
+static inline FullTransactionId
+FullTransactionIdNewer(FullTransactionId a, FullTransactionId b)
+{
+	if (!FullTransactionIdIsValid(a))
+		return b;
+
+	if (!FullTransactionIdIsValid(b))
+		return a;
+
+	if (FullTransactionIdFollows(a, b))
+		return a;
+	return b;
+}
+
 #endif							/* FRONTEND */
 
 #endif							/* TRANSAM_H */
diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h
index 3f88683a059..51b8f994ac0 100644
--- a/src/include/storage/bufpage.h
+++ b/src/include/storage/bufpage.h
@@ -389,12 +389,6 @@ PageValidateSpecialPointer(Page page)
 #define PageClearAllVisible(page) \
 	(((PageHeader) (page))->pd_flags &= ~PD_ALL_VISIBLE)
 
-#define PageIsPrunable(page, oldestxmin) \
-( \
-	AssertMacro(TransactionIdIsNormal(oldestxmin)), \
-	TransactionIdIsValid(((PageHeader) (page))->pd_prune_xid) && \
-	TransactionIdPrecedes(((PageHeader) (page))->pd_prune_xid, oldestxmin) \
-)
 #define PageSetPrunable(page, xid) \
 do { \
 	Assert(TransactionIdIsNormal(xid)); \
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index ae4f573ab46..23d12c1f72f 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -42,13 +42,6 @@ struct XidCache
 
 /*
  * Flags for PGXACT->vacuumFlags
- *
- * Note: If you modify these flags, you need to modify PROCARRAY_XXX flags
- * in src/include/storage/procarray.h.
- *
- * PROC_RESERVED may later be assigned for use in vacuumFlags, but its value is
- * used for PROCARRAY_SLOTS_XMIN in procarray.h, so GetOldestXmin won't be able
- * to match and ignore processes with this flag set.
  */
 #define		PROC_IS_AUTOVACUUM	0x01	/* is it an autovac worker? */
 #define		PROC_IN_VACUUM		0x02	/* currently running lazy vacuum */
@@ -56,7 +49,6 @@ struct XidCache
 #define		PROC_VACUUM_FOR_WRAPAROUND	0x08	/* set by autovac only */
 #define		PROC_IN_LOGICAL_DECODING	0x10	/* currently doing logical
 												 * decoding outside xact */
-#define		PROC_RESERVED				0x20	/* reserved for procarray */
 
 /* flags reset at EOXact */
 #define		PROC_VACUUM_STATE_MASK \
diff --git a/src/include/storage/procarray.h b/src/include/storage/procarray.h
index a5c7d0c0644..ea8a876ca45 100644
--- a/src/include/storage/procarray.h
+++ b/src/include/storage/procarray.h
@@ -20,41 +20,6 @@
 #include "utils/snapshot.h"
 
 
-/*
- * These are to implement PROCARRAY_FLAGS_XXX
- *
- * Note: These flags are cloned from PROC_XXX flags in src/include/storage/proc.h
- * to avoid forcing to include proc.h when including procarray.h. So if you modify
- * PROC_XXX flags, you need to modify these flags.
- */
-#define		PROCARRAY_VACUUM_FLAG			0x02	/* currently running lazy
-													 * vacuum */
-#define		PROCARRAY_ANALYZE_FLAG			0x04	/* currently running
-													 * analyze */
-#define		PROCARRAY_LOGICAL_DECODING_FLAG 0x10	/* currently doing logical
-													 * decoding outside xact */
-
-#define		PROCARRAY_SLOTS_XMIN			0x20	/* replication slot xmin,
-													 * catalog_xmin */
-/*
- * Only flags in PROCARRAY_PROC_FLAGS_MASK are considered when matching
- * PGXACT->vacuumFlags. Other flags are used for different purposes and
- * have no corresponding PROC flag equivalent.
- */
-#define		PROCARRAY_PROC_FLAGS_MASK	(PROCARRAY_VACUUM_FLAG | \
-										 PROCARRAY_ANALYZE_FLAG | \
-										 PROCARRAY_LOGICAL_DECODING_FLAG)
-
-/* Use the following flags as an input "flags" to GetOldestXmin function */
-/* Consider all backends except for logical decoding ones which manage xmin separately */
-#define		PROCARRAY_FLAGS_DEFAULT			PROCARRAY_LOGICAL_DECODING_FLAG
-/* Ignore vacuum backends */
-#define		PROCARRAY_FLAGS_VACUUM			PROCARRAY_FLAGS_DEFAULT | PROCARRAY_VACUUM_FLAG
-/* Ignore analyze backends */
-#define		PROCARRAY_FLAGS_ANALYZE			PROCARRAY_FLAGS_DEFAULT | PROCARRAY_ANALYZE_FLAG
-/* Ignore both vacuum and analyze backends */
-#define		PROCARRAY_FLAGS_VACUUM_ANALYZE	PROCARRAY_FLAGS_DEFAULT | PROCARRAY_VACUUM_FLAG | PROCARRAY_ANALYZE_FLAG
-
 extern Size ProcArrayShmemSize(void);
 extern void CreateSharedProcArray(void);
 extern void ProcArrayAdd(PGPROC *proc);
@@ -88,9 +53,11 @@ extern RunningTransactions GetRunningTransactionData(void);
 
 extern bool TransactionIdIsInProgress(TransactionId xid);
 extern bool TransactionIdIsActive(TransactionId xid);
-extern TransactionId GetOldestXmin(Relation rel, int flags);
+extern TransactionId GetOldestNonRemovableTransactionId(Relation rel);
+extern TransactionId GetOldestTransactionIdConsideredRunning(void);
 extern TransactionId GetOldestActiveTransactionId(void);
 extern TransactionId GetOldestSafeDecodingTransactionId(bool catalogOnly);
+extern void GetReplicationHorizons(TransactionId *slot_xmin, TransactionId *catalog_xmin);
 
 extern VirtualTransactionId *GetVirtualXIDsDelayingChkpt(int *nvxids);
 extern bool HaveVirtualXIDsDelayingChkpt(VirtualTransactionId *vxids, int nvxids);
diff --git a/src/include/utils/snapmgr.h b/src/include/utils/snapmgr.h
index b28d13ce841..3ddc526febc 100644
--- a/src/include/utils/snapmgr.h
+++ b/src/include/utils/snapmgr.h
@@ -52,13 +52,12 @@ extern Size SnapMgrShmemSize(void);
 extern void SnapMgrInit(void);
 extern TimestampTz GetSnapshotCurrentTimestamp(void);
 extern TimestampTz GetOldSnapshotThresholdTimestamp(void);
+extern void SnapshotTooOldMagicForTest(void);
 
 extern bool FirstSnapshotSet;
 
 extern PGDLLIMPORT TransactionId TransactionXmin;
 extern PGDLLIMPORT TransactionId RecentXmin;
-extern PGDLLIMPORT TransactionId RecentGlobalXmin;
-extern PGDLLIMPORT TransactionId RecentGlobalDataXmin;
 
 /* Variables representing various special snapshot semantics */
 extern PGDLLIMPORT SnapshotData SnapshotSelfData;
@@ -78,11 +77,12 @@ extern PGDLLIMPORT SnapshotData CatalogSnapshotData;
 
 /*
  * Similarly, some initialization is required for a NonVacuumable snapshot.
- * The caller must supply the xmin horizon to use (e.g., RecentGlobalXmin).
+ * The caller must supply the visibility cutoff state to use (c.f.
+ * GlobalVisTestFor()).
  */
-#define InitNonVacuumableSnapshot(snapshotdata, xmin_horizon)  \
+#define InitNonVacuumableSnapshot(snapshotdata, vistestp)  \
 	((snapshotdata).snapshot_type = SNAPSHOT_NON_VACUUMABLE, \
-	 (snapshotdata).xmin = (xmin_horizon))
+	 (snapshotdata).vistest = (vistestp))
 
 /*
  * Similarly, some initialization is required for SnapshotToast.  We need
@@ -98,6 +98,11 @@ extern PGDLLIMPORT SnapshotData CatalogSnapshotData;
 	((snapshot)->snapshot_type == SNAPSHOT_MVCC || \
 	 (snapshot)->snapshot_type == SNAPSHOT_HISTORIC_MVCC)
 
+static inline bool
+OldSnapshotThresholdActive(void)
+{
+	return old_snapshot_threshold >= 0;
+}
 
 extern Snapshot GetTransactionSnapshot(void);
 extern Snapshot GetLatestSnapshot(void);
@@ -121,8 +126,6 @@ extern void UnregisterSnapshot(Snapshot snapshot);
 extern Snapshot RegisterSnapshotOnOwner(Snapshot snapshot, ResourceOwner owner);
 extern void UnregisterSnapshotFromOwner(Snapshot snapshot, ResourceOwner owner);
 
-extern FullTransactionId GetFullRecentGlobalXmin(void);
-
 extern void AtSubCommit_Snapshot(int level);
 extern void AtSubAbort_Snapshot(int level);
 extern void AtEOXact_Snapshot(bool isCommit, bool resetXmin);
@@ -131,13 +134,29 @@ extern void ImportSnapshot(const char *idstr);
 extern bool XactHasExportedSnapshots(void);
 extern void DeleteAllExportedSnapshotFiles(void);
 extern bool ThereAreNoPriorRegisteredSnapshots(void);
-extern TransactionId TransactionIdLimitedForOldSnapshots(TransactionId recentXmin,
-														 Relation relation);
+extern bool TransactionIdLimitedForOldSnapshots(TransactionId recentXmin,
+												Relation relation,
+												TransactionId *limit_xid,
+												TimestampTz *limit_ts);
+extern void SetOldSnapshotThresholdTimestamp(TimestampTz ts, TransactionId xlimit);
 extern void MaintainOldSnapshotTimeMapping(TimestampTz whenTaken,
 										   TransactionId xmin);
 
 extern char *ExportSnapshot(Snapshot snapshot);
 
+/*
+ * These live in procarray.c because they're intimately linked to the
+ * procarray contents, but thematically they better fit into snapmgr.h.
+ */
+typedef struct GlobalVisState GlobalVisState;
+extern GlobalVisState *GlobalVisTestFor(Relation rel);
+extern bool GlobalVisTestIsRemovableXid(GlobalVisState *state, TransactionId xid);
+extern bool GlobalVisTestIsRemovableFullXid(GlobalVisState *state, FullTransactionId fxid);
+extern FullTransactionId GlobalVisTestNonRemovableFullHorizon(GlobalVisState *state);
+extern TransactionId GlobalVisTestNonRemovableHorizon(GlobalVisState *state);
+extern bool GlobalVisCheckRemovableXid(Relation rel, TransactionId xid);
+extern bool GlobalVisIsRemovableFullXid(Relation rel, FullTransactionId fxid);
+
 /*
  * Utility functions for implementing visibility routines in table AMs.
  */
diff --git a/src/include/utils/snapshot.h b/src/include/utils/snapshot.h
index 4796edb63aa..35b1f05bea6 100644
--- a/src/include/utils/snapshot.h
+++ b/src/include/utils/snapshot.h
@@ -192,6 +192,12 @@ typedef struct SnapshotData
 	 */
 	uint32		speculativeToken;
 
+	/*
+	 * For SNAPSHOT_NON_VACUUMABLE (and hopefully more in the future) this is
+	 * used to determine whether row could be vacuumed.
+	 */
+	struct GlobalVisState *vistest;
+
 	/*
 	 * Book-keeping information, used by the snapshot manager
 	 */
diff --git a/src/backend/access/gin/ginvacuum.c b/src/backend/access/gin/ginvacuum.c
index 8ae4fd95a7b..9cd6638df62 100644
--- a/src/backend/access/gin/ginvacuum.c
+++ b/src/backend/access/gin/ginvacuum.c
@@ -793,3 +793,29 @@ ginvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 
 	return stats;
 }
+
+/*
+ * Return whether Page can safely be recycled.
+ */
+bool
+GinPageIsRecyclable(Page page)
+{
+	TransactionId delete_xid;
+
+	if (PageIsNew(page))
+		return true;
+
+	if (!GinPageIsDeleted(page))
+		return false;
+
+	delete_xid = GinPageGetDeleteXid(page);
+
+	if (!TransactionIdIsValid(delete_xid))
+		return true;
+
+	/*
+	 * If no backend still could view delete_xid as in running, all scans
+	 * concurrent with ginDeletePage() must have finished.
+	 */
+	return GlobalVisCheckRemovableXid(NULL, delete_xid);
+}
diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
index 765329bbcd4..bfda7fbe3d5 100644
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -891,15 +891,13 @@ gistPageRecyclable(Page page)
 		 * As long as that can happen, we must keep the deleted page around as
 		 * a tombstone.
 		 *
-		 * Compare the deletion XID with RecentGlobalXmin. If deleteXid <
-		 * RecentGlobalXmin, then no scan that's still in progress could have
+		 * For that check if the deletion XID could still be visible to
+		 * anyone. If not, then no scan that's still in progress could have
 		 * seen its downlink, and we can recycle it.
 		 */
 		FullTransactionId deletexid_full = GistPageGetDeleteXid(page);
-		FullTransactionId recentxmin_full = GetFullRecentGlobalXmin();
 
-		if (FullTransactionIdPrecedes(deletexid_full, recentxmin_full))
-			return true;
+		return GlobalVisIsRemovableFullXid(NULL, deletexid_full);
 	}
 	return false;
 }
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index b60dba052fa..af4731cff18 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -387,11 +387,11 @@ gistRedoPageReuse(XLogReaderState *record)
 	 * PAGE_REUSE records exist to provide a conflict point when we reuse
 	 * pages in the index via the FSM.  That's all they do though.
 	 *
-	 * latestRemovedXid was the page's deleteXid.  The deleteXid <
-	 * RecentGlobalXmin test in gistPageRecyclable() conceptually mirrors the
-	 * pgxact->xmin > limitXmin test in GetConflictingVirtualXIDs().
-	 * Consequently, one XID value achieves the same exclusion effect on
-	 * master and standby.
+	 * latestRemovedXid was the page's deleteXid.  The
+	 * GlobalVisIsRemovableFullXid(deleteXid) test in gistPageRecyclable()
+	 * conceptually mirrors the pgxact->xmin > limitXmin test in
+	 * GetConflictingVirtualXIDs().  Consequently, one XID value achieves the
+	 * same exclusion effect on master and standby.
 	 */
 	if (InHotStandby)
 	{
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index c4a5aa616a3..d572a0f01d3 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1535,6 +1535,7 @@ heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer,
 	bool		at_chain_start;
 	bool		valid;
 	bool		skip;
+	GlobalVisState *vistest = NULL;
 
 	/* If this is not the first call, previous call returned a (live!) tuple */
 	if (all_dead)
@@ -1545,7 +1546,8 @@ heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer,
 	at_chain_start = first_call;
 	skip = !first_call;
 
-	Assert(TransactionIdIsValid(RecentGlobalXmin));
+	/* XXX: we should assert that a snapshot is pushed or registered */
+	Assert(TransactionIdIsValid(RecentXmin));
 	Assert(BufferGetBlockNumber(buffer) == blkno);
 
 	/* Scan through possible multiple members of HOT-chain */
@@ -1634,9 +1636,14 @@ heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer,
 		 * Note: if you change the criterion here for what is "dead", fix the
 		 * planner's get_actual_variable_range() function to match.
 		 */
-		if (all_dead && *all_dead &&
-			!HeapTupleIsSurelyDead(heapTuple, RecentGlobalXmin))
-			*all_dead = false;
+		if (all_dead && *all_dead)
+		{
+			if (!vistest)
+				vistest = GlobalVisTestFor(relation);
+
+			if (!HeapTupleIsSurelyDead(vistest, heapTuple))
+				*all_dead = false;
+		}
 
 		/*
 		 * Check to see if HOT chain continues past this tuple; if so fetch
@@ -2192,8 +2199,8 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
 		RelationPutHeapTuple(relation, buffer, heaptuples[ndone], false);
 
 		/*
-		 * Note that heap_multi_insert is not used for catalog tuples yet,
-		 * but this will cover the gap once that is the case.
+		 * Note that heap_multi_insert is not used for catalog tuples yet, but
+		 * this will cover the gap once that is the case.
 		 */
 		if (needwal && need_cids)
 			log_heap_new_cid(relation, heaptuples[ndone]);
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 56b35622f1a..659fc4d8697 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1201,7 +1201,7 @@ heapam_index_build_range_scan(Relation heapRelation,
 
 	/* okay to ignore lazy VACUUMs here */
 	if (!IsBootstrapProcessingMode() && !indexInfo->ii_Concurrent)
-		OldestXmin = GetOldestXmin(heapRelation, PROCARRAY_FLAGS_VACUUM);
+		OldestXmin = GetOldestNonRemovableTransactionId(heapRelation);
 
 	if (!scan)
 	{
@@ -1242,6 +1242,17 @@ heapam_index_build_range_scan(Relation heapRelation,
 
 	hscan = (HeapScanDesc) scan;
 
+	/*
+	 * Must have called GetOldestNonRemovableTransactionId() if using
+	 * SnapshotAny.  Shouldn't have for an MVCC snapshot. (It's especially
+	 * worth checking this for parallel builds, since ambuild routines that
+	 * support parallel builds must work these details out for themselves.)
+	 */
+	Assert(snapshot == SnapshotAny || IsMVCCSnapshot(snapshot));
+	Assert(snapshot == SnapshotAny ? TransactionIdIsValid(OldestXmin) :
+		   !TransactionIdIsValid(OldestXmin));
+	Assert(snapshot == SnapshotAny || !anyvisible);
+
 	/* Publish number of blocks to scan */
 	if (progress)
 	{
@@ -1261,17 +1272,6 @@ heapam_index_build_range_scan(Relation heapRelation,
 									 nblocks);
 	}
 
-	/*
-	 * Must call GetOldestXmin() with SnapshotAny.  Should never call
-	 * GetOldestXmin() with MVCC snapshot. (It's especially worth checking
-	 * this for parallel builds, since ambuild routines that support parallel
-	 * builds must work these details out for themselves.)
-	 */
-	Assert(snapshot == SnapshotAny || IsMVCCSnapshot(snapshot));
-	Assert(snapshot == SnapshotAny ? TransactionIdIsValid(OldestXmin) :
-		   !TransactionIdIsValid(OldestXmin));
-	Assert(snapshot == SnapshotAny || !anyvisible);
-
 	/* set our scan endpoints */
 	if (!allow_sync)
 		heap_setscanlimits(scan, start_blockno, numblocks);
diff --git a/src/backend/access/heap/heapam_visibility.c b/src/backend/access/heap/heapam_visibility.c
index dba10890aab..b25b3e429ed 100644
--- a/src/backend/access/heap/heapam_visibility.c
+++ b/src/backend/access/heap/heapam_visibility.c
@@ -1154,19 +1154,56 @@ HeapTupleSatisfiesMVCC(HeapTuple htup, Snapshot snapshot,
  *	we mainly want to know is if a tuple is potentially visible to *any*
  *	running transaction.  If so, it can't be removed yet by VACUUM.
  *
- * OldestXmin is a cutoff XID (obtained from GetOldestXmin()).  Tuples
- * deleted by XIDs >= OldestXmin are deemed "recently dead"; they might
- * still be visible to some open transaction, so we can't remove them,
- * even if we see that the deleting transaction has committed.
+ * OldestXmin is a cutoff XID (obtained from
+ * GetOldestNonRemovableTransactionId()).  Tuples deleted by XIDs >=
+ * OldestXmin are deemed "recently dead"; they might still be visible to some
+ * open transaction, so we can't remove them, even if we see that the deleting
+ * transaction has committed.
  */
 HTSV_Result
 HeapTupleSatisfiesVacuum(HeapTuple htup, TransactionId OldestXmin,
 						 Buffer buffer)
+{
+	TransactionId dead_after = InvalidTransactionId;
+	HTSV_Result res;
+
+	res = HeapTupleSatisfiesVacuumHorizon(htup, buffer, &dead_after);
+
+	if (res == HEAPTUPLE_RECENTLY_DEAD)
+	{
+		Assert(TransactionIdIsValid(dead_after));
+
+		if (TransactionIdPrecedes(dead_after, OldestXmin))
+			res = HEAPTUPLE_DEAD;
+	}
+	else
+		Assert(!TransactionIdIsValid(dead_after));
+
+	return res;
+}
+
+/*
+ * Work horse for HeapTupleSatisfiesVacuum and similar routines.
+ *
+ * In contrast to HeapTupleSatisfiesVacuum this routine, when encountering a
+ * tuple that could still be visible to some backend, stores the xid that
+ * needs to be compared with the horizon in *dead_after, and returns
+ * HEAPTUPLE_RECENTLY_DEAD. The caller then can perform the comparison with
+ * the horizon.  This is e.g. useful when comparing with different horizons.
+ *
+ * Note: HEAPTUPLE_DEAD can still be returned here, e.g. if the inserting
+ * transaction aborted.
+ */
+HTSV_Result
+HeapTupleSatisfiesVacuumHorizon(HeapTuple htup, Buffer buffer, TransactionId *dead_after)
 {
 	HeapTupleHeader tuple = htup->t_data;
 
 	Assert(ItemPointerIsValid(&htup->t_self));
 	Assert(htup->t_tableOid != InvalidOid);
+	Assert(dead_after != NULL);
+
+	*dead_after = InvalidTransactionId;
 
 	/*
 	 * Has inserting transaction committed?
@@ -1323,17 +1360,15 @@ HeapTupleSatisfiesVacuum(HeapTuple htup, TransactionId OldestXmin,
 		else if (TransactionIdDidCommit(xmax))
 		{
 			/*
-			 * The multixact might still be running due to lockers.  If the
-			 * updater is below the xid horizon, we have to return DEAD
-			 * regardless -- otherwise we could end up with a tuple where the
-			 * updater has to be removed due to the horizon, but is not pruned
-			 * away.  It's not a problem to prune that tuple, because any
-			 * remaining lockers will also be present in newer tuple versions.
+			 * The multixact might still be running due to lockers.  Need to
+			 * allow for pruning if below the xid horizon regardless --
+			 * otherwise we could end up with a tuple where the updater has to
+			 * be removed due to the horizon, but is not pruned away.  It's
+			 * not a problem to prune that tuple, because any remaining
+			 * lockers will also be present in newer tuple versions.
 			 */
-			if (!TransactionIdPrecedes(xmax, OldestXmin))
-				return HEAPTUPLE_RECENTLY_DEAD;
-
-			return HEAPTUPLE_DEAD;
+			*dead_after = xmax;
+			return HEAPTUPLE_RECENTLY_DEAD;
 		}
 		else if (!MultiXactIdIsRunning(HeapTupleHeaderGetRawXmax(tuple), false))
 		{
@@ -1372,14 +1407,11 @@ HeapTupleSatisfiesVacuum(HeapTuple htup, TransactionId OldestXmin,
 	}
 
 	/*
-	 * Deleter committed, but perhaps it was recent enough that some open
-	 * transactions could still see the tuple.
+	 * Deleter committed, allow caller to check if it was recent enough that
+	 * some open transactions could still see the tuple.
 	 */
-	if (!TransactionIdPrecedes(HeapTupleHeaderGetRawXmax(tuple), OldestXmin))
-		return HEAPTUPLE_RECENTLY_DEAD;
-
-	/* Otherwise, it's dead and removable */
-	return HEAPTUPLE_DEAD;
+	*dead_after = HeapTupleHeaderGetRawXmax(tuple);
+	return HEAPTUPLE_RECENTLY_DEAD;
 }
 
 
@@ -1418,7 +1450,7 @@ HeapTupleSatisfiesNonVacuumable(HeapTuple htup, Snapshot snapshot,
  *	if the tuple is removable.
  */
 bool
-HeapTupleIsSurelyDead(HeapTuple htup, TransactionId OldestXmin)
+HeapTupleIsSurelyDead(GlobalVisState *vistest, HeapTuple htup)
 {
 	HeapTupleHeader tuple = htup->t_data;
 
@@ -1459,7 +1491,8 @@ HeapTupleIsSurelyDead(HeapTuple htup, TransactionId OldestXmin)
 		return false;
 
 	/* Deleter committed, so tuple is dead if the XID is old enough. */
-	return TransactionIdPrecedes(HeapTupleHeaderGetRawXmax(tuple), OldestXmin);
+	return GlobalVisTestIsRemovableXid(vistest,
+									   HeapTupleHeaderGetRawXmax(tuple));
 }
 
 /*
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index 1794cfd8d9a..453465c54a9 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -23,12 +23,30 @@
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "storage/bufmgr.h"
+#include "utils/snapmgr.h"
 #include "utils/rel.h"
 #include "utils/snapmgr.h"
 
 /* Working data for heap_page_prune and subroutines */
 typedef struct
 {
+	Relation	rel;
+
+	/* tuple visibility test, initialized for the relation */
+	GlobalVisState *vistest;
+
+	/*
+	 * Thresholds set by TransactionIdLimitedForOldSnapshots() if they have
+	 * been computed (done on demand, and only if
+	 * OldSnapshotThresholdActive()). The first time a tuple is about to be
+	 * removed based on the limited horizon, old_snap_used is set to true, and
+	 * SetOldSnapshotThresholdTimestamp() is called. See
+	 * heap_prune_satisfies_vacuum().
+	 */
+	TimestampTz old_snap_ts;
+	TransactionId old_snap_xmin;
+	bool		old_snap_used;
+
 	TransactionId new_prune_xid;	/* new prune hint value for page */
 	TransactionId latestRemovedXid; /* latest xid to be removed by this prune */
 	int			nredirected;	/* numbers of entries in arrays below */
@@ -43,9 +61,8 @@ typedef struct
 } PruneState;
 
 /* Local functions */
-static int	heap_prune_chain(Relation relation, Buffer buffer,
+static int	heap_prune_chain(Buffer buffer,
 							 OffsetNumber rootoffnum,
-							 TransactionId OldestXmin,
 							 PruneState *prstate);
 static void heap_prune_record_prunable(PruneState *prstate, TransactionId xid);
 static void heap_prune_record_redirect(PruneState *prstate,
@@ -65,16 +82,16 @@ static void heap_prune_record_unused(PruneState *prstate, OffsetNumber offnum);
  * if there's not any use in pruning.
  *
  * Caller must have pin on the buffer, and must *not* have a lock on it.
- *
- * OldestXmin is the cutoff XID used to distinguish whether tuples are DEAD
- * or RECENTLY_DEAD (see HeapTupleSatisfiesVacuum).
  */
 void
 heap_page_prune_opt(Relation relation, Buffer buffer)
 {
 	Page		page = BufferGetPage(buffer);
+	TransactionId prune_xid;
+	GlobalVisState *vistest;
+	TransactionId limited_xmin = InvalidTransactionId;
+	TimestampTz limited_ts = 0;
 	Size		minfree;
-	TransactionId OldestXmin;
 
 	/*
 	 * We can't write WAL in recovery mode, so there's no point trying to
@@ -85,37 +102,55 @@ heap_page_prune_opt(Relation relation, Buffer buffer)
 		return;
 
 	/*
-	 * Use the appropriate xmin horizon for this relation. If it's a proper
-	 * catalog relation or a user defined, additional, catalog relation, we
-	 * need to use the horizon that includes slots, otherwise the data-only
-	 * horizon can be used. Note that the toast relation of user defined
-	 * relations are *not* considered catalog relations.
+	 * XXX: Magic to keep old_snapshot_threshold tests appear "working". They
+	 * currently are broken, and discussion of what to do about them is
+	 * ongoing. See
+	 * https://www.postgresql.org/message-id/20200403001235.e6jfdll3gh2ygbuc%40alap3.anarazel.de
+	 */
+	if (old_snapshot_threshold == 0)
+		SnapshotTooOldMagicForTest();
+
+	/*
+	 * First check whether there's any chance there's something to prune,
+	 * determining the appropriate horizon is a waste if there's no prune_xid
+	 * (i.e. no updates/deletes left potentially dead tuples around).
+	 */
+	prune_xid = ((PageHeader) page)->pd_prune_xid;
+	if (!TransactionIdIsValid(prune_xid))
+		return;
+
+	/*
+	 * Check whether prune_xid indicates that there may be dead rows that can
+	 * be cleaned up.
 	 *
-	 * It is OK to apply the old snapshot limit before acquiring the cleanup
+	 * It is OK to check the old snapshot limit before acquiring the cleanup
 	 * lock because the worst that can happen is that we are not quite as
 	 * aggressive about the cleanup (by however many transaction IDs are
 	 * consumed between this point and acquiring the lock).  This allows us to
 	 * save significant overhead in the case where the page is found not to be
 	 * prunable.
-	 */
-	if (IsCatalogRelation(relation) ||
-		RelationIsAccessibleInLogicalDecoding(relation))
-		OldestXmin = RecentGlobalXmin;
-	else
-		OldestXmin =
-			TransactionIdLimitedForOldSnapshots(RecentGlobalDataXmin,
-												relation);
-
-	Assert(TransactionIdIsValid(OldestXmin));
-
-	/*
-	 * Let's see if we really need pruning.
 	 *
-	 * Forget it if page is not hinted to contain something prunable that's
-	 * older than OldestXmin.
+	 * Even if old_snapshot_threshold is set, we first check whether the page
+	 * can be pruned without. Both because
+	 * TransactionIdLimitedForOldSnapshots() is not cheap, and because not
+	 * unnecessarily relying on old_snapshot_threshold avoids causing
+	 * conflicts.
 	 */
-	if (!PageIsPrunable(page, OldestXmin))
-		return;
+	vistest = GlobalVisTestFor(relation);
+
+	if (!GlobalVisTestIsRemovableXid(vistest, prune_xid))
+	{
+		if (!OldSnapshotThresholdActive())
+			return;
+
+		if (!TransactionIdLimitedForOldSnapshots(GlobalVisTestNonRemovableHorizon(vistest),
+												 relation,
+												 &limited_xmin, &limited_ts))
+			return;
+
+		if (!TransactionIdPrecedes(prune_xid, limited_xmin))
+			return;
+	}
 
 	/*
 	 * We prune when a previous UPDATE failed to find enough space on the page
@@ -151,7 +186,9 @@ heap_page_prune_opt(Relation relation, Buffer buffer)
 															 * needed */
 
 			/* OK to prune */
-			(void) heap_page_prune(relation, buffer, OldestXmin, true, &ignore);
+			(void) heap_page_prune(relation, buffer, vistest,
+								   limited_xmin, limited_ts,
+								   true, &ignore);
 		}
 
 		/* And release buffer lock */
@@ -165,8 +202,11 @@ heap_page_prune_opt(Relation relation, Buffer buffer)
  *
  * Caller must have pin and buffer cleanup lock on the page.
  *
- * OldestXmin is the cutoff XID used to distinguish whether tuples are DEAD
- * or RECENTLY_DEAD (see HeapTupleSatisfiesVacuum).
+ * vistest is used to distinguish whether tuples are DEAD or RECENTLY_DEAD
+ * (see heap_prune_satisfies_vacuum and
+ * HeapTupleSatisfiesVacuum). old_snap_xmin / old_snap_ts need to
+ * either have been set by TransactionIdLimitedForOldSnapshots, or
+ * InvalidTransactionId/0 respectively.
  *
  * If report_stats is true then we send the number of reclaimed heap-only
  * tuples to pgstats.  (This must be false during vacuum, since vacuum will
@@ -177,7 +217,10 @@ heap_page_prune_opt(Relation relation, Buffer buffer)
  * latestRemovedXid.
  */
 int
-heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin,
+heap_page_prune(Relation relation, Buffer buffer,
+				GlobalVisState *vistest,
+				TransactionId old_snap_xmin,
+				TimestampTz old_snap_ts,
 				bool report_stats, TransactionId *latestRemovedXid)
 {
 	int			ndeleted = 0;
@@ -198,6 +241,11 @@ heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin,
 	 * initialize the rest of our working state.
 	 */
 	prstate.new_prune_xid = InvalidTransactionId;
+	prstate.rel = relation;
+	prstate.vistest = vistest;
+	prstate.old_snap_xmin = old_snap_xmin;
+	prstate.old_snap_ts = old_snap_ts;
+	prstate.old_snap_used = false;
 	prstate.latestRemovedXid = *latestRemovedXid;
 	prstate.nredirected = prstate.ndead = prstate.nunused = 0;
 	memset(prstate.marked, 0, sizeof(prstate.marked));
@@ -220,9 +268,7 @@ heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin,
 			continue;
 
 		/* Process this item or chain of items */
-		ndeleted += heap_prune_chain(relation, buffer, offnum,
-									 OldestXmin,
-									 &prstate);
+		ndeleted += heap_prune_chain(buffer, offnum, &prstate);
 	}
 
 	/* Any error while applying the changes is critical */
@@ -323,6 +369,85 @@ heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin,
 }
 
 
+/*
+ * Perform visiblity checks for heap pruning.
+ *
+ * This is more complicated than just using GlobalVisTestIsRemovableXid()
+ * because of old_snapshot_threshold. We only want to increase the threshold
+ * that triggers errors for old snapshots when we actually decide to remove a
+ * row based on the limited horizon.
+ *
+ * Due to its cost we also only want to call
+ * TransactionIdLimitedForOldSnapshots() if necessary, i.e. we might not have
+ * done so in heap_hot_prune_opt() if pd_prune_xid was old enough. But we
+ * still want to be able to remove rows that are too new to be removed
+ * according to prstate->vistest, but that can be removed based on
+ * old_snapshot_threshold. So we call TransactionIdLimitedForOldSnapshots() on
+ * demand in here, if appropriate.
+ */
+static HTSV_Result
+heap_prune_satisfies_vacuum(PruneState *prstate, HeapTuple tup, Buffer buffer)
+{
+	HTSV_Result res;
+	TransactionId dead_after;
+
+	res = HeapTupleSatisfiesVacuumHorizon(tup, buffer, &dead_after);
+
+	if (res != HEAPTUPLE_RECENTLY_DEAD)
+		return res;
+
+	/*
+	 * If we are already relying on the limited xmin, there is no need to
+	 * delay doing so anymore.
+	 */
+	if (prstate->old_snap_used)
+	{
+		Assert(TransactionIdIsValid(prstate->old_snap_xmin));
+
+		if (TransactionIdPrecedes(dead_after, prstate->old_snap_xmin))
+			res = HEAPTUPLE_DEAD;
+		return res;
+	}
+
+	/*
+	 * First check if GlobalVisTestIsRemovableXid() is sufficient to find the
+	 * row dead. If not, and old_snapshot_threshold is enabled, try to use the
+	 * lowered horizon.
+	 */
+	if (GlobalVisTestIsRemovableXid(prstate->vistest, dead_after))
+		res = HEAPTUPLE_DEAD;
+	else if (OldSnapshotThresholdActive())
+	{
+		/* haven't determined limited horizon yet, requests */
+		if (!TransactionIdIsValid(prstate->old_snap_xmin))
+		{
+			TransactionId horizon =
+			GlobalVisTestNonRemovableHorizon(prstate->vistest);
+
+			TransactionIdLimitedForOldSnapshots(horizon, prstate->rel,
+												&prstate->old_snap_xmin,
+												&prstate->old_snap_ts);
+		}
+
+		if (TransactionIdIsValid(prstate->old_snap_xmin) &&
+			TransactionIdPrecedes(dead_after, prstate->old_snap_xmin))
+		{
+			/*
+			 * About to remove row based on snapshot_too_old. Need to raise
+			 * the threshold so problematic accesses would error.
+			 */
+			Assert(!prstate->old_snap_used);
+			SetOldSnapshotThresholdTimestamp(prstate->old_snap_ts,
+											 prstate->old_snap_xmin);
+			prstate->old_snap_used = true;
+			res = HEAPTUPLE_DEAD;
+		}
+	}
+
+	return res;
+}
+
+
 /*
  * Prune specified line pointer or a HOT chain originating at line pointer.
  *
@@ -349,9 +474,7 @@ heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin,
  * Returns the number of tuples (to be) deleted from the page.
  */
 static int
-heap_prune_chain(Relation relation, Buffer buffer, OffsetNumber rootoffnum,
-				 TransactionId OldestXmin,
-				 PruneState *prstate)
+heap_prune_chain(Buffer buffer, OffsetNumber rootoffnum, PruneState *prstate)
 {
 	int			ndeleted = 0;
 	Page		dp = (Page) BufferGetPage(buffer);
@@ -366,7 +489,7 @@ heap_prune_chain(Relation relation, Buffer buffer, OffsetNumber rootoffnum,
 				i;
 	HeapTupleData tup;
 
-	tup.t_tableOid = RelationGetRelid(relation);
+	tup.t_tableOid = RelationGetRelid(prstate->rel);
 
 	rootlp = PageGetItemId(dp, rootoffnum);
 
@@ -401,7 +524,7 @@ heap_prune_chain(Relation relation, Buffer buffer, OffsetNumber rootoffnum,
 			 * either here or while following a chain below.  Whichever path
 			 * gets there first will mark the tuple unused.
 			 */
-			if (HeapTupleSatisfiesVacuum(&tup, OldestXmin, buffer)
+			if (heap_prune_satisfies_vacuum(prstate, &tup, buffer)
 				== HEAPTUPLE_DEAD && !HeapTupleHeaderIsHotUpdated(htup))
 			{
 				heap_prune_record_unused(prstate, rootoffnum);
@@ -485,7 +608,7 @@ heap_prune_chain(Relation relation, Buffer buffer, OffsetNumber rootoffnum,
 		 */
 		tupdead = recent_dead = false;
 
-		switch (HeapTupleSatisfiesVacuum(&tup, OldestXmin, buffer))
+		switch (heap_prune_satisfies_vacuum(prstate, &tup, buffer))
 		{
 			case HEAPTUPLE_DEAD:
 				tupdead = true;
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index f3382d37a40..22e86a391b4 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -780,6 +780,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 		PROGRESS_VACUUM_MAX_DEAD_TUPLES
 	};
 	int64		initprog_val[3];
+	GlobalVisState *vistest;
 
 	pg_rusage_init(&ru0);
 
@@ -808,6 +809,8 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	vacrelstats->nonempty_pages = 0;
 	vacrelstats->latestRemovedXid = InvalidTransactionId;
 
+	vistest = GlobalVisTestFor(onerel);
+
 	/*
 	 * Initialize the state for a parallel vacuum.  As of now, only one worker
 	 * can be used for an index, so we invoke parallelism only if there are at
@@ -1231,7 +1234,8 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 		 *
 		 * We count tuples removed by the pruning step as removed by VACUUM.
 		 */
-		tups_vacuumed += heap_page_prune(onerel, buf, OldestXmin, false,
+		tups_vacuumed += heap_page_prune(onerel, buf, vistest, false,
+										 InvalidTransactionId, 0,
 										 &vacrelstats->latestRemovedXid);
 
 		/*
@@ -1588,14 +1592,16 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 		}
 
 		/*
-		 * It's possible for the value returned by GetOldestXmin() to move
-		 * backwards, so it's not wrong for us to see tuples that appear to
-		 * not be visible to everyone yet, while PD_ALL_VISIBLE is already
-		 * set. The real safe xmin value never moves backwards, but
-		 * GetOldestXmin() is conservative and sometimes returns a value
-		 * that's unnecessarily small, so if we see that contradiction it just
-		 * means that the tuples that we think are not visible to everyone yet
-		 * actually are, and the PD_ALL_VISIBLE flag is correct.
+		 * It's possible for the value returned by
+		 * GetOldestNonRemovableTransactionId() to move backwards, so it's not
+		 * wrong for us to see tuples that appear to not be visible to
+		 * everyone yet, while PD_ALL_VISIBLE is already set. The real safe
+		 * xmin value never moves backwards, but
+		 * GetOldestNonRemovableTransactionId() is conservative and sometimes
+		 * returns a value that's unnecessarily small, so if we see that
+		 * contradiction it just means that the tuples that we think are not
+		 * visible to everyone yet actually are, and the PD_ALL_VISIBLE flag
+		 * is correct.
 		 *
 		 * There should never be dead tuples on a page with PD_ALL_VISIBLE
 		 * set, however.
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index a3f77169a79..c6276e6fe86 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -519,7 +519,8 @@ index_getnext_tid(IndexScanDesc scan, ScanDirection direction)
 	SCAN_CHECKS;
 	CHECK_SCAN_PROCEDURE(amgettuple);
 
-	Assert(TransactionIdIsValid(RecentGlobalXmin));
+	/* XXX: we should assert that a snapshot is pushed or registered */
+	Assert(TransactionIdIsValid(RecentXmin));
 
 	/*
 	 * The AM's amgettuple proc finds the next index entry matching the scan
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 2d0f8f4b79a..42993f131db 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -336,9 +336,9 @@ snapshots and registered snapshots as of the deletion are gone; which is
 overly strong, but is simple to implement within Postgres.  When marked
 dead, a deleted page is labeled with the next-transaction counter value.
 VACUUM can reclaim the page for re-use when this transaction number is
-older than RecentGlobalXmin.  As collateral damage, this implementation
-also waits for running XIDs with no snapshots and for snapshots taken
-until the next transaction to allocate an XID commits.
+guaranteed to be "visible to everyone".  As collateral damage, this
+implementation also waits for running XIDs with no snapshots and for
+snapshots taken until the next transaction to allocate an XID commits.
 
 Reclaiming a page doesn't actually change its state on disk --- we simply
 record it in the shared-memory free space map, from which it will be
@@ -405,8 +405,8 @@ page and also the correct place to hold the current value. We can avoid
 the cost of walking down the tree in such common cases.
 
 The optimization works on the assumption that there can only be one
-non-ignorable leaf rightmost page, and so even a RecentGlobalXmin style
-interlock isn't required.  We cannot fail to detect that our hint was
+non-ignorable leaf rightmost page, and so not even a visible-to-everyone
+style interlock required.  We cannot fail to detect that our hint was
 invalidated, because there can only be one such page in the B-Tree at
 any time. It's possible that the page will be deleted and recycled
 without a backend's cached page also being detected as invalidated, but
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 39b8f17f4b5..37e8c97b0c9 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -983,7 +983,7 @@ _bt_page_recyclable(Page page)
 	 */
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 	if (P_ISDELETED(opaque) &&
-		TransactionIdPrecedes(opaque->btpo.xact, RecentGlobalXmin))
+		GlobalVisCheckRemovableXid(NULL, opaque->btpo.xact))
 		return true;
 	return false;
 }
@@ -2186,8 +2186,8 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, bool *rightsib_empty)
 	 * updated links to the target, ReadNewTransactionId() suffices as an
 	 * upper bound.  Any scan having retained a now-stale link is advertising
 	 * in its PGXACT an xmin less than or equal to the value we read here.  It
-	 * will continue to do so, holding back RecentGlobalXmin, for the duration
-	 * of that scan.
+	 * will continue to do so, holding back xmin horizon, for the duration of
+	 * that scan.
 	 */
 	page = BufferGetPage(buf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 36294789f3f..9afab9e2111 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -802,6 +802,12 @@ _bt_vacuum_needs_cleanup(IndexVacuumInfo *info)
 	metapg = BufferGetPage(metabuf);
 	metad = BTPageGetMeta(metapg);
 
+	/*
+	 * XXX: If IndexVacuumInfo contained the heap relation, we could be more
+	 * aggressive about vacuuming non catalog relations by passing the table
+	 * to GlobalVisCheckRemovableXid().
+	 */
+
 	if (metad->btm_version < BTREE_NOVAC_VERSION)
 	{
 		/*
@@ -811,12 +817,11 @@ _bt_vacuum_needs_cleanup(IndexVacuumInfo *info)
 		result = true;
 	}
 	else if (TransactionIdIsValid(metad->btm_oldest_btpo_xact) &&
-			 TransactionIdPrecedes(metad->btm_oldest_btpo_xact,
-								   RecentGlobalXmin))
+			 GlobalVisCheckRemovableXid(NULL, metad->btm_oldest_btpo_xact))
 	{
 		/*
-		 * If oldest btpo.xact in the deleted pages is older than
-		 * RecentGlobalXmin, then at least one deleted page can be recycled.
+		 * If oldest btpo.xact in the deleted pages is visible to everyone,
+		 * then at least one deleted page can be recycled.
 		 */
 		result = true;
 	}
@@ -1227,14 +1232,13 @@ restart:
 				 * own conflict now.)
 				 *
 				 * Backends with snapshots acquired after a VACUUM starts but
-				 * before it finishes could have a RecentGlobalXmin with a
-				 * later xid than the VACUUM's OldestXmin cutoff.  These
-				 * backends might happen to opportunistically mark some index
-				 * tuples LP_DEAD before we reach them, even though they may
-				 * be after our cutoff.  We don't try to kill these "extra"
-				 * index tuples in _bt_delitems_vacuum().  This keep things
-				 * simple, and allows us to always avoid generating our own
-				 * conflicts.
+				 * before it finishes could have visibility cutoff with a
+				 * later xid than VACUUM's OldestXmin cutoff.  These backends
+				 * might happen to opportunistically mark some index tuples
+				 * LP_DEAD before we reach them, even though they may be after
+				 * our cutoff.  We don't try to kill these "extra" index
+				 * tuples in _bt_delitems_vacuum().  This keep things simple,
+				 * and allows us to always avoid generating our own conflicts.
 				 */
 				Assert(!BTreeTupleIsPivot(itup));
 				if (!BTreeTupleIsPosting(itup))
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 99d0914e724..c6b6a723dc9 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -926,11 +926,11 @@ btree_xlog_reuse_page(XLogReaderState *record)
 	 * Btree reuse_page records exist to provide a conflict point when we
 	 * reuse pages in the index via the FSM.  That's all they do though.
 	 *
-	 * latestRemovedXid was the page's btpo.xact.  The btpo.xact <
-	 * RecentGlobalXmin test in _bt_page_recyclable() conceptually mirrors the
-	 * pgxact->xmin > limitXmin test in GetConflictingVirtualXIDs().
-	 * Consequently, one XID value achieves the same exclusion effect on
-	 * master and standby.
+	 * latestRemovedXid was the page's btpo.xact.  The
+	 * GlobalVisCheckRemovableXid test in _bt_page_recyclable() conceptually
+	 * mirrors the pgxact->xmin > limitXmin test in
+	 * GetConflictingVirtualXIDs().  Consequently, one XID value achieves the
+	 * same exclusion effect on master and standby.
 	 */
 	if (InHotStandby)
 	{
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index bd98707f3c0..e1c58933f97 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -501,10 +501,14 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	OffsetNumber itemToPlaceholder[MaxIndexTuplesPerPage];
 	OffsetNumber itemnos[MaxIndexTuplesPerPage];
 	spgxlogVacuumRedirect xlrec;
+	GlobalVisState *vistest;
 
 	xlrec.nToPlaceholder = 0;
 	xlrec.newestRedirectXid = InvalidTransactionId;
 
+	/* XXX: providing heap relation would allow more pruning */
+	vistest = GlobalVisTestFor(NULL);
+
 	START_CRIT_SECTION();
 
 	/*
@@ -521,7 +525,7 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 		dt = (SpGistDeadTuple) PageGetItem(page, PageGetItemId(page, i));
 
 		if (dt->tupstate == SPGIST_REDIRECT &&
-			TransactionIdPrecedes(dt->xid, RecentGlobalXmin))
+			GlobalVisTestIsRemovableXid(vistest, dt->xid))
 		{
 			dt->tupstate = SPGIST_PLACEHOLDER;
 			Assert(opaque->nRedirection > 0);
diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README
index eb9aac5fd39..4e2178dabab 100644
--- a/src/backend/access/transam/README
+++ b/src/backend/access/transam/README
@@ -257,31 +257,31 @@ simultaneously, we have one backend take ProcArrayLock and clear the XIDs
 of multiple processes at once.)
 
 ProcArrayEndTransaction also holds the lock while advancing the shared
-latestCompletedXid variable.  This allows GetSnapshotData to use
-latestCompletedXid + 1 as xmax for its snapshot: there can be no
+latestCompletedFullXid variable.  This allows GetSnapshotData to use
+latestCompletedFullXid + 1 as xmax for its snapshot: there can be no
 transaction >= this xid value that the snapshot needs to consider as
 completed.
 
 In short, then, the rule is that no transaction may exit the set of
-currently-running transactions between the time we fetch latestCompletedXid
+currently-running transactions between the time we fetch latestCompletedFullXid
 and the time we finish building our snapshot.  However, this restriction
 only applies to transactions that have an XID --- read-only transactions
 can end without acquiring ProcArrayLock, since they don't affect anyone
-else's snapshot nor latestCompletedXid.
+else's snapshot nor latestCompletedFullXid.
 
 Transaction start, per se, doesn't have any interlocking with these
 considerations, since we no longer assign an XID immediately at transaction
 start.  But when we do decide to allocate an XID, GetNewTransactionId must
 store the new XID into the shared ProcArray before releasing XidGenLock.
-This ensures that all top-level XIDs <= latestCompletedXid are either
+This ensures that all top-level XIDs <= latestCompletedFullXid are either
 present in the ProcArray, or not running anymore.  (This guarantee doesn't
 apply to subtransaction XIDs, because of the possibility that there's not
 room for them in the subxid array; instead we guarantee that they are
 present or the overflow flag is set.)  If a backend released XidGenLock
 before storing its XID into MyPgXact, then it would be possible for another
-backend to allocate and commit a later XID, causing latestCompletedXid to
+backend to allocate and commit a later XID, causing latestCompletedFullXid to
 pass the first backend's XID, before that value became visible in the
-ProcArray.  That would break GetOldestXmin, as discussed below.
+ProcArray.  That would break ComputeXidHorizons, as discussed below.
 
 We allow GetNewTransactionId to store the XID into MyPgXact->xid (or the
 subxid array) without taking ProcArrayLock.  This was once necessary to
@@ -293,42 +293,50 @@ once, rather than assume they can read it multiple times and get the same
 answer each time.  (Use volatile-qualified pointers when doing this, to
 ensure that the C compiler does exactly what you tell it to.)
 
-Another important activity that uses the shared ProcArray is GetOldestXmin,
-which must determine a lower bound for the oldest xmin of any active MVCC
-snapshot, system-wide.  Each individual backend advertises the smallest
-xmin of its own snapshots in MyPgXact->xmin, or zero if it currently has no
-live snapshots (eg, if it's between transactions or hasn't yet set a
-snapshot for a new transaction).  GetOldestXmin takes the MIN() of the
-valid xmin fields.  It does this with only shared lock on ProcArrayLock,
-which means there is a potential race condition against other backends
-doing GetSnapshotData concurrently: we must be certain that a concurrent
-backend that is about to set its xmin does not compute an xmin less than
-what GetOldestXmin returns.  We ensure that by including all the active
-XIDs into the MIN() calculation, along with the valid xmins.  The rule that
-transactions can't exit without taking exclusive ProcArrayLock ensures that
-concurrent holders of shared ProcArrayLock will compute the same minimum of
-currently-active XIDs: no xact, in particular not the oldest, can exit
-while we hold shared ProcArrayLock.  So GetOldestXmin's view of the minimum
-active XID will be the same as that of any concurrent GetSnapshotData, and
-so it can't produce an overestimate.  If there is no active transaction at
-all, GetOldestXmin returns latestCompletedXid + 1, which is a lower bound
-for the xmin that might be computed by concurrent or later GetSnapshotData
-calls.  (We know that no XID less than this could be about to appear in
-the ProcArray, because of the XidGenLock interlock discussed above.)
+Another important activity that uses the shared ProcArray is
+ComputeXidHorizons, which must determine a lower bound for the oldest xmin
+of any active MVCC snapshot, system-wide.  Each individual backend
+advertises the smallest xmin of its own snapshots in MyPgXact->xmin, or zero
+if it currently has no live snapshots (eg, if it's between transactions or
+hasn't yet set a snapshot for a new transaction).  ComputeXidHorizons takes
+the MIN() of the valid xmin fields.  It does this with only shared lock on
+ProcArrayLock, which means there is a potential race condition against other
+backends doing GetSnapshotData concurrently: we must be certain that a
+concurrent backend that is about to set its xmin does not compute an xmin
+less than what ComputeXidHorizons determines.  We ensure that by including
+all the active XIDs into the MIN() calculation, along with the valid xmins.
+The rule that transactions can't exit without taking exclusive ProcArrayLock
+ensures that concurrent holders of shared ProcArrayLock will compute the
+same minimum of currently-active XIDs: no xact, in particular not the
+oldest, can exit while we hold shared ProcArrayLock.  So
+ComputeXidHorizons's view of the minimum active XID will be the same as that
+of any concurrent GetSnapshotData, and so it can't produce an overestimate.
+If there is no active transaction at all, ComputeXidHorizons uses
+latestCompletedFullXid + 1, which is a lower bound for the xmin that might
+be computed by concurrent or later GetSnapshotData calls.  (We know that no
+XID less than this could be about to appear in the ProcArray, because of the
+XidGenLock interlock discussed above.)
 
-GetSnapshotData also performs an oldest-xmin calculation (which had better
-match GetOldestXmin's) and stores that into RecentGlobalXmin, which is used
-for some tuple age cutoff checks where a fresh call of GetOldestXmin seems
-too expensive.  Note that while it is certain that two concurrent
-executions of GetSnapshotData will compute the same xmin for their own
-snapshots, as argued above, it is not certain that they will arrive at the
-same estimate of RecentGlobalXmin.  This is because we allow XID-less
-transactions to clear their MyPgXact->xmin asynchronously (without taking
-ProcArrayLock), so one execution might see what had been the oldest xmin,
-and another not.  This is OK since RecentGlobalXmin need only be a valid
-lower bound.  As noted above, we are already assuming that fetch/store
-of the xid fields is atomic, so assuming it for xmin as well is no extra
-risk.
+As GetSnapshotData is performance critical, it does not perform an accurate
+oldest-xmin calculation (it used to, until v13). The contents of a snapshot
+only depend on the xids of other backends, not their xmin. As backend's xmin
+changes much more often than its xid, having GetSnapshotData look at xmins
+can lead to a lot of unnecessary cacheline ping-pong.  Instead
+GetSnapshotData updates approximate thresholds (one that guarantees that all
+deleted rows older than it can be removed, another determining that deleted
+rows newer than it can not be removed). GlobalVisTest* uses those threshold
+to make invisibility decision, falling back to ComputeXidHorizons if
+necessary.
+
+Note that while it is certain that two concurrent executions of
+GetSnapshotData will compute the same xmin for their own snapshots, there is
+no such guarantee for the horizons computed by ComputeXidHorizons.  This is
+because we allow XID-less transactions to clear their MyPgXact->xmin
+asynchronously (without taking ProcArrayLock), so one execution might see
+what had been the oldest xmin, and another not.  This is OK since the
+thresholds need only be a valid lower bound.  As noted above, we are already
+assuming that fetch/store of the xid fields is atomic, so assuming it for
+xmin as well is no extra risk.
 
 
 pg_xact and pg_subtrans
diff --git a/src/backend/access/transam/varsup.c b/src/backend/access/transam/varsup.c
index 2570e7086a7..c12e477ecfc 100644
--- a/src/backend/access/transam/varsup.c
+++ b/src/backend/access/transam/varsup.c
@@ -566,3 +566,53 @@ GetNewObjectId(void)
 
 	return result;
 }
+
+
+#ifdef USE_ASSERT_CHECKING
+
+/*
+ * Assert that xid is between [oldestXid, nextFullXid], which is the range we
+ * expect XIDs coming from tables etc to be in.
+ *
+ * As ShmemVariableCache->oldestXid could change just after this call without
+ * further precautions, and as a wrapped-around xid could again fall within
+ * the valid range, this assertion can only detect if something is definitely
+ * wrong, but not establish correctness.
+ *
+ * This intentionally does not expose a return value, to avoid code being
+ * introduced that depends on the return value.
+ */
+void
+AssertTransactionInAllowableRange(TransactionId xid)
+{
+	TransactionId oldest_xid;
+	TransactionId next_xid;
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* we may see bootstrap / frozen */
+	if (!TransactionIdIsNormal(xid))
+		return;
+
+	/*
+	 * We can't acquire XidGenLock, as this may be called with XidGenLock
+	 * already held (or with other locks that don't allow XidGenLock to be
+	 * nested). That's ok for our purposes though, since we already rely on
+	 * 32bit reads to be atomic. While nextFullXid is 64 bit, we only look at
+	 * the lower 32bit, so a skewed read doesn't hurt.
+	 *
+	 * There's no increased danger of falling outside [oldest, next] by
+	 * accessing them without a lock. xid needs to have been created with
+	 * GetNewTransactionId() in the originating session, and the locks there
+	 * pair with the memory barrier below.  We do however accept xid to be <=
+	 * to next_xid, instead of just <, as xid could be from the procarray,
+	 * before we see the updated nextFullXid value.
+	 */
+	pg_memory_barrier();
+	oldest_xid = ShmemVariableCache->oldestXid;
+	next_xid = XidFromFullTransactionId(ShmemVariableCache->nextFullXid);
+
+	Assert(TransactionIdFollowsOrEquals(xid, oldest_xid) ||
+		   TransactionIdPrecedesOrEquals(xid, next_xid));
+}
+#endif
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index c38bc1412d8..7a7d0dd31ef 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7837,10 +7837,11 @@ StartupXLOG(void)
 	XLogCtl->lastSegSwitchTime = (pg_time_t) time(NULL);
 	XLogCtl->lastSegSwitchLSN = EndOfLog;
 
-	/* also initialize latestCompletedXid, to nextXid - 1 */
+	/* also initialize latestCompletedFullXid, to nextFullXid - 1 */
 	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
-	ShmemVariableCache->latestCompletedXid = XidFromFullTransactionId(ShmemVariableCache->nextFullXid);
-	TransactionIdRetreat(ShmemVariableCache->latestCompletedXid);
+	ShmemVariableCache->latestCompletedFullXid =
+		ShmemVariableCache->nextFullXid;
+	FullTransactionIdRetreat(&ShmemVariableCache->latestCompletedFullXid);
 	LWLockRelease(ProcArrayLock);
 
 	/*
@@ -9051,7 +9052,7 @@ CreateCheckPoint(int flags)
 	 * StartupSUBTRANS hasn't been called yet.
 	 */
 	if (!RecoveryInProgress())
-		TruncateSUBTRANS(GetOldestXmin(NULL, PROCARRAY_FLAGS_DEFAULT));
+		TruncateSUBTRANS(GetOldestTransactionIdConsideredRunning());
 
 	/* Real work is done, but log and update stats before releasing lock. */
 	LogCheckpointEnd(false);
@@ -9411,7 +9412,7 @@ CreateRestartPoint(int flags)
 	 * this because StartupSUBTRANS hasn't been called yet.
 	 */
 	if (EnableHotStandby)
-		TruncateSUBTRANS(GetOldestXmin(NULL, PROCARRAY_FLAGS_DEFAULT));
+		TruncateSUBTRANS(GetOldestTransactionIdConsideredRunning());
 
 	/* Real work is done, but log and update before releasing lock. */
 	LogCheckpointEnd(true);
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 924ef37c816..34b71b6c1c5 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -1056,7 +1056,7 @@ acquire_sample_rows(Relation onerel, int elevel,
 	totalblocks = RelationGetNumberOfBlocks(onerel);
 
 	/* Need a cutoff xmin for HeapTupleSatisfiesVacuum */
-	OldestXmin = GetOldestXmin(onerel, PROCARRAY_FLAGS_VACUUM);
+	OldestXmin = GetOldestNonRemovableTransactionId(onerel);
 
 	/* Prepare for sampling block numbers */
 	nblocks = BlockSampler_Init(&bs, totalblocks, targrows, random());
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 3a89f8fe1e2..77474b8d7d6 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -957,8 +957,25 @@ vacuum_set_xid_limits(Relation rel,
 	 * working on a particular table at any time, and that each vacuum is
 	 * always an independent transaction.
 	 */
-	*oldestXmin =
-		TransactionIdLimitedForOldSnapshots(GetOldestXmin(rel, PROCARRAY_FLAGS_VACUUM), rel);
+	*oldestXmin = GetOldestNonRemovableTransactionId(rel);
+
+	if (OldSnapshotThresholdActive())
+	{
+		TransactionId limit_xmin;
+		TimestampTz limit_ts;
+
+		if (TransactionIdLimitedForOldSnapshots(*oldestXmin, rel, &limit_xmin, &limit_ts))
+		{
+			/*
+			 * TODO: We should only set the threshold if we are pruning on the
+			 * basis of the increased limits. Not as crucial here as it is for
+			 * opportunistic pruning (which often happens at a much higher
+			 * frequency), but would still be a significant improvement.
+			 */
+			SetOldSnapshotThresholdTimestamp(limit_ts, limit_xmin);
+			*oldestXmin = limit_xmin;
+		}
+	}
 
 	Assert(TransactionIdIsNormal(*oldestXmin));
 
@@ -1347,12 +1364,13 @@ vac_update_datfrozenxid(void)
 	bool		dirty = false;
 
 	/*
-	 * Initialize the "min" calculation with GetOldestXmin, which is a
-	 * reasonable approximation to the minimum relfrozenxid for not-yet-
-	 * committed pg_class entries for new tables; see AddNewRelationTuple().
-	 * So we cannot produce a wrong minimum by starting with this.
+	 * Initialize the "min" calculation with
+	 * GetOldestNonRemovableTransactionId(), which is a reasonable
+	 * approximation to the minimum relfrozenxid for not-yet-committed
+	 * pg_class entries for new tables; see AddNewRelationTuple().  So we
+	 * cannot produce a wrong minimum by starting with this.
 	 */
-	newFrozenXid = GetOldestXmin(NULL, PROCARRAY_FLAGS_VACUUM);
+	newFrozenXid = GetOldestNonRemovableTransactionId(NULL);
 
 	/*
 	 * Similarly, initialize the MultiXact "min" with the value that would be
@@ -1683,8 +1701,9 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params)
 	StartTransactionCommand();
 
 	/*
-	 * Functions in indexes may want a snapshot set.  Also, setting a snapshot
-	 * ensures that RecentGlobalXmin is kept truly recent.
+	 * Need to acquire a snapshot to prevent pg_subtrans from being truncated,
+	 * cutoff xids in local memory wrapping around, and to have updated xmin
+	 * horizons.
 	 */
 	PushActiveSnapshot(GetTransactionSnapshot());
 
@@ -1707,8 +1726,8 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params)
 		 *
 		 * Note: these flags remain set until CommitTransaction or
 		 * AbortTransaction.  We don't want to clear them until we reset
-		 * MyPgXact->xid/xmin, else OldestXmin might appear to go backwards,
-		 * which is probably Not Good.
+		 * MyPgXact->xid/xmin, otherwise GetOldestNonRemovableTransactionId()
+		 * might appear to go backwards, which is probably Not Good.
 		 */
 		LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
 		MyPgXact->vacuumFlags |= PROC_IN_VACUUM;
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 7e97ffab27d..df1af9354ce 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -1878,6 +1878,10 @@ get_database_list(void)
 	 * the secondary effect that it sets RecentGlobalXmin.  (This is critical
 	 * for anything that reads heap pages, because HOT may decide to prune
 	 * them even if the process doesn't attempt to modify any tuples.)
+	 *
+	 * FIXME: This comment is inaccurate / the code buggy. A snapshot that is
+	 * not pushed/active does not reliably prevent HOT pruning (->xmin could
+	 * e.g. be cleared when cache invalidations are processed).
 	 */
 	StartTransactionCommand();
 	(void) GetTransactionSnapshot();
diff --git a/src/backend/replication/logical/launcher.c b/src/backend/replication/logical/launcher.c
index aec885e9871..158b2f3d73b 100644
--- a/src/backend/replication/logical/launcher.c
+++ b/src/backend/replication/logical/launcher.c
@@ -122,6 +122,10 @@ get_subscription_list(void)
 	 * the secondary effect that it sets RecentGlobalXmin.  (This is critical
 	 * for anything that reads heap pages, because HOT may decide to prune
 	 * them even if the process doesn't attempt to modify any tuples.)
+	 *
+	 * FIXME: This comment is inaccurate / the code buggy. A snapshot that is
+	 * not pushed/active does not reliably prevent HOT pruning (->xmin could
+	 * e.g. be cleared when cache invalidations are processed).
 	 */
 	StartTransactionCommand();
 	(void) GetTransactionSnapshot();
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index d69fb90132d..86336a10da5 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -1181,22 +1181,7 @@ XLogWalRcvSendHSFeedback(bool immed)
 	 */
 	if (hot_standby_feedback)
 	{
-		TransactionId slot_xmin;
-
-		/*
-		 * Usually GetOldestXmin() would include both global replication slot
-		 * xmin and catalog_xmin in its calculations, but we want to derive
-		 * separate values for each of those. So we ask for an xmin that
-		 * excludes the catalog_xmin.
-		 */
-		xmin = GetOldestXmin(NULL,
-							 PROCARRAY_FLAGS_DEFAULT | PROCARRAY_SLOTS_XMIN);
-
-		ProcArrayGetReplicationSlotXmin(&slot_xmin, &catalog_xmin);
-
-		if (TransactionIdIsValid(slot_xmin) &&
-			TransactionIdPrecedes(slot_xmin, xmin))
-			xmin = slot_xmin;
+		GetReplicationHorizons(&xmin, &catalog_xmin);
 	}
 	else
 	{
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 122d884f3e4..d8989762d74 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -2097,9 +2097,10 @@ ProcessStandbyHSFeedbackMessage(void)
 
 	/*
 	 * Set the WalSender's xmin equal to the standby's requested xmin, so that
-	 * the xmin will be taken into account by GetOldestXmin.  This will hold
-	 * back the removal of dead rows and thereby prevent the generation of
-	 * cleanup conflicts on the standby server.
+	 * the xmin will be taken into account by GetSnapshotData() /
+	 * ComputeXidHorizons().  This will hold back the removal of dead rows and
+	 * thereby prevent the generation of cleanup conflicts on the standby
+	 * server.
 	 *
 	 * There is a small window for a race condition here: although we just
 	 * checked that feedbackXmin precedes nextXid, the nextXid could have
@@ -2112,10 +2113,10 @@ ProcessStandbyHSFeedbackMessage(void)
 	 * own xmin would prevent nextXid from advancing so far.
 	 *
 	 * We don't bother taking the ProcArrayLock here.  Setting the xmin field
-	 * is assumed atomic, and there's no real need to prevent a concurrent
-	 * GetOldestXmin.  (If we're moving our xmin forward, this is obviously
-	 * safe, and if we're moving it backwards, well, the data is at risk
-	 * already since a VACUUM could have just finished calling GetOldestXmin.)
+	 * is assumed atomic, and there's no real need to prevent concurrent
+	 * horizon determinations.  (If we're moving our xmin forward, this is
+	 * obviously safe, and if we're moving it backwards, well, the data is at
+	 * risk already since a VACUUM could already have determined the horizon.)
 	 *
 	 * If we're using a replication slot we reserve the xmin via that,
 	 * otherwise via the walsender's PGXACT entry. We can only track the
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 363000670b2..58f119f9895 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -99,6 +99,143 @@ typedef struct ProcArrayStruct
 	int			pgprocnos[FLEXIBLE_ARRAY_MEMBER];
 } ProcArrayStruct;
 
+/*
+ * State for the GlobalVisTest* familiy of functions. Those functions can
+ * e.g. be used to decide if a deleted row can be removed without violating
+ * MVCC semantics: If the deleted row's xmax is not considered to be running
+ * by anyone, the row can be removed.
+ *
+
+ * To avoid slowing down GetSnapshotData(), we don't calculate a precise
+ * cutoff XID while building a snapshot (looking at the frequently changing
+ * xmins scales badly). Instead we compute two boundaries while building the
+ * snapshot:
+ *
+ * 1) definitely_needed, indicating that rows deleted by XIDs >=
+ *    definitely_needed are definitely still visible.
+ *
+ * 2) maybe_needed, indicating that rows deleted by XIDs < maybe_needed can
+ *    definitely be removed
+ *
+ * When testing an XID that falls in between the two (i.e. XID >= maybe_needed
+ * && XID < definitely_needed), the boundaries can be recomputed (using
+ * ComputeXidHorizons()) to get a more accurate answer. This is cheaper than
+ * maintaining an accurate value all the time.
+ *
+ * As it is not cheap to compute accurate boundaries, we limit the number of
+ * times that happens in short succession. See GlobalVisTestShouldUpdate().
+ *
+ *
+ * There are three backend lifetime instances of this struct, optimized for
+ * different types of relations. As e.g. a normal user defined table in one
+ * database is inaccessible to backends connected to another database, a test
+ * specific to a relation can be more aggressive than a test for a shared
+ * relation.  Currently we track three different states:
+ *
+ * 1) GlobalVisSharedRels, which only considers an XID's
+ *    effects visible-to-everyone if neither snapshots in any database, nor a
+ *    replication slot's xmin, nor a replication slot's catalog_xmin might
+ *    still consider XID as running.
+ *
+ * 2) GlobalVisCatalogRels, which only considers an XID's
+ *    effects visible-to-everyone if neither snapshots in the current
+ *    database, nor a replication slot's xmin, nor a replication slot's
+ *    catalog_xmin might still consider XID as running.
+ *
+ *    I.e. the difference to GlobalVisSharedRels is that
+ *    snapshot in other databases are ignored.
+ *
+ * 3) GlobalVisCatalogRels, which only considers an XID's
+ *    effects visible-to-everyone if neither snapshots in the current
+ *    database, nor a replication slot's xmin consider XID as running.
+ *
+ *    I.e. the difference to GlobalVisCatalogRels is that
+ *    replication slot's catalog_xmin is not taken into account.
+ *
+ * GlobalVisTestFor(relation) returns the appropriate state
+ * for the relation.
+ *
+ * The boundaries are FullTransactionIds instead of TransactionIds to avoid
+ * wraparound dangers. There e.g. would otherwise exist no procarray state to
+ * prevent maybe_needed to become old enough after the GetSnapshotData()
+ * call.
+ *
+ * The typedef is in the header.
+ */
+struct GlobalVisState
+{
+	/* XIDs >= are considered running by some backend */
+	FullTransactionId definitely_needed;
+
+	/* XIDs < are not considered to be running by any backend */
+	FullTransactionId maybe_needed;
+};
+
+/*
+ * Result of ComputeXidHorizons().
+ */
+typedef struct ComputeXidHorizonsResult
+{
+	/*
+	 * The value of ShmemVariableCache->latestCompletedFullXid when
+	 * ComputeXidHorizons() held ProcArrayLock.
+	 */
+	FullTransactionId latest_completed;
+
+	/*
+	 * The same for procArray->replication_slot_xmin and.
+	 * procArray->replication_slot_catalog_xmin.
+	 */
+	TransactionId slot_xmin;
+	TransactionId slot_catalog_xmin;
+
+	/*
+	 * Oldest xid that any backend might still consider running. This needs to
+	 * include processes running VACUUM, in contrast to the normal visibility
+	 * cutoffs, as vacuum needs to be able to perform pg_subtrans lookups when
+	 * determining visibility, but doesn't care about rows above its xmin to
+	 * be removed.
+	 *
+	 * This likely should only be needed to determine whether pg_subtrans can
+	 * be truncated. It currently includes the effects of replications slots,
+	 * for historical reasons. But that could likely be changed.
+	 */
+	TransactionId oldest_considered_running;
+
+	/*
+	 * Oldest xid for which deleted tuples need to be retained in shared
+	 * tables.
+	 *
+	 * This includes the effects of replications lots. If that's not desired,
+	 * look at shared_oldest_nonremovable_raw;
+	 */
+	TransactionId shared_oldest_nonremovable;
+
+	/*
+	 * Oldest xid that may be necessary to retain in shared tables. This is
+	 * the same as shared_oldest_nonremovable, except that is not affected by
+	 * replication slot's catalog_xmin.
+	 *
+	 * This is mainly useful to be able to send the catalog_xmin to upstream
+	 * streaming replication servers via hot_standby_feedback, so they can
+	 * apply the limit only when accessing catalog tables.
+	 */
+	TransactionId shared_oldest_nonremovable_raw;
+
+	/*
+	 * Oldest xid for which deleted tuples need to be retained in non-shared
+	 * catalog tables.
+	 */
+	TransactionId catalog_oldest_nonremovable;
+
+	/*
+	 * Oldest xid for which deleted tuples need to be retained in normal user
+	 * defined tables.
+	 */
+	TransactionId data_oldest_nonremovable;
+} ComputeXidHorizonsResult;
+
+
 static ProcArrayStruct *procArray;
 
 static PGPROC *allProcs;
@@ -118,6 +255,22 @@ static TransactionId latestObservedXid = InvalidTransactionId;
  */
 static TransactionId standbySnapshotPendingXmin;
 
+/*
+ * State for visibility checks on different types of relations. See struct
+ * GlobalVisState for details. As shared, catalog, and user defined
+ * relations can have different horizons, one such state exists for each.
+ */
+static GlobalVisState GlobalVisSharedRels;
+static GlobalVisState GlobalVisCatalogRels;
+static GlobalVisState GlobalVisDataRels;
+
+/*
+ * This backend's RecentXmin at the last time the accurate xmin horizon was
+ * recomputed, or InvalidTransactionId if it has not. Used to limit how many
+ * times accurate horizons are recomputed. See GlobalVisTestShouldUpdate().
+ */
+static TransactionId ComputeXidHorizonsResultLastXmin;
+
 #ifdef XIDCACHE_DEBUG
 
 /* counters for XidCache measurement */
@@ -175,6 +328,10 @@ static void KnownAssignedXidsReset(void);
 static inline void ProcArrayEndTransactionInternal(PGPROC *proc,
 												   PGXACT *pgxact, TransactionId latestXid);
 static void ProcArrayGroupClearXid(PGPROC *proc, TransactionId latestXid);
+static void MaintainLatestCompletedXid(TransactionId latestXid);
+static void MaintainLatestCompletedXidRecovery(TransactionId latestXid);
+
+static inline FullTransactionId FullXidViaRelative(FullTransactionId rel, TransactionId xid);
 
 /*
  * Report shared-memory space needed by CreateSharedProcArray.
@@ -351,9 +508,7 @@ ProcArrayRemove(PGPROC *proc, TransactionId latestXid)
 		Assert(TransactionIdIsValid(allPgXact[proc->pgprocno].xid));
 
 		/* Advance global latestCompletedXid while holding the lock */
-		if (TransactionIdPrecedes(ShmemVariableCache->latestCompletedXid,
-								  latestXid))
-			ShmemVariableCache->latestCompletedXid = latestXid;
+		MaintainLatestCompletedXid(latestXid);
 	}
 	else
 	{
@@ -466,9 +621,7 @@ ProcArrayEndTransactionInternal(PGPROC *proc, PGXACT *pgxact,
 	pgxact->overflowed = false;
 
 	/* Also advance global latestCompletedXid while holding the lock */
-	if (TransactionIdPrecedes(ShmemVariableCache->latestCompletedXid,
-							  latestXid))
-		ShmemVariableCache->latestCompletedXid = latestXid;
+	MaintainLatestCompletedXid(latestXid);
 }
 
 /*
@@ -623,6 +776,58 @@ ProcArrayClearTransaction(PGPROC *proc)
 	pgxact->overflowed = false;
 }
 
+/*
+ * Update ShmemVariableCache->latestCompletedFullXid to point to latestXid if
+ * currently older.
+ */
+static void
+MaintainLatestCompletedXid(TransactionId latestXid)
+{
+	FullTransactionId cur_latest = ShmemVariableCache->latestCompletedFullXid;
+
+	Assert(LWLockHeldByMe(ProcArrayLock));
+	Assert(FullTransactionIdIsValid(cur_latest));
+
+	if (TransactionIdPrecedes(XidFromFullTransactionId(cur_latest), latestXid))
+	{
+		ShmemVariableCache->latestCompletedFullXid =
+			FullXidViaRelative(cur_latest, latestXid);
+	}
+
+	Assert(IsBootstrapProcessingMode() ||
+		   FullTransactionIdIsNormal(ShmemVariableCache->latestCompletedFullXid));
+}
+
+/*
+ * Same as MaintainLatestCompletedXid, except for use during WAL replay.
+ */
+static void
+MaintainLatestCompletedXidRecovery(TransactionId latestXid)
+{
+	FullTransactionId cur_latest = ShmemVariableCache->latestCompletedFullXid;
+	FullTransactionId rel;
+
+	Assert(AmStartupProcess() || !IsUnderPostmaster);
+	Assert(LWLockHeldByMe(ProcArrayLock));
+
+	/*
+	 * Need a FullTransactionId to compare latestXid with. Can't rely on
+	 * latestCompletedFullXid to be initialized in recovery. But in recovery
+	 * it's safe to access nextFullXid without a lock for the startup process.
+	 */
+	rel = ShmemVariableCache->nextFullXid;
+	Assert(FullTransactionIdIsValid(ShmemVariableCache->nextFullXid));
+
+	if (!FullTransactionIdIsValid(cur_latest) ||
+		TransactionIdPrecedes(XidFromFullTransactionId(cur_latest), latestXid))
+	{
+		ShmemVariableCache->latestCompletedFullXid =
+			FullXidViaRelative(rel, latestXid);
+	}
+
+	Assert(FullTransactionIdIsNormal(ShmemVariableCache->latestCompletedFullXid));
+}
+
 /*
  * ProcArrayInitRecovery -- initialize recovery xid mgmt environment
  *
@@ -843,7 +1048,7 @@ ProcArrayApplyRecoveryInfo(RunningTransactions running)
 	 * Now we've got the running xids we need to set the global values that
 	 * are used to track snapshots as they evolve further.
 	 *
-	 * - latestCompletedXid which will be the xmax for snapshots
+	 * - latestCompletedFullXid which will be the xmax for snapshots
 	 * - lastOverflowedXid which shows whether snapshots overflow
 	 * - nextXid
 	 *
@@ -869,14 +1074,11 @@ ProcArrayApplyRecoveryInfo(RunningTransactions running)
 
 	/*
 	 * If a transaction wrote a commit record in the gap between taking and
-	 * logging the snapshot then latestCompletedXid may already be higher than
-	 * the value from the snapshot, so check before we use the incoming value.
+	 * logging the snapshot then latestCompletedFullXid may already be higher
+	 * than the value from the snapshot, so check before we use the incoming
+	 * value. It also might not yet be set at all.
 	 */
-	if (TransactionIdPrecedes(ShmemVariableCache->latestCompletedXid,
-							  running->latestCompletedXid))
-		ShmemVariableCache->latestCompletedXid = running->latestCompletedXid;
-
-	Assert(TransactionIdIsNormal(ShmemVariableCache->latestCompletedXid));
+	MaintainLatestCompletedXidRecovery(running->latestCompletedXid);
 
 	LWLockRelease(ProcArrayLock);
 
@@ -1050,10 +1252,11 @@ TransactionIdIsInProgress(TransactionId xid)
 	LWLockAcquire(ProcArrayLock, LW_SHARED);
 
 	/*
-	 * Now that we have the lock, we can check latestCompletedXid; if the
+	 * Now that we have the lock, we can check latestCompletedFullXid; if the
 	 * target Xid is after that, it's surely still running.
 	 */
-	if (TransactionIdPrecedes(ShmemVariableCache->latestCompletedXid, xid))
+	if (TransactionIdPrecedes(XidFromFullTransactionId(ShmemVariableCache->latestCompletedFullXid),
+							  xid))
 	{
 		LWLockRelease(ProcArrayLock);
 		xc_by_latest_xid_inc();
@@ -1250,159 +1453,183 @@ TransactionIdIsActive(TransactionId xid)
 
 
 /*
- * GetOldestXmin -- returns oldest transaction that was running
- *					when any current transaction was started.
+ * Determine XID horizons.
  *
- * If rel is NULL or a shared relation, all backends are considered, otherwise
- * only backends running in this database are considered.
+ * This is used by wrapper functions like GetOldestNonRemovableTransactionId()
+ * (for VACUUM), GetReplicationHorizons() (for hot_standby_feedback), etc as
+ * well as "internally" by GlobalVisUpdate() (see comment above struct
+ * GlobalVisState).
  *
- * The flags are used to ignore the backends in calculation when any of the
- * corresponding flags is set. Typically, if you want to ignore ones with
- * PROC_IN_VACUUM flag, you can use PROCARRAY_FLAGS_VACUUM.
+ * See ComputedXidHorizonsResult for the various computed horizons.
  *
- * PROCARRAY_SLOTS_XMIN causes GetOldestXmin to ignore the xmin and
- * catalog_xmin of any replication slots that exist in the system when
- * calculating the oldest xmin.
+ * For VACUUM separate horizons (used to to decide which deleted tuples must
+ * be preserved), for shared and non-shared tables are computed.  For shared
+ * relations backends in all databases must be considered, but for non-shared
+ * relations that's not required, since only backends in my own database could
+ * ever see the tuples in them. Also, we can ignore concurrently running lazy
+ * VACUUMs because (a) they must be working on other tables, and (b) they
+ * don't need to do snapshot-based lookups.
  *
- * This is used by VACUUM to decide which deleted tuples must be preserved in
- * the passed in table. For shared relations backends in all databases must be
- * considered, but for non-shared relations that's not required, since only
- * backends in my own database could ever see the tuples in them. Also, we can
- * ignore concurrently running lazy VACUUMs because (a) they must be working
- * on other tables, and (b) they don't need to do snapshot-based lookups.
- *
- * This is also used to determine where to truncate pg_subtrans.  For that
- * backends in all databases have to be considered, so rel = NULL has to be
- * passed in.
+ * This also computes a horizon used to truncate pg_subtrans. For that
+ * backends in all databases have to be considered, and concurrently running
+ * lazy VACUUMs cannot be ignored, as they still may perform pg_subtrans
+ * accesses.
  *
  * Note: we include all currently running xids in the set of considered xids.
  * This ensures that if a just-started xact has not yet set its snapshot,
  * when it does set the snapshot it cannot set xmin less than what we compute.
  * See notes in src/backend/access/transam/README.
  *
- * Note: despite the above, it's possible for the calculated value to move
- * backwards on repeated calls. The calculated value is conservative, so that
- * anything older is definitely not considered as running by anyone anymore,
- * but the exact value calculated depends on a number of things. For example,
- * if rel = NULL and there are no transactions running in the current
- * database, GetOldestXmin() returns latestCompletedXid. If a transaction
+ * Note: despite the above, it's possible for the calculated values to move
+ * backwards on repeated calls. The calculated values are conservative, so
+ * that anything older is definitely not considered as running by anyone
+ * anymore, but the exact values calculated depend on a number of things. For
+ * example, if there are no transactions running in the current database, the
+ * horizon for normal tables will be latestCompletedFullXid. If a transaction
  * begins after that, its xmin will include in-progress transactions in other
  * databases that started earlier, so another call will return a lower value.
  * Nonetheless it is safe to vacuum a table in the current database with the
  * first result.  There are also replication-related effects: a walsender
  * process can set its xmin based on transactions that are no longer running
  * in the master but are still being replayed on the standby, thus possibly
- * making the GetOldestXmin reading go backwards.  In this case there is a
- * possibility that we lose data that the standby would like to have, but
- * unless the standby uses a replication slot to make its xmin persistent
- * there is little we can do about that --- data is only protected if the
- * walsender runs continuously while queries are executed on the standby.
- * (The Hot Standby code deals with such cases by failing standby queries
- * that needed to access already-removed data, so there's no integrity bug.)
- * The return value is also adjusted with vacuum_defer_cleanup_age, so
- * increasing that setting on the fly is another easy way to make
- * GetOldestXmin() move backwards, with no consequences for data integrity.
+ * making the values go backwards.  In this case there is a possibility that
+ * we lose data that the standby would like to have, but unless the standby
+ * uses a replication slot to make its xmin persistent there is little we can
+ * do about that --- data is only protected if the walsender runs continuously
+ * while queries are executed on the standby.  (The Hot Standby code deals
+ * with such cases by failing standby queries that needed to access
+ * already-removed data, so there's no integrity bug.)  The computed values
+ * are also adjusted with vacuum_defer_cleanup_age, so increasing that setting
+ * on the fly is another easy way to make horizons move backwards, with no
+ * consequences for data integrity.
  */
-TransactionId
-GetOldestXmin(Relation rel, int flags)
+static void
+ComputeXidHorizons(ComputeXidHorizonsResult *h)
 {
 	ProcArrayStruct *arrayP = procArray;
-	TransactionId result;
-	int			index;
-	bool		allDbs;
+	TransactionId kaxmin;
+	bool		in_recovery = RecoveryInProgress();
 
-	TransactionId replication_slot_xmin = InvalidTransactionId;
-	TransactionId replication_slot_catalog_xmin = InvalidTransactionId;
-
-	/*
-	 * If we're not computing a relation specific limit, or if a shared
-	 * relation has been passed in, backends in all databases have to be
-	 * considered.
-	 */
-	allDbs = rel == NULL || rel->rd_rel->relisshared;
-
-	/* Cannot look for individual databases during recovery */
-	Assert(allDbs || !RecoveryInProgress());
+	/* inferred after ProcArrayLock is released */
+	h->catalog_oldest_nonremovable = InvalidTransactionId;
 
 	LWLockAcquire(ProcArrayLock, LW_SHARED);
 
-	/*
-	 * We initialize the MIN() calculation with latestCompletedXid + 1. This
-	 * is a lower bound for the XIDs that might appear in the ProcArray later,
-	 * and so protects us against overestimating the result due to future
-	 * additions.
-	 */
-	result = ShmemVariableCache->latestCompletedXid;
-	Assert(TransactionIdIsNormal(result));
-	TransactionIdAdvance(result);
+	h->latest_completed = ShmemVariableCache->latestCompletedFullXid;
 
-	for (index = 0; index < arrayP->numProcs; index++)
+	/*
+	 * We initialize the MIN() calculation with latestCompletedFullXid + 1.
+	 * This is a lower bound for the XIDs that might appear in the ProcArray
+	 * later, and so protects us against overestimating the result due to
+	 * future additions.
+	 */
+	{
+		TransactionId initial;
+
+		initial = XidFromFullTransactionId(h->latest_completed);
+		Assert(TransactionIdIsValid(initial));
+		TransactionIdAdvance(initial);
+
+		h->oldest_considered_running = initial;
+		h->shared_oldest_nonremovable = initial;
+		h->data_oldest_nonremovable = initial;
+	}
+
+	/*
+	 * Fetch slot horizons while ProcArrayLock is held - the
+	 * LWLockAcquire/LWLockRelease are a barrier, ensuring this happens inside
+	 * the lock.
+	 */
+	h->slot_xmin = procArray->replication_slot_xmin;
+	h->slot_catalog_xmin = procArray->replication_slot_catalog_xmin;
+
+	for (int index = 0; index < arrayP->numProcs; index++)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
 		PGPROC	   *proc = &allProcs[pgprocno];
 		PGXACT	   *pgxact = &allPgXact[pgprocno];
+		TransactionId xid;
+		TransactionId xmin;
 
-		if (pgxact->vacuumFlags & (flags & PROCARRAY_PROC_FLAGS_MASK))
+		/* Fetch xid just once - see GetNewTransactionId */
+		xid = UINT32_ACCESS_ONCE(pgxact->xid);
+		xmin = UINT32_ACCESS_ONCE(pgxact->xmin);
+
+		/*
+		 * Consider both the transaction's Xmin, and its Xid.
+		 *
+		 * We must check both because a transaction might have an Xmin but not
+		 * (yet) an Xid; conversely, if it has an Xid, that could determine
+		 * some not-yet-set Xmin.
+		 */
+		xmin = TransactionIdOlder(xmin, xid);
+
+		/* if neither is set, this proc doesn't influence the horizon */
+		if (!TransactionIdIsValid(xmin))
 			continue;
 
-		if (allDbs ||
+		/*
+		 * Don't ignore any procs when determining which transactions might be
+		 * considered running.  While slots should ensure logical decoding
+		 * backends are protected even without this check, it can't hurt to
+		 * include them here as well..
+		 */
+		h->oldest_considered_running =
+			TransactionIdOlder(h->oldest_considered_running, xmin);
+
+		/*
+		 * Skip over backends either vacuuming (which is ok with rows being
+		 * removed, as long as pg_subtrans is not truncated) or doing logical
+		 * decoding (which manages xmin separately, check below).
+		 */
+		if (pgxact->vacuumFlags & (PROC_IN_VACUUM | PROC_IN_LOGICAL_DECODING))
+			continue;
+
+		/* shared tables need to take backends in all database into account */
+		h->shared_oldest_nonremovable =
+			TransactionIdOlder(h->shared_oldest_nonremovable, xmin);
+
+		/*
+		 * Normally queries in other databases are ignored for anything but
+		 * the shared horizon. But in recovery we cannot compute an accurate
+		 * per-database horizon as all xids are managed via the
+		 * KnownAssignedXids machinery.
+		 */
+		if (in_recovery ||
 			proc->databaseId == MyDatabaseId ||
 			proc->databaseId == 0)	/* always include WalSender */
 		{
-			/* Fetch xid just once - see GetNewTransactionId */
-			TransactionId xid = UINT32_ACCESS_ONCE(pgxact->xid);
-
-			/* First consider the transaction's own Xid, if any */
-			if (TransactionIdIsNormal(xid) &&
-				TransactionIdPrecedes(xid, result))
-				result = xid;
-
-			/*
-			 * Also consider the transaction's Xmin, if set.
-			 *
-			 * We must check both Xid and Xmin because a transaction might
-			 * have an Xmin but not (yet) an Xid; conversely, if it has an
-			 * Xid, that could determine some not-yet-set Xmin.
-			 */
-			xid = UINT32_ACCESS_ONCE(pgxact->xmin);
-			if (TransactionIdIsNormal(xid) &&
-				TransactionIdPrecedes(xid, result))
-				result = xid;
+			h->data_oldest_nonremovable =
+				TransactionIdOlder(h->data_oldest_nonremovable, xmin);
 		}
 	}
 
 	/*
-	 * Fetch into local variable while ProcArrayLock is held - the
-	 * LWLockRelease below is a barrier, ensuring this happens inside the
-	 * lock.
+	 * If in recovery fetch oldest xid in KnownAssignedXids, will be applied
+	 * after lock is released.
 	 */
-	replication_slot_xmin = procArray->replication_slot_xmin;
-	replication_slot_catalog_xmin = procArray->replication_slot_catalog_xmin;
+	if (in_recovery)
+		kaxmin = KnownAssignedXidsGetOldestXmin();
 
-	if (RecoveryInProgress())
+	/*
+	 * No other information needed, so release the lock immediately. The rest
+	 * of the computations can be done without a lock.
+	 */
+	LWLockRelease(ProcArrayLock);
+
+	if (in_recovery)
 	{
-		/*
-		 * Check to see whether KnownAssignedXids contains an xid value older
-		 * than the main procarray.
-		 */
-		TransactionId kaxmin = KnownAssignedXidsGetOldestXmin();
-
-		LWLockRelease(ProcArrayLock);
-
-		if (TransactionIdIsNormal(kaxmin) &&
-			TransactionIdPrecedes(kaxmin, result))
-			result = kaxmin;
+		h->oldest_considered_running =
+			TransactionIdOlder(h->oldest_considered_running, kaxmin);
+		h->shared_oldest_nonremovable =
+			TransactionIdOlder(h->shared_oldest_nonremovable, kaxmin);
+		h->data_oldest_nonremovable =
+			TransactionIdOlder(h->data_oldest_nonremovable, kaxmin);
 	}
 	else
 	{
 		/*
-		 * No other information needed, so release the lock immediately.
-		 */
-		LWLockRelease(ProcArrayLock);
-
-		/*
-		 * Compute the cutoff XID by subtracting vacuum_defer_cleanup_age,
-		 * being careful not to generate a "permanent" XID.
+		 * Compute the cutoff XID by subtracting vacuum_defer_cleanup_age.
 		 *
 		 * vacuum_defer_cleanup_age provides some additional "slop" for the
 		 * benefit of hot standby queries on standby servers.  This is quick
@@ -1414,34 +1641,143 @@ GetOldestXmin(Relation rel, int flags)
 		 * in varsup.c.  Also note that we intentionally don't apply
 		 * vacuum_defer_cleanup_age on standby servers.
 		 */
-		result -= vacuum_defer_cleanup_age;
-		if (!TransactionIdIsNormal(result))
-			result = FirstNormalTransactionId;
+		h->oldest_considered_running =
+			TransactionIdRetreatedBy(h->oldest_considered_running,
+									 vacuum_defer_cleanup_age);
+		h->shared_oldest_nonremovable =
+			TransactionIdRetreatedBy(h->shared_oldest_nonremovable,
+									 vacuum_defer_cleanup_age);
+		h->data_oldest_nonremovable =
+			TransactionIdRetreatedBy(h->data_oldest_nonremovable,
+									 vacuum_defer_cleanup_age);
 	}
 
 	/*
 	 * Check whether there are replication slots requiring an older xmin.
 	 */
-	if (!(flags & PROCARRAY_SLOTS_XMIN) &&
-		TransactionIdIsValid(replication_slot_xmin) &&
-		NormalTransactionIdPrecedes(replication_slot_xmin, result))
-		result = replication_slot_xmin;
+	h->shared_oldest_nonremovable =
+		TransactionIdOlder(h->shared_oldest_nonremovable, h->slot_xmin);
+	h->data_oldest_nonremovable =
+		TransactionIdOlder(h->data_oldest_nonremovable, h->slot_xmin);
 
 	/*
-	 * After locks have been released and vacuum_defer_cleanup_age has been
-	 * applied, check whether we need to back up further to make logical
-	 * decoding possible. We need to do so if we're computing the global limit
-	 * (rel = NULL) or if the passed relation is a catalog relation of some
-	 * kind.
+	 * The only difference between catalog / data horizons is that the slot's
+	 * catalog xmin is applied to the catalog one (so catalogs can be accessed
+	 * for logical decoding). Initialize with data horizon, and then back up
+	 * further if necessary. Have to back up the shared horizon as well, since
+	 * that also can contain catalogs.
 	 */
-	if (!(flags & PROCARRAY_SLOTS_XMIN) &&
-		(rel == NULL ||
-		 RelationIsAccessibleInLogicalDecoding(rel)) &&
-		TransactionIdIsValid(replication_slot_catalog_xmin) &&
-		NormalTransactionIdPrecedes(replication_slot_catalog_xmin, result))
-		result = replication_slot_catalog_xmin;
+	h->shared_oldest_nonremovable_raw = h->shared_oldest_nonremovable;
+	h->shared_oldest_nonremovable =
+		TransactionIdOlder(h->shared_oldest_nonremovable,
+						   h->slot_catalog_xmin);
+	h->catalog_oldest_nonremovable = h->data_oldest_nonremovable;
+	h->catalog_oldest_nonremovable =
+		TransactionIdOlder(h->catalog_oldest_nonremovable,
+						   h->slot_catalog_xmin);
 
-	return result;
+	/*
+	 * It's possible that slots / vacuum_defer_cleanup_age backed up the
+	 * horizons further than oldest_considered_running. Fix.
+	 */
+	h->oldest_considered_running =
+		TransactionIdOlder(h->oldest_considered_running,
+						   h->shared_oldest_nonremovable);
+	h->oldest_considered_running =
+		TransactionIdOlder(h->oldest_considered_running,
+						   h->catalog_oldest_nonremovable);
+	h->oldest_considered_running =
+		TransactionIdOlder(h->oldest_considered_running,
+						   h->data_oldest_nonremovable);
+
+	/*
+	 * shared horizons have to be at least as old as the oldest visible in
+	 * current db
+	 */
+	Assert(TransactionIdPrecedesOrEquals(h->shared_oldest_nonremovable,
+										 h->data_oldest_nonremovable));
+	Assert(TransactionIdPrecedesOrEquals(h->shared_oldest_nonremovable,
+										 h->catalog_oldest_nonremovable));
+
+	/*
+	 * Horizons need to ensure that pg_subtrans access is still possible for
+	 * the relevant backends.
+	 */
+	Assert(TransactionIdPrecedesOrEquals(h->oldest_considered_running,
+										 h->shared_oldest_nonremovable));
+	Assert(TransactionIdPrecedesOrEquals(h->oldest_considered_running,
+										 h->catalog_oldest_nonremovable));
+	Assert(TransactionIdPrecedesOrEquals(h->oldest_considered_running,
+										 h->data_oldest_nonremovable));
+	Assert(!TransactionIdIsValid(h->slot_xmin) ||
+		   TransactionIdPrecedesOrEquals(h->oldest_considered_running,
+										 h->slot_xmin));
+	Assert(!TransactionIdIsValid(h->slot_catalog_xmin) ||
+		   TransactionIdPrecedesOrEquals(h->oldest_considered_running,
+										 h->slot_catalog_xmin));
+}
+
+/*
+ * Return the oldest XID for which deleted tuples must be preserved in the
+ * passed table.
+ *
+ * If rel is not NULL the horizon may be considerably more recent than
+ * otherwise (i.e. fewer tuples will be removable). In the NULL case a horizon
+ * that is correct (but not optimal) for all relations will be returned.
+ *
+ * This is used by VACUUM to decide which deleted tuples must be preserved in
+ * the passed in table.
+ */
+TransactionId
+GetOldestNonRemovableTransactionId(Relation rel)
+{
+	ComputeXidHorizonsResult horizons;
+
+	ComputeXidHorizons(&horizons);
+
+	/* select horizon appropriate for relation */
+	if (rel == NULL || rel->rd_rel->relisshared)
+		return horizons.shared_oldest_nonremovable;
+	else if (RelationIsAccessibleInLogicalDecoding(rel))
+		return horizons.catalog_oldest_nonremovable;
+	else
+		return horizons.data_oldest_nonremovable;
+}
+
+/*
+ * Return the oldest transaction id any currently running backend might still
+ * consider running. This should not be used for visibility / pruning
+ * determinations (see GetOldestNonRemovableTransactionId()), but for
+ * decisions like up to where pg_subtrans can be truncated.
+ */
+TransactionId
+GetOldestTransactionIdConsideredRunning(void)
+{
+	ComputeXidHorizonsResult horizons;
+
+	ComputeXidHorizons(&horizons);
+
+	return horizons.oldest_considered_running;
+}
+
+/*
+ * Return the visibility horizons for a hot standby feedback message.
+ */
+void
+GetReplicationHorizons(TransactionId *xmin, TransactionId *catalog_xmin)
+{
+	ComputeXidHorizonsResult horizons;
+
+	ComputeXidHorizons(&horizons);
+
+	/*
+	 * Don't want to use shared_oldest_nonremovable here, as that contains the
+	 * effect of replication slot's catalog_xmin. We want to send a separate
+	 * feedback for the catalog horizon, so the primary can remove data table
+	 * contents more aggressively.
+	 */
+	*xmin = horizons.shared_oldest_nonremovable_raw;
+	*catalog_xmin = horizons.slot_catalog_xmin;
 }
 
 /*
@@ -1492,12 +1828,10 @@ GetMaxSnapshotSubxidCount(void)
  *			current transaction (this is the same as MyPgXact->xmin).
  *		RecentXmin: the xmin computed for the most recent snapshot.  XIDs
  *			older than this are known not running any more.
- *		RecentGlobalXmin: the global xmin (oldest TransactionXmin across all
- *			running transactions, except those running LAZY VACUUM).  This is
- *			the same computation done by
- *			GetOldestXmin(NULL, PROCARRAY_FLAGS_VACUUM).
- *		RecentGlobalDataXmin: the global xmin for non-catalog tables
- *			>= RecentGlobalXmin
+ *
+ * And try to advance the bounds of GlobalVisSharedRels,
+ * GlobalVisCatalogRels, GlobalVisDataRels for
+ * the benefit GlobalVis*.
  *
  * Note: this function should probably not be called with an argument that's
  * not statically allocated (see xip allocation below).
@@ -1508,11 +1842,12 @@ GetSnapshotData(Snapshot snapshot)
 	ProcArrayStruct *arrayP = procArray;
 	TransactionId xmin;
 	TransactionId xmax;
-	TransactionId globalxmin;
 	int			index;
 	int			count = 0;
 	int			subcount = 0;
 	bool		suboverflowed = false;
+	FullTransactionId latest_completed;
+	TransactionId oldestxid;
 	TransactionId replication_slot_xmin = InvalidTransactionId;
 	TransactionId replication_slot_catalog_xmin = InvalidTransactionId;
 
@@ -1556,13 +1891,16 @@ GetSnapshotData(Snapshot snapshot)
 	 */
 	LWLockAcquire(ProcArrayLock, LW_SHARED);
 
+	latest_completed = ShmemVariableCache->latestCompletedFullXid;
+	oldestxid = ShmemVariableCache->oldestXid;
+
 	/* xmax is always latestCompletedXid + 1 */
-	xmax = ShmemVariableCache->latestCompletedXid;
-	Assert(TransactionIdIsNormal(xmax));
+	xmax = XidFromFullTransactionId(latest_completed);
 	TransactionIdAdvance(xmax);
+	Assert(TransactionIdIsNormal(xmax));
 
 	/* initialize xmin calculation with xmax */
-	globalxmin = xmin = xmax;
+	xmin = xmax;
 
 	snapshot->takenDuringRecovery = RecoveryInProgress();
 
@@ -1591,12 +1929,6 @@ GetSnapshotData(Snapshot snapshot)
 				(PROC_IN_LOGICAL_DECODING | PROC_IN_VACUUM))
 				continue;
 
-			/* Update globalxmin to be the smallest valid xmin */
-			xid = UINT32_ACCESS_ONCE(pgxact->xmin);
-			if (TransactionIdIsNormal(xid) &&
-				NormalTransactionIdPrecedes(xid, globalxmin))
-				globalxmin = xid;
-
 			/* Fetch xid just once - see GetNewTransactionId */
 			xid = UINT32_ACCESS_ONCE(pgxact->xid);
 
@@ -1712,34 +2044,78 @@ GetSnapshotData(Snapshot snapshot)
 
 	LWLockRelease(ProcArrayLock);
 
-	/*
-	 * Update globalxmin to include actual process xids.  This is a slightly
-	 * different way of computing it than GetOldestXmin uses, but should give
-	 * the same result.
-	 */
-	if (TransactionIdPrecedes(xmin, globalxmin))
-		globalxmin = xmin;
+	/* maintain state for GlobalVis* */
+	{
+		TransactionId def_vis_xid;
+		TransactionId def_vis_xid_data;
+		FullTransactionId def_vis_fxid;
+		FullTransactionId def_vis_fxid_data;
+		FullTransactionId oldestfxid;
 
-	/* Update global variables too */
-	RecentGlobalXmin = globalxmin - vacuum_defer_cleanup_age;
-	if (!TransactionIdIsNormal(RecentGlobalXmin))
-		RecentGlobalXmin = FirstNormalTransactionId;
+		/*
+		 * Converting oldestXid is only safe when xid horizon cannot advance,
+		 * i.e. holding locks. While we don't hold the lock anymore, all the
+		 * necessary data has been gathered with lock held.
+		 */
+		oldestfxid = FullXidViaRelative(latest_completed, oldestxid);
 
-	/* Check whether there's a replication slot requiring an older xmin. */
-	if (TransactionIdIsValid(replication_slot_xmin) &&
-		NormalTransactionIdPrecedes(replication_slot_xmin, RecentGlobalXmin))
-		RecentGlobalXmin = replication_slot_xmin;
+		/* apply vacuum_defer_cleanup_age */
+		def_vis_xid_data =
+			TransactionIdRetreatedBy(xmin, vacuum_defer_cleanup_age);
 
-	/* Non-catalog tables can be vacuumed if older than this xid */
-	RecentGlobalDataXmin = RecentGlobalXmin;
+		/* Check whether there's a replication slot requiring an older xmin. */
+		def_vis_xid_data =
+			TransactionIdOlder(def_vis_xid_data, replication_slot_xmin);
 
-	/*
-	 * Check whether there's a replication slot requiring an older catalog
-	 * xmin.
-	 */
-	if (TransactionIdIsNormal(replication_slot_catalog_xmin) &&
-		NormalTransactionIdPrecedes(replication_slot_catalog_xmin, RecentGlobalXmin))
-		RecentGlobalXmin = replication_slot_catalog_xmin;
+		/*
+		 * Rows in non-shared, non-catalog tables possibly could be vacuumed
+		 * if older than this xid.
+		 */
+		def_vis_xid = def_vis_xid_data;
+
+		/*
+		 * Check whether there's a replication slot requiring an older catalog
+		 * xmin.
+		 */
+		def_vis_xid =
+			TransactionIdOlder(replication_slot_catalog_xmin, def_vis_xid);
+
+		def_vis_fxid = FullXidViaRelative(latest_completed, def_vis_xid);
+		def_vis_fxid_data = FullXidViaRelative(latest_completed, def_vis_xid_data);
+
+		/*
+		 * Check if we can increase upper bound. As a previous
+		 * GlobalVisUpdate() might have computed more aggressive values, don't
+		 * overwrite them if so.
+		 */
+		GlobalVisSharedRels.definitely_needed =
+			FullTransactionIdNewer(def_vis_fxid,
+								   GlobalVisSharedRels.definitely_needed);
+		GlobalVisCatalogRels.definitely_needed =
+			FullTransactionIdNewer(def_vis_fxid,
+								   GlobalVisCatalogRels.definitely_needed);
+		GlobalVisDataRels.definitely_needed =
+			FullTransactionIdNewer(def_vis_fxid_data,
+								   GlobalVisDataRels.definitely_needed);
+
+		/*
+		 * Check if we know that we can initialize or increase the lower
+		 * bound. Currently the only cheap way to do so is to use
+		 * ShmemVariableCache->oldestXid as input.
+		 *
+		 * We should definitely be able to do better. We could e.g. put a
+		 * global lower bound value into ShmemVariableCache.
+		 */
+		GlobalVisSharedRels.maybe_needed =
+			FullTransactionIdNewer(GlobalVisSharedRels.maybe_needed,
+								   oldestfxid);
+		GlobalVisCatalogRels.maybe_needed =
+			FullTransactionIdNewer(GlobalVisCatalogRels.maybe_needed,
+								   oldestfxid);
+		GlobalVisDataRels.maybe_needed =
+			FullTransactionIdNewer(GlobalVisDataRels.maybe_needed,
+								   oldestfxid);
+	}
 
 	RecentXmin = xmin;
 
@@ -1986,7 +2362,7 @@ GetRunningTransactionData(void)
 	LWLockAcquire(ProcArrayLock, LW_SHARED);
 	LWLockAcquire(XidGenLock, LW_SHARED);
 
-	latestCompletedXid = ShmemVariableCache->latestCompletedXid;
+	latestCompletedXid = XidFromFullTransactionId(ShmemVariableCache->latestCompletedFullXid);
 
 	oldestRunningXid = XidFromFullTransactionId(ShmemVariableCache->nextFullXid);
 
@@ -3209,9 +3585,7 @@ XidCacheRemoveRunningXids(TransactionId xid,
 		elog(WARNING, "did not find subXID %u in MyProc", xid);
 
 	/* Also advance global latestCompletedXid while holding the lock */
-	if (TransactionIdPrecedes(ShmemVariableCache->latestCompletedXid,
-							  latestXid))
-		ShmemVariableCache->latestCompletedXid = latestXid;
+	MaintainLatestCompletedXid(latestXid);
 
 	LWLockRelease(ProcArrayLock);
 }
@@ -3238,6 +3612,276 @@ DisplayXidCache(void)
 }
 #endif							/* XIDCACHE_DEBUG */
 
+/*
+ * If rel != NULL, return test state appropriate for relation, otherwise
+ * return state usable for all relations.  The latter may consider XIDs as
+ * not-yet-visible-to-everyone that a state for a specific relation would
+ * already consider visible-to-everyone.
+ *
+ * This needs to be called while a snapshot is active or registered, otherwise
+ * there are wraparound and other dangers.
+ *
+ * See comment for GlobalVisState for details.
+ */
+GlobalVisState *
+GlobalVisTestFor(Relation rel)
+{
+	bool		need_shared;
+	bool		need_catalog;
+	GlobalVisState *state;
+
+	/* XXX: we should assert that a snapshot is pushed or registered */
+	Assert(RecentXmin);
+
+	if (!rel)
+		need_shared = need_catalog = true;
+	else
+	{
+		/*
+		 * Other kinds currently don't contain xids, nor always the necessary
+		 * logical decoding markers.
+		 */
+		Assert(rel->rd_rel->relkind == RELKIND_RELATION ||
+			   rel->rd_rel->relkind == RELKIND_MATVIEW ||
+			   rel->rd_rel->relkind == RELKIND_TOASTVALUE);
+
+		need_shared = rel->rd_rel->relisshared || RecoveryInProgress();
+		need_catalog = IsCatalogRelation(rel) || RelationIsAccessibleInLogicalDecoding(rel);
+	}
+
+	if (need_shared)
+		state = &GlobalVisSharedRels;
+	else if (need_catalog)
+		state = &GlobalVisCatalogRels;
+	else
+		state = &GlobalVisDataRels;
+
+	Assert(FullTransactionIdIsValid(state->definitely_needed) &&
+		   FullTransactionIdIsValid(state->maybe_needed));
+
+	return state;
+}
+
+/*
+ * Return true if it's worth updating the accurate maybe_needed boundary.
+ *
+ * As it is somewhat expensive to determine xmin horizons, we don't want to
+ * repeatedly do so when there is a low likelihood of it being beneficial.
+ *
+ * The current heuristic is that we update only if RecentXmin has changed
+ * since the last update. If the oldest currently running transaction has not
+ * finished, it is unlikely that recomputing the horizon would be useful.
+ */
+static bool
+GlobalVisTestShouldUpdate(GlobalVisState *state)
+{
+	/* hasn't been updated yet */
+	if (!TransactionIdIsValid(ComputeXidHorizonsResultLastXmin))
+		return true;
+
+	/*
+	 * If the maybe_needed/definitely_needed boundaries are the same, it's
+	 * unlikely to be beneficial to refresh boundaries.
+	 */
+	if (FullTransactionIdFollowsOrEquals(state->maybe_needed,
+										 state->definitely_needed))
+		return false;
+
+	/* does the last snapshot built have a different xmin? */
+	return RecentXmin != ComputeXidHorizonsResultLastXmin;
+}
+
+/*
+ * Update boundaries in GlobalVis{Shared,Catalog, Data}Rels
+ * using ComputeXidHorizons().
+ */
+static void
+GlobalVisUpdate(void)
+{
+	ComputeXidHorizonsResult horizons;
+
+	ComputeXidHorizons(&horizons);
+
+	GlobalVisSharedRels.maybe_needed =
+		FullXidViaRelative(horizons.latest_completed,
+						   horizons.shared_oldest_nonremovable);
+	GlobalVisCatalogRels.maybe_needed =
+		FullXidViaRelative(horizons.latest_completed,
+						   horizons.catalog_oldest_nonremovable);
+	GlobalVisDataRels.maybe_needed =
+		FullXidViaRelative(horizons.latest_completed,
+						   horizons.data_oldest_nonremovable);
+
+	/*
+	 * In longer running transactions it's possible that transactions we
+	 * previously needed to treat as running aren't around anymore. So update
+	 * definitely_needed to not be earlier than maybe_needed.
+	 */
+	GlobalVisSharedRels.definitely_needed =
+		FullTransactionIdNewer(GlobalVisSharedRels.maybe_needed,
+							   GlobalVisSharedRels.definitely_needed);
+	GlobalVisCatalogRels.definitely_needed =
+		FullTransactionIdNewer(GlobalVisCatalogRels.maybe_needed,
+							   GlobalVisCatalogRels.definitely_needed);
+	GlobalVisDataRels.definitely_needed =
+		FullTransactionIdNewer(GlobalVisDataRels.maybe_needed,
+							   GlobalVisDataRels.definitely_needed);
+
+	ComputeXidHorizonsResultLastXmin = RecentXmin;
+}
+
+/*
+ * Return true if no snapshot still considers fxid to be running.
+ *
+ * The state passed needs to have been initialized for the relation fxid is
+ * from (NULL is also OK), otherwise the result may not be correct.
+ *
+ * See comment for GlobalVisState for details.
+ */
+bool
+GlobalVisTestIsRemovableFullXid(GlobalVisState *state,
+								FullTransactionId fxid)
+{
+	/*
+	 * If fxid is older than maybe_needed bound, it definitely is visible to
+	 * everyone.
+	 */
+	if (FullTransactionIdPrecedes(fxid, state->maybe_needed))
+		return true;
+
+	/*
+	 * If fxid is >= definitely_needed bound, it is very likely to still be
+	 * considered running.
+	 */
+	if (FullTransactionIdFollowsOrEquals(fxid, state->definitely_needed))
+		return false;
+
+	/*
+	 * fxid is between maybe_needed and definitely_needed, i.e. there might or
+	 * might not exist a snapshot considering fxid running. If it makes sense,
+	 * update boundaries and recheck.
+	 */
+	if (GlobalVisTestShouldUpdate(state))
+	{
+		GlobalVisUpdate();
+
+		Assert(FullTransactionIdPrecedes(fxid, state->definitely_needed));
+
+		return FullTransactionIdPrecedes(fxid, state->maybe_needed);
+	}
+	else
+		return false;
+}
+
+/*
+ * Wrapper around GlobalVisTestIsRemovableFullXid() for 32bit xids.
+ *
+ * It is crucial that this only gets called for xids from a source that
+ * protects against xid wraparounds (e.g. from a table and thus protected by
+ * relfrozenxid).
+ */
+bool
+GlobalVisTestIsRemovableXid(GlobalVisState *state, TransactionId xid)
+{
+	FullTransactionId fxid;
+
+	/*
+	 * Convert 32 bit argument to FullTransactionId. We can do so safely
+	 * because we know the xid has to, at the very least, be between
+	 * [oldestXid, nextFullXid), i.e. within 2 billion of xid. To avoid taking
+	 * a lock to determine either, we can just compare with
+	 * state->definitely_needed, which was based on those value at the time
+	 * the current snapshot was built.
+	 */
+	fxid = FullXidViaRelative(state->definitely_needed, xid);
+
+	return GlobalVisTestIsRemovableFullXid(state, fxid);
+}
+
+/*
+ * Return FullTransactionId below which all transactions are not considered
+ * running anymore.
+ *
+ * Note: This is less efficient than testing with
+ * GlobalVisTestIsRemovableFullXid as it likely requires building an accurate
+ * cutoff, even in the case all the XIDs compared with the cutoff are outside
+ * [maybe_needed, definitely_needed).
+ */
+FullTransactionId
+GlobalVisTestNonRemovableFullHorizon(GlobalVisState *state)
+{
+	/* acquire accurate horizon if not already done */
+	if (GlobalVisTestShouldUpdate(state))
+		GlobalVisUpdate();
+
+	return state->maybe_needed;
+}
+
+/* Convenience wrapper around GlobalVisTestNonRemovableFullHorizon */
+TransactionId
+GlobalVisTestNonRemovableHorizon(GlobalVisState *state)
+{
+	FullTransactionId cutoff;
+
+	cutoff = GlobalVisTestNonRemovableFullHorizon(state);
+
+	return XidFromFullTransactionId(cutoff);
+}
+
+/*
+ * Convenience wrapper around GlobalVisTestFor() and
+ * GlobalVisTestIsRemovableFullXid(), see their comments.
+ */
+bool
+GlobalVisIsRemovableFullXid(Relation rel, FullTransactionId fxid)
+{
+	GlobalVisState *state;
+
+	state = GlobalVisTestFor(rel);
+
+	return GlobalVisTestIsRemovableFullXid(state, fxid);
+}
+
+/*
+ * Convenience wrapper around GlobalVisTestFor() and
+ * GlobalVisTestIsRemovableXid(), see their comments.
+ */
+bool
+GlobalVisCheckRemovableXid(Relation rel, TransactionId xid)
+{
+	GlobalVisState *state;
+
+	state = GlobalVisTestFor(rel);
+
+	return GlobalVisTestIsRemovableXid(state, xid);
+}
+
+/*
+ * Convert a 32 bit transaction id into 64 bit transaction id, by assuming it
+ * is within MaxTransactionId / 2 of XidFromFullTransactionId(rel).
+ *
+ * Be very careful about when to use this function. It can only safely be used
+ * when there is a guarantee that xid is within MaxTransactionId / 2 xids of
+ * rel. That e.g. can be guaranteed if the the caller assures a snapshot is
+ * held by the backend and xid is from a table (where vacuum/freezing ensures
+ * the xid has to be within that range), or if xid is from the procarray and
+ * prevents xid wraparound that way.
+ */
+static inline FullTransactionId
+FullXidViaRelative(FullTransactionId rel, TransactionId xid)
+{
+	TransactionId rel_xid = XidFromFullTransactionId(rel);
+
+	Assert(TransactionIdIsValid(xid));
+	Assert(TransactionIdIsValid(rel_xid));
+
+	/* not guaranteed to find issues, but likely to catch mistakes */
+	AssertTransactionInAllowableRange(xid);
+
+	return FullTransactionIdFromU64(U64FromFullTransactionId(rel)
+									+ (int32) (xid - rel_xid));
+}
+
 
 /* ----------------------------------------------
  *		KnownAssignedTransactionIds sub-module
@@ -3390,9 +4034,7 @@ ExpireTreeKnownAssignedTransactionIds(TransactionId xid, int nsubxids,
 	KnownAssignedXidsRemoveTree(xid, nsubxids, subxids);
 
 	/* As in ProcArrayEndTransaction, advance latestCompletedXid */
-	if (TransactionIdPrecedes(ShmemVariableCache->latestCompletedXid,
-							  max_xid))
-		ShmemVariableCache->latestCompletedXid = max_xid;
+	MaintainLatestCompletedXidRecovery(max_xid);
 
 	LWLockRelease(ProcArrayLock);
 }
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index 4fdcb07d97b..5b479c8d6e9 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -5591,14 +5591,15 @@ get_actual_variable_endpoint(Relation heapRel,
 	 * recent); that case motivates not using SnapshotAny here.
 	 *
 	 * A crucial point here is that SnapshotNonVacuumable, with
-	 * RecentGlobalXmin as horizon, yields the inverse of the condition that
-	 * the indexscan will use to decide that index entries are killable (see
-	 * heap_hot_search_buffer()).  Therefore, if the snapshot rejects a tuple
-	 * (or more precisely, all tuples of a HOT chain) and we have to continue
-	 * scanning past it, we know that the indexscan will mark that index entry
-	 * killed.  That means that the next get_actual_variable_endpoint() call
-	 * will not have to re-consider that index entry.  In this way we avoid
-	 * repetitive work when this function is used a lot during planning.
+	 * GlobalVisTestFor(heapRel) as horizon, yields the inverse of the
+	 * condition that the indexscan will use to decide that index entries are
+	 * killable (see heap_hot_search_buffer()).  Therefore, if the snapshot
+	 * rejects a tuple (or more precisely, all tuples of a HOT chain) and we
+	 * have to continue scanning past it, we know that the indexscan will mark
+	 * that index entry killed.  That means that the next
+	 * get_actual_variable_endpoint() call will not have to re-consider that
+	 * index entry.  In this way we avoid repetitive work when this function
+	 * is used a lot during planning.
 	 *
 	 * But using SnapshotNonVacuumable creates a hazard of its own.  In a
 	 * recently-created index, some index entries may point at "broken" HOT
@@ -5610,7 +5611,8 @@ get_actual_variable_endpoint(Relation heapRel,
 	 * or could even be NULL.  We avoid this hazard because we take the data
 	 * from the index entry not the heap.
 	 */
-	InitNonVacuumableSnapshot(SnapshotNonVacuumable, RecentGlobalXmin);
+	InitNonVacuumableSnapshot(SnapshotNonVacuumable,
+							  GlobalVisTestFor(heapRel));
 
 	index_scan = index_beginscan(heapRel, indexRel,
 								 &SnapshotNonVacuumable,
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index f4247ea70d5..893be2f3ddb 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -722,6 +722,10 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
 	 * is critical for anything that reads heap pages, because HOT may decide
 	 * to prune them even if the process doesn't attempt to modify any
 	 * tuples.)
+	 *
+	 * FIXME: This comment is inaccurate / the code buggy. A snapshot that is
+	 * not pushed/active does not reliably prevent HOT pruning (->xmin could
+	 * e.g. be cleared when cache invalidations are processed).
 	 */
 	if (!bootstrap)
 	{
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 1c063c592ce..ba5d9615c79 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -157,16 +157,9 @@ static Snapshot HistoricSnapshot = NULL;
  * These are updated by GetSnapshotData.  We initialize them this way
  * for the convenience of TransactionIdIsInProgress: even in bootstrap
  * mode, we don't want it to say that BootstrapTransactionId is in progress.
- *
- * RecentGlobalXmin and RecentGlobalDataXmin are initialized to
- * InvalidTransactionId, to ensure that no one tries to use a stale
- * value. Readers should ensure that it has been set to something else
- * before using it.
  */
 TransactionId TransactionXmin = FirstNormalTransactionId;
 TransactionId RecentXmin = FirstNormalTransactionId;
-TransactionId RecentGlobalXmin = InvalidTransactionId;
-TransactionId RecentGlobalDataXmin = InvalidTransactionId;
 
 /* (table, ctid) => (cmin, cmax) mapping during timetravel */
 static HTAB *tuplecid_data = NULL;
@@ -581,9 +574,7 @@ SetTransactionSnapshot(Snapshot sourcesnap, VirtualTransactionId *sourcevxid,
 	 * Even though we are not going to use the snapshot it computes, we must
 	 * call GetSnapshotData, for two reasons: (1) to be sure that
 	 * CurrentSnapshotData's XID arrays have been allocated, and (2) to update
-	 * RecentXmin and RecentGlobalXmin.  (We could alternatively include those
-	 * two variables in exported snapshot files, but it seems better to have
-	 * snapshot importers compute reasonably up-to-date values for them.)
+	 * the state for GlobalVis*.
 	 */
 	CurrentSnapshot = GetSnapshotData(&CurrentSnapshotData);
 
@@ -956,36 +947,6 @@ xmin_cmp(const pairingheap_node *a, const pairingheap_node *b, void *arg)
 		return 0;
 }
 
-/*
- * Get current RecentGlobalXmin value, as a FullTransactionId.
- */
-FullTransactionId
-GetFullRecentGlobalXmin(void)
-{
-	FullTransactionId nextxid_full;
-	uint32		nextxid_epoch;
-	TransactionId nextxid_xid;
-	uint32		epoch;
-
-	Assert(TransactionIdIsNormal(RecentGlobalXmin));
-
-	/*
-	 * Compute the epoch from the next XID's epoch. This relies on the fact
-	 * that RecentGlobalXmin must be within the 2 billion XID horizon from the
-	 * next XID.
-	 */
-	nextxid_full = ReadNextFullTransactionId();
-	nextxid_epoch = EpochFromFullTransactionId(nextxid_full);
-	nextxid_xid = XidFromFullTransactionId(nextxid_full);
-
-	if (RecentGlobalXmin > nextxid_xid)
-		epoch = nextxid_epoch - 1;
-	else
-		epoch = nextxid_epoch;
-
-	return FullTransactionIdFromEpochAndXid(epoch, RecentGlobalXmin);
-}
-
 /*
  * SnapshotResetXmin
  *
@@ -1753,106 +1714,157 @@ GetOldSnapshotThresholdTimestamp(void)
 	return threshold_timestamp;
 }
 
-static void
+void
 SetOldSnapshotThresholdTimestamp(TimestampTz ts, TransactionId xlimit)
 {
 	SpinLockAcquire(&oldSnapshotControl->mutex_threshold);
+	Assert(oldSnapshotControl->threshold_timestamp <= ts);
+	Assert(TransactionIdPrecedesOrEquals(oldSnapshotControl->threshold_xid, xlimit));
 	oldSnapshotControl->threshold_timestamp = ts;
 	oldSnapshotControl->threshold_xid = xlimit;
 	SpinLockRelease(&oldSnapshotControl->mutex_threshold);
 }
 
+/*
+ * XXX: Magic to keep old_snapshot_threshold tests appear "working". They
+ * currently are broken, and discussion of what to do about them is
+ * ongoing. See
+ * https://www.postgresql.org/message-id/20200403001235.e6jfdll3gh2ygbuc%40alap3.anarazel.de
+ */
+void
+SnapshotTooOldMagicForTest(void)
+{
+	TimestampTz ts = GetSnapshotCurrentTimestamp();
+
+	Assert(old_snapshot_threshold == 0);
+
+	ts -= 5 * USECS_PER_SEC;
+
+	SpinLockAcquire(&oldSnapshotControl->mutex_threshold);
+	oldSnapshotControl->threshold_timestamp = ts;
+	SpinLockRelease(&oldSnapshotControl->mutex_threshold);
+}
+
+/*
+ * If there is a valid mapping for the timestamp, set *xlimitp to
+ * that. Returns whether there is such a mapping.
+ */
+static bool
+GetOldSnapshotFromTimeMapping(TimestampTz ts, TransactionId *xlimitp)
+{
+	bool in_mapping = false;
+
+	Assert(ts == AlignTimestampToMinuteBoundary(ts));
+
+	LWLockAcquire(OldSnapshotTimeMapLock, LW_SHARED);
+
+	if (oldSnapshotControl->count_used > 0
+		&& ts >= oldSnapshotControl->head_timestamp)
+	{
+		int			offset;
+
+		offset = ((ts - oldSnapshotControl->head_timestamp)
+				  / USECS_PER_MINUTE);
+		if (offset > oldSnapshotControl->count_used - 1)
+			offset = oldSnapshotControl->count_used - 1;
+		offset = (oldSnapshotControl->head_offset + offset)
+			% OLD_SNAPSHOT_TIME_MAP_ENTRIES;
+
+		*xlimitp = oldSnapshotControl->xid_by_minute[offset];
+
+		in_mapping = true;
+	}
+
+	LWLockRelease(OldSnapshotTimeMapLock);
+
+	return in_mapping;
+}
+
 /*
  * TransactionIdLimitedForOldSnapshots
  *
- * Apply old snapshot limit, if any.  This is intended to be called for page
- * pruning and table vacuuming, to allow old_snapshot_threshold to override
- * the normal global xmin value.  Actual testing for snapshot too old will be
- * based on whether a snapshot timestamp is prior to the threshold timestamp
- * set in this function.
+ * Apply old snapshot limit.  This is intended to be called for page pruning
+ * and table vacuuming, to allow old_snapshot_threshold to override the normal
+ * global xmin value.  Actual testing for snapshot too old will be based on
+ * whether a snapshot timestamp is prior to the threshold timestamp set in
+ * this function.
+ *
+ * If the limited horizon allows a cleanup action that otherwise would not be
+ * possible, SetOldSnapshotThresholdTimestamp(*limit_ts, *limit_xid) needs to
+ * be called before that cleanup action.
  */
-TransactionId
+bool
 TransactionIdLimitedForOldSnapshots(TransactionId recentXmin,
-									Relation relation)
+									Relation relation,
+									TransactionId *limit_xid,
+									TimestampTz *limit_ts)
 {
-	if (TransactionIdIsNormal(recentXmin)
-		&& old_snapshot_threshold >= 0
-		&& RelationAllowsEarlyPruning(relation))
+	TimestampTz ts;
+	TransactionId xlimit = recentXmin;
+	TransactionId latest_xmin;
+	TimestampTz next_map_update_ts;
+	TransactionId threshold_timestamp;
+	TransactionId threshold_xid;
+
+	Assert(TransactionIdIsNormal(recentXmin));
+	Assert(OldSnapshotThresholdActive());
+	Assert(limit_ts != NULL && limit_xid != NULL);
+
+	if (!RelationAllowsEarlyPruning(relation))
+		return false;
+
+	ts = GetSnapshotCurrentTimestamp();
+
+	SpinLockAcquire(&oldSnapshotControl->mutex_latest_xmin);
+	latest_xmin = oldSnapshotControl->latest_xmin;
+	next_map_update_ts = oldSnapshotControl->next_map_update;
+	SpinLockRelease(&oldSnapshotControl->mutex_latest_xmin);
+
+	/*
+	 * Zero threshold always overrides to latest xmin, if valid.  Without
+	 * some heuristic it will find its own snapshot too old on, for
+	 * example, a simple UPDATE -- which would make it useless for most
+	 * testing, but there is no principled way to ensure that it doesn't
+	 * fail in this way.  Use a five-second delay to try to get useful
+	 * testing behavior, but this may need adjustment.
+	 */
+	if (old_snapshot_threshold == 0)
 	{
-		TimestampTz ts = GetSnapshotCurrentTimestamp();
-		TransactionId xlimit = recentXmin;
-		TransactionId latest_xmin;
-		TimestampTz update_ts;
-		bool		same_ts_as_threshold = false;
-
-		SpinLockAcquire(&oldSnapshotControl->mutex_latest_xmin);
-		latest_xmin = oldSnapshotControl->latest_xmin;
-		update_ts = oldSnapshotControl->next_map_update;
-		SpinLockRelease(&oldSnapshotControl->mutex_latest_xmin);
-
-		/*
-		 * Zero threshold always overrides to latest xmin, if valid.  Without
-		 * some heuristic it will find its own snapshot too old on, for
-		 * example, a simple UPDATE -- which would make it useless for most
-		 * testing, but there is no principled way to ensure that it doesn't
-		 * fail in this way.  Use a five-second delay to try to get useful
-		 * testing behavior, but this may need adjustment.
-		 */
-		if (old_snapshot_threshold == 0)
-		{
-			if (TransactionIdPrecedes(latest_xmin, MyPgXact->xmin)
-				&& TransactionIdFollows(latest_xmin, xlimit))
-				xlimit = latest_xmin;
-
-			ts -= 5 * USECS_PER_SEC;
-			SetOldSnapshotThresholdTimestamp(ts, xlimit);
-
-			return xlimit;
-		}
+		if (TransactionIdPrecedes(latest_xmin, MyPgXact->xmin)
+			&& TransactionIdFollows(latest_xmin, xlimit))
+			xlimit = latest_xmin;
 
+		ts -= 5 * USECS_PER_SEC;
+	}
+	else
+	{
 		ts = AlignTimestampToMinuteBoundary(ts)
 			- (old_snapshot_threshold * USECS_PER_MINUTE);
 
 		/* Check for fast exit without LW locking. */
 		SpinLockAcquire(&oldSnapshotControl->mutex_threshold);
-		if (ts == oldSnapshotControl->threshold_timestamp)
-		{
-			xlimit = oldSnapshotControl->threshold_xid;
-			same_ts_as_threshold = true;
-		}
+		threshold_timestamp = oldSnapshotControl->threshold_timestamp;
+		threshold_xid = oldSnapshotControl->threshold_xid;
 		SpinLockRelease(&oldSnapshotControl->mutex_threshold);
 
-		if (!same_ts_as_threshold)
+		if (ts == threshold_timestamp)
+		{
+			/*
+			 * Current timestamp is in same bucket as the the last limit that
+			 * was applied. Reuse.
+			 */
+			xlimit = threshold_xid;
+		}
+		else if (ts == next_map_update_ts)
+		{
+			/*
+			 * FIXME: This branch is super iffy - but that should probably
+			 * fixed separately.
+			 */
+			xlimit = latest_xmin;
+		}
+		else if (GetOldSnapshotFromTimeMapping(ts, &xlimit))
 		{
-			if (ts == update_ts)
-			{
-				xlimit = latest_xmin;
-				if (NormalTransactionIdFollows(xlimit, recentXmin))
-					SetOldSnapshotThresholdTimestamp(ts, xlimit);
-			}
-			else
-			{
-				LWLockAcquire(OldSnapshotTimeMapLock, LW_SHARED);
-
-				if (oldSnapshotControl->count_used > 0
-					&& ts >= oldSnapshotControl->head_timestamp)
-				{
-					int			offset;
-
-					offset = ((ts - oldSnapshotControl->head_timestamp)
-							  / USECS_PER_MINUTE);
-					if (offset > oldSnapshotControl->count_used - 1)
-						offset = oldSnapshotControl->count_used - 1;
-					offset = (oldSnapshotControl->head_offset + offset)
-						% OLD_SNAPSHOT_TIME_MAP_ENTRIES;
-					xlimit = oldSnapshotControl->xid_by_minute[offset];
-
-					if (NormalTransactionIdFollows(xlimit, recentXmin))
-						SetOldSnapshotThresholdTimestamp(ts, xlimit);
-				}
-
-				LWLockRelease(OldSnapshotTimeMapLock);
-			}
 		}
 
 		/*
@@ -1867,12 +1879,18 @@ TransactionIdLimitedForOldSnapshots(TransactionId recentXmin,
 		if (TransactionIdIsNormal(latest_xmin)
 			&& TransactionIdPrecedes(latest_xmin, xlimit))
 			xlimit = latest_xmin;
-
-		if (NormalTransactionIdFollows(xlimit, recentXmin))
-			return xlimit;
 	}
 
-	return recentXmin;
+	if (TransactionIdIsValid(xlimit) &&
+		TransactionIdFollowsOrEquals(xlimit, recentXmin))
+	{
+		*limit_ts = ts;
+		*limit_xid = xlimit;
+
+		return true;
+	}
+
+	return false;
 }
 
 /*
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index ceaaa271680..f8411621043 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -412,10 +412,10 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 	Snapshot	snapshot = SnapshotAny;
 
 	/*
-	 * RecentGlobalXmin assertion matches index_getnext_tid().  See note on
-	 * RecentGlobalXmin/B-Tree page deletion.
+	 * This assertion matches the one in index_getnext_tid().  See page
+	 * recycling/"visible to everyone" notes in nbtree README.
 	 */
-	Assert(TransactionIdIsValid(RecentGlobalXmin));
+	Assert(TransactionIdIsValid(RecentXmin));
 
 	/*
 	 * Initialize state for entire verification operation
@@ -1437,7 +1437,7 @@ bt_right_page_check_scankey(BtreeCheckState *state)
 	 * does not occur until no possible index scan could land on the page.
 	 * Index scans can follow links with nothing more than their snapshot as
 	 * an interlock and be sure of at least that much.  (See page
-	 * recycling/RecentGlobalXmin notes in nbtree README.)
+	 * recycling/"visible to everyone" notes in nbtree README.)
 	 *
 	 * Furthermore, it's okay if we follow a rightlink and find a half-dead or
 	 * dead (ignorable) page one or more times.  There will either be a
diff --git a/contrib/pg_visibility/pg_visibility.c b/contrib/pg_visibility/pg_visibility.c
index 0cd1160ceb2..e32645d2d3d 100644
--- a/contrib/pg_visibility/pg_visibility.c
+++ b/contrib/pg_visibility/pg_visibility.c
@@ -563,17 +563,14 @@ collect_corrupt_items(Oid relid, bool all_visible, bool all_frozen)
 	BufferAccessStrategy bstrategy = GetAccessStrategy(BAS_BULKREAD);
 	TransactionId OldestXmin = InvalidTransactionId;
 
-	if (all_visible)
-	{
-		/* Don't pass rel; that will fail in recovery. */
-		OldestXmin = GetOldestXmin(NULL, PROCARRAY_FLAGS_VACUUM);
-	}
-
 	rel = relation_open(relid, AccessShareLock);
 
 	/* Only some relkinds have a visibility map */
 	check_relation_relkind(rel);
 
+	if (all_visible)
+		OldestXmin = GetOldestNonRemovableTransactionId(rel);
+
 	nblocks = RelationGetNumberOfBlocks(rel);
 
 	/*
@@ -679,11 +676,12 @@ collect_corrupt_items(Oid relid, bool all_visible, bool all_frozen)
 				 * From a concurrency point of view, it sort of sucks to
 				 * retake ProcArrayLock here while we're holding the buffer
 				 * exclusively locked, but it should be safe against
-				 * deadlocks, because surely GetOldestXmin() should never take
-				 * a buffer lock. And this shouldn't happen often, so it's
-				 * worth being careful so as to avoid false positives.
+				 * deadlocks, because surely GetOldestNonRemovableTransactionId()
+				 * should never take a buffer lock. And this shouldn't happen
+				 * often, so it's worth being careful so as to avoid false
+				 * positives.
 				 */
-				RecomputedOldestXmin = GetOldestXmin(NULL, PROCARRAY_FLAGS_VACUUM);
+				RecomputedOldestXmin = GetOldestNonRemovableTransactionId(rel);
 
 				if (!TransactionIdPrecedes(OldestXmin, RecomputedOldestXmin))
 					record_corrupt_item(items, &tuple.t_self);
diff --git a/contrib/pgstattuple/pgstatapprox.c b/contrib/pgstattuple/pgstatapprox.c
index 96d837485fa..e795f0862fb 100644
--- a/contrib/pgstattuple/pgstatapprox.c
+++ b/contrib/pgstattuple/pgstatapprox.c
@@ -71,7 +71,7 @@ statapprox_heap(Relation rel, output_type *stat)
 	BufferAccessStrategy bstrategy;
 	TransactionId OldestXmin;
 
-	OldestXmin = GetOldestXmin(rel, PROCARRAY_FLAGS_VACUUM);
+	OldestXmin = GetOldestNonRemovableTransactionId(rel);
 	bstrategy = GetAccessStrategy(BAS_BULKREAD);
 
 	nblocks = RelationGetNumberOfBlocks(rel);
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 525d58e7f01..2d821bd817f 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -386,6 +386,7 @@ CompositeTypeStmt
 CompoundAffixFlag
 CompressionAlgorithm
 CompressorState
+ComputeXidHorizonsResult
 ConditionVariable
 ConditionalStack
 ConfigData
@@ -919,6 +920,7 @@ GistSplitVector
 GistTsVectorOptions
 GistVacState
 GlobalTransaction
+GlobalVisState
 GrantRoleStmt
 GrantStmt
 GrantTargetType
-- 
2.25.0.114.g5b0ca878e0

v9-0002-snapshot-scalability-Move-PGXACT-xmin-back-to-PGP.patchtext/x-diff; charset=us-asciiDownload
From 0bb7f412be4e3ea94c7238b15ade837090be22ae Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 8 Apr 2020 04:34:36 -0700
Subject: [PATCH v9 2/6] snapshot scalability: Move PGXACT->xmin back to
 PGPROC.

Now that xmin isn't needed for GetSnapshotData() anymore, it just
leads to unnecessary cacheline ping-pong to have it in PGXACT.

Author: Andres Freund
Reviewed-By: Robert Haas, Thomas Munro, David Rowley
Discussion: https://postgr.es/m/20200301083601.ews6hz5dduc3w2se@alap3.anarazel.de
---
 src/include/storage/proc.h                  | 10 +++---
 src/backend/access/gist/gistxlog.c          |  2 +-
 src/backend/access/transam/README           |  2 +-
 src/backend/access/transam/twophase.c       |  2 +-
 src/backend/commands/indexcmds.c            |  2 +-
 src/backend/replication/logical/snapbuild.c |  6 ++--
 src/backend/replication/walsender.c         | 10 +++---
 src/backend/storage/ipc/procarray.c         | 36 +++++++++------------
 src/backend/storage/ipc/sinvaladt.c         |  2 +-
 src/backend/storage/lmgr/proc.c             |  4 +--
 src/backend/utils/time/snapmgr.c            | 28 ++++++++--------
 11 files changed, 50 insertions(+), 54 deletions(-)

diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 23d12c1f72f..3b3936249ab 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -95,6 +95,11 @@ struct PGPROC
 
 	Latch		procLatch;		/* generic latch for process */
 
+	TransactionId xmin;			/* minimal running XID as it was when we were
+								 * starting our xact, excluding LAZY VACUUM:
+								 * vacuum must not remove tuples deleted by
+								 * xid >= xmin ! */
+
 	LocalTransactionId lxid;	/* local id of top-level transaction currently
 								 * being executed by this proc, if running;
 								 * else InvalidLocalTransactionId */
@@ -219,11 +224,6 @@ typedef struct PGXACT
 								 * executed by this proc, if running and XID
 								 * is assigned; else InvalidTransactionId */
 
-	TransactionId xmin;			/* minimal running XID as it was when we were
-								 * starting our xact, excluding LAZY VACUUM:
-								 * vacuum must not remove tuples deleted by
-								 * xid >= xmin ! */
-
 	uint8		vacuumFlags;	/* vacuum-related flags, see above */
 	bool		overflowed;
 
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index af4731cff18..19f39e4dc0a 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -389,7 +389,7 @@ gistRedoPageReuse(XLogReaderState *record)
 	 *
 	 * latestRemovedXid was the page's deleteXid.  The
 	 * GlobalVisIsRemovableFullXid(deleteXid) test in gistPageRecyclable()
-	 * conceptually mirrors the pgxact->xmin > limitXmin test in
+	 * conceptually mirrors the PGPROC->xmin > limitXmin test in
 	 * GetConflictingVirtualXIDs().  Consequently, one XID value achieves the
 	 * same exclusion effect on master and standby.
 	 */
diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README
index 4e2178dabab..94d8f3fd0a2 100644
--- a/src/backend/access/transam/README
+++ b/src/backend/access/transam/README
@@ -331,7 +331,7 @@ necessary.
 Note that while it is certain that two concurrent executions of
 GetSnapshotData will compute the same xmin for their own snapshots, there is
 no such guarantee for the horizons computed by ComputeXidHorizons.  This is
-because we allow XID-less transactions to clear their MyPgXact->xmin
+because we allow XID-less transactions to clear their MyProc->xmin
 asynchronously (without taking ProcArrayLock), so one execution might see
 what had been the oldest xmin, and another not.  This is OK since the
 thresholds need only be a valid lower bound.  As noted above, we are already
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 2f7d4ed59a8..5867cc60f3e 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -464,7 +464,7 @@ MarkAsPreparingGuts(GlobalTransaction gxact, TransactionId xid, const char *gid,
 	/* We set up the gxact's VXID as InvalidBackendId/XID */
 	proc->lxid = (LocalTransactionId) xid;
 	pgxact->xid = xid;
-	pgxact->xmin = InvalidTransactionId;
+	Assert(proc->xmin == InvalidTransactionId);
 	proc->delayChkpt = false;
 	pgxact->vacuumFlags = 0;
 	proc->pid = 0;
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 2baca12c5f4..9d741aa03fa 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1535,7 +1535,7 @@ DefineIndex(Oid relationId,
 	StartTransactionCommand();
 
 	/* We should now definitely not be advertising any xmin. */
-	Assert(MyPgXact->xmin == InvalidTransactionId);
+	Assert(MyProc->xmin == InvalidTransactionId);
 
 	/*
 	 * The index is now valid in the sense that it contains all currently
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 3089f0d5ddc..e9701ea7221 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -553,8 +553,8 @@ SnapBuildInitialSnapshot(SnapBuild *builder)
 		elog(ERROR, "cannot build an initial slot snapshot, not all transactions are monitored anymore");
 
 	/* so we don't overwrite the existing value */
-	if (TransactionIdIsValid(MyPgXact->xmin))
-		elog(ERROR, "cannot build an initial slot snapshot when MyPgXact->xmin already is valid");
+	if (TransactionIdIsValid(MyProc->xmin))
+		elog(ERROR, "cannot build an initial slot snapshot when MyProc->xmin already is valid");
 
 	snap = SnapBuildBuildSnapshot(builder);
 
@@ -575,7 +575,7 @@ SnapBuildInitialSnapshot(SnapBuild *builder)
 	}
 #endif
 
-	MyPgXact->xmin = snap->xmin;
+	MyProc->xmin = snap->xmin;
 
 	/* allocate in transaction context */
 	newxip = (TransactionId *)
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index d8989762d74..b15faa18194 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1948,7 +1948,7 @@ PhysicalReplicationSlotNewXmin(TransactionId feedbackXmin, TransactionId feedbac
 	ReplicationSlot *slot = MyReplicationSlot;
 
 	SpinLockAcquire(&slot->mutex);
-	MyPgXact->xmin = InvalidTransactionId;
+	MyProc->xmin = InvalidTransactionId;
 
 	/*
 	 * For physical replication we don't need the interlock provided by xmin
@@ -2077,7 +2077,7 @@ ProcessStandbyHSFeedbackMessage(void)
 	if (!TransactionIdIsNormal(feedbackXmin)
 		&& !TransactionIdIsNormal(feedbackCatalogXmin))
 	{
-		MyPgXact->xmin = InvalidTransactionId;
+		MyProc->xmin = InvalidTransactionId;
 		if (MyReplicationSlot != NULL)
 			PhysicalReplicationSlotNewXmin(feedbackXmin, feedbackCatalogXmin);
 		return;
@@ -2119,7 +2119,7 @@ ProcessStandbyHSFeedbackMessage(void)
 	 * risk already since a VACUUM could already have determined the horizon.)
 	 *
 	 * If we're using a replication slot we reserve the xmin via that,
-	 * otherwise via the walsender's PGXACT entry. We can only track the
+	 * otherwise via the walsender's PGPROC entry. We can only track the
 	 * catalog xmin separately when using a slot, so we store the least of the
 	 * two provided when not using a slot.
 	 *
@@ -2132,9 +2132,9 @@ ProcessStandbyHSFeedbackMessage(void)
 	{
 		if (TransactionIdIsNormal(feedbackCatalogXmin)
 			&& TransactionIdPrecedes(feedbackCatalogXmin, feedbackXmin))
-			MyPgXact->xmin = feedbackCatalogXmin;
+			MyProc->xmin = feedbackCatalogXmin;
 		else
-			MyPgXact->xmin = feedbackXmin;
+			MyProc->xmin = feedbackXmin;
 	}
 }
 
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 58f119f9895..899f936925e 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -588,9 +588,9 @@ ProcArrayEndTransaction(PGPROC *proc, TransactionId latestXid)
 		Assert(!TransactionIdIsValid(allPgXact[proc->pgprocno].xid));
 
 		proc->lxid = InvalidLocalTransactionId;
-		pgxact->xmin = InvalidTransactionId;
 		/* must be cleared with xid/xmin: */
 		pgxact->vacuumFlags &= ~PROC_VACUUM_STATE_MASK;
+		proc->xmin = InvalidTransactionId;
 		proc->delayChkpt = false; /* be sure this is cleared in abort */
 		proc->recoveryConflictPending = false;
 
@@ -610,9 +610,9 @@ ProcArrayEndTransactionInternal(PGPROC *proc, PGXACT *pgxact,
 {
 	pgxact->xid = InvalidTransactionId;
 	proc->lxid = InvalidLocalTransactionId;
-	pgxact->xmin = InvalidTransactionId;
 	/* must be cleared with xid/xmin: */
 	pgxact->vacuumFlags &= ~PROC_VACUUM_STATE_MASK;
+	proc->xmin = InvalidTransactionId;
 	proc->delayChkpt = false; /* be sure this is cleared in abort */
 	proc->recoveryConflictPending = false;
 
@@ -764,7 +764,7 @@ ProcArrayClearTransaction(PGPROC *proc)
 	 */
 	pgxact->xid = InvalidTransactionId;
 	proc->lxid = InvalidLocalTransactionId;
-	pgxact->xmin = InvalidTransactionId;
+	proc->xmin = InvalidTransactionId;
 	proc->recoveryConflictPending = false;
 
 	/* redundant, but just in case */
@@ -1553,7 +1553,7 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
 
 		/* Fetch xid just once - see GetNewTransactionId */
 		xid = UINT32_ACCESS_ONCE(pgxact->xid);
-		xmin = UINT32_ACCESS_ONCE(pgxact->xmin);
+		xmin = UINT32_ACCESS_ONCE(proc->xmin);
 
 		/*
 		 * Consider both the transaction's Xmin, and its Xid.
@@ -1825,7 +1825,7 @@ GetMaxSnapshotSubxidCount(void)
  *
  * We also update the following backend-global variables:
  *		TransactionXmin: the oldest xmin of any snapshot in use in the
- *			current transaction (this is the same as MyPgXact->xmin).
+ *			current transaction (this is the same as MyProc->xmin).
  *		RecentXmin: the xmin computed for the most recent snapshot.  XIDs
  *			older than this are known not running any more.
  *
@@ -1887,7 +1887,7 @@ GetSnapshotData(Snapshot snapshot)
 
 	/*
 	 * It is sufficient to get shared lock on ProcArrayLock, even if we are
-	 * going to set MyPgXact->xmin.
+	 * going to set MyProc->xmin.
 	 */
 	LWLockAcquire(ProcArrayLock, LW_SHARED);
 
@@ -2039,8 +2039,8 @@ GetSnapshotData(Snapshot snapshot)
 	replication_slot_xmin = procArray->replication_slot_xmin;
 	replication_slot_catalog_xmin = procArray->replication_slot_catalog_xmin;
 
-	if (!TransactionIdIsValid(MyPgXact->xmin))
-		MyPgXact->xmin = TransactionXmin = xmin;
+	if (!TransactionIdIsValid(MyProc->xmin))
+		MyProc->xmin = TransactionXmin = xmin;
 
 	LWLockRelease(ProcArrayLock);
 
@@ -2160,7 +2160,7 @@ GetSnapshotData(Snapshot snapshot)
 }
 
 /*
- * ProcArrayInstallImportedXmin -- install imported xmin into MyPgXact->xmin
+ * ProcArrayInstallImportedXmin -- install imported xmin into MyProc->xmin
  *
  * This is called when installing a snapshot imported from another
  * transaction.  To ensure that OldestXmin doesn't go backwards, we must
@@ -2213,7 +2213,7 @@ ProcArrayInstallImportedXmin(TransactionId xmin,
 		/*
 		 * Likewise, let's just make real sure its xmin does cover us.
 		 */
-		xid = UINT32_ACCESS_ONCE(pgxact->xmin);
+		xid = UINT32_ACCESS_ONCE(proc->xmin);
 		if (!TransactionIdIsNormal(xid) ||
 			!TransactionIdPrecedesOrEquals(xid, xmin))
 			continue;
@@ -2224,7 +2224,7 @@ ProcArrayInstallImportedXmin(TransactionId xmin,
 		 * GetSnapshotData first, we'll be overwriting a valid xmin here, so
 		 * we don't check that.)
 		 */
-		MyPgXact->xmin = TransactionXmin = xmin;
+		MyProc->xmin = TransactionXmin = xmin;
 
 		result = true;
 		break;
@@ -2236,7 +2236,7 @@ ProcArrayInstallImportedXmin(TransactionId xmin,
 }
 
 /*
- * ProcArrayInstallRestoredXmin -- install restored xmin into MyPgXact->xmin
+ * ProcArrayInstallRestoredXmin -- install restored xmin into MyProc->xmin
  *
  * This is like ProcArrayInstallImportedXmin, but we have a pointer to the
  * PGPROC of the transaction from which we imported the snapshot, rather than
@@ -2249,7 +2249,6 @@ ProcArrayInstallRestoredXmin(TransactionId xmin, PGPROC *proc)
 {
 	bool		result = false;
 	TransactionId xid;
-	PGXACT	   *pgxact;
 
 	Assert(TransactionIdIsNormal(xmin));
 	Assert(proc != NULL);
@@ -2257,20 +2256,18 @@ ProcArrayInstallRestoredXmin(TransactionId xmin, PGPROC *proc)
 	/* Get lock so source xact can't end while we're doing this */
 	LWLockAcquire(ProcArrayLock, LW_SHARED);
 
-	pgxact = &allPgXact[proc->pgprocno];
-
 	/*
 	 * Be certain that the referenced PGPROC has an advertised xmin which is
 	 * no later than the one we're installing, so that the system-wide xmin
 	 * can't go backwards.  Also, make sure it's running in the same database,
 	 * so that the per-database xmin cannot go backwards.
 	 */
-	xid = UINT32_ACCESS_ONCE(pgxact->xmin);
+	xid = UINT32_ACCESS_ONCE(proc->xmin);
 	if (proc->databaseId == MyDatabaseId &&
 		TransactionIdIsNormal(xid) &&
 		TransactionIdPrecedesOrEquals(xid, xmin))
 	{
-		MyPgXact->xmin = TransactionXmin = xmin;
+		MyProc->xmin = TransactionXmin = xmin;
 		result = true;
 	}
 
@@ -2895,7 +2892,7 @@ GetCurrentVirtualXIDs(TransactionId limitXmin, bool excludeXmin0,
 		if (allDbs || proc->databaseId == MyDatabaseId)
 		{
 			/* Fetch xmin just once - might change on us */
-			TransactionId pxmin = UINT32_ACCESS_ONCE(pgxact->xmin);
+			TransactionId pxmin = UINT32_ACCESS_ONCE(proc->xmin);
 
 			if (excludeXmin0 && !TransactionIdIsValid(pxmin))
 				continue;
@@ -2981,7 +2978,6 @@ GetConflictingVirtualXIDs(TransactionId limitXmin, Oid dbOid)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
 		PGPROC	   *proc = &allProcs[pgprocno];
-		PGXACT	   *pgxact = &allPgXact[pgprocno];
 
 		/* Exclude prepared transactions */
 		if (proc->pid == 0)
@@ -2991,7 +2987,7 @@ GetConflictingVirtualXIDs(TransactionId limitXmin, Oid dbOid)
 			proc->databaseId == dbOid)
 		{
 			/* Fetch xmin just once - can't change on us, but good coding */
-			TransactionId pxmin = UINT32_ACCESS_ONCE(pgxact->xmin);
+			TransactionId pxmin = UINT32_ACCESS_ONCE(proc->xmin);
 
 			/*
 			 * We ignore an invalid pxmin because this means that backend has
diff --git a/src/backend/storage/ipc/sinvaladt.c b/src/backend/storage/ipc/sinvaladt.c
index e5c115b92f2..ad048bc85fa 100644
--- a/src/backend/storage/ipc/sinvaladt.c
+++ b/src/backend/storage/ipc/sinvaladt.c
@@ -420,7 +420,7 @@ BackendIdGetTransactionIds(int backendID, TransactionId *xid, TransactionId *xmi
 			PGXACT	   *xact = &ProcGlobal->allPgXact[proc->pgprocno];
 
 			*xid = xact->xid;
-			*xmin = xact->xmin;
+			*xmin = proc->xmin;
 		}
 	}
 
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 5aa19d3f781..66d25dba7f8 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -388,7 +388,7 @@ InitProcess(void)
 	MyProc->fpVXIDLock = false;
 	MyProc->fpLocalTransactionId = InvalidLocalTransactionId;
 	MyPgXact->xid = InvalidTransactionId;
-	MyPgXact->xmin = InvalidTransactionId;
+	MyProc->xmin = InvalidTransactionId;
 	MyProc->pid = MyProcPid;
 	/* backendId, databaseId and roleId will be filled in later */
 	MyProc->backendId = InvalidBackendId;
@@ -572,7 +572,7 @@ InitAuxiliaryProcess(void)
 	MyProc->fpVXIDLock = false;
 	MyProc->fpLocalTransactionId = InvalidLocalTransactionId;
 	MyPgXact->xid = InvalidTransactionId;
-	MyPgXact->xmin = InvalidTransactionId;
+	MyProc->xmin = InvalidTransactionId;
 	MyProc->backendId = InvalidBackendId;
 	MyProc->databaseId = InvalidOid;
 	MyProc->roleId = InvalidOid;
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index ba5d9615c79..e9d3e832c76 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -27,11 +27,11 @@
  * their lifetime is managed separately (as they live longer than one xact.c
  * transaction).
  *
- * These arrangements let us reset MyPgXact->xmin when there are no snapshots
+ * These arrangements let us reset MyProc->xmin when there are no snapshots
  * referenced by this transaction, and advance it when the one with oldest
  * Xmin is no longer referenced.  For simplicity however, only registered
  * snapshots not active snapshots participate in tracking which one is oldest;
- * we don't try to change MyPgXact->xmin except when the active-snapshot
+ * we don't try to change MyProc->xmin except when the active-snapshot
  * stack is empty.
  *
  *
@@ -187,7 +187,7 @@ static ActiveSnapshotElt *OldestActiveSnapshot = NULL;
 
 /*
  * Currently registered Snapshots.  Ordered in a heap by xmin, so that we can
- * quickly find the one with lowest xmin, to advance our MyPgXact->xmin.
+ * quickly find the one with lowest xmin, to advance our MyProc->xmin.
  */
 static int	xmin_cmp(const pairingheap_node *a, const pairingheap_node *b,
 					 void *arg);
@@ -475,7 +475,7 @@ GetNonHistoricCatalogSnapshot(Oid relid)
 
 		/*
 		 * Make sure the catalog snapshot will be accounted for in decisions
-		 * about advancing PGXACT->xmin.  We could apply RegisterSnapshot, but
+		 * about advancing PGPROC->xmin.  We could apply RegisterSnapshot, but
 		 * that would result in making a physical copy, which is overkill; and
 		 * it would also create a dependency on some resource owner, which we
 		 * do not want for reasons explained at the head of this file. Instead
@@ -596,7 +596,7 @@ SetTransactionSnapshot(Snapshot sourcesnap, VirtualTransactionId *sourcevxid,
 	/* NB: curcid should NOT be copied, it's a local matter */
 
 	/*
-	 * Now we have to fix what GetSnapshotData did with MyPgXact->xmin and
+	 * Now we have to fix what GetSnapshotData did with MyProc->xmin and
 	 * TransactionXmin.  There is a race condition: to make sure we are not
 	 * causing the global xmin to go backwards, we have to test that the
 	 * source transaction is still running, and that has to be done
@@ -950,13 +950,13 @@ xmin_cmp(const pairingheap_node *a, const pairingheap_node *b, void *arg)
 /*
  * SnapshotResetXmin
  *
- * If there are no more snapshots, we can reset our PGXACT->xmin to InvalidXid.
+ * If there are no more snapshots, we can reset our PGPROC->xmin to InvalidXid.
  * Note we can do this without locking because we assume that storing an Xid
  * is atomic.
  *
  * Even if there are some remaining snapshots, we may be able to advance our
- * PGXACT->xmin to some degree.  This typically happens when a portal is
- * dropped.  For efficiency, we only consider recomputing PGXACT->xmin when
+ * PGPROC->xmin to some degree.  This typically happens when a portal is
+ * dropped.  For efficiency, we only consider recomputing PGPROC->xmin when
  * the active snapshot stack is empty; this allows us not to need to track
  * which active snapshot is oldest.
  *
@@ -977,15 +977,15 @@ SnapshotResetXmin(void)
 
 	if (pairingheap_is_empty(&RegisteredSnapshots))
 	{
-		MyPgXact->xmin = InvalidTransactionId;
+		MyProc->xmin = InvalidTransactionId;
 		return;
 	}
 
 	minSnapshot = pairingheap_container(SnapshotData, ph_node,
 										pairingheap_first(&RegisteredSnapshots));
 
-	if (TransactionIdPrecedes(MyPgXact->xmin, minSnapshot->xmin))
-		MyPgXact->xmin = minSnapshot->xmin;
+	if (TransactionIdPrecedes(MyProc->xmin, minSnapshot->xmin))
+		MyProc->xmin = minSnapshot->xmin;
 }
 
 /*
@@ -1132,13 +1132,13 @@ AtEOXact_Snapshot(bool isCommit, bool resetXmin)
 
 	/*
 	 * During normal commit processing, we call ProcArrayEndTransaction() to
-	 * reset the MyPgXact->xmin. That call happens prior to the call to
+	 * reset the MyProc->xmin. That call happens prior to the call to
 	 * AtEOXact_Snapshot(), so we need not touch xmin here at all.
 	 */
 	if (resetXmin)
 		SnapshotResetXmin();
 
-	Assert(resetXmin || MyPgXact->xmin == 0);
+	Assert(resetXmin || MyProc->xmin == 0);
 }
 
 
@@ -1830,7 +1830,7 @@ TransactionIdLimitedForOldSnapshots(TransactionId recentXmin,
 	 */
 	if (old_snapshot_threshold == 0)
 	{
-		if (TransactionIdPrecedes(latest_xmin, MyPgXact->xmin)
+		if (TransactionIdPrecedes(latest_xmin, MyProc->xmin)
 			&& TransactionIdFollows(latest_xmin, xlimit))
 			xlimit = latest_xmin;
 
-- 
2.25.0.114.g5b0ca878e0

v9-0003-snapshot-scalability-Move-in-progress-xids-to-Pro.patchtext/x-diff; charset=us-asciiDownload
From c220cb784858fd53701b71a8e19c129ac1dd00a0 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 8 Apr 2020 02:31:33 -0700
Subject: [PATCH v9 3/6] snapshot scalability: Move in-progress xids to
 ProcGlobal->xids[].

This improves performance because GetSnapshotData() always needs to
scan the xids of all procarray entries. Now there's no need to go
through the procArray->pgprocnos indirection anymore.

As the set of running toplevel xids changes rarely compared to the
number of snapshots taken, this substantially increases the likelihood
of most data required for a snapshot already being in l2 cache.  In
read mostly workloads scanning the xids[] array will sufficient to
build a snapshot, as most backends will not have an xid assigned.

Author: Andres Freund Reviewed-By: Robert Haas, Thomas Munro, David
Rowley Discussion:
https://postgr.es/m/20200301083601.ews6hz5dduc3w2se@alap3.anarazel.de
---
 src/include/storage/proc.h                  |  66 ++++-
 src/backend/access/heap/heapam_visibility.c |   8 +-
 src/backend/access/transam/README           |  33 +--
 src/backend/access/transam/clog.c           |   8 +-
 src/backend/access/transam/twophase.c       |  31 +--
 src/backend/access/transam/varsup.c         |  20 +-
 src/backend/commands/vacuum.c               |   2 +-
 src/backend/storage/ipc/procarray.c         | 283 +++++++++++++-------
 src/backend/storage/ipc/sinvaladt.c         |   4 +-
 src/backend/storage/lmgr/lock.c             |   3 +-
 src/backend/storage/lmgr/proc.c             |  26 +-
 11 files changed, 323 insertions(+), 161 deletions(-)

diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 3b3936249ab..f240bc7521e 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -83,6 +83,17 @@ struct XidCache
  * distinguished from a real one at need by the fact that it has pid == 0.
  * The semaphore and lock-activity fields in a prepared-xact PGPROC are unused,
  * but its myProcLocks[] lists are valid.
+ *
+ * Mirrored fields:
+ *
+ * Some fields in PGPROC (see "mirrored in ..." comment) are mirrored into an
+ * element of more densely packed ProcGlobal arrays. These arrays are indexed
+ * by PGPROC->pgxactoff. Both copies need to be maintained coherently.
+ *
+ * NB: The pgxactoff indexed value can *never* be accessed without holding
+ * locks.
+ *
+ * See PROC_HDR for details.
  */
 struct PGPROC
 {
@@ -95,6 +106,12 @@ struct PGPROC
 
 	Latch		procLatch;		/* generic latch for process */
 
+
+	TransactionId xid;			/* id of top-level transaction currently being
+								 * executed by this proc, if running and XID
+								 * is assigned; else InvalidTransactionId.
+								 * mirrored in ProcGlobal->xids[pgxactoff] */
+
 	TransactionId xmin;			/* minimal running XID as it was when we were
 								 * starting our xact, excluding LAZY VACUUM:
 								 * vacuum must not remove tuples deleted by
@@ -104,6 +121,9 @@ struct PGPROC
 								 * being executed by this proc, if running;
 								 * else InvalidLocalTransactionId */
 	int			pid;			/* Backend's process ID; 0 if prepared xact */
+
+	int			pgxactoff;		/* offset into various ProcGlobal->arrays
+								 * with data mirrored from this PGPROC */
 	int			pgprocno;
 
 	/* These fields are zero while a backend is still starting up: */
@@ -220,10 +240,6 @@ extern PGDLLIMPORT struct PGXACT *MyPgXact;
  */
 typedef struct PGXACT
 {
-	TransactionId xid;			/* id of top-level transaction currently being
-								 * executed by this proc, if running and XID
-								 * is assigned; else InvalidTransactionId */
-
 	uint8		vacuumFlags;	/* vacuum-related flags, see above */
 	bool		overflowed;
 
@@ -232,6 +248,44 @@ typedef struct PGXACT
 
 /*
  * There is one ProcGlobal struct for the whole database cluster.
+ *
+ * Adding/Removing an entry into the procarray requires holding *both*
+ * ProcArrayLock and XidGenLock in exclusive mode (in that order). Both are
+ * needed because the dense arrays (see below) are accessed from
+ * GetNewTransactionId() and GetSnapshotData(), and we don't want to add
+ * further contention by both using the same lock. Adding/Removing a procarray
+ * entry is much less frequent.
+ *
+ * Some fields in PGPROC are mirrored into more densely packed arrays (like
+ * xids), with one entry for each backend. These arrays only contain entries
+ * for PGPROCs that have been added to the shared array with
+ * ProcArrayAdd().
+ *
+ * The dense arrays are indexed indexed by PGPROC->pgxactoff. Any concurrent
+ * ProcArrayAdd() / ProcArrayRemove() can lead to pgxactoff of a procarray
+ * member to change.  Therefore it is only safe to use PGPROC->pgxactoff to
+ * access the dense array while holding either ProcArrayLock or XidGenLock.
+ *
+ * The data in mirrored to the separate arrays for three reasons: First, to
+ * allow for as tight loops accessing the data as possible. Second, to prevent
+ * updates of frequently changing data (e.g. xmin) from invalidating
+ * cachelines also containing less frequently changing data (e.g. xid,
+ * vacuumFlags). Third to condense frequently accessed data into as few
+ * cachelines as possible.
+ *
+ * The reason to still have the mirrored data in PGPROC is that that that
+ * allows to avoid having to hold the locks mentioned above. That is
+ * particularly important for a backend to checks it own values, which it
+ * often can safely do without any locking.  A secondary benefit is that
+ * unnecessary access to the dense array can often be avoided at commit time,
+ * by checking if the PGPROC value indicates that state needs to be reset.
+ *
+ * As long as a PGPROC is in the procarray, the mirrored values need to be
+ * maintained in both places in a coherent manner.
+ *
+ * When entering a PGPROC for 2PC transactions with ProcArrayAdd(), the data
+ * in the dense arrays is initialized from the PGPROC while it already holds
+ * ProcArrayLock.
  */
 typedef struct PROC_HDR
 {
@@ -239,6 +293,10 @@ typedef struct PROC_HDR
 	PGPROC	   *allProcs;
 	/* Array of PGXACT structures (not including dummies for prepared txns) */
 	PGXACT	   *allPgXact;
+
+	/* Array mirroring PGPROC.xid for each PGPROC currently in the procarray */
+	TransactionId *xids;
+
 	/* Length of allProcs array */
 	uint32		allProcCount;
 	/* Head of list of free PGPROC structures */
diff --git a/src/backend/access/heap/heapam_visibility.c b/src/backend/access/heap/heapam_visibility.c
index b25b3e429ed..10848649c0c 100644
--- a/src/backend/access/heap/heapam_visibility.c
+++ b/src/backend/access/heap/heapam_visibility.c
@@ -11,12 +11,12 @@
  * shared buffer content lock on the buffer containing the tuple.
  *
  * NOTE: When using a non-MVCC snapshot, we must check
- * TransactionIdIsInProgress (which looks in the PGXACT array)
+ * TransactionIdIsInProgress (which looks in the PGPROC array)
  * before TransactionIdDidCommit/TransactionIdDidAbort (which look in
  * pg_xact).  Otherwise we have a race condition: we might decide that a
  * just-committed transaction crashed, because none of the tests succeed.
  * xact.c is careful to record commit/abort in pg_xact before it unsets
- * MyPgXact->xid in the PGXACT array.  That fixes that problem, but it
+ * MyProc->xid in the PGPROC array.  That fixes that problem, but it
  * also means there is a window where TransactionIdIsInProgress and
  * TransactionIdDidCommit will both return true.  If we check only
  * TransactionIdDidCommit, we could consider a tuple committed when a
@@ -956,7 +956,7 @@ HeapTupleSatisfiesDirty(HeapTuple htup, Snapshot snapshot,
  * coding where we tried to set the hint bits as soon as possible, we instead
  * did TransactionIdIsInProgress in each call --- to no avail, as long as the
  * inserting/deleting transaction was still running --- which was more cycles
- * and more contention on the PGXACT array.
+ * and more contention on ProcArrayLock.
  */
 static bool
 HeapTupleSatisfiesMVCC(HeapTuple htup, Snapshot snapshot,
@@ -1445,7 +1445,7 @@ HeapTupleSatisfiesNonVacuumable(HeapTuple htup, Snapshot snapshot,
  *	HeapTupleSatisfiesMVCC) and, therefore, any hint bits that can be set
  *	should already be set.  We assume that if no hint bits are set, the xmin
  *	or xmax transaction is still running.  This is therefore faster than
- *	HeapTupleSatisfiesVacuum, because we don't consult PGXACT nor CLOG.
+ *	HeapTupleSatisfiesVacuum, because we consult neither procarray nor CLOG.
  *	It's okay to return false when in doubt, but we must return true only
  *	if the tuple is removable.
  */
diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README
index 94d8f3fd0a2..c46fc3cc194 100644
--- a/src/backend/access/transam/README
+++ b/src/backend/access/transam/README
@@ -251,10 +251,10 @@ enforce, and it assists with some other issues as explained below.)  The
 implementation of this is that GetSnapshotData takes the ProcArrayLock in
 shared mode (so that multiple backends can take snapshots in parallel),
 but ProcArrayEndTransaction must take the ProcArrayLock in exclusive mode
-while clearing MyPgXact->xid at transaction end (either commit or abort).
-(To reduce context switching, when multiple transactions commit nearly
-simultaneously, we have one backend take ProcArrayLock and clear the XIDs
-of multiple processes at once.)
+while clearing the ProcGlobal->xids[] entry at transaction end (either
+commit or abort). (To reduce context switching, when multiple transactions
+commit nearly simultaneously, we have one backend take ProcArrayLock and
+clear the XIDs of multiple processes at once.)
 
 ProcArrayEndTransaction also holds the lock while advancing the shared
 latestCompletedFullXid variable.  This allows GetSnapshotData to use
@@ -278,12 +278,13 @@ present in the ProcArray, or not running anymore.  (This guarantee doesn't
 apply to subtransaction XIDs, because of the possibility that there's not
 room for them in the subxid array; instead we guarantee that they are
 present or the overflow flag is set.)  If a backend released XidGenLock
-before storing its XID into MyPgXact, then it would be possible for another
-backend to allocate and commit a later XID, causing latestCompletedFullXid to
-pass the first backend's XID, before that value became visible in the
-ProcArray.  That would break ComputeXidHorizons, as discussed below.
+before storing its XID into ProcGlobal->xids[], then it would be possible for
+another backend to allocate and commit a later XID, causing
+latestCompletedFullXid to pass the first backend's XID, before that value
+became visible in the ProcArray.  That would break ComputeXidHorizons,
+as discussed below.
 
-We allow GetNewTransactionId to store the XID into MyPgXact->xid (or the
+We allow GetNewTransactionId to store the XID into ProcGlobal->xids[] (or the
 subxid array) without taking ProcArrayLock.  This was once necessary to
 avoid deadlock; while that is no longer the case, it's still beneficial for
 performance.  We are thereby relying on fetch/store of an XID to be atomic,
@@ -382,13 +383,13 @@ Top-level transactions do not have a parent, so they leave their pg_subtrans
 entries set to the default value of zero (InvalidTransactionId).
 
 pg_subtrans is used to check whether the transaction in question is still
-running --- the main Xid of a transaction is recorded in the PGXACT struct,
-but since we allow arbitrary nesting of subtransactions, we can't fit all Xids
-in shared memory, so we have to store them on disk.  Note, however, that for
-each transaction we keep a "cache" of Xids that are known to be part of the
-transaction tree, so we can skip looking at pg_subtrans unless we know the
-cache has been overflowed.  See storage/ipc/procarray.c for the gory details.
-
+running --- the main Xid of a transaction is recorded in ProcGlobal->xids[],
+with a copy in PGPROC->xid, but since we allow arbitrary nesting of
+subtransactions, we can't fit all Xids in shared memory, so we have to store
+them on disk.  Note, however, that for each transaction we keep a "cache" of
+Xids that are known to be part of the transaction tree, so we can skip looking
+at pg_subtrans unless we know the cache has been overflowed.  See
+storage/ipc/procarray.c for the gory details.
 slru.c is the supporting mechanism for both pg_xact and pg_subtrans.  It
 implements the LRU policy for in-memory buffer pages.  The high-level routines
 for pg_xact are implemented in transam.c, while the low-level functions are in
diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c
index f8e7670f8da..c920f565a39 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -285,15 +285,15 @@ TransactionIdSetPageStatus(TransactionId xid, int nsubxids,
 	 * updates for multiple backends so that the number of times
 	 * CLogControlLock needs to be acquired is reduced.
 	 *
-	 * For this optimization to be safe, the XID in MyPgXact and the subxids
-	 * in MyProc must be the same as the ones for which we're setting the
-	 * status.  Check that this is the case.
+	 * For this optimization to be safe, the XID and subxids in MyProc must be
+	 * the same as the ones for which we're setting the status.  Check that
+	 * this is the case.
 	 *
 	 * For this optimization to be efficient, we shouldn't have too many
 	 * sub-XIDs and all of the XIDs for which we're adjusting clog should be
 	 * on the same page.  Check those conditions, too.
 	 */
-	if (all_xact_same_page && xid == MyPgXact->xid &&
+	if (all_xact_same_page && xid == MyProc->xid &&
 		nsubxids <= THRESHOLD_SUBTRANS_CLOG_OPT &&
 		nsubxids == MyPgXact->nxids &&
 		memcmp(subxids, MyProc->subxids.xids,
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 5867cc60f3e..353f13ef489 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -351,7 +351,7 @@ AtAbort_Twophase(void)
 
 /*
  * This is called after we have finished transferring state to the prepared
- * PGXACT entry.
+ * PGPROC entry.
  */
 void
 PostPrepare_Twophase(void)
@@ -463,7 +463,7 @@ MarkAsPreparingGuts(GlobalTransaction gxact, TransactionId xid, const char *gid,
 	proc->waitStatus = STATUS_OK;
 	/* We set up the gxact's VXID as InvalidBackendId/XID */
 	proc->lxid = (LocalTransactionId) xid;
-	pgxact->xid = xid;
+	proc->xid = xid;
 	Assert(proc->xmin == InvalidTransactionId);
 	proc->delayChkpt = false;
 	pgxact->vacuumFlags = 0;
@@ -768,7 +768,6 @@ pg_prepared_xact(PG_FUNCTION_ARGS)
 	{
 		GlobalTransaction gxact = &status->array[status->currIdx++];
 		PGPROC	   *proc = &ProcGlobal->allProcs[gxact->pgprocno];
-		PGXACT	   *pgxact = &ProcGlobal->allPgXact[gxact->pgprocno];
 		Datum		values[5];
 		bool		nulls[5];
 		HeapTuple	tuple;
@@ -783,7 +782,7 @@ pg_prepared_xact(PG_FUNCTION_ARGS)
 		MemSet(values, 0, sizeof(values));
 		MemSet(nulls, 0, sizeof(nulls));
 
-		values[0] = TransactionIdGetDatum(pgxact->xid);
+		values[0] = TransactionIdGetDatum(proc->xid);
 		values[1] = CStringGetTextDatum(gxact->gid);
 		values[2] = TimestampTzGetDatum(gxact->prepared_at);
 		values[3] = ObjectIdGetDatum(gxact->owner);
@@ -829,9 +828,8 @@ TwoPhaseGetGXact(TransactionId xid, bool lock_held)
 	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
 	{
 		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
-		PGXACT	   *pgxact = &ProcGlobal->allPgXact[gxact->pgprocno];
 
-		if (pgxact->xid == xid)
+		if (gxact->xid == xid)
 		{
 			result = gxact;
 			break;
@@ -987,8 +985,7 @@ void
 StartPrepare(GlobalTransaction gxact)
 {
 	PGPROC	   *proc = &ProcGlobal->allProcs[gxact->pgprocno];
-	PGXACT	   *pgxact = &ProcGlobal->allPgXact[gxact->pgprocno];
-	TransactionId xid = pgxact->xid;
+	TransactionId xid = gxact->xid;
 	TwoPhaseFileHeader hdr;
 	TransactionId *children;
 	RelFileNode *commitrels;
@@ -1140,15 +1137,15 @@ EndPrepare(GlobalTransaction gxact)
 
 	/*
 	 * Mark the prepared transaction as valid.  As soon as xact.c marks
-	 * MyPgXact as not running our XID (which it will do immediately after
+	 * MyProc as not running our XID (which it will do immediately after
 	 * this function returns), others can commit/rollback the xact.
 	 *
 	 * NB: a side effect of this is to make a dummy ProcArray entry for the
-	 * prepared XID.  This must happen before we clear the XID from MyPgXact,
-	 * else there is a window where the XID is not running according to
-	 * TransactionIdIsInProgress, and onlookers would be entitled to assume
-	 * the xact crashed.  Instead we have a window where the same XID appears
-	 * twice in ProcArray, which is OK.
+	 * prepared XID.  This must happen before we clear the XID from MyProc /
+	 * ProcGlobal->xids[], else there is a window where the XID is not running
+	 * according to TransactionIdIsInProgress, and onlookers would be entitled
+	 * to assume the xact crashed.  Instead we have a window where the same
+	 * XID appears twice in ProcArray, which is OK.
 	 */
 	MarkAsPrepared(gxact, false);
 
@@ -1401,7 +1398,6 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
 {
 	GlobalTransaction gxact;
 	PGPROC	   *proc;
-	PGXACT	   *pgxact;
 	TransactionId xid;
 	char	   *buf;
 	char	   *bufptr;
@@ -1420,8 +1416,7 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
 	 */
 	gxact = LockGXact(gid, GetUserId());
 	proc = &ProcGlobal->allProcs[gxact->pgprocno];
-	pgxact = &ProcGlobal->allPgXact[gxact->pgprocno];
-	xid = pgxact->xid;
+	xid = gxact->xid;
 
 	/*
 	 * Read and validate 2PC state data. State data will typically be stored
@@ -1723,7 +1718,7 @@ CheckPointTwoPhase(XLogRecPtr redo_horizon)
 	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
 	{
 		/*
-		 * Note that we are using gxact not pgxact so this works in recovery
+		 * Note that we are using gxact not PGPROC so this works in recovery
 		 * also
 		 */
 		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
diff --git a/src/backend/access/transam/varsup.c b/src/backend/access/transam/varsup.c
index c12e477ecfc..8869b8a6866 100644
--- a/src/backend/access/transam/varsup.c
+++ b/src/backend/access/transam/varsup.c
@@ -38,7 +38,8 @@ VariableCache ShmemVariableCache = NULL;
  * Allocate the next FullTransactionId for a new transaction or
  * subtransaction.
  *
- * The new XID is also stored into MyPgXact before returning.
+ * The new XID is also stored into MyProc->xid/ProcGlobal->xids[] before
+ * returning.
  *
  * Note: when this is called, we are actually already inside a valid
  * transaction, since XIDs are now not allocated until the transaction
@@ -65,7 +66,8 @@ GetNewTransactionId(bool isSubXact)
 	if (IsBootstrapProcessingMode())
 	{
 		Assert(!isSubXact);
-		MyPgXact->xid = BootstrapTransactionId;
+		MyProc->xid = BootstrapTransactionId;
+		ProcGlobal->xids[MyProc->pgxactoff] = BootstrapTransactionId;
 		return FullTransactionIdFromEpochAndXid(0, BootstrapTransactionId);
 	}
 
@@ -190,10 +192,10 @@ GetNewTransactionId(bool isSubXact)
 	 * latestCompletedXid is present in the ProcArray, which is essential for
 	 * correct OldestXmin tracking; see src/backend/access/transam/README.
 	 *
-	 * Note that readers of PGXACT xid fields should be careful to fetch the
-	 * value only once, rather than assume they can read a value multiple
-	 * times and get the same answer each time.  Note we are assuming that
-	 * TransactionId and int fetch/store are atomic.
+	 * Note that readers of ProcGlobal->xids/PGPROC->xid should be careful
+	 * to fetch the value for each proc only once, rather than assume they can
+	 * read a value multiple times and get the same answer each time.  Note we
+	 * are assuming that TransactionId and int fetch/store are atomic.
 	 *
 	 * The same comments apply to the subxact xid count and overflow fields.
 	 *
@@ -219,7 +221,11 @@ GetNewTransactionId(bool isSubXact)
 	 * answer later on when someone does have a reason to inquire.)
 	 */
 	if (!isSubXact)
-		MyPgXact->xid = xid;	/* LWLockRelease acts as barrier */
+	{
+		/* LWLockRelease acts as barrier */
+		MyProc->xid = xid;
+		ProcGlobal->xids[MyProc->pgxactoff] = xid;
+	}
 	else
 	{
 		int			nxids = MyPgXact->nxids;
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 77474b8d7d6..83548cfa5ec 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -1726,7 +1726,7 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params)
 		 *
 		 * Note: these flags remain set until CommitTransaction or
 		 * AbortTransaction.  We don't want to clear them until we reset
-		 * MyPgXact->xid/xmin, otherwise GetOldestNonRemovableTransactionId()
+		 * MyProc->xid/xmin, otherwise GetOldestNonRemovableTransactionId()
 		 * might appear to go backwards, which is probably Not Good.
 		 */
 		LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 899f936925e..5f0d3ee962e 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -9,8 +9,9 @@
  * one is as a means of determining the set of currently running transactions.
  *
  * Because of various subtle race conditions it is critical that a backend
- * hold the correct locks while setting or clearing its MyPgXact->xid field.
- * See notes in src/backend/access/transam/README.
+ * hold the correct locks while setting or clearing its xid (in
+ * ProcGlobal->xids[]/MyProc->xid).  See notes in
+ * src/backend/access/transam/README.
  *
  * The process arrays now also include structures representing prepared
  * transactions.  The xid and subxids fields of these are valid, as are the
@@ -333,6 +334,7 @@ static void MaintainLatestCompletedXidRecovery(TransactionId latestXid);
 
 static inline FullTransactionId FullXidViaRelative(FullTransactionId rel, TransactionId xid);
 
+
 /*
  * Report shared-memory space needed by CreateSharedProcArray.
  */
@@ -437,7 +439,9 @@ ProcArrayAdd(PGPROC *proc)
 	ProcArrayStruct *arrayP = procArray;
 	int			index;
 
+	/* See ProcGlobal comment explaining why both locks are held */
 	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+	LWLockAcquire(XidGenLock, LW_EXCLUSIVE);
 
 	if (arrayP->numProcs >= arrayP->maxProcs)
 	{
@@ -446,7 +450,6 @@ ProcArrayAdd(PGPROC *proc)
 		 * fixed supply of PGPROC structs too, and so we should have failed
 		 * earlier.)
 		 */
-		LWLockRelease(ProcArrayLock);
 		ereport(FATAL,
 				(errcode(ERRCODE_TOO_MANY_CONNECTIONS),
 				 errmsg("sorry, too many clients already")));
@@ -472,10 +475,25 @@ ProcArrayAdd(PGPROC *proc)
 	}
 
 	memmove(&arrayP->pgprocnos[index + 1], &arrayP->pgprocnos[index],
-			(arrayP->numProcs - index) * sizeof(int));
+			(arrayP->numProcs - index) * sizeof(*arrayP->pgprocnos));
+	memmove(&ProcGlobal->xids[index + 1], &ProcGlobal->xids[index],
+			(arrayP->numProcs - index) * sizeof(*ProcGlobal->xids));
+
 	arrayP->pgprocnos[index] = proc->pgprocno;
+	ProcGlobal->xids[index] = proc->xid;
+
 	arrayP->numProcs++;
 
+	for (; index < arrayP->numProcs; index++)
+	{
+		allProcs[arrayP->pgprocnos[index]].pgxactoff = index;
+	}
+
+	/*
+	 * Release in reversed acquisition order, to reduce frequency of having to
+	 * wait for XidGenLock while holding ProcArrayLock.
+	 */
+	LWLockRelease(XidGenLock);
 	LWLockRelease(ProcArrayLock);
 }
 
@@ -501,36 +519,59 @@ ProcArrayRemove(PGPROC *proc, TransactionId latestXid)
 		DisplayXidCache();
 #endif
 
+	/* See ProcGlobal comment explaining why both locks are held */
 	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+	LWLockAcquire(XidGenLock, LW_EXCLUSIVE);
+
+	Assert(ProcGlobal->allProcs[arrayP->pgprocnos[proc->pgxactoff]].pgxactoff == proc->pgxactoff);
 
 	if (TransactionIdIsValid(latestXid))
 	{
-		Assert(TransactionIdIsValid(allPgXact[proc->pgprocno].xid));
+		Assert(TransactionIdIsValid(ProcGlobal->xids[proc->pgxactoff]));
 
 		/* Advance global latestCompletedXid while holding the lock */
 		MaintainLatestCompletedXid(latestXid);
+
+		ProcGlobal->xids[proc->pgxactoff] = 0;
 	}
 	else
 	{
 		/* Shouldn't be trying to remove a live transaction here */
-		Assert(!TransactionIdIsValid(allPgXact[proc->pgprocno].xid));
+		Assert(!TransactionIdIsValid(ProcGlobal->xids[proc->pgxactoff]));
 	}
 
+	Assert(TransactionIdIsValid(ProcGlobal->xids[proc->pgxactoff] == 0));
+
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
 		if (arrayP->pgprocnos[index] == proc->pgprocno)
 		{
 			/* Keep the PGPROC array sorted. See notes above */
 			memmove(&arrayP->pgprocnos[index], &arrayP->pgprocnos[index + 1],
-					(arrayP->numProcs - index - 1) * sizeof(int));
+					(arrayP->numProcs - index - 1) * sizeof(*arrayP->pgprocnos));
+			memmove(&ProcGlobal->xids[index], &ProcGlobal->xids[index + 1],
+					(arrayP->numProcs - index - 1) * sizeof(*ProcGlobal->xids));
+
 			arrayP->pgprocnos[arrayP->numProcs - 1] = -1;	/* for debugging */
 			arrayP->numProcs--;
+
+			for (; index < arrayP->numProcs; index++)
+			{
+				allProcs[arrayP->pgprocnos[index]].pgxactoff--;
+			}
+
+			/*
+			 * Release in reversed acquisition order, to reduce frequency of
+			 * having to wait for XidGenLock while holding ProcArrayLock.
+			 */
+			LWLockRelease(XidGenLock);
 			LWLockRelease(ProcArrayLock);
 			return;
 		}
 	}
 
 	/* Oops */
+	LWLockRelease(XidGenLock);
 	LWLockRelease(ProcArrayLock);
 
 	elog(LOG, "failed to find proc %p in ProcArray", proc);
@@ -563,7 +604,7 @@ ProcArrayEndTransaction(PGPROC *proc, TransactionId latestXid)
 		 * else is taking a snapshot.  See discussion in
 		 * src/backend/access/transam/README.
 		 */
-		Assert(TransactionIdIsValid(allPgXact[proc->pgprocno].xid));
+		Assert(TransactionIdIsValid(proc->xid));
 
 		/*
 		 * If we can immediately acquire ProcArrayLock, we clear our own XID
@@ -585,7 +626,7 @@ ProcArrayEndTransaction(PGPROC *proc, TransactionId latestXid)
 		 * anyone else's calculation of a snapshot.  We might change their
 		 * estimate of global xmin, but that's OK.
 		 */
-		Assert(!TransactionIdIsValid(allPgXact[proc->pgprocno].xid));
+		Assert(!TransactionIdIsValid(proc->xid));
 
 		proc->lxid = InvalidLocalTransactionId;
 		/* must be cleared with xid/xmin: */
@@ -608,7 +649,13 @@ static inline void
 ProcArrayEndTransactionInternal(PGPROC *proc, PGXACT *pgxact,
 								TransactionId latestXid)
 {
-	pgxact->xid = InvalidTransactionId;
+	size_t		pgxactoff = proc->pgxactoff;
+
+	Assert(TransactionIdIsValid(ProcGlobal->xids[pgxactoff]));
+	Assert(ProcGlobal->xids[pgxactoff] == proc->xid);
+
+	ProcGlobal->xids[pgxactoff] = InvalidTransactionId;
+	proc->xid = InvalidTransactionId;
 	proc->lxid = InvalidLocalTransactionId;
 	/* must be cleared with xid/xmin: */
 	pgxact->vacuumFlags &= ~PROC_VACUUM_STATE_MASK;
@@ -644,7 +691,7 @@ ProcArrayGroupClearXid(PGPROC *proc, TransactionId latestXid)
 	uint32		wakeidx;
 
 	/* We should definitely have an XID to clear. */
-	Assert(TransactionIdIsValid(allPgXact[proc->pgprocno].xid));
+	Assert(TransactionIdIsValid(proc->xid));
 
 	/* Add ourselves to the list of processes needing a group XID clear. */
 	proc->procArrayGroupMember = true;
@@ -749,20 +796,28 @@ ProcArrayGroupClearXid(PGPROC *proc, TransactionId latestXid)
  * This is used after successfully preparing a 2-phase transaction.  We are
  * not actually reporting the transaction's XID as no longer running --- it
  * will still appear as running because the 2PC's gxact is in the ProcArray
- * too.  We just have to clear out our own PGXACT.
+ * too.  We just have to clear out our own PGPROC.
  */
 void
 ProcArrayClearTransaction(PGPROC *proc)
 {
 	PGXACT	   *pgxact = &allPgXact[proc->pgprocno];
+	size_t		pgxactoff;
 
 	/*
-	 * We can skip locking ProcArrayLock here, because this action does not
-	 * actually change anyone's view of the set of running XIDs: our entry is
-	 * duplicate with the gxact that has already been inserted into the
-	 * ProcArray.
+	 * We can skip locking ProcArrayLock exclusively here, because this action
+	 * does not actually change anyone's view of the set of running XIDs: our
+	 * entry is duplicate with the gxact that has already been inserted into
+	 * the ProcArray. But need it in shared mode for pgproc->pgxactoff to stay
+	 * the same.
 	 */
-	pgxact->xid = InvalidTransactionId;
+	LWLockAcquire(ProcArrayLock, LW_SHARED);
+
+	pgxactoff = proc->pgxactoff;
+
+	ProcGlobal->xids[pgxactoff] = InvalidTransactionId;
+	proc->xid = InvalidTransactionId;
+
 	proc->lxid = InvalidLocalTransactionId;
 	proc->xmin = InvalidTransactionId;
 	proc->recoveryConflictPending = false;
@@ -774,6 +829,8 @@ ProcArrayClearTransaction(PGPROC *proc)
 	/* Clear the subtransaction-XID cache too */
 	pgxact->nxids = 0;
 	pgxact->overflowed = false;
+
+	LWLockRelease(ProcArrayLock);
 }
 
 /*
@@ -1167,7 +1224,7 @@ ProcArrayApplyXidAssignment(TransactionId topxid,
  * there are four possibilities for finding a running transaction:
  *
  * 1. The given Xid is a main transaction Id.  We will find this out cheaply
- * by looking at the PGXACT struct for each backend.
+ * by looking at ProcGlobal->xids.
  *
  * 2. The given Xid is one of the cached subxact Xids in the PGPROC array.
  * We can find this out cheaply too.
@@ -1176,25 +1233,27 @@ ProcArrayApplyXidAssignment(TransactionId topxid,
  * if the Xid is running on the master.
  *
  * 4. Search the SubTrans tree to find the Xid's topmost parent, and then see
- * if that is running according to PGXACT or KnownAssignedXids.  This is the
- * slowest way, but sadly it has to be done always if the others failed,
- * unless we see that the cached subxact sets are complete (none have
+ * if that is running according to ProcGlobal->xids[] or KnownAssignedXids.
+ * This is the slowest way, but sadly it has to be done always if the others
+ * failed, unless we see that the cached subxact sets are complete (none have
  * overflowed).
  *
  * ProcArrayLock has to be held while we do 1, 2, 3.  If we save the top Xids
  * while doing 1 and 3, we can release the ProcArrayLock while we do 4.
  * This buys back some concurrency (and we can't retrieve the main Xids from
- * PGXACT again anyway; see GetNewTransactionId).
+ * ProcGlobal->xids[] again anyway; see GetNewTransactionId).
  */
 bool
 TransactionIdIsInProgress(TransactionId xid)
 {
 	static TransactionId *xids = NULL;
+	static TransactionId *other_xids;
 	int			nxids = 0;
 	ProcArrayStruct *arrayP = procArray;
 	TransactionId topxid;
-	int			i,
-				j;
+	int			mypgxactoff;
+	size_t		numProcs;
+	int			j;
 
 	/*
 	 * Don't bother checking a transaction older than RecentXmin; it could not
@@ -1249,6 +1308,8 @@ TransactionIdIsInProgress(TransactionId xid)
 					 errmsg("out of memory")));
 	}
 
+	other_xids = ProcGlobal->xids;
+
 	LWLockAcquire(ProcArrayLock, LW_SHARED);
 
 	/*
@@ -1264,20 +1325,22 @@ TransactionIdIsInProgress(TransactionId xid)
 	}
 
 	/* No shortcuts, gotta grovel through the array */
-	for (i = 0; i < arrayP->numProcs; i++)
+	mypgxactoff = MyProc->pgxactoff;
+	numProcs = arrayP->numProcs;
+	for (size_t pgxactoff = 0; pgxactoff < numProcs; pgxactoff++)
 	{
-		int			pgprocno = arrayP->pgprocnos[i];
-		PGPROC	   *proc = &allProcs[pgprocno];
-		PGXACT	   *pgxact = &allPgXact[pgprocno];
+		int			pgprocno;
+		PGXACT	   *pgxact;
+		PGPROC	   *proc;
 		TransactionId pxid;
 		int			pxids;
 
-		/* Ignore my own proc --- dealt with it above */
-		if (proc == MyProc)
+		/* Ignore ourselves --- dealt with it above */
+		if (pgxactoff == mypgxactoff)
 			continue;
 
 		/* Fetch xid just once - see GetNewTransactionId */
-		pxid = UINT32_ACCESS_ONCE(pgxact->xid);
+		pxid = UINT32_ACCESS_ONCE(other_xids[pgxactoff]);
 
 		if (!TransactionIdIsValid(pxid))
 			continue;
@@ -1302,8 +1365,12 @@ TransactionIdIsInProgress(TransactionId xid)
 		/*
 		 * Step 2: check the cached child-Xids arrays
 		 */
+		pgprocno = arrayP->pgprocnos[pgxactoff];
+		pgxact = &allPgXact[pgprocno];
 		pxids = pgxact->nxids;
 		pg_read_barrier();		/* pairs with barrier in GetNewTransactionId() */
+		pgprocno = arrayP->pgprocnos[pgxactoff];
+		proc = &allProcs[pgprocno];
 		for (j = pxids - 1; j >= 0; j--)
 		{
 			/* Fetch xid just once - see GetNewTransactionId */
@@ -1334,7 +1401,7 @@ TransactionIdIsInProgress(TransactionId xid)
 	 */
 	if (RecoveryInProgress())
 	{
-		/* none of the PGXACT entries should have XIDs in hot standby mode */
+		/* none of the PGPROC entries should have XIDs in hot standby mode */
 		Assert(nxids == 0);
 
 		if (KnownAssignedXidExists(xid))
@@ -1389,7 +1456,7 @@ TransactionIdIsInProgress(TransactionId xid)
 	Assert(TransactionIdIsValid(topxid));
 	if (!TransactionIdEquals(topxid, xid))
 	{
-		for (i = 0; i < nxids; i++)
+		for (int i = 0; i < nxids; i++)
 		{
 			if (TransactionIdEquals(xids[i], topxid))
 				return true;
@@ -1412,6 +1479,7 @@ TransactionIdIsActive(TransactionId xid)
 {
 	bool		result = false;
 	ProcArrayStruct *arrayP = procArray;
+	TransactionId *other_xids = ProcGlobal->xids;
 	int			i;
 
 	/*
@@ -1427,11 +1495,10 @@ TransactionIdIsActive(TransactionId xid)
 	{
 		int			pgprocno = arrayP->pgprocnos[i];
 		PGPROC	   *proc = &allProcs[pgprocno];
-		PGXACT	   *pgxact = &allPgXact[pgprocno];
 		TransactionId pxid;
 
 		/* Fetch xid just once - see GetNewTransactionId */
-		pxid = UINT32_ACCESS_ONCE(pgxact->xid);
+		pxid = UINT32_ACCESS_ONCE(other_xids[i]);
 
 		if (!TransactionIdIsValid(pxid))
 			continue;
@@ -1509,6 +1576,7 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
 	ProcArrayStruct *arrayP = procArray;
 	TransactionId kaxmin;
 	bool		in_recovery = RecoveryInProgress();
+	TransactionId *other_xids = ProcGlobal->xids;
 
 	/* inferred after ProcArrayLock is released */
 	h->catalog_oldest_nonremovable = InvalidTransactionId;
@@ -1552,7 +1620,7 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
 		TransactionId xmin;
 
 		/* Fetch xid just once - see GetNewTransactionId */
-		xid = UINT32_ACCESS_ONCE(pgxact->xid);
+		xid = UINT32_ACCESS_ONCE(other_xids[pgprocno]);
 		xmin = UINT32_ACCESS_ONCE(proc->xmin);
 
 		/*
@@ -1840,14 +1908,17 @@ Snapshot
 GetSnapshotData(Snapshot snapshot)
 {
 	ProcArrayStruct *arrayP = procArray;
+	TransactionId *other_xids = ProcGlobal->xids;
 	TransactionId xmin;
 	TransactionId xmax;
-	int			index;
-	int			count = 0;
+	size_t		count = 0;
 	int			subcount = 0;
 	bool		suboverflowed = false;
 	FullTransactionId latest_completed;
 	TransactionId oldestxid;
+	int			mypgxactoff;
+	TransactionId myxid;
+
 	TransactionId replication_slot_xmin = InvalidTransactionId;
 	TransactionId replication_slot_catalog_xmin = InvalidTransactionId;
 
@@ -1892,6 +1963,10 @@ GetSnapshotData(Snapshot snapshot)
 	LWLockAcquire(ProcArrayLock, LW_SHARED);
 
 	latest_completed = ShmemVariableCache->latestCompletedFullXid;
+	mypgxactoff = MyProc->pgxactoff;
+	myxid = other_xids[mypgxactoff];
+	Assert(myxid == MyProc->xid);
+
 	oldestxid = ShmemVariableCache->oldestXid;
 
 	/* xmax is always latestCompletedXid + 1 */
@@ -1902,57 +1977,79 @@ GetSnapshotData(Snapshot snapshot)
 	/* initialize xmin calculation with xmax */
 	xmin = xmax;
 
+	/* take own xid into account, saves a check inside the loop */
+	if (TransactionIdIsNormal(myxid) && NormalTransactionIdPrecedes(myxid, xmin))
+		xmin = myxid;
+
 	snapshot->takenDuringRecovery = RecoveryInProgress();
 
 	if (!snapshot->takenDuringRecovery)
 	{
+		size_t		numProcs = arrayP->numProcs;
+		TransactionId *xip = snapshot->xip;
 		int		   *pgprocnos = arrayP->pgprocnos;
-		int			numProcs;
 
 		/*
-		 * Spin over procArray checking xid, xmin, and subxids.  The goal is
-		 * to gather all active xids, find the lowest xmin, and try to record
-		 * subxids.
+		 * First collect set of pgxactoff/xids that need to be included in the
+		 * snapshot.
 		 */
-		numProcs = arrayP->numProcs;
-		for (index = 0; index < numProcs; index++)
+		for (size_t pgxactoff = 0; pgxactoff < numProcs; pgxactoff++)
 		{
-			int			pgprocno = pgprocnos[index];
-			PGXACT	   *pgxact = &allPgXact[pgprocno];
-			TransactionId xid;
+			/* Fetch xid just once - see GetNewTransactionId */
+			TransactionId xid = UINT32_ACCESS_ONCE(other_xids[pgxactoff]);
+			int			pgprocno;
+			PGXACT	   *pgxact;
+			uint8		vacuumFlags;
+
+			Assert(allProcs[arrayP->pgprocnos[pgxactoff]].pgxactoff == pgxactoff);
+
+			/*
+			 * If the transaction has no XID assigned, we can skip it; it
+			 * won't have sub-XIDs either.
+			 */
+			if (likely(xid == InvalidTransactionId))
+				continue;
+
+			/*
+			 * We don't include our own XIDs (if any) in the snapshot. It
+			 * needs to be includeded in the xmin computation, but we did so
+			 * outside the loop.
+			 */
+			if (pgxactoff == mypgxactoff)
+				continue;
+
+			/*
+			 * The only way we are able to get here with a non-normal xid
+			 * is during bootstrap - with this backend using
+			 * BootstrapTransactionId. But the above test should filter
+			 * that out.
+			 */
+			Assert(TransactionIdIsNormal(xid));
+
+			/*
+			 * If the XID is >= xmax, we can skip it; such transactions will
+			 * be treated as running anyway (and any sub-XIDs will also be >=
+			 * xmax).
+			 */
+			if (!NormalTransactionIdPrecedes(xid, xmax))
+				continue;
+
+			pgprocno = pgprocnos[pgxactoff];
+			pgxact = &allPgXact[pgprocno];
+			vacuumFlags = pgxact->vacuumFlags;
 
 			/*
 			 * Skip over backends doing logical decoding which manages xmin
 			 * separately (check below) and ones running LAZY VACUUM.
 			 */
-			if (pgxact->vacuumFlags &
-				(PROC_IN_LOGICAL_DECODING | PROC_IN_VACUUM))
+			if (vacuumFlags & (PROC_IN_LOGICAL_DECODING | PROC_IN_VACUUM))
 				continue;
 
-			/* Fetch xid just once - see GetNewTransactionId */
-			xid = UINT32_ACCESS_ONCE(pgxact->xid);
-
-			/*
-			 * If the transaction has no XID assigned, we can skip it; it
-			 * won't have sub-XIDs either.  If the XID is >= xmax, we can also
-			 * skip it; such transactions will be treated as running anyway
-			 * (and any sub-XIDs will also be >= xmax).
-			 */
-			if (!TransactionIdIsNormal(xid)
-				|| !NormalTransactionIdPrecedes(xid, xmax))
-				continue;
-
-			/*
-			 * We don't include our own XIDs (if any) in the snapshot, but we
-			 * must include them in xmin.
-			 */
 			if (NormalTransactionIdPrecedes(xid, xmin))
 				xmin = xid;
-			if (pgxact == MyPgXact)
-				continue;
 
 			/* Add XID to snapshot. */
-			snapshot->xip[count++] = xid;
+			xip[count++] = xid;
 
 			/*
 			 * Save subtransaction XIDs if possible (if we've already
@@ -1975,9 +2072,9 @@ GetSnapshotData(Snapshot snapshot)
 					suboverflowed = true;
 				else
 				{
-					int			nxids = pgxact->nxids;
+					int			nsubxids = pgxact->nxids;
 
-					if (nxids > 0)
+					if (nsubxids > 0)
 					{
 						PGPROC	   *proc = &allProcs[pgprocno];
 
@@ -1985,8 +2082,8 @@ GetSnapshotData(Snapshot snapshot)
 
 						memcpy(snapshot->subxip + subcount,
 							   (void *) proc->subxids.xids,
-							   nxids * sizeof(TransactionId));
-						subcount += nxids;
+							   nsubxids * sizeof(TransactionId));
+						subcount += nsubxids;
 					}
 				}
 			}
@@ -2118,6 +2215,7 @@ GetSnapshotData(Snapshot snapshot)
 	}
 
 	RecentXmin = xmin;
+	Assert(TransactionIdPrecedesOrEquals(TransactionXmin, RecentXmin));
 
 	snapshot->xmin = xmin;
 	snapshot->xmax = xmax;
@@ -2280,7 +2378,7 @@ ProcArrayInstallRestoredXmin(TransactionId xmin, PGPROC *proc)
  * GetRunningTransactionData -- returns information about running transactions.
  *
  * Similar to GetSnapshotData but returns more information. We include
- * all PGXACTs with an assigned TransactionId, even VACUUM processes and
+ * all PGPROCs with an assigned TransactionId, even VACUUM processes and
  * prepared transactions.
  *
  * We acquire XidGenLock and ProcArrayLock, but the caller is responsible for
@@ -2295,7 +2393,7 @@ ProcArrayInstallRestoredXmin(TransactionId xmin, PGPROC *proc)
  * This is never executed during recovery so there is no need to look at
  * KnownAssignedXids.
  *
- * Dummy PGXACTs from prepared transaction are included, meaning that this
+ * Dummy PGPROCs from prepared transaction are included, meaning that this
  * may return entries with duplicated TransactionId values coming from
  * transaction finishing to prepare.  Nothing is done about duplicated
  * entries here to not hold on ProcArrayLock more than necessary.
@@ -2314,6 +2412,7 @@ GetRunningTransactionData(void)
 	static RunningTransactionsData CurrentRunningXactsData;
 
 	ProcArrayStruct *arrayP = procArray;
+	TransactionId *other_xids = ProcGlobal->xids;
 	RunningTransactions CurrentRunningXacts = &CurrentRunningXactsData;
 	TransactionId latestCompletedXid;
 	TransactionId oldestRunningXid;
@@ -2373,7 +2472,7 @@ GetRunningTransactionData(void)
 		TransactionId xid;
 
 		/* Fetch xid just once - see GetNewTransactionId */
-		xid = UINT32_ACCESS_ONCE(pgxact->xid);
+		xid = UINT32_ACCESS_ONCE(other_xids[index]);
 
 		/*
 		 * We don't need to store transactions that don't have a TransactionId
@@ -2470,7 +2569,7 @@ GetRunningTransactionData(void)
  * GetOldestActiveTransactionId()
  *
  * Similar to GetSnapshotData but returns just oldestActiveXid. We include
- * all PGXACTs with an assigned TransactionId, even VACUUM processes.
+ * all PGPROCs with an assigned TransactionId, even VACUUM processes.
  * We look at all databases, though there is no need to include WALSender
  * since this has no effect on hot standby conflicts.
  *
@@ -2485,6 +2584,7 @@ TransactionId
 GetOldestActiveTransactionId(void)
 {
 	ProcArrayStruct *arrayP = procArray;
+	TransactionId *other_xids = ProcGlobal->xids;
 	TransactionId oldestRunningXid;
 	int			index;
 
@@ -2507,12 +2607,10 @@ GetOldestActiveTransactionId(void)
 	LWLockAcquire(ProcArrayLock, LW_SHARED);
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
-		int			pgprocno = arrayP->pgprocnos[index];
-		PGXACT	   *pgxact = &allPgXact[pgprocno];
 		TransactionId xid;
 
 		/* Fetch xid just once - see GetNewTransactionId */
-		xid = UINT32_ACCESS_ONCE(pgxact->xid);
+		xid = UINT32_ACCESS_ONCE(other_xids[index]);
 
 		if (!TransactionIdIsNormal(xid))
 			continue;
@@ -2590,8 +2688,8 @@ GetOldestSafeDecodingTransactionId(bool catalogOnly)
 	 * If we're not in recovery, we walk over the procarray and collect the
 	 * lowest xid. Since we're called with ProcArrayLock held and have
 	 * acquired XidGenLock, no entries can vanish concurrently, since
-	 * PGXACT->xid is only set with XidGenLock held and only cleared with
-	 * ProcArrayLock held.
+	 * ProcGlobal->xids[i] is only set with XidGenLock held and only cleared
+	 * with ProcArrayLock held.
 	 *
 	 * In recovery we can't lower the safe value besides what we've computed
 	 * above, so we'll have to wait a bit longer there. We unfortunately can
@@ -2600,17 +2698,17 @@ GetOldestSafeDecodingTransactionId(bool catalogOnly)
 	 */
 	if (!recovery_in_progress)
 	{
+		TransactionId *other_xids = ProcGlobal->xids;
+
 		/*
-		 * Spin over procArray collecting all min(PGXACT->xid)
+		 * Spin over procArray collecting min(ProcGlobal->xids[i])
 		 */
 		for (index = 0; index < arrayP->numProcs; index++)
 		{
-			int			pgprocno = arrayP->pgprocnos[index];
-			PGXACT	   *pgxact = &allPgXact[pgprocno];
 			TransactionId xid;
 
 			/* Fetch xid just once - see GetNewTransactionId */
-			xid = UINT32_ACCESS_ONCE(pgxact->xid);
+			xid = UINT32_ACCESS_ONCE(other_xids[index]);
 
 			if (!TransactionIdIsNormal(xid))
 				continue;
@@ -2798,6 +2896,7 @@ BackendXidGetPid(TransactionId xid)
 {
 	int			result = 0;
 	ProcArrayStruct *arrayP = procArray;
+	TransactionId *other_xids = ProcGlobal->xids;
 	int			index;
 
 	if (xid == InvalidTransactionId)	/* never match invalid xid */
@@ -2809,9 +2908,8 @@ BackendXidGetPid(TransactionId xid)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
 		PGPROC	   *proc = &allProcs[pgprocno];
-		PGXACT	   *pgxact = &allPgXact[pgprocno];
 
-		if (pgxact->xid == xid)
+		if (other_xids[index] == xid)
 		{
 			result = proc->pid;
 			break;
@@ -3091,7 +3189,6 @@ MinimumActiveBackends(int min)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
 		PGPROC	   *proc = &allProcs[pgprocno];
-		PGXACT	   *pgxact = &allPgXact[pgprocno];
 
 		/*
 		 * Since we're not holding a lock, need to be prepared to deal with
@@ -3108,7 +3205,7 @@ MinimumActiveBackends(int min)
 			continue;			/* do not count deleted entries */
 		if (proc == MyProc)
 			continue;			/* do not count myself */
-		if (pgxact->xid == InvalidTransactionId)
+		if (proc->xid == InvalidTransactionId)
 			continue;			/* do not count if no XID assigned */
 		if (proc->pid == 0)
 			continue;			/* do not count prepared xacts */
@@ -3534,8 +3631,8 @@ XidCacheRemoveRunningXids(TransactionId xid,
 	 *
 	 * Note that we do not have to be careful about memory ordering of our own
 	 * reads wrt. GetNewTransactionId() here - only this process can modify
-	 * relevant fields of MyProc/MyPgXact.  But we do have to be careful about
-	 * our own writes being well ordered.
+	 * relevant fields of MyProc/ProcGlobal->xids[].  But we do have to be
+	 * careful about our own writes being well ordered.
 	 */
 	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
 
@@ -3888,7 +3985,7 @@ FullXidViaRelative(FullTransactionId rel, TransactionId xid)
  * In Hot Standby mode, we maintain a list of transactions that are (or were)
  * running in the master at the current point in WAL.  These XIDs must be
  * treated as running by standby transactions, even though they are not in
- * the standby server's PGXACT array.
+ * the standby server's PGPROC array.
  *
  * We record all XIDs that we know have been assigned.  That includes all the
  * XIDs seen in WAL records, plus all unobserved XIDs that we can deduce have
diff --git a/src/backend/storage/ipc/sinvaladt.c b/src/backend/storage/ipc/sinvaladt.c
index ad048bc85fa..a9477ccb4a3 100644
--- a/src/backend/storage/ipc/sinvaladt.c
+++ b/src/backend/storage/ipc/sinvaladt.c
@@ -417,9 +417,7 @@ BackendIdGetTransactionIds(int backendID, TransactionId *xid, TransactionId *xmi
 
 		if (proc != NULL)
 		{
-			PGXACT	   *xact = &ProcGlobal->allPgXact[proc->pgprocno];
-
-			*xid = xact->xid;
+			*xid = proc->xid;
 			*xmin = proc->xmin;
 		}
 	}
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index efb44a25c42..c3c0149a754 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -3974,9 +3974,8 @@ GetRunningTransactionLocks(int *nlocks)
 			proclock->tag.myLock->tag.locktag_type == LOCKTAG_RELATION)
 		{
 			PGPROC	   *proc = proclock->tag.myProc;
-			PGXACT	   *pgxact = &ProcGlobal->allPgXact[proc->pgprocno];
 			LOCK	   *lock = proclock->tag.myLock;
-			TransactionId xid = pgxact->xid;
+			TransactionId xid = proc->xid;
 
 			/*
 			 * Don't record locks for transactions if we know they have
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 66d25dba7f8..8cd25c83e2b 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -102,21 +102,18 @@ Size
 ProcGlobalShmemSize(void)
 {
 	Size		size = 0;
+	Size		TotalProcs =
+		add_size(MaxBackends, add_size(NUM_AUXILIARY_PROCS, max_prepared_xacts));
 
 	/* ProcGlobal */
 	size = add_size(size, sizeof(PROC_HDR));
-	/* MyProcs, including autovacuum workers and launcher */
-	size = add_size(size, mul_size(MaxBackends, sizeof(PGPROC)));
-	/* AuxiliaryProcs */
-	size = add_size(size, mul_size(NUM_AUXILIARY_PROCS, sizeof(PGPROC)));
-	/* Prepared xacts */
-	size = add_size(size, mul_size(max_prepared_xacts, sizeof(PGPROC)));
-	/* ProcStructLock */
+	size = add_size(size, mul_size(TotalProcs, sizeof(PGPROC)));
 	size = add_size(size, sizeof(slock_t));
 
 	size = add_size(size, mul_size(MaxBackends, sizeof(PGXACT)));
 	size = add_size(size, mul_size(NUM_AUXILIARY_PROCS, sizeof(PGXACT)));
 	size = add_size(size, mul_size(max_prepared_xacts, sizeof(PGXACT)));
+	size = add_size(size, mul_size(TotalProcs, sizeof(*ProcGlobal->xids)));
 
 	return size;
 }
@@ -216,6 +213,17 @@ InitProcGlobal(void)
 	MemSet(pgxacts, 0, TotalProcs * sizeof(PGXACT));
 	ProcGlobal->allPgXact = pgxacts;
 
+	/*
+	 * Allocate arrays mirroring PGPROC fields in a dense manner. See
+	 * PROC_HDR.
+	 *
+	 * XXX: It might make sense to increase padding for these arrays, given
+	 * how hotly they are accessed.
+	 */
+	ProcGlobal->xids =
+		(TransactionId *) ShmemAlloc(TotalProcs * sizeof(*ProcGlobal->xids));
+	MemSet(ProcGlobal->xids, 0, TotalProcs * sizeof(*ProcGlobal->xids));
+
 	for (i = 0; i < TotalProcs; i++)
 	{
 		/* Common initialization for all PGPROCs, regardless of type. */
@@ -387,7 +395,7 @@ InitProcess(void)
 	MyProc->lxid = InvalidLocalTransactionId;
 	MyProc->fpVXIDLock = false;
 	MyProc->fpLocalTransactionId = InvalidLocalTransactionId;
-	MyPgXact->xid = InvalidTransactionId;
+	MyProc->xid = InvalidTransactionId;
 	MyProc->xmin = InvalidTransactionId;
 	MyProc->pid = MyProcPid;
 	/* backendId, databaseId and roleId will be filled in later */
@@ -571,7 +579,7 @@ InitAuxiliaryProcess(void)
 	MyProc->lxid = InvalidLocalTransactionId;
 	MyProc->fpVXIDLock = false;
 	MyProc->fpLocalTransactionId = InvalidLocalTransactionId;
-	MyPgXact->xid = InvalidTransactionId;
+	MyProc->xid = InvalidTransactionId;
 	MyProc->xmin = InvalidTransactionId;
 	MyProc->backendId = InvalidBackendId;
 	MyProc->databaseId = InvalidOid;
-- 
2.25.0.114.g5b0ca878e0

v9-0004-snapshot-scalability-Move-PGXACT-vacuumFlags-to-P.patchtext/x-diff; charset=us-asciiDownload
From 11fc0006702be92009e6b8d3f93c265178c5a55f Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 8 Apr 2020 02:32:15 -0700
Subject: [PATCH v9 4/6] snapshot scalability: Move PGXACT->vacuumFlags to
 ProcGlobal->vacuumFlags.

Similar to the previous commit this increases the chance that data
frequently needed by GetSnapshotData() stays in l2 cache. As we now
take care to not unnecessarily write to ProcGlobal->vacuumFlags, there
should be very few modifications to the ProcGlobal->vacuumFlags array.

Author: Andres Freund
Reviewed-By: Robert Haas, Thomas Munro, David Rowley
Discussion: https://postgr.es/m/20200301083601.ews6hz5dduc3w2se@alap3.anarazel.de
---
 src/include/storage/proc.h                | 12 ++++-
 src/backend/access/transam/twophase.c     |  2 +-
 src/backend/commands/analyze.c            | 10 ++--
 src/backend/commands/vacuum.c             |  5 +-
 src/backend/postmaster/autovacuum.c       |  6 +--
 src/backend/replication/logical/logical.c |  3 +-
 src/backend/replication/slot.c            |  3 +-
 src/backend/storage/ipc/procarray.c       | 66 ++++++++++++++---------
 src/backend/storage/lmgr/deadlock.c       |  4 +-
 src/backend/storage/lmgr/proc.c           | 16 +++---
 10 files changed, 79 insertions(+), 48 deletions(-)

diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index f240bc7521e..2bfb05840c5 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -41,7 +41,7 @@ struct XidCache
 };
 
 /*
- * Flags for PGXACT->vacuumFlags
+ * Flags for ProcGlobal->vacuumFlags[]
  */
 #define		PROC_IS_AUTOVACUUM	0x01	/* is it an autovac worker? */
 #define		PROC_IN_VACUUM		0x02	/* currently running lazy vacuum */
@@ -161,6 +161,9 @@ struct PGPROC
 
 	bool		delayChkpt;		/* true if this proc delays checkpoint start */
 
+	uint8		vacuumFlags;    /* this backend's vacuum flags, see PROC_*
+							     * above. mirrored in
+							     * ProcGlobal->vacuumFlags[pgxactoff] */
 	/*
 	 * Info to allow us to wait for synchronous replication, if needed.
 	 * waitLSN is InvalidXLogRecPtr if not waiting; set only by user backend.
@@ -240,7 +243,6 @@ extern PGDLLIMPORT struct PGXACT *MyPgXact;
  */
 typedef struct PGXACT
 {
-	uint8		vacuumFlags;	/* vacuum-related flags, see above */
 	bool		overflowed;
 
 	uint8		nxids;
@@ -297,6 +299,12 @@ typedef struct PROC_HDR
 	/* Array mirroring PGPROC.xid for each PGPROC currently in the procarray */
 	TransactionId *xids;
 
+	/*
+	 * Array mirroring PGPROC.vacuumFlags for each PGPROC currently in the
+	 * procarray.
+	 */
+	uint8	   *vacuumFlags;
+
 	/* Length of allProcs array */
 	uint32		allProcCount;
 	/* Head of list of free PGPROC structures */
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 353f13ef489..3e71ab24bb4 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -466,7 +466,7 @@ MarkAsPreparingGuts(GlobalTransaction gxact, TransactionId xid, const char *gid,
 	proc->xid = xid;
 	Assert(proc->xmin == InvalidTransactionId);
 	proc->delayChkpt = false;
-	pgxact->vacuumFlags = 0;
+	proc->vacuumFlags = 0;
 	proc->pid = 0;
 	proc->backendId = InvalidBackendId;
 	proc->databaseId = databaseid;
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 34b71b6c1c5..2c1b956b76b 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -250,7 +250,8 @@ analyze_rel(Oid relid, RangeVar *relation,
 	 * OK, let's do it.  First let other backends know I'm in ANALYZE.
 	 */
 	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
-	MyPgXact->vacuumFlags |= PROC_IN_ANALYZE;
+	MyProc->vacuumFlags |= PROC_IN_ANALYZE;
+	ProcGlobal->vacuumFlags[MyProc->pgxactoff] = MyProc->vacuumFlags;
 	LWLockRelease(ProcArrayLock);
 	pgstat_progress_start_command(PROGRESS_COMMAND_ANALYZE,
 								  RelationGetRelid(onerel));
@@ -281,11 +282,12 @@ analyze_rel(Oid relid, RangeVar *relation,
 	pgstat_progress_end_command();
 
 	/*
-	 * Reset my PGXACT flag.  Note: we need this here, and not in vacuum_rel,
-	 * because the vacuum flag is cleared by the end-of-xact code.
+	 * Reset vacuumFlags we set early.  Note: we need this here, and not in
+	 * vacuum_rel, because the vacuum flag is cleared by the end-of-xact code.
 	 */
 	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
-	MyPgXact->vacuumFlags &= ~PROC_IN_ANALYZE;
+	MyProc->vacuumFlags &= ~PROC_IN_ANALYZE;
+	ProcGlobal->vacuumFlags[MyProc->pgxactoff] = MyProc->vacuumFlags;
 	LWLockRelease(ProcArrayLock);
 }
 
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 83548cfa5ec..8f2b975a40e 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -1730,9 +1730,10 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params)
 		 * might appear to go backwards, which is probably Not Good.
 		 */
 		LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
-		MyPgXact->vacuumFlags |= PROC_IN_VACUUM;
+		MyProc->vacuumFlags |= PROC_IN_VACUUM;
 		if (params->is_wraparound)
-			MyPgXact->vacuumFlags |= PROC_VACUUM_FOR_WRAPAROUND;
+			MyProc->vacuumFlags |= PROC_VACUUM_FOR_WRAPAROUND;
+		ProcGlobal->vacuumFlags[MyProc->pgxactoff] = MyProc->vacuumFlags;
 		LWLockRelease(ProcArrayLock);
 	}
 
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index df1af9354ce..465f8893cd5 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -2494,7 +2494,7 @@ do_autovacuum(void)
 						   tab->at_datname, tab->at_nspname, tab->at_relname);
 			EmitErrorReport();
 
-			/* this resets the PGXACT flags too */
+			/* this resets ProcGlobal->vacuumFlags[i] too */
 			AbortOutOfAnyTransaction();
 			FlushErrorState();
 			MemoryContextResetAndDeleteChildren(PortalContext);
@@ -2510,7 +2510,7 @@ do_autovacuum(void)
 
 		did_vacuum = true;
 
-		/* the PGXACT flags are reset at the next end of transaction */
+		/* ProcGlobal->vacuumFlags[i] are reset at the next end of xact */
 
 		/* be tidy */
 deleted:
@@ -2687,7 +2687,7 @@ perform_work_item(AutoVacuumWorkItem *workitem)
 				   cur_datname, cur_nspname, cur_relname);
 		EmitErrorReport();
 
-		/* this resets the PGXACT flags too */
+		/* this resets ProcGlobal->vacuumFlags[i] too */
 		AbortOutOfAnyTransaction();
 		FlushErrorState();
 		MemoryContextResetAndDeleteChildren(PortalContext);
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 5adf253583b..756cd2b8470 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -163,7 +163,8 @@ StartupDecodingContext(List *output_plugin_options,
 	if (!IsTransactionOrTransactionBlock())
 	{
 		LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
-		MyPgXact->vacuumFlags |= PROC_IN_LOGICAL_DECODING;
+		MyProc->vacuumFlags |= PROC_IN_LOGICAL_DECODING;
+		ProcGlobal->vacuumFlags[MyProc->pgxactoff] = MyProc->vacuumFlags;
 		LWLockRelease(ProcArrayLock);
 	}
 
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index abae74c9a59..769f6bf3cde 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -479,7 +479,8 @@ ReplicationSlotRelease(void)
 
 	/* might not have been set when we've been a plain slot */
 	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
-	MyPgXact->vacuumFlags &= ~PROC_IN_LOGICAL_DECODING;
+	MyProc->vacuumFlags &= ~PROC_IN_LOGICAL_DECODING;
+	ProcGlobal->vacuumFlags[MyProc->pgxactoff] = MyProc->vacuumFlags;
 	LWLockRelease(ProcArrayLock);
 }
 
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 5f0d3ee962e..b8a60e7ef43 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -478,9 +478,12 @@ ProcArrayAdd(PGPROC *proc)
 			(arrayP->numProcs - index) * sizeof(*arrayP->pgprocnos));
 	memmove(&ProcGlobal->xids[index + 1], &ProcGlobal->xids[index],
 			(arrayP->numProcs - index) * sizeof(*ProcGlobal->xids));
+	memmove(&ProcGlobal->vacuumFlags[index + 1], &ProcGlobal->vacuumFlags[index],
+			(arrayP->numProcs - index) * sizeof(*ProcGlobal->vacuumFlags));
 
 	arrayP->pgprocnos[index] = proc->pgprocno;
 	ProcGlobal->xids[index] = proc->xid;
+	ProcGlobal->vacuumFlags[index] = proc->vacuumFlags;
 
 	arrayP->numProcs++;
 
@@ -541,6 +544,7 @@ ProcArrayRemove(PGPROC *proc, TransactionId latestXid)
 	}
 
 	Assert(TransactionIdIsValid(ProcGlobal->xids[proc->pgxactoff] == 0));
+	ProcGlobal->vacuumFlags[proc->pgxactoff] = 0;
 
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
@@ -551,6 +555,8 @@ ProcArrayRemove(PGPROC *proc, TransactionId latestXid)
 					(arrayP->numProcs - index - 1) * sizeof(*arrayP->pgprocnos));
 			memmove(&ProcGlobal->xids[index], &ProcGlobal->xids[index + 1],
 					(arrayP->numProcs - index - 1) * sizeof(*ProcGlobal->xids));
+			memmove(&ProcGlobal->vacuumFlags[index], &ProcGlobal->vacuumFlags[index + 1],
+					(arrayP->numProcs - index - 1) * sizeof(*ProcGlobal->vacuumFlags));
 
 			arrayP->pgprocnos[arrayP->numProcs - 1] = -1;	/* for debugging */
 			arrayP->numProcs--;
@@ -629,14 +635,24 @@ ProcArrayEndTransaction(PGPROC *proc, TransactionId latestXid)
 		Assert(!TransactionIdIsValid(proc->xid));
 
 		proc->lxid = InvalidLocalTransactionId;
-		/* must be cleared with xid/xmin: */
-		pgxact->vacuumFlags &= ~PROC_VACUUM_STATE_MASK;
 		proc->xmin = InvalidTransactionId;
 		proc->delayChkpt = false; /* be sure this is cleared in abort */
 		proc->recoveryConflictPending = false;
 
 		Assert(pgxact->nxids == 0);
 		Assert(pgxact->overflowed == false);
+
+		/* must be cleared with xid/xmin: */
+		/* avoid unnecessarily dirtying shared cachelines */
+		if (proc->vacuumFlags & PROC_VACUUM_STATE_MASK)
+		{
+			Assert(!LWLockHeldByMe(ProcArrayLock));
+			LWLockAcquire(ProcArrayLock, LW_SHARED);
+			Assert(proc->vacuumFlags == ProcGlobal->vacuumFlags[proc->pgxactoff]);
+			proc->vacuumFlags &= ~PROC_VACUUM_STATE_MASK;
+			ProcGlobal->vacuumFlags[proc->pgxactoff] = proc->vacuumFlags;
+			LWLockRelease(ProcArrayLock);
+		}
 	}
 }
 
@@ -657,12 +673,18 @@ ProcArrayEndTransactionInternal(PGPROC *proc, PGXACT *pgxact,
 	ProcGlobal->xids[pgxactoff] = InvalidTransactionId;
 	proc->xid = InvalidTransactionId;
 	proc->lxid = InvalidLocalTransactionId;
-	/* must be cleared with xid/xmin: */
-	pgxact->vacuumFlags &= ~PROC_VACUUM_STATE_MASK;
 	proc->xmin = InvalidTransactionId;
 	proc->delayChkpt = false; /* be sure this is cleared in abort */
 	proc->recoveryConflictPending = false;
 
+	/* must be cleared with xid/xmin: */
+	/* avoid unnecessarily dirtying shared cachelines */
+	if (proc->vacuumFlags & PROC_VACUUM_STATE_MASK)
+	{
+		proc->vacuumFlags &= ~PROC_VACUUM_STATE_MASK;
+		ProcGlobal->vacuumFlags[proc->pgxactoff] = proc->vacuumFlags;
+	}
+
 	/* Clear the subtransaction-XID cache too while holding the lock */
 	pgxact->nxids = 0;
 	pgxact->overflowed = false;
@@ -822,9 +844,8 @@ ProcArrayClearTransaction(PGPROC *proc)
 	proc->xmin = InvalidTransactionId;
 	proc->recoveryConflictPending = false;
 
-	/* redundant, but just in case */
-	pgxact->vacuumFlags &= ~PROC_VACUUM_STATE_MASK;
-	proc->delayChkpt = false;
+	Assert(!(proc->vacuumFlags & PROC_VACUUM_STATE_MASK));
+	Assert(!proc->delayChkpt);
 
 	/* Clear the subtransaction-XID cache too */
 	pgxact->nxids = 0;
@@ -1615,7 +1636,7 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
 		PGPROC	   *proc = &allProcs[pgprocno];
-		PGXACT	   *pgxact = &allPgXact[pgprocno];
+		int8		vacuumFlags = ProcGlobal->vacuumFlags[index];
 		TransactionId xid;
 		TransactionId xmin;
 
@@ -1632,10 +1653,6 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
 		 */
 		xmin = TransactionIdOlder(xmin, xid);
 
-		/* if neither is set, this proc doesn't influence the horizon */
-		if (!TransactionIdIsValid(xmin))
-			continue;
-
 		/*
 		 * Don't ignore any procs when determining which transactions might be
 		 * considered running.  While slots should ensure logical decoding
@@ -1650,7 +1667,7 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
 		 * removed, as long as pg_subtrans is not truncated) or doing logical
 		 * decoding (which manages xmin separately, check below).
 		 */
-		if (pgxact->vacuumFlags & (PROC_IN_VACUUM | PROC_IN_LOGICAL_DECODING))
+		if (vacuumFlags & (PROC_IN_VACUUM | PROC_IN_LOGICAL_DECODING))
 			continue;
 
 		/* shared tables need to take backends in all database into account */
@@ -1988,6 +2005,7 @@ GetSnapshotData(Snapshot snapshot)
 		size_t		numProcs = arrayP->numProcs;
 		TransactionId *xip = snapshot->xip;
 		int		   *pgprocnos = arrayP->pgprocnos;
+		uint8	   *allVacuumFlags = ProcGlobal->vacuumFlags;
 
 		/*
 		 * First collect set of pgxactoff/xids that need to be included in the
@@ -1997,8 +2015,6 @@ GetSnapshotData(Snapshot snapshot)
 		{
 			/* Fetch xid just once - see GetNewTransactionId */
 			TransactionId xid = UINT32_ACCESS_ONCE(other_xids[pgxactoff]);
-			int			pgprocno;
-			PGXACT	   *pgxact;
 			uint8		vacuumFlags;
 
 			Assert(allProcs[arrayP->pgprocnos[pgxactoff]].pgxactoff == pgxactoff);
@@ -2034,14 +2050,11 @@ GetSnapshotData(Snapshot snapshot)
 			if (!NormalTransactionIdPrecedes(xid, xmax))
 				continue;
 
-			pgprocno = pgprocnos[pgxactoff];
-			pgxact = &allPgXact[pgprocno];
-			vacuumFlags = pgxact->vacuumFlags;
-
 			/*
 			 * Skip over backends doing logical decoding which manages xmin
 			 * separately (check below) and ones running LAZY VACUUM.
 			 */
+			vacuumFlags = allVacuumFlags[pgxactoff];
 			if (vacuumFlags & (PROC_IN_LOGICAL_DECODING | PROC_IN_VACUUM))
 				continue;
 
@@ -2068,6 +2081,9 @@ GetSnapshotData(Snapshot snapshot)
 			 */
 			if (!suboverflowed)
 			{
+				int			pgprocno = pgprocnos[pgxactoff];
+				PGXACT	   *pgxact = &allPgXact[pgprocno];
+
 				if (pgxact->overflowed)
 					suboverflowed = true;
 				else
@@ -2286,11 +2302,11 @@ ProcArrayInstallImportedXmin(TransactionId xmin,
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
 		PGPROC	   *proc = &allProcs[pgprocno];
-		PGXACT	   *pgxact = &allPgXact[pgprocno];
+		int			vacuumFlags = ProcGlobal->vacuumFlags[index];
 		TransactionId xid;
 
 		/* Ignore procs running LAZY VACUUM */
-		if (pgxact->vacuumFlags & PROC_IN_VACUUM)
+		if (vacuumFlags & PROC_IN_VACUUM)
 			continue;
 
 		/* We are only interested in the specific virtual transaction. */
@@ -2979,12 +2995,12 @@ GetCurrentVirtualXIDs(TransactionId limitXmin, bool excludeXmin0,
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
 		PGPROC	   *proc = &allProcs[pgprocno];
-		PGXACT	   *pgxact = &allPgXact[pgprocno];
+		uint8		vacuumFlags = ProcGlobal->vacuumFlags[index];
 
 		if (proc == MyProc)
 			continue;
 
-		if (excludeVacuum & pgxact->vacuumFlags)
+		if (excludeVacuum & vacuumFlags)
 			continue;
 
 		if (allDbs || proc->databaseId == MyDatabaseId)
@@ -3399,7 +3415,7 @@ CountOtherDBBackends(Oid databaseId, int *nbackends, int *nprepared)
 		{
 			int			pgprocno = arrayP->pgprocnos[index];
 			PGPROC	   *proc = &allProcs[pgprocno];
-			PGXACT	   *pgxact = &allPgXact[pgprocno];
+			uint8		vacuumFlags = ProcGlobal->vacuumFlags[index];
 
 			if (proc->databaseId != databaseId)
 				continue;
@@ -3413,7 +3429,7 @@ CountOtherDBBackends(Oid databaseId, int *nbackends, int *nprepared)
 			else
 			{
 				(*nbackends)++;
-				if ((pgxact->vacuumFlags & PROC_IS_AUTOVACUUM) &&
+				if ((vacuumFlags & PROC_IS_AUTOVACUUM) &&
 					nautovacs < MAXAUTOVACPIDS)
 					autovac_pids[nautovacs++] = proc->pid;
 			}
diff --git a/src/backend/storage/lmgr/deadlock.c b/src/backend/storage/lmgr/deadlock.c
index beedc7947db..e1246b8a4da 100644
--- a/src/backend/storage/lmgr/deadlock.c
+++ b/src/backend/storage/lmgr/deadlock.c
@@ -544,7 +544,6 @@ FindLockCycleRecurseMember(PGPROC *checkProc,
 {
 	PGPROC	   *proc;
 	LOCK	   *lock = checkProc->waitLock;
-	PGXACT	   *pgxact;
 	PROCLOCK   *proclock;
 	SHM_QUEUE  *procLocks;
 	LockMethod	lockMethodTable;
@@ -582,7 +581,6 @@ FindLockCycleRecurseMember(PGPROC *checkProc,
 		PGPROC	   *leader;
 
 		proc = proclock->tag.myProc;
-		pgxact = &ProcGlobal->allPgXact[proc->pgprocno];
 		leader = proc->lockGroupLeader == NULL ? proc : proc->lockGroupLeader;
 
 		/* A proc never blocks itself or any other lock group member */
@@ -630,7 +628,7 @@ FindLockCycleRecurseMember(PGPROC *checkProc,
 					 * ProcArrayLock.
 					 */
 					if (checkProc == MyProc &&
-						pgxact->vacuumFlags & PROC_IS_AUTOVACUUM)
+						proc->vacuumFlags & PROC_IS_AUTOVACUUM)
 						blocking_autovacuum_proc = proc;
 
 					/* We're done looking at this proclock */
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 8cd25c83e2b..a557f63e2b3 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -114,6 +114,7 @@ ProcGlobalShmemSize(void)
 	size = add_size(size, mul_size(NUM_AUXILIARY_PROCS, sizeof(PGXACT)));
 	size = add_size(size, mul_size(max_prepared_xacts, sizeof(PGXACT)));
 	size = add_size(size, mul_size(TotalProcs, sizeof(*ProcGlobal->xids)));
+	size = add_size(size, mul_size(TotalProcs, sizeof(*ProcGlobal->vacuumFlags)));
 
 	return size;
 }
@@ -223,6 +224,8 @@ InitProcGlobal(void)
 	ProcGlobal->xids =
 		(TransactionId *) ShmemAlloc(TotalProcs * sizeof(*ProcGlobal->xids));
 	MemSet(ProcGlobal->xids, 0, TotalProcs * sizeof(*ProcGlobal->xids));
+	ProcGlobal->vacuumFlags = (uint8 *) ShmemAlloc(TotalProcs * sizeof(*ProcGlobal->vacuumFlags));
+	MemSet(ProcGlobal->vacuumFlags, 0, TotalProcs * sizeof(*ProcGlobal->vacuumFlags));
 
 	for (i = 0; i < TotalProcs; i++)
 	{
@@ -405,10 +408,10 @@ InitProcess(void)
 	MyProc->tempNamespaceId = InvalidOid;
 	MyProc->isBackgroundWorker = IsBackgroundWorker;
 	MyProc->delayChkpt = false;
-	MyPgXact->vacuumFlags = 0;
+	MyProc->vacuumFlags = 0;
 	/* NB -- autovac launcher intentionally does not set IS_AUTOVACUUM */
 	if (IsAutoVacuumWorkerProcess())
-		MyPgXact->vacuumFlags |= PROC_IS_AUTOVACUUM;
+		MyProc->vacuumFlags |= PROC_IS_AUTOVACUUM;
 	MyProc->lwWaiting = false;
 	MyProc->lwWaitMode = 0;
 	MyProc->waitLock = NULL;
@@ -587,7 +590,7 @@ InitAuxiliaryProcess(void)
 	MyProc->tempNamespaceId = InvalidOid;
 	MyProc->isBackgroundWorker = IsBackgroundWorker;
 	MyProc->delayChkpt = false;
-	MyPgXact->vacuumFlags = 0;
+	MyProc->vacuumFlags = 0;
 	MyProc->lwWaiting = false;
 	MyProc->lwWaitMode = 0;
 	MyProc->waitLock = NULL;
@@ -1323,7 +1326,7 @@ ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable)
 		if (deadlock_state == DS_BLOCKED_BY_AUTOVACUUM && allow_autovacuum_cancel)
 		{
 			PGPROC	   *autovac = GetBlockingAutoVacuumPgproc();
-			PGXACT	   *autovac_pgxact = &ProcGlobal->allPgXact[autovac->pgprocno];
+			uint8		vacuumFlags;
 
 			LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
 
@@ -1331,8 +1334,9 @@ ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable)
 			 * Only do it if the worker is not working to protect against Xid
 			 * wraparound.
 			 */
-			if ((autovac_pgxact->vacuumFlags & PROC_IS_AUTOVACUUM) &&
-				!(autovac_pgxact->vacuumFlags & PROC_VACUUM_FOR_WRAPAROUND))
+			vacuumFlags = ProcGlobal->vacuumFlags[proc->pgxactoff];
+			if ((vacuumFlags & PROC_IS_AUTOVACUUM) &&
+				!(vacuumFlags & PROC_VACUUM_FOR_WRAPAROUND))
 			{
 				int			pid = autovac->pid;
 				StringInfoData locktagbuf;
-- 
2.25.0.114.g5b0ca878e0

v9-0005-snapshot-scalability-Move-subxact-info-to-ProcGlo.patchtext/x-diff; charset=us-asciiDownload
From 0fd82af2eedf9499d50902bbd0ec2f467b0d747a Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 8 Apr 2020 02:16:43 -0700
Subject: [PATCH v9 5/6] snapshot scalability: Move subxact info to ProcGlobal,
 remove PGXACT.

Similar to the previous changes this increases the chance that data
frequently needed by GetSnapshotData() stays in l2 cache. In many
workloads subtransactions are very rare, and this makes the check for
that cheaper.

As this removes the last member of PGXACT, there is no need to keep it
around anymore.

Author: Andres Freund
Reviewed-By: Robert Haas, Thomas Munro, David Rowley
Discussion: https://postgr.es/m/20200301083601.ews6hz5dduc3w2se@alap3.anarazel.de
---
 src/include/storage/proc.h            |  34 ++++---
 src/backend/access/transam/clog.c     |   7 +-
 src/backend/access/transam/twophase.c |  17 ++--
 src/backend/access/transam/varsup.c   |  15 ++-
 src/backend/storage/ipc/procarray.c   | 128 ++++++++++++++------------
 src/backend/storage/lmgr/proc.c       |  24 +----
 src/tools/pgindent/typedefs.list      |   1 -
 7 files changed, 113 insertions(+), 113 deletions(-)

diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 2bfb05840c5..8b6361517bb 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -35,6 +35,14 @@
  */
 #define PGPROC_MAX_CACHED_SUBXIDS 64	/* XXX guessed-at value */
 
+typedef struct XidCacheStatus
+{
+	/* number of cached subxids, never more than PGPROC_MAX_CACHED_SUBXIDS */
+	uint8	count;
+	/* has PGPROC->subxids overflowed */
+	bool	overflowed;
+} XidCacheStatus;
+
 struct XidCache
 {
 	TransactionId xids[PGPROC_MAX_CACHED_SUBXIDS];
@@ -181,6 +189,8 @@ struct PGPROC
 	 */
 	SHM_QUEUE	myProcLocks[NUM_LOCK_PARTITIONS];
 
+	XidCacheStatus subxidStatus; /* mirrored with
+								  * ProcGlobal->subxidStates[i] */
 	struct XidCache subxids;	/* cache for subtransaction XIDs */
 
 	/* Support for group XID clearing. */
@@ -231,22 +241,6 @@ struct PGPROC
 
 
 extern PGDLLIMPORT PGPROC *MyProc;
-extern PGDLLIMPORT struct PGXACT *MyPgXact;
-
-/*
- * Prior to PostgreSQL 9.2, the fields below were stored as part of the
- * PGPROC.  However, benchmarking revealed that packing these particular
- * members into a separate array as tightly as possible sped up GetSnapshotData
- * considerably on systems with many CPU cores, by reducing the number of
- * cache lines needing to be fetched.  Thus, think very carefully before adding
- * anything else here.
- */
-typedef struct PGXACT
-{
-	bool		overflowed;
-
-	uint8		nxids;
-} PGXACT;
 
 /*
  * There is one ProcGlobal struct for the whole database cluster.
@@ -293,12 +287,16 @@ typedef struct PROC_HDR
 {
 	/* Array of PGPROC structures (not including dummies for prepared txns) */
 	PGPROC	   *allProcs;
-	/* Array of PGXACT structures (not including dummies for prepared txns) */
-	PGXACT	   *allPgXact;
 
 	/* Array mirroring PGPROC.xid for each PGPROC currently in the procarray */
 	TransactionId *xids;
 
+	/*
+	 * Array mirroring PGPROC.subxidStatus for each PGPROC currently in the
+	 * procarray.
+	 */
+	XidCacheStatus *subxidStates;
+
 	/*
 	 * Array mirroring PGPROC.vacuumFlags for each PGPROC currently in the
 	 * procarray.
diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c
index c920f565a39..92c451a0673 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -295,7 +295,7 @@ TransactionIdSetPageStatus(TransactionId xid, int nsubxids,
 	 */
 	if (all_xact_same_page && xid == MyProc->xid &&
 		nsubxids <= THRESHOLD_SUBTRANS_CLOG_OPT &&
-		nsubxids == MyPgXact->nxids &&
+		nsubxids == MyProc->subxidStatus.count &&
 		memcmp(subxids, MyProc->subxids.xids,
 			   nsubxids * sizeof(TransactionId)) == 0)
 	{
@@ -510,16 +510,15 @@ TransactionGroupUpdateXidStatus(TransactionId xid, XidStatus status,
 	while (nextidx != INVALID_PGPROCNO)
 	{
 		PGPROC	   *proc = &ProcGlobal->allProcs[nextidx];
-		PGXACT	   *pgxact = &ProcGlobal->allPgXact[nextidx];
 
 		/*
 		 * Transactions with more than THRESHOLD_SUBTRANS_CLOG_OPT sub-XIDs
 		 * should not use group XID status update mechanism.
 		 */
-		Assert(pgxact->nxids <= THRESHOLD_SUBTRANS_CLOG_OPT);
+		Assert(proc->subxidStatus.count <= THRESHOLD_SUBTRANS_CLOG_OPT);
 
 		TransactionIdSetPageStatusInternal(proc->clogGroupMemberXid,
-										   pgxact->nxids,
+										   proc->subxidStatus.count,
 										   proc->subxids.xids,
 										   proc->clogGroupMemberXidStatus,
 										   proc->clogGroupMemberLsn,
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 3e71ab24bb4..dc57050f942 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -21,9 +21,9 @@
  *		GIDs and aborts the transaction if there already is a global
  *		transaction in prepared state with the same GID.
  *
- *		A global transaction (gxact) also has dummy PGXACT and PGPROC; this is
- *		what keeps the XID considered running by TransactionIdIsInProgress.
- *		It is also convenient as a PGPROC to hook the gxact's locks to.
+ *		A global transaction (gxact) also has dummy PGPROC; this is what keeps
+ *		the XID considered running by TransactionIdIsInProgress.  It is also
+ *		convenient as a PGPROC to hook the gxact's locks to.
  *
  *		Information to recover prepared transactions in case of crash is
  *		now stored in WAL for the common case. In some cases there will be
@@ -447,14 +447,12 @@ MarkAsPreparingGuts(GlobalTransaction gxact, TransactionId xid, const char *gid,
 					TimestampTz prepared_at, Oid owner, Oid databaseid)
 {
 	PGPROC	   *proc;
-	PGXACT	   *pgxact;
 	int			i;
 
 	Assert(LWLockHeldByMeInMode(TwoPhaseStateLock, LW_EXCLUSIVE));
 
 	Assert(gxact != NULL);
 	proc = &ProcGlobal->allProcs[gxact->pgprocno];
-	pgxact = &ProcGlobal->allPgXact[gxact->pgprocno];
 
 	/* Initialize the PGPROC entry */
 	MemSet(proc, 0, sizeof(PGPROC));
@@ -480,8 +478,8 @@ MarkAsPreparingGuts(GlobalTransaction gxact, TransactionId xid, const char *gid,
 	for (i = 0; i < NUM_LOCK_PARTITIONS; i++)
 		SHMQueueInit(&(proc->myProcLocks[i]));
 	/* subxid data must be filled later by GXactLoadSubxactData */
-	pgxact->overflowed = false;
-	pgxact->nxids = 0;
+	proc->subxidStatus.count = 0;
+	proc->subxidStatus.overflowed = 0;
 
 	gxact->prepared_at = prepared_at;
 	gxact->xid = xid;
@@ -510,19 +508,18 @@ GXactLoadSubxactData(GlobalTransaction gxact, int nsubxacts,
 					 TransactionId *children)
 {
 	PGPROC	   *proc = &ProcGlobal->allProcs[gxact->pgprocno];
-	PGXACT	   *pgxact = &ProcGlobal->allPgXact[gxact->pgprocno];
 
 	/* We need no extra lock since the GXACT isn't valid yet */
 	if (nsubxacts > PGPROC_MAX_CACHED_SUBXIDS)
 	{
-		pgxact->overflowed = true;
+		proc->subxidStatus.overflowed = true;
 		nsubxacts = PGPROC_MAX_CACHED_SUBXIDS;
 	}
 	if (nsubxacts > 0)
 	{
 		memcpy(proc->subxids.xids, children,
 			   nsubxacts * sizeof(TransactionId));
-		pgxact->nxids = nsubxacts;
+		proc->subxidStatus.count = PGPROC_MAX_CACHED_SUBXIDS;
 	}
 }
 
diff --git a/src/backend/access/transam/varsup.c b/src/backend/access/transam/varsup.c
index 8869b8a6866..b87b8c0c8c6 100644
--- a/src/backend/access/transam/varsup.c
+++ b/src/backend/access/transam/varsup.c
@@ -222,22 +222,31 @@ GetNewTransactionId(bool isSubXact)
 	 */
 	if (!isSubXact)
 	{
+		Assert(ProcGlobal->subxidStates[MyProc->pgxactoff].count == 0);
+		Assert(!ProcGlobal->subxidStates[MyProc->pgxactoff].overflowed);
+		Assert(MyProc->subxidStatus.count == 0);
+		Assert(!MyProc->subxidStatus.overflowed);
+
 		/* LWLockRelease acts as barrier */
 		MyProc->xid = xid;
 		ProcGlobal->xids[MyProc->pgxactoff] = xid;
 	}
 	else
 	{
-		int			nxids = MyPgXact->nxids;
+		XidCacheStatus *substat = &ProcGlobal->subxidStates[MyProc->pgxactoff];
+		int			nxids = MyProc->subxidStatus.count;
+
+		Assert(substat->count == MyProc->subxidStatus.count);
+		Assert(substat->overflowed == MyProc->subxidStatus.overflowed);
 
 		if (nxids < PGPROC_MAX_CACHED_SUBXIDS)
 		{
 			MyProc->subxids.xids[nxids] = xid;
 			pg_write_barrier();
-			MyPgXact->nxids = nxids + 1;
+			MyProc->subxidStatus.count = substat->count = nxids + 1;
 		}
 		else
-			MyPgXact->overflowed = true;
+			MyProc->subxidStatus.overflowed = substat->overflowed = true;
 	}
 
 	LWLockRelease(XidGenLock);
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index b8a60e7ef43..3a28fed05fd 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -4,9 +4,10 @@
  *	  POSTGRES process array code.
  *
  *
- * This module maintains arrays of the PGPROC and PGXACT structures for all
- * active backends.  Although there are several uses for this, the principal
- * one is as a means of determining the set of currently running transactions.
+ * This module maintains arrays of PGPROC substructures, as well as associated
+ * arrays in ProcGlobal, for all active backends.  Although there are several
+ * uses for this, the principal one is as a means of determining the set of
+ * currently running transactions.
  *
  * Because of various subtle race conditions it is critical that a backend
  * hold the correct locks while setting or clearing its xid (in
@@ -85,7 +86,7 @@ typedef struct ProcArrayStruct
 	/*
 	 * Highest subxid that has been removed from KnownAssignedXids array to
 	 * prevent overflow; or InvalidTransactionId if none.  We track this for
-	 * similar reasons to tracking overflowing cached subxids in PGXACT
+	 * similar reasons to tracking overflowing cached subxids in PGPROC
 	 * entries.  Must hold exclusive ProcArrayLock to change this, and shared
 	 * lock to read it.
 	 */
@@ -96,7 +97,7 @@ typedef struct ProcArrayStruct
 	/* oldest catalog xmin of any replication slot */
 	TransactionId replication_slot_catalog_xmin;
 
-	/* indexes into allPgXact[], has PROCARRAY_MAXPROCS entries */
+	/* indexes into allProcs[], has PROCARRAY_MAXPROCS entries */
 	int			pgprocnos[FLEXIBLE_ARRAY_MEMBER];
 } ProcArrayStruct;
 
@@ -240,7 +241,6 @@ typedef struct ComputeXidHorizonsResult
 static ProcArrayStruct *procArray;
 
 static PGPROC *allProcs;
-static PGXACT *allPgXact;
 
 /*
  * Bookkeeping for tracking emulated transactions in recovery
@@ -326,8 +326,7 @@ static int	KnownAssignedXidsGetAndSetXmin(TransactionId *xarray,
 static TransactionId KnownAssignedXidsGetOldestXmin(void);
 static void KnownAssignedXidsDisplay(int trace_level);
 static void KnownAssignedXidsReset(void);
-static inline void ProcArrayEndTransactionInternal(PGPROC *proc,
-												   PGXACT *pgxact, TransactionId latestXid);
+static inline void ProcArrayEndTransactionInternal(PGPROC *proc, TransactionId latestXid);
 static void ProcArrayGroupClearXid(PGPROC *proc, TransactionId latestXid);
 static void MaintainLatestCompletedXid(TransactionId latestXid);
 static void MaintainLatestCompletedXidRecovery(TransactionId latestXid);
@@ -411,7 +410,6 @@ CreateSharedProcArray(void)
 	}
 
 	allProcs = ProcGlobal->allProcs;
-	allPgXact = ProcGlobal->allPgXact;
 
 	/* Create or attach to the KnownAssignedXids arrays too, if needed */
 	if (EnableHotStandby)
@@ -478,11 +476,14 @@ ProcArrayAdd(PGPROC *proc)
 			(arrayP->numProcs - index) * sizeof(*arrayP->pgprocnos));
 	memmove(&ProcGlobal->xids[index + 1], &ProcGlobal->xids[index],
 			(arrayP->numProcs - index) * sizeof(*ProcGlobal->xids));
+	memmove(&ProcGlobal->subxidStates[index + 1], &ProcGlobal->subxidStates[index],
+			(arrayP->numProcs - index) * sizeof(*ProcGlobal->subxidStates));
 	memmove(&ProcGlobal->vacuumFlags[index + 1], &ProcGlobal->vacuumFlags[index],
 			(arrayP->numProcs - index) * sizeof(*ProcGlobal->vacuumFlags));
 
 	arrayP->pgprocnos[index] = proc->pgprocno;
 	ProcGlobal->xids[index] = proc->xid;
+	ProcGlobal->subxidStates[index] = proc->subxidStatus;
 	ProcGlobal->vacuumFlags[index] = proc->vacuumFlags;
 
 	arrayP->numProcs++;
@@ -536,6 +537,8 @@ ProcArrayRemove(PGPROC *proc, TransactionId latestXid)
 		MaintainLatestCompletedXid(latestXid);
 
 		ProcGlobal->xids[proc->pgxactoff] = 0;
+		ProcGlobal->subxidStates[proc->pgxactoff].overflowed = false;
+		ProcGlobal->subxidStates[proc->pgxactoff].count = 0;
 	}
 	else
 	{
@@ -544,6 +547,8 @@ ProcArrayRemove(PGPROC *proc, TransactionId latestXid)
 	}
 
 	Assert(TransactionIdIsValid(ProcGlobal->xids[proc->pgxactoff] == 0));
+	Assert(TransactionIdIsValid(ProcGlobal->subxidStates[proc->pgxactoff].count == 0));
+	Assert(TransactionIdIsValid(ProcGlobal->subxidStates[proc->pgxactoff].overflowed == false));
 	ProcGlobal->vacuumFlags[proc->pgxactoff] = 0;
 
 	for (index = 0; index < arrayP->numProcs; index++)
@@ -555,6 +560,8 @@ ProcArrayRemove(PGPROC *proc, TransactionId latestXid)
 					(arrayP->numProcs - index - 1) * sizeof(*arrayP->pgprocnos));
 			memmove(&ProcGlobal->xids[index], &ProcGlobal->xids[index + 1],
 					(arrayP->numProcs - index - 1) * sizeof(*ProcGlobal->xids));
+			memmove(&ProcGlobal->subxidStates[index], &ProcGlobal->subxidStates[index + 1],
+					(arrayP->numProcs - index - 1) * sizeof(*ProcGlobal->subxidStates));
 			memmove(&ProcGlobal->vacuumFlags[index], &ProcGlobal->vacuumFlags[index + 1],
 					(arrayP->numProcs - index - 1) * sizeof(*ProcGlobal->vacuumFlags));
 
@@ -600,8 +607,6 @@ ProcArrayRemove(PGPROC *proc, TransactionId latestXid)
 void
 ProcArrayEndTransaction(PGPROC *proc, TransactionId latestXid)
 {
-	PGXACT	   *pgxact = &allPgXact[proc->pgprocno];
-
 	if (TransactionIdIsValid(latestXid))
 	{
 		/*
@@ -619,7 +624,7 @@ ProcArrayEndTransaction(PGPROC *proc, TransactionId latestXid)
 		 */
 		if (LWLockConditionalAcquire(ProcArrayLock, LW_EXCLUSIVE))
 		{
-			ProcArrayEndTransactionInternal(proc, pgxact, latestXid);
+			ProcArrayEndTransactionInternal(proc, latestXid);
 			LWLockRelease(ProcArrayLock);
 		}
 		else
@@ -633,15 +638,14 @@ ProcArrayEndTransaction(PGPROC *proc, TransactionId latestXid)
 		 * estimate of global xmin, but that's OK.
 		 */
 		Assert(!TransactionIdIsValid(proc->xid));
+		Assert(proc->subxidStatus.count == 0);
+		Assert(!proc->subxidStatus.overflowed);
 
 		proc->lxid = InvalidLocalTransactionId;
 		proc->xmin = InvalidTransactionId;
 		proc->delayChkpt = false; /* be sure this is cleared in abort */
 		proc->recoveryConflictPending = false;
 
-		Assert(pgxact->nxids == 0);
-		Assert(pgxact->overflowed == false);
-
 		/* must be cleared with xid/xmin: */
 		/* avoid unnecessarily dirtying shared cachelines */
 		if (proc->vacuumFlags & PROC_VACUUM_STATE_MASK)
@@ -662,8 +666,7 @@ ProcArrayEndTransaction(PGPROC *proc, TransactionId latestXid)
  * We don't do any locking here; caller must handle that.
  */
 static inline void
-ProcArrayEndTransactionInternal(PGPROC *proc, PGXACT *pgxact,
-								TransactionId latestXid)
+ProcArrayEndTransactionInternal(PGPROC *proc, TransactionId latestXid)
 {
 	size_t		pgxactoff = proc->pgxactoff;
 
@@ -686,8 +689,15 @@ ProcArrayEndTransactionInternal(PGPROC *proc, PGXACT *pgxact,
 	}
 
 	/* Clear the subtransaction-XID cache too while holding the lock */
-	pgxact->nxids = 0;
-	pgxact->overflowed = false;
+	Assert(ProcGlobal->subxidStates[pgxactoff].count == proc->subxidStatus.count &&
+		   ProcGlobal->subxidStates[pgxactoff].overflowed == proc->subxidStatus.overflowed);
+	if (proc->subxidStatus.count > 0 || proc->subxidStatus.overflowed)
+	{
+		ProcGlobal->subxidStates[pgxactoff].count = 0;
+		ProcGlobal->subxidStates[pgxactoff].overflowed = false;
+		proc->subxidStatus.count = 0;
+		proc->subxidStatus.overflowed = false;
+	}
 
 	/* Also advance global latestCompletedXid while holding the lock */
 	MaintainLatestCompletedXid(latestXid);
@@ -777,9 +787,8 @@ ProcArrayGroupClearXid(PGPROC *proc, TransactionId latestXid)
 	while (nextidx != INVALID_PGPROCNO)
 	{
 		PGPROC	   *proc = &allProcs[nextidx];
-		PGXACT	   *pgxact = &allPgXact[nextidx];
 
-		ProcArrayEndTransactionInternal(proc, pgxact, proc->procArrayGroupMemberXid);
+		ProcArrayEndTransactionInternal(proc, proc->procArrayGroupMemberXid);
 
 		/* Move to next proc in list. */
 		nextidx = pg_atomic_read_u32(&proc->procArrayGroupNext);
@@ -823,7 +832,6 @@ ProcArrayGroupClearXid(PGPROC *proc, TransactionId latestXid)
 void
 ProcArrayClearTransaction(PGPROC *proc)
 {
-	PGXACT	   *pgxact = &allPgXact[proc->pgprocno];
 	size_t		pgxactoff;
 
 	/*
@@ -848,8 +856,15 @@ ProcArrayClearTransaction(PGPROC *proc)
 	Assert(!proc->delayChkpt);
 
 	/* Clear the subtransaction-XID cache too */
-	pgxact->nxids = 0;
-	pgxact->overflowed = false;
+	Assert(ProcGlobal->subxidStates[pgxactoff].count == proc->subxidStatus.count &&
+		   ProcGlobal->subxidStates[pgxactoff].overflowed == proc->subxidStatus.overflowed);
+	if (proc->subxidStatus.count > 0 || proc->subxidStatus.overflowed)
+	{
+		ProcGlobal->subxidStates[pgxactoff].count = 0;
+		ProcGlobal->subxidStates[pgxactoff].overflowed = false;
+		proc->subxidStatus.count = 0;
+		proc->subxidStatus.overflowed = false;
+	}
 
 	LWLockRelease(ProcArrayLock);
 }
@@ -1269,6 +1284,7 @@ TransactionIdIsInProgress(TransactionId xid)
 {
 	static TransactionId *xids = NULL;
 	static TransactionId *other_xids;
+	XidCacheStatus *other_subxidstates;
 	int			nxids = 0;
 	ProcArrayStruct *arrayP = procArray;
 	TransactionId topxid;
@@ -1330,6 +1346,7 @@ TransactionIdIsInProgress(TransactionId xid)
 	}
 
 	other_xids = ProcGlobal->xids;
+	other_subxidstates = ProcGlobal->subxidStates;
 
 	LWLockAcquire(ProcArrayLock, LW_SHARED);
 
@@ -1351,7 +1368,6 @@ TransactionIdIsInProgress(TransactionId xid)
 	for (size_t pgxactoff = 0; pgxactoff < numProcs; pgxactoff++)
 	{
 		int			pgprocno;
-		PGXACT	   *pgxact;
 		PGPROC	   *proc;
 		TransactionId pxid;
 		int			pxids;
@@ -1386,9 +1402,7 @@ TransactionIdIsInProgress(TransactionId xid)
 		/*
 		 * Step 2: check the cached child-Xids arrays
 		 */
-		pgprocno = arrayP->pgprocnos[pgxactoff];
-		pgxact = &allPgXact[pgprocno];
-		pxids = pgxact->nxids;
+		pxids = other_subxidstates[pgxactoff].count;
 		pg_read_barrier();		/* pairs with barrier in GetNewTransactionId() */
 		pgprocno = arrayP->pgprocnos[pgxactoff];
 		proc = &allProcs[pgprocno];
@@ -1412,7 +1426,7 @@ TransactionIdIsInProgress(TransactionId xid)
 		 * we hold ProcArrayLock.  So we can't miss an Xid that we need to
 		 * worry about.)
 		 */
-		if (pgxact->overflowed)
+		if (other_subxidstates[pgxactoff].overflowed)
 			xids[nxids++] = pxid;
 	}
 
@@ -2005,6 +2019,7 @@ GetSnapshotData(Snapshot snapshot)
 		size_t		numProcs = arrayP->numProcs;
 		TransactionId *xip = snapshot->xip;
 		int		   *pgprocnos = arrayP->pgprocnos;
+		XidCacheStatus *subxidStates = ProcGlobal->subxidStates;
 		uint8	   *allVacuumFlags = ProcGlobal->vacuumFlags;
 
 		/*
@@ -2081,17 +2096,16 @@ GetSnapshotData(Snapshot snapshot)
 			 */
 			if (!suboverflowed)
 			{
-				int			pgprocno = pgprocnos[pgxactoff];
-				PGXACT	   *pgxact = &allPgXact[pgprocno];
 
-				if (pgxact->overflowed)
+				if (subxidStates[pgxactoff].overflowed)
 					suboverflowed = true;
 				else
 				{
-					int			nsubxids = pgxact->nxids;
+					int			nsubxids = subxidStates[pgxactoff].count;
 
 					if (nsubxids > 0)
 					{
+						int			pgprocno = pgprocnos[pgxactoff];
 						PGPROC	   *proc = &allProcs[pgprocno];
 
 						pg_read_barrier();	/* pairs with GetNewTransactionId */
@@ -2483,8 +2497,6 @@ GetRunningTransactionData(void)
 	 */
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
-		int			pgprocno = arrayP->pgprocnos[index];
-		PGXACT	   *pgxact = &allPgXact[pgprocno];
 		TransactionId xid;
 
 		/* Fetch xid just once - see GetNewTransactionId */
@@ -2505,7 +2517,7 @@ GetRunningTransactionData(void)
 		if (TransactionIdPrecedes(xid, oldestRunningXid))
 			oldestRunningXid = xid;
 
-		if (pgxact->overflowed)
+		if (ProcGlobal->subxidStates[index].overflowed)
 			suboverflowed = true;
 
 		/*
@@ -2525,27 +2537,28 @@ GetRunningTransactionData(void)
 	 */
 	if (!suboverflowed)
 	{
+		XidCacheStatus *other_subxidstates = ProcGlobal->subxidStates;
+
 		for (index = 0; index < arrayP->numProcs; index++)
 		{
 			int			pgprocno = arrayP->pgprocnos[index];
 			PGPROC	   *proc = &allProcs[pgprocno];
-			PGXACT	   *pgxact = &allPgXact[pgprocno];
-			int			nxids;
+			int			nsubxids;
 
 			/*
 			 * Save subtransaction XIDs. Other backends can't add or remove
 			 * entries while we're holding XidGenLock.
 			 */
-			nxids = pgxact->nxids;
-			if (nxids > 0)
+			nsubxids = other_subxidstates[index].count;
+			if (nsubxids > 0)
 			{
 				/* barrier not really required, as XidGenLock is held, but ... */
 				pg_read_barrier();	/* pairs with GetNewTransactionId */
 
 				memcpy(&xids[count], (void *) proc->subxids.xids,
-					   nxids * sizeof(TransactionId));
-				count += nxids;
-				subcount += nxids;
+					   nsubxids * sizeof(TransactionId));
+				count += nsubxids;
+				subcount += nsubxids;
 
 				/*
 				 * Top-level XID of a transaction is always less than any of
@@ -3612,14 +3625,6 @@ ProcArrayGetReplicationSlotXmin(TransactionId *xmin,
 	LWLockRelease(ProcArrayLock);
 }
 
-
-#define XidCacheRemove(i) \
-	do { \
-		MyProc->subxids.xids[i] = MyProc->subxids.xids[MyPgXact->nxids - 1]; \
-		pg_write_barrier(); \
-		MyPgXact->nxids--; \
-	} while (0)
-
 /*
  * XidCacheRemoveRunningXids
  *
@@ -3635,6 +3640,7 @@ XidCacheRemoveRunningXids(TransactionId xid,
 {
 	int			i,
 				j;
+	XidCacheStatus *mysubxidstat;
 
 	Assert(TransactionIdIsValid(xid));
 
@@ -3652,6 +3658,8 @@ XidCacheRemoveRunningXids(TransactionId xid,
 	 */
 	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
 
+	mysubxidstat = &ProcGlobal->subxidStates[MyProc->pgxactoff];
+
 	/*
 	 * Under normal circumstances xid and xids[] will be in increasing order,
 	 * as will be the entries in subxids.  Scan backwards to avoid O(N^2)
@@ -3661,11 +3669,14 @@ XidCacheRemoveRunningXids(TransactionId xid,
 	{
 		TransactionId anxid = xids[i];
 
-		for (j = MyPgXact->nxids - 1; j >= 0; j--)
+		for (j = MyProc->subxidStatus.count - 1; j >= 0; j--)
 		{
 			if (TransactionIdEquals(MyProc->subxids.xids[j], anxid))
 			{
-				XidCacheRemove(j);
+				MyProc->subxids.xids[j] = MyProc->subxids.xids[MyProc->subxidStatus.count - 1];
+				pg_write_barrier();
+				mysubxidstat->count--;
+				MyProc->subxidStatus.count--;
 				break;
 			}
 		}
@@ -3677,20 +3688,23 @@ XidCacheRemoveRunningXids(TransactionId xid,
 		 * error during AbortSubTransaction.  So instead of Assert, emit a
 		 * debug warning.
 		 */
-		if (j < 0 && !MyPgXact->overflowed)
+		if (j < 0 && !MyProc->subxidStatus.overflowed)
 			elog(WARNING, "did not find subXID %u in MyProc", anxid);
 	}
 
-	for (j = MyPgXact->nxids - 1; j >= 0; j--)
+	for (j = MyProc->subxidStatus.count - 1; j >= 0; j--)
 	{
 		if (TransactionIdEquals(MyProc->subxids.xids[j], xid))
 		{
-			XidCacheRemove(j);
+			MyProc->subxids.xids[j] = MyProc->subxids.xids[MyProc->subxidStatus.count - 1];
+			pg_write_barrier();
+			mysubxidstat->count--;
+			MyProc->subxidStatus.count--;
 			break;
 		}
 	}
 	/* Ordinarily we should have found it, unless the cache has overflowed */
-	if (j < 0 && !MyPgXact->overflowed)
+	if (j < 0 && !MyProc->subxidStatus.overflowed)
 		elog(WARNING, "did not find subXID %u in MyProc", xid);
 
 	/* Also advance global latestCompletedXid while holding the lock */
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index a557f63e2b3..5d4d756fbde 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -63,9 +63,8 @@ int			LockTimeout = 0;
 int			IdleInTransactionSessionTimeout = 0;
 bool		log_lock_waits = false;
 
-/* Pointer to this process's PGPROC and PGXACT structs, if any */
+/* Pointer to this process's PGPROC struct, if any */
 PGPROC	   *MyProc = NULL;
-PGXACT	   *MyPgXact = NULL;
 
 /*
  * This spinlock protects the freelist of recycled PGPROC structures.
@@ -110,10 +109,8 @@ ProcGlobalShmemSize(void)
 	size = add_size(size, mul_size(TotalProcs, sizeof(PGPROC)));
 	size = add_size(size, sizeof(slock_t));
 
-	size = add_size(size, mul_size(MaxBackends, sizeof(PGXACT)));
-	size = add_size(size, mul_size(NUM_AUXILIARY_PROCS, sizeof(PGXACT)));
-	size = add_size(size, mul_size(max_prepared_xacts, sizeof(PGXACT)));
 	size = add_size(size, mul_size(TotalProcs, sizeof(*ProcGlobal->xids)));
+	size = add_size(size, mul_size(TotalProcs, sizeof(*ProcGlobal->subxidStates)));
 	size = add_size(size, mul_size(TotalProcs, sizeof(*ProcGlobal->vacuumFlags)));
 
 	return size;
@@ -161,7 +158,6 @@ void
 InitProcGlobal(void)
 {
 	PGPROC	   *procs;
-	PGXACT	   *pgxacts;
 	int			i,
 				j;
 	bool		found;
@@ -202,18 +198,6 @@ InitProcGlobal(void)
 	/* XXX allProcCount isn't really all of them; it excludes prepared xacts */
 	ProcGlobal->allProcCount = MaxBackends + NUM_AUXILIARY_PROCS;
 
-	/*
-	 * Also allocate a separate array of PGXACT structures.  This is separate
-	 * from the main PGPROC array so that the most heavily accessed data is
-	 * stored contiguously in memory in as few cache lines as possible. This
-	 * provides significant performance benefits, especially on a
-	 * multiprocessor system.  There is one PGXACT structure for every PGPROC
-	 * structure.
-	 */
-	pgxacts = (PGXACT *) ShmemAlloc(TotalProcs * sizeof(PGXACT));
-	MemSet(pgxacts, 0, TotalProcs * sizeof(PGXACT));
-	ProcGlobal->allPgXact = pgxacts;
-
 	/*
 	 * Allocate arrays mirroring PGPROC fields in a dense manner. See
 	 * PROC_HDR.
@@ -224,6 +208,8 @@ InitProcGlobal(void)
 	ProcGlobal->xids =
 		(TransactionId *) ShmemAlloc(TotalProcs * sizeof(*ProcGlobal->xids));
 	MemSet(ProcGlobal->xids, 0, TotalProcs * sizeof(*ProcGlobal->xids));
+	ProcGlobal->subxidStates = (XidCacheStatus *) ShmemAlloc(TotalProcs * sizeof(*ProcGlobal->subxidStates));
+	MemSet(ProcGlobal->subxidStates, 0, TotalProcs * sizeof(*ProcGlobal->subxidStates));
 	ProcGlobal->vacuumFlags = (uint8 *) ShmemAlloc(TotalProcs * sizeof(*ProcGlobal->vacuumFlags));
 	MemSet(ProcGlobal->vacuumFlags, 0, TotalProcs * sizeof(*ProcGlobal->vacuumFlags));
 
@@ -372,7 +358,6 @@ InitProcess(void)
 				(errcode(ERRCODE_TOO_MANY_CONNECTIONS),
 				 errmsg("sorry, too many clients already")));
 	}
-	MyPgXact = &ProcGlobal->allPgXact[MyProc->pgprocno];
 
 	/*
 	 * Cross-check that the PGPROC is of the type we expect; if this were not
@@ -569,7 +554,6 @@ InitAuxiliaryProcess(void)
 	((volatile PGPROC *) auxproc)->pid = MyProcPid;
 
 	MyProc = auxproc;
-	MyPgXact = &ProcGlobal->allPgXact[auxproc->pgprocno];
 
 	SpinLockRelease(ProcStructLock);
 
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 2d821bd817f..3da5a9c6ab2 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1510,7 +1510,6 @@ PGSetenvStatusType
 PGShmemHeader
 PGTransactionStatusType
 PGVerbosity
-PGXACT
 PG_Locale_Strategy
 PG_Lock_Status
 PG_init_t
-- 
2.25.0.114.g5b0ca878e0

v9-0006-snapshot-scalability-cache-snapshots-using-a-xact.patchtext/x-diff; charset=us-asciiDownload
From 1bc7beaefb4879ad1bffe17904ed5a37722c4aff Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 6 Apr 2020 21:28:55 -0700
Subject: [PATCH v9 6/6] snapshot scalability: cache snapshots using a xact
 completion counter.

Previous commits made it faster/more scalable to compute snapshots. But not
building a snapshot is still faster. Now that GetSnapshotData() does not
maintain RecentGlobal* anymore, that is actually not too hard:

This commit introduces xactCompletionCount, which tracks the number of
top-level transactions with xids (i.e. which may have modified the database)
that completed in some form since the start of the server.

We can avoid rebuilding the snapshot's contents whenever the current
xactCompletionCount is the same as it was when the snapshot was originally
built.

Currently this check happens while holding ProcArrayLock. While it's likely
possible to perform the check before acquiring ProcArrayLock, it's too
complicated for now.

Author: Andres Freund
Reviewed-By: Robert Haas, Thomas Munro, David Rowley
Discussion: https://postgr.es/m/20200301083601.ews6hz5dduc3w2se@alap3.anarazel.de
---
 src/include/access/transam.h                |   9 ++
 src/include/utils/snapshot.h                |   7 ++
 src/backend/replication/logical/snapbuild.c |   1 +
 src/backend/storage/ipc/procarray.c         | 125 ++++++++++++++++----
 src/backend/utils/time/snapmgr.c            |   4 +
 5 files changed, 126 insertions(+), 20 deletions(-)

diff --git a/src/include/access/transam.h b/src/include/access/transam.h
index 94ba797f026..7ea0235b3e6 100644
--- a/src/include/access/transam.h
+++ b/src/include/access/transam.h
@@ -231,6 +231,15 @@ typedef struct VariableCacheData
 	FullTransactionId latestCompletedFullXid;	/* newest full XID that has
 												 * committed or aborted */
 
+	/*
+	 * Number of top-level transactions with xids (i.e. which may have
+	 * modified the database) that completed in some form since the start of
+	 * the server. This currently is solely used to check whether
+	 * GetSnapshotData() needs to recompute the contents of the snapshot, or
+	 * not. There are likely other users of this.  Always above 1.
+	 */
+	uint64 xactCompletionCount;
+
 	/*
 	 * These fields are protected by CLogTruncationLock
 	 */
diff --git a/src/include/utils/snapshot.h b/src/include/utils/snapshot.h
index 35b1f05bea6..dea072e5edf 100644
--- a/src/include/utils/snapshot.h
+++ b/src/include/utils/snapshot.h
@@ -207,6 +207,13 @@ typedef struct SnapshotData
 
 	TimestampTz whenTaken;		/* timestamp when snapshot was taken */
 	XLogRecPtr	lsn;			/* position in the WAL stream when taken */
+
+	/*
+	 * The transaction completion count at the time GetSnapshotData() built
+	 * this snapshot. Allows to avoid re-computing static snapshots when no
+	 * transactions completed since the last GetSnapshotData().
+	 */
+	uint64		snapXactCompletionCount;
 } SnapshotData;
 
 #endif							/* SNAPSHOT_H */
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index e9701ea7221..9d5d68f3fa7 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -524,6 +524,7 @@ SnapBuildBuildSnapshot(SnapBuild *builder)
 	snapshot->curcid = FirstCommandId;
 	snapshot->active_count = 0;
 	snapshot->regd_count = 0;
+	snapshot->snapXactCompletionCount = 0;
 
 	return snapshot;
 }
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 3a28fed05fd..b6a2149bede 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -407,6 +407,7 @@ CreateSharedProcArray(void)
 		procArray->lastOverflowedXid = InvalidTransactionId;
 		procArray->replication_slot_xmin = InvalidTransactionId;
 		procArray->replication_slot_catalog_xmin = InvalidTransactionId;
+		ShmemVariableCache->xactCompletionCount = 1;
 	}
 
 	allProcs = ProcGlobal->allProcs;
@@ -536,6 +537,9 @@ ProcArrayRemove(PGPROC *proc, TransactionId latestXid)
 		/* Advance global latestCompletedXid while holding the lock */
 		MaintainLatestCompletedXid(latestXid);
 
+		/* Same with xactCompletionCount  */
+		ShmemVariableCache->xactCompletionCount++;
+
 		ProcGlobal->xids[proc->pgxactoff] = 0;
 		ProcGlobal->subxidStates[proc->pgxactoff].overflowed = false;
 		ProcGlobal->subxidStates[proc->pgxactoff].count = 0;
@@ -670,6 +674,7 @@ ProcArrayEndTransactionInternal(PGPROC *proc, TransactionId latestXid)
 {
 	size_t		pgxactoff = proc->pgxactoff;
 
+	Assert(LWLockHeldByMe(ProcArrayLock));
 	Assert(TransactionIdIsValid(ProcGlobal->xids[pgxactoff]));
 	Assert(ProcGlobal->xids[pgxactoff] == proc->xid);
 
@@ -701,6 +706,9 @@ ProcArrayEndTransactionInternal(PGPROC *proc, TransactionId latestXid)
 
 	/* Also advance global latestCompletedXid while holding the lock */
 	MaintainLatestCompletedXid(latestXid);
+
+	/* Same with xactCompletionCount  */
+	ShmemVariableCache->xactCompletionCount++;
 }
 
 /*
@@ -1901,6 +1909,93 @@ GetMaxSnapshotSubxidCount(void)
 	return TOTAL_MAX_CACHED_SUBXIDS;
 }
 
+/*
+ * Initialize old_snapshot_threshold specific parts of a newly build snapshot.
+ */
+static void
+GetSnapshotDataInitOldSnapshot(Snapshot snapshot)
+{
+	if (!OldSnapshotThresholdActive())
+	{
+		/*
+		 * If not using "snapshot too old" feature, fill related fields with
+		 * dummy values that don't require any locking.
+		 */
+		snapshot->lsn = InvalidXLogRecPtr;
+		snapshot->whenTaken = 0;
+	}
+	else
+	{
+		/*
+		 * Capture the current time and WAL stream location in case this
+		 * snapshot becomes old enough to need to fall back on the special
+		 * "old snapshot" logic.
+		 */
+		snapshot->lsn = GetXLogInsertRecPtr();
+		snapshot->whenTaken = GetSnapshotCurrentTimestamp();
+		MaintainOldSnapshotTimeMapping(snapshot->whenTaken, snapshot->xmin);
+	}
+}
+
+/*
+ * Helper function for GetSnapshotData() that check if the bulk of the
+ * visibility information in the snapshot is still valid. If so, it updates
+ * the fields that need to change and returns true. Otherwise it returns
+ * false.
+ *
+ * This very likely can be evolved to not need ProcArrayLock held (at very
+ * least in the case we already hold a snapshot), but that's for another day.
+ */
+static bool
+GetSnapshotDataReuse(Snapshot snapshot)
+{
+	uint64 curXactCompletionCount;
+
+	Assert(LWLockHeldByMe(ProcArrayLock));
+
+	if (unlikely(snapshot->snapXactCompletionCount == 0))
+		return false;
+
+	curXactCompletionCount = ShmemVariableCache->xactCompletionCount;
+	if (curXactCompletionCount != snapshot->snapXactCompletionCount)
+		return false;
+
+	/*
+	 * If the current xactCompletionCount is still the same as it was at the
+	 * time the snapshot was built, we can be sure that rebuilding the
+	 * contents of the snapshot the hard way would result in the same snapshot
+	 * contents:
+	 *
+	 * As explained in transam/README, the set of xids considered running by
+	 * GetSnapshotData() cannot change while ProcArrayLock is held. Snapshot
+	 * contents only depend on transactions with xids and xactCompletionCount
+	 * is incremented whenever a transaction with an xid finishes (while
+	 * holding ProcArrayLock) exclusively). Thus the xactCompletionCount check
+	 * ensures we would detect if the snapshot would have changed.
+	 *
+	 * As the snapshot contents are the same as it was before, it is is safe
+	 * to re-enter the snapshot's xmin into the PGPROC array. None of the rows
+	 * visible under the snapshot could already have been removed (that'd
+	 * require the set of running transactions to change) and it fulfills the
+	 * requirement that concurrent GetSnapshotData() calls yield the same
+	 * xmin.
+	 */
+	if (!TransactionIdIsValid(MyProc->xmin))
+		MyProc->xmin = TransactionXmin = snapshot->xmin;
+
+	RecentXmin = snapshot->xmin;
+	Assert(TransactionIdPrecedesOrEquals(TransactionXmin, RecentXmin));
+
+	snapshot->curcid = GetCurrentCommandId(false);
+	snapshot->active_count = 0;
+	snapshot->regd_count = 0;
+	snapshot->copied = false;
+
+	GetSnapshotDataInitOldSnapshot(snapshot);
+
+	return true;
+}
+
 /*
  * GetSnapshotData -- returns information about running transactions.
  *
@@ -1949,6 +2044,7 @@ GetSnapshotData(Snapshot snapshot)
 	TransactionId oldestxid;
 	int			mypgxactoff;
 	TransactionId myxid;
+	uint64		curXactCompletionCount;
 
 	TransactionId replication_slot_xmin = InvalidTransactionId;
 	TransactionId replication_slot_catalog_xmin = InvalidTransactionId;
@@ -1993,12 +2089,19 @@ GetSnapshotData(Snapshot snapshot)
 	 */
 	LWLockAcquire(ProcArrayLock, LW_SHARED);
 
+	if (GetSnapshotDataReuse(snapshot))
+	{
+		LWLockRelease(ProcArrayLock);
+		return snapshot;
+	}
+
 	latest_completed = ShmemVariableCache->latestCompletedFullXid;
 	mypgxactoff = MyProc->pgxactoff;
 	myxid = other_xids[mypgxactoff];
 	Assert(myxid == MyProc->xid);
 
 	oldestxid = ShmemVariableCache->oldestXid;
+	curXactCompletionCount = ShmemVariableCache->xactCompletionCount;
 
 	/* xmax is always latestCompletedXid + 1 */
 	xmax = XidFromFullTransactionId(latest_completed);
@@ -2252,6 +2355,7 @@ GetSnapshotData(Snapshot snapshot)
 	snapshot->xcnt = count;
 	snapshot->subxcnt = subcount;
 	snapshot->suboverflowed = suboverflowed;
+	snapshot->snapXactCompletionCount = curXactCompletionCount;
 
 	snapshot->curcid = GetCurrentCommandId(false);
 
@@ -2263,26 +2367,7 @@ GetSnapshotData(Snapshot snapshot)
 	snapshot->regd_count = 0;
 	snapshot->copied = false;
 
-	if (old_snapshot_threshold < 0)
-	{
-		/*
-		 * If not using "snapshot too old" feature, fill related fields with
-		 * dummy values that don't require any locking.
-		 */
-		snapshot->lsn = InvalidXLogRecPtr;
-		snapshot->whenTaken = 0;
-	}
-	else
-	{
-		/*
-		 * Capture the current time and WAL stream location in case this
-		 * snapshot becomes old enough to need to fall back on the special
-		 * "old snapshot" logic.
-		 */
-		snapshot->lsn = GetXLogInsertRecPtr();
-		snapshot->whenTaken = GetSnapshotCurrentTimestamp();
-		MaintainOldSnapshotTimeMapping(snapshot->whenTaken, xmin);
-	}
+	GetSnapshotDataInitOldSnapshot(snapshot);
 
 	return snapshot;
 }
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index e9d3e832c76..98735f0c4b7 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -595,6 +595,8 @@ SetTransactionSnapshot(Snapshot sourcesnap, VirtualTransactionId *sourcevxid,
 	CurrentSnapshot->takenDuringRecovery = sourcesnap->takenDuringRecovery;
 	/* NB: curcid should NOT be copied, it's a local matter */
 
+	CurrentSnapshot->snapXactCompletionCount = 0;
+
 	/*
 	 * Now we have to fix what GetSnapshotData did with MyProc->xmin and
 	 * TransactionXmin.  There is a race condition: to make sure we are not
@@ -670,6 +672,7 @@ CopySnapshot(Snapshot snapshot)
 	newsnap->regd_count = 0;
 	newsnap->active_count = 0;
 	newsnap->copied = true;
+	newsnap->snapXactCompletionCount = 0;
 
 	/* setup XID array */
 	if (snapshot->xcnt > 0)
@@ -2207,6 +2210,7 @@ RestoreSnapshot(char *start_address)
 	snapshot->curcid = serialized_snapshot.curcid;
 	snapshot->whenTaken = serialized_snapshot.whenTaken;
 	snapshot->lsn = serialized_snapshot.lsn;
+	snapshot->snapXactCompletionCount = 0;
 
 	/* Copy XIDs, if present. */
 	if (serialized_snapshot.xcnt > 0)
-- 
2.25.0.114.g5b0ca878e0

#43Alexander Korotkov
a.korotkov@postgrespro.ru
In reply to: Andres Freund (#42)
Re: Improving connection scalability: GetSnapshotData()

On Wed, Apr 8, 2020 at 3:43 PM Andres Freund <andres@anarazel.de> wrote:

Realistically it still 2-3 hours of proof-reading.

This makes me sad :(

Can we ask RMT to extend feature freeze for this particular patchset?
I think it's reasonable assuming extreme importance of this patchset.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

#44Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#40)
Re: Improving connection scalability: GetSnapshotData()

On Tue, Apr 7, 2020 at 4:27 PM Andres Freund <andres@anarazel.de> wrote:

The main reason is that we want to be able to cheaply check the current
state of the variables (mostly when checking a backend's own state). We
can't access the "dense" ones without holding a lock, but we e.g. don't
want to make ProcArrayEndTransactionInternal() take a lock just to check
if vacuumFlags is set.

It turns out to also be good for performance to have the copy for
another reason: The "dense" arrays share cachelines with other
backends. That's worth it because it allows to make GetSnapshotData(),
by far the most frequent operation, touch fewer cache lines. But it also
means that it's more likely that a backend's "dense" array entry isn't
in a local cpu cache (it'll be pulled out of there when modified in
another backend). In many cases we don't need the shared entry at commit
etc time though, we just need to check if it is set - and most of the
time it won't be. The local entry allows to do that cheaply.

Basically it makes sense to access the PGPROC variable when checking a
single backend's data, especially when we have to look at the PGPROC for
other reasons already. It makes sense to look at the "dense" arrays if
we need to look at many / most entries, because we then benefit from the
reduced indirection and better cross-process cacheability.

That's a good explanation. I think it should be in the comments or a
README somewhere.

How about:
/*
* If the current xactCompletionCount is still the same as it was at the
* time the snapshot was built, we can be sure that rebuilding the
* contents of the snapshot the hard way would result in the same snapshot
* contents:
*
* As explained in transam/README, the set of xids considered running by
* GetSnapshotData() cannot change while ProcArrayLock is held. Snapshot
* contents only depend on transactions with xids and xactCompletionCount
* is incremented whenever a transaction with an xid finishes (while
* holding ProcArrayLock) exclusively). Thus the xactCompletionCount check
* ensures we would detect if the snapshot would have changed.
*
* As the snapshot contents are the same as it was before, it is is safe
* to re-enter the snapshot's xmin into the PGPROC array. None of the rows
* visible under the snapshot could already have been removed (that'd
* require the set of running transactions to change) and it fulfills the
* requirement that concurrent GetSnapshotData() calls yield the same
* xmin.
*/

That's nice and clear.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#45Jonathan S. Katz
jkatz@postgresql.org
In reply to: Alexander Korotkov (#43)
Re: Improving connection scalability: GetSnapshotData()

On 4/8/20 8:59 AM, Alexander Korotkov wrote:

On Wed, Apr 8, 2020 at 3:43 PM Andres Freund <andres@anarazel.de> wrote:

Realistically it still 2-3 hours of proof-reading.

This makes me sad :(

Can we ask RMT to extend feature freeze for this particular patchset?
I think it's reasonable assuming extreme importance of this patchset.

One of the features of RMT responsibilities[1]https://wiki.postgresql.org/wiki/Release_Management_Team#History is to be "hands off" as
much as possible, so perhaps a reverse ask: how would people feel about
this patch going into PG13, knowing that the commit would come after the
feature freeze date?

My 2¢, with RMT hat off:

As mentioned earlier[2]/messages/by-id/6be8c321-68ea-a865-d8d0-50a3af616463@postgresql.org, we know that connection scalability is a major
pain point with PostgreSQL and any effort that can help alleviate that
is a huge win, even in incremental gains. Andres et al experimentation
show that this is more than incremental gains, and will certainly make a
huge difference in people's PostgreSQL experience. It is one of those
features where you can "plug in and win" -- you get a performance
benefit just by upgrading. That is not insignificant.

However, I also want to ensure that we are fair: in the past there have
also been other patches that have been "oh-so-close" to commit before
feature freeze but have not made it in (an example escapes me at the
moment). Therefore, we really need to have consensus among ourselves
that the right decision is to allow this to go in after feature freeze.

Did this come in (very) late into the development cycle? Yes, and I
think normally that's enough to give cause for pause. But I could also
argue that Andres is fixing a "bug" with PostgreSQL (probably several
bugs ;) with PostgreSQL -- and perhaps the fixes can't be backpatched
per se, but they do improve the overall stability and usability of
PostgreSQL and it'd be a shame if we have to wait on them.

Lastly, with the ongoing world events, perhaps time that could have been
dedicated to this and other patches likely affected their completion. I
know most things in my life take way longer than they used to (e.g.
taking out the trash/recycles has gone from a 15s to 240s routine). The
same could be said about other patches as well, but this one has a far
greater impact (a double-edged sword, of course) given it's a feature
that everyone uses in PostgreSQL ;)

So with my RMT hat off, I say +1 to allowing this post feature freeze,
though within a reasonable window.

My 2¢, with RMT hat on:

I believe in[2]/messages/by-id/6be8c321-68ea-a865-d8d0-50a3af616463@postgresql.org I outlined a way a path for including the patch even at
this stage in the game. If it is indeed committed, I think we
immediately put it on the "Recheck a mid-Beta" list. I know it's not as
trivial to change as something like "Determine if jit="on" by default"
(not picking on Andres, I just remember that example from RMT 11), but
it at least provides a notable reminder that we need to ensure we test
this thoroughly, and point people to really hammer it during beta.

So with my RMT hat on, I say +0 but with a ;)

Thanks,

Jonathan

[1]: https://wiki.postgresql.org/wiki/Release_Management_Team#History
[2]: /messages/by-id/6be8c321-68ea-a865-d8d0-50a3af616463@postgresql.org
/messages/by-id/6be8c321-68ea-a865-d8d0-50a3af616463@postgresql.org

#46Robert Haas
robertmhaas@gmail.com
In reply to: Jonathan S. Katz (#45)
Re: Improving connection scalability: GetSnapshotData()

On Wed, Apr 8, 2020 at 9:27 AM Jonathan S. Katz <jkatz@postgresql.org> wrote:

One of the features of RMT responsibilities[1] is to be "hands off" as
much as possible, so perhaps a reverse ask: how would people feel about
this patch going into PG13, knowing that the commit would come after the
feature freeze date?

Letting something be committed after feature freeze, or at any other
time, is just a risk vs. reward trade-off. Every patch carries some
chance of breaking stuff or making things worse. And every patch has a
chance of making something better that people care about.

On general principle, I would categorize this as a moderate-risk
patch. It doesn't change SQL syntax like, e.g. MERGE, nor does it
touch the on-disk format, like, e.g. INSERT .. ON CONFLICT UPDATE. The
changes are relatively localized, unlike, e.g. parallel query. Those
are all things that reduce risk. On the other hand, it's a brand new
patch which has not been thoroughly reviewed by anyone. Moreover,
shakedown time will be minimal because we're so late in the release
cycle. if it has subtle synchronization problems or if it regresses
performance badly in some cases, we might not find out about any of
that until after release. While in theory we could revert it any time,
since no SQL syntax or on-disk format is affected, in practice it will
be difficult to do that if it's making life better for some people and
worse for others.

I don't know what the right thing to do is. I agree with everyone who
says this is a very important problem, and I have the highest respect
for Andres's technical ability. On the other hand, I have been around
here long enough to know that deciding whether to allow late commits
on the basis of how much we like the feature is a bad plan, because it
takes into account only the upside of a commit, and ignores the
possible downside risk. Typically, the commit is late because the
feature was rushed to completion at the last minute, which can have an
effect on quality. I can say, having read through the patches
yesterday, that they don't suck, but I can't say that they're fully
correct. That's not to say that we shouldn't decide to take them, but
it is a concern to be taken seriously. We have made mistakes before in
what we shipped that had serious implications for many users and for
the project; we should all be wary of making more such mistakes. I am
not trying to say that solving problems and making stuff better is NOT
important, just that every coin has two sides.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#47Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#44)
Re: Improving connection scalability: GetSnapshotData()

Hi,

On 2020-04-08 09:24:13 -0400, Robert Haas wrote:

On Tue, Apr 7, 2020 at 4:27 PM Andres Freund <andres@anarazel.de> wrote:

The main reason is that we want to be able to cheaply check the current
state of the variables (mostly when checking a backend's own state). We
can't access the "dense" ones without holding a lock, but we e.g. don't
want to make ProcArrayEndTransactionInternal() take a lock just to check
if vacuumFlags is set.

It turns out to also be good for performance to have the copy for
another reason: The "dense" arrays share cachelines with other
backends. That's worth it because it allows to make GetSnapshotData(),
by far the most frequent operation, touch fewer cache lines. But it also
means that it's more likely that a backend's "dense" array entry isn't
in a local cpu cache (it'll be pulled out of there when modified in
another backend). In many cases we don't need the shared entry at commit
etc time though, we just need to check if it is set - and most of the
time it won't be. The local entry allows to do that cheaply.

Basically it makes sense to access the PGPROC variable when checking a
single backend's data, especially when we have to look at the PGPROC for
other reasons already. It makes sense to look at the "dense" arrays if
we need to look at many / most entries, because we then benefit from the
reduced indirection and better cross-process cacheability.

That's a good explanation. I think it should be in the comments or a
README somewhere.

I had a briefer version in the PROC_HDR comment. I've just expanded it
to:
*
* The denser separate arrays are beneficial for three main reasons: First, to
* allow for as tight loops accessing the data as possible. Second, to prevent
* updates of frequently changing data (e.g. xmin) from invalidating
* cachelines also containing less frequently changing data (e.g. xid,
* vacuumFlags). Third to condense frequently accessed data into as few
* cachelines as possible.
*
* There are two main reasons to have the data mirrored between these dense
* arrays and PGPROC. First, as explained above, a PGPROC's array entries can
* only be accessed with either ProcArrayLock or XidGenLock held, whereas the
* PGPROC entries do not require that (obviously there may still be locking
* requirements around the individual field, separate from the concerns
* here). That is particularly important for a backend to efficiently checks
* it own values, which it often can safely do without locking. Second, the
* PGPROC fields allow to avoid unnecessary accesses and modification to the
* dense arrays. A backend's own PGPROC is more likely to be in a local cache,
* whereas the cachelines for the dense array will be modified by other
* backends (often removing it from the cache for other cores/sockets). At
* commit/abort time a check of the PGPROC value can avoid accessing/dirtying
* the corresponding array value.
*
* Basically it makes sense to access the PGPROC variable when checking a
* single backend's data, especially when already looking at the PGPROC for
* other reasons already. It makes sense to look at the "dense" arrays if we
* need to look at many / most entries, because we then benefit from the
* reduced indirection and better cross-process cache-ability.
*
* When entering a PGPROC for 2PC transactions with ProcArrayAdd(), the data
* in the dense arrays is initialized from the PGPROC while it already holds

Greetings,

Andres Freund

#48Bruce Momjian
bruce@momjian.us
In reply to: Robert Haas (#46)
Re: Improving connection scalability: GetSnapshotData()

On Wed, Apr 8, 2020 at 09:44:16AM -0400, Robert Haas wrote:

I don't know what the right thing to do is. I agree with everyone who
says this is a very important problem, and I have the highest respect
for Andres's technical ability. On the other hand, I have been around
here long enough to know that deciding whether to allow late commits
on the basis of how much we like the feature is a bad plan, because it
takes into account only the upside of a commit, and ignores the
possible downside risk. Typically, the commit is late because the
feature was rushed to completion at the last minute, which can have an
effect on quality. I can say, having read through the patches
yesterday, that they don't suck, but I can't say that they're fully
correct. That's not to say that we shouldn't decide to take them, but
it is a concern to be taken seriously. We have made mistakes before in
what we shipped that had serious implications for many users and for
the project; we should all be wary of making more such mistakes. I am
not trying to say that solving problems and making stuff better is NOT
important, just that every coin has two sides.

If we don't commit this, where does this leave us with the
old_snapshot_threshold feature? We remove it in back branches and have
no working version in PG 13? That seems kind of bad.

--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EnterpriseDB https://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +
#49Andres Freund
andres@anarazel.de
In reply to: Jonathan S. Katz (#45)
Re: Improving connection scalability: GetSnapshotData()

Hi,

On 2020-04-08 09:26:42 -0400, Jonathan S. Katz wrote:

On 4/8/20 8:59 AM, Alexander Korotkov wrote:

On Wed, Apr 8, 2020 at 3:43 PM Andres Freund <andres@anarazel.de> wrote:

Realistically it still 2-3 hours of proof-reading.

This makes me sad :(

Can we ask RMT to extend feature freeze for this particular patchset?
I think it's reasonable assuming extreme importance of this patchset.

One of the features of RMT responsibilities[1] is to be "hands off" as
much as possible, so perhaps a reverse ask: how would people feel about
this patch going into PG13, knowing that the commit would come after the
feature freeze date?

I'm obviously biased, so I don't think there's much point in responding
directly to that question. But I thought it could be helpful if I
described what my thoughts about where the patchset is:

What made me not commit it "earlier" yesterday was not that I had/have
any substantial concerns about the technical details of the patch. But
that there were a few too many comments that didn't yet sound quite
right, that the commit messages didn't yet explain the architecture
/ benefits well enough, and that I noticed that a few variable names
were too easy to be misunderstood by others.

By 5 AM I had addressed most of that, except that some technical details
weren't yet mentioned in the commit messages ([1]the "mirroring" of values beteween dense arrays and PGPROC, the changed locking regimen for ProcArrayAdd/Remove, the widening of lastCompletedXid to be a 64bit xid, they are documented
in the code). I also produce enough typos / odd grammar when fully
awake, so even though I did proof read my changes, I thought that I need
to do that again while awake.

There have been no substantial code changes since yesterday. The
variable renaming prompted by Robert (which I agree is an improvement),
as well as reducing the diff size by deferring some readability
improvements (probably also a good idea) did however produce quite a few
conflicts in subsequent patches that I needed to resolve. Another awake
read-through to confirm that I resolved them correctly seemed the
responsible thing to do before a commit.

Lastly, with the ongoing world events, perhaps time that could have been
dedicated to this and other patches likely affected their completion. I
know most things in my life take way longer than they used to (e.g.
taking out the trash/recycles has gone from a 15s to 240s routine). The
same could be said about other patches as well, but this one has a far
greater impact (a double-edged sword, of course) given it's a feature
that everyone uses in PostgreSQL ;)

I'm obviously not alone in that, so I agree that it's not an argument
pro/con anything.

But this definitely is the case for me. Leaving aside the general dread,
not having a quiet home-office, nor good exercise, is definitely not
helping.

I'm really bummed that I didn't have the cycles to help the shared
memory stats patch ready as well. It's clearly not yet there (but
improved a lot during the CF). But it's been around for so long, and
there's so many improvements blocked by the current stats
infrastructure...

[1]: the "mirroring" of values beteween dense arrays and PGPROC, the changed locking regimen for ProcArrayAdd/Remove, the widening of lastCompletedXid to be a 64bit xid
changed locking regimen for ProcArrayAdd/Remove, the widening of
lastCompletedXid to be a 64bit xid
[2]: /messages/by-id/20200407121503.zltbpqmdesurflnm@alap3.anarazel.de

Greetings,

Andres Freund

#50Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#46)
Re: Improving connection scalability: GetSnapshotData()

Hi,

On 2020-04-08 09:44:16 -0400, Robert Haas wrote:

Moreover, shakedown time will be minimal because we're so late in the
release cycle

My impression increasingly is that there's very little actual shakedown
before beta :(. As e.g. evidenced by the fact that 2PC did basically not
work for several months until I did new benchmarks for this patch.

I don't know what to do about that, but...

Greetings,

Andres Freund

#51Andres Freund
andres@anarazel.de
In reply to: Bruce Momjian (#48)
Re: Improving connection scalability: GetSnapshotData()

Hi,

On 2020-04-08 18:06:23 -0400, Bruce Momjian wrote:

If we don't commit this, where does this leave us with the
old_snapshot_threshold feature? We remove it in back branches and have
no working version in PG 13? That seems kind of bad.

I don't think this patch changes the situation for
old_snapshot_threshold in a meaningful way.

Sure, this patch makes old_snapshot_threshold scale better, and triggers
fewer unnecessary query cancellations. But there still are wrong query
results, the tests still don't test anything meaningful, and the
determination of which query is cancelled is still wrong.

- Andres

#52Bruce Momjian
bruce@momjian.us
In reply to: Andres Freund (#51)
Re: Improving connection scalability: GetSnapshotData()

On Wed, Apr 8, 2020 at 03:25:34PM -0700, Andres Freund wrote:

Hi,

On 2020-04-08 18:06:23 -0400, Bruce Momjian wrote:

If we don't commit this, where does this leave us with the
old_snapshot_threshold feature? We remove it in back branches and have
no working version in PG 13? That seems kind of bad.

I don't think this patch changes the situation for
old_snapshot_threshold in a meaningful way.

Sure, this patch makes old_snapshot_threshold scale better, and triggers
fewer unnecessary query cancellations. But there still are wrong query
results, the tests still don't test anything meaningful, and the
determination of which query is cancelled is still wrong.

Oh, OK, so it still needs to be disabled. I was hoping we could paint
this as a fix.

Based on Robert's analysis of the risk (no SQL syntax, no storage
changes), I think, if you are willing to keep working at this until the
final release, it is reasonable to apply it.

--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EnterpriseDB https://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +
#53Michael Paquier
michael@paquier.xyz
In reply to: Andres Freund (#49)
Re: Improving connection scalability: GetSnapshotData()

On Wed, Apr 08, 2020 at 03:17:41PM -0700, Andres Freund wrote:

On 2020-04-08 09:26:42 -0400, Jonathan S. Katz wrote:

Lastly, with the ongoing world events, perhaps time that could have been
dedicated to this and other patches likely affected their completion. I
know most things in my life take way longer than they used to (e.g.
taking out the trash/recycles has gone from a 15s to 240s routine). The
same could be said about other patches as well, but this one has a far
greater impact (a double-edged sword, of course) given it's a feature
that everyone uses in PostgreSQL ;)

I'm obviously not alone in that, so I agree that it's not an argument
pro/con anything.

But this definitely is the case for me. Leaving aside the general dread,
not having a quiet home-office, nor good exercise, is definitely not
helping.

Another factor to be careful of is that by committing a new feature in
a release cycle, you actually need to think about the extra amount of
resources you may need to address comments and issues about it in time
during the beta/stability period, and that more care is likely needed
if you commit something at the end of the cycle. On top of that,
currently, that's a bit hard to plan one or two weeks ahead if help is
needed to stabilize something you worked on. I am pretty sure that
we'll be able to sort things out with a collective effort though.
--
Michael

#54Michail Nikolaev
michail.nikolaev@gmail.com
In reply to: Michael Paquier (#53)
Re: Improving connection scalability: GetSnapshotData()

Hello, hackers.
Andres, nice work!

Sorry for the off-top.

Some of my work [1]/messages/by-id/CANtu0ojmkN_6P7CQWsZ=uEgeFnSmpCiqCxyYaHnhYpTZHj7Ubw@mail.gmail.com related to the support of index hint bits on
standby is highly interfering with this patch.
Is it safe to consider it committed and start rebasing on top of the patches?

Thanks,
Michail.

[1]: /messages/by-id/CANtu0ojmkN_6P7CQWsZ=uEgeFnSmpCiqCxyYaHnhYpTZHj7Ubw@mail.gmail.com

#55Daniel Gustafsson
daniel@yesql.se
In reply to: Andres Freund (#50)
Re: Improving connection scalability: GetSnapshotData()

This patch no longer applies to HEAD, please submit a rebased version. I've
marked it Waiting on Author in the meantime.

cheers ./daniel

#56Andres Freund
andres@anarazel.de
In reply to: Daniel Gustafsson (#55)
6 attachment(s)
Re: Improving connection scalability: GetSnapshotData()

Hi,

On 2020-07-01 14:42:59 +0200, Daniel Gustafsson wrote:

This patch no longer applies to HEAD, please submit a rebased version. I've
marked it Waiting on Author in the meantime.

Thanks!

Here's a rebased version. There's a good bit of commit message
polishing and some code and comment cleanup compared to the last
version. Oh, and obviously the conflicts are resolved.

It could make sense to split the conversion of
VariableCacheData->latestCompletedXid to FullTransactionId out from 0001
into is own commit. Not sure...

I've played with splitting 0003, to have the "infrastructure" pieces
separate, but I think it's not worth it. Without a user the changes look
weird and it's hard to have the comment make sense.

Greetings,

Andres Freund

Attachments:

v11-0001-snapshot-scalability-Don-t-compute-global-horizo.patchtext/x-diff; charset=us-asciiDownload
From 6f8b0d0cc0ac6fd882e4d3bb287ef6670d3da5ff Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 15 Jul 2020 15:35:07 -0700
Subject: [PATCH v11 1/6] snapshot scalability: Don't compute global horizons
 when building snapshots.

To make GetSnapshotData() more scalable, it cannot not look at at each proc's
xmin (see Discussion link below). Due to the frequency at which xmins are
updated, that just does not scale.

Without accessing xmins GetSnapshotData() cannot calculate accurate thresholds
as it has so far. But we don't really have to: The horizons don't actually
change that much between GetSnapshotData() calls. Nor are the horizons
actually used every time a snapshot is called.

The use of RecentGlobal[Data]Xmin to decide whether a row version could be
removed has been replaces with new GlobalVisTest* functions.  These use two
thresholds to determine whether a row can be pruned:
1) definitely_needed, indicating that rows deleted by XIDs >=
   definitely_needed are definitely still visible.
2) maybe_needed, indicating that rows deleted by XIDs < maybe_needed can
   definitely be removed
GetSnapshotData() updates definitely_needed to be the xmin of the computed
snapshot.

When testing whether a row can be removed (with GlobalVisTestIsRemovableXid())
and the tested XID falls in between the two (i.e. XID >= maybe_needed && XID <
definitely_needed) the boundaries can be recomputed to be more accurate. As it
is not cheap to compute accurate boundaries, we limit the number of times that
happens in short succession.  As the boundaries used by
GlobalVisTestIsRemovableXid() are never reset (with maybe_needed updated
byGetSnapshotData()), it is likely that further test can benefit from an
earlier computation of accurate horizons.

To avoid regressing performance when old_snapshot_threshold is set (as
that requires an accurate horizon to be computed),
heap_page_prune_opt() doesn't unconditionally call
TransactionIdLimitedForOldSnapshots() anymore. Both the computation of
the limited horizon, and the triggering of errors (with
SetOldSnapshotThresholdTimestamp()) is now only done when necessary to
remove tuples.

Subsequent commits will take further advantage of the fact that
GetSnapshotData() will not need to access xmins anymore.

Note: This contains a workaround in heap_page_prune_opt() to keep the
snapshot_too_old tests working. While that workaround is ugly, the
tests currently are not meaningful, and it seems best to address them
separately.

Author: Andres Freund
Reviewed-By: Robert Haas, Thomas Munro, David Rowley
Discussion: https://postgr.es/m/20200301083601.ews6hz5dduc3w2se@alap3.anarazel.de
---
 src/include/access/ginblock.h               |    4 +-
 src/include/access/heapam.h                 |   11 +-
 src/include/access/transam.h                |  100 +-
 src/include/storage/bufpage.h               |    6 -
 src/include/storage/proc.h                  |    8 -
 src/include/storage/procarray.h             |   39 +-
 src/include/utils/snapmgr.h                 |   37 +-
 src/include/utils/snapshot.h                |    6 +
 src/backend/access/gin/ginvacuum.c          |   26 +
 src/backend/access/gist/gistutil.c          |    8 +-
 src/backend/access/gist/gistxlog.c          |   10 +-
 src/backend/access/heap/heapam.c            |   15 +-
 src/backend/access/heap/heapam_handler.c    |   24 +-
 src/backend/access/heap/heapam_visibility.c |   79 +-
 src/backend/access/heap/pruneheap.c         |  207 +++-
 src/backend/access/heap/vacuumlazy.c        |   24 +-
 src/backend/access/index/indexam.c          |    3 +-
 src/backend/access/nbtree/README            |   10 +-
 src/backend/access/nbtree/nbtpage.c         |    4 +-
 src/backend/access/nbtree/nbtree.c          |   28 +-
 src/backend/access/nbtree/nbtxlog.c         |   10 +-
 src/backend/access/spgist/spgvacuum.c       |    6 +-
 src/backend/access/transam/README           |   92 +-
 src/backend/access/transam/varsup.c         |   50 +
 src/backend/access/transam/xlog.c           |   11 +-
 src/backend/commands/analyze.c              |    2 +-
 src/backend/commands/vacuum.c               |   41 +-
 src/backend/postmaster/autovacuum.c         |    4 +
 src/backend/replication/logical/launcher.c  |    4 +
 src/backend/replication/walreceiver.c       |   17 +-
 src/backend/replication/walsender.c         |   15 +-
 src/backend/storage/ipc/procarray.c         | 1021 +++++++++++++++----
 src/backend/utils/adt/selfuncs.c            |   20 +-
 src/backend/utils/init/postinit.c           |    4 +
 src/backend/utils/time/snapmgr.c            |  258 ++---
 contrib/amcheck/verify_nbtree.c             |    8 +-
 contrib/pg_visibility/pg_visibility.c       |   18 +-
 contrib/pgstattuple/pgstatapprox.c          |    2 +-
 src/tools/pgindent/typedefs.list            |    2 +
 39 files changed, 1626 insertions(+), 608 deletions(-)

diff --git a/src/include/access/ginblock.h b/src/include/access/ginblock.h
index 3f64fd572e3..fe66a95226b 100644
--- a/src/include/access/ginblock.h
+++ b/src/include/access/ginblock.h
@@ -12,6 +12,7 @@
 
 #include "access/transam.h"
 #include "storage/block.h"
+#include "storage/bufpage.h"
 #include "storage/itemptr.h"
 #include "storage/off.h"
 
@@ -134,8 +135,7 @@ typedef struct GinMetaPageData
  */
 #define GinPageGetDeleteXid(page) ( ((PageHeader) (page))->pd_prune_xid )
 #define GinPageSetDeleteXid(page, xid) ( ((PageHeader) (page))->pd_prune_xid = xid)
-#define GinPageIsRecyclable(page) ( PageIsNew(page) || (GinPageIsDeleted(page) \
-	&& TransactionIdPrecedes(GinPageGetDeleteXid(page), RecentGlobalXmin)))
+extern bool GinPageIsRecyclable(Page page);
 
 /*
  * We use our own ItemPointerGet(BlockNumber|OffsetNumber)
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index f279edc4734..ef2fcb55a71 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -172,9 +172,12 @@ extern TransactionId heap_compute_xid_horizon_for_tuples(Relation rel,
 														 int nitems);
 
 /* in heap/pruneheap.c */
+struct GlobalVisState;
 extern void heap_page_prune_opt(Relation relation, Buffer buffer);
 extern int	heap_page_prune(Relation relation, Buffer buffer,
-							TransactionId OldestXmin,
+							struct GlobalVisState *vistest,
+							TransactionId limited_oldest_xmin,
+							TimestampTz limited_oldest_ts,
 							bool report_stats, TransactionId *latestRemovedXid);
 extern void heap_page_prune_execute(Buffer buffer,
 									OffsetNumber *redirected, int nredirected,
@@ -201,11 +204,15 @@ extern TM_Result HeapTupleSatisfiesUpdate(HeapTuple stup, CommandId curcid,
 										  Buffer buffer);
 extern HTSV_Result HeapTupleSatisfiesVacuum(HeapTuple stup, TransactionId OldestXmin,
 											Buffer buffer);
+extern HTSV_Result HeapTupleSatisfiesVacuumHorizon(HeapTuple stup, Buffer buffer,
+												   TransactionId *dead_after);
 extern void HeapTupleSetHintBits(HeapTupleHeader tuple, Buffer buffer,
 								 uint16 infomask, TransactionId xid);
 extern bool HeapTupleHeaderIsOnlyLocked(HeapTupleHeader tuple);
 extern bool XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot);
-extern bool HeapTupleIsSurelyDead(HeapTuple htup, TransactionId OldestXmin);
+struct GlobalVisState;
+extern bool HeapTupleIsSurelyDead(struct GlobalVisState *vistest,
+								  HeapTuple htup);
 
 /*
  * To avoid leaking too much knowledge about reorderbuffer implementation
diff --git a/src/include/access/transam.h b/src/include/access/transam.h
index a91a0c7487d..6ec84b54599 100644
--- a/src/include/access/transam.h
+++ b/src/include/access/transam.h
@@ -54,6 +54,8 @@
 #define FullTransactionIdFollowsOrEquals(a, b) ((a).value >= (b).value)
 #define FullTransactionIdIsValid(x)		TransactionIdIsValid(XidFromFullTransactionId(x))
 #define InvalidFullTransactionId		FullTransactionIdFromEpochAndXid(0, InvalidTransactionId)
+#define FirstNormalFullTransactionId	FullTransactionIdFromEpochAndXid(0, FirstNormalTransactionId)
+#define FullTransactionIdIsNormal(x)	FullTransactionIdFollowsOrEquals(x, FirstNormalFullTransactionId)
 
 /*
  * A 64 bit value that contains an epoch and a TransactionId.  This is
@@ -93,15 +95,48 @@ FullTransactionIdFromU64(uint64 value)
 			(dest) = FirstNormalTransactionId; \
 	} while(0)
 
-/* advance a FullTransactionId variable, stepping over special XIDs */
+/*
+ * Advance a FullTransactionId variable, stepping over xids that would appear
+ * to be special only when viewed as 32bit XIDs.
+ */
 static inline void
 FullTransactionIdAdvance(FullTransactionId *dest)
 {
 	dest->value++;
+
+	/*
+	 * In contrast to 32bit XIDs don't step over the "actual" special xids.
+	 * For 64bit xids these can't be reached as part of a wraparound as they
+	 * can in the 32bit case.
+	 */
+	if (FullTransactionIdPrecedes(*dest, FirstNormalFullTransactionId))
+		return;
+
+	/*
+	 * But we do need to step over XIDs that'd appear special only for 32bit
+	 * XIDs.
+	 */
 	while (XidFromFullTransactionId(*dest) < FirstNormalTransactionId)
 		dest->value++;
 }
 
+/*
+ * Retreat a FullTransactionId variable, stepping over xids that would appear
+ * to be special only when viewed as 32bit XIDs.
+ */
+static inline void
+FullTransactionIdRetreat(FullTransactionId *dest)
+{
+	dest->value--;
+
+	/* see FullTransactionIdAdvance() */
+	if (FullTransactionIdPrecedes(*dest, FirstNormalFullTransactionId))
+		return;
+
+	while (XidFromFullTransactionId(*dest) < FirstNormalTransactionId)
+		dest->value--;
+}
+
 /* back up a transaction ID variable, handling wraparound correctly */
 #define TransactionIdRetreat(dest)	\
 	do { \
@@ -193,8 +228,8 @@ typedef struct VariableCacheData
 	/*
 	 * These fields are protected by ProcArrayLock.
 	 */
-	TransactionId latestCompletedXid;	/* newest XID that has committed or
-										 * aborted */
+	FullTransactionId latestCompletedFullXid;	/* newest full XID that has
+												 * committed or aborted */
 
 	/*
 	 * These fields are protected by XactTruncationLock
@@ -244,6 +279,12 @@ extern void AdvanceOldestClogXid(TransactionId oldest_datfrozenxid);
 extern bool ForceTransactionIdLimitUpdate(void);
 extern Oid	GetNewObjectId(void);
 
+#ifdef USE_ASSERT_CHECKING
+extern void AssertTransactionInAllowableRange(TransactionId xid);
+#else
+#define AssertTransactionInAllowableRange(xid) ((void)true)
+#endif
+
 /*
  * Some frontend programs include this header.  For compilers that emit static
  * inline functions even when they're unused, that leads to unsatisfied
@@ -260,6 +301,59 @@ ReadNewTransactionId(void)
 	return XidFromFullTransactionId(ReadNextFullTransactionId());
 }
 
+/* return transaction ID backed up by amount, handling wraparound correctly */
+static inline TransactionId
+TransactionIdRetreatedBy(TransactionId xid, uint32 amount)
+{
+	xid -= amount;
+
+	while (xid < FirstNormalTransactionId)
+		xid--;
+
+	return xid;
+}
+
+/* return the older of the two IDs */
+static inline TransactionId
+TransactionIdOlder(TransactionId a, TransactionId b)
+{
+	if (!TransactionIdIsValid(a))
+		return b;
+
+	if (!TransactionIdIsValid(b))
+		return a;
+
+	if (TransactionIdPrecedes(a, b))
+		return a;
+	return b;
+}
+
+/* return the older of the two IDs, assuming they're both normal */
+static inline TransactionId
+NormalTransactionIdOlder(TransactionId a, TransactionId b)
+{
+	Assert(TransactionIdIsNormal(a));
+	Assert(TransactionIdIsNormal(b));
+	if (NormalTransactionIdPrecedes(a, b))
+		return a;
+	return b;
+}
+
+/* return the newer of the two IDs */
+static inline FullTransactionId
+FullTransactionIdNewer(FullTransactionId a, FullTransactionId b)
+{
+	if (!FullTransactionIdIsValid(a))
+		return b;
+
+	if (!FullTransactionIdIsValid(b))
+		return a;
+
+	if (FullTransactionIdFollows(a, b))
+		return a;
+	return b;
+}
+
 #endif							/* FRONTEND */
 
 #endif							/* TRANSAM_H */
diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h
index 3f88683a059..51b8f994ac0 100644
--- a/src/include/storage/bufpage.h
+++ b/src/include/storage/bufpage.h
@@ -389,12 +389,6 @@ PageValidateSpecialPointer(Page page)
 #define PageClearAllVisible(page) \
 	(((PageHeader) (page))->pd_flags &= ~PD_ALL_VISIBLE)
 
-#define PageIsPrunable(page, oldestxmin) \
-( \
-	AssertMacro(TransactionIdIsNormal(oldestxmin)), \
-	TransactionIdIsValid(((PageHeader) (page))->pd_prune_xid) && \
-	TransactionIdPrecedes(((PageHeader) (page))->pd_prune_xid, oldestxmin) \
-)
 #define PageSetPrunable(page, xid) \
 do { \
 	Assert(TransactionIdIsNormal(xid)); \
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index b20e2ad4f6a..08f006f782e 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -42,13 +42,6 @@ struct XidCache
 
 /*
  * Flags for PGXACT->vacuumFlags
- *
- * Note: If you modify these flags, you need to modify PROCARRAY_XXX flags
- * in src/include/storage/procarray.h.
- *
- * PROC_RESERVED may later be assigned for use in vacuumFlags, but its value is
- * used for PROCARRAY_SLOTS_XMIN in procarray.h, so GetOldestXmin won't be able
- * to match and ignore processes with this flag set.
  */
 #define		PROC_IS_AUTOVACUUM	0x01	/* is it an autovac worker? */
 #define		PROC_IN_VACUUM		0x02	/* currently running lazy vacuum */
@@ -56,7 +49,6 @@ struct XidCache
 #define		PROC_VACUUM_FOR_WRAPAROUND	0x08	/* set by autovac only */
 #define		PROC_IN_LOGICAL_DECODING	0x10	/* currently doing logical
 												 * decoding outside xact */
-#define		PROC_RESERVED				0x20	/* reserved for procarray */
 
 /* flags reset at EOXact */
 #define		PROC_VACUUM_STATE_MASK \
diff --git a/src/include/storage/procarray.h b/src/include/storage/procarray.h
index a5c7d0c0644..ea8a876ca45 100644
--- a/src/include/storage/procarray.h
+++ b/src/include/storage/procarray.h
@@ -20,41 +20,6 @@
 #include "utils/snapshot.h"
 
 
-/*
- * These are to implement PROCARRAY_FLAGS_XXX
- *
- * Note: These flags are cloned from PROC_XXX flags in src/include/storage/proc.h
- * to avoid forcing to include proc.h when including procarray.h. So if you modify
- * PROC_XXX flags, you need to modify these flags.
- */
-#define		PROCARRAY_VACUUM_FLAG			0x02	/* currently running lazy
-													 * vacuum */
-#define		PROCARRAY_ANALYZE_FLAG			0x04	/* currently running
-													 * analyze */
-#define		PROCARRAY_LOGICAL_DECODING_FLAG 0x10	/* currently doing logical
-													 * decoding outside xact */
-
-#define		PROCARRAY_SLOTS_XMIN			0x20	/* replication slot xmin,
-													 * catalog_xmin */
-/*
- * Only flags in PROCARRAY_PROC_FLAGS_MASK are considered when matching
- * PGXACT->vacuumFlags. Other flags are used for different purposes and
- * have no corresponding PROC flag equivalent.
- */
-#define		PROCARRAY_PROC_FLAGS_MASK	(PROCARRAY_VACUUM_FLAG | \
-										 PROCARRAY_ANALYZE_FLAG | \
-										 PROCARRAY_LOGICAL_DECODING_FLAG)
-
-/* Use the following flags as an input "flags" to GetOldestXmin function */
-/* Consider all backends except for logical decoding ones which manage xmin separately */
-#define		PROCARRAY_FLAGS_DEFAULT			PROCARRAY_LOGICAL_DECODING_FLAG
-/* Ignore vacuum backends */
-#define		PROCARRAY_FLAGS_VACUUM			PROCARRAY_FLAGS_DEFAULT | PROCARRAY_VACUUM_FLAG
-/* Ignore analyze backends */
-#define		PROCARRAY_FLAGS_ANALYZE			PROCARRAY_FLAGS_DEFAULT | PROCARRAY_ANALYZE_FLAG
-/* Ignore both vacuum and analyze backends */
-#define		PROCARRAY_FLAGS_VACUUM_ANALYZE	PROCARRAY_FLAGS_DEFAULT | PROCARRAY_VACUUM_FLAG | PROCARRAY_ANALYZE_FLAG
-
 extern Size ProcArrayShmemSize(void);
 extern void CreateSharedProcArray(void);
 extern void ProcArrayAdd(PGPROC *proc);
@@ -88,9 +53,11 @@ extern RunningTransactions GetRunningTransactionData(void);
 
 extern bool TransactionIdIsInProgress(TransactionId xid);
 extern bool TransactionIdIsActive(TransactionId xid);
-extern TransactionId GetOldestXmin(Relation rel, int flags);
+extern TransactionId GetOldestNonRemovableTransactionId(Relation rel);
+extern TransactionId GetOldestTransactionIdConsideredRunning(void);
 extern TransactionId GetOldestActiveTransactionId(void);
 extern TransactionId GetOldestSafeDecodingTransactionId(bool catalogOnly);
+extern void GetReplicationHorizons(TransactionId *slot_xmin, TransactionId *catalog_xmin);
 
 extern VirtualTransactionId *GetVirtualXIDsDelayingChkpt(int *nvxids);
 extern bool HaveVirtualXIDsDelayingChkpt(VirtualTransactionId *vxids, int nvxids);
diff --git a/src/include/utils/snapmgr.h b/src/include/utils/snapmgr.h
index ffb4ba3adfb..b6b403e2931 100644
--- a/src/include/utils/snapmgr.h
+++ b/src/include/utils/snapmgr.h
@@ -52,13 +52,12 @@ extern Size SnapMgrShmemSize(void);
 extern void SnapMgrInit(void);
 extern TimestampTz GetSnapshotCurrentTimestamp(void);
 extern TimestampTz GetOldSnapshotThresholdTimestamp(void);
+extern void SnapshotTooOldMagicForTest(void);
 
 extern bool FirstSnapshotSet;
 
 extern PGDLLIMPORT TransactionId TransactionXmin;
 extern PGDLLIMPORT TransactionId RecentXmin;
-extern PGDLLIMPORT TransactionId RecentGlobalXmin;
-extern PGDLLIMPORT TransactionId RecentGlobalDataXmin;
 
 /* Variables representing various special snapshot semantics */
 extern PGDLLIMPORT SnapshotData SnapshotSelfData;
@@ -78,11 +77,12 @@ extern PGDLLIMPORT SnapshotData CatalogSnapshotData;
 
 /*
  * Similarly, some initialization is required for a NonVacuumable snapshot.
- * The caller must supply the xmin horizon to use (e.g., RecentGlobalXmin).
+ * The caller must supply the visibility cutoff state to use (c.f.
+ * GlobalVisTestFor()).
  */
-#define InitNonVacuumableSnapshot(snapshotdata, xmin_horizon)  \
+#define InitNonVacuumableSnapshot(snapshotdata, vistestp)  \
 	((snapshotdata).snapshot_type = SNAPSHOT_NON_VACUUMABLE, \
-	 (snapshotdata).xmin = (xmin_horizon))
+	 (snapshotdata).vistest = (vistestp))
 
 /*
  * Similarly, some initialization is required for SnapshotToast.  We need
@@ -98,6 +98,11 @@ extern PGDLLIMPORT SnapshotData CatalogSnapshotData;
 	((snapshot)->snapshot_type == SNAPSHOT_MVCC || \
 	 (snapshot)->snapshot_type == SNAPSHOT_HISTORIC_MVCC)
 
+static inline bool
+OldSnapshotThresholdActive(void)
+{
+	return old_snapshot_threshold >= 0;
+}
 
 extern Snapshot GetTransactionSnapshot(void);
 extern Snapshot GetLatestSnapshot(void);
@@ -121,8 +126,6 @@ extern void UnregisterSnapshot(Snapshot snapshot);
 extern Snapshot RegisterSnapshotOnOwner(Snapshot snapshot, ResourceOwner owner);
 extern void UnregisterSnapshotFromOwner(Snapshot snapshot, ResourceOwner owner);
 
-extern FullTransactionId GetFullRecentGlobalXmin(void);
-
 extern void AtSubCommit_Snapshot(int level);
 extern void AtSubAbort_Snapshot(int level);
 extern void AtEOXact_Snapshot(bool isCommit, bool resetXmin);
@@ -131,13 +134,29 @@ extern void ImportSnapshot(const char *idstr);
 extern bool XactHasExportedSnapshots(void);
 extern void DeleteAllExportedSnapshotFiles(void);
 extern bool ThereAreNoPriorRegisteredSnapshots(void);
-extern TransactionId TransactionIdLimitedForOldSnapshots(TransactionId recentXmin,
-														 Relation relation);
+extern bool TransactionIdLimitedForOldSnapshots(TransactionId recentXmin,
+												Relation relation,
+												TransactionId *limit_xid,
+												TimestampTz *limit_ts);
+extern void SetOldSnapshotThresholdTimestamp(TimestampTz ts, TransactionId xlimit);
 extern void MaintainOldSnapshotTimeMapping(TimestampTz whenTaken,
 										   TransactionId xmin);
 
 extern char *ExportSnapshot(Snapshot snapshot);
 
+/*
+ * These live in procarray.c because they're intimately linked to the
+ * procarray contents, but thematically they better fit into snapmgr.h.
+ */
+typedef struct GlobalVisState GlobalVisState;
+extern GlobalVisState *GlobalVisTestFor(Relation rel);
+extern bool GlobalVisTestIsRemovableXid(GlobalVisState *state, TransactionId xid);
+extern bool GlobalVisTestIsRemovableFullXid(GlobalVisState *state, FullTransactionId fxid);
+extern FullTransactionId GlobalVisTestNonRemovableFullHorizon(GlobalVisState *state);
+extern TransactionId GlobalVisTestNonRemovableHorizon(GlobalVisState *state);
+extern bool GlobalVisCheckRemovableXid(Relation rel, TransactionId xid);
+extern bool GlobalVisIsRemovableFullXid(Relation rel, FullTransactionId fxid);
+
 /*
  * Utility functions for implementing visibility routines in table AMs.
  */
diff --git a/src/include/utils/snapshot.h b/src/include/utils/snapshot.h
index 4796edb63aa..35b1f05bea6 100644
--- a/src/include/utils/snapshot.h
+++ b/src/include/utils/snapshot.h
@@ -192,6 +192,12 @@ typedef struct SnapshotData
 	 */
 	uint32		speculativeToken;
 
+	/*
+	 * For SNAPSHOT_NON_VACUUMABLE (and hopefully more in the future) this is
+	 * used to determine whether row could be vacuumed.
+	 */
+	struct GlobalVisState *vistest;
+
 	/*
 	 * Book-keeping information, used by the snapshot manager
 	 */
diff --git a/src/backend/access/gin/ginvacuum.c b/src/backend/access/gin/ginvacuum.c
index 8ae4fd95a7b..9cd6638df62 100644
--- a/src/backend/access/gin/ginvacuum.c
+++ b/src/backend/access/gin/ginvacuum.c
@@ -793,3 +793,29 @@ ginvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 
 	return stats;
 }
+
+/*
+ * Return whether Page can safely be recycled.
+ */
+bool
+GinPageIsRecyclable(Page page)
+{
+	TransactionId delete_xid;
+
+	if (PageIsNew(page))
+		return true;
+
+	if (!GinPageIsDeleted(page))
+		return false;
+
+	delete_xid = GinPageGetDeleteXid(page);
+
+	if (!TransactionIdIsValid(delete_xid))
+		return true;
+
+	/*
+	 * If no backend still could view delete_xid as in running, all scans
+	 * concurrent with ginDeletePage() must have finished.
+	 */
+	return GlobalVisCheckRemovableXid(NULL, delete_xid);
+}
diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
index 765329bbcd4..bfda7fbe3d5 100644
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -891,15 +891,13 @@ gistPageRecyclable(Page page)
 		 * As long as that can happen, we must keep the deleted page around as
 		 * a tombstone.
 		 *
-		 * Compare the deletion XID with RecentGlobalXmin. If deleteXid <
-		 * RecentGlobalXmin, then no scan that's still in progress could have
+		 * For that check if the deletion XID could still be visible to
+		 * anyone. If not, then no scan that's still in progress could have
 		 * seen its downlink, and we can recycle it.
 		 */
 		FullTransactionId deletexid_full = GistPageGetDeleteXid(page);
-		FullTransactionId recentxmin_full = GetFullRecentGlobalXmin();
 
-		if (FullTransactionIdPrecedes(deletexid_full, recentxmin_full))
-			return true;
+		return GlobalVisIsRemovableFullXid(NULL, deletexid_full);
 	}
 	return false;
 }
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index 3f0effd5e42..3167305ac00 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -387,11 +387,11 @@ gistRedoPageReuse(XLogReaderState *record)
 	 * PAGE_REUSE records exist to provide a conflict point when we reuse
 	 * pages in the index via the FSM.  That's all they do though.
 	 *
-	 * latestRemovedXid was the page's deleteXid.  The deleteXid <
-	 * RecentGlobalXmin test in gistPageRecyclable() conceptually mirrors the
-	 * pgxact->xmin > limitXmin test in GetConflictingVirtualXIDs().
-	 * Consequently, one XID value achieves the same exclusion effect on
-	 * primary and standby.
+	 * latestRemovedXid was the page's deleteXid.  The
+	 * GlobalVisIsRemovableFullXid(deleteXid) test in gistPageRecyclable()
+	 * conceptually mirrors the pgxact->xmin > limitXmin test in
+	 * GetConflictingVirtualXIDs().  Consequently, one XID value achieves the
+	 * same exclusion effect on primary and standby.
 	 */
 	if (InHotStandby)
 	{
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index d881f4cd46a..a8804351bee 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1496,6 +1496,7 @@ heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer,
 	bool		at_chain_start;
 	bool		valid;
 	bool		skip;
+	GlobalVisState *vistest = NULL;
 
 	/* If this is not the first call, previous call returned a (live!) tuple */
 	if (all_dead)
@@ -1506,7 +1507,8 @@ heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer,
 	at_chain_start = first_call;
 	skip = !first_call;
 
-	Assert(TransactionIdIsValid(RecentGlobalXmin));
+	/* XXX: we should assert that a snapshot is pushed or registered */
+	Assert(TransactionIdIsValid(RecentXmin));
 	Assert(BufferGetBlockNumber(buffer) == blkno);
 
 	/* Scan through possible multiple members of HOT-chain */
@@ -1595,9 +1597,14 @@ heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer,
 		 * Note: if you change the criterion here for what is "dead", fix the
 		 * planner's get_actual_variable_range() function to match.
 		 */
-		if (all_dead && *all_dead &&
-			!HeapTupleIsSurelyDead(heapTuple, RecentGlobalXmin))
-			*all_dead = false;
+		if (all_dead && *all_dead)
+		{
+			if (!vistest)
+				vistest = GlobalVisTestFor(relation);
+
+			if (!HeapTupleIsSurelyDead(vistest, heapTuple))
+				*all_dead = false;
+		}
 
 		/*
 		 * Check to see if HOT chain continues past this tuple; if so fetch
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 56b35622f1a..659fc4d8697 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1201,7 +1201,7 @@ heapam_index_build_range_scan(Relation heapRelation,
 
 	/* okay to ignore lazy VACUUMs here */
 	if (!IsBootstrapProcessingMode() && !indexInfo->ii_Concurrent)
-		OldestXmin = GetOldestXmin(heapRelation, PROCARRAY_FLAGS_VACUUM);
+		OldestXmin = GetOldestNonRemovableTransactionId(heapRelation);
 
 	if (!scan)
 	{
@@ -1242,6 +1242,17 @@ heapam_index_build_range_scan(Relation heapRelation,
 
 	hscan = (HeapScanDesc) scan;
 
+	/*
+	 * Must have called GetOldestNonRemovableTransactionId() if using
+	 * SnapshotAny.  Shouldn't have for an MVCC snapshot. (It's especially
+	 * worth checking this for parallel builds, since ambuild routines that
+	 * support parallel builds must work these details out for themselves.)
+	 */
+	Assert(snapshot == SnapshotAny || IsMVCCSnapshot(snapshot));
+	Assert(snapshot == SnapshotAny ? TransactionIdIsValid(OldestXmin) :
+		   !TransactionIdIsValid(OldestXmin));
+	Assert(snapshot == SnapshotAny || !anyvisible);
+
 	/* Publish number of blocks to scan */
 	if (progress)
 	{
@@ -1261,17 +1272,6 @@ heapam_index_build_range_scan(Relation heapRelation,
 									 nblocks);
 	}
 
-	/*
-	 * Must call GetOldestXmin() with SnapshotAny.  Should never call
-	 * GetOldestXmin() with MVCC snapshot. (It's especially worth checking
-	 * this for parallel builds, since ambuild routines that support parallel
-	 * builds must work these details out for themselves.)
-	 */
-	Assert(snapshot == SnapshotAny || IsMVCCSnapshot(snapshot));
-	Assert(snapshot == SnapshotAny ? TransactionIdIsValid(OldestXmin) :
-		   !TransactionIdIsValid(OldestXmin));
-	Assert(snapshot == SnapshotAny || !anyvisible);
-
 	/* set our scan endpoints */
 	if (!allow_sync)
 		heap_setscanlimits(scan, start_blockno, numblocks);
diff --git a/src/backend/access/heap/heapam_visibility.c b/src/backend/access/heap/heapam_visibility.c
index dba10890aab..b25b3e429ed 100644
--- a/src/backend/access/heap/heapam_visibility.c
+++ b/src/backend/access/heap/heapam_visibility.c
@@ -1154,19 +1154,56 @@ HeapTupleSatisfiesMVCC(HeapTuple htup, Snapshot snapshot,
  *	we mainly want to know is if a tuple is potentially visible to *any*
  *	running transaction.  If so, it can't be removed yet by VACUUM.
  *
- * OldestXmin is a cutoff XID (obtained from GetOldestXmin()).  Tuples
- * deleted by XIDs >= OldestXmin are deemed "recently dead"; they might
- * still be visible to some open transaction, so we can't remove them,
- * even if we see that the deleting transaction has committed.
+ * OldestXmin is a cutoff XID (obtained from
+ * GetOldestNonRemovableTransactionId()).  Tuples deleted by XIDs >=
+ * OldestXmin are deemed "recently dead"; they might still be visible to some
+ * open transaction, so we can't remove them, even if we see that the deleting
+ * transaction has committed.
  */
 HTSV_Result
 HeapTupleSatisfiesVacuum(HeapTuple htup, TransactionId OldestXmin,
 						 Buffer buffer)
+{
+	TransactionId dead_after = InvalidTransactionId;
+	HTSV_Result res;
+
+	res = HeapTupleSatisfiesVacuumHorizon(htup, buffer, &dead_after);
+
+	if (res == HEAPTUPLE_RECENTLY_DEAD)
+	{
+		Assert(TransactionIdIsValid(dead_after));
+
+		if (TransactionIdPrecedes(dead_after, OldestXmin))
+			res = HEAPTUPLE_DEAD;
+	}
+	else
+		Assert(!TransactionIdIsValid(dead_after));
+
+	return res;
+}
+
+/*
+ * Work horse for HeapTupleSatisfiesVacuum and similar routines.
+ *
+ * In contrast to HeapTupleSatisfiesVacuum this routine, when encountering a
+ * tuple that could still be visible to some backend, stores the xid that
+ * needs to be compared with the horizon in *dead_after, and returns
+ * HEAPTUPLE_RECENTLY_DEAD. The caller then can perform the comparison with
+ * the horizon.  This is e.g. useful when comparing with different horizons.
+ *
+ * Note: HEAPTUPLE_DEAD can still be returned here, e.g. if the inserting
+ * transaction aborted.
+ */
+HTSV_Result
+HeapTupleSatisfiesVacuumHorizon(HeapTuple htup, Buffer buffer, TransactionId *dead_after)
 {
 	HeapTupleHeader tuple = htup->t_data;
 
 	Assert(ItemPointerIsValid(&htup->t_self));
 	Assert(htup->t_tableOid != InvalidOid);
+	Assert(dead_after != NULL);
+
+	*dead_after = InvalidTransactionId;
 
 	/*
 	 * Has inserting transaction committed?
@@ -1323,17 +1360,15 @@ HeapTupleSatisfiesVacuum(HeapTuple htup, TransactionId OldestXmin,
 		else if (TransactionIdDidCommit(xmax))
 		{
 			/*
-			 * The multixact might still be running due to lockers.  If the
-			 * updater is below the xid horizon, we have to return DEAD
-			 * regardless -- otherwise we could end up with a tuple where the
-			 * updater has to be removed due to the horizon, but is not pruned
-			 * away.  It's not a problem to prune that tuple, because any
-			 * remaining lockers will also be present in newer tuple versions.
+			 * The multixact might still be running due to lockers.  Need to
+			 * allow for pruning if below the xid horizon regardless --
+			 * otherwise we could end up with a tuple where the updater has to
+			 * be removed due to the horizon, but is not pruned away.  It's
+			 * not a problem to prune that tuple, because any remaining
+			 * lockers will also be present in newer tuple versions.
 			 */
-			if (!TransactionIdPrecedes(xmax, OldestXmin))
-				return HEAPTUPLE_RECENTLY_DEAD;
-
-			return HEAPTUPLE_DEAD;
+			*dead_after = xmax;
+			return HEAPTUPLE_RECENTLY_DEAD;
 		}
 		else if (!MultiXactIdIsRunning(HeapTupleHeaderGetRawXmax(tuple), false))
 		{
@@ -1372,14 +1407,11 @@ HeapTupleSatisfiesVacuum(HeapTuple htup, TransactionId OldestXmin,
 	}
 
 	/*
-	 * Deleter committed, but perhaps it was recent enough that some open
-	 * transactions could still see the tuple.
+	 * Deleter committed, allow caller to check if it was recent enough that
+	 * some open transactions could still see the tuple.
 	 */
-	if (!TransactionIdPrecedes(HeapTupleHeaderGetRawXmax(tuple), OldestXmin))
-		return HEAPTUPLE_RECENTLY_DEAD;
-
-	/* Otherwise, it's dead and removable */
-	return HEAPTUPLE_DEAD;
+	*dead_after = HeapTupleHeaderGetRawXmax(tuple);
+	return HEAPTUPLE_RECENTLY_DEAD;
 }
 
 
@@ -1418,7 +1450,7 @@ HeapTupleSatisfiesNonVacuumable(HeapTuple htup, Snapshot snapshot,
  *	if the tuple is removable.
  */
 bool
-HeapTupleIsSurelyDead(HeapTuple htup, TransactionId OldestXmin)
+HeapTupleIsSurelyDead(GlobalVisState *vistest, HeapTuple htup)
 {
 	HeapTupleHeader tuple = htup->t_data;
 
@@ -1459,7 +1491,8 @@ HeapTupleIsSurelyDead(HeapTuple htup, TransactionId OldestXmin)
 		return false;
 
 	/* Deleter committed, so tuple is dead if the XID is old enough. */
-	return TransactionIdPrecedes(HeapTupleHeaderGetRawXmax(tuple), OldestXmin);
+	return GlobalVisTestIsRemovableXid(vistest,
+									   HeapTupleHeaderGetRawXmax(tuple));
 }
 
 /*
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index 256df4de105..00a3cb106aa 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -23,12 +23,30 @@
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "storage/bufmgr.h"
+#include "utils/snapmgr.h"
 #include "utils/rel.h"
 #include "utils/snapmgr.h"
 
 /* Working data for heap_page_prune and subroutines */
 typedef struct
 {
+	Relation	rel;
+
+	/* tuple visibility test, initialized for the relation */
+	GlobalVisState *vistest;
+
+	/*
+	 * Thresholds set by TransactionIdLimitedForOldSnapshots() if they have
+	 * been computed (done on demand, and only if
+	 * OldSnapshotThresholdActive()). The first time a tuple is about to be
+	 * removed based on the limited horizon, old_snap_used is set to true, and
+	 * SetOldSnapshotThresholdTimestamp() is called. See
+	 * heap_prune_satisfies_vacuum().
+	 */
+	TimestampTz old_snap_ts;
+	TransactionId old_snap_xmin;
+	bool		old_snap_used;
+
 	TransactionId new_prune_xid;	/* new prune hint value for page */
 	TransactionId latestRemovedXid; /* latest xid to be removed by this prune */
 	int			nredirected;	/* numbers of entries in arrays below */
@@ -43,9 +61,8 @@ typedef struct
 } PruneState;
 
 /* Local functions */
-static int	heap_prune_chain(Relation relation, Buffer buffer,
+static int	heap_prune_chain(Buffer buffer,
 							 OffsetNumber rootoffnum,
-							 TransactionId OldestXmin,
 							 PruneState *prstate);
 static void heap_prune_record_prunable(PruneState *prstate, TransactionId xid);
 static void heap_prune_record_redirect(PruneState *prstate,
@@ -65,16 +82,16 @@ static void heap_prune_record_unused(PruneState *prstate, OffsetNumber offnum);
  * if there's not any use in pruning.
  *
  * Caller must have pin on the buffer, and must *not* have a lock on it.
- *
- * OldestXmin is the cutoff XID used to distinguish whether tuples are DEAD
- * or RECENTLY_DEAD (see HeapTupleSatisfiesVacuum).
  */
 void
 heap_page_prune_opt(Relation relation, Buffer buffer)
 {
 	Page		page = BufferGetPage(buffer);
+	TransactionId prune_xid;
+	GlobalVisState *vistest;
+	TransactionId limited_xmin = InvalidTransactionId;
+	TimestampTz limited_ts = 0;
 	Size		minfree;
-	TransactionId OldestXmin;
 
 	/*
 	 * We can't write WAL in recovery mode, so there's no point trying to
@@ -85,37 +102,55 @@ heap_page_prune_opt(Relation relation, Buffer buffer)
 		return;
 
 	/*
-	 * Use the appropriate xmin horizon for this relation. If it's a proper
-	 * catalog relation or a user defined, additional, catalog relation, we
-	 * need to use the horizon that includes slots, otherwise the data-only
-	 * horizon can be used. Note that the toast relation of user defined
-	 * relations are *not* considered catalog relations.
+	 * XXX: Magic to keep old_snapshot_threshold tests appear "working". They
+	 * currently are broken, and discussion of what to do about them is
+	 * ongoing. See
+	 * https://www.postgresql.org/message-id/20200403001235.e6jfdll3gh2ygbuc%40alap3.anarazel.de
+	 */
+	if (old_snapshot_threshold == 0)
+		SnapshotTooOldMagicForTest();
+
+	/*
+	 * First check whether there's any chance there's something to prune,
+	 * determining the appropriate horizon is a waste if there's no prune_xid
+	 * (i.e. no updates/deletes left potentially dead tuples around).
+	 */
+	prune_xid = ((PageHeader) page)->pd_prune_xid;
+	if (!TransactionIdIsValid(prune_xid))
+		return;
+
+	/*
+	 * Check whether prune_xid indicates that there may be dead rows that can
+	 * be cleaned up.
 	 *
-	 * It is OK to apply the old snapshot limit before acquiring the cleanup
+	 * It is OK to check the old snapshot limit before acquiring the cleanup
 	 * lock because the worst that can happen is that we are not quite as
 	 * aggressive about the cleanup (by however many transaction IDs are
 	 * consumed between this point and acquiring the lock).  This allows us to
 	 * save significant overhead in the case where the page is found not to be
 	 * prunable.
-	 */
-	if (IsCatalogRelation(relation) ||
-		RelationIsAccessibleInLogicalDecoding(relation))
-		OldestXmin = RecentGlobalXmin;
-	else
-		OldestXmin =
-			TransactionIdLimitedForOldSnapshots(RecentGlobalDataXmin,
-												relation);
-
-	Assert(TransactionIdIsValid(OldestXmin));
-
-	/*
-	 * Let's see if we really need pruning.
 	 *
-	 * Forget it if page is not hinted to contain something prunable that's
-	 * older than OldestXmin.
+	 * Even if old_snapshot_threshold is set, we first check whether the page
+	 * can be pruned without. Both because
+	 * TransactionIdLimitedForOldSnapshots() is not cheap, and because not
+	 * unnecessarily relying on old_snapshot_threshold avoids causing
+	 * conflicts.
 	 */
-	if (!PageIsPrunable(page, OldestXmin))
-		return;
+	vistest = GlobalVisTestFor(relation);
+
+	if (!GlobalVisTestIsRemovableXid(vistest, prune_xid))
+	{
+		if (!OldSnapshotThresholdActive())
+			return;
+
+		if (!TransactionIdLimitedForOldSnapshots(GlobalVisTestNonRemovableHorizon(vistest),
+												 relation,
+												 &limited_xmin, &limited_ts))
+			return;
+
+		if (!TransactionIdPrecedes(prune_xid, limited_xmin))
+			return;
+	}
 
 	/*
 	 * We prune when a previous UPDATE failed to find enough space on the page
@@ -151,7 +186,9 @@ heap_page_prune_opt(Relation relation, Buffer buffer)
 															 * needed */
 
 			/* OK to prune */
-			(void) heap_page_prune(relation, buffer, OldestXmin, true, &ignore);
+			(void) heap_page_prune(relation, buffer, vistest,
+								   limited_xmin, limited_ts,
+								   true, &ignore);
 		}
 
 		/* And release buffer lock */
@@ -165,8 +202,11 @@ heap_page_prune_opt(Relation relation, Buffer buffer)
  *
  * Caller must have pin and buffer cleanup lock on the page.
  *
- * OldestXmin is the cutoff XID used to distinguish whether tuples are DEAD
- * or RECENTLY_DEAD (see HeapTupleSatisfiesVacuum).
+ * vistest is used to distinguish whether tuples are DEAD or RECENTLY_DEAD
+ * (see heap_prune_satisfies_vacuum and
+ * HeapTupleSatisfiesVacuum). old_snap_xmin / old_snap_ts need to
+ * either have been set by TransactionIdLimitedForOldSnapshots, or
+ * InvalidTransactionId/0 respectively.
  *
  * If report_stats is true then we send the number of reclaimed heap-only
  * tuples to pgstats.  (This must be false during vacuum, since vacuum will
@@ -177,7 +217,10 @@ heap_page_prune_opt(Relation relation, Buffer buffer)
  * latestRemovedXid.
  */
 int
-heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin,
+heap_page_prune(Relation relation, Buffer buffer,
+				GlobalVisState *vistest,
+				TransactionId old_snap_xmin,
+				TimestampTz old_snap_ts,
 				bool report_stats, TransactionId *latestRemovedXid)
 {
 	int			ndeleted = 0;
@@ -198,6 +241,11 @@ heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin,
 	 * initialize the rest of our working state.
 	 */
 	prstate.new_prune_xid = InvalidTransactionId;
+	prstate.rel = relation;
+	prstate.vistest = vistest;
+	prstate.old_snap_xmin = old_snap_xmin;
+	prstate.old_snap_ts = old_snap_ts;
+	prstate.old_snap_used = false;
 	prstate.latestRemovedXid = *latestRemovedXid;
 	prstate.nredirected = prstate.ndead = prstate.nunused = 0;
 	memset(prstate.marked, 0, sizeof(prstate.marked));
@@ -220,9 +268,7 @@ heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin,
 			continue;
 
 		/* Process this item or chain of items */
-		ndeleted += heap_prune_chain(relation, buffer, offnum,
-									 OldestXmin,
-									 &prstate);
+		ndeleted += heap_prune_chain(buffer, offnum, &prstate);
 	}
 
 	/* Any error while applying the changes is critical */
@@ -323,6 +369,85 @@ heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin,
 }
 
 
+/*
+ * Perform visiblity checks for heap pruning.
+ *
+ * This is more complicated than just using GlobalVisTestIsRemovableXid()
+ * because of old_snapshot_threshold. We only want to increase the threshold
+ * that triggers errors for old snapshots when we actually decide to remove a
+ * row based on the limited horizon.
+ *
+ * Due to its cost we also only want to call
+ * TransactionIdLimitedForOldSnapshots() if necessary, i.e. we might not have
+ * done so in heap_hot_prune_opt() if pd_prune_xid was old enough. But we
+ * still want to be able to remove rows that are too new to be removed
+ * according to prstate->vistest, but that can be removed based on
+ * old_snapshot_threshold. So we call TransactionIdLimitedForOldSnapshots() on
+ * demand in here, if appropriate.
+ */
+static HTSV_Result
+heap_prune_satisfies_vacuum(PruneState *prstate, HeapTuple tup, Buffer buffer)
+{
+	HTSV_Result res;
+	TransactionId dead_after;
+
+	res = HeapTupleSatisfiesVacuumHorizon(tup, buffer, &dead_after);
+
+	if (res != HEAPTUPLE_RECENTLY_DEAD)
+		return res;
+
+	/*
+	 * If we are already relying on the limited xmin, there is no need to
+	 * delay doing so anymore.
+	 */
+	if (prstate->old_snap_used)
+	{
+		Assert(TransactionIdIsValid(prstate->old_snap_xmin));
+
+		if (TransactionIdPrecedes(dead_after, prstate->old_snap_xmin))
+			res = HEAPTUPLE_DEAD;
+		return res;
+	}
+
+	/*
+	 * First check if GlobalVisTestIsRemovableXid() is sufficient to find the
+	 * row dead. If not, and old_snapshot_threshold is enabled, try to use the
+	 * lowered horizon.
+	 */
+	if (GlobalVisTestIsRemovableXid(prstate->vistest, dead_after))
+		res = HEAPTUPLE_DEAD;
+	else if (OldSnapshotThresholdActive())
+	{
+		/* haven't determined limited horizon yet, requests */
+		if (!TransactionIdIsValid(prstate->old_snap_xmin))
+		{
+			TransactionId horizon =
+			GlobalVisTestNonRemovableHorizon(prstate->vistest);
+
+			TransactionIdLimitedForOldSnapshots(horizon, prstate->rel,
+												&prstate->old_snap_xmin,
+												&prstate->old_snap_ts);
+		}
+
+		if (TransactionIdIsValid(prstate->old_snap_xmin) &&
+			TransactionIdPrecedes(dead_after, prstate->old_snap_xmin))
+		{
+			/*
+			 * About to remove row based on snapshot_too_old. Need to raise
+			 * the threshold so problematic accesses would error.
+			 */
+			Assert(!prstate->old_snap_used);
+			SetOldSnapshotThresholdTimestamp(prstate->old_snap_ts,
+											 prstate->old_snap_xmin);
+			prstate->old_snap_used = true;
+			res = HEAPTUPLE_DEAD;
+		}
+	}
+
+	return res;
+}
+
+
 /*
  * Prune specified line pointer or a HOT chain originating at line pointer.
  *
@@ -349,9 +474,7 @@ heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin,
  * Returns the number of tuples (to be) deleted from the page.
  */
 static int
-heap_prune_chain(Relation relation, Buffer buffer, OffsetNumber rootoffnum,
-				 TransactionId OldestXmin,
-				 PruneState *prstate)
+heap_prune_chain(Buffer buffer, OffsetNumber rootoffnum, PruneState *prstate)
 {
 	int			ndeleted = 0;
 	Page		dp = (Page) BufferGetPage(buffer);
@@ -366,7 +489,7 @@ heap_prune_chain(Relation relation, Buffer buffer, OffsetNumber rootoffnum,
 				i;
 	HeapTupleData tup;
 
-	tup.t_tableOid = RelationGetRelid(relation);
+	tup.t_tableOid = RelationGetRelid(prstate->rel);
 
 	rootlp = PageGetItemId(dp, rootoffnum);
 
@@ -401,7 +524,7 @@ heap_prune_chain(Relation relation, Buffer buffer, OffsetNumber rootoffnum,
 			 * either here or while following a chain below.  Whichever path
 			 * gets there first will mark the tuple unused.
 			 */
-			if (HeapTupleSatisfiesVacuum(&tup, OldestXmin, buffer)
+			if (heap_prune_satisfies_vacuum(prstate, &tup, buffer)
 				== HEAPTUPLE_DEAD && !HeapTupleHeaderIsHotUpdated(htup))
 			{
 				heap_prune_record_unused(prstate, rootoffnum);
@@ -485,7 +608,7 @@ heap_prune_chain(Relation relation, Buffer buffer, OffsetNumber rootoffnum,
 		 */
 		tupdead = recent_dead = false;
 
-		switch (HeapTupleSatisfiesVacuum(&tup, OldestXmin, buffer))
+		switch (heap_prune_satisfies_vacuum(prstate, &tup, buffer))
 		{
 			case HEAPTUPLE_DEAD:
 				tupdead = true;
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 1bbc4598f75..44e2224dd55 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -788,6 +788,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 		PROGRESS_VACUUM_MAX_DEAD_TUPLES
 	};
 	int64		initprog_val[3];
+	GlobalVisState *vistest;
 
 	pg_rusage_init(&ru0);
 
@@ -816,6 +817,8 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	vacrelstats->nonempty_pages = 0;
 	vacrelstats->latestRemovedXid = InvalidTransactionId;
 
+	vistest = GlobalVisTestFor(onerel);
+
 	/*
 	 * Initialize state for a parallel vacuum.  As of now, only one worker can
 	 * be used for an index, so we invoke parallelism only if there are at
@@ -1239,7 +1242,8 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 		 *
 		 * We count tuples removed by the pruning step as removed by VACUUM.
 		 */
-		tups_vacuumed += heap_page_prune(onerel, buf, OldestXmin, false,
+		tups_vacuumed += heap_page_prune(onerel, buf, vistest, false,
+										 InvalidTransactionId, 0,
 										 &vacrelstats->latestRemovedXid);
 
 		/*
@@ -1596,14 +1600,16 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 		}
 
 		/*
-		 * It's possible for the value returned by GetOldestXmin() to move
-		 * backwards, so it's not wrong for us to see tuples that appear to
-		 * not be visible to everyone yet, while PD_ALL_VISIBLE is already
-		 * set. The real safe xmin value never moves backwards, but
-		 * GetOldestXmin() is conservative and sometimes returns a value
-		 * that's unnecessarily small, so if we see that contradiction it just
-		 * means that the tuples that we think are not visible to everyone yet
-		 * actually are, and the PD_ALL_VISIBLE flag is correct.
+		 * It's possible for the value returned by
+		 * GetOldestNonRemovableTransactionId() to move backwards, so it's not
+		 * wrong for us to see tuples that appear to not be visible to
+		 * everyone yet, while PD_ALL_VISIBLE is already set. The real safe
+		 * xmin value never moves backwards, but
+		 * GetOldestNonRemovableTransactionId() is conservative and sometimes
+		 * returns a value that's unnecessarily small, so if we see that
+		 * contradiction it just means that the tuples that we think are not
+		 * visible to everyone yet actually are, and the PD_ALL_VISIBLE flag
+		 * is correct.
 		 *
 		 * There should never be dead tuples on a page with PD_ALL_VISIBLE
 		 * set, however.
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 6b9750c244a..3fb8688f8f4 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -519,7 +519,8 @@ index_getnext_tid(IndexScanDesc scan, ScanDirection direction)
 	SCAN_CHECKS;
 	CHECK_SCAN_PROCEDURE(amgettuple);
 
-	Assert(TransactionIdIsValid(RecentGlobalXmin));
+	/* XXX: we should assert that a snapshot is pushed or registered */
+	Assert(TransactionIdIsValid(RecentXmin));
 
 	/*
 	 * The AM's amgettuple proc finds the next index entry matching the scan
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 32ad9e339a2..cf3dba96008 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -342,9 +342,9 @@ snapshots and registered snapshots as of the deletion are gone; which is
 overly strong, but is simple to implement within Postgres.  When marked
 dead, a deleted page is labeled with the next-transaction counter value.
 VACUUM can reclaim the page for re-use when this transaction number is
-older than RecentGlobalXmin.  As collateral damage, this implementation
-also waits for running XIDs with no snapshots and for snapshots taken
-until the next transaction to allocate an XID commits.
+guaranteed to be "visible to everyone".  As collateral damage, this
+implementation also waits for running XIDs with no snapshots and for
+snapshots taken until the next transaction to allocate an XID commits.
 
 Reclaiming a page doesn't actually change its state on disk --- we simply
 record it in the shared-memory free space map, from which it will be
@@ -411,8 +411,8 @@ page and also the correct place to hold the current value. We can avoid
 the cost of walking down the tree in such common cases.
 
 The optimization works on the assumption that there can only be one
-non-ignorable leaf rightmost page, and so even a RecentGlobalXmin style
-interlock isn't required.  We cannot fail to detect that our hint was
+non-ignorable leaf rightmost page, and so not even a visible-to-everyone
+style interlock required.  We cannot fail to detect that our hint was
 invalidated, because there can only be one such page in the B-Tree at
 any time. It's possible that the page will be deleted and recycled
 without a backend's cached page also being detected as invalidated, but
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 75628e0eb98..9e6376f2c2b 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -990,7 +990,7 @@ _bt_page_recyclable(Page page)
 	 */
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 	if (P_ISDELETED(opaque) &&
-		TransactionIdPrecedes(opaque->btpo.xact, RecentGlobalXmin))
+		GlobalVisCheckRemovableXid(NULL, opaque->btpo.xact))
 		return true;
 	return false;
 }
@@ -2209,7 +2209,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 	 * updated links to the target, ReadNewTransactionId() suffices as an
 	 * upper bound.  Any scan having retained a now-stale link is advertising
 	 * in its PGXACT an xmin less than or equal to the value we read here.  It
-	 * will continue to do so, holding back RecentGlobalXmin, for the duration
+	 * will continue to do so, holding back the xmin horizon, for the duration
 	 * of that scan.
 	 */
 	page = BufferGetPage(buf);
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index e947addef6b..b59ba02a32f 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -807,6 +807,12 @@ _bt_vacuum_needs_cleanup(IndexVacuumInfo *info)
 	metapg = BufferGetPage(metabuf);
 	metad = BTPageGetMeta(metapg);
 
+	/*
+	 * XXX: If IndexVacuumInfo contained the heap relation, we could be more
+	 * aggressive about vacuuming non catalog relations by passing the table
+	 * to GlobalVisCheckRemovableXid().
+	 */
+
 	if (metad->btm_version < BTREE_NOVAC_VERSION)
 	{
 		/*
@@ -816,13 +822,12 @@ _bt_vacuum_needs_cleanup(IndexVacuumInfo *info)
 		result = true;
 	}
 	else if (TransactionIdIsValid(metad->btm_oldest_btpo_xact) &&
-			 TransactionIdPrecedes(metad->btm_oldest_btpo_xact,
-								   RecentGlobalXmin))
+			 GlobalVisCheckRemovableXid(NULL, metad->btm_oldest_btpo_xact))
 	{
 		/*
 		 * If any oldest btpo.xact from a previously deleted page in the index
-		 * is older than RecentGlobalXmin, then at least one deleted page can
-		 * be recycled -- don't skip cleanup.
+		 * is visible to everyone, then at least one deleted page can be
+		 * recycled -- don't skip cleanup.
 		 */
 		result = true;
 	}
@@ -1276,14 +1281,13 @@ backtrack:
 				 * own conflict now.)
 				 *
 				 * Backends with snapshots acquired after a VACUUM starts but
-				 * before it finishes could have a RecentGlobalXmin with a
-				 * later xid than the VACUUM's OldestXmin cutoff.  These
-				 * backends might happen to opportunistically mark some index
-				 * tuples LP_DEAD before we reach them, even though they may
-				 * be after our cutoff.  We don't try to kill these "extra"
-				 * index tuples in _bt_delitems_vacuum().  This keep things
-				 * simple, and allows us to always avoid generating our own
-				 * conflicts.
+				 * before it finishes could have visibility cutoff with a
+				 * later xid than VACUUM's OldestXmin cutoff.  These backends
+				 * might happen to opportunistically mark some index tuples
+				 * LP_DEAD before we reach them, even though they may be after
+				 * our cutoff.  We don't try to kill these "extra" index
+				 * tuples in _bt_delitems_vacuum().  This keep things simple,
+				 * and allows us to always avoid generating our own conflicts.
 				 */
 				Assert(!BTreeTupleIsPivot(itup));
 				if (!BTreeTupleIsPosting(itup))
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 5d346da84fd..b097e98c3ba 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -928,11 +928,11 @@ btree_xlog_reuse_page(XLogReaderState *record)
 	 * Btree reuse_page records exist to provide a conflict point when we
 	 * reuse pages in the index via the FSM.  That's all they do though.
 	 *
-	 * latestRemovedXid was the page's btpo.xact.  The btpo.xact <
-	 * RecentGlobalXmin test in _bt_page_recyclable() conceptually mirrors the
-	 * pgxact->xmin > limitXmin test in GetConflictingVirtualXIDs().
-	 * Consequently, one XID value achieves the same exclusion effect on
-	 * primary and standby.
+	 * latestRemovedXid was the page's btpo.xact.  The
+	 * GlobalVisCheckRemovableXid test in _bt_page_recyclable() conceptually
+	 * mirrors the pgxact->xmin > limitXmin test in
+	 * GetConflictingVirtualXIDs().  Consequently, one XID value achieves the
+	 * same exclusion effect on primary and standby.
 	 */
 	if (InHotStandby)
 	{
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index bd98707f3c0..e1c58933f97 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -501,10 +501,14 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	OffsetNumber itemToPlaceholder[MaxIndexTuplesPerPage];
 	OffsetNumber itemnos[MaxIndexTuplesPerPage];
 	spgxlogVacuumRedirect xlrec;
+	GlobalVisState *vistest;
 
 	xlrec.nToPlaceholder = 0;
 	xlrec.newestRedirectXid = InvalidTransactionId;
 
+	/* XXX: providing heap relation would allow more pruning */
+	vistest = GlobalVisTestFor(NULL);
+
 	START_CRIT_SECTION();
 
 	/*
@@ -521,7 +525,7 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 		dt = (SpGistDeadTuple) PageGetItem(page, PageGetItemId(page, i));
 
 		if (dt->tupstate == SPGIST_REDIRECT &&
-			TransactionIdPrecedes(dt->xid, RecentGlobalXmin))
+			GlobalVisTestIsRemovableXid(vistest, dt->xid))
 		{
 			dt->tupstate = SPGIST_PLACEHOLDER;
 			Assert(opaque->nRedirection > 0);
diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README
index eb9aac5fd39..4e2178dabab 100644
--- a/src/backend/access/transam/README
+++ b/src/backend/access/transam/README
@@ -257,31 +257,31 @@ simultaneously, we have one backend take ProcArrayLock and clear the XIDs
 of multiple processes at once.)
 
 ProcArrayEndTransaction also holds the lock while advancing the shared
-latestCompletedXid variable.  This allows GetSnapshotData to use
-latestCompletedXid + 1 as xmax for its snapshot: there can be no
+latestCompletedFullXid variable.  This allows GetSnapshotData to use
+latestCompletedFullXid + 1 as xmax for its snapshot: there can be no
 transaction >= this xid value that the snapshot needs to consider as
 completed.
 
 In short, then, the rule is that no transaction may exit the set of
-currently-running transactions between the time we fetch latestCompletedXid
+currently-running transactions between the time we fetch latestCompletedFullXid
 and the time we finish building our snapshot.  However, this restriction
 only applies to transactions that have an XID --- read-only transactions
 can end without acquiring ProcArrayLock, since they don't affect anyone
-else's snapshot nor latestCompletedXid.
+else's snapshot nor latestCompletedFullXid.
 
 Transaction start, per se, doesn't have any interlocking with these
 considerations, since we no longer assign an XID immediately at transaction
 start.  But when we do decide to allocate an XID, GetNewTransactionId must
 store the new XID into the shared ProcArray before releasing XidGenLock.
-This ensures that all top-level XIDs <= latestCompletedXid are either
+This ensures that all top-level XIDs <= latestCompletedFullXid are either
 present in the ProcArray, or not running anymore.  (This guarantee doesn't
 apply to subtransaction XIDs, because of the possibility that there's not
 room for them in the subxid array; instead we guarantee that they are
 present or the overflow flag is set.)  If a backend released XidGenLock
 before storing its XID into MyPgXact, then it would be possible for another
-backend to allocate and commit a later XID, causing latestCompletedXid to
+backend to allocate and commit a later XID, causing latestCompletedFullXid to
 pass the first backend's XID, before that value became visible in the
-ProcArray.  That would break GetOldestXmin, as discussed below.
+ProcArray.  That would break ComputeXidHorizons, as discussed below.
 
 We allow GetNewTransactionId to store the XID into MyPgXact->xid (or the
 subxid array) without taking ProcArrayLock.  This was once necessary to
@@ -293,42 +293,50 @@ once, rather than assume they can read it multiple times and get the same
 answer each time.  (Use volatile-qualified pointers when doing this, to
 ensure that the C compiler does exactly what you tell it to.)
 
-Another important activity that uses the shared ProcArray is GetOldestXmin,
-which must determine a lower bound for the oldest xmin of any active MVCC
-snapshot, system-wide.  Each individual backend advertises the smallest
-xmin of its own snapshots in MyPgXact->xmin, or zero if it currently has no
-live snapshots (eg, if it's between transactions or hasn't yet set a
-snapshot for a new transaction).  GetOldestXmin takes the MIN() of the
-valid xmin fields.  It does this with only shared lock on ProcArrayLock,
-which means there is a potential race condition against other backends
-doing GetSnapshotData concurrently: we must be certain that a concurrent
-backend that is about to set its xmin does not compute an xmin less than
-what GetOldestXmin returns.  We ensure that by including all the active
-XIDs into the MIN() calculation, along with the valid xmins.  The rule that
-transactions can't exit without taking exclusive ProcArrayLock ensures that
-concurrent holders of shared ProcArrayLock will compute the same minimum of
-currently-active XIDs: no xact, in particular not the oldest, can exit
-while we hold shared ProcArrayLock.  So GetOldestXmin's view of the minimum
-active XID will be the same as that of any concurrent GetSnapshotData, and
-so it can't produce an overestimate.  If there is no active transaction at
-all, GetOldestXmin returns latestCompletedXid + 1, which is a lower bound
-for the xmin that might be computed by concurrent or later GetSnapshotData
-calls.  (We know that no XID less than this could be about to appear in
-the ProcArray, because of the XidGenLock interlock discussed above.)
+Another important activity that uses the shared ProcArray is
+ComputeXidHorizons, which must determine a lower bound for the oldest xmin
+of any active MVCC snapshot, system-wide.  Each individual backend
+advertises the smallest xmin of its own snapshots in MyPgXact->xmin, or zero
+if it currently has no live snapshots (eg, if it's between transactions or
+hasn't yet set a snapshot for a new transaction).  ComputeXidHorizons takes
+the MIN() of the valid xmin fields.  It does this with only shared lock on
+ProcArrayLock, which means there is a potential race condition against other
+backends doing GetSnapshotData concurrently: we must be certain that a
+concurrent backend that is about to set its xmin does not compute an xmin
+less than what ComputeXidHorizons determines.  We ensure that by including
+all the active XIDs into the MIN() calculation, along with the valid xmins.
+The rule that transactions can't exit without taking exclusive ProcArrayLock
+ensures that concurrent holders of shared ProcArrayLock will compute the
+same minimum of currently-active XIDs: no xact, in particular not the
+oldest, can exit while we hold shared ProcArrayLock.  So
+ComputeXidHorizons's view of the minimum active XID will be the same as that
+of any concurrent GetSnapshotData, and so it can't produce an overestimate.
+If there is no active transaction at all, ComputeXidHorizons uses
+latestCompletedFullXid + 1, which is a lower bound for the xmin that might
+be computed by concurrent or later GetSnapshotData calls.  (We know that no
+XID less than this could be about to appear in the ProcArray, because of the
+XidGenLock interlock discussed above.)
 
-GetSnapshotData also performs an oldest-xmin calculation (which had better
-match GetOldestXmin's) and stores that into RecentGlobalXmin, which is used
-for some tuple age cutoff checks where a fresh call of GetOldestXmin seems
-too expensive.  Note that while it is certain that two concurrent
-executions of GetSnapshotData will compute the same xmin for their own
-snapshots, as argued above, it is not certain that they will arrive at the
-same estimate of RecentGlobalXmin.  This is because we allow XID-less
-transactions to clear their MyPgXact->xmin asynchronously (without taking
-ProcArrayLock), so one execution might see what had been the oldest xmin,
-and another not.  This is OK since RecentGlobalXmin need only be a valid
-lower bound.  As noted above, we are already assuming that fetch/store
-of the xid fields is atomic, so assuming it for xmin as well is no extra
-risk.
+As GetSnapshotData is performance critical, it does not perform an accurate
+oldest-xmin calculation (it used to, until v13). The contents of a snapshot
+only depend on the xids of other backends, not their xmin. As backend's xmin
+changes much more often than its xid, having GetSnapshotData look at xmins
+can lead to a lot of unnecessary cacheline ping-pong.  Instead
+GetSnapshotData updates approximate thresholds (one that guarantees that all
+deleted rows older than it can be removed, another determining that deleted
+rows newer than it can not be removed). GlobalVisTest* uses those threshold
+to make invisibility decision, falling back to ComputeXidHorizons if
+necessary.
+
+Note that while it is certain that two concurrent executions of
+GetSnapshotData will compute the same xmin for their own snapshots, there is
+no such guarantee for the horizons computed by ComputeXidHorizons.  This is
+because we allow XID-less transactions to clear their MyPgXact->xmin
+asynchronously (without taking ProcArrayLock), so one execution might see
+what had been the oldest xmin, and another not.  This is OK since the
+thresholds need only be a valid lower bound.  As noted above, we are already
+assuming that fetch/store of the xid fields is atomic, so assuming it for
+xmin as well is no extra risk.
 
 
 pg_xact and pg_subtrans
diff --git a/src/backend/access/transam/varsup.c b/src/backend/access/transam/varsup.c
index e14b53bf9e3..00b8e4e50d7 100644
--- a/src/backend/access/transam/varsup.c
+++ b/src/backend/access/transam/varsup.c
@@ -566,3 +566,53 @@ GetNewObjectId(void)
 
 	return result;
 }
+
+
+#ifdef USE_ASSERT_CHECKING
+
+/*
+ * Assert that xid is between [oldestXid, nextFullXid], which is the range we
+ * expect XIDs coming from tables etc to be in.
+ *
+ * As ShmemVariableCache->oldestXid could change just after this call without
+ * further precautions, and as a wrapped-around xid could again fall within
+ * the valid range, this assertion can only detect if something is definitely
+ * wrong, but not establish correctness.
+ *
+ * This intentionally does not expose a return value, to avoid code being
+ * introduced that depends on the return value.
+ */
+void
+AssertTransactionInAllowableRange(TransactionId xid)
+{
+	TransactionId oldest_xid;
+	TransactionId next_xid;
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* we may see bootstrap / frozen */
+	if (!TransactionIdIsNormal(xid))
+		return;
+
+	/*
+	 * We can't acquire XidGenLock, as this may be called with XidGenLock
+	 * already held (or with other locks that don't allow XidGenLock to be
+	 * nested). That's ok for our purposes though, since we already rely on
+	 * 32bit reads to be atomic. While nextFullXid is 64 bit, we only look at
+	 * the lower 32bit, so a skewed read doesn't hurt.
+	 *
+	 * There's no increased danger of falling outside [oldest, next] by
+	 * accessing them without a lock. xid needs to have been created with
+	 * GetNewTransactionId() in the originating session, and the locks there
+	 * pair with the memory barrier below.  We do however accept xid to be <=
+	 * to next_xid, instead of just <, as xid could be from the procarray,
+	 * before we see the updated nextFullXid value.
+	 */
+	pg_memory_barrier();
+	oldest_xid = ShmemVariableCache->oldestXid;
+	next_xid = XidFromFullTransactionId(ShmemVariableCache->nextFullXid);
+
+	Assert(TransactionIdFollowsOrEquals(xid, oldest_xid) ||
+		   TransactionIdPrecedesOrEquals(xid, next_xid));
+}
+#endif
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 0a97b1d37fb..4ec414e93ef 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7866,10 +7866,11 @@ StartupXLOG(void)
 	XLogCtl->lastSegSwitchTime = (pg_time_t) time(NULL);
 	XLogCtl->lastSegSwitchLSN = EndOfLog;
 
-	/* also initialize latestCompletedXid, to nextXid - 1 */
+	/* also initialize latestCompletedFullXid, to nextFullXid - 1 */
 	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
-	ShmemVariableCache->latestCompletedXid = XidFromFullTransactionId(ShmemVariableCache->nextFullXid);
-	TransactionIdRetreat(ShmemVariableCache->latestCompletedXid);
+	ShmemVariableCache->latestCompletedFullXid =
+		ShmemVariableCache->nextFullXid;
+	FullTransactionIdRetreat(&ShmemVariableCache->latestCompletedFullXid);
 	LWLockRelease(ProcArrayLock);
 
 	/*
@@ -9099,7 +9100,7 @@ CreateCheckPoint(int flags)
 	 * StartupSUBTRANS hasn't been called yet.
 	 */
 	if (!RecoveryInProgress())
-		TruncateSUBTRANS(GetOldestXmin(NULL, PROCARRAY_FLAGS_DEFAULT));
+		TruncateSUBTRANS(GetOldestTransactionIdConsideredRunning());
 
 	/* Real work is done, but log and update stats before releasing lock. */
 	LogCheckpointEnd(false);
@@ -9459,7 +9460,7 @@ CreateRestartPoint(int flags)
 	 * this because StartupSUBTRANS hasn't been called yet.
 	 */
 	if (EnableHotStandby)
-		TruncateSUBTRANS(GetOldestXmin(NULL, PROCARRAY_FLAGS_DEFAULT));
+		TruncateSUBTRANS(GetOldestTransactionIdConsideredRunning());
 
 	/* Real work is done, but log and update before releasing lock. */
 	LogCheckpointEnd(true);
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 924ef37c816..34b71b6c1c5 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -1056,7 +1056,7 @@ acquire_sample_rows(Relation onerel, int elevel,
 	totalblocks = RelationGetNumberOfBlocks(onerel);
 
 	/* Need a cutoff xmin for HeapTupleSatisfiesVacuum */
-	OldestXmin = GetOldestXmin(onerel, PROCARRAY_FLAGS_VACUUM);
+	OldestXmin = GetOldestNonRemovableTransactionId(onerel);
 
 	/* Prepare for sampling block numbers */
 	nblocks = BlockSampler_Init(&bs, totalblocks, targrows, random());
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 576c7e63e99..22228f5684f 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -955,8 +955,25 @@ vacuum_set_xid_limits(Relation rel,
 	 * working on a particular table at any time, and that each vacuum is
 	 * always an independent transaction.
 	 */
-	*oldestXmin =
-		TransactionIdLimitedForOldSnapshots(GetOldestXmin(rel, PROCARRAY_FLAGS_VACUUM), rel);
+	*oldestXmin = GetOldestNonRemovableTransactionId(rel);
+
+	if (OldSnapshotThresholdActive())
+	{
+		TransactionId limit_xmin;
+		TimestampTz limit_ts;
+
+		if (TransactionIdLimitedForOldSnapshots(*oldestXmin, rel, &limit_xmin, &limit_ts))
+		{
+			/*
+			 * TODO: We should only set the threshold if we are pruning on the
+			 * basis of the increased limits. Not as crucial here as it is for
+			 * opportunistic pruning (which often happens at a much higher
+			 * frequency), but would still be a significant improvement.
+			 */
+			SetOldSnapshotThresholdTimestamp(limit_ts, limit_xmin);
+			*oldestXmin = limit_xmin;
+		}
+	}
 
 	Assert(TransactionIdIsNormal(*oldestXmin));
 
@@ -1345,12 +1362,13 @@ vac_update_datfrozenxid(void)
 	bool		dirty = false;
 
 	/*
-	 * Initialize the "min" calculation with GetOldestXmin, which is a
-	 * reasonable approximation to the minimum relfrozenxid for not-yet-
-	 * committed pg_class entries for new tables; see AddNewRelationTuple().
-	 * So we cannot produce a wrong minimum by starting with this.
+	 * Initialize the "min" calculation with
+	 * GetOldestNonRemovableTransactionId(), which is a reasonable
+	 * approximation to the minimum relfrozenxid for not-yet-committed
+	 * pg_class entries for new tables; see AddNewRelationTuple().  So we
+	 * cannot produce a wrong minimum by starting with this.
 	 */
-	newFrozenXid = GetOldestXmin(NULL, PROCARRAY_FLAGS_VACUUM);
+	newFrozenXid = GetOldestNonRemovableTransactionId(NULL);
 
 	/*
 	 * Similarly, initialize the MultiXact "min" with the value that would be
@@ -1681,8 +1699,9 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params)
 	StartTransactionCommand();
 
 	/*
-	 * Functions in indexes may want a snapshot set.  Also, setting a snapshot
-	 * ensures that RecentGlobalXmin is kept truly recent.
+	 * Need to acquire a snapshot to prevent pg_subtrans from being truncated,
+	 * cutoff xids in local memory wrapping around, and to have updated xmin
+	 * horizons.
 	 */
 	PushActiveSnapshot(GetTransactionSnapshot());
 
@@ -1705,8 +1724,8 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params)
 		 *
 		 * Note: these flags remain set until CommitTransaction or
 		 * AbortTransaction.  We don't want to clear them until we reset
-		 * MyPgXact->xid/xmin, else OldestXmin might appear to go backwards,
-		 * which is probably Not Good.
+		 * MyPgXact->xid/xmin, otherwise GetOldestNonRemovableTransactionId()
+		 * might appear to go backwards, which is probably Not Good.
 		 */
 		LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
 		MyPgXact->vacuumFlags |= PROC_IN_VACUUM;
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 9c7d4b0c60e..ac97e28be19 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -1877,6 +1877,10 @@ get_database_list(void)
 	 * the secondary effect that it sets RecentGlobalXmin.  (This is critical
 	 * for anything that reads heap pages, because HOT may decide to prune
 	 * them even if the process doesn't attempt to modify any tuples.)
+	 *
+	 * FIXME: This comment is inaccurate / the code buggy. A snapshot that is
+	 * not pushed/active does not reliably prevent HOT pruning (->xmin could
+	 * e.g. be cleared when cache invalidations are processed).
 	 */
 	StartTransactionCommand();
 	(void) GetTransactionSnapshot();
diff --git a/src/backend/replication/logical/launcher.c b/src/backend/replication/logical/launcher.c
index aec885e9871..158b2f3d73b 100644
--- a/src/backend/replication/logical/launcher.c
+++ b/src/backend/replication/logical/launcher.c
@@ -122,6 +122,10 @@ get_subscription_list(void)
 	 * the secondary effect that it sets RecentGlobalXmin.  (This is critical
 	 * for anything that reads heap pages, because HOT may decide to prune
 	 * them even if the process doesn't attempt to modify any tuples.)
+	 *
+	 * FIXME: This comment is inaccurate / the code buggy. A snapshot that is
+	 * not pushed/active does not reliably prevent HOT pruning (->xmin could
+	 * e.g. be cleared when cache invalidations are processed).
 	 */
 	StartTransactionCommand();
 	(void) GetTransactionSnapshot();
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index d5a9b568a68..7c11e1ab44c 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -1181,22 +1181,7 @@ XLogWalRcvSendHSFeedback(bool immed)
 	 */
 	if (hot_standby_feedback)
 	{
-		TransactionId slot_xmin;
-
-		/*
-		 * Usually GetOldestXmin() would include both global replication slot
-		 * xmin and catalog_xmin in its calculations, but we want to derive
-		 * separate values for each of those. So we ask for an xmin that
-		 * excludes the catalog_xmin.
-		 */
-		xmin = GetOldestXmin(NULL,
-							 PROCARRAY_FLAGS_DEFAULT | PROCARRAY_SLOTS_XMIN);
-
-		ProcArrayGetReplicationSlotXmin(&slot_xmin, &catalog_xmin);
-
-		if (TransactionIdIsValid(slot_xmin) &&
-			TransactionIdPrecedes(slot_xmin, xmin))
-			xmin = slot_xmin;
+		GetReplicationHorizons(&xmin, &catalog_xmin);
 	}
 	else
 	{
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 5e2210dd7bd..fd370d52b66 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -2116,9 +2116,10 @@ ProcessStandbyHSFeedbackMessage(void)
 
 	/*
 	 * Set the WalSender's xmin equal to the standby's requested xmin, so that
-	 * the xmin will be taken into account by GetOldestXmin.  This will hold
-	 * back the removal of dead rows and thereby prevent the generation of
-	 * cleanup conflicts on the standby server.
+	 * the xmin will be taken into account by GetSnapshotData() /
+	 * ComputeXidHorizons().  This will hold back the removal of dead rows and
+	 * thereby prevent the generation of cleanup conflicts on the standby
+	 * server.
 	 *
 	 * There is a small window for a race condition here: although we just
 	 * checked that feedbackXmin precedes nextXid, the nextXid could have
@@ -2131,10 +2132,10 @@ ProcessStandbyHSFeedbackMessage(void)
 	 * own xmin would prevent nextXid from advancing so far.
 	 *
 	 * We don't bother taking the ProcArrayLock here.  Setting the xmin field
-	 * is assumed atomic, and there's no real need to prevent a concurrent
-	 * GetOldestXmin.  (If we're moving our xmin forward, this is obviously
-	 * safe, and if we're moving it backwards, well, the data is at risk
-	 * already since a VACUUM could have just finished calling GetOldestXmin.)
+	 * is assumed atomic, and there's no real need to prevent concurrent
+	 * horizon determinations.  (If we're moving our xmin forward, this is
+	 * obviously safe, and if we're moving it backwards, well, the data is at
+	 * risk already since a VACUUM could already have determined the horizon.)
 	 *
 	 * If we're using a replication slot we reserve the xmin via that,
 	 * otherwise via the walsender's PGXACT entry. We can only track the
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index b4485335644..c011387ba90 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -99,6 +99,142 @@ typedef struct ProcArrayStruct
 	int			pgprocnos[FLEXIBLE_ARRAY_MEMBER];
 } ProcArrayStruct;
 
+/*
+ * State for the GlobalVisTest* familiy of functions. Those functions can
+ * e.g. be used to decide if a deleted row can be removed without violating
+ * MVCC semantics: If the deleted row's xmax is not considered to be running
+ * by anyone, the row can be removed.
+ *
+ * To avoid slowing down GetSnapshotData(), we don't calculate a precise
+ * cutoff XID while building a snapshot (looking at the frequently changing
+ * xmins scales badly). Instead we compute two boundaries while building the
+ * snapshot:
+ *
+ * 1) definitely_needed, indicating that rows deleted by XIDs >=
+ *    definitely_needed are definitely still visible.
+ *
+ * 2) maybe_needed, indicating that rows deleted by XIDs < maybe_needed can
+ *    definitely be removed
+ *
+ * When testing an XID that falls in between the two (i.e. XID >= maybe_needed
+ * && XID < definitely_needed), the boundaries can be recomputed (using
+ * ComputeXidHorizons()) to get a more accurate answer. This is cheaper than
+ * maintaining an accurate value all the time.
+ *
+ * As it is not cheap to compute accurate boundaries, we limit the number of
+ * times that happens in short succession. See GlobalVisTestShouldUpdate().
+ *
+ *
+ * There are three backend lifetime instances of this struct, optimized for
+ * different types of relations. As e.g. a normal user defined table in one
+ * database is inaccessible to backends connected to another database, a test
+ * specific to a relation can be more aggressive than a test for a shared
+ * relation.  Currently we track three different states:
+ *
+ * 1) GlobalVisSharedRels, which only considers an XID's
+ *    effects visible-to-everyone if neither snapshots in any database, nor a
+ *    replication slot's xmin, nor a replication slot's catalog_xmin might
+ *    still consider XID as running.
+ *
+ * 2) GlobalVisCatalogRels, which only considers an XID's
+ *    effects visible-to-everyone if neither snapshots in the current
+ *    database, nor a replication slot's xmin, nor a replication slot's
+ *    catalog_xmin might still consider XID as running.
+ *
+ *    I.e. the difference to GlobalVisSharedRels is that
+ *    snapshot in other databases are ignored.
+ *
+ * 3) GlobalVisCatalogRels, which only considers an XID's
+ *    effects visible-to-everyone if neither snapshots in the current
+ *    database, nor a replication slot's xmin consider XID as running.
+ *
+ *    I.e. the difference to GlobalVisCatalogRels is that
+ *    replication slot's catalog_xmin is not taken into account.
+ *
+ * GlobalVisTestFor(relation) returns the appropriate state
+ * for the relation.
+ *
+ * The boundaries are FullTransactionIds instead of TransactionIds to avoid
+ * wraparound dangers. There e.g. would otherwise exist no procarray state to
+ * prevent maybe_needed to become old enough after the GetSnapshotData()
+ * call.
+ *
+ * The typedef is in the header.
+ */
+struct GlobalVisState
+{
+	/* XIDs >= are considered running by some backend */
+	FullTransactionId definitely_needed;
+
+	/* XIDs < are not considered to be running by any backend */
+	FullTransactionId maybe_needed;
+};
+
+/*
+ * Result of ComputeXidHorizons().
+ */
+typedef struct ComputeXidHorizonsResult
+{
+	/*
+	 * The value of ShmemVariableCache->latestCompletedFullXid when
+	 * ComputeXidHorizons() held ProcArrayLock.
+	 */
+	FullTransactionId latest_completed;
+
+	/*
+	 * The same for procArray->replication_slot_xmin and.
+	 * procArray->replication_slot_catalog_xmin.
+	 */
+	TransactionId slot_xmin;
+	TransactionId slot_catalog_xmin;
+
+	/*
+	 * Oldest xid that any backend might still consider running. This needs to
+	 * include processes running VACUUM, in contrast to the normal visibility
+	 * cutoffs, as vacuum needs to be able to perform pg_subtrans lookups when
+	 * determining visibility, but doesn't care about rows above its xmin to
+	 * be removed.
+	 *
+	 * This likely should only be needed to determine whether pg_subtrans can
+	 * be truncated. It currently includes the effects of replications slots,
+	 * for historical reasons. But that could likely be changed.
+	 */
+	TransactionId oldest_considered_running;
+
+	/*
+	 * Oldest xid for which deleted tuples need to be retained in shared
+	 * tables.
+	 *
+	 * This includes the effects of replications lots. If that's not desired,
+	 * look at shared_oldest_nonremovable_raw;
+	 */
+	TransactionId shared_oldest_nonremovable;
+
+	/*
+	 * Oldest xid that may be necessary to retain in shared tables. This is
+	 * the same as shared_oldest_nonremovable, except that is not affected by
+	 * replication slot's catalog_xmin.
+	 *
+	 * This is mainly useful to be able to send the catalog_xmin to upstream
+	 * streaming replication servers via hot_standby_feedback, so they can
+	 * apply the limit only when accessing catalog tables.
+	 */
+	TransactionId shared_oldest_nonremovable_raw;
+
+	/*
+	 * Oldest xid for which deleted tuples need to be retained in non-shared
+	 * catalog tables.
+	 */
+	TransactionId catalog_oldest_nonremovable;
+
+	/*
+	 * Oldest xid for which deleted tuples need to be retained in normal user
+	 * defined tables.
+	 */
+	TransactionId data_oldest_nonremovable;
+} ComputeXidHorizonsResult;
+
+
 static ProcArrayStruct *procArray;
 
 static PGPROC *allProcs;
@@ -118,6 +254,22 @@ static TransactionId latestObservedXid = InvalidTransactionId;
  */
 static TransactionId standbySnapshotPendingXmin;
 
+/*
+ * State for visibility checks on different types of relations. See struct
+ * GlobalVisState for details. As shared, catalog, and user defined
+ * relations can have different horizons, one such state exists for each.
+ */
+static GlobalVisState GlobalVisSharedRels;
+static GlobalVisState GlobalVisCatalogRels;
+static GlobalVisState GlobalVisDataRels;
+
+/*
+ * This backend's RecentXmin at the last time the accurate xmin horizon was
+ * recomputed, or InvalidTransactionId if it has not. Used to limit how many
+ * times accurate horizons are recomputed. See GlobalVisTestShouldUpdate().
+ */
+static TransactionId ComputeXidHorizonsResultLastXmin;
+
 #ifdef XIDCACHE_DEBUG
 
 /* counters for XidCache measurement */
@@ -175,6 +327,10 @@ static void KnownAssignedXidsReset(void);
 static inline void ProcArrayEndTransactionInternal(PGPROC *proc,
 												   PGXACT *pgxact, TransactionId latestXid);
 static void ProcArrayGroupClearXid(PGPROC *proc, TransactionId latestXid);
+static void MaintainLatestCompletedXid(TransactionId latestXid);
+static void MaintainLatestCompletedXidRecovery(TransactionId latestXid);
+
+static inline FullTransactionId FullXidViaRelative(FullTransactionId rel, TransactionId xid);
 
 /*
  * Report shared-memory space needed by CreateSharedProcArray.
@@ -349,9 +505,7 @@ ProcArrayRemove(PGPROC *proc, TransactionId latestXid)
 		Assert(TransactionIdIsValid(allPgXact[proc->pgprocno].xid));
 
 		/* Advance global latestCompletedXid while holding the lock */
-		if (TransactionIdPrecedes(ShmemVariableCache->latestCompletedXid,
-								  latestXid))
-			ShmemVariableCache->latestCompletedXid = latestXid;
+		MaintainLatestCompletedXid(latestXid);
 	}
 	else
 	{
@@ -464,9 +618,7 @@ ProcArrayEndTransactionInternal(PGPROC *proc, PGXACT *pgxact,
 	pgxact->overflowed = false;
 
 	/* Also advance global latestCompletedXid while holding the lock */
-	if (TransactionIdPrecedes(ShmemVariableCache->latestCompletedXid,
-							  latestXid))
-		ShmemVariableCache->latestCompletedXid = latestXid;
+	MaintainLatestCompletedXid(latestXid);
 }
 
 /*
@@ -621,6 +773,58 @@ ProcArrayClearTransaction(PGPROC *proc)
 	pgxact->overflowed = false;
 }
 
+/*
+ * Update ShmemVariableCache->latestCompletedFullXid to point to latestXid if
+ * currently older.
+ */
+static void
+MaintainLatestCompletedXid(TransactionId latestXid)
+{
+	FullTransactionId cur_latest = ShmemVariableCache->latestCompletedFullXid;
+
+	Assert(LWLockHeldByMe(ProcArrayLock));
+	Assert(FullTransactionIdIsValid(cur_latest));
+
+	if (TransactionIdPrecedes(XidFromFullTransactionId(cur_latest), latestXid))
+	{
+		ShmemVariableCache->latestCompletedFullXid =
+			FullXidViaRelative(cur_latest, latestXid);
+	}
+
+	Assert(IsBootstrapProcessingMode() ||
+		   FullTransactionIdIsNormal(ShmemVariableCache->latestCompletedFullXid));
+}
+
+/*
+ * Same as MaintainLatestCompletedXid, except for use during WAL replay.
+ */
+static void
+MaintainLatestCompletedXidRecovery(TransactionId latestXid)
+{
+	FullTransactionId cur_latest = ShmemVariableCache->latestCompletedFullXid;
+	FullTransactionId rel;
+
+	Assert(AmStartupProcess() || !IsUnderPostmaster);
+	Assert(LWLockHeldByMe(ProcArrayLock));
+
+	/*
+	 * Need a FullTransactionId to compare latestXid with. Can't rely on
+	 * latestCompletedFullXid to be initialized in recovery. But in recovery
+	 * it's safe to access nextFullXid without a lock for the startup process.
+	 */
+	rel = ShmemVariableCache->nextFullXid;
+	Assert(FullTransactionIdIsValid(ShmemVariableCache->nextFullXid));
+
+	if (!FullTransactionIdIsValid(cur_latest) ||
+		TransactionIdPrecedes(XidFromFullTransactionId(cur_latest), latestXid))
+	{
+		ShmemVariableCache->latestCompletedFullXid =
+			FullXidViaRelative(rel, latestXid);
+	}
+
+	Assert(FullTransactionIdIsNormal(ShmemVariableCache->latestCompletedFullXid));
+}
+
 /*
  * ProcArrayInitRecovery -- initialize recovery xid mgmt environment
  *
@@ -841,7 +1045,7 @@ ProcArrayApplyRecoveryInfo(RunningTransactions running)
 	 * Now we've got the running xids we need to set the global values that
 	 * are used to track snapshots as they evolve further.
 	 *
-	 * - latestCompletedXid which will be the xmax for snapshots
+	 * - latestCompletedFullXid which will be the xmax for snapshots
 	 * - lastOverflowedXid which shows whether snapshots overflow
 	 * - nextXid
 	 *
@@ -867,14 +1071,11 @@ ProcArrayApplyRecoveryInfo(RunningTransactions running)
 
 	/*
 	 * If a transaction wrote a commit record in the gap between taking and
-	 * logging the snapshot then latestCompletedXid may already be higher than
-	 * the value from the snapshot, so check before we use the incoming value.
+	 * logging the snapshot then latestCompletedFullXid may already be higher
+	 * than the value from the snapshot, so check before we use the incoming
+	 * value. It also might not yet be set at all.
 	 */
-	if (TransactionIdPrecedes(ShmemVariableCache->latestCompletedXid,
-							  running->latestCompletedXid))
-		ShmemVariableCache->latestCompletedXid = running->latestCompletedXid;
-
-	Assert(TransactionIdIsNormal(ShmemVariableCache->latestCompletedXid));
+	MaintainLatestCompletedXidRecovery(running->latestCompletedXid);
 
 	LWLockRelease(ProcArrayLock);
 
@@ -1048,10 +1249,11 @@ TransactionIdIsInProgress(TransactionId xid)
 	LWLockAcquire(ProcArrayLock, LW_SHARED);
 
 	/*
-	 * Now that we have the lock, we can check latestCompletedXid; if the
+	 * Now that we have the lock, we can check latestCompletedFullXid; if the
 	 * target Xid is after that, it's surely still running.
 	 */
-	if (TransactionIdPrecedes(ShmemVariableCache->latestCompletedXid, xid))
+	if (TransactionIdPrecedes(XidFromFullTransactionId(ShmemVariableCache->latestCompletedFullXid),
+							  xid))
 	{
 		LWLockRelease(ProcArrayLock);
 		xc_by_latest_xid_inc();
@@ -1248,159 +1450,183 @@ TransactionIdIsActive(TransactionId xid)
 
 
 /*
- * GetOldestXmin -- returns oldest transaction that was running
- *					when any current transaction was started.
+ * Determine XID horizons.
  *
- * If rel is NULL or a shared relation, all backends are considered, otherwise
- * only backends running in this database are considered.
+ * This is used by wrapper functions like GetOldestNonRemovableTransactionId()
+ * (for VACUUM), GetReplicationHorizons() (for hot_standby_feedback), etc as
+ * well as "internally" by GlobalVisUpdate() (see comment above struct
+ * GlobalVisState).
  *
- * The flags are used to ignore the backends in calculation when any of the
- * corresponding flags is set. Typically, if you want to ignore ones with
- * PROC_IN_VACUUM flag, you can use PROCARRAY_FLAGS_VACUUM.
+ * See ComputedXidHorizonsResult for the various computed horizons.
  *
- * PROCARRAY_SLOTS_XMIN causes GetOldestXmin to ignore the xmin and
- * catalog_xmin of any replication slots that exist in the system when
- * calculating the oldest xmin.
+ * For VACUUM separate horizons (used to to decide which deleted tuples must
+ * be preserved), for shared and non-shared tables are computed.  For shared
+ * relations backends in all databases must be considered, but for non-shared
+ * relations that's not required, since only backends in my own database could
+ * ever see the tuples in them. Also, we can ignore concurrently running lazy
+ * VACUUMs because (a) they must be working on other tables, and (b) they
+ * don't need to do snapshot-based lookups.
  *
- * This is used by VACUUM to decide which deleted tuples must be preserved in
- * the passed in table. For shared relations backends in all databases must be
- * considered, but for non-shared relations that's not required, since only
- * backends in my own database could ever see the tuples in them. Also, we can
- * ignore concurrently running lazy VACUUMs because (a) they must be working
- * on other tables, and (b) they don't need to do snapshot-based lookups.
- *
- * This is also used to determine where to truncate pg_subtrans.  For that
- * backends in all databases have to be considered, so rel = NULL has to be
- * passed in.
+ * This also computes a horizon used to truncate pg_subtrans. For that
+ * backends in all databases have to be considered, and concurrently running
+ * lazy VACUUMs cannot be ignored, as they still may perform pg_subtrans
+ * accesses.
  *
  * Note: we include all currently running xids in the set of considered xids.
  * This ensures that if a just-started xact has not yet set its snapshot,
  * when it does set the snapshot it cannot set xmin less than what we compute.
  * See notes in src/backend/access/transam/README.
  *
- * Note: despite the above, it's possible for the calculated value to move
- * backwards on repeated calls. The calculated value is conservative, so that
- * anything older is definitely not considered as running by anyone anymore,
- * but the exact value calculated depends on a number of things. For example,
- * if rel = NULL and there are no transactions running in the current
- * database, GetOldestXmin() returns latestCompletedXid. If a transaction
+ * Note: despite the above, it's possible for the calculated values to move
+ * backwards on repeated calls. The calculated values are conservative, so
+ * that anything older is definitely not considered as running by anyone
+ * anymore, but the exact values calculated depend on a number of things. For
+ * example, if there are no transactions running in the current database, the
+ * horizon for normal tables will be latestCompletedFullXid. If a transaction
  * begins after that, its xmin will include in-progress transactions in other
  * databases that started earlier, so another call will return a lower value.
  * Nonetheless it is safe to vacuum a table in the current database with the
  * first result.  There are also replication-related effects: a walsender
  * process can set its xmin based on transactions that are no longer running
  * on the primary but are still being replayed on the standby, thus possibly
- * making the GetOldestXmin reading go backwards.  In this case there is a
- * possibility that we lose data that the standby would like to have, but
- * unless the standby uses a replication slot to make its xmin persistent
- * there is little we can do about that --- data is only protected if the
- * walsender runs continuously while queries are executed on the standby.
- * (The Hot Standby code deals with such cases by failing standby queries
- * that needed to access already-removed data, so there's no integrity bug.)
- * The return value is also adjusted with vacuum_defer_cleanup_age, so
- * increasing that setting on the fly is another easy way to make
- * GetOldestXmin() move backwards, with no consequences for data integrity.
+ * making the values go backwards.  In this case there is a possibility that
+ * we lose data that the standby would like to have, but unless the standby
+ * uses a replication slot to make its xmin persistent there is little we can
+ * do about that --- data is only protected if the walsender runs continuously
+ * while queries are executed on the standby.  (The Hot Standby code deals
+ * with such cases by failing standby queries that needed to access
+ * already-removed data, so there's no integrity bug.)  The computed values
+ * are also adjusted with vacuum_defer_cleanup_age, so increasing that setting
+ * on the fly is another easy way to make horizons move backwards, with no
+ * consequences for data integrity.
  */
-TransactionId
-GetOldestXmin(Relation rel, int flags)
+static void
+ComputeXidHorizons(ComputeXidHorizonsResult *h)
 {
 	ProcArrayStruct *arrayP = procArray;
-	TransactionId result;
-	int			index;
-	bool		allDbs;
+	TransactionId kaxmin;
+	bool		in_recovery = RecoveryInProgress();
 
-	TransactionId replication_slot_xmin = InvalidTransactionId;
-	TransactionId replication_slot_catalog_xmin = InvalidTransactionId;
-
-	/*
-	 * If we're not computing a relation specific limit, or if a shared
-	 * relation has been passed in, backends in all databases have to be
-	 * considered.
-	 */
-	allDbs = rel == NULL || rel->rd_rel->relisshared;
-
-	/* Cannot look for individual databases during recovery */
-	Assert(allDbs || !RecoveryInProgress());
+	/* inferred after ProcArrayLock is released */
+	h->catalog_oldest_nonremovable = InvalidTransactionId;
 
 	LWLockAcquire(ProcArrayLock, LW_SHARED);
 
-	/*
-	 * We initialize the MIN() calculation with latestCompletedXid + 1. This
-	 * is a lower bound for the XIDs that might appear in the ProcArray later,
-	 * and so protects us against overestimating the result due to future
-	 * additions.
-	 */
-	result = ShmemVariableCache->latestCompletedXid;
-	Assert(TransactionIdIsNormal(result));
-	TransactionIdAdvance(result);
+	h->latest_completed = ShmemVariableCache->latestCompletedFullXid;
 
-	for (index = 0; index < arrayP->numProcs; index++)
+	/*
+	 * We initialize the MIN() calculation with latestCompletedFullXid + 1.
+	 * This is a lower bound for the XIDs that might appear in the ProcArray
+	 * later, and so protects us against overestimating the result due to
+	 * future additions.
+	 */
+	{
+		TransactionId initial;
+
+		initial = XidFromFullTransactionId(h->latest_completed);
+		Assert(TransactionIdIsValid(initial));
+		TransactionIdAdvance(initial);
+
+		h->oldest_considered_running = initial;
+		h->shared_oldest_nonremovable = initial;
+		h->data_oldest_nonremovable = initial;
+	}
+
+	/*
+	 * Fetch slot horizons while ProcArrayLock is held - the
+	 * LWLockAcquire/LWLockRelease are a barrier, ensuring this happens inside
+	 * the lock.
+	 */
+	h->slot_xmin = procArray->replication_slot_xmin;
+	h->slot_catalog_xmin = procArray->replication_slot_catalog_xmin;
+
+	for (int index = 0; index < arrayP->numProcs; index++)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
 		PGPROC	   *proc = &allProcs[pgprocno];
 		PGXACT	   *pgxact = &allPgXact[pgprocno];
+		TransactionId xid;
+		TransactionId xmin;
 
-		if (pgxact->vacuumFlags & (flags & PROCARRAY_PROC_FLAGS_MASK))
+		/* Fetch xid just once - see GetNewTransactionId */
+		xid = UINT32_ACCESS_ONCE(pgxact->xid);
+		xmin = UINT32_ACCESS_ONCE(pgxact->xmin);
+
+		/*
+		 * Consider both the transaction's Xmin, and its Xid.
+		 *
+		 * We must check both because a transaction might have an Xmin but not
+		 * (yet) an Xid; conversely, if it has an Xid, that could determine
+		 * some not-yet-set Xmin.
+		 */
+		xmin = TransactionIdOlder(xmin, xid);
+
+		/* if neither is set, this proc doesn't influence the horizon */
+		if (!TransactionIdIsValid(xmin))
 			continue;
 
-		if (allDbs ||
+		/*
+		 * Don't ignore any procs when determining which transactions might be
+		 * considered running.  While slots should ensure logical decoding
+		 * backends are protected even without this check, it can't hurt to
+		 * include them here as well..
+		 */
+		h->oldest_considered_running =
+			TransactionIdOlder(h->oldest_considered_running, xmin);
+
+		/*
+		 * Skip over backends either vacuuming (which is ok with rows being
+		 * removed, as long as pg_subtrans is not truncated) or doing logical
+		 * decoding (which manages xmin separately, check below).
+		 */
+		if (pgxact->vacuumFlags & (PROC_IN_VACUUM | PROC_IN_LOGICAL_DECODING))
+			continue;
+
+		/* shared tables need to take backends in all database into account */
+		h->shared_oldest_nonremovable =
+			TransactionIdOlder(h->shared_oldest_nonremovable, xmin);
+
+		/*
+		 * Normally queries in other databases are ignored for anything but
+		 * the shared horizon. But in recovery we cannot compute an accurate
+		 * per-database horizon as all xids are managed via the
+		 * KnownAssignedXids machinery.
+		 */
+		if (in_recovery ||
 			proc->databaseId == MyDatabaseId ||
 			proc->databaseId == 0)	/* always include WalSender */
 		{
-			/* Fetch xid just once - see GetNewTransactionId */
-			TransactionId xid = UINT32_ACCESS_ONCE(pgxact->xid);
-
-			/* First consider the transaction's own Xid, if any */
-			if (TransactionIdIsNormal(xid) &&
-				TransactionIdPrecedes(xid, result))
-				result = xid;
-
-			/*
-			 * Also consider the transaction's Xmin, if set.
-			 *
-			 * We must check both Xid and Xmin because a transaction might
-			 * have an Xmin but not (yet) an Xid; conversely, if it has an
-			 * Xid, that could determine some not-yet-set Xmin.
-			 */
-			xid = UINT32_ACCESS_ONCE(pgxact->xmin);
-			if (TransactionIdIsNormal(xid) &&
-				TransactionIdPrecedes(xid, result))
-				result = xid;
+			h->data_oldest_nonremovable =
+				TransactionIdOlder(h->data_oldest_nonremovable, xmin);
 		}
 	}
 
 	/*
-	 * Fetch into local variable while ProcArrayLock is held - the
-	 * LWLockRelease below is a barrier, ensuring this happens inside the
-	 * lock.
+	 * If in recovery fetch oldest xid in KnownAssignedXids, will be applied
+	 * after lock is released.
 	 */
-	replication_slot_xmin = procArray->replication_slot_xmin;
-	replication_slot_catalog_xmin = procArray->replication_slot_catalog_xmin;
+	if (in_recovery)
+		kaxmin = KnownAssignedXidsGetOldestXmin();
 
-	if (RecoveryInProgress())
+	/*
+	 * No other information needed, so release the lock immediately. The rest
+	 * of the computations can be done without a lock.
+	 */
+	LWLockRelease(ProcArrayLock);
+
+	if (in_recovery)
 	{
-		/*
-		 * Check to see whether KnownAssignedXids contains an xid value older
-		 * than the main procarray.
-		 */
-		TransactionId kaxmin = KnownAssignedXidsGetOldestXmin();
-
-		LWLockRelease(ProcArrayLock);
-
-		if (TransactionIdIsNormal(kaxmin) &&
-			TransactionIdPrecedes(kaxmin, result))
-			result = kaxmin;
+		h->oldest_considered_running =
+			TransactionIdOlder(h->oldest_considered_running, kaxmin);
+		h->shared_oldest_nonremovable =
+			TransactionIdOlder(h->shared_oldest_nonremovable, kaxmin);
+		h->data_oldest_nonremovable =
+			TransactionIdOlder(h->data_oldest_nonremovable, kaxmin);
 	}
 	else
 	{
 		/*
-		 * No other information needed, so release the lock immediately.
-		 */
-		LWLockRelease(ProcArrayLock);
-
-		/*
-		 * Compute the cutoff XID by subtracting vacuum_defer_cleanup_age,
-		 * being careful not to generate a "permanent" XID.
+		 * Compute the cutoff XID by subtracting vacuum_defer_cleanup_age.
 		 *
 		 * vacuum_defer_cleanup_age provides some additional "slop" for the
 		 * benefit of hot standby queries on standby servers.  This is quick
@@ -1412,34 +1638,143 @@ GetOldestXmin(Relation rel, int flags)
 		 * in varsup.c.  Also note that we intentionally don't apply
 		 * vacuum_defer_cleanup_age on standby servers.
 		 */
-		result -= vacuum_defer_cleanup_age;
-		if (!TransactionIdIsNormal(result))
-			result = FirstNormalTransactionId;
+		h->oldest_considered_running =
+			TransactionIdRetreatedBy(h->oldest_considered_running,
+									 vacuum_defer_cleanup_age);
+		h->shared_oldest_nonremovable =
+			TransactionIdRetreatedBy(h->shared_oldest_nonremovable,
+									 vacuum_defer_cleanup_age);
+		h->data_oldest_nonremovable =
+			TransactionIdRetreatedBy(h->data_oldest_nonremovable,
+									 vacuum_defer_cleanup_age);
 	}
 
 	/*
 	 * Check whether there are replication slots requiring an older xmin.
 	 */
-	if (!(flags & PROCARRAY_SLOTS_XMIN) &&
-		TransactionIdIsValid(replication_slot_xmin) &&
-		NormalTransactionIdPrecedes(replication_slot_xmin, result))
-		result = replication_slot_xmin;
+	h->shared_oldest_nonremovable =
+		TransactionIdOlder(h->shared_oldest_nonremovable, h->slot_xmin);
+	h->data_oldest_nonremovable =
+		TransactionIdOlder(h->data_oldest_nonremovable, h->slot_xmin);
 
 	/*
-	 * After locks have been released and vacuum_defer_cleanup_age has been
-	 * applied, check whether we need to back up further to make logical
-	 * decoding possible. We need to do so if we're computing the global limit
-	 * (rel = NULL) or if the passed relation is a catalog relation of some
-	 * kind.
+	 * The only difference between catalog / data horizons is that the slot's
+	 * catalog xmin is applied to the catalog one (so catalogs can be accessed
+	 * for logical decoding). Initialize with data horizon, and then back up
+	 * further if necessary. Have to back up the shared horizon as well, since
+	 * that also can contain catalogs.
 	 */
-	if (!(flags & PROCARRAY_SLOTS_XMIN) &&
-		(rel == NULL ||
-		 RelationIsAccessibleInLogicalDecoding(rel)) &&
-		TransactionIdIsValid(replication_slot_catalog_xmin) &&
-		NormalTransactionIdPrecedes(replication_slot_catalog_xmin, result))
-		result = replication_slot_catalog_xmin;
+	h->shared_oldest_nonremovable_raw = h->shared_oldest_nonremovable;
+	h->shared_oldest_nonremovable =
+		TransactionIdOlder(h->shared_oldest_nonremovable,
+						   h->slot_catalog_xmin);
+	h->catalog_oldest_nonremovable = h->data_oldest_nonremovable;
+	h->catalog_oldest_nonremovable =
+		TransactionIdOlder(h->catalog_oldest_nonremovable,
+						   h->slot_catalog_xmin);
 
-	return result;
+	/*
+	 * It's possible that slots / vacuum_defer_cleanup_age backed up the
+	 * horizons further than oldest_considered_running. Fix.
+	 */
+	h->oldest_considered_running =
+		TransactionIdOlder(h->oldest_considered_running,
+						   h->shared_oldest_nonremovable);
+	h->oldest_considered_running =
+		TransactionIdOlder(h->oldest_considered_running,
+						   h->catalog_oldest_nonremovable);
+	h->oldest_considered_running =
+		TransactionIdOlder(h->oldest_considered_running,
+						   h->data_oldest_nonremovable);
+
+	/*
+	 * shared horizons have to be at least as old as the oldest visible in
+	 * current db
+	 */
+	Assert(TransactionIdPrecedesOrEquals(h->shared_oldest_nonremovable,
+										 h->data_oldest_nonremovable));
+	Assert(TransactionIdPrecedesOrEquals(h->shared_oldest_nonremovable,
+										 h->catalog_oldest_nonremovable));
+
+	/*
+	 * Horizons need to ensure that pg_subtrans access is still possible for
+	 * the relevant backends.
+	 */
+	Assert(TransactionIdPrecedesOrEquals(h->oldest_considered_running,
+										 h->shared_oldest_nonremovable));
+	Assert(TransactionIdPrecedesOrEquals(h->oldest_considered_running,
+										 h->catalog_oldest_nonremovable));
+	Assert(TransactionIdPrecedesOrEquals(h->oldest_considered_running,
+										 h->data_oldest_nonremovable));
+	Assert(!TransactionIdIsValid(h->slot_xmin) ||
+		   TransactionIdPrecedesOrEquals(h->oldest_considered_running,
+										 h->slot_xmin));
+	Assert(!TransactionIdIsValid(h->slot_catalog_xmin) ||
+		   TransactionIdPrecedesOrEquals(h->oldest_considered_running,
+										 h->slot_catalog_xmin));
+}
+
+/*
+ * Return the oldest XID for which deleted tuples must be preserved in the
+ * passed table.
+ *
+ * If rel is not NULL the horizon may be considerably more recent than
+ * otherwise (i.e. fewer tuples will be removable). In the NULL case a horizon
+ * that is correct (but not optimal) for all relations will be returned.
+ *
+ * This is used by VACUUM to decide which deleted tuples must be preserved in
+ * the passed in table.
+ */
+TransactionId
+GetOldestNonRemovableTransactionId(Relation rel)
+{
+	ComputeXidHorizonsResult horizons;
+
+	ComputeXidHorizons(&horizons);
+
+	/* select horizon appropriate for relation */
+	if (rel == NULL || rel->rd_rel->relisshared)
+		return horizons.shared_oldest_nonremovable;
+	else if (RelationIsAccessibleInLogicalDecoding(rel))
+		return horizons.catalog_oldest_nonremovable;
+	else
+		return horizons.data_oldest_nonremovable;
+}
+
+/*
+ * Return the oldest transaction id any currently running backend might still
+ * consider running. This should not be used for visibility / pruning
+ * determinations (see GetOldestNonRemovableTransactionId()), but for
+ * decisions like up to where pg_subtrans can be truncated.
+ */
+TransactionId
+GetOldestTransactionIdConsideredRunning(void)
+{
+	ComputeXidHorizonsResult horizons;
+
+	ComputeXidHorizons(&horizons);
+
+	return horizons.oldest_considered_running;
+}
+
+/*
+ * Return the visibility horizons for a hot standby feedback message.
+ */
+void
+GetReplicationHorizons(TransactionId *xmin, TransactionId *catalog_xmin)
+{
+	ComputeXidHorizonsResult horizons;
+
+	ComputeXidHorizons(&horizons);
+
+	/*
+	 * Don't want to use shared_oldest_nonremovable here, as that contains the
+	 * effect of replication slot's catalog_xmin. We want to send a separate
+	 * feedback for the catalog horizon, so the primary can remove data table
+	 * contents more aggressively.
+	 */
+	*xmin = horizons.shared_oldest_nonremovable_raw;
+	*catalog_xmin = horizons.slot_catalog_xmin;
 }
 
 /*
@@ -1490,12 +1825,10 @@ GetMaxSnapshotSubxidCount(void)
  *			current transaction (this is the same as MyPgXact->xmin).
  *		RecentXmin: the xmin computed for the most recent snapshot.  XIDs
  *			older than this are known not running any more.
- *		RecentGlobalXmin: the global xmin (oldest TransactionXmin across all
- *			running transactions, except those running LAZY VACUUM).  This is
- *			the same computation done by
- *			GetOldestXmin(NULL, PROCARRAY_FLAGS_VACUUM).
- *		RecentGlobalDataXmin: the global xmin for non-catalog tables
- *			>= RecentGlobalXmin
+ *
+ * And try to advance the bounds of GlobalVisSharedRels,
+ * GlobalVisCatalogRels, GlobalVisDataRels for
+ * the benefit GlobalVis*.
  *
  * Note: this function should probably not be called with an argument that's
  * not statically allocated (see xip allocation below).
@@ -1506,11 +1839,12 @@ GetSnapshotData(Snapshot snapshot)
 	ProcArrayStruct *arrayP = procArray;
 	TransactionId xmin;
 	TransactionId xmax;
-	TransactionId globalxmin;
 	int			index;
 	int			count = 0;
 	int			subcount = 0;
 	bool		suboverflowed = false;
+	FullTransactionId latest_completed;
+	TransactionId oldestxid;
 	TransactionId replication_slot_xmin = InvalidTransactionId;
 	TransactionId replication_slot_catalog_xmin = InvalidTransactionId;
 
@@ -1554,13 +1888,16 @@ GetSnapshotData(Snapshot snapshot)
 	 */
 	LWLockAcquire(ProcArrayLock, LW_SHARED);
 
+	latest_completed = ShmemVariableCache->latestCompletedFullXid;
+	oldestxid = ShmemVariableCache->oldestXid;
+
 	/* xmax is always latestCompletedXid + 1 */
-	xmax = ShmemVariableCache->latestCompletedXid;
-	Assert(TransactionIdIsNormal(xmax));
+	xmax = XidFromFullTransactionId(latest_completed);
 	TransactionIdAdvance(xmax);
+	Assert(TransactionIdIsNormal(xmax));
 
 	/* initialize xmin calculation with xmax */
-	globalxmin = xmin = xmax;
+	xmin = xmax;
 
 	snapshot->takenDuringRecovery = RecoveryInProgress();
 
@@ -1589,12 +1926,6 @@ GetSnapshotData(Snapshot snapshot)
 				(PROC_IN_LOGICAL_DECODING | PROC_IN_VACUUM))
 				continue;
 
-			/* Update globalxmin to be the smallest valid xmin */
-			xid = UINT32_ACCESS_ONCE(pgxact->xmin);
-			if (TransactionIdIsNormal(xid) &&
-				NormalTransactionIdPrecedes(xid, globalxmin))
-				globalxmin = xid;
-
 			/* Fetch xid just once - see GetNewTransactionId */
 			xid = UINT32_ACCESS_ONCE(pgxact->xid);
 
@@ -1710,34 +2041,78 @@ GetSnapshotData(Snapshot snapshot)
 
 	LWLockRelease(ProcArrayLock);
 
-	/*
-	 * Update globalxmin to include actual process xids.  This is a slightly
-	 * different way of computing it than GetOldestXmin uses, but should give
-	 * the same result.
-	 */
-	if (TransactionIdPrecedes(xmin, globalxmin))
-		globalxmin = xmin;
+	/* maintain state for GlobalVis* */
+	{
+		TransactionId def_vis_xid;
+		TransactionId def_vis_xid_data;
+		FullTransactionId def_vis_fxid;
+		FullTransactionId def_vis_fxid_data;
+		FullTransactionId oldestfxid;
 
-	/* Update global variables too */
-	RecentGlobalXmin = globalxmin - vacuum_defer_cleanup_age;
-	if (!TransactionIdIsNormal(RecentGlobalXmin))
-		RecentGlobalXmin = FirstNormalTransactionId;
+		/*
+		 * Converting oldestXid is only safe when xid horizon cannot advance,
+		 * i.e. holding locks. While we don't hold the lock anymore, all the
+		 * necessary data has been gathered with lock held.
+		 */
+		oldestfxid = FullXidViaRelative(latest_completed, oldestxid);
 
-	/* Check whether there's a replication slot requiring an older xmin. */
-	if (TransactionIdIsValid(replication_slot_xmin) &&
-		NormalTransactionIdPrecedes(replication_slot_xmin, RecentGlobalXmin))
-		RecentGlobalXmin = replication_slot_xmin;
+		/* apply vacuum_defer_cleanup_age */
+		def_vis_xid_data =
+			TransactionIdRetreatedBy(xmin, vacuum_defer_cleanup_age);
 
-	/* Non-catalog tables can be vacuumed if older than this xid */
-	RecentGlobalDataXmin = RecentGlobalXmin;
+		/* Check whether there's a replication slot requiring an older xmin. */
+		def_vis_xid_data =
+			TransactionIdOlder(def_vis_xid_data, replication_slot_xmin);
 
-	/*
-	 * Check whether there's a replication slot requiring an older catalog
-	 * xmin.
-	 */
-	if (TransactionIdIsNormal(replication_slot_catalog_xmin) &&
-		NormalTransactionIdPrecedes(replication_slot_catalog_xmin, RecentGlobalXmin))
-		RecentGlobalXmin = replication_slot_catalog_xmin;
+		/*
+		 * Rows in non-shared, non-catalog tables possibly could be vacuumed
+		 * if older than this xid.
+		 */
+		def_vis_xid = def_vis_xid_data;
+
+		/*
+		 * Check whether there's a replication slot requiring an older catalog
+		 * xmin.
+		 */
+		def_vis_xid =
+			TransactionIdOlder(replication_slot_catalog_xmin, def_vis_xid);
+
+		def_vis_fxid = FullXidViaRelative(latest_completed, def_vis_xid);
+		def_vis_fxid_data = FullXidViaRelative(latest_completed, def_vis_xid_data);
+
+		/*
+		 * Check if we can increase upper bound. As a previous
+		 * GlobalVisUpdate() might have computed more aggressive values, don't
+		 * overwrite them if so.
+		 */
+		GlobalVisSharedRels.definitely_needed =
+			FullTransactionIdNewer(def_vis_fxid,
+								   GlobalVisSharedRels.definitely_needed);
+		GlobalVisCatalogRels.definitely_needed =
+			FullTransactionIdNewer(def_vis_fxid,
+								   GlobalVisCatalogRels.definitely_needed);
+		GlobalVisDataRels.definitely_needed =
+			FullTransactionIdNewer(def_vis_fxid_data,
+								   GlobalVisDataRels.definitely_needed);
+
+		/*
+		 * Check if we know that we can initialize or increase the lower
+		 * bound. Currently the only cheap way to do so is to use
+		 * ShmemVariableCache->oldestXid as input.
+		 *
+		 * We should definitely be able to do better. We could e.g. put a
+		 * global lower bound value into ShmemVariableCache.
+		 */
+		GlobalVisSharedRels.maybe_needed =
+			FullTransactionIdNewer(GlobalVisSharedRels.maybe_needed,
+								   oldestfxid);
+		GlobalVisCatalogRels.maybe_needed =
+			FullTransactionIdNewer(GlobalVisCatalogRels.maybe_needed,
+								   oldestfxid);
+		GlobalVisDataRels.maybe_needed =
+			FullTransactionIdNewer(GlobalVisDataRels.maybe_needed,
+								   oldestfxid);
+	}
 
 	RecentXmin = xmin;
 
@@ -1984,7 +2359,7 @@ GetRunningTransactionData(void)
 	LWLockAcquire(ProcArrayLock, LW_SHARED);
 	LWLockAcquire(XidGenLock, LW_SHARED);
 
-	latestCompletedXid = ShmemVariableCache->latestCompletedXid;
+	latestCompletedXid = XidFromFullTransactionId(ShmemVariableCache->latestCompletedFullXid);
 
 	oldestRunningXid = XidFromFullTransactionId(ShmemVariableCache->nextFullXid);
 
@@ -3207,9 +3582,7 @@ XidCacheRemoveRunningXids(TransactionId xid,
 		elog(WARNING, "did not find subXID %u in MyProc", xid);
 
 	/* Also advance global latestCompletedXid while holding the lock */
-	if (TransactionIdPrecedes(ShmemVariableCache->latestCompletedXid,
-							  latestXid))
-		ShmemVariableCache->latestCompletedXid = latestXid;
+	MaintainLatestCompletedXid(latestXid);
 
 	LWLockRelease(ProcArrayLock);
 }
@@ -3236,6 +3609,276 @@ DisplayXidCache(void)
 }
 #endif							/* XIDCACHE_DEBUG */
 
+/*
+ * If rel != NULL, return test state appropriate for relation, otherwise
+ * return state usable for all relations.  The latter may consider XIDs as
+ * not-yet-visible-to-everyone that a state for a specific relation would
+ * already consider visible-to-everyone.
+ *
+ * This needs to be called while a snapshot is active or registered, otherwise
+ * there are wraparound and other dangers.
+ *
+ * See comment for GlobalVisState for details.
+ */
+GlobalVisState *
+GlobalVisTestFor(Relation rel)
+{
+	bool		need_shared;
+	bool		need_catalog;
+	GlobalVisState *state;
+
+	/* XXX: we should assert that a snapshot is pushed or registered */
+	Assert(RecentXmin);
+
+	if (!rel)
+		need_shared = need_catalog = true;
+	else
+	{
+		/*
+		 * Other kinds currently don't contain xids, nor always the necessary
+		 * logical decoding markers.
+		 */
+		Assert(rel->rd_rel->relkind == RELKIND_RELATION ||
+			   rel->rd_rel->relkind == RELKIND_MATVIEW ||
+			   rel->rd_rel->relkind == RELKIND_TOASTVALUE);
+
+		need_shared = rel->rd_rel->relisshared || RecoveryInProgress();
+		need_catalog = IsCatalogRelation(rel) || RelationIsAccessibleInLogicalDecoding(rel);
+	}
+
+	if (need_shared)
+		state = &GlobalVisSharedRels;
+	else if (need_catalog)
+		state = &GlobalVisCatalogRels;
+	else
+		state = &GlobalVisDataRels;
+
+	Assert(FullTransactionIdIsValid(state->definitely_needed) &&
+		   FullTransactionIdIsValid(state->maybe_needed));
+
+	return state;
+}
+
+/*
+ * Return true if it's worth updating the accurate maybe_needed boundary.
+ *
+ * As it is somewhat expensive to determine xmin horizons, we don't want to
+ * repeatedly do so when there is a low likelihood of it being beneficial.
+ *
+ * The current heuristic is that we update only if RecentXmin has changed
+ * since the last update. If the oldest currently running transaction has not
+ * finished, it is unlikely that recomputing the horizon would be useful.
+ */
+static bool
+GlobalVisTestShouldUpdate(GlobalVisState *state)
+{
+	/* hasn't been updated yet */
+	if (!TransactionIdIsValid(ComputeXidHorizonsResultLastXmin))
+		return true;
+
+	/*
+	 * If the maybe_needed/definitely_needed boundaries are the same, it's
+	 * unlikely to be beneficial to refresh boundaries.
+	 */
+	if (FullTransactionIdFollowsOrEquals(state->maybe_needed,
+										 state->definitely_needed))
+		return false;
+
+	/* does the last snapshot built have a different xmin? */
+	return RecentXmin != ComputeXidHorizonsResultLastXmin;
+}
+
+/*
+ * Update boundaries in GlobalVis{Shared,Catalog, Data}Rels
+ * using ComputeXidHorizons().
+ */
+static void
+GlobalVisUpdate(void)
+{
+	ComputeXidHorizonsResult horizons;
+
+	ComputeXidHorizons(&horizons);
+
+	GlobalVisSharedRels.maybe_needed =
+		FullXidViaRelative(horizons.latest_completed,
+						   horizons.shared_oldest_nonremovable);
+	GlobalVisCatalogRels.maybe_needed =
+		FullXidViaRelative(horizons.latest_completed,
+						   horizons.catalog_oldest_nonremovable);
+	GlobalVisDataRels.maybe_needed =
+		FullXidViaRelative(horizons.latest_completed,
+						   horizons.data_oldest_nonremovable);
+
+	/*
+	 * In longer running transactions it's possible that transactions we
+	 * previously needed to treat as running aren't around anymore. So update
+	 * definitely_needed to not be earlier than maybe_needed.
+	 */
+	GlobalVisSharedRels.definitely_needed =
+		FullTransactionIdNewer(GlobalVisSharedRels.maybe_needed,
+							   GlobalVisSharedRels.definitely_needed);
+	GlobalVisCatalogRels.definitely_needed =
+		FullTransactionIdNewer(GlobalVisCatalogRels.maybe_needed,
+							   GlobalVisCatalogRels.definitely_needed);
+	GlobalVisDataRels.definitely_needed =
+		FullTransactionIdNewer(GlobalVisDataRels.maybe_needed,
+							   GlobalVisDataRels.definitely_needed);
+
+	ComputeXidHorizonsResultLastXmin = RecentXmin;
+}
+
+/*
+ * Return true if no snapshot still considers fxid to be running.
+ *
+ * The state passed needs to have been initialized for the relation fxid is
+ * from (NULL is also OK), otherwise the result may not be correct.
+ *
+ * See comment for GlobalVisState for details.
+ */
+bool
+GlobalVisTestIsRemovableFullXid(GlobalVisState *state,
+								FullTransactionId fxid)
+{
+	/*
+	 * If fxid is older than maybe_needed bound, it definitely is visible to
+	 * everyone.
+	 */
+	if (FullTransactionIdPrecedes(fxid, state->maybe_needed))
+		return true;
+
+	/*
+	 * If fxid is >= definitely_needed bound, it is very likely to still be
+	 * considered running.
+	 */
+	if (FullTransactionIdFollowsOrEquals(fxid, state->definitely_needed))
+		return false;
+
+	/*
+	 * fxid is between maybe_needed and definitely_needed, i.e. there might or
+	 * might not exist a snapshot considering fxid running. If it makes sense,
+	 * update boundaries and recheck.
+	 */
+	if (GlobalVisTestShouldUpdate(state))
+	{
+		GlobalVisUpdate();
+
+		Assert(FullTransactionIdPrecedes(fxid, state->definitely_needed));
+
+		return FullTransactionIdPrecedes(fxid, state->maybe_needed);
+	}
+	else
+		return false;
+}
+
+/*
+ * Wrapper around GlobalVisTestIsRemovableFullXid() for 32bit xids.
+ *
+ * It is crucial that this only gets called for xids from a source that
+ * protects against xid wraparounds (e.g. from a table and thus protected by
+ * relfrozenxid).
+ */
+bool
+GlobalVisTestIsRemovableXid(GlobalVisState *state, TransactionId xid)
+{
+	FullTransactionId fxid;
+
+	/*
+	 * Convert 32 bit argument to FullTransactionId. We can do so safely
+	 * because we know the xid has to, at the very least, be between
+	 * [oldestXid, nextFullXid), i.e. within 2 billion of xid. To avoid taking
+	 * a lock to determine either, we can just compare with
+	 * state->definitely_needed, which was based on those value at the time
+	 * the current snapshot was built.
+	 */
+	fxid = FullXidViaRelative(state->definitely_needed, xid);
+
+	return GlobalVisTestIsRemovableFullXid(state, fxid);
+}
+
+/*
+ * Return FullTransactionId below which all transactions are not considered
+ * running anymore.
+ *
+ * Note: This is less efficient than testing with
+ * GlobalVisTestIsRemovableFullXid as it likely requires building an accurate
+ * cutoff, even in the case all the XIDs compared with the cutoff are outside
+ * [maybe_needed, definitely_needed).
+ */
+FullTransactionId
+GlobalVisTestNonRemovableFullHorizon(GlobalVisState *state)
+{
+	/* acquire accurate horizon if not already done */
+	if (GlobalVisTestShouldUpdate(state))
+		GlobalVisUpdate();
+
+	return state->maybe_needed;
+}
+
+/* Convenience wrapper around GlobalVisTestNonRemovableFullHorizon */
+TransactionId
+GlobalVisTestNonRemovableHorizon(GlobalVisState *state)
+{
+	FullTransactionId cutoff;
+
+	cutoff = GlobalVisTestNonRemovableFullHorizon(state);
+
+	return XidFromFullTransactionId(cutoff);
+}
+
+/*
+ * Convenience wrapper around GlobalVisTestFor() and
+ * GlobalVisTestIsRemovableFullXid(), see their comments.
+ */
+bool
+GlobalVisIsRemovableFullXid(Relation rel, FullTransactionId fxid)
+{
+	GlobalVisState *state;
+
+	state = GlobalVisTestFor(rel);
+
+	return GlobalVisTestIsRemovableFullXid(state, fxid);
+}
+
+/*
+ * Convenience wrapper around GlobalVisTestFor() and
+ * GlobalVisTestIsRemovableXid(), see their comments.
+ */
+bool
+GlobalVisCheckRemovableXid(Relation rel, TransactionId xid)
+{
+	GlobalVisState *state;
+
+	state = GlobalVisTestFor(rel);
+
+	return GlobalVisTestIsRemovableXid(state, xid);
+}
+
+/*
+ * Convert a 32 bit transaction id into 64 bit transaction id, by assuming it
+ * is within MaxTransactionId / 2 of XidFromFullTransactionId(rel).
+ *
+ * Be very careful about when to use this function. It can only safely be used
+ * when there is a guarantee that xid is within MaxTransactionId / 2 xids of
+ * rel. That e.g. can be guaranteed if the the caller assures a snapshot is
+ * held by the backend and xid is from a table (where vacuum/freezing ensures
+ * the xid has to be within that range), or if xid is from the procarray and
+ * prevents xid wraparound that way.
+ */
+static inline FullTransactionId
+FullXidViaRelative(FullTransactionId rel, TransactionId xid)
+{
+	TransactionId rel_xid = XidFromFullTransactionId(rel);
+
+	Assert(TransactionIdIsValid(xid));
+	Assert(TransactionIdIsValid(rel_xid));
+
+	/* not guaranteed to find issues, but likely to catch mistakes */
+	AssertTransactionInAllowableRange(xid);
+
+	return FullTransactionIdFromU64(U64FromFullTransactionId(rel)
+									+ (int32) (xid - rel_xid));
+}
+
 
 /* ----------------------------------------------
  *		KnownAssignedTransactionIds sub-module
@@ -3388,9 +4031,7 @@ ExpireTreeKnownAssignedTransactionIds(TransactionId xid, int nsubxids,
 	KnownAssignedXidsRemoveTree(xid, nsubxids, subxids);
 
 	/* As in ProcArrayEndTransaction, advance latestCompletedXid */
-	if (TransactionIdPrecedes(ShmemVariableCache->latestCompletedXid,
-							  max_xid))
-		ShmemVariableCache->latestCompletedXid = max_xid;
+	MaintainLatestCompletedXidRecovery(max_xid);
 
 	LWLockRelease(ProcArrayLock);
 }
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index be08eb48148..2d4ec92c2a1 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -5783,14 +5783,15 @@ get_actual_variable_endpoint(Relation heapRel,
 	 * recent); that case motivates not using SnapshotAny here.
 	 *
 	 * A crucial point here is that SnapshotNonVacuumable, with
-	 * RecentGlobalXmin as horizon, yields the inverse of the condition that
-	 * the indexscan will use to decide that index entries are killable (see
-	 * heap_hot_search_buffer()).  Therefore, if the snapshot rejects a tuple
-	 * (or more precisely, all tuples of a HOT chain) and we have to continue
-	 * scanning past it, we know that the indexscan will mark that index entry
-	 * killed.  That means that the next get_actual_variable_endpoint() call
-	 * will not have to re-consider that index entry.  In this way we avoid
-	 * repetitive work when this function is used a lot during planning.
+	 * GlobalVisTestFor(heapRel) as horizon, yields the inverse of the
+	 * condition that the indexscan will use to decide that index entries are
+	 * killable (see heap_hot_search_buffer()).  Therefore, if the snapshot
+	 * rejects a tuple (or more precisely, all tuples of a HOT chain) and we
+	 * have to continue scanning past it, we know that the indexscan will mark
+	 * that index entry killed.  That means that the next
+	 * get_actual_variable_endpoint() call will not have to re-consider that
+	 * index entry.  In this way we avoid repetitive work when this function
+	 * is used a lot during planning.
 	 *
 	 * But using SnapshotNonVacuumable creates a hazard of its own.  In a
 	 * recently-created index, some index entries may point at "broken" HOT
@@ -5802,7 +5803,8 @@ get_actual_variable_endpoint(Relation heapRel,
 	 * or could even be NULL.  We avoid this hazard because we take the data
 	 * from the index entry not the heap.
 	 */
-	InitNonVacuumableSnapshot(SnapshotNonVacuumable, RecentGlobalXmin);
+	InitNonVacuumableSnapshot(SnapshotNonVacuumable,
+							  GlobalVisTestFor(heapRel));
 
 	index_scan = index_beginscan(heapRel, indexRel,
 								 &SnapshotNonVacuumable,
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index f4247ea70d5..893be2f3ddb 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -722,6 +722,10 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
 	 * is critical for anything that reads heap pages, because HOT may decide
 	 * to prune them even if the process doesn't attempt to modify any
 	 * tuples.)
+	 *
+	 * FIXME: This comment is inaccurate / the code buggy. A snapshot that is
+	 * not pushed/active does not reliably prevent HOT pruning (->xmin could
+	 * e.g. be cleared when cache invalidations are processed).
 	 */
 	if (!bootstrap)
 	{
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 6b6c8571e23..76578868cf9 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -157,16 +157,9 @@ static Snapshot HistoricSnapshot = NULL;
  * These are updated by GetSnapshotData.  We initialize them this way
  * for the convenience of TransactionIdIsInProgress: even in bootstrap
  * mode, we don't want it to say that BootstrapTransactionId is in progress.
- *
- * RecentGlobalXmin and RecentGlobalDataXmin are initialized to
- * InvalidTransactionId, to ensure that no one tries to use a stale
- * value. Readers should ensure that it has been set to something else
- * before using it.
  */
 TransactionId TransactionXmin = FirstNormalTransactionId;
 TransactionId RecentXmin = FirstNormalTransactionId;
-TransactionId RecentGlobalXmin = InvalidTransactionId;
-TransactionId RecentGlobalDataXmin = InvalidTransactionId;
 
 /* (table, ctid) => (cmin, cmax) mapping during timetravel */
 static HTAB *tuplecid_data = NULL;
@@ -581,9 +574,7 @@ SetTransactionSnapshot(Snapshot sourcesnap, VirtualTransactionId *sourcevxid,
 	 * Even though we are not going to use the snapshot it computes, we must
 	 * call GetSnapshotData, for two reasons: (1) to be sure that
 	 * CurrentSnapshotData's XID arrays have been allocated, and (2) to update
-	 * RecentXmin and RecentGlobalXmin.  (We could alternatively include those
-	 * two variables in exported snapshot files, but it seems better to have
-	 * snapshot importers compute reasonably up-to-date values for them.)
+	 * the state for GlobalVis*.
 	 */
 	CurrentSnapshot = GetSnapshotData(&CurrentSnapshotData);
 
@@ -956,36 +947,6 @@ xmin_cmp(const pairingheap_node *a, const pairingheap_node *b, void *arg)
 		return 0;
 }
 
-/*
- * Get current RecentGlobalXmin value, as a FullTransactionId.
- */
-FullTransactionId
-GetFullRecentGlobalXmin(void)
-{
-	FullTransactionId nextxid_full;
-	uint32		nextxid_epoch;
-	TransactionId nextxid_xid;
-	uint32		epoch;
-
-	Assert(TransactionIdIsNormal(RecentGlobalXmin));
-
-	/*
-	 * Compute the epoch from the next XID's epoch. This relies on the fact
-	 * that RecentGlobalXmin must be within the 2 billion XID horizon from the
-	 * next XID.
-	 */
-	nextxid_full = ReadNextFullTransactionId();
-	nextxid_epoch = EpochFromFullTransactionId(nextxid_full);
-	nextxid_xid = XidFromFullTransactionId(nextxid_full);
-
-	if (RecentGlobalXmin > nextxid_xid)
-		epoch = nextxid_epoch - 1;
-	else
-		epoch = nextxid_epoch;
-
-	return FullTransactionIdFromEpochAndXid(epoch, RecentGlobalXmin);
-}
-
 /*
  * SnapshotResetXmin
  *
@@ -1753,106 +1714,157 @@ GetOldSnapshotThresholdTimestamp(void)
 	return threshold_timestamp;
 }
 
-static void
+void
 SetOldSnapshotThresholdTimestamp(TimestampTz ts, TransactionId xlimit)
 {
 	SpinLockAcquire(&oldSnapshotControl->mutex_threshold);
+	Assert(oldSnapshotControl->threshold_timestamp <= ts);
+	Assert(TransactionIdPrecedesOrEquals(oldSnapshotControl->threshold_xid, xlimit));
 	oldSnapshotControl->threshold_timestamp = ts;
 	oldSnapshotControl->threshold_xid = xlimit;
 	SpinLockRelease(&oldSnapshotControl->mutex_threshold);
 }
 
+/*
+ * XXX: Magic to keep old_snapshot_threshold tests appear "working". They
+ * currently are broken, and discussion of what to do about them is
+ * ongoing. See
+ * https://www.postgresql.org/message-id/20200403001235.e6jfdll3gh2ygbuc%40alap3.anarazel.de
+ */
+void
+SnapshotTooOldMagicForTest(void)
+{
+	TimestampTz ts = GetSnapshotCurrentTimestamp();
+
+	Assert(old_snapshot_threshold == 0);
+
+	ts -= 5 * USECS_PER_SEC;
+
+	SpinLockAcquire(&oldSnapshotControl->mutex_threshold);
+	oldSnapshotControl->threshold_timestamp = ts;
+	SpinLockRelease(&oldSnapshotControl->mutex_threshold);
+}
+
+/*
+ * If there is a valid mapping for the timestamp, set *xlimitp to
+ * that. Returns whether there is such a mapping.
+ */
+static bool
+GetOldSnapshotFromTimeMapping(TimestampTz ts, TransactionId *xlimitp)
+{
+	bool in_mapping = false;
+
+	Assert(ts == AlignTimestampToMinuteBoundary(ts));
+
+	LWLockAcquire(OldSnapshotTimeMapLock, LW_SHARED);
+
+	if (oldSnapshotControl->count_used > 0
+		&& ts >= oldSnapshotControl->head_timestamp)
+	{
+		int			offset;
+
+		offset = ((ts - oldSnapshotControl->head_timestamp)
+				  / USECS_PER_MINUTE);
+		if (offset > oldSnapshotControl->count_used - 1)
+			offset = oldSnapshotControl->count_used - 1;
+		offset = (oldSnapshotControl->head_offset + offset)
+			% OLD_SNAPSHOT_TIME_MAP_ENTRIES;
+
+		*xlimitp = oldSnapshotControl->xid_by_minute[offset];
+
+		in_mapping = true;
+	}
+
+	LWLockRelease(OldSnapshotTimeMapLock);
+
+	return in_mapping;
+}
+
 /*
  * TransactionIdLimitedForOldSnapshots
  *
- * Apply old snapshot limit, if any.  This is intended to be called for page
- * pruning and table vacuuming, to allow old_snapshot_threshold to override
- * the normal global xmin value.  Actual testing for snapshot too old will be
- * based on whether a snapshot timestamp is prior to the threshold timestamp
- * set in this function.
+ * Apply old snapshot limit.  This is intended to be called for page pruning
+ * and table vacuuming, to allow old_snapshot_threshold to override the normal
+ * global xmin value.  Actual testing for snapshot too old will be based on
+ * whether a snapshot timestamp is prior to the threshold timestamp set in
+ * this function.
+ *
+ * If the limited horizon allows a cleanup action that otherwise would not be
+ * possible, SetOldSnapshotThresholdTimestamp(*limit_ts, *limit_xid) needs to
+ * be called before that cleanup action.
  */
-TransactionId
+bool
 TransactionIdLimitedForOldSnapshots(TransactionId recentXmin,
-									Relation relation)
+									Relation relation,
+									TransactionId *limit_xid,
+									TimestampTz *limit_ts)
 {
-	if (TransactionIdIsNormal(recentXmin)
-		&& old_snapshot_threshold >= 0
-		&& RelationAllowsEarlyPruning(relation))
+	TimestampTz ts;
+	TransactionId xlimit = recentXmin;
+	TransactionId latest_xmin;
+	TimestampTz next_map_update_ts;
+	TransactionId threshold_timestamp;
+	TransactionId threshold_xid;
+
+	Assert(TransactionIdIsNormal(recentXmin));
+	Assert(OldSnapshotThresholdActive());
+	Assert(limit_ts != NULL && limit_xid != NULL);
+
+	if (!RelationAllowsEarlyPruning(relation))
+		return false;
+
+	ts = GetSnapshotCurrentTimestamp();
+
+	SpinLockAcquire(&oldSnapshotControl->mutex_latest_xmin);
+	latest_xmin = oldSnapshotControl->latest_xmin;
+	next_map_update_ts = oldSnapshotControl->next_map_update;
+	SpinLockRelease(&oldSnapshotControl->mutex_latest_xmin);
+
+	/*
+	 * Zero threshold always overrides to latest xmin, if valid.  Without
+	 * some heuristic it will find its own snapshot too old on, for
+	 * example, a simple UPDATE -- which would make it useless for most
+	 * testing, but there is no principled way to ensure that it doesn't
+	 * fail in this way.  Use a five-second delay to try to get useful
+	 * testing behavior, but this may need adjustment.
+	 */
+	if (old_snapshot_threshold == 0)
 	{
-		TimestampTz ts = GetSnapshotCurrentTimestamp();
-		TransactionId xlimit = recentXmin;
-		TransactionId latest_xmin;
-		TimestampTz update_ts;
-		bool		same_ts_as_threshold = false;
-
-		SpinLockAcquire(&oldSnapshotControl->mutex_latest_xmin);
-		latest_xmin = oldSnapshotControl->latest_xmin;
-		update_ts = oldSnapshotControl->next_map_update;
-		SpinLockRelease(&oldSnapshotControl->mutex_latest_xmin);
-
-		/*
-		 * Zero threshold always overrides to latest xmin, if valid.  Without
-		 * some heuristic it will find its own snapshot too old on, for
-		 * example, a simple UPDATE -- which would make it useless for most
-		 * testing, but there is no principled way to ensure that it doesn't
-		 * fail in this way.  Use a five-second delay to try to get useful
-		 * testing behavior, but this may need adjustment.
-		 */
-		if (old_snapshot_threshold == 0)
-		{
-			if (TransactionIdPrecedes(latest_xmin, MyPgXact->xmin)
-				&& TransactionIdFollows(latest_xmin, xlimit))
-				xlimit = latest_xmin;
-
-			ts -= 5 * USECS_PER_SEC;
-			SetOldSnapshotThresholdTimestamp(ts, xlimit);
-
-			return xlimit;
-		}
+		if (TransactionIdPrecedes(latest_xmin, MyPgXact->xmin)
+			&& TransactionIdFollows(latest_xmin, xlimit))
+			xlimit = latest_xmin;
 
+		ts -= 5 * USECS_PER_SEC;
+	}
+	else
+	{
 		ts = AlignTimestampToMinuteBoundary(ts)
 			- (old_snapshot_threshold * USECS_PER_MINUTE);
 
 		/* Check for fast exit without LW locking. */
 		SpinLockAcquire(&oldSnapshotControl->mutex_threshold);
-		if (ts == oldSnapshotControl->threshold_timestamp)
-		{
-			xlimit = oldSnapshotControl->threshold_xid;
-			same_ts_as_threshold = true;
-		}
+		threshold_timestamp = oldSnapshotControl->threshold_timestamp;
+		threshold_xid = oldSnapshotControl->threshold_xid;
 		SpinLockRelease(&oldSnapshotControl->mutex_threshold);
 
-		if (!same_ts_as_threshold)
+		if (ts == threshold_timestamp)
+		{
+			/*
+			 * Current timestamp is in same bucket as the the last limit that
+			 * was applied. Reuse.
+			 */
+			xlimit = threshold_xid;
+		}
+		else if (ts == next_map_update_ts)
+		{
+			/*
+			 * FIXME: This branch is super iffy - but that should probably
+			 * fixed separately.
+			 */
+			xlimit = latest_xmin;
+		}
+		else if (GetOldSnapshotFromTimeMapping(ts, &xlimit))
 		{
-			if (ts == update_ts)
-			{
-				xlimit = latest_xmin;
-				if (NormalTransactionIdFollows(xlimit, recentXmin))
-					SetOldSnapshotThresholdTimestamp(ts, xlimit);
-			}
-			else
-			{
-				LWLockAcquire(OldSnapshotTimeMapLock, LW_SHARED);
-
-				if (oldSnapshotControl->count_used > 0
-					&& ts >= oldSnapshotControl->head_timestamp)
-				{
-					int			offset;
-
-					offset = ((ts - oldSnapshotControl->head_timestamp)
-							  / USECS_PER_MINUTE);
-					if (offset > oldSnapshotControl->count_used - 1)
-						offset = oldSnapshotControl->count_used - 1;
-					offset = (oldSnapshotControl->head_offset + offset)
-						% OLD_SNAPSHOT_TIME_MAP_ENTRIES;
-					xlimit = oldSnapshotControl->xid_by_minute[offset];
-
-					if (NormalTransactionIdFollows(xlimit, recentXmin))
-						SetOldSnapshotThresholdTimestamp(ts, xlimit);
-				}
-
-				LWLockRelease(OldSnapshotTimeMapLock);
-			}
 		}
 
 		/*
@@ -1867,12 +1879,18 @@ TransactionIdLimitedForOldSnapshots(TransactionId recentXmin,
 		if (TransactionIdIsNormal(latest_xmin)
 			&& TransactionIdPrecedes(latest_xmin, xlimit))
 			xlimit = latest_xmin;
-
-		if (NormalTransactionIdFollows(xlimit, recentXmin))
-			return xlimit;
 	}
 
-	return recentXmin;
+	if (TransactionIdIsValid(xlimit) &&
+		TransactionIdFollowsOrEquals(xlimit, recentXmin))
+	{
+		*limit_ts = ts;
+		*limit_xid = xlimit;
+
+		return true;
+	}
+
+	return false;
 }
 
 /*
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index e4d501a85d1..76306976c2a 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -419,10 +419,10 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 			 RelationGetRelationName(rel));
 
 	/*
-	 * RecentGlobalXmin assertion matches index_getnext_tid().  See note on
-	 * RecentGlobalXmin/B-Tree page deletion.
+	 * This assertion matches the one in index_getnext_tid().  See page
+	 * recycling/"visible to everyone" notes in nbtree README.
 	 */
-	Assert(TransactionIdIsValid(RecentGlobalXmin));
+	Assert(TransactionIdIsValid(RecentXmin));
 
 	/*
 	 * Initialize state for entire verification operation
@@ -1441,7 +1441,7 @@ bt_right_page_check_scankey(BtreeCheckState *state)
 	 * does not occur until no possible index scan could land on the page.
 	 * Index scans can follow links with nothing more than their snapshot as
 	 * an interlock and be sure of at least that much.  (See page
-	 * recycling/RecentGlobalXmin notes in nbtree README.)
+	 * recycling/"visible to everyone" notes in nbtree README.)
 	 *
 	 * Furthermore, it's okay if we follow a rightlink and find a half-dead or
 	 * dead (ignorable) page one or more times.  There will either be a
diff --git a/contrib/pg_visibility/pg_visibility.c b/contrib/pg_visibility/pg_visibility.c
index 68d580ed1e0..37206c50a21 100644
--- a/contrib/pg_visibility/pg_visibility.c
+++ b/contrib/pg_visibility/pg_visibility.c
@@ -563,17 +563,14 @@ collect_corrupt_items(Oid relid, bool all_visible, bool all_frozen)
 	BufferAccessStrategy bstrategy = GetAccessStrategy(BAS_BULKREAD);
 	TransactionId OldestXmin = InvalidTransactionId;
 
-	if (all_visible)
-	{
-		/* Don't pass rel; that will fail in recovery. */
-		OldestXmin = GetOldestXmin(NULL, PROCARRAY_FLAGS_VACUUM);
-	}
-
 	rel = relation_open(relid, AccessShareLock);
 
 	/* Only some relkinds have a visibility map */
 	check_relation_relkind(rel);
 
+	if (all_visible)
+		OldestXmin = GetOldestNonRemovableTransactionId(rel);
+
 	nblocks = RelationGetNumberOfBlocks(rel);
 
 	/*
@@ -679,11 +676,12 @@ collect_corrupt_items(Oid relid, bool all_visible, bool all_frozen)
 				 * From a concurrency point of view, it sort of sucks to
 				 * retake ProcArrayLock here while we're holding the buffer
 				 * exclusively locked, but it should be safe against
-				 * deadlocks, because surely GetOldestXmin() should never take
-				 * a buffer lock. And this shouldn't happen often, so it's
-				 * worth being careful so as to avoid false positives.
+				 * deadlocks, because surely GetOldestNonRemovableTransactionId()
+				 * should never take a buffer lock. And this shouldn't happen
+				 * often, so it's worth being careful so as to avoid false
+				 * positives.
 				 */
-				RecomputedOldestXmin = GetOldestXmin(NULL, PROCARRAY_FLAGS_VACUUM);
+				RecomputedOldestXmin = GetOldestNonRemovableTransactionId(rel);
 
 				if (!TransactionIdPrecedes(OldestXmin, RecomputedOldestXmin))
 					record_corrupt_item(items, &tuple.t_self);
diff --git a/contrib/pgstattuple/pgstatapprox.c b/contrib/pgstattuple/pgstatapprox.c
index dbc0fa11f61..3a99333d443 100644
--- a/contrib/pgstattuple/pgstatapprox.c
+++ b/contrib/pgstattuple/pgstatapprox.c
@@ -71,7 +71,7 @@ statapprox_heap(Relation rel, output_type *stat)
 	BufferAccessStrategy bstrategy;
 	TransactionId OldestXmin;
 
-	OldestXmin = GetOldestXmin(rel, PROCARRAY_FLAGS_VACUUM);
+	OldestXmin = GetOldestNonRemovableTransactionId(rel);
 	bstrategy = GetAccessStrategy(BAS_BULKREAD);
 
 	nblocks = RelationGetNumberOfBlocks(rel);
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 7eaaad1e140..b4948ac675f 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -395,6 +395,7 @@ CompositeTypeStmt
 CompoundAffixFlag
 CompressionAlgorithm
 CompressorState
+ComputeXidHorizonsResult
 ConditionVariable
 ConditionalStack
 ConfigData
@@ -930,6 +931,7 @@ GistSplitVector
 GistTsVectorOptions
 GistVacState
 GlobalTransaction
+GlobalVisState
 GrantRoleStmt
 GrantStmt
 GrantTargetType
-- 
2.25.0.114.g5b0ca878e0

v11-0002-snapshot-scalability-Move-PGXACT-xmin-back-to-PG.patchtext/x-diff; charset=us-asciiDownload
From ea877d637845b5941b7cbe63214c50334785f251 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 15 Jul 2020 15:35:07 -0700
Subject: [PATCH v11 2/6] snapshot scalability: Move PGXACT->xmin back to
 PGPROC.

Now that xmin isn't needed for GetSnapshotData() anymore, it leads to
unnecessary cacheline ping-pong to have it in PGXACT as it is updated
more frequently than the other PGXACT members.

Author: Andres Freund
Reviewed-By: Robert Haas, Thomas Munro, David Rowley
Discussion: https://postgr.es/m/20200301083601.ews6hz5dduc3w2se@alap3.anarazel.de
---
 src/include/storage/proc.h                  | 10 +++---
 src/backend/access/gist/gistxlog.c          |  2 +-
 src/backend/access/nbtree/nbtpage.c         |  2 +-
 src/backend/access/transam/README           |  2 +-
 src/backend/access/transam/twophase.c       |  2 +-
 src/backend/commands/indexcmds.c            |  2 +-
 src/backend/replication/logical/snapbuild.c |  6 ++--
 src/backend/replication/walsender.c         | 10 +++---
 src/backend/storage/ipc/procarray.c         | 36 +++++++++------------
 src/backend/storage/ipc/sinvaladt.c         |  2 +-
 src/backend/storage/lmgr/proc.c             |  4 +--
 src/backend/utils/time/snapmgr.c            | 28 ++++++++--------
 12 files changed, 51 insertions(+), 55 deletions(-)

diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 08f006f782e..286c9a9aec3 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -102,6 +102,11 @@ struct PGPROC
 
 	Latch		procLatch;		/* generic latch for process */
 
+	TransactionId xmin;			/* minimal running XID as it was when we were
+								 * starting our xact, excluding LAZY VACUUM:
+								 * vacuum must not remove tuples deleted by
+								 * xid >= xmin ! */
+
 	LocalTransactionId lxid;	/* local id of top-level transaction currently
 								 * being executed by this proc, if running;
 								 * else InvalidLocalTransactionId */
@@ -224,11 +229,6 @@ typedef struct PGXACT
 								 * executed by this proc, if running and XID
 								 * is assigned; else InvalidTransactionId */
 
-	TransactionId xmin;			/* minimal running XID as it was when we were
-								 * starting our xact, excluding LAZY VACUUM:
-								 * vacuum must not remove tuples deleted by
-								 * xid >= xmin ! */
-
 	uint8		vacuumFlags;	/* vacuum-related flags, see above */
 	bool		overflowed;
 
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index 3167305ac00..b6603cd73cf 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -389,7 +389,7 @@ gistRedoPageReuse(XLogReaderState *record)
 	 *
 	 * latestRemovedXid was the page's deleteXid.  The
 	 * GlobalVisIsRemovableFullXid(deleteXid) test in gistPageRecyclable()
-	 * conceptually mirrors the pgxact->xmin > limitXmin test in
+	 * conceptually mirrors the PGPROC->xmin > limitXmin test in
 	 * GetConflictingVirtualXIDs().  Consequently, one XID value achieves the
 	 * same exclusion effect on primary and standby.
 	 */
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 9e6376f2c2b..c88ca4221a4 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -2208,7 +2208,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 	 * we're in VACUUM and would not otherwise have an XID.  Having already
 	 * updated links to the target, ReadNewTransactionId() suffices as an
 	 * upper bound.  Any scan having retained a now-stale link is advertising
-	 * in its PGXACT an xmin less than or equal to the value we read here.  It
+	 * in its PGPROC an xmin less than or equal to the value we read here.  It
 	 * will continue to do so, holding back the xmin horizon, for the duration
 	 * of that scan.
 	 */
diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README
index 4e2178dabab..94d8f3fd0a2 100644
--- a/src/backend/access/transam/README
+++ b/src/backend/access/transam/README
@@ -331,7 +331,7 @@ necessary.
 Note that while it is certain that two concurrent executions of
 GetSnapshotData will compute the same xmin for their own snapshots, there is
 no such guarantee for the horizons computed by ComputeXidHorizons.  This is
-because we allow XID-less transactions to clear their MyPgXact->xmin
+because we allow XID-less transactions to clear their MyProc->xmin
 asynchronously (without taking ProcArrayLock), so one execution might see
 what had been the oldest xmin, and another not.  This is OK since the
 thresholds need only be a valid lower bound.  As noted above, we are already
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 9b2e59bf0ec..ae7c1a4c172 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -464,7 +464,7 @@ MarkAsPreparingGuts(GlobalTransaction gxact, TransactionId xid, const char *gid,
 	/* We set up the gxact's VXID as InvalidBackendId/XID */
 	proc->lxid = (LocalTransactionId) xid;
 	pgxact->xid = xid;
-	pgxact->xmin = InvalidTransactionId;
+	Assert(proc->xmin == InvalidTransactionId);
 	proc->delayChkpt = false;
 	pgxact->vacuumFlags = 0;
 	proc->pid = 0;
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 2baca12c5f4..9d741aa03fa 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1535,7 +1535,7 @@ DefineIndex(Oid relationId,
 	StartTransactionCommand();
 
 	/* We should now definitely not be advertising any xmin. */
-	Assert(MyPgXact->xmin == InvalidTransactionId);
+	Assert(MyProc->xmin == InvalidTransactionId);
 
 	/*
 	 * The index is now valid in the sense that it contains all currently
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 3089f0d5ddc..e9701ea7221 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -553,8 +553,8 @@ SnapBuildInitialSnapshot(SnapBuild *builder)
 		elog(ERROR, "cannot build an initial slot snapshot, not all transactions are monitored anymore");
 
 	/* so we don't overwrite the existing value */
-	if (TransactionIdIsValid(MyPgXact->xmin))
-		elog(ERROR, "cannot build an initial slot snapshot when MyPgXact->xmin already is valid");
+	if (TransactionIdIsValid(MyProc->xmin))
+		elog(ERROR, "cannot build an initial slot snapshot when MyProc->xmin already is valid");
 
 	snap = SnapBuildBuildSnapshot(builder);
 
@@ -575,7 +575,7 @@ SnapBuildInitialSnapshot(SnapBuild *builder)
 	}
 #endif
 
-	MyPgXact->xmin = snap->xmin;
+	MyProc->xmin = snap->xmin;
 
 	/* allocate in transaction context */
 	newxip = (TransactionId *)
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index fd370d52b66..06da4b4352a 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1967,7 +1967,7 @@ PhysicalReplicationSlotNewXmin(TransactionId feedbackXmin, TransactionId feedbac
 	ReplicationSlot *slot = MyReplicationSlot;
 
 	SpinLockAcquire(&slot->mutex);
-	MyPgXact->xmin = InvalidTransactionId;
+	MyProc->xmin = InvalidTransactionId;
 
 	/*
 	 * For physical replication we don't need the interlock provided by xmin
@@ -2096,7 +2096,7 @@ ProcessStandbyHSFeedbackMessage(void)
 	if (!TransactionIdIsNormal(feedbackXmin)
 		&& !TransactionIdIsNormal(feedbackCatalogXmin))
 	{
-		MyPgXact->xmin = InvalidTransactionId;
+		MyProc->xmin = InvalidTransactionId;
 		if (MyReplicationSlot != NULL)
 			PhysicalReplicationSlotNewXmin(feedbackXmin, feedbackCatalogXmin);
 		return;
@@ -2138,7 +2138,7 @@ ProcessStandbyHSFeedbackMessage(void)
 	 * risk already since a VACUUM could already have determined the horizon.)
 	 *
 	 * If we're using a replication slot we reserve the xmin via that,
-	 * otherwise via the walsender's PGXACT entry. We can only track the
+	 * otherwise via the walsender's PGPROC entry. We can only track the
 	 * catalog xmin separately when using a slot, so we store the least of the
 	 * two provided when not using a slot.
 	 *
@@ -2151,9 +2151,9 @@ ProcessStandbyHSFeedbackMessage(void)
 	{
 		if (TransactionIdIsNormal(feedbackCatalogXmin)
 			&& TransactionIdPrecedes(feedbackCatalogXmin, feedbackXmin))
-			MyPgXact->xmin = feedbackCatalogXmin;
+			MyProc->xmin = feedbackCatalogXmin;
 		else
-			MyPgXact->xmin = feedbackXmin;
+			MyProc->xmin = feedbackXmin;
 	}
 }
 
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index c011387ba90..980ca2cc2af 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -585,9 +585,9 @@ ProcArrayEndTransaction(PGPROC *proc, TransactionId latestXid)
 		Assert(!TransactionIdIsValid(allPgXact[proc->pgprocno].xid));
 
 		proc->lxid = InvalidLocalTransactionId;
-		pgxact->xmin = InvalidTransactionId;
 		/* must be cleared with xid/xmin: */
 		pgxact->vacuumFlags &= ~PROC_VACUUM_STATE_MASK;
+		proc->xmin = InvalidTransactionId;
 		proc->delayChkpt = false;	/* be sure this is cleared in abort */
 		proc->recoveryConflictPending = false;
 
@@ -607,9 +607,9 @@ ProcArrayEndTransactionInternal(PGPROC *proc, PGXACT *pgxact,
 {
 	pgxact->xid = InvalidTransactionId;
 	proc->lxid = InvalidLocalTransactionId;
-	pgxact->xmin = InvalidTransactionId;
 	/* must be cleared with xid/xmin: */
 	pgxact->vacuumFlags &= ~PROC_VACUUM_STATE_MASK;
+	proc->xmin = InvalidTransactionId;
 	proc->delayChkpt = false;	/* be sure this is cleared in abort */
 	proc->recoveryConflictPending = false;
 
@@ -761,7 +761,7 @@ ProcArrayClearTransaction(PGPROC *proc)
 	 */
 	pgxact->xid = InvalidTransactionId;
 	proc->lxid = InvalidLocalTransactionId;
-	pgxact->xmin = InvalidTransactionId;
+	proc->xmin = InvalidTransactionId;
 	proc->recoveryConflictPending = false;
 
 	/* redundant, but just in case */
@@ -1550,7 +1550,7 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
 
 		/* Fetch xid just once - see GetNewTransactionId */
 		xid = UINT32_ACCESS_ONCE(pgxact->xid);
-		xmin = UINT32_ACCESS_ONCE(pgxact->xmin);
+		xmin = UINT32_ACCESS_ONCE(proc->xmin);
 
 		/*
 		 * Consider both the transaction's Xmin, and its Xid.
@@ -1822,7 +1822,7 @@ GetMaxSnapshotSubxidCount(void)
  *
  * We also update the following backend-global variables:
  *		TransactionXmin: the oldest xmin of any snapshot in use in the
- *			current transaction (this is the same as MyPgXact->xmin).
+ *			current transaction (this is the same as MyProc->xmin).
  *		RecentXmin: the xmin computed for the most recent snapshot.  XIDs
  *			older than this are known not running any more.
  *
@@ -1884,7 +1884,7 @@ GetSnapshotData(Snapshot snapshot)
 
 	/*
 	 * It is sufficient to get shared lock on ProcArrayLock, even if we are
-	 * going to set MyPgXact->xmin.
+	 * going to set MyProc->xmin.
 	 */
 	LWLockAcquire(ProcArrayLock, LW_SHARED);
 
@@ -2036,8 +2036,8 @@ GetSnapshotData(Snapshot snapshot)
 	replication_slot_xmin = procArray->replication_slot_xmin;
 	replication_slot_catalog_xmin = procArray->replication_slot_catalog_xmin;
 
-	if (!TransactionIdIsValid(MyPgXact->xmin))
-		MyPgXact->xmin = TransactionXmin = xmin;
+	if (!TransactionIdIsValid(MyProc->xmin))
+		MyProc->xmin = TransactionXmin = xmin;
 
 	LWLockRelease(ProcArrayLock);
 
@@ -2157,7 +2157,7 @@ GetSnapshotData(Snapshot snapshot)
 }
 
 /*
- * ProcArrayInstallImportedXmin -- install imported xmin into MyPgXact->xmin
+ * ProcArrayInstallImportedXmin -- install imported xmin into MyProc->xmin
  *
  * This is called when installing a snapshot imported from another
  * transaction.  To ensure that OldestXmin doesn't go backwards, we must
@@ -2210,7 +2210,7 @@ ProcArrayInstallImportedXmin(TransactionId xmin,
 		/*
 		 * Likewise, let's just make real sure its xmin does cover us.
 		 */
-		xid = UINT32_ACCESS_ONCE(pgxact->xmin);
+		xid = UINT32_ACCESS_ONCE(proc->xmin);
 		if (!TransactionIdIsNormal(xid) ||
 			!TransactionIdPrecedesOrEquals(xid, xmin))
 			continue;
@@ -2221,7 +2221,7 @@ ProcArrayInstallImportedXmin(TransactionId xmin,
 		 * GetSnapshotData first, we'll be overwriting a valid xmin here, so
 		 * we don't check that.)
 		 */
-		MyPgXact->xmin = TransactionXmin = xmin;
+		MyProc->xmin = TransactionXmin = xmin;
 
 		result = true;
 		break;
@@ -2233,7 +2233,7 @@ ProcArrayInstallImportedXmin(TransactionId xmin,
 }
 
 /*
- * ProcArrayInstallRestoredXmin -- install restored xmin into MyPgXact->xmin
+ * ProcArrayInstallRestoredXmin -- install restored xmin into MyProc->xmin
  *
  * This is like ProcArrayInstallImportedXmin, but we have a pointer to the
  * PGPROC of the transaction from which we imported the snapshot, rather than
@@ -2246,7 +2246,6 @@ ProcArrayInstallRestoredXmin(TransactionId xmin, PGPROC *proc)
 {
 	bool		result = false;
 	TransactionId xid;
-	PGXACT	   *pgxact;
 
 	Assert(TransactionIdIsNormal(xmin));
 	Assert(proc != NULL);
@@ -2254,20 +2253,18 @@ ProcArrayInstallRestoredXmin(TransactionId xmin, PGPROC *proc)
 	/* Get lock so source xact can't end while we're doing this */
 	LWLockAcquire(ProcArrayLock, LW_SHARED);
 
-	pgxact = &allPgXact[proc->pgprocno];
-
 	/*
 	 * Be certain that the referenced PGPROC has an advertised xmin which is
 	 * no later than the one we're installing, so that the system-wide xmin
 	 * can't go backwards.  Also, make sure it's running in the same database,
 	 * so that the per-database xmin cannot go backwards.
 	 */
-	xid = UINT32_ACCESS_ONCE(pgxact->xmin);
+	xid = UINT32_ACCESS_ONCE(proc->xmin);
 	if (proc->databaseId == MyDatabaseId &&
 		TransactionIdIsNormal(xid) &&
 		TransactionIdPrecedesOrEquals(xid, xmin))
 	{
-		MyPgXact->xmin = TransactionXmin = xmin;
+		MyProc->xmin = TransactionXmin = xmin;
 		result = true;
 	}
 
@@ -2892,7 +2889,7 @@ GetCurrentVirtualXIDs(TransactionId limitXmin, bool excludeXmin0,
 		if (allDbs || proc->databaseId == MyDatabaseId)
 		{
 			/* Fetch xmin just once - might change on us */
-			TransactionId pxmin = UINT32_ACCESS_ONCE(pgxact->xmin);
+			TransactionId pxmin = UINT32_ACCESS_ONCE(proc->xmin);
 
 			if (excludeXmin0 && !TransactionIdIsValid(pxmin))
 				continue;
@@ -2978,7 +2975,6 @@ GetConflictingVirtualXIDs(TransactionId limitXmin, Oid dbOid)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
 		PGPROC	   *proc = &allProcs[pgprocno];
-		PGXACT	   *pgxact = &allPgXact[pgprocno];
 
 		/* Exclude prepared transactions */
 		if (proc->pid == 0)
@@ -2988,7 +2984,7 @@ GetConflictingVirtualXIDs(TransactionId limitXmin, Oid dbOid)
 			proc->databaseId == dbOid)
 		{
 			/* Fetch xmin just once - can't change on us, but good coding */
-			TransactionId pxmin = UINT32_ACCESS_ONCE(pgxact->xmin);
+			TransactionId pxmin = UINT32_ACCESS_ONCE(proc->xmin);
 
 			/*
 			 * We ignore an invalid pxmin because this means that backend has
diff --git a/src/backend/storage/ipc/sinvaladt.c b/src/backend/storage/ipc/sinvaladt.c
index e5c115b92f2..ad048bc85fa 100644
--- a/src/backend/storage/ipc/sinvaladt.c
+++ b/src/backend/storage/ipc/sinvaladt.c
@@ -420,7 +420,7 @@ BackendIdGetTransactionIds(int backendID, TransactionId *xid, TransactionId *xmi
 			PGXACT	   *xact = &ProcGlobal->allPgXact[proc->pgprocno];
 
 			*xid = xact->xid;
-			*xmin = xact->xmin;
+			*xmin = proc->xmin;
 		}
 	}
 
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index e57fcd25388..de346cd87fc 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -388,7 +388,7 @@ InitProcess(void)
 	MyProc->fpVXIDLock = false;
 	MyProc->fpLocalTransactionId = InvalidLocalTransactionId;
 	MyPgXact->xid = InvalidTransactionId;
-	MyPgXact->xmin = InvalidTransactionId;
+	MyProc->xmin = InvalidTransactionId;
 	MyProc->pid = MyProcPid;
 	/* backendId, databaseId and roleId will be filled in later */
 	MyProc->backendId = InvalidBackendId;
@@ -572,7 +572,7 @@ InitAuxiliaryProcess(void)
 	MyProc->fpVXIDLock = false;
 	MyProc->fpLocalTransactionId = InvalidLocalTransactionId;
 	MyPgXact->xid = InvalidTransactionId;
-	MyPgXact->xmin = InvalidTransactionId;
+	MyProc->xmin = InvalidTransactionId;
 	MyProc->backendId = InvalidBackendId;
 	MyProc->databaseId = InvalidOid;
 	MyProc->roleId = InvalidOid;
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 76578868cf9..689a3b6a597 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -27,11 +27,11 @@
  * their lifetime is managed separately (as they live longer than one xact.c
  * transaction).
  *
- * These arrangements let us reset MyPgXact->xmin when there are no snapshots
+ * These arrangements let us reset MyProc->xmin when there are no snapshots
  * referenced by this transaction, and advance it when the one with oldest
  * Xmin is no longer referenced.  For simplicity however, only registered
  * snapshots not active snapshots participate in tracking which one is oldest;
- * we don't try to change MyPgXact->xmin except when the active-snapshot
+ * we don't try to change MyProc->xmin except when the active-snapshot
  * stack is empty.
  *
  *
@@ -187,7 +187,7 @@ static ActiveSnapshotElt *OldestActiveSnapshot = NULL;
 
 /*
  * Currently registered Snapshots.  Ordered in a heap by xmin, so that we can
- * quickly find the one with lowest xmin, to advance our MyPgXact->xmin.
+ * quickly find the one with lowest xmin, to advance our MyProc->xmin.
  */
 static int	xmin_cmp(const pairingheap_node *a, const pairingheap_node *b,
 					 void *arg);
@@ -475,7 +475,7 @@ GetNonHistoricCatalogSnapshot(Oid relid)
 
 		/*
 		 * Make sure the catalog snapshot will be accounted for in decisions
-		 * about advancing PGXACT->xmin.  We could apply RegisterSnapshot, but
+		 * about advancing PGPROC->xmin.  We could apply RegisterSnapshot, but
 		 * that would result in making a physical copy, which is overkill; and
 		 * it would also create a dependency on some resource owner, which we
 		 * do not want for reasons explained at the head of this file. Instead
@@ -596,7 +596,7 @@ SetTransactionSnapshot(Snapshot sourcesnap, VirtualTransactionId *sourcevxid,
 	/* NB: curcid should NOT be copied, it's a local matter */
 
 	/*
-	 * Now we have to fix what GetSnapshotData did with MyPgXact->xmin and
+	 * Now we have to fix what GetSnapshotData did with MyProc->xmin and
 	 * TransactionXmin.  There is a race condition: to make sure we are not
 	 * causing the global xmin to go backwards, we have to test that the
 	 * source transaction is still running, and that has to be done
@@ -950,13 +950,13 @@ xmin_cmp(const pairingheap_node *a, const pairingheap_node *b, void *arg)
 /*
  * SnapshotResetXmin
  *
- * If there are no more snapshots, we can reset our PGXACT->xmin to InvalidXid.
+ * If there are no more snapshots, we can reset our PGPROC->xmin to InvalidXid.
  * Note we can do this without locking because we assume that storing an Xid
  * is atomic.
  *
  * Even if there are some remaining snapshots, we may be able to advance our
- * PGXACT->xmin to some degree.  This typically happens when a portal is
- * dropped.  For efficiency, we only consider recomputing PGXACT->xmin when
+ * PGPROC->xmin to some degree.  This typically happens when a portal is
+ * dropped.  For efficiency, we only consider recomputing PGPROC->xmin when
  * the active snapshot stack is empty; this allows us not to need to track
  * which active snapshot is oldest.
  *
@@ -977,15 +977,15 @@ SnapshotResetXmin(void)
 
 	if (pairingheap_is_empty(&RegisteredSnapshots))
 	{
-		MyPgXact->xmin = InvalidTransactionId;
+		MyProc->xmin = InvalidTransactionId;
 		return;
 	}
 
 	minSnapshot = pairingheap_container(SnapshotData, ph_node,
 										pairingheap_first(&RegisteredSnapshots));
 
-	if (TransactionIdPrecedes(MyPgXact->xmin, minSnapshot->xmin))
-		MyPgXact->xmin = minSnapshot->xmin;
+	if (TransactionIdPrecedes(MyProc->xmin, minSnapshot->xmin))
+		MyProc->xmin = minSnapshot->xmin;
 }
 
 /*
@@ -1132,13 +1132,13 @@ AtEOXact_Snapshot(bool isCommit, bool resetXmin)
 
 	/*
 	 * During normal commit processing, we call ProcArrayEndTransaction() to
-	 * reset the MyPgXact->xmin. That call happens prior to the call to
+	 * reset the MyProc->xmin. That call happens prior to the call to
 	 * AtEOXact_Snapshot(), so we need not touch xmin here at all.
 	 */
 	if (resetXmin)
 		SnapshotResetXmin();
 
-	Assert(resetXmin || MyPgXact->xmin == 0);
+	Assert(resetXmin || MyProc->xmin == 0);
 }
 
 
@@ -1830,7 +1830,7 @@ TransactionIdLimitedForOldSnapshots(TransactionId recentXmin,
 	 */
 	if (old_snapshot_threshold == 0)
 	{
-		if (TransactionIdPrecedes(latest_xmin, MyPgXact->xmin)
+		if (TransactionIdPrecedes(latest_xmin, MyProc->xmin)
 			&& TransactionIdFollows(latest_xmin, xlimit))
 			xlimit = latest_xmin;
 
-- 
2.25.0.114.g5b0ca878e0

v11-0003-snapshot-scalability-Introduce-dense-array-of-in.patchtext/x-diff; charset=us-asciiDownload
From 1609ab74104aa845df655807695ed341a0a5dbe1 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 15 Jul 2020 15:35:07 -0700
Subject: [PATCH v11 3/6] snapshot scalability: Introduce dense array of
 in-progress xids.

The new array contains the xids for all connected backends / in-use
PGPROC entries in a dense manner (in contrast to the PGPROC/PGXACT
arrays which can have unused entries interspersed).

This improves performance because GetSnapshotData() always needs to
scan the xids of all live procarray entries and now there's no need to
go through the procArray->pgprocnos indirection anymore.

As the set of running top-level xids changes rarely, compared to the
number of snapshots taken, this substantially increases the likelihood
of most data required for a snapshot being in l2 cache.  In
read-mostly workloads scanning the xids[] array will sufficient to
build a snapshot, as most backends will not have an xid assigned.

To keep the xid array dense ProcArrayRemove() needs to move entries
behind the to-be-removed proc's one further up in the array. Obviously
moving array entries cannot happen while a backend sets it
xid. I.e. locking needs to prevent that array entries are moved while
a backend modifies its xid.

To avoid locking ProcArrayLock in GetNewTransactionId() - a fairly hot
spot already - ProcArrayAdd() / ProcArrayRemove() now needs to hold
XidGenLock in addition to ProcArrayLock. Adding / Removing a procarray
entry is not a very frequent operation, even taking 2PC into account.

Due to the above, the dense array entries can only be read or modified
while holding ProcArrayLock and/or XidGenLock. This prevents a
concurrent ProcArrayRemove() from shifting the dense array while it is
accessed concurrently.

While the new dense array is very good when needing to look at all
xids it is less suitable when accessing a single backend's xid. In
particular it would be problematic to have to acquire a lock to access
a backend's own xid. Therefore a backend's xid is not just stored in
the dense array, but also in PGPROC. This also allows a backend to
only access the shared xid value when the backend had acquired an
xid.

The infrastructure added in this commit will be used for the remaining
PGXACT fields in subsequent commits. They are kept separate to make
review easier.

Author: Andres Freund
Reviewed-By: Robert Haas, Thomas Munro, David Rowley
Discussion: https://postgr.es/m/20200301083601.ews6hz5dduc3w2se@alap3.anarazel.de
---
 src/include/storage/proc.h                  |  79 +++++-
 src/backend/access/heap/heapam_visibility.c |   8 +-
 src/backend/access/transam/README           |  33 +--
 src/backend/access/transam/clog.c           |   8 +-
 src/backend/access/transam/twophase.c       |  31 +--
 src/backend/access/transam/varsup.c         |  20 +-
 src/backend/commands/vacuum.c               |   2 +-
 src/backend/storage/ipc/procarray.c         | 282 +++++++++++++-------
 src/backend/storage/ipc/sinvaladt.c         |   4 +-
 src/backend/storage/lmgr/lock.c             |   3 +-
 src/backend/storage/lmgr/proc.c             |  26 +-
 11 files changed, 335 insertions(+), 161 deletions(-)

diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 286c9a9aec3..b828cecd185 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -90,6 +90,17 @@ typedef enum
  * distinguished from a real one at need by the fact that it has pid == 0.
  * The semaphore and lock-activity fields in a prepared-xact PGPROC are unused,
  * but its myProcLocks[] lists are valid.
+ *
+ * Mirrored fields:
+ *
+ * Some fields in PGPROC (see "mirrored in ..." comment) are mirrored into an
+ * element of more densely packed ProcGlobal arrays. These arrays are indexed
+ * by PGPROC->pgxactoff. Both copies need to be maintained coherently.
+ *
+ * NB: The pgxactoff indexed value can *never* be accessed without holding
+ * locks.
+ *
+ * See PROC_HDR for details.
  */
 struct PGPROC
 {
@@ -102,6 +113,12 @@ struct PGPROC
 
 	Latch		procLatch;		/* generic latch for process */
 
+
+	TransactionId xid;			/* id of top-level transaction currently being
+								 * executed by this proc, if running and XID
+								 * is assigned; else InvalidTransactionId.
+								 * mirrored in ProcGlobal->xids[pgxactoff] */
+
 	TransactionId xmin;			/* minimal running XID as it was when we were
 								 * starting our xact, excluding LAZY VACUUM:
 								 * vacuum must not remove tuples deleted by
@@ -111,6 +128,9 @@ struct PGPROC
 								 * being executed by this proc, if running;
 								 * else InvalidLocalTransactionId */
 	int			pid;			/* Backend's process ID; 0 if prepared xact */
+
+	int			pgxactoff;		/* offset into various ProcGlobal->arrays
+								 * with data mirrored from this PGPROC */
 	int			pgprocno;
 
 	/* These fields are zero while a backend is still starting up: */
@@ -225,10 +245,6 @@ extern PGDLLIMPORT struct PGXACT *MyPgXact;
  */
 typedef struct PGXACT
 {
-	TransactionId xid;			/* id of top-level transaction currently being
-								 * executed by this proc, if running and XID
-								 * is assigned; else InvalidTransactionId */
-
 	uint8		vacuumFlags;	/* vacuum-related flags, see above */
 	bool		overflowed;
 
@@ -237,6 +253,57 @@ typedef struct PGXACT
 
 /*
  * There is one ProcGlobal struct for the whole database cluster.
+ *
+ * Adding/Removing an entry into the procarray requires holding *both*
+ * ProcArrayLock and XidGenLock in exclusive mode (in that order). Both are
+ * needed because the dense arrays (see below) are accessed from
+ * GetNewTransactionId() and GetSnapshotData(), and we don't want to add
+ * further contention by both using the same lock. Adding/Removing a procarray
+ * entry is much less frequent.
+ *
+ * Some fields in PGPROC are mirrored into more densely packed arrays (like
+ * xids), with one entry for each backend. These arrays only contain entries
+ * for PGPROCs that have been added to the shared array with ProcArrayAdd()
+ * (in contrast to PGPROC array which has unused PGPROCs interspersed).
+ *
+ * The dense arrays are indexed indexed by PGPROC->pgxactoff. Any concurrent
+ * ProcArrayAdd() / ProcArrayRemove() can lead to pgxactoff of a procarray
+ * member to change.  Therefore it is only safe to use PGPROC->pgxactoff to
+ * access the dense array while holding either ProcArrayLock or XidGenLock.
+ *
+ * As long as a PGPROC is in the procarray, the mirrored values need to be
+ * maintained in both places in a coherent manner.
+ *
+ * The denser separate arrays are beneficial for three main reasons: First, to
+ * allow for as tight loops accessing the data as possible. Second, to prevent
+ * updates of frequently changing data (e.g. xmin) from invalidating
+ * cachelines also containing less frequently changing data (e.g. xid,
+ * vacuumFlags). Third to condense frequently accessed data into as few
+ * cachelines as possible.
+ *
+ * There are two main reasons to have the data mirrored between these dense
+ * arrays and PGPROC. First, as explained above, a PGPROC's array entries can
+ * only be accessed with either ProcArrayLock or XidGenLock held, whereas the
+ * PGPROC entries do not require that (obviously there may still be locking
+ * requirements around the individual field, separate from the concerns
+ * here). That is particularly important for a backend to efficiently checks
+ * it own values, which it often can safely do without locking.  Second, the
+ * PGPROC fields allow to avoid unnecessary accesses and modification to the
+ * dense arrays. A backend's own PGPROC is more likely to be in a local cache,
+ * whereas the cachelines for the dense array will be modified by other
+ * backends (often removing it from the cache for other cores/sockets). At
+ * commit/abort time a check of the PGPROC value can avoid accessing/dirtying
+ * the corresponding array value.
+ *
+ * Basically it makes sense to access the PGPROC variable when checking a
+ * single backend's data, especially when already looking at the PGPROC for
+ * other reasons already.  It makes sense to look at the "dense" arrays if we
+ * need to look at many / most entries, because we then benefit from the
+ * reduced indirection and better cross-process cache-ability.
+ *
+ * When entering a PGPROC for 2PC transactions with ProcArrayAdd(), the data
+ * in the dense arrays is initialized from the PGPROC while it already holds
+ * ProcArrayLock.
  */
 typedef struct PROC_HDR
 {
@@ -244,6 +311,10 @@ typedef struct PROC_HDR
 	PGPROC	   *allProcs;
 	/* Array of PGXACT structures (not including dummies for prepared txns) */
 	PGXACT	   *allPgXact;
+
+	/* Array mirroring PGPROC.xid for each PGPROC currently in the procarray */
+	TransactionId *xids;
+
 	/* Length of allProcs array */
 	uint32		allProcCount;
 	/* Head of list of free PGPROC structures */
diff --git a/src/backend/access/heap/heapam_visibility.c b/src/backend/access/heap/heapam_visibility.c
index b25b3e429ed..10848649c0c 100644
--- a/src/backend/access/heap/heapam_visibility.c
+++ b/src/backend/access/heap/heapam_visibility.c
@@ -11,12 +11,12 @@
  * shared buffer content lock on the buffer containing the tuple.
  *
  * NOTE: When using a non-MVCC snapshot, we must check
- * TransactionIdIsInProgress (which looks in the PGXACT array)
+ * TransactionIdIsInProgress (which looks in the PGPROC array)
  * before TransactionIdDidCommit/TransactionIdDidAbort (which look in
  * pg_xact).  Otherwise we have a race condition: we might decide that a
  * just-committed transaction crashed, because none of the tests succeed.
  * xact.c is careful to record commit/abort in pg_xact before it unsets
- * MyPgXact->xid in the PGXACT array.  That fixes that problem, but it
+ * MyProc->xid in the PGPROC array.  That fixes that problem, but it
  * also means there is a window where TransactionIdIsInProgress and
  * TransactionIdDidCommit will both return true.  If we check only
  * TransactionIdDidCommit, we could consider a tuple committed when a
@@ -956,7 +956,7 @@ HeapTupleSatisfiesDirty(HeapTuple htup, Snapshot snapshot,
  * coding where we tried to set the hint bits as soon as possible, we instead
  * did TransactionIdIsInProgress in each call --- to no avail, as long as the
  * inserting/deleting transaction was still running --- which was more cycles
- * and more contention on the PGXACT array.
+ * and more contention on ProcArrayLock.
  */
 static bool
 HeapTupleSatisfiesMVCC(HeapTuple htup, Snapshot snapshot,
@@ -1445,7 +1445,7 @@ HeapTupleSatisfiesNonVacuumable(HeapTuple htup, Snapshot snapshot,
  *	HeapTupleSatisfiesMVCC) and, therefore, any hint bits that can be set
  *	should already be set.  We assume that if no hint bits are set, the xmin
  *	or xmax transaction is still running.  This is therefore faster than
- *	HeapTupleSatisfiesVacuum, because we don't consult PGXACT nor CLOG.
+ *	HeapTupleSatisfiesVacuum, because we consult neither procarray nor CLOG.
  *	It's okay to return false when in doubt, but we must return true only
  *	if the tuple is removable.
  */
diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README
index 94d8f3fd0a2..c46fc3cc194 100644
--- a/src/backend/access/transam/README
+++ b/src/backend/access/transam/README
@@ -251,10 +251,10 @@ enforce, and it assists with some other issues as explained below.)  The
 implementation of this is that GetSnapshotData takes the ProcArrayLock in
 shared mode (so that multiple backends can take snapshots in parallel),
 but ProcArrayEndTransaction must take the ProcArrayLock in exclusive mode
-while clearing MyPgXact->xid at transaction end (either commit or abort).
-(To reduce context switching, when multiple transactions commit nearly
-simultaneously, we have one backend take ProcArrayLock and clear the XIDs
-of multiple processes at once.)
+while clearing the ProcGlobal->xids[] entry at transaction end (either
+commit or abort). (To reduce context switching, when multiple transactions
+commit nearly simultaneously, we have one backend take ProcArrayLock and
+clear the XIDs of multiple processes at once.)
 
 ProcArrayEndTransaction also holds the lock while advancing the shared
 latestCompletedFullXid variable.  This allows GetSnapshotData to use
@@ -278,12 +278,13 @@ present in the ProcArray, or not running anymore.  (This guarantee doesn't
 apply to subtransaction XIDs, because of the possibility that there's not
 room for them in the subxid array; instead we guarantee that they are
 present or the overflow flag is set.)  If a backend released XidGenLock
-before storing its XID into MyPgXact, then it would be possible for another
-backend to allocate and commit a later XID, causing latestCompletedFullXid to
-pass the first backend's XID, before that value became visible in the
-ProcArray.  That would break ComputeXidHorizons, as discussed below.
+before storing its XID into ProcGlobal->xids[], then it would be possible for
+another backend to allocate and commit a later XID, causing
+latestCompletedFullXid to pass the first backend's XID, before that value
+became visible in the ProcArray.  That would break ComputeXidHorizons,
+as discussed below.
 
-We allow GetNewTransactionId to store the XID into MyPgXact->xid (or the
+We allow GetNewTransactionId to store the XID into ProcGlobal->xids[] (or the
 subxid array) without taking ProcArrayLock.  This was once necessary to
 avoid deadlock; while that is no longer the case, it's still beneficial for
 performance.  We are thereby relying on fetch/store of an XID to be atomic,
@@ -382,13 +383,13 @@ Top-level transactions do not have a parent, so they leave their pg_subtrans
 entries set to the default value of zero (InvalidTransactionId).
 
 pg_subtrans is used to check whether the transaction in question is still
-running --- the main Xid of a transaction is recorded in the PGXACT struct,
-but since we allow arbitrary nesting of subtransactions, we can't fit all Xids
-in shared memory, so we have to store them on disk.  Note, however, that for
-each transaction we keep a "cache" of Xids that are known to be part of the
-transaction tree, so we can skip looking at pg_subtrans unless we know the
-cache has been overflowed.  See storage/ipc/procarray.c for the gory details.
-
+running --- the main Xid of a transaction is recorded in ProcGlobal->xids[],
+with a copy in PGPROC->xid, but since we allow arbitrary nesting of
+subtransactions, we can't fit all Xids in shared memory, so we have to store
+them on disk.  Note, however, that for each transaction we keep a "cache" of
+Xids that are known to be part of the transaction tree, so we can skip looking
+at pg_subtrans unless we know the cache has been overflowed.  See
+storage/ipc/procarray.c for the gory details.
 slru.c is the supporting mechanism for both pg_xact and pg_subtrans.  It
 implements the LRU policy for in-memory buffer pages.  The high-level routines
 for pg_xact are implemented in transam.c, while the low-level functions are in
diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c
index f3da40ae017..5198a0cef68 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -285,15 +285,15 @@ TransactionIdSetPageStatus(TransactionId xid, int nsubxids,
 	 * updates for multiple backends so that the number of times XactSLRULock
 	 * needs to be acquired is reduced.
 	 *
-	 * For this optimization to be safe, the XID in MyPgXact and the subxids
-	 * in MyProc must be the same as the ones for which we're setting the
-	 * status.  Check that this is the case.
+	 * For this optimization to be safe, the XID and subxids in MyProc must be
+	 * the same as the ones for which we're setting the status.  Check that
+	 * this is the case.
 	 *
 	 * For this optimization to be efficient, we shouldn't have too many
 	 * sub-XIDs and all of the XIDs for which we're adjusting clog should be
 	 * on the same page.  Check those conditions, too.
 	 */
-	if (all_xact_same_page && xid == MyPgXact->xid &&
+	if (all_xact_same_page && xid == MyProc->xid &&
 		nsubxids <= THRESHOLD_SUBTRANS_CLOG_OPT &&
 		nsubxids == MyPgXact->nxids &&
 		memcmp(subxids, MyProc->subxids.xids,
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index ae7c1a4c172..d073eb07d23 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -351,7 +351,7 @@ AtAbort_Twophase(void)
 
 /*
  * This is called after we have finished transferring state to the prepared
- * PGXACT entry.
+ * PGPROC entry.
  */
 void
 PostPrepare_Twophase(void)
@@ -463,7 +463,7 @@ MarkAsPreparingGuts(GlobalTransaction gxact, TransactionId xid, const char *gid,
 	proc->waitStatus = PROC_WAIT_STATUS_OK;
 	/* We set up the gxact's VXID as InvalidBackendId/XID */
 	proc->lxid = (LocalTransactionId) xid;
-	pgxact->xid = xid;
+	proc->xid = xid;
 	Assert(proc->xmin == InvalidTransactionId);
 	proc->delayChkpt = false;
 	pgxact->vacuumFlags = 0;
@@ -768,7 +768,6 @@ pg_prepared_xact(PG_FUNCTION_ARGS)
 	{
 		GlobalTransaction gxact = &status->array[status->currIdx++];
 		PGPROC	   *proc = &ProcGlobal->allProcs[gxact->pgprocno];
-		PGXACT	   *pgxact = &ProcGlobal->allPgXact[gxact->pgprocno];
 		Datum		values[5];
 		bool		nulls[5];
 		HeapTuple	tuple;
@@ -783,7 +782,7 @@ pg_prepared_xact(PG_FUNCTION_ARGS)
 		MemSet(values, 0, sizeof(values));
 		MemSet(nulls, 0, sizeof(nulls));
 
-		values[0] = TransactionIdGetDatum(pgxact->xid);
+		values[0] = TransactionIdGetDatum(proc->xid);
 		values[1] = CStringGetTextDatum(gxact->gid);
 		values[2] = TimestampTzGetDatum(gxact->prepared_at);
 		values[3] = ObjectIdGetDatum(gxact->owner);
@@ -829,9 +828,8 @@ TwoPhaseGetGXact(TransactionId xid, bool lock_held)
 	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
 	{
 		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
-		PGXACT	   *pgxact = &ProcGlobal->allPgXact[gxact->pgprocno];
 
-		if (pgxact->xid == xid)
+		if (gxact->xid == xid)
 		{
 			result = gxact;
 			break;
@@ -987,8 +985,7 @@ void
 StartPrepare(GlobalTransaction gxact)
 {
 	PGPROC	   *proc = &ProcGlobal->allProcs[gxact->pgprocno];
-	PGXACT	   *pgxact = &ProcGlobal->allPgXact[gxact->pgprocno];
-	TransactionId xid = pgxact->xid;
+	TransactionId xid = gxact->xid;
 	TwoPhaseFileHeader hdr;
 	TransactionId *children;
 	RelFileNode *commitrels;
@@ -1140,15 +1137,15 @@ EndPrepare(GlobalTransaction gxact)
 
 	/*
 	 * Mark the prepared transaction as valid.  As soon as xact.c marks
-	 * MyPgXact as not running our XID (which it will do immediately after
+	 * MyProc as not running our XID (which it will do immediately after
 	 * this function returns), others can commit/rollback the xact.
 	 *
 	 * NB: a side effect of this is to make a dummy ProcArray entry for the
-	 * prepared XID.  This must happen before we clear the XID from MyPgXact,
-	 * else there is a window where the XID is not running according to
-	 * TransactionIdIsInProgress, and onlookers would be entitled to assume
-	 * the xact crashed.  Instead we have a window where the same XID appears
-	 * twice in ProcArray, which is OK.
+	 * prepared XID.  This must happen before we clear the XID from MyProc /
+	 * ProcGlobal->xids[], else there is a window where the XID is not running
+	 * according to TransactionIdIsInProgress, and onlookers would be entitled
+	 * to assume the xact crashed.  Instead we have a window where the same
+	 * XID appears twice in ProcArray, which is OK.
 	 */
 	MarkAsPrepared(gxact, false);
 
@@ -1404,7 +1401,6 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
 {
 	GlobalTransaction gxact;
 	PGPROC	   *proc;
-	PGXACT	   *pgxact;
 	TransactionId xid;
 	char	   *buf;
 	char	   *bufptr;
@@ -1423,8 +1419,7 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
 	 */
 	gxact = LockGXact(gid, GetUserId());
 	proc = &ProcGlobal->allProcs[gxact->pgprocno];
-	pgxact = &ProcGlobal->allPgXact[gxact->pgprocno];
-	xid = pgxact->xid;
+	xid = gxact->xid;
 
 	/*
 	 * Read and validate 2PC state data. State data will typically be stored
@@ -1726,7 +1721,7 @@ CheckPointTwoPhase(XLogRecPtr redo_horizon)
 	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
 	{
 		/*
-		 * Note that we are using gxact not pgxact so this works in recovery
+		 * Note that we are using gxact not PGPROC so this works in recovery
 		 * also
 		 */
 		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
diff --git a/src/backend/access/transam/varsup.c b/src/backend/access/transam/varsup.c
index 00b8e4e50d7..ab376f2fe22 100644
--- a/src/backend/access/transam/varsup.c
+++ b/src/backend/access/transam/varsup.c
@@ -38,7 +38,8 @@ VariableCache ShmemVariableCache = NULL;
  * Allocate the next FullTransactionId for a new transaction or
  * subtransaction.
  *
- * The new XID is also stored into MyPgXact before returning.
+ * The new XID is also stored into MyProc->xid/ProcGlobal->xids[] before
+ * returning.
  *
  * Note: when this is called, we are actually already inside a valid
  * transaction, since XIDs are now not allocated until the transaction
@@ -65,7 +66,8 @@ GetNewTransactionId(bool isSubXact)
 	if (IsBootstrapProcessingMode())
 	{
 		Assert(!isSubXact);
-		MyPgXact->xid = BootstrapTransactionId;
+		MyProc->xid = BootstrapTransactionId;
+		ProcGlobal->xids[MyProc->pgxactoff] = BootstrapTransactionId;
 		return FullTransactionIdFromEpochAndXid(0, BootstrapTransactionId);
 	}
 
@@ -190,10 +192,10 @@ GetNewTransactionId(bool isSubXact)
 	 * latestCompletedXid is present in the ProcArray, which is essential for
 	 * correct OldestXmin tracking; see src/backend/access/transam/README.
 	 *
-	 * Note that readers of PGXACT xid fields should be careful to fetch the
-	 * value only once, rather than assume they can read a value multiple
-	 * times and get the same answer each time.  Note we are assuming that
-	 * TransactionId and int fetch/store are atomic.
+	 * Note that readers of ProcGlobal->xids/PGPROC->xid should be careful
+	 * to fetch the value for each proc only once, rather than assume they can
+	 * read a value multiple times and get the same answer each time.  Note we
+	 * are assuming that TransactionId and int fetch/store are atomic.
 	 *
 	 * The same comments apply to the subxact xid count and overflow fields.
 	 *
@@ -219,7 +221,11 @@ GetNewTransactionId(bool isSubXact)
 	 * answer later on when someone does have a reason to inquire.)
 	 */
 	if (!isSubXact)
-		MyPgXact->xid = xid;	/* LWLockRelease acts as barrier */
+	{
+		/* LWLockRelease acts as barrier */
+		MyProc->xid = xid;
+		ProcGlobal->xids[MyProc->pgxactoff] = xid;
+	}
 	else
 	{
 		int			nxids = MyPgXact->nxids;
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 22228f5684f..648e12c78d8 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -1724,7 +1724,7 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params)
 		 *
 		 * Note: these flags remain set until CommitTransaction or
 		 * AbortTransaction.  We don't want to clear them until we reset
-		 * MyPgXact->xid/xmin, otherwise GetOldestNonRemovableTransactionId()
+		 * MyProc->xid/xmin, otherwise GetOldestNonRemovableTransactionId()
 		 * might appear to go backwards, which is probably Not Good.
 		 */
 		LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 980ca2cc2af..a9b32565367 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -9,8 +9,9 @@
  * one is as a means of determining the set of currently running transactions.
  *
  * Because of various subtle race conditions it is critical that a backend
- * hold the correct locks while setting or clearing its MyPgXact->xid field.
- * See notes in src/backend/access/transam/README.
+ * hold the correct locks while setting or clearing its xid (in
+ * ProcGlobal->xids[]/MyProc->xid).  See notes in
+ * src/backend/access/transam/README.
  *
  * The process arrays now also include structures representing prepared
  * transactions.  The xid and subxids fields of these are valid, as are the
@@ -434,7 +435,9 @@ ProcArrayAdd(PGPROC *proc)
 	ProcArrayStruct *arrayP = procArray;
 	int			index;
 
+	/* See ProcGlobal comment explaining why both locks are held */
 	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+	LWLockAcquire(XidGenLock, LW_EXCLUSIVE);
 
 	if (arrayP->numProcs >= arrayP->maxProcs)
 	{
@@ -443,7 +446,6 @@ ProcArrayAdd(PGPROC *proc)
 		 * fixed supply of PGPROC structs too, and so we should have failed
 		 * earlier.)
 		 */
-		LWLockRelease(ProcArrayLock);
 		ereport(FATAL,
 				(errcode(ERRCODE_TOO_MANY_CONNECTIONS),
 				 errmsg("sorry, too many clients already")));
@@ -469,10 +471,25 @@ ProcArrayAdd(PGPROC *proc)
 	}
 
 	memmove(&arrayP->pgprocnos[index + 1], &arrayP->pgprocnos[index],
-			(arrayP->numProcs - index) * sizeof(int));
+			(arrayP->numProcs - index) * sizeof(*arrayP->pgprocnos));
+	memmove(&ProcGlobal->xids[index + 1], &ProcGlobal->xids[index],
+			(arrayP->numProcs - index) * sizeof(*ProcGlobal->xids));
+
 	arrayP->pgprocnos[index] = proc->pgprocno;
+	ProcGlobal->xids[index] = proc->xid;
+
 	arrayP->numProcs++;
 
+	for (; index < arrayP->numProcs; index++)
+	{
+		allProcs[arrayP->pgprocnos[index]].pgxactoff = index;
+	}
+
+	/*
+	 * Release in reversed acquisition order, to reduce frequency of having to
+	 * wait for XidGenLock while holding ProcArrayLock.
+	 */
+	LWLockRelease(XidGenLock);
 	LWLockRelease(ProcArrayLock);
 }
 
@@ -498,36 +515,59 @@ ProcArrayRemove(PGPROC *proc, TransactionId latestXid)
 		DisplayXidCache();
 #endif
 
+	/* See ProcGlobal comment explaining why both locks are held */
 	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+	LWLockAcquire(XidGenLock, LW_EXCLUSIVE);
+
+	Assert(ProcGlobal->allProcs[arrayP->pgprocnos[proc->pgxactoff]].pgxactoff == proc->pgxactoff);
 
 	if (TransactionIdIsValid(latestXid))
 	{
-		Assert(TransactionIdIsValid(allPgXact[proc->pgprocno].xid));
+		Assert(TransactionIdIsValid(ProcGlobal->xids[proc->pgxactoff]));
 
 		/* Advance global latestCompletedXid while holding the lock */
 		MaintainLatestCompletedXid(latestXid);
+
+		ProcGlobal->xids[proc->pgxactoff] = 0;
 	}
 	else
 	{
 		/* Shouldn't be trying to remove a live transaction here */
-		Assert(!TransactionIdIsValid(allPgXact[proc->pgprocno].xid));
+		Assert(!TransactionIdIsValid(ProcGlobal->xids[proc->pgxactoff]));
 	}
 
+	Assert(TransactionIdIsValid(ProcGlobal->xids[proc->pgxactoff] == 0));
+
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
 		if (arrayP->pgprocnos[index] == proc->pgprocno)
 		{
 			/* Keep the PGPROC array sorted. See notes above */
 			memmove(&arrayP->pgprocnos[index], &arrayP->pgprocnos[index + 1],
-					(arrayP->numProcs - index - 1) * sizeof(int));
+					(arrayP->numProcs - index - 1) * sizeof(*arrayP->pgprocnos));
+			memmove(&ProcGlobal->xids[index], &ProcGlobal->xids[index + 1],
+					(arrayP->numProcs - index - 1) * sizeof(*ProcGlobal->xids));
+
 			arrayP->pgprocnos[arrayP->numProcs - 1] = -1;	/* for debugging */
 			arrayP->numProcs--;
+
+			for (; index < arrayP->numProcs; index++)
+			{
+				allProcs[arrayP->pgprocnos[index]].pgxactoff--;
+			}
+
+			/*
+			 * Release in reversed acquisition order, to reduce frequency of
+			 * having to wait for XidGenLock while holding ProcArrayLock.
+			 */
+			LWLockRelease(XidGenLock);
 			LWLockRelease(ProcArrayLock);
 			return;
 		}
 	}
 
 	/* Oops */
+	LWLockRelease(XidGenLock);
 	LWLockRelease(ProcArrayLock);
 
 	elog(LOG, "failed to find proc %p in ProcArray", proc);
@@ -560,7 +600,7 @@ ProcArrayEndTransaction(PGPROC *proc, TransactionId latestXid)
 		 * else is taking a snapshot.  See discussion in
 		 * src/backend/access/transam/README.
 		 */
-		Assert(TransactionIdIsValid(allPgXact[proc->pgprocno].xid));
+		Assert(TransactionIdIsValid(proc->xid));
 
 		/*
 		 * If we can immediately acquire ProcArrayLock, we clear our own XID
@@ -582,7 +622,7 @@ ProcArrayEndTransaction(PGPROC *proc, TransactionId latestXid)
 		 * anyone else's calculation of a snapshot.  We might change their
 		 * estimate of global xmin, but that's OK.
 		 */
-		Assert(!TransactionIdIsValid(allPgXact[proc->pgprocno].xid));
+		Assert(!TransactionIdIsValid(proc->xid));
 
 		proc->lxid = InvalidLocalTransactionId;
 		/* must be cleared with xid/xmin: */
@@ -605,7 +645,13 @@ static inline void
 ProcArrayEndTransactionInternal(PGPROC *proc, PGXACT *pgxact,
 								TransactionId latestXid)
 {
-	pgxact->xid = InvalidTransactionId;
+	size_t		pgxactoff = proc->pgxactoff;
+
+	Assert(TransactionIdIsValid(ProcGlobal->xids[pgxactoff]));
+	Assert(ProcGlobal->xids[pgxactoff] == proc->xid);
+
+	ProcGlobal->xids[pgxactoff] = InvalidTransactionId;
+	proc->xid = InvalidTransactionId;
 	proc->lxid = InvalidLocalTransactionId;
 	/* must be cleared with xid/xmin: */
 	pgxact->vacuumFlags &= ~PROC_VACUUM_STATE_MASK;
@@ -641,7 +687,7 @@ ProcArrayGroupClearXid(PGPROC *proc, TransactionId latestXid)
 	uint32		wakeidx;
 
 	/* We should definitely have an XID to clear. */
-	Assert(TransactionIdIsValid(allPgXact[proc->pgprocno].xid));
+	Assert(TransactionIdIsValid(proc->xid));
 
 	/* Add ourselves to the list of processes needing a group XID clear. */
 	proc->procArrayGroupMember = true;
@@ -746,20 +792,28 @@ ProcArrayGroupClearXid(PGPROC *proc, TransactionId latestXid)
  * This is used after successfully preparing a 2-phase transaction.  We are
  * not actually reporting the transaction's XID as no longer running --- it
  * will still appear as running because the 2PC's gxact is in the ProcArray
- * too.  We just have to clear out our own PGXACT.
+ * too.  We just have to clear out our own PGPROC.
  */
 void
 ProcArrayClearTransaction(PGPROC *proc)
 {
 	PGXACT	   *pgxact = &allPgXact[proc->pgprocno];
+	size_t		pgxactoff;
 
 	/*
-	 * We can skip locking ProcArrayLock here, because this action does not
-	 * actually change anyone's view of the set of running XIDs: our entry is
-	 * duplicate with the gxact that has already been inserted into the
-	 * ProcArray.
+	 * We can skip locking ProcArrayLock exclusively here, because this action
+	 * does not actually change anyone's view of the set of running XIDs: our
+	 * entry is duplicate with the gxact that has already been inserted into
+	 * the ProcArray. But need it in shared mode for pgproc->pgxactoff to stay
+	 * the same.
 	 */
-	pgxact->xid = InvalidTransactionId;
+	LWLockAcquire(ProcArrayLock, LW_SHARED);
+
+	pgxactoff = proc->pgxactoff;
+
+	ProcGlobal->xids[pgxactoff] = InvalidTransactionId;
+	proc->xid = InvalidTransactionId;
+
 	proc->lxid = InvalidLocalTransactionId;
 	proc->xmin = InvalidTransactionId;
 	proc->recoveryConflictPending = false;
@@ -771,6 +825,8 @@ ProcArrayClearTransaction(PGPROC *proc)
 	/* Clear the subtransaction-XID cache too */
 	pgxact->nxids = 0;
 	pgxact->overflowed = false;
+
+	LWLockRelease(ProcArrayLock);
 }
 
 /*
@@ -1164,7 +1220,7 @@ ProcArrayApplyXidAssignment(TransactionId topxid,
  * there are four possibilities for finding a running transaction:
  *
  * 1. The given Xid is a main transaction Id.  We will find this out cheaply
- * by looking at the PGXACT struct for each backend.
+ * by looking at ProcGlobal->xids.
  *
  * 2. The given Xid is one of the cached subxact Xids in the PGPROC array.
  * We can find this out cheaply too.
@@ -1173,25 +1229,27 @@ ProcArrayApplyXidAssignment(TransactionId topxid,
  * if the Xid is running on the primary.
  *
  * 4. Search the SubTrans tree to find the Xid's topmost parent, and then see
- * if that is running according to PGXACT or KnownAssignedXids.  This is the
- * slowest way, but sadly it has to be done always if the others failed,
- * unless we see that the cached subxact sets are complete (none have
+ * if that is running according to ProcGlobal->xids[] or KnownAssignedXids.
+ * This is the slowest way, but sadly it has to be done always if the others
+ * failed, unless we see that the cached subxact sets are complete (none have
  * overflowed).
  *
  * ProcArrayLock has to be held while we do 1, 2, 3.  If we save the top Xids
  * while doing 1 and 3, we can release the ProcArrayLock while we do 4.
  * This buys back some concurrency (and we can't retrieve the main Xids from
- * PGXACT again anyway; see GetNewTransactionId).
+ * ProcGlobal->xids[] again anyway; see GetNewTransactionId).
  */
 bool
 TransactionIdIsInProgress(TransactionId xid)
 {
 	static TransactionId *xids = NULL;
+	static TransactionId *other_xids;
 	int			nxids = 0;
 	ProcArrayStruct *arrayP = procArray;
 	TransactionId topxid;
-	int			i,
-				j;
+	int			mypgxactoff;
+	size_t		numProcs;
+	int			j;
 
 	/*
 	 * Don't bother checking a transaction older than RecentXmin; it could not
@@ -1246,6 +1304,8 @@ TransactionIdIsInProgress(TransactionId xid)
 					 errmsg("out of memory")));
 	}
 
+	other_xids = ProcGlobal->xids;
+
 	LWLockAcquire(ProcArrayLock, LW_SHARED);
 
 	/*
@@ -1261,20 +1321,22 @@ TransactionIdIsInProgress(TransactionId xid)
 	}
 
 	/* No shortcuts, gotta grovel through the array */
-	for (i = 0; i < arrayP->numProcs; i++)
+	mypgxactoff = MyProc->pgxactoff;
+	numProcs = arrayP->numProcs;
+	for (size_t pgxactoff = 0; pgxactoff < numProcs; pgxactoff++)
 	{
-		int			pgprocno = arrayP->pgprocnos[i];
-		PGPROC	   *proc = &allProcs[pgprocno];
-		PGXACT	   *pgxact = &allPgXact[pgprocno];
+		int			pgprocno;
+		PGXACT	   *pgxact;
+		PGPROC	   *proc;
 		TransactionId pxid;
 		int			pxids;
 
-		/* Ignore my own proc --- dealt with it above */
-		if (proc == MyProc)
+		/* Ignore ourselves --- dealt with it above */
+		if (pgxactoff == mypgxactoff)
 			continue;
 
 		/* Fetch xid just once - see GetNewTransactionId */
-		pxid = UINT32_ACCESS_ONCE(pgxact->xid);
+		pxid = UINT32_ACCESS_ONCE(other_xids[pgxactoff]);
 
 		if (!TransactionIdIsValid(pxid))
 			continue;
@@ -1299,8 +1361,12 @@ TransactionIdIsInProgress(TransactionId xid)
 		/*
 		 * Step 2: check the cached child-Xids arrays
 		 */
+		pgprocno = arrayP->pgprocnos[pgxactoff];
+		pgxact = &allPgXact[pgprocno];
 		pxids = pgxact->nxids;
 		pg_read_barrier();		/* pairs with barrier in GetNewTransactionId() */
+		pgprocno = arrayP->pgprocnos[pgxactoff];
+		proc = &allProcs[pgprocno];
 		for (j = pxids - 1; j >= 0; j--)
 		{
 			/* Fetch xid just once - see GetNewTransactionId */
@@ -1331,7 +1397,7 @@ TransactionIdIsInProgress(TransactionId xid)
 	 */
 	if (RecoveryInProgress())
 	{
-		/* none of the PGXACT entries should have XIDs in hot standby mode */
+		/* none of the PGPROC entries should have XIDs in hot standby mode */
 		Assert(nxids == 0);
 
 		if (KnownAssignedXidExists(xid))
@@ -1386,7 +1452,7 @@ TransactionIdIsInProgress(TransactionId xid)
 	Assert(TransactionIdIsValid(topxid));
 	if (!TransactionIdEquals(topxid, xid))
 	{
-		for (i = 0; i < nxids; i++)
+		for (int i = 0; i < nxids; i++)
 		{
 			if (TransactionIdEquals(xids[i], topxid))
 				return true;
@@ -1409,6 +1475,7 @@ TransactionIdIsActive(TransactionId xid)
 {
 	bool		result = false;
 	ProcArrayStruct *arrayP = procArray;
+	TransactionId *other_xids = ProcGlobal->xids;
 	int			i;
 
 	/*
@@ -1424,11 +1491,10 @@ TransactionIdIsActive(TransactionId xid)
 	{
 		int			pgprocno = arrayP->pgprocnos[i];
 		PGPROC	   *proc = &allProcs[pgprocno];
-		PGXACT	   *pgxact = &allPgXact[pgprocno];
 		TransactionId pxid;
 
 		/* Fetch xid just once - see GetNewTransactionId */
-		pxid = UINT32_ACCESS_ONCE(pgxact->xid);
+		pxid = UINT32_ACCESS_ONCE(other_xids[i]);
 
 		if (!TransactionIdIsValid(pxid))
 			continue;
@@ -1506,6 +1572,7 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
 	ProcArrayStruct *arrayP = procArray;
 	TransactionId kaxmin;
 	bool		in_recovery = RecoveryInProgress();
+	TransactionId *other_xids = ProcGlobal->xids;
 
 	/* inferred after ProcArrayLock is released */
 	h->catalog_oldest_nonremovable = InvalidTransactionId;
@@ -1549,7 +1616,7 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
 		TransactionId xmin;
 
 		/* Fetch xid just once - see GetNewTransactionId */
-		xid = UINT32_ACCESS_ONCE(pgxact->xid);
+		xid = UINT32_ACCESS_ONCE(other_xids[pgprocno]);
 		xmin = UINT32_ACCESS_ONCE(proc->xmin);
 
 		/*
@@ -1837,14 +1904,17 @@ Snapshot
 GetSnapshotData(Snapshot snapshot)
 {
 	ProcArrayStruct *arrayP = procArray;
+	TransactionId *other_xids = ProcGlobal->xids;
 	TransactionId xmin;
 	TransactionId xmax;
-	int			index;
-	int			count = 0;
+	size_t		count = 0;
 	int			subcount = 0;
 	bool		suboverflowed = false;
 	FullTransactionId latest_completed;
 	TransactionId oldestxid;
+	int			mypgxactoff;
+	TransactionId myxid;
+
 	TransactionId replication_slot_xmin = InvalidTransactionId;
 	TransactionId replication_slot_catalog_xmin = InvalidTransactionId;
 
@@ -1889,6 +1959,10 @@ GetSnapshotData(Snapshot snapshot)
 	LWLockAcquire(ProcArrayLock, LW_SHARED);
 
 	latest_completed = ShmemVariableCache->latestCompletedFullXid;
+	mypgxactoff = MyProc->pgxactoff;
+	myxid = other_xids[mypgxactoff];
+	Assert(myxid == MyProc->xid);
+
 	oldestxid = ShmemVariableCache->oldestXid;
 
 	/* xmax is always latestCompletedXid + 1 */
@@ -1899,57 +1973,79 @@ GetSnapshotData(Snapshot snapshot)
 	/* initialize xmin calculation with xmax */
 	xmin = xmax;
 
+	/* take own xid into account, saves a check inside the loop */
+	if (TransactionIdIsNormal(myxid) && NormalTransactionIdPrecedes(myxid, xmin))
+		xmin = myxid;
+
 	snapshot->takenDuringRecovery = RecoveryInProgress();
 
 	if (!snapshot->takenDuringRecovery)
 	{
+		size_t		numProcs = arrayP->numProcs;
+		TransactionId *xip = snapshot->xip;
 		int		   *pgprocnos = arrayP->pgprocnos;
-		int			numProcs;
 
 		/*
-		 * Spin over procArray checking xid, xmin, and subxids.  The goal is
-		 * to gather all active xids, find the lowest xmin, and try to record
-		 * subxids.
+		 * First collect set of pgxactoff/xids that need to be included in the
+		 * snapshot.
 		 */
-		numProcs = arrayP->numProcs;
-		for (index = 0; index < numProcs; index++)
+		for (size_t pgxactoff = 0; pgxactoff < numProcs; pgxactoff++)
 		{
-			int			pgprocno = pgprocnos[index];
-			PGXACT	   *pgxact = &allPgXact[pgprocno];
-			TransactionId xid;
+			/* Fetch xid just once - see GetNewTransactionId */
+			TransactionId xid = UINT32_ACCESS_ONCE(other_xids[pgxactoff]);
+			int			pgprocno;
+			PGXACT	   *pgxact;
+			uint8		vacuumFlags;
+
+			Assert(allProcs[arrayP->pgprocnos[pgxactoff]].pgxactoff == pgxactoff);
+
+			/*
+			 * If the transaction has no XID assigned, we can skip it; it
+			 * won't have sub-XIDs either.
+			 */
+			if (likely(xid == InvalidTransactionId))
+				continue;
+
+			/*
+			 * We don't include our own XIDs (if any) in the snapshot. It
+			 * needs to be includeded in the xmin computation, but we did so
+			 * outside the loop.
+			 */
+			if (pgxactoff == mypgxactoff)
+				continue;
+
+			/*
+			 * The only way we are able to get here with a non-normal xid
+			 * is during bootstrap - with this backend using
+			 * BootstrapTransactionId. But the above test should filter
+			 * that out.
+			 */
+			Assert(TransactionIdIsNormal(xid));
+
+			/*
+			 * If the XID is >= xmax, we can skip it; such transactions will
+			 * be treated as running anyway (and any sub-XIDs will also be >=
+			 * xmax).
+			 */
+			if (!NormalTransactionIdPrecedes(xid, xmax))
+				continue;
+
+			pgprocno = pgprocnos[pgxactoff];
+			pgxact = &allPgXact[pgprocno];
+			vacuumFlags = pgxact->vacuumFlags;
 
 			/*
 			 * Skip over backends doing logical decoding which manages xmin
 			 * separately (check below) and ones running LAZY VACUUM.
 			 */
-			if (pgxact->vacuumFlags &
-				(PROC_IN_LOGICAL_DECODING | PROC_IN_VACUUM))
+			if (vacuumFlags & (PROC_IN_LOGICAL_DECODING | PROC_IN_VACUUM))
 				continue;
 
-			/* Fetch xid just once - see GetNewTransactionId */
-			xid = UINT32_ACCESS_ONCE(pgxact->xid);
-
-			/*
-			 * If the transaction has no XID assigned, we can skip it; it
-			 * won't have sub-XIDs either.  If the XID is >= xmax, we can also
-			 * skip it; such transactions will be treated as running anyway
-			 * (and any sub-XIDs will also be >= xmax).
-			 */
-			if (!TransactionIdIsNormal(xid)
-				|| !NormalTransactionIdPrecedes(xid, xmax))
-				continue;
-
-			/*
-			 * We don't include our own XIDs (if any) in the snapshot, but we
-			 * must include them in xmin.
-			 */
 			if (NormalTransactionIdPrecedes(xid, xmin))
 				xmin = xid;
-			if (pgxact == MyPgXact)
-				continue;
 
 			/* Add XID to snapshot. */
-			snapshot->xip[count++] = xid;
+			xip[count++] = xid;
 
 			/*
 			 * Save subtransaction XIDs if possible (if we've already
@@ -1972,9 +2068,9 @@ GetSnapshotData(Snapshot snapshot)
 					suboverflowed = true;
 				else
 				{
-					int			nxids = pgxact->nxids;
+					int			nsubxids = pgxact->nxids;
 
-					if (nxids > 0)
+					if (nsubxids > 0)
 					{
 						PGPROC	   *proc = &allProcs[pgprocno];
 
@@ -1982,8 +2078,8 @@ GetSnapshotData(Snapshot snapshot)
 
 						memcpy(snapshot->subxip + subcount,
 							   (void *) proc->subxids.xids,
-							   nxids * sizeof(TransactionId));
-						subcount += nxids;
+							   nsubxids * sizeof(TransactionId));
+						subcount += nsubxids;
 					}
 				}
 			}
@@ -2115,6 +2211,7 @@ GetSnapshotData(Snapshot snapshot)
 	}
 
 	RecentXmin = xmin;
+	Assert(TransactionIdPrecedesOrEquals(TransactionXmin, RecentXmin));
 
 	snapshot->xmin = xmin;
 	snapshot->xmax = xmax;
@@ -2277,7 +2374,7 @@ ProcArrayInstallRestoredXmin(TransactionId xmin, PGPROC *proc)
  * GetRunningTransactionData -- returns information about running transactions.
  *
  * Similar to GetSnapshotData but returns more information. We include
- * all PGXACTs with an assigned TransactionId, even VACUUM processes and
+ * all PGPROCs with an assigned TransactionId, even VACUUM processes and
  * prepared transactions.
  *
  * We acquire XidGenLock and ProcArrayLock, but the caller is responsible for
@@ -2292,7 +2389,7 @@ ProcArrayInstallRestoredXmin(TransactionId xmin, PGPROC *proc)
  * This is never executed during recovery so there is no need to look at
  * KnownAssignedXids.
  *
- * Dummy PGXACTs from prepared transaction are included, meaning that this
+ * Dummy PGPROCs from prepared transaction are included, meaning that this
  * may return entries with duplicated TransactionId values coming from
  * transaction finishing to prepare.  Nothing is done about duplicated
  * entries here to not hold on ProcArrayLock more than necessary.
@@ -2311,6 +2408,7 @@ GetRunningTransactionData(void)
 	static RunningTransactionsData CurrentRunningXactsData;
 
 	ProcArrayStruct *arrayP = procArray;
+	TransactionId *other_xids = ProcGlobal->xids;
 	RunningTransactions CurrentRunningXacts = &CurrentRunningXactsData;
 	TransactionId latestCompletedXid;
 	TransactionId oldestRunningXid;
@@ -2370,7 +2468,7 @@ GetRunningTransactionData(void)
 		TransactionId xid;
 
 		/* Fetch xid just once - see GetNewTransactionId */
-		xid = UINT32_ACCESS_ONCE(pgxact->xid);
+		xid = UINT32_ACCESS_ONCE(other_xids[index]);
 
 		/*
 		 * We don't need to store transactions that don't have a TransactionId
@@ -2467,7 +2565,7 @@ GetRunningTransactionData(void)
  * GetOldestActiveTransactionId()
  *
  * Similar to GetSnapshotData but returns just oldestActiveXid. We include
- * all PGXACTs with an assigned TransactionId, even VACUUM processes.
+ * all PGPROCs with an assigned TransactionId, even VACUUM processes.
  * We look at all databases, though there is no need to include WALSender
  * since this has no effect on hot standby conflicts.
  *
@@ -2482,6 +2580,7 @@ TransactionId
 GetOldestActiveTransactionId(void)
 {
 	ProcArrayStruct *arrayP = procArray;
+	TransactionId *other_xids = ProcGlobal->xids;
 	TransactionId oldestRunningXid;
 	int			index;
 
@@ -2504,12 +2603,10 @@ GetOldestActiveTransactionId(void)
 	LWLockAcquire(ProcArrayLock, LW_SHARED);
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
-		int			pgprocno = arrayP->pgprocnos[index];
-		PGXACT	   *pgxact = &allPgXact[pgprocno];
 		TransactionId xid;
 
 		/* Fetch xid just once - see GetNewTransactionId */
-		xid = UINT32_ACCESS_ONCE(pgxact->xid);
+		xid = UINT32_ACCESS_ONCE(other_xids[index]);
 
 		if (!TransactionIdIsNormal(xid))
 			continue;
@@ -2587,8 +2684,8 @@ GetOldestSafeDecodingTransactionId(bool catalogOnly)
 	 * If we're not in recovery, we walk over the procarray and collect the
 	 * lowest xid. Since we're called with ProcArrayLock held and have
 	 * acquired XidGenLock, no entries can vanish concurrently, since
-	 * PGXACT->xid is only set with XidGenLock held and only cleared with
-	 * ProcArrayLock held.
+	 * ProcGlobal->xids[i] is only set with XidGenLock held and only cleared
+	 * with ProcArrayLock held.
 	 *
 	 * In recovery we can't lower the safe value besides what we've computed
 	 * above, so we'll have to wait a bit longer there. We unfortunately can
@@ -2597,17 +2694,17 @@ GetOldestSafeDecodingTransactionId(bool catalogOnly)
 	 */
 	if (!recovery_in_progress)
 	{
+		TransactionId *other_xids = ProcGlobal->xids;
+
 		/*
-		 * Spin over procArray collecting all min(PGXACT->xid)
+		 * Spin over procArray collecting min(ProcGlobal->xids[i])
 		 */
 		for (index = 0; index < arrayP->numProcs; index++)
 		{
-			int			pgprocno = arrayP->pgprocnos[index];
-			PGXACT	   *pgxact = &allPgXact[pgprocno];
 			TransactionId xid;
 
 			/* Fetch xid just once - see GetNewTransactionId */
-			xid = UINT32_ACCESS_ONCE(pgxact->xid);
+			xid = UINT32_ACCESS_ONCE(other_xids[index]);
 
 			if (!TransactionIdIsNormal(xid))
 				continue;
@@ -2795,6 +2892,7 @@ BackendXidGetPid(TransactionId xid)
 {
 	int			result = 0;
 	ProcArrayStruct *arrayP = procArray;
+	TransactionId *other_xids = ProcGlobal->xids;
 	int			index;
 
 	if (xid == InvalidTransactionId)	/* never match invalid xid */
@@ -2806,9 +2904,8 @@ BackendXidGetPid(TransactionId xid)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
 		PGPROC	   *proc = &allProcs[pgprocno];
-		PGXACT	   *pgxact = &allPgXact[pgprocno];
 
-		if (pgxact->xid == xid)
+		if (other_xids[index] == xid)
 		{
 			result = proc->pid;
 			break;
@@ -3088,7 +3185,6 @@ MinimumActiveBackends(int min)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
 		PGPROC	   *proc = &allProcs[pgprocno];
-		PGXACT	   *pgxact = &allPgXact[pgprocno];
 
 		/*
 		 * Since we're not holding a lock, need to be prepared to deal with
@@ -3105,7 +3201,7 @@ MinimumActiveBackends(int min)
 			continue;			/* do not count deleted entries */
 		if (proc == MyProc)
 			continue;			/* do not count myself */
-		if (pgxact->xid == InvalidTransactionId)
+		if (proc->xid == InvalidTransactionId)
 			continue;			/* do not count if no XID assigned */
 		if (proc->pid == 0)
 			continue;			/* do not count prepared xacts */
@@ -3531,8 +3627,8 @@ XidCacheRemoveRunningXids(TransactionId xid,
 	 *
 	 * Note that we do not have to be careful about memory ordering of our own
 	 * reads wrt. GetNewTransactionId() here - only this process can modify
-	 * relevant fields of MyProc/MyPgXact.  But we do have to be careful about
-	 * our own writes being well ordered.
+	 * relevant fields of MyProc/ProcGlobal->xids[].  But we do have to be
+	 * careful about our own writes being well ordered.
 	 */
 	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
 
@@ -3885,7 +3981,7 @@ FullXidViaRelative(FullTransactionId rel, TransactionId xid)
  * In Hot Standby mode, we maintain a list of transactions that are (or were)
  * running on the primary at the current point in WAL.  These XIDs must be
  * treated as running by standby transactions, even though they are not in
- * the standby server's PGXACT array.
+ * the standby server's PGPROC array.
  *
  * We record all XIDs that we know have been assigned.  That includes all the
  * XIDs seen in WAL records, plus all unobserved XIDs that we can deduce have
diff --git a/src/backend/storage/ipc/sinvaladt.c b/src/backend/storage/ipc/sinvaladt.c
index ad048bc85fa..a9477ccb4a3 100644
--- a/src/backend/storage/ipc/sinvaladt.c
+++ b/src/backend/storage/ipc/sinvaladt.c
@@ -417,9 +417,7 @@ BackendIdGetTransactionIds(int backendID, TransactionId *xid, TransactionId *xmi
 
 		if (proc != NULL)
 		{
-			PGXACT	   *xact = &ProcGlobal->allPgXact[proc->pgprocno];
-
-			*xid = xact->xid;
+			*xid = proc->xid;
 			*xmin = proc->xmin;
 		}
 	}
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 95989ce79bd..d86566f4554 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -3974,9 +3974,8 @@ GetRunningTransactionLocks(int *nlocks)
 			proclock->tag.myLock->tag.locktag_type == LOCKTAG_RELATION)
 		{
 			PGPROC	   *proc = proclock->tag.myProc;
-			PGXACT	   *pgxact = &ProcGlobal->allPgXact[proc->pgprocno];
 			LOCK	   *lock = proclock->tag.myLock;
-			TransactionId xid = pgxact->xid;
+			TransactionId xid = proc->xid;
 
 			/*
 			 * Don't record locks for transactions if we know they have
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index de346cd87fc..7fad49544ce 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -102,21 +102,18 @@ Size
 ProcGlobalShmemSize(void)
 {
 	Size		size = 0;
+	Size		TotalProcs =
+		add_size(MaxBackends, add_size(NUM_AUXILIARY_PROCS, max_prepared_xacts));
 
 	/* ProcGlobal */
 	size = add_size(size, sizeof(PROC_HDR));
-	/* MyProcs, including autovacuum workers and launcher */
-	size = add_size(size, mul_size(MaxBackends, sizeof(PGPROC)));
-	/* AuxiliaryProcs */
-	size = add_size(size, mul_size(NUM_AUXILIARY_PROCS, sizeof(PGPROC)));
-	/* Prepared xacts */
-	size = add_size(size, mul_size(max_prepared_xacts, sizeof(PGPROC)));
-	/* ProcStructLock */
+	size = add_size(size, mul_size(TotalProcs, sizeof(PGPROC)));
 	size = add_size(size, sizeof(slock_t));
 
 	size = add_size(size, mul_size(MaxBackends, sizeof(PGXACT)));
 	size = add_size(size, mul_size(NUM_AUXILIARY_PROCS, sizeof(PGXACT)));
 	size = add_size(size, mul_size(max_prepared_xacts, sizeof(PGXACT)));
+	size = add_size(size, mul_size(TotalProcs, sizeof(*ProcGlobal->xids)));
 
 	return size;
 }
@@ -216,6 +213,17 @@ InitProcGlobal(void)
 	MemSet(pgxacts, 0, TotalProcs * sizeof(PGXACT));
 	ProcGlobal->allPgXact = pgxacts;
 
+	/*
+	 * Allocate arrays mirroring PGPROC fields in a dense manner. See
+	 * PROC_HDR.
+	 *
+	 * XXX: It might make sense to increase padding for these arrays, given
+	 * how hotly they are accessed.
+	 */
+	ProcGlobal->xids =
+		(TransactionId *) ShmemAlloc(TotalProcs * sizeof(*ProcGlobal->xids));
+	MemSet(ProcGlobal->xids, 0, TotalProcs * sizeof(*ProcGlobal->xids));
+
 	for (i = 0; i < TotalProcs; i++)
 	{
 		/* Common initialization for all PGPROCs, regardless of type. */
@@ -387,7 +395,7 @@ InitProcess(void)
 	MyProc->lxid = InvalidLocalTransactionId;
 	MyProc->fpVXIDLock = false;
 	MyProc->fpLocalTransactionId = InvalidLocalTransactionId;
-	MyPgXact->xid = InvalidTransactionId;
+	MyProc->xid = InvalidTransactionId;
 	MyProc->xmin = InvalidTransactionId;
 	MyProc->pid = MyProcPid;
 	/* backendId, databaseId and roleId will be filled in later */
@@ -571,7 +579,7 @@ InitAuxiliaryProcess(void)
 	MyProc->lxid = InvalidLocalTransactionId;
 	MyProc->fpVXIDLock = false;
 	MyProc->fpLocalTransactionId = InvalidLocalTransactionId;
-	MyPgXact->xid = InvalidTransactionId;
+	MyProc->xid = InvalidTransactionId;
 	MyProc->xmin = InvalidTransactionId;
 	MyProc->backendId = InvalidBackendId;
 	MyProc->databaseId = InvalidOid;
-- 
2.25.0.114.g5b0ca878e0

v11-0004-snapshot-scalability-Move-PGXACT-vacuumFlags-to-.patchtext/x-diff; charset=us-asciiDownload
From f692283f4f8bd307830bffae7c826e10af113074 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 15 Jul 2020 15:35:07 -0700
Subject: [PATCH v11 4/6] snapshot scalability: Move PGXACT->vacuumFlags to
 ProcGlobal->vacuumFlags.

Similar to the previous commit this increases the chance that data
frequently needed by GetSnapshotData() stays in l2 cache. As we now
take care to not unnecessarily write to ProcGlobal->vacuumFlags, there
should be very few modifications to the ProcGlobal->vacuumFlags array.

Author: Andres Freund
Reviewed-By: Robert Haas, Thomas Munro, David Rowley
Discussion: https://postgr.es/m/20200301083601.ews6hz5dduc3w2se@alap3.anarazel.de
---
 src/include/storage/proc.h                | 12 ++++-
 src/backend/access/transam/twophase.c     |  2 +-
 src/backend/commands/analyze.c            | 10 ++--
 src/backend/commands/vacuum.c             |  5 +-
 src/backend/postmaster/autovacuum.c       |  6 +--
 src/backend/replication/logical/logical.c |  3 +-
 src/backend/replication/slot.c            |  3 +-
 src/backend/storage/ipc/procarray.c       | 66 ++++++++++++++---------
 src/backend/storage/lmgr/deadlock.c       |  4 +-
 src/backend/storage/lmgr/proc.c           | 16 +++---
 10 files changed, 79 insertions(+), 48 deletions(-)

diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index b828cecd185..ffb775939ed 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -41,7 +41,7 @@ struct XidCache
 };
 
 /*
- * Flags for PGXACT->vacuumFlags
+ * Flags for ProcGlobal->vacuumFlags[]
  */
 #define		PROC_IS_AUTOVACUUM	0x01	/* is it an autovac worker? */
 #define		PROC_IN_VACUUM		0x02	/* currently running lazy vacuum */
@@ -168,6 +168,9 @@ struct PGPROC
 
 	bool		delayChkpt;		/* true if this proc delays checkpoint start */
 
+	uint8		vacuumFlags;    /* this backend's vacuum flags, see PROC_*
+								 * above. mirrored in
+								 * ProcGlobal->vacuumFlags[pgxactoff] */
 	/*
 	 * Info to allow us to wait for synchronous replication, if needed.
 	 * waitLSN is InvalidXLogRecPtr if not waiting; set only by user backend.
@@ -245,7 +248,6 @@ extern PGDLLIMPORT struct PGXACT *MyPgXact;
  */
 typedef struct PGXACT
 {
-	uint8		vacuumFlags;	/* vacuum-related flags, see above */
 	bool		overflowed;
 
 	uint8		nxids;
@@ -315,6 +317,12 @@ typedef struct PROC_HDR
 	/* Array mirroring PGPROC.xid for each PGPROC currently in the procarray */
 	TransactionId *xids;
 
+	/*
+	 * Array mirroring PGPROC.vacuumFlags for each PGPROC currently in the
+	 * procarray.
+	 */
+	uint8	   *vacuumFlags;
+
 	/* Length of allProcs array */
 	uint32		allProcCount;
 	/* Head of list of free PGPROC structures */
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index d073eb07d23..3371ebd8896 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -466,7 +466,7 @@ MarkAsPreparingGuts(GlobalTransaction gxact, TransactionId xid, const char *gid,
 	proc->xid = xid;
 	Assert(proc->xmin == InvalidTransactionId);
 	proc->delayChkpt = false;
-	pgxact->vacuumFlags = 0;
+	proc->vacuumFlags = 0;
 	proc->pid = 0;
 	proc->backendId = InvalidBackendId;
 	proc->databaseId = databaseid;
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 34b71b6c1c5..2c1b956b76b 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -250,7 +250,8 @@ analyze_rel(Oid relid, RangeVar *relation,
 	 * OK, let's do it.  First let other backends know I'm in ANALYZE.
 	 */
 	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
-	MyPgXact->vacuumFlags |= PROC_IN_ANALYZE;
+	MyProc->vacuumFlags |= PROC_IN_ANALYZE;
+	ProcGlobal->vacuumFlags[MyProc->pgxactoff] = MyProc->vacuumFlags;
 	LWLockRelease(ProcArrayLock);
 	pgstat_progress_start_command(PROGRESS_COMMAND_ANALYZE,
 								  RelationGetRelid(onerel));
@@ -281,11 +282,12 @@ analyze_rel(Oid relid, RangeVar *relation,
 	pgstat_progress_end_command();
 
 	/*
-	 * Reset my PGXACT flag.  Note: we need this here, and not in vacuum_rel,
-	 * because the vacuum flag is cleared by the end-of-xact code.
+	 * Reset vacuumFlags we set early.  Note: we need this here, and not in
+	 * vacuum_rel, because the vacuum flag is cleared by the end-of-xact code.
 	 */
 	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
-	MyPgXact->vacuumFlags &= ~PROC_IN_ANALYZE;
+	MyProc->vacuumFlags &= ~PROC_IN_ANALYZE;
+	ProcGlobal->vacuumFlags[MyProc->pgxactoff] = MyProc->vacuumFlags;
 	LWLockRelease(ProcArrayLock);
 }
 
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 648e12c78d8..aba13c31d1b 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -1728,9 +1728,10 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params)
 		 * might appear to go backwards, which is probably Not Good.
 		 */
 		LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
-		MyPgXact->vacuumFlags |= PROC_IN_VACUUM;
+		MyProc->vacuumFlags |= PROC_IN_VACUUM;
 		if (params->is_wraparound)
-			MyPgXact->vacuumFlags |= PROC_VACUUM_FOR_WRAPAROUND;
+			MyProc->vacuumFlags |= PROC_VACUUM_FOR_WRAPAROUND;
+		ProcGlobal->vacuumFlags[MyProc->pgxactoff] = MyProc->vacuumFlags;
 		LWLockRelease(ProcArrayLock);
 	}
 
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index ac97e28be19..c6ec657a936 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -2493,7 +2493,7 @@ do_autovacuum(void)
 						   tab->at_datname, tab->at_nspname, tab->at_relname);
 			EmitErrorReport();
 
-			/* this resets the PGXACT flags too */
+			/* this resets ProcGlobal->vacuumFlags[i] too */
 			AbortOutOfAnyTransaction();
 			FlushErrorState();
 			MemoryContextResetAndDeleteChildren(PortalContext);
@@ -2509,7 +2509,7 @@ do_autovacuum(void)
 
 		did_vacuum = true;
 
-		/* the PGXACT flags are reset at the next end of transaction */
+		/* ProcGlobal->vacuumFlags[i] are reset at the next end of xact */
 
 		/* be tidy */
 deleted:
@@ -2686,7 +2686,7 @@ perform_work_item(AutoVacuumWorkItem *workitem)
 				   cur_datname, cur_nspname, cur_relname);
 		EmitErrorReport();
 
-		/* this resets the PGXACT flags too */
+		/* this resets ProcGlobal->vacuumFlags[i] too */
 		AbortOutOfAnyTransaction();
 		FlushErrorState();
 		MemoryContextResetAndDeleteChildren(PortalContext);
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 61902be3b0e..b416562ee2a 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -163,7 +163,8 @@ StartupDecodingContext(List *output_plugin_options,
 	if (!IsTransactionOrTransactionBlock())
 	{
 		LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
-		MyPgXact->vacuumFlags |= PROC_IN_LOGICAL_DECODING;
+		MyProc->vacuumFlags |= PROC_IN_LOGICAL_DECODING;
+		ProcGlobal->vacuumFlags[MyProc->pgxactoff] = MyProc->vacuumFlags;
 		LWLockRelease(ProcArrayLock);
 	}
 
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 57bbb6288c6..ca46256f9d0 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -520,7 +520,8 @@ ReplicationSlotRelease(void)
 
 	/* might not have been set when we've been a plain slot */
 	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
-	MyPgXact->vacuumFlags &= ~PROC_IN_LOGICAL_DECODING;
+	MyProc->vacuumFlags &= ~PROC_IN_LOGICAL_DECODING;
+	ProcGlobal->vacuumFlags[MyProc->pgxactoff] = MyProc->vacuumFlags;
 	LWLockRelease(ProcArrayLock);
 }
 
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index a9b32565367..e72a0705abb 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -474,9 +474,12 @@ ProcArrayAdd(PGPROC *proc)
 			(arrayP->numProcs - index) * sizeof(*arrayP->pgprocnos));
 	memmove(&ProcGlobal->xids[index + 1], &ProcGlobal->xids[index],
 			(arrayP->numProcs - index) * sizeof(*ProcGlobal->xids));
+	memmove(&ProcGlobal->vacuumFlags[index + 1], &ProcGlobal->vacuumFlags[index],
+			(arrayP->numProcs - index) * sizeof(*ProcGlobal->vacuumFlags));
 
 	arrayP->pgprocnos[index] = proc->pgprocno;
 	ProcGlobal->xids[index] = proc->xid;
+	ProcGlobal->vacuumFlags[index] = proc->vacuumFlags;
 
 	arrayP->numProcs++;
 
@@ -537,6 +540,7 @@ ProcArrayRemove(PGPROC *proc, TransactionId latestXid)
 	}
 
 	Assert(TransactionIdIsValid(ProcGlobal->xids[proc->pgxactoff] == 0));
+	ProcGlobal->vacuumFlags[proc->pgxactoff] = 0;
 
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
@@ -547,6 +551,8 @@ ProcArrayRemove(PGPROC *proc, TransactionId latestXid)
 					(arrayP->numProcs - index - 1) * sizeof(*arrayP->pgprocnos));
 			memmove(&ProcGlobal->xids[index], &ProcGlobal->xids[index + 1],
 					(arrayP->numProcs - index - 1) * sizeof(*ProcGlobal->xids));
+			memmove(&ProcGlobal->vacuumFlags[index], &ProcGlobal->vacuumFlags[index + 1],
+					(arrayP->numProcs - index - 1) * sizeof(*ProcGlobal->vacuumFlags));
 
 			arrayP->pgprocnos[arrayP->numProcs - 1] = -1;	/* for debugging */
 			arrayP->numProcs--;
@@ -625,14 +631,24 @@ ProcArrayEndTransaction(PGPROC *proc, TransactionId latestXid)
 		Assert(!TransactionIdIsValid(proc->xid));
 
 		proc->lxid = InvalidLocalTransactionId;
-		/* must be cleared with xid/xmin: */
-		pgxact->vacuumFlags &= ~PROC_VACUUM_STATE_MASK;
 		proc->xmin = InvalidTransactionId;
 		proc->delayChkpt = false;	/* be sure this is cleared in abort */
 		proc->recoveryConflictPending = false;
 
 		Assert(pgxact->nxids == 0);
 		Assert(pgxact->overflowed == false);
+
+		/* must be cleared with xid/xmin: */
+		/* avoid unnecessarily dirtying shared cachelines */
+		if (proc->vacuumFlags & PROC_VACUUM_STATE_MASK)
+		{
+			Assert(!LWLockHeldByMe(ProcArrayLock));
+			LWLockAcquire(ProcArrayLock, LW_SHARED);
+			Assert(proc->vacuumFlags == ProcGlobal->vacuumFlags[proc->pgxactoff]);
+			proc->vacuumFlags &= ~PROC_VACUUM_STATE_MASK;
+			ProcGlobal->vacuumFlags[proc->pgxactoff] = proc->vacuumFlags;
+			LWLockRelease(ProcArrayLock);
+		}
 	}
 }
 
@@ -653,12 +669,18 @@ ProcArrayEndTransactionInternal(PGPROC *proc, PGXACT *pgxact,
 	ProcGlobal->xids[pgxactoff] = InvalidTransactionId;
 	proc->xid = InvalidTransactionId;
 	proc->lxid = InvalidLocalTransactionId;
-	/* must be cleared with xid/xmin: */
-	pgxact->vacuumFlags &= ~PROC_VACUUM_STATE_MASK;
 	proc->xmin = InvalidTransactionId;
 	proc->delayChkpt = false;	/* be sure this is cleared in abort */
 	proc->recoveryConflictPending = false;
 
+	/* must be cleared with xid/xmin: */
+	/* avoid unnecessarily dirtying shared cachelines */
+	if (proc->vacuumFlags & PROC_VACUUM_STATE_MASK)
+	{
+		proc->vacuumFlags &= ~PROC_VACUUM_STATE_MASK;
+		ProcGlobal->vacuumFlags[proc->pgxactoff] = proc->vacuumFlags;
+	}
+
 	/* Clear the subtransaction-XID cache too while holding the lock */
 	pgxact->nxids = 0;
 	pgxact->overflowed = false;
@@ -818,9 +840,8 @@ ProcArrayClearTransaction(PGPROC *proc)
 	proc->xmin = InvalidTransactionId;
 	proc->recoveryConflictPending = false;
 
-	/* redundant, but just in case */
-	pgxact->vacuumFlags &= ~PROC_VACUUM_STATE_MASK;
-	proc->delayChkpt = false;
+	Assert(!(proc->vacuumFlags & PROC_VACUUM_STATE_MASK));
+	Assert(!proc->delayChkpt);
 
 	/* Clear the subtransaction-XID cache too */
 	pgxact->nxids = 0;
@@ -1611,7 +1632,7 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
 		PGPROC	   *proc = &allProcs[pgprocno];
-		PGXACT	   *pgxact = &allPgXact[pgprocno];
+		int8		vacuumFlags = ProcGlobal->vacuumFlags[index];
 		TransactionId xid;
 		TransactionId xmin;
 
@@ -1628,10 +1649,6 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
 		 */
 		xmin = TransactionIdOlder(xmin, xid);
 
-		/* if neither is set, this proc doesn't influence the horizon */
-		if (!TransactionIdIsValid(xmin))
-			continue;
-
 		/*
 		 * Don't ignore any procs when determining which transactions might be
 		 * considered running.  While slots should ensure logical decoding
@@ -1646,7 +1663,7 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
 		 * removed, as long as pg_subtrans is not truncated) or doing logical
 		 * decoding (which manages xmin separately, check below).
 		 */
-		if (pgxact->vacuumFlags & (PROC_IN_VACUUM | PROC_IN_LOGICAL_DECODING))
+		if (vacuumFlags & (PROC_IN_VACUUM | PROC_IN_LOGICAL_DECODING))
 			continue;
 
 		/* shared tables need to take backends in all database into account */
@@ -1984,6 +2001,7 @@ GetSnapshotData(Snapshot snapshot)
 		size_t		numProcs = arrayP->numProcs;
 		TransactionId *xip = snapshot->xip;
 		int		   *pgprocnos = arrayP->pgprocnos;
+		uint8	   *allVacuumFlags = ProcGlobal->vacuumFlags;
 
 		/*
 		 * First collect set of pgxactoff/xids that need to be included in the
@@ -1993,8 +2011,6 @@ GetSnapshotData(Snapshot snapshot)
 		{
 			/* Fetch xid just once - see GetNewTransactionId */
 			TransactionId xid = UINT32_ACCESS_ONCE(other_xids[pgxactoff]);
-			int			pgprocno;
-			PGXACT	   *pgxact;
 			uint8		vacuumFlags;
 
 			Assert(allProcs[arrayP->pgprocnos[pgxactoff]].pgxactoff == pgxactoff);
@@ -2030,14 +2046,11 @@ GetSnapshotData(Snapshot snapshot)
 			if (!NormalTransactionIdPrecedes(xid, xmax))
 				continue;
 
-			pgprocno = pgprocnos[pgxactoff];
-			pgxact = &allPgXact[pgprocno];
-			vacuumFlags = pgxact->vacuumFlags;
-
 			/*
 			 * Skip over backends doing logical decoding which manages xmin
 			 * separately (check below) and ones running LAZY VACUUM.
 			 */
+			vacuumFlags = allVacuumFlags[pgxactoff];
 			if (vacuumFlags & (PROC_IN_LOGICAL_DECODING | PROC_IN_VACUUM))
 				continue;
 
@@ -2064,6 +2077,9 @@ GetSnapshotData(Snapshot snapshot)
 			 */
 			if (!suboverflowed)
 			{
+				int			pgprocno = pgprocnos[pgxactoff];
+				PGXACT	   *pgxact = &allPgXact[pgprocno];
+
 				if (pgxact->overflowed)
 					suboverflowed = true;
 				else
@@ -2282,11 +2298,11 @@ ProcArrayInstallImportedXmin(TransactionId xmin,
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
 		PGPROC	   *proc = &allProcs[pgprocno];
-		PGXACT	   *pgxact = &allPgXact[pgprocno];
+		int			vacuumFlags = ProcGlobal->vacuumFlags[index];
 		TransactionId xid;
 
 		/* Ignore procs running LAZY VACUUM */
-		if (pgxact->vacuumFlags & PROC_IN_VACUUM)
+		if (vacuumFlags & PROC_IN_VACUUM)
 			continue;
 
 		/* We are only interested in the specific virtual transaction. */
@@ -2975,12 +2991,12 @@ GetCurrentVirtualXIDs(TransactionId limitXmin, bool excludeXmin0,
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
 		PGPROC	   *proc = &allProcs[pgprocno];
-		PGXACT	   *pgxact = &allPgXact[pgprocno];
+		uint8		vacuumFlags = ProcGlobal->vacuumFlags[index];
 
 		if (proc == MyProc)
 			continue;
 
-		if (excludeVacuum & pgxact->vacuumFlags)
+		if (excludeVacuum & vacuumFlags)
 			continue;
 
 		if (allDbs || proc->databaseId == MyDatabaseId)
@@ -3395,7 +3411,7 @@ CountOtherDBBackends(Oid databaseId, int *nbackends, int *nprepared)
 		{
 			int			pgprocno = arrayP->pgprocnos[index];
 			PGPROC	   *proc = &allProcs[pgprocno];
-			PGXACT	   *pgxact = &allPgXact[pgprocno];
+			uint8		vacuumFlags = ProcGlobal->vacuumFlags[index];
 
 			if (proc->databaseId != databaseId)
 				continue;
@@ -3409,7 +3425,7 @@ CountOtherDBBackends(Oid databaseId, int *nbackends, int *nprepared)
 			else
 			{
 				(*nbackends)++;
-				if ((pgxact->vacuumFlags & PROC_IS_AUTOVACUUM) &&
+				if ((vacuumFlags & PROC_IS_AUTOVACUUM) &&
 					nautovacs < MAXAUTOVACPIDS)
 					autovac_pids[nautovacs++] = proc->pid;
 			}
diff --git a/src/backend/storage/lmgr/deadlock.c b/src/backend/storage/lmgr/deadlock.c
index beedc7947db..e1246b8a4da 100644
--- a/src/backend/storage/lmgr/deadlock.c
+++ b/src/backend/storage/lmgr/deadlock.c
@@ -544,7 +544,6 @@ FindLockCycleRecurseMember(PGPROC *checkProc,
 {
 	PGPROC	   *proc;
 	LOCK	   *lock = checkProc->waitLock;
-	PGXACT	   *pgxact;
 	PROCLOCK   *proclock;
 	SHM_QUEUE  *procLocks;
 	LockMethod	lockMethodTable;
@@ -582,7 +581,6 @@ FindLockCycleRecurseMember(PGPROC *checkProc,
 		PGPROC	   *leader;
 
 		proc = proclock->tag.myProc;
-		pgxact = &ProcGlobal->allPgXact[proc->pgprocno];
 		leader = proc->lockGroupLeader == NULL ? proc : proc->lockGroupLeader;
 
 		/* A proc never blocks itself or any other lock group member */
@@ -630,7 +628,7 @@ FindLockCycleRecurseMember(PGPROC *checkProc,
 					 * ProcArrayLock.
 					 */
 					if (checkProc == MyProc &&
-						pgxact->vacuumFlags & PROC_IS_AUTOVACUUM)
+						proc->vacuumFlags & PROC_IS_AUTOVACUUM)
 						blocking_autovacuum_proc = proc;
 
 					/* We're done looking at this proclock */
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 7fad49544ce..f6113b2d243 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -114,6 +114,7 @@ ProcGlobalShmemSize(void)
 	size = add_size(size, mul_size(NUM_AUXILIARY_PROCS, sizeof(PGXACT)));
 	size = add_size(size, mul_size(max_prepared_xacts, sizeof(PGXACT)));
 	size = add_size(size, mul_size(TotalProcs, sizeof(*ProcGlobal->xids)));
+	size = add_size(size, mul_size(TotalProcs, sizeof(*ProcGlobal->vacuumFlags)));
 
 	return size;
 }
@@ -223,6 +224,8 @@ InitProcGlobal(void)
 	ProcGlobal->xids =
 		(TransactionId *) ShmemAlloc(TotalProcs * sizeof(*ProcGlobal->xids));
 	MemSet(ProcGlobal->xids, 0, TotalProcs * sizeof(*ProcGlobal->xids));
+	ProcGlobal->vacuumFlags = (uint8 *) ShmemAlloc(TotalProcs * sizeof(*ProcGlobal->vacuumFlags));
+	MemSet(ProcGlobal->vacuumFlags, 0, TotalProcs * sizeof(*ProcGlobal->vacuumFlags));
 
 	for (i = 0; i < TotalProcs; i++)
 	{
@@ -405,10 +408,10 @@ InitProcess(void)
 	MyProc->tempNamespaceId = InvalidOid;
 	MyProc->isBackgroundWorker = IsBackgroundWorker;
 	MyProc->delayChkpt = false;
-	MyPgXact->vacuumFlags = 0;
+	MyProc->vacuumFlags = 0;
 	/* NB -- autovac launcher intentionally does not set IS_AUTOVACUUM */
 	if (IsAutoVacuumWorkerProcess())
-		MyPgXact->vacuumFlags |= PROC_IS_AUTOVACUUM;
+		MyProc->vacuumFlags |= PROC_IS_AUTOVACUUM;
 	MyProc->lwWaiting = false;
 	MyProc->lwWaitMode = 0;
 	MyProc->waitLock = NULL;
@@ -587,7 +590,7 @@ InitAuxiliaryProcess(void)
 	MyProc->tempNamespaceId = InvalidOid;
 	MyProc->isBackgroundWorker = IsBackgroundWorker;
 	MyProc->delayChkpt = false;
-	MyPgXact->vacuumFlags = 0;
+	MyProc->vacuumFlags = 0;
 	MyProc->lwWaiting = false;
 	MyProc->lwWaitMode = 0;
 	MyProc->waitLock = NULL;
@@ -1323,7 +1326,7 @@ ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable)
 		if (deadlock_state == DS_BLOCKED_BY_AUTOVACUUM && allow_autovacuum_cancel)
 		{
 			PGPROC	   *autovac = GetBlockingAutoVacuumPgproc();
-			PGXACT	   *autovac_pgxact = &ProcGlobal->allPgXact[autovac->pgprocno];
+			uint8		vacuumFlags;
 
 			LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
 
@@ -1331,8 +1334,9 @@ ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable)
 			 * Only do it if the worker is not working to protect against Xid
 			 * wraparound.
 			 */
-			if ((autovac_pgxact->vacuumFlags & PROC_IS_AUTOVACUUM) &&
-				!(autovac_pgxact->vacuumFlags & PROC_VACUUM_FOR_WRAPAROUND))
+			vacuumFlags = ProcGlobal->vacuumFlags[proc->pgxactoff];
+			if ((vacuumFlags & PROC_IS_AUTOVACUUM) &&
+				!(vacuumFlags & PROC_VACUUM_FOR_WRAPAROUND))
 			{
 				int			pid = autovac->pid;
 				StringInfoData locktagbuf;
-- 
2.25.0.114.g5b0ca878e0

v11-0005-snapshot-scalability-Move-subxact-info-to-ProcGl.patchtext/x-diff; charset=us-asciiDownload
From 8c58389951790366394c1ec82ac4ad1493e60030 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 15 Jul 2020 15:35:07 -0700
Subject: [PATCH v11 5/6] snapshot scalability: Move subxact info to
 ProcGlobal, remove PGXACT.

Similar to the previous changes this increases the chance that data
frequently needed by GetSnapshotData() stays in l2 cache. In many
workloads subtransactions are very rare, and this makes the check for
that considerably cheaper.

As this removes the last member of PGXACT, there is no need to keep it
around anymore.

Author: Andres Freund
Reviewed-By: Robert Haas, Thomas Munro, David Rowley
Discussion: https://postgr.es/m/20200301083601.ews6hz5dduc3w2se@alap3.anarazel.de
---
 src/include/storage/proc.h            |  34 ++++---
 src/backend/access/transam/clog.c     |   7 +-
 src/backend/access/transam/twophase.c |  17 ++--
 src/backend/access/transam/varsup.c   |  15 ++-
 src/backend/storage/ipc/procarray.c   | 128 ++++++++++++++------------
 src/backend/storage/lmgr/proc.c       |  24 +----
 src/tools/pgindent/typedefs.list      |   1 -
 7 files changed, 113 insertions(+), 113 deletions(-)

diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index ffb775939ed..36fe5253a15 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -35,6 +35,14 @@
  */
 #define PGPROC_MAX_CACHED_SUBXIDS 64	/* XXX guessed-at value */
 
+typedef struct XidCacheStatus
+{
+	/* number of cached subxids, never more than PGPROC_MAX_CACHED_SUBXIDS */
+	uint8	count;
+	/* has PGPROC->subxids overflowed */
+	bool	overflowed;
+} XidCacheStatus;
+
 struct XidCache
 {
 	TransactionId xids[PGPROC_MAX_CACHED_SUBXIDS];
@@ -188,6 +196,8 @@ struct PGPROC
 	 */
 	SHM_QUEUE	myProcLocks[NUM_LOCK_PARTITIONS];
 
+	XidCacheStatus subxidStatus; /* mirrored with
+								  * ProcGlobal->subxidStates[i] */
 	struct XidCache subxids;	/* cache for subtransaction XIDs */
 
 	/* Support for group XID clearing. */
@@ -236,22 +246,6 @@ struct PGPROC
 
 
 extern PGDLLIMPORT PGPROC *MyProc;
-extern PGDLLIMPORT struct PGXACT *MyPgXact;
-
-/*
- * Prior to PostgreSQL 9.2, the fields below were stored as part of the
- * PGPROC.  However, benchmarking revealed that packing these particular
- * members into a separate array as tightly as possible sped up GetSnapshotData
- * considerably on systems with many CPU cores, by reducing the number of
- * cache lines needing to be fetched.  Thus, think very carefully before adding
- * anything else here.
- */
-typedef struct PGXACT
-{
-	bool		overflowed;
-
-	uint8		nxids;
-} PGXACT;
 
 /*
  * There is one ProcGlobal struct for the whole database cluster.
@@ -311,12 +305,16 @@ typedef struct PROC_HDR
 {
 	/* Array of PGPROC structures (not including dummies for prepared txns) */
 	PGPROC	   *allProcs;
-	/* Array of PGXACT structures (not including dummies for prepared txns) */
-	PGXACT	   *allPgXact;
 
 	/* Array mirroring PGPROC.xid for each PGPROC currently in the procarray */
 	TransactionId *xids;
 
+	/*
+	 * Array mirroring PGPROC.subxidStatus for each PGPROC currently in the
+	 * procarray.
+	 */
+	XidCacheStatus *subxidStates;
+
 	/*
 	 * Array mirroring PGPROC.vacuumFlags for each PGPROC currently in the
 	 * procarray.
diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c
index 5198a0cef68..a3095ced3fb 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -295,7 +295,7 @@ TransactionIdSetPageStatus(TransactionId xid, int nsubxids,
 	 */
 	if (all_xact_same_page && xid == MyProc->xid &&
 		nsubxids <= THRESHOLD_SUBTRANS_CLOG_OPT &&
-		nsubxids == MyPgXact->nxids &&
+		nsubxids == MyProc->subxidStatus.count &&
 		memcmp(subxids, MyProc->subxids.xids,
 			   nsubxids * sizeof(TransactionId)) == 0)
 	{
@@ -510,16 +510,15 @@ TransactionGroupUpdateXidStatus(TransactionId xid, XidStatus status,
 	while (nextidx != INVALID_PGPROCNO)
 	{
 		PGPROC	   *proc = &ProcGlobal->allProcs[nextidx];
-		PGXACT	   *pgxact = &ProcGlobal->allPgXact[nextidx];
 
 		/*
 		 * Transactions with more than THRESHOLD_SUBTRANS_CLOG_OPT sub-XIDs
 		 * should not use group XID status update mechanism.
 		 */
-		Assert(pgxact->nxids <= THRESHOLD_SUBTRANS_CLOG_OPT);
+		Assert(proc->subxidStatus.count <= THRESHOLD_SUBTRANS_CLOG_OPT);
 
 		TransactionIdSetPageStatusInternal(proc->clogGroupMemberXid,
-										   pgxact->nxids,
+										   proc->subxidStatus.count,
 										   proc->subxids.xids,
 										   proc->clogGroupMemberXidStatus,
 										   proc->clogGroupMemberLsn,
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 3371ebd8896..2642eaf99de 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -21,9 +21,9 @@
  *		GIDs and aborts the transaction if there already is a global
  *		transaction in prepared state with the same GID.
  *
- *		A global transaction (gxact) also has dummy PGXACT and PGPROC; this is
- *		what keeps the XID considered running by TransactionIdIsInProgress.
- *		It is also convenient as a PGPROC to hook the gxact's locks to.
+ *		A global transaction (gxact) also has dummy PGPROC; this is what keeps
+ *		the XID considered running by TransactionIdIsInProgress.  It is also
+ *		convenient as a PGPROC to hook the gxact's locks to.
  *
  *		Information to recover prepared transactions in case of crash is
  *		now stored in WAL for the common case. In some cases there will be
@@ -447,14 +447,12 @@ MarkAsPreparingGuts(GlobalTransaction gxact, TransactionId xid, const char *gid,
 					TimestampTz prepared_at, Oid owner, Oid databaseid)
 {
 	PGPROC	   *proc;
-	PGXACT	   *pgxact;
 	int			i;
 
 	Assert(LWLockHeldByMeInMode(TwoPhaseStateLock, LW_EXCLUSIVE));
 
 	Assert(gxact != NULL);
 	proc = &ProcGlobal->allProcs[gxact->pgprocno];
-	pgxact = &ProcGlobal->allPgXact[gxact->pgprocno];
 
 	/* Initialize the PGPROC entry */
 	MemSet(proc, 0, sizeof(PGPROC));
@@ -480,8 +478,8 @@ MarkAsPreparingGuts(GlobalTransaction gxact, TransactionId xid, const char *gid,
 	for (i = 0; i < NUM_LOCK_PARTITIONS; i++)
 		SHMQueueInit(&(proc->myProcLocks[i]));
 	/* subxid data must be filled later by GXactLoadSubxactData */
-	pgxact->overflowed = false;
-	pgxact->nxids = 0;
+	proc->subxidStatus.count = 0;
+	proc->subxidStatus.overflowed = 0;
 
 	gxact->prepared_at = prepared_at;
 	gxact->xid = xid;
@@ -510,19 +508,18 @@ GXactLoadSubxactData(GlobalTransaction gxact, int nsubxacts,
 					 TransactionId *children)
 {
 	PGPROC	   *proc = &ProcGlobal->allProcs[gxact->pgprocno];
-	PGXACT	   *pgxact = &ProcGlobal->allPgXact[gxact->pgprocno];
 
 	/* We need no extra lock since the GXACT isn't valid yet */
 	if (nsubxacts > PGPROC_MAX_CACHED_SUBXIDS)
 	{
-		pgxact->overflowed = true;
+		proc->subxidStatus.overflowed = true;
 		nsubxacts = PGPROC_MAX_CACHED_SUBXIDS;
 	}
 	if (nsubxacts > 0)
 	{
 		memcpy(proc->subxids.xids, children,
 			   nsubxacts * sizeof(TransactionId));
-		pgxact->nxids = nsubxacts;
+		proc->subxidStatus.count = PGPROC_MAX_CACHED_SUBXIDS;
 	}
 }
 
diff --git a/src/backend/access/transam/varsup.c b/src/backend/access/transam/varsup.c
index ab376f2fe22..97471ffe488 100644
--- a/src/backend/access/transam/varsup.c
+++ b/src/backend/access/transam/varsup.c
@@ -222,22 +222,31 @@ GetNewTransactionId(bool isSubXact)
 	 */
 	if (!isSubXact)
 	{
+		Assert(ProcGlobal->subxidStates[MyProc->pgxactoff].count == 0);
+		Assert(!ProcGlobal->subxidStates[MyProc->pgxactoff].overflowed);
+		Assert(MyProc->subxidStatus.count == 0);
+		Assert(!MyProc->subxidStatus.overflowed);
+
 		/* LWLockRelease acts as barrier */
 		MyProc->xid = xid;
 		ProcGlobal->xids[MyProc->pgxactoff] = xid;
 	}
 	else
 	{
-		int			nxids = MyPgXact->nxids;
+		XidCacheStatus *substat = &ProcGlobal->subxidStates[MyProc->pgxactoff];
+		int			nxids = MyProc->subxidStatus.count;
+
+		Assert(substat->count == MyProc->subxidStatus.count);
+		Assert(substat->overflowed == MyProc->subxidStatus.overflowed);
 
 		if (nxids < PGPROC_MAX_CACHED_SUBXIDS)
 		{
 			MyProc->subxids.xids[nxids] = xid;
 			pg_write_barrier();
-			MyPgXact->nxids = nxids + 1;
+			MyProc->subxidStatus.count = substat->count = nxids + 1;
 		}
 		else
-			MyPgXact->overflowed = true;
+			MyProc->subxidStatus.overflowed = substat->overflowed = true;
 	}
 
 	LWLockRelease(XidGenLock);
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index e72a0705abb..bf3e5b65dc7 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -4,9 +4,10 @@
  *	  POSTGRES process array code.
  *
  *
- * This module maintains arrays of the PGPROC and PGXACT structures for all
- * active backends.  Although there are several uses for this, the principal
- * one is as a means of determining the set of currently running transactions.
+ * This module maintains arrays of PGPROC substructures, as well as associated
+ * arrays in ProcGlobal, for all active backends.  Although there are several
+ * uses for this, the principal one is as a means of determining the set of
+ * currently running transactions.
  *
  * Because of various subtle race conditions it is critical that a backend
  * hold the correct locks while setting or clearing its xid (in
@@ -85,7 +86,7 @@ typedef struct ProcArrayStruct
 	/*
 	 * Highest subxid that has been removed from KnownAssignedXids array to
 	 * prevent overflow; or InvalidTransactionId if none.  We track this for
-	 * similar reasons to tracking overflowing cached subxids in PGXACT
+	 * similar reasons to tracking overflowing cached subxids in PGPROC
 	 * entries.  Must hold exclusive ProcArrayLock to change this, and shared
 	 * lock to read it.
 	 */
@@ -96,7 +97,7 @@ typedef struct ProcArrayStruct
 	/* oldest catalog xmin of any replication slot */
 	TransactionId replication_slot_catalog_xmin;
 
-	/* indexes into allPgXact[], has PROCARRAY_MAXPROCS entries */
+	/* indexes into allProcs[], has PROCARRAY_MAXPROCS entries */
 	int			pgprocnos[FLEXIBLE_ARRAY_MEMBER];
 } ProcArrayStruct;
 
@@ -239,7 +240,6 @@ typedef struct ComputeXidHorizonsResult
 static ProcArrayStruct *procArray;
 
 static PGPROC *allProcs;
-static PGXACT *allPgXact;
 
 /*
  * Bookkeeping for tracking emulated transactions in recovery
@@ -325,8 +325,7 @@ static int	KnownAssignedXidsGetAndSetXmin(TransactionId *xarray,
 static TransactionId KnownAssignedXidsGetOldestXmin(void);
 static void KnownAssignedXidsDisplay(int trace_level);
 static void KnownAssignedXidsReset(void);
-static inline void ProcArrayEndTransactionInternal(PGPROC *proc,
-												   PGXACT *pgxact, TransactionId latestXid);
+static inline void ProcArrayEndTransactionInternal(PGPROC *proc, TransactionId latestXid);
 static void ProcArrayGroupClearXid(PGPROC *proc, TransactionId latestXid);
 static void MaintainLatestCompletedXid(TransactionId latestXid);
 static void MaintainLatestCompletedXidRecovery(TransactionId latestXid);
@@ -409,7 +408,6 @@ CreateSharedProcArray(void)
 	}
 
 	allProcs = ProcGlobal->allProcs;
-	allPgXact = ProcGlobal->allPgXact;
 
 	/* Create or attach to the KnownAssignedXids arrays too, if needed */
 	if (EnableHotStandby)
@@ -474,11 +472,14 @@ ProcArrayAdd(PGPROC *proc)
 			(arrayP->numProcs - index) * sizeof(*arrayP->pgprocnos));
 	memmove(&ProcGlobal->xids[index + 1], &ProcGlobal->xids[index],
 			(arrayP->numProcs - index) * sizeof(*ProcGlobal->xids));
+	memmove(&ProcGlobal->subxidStates[index + 1], &ProcGlobal->subxidStates[index],
+			(arrayP->numProcs - index) * sizeof(*ProcGlobal->subxidStates));
 	memmove(&ProcGlobal->vacuumFlags[index + 1], &ProcGlobal->vacuumFlags[index],
 			(arrayP->numProcs - index) * sizeof(*ProcGlobal->vacuumFlags));
 
 	arrayP->pgprocnos[index] = proc->pgprocno;
 	ProcGlobal->xids[index] = proc->xid;
+	ProcGlobal->subxidStates[index] = proc->subxidStatus;
 	ProcGlobal->vacuumFlags[index] = proc->vacuumFlags;
 
 	arrayP->numProcs++;
@@ -532,6 +533,8 @@ ProcArrayRemove(PGPROC *proc, TransactionId latestXid)
 		MaintainLatestCompletedXid(latestXid);
 
 		ProcGlobal->xids[proc->pgxactoff] = 0;
+		ProcGlobal->subxidStates[proc->pgxactoff].overflowed = false;
+		ProcGlobal->subxidStates[proc->pgxactoff].count = 0;
 	}
 	else
 	{
@@ -540,6 +543,8 @@ ProcArrayRemove(PGPROC *proc, TransactionId latestXid)
 	}
 
 	Assert(TransactionIdIsValid(ProcGlobal->xids[proc->pgxactoff] == 0));
+	Assert(TransactionIdIsValid(ProcGlobal->subxidStates[proc->pgxactoff].count == 0));
+	Assert(TransactionIdIsValid(ProcGlobal->subxidStates[proc->pgxactoff].overflowed == false));
 	ProcGlobal->vacuumFlags[proc->pgxactoff] = 0;
 
 	for (index = 0; index < arrayP->numProcs; index++)
@@ -551,6 +556,8 @@ ProcArrayRemove(PGPROC *proc, TransactionId latestXid)
 					(arrayP->numProcs - index - 1) * sizeof(*arrayP->pgprocnos));
 			memmove(&ProcGlobal->xids[index], &ProcGlobal->xids[index + 1],
 					(arrayP->numProcs - index - 1) * sizeof(*ProcGlobal->xids));
+			memmove(&ProcGlobal->subxidStates[index], &ProcGlobal->subxidStates[index + 1],
+					(arrayP->numProcs - index - 1) * sizeof(*ProcGlobal->subxidStates));
 			memmove(&ProcGlobal->vacuumFlags[index], &ProcGlobal->vacuumFlags[index + 1],
 					(arrayP->numProcs - index - 1) * sizeof(*ProcGlobal->vacuumFlags));
 
@@ -596,8 +603,6 @@ ProcArrayRemove(PGPROC *proc, TransactionId latestXid)
 void
 ProcArrayEndTransaction(PGPROC *proc, TransactionId latestXid)
 {
-	PGXACT	   *pgxact = &allPgXact[proc->pgprocno];
-
 	if (TransactionIdIsValid(latestXid))
 	{
 		/*
@@ -615,7 +620,7 @@ ProcArrayEndTransaction(PGPROC *proc, TransactionId latestXid)
 		 */
 		if (LWLockConditionalAcquire(ProcArrayLock, LW_EXCLUSIVE))
 		{
-			ProcArrayEndTransactionInternal(proc, pgxact, latestXid);
+			ProcArrayEndTransactionInternal(proc, latestXid);
 			LWLockRelease(ProcArrayLock);
 		}
 		else
@@ -629,15 +634,14 @@ ProcArrayEndTransaction(PGPROC *proc, TransactionId latestXid)
 		 * estimate of global xmin, but that's OK.
 		 */
 		Assert(!TransactionIdIsValid(proc->xid));
+		Assert(proc->subxidStatus.count == 0);
+		Assert(!proc->subxidStatus.overflowed);
 
 		proc->lxid = InvalidLocalTransactionId;
 		proc->xmin = InvalidTransactionId;
 		proc->delayChkpt = false;	/* be sure this is cleared in abort */
 		proc->recoveryConflictPending = false;
 
-		Assert(pgxact->nxids == 0);
-		Assert(pgxact->overflowed == false);
-
 		/* must be cleared with xid/xmin: */
 		/* avoid unnecessarily dirtying shared cachelines */
 		if (proc->vacuumFlags & PROC_VACUUM_STATE_MASK)
@@ -658,8 +662,7 @@ ProcArrayEndTransaction(PGPROC *proc, TransactionId latestXid)
  * We don't do any locking here; caller must handle that.
  */
 static inline void
-ProcArrayEndTransactionInternal(PGPROC *proc, PGXACT *pgxact,
-								TransactionId latestXid)
+ProcArrayEndTransactionInternal(PGPROC *proc, TransactionId latestXid)
 {
 	size_t		pgxactoff = proc->pgxactoff;
 
@@ -682,8 +685,15 @@ ProcArrayEndTransactionInternal(PGPROC *proc, PGXACT *pgxact,
 	}
 
 	/* Clear the subtransaction-XID cache too while holding the lock */
-	pgxact->nxids = 0;
-	pgxact->overflowed = false;
+	Assert(ProcGlobal->subxidStates[pgxactoff].count == proc->subxidStatus.count &&
+		   ProcGlobal->subxidStates[pgxactoff].overflowed == proc->subxidStatus.overflowed);
+	if (proc->subxidStatus.count > 0 || proc->subxidStatus.overflowed)
+	{
+		ProcGlobal->subxidStates[pgxactoff].count = 0;
+		ProcGlobal->subxidStates[pgxactoff].overflowed = false;
+		proc->subxidStatus.count = 0;
+		proc->subxidStatus.overflowed = false;
+	}
 
 	/* Also advance global latestCompletedXid while holding the lock */
 	MaintainLatestCompletedXid(latestXid);
@@ -773,9 +783,8 @@ ProcArrayGroupClearXid(PGPROC *proc, TransactionId latestXid)
 	while (nextidx != INVALID_PGPROCNO)
 	{
 		PGPROC	   *proc = &allProcs[nextidx];
-		PGXACT	   *pgxact = &allPgXact[nextidx];
 
-		ProcArrayEndTransactionInternal(proc, pgxact, proc->procArrayGroupMemberXid);
+		ProcArrayEndTransactionInternal(proc, proc->procArrayGroupMemberXid);
 
 		/* Move to next proc in list. */
 		nextidx = pg_atomic_read_u32(&proc->procArrayGroupNext);
@@ -819,7 +828,6 @@ ProcArrayGroupClearXid(PGPROC *proc, TransactionId latestXid)
 void
 ProcArrayClearTransaction(PGPROC *proc)
 {
-	PGXACT	   *pgxact = &allPgXact[proc->pgprocno];
 	size_t		pgxactoff;
 
 	/*
@@ -844,8 +852,15 @@ ProcArrayClearTransaction(PGPROC *proc)
 	Assert(!proc->delayChkpt);
 
 	/* Clear the subtransaction-XID cache too */
-	pgxact->nxids = 0;
-	pgxact->overflowed = false;
+	Assert(ProcGlobal->subxidStates[pgxactoff].count == proc->subxidStatus.count &&
+		   ProcGlobal->subxidStates[pgxactoff].overflowed == proc->subxidStatus.overflowed);
+	if (proc->subxidStatus.count > 0 || proc->subxidStatus.overflowed)
+	{
+		ProcGlobal->subxidStates[pgxactoff].count = 0;
+		ProcGlobal->subxidStates[pgxactoff].overflowed = false;
+		proc->subxidStatus.count = 0;
+		proc->subxidStatus.overflowed = false;
+	}
 
 	LWLockRelease(ProcArrayLock);
 }
@@ -1265,6 +1280,7 @@ TransactionIdIsInProgress(TransactionId xid)
 {
 	static TransactionId *xids = NULL;
 	static TransactionId *other_xids;
+	XidCacheStatus *other_subxidstates;
 	int			nxids = 0;
 	ProcArrayStruct *arrayP = procArray;
 	TransactionId topxid;
@@ -1326,6 +1342,7 @@ TransactionIdIsInProgress(TransactionId xid)
 	}
 
 	other_xids = ProcGlobal->xids;
+	other_subxidstates = ProcGlobal->subxidStates;
 
 	LWLockAcquire(ProcArrayLock, LW_SHARED);
 
@@ -1347,7 +1364,6 @@ TransactionIdIsInProgress(TransactionId xid)
 	for (size_t pgxactoff = 0; pgxactoff < numProcs; pgxactoff++)
 	{
 		int			pgprocno;
-		PGXACT	   *pgxact;
 		PGPROC	   *proc;
 		TransactionId pxid;
 		int			pxids;
@@ -1382,9 +1398,7 @@ TransactionIdIsInProgress(TransactionId xid)
 		/*
 		 * Step 2: check the cached child-Xids arrays
 		 */
-		pgprocno = arrayP->pgprocnos[pgxactoff];
-		pgxact = &allPgXact[pgprocno];
-		pxids = pgxact->nxids;
+		pxids = other_subxidstates[pgxactoff].count;
 		pg_read_barrier();		/* pairs with barrier in GetNewTransactionId() */
 		pgprocno = arrayP->pgprocnos[pgxactoff];
 		proc = &allProcs[pgprocno];
@@ -1408,7 +1422,7 @@ TransactionIdIsInProgress(TransactionId xid)
 		 * we hold ProcArrayLock.  So we can't miss an Xid that we need to
 		 * worry about.)
 		 */
-		if (pgxact->overflowed)
+		if (other_subxidstates[pgxactoff].overflowed)
 			xids[nxids++] = pxid;
 	}
 
@@ -2001,6 +2015,7 @@ GetSnapshotData(Snapshot snapshot)
 		size_t		numProcs = arrayP->numProcs;
 		TransactionId *xip = snapshot->xip;
 		int		   *pgprocnos = arrayP->pgprocnos;
+		XidCacheStatus *subxidStates = ProcGlobal->subxidStates;
 		uint8	   *allVacuumFlags = ProcGlobal->vacuumFlags;
 
 		/*
@@ -2077,17 +2092,16 @@ GetSnapshotData(Snapshot snapshot)
 			 */
 			if (!suboverflowed)
 			{
-				int			pgprocno = pgprocnos[pgxactoff];
-				PGXACT	   *pgxact = &allPgXact[pgprocno];
 
-				if (pgxact->overflowed)
+				if (subxidStates[pgxactoff].overflowed)
 					suboverflowed = true;
 				else
 				{
-					int			nsubxids = pgxact->nxids;
+					int			nsubxids = subxidStates[pgxactoff].count;
 
 					if (nsubxids > 0)
 					{
+						int			pgprocno = pgprocnos[pgxactoff];
 						PGPROC	   *proc = &allProcs[pgprocno];
 
 						pg_read_barrier();	/* pairs with GetNewTransactionId */
@@ -2479,8 +2493,6 @@ GetRunningTransactionData(void)
 	 */
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
-		int			pgprocno = arrayP->pgprocnos[index];
-		PGXACT	   *pgxact = &allPgXact[pgprocno];
 		TransactionId xid;
 
 		/* Fetch xid just once - see GetNewTransactionId */
@@ -2501,7 +2513,7 @@ GetRunningTransactionData(void)
 		if (TransactionIdPrecedes(xid, oldestRunningXid))
 			oldestRunningXid = xid;
 
-		if (pgxact->overflowed)
+		if (ProcGlobal->subxidStates[index].overflowed)
 			suboverflowed = true;
 
 		/*
@@ -2521,27 +2533,28 @@ GetRunningTransactionData(void)
 	 */
 	if (!suboverflowed)
 	{
+		XidCacheStatus *other_subxidstates = ProcGlobal->subxidStates;
+
 		for (index = 0; index < arrayP->numProcs; index++)
 		{
 			int			pgprocno = arrayP->pgprocnos[index];
 			PGPROC	   *proc = &allProcs[pgprocno];
-			PGXACT	   *pgxact = &allPgXact[pgprocno];
-			int			nxids;
+			int			nsubxids;
 
 			/*
 			 * Save subtransaction XIDs. Other backends can't add or remove
 			 * entries while we're holding XidGenLock.
 			 */
-			nxids = pgxact->nxids;
-			if (nxids > 0)
+			nsubxids = other_subxidstates[index].count;
+			if (nsubxids > 0)
 			{
 				/* barrier not really required, as XidGenLock is held, but ... */
 				pg_read_barrier();	/* pairs with GetNewTransactionId */
 
 				memcpy(&xids[count], (void *) proc->subxids.xids,
-					   nxids * sizeof(TransactionId));
-				count += nxids;
-				subcount += nxids;
+					   nsubxids * sizeof(TransactionId));
+				count += nsubxids;
+				subcount += nsubxids;
 
 				/*
 				 * Top-level XID of a transaction is always less than any of
@@ -3608,14 +3621,6 @@ ProcArrayGetReplicationSlotXmin(TransactionId *xmin,
 	LWLockRelease(ProcArrayLock);
 }
 
-
-#define XidCacheRemove(i) \
-	do { \
-		MyProc->subxids.xids[i] = MyProc->subxids.xids[MyPgXact->nxids - 1]; \
-		pg_write_barrier(); \
-		MyPgXact->nxids--; \
-	} while (0)
-
 /*
  * XidCacheRemoveRunningXids
  *
@@ -3631,6 +3636,7 @@ XidCacheRemoveRunningXids(TransactionId xid,
 {
 	int			i,
 				j;
+	XidCacheStatus *mysubxidstat;
 
 	Assert(TransactionIdIsValid(xid));
 
@@ -3648,6 +3654,8 @@ XidCacheRemoveRunningXids(TransactionId xid,
 	 */
 	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
 
+	mysubxidstat = &ProcGlobal->subxidStates[MyProc->pgxactoff];
+
 	/*
 	 * Under normal circumstances xid and xids[] will be in increasing order,
 	 * as will be the entries in subxids.  Scan backwards to avoid O(N^2)
@@ -3657,11 +3665,14 @@ XidCacheRemoveRunningXids(TransactionId xid,
 	{
 		TransactionId anxid = xids[i];
 
-		for (j = MyPgXact->nxids - 1; j >= 0; j--)
+		for (j = MyProc->subxidStatus.count - 1; j >= 0; j--)
 		{
 			if (TransactionIdEquals(MyProc->subxids.xids[j], anxid))
 			{
-				XidCacheRemove(j);
+				MyProc->subxids.xids[j] = MyProc->subxids.xids[MyProc->subxidStatus.count - 1];
+				pg_write_barrier();
+				mysubxidstat->count--;
+				MyProc->subxidStatus.count--;
 				break;
 			}
 		}
@@ -3673,20 +3684,23 @@ XidCacheRemoveRunningXids(TransactionId xid,
 		 * error during AbortSubTransaction.  So instead of Assert, emit a
 		 * debug warning.
 		 */
-		if (j < 0 && !MyPgXact->overflowed)
+		if (j < 0 && !MyProc->subxidStatus.overflowed)
 			elog(WARNING, "did not find subXID %u in MyProc", anxid);
 	}
 
-	for (j = MyPgXact->nxids - 1; j >= 0; j--)
+	for (j = MyProc->subxidStatus.count - 1; j >= 0; j--)
 	{
 		if (TransactionIdEquals(MyProc->subxids.xids[j], xid))
 		{
-			XidCacheRemove(j);
+			MyProc->subxids.xids[j] = MyProc->subxids.xids[MyProc->subxidStatus.count - 1];
+			pg_write_barrier();
+			mysubxidstat->count--;
+			MyProc->subxidStatus.count--;
 			break;
 		}
 	}
 	/* Ordinarily we should have found it, unless the cache has overflowed */
-	if (j < 0 && !MyPgXact->overflowed)
+	if (j < 0 && !MyProc->subxidStatus.overflowed)
 		elog(WARNING, "did not find subXID %u in MyProc", xid);
 
 	/* Also advance global latestCompletedXid while holding the lock */
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index f6113b2d243..aa9fbd80545 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -63,9 +63,8 @@ int			LockTimeout = 0;
 int			IdleInTransactionSessionTimeout = 0;
 bool		log_lock_waits = false;
 
-/* Pointer to this process's PGPROC and PGXACT structs, if any */
+/* Pointer to this process's PGPROC struct, if any */
 PGPROC	   *MyProc = NULL;
-PGXACT	   *MyPgXact = NULL;
 
 /*
  * This spinlock protects the freelist of recycled PGPROC structures.
@@ -110,10 +109,8 @@ ProcGlobalShmemSize(void)
 	size = add_size(size, mul_size(TotalProcs, sizeof(PGPROC)));
 	size = add_size(size, sizeof(slock_t));
 
-	size = add_size(size, mul_size(MaxBackends, sizeof(PGXACT)));
-	size = add_size(size, mul_size(NUM_AUXILIARY_PROCS, sizeof(PGXACT)));
-	size = add_size(size, mul_size(max_prepared_xacts, sizeof(PGXACT)));
 	size = add_size(size, mul_size(TotalProcs, sizeof(*ProcGlobal->xids)));
+	size = add_size(size, mul_size(TotalProcs, sizeof(*ProcGlobal->subxidStates)));
 	size = add_size(size, mul_size(TotalProcs, sizeof(*ProcGlobal->vacuumFlags)));
 
 	return size;
@@ -161,7 +158,6 @@ void
 InitProcGlobal(void)
 {
 	PGPROC	   *procs;
-	PGXACT	   *pgxacts;
 	int			i,
 				j;
 	bool		found;
@@ -202,18 +198,6 @@ InitProcGlobal(void)
 	/* XXX allProcCount isn't really all of them; it excludes prepared xacts */
 	ProcGlobal->allProcCount = MaxBackends + NUM_AUXILIARY_PROCS;
 
-	/*
-	 * Also allocate a separate array of PGXACT structures.  This is separate
-	 * from the main PGPROC array so that the most heavily accessed data is
-	 * stored contiguously in memory in as few cache lines as possible. This
-	 * provides significant performance benefits, especially on a
-	 * multiprocessor system.  There is one PGXACT structure for every PGPROC
-	 * structure.
-	 */
-	pgxacts = (PGXACT *) ShmemAlloc(TotalProcs * sizeof(PGXACT));
-	MemSet(pgxacts, 0, TotalProcs * sizeof(PGXACT));
-	ProcGlobal->allPgXact = pgxacts;
-
 	/*
 	 * Allocate arrays mirroring PGPROC fields in a dense manner. See
 	 * PROC_HDR.
@@ -224,6 +208,8 @@ InitProcGlobal(void)
 	ProcGlobal->xids =
 		(TransactionId *) ShmemAlloc(TotalProcs * sizeof(*ProcGlobal->xids));
 	MemSet(ProcGlobal->xids, 0, TotalProcs * sizeof(*ProcGlobal->xids));
+	ProcGlobal->subxidStates = (XidCacheStatus *) ShmemAlloc(TotalProcs * sizeof(*ProcGlobal->subxidStates));
+	MemSet(ProcGlobal->subxidStates, 0, TotalProcs * sizeof(*ProcGlobal->subxidStates));
 	ProcGlobal->vacuumFlags = (uint8 *) ShmemAlloc(TotalProcs * sizeof(*ProcGlobal->vacuumFlags));
 	MemSet(ProcGlobal->vacuumFlags, 0, TotalProcs * sizeof(*ProcGlobal->vacuumFlags));
 
@@ -372,7 +358,6 @@ InitProcess(void)
 				(errcode(ERRCODE_TOO_MANY_CONNECTIONS),
 				 errmsg("sorry, too many clients already")));
 	}
-	MyPgXact = &ProcGlobal->allPgXact[MyProc->pgprocno];
 
 	/*
 	 * Cross-check that the PGPROC is of the type we expect; if this were not
@@ -569,7 +554,6 @@ InitAuxiliaryProcess(void)
 	((volatile PGPROC *) auxproc)->pid = MyProcPid;
 
 	MyProc = auxproc;
-	MyPgXact = &ProcGlobal->allPgXact[auxproc->pgprocno];
 
 	SpinLockRelease(ProcStructLock);
 
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index b4948ac675f..3d990463ce9 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1536,7 +1536,6 @@ PGSetenvStatusType
 PGShmemHeader
 PGTransactionStatusType
 PGVerbosity
-PGXACT
 PG_Locale_Strategy
 PG_Lock_Status
 PG_init_t
-- 
2.25.0.114.g5b0ca878e0

v11-0006-snapshot-scalability-cache-snapshots-using-a-xac.patchtext/x-diff; charset=us-asciiDownload
From 1ad313d240fca83e4fd9f21d595792aaa3bbdade Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 15 Jul 2020 15:35:07 -0700
Subject: [PATCH v11 6/6] snapshot scalability: cache snapshots using a xact
 completion counter.

Previous commits made it faster/more scalable to compute snapshots. But not
building a snapshot is still faster. Now that GetSnapshotData() does not
maintain RecentGlobal* anymore, that is actually not too hard:

This commit introduces xactCompletionCount, which tracks the number of
top-level transactions with xids (i.e. which may have modified the database)
that completed in some form since the start of the server.

We can avoid rebuilding the snapshot's contents whenever the current
xactCompletionCount is the same as it was when the snapshot was
originally built.  Currently this check happens while holding
ProcArrayLock. While it's likely possible to perform the check before
acquiring ProcArrayLock, it's too complicated for now.

Author: Andres Freund
Reviewed-By: Robert Haas, Thomas Munro, David Rowley
Discussion: https://postgr.es/m/20200301083601.ews6hz5dduc3w2se@alap3.anarazel.de
---
 src/include/access/transam.h                |   9 ++
 src/include/utils/snapshot.h                |   7 ++
 src/backend/replication/logical/snapbuild.c |   1 +
 src/backend/storage/ipc/procarray.c         | 125 ++++++++++++++++----
 src/backend/utils/time/snapmgr.c            |   4 +
 5 files changed, 126 insertions(+), 20 deletions(-)

diff --git a/src/include/access/transam.h b/src/include/access/transam.h
index 6ec84b54599..bc8d2b00bb7 100644
--- a/src/include/access/transam.h
+++ b/src/include/access/transam.h
@@ -231,6 +231,15 @@ typedef struct VariableCacheData
 	FullTransactionId latestCompletedFullXid;	/* newest full XID that has
 												 * committed or aborted */
 
+	/*
+	 * Number of top-level transactions with xids (i.e. which may have
+	 * modified the database) that completed in some form since the start of
+	 * the server. This currently is solely used to check whether
+	 * GetSnapshotData() needs to recompute the contents of the snapshot, or
+	 * not. There are likely other users of this.  Always above 1.
+	 */
+	uint64 xactCompletionCount;
+
 	/*
 	 * These fields are protected by XactTruncationLock
 	 */
diff --git a/src/include/utils/snapshot.h b/src/include/utils/snapshot.h
index 35b1f05bea6..dea072e5edf 100644
--- a/src/include/utils/snapshot.h
+++ b/src/include/utils/snapshot.h
@@ -207,6 +207,13 @@ typedef struct SnapshotData
 
 	TimestampTz whenTaken;		/* timestamp when snapshot was taken */
 	XLogRecPtr	lsn;			/* position in the WAL stream when taken */
+
+	/*
+	 * The transaction completion count at the time GetSnapshotData() built
+	 * this snapshot. Allows to avoid re-computing static snapshots when no
+	 * transactions completed since the last GetSnapshotData().
+	 */
+	uint64		snapXactCompletionCount;
 } SnapshotData;
 
 #endif							/* SNAPSHOT_H */
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index e9701ea7221..9d5d68f3fa7 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -524,6 +524,7 @@ SnapBuildBuildSnapshot(SnapBuild *builder)
 	snapshot->curcid = FirstCommandId;
 	snapshot->active_count = 0;
 	snapshot->regd_count = 0;
+	snapshot->snapXactCompletionCount = 0;
 
 	return snapshot;
 }
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index bf3e5b65dc7..cc88111a904 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -405,6 +405,7 @@ CreateSharedProcArray(void)
 		procArray->lastOverflowedXid = InvalidTransactionId;
 		procArray->replication_slot_xmin = InvalidTransactionId;
 		procArray->replication_slot_catalog_xmin = InvalidTransactionId;
+		ShmemVariableCache->xactCompletionCount = 1;
 	}
 
 	allProcs = ProcGlobal->allProcs;
@@ -532,6 +533,9 @@ ProcArrayRemove(PGPROC *proc, TransactionId latestXid)
 		/* Advance global latestCompletedXid while holding the lock */
 		MaintainLatestCompletedXid(latestXid);
 
+		/* Same with xactCompletionCount  */
+		ShmemVariableCache->xactCompletionCount++;
+
 		ProcGlobal->xids[proc->pgxactoff] = 0;
 		ProcGlobal->subxidStates[proc->pgxactoff].overflowed = false;
 		ProcGlobal->subxidStates[proc->pgxactoff].count = 0;
@@ -666,6 +670,7 @@ ProcArrayEndTransactionInternal(PGPROC *proc, TransactionId latestXid)
 {
 	size_t		pgxactoff = proc->pgxactoff;
 
+	Assert(LWLockHeldByMe(ProcArrayLock));
 	Assert(TransactionIdIsValid(ProcGlobal->xids[pgxactoff]));
 	Assert(ProcGlobal->xids[pgxactoff] == proc->xid);
 
@@ -697,6 +702,9 @@ ProcArrayEndTransactionInternal(PGPROC *proc, TransactionId latestXid)
 
 	/* Also advance global latestCompletedXid while holding the lock */
 	MaintainLatestCompletedXid(latestXid);
+
+	/* Same with xactCompletionCount  */
+	ShmemVariableCache->xactCompletionCount++;
 }
 
 /*
@@ -1897,6 +1905,93 @@ GetMaxSnapshotSubxidCount(void)
 	return TOTAL_MAX_CACHED_SUBXIDS;
 }
 
+/*
+ * Initialize old_snapshot_threshold specific parts of a newly build snapshot.
+ */
+static void
+GetSnapshotDataInitOldSnapshot(Snapshot snapshot)
+{
+	if (!OldSnapshotThresholdActive())
+	{
+		/*
+		 * If not using "snapshot too old" feature, fill related fields with
+		 * dummy values that don't require any locking.
+		 */
+		snapshot->lsn = InvalidXLogRecPtr;
+		snapshot->whenTaken = 0;
+	}
+	else
+	{
+		/*
+		 * Capture the current time and WAL stream location in case this
+		 * snapshot becomes old enough to need to fall back on the special
+		 * "old snapshot" logic.
+		 */
+		snapshot->lsn = GetXLogInsertRecPtr();
+		snapshot->whenTaken = GetSnapshotCurrentTimestamp();
+		MaintainOldSnapshotTimeMapping(snapshot->whenTaken, snapshot->xmin);
+	}
+}
+
+/*
+ * Helper function for GetSnapshotData() that check if the bulk of the
+ * visibility information in the snapshot is still valid. If so, it updates
+ * the fields that need to change and returns true. Otherwise it returns
+ * false.
+ *
+ * This very likely can be evolved to not need ProcArrayLock held (at very
+ * least in the case we already hold a snapshot), but that's for another day.
+ */
+static bool
+GetSnapshotDataReuse(Snapshot snapshot)
+{
+	uint64 curXactCompletionCount;
+
+	Assert(LWLockHeldByMe(ProcArrayLock));
+
+	if (unlikely(snapshot->snapXactCompletionCount == 0))
+		return false;
+
+	curXactCompletionCount = ShmemVariableCache->xactCompletionCount;
+	if (curXactCompletionCount != snapshot->snapXactCompletionCount)
+		return false;
+
+	/*
+	 * If the current xactCompletionCount is still the same as it was at the
+	 * time the snapshot was built, we can be sure that rebuilding the
+	 * contents of the snapshot the hard way would result in the same snapshot
+	 * contents:
+	 *
+	 * As explained in transam/README, the set of xids considered running by
+	 * GetSnapshotData() cannot change while ProcArrayLock is held. Snapshot
+	 * contents only depend on transactions with xids and xactCompletionCount
+	 * is incremented whenever a transaction with an xid finishes (while
+	 * holding ProcArrayLock) exclusively). Thus the xactCompletionCount check
+	 * ensures we would detect if the snapshot would have changed.
+	 *
+	 * As the snapshot contents are the same as it was before, it is is safe
+	 * to re-enter the snapshot's xmin into the PGPROC array. None of the rows
+	 * visible under the snapshot could already have been removed (that'd
+	 * require the set of running transactions to change) and it fulfills the
+	 * requirement that concurrent GetSnapshotData() calls yield the same
+	 * xmin.
+	 */
+	if (!TransactionIdIsValid(MyProc->xmin))
+		MyProc->xmin = TransactionXmin = snapshot->xmin;
+
+	RecentXmin = snapshot->xmin;
+	Assert(TransactionIdPrecedesOrEquals(TransactionXmin, RecentXmin));
+
+	snapshot->curcid = GetCurrentCommandId(false);
+	snapshot->active_count = 0;
+	snapshot->regd_count = 0;
+	snapshot->copied = false;
+
+	GetSnapshotDataInitOldSnapshot(snapshot);
+
+	return true;
+}
+
 /*
  * GetSnapshotData -- returns information about running transactions.
  *
@@ -1945,6 +2040,7 @@ GetSnapshotData(Snapshot snapshot)
 	TransactionId oldestxid;
 	int			mypgxactoff;
 	TransactionId myxid;
+	uint64		curXactCompletionCount;
 
 	TransactionId replication_slot_xmin = InvalidTransactionId;
 	TransactionId replication_slot_catalog_xmin = InvalidTransactionId;
@@ -1989,12 +2085,19 @@ GetSnapshotData(Snapshot snapshot)
 	 */
 	LWLockAcquire(ProcArrayLock, LW_SHARED);
 
+	if (GetSnapshotDataReuse(snapshot))
+	{
+		LWLockRelease(ProcArrayLock);
+		return snapshot;
+	}
+
 	latest_completed = ShmemVariableCache->latestCompletedFullXid;
 	mypgxactoff = MyProc->pgxactoff;
 	myxid = other_xids[mypgxactoff];
 	Assert(myxid == MyProc->xid);
 
 	oldestxid = ShmemVariableCache->oldestXid;
+	curXactCompletionCount = ShmemVariableCache->xactCompletionCount;
 
 	/* xmax is always latestCompletedXid + 1 */
 	xmax = XidFromFullTransactionId(latest_completed);
@@ -2248,6 +2351,7 @@ GetSnapshotData(Snapshot snapshot)
 	snapshot->xcnt = count;
 	snapshot->subxcnt = subcount;
 	snapshot->suboverflowed = suboverflowed;
+	snapshot->snapXactCompletionCount = curXactCompletionCount;
 
 	snapshot->curcid = GetCurrentCommandId(false);
 
@@ -2259,26 +2363,7 @@ GetSnapshotData(Snapshot snapshot)
 	snapshot->regd_count = 0;
 	snapshot->copied = false;
 
-	if (old_snapshot_threshold < 0)
-	{
-		/*
-		 * If not using "snapshot too old" feature, fill related fields with
-		 * dummy values that don't require any locking.
-		 */
-		snapshot->lsn = InvalidXLogRecPtr;
-		snapshot->whenTaken = 0;
-	}
-	else
-	{
-		/*
-		 * Capture the current time and WAL stream location in case this
-		 * snapshot becomes old enough to need to fall back on the special
-		 * "old snapshot" logic.
-		 */
-		snapshot->lsn = GetXLogInsertRecPtr();
-		snapshot->whenTaken = GetSnapshotCurrentTimestamp();
-		MaintainOldSnapshotTimeMapping(snapshot->whenTaken, xmin);
-	}
+	GetSnapshotDataInitOldSnapshot(snapshot);
 
 	return snapshot;
 }
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 689a3b6a597..09ea03c2063 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -595,6 +595,8 @@ SetTransactionSnapshot(Snapshot sourcesnap, VirtualTransactionId *sourcevxid,
 	CurrentSnapshot->takenDuringRecovery = sourcesnap->takenDuringRecovery;
 	/* NB: curcid should NOT be copied, it's a local matter */
 
+	CurrentSnapshot->snapXactCompletionCount = 0;
+
 	/*
 	 * Now we have to fix what GetSnapshotData did with MyProc->xmin and
 	 * TransactionXmin.  There is a race condition: to make sure we are not
@@ -670,6 +672,7 @@ CopySnapshot(Snapshot snapshot)
 	newsnap->regd_count = 0;
 	newsnap->active_count = 0;
 	newsnap->copied = true;
+	newsnap->snapXactCompletionCount = 0;
 
 	/* setup XID array */
 	if (snapshot->xcnt > 0)
@@ -2207,6 +2210,7 @@ RestoreSnapshot(char *start_address)
 	snapshot->curcid = serialized_snapshot.curcid;
 	snapshot->whenTaken = serialized_snapshot.whenTaken;
 	snapshot->lsn = serialized_snapshot.lsn;
+	snapshot->snapXactCompletionCount = 0;
 
 	/* Copy XIDs, if present. */
 	if (serialized_snapshot.xcnt > 0)
-- 
2.25.0.114.g5b0ca878e0

#57Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Andres Freund (#56)
Re: Improving connection scalability: GetSnapshotData()

On 2020-Jul-15, Andres Freund wrote:

It could make sense to split the conversion of
VariableCacheData->latestCompletedXid to FullTransactionId out from 0001
into is own commit. Not sure...

+1, the commit is large enough and that change can be had in advance.

Note you forward-declare struct GlobalVisState twice in heapam.h.

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#58Andres Freund
andres@anarazel.de
In reply to: Alvaro Herrera (#57)
7 attachment(s)
Re: Improving connection scalability: GetSnapshotData()

Hi,

On 2020-07-15 21:33:06 -0400, Alvaro Herrera wrote:

On 2020-Jul-15, Andres Freund wrote:

It could make sense to split the conversion of
VariableCacheData->latestCompletedXid to FullTransactionId out from 0001
into is own commit. Not sure...

+1, the commit is large enough and that change can be had in advance.

I've done that in the attached.

I wonder if somebody has an opinion on renaming latestCompletedXid to
latestCompletedFullXid. That's the pattern we already had (cf
nextFullXid), but it also leads to pretty long lines and quite a few
comment etc changes.

I'm somewhat inclined to remove the "Full" out of the variable, and to
also do that for nextFullXid. I feel like including it in the variable
name is basically a poor copy of the (also not great) C type system. If
we hadn't made FullTransactionId a struct I'd see it differently (and
thus incompatible with TransactionId), but we have ...

Note you forward-declare struct GlobalVisState twice in heapam.h.

Oh, fixed, thanks.

I've also fixed a correctness bug that Thomas's cfbot found (and he
personally pointed out). There were occasional make check runs with
vacuum erroring out. That turned out to be because it was possible for
the horizon used to make decisions in heap_page_prune() and
lazy_scan_heap() to differ a bit. I've started a thread about my
concerns around the fragility of that logic [1]/messages/by-id/20200723181018.neey2jd3u7rfrfrn@alap3.anarazel.de. The code around that
can use a bit more polish, I think. I mainly wanted to post a new
version so that the patch separated out above can be looked at.

Greetings,

Andres Freund

[1]: /messages/by-id/20200723181018.neey2jd3u7rfrfrn@alap3.anarazel.de

Attachments:

v12-0007-snapshot-scalability-cache-snapshots-using-a-xac.patchtext/x-diff; charset=us-asciiDownload
From 20b70d9d49ab785299df156e1c424a506a4ba2f0 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 15 Jul 2020 15:35:07 -0700
Subject: [PATCH v12 7/7] snapshot scalability: cache snapshots using a xact
 completion counter.

Previous commits made it faster/more scalable to compute snapshots. But not
building a snapshot is still faster. Now that GetSnapshotData() does not
maintain RecentGlobal* anymore, that is actually not too hard:

This commit introduces xactCompletionCount, which tracks the number of
top-level transactions with xids (i.e. which may have modified the database)
that completed in some form since the start of the server.

We can avoid rebuilding the snapshot's contents whenever the current
xactCompletionCount is the same as it was when the snapshot was
originally built.  Currently this check happens while holding
ProcArrayLock. While it's likely possible to perform the check before
acquiring ProcArrayLock, it's too complicated for now.

Author: Andres Freund
Reviewed-By: Robert Haas, Thomas Munro, David Rowley
Discussion: https://postgr.es/m/20200301083601.ews6hz5dduc3w2se@alap3.anarazel.de
---
 src/include/access/transam.h                |   9 ++
 src/include/utils/snapshot.h                |   7 ++
 src/backend/replication/logical/snapbuild.c |   1 +
 src/backend/storage/ipc/procarray.c         | 125 ++++++++++++++++----
 src/backend/utils/time/snapmgr.c            |   4 +
 5 files changed, 126 insertions(+), 20 deletions(-)

diff --git a/src/include/access/transam.h b/src/include/access/transam.h
index cf221582cd0..12fb7487b93 100644
--- a/src/include/access/transam.h
+++ b/src/include/access/transam.h
@@ -231,6 +231,15 @@ typedef struct VariableCacheData
 	FullTransactionId latestCompletedFullXid;	/* newest full XID that has
 												 * committed or aborted */
 
+	/*
+	 * Number of top-level transactions with xids (i.e. which may have
+	 * modified the database) that completed in some form since the start of
+	 * the server. This currently is solely used to check whether
+	 * GetSnapshotData() needs to recompute the contents of the snapshot, or
+	 * not. There are likely other users of this.  Always above 1.
+	 */
+	uint64 xactCompletionCount;
+
 	/*
 	 * These fields are protected by XactTruncationLock
 	 */
diff --git a/src/include/utils/snapshot.h b/src/include/utils/snapshot.h
index 35b1f05bea6..dea072e5edf 100644
--- a/src/include/utils/snapshot.h
+++ b/src/include/utils/snapshot.h
@@ -207,6 +207,13 @@ typedef struct SnapshotData
 
 	TimestampTz whenTaken;		/* timestamp when snapshot was taken */
 	XLogRecPtr	lsn;			/* position in the WAL stream when taken */
+
+	/*
+	 * The transaction completion count at the time GetSnapshotData() built
+	 * this snapshot. Allows to avoid re-computing static snapshots when no
+	 * transactions completed since the last GetSnapshotData().
+	 */
+	uint64		snapXactCompletionCount;
 } SnapshotData;
 
 #endif							/* SNAPSHOT_H */
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index e9701ea7221..9d5d68f3fa7 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -524,6 +524,7 @@ SnapBuildBuildSnapshot(SnapBuild *builder)
 	snapshot->curcid = FirstCommandId;
 	snapshot->active_count = 0;
 	snapshot->regd_count = 0;
+	snapshot->snapXactCompletionCount = 0;
 
 	return snapshot;
 }
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 4f6da7c86a3..5078225c1ec 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -406,6 +406,7 @@ CreateSharedProcArray(void)
 		procArray->lastOverflowedXid = InvalidTransactionId;
 		procArray->replication_slot_xmin = InvalidTransactionId;
 		procArray->replication_slot_catalog_xmin = InvalidTransactionId;
+		ShmemVariableCache->xactCompletionCount = 1;
 	}
 
 	allProcs = ProcGlobal->allProcs;
@@ -533,6 +534,9 @@ ProcArrayRemove(PGPROC *proc, TransactionId latestXid)
 		/* Advance global latestCompletedXid while holding the lock */
 		MaintainLatestCompletedXid(latestXid);
 
+		/* Same with xactCompletionCount  */
+		ShmemVariableCache->xactCompletionCount++;
+
 		ProcGlobal->xids[proc->pgxactoff] = 0;
 		ProcGlobal->subxidStates[proc->pgxactoff].overflowed = false;
 		ProcGlobal->subxidStates[proc->pgxactoff].count = 0;
@@ -667,6 +671,7 @@ ProcArrayEndTransactionInternal(PGPROC *proc, TransactionId latestXid)
 {
 	size_t		pgxactoff = proc->pgxactoff;
 
+	Assert(LWLockHeldByMe(ProcArrayLock));
 	Assert(TransactionIdIsValid(ProcGlobal->xids[pgxactoff]));
 	Assert(ProcGlobal->xids[pgxactoff] == proc->xid);
 
@@ -698,6 +703,9 @@ ProcArrayEndTransactionInternal(PGPROC *proc, TransactionId latestXid)
 
 	/* Also advance global latestCompletedXid while holding the lock */
 	MaintainLatestCompletedXid(latestXid);
+
+	/* Same with xactCompletionCount  */
+	ShmemVariableCache->xactCompletionCount++;
 }
 
 /*
@@ -1910,6 +1918,93 @@ GetMaxSnapshotSubxidCount(void)
 	return TOTAL_MAX_CACHED_SUBXIDS;
 }
 
+/*
+ * Initialize old_snapshot_threshold specific parts of a newly build snapshot.
+ */
+static void
+GetSnapshotDataInitOldSnapshot(Snapshot snapshot)
+{
+	if (!OldSnapshotThresholdActive())
+	{
+		/*
+		 * If not using "snapshot too old" feature, fill related fields with
+		 * dummy values that don't require any locking.
+		 */
+		snapshot->lsn = InvalidXLogRecPtr;
+		snapshot->whenTaken = 0;
+	}
+	else
+	{
+		/*
+		 * Capture the current time and WAL stream location in case this
+		 * snapshot becomes old enough to need to fall back on the special
+		 * "old snapshot" logic.
+		 */
+		snapshot->lsn = GetXLogInsertRecPtr();
+		snapshot->whenTaken = GetSnapshotCurrentTimestamp();
+		MaintainOldSnapshotTimeMapping(snapshot->whenTaken, snapshot->xmin);
+	}
+}
+
+/*
+ * Helper function for GetSnapshotData() that check if the bulk of the
+ * visibility information in the snapshot is still valid. If so, it updates
+ * the fields that need to change and returns true. Otherwise it returns
+ * false.
+ *
+ * This very likely can be evolved to not need ProcArrayLock held (at very
+ * least in the case we already hold a snapshot), but that's for another day.
+ */
+static bool
+GetSnapshotDataReuse(Snapshot snapshot)
+{
+	uint64 curXactCompletionCount;
+
+	Assert(LWLockHeldByMe(ProcArrayLock));
+
+	if (unlikely(snapshot->snapXactCompletionCount == 0))
+		return false;
+
+	curXactCompletionCount = ShmemVariableCache->xactCompletionCount;
+	if (curXactCompletionCount != snapshot->snapXactCompletionCount)
+		return false;
+
+	/*
+	 * If the current xactCompletionCount is still the same as it was at the
+	 * time the snapshot was built, we can be sure that rebuilding the
+	 * contents of the snapshot the hard way would result in the same snapshot
+	 * contents:
+	 *
+	 * As explained in transam/README, the set of xids considered running by
+	 * GetSnapshotData() cannot change while ProcArrayLock is held. Snapshot
+	 * contents only depend on transactions with xids and xactCompletionCount
+	 * is incremented whenever a transaction with an xid finishes (while
+	 * holding ProcArrayLock) exclusively). Thus the xactCompletionCount check
+	 * ensures we would detect if the snapshot would have changed.
+	 *
+	 * As the snapshot contents are the same as it was before, it is is safe
+	 * to re-enter the snapshot's xmin into the PGPROC array. None of the rows
+	 * visible under the snapshot could already have been removed (that'd
+	 * require the set of running transactions to change) and it fulfills the
+	 * requirement that concurrent GetSnapshotData() calls yield the same
+	 * xmin.
+	 */
+	if (!TransactionIdIsValid(MyProc->xmin))
+		MyProc->xmin = TransactionXmin = snapshot->xmin;
+
+	RecentXmin = snapshot->xmin;
+	Assert(TransactionIdPrecedesOrEquals(TransactionXmin, RecentXmin));
+
+	snapshot->curcid = GetCurrentCommandId(false);
+	snapshot->active_count = 0;
+	snapshot->regd_count = 0;
+	snapshot->copied = false;
+
+	GetSnapshotDataInitOldSnapshot(snapshot);
+
+	return true;
+}
+
 /*
  * GetSnapshotData -- returns information about running transactions.
  *
@@ -1958,6 +2053,7 @@ GetSnapshotData(Snapshot snapshot)
 	TransactionId oldestxid;
 	int			mypgxactoff;
 	TransactionId myxid;
+	uint64		curXactCompletionCount;
 
 	TransactionId replication_slot_xmin = InvalidTransactionId;
 	TransactionId replication_slot_catalog_xmin = InvalidTransactionId;
@@ -2002,12 +2098,19 @@ GetSnapshotData(Snapshot snapshot)
 	 */
 	LWLockAcquire(ProcArrayLock, LW_SHARED);
 
+	if (GetSnapshotDataReuse(snapshot))
+	{
+		LWLockRelease(ProcArrayLock);
+		return snapshot;
+	}
+
 	latest_completed = ShmemVariableCache->latestCompletedFullXid;
 	mypgxactoff = MyProc->pgxactoff;
 	myxid = other_xids[mypgxactoff];
 	Assert(myxid == MyProc->xid);
 
 	oldestxid = ShmemVariableCache->oldestXid;
+	curXactCompletionCount = ShmemVariableCache->xactCompletionCount;
 
 	/* xmax is always latestCompletedFullXid + 1 */
 	xmax = XidFromFullTransactionId(latest_completed);
@@ -2261,6 +2364,7 @@ GetSnapshotData(Snapshot snapshot)
 	snapshot->xcnt = count;
 	snapshot->subxcnt = subcount;
 	snapshot->suboverflowed = suboverflowed;
+	snapshot->snapXactCompletionCount = curXactCompletionCount;
 
 	snapshot->curcid = GetCurrentCommandId(false);
 
@@ -2272,26 +2376,7 @@ GetSnapshotData(Snapshot snapshot)
 	snapshot->regd_count = 0;
 	snapshot->copied = false;
 
-	if (old_snapshot_threshold < 0)
-	{
-		/*
-		 * If not using "snapshot too old" feature, fill related fields with
-		 * dummy values that don't require any locking.
-		 */
-		snapshot->lsn = InvalidXLogRecPtr;
-		snapshot->whenTaken = 0;
-	}
-	else
-	{
-		/*
-		 * Capture the current time and WAL stream location in case this
-		 * snapshot becomes old enough to need to fall back on the special
-		 * "old snapshot" logic.
-		 */
-		snapshot->lsn = GetXLogInsertRecPtr();
-		snapshot->whenTaken = GetSnapshotCurrentTimestamp();
-		MaintainOldSnapshotTimeMapping(snapshot->whenTaken, xmin);
-	}
+	GetSnapshotDataInitOldSnapshot(snapshot);
 
 	return snapshot;
 }
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 689a3b6a597..09ea03c2063 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -595,6 +595,8 @@ SetTransactionSnapshot(Snapshot sourcesnap, VirtualTransactionId *sourcevxid,
 	CurrentSnapshot->takenDuringRecovery = sourcesnap->takenDuringRecovery;
 	/* NB: curcid should NOT be copied, it's a local matter */
 
+	CurrentSnapshot->snapXactCompletionCount = 0;
+
 	/*
 	 * Now we have to fix what GetSnapshotData did with MyProc->xmin and
 	 * TransactionXmin.  There is a race condition: to make sure we are not
@@ -670,6 +672,7 @@ CopySnapshot(Snapshot snapshot)
 	newsnap->regd_count = 0;
 	newsnap->active_count = 0;
 	newsnap->copied = true;
+	newsnap->snapXactCompletionCount = 0;
 
 	/* setup XID array */
 	if (snapshot->xcnt > 0)
@@ -2207,6 +2210,7 @@ RestoreSnapshot(char *start_address)
 	snapshot->curcid = serialized_snapshot.curcid;
 	snapshot->whenTaken = serialized_snapshot.whenTaken;
 	snapshot->lsn = serialized_snapshot.lsn;
+	snapshot->snapXactCompletionCount = 0;
 
 	/* Copy XIDs, if present. */
 	if (serialized_snapshot.xcnt > 0)
-- 
2.25.0.114.g5b0ca878e0

v12-0006-snapshot-scalability-Move-subxact-info-to-ProcGl.patchtext/x-diff; charset=us-asciiDownload
From c03ecdb557f5fe3d335f32cb540c032b3ee9a41b Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 15 Jul 2020 15:35:07 -0700
Subject: [PATCH v12 6/7] snapshot scalability: Move subxact info to
 ProcGlobal, remove PGXACT.

Similar to the previous changes this increases the chance that data
frequently needed by GetSnapshotData() stays in l2 cache. In many
workloads subtransactions are very rare, and this makes the check for
that considerably cheaper.

As this removes the last member of PGXACT, there is no need to keep it
around anymore.

Author: Andres Freund
Reviewed-By: Robert Haas, Thomas Munro, David Rowley
Discussion: https://postgr.es/m/20200301083601.ews6hz5dduc3w2se@alap3.anarazel.de
---
 src/include/storage/proc.h            |  34 ++++---
 src/backend/access/transam/clog.c     |   7 +-
 src/backend/access/transam/twophase.c |  17 ++--
 src/backend/access/transam/varsup.c   |  15 ++-
 src/backend/storage/ipc/procarray.c   | 128 ++++++++++++++------------
 src/backend/storage/lmgr/proc.c       |  24 +----
 src/tools/pgindent/typedefs.list      |   1 -
 7 files changed, 113 insertions(+), 113 deletions(-)

diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index ffb775939ed..36fe5253a15 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -35,6 +35,14 @@
  */
 #define PGPROC_MAX_CACHED_SUBXIDS 64	/* XXX guessed-at value */
 
+typedef struct XidCacheStatus
+{
+	/* number of cached subxids, never more than PGPROC_MAX_CACHED_SUBXIDS */
+	uint8	count;
+	/* has PGPROC->subxids overflowed */
+	bool	overflowed;
+} XidCacheStatus;
+
 struct XidCache
 {
 	TransactionId xids[PGPROC_MAX_CACHED_SUBXIDS];
@@ -188,6 +196,8 @@ struct PGPROC
 	 */
 	SHM_QUEUE	myProcLocks[NUM_LOCK_PARTITIONS];
 
+	XidCacheStatus subxidStatus; /* mirrored with
+								  * ProcGlobal->subxidStates[i] */
 	struct XidCache subxids;	/* cache for subtransaction XIDs */
 
 	/* Support for group XID clearing. */
@@ -236,22 +246,6 @@ struct PGPROC
 
 
 extern PGDLLIMPORT PGPROC *MyProc;
-extern PGDLLIMPORT struct PGXACT *MyPgXact;
-
-/*
- * Prior to PostgreSQL 9.2, the fields below were stored as part of the
- * PGPROC.  However, benchmarking revealed that packing these particular
- * members into a separate array as tightly as possible sped up GetSnapshotData
- * considerably on systems with many CPU cores, by reducing the number of
- * cache lines needing to be fetched.  Thus, think very carefully before adding
- * anything else here.
- */
-typedef struct PGXACT
-{
-	bool		overflowed;
-
-	uint8		nxids;
-} PGXACT;
 
 /*
  * There is one ProcGlobal struct for the whole database cluster.
@@ -311,12 +305,16 @@ typedef struct PROC_HDR
 {
 	/* Array of PGPROC structures (not including dummies for prepared txns) */
 	PGPROC	   *allProcs;
-	/* Array of PGXACT structures (not including dummies for prepared txns) */
-	PGXACT	   *allPgXact;
 
 	/* Array mirroring PGPROC.xid for each PGPROC currently in the procarray */
 	TransactionId *xids;
 
+	/*
+	 * Array mirroring PGPROC.subxidStatus for each PGPROC currently in the
+	 * procarray.
+	 */
+	XidCacheStatus *subxidStates;
+
 	/*
 	 * Array mirroring PGPROC.vacuumFlags for each PGPROC currently in the
 	 * procarray.
diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c
index 5198a0cef68..a3095ced3fb 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -295,7 +295,7 @@ TransactionIdSetPageStatus(TransactionId xid, int nsubxids,
 	 */
 	if (all_xact_same_page && xid == MyProc->xid &&
 		nsubxids <= THRESHOLD_SUBTRANS_CLOG_OPT &&
-		nsubxids == MyPgXact->nxids &&
+		nsubxids == MyProc->subxidStatus.count &&
 		memcmp(subxids, MyProc->subxids.xids,
 			   nsubxids * sizeof(TransactionId)) == 0)
 	{
@@ -510,16 +510,15 @@ TransactionGroupUpdateXidStatus(TransactionId xid, XidStatus status,
 	while (nextidx != INVALID_PGPROCNO)
 	{
 		PGPROC	   *proc = &ProcGlobal->allProcs[nextidx];
-		PGXACT	   *pgxact = &ProcGlobal->allPgXact[nextidx];
 
 		/*
 		 * Transactions with more than THRESHOLD_SUBTRANS_CLOG_OPT sub-XIDs
 		 * should not use group XID status update mechanism.
 		 */
-		Assert(pgxact->nxids <= THRESHOLD_SUBTRANS_CLOG_OPT);
+		Assert(proc->subxidStatus.count <= THRESHOLD_SUBTRANS_CLOG_OPT);
 
 		TransactionIdSetPageStatusInternal(proc->clogGroupMemberXid,
-										   pgxact->nxids,
+										   proc->subxidStatus.count,
 										   proc->subxids.xids,
 										   proc->clogGroupMemberXidStatus,
 										   proc->clogGroupMemberLsn,
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 3371ebd8896..2642eaf99de 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -21,9 +21,9 @@
  *		GIDs and aborts the transaction if there already is a global
  *		transaction in prepared state with the same GID.
  *
- *		A global transaction (gxact) also has dummy PGXACT and PGPROC; this is
- *		what keeps the XID considered running by TransactionIdIsInProgress.
- *		It is also convenient as a PGPROC to hook the gxact's locks to.
+ *		A global transaction (gxact) also has dummy PGPROC; this is what keeps
+ *		the XID considered running by TransactionIdIsInProgress.  It is also
+ *		convenient as a PGPROC to hook the gxact's locks to.
  *
  *		Information to recover prepared transactions in case of crash is
  *		now stored in WAL for the common case. In some cases there will be
@@ -447,14 +447,12 @@ MarkAsPreparingGuts(GlobalTransaction gxact, TransactionId xid, const char *gid,
 					TimestampTz prepared_at, Oid owner, Oid databaseid)
 {
 	PGPROC	   *proc;
-	PGXACT	   *pgxact;
 	int			i;
 
 	Assert(LWLockHeldByMeInMode(TwoPhaseStateLock, LW_EXCLUSIVE));
 
 	Assert(gxact != NULL);
 	proc = &ProcGlobal->allProcs[gxact->pgprocno];
-	pgxact = &ProcGlobal->allPgXact[gxact->pgprocno];
 
 	/* Initialize the PGPROC entry */
 	MemSet(proc, 0, sizeof(PGPROC));
@@ -480,8 +478,8 @@ MarkAsPreparingGuts(GlobalTransaction gxact, TransactionId xid, const char *gid,
 	for (i = 0; i < NUM_LOCK_PARTITIONS; i++)
 		SHMQueueInit(&(proc->myProcLocks[i]));
 	/* subxid data must be filled later by GXactLoadSubxactData */
-	pgxact->overflowed = false;
-	pgxact->nxids = 0;
+	proc->subxidStatus.count = 0;
+	proc->subxidStatus.overflowed = 0;
 
 	gxact->prepared_at = prepared_at;
 	gxact->xid = xid;
@@ -510,19 +508,18 @@ GXactLoadSubxactData(GlobalTransaction gxact, int nsubxacts,
 					 TransactionId *children)
 {
 	PGPROC	   *proc = &ProcGlobal->allProcs[gxact->pgprocno];
-	PGXACT	   *pgxact = &ProcGlobal->allPgXact[gxact->pgprocno];
 
 	/* We need no extra lock since the GXACT isn't valid yet */
 	if (nsubxacts > PGPROC_MAX_CACHED_SUBXIDS)
 	{
-		pgxact->overflowed = true;
+		proc->subxidStatus.overflowed = true;
 		nsubxacts = PGPROC_MAX_CACHED_SUBXIDS;
 	}
 	if (nsubxacts > 0)
 	{
 		memcpy(proc->subxids.xids, children,
 			   nsubxacts * sizeof(TransactionId));
-		pgxact->nxids = nsubxacts;
+		proc->subxidStatus.count = PGPROC_MAX_CACHED_SUBXIDS;
 	}
 }
 
diff --git a/src/backend/access/transam/varsup.c b/src/backend/access/transam/varsup.c
index 73167054e61..a2b1ac7cfb3 100644
--- a/src/backend/access/transam/varsup.c
+++ b/src/backend/access/transam/varsup.c
@@ -222,22 +222,31 @@ GetNewTransactionId(bool isSubXact)
 	 */
 	if (!isSubXact)
 	{
+		Assert(ProcGlobal->subxidStates[MyProc->pgxactoff].count == 0);
+		Assert(!ProcGlobal->subxidStates[MyProc->pgxactoff].overflowed);
+		Assert(MyProc->subxidStatus.count == 0);
+		Assert(!MyProc->subxidStatus.overflowed);
+
 		/* LWLockRelease acts as barrier */
 		MyProc->xid = xid;
 		ProcGlobal->xids[MyProc->pgxactoff] = xid;
 	}
 	else
 	{
-		int			nxids = MyPgXact->nxids;
+		XidCacheStatus *substat = &ProcGlobal->subxidStates[MyProc->pgxactoff];
+		int			nxids = MyProc->subxidStatus.count;
+
+		Assert(substat->count == MyProc->subxidStatus.count);
+		Assert(substat->overflowed == MyProc->subxidStatus.overflowed);
 
 		if (nxids < PGPROC_MAX_CACHED_SUBXIDS)
 		{
 			MyProc->subxids.xids[nxids] = xid;
 			pg_write_barrier();
-			MyPgXact->nxids = nxids + 1;
+			MyProc->subxidStatus.count = substat->count = nxids + 1;
 		}
 		else
-			MyPgXact->overflowed = true;
+			MyProc->subxidStatus.overflowed = substat->overflowed = true;
 	}
 
 	LWLockRelease(XidGenLock);
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index dc46b98f5fd..4f6da7c86a3 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -4,9 +4,10 @@
  *	  POSTGRES process array code.
  *
  *
- * This module maintains arrays of the PGPROC and PGXACT structures for all
- * active backends.  Although there are several uses for this, the principal
- * one is as a means of determining the set of currently running transactions.
+ * This module maintains arrays of PGPROC substructures, as well as associated
+ * arrays in ProcGlobal, for all active backends.  Although there are several
+ * uses for this, the principal one is as a means of determining the set of
+ * currently running transactions.
  *
  * Because of various subtle race conditions it is critical that a backend
  * hold the correct locks while setting or clearing its xid (in
@@ -85,7 +86,7 @@ typedef struct ProcArrayStruct
 	/*
 	 * Highest subxid that has been removed from KnownAssignedXids array to
 	 * prevent overflow; or InvalidTransactionId if none.  We track this for
-	 * similar reasons to tracking overflowing cached subxids in PGXACT
+	 * similar reasons to tracking overflowing cached subxids in PGPROC
 	 * entries.  Must hold exclusive ProcArrayLock to change this, and shared
 	 * lock to read it.
 	 */
@@ -96,7 +97,7 @@ typedef struct ProcArrayStruct
 	/* oldest catalog xmin of any replication slot */
 	TransactionId replication_slot_catalog_xmin;
 
-	/* indexes into allPgXact[], has PROCARRAY_MAXPROCS entries */
+	/* indexes into allProcs[], has PROCARRAY_MAXPROCS entries */
 	int			pgprocnos[FLEXIBLE_ARRAY_MEMBER];
 } ProcArrayStruct;
 
@@ -239,7 +240,6 @@ typedef struct ComputeXidHorizonsResult
 static ProcArrayStruct *procArray;
 
 static PGPROC *allProcs;
-static PGXACT *allPgXact;
 
 /*
  * Bookkeeping for tracking emulated transactions in recovery
@@ -325,8 +325,7 @@ static int	KnownAssignedXidsGetAndSetXmin(TransactionId *xarray,
 static TransactionId KnownAssignedXidsGetOldestXmin(void);
 static void KnownAssignedXidsDisplay(int trace_level);
 static void KnownAssignedXidsReset(void);
-static inline void ProcArrayEndTransactionInternal(PGPROC *proc,
-												   PGXACT *pgxact, TransactionId latestXid);
+static inline void ProcArrayEndTransactionInternal(PGPROC *proc, TransactionId latestXid);
 static void ProcArrayGroupClearXid(PGPROC *proc, TransactionId latestXid);
 static void MaintainLatestCompletedXid(TransactionId latestXid);
 static void MaintainLatestCompletedXidRecovery(TransactionId latestXid);
@@ -410,7 +409,6 @@ CreateSharedProcArray(void)
 	}
 
 	allProcs = ProcGlobal->allProcs;
-	allPgXact = ProcGlobal->allPgXact;
 
 	/* Create or attach to the KnownAssignedXids arrays too, if needed */
 	if (EnableHotStandby)
@@ -475,11 +473,14 @@ ProcArrayAdd(PGPROC *proc)
 			(arrayP->numProcs - index) * sizeof(*arrayP->pgprocnos));
 	memmove(&ProcGlobal->xids[index + 1], &ProcGlobal->xids[index],
 			(arrayP->numProcs - index) * sizeof(*ProcGlobal->xids));
+	memmove(&ProcGlobal->subxidStates[index + 1], &ProcGlobal->subxidStates[index],
+			(arrayP->numProcs - index) * sizeof(*ProcGlobal->subxidStates));
 	memmove(&ProcGlobal->vacuumFlags[index + 1], &ProcGlobal->vacuumFlags[index],
 			(arrayP->numProcs - index) * sizeof(*ProcGlobal->vacuumFlags));
 
 	arrayP->pgprocnos[index] = proc->pgprocno;
 	ProcGlobal->xids[index] = proc->xid;
+	ProcGlobal->subxidStates[index] = proc->subxidStatus;
 	ProcGlobal->vacuumFlags[index] = proc->vacuumFlags;
 
 	arrayP->numProcs++;
@@ -533,6 +534,8 @@ ProcArrayRemove(PGPROC *proc, TransactionId latestXid)
 		MaintainLatestCompletedXid(latestXid);
 
 		ProcGlobal->xids[proc->pgxactoff] = 0;
+		ProcGlobal->subxidStates[proc->pgxactoff].overflowed = false;
+		ProcGlobal->subxidStates[proc->pgxactoff].count = 0;
 	}
 	else
 	{
@@ -541,6 +544,8 @@ ProcArrayRemove(PGPROC *proc, TransactionId latestXid)
 	}
 
 	Assert(TransactionIdIsValid(ProcGlobal->xids[proc->pgxactoff] == 0));
+	Assert(TransactionIdIsValid(ProcGlobal->subxidStates[proc->pgxactoff].count == 0));
+	Assert(TransactionIdIsValid(ProcGlobal->subxidStates[proc->pgxactoff].overflowed == false));
 	ProcGlobal->vacuumFlags[proc->pgxactoff] = 0;
 
 	for (index = 0; index < arrayP->numProcs; index++)
@@ -552,6 +557,8 @@ ProcArrayRemove(PGPROC *proc, TransactionId latestXid)
 					(arrayP->numProcs - index - 1) * sizeof(*arrayP->pgprocnos));
 			memmove(&ProcGlobal->xids[index], &ProcGlobal->xids[index + 1],
 					(arrayP->numProcs - index - 1) * sizeof(*ProcGlobal->xids));
+			memmove(&ProcGlobal->subxidStates[index], &ProcGlobal->subxidStates[index + 1],
+					(arrayP->numProcs - index - 1) * sizeof(*ProcGlobal->subxidStates));
 			memmove(&ProcGlobal->vacuumFlags[index], &ProcGlobal->vacuumFlags[index + 1],
 					(arrayP->numProcs - index - 1) * sizeof(*ProcGlobal->vacuumFlags));
 
@@ -597,8 +604,6 @@ ProcArrayRemove(PGPROC *proc, TransactionId latestXid)
 void
 ProcArrayEndTransaction(PGPROC *proc, TransactionId latestXid)
 {
-	PGXACT	   *pgxact = &allPgXact[proc->pgprocno];
-
 	if (TransactionIdIsValid(latestXid))
 	{
 		/*
@@ -616,7 +621,7 @@ ProcArrayEndTransaction(PGPROC *proc, TransactionId latestXid)
 		 */
 		if (LWLockConditionalAcquire(ProcArrayLock, LW_EXCLUSIVE))
 		{
-			ProcArrayEndTransactionInternal(proc, pgxact, latestXid);
+			ProcArrayEndTransactionInternal(proc, latestXid);
 			LWLockRelease(ProcArrayLock);
 		}
 		else
@@ -630,15 +635,14 @@ ProcArrayEndTransaction(PGPROC *proc, TransactionId latestXid)
 		 * estimate of global xmin, but that's OK.
 		 */
 		Assert(!TransactionIdIsValid(proc->xid));
+		Assert(proc->subxidStatus.count == 0);
+		Assert(!proc->subxidStatus.overflowed);
 
 		proc->lxid = InvalidLocalTransactionId;
 		proc->xmin = InvalidTransactionId;
 		proc->delayChkpt = false;	/* be sure this is cleared in abort */
 		proc->recoveryConflictPending = false;
 
-		Assert(pgxact->nxids == 0);
-		Assert(pgxact->overflowed == false);
-
 		/* must be cleared with xid/xmin: */
 		/* avoid unnecessarily dirtying shared cachelines */
 		if (proc->vacuumFlags & PROC_VACUUM_STATE_MASK)
@@ -659,8 +663,7 @@ ProcArrayEndTransaction(PGPROC *proc, TransactionId latestXid)
  * We don't do any locking here; caller must handle that.
  */
 static inline void
-ProcArrayEndTransactionInternal(PGPROC *proc, PGXACT *pgxact,
-								TransactionId latestXid)
+ProcArrayEndTransactionInternal(PGPROC *proc, TransactionId latestXid)
 {
 	size_t		pgxactoff = proc->pgxactoff;
 
@@ -683,8 +686,15 @@ ProcArrayEndTransactionInternal(PGPROC *proc, PGXACT *pgxact,
 	}
 
 	/* Clear the subtransaction-XID cache too while holding the lock */
-	pgxact->nxids = 0;
-	pgxact->overflowed = false;
+	Assert(ProcGlobal->subxidStates[pgxactoff].count == proc->subxidStatus.count &&
+		   ProcGlobal->subxidStates[pgxactoff].overflowed == proc->subxidStatus.overflowed);
+	if (proc->subxidStatus.count > 0 || proc->subxidStatus.overflowed)
+	{
+		ProcGlobal->subxidStates[pgxactoff].count = 0;
+		ProcGlobal->subxidStates[pgxactoff].overflowed = false;
+		proc->subxidStatus.count = 0;
+		proc->subxidStatus.overflowed = false;
+	}
 
 	/* Also advance global latestCompletedXid while holding the lock */
 	MaintainLatestCompletedXid(latestXid);
@@ -774,9 +784,8 @@ ProcArrayGroupClearXid(PGPROC *proc, TransactionId latestXid)
 	while (nextidx != INVALID_PGPROCNO)
 	{
 		PGPROC	   *proc = &allProcs[nextidx];
-		PGXACT	   *pgxact = &allPgXact[nextidx];
 
-		ProcArrayEndTransactionInternal(proc, pgxact, proc->procArrayGroupMemberXid);
+		ProcArrayEndTransactionInternal(proc, proc->procArrayGroupMemberXid);
 
 		/* Move to next proc in list. */
 		nextidx = pg_atomic_read_u32(&proc->procArrayGroupNext);
@@ -820,7 +829,6 @@ ProcArrayGroupClearXid(PGPROC *proc, TransactionId latestXid)
 void
 ProcArrayClearTransaction(PGPROC *proc)
 {
-	PGXACT	   *pgxact = &allPgXact[proc->pgprocno];
 	size_t		pgxactoff;
 
 	/*
@@ -845,8 +853,15 @@ ProcArrayClearTransaction(PGPROC *proc)
 	Assert(!proc->delayChkpt);
 
 	/* Clear the subtransaction-XID cache too */
-	pgxact->nxids = 0;
-	pgxact->overflowed = false;
+	Assert(ProcGlobal->subxidStates[pgxactoff].count == proc->subxidStatus.count &&
+		   ProcGlobal->subxidStates[pgxactoff].overflowed == proc->subxidStatus.overflowed);
+	if (proc->subxidStatus.count > 0 || proc->subxidStatus.overflowed)
+	{
+		ProcGlobal->subxidStates[pgxactoff].count = 0;
+		ProcGlobal->subxidStates[pgxactoff].overflowed = false;
+		proc->subxidStatus.count = 0;
+		proc->subxidStatus.overflowed = false;
+	}
 
 	LWLockRelease(ProcArrayLock);
 }
@@ -1267,6 +1282,7 @@ TransactionIdIsInProgress(TransactionId xid)
 {
 	static TransactionId *xids = NULL;
 	static TransactionId *other_xids;
+	XidCacheStatus *other_subxidstates;
 	int			nxids = 0;
 	ProcArrayStruct *arrayP = procArray;
 	TransactionId topxid;
@@ -1328,6 +1344,7 @@ TransactionIdIsInProgress(TransactionId xid)
 	}
 
 	other_xids = ProcGlobal->xids;
+	other_subxidstates = ProcGlobal->subxidStates;
 
 	LWLockAcquire(ProcArrayLock, LW_SHARED);
 
@@ -1349,7 +1366,6 @@ TransactionIdIsInProgress(TransactionId xid)
 	for (size_t pgxactoff = 0; pgxactoff < numProcs; pgxactoff++)
 	{
 		int			pgprocno;
-		PGXACT	   *pgxact;
 		PGPROC	   *proc;
 		TransactionId pxid;
 		int			pxids;
@@ -1384,9 +1400,7 @@ TransactionIdIsInProgress(TransactionId xid)
 		/*
 		 * Step 2: check the cached child-Xids arrays
 		 */
-		pgprocno = arrayP->pgprocnos[pgxactoff];
-		pgxact = &allPgXact[pgprocno];
-		pxids = pgxact->nxids;
+		pxids = other_subxidstates[pgxactoff].count;
 		pg_read_barrier();		/* pairs with barrier in GetNewTransactionId() */
 		pgprocno = arrayP->pgprocnos[pgxactoff];
 		proc = &allProcs[pgprocno];
@@ -1410,7 +1424,7 @@ TransactionIdIsInProgress(TransactionId xid)
 		 * we hold ProcArrayLock.  So we can't miss an Xid that we need to
 		 * worry about.)
 		 */
-		if (pgxact->overflowed)
+		if (other_subxidstates[pgxactoff].overflowed)
 			xids[nxids++] = pxid;
 	}
 
@@ -2014,6 +2028,7 @@ GetSnapshotData(Snapshot snapshot)
 		size_t		numProcs = arrayP->numProcs;
 		TransactionId *xip = snapshot->xip;
 		int		   *pgprocnos = arrayP->pgprocnos;
+		XidCacheStatus *subxidStates = ProcGlobal->subxidStates;
 		uint8	   *allVacuumFlags = ProcGlobal->vacuumFlags;
 
 		/*
@@ -2090,17 +2105,16 @@ GetSnapshotData(Snapshot snapshot)
 			 */
 			if (!suboverflowed)
 			{
-				int			pgprocno = pgprocnos[pgxactoff];
-				PGXACT	   *pgxact = &allPgXact[pgprocno];
 
-				if (pgxact->overflowed)
+				if (subxidStates[pgxactoff].overflowed)
 					suboverflowed = true;
 				else
 				{
-					int			nsubxids = pgxact->nxids;
+					int			nsubxids = subxidStates[pgxactoff].count;
 
 					if (nsubxids > 0)
 					{
+						int			pgprocno = pgprocnos[pgxactoff];
 						PGPROC	   *proc = &allProcs[pgprocno];
 
 						pg_read_barrier();	/* pairs with GetNewTransactionId */
@@ -2493,8 +2507,6 @@ GetRunningTransactionData(void)
 	 */
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
-		int			pgprocno = arrayP->pgprocnos[index];
-		PGXACT	   *pgxact = &allPgXact[pgprocno];
 		TransactionId xid;
 
 		/* Fetch xid just once - see GetNewTransactionId */
@@ -2515,7 +2527,7 @@ GetRunningTransactionData(void)
 		if (TransactionIdPrecedes(xid, oldestRunningXid))
 			oldestRunningXid = xid;
 
-		if (pgxact->overflowed)
+		if (ProcGlobal->subxidStates[index].overflowed)
 			suboverflowed = true;
 
 		/*
@@ -2535,27 +2547,28 @@ GetRunningTransactionData(void)
 	 */
 	if (!suboverflowed)
 	{
+		XidCacheStatus *other_subxidstates = ProcGlobal->subxidStates;
+
 		for (index = 0; index < arrayP->numProcs; index++)
 		{
 			int			pgprocno = arrayP->pgprocnos[index];
 			PGPROC	   *proc = &allProcs[pgprocno];
-			PGXACT	   *pgxact = &allPgXact[pgprocno];
-			int			nxids;
+			int			nsubxids;
 
 			/*
 			 * Save subtransaction XIDs. Other backends can't add or remove
 			 * entries while we're holding XidGenLock.
 			 */
-			nxids = pgxact->nxids;
-			if (nxids > 0)
+			nsubxids = other_subxidstates[index].count;
+			if (nsubxids > 0)
 			{
 				/* barrier not really required, as XidGenLock is held, but ... */
 				pg_read_barrier();	/* pairs with GetNewTransactionId */
 
 				memcpy(&xids[count], (void *) proc->subxids.xids,
-					   nxids * sizeof(TransactionId));
-				count += nxids;
-				subcount += nxids;
+					   nsubxids * sizeof(TransactionId));
+				count += nsubxids;
+				subcount += nsubxids;
 
 				/*
 				 * Top-level XID of a transaction is always less than any of
@@ -3622,14 +3635,6 @@ ProcArrayGetReplicationSlotXmin(TransactionId *xmin,
 	LWLockRelease(ProcArrayLock);
 }
 
-
-#define XidCacheRemove(i) \
-	do { \
-		MyProc->subxids.xids[i] = MyProc->subxids.xids[MyPgXact->nxids - 1]; \
-		pg_write_barrier(); \
-		MyPgXact->nxids--; \
-	} while (0)
-
 /*
  * XidCacheRemoveRunningXids
  *
@@ -3645,6 +3650,7 @@ XidCacheRemoveRunningXids(TransactionId xid,
 {
 	int			i,
 				j;
+	XidCacheStatus *mysubxidstat;
 
 	Assert(TransactionIdIsValid(xid));
 
@@ -3662,6 +3668,8 @@ XidCacheRemoveRunningXids(TransactionId xid,
 	 */
 	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
 
+	mysubxidstat = &ProcGlobal->subxidStates[MyProc->pgxactoff];
+
 	/*
 	 * Under normal circumstances xid and xids[] will be in increasing order,
 	 * as will be the entries in subxids.  Scan backwards to avoid O(N^2)
@@ -3671,11 +3679,14 @@ XidCacheRemoveRunningXids(TransactionId xid,
 	{
 		TransactionId anxid = xids[i];
 
-		for (j = MyPgXact->nxids - 1; j >= 0; j--)
+		for (j = MyProc->subxidStatus.count - 1; j >= 0; j--)
 		{
 			if (TransactionIdEquals(MyProc->subxids.xids[j], anxid))
 			{
-				XidCacheRemove(j);
+				MyProc->subxids.xids[j] = MyProc->subxids.xids[MyProc->subxidStatus.count - 1];
+				pg_write_barrier();
+				mysubxidstat->count--;
+				MyProc->subxidStatus.count--;
 				break;
 			}
 		}
@@ -3687,20 +3698,23 @@ XidCacheRemoveRunningXids(TransactionId xid,
 		 * error during AbortSubTransaction.  So instead of Assert, emit a
 		 * debug warning.
 		 */
-		if (j < 0 && !MyPgXact->overflowed)
+		if (j < 0 && !MyProc->subxidStatus.overflowed)
 			elog(WARNING, "did not find subXID %u in MyProc", anxid);
 	}
 
-	for (j = MyPgXact->nxids - 1; j >= 0; j--)
+	for (j = MyProc->subxidStatus.count - 1; j >= 0; j--)
 	{
 		if (TransactionIdEquals(MyProc->subxids.xids[j], xid))
 		{
-			XidCacheRemove(j);
+			MyProc->subxids.xids[j] = MyProc->subxids.xids[MyProc->subxidStatus.count - 1];
+			pg_write_barrier();
+			mysubxidstat->count--;
+			MyProc->subxidStatus.count--;
 			break;
 		}
 	}
 	/* Ordinarily we should have found it, unless the cache has overflowed */
-	if (j < 0 && !MyPgXact->overflowed)
+	if (j < 0 && !MyProc->subxidStatus.overflowed)
 		elog(WARNING, "did not find subXID %u in MyProc", xid);
 
 	/* Also advance global latestCompletedFullXid while holding the lock */
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index f6113b2d243..aa9fbd80545 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -63,9 +63,8 @@ int			LockTimeout = 0;
 int			IdleInTransactionSessionTimeout = 0;
 bool		log_lock_waits = false;
 
-/* Pointer to this process's PGPROC and PGXACT structs, if any */
+/* Pointer to this process's PGPROC struct, if any */
 PGPROC	   *MyProc = NULL;
-PGXACT	   *MyPgXact = NULL;
 
 /*
  * This spinlock protects the freelist of recycled PGPROC structures.
@@ -110,10 +109,8 @@ ProcGlobalShmemSize(void)
 	size = add_size(size, mul_size(TotalProcs, sizeof(PGPROC)));
 	size = add_size(size, sizeof(slock_t));
 
-	size = add_size(size, mul_size(MaxBackends, sizeof(PGXACT)));
-	size = add_size(size, mul_size(NUM_AUXILIARY_PROCS, sizeof(PGXACT)));
-	size = add_size(size, mul_size(max_prepared_xacts, sizeof(PGXACT)));
 	size = add_size(size, mul_size(TotalProcs, sizeof(*ProcGlobal->xids)));
+	size = add_size(size, mul_size(TotalProcs, sizeof(*ProcGlobal->subxidStates)));
 	size = add_size(size, mul_size(TotalProcs, sizeof(*ProcGlobal->vacuumFlags)));
 
 	return size;
@@ -161,7 +158,6 @@ void
 InitProcGlobal(void)
 {
 	PGPROC	   *procs;
-	PGXACT	   *pgxacts;
 	int			i,
 				j;
 	bool		found;
@@ -202,18 +198,6 @@ InitProcGlobal(void)
 	/* XXX allProcCount isn't really all of them; it excludes prepared xacts */
 	ProcGlobal->allProcCount = MaxBackends + NUM_AUXILIARY_PROCS;
 
-	/*
-	 * Also allocate a separate array of PGXACT structures.  This is separate
-	 * from the main PGPROC array so that the most heavily accessed data is
-	 * stored contiguously in memory in as few cache lines as possible. This
-	 * provides significant performance benefits, especially on a
-	 * multiprocessor system.  There is one PGXACT structure for every PGPROC
-	 * structure.
-	 */
-	pgxacts = (PGXACT *) ShmemAlloc(TotalProcs * sizeof(PGXACT));
-	MemSet(pgxacts, 0, TotalProcs * sizeof(PGXACT));
-	ProcGlobal->allPgXact = pgxacts;
-
 	/*
 	 * Allocate arrays mirroring PGPROC fields in a dense manner. See
 	 * PROC_HDR.
@@ -224,6 +208,8 @@ InitProcGlobal(void)
 	ProcGlobal->xids =
 		(TransactionId *) ShmemAlloc(TotalProcs * sizeof(*ProcGlobal->xids));
 	MemSet(ProcGlobal->xids, 0, TotalProcs * sizeof(*ProcGlobal->xids));
+	ProcGlobal->subxidStates = (XidCacheStatus *) ShmemAlloc(TotalProcs * sizeof(*ProcGlobal->subxidStates));
+	MemSet(ProcGlobal->subxidStates, 0, TotalProcs * sizeof(*ProcGlobal->subxidStates));
 	ProcGlobal->vacuumFlags = (uint8 *) ShmemAlloc(TotalProcs * sizeof(*ProcGlobal->vacuumFlags));
 	MemSet(ProcGlobal->vacuumFlags, 0, TotalProcs * sizeof(*ProcGlobal->vacuumFlags));
 
@@ -372,7 +358,6 @@ InitProcess(void)
 				(errcode(ERRCODE_TOO_MANY_CONNECTIONS),
 				 errmsg("sorry, too many clients already")));
 	}
-	MyPgXact = &ProcGlobal->allPgXact[MyProc->pgprocno];
 
 	/*
 	 * Cross-check that the PGPROC is of the type we expect; if this were not
@@ -569,7 +554,6 @@ InitAuxiliaryProcess(void)
 	((volatile PGPROC *) auxproc)->pid = MyProcPid;
 
 	MyProc = auxproc;
-	MyPgXact = &ProcGlobal->allPgXact[auxproc->pgprocno];
 
 	SpinLockRelease(ProcStructLock);
 
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index b4948ac675f..3d990463ce9 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1536,7 +1536,6 @@ PGSetenvStatusType
 PGShmemHeader
 PGTransactionStatusType
 PGVerbosity
-PGXACT
 PG_Locale_Strategy
 PG_Lock_Status
 PG_init_t
-- 
2.25.0.114.g5b0ca878e0

v12-0001-Track-latest-completed-xid-as-a-FullTransactionI.patchtext/x-diff; charset=us-asciiDownload
From f2a0cebcdd202fbe39b5f0aa8c2034df1b6a5ccd Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Thu, 23 Jul 2020 14:19:48 -0700
Subject: [PATCH v12 1/7] Track latest completed xid as a FullTransactionId.

The reason for doing so is that a subsequent commit will need that to
avoid wraparound issues. As the subsequent change is large this was
split out for easier review.

The reason this is not a perfect straight-forward change is that we do
not want track 64bit xids in the procarray or the WAL. Therefore we
need to advance lastestCompletedXid in relation to 32 bit xids. The
code for that is now centralized in MaintainLatestCompletedXid*.

Author: Andres Freund
Reviewed-By: Robert Haas, Thomas Munro, David Rowley
Discussion: https://postgr.es/m/20200301083601.ews6hz5dduc3w2se@alap3.anarazel.de
---
 src/include/access/transam.h        |  37 ++++++-
 src/backend/access/transam/README   |  16 +--
 src/backend/access/transam/varsup.c |  54 +++++++++-
 src/backend/access/transam/xlog.c   |   7 +-
 src/backend/storage/ipc/procarray.c | 150 +++++++++++++++++++++-------
 5 files changed, 212 insertions(+), 52 deletions(-)

diff --git a/src/include/access/transam.h b/src/include/access/transam.h
index a91a0c7487d..9502217932e 100644
--- a/src/include/access/transam.h
+++ b/src/include/access/transam.h
@@ -54,6 +54,8 @@
 #define FullTransactionIdFollowsOrEquals(a, b) ((a).value >= (b).value)
 #define FullTransactionIdIsValid(x)		TransactionIdIsValid(XidFromFullTransactionId(x))
 #define InvalidFullTransactionId		FullTransactionIdFromEpochAndXid(0, InvalidTransactionId)
+#define FirstNormalFullTransactionId	FullTransactionIdFromEpochAndXid(0, FirstNormalTransactionId)
+#define FullTransactionIdIsNormal(x)	FullTransactionIdFollowsOrEquals(x, FirstNormalFullTransactionId)
 
 /*
  * A 64 bit value that contains an epoch and a TransactionId.  This is
@@ -102,6 +104,31 @@ FullTransactionIdAdvance(FullTransactionId *dest)
 		dest->value++;
 }
 
+/*
+ * Retreat a FullTransactionId variable, stepping over xids that would appear
+ * to be special only when viewed as 32bit XIDs.
+ */
+static inline void
+FullTransactionIdRetreat(FullTransactionId *dest)
+{
+	dest->value--;
+
+	/*
+	 * In contrast to 32bit XIDs don't step over the "actual" special xids.
+	 * For 64bit xids these can't be reached as part of a wraparound as they
+	 * can in the 32bit case.
+	 */
+	if (FullTransactionIdPrecedes(*dest, FirstNormalFullTransactionId))
+		return;
+
+	/*
+	 * But we do need to step over XIDs that'd appear special only for 32bit
+	 * XIDs.
+	 */
+	while (XidFromFullTransactionId(*dest) < FirstNormalTransactionId)
+		dest->value--;
+}
+
 /* back up a transaction ID variable, handling wraparound correctly */
 #define TransactionIdRetreat(dest)	\
 	do { \
@@ -193,8 +220,8 @@ typedef struct VariableCacheData
 	/*
 	 * These fields are protected by ProcArrayLock.
 	 */
-	TransactionId latestCompletedXid;	/* newest XID that has committed or
-										 * aborted */
+	FullTransactionId latestCompletedFullXid;	/* newest full XID that has
+												 * committed or aborted */
 
 	/*
 	 * These fields are protected by XactTruncationLock
@@ -244,6 +271,12 @@ extern void AdvanceOldestClogXid(TransactionId oldest_datfrozenxid);
 extern bool ForceTransactionIdLimitUpdate(void);
 extern Oid	GetNewObjectId(void);
 
+#ifdef USE_ASSERT_CHECKING
+extern void AssertTransactionIdInAllowableRange(TransactionId xid);
+#else
+#define AssertTransactionIdInAllowableRange(xid) ((void)true)
+#endif
+
 /*
  * Some frontend programs include this header.  For compilers that emit static
  * inline functions even when they're unused, that leads to unsatisfied
diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README
index eb9aac5fd39..c06e52f9cd0 100644
--- a/src/backend/access/transam/README
+++ b/src/backend/access/transam/README
@@ -257,31 +257,31 @@ simultaneously, we have one backend take ProcArrayLock and clear the XIDs
 of multiple processes at once.)
 
 ProcArrayEndTransaction also holds the lock while advancing the shared
-latestCompletedXid variable.  This allows GetSnapshotData to use
-latestCompletedXid + 1 as xmax for its snapshot: there can be no
+latestCompletedFullXid variable.  This allows GetSnapshotData to use
+latestCompletedFullXid + 1 as xmax for its snapshot: there can be no
 transaction >= this xid value that the snapshot needs to consider as
 completed.
 
 In short, then, the rule is that no transaction may exit the set of
-currently-running transactions between the time we fetch latestCompletedXid
+currently-running transactions between the time we fetch latestCompletedFullXid
 and the time we finish building our snapshot.  However, this restriction
 only applies to transactions that have an XID --- read-only transactions
 can end without acquiring ProcArrayLock, since they don't affect anyone
-else's snapshot nor latestCompletedXid.
+else's snapshot nor latestCompletedFullXid.
 
 Transaction start, per se, doesn't have any interlocking with these
 considerations, since we no longer assign an XID immediately at transaction
 start.  But when we do decide to allocate an XID, GetNewTransactionId must
 store the new XID into the shared ProcArray before releasing XidGenLock.
-This ensures that all top-level XIDs <= latestCompletedXid are either
+This ensures that all top-level XIDs <= latestCompletedFullXid are either
 present in the ProcArray, or not running anymore.  (This guarantee doesn't
 apply to subtransaction XIDs, because of the possibility that there's not
 room for them in the subxid array; instead we guarantee that they are
 present or the overflow flag is set.)  If a backend released XidGenLock
 before storing its XID into MyPgXact, then it would be possible for another
-backend to allocate and commit a later XID, causing latestCompletedXid to
+backend to allocate and commit a later XID, causing latestCompletedFullXid to
 pass the first backend's XID, before that value became visible in the
-ProcArray.  That would break GetOldestXmin, as discussed below.
+ProcArray.  That would break ComputeXidHorizons, as discussed below.
 
 We allow GetNewTransactionId to store the XID into MyPgXact->xid (or the
 subxid array) without taking ProcArrayLock.  This was once necessary to
@@ -311,7 +311,7 @@ currently-active XIDs: no xact, in particular not the oldest, can exit
 while we hold shared ProcArrayLock.  So GetOldestXmin's view of the minimum
 active XID will be the same as that of any concurrent GetSnapshotData, and
 so it can't produce an overestimate.  If there is no active transaction at
-all, GetOldestXmin returns latestCompletedXid + 1, which is a lower bound
+all, GetOldestXmin returns latestCompletedFullXid + 1, which is a lower bound
 for the xmin that might be computed by concurrent or later GetSnapshotData
 calls.  (We know that no XID less than this could be about to appear in
 the ProcArray, because of the XidGenLock interlock discussed above.)
diff --git a/src/backend/access/transam/varsup.c b/src/backend/access/transam/varsup.c
index e14b53bf9e3..66eb74aa9f8 100644
--- a/src/backend/access/transam/varsup.c
+++ b/src/backend/access/transam/varsup.c
@@ -187,8 +187,8 @@ GetNewTransactionId(bool isSubXact)
 	/*
 	 * We must store the new XID into the shared ProcArray before releasing
 	 * XidGenLock.  This ensures that every active XID older than
-	 * latestCompletedXid is present in the ProcArray, which is essential for
-	 * correct OldestXmin tracking; see src/backend/access/transam/README.
+	 * latestCompletedFullXid is present in the ProcArray, which is essential
+	 * for correct OldestXmin tracking; see src/backend/access/transam/README.
 	 *
 	 * Note that readers of PGXACT xid fields should be careful to fetch the
 	 * value only once, rather than assume they can read a value multiple
@@ -566,3 +566,53 @@ GetNewObjectId(void)
 
 	return result;
 }
+
+
+#ifdef USE_ASSERT_CHECKING
+
+/*
+ * Assert that xid is between [oldestXid, nextFullXid], which is the range we
+ * expect XIDs coming from tables etc to be in.
+ *
+ * As ShmemVariableCache->oldestXid could change just after this call without
+ * further precautions, and as a wrapped-around xid could again fall within
+ * the valid range, this assertion can only detect if something is definitely
+ * wrong, but not establish correctness.
+ *
+ * This intentionally does not expose a return value, to avoid code being
+ * introduced that depends on the return value.
+ */
+void
+AssertTransactionIdInAllowableRange(TransactionId xid)
+{
+	TransactionId oldest_xid;
+	TransactionId next_xid;
+
+	Assert(TransactionIdIsValid(xid));
+
+	/* we may see bootstrap / frozen */
+	if (!TransactionIdIsNormal(xid))
+		return;
+
+	/*
+	 * We can't acquire XidGenLock, as this may be called with XidGenLock
+	 * already held (or with other locks that don't allow XidGenLock to be
+	 * nested). That's ok for our purposes though, since we already rely on
+	 * 32bit reads to be atomic. While nextFullXid is 64 bit, we only look at
+	 * the lower 32bit, so a skewed read doesn't hurt.
+	 *
+	 * There's no increased danger of falling outside [oldest, next] by
+	 * accessing them without a lock. xid needs to have been created with
+	 * GetNewTransactionId() in the originating session, and the locks there
+	 * pair with the memory barrier below.  We do however accept xid to be <=
+	 * to next_xid, instead of just <, as xid could be from the procarray,
+	 * before we see the updated nextFullXid value.
+	 */
+	pg_memory_barrier();
+	oldest_xid = ShmemVariableCache->oldestXid;
+	next_xid = XidFromFullTransactionId(ShmemVariableCache->nextFullXid);
+
+	Assert(TransactionIdFollowsOrEquals(xid, oldest_xid) ||
+		   TransactionIdPrecedesOrEquals(xid, next_xid));
+}
+#endif
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 184c6672f3b..e1a043763cf 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7866,10 +7866,11 @@ StartupXLOG(void)
 	XLogCtl->lastSegSwitchTime = (pg_time_t) time(NULL);
 	XLogCtl->lastSegSwitchLSN = EndOfLog;
 
-	/* also initialize latestCompletedXid, to nextXid - 1 */
+	/* also initialize latestCompletedFullXid, to nextFullXid - 1 */
 	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
-	ShmemVariableCache->latestCompletedXid = XidFromFullTransactionId(ShmemVariableCache->nextFullXid);
-	TransactionIdRetreat(ShmemVariableCache->latestCompletedXid);
+	ShmemVariableCache->latestCompletedFullXid =
+		ShmemVariableCache->nextFullXid;
+	FullTransactionIdRetreat(&ShmemVariableCache->latestCompletedFullXid);
 	LWLockRelease(ProcArrayLock);
 
 	/*
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index b4485335644..82798760752 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -175,6 +175,10 @@ static void KnownAssignedXidsReset(void);
 static inline void ProcArrayEndTransactionInternal(PGPROC *proc,
 												   PGXACT *pgxact, TransactionId latestXid);
 static void ProcArrayGroupClearXid(PGPROC *proc, TransactionId latestXid);
+static void MaintainLatestCompletedXid(TransactionId latestXid);
+static void MaintainLatestCompletedXidRecovery(TransactionId latestXid);
+
+static inline FullTransactionId FullXidViaRelative(FullTransactionId rel, TransactionId xid);
 
 /*
  * Report shared-memory space needed by CreateSharedProcArray.
@@ -349,9 +353,7 @@ ProcArrayRemove(PGPROC *proc, TransactionId latestXid)
 		Assert(TransactionIdIsValid(allPgXact[proc->pgprocno].xid));
 
 		/* Advance global latestCompletedXid while holding the lock */
-		if (TransactionIdPrecedes(ShmemVariableCache->latestCompletedXid,
-								  latestXid))
-			ShmemVariableCache->latestCompletedXid = latestXid;
+		MaintainLatestCompletedXid(latestXid);
 	}
 	else
 	{
@@ -464,9 +466,7 @@ ProcArrayEndTransactionInternal(PGPROC *proc, PGXACT *pgxact,
 	pgxact->overflowed = false;
 
 	/* Also advance global latestCompletedXid while holding the lock */
-	if (TransactionIdPrecedes(ShmemVariableCache->latestCompletedXid,
-							  latestXid))
-		ShmemVariableCache->latestCompletedXid = latestXid;
+	MaintainLatestCompletedXid(latestXid);
 }
 
 /*
@@ -621,6 +621,59 @@ ProcArrayClearTransaction(PGPROC *proc)
 	pgxact->overflowed = false;
 }
 
+/*
+ * Update ShmemVariableCache->latestCompletedFullXid to point to latestXid if
+ * currently older.
+ */
+static void
+MaintainLatestCompletedXid(TransactionId latestXid)
+{
+	FullTransactionId cur_latest = ShmemVariableCache->latestCompletedFullXid;
+
+	Assert(FullTransactionIdIsValid(cur_latest));
+	Assert(!RecoveryInProgress());
+	Assert(LWLockHeldByMe(ProcArrayLock));
+
+	if (TransactionIdPrecedes(XidFromFullTransactionId(cur_latest), latestXid))
+	{
+		ShmemVariableCache->latestCompletedFullXid =
+			FullXidViaRelative(cur_latest, latestXid);
+	}
+
+	Assert(IsBootstrapProcessingMode() ||
+		   FullTransactionIdIsNormal(ShmemVariableCache->latestCompletedFullXid));
+}
+
+/*
+ * Same as MaintainLatestCompletedXid, except for use during WAL replay.
+ */
+static void
+MaintainLatestCompletedXidRecovery(TransactionId latestXid)
+{
+	FullTransactionId cur_latest = ShmemVariableCache->latestCompletedFullXid;
+	FullTransactionId rel;
+
+	Assert(AmStartupProcess() || !IsUnderPostmaster);
+	Assert(LWLockHeldByMe(ProcArrayLock));
+
+	/*
+	 * Need a FullTransactionId to compare latestXid with. Can't rely on
+	 * latestCompletedFullXid to be initialized in recovery. But in recovery
+	 * it's safe to access nextFullXid without a lock for the startup process.
+	 */
+	rel = ShmemVariableCache->nextFullXid;
+	Assert(FullTransactionIdIsValid(ShmemVariableCache->nextFullXid));
+
+	if (!FullTransactionIdIsValid(cur_latest) ||
+		TransactionIdPrecedes(XidFromFullTransactionId(cur_latest), latestXid))
+	{
+		ShmemVariableCache->latestCompletedFullXid =
+			FullXidViaRelative(rel, latestXid);
+	}
+
+	Assert(FullTransactionIdIsNormal(ShmemVariableCache->latestCompletedFullXid));
+}
+
 /*
  * ProcArrayInitRecovery -- initialize recovery xid mgmt environment
  *
@@ -841,7 +894,7 @@ ProcArrayApplyRecoveryInfo(RunningTransactions running)
 	 * Now we've got the running xids we need to set the global values that
 	 * are used to track snapshots as they evolve further.
 	 *
-	 * - latestCompletedXid which will be the xmax for snapshots
+	 * - latestCompletedFullXid which will be the xmax for snapshots
 	 * - lastOverflowedXid which shows whether snapshots overflow
 	 * - nextXid
 	 *
@@ -867,14 +920,11 @@ ProcArrayApplyRecoveryInfo(RunningTransactions running)
 
 	/*
 	 * If a transaction wrote a commit record in the gap between taking and
-	 * logging the snapshot then latestCompletedXid may already be higher than
-	 * the value from the snapshot, so check before we use the incoming value.
+	 * logging the snapshot then latestCompletedFullXid may already be higher
+	 * than the value from the snapshot, so check before we use the incoming
+	 * value. It also might not yet be set at all.
 	 */
-	if (TransactionIdPrecedes(ShmemVariableCache->latestCompletedXid,
-							  running->latestCompletedXid))
-		ShmemVariableCache->latestCompletedXid = running->latestCompletedXid;
-
-	Assert(TransactionIdIsNormal(ShmemVariableCache->latestCompletedXid));
+	MaintainLatestCompletedXidRecovery(running->latestCompletedXid);
 
 	LWLockRelease(ProcArrayLock);
 
@@ -1048,10 +1098,11 @@ TransactionIdIsInProgress(TransactionId xid)
 	LWLockAcquire(ProcArrayLock, LW_SHARED);
 
 	/*
-	 * Now that we have the lock, we can check latestCompletedXid; if the
+	 * Now that we have the lock, we can check latestCompletedFullXid; if the
 	 * target Xid is after that, it's surely still running.
 	 */
-	if (TransactionIdPrecedes(ShmemVariableCache->latestCompletedXid, xid))
+	if (TransactionIdPrecedes(XidFromFullTransactionId(ShmemVariableCache->latestCompletedFullXid),
+							  xid))
 	{
 		LWLockRelease(ProcArrayLock);
 		xc_by_latest_xid_inc();
@@ -1283,7 +1334,7 @@ TransactionIdIsActive(TransactionId xid)
  * anything older is definitely not considered as running by anyone anymore,
  * but the exact value calculated depends on a number of things. For example,
  * if rel = NULL and there are no transactions running in the current
- * database, GetOldestXmin() returns latestCompletedXid. If a transaction
+ * database, GetOldestXmin() returns latestCompletedFullXid. If a transaction
  * begins after that, its xmin will include in-progress transactions in other
  * databases that started earlier, so another call will return a lower value.
  * Nonetheless it is safe to vacuum a table in the current database with the
@@ -1325,14 +1376,14 @@ GetOldestXmin(Relation rel, int flags)
 	LWLockAcquire(ProcArrayLock, LW_SHARED);
 
 	/*
-	 * We initialize the MIN() calculation with latestCompletedXid + 1. This
-	 * is a lower bound for the XIDs that might appear in the ProcArray later,
-	 * and so protects us against overestimating the result due to future
-	 * additions.
+	 * We initialize the MIN() calculation with latestCompletedFullXid + 1.
+	 * This is a lower bound for the XIDs that might appear in the ProcArray
+	 * later, and so protects us against overestimating the result due to
+	 * future additions.
 	 */
-	result = ShmemVariableCache->latestCompletedXid;
-	Assert(TransactionIdIsNormal(result));
+	result = XidFromFullTransactionId(ShmemVariableCache->latestCompletedFullXid);
 	TransactionIdAdvance(result);
+	Assert(TransactionIdIsNormal(result));
 
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
@@ -1511,6 +1562,7 @@ GetSnapshotData(Snapshot snapshot)
 	int			count = 0;
 	int			subcount = 0;
 	bool		suboverflowed = false;
+	FullTransactionId latest_completed;
 	TransactionId replication_slot_xmin = InvalidTransactionId;
 	TransactionId replication_slot_catalog_xmin = InvalidTransactionId;
 
@@ -1554,10 +1606,11 @@ GetSnapshotData(Snapshot snapshot)
 	 */
 	LWLockAcquire(ProcArrayLock, LW_SHARED);
 
-	/* xmax is always latestCompletedXid + 1 */
-	xmax = ShmemVariableCache->latestCompletedXid;
-	Assert(TransactionIdIsNormal(xmax));
+	latest_completed = ShmemVariableCache->latestCompletedFullXid;
+	/* xmax is always latestCompletedFullXid + 1 */
+	xmax = XidFromFullTransactionId(latest_completed);
 	TransactionIdAdvance(xmax);
+	Assert(TransactionIdIsNormal(xmax));
 
 	/* initialize xmin calculation with xmax */
 	globalxmin = xmin = xmax;
@@ -1984,9 +2037,10 @@ GetRunningTransactionData(void)
 	LWLockAcquire(ProcArrayLock, LW_SHARED);
 	LWLockAcquire(XidGenLock, LW_SHARED);
 
-	latestCompletedXid = ShmemVariableCache->latestCompletedXid;
-
-	oldestRunningXid = XidFromFullTransactionId(ShmemVariableCache->nextFullXid);
+	latestCompletedXid =
+		XidFromFullTransactionId(ShmemVariableCache->latestCompletedFullXid);
+	oldestRunningXid =
+		XidFromFullTransactionId(ShmemVariableCache->nextFullXid);
 
 	/*
 	 * Spin over procArray collecting all xids
@@ -3206,10 +3260,8 @@ XidCacheRemoveRunningXids(TransactionId xid,
 	if (j < 0 && !MyPgXact->overflowed)
 		elog(WARNING, "did not find subXID %u in MyProc", xid);
 
-	/* Also advance global latestCompletedXid while holding the lock */
-	if (TransactionIdPrecedes(ShmemVariableCache->latestCompletedXid,
-							  latestXid))
-		ShmemVariableCache->latestCompletedXid = latestXid;
+	/* Also advance global latestCompletedFullXid while holding the lock */
+	MaintainLatestCompletedXid(latestXid);
 
 	LWLockRelease(ProcArrayLock);
 }
@@ -3236,6 +3288,32 @@ DisplayXidCache(void)
 }
 #endif							/* XIDCACHE_DEBUG */
 
+/*
+ * Convert a 32 bit transaction id into 64 bit transaction id, by assuming it
+ * is within MaxTransactionId / 2 of XidFromFullTransactionId(rel).
+ *
+ * Be very careful about when to use this function. It can only safely be used
+ * when there is a guarantee that xid is within MaxTransactionId / 2 xids of
+ * rel. That e.g. can be guaranteed if the the caller assures a snapshot is
+ * held by the backend and xid is from a table (where vacuum/freezing ensures
+ * the xid has to be within that range), or if xid is from the procarray and
+ * prevents xid wraparound that way.
+ */
+static inline FullTransactionId
+FullXidViaRelative(FullTransactionId rel, TransactionId xid)
+{
+	TransactionId rel_xid = XidFromFullTransactionId(rel);
+
+	Assert(TransactionIdIsValid(xid));
+	Assert(TransactionIdIsValid(rel_xid));
+
+	/* not guaranteed to find issues, but likely to catch mistakes */
+	AssertTransactionIdInAllowableRange(xid);
+
+	return FullTransactionIdFromU64(U64FromFullTransactionId(rel)
+									+ (int32) (xid - rel_xid));
+}
+
 
 /* ----------------------------------------------
  *		KnownAssignedTransactionIds sub-module
@@ -3387,10 +3465,8 @@ ExpireTreeKnownAssignedTransactionIds(TransactionId xid, int nsubxids,
 
 	KnownAssignedXidsRemoveTree(xid, nsubxids, subxids);
 
-	/* As in ProcArrayEndTransaction, advance latestCompletedXid */
-	if (TransactionIdPrecedes(ShmemVariableCache->latestCompletedXid,
-							  max_xid))
-		ShmemVariableCache->latestCompletedXid = max_xid;
+	/* As in ProcArrayEndTransaction, advance latestCompletedFullXid */
+	MaintainLatestCompletedXidRecovery(max_xid);
 
 	LWLockRelease(ProcArrayLock);
 }
-- 
2.25.0.114.g5b0ca878e0

v12-0002-snapshot-scalability-Don-t-compute-global-horizo.patchtext/x-diff; charset=us-asciiDownload
From 4ef7b66bbdde897a3c669abacb0eae387a5d6e37 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 15 Jul 2020 15:35:07 -0700
Subject: [PATCH v12 2/7] snapshot scalability: Don't compute global horizons
 when building snapshots.

To make GetSnapshotData() more scalable, it cannot not look at at each proc's
xmin (see Discussion link below). Due to the frequency at which xmins are
updated, that just does not scale.

Without accessing xmins GetSnapshotData() cannot calculate accurate thresholds
as it has so far. But we don't really have to: The horizons don't actually
change that much between GetSnapshotData() calls. Nor are the horizons
actually used every time a snapshot is called.

The use of RecentGlobal[Data]Xmin to decide whether a row version could be
removed has been replaces with new GlobalVisTest* functions.  These use two
thresholds to determine whether a row can be pruned:
1) definitely_needed, indicating that rows deleted by XIDs >=
   definitely_needed are definitely still visible.
2) maybe_needed, indicating that rows deleted by XIDs < maybe_needed can
   definitely be removed
GetSnapshotData() updates definitely_needed to be the xmin of the computed
snapshot.

When testing whether a row can be removed (with GlobalVisTestIsRemovableXid())
and the tested XID falls in between the two (i.e. XID >= maybe_needed && XID <
definitely_needed) the boundaries can be recomputed to be more accurate. As it
is not cheap to compute accurate boundaries, we limit the number of times that
happens in short succession.  As the boundaries used by
GlobalVisTestIsRemovableXid() are never reset (with maybe_needed updated
byGetSnapshotData()), it is likely that further test can benefit from an
earlier computation of accurate horizons.

To avoid regressing performance when old_snapshot_threshold is set (as
that requires an accurate horizon to be computed),
heap_page_prune_opt() doesn't unconditionally call
TransactionIdLimitedForOldSnapshots() anymore. Both the computation of
the limited horizon, and the triggering of errors (with
SetOldSnapshotThresholdTimestamp()) is now only done when necessary to
remove tuples.

Subsequent commits will take further advantage of the fact that
GetSnapshotData() will not need to access xmins anymore.

Note: This contains a workaround in heap_page_prune_opt() to keep the
snapshot_too_old tests working. While that workaround is ugly, the
tests currently are not meaningful, and it seems best to address them
separately.

Author: Andres Freund
Reviewed-By: Robert Haas, Thomas Munro, David Rowley
Discussion: https://postgr.es/m/20200301083601.ews6hz5dduc3w2se@alap3.anarazel.de
---
 src/include/access/ginblock.h               |   4 +-
 src/include/access/heapam.h                 |  10 +-
 src/include/access/transam.h                |  79 +-
 src/include/storage/bufpage.h               |   6 -
 src/include/storage/proc.h                  |   8 -
 src/include/storage/procarray.h             |  39 +-
 src/include/utils/snapmgr.h                 |  37 +-
 src/include/utils/snapshot.h                |   6 +
 src/backend/access/gin/ginvacuum.c          |  26 +
 src/backend/access/gist/gistutil.c          |   8 +-
 src/backend/access/gist/gistxlog.c          |  10 +-
 src/backend/access/heap/heapam.c            |  15 +-
 src/backend/access/heap/heapam_handler.c    |  24 +-
 src/backend/access/heap/heapam_visibility.c |  79 +-
 src/backend/access/heap/pruneheap.c         | 207 ++++-
 src/backend/access/heap/vacuumlazy.c        |  24 +-
 src/backend/access/index/indexam.c          |   3 +-
 src/backend/access/nbtree/README            |  10 +-
 src/backend/access/nbtree/nbtpage.c         |   4 +-
 src/backend/access/nbtree/nbtree.c          |  28 +-
 src/backend/access/nbtree/nbtxlog.c         |  10 +-
 src/backend/access/spgist/spgvacuum.c       |   6 +-
 src/backend/access/transam/README           |  78 +-
 src/backend/access/transam/xlog.c           |   4 +-
 src/backend/commands/analyze.c              |   2 +-
 src/backend/commands/vacuum.c               |  41 +-
 src/backend/postmaster/autovacuum.c         |   4 +
 src/backend/replication/logical/launcher.c  |   4 +
 src/backend/replication/walreceiver.c       |  17 +-
 src/backend/replication/walsender.c         |  15 +-
 src/backend/storage/ipc/procarray.c         | 902 ++++++++++++++++----
 src/backend/utils/adt/selfuncs.c            |  20 +-
 src/backend/utils/init/postinit.c           |   4 +
 src/backend/utils/time/snapmgr.c            | 258 +++---
 contrib/amcheck/verify_nbtree.c             |   8 +-
 contrib/pg_visibility/pg_visibility.c       |  18 +-
 contrib/pgstattuple/pgstatapprox.c          |   2 +-
 src/tools/pgindent/typedefs.list            |   2 +
 38 files changed, 1449 insertions(+), 573 deletions(-)

diff --git a/src/include/access/ginblock.h b/src/include/access/ginblock.h
index 3f64fd572e3..fe66a95226b 100644
--- a/src/include/access/ginblock.h
+++ b/src/include/access/ginblock.h
@@ -12,6 +12,7 @@
 
 #include "access/transam.h"
 #include "storage/block.h"
+#include "storage/bufpage.h"
 #include "storage/itemptr.h"
 #include "storage/off.h"
 
@@ -134,8 +135,7 @@ typedef struct GinMetaPageData
  */
 #define GinPageGetDeleteXid(page) ( ((PageHeader) (page))->pd_prune_xid )
 #define GinPageSetDeleteXid(page, xid) ( ((PageHeader) (page))->pd_prune_xid = xid)
-#define GinPageIsRecyclable(page) ( PageIsNew(page) || (GinPageIsDeleted(page) \
-	&& TransactionIdPrecedes(GinPageGetDeleteXid(page), RecentGlobalXmin)))
+extern bool GinPageIsRecyclable(Page page);
 
 /*
  * We use our own ItemPointerGet(BlockNumber|OffsetNumber)
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index f279edc4734..232db64ecdf 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -172,9 +172,12 @@ extern TransactionId heap_compute_xid_horizon_for_tuples(Relation rel,
 														 int nitems);
 
 /* in heap/pruneheap.c */
+struct GlobalVisState;
 extern void heap_page_prune_opt(Relation relation, Buffer buffer);
 extern int	heap_page_prune(Relation relation, Buffer buffer,
-							TransactionId OldestXmin,
+							struct GlobalVisState *vistest,
+							TransactionId limited_oldest_xmin,
+							TimestampTz limited_oldest_ts,
 							bool report_stats, TransactionId *latestRemovedXid);
 extern void heap_page_prune_execute(Buffer buffer,
 									OffsetNumber *redirected, int nredirected,
@@ -201,11 +204,14 @@ extern TM_Result HeapTupleSatisfiesUpdate(HeapTuple stup, CommandId curcid,
 										  Buffer buffer);
 extern HTSV_Result HeapTupleSatisfiesVacuum(HeapTuple stup, TransactionId OldestXmin,
 											Buffer buffer);
+extern HTSV_Result HeapTupleSatisfiesVacuumHorizon(HeapTuple stup, Buffer buffer,
+												   TransactionId *dead_after);
 extern void HeapTupleSetHintBits(HeapTupleHeader tuple, Buffer buffer,
 								 uint16 infomask, TransactionId xid);
 extern bool HeapTupleHeaderIsOnlyLocked(HeapTupleHeader tuple);
 extern bool XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot);
-extern bool HeapTupleIsSurelyDead(HeapTuple htup, TransactionId OldestXmin);
+extern bool HeapTupleIsSurelyDead(struct GlobalVisState *vistest,
+								  HeapTuple htup);
 
 /*
  * To avoid leaking too much knowledge about reorderbuffer implementation
diff --git a/src/include/access/transam.h b/src/include/access/transam.h
index 9502217932e..cf221582cd0 100644
--- a/src/include/access/transam.h
+++ b/src/include/access/transam.h
@@ -95,15 +95,6 @@ FullTransactionIdFromU64(uint64 value)
 			(dest) = FirstNormalTransactionId; \
 	} while(0)
 
-/* advance a FullTransactionId variable, stepping over special XIDs */
-static inline void
-FullTransactionIdAdvance(FullTransactionId *dest)
-{
-	dest->value++;
-	while (XidFromFullTransactionId(*dest) < FirstNormalTransactionId)
-		dest->value++;
-}
-
 /*
  * Retreat a FullTransactionId variable, stepping over xids that would appear
  * to be special only when viewed as 32bit XIDs.
@@ -129,6 +120,23 @@ FullTransactionIdRetreat(FullTransactionId *dest)
 		dest->value--;
 }
 
+/*
+ * Advance a FullTransactionId variable, stepping over xids that would appear
+ * to be special only when viewed as 32bit XIDs.
+ */
+static inline void
+FullTransactionIdAdvance(FullTransactionId *dest)
+{
+	dest->value++;
+
+	/* see FullTransactionIdAdvance() */
+	if (FullTransactionIdPrecedes(*dest, FirstNormalFullTransactionId))
+		return;
+
+	while (XidFromFullTransactionId(*dest) < FirstNormalTransactionId)
+		dest->value++;
+}
+
 /* back up a transaction ID variable, handling wraparound correctly */
 #define TransactionIdRetreat(dest)	\
 	do { \
@@ -293,6 +301,59 @@ ReadNewTransactionId(void)
 	return XidFromFullTransactionId(ReadNextFullTransactionId());
 }
 
+/* return transaction ID backed up by amount, handling wraparound correctly */
+static inline TransactionId
+TransactionIdRetreatedBy(TransactionId xid, uint32 amount)
+{
+	xid -= amount;
+
+	while (xid < FirstNormalTransactionId)
+		xid--;
+
+	return xid;
+}
+
+/* return the older of the two IDs */
+static inline TransactionId
+TransactionIdOlder(TransactionId a, TransactionId b)
+{
+	if (!TransactionIdIsValid(a))
+		return b;
+
+	if (!TransactionIdIsValid(b))
+		return a;
+
+	if (TransactionIdPrecedes(a, b))
+		return a;
+	return b;
+}
+
+/* return the older of the two IDs, assuming they're both normal */
+static inline TransactionId
+NormalTransactionIdOlder(TransactionId a, TransactionId b)
+{
+	Assert(TransactionIdIsNormal(a));
+	Assert(TransactionIdIsNormal(b));
+	if (NormalTransactionIdPrecedes(a, b))
+		return a;
+	return b;
+}
+
+/* return the newer of the two IDs */
+static inline FullTransactionId
+FullTransactionIdNewer(FullTransactionId a, FullTransactionId b)
+{
+	if (!FullTransactionIdIsValid(a))
+		return b;
+
+	if (!FullTransactionIdIsValid(b))
+		return a;
+
+	if (FullTransactionIdFollows(a, b))
+		return a;
+	return b;
+}
+
 #endif							/* FRONTEND */
 
 #endif							/* TRANSAM_H */
diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h
index 3f88683a059..51b8f994ac0 100644
--- a/src/include/storage/bufpage.h
+++ b/src/include/storage/bufpage.h
@@ -389,12 +389,6 @@ PageValidateSpecialPointer(Page page)
 #define PageClearAllVisible(page) \
 	(((PageHeader) (page))->pd_flags &= ~PD_ALL_VISIBLE)
 
-#define PageIsPrunable(page, oldestxmin) \
-( \
-	AssertMacro(TransactionIdIsNormal(oldestxmin)), \
-	TransactionIdIsValid(((PageHeader) (page))->pd_prune_xid) && \
-	TransactionIdPrecedes(((PageHeader) (page))->pd_prune_xid, oldestxmin) \
-)
 #define PageSetPrunable(page, xid) \
 do { \
 	Assert(TransactionIdIsNormal(xid)); \
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index b20e2ad4f6a..08f006f782e 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -42,13 +42,6 @@ struct XidCache
 
 /*
  * Flags for PGXACT->vacuumFlags
- *
- * Note: If you modify these flags, you need to modify PROCARRAY_XXX flags
- * in src/include/storage/procarray.h.
- *
- * PROC_RESERVED may later be assigned for use in vacuumFlags, but its value is
- * used for PROCARRAY_SLOTS_XMIN in procarray.h, so GetOldestXmin won't be able
- * to match and ignore processes with this flag set.
  */
 #define		PROC_IS_AUTOVACUUM	0x01	/* is it an autovac worker? */
 #define		PROC_IN_VACUUM		0x02	/* currently running lazy vacuum */
@@ -56,7 +49,6 @@ struct XidCache
 #define		PROC_VACUUM_FOR_WRAPAROUND	0x08	/* set by autovac only */
 #define		PROC_IN_LOGICAL_DECODING	0x10	/* currently doing logical
 												 * decoding outside xact */
-#define		PROC_RESERVED				0x20	/* reserved for procarray */
 
 /* flags reset at EOXact */
 #define		PROC_VACUUM_STATE_MASK \
diff --git a/src/include/storage/procarray.h b/src/include/storage/procarray.h
index a5c7d0c0644..ea8a876ca45 100644
--- a/src/include/storage/procarray.h
+++ b/src/include/storage/procarray.h
@@ -20,41 +20,6 @@
 #include "utils/snapshot.h"
 
 
-/*
- * These are to implement PROCARRAY_FLAGS_XXX
- *
- * Note: These flags are cloned from PROC_XXX flags in src/include/storage/proc.h
- * to avoid forcing to include proc.h when including procarray.h. So if you modify
- * PROC_XXX flags, you need to modify these flags.
- */
-#define		PROCARRAY_VACUUM_FLAG			0x02	/* currently running lazy
-													 * vacuum */
-#define		PROCARRAY_ANALYZE_FLAG			0x04	/* currently running
-													 * analyze */
-#define		PROCARRAY_LOGICAL_DECODING_FLAG 0x10	/* currently doing logical
-													 * decoding outside xact */
-
-#define		PROCARRAY_SLOTS_XMIN			0x20	/* replication slot xmin,
-													 * catalog_xmin */
-/*
- * Only flags in PROCARRAY_PROC_FLAGS_MASK are considered when matching
- * PGXACT->vacuumFlags. Other flags are used for different purposes and
- * have no corresponding PROC flag equivalent.
- */
-#define		PROCARRAY_PROC_FLAGS_MASK	(PROCARRAY_VACUUM_FLAG | \
-										 PROCARRAY_ANALYZE_FLAG | \
-										 PROCARRAY_LOGICAL_DECODING_FLAG)
-
-/* Use the following flags as an input "flags" to GetOldestXmin function */
-/* Consider all backends except for logical decoding ones which manage xmin separately */
-#define		PROCARRAY_FLAGS_DEFAULT			PROCARRAY_LOGICAL_DECODING_FLAG
-/* Ignore vacuum backends */
-#define		PROCARRAY_FLAGS_VACUUM			PROCARRAY_FLAGS_DEFAULT | PROCARRAY_VACUUM_FLAG
-/* Ignore analyze backends */
-#define		PROCARRAY_FLAGS_ANALYZE			PROCARRAY_FLAGS_DEFAULT | PROCARRAY_ANALYZE_FLAG
-/* Ignore both vacuum and analyze backends */
-#define		PROCARRAY_FLAGS_VACUUM_ANALYZE	PROCARRAY_FLAGS_DEFAULT | PROCARRAY_VACUUM_FLAG | PROCARRAY_ANALYZE_FLAG
-
 extern Size ProcArrayShmemSize(void);
 extern void CreateSharedProcArray(void);
 extern void ProcArrayAdd(PGPROC *proc);
@@ -88,9 +53,11 @@ extern RunningTransactions GetRunningTransactionData(void);
 
 extern bool TransactionIdIsInProgress(TransactionId xid);
 extern bool TransactionIdIsActive(TransactionId xid);
-extern TransactionId GetOldestXmin(Relation rel, int flags);
+extern TransactionId GetOldestNonRemovableTransactionId(Relation rel);
+extern TransactionId GetOldestTransactionIdConsideredRunning(void);
 extern TransactionId GetOldestActiveTransactionId(void);
 extern TransactionId GetOldestSafeDecodingTransactionId(bool catalogOnly);
+extern void GetReplicationHorizons(TransactionId *slot_xmin, TransactionId *catalog_xmin);
 
 extern VirtualTransactionId *GetVirtualXIDsDelayingChkpt(int *nvxids);
 extern bool HaveVirtualXIDsDelayingChkpt(VirtualTransactionId *vxids, int nvxids);
diff --git a/src/include/utils/snapmgr.h b/src/include/utils/snapmgr.h
index ffb4ba3adfb..b6b403e2931 100644
--- a/src/include/utils/snapmgr.h
+++ b/src/include/utils/snapmgr.h
@@ -52,13 +52,12 @@ extern Size SnapMgrShmemSize(void);
 extern void SnapMgrInit(void);
 extern TimestampTz GetSnapshotCurrentTimestamp(void);
 extern TimestampTz GetOldSnapshotThresholdTimestamp(void);
+extern void SnapshotTooOldMagicForTest(void);
 
 extern bool FirstSnapshotSet;
 
 extern PGDLLIMPORT TransactionId TransactionXmin;
 extern PGDLLIMPORT TransactionId RecentXmin;
-extern PGDLLIMPORT TransactionId RecentGlobalXmin;
-extern PGDLLIMPORT TransactionId RecentGlobalDataXmin;
 
 /* Variables representing various special snapshot semantics */
 extern PGDLLIMPORT SnapshotData SnapshotSelfData;
@@ -78,11 +77,12 @@ extern PGDLLIMPORT SnapshotData CatalogSnapshotData;
 
 /*
  * Similarly, some initialization is required for a NonVacuumable snapshot.
- * The caller must supply the xmin horizon to use (e.g., RecentGlobalXmin).
+ * The caller must supply the visibility cutoff state to use (c.f.
+ * GlobalVisTestFor()).
  */
-#define InitNonVacuumableSnapshot(snapshotdata, xmin_horizon)  \
+#define InitNonVacuumableSnapshot(snapshotdata, vistestp)  \
 	((snapshotdata).snapshot_type = SNAPSHOT_NON_VACUUMABLE, \
-	 (snapshotdata).xmin = (xmin_horizon))
+	 (snapshotdata).vistest = (vistestp))
 
 /*
  * Similarly, some initialization is required for SnapshotToast.  We need
@@ -98,6 +98,11 @@ extern PGDLLIMPORT SnapshotData CatalogSnapshotData;
 	((snapshot)->snapshot_type == SNAPSHOT_MVCC || \
 	 (snapshot)->snapshot_type == SNAPSHOT_HISTORIC_MVCC)
 
+static inline bool
+OldSnapshotThresholdActive(void)
+{
+	return old_snapshot_threshold >= 0;
+}
 
 extern Snapshot GetTransactionSnapshot(void);
 extern Snapshot GetLatestSnapshot(void);
@@ -121,8 +126,6 @@ extern void UnregisterSnapshot(Snapshot snapshot);
 extern Snapshot RegisterSnapshotOnOwner(Snapshot snapshot, ResourceOwner owner);
 extern void UnregisterSnapshotFromOwner(Snapshot snapshot, ResourceOwner owner);
 
-extern FullTransactionId GetFullRecentGlobalXmin(void);
-
 extern void AtSubCommit_Snapshot(int level);
 extern void AtSubAbort_Snapshot(int level);
 extern void AtEOXact_Snapshot(bool isCommit, bool resetXmin);
@@ -131,13 +134,29 @@ extern void ImportSnapshot(const char *idstr);
 extern bool XactHasExportedSnapshots(void);
 extern void DeleteAllExportedSnapshotFiles(void);
 extern bool ThereAreNoPriorRegisteredSnapshots(void);
-extern TransactionId TransactionIdLimitedForOldSnapshots(TransactionId recentXmin,
-														 Relation relation);
+extern bool TransactionIdLimitedForOldSnapshots(TransactionId recentXmin,
+												Relation relation,
+												TransactionId *limit_xid,
+												TimestampTz *limit_ts);
+extern void SetOldSnapshotThresholdTimestamp(TimestampTz ts, TransactionId xlimit);
 extern void MaintainOldSnapshotTimeMapping(TimestampTz whenTaken,
 										   TransactionId xmin);
 
 extern char *ExportSnapshot(Snapshot snapshot);
 
+/*
+ * These live in procarray.c because they're intimately linked to the
+ * procarray contents, but thematically they better fit into snapmgr.h.
+ */
+typedef struct GlobalVisState GlobalVisState;
+extern GlobalVisState *GlobalVisTestFor(Relation rel);
+extern bool GlobalVisTestIsRemovableXid(GlobalVisState *state, TransactionId xid);
+extern bool GlobalVisTestIsRemovableFullXid(GlobalVisState *state, FullTransactionId fxid);
+extern FullTransactionId GlobalVisTestNonRemovableFullHorizon(GlobalVisState *state);
+extern TransactionId GlobalVisTestNonRemovableHorizon(GlobalVisState *state);
+extern bool GlobalVisCheckRemovableXid(Relation rel, TransactionId xid);
+extern bool GlobalVisIsRemovableFullXid(Relation rel, FullTransactionId fxid);
+
 /*
  * Utility functions for implementing visibility routines in table AMs.
  */
diff --git a/src/include/utils/snapshot.h b/src/include/utils/snapshot.h
index 4796edb63aa..35b1f05bea6 100644
--- a/src/include/utils/snapshot.h
+++ b/src/include/utils/snapshot.h
@@ -192,6 +192,12 @@ typedef struct SnapshotData
 	 */
 	uint32		speculativeToken;
 
+	/*
+	 * For SNAPSHOT_NON_VACUUMABLE (and hopefully more in the future) this is
+	 * used to determine whether row could be vacuumed.
+	 */
+	struct GlobalVisState *vistest;
+
 	/*
 	 * Book-keeping information, used by the snapshot manager
 	 */
diff --git a/src/backend/access/gin/ginvacuum.c b/src/backend/access/gin/ginvacuum.c
index 8ae4fd95a7b..9cd6638df62 100644
--- a/src/backend/access/gin/ginvacuum.c
+++ b/src/backend/access/gin/ginvacuum.c
@@ -793,3 +793,29 @@ ginvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 
 	return stats;
 }
+
+/*
+ * Return whether Page can safely be recycled.
+ */
+bool
+GinPageIsRecyclable(Page page)
+{
+	TransactionId delete_xid;
+
+	if (PageIsNew(page))
+		return true;
+
+	if (!GinPageIsDeleted(page))
+		return false;
+
+	delete_xid = GinPageGetDeleteXid(page);
+
+	if (!TransactionIdIsValid(delete_xid))
+		return true;
+
+	/*
+	 * If no backend still could view delete_xid as in running, all scans
+	 * concurrent with ginDeletePage() must have finished.
+	 */
+	return GlobalVisCheckRemovableXid(NULL, delete_xid);
+}
diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
index 765329bbcd4..bfda7fbe3d5 100644
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -891,15 +891,13 @@ gistPageRecyclable(Page page)
 		 * As long as that can happen, we must keep the deleted page around as
 		 * a tombstone.
 		 *
-		 * Compare the deletion XID with RecentGlobalXmin. If deleteXid <
-		 * RecentGlobalXmin, then no scan that's still in progress could have
+		 * For that check if the deletion XID could still be visible to
+		 * anyone. If not, then no scan that's still in progress could have
 		 * seen its downlink, and we can recycle it.
 		 */
 		FullTransactionId deletexid_full = GistPageGetDeleteXid(page);
-		FullTransactionId recentxmin_full = GetFullRecentGlobalXmin();
 
-		if (FullTransactionIdPrecedes(deletexid_full, recentxmin_full))
-			return true;
+		return GlobalVisIsRemovableFullXid(NULL, deletexid_full);
 	}
 	return false;
 }
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index 3f0effd5e42..3167305ac00 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -387,11 +387,11 @@ gistRedoPageReuse(XLogReaderState *record)
 	 * PAGE_REUSE records exist to provide a conflict point when we reuse
 	 * pages in the index via the FSM.  That's all they do though.
 	 *
-	 * latestRemovedXid was the page's deleteXid.  The deleteXid <
-	 * RecentGlobalXmin test in gistPageRecyclable() conceptually mirrors the
-	 * pgxact->xmin > limitXmin test in GetConflictingVirtualXIDs().
-	 * Consequently, one XID value achieves the same exclusion effect on
-	 * primary and standby.
+	 * latestRemovedXid was the page's deleteXid.  The
+	 * GlobalVisIsRemovableFullXid(deleteXid) test in gistPageRecyclable()
+	 * conceptually mirrors the pgxact->xmin > limitXmin test in
+	 * GetConflictingVirtualXIDs().  Consequently, one XID value achieves the
+	 * same exclusion effect on primary and standby.
 	 */
 	if (InHotStandby)
 	{
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index d881f4cd46a..a8804351bee 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1496,6 +1496,7 @@ heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer,
 	bool		at_chain_start;
 	bool		valid;
 	bool		skip;
+	GlobalVisState *vistest = NULL;
 
 	/* If this is not the first call, previous call returned a (live!) tuple */
 	if (all_dead)
@@ -1506,7 +1507,8 @@ heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer,
 	at_chain_start = first_call;
 	skip = !first_call;
 
-	Assert(TransactionIdIsValid(RecentGlobalXmin));
+	/* XXX: we should assert that a snapshot is pushed or registered */
+	Assert(TransactionIdIsValid(RecentXmin));
 	Assert(BufferGetBlockNumber(buffer) == blkno);
 
 	/* Scan through possible multiple members of HOT-chain */
@@ -1595,9 +1597,14 @@ heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer,
 		 * Note: if you change the criterion here for what is "dead", fix the
 		 * planner's get_actual_variable_range() function to match.
 		 */
-		if (all_dead && *all_dead &&
-			!HeapTupleIsSurelyDead(heapTuple, RecentGlobalXmin))
-			*all_dead = false;
+		if (all_dead && *all_dead)
+		{
+			if (!vistest)
+				vistest = GlobalVisTestFor(relation);
+
+			if (!HeapTupleIsSurelyDead(vistest, heapTuple))
+				*all_dead = false;
+		}
 
 		/*
 		 * Check to see if HOT chain continues past this tuple; if so fetch
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 8f2e5379210..8c0d601e6d9 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1202,7 +1202,7 @@ heapam_index_build_range_scan(Relation heapRelation,
 
 	/* okay to ignore lazy VACUUMs here */
 	if (!IsBootstrapProcessingMode() && !indexInfo->ii_Concurrent)
-		OldestXmin = GetOldestXmin(heapRelation, PROCARRAY_FLAGS_VACUUM);
+		OldestXmin = GetOldestNonRemovableTransactionId(heapRelation);
 
 	if (!scan)
 	{
@@ -1243,6 +1243,17 @@ heapam_index_build_range_scan(Relation heapRelation,
 
 	hscan = (HeapScanDesc) scan;
 
+	/*
+	 * Must have called GetOldestNonRemovableTransactionId() if using
+	 * SnapshotAny.  Shouldn't have for an MVCC snapshot. (It's especially
+	 * worth checking this for parallel builds, since ambuild routines that
+	 * support parallel builds must work these details out for themselves.)
+	 */
+	Assert(snapshot == SnapshotAny || IsMVCCSnapshot(snapshot));
+	Assert(snapshot == SnapshotAny ? TransactionIdIsValid(OldestXmin) :
+		   !TransactionIdIsValid(OldestXmin));
+	Assert(snapshot == SnapshotAny || !anyvisible);
+
 	/* Publish number of blocks to scan */
 	if (progress)
 	{
@@ -1262,17 +1273,6 @@ heapam_index_build_range_scan(Relation heapRelation,
 									 nblocks);
 	}
 
-	/*
-	 * Must call GetOldestXmin() with SnapshotAny.  Should never call
-	 * GetOldestXmin() with MVCC snapshot. (It's especially worth checking
-	 * this for parallel builds, since ambuild routines that support parallel
-	 * builds must work these details out for themselves.)
-	 */
-	Assert(snapshot == SnapshotAny || IsMVCCSnapshot(snapshot));
-	Assert(snapshot == SnapshotAny ? TransactionIdIsValid(OldestXmin) :
-		   !TransactionIdIsValid(OldestXmin));
-	Assert(snapshot == SnapshotAny || !anyvisible);
-
 	/* set our scan endpoints */
 	if (!allow_sync)
 		heap_setscanlimits(scan, start_blockno, numblocks);
diff --git a/src/backend/access/heap/heapam_visibility.c b/src/backend/access/heap/heapam_visibility.c
index dba10890aab..b25b3e429ed 100644
--- a/src/backend/access/heap/heapam_visibility.c
+++ b/src/backend/access/heap/heapam_visibility.c
@@ -1154,19 +1154,56 @@ HeapTupleSatisfiesMVCC(HeapTuple htup, Snapshot snapshot,
  *	we mainly want to know is if a tuple is potentially visible to *any*
  *	running transaction.  If so, it can't be removed yet by VACUUM.
  *
- * OldestXmin is a cutoff XID (obtained from GetOldestXmin()).  Tuples
- * deleted by XIDs >= OldestXmin are deemed "recently dead"; they might
- * still be visible to some open transaction, so we can't remove them,
- * even if we see that the deleting transaction has committed.
+ * OldestXmin is a cutoff XID (obtained from
+ * GetOldestNonRemovableTransactionId()).  Tuples deleted by XIDs >=
+ * OldestXmin are deemed "recently dead"; they might still be visible to some
+ * open transaction, so we can't remove them, even if we see that the deleting
+ * transaction has committed.
  */
 HTSV_Result
 HeapTupleSatisfiesVacuum(HeapTuple htup, TransactionId OldestXmin,
 						 Buffer buffer)
+{
+	TransactionId dead_after = InvalidTransactionId;
+	HTSV_Result res;
+
+	res = HeapTupleSatisfiesVacuumHorizon(htup, buffer, &dead_after);
+
+	if (res == HEAPTUPLE_RECENTLY_DEAD)
+	{
+		Assert(TransactionIdIsValid(dead_after));
+
+		if (TransactionIdPrecedes(dead_after, OldestXmin))
+			res = HEAPTUPLE_DEAD;
+	}
+	else
+		Assert(!TransactionIdIsValid(dead_after));
+
+	return res;
+}
+
+/*
+ * Work horse for HeapTupleSatisfiesVacuum and similar routines.
+ *
+ * In contrast to HeapTupleSatisfiesVacuum this routine, when encountering a
+ * tuple that could still be visible to some backend, stores the xid that
+ * needs to be compared with the horizon in *dead_after, and returns
+ * HEAPTUPLE_RECENTLY_DEAD. The caller then can perform the comparison with
+ * the horizon.  This is e.g. useful when comparing with different horizons.
+ *
+ * Note: HEAPTUPLE_DEAD can still be returned here, e.g. if the inserting
+ * transaction aborted.
+ */
+HTSV_Result
+HeapTupleSatisfiesVacuumHorizon(HeapTuple htup, Buffer buffer, TransactionId *dead_after)
 {
 	HeapTupleHeader tuple = htup->t_data;
 
 	Assert(ItemPointerIsValid(&htup->t_self));
 	Assert(htup->t_tableOid != InvalidOid);
+	Assert(dead_after != NULL);
+
+	*dead_after = InvalidTransactionId;
 
 	/*
 	 * Has inserting transaction committed?
@@ -1323,17 +1360,15 @@ HeapTupleSatisfiesVacuum(HeapTuple htup, TransactionId OldestXmin,
 		else if (TransactionIdDidCommit(xmax))
 		{
 			/*
-			 * The multixact might still be running due to lockers.  If the
-			 * updater is below the xid horizon, we have to return DEAD
-			 * regardless -- otherwise we could end up with a tuple where the
-			 * updater has to be removed due to the horizon, but is not pruned
-			 * away.  It's not a problem to prune that tuple, because any
-			 * remaining lockers will also be present in newer tuple versions.
+			 * The multixact might still be running due to lockers.  Need to
+			 * allow for pruning if below the xid horizon regardless --
+			 * otherwise we could end up with a tuple where the updater has to
+			 * be removed due to the horizon, but is not pruned away.  It's
+			 * not a problem to prune that tuple, because any remaining
+			 * lockers will also be present in newer tuple versions.
 			 */
-			if (!TransactionIdPrecedes(xmax, OldestXmin))
-				return HEAPTUPLE_RECENTLY_DEAD;
-
-			return HEAPTUPLE_DEAD;
+			*dead_after = xmax;
+			return HEAPTUPLE_RECENTLY_DEAD;
 		}
 		else if (!MultiXactIdIsRunning(HeapTupleHeaderGetRawXmax(tuple), false))
 		{
@@ -1372,14 +1407,11 @@ HeapTupleSatisfiesVacuum(HeapTuple htup, TransactionId OldestXmin,
 	}
 
 	/*
-	 * Deleter committed, but perhaps it was recent enough that some open
-	 * transactions could still see the tuple.
+	 * Deleter committed, allow caller to check if it was recent enough that
+	 * some open transactions could still see the tuple.
 	 */
-	if (!TransactionIdPrecedes(HeapTupleHeaderGetRawXmax(tuple), OldestXmin))
-		return HEAPTUPLE_RECENTLY_DEAD;
-
-	/* Otherwise, it's dead and removable */
-	return HEAPTUPLE_DEAD;
+	*dead_after = HeapTupleHeaderGetRawXmax(tuple);
+	return HEAPTUPLE_RECENTLY_DEAD;
 }
 
 
@@ -1418,7 +1450,7 @@ HeapTupleSatisfiesNonVacuumable(HeapTuple htup, Snapshot snapshot,
  *	if the tuple is removable.
  */
 bool
-HeapTupleIsSurelyDead(HeapTuple htup, TransactionId OldestXmin)
+HeapTupleIsSurelyDead(GlobalVisState *vistest, HeapTuple htup)
 {
 	HeapTupleHeader tuple = htup->t_data;
 
@@ -1459,7 +1491,8 @@ HeapTupleIsSurelyDead(HeapTuple htup, TransactionId OldestXmin)
 		return false;
 
 	/* Deleter committed, so tuple is dead if the XID is old enough. */
-	return TransactionIdPrecedes(HeapTupleHeaderGetRawXmax(tuple), OldestXmin);
+	return GlobalVisTestIsRemovableXid(vistest,
+									   HeapTupleHeaderGetRawXmax(tuple));
 }
 
 /*
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index 256df4de105..00a3cb106aa 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -23,12 +23,30 @@
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "storage/bufmgr.h"
+#include "utils/snapmgr.h"
 #include "utils/rel.h"
 #include "utils/snapmgr.h"
 
 /* Working data for heap_page_prune and subroutines */
 typedef struct
 {
+	Relation	rel;
+
+	/* tuple visibility test, initialized for the relation */
+	GlobalVisState *vistest;
+
+	/*
+	 * Thresholds set by TransactionIdLimitedForOldSnapshots() if they have
+	 * been computed (done on demand, and only if
+	 * OldSnapshotThresholdActive()). The first time a tuple is about to be
+	 * removed based on the limited horizon, old_snap_used is set to true, and
+	 * SetOldSnapshotThresholdTimestamp() is called. See
+	 * heap_prune_satisfies_vacuum().
+	 */
+	TimestampTz old_snap_ts;
+	TransactionId old_snap_xmin;
+	bool		old_snap_used;
+
 	TransactionId new_prune_xid;	/* new prune hint value for page */
 	TransactionId latestRemovedXid; /* latest xid to be removed by this prune */
 	int			nredirected;	/* numbers of entries in arrays below */
@@ -43,9 +61,8 @@ typedef struct
 } PruneState;
 
 /* Local functions */
-static int	heap_prune_chain(Relation relation, Buffer buffer,
+static int	heap_prune_chain(Buffer buffer,
 							 OffsetNumber rootoffnum,
-							 TransactionId OldestXmin,
 							 PruneState *prstate);
 static void heap_prune_record_prunable(PruneState *prstate, TransactionId xid);
 static void heap_prune_record_redirect(PruneState *prstate,
@@ -65,16 +82,16 @@ static void heap_prune_record_unused(PruneState *prstate, OffsetNumber offnum);
  * if there's not any use in pruning.
  *
  * Caller must have pin on the buffer, and must *not* have a lock on it.
- *
- * OldestXmin is the cutoff XID used to distinguish whether tuples are DEAD
- * or RECENTLY_DEAD (see HeapTupleSatisfiesVacuum).
  */
 void
 heap_page_prune_opt(Relation relation, Buffer buffer)
 {
 	Page		page = BufferGetPage(buffer);
+	TransactionId prune_xid;
+	GlobalVisState *vistest;
+	TransactionId limited_xmin = InvalidTransactionId;
+	TimestampTz limited_ts = 0;
 	Size		minfree;
-	TransactionId OldestXmin;
 
 	/*
 	 * We can't write WAL in recovery mode, so there's no point trying to
@@ -85,37 +102,55 @@ heap_page_prune_opt(Relation relation, Buffer buffer)
 		return;
 
 	/*
-	 * Use the appropriate xmin horizon for this relation. If it's a proper
-	 * catalog relation or a user defined, additional, catalog relation, we
-	 * need to use the horizon that includes slots, otherwise the data-only
-	 * horizon can be used. Note that the toast relation of user defined
-	 * relations are *not* considered catalog relations.
+	 * XXX: Magic to keep old_snapshot_threshold tests appear "working". They
+	 * currently are broken, and discussion of what to do about them is
+	 * ongoing. See
+	 * https://www.postgresql.org/message-id/20200403001235.e6jfdll3gh2ygbuc%40alap3.anarazel.de
+	 */
+	if (old_snapshot_threshold == 0)
+		SnapshotTooOldMagicForTest();
+
+	/*
+	 * First check whether there's any chance there's something to prune,
+	 * determining the appropriate horizon is a waste if there's no prune_xid
+	 * (i.e. no updates/deletes left potentially dead tuples around).
+	 */
+	prune_xid = ((PageHeader) page)->pd_prune_xid;
+	if (!TransactionIdIsValid(prune_xid))
+		return;
+
+	/*
+	 * Check whether prune_xid indicates that there may be dead rows that can
+	 * be cleaned up.
 	 *
-	 * It is OK to apply the old snapshot limit before acquiring the cleanup
+	 * It is OK to check the old snapshot limit before acquiring the cleanup
 	 * lock because the worst that can happen is that we are not quite as
 	 * aggressive about the cleanup (by however many transaction IDs are
 	 * consumed between this point and acquiring the lock).  This allows us to
 	 * save significant overhead in the case where the page is found not to be
 	 * prunable.
-	 */
-	if (IsCatalogRelation(relation) ||
-		RelationIsAccessibleInLogicalDecoding(relation))
-		OldestXmin = RecentGlobalXmin;
-	else
-		OldestXmin =
-			TransactionIdLimitedForOldSnapshots(RecentGlobalDataXmin,
-												relation);
-
-	Assert(TransactionIdIsValid(OldestXmin));
-
-	/*
-	 * Let's see if we really need pruning.
 	 *
-	 * Forget it if page is not hinted to contain something prunable that's
-	 * older than OldestXmin.
+	 * Even if old_snapshot_threshold is set, we first check whether the page
+	 * can be pruned without. Both because
+	 * TransactionIdLimitedForOldSnapshots() is not cheap, and because not
+	 * unnecessarily relying on old_snapshot_threshold avoids causing
+	 * conflicts.
 	 */
-	if (!PageIsPrunable(page, OldestXmin))
-		return;
+	vistest = GlobalVisTestFor(relation);
+
+	if (!GlobalVisTestIsRemovableXid(vistest, prune_xid))
+	{
+		if (!OldSnapshotThresholdActive())
+			return;
+
+		if (!TransactionIdLimitedForOldSnapshots(GlobalVisTestNonRemovableHorizon(vistest),
+												 relation,
+												 &limited_xmin, &limited_ts))
+			return;
+
+		if (!TransactionIdPrecedes(prune_xid, limited_xmin))
+			return;
+	}
 
 	/*
 	 * We prune when a previous UPDATE failed to find enough space on the page
@@ -151,7 +186,9 @@ heap_page_prune_opt(Relation relation, Buffer buffer)
 															 * needed */
 
 			/* OK to prune */
-			(void) heap_page_prune(relation, buffer, OldestXmin, true, &ignore);
+			(void) heap_page_prune(relation, buffer, vistest,
+								   limited_xmin, limited_ts,
+								   true, &ignore);
 		}
 
 		/* And release buffer lock */
@@ -165,8 +202,11 @@ heap_page_prune_opt(Relation relation, Buffer buffer)
  *
  * Caller must have pin and buffer cleanup lock on the page.
  *
- * OldestXmin is the cutoff XID used to distinguish whether tuples are DEAD
- * or RECENTLY_DEAD (see HeapTupleSatisfiesVacuum).
+ * vistest is used to distinguish whether tuples are DEAD or RECENTLY_DEAD
+ * (see heap_prune_satisfies_vacuum and
+ * HeapTupleSatisfiesVacuum). old_snap_xmin / old_snap_ts need to
+ * either have been set by TransactionIdLimitedForOldSnapshots, or
+ * InvalidTransactionId/0 respectively.
  *
  * If report_stats is true then we send the number of reclaimed heap-only
  * tuples to pgstats.  (This must be false during vacuum, since vacuum will
@@ -177,7 +217,10 @@ heap_page_prune_opt(Relation relation, Buffer buffer)
  * latestRemovedXid.
  */
 int
-heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin,
+heap_page_prune(Relation relation, Buffer buffer,
+				GlobalVisState *vistest,
+				TransactionId old_snap_xmin,
+				TimestampTz old_snap_ts,
 				bool report_stats, TransactionId *latestRemovedXid)
 {
 	int			ndeleted = 0;
@@ -198,6 +241,11 @@ heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin,
 	 * initialize the rest of our working state.
 	 */
 	prstate.new_prune_xid = InvalidTransactionId;
+	prstate.rel = relation;
+	prstate.vistest = vistest;
+	prstate.old_snap_xmin = old_snap_xmin;
+	prstate.old_snap_ts = old_snap_ts;
+	prstate.old_snap_used = false;
 	prstate.latestRemovedXid = *latestRemovedXid;
 	prstate.nredirected = prstate.ndead = prstate.nunused = 0;
 	memset(prstate.marked, 0, sizeof(prstate.marked));
@@ -220,9 +268,7 @@ heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin,
 			continue;
 
 		/* Process this item or chain of items */
-		ndeleted += heap_prune_chain(relation, buffer, offnum,
-									 OldestXmin,
-									 &prstate);
+		ndeleted += heap_prune_chain(buffer, offnum, &prstate);
 	}
 
 	/* Any error while applying the changes is critical */
@@ -323,6 +369,85 @@ heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin,
 }
 
 
+/*
+ * Perform visiblity checks for heap pruning.
+ *
+ * This is more complicated than just using GlobalVisTestIsRemovableXid()
+ * because of old_snapshot_threshold. We only want to increase the threshold
+ * that triggers errors for old snapshots when we actually decide to remove a
+ * row based on the limited horizon.
+ *
+ * Due to its cost we also only want to call
+ * TransactionIdLimitedForOldSnapshots() if necessary, i.e. we might not have
+ * done so in heap_hot_prune_opt() if pd_prune_xid was old enough. But we
+ * still want to be able to remove rows that are too new to be removed
+ * according to prstate->vistest, but that can be removed based on
+ * old_snapshot_threshold. So we call TransactionIdLimitedForOldSnapshots() on
+ * demand in here, if appropriate.
+ */
+static HTSV_Result
+heap_prune_satisfies_vacuum(PruneState *prstate, HeapTuple tup, Buffer buffer)
+{
+	HTSV_Result res;
+	TransactionId dead_after;
+
+	res = HeapTupleSatisfiesVacuumHorizon(tup, buffer, &dead_after);
+
+	if (res != HEAPTUPLE_RECENTLY_DEAD)
+		return res;
+
+	/*
+	 * If we are already relying on the limited xmin, there is no need to
+	 * delay doing so anymore.
+	 */
+	if (prstate->old_snap_used)
+	{
+		Assert(TransactionIdIsValid(prstate->old_snap_xmin));
+
+		if (TransactionIdPrecedes(dead_after, prstate->old_snap_xmin))
+			res = HEAPTUPLE_DEAD;
+		return res;
+	}
+
+	/*
+	 * First check if GlobalVisTestIsRemovableXid() is sufficient to find the
+	 * row dead. If not, and old_snapshot_threshold is enabled, try to use the
+	 * lowered horizon.
+	 */
+	if (GlobalVisTestIsRemovableXid(prstate->vistest, dead_after))
+		res = HEAPTUPLE_DEAD;
+	else if (OldSnapshotThresholdActive())
+	{
+		/* haven't determined limited horizon yet, requests */
+		if (!TransactionIdIsValid(prstate->old_snap_xmin))
+		{
+			TransactionId horizon =
+			GlobalVisTestNonRemovableHorizon(prstate->vistest);
+
+			TransactionIdLimitedForOldSnapshots(horizon, prstate->rel,
+												&prstate->old_snap_xmin,
+												&prstate->old_snap_ts);
+		}
+
+		if (TransactionIdIsValid(prstate->old_snap_xmin) &&
+			TransactionIdPrecedes(dead_after, prstate->old_snap_xmin))
+		{
+			/*
+			 * About to remove row based on snapshot_too_old. Need to raise
+			 * the threshold so problematic accesses would error.
+			 */
+			Assert(!prstate->old_snap_used);
+			SetOldSnapshotThresholdTimestamp(prstate->old_snap_ts,
+											 prstate->old_snap_xmin);
+			prstate->old_snap_used = true;
+			res = HEAPTUPLE_DEAD;
+		}
+	}
+
+	return res;
+}
+
+
 /*
  * Prune specified line pointer or a HOT chain originating at line pointer.
  *
@@ -349,9 +474,7 @@ heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin,
  * Returns the number of tuples (to be) deleted from the page.
  */
 static int
-heap_prune_chain(Relation relation, Buffer buffer, OffsetNumber rootoffnum,
-				 TransactionId OldestXmin,
-				 PruneState *prstate)
+heap_prune_chain(Buffer buffer, OffsetNumber rootoffnum, PruneState *prstate)
 {
 	int			ndeleted = 0;
 	Page		dp = (Page) BufferGetPage(buffer);
@@ -366,7 +489,7 @@ heap_prune_chain(Relation relation, Buffer buffer, OffsetNumber rootoffnum,
 				i;
 	HeapTupleData tup;
 
-	tup.t_tableOid = RelationGetRelid(relation);
+	tup.t_tableOid = RelationGetRelid(prstate->rel);
 
 	rootlp = PageGetItemId(dp, rootoffnum);
 
@@ -401,7 +524,7 @@ heap_prune_chain(Relation relation, Buffer buffer, OffsetNumber rootoffnum,
 			 * either here or while following a chain below.  Whichever path
 			 * gets there first will mark the tuple unused.
 			 */
-			if (HeapTupleSatisfiesVacuum(&tup, OldestXmin, buffer)
+			if (heap_prune_satisfies_vacuum(prstate, &tup, buffer)
 				== HEAPTUPLE_DEAD && !HeapTupleHeaderIsHotUpdated(htup))
 			{
 				heap_prune_record_unused(prstate, rootoffnum);
@@ -485,7 +608,7 @@ heap_prune_chain(Relation relation, Buffer buffer, OffsetNumber rootoffnum,
 		 */
 		tupdead = recent_dead = false;
 
-		switch (HeapTupleSatisfiesVacuum(&tup, OldestXmin, buffer))
+		switch (heap_prune_satisfies_vacuum(prstate, &tup, buffer))
 		{
 			case HEAPTUPLE_DEAD:
 				tupdead = true;
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 1bbc4598f75..44e2224dd55 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -788,6 +788,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 		PROGRESS_VACUUM_MAX_DEAD_TUPLES
 	};
 	int64		initprog_val[3];
+	GlobalVisState *vistest;
 
 	pg_rusage_init(&ru0);
 
@@ -816,6 +817,8 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	vacrelstats->nonempty_pages = 0;
 	vacrelstats->latestRemovedXid = InvalidTransactionId;
 
+	vistest = GlobalVisTestFor(onerel);
+
 	/*
 	 * Initialize state for a parallel vacuum.  As of now, only one worker can
 	 * be used for an index, so we invoke parallelism only if there are at
@@ -1239,7 +1242,8 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 		 *
 		 * We count tuples removed by the pruning step as removed by VACUUM.
 		 */
-		tups_vacuumed += heap_page_prune(onerel, buf, OldestXmin, false,
+		tups_vacuumed += heap_page_prune(onerel, buf, vistest, false,
+										 InvalidTransactionId, 0,
 										 &vacrelstats->latestRemovedXid);
 
 		/*
@@ -1596,14 +1600,16 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 		}
 
 		/*
-		 * It's possible for the value returned by GetOldestXmin() to move
-		 * backwards, so it's not wrong for us to see tuples that appear to
-		 * not be visible to everyone yet, while PD_ALL_VISIBLE is already
-		 * set. The real safe xmin value never moves backwards, but
-		 * GetOldestXmin() is conservative and sometimes returns a value
-		 * that's unnecessarily small, so if we see that contradiction it just
-		 * means that the tuples that we think are not visible to everyone yet
-		 * actually are, and the PD_ALL_VISIBLE flag is correct.
+		 * It's possible for the value returned by
+		 * GetOldestNonRemovableTransactionId() to move backwards, so it's not
+		 * wrong for us to see tuples that appear to not be visible to
+		 * everyone yet, while PD_ALL_VISIBLE is already set. The real safe
+		 * xmin value never moves backwards, but
+		 * GetOldestNonRemovableTransactionId() is conservative and sometimes
+		 * returns a value that's unnecessarily small, so if we see that
+		 * contradiction it just means that the tuples that we think are not
+		 * visible to everyone yet actually are, and the PD_ALL_VISIBLE flag
+		 * is correct.
 		 *
 		 * There should never be dead tuples on a page with PD_ALL_VISIBLE
 		 * set, however.
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 6b9750c244a..3fb8688f8f4 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -519,7 +519,8 @@ index_getnext_tid(IndexScanDesc scan, ScanDirection direction)
 	SCAN_CHECKS;
 	CHECK_SCAN_PROCEDURE(amgettuple);
 
-	Assert(TransactionIdIsValid(RecentGlobalXmin));
+	/* XXX: we should assert that a snapshot is pushed or registered */
+	Assert(TransactionIdIsValid(RecentXmin));
 
 	/*
 	 * The AM's amgettuple proc finds the next index entry matching the scan
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 32ad9e339a2..cf3dba96008 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -342,9 +342,9 @@ snapshots and registered snapshots as of the deletion are gone; which is
 overly strong, but is simple to implement within Postgres.  When marked
 dead, a deleted page is labeled with the next-transaction counter value.
 VACUUM can reclaim the page for re-use when this transaction number is
-older than RecentGlobalXmin.  As collateral damage, this implementation
-also waits for running XIDs with no snapshots and for snapshots taken
-until the next transaction to allocate an XID commits.
+guaranteed to be "visible to everyone".  As collateral damage, this
+implementation also waits for running XIDs with no snapshots and for
+snapshots taken until the next transaction to allocate an XID commits.
 
 Reclaiming a page doesn't actually change its state on disk --- we simply
 record it in the shared-memory free space map, from which it will be
@@ -411,8 +411,8 @@ page and also the correct place to hold the current value. We can avoid
 the cost of walking down the tree in such common cases.
 
 The optimization works on the assumption that there can only be one
-non-ignorable leaf rightmost page, and so even a RecentGlobalXmin style
-interlock isn't required.  We cannot fail to detect that our hint was
+non-ignorable leaf rightmost page, and so not even a visible-to-everyone
+style interlock required.  We cannot fail to detect that our hint was
 invalidated, because there can only be one such page in the B-Tree at
 any time. It's possible that the page will be deleted and recycled
 without a backend's cached page also being detected as invalidated, but
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 70bac0052fc..d18b2722693 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -1097,7 +1097,7 @@ _bt_page_recyclable(Page page)
 	 */
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 	if (P_ISDELETED(opaque) &&
-		TransactionIdPrecedes(opaque->btpo.xact, RecentGlobalXmin))
+		GlobalVisCheckRemovableXid(NULL, opaque->btpo.xact))
 		return true;
 	return false;
 }
@@ -2316,7 +2316,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 	 * updated links to the target, ReadNewTransactionId() suffices as an
 	 * upper bound.  Any scan having retained a now-stale link is advertising
 	 * in its PGXACT an xmin less than or equal to the value we read here.  It
-	 * will continue to do so, holding back RecentGlobalXmin, for the duration
+	 * will continue to do so, holding back the xmin horizon, for the duration
 	 * of that scan.
 	 */
 	page = BufferGetPage(buf);
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index d65f4357cc8..922774bcf3b 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -807,6 +807,12 @@ _bt_vacuum_needs_cleanup(IndexVacuumInfo *info)
 	metapg = BufferGetPage(metabuf);
 	metad = BTPageGetMeta(metapg);
 
+	/*
+	 * XXX: If IndexVacuumInfo contained the heap relation, we could be more
+	 * aggressive about vacuuming non catalog relations by passing the table
+	 * to GlobalVisCheckRemovableXid().
+	 */
+
 	if (metad->btm_version < BTREE_NOVAC_VERSION)
 	{
 		/*
@@ -816,13 +822,12 @@ _bt_vacuum_needs_cleanup(IndexVacuumInfo *info)
 		result = true;
 	}
 	else if (TransactionIdIsValid(metad->btm_oldest_btpo_xact) &&
-			 TransactionIdPrecedes(metad->btm_oldest_btpo_xact,
-								   RecentGlobalXmin))
+			 GlobalVisCheckRemovableXid(NULL, metad->btm_oldest_btpo_xact))
 	{
 		/*
 		 * If any oldest btpo.xact from a previously deleted page in the index
-		 * is older than RecentGlobalXmin, then at least one deleted page can
-		 * be recycled -- don't skip cleanup.
+		 * is visible to everyone, then at least one deleted page can be
+		 * recycled -- don't skip cleanup.
 		 */
 		result = true;
 	}
@@ -1275,14 +1280,13 @@ backtrack:
 				 * own conflict now.)
 				 *
 				 * Backends with snapshots acquired after a VACUUM starts but
-				 * before it finishes could have a RecentGlobalXmin with a
-				 * later xid than the VACUUM's OldestXmin cutoff.  These
-				 * backends might happen to opportunistically mark some index
-				 * tuples LP_DEAD before we reach them, even though they may
-				 * be after our cutoff.  We don't try to kill these "extra"
-				 * index tuples in _bt_delitems_vacuum().  This keep things
-				 * simple, and allows us to always avoid generating our own
-				 * conflicts.
+				 * before it finishes could have visibility cutoff with a
+				 * later xid than VACUUM's OldestXmin cutoff.  These backends
+				 * might happen to opportunistically mark some index tuples
+				 * LP_DEAD before we reach them, even though they may be after
+				 * our cutoff.  We don't try to kill these "extra" index
+				 * tuples in _bt_delitems_vacuum().  This keep things simple,
+				 * and allows us to always avoid generating our own conflicts.
 				 */
 				Assert(!BTreeTupleIsPivot(itup));
 				if (!BTreeTupleIsPosting(itup))
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 5d346da84fd..b097e98c3ba 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -928,11 +928,11 @@ btree_xlog_reuse_page(XLogReaderState *record)
 	 * Btree reuse_page records exist to provide a conflict point when we
 	 * reuse pages in the index via the FSM.  That's all they do though.
 	 *
-	 * latestRemovedXid was the page's btpo.xact.  The btpo.xact <
-	 * RecentGlobalXmin test in _bt_page_recyclable() conceptually mirrors the
-	 * pgxact->xmin > limitXmin test in GetConflictingVirtualXIDs().
-	 * Consequently, one XID value achieves the same exclusion effect on
-	 * primary and standby.
+	 * latestRemovedXid was the page's btpo.xact.  The
+	 * GlobalVisCheckRemovableXid test in _bt_page_recyclable() conceptually
+	 * mirrors the pgxact->xmin > limitXmin test in
+	 * GetConflictingVirtualXIDs().  Consequently, one XID value achieves the
+	 * same exclusion effect on primary and standby.
 	 */
 	if (InHotStandby)
 	{
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index bd98707f3c0..e1c58933f97 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -501,10 +501,14 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	OffsetNumber itemToPlaceholder[MaxIndexTuplesPerPage];
 	OffsetNumber itemnos[MaxIndexTuplesPerPage];
 	spgxlogVacuumRedirect xlrec;
+	GlobalVisState *vistest;
 
 	xlrec.nToPlaceholder = 0;
 	xlrec.newestRedirectXid = InvalidTransactionId;
 
+	/* XXX: providing heap relation would allow more pruning */
+	vistest = GlobalVisTestFor(NULL);
+
 	START_CRIT_SECTION();
 
 	/*
@@ -521,7 +525,7 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 		dt = (SpGistDeadTuple) PageGetItem(page, PageGetItemId(page, i));
 
 		if (dt->tupstate == SPGIST_REDIRECT &&
-			TransactionIdPrecedes(dt->xid, RecentGlobalXmin))
+			GlobalVisTestIsRemovableXid(vistest, dt->xid))
 		{
 			dt->tupstate = SPGIST_PLACEHOLDER;
 			Assert(opaque->nRedirection > 0);
diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README
index c06e52f9cd0..4e2178dabab 100644
--- a/src/backend/access/transam/README
+++ b/src/backend/access/transam/README
@@ -293,42 +293,50 @@ once, rather than assume they can read it multiple times and get the same
 answer each time.  (Use volatile-qualified pointers when doing this, to
 ensure that the C compiler does exactly what you tell it to.)
 
-Another important activity that uses the shared ProcArray is GetOldestXmin,
-which must determine a lower bound for the oldest xmin of any active MVCC
-snapshot, system-wide.  Each individual backend advertises the smallest
-xmin of its own snapshots in MyPgXact->xmin, or zero if it currently has no
-live snapshots (eg, if it's between transactions or hasn't yet set a
-snapshot for a new transaction).  GetOldestXmin takes the MIN() of the
-valid xmin fields.  It does this with only shared lock on ProcArrayLock,
-which means there is a potential race condition against other backends
-doing GetSnapshotData concurrently: we must be certain that a concurrent
-backend that is about to set its xmin does not compute an xmin less than
-what GetOldestXmin returns.  We ensure that by including all the active
-XIDs into the MIN() calculation, along with the valid xmins.  The rule that
-transactions can't exit without taking exclusive ProcArrayLock ensures that
-concurrent holders of shared ProcArrayLock will compute the same minimum of
-currently-active XIDs: no xact, in particular not the oldest, can exit
-while we hold shared ProcArrayLock.  So GetOldestXmin's view of the minimum
-active XID will be the same as that of any concurrent GetSnapshotData, and
-so it can't produce an overestimate.  If there is no active transaction at
-all, GetOldestXmin returns latestCompletedFullXid + 1, which is a lower bound
-for the xmin that might be computed by concurrent or later GetSnapshotData
-calls.  (We know that no XID less than this could be about to appear in
-the ProcArray, because of the XidGenLock interlock discussed above.)
+Another important activity that uses the shared ProcArray is
+ComputeXidHorizons, which must determine a lower bound for the oldest xmin
+of any active MVCC snapshot, system-wide.  Each individual backend
+advertises the smallest xmin of its own snapshots in MyPgXact->xmin, or zero
+if it currently has no live snapshots (eg, if it's between transactions or
+hasn't yet set a snapshot for a new transaction).  ComputeXidHorizons takes
+the MIN() of the valid xmin fields.  It does this with only shared lock on
+ProcArrayLock, which means there is a potential race condition against other
+backends doing GetSnapshotData concurrently: we must be certain that a
+concurrent backend that is about to set its xmin does not compute an xmin
+less than what ComputeXidHorizons determines.  We ensure that by including
+all the active XIDs into the MIN() calculation, along with the valid xmins.
+The rule that transactions can't exit without taking exclusive ProcArrayLock
+ensures that concurrent holders of shared ProcArrayLock will compute the
+same minimum of currently-active XIDs: no xact, in particular not the
+oldest, can exit while we hold shared ProcArrayLock.  So
+ComputeXidHorizons's view of the minimum active XID will be the same as that
+of any concurrent GetSnapshotData, and so it can't produce an overestimate.
+If there is no active transaction at all, ComputeXidHorizons uses
+latestCompletedFullXid + 1, which is a lower bound for the xmin that might
+be computed by concurrent or later GetSnapshotData calls.  (We know that no
+XID less than this could be about to appear in the ProcArray, because of the
+XidGenLock interlock discussed above.)
 
-GetSnapshotData also performs an oldest-xmin calculation (which had better
-match GetOldestXmin's) and stores that into RecentGlobalXmin, which is used
-for some tuple age cutoff checks where a fresh call of GetOldestXmin seems
-too expensive.  Note that while it is certain that two concurrent
-executions of GetSnapshotData will compute the same xmin for their own
-snapshots, as argued above, it is not certain that they will arrive at the
-same estimate of RecentGlobalXmin.  This is because we allow XID-less
-transactions to clear their MyPgXact->xmin asynchronously (without taking
-ProcArrayLock), so one execution might see what had been the oldest xmin,
-and another not.  This is OK since RecentGlobalXmin need only be a valid
-lower bound.  As noted above, we are already assuming that fetch/store
-of the xid fields is atomic, so assuming it for xmin as well is no extra
-risk.
+As GetSnapshotData is performance critical, it does not perform an accurate
+oldest-xmin calculation (it used to, until v13). The contents of a snapshot
+only depend on the xids of other backends, not their xmin. As backend's xmin
+changes much more often than its xid, having GetSnapshotData look at xmins
+can lead to a lot of unnecessary cacheline ping-pong.  Instead
+GetSnapshotData updates approximate thresholds (one that guarantees that all
+deleted rows older than it can be removed, another determining that deleted
+rows newer than it can not be removed). GlobalVisTest* uses those threshold
+to make invisibility decision, falling back to ComputeXidHorizons if
+necessary.
+
+Note that while it is certain that two concurrent executions of
+GetSnapshotData will compute the same xmin for their own snapshots, there is
+no such guarantee for the horizons computed by ComputeXidHorizons.  This is
+because we allow XID-less transactions to clear their MyPgXact->xmin
+asynchronously (without taking ProcArrayLock), so one execution might see
+what had been the oldest xmin, and another not.  This is OK since the
+thresholds need only be a valid lower bound.  As noted above, we are already
+assuming that fetch/store of the xid fields is atomic, so assuming it for
+xmin as well is no extra risk.
 
 
 pg_xact and pg_subtrans
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index e1a043763cf..4420d59f26d 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -9100,7 +9100,7 @@ CreateCheckPoint(int flags)
 	 * StartupSUBTRANS hasn't been called yet.
 	 */
 	if (!RecoveryInProgress())
-		TruncateSUBTRANS(GetOldestXmin(NULL, PROCARRAY_FLAGS_DEFAULT));
+		TruncateSUBTRANS(GetOldestTransactionIdConsideredRunning());
 
 	/* Real work is done, but log and update stats before releasing lock. */
 	LogCheckpointEnd(false);
@@ -9460,7 +9460,7 @@ CreateRestartPoint(int flags)
 	 * this because StartupSUBTRANS hasn't been called yet.
 	 */
 	if (EnableHotStandby)
-		TruncateSUBTRANS(GetOldestXmin(NULL, PROCARRAY_FLAGS_DEFAULT));
+		TruncateSUBTRANS(GetOldestTransactionIdConsideredRunning());
 
 	/* Real work is done, but log and update before releasing lock. */
 	LogCheckpointEnd(true);
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 924ef37c816..34b71b6c1c5 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -1056,7 +1056,7 @@ acquire_sample_rows(Relation onerel, int elevel,
 	totalblocks = RelationGetNumberOfBlocks(onerel);
 
 	/* Need a cutoff xmin for HeapTupleSatisfiesVacuum */
-	OldestXmin = GetOldestXmin(onerel, PROCARRAY_FLAGS_VACUUM);
+	OldestXmin = GetOldestNonRemovableTransactionId(onerel);
 
 	/* Prepare for sampling block numbers */
 	nblocks = BlockSampler_Init(&bs, totalblocks, targrows, random());
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 576c7e63e99..22228f5684f 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -955,8 +955,25 @@ vacuum_set_xid_limits(Relation rel,
 	 * working on a particular table at any time, and that each vacuum is
 	 * always an independent transaction.
 	 */
-	*oldestXmin =
-		TransactionIdLimitedForOldSnapshots(GetOldestXmin(rel, PROCARRAY_FLAGS_VACUUM), rel);
+	*oldestXmin = GetOldestNonRemovableTransactionId(rel);
+
+	if (OldSnapshotThresholdActive())
+	{
+		TransactionId limit_xmin;
+		TimestampTz limit_ts;
+
+		if (TransactionIdLimitedForOldSnapshots(*oldestXmin, rel, &limit_xmin, &limit_ts))
+		{
+			/*
+			 * TODO: We should only set the threshold if we are pruning on the
+			 * basis of the increased limits. Not as crucial here as it is for
+			 * opportunistic pruning (which often happens at a much higher
+			 * frequency), but would still be a significant improvement.
+			 */
+			SetOldSnapshotThresholdTimestamp(limit_ts, limit_xmin);
+			*oldestXmin = limit_xmin;
+		}
+	}
 
 	Assert(TransactionIdIsNormal(*oldestXmin));
 
@@ -1345,12 +1362,13 @@ vac_update_datfrozenxid(void)
 	bool		dirty = false;
 
 	/*
-	 * Initialize the "min" calculation with GetOldestXmin, which is a
-	 * reasonable approximation to the minimum relfrozenxid for not-yet-
-	 * committed pg_class entries for new tables; see AddNewRelationTuple().
-	 * So we cannot produce a wrong minimum by starting with this.
+	 * Initialize the "min" calculation with
+	 * GetOldestNonRemovableTransactionId(), which is a reasonable
+	 * approximation to the minimum relfrozenxid for not-yet-committed
+	 * pg_class entries for new tables; see AddNewRelationTuple().  So we
+	 * cannot produce a wrong minimum by starting with this.
 	 */
-	newFrozenXid = GetOldestXmin(NULL, PROCARRAY_FLAGS_VACUUM);
+	newFrozenXid = GetOldestNonRemovableTransactionId(NULL);
 
 	/*
 	 * Similarly, initialize the MultiXact "min" with the value that would be
@@ -1681,8 +1699,9 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params)
 	StartTransactionCommand();
 
 	/*
-	 * Functions in indexes may want a snapshot set.  Also, setting a snapshot
-	 * ensures that RecentGlobalXmin is kept truly recent.
+	 * Need to acquire a snapshot to prevent pg_subtrans from being truncated,
+	 * cutoff xids in local memory wrapping around, and to have updated xmin
+	 * horizons.
 	 */
 	PushActiveSnapshot(GetTransactionSnapshot());
 
@@ -1705,8 +1724,8 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params)
 		 *
 		 * Note: these flags remain set until CommitTransaction or
 		 * AbortTransaction.  We don't want to clear them until we reset
-		 * MyPgXact->xid/xmin, else OldestXmin might appear to go backwards,
-		 * which is probably Not Good.
+		 * MyPgXact->xid/xmin, otherwise GetOldestNonRemovableTransactionId()
+		 * might appear to go backwards, which is probably Not Good.
 		 */
 		LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
 		MyPgXact->vacuumFlags |= PROC_IN_VACUUM;
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 9c7d4b0c60e..ac97e28be19 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -1877,6 +1877,10 @@ get_database_list(void)
 	 * the secondary effect that it sets RecentGlobalXmin.  (This is critical
 	 * for anything that reads heap pages, because HOT may decide to prune
 	 * them even if the process doesn't attempt to modify any tuples.)
+	 *
+	 * FIXME: This comment is inaccurate / the code buggy. A snapshot that is
+	 * not pushed/active does not reliably prevent HOT pruning (->xmin could
+	 * e.g. be cleared when cache invalidations are processed).
 	 */
 	StartTransactionCommand();
 	(void) GetTransactionSnapshot();
diff --git a/src/backend/replication/logical/launcher.c b/src/backend/replication/logical/launcher.c
index ff985b9b24c..bdaf0312d63 100644
--- a/src/backend/replication/logical/launcher.c
+++ b/src/backend/replication/logical/launcher.c
@@ -122,6 +122,10 @@ get_subscription_list(void)
 	 * the secondary effect that it sets RecentGlobalXmin.  (This is critical
 	 * for anything that reads heap pages, because HOT may decide to prune
 	 * them even if the process doesn't attempt to modify any tuples.)
+	 *
+	 * FIXME: This comment is inaccurate / the code buggy. A snapshot that is
+	 * not pushed/active does not reliably prevent HOT pruning (->xmin could
+	 * e.g. be cleared when cache invalidations are processed).
 	 */
 	StartTransactionCommand();
 	(void) GetTransactionSnapshot();
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index d5a9b568a68..7c11e1ab44c 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -1181,22 +1181,7 @@ XLogWalRcvSendHSFeedback(bool immed)
 	 */
 	if (hot_standby_feedback)
 	{
-		TransactionId slot_xmin;
-
-		/*
-		 * Usually GetOldestXmin() would include both global replication slot
-		 * xmin and catalog_xmin in its calculations, but we want to derive
-		 * separate values for each of those. So we ask for an xmin that
-		 * excludes the catalog_xmin.
-		 */
-		xmin = GetOldestXmin(NULL,
-							 PROCARRAY_FLAGS_DEFAULT | PROCARRAY_SLOTS_XMIN);
-
-		ProcArrayGetReplicationSlotXmin(&slot_xmin, &catalog_xmin);
-
-		if (TransactionIdIsValid(slot_xmin) &&
-			TransactionIdPrecedes(slot_xmin, xmin))
-			xmin = slot_xmin;
+		GetReplicationHorizons(&xmin, &catalog_xmin);
 	}
 	else
 	{
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 5e2210dd7bd..fd370d52b66 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -2116,9 +2116,10 @@ ProcessStandbyHSFeedbackMessage(void)
 
 	/*
 	 * Set the WalSender's xmin equal to the standby's requested xmin, so that
-	 * the xmin will be taken into account by GetOldestXmin.  This will hold
-	 * back the removal of dead rows and thereby prevent the generation of
-	 * cleanup conflicts on the standby server.
+	 * the xmin will be taken into account by GetSnapshotData() /
+	 * ComputeXidHorizons().  This will hold back the removal of dead rows and
+	 * thereby prevent the generation of cleanup conflicts on the standby
+	 * server.
 	 *
 	 * There is a small window for a race condition here: although we just
 	 * checked that feedbackXmin precedes nextXid, the nextXid could have
@@ -2131,10 +2132,10 @@ ProcessStandbyHSFeedbackMessage(void)
 	 * own xmin would prevent nextXid from advancing so far.
 	 *
 	 * We don't bother taking the ProcArrayLock here.  Setting the xmin field
-	 * is assumed atomic, and there's no real need to prevent a concurrent
-	 * GetOldestXmin.  (If we're moving our xmin forward, this is obviously
-	 * safe, and if we're moving it backwards, well, the data is at risk
-	 * already since a VACUUM could have just finished calling GetOldestXmin.)
+	 * is assumed atomic, and there's no real need to prevent concurrent
+	 * horizon determinations.  (If we're moving our xmin forward, this is
+	 * obviously safe, and if we're moving it backwards, well, the data is at
+	 * risk already since a VACUUM could already have determined the horizon.)
 	 *
 	 * If we're using a replication slot we reserve the xmin via that,
 	 * otherwise via the walsender's PGXACT entry. We can only track the
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 82798760752..20115e2f63f 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -99,6 +99,142 @@ typedef struct ProcArrayStruct
 	int			pgprocnos[FLEXIBLE_ARRAY_MEMBER];
 } ProcArrayStruct;
 
+/*
+ * State for the GlobalVisTest* family of functions. Those functions can
+ * e.g. be used to decide if a deleted row can be removed without violating
+ * MVCC semantics: If the deleted row's xmax is not considered to be running
+ * by anyone, the row can be removed.
+ *
+ * To avoid slowing down GetSnapshotData(), we don't calculate a precise
+ * cutoff XID while building a snapshot (looking at the frequently changing
+ * xmins scales badly). Instead we compute two boundaries while building the
+ * snapshot:
+ *
+ * 1) definitely_needed, indicating that rows deleted by XIDs >=
+ *    definitely_needed are definitely still visible.
+ *
+ * 2) maybe_needed, indicating that rows deleted by XIDs < maybe_needed can
+ *    definitely be removed
+ *
+ * When testing an XID that falls in between the two (i.e. XID >= maybe_needed
+ * && XID < definitely_needed), the boundaries can be recomputed (using
+ * ComputeXidHorizons()) to get a more accurate answer. This is cheaper than
+ * maintaining an accurate value all the time.
+ *
+ * As it is not cheap to compute accurate boundaries, we limit the number of
+ * times that happens in short succession. See GlobalVisTestShouldUpdate().
+ *
+ *
+ * There are three backend lifetime instances of this struct, optimized for
+ * different types of relations. As e.g. a normal user defined table in one
+ * database is inaccessible to backends connected to another database, a test
+ * specific to a relation can be more aggressive than a test for a shared
+ * relation.  Currently we track three different states:
+ *
+ * 1) GlobalVisSharedRels, which only considers an XID's
+ *    effects visible-to-everyone if neither snapshots in any database, nor a
+ *    replication slot's xmin, nor a replication slot's catalog_xmin might
+ *    still consider XID as running.
+ *
+ * 2) GlobalVisCatalogRels, which only considers an XID's
+ *    effects visible-to-everyone if neither snapshots in the current
+ *    database, nor a replication slot's xmin, nor a replication slot's
+ *    catalog_xmin might still consider XID as running.
+ *
+ *    I.e. the difference to GlobalVisSharedRels is that
+ *    snapshot in other databases are ignored.
+ *
+ * 3) GlobalVisCatalogRels, which only considers an XID's
+ *    effects visible-to-everyone if neither snapshots in the current
+ *    database, nor a replication slot's xmin consider XID as running.
+ *
+ *    I.e. the difference to GlobalVisCatalogRels is that
+ *    replication slot's catalog_xmin is not taken into account.
+ *
+ * GlobalVisTestFor(relation) returns the appropriate state
+ * for the relation.
+ *
+ * The boundaries are FullTransactionIds instead of TransactionIds to avoid
+ * wraparound dangers. There e.g. would otherwise exist no procarray state to
+ * prevent maybe_needed to become old enough after the GetSnapshotData()
+ * call.
+ *
+ * The typedef is in the header.
+ */
+struct GlobalVisState
+{
+	/* XIDs >= are considered running by some backend */
+	FullTransactionId definitely_needed;
+
+	/* XIDs < are not considered to be running by any backend */
+	FullTransactionId maybe_needed;
+};
+
+/*
+ * Result of ComputeXidHorizons().
+ */
+typedef struct ComputeXidHorizonsResult
+{
+	/*
+	 * The value of ShmemVariableCache->latestCompletedFullXid when
+	 * ComputeXidHorizons() held ProcArrayLock.
+	 */
+	FullTransactionId latest_completed;
+
+	/*
+	 * The same for procArray->replication_slot_xmin and.
+	 * procArray->replication_slot_catalog_xmin.
+	 */
+	TransactionId slot_xmin;
+	TransactionId slot_catalog_xmin;
+
+	/*
+	 * Oldest xid that any backend might still consider running. This needs to
+	 * include processes running VACUUM, in contrast to the normal visibility
+	 * cutoffs, as vacuum needs to be able to perform pg_subtrans lookups when
+	 * determining visibility, but doesn't care about rows above its xmin to
+	 * be removed.
+	 *
+	 * This likely should only be needed to determine whether pg_subtrans can
+	 * be truncated. It currently includes the effects of replications slots,
+	 * for historical reasons. But that could likely be changed.
+	 */
+	TransactionId oldest_considered_running;
+
+	/*
+	 * Oldest xid for which deleted tuples need to be retained in shared
+	 * tables.
+	 *
+	 * This includes the effects of replications lots. If that's not desired,
+	 * look at shared_oldest_nonremovable_raw;
+	 */
+	TransactionId shared_oldest_nonremovable;
+
+	/*
+	 * Oldest xid that may be necessary to retain in shared tables. This is
+	 * the same as shared_oldest_nonremovable, except that is not affected by
+	 * replication slot's catalog_xmin.
+	 *
+	 * This is mainly useful to be able to send the catalog_xmin to upstream
+	 * streaming replication servers via hot_standby_feedback, so they can
+	 * apply the limit only when accessing catalog tables.
+	 */
+	TransactionId shared_oldest_nonremovable_raw;
+
+	/*
+	 * Oldest xid for which deleted tuples need to be retained in non-shared
+	 * catalog tables.
+	 */
+	TransactionId catalog_oldest_nonremovable;
+
+	/*
+	 * Oldest xid for which deleted tuples need to be retained in normal user
+	 * defined tables.
+	 */
+	TransactionId data_oldest_nonremovable;
+} ComputeXidHorizonsResult;
+
+
 static ProcArrayStruct *procArray;
 
 static PGPROC *allProcs;
@@ -118,6 +254,22 @@ static TransactionId latestObservedXid = InvalidTransactionId;
  */
 static TransactionId standbySnapshotPendingXmin;
 
+/*
+ * State for visibility checks on different types of relations. See struct
+ * GlobalVisState for details. As shared, catalog, and user defined
+ * relations can have different horizons, one such state exists for each.
+ */
+static GlobalVisState GlobalVisSharedRels;
+static GlobalVisState GlobalVisCatalogRels;
+static GlobalVisState GlobalVisDataRels;
+
+/*
+ * This backend's RecentXmin at the last time the accurate xmin horizon was
+ * recomputed, or InvalidTransactionId if it has not. Used to limit how many
+ * times accurate horizons are recomputed. See GlobalVisTestShouldUpdate().
+ */
+static TransactionId ComputeXidHorizonsResultLastXmin;
+
 #ifdef XIDCACHE_DEBUG
 
 /* counters for XidCache measurement */
@@ -179,6 +331,7 @@ static void MaintainLatestCompletedXid(TransactionId latestXid);
 static void MaintainLatestCompletedXidRecovery(TransactionId latestXid);
 
 static inline FullTransactionId FullXidViaRelative(FullTransactionId rel, TransactionId xid);
+static void GlobalVisUpdateApply(ComputeXidHorizonsResult *horizons);
 
 /*
  * Report shared-memory space needed by CreateSharedProcArray.
@@ -1299,159 +1452,191 @@ TransactionIdIsActive(TransactionId xid)
 
 
 /*
- * GetOldestXmin -- returns oldest transaction that was running
- *					when any current transaction was started.
+ * Determine XID horizons.
  *
- * If rel is NULL or a shared relation, all backends are considered, otherwise
- * only backends running in this database are considered.
+ * This is used by wrapper functions like GetOldestNonRemovableTransactionId()
+ * (for VACUUM), GetReplicationHorizons() (for hot_standby_feedback), etc as
+ * well as "internally" by GlobalVisUpdate() (see comment above struct
+ * GlobalVisState).
  *
- * The flags are used to ignore the backends in calculation when any of the
- * corresponding flags is set. Typically, if you want to ignore ones with
- * PROC_IN_VACUUM flag, you can use PROCARRAY_FLAGS_VACUUM.
+ * See the definition of ComputedXidHorizonsResult for the various computed
+ * horizons.
  *
- * PROCARRAY_SLOTS_XMIN causes GetOldestXmin to ignore the xmin and
- * catalog_xmin of any replication slots that exist in the system when
- * calculating the oldest xmin.
+ * For VACUUM separate horizons (used to to decide which deleted tuples must
+ * be preserved), for shared and non-shared tables are computed.  For shared
+ * relations backends in all databases must be considered, but for non-shared
+ * relations that's not required, since only backends in my own database could
+ * ever see the tuples in them. Also, we can ignore concurrently running lazy
+ * VACUUMs because (a) they must be working on other tables, and (b) they
+ * don't need to do snapshot-based lookups.
  *
- * This is used by VACUUM to decide which deleted tuples must be preserved in
- * the passed in table. For shared relations backends in all databases must be
- * considered, but for non-shared relations that's not required, since only
- * backends in my own database could ever see the tuples in them. Also, we can
- * ignore concurrently running lazy VACUUMs because (a) they must be working
- * on other tables, and (b) they don't need to do snapshot-based lookups.
- *
- * This is also used to determine where to truncate pg_subtrans.  For that
- * backends in all databases have to be considered, so rel = NULL has to be
- * passed in.
+ * This also computes a horizon used to truncate pg_subtrans. For that
+ * backends in all databases have to be considered, and concurrently running
+ * lazy VACUUMs cannot be ignored, as they still may perform pg_subtrans
+ * accesses.
  *
  * Note: we include all currently running xids in the set of considered xids.
  * This ensures that if a just-started xact has not yet set its snapshot,
  * when it does set the snapshot it cannot set xmin less than what we compute.
  * See notes in src/backend/access/transam/README.
  *
- * Note: despite the above, it's possible for the calculated value to move
- * backwards on repeated calls. The calculated value is conservative, so that
- * anything older is definitely not considered as running by anyone anymore,
- * but the exact value calculated depends on a number of things. For example,
- * if rel = NULL and there are no transactions running in the current
- * database, GetOldestXmin() returns latestCompletedFullXid. If a transaction
+ * Note: despite the above, it's possible for the calculated values to move
+ * backwards on repeated calls. The calculated values are conservative, so
+ * that anything older is definitely not considered as running by anyone
+ * anymore, but the exact values calculated depend on a number of things. For
+ * example, if there are no transactions running in the current database, the
+ * horizon for normal tables will be latestCompletedFullXid. If a transaction
  * begins after that, its xmin will include in-progress transactions in other
  * databases that started earlier, so another call will return a lower value.
  * Nonetheless it is safe to vacuum a table in the current database with the
  * first result.  There are also replication-related effects: a walsender
  * process can set its xmin based on transactions that are no longer running
  * on the primary but are still being replayed on the standby, thus possibly
- * making the GetOldestXmin reading go backwards.  In this case there is a
- * possibility that we lose data that the standby would like to have, but
- * unless the standby uses a replication slot to make its xmin persistent
- * there is little we can do about that --- data is only protected if the
- * walsender runs continuously while queries are executed on the standby.
- * (The Hot Standby code deals with such cases by failing standby queries
- * that needed to access already-removed data, so there's no integrity bug.)
- * The return value is also adjusted with vacuum_defer_cleanup_age, so
- * increasing that setting on the fly is another easy way to make
- * GetOldestXmin() move backwards, with no consequences for data integrity.
+ * making the values go backwards.  In this case there is a possibility that
+ * we lose data that the standby would like to have, but unless the standby
+ * uses a replication slot to make its xmin persistent there is little we can
+ * do about that --- data is only protected if the walsender runs continuously
+ * while queries are executed on the standby.  (The Hot Standby code deals
+ * with such cases by failing standby queries that needed to access
+ * already-removed data, so there's no integrity bug.)  The computed values
+ * are also adjusted with vacuum_defer_cleanup_age, so increasing that setting
+ * on the fly is another easy way to make horizons move backwards, with no
+ * consequences for data integrity.
+ *
+ * Note: the approximate horizons (see definition of GlobalVisState) are
+ * updated by the computations done here. That's currently required for
+ * correctness and a small optimization. Without doing so it's possible that
+ * heap vacuum's call to heap_page_prune() uses a more conservative horizon
+ * than later when deciding which tuples can be removed - which the code
+ * doesn't expect (breaking HOT).
  */
-TransactionId
-GetOldestXmin(Relation rel, int flags)
+static void
+ComputeXidHorizons(ComputeXidHorizonsResult *h)
 {
 	ProcArrayStruct *arrayP = procArray;
-	TransactionId result;
-	int			index;
-	bool		allDbs;
+	TransactionId kaxmin;
+	bool		in_recovery = RecoveryInProgress();
 
-	TransactionId replication_slot_xmin = InvalidTransactionId;
-	TransactionId replication_slot_catalog_xmin = InvalidTransactionId;
-
-	/*
-	 * If we're not computing a relation specific limit, or if a shared
-	 * relation has been passed in, backends in all databases have to be
-	 * considered.
-	 */
-	allDbs = rel == NULL || rel->rd_rel->relisshared;
-
-	/* Cannot look for individual databases during recovery */
-	Assert(allDbs || !RecoveryInProgress());
+	/* inferred after ProcArrayLock is released */
+	h->catalog_oldest_nonremovable = InvalidTransactionId;
 
 	LWLockAcquire(ProcArrayLock, LW_SHARED);
 
+	h->latest_completed = ShmemVariableCache->latestCompletedFullXid;
+
 	/*
 	 * We initialize the MIN() calculation with latestCompletedFullXid + 1.
 	 * This is a lower bound for the XIDs that might appear in the ProcArray
 	 * later, and so protects us against overestimating the result due to
 	 * future additions.
 	 */
-	result = XidFromFullTransactionId(ShmemVariableCache->latestCompletedFullXid);
-	TransactionIdAdvance(result);
-	Assert(TransactionIdIsNormal(result));
+	{
+		TransactionId initial;
 
-	for (index = 0; index < arrayP->numProcs; index++)
+		initial = XidFromFullTransactionId(h->latest_completed);
+		Assert(TransactionIdIsValid(initial));
+		TransactionIdAdvance(initial);
+
+		h->oldest_considered_running = initial;
+		h->shared_oldest_nonremovable = initial;
+		h->data_oldest_nonremovable = initial;
+	}
+
+	/*
+	 * Fetch slot horizons while ProcArrayLock is held - the
+	 * LWLockAcquire/LWLockRelease are a barrier, ensuring this happens inside
+	 * the lock.
+	 */
+	h->slot_xmin = procArray->replication_slot_xmin;
+	h->slot_catalog_xmin = procArray->replication_slot_catalog_xmin;
+
+	for (int index = 0; index < arrayP->numProcs; index++)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
 		PGPROC	   *proc = &allProcs[pgprocno];
 		PGXACT	   *pgxact = &allPgXact[pgprocno];
+		TransactionId xid;
+		TransactionId xmin;
 
-		if (pgxact->vacuumFlags & (flags & PROCARRAY_PROC_FLAGS_MASK))
+		/* Fetch xid just once - see GetNewTransactionId */
+		xid = UINT32_ACCESS_ONCE(pgxact->xid);
+		xmin = UINT32_ACCESS_ONCE(pgxact->xmin);
+
+		/*
+		 * Consider both the transaction's Xmin, and its Xid.
+		 *
+		 * We must check both because a transaction might have an Xmin but not
+		 * (yet) an Xid; conversely, if it has an Xid, that could determine
+		 * some not-yet-set Xmin.
+		 */
+		xmin = TransactionIdOlder(xmin, xid);
+
+		/* if neither is set, this proc doesn't influence the horizon */
+		if (!TransactionIdIsValid(xmin))
 			continue;
 
-		if (allDbs ||
+		/*
+		 * Don't ignore any procs when determining which transactions might be
+		 * considered running.  While slots should ensure logical decoding
+		 * backends are protected even without this check, it can't hurt to
+		 * include them here as well..
+		 */
+		h->oldest_considered_running =
+			TransactionIdOlder(h->oldest_considered_running, xmin);
+
+		/*
+		 * Skip over backends either vacuuming (which is ok with rows being
+		 * removed, as long as pg_subtrans is not truncated) or doing logical
+		 * decoding (which manages xmin separately, check below).
+		 */
+		if (pgxact->vacuumFlags & (PROC_IN_VACUUM | PROC_IN_LOGICAL_DECODING))
+			continue;
+
+		/* shared tables need to take backends in all database into account */
+		h->shared_oldest_nonremovable =
+			TransactionIdOlder(h->shared_oldest_nonremovable, xmin);
+
+		/*
+		 * Normally queries in other databases are ignored for anything but
+		 * the shared horizon. But in recovery we cannot compute an accurate
+		 * per-database horizon as all xids are managed via the
+		 * KnownAssignedXids machinery.
+		 */
+		if (in_recovery ||
 			proc->databaseId == MyDatabaseId ||
 			proc->databaseId == 0)	/* always include WalSender */
 		{
-			/* Fetch xid just once - see GetNewTransactionId */
-			TransactionId xid = UINT32_ACCESS_ONCE(pgxact->xid);
-
-			/* First consider the transaction's own Xid, if any */
-			if (TransactionIdIsNormal(xid) &&
-				TransactionIdPrecedes(xid, result))
-				result = xid;
-
-			/*
-			 * Also consider the transaction's Xmin, if set.
-			 *
-			 * We must check both Xid and Xmin because a transaction might
-			 * have an Xmin but not (yet) an Xid; conversely, if it has an
-			 * Xid, that could determine some not-yet-set Xmin.
-			 */
-			xid = UINT32_ACCESS_ONCE(pgxact->xmin);
-			if (TransactionIdIsNormal(xid) &&
-				TransactionIdPrecedes(xid, result))
-				result = xid;
+			h->data_oldest_nonremovable =
+				TransactionIdOlder(h->data_oldest_nonremovable, xmin);
 		}
 	}
 
 	/*
-	 * Fetch into local variable while ProcArrayLock is held - the
-	 * LWLockRelease below is a barrier, ensuring this happens inside the
-	 * lock.
+	 * If in recovery fetch oldest xid in KnownAssignedXids, will be applied
+	 * after lock is released.
 	 */
-	replication_slot_xmin = procArray->replication_slot_xmin;
-	replication_slot_catalog_xmin = procArray->replication_slot_catalog_xmin;
+	if (in_recovery)
+		kaxmin = KnownAssignedXidsGetOldestXmin();
 
-	if (RecoveryInProgress())
+	/*
+	 * No other information needed, so release the lock immediately. The rest
+	 * of the computations can be done without a lock.
+	 */
+	LWLockRelease(ProcArrayLock);
+
+	if (in_recovery)
 	{
-		/*
-		 * Check to see whether KnownAssignedXids contains an xid value older
-		 * than the main procarray.
-		 */
-		TransactionId kaxmin = KnownAssignedXidsGetOldestXmin();
-
-		LWLockRelease(ProcArrayLock);
-
-		if (TransactionIdIsNormal(kaxmin) &&
-			TransactionIdPrecedes(kaxmin, result))
-			result = kaxmin;
+		h->oldest_considered_running =
+			TransactionIdOlder(h->oldest_considered_running, kaxmin);
+		h->shared_oldest_nonremovable =
+			TransactionIdOlder(h->shared_oldest_nonremovable, kaxmin);
+		h->data_oldest_nonremovable =
+			TransactionIdOlder(h->data_oldest_nonremovable, kaxmin);
 	}
 	else
 	{
 		/*
-		 * No other information needed, so release the lock immediately.
-		 */
-		LWLockRelease(ProcArrayLock);
-
-		/*
-		 * Compute the cutoff XID by subtracting vacuum_defer_cleanup_age,
-		 * being careful not to generate a "permanent" XID.
+		 * Compute the cutoff XID by subtracting vacuum_defer_cleanup_age.
 		 *
 		 * vacuum_defer_cleanup_age provides some additional "slop" for the
 		 * benefit of hot standby queries on standby servers.  This is quick
@@ -1463,34 +1648,146 @@ GetOldestXmin(Relation rel, int flags)
 		 * in varsup.c.  Also note that we intentionally don't apply
 		 * vacuum_defer_cleanup_age on standby servers.
 		 */
-		result -= vacuum_defer_cleanup_age;
-		if (!TransactionIdIsNormal(result))
-			result = FirstNormalTransactionId;
+		h->oldest_considered_running =
+			TransactionIdRetreatedBy(h->oldest_considered_running,
+									 vacuum_defer_cleanup_age);
+		h->shared_oldest_nonremovable =
+			TransactionIdRetreatedBy(h->shared_oldest_nonremovable,
+									 vacuum_defer_cleanup_age);
+		h->data_oldest_nonremovable =
+			TransactionIdRetreatedBy(h->data_oldest_nonremovable,
+									 vacuum_defer_cleanup_age);
 	}
 
 	/*
 	 * Check whether there are replication slots requiring an older xmin.
 	 */
-	if (!(flags & PROCARRAY_SLOTS_XMIN) &&
-		TransactionIdIsValid(replication_slot_xmin) &&
-		NormalTransactionIdPrecedes(replication_slot_xmin, result))
-		result = replication_slot_xmin;
+	h->shared_oldest_nonremovable =
+		TransactionIdOlder(h->shared_oldest_nonremovable, h->slot_xmin);
+	h->data_oldest_nonremovable =
+		TransactionIdOlder(h->data_oldest_nonremovable, h->slot_xmin);
 
 	/*
-	 * After locks have been released and vacuum_defer_cleanup_age has been
-	 * applied, check whether we need to back up further to make logical
-	 * decoding possible. We need to do so if we're computing the global limit
-	 * (rel = NULL) or if the passed relation is a catalog relation of some
-	 * kind.
+	 * The only difference between catalog / data horizons is that the slot's
+	 * catalog xmin is applied to the catalog one (so catalogs can be accessed
+	 * for logical decoding). Initialize with data horizon, and then back up
+	 * further if necessary. Have to back up the shared horizon as well, since
+	 * that also can contain catalogs.
 	 */
-	if (!(flags & PROCARRAY_SLOTS_XMIN) &&
-		(rel == NULL ||
-		 RelationIsAccessibleInLogicalDecoding(rel)) &&
-		TransactionIdIsValid(replication_slot_catalog_xmin) &&
-		NormalTransactionIdPrecedes(replication_slot_catalog_xmin, result))
-		result = replication_slot_catalog_xmin;
+	h->shared_oldest_nonremovable_raw = h->shared_oldest_nonremovable;
+	h->shared_oldest_nonremovable =
+		TransactionIdOlder(h->shared_oldest_nonremovable,
+						   h->slot_catalog_xmin);
+	h->catalog_oldest_nonremovable = h->data_oldest_nonremovable;
+	h->catalog_oldest_nonremovable =
+		TransactionIdOlder(h->catalog_oldest_nonremovable,
+						   h->slot_catalog_xmin);
 
-	return result;
+	/*
+	 * It's possible that slots / vacuum_defer_cleanup_age backed up the
+	 * horizons further than oldest_considered_running. Fix.
+	 */
+	h->oldest_considered_running =
+		TransactionIdOlder(h->oldest_considered_running,
+						   h->shared_oldest_nonremovable);
+	h->oldest_considered_running =
+		TransactionIdOlder(h->oldest_considered_running,
+						   h->catalog_oldest_nonremovable);
+	h->oldest_considered_running =
+		TransactionIdOlder(h->oldest_considered_running,
+						   h->data_oldest_nonremovable);
+
+	/*
+	 * shared horizons have to be at least as old as the oldest visible in
+	 * current db
+	 */
+	Assert(TransactionIdPrecedesOrEquals(h->shared_oldest_nonremovable,
+										 h->data_oldest_nonremovable));
+	Assert(TransactionIdPrecedesOrEquals(h->shared_oldest_nonremovable,
+										 h->catalog_oldest_nonremovable));
+
+	/*
+	 * Horizons need to ensure that pg_subtrans access is still possible for
+	 * the relevant backends.
+	 */
+	Assert(TransactionIdPrecedesOrEquals(h->oldest_considered_running,
+										 h->shared_oldest_nonremovable));
+	Assert(TransactionIdPrecedesOrEquals(h->oldest_considered_running,
+										 h->catalog_oldest_nonremovable));
+	Assert(TransactionIdPrecedesOrEquals(h->oldest_considered_running,
+										 h->data_oldest_nonremovable));
+	Assert(!TransactionIdIsValid(h->slot_xmin) ||
+		   TransactionIdPrecedesOrEquals(h->oldest_considered_running,
+										 h->slot_xmin));
+	Assert(!TransactionIdIsValid(h->slot_catalog_xmin) ||
+		   TransactionIdPrecedesOrEquals(h->oldest_considered_running,
+										 h->slot_catalog_xmin));
+
+	/* update approximate horizons with the computed horizons */
+	GlobalVisUpdateApply(h);
+}
+
+/*
+ * Return the oldest XID for which deleted tuples must be preserved in the
+ * passed table.
+ *
+ * If rel is not NULL the horizon may be considerably more recent than
+ * otherwise (i.e. fewer tuples will be removable). In the NULL case a horizon
+ * that is correct (but not optimal) for all relations will be returned.
+ *
+ * This is used by VACUUM to decide which deleted tuples must be preserved in
+ * the passed in table.
+ */
+TransactionId
+GetOldestNonRemovableTransactionId(Relation rel)
+{
+	ComputeXidHorizonsResult horizons;
+
+	ComputeXidHorizons(&horizons);
+
+	/* select horizon appropriate for relation */
+	if (rel == NULL || rel->rd_rel->relisshared)
+		return horizons.shared_oldest_nonremovable;
+	else if (RelationIsAccessibleInLogicalDecoding(rel))
+		return horizons.catalog_oldest_nonremovable;
+	else
+		return horizons.data_oldest_nonremovable;
+}
+
+/*
+ * Return the oldest transaction id any currently running backend might still
+ * consider running. This should not be used for visibility / pruning
+ * determinations (see GetOldestNonRemovableTransactionId()), but for
+ * decisions like up to where pg_subtrans can be truncated.
+ */
+TransactionId
+GetOldestTransactionIdConsideredRunning(void)
+{
+	ComputeXidHorizonsResult horizons;
+
+	ComputeXidHorizons(&horizons);
+
+	return horizons.oldest_considered_running;
+}
+
+/*
+ * Return the visibility horizons for a hot standby feedback message.
+ */
+void
+GetReplicationHorizons(TransactionId *xmin, TransactionId *catalog_xmin)
+{
+	ComputeXidHorizonsResult horizons;
+
+	ComputeXidHorizons(&horizons);
+
+	/*
+	 * Don't want to use shared_oldest_nonremovable here, as that contains the
+	 * effect of replication slot's catalog_xmin. We want to send a separate
+	 * feedback for the catalog horizon, so the primary can remove data table
+	 * contents more aggressively.
+	 */
+	*xmin = horizons.shared_oldest_nonremovable_raw;
+	*catalog_xmin = horizons.slot_catalog_xmin;
 }
 
 /*
@@ -1541,12 +1838,10 @@ GetMaxSnapshotSubxidCount(void)
  *			current transaction (this is the same as MyPgXact->xmin).
  *		RecentXmin: the xmin computed for the most recent snapshot.  XIDs
  *			older than this are known not running any more.
- *		RecentGlobalXmin: the global xmin (oldest TransactionXmin across all
- *			running transactions, except those running LAZY VACUUM).  This is
- *			the same computation done by
- *			GetOldestXmin(NULL, PROCARRAY_FLAGS_VACUUM).
- *		RecentGlobalDataXmin: the global xmin for non-catalog tables
- *			>= RecentGlobalXmin
+ *
+ * And try to advance the bounds of GlobalVisSharedRels,
+ * GlobalVisCatalogRels, GlobalVisDataRels for
+ * the benefit GlobalVis*.
  *
  * Note: this function should probably not be called with an argument that's
  * not statically allocated (see xip allocation below).
@@ -1557,12 +1852,12 @@ GetSnapshotData(Snapshot snapshot)
 	ProcArrayStruct *arrayP = procArray;
 	TransactionId xmin;
 	TransactionId xmax;
-	TransactionId globalxmin;
 	int			index;
 	int			count = 0;
 	int			subcount = 0;
 	bool		suboverflowed = false;
 	FullTransactionId latest_completed;
+	TransactionId oldestxid;
 	TransactionId replication_slot_xmin = InvalidTransactionId;
 	TransactionId replication_slot_catalog_xmin = InvalidTransactionId;
 
@@ -1607,13 +1902,15 @@ GetSnapshotData(Snapshot snapshot)
 	LWLockAcquire(ProcArrayLock, LW_SHARED);
 
 	latest_completed = ShmemVariableCache->latestCompletedFullXid;
+	oldestxid = ShmemVariableCache->oldestXid;
+
 	/* xmax is always latestCompletedFullXid + 1 */
 	xmax = XidFromFullTransactionId(latest_completed);
 	TransactionIdAdvance(xmax);
 	Assert(TransactionIdIsNormal(xmax));
 
 	/* initialize xmin calculation with xmax */
-	globalxmin = xmin = xmax;
+	xmin = xmax;
 
 	snapshot->takenDuringRecovery = RecoveryInProgress();
 
@@ -1642,12 +1939,6 @@ GetSnapshotData(Snapshot snapshot)
 				(PROC_IN_LOGICAL_DECODING | PROC_IN_VACUUM))
 				continue;
 
-			/* Update globalxmin to be the smallest valid xmin */
-			xid = UINT32_ACCESS_ONCE(pgxact->xmin);
-			if (TransactionIdIsNormal(xid) &&
-				NormalTransactionIdPrecedes(xid, globalxmin))
-				globalxmin = xid;
-
 			/* Fetch xid just once - see GetNewTransactionId */
 			xid = UINT32_ACCESS_ONCE(pgxact->xid);
 
@@ -1763,34 +2054,78 @@ GetSnapshotData(Snapshot snapshot)
 
 	LWLockRelease(ProcArrayLock);
 
-	/*
-	 * Update globalxmin to include actual process xids.  This is a slightly
-	 * different way of computing it than GetOldestXmin uses, but should give
-	 * the same result.
-	 */
-	if (TransactionIdPrecedes(xmin, globalxmin))
-		globalxmin = xmin;
+	/* maintain state for GlobalVis* */
+	{
+		TransactionId def_vis_xid;
+		TransactionId def_vis_xid_data;
+		FullTransactionId def_vis_fxid;
+		FullTransactionId def_vis_fxid_data;
+		FullTransactionId oldestfxid;
 
-	/* Update global variables too */
-	RecentGlobalXmin = globalxmin - vacuum_defer_cleanup_age;
-	if (!TransactionIdIsNormal(RecentGlobalXmin))
-		RecentGlobalXmin = FirstNormalTransactionId;
+		/*
+		 * Converting oldestXid is only safe when xid horizon cannot advance,
+		 * i.e. holding locks. While we don't hold the lock anymore, all the
+		 * necessary data has been gathered with lock held.
+		 */
+		oldestfxid = FullXidViaRelative(latest_completed, oldestxid);
 
-	/* Check whether there's a replication slot requiring an older xmin. */
-	if (TransactionIdIsValid(replication_slot_xmin) &&
-		NormalTransactionIdPrecedes(replication_slot_xmin, RecentGlobalXmin))
-		RecentGlobalXmin = replication_slot_xmin;
+		/* apply vacuum_defer_cleanup_age */
+		def_vis_xid_data =
+			TransactionIdRetreatedBy(xmin, vacuum_defer_cleanup_age);
 
-	/* Non-catalog tables can be vacuumed if older than this xid */
-	RecentGlobalDataXmin = RecentGlobalXmin;
+		/* Check whether there's a replication slot requiring an older xmin. */
+		def_vis_xid_data =
+			TransactionIdOlder(def_vis_xid_data, replication_slot_xmin);
 
-	/*
-	 * Check whether there's a replication slot requiring an older catalog
-	 * xmin.
-	 */
-	if (TransactionIdIsNormal(replication_slot_catalog_xmin) &&
-		NormalTransactionIdPrecedes(replication_slot_catalog_xmin, RecentGlobalXmin))
-		RecentGlobalXmin = replication_slot_catalog_xmin;
+		/*
+		 * Rows in non-shared, non-catalog tables possibly could be vacuumed
+		 * if older than this xid.
+		 */
+		def_vis_xid = def_vis_xid_data;
+
+		/*
+		 * Check whether there's a replication slot requiring an older catalog
+		 * xmin.
+		 */
+		def_vis_xid =
+			TransactionIdOlder(replication_slot_catalog_xmin, def_vis_xid);
+
+		def_vis_fxid = FullXidViaRelative(latest_completed, def_vis_xid);
+		def_vis_fxid_data = FullXidViaRelative(latest_completed, def_vis_xid_data);
+
+		/*
+		 * Check if we can increase upper bound. As a previous
+		 * GlobalVisUpdate() might have computed more aggressive values, don't
+		 * overwrite them if so.
+		 */
+		GlobalVisSharedRels.definitely_needed =
+			FullTransactionIdNewer(def_vis_fxid,
+								   GlobalVisSharedRels.definitely_needed);
+		GlobalVisCatalogRels.definitely_needed =
+			FullTransactionIdNewer(def_vis_fxid,
+								   GlobalVisCatalogRels.definitely_needed);
+		GlobalVisDataRels.definitely_needed =
+			FullTransactionIdNewer(def_vis_fxid_data,
+								   GlobalVisDataRels.definitely_needed);
+
+		/*
+		 * Check if we know that we can initialize or increase the lower
+		 * bound. Currently the only cheap way to do so is to use
+		 * ShmemVariableCache->oldestXid as input.
+		 *
+		 * We should definitely be able to do better. We could e.g. put a
+		 * global lower bound value into ShmemVariableCache.
+		 */
+		GlobalVisSharedRels.maybe_needed =
+			FullTransactionIdNewer(GlobalVisSharedRels.maybe_needed,
+								   oldestfxid);
+		GlobalVisCatalogRels.maybe_needed =
+			FullTransactionIdNewer(GlobalVisCatalogRels.maybe_needed,
+								   oldestfxid);
+		GlobalVisDataRels.maybe_needed =
+			FullTransactionIdNewer(GlobalVisDataRels.maybe_needed,
+								   oldestfxid);
+	}
 
 	RecentXmin = xmin;
 
@@ -3288,6 +3623,255 @@ DisplayXidCache(void)
 }
 #endif							/* XIDCACHE_DEBUG */
 
+/*
+ * If rel != NULL, return test state appropriate for relation, otherwise
+ * return state usable for all relations.  The latter may consider XIDs as
+ * not-yet-visible-to-everyone that a state for a specific relation would
+ * already consider visible-to-everyone.
+ *
+ * This needs to be called while a snapshot is active or registered, otherwise
+ * there are wraparound and other dangers.
+ *
+ * See comment for GlobalVisState for details.
+ */
+GlobalVisState *
+GlobalVisTestFor(Relation rel)
+{
+	bool		need_shared;
+	bool		need_catalog;
+	GlobalVisState *state;
+
+	/* XXX: we should assert that a snapshot is pushed or registered */
+	Assert(RecentXmin);
+
+	if (!rel)
+		need_shared = need_catalog = true;
+	else
+	{
+		/*
+		 * Other kinds currently don't contain xids, nor always the necessary
+		 * logical decoding markers.
+		 */
+		Assert(rel->rd_rel->relkind == RELKIND_RELATION ||
+			   rel->rd_rel->relkind == RELKIND_MATVIEW ||
+			   rel->rd_rel->relkind == RELKIND_TOASTVALUE);
+
+		need_shared = rel->rd_rel->relisshared || RecoveryInProgress();
+		need_catalog = IsCatalogRelation(rel) || RelationIsAccessibleInLogicalDecoding(rel);
+	}
+
+	if (need_shared)
+		state = &GlobalVisSharedRels;
+	else if (need_catalog)
+		state = &GlobalVisCatalogRels;
+	else
+		state = &GlobalVisDataRels;
+
+	Assert(FullTransactionIdIsValid(state->definitely_needed) &&
+		   FullTransactionIdIsValid(state->maybe_needed));
+
+	return state;
+}
+
+/*
+ * Return true if it's worth updating the accurate maybe_needed boundary.
+ *
+ * As it is somewhat expensive to determine xmin horizons, we don't want to
+ * repeatedly do so when there is a low likelihood of it being beneficial.
+ *
+ * The current heuristic is that we update only if RecentXmin has changed
+ * since the last update. If the oldest currently running transaction has not
+ * finished, it is unlikely that recomputing the horizon would be useful.
+ */
+static bool
+GlobalVisTestShouldUpdate(GlobalVisState *state)
+{
+	/* hasn't been updated yet */
+	if (!TransactionIdIsValid(ComputeXidHorizonsResultLastXmin))
+		return true;
+
+	/*
+	 * If the maybe_needed/definitely_needed boundaries are the same, it's
+	 * unlikely to be beneficial to refresh boundaries.
+	 */
+	if (FullTransactionIdFollowsOrEquals(state->maybe_needed,
+										 state->definitely_needed))
+		return false;
+
+	/* does the last snapshot built have a different xmin? */
+	return RecentXmin != ComputeXidHorizonsResultLastXmin;
+}
+
+static void
+GlobalVisUpdateApply(ComputeXidHorizonsResult *horizons)
+{
+	GlobalVisSharedRels.maybe_needed =
+		FullXidViaRelative(horizons->latest_completed,
+						   horizons->shared_oldest_nonremovable);
+	GlobalVisCatalogRels.maybe_needed =
+		FullXidViaRelative(horizons->latest_completed,
+						   horizons->catalog_oldest_nonremovable);
+	GlobalVisDataRels.maybe_needed =
+		FullXidViaRelative(horizons->latest_completed,
+						   horizons->data_oldest_nonremovable);
+
+	/*
+	 * In longer running transactions it's possible that transactions we
+	 * previously needed to treat as running aren't around anymore. So update
+	 * definitely_needed to not be earlier than maybe_needed.
+	 */
+	GlobalVisSharedRels.definitely_needed =
+		FullTransactionIdNewer(GlobalVisSharedRels.maybe_needed,
+							   GlobalVisSharedRels.definitely_needed);
+	GlobalVisCatalogRels.definitely_needed =
+		FullTransactionIdNewer(GlobalVisCatalogRels.maybe_needed,
+							   GlobalVisCatalogRels.definitely_needed);
+	GlobalVisDataRels.definitely_needed =
+		FullTransactionIdNewer(GlobalVisDataRels.maybe_needed,
+							   GlobalVisDataRels.definitely_needed);
+
+	ComputeXidHorizonsResultLastXmin = RecentXmin;
+}
+
+/*
+ * Update boundaries in GlobalVis{Shared,Catalog, Data}Rels
+ * using ComputeXidHorizons().
+ */
+static void
+GlobalVisUpdate(void)
+{
+	ComputeXidHorizonsResult horizons;
+
+	/* updates the horizons as a side-effect */
+	ComputeXidHorizons(&horizons);
+}
+
+/*
+ * Return true if no snapshot still considers fxid to be running.
+ *
+ * The state passed needs to have been initialized for the relation fxid is
+ * from (NULL is also OK), otherwise the result may not be correct.
+ *
+ * See comment for GlobalVisState for details.
+ */
+bool
+GlobalVisTestIsRemovableFullXid(GlobalVisState *state,
+								FullTransactionId fxid)
+{
+	/*
+	 * If fxid is older than maybe_needed bound, it definitely is visible to
+	 * everyone.
+	 */
+	if (FullTransactionIdPrecedes(fxid, state->maybe_needed))
+		return true;
+
+	/*
+	 * If fxid is >= definitely_needed bound, it is very likely to still be
+	 * considered running.
+	 */
+	if (FullTransactionIdFollowsOrEquals(fxid, state->definitely_needed))
+		return false;
+
+	/*
+	 * fxid is between maybe_needed and definitely_needed, i.e. there might or
+	 * might not exist a snapshot considering fxid running. If it makes sense,
+	 * update boundaries and recheck.
+	 */
+	if (GlobalVisTestShouldUpdate(state))
+	{
+		GlobalVisUpdate();
+
+		Assert(FullTransactionIdPrecedes(fxid, state->definitely_needed));
+
+		return FullTransactionIdPrecedes(fxid, state->maybe_needed);
+	}
+	else
+		return false;
+}
+
+/*
+ * Wrapper around GlobalVisTestIsRemovableFullXid() for 32bit xids.
+ *
+ * It is crucial that this only gets called for xids from a source that
+ * protects against xid wraparounds (e.g. from a table and thus protected by
+ * relfrozenxid).
+ */
+bool
+GlobalVisTestIsRemovableXid(GlobalVisState *state, TransactionId xid)
+{
+	FullTransactionId fxid;
+
+	/*
+	 * Convert 32 bit argument to FullTransactionId. We can do so safely
+	 * because we know the xid has to, at the very least, be between
+	 * [oldestXid, nextFullXid), i.e. within 2 billion of xid. To avoid taking
+	 * a lock to determine either, we can just compare with
+	 * state->definitely_needed, which was based on those value at the time
+	 * the current snapshot was built.
+	 */
+	fxid = FullXidViaRelative(state->definitely_needed, xid);
+
+	return GlobalVisTestIsRemovableFullXid(state, fxid);
+}
+
+/*
+ * Return FullTransactionId below which all transactions are not considered
+ * running anymore.
+ *
+ * Note: This is less efficient than testing with
+ * GlobalVisTestIsRemovableFullXid as it likely requires building an accurate
+ * cutoff, even in the case all the XIDs compared with the cutoff are outside
+ * [maybe_needed, definitely_needed).
+ */
+FullTransactionId
+GlobalVisTestNonRemovableFullHorizon(GlobalVisState *state)
+{
+	/* acquire accurate horizon if not already done */
+	if (GlobalVisTestShouldUpdate(state))
+		GlobalVisUpdate();
+
+	return state->maybe_needed;
+}
+
+/* Convenience wrapper around GlobalVisTestNonRemovableFullHorizon */
+TransactionId
+GlobalVisTestNonRemovableHorizon(GlobalVisState *state)
+{
+	FullTransactionId cutoff;
+
+	cutoff = GlobalVisTestNonRemovableFullHorizon(state);
+
+	return XidFromFullTransactionId(cutoff);
+}
+
+/*
+ * Convenience wrapper around GlobalVisTestFor() and
+ * GlobalVisTestIsRemovableFullXid(), see their comments.
+ */
+bool
+GlobalVisIsRemovableFullXid(Relation rel, FullTransactionId fxid)
+{
+	GlobalVisState *state;
+
+	state = GlobalVisTestFor(rel);
+
+	return GlobalVisTestIsRemovableFullXid(state, fxid);
+}
+
+/*
+ * Convenience wrapper around GlobalVisTestFor() and
+ * GlobalVisTestIsRemovableXid(), see their comments.
+ */
+bool
+GlobalVisCheckRemovableXid(Relation rel, TransactionId xid)
+{
+	GlobalVisState *state;
+
+	state = GlobalVisTestFor(rel);
+
+	return GlobalVisTestIsRemovableXid(state, xid);
+}
+
 /*
  * Convert a 32 bit transaction id into 64 bit transaction id, by assuming it
  * is within MaxTransactionId / 2 of XidFromFullTransactionId(rel).
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index 53d974125fd..00c7afc66fc 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -5786,14 +5786,15 @@ get_actual_variable_endpoint(Relation heapRel,
 	 * recent); that case motivates not using SnapshotAny here.
 	 *
 	 * A crucial point here is that SnapshotNonVacuumable, with
-	 * RecentGlobalXmin as horizon, yields the inverse of the condition that
-	 * the indexscan will use to decide that index entries are killable (see
-	 * heap_hot_search_buffer()).  Therefore, if the snapshot rejects a tuple
-	 * (or more precisely, all tuples of a HOT chain) and we have to continue
-	 * scanning past it, we know that the indexscan will mark that index entry
-	 * killed.  That means that the next get_actual_variable_endpoint() call
-	 * will not have to re-consider that index entry.  In this way we avoid
-	 * repetitive work when this function is used a lot during planning.
+	 * GlobalVisTestFor(heapRel) as horizon, yields the inverse of the
+	 * condition that the indexscan will use to decide that index entries are
+	 * killable (see heap_hot_search_buffer()).  Therefore, if the snapshot
+	 * rejects a tuple (or more precisely, all tuples of a HOT chain) and we
+	 * have to continue scanning past it, we know that the indexscan will mark
+	 * that index entry killed.  That means that the next
+	 * get_actual_variable_endpoint() call will not have to re-consider that
+	 * index entry.  In this way we avoid repetitive work when this function
+	 * is used a lot during planning.
 	 *
 	 * But using SnapshotNonVacuumable creates a hazard of its own.  In a
 	 * recently-created index, some index entries may point at "broken" HOT
@@ -5805,7 +5806,8 @@ get_actual_variable_endpoint(Relation heapRel,
 	 * or could even be NULL.  We avoid this hazard because we take the data
 	 * from the index entry not the heap.
 	 */
-	InitNonVacuumableSnapshot(SnapshotNonVacuumable, RecentGlobalXmin);
+	InitNonVacuumableSnapshot(SnapshotNonVacuumable,
+							  GlobalVisTestFor(heapRel));
 
 	index_scan = index_beginscan(heapRel, indexRel,
 								 &SnapshotNonVacuumable,
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index f4247ea70d5..893be2f3ddb 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -722,6 +722,10 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
 	 * is critical for anything that reads heap pages, because HOT may decide
 	 * to prune them even if the process doesn't attempt to modify any
 	 * tuples.)
+	 *
+	 * FIXME: This comment is inaccurate / the code buggy. A snapshot that is
+	 * not pushed/active does not reliably prevent HOT pruning (->xmin could
+	 * e.g. be cleared when cache invalidations are processed).
 	 */
 	if (!bootstrap)
 	{
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 6b6c8571e23..76578868cf9 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -157,16 +157,9 @@ static Snapshot HistoricSnapshot = NULL;
  * These are updated by GetSnapshotData.  We initialize them this way
  * for the convenience of TransactionIdIsInProgress: even in bootstrap
  * mode, we don't want it to say that BootstrapTransactionId is in progress.
- *
- * RecentGlobalXmin and RecentGlobalDataXmin are initialized to
- * InvalidTransactionId, to ensure that no one tries to use a stale
- * value. Readers should ensure that it has been set to something else
- * before using it.
  */
 TransactionId TransactionXmin = FirstNormalTransactionId;
 TransactionId RecentXmin = FirstNormalTransactionId;
-TransactionId RecentGlobalXmin = InvalidTransactionId;
-TransactionId RecentGlobalDataXmin = InvalidTransactionId;
 
 /* (table, ctid) => (cmin, cmax) mapping during timetravel */
 static HTAB *tuplecid_data = NULL;
@@ -581,9 +574,7 @@ SetTransactionSnapshot(Snapshot sourcesnap, VirtualTransactionId *sourcevxid,
 	 * Even though we are not going to use the snapshot it computes, we must
 	 * call GetSnapshotData, for two reasons: (1) to be sure that
 	 * CurrentSnapshotData's XID arrays have been allocated, and (2) to update
-	 * RecentXmin and RecentGlobalXmin.  (We could alternatively include those
-	 * two variables in exported snapshot files, but it seems better to have
-	 * snapshot importers compute reasonably up-to-date values for them.)
+	 * the state for GlobalVis*.
 	 */
 	CurrentSnapshot = GetSnapshotData(&CurrentSnapshotData);
 
@@ -956,36 +947,6 @@ xmin_cmp(const pairingheap_node *a, const pairingheap_node *b, void *arg)
 		return 0;
 }
 
-/*
- * Get current RecentGlobalXmin value, as a FullTransactionId.
- */
-FullTransactionId
-GetFullRecentGlobalXmin(void)
-{
-	FullTransactionId nextxid_full;
-	uint32		nextxid_epoch;
-	TransactionId nextxid_xid;
-	uint32		epoch;
-
-	Assert(TransactionIdIsNormal(RecentGlobalXmin));
-
-	/*
-	 * Compute the epoch from the next XID's epoch. This relies on the fact
-	 * that RecentGlobalXmin must be within the 2 billion XID horizon from the
-	 * next XID.
-	 */
-	nextxid_full = ReadNextFullTransactionId();
-	nextxid_epoch = EpochFromFullTransactionId(nextxid_full);
-	nextxid_xid = XidFromFullTransactionId(nextxid_full);
-
-	if (RecentGlobalXmin > nextxid_xid)
-		epoch = nextxid_epoch - 1;
-	else
-		epoch = nextxid_epoch;
-
-	return FullTransactionIdFromEpochAndXid(epoch, RecentGlobalXmin);
-}
-
 /*
  * SnapshotResetXmin
  *
@@ -1753,106 +1714,157 @@ GetOldSnapshotThresholdTimestamp(void)
 	return threshold_timestamp;
 }
 
-static void
+void
 SetOldSnapshotThresholdTimestamp(TimestampTz ts, TransactionId xlimit)
 {
 	SpinLockAcquire(&oldSnapshotControl->mutex_threshold);
+	Assert(oldSnapshotControl->threshold_timestamp <= ts);
+	Assert(TransactionIdPrecedesOrEquals(oldSnapshotControl->threshold_xid, xlimit));
 	oldSnapshotControl->threshold_timestamp = ts;
 	oldSnapshotControl->threshold_xid = xlimit;
 	SpinLockRelease(&oldSnapshotControl->mutex_threshold);
 }
 
+/*
+ * XXX: Magic to keep old_snapshot_threshold tests appear "working". They
+ * currently are broken, and discussion of what to do about them is
+ * ongoing. See
+ * https://www.postgresql.org/message-id/20200403001235.e6jfdll3gh2ygbuc%40alap3.anarazel.de
+ */
+void
+SnapshotTooOldMagicForTest(void)
+{
+	TimestampTz ts = GetSnapshotCurrentTimestamp();
+
+	Assert(old_snapshot_threshold == 0);
+
+	ts -= 5 * USECS_PER_SEC;
+
+	SpinLockAcquire(&oldSnapshotControl->mutex_threshold);
+	oldSnapshotControl->threshold_timestamp = ts;
+	SpinLockRelease(&oldSnapshotControl->mutex_threshold);
+}
+
+/*
+ * If there is a valid mapping for the timestamp, set *xlimitp to
+ * that. Returns whether there is such a mapping.
+ */
+static bool
+GetOldSnapshotFromTimeMapping(TimestampTz ts, TransactionId *xlimitp)
+{
+	bool in_mapping = false;
+
+	Assert(ts == AlignTimestampToMinuteBoundary(ts));
+
+	LWLockAcquire(OldSnapshotTimeMapLock, LW_SHARED);
+
+	if (oldSnapshotControl->count_used > 0
+		&& ts >= oldSnapshotControl->head_timestamp)
+	{
+		int			offset;
+
+		offset = ((ts - oldSnapshotControl->head_timestamp)
+				  / USECS_PER_MINUTE);
+		if (offset > oldSnapshotControl->count_used - 1)
+			offset = oldSnapshotControl->count_used - 1;
+		offset = (oldSnapshotControl->head_offset + offset)
+			% OLD_SNAPSHOT_TIME_MAP_ENTRIES;
+
+		*xlimitp = oldSnapshotControl->xid_by_minute[offset];
+
+		in_mapping = true;
+	}
+
+	LWLockRelease(OldSnapshotTimeMapLock);
+
+	return in_mapping;
+}
+
 /*
  * TransactionIdLimitedForOldSnapshots
  *
- * Apply old snapshot limit, if any.  This is intended to be called for page
- * pruning and table vacuuming, to allow old_snapshot_threshold to override
- * the normal global xmin value.  Actual testing for snapshot too old will be
- * based on whether a snapshot timestamp is prior to the threshold timestamp
- * set in this function.
+ * Apply old snapshot limit.  This is intended to be called for page pruning
+ * and table vacuuming, to allow old_snapshot_threshold to override the normal
+ * global xmin value.  Actual testing for snapshot too old will be based on
+ * whether a snapshot timestamp is prior to the threshold timestamp set in
+ * this function.
+ *
+ * If the limited horizon allows a cleanup action that otherwise would not be
+ * possible, SetOldSnapshotThresholdTimestamp(*limit_ts, *limit_xid) needs to
+ * be called before that cleanup action.
  */
-TransactionId
+bool
 TransactionIdLimitedForOldSnapshots(TransactionId recentXmin,
-									Relation relation)
+									Relation relation,
+									TransactionId *limit_xid,
+									TimestampTz *limit_ts)
 {
-	if (TransactionIdIsNormal(recentXmin)
-		&& old_snapshot_threshold >= 0
-		&& RelationAllowsEarlyPruning(relation))
+	TimestampTz ts;
+	TransactionId xlimit = recentXmin;
+	TransactionId latest_xmin;
+	TimestampTz next_map_update_ts;
+	TransactionId threshold_timestamp;
+	TransactionId threshold_xid;
+
+	Assert(TransactionIdIsNormal(recentXmin));
+	Assert(OldSnapshotThresholdActive());
+	Assert(limit_ts != NULL && limit_xid != NULL);
+
+	if (!RelationAllowsEarlyPruning(relation))
+		return false;
+
+	ts = GetSnapshotCurrentTimestamp();
+
+	SpinLockAcquire(&oldSnapshotControl->mutex_latest_xmin);
+	latest_xmin = oldSnapshotControl->latest_xmin;
+	next_map_update_ts = oldSnapshotControl->next_map_update;
+	SpinLockRelease(&oldSnapshotControl->mutex_latest_xmin);
+
+	/*
+	 * Zero threshold always overrides to latest xmin, if valid.  Without
+	 * some heuristic it will find its own snapshot too old on, for
+	 * example, a simple UPDATE -- which would make it useless for most
+	 * testing, but there is no principled way to ensure that it doesn't
+	 * fail in this way.  Use a five-second delay to try to get useful
+	 * testing behavior, but this may need adjustment.
+	 */
+	if (old_snapshot_threshold == 0)
 	{
-		TimestampTz ts = GetSnapshotCurrentTimestamp();
-		TransactionId xlimit = recentXmin;
-		TransactionId latest_xmin;
-		TimestampTz update_ts;
-		bool		same_ts_as_threshold = false;
-
-		SpinLockAcquire(&oldSnapshotControl->mutex_latest_xmin);
-		latest_xmin = oldSnapshotControl->latest_xmin;
-		update_ts = oldSnapshotControl->next_map_update;
-		SpinLockRelease(&oldSnapshotControl->mutex_latest_xmin);
-
-		/*
-		 * Zero threshold always overrides to latest xmin, if valid.  Without
-		 * some heuristic it will find its own snapshot too old on, for
-		 * example, a simple UPDATE -- which would make it useless for most
-		 * testing, but there is no principled way to ensure that it doesn't
-		 * fail in this way.  Use a five-second delay to try to get useful
-		 * testing behavior, but this may need adjustment.
-		 */
-		if (old_snapshot_threshold == 0)
-		{
-			if (TransactionIdPrecedes(latest_xmin, MyPgXact->xmin)
-				&& TransactionIdFollows(latest_xmin, xlimit))
-				xlimit = latest_xmin;
-
-			ts -= 5 * USECS_PER_SEC;
-			SetOldSnapshotThresholdTimestamp(ts, xlimit);
-
-			return xlimit;
-		}
+		if (TransactionIdPrecedes(latest_xmin, MyPgXact->xmin)
+			&& TransactionIdFollows(latest_xmin, xlimit))
+			xlimit = latest_xmin;
 
+		ts -= 5 * USECS_PER_SEC;
+	}
+	else
+	{
 		ts = AlignTimestampToMinuteBoundary(ts)
 			- (old_snapshot_threshold * USECS_PER_MINUTE);
 
 		/* Check for fast exit without LW locking. */
 		SpinLockAcquire(&oldSnapshotControl->mutex_threshold);
-		if (ts == oldSnapshotControl->threshold_timestamp)
-		{
-			xlimit = oldSnapshotControl->threshold_xid;
-			same_ts_as_threshold = true;
-		}
+		threshold_timestamp = oldSnapshotControl->threshold_timestamp;
+		threshold_xid = oldSnapshotControl->threshold_xid;
 		SpinLockRelease(&oldSnapshotControl->mutex_threshold);
 
-		if (!same_ts_as_threshold)
+		if (ts == threshold_timestamp)
+		{
+			/*
+			 * Current timestamp is in same bucket as the the last limit that
+			 * was applied. Reuse.
+			 */
+			xlimit = threshold_xid;
+		}
+		else if (ts == next_map_update_ts)
+		{
+			/*
+			 * FIXME: This branch is super iffy - but that should probably
+			 * fixed separately.
+			 */
+			xlimit = latest_xmin;
+		}
+		else if (GetOldSnapshotFromTimeMapping(ts, &xlimit))
 		{
-			if (ts == update_ts)
-			{
-				xlimit = latest_xmin;
-				if (NormalTransactionIdFollows(xlimit, recentXmin))
-					SetOldSnapshotThresholdTimestamp(ts, xlimit);
-			}
-			else
-			{
-				LWLockAcquire(OldSnapshotTimeMapLock, LW_SHARED);
-
-				if (oldSnapshotControl->count_used > 0
-					&& ts >= oldSnapshotControl->head_timestamp)
-				{
-					int			offset;
-
-					offset = ((ts - oldSnapshotControl->head_timestamp)
-							  / USECS_PER_MINUTE);
-					if (offset > oldSnapshotControl->count_used - 1)
-						offset = oldSnapshotControl->count_used - 1;
-					offset = (oldSnapshotControl->head_offset + offset)
-						% OLD_SNAPSHOT_TIME_MAP_ENTRIES;
-					xlimit = oldSnapshotControl->xid_by_minute[offset];
-
-					if (NormalTransactionIdFollows(xlimit, recentXmin))
-						SetOldSnapshotThresholdTimestamp(ts, xlimit);
-				}
-
-				LWLockRelease(OldSnapshotTimeMapLock);
-			}
 		}
 
 		/*
@@ -1867,12 +1879,18 @@ TransactionIdLimitedForOldSnapshots(TransactionId recentXmin,
 		if (TransactionIdIsNormal(latest_xmin)
 			&& TransactionIdPrecedes(latest_xmin, xlimit))
 			xlimit = latest_xmin;
-
-		if (NormalTransactionIdFollows(xlimit, recentXmin))
-			return xlimit;
 	}
 
-	return recentXmin;
+	if (TransactionIdIsValid(xlimit) &&
+		TransactionIdFollowsOrEquals(xlimit, recentXmin))
+	{
+		*limit_ts = ts;
+		*limit_xid = xlimit;
+
+		return true;
+	}
+
+	return false;
 }
 
 /*
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index e4d501a85d1..76306976c2a 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -419,10 +419,10 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 			 RelationGetRelationName(rel));
 
 	/*
-	 * RecentGlobalXmin assertion matches index_getnext_tid().  See note on
-	 * RecentGlobalXmin/B-Tree page deletion.
+	 * This assertion matches the one in index_getnext_tid().  See page
+	 * recycling/"visible to everyone" notes in nbtree README.
 	 */
-	Assert(TransactionIdIsValid(RecentGlobalXmin));
+	Assert(TransactionIdIsValid(RecentXmin));
 
 	/*
 	 * Initialize state for entire verification operation
@@ -1441,7 +1441,7 @@ bt_right_page_check_scankey(BtreeCheckState *state)
 	 * does not occur until no possible index scan could land on the page.
 	 * Index scans can follow links with nothing more than their snapshot as
 	 * an interlock and be sure of at least that much.  (See page
-	 * recycling/RecentGlobalXmin notes in nbtree README.)
+	 * recycling/"visible to everyone" notes in nbtree README.)
 	 *
 	 * Furthermore, it's okay if we follow a rightlink and find a half-dead or
 	 * dead (ignorable) page one or more times.  There will either be a
diff --git a/contrib/pg_visibility/pg_visibility.c b/contrib/pg_visibility/pg_visibility.c
index 68d580ed1e0..37206c50a21 100644
--- a/contrib/pg_visibility/pg_visibility.c
+++ b/contrib/pg_visibility/pg_visibility.c
@@ -563,17 +563,14 @@ collect_corrupt_items(Oid relid, bool all_visible, bool all_frozen)
 	BufferAccessStrategy bstrategy = GetAccessStrategy(BAS_BULKREAD);
 	TransactionId OldestXmin = InvalidTransactionId;
 
-	if (all_visible)
-	{
-		/* Don't pass rel; that will fail in recovery. */
-		OldestXmin = GetOldestXmin(NULL, PROCARRAY_FLAGS_VACUUM);
-	}
-
 	rel = relation_open(relid, AccessShareLock);
 
 	/* Only some relkinds have a visibility map */
 	check_relation_relkind(rel);
 
+	if (all_visible)
+		OldestXmin = GetOldestNonRemovableTransactionId(rel);
+
 	nblocks = RelationGetNumberOfBlocks(rel);
 
 	/*
@@ -679,11 +676,12 @@ collect_corrupt_items(Oid relid, bool all_visible, bool all_frozen)
 				 * From a concurrency point of view, it sort of sucks to
 				 * retake ProcArrayLock here while we're holding the buffer
 				 * exclusively locked, but it should be safe against
-				 * deadlocks, because surely GetOldestXmin() should never take
-				 * a buffer lock. And this shouldn't happen often, so it's
-				 * worth being careful so as to avoid false positives.
+				 * deadlocks, because surely GetOldestNonRemovableTransactionId()
+				 * should never take a buffer lock. And this shouldn't happen
+				 * often, so it's worth being careful so as to avoid false
+				 * positives.
 				 */
-				RecomputedOldestXmin = GetOldestXmin(NULL, PROCARRAY_FLAGS_VACUUM);
+				RecomputedOldestXmin = GetOldestNonRemovableTransactionId(rel);
 
 				if (!TransactionIdPrecedes(OldestXmin, RecomputedOldestXmin))
 					record_corrupt_item(items, &tuple.t_self);
diff --git a/contrib/pgstattuple/pgstatapprox.c b/contrib/pgstattuple/pgstatapprox.c
index dbc0fa11f61..3a99333d443 100644
--- a/contrib/pgstattuple/pgstatapprox.c
+++ b/contrib/pgstattuple/pgstatapprox.c
@@ -71,7 +71,7 @@ statapprox_heap(Relation rel, output_type *stat)
 	BufferAccessStrategy bstrategy;
 	TransactionId OldestXmin;
 
-	OldestXmin = GetOldestXmin(rel, PROCARRAY_FLAGS_VACUUM);
+	OldestXmin = GetOldestNonRemovableTransactionId(rel);
 	bstrategy = GetAccessStrategy(BAS_BULKREAD);
 
 	nblocks = RelationGetNumberOfBlocks(rel);
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 7eaaad1e140..b4948ac675f 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -395,6 +395,7 @@ CompositeTypeStmt
 CompoundAffixFlag
 CompressionAlgorithm
 CompressorState
+ComputeXidHorizonsResult
 ConditionVariable
 ConditionalStack
 ConfigData
@@ -930,6 +931,7 @@ GistSplitVector
 GistTsVectorOptions
 GistVacState
 GlobalTransaction
+GlobalVisState
 GrantRoleStmt
 GrantStmt
 GrantTargetType
-- 
2.25.0.114.g5b0ca878e0

v12-0003-snapshot-scalability-Move-PGXACT-xmin-back-to-PG.patchtext/x-diff; charset=us-asciiDownload
From 6b9fe7f54a2853e1b72ce5880873291864a3b13d Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 15 Jul 2020 15:35:07 -0700
Subject: [PATCH v12 3/7] snapshot scalability: Move PGXACT->xmin back to
 PGPROC.

Now that xmin isn't needed for GetSnapshotData() anymore, it leads to
unnecessary cacheline ping-pong to have it in PGXACT as it is updated
more frequently than the other PGXACT members.

Author: Andres Freund
Reviewed-By: Robert Haas, Thomas Munro, David Rowley
Discussion: https://postgr.es/m/20200301083601.ews6hz5dduc3w2se@alap3.anarazel.de
---
 src/include/storage/proc.h                  | 10 +++---
 src/backend/access/gist/gistxlog.c          |  2 +-
 src/backend/access/nbtree/nbtpage.c         |  2 +-
 src/backend/access/transam/README           |  2 +-
 src/backend/access/transam/twophase.c       |  2 +-
 src/backend/commands/indexcmds.c            |  2 +-
 src/backend/replication/logical/snapbuild.c |  6 ++--
 src/backend/replication/walsender.c         | 10 +++---
 src/backend/storage/ipc/procarray.c         | 36 +++++++++------------
 src/backend/storage/ipc/sinvaladt.c         |  2 +-
 src/backend/storage/lmgr/proc.c             |  4 +--
 src/backend/utils/time/snapmgr.c            | 28 ++++++++--------
 12 files changed, 51 insertions(+), 55 deletions(-)

diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 08f006f782e..286c9a9aec3 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -102,6 +102,11 @@ struct PGPROC
 
 	Latch		procLatch;		/* generic latch for process */
 
+	TransactionId xmin;			/* minimal running XID as it was when we were
+								 * starting our xact, excluding LAZY VACUUM:
+								 * vacuum must not remove tuples deleted by
+								 * xid >= xmin ! */
+
 	LocalTransactionId lxid;	/* local id of top-level transaction currently
 								 * being executed by this proc, if running;
 								 * else InvalidLocalTransactionId */
@@ -224,11 +229,6 @@ typedef struct PGXACT
 								 * executed by this proc, if running and XID
 								 * is assigned; else InvalidTransactionId */
 
-	TransactionId xmin;			/* minimal running XID as it was when we were
-								 * starting our xact, excluding LAZY VACUUM:
-								 * vacuum must not remove tuples deleted by
-								 * xid >= xmin ! */
-
 	uint8		vacuumFlags;	/* vacuum-related flags, see above */
 	bool		overflowed;
 
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index 3167305ac00..b6603cd73cf 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -389,7 +389,7 @@ gistRedoPageReuse(XLogReaderState *record)
 	 *
 	 * latestRemovedXid was the page's deleteXid.  The
 	 * GlobalVisIsRemovableFullXid(deleteXid) test in gistPageRecyclable()
-	 * conceptually mirrors the pgxact->xmin > limitXmin test in
+	 * conceptually mirrors the PGPROC->xmin > limitXmin test in
 	 * GetConflictingVirtualXIDs().  Consequently, one XID value achieves the
 	 * same exclusion effect on primary and standby.
 	 */
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index d18b2722693..d567c51c6f2 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -2315,7 +2315,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 	 * we're in VACUUM and would not otherwise have an XID.  Having already
 	 * updated links to the target, ReadNewTransactionId() suffices as an
 	 * upper bound.  Any scan having retained a now-stale link is advertising
-	 * in its PGXACT an xmin less than or equal to the value we read here.  It
+	 * in its PGPROC an xmin less than or equal to the value we read here.  It
 	 * will continue to do so, holding back the xmin horizon, for the duration
 	 * of that scan.
 	 */
diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README
index 4e2178dabab..94d8f3fd0a2 100644
--- a/src/backend/access/transam/README
+++ b/src/backend/access/transam/README
@@ -331,7 +331,7 @@ necessary.
 Note that while it is certain that two concurrent executions of
 GetSnapshotData will compute the same xmin for their own snapshots, there is
 no such guarantee for the horizons computed by ComputeXidHorizons.  This is
-because we allow XID-less transactions to clear their MyPgXact->xmin
+because we allow XID-less transactions to clear their MyProc->xmin
 asynchronously (without taking ProcArrayLock), so one execution might see
 what had been the oldest xmin, and another not.  This is OK since the
 thresholds need only be a valid lower bound.  As noted above, we are already
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 9b2e59bf0ec..ae7c1a4c172 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -464,7 +464,7 @@ MarkAsPreparingGuts(GlobalTransaction gxact, TransactionId xid, const char *gid,
 	/* We set up the gxact's VXID as InvalidBackendId/XID */
 	proc->lxid = (LocalTransactionId) xid;
 	pgxact->xid = xid;
-	pgxact->xmin = InvalidTransactionId;
+	Assert(proc->xmin == InvalidTransactionId);
 	proc->delayChkpt = false;
 	pgxact->vacuumFlags = 0;
 	proc->pid = 0;
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 2baca12c5f4..9d741aa03fa 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1535,7 +1535,7 @@ DefineIndex(Oid relationId,
 	StartTransactionCommand();
 
 	/* We should now definitely not be advertising any xmin. */
-	Assert(MyPgXact->xmin == InvalidTransactionId);
+	Assert(MyProc->xmin == InvalidTransactionId);
 
 	/*
 	 * The index is now valid in the sense that it contains all currently
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 3089f0d5ddc..e9701ea7221 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -553,8 +553,8 @@ SnapBuildInitialSnapshot(SnapBuild *builder)
 		elog(ERROR, "cannot build an initial slot snapshot, not all transactions are monitored anymore");
 
 	/* so we don't overwrite the existing value */
-	if (TransactionIdIsValid(MyPgXact->xmin))
-		elog(ERROR, "cannot build an initial slot snapshot when MyPgXact->xmin already is valid");
+	if (TransactionIdIsValid(MyProc->xmin))
+		elog(ERROR, "cannot build an initial slot snapshot when MyProc->xmin already is valid");
 
 	snap = SnapBuildBuildSnapshot(builder);
 
@@ -575,7 +575,7 @@ SnapBuildInitialSnapshot(SnapBuild *builder)
 	}
 #endif
 
-	MyPgXact->xmin = snap->xmin;
+	MyProc->xmin = snap->xmin;
 
 	/* allocate in transaction context */
 	newxip = (TransactionId *)
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index fd370d52b66..06da4b4352a 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1967,7 +1967,7 @@ PhysicalReplicationSlotNewXmin(TransactionId feedbackXmin, TransactionId feedbac
 	ReplicationSlot *slot = MyReplicationSlot;
 
 	SpinLockAcquire(&slot->mutex);
-	MyPgXact->xmin = InvalidTransactionId;
+	MyProc->xmin = InvalidTransactionId;
 
 	/*
 	 * For physical replication we don't need the interlock provided by xmin
@@ -2096,7 +2096,7 @@ ProcessStandbyHSFeedbackMessage(void)
 	if (!TransactionIdIsNormal(feedbackXmin)
 		&& !TransactionIdIsNormal(feedbackCatalogXmin))
 	{
-		MyPgXact->xmin = InvalidTransactionId;
+		MyProc->xmin = InvalidTransactionId;
 		if (MyReplicationSlot != NULL)
 			PhysicalReplicationSlotNewXmin(feedbackXmin, feedbackCatalogXmin);
 		return;
@@ -2138,7 +2138,7 @@ ProcessStandbyHSFeedbackMessage(void)
 	 * risk already since a VACUUM could already have determined the horizon.)
 	 *
 	 * If we're using a replication slot we reserve the xmin via that,
-	 * otherwise via the walsender's PGXACT entry. We can only track the
+	 * otherwise via the walsender's PGPROC entry. We can only track the
 	 * catalog xmin separately when using a slot, so we store the least of the
 	 * two provided when not using a slot.
 	 *
@@ -2151,9 +2151,9 @@ ProcessStandbyHSFeedbackMessage(void)
 	{
 		if (TransactionIdIsNormal(feedbackCatalogXmin)
 			&& TransactionIdPrecedes(feedbackCatalogXmin, feedbackXmin))
-			MyPgXact->xmin = feedbackCatalogXmin;
+			MyProc->xmin = feedbackCatalogXmin;
 		else
-			MyPgXact->xmin = feedbackXmin;
+			MyProc->xmin = feedbackXmin;
 	}
 }
 
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 20115e2f63f..164cf0cabc2 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -586,9 +586,9 @@ ProcArrayEndTransaction(PGPROC *proc, TransactionId latestXid)
 		Assert(!TransactionIdIsValid(allPgXact[proc->pgprocno].xid));
 
 		proc->lxid = InvalidLocalTransactionId;
-		pgxact->xmin = InvalidTransactionId;
 		/* must be cleared with xid/xmin: */
 		pgxact->vacuumFlags &= ~PROC_VACUUM_STATE_MASK;
+		proc->xmin = InvalidTransactionId;
 		proc->delayChkpt = false;	/* be sure this is cleared in abort */
 		proc->recoveryConflictPending = false;
 
@@ -608,9 +608,9 @@ ProcArrayEndTransactionInternal(PGPROC *proc, PGXACT *pgxact,
 {
 	pgxact->xid = InvalidTransactionId;
 	proc->lxid = InvalidLocalTransactionId;
-	pgxact->xmin = InvalidTransactionId;
 	/* must be cleared with xid/xmin: */
 	pgxact->vacuumFlags &= ~PROC_VACUUM_STATE_MASK;
+	proc->xmin = InvalidTransactionId;
 	proc->delayChkpt = false;	/* be sure this is cleared in abort */
 	proc->recoveryConflictPending = false;
 
@@ -762,7 +762,7 @@ ProcArrayClearTransaction(PGPROC *proc)
 	 */
 	pgxact->xid = InvalidTransactionId;
 	proc->lxid = InvalidLocalTransactionId;
-	pgxact->xmin = InvalidTransactionId;
+	proc->xmin = InvalidTransactionId;
 	proc->recoveryConflictPending = false;
 
 	/* redundant, but just in case */
@@ -1560,7 +1560,7 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
 
 		/* Fetch xid just once - see GetNewTransactionId */
 		xid = UINT32_ACCESS_ONCE(pgxact->xid);
-		xmin = UINT32_ACCESS_ONCE(pgxact->xmin);
+		xmin = UINT32_ACCESS_ONCE(proc->xmin);
 
 		/*
 		 * Consider both the transaction's Xmin, and its Xid.
@@ -1835,7 +1835,7 @@ GetMaxSnapshotSubxidCount(void)
  *
  * We also update the following backend-global variables:
  *		TransactionXmin: the oldest xmin of any snapshot in use in the
- *			current transaction (this is the same as MyPgXact->xmin).
+ *			current transaction (this is the same as MyProc->xmin).
  *		RecentXmin: the xmin computed for the most recent snapshot.  XIDs
  *			older than this are known not running any more.
  *
@@ -1897,7 +1897,7 @@ GetSnapshotData(Snapshot snapshot)
 
 	/*
 	 * It is sufficient to get shared lock on ProcArrayLock, even if we are
-	 * going to set MyPgXact->xmin.
+	 * going to set MyProc->xmin.
 	 */
 	LWLockAcquire(ProcArrayLock, LW_SHARED);
 
@@ -2049,8 +2049,8 @@ GetSnapshotData(Snapshot snapshot)
 	replication_slot_xmin = procArray->replication_slot_xmin;
 	replication_slot_catalog_xmin = procArray->replication_slot_catalog_xmin;
 
-	if (!TransactionIdIsValid(MyPgXact->xmin))
-		MyPgXact->xmin = TransactionXmin = xmin;
+	if (!TransactionIdIsValid(MyProc->xmin))
+		MyProc->xmin = TransactionXmin = xmin;
 
 	LWLockRelease(ProcArrayLock);
 
@@ -2170,7 +2170,7 @@ GetSnapshotData(Snapshot snapshot)
 }
 
 /*
- * ProcArrayInstallImportedXmin -- install imported xmin into MyPgXact->xmin
+ * ProcArrayInstallImportedXmin -- install imported xmin into MyProc->xmin
  *
  * This is called when installing a snapshot imported from another
  * transaction.  To ensure that OldestXmin doesn't go backwards, we must
@@ -2223,7 +2223,7 @@ ProcArrayInstallImportedXmin(TransactionId xmin,
 		/*
 		 * Likewise, let's just make real sure its xmin does cover us.
 		 */
-		xid = UINT32_ACCESS_ONCE(pgxact->xmin);
+		xid = UINT32_ACCESS_ONCE(proc->xmin);
 		if (!TransactionIdIsNormal(xid) ||
 			!TransactionIdPrecedesOrEquals(xid, xmin))
 			continue;
@@ -2234,7 +2234,7 @@ ProcArrayInstallImportedXmin(TransactionId xmin,
 		 * GetSnapshotData first, we'll be overwriting a valid xmin here, so
 		 * we don't check that.)
 		 */
-		MyPgXact->xmin = TransactionXmin = xmin;
+		MyProc->xmin = TransactionXmin = xmin;
 
 		result = true;
 		break;
@@ -2246,7 +2246,7 @@ ProcArrayInstallImportedXmin(TransactionId xmin,
 }
 
 /*
- * ProcArrayInstallRestoredXmin -- install restored xmin into MyPgXact->xmin
+ * ProcArrayInstallRestoredXmin -- install restored xmin into MyProc->xmin
  *
  * This is like ProcArrayInstallImportedXmin, but we have a pointer to the
  * PGPROC of the transaction from which we imported the snapshot, rather than
@@ -2259,7 +2259,6 @@ ProcArrayInstallRestoredXmin(TransactionId xmin, PGPROC *proc)
 {
 	bool		result = false;
 	TransactionId xid;
-	PGXACT	   *pgxact;
 
 	Assert(TransactionIdIsNormal(xmin));
 	Assert(proc != NULL);
@@ -2267,20 +2266,18 @@ ProcArrayInstallRestoredXmin(TransactionId xmin, PGPROC *proc)
 	/* Get lock so source xact can't end while we're doing this */
 	LWLockAcquire(ProcArrayLock, LW_SHARED);
 
-	pgxact = &allPgXact[proc->pgprocno];
-
 	/*
 	 * Be certain that the referenced PGPROC has an advertised xmin which is
 	 * no later than the one we're installing, so that the system-wide xmin
 	 * can't go backwards.  Also, make sure it's running in the same database,
 	 * so that the per-database xmin cannot go backwards.
 	 */
-	xid = UINT32_ACCESS_ONCE(pgxact->xmin);
+	xid = UINT32_ACCESS_ONCE(proc->xmin);
 	if (proc->databaseId == MyDatabaseId &&
 		TransactionIdIsNormal(xid) &&
 		TransactionIdPrecedesOrEquals(xid, xmin))
 	{
-		MyPgXact->xmin = TransactionXmin = xmin;
+		MyProc->xmin = TransactionXmin = xmin;
 		result = true;
 	}
 
@@ -2906,7 +2903,7 @@ GetCurrentVirtualXIDs(TransactionId limitXmin, bool excludeXmin0,
 		if (allDbs || proc->databaseId == MyDatabaseId)
 		{
 			/* Fetch xmin just once - might change on us */
-			TransactionId pxmin = UINT32_ACCESS_ONCE(pgxact->xmin);
+			TransactionId pxmin = UINT32_ACCESS_ONCE(proc->xmin);
 
 			if (excludeXmin0 && !TransactionIdIsValid(pxmin))
 				continue;
@@ -2992,7 +2989,6 @@ GetConflictingVirtualXIDs(TransactionId limitXmin, Oid dbOid)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
 		PGPROC	   *proc = &allProcs[pgprocno];
-		PGXACT	   *pgxact = &allPgXact[pgprocno];
 
 		/* Exclude prepared transactions */
 		if (proc->pid == 0)
@@ -3002,7 +2998,7 @@ GetConflictingVirtualXIDs(TransactionId limitXmin, Oid dbOid)
 			proc->databaseId == dbOid)
 		{
 			/* Fetch xmin just once - can't change on us, but good coding */
-			TransactionId pxmin = UINT32_ACCESS_ONCE(pgxact->xmin);
+			TransactionId pxmin = UINT32_ACCESS_ONCE(proc->xmin);
 
 			/*
 			 * We ignore an invalid pxmin because this means that backend has
diff --git a/src/backend/storage/ipc/sinvaladt.c b/src/backend/storage/ipc/sinvaladt.c
index e5c115b92f2..ad048bc85fa 100644
--- a/src/backend/storage/ipc/sinvaladt.c
+++ b/src/backend/storage/ipc/sinvaladt.c
@@ -420,7 +420,7 @@ BackendIdGetTransactionIds(int backendID, TransactionId *xid, TransactionId *xmi
 			PGXACT	   *xact = &ProcGlobal->allPgXact[proc->pgprocno];
 
 			*xid = xact->xid;
-			*xmin = xact->xmin;
+			*xmin = proc->xmin;
 		}
 	}
 
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index e57fcd25388..de346cd87fc 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -388,7 +388,7 @@ InitProcess(void)
 	MyProc->fpVXIDLock = false;
 	MyProc->fpLocalTransactionId = InvalidLocalTransactionId;
 	MyPgXact->xid = InvalidTransactionId;
-	MyPgXact->xmin = InvalidTransactionId;
+	MyProc->xmin = InvalidTransactionId;
 	MyProc->pid = MyProcPid;
 	/* backendId, databaseId and roleId will be filled in later */
 	MyProc->backendId = InvalidBackendId;
@@ -572,7 +572,7 @@ InitAuxiliaryProcess(void)
 	MyProc->fpVXIDLock = false;
 	MyProc->fpLocalTransactionId = InvalidLocalTransactionId;
 	MyPgXact->xid = InvalidTransactionId;
-	MyPgXact->xmin = InvalidTransactionId;
+	MyProc->xmin = InvalidTransactionId;
 	MyProc->backendId = InvalidBackendId;
 	MyProc->databaseId = InvalidOid;
 	MyProc->roleId = InvalidOid;
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 76578868cf9..689a3b6a597 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -27,11 +27,11 @@
  * their lifetime is managed separately (as they live longer than one xact.c
  * transaction).
  *
- * These arrangements let us reset MyPgXact->xmin when there are no snapshots
+ * These arrangements let us reset MyProc->xmin when there are no snapshots
  * referenced by this transaction, and advance it when the one with oldest
  * Xmin is no longer referenced.  For simplicity however, only registered
  * snapshots not active snapshots participate in tracking which one is oldest;
- * we don't try to change MyPgXact->xmin except when the active-snapshot
+ * we don't try to change MyProc->xmin except when the active-snapshot
  * stack is empty.
  *
  *
@@ -187,7 +187,7 @@ static ActiveSnapshotElt *OldestActiveSnapshot = NULL;
 
 /*
  * Currently registered Snapshots.  Ordered in a heap by xmin, so that we can
- * quickly find the one with lowest xmin, to advance our MyPgXact->xmin.
+ * quickly find the one with lowest xmin, to advance our MyProc->xmin.
  */
 static int	xmin_cmp(const pairingheap_node *a, const pairingheap_node *b,
 					 void *arg);
@@ -475,7 +475,7 @@ GetNonHistoricCatalogSnapshot(Oid relid)
 
 		/*
 		 * Make sure the catalog snapshot will be accounted for in decisions
-		 * about advancing PGXACT->xmin.  We could apply RegisterSnapshot, but
+		 * about advancing PGPROC->xmin.  We could apply RegisterSnapshot, but
 		 * that would result in making a physical copy, which is overkill; and
 		 * it would also create a dependency on some resource owner, which we
 		 * do not want for reasons explained at the head of this file. Instead
@@ -596,7 +596,7 @@ SetTransactionSnapshot(Snapshot sourcesnap, VirtualTransactionId *sourcevxid,
 	/* NB: curcid should NOT be copied, it's a local matter */
 
 	/*
-	 * Now we have to fix what GetSnapshotData did with MyPgXact->xmin and
+	 * Now we have to fix what GetSnapshotData did with MyProc->xmin and
 	 * TransactionXmin.  There is a race condition: to make sure we are not
 	 * causing the global xmin to go backwards, we have to test that the
 	 * source transaction is still running, and that has to be done
@@ -950,13 +950,13 @@ xmin_cmp(const pairingheap_node *a, const pairingheap_node *b, void *arg)
 /*
  * SnapshotResetXmin
  *
- * If there are no more snapshots, we can reset our PGXACT->xmin to InvalidXid.
+ * If there are no more snapshots, we can reset our PGPROC->xmin to InvalidXid.
  * Note we can do this without locking because we assume that storing an Xid
  * is atomic.
  *
  * Even if there are some remaining snapshots, we may be able to advance our
- * PGXACT->xmin to some degree.  This typically happens when a portal is
- * dropped.  For efficiency, we only consider recomputing PGXACT->xmin when
+ * PGPROC->xmin to some degree.  This typically happens when a portal is
+ * dropped.  For efficiency, we only consider recomputing PGPROC->xmin when
  * the active snapshot stack is empty; this allows us not to need to track
  * which active snapshot is oldest.
  *
@@ -977,15 +977,15 @@ SnapshotResetXmin(void)
 
 	if (pairingheap_is_empty(&RegisteredSnapshots))
 	{
-		MyPgXact->xmin = InvalidTransactionId;
+		MyProc->xmin = InvalidTransactionId;
 		return;
 	}
 
 	minSnapshot = pairingheap_container(SnapshotData, ph_node,
 										pairingheap_first(&RegisteredSnapshots));
 
-	if (TransactionIdPrecedes(MyPgXact->xmin, minSnapshot->xmin))
-		MyPgXact->xmin = minSnapshot->xmin;
+	if (TransactionIdPrecedes(MyProc->xmin, minSnapshot->xmin))
+		MyProc->xmin = minSnapshot->xmin;
 }
 
 /*
@@ -1132,13 +1132,13 @@ AtEOXact_Snapshot(bool isCommit, bool resetXmin)
 
 	/*
 	 * During normal commit processing, we call ProcArrayEndTransaction() to
-	 * reset the MyPgXact->xmin. That call happens prior to the call to
+	 * reset the MyProc->xmin. That call happens prior to the call to
 	 * AtEOXact_Snapshot(), so we need not touch xmin here at all.
 	 */
 	if (resetXmin)
 		SnapshotResetXmin();
 
-	Assert(resetXmin || MyPgXact->xmin == 0);
+	Assert(resetXmin || MyProc->xmin == 0);
 }
 
 
@@ -1830,7 +1830,7 @@ TransactionIdLimitedForOldSnapshots(TransactionId recentXmin,
 	 */
 	if (old_snapshot_threshold == 0)
 	{
-		if (TransactionIdPrecedes(latest_xmin, MyPgXact->xmin)
+		if (TransactionIdPrecedes(latest_xmin, MyProc->xmin)
 			&& TransactionIdFollows(latest_xmin, xlimit))
 			xlimit = latest_xmin;
 
-- 
2.25.0.114.g5b0ca878e0

v12-0004-snapshot-scalability-Introduce-dense-array-of-in.patchtext/x-diff; charset=us-asciiDownload
From 5d7d5db62127541219715f4c2d726dc643ded956 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 15 Jul 2020 15:35:07 -0700
Subject: [PATCH v12 4/7] snapshot scalability: Introduce dense array of
 in-progress xids.

The new array contains the xids for all connected backends / in-use
PGPROC entries in a dense manner (in contrast to the PGPROC/PGXACT
arrays which can have unused entries interspersed).

This improves performance because GetSnapshotData() always needs to
scan the xids of all live procarray entries and now there's no need to
go through the procArray->pgprocnos indirection anymore.

As the set of running top-level xids changes rarely, compared to the
number of snapshots taken, this substantially increases the likelihood
of most data required for a snapshot being in l2 cache.  In
read-mostly workloads scanning the xids[] array will sufficient to
build a snapshot, as most backends will not have an xid assigned.

To keep the xid array dense ProcArrayRemove() needs to move entries
behind the to-be-removed proc's one further up in the array. Obviously
moving array entries cannot happen while a backend sets it
xid. I.e. locking needs to prevent that array entries are moved while
a backend modifies its xid.

To avoid locking ProcArrayLock in GetNewTransactionId() - a fairly hot
spot already - ProcArrayAdd() / ProcArrayRemove() now needs to hold
XidGenLock in addition to ProcArrayLock. Adding / Removing a procarray
entry is not a very frequent operation, even taking 2PC into account.

Due to the above, the dense array entries can only be read or modified
while holding ProcArrayLock and/or XidGenLock. This prevents a
concurrent ProcArrayRemove() from shifting the dense array while it is
accessed concurrently.

While the new dense array is very good when needing to look at all
xids it is less suitable when accessing a single backend's xid. In
particular it would be problematic to have to acquire a lock to access
a backend's own xid. Therefore a backend's xid is not just stored in
the dense array, but also in PGPROC. This also allows a backend to
only access the shared xid value when the backend had acquired an
xid.

The infrastructure added in this commit will be used for the remaining
PGXACT fields in subsequent commits. They are kept separate to make
review easier.

Author: Andres Freund
Reviewed-By: Robert Haas, Thomas Munro, David Rowley
Discussion: https://postgr.es/m/20200301083601.ews6hz5dduc3w2se@alap3.anarazel.de
---
 src/include/storage/proc.h                  |  79 +++++-
 src/backend/access/heap/heapam_visibility.c |   8 +-
 src/backend/access/transam/README           |  33 +--
 src/backend/access/transam/clog.c           |   8 +-
 src/backend/access/transam/twophase.c       |  31 +--
 src/backend/access/transam/varsup.c         |  20 +-
 src/backend/commands/vacuum.c               |   2 +-
 src/backend/storage/ipc/procarray.c         | 282 +++++++++++++-------
 src/backend/storage/ipc/sinvaladt.c         |   4 +-
 src/backend/storage/lmgr/lock.c             |   3 +-
 src/backend/storage/lmgr/proc.c             |  26 +-
 11 files changed, 335 insertions(+), 161 deletions(-)

diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 286c9a9aec3..b828cecd185 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -90,6 +90,17 @@ typedef enum
  * distinguished from a real one at need by the fact that it has pid == 0.
  * The semaphore and lock-activity fields in a prepared-xact PGPROC are unused,
  * but its myProcLocks[] lists are valid.
+ *
+ * Mirrored fields:
+ *
+ * Some fields in PGPROC (see "mirrored in ..." comment) are mirrored into an
+ * element of more densely packed ProcGlobal arrays. These arrays are indexed
+ * by PGPROC->pgxactoff. Both copies need to be maintained coherently.
+ *
+ * NB: The pgxactoff indexed value can *never* be accessed without holding
+ * locks.
+ *
+ * See PROC_HDR for details.
  */
 struct PGPROC
 {
@@ -102,6 +113,12 @@ struct PGPROC
 
 	Latch		procLatch;		/* generic latch for process */
 
+
+	TransactionId xid;			/* id of top-level transaction currently being
+								 * executed by this proc, if running and XID
+								 * is assigned; else InvalidTransactionId.
+								 * mirrored in ProcGlobal->xids[pgxactoff] */
+
 	TransactionId xmin;			/* minimal running XID as it was when we were
 								 * starting our xact, excluding LAZY VACUUM:
 								 * vacuum must not remove tuples deleted by
@@ -111,6 +128,9 @@ struct PGPROC
 								 * being executed by this proc, if running;
 								 * else InvalidLocalTransactionId */
 	int			pid;			/* Backend's process ID; 0 if prepared xact */
+
+	int			pgxactoff;		/* offset into various ProcGlobal->arrays
+								 * with data mirrored from this PGPROC */
 	int			pgprocno;
 
 	/* These fields are zero while a backend is still starting up: */
@@ -225,10 +245,6 @@ extern PGDLLIMPORT struct PGXACT *MyPgXact;
  */
 typedef struct PGXACT
 {
-	TransactionId xid;			/* id of top-level transaction currently being
-								 * executed by this proc, if running and XID
-								 * is assigned; else InvalidTransactionId */
-
 	uint8		vacuumFlags;	/* vacuum-related flags, see above */
 	bool		overflowed;
 
@@ -237,6 +253,57 @@ typedef struct PGXACT
 
 /*
  * There is one ProcGlobal struct for the whole database cluster.
+ *
+ * Adding/Removing an entry into the procarray requires holding *both*
+ * ProcArrayLock and XidGenLock in exclusive mode (in that order). Both are
+ * needed because the dense arrays (see below) are accessed from
+ * GetNewTransactionId() and GetSnapshotData(), and we don't want to add
+ * further contention by both using the same lock. Adding/Removing a procarray
+ * entry is much less frequent.
+ *
+ * Some fields in PGPROC are mirrored into more densely packed arrays (like
+ * xids), with one entry for each backend. These arrays only contain entries
+ * for PGPROCs that have been added to the shared array with ProcArrayAdd()
+ * (in contrast to PGPROC array which has unused PGPROCs interspersed).
+ *
+ * The dense arrays are indexed indexed by PGPROC->pgxactoff. Any concurrent
+ * ProcArrayAdd() / ProcArrayRemove() can lead to pgxactoff of a procarray
+ * member to change.  Therefore it is only safe to use PGPROC->pgxactoff to
+ * access the dense array while holding either ProcArrayLock or XidGenLock.
+ *
+ * As long as a PGPROC is in the procarray, the mirrored values need to be
+ * maintained in both places in a coherent manner.
+ *
+ * The denser separate arrays are beneficial for three main reasons: First, to
+ * allow for as tight loops accessing the data as possible. Second, to prevent
+ * updates of frequently changing data (e.g. xmin) from invalidating
+ * cachelines also containing less frequently changing data (e.g. xid,
+ * vacuumFlags). Third to condense frequently accessed data into as few
+ * cachelines as possible.
+ *
+ * There are two main reasons to have the data mirrored between these dense
+ * arrays and PGPROC. First, as explained above, a PGPROC's array entries can
+ * only be accessed with either ProcArrayLock or XidGenLock held, whereas the
+ * PGPROC entries do not require that (obviously there may still be locking
+ * requirements around the individual field, separate from the concerns
+ * here). That is particularly important for a backend to efficiently checks
+ * it own values, which it often can safely do without locking.  Second, the
+ * PGPROC fields allow to avoid unnecessary accesses and modification to the
+ * dense arrays. A backend's own PGPROC is more likely to be in a local cache,
+ * whereas the cachelines for the dense array will be modified by other
+ * backends (often removing it from the cache for other cores/sockets). At
+ * commit/abort time a check of the PGPROC value can avoid accessing/dirtying
+ * the corresponding array value.
+ *
+ * Basically it makes sense to access the PGPROC variable when checking a
+ * single backend's data, especially when already looking at the PGPROC for
+ * other reasons already.  It makes sense to look at the "dense" arrays if we
+ * need to look at many / most entries, because we then benefit from the
+ * reduced indirection and better cross-process cache-ability.
+ *
+ * When entering a PGPROC for 2PC transactions with ProcArrayAdd(), the data
+ * in the dense arrays is initialized from the PGPROC while it already holds
+ * ProcArrayLock.
  */
 typedef struct PROC_HDR
 {
@@ -244,6 +311,10 @@ typedef struct PROC_HDR
 	PGPROC	   *allProcs;
 	/* Array of PGXACT structures (not including dummies for prepared txns) */
 	PGXACT	   *allPgXact;
+
+	/* Array mirroring PGPROC.xid for each PGPROC currently in the procarray */
+	TransactionId *xids;
+
 	/* Length of allProcs array */
 	uint32		allProcCount;
 	/* Head of list of free PGPROC structures */
diff --git a/src/backend/access/heap/heapam_visibility.c b/src/backend/access/heap/heapam_visibility.c
index b25b3e429ed..10848649c0c 100644
--- a/src/backend/access/heap/heapam_visibility.c
+++ b/src/backend/access/heap/heapam_visibility.c
@@ -11,12 +11,12 @@
  * shared buffer content lock on the buffer containing the tuple.
  *
  * NOTE: When using a non-MVCC snapshot, we must check
- * TransactionIdIsInProgress (which looks in the PGXACT array)
+ * TransactionIdIsInProgress (which looks in the PGPROC array)
  * before TransactionIdDidCommit/TransactionIdDidAbort (which look in
  * pg_xact).  Otherwise we have a race condition: we might decide that a
  * just-committed transaction crashed, because none of the tests succeed.
  * xact.c is careful to record commit/abort in pg_xact before it unsets
- * MyPgXact->xid in the PGXACT array.  That fixes that problem, but it
+ * MyProc->xid in the PGPROC array.  That fixes that problem, but it
  * also means there is a window where TransactionIdIsInProgress and
  * TransactionIdDidCommit will both return true.  If we check only
  * TransactionIdDidCommit, we could consider a tuple committed when a
@@ -956,7 +956,7 @@ HeapTupleSatisfiesDirty(HeapTuple htup, Snapshot snapshot,
  * coding where we tried to set the hint bits as soon as possible, we instead
  * did TransactionIdIsInProgress in each call --- to no avail, as long as the
  * inserting/deleting transaction was still running --- which was more cycles
- * and more contention on the PGXACT array.
+ * and more contention on ProcArrayLock.
  */
 static bool
 HeapTupleSatisfiesMVCC(HeapTuple htup, Snapshot snapshot,
@@ -1445,7 +1445,7 @@ HeapTupleSatisfiesNonVacuumable(HeapTuple htup, Snapshot snapshot,
  *	HeapTupleSatisfiesMVCC) and, therefore, any hint bits that can be set
  *	should already be set.  We assume that if no hint bits are set, the xmin
  *	or xmax transaction is still running.  This is therefore faster than
- *	HeapTupleSatisfiesVacuum, because we don't consult PGXACT nor CLOG.
+ *	HeapTupleSatisfiesVacuum, because we consult neither procarray nor CLOG.
  *	It's okay to return false when in doubt, but we must return true only
  *	if the tuple is removable.
  */
diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README
index 94d8f3fd0a2..c46fc3cc194 100644
--- a/src/backend/access/transam/README
+++ b/src/backend/access/transam/README
@@ -251,10 +251,10 @@ enforce, and it assists with some other issues as explained below.)  The
 implementation of this is that GetSnapshotData takes the ProcArrayLock in
 shared mode (so that multiple backends can take snapshots in parallel),
 but ProcArrayEndTransaction must take the ProcArrayLock in exclusive mode
-while clearing MyPgXact->xid at transaction end (either commit or abort).
-(To reduce context switching, when multiple transactions commit nearly
-simultaneously, we have one backend take ProcArrayLock and clear the XIDs
-of multiple processes at once.)
+while clearing the ProcGlobal->xids[] entry at transaction end (either
+commit or abort). (To reduce context switching, when multiple transactions
+commit nearly simultaneously, we have one backend take ProcArrayLock and
+clear the XIDs of multiple processes at once.)
 
 ProcArrayEndTransaction also holds the lock while advancing the shared
 latestCompletedFullXid variable.  This allows GetSnapshotData to use
@@ -278,12 +278,13 @@ present in the ProcArray, or not running anymore.  (This guarantee doesn't
 apply to subtransaction XIDs, because of the possibility that there's not
 room for them in the subxid array; instead we guarantee that they are
 present or the overflow flag is set.)  If a backend released XidGenLock
-before storing its XID into MyPgXact, then it would be possible for another
-backend to allocate and commit a later XID, causing latestCompletedFullXid to
-pass the first backend's XID, before that value became visible in the
-ProcArray.  That would break ComputeXidHorizons, as discussed below.
+before storing its XID into ProcGlobal->xids[], then it would be possible for
+another backend to allocate and commit a later XID, causing
+latestCompletedFullXid to pass the first backend's XID, before that value
+became visible in the ProcArray.  That would break ComputeXidHorizons,
+as discussed below.
 
-We allow GetNewTransactionId to store the XID into MyPgXact->xid (or the
+We allow GetNewTransactionId to store the XID into ProcGlobal->xids[] (or the
 subxid array) without taking ProcArrayLock.  This was once necessary to
 avoid deadlock; while that is no longer the case, it's still beneficial for
 performance.  We are thereby relying on fetch/store of an XID to be atomic,
@@ -382,13 +383,13 @@ Top-level transactions do not have a parent, so they leave their pg_subtrans
 entries set to the default value of zero (InvalidTransactionId).
 
 pg_subtrans is used to check whether the transaction in question is still
-running --- the main Xid of a transaction is recorded in the PGXACT struct,
-but since we allow arbitrary nesting of subtransactions, we can't fit all Xids
-in shared memory, so we have to store them on disk.  Note, however, that for
-each transaction we keep a "cache" of Xids that are known to be part of the
-transaction tree, so we can skip looking at pg_subtrans unless we know the
-cache has been overflowed.  See storage/ipc/procarray.c for the gory details.
-
+running --- the main Xid of a transaction is recorded in ProcGlobal->xids[],
+with a copy in PGPROC->xid, but since we allow arbitrary nesting of
+subtransactions, we can't fit all Xids in shared memory, so we have to store
+them on disk.  Note, however, that for each transaction we keep a "cache" of
+Xids that are known to be part of the transaction tree, so we can skip looking
+at pg_subtrans unless we know the cache has been overflowed.  See
+storage/ipc/procarray.c for the gory details.
 slru.c is the supporting mechanism for both pg_xact and pg_subtrans.  It
 implements the LRU policy for in-memory buffer pages.  The high-level routines
 for pg_xact are implemented in transam.c, while the low-level functions are in
diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c
index f3da40ae017..5198a0cef68 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -285,15 +285,15 @@ TransactionIdSetPageStatus(TransactionId xid, int nsubxids,
 	 * updates for multiple backends so that the number of times XactSLRULock
 	 * needs to be acquired is reduced.
 	 *
-	 * For this optimization to be safe, the XID in MyPgXact and the subxids
-	 * in MyProc must be the same as the ones for which we're setting the
-	 * status.  Check that this is the case.
+	 * For this optimization to be safe, the XID and subxids in MyProc must be
+	 * the same as the ones for which we're setting the status.  Check that
+	 * this is the case.
 	 *
 	 * For this optimization to be efficient, we shouldn't have too many
 	 * sub-XIDs and all of the XIDs for which we're adjusting clog should be
 	 * on the same page.  Check those conditions, too.
 	 */
-	if (all_xact_same_page && xid == MyPgXact->xid &&
+	if (all_xact_same_page && xid == MyProc->xid &&
 		nsubxids <= THRESHOLD_SUBTRANS_CLOG_OPT &&
 		nsubxids == MyPgXact->nxids &&
 		memcmp(subxids, MyProc->subxids.xids,
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index ae7c1a4c172..d073eb07d23 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -351,7 +351,7 @@ AtAbort_Twophase(void)
 
 /*
  * This is called after we have finished transferring state to the prepared
- * PGXACT entry.
+ * PGPROC entry.
  */
 void
 PostPrepare_Twophase(void)
@@ -463,7 +463,7 @@ MarkAsPreparingGuts(GlobalTransaction gxact, TransactionId xid, const char *gid,
 	proc->waitStatus = PROC_WAIT_STATUS_OK;
 	/* We set up the gxact's VXID as InvalidBackendId/XID */
 	proc->lxid = (LocalTransactionId) xid;
-	pgxact->xid = xid;
+	proc->xid = xid;
 	Assert(proc->xmin == InvalidTransactionId);
 	proc->delayChkpt = false;
 	pgxact->vacuumFlags = 0;
@@ -768,7 +768,6 @@ pg_prepared_xact(PG_FUNCTION_ARGS)
 	{
 		GlobalTransaction gxact = &status->array[status->currIdx++];
 		PGPROC	   *proc = &ProcGlobal->allProcs[gxact->pgprocno];
-		PGXACT	   *pgxact = &ProcGlobal->allPgXact[gxact->pgprocno];
 		Datum		values[5];
 		bool		nulls[5];
 		HeapTuple	tuple;
@@ -783,7 +782,7 @@ pg_prepared_xact(PG_FUNCTION_ARGS)
 		MemSet(values, 0, sizeof(values));
 		MemSet(nulls, 0, sizeof(nulls));
 
-		values[0] = TransactionIdGetDatum(pgxact->xid);
+		values[0] = TransactionIdGetDatum(proc->xid);
 		values[1] = CStringGetTextDatum(gxact->gid);
 		values[2] = TimestampTzGetDatum(gxact->prepared_at);
 		values[3] = ObjectIdGetDatum(gxact->owner);
@@ -829,9 +828,8 @@ TwoPhaseGetGXact(TransactionId xid, bool lock_held)
 	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
 	{
 		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
-		PGXACT	   *pgxact = &ProcGlobal->allPgXact[gxact->pgprocno];
 
-		if (pgxact->xid == xid)
+		if (gxact->xid == xid)
 		{
 			result = gxact;
 			break;
@@ -987,8 +985,7 @@ void
 StartPrepare(GlobalTransaction gxact)
 {
 	PGPROC	   *proc = &ProcGlobal->allProcs[gxact->pgprocno];
-	PGXACT	   *pgxact = &ProcGlobal->allPgXact[gxact->pgprocno];
-	TransactionId xid = pgxact->xid;
+	TransactionId xid = gxact->xid;
 	TwoPhaseFileHeader hdr;
 	TransactionId *children;
 	RelFileNode *commitrels;
@@ -1140,15 +1137,15 @@ EndPrepare(GlobalTransaction gxact)
 
 	/*
 	 * Mark the prepared transaction as valid.  As soon as xact.c marks
-	 * MyPgXact as not running our XID (which it will do immediately after
+	 * MyProc as not running our XID (which it will do immediately after
 	 * this function returns), others can commit/rollback the xact.
 	 *
 	 * NB: a side effect of this is to make a dummy ProcArray entry for the
-	 * prepared XID.  This must happen before we clear the XID from MyPgXact,
-	 * else there is a window where the XID is not running according to
-	 * TransactionIdIsInProgress, and onlookers would be entitled to assume
-	 * the xact crashed.  Instead we have a window where the same XID appears
-	 * twice in ProcArray, which is OK.
+	 * prepared XID.  This must happen before we clear the XID from MyProc /
+	 * ProcGlobal->xids[], else there is a window where the XID is not running
+	 * according to TransactionIdIsInProgress, and onlookers would be entitled
+	 * to assume the xact crashed.  Instead we have a window where the same
+	 * XID appears twice in ProcArray, which is OK.
 	 */
 	MarkAsPrepared(gxact, false);
 
@@ -1404,7 +1401,6 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
 {
 	GlobalTransaction gxact;
 	PGPROC	   *proc;
-	PGXACT	   *pgxact;
 	TransactionId xid;
 	char	   *buf;
 	char	   *bufptr;
@@ -1423,8 +1419,7 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
 	 */
 	gxact = LockGXact(gid, GetUserId());
 	proc = &ProcGlobal->allProcs[gxact->pgprocno];
-	pgxact = &ProcGlobal->allPgXact[gxact->pgprocno];
-	xid = pgxact->xid;
+	xid = gxact->xid;
 
 	/*
 	 * Read and validate 2PC state data. State data will typically be stored
@@ -1726,7 +1721,7 @@ CheckPointTwoPhase(XLogRecPtr redo_horizon)
 	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
 	{
 		/*
-		 * Note that we are using gxact not pgxact so this works in recovery
+		 * Note that we are using gxact not PGPROC so this works in recovery
 		 * also
 		 */
 		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
diff --git a/src/backend/access/transam/varsup.c b/src/backend/access/transam/varsup.c
index 66eb74aa9f8..73167054e61 100644
--- a/src/backend/access/transam/varsup.c
+++ b/src/backend/access/transam/varsup.c
@@ -38,7 +38,8 @@ VariableCache ShmemVariableCache = NULL;
  * Allocate the next FullTransactionId for a new transaction or
  * subtransaction.
  *
- * The new XID is also stored into MyPgXact before returning.
+ * The new XID is also stored into MyProc->xid/ProcGlobal->xids[] before
+ * returning.
  *
  * Note: when this is called, we are actually already inside a valid
  * transaction, since XIDs are now not allocated until the transaction
@@ -65,7 +66,8 @@ GetNewTransactionId(bool isSubXact)
 	if (IsBootstrapProcessingMode())
 	{
 		Assert(!isSubXact);
-		MyPgXact->xid = BootstrapTransactionId;
+		MyProc->xid = BootstrapTransactionId;
+		ProcGlobal->xids[MyProc->pgxactoff] = BootstrapTransactionId;
 		return FullTransactionIdFromEpochAndXid(0, BootstrapTransactionId);
 	}
 
@@ -190,10 +192,10 @@ GetNewTransactionId(bool isSubXact)
 	 * latestCompletedFullXid is present in the ProcArray, which is essential
 	 * for correct OldestXmin tracking; see src/backend/access/transam/README.
 	 *
-	 * Note that readers of PGXACT xid fields should be careful to fetch the
-	 * value only once, rather than assume they can read a value multiple
-	 * times and get the same answer each time.  Note we are assuming that
-	 * TransactionId and int fetch/store are atomic.
+	 * Note that readers of ProcGlobal->xids/PGPROC->xid should be careful
+	 * to fetch the value for each proc only once, rather than assume they can
+	 * read a value multiple times and get the same answer each time.  Note we
+	 * are assuming that TransactionId and int fetch/store are atomic.
 	 *
 	 * The same comments apply to the subxact xid count and overflow fields.
 	 *
@@ -219,7 +221,11 @@ GetNewTransactionId(bool isSubXact)
 	 * answer later on when someone does have a reason to inquire.)
 	 */
 	if (!isSubXact)
-		MyPgXact->xid = xid;	/* LWLockRelease acts as barrier */
+	{
+		/* LWLockRelease acts as barrier */
+		MyProc->xid = xid;
+		ProcGlobal->xids[MyProc->pgxactoff] = xid;
+	}
 	else
 	{
 		int			nxids = MyPgXact->nxids;
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 22228f5684f..648e12c78d8 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -1724,7 +1724,7 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params)
 		 *
 		 * Note: these flags remain set until CommitTransaction or
 		 * AbortTransaction.  We don't want to clear them until we reset
-		 * MyPgXact->xid/xmin, otherwise GetOldestNonRemovableTransactionId()
+		 * MyProc->xid/xmin, otherwise GetOldestNonRemovableTransactionId()
 		 * might appear to go backwards, which is probably Not Good.
 		 */
 		LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 164cf0cabc2..eeccb2eac7f 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -9,8 +9,9 @@
  * one is as a means of determining the set of currently running transactions.
  *
  * Because of various subtle race conditions it is critical that a backend
- * hold the correct locks while setting or clearing its MyPgXact->xid field.
- * See notes in src/backend/access/transam/README.
+ * hold the correct locks while setting or clearing its xid (in
+ * ProcGlobal->xids[]/MyProc->xid).  See notes in
+ * src/backend/access/transam/README.
  *
  * The process arrays now also include structures representing prepared
  * transactions.  The xid and subxids fields of these are valid, as are the
@@ -435,7 +436,9 @@ ProcArrayAdd(PGPROC *proc)
 	ProcArrayStruct *arrayP = procArray;
 	int			index;
 
+	/* See ProcGlobal comment explaining why both locks are held */
 	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+	LWLockAcquire(XidGenLock, LW_EXCLUSIVE);
 
 	if (arrayP->numProcs >= arrayP->maxProcs)
 	{
@@ -444,7 +447,6 @@ ProcArrayAdd(PGPROC *proc)
 		 * fixed supply of PGPROC structs too, and so we should have failed
 		 * earlier.)
 		 */
-		LWLockRelease(ProcArrayLock);
 		ereport(FATAL,
 				(errcode(ERRCODE_TOO_MANY_CONNECTIONS),
 				 errmsg("sorry, too many clients already")));
@@ -470,10 +472,25 @@ ProcArrayAdd(PGPROC *proc)
 	}
 
 	memmove(&arrayP->pgprocnos[index + 1], &arrayP->pgprocnos[index],
-			(arrayP->numProcs - index) * sizeof(int));
+			(arrayP->numProcs - index) * sizeof(*arrayP->pgprocnos));
+	memmove(&ProcGlobal->xids[index + 1], &ProcGlobal->xids[index],
+			(arrayP->numProcs - index) * sizeof(*ProcGlobal->xids));
+
 	arrayP->pgprocnos[index] = proc->pgprocno;
+	ProcGlobal->xids[index] = proc->xid;
+
 	arrayP->numProcs++;
 
+	for (; index < arrayP->numProcs; index++)
+	{
+		allProcs[arrayP->pgprocnos[index]].pgxactoff = index;
+	}
+
+	/*
+	 * Release in reversed acquisition order, to reduce frequency of having to
+	 * wait for XidGenLock while holding ProcArrayLock.
+	 */
+	LWLockRelease(XidGenLock);
 	LWLockRelease(ProcArrayLock);
 }
 
@@ -499,36 +516,59 @@ ProcArrayRemove(PGPROC *proc, TransactionId latestXid)
 		DisplayXidCache();
 #endif
 
+	/* See ProcGlobal comment explaining why both locks are held */
 	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+	LWLockAcquire(XidGenLock, LW_EXCLUSIVE);
+
+	Assert(ProcGlobal->allProcs[arrayP->pgprocnos[proc->pgxactoff]].pgxactoff == proc->pgxactoff);
 
 	if (TransactionIdIsValid(latestXid))
 	{
-		Assert(TransactionIdIsValid(allPgXact[proc->pgprocno].xid));
+		Assert(TransactionIdIsValid(ProcGlobal->xids[proc->pgxactoff]));
 
 		/* Advance global latestCompletedXid while holding the lock */
 		MaintainLatestCompletedXid(latestXid);
+
+		ProcGlobal->xids[proc->pgxactoff] = 0;
 	}
 	else
 	{
 		/* Shouldn't be trying to remove a live transaction here */
-		Assert(!TransactionIdIsValid(allPgXact[proc->pgprocno].xid));
+		Assert(!TransactionIdIsValid(ProcGlobal->xids[proc->pgxactoff]));
 	}
 
+	Assert(TransactionIdIsValid(ProcGlobal->xids[proc->pgxactoff] == 0));
+
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
 		if (arrayP->pgprocnos[index] == proc->pgprocno)
 		{
 			/* Keep the PGPROC array sorted. See notes above */
 			memmove(&arrayP->pgprocnos[index], &arrayP->pgprocnos[index + 1],
-					(arrayP->numProcs - index - 1) * sizeof(int));
+					(arrayP->numProcs - index - 1) * sizeof(*arrayP->pgprocnos));
+			memmove(&ProcGlobal->xids[index], &ProcGlobal->xids[index + 1],
+					(arrayP->numProcs - index - 1) * sizeof(*ProcGlobal->xids));
+
 			arrayP->pgprocnos[arrayP->numProcs - 1] = -1;	/* for debugging */
 			arrayP->numProcs--;
+
+			for (; index < arrayP->numProcs; index++)
+			{
+				allProcs[arrayP->pgprocnos[index]].pgxactoff--;
+			}
+
+			/*
+			 * Release in reversed acquisition order, to reduce frequency of
+			 * having to wait for XidGenLock while holding ProcArrayLock.
+			 */
+			LWLockRelease(XidGenLock);
 			LWLockRelease(ProcArrayLock);
 			return;
 		}
 	}
 
 	/* Oops */
+	LWLockRelease(XidGenLock);
 	LWLockRelease(ProcArrayLock);
 
 	elog(LOG, "failed to find proc %p in ProcArray", proc);
@@ -561,7 +601,7 @@ ProcArrayEndTransaction(PGPROC *proc, TransactionId latestXid)
 		 * else is taking a snapshot.  See discussion in
 		 * src/backend/access/transam/README.
 		 */
-		Assert(TransactionIdIsValid(allPgXact[proc->pgprocno].xid));
+		Assert(TransactionIdIsValid(proc->xid));
 
 		/*
 		 * If we can immediately acquire ProcArrayLock, we clear our own XID
@@ -583,7 +623,7 @@ ProcArrayEndTransaction(PGPROC *proc, TransactionId latestXid)
 		 * anyone else's calculation of a snapshot.  We might change their
 		 * estimate of global xmin, but that's OK.
 		 */
-		Assert(!TransactionIdIsValid(allPgXact[proc->pgprocno].xid));
+		Assert(!TransactionIdIsValid(proc->xid));
 
 		proc->lxid = InvalidLocalTransactionId;
 		/* must be cleared with xid/xmin: */
@@ -606,7 +646,13 @@ static inline void
 ProcArrayEndTransactionInternal(PGPROC *proc, PGXACT *pgxact,
 								TransactionId latestXid)
 {
-	pgxact->xid = InvalidTransactionId;
+	size_t		pgxactoff = proc->pgxactoff;
+
+	Assert(TransactionIdIsValid(ProcGlobal->xids[pgxactoff]));
+	Assert(ProcGlobal->xids[pgxactoff] == proc->xid);
+
+	ProcGlobal->xids[pgxactoff] = InvalidTransactionId;
+	proc->xid = InvalidTransactionId;
 	proc->lxid = InvalidLocalTransactionId;
 	/* must be cleared with xid/xmin: */
 	pgxact->vacuumFlags &= ~PROC_VACUUM_STATE_MASK;
@@ -642,7 +688,7 @@ ProcArrayGroupClearXid(PGPROC *proc, TransactionId latestXid)
 	uint32		wakeidx;
 
 	/* We should definitely have an XID to clear. */
-	Assert(TransactionIdIsValid(allPgXact[proc->pgprocno].xid));
+	Assert(TransactionIdIsValid(proc->xid));
 
 	/* Add ourselves to the list of processes needing a group XID clear. */
 	proc->procArrayGroupMember = true;
@@ -747,20 +793,28 @@ ProcArrayGroupClearXid(PGPROC *proc, TransactionId latestXid)
  * This is used after successfully preparing a 2-phase transaction.  We are
  * not actually reporting the transaction's XID as no longer running --- it
  * will still appear as running because the 2PC's gxact is in the ProcArray
- * too.  We just have to clear out our own PGXACT.
+ * too.  We just have to clear out our own PGPROC.
  */
 void
 ProcArrayClearTransaction(PGPROC *proc)
 {
 	PGXACT	   *pgxact = &allPgXact[proc->pgprocno];
+	size_t		pgxactoff;
 
 	/*
-	 * We can skip locking ProcArrayLock here, because this action does not
-	 * actually change anyone's view of the set of running XIDs: our entry is
-	 * duplicate with the gxact that has already been inserted into the
-	 * ProcArray.
+	 * We can skip locking ProcArrayLock exclusively here, because this action
+	 * does not actually change anyone's view of the set of running XIDs: our
+	 * entry is duplicate with the gxact that has already been inserted into
+	 * the ProcArray. But need it in shared mode for pgproc->pgxactoff to stay
+	 * the same.
 	 */
-	pgxact->xid = InvalidTransactionId;
+	LWLockAcquire(ProcArrayLock, LW_SHARED);
+
+	pgxactoff = proc->pgxactoff;
+
+	ProcGlobal->xids[pgxactoff] = InvalidTransactionId;
+	proc->xid = InvalidTransactionId;
+
 	proc->lxid = InvalidLocalTransactionId;
 	proc->xmin = InvalidTransactionId;
 	proc->recoveryConflictPending = false;
@@ -772,6 +826,8 @@ ProcArrayClearTransaction(PGPROC *proc)
 	/* Clear the subtransaction-XID cache too */
 	pgxact->nxids = 0;
 	pgxact->overflowed = false;
+
+	LWLockRelease(ProcArrayLock);
 }
 
 /*
@@ -1166,7 +1222,7 @@ ProcArrayApplyXidAssignment(TransactionId topxid,
  * there are four possibilities for finding a running transaction:
  *
  * 1. The given Xid is a main transaction Id.  We will find this out cheaply
- * by looking at the PGXACT struct for each backend.
+ * by looking at ProcGlobal->xids.
  *
  * 2. The given Xid is one of the cached subxact Xids in the PGPROC array.
  * We can find this out cheaply too.
@@ -1175,25 +1231,27 @@ ProcArrayApplyXidAssignment(TransactionId topxid,
  * if the Xid is running on the primary.
  *
  * 4. Search the SubTrans tree to find the Xid's topmost parent, and then see
- * if that is running according to PGXACT or KnownAssignedXids.  This is the
- * slowest way, but sadly it has to be done always if the others failed,
- * unless we see that the cached subxact sets are complete (none have
+ * if that is running according to ProcGlobal->xids[] or KnownAssignedXids.
+ * This is the slowest way, but sadly it has to be done always if the others
+ * failed, unless we see that the cached subxact sets are complete (none have
  * overflowed).
  *
  * ProcArrayLock has to be held while we do 1, 2, 3.  If we save the top Xids
  * while doing 1 and 3, we can release the ProcArrayLock while we do 4.
  * This buys back some concurrency (and we can't retrieve the main Xids from
- * PGXACT again anyway; see GetNewTransactionId).
+ * ProcGlobal->xids[] again anyway; see GetNewTransactionId).
  */
 bool
 TransactionIdIsInProgress(TransactionId xid)
 {
 	static TransactionId *xids = NULL;
+	static TransactionId *other_xids;
 	int			nxids = 0;
 	ProcArrayStruct *arrayP = procArray;
 	TransactionId topxid;
-	int			i,
-				j;
+	int			mypgxactoff;
+	size_t		numProcs;
+	int			j;
 
 	/*
 	 * Don't bother checking a transaction older than RecentXmin; it could not
@@ -1248,6 +1306,8 @@ TransactionIdIsInProgress(TransactionId xid)
 					 errmsg("out of memory")));
 	}
 
+	other_xids = ProcGlobal->xids;
+
 	LWLockAcquire(ProcArrayLock, LW_SHARED);
 
 	/*
@@ -1263,20 +1323,22 @@ TransactionIdIsInProgress(TransactionId xid)
 	}
 
 	/* No shortcuts, gotta grovel through the array */
-	for (i = 0; i < arrayP->numProcs; i++)
+	mypgxactoff = MyProc->pgxactoff;
+	numProcs = arrayP->numProcs;
+	for (size_t pgxactoff = 0; pgxactoff < numProcs; pgxactoff++)
 	{
-		int			pgprocno = arrayP->pgprocnos[i];
-		PGPROC	   *proc = &allProcs[pgprocno];
-		PGXACT	   *pgxact = &allPgXact[pgprocno];
+		int			pgprocno;
+		PGXACT	   *pgxact;
+		PGPROC	   *proc;
 		TransactionId pxid;
 		int			pxids;
 
-		/* Ignore my own proc --- dealt with it above */
-		if (proc == MyProc)
+		/* Ignore ourselves --- dealt with it above */
+		if (pgxactoff == mypgxactoff)
 			continue;
 
 		/* Fetch xid just once - see GetNewTransactionId */
-		pxid = UINT32_ACCESS_ONCE(pgxact->xid);
+		pxid = UINT32_ACCESS_ONCE(other_xids[pgxactoff]);
 
 		if (!TransactionIdIsValid(pxid))
 			continue;
@@ -1301,8 +1363,12 @@ TransactionIdIsInProgress(TransactionId xid)
 		/*
 		 * Step 2: check the cached child-Xids arrays
 		 */
+		pgprocno = arrayP->pgprocnos[pgxactoff];
+		pgxact = &allPgXact[pgprocno];
 		pxids = pgxact->nxids;
 		pg_read_barrier();		/* pairs with barrier in GetNewTransactionId() */
+		pgprocno = arrayP->pgprocnos[pgxactoff];
+		proc = &allProcs[pgprocno];
 		for (j = pxids - 1; j >= 0; j--)
 		{
 			/* Fetch xid just once - see GetNewTransactionId */
@@ -1333,7 +1399,7 @@ TransactionIdIsInProgress(TransactionId xid)
 	 */
 	if (RecoveryInProgress())
 	{
-		/* none of the PGXACT entries should have XIDs in hot standby mode */
+		/* none of the PGPROC entries should have XIDs in hot standby mode */
 		Assert(nxids == 0);
 
 		if (KnownAssignedXidExists(xid))
@@ -1388,7 +1454,7 @@ TransactionIdIsInProgress(TransactionId xid)
 	Assert(TransactionIdIsValid(topxid));
 	if (!TransactionIdEquals(topxid, xid))
 	{
-		for (i = 0; i < nxids; i++)
+		for (int i = 0; i < nxids; i++)
 		{
 			if (TransactionIdEquals(xids[i], topxid))
 				return true;
@@ -1411,6 +1477,7 @@ TransactionIdIsActive(TransactionId xid)
 {
 	bool		result = false;
 	ProcArrayStruct *arrayP = procArray;
+	TransactionId *other_xids = ProcGlobal->xids;
 	int			i;
 
 	/*
@@ -1426,11 +1493,10 @@ TransactionIdIsActive(TransactionId xid)
 	{
 		int			pgprocno = arrayP->pgprocnos[i];
 		PGPROC	   *proc = &allProcs[pgprocno];
-		PGXACT	   *pgxact = &allPgXact[pgprocno];
 		TransactionId pxid;
 
 		/* Fetch xid just once - see GetNewTransactionId */
-		pxid = UINT32_ACCESS_ONCE(pgxact->xid);
+		pxid = UINT32_ACCESS_ONCE(other_xids[i]);
 
 		if (!TransactionIdIsValid(pxid))
 			continue;
@@ -1516,6 +1582,7 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
 	ProcArrayStruct *arrayP = procArray;
 	TransactionId kaxmin;
 	bool		in_recovery = RecoveryInProgress();
+	TransactionId *other_xids = ProcGlobal->xids;
 
 	/* inferred after ProcArrayLock is released */
 	h->catalog_oldest_nonremovable = InvalidTransactionId;
@@ -1559,7 +1626,7 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
 		TransactionId xmin;
 
 		/* Fetch xid just once - see GetNewTransactionId */
-		xid = UINT32_ACCESS_ONCE(pgxact->xid);
+		xid = UINT32_ACCESS_ONCE(other_xids[pgprocno]);
 		xmin = UINT32_ACCESS_ONCE(proc->xmin);
 
 		/*
@@ -1850,14 +1917,17 @@ Snapshot
 GetSnapshotData(Snapshot snapshot)
 {
 	ProcArrayStruct *arrayP = procArray;
+	TransactionId *other_xids = ProcGlobal->xids;
 	TransactionId xmin;
 	TransactionId xmax;
-	int			index;
-	int			count = 0;
+	size_t		count = 0;
 	int			subcount = 0;
 	bool		suboverflowed = false;
 	FullTransactionId latest_completed;
 	TransactionId oldestxid;
+	int			mypgxactoff;
+	TransactionId myxid;
+
 	TransactionId replication_slot_xmin = InvalidTransactionId;
 	TransactionId replication_slot_catalog_xmin = InvalidTransactionId;
 
@@ -1902,6 +1972,10 @@ GetSnapshotData(Snapshot snapshot)
 	LWLockAcquire(ProcArrayLock, LW_SHARED);
 
 	latest_completed = ShmemVariableCache->latestCompletedFullXid;
+	mypgxactoff = MyProc->pgxactoff;
+	myxid = other_xids[mypgxactoff];
+	Assert(myxid == MyProc->xid);
+
 	oldestxid = ShmemVariableCache->oldestXid;
 
 	/* xmax is always latestCompletedFullXid + 1 */
@@ -1912,57 +1986,79 @@ GetSnapshotData(Snapshot snapshot)
 	/* initialize xmin calculation with xmax */
 	xmin = xmax;
 
+	/* take own xid into account, saves a check inside the loop */
+	if (TransactionIdIsNormal(myxid) && NormalTransactionIdPrecedes(myxid, xmin))
+		xmin = myxid;
+
 	snapshot->takenDuringRecovery = RecoveryInProgress();
 
 	if (!snapshot->takenDuringRecovery)
 	{
+		size_t		numProcs = arrayP->numProcs;
+		TransactionId *xip = snapshot->xip;
 		int		   *pgprocnos = arrayP->pgprocnos;
-		int			numProcs;
 
 		/*
-		 * Spin over procArray checking xid, xmin, and subxids.  The goal is
-		 * to gather all active xids, find the lowest xmin, and try to record
-		 * subxids.
+		 * First collect set of pgxactoff/xids that need to be included in the
+		 * snapshot.
 		 */
-		numProcs = arrayP->numProcs;
-		for (index = 0; index < numProcs; index++)
+		for (size_t pgxactoff = 0; pgxactoff < numProcs; pgxactoff++)
 		{
-			int			pgprocno = pgprocnos[index];
-			PGXACT	   *pgxact = &allPgXact[pgprocno];
-			TransactionId xid;
+			/* Fetch xid just once - see GetNewTransactionId */
+			TransactionId xid = UINT32_ACCESS_ONCE(other_xids[pgxactoff]);
+			int			pgprocno;
+			PGXACT	   *pgxact;
+			uint8		vacuumFlags;
+
+			Assert(allProcs[arrayP->pgprocnos[pgxactoff]].pgxactoff == pgxactoff);
+
+			/*
+			 * If the transaction has no XID assigned, we can skip it; it
+			 * won't have sub-XIDs either.
+			 */
+			if (likely(xid == InvalidTransactionId))
+				continue;
+
+			/*
+			 * We don't include our own XIDs (if any) in the snapshot. It
+			 * needs to be includeded in the xmin computation, but we did so
+			 * outside the loop.
+			 */
+			if (pgxactoff == mypgxactoff)
+				continue;
+
+			/*
+			 * The only way we are able to get here with a non-normal xid
+			 * is during bootstrap - with this backend using
+			 * BootstrapTransactionId. But the above test should filter
+			 * that out.
+			 */
+			Assert(TransactionIdIsNormal(xid));
+
+			/*
+			 * If the XID is >= xmax, we can skip it; such transactions will
+			 * be treated as running anyway (and any sub-XIDs will also be >=
+			 * xmax).
+			 */
+			if (!NormalTransactionIdPrecedes(xid, xmax))
+				continue;
+
+			pgprocno = pgprocnos[pgxactoff];
+			pgxact = &allPgXact[pgprocno];
+			vacuumFlags = pgxact->vacuumFlags;
 
 			/*
 			 * Skip over backends doing logical decoding which manages xmin
 			 * separately (check below) and ones running LAZY VACUUM.
 			 */
-			if (pgxact->vacuumFlags &
-				(PROC_IN_LOGICAL_DECODING | PROC_IN_VACUUM))
+			if (vacuumFlags & (PROC_IN_LOGICAL_DECODING | PROC_IN_VACUUM))
 				continue;
 
-			/* Fetch xid just once - see GetNewTransactionId */
-			xid = UINT32_ACCESS_ONCE(pgxact->xid);
-
-			/*
-			 * If the transaction has no XID assigned, we can skip it; it
-			 * won't have sub-XIDs either.  If the XID is >= xmax, we can also
-			 * skip it; such transactions will be treated as running anyway
-			 * (and any sub-XIDs will also be >= xmax).
-			 */
-			if (!TransactionIdIsNormal(xid)
-				|| !NormalTransactionIdPrecedes(xid, xmax))
-				continue;
-
-			/*
-			 * We don't include our own XIDs (if any) in the snapshot, but we
-			 * must include them in xmin.
-			 */
 			if (NormalTransactionIdPrecedes(xid, xmin))
 				xmin = xid;
-			if (pgxact == MyPgXact)
-				continue;
 
 			/* Add XID to snapshot. */
-			snapshot->xip[count++] = xid;
+			xip[count++] = xid;
 
 			/*
 			 * Save subtransaction XIDs if possible (if we've already
@@ -1985,9 +2081,9 @@ GetSnapshotData(Snapshot snapshot)
 					suboverflowed = true;
 				else
 				{
-					int			nxids = pgxact->nxids;
+					int			nsubxids = pgxact->nxids;
 
-					if (nxids > 0)
+					if (nsubxids > 0)
 					{
 						PGPROC	   *proc = &allProcs[pgprocno];
 
@@ -1995,8 +2091,8 @@ GetSnapshotData(Snapshot snapshot)
 
 						memcpy(snapshot->subxip + subcount,
 							   (void *) proc->subxids.xids,
-							   nxids * sizeof(TransactionId));
-						subcount += nxids;
+							   nsubxids * sizeof(TransactionId));
+						subcount += nsubxids;
 					}
 				}
 			}
@@ -2128,6 +2224,7 @@ GetSnapshotData(Snapshot snapshot)
 	}
 
 	RecentXmin = xmin;
+	Assert(TransactionIdPrecedesOrEquals(TransactionXmin, RecentXmin));
 
 	snapshot->xmin = xmin;
 	snapshot->xmax = xmax;
@@ -2290,7 +2387,7 @@ ProcArrayInstallRestoredXmin(TransactionId xmin, PGPROC *proc)
  * GetRunningTransactionData -- returns information about running transactions.
  *
  * Similar to GetSnapshotData but returns more information. We include
- * all PGXACTs with an assigned TransactionId, even VACUUM processes and
+ * all PGPROCs with an assigned TransactionId, even VACUUM processes and
  * prepared transactions.
  *
  * We acquire XidGenLock and ProcArrayLock, but the caller is responsible for
@@ -2305,7 +2402,7 @@ ProcArrayInstallRestoredXmin(TransactionId xmin, PGPROC *proc)
  * This is never executed during recovery so there is no need to look at
  * KnownAssignedXids.
  *
- * Dummy PGXACTs from prepared transaction are included, meaning that this
+ * Dummy PGPROCs from prepared transaction are included, meaning that this
  * may return entries with duplicated TransactionId values coming from
  * transaction finishing to prepare.  Nothing is done about duplicated
  * entries here to not hold on ProcArrayLock more than necessary.
@@ -2324,6 +2421,7 @@ GetRunningTransactionData(void)
 	static RunningTransactionsData CurrentRunningXactsData;
 
 	ProcArrayStruct *arrayP = procArray;
+	TransactionId *other_xids = ProcGlobal->xids;
 	RunningTransactions CurrentRunningXacts = &CurrentRunningXactsData;
 	TransactionId latestCompletedXid;
 	TransactionId oldestRunningXid;
@@ -2384,7 +2482,7 @@ GetRunningTransactionData(void)
 		TransactionId xid;
 
 		/* Fetch xid just once - see GetNewTransactionId */
-		xid = UINT32_ACCESS_ONCE(pgxact->xid);
+		xid = UINT32_ACCESS_ONCE(other_xids[index]);
 
 		/*
 		 * We don't need to store transactions that don't have a TransactionId
@@ -2481,7 +2579,7 @@ GetRunningTransactionData(void)
  * GetOldestActiveTransactionId()
  *
  * Similar to GetSnapshotData but returns just oldestActiveXid. We include
- * all PGXACTs with an assigned TransactionId, even VACUUM processes.
+ * all PGPROCs with an assigned TransactionId, even VACUUM processes.
  * We look at all databases, though there is no need to include WALSender
  * since this has no effect on hot standby conflicts.
  *
@@ -2496,6 +2594,7 @@ TransactionId
 GetOldestActiveTransactionId(void)
 {
 	ProcArrayStruct *arrayP = procArray;
+	TransactionId *other_xids = ProcGlobal->xids;
 	TransactionId oldestRunningXid;
 	int			index;
 
@@ -2518,12 +2617,10 @@ GetOldestActiveTransactionId(void)
 	LWLockAcquire(ProcArrayLock, LW_SHARED);
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
-		int			pgprocno = arrayP->pgprocnos[index];
-		PGXACT	   *pgxact = &allPgXact[pgprocno];
 		TransactionId xid;
 
 		/* Fetch xid just once - see GetNewTransactionId */
-		xid = UINT32_ACCESS_ONCE(pgxact->xid);
+		xid = UINT32_ACCESS_ONCE(other_xids[index]);
 
 		if (!TransactionIdIsNormal(xid))
 			continue;
@@ -2601,8 +2698,8 @@ GetOldestSafeDecodingTransactionId(bool catalogOnly)
 	 * If we're not in recovery, we walk over the procarray and collect the
 	 * lowest xid. Since we're called with ProcArrayLock held and have
 	 * acquired XidGenLock, no entries can vanish concurrently, since
-	 * PGXACT->xid is only set with XidGenLock held and only cleared with
-	 * ProcArrayLock held.
+	 * ProcGlobal->xids[i] is only set with XidGenLock held and only cleared
+	 * with ProcArrayLock held.
 	 *
 	 * In recovery we can't lower the safe value besides what we've computed
 	 * above, so we'll have to wait a bit longer there. We unfortunately can
@@ -2611,17 +2708,17 @@ GetOldestSafeDecodingTransactionId(bool catalogOnly)
 	 */
 	if (!recovery_in_progress)
 	{
+		TransactionId *other_xids = ProcGlobal->xids;
+
 		/*
-		 * Spin over procArray collecting all min(PGXACT->xid)
+		 * Spin over procArray collecting min(ProcGlobal->xids[i])
 		 */
 		for (index = 0; index < arrayP->numProcs; index++)
 		{
-			int			pgprocno = arrayP->pgprocnos[index];
-			PGXACT	   *pgxact = &allPgXact[pgprocno];
 			TransactionId xid;
 
 			/* Fetch xid just once - see GetNewTransactionId */
-			xid = UINT32_ACCESS_ONCE(pgxact->xid);
+			xid = UINT32_ACCESS_ONCE(other_xids[index]);
 
 			if (!TransactionIdIsNormal(xid))
 				continue;
@@ -2809,6 +2906,7 @@ BackendXidGetPid(TransactionId xid)
 {
 	int			result = 0;
 	ProcArrayStruct *arrayP = procArray;
+	TransactionId *other_xids = ProcGlobal->xids;
 	int			index;
 
 	if (xid == InvalidTransactionId)	/* never match invalid xid */
@@ -2820,9 +2918,8 @@ BackendXidGetPid(TransactionId xid)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
 		PGPROC	   *proc = &allProcs[pgprocno];
-		PGXACT	   *pgxact = &allPgXact[pgprocno];
 
-		if (pgxact->xid == xid)
+		if (other_xids[index] == xid)
 		{
 			result = proc->pid;
 			break;
@@ -3102,7 +3199,6 @@ MinimumActiveBackends(int min)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
 		PGPROC	   *proc = &allProcs[pgprocno];
-		PGXACT	   *pgxact = &allPgXact[pgprocno];
 
 		/*
 		 * Since we're not holding a lock, need to be prepared to deal with
@@ -3119,7 +3215,7 @@ MinimumActiveBackends(int min)
 			continue;			/* do not count deleted entries */
 		if (proc == MyProc)
 			continue;			/* do not count myself */
-		if (pgxact->xid == InvalidTransactionId)
+		if (proc->xid == InvalidTransactionId)
 			continue;			/* do not count if no XID assigned */
 		if (proc->pid == 0)
 			continue;			/* do not count prepared xacts */
@@ -3545,8 +3641,8 @@ XidCacheRemoveRunningXids(TransactionId xid,
 	 *
 	 * Note that we do not have to be careful about memory ordering of our own
 	 * reads wrt. GetNewTransactionId() here - only this process can modify
-	 * relevant fields of MyProc/MyPgXact.  But we do have to be careful about
-	 * our own writes being well ordered.
+	 * relevant fields of MyProc/ProcGlobal->xids[].  But we do have to be
+	 * careful about our own writes being well ordered.
 	 */
 	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
 
@@ -3904,7 +4000,7 @@ FullXidViaRelative(FullTransactionId rel, TransactionId xid)
  * In Hot Standby mode, we maintain a list of transactions that are (or were)
  * running on the primary at the current point in WAL.  These XIDs must be
  * treated as running by standby transactions, even though they are not in
- * the standby server's PGXACT array.
+ * the standby server's PGPROC array.
  *
  * We record all XIDs that we know have been assigned.  That includes all the
  * XIDs seen in WAL records, plus all unobserved XIDs that we can deduce have
diff --git a/src/backend/storage/ipc/sinvaladt.c b/src/backend/storage/ipc/sinvaladt.c
index ad048bc85fa..a9477ccb4a3 100644
--- a/src/backend/storage/ipc/sinvaladt.c
+++ b/src/backend/storage/ipc/sinvaladt.c
@@ -417,9 +417,7 @@ BackendIdGetTransactionIds(int backendID, TransactionId *xid, TransactionId *xmi
 
 		if (proc != NULL)
 		{
-			PGXACT	   *xact = &ProcGlobal->allPgXact[proc->pgprocno];
-
-			*xid = xact->xid;
+			*xid = proc->xid;
 			*xmin = proc->xmin;
 		}
 	}
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 95989ce79bd..d86566f4554 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -3974,9 +3974,8 @@ GetRunningTransactionLocks(int *nlocks)
 			proclock->tag.myLock->tag.locktag_type == LOCKTAG_RELATION)
 		{
 			PGPROC	   *proc = proclock->tag.myProc;
-			PGXACT	   *pgxact = &ProcGlobal->allPgXact[proc->pgprocno];
 			LOCK	   *lock = proclock->tag.myLock;
-			TransactionId xid = pgxact->xid;
+			TransactionId xid = proc->xid;
 
 			/*
 			 * Don't record locks for transactions if we know they have
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index de346cd87fc..7fad49544ce 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -102,21 +102,18 @@ Size
 ProcGlobalShmemSize(void)
 {
 	Size		size = 0;
+	Size		TotalProcs =
+		add_size(MaxBackends, add_size(NUM_AUXILIARY_PROCS, max_prepared_xacts));
 
 	/* ProcGlobal */
 	size = add_size(size, sizeof(PROC_HDR));
-	/* MyProcs, including autovacuum workers and launcher */
-	size = add_size(size, mul_size(MaxBackends, sizeof(PGPROC)));
-	/* AuxiliaryProcs */
-	size = add_size(size, mul_size(NUM_AUXILIARY_PROCS, sizeof(PGPROC)));
-	/* Prepared xacts */
-	size = add_size(size, mul_size(max_prepared_xacts, sizeof(PGPROC)));
-	/* ProcStructLock */
+	size = add_size(size, mul_size(TotalProcs, sizeof(PGPROC)));
 	size = add_size(size, sizeof(slock_t));
 
 	size = add_size(size, mul_size(MaxBackends, sizeof(PGXACT)));
 	size = add_size(size, mul_size(NUM_AUXILIARY_PROCS, sizeof(PGXACT)));
 	size = add_size(size, mul_size(max_prepared_xacts, sizeof(PGXACT)));
+	size = add_size(size, mul_size(TotalProcs, sizeof(*ProcGlobal->xids)));
 
 	return size;
 }
@@ -216,6 +213,17 @@ InitProcGlobal(void)
 	MemSet(pgxacts, 0, TotalProcs * sizeof(PGXACT));
 	ProcGlobal->allPgXact = pgxacts;
 
+	/*
+	 * Allocate arrays mirroring PGPROC fields in a dense manner. See
+	 * PROC_HDR.
+	 *
+	 * XXX: It might make sense to increase padding for these arrays, given
+	 * how hotly they are accessed.
+	 */
+	ProcGlobal->xids =
+		(TransactionId *) ShmemAlloc(TotalProcs * sizeof(*ProcGlobal->xids));
+	MemSet(ProcGlobal->xids, 0, TotalProcs * sizeof(*ProcGlobal->xids));
+
 	for (i = 0; i < TotalProcs; i++)
 	{
 		/* Common initialization for all PGPROCs, regardless of type. */
@@ -387,7 +395,7 @@ InitProcess(void)
 	MyProc->lxid = InvalidLocalTransactionId;
 	MyProc->fpVXIDLock = false;
 	MyProc->fpLocalTransactionId = InvalidLocalTransactionId;
-	MyPgXact->xid = InvalidTransactionId;
+	MyProc->xid = InvalidTransactionId;
 	MyProc->xmin = InvalidTransactionId;
 	MyProc->pid = MyProcPid;
 	/* backendId, databaseId and roleId will be filled in later */
@@ -571,7 +579,7 @@ InitAuxiliaryProcess(void)
 	MyProc->lxid = InvalidLocalTransactionId;
 	MyProc->fpVXIDLock = false;
 	MyProc->fpLocalTransactionId = InvalidLocalTransactionId;
-	MyPgXact->xid = InvalidTransactionId;
+	MyProc->xid = InvalidTransactionId;
 	MyProc->xmin = InvalidTransactionId;
 	MyProc->backendId = InvalidBackendId;
 	MyProc->databaseId = InvalidOid;
-- 
2.25.0.114.g5b0ca878e0

v12-0005-snapshot-scalability-Move-PGXACT-vacuumFlags-to-.patchtext/x-diff; charset=us-asciiDownload
From 8efd6af3207c233223fdf55805128c3441795618 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 15 Jul 2020 15:35:07 -0700
Subject: [PATCH v12 5/7] snapshot scalability: Move PGXACT->vacuumFlags to
 ProcGlobal->vacuumFlags.

Similar to the previous commit this increases the chance that data
frequently needed by GetSnapshotData() stays in l2 cache. As we now
take care to not unnecessarily write to ProcGlobal->vacuumFlags, there
should be very few modifications to the ProcGlobal->vacuumFlags array.

Author: Andres Freund
Reviewed-By: Robert Haas, Thomas Munro, David Rowley
Discussion: https://postgr.es/m/20200301083601.ews6hz5dduc3w2se@alap3.anarazel.de
---
 src/include/storage/proc.h                | 12 ++++-
 src/backend/access/transam/twophase.c     |  2 +-
 src/backend/commands/analyze.c            | 10 ++--
 src/backend/commands/vacuum.c             |  5 +-
 src/backend/postmaster/autovacuum.c       |  6 +--
 src/backend/replication/logical/logical.c |  3 +-
 src/backend/replication/slot.c            |  3 +-
 src/backend/storage/ipc/procarray.c       | 66 ++++++++++++++---------
 src/backend/storage/lmgr/deadlock.c       |  4 +-
 src/backend/storage/lmgr/proc.c           | 16 +++---
 10 files changed, 79 insertions(+), 48 deletions(-)

diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index b828cecd185..ffb775939ed 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -41,7 +41,7 @@ struct XidCache
 };
 
 /*
- * Flags for PGXACT->vacuumFlags
+ * Flags for ProcGlobal->vacuumFlags[]
  */
 #define		PROC_IS_AUTOVACUUM	0x01	/* is it an autovac worker? */
 #define		PROC_IN_VACUUM		0x02	/* currently running lazy vacuum */
@@ -168,6 +168,9 @@ struct PGPROC
 
 	bool		delayChkpt;		/* true if this proc delays checkpoint start */
 
+	uint8		vacuumFlags;    /* this backend's vacuum flags, see PROC_*
+								 * above. mirrored in
+								 * ProcGlobal->vacuumFlags[pgxactoff] */
 	/*
 	 * Info to allow us to wait for synchronous replication, if needed.
 	 * waitLSN is InvalidXLogRecPtr if not waiting; set only by user backend.
@@ -245,7 +248,6 @@ extern PGDLLIMPORT struct PGXACT *MyPgXact;
  */
 typedef struct PGXACT
 {
-	uint8		vacuumFlags;	/* vacuum-related flags, see above */
 	bool		overflowed;
 
 	uint8		nxids;
@@ -315,6 +317,12 @@ typedef struct PROC_HDR
 	/* Array mirroring PGPROC.xid for each PGPROC currently in the procarray */
 	TransactionId *xids;
 
+	/*
+	 * Array mirroring PGPROC.vacuumFlags for each PGPROC currently in the
+	 * procarray.
+	 */
+	uint8	   *vacuumFlags;
+
 	/* Length of allProcs array */
 	uint32		allProcCount;
 	/* Head of list of free PGPROC structures */
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index d073eb07d23..3371ebd8896 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -466,7 +466,7 @@ MarkAsPreparingGuts(GlobalTransaction gxact, TransactionId xid, const char *gid,
 	proc->xid = xid;
 	Assert(proc->xmin == InvalidTransactionId);
 	proc->delayChkpt = false;
-	pgxact->vacuumFlags = 0;
+	proc->vacuumFlags = 0;
 	proc->pid = 0;
 	proc->backendId = InvalidBackendId;
 	proc->databaseId = databaseid;
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 34b71b6c1c5..2c1b956b76b 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -250,7 +250,8 @@ analyze_rel(Oid relid, RangeVar *relation,
 	 * OK, let's do it.  First let other backends know I'm in ANALYZE.
 	 */
 	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
-	MyPgXact->vacuumFlags |= PROC_IN_ANALYZE;
+	MyProc->vacuumFlags |= PROC_IN_ANALYZE;
+	ProcGlobal->vacuumFlags[MyProc->pgxactoff] = MyProc->vacuumFlags;
 	LWLockRelease(ProcArrayLock);
 	pgstat_progress_start_command(PROGRESS_COMMAND_ANALYZE,
 								  RelationGetRelid(onerel));
@@ -281,11 +282,12 @@ analyze_rel(Oid relid, RangeVar *relation,
 	pgstat_progress_end_command();
 
 	/*
-	 * Reset my PGXACT flag.  Note: we need this here, and not in vacuum_rel,
-	 * because the vacuum flag is cleared by the end-of-xact code.
+	 * Reset vacuumFlags we set early.  Note: we need this here, and not in
+	 * vacuum_rel, because the vacuum flag is cleared by the end-of-xact code.
 	 */
 	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
-	MyPgXact->vacuumFlags &= ~PROC_IN_ANALYZE;
+	MyProc->vacuumFlags &= ~PROC_IN_ANALYZE;
+	ProcGlobal->vacuumFlags[MyProc->pgxactoff] = MyProc->vacuumFlags;
 	LWLockRelease(ProcArrayLock);
 }
 
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 648e12c78d8..aba13c31d1b 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -1728,9 +1728,10 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params)
 		 * might appear to go backwards, which is probably Not Good.
 		 */
 		LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
-		MyPgXact->vacuumFlags |= PROC_IN_VACUUM;
+		MyProc->vacuumFlags |= PROC_IN_VACUUM;
 		if (params->is_wraparound)
-			MyPgXact->vacuumFlags |= PROC_VACUUM_FOR_WRAPAROUND;
+			MyProc->vacuumFlags |= PROC_VACUUM_FOR_WRAPAROUND;
+		ProcGlobal->vacuumFlags[MyProc->pgxactoff] = MyProc->vacuumFlags;
 		LWLockRelease(ProcArrayLock);
 	}
 
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index ac97e28be19..c6ec657a936 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -2493,7 +2493,7 @@ do_autovacuum(void)
 						   tab->at_datname, tab->at_nspname, tab->at_relname);
 			EmitErrorReport();
 
-			/* this resets the PGXACT flags too */
+			/* this resets ProcGlobal->vacuumFlags[i] too */
 			AbortOutOfAnyTransaction();
 			FlushErrorState();
 			MemoryContextResetAndDeleteChildren(PortalContext);
@@ -2509,7 +2509,7 @@ do_autovacuum(void)
 
 		did_vacuum = true;
 
-		/* the PGXACT flags are reset at the next end of transaction */
+		/* ProcGlobal->vacuumFlags[i] are reset at the next end of xact */
 
 		/* be tidy */
 deleted:
@@ -2686,7 +2686,7 @@ perform_work_item(AutoVacuumWorkItem *workitem)
 				   cur_datname, cur_nspname, cur_relname);
 		EmitErrorReport();
 
-		/* this resets the PGXACT flags too */
+		/* this resets ProcGlobal->vacuumFlags[i] too */
 		AbortOutOfAnyTransaction();
 		FlushErrorState();
 		MemoryContextResetAndDeleteChildren(PortalContext);
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 61902be3b0e..b416562ee2a 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -163,7 +163,8 @@ StartupDecodingContext(List *output_plugin_options,
 	if (!IsTransactionOrTransactionBlock())
 	{
 		LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
-		MyPgXact->vacuumFlags |= PROC_IN_LOGICAL_DECODING;
+		MyProc->vacuumFlags |= PROC_IN_LOGICAL_DECODING;
+		ProcGlobal->vacuumFlags[MyProc->pgxactoff] = MyProc->vacuumFlags;
 		LWLockRelease(ProcArrayLock);
 	}
 
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 57bbb6288c6..ca46256f9d0 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -520,7 +520,8 @@ ReplicationSlotRelease(void)
 
 	/* might not have been set when we've been a plain slot */
 	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
-	MyPgXact->vacuumFlags &= ~PROC_IN_LOGICAL_DECODING;
+	MyProc->vacuumFlags &= ~PROC_IN_LOGICAL_DECODING;
+	ProcGlobal->vacuumFlags[MyProc->pgxactoff] = MyProc->vacuumFlags;
 	LWLockRelease(ProcArrayLock);
 }
 
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index eeccb2eac7f..dc46b98f5fd 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -475,9 +475,12 @@ ProcArrayAdd(PGPROC *proc)
 			(arrayP->numProcs - index) * sizeof(*arrayP->pgprocnos));
 	memmove(&ProcGlobal->xids[index + 1], &ProcGlobal->xids[index],
 			(arrayP->numProcs - index) * sizeof(*ProcGlobal->xids));
+	memmove(&ProcGlobal->vacuumFlags[index + 1], &ProcGlobal->vacuumFlags[index],
+			(arrayP->numProcs - index) * sizeof(*ProcGlobal->vacuumFlags));
 
 	arrayP->pgprocnos[index] = proc->pgprocno;
 	ProcGlobal->xids[index] = proc->xid;
+	ProcGlobal->vacuumFlags[index] = proc->vacuumFlags;
 
 	arrayP->numProcs++;
 
@@ -538,6 +541,7 @@ ProcArrayRemove(PGPROC *proc, TransactionId latestXid)
 	}
 
 	Assert(TransactionIdIsValid(ProcGlobal->xids[proc->pgxactoff] == 0));
+	ProcGlobal->vacuumFlags[proc->pgxactoff] = 0;
 
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
@@ -548,6 +552,8 @@ ProcArrayRemove(PGPROC *proc, TransactionId latestXid)
 					(arrayP->numProcs - index - 1) * sizeof(*arrayP->pgprocnos));
 			memmove(&ProcGlobal->xids[index], &ProcGlobal->xids[index + 1],
 					(arrayP->numProcs - index - 1) * sizeof(*ProcGlobal->xids));
+			memmove(&ProcGlobal->vacuumFlags[index], &ProcGlobal->vacuumFlags[index + 1],
+					(arrayP->numProcs - index - 1) * sizeof(*ProcGlobal->vacuumFlags));
 
 			arrayP->pgprocnos[arrayP->numProcs - 1] = -1;	/* for debugging */
 			arrayP->numProcs--;
@@ -626,14 +632,24 @@ ProcArrayEndTransaction(PGPROC *proc, TransactionId latestXid)
 		Assert(!TransactionIdIsValid(proc->xid));
 
 		proc->lxid = InvalidLocalTransactionId;
-		/* must be cleared with xid/xmin: */
-		pgxact->vacuumFlags &= ~PROC_VACUUM_STATE_MASK;
 		proc->xmin = InvalidTransactionId;
 		proc->delayChkpt = false;	/* be sure this is cleared in abort */
 		proc->recoveryConflictPending = false;
 
 		Assert(pgxact->nxids == 0);
 		Assert(pgxact->overflowed == false);
+
+		/* must be cleared with xid/xmin: */
+		/* avoid unnecessarily dirtying shared cachelines */
+		if (proc->vacuumFlags & PROC_VACUUM_STATE_MASK)
+		{
+			Assert(!LWLockHeldByMe(ProcArrayLock));
+			LWLockAcquire(ProcArrayLock, LW_SHARED);
+			Assert(proc->vacuumFlags == ProcGlobal->vacuumFlags[proc->pgxactoff]);
+			proc->vacuumFlags &= ~PROC_VACUUM_STATE_MASK;
+			ProcGlobal->vacuumFlags[proc->pgxactoff] = proc->vacuumFlags;
+			LWLockRelease(ProcArrayLock);
+		}
 	}
 }
 
@@ -654,12 +670,18 @@ ProcArrayEndTransactionInternal(PGPROC *proc, PGXACT *pgxact,
 	ProcGlobal->xids[pgxactoff] = InvalidTransactionId;
 	proc->xid = InvalidTransactionId;
 	proc->lxid = InvalidLocalTransactionId;
-	/* must be cleared with xid/xmin: */
-	pgxact->vacuumFlags &= ~PROC_VACUUM_STATE_MASK;
 	proc->xmin = InvalidTransactionId;
 	proc->delayChkpt = false;	/* be sure this is cleared in abort */
 	proc->recoveryConflictPending = false;
 
+	/* must be cleared with xid/xmin: */
+	/* avoid unnecessarily dirtying shared cachelines */
+	if (proc->vacuumFlags & PROC_VACUUM_STATE_MASK)
+	{
+		proc->vacuumFlags &= ~PROC_VACUUM_STATE_MASK;
+		ProcGlobal->vacuumFlags[proc->pgxactoff] = proc->vacuumFlags;
+	}
+
 	/* Clear the subtransaction-XID cache too while holding the lock */
 	pgxact->nxids = 0;
 	pgxact->overflowed = false;
@@ -819,9 +841,8 @@ ProcArrayClearTransaction(PGPROC *proc)
 	proc->xmin = InvalidTransactionId;
 	proc->recoveryConflictPending = false;
 
-	/* redundant, but just in case */
-	pgxact->vacuumFlags &= ~PROC_VACUUM_STATE_MASK;
-	proc->delayChkpt = false;
+	Assert(!(proc->vacuumFlags & PROC_VACUUM_STATE_MASK));
+	Assert(!proc->delayChkpt);
 
 	/* Clear the subtransaction-XID cache too */
 	pgxact->nxids = 0;
@@ -1621,7 +1642,7 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
 		PGPROC	   *proc = &allProcs[pgprocno];
-		PGXACT	   *pgxact = &allPgXact[pgprocno];
+		int8		vacuumFlags = ProcGlobal->vacuumFlags[index];
 		TransactionId xid;
 		TransactionId xmin;
 
@@ -1638,10 +1659,6 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
 		 */
 		xmin = TransactionIdOlder(xmin, xid);
 
-		/* if neither is set, this proc doesn't influence the horizon */
-		if (!TransactionIdIsValid(xmin))
-			continue;
-
 		/*
 		 * Don't ignore any procs when determining which transactions might be
 		 * considered running.  While slots should ensure logical decoding
@@ -1656,7 +1673,7 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
 		 * removed, as long as pg_subtrans is not truncated) or doing logical
 		 * decoding (which manages xmin separately, check below).
 		 */
-		if (pgxact->vacuumFlags & (PROC_IN_VACUUM | PROC_IN_LOGICAL_DECODING))
+		if (vacuumFlags & (PROC_IN_VACUUM | PROC_IN_LOGICAL_DECODING))
 			continue;
 
 		/* shared tables need to take backends in all database into account */
@@ -1997,6 +2014,7 @@ GetSnapshotData(Snapshot snapshot)
 		size_t		numProcs = arrayP->numProcs;
 		TransactionId *xip = snapshot->xip;
 		int		   *pgprocnos = arrayP->pgprocnos;
+		uint8	   *allVacuumFlags = ProcGlobal->vacuumFlags;
 
 		/*
 		 * First collect set of pgxactoff/xids that need to be included in the
@@ -2006,8 +2024,6 @@ GetSnapshotData(Snapshot snapshot)
 		{
 			/* Fetch xid just once - see GetNewTransactionId */
 			TransactionId xid = UINT32_ACCESS_ONCE(other_xids[pgxactoff]);
-			int			pgprocno;
-			PGXACT	   *pgxact;
 			uint8		vacuumFlags;
 
 			Assert(allProcs[arrayP->pgprocnos[pgxactoff]].pgxactoff == pgxactoff);
@@ -2043,14 +2059,11 @@ GetSnapshotData(Snapshot snapshot)
 			if (!NormalTransactionIdPrecedes(xid, xmax))
 				continue;
 
-			pgprocno = pgprocnos[pgxactoff];
-			pgxact = &allPgXact[pgprocno];
-			vacuumFlags = pgxact->vacuumFlags;
-
 			/*
 			 * Skip over backends doing logical decoding which manages xmin
 			 * separately (check below) and ones running LAZY VACUUM.
 			 */
+			vacuumFlags = allVacuumFlags[pgxactoff];
 			if (vacuumFlags & (PROC_IN_LOGICAL_DECODING | PROC_IN_VACUUM))
 				continue;
 
@@ -2077,6 +2090,9 @@ GetSnapshotData(Snapshot snapshot)
 			 */
 			if (!suboverflowed)
 			{
+				int			pgprocno = pgprocnos[pgxactoff];
+				PGXACT	   *pgxact = &allPgXact[pgprocno];
+
 				if (pgxact->overflowed)
 					suboverflowed = true;
 				else
@@ -2295,11 +2311,11 @@ ProcArrayInstallImportedXmin(TransactionId xmin,
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
 		PGPROC	   *proc = &allProcs[pgprocno];
-		PGXACT	   *pgxact = &allPgXact[pgprocno];
+		int			vacuumFlags = ProcGlobal->vacuumFlags[index];
 		TransactionId xid;
 
 		/* Ignore procs running LAZY VACUUM */
-		if (pgxact->vacuumFlags & PROC_IN_VACUUM)
+		if (vacuumFlags & PROC_IN_VACUUM)
 			continue;
 
 		/* We are only interested in the specific virtual transaction. */
@@ -2989,12 +3005,12 @@ GetCurrentVirtualXIDs(TransactionId limitXmin, bool excludeXmin0,
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
 		PGPROC	   *proc = &allProcs[pgprocno];
-		PGXACT	   *pgxact = &allPgXact[pgprocno];
+		uint8		vacuumFlags = ProcGlobal->vacuumFlags[index];
 
 		if (proc == MyProc)
 			continue;
 
-		if (excludeVacuum & pgxact->vacuumFlags)
+		if (excludeVacuum & vacuumFlags)
 			continue;
 
 		if (allDbs || proc->databaseId == MyDatabaseId)
@@ -3409,7 +3425,7 @@ CountOtherDBBackends(Oid databaseId, int *nbackends, int *nprepared)
 		{
 			int			pgprocno = arrayP->pgprocnos[index];
 			PGPROC	   *proc = &allProcs[pgprocno];
-			PGXACT	   *pgxact = &allPgXact[pgprocno];
+			uint8		vacuumFlags = ProcGlobal->vacuumFlags[index];
 
 			if (proc->databaseId != databaseId)
 				continue;
@@ -3423,7 +3439,7 @@ CountOtherDBBackends(Oid databaseId, int *nbackends, int *nprepared)
 			else
 			{
 				(*nbackends)++;
-				if ((pgxact->vacuumFlags & PROC_IS_AUTOVACUUM) &&
+				if ((vacuumFlags & PROC_IS_AUTOVACUUM) &&
 					nautovacs < MAXAUTOVACPIDS)
 					autovac_pids[nautovacs++] = proc->pid;
 			}
diff --git a/src/backend/storage/lmgr/deadlock.c b/src/backend/storage/lmgr/deadlock.c
index beedc7947db..e1246b8a4da 100644
--- a/src/backend/storage/lmgr/deadlock.c
+++ b/src/backend/storage/lmgr/deadlock.c
@@ -544,7 +544,6 @@ FindLockCycleRecurseMember(PGPROC *checkProc,
 {
 	PGPROC	   *proc;
 	LOCK	   *lock = checkProc->waitLock;
-	PGXACT	   *pgxact;
 	PROCLOCK   *proclock;
 	SHM_QUEUE  *procLocks;
 	LockMethod	lockMethodTable;
@@ -582,7 +581,6 @@ FindLockCycleRecurseMember(PGPROC *checkProc,
 		PGPROC	   *leader;
 
 		proc = proclock->tag.myProc;
-		pgxact = &ProcGlobal->allPgXact[proc->pgprocno];
 		leader = proc->lockGroupLeader == NULL ? proc : proc->lockGroupLeader;
 
 		/* A proc never blocks itself or any other lock group member */
@@ -630,7 +628,7 @@ FindLockCycleRecurseMember(PGPROC *checkProc,
 					 * ProcArrayLock.
 					 */
 					if (checkProc == MyProc &&
-						pgxact->vacuumFlags & PROC_IS_AUTOVACUUM)
+						proc->vacuumFlags & PROC_IS_AUTOVACUUM)
 						blocking_autovacuum_proc = proc;
 
 					/* We're done looking at this proclock */
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 7fad49544ce..f6113b2d243 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -114,6 +114,7 @@ ProcGlobalShmemSize(void)
 	size = add_size(size, mul_size(NUM_AUXILIARY_PROCS, sizeof(PGXACT)));
 	size = add_size(size, mul_size(max_prepared_xacts, sizeof(PGXACT)));
 	size = add_size(size, mul_size(TotalProcs, sizeof(*ProcGlobal->xids)));
+	size = add_size(size, mul_size(TotalProcs, sizeof(*ProcGlobal->vacuumFlags)));
 
 	return size;
 }
@@ -223,6 +224,8 @@ InitProcGlobal(void)
 	ProcGlobal->xids =
 		(TransactionId *) ShmemAlloc(TotalProcs * sizeof(*ProcGlobal->xids));
 	MemSet(ProcGlobal->xids, 0, TotalProcs * sizeof(*ProcGlobal->xids));
+	ProcGlobal->vacuumFlags = (uint8 *) ShmemAlloc(TotalProcs * sizeof(*ProcGlobal->vacuumFlags));
+	MemSet(ProcGlobal->vacuumFlags, 0, TotalProcs * sizeof(*ProcGlobal->vacuumFlags));
 
 	for (i = 0; i < TotalProcs; i++)
 	{
@@ -405,10 +408,10 @@ InitProcess(void)
 	MyProc->tempNamespaceId = InvalidOid;
 	MyProc->isBackgroundWorker = IsBackgroundWorker;
 	MyProc->delayChkpt = false;
-	MyPgXact->vacuumFlags = 0;
+	MyProc->vacuumFlags = 0;
 	/* NB -- autovac launcher intentionally does not set IS_AUTOVACUUM */
 	if (IsAutoVacuumWorkerProcess())
-		MyPgXact->vacuumFlags |= PROC_IS_AUTOVACUUM;
+		MyProc->vacuumFlags |= PROC_IS_AUTOVACUUM;
 	MyProc->lwWaiting = false;
 	MyProc->lwWaitMode = 0;
 	MyProc->waitLock = NULL;
@@ -587,7 +590,7 @@ InitAuxiliaryProcess(void)
 	MyProc->tempNamespaceId = InvalidOid;
 	MyProc->isBackgroundWorker = IsBackgroundWorker;
 	MyProc->delayChkpt = false;
-	MyPgXact->vacuumFlags = 0;
+	MyProc->vacuumFlags = 0;
 	MyProc->lwWaiting = false;
 	MyProc->lwWaitMode = 0;
 	MyProc->waitLock = NULL;
@@ -1323,7 +1326,7 @@ ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable)
 		if (deadlock_state == DS_BLOCKED_BY_AUTOVACUUM && allow_autovacuum_cancel)
 		{
 			PGPROC	   *autovac = GetBlockingAutoVacuumPgproc();
-			PGXACT	   *autovac_pgxact = &ProcGlobal->allPgXact[autovac->pgprocno];
+			uint8		vacuumFlags;
 
 			LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
 
@@ -1331,8 +1334,9 @@ ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable)
 			 * Only do it if the worker is not working to protect against Xid
 			 * wraparound.
 			 */
-			if ((autovac_pgxact->vacuumFlags & PROC_IS_AUTOVACUUM) &&
-				!(autovac_pgxact->vacuumFlags & PROC_VACUUM_FOR_WRAPAROUND))
+			vacuumFlags = ProcGlobal->vacuumFlags[proc->pgxactoff];
+			if ((vacuumFlags & PROC_IS_AUTOVACUUM) &&
+				!(vacuumFlags & PROC_VACUUM_FOR_WRAPAROUND))
 			{
 				int			pid = autovac->pid;
 				StringInfoData locktagbuf;
-- 
2.25.0.114.g5b0ca878e0

#59Ranier Vilela
ranier.vf@gmail.com
In reply to: Andres Freund (#58)
Re: Improving connection scalability: GetSnapshotData()

Latest Postgres
Windows 64 bits
msvc 2019 64 bits

Patches applied v12-0001 to v12-0007:

C:\dll\postgres\contrib\pgstattuple\pgstatapprox.c(74,28): warning C4013:
'GetOldestXmin' indefinido; assumindo extern retornando int
[C:\dll\postgres
C:\dll\postgres\contrib\pg_visibility\pg_visibility.c(569,29): warning
C4013: 'GetOldestXmin' indefinido; assumindo extern retornando int
[C:\dll\postgres\pg_visibility.
vcxproj]
C:\dll\postgres\contrib\pgstattuple\pgstatapprox.c(74,56): error C2065:
'PROCARRAY_FLAGS_VACUUM': identificador nao declarado
[C:\dll\postgres\pgstattuple.vcxproj]
C:\dll\postgres\contrib\pg_visibility\pg_visibility.c(569,58): error
C2065: 'PROCARRAY_FLAGS_VACUUM': identificador nao declarado
[C:\dll\postgres\pg_visibility.vcxproj]
C:\dll\postgres\contrib\pg_visibility\pg_visibility.c(686,70): error
C2065: 'PROCARRAY_FLAGS_VACUUM': identificador nao declarado
[C:\dll\postgres\pg_visibility.vcxproj]

regards,
Ranier Vilela

#60Andres Freund
andres@anarazel.de
In reply to: Ranier Vilela (#59)
Re: Improving connection scalability: GetSnapshotData()

On 2020-07-24 14:05:04 -0300, Ranier Vilela wrote:

Latest Postgres
Windows 64 bits
msvc 2019 64 bits

Patches applied v12-0001 to v12-0007:

C:\dll\postgres\contrib\pgstattuple\pgstatapprox.c(74,28): warning C4013:
'GetOldestXmin' indefinido; assumindo extern retornando int
[C:\dll\postgres
C:\dll\postgres\contrib\pg_visibility\pg_visibility.c(569,29): warning
C4013: 'GetOldestXmin' indefinido; assumindo extern retornando int
[C:\dll\postgres\pg_visibility.
vcxproj]
C:\dll\postgres\contrib\pgstattuple\pgstatapprox.c(74,56): error C2065:
'PROCARRAY_FLAGS_VACUUM': identificador nao declarado
[C:\dll\postgres\pgstattuple.vcxproj]
C:\dll\postgres\contrib\pg_visibility\pg_visibility.c(569,58): error
C2065: 'PROCARRAY_FLAGS_VACUUM': identificador nao declarado
[C:\dll\postgres\pg_visibility.vcxproj]
C:\dll\postgres\contrib\pg_visibility\pg_visibility.c(686,70): error
C2065: 'PROCARRAY_FLAGS_VACUUM': identificador nao declarado
[C:\dll\postgres\pg_visibility.vcxproj]

I don't know that's about - there's no call to GetOldestXmin() in
pgstatapprox and pg_visibility after patch 0002? And similarly, the
PROCARRAY_* references are also removed in the same patch?

Greetings,

Andres Freund

#61Ranier Vilela
ranier.vf@gmail.com
In reply to: Andres Freund (#60)
Re: Improving connection scalability: GetSnapshotData()

Em sex., 24 de jul. de 2020 às 14:16, Andres Freund <andres@anarazel.de>
escreveu:

On 2020-07-24 14:05:04 -0300, Ranier Vilela wrote:

Latest Postgres
Windows 64 bits
msvc 2019 64 bits

Patches applied v12-0001 to v12-0007:

C:\dll\postgres\contrib\pgstattuple\pgstatapprox.c(74,28): warning

C4013:

'GetOldestXmin' indefinido; assumindo extern retornando int
[C:\dll\postgres
C:\dll\postgres\contrib\pg_visibility\pg_visibility.c(569,29): warning
C4013: 'GetOldestXmin' indefinido; assumindo extern retornando int
[C:\dll\postgres\pg_visibility.
vcxproj]
C:\dll\postgres\contrib\pgstattuple\pgstatapprox.c(74,56): error C2065:
'PROCARRAY_FLAGS_VACUUM': identificador nao declarado
[C:\dll\postgres\pgstattuple.vcxproj]
C:\dll\postgres\contrib\pg_visibility\pg_visibility.c(569,58): error
C2065: 'PROCARRAY_FLAGS_VACUUM': identificador nao declarado
[C:\dll\postgres\pg_visibility.vcxproj]
C:\dll\postgres\contrib\pg_visibility\pg_visibility.c(686,70): error
C2065: 'PROCARRAY_FLAGS_VACUUM': identificador nao declarado
[C:\dll\postgres\pg_visibility.vcxproj]

I don't know that's about - there's no call to GetOldestXmin() in
pgstatapprox and pg_visibility after patch 0002? And similarly, the
PROCARRAY_* references are also removed in the same patch?

Maybe need to remove them from these places, not?
C:\dll\postgres\contrib>grep -d GetOldestXmin *.c
File pgstattuple\pgstatapprox.c:
OldestXmin = GetOldestXmin(rel, PROCARRAY_FLAGS_VACUUM);
File pg_visibility\pg_visibility.c:
OldestXmin = GetOldestXmin(NULL, PROCARRAY_FLAGS_VACUUM);
* deadlocks, because surely
GetOldestXmin() should never take
RecomputedOldestXmin = GetOldestXmin(NULL,
PROCARRAY_FLAGS_VACUUM);

regards,
Ranier Vilela

#62Andres Freund
andres@anarazel.de
In reply to: Ranier Vilela (#61)
Re: Improving connection scalability: GetSnapshotData()

On 2020-07-24 18:15:15 -0300, Ranier Vilela wrote:

Em sex., 24 de jul. de 2020 �s 14:16, Andres Freund <andres@anarazel.de>
escreveu:

On 2020-07-24 14:05:04 -0300, Ranier Vilela wrote:

Latest Postgres
Windows 64 bits
msvc 2019 64 bits

Patches applied v12-0001 to v12-0007:

C:\dll\postgres\contrib\pgstattuple\pgstatapprox.c(74,28): warning

C4013:

'GetOldestXmin' indefinido; assumindo extern retornando int
[C:\dll\postgres
C:\dll\postgres\contrib\pg_visibility\pg_visibility.c(569,29): warning
C4013: 'GetOldestXmin' indefinido; assumindo extern retornando int
[C:\dll\postgres\pg_visibility.
vcxproj]
C:\dll\postgres\contrib\pgstattuple\pgstatapprox.c(74,56): error C2065:
'PROCARRAY_FLAGS_VACUUM': identificador nao declarado
[C:\dll\postgres\pgstattuple.vcxproj]
C:\dll\postgres\contrib\pg_visibility\pg_visibility.c(569,58): error
C2065: 'PROCARRAY_FLAGS_VACUUM': identificador nao declarado
[C:\dll\postgres\pg_visibility.vcxproj]
C:\dll\postgres\contrib\pg_visibility\pg_visibility.c(686,70): error
C2065: 'PROCARRAY_FLAGS_VACUUM': identificador nao declarado
[C:\dll\postgres\pg_visibility.vcxproj]

I don't know that's about - there's no call to GetOldestXmin() in
pgstatapprox and pg_visibility after patch 0002? And similarly, the
PROCARRAY_* references are also removed in the same patch?

Maybe need to remove them from these places, not?
C:\dll\postgres\contrib>grep -d GetOldestXmin *.c
File pgstattuple\pgstatapprox.c:
OldestXmin = GetOldestXmin(rel, PROCARRAY_FLAGS_VACUUM);
File pg_visibility\pg_visibility.c:
OldestXmin = GetOldestXmin(NULL, PROCARRAY_FLAGS_VACUUM);
* deadlocks, because surely
GetOldestXmin() should never take
RecomputedOldestXmin = GetOldestXmin(NULL,
PROCARRAY_FLAGS_VACUUM);

The 0002 patch changed those files:

diff --git a/contrib/pg_visibility/pg_visibility.c b/contrib/pg_visibility/pg_visibility.c
index 68d580ed1e0..37206c50a21 100644
--- a/contrib/pg_visibility/pg_visibility.c
+++ b/contrib/pg_visibility/pg_visibility.c
@@ -563,17 +563,14 @@ collect_corrupt_items(Oid relid, bool all_visible, bool all_frozen)
 	BufferAccessStrategy bstrategy = GetAccessStrategy(BAS_BULKREAD);
 	TransactionId OldestXmin = InvalidTransactionId;

- if (all_visible)
- {
- /* Don't pass rel; that will fail in recovery. */
- OldestXmin = GetOldestXmin(NULL, PROCARRAY_FLAGS_VACUUM);
- }
-
rel = relation_open(relid, AccessShareLock);

/* Only some relkinds have a visibility map */
check_relation_relkind(rel);

+	if (all_visible)
+		OldestXmin = GetOldestNonRemovableTransactionId(rel);
+
 	nblocks = RelationGetNumberOfBlocks(rel);
 	/*
@@ -679,11 +676,12 @@ collect_corrupt_items(Oid relid, bool all_visible, bool all_frozen)
 				 * From a concurrency point of view, it sort of sucks to
 				 * retake ProcArrayLock here while we're holding the buffer
 				 * exclusively locked, but it should be safe against
-				 * deadlocks, because surely GetOldestXmin() should never take
-				 * a buffer lock. And this shouldn't happen often, so it's
-				 * worth being careful so as to avoid false positives.
+				 * deadlocks, because surely GetOldestNonRemovableTransactionId()
+				 * should never take a buffer lock. And this shouldn't happen
+				 * often, so it's worth being careful so as to avoid false
+				 * positives.
 				 */
-				RecomputedOldestXmin = GetOldestXmin(NULL, PROCARRAY_FLAGS_VACUUM);
+				RecomputedOldestXmin = GetOldestNonRemovableTransactionId(rel);

if (!TransactionIdPrecedes(OldestXmin, RecomputedOldestXmin))
record_corrupt_item(items, &tuple.t_self);

diff --git a/contrib/pgstattuple/pgstatapprox.c b/contrib/pgstattuple/pgstatapprox.c
index dbc0fa11f61..3a99333d443 100644
--- a/contrib/pgstattuple/pgstatapprox.c
+++ b/contrib/pgstattuple/pgstatapprox.c
@@ -71,7 +71,7 @@ statapprox_heap(Relation rel, output_type *stat)
 	BufferAccessStrategy bstrategy;
 	TransactionId OldestXmin;
-	OldestXmin = GetOldestXmin(rel, PROCARRAY_FLAGS_VACUUM);
+	OldestXmin = GetOldestNonRemovableTransactionId(rel);
 	bstrategy = GetAccessStrategy(BAS_BULKREAD);

nblocks = RelationGetNumberOfBlocks(rel);

Greetings,

Andres Freund

#63Ranier Vilela
ranier.vf@gmail.com
In reply to: Andres Freund (#62)
Re: Improving connection scalability: GetSnapshotData()

Em sex., 24 de jul. de 2020 às 21:00, Andres Freund <andres@anarazel.de>
escreveu:

On 2020-07-24 18:15:15 -0300, Ranier Vilela wrote:

Em sex., 24 de jul. de 2020 às 14:16, Andres Freund <andres@anarazel.de>
escreveu:

On 2020-07-24 14:05:04 -0300, Ranier Vilela wrote:

Latest Postgres
Windows 64 bits
msvc 2019 64 bits

Patches applied v12-0001 to v12-0007:

C:\dll\postgres\contrib\pgstattuple\pgstatapprox.c(74,28): warning

C4013:

'GetOldestXmin' indefinido; assumindo extern retornando int
[C:\dll\postgres
C:\dll\postgres\contrib\pg_visibility\pg_visibility.c(569,29):

warning

C4013: 'GetOldestXmin' indefinido; assumindo extern retornando int
[C:\dll\postgres\pg_visibility.
vcxproj]
C:\dll\postgres\contrib\pgstattuple\pgstatapprox.c(74,56): error

C2065:

'PROCARRAY_FLAGS_VACUUM': identificador nao declarado
[C:\dll\postgres\pgstattuple.vcxproj]
C:\dll\postgres\contrib\pg_visibility\pg_visibility.c(569,58):

error

C2065: 'PROCARRAY_FLAGS_VACUUM': identificador nao declarado
[C:\dll\postgres\pg_visibility.vcxproj]
C:\dll\postgres\contrib\pg_visibility\pg_visibility.c(686,70):

error

C2065: 'PROCARRAY_FLAGS_VACUUM': identificador nao declarado
[C:\dll\postgres\pg_visibility.vcxproj]

I don't know that's about - there's no call to GetOldestXmin() in
pgstatapprox and pg_visibility after patch 0002? And similarly, the
PROCARRAY_* references are also removed in the same patch?

Maybe need to remove them from these places, not?
C:\dll\postgres\contrib>grep -d GetOldestXmin *.c
File pgstattuple\pgstatapprox.c:
OldestXmin = GetOldestXmin(rel, PROCARRAY_FLAGS_VACUUM);
File pg_visibility\pg_visibility.c:
OldestXmin = GetOldestXmin(NULL, PROCARRAY_FLAGS_VACUUM);
* deadlocks, because surely
GetOldestXmin() should never take
RecomputedOldestXmin =

GetOldestXmin(NULL,

PROCARRAY_FLAGS_VACUUM);

The 0002 patch changed those files:

diff --git a/contrib/pg_visibility/pg_visibility.c
b/contrib/pg_visibility/pg_visibility.c
index 68d580ed1e0..37206c50a21 100644
--- a/contrib/pg_visibility/pg_visibility.c
+++ b/contrib/pg_visibility/pg_visibility.c
@@ -563,17 +563,14 @@ collect_corrupt_items(Oid relid, bool all_visible,
bool all_frozen)
BufferAccessStrategy bstrategy = GetAccessStrategy(BAS_BULKREAD);
TransactionId OldestXmin = InvalidTransactionId;

- if (all_visible)
- {
- /* Don't pass rel; that will fail in recovery. */
- OldestXmin = GetOldestXmin(NULL, PROCARRAY_FLAGS_VACUUM);
- }
-
rel = relation_open(relid, AccessShareLock);

/* Only some relkinds have a visibility map */
check_relation_relkind(rel);

+       if (all_visible)
+               OldestXmin = GetOldestNonRemovableTransactionId(rel);
+
nblocks = RelationGetNumberOfBlocks(rel);
/*
@@ -679,11 +676,12 @@ collect_corrupt_items(Oid relid, bool all_visible,
bool all_frozen)
* From a concurrency point of view, it
sort of sucks to
* retake ProcArrayLock here while we're
holding the buffer
* exclusively locked, but it should be
safe against
-                                * deadlocks, because surely
GetOldestXmin() should never take
-                                * a buffer lock. And this shouldn't
happen often, so it's
-                                * worth being careful so as to avoid
false positives.
+                                * deadlocks, because surely
GetOldestNonRemovableTransactionId()
+                                * should never take a buffer lock. And
this shouldn't happen
+                                * often, so it's worth being careful so
as to avoid false
+                                * positives.
*/
-                               RecomputedOldestXmin = GetOldestXmin(NULL,
PROCARRAY_FLAGS_VACUUM);
+                               RecomputedOldestXmin =
GetOldestNonRemovableTransactionId(rel);

if (!TransactionIdPrecedes(OldestXmin,
RecomputedOldestXmin))
record_corrupt_item(items,
&tuple.t_self);

diff --git a/contrib/pgstattuple/pgstatapprox.c
b/contrib/pgstattuple/pgstatapprox.c
index dbc0fa11f61..3a99333d443 100644
--- a/contrib/pgstattuple/pgstatapprox.c
+++ b/contrib/pgstattuple/pgstatapprox.c
@@ -71,7 +71,7 @@ statapprox_heap(Relation rel, output_type *stat)
BufferAccessStrategy bstrategy;
TransactionId OldestXmin;
-       OldestXmin = GetOldestXmin(rel, PROCARRAY_FLAGS_VACUUM);
+       OldestXmin = GetOldestNonRemovableTransactionId(rel);
bstrategy = GetAccessStrategy(BAS_BULKREAD);

nblocks = RelationGetNumberOfBlocks(rel);

Obviously, the

v12-0002-snapshot-scalability-Don-t-compute-global-horizo.patch patch needs
to be rebased.
https://github.com/postgres/postgres/blob/master/contrib/pg_visibility/pg_visibility.c

1:
if (all_visible)
{
/ * Don't pass rel; that will fail in recovery. * /
OldestXmin = GetOldestXmin (NULL, PROCARRAY_FLAGS_VACUUM);
}
It is on line 566 in the current version of git, while the patch is on line
563.

2:
* deadlocks, because surely GetOldestXmin () should never take
* a buffer lock. And this shouldn't happen often, so it's
* worth being careful so as to avoid false positives.
* /
It is currently on line 682, while in the patch it is on line 679.

regards,
Ranier Vilela

#64Thomas Munro
thomas.munro@gmail.com
In reply to: Andres Freund (#58)
Re: Improving connection scalability: GetSnapshotData()

On Fri, Jul 24, 2020 at 1:11 PM Andres Freund <andres@anarazel.de> wrote:

On 2020-07-15 21:33:06 -0400, Alvaro Herrera wrote:

On 2020-Jul-15, Andres Freund wrote:

It could make sense to split the conversion of
VariableCacheData->latestCompletedXid to FullTransactionId out from 0001
into is own commit. Not sure...

+1, the commit is large enough and that change can be had in advance.

I've done that in the attached.

+     * pair with the memory barrier below.  We do however accept xid to be <=
+     * to next_xid, instead of just <, as xid could be from the procarray,
+     * before we see the updated nextFullXid value.

Tricky. Right, that makes sense. I like the range assertion.

+static inline FullTransactionId
+FullXidViaRelative(FullTransactionId rel, TransactionId xid)

I'm struggling to find a better word for this than "relative".

+    return FullTransactionIdFromU64(U64FromFullTransactionId(rel)
+                                    + (int32) (xid - rel_xid));

I like your branch-free code for this.

I wonder if somebody has an opinion on renaming latestCompletedXid to
latestCompletedFullXid. That's the pattern we already had (cf
nextFullXid), but it also leads to pretty long lines and quite a few
comment etc changes.

I'm somewhat inclined to remove the "Full" out of the variable, and to
also do that for nextFullXid. I feel like including it in the variable
name is basically a poor copy of the (also not great) C type system. If
we hadn't made FullTransactionId a struct I'd see it differently (and
thus incompatible with TransactionId), but we have ...

Yeah, I'm OK with dropping the "Full". I've found it rather clumsy too.

#65Thomas Munro
thomas.munro@gmail.com
In reply to: Thomas Munro (#64)
Re: Improving connection scalability: GetSnapshotData()

On Wed, Jul 29, 2020 at 6:15 PM Thomas Munro <thomas.munro@gmail.com> wrote:

+static inline FullTransactionId
+FullXidViaRelative(FullTransactionId rel, TransactionId xid)

I'm struggling to find a better word for this than "relative".

The best I've got is "anchor" xid. It is an xid that is known to
limit nextFullXid's range while the receiving function runs.

#66Daniel Gustafsson
daniel@yesql.se
In reply to: Andres Freund (#58)
Re: Improving connection scalability: GetSnapshotData()

On 24 Jul 2020, at 03:11, Andres Freund <andres@anarazel.de> wrote:

I've done that in the attached.

As this is actively being reviewed but time is running short, I'm moving this
to the next CF.

cheers ./daniel

#67Andres Freund
andres@anarazel.de
In reply to: Thomas Munro (#65)
Re: Improving connection scalability: GetSnapshotData()

Hi,

On 2020-07-29 19:20:04 +1200, Thomas Munro wrote:

On Wed, Jul 29, 2020 at 6:15 PM Thomas Munro <thomas.munro@gmail.com> wrote:

+static inline FullTransactionId
+FullXidViaRelative(FullTransactionId rel, TransactionId xid)

I'm struggling to find a better word for this than "relative".

The best I've got is "anchor" xid. It is an xid that is known to
limit nextFullXid's range while the receiving function runs.

Thinking about it, I think that relative is a good descriptor. It's just
that 'via' is weird. How about: FullXidRelativeTo?

Greetings,

Andres Freund

#68Thomas Munro
thomas.munro@gmail.com
In reply to: Andres Freund (#67)
Re: Improving connection scalability: GetSnapshotData()

On Wed, Aug 12, 2020 at 12:19 PM Andres Freund <andres@anarazel.de> wrote:

On 2020-07-29 19:20:04 +1200, Thomas Munro wrote:

On Wed, Jul 29, 2020 at 6:15 PM Thomas Munro <thomas.munro@gmail.com> wrote:

+static inline FullTransactionId
+FullXidViaRelative(FullTransactionId rel, TransactionId xid)

I'm struggling to find a better word for this than "relative".

The best I've got is "anchor" xid. It is an xid that is known to
limit nextFullXid's range while the receiving function runs.

Thinking about it, I think that relative is a good descriptor. It's just
that 'via' is weird. How about: FullXidRelativeTo?

WFM.

#69Andres Freund
andres@anarazel.de
In reply to: Thomas Munro (#68)
6 attachment(s)
Re: Improving connection scalability: GetSnapshotData()

Hi,

On 2020-08-12 12:24:52 +1200, Thomas Munro wrote:

On Wed, Aug 12, 2020 at 12:19 PM Andres Freund <andres@anarazel.de> wrote:

On 2020-07-29 19:20:04 +1200, Thomas Munro wrote:

On Wed, Jul 29, 2020 at 6:15 PM Thomas Munro <thomas.munro@gmail.com> wrote:

+static inline FullTransactionId
+FullXidViaRelative(FullTransactionId rel, TransactionId xid)

I'm struggling to find a better word for this than "relative".

The best I've got is "anchor" xid. It is an xid that is known to
limit nextFullXid's range while the receiving function runs.

Thinking about it, I think that relative is a good descriptor. It's just
that 'via' is weird. How about: FullXidRelativeTo?

WFM.

Cool, pushed.

Attached are the rebased remainder of the series. Unless somebody
protests, I plan to push 0001 after a bit more comment polishing and
wait a buildfarm cycle, then push 0002-0005 and wait again, and finally
push 0006.

There's further optimizations, particularly after 0002 and after 0006,
but that seems better done later.

Greetings,

Andres Freund

Attachments:

v13-0001-snapshot-scalability-Don-t-compute-global-horizo.patchtext/x-diff; charset=us-asciiDownload
From c5ee4e016599d635f65a22f76c2069510f98ee47 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 15 Jul 2020 15:35:07 -0700
Subject: [PATCH v13 1/6] snapshot scalability: Don't compute global horizons
 when building snapshots.

To make GetSnapshotData() more scalable, it cannot not look at at each proc's
xmin (see Discussion link below). Due to the frequency at which xmins are
updated, that just does not scale.

Without accessing xmins GetSnapshotData() cannot calculate accurate thresholds
as it has so far. But we don't really have to: The horizons don't actually
change that much between GetSnapshotData() calls. Nor are the horizons
actually used every time a snapshot is called.

The use of RecentGlobal[Data]Xmin to decide whether a row version could be
removed has been replaces with new GlobalVisTest* functions.  These use two
thresholds to determine whether a row can be pruned:
1) definitely_needed, indicating that rows deleted by XIDs >=
   definitely_needed are definitely still visible.
2) maybe_needed, indicating that rows deleted by XIDs < maybe_needed can
   definitely be removed
GetSnapshotData() updates definitely_needed to be the xmin of the computed
snapshot.

When testing whether a row can be removed (with GlobalVisTestIsRemovableXid())
and the tested XID falls in between the two (i.e. XID >= maybe_needed && XID <
definitely_needed) the boundaries can be recomputed to be more accurate. As it
is not cheap to compute accurate boundaries, we limit the number of times that
happens in short succession.  As the boundaries used by
GlobalVisTestIsRemovableXid() are never reset (with maybe_needed updated
byGetSnapshotData()), it is likely that further test can benefit from an
earlier computation of accurate horizons.

To avoid regressing performance when old_snapshot_threshold is set (as
that requires an accurate horizon to be computed),
heap_page_prune_opt() doesn't unconditionally call
TransactionIdLimitedForOldSnapshots() anymore. Both the computation of
the limited horizon, and the triggering of errors (with
SetOldSnapshotThresholdTimestamp()) is now only done when necessary to
remove tuples.

Subsequent commits will take further advantage of the fact that
GetSnapshotData() will not need to access xmins anymore.

Note: This contains a workaround in heap_page_prune_opt() to keep the
snapshot_too_old tests working. While that workaround is ugly, the
tests currently are not meaningful, and it seems best to address them
separately.

Author: Andres Freund
Reviewed-By: Robert Haas, Thomas Munro, David Rowley
Discussion: https://postgr.es/m/20200301083601.ews6hz5dduc3w2se@alap3.anarazel.de
---
 src/include/access/ginblock.h               |   4 +-
 src/include/access/heapam.h                 |  10 +-
 src/include/access/transam.h                |  79 +-
 src/include/storage/bufpage.h               |   6 -
 src/include/storage/proc.h                  |   8 -
 src/include/storage/procarray.h             |  32 +-
 src/include/utils/snapmgr.h                 |  37 +-
 src/include/utils/snapshot.h                |   6 +
 src/backend/access/gin/ginvacuum.c          |  26 +
 src/backend/access/gist/gistutil.c          |   8 +-
 src/backend/access/gist/gistxlog.c          |  10 +-
 src/backend/access/heap/heapam.c            |  15 +-
 src/backend/access/heap/heapam_handler.c    |  24 +-
 src/backend/access/heap/heapam_visibility.c |  79 +-
 src/backend/access/heap/pruneheap.c         | 207 ++++-
 src/backend/access/heap/vacuumlazy.c        |  24 +-
 src/backend/access/index/indexam.c          |   3 +-
 src/backend/access/nbtree/README            |  10 +-
 src/backend/access/nbtree/nbtpage.c         |   4 +-
 src/backend/access/nbtree/nbtree.c          |  28 +-
 src/backend/access/nbtree/nbtxlog.c         |  10 +-
 src/backend/access/spgist/spgvacuum.c       |   6 +-
 src/backend/access/transam/README           |  78 +-
 src/backend/access/transam/xlog.c           |   4 +-
 src/backend/commands/analyze.c              |   2 +-
 src/backend/commands/vacuum.c               |  41 +-
 src/backend/postmaster/autovacuum.c         |   4 +
 src/backend/replication/logical/launcher.c  |   4 +
 src/backend/replication/walreceiver.c       |  17 +-
 src/backend/replication/walsender.c         |  15 +-
 src/backend/storage/ipc/procarray.c         | 902 ++++++++++++++++----
 src/backend/utils/adt/selfuncs.c            |  20 +-
 src/backend/utils/init/postinit.c           |   4 +
 src/backend/utils/time/snapmgr.c            | 258 +++---
 contrib/amcheck/verify_nbtree.c             |   8 +-
 contrib/pg_visibility/pg_visibility.c       |  18 +-
 contrib/pgstattuple/pgstatapprox.c          |   2 +-
 src/tools/pgindent/typedefs.list            |   2 +
 38 files changed, 1449 insertions(+), 566 deletions(-)

diff --git a/src/include/access/ginblock.h b/src/include/access/ginblock.h
index 3f64fd572e3..fe66a95226b 100644
--- a/src/include/access/ginblock.h
+++ b/src/include/access/ginblock.h
@@ -12,6 +12,7 @@
 
 #include "access/transam.h"
 #include "storage/block.h"
+#include "storage/bufpage.h"
 #include "storage/itemptr.h"
 #include "storage/off.h"
 
@@ -134,8 +135,7 @@ typedef struct GinMetaPageData
  */
 #define GinPageGetDeleteXid(page) ( ((PageHeader) (page))->pd_prune_xid )
 #define GinPageSetDeleteXid(page, xid) ( ((PageHeader) (page))->pd_prune_xid = xid)
-#define GinPageIsRecyclable(page) ( PageIsNew(page) || (GinPageIsDeleted(page) \
-	&& TransactionIdPrecedes(GinPageGetDeleteXid(page), RecentGlobalXmin)))
+extern bool GinPageIsRecyclable(Page page);
 
 /*
  * We use our own ItemPointerGet(BlockNumber|OffsetNumber)
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index b31de389106..ebb79428d16 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -172,9 +172,12 @@ extern TransactionId heap_compute_xid_horizon_for_tuples(Relation rel,
 														 int nitems);
 
 /* in heap/pruneheap.c */
+struct GlobalVisState;
 extern void heap_page_prune_opt(Relation relation, Buffer buffer);
 extern int	heap_page_prune(Relation relation, Buffer buffer,
-							TransactionId OldestXmin,
+							struct GlobalVisState *vistest,
+							TransactionId limited_oldest_xmin,
+							TimestampTz limited_oldest_ts,
 							bool report_stats, TransactionId *latestRemovedXid);
 extern void heap_page_prune_execute(Buffer buffer,
 									OffsetNumber *redirected, int nredirected,
@@ -195,11 +198,14 @@ extern TM_Result HeapTupleSatisfiesUpdate(HeapTuple stup, CommandId curcid,
 										  Buffer buffer);
 extern HTSV_Result HeapTupleSatisfiesVacuum(HeapTuple stup, TransactionId OldestXmin,
 											Buffer buffer);
+extern HTSV_Result HeapTupleSatisfiesVacuumHorizon(HeapTuple stup, Buffer buffer,
+												   TransactionId *dead_after);
 extern void HeapTupleSetHintBits(HeapTupleHeader tuple, Buffer buffer,
 								 uint16 infomask, TransactionId xid);
 extern bool HeapTupleHeaderIsOnlyLocked(HeapTupleHeader tuple);
 extern bool XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot);
-extern bool HeapTupleIsSurelyDead(HeapTuple htup, TransactionId OldestXmin);
+extern bool HeapTupleIsSurelyDead(struct GlobalVisState *vistest,
+								  HeapTuple htup);
 
 /*
  * To avoid leaking too much knowledge about reorderbuffer implementation
diff --git a/src/include/access/transam.h b/src/include/access/transam.h
index 8db326ad1b5..b32044153b0 100644
--- a/src/include/access/transam.h
+++ b/src/include/access/transam.h
@@ -95,15 +95,6 @@ FullTransactionIdFromU64(uint64 value)
 			(dest) = FirstNormalTransactionId; \
 	} while(0)
 
-/* advance a FullTransactionId variable, stepping over special XIDs */
-static inline void
-FullTransactionIdAdvance(FullTransactionId *dest)
-{
-	dest->value++;
-	while (XidFromFullTransactionId(*dest) < FirstNormalTransactionId)
-		dest->value++;
-}
-
 /*
  * Retreat a FullTransactionId variable, stepping over xids that would appear
  * to be special only when viewed as 32bit XIDs.
@@ -129,6 +120,23 @@ FullTransactionIdRetreat(FullTransactionId *dest)
 		dest->value--;
 }
 
+/*
+ * Advance a FullTransactionId variable, stepping over xids that would appear
+ * to be special only when viewed as 32bit XIDs.
+ */
+static inline void
+FullTransactionIdAdvance(FullTransactionId *dest)
+{
+	dest->value++;
+
+	/* see FullTransactionIdAdvance() */
+	if (FullTransactionIdPrecedes(*dest, FirstNormalFullTransactionId))
+		return;
+
+	while (XidFromFullTransactionId(*dest) < FirstNormalTransactionId)
+		dest->value++;
+}
+
 /* back up a transaction ID variable, handling wraparound correctly */
 #define TransactionIdRetreat(dest)	\
 	do { \
@@ -293,6 +301,59 @@ ReadNewTransactionId(void)
 	return XidFromFullTransactionId(ReadNextFullTransactionId());
 }
 
+/* return transaction ID backed up by amount, handling wraparound correctly */
+static inline TransactionId
+TransactionIdRetreatedBy(TransactionId xid, uint32 amount)
+{
+	xid -= amount;
+
+	while (xid < FirstNormalTransactionId)
+		xid--;
+
+	return xid;
+}
+
+/* return the older of the two IDs */
+static inline TransactionId
+TransactionIdOlder(TransactionId a, TransactionId b)
+{
+	if (!TransactionIdIsValid(a))
+		return b;
+
+	if (!TransactionIdIsValid(b))
+		return a;
+
+	if (TransactionIdPrecedes(a, b))
+		return a;
+	return b;
+}
+
+/* return the older of the two IDs, assuming they're both normal */
+static inline TransactionId
+NormalTransactionIdOlder(TransactionId a, TransactionId b)
+{
+	Assert(TransactionIdIsNormal(a));
+	Assert(TransactionIdIsNormal(b));
+	if (NormalTransactionIdPrecedes(a, b))
+		return a;
+	return b;
+}
+
+/* return the newer of the two IDs */
+static inline FullTransactionId
+FullTransactionIdNewer(FullTransactionId a, FullTransactionId b)
+{
+	if (!FullTransactionIdIsValid(a))
+		return b;
+
+	if (!FullTransactionIdIsValid(b))
+		return a;
+
+	if (FullTransactionIdFollows(a, b))
+		return a;
+	return b;
+}
+
 #endif							/* FRONTEND */
 
 #endif							/* TRANSAM_H */
diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h
index 3f88683a059..51b8f994ac0 100644
--- a/src/include/storage/bufpage.h
+++ b/src/include/storage/bufpage.h
@@ -389,12 +389,6 @@ PageValidateSpecialPointer(Page page)
 #define PageClearAllVisible(page) \
 	(((PageHeader) (page))->pd_flags &= ~PD_ALL_VISIBLE)
 
-#define PageIsPrunable(page, oldestxmin) \
-( \
-	AssertMacro(TransactionIdIsNormal(oldestxmin)), \
-	TransactionIdIsValid(((PageHeader) (page))->pd_prune_xid) && \
-	TransactionIdPrecedes(((PageHeader) (page))->pd_prune_xid, oldestxmin) \
-)
 #define PageSetPrunable(page, xid) \
 do { \
 	Assert(TransactionIdIsNormal(xid)); \
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 5ceb2494bae..52ff43cabaa 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -42,20 +42,12 @@ struct XidCache
 
 /*
  * Flags for PGXACT->vacuumFlags
- *
- * Note: If you modify these flags, you need to modify PROCARRAY_XXX flags
- * in src/include/storage/procarray.h.
- *
- * PROC_RESERVED may later be assigned for use in vacuumFlags, but its value is
- * used for PROCARRAY_SLOTS_XMIN in procarray.h, so GetOldestXmin won't be able
- * to match and ignore processes with this flag set.
  */
 #define		PROC_IS_AUTOVACUUM	0x01	/* is it an autovac worker? */
 #define		PROC_IN_VACUUM		0x02	/* currently running lazy vacuum */
 #define		PROC_VACUUM_FOR_WRAPAROUND	0x08	/* set by autovac only */
 #define		PROC_IN_LOGICAL_DECODING	0x10	/* currently doing logical
 												 * decoding outside xact */
-#define		PROC_RESERVED				0x20	/* reserved for procarray */
 
 /* flags reset at EOXact */
 #define		PROC_VACUUM_STATE_MASK \
diff --git a/src/include/storage/procarray.h b/src/include/storage/procarray.h
index 01040d76e12..ea8a876ca45 100644
--- a/src/include/storage/procarray.h
+++ b/src/include/storage/procarray.h
@@ -20,34 +20,6 @@
 #include "utils/snapshot.h"
 
 
-/*
- * These are to implement PROCARRAY_FLAGS_XXX
- *
- * Note: These flags are cloned from PROC_XXX flags in src/include/storage/proc.h
- * to avoid forcing to include proc.h when including procarray.h. So if you modify
- * PROC_XXX flags, you need to modify these flags.
- */
-#define		PROCARRAY_VACUUM_FLAG			0x02	/* currently running lazy
-													 * vacuum */
-#define		PROCARRAY_LOGICAL_DECODING_FLAG 0x10	/* currently doing logical
-													 * decoding outside xact */
-
-#define		PROCARRAY_SLOTS_XMIN			0x20	/* replication slot xmin,
-													 * catalog_xmin */
-/*
- * Only flags in PROCARRAY_PROC_FLAGS_MASK are considered when matching
- * PGXACT->vacuumFlags. Other flags are used for different purposes and
- * have no corresponding PROC flag equivalent.
- */
-#define		PROCARRAY_PROC_FLAGS_MASK	(PROCARRAY_VACUUM_FLAG | \
-										 PROCARRAY_LOGICAL_DECODING_FLAG)
-
-/* Use the following flags as an input "flags" to GetOldestXmin function */
-/* Consider all backends except for logical decoding ones which manage xmin separately */
-#define		PROCARRAY_FLAGS_DEFAULT			PROCARRAY_LOGICAL_DECODING_FLAG
-/* Ignore vacuum backends */
-#define		PROCARRAY_FLAGS_VACUUM			PROCARRAY_FLAGS_DEFAULT | PROCARRAY_VACUUM_FLAG
-
 extern Size ProcArrayShmemSize(void);
 extern void CreateSharedProcArray(void);
 extern void ProcArrayAdd(PGPROC *proc);
@@ -81,9 +53,11 @@ extern RunningTransactions GetRunningTransactionData(void);
 
 extern bool TransactionIdIsInProgress(TransactionId xid);
 extern bool TransactionIdIsActive(TransactionId xid);
-extern TransactionId GetOldestXmin(Relation rel, int flags);
+extern TransactionId GetOldestNonRemovableTransactionId(Relation rel);
+extern TransactionId GetOldestTransactionIdConsideredRunning(void);
 extern TransactionId GetOldestActiveTransactionId(void);
 extern TransactionId GetOldestSafeDecodingTransactionId(bool catalogOnly);
+extern void GetReplicationHorizons(TransactionId *slot_xmin, TransactionId *catalog_xmin);
 
 extern VirtualTransactionId *GetVirtualXIDsDelayingChkpt(int *nvxids);
 extern bool HaveVirtualXIDsDelayingChkpt(VirtualTransactionId *vxids, int nvxids);
diff --git a/src/include/utils/snapmgr.h b/src/include/utils/snapmgr.h
index ffb4ba3adfb..b6b403e2931 100644
--- a/src/include/utils/snapmgr.h
+++ b/src/include/utils/snapmgr.h
@@ -52,13 +52,12 @@ extern Size SnapMgrShmemSize(void);
 extern void SnapMgrInit(void);
 extern TimestampTz GetSnapshotCurrentTimestamp(void);
 extern TimestampTz GetOldSnapshotThresholdTimestamp(void);
+extern void SnapshotTooOldMagicForTest(void);
 
 extern bool FirstSnapshotSet;
 
 extern PGDLLIMPORT TransactionId TransactionXmin;
 extern PGDLLIMPORT TransactionId RecentXmin;
-extern PGDLLIMPORT TransactionId RecentGlobalXmin;
-extern PGDLLIMPORT TransactionId RecentGlobalDataXmin;
 
 /* Variables representing various special snapshot semantics */
 extern PGDLLIMPORT SnapshotData SnapshotSelfData;
@@ -78,11 +77,12 @@ extern PGDLLIMPORT SnapshotData CatalogSnapshotData;
 
 /*
  * Similarly, some initialization is required for a NonVacuumable snapshot.
- * The caller must supply the xmin horizon to use (e.g., RecentGlobalXmin).
+ * The caller must supply the visibility cutoff state to use (c.f.
+ * GlobalVisTestFor()).
  */
-#define InitNonVacuumableSnapshot(snapshotdata, xmin_horizon)  \
+#define InitNonVacuumableSnapshot(snapshotdata, vistestp)  \
 	((snapshotdata).snapshot_type = SNAPSHOT_NON_VACUUMABLE, \
-	 (snapshotdata).xmin = (xmin_horizon))
+	 (snapshotdata).vistest = (vistestp))
 
 /*
  * Similarly, some initialization is required for SnapshotToast.  We need
@@ -98,6 +98,11 @@ extern PGDLLIMPORT SnapshotData CatalogSnapshotData;
 	((snapshot)->snapshot_type == SNAPSHOT_MVCC || \
 	 (snapshot)->snapshot_type == SNAPSHOT_HISTORIC_MVCC)
 
+static inline bool
+OldSnapshotThresholdActive(void)
+{
+	return old_snapshot_threshold >= 0;
+}
 
 extern Snapshot GetTransactionSnapshot(void);
 extern Snapshot GetLatestSnapshot(void);
@@ -121,8 +126,6 @@ extern void UnregisterSnapshot(Snapshot snapshot);
 extern Snapshot RegisterSnapshotOnOwner(Snapshot snapshot, ResourceOwner owner);
 extern void UnregisterSnapshotFromOwner(Snapshot snapshot, ResourceOwner owner);
 
-extern FullTransactionId GetFullRecentGlobalXmin(void);
-
 extern void AtSubCommit_Snapshot(int level);
 extern void AtSubAbort_Snapshot(int level);
 extern void AtEOXact_Snapshot(bool isCommit, bool resetXmin);
@@ -131,13 +134,29 @@ extern void ImportSnapshot(const char *idstr);
 extern bool XactHasExportedSnapshots(void);
 extern void DeleteAllExportedSnapshotFiles(void);
 extern bool ThereAreNoPriorRegisteredSnapshots(void);
-extern TransactionId TransactionIdLimitedForOldSnapshots(TransactionId recentXmin,
-														 Relation relation);
+extern bool TransactionIdLimitedForOldSnapshots(TransactionId recentXmin,
+												Relation relation,
+												TransactionId *limit_xid,
+												TimestampTz *limit_ts);
+extern void SetOldSnapshotThresholdTimestamp(TimestampTz ts, TransactionId xlimit);
 extern void MaintainOldSnapshotTimeMapping(TimestampTz whenTaken,
 										   TransactionId xmin);
 
 extern char *ExportSnapshot(Snapshot snapshot);
 
+/*
+ * These live in procarray.c because they're intimately linked to the
+ * procarray contents, but thematically they better fit into snapmgr.h.
+ */
+typedef struct GlobalVisState GlobalVisState;
+extern GlobalVisState *GlobalVisTestFor(Relation rel);
+extern bool GlobalVisTestIsRemovableXid(GlobalVisState *state, TransactionId xid);
+extern bool GlobalVisTestIsRemovableFullXid(GlobalVisState *state, FullTransactionId fxid);
+extern FullTransactionId GlobalVisTestNonRemovableFullHorizon(GlobalVisState *state);
+extern TransactionId GlobalVisTestNonRemovableHorizon(GlobalVisState *state);
+extern bool GlobalVisCheckRemovableXid(Relation rel, TransactionId xid);
+extern bool GlobalVisIsRemovableFullXid(Relation rel, FullTransactionId fxid);
+
 /*
  * Utility functions for implementing visibility routines in table AMs.
  */
diff --git a/src/include/utils/snapshot.h b/src/include/utils/snapshot.h
index 4796edb63aa..35b1f05bea6 100644
--- a/src/include/utils/snapshot.h
+++ b/src/include/utils/snapshot.h
@@ -192,6 +192,12 @@ typedef struct SnapshotData
 	 */
 	uint32		speculativeToken;
 
+	/*
+	 * For SNAPSHOT_NON_VACUUMABLE (and hopefully more in the future) this is
+	 * used to determine whether row could be vacuumed.
+	 */
+	struct GlobalVisState *vistest;
+
 	/*
 	 * Book-keeping information, used by the snapshot manager
 	 */
diff --git a/src/backend/access/gin/ginvacuum.c b/src/backend/access/gin/ginvacuum.c
index 8ae4fd95a7b..9cd6638df62 100644
--- a/src/backend/access/gin/ginvacuum.c
+++ b/src/backend/access/gin/ginvacuum.c
@@ -793,3 +793,29 @@ ginvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 
 	return stats;
 }
+
+/*
+ * Return whether Page can safely be recycled.
+ */
+bool
+GinPageIsRecyclable(Page page)
+{
+	TransactionId delete_xid;
+
+	if (PageIsNew(page))
+		return true;
+
+	if (!GinPageIsDeleted(page))
+		return false;
+
+	delete_xid = GinPageGetDeleteXid(page);
+
+	if (!TransactionIdIsValid(delete_xid))
+		return true;
+
+	/*
+	 * If no backend still could view delete_xid as in running, all scans
+	 * concurrent with ginDeletePage() must have finished.
+	 */
+	return GlobalVisCheckRemovableXid(NULL, delete_xid);
+}
diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
index 765329bbcd4..bfda7fbe3d5 100644
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -891,15 +891,13 @@ gistPageRecyclable(Page page)
 		 * As long as that can happen, we must keep the deleted page around as
 		 * a tombstone.
 		 *
-		 * Compare the deletion XID with RecentGlobalXmin. If deleteXid <
-		 * RecentGlobalXmin, then no scan that's still in progress could have
+		 * For that check if the deletion XID could still be visible to
+		 * anyone. If not, then no scan that's still in progress could have
 		 * seen its downlink, and we can recycle it.
 		 */
 		FullTransactionId deletexid_full = GistPageGetDeleteXid(page);
-		FullTransactionId recentxmin_full = GetFullRecentGlobalXmin();
 
-		if (FullTransactionIdPrecedes(deletexid_full, recentxmin_full))
-			return true;
+		return GlobalVisIsRemovableFullXid(NULL, deletexid_full);
 	}
 	return false;
 }
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index 7b5d1e98b70..a63b05388c5 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -387,11 +387,11 @@ gistRedoPageReuse(XLogReaderState *record)
 	 * PAGE_REUSE records exist to provide a conflict point when we reuse
 	 * pages in the index via the FSM.  That's all they do though.
 	 *
-	 * latestRemovedXid was the page's deleteXid.  The deleteXid <
-	 * RecentGlobalXmin test in gistPageRecyclable() conceptually mirrors the
-	 * pgxact->xmin > limitXmin test in GetConflictingVirtualXIDs().
-	 * Consequently, one XID value achieves the same exclusion effect on
-	 * primary and standby.
+	 * latestRemovedXid was the page's deleteXid.  The
+	 * GlobalVisIsRemovableFullXid(deleteXid) test in gistPageRecyclable()
+	 * conceptually mirrors the pgxact->xmin > limitXmin test in
+	 * GetConflictingVirtualXIDs().  Consequently, one XID value achieves the
+	 * same exclusion effect on primary and standby.
 	 */
 	if (InHotStandby)
 	{
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 00169006fb1..0a89e741a15 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1517,6 +1517,7 @@ heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer,
 	bool		at_chain_start;
 	bool		valid;
 	bool		skip;
+	GlobalVisState *vistest = NULL;
 
 	/* If this is not the first call, previous call returned a (live!) tuple */
 	if (all_dead)
@@ -1527,7 +1528,8 @@ heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer,
 	at_chain_start = first_call;
 	skip = !first_call;
 
-	Assert(TransactionIdIsValid(RecentGlobalXmin));
+	/* XXX: we should assert that a snapshot is pushed or registered */
+	Assert(TransactionIdIsValid(RecentXmin));
 	Assert(BufferGetBlockNumber(buffer) == blkno);
 
 	/* Scan through possible multiple members of HOT-chain */
@@ -1616,9 +1618,14 @@ heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer,
 		 * Note: if you change the criterion here for what is "dead", fix the
 		 * planner's get_actual_variable_range() function to match.
 		 */
-		if (all_dead && *all_dead &&
-			!HeapTupleIsSurelyDead(heapTuple, RecentGlobalXmin))
-			*all_dead = false;
+		if (all_dead && *all_dead)
+		{
+			if (!vistest)
+				vistest = GlobalVisTestFor(relation);
+
+			if (!HeapTupleIsSurelyDead(vistest, heapTuple))
+				*all_dead = false;
+		}
 
 		/*
 		 * Check to see if HOT chain continues past this tuple; if so fetch
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 267a6ee25a7..e3e41fb7516 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1203,7 +1203,7 @@ heapam_index_build_range_scan(Relation heapRelation,
 
 	/* okay to ignore lazy VACUUMs here */
 	if (!IsBootstrapProcessingMode() && !indexInfo->ii_Concurrent)
-		OldestXmin = GetOldestXmin(heapRelation, PROCARRAY_FLAGS_VACUUM);
+		OldestXmin = GetOldestNonRemovableTransactionId(heapRelation);
 
 	if (!scan)
 	{
@@ -1244,6 +1244,17 @@ heapam_index_build_range_scan(Relation heapRelation,
 
 	hscan = (HeapScanDesc) scan;
 
+	/*
+	 * Must have called GetOldestNonRemovableTransactionId() if using
+	 * SnapshotAny.  Shouldn't have for an MVCC snapshot. (It's especially
+	 * worth checking this for parallel builds, since ambuild routines that
+	 * support parallel builds must work these details out for themselves.)
+	 */
+	Assert(snapshot == SnapshotAny || IsMVCCSnapshot(snapshot));
+	Assert(snapshot == SnapshotAny ? TransactionIdIsValid(OldestXmin) :
+		   !TransactionIdIsValid(OldestXmin));
+	Assert(snapshot == SnapshotAny || !anyvisible);
+
 	/* Publish number of blocks to scan */
 	if (progress)
 	{
@@ -1263,17 +1274,6 @@ heapam_index_build_range_scan(Relation heapRelation,
 									 nblocks);
 	}
 
-	/*
-	 * Must call GetOldestXmin() with SnapshotAny.  Should never call
-	 * GetOldestXmin() with MVCC snapshot. (It's especially worth checking
-	 * this for parallel builds, since ambuild routines that support parallel
-	 * builds must work these details out for themselves.)
-	 */
-	Assert(snapshot == SnapshotAny || IsMVCCSnapshot(snapshot));
-	Assert(snapshot == SnapshotAny ? TransactionIdIsValid(OldestXmin) :
-		   !TransactionIdIsValid(OldestXmin));
-	Assert(snapshot == SnapshotAny || !anyvisible);
-
 	/* set our scan endpoints */
 	if (!allow_sync)
 		heap_setscanlimits(scan, start_blockno, numblocks);
diff --git a/src/backend/access/heap/heapam_visibility.c b/src/backend/access/heap/heapam_visibility.c
index c77128087cf..f117ee160a3 100644
--- a/src/backend/access/heap/heapam_visibility.c
+++ b/src/backend/access/heap/heapam_visibility.c
@@ -1154,19 +1154,56 @@ HeapTupleSatisfiesMVCC(HeapTuple htup, Snapshot snapshot,
  *	we mainly want to know is if a tuple is potentially visible to *any*
  *	running transaction.  If so, it can't be removed yet by VACUUM.
  *
- * OldestXmin is a cutoff XID (obtained from GetOldestXmin()).  Tuples
- * deleted by XIDs >= OldestXmin are deemed "recently dead"; they might
- * still be visible to some open transaction, so we can't remove them,
- * even if we see that the deleting transaction has committed.
+ * OldestXmin is a cutoff XID (obtained from
+ * GetOldestNonRemovableTransactionId()).  Tuples deleted by XIDs >=
+ * OldestXmin are deemed "recently dead"; they might still be visible to some
+ * open transaction, so we can't remove them, even if we see that the deleting
+ * transaction has committed.
  */
 HTSV_Result
 HeapTupleSatisfiesVacuum(HeapTuple htup, TransactionId OldestXmin,
 						 Buffer buffer)
+{
+	TransactionId dead_after = InvalidTransactionId;
+	HTSV_Result res;
+
+	res = HeapTupleSatisfiesVacuumHorizon(htup, buffer, &dead_after);
+
+	if (res == HEAPTUPLE_RECENTLY_DEAD)
+	{
+		Assert(TransactionIdIsValid(dead_after));
+
+		if (TransactionIdPrecedes(dead_after, OldestXmin))
+			res = HEAPTUPLE_DEAD;
+	}
+	else
+		Assert(!TransactionIdIsValid(dead_after));
+
+	return res;
+}
+
+/*
+ * Work horse for HeapTupleSatisfiesVacuum and similar routines.
+ *
+ * In contrast to HeapTupleSatisfiesVacuum this routine, when encountering a
+ * tuple that could still be visible to some backend, stores the xid that
+ * needs to be compared with the horizon in *dead_after, and returns
+ * HEAPTUPLE_RECENTLY_DEAD. The caller then can perform the comparison with
+ * the horizon.  This is e.g. useful when comparing with different horizons.
+ *
+ * Note: HEAPTUPLE_DEAD can still be returned here, e.g. if the inserting
+ * transaction aborted.
+ */
+HTSV_Result
+HeapTupleSatisfiesVacuumHorizon(HeapTuple htup, Buffer buffer, TransactionId *dead_after)
 {
 	HeapTupleHeader tuple = htup->t_data;
 
 	Assert(ItemPointerIsValid(&htup->t_self));
 	Assert(htup->t_tableOid != InvalidOid);
+	Assert(dead_after != NULL);
+
+	*dead_after = InvalidTransactionId;
 
 	/*
 	 * Has inserting transaction committed?
@@ -1323,17 +1360,15 @@ HeapTupleSatisfiesVacuum(HeapTuple htup, TransactionId OldestXmin,
 		else if (TransactionIdDidCommit(xmax))
 		{
 			/*
-			 * The multixact might still be running due to lockers.  If the
-			 * updater is below the xid horizon, we have to return DEAD
-			 * regardless -- otherwise we could end up with a tuple where the
-			 * updater has to be removed due to the horizon, but is not pruned
-			 * away.  It's not a problem to prune that tuple, because any
-			 * remaining lockers will also be present in newer tuple versions.
+			 * The multixact might still be running due to lockers.  Need to
+			 * allow for pruning if below the xid horizon regardless --
+			 * otherwise we could end up with a tuple where the updater has to
+			 * be removed due to the horizon, but is not pruned away.  It's
+			 * not a problem to prune that tuple, because any remaining
+			 * lockers will also be present in newer tuple versions.
 			 */
-			if (!TransactionIdPrecedes(xmax, OldestXmin))
-				return HEAPTUPLE_RECENTLY_DEAD;
-
-			return HEAPTUPLE_DEAD;
+			*dead_after = xmax;
+			return HEAPTUPLE_RECENTLY_DEAD;
 		}
 		else if (!MultiXactIdIsRunning(HeapTupleHeaderGetRawXmax(tuple), false))
 		{
@@ -1372,14 +1407,11 @@ HeapTupleSatisfiesVacuum(HeapTuple htup, TransactionId OldestXmin,
 	}
 
 	/*
-	 * Deleter committed, but perhaps it was recent enough that some open
-	 * transactions could still see the tuple.
+	 * Deleter committed, allow caller to check if it was recent enough that
+	 * some open transactions could still see the tuple.
 	 */
-	if (!TransactionIdPrecedes(HeapTupleHeaderGetRawXmax(tuple), OldestXmin))
-		return HEAPTUPLE_RECENTLY_DEAD;
-
-	/* Otherwise, it's dead and removable */
-	return HEAPTUPLE_DEAD;
+	*dead_after = HeapTupleHeaderGetRawXmax(tuple);
+	return HEAPTUPLE_RECENTLY_DEAD;
 }
 
 
@@ -1418,7 +1450,7 @@ HeapTupleSatisfiesNonVacuumable(HeapTuple htup, Snapshot snapshot,
  *	if the tuple is removable.
  */
 bool
-HeapTupleIsSurelyDead(HeapTuple htup, TransactionId OldestXmin)
+HeapTupleIsSurelyDead(GlobalVisState *vistest, HeapTuple htup)
 {
 	HeapTupleHeader tuple = htup->t_data;
 
@@ -1459,7 +1491,8 @@ HeapTupleIsSurelyDead(HeapTuple htup, TransactionId OldestXmin)
 		return false;
 
 	/* Deleter committed, so tuple is dead if the XID is old enough. */
-	return TransactionIdPrecedes(HeapTupleHeaderGetRawXmax(tuple), OldestXmin);
+	return GlobalVisTestIsRemovableXid(vistest,
+									   HeapTupleHeaderGetRawXmax(tuple));
 }
 
 /*
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index 256df4de105..00a3cb106aa 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -23,12 +23,30 @@
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "storage/bufmgr.h"
+#include "utils/snapmgr.h"
 #include "utils/rel.h"
 #include "utils/snapmgr.h"
 
 /* Working data for heap_page_prune and subroutines */
 typedef struct
 {
+	Relation	rel;
+
+	/* tuple visibility test, initialized for the relation */
+	GlobalVisState *vistest;
+
+	/*
+	 * Thresholds set by TransactionIdLimitedForOldSnapshots() if they have
+	 * been computed (done on demand, and only if
+	 * OldSnapshotThresholdActive()). The first time a tuple is about to be
+	 * removed based on the limited horizon, old_snap_used is set to true, and
+	 * SetOldSnapshotThresholdTimestamp() is called. See
+	 * heap_prune_satisfies_vacuum().
+	 */
+	TimestampTz old_snap_ts;
+	TransactionId old_snap_xmin;
+	bool		old_snap_used;
+
 	TransactionId new_prune_xid;	/* new prune hint value for page */
 	TransactionId latestRemovedXid; /* latest xid to be removed by this prune */
 	int			nredirected;	/* numbers of entries in arrays below */
@@ -43,9 +61,8 @@ typedef struct
 } PruneState;
 
 /* Local functions */
-static int	heap_prune_chain(Relation relation, Buffer buffer,
+static int	heap_prune_chain(Buffer buffer,
 							 OffsetNumber rootoffnum,
-							 TransactionId OldestXmin,
 							 PruneState *prstate);
 static void heap_prune_record_prunable(PruneState *prstate, TransactionId xid);
 static void heap_prune_record_redirect(PruneState *prstate,
@@ -65,16 +82,16 @@ static void heap_prune_record_unused(PruneState *prstate, OffsetNumber offnum);
  * if there's not any use in pruning.
  *
  * Caller must have pin on the buffer, and must *not* have a lock on it.
- *
- * OldestXmin is the cutoff XID used to distinguish whether tuples are DEAD
- * or RECENTLY_DEAD (see HeapTupleSatisfiesVacuum).
  */
 void
 heap_page_prune_opt(Relation relation, Buffer buffer)
 {
 	Page		page = BufferGetPage(buffer);
+	TransactionId prune_xid;
+	GlobalVisState *vistest;
+	TransactionId limited_xmin = InvalidTransactionId;
+	TimestampTz limited_ts = 0;
 	Size		minfree;
-	TransactionId OldestXmin;
 
 	/*
 	 * We can't write WAL in recovery mode, so there's no point trying to
@@ -85,37 +102,55 @@ heap_page_prune_opt(Relation relation, Buffer buffer)
 		return;
 
 	/*
-	 * Use the appropriate xmin horizon for this relation. If it's a proper
-	 * catalog relation or a user defined, additional, catalog relation, we
-	 * need to use the horizon that includes slots, otherwise the data-only
-	 * horizon can be used. Note that the toast relation of user defined
-	 * relations are *not* considered catalog relations.
+	 * XXX: Magic to keep old_snapshot_threshold tests appear "working". They
+	 * currently are broken, and discussion of what to do about them is
+	 * ongoing. See
+	 * https://www.postgresql.org/message-id/20200403001235.e6jfdll3gh2ygbuc%40alap3.anarazel.de
+	 */
+	if (old_snapshot_threshold == 0)
+		SnapshotTooOldMagicForTest();
+
+	/*
+	 * First check whether there's any chance there's something to prune,
+	 * determining the appropriate horizon is a waste if there's no prune_xid
+	 * (i.e. no updates/deletes left potentially dead tuples around).
+	 */
+	prune_xid = ((PageHeader) page)->pd_prune_xid;
+	if (!TransactionIdIsValid(prune_xid))
+		return;
+
+	/*
+	 * Check whether prune_xid indicates that there may be dead rows that can
+	 * be cleaned up.
 	 *
-	 * It is OK to apply the old snapshot limit before acquiring the cleanup
+	 * It is OK to check the old snapshot limit before acquiring the cleanup
 	 * lock because the worst that can happen is that we are not quite as
 	 * aggressive about the cleanup (by however many transaction IDs are
 	 * consumed between this point and acquiring the lock).  This allows us to
 	 * save significant overhead in the case where the page is found not to be
 	 * prunable.
-	 */
-	if (IsCatalogRelation(relation) ||
-		RelationIsAccessibleInLogicalDecoding(relation))
-		OldestXmin = RecentGlobalXmin;
-	else
-		OldestXmin =
-			TransactionIdLimitedForOldSnapshots(RecentGlobalDataXmin,
-												relation);
-
-	Assert(TransactionIdIsValid(OldestXmin));
-
-	/*
-	 * Let's see if we really need pruning.
 	 *
-	 * Forget it if page is not hinted to contain something prunable that's
-	 * older than OldestXmin.
+	 * Even if old_snapshot_threshold is set, we first check whether the page
+	 * can be pruned without. Both because
+	 * TransactionIdLimitedForOldSnapshots() is not cheap, and because not
+	 * unnecessarily relying on old_snapshot_threshold avoids causing
+	 * conflicts.
 	 */
-	if (!PageIsPrunable(page, OldestXmin))
-		return;
+	vistest = GlobalVisTestFor(relation);
+
+	if (!GlobalVisTestIsRemovableXid(vistest, prune_xid))
+	{
+		if (!OldSnapshotThresholdActive())
+			return;
+
+		if (!TransactionIdLimitedForOldSnapshots(GlobalVisTestNonRemovableHorizon(vistest),
+												 relation,
+												 &limited_xmin, &limited_ts))
+			return;
+
+		if (!TransactionIdPrecedes(prune_xid, limited_xmin))
+			return;
+	}
 
 	/*
 	 * We prune when a previous UPDATE failed to find enough space on the page
@@ -151,7 +186,9 @@ heap_page_prune_opt(Relation relation, Buffer buffer)
 															 * needed */
 
 			/* OK to prune */
-			(void) heap_page_prune(relation, buffer, OldestXmin, true, &ignore);
+			(void) heap_page_prune(relation, buffer, vistest,
+								   limited_xmin, limited_ts,
+								   true, &ignore);
 		}
 
 		/* And release buffer lock */
@@ -165,8 +202,11 @@ heap_page_prune_opt(Relation relation, Buffer buffer)
  *
  * Caller must have pin and buffer cleanup lock on the page.
  *
- * OldestXmin is the cutoff XID used to distinguish whether tuples are DEAD
- * or RECENTLY_DEAD (see HeapTupleSatisfiesVacuum).
+ * vistest is used to distinguish whether tuples are DEAD or RECENTLY_DEAD
+ * (see heap_prune_satisfies_vacuum and
+ * HeapTupleSatisfiesVacuum). old_snap_xmin / old_snap_ts need to
+ * either have been set by TransactionIdLimitedForOldSnapshots, or
+ * InvalidTransactionId/0 respectively.
  *
  * If report_stats is true then we send the number of reclaimed heap-only
  * tuples to pgstats.  (This must be false during vacuum, since vacuum will
@@ -177,7 +217,10 @@ heap_page_prune_opt(Relation relation, Buffer buffer)
  * latestRemovedXid.
  */
 int
-heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin,
+heap_page_prune(Relation relation, Buffer buffer,
+				GlobalVisState *vistest,
+				TransactionId old_snap_xmin,
+				TimestampTz old_snap_ts,
 				bool report_stats, TransactionId *latestRemovedXid)
 {
 	int			ndeleted = 0;
@@ -198,6 +241,11 @@ heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin,
 	 * initialize the rest of our working state.
 	 */
 	prstate.new_prune_xid = InvalidTransactionId;
+	prstate.rel = relation;
+	prstate.vistest = vistest;
+	prstate.old_snap_xmin = old_snap_xmin;
+	prstate.old_snap_ts = old_snap_ts;
+	prstate.old_snap_used = false;
 	prstate.latestRemovedXid = *latestRemovedXid;
 	prstate.nredirected = prstate.ndead = prstate.nunused = 0;
 	memset(prstate.marked, 0, sizeof(prstate.marked));
@@ -220,9 +268,7 @@ heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin,
 			continue;
 
 		/* Process this item or chain of items */
-		ndeleted += heap_prune_chain(relation, buffer, offnum,
-									 OldestXmin,
-									 &prstate);
+		ndeleted += heap_prune_chain(buffer, offnum, &prstate);
 	}
 
 	/* Any error while applying the changes is critical */
@@ -323,6 +369,85 @@ heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin,
 }
 
 
+/*
+ * Perform visiblity checks for heap pruning.
+ *
+ * This is more complicated than just using GlobalVisTestIsRemovableXid()
+ * because of old_snapshot_threshold. We only want to increase the threshold
+ * that triggers errors for old snapshots when we actually decide to remove a
+ * row based on the limited horizon.
+ *
+ * Due to its cost we also only want to call
+ * TransactionIdLimitedForOldSnapshots() if necessary, i.e. we might not have
+ * done so in heap_hot_prune_opt() if pd_prune_xid was old enough. But we
+ * still want to be able to remove rows that are too new to be removed
+ * according to prstate->vistest, but that can be removed based on
+ * old_snapshot_threshold. So we call TransactionIdLimitedForOldSnapshots() on
+ * demand in here, if appropriate.
+ */
+static HTSV_Result
+heap_prune_satisfies_vacuum(PruneState *prstate, HeapTuple tup, Buffer buffer)
+{
+	HTSV_Result res;
+	TransactionId dead_after;
+
+	res = HeapTupleSatisfiesVacuumHorizon(tup, buffer, &dead_after);
+
+	if (res != HEAPTUPLE_RECENTLY_DEAD)
+		return res;
+
+	/*
+	 * If we are already relying on the limited xmin, there is no need to
+	 * delay doing so anymore.
+	 */
+	if (prstate->old_snap_used)
+	{
+		Assert(TransactionIdIsValid(prstate->old_snap_xmin));
+
+		if (TransactionIdPrecedes(dead_after, prstate->old_snap_xmin))
+			res = HEAPTUPLE_DEAD;
+		return res;
+	}
+
+	/*
+	 * First check if GlobalVisTestIsRemovableXid() is sufficient to find the
+	 * row dead. If not, and old_snapshot_threshold is enabled, try to use the
+	 * lowered horizon.
+	 */
+	if (GlobalVisTestIsRemovableXid(prstate->vistest, dead_after))
+		res = HEAPTUPLE_DEAD;
+	else if (OldSnapshotThresholdActive())
+	{
+		/* haven't determined limited horizon yet, requests */
+		if (!TransactionIdIsValid(prstate->old_snap_xmin))
+		{
+			TransactionId horizon =
+			GlobalVisTestNonRemovableHorizon(prstate->vistest);
+
+			TransactionIdLimitedForOldSnapshots(horizon, prstate->rel,
+												&prstate->old_snap_xmin,
+												&prstate->old_snap_ts);
+		}
+
+		if (TransactionIdIsValid(prstate->old_snap_xmin) &&
+			TransactionIdPrecedes(dead_after, prstate->old_snap_xmin))
+		{
+			/*
+			 * About to remove row based on snapshot_too_old. Need to raise
+			 * the threshold so problematic accesses would error.
+			 */
+			Assert(!prstate->old_snap_used);
+			SetOldSnapshotThresholdTimestamp(prstate->old_snap_ts,
+											 prstate->old_snap_xmin);
+			prstate->old_snap_used = true;
+			res = HEAPTUPLE_DEAD;
+		}
+	}
+
+	return res;
+}
+
+
 /*
  * Prune specified line pointer or a HOT chain originating at line pointer.
  *
@@ -349,9 +474,7 @@ heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin,
  * Returns the number of tuples (to be) deleted from the page.
  */
 static int
-heap_prune_chain(Relation relation, Buffer buffer, OffsetNumber rootoffnum,
-				 TransactionId OldestXmin,
-				 PruneState *prstate)
+heap_prune_chain(Buffer buffer, OffsetNumber rootoffnum, PruneState *prstate)
 {
 	int			ndeleted = 0;
 	Page		dp = (Page) BufferGetPage(buffer);
@@ -366,7 +489,7 @@ heap_prune_chain(Relation relation, Buffer buffer, OffsetNumber rootoffnum,
 				i;
 	HeapTupleData tup;
 
-	tup.t_tableOid = RelationGetRelid(relation);
+	tup.t_tableOid = RelationGetRelid(prstate->rel);
 
 	rootlp = PageGetItemId(dp, rootoffnum);
 
@@ -401,7 +524,7 @@ heap_prune_chain(Relation relation, Buffer buffer, OffsetNumber rootoffnum,
 			 * either here or while following a chain below.  Whichever path
 			 * gets there first will mark the tuple unused.
 			 */
-			if (HeapTupleSatisfiesVacuum(&tup, OldestXmin, buffer)
+			if (heap_prune_satisfies_vacuum(prstate, &tup, buffer)
 				== HEAPTUPLE_DEAD && !HeapTupleHeaderIsHotUpdated(htup))
 			{
 				heap_prune_record_unused(prstate, rootoffnum);
@@ -485,7 +608,7 @@ heap_prune_chain(Relation relation, Buffer buffer, OffsetNumber rootoffnum,
 		 */
 		tupdead = recent_dead = false;
 
-		switch (HeapTupleSatisfiesVacuum(&tup, OldestXmin, buffer))
+		switch (heap_prune_satisfies_vacuum(prstate, &tup, buffer))
 		{
 			case HEAPTUPLE_DEAD:
 				tupdead = true;
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 1bbc4598f75..44e2224dd55 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -788,6 +788,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 		PROGRESS_VACUUM_MAX_DEAD_TUPLES
 	};
 	int64		initprog_val[3];
+	GlobalVisState *vistest;
 
 	pg_rusage_init(&ru0);
 
@@ -816,6 +817,8 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	vacrelstats->nonempty_pages = 0;
 	vacrelstats->latestRemovedXid = InvalidTransactionId;
 
+	vistest = GlobalVisTestFor(onerel);
+
 	/*
 	 * Initialize state for a parallel vacuum.  As of now, only one worker can
 	 * be used for an index, so we invoke parallelism only if there are at
@@ -1239,7 +1242,8 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 		 *
 		 * We count tuples removed by the pruning step as removed by VACUUM.
 		 */
-		tups_vacuumed += heap_page_prune(onerel, buf, OldestXmin, false,
+		tups_vacuumed += heap_page_prune(onerel, buf, vistest, false,
+										 InvalidTransactionId, 0,
 										 &vacrelstats->latestRemovedXid);
 
 		/*
@@ -1596,14 +1600,16 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 		}
 
 		/*
-		 * It's possible for the value returned by GetOldestXmin() to move
-		 * backwards, so it's not wrong for us to see tuples that appear to
-		 * not be visible to everyone yet, while PD_ALL_VISIBLE is already
-		 * set. The real safe xmin value never moves backwards, but
-		 * GetOldestXmin() is conservative and sometimes returns a value
-		 * that's unnecessarily small, so if we see that contradiction it just
-		 * means that the tuples that we think are not visible to everyone yet
-		 * actually are, and the PD_ALL_VISIBLE flag is correct.
+		 * It's possible for the value returned by
+		 * GetOldestNonRemovableTransactionId() to move backwards, so it's not
+		 * wrong for us to see tuples that appear to not be visible to
+		 * everyone yet, while PD_ALL_VISIBLE is already set. The real safe
+		 * xmin value never moves backwards, but
+		 * GetOldestNonRemovableTransactionId() is conservative and sometimes
+		 * returns a value that's unnecessarily small, so if we see that
+		 * contradiction it just means that the tuples that we think are not
+		 * visible to everyone yet actually are, and the PD_ALL_VISIBLE flag
+		 * is correct.
 		 *
 		 * There should never be dead tuples on a page with PD_ALL_VISIBLE
 		 * set, however.
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 6b9750c244a..3fb8688f8f4 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -519,7 +519,8 @@ index_getnext_tid(IndexScanDesc scan, ScanDirection direction)
 	SCAN_CHECKS;
 	CHECK_SCAN_PROCEDURE(amgettuple);
 
-	Assert(TransactionIdIsValid(RecentGlobalXmin));
+	/* XXX: we should assert that a snapshot is pushed or registered */
+	Assert(TransactionIdIsValid(RecentXmin));
 
 	/*
 	 * The AM's amgettuple proc finds the next index entry matching the scan
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index abce31a5a96..781a8f1932d 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -342,9 +342,9 @@ snapshots and registered snapshots as of the deletion are gone; which is
 overly strong, but is simple to implement within Postgres.  When marked
 dead, a deleted page is labeled with the next-transaction counter value.
 VACUUM can reclaim the page for re-use when this transaction number is
-older than RecentGlobalXmin.  As collateral damage, this implementation
-also waits for running XIDs with no snapshots and for snapshots taken
-until the next transaction to allocate an XID commits.
+guaranteed to be "visible to everyone".  As collateral damage, this
+implementation also waits for running XIDs with no snapshots and for
+snapshots taken until the next transaction to allocate an XID commits.
 
 Reclaiming a page doesn't actually change its state on disk --- we simply
 record it in the shared-memory free space map, from which it will be
@@ -411,8 +411,8 @@ page and also the correct place to hold the current value. We can avoid
 the cost of walking down the tree in such common cases.
 
 The optimization works on the assumption that there can only be one
-non-ignorable leaf rightmost page, and so even a RecentGlobalXmin style
-interlock isn't required.  We cannot fail to detect that our hint was
+non-ignorable leaf rightmost page, and so not even a visible-to-everyone
+style interlock required.  We cannot fail to detect that our hint was
 invalidated, because there can only be one such page in the B-Tree at
 any time. It's possible that the page will be deleted and recycled
 without a backend's cached page also being detected as invalidated, but
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index d5db9aaa3a1..74be3807bb7 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -1097,7 +1097,7 @@ _bt_page_recyclable(Page page)
 	 */
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 	if (P_ISDELETED(opaque) &&
-		TransactionIdPrecedes(opaque->btpo.xact, RecentGlobalXmin))
+		GlobalVisCheckRemovableXid(NULL, opaque->btpo.xact))
 		return true;
 	return false;
 }
@@ -2318,7 +2318,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 	 * updated links to the target, ReadNewTransactionId() suffices as an
 	 * upper bound.  Any scan having retained a now-stale link is advertising
 	 * in its PGXACT an xmin less than or equal to the value we read here.  It
-	 * will continue to do so, holding back RecentGlobalXmin, for the duration
+	 * will continue to do so, holding back the xmin horizon, for the duration
 	 * of that scan.
 	 */
 	page = BufferGetPage(buf);
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 49a8a9708e3..8fa6ac7296b 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -808,6 +808,12 @@ _bt_vacuum_needs_cleanup(IndexVacuumInfo *info)
 	metapg = BufferGetPage(metabuf);
 	metad = BTPageGetMeta(metapg);
 
+	/*
+	 * XXX: If IndexVacuumInfo contained the heap relation, we could be more
+	 * aggressive about vacuuming non catalog relations by passing the table
+	 * to GlobalVisCheckRemovableXid().
+	 */
+
 	if (metad->btm_version < BTREE_NOVAC_VERSION)
 	{
 		/*
@@ -817,13 +823,12 @@ _bt_vacuum_needs_cleanup(IndexVacuumInfo *info)
 		result = true;
 	}
 	else if (TransactionIdIsValid(metad->btm_oldest_btpo_xact) &&
-			 TransactionIdPrecedes(metad->btm_oldest_btpo_xact,
-								   RecentGlobalXmin))
+			 GlobalVisCheckRemovableXid(NULL, metad->btm_oldest_btpo_xact))
 	{
 		/*
 		 * If any oldest btpo.xact from a previously deleted page in the index
-		 * is older than RecentGlobalXmin, then at least one deleted page can
-		 * be recycled -- don't skip cleanup.
+		 * is visible to everyone, then at least one deleted page can be
+		 * recycled -- don't skip cleanup.
 		 */
 		result = true;
 	}
@@ -1276,14 +1281,13 @@ backtrack:
 				 * own conflict now.)
 				 *
 				 * Backends with snapshots acquired after a VACUUM starts but
-				 * before it finishes could have a RecentGlobalXmin with a
-				 * later xid than the VACUUM's OldestXmin cutoff.  These
-				 * backends might happen to opportunistically mark some index
-				 * tuples LP_DEAD before we reach them, even though they may
-				 * be after our cutoff.  We don't try to kill these "extra"
-				 * index tuples in _bt_delitems_vacuum().  This keep things
-				 * simple, and allows us to always avoid generating our own
-				 * conflicts.
+				 * before it finishes could have visibility cutoff with a
+				 * later xid than VACUUM's OldestXmin cutoff.  These backends
+				 * might happen to opportunistically mark some index tuples
+				 * LP_DEAD before we reach them, even though they may be after
+				 * our cutoff.  We don't try to kill these "extra" index
+				 * tuples in _bt_delitems_vacuum().  This keep things simple,
+				 * and allows us to always avoid generating our own conflicts.
 				 */
 				Assert(!BTreeTupleIsPivot(itup));
 				if (!BTreeTupleIsPosting(itup))
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index dbec58d5249..bda9be23489 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -948,11 +948,11 @@ btree_xlog_reuse_page(XLogReaderState *record)
 	 * Btree reuse_page records exist to provide a conflict point when we
 	 * reuse pages in the index via the FSM.  That's all they do though.
 	 *
-	 * latestRemovedXid was the page's btpo.xact.  The btpo.xact <
-	 * RecentGlobalXmin test in _bt_page_recyclable() conceptually mirrors the
-	 * pgxact->xmin > limitXmin test in GetConflictingVirtualXIDs().
-	 * Consequently, one XID value achieves the same exclusion effect on
-	 * primary and standby.
+	 * latestRemovedXid was the page's btpo.xact.  The
+	 * GlobalVisCheckRemovableXid test in _bt_page_recyclable() conceptually
+	 * mirrors the pgxact->xmin > limitXmin test in
+	 * GetConflictingVirtualXIDs().  Consequently, one XID value achieves the
+	 * same exclusion effect on primary and standby.
 	 */
 	if (InHotStandby)
 	{
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index bd98707f3c0..e1c58933f97 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -501,10 +501,14 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 	OffsetNumber itemToPlaceholder[MaxIndexTuplesPerPage];
 	OffsetNumber itemnos[MaxIndexTuplesPerPage];
 	spgxlogVacuumRedirect xlrec;
+	GlobalVisState *vistest;
 
 	xlrec.nToPlaceholder = 0;
 	xlrec.newestRedirectXid = InvalidTransactionId;
 
+	/* XXX: providing heap relation would allow more pruning */
+	vistest = GlobalVisTestFor(NULL);
+
 	START_CRIT_SECTION();
 
 	/*
@@ -521,7 +525,7 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
 		dt = (SpGistDeadTuple) PageGetItem(page, PageGetItemId(page, i));
 
 		if (dt->tupstate == SPGIST_REDIRECT &&
-			TransactionIdPrecedes(dt->xid, RecentGlobalXmin))
+			GlobalVisTestIsRemovableXid(vistest, dt->xid))
 		{
 			dt->tupstate = SPGIST_PLACEHOLDER;
 			Assert(opaque->nRedirection > 0);
diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README
index eb9aac5fd39..fffe0783295 100644
--- a/src/backend/access/transam/README
+++ b/src/backend/access/transam/README
@@ -293,42 +293,50 @@ once, rather than assume they can read it multiple times and get the same
 answer each time.  (Use volatile-qualified pointers when doing this, to
 ensure that the C compiler does exactly what you tell it to.)
 
-Another important activity that uses the shared ProcArray is GetOldestXmin,
-which must determine a lower bound for the oldest xmin of any active MVCC
-snapshot, system-wide.  Each individual backend advertises the smallest
-xmin of its own snapshots in MyPgXact->xmin, or zero if it currently has no
-live snapshots (eg, if it's between transactions or hasn't yet set a
-snapshot for a new transaction).  GetOldestXmin takes the MIN() of the
-valid xmin fields.  It does this with only shared lock on ProcArrayLock,
-which means there is a potential race condition against other backends
-doing GetSnapshotData concurrently: we must be certain that a concurrent
-backend that is about to set its xmin does not compute an xmin less than
-what GetOldestXmin returns.  We ensure that by including all the active
-XIDs into the MIN() calculation, along with the valid xmins.  The rule that
-transactions can't exit without taking exclusive ProcArrayLock ensures that
-concurrent holders of shared ProcArrayLock will compute the same minimum of
-currently-active XIDs: no xact, in particular not the oldest, can exit
-while we hold shared ProcArrayLock.  So GetOldestXmin's view of the minimum
-active XID will be the same as that of any concurrent GetSnapshotData, and
-so it can't produce an overestimate.  If there is no active transaction at
-all, GetOldestXmin returns latestCompletedXid + 1, which is a lower bound
-for the xmin that might be computed by concurrent or later GetSnapshotData
-calls.  (We know that no XID less than this could be about to appear in
-the ProcArray, because of the XidGenLock interlock discussed above.)
+Another important activity that uses the shared ProcArray is
+ComputeXidHorizons, which must determine a lower bound for the oldest xmin
+of any active MVCC snapshot, system-wide.  Each individual backend
+advertises the smallest xmin of its own snapshots in MyPgXact->xmin, or zero
+if it currently has no live snapshots (eg, if it's between transactions or
+hasn't yet set a snapshot for a new transaction).  ComputeXidHorizons takes
+the MIN() of the valid xmin fields.  It does this with only shared lock on
+ProcArrayLock, which means there is a potential race condition against other
+backends doing GetSnapshotData concurrently: we must be certain that a
+concurrent backend that is about to set its xmin does not compute an xmin
+less than what ComputeXidHorizons determines.  We ensure that by including
+all the active XIDs into the MIN() calculation, along with the valid xmins.
+The rule that transactions can't exit without taking exclusive ProcArrayLock
+ensures that concurrent holders of shared ProcArrayLock will compute the
+same minimum of currently-active XIDs: no xact, in particular not the
+oldest, can exit while we hold shared ProcArrayLock.  So
+ComputeXidHorizons's view of the minimum active XID will be the same as that
+of any concurrent GetSnapshotData, and so it can't produce an overestimate.
+If there is no active transaction at all, ComputeXidHorizons uses
+latestCompletedXid + 1, which is a lower bound for the xmin that might
+be computed by concurrent or later GetSnapshotData calls.  (We know that no
+XID less than this could be about to appear in the ProcArray, because of the
+XidGenLock interlock discussed above.)
 
-GetSnapshotData also performs an oldest-xmin calculation (which had better
-match GetOldestXmin's) and stores that into RecentGlobalXmin, which is used
-for some tuple age cutoff checks where a fresh call of GetOldestXmin seems
-too expensive.  Note that while it is certain that two concurrent
-executions of GetSnapshotData will compute the same xmin for their own
-snapshots, as argued above, it is not certain that they will arrive at the
-same estimate of RecentGlobalXmin.  This is because we allow XID-less
-transactions to clear their MyPgXact->xmin asynchronously (without taking
-ProcArrayLock), so one execution might see what had been the oldest xmin,
-and another not.  This is OK since RecentGlobalXmin need only be a valid
-lower bound.  As noted above, we are already assuming that fetch/store
-of the xid fields is atomic, so assuming it for xmin as well is no extra
-risk.
+As GetSnapshotData is performance critical, it does not perform an accurate
+oldest-xmin calculation (it used to, until v13). The contents of a snapshot
+only depend on the xids of other backends, not their xmin. As backend's xmin
+changes much more often than its xid, having GetSnapshotData look at xmins
+can lead to a lot of unnecessary cacheline ping-pong.  Instead
+GetSnapshotData updates approximate thresholds (one that guarantees that all
+deleted rows older than it can be removed, another determining that deleted
+rows newer than it can not be removed). GlobalVisTest* uses those threshold
+to make invisibility decision, falling back to ComputeXidHorizons if
+necessary.
+
+Note that while it is certain that two concurrent executions of
+GetSnapshotData will compute the same xmin for their own snapshots, there is
+no such guarantee for the horizons computed by ComputeXidHorizons.  This is
+because we allow XID-less transactions to clear their MyPgXact->xmin
+asynchronously (without taking ProcArrayLock), so one execution might see
+what had been the oldest xmin, and another not.  This is OK since the
+thresholds need only be a valid lower bound.  As noted above, we are already
+assuming that fetch/store of the xid fields is atomic, so assuming it for
+xmin as well is no extra risk.
 
 
 pg_xact and pg_subtrans
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 8f72faee82c..09c01ed4ae4 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -9096,7 +9096,7 @@ CreateCheckPoint(int flags)
 	 * StartupSUBTRANS hasn't been called yet.
 	 */
 	if (!RecoveryInProgress())
-		TruncateSUBTRANS(GetOldestXmin(NULL, PROCARRAY_FLAGS_DEFAULT));
+		TruncateSUBTRANS(GetOldestTransactionIdConsideredRunning());
 
 	/* Real work is done, but log and update stats before releasing lock. */
 	LogCheckpointEnd(false);
@@ -9456,7 +9456,7 @@ CreateRestartPoint(int flags)
 	 * this because StartupSUBTRANS hasn't been called yet.
 	 */
 	if (EnableHotStandby)
-		TruncateSUBTRANS(GetOldestXmin(NULL, PROCARRAY_FLAGS_DEFAULT));
+		TruncateSUBTRANS(GetOldestTransactionIdConsideredRunning());
 
 	/* Real work is done, but log and update before releasing lock. */
 	LogCheckpointEnd(true);
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index e0fa73ba790..8af12b5c6b2 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -1045,7 +1045,7 @@ acquire_sample_rows(Relation onerel, int elevel,
 	totalblocks = RelationGetNumberOfBlocks(onerel);
 
 	/* Need a cutoff xmin for HeapTupleSatisfiesVacuum */
-	OldestXmin = GetOldestXmin(onerel, PROCARRAY_FLAGS_VACUUM);
+	OldestXmin = GetOldestNonRemovableTransactionId(onerel);
 
 	/* Prepare for sampling block numbers */
 	nblocks = BlockSampler_Init(&bs, totalblocks, targrows, random());
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 576c7e63e99..22228f5684f 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -955,8 +955,25 @@ vacuum_set_xid_limits(Relation rel,
 	 * working on a particular table at any time, and that each vacuum is
 	 * always an independent transaction.
 	 */
-	*oldestXmin =
-		TransactionIdLimitedForOldSnapshots(GetOldestXmin(rel, PROCARRAY_FLAGS_VACUUM), rel);
+	*oldestXmin = GetOldestNonRemovableTransactionId(rel);
+
+	if (OldSnapshotThresholdActive())
+	{
+		TransactionId limit_xmin;
+		TimestampTz limit_ts;
+
+		if (TransactionIdLimitedForOldSnapshots(*oldestXmin, rel, &limit_xmin, &limit_ts))
+		{
+			/*
+			 * TODO: We should only set the threshold if we are pruning on the
+			 * basis of the increased limits. Not as crucial here as it is for
+			 * opportunistic pruning (which often happens at a much higher
+			 * frequency), but would still be a significant improvement.
+			 */
+			SetOldSnapshotThresholdTimestamp(limit_ts, limit_xmin);
+			*oldestXmin = limit_xmin;
+		}
+	}
 
 	Assert(TransactionIdIsNormal(*oldestXmin));
 
@@ -1345,12 +1362,13 @@ vac_update_datfrozenxid(void)
 	bool		dirty = false;
 
 	/*
-	 * Initialize the "min" calculation with GetOldestXmin, which is a
-	 * reasonable approximation to the minimum relfrozenxid for not-yet-
-	 * committed pg_class entries for new tables; see AddNewRelationTuple().
-	 * So we cannot produce a wrong minimum by starting with this.
+	 * Initialize the "min" calculation with
+	 * GetOldestNonRemovableTransactionId(), which is a reasonable
+	 * approximation to the minimum relfrozenxid for not-yet-committed
+	 * pg_class entries for new tables; see AddNewRelationTuple().  So we
+	 * cannot produce a wrong minimum by starting with this.
 	 */
-	newFrozenXid = GetOldestXmin(NULL, PROCARRAY_FLAGS_VACUUM);
+	newFrozenXid = GetOldestNonRemovableTransactionId(NULL);
 
 	/*
 	 * Similarly, initialize the MultiXact "min" with the value that would be
@@ -1681,8 +1699,9 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params)
 	StartTransactionCommand();
 
 	/*
-	 * Functions in indexes may want a snapshot set.  Also, setting a snapshot
-	 * ensures that RecentGlobalXmin is kept truly recent.
+	 * Need to acquire a snapshot to prevent pg_subtrans from being truncated,
+	 * cutoff xids in local memory wrapping around, and to have updated xmin
+	 * horizons.
 	 */
 	PushActiveSnapshot(GetTransactionSnapshot());
 
@@ -1705,8 +1724,8 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params)
 		 *
 		 * Note: these flags remain set until CommitTransaction or
 		 * AbortTransaction.  We don't want to clear them until we reset
-		 * MyPgXact->xid/xmin, else OldestXmin might appear to go backwards,
-		 * which is probably Not Good.
+		 * MyPgXact->xid/xmin, otherwise GetOldestNonRemovableTransactionId()
+		 * might appear to go backwards, which is probably Not Good.
 		 */
 		LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
 		MyPgXact->vacuumFlags |= PROC_IN_VACUUM;
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 9c7d4b0c60e..ac97e28be19 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -1877,6 +1877,10 @@ get_database_list(void)
 	 * the secondary effect that it sets RecentGlobalXmin.  (This is critical
 	 * for anything that reads heap pages, because HOT may decide to prune
 	 * them even if the process doesn't attempt to modify any tuples.)
+	 *
+	 * FIXME: This comment is inaccurate / the code buggy. A snapshot that is
+	 * not pushed/active does not reliably prevent HOT pruning (->xmin could
+	 * e.g. be cleared when cache invalidations are processed).
 	 */
 	StartTransactionCommand();
 	(void) GetTransactionSnapshot();
diff --git a/src/backend/replication/logical/launcher.c b/src/backend/replication/logical/launcher.c
index ff985b9b24c..bdaf0312d63 100644
--- a/src/backend/replication/logical/launcher.c
+++ b/src/backend/replication/logical/launcher.c
@@ -122,6 +122,10 @@ get_subscription_list(void)
 	 * the secondary effect that it sets RecentGlobalXmin.  (This is critical
 	 * for anything that reads heap pages, because HOT may decide to prune
 	 * them even if the process doesn't attempt to modify any tuples.)
+	 *
+	 * FIXME: This comment is inaccurate / the code buggy. A snapshot that is
+	 * not pushed/active does not reliably prevent HOT pruning (->xmin could
+	 * e.g. be cleared when cache invalidations are processed).
 	 */
 	StartTransactionCommand();
 	(void) GetTransactionSnapshot();
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index d5a9b568a68..7c11e1ab44c 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -1181,22 +1181,7 @@ XLogWalRcvSendHSFeedback(bool immed)
 	 */
 	if (hot_standby_feedback)
 	{
-		TransactionId slot_xmin;
-
-		/*
-		 * Usually GetOldestXmin() would include both global replication slot
-		 * xmin and catalog_xmin in its calculations, but we want to derive
-		 * separate values for each of those. So we ask for an xmin that
-		 * excludes the catalog_xmin.
-		 */
-		xmin = GetOldestXmin(NULL,
-							 PROCARRAY_FLAGS_DEFAULT | PROCARRAY_SLOTS_XMIN);
-
-		ProcArrayGetReplicationSlotXmin(&slot_xmin, &catalog_xmin);
-
-		if (TransactionIdIsValid(slot_xmin) &&
-			TransactionIdPrecedes(slot_xmin, xmin))
-			xmin = slot_xmin;
+		GetReplicationHorizons(&xmin, &catalog_xmin);
 	}
 	else
 	{
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index d13220c1400..460ca3f947f 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -2113,9 +2113,10 @@ ProcessStandbyHSFeedbackMessage(void)
 
 	/*
 	 * Set the WalSender's xmin equal to the standby's requested xmin, so that
-	 * the xmin will be taken into account by GetOldestXmin.  This will hold
-	 * back the removal of dead rows and thereby prevent the generation of
-	 * cleanup conflicts on the standby server.
+	 * the xmin will be taken into account by GetSnapshotData() /
+	 * ComputeXidHorizons().  This will hold back the removal of dead rows and
+	 * thereby prevent the generation of cleanup conflicts on the standby
+	 * server.
 	 *
 	 * There is a small window for a race condition here: although we just
 	 * checked that feedbackXmin precedes nextXid, the nextXid could have
@@ -2128,10 +2129,10 @@ ProcessStandbyHSFeedbackMessage(void)
 	 * own xmin would prevent nextXid from advancing so far.
 	 *
 	 * We don't bother taking the ProcArrayLock here.  Setting the xmin field
-	 * is assumed atomic, and there's no real need to prevent a concurrent
-	 * GetOldestXmin.  (If we're moving our xmin forward, this is obviously
-	 * safe, and if we're moving it backwards, well, the data is at risk
-	 * already since a VACUUM could have just finished calling GetOldestXmin.)
+	 * is assumed atomic, and there's no real need to prevent concurrent
+	 * horizon determinations.  (If we're moving our xmin forward, this is
+	 * obviously safe, and if we're moving it backwards, well, the data is at
+	 * risk already since a VACUUM could already have determined the horizon.)
 	 *
 	 * If we're using a replication slot we reserve the xmin via that,
 	 * otherwise via the walsender's PGXACT entry. We can only track the
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 522518695ee..360e6e9da07 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -99,6 +99,142 @@ typedef struct ProcArrayStruct
 	int			pgprocnos[FLEXIBLE_ARRAY_MEMBER];
 } ProcArrayStruct;
 
+/*
+ * State for the GlobalVisTest* family of functions. Those functions can
+ * e.g. be used to decide if a deleted row can be removed without violating
+ * MVCC semantics: If the deleted row's xmax is not considered to be running
+ * by anyone, the row can be removed.
+ *
+ * To avoid slowing down GetSnapshotData(), we don't calculate a precise
+ * cutoff XID while building a snapshot (looking at the frequently changing
+ * xmins scales badly). Instead we compute two boundaries while building the
+ * snapshot:
+ *
+ * 1) definitely_needed, indicating that rows deleted by XIDs >=
+ *    definitely_needed are definitely still visible.
+ *
+ * 2) maybe_needed, indicating that rows deleted by XIDs < maybe_needed can
+ *    definitely be removed
+ *
+ * When testing an XID that falls in between the two (i.e. XID >= maybe_needed
+ * && XID < definitely_needed), the boundaries can be recomputed (using
+ * ComputeXidHorizons()) to get a more accurate answer. This is cheaper than
+ * maintaining an accurate value all the time.
+ *
+ * As it is not cheap to compute accurate boundaries, we limit the number of
+ * times that happens in short succession. See GlobalVisTestShouldUpdate().
+ *
+ *
+ * There are three backend lifetime instances of this struct, optimized for
+ * different types of relations. As e.g. a normal user defined table in one
+ * database is inaccessible to backends connected to another database, a test
+ * specific to a relation can be more aggressive than a test for a shared
+ * relation.  Currently we track three different states:
+ *
+ * 1) GlobalVisSharedRels, which only considers an XID's
+ *    effects visible-to-everyone if neither snapshots in any database, nor a
+ *    replication slot's xmin, nor a replication slot's catalog_xmin might
+ *    still consider XID as running.
+ *
+ * 2) GlobalVisCatalogRels, which only considers an XID's
+ *    effects visible-to-everyone if neither snapshots in the current
+ *    database, nor a replication slot's xmin, nor a replication slot's
+ *    catalog_xmin might still consider XID as running.
+ *
+ *    I.e. the difference to GlobalVisSharedRels is that
+ *    snapshot in other databases are ignored.
+ *
+ * 3) GlobalVisCatalogRels, which only considers an XID's
+ *    effects visible-to-everyone if neither snapshots in the current
+ *    database, nor a replication slot's xmin consider XID as running.
+ *
+ *    I.e. the difference to GlobalVisCatalogRels is that
+ *    replication slot's catalog_xmin is not taken into account.
+ *
+ * GlobalVisTestFor(relation) returns the appropriate state
+ * for the relation.
+ *
+ * The boundaries are FullTransactionIds instead of TransactionIds to avoid
+ * wraparound dangers. There e.g. would otherwise exist no procarray state to
+ * prevent maybe_needed to become old enough after the GetSnapshotData()
+ * call.
+ *
+ * The typedef is in the header.
+ */
+struct GlobalVisState
+{
+	/* XIDs >= are considered running by some backend */
+	FullTransactionId definitely_needed;
+
+	/* XIDs < are not considered to be running by any backend */
+	FullTransactionId maybe_needed;
+};
+
+/*
+ * Result of ComputeXidHorizons().
+ */
+typedef struct ComputeXidHorizonsResult
+{
+	/*
+	 * The value of ShmemVariableCache->latestCompletedXid when
+	 * ComputeXidHorizons() held ProcArrayLock.
+	 */
+	FullTransactionId latest_completed;
+
+	/*
+	 * The same for procArray->replication_slot_xmin and.
+	 * procArray->replication_slot_catalog_xmin.
+	 */
+	TransactionId slot_xmin;
+	TransactionId slot_catalog_xmin;
+
+	/*
+	 * Oldest xid that any backend might still consider running. This needs to
+	 * include processes running VACUUM, in contrast to the normal visibility
+	 * cutoffs, as vacuum needs to be able to perform pg_subtrans lookups when
+	 * determining visibility, but doesn't care about rows above its xmin to
+	 * be removed.
+	 *
+	 * This likely should only be needed to determine whether pg_subtrans can
+	 * be truncated. It currently includes the effects of replications slots,
+	 * for historical reasons. But that could likely be changed.
+	 */
+	TransactionId oldest_considered_running;
+
+	/*
+	 * Oldest xid for which deleted tuples need to be retained in shared
+	 * tables.
+	 *
+	 * This includes the effects of replications lots. If that's not desired,
+	 * look at shared_oldest_nonremovable_raw;
+	 */
+	TransactionId shared_oldest_nonremovable;
+
+	/*
+	 * Oldest xid that may be necessary to retain in shared tables. This is
+	 * the same as shared_oldest_nonremovable, except that is not affected by
+	 * replication slot's catalog_xmin.
+	 *
+	 * This is mainly useful to be able to send the catalog_xmin to upstream
+	 * streaming replication servers via hot_standby_feedback, so they can
+	 * apply the limit only when accessing catalog tables.
+	 */
+	TransactionId shared_oldest_nonremovable_raw;
+
+	/*
+	 * Oldest xid for which deleted tuples need to be retained in non-shared
+	 * catalog tables.
+	 */
+	TransactionId catalog_oldest_nonremovable;
+
+	/*
+	 * Oldest xid for which deleted tuples need to be retained in normal user
+	 * defined tables.
+	 */
+	TransactionId data_oldest_nonremovable;
+} ComputeXidHorizonsResult;
+
+
 static ProcArrayStruct *procArray;
 
 static PGPROC *allProcs;
@@ -118,6 +254,22 @@ static TransactionId latestObservedXid = InvalidTransactionId;
  */
 static TransactionId standbySnapshotPendingXmin;
 
+/*
+ * State for visibility checks on different types of relations. See struct
+ * GlobalVisState for details. As shared, catalog, and user defined
+ * relations can have different horizons, one such state exists for each.
+ */
+static GlobalVisState GlobalVisSharedRels;
+static GlobalVisState GlobalVisCatalogRels;
+static GlobalVisState GlobalVisDataRels;
+
+/*
+ * This backend's RecentXmin at the last time the accurate xmin horizon was
+ * recomputed, or InvalidTransactionId if it has not. Used to limit how many
+ * times accurate horizons are recomputed. See GlobalVisTestShouldUpdate().
+ */
+static TransactionId ComputeXidHorizonsResultLastXmin;
+
 #ifdef XIDCACHE_DEBUG
 
 /* counters for XidCache measurement */
@@ -180,6 +332,7 @@ static void MaintainLatestCompletedXidRecovery(TransactionId latestXid);
 
 static inline FullTransactionId FullXidRelativeTo(FullTransactionId rel,
 												  TransactionId xid);
+static void GlobalVisUpdateApply(ComputeXidHorizonsResult *horizons);
 
 /*
  * Report shared-memory space needed by CreateSharedProcArray.
@@ -1302,159 +1455,191 @@ TransactionIdIsActive(TransactionId xid)
 
 
 /*
- * GetOldestXmin -- returns oldest transaction that was running
- *					when any current transaction was started.
+ * Determine XID horizons.
  *
- * If rel is NULL or a shared relation, all backends are considered, otherwise
- * only backends running in this database are considered.
+ * This is used by wrapper functions like GetOldestNonRemovableTransactionId()
+ * (for VACUUM), GetReplicationHorizons() (for hot_standby_feedback), etc as
+ * well as "internally" by GlobalVisUpdate() (see comment above struct
+ * GlobalVisState).
  *
- * The flags are used to ignore the backends in calculation when any of the
- * corresponding flags is set. Typically, if you want to ignore ones with
- * PROC_IN_VACUUM flag, you can use PROCARRAY_FLAGS_VACUUM.
+ * See the definition of ComputedXidHorizonsResult for the various computed
+ * horizons.
  *
- * PROCARRAY_SLOTS_XMIN causes GetOldestXmin to ignore the xmin and
- * catalog_xmin of any replication slots that exist in the system when
- * calculating the oldest xmin.
+ * For VACUUM separate horizons (used to to decide which deleted tuples must
+ * be preserved), for shared and non-shared tables are computed.  For shared
+ * relations backends in all databases must be considered, but for non-shared
+ * relations that's not required, since only backends in my own database could
+ * ever see the tuples in them. Also, we can ignore concurrently running lazy
+ * VACUUMs because (a) they must be working on other tables, and (b) they
+ * don't need to do snapshot-based lookups.
  *
- * This is used by VACUUM to decide which deleted tuples must be preserved in
- * the passed in table. For shared relations backends in all databases must be
- * considered, but for non-shared relations that's not required, since only
- * backends in my own database could ever see the tuples in them. Also, we can
- * ignore concurrently running lazy VACUUMs because (a) they must be working
- * on other tables, and (b) they don't need to do snapshot-based lookups.
- *
- * This is also used to determine where to truncate pg_subtrans.  For that
- * backends in all databases have to be considered, so rel = NULL has to be
- * passed in.
+ * This also computes a horizon used to truncate pg_subtrans. For that
+ * backends in all databases have to be considered, and concurrently running
+ * lazy VACUUMs cannot be ignored, as they still may perform pg_subtrans
+ * accesses.
  *
  * Note: we include all currently running xids in the set of considered xids.
  * This ensures that if a just-started xact has not yet set its snapshot,
  * when it does set the snapshot it cannot set xmin less than what we compute.
  * See notes in src/backend/access/transam/README.
  *
- * Note: despite the above, it's possible for the calculated value to move
- * backwards on repeated calls. The calculated value is conservative, so that
- * anything older is definitely not considered as running by anyone anymore,
- * but the exact value calculated depends on a number of things. For example,
- * if rel = NULL and there are no transactions running in the current
- * database, GetOldestXmin() returns latestCompletedXid. If a transaction
+ * Note: despite the above, it's possible for the calculated values to move
+ * backwards on repeated calls. The calculated values are conservative, so
+ * that anything older is definitely not considered as running by anyone
+ * anymore, but the exact values calculated depend on a number of things. For
+ * example, if there are no transactions running in the current database, the
+ * horizon for normal tables will be latestCompletedXid. If a transaction
  * begins after that, its xmin will include in-progress transactions in other
  * databases that started earlier, so another call will return a lower value.
  * Nonetheless it is safe to vacuum a table in the current database with the
  * first result.  There are also replication-related effects: a walsender
  * process can set its xmin based on transactions that are no longer running
  * on the primary but are still being replayed on the standby, thus possibly
- * making the GetOldestXmin reading go backwards.  In this case there is a
- * possibility that we lose data that the standby would like to have, but
- * unless the standby uses a replication slot to make its xmin persistent
- * there is little we can do about that --- data is only protected if the
- * walsender runs continuously while queries are executed on the standby.
- * (The Hot Standby code deals with such cases by failing standby queries
- * that needed to access already-removed data, so there's no integrity bug.)
- * The return value is also adjusted with vacuum_defer_cleanup_age, so
- * increasing that setting on the fly is another easy way to make
- * GetOldestXmin() move backwards, with no consequences for data integrity.
+ * making the values go backwards.  In this case there is a possibility that
+ * we lose data that the standby would like to have, but unless the standby
+ * uses a replication slot to make its xmin persistent there is little we can
+ * do about that --- data is only protected if the walsender runs continuously
+ * while queries are executed on the standby.  (The Hot Standby code deals
+ * with such cases by failing standby queries that needed to access
+ * already-removed data, so there's no integrity bug.)  The computed values
+ * are also adjusted with vacuum_defer_cleanup_age, so increasing that setting
+ * on the fly is another easy way to make horizons move backwards, with no
+ * consequences for data integrity.
+ *
+ * Note: the approximate horizons (see definition of GlobalVisState) are
+ * updated by the computations done here. That's currently required for
+ * correctness and a small optimization. Without doing so it's possible that
+ * heap vacuum's call to heap_page_prune() uses a more conservative horizon
+ * than later when deciding which tuples can be removed - which the code
+ * doesn't expect (breaking HOT).
  */
-TransactionId
-GetOldestXmin(Relation rel, int flags)
+static void
+ComputeXidHorizons(ComputeXidHorizonsResult *h)
 {
 	ProcArrayStruct *arrayP = procArray;
-	TransactionId result;
-	int			index;
-	bool		allDbs;
+	TransactionId kaxmin;
+	bool		in_recovery = RecoveryInProgress();
 
-	TransactionId replication_slot_xmin = InvalidTransactionId;
-	TransactionId replication_slot_catalog_xmin = InvalidTransactionId;
-
-	/*
-	 * If we're not computing a relation specific limit, or if a shared
-	 * relation has been passed in, backends in all databases have to be
-	 * considered.
-	 */
-	allDbs = rel == NULL || rel->rd_rel->relisshared;
-
-	/* Cannot look for individual databases during recovery */
-	Assert(allDbs || !RecoveryInProgress());
+	/* inferred after ProcArrayLock is released */
+	h->catalog_oldest_nonremovable = InvalidTransactionId;
 
 	LWLockAcquire(ProcArrayLock, LW_SHARED);
 
+	h->latest_completed = ShmemVariableCache->latestCompletedXid;
+
 	/*
 	 * We initialize the MIN() calculation with latestCompletedXid + 1. This
 	 * is a lower bound for the XIDs that might appear in the ProcArray later,
 	 * and so protects us against overestimating the result due to future
 	 * additions.
 	 */
-	result = XidFromFullTransactionId(ShmemVariableCache->latestCompletedXid);
-	TransactionIdAdvance(result);
-	Assert(TransactionIdIsNormal(result));
+	{
+		TransactionId initial;
 
-	for (index = 0; index < arrayP->numProcs; index++)
+		initial = XidFromFullTransactionId(h->latest_completed);
+		Assert(TransactionIdIsValid(initial));
+		TransactionIdAdvance(initial);
+
+		h->oldest_considered_running = initial;
+		h->shared_oldest_nonremovable = initial;
+		h->data_oldest_nonremovable = initial;
+	}
+
+	/*
+	 * Fetch slot horizons while ProcArrayLock is held - the
+	 * LWLockAcquire/LWLockRelease are a barrier, ensuring this happens inside
+	 * the lock.
+	 */
+	h->slot_xmin = procArray->replication_slot_xmin;
+	h->slot_catalog_xmin = procArray->replication_slot_catalog_xmin;
+
+	for (int index = 0; index < arrayP->numProcs; index++)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
 		PGPROC	   *proc = &allProcs[pgprocno];
 		PGXACT	   *pgxact = &allPgXact[pgprocno];
+		TransactionId xid;
+		TransactionId xmin;
 
-		if (pgxact->vacuumFlags & (flags & PROCARRAY_PROC_FLAGS_MASK))
+		/* Fetch xid just once - see GetNewTransactionId */
+		xid = UINT32_ACCESS_ONCE(pgxact->xid);
+		xmin = UINT32_ACCESS_ONCE(pgxact->xmin);
+
+		/*
+		 * Consider both the transaction's Xmin, and its Xid.
+		 *
+		 * We must check both because a transaction might have an Xmin but not
+		 * (yet) an Xid; conversely, if it has an Xid, that could determine
+		 * some not-yet-set Xmin.
+		 */
+		xmin = TransactionIdOlder(xmin, xid);
+
+		/* if neither is set, this proc doesn't influence the horizon */
+		if (!TransactionIdIsValid(xmin))
 			continue;
 
-		if (allDbs ||
+		/*
+		 * Don't ignore any procs when determining which transactions might be
+		 * considered running.  While slots should ensure logical decoding
+		 * backends are protected even without this check, it can't hurt to
+		 * include them here as well..
+		 */
+		h->oldest_considered_running =
+			TransactionIdOlder(h->oldest_considered_running, xmin);
+
+		/*
+		 * Skip over backends either vacuuming (which is ok with rows being
+		 * removed, as long as pg_subtrans is not truncated) or doing logical
+		 * decoding (which manages xmin separately, check below).
+		 */
+		if (pgxact->vacuumFlags & (PROC_IN_VACUUM | PROC_IN_LOGICAL_DECODING))
+			continue;
+
+		/* shared tables need to take backends in all database into account */
+		h->shared_oldest_nonremovable =
+			TransactionIdOlder(h->shared_oldest_nonremovable, xmin);
+
+		/*
+		 * Normally queries in other databases are ignored for anything but
+		 * the shared horizon. But in recovery we cannot compute an accurate
+		 * per-database horizon as all xids are managed via the
+		 * KnownAssignedXids machinery.
+		 */
+		if (in_recovery ||
 			proc->databaseId == MyDatabaseId ||
 			proc->databaseId == 0)	/* always include WalSender */
 		{
-			/* Fetch xid just once - see GetNewTransactionId */
-			TransactionId xid = UINT32_ACCESS_ONCE(pgxact->xid);
-
-			/* First consider the transaction's own Xid, if any */
-			if (TransactionIdIsNormal(xid) &&
-				TransactionIdPrecedes(xid, result))
-				result = xid;
-
-			/*
-			 * Also consider the transaction's Xmin, if set.
-			 *
-			 * We must check both Xid and Xmin because a transaction might
-			 * have an Xmin but not (yet) an Xid; conversely, if it has an
-			 * Xid, that could determine some not-yet-set Xmin.
-			 */
-			xid = UINT32_ACCESS_ONCE(pgxact->xmin);
-			if (TransactionIdIsNormal(xid) &&
-				TransactionIdPrecedes(xid, result))
-				result = xid;
+			h->data_oldest_nonremovable =
+				TransactionIdOlder(h->data_oldest_nonremovable, xmin);
 		}
 	}
 
 	/*
-	 * Fetch into local variable while ProcArrayLock is held - the
-	 * LWLockRelease below is a barrier, ensuring this happens inside the
-	 * lock.
+	 * If in recovery fetch oldest xid in KnownAssignedXids, will be applied
+	 * after lock is released.
 	 */
-	replication_slot_xmin = procArray->replication_slot_xmin;
-	replication_slot_catalog_xmin = procArray->replication_slot_catalog_xmin;
+	if (in_recovery)
+		kaxmin = KnownAssignedXidsGetOldestXmin();
 
-	if (RecoveryInProgress())
+	/*
+	 * No other information needed, so release the lock immediately. The rest
+	 * of the computations can be done without a lock.
+	 */
+	LWLockRelease(ProcArrayLock);
+
+	if (in_recovery)
 	{
-		/*
-		 * Check to see whether KnownAssignedXids contains an xid value older
-		 * than the main procarray.
-		 */
-		TransactionId kaxmin = KnownAssignedXidsGetOldestXmin();
-
-		LWLockRelease(ProcArrayLock);
-
-		if (TransactionIdIsNormal(kaxmin) &&
-			TransactionIdPrecedes(kaxmin, result))
-			result = kaxmin;
+		h->oldest_considered_running =
+			TransactionIdOlder(h->oldest_considered_running, kaxmin);
+		h->shared_oldest_nonremovable =
+			TransactionIdOlder(h->shared_oldest_nonremovable, kaxmin);
+		h->data_oldest_nonremovable =
+			TransactionIdOlder(h->data_oldest_nonremovable, kaxmin);
 	}
 	else
 	{
 		/*
-		 * No other information needed, so release the lock immediately.
-		 */
-		LWLockRelease(ProcArrayLock);
-
-		/*
-		 * Compute the cutoff XID by subtracting vacuum_defer_cleanup_age,
-		 * being careful not to generate a "permanent" XID.
+		 * Compute the cutoff XID by subtracting vacuum_defer_cleanup_age.
 		 *
 		 * vacuum_defer_cleanup_age provides some additional "slop" for the
 		 * benefit of hot standby queries on standby servers.  This is quick
@@ -1466,34 +1651,146 @@ GetOldestXmin(Relation rel, int flags)
 		 * in varsup.c.  Also note that we intentionally don't apply
 		 * vacuum_defer_cleanup_age on standby servers.
 		 */
-		result -= vacuum_defer_cleanup_age;
-		if (!TransactionIdIsNormal(result))
-			result = FirstNormalTransactionId;
+		h->oldest_considered_running =
+			TransactionIdRetreatedBy(h->oldest_considered_running,
+									 vacuum_defer_cleanup_age);
+		h->shared_oldest_nonremovable =
+			TransactionIdRetreatedBy(h->shared_oldest_nonremovable,
+									 vacuum_defer_cleanup_age);
+		h->data_oldest_nonremovable =
+			TransactionIdRetreatedBy(h->data_oldest_nonremovable,
+									 vacuum_defer_cleanup_age);
 	}
 
 	/*
 	 * Check whether there are replication slots requiring an older xmin.
 	 */
-	if (!(flags & PROCARRAY_SLOTS_XMIN) &&
-		TransactionIdIsValid(replication_slot_xmin) &&
-		NormalTransactionIdPrecedes(replication_slot_xmin, result))
-		result = replication_slot_xmin;
+	h->shared_oldest_nonremovable =
+		TransactionIdOlder(h->shared_oldest_nonremovable, h->slot_xmin);
+	h->data_oldest_nonremovable =
+		TransactionIdOlder(h->data_oldest_nonremovable, h->slot_xmin);
 
 	/*
-	 * After locks have been released and vacuum_defer_cleanup_age has been
-	 * applied, check whether we need to back up further to make logical
-	 * decoding possible. We need to do so if we're computing the global limit
-	 * (rel = NULL) or if the passed relation is a catalog relation of some
-	 * kind.
+	 * The only difference between catalog / data horizons is that the slot's
+	 * catalog xmin is applied to the catalog one (so catalogs can be accessed
+	 * for logical decoding). Initialize with data horizon, and then back up
+	 * further if necessary. Have to back up the shared horizon as well, since
+	 * that also can contain catalogs.
 	 */
-	if (!(flags & PROCARRAY_SLOTS_XMIN) &&
-		(rel == NULL ||
-		 RelationIsAccessibleInLogicalDecoding(rel)) &&
-		TransactionIdIsValid(replication_slot_catalog_xmin) &&
-		NormalTransactionIdPrecedes(replication_slot_catalog_xmin, result))
-		result = replication_slot_catalog_xmin;
+	h->shared_oldest_nonremovable_raw = h->shared_oldest_nonremovable;
+	h->shared_oldest_nonremovable =
+		TransactionIdOlder(h->shared_oldest_nonremovable,
+						   h->slot_catalog_xmin);
+	h->catalog_oldest_nonremovable = h->data_oldest_nonremovable;
+	h->catalog_oldest_nonremovable =
+		TransactionIdOlder(h->catalog_oldest_nonremovable,
+						   h->slot_catalog_xmin);
 
-	return result;
+	/*
+	 * It's possible that slots / vacuum_defer_cleanup_age backed up the
+	 * horizons further than oldest_considered_running. Fix.
+	 */
+	h->oldest_considered_running =
+		TransactionIdOlder(h->oldest_considered_running,
+						   h->shared_oldest_nonremovable);
+	h->oldest_considered_running =
+		TransactionIdOlder(h->oldest_considered_running,
+						   h->catalog_oldest_nonremovable);
+	h->oldest_considered_running =
+		TransactionIdOlder(h->oldest_considered_running,
+						   h->data_oldest_nonremovable);
+
+	/*
+	 * shared horizons have to be at least as old as the oldest visible in
+	 * current db
+	 */
+	Assert(TransactionIdPrecedesOrEquals(h->shared_oldest_nonremovable,
+										 h->data_oldest_nonremovable));
+	Assert(TransactionIdPrecedesOrEquals(h->shared_oldest_nonremovable,
+										 h->catalog_oldest_nonremovable));
+
+	/*
+	 * Horizons need to ensure that pg_subtrans access is still possible for
+	 * the relevant backends.
+	 */
+	Assert(TransactionIdPrecedesOrEquals(h->oldest_considered_running,
+										 h->shared_oldest_nonremovable));
+	Assert(TransactionIdPrecedesOrEquals(h->oldest_considered_running,
+										 h->catalog_oldest_nonremovable));
+	Assert(TransactionIdPrecedesOrEquals(h->oldest_considered_running,
+										 h->data_oldest_nonremovable));
+	Assert(!TransactionIdIsValid(h->slot_xmin) ||
+		   TransactionIdPrecedesOrEquals(h->oldest_considered_running,
+										 h->slot_xmin));
+	Assert(!TransactionIdIsValid(h->slot_catalog_xmin) ||
+		   TransactionIdPrecedesOrEquals(h->oldest_considered_running,
+										 h->slot_catalog_xmin));
+
+	/* update approximate horizons with the computed horizons */
+	GlobalVisUpdateApply(h);
+}
+
+/*
+ * Return the oldest XID for which deleted tuples must be preserved in the
+ * passed table.
+ *
+ * If rel is not NULL the horizon may be considerably more recent than
+ * otherwise (i.e. fewer tuples will be removable). In the NULL case a horizon
+ * that is correct (but not optimal) for all relations will be returned.
+ *
+ * This is used by VACUUM to decide which deleted tuples must be preserved in
+ * the passed in table.
+ */
+TransactionId
+GetOldestNonRemovableTransactionId(Relation rel)
+{
+	ComputeXidHorizonsResult horizons;
+
+	ComputeXidHorizons(&horizons);
+
+	/* select horizon appropriate for relation */
+	if (rel == NULL || rel->rd_rel->relisshared)
+		return horizons.shared_oldest_nonremovable;
+	else if (RelationIsAccessibleInLogicalDecoding(rel))
+		return horizons.catalog_oldest_nonremovable;
+	else
+		return horizons.data_oldest_nonremovable;
+}
+
+/*
+ * Return the oldest transaction id any currently running backend might still
+ * consider running. This should not be used for visibility / pruning
+ * determinations (see GetOldestNonRemovableTransactionId()), but for
+ * decisions like up to where pg_subtrans can be truncated.
+ */
+TransactionId
+GetOldestTransactionIdConsideredRunning(void)
+{
+	ComputeXidHorizonsResult horizons;
+
+	ComputeXidHorizons(&horizons);
+
+	return horizons.oldest_considered_running;
+}
+
+/*
+ * Return the visibility horizons for a hot standby feedback message.
+ */
+void
+GetReplicationHorizons(TransactionId *xmin, TransactionId *catalog_xmin)
+{
+	ComputeXidHorizonsResult horizons;
+
+	ComputeXidHorizons(&horizons);
+
+	/*
+	 * Don't want to use shared_oldest_nonremovable here, as that contains the
+	 * effect of replication slot's catalog_xmin. We want to send a separate
+	 * feedback for the catalog horizon, so the primary can remove data table
+	 * contents more aggressively.
+	 */
+	*xmin = horizons.shared_oldest_nonremovable_raw;
+	*catalog_xmin = horizons.slot_catalog_xmin;
 }
 
 /*
@@ -1544,12 +1841,10 @@ GetMaxSnapshotSubxidCount(void)
  *			current transaction (this is the same as MyPgXact->xmin).
  *		RecentXmin: the xmin computed for the most recent snapshot.  XIDs
  *			older than this are known not running any more.
- *		RecentGlobalXmin: the global xmin (oldest TransactionXmin across all
- *			running transactions, except those running LAZY VACUUM).  This is
- *			the same computation done by
- *			GetOldestXmin(NULL, PROCARRAY_FLAGS_VACUUM).
- *		RecentGlobalDataXmin: the global xmin for non-catalog tables
- *			>= RecentGlobalXmin
+ *
+ * And try to advance the bounds of GlobalVisSharedRels,
+ * GlobalVisCatalogRels, GlobalVisDataRels for
+ * the benefit GlobalVis*.
  *
  * Note: this function should probably not be called with an argument that's
  * not statically allocated (see xip allocation below).
@@ -1560,12 +1855,12 @@ GetSnapshotData(Snapshot snapshot)
 	ProcArrayStruct *arrayP = procArray;
 	TransactionId xmin;
 	TransactionId xmax;
-	TransactionId globalxmin;
 	int			index;
 	int			count = 0;
 	int			subcount = 0;
 	bool		suboverflowed = false;
 	FullTransactionId latest_completed;
+	TransactionId oldestxid;
 	TransactionId replication_slot_xmin = InvalidTransactionId;
 	TransactionId replication_slot_catalog_xmin = InvalidTransactionId;
 
@@ -1610,13 +1905,15 @@ GetSnapshotData(Snapshot snapshot)
 	LWLockAcquire(ProcArrayLock, LW_SHARED);
 
 	latest_completed = ShmemVariableCache->latestCompletedXid;
+	oldestxid = ShmemVariableCache->oldestXid;
+
 	/* xmax is always latestCompletedXid + 1 */
 	xmax = XidFromFullTransactionId(latest_completed);
 	TransactionIdAdvance(xmax);
 	Assert(TransactionIdIsNormal(xmax));
 
 	/* initialize xmin calculation with xmax */
-	globalxmin = xmin = xmax;
+	xmin = xmax;
 
 	snapshot->takenDuringRecovery = RecoveryInProgress();
 
@@ -1645,12 +1942,6 @@ GetSnapshotData(Snapshot snapshot)
 				(PROC_IN_LOGICAL_DECODING | PROC_IN_VACUUM))
 				continue;
 
-			/* Update globalxmin to be the smallest valid xmin */
-			xid = UINT32_ACCESS_ONCE(pgxact->xmin);
-			if (TransactionIdIsNormal(xid) &&
-				NormalTransactionIdPrecedes(xid, globalxmin))
-				globalxmin = xid;
-
 			/* Fetch xid just once - see GetNewTransactionId */
 			xid = UINT32_ACCESS_ONCE(pgxact->xid);
 
@@ -1766,34 +2057,78 @@ GetSnapshotData(Snapshot snapshot)
 
 	LWLockRelease(ProcArrayLock);
 
-	/*
-	 * Update globalxmin to include actual process xids.  This is a slightly
-	 * different way of computing it than GetOldestXmin uses, but should give
-	 * the same result.
-	 */
-	if (TransactionIdPrecedes(xmin, globalxmin))
-		globalxmin = xmin;
+	/* maintain state for GlobalVis* */
+	{
+		TransactionId def_vis_xid;
+		TransactionId def_vis_xid_data;
+		FullTransactionId def_vis_fxid;
+		FullTransactionId def_vis_fxid_data;
+		FullTransactionId oldestfxid;
 
-	/* Update global variables too */
-	RecentGlobalXmin = globalxmin - vacuum_defer_cleanup_age;
-	if (!TransactionIdIsNormal(RecentGlobalXmin))
-		RecentGlobalXmin = FirstNormalTransactionId;
+		/*
+		 * Converting oldestXid is only safe when xid horizon cannot advance,
+		 * i.e. holding locks. While we don't hold the lock anymore, all the
+		 * necessary data has been gathered with lock held.
+		 */
+		oldestfxid = FullXidRelativeTo(latest_completed, oldestxid);
 
-	/* Check whether there's a replication slot requiring an older xmin. */
-	if (TransactionIdIsValid(replication_slot_xmin) &&
-		NormalTransactionIdPrecedes(replication_slot_xmin, RecentGlobalXmin))
-		RecentGlobalXmin = replication_slot_xmin;
+		/* apply vacuum_defer_cleanup_age */
+		def_vis_xid_data =
+			TransactionIdRetreatedBy(xmin, vacuum_defer_cleanup_age);
 
-	/* Non-catalog tables can be vacuumed if older than this xid */
-	RecentGlobalDataXmin = RecentGlobalXmin;
+		/* Check whether there's a replication slot requiring an older xmin. */
+		def_vis_xid_data =
+			TransactionIdOlder(def_vis_xid_data, replication_slot_xmin);
 
-	/*
-	 * Check whether there's a replication slot requiring an older catalog
-	 * xmin.
-	 */
-	if (TransactionIdIsNormal(replication_slot_catalog_xmin) &&
-		NormalTransactionIdPrecedes(replication_slot_catalog_xmin, RecentGlobalXmin))
-		RecentGlobalXmin = replication_slot_catalog_xmin;
+		/*
+		 * Rows in non-shared, non-catalog tables possibly could be vacuumed
+		 * if older than this xid.
+		 */
+		def_vis_xid = def_vis_xid_data;
+
+		/*
+		 * Check whether there's a replication slot requiring an older catalog
+		 * xmin.
+		 */
+		def_vis_xid =
+			TransactionIdOlder(replication_slot_catalog_xmin, def_vis_xid);
+
+		def_vis_fxid = FullXidRelativeTo(latest_completed, def_vis_xid);
+		def_vis_fxid_data = FullXidRelativeTo(latest_completed, def_vis_xid_data);
+
+		/*
+		 * Check if we can increase upper bound. As a previous
+		 * GlobalVisUpdate() might have computed more aggressive values, don't
+		 * overwrite them if so.
+		 */
+		GlobalVisSharedRels.definitely_needed =
+			FullTransactionIdNewer(def_vis_fxid,
+								   GlobalVisSharedRels.definitely_needed);
+		GlobalVisCatalogRels.definitely_needed =
+			FullTransactionIdNewer(def_vis_fxid,
+								   GlobalVisCatalogRels.definitely_needed);
+		GlobalVisDataRels.definitely_needed =
+			FullTransactionIdNewer(def_vis_fxid_data,
+								   GlobalVisDataRels.definitely_needed);
+
+		/*
+		 * Check if we know that we can initialize or increase the lower
+		 * bound. Currently the only cheap way to do so is to use
+		 * ShmemVariableCache->oldestXid as input.
+		 *
+		 * We should definitely be able to do better. We could e.g. put a
+		 * global lower bound value into ShmemVariableCache.
+		 */
+		GlobalVisSharedRels.maybe_needed =
+			FullTransactionIdNewer(GlobalVisSharedRels.maybe_needed,
+								   oldestfxid);
+		GlobalVisCatalogRels.maybe_needed =
+			FullTransactionIdNewer(GlobalVisCatalogRels.maybe_needed,
+								   oldestfxid);
+		GlobalVisDataRels.maybe_needed =
+			FullTransactionIdNewer(GlobalVisDataRels.maybe_needed,
+								   oldestfxid);
+	}
 
 	RecentXmin = xmin;
 
@@ -3291,6 +3626,255 @@ DisplayXidCache(void)
 }
 #endif							/* XIDCACHE_DEBUG */
 
+/*
+ * If rel != NULL, return test state appropriate for relation, otherwise
+ * return state usable for all relations.  The latter may consider XIDs as
+ * not-yet-visible-to-everyone that a state for a specific relation would
+ * already consider visible-to-everyone.
+ *
+ * This needs to be called while a snapshot is active or registered, otherwise
+ * there are wraparound and other dangers.
+ *
+ * See comment for GlobalVisState for details.
+ */
+GlobalVisState *
+GlobalVisTestFor(Relation rel)
+{
+	bool		need_shared;
+	bool		need_catalog;
+	GlobalVisState *state;
+
+	/* XXX: we should assert that a snapshot is pushed or registered */
+	Assert(RecentXmin);
+
+	if (!rel)
+		need_shared = need_catalog = true;
+	else
+	{
+		/*
+		 * Other kinds currently don't contain xids, nor always the necessary
+		 * logical decoding markers.
+		 */
+		Assert(rel->rd_rel->relkind == RELKIND_RELATION ||
+			   rel->rd_rel->relkind == RELKIND_MATVIEW ||
+			   rel->rd_rel->relkind == RELKIND_TOASTVALUE);
+
+		need_shared = rel->rd_rel->relisshared || RecoveryInProgress();
+		need_catalog = IsCatalogRelation(rel) || RelationIsAccessibleInLogicalDecoding(rel);
+	}
+
+	if (need_shared)
+		state = &GlobalVisSharedRels;
+	else if (need_catalog)
+		state = &GlobalVisCatalogRels;
+	else
+		state = &GlobalVisDataRels;
+
+	Assert(FullTransactionIdIsValid(state->definitely_needed) &&
+		   FullTransactionIdIsValid(state->maybe_needed));
+
+	return state;
+}
+
+/*
+ * Return true if it's worth updating the accurate maybe_needed boundary.
+ *
+ * As it is somewhat expensive to determine xmin horizons, we don't want to
+ * repeatedly do so when there is a low likelihood of it being beneficial.
+ *
+ * The current heuristic is that we update only if RecentXmin has changed
+ * since the last update. If the oldest currently running transaction has not
+ * finished, it is unlikely that recomputing the horizon would be useful.
+ */
+static bool
+GlobalVisTestShouldUpdate(GlobalVisState *state)
+{
+	/* hasn't been updated yet */
+	if (!TransactionIdIsValid(ComputeXidHorizonsResultLastXmin))
+		return true;
+
+	/*
+	 * If the maybe_needed/definitely_needed boundaries are the same, it's
+	 * unlikely to be beneficial to refresh boundaries.
+	 */
+	if (FullTransactionIdFollowsOrEquals(state->maybe_needed,
+										 state->definitely_needed))
+		return false;
+
+	/* does the last snapshot built have a different xmin? */
+	return RecentXmin != ComputeXidHorizonsResultLastXmin;
+}
+
+static void
+GlobalVisUpdateApply(ComputeXidHorizonsResult *horizons)
+{
+	GlobalVisSharedRels.maybe_needed =
+		FullXidRelativeTo(horizons->latest_completed,
+						   horizons->shared_oldest_nonremovable);
+	GlobalVisCatalogRels.maybe_needed =
+		FullXidRelativeTo(horizons->latest_completed,
+						   horizons->catalog_oldest_nonremovable);
+	GlobalVisDataRels.maybe_needed =
+		FullXidRelativeTo(horizons->latest_completed,
+						   horizons->data_oldest_nonremovable);
+
+	/*
+	 * In longer running transactions it's possible that transactions we
+	 * previously needed to treat as running aren't around anymore. So update
+	 * definitely_needed to not be earlier than maybe_needed.
+	 */
+	GlobalVisSharedRels.definitely_needed =
+		FullTransactionIdNewer(GlobalVisSharedRels.maybe_needed,
+							   GlobalVisSharedRels.definitely_needed);
+	GlobalVisCatalogRels.definitely_needed =
+		FullTransactionIdNewer(GlobalVisCatalogRels.maybe_needed,
+							   GlobalVisCatalogRels.definitely_needed);
+	GlobalVisDataRels.definitely_needed =
+		FullTransactionIdNewer(GlobalVisDataRels.maybe_needed,
+							   GlobalVisDataRels.definitely_needed);
+
+	ComputeXidHorizonsResultLastXmin = RecentXmin;
+}
+
+/*
+ * Update boundaries in GlobalVis{Shared,Catalog, Data}Rels
+ * using ComputeXidHorizons().
+ */
+static void
+GlobalVisUpdate(void)
+{
+	ComputeXidHorizonsResult horizons;
+
+	/* updates the horizons as a side-effect */
+	ComputeXidHorizons(&horizons);
+}
+
+/*
+ * Return true if no snapshot still considers fxid to be running.
+ *
+ * The state passed needs to have been initialized for the relation fxid is
+ * from (NULL is also OK), otherwise the result may not be correct.
+ *
+ * See comment for GlobalVisState for details.
+ */
+bool
+GlobalVisTestIsRemovableFullXid(GlobalVisState *state,
+								FullTransactionId fxid)
+{
+	/*
+	 * If fxid is older than maybe_needed bound, it definitely is visible to
+	 * everyone.
+	 */
+	if (FullTransactionIdPrecedes(fxid, state->maybe_needed))
+		return true;
+
+	/*
+	 * If fxid is >= definitely_needed bound, it is very likely to still be
+	 * considered running.
+	 */
+	if (FullTransactionIdFollowsOrEquals(fxid, state->definitely_needed))
+		return false;
+
+	/*
+	 * fxid is between maybe_needed and definitely_needed, i.e. there might or
+	 * might not exist a snapshot considering fxid running. If it makes sense,
+	 * update boundaries and recheck.
+	 */
+	if (GlobalVisTestShouldUpdate(state))
+	{
+		GlobalVisUpdate();
+
+		Assert(FullTransactionIdPrecedes(fxid, state->definitely_needed));
+
+		return FullTransactionIdPrecedes(fxid, state->maybe_needed);
+	}
+	else
+		return false;
+}
+
+/*
+ * Wrapper around GlobalVisTestIsRemovableFullXid() for 32bit xids.
+ *
+ * It is crucial that this only gets called for xids from a source that
+ * protects against xid wraparounds (e.g. from a table and thus protected by
+ * relfrozenxid).
+ */
+bool
+GlobalVisTestIsRemovableXid(GlobalVisState *state, TransactionId xid)
+{
+	FullTransactionId fxid;
+
+	/*
+	 * Convert 32 bit argument to FullTransactionId. We can do so safely
+	 * because we know the xid has to, at the very least, be between
+	 * [oldestXid, nextFullXid), i.e. within 2 billion of xid. To avoid taking
+	 * a lock to determine either, we can just compare with
+	 * state->definitely_needed, which was based on those value at the time
+	 * the current snapshot was built.
+	 */
+	fxid = FullXidRelativeTo(state->definitely_needed, xid);
+
+	return GlobalVisTestIsRemovableFullXid(state, fxid);
+}
+
+/*
+ * Return FullTransactionId below which all transactions are not considered
+ * running anymore.
+ *
+ * Note: This is less efficient than testing with
+ * GlobalVisTestIsRemovableFullXid as it likely requires building an accurate
+ * cutoff, even in the case all the XIDs compared with the cutoff are outside
+ * [maybe_needed, definitely_needed).
+ */
+FullTransactionId
+GlobalVisTestNonRemovableFullHorizon(GlobalVisState *state)
+{
+	/* acquire accurate horizon if not already done */
+	if (GlobalVisTestShouldUpdate(state))
+		GlobalVisUpdate();
+
+	return state->maybe_needed;
+}
+
+/* Convenience wrapper around GlobalVisTestNonRemovableFullHorizon */
+TransactionId
+GlobalVisTestNonRemovableHorizon(GlobalVisState *state)
+{
+	FullTransactionId cutoff;
+
+	cutoff = GlobalVisTestNonRemovableFullHorizon(state);
+
+	return XidFromFullTransactionId(cutoff);
+}
+
+/*
+ * Convenience wrapper around GlobalVisTestFor() and
+ * GlobalVisTestIsRemovableFullXid(), see their comments.
+ */
+bool
+GlobalVisIsRemovableFullXid(Relation rel, FullTransactionId fxid)
+{
+	GlobalVisState *state;
+
+	state = GlobalVisTestFor(rel);
+
+	return GlobalVisTestIsRemovableFullXid(state, fxid);
+}
+
+/*
+ * Convenience wrapper around GlobalVisTestFor() and
+ * GlobalVisTestIsRemovableXid(), see their comments.
+ */
+bool
+GlobalVisCheckRemovableXid(Relation rel, TransactionId xid)
+{
+	GlobalVisState *state;
+
+	state = GlobalVisTestFor(rel);
+
+	return GlobalVisTestIsRemovableXid(state, xid);
+}
+
 /*
  * Convert a 32 bit transaction id into 64 bit transaction id, by assuming it
  * is within MaxTransactionId / 2 of XidFromFullTransactionId(rel).
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index 53d974125fd..00c7afc66fc 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -5786,14 +5786,15 @@ get_actual_variable_endpoint(Relation heapRel,
 	 * recent); that case motivates not using SnapshotAny here.
 	 *
 	 * A crucial point here is that SnapshotNonVacuumable, with
-	 * RecentGlobalXmin as horizon, yields the inverse of the condition that
-	 * the indexscan will use to decide that index entries are killable (see
-	 * heap_hot_search_buffer()).  Therefore, if the snapshot rejects a tuple
-	 * (or more precisely, all tuples of a HOT chain) and we have to continue
-	 * scanning past it, we know that the indexscan will mark that index entry
-	 * killed.  That means that the next get_actual_variable_endpoint() call
-	 * will not have to re-consider that index entry.  In this way we avoid
-	 * repetitive work when this function is used a lot during planning.
+	 * GlobalVisTestFor(heapRel) as horizon, yields the inverse of the
+	 * condition that the indexscan will use to decide that index entries are
+	 * killable (see heap_hot_search_buffer()).  Therefore, if the snapshot
+	 * rejects a tuple (or more precisely, all tuples of a HOT chain) and we
+	 * have to continue scanning past it, we know that the indexscan will mark
+	 * that index entry killed.  That means that the next
+	 * get_actual_variable_endpoint() call will not have to re-consider that
+	 * index entry.  In this way we avoid repetitive work when this function
+	 * is used a lot during planning.
 	 *
 	 * But using SnapshotNonVacuumable creates a hazard of its own.  In a
 	 * recently-created index, some index entries may point at "broken" HOT
@@ -5805,7 +5806,8 @@ get_actual_variable_endpoint(Relation heapRel,
 	 * or could even be NULL.  We avoid this hazard because we take the data
 	 * from the index entry not the heap.
 	 */
-	InitNonVacuumableSnapshot(SnapshotNonVacuumable, RecentGlobalXmin);
+	InitNonVacuumableSnapshot(SnapshotNonVacuumable,
+							  GlobalVisTestFor(heapRel));
 
 	index_scan = index_beginscan(heapRel, indexRel,
 								 &SnapshotNonVacuumable,
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index f4247ea70d5..893be2f3ddb 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -722,6 +722,10 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
 	 * is critical for anything that reads heap pages, because HOT may decide
 	 * to prune them even if the process doesn't attempt to modify any
 	 * tuples.)
+	 *
+	 * FIXME: This comment is inaccurate / the code buggy. A snapshot that is
+	 * not pushed/active does not reliably prevent HOT pruning (->xmin could
+	 * e.g. be cleared when cache invalidations are processed).
 	 */
 	if (!bootstrap)
 	{
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 6b6c8571e23..76578868cf9 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -157,16 +157,9 @@ static Snapshot HistoricSnapshot = NULL;
  * These are updated by GetSnapshotData.  We initialize them this way
  * for the convenience of TransactionIdIsInProgress: even in bootstrap
  * mode, we don't want it to say that BootstrapTransactionId is in progress.
- *
- * RecentGlobalXmin and RecentGlobalDataXmin are initialized to
- * InvalidTransactionId, to ensure that no one tries to use a stale
- * value. Readers should ensure that it has been set to something else
- * before using it.
  */
 TransactionId TransactionXmin = FirstNormalTransactionId;
 TransactionId RecentXmin = FirstNormalTransactionId;
-TransactionId RecentGlobalXmin = InvalidTransactionId;
-TransactionId RecentGlobalDataXmin = InvalidTransactionId;
 
 /* (table, ctid) => (cmin, cmax) mapping during timetravel */
 static HTAB *tuplecid_data = NULL;
@@ -581,9 +574,7 @@ SetTransactionSnapshot(Snapshot sourcesnap, VirtualTransactionId *sourcevxid,
 	 * Even though we are not going to use the snapshot it computes, we must
 	 * call GetSnapshotData, for two reasons: (1) to be sure that
 	 * CurrentSnapshotData's XID arrays have been allocated, and (2) to update
-	 * RecentXmin and RecentGlobalXmin.  (We could alternatively include those
-	 * two variables in exported snapshot files, but it seems better to have
-	 * snapshot importers compute reasonably up-to-date values for them.)
+	 * the state for GlobalVis*.
 	 */
 	CurrentSnapshot = GetSnapshotData(&CurrentSnapshotData);
 
@@ -956,36 +947,6 @@ xmin_cmp(const pairingheap_node *a, const pairingheap_node *b, void *arg)
 		return 0;
 }
 
-/*
- * Get current RecentGlobalXmin value, as a FullTransactionId.
- */
-FullTransactionId
-GetFullRecentGlobalXmin(void)
-{
-	FullTransactionId nextxid_full;
-	uint32		nextxid_epoch;
-	TransactionId nextxid_xid;
-	uint32		epoch;
-
-	Assert(TransactionIdIsNormal(RecentGlobalXmin));
-
-	/*
-	 * Compute the epoch from the next XID's epoch. This relies on the fact
-	 * that RecentGlobalXmin must be within the 2 billion XID horizon from the
-	 * next XID.
-	 */
-	nextxid_full = ReadNextFullTransactionId();
-	nextxid_epoch = EpochFromFullTransactionId(nextxid_full);
-	nextxid_xid = XidFromFullTransactionId(nextxid_full);
-
-	if (RecentGlobalXmin > nextxid_xid)
-		epoch = nextxid_epoch - 1;
-	else
-		epoch = nextxid_epoch;
-
-	return FullTransactionIdFromEpochAndXid(epoch, RecentGlobalXmin);
-}
-
 /*
  * SnapshotResetXmin
  *
@@ -1753,106 +1714,157 @@ GetOldSnapshotThresholdTimestamp(void)
 	return threshold_timestamp;
 }
 
-static void
+void
 SetOldSnapshotThresholdTimestamp(TimestampTz ts, TransactionId xlimit)
 {
 	SpinLockAcquire(&oldSnapshotControl->mutex_threshold);
+	Assert(oldSnapshotControl->threshold_timestamp <= ts);
+	Assert(TransactionIdPrecedesOrEquals(oldSnapshotControl->threshold_xid, xlimit));
 	oldSnapshotControl->threshold_timestamp = ts;
 	oldSnapshotControl->threshold_xid = xlimit;
 	SpinLockRelease(&oldSnapshotControl->mutex_threshold);
 }
 
+/*
+ * XXX: Magic to keep old_snapshot_threshold tests appear "working". They
+ * currently are broken, and discussion of what to do about them is
+ * ongoing. See
+ * https://www.postgresql.org/message-id/20200403001235.e6jfdll3gh2ygbuc%40alap3.anarazel.de
+ */
+void
+SnapshotTooOldMagicForTest(void)
+{
+	TimestampTz ts = GetSnapshotCurrentTimestamp();
+
+	Assert(old_snapshot_threshold == 0);
+
+	ts -= 5 * USECS_PER_SEC;
+
+	SpinLockAcquire(&oldSnapshotControl->mutex_threshold);
+	oldSnapshotControl->threshold_timestamp = ts;
+	SpinLockRelease(&oldSnapshotControl->mutex_threshold);
+}
+
+/*
+ * If there is a valid mapping for the timestamp, set *xlimitp to
+ * that. Returns whether there is such a mapping.
+ */
+static bool
+GetOldSnapshotFromTimeMapping(TimestampTz ts, TransactionId *xlimitp)
+{
+	bool in_mapping = false;
+
+	Assert(ts == AlignTimestampToMinuteBoundary(ts));
+
+	LWLockAcquire(OldSnapshotTimeMapLock, LW_SHARED);
+
+	if (oldSnapshotControl->count_used > 0
+		&& ts >= oldSnapshotControl->head_timestamp)
+	{
+		int			offset;
+
+		offset = ((ts - oldSnapshotControl->head_timestamp)
+				  / USECS_PER_MINUTE);
+		if (offset > oldSnapshotControl->count_used - 1)
+			offset = oldSnapshotControl->count_used - 1;
+		offset = (oldSnapshotControl->head_offset + offset)
+			% OLD_SNAPSHOT_TIME_MAP_ENTRIES;
+
+		*xlimitp = oldSnapshotControl->xid_by_minute[offset];
+
+		in_mapping = true;
+	}
+
+	LWLockRelease(OldSnapshotTimeMapLock);
+
+	return in_mapping;
+}
+
 /*
  * TransactionIdLimitedForOldSnapshots
  *
- * Apply old snapshot limit, if any.  This is intended to be called for page
- * pruning and table vacuuming, to allow old_snapshot_threshold to override
- * the normal global xmin value.  Actual testing for snapshot too old will be
- * based on whether a snapshot timestamp is prior to the threshold timestamp
- * set in this function.
+ * Apply old snapshot limit.  This is intended to be called for page pruning
+ * and table vacuuming, to allow old_snapshot_threshold to override the normal
+ * global xmin value.  Actual testing for snapshot too old will be based on
+ * whether a snapshot timestamp is prior to the threshold timestamp set in
+ * this function.
+ *
+ * If the limited horizon allows a cleanup action that otherwise would not be
+ * possible, SetOldSnapshotThresholdTimestamp(*limit_ts, *limit_xid) needs to
+ * be called before that cleanup action.
  */
-TransactionId
+bool
 TransactionIdLimitedForOldSnapshots(TransactionId recentXmin,
-									Relation relation)
+									Relation relation,
+									TransactionId *limit_xid,
+									TimestampTz *limit_ts)
 {
-	if (TransactionIdIsNormal(recentXmin)
-		&& old_snapshot_threshold >= 0
-		&& RelationAllowsEarlyPruning(relation))
+	TimestampTz ts;
+	TransactionId xlimit = recentXmin;
+	TransactionId latest_xmin;
+	TimestampTz next_map_update_ts;
+	TransactionId threshold_timestamp;
+	TransactionId threshold_xid;
+
+	Assert(TransactionIdIsNormal(recentXmin));
+	Assert(OldSnapshotThresholdActive());
+	Assert(limit_ts != NULL && limit_xid != NULL);
+
+	if (!RelationAllowsEarlyPruning(relation))
+		return false;
+
+	ts = GetSnapshotCurrentTimestamp();
+
+	SpinLockAcquire(&oldSnapshotControl->mutex_latest_xmin);
+	latest_xmin = oldSnapshotControl->latest_xmin;
+	next_map_update_ts = oldSnapshotControl->next_map_update;
+	SpinLockRelease(&oldSnapshotControl->mutex_latest_xmin);
+
+	/*
+	 * Zero threshold always overrides to latest xmin, if valid.  Without
+	 * some heuristic it will find its own snapshot too old on, for
+	 * example, a simple UPDATE -- which would make it useless for most
+	 * testing, but there is no principled way to ensure that it doesn't
+	 * fail in this way.  Use a five-second delay to try to get useful
+	 * testing behavior, but this may need adjustment.
+	 */
+	if (old_snapshot_threshold == 0)
 	{
-		TimestampTz ts = GetSnapshotCurrentTimestamp();
-		TransactionId xlimit = recentXmin;
-		TransactionId latest_xmin;
-		TimestampTz update_ts;
-		bool		same_ts_as_threshold = false;
-
-		SpinLockAcquire(&oldSnapshotControl->mutex_latest_xmin);
-		latest_xmin = oldSnapshotControl->latest_xmin;
-		update_ts = oldSnapshotControl->next_map_update;
-		SpinLockRelease(&oldSnapshotControl->mutex_latest_xmin);
-
-		/*
-		 * Zero threshold always overrides to latest xmin, if valid.  Without
-		 * some heuristic it will find its own snapshot too old on, for
-		 * example, a simple UPDATE -- which would make it useless for most
-		 * testing, but there is no principled way to ensure that it doesn't
-		 * fail in this way.  Use a five-second delay to try to get useful
-		 * testing behavior, but this may need adjustment.
-		 */
-		if (old_snapshot_threshold == 0)
-		{
-			if (TransactionIdPrecedes(latest_xmin, MyPgXact->xmin)
-				&& TransactionIdFollows(latest_xmin, xlimit))
-				xlimit = latest_xmin;
-
-			ts -= 5 * USECS_PER_SEC;
-			SetOldSnapshotThresholdTimestamp(ts, xlimit);
-
-			return xlimit;
-		}
+		if (TransactionIdPrecedes(latest_xmin, MyPgXact->xmin)
+			&& TransactionIdFollows(latest_xmin, xlimit))
+			xlimit = latest_xmin;
 
+		ts -= 5 * USECS_PER_SEC;
+	}
+	else
+	{
 		ts = AlignTimestampToMinuteBoundary(ts)
 			- (old_snapshot_threshold * USECS_PER_MINUTE);
 
 		/* Check for fast exit without LW locking. */
 		SpinLockAcquire(&oldSnapshotControl->mutex_threshold);
-		if (ts == oldSnapshotControl->threshold_timestamp)
-		{
-			xlimit = oldSnapshotControl->threshold_xid;
-			same_ts_as_threshold = true;
-		}
+		threshold_timestamp = oldSnapshotControl->threshold_timestamp;
+		threshold_xid = oldSnapshotControl->threshold_xid;
 		SpinLockRelease(&oldSnapshotControl->mutex_threshold);
 
-		if (!same_ts_as_threshold)
+		if (ts == threshold_timestamp)
+		{
+			/*
+			 * Current timestamp is in same bucket as the the last limit that
+			 * was applied. Reuse.
+			 */
+			xlimit = threshold_xid;
+		}
+		else if (ts == next_map_update_ts)
+		{
+			/*
+			 * FIXME: This branch is super iffy - but that should probably
+			 * fixed separately.
+			 */
+			xlimit = latest_xmin;
+		}
+		else if (GetOldSnapshotFromTimeMapping(ts, &xlimit))
 		{
-			if (ts == update_ts)
-			{
-				xlimit = latest_xmin;
-				if (NormalTransactionIdFollows(xlimit, recentXmin))
-					SetOldSnapshotThresholdTimestamp(ts, xlimit);
-			}
-			else
-			{
-				LWLockAcquire(OldSnapshotTimeMapLock, LW_SHARED);
-
-				if (oldSnapshotControl->count_used > 0
-					&& ts >= oldSnapshotControl->head_timestamp)
-				{
-					int			offset;
-
-					offset = ((ts - oldSnapshotControl->head_timestamp)
-							  / USECS_PER_MINUTE);
-					if (offset > oldSnapshotControl->count_used - 1)
-						offset = oldSnapshotControl->count_used - 1;
-					offset = (oldSnapshotControl->head_offset + offset)
-						% OLD_SNAPSHOT_TIME_MAP_ENTRIES;
-					xlimit = oldSnapshotControl->xid_by_minute[offset];
-
-					if (NormalTransactionIdFollows(xlimit, recentXmin))
-						SetOldSnapshotThresholdTimestamp(ts, xlimit);
-				}
-
-				LWLockRelease(OldSnapshotTimeMapLock);
-			}
 		}
 
 		/*
@@ -1867,12 +1879,18 @@ TransactionIdLimitedForOldSnapshots(TransactionId recentXmin,
 		if (TransactionIdIsNormal(latest_xmin)
 			&& TransactionIdPrecedes(latest_xmin, xlimit))
 			xlimit = latest_xmin;
-
-		if (NormalTransactionIdFollows(xlimit, recentXmin))
-			return xlimit;
 	}
 
-	return recentXmin;
+	if (TransactionIdIsValid(xlimit) &&
+		TransactionIdFollowsOrEquals(xlimit, recentXmin))
+	{
+		*limit_ts = ts;
+		*limit_xid = xlimit;
+
+		return true;
+	}
+
+	return false;
 }
 
 /*
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 635ece73b35..5f3de3c0b7f 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -434,10 +434,10 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 			 RelationGetRelationName(rel));
 
 	/*
-	 * RecentGlobalXmin assertion matches index_getnext_tid().  See note on
-	 * RecentGlobalXmin/B-Tree page deletion.
+	 * This assertion matches the one in index_getnext_tid().  See page
+	 * recycling/"visible to everyone" notes in nbtree README.
 	 */
-	Assert(TransactionIdIsValid(RecentGlobalXmin));
+	Assert(TransactionIdIsValid(RecentXmin));
 
 	/*
 	 * Initialize state for entire verification operation
@@ -1581,7 +1581,7 @@ bt_right_page_check_scankey(BtreeCheckState *state)
 	 * does not occur until no possible index scan could land on the page.
 	 * Index scans can follow links with nothing more than their snapshot as
 	 * an interlock and be sure of at least that much.  (See page
-	 * recycling/RecentGlobalXmin notes in nbtree README.)
+	 * recycling/"visible to everyone" notes in nbtree README.)
 	 *
 	 * Furthermore, it's okay if we follow a rightlink and find a half-dead or
 	 * dead (ignorable) page one or more times.  There will either be a
diff --git a/contrib/pg_visibility/pg_visibility.c b/contrib/pg_visibility/pg_visibility.c
index e731161734a..e8cdea7e283 100644
--- a/contrib/pg_visibility/pg_visibility.c
+++ b/contrib/pg_visibility/pg_visibility.c
@@ -563,17 +563,14 @@ collect_corrupt_items(Oid relid, bool all_visible, bool all_frozen)
 	BufferAccessStrategy bstrategy = GetAccessStrategy(BAS_BULKREAD);
 	TransactionId OldestXmin = InvalidTransactionId;
 
-	if (all_visible)
-	{
-		/* Don't pass rel; that will fail in recovery. */
-		OldestXmin = GetOldestXmin(NULL, PROCARRAY_FLAGS_VACUUM);
-	}
-
 	rel = relation_open(relid, AccessShareLock);
 
 	/* Only some relkinds have a visibility map */
 	check_relation_relkind(rel);
 
+	if (all_visible)
+		OldestXmin = GetOldestNonRemovableTransactionId(rel);
+
 	nblocks = RelationGetNumberOfBlocks(rel);
 
 	/*
@@ -679,11 +676,12 @@ collect_corrupt_items(Oid relid, bool all_visible, bool all_frozen)
 				 * From a concurrency point of view, it sort of sucks to
 				 * retake ProcArrayLock here while we're holding the buffer
 				 * exclusively locked, but it should be safe against
-				 * deadlocks, because surely GetOldestXmin() should never take
-				 * a buffer lock. And this shouldn't happen often, so it's
-				 * worth being careful so as to avoid false positives.
+				 * deadlocks, because surely GetOldestNonRemovableTransactionId()
+				 * should never take a buffer lock. And this shouldn't happen
+				 * often, so it's worth being careful so as to avoid false
+				 * positives.
 				 */
-				RecomputedOldestXmin = GetOldestXmin(NULL, PROCARRAY_FLAGS_VACUUM);
+				RecomputedOldestXmin = GetOldestNonRemovableTransactionId(rel);
 
 				if (!TransactionIdPrecedes(OldestXmin, RecomputedOldestXmin))
 					record_corrupt_item(items, &tuple.t_self);
diff --git a/contrib/pgstattuple/pgstatapprox.c b/contrib/pgstattuple/pgstatapprox.c
index dbc0fa11f61..3a99333d443 100644
--- a/contrib/pgstattuple/pgstatapprox.c
+++ b/contrib/pgstattuple/pgstatapprox.c
@@ -71,7 +71,7 @@ statapprox_heap(Relation rel, output_type *stat)
 	BufferAccessStrategy bstrategy;
 	TransactionId OldestXmin;
 
-	OldestXmin = GetOldestXmin(rel, PROCARRAY_FLAGS_VACUUM);
+	OldestXmin = GetOldestNonRemovableTransactionId(rel);
 	bstrategy = GetAccessStrategy(BAS_BULKREAD);
 
 	nblocks = RelationGetNumberOfBlocks(rel);
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 7eaaad1e140..b4948ac675f 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -395,6 +395,7 @@ CompositeTypeStmt
 CompoundAffixFlag
 CompressionAlgorithm
 CompressorState
+ComputeXidHorizonsResult
 ConditionVariable
 ConditionalStack
 ConfigData
@@ -930,6 +931,7 @@ GistSplitVector
 GistTsVectorOptions
 GistVacState
 GlobalTransaction
+GlobalVisState
 GrantRoleStmt
 GrantStmt
 GrantTargetType
-- 
2.25.0.114.g5b0ca878e0

v13-0002-snapshot-scalability-Move-PGXACT-xmin-back-to-PG.patchtext/x-diff; charset=us-asciiDownload
From 50ce3971f3ae590b5a7dfdf3eccdccadfd2d96c2 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 15 Jul 2020 15:35:07 -0700
Subject: [PATCH v13 2/6] snapshot scalability: Move PGXACT->xmin back to
 PGPROC.

Now that xmin isn't needed for GetSnapshotData() anymore, it leads to
unnecessary cacheline ping-pong to have it in PGXACT as it is updated
more frequently than the other PGXACT members.

Author: Andres Freund
Reviewed-By: Robert Haas, Thomas Munro, David Rowley
Discussion: https://postgr.es/m/20200301083601.ews6hz5dduc3w2se@alap3.anarazel.de
---
 src/include/storage/proc.h                  | 10 +++---
 src/backend/access/gist/gistxlog.c          |  2 +-
 src/backend/access/nbtree/nbtpage.c         |  2 +-
 src/backend/access/transam/README           |  2 +-
 src/backend/access/transam/twophase.c       |  2 +-
 src/backend/commands/indexcmds.c            |  2 +-
 src/backend/replication/logical/snapbuild.c |  6 ++--
 src/backend/replication/walsender.c         | 10 +++---
 src/backend/storage/ipc/procarray.c         | 36 +++++++++------------
 src/backend/storage/ipc/sinvaladt.c         |  2 +-
 src/backend/storage/lmgr/proc.c             |  4 +--
 src/backend/utils/time/snapmgr.c            | 28 ++++++++--------
 12 files changed, 51 insertions(+), 55 deletions(-)

diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 52ff43cabaa..5e4b028a5f9 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -101,6 +101,11 @@ struct PGPROC
 
 	Latch		procLatch;		/* generic latch for process */
 
+	TransactionId xmin;			/* minimal running XID as it was when we were
+								 * starting our xact, excluding LAZY VACUUM:
+								 * vacuum must not remove tuples deleted by
+								 * xid >= xmin ! */
+
 	LocalTransactionId lxid;	/* local id of top-level transaction currently
 								 * being executed by this proc, if running;
 								 * else InvalidLocalTransactionId */
@@ -223,11 +228,6 @@ typedef struct PGXACT
 								 * executed by this proc, if running and XID
 								 * is assigned; else InvalidTransactionId */
 
-	TransactionId xmin;			/* minimal running XID as it was when we were
-								 * starting our xact, excluding LAZY VACUUM:
-								 * vacuum must not remove tuples deleted by
-								 * xid >= xmin ! */
-
 	uint8		vacuumFlags;	/* vacuum-related flags, see above */
 	bool		overflowed;
 
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index a63b05388c5..dcd28f678b3 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -389,7 +389,7 @@ gistRedoPageReuse(XLogReaderState *record)
 	 *
 	 * latestRemovedXid was the page's deleteXid.  The
 	 * GlobalVisIsRemovableFullXid(deleteXid) test in gistPageRecyclable()
-	 * conceptually mirrors the pgxact->xmin > limitXmin test in
+	 * conceptually mirrors the PGPROC->xmin > limitXmin test in
 	 * GetConflictingVirtualXIDs().  Consequently, one XID value achieves the
 	 * same exclusion effect on primary and standby.
 	 */
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 74be3807bb7..7f392480ac0 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -2317,7 +2317,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 	 * we're in VACUUM and would not otherwise have an XID.  Having already
 	 * updated links to the target, ReadNewTransactionId() suffices as an
 	 * upper bound.  Any scan having retained a now-stale link is advertising
-	 * in its PGXACT an xmin less than or equal to the value we read here.  It
+	 * in its PGPROC an xmin less than or equal to the value we read here.  It
 	 * will continue to do so, holding back the xmin horizon, for the duration
 	 * of that scan.
 	 */
diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README
index fffe0783295..c15b5540a09 100644
--- a/src/backend/access/transam/README
+++ b/src/backend/access/transam/README
@@ -331,7 +331,7 @@ necessary.
 Note that while it is certain that two concurrent executions of
 GetSnapshotData will compute the same xmin for their own snapshots, there is
 no such guarantee for the horizons computed by ComputeXidHorizons.  This is
-because we allow XID-less transactions to clear their MyPgXact->xmin
+because we allow XID-less transactions to clear their MyProc->xmin
 asynchronously (without taking ProcArrayLock), so one execution might see
 what had been the oldest xmin, and another not.  This is OK since the
 thresholds need only be a valid lower bound.  As noted above, we are already
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 31f135f5ced..eb5f4680a3d 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -464,7 +464,7 @@ MarkAsPreparingGuts(GlobalTransaction gxact, TransactionId xid, const char *gid,
 	/* We set up the gxact's VXID as InvalidBackendId/XID */
 	proc->lxid = (LocalTransactionId) xid;
 	pgxact->xid = xid;
-	pgxact->xmin = InvalidTransactionId;
+	Assert(proc->xmin == InvalidTransactionId);
 	proc->delayChkpt = false;
 	pgxact->vacuumFlags = 0;
 	proc->pid = 0;
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 7819266a630..254dbcdce52 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1535,7 +1535,7 @@ DefineIndex(Oid relationId,
 	StartTransactionCommand();
 
 	/* We should now definitely not be advertising any xmin. */
-	Assert(MyPgXact->xmin == InvalidTransactionId);
+	Assert(MyProc->xmin == InvalidTransactionId);
 
 	/*
 	 * The index is now valid in the sense that it contains all currently
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 3089f0d5ddc..e9701ea7221 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -553,8 +553,8 @@ SnapBuildInitialSnapshot(SnapBuild *builder)
 		elog(ERROR, "cannot build an initial slot snapshot, not all transactions are monitored anymore");
 
 	/* so we don't overwrite the existing value */
-	if (TransactionIdIsValid(MyPgXact->xmin))
-		elog(ERROR, "cannot build an initial slot snapshot when MyPgXact->xmin already is valid");
+	if (TransactionIdIsValid(MyProc->xmin))
+		elog(ERROR, "cannot build an initial slot snapshot when MyProc->xmin already is valid");
 
 	snap = SnapBuildBuildSnapshot(builder);
 
@@ -575,7 +575,7 @@ SnapBuildInitialSnapshot(SnapBuild *builder)
 	}
 #endif
 
-	MyPgXact->xmin = snap->xmin;
+	MyProc->xmin = snap->xmin;
 
 	/* allocate in transaction context */
 	newxip = (TransactionId *)
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 460ca3f947f..3f756b470af 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1964,7 +1964,7 @@ PhysicalReplicationSlotNewXmin(TransactionId feedbackXmin, TransactionId feedbac
 	ReplicationSlot *slot = MyReplicationSlot;
 
 	SpinLockAcquire(&slot->mutex);
-	MyPgXact->xmin = InvalidTransactionId;
+	MyProc->xmin = InvalidTransactionId;
 
 	/*
 	 * For physical replication we don't need the interlock provided by xmin
@@ -2093,7 +2093,7 @@ ProcessStandbyHSFeedbackMessage(void)
 	if (!TransactionIdIsNormal(feedbackXmin)
 		&& !TransactionIdIsNormal(feedbackCatalogXmin))
 	{
-		MyPgXact->xmin = InvalidTransactionId;
+		MyProc->xmin = InvalidTransactionId;
 		if (MyReplicationSlot != NULL)
 			PhysicalReplicationSlotNewXmin(feedbackXmin, feedbackCatalogXmin);
 		return;
@@ -2135,7 +2135,7 @@ ProcessStandbyHSFeedbackMessage(void)
 	 * risk already since a VACUUM could already have determined the horizon.)
 	 *
 	 * If we're using a replication slot we reserve the xmin via that,
-	 * otherwise via the walsender's PGXACT entry. We can only track the
+	 * otherwise via the walsender's PGPROC entry. We can only track the
 	 * catalog xmin separately when using a slot, so we store the least of the
 	 * two provided when not using a slot.
 	 *
@@ -2148,9 +2148,9 @@ ProcessStandbyHSFeedbackMessage(void)
 	{
 		if (TransactionIdIsNormal(feedbackCatalogXmin)
 			&& TransactionIdPrecedes(feedbackCatalogXmin, feedbackXmin))
-			MyPgXact->xmin = feedbackCatalogXmin;
+			MyProc->xmin = feedbackCatalogXmin;
 		else
-			MyPgXact->xmin = feedbackXmin;
+			MyProc->xmin = feedbackXmin;
 	}
 }
 
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 360e6e9da07..a016816ae86 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -587,9 +587,9 @@ ProcArrayEndTransaction(PGPROC *proc, TransactionId latestXid)
 		Assert(!TransactionIdIsValid(allPgXact[proc->pgprocno].xid));
 
 		proc->lxid = InvalidLocalTransactionId;
-		pgxact->xmin = InvalidTransactionId;
 		/* must be cleared with xid/xmin: */
 		pgxact->vacuumFlags &= ~PROC_VACUUM_STATE_MASK;
+		proc->xmin = InvalidTransactionId;
 		proc->delayChkpt = false;	/* be sure this is cleared in abort */
 		proc->recoveryConflictPending = false;
 
@@ -609,9 +609,9 @@ ProcArrayEndTransactionInternal(PGPROC *proc, PGXACT *pgxact,
 {
 	pgxact->xid = InvalidTransactionId;
 	proc->lxid = InvalidLocalTransactionId;
-	pgxact->xmin = InvalidTransactionId;
 	/* must be cleared with xid/xmin: */
 	pgxact->vacuumFlags &= ~PROC_VACUUM_STATE_MASK;
+	proc->xmin = InvalidTransactionId;
 	proc->delayChkpt = false;	/* be sure this is cleared in abort */
 	proc->recoveryConflictPending = false;
 
@@ -763,7 +763,7 @@ ProcArrayClearTransaction(PGPROC *proc)
 	 */
 	pgxact->xid = InvalidTransactionId;
 	proc->lxid = InvalidLocalTransactionId;
-	pgxact->xmin = InvalidTransactionId;
+	proc->xmin = InvalidTransactionId;
 	proc->recoveryConflictPending = false;
 
 	/* redundant, but just in case */
@@ -1563,7 +1563,7 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
 
 		/* Fetch xid just once - see GetNewTransactionId */
 		xid = UINT32_ACCESS_ONCE(pgxact->xid);
-		xmin = UINT32_ACCESS_ONCE(pgxact->xmin);
+		xmin = UINT32_ACCESS_ONCE(proc->xmin);
 
 		/*
 		 * Consider both the transaction's Xmin, and its Xid.
@@ -1838,7 +1838,7 @@ GetMaxSnapshotSubxidCount(void)
  *
  * We also update the following backend-global variables:
  *		TransactionXmin: the oldest xmin of any snapshot in use in the
- *			current transaction (this is the same as MyPgXact->xmin).
+ *			current transaction (this is the same as MyProc->xmin).
  *		RecentXmin: the xmin computed for the most recent snapshot.  XIDs
  *			older than this are known not running any more.
  *
@@ -1900,7 +1900,7 @@ GetSnapshotData(Snapshot snapshot)
 
 	/*
 	 * It is sufficient to get shared lock on ProcArrayLock, even if we are
-	 * going to set MyPgXact->xmin.
+	 * going to set MyProc->xmin.
 	 */
 	LWLockAcquire(ProcArrayLock, LW_SHARED);
 
@@ -2052,8 +2052,8 @@ GetSnapshotData(Snapshot snapshot)
 	replication_slot_xmin = procArray->replication_slot_xmin;
 	replication_slot_catalog_xmin = procArray->replication_slot_catalog_xmin;
 
-	if (!TransactionIdIsValid(MyPgXact->xmin))
-		MyPgXact->xmin = TransactionXmin = xmin;
+	if (!TransactionIdIsValid(MyProc->xmin))
+		MyProc->xmin = TransactionXmin = xmin;
 
 	LWLockRelease(ProcArrayLock);
 
@@ -2173,7 +2173,7 @@ GetSnapshotData(Snapshot snapshot)
 }
 
 /*
- * ProcArrayInstallImportedXmin -- install imported xmin into MyPgXact->xmin
+ * ProcArrayInstallImportedXmin -- install imported xmin into MyProc->xmin
  *
  * This is called when installing a snapshot imported from another
  * transaction.  To ensure that OldestXmin doesn't go backwards, we must
@@ -2226,7 +2226,7 @@ ProcArrayInstallImportedXmin(TransactionId xmin,
 		/*
 		 * Likewise, let's just make real sure its xmin does cover us.
 		 */
-		xid = UINT32_ACCESS_ONCE(pgxact->xmin);
+		xid = UINT32_ACCESS_ONCE(proc->xmin);
 		if (!TransactionIdIsNormal(xid) ||
 			!TransactionIdPrecedesOrEquals(xid, xmin))
 			continue;
@@ -2237,7 +2237,7 @@ ProcArrayInstallImportedXmin(TransactionId xmin,
 		 * GetSnapshotData first, we'll be overwriting a valid xmin here, so
 		 * we don't check that.)
 		 */
-		MyPgXact->xmin = TransactionXmin = xmin;
+		MyProc->xmin = TransactionXmin = xmin;
 
 		result = true;
 		break;
@@ -2249,7 +2249,7 @@ ProcArrayInstallImportedXmin(TransactionId xmin,
 }
 
 /*
- * ProcArrayInstallRestoredXmin -- install restored xmin into MyPgXact->xmin
+ * ProcArrayInstallRestoredXmin -- install restored xmin into MyProc->xmin
  *
  * This is like ProcArrayInstallImportedXmin, but we have a pointer to the
  * PGPROC of the transaction from which we imported the snapshot, rather than
@@ -2262,7 +2262,6 @@ ProcArrayInstallRestoredXmin(TransactionId xmin, PGPROC *proc)
 {
 	bool		result = false;
 	TransactionId xid;
-	PGXACT	   *pgxact;
 
 	Assert(TransactionIdIsNormal(xmin));
 	Assert(proc != NULL);
@@ -2270,20 +2269,18 @@ ProcArrayInstallRestoredXmin(TransactionId xmin, PGPROC *proc)
 	/* Get lock so source xact can't end while we're doing this */
 	LWLockAcquire(ProcArrayLock, LW_SHARED);
 
-	pgxact = &allPgXact[proc->pgprocno];
-
 	/*
 	 * Be certain that the referenced PGPROC has an advertised xmin which is
 	 * no later than the one we're installing, so that the system-wide xmin
 	 * can't go backwards.  Also, make sure it's running in the same database,
 	 * so that the per-database xmin cannot go backwards.
 	 */
-	xid = UINT32_ACCESS_ONCE(pgxact->xmin);
+	xid = UINT32_ACCESS_ONCE(proc->xmin);
 	if (proc->databaseId == MyDatabaseId &&
 		TransactionIdIsNormal(xid) &&
 		TransactionIdPrecedesOrEquals(xid, xmin))
 	{
-		MyPgXact->xmin = TransactionXmin = xmin;
+		MyProc->xmin = TransactionXmin = xmin;
 		result = true;
 	}
 
@@ -2909,7 +2906,7 @@ GetCurrentVirtualXIDs(TransactionId limitXmin, bool excludeXmin0,
 		if (allDbs || proc->databaseId == MyDatabaseId)
 		{
 			/* Fetch xmin just once - might change on us */
-			TransactionId pxmin = UINT32_ACCESS_ONCE(pgxact->xmin);
+			TransactionId pxmin = UINT32_ACCESS_ONCE(proc->xmin);
 
 			if (excludeXmin0 && !TransactionIdIsValid(pxmin))
 				continue;
@@ -2995,7 +2992,6 @@ GetConflictingVirtualXIDs(TransactionId limitXmin, Oid dbOid)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
 		PGPROC	   *proc = &allProcs[pgprocno];
-		PGXACT	   *pgxact = &allPgXact[pgprocno];
 
 		/* Exclude prepared transactions */
 		if (proc->pid == 0)
@@ -3005,7 +3001,7 @@ GetConflictingVirtualXIDs(TransactionId limitXmin, Oid dbOid)
 			proc->databaseId == dbOid)
 		{
 			/* Fetch xmin just once - can't change on us, but good coding */
-			TransactionId pxmin = UINT32_ACCESS_ONCE(pgxact->xmin);
+			TransactionId pxmin = UINT32_ACCESS_ONCE(proc->xmin);
 
 			/*
 			 * We ignore an invalid pxmin because this means that backend has
diff --git a/src/backend/storage/ipc/sinvaladt.c b/src/backend/storage/ipc/sinvaladt.c
index e5c115b92f2..ad048bc85fa 100644
--- a/src/backend/storage/ipc/sinvaladt.c
+++ b/src/backend/storage/ipc/sinvaladt.c
@@ -420,7 +420,7 @@ BackendIdGetTransactionIds(int backendID, TransactionId *xid, TransactionId *xmi
 			PGXACT	   *xact = &ProcGlobal->allPgXact[proc->pgprocno];
 
 			*xid = xact->xid;
-			*xmin = xact->xmin;
+			*xmin = proc->xmin;
 		}
 	}
 
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index e57fcd25388..de346cd87fc 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -388,7 +388,7 @@ InitProcess(void)
 	MyProc->fpVXIDLock = false;
 	MyProc->fpLocalTransactionId = InvalidLocalTransactionId;
 	MyPgXact->xid = InvalidTransactionId;
-	MyPgXact->xmin = InvalidTransactionId;
+	MyProc->xmin = InvalidTransactionId;
 	MyProc->pid = MyProcPid;
 	/* backendId, databaseId and roleId will be filled in later */
 	MyProc->backendId = InvalidBackendId;
@@ -572,7 +572,7 @@ InitAuxiliaryProcess(void)
 	MyProc->fpVXIDLock = false;
 	MyProc->fpLocalTransactionId = InvalidLocalTransactionId;
 	MyPgXact->xid = InvalidTransactionId;
-	MyPgXact->xmin = InvalidTransactionId;
+	MyProc->xmin = InvalidTransactionId;
 	MyProc->backendId = InvalidBackendId;
 	MyProc->databaseId = InvalidOid;
 	MyProc->roleId = InvalidOid;
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 76578868cf9..689a3b6a597 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -27,11 +27,11 @@
  * their lifetime is managed separately (as they live longer than one xact.c
  * transaction).
  *
- * These arrangements let us reset MyPgXact->xmin when there are no snapshots
+ * These arrangements let us reset MyProc->xmin when there are no snapshots
  * referenced by this transaction, and advance it when the one with oldest
  * Xmin is no longer referenced.  For simplicity however, only registered
  * snapshots not active snapshots participate in tracking which one is oldest;
- * we don't try to change MyPgXact->xmin except when the active-snapshot
+ * we don't try to change MyProc->xmin except when the active-snapshot
  * stack is empty.
  *
  *
@@ -187,7 +187,7 @@ static ActiveSnapshotElt *OldestActiveSnapshot = NULL;
 
 /*
  * Currently registered Snapshots.  Ordered in a heap by xmin, so that we can
- * quickly find the one with lowest xmin, to advance our MyPgXact->xmin.
+ * quickly find the one with lowest xmin, to advance our MyProc->xmin.
  */
 static int	xmin_cmp(const pairingheap_node *a, const pairingheap_node *b,
 					 void *arg);
@@ -475,7 +475,7 @@ GetNonHistoricCatalogSnapshot(Oid relid)
 
 		/*
 		 * Make sure the catalog snapshot will be accounted for in decisions
-		 * about advancing PGXACT->xmin.  We could apply RegisterSnapshot, but
+		 * about advancing PGPROC->xmin.  We could apply RegisterSnapshot, but
 		 * that would result in making a physical copy, which is overkill; and
 		 * it would also create a dependency on some resource owner, which we
 		 * do not want for reasons explained at the head of this file. Instead
@@ -596,7 +596,7 @@ SetTransactionSnapshot(Snapshot sourcesnap, VirtualTransactionId *sourcevxid,
 	/* NB: curcid should NOT be copied, it's a local matter */
 
 	/*
-	 * Now we have to fix what GetSnapshotData did with MyPgXact->xmin and
+	 * Now we have to fix what GetSnapshotData did with MyProc->xmin and
 	 * TransactionXmin.  There is a race condition: to make sure we are not
 	 * causing the global xmin to go backwards, we have to test that the
 	 * source transaction is still running, and that has to be done
@@ -950,13 +950,13 @@ xmin_cmp(const pairingheap_node *a, const pairingheap_node *b, void *arg)
 /*
  * SnapshotResetXmin
  *
- * If there are no more snapshots, we can reset our PGXACT->xmin to InvalidXid.
+ * If there are no more snapshots, we can reset our PGPROC->xmin to InvalidXid.
  * Note we can do this without locking because we assume that storing an Xid
  * is atomic.
  *
  * Even if there are some remaining snapshots, we may be able to advance our
- * PGXACT->xmin to some degree.  This typically happens when a portal is
- * dropped.  For efficiency, we only consider recomputing PGXACT->xmin when
+ * PGPROC->xmin to some degree.  This typically happens when a portal is
+ * dropped.  For efficiency, we only consider recomputing PGPROC->xmin when
  * the active snapshot stack is empty; this allows us not to need to track
  * which active snapshot is oldest.
  *
@@ -977,15 +977,15 @@ SnapshotResetXmin(void)
 
 	if (pairingheap_is_empty(&RegisteredSnapshots))
 	{
-		MyPgXact->xmin = InvalidTransactionId;
+		MyProc->xmin = InvalidTransactionId;
 		return;
 	}
 
 	minSnapshot = pairingheap_container(SnapshotData, ph_node,
 										pairingheap_first(&RegisteredSnapshots));
 
-	if (TransactionIdPrecedes(MyPgXact->xmin, minSnapshot->xmin))
-		MyPgXact->xmin = minSnapshot->xmin;
+	if (TransactionIdPrecedes(MyProc->xmin, minSnapshot->xmin))
+		MyProc->xmin = minSnapshot->xmin;
 }
 
 /*
@@ -1132,13 +1132,13 @@ AtEOXact_Snapshot(bool isCommit, bool resetXmin)
 
 	/*
 	 * During normal commit processing, we call ProcArrayEndTransaction() to
-	 * reset the MyPgXact->xmin. That call happens prior to the call to
+	 * reset the MyProc->xmin. That call happens prior to the call to
 	 * AtEOXact_Snapshot(), so we need not touch xmin here at all.
 	 */
 	if (resetXmin)
 		SnapshotResetXmin();
 
-	Assert(resetXmin || MyPgXact->xmin == 0);
+	Assert(resetXmin || MyProc->xmin == 0);
 }
 
 
@@ -1830,7 +1830,7 @@ TransactionIdLimitedForOldSnapshots(TransactionId recentXmin,
 	 */
 	if (old_snapshot_threshold == 0)
 	{
-		if (TransactionIdPrecedes(latest_xmin, MyPgXact->xmin)
+		if (TransactionIdPrecedes(latest_xmin, MyProc->xmin)
 			&& TransactionIdFollows(latest_xmin, xlimit))
 			xlimit = latest_xmin;
 
-- 
2.25.0.114.g5b0ca878e0

v13-0003-snapshot-scalability-Introduce-dense-array-of-in.patchtext/x-diff; charset=us-asciiDownload
From d79f200a7f2686cb8e1dc7d3c53d3906435e9e78 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 15 Jul 2020 15:35:07 -0700
Subject: [PATCH v13 3/6] snapshot scalability: Introduce dense array of
 in-progress xids.

The new array contains the xids for all connected backends / in-use
PGPROC entries in a dense manner (in contrast to the PGPROC/PGXACT
arrays which can have unused entries interspersed).

This improves performance because GetSnapshotData() always needs to
scan the xids of all live procarray entries and now there's no need to
go through the procArray->pgprocnos indirection anymore.

As the set of running top-level xids changes rarely, compared to the
number of snapshots taken, this substantially increases the likelihood
of most data required for a snapshot being in l2 cache.  In
read-mostly workloads scanning the xids[] array will sufficient to
build a snapshot, as most backends will not have an xid assigned.

To keep the xid array dense ProcArrayRemove() needs to move entries
behind the to-be-removed proc's one further up in the array. Obviously
moving array entries cannot happen while a backend sets it
xid. I.e. locking needs to prevent that array entries are moved while
a backend modifies its xid.

To avoid locking ProcArrayLock in GetNewTransactionId() - a fairly hot
spot already - ProcArrayAdd() / ProcArrayRemove() now needs to hold
XidGenLock in addition to ProcArrayLock. Adding / Removing a procarray
entry is not a very frequent operation, even taking 2PC into account.

Due to the above, the dense array entries can only be read or modified
while holding ProcArrayLock and/or XidGenLock. This prevents a
concurrent ProcArrayRemove() from shifting the dense array while it is
accessed concurrently.

While the new dense array is very good when needing to look at all
xids it is less suitable when accessing a single backend's xid. In
particular it would be problematic to have to acquire a lock to access
a backend's own xid. Therefore a backend's xid is not just stored in
the dense array, but also in PGPROC. This also allows a backend to
only access the shared xid value when the backend had acquired an
xid.

The infrastructure added in this commit will be used for the remaining
PGXACT fields in subsequent commits. They are kept separate to make
review easier.

Author: Andres Freund
Reviewed-By: Robert Haas, Thomas Munro, David Rowley
Discussion: https://postgr.es/m/20200301083601.ews6hz5dduc3w2se@alap3.anarazel.de
---
 src/include/storage/proc.h                  |  79 +++++-
 src/backend/access/heap/heapam_visibility.c |   8 +-
 src/backend/access/transam/README           |  33 +--
 src/backend/access/transam/clog.c           |   8 +-
 src/backend/access/transam/twophase.c       |  31 +--
 src/backend/access/transam/varsup.c         |  20 +-
 src/backend/commands/vacuum.c               |   2 +-
 src/backend/storage/ipc/procarray.c         | 282 +++++++++++++-------
 src/backend/storage/ipc/sinvaladt.c         |   4 +-
 src/backend/storage/lmgr/lock.c             |   3 +-
 src/backend/storage/lmgr/proc.c             |  26 +-
 11 files changed, 335 insertions(+), 161 deletions(-)

diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 5e4b028a5f9..146bca84bd6 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -89,6 +89,17 @@ typedef enum
  * distinguished from a real one at need by the fact that it has pid == 0.
  * The semaphore and lock-activity fields in a prepared-xact PGPROC are unused,
  * but its myProcLocks[] lists are valid.
+ *
+ * Mirrored fields:
+ *
+ * Some fields in PGPROC (see "mirrored in ..." comment) are mirrored into an
+ * element of more densely packed ProcGlobal arrays. These arrays are indexed
+ * by PGPROC->pgxactoff. Both copies need to be maintained coherently.
+ *
+ * NB: The pgxactoff indexed value can *never* be accessed without holding
+ * locks.
+ *
+ * See PROC_HDR for details.
  */
 struct PGPROC
 {
@@ -101,6 +112,12 @@ struct PGPROC
 
 	Latch		procLatch;		/* generic latch for process */
 
+
+	TransactionId xid;			/* id of top-level transaction currently being
+								 * executed by this proc, if running and XID
+								 * is assigned; else InvalidTransactionId.
+								 * mirrored in ProcGlobal->xids[pgxactoff] */
+
 	TransactionId xmin;			/* minimal running XID as it was when we were
 								 * starting our xact, excluding LAZY VACUUM:
 								 * vacuum must not remove tuples deleted by
@@ -110,6 +127,9 @@ struct PGPROC
 								 * being executed by this proc, if running;
 								 * else InvalidLocalTransactionId */
 	int			pid;			/* Backend's process ID; 0 if prepared xact */
+
+	int			pgxactoff;		/* offset into various ProcGlobal->arrays
+								 * with data mirrored from this PGPROC */
 	int			pgprocno;
 
 	/* These fields are zero while a backend is still starting up: */
@@ -224,10 +244,6 @@ extern PGDLLIMPORT struct PGXACT *MyPgXact;
  */
 typedef struct PGXACT
 {
-	TransactionId xid;			/* id of top-level transaction currently being
-								 * executed by this proc, if running and XID
-								 * is assigned; else InvalidTransactionId */
-
 	uint8		vacuumFlags;	/* vacuum-related flags, see above */
 	bool		overflowed;
 
@@ -236,6 +252,57 @@ typedef struct PGXACT
 
 /*
  * There is one ProcGlobal struct for the whole database cluster.
+ *
+ * Adding/Removing an entry into the procarray requires holding *both*
+ * ProcArrayLock and XidGenLock in exclusive mode (in that order). Both are
+ * needed because the dense arrays (see below) are accessed from
+ * GetNewTransactionId() and GetSnapshotData(), and we don't want to add
+ * further contention by both using the same lock. Adding/Removing a procarray
+ * entry is much less frequent.
+ *
+ * Some fields in PGPROC are mirrored into more densely packed arrays (like
+ * xids), with one entry for each backend. These arrays only contain entries
+ * for PGPROCs that have been added to the shared array with ProcArrayAdd()
+ * (in contrast to PGPROC array which has unused PGPROCs interspersed).
+ *
+ * The dense arrays are indexed indexed by PGPROC->pgxactoff. Any concurrent
+ * ProcArrayAdd() / ProcArrayRemove() can lead to pgxactoff of a procarray
+ * member to change.  Therefore it is only safe to use PGPROC->pgxactoff to
+ * access the dense array while holding either ProcArrayLock or XidGenLock.
+ *
+ * As long as a PGPROC is in the procarray, the mirrored values need to be
+ * maintained in both places in a coherent manner.
+ *
+ * The denser separate arrays are beneficial for three main reasons: First, to
+ * allow for as tight loops accessing the data as possible. Second, to prevent
+ * updates of frequently changing data (e.g. xmin) from invalidating
+ * cachelines also containing less frequently changing data (e.g. xid,
+ * vacuumFlags). Third to condense frequently accessed data into as few
+ * cachelines as possible.
+ *
+ * There are two main reasons to have the data mirrored between these dense
+ * arrays and PGPROC. First, as explained above, a PGPROC's array entries can
+ * only be accessed with either ProcArrayLock or XidGenLock held, whereas the
+ * PGPROC entries do not require that (obviously there may still be locking
+ * requirements around the individual field, separate from the concerns
+ * here). That is particularly important for a backend to efficiently checks
+ * it own values, which it often can safely do without locking.  Second, the
+ * PGPROC fields allow to avoid unnecessary accesses and modification to the
+ * dense arrays. A backend's own PGPROC is more likely to be in a local cache,
+ * whereas the cachelines for the dense array will be modified by other
+ * backends (often removing it from the cache for other cores/sockets). At
+ * commit/abort time a check of the PGPROC value can avoid accessing/dirtying
+ * the corresponding array value.
+ *
+ * Basically it makes sense to access the PGPROC variable when checking a
+ * single backend's data, especially when already looking at the PGPROC for
+ * other reasons already.  It makes sense to look at the "dense" arrays if we
+ * need to look at many / most entries, because we then benefit from the
+ * reduced indirection and better cross-process cache-ability.
+ *
+ * When entering a PGPROC for 2PC transactions with ProcArrayAdd(), the data
+ * in the dense arrays is initialized from the PGPROC while it already holds
+ * ProcArrayLock.
  */
 typedef struct PROC_HDR
 {
@@ -243,6 +310,10 @@ typedef struct PROC_HDR
 	PGPROC	   *allProcs;
 	/* Array of PGXACT structures (not including dummies for prepared txns) */
 	PGXACT	   *allPgXact;
+
+	/* Array mirroring PGPROC.xid for each PGPROC currently in the procarray */
+	TransactionId *xids;
+
 	/* Length of allProcs array */
 	uint32		allProcCount;
 	/* Head of list of free PGPROC structures */
diff --git a/src/backend/access/heap/heapam_visibility.c b/src/backend/access/heap/heapam_visibility.c
index f117ee160a3..6dec0c8311b 100644
--- a/src/backend/access/heap/heapam_visibility.c
+++ b/src/backend/access/heap/heapam_visibility.c
@@ -11,12 +11,12 @@
  * shared buffer content lock on the buffer containing the tuple.
  *
  * NOTE: When using a non-MVCC snapshot, we must check
- * TransactionIdIsInProgress (which looks in the PGXACT array)
+ * TransactionIdIsInProgress (which looks in the PGPROC array)
  * before TransactionIdDidCommit/TransactionIdDidAbort (which look in
  * pg_xact).  Otherwise we have a race condition: we might decide that a
  * just-committed transaction crashed, because none of the tests succeed.
  * xact.c is careful to record commit/abort in pg_xact before it unsets
- * MyPgXact->xid in the PGXACT array.  That fixes that problem, but it
+ * MyProc->xid in the PGPROC array.  That fixes that problem, but it
  * also means there is a window where TransactionIdIsInProgress and
  * TransactionIdDidCommit will both return true.  If we check only
  * TransactionIdDidCommit, we could consider a tuple committed when a
@@ -956,7 +956,7 @@ HeapTupleSatisfiesDirty(HeapTuple htup, Snapshot snapshot,
  * coding where we tried to set the hint bits as soon as possible, we instead
  * did TransactionIdIsInProgress in each call --- to no avail, as long as the
  * inserting/deleting transaction was still running --- which was more cycles
- * and more contention on the PGXACT array.
+ * and more contention on ProcArrayLock.
  */
 static bool
 HeapTupleSatisfiesMVCC(HeapTuple htup, Snapshot snapshot,
@@ -1445,7 +1445,7 @@ HeapTupleSatisfiesNonVacuumable(HeapTuple htup, Snapshot snapshot,
  *	HeapTupleSatisfiesMVCC) and, therefore, any hint bits that can be set
  *	should already be set.  We assume that if no hint bits are set, the xmin
  *	or xmax transaction is still running.  This is therefore faster than
- *	HeapTupleSatisfiesVacuum, because we don't consult PGXACT nor CLOG.
+ *	HeapTupleSatisfiesVacuum, because we consult neither procarray nor CLOG.
  *	It's okay to return false when in doubt, but we must return true only
  *	if the tuple is removable.
  */
diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README
index c15b5540a09..2d7979011ce 100644
--- a/src/backend/access/transam/README
+++ b/src/backend/access/transam/README
@@ -251,10 +251,10 @@ enforce, and it assists with some other issues as explained below.)  The
 implementation of this is that GetSnapshotData takes the ProcArrayLock in
 shared mode (so that multiple backends can take snapshots in parallel),
 but ProcArrayEndTransaction must take the ProcArrayLock in exclusive mode
-while clearing MyPgXact->xid at transaction end (either commit or abort).
-(To reduce context switching, when multiple transactions commit nearly
-simultaneously, we have one backend take ProcArrayLock and clear the XIDs
-of multiple processes at once.)
+while clearing the ProcGlobal->xids[] entry at transaction end (either
+commit or abort). (To reduce context switching, when multiple transactions
+commit nearly simultaneously, we have one backend take ProcArrayLock and
+clear the XIDs of multiple processes at once.)
 
 ProcArrayEndTransaction also holds the lock while advancing the shared
 latestCompletedXid variable.  This allows GetSnapshotData to use
@@ -278,12 +278,13 @@ present in the ProcArray, or not running anymore.  (This guarantee doesn't
 apply to subtransaction XIDs, because of the possibility that there's not
 room for them in the subxid array; instead we guarantee that they are
 present or the overflow flag is set.)  If a backend released XidGenLock
-before storing its XID into MyPgXact, then it would be possible for another
-backend to allocate and commit a later XID, causing latestCompletedXid to
-pass the first backend's XID, before that value became visible in the
-ProcArray.  That would break GetOldestXmin, as discussed below.
+before storing its XID into ProcGlobal->xids[], then it would be possible
+for another backend to allocate and commit a later XID, causing
+latestCompletedXid to pass the first backend's XID, before that value
+became visible in the ProcArray.  That would break ComputeXidHorizons,
+as discussed below.
 
-We allow GetNewTransactionId to store the XID into MyPgXact->xid (or the
+We allow GetNewTransactionId to store the XID into ProcGlobal->xids[] (or the
 subxid array) without taking ProcArrayLock.  This was once necessary to
 avoid deadlock; while that is no longer the case, it's still beneficial for
 performance.  We are thereby relying on fetch/store of an XID to be atomic,
@@ -382,13 +383,13 @@ Top-level transactions do not have a parent, so they leave their pg_subtrans
 entries set to the default value of zero (InvalidTransactionId).
 
 pg_subtrans is used to check whether the transaction in question is still
-running --- the main Xid of a transaction is recorded in the PGXACT struct,
-but since we allow arbitrary nesting of subtransactions, we can't fit all Xids
-in shared memory, so we have to store them on disk.  Note, however, that for
-each transaction we keep a "cache" of Xids that are known to be part of the
-transaction tree, so we can skip looking at pg_subtrans unless we know the
-cache has been overflowed.  See storage/ipc/procarray.c for the gory details.
-
+running --- the main Xid of a transaction is recorded in ProcGlobal->xids[],
+with a copy in PGPROC->xid, but since we allow arbitrary nesting of
+subtransactions, we can't fit all Xids in shared memory, so we have to store
+them on disk.  Note, however, that for each transaction we keep a "cache" of
+Xids that are known to be part of the transaction tree, so we can skip looking
+at pg_subtrans unless we know the cache has been overflowed.  See
+storage/ipc/procarray.c for the gory details.
 slru.c is the supporting mechanism for both pg_xact and pg_subtrans.  It
 implements the LRU policy for in-memory buffer pages.  The high-level routines
 for pg_xact are implemented in transam.c, while the low-level functions are in
diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c
index dd2f4d5bc7e..a4599e96610 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -285,15 +285,15 @@ TransactionIdSetPageStatus(TransactionId xid, int nsubxids,
 	 * updates for multiple backends so that the number of times XactSLRULock
 	 * needs to be acquired is reduced.
 	 *
-	 * For this optimization to be safe, the XID in MyPgXact and the subxids
-	 * in MyProc must be the same as the ones for which we're setting the
-	 * status.  Check that this is the case.
+	 * For this optimization to be safe, the XID and subxids in MyProc must be
+	 * the same as the ones for which we're setting the status.  Check that
+	 * this is the case.
 	 *
 	 * For this optimization to be efficient, we shouldn't have too many
 	 * sub-XIDs and all of the XIDs for which we're adjusting clog should be
 	 * on the same page.  Check those conditions, too.
 	 */
-	if (all_xact_same_page && xid == MyPgXact->xid &&
+	if (all_xact_same_page && xid == MyProc->xid &&
 		nsubxids <= THRESHOLD_SUBTRANS_CLOG_OPT &&
 		nsubxids == MyPgXact->nxids &&
 		memcmp(subxids, MyProc->subxids.xids,
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index eb5f4680a3d..a0398bf3a3e 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -351,7 +351,7 @@ AtAbort_Twophase(void)
 
 /*
  * This is called after we have finished transferring state to the prepared
- * PGXACT entry.
+ * PGPROC entry.
  */
 void
 PostPrepare_Twophase(void)
@@ -463,7 +463,7 @@ MarkAsPreparingGuts(GlobalTransaction gxact, TransactionId xid, const char *gid,
 	proc->waitStatus = PROC_WAIT_STATUS_OK;
 	/* We set up the gxact's VXID as InvalidBackendId/XID */
 	proc->lxid = (LocalTransactionId) xid;
-	pgxact->xid = xid;
+	proc->xid = xid;
 	Assert(proc->xmin == InvalidTransactionId);
 	proc->delayChkpt = false;
 	pgxact->vacuumFlags = 0;
@@ -768,7 +768,6 @@ pg_prepared_xact(PG_FUNCTION_ARGS)
 	{
 		GlobalTransaction gxact = &status->array[status->currIdx++];
 		PGPROC	   *proc = &ProcGlobal->allProcs[gxact->pgprocno];
-		PGXACT	   *pgxact = &ProcGlobal->allPgXact[gxact->pgprocno];
 		Datum		values[5];
 		bool		nulls[5];
 		HeapTuple	tuple;
@@ -783,7 +782,7 @@ pg_prepared_xact(PG_FUNCTION_ARGS)
 		MemSet(values, 0, sizeof(values));
 		MemSet(nulls, 0, sizeof(nulls));
 
-		values[0] = TransactionIdGetDatum(pgxact->xid);
+		values[0] = TransactionIdGetDatum(proc->xid);
 		values[1] = CStringGetTextDatum(gxact->gid);
 		values[2] = TimestampTzGetDatum(gxact->prepared_at);
 		values[3] = ObjectIdGetDatum(gxact->owner);
@@ -829,9 +828,8 @@ TwoPhaseGetGXact(TransactionId xid, bool lock_held)
 	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
 	{
 		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
-		PGXACT	   *pgxact = &ProcGlobal->allPgXact[gxact->pgprocno];
 
-		if (pgxact->xid == xid)
+		if (gxact->xid == xid)
 		{
 			result = gxact;
 			break;
@@ -987,8 +985,7 @@ void
 StartPrepare(GlobalTransaction gxact)
 {
 	PGPROC	   *proc = &ProcGlobal->allProcs[gxact->pgprocno];
-	PGXACT	   *pgxact = &ProcGlobal->allPgXact[gxact->pgprocno];
-	TransactionId xid = pgxact->xid;
+	TransactionId xid = gxact->xid;
 	TwoPhaseFileHeader hdr;
 	TransactionId *children;
 	RelFileNode *commitrels;
@@ -1140,15 +1137,15 @@ EndPrepare(GlobalTransaction gxact)
 
 	/*
 	 * Mark the prepared transaction as valid.  As soon as xact.c marks
-	 * MyPgXact as not running our XID (which it will do immediately after
+	 * MyProc as not running our XID (which it will do immediately after
 	 * this function returns), others can commit/rollback the xact.
 	 *
 	 * NB: a side effect of this is to make a dummy ProcArray entry for the
-	 * prepared XID.  This must happen before we clear the XID from MyPgXact,
-	 * else there is a window where the XID is not running according to
-	 * TransactionIdIsInProgress, and onlookers would be entitled to assume
-	 * the xact crashed.  Instead we have a window where the same XID appears
-	 * twice in ProcArray, which is OK.
+	 * prepared XID.  This must happen before we clear the XID from MyProc /
+	 * ProcGlobal->xids[], else there is a window where the XID is not running
+	 * according to TransactionIdIsInProgress, and onlookers would be entitled
+	 * to assume the xact crashed.  Instead we have a window where the same
+	 * XID appears twice in ProcArray, which is OK.
 	 */
 	MarkAsPrepared(gxact, false);
 
@@ -1404,7 +1401,6 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
 {
 	GlobalTransaction gxact;
 	PGPROC	   *proc;
-	PGXACT	   *pgxact;
 	TransactionId xid;
 	char	   *buf;
 	char	   *bufptr;
@@ -1423,8 +1419,7 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
 	 */
 	gxact = LockGXact(gid, GetUserId());
 	proc = &ProcGlobal->allProcs[gxact->pgprocno];
-	pgxact = &ProcGlobal->allPgXact[gxact->pgprocno];
-	xid = pgxact->xid;
+	xid = gxact->xid;
 
 	/*
 	 * Read and validate 2PC state data. State data will typically be stored
@@ -1726,7 +1721,7 @@ CheckPointTwoPhase(XLogRecPtr redo_horizon)
 	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
 	{
 		/*
-		 * Note that we are using gxact not pgxact so this works in recovery
+		 * Note that we are using gxact not PGPROC so this works in recovery
 		 * also
 		 */
 		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
diff --git a/src/backend/access/transam/varsup.c b/src/backend/access/transam/varsup.c
index 2ef0f4991ca..4c91b343ecd 100644
--- a/src/backend/access/transam/varsup.c
+++ b/src/backend/access/transam/varsup.c
@@ -38,7 +38,8 @@ VariableCache ShmemVariableCache = NULL;
  * Allocate the next FullTransactionId for a new transaction or
  * subtransaction.
  *
- * The new XID is also stored into MyPgXact before returning.
+ * The new XID is also stored into MyProc->xid/ProcGlobal->xids[] before
+ * returning.
  *
  * Note: when this is called, we are actually already inside a valid
  * transaction, since XIDs are now not allocated until the transaction
@@ -65,7 +66,8 @@ GetNewTransactionId(bool isSubXact)
 	if (IsBootstrapProcessingMode())
 	{
 		Assert(!isSubXact);
-		MyPgXact->xid = BootstrapTransactionId;
+		MyProc->xid = BootstrapTransactionId;
+		ProcGlobal->xids[MyProc->pgxactoff] = BootstrapTransactionId;
 		return FullTransactionIdFromEpochAndXid(0, BootstrapTransactionId);
 	}
 
@@ -190,10 +192,10 @@ GetNewTransactionId(bool isSubXact)
 	 * latestCompletedXid is present in the ProcArray, which is essential for
 	 * correct OldestXmin tracking; see src/backend/access/transam/README.
 	 *
-	 * Note that readers of PGXACT xid fields should be careful to fetch the
-	 * value only once, rather than assume they can read a value multiple
-	 * times and get the same answer each time.  Note we are assuming that
-	 * TransactionId and int fetch/store are atomic.
+	 * Note that readers of ProcGlobal->xids/PGPROC->xid should be careful
+	 * to fetch the value for each proc only once, rather than assume they can
+	 * read a value multiple times and get the same answer each time.  Note we
+	 * are assuming that TransactionId and int fetch/store are atomic.
 	 *
 	 * The same comments apply to the subxact xid count and overflow fields.
 	 *
@@ -219,7 +221,11 @@ GetNewTransactionId(bool isSubXact)
 	 * answer later on when someone does have a reason to inquire.)
 	 */
 	if (!isSubXact)
-		MyPgXact->xid = xid;	/* LWLockRelease acts as barrier */
+	{
+		/* LWLockRelease acts as barrier */
+		MyProc->xid = xid;
+		ProcGlobal->xids[MyProc->pgxactoff] = xid;
+	}
 	else
 	{
 		int			nxids = MyPgXact->nxids;
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 22228f5684f..648e12c78d8 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -1724,7 +1724,7 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params)
 		 *
 		 * Note: these flags remain set until CommitTransaction or
 		 * AbortTransaction.  We don't want to clear them until we reset
-		 * MyPgXact->xid/xmin, otherwise GetOldestNonRemovableTransactionId()
+		 * MyProc->xid/xmin, otherwise GetOldestNonRemovableTransactionId()
 		 * might appear to go backwards, which is probably Not Good.
 		 */
 		LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index a016816ae86..e5617ddfb41 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -9,8 +9,9 @@
  * one is as a means of determining the set of currently running transactions.
  *
  * Because of various subtle race conditions it is critical that a backend
- * hold the correct locks while setting or clearing its MyPgXact->xid field.
- * See notes in src/backend/access/transam/README.
+ * hold the correct locks while setting or clearing its xid (in
+ * ProcGlobal->xids[]/MyProc->xid).  See notes in
+ * src/backend/access/transam/README.
  *
  * The process arrays now also include structures representing prepared
  * transactions.  The xid and subxids fields of these are valid, as are the
@@ -436,7 +437,9 @@ ProcArrayAdd(PGPROC *proc)
 	ProcArrayStruct *arrayP = procArray;
 	int			index;
 
+	/* See ProcGlobal comment explaining why both locks are held */
 	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+	LWLockAcquire(XidGenLock, LW_EXCLUSIVE);
 
 	if (arrayP->numProcs >= arrayP->maxProcs)
 	{
@@ -445,7 +448,6 @@ ProcArrayAdd(PGPROC *proc)
 		 * fixed supply of PGPROC structs too, and so we should have failed
 		 * earlier.)
 		 */
-		LWLockRelease(ProcArrayLock);
 		ereport(FATAL,
 				(errcode(ERRCODE_TOO_MANY_CONNECTIONS),
 				 errmsg("sorry, too many clients already")));
@@ -471,10 +473,25 @@ ProcArrayAdd(PGPROC *proc)
 	}
 
 	memmove(&arrayP->pgprocnos[index + 1], &arrayP->pgprocnos[index],
-			(arrayP->numProcs - index) * sizeof(int));
+			(arrayP->numProcs - index) * sizeof(*arrayP->pgprocnos));
+	memmove(&ProcGlobal->xids[index + 1], &ProcGlobal->xids[index],
+			(arrayP->numProcs - index) * sizeof(*ProcGlobal->xids));
+
 	arrayP->pgprocnos[index] = proc->pgprocno;
+	ProcGlobal->xids[index] = proc->xid;
+
 	arrayP->numProcs++;
 
+	for (; index < arrayP->numProcs; index++)
+	{
+		allProcs[arrayP->pgprocnos[index]].pgxactoff = index;
+	}
+
+	/*
+	 * Release in reversed acquisition order, to reduce frequency of having to
+	 * wait for XidGenLock while holding ProcArrayLock.
+	 */
+	LWLockRelease(XidGenLock);
 	LWLockRelease(ProcArrayLock);
 }
 
@@ -500,36 +517,59 @@ ProcArrayRemove(PGPROC *proc, TransactionId latestXid)
 		DisplayXidCache();
 #endif
 
+	/* See ProcGlobal comment explaining why both locks are held */
 	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+	LWLockAcquire(XidGenLock, LW_EXCLUSIVE);
+
+	Assert(ProcGlobal->allProcs[arrayP->pgprocnos[proc->pgxactoff]].pgxactoff == proc->pgxactoff);
 
 	if (TransactionIdIsValid(latestXid))
 	{
-		Assert(TransactionIdIsValid(allPgXact[proc->pgprocno].xid));
+		Assert(TransactionIdIsValid(ProcGlobal->xids[proc->pgxactoff]));
 
 		/* Advance global latestCompletedXid while holding the lock */
 		MaintainLatestCompletedXid(latestXid);
+
+		ProcGlobal->xids[proc->pgxactoff] = 0;
 	}
 	else
 	{
 		/* Shouldn't be trying to remove a live transaction here */
-		Assert(!TransactionIdIsValid(allPgXact[proc->pgprocno].xid));
+		Assert(!TransactionIdIsValid(ProcGlobal->xids[proc->pgxactoff]));
 	}
 
+	Assert(TransactionIdIsValid(ProcGlobal->xids[proc->pgxactoff] == 0));
+
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
 		if (arrayP->pgprocnos[index] == proc->pgprocno)
 		{
 			/* Keep the PGPROC array sorted. See notes above */
 			memmove(&arrayP->pgprocnos[index], &arrayP->pgprocnos[index + 1],
-					(arrayP->numProcs - index - 1) * sizeof(int));
+					(arrayP->numProcs - index - 1) * sizeof(*arrayP->pgprocnos));
+			memmove(&ProcGlobal->xids[index], &ProcGlobal->xids[index + 1],
+					(arrayP->numProcs - index - 1) * sizeof(*ProcGlobal->xids));
+
 			arrayP->pgprocnos[arrayP->numProcs - 1] = -1;	/* for debugging */
 			arrayP->numProcs--;
+
+			for (; index < arrayP->numProcs; index++)
+			{
+				allProcs[arrayP->pgprocnos[index]].pgxactoff--;
+			}
+
+			/*
+			 * Release in reversed acquisition order, to reduce frequency of
+			 * having to wait for XidGenLock while holding ProcArrayLock.
+			 */
+			LWLockRelease(XidGenLock);
 			LWLockRelease(ProcArrayLock);
 			return;
 		}
 	}
 
 	/* Oops */
+	LWLockRelease(XidGenLock);
 	LWLockRelease(ProcArrayLock);
 
 	elog(LOG, "failed to find proc %p in ProcArray", proc);
@@ -562,7 +602,7 @@ ProcArrayEndTransaction(PGPROC *proc, TransactionId latestXid)
 		 * else is taking a snapshot.  See discussion in
 		 * src/backend/access/transam/README.
 		 */
-		Assert(TransactionIdIsValid(allPgXact[proc->pgprocno].xid));
+		Assert(TransactionIdIsValid(proc->xid));
 
 		/*
 		 * If we can immediately acquire ProcArrayLock, we clear our own XID
@@ -584,7 +624,7 @@ ProcArrayEndTransaction(PGPROC *proc, TransactionId latestXid)
 		 * anyone else's calculation of a snapshot.  We might change their
 		 * estimate of global xmin, but that's OK.
 		 */
-		Assert(!TransactionIdIsValid(allPgXact[proc->pgprocno].xid));
+		Assert(!TransactionIdIsValid(proc->xid));
 
 		proc->lxid = InvalidLocalTransactionId;
 		/* must be cleared with xid/xmin: */
@@ -607,7 +647,13 @@ static inline void
 ProcArrayEndTransactionInternal(PGPROC *proc, PGXACT *pgxact,
 								TransactionId latestXid)
 {
-	pgxact->xid = InvalidTransactionId;
+	size_t		pgxactoff = proc->pgxactoff;
+
+	Assert(TransactionIdIsValid(ProcGlobal->xids[pgxactoff]));
+	Assert(ProcGlobal->xids[pgxactoff] == proc->xid);
+
+	ProcGlobal->xids[pgxactoff] = InvalidTransactionId;
+	proc->xid = InvalidTransactionId;
 	proc->lxid = InvalidLocalTransactionId;
 	/* must be cleared with xid/xmin: */
 	pgxact->vacuumFlags &= ~PROC_VACUUM_STATE_MASK;
@@ -643,7 +689,7 @@ ProcArrayGroupClearXid(PGPROC *proc, TransactionId latestXid)
 	uint32		wakeidx;
 
 	/* We should definitely have an XID to clear. */
-	Assert(TransactionIdIsValid(allPgXact[proc->pgprocno].xid));
+	Assert(TransactionIdIsValid(proc->xid));
 
 	/* Add ourselves to the list of processes needing a group XID clear. */
 	proc->procArrayGroupMember = true;
@@ -748,20 +794,28 @@ ProcArrayGroupClearXid(PGPROC *proc, TransactionId latestXid)
  * This is used after successfully preparing a 2-phase transaction.  We are
  * not actually reporting the transaction's XID as no longer running --- it
  * will still appear as running because the 2PC's gxact is in the ProcArray
- * too.  We just have to clear out our own PGXACT.
+ * too.  We just have to clear out our own PGPROC.
  */
 void
 ProcArrayClearTransaction(PGPROC *proc)
 {
 	PGXACT	   *pgxact = &allPgXact[proc->pgprocno];
+	size_t		pgxactoff;
 
 	/*
-	 * We can skip locking ProcArrayLock here, because this action does not
-	 * actually change anyone's view of the set of running XIDs: our entry is
-	 * duplicate with the gxact that has already been inserted into the
-	 * ProcArray.
+	 * We can skip locking ProcArrayLock exclusively here, because this action
+	 * does not actually change anyone's view of the set of running XIDs: our
+	 * entry is duplicate with the gxact that has already been inserted into
+	 * the ProcArray. But need it in shared mode for pgproc->pgxactoff to stay
+	 * the same.
 	 */
-	pgxact->xid = InvalidTransactionId;
+	LWLockAcquire(ProcArrayLock, LW_SHARED);
+
+	pgxactoff = proc->pgxactoff;
+
+	ProcGlobal->xids[pgxactoff] = InvalidTransactionId;
+	proc->xid = InvalidTransactionId;
+
 	proc->lxid = InvalidLocalTransactionId;
 	proc->xmin = InvalidTransactionId;
 	proc->recoveryConflictPending = false;
@@ -773,6 +827,8 @@ ProcArrayClearTransaction(PGPROC *proc)
 	/* Clear the subtransaction-XID cache too */
 	pgxact->nxids = 0;
 	pgxact->overflowed = false;
+
+	LWLockRelease(ProcArrayLock);
 }
 
 /*
@@ -1167,7 +1223,7 @@ ProcArrayApplyXidAssignment(TransactionId topxid,
  * there are four possibilities for finding a running transaction:
  *
  * 1. The given Xid is a main transaction Id.  We will find this out cheaply
- * by looking at the PGXACT struct for each backend.
+ * by looking at ProcGlobal->xids.
  *
  * 2. The given Xid is one of the cached subxact Xids in the PGPROC array.
  * We can find this out cheaply too.
@@ -1176,26 +1232,28 @@ ProcArrayApplyXidAssignment(TransactionId topxid,
  * if the Xid is running on the primary.
  *
  * 4. Search the SubTrans tree to find the Xid's topmost parent, and then see
- * if that is running according to PGXACT or KnownAssignedXids.  This is the
- * slowest way, but sadly it has to be done always if the others failed,
- * unless we see that the cached subxact sets are complete (none have
+ * if that is running according to ProcGlobal->xids[] or KnownAssignedXids.
+ * This is the slowest way, but sadly it has to be done always if the others
+ * failed, unless we see that the cached subxact sets are complete (none have
  * overflowed).
  *
  * ProcArrayLock has to be held while we do 1, 2, 3.  If we save the top Xids
  * while doing 1 and 3, we can release the ProcArrayLock while we do 4.
  * This buys back some concurrency (and we can't retrieve the main Xids from
- * PGXACT again anyway; see GetNewTransactionId).
+ * ProcGlobal->xids[] again anyway; see GetNewTransactionId).
  */
 bool
 TransactionIdIsInProgress(TransactionId xid)
 {
 	static TransactionId *xids = NULL;
+	static TransactionId *other_xids;
 	int			nxids = 0;
 	ProcArrayStruct *arrayP = procArray;
 	TransactionId topxid;
 	TransactionId latestCompletedXid;
-	int			i,
-				j;
+	int			mypgxactoff;
+	size_t		numProcs;
+	int			j;
 
 	/*
 	 * Don't bother checking a transaction older than RecentXmin; it could not
@@ -1250,6 +1308,8 @@ TransactionIdIsInProgress(TransactionId xid)
 					 errmsg("out of memory")));
 	}
 
+	other_xids = ProcGlobal->xids;
+
 	LWLockAcquire(ProcArrayLock, LW_SHARED);
 
 	/*
@@ -1266,20 +1326,22 @@ TransactionIdIsInProgress(TransactionId xid)
 	}
 
 	/* No shortcuts, gotta grovel through the array */
-	for (i = 0; i < arrayP->numProcs; i++)
+	mypgxactoff = MyProc->pgxactoff;
+	numProcs = arrayP->numProcs;
+	for (size_t pgxactoff = 0; pgxactoff < numProcs; pgxactoff++)
 	{
-		int			pgprocno = arrayP->pgprocnos[i];
-		PGPROC	   *proc = &allProcs[pgprocno];
-		PGXACT	   *pgxact = &allPgXact[pgprocno];
+		int			pgprocno;
+		PGXACT	   *pgxact;
+		PGPROC	   *proc;
 		TransactionId pxid;
 		int			pxids;
 
-		/* Ignore my own proc --- dealt with it above */
-		if (proc == MyProc)
+		/* Ignore ourselves --- dealt with it above */
+		if (pgxactoff == mypgxactoff)
 			continue;
 
 		/* Fetch xid just once - see GetNewTransactionId */
-		pxid = UINT32_ACCESS_ONCE(pgxact->xid);
+		pxid = UINT32_ACCESS_ONCE(other_xids[pgxactoff]);
 
 		if (!TransactionIdIsValid(pxid))
 			continue;
@@ -1304,8 +1366,12 @@ TransactionIdIsInProgress(TransactionId xid)
 		/*
 		 * Step 2: check the cached child-Xids arrays
 		 */
+		pgprocno = arrayP->pgprocnos[pgxactoff];
+		pgxact = &allPgXact[pgprocno];
 		pxids = pgxact->nxids;
 		pg_read_barrier();		/* pairs with barrier in GetNewTransactionId() */
+		pgprocno = arrayP->pgprocnos[pgxactoff];
+		proc = &allProcs[pgprocno];
 		for (j = pxids - 1; j >= 0; j--)
 		{
 			/* Fetch xid just once - see GetNewTransactionId */
@@ -1336,7 +1402,7 @@ TransactionIdIsInProgress(TransactionId xid)
 	 */
 	if (RecoveryInProgress())
 	{
-		/* none of the PGXACT entries should have XIDs in hot standby mode */
+		/* none of the PGPROC entries should have XIDs in hot standby mode */
 		Assert(nxids == 0);
 
 		if (KnownAssignedXidExists(xid))
@@ -1391,7 +1457,7 @@ TransactionIdIsInProgress(TransactionId xid)
 	Assert(TransactionIdIsValid(topxid));
 	if (!TransactionIdEquals(topxid, xid))
 	{
-		for (i = 0; i < nxids; i++)
+		for (int i = 0; i < nxids; i++)
 		{
 			if (TransactionIdEquals(xids[i], topxid))
 				return true;
@@ -1414,6 +1480,7 @@ TransactionIdIsActive(TransactionId xid)
 {
 	bool		result = false;
 	ProcArrayStruct *arrayP = procArray;
+	TransactionId *other_xids = ProcGlobal->xids;
 	int			i;
 
 	/*
@@ -1429,11 +1496,10 @@ TransactionIdIsActive(TransactionId xid)
 	{
 		int			pgprocno = arrayP->pgprocnos[i];
 		PGPROC	   *proc = &allProcs[pgprocno];
-		PGXACT	   *pgxact = &allPgXact[pgprocno];
 		TransactionId pxid;
 
 		/* Fetch xid just once - see GetNewTransactionId */
-		pxid = UINT32_ACCESS_ONCE(pgxact->xid);
+		pxid = UINT32_ACCESS_ONCE(other_xids[i]);
 
 		if (!TransactionIdIsValid(pxid))
 			continue;
@@ -1519,6 +1585,7 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
 	ProcArrayStruct *arrayP = procArray;
 	TransactionId kaxmin;
 	bool		in_recovery = RecoveryInProgress();
+	TransactionId *other_xids = ProcGlobal->xids;
 
 	/* inferred after ProcArrayLock is released */
 	h->catalog_oldest_nonremovable = InvalidTransactionId;
@@ -1562,7 +1629,7 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
 		TransactionId xmin;
 
 		/* Fetch xid just once - see GetNewTransactionId */
-		xid = UINT32_ACCESS_ONCE(pgxact->xid);
+		xid = UINT32_ACCESS_ONCE(other_xids[pgprocno]);
 		xmin = UINT32_ACCESS_ONCE(proc->xmin);
 
 		/*
@@ -1853,14 +1920,17 @@ Snapshot
 GetSnapshotData(Snapshot snapshot)
 {
 	ProcArrayStruct *arrayP = procArray;
+	TransactionId *other_xids = ProcGlobal->xids;
 	TransactionId xmin;
 	TransactionId xmax;
-	int			index;
-	int			count = 0;
+	size_t		count = 0;
 	int			subcount = 0;
 	bool		suboverflowed = false;
 	FullTransactionId latest_completed;
 	TransactionId oldestxid;
+	int			mypgxactoff;
+	TransactionId myxid;
+
 	TransactionId replication_slot_xmin = InvalidTransactionId;
 	TransactionId replication_slot_catalog_xmin = InvalidTransactionId;
 
@@ -1905,6 +1975,10 @@ GetSnapshotData(Snapshot snapshot)
 	LWLockAcquire(ProcArrayLock, LW_SHARED);
 
 	latest_completed = ShmemVariableCache->latestCompletedXid;
+	mypgxactoff = MyProc->pgxactoff;
+	myxid = other_xids[mypgxactoff];
+	Assert(myxid == MyProc->xid);
+
 	oldestxid = ShmemVariableCache->oldestXid;
 
 	/* xmax is always latestCompletedXid + 1 */
@@ -1915,57 +1989,79 @@ GetSnapshotData(Snapshot snapshot)
 	/* initialize xmin calculation with xmax */
 	xmin = xmax;
 
+	/* take own xid into account, saves a check inside the loop */
+	if (TransactionIdIsNormal(myxid) && NormalTransactionIdPrecedes(myxid, xmin))
+		xmin = myxid;
+
 	snapshot->takenDuringRecovery = RecoveryInProgress();
 
 	if (!snapshot->takenDuringRecovery)
 	{
+		size_t		numProcs = arrayP->numProcs;
+		TransactionId *xip = snapshot->xip;
 		int		   *pgprocnos = arrayP->pgprocnos;
-		int			numProcs;
 
 		/*
-		 * Spin over procArray checking xid, xmin, and subxids.  The goal is
-		 * to gather all active xids, find the lowest xmin, and try to record
-		 * subxids.
+		 * First collect set of pgxactoff/xids that need to be included in the
+		 * snapshot.
 		 */
-		numProcs = arrayP->numProcs;
-		for (index = 0; index < numProcs; index++)
+		for (size_t pgxactoff = 0; pgxactoff < numProcs; pgxactoff++)
 		{
-			int			pgprocno = pgprocnos[index];
-			PGXACT	   *pgxact = &allPgXact[pgprocno];
-			TransactionId xid;
+			/* Fetch xid just once - see GetNewTransactionId */
+			TransactionId xid = UINT32_ACCESS_ONCE(other_xids[pgxactoff]);
+			int			pgprocno;
+			PGXACT	   *pgxact;
+			uint8		vacuumFlags;
+
+			Assert(allProcs[arrayP->pgprocnos[pgxactoff]].pgxactoff == pgxactoff);
+
+			/*
+			 * If the transaction has no XID assigned, we can skip it; it
+			 * won't have sub-XIDs either.
+			 */
+			if (likely(xid == InvalidTransactionId))
+				continue;
+
+			/*
+			 * We don't include our own XIDs (if any) in the snapshot. It
+			 * needs to be includeded in the xmin computation, but we did so
+			 * outside the loop.
+			 */
+			if (pgxactoff == mypgxactoff)
+				continue;
+
+			/*
+			 * The only way we are able to get here with a non-normal xid
+			 * is during bootstrap - with this backend using
+			 * BootstrapTransactionId. But the above test should filter
+			 * that out.
+			 */
+			Assert(TransactionIdIsNormal(xid));
+
+			/*
+			 * If the XID is >= xmax, we can skip it; such transactions will
+			 * be treated as running anyway (and any sub-XIDs will also be >=
+			 * xmax).
+			 */
+			if (!NormalTransactionIdPrecedes(xid, xmax))
+				continue;
+
+			pgprocno = pgprocnos[pgxactoff];
+			pgxact = &allPgXact[pgprocno];
+			vacuumFlags = pgxact->vacuumFlags;
 
 			/*
 			 * Skip over backends doing logical decoding which manages xmin
 			 * separately (check below) and ones running LAZY VACUUM.
 			 */
-			if (pgxact->vacuumFlags &
-				(PROC_IN_LOGICAL_DECODING | PROC_IN_VACUUM))
+			if (vacuumFlags & (PROC_IN_LOGICAL_DECODING | PROC_IN_VACUUM))
 				continue;
 
-			/* Fetch xid just once - see GetNewTransactionId */
-			xid = UINT32_ACCESS_ONCE(pgxact->xid);
-
-			/*
-			 * If the transaction has no XID assigned, we can skip it; it
-			 * won't have sub-XIDs either.  If the XID is >= xmax, we can also
-			 * skip it; such transactions will be treated as running anyway
-			 * (and any sub-XIDs will also be >= xmax).
-			 */
-			if (!TransactionIdIsNormal(xid)
-				|| !NormalTransactionIdPrecedes(xid, xmax))
-				continue;
-
-			/*
-			 * We don't include our own XIDs (if any) in the snapshot, but we
-			 * must include them in xmin.
-			 */
 			if (NormalTransactionIdPrecedes(xid, xmin))
 				xmin = xid;
-			if (pgxact == MyPgXact)
-				continue;
 
 			/* Add XID to snapshot. */
-			snapshot->xip[count++] = xid;
+			xip[count++] = xid;
 
 			/*
 			 * Save subtransaction XIDs if possible (if we've already
@@ -1988,9 +2084,9 @@ GetSnapshotData(Snapshot snapshot)
 					suboverflowed = true;
 				else
 				{
-					int			nxids = pgxact->nxids;
+					int			nsubxids = pgxact->nxids;
 
-					if (nxids > 0)
+					if (nsubxids > 0)
 					{
 						PGPROC	   *proc = &allProcs[pgprocno];
 
@@ -1998,8 +2094,8 @@ GetSnapshotData(Snapshot snapshot)
 
 						memcpy(snapshot->subxip + subcount,
 							   (void *) proc->subxids.xids,
-							   nxids * sizeof(TransactionId));
-						subcount += nxids;
+							   nsubxids * sizeof(TransactionId));
+						subcount += nsubxids;
 					}
 				}
 			}
@@ -2131,6 +2227,7 @@ GetSnapshotData(Snapshot snapshot)
 	}
 
 	RecentXmin = xmin;
+	Assert(TransactionIdPrecedesOrEquals(TransactionXmin, RecentXmin));
 
 	snapshot->xmin = xmin;
 	snapshot->xmax = xmax;
@@ -2293,7 +2390,7 @@ ProcArrayInstallRestoredXmin(TransactionId xmin, PGPROC *proc)
  * GetRunningTransactionData -- returns information about running transactions.
  *
  * Similar to GetSnapshotData but returns more information. We include
- * all PGXACTs with an assigned TransactionId, even VACUUM processes and
+ * all PGPROCs with an assigned TransactionId, even VACUUM processes and
  * prepared transactions.
  *
  * We acquire XidGenLock and ProcArrayLock, but the caller is responsible for
@@ -2308,7 +2405,7 @@ ProcArrayInstallRestoredXmin(TransactionId xmin, PGPROC *proc)
  * This is never executed during recovery so there is no need to look at
  * KnownAssignedXids.
  *
- * Dummy PGXACTs from prepared transaction are included, meaning that this
+ * Dummy PGPROCs from prepared transaction are included, meaning that this
  * may return entries with duplicated TransactionId values coming from
  * transaction finishing to prepare.  Nothing is done about duplicated
  * entries here to not hold on ProcArrayLock more than necessary.
@@ -2327,6 +2424,7 @@ GetRunningTransactionData(void)
 	static RunningTransactionsData CurrentRunningXactsData;
 
 	ProcArrayStruct *arrayP = procArray;
+	TransactionId *other_xids = ProcGlobal->xids;
 	RunningTransactions CurrentRunningXacts = &CurrentRunningXactsData;
 	TransactionId latestCompletedXid;
 	TransactionId oldestRunningXid;
@@ -2387,7 +2485,7 @@ GetRunningTransactionData(void)
 		TransactionId xid;
 
 		/* Fetch xid just once - see GetNewTransactionId */
-		xid = UINT32_ACCESS_ONCE(pgxact->xid);
+		xid = UINT32_ACCESS_ONCE(other_xids[index]);
 
 		/*
 		 * We don't need to store transactions that don't have a TransactionId
@@ -2484,7 +2582,7 @@ GetRunningTransactionData(void)
  * GetOldestActiveTransactionId()
  *
  * Similar to GetSnapshotData but returns just oldestActiveXid. We include
- * all PGXACTs with an assigned TransactionId, even VACUUM processes.
+ * all PGPROCs with an assigned TransactionId, even VACUUM processes.
  * We look at all databases, though there is no need to include WALSender
  * since this has no effect on hot standby conflicts.
  *
@@ -2499,6 +2597,7 @@ TransactionId
 GetOldestActiveTransactionId(void)
 {
 	ProcArrayStruct *arrayP = procArray;
+	TransactionId *other_xids = ProcGlobal->xids;
 	TransactionId oldestRunningXid;
 	int			index;
 
@@ -2521,12 +2620,10 @@ GetOldestActiveTransactionId(void)
 	LWLockAcquire(ProcArrayLock, LW_SHARED);
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
-		int			pgprocno = arrayP->pgprocnos[index];
-		PGXACT	   *pgxact = &allPgXact[pgprocno];
 		TransactionId xid;
 
 		/* Fetch xid just once - see GetNewTransactionId */
-		xid = UINT32_ACCESS_ONCE(pgxact->xid);
+		xid = UINT32_ACCESS_ONCE(other_xids[index]);
 
 		if (!TransactionIdIsNormal(xid))
 			continue;
@@ -2604,8 +2701,8 @@ GetOldestSafeDecodingTransactionId(bool catalogOnly)
 	 * If we're not in recovery, we walk over the procarray and collect the
 	 * lowest xid. Since we're called with ProcArrayLock held and have
 	 * acquired XidGenLock, no entries can vanish concurrently, since
-	 * PGXACT->xid is only set with XidGenLock held and only cleared with
-	 * ProcArrayLock held.
+	 * ProcGlobal->xids[i] is only set with XidGenLock held and only cleared
+	 * with ProcArrayLock held.
 	 *
 	 * In recovery we can't lower the safe value besides what we've computed
 	 * above, so we'll have to wait a bit longer there. We unfortunately can
@@ -2614,17 +2711,17 @@ GetOldestSafeDecodingTransactionId(bool catalogOnly)
 	 */
 	if (!recovery_in_progress)
 	{
+		TransactionId *other_xids = ProcGlobal->xids;
+
 		/*
-		 * Spin over procArray collecting all min(PGXACT->xid)
+		 * Spin over procArray collecting min(ProcGlobal->xids[i])
 		 */
 		for (index = 0; index < arrayP->numProcs; index++)
 		{
-			int			pgprocno = arrayP->pgprocnos[index];
-			PGXACT	   *pgxact = &allPgXact[pgprocno];
 			TransactionId xid;
 
 			/* Fetch xid just once - see GetNewTransactionId */
-			xid = UINT32_ACCESS_ONCE(pgxact->xid);
+			xid = UINT32_ACCESS_ONCE(other_xids[index]);
 
 			if (!TransactionIdIsNormal(xid))
 				continue;
@@ -2812,6 +2909,7 @@ BackendXidGetPid(TransactionId xid)
 {
 	int			result = 0;
 	ProcArrayStruct *arrayP = procArray;
+	TransactionId *other_xids = ProcGlobal->xids;
 	int			index;
 
 	if (xid == InvalidTransactionId)	/* never match invalid xid */
@@ -2823,9 +2921,8 @@ BackendXidGetPid(TransactionId xid)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
 		PGPROC	   *proc = &allProcs[pgprocno];
-		PGXACT	   *pgxact = &allPgXact[pgprocno];
 
-		if (pgxact->xid == xid)
+		if (other_xids[index] == xid)
 		{
 			result = proc->pid;
 			break;
@@ -3105,7 +3202,6 @@ MinimumActiveBackends(int min)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
 		PGPROC	   *proc = &allProcs[pgprocno];
-		PGXACT	   *pgxact = &allPgXact[pgprocno];
 
 		/*
 		 * Since we're not holding a lock, need to be prepared to deal with
@@ -3122,7 +3218,7 @@ MinimumActiveBackends(int min)
 			continue;			/* do not count deleted entries */
 		if (proc == MyProc)
 			continue;			/* do not count myself */
-		if (pgxact->xid == InvalidTransactionId)
+		if (proc->xid == InvalidTransactionId)
 			continue;			/* do not count if no XID assigned */
 		if (proc->pid == 0)
 			continue;			/* do not count prepared xacts */
@@ -3548,8 +3644,8 @@ XidCacheRemoveRunningXids(TransactionId xid,
 	 *
 	 * Note that we do not have to be careful about memory ordering of our own
 	 * reads wrt. GetNewTransactionId() here - only this process can modify
-	 * relevant fields of MyProc/MyPgXact.  But we do have to be careful about
-	 * our own writes being well ordered.
+	 * relevant fields of MyProc/ProcGlobal->xids[].  But we do have to be
+	 * careful about our own writes being well ordered.
 	 */
 	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
 
@@ -3907,7 +4003,7 @@ FullXidRelativeTo(FullTransactionId rel, TransactionId xid)
  * In Hot Standby mode, we maintain a list of transactions that are (or were)
  * running on the primary at the current point in WAL.  These XIDs must be
  * treated as running by standby transactions, even though they are not in
- * the standby server's PGXACT array.
+ * the standby server's PGPROC array.
  *
  * We record all XIDs that we know have been assigned.  That includes all the
  * XIDs seen in WAL records, plus all unobserved XIDs that we can deduce have
diff --git a/src/backend/storage/ipc/sinvaladt.c b/src/backend/storage/ipc/sinvaladt.c
index ad048bc85fa..a9477ccb4a3 100644
--- a/src/backend/storage/ipc/sinvaladt.c
+++ b/src/backend/storage/ipc/sinvaladt.c
@@ -417,9 +417,7 @@ BackendIdGetTransactionIds(int backendID, TransactionId *xid, TransactionId *xmi
 
 		if (proc != NULL)
 		{
-			PGXACT	   *xact = &ProcGlobal->allPgXact[proc->pgprocno];
-
-			*xid = xact->xid;
+			*xid = proc->xid;
 			*xmin = proc->xmin;
 		}
 	}
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 95989ce79bd..d86566f4554 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -3974,9 +3974,8 @@ GetRunningTransactionLocks(int *nlocks)
 			proclock->tag.myLock->tag.locktag_type == LOCKTAG_RELATION)
 		{
 			PGPROC	   *proc = proclock->tag.myProc;
-			PGXACT	   *pgxact = &ProcGlobal->allPgXact[proc->pgprocno];
 			LOCK	   *lock = proclock->tag.myLock;
-			TransactionId xid = pgxact->xid;
+			TransactionId xid = proc->xid;
 
 			/*
 			 * Don't record locks for transactions if we know they have
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index de346cd87fc..7fad49544ce 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -102,21 +102,18 @@ Size
 ProcGlobalShmemSize(void)
 {
 	Size		size = 0;
+	Size		TotalProcs =
+		add_size(MaxBackends, add_size(NUM_AUXILIARY_PROCS, max_prepared_xacts));
 
 	/* ProcGlobal */
 	size = add_size(size, sizeof(PROC_HDR));
-	/* MyProcs, including autovacuum workers and launcher */
-	size = add_size(size, mul_size(MaxBackends, sizeof(PGPROC)));
-	/* AuxiliaryProcs */
-	size = add_size(size, mul_size(NUM_AUXILIARY_PROCS, sizeof(PGPROC)));
-	/* Prepared xacts */
-	size = add_size(size, mul_size(max_prepared_xacts, sizeof(PGPROC)));
-	/* ProcStructLock */
+	size = add_size(size, mul_size(TotalProcs, sizeof(PGPROC)));
 	size = add_size(size, sizeof(slock_t));
 
 	size = add_size(size, mul_size(MaxBackends, sizeof(PGXACT)));
 	size = add_size(size, mul_size(NUM_AUXILIARY_PROCS, sizeof(PGXACT)));
 	size = add_size(size, mul_size(max_prepared_xacts, sizeof(PGXACT)));
+	size = add_size(size, mul_size(TotalProcs, sizeof(*ProcGlobal->xids)));
 
 	return size;
 }
@@ -216,6 +213,17 @@ InitProcGlobal(void)
 	MemSet(pgxacts, 0, TotalProcs * sizeof(PGXACT));
 	ProcGlobal->allPgXact = pgxacts;
 
+	/*
+	 * Allocate arrays mirroring PGPROC fields in a dense manner. See
+	 * PROC_HDR.
+	 *
+	 * XXX: It might make sense to increase padding for these arrays, given
+	 * how hotly they are accessed.
+	 */
+	ProcGlobal->xids =
+		(TransactionId *) ShmemAlloc(TotalProcs * sizeof(*ProcGlobal->xids));
+	MemSet(ProcGlobal->xids, 0, TotalProcs * sizeof(*ProcGlobal->xids));
+
 	for (i = 0; i < TotalProcs; i++)
 	{
 		/* Common initialization for all PGPROCs, regardless of type. */
@@ -387,7 +395,7 @@ InitProcess(void)
 	MyProc->lxid = InvalidLocalTransactionId;
 	MyProc->fpVXIDLock = false;
 	MyProc->fpLocalTransactionId = InvalidLocalTransactionId;
-	MyPgXact->xid = InvalidTransactionId;
+	MyProc->xid = InvalidTransactionId;
 	MyProc->xmin = InvalidTransactionId;
 	MyProc->pid = MyProcPid;
 	/* backendId, databaseId and roleId will be filled in later */
@@ -571,7 +579,7 @@ InitAuxiliaryProcess(void)
 	MyProc->lxid = InvalidLocalTransactionId;
 	MyProc->fpVXIDLock = false;
 	MyProc->fpLocalTransactionId = InvalidLocalTransactionId;
-	MyPgXact->xid = InvalidTransactionId;
+	MyProc->xid = InvalidTransactionId;
 	MyProc->xmin = InvalidTransactionId;
 	MyProc->backendId = InvalidBackendId;
 	MyProc->databaseId = InvalidOid;
-- 
2.25.0.114.g5b0ca878e0

v13-0004-snapshot-scalability-Move-PGXACT-vacuumFlags-to-.patchtext/x-diff; charset=us-asciiDownload
From 59a1af7d77dd63deac48270f36e5419244e38c80 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 15 Jul 2020 15:35:07 -0700
Subject: [PATCH v13 4/6] snapshot scalability: Move PGXACT->vacuumFlags to
 ProcGlobal->vacuumFlags.

Similar to the previous commit this increases the chance that data
frequently needed by GetSnapshotData() stays in l2 cache. As we now
take care to not unnecessarily write to ProcGlobal->vacuumFlags, there
should be very few modifications to the ProcGlobal->vacuumFlags array.

Author: Andres Freund
Reviewed-By: Robert Haas, Thomas Munro, David Rowley
Discussion: https://postgr.es/m/20200301083601.ews6hz5dduc3w2se@alap3.anarazel.de
---
 src/include/storage/proc.h                | 12 ++++-
 src/backend/access/transam/twophase.c     |  2 +-
 src/backend/commands/vacuum.c             |  5 +-
 src/backend/postmaster/autovacuum.c       |  6 +--
 src/backend/replication/logical/logical.c |  3 +-
 src/backend/replication/slot.c            |  3 +-
 src/backend/storage/ipc/procarray.c       | 66 ++++++++++++++---------
 src/backend/storage/lmgr/deadlock.c       |  4 +-
 src/backend/storage/lmgr/proc.c           | 16 +++---
 9 files changed, 73 insertions(+), 44 deletions(-)

diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 146bca84bd6..ea95cf92402 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -41,7 +41,7 @@ struct XidCache
 };
 
 /*
- * Flags for PGXACT->vacuumFlags
+ * Flags for ProcGlobal->vacuumFlags[]
  */
 #define		PROC_IS_AUTOVACUUM	0x01	/* is it an autovac worker? */
 #define		PROC_IN_VACUUM		0x02	/* currently running lazy vacuum */
@@ -167,6 +167,9 @@ struct PGPROC
 
 	bool		delayChkpt;		/* true if this proc delays checkpoint start */
 
+	uint8		vacuumFlags;    /* this backend's vacuum flags, see PROC_*
+								 * above. mirrored in
+								 * ProcGlobal->vacuumFlags[pgxactoff] */
 	/*
 	 * Info to allow us to wait for synchronous replication, if needed.
 	 * waitLSN is InvalidXLogRecPtr if not waiting; set only by user backend.
@@ -244,7 +247,6 @@ extern PGDLLIMPORT struct PGXACT *MyPgXact;
  */
 typedef struct PGXACT
 {
-	uint8		vacuumFlags;	/* vacuum-related flags, see above */
 	bool		overflowed;
 
 	uint8		nxids;
@@ -314,6 +316,12 @@ typedef struct PROC_HDR
 	/* Array mirroring PGPROC.xid for each PGPROC currently in the procarray */
 	TransactionId *xids;
 
+	/*
+	 * Array mirroring PGPROC.vacuumFlags for each PGPROC currently in the
+	 * procarray.
+	 */
+	uint8	   *vacuumFlags;
+
 	/* Length of allProcs array */
 	uint32		allProcCount;
 	/* Head of list of free PGPROC structures */
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index a0398bf3a3e..744b8a7f393 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -466,7 +466,7 @@ MarkAsPreparingGuts(GlobalTransaction gxact, TransactionId xid, const char *gid,
 	proc->xid = xid;
 	Assert(proc->xmin == InvalidTransactionId);
 	proc->delayChkpt = false;
-	pgxact->vacuumFlags = 0;
+	proc->vacuumFlags = 0;
 	proc->pid = 0;
 	proc->backendId = InvalidBackendId;
 	proc->databaseId = databaseid;
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 648e12c78d8..aba13c31d1b 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -1728,9 +1728,10 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params)
 		 * might appear to go backwards, which is probably Not Good.
 		 */
 		LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
-		MyPgXact->vacuumFlags |= PROC_IN_VACUUM;
+		MyProc->vacuumFlags |= PROC_IN_VACUUM;
 		if (params->is_wraparound)
-			MyPgXact->vacuumFlags |= PROC_VACUUM_FOR_WRAPAROUND;
+			MyProc->vacuumFlags |= PROC_VACUUM_FOR_WRAPAROUND;
+		ProcGlobal->vacuumFlags[MyProc->pgxactoff] = MyProc->vacuumFlags;
 		LWLockRelease(ProcArrayLock);
 	}
 
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index ac97e28be19..c6ec657a936 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -2493,7 +2493,7 @@ do_autovacuum(void)
 						   tab->at_datname, tab->at_nspname, tab->at_relname);
 			EmitErrorReport();
 
-			/* this resets the PGXACT flags too */
+			/* this resets ProcGlobal->vacuumFlags[i] too */
 			AbortOutOfAnyTransaction();
 			FlushErrorState();
 			MemoryContextResetAndDeleteChildren(PortalContext);
@@ -2509,7 +2509,7 @@ do_autovacuum(void)
 
 		did_vacuum = true;
 
-		/* the PGXACT flags are reset at the next end of transaction */
+		/* ProcGlobal->vacuumFlags[i] are reset at the next end of xact */
 
 		/* be tidy */
 deleted:
@@ -2686,7 +2686,7 @@ perform_work_item(AutoVacuumWorkItem *workitem)
 				   cur_datname, cur_nspname, cur_relname);
 		EmitErrorReport();
 
-		/* this resets the PGXACT flags too */
+		/* this resets ProcGlobal->vacuumFlags[i] too */
 		AbortOutOfAnyTransaction();
 		FlushErrorState();
 		MemoryContextResetAndDeleteChildren(PortalContext);
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 57c5b513ccf..0f6af952f93 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -181,7 +181,8 @@ StartupDecodingContext(List *output_plugin_options,
 	if (!IsTransactionOrTransactionBlock())
 	{
 		LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
-		MyPgXact->vacuumFlags |= PROC_IN_LOGICAL_DECODING;
+		MyProc->vacuumFlags |= PROC_IN_LOGICAL_DECODING;
+		ProcGlobal->vacuumFlags[MyProc->pgxactoff] = MyProc->vacuumFlags;
 		LWLockRelease(ProcArrayLock);
 	}
 
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 3dc01b6df22..42c78eabd4e 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -520,7 +520,8 @@ ReplicationSlotRelease(void)
 
 	/* might not have been set when we've been a plain slot */
 	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
-	MyPgXact->vacuumFlags &= ~PROC_IN_LOGICAL_DECODING;
+	MyProc->vacuumFlags &= ~PROC_IN_LOGICAL_DECODING;
+	ProcGlobal->vacuumFlags[MyProc->pgxactoff] = MyProc->vacuumFlags;
 	LWLockRelease(ProcArrayLock);
 }
 
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index e5617ddfb41..e77d4e44fb8 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -476,9 +476,12 @@ ProcArrayAdd(PGPROC *proc)
 			(arrayP->numProcs - index) * sizeof(*arrayP->pgprocnos));
 	memmove(&ProcGlobal->xids[index + 1], &ProcGlobal->xids[index],
 			(arrayP->numProcs - index) * sizeof(*ProcGlobal->xids));
+	memmove(&ProcGlobal->vacuumFlags[index + 1], &ProcGlobal->vacuumFlags[index],
+			(arrayP->numProcs - index) * sizeof(*ProcGlobal->vacuumFlags));
 
 	arrayP->pgprocnos[index] = proc->pgprocno;
 	ProcGlobal->xids[index] = proc->xid;
+	ProcGlobal->vacuumFlags[index] = proc->vacuumFlags;
 
 	arrayP->numProcs++;
 
@@ -539,6 +542,7 @@ ProcArrayRemove(PGPROC *proc, TransactionId latestXid)
 	}
 
 	Assert(TransactionIdIsValid(ProcGlobal->xids[proc->pgxactoff] == 0));
+	ProcGlobal->vacuumFlags[proc->pgxactoff] = 0;
 
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
@@ -549,6 +553,8 @@ ProcArrayRemove(PGPROC *proc, TransactionId latestXid)
 					(arrayP->numProcs - index - 1) * sizeof(*arrayP->pgprocnos));
 			memmove(&ProcGlobal->xids[index], &ProcGlobal->xids[index + 1],
 					(arrayP->numProcs - index - 1) * sizeof(*ProcGlobal->xids));
+			memmove(&ProcGlobal->vacuumFlags[index], &ProcGlobal->vacuumFlags[index + 1],
+					(arrayP->numProcs - index - 1) * sizeof(*ProcGlobal->vacuumFlags));
 
 			arrayP->pgprocnos[arrayP->numProcs - 1] = -1;	/* for debugging */
 			arrayP->numProcs--;
@@ -627,14 +633,24 @@ ProcArrayEndTransaction(PGPROC *proc, TransactionId latestXid)
 		Assert(!TransactionIdIsValid(proc->xid));
 
 		proc->lxid = InvalidLocalTransactionId;
-		/* must be cleared with xid/xmin: */
-		pgxact->vacuumFlags &= ~PROC_VACUUM_STATE_MASK;
 		proc->xmin = InvalidTransactionId;
 		proc->delayChkpt = false;	/* be sure this is cleared in abort */
 		proc->recoveryConflictPending = false;
 
 		Assert(pgxact->nxids == 0);
 		Assert(pgxact->overflowed == false);
+
+		/* must be cleared with xid/xmin: */
+		/* avoid unnecessarily dirtying shared cachelines */
+		if (proc->vacuumFlags & PROC_VACUUM_STATE_MASK)
+		{
+			Assert(!LWLockHeldByMe(ProcArrayLock));
+			LWLockAcquire(ProcArrayLock, LW_SHARED);
+			Assert(proc->vacuumFlags == ProcGlobal->vacuumFlags[proc->pgxactoff]);
+			proc->vacuumFlags &= ~PROC_VACUUM_STATE_MASK;
+			ProcGlobal->vacuumFlags[proc->pgxactoff] = proc->vacuumFlags;
+			LWLockRelease(ProcArrayLock);
+		}
 	}
 }
 
@@ -655,12 +671,18 @@ ProcArrayEndTransactionInternal(PGPROC *proc, PGXACT *pgxact,
 	ProcGlobal->xids[pgxactoff] = InvalidTransactionId;
 	proc->xid = InvalidTransactionId;
 	proc->lxid = InvalidLocalTransactionId;
-	/* must be cleared with xid/xmin: */
-	pgxact->vacuumFlags &= ~PROC_VACUUM_STATE_MASK;
 	proc->xmin = InvalidTransactionId;
 	proc->delayChkpt = false;	/* be sure this is cleared in abort */
 	proc->recoveryConflictPending = false;
 
+	/* must be cleared with xid/xmin: */
+	/* avoid unnecessarily dirtying shared cachelines */
+	if (proc->vacuumFlags & PROC_VACUUM_STATE_MASK)
+	{
+		proc->vacuumFlags &= ~PROC_VACUUM_STATE_MASK;
+		ProcGlobal->vacuumFlags[proc->pgxactoff] = proc->vacuumFlags;
+	}
+
 	/* Clear the subtransaction-XID cache too while holding the lock */
 	pgxact->nxids = 0;
 	pgxact->overflowed = false;
@@ -820,9 +842,8 @@ ProcArrayClearTransaction(PGPROC *proc)
 	proc->xmin = InvalidTransactionId;
 	proc->recoveryConflictPending = false;
 
-	/* redundant, but just in case */
-	pgxact->vacuumFlags &= ~PROC_VACUUM_STATE_MASK;
-	proc->delayChkpt = false;
+	Assert(!(proc->vacuumFlags & PROC_VACUUM_STATE_MASK));
+	Assert(!proc->delayChkpt);
 
 	/* Clear the subtransaction-XID cache too */
 	pgxact->nxids = 0;
@@ -1624,7 +1645,7 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
 		PGPROC	   *proc = &allProcs[pgprocno];
-		PGXACT	   *pgxact = &allPgXact[pgprocno];
+		int8		vacuumFlags = ProcGlobal->vacuumFlags[index];
 		TransactionId xid;
 		TransactionId xmin;
 
@@ -1641,10 +1662,6 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
 		 */
 		xmin = TransactionIdOlder(xmin, xid);
 
-		/* if neither is set, this proc doesn't influence the horizon */
-		if (!TransactionIdIsValid(xmin))
-			continue;
-
 		/*
 		 * Don't ignore any procs when determining which transactions might be
 		 * considered running.  While slots should ensure logical decoding
@@ -1659,7 +1676,7 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
 		 * removed, as long as pg_subtrans is not truncated) or doing logical
 		 * decoding (which manages xmin separately, check below).
 		 */
-		if (pgxact->vacuumFlags & (PROC_IN_VACUUM | PROC_IN_LOGICAL_DECODING))
+		if (vacuumFlags & (PROC_IN_VACUUM | PROC_IN_LOGICAL_DECODING))
 			continue;
 
 		/* shared tables need to take backends in all database into account */
@@ -2000,6 +2017,7 @@ GetSnapshotData(Snapshot snapshot)
 		size_t		numProcs = arrayP->numProcs;
 		TransactionId *xip = snapshot->xip;
 		int		   *pgprocnos = arrayP->pgprocnos;
+		uint8	   *allVacuumFlags = ProcGlobal->vacuumFlags;
 
 		/*
 		 * First collect set of pgxactoff/xids that need to be included in the
@@ -2009,8 +2027,6 @@ GetSnapshotData(Snapshot snapshot)
 		{
 			/* Fetch xid just once - see GetNewTransactionId */
 			TransactionId xid = UINT32_ACCESS_ONCE(other_xids[pgxactoff]);
-			int			pgprocno;
-			PGXACT	   *pgxact;
 			uint8		vacuumFlags;
 
 			Assert(allProcs[arrayP->pgprocnos[pgxactoff]].pgxactoff == pgxactoff);
@@ -2046,14 +2062,11 @@ GetSnapshotData(Snapshot snapshot)
 			if (!NormalTransactionIdPrecedes(xid, xmax))
 				continue;
 
-			pgprocno = pgprocnos[pgxactoff];
-			pgxact = &allPgXact[pgprocno];
-			vacuumFlags = pgxact->vacuumFlags;
-
 			/*
 			 * Skip over backends doing logical decoding which manages xmin
 			 * separately (check below) and ones running LAZY VACUUM.
 			 */
+			vacuumFlags = allVacuumFlags[pgxactoff];
 			if (vacuumFlags & (PROC_IN_LOGICAL_DECODING | PROC_IN_VACUUM))
 				continue;
 
@@ -2080,6 +2093,9 @@ GetSnapshotData(Snapshot snapshot)
 			 */
 			if (!suboverflowed)
 			{
+				int			pgprocno = pgprocnos[pgxactoff];
+				PGXACT	   *pgxact = &allPgXact[pgprocno];
+
 				if (pgxact->overflowed)
 					suboverflowed = true;
 				else
@@ -2298,11 +2314,11 @@ ProcArrayInstallImportedXmin(TransactionId xmin,
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
 		PGPROC	   *proc = &allProcs[pgprocno];
-		PGXACT	   *pgxact = &allPgXact[pgprocno];
+		int			vacuumFlags = ProcGlobal->vacuumFlags[index];
 		TransactionId xid;
 
 		/* Ignore procs running LAZY VACUUM */
-		if (pgxact->vacuumFlags & PROC_IN_VACUUM)
+		if (vacuumFlags & PROC_IN_VACUUM)
 			continue;
 
 		/* We are only interested in the specific virtual transaction. */
@@ -2992,12 +3008,12 @@ GetCurrentVirtualXIDs(TransactionId limitXmin, bool excludeXmin0,
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
 		PGPROC	   *proc = &allProcs[pgprocno];
-		PGXACT	   *pgxact = &allPgXact[pgprocno];
+		uint8		vacuumFlags = ProcGlobal->vacuumFlags[index];
 
 		if (proc == MyProc)
 			continue;
 
-		if (excludeVacuum & pgxact->vacuumFlags)
+		if (excludeVacuum & vacuumFlags)
 			continue;
 
 		if (allDbs || proc->databaseId == MyDatabaseId)
@@ -3412,7 +3428,7 @@ CountOtherDBBackends(Oid databaseId, int *nbackends, int *nprepared)
 		{
 			int			pgprocno = arrayP->pgprocnos[index];
 			PGPROC	   *proc = &allProcs[pgprocno];
-			PGXACT	   *pgxact = &allPgXact[pgprocno];
+			uint8		vacuumFlags = ProcGlobal->vacuumFlags[index];
 
 			if (proc->databaseId != databaseId)
 				continue;
@@ -3426,7 +3442,7 @@ CountOtherDBBackends(Oid databaseId, int *nbackends, int *nprepared)
 			else
 			{
 				(*nbackends)++;
-				if ((pgxact->vacuumFlags & PROC_IS_AUTOVACUUM) &&
+				if ((vacuumFlags & PROC_IS_AUTOVACUUM) &&
 					nautovacs < MAXAUTOVACPIDS)
 					autovac_pids[nautovacs++] = proc->pid;
 			}
diff --git a/src/backend/storage/lmgr/deadlock.c b/src/backend/storage/lmgr/deadlock.c
index beedc7947db..e1246b8a4da 100644
--- a/src/backend/storage/lmgr/deadlock.c
+++ b/src/backend/storage/lmgr/deadlock.c
@@ -544,7 +544,6 @@ FindLockCycleRecurseMember(PGPROC *checkProc,
 {
 	PGPROC	   *proc;
 	LOCK	   *lock = checkProc->waitLock;
-	PGXACT	   *pgxact;
 	PROCLOCK   *proclock;
 	SHM_QUEUE  *procLocks;
 	LockMethod	lockMethodTable;
@@ -582,7 +581,6 @@ FindLockCycleRecurseMember(PGPROC *checkProc,
 		PGPROC	   *leader;
 
 		proc = proclock->tag.myProc;
-		pgxact = &ProcGlobal->allPgXact[proc->pgprocno];
 		leader = proc->lockGroupLeader == NULL ? proc : proc->lockGroupLeader;
 
 		/* A proc never blocks itself or any other lock group member */
@@ -630,7 +628,7 @@ FindLockCycleRecurseMember(PGPROC *checkProc,
 					 * ProcArrayLock.
 					 */
 					if (checkProc == MyProc &&
-						pgxact->vacuumFlags & PROC_IS_AUTOVACUUM)
+						proc->vacuumFlags & PROC_IS_AUTOVACUUM)
 						blocking_autovacuum_proc = proc;
 
 					/* We're done looking at this proclock */
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 7fad49544ce..f6113b2d243 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -114,6 +114,7 @@ ProcGlobalShmemSize(void)
 	size = add_size(size, mul_size(NUM_AUXILIARY_PROCS, sizeof(PGXACT)));
 	size = add_size(size, mul_size(max_prepared_xacts, sizeof(PGXACT)));
 	size = add_size(size, mul_size(TotalProcs, sizeof(*ProcGlobal->xids)));
+	size = add_size(size, mul_size(TotalProcs, sizeof(*ProcGlobal->vacuumFlags)));
 
 	return size;
 }
@@ -223,6 +224,8 @@ InitProcGlobal(void)
 	ProcGlobal->xids =
 		(TransactionId *) ShmemAlloc(TotalProcs * sizeof(*ProcGlobal->xids));
 	MemSet(ProcGlobal->xids, 0, TotalProcs * sizeof(*ProcGlobal->xids));
+	ProcGlobal->vacuumFlags = (uint8 *) ShmemAlloc(TotalProcs * sizeof(*ProcGlobal->vacuumFlags));
+	MemSet(ProcGlobal->vacuumFlags, 0, TotalProcs * sizeof(*ProcGlobal->vacuumFlags));
 
 	for (i = 0; i < TotalProcs; i++)
 	{
@@ -405,10 +408,10 @@ InitProcess(void)
 	MyProc->tempNamespaceId = InvalidOid;
 	MyProc->isBackgroundWorker = IsBackgroundWorker;
 	MyProc->delayChkpt = false;
-	MyPgXact->vacuumFlags = 0;
+	MyProc->vacuumFlags = 0;
 	/* NB -- autovac launcher intentionally does not set IS_AUTOVACUUM */
 	if (IsAutoVacuumWorkerProcess())
-		MyPgXact->vacuumFlags |= PROC_IS_AUTOVACUUM;
+		MyProc->vacuumFlags |= PROC_IS_AUTOVACUUM;
 	MyProc->lwWaiting = false;
 	MyProc->lwWaitMode = 0;
 	MyProc->waitLock = NULL;
@@ -587,7 +590,7 @@ InitAuxiliaryProcess(void)
 	MyProc->tempNamespaceId = InvalidOid;
 	MyProc->isBackgroundWorker = IsBackgroundWorker;
 	MyProc->delayChkpt = false;
-	MyPgXact->vacuumFlags = 0;
+	MyProc->vacuumFlags = 0;
 	MyProc->lwWaiting = false;
 	MyProc->lwWaitMode = 0;
 	MyProc->waitLock = NULL;
@@ -1323,7 +1326,7 @@ ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable)
 		if (deadlock_state == DS_BLOCKED_BY_AUTOVACUUM && allow_autovacuum_cancel)
 		{
 			PGPROC	   *autovac = GetBlockingAutoVacuumPgproc();
-			PGXACT	   *autovac_pgxact = &ProcGlobal->allPgXact[autovac->pgprocno];
+			uint8		vacuumFlags;
 
 			LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
 
@@ -1331,8 +1334,9 @@ ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable)
 			 * Only do it if the worker is not working to protect against Xid
 			 * wraparound.
 			 */
-			if ((autovac_pgxact->vacuumFlags & PROC_IS_AUTOVACUUM) &&
-				!(autovac_pgxact->vacuumFlags & PROC_VACUUM_FOR_WRAPAROUND))
+			vacuumFlags = ProcGlobal->vacuumFlags[proc->pgxactoff];
+			if ((vacuumFlags & PROC_IS_AUTOVACUUM) &&
+				!(vacuumFlags & PROC_VACUUM_FOR_WRAPAROUND))
 			{
 				int			pid = autovac->pid;
 				StringInfoData locktagbuf;
-- 
2.25.0.114.g5b0ca878e0

v13-0005-snapshot-scalability-Move-subxact-info-to-ProcGl.patchtext/x-diff; charset=us-asciiDownload
From fc8c7db296df03bcd527fa0089b2ad6bf8641864 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 15 Jul 2020 15:35:07 -0700
Subject: [PATCH v13 5/6] snapshot scalability: Move subxact info to
 ProcGlobal, remove PGXACT.

Similar to the previous changes this increases the chance that data
frequently needed by GetSnapshotData() stays in l2 cache. In many
workloads subtransactions are very rare, and this makes the check for
that considerably cheaper.

As this removes the last member of PGXACT, there is no need to keep it
around anymore.

Author: Andres Freund
Reviewed-By: Robert Haas, Thomas Munro, David Rowley
Discussion: https://postgr.es/m/20200301083601.ews6hz5dduc3w2se@alap3.anarazel.de
---
 src/include/storage/proc.h            |  34 ++++---
 src/backend/access/transam/clog.c     |   7 +-
 src/backend/access/transam/twophase.c |  17 ++--
 src/backend/access/transam/varsup.c   |  15 ++-
 src/backend/storage/ipc/procarray.c   | 128 ++++++++++++++------------
 src/backend/storage/lmgr/proc.c       |  24 +----
 src/tools/pgindent/typedefs.list      |   1 -
 7 files changed, 113 insertions(+), 113 deletions(-)

diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index ea95cf92402..43aa234709e 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -35,6 +35,14 @@
  */
 #define PGPROC_MAX_CACHED_SUBXIDS 64	/* XXX guessed-at value */
 
+typedef struct XidCacheStatus
+{
+	/* number of cached subxids, never more than PGPROC_MAX_CACHED_SUBXIDS */
+	uint8	count;
+	/* has PGPROC->subxids overflowed */
+	bool	overflowed;
+} XidCacheStatus;
+
 struct XidCache
 {
 	TransactionId xids[PGPROC_MAX_CACHED_SUBXIDS];
@@ -187,6 +195,8 @@ struct PGPROC
 	 */
 	SHM_QUEUE	myProcLocks[NUM_LOCK_PARTITIONS];
 
+	XidCacheStatus subxidStatus; /* mirrored with
+								  * ProcGlobal->subxidStates[i] */
 	struct XidCache subxids;	/* cache for subtransaction XIDs */
 
 	/* Support for group XID clearing. */
@@ -235,22 +245,6 @@ struct PGPROC
 
 
 extern PGDLLIMPORT PGPROC *MyProc;
-extern PGDLLIMPORT struct PGXACT *MyPgXact;
-
-/*
- * Prior to PostgreSQL 9.2, the fields below were stored as part of the
- * PGPROC.  However, benchmarking revealed that packing these particular
- * members into a separate array as tightly as possible sped up GetSnapshotData
- * considerably on systems with many CPU cores, by reducing the number of
- * cache lines needing to be fetched.  Thus, think very carefully before adding
- * anything else here.
- */
-typedef struct PGXACT
-{
-	bool		overflowed;
-
-	uint8		nxids;
-} PGXACT;
 
 /*
  * There is one ProcGlobal struct for the whole database cluster.
@@ -310,12 +304,16 @@ typedef struct PROC_HDR
 {
 	/* Array of PGPROC structures (not including dummies for prepared txns) */
 	PGPROC	   *allProcs;
-	/* Array of PGXACT structures (not including dummies for prepared txns) */
-	PGXACT	   *allPgXact;
 
 	/* Array mirroring PGPROC.xid for each PGPROC currently in the procarray */
 	TransactionId *xids;
 
+	/*
+	 * Array mirroring PGPROC.subxidStatus for each PGPROC currently in the
+	 * procarray.
+	 */
+	XidCacheStatus *subxidStates;
+
 	/*
 	 * Array mirroring PGPROC.vacuumFlags for each PGPROC currently in the
 	 * procarray.
diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c
index a4599e96610..65aa8841f7c 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -295,7 +295,7 @@ TransactionIdSetPageStatus(TransactionId xid, int nsubxids,
 	 */
 	if (all_xact_same_page && xid == MyProc->xid &&
 		nsubxids <= THRESHOLD_SUBTRANS_CLOG_OPT &&
-		nsubxids == MyPgXact->nxids &&
+		nsubxids == MyProc->subxidStatus.count &&
 		memcmp(subxids, MyProc->subxids.xids,
 			   nsubxids * sizeof(TransactionId)) == 0)
 	{
@@ -510,16 +510,15 @@ TransactionGroupUpdateXidStatus(TransactionId xid, XidStatus status,
 	while (nextidx != INVALID_PGPROCNO)
 	{
 		PGPROC	   *proc = &ProcGlobal->allProcs[nextidx];
-		PGXACT	   *pgxact = &ProcGlobal->allPgXact[nextidx];
 
 		/*
 		 * Transactions with more than THRESHOLD_SUBTRANS_CLOG_OPT sub-XIDs
 		 * should not use group XID status update mechanism.
 		 */
-		Assert(pgxact->nxids <= THRESHOLD_SUBTRANS_CLOG_OPT);
+		Assert(proc->subxidStatus.count <= THRESHOLD_SUBTRANS_CLOG_OPT);
 
 		TransactionIdSetPageStatusInternal(proc->clogGroupMemberXid,
-										   pgxact->nxids,
+										   proc->subxidStatus.count,
 										   proc->subxids.xids,
 										   proc->clogGroupMemberXidStatus,
 										   proc->clogGroupMemberLsn,
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 744b8a7f393..76465ad2c8b 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -21,9 +21,9 @@
  *		GIDs and aborts the transaction if there already is a global
  *		transaction in prepared state with the same GID.
  *
- *		A global transaction (gxact) also has dummy PGXACT and PGPROC; this is
- *		what keeps the XID considered running by TransactionIdIsInProgress.
- *		It is also convenient as a PGPROC to hook the gxact's locks to.
+ *		A global transaction (gxact) also has dummy PGPROC; this is what keeps
+ *		the XID considered running by TransactionIdIsInProgress.  It is also
+ *		convenient as a PGPROC to hook the gxact's locks to.
  *
  *		Information to recover prepared transactions in case of crash is
  *		now stored in WAL for the common case. In some cases there will be
@@ -447,14 +447,12 @@ MarkAsPreparingGuts(GlobalTransaction gxact, TransactionId xid, const char *gid,
 					TimestampTz prepared_at, Oid owner, Oid databaseid)
 {
 	PGPROC	   *proc;
-	PGXACT	   *pgxact;
 	int			i;
 
 	Assert(LWLockHeldByMeInMode(TwoPhaseStateLock, LW_EXCLUSIVE));
 
 	Assert(gxact != NULL);
 	proc = &ProcGlobal->allProcs[gxact->pgprocno];
-	pgxact = &ProcGlobal->allPgXact[gxact->pgprocno];
 
 	/* Initialize the PGPROC entry */
 	MemSet(proc, 0, sizeof(PGPROC));
@@ -480,8 +478,8 @@ MarkAsPreparingGuts(GlobalTransaction gxact, TransactionId xid, const char *gid,
 	for (i = 0; i < NUM_LOCK_PARTITIONS; i++)
 		SHMQueueInit(&(proc->myProcLocks[i]));
 	/* subxid data must be filled later by GXactLoadSubxactData */
-	pgxact->overflowed = false;
-	pgxact->nxids = 0;
+	proc->subxidStatus.count = 0;
+	proc->subxidStatus.overflowed = 0;
 
 	gxact->prepared_at = prepared_at;
 	gxact->xid = xid;
@@ -510,19 +508,18 @@ GXactLoadSubxactData(GlobalTransaction gxact, int nsubxacts,
 					 TransactionId *children)
 {
 	PGPROC	   *proc = &ProcGlobal->allProcs[gxact->pgprocno];
-	PGXACT	   *pgxact = &ProcGlobal->allPgXact[gxact->pgprocno];
 
 	/* We need no extra lock since the GXACT isn't valid yet */
 	if (nsubxacts > PGPROC_MAX_CACHED_SUBXIDS)
 	{
-		pgxact->overflowed = true;
+		proc->subxidStatus.overflowed = true;
 		nsubxacts = PGPROC_MAX_CACHED_SUBXIDS;
 	}
 	if (nsubxacts > 0)
 	{
 		memcpy(proc->subxids.xids, children,
 			   nsubxacts * sizeof(TransactionId));
-		pgxact->nxids = nsubxacts;
+		proc->subxidStatus.count = PGPROC_MAX_CACHED_SUBXIDS;
 	}
 }
 
diff --git a/src/backend/access/transam/varsup.c b/src/backend/access/transam/varsup.c
index 4c91b343ecd..2d2b05be36c 100644
--- a/src/backend/access/transam/varsup.c
+++ b/src/backend/access/transam/varsup.c
@@ -222,22 +222,31 @@ GetNewTransactionId(bool isSubXact)
 	 */
 	if (!isSubXact)
 	{
+		Assert(ProcGlobal->subxidStates[MyProc->pgxactoff].count == 0);
+		Assert(!ProcGlobal->subxidStates[MyProc->pgxactoff].overflowed);
+		Assert(MyProc->subxidStatus.count == 0);
+		Assert(!MyProc->subxidStatus.overflowed);
+
 		/* LWLockRelease acts as barrier */
 		MyProc->xid = xid;
 		ProcGlobal->xids[MyProc->pgxactoff] = xid;
 	}
 	else
 	{
-		int			nxids = MyPgXact->nxids;
+		XidCacheStatus *substat = &ProcGlobal->subxidStates[MyProc->pgxactoff];
+		int			nxids = MyProc->subxidStatus.count;
+
+		Assert(substat->count == MyProc->subxidStatus.count);
+		Assert(substat->overflowed == MyProc->subxidStatus.overflowed);
 
 		if (nxids < PGPROC_MAX_CACHED_SUBXIDS)
 		{
 			MyProc->subxids.xids[nxids] = xid;
 			pg_write_barrier();
-			MyPgXact->nxids = nxids + 1;
+			MyProc->subxidStatus.count = substat->count = nxids + 1;
 		}
 		else
-			MyPgXact->overflowed = true;
+			MyProc->subxidStatus.overflowed = substat->overflowed = true;
 	}
 
 	LWLockRelease(XidGenLock);
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index e77d4e44fb8..8e8049d9715 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -4,9 +4,10 @@
  *	  POSTGRES process array code.
  *
  *
- * This module maintains arrays of the PGPROC and PGXACT structures for all
- * active backends.  Although there are several uses for this, the principal
- * one is as a means of determining the set of currently running transactions.
+ * This module maintains arrays of PGPROC substructures, as well as associated
+ * arrays in ProcGlobal, for all active backends.  Although there are several
+ * uses for this, the principal one is as a means of determining the set of
+ * currently running transactions.
  *
  * Because of various subtle race conditions it is critical that a backend
  * hold the correct locks while setting or clearing its xid (in
@@ -85,7 +86,7 @@ typedef struct ProcArrayStruct
 	/*
 	 * Highest subxid that has been removed from KnownAssignedXids array to
 	 * prevent overflow; or InvalidTransactionId if none.  We track this for
-	 * similar reasons to tracking overflowing cached subxids in PGXACT
+	 * similar reasons to tracking overflowing cached subxids in PGPROC
 	 * entries.  Must hold exclusive ProcArrayLock to change this, and shared
 	 * lock to read it.
 	 */
@@ -96,7 +97,7 @@ typedef struct ProcArrayStruct
 	/* oldest catalog xmin of any replication slot */
 	TransactionId replication_slot_catalog_xmin;
 
-	/* indexes into allPgXact[], has PROCARRAY_MAXPROCS entries */
+	/* indexes into allProcs[], has PROCARRAY_MAXPROCS entries */
 	int			pgprocnos[FLEXIBLE_ARRAY_MEMBER];
 } ProcArrayStruct;
 
@@ -239,7 +240,6 @@ typedef struct ComputeXidHorizonsResult
 static ProcArrayStruct *procArray;
 
 static PGPROC *allProcs;
-static PGXACT *allPgXact;
 
 /*
  * Bookkeeping for tracking emulated transactions in recovery
@@ -325,8 +325,7 @@ static int	KnownAssignedXidsGetAndSetXmin(TransactionId *xarray,
 static TransactionId KnownAssignedXidsGetOldestXmin(void);
 static void KnownAssignedXidsDisplay(int trace_level);
 static void KnownAssignedXidsReset(void);
-static inline void ProcArrayEndTransactionInternal(PGPROC *proc,
-												   PGXACT *pgxact, TransactionId latestXid);
+static inline void ProcArrayEndTransactionInternal(PGPROC *proc, TransactionId latestXid);
 static void ProcArrayGroupClearXid(PGPROC *proc, TransactionId latestXid);
 static void MaintainLatestCompletedXid(TransactionId latestXid);
 static void MaintainLatestCompletedXidRecovery(TransactionId latestXid);
@@ -411,7 +410,6 @@ CreateSharedProcArray(void)
 	}
 
 	allProcs = ProcGlobal->allProcs;
-	allPgXact = ProcGlobal->allPgXact;
 
 	/* Create or attach to the KnownAssignedXids arrays too, if needed */
 	if (EnableHotStandby)
@@ -476,11 +474,14 @@ ProcArrayAdd(PGPROC *proc)
 			(arrayP->numProcs - index) * sizeof(*arrayP->pgprocnos));
 	memmove(&ProcGlobal->xids[index + 1], &ProcGlobal->xids[index],
 			(arrayP->numProcs - index) * sizeof(*ProcGlobal->xids));
+	memmove(&ProcGlobal->subxidStates[index + 1], &ProcGlobal->subxidStates[index],
+			(arrayP->numProcs - index) * sizeof(*ProcGlobal->subxidStates));
 	memmove(&ProcGlobal->vacuumFlags[index + 1], &ProcGlobal->vacuumFlags[index],
 			(arrayP->numProcs - index) * sizeof(*ProcGlobal->vacuumFlags));
 
 	arrayP->pgprocnos[index] = proc->pgprocno;
 	ProcGlobal->xids[index] = proc->xid;
+	ProcGlobal->subxidStates[index] = proc->subxidStatus;
 	ProcGlobal->vacuumFlags[index] = proc->vacuumFlags;
 
 	arrayP->numProcs++;
@@ -534,6 +535,8 @@ ProcArrayRemove(PGPROC *proc, TransactionId latestXid)
 		MaintainLatestCompletedXid(latestXid);
 
 		ProcGlobal->xids[proc->pgxactoff] = 0;
+		ProcGlobal->subxidStates[proc->pgxactoff].overflowed = false;
+		ProcGlobal->subxidStates[proc->pgxactoff].count = 0;
 	}
 	else
 	{
@@ -542,6 +545,8 @@ ProcArrayRemove(PGPROC *proc, TransactionId latestXid)
 	}
 
 	Assert(TransactionIdIsValid(ProcGlobal->xids[proc->pgxactoff] == 0));
+	Assert(TransactionIdIsValid(ProcGlobal->subxidStates[proc->pgxactoff].count == 0));
+	Assert(TransactionIdIsValid(ProcGlobal->subxidStates[proc->pgxactoff].overflowed == false));
 	ProcGlobal->vacuumFlags[proc->pgxactoff] = 0;
 
 	for (index = 0; index < arrayP->numProcs; index++)
@@ -553,6 +558,8 @@ ProcArrayRemove(PGPROC *proc, TransactionId latestXid)
 					(arrayP->numProcs - index - 1) * sizeof(*arrayP->pgprocnos));
 			memmove(&ProcGlobal->xids[index], &ProcGlobal->xids[index + 1],
 					(arrayP->numProcs - index - 1) * sizeof(*ProcGlobal->xids));
+			memmove(&ProcGlobal->subxidStates[index], &ProcGlobal->subxidStates[index + 1],
+					(arrayP->numProcs - index - 1) * sizeof(*ProcGlobal->subxidStates));
 			memmove(&ProcGlobal->vacuumFlags[index], &ProcGlobal->vacuumFlags[index + 1],
 					(arrayP->numProcs - index - 1) * sizeof(*ProcGlobal->vacuumFlags));
 
@@ -598,8 +605,6 @@ ProcArrayRemove(PGPROC *proc, TransactionId latestXid)
 void
 ProcArrayEndTransaction(PGPROC *proc, TransactionId latestXid)
 {
-	PGXACT	   *pgxact = &allPgXact[proc->pgprocno];
-
 	if (TransactionIdIsValid(latestXid))
 	{
 		/*
@@ -617,7 +622,7 @@ ProcArrayEndTransaction(PGPROC *proc, TransactionId latestXid)
 		 */
 		if (LWLockConditionalAcquire(ProcArrayLock, LW_EXCLUSIVE))
 		{
-			ProcArrayEndTransactionInternal(proc, pgxact, latestXid);
+			ProcArrayEndTransactionInternal(proc, latestXid);
 			LWLockRelease(ProcArrayLock);
 		}
 		else
@@ -631,15 +636,14 @@ ProcArrayEndTransaction(PGPROC *proc, TransactionId latestXid)
 		 * estimate of global xmin, but that's OK.
 		 */
 		Assert(!TransactionIdIsValid(proc->xid));
+		Assert(proc->subxidStatus.count == 0);
+		Assert(!proc->subxidStatus.overflowed);
 
 		proc->lxid = InvalidLocalTransactionId;
 		proc->xmin = InvalidTransactionId;
 		proc->delayChkpt = false;	/* be sure this is cleared in abort */
 		proc->recoveryConflictPending = false;
 
-		Assert(pgxact->nxids == 0);
-		Assert(pgxact->overflowed == false);
-
 		/* must be cleared with xid/xmin: */
 		/* avoid unnecessarily dirtying shared cachelines */
 		if (proc->vacuumFlags & PROC_VACUUM_STATE_MASK)
@@ -660,8 +664,7 @@ ProcArrayEndTransaction(PGPROC *proc, TransactionId latestXid)
  * We don't do any locking here; caller must handle that.
  */
 static inline void
-ProcArrayEndTransactionInternal(PGPROC *proc, PGXACT *pgxact,
-								TransactionId latestXid)
+ProcArrayEndTransactionInternal(PGPROC *proc, TransactionId latestXid)
 {
 	size_t		pgxactoff = proc->pgxactoff;
 
@@ -684,8 +687,15 @@ ProcArrayEndTransactionInternal(PGPROC *proc, PGXACT *pgxact,
 	}
 
 	/* Clear the subtransaction-XID cache too while holding the lock */
-	pgxact->nxids = 0;
-	pgxact->overflowed = false;
+	Assert(ProcGlobal->subxidStates[pgxactoff].count == proc->subxidStatus.count &&
+		   ProcGlobal->subxidStates[pgxactoff].overflowed == proc->subxidStatus.overflowed);
+	if (proc->subxidStatus.count > 0 || proc->subxidStatus.overflowed)
+	{
+		ProcGlobal->subxidStates[pgxactoff].count = 0;
+		ProcGlobal->subxidStates[pgxactoff].overflowed = false;
+		proc->subxidStatus.count = 0;
+		proc->subxidStatus.overflowed = false;
+	}
 
 	/* Also advance global latestCompletedXid while holding the lock */
 	MaintainLatestCompletedXid(latestXid);
@@ -775,9 +785,8 @@ ProcArrayGroupClearXid(PGPROC *proc, TransactionId latestXid)
 	while (nextidx != INVALID_PGPROCNO)
 	{
 		PGPROC	   *proc = &allProcs[nextidx];
-		PGXACT	   *pgxact = &allPgXact[nextidx];
 
-		ProcArrayEndTransactionInternal(proc, pgxact, proc->procArrayGroupMemberXid);
+		ProcArrayEndTransactionInternal(proc, proc->procArrayGroupMemberXid);
 
 		/* Move to next proc in list. */
 		nextidx = pg_atomic_read_u32(&proc->procArrayGroupNext);
@@ -821,7 +830,6 @@ ProcArrayGroupClearXid(PGPROC *proc, TransactionId latestXid)
 void
 ProcArrayClearTransaction(PGPROC *proc)
 {
-	PGXACT	   *pgxact = &allPgXact[proc->pgprocno];
 	size_t		pgxactoff;
 
 	/*
@@ -846,8 +854,15 @@ ProcArrayClearTransaction(PGPROC *proc)
 	Assert(!proc->delayChkpt);
 
 	/* Clear the subtransaction-XID cache too */
-	pgxact->nxids = 0;
-	pgxact->overflowed = false;
+	Assert(ProcGlobal->subxidStates[pgxactoff].count == proc->subxidStatus.count &&
+		   ProcGlobal->subxidStates[pgxactoff].overflowed == proc->subxidStatus.overflowed);
+	if (proc->subxidStatus.count > 0 || proc->subxidStatus.overflowed)
+	{
+		ProcGlobal->subxidStates[pgxactoff].count = 0;
+		ProcGlobal->subxidStates[pgxactoff].overflowed = false;
+		proc->subxidStatus.count = 0;
+		proc->subxidStatus.overflowed = false;
+	}
 
 	LWLockRelease(ProcArrayLock);
 }
@@ -1268,6 +1283,7 @@ TransactionIdIsInProgress(TransactionId xid)
 {
 	static TransactionId *xids = NULL;
 	static TransactionId *other_xids;
+	XidCacheStatus *other_subxidstates;
 	int			nxids = 0;
 	ProcArrayStruct *arrayP = procArray;
 	TransactionId topxid;
@@ -1330,6 +1346,7 @@ TransactionIdIsInProgress(TransactionId xid)
 	}
 
 	other_xids = ProcGlobal->xids;
+	other_subxidstates = ProcGlobal->subxidStates;
 
 	LWLockAcquire(ProcArrayLock, LW_SHARED);
 
@@ -1352,7 +1369,6 @@ TransactionIdIsInProgress(TransactionId xid)
 	for (size_t pgxactoff = 0; pgxactoff < numProcs; pgxactoff++)
 	{
 		int			pgprocno;
-		PGXACT	   *pgxact;
 		PGPROC	   *proc;
 		TransactionId pxid;
 		int			pxids;
@@ -1387,9 +1403,7 @@ TransactionIdIsInProgress(TransactionId xid)
 		/*
 		 * Step 2: check the cached child-Xids arrays
 		 */
-		pgprocno = arrayP->pgprocnos[pgxactoff];
-		pgxact = &allPgXact[pgprocno];
-		pxids = pgxact->nxids;
+		pxids = other_subxidstates[pgxactoff].count;
 		pg_read_barrier();		/* pairs with barrier in GetNewTransactionId() */
 		pgprocno = arrayP->pgprocnos[pgxactoff];
 		proc = &allProcs[pgprocno];
@@ -1413,7 +1427,7 @@ TransactionIdIsInProgress(TransactionId xid)
 		 * we hold ProcArrayLock.  So we can't miss an Xid that we need to
 		 * worry about.)
 		 */
-		if (pgxact->overflowed)
+		if (other_subxidstates[pgxactoff].overflowed)
 			xids[nxids++] = pxid;
 	}
 
@@ -2017,6 +2031,7 @@ GetSnapshotData(Snapshot snapshot)
 		size_t		numProcs = arrayP->numProcs;
 		TransactionId *xip = snapshot->xip;
 		int		   *pgprocnos = arrayP->pgprocnos;
+		XidCacheStatus *subxidStates = ProcGlobal->subxidStates;
 		uint8	   *allVacuumFlags = ProcGlobal->vacuumFlags;
 
 		/*
@@ -2093,17 +2108,16 @@ GetSnapshotData(Snapshot snapshot)
 			 */
 			if (!suboverflowed)
 			{
-				int			pgprocno = pgprocnos[pgxactoff];
-				PGXACT	   *pgxact = &allPgXact[pgprocno];
 
-				if (pgxact->overflowed)
+				if (subxidStates[pgxactoff].overflowed)
 					suboverflowed = true;
 				else
 				{
-					int			nsubxids = pgxact->nxids;
+					int			nsubxids = subxidStates[pgxactoff].count;
 
 					if (nsubxids > 0)
 					{
+						int			pgprocno = pgprocnos[pgxactoff];
 						PGPROC	   *proc = &allProcs[pgprocno];
 
 						pg_read_barrier();	/* pairs with GetNewTransactionId */
@@ -2496,8 +2510,6 @@ GetRunningTransactionData(void)
 	 */
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
-		int			pgprocno = arrayP->pgprocnos[index];
-		PGXACT	   *pgxact = &allPgXact[pgprocno];
 		TransactionId xid;
 
 		/* Fetch xid just once - see GetNewTransactionId */
@@ -2518,7 +2530,7 @@ GetRunningTransactionData(void)
 		if (TransactionIdPrecedes(xid, oldestRunningXid))
 			oldestRunningXid = xid;
 
-		if (pgxact->overflowed)
+		if (ProcGlobal->subxidStates[index].overflowed)
 			suboverflowed = true;
 
 		/*
@@ -2538,27 +2550,28 @@ GetRunningTransactionData(void)
 	 */
 	if (!suboverflowed)
 	{
+		XidCacheStatus *other_subxidstates = ProcGlobal->subxidStates;
+
 		for (index = 0; index < arrayP->numProcs; index++)
 		{
 			int			pgprocno = arrayP->pgprocnos[index];
 			PGPROC	   *proc = &allProcs[pgprocno];
-			PGXACT	   *pgxact = &allPgXact[pgprocno];
-			int			nxids;
+			int			nsubxids;
 
 			/*
 			 * Save subtransaction XIDs. Other backends can't add or remove
 			 * entries while we're holding XidGenLock.
 			 */
-			nxids = pgxact->nxids;
-			if (nxids > 0)
+			nsubxids = other_subxidstates[index].count;
+			if (nsubxids > 0)
 			{
 				/* barrier not really required, as XidGenLock is held, but ... */
 				pg_read_barrier();	/* pairs with GetNewTransactionId */
 
 				memcpy(&xids[count], (void *) proc->subxids.xids,
-					   nxids * sizeof(TransactionId));
-				count += nxids;
-				subcount += nxids;
+					   nsubxids * sizeof(TransactionId));
+				count += nsubxids;
+				subcount += nsubxids;
 
 				/*
 				 * Top-level XID of a transaction is always less than any of
@@ -3625,14 +3638,6 @@ ProcArrayGetReplicationSlotXmin(TransactionId *xmin,
 	LWLockRelease(ProcArrayLock);
 }
 
-
-#define XidCacheRemove(i) \
-	do { \
-		MyProc->subxids.xids[i] = MyProc->subxids.xids[MyPgXact->nxids - 1]; \
-		pg_write_barrier(); \
-		MyPgXact->nxids--; \
-	} while (0)
-
 /*
  * XidCacheRemoveRunningXids
  *
@@ -3648,6 +3653,7 @@ XidCacheRemoveRunningXids(TransactionId xid,
 {
 	int			i,
 				j;
+	XidCacheStatus *mysubxidstat;
 
 	Assert(TransactionIdIsValid(xid));
 
@@ -3665,6 +3671,8 @@ XidCacheRemoveRunningXids(TransactionId xid,
 	 */
 	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
 
+	mysubxidstat = &ProcGlobal->subxidStates[MyProc->pgxactoff];
+
 	/*
 	 * Under normal circumstances xid and xids[] will be in increasing order,
 	 * as will be the entries in subxids.  Scan backwards to avoid O(N^2)
@@ -3674,11 +3682,14 @@ XidCacheRemoveRunningXids(TransactionId xid,
 	{
 		TransactionId anxid = xids[i];
 
-		for (j = MyPgXact->nxids - 1; j >= 0; j--)
+		for (j = MyProc->subxidStatus.count - 1; j >= 0; j--)
 		{
 			if (TransactionIdEquals(MyProc->subxids.xids[j], anxid))
 			{
-				XidCacheRemove(j);
+				MyProc->subxids.xids[j] = MyProc->subxids.xids[MyProc->subxidStatus.count - 1];
+				pg_write_barrier();
+				mysubxidstat->count--;
+				MyProc->subxidStatus.count--;
 				break;
 			}
 		}
@@ -3690,20 +3701,23 @@ XidCacheRemoveRunningXids(TransactionId xid,
 		 * error during AbortSubTransaction.  So instead of Assert, emit a
 		 * debug warning.
 		 */
-		if (j < 0 && !MyPgXact->overflowed)
+		if (j < 0 && !MyProc->subxidStatus.overflowed)
 			elog(WARNING, "did not find subXID %u in MyProc", anxid);
 	}
 
-	for (j = MyPgXact->nxids - 1; j >= 0; j--)
+	for (j = MyProc->subxidStatus.count - 1; j >= 0; j--)
 	{
 		if (TransactionIdEquals(MyProc->subxids.xids[j], xid))
 		{
-			XidCacheRemove(j);
+			MyProc->subxids.xids[j] = MyProc->subxids.xids[MyProc->subxidStatus.count - 1];
+			pg_write_barrier();
+			mysubxidstat->count--;
+			MyProc->subxidStatus.count--;
 			break;
 		}
 	}
 	/* Ordinarily we should have found it, unless the cache has overflowed */
-	if (j < 0 && !MyPgXact->overflowed)
+	if (j < 0 && !MyProc->subxidStatus.overflowed)
 		elog(WARNING, "did not find subXID %u in MyProc", xid);
 
 	/* Also advance global latestCompletedXid while holding the lock */
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index f6113b2d243..aa9fbd80545 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -63,9 +63,8 @@ int			LockTimeout = 0;
 int			IdleInTransactionSessionTimeout = 0;
 bool		log_lock_waits = false;
 
-/* Pointer to this process's PGPROC and PGXACT structs, if any */
+/* Pointer to this process's PGPROC struct, if any */
 PGPROC	   *MyProc = NULL;
-PGXACT	   *MyPgXact = NULL;
 
 /*
  * This spinlock protects the freelist of recycled PGPROC structures.
@@ -110,10 +109,8 @@ ProcGlobalShmemSize(void)
 	size = add_size(size, mul_size(TotalProcs, sizeof(PGPROC)));
 	size = add_size(size, sizeof(slock_t));
 
-	size = add_size(size, mul_size(MaxBackends, sizeof(PGXACT)));
-	size = add_size(size, mul_size(NUM_AUXILIARY_PROCS, sizeof(PGXACT)));
-	size = add_size(size, mul_size(max_prepared_xacts, sizeof(PGXACT)));
 	size = add_size(size, mul_size(TotalProcs, sizeof(*ProcGlobal->xids)));
+	size = add_size(size, mul_size(TotalProcs, sizeof(*ProcGlobal->subxidStates)));
 	size = add_size(size, mul_size(TotalProcs, sizeof(*ProcGlobal->vacuumFlags)));
 
 	return size;
@@ -161,7 +158,6 @@ void
 InitProcGlobal(void)
 {
 	PGPROC	   *procs;
-	PGXACT	   *pgxacts;
 	int			i,
 				j;
 	bool		found;
@@ -202,18 +198,6 @@ InitProcGlobal(void)
 	/* XXX allProcCount isn't really all of them; it excludes prepared xacts */
 	ProcGlobal->allProcCount = MaxBackends + NUM_AUXILIARY_PROCS;
 
-	/*
-	 * Also allocate a separate array of PGXACT structures.  This is separate
-	 * from the main PGPROC array so that the most heavily accessed data is
-	 * stored contiguously in memory in as few cache lines as possible. This
-	 * provides significant performance benefits, especially on a
-	 * multiprocessor system.  There is one PGXACT structure for every PGPROC
-	 * structure.
-	 */
-	pgxacts = (PGXACT *) ShmemAlloc(TotalProcs * sizeof(PGXACT));
-	MemSet(pgxacts, 0, TotalProcs * sizeof(PGXACT));
-	ProcGlobal->allPgXact = pgxacts;
-
 	/*
 	 * Allocate arrays mirroring PGPROC fields in a dense manner. See
 	 * PROC_HDR.
@@ -224,6 +208,8 @@ InitProcGlobal(void)
 	ProcGlobal->xids =
 		(TransactionId *) ShmemAlloc(TotalProcs * sizeof(*ProcGlobal->xids));
 	MemSet(ProcGlobal->xids, 0, TotalProcs * sizeof(*ProcGlobal->xids));
+	ProcGlobal->subxidStates = (XidCacheStatus *) ShmemAlloc(TotalProcs * sizeof(*ProcGlobal->subxidStates));
+	MemSet(ProcGlobal->subxidStates, 0, TotalProcs * sizeof(*ProcGlobal->subxidStates));
 	ProcGlobal->vacuumFlags = (uint8 *) ShmemAlloc(TotalProcs * sizeof(*ProcGlobal->vacuumFlags));
 	MemSet(ProcGlobal->vacuumFlags, 0, TotalProcs * sizeof(*ProcGlobal->vacuumFlags));
 
@@ -372,7 +358,6 @@ InitProcess(void)
 				(errcode(ERRCODE_TOO_MANY_CONNECTIONS),
 				 errmsg("sorry, too many clients already")));
 	}
-	MyPgXact = &ProcGlobal->allPgXact[MyProc->pgprocno];
 
 	/*
 	 * Cross-check that the PGPROC is of the type we expect; if this were not
@@ -569,7 +554,6 @@ InitAuxiliaryProcess(void)
 	((volatile PGPROC *) auxproc)->pid = MyProcPid;
 
 	MyProc = auxproc;
-	MyPgXact = &ProcGlobal->allPgXact[auxproc->pgprocno];
 
 	SpinLockRelease(ProcStructLock);
 
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index b4948ac675f..3d990463ce9 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1536,7 +1536,6 @@ PGSetenvStatusType
 PGShmemHeader
 PGTransactionStatusType
 PGVerbosity
-PGXACT
 PG_Locale_Strategy
 PG_Lock_Status
 PG_init_t
-- 
2.25.0.114.g5b0ca878e0

v13-0006-snapshot-scalability-cache-snapshots-using-a-xac.patchtext/x-diff; charset=us-asciiDownload
From 5e202f7005878b242687495c57f21a34d9dc3aea Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 15 Jul 2020 15:35:07 -0700
Subject: [PATCH v13 6/6] snapshot scalability: cache snapshots using a xact
 completion counter.

Previous commits made it faster/more scalable to compute snapshots. But not
building a snapshot is still faster. Now that GetSnapshotData() does not
maintain RecentGlobal* anymore, that is actually not too hard:

This commit introduces xactCompletionCount, which tracks the number of
top-level transactions with xids (i.e. which may have modified the database)
that completed in some form since the start of the server.

We can avoid rebuilding the snapshot's contents whenever the current
xactCompletionCount is the same as it was when the snapshot was
originally built.  Currently this check happens while holding
ProcArrayLock. While it's likely possible to perform the check before
acquiring ProcArrayLock, it's too complicated for now.

Author: Andres Freund
Reviewed-By: Robert Haas, Thomas Munro, David Rowley
Discussion: https://postgr.es/m/20200301083601.ews6hz5dduc3w2se@alap3.anarazel.de
---
 src/include/access/transam.h                |   9 ++
 src/include/utils/snapshot.h                |   7 ++
 src/backend/replication/logical/snapbuild.c |   1 +
 src/backend/storage/ipc/procarray.c         | 125 ++++++++++++++++----
 src/backend/utils/time/snapmgr.c            |   4 +
 5 files changed, 126 insertions(+), 20 deletions(-)

diff --git a/src/include/access/transam.h b/src/include/access/transam.h
index b32044153b0..2f1f144db4d 100644
--- a/src/include/access/transam.h
+++ b/src/include/access/transam.h
@@ -231,6 +231,15 @@ typedef struct VariableCacheData
 	FullTransactionId latestCompletedXid;	/* newest full XID that has
 											 * committed or aborted */
 
+	/*
+	 * Number of top-level transactions with xids (i.e. which may have
+	 * modified the database) that completed in some form since the start of
+	 * the server. This currently is solely used to check whether
+	 * GetSnapshotData() needs to recompute the contents of the snapshot, or
+	 * not. There are likely other users of this.  Always above 1.
+	 */
+	uint64 xactCompletionCount;
+
 	/*
 	 * These fields are protected by XactTruncationLock
 	 */
diff --git a/src/include/utils/snapshot.h b/src/include/utils/snapshot.h
index 35b1f05bea6..dea072e5edf 100644
--- a/src/include/utils/snapshot.h
+++ b/src/include/utils/snapshot.h
@@ -207,6 +207,13 @@ typedef struct SnapshotData
 
 	TimestampTz whenTaken;		/* timestamp when snapshot was taken */
 	XLogRecPtr	lsn;			/* position in the WAL stream when taken */
+
+	/*
+	 * The transaction completion count at the time GetSnapshotData() built
+	 * this snapshot. Allows to avoid re-computing static snapshots when no
+	 * transactions completed since the last GetSnapshotData().
+	 */
+	uint64		snapXactCompletionCount;
 } SnapshotData;
 
 #endif							/* SNAPSHOT_H */
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index e9701ea7221..9d5d68f3fa7 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -524,6 +524,7 @@ SnapBuildBuildSnapshot(SnapBuild *builder)
 	snapshot->curcid = FirstCommandId;
 	snapshot->active_count = 0;
 	snapshot->regd_count = 0;
+	snapshot->snapXactCompletionCount = 0;
 
 	return snapshot;
 }
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 8e8049d9715..ac62343c1bb 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -407,6 +407,7 @@ CreateSharedProcArray(void)
 		procArray->lastOverflowedXid = InvalidTransactionId;
 		procArray->replication_slot_xmin = InvalidTransactionId;
 		procArray->replication_slot_catalog_xmin = InvalidTransactionId;
+		ShmemVariableCache->xactCompletionCount = 1;
 	}
 
 	allProcs = ProcGlobal->allProcs;
@@ -534,6 +535,9 @@ ProcArrayRemove(PGPROC *proc, TransactionId latestXid)
 		/* Advance global latestCompletedXid while holding the lock */
 		MaintainLatestCompletedXid(latestXid);
 
+		/* Same with xactCompletionCount  */
+		ShmemVariableCache->xactCompletionCount++;
+
 		ProcGlobal->xids[proc->pgxactoff] = 0;
 		ProcGlobal->subxidStates[proc->pgxactoff].overflowed = false;
 		ProcGlobal->subxidStates[proc->pgxactoff].count = 0;
@@ -668,6 +672,7 @@ ProcArrayEndTransactionInternal(PGPROC *proc, TransactionId latestXid)
 {
 	size_t		pgxactoff = proc->pgxactoff;
 
+	Assert(LWLockHeldByMe(ProcArrayLock));
 	Assert(TransactionIdIsValid(ProcGlobal->xids[pgxactoff]));
 	Assert(ProcGlobal->xids[pgxactoff] == proc->xid);
 
@@ -699,6 +704,9 @@ ProcArrayEndTransactionInternal(PGPROC *proc, TransactionId latestXid)
 
 	/* Also advance global latestCompletedXid while holding the lock */
 	MaintainLatestCompletedXid(latestXid);
+
+	/* Same with xactCompletionCount  */
+	ShmemVariableCache->xactCompletionCount++;
 }
 
 /*
@@ -1913,6 +1921,93 @@ GetMaxSnapshotSubxidCount(void)
 	return TOTAL_MAX_CACHED_SUBXIDS;
 }
 
+/*
+ * Initialize old_snapshot_threshold specific parts of a newly build snapshot.
+ */
+static void
+GetSnapshotDataInitOldSnapshot(Snapshot snapshot)
+{
+	if (!OldSnapshotThresholdActive())
+	{
+		/*
+		 * If not using "snapshot too old" feature, fill related fields with
+		 * dummy values that don't require any locking.
+		 */
+		snapshot->lsn = InvalidXLogRecPtr;
+		snapshot->whenTaken = 0;
+	}
+	else
+	{
+		/*
+		 * Capture the current time and WAL stream location in case this
+		 * snapshot becomes old enough to need to fall back on the special
+		 * "old snapshot" logic.
+		 */
+		snapshot->lsn = GetXLogInsertRecPtr();
+		snapshot->whenTaken = GetSnapshotCurrentTimestamp();
+		MaintainOldSnapshotTimeMapping(snapshot->whenTaken, snapshot->xmin);
+	}
+}
+
+/*
+ * Helper function for GetSnapshotData() that check if the bulk of the
+ * visibility information in the snapshot is still valid. If so, it updates
+ * the fields that need to change and returns true. Otherwise it returns
+ * false.
+ *
+ * This very likely can be evolved to not need ProcArrayLock held (at very
+ * least in the case we already hold a snapshot), but that's for another day.
+ */
+static bool
+GetSnapshotDataReuse(Snapshot snapshot)
+{
+	uint64 curXactCompletionCount;
+
+	Assert(LWLockHeldByMe(ProcArrayLock));
+
+	if (unlikely(snapshot->snapXactCompletionCount == 0))
+		return false;
+
+	curXactCompletionCount = ShmemVariableCache->xactCompletionCount;
+	if (curXactCompletionCount != snapshot->snapXactCompletionCount)
+		return false;
+
+	/*
+	 * If the current xactCompletionCount is still the same as it was at the
+	 * time the snapshot was built, we can be sure that rebuilding the
+	 * contents of the snapshot the hard way would result in the same snapshot
+	 * contents:
+	 *
+	 * As explained in transam/README, the set of xids considered running by
+	 * GetSnapshotData() cannot change while ProcArrayLock is held. Snapshot
+	 * contents only depend on transactions with xids and xactCompletionCount
+	 * is incremented whenever a transaction with an xid finishes (while
+	 * holding ProcArrayLock) exclusively). Thus the xactCompletionCount check
+	 * ensures we would detect if the snapshot would have changed.
+	 *
+	 * As the snapshot contents are the same as it was before, it is is safe
+	 * to re-enter the snapshot's xmin into the PGPROC array. None of the rows
+	 * visible under the snapshot could already have been removed (that'd
+	 * require the set of running transactions to change) and it fulfills the
+	 * requirement that concurrent GetSnapshotData() calls yield the same
+	 * xmin.
+	 */
+	if (!TransactionIdIsValid(MyProc->xmin))
+		MyProc->xmin = TransactionXmin = snapshot->xmin;
+
+	RecentXmin = snapshot->xmin;
+	Assert(TransactionIdPrecedesOrEquals(TransactionXmin, RecentXmin));
+
+	snapshot->curcid = GetCurrentCommandId(false);
+	snapshot->active_count = 0;
+	snapshot->regd_count = 0;
+	snapshot->copied = false;
+
+	GetSnapshotDataInitOldSnapshot(snapshot);
+
+	return true;
+}
+
 /*
  * GetSnapshotData -- returns information about running transactions.
  *
@@ -1961,6 +2056,7 @@ GetSnapshotData(Snapshot snapshot)
 	TransactionId oldestxid;
 	int			mypgxactoff;
 	TransactionId myxid;
+	uint64		curXactCompletionCount;
 
 	TransactionId replication_slot_xmin = InvalidTransactionId;
 	TransactionId replication_slot_catalog_xmin = InvalidTransactionId;
@@ -2005,12 +2101,19 @@ GetSnapshotData(Snapshot snapshot)
 	 */
 	LWLockAcquire(ProcArrayLock, LW_SHARED);
 
+	if (GetSnapshotDataReuse(snapshot))
+	{
+		LWLockRelease(ProcArrayLock);
+		return snapshot;
+	}
+
 	latest_completed = ShmemVariableCache->latestCompletedXid;
 	mypgxactoff = MyProc->pgxactoff;
 	myxid = other_xids[mypgxactoff];
 	Assert(myxid == MyProc->xid);
 
 	oldestxid = ShmemVariableCache->oldestXid;
+	curXactCompletionCount = ShmemVariableCache->xactCompletionCount;
 
 	/* xmax is always latestCompletedXid + 1 */
 	xmax = XidFromFullTransactionId(latest_completed);
@@ -2264,6 +2367,7 @@ GetSnapshotData(Snapshot snapshot)
 	snapshot->xcnt = count;
 	snapshot->subxcnt = subcount;
 	snapshot->suboverflowed = suboverflowed;
+	snapshot->snapXactCompletionCount = curXactCompletionCount;
 
 	snapshot->curcid = GetCurrentCommandId(false);
 
@@ -2275,26 +2379,7 @@ GetSnapshotData(Snapshot snapshot)
 	snapshot->regd_count = 0;
 	snapshot->copied = false;
 
-	if (old_snapshot_threshold < 0)
-	{
-		/*
-		 * If not using "snapshot too old" feature, fill related fields with
-		 * dummy values that don't require any locking.
-		 */
-		snapshot->lsn = InvalidXLogRecPtr;
-		snapshot->whenTaken = 0;
-	}
-	else
-	{
-		/*
-		 * Capture the current time and WAL stream location in case this
-		 * snapshot becomes old enough to need to fall back on the special
-		 * "old snapshot" logic.
-		 */
-		snapshot->lsn = GetXLogInsertRecPtr();
-		snapshot->whenTaken = GetSnapshotCurrentTimestamp();
-		MaintainOldSnapshotTimeMapping(snapshot->whenTaken, xmin);
-	}
+	GetSnapshotDataInitOldSnapshot(snapshot);
 
 	return snapshot;
 }
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 689a3b6a597..09ea03c2063 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -595,6 +595,8 @@ SetTransactionSnapshot(Snapshot sourcesnap, VirtualTransactionId *sourcevxid,
 	CurrentSnapshot->takenDuringRecovery = sourcesnap->takenDuringRecovery;
 	/* NB: curcid should NOT be copied, it's a local matter */
 
+	CurrentSnapshot->snapXactCompletionCount = 0;
+
 	/*
 	 * Now we have to fix what GetSnapshotData did with MyProc->xmin and
 	 * TransactionXmin.  There is a race condition: to make sure we are not
@@ -670,6 +672,7 @@ CopySnapshot(Snapshot snapshot)
 	newsnap->regd_count = 0;
 	newsnap->active_count = 0;
 	newsnap->copied = true;
+	newsnap->snapXactCompletionCount = 0;
 
 	/* setup XID array */
 	if (snapshot->xcnt > 0)
@@ -2207,6 +2210,7 @@ RestoreSnapshot(char *start_address)
 	snapshot->curcid = serialized_snapshot.curcid;
 	snapshot->whenTaken = serialized_snapshot.whenTaken;
 	snapshot->lsn = serialized_snapshot.lsn;
+	snapshot->snapXactCompletionCount = 0;
 
 	/* Copy XIDs, if present. */
 	if (serialized_snapshot.xcnt > 0)
-- 
2.25.0.114.g5b0ca878e0

#70Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#69)
Re: Improving connection scalability: GetSnapshotData()

We have two essentially identical buildfarm failures since these patches
went in:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=damselfly&amp;dt=2020-08-15%2011%3A27%3A32
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=peripatus&amp;dt=2020-08-15%2003%3A09%3A14

They're both in the same place in the freeze-the-dead isolation test:

TRAP: FailedAssertion("!TransactionIdPrecedes(members[i].xid, cutoff_xid)", File: "heapam.c", Line: 6051)
0x9613eb <ExceptionalCondition+0x5b> at /home/pgbuildfarm/buildroot/HEAD/inst/bin/postgres
0x52d586 <heap_prepare_freeze_tuple+0x926> at /home/pgbuildfarm/buildroot/HEAD/inst/bin/postgres
0x53bc7e <heap_vacuum_rel+0x100e> at /home/pgbuildfarm/buildroot/HEAD/inst/bin/postgres
0x6949bb <vacuum_rel+0x25b> at /home/pgbuildfarm/buildroot/HEAD/inst/bin/postgres
0x694532 <vacuum+0x602> at /home/pgbuildfarm/buildroot/HEAD/inst/bin/postgres
0x693d1c <ExecVacuum+0x37c> at /home/pgbuildfarm/buildroot/HEAD/inst/bin/postgres
0x8324b3
...
2020-08-14 22:16:41.783 CDT [78410:4] LOG: server process (PID 80395) was terminated by signal 6: Abort trap
2020-08-14 22:16:41.783 CDT [78410:5] DETAIL: Failed process was running: VACUUM FREEZE tab_freeze;

peripatus has successes since this failure, so it's not fully reproducible
on that machine. I'm suspicious of a timing problem in computing vacuum's
cutoff_xid.

(I'm also wondering why the failing check is an Assert rather than a real
test-and-elog. Assert doesn't seem like an appropriate way to check for
plausible data corruption cases.)

regards, tom lane

#71Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#70)
Re: Improving connection scalability: GetSnapshotData()

Hi,

On 2020-08-15 11:10:51 -0400, Tom Lane wrote:

We have two essentially identical buildfarm failures since these patches
went in:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=damselfly&amp;dt=2020-08-15%2011%3A27%3A32
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=peripatus&amp;dt=2020-08-15%2003%3A09%3A14

They're both in the same place in the freeze-the-dead isolation test:

TRAP: FailedAssertion("!TransactionIdPrecedes(members[i].xid, cutoff_xid)", File: "heapam.c", Line: 6051)
0x9613eb <ExceptionalCondition+0x5b> at /home/pgbuildfarm/buildroot/HEAD/inst/bin/postgres
0x52d586 <heap_prepare_freeze_tuple+0x926> at /home/pgbuildfarm/buildroot/HEAD/inst/bin/postgres
0x53bc7e <heap_vacuum_rel+0x100e> at /home/pgbuildfarm/buildroot/HEAD/inst/bin/postgres
0x6949bb <vacuum_rel+0x25b> at /home/pgbuildfarm/buildroot/HEAD/inst/bin/postgres
0x694532 <vacuum+0x602> at /home/pgbuildfarm/buildroot/HEAD/inst/bin/postgres
0x693d1c <ExecVacuum+0x37c> at /home/pgbuildfarm/buildroot/HEAD/inst/bin/postgres
0x8324b3
...
2020-08-14 22:16:41.783 CDT [78410:4] LOG: server process (PID 80395) was terminated by signal 6: Abort trap
2020-08-14 22:16:41.783 CDT [78410:5] DETAIL: Failed process was running: VACUUM FREEZE tab_freeze;

peripatus has successes since this failure, so it's not fully reproducible
on that machine. I'm suspicious of a timing problem in computing vacuum's
cutoff_xid.

Hm, maybe it's something around what I observed in
/messages/by-id/20200723181018.neey2jd3u7rfrfrn@alap3.anarazel.de

I.e. that somehow we end up with hot pruning and freezing coming to a
different determination, and trying to freeze a hot tuple.

I'll try to add a few additional asserts here, and burn some cpu tests
trying to trigger the issue.

I gotta escape the heat in the house for a few hours though (no AC
here), so I'll not look at the results till later this afternoon, unless
it triggers soon.

(I'm also wondering why the failing check is an Assert rather than a real
test-and-elog. Assert doesn't seem like an appropriate way to check for
plausible data corruption cases.)

Robert, and to a lesser degree you, gave me quite a bit of grief over
converting nearby asserts to elogs. I agree it'd be better if it were
an assert, but ...

Greetings,

Andres Freund

#72Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#71)
Re: Improving connection scalability: GetSnapshotData()

Hi,

On 2020-08-15 09:42:00 -0700, Andres Freund wrote:

On 2020-08-15 11:10:51 -0400, Tom Lane wrote:

We have two essentially identical buildfarm failures since these patches
went in:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=damselfly&amp;dt=2020-08-15%2011%3A27%3A32
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=peripatus&amp;dt=2020-08-15%2003%3A09%3A14

They're both in the same place in the freeze-the-dead isolation test:

TRAP: FailedAssertion("!TransactionIdPrecedes(members[i].xid, cutoff_xid)", File: "heapam.c", Line: 6051)
0x9613eb <ExceptionalCondition+0x5b> at /home/pgbuildfarm/buildroot/HEAD/inst/bin/postgres
0x52d586 <heap_prepare_freeze_tuple+0x926> at /home/pgbuildfarm/buildroot/HEAD/inst/bin/postgres
0x53bc7e <heap_vacuum_rel+0x100e> at /home/pgbuildfarm/buildroot/HEAD/inst/bin/postgres
0x6949bb <vacuum_rel+0x25b> at /home/pgbuildfarm/buildroot/HEAD/inst/bin/postgres
0x694532 <vacuum+0x602> at /home/pgbuildfarm/buildroot/HEAD/inst/bin/postgres
0x693d1c <ExecVacuum+0x37c> at /home/pgbuildfarm/buildroot/HEAD/inst/bin/postgres
0x8324b3
...
2020-08-14 22:16:41.783 CDT [78410:4] LOG: server process (PID 80395) was terminated by signal 6: Abort trap
2020-08-14 22:16:41.783 CDT [78410:5] DETAIL: Failed process was running: VACUUM FREEZE tab_freeze;

peripatus has successes since this failure, so it's not fully reproducible
on that machine. I'm suspicious of a timing problem in computing vacuum's
cutoff_xid.

Hm, maybe it's something around what I observed in
/messages/by-id/20200723181018.neey2jd3u7rfrfrn@alap3.anarazel.de

I.e. that somehow we end up with hot pruning and freezing coming to a
different determination, and trying to freeze a hot tuple.

I'll try to add a few additional asserts here, and burn some cpu tests
trying to trigger the issue.

I gotta escape the heat in the house for a few hours though (no AC
here), so I'll not look at the results till later this afternoon, unless
it triggers soon.

690 successful runs later, it didn't trigger for me :(. Seems pretty
clear that there's another variable than pure chance, otherwise it seems
like that number of runs should have hit the issue, given the number of
bf hits vs bf runs.

My current plan would is to push a bit of additional instrumentation to
help narrow down the issue. We can afterwards decide what of that we'd
like to keep longer term, and what not.

Greetings,

Andres Freund

#73Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#72)
Re: Improving connection scalability: GetSnapshotData()

Andres Freund <andres@anarazel.de> writes:

690 successful runs later, it didn't trigger for me :(. Seems pretty
clear that there's another variable than pure chance, otherwise it seems
like that number of runs should have hit the issue, given the number of
bf hits vs bf runs.

It seems entirely likely that there's a timing component in this, for
instance autovacuum coming along at just the right time. It's not too
surprising that some machines would be more prone to show that than
others. (Note peripatus is FreeBSD, which we've already learned has
significantly different kernel scheduler behavior than Linux.)

My current plan would is to push a bit of additional instrumentation to
help narrow down the issue.

+1

regards, tom lane

#74Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#73)
Re: Improving connection scalability: GetSnapshotData()

On 2020-08-16 14:30:24 -0400, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

690 successful runs later, it didn't trigger for me :(. Seems pretty
clear that there's another variable than pure chance, otherwise it seems
like that number of runs should have hit the issue, given the number of
bf hits vs bf runs.

It seems entirely likely that there's a timing component in this, for
instance autovacuum coming along at just the right time. It's not too
surprising that some machines would be more prone to show that than
others. (Note peripatus is FreeBSD, which we've already learned has
significantly different kernel scheduler behavior than Linux.)

Yea. Interestingly there was a reproduction on linux since the initial
reports you'd dug up:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=butterflyfish&amp;dt=2020-08-15%2019%3A54%3A53

but that's likely a virtualized environment, so I guess the host
scheduler behaviour could play a similar role.

I'll run a few iterations with rr's chaos mode too, which tries to
randomize scheduling decisions...

I noticed that it's quite hard to actually hit the hot tuple path I
mentioned earlier on my machine. Would probably be good to have a tests
hitting it more reliably. But I'm not immediately seeing how we could
force the necessarily serialization.

Greetings,

Andres Freund

#75Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#73)
1 attachment(s)
Re: Improving connection scalability: GetSnapshotData()

I wrote:

It seems entirely likely that there's a timing component in this, for
instance autovacuum coming along at just the right time.

D'oh. The attached seems to make it 100% reproducible.

regards, tom lane

Attachments:

add-delay-before-vacuum.patchtext/x-diff; charset=us-ascii; name=add-delay-before-vacuum.patchDownload
diff --git a/src/test/isolation/specs/freeze-the-dead.spec b/src/test/isolation/specs/freeze-the-dead.spec
index 915bf15b92..4100d9fc6f 100644
--- a/src/test/isolation/specs/freeze-the-dead.spec
+++ b/src/test/isolation/specs/freeze-the-dead.spec
@@ -32,6 +32,7 @@ session "s2"
 step "s2_begin"		{ BEGIN; }
 step "s2_key_share"	{ SELECT id FROM tab_freeze WHERE id = 3 FOR KEY SHARE; }
 step "s2_commit"	{ COMMIT; }
+step "s2_wait"		{ select pg_sleep(60); }
 step "s2_vacuum"	{ VACUUM FREEZE tab_freeze; }
 
 session "s3"
@@ -49,6 +50,7 @@ permutation "s1_begin" "s2_begin" "s3_begin" # start transactions
    "s1_update" "s2_key_share" "s3_key_share" # have xmax be a multi with an updater, updater being oldest xid
    "s1_update" # create additional row version that has multis
    "s1_commit" "s2_commit" # commit both updater and share locker
+   "s2_wait"
    "s2_vacuum" # due to bug in freezing logic, we used to *not* prune updated row, and then froze it
    "s1_selectone" # if hot chain is broken, the row can't be found via index scan
    "s3_commit" # commit remaining open xact
#76Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#75)
Re: Improving connection scalability: GetSnapshotData()

Hi,

On 2020-08-16 16:17:23 -0400, Tom Lane wrote:

I wrote:

It seems entirely likely that there's a timing component in this, for
instance autovacuum coming along at just the right time.

D'oh. The attached seems to make it 100% reproducible.

Great! It interestingly didn't work as the first item on the schedule,
where I had duplicated it it to out of impatience. I guess there might
be some need of concurrent autovacuum activity or something like that.

I now luckily have a rr trace of the problem, so I hope I can narrow it
down to the original problem fairly quickly.

Thanks,

Andres

#77Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#76)
Re: Improving connection scalability: GetSnapshotData()

Hi,

On 2020-08-16 13:31:53 -0700, Andres Freund wrote:

I now luckily have a rr trace of the problem, so I hope I can narrow it
down to the original problem fairly quickly.

Gna, I think I see the problem. In at least one place I wrongly
accessed the 'dense' array of in-progress xids using the 'pgprocno',
instead of directly using the [0...procArray->numProcs) index.

Working on a fix, together with some improved asserts.

Greetings,

Andres Freund

#78Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#77)
Re: Improving connection scalability: GetSnapshotData()

Hi,

On 2020-08-16 13:52:58 -0700, Andres Freund wrote:

On 2020-08-16 13:31:53 -0700, Andres Freund wrote:

I now luckily have a rr trace of the problem, so I hope I can narrow it
down to the original problem fairly quickly.

Gna, I think I see the problem. In at least one place I wrongly
accessed the 'dense' array of in-progress xids using the 'pgprocno',
instead of directly using the [0...procArray->numProcs) index.

Working on a fix, together with some improved asserts.

diff --git i/src/backend/storage/ipc/procarray.c w/src/backend/storage/ipc/procarray.c
index 8262abd42e6..96e4a878576 100644
--- i/src/backend/storage/ipc/procarray.c
+++ w/src/backend/storage/ipc/procarray.c
@@ -1663,7 +1663,7 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
         TransactionId xmin;
         /* Fetch xid just once - see GetNewTransactionId */
-        xid = UINT32_ACCESS_ONCE(other_xids[pgprocno]);
+        xid = UINT32_ACCESS_ONCE(other_xids[index]);
         xmin = UINT32_ACCESS_ONCE(proc->xmin);

/*

indeed fixes the issue based on a number of iterations of your modified
test, and fixes a clear bug.

WRT better asserts: We could make ProcArrayRemove() and InitProcGlobal()
initialize currently unused procArray->pgprocnos,
procGlobal->{xids,subxidStates,vacuumFlags} to invalid values and/or
declare them as uninitialized using the valgrind helpers.

For the first, one issue is that there's no obviously good candidate for
an uninitialized xid. We could use something like FrozenTransactionId,
which may never be in the procarray. But it's not exactly pretty.

Opinions?

Greetings,

Andres Freund

#79Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#78)
Re: Improving connection scalability: GetSnapshotData()

Hi,

On 2020-08-16 14:11:46 -0700, Andres Freund wrote:

On 2020-08-16 13:52:58 -0700, Andres Freund wrote:

On 2020-08-16 13:31:53 -0700, Andres Freund wrote:
Gna, I think I see the problem. In at least one place I wrongly
accessed the 'dense' array of in-progress xids using the 'pgprocno',
instead of directly using the [0...procArray->numProcs) index.

Working on a fix, together with some improved asserts.

diff --git i/src/backend/storage/ipc/procarray.c w/src/backend/storage/ipc/procarray.c
index 8262abd42e6..96e4a878576 100644
--- i/src/backend/storage/ipc/procarray.c
+++ w/src/backend/storage/ipc/procarray.c
@@ -1663,7 +1663,7 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
TransactionId xmin;
/* Fetch xid just once - see GetNewTransactionId */
-        xid = UINT32_ACCESS_ONCE(other_xids[pgprocno]);
+        xid = UINT32_ACCESS_ONCE(other_xids[index]);
xmin = UINT32_ACCESS_ONCE(proc->xmin);

/*

indeed fixes the issue based on a number of iterations of your modified
test, and fixes a clear bug.

Pushed that much.

WRT better asserts: We could make ProcArrayRemove() and InitProcGlobal()
initialize currently unused procArray->pgprocnos,
procGlobal->{xids,subxidStates,vacuumFlags} to invalid values and/or
declare them as uninitialized using the valgrind helpers.

For the first, one issue is that there's no obviously good candidate for
an uninitialized xid. We could use something like FrozenTransactionId,
which may never be in the procarray. But it's not exactly pretty.

Opinions?

So we get some builfarm results while thinking about this.

Greetings,

Andres Freund

In reply to: Andres Freund (#78)
Re: Improving connection scalability: GetSnapshotData()

On Sun, Aug 16, 2020 at 2:11 PM Andres Freund <andres@anarazel.de> wrote:

For the first, one issue is that there's no obviously good candidate for
an uninitialized xid. We could use something like FrozenTransactionId,
which may never be in the procarray. But it's not exactly pretty.

Maybe it would make sense to mark the fields as inaccessible or
undefined to Valgrind. That has advantages and disadvantages that are
obvious.

If that isn't enough, it might not hurt to do this on top of whatever
becomes the primary solution. An undefined value has the advantage of
"spreading" when the value gets copied around.

--
Peter Geoghegan

#81Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#78)
Re: Improving connection scalability: GetSnapshotData()

Andres Freund <andres@anarazel.de> writes:

For the first, one issue is that there's no obviously good candidate for
an uninitialized xid. We could use something like FrozenTransactionId,
which may never be in the procarray. But it's not exactly pretty.

Huh? What's wrong with using InvalidTransactionId?

regards, tom lane

#82Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#81)
Re: Improving connection scalability: GetSnapshotData()

Hi,

On 2020-08-16 17:28:46 -0400, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

For the first, one issue is that there's no obviously good candidate for
an uninitialized xid. We could use something like FrozenTransactionId,
which may never be in the procarray. But it's not exactly pretty.

Huh? What's wrong with using InvalidTransactionId?

It's a normal value for a backend when it doesn't have an xid assigned.

Greetings,

Andres Freund

#83Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Peter Geoghegan (#80)
Re: Improving connection scalability: GetSnapshotData()

On 2020-Aug-16, Peter Geoghegan wrote:

On Sun, Aug 16, 2020 at 2:11 PM Andres Freund <andres@anarazel.de> wrote:

For the first, one issue is that there's no obviously good candidate for
an uninitialized xid. We could use something like FrozenTransactionId,
which may never be in the procarray. But it's not exactly pretty.

Maybe it would make sense to mark the fields as inaccessible or
undefined to Valgrind. That has advantages and disadvantages that are
obvious.

... and perhaps making Valgrind complain about it is sufficient.

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#84Michael Paquier
michael@paquier.xyz
In reply to: Andres Freund (#79)
Re: Improving connection scalability: GetSnapshotData()

On Sun, Aug 16, 2020 at 02:26:57PM -0700, Andres Freund wrote:

So we get some builfarm results while thinking about this.

Andres, there is an entry in the CF for this thread:
https://commitfest.postgresql.org/29/2500/

A lot of work has been committed with 623a9ba, 73487a6, 5788e25, etc.
Now that PGXACT is done, how much work is remaining here?
--
Michael

#85Konstantin Knizhnik
k.knizhnik@postgrespro.ru
In reply to: Michael Paquier (#84)
Re: Improving connection scalability: GetSnapshotData()

On 03.09.2020 11:18, Michael Paquier wrote:

On Sun, Aug 16, 2020 at 02:26:57PM -0700, Andres Freund wrote:

So we get some builfarm results while thinking about this.

Andres, there is an entry in the CF for this thread:
https://commitfest.postgresql.org/29/2500/

A lot of work has been committed with 623a9ba, 73487a6, 5788e25, etc.
Now that PGXACT is done, how much work is remaining here?
--
Michael

Andres,
First of all a lot of thanks for this work.
Improving Postgres connection scalability is very important.

Reported results looks very impressive.
But I tried to reproduce them and didn't observed similar behavior.
So I am wondering what can be the difference and what I am doing wrong.

I have tried two different systems.
First one is IBM Power2 server with 384 cores and 8Tb of RAM.
I run the same read-only pgbench test as you. I do not think that size of the database is matter, so I used scale 100 -
it seems to be enough to avoid frequent buffer conflicts.
Then I run the same scripts as you:

�for ((n=100; n < 1000; n+=100)); do echo $n; pgbench -M prepared -c $n -T 100 -j $n -M prepared -S -n postgres ; done
�for ((n=1000; n <= 5000; n+=1000)); do echo $n; pgbench -M prepared -c $n -T 100 -j $n -M prepared -S -n postgres ; done

I have compared current master with version of Postgres prior to your commits with scalability improvements: a9a4a7ad56

For all number of connections older version shows slightly better results, for example for 500 clients: 475k TPS vs. 450k TPS for current master.

This is quite exotic server and I do not have currently access to it.
So I have repeated experiments at Intel server.
It has 160 cores Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz and 256Gb of RAM.

The same database, the same script, results are the following:

Clients old/inc old/exl new/inc new/exl
1000 1105750 1163292 1206105 1212701
2000 1050933 1124688 1149706 1164942
3000 1063667 1195158 1118087 1144216
4000 1040065 1290432 1107348 1163906
5000 943813 1258643 1103790 1160251

I have separately show results including/excluding connection connections establishing,
because in new version there are almost no differences between them,
but for old version gap between them is noticeable.

Configuration file has the following differences with default postgres config:

max_connections = 10000 # (change requires restart)
shared_buffers = 8GB # min 128kB

This results contradict with yours and makes me ask the following questions:

1. Why in your case performance is almost two times larger (2 millions vs 1)?
The hardware in my case seems to be at least not worser than yours...
May be there are some other improvements in the version you have tested which are not yet committed to master?

2. You wrote: This is on a machine with 2
Intel(R) Xeon(R) Platinum 8168, but virtualized (2 sockets of 18 cores/36 threads)

According to Intel specification Intel� Xeon� Platinum 8168 Processor has 24 cores:
https://ark.intel.com/content/www/us/en/ark/products/120504/intel-xeon-platinum-8168-processor-33m-cache-2-70-ghz.html

And at your graph we can see almost linear increase of speed up to 40 connections.

But most suspicious word for me is "virtualized". What is the actual hardware and how it is virtualized?

Do you have any idea why in my case master version (with your commits) behaves almost the same as non-patched version?
Below is yet another table showing scalability from 10 to 100 connections and combining your results (first two columns) and my results (last two columns):

Clients old master pgxact-split-cache current master
revision 9a4a7ad56
10 367883 375682 358984
347067
20 748000 810964 668631
630304
30 999231 1288276 920255
848244
40 991672 1573310 1100745
970717
50
1017561 1715762 1193928
1008755
60
993943 1789698 1255629
917788
70
971379 1819477 1277634
873022
80
966276 1842248 1266523
830197
90
901175 1847823 1255260
736550
100
803175 1865795 1241143
736756

May be it is because of more complex architecture of my server?

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

#86Andres Freund
andres@anarazel.de
In reply to: Michael Paquier (#84)
Re: Improving connection scalability: GetSnapshotData()

Hi,

On 2020-09-03 17:18:29 +0900, Michael Paquier wrote:

On Sun, Aug 16, 2020 at 02:26:57PM -0700, Andres Freund wrote:

So we get some builfarm results while thinking about this.

Andres, there is an entry in the CF for this thread:
https://commitfest.postgresql.org/29/2500/

A lot of work has been committed with 623a9ba, 73487a6, 5788e25, etc.
Now that PGXACT is done, how much work is remaining here?

I think it's best to close the entry. There's substantial further wins
possible, in particular not acquiring ProcArrayLock in GetSnapshotData()
when the cache is valid improves performance substantially. But it's
non-trivial enough that it's probably best dealth with in a separate
patch / CF entry.

Closed.

Greetings,

Andres Freund

#87Andres Freund
andres@anarazel.de
In reply to: Konstantin Knizhnik (#85)
Re: Improving connection scalability: GetSnapshotData()

Hi,

On 2020-09-04 18:24:12 +0300, Konstantin Knizhnik wrote:

Reported results looks very impressive.
But I tried to reproduce them and didn't observed similar behavior.
So I am wondering what can be the difference and what I am doing wrong.

That is odd - I did reproduce it on quite a few systems by now.

Configuration file has the following differences with default postgres config:

max_connections = 10000 # (change requires restart)
shared_buffers = 8GB # min 128kB

I also used huge_pages=on / configured them on the OS level. Otherwise
TLB misses will be a significant factor.

Does it change if you initialize the test database using
PGOPTIONS='-c vacuum_freeze_min_age=0' pgbench -i -s 100
or run a manual VACUUM FREEZE; after initialization?

I have tried two different systems.
First one is IBM Power2 server with 384 cores and 8Tb of RAM.
I run the same read-only pgbench test as you. I do not think that size of the database is matter, so I used scale 100 -
it seems to be enough to avoid frequent buffer conflicts.
Then I run the same scripts as you:

�for ((n=100; n < 1000; n+=100)); do echo $n; pgbench -M prepared -c $n -T 100 -j $n -M prepared -S -n postgres ; done
�for ((n=1000; n <= 5000; n+=1000)); do echo $n; pgbench -M prepared -c $n -T 100 -j $n -M prepared -S -n postgres ; done

I have compared current master with version of Postgres prior to your commits with scalability improvements: a9a4a7ad56

Hm, it'd probably be good to compare commits closer to the changes, to
avoid other changes showing up.

Hm - did you verify if all the connections were actually established?
Particularly without the patch applied? With an unmodified pgbench, I
sometimes saw better numbers, but only because only half the connections
were able to be established, due to ProcArrayLock contention.

See /messages/by-id/20200227180100.zyvjwzcpiokfsqm2@alap3.anarazel.de

There also is the issue that pgbench numbers for inclusive/exclusive are
just about meaningless right now:
/messages/by-id/20200227202636.qaf7o6qcajsudoor@alap3.anarazel.de
(reminds me, need to get that fixed)

One more thing worth investigating is whether your results change
significantly when you start the server using
numactl --interleave=all <start_server_cmdline>.
Especially on larger systems the results otherwise can vary a lot from
run-to-run, because the placement of shared buffers matters a lot.

So I have repeated experiments at Intel server.
It has 160 cores Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz and 256Gb of RAM.

The same database, the same script, results are the following:

Clients old/inc old/exl new/inc new/exl
1000 1105750 1163292 1206105 1212701
2000 1050933 1124688 1149706 1164942
3000 1063667 1195158 1118087 1144216
4000 1040065 1290432 1107348 1163906
5000 943813 1258643 1103790 1160251

I have separately show results including/excluding connection connections establishing,
because in new version there are almost no differences between them,
but for old version gap between them is noticeable.

Configuration file has the following differences with default postgres config:

max_connections = 10000 # (change requires restart)
shared_buffers = 8GB # min 128kB

This results contradict with yours and makes me ask the following questions:

1. Why in your case performance is almost two times larger (2 millions vs 1)?
The hardware in my case seems to be at least not worser than yours...
May be there are some other improvements in the version you have tested which are not yet committed to master?

No, no uncommitted changes, except for the pgbench stuff mentioned
above. However I found that the kernel version matters a fair bit, it's
pretty easy to run into kernel scalability issues in a workload that is
this heavy scheduler dependent.

Did you connect via tcp or unix socket? Was pgbench running on the same
machine? It was locally via unix socket for me (but it's also observable
via two machines, just with lower overall throughput).

Did you run a profile to see where the bottleneck is?

There's a seperate benchmark that I found to be quite revealing that's
far less dependent on scheduler behaviour. Run two pgbench instances:

1) With a very simply script '\sleep 1s' or such, and many connections
(e.g. 100,1000,5000). That's to simulate connections that are
currently idle.
2) With a normal pgbench read only script, and low client counts.

Before the changes 2) shows a very sharp decline in performance when the
count in 1) increases. Afterwards its pretty much linear.

I think this benchmark actually is much more real world oriented - due
to latency and client side overheads it's very normal to have a large
fraction of connections idle in read mostly OLTP workloads.

Here's the result on my workstation (2x Xeon Gold 5215 CPUs), testing
1f42d35a1d6144a23602b2c0bc7f97f3046cf890 against
07f32fcd23ac81898ed47f88beb569c631a2f223 which are the commits pre/post
connection scalability changes.

I used fairly short pgbench runs (15s), and the numbers are the best of
three runs. I also had emacs and mutt open - some noise to be
expected. But I also gotta work ;)

| Idle Connections | Active Connections | TPS pre | TPS post |
|-----------------:|-------------------:|--------:|---------:|
| 0 | 1 | 33599 | 33406 |
| 100 | 1 | 31088 | 33279 |
| 1000 | 1 | 29377 | 33434 |
| 2500 | 1 | 27050 | 33149 |
| 5000 | 1 | 21895 | 33903 |
| 10000 | 1 | 16034 | 33140 |
| 0 | 48 | 1042005 | 1125104 |
| 100 | 48 | 986731 | 1103584 |
| 1000 | 48 | 854230 | 1119043 |
| 2500 | 48 | 716624 | 1119353 |
| 5000 | 48 | 553657 | 1119476 |
| 10000 | 48 | 369845 | 1115740 |

And a second version of this, where the idle connections are just less
busy, using the following script:
\sleep 100ms
SELECT 1;

| Mostly Idle Connections | Active Connections | TPS pre | TPS post |
|------------------------:|-------------------:|--------:|---------------:|
| 0 | 1 | 33837 | 34095.891429 |
| 100 | 1 | 30622 | 31166.767491 |
| 1000 | 1 | 25523 | 28829.313249 |
| 2500 | 1 | 19260 | 24978.878822 |
| 5000 | 1 | 11171 | 24208.146408 |
| 10000 | 1 | 6702 | 29577.517084 |
| 0 | 48 | 1022721 | 1133153.772338 |
| 100 | 48 | 980705 | 1034235.255883 |
| 1000 | 48 | 824668 | 1115965.638395 |
| 2500 | 48 | 698510 | 1073280.930789 |
| 5000 | 48 | 478535 | 1041931.158287 |
| 10000 | 48 | 276042 | 953567.038634 |

It's probably worth to call out that in the second test run here the
run-to-run variability is huge. Presumably because it's very scheduler
dependent much CPU time "active" backends and the "active" pgbench gets
at higher "mostly idle" connection counts.

2. You wrote: This is on a machine with 2
Intel(R) Xeon(R) Platinum 8168, but virtualized (2 sockets of 18 cores/36 threads)

According to Intel specification Intel� Xeon� Platinum 8168 Processor has 24 cores:
https://ark.intel.com/content/www/us/en/ark/products/120504/intel-xeon-platinum-8168-processor-33m-cache-2-70-ghz.html

And at your graph we can see almost linear increase of speed up to 40 connections.

But most suspicious word for me is "virtualized". What is the actual hardware and how it is virtualized?

That was on an azure Fs72v2. I think that's hyperv virtualized, with all
the "lost" cores dedicated to the hypervisor. But I did reproduce the
speedups on my unvirtualized workstation (2x Xeon Gold 5215 CPUs) -
the ceiling is lower, obviously.

May be it is because of more complex architecture of my server?

Think we'll need profiles to know...

Greetings,

Andres Freund

#88Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#87)
2 attachment(s)
Re: Improving connection scalability: GetSnapshotData()

On 2020-09-04 11:53:04 -0700, Andres Freund wrote:

There's a seperate benchmark that I found to be quite revealing that's
far less dependent on scheduler behaviour. Run two pgbench instances:

1) With a very simply script '\sleep 1s' or such, and many connections
(e.g. 100,1000,5000). That's to simulate connections that are
currently idle.
2) With a normal pgbench read only script, and low client counts.

Before the changes 2) shows a very sharp decline in performance when the
count in 1) increases. Afterwards its pretty much linear.

I think this benchmark actually is much more real world oriented - due
to latency and client side overheads it's very normal to have a large
fraction of connections idle in read mostly OLTP workloads.

Here's the result on my workstation (2x Xeon Gold 5215 CPUs), testing
1f42d35a1d6144a23602b2c0bc7f97f3046cf890 against
07f32fcd23ac81898ed47f88beb569c631a2f223 which are the commits pre/post
connection scalability changes.

I used fairly short pgbench runs (15s), and the numbers are the best of
three runs. I also had emacs and mutt open - some noise to be
expected. But I also gotta work ;)

| Idle Connections | Active Connections | TPS pre | TPS post |
|-----------------:|-------------------:|--------:|---------:|
| 0 | 1 | 33599 | 33406 |
| 100 | 1 | 31088 | 33279 |
| 1000 | 1 | 29377 | 33434 |
| 2500 | 1 | 27050 | 33149 |
| 5000 | 1 | 21895 | 33903 |
| 10000 | 1 | 16034 | 33140 |
| 0 | 48 | 1042005 | 1125104 |
| 100 | 48 | 986731 | 1103584 |
| 1000 | 48 | 854230 | 1119043 |
| 2500 | 48 | 716624 | 1119353 |
| 5000 | 48 | 553657 | 1119476 |
| 10000 | 48 | 369845 | 1115740 |

Attached in graph form.

Greetings,

Andres Freund

Attachments:

performance-impact-of-idle-connections-1active-prepost.pngimage/pngDownload
performance-impact-of-idle-connections-48active-prepost.pngimage/pngDownload
#89Michael Paquier
michael@paquier.xyz
In reply to: Andres Freund (#86)
Re: Improving connection scalability: GetSnapshotData()

On Fri, Sep 04, 2020 at 10:35:19AM -0700, Andres Freund wrote:

I think it's best to close the entry. There's substantial further wins
possible, in particular not acquiring ProcArrayLock in GetSnapshotData()
when the cache is valid improves performance substantially. But it's
non-trivial enough that it's probably best dealth with in a separate
patch / CF entry.

Cool, thanks for updating the CF entry.
--
Michael

#90Konstantin Knizhnik
k.knizhnik@postgrespro.ru
In reply to: Andres Freund (#87)
Re: Improving connection scalability: GetSnapshotData()

On 04.09.2020 21:53, Andres Freund wrote:

I also used huge_pages=on / configured them on the OS level. Otherwise
TLB misses will be a significant factor.

As far as I understand there should not be no any TLB misses because
size of the shared buffers (8Mb) as several order of magnitude smaler
that available physical memory.

Does it change if you initialize the test database using
PGOPTIONS='-c vacuum_freeze_min_age=0' pgbench -i -s 100
or run a manual VACUUM FREEZE; after initialization?

I tried it, but didn't see any improvement.

Hm, it'd probably be good to compare commits closer to the changes, to
avoid other changes showing up.

Hm - did you verify if all the connections were actually established?
Particularly without the patch applied? With an unmodified pgbench, I
sometimes saw better numbers, but only because only half the connections
were able to be established, due to ProcArrayLock contention.

Yes, that really happen quite often at IBM Power2 server (specific of
it's atomic implementation).
I even have to patch pgbench  by adding one second delay after
connection has been established to make it possible  for all clients to
connect.
But at Intel server I didn't see unconnected clients. And in any case -
it happen only for large number of connections (> 1000).
But the best performance was achieved at about 100 connections and still
I can not reach 2k TPS performance a in your case.

Did you connect via tcp or unix socket? Was pgbench running on the same
machine? It was locally via unix socket for me (but it's also observable
via two machines, just with lower overall throughput).

Pgbench was launched at the same machine and connected through unix sockets.

Did you run a profile to see where the bottleneck is?

Sorry I do not have root privileges at this server and so can not use perf.

There's a seperate benchmark that I found to be quite revealing that's
far less dependent on scheduler behaviour. Run two pgbench instances:

1) With a very simply script '\sleep 1s' or such, and many connections
(e.g. 100,1000,5000). That's to simulate connections that are
currently idle.
2) With a normal pgbench read only script, and low client counts.

Before the changes 2) shows a very sharp decline in performance when the
count in 1) increases. Afterwards its pretty much linear.

I think this benchmark actually is much more real world oriented - due
to latency and client side overheads it's very normal to have a large
fraction of connections idle in read mostly OLTP workloads.

Here's the result on my workstation (2x Xeon Gold 5215 CPUs), testing
1f42d35a1d6144a23602b2c0bc7f97f3046cf890 against
07f32fcd23ac81898ed47f88beb569c631a2f223 which are the commits pre/post
connection scalability changes.

I used fairly short pgbench runs (15s), and the numbers are the best of
three runs. I also had emacs and mutt open - some noise to be
expected. But I also gotta work ;)

| Idle Connections | Active Connections | TPS pre | TPS post |
|-----------------:|-------------------:|--------:|---------:|
| 0 | 1 | 33599 | 33406 |
| 100 | 1 | 31088 | 33279 |
| 1000 | 1 | 29377 | 33434 |
| 2500 | 1 | 27050 | 33149 |
| 5000 | 1 | 21895 | 33903 |
| 10000 | 1 | 16034 | 33140 |
| 0 | 48 | 1042005 | 1125104 |
| 100 | 48 | 986731 | 1103584 |
| 1000 | 48 | 854230 | 1119043 |
| 2500 | 48 | 716624 | 1119353 |
| 5000 | 48 | 553657 | 1119476 |
| 10000 | 48 | 369845 | 1115740 |

Yes, there is also noticeable difference in my case

| Idle Connections | Active Connections | TPS pre | TPS post |
|-----------------:|-------------------:|--------:|---------:|
| 5000 | 48 | 758914 | 1184085 |

Think we'll need profiles to know...

I will try to obtain sudo permissions and do profiling.

#91Konstantin Knizhnik
k.knizhnik@postgrespro.ru
In reply to: Andres Freund (#87)
Re: Improving connection scalability: GetSnapshotData()

On 04.09.2020 21:53, Andres Freund wrote:

May be it is because of more complex architecture of my server?

Think we'll need profiles to know...

This is "perf top" of pgebch -c 100 -j 100 -M prepared -S

  12.16%  postgres                           [.] PinBuffer
  11.92%  postgres                           [.] LWLockAttemptLock
   6.46%  postgres                           [.] UnpinBuffer.constprop.11
   6.03%  postgres                           [.] LWLockRelease
   3.14%  postgres                           [.] BufferGetBlockNumber
   3.04%  postgres                           [.] ReadBuffer_common
   2.13%  [kernel]                           [k] _raw_spin_lock_irqsave
   2.11%  [kernel]                           [k] switch_mm_irqs_off
   1.95%  postgres                           [.] _bt_compare

Looks like most of the time is pent in buffers locks.
And which pgbench database scale factor you have used?

#92Andres Freund
andres@anarazel.de
In reply to: Konstantin Knizhnik (#90)
Re: Improving connection scalability: GetSnapshotData()

Hi,

On 2020-09-05 16:58:31 +0300, Konstantin Knizhnik wrote:

On 04.09.2020 21:53, Andres Freund wrote:

I also used huge_pages=on / configured them on the OS level. Otherwise
TLB misses will be a significant factor.

As far as I understand there should not be no any TLB misses because size of
the shared buffers (8Mb) as several order of magnitude smaler that available
physical memory.

I assume you didn't mean 8MB but 8GB? If so, that's way large enough to
be bigger than the TLB, particularly across processes (IIRC there's no
optimization to keep shared mappings de-duplicated between processes
from the view of the TLB).

Yes, there is also noticeable difference in my case

| Idle Connections | Active Connections | TPS pre | TPS post |
|-----------------:|-------------------:|--------:|---------:|
| 5000 | 48 | 758914 | 1184085 |

Sounds like you're somehow hitting another bottleneck around 1.2M TPS

Greetings,

Andres Freund

#93Andres Freund
andres@anarazel.de
In reply to: Konstantin Knizhnik (#91)
Re: Improving connection scalability: GetSnapshotData()

Hi,

On 2020-09-06 14:05:35 +0300, Konstantin Knizhnik wrote:

On 04.09.2020 21:53, Andres Freund wrote:

May be it is because of more complex architecture of my server?

Think we'll need profiles to know...

This is "perf top" of pgebch -c 100 -j 100 -M prepared -S

� 12.16%� postgres�������������������������� [.] PinBuffer
� 11.92%� postgres�������������������������� [.] LWLockAttemptLock
�� 6.46%� postgres�������������������������� [.] UnpinBuffer.constprop.11
�� 6.03%� postgres�������������������������� [.] LWLockRelease
�� 3.14%� postgres�������������������������� [.] BufferGetBlockNumber
�� 3.04%� postgres�������������������������� [.] ReadBuffer_common
�� 2.13%� [kernel]�������������������������� [k] _raw_spin_lock_irqsave
�� 2.11%� [kernel]�������������������������� [k] switch_mm_irqs_off
�� 1.95%� postgres�������������������������� [.] _bt_compare

Looks like most of the time is pent in buffers locks.

Hm, that is interesting / odd. If you record a profile with call graphs
(e.g. --call-graph dwarf), where are all the LWLockAttemptLock calls
comming from?

I assume the machine you're talking about is an 8 socket machine?

What if you:
a) start postgres and pgbench with numactl --interleave=all
b) start postgres with numactl --interleave=0,1 --cpunodebind=0,1 --membind=0,1
in case you have 4 sockets, or 0,1,2,3 in case you have 8 sockets?

And which pgbench database scale factor you have used?

200

Another thing you could try is to run 2-4 pgench instances in different
databases.

Greetings,

Andres Freund

#94Konstantin Knizhnik
k.knizhnik@postgrespro.ru
In reply to: Andres Freund (#93)
1 attachment(s)
Re: Improving connection scalability: GetSnapshotData()

On 06.09.2020 21:56, Andres Freund wrote:

Hm, that is interesting / odd. If you record a profile with call graphs
(e.g. --call-graph dwarf), where are all the LWLockAttemptLock calls
comming from?

Attached.

I assume the machine you're talking about is an 8 socket machine?

What if you:
a) start postgres and pgbench with numactl --interleave=all
b) start postgres with numactl --interleave=0,1 --cpunodebind=0,1 --membind=0,1
in case you have 4 sockets, or 0,1,2,3 in case you have 8 sockets?

TPS for -c 100

--interleave=all
1168910
--interleave=0,1
1232557
--interleave=0,1,2,3
1254271
--cpunodebind=0,1,2,3 --membind=0,1,2,3
1237237
--cpunodebind=0,1 --membind=0,1
1420211
--cpunodebind=0 --membind=0
1101203

And which pgbench database scale factor you have used?

200

Another thing you could try is to run 2-4 pgench instances in different
databases.

I tried to reinitialize database with scale 200 but there was no
significant improvement in performance.

Attachments:

pgbench.svgimage/svg+xml; name=pgbench.svgDownload
#95Konstantin Knizhnik
k.knizhnik@postgrespro.ru
In reply to: Andres Freund (#92)
Re: Improving connection scalability: GetSnapshotData()

On 06.09.2020 21:52, Andres Freund wrote:

Hi,

On 2020-09-05 16:58:31 +0300, Konstantin Knizhnik wrote:

On 04.09.2020 21:53, Andres Freund wrote:

I also used huge_pages=on / configured them on the OS level. Otherwise
TLB misses will be a significant factor.

As far as I understand there should not be no any TLB misses because size of
the shared buffers (8Mb) as several order of magnitude smaler that available
physical memory.

I assume you didn't mean 8MB but 8GB? If so, that's way large enough to
be bigger than the TLB, particularly across processes (IIRC there's no
optimization to keep shared mappings de-duplicated between processes
from the view of the TLB).

Sorry, certainly 8Gb.
I tried huge pages, but it has almost no effect/

#96Andres Freund
andres@anarazel.de
In reply to: Konstantin Knizhnik (#94)
Re: Improving connection scalability: GetSnapshotData()

Hi,

On Mon, Sep 7, 2020, at 07:20, Konstantin Knizhnik wrote:

And which pgbench database scale factor you have used?

200

Another thing you could try is to run 2-4 pgench instances in different
databases.

I tried to reinitialize database with scale 200 but there was no
significant improvement in performance.

If you're replying to the last bit I am quoting, I was talking about having four databases with separate pbench tables etc. To see how much of it is procarray contention, and how much it is contention of common buffers etc.

Attachments:
* pgbench.svg

What numactl was used for this one?

#97Ian Barwick
ian.barwick@2ndquadrant.com
In reply to: Michael Paquier (#84)
Re: Improving connection scalability: GetSnapshotData()

On 2020/09/03 17:18, Michael Paquier wrote:

On Sun, Aug 16, 2020 at 02:26:57PM -0700, Andres Freund wrote:

So we get some builfarm results while thinking about this.

Andres, there is an entry in the CF for this thread:
https://commitfest.postgresql.org/29/2500/

A lot of work has been committed with 623a9ba, 73487a6, 5788e25, etc.

I haven't seen it mentioned here, so apologies if I've overlooked
something, but as of 623a9ba queries on standbys seem somewhat
broken.

Specifically, I maintain some code which does something like this:

- connects to a standby
- checks a particular row does not exist on the standby
- connects to the primary
- writes a row in the primary
- polls the standby (using the same connection as above)
to verify the row arrives on the standby

As of recent HEAD it never sees the row arrive on the standby, even
though it is verifiably there.

I've traced this back to 623a9ba, which relies on "xactCompletionCount"
being incremented to determine whether the snapshot can be reused,
but that never happens on a standby, meaning this test in
GetSnapshotDataReuse():

if (curXactCompletionCount != snapshot->snapXactCompletionCount)
return false;

will never return false, and the snapshot's xmin/xmax never get advanced.
Which means the session on the standby is not able to see rows on the
standby added after the session was started.

It's simple enough to prevent that being an issue by just never calling
GetSnapshotDataReuse() if the snapshot was taken during recovery
(though obviously that means any performance benefits won't be available
on standbys).

I wonder if it's possible to increment "xactCompletionCount"
during replay along these lines:

     *** a/src/backend/access/transam/xact.c
     --- b/src/backend/access/transam/xact.c
     *************** xact_redo_commit(xl_xact_parsed_commit *
     *** 5915,5920 ****
     --- 5915,5924 ----
              */
             if (XactCompletionApplyFeedback(parsed->xinfo))
                     XLogRequestWalReceiverReply();
     +
     +       LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
     +       ShmemVariableCache->xactCompletionCount++;
     +       LWLockRelease(ProcArrayLock);
       }

which seems to work (though quite possibly I've overlooked something I don't
know that I don't know about and it will all break horribly somewhere,
etc. etc.).

Regards

Ian Barwick

--
Ian Barwick https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#98Andres Freund
andres@anarazel.de
In reply to: Ian Barwick (#97)
Re: Improving connection scalability: GetSnapshotData()

Hi,

On 2020-09-08 13:03:01 +0900, Ian Barwick wrote:

On 2020/09/03 17:18, Michael Paquier wrote:

On Sun, Aug 16, 2020 at 02:26:57PM -0700, Andres Freund wrote:

So we get some builfarm results while thinking about this.

Andres, there is an entry in the CF for this thread:
https://commitfest.postgresql.org/29/2500/

A lot of work has been committed with 623a9ba, 73487a6, 5788e25, etc.

I haven't seen it mentioned here, so apologies if I've overlooked
something, but as of 623a9ba queries on standbys seem somewhat
broken.

Specifically, I maintain some code which does something like this:

- connects to a standby
- checks a particular row does not exist on the standby
- connects to the primary
- writes a row in the primary
- polls the standby (using the same connection as above)
to verify the row arrives on the standby

As of recent HEAD it never sees the row arrive on the standby, even
though it is verifiably there.

Ugh, that's not good.

I've traced this back to 623a9ba, which relies on "xactCompletionCount"
being incremented to determine whether the snapshot can be reused,
but that never happens on a standby, meaning this test in
GetSnapshotDataReuse():

if (curXactCompletionCount != snapshot->snapXactCompletionCount)
return false;

will never return false, and the snapshot's xmin/xmax never get advanced.
Which means the session on the standby is not able to see rows on the
standby added after the session was started.

It's simple enough to prevent that being an issue by just never calling
GetSnapshotDataReuse() if the snapshot was taken during recovery
(though obviously that means any performance benefits won't be available
on standbys).

Yea, that doesn't sound great. Nor is the additional branch welcome.

I wonder if it's possible to increment "xactCompletionCount"
during replay along these lines:

*** a/src/backend/access/transam/xact.c
--- b/src/backend/access/transam/xact.c
*************** xact_redo_commit(xl_xact_parsed_commit *
*** 5915,5920 ****
--- 5915,5924 ----
*/
if (XactCompletionApplyFeedback(parsed->xinfo))
XLogRequestWalReceiverReply();
+
+       LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+       ShmemVariableCache->xactCompletionCount++;
+       LWLockRelease(ProcArrayLock);
}

which seems to work (though quite possibly I've overlooked something I don't
know that I don't know about and it will all break horribly somewhere,
etc. etc.).

We'd also need the same in a few more places. Probably worth looking at
the list where we increment it on the primary (particularly we need to
also increment it for aborts, and 2pc commit/aborts).

At first I was very confused as to why none of the existing tests have
found this significant issue. But after thinking about it for a minute
that's because they all use psql, and largely separate psql invocations
for each query :(. Which means that there's no cached snapshot around...

Do you want to try to write a patch?

Greetings,

Andres Freund

#99Ian Barwick
ian.barwick@2ndquadrant.com
In reply to: Andres Freund (#98)
Re: Improving connection scalability: GetSnapshotData()

On 2020/09/08 13:11, Andres Freund wrote:

Hi,

On 2020-09-08 13:03:01 +0900, Ian Barwick wrote:

(...)

I wonder if it's possible to increment "xactCompletionCount"
during replay along these lines:

*** a/src/backend/access/transam/xact.c
--- b/src/backend/access/transam/xact.c
*************** xact_redo_commit(xl_xact_parsed_commit *
*** 5915,5920 ****
--- 5915,5924 ----
*/
if (XactCompletionApplyFeedback(parsed->xinfo))
XLogRequestWalReceiverReply();
+
+       LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+       ShmemVariableCache->xactCompletionCount++;
+       LWLockRelease(ProcArrayLock);
}

which seems to work (though quite possibly I've overlooked something I don't
know that I don't know about and it will all break horribly somewhere,
etc. etc.).

We'd also need the same in a few more places. Probably worth looking at
the list where we increment it on the primary (particularly we need to
also increment it for aborts, and 2pc commit/aborts).

Yup.

At first I was very confused as to why none of the existing tests have
found this significant issue. But after thinking about it for a minute
that's because they all use psql, and largely separate psql invocations
for each query :(. Which means that there's no cached snapshot around...

Do you want to try to write a patch?

Sure, I'll give it a go as I have some time right now.

Regards

Ian Barwick

--
Ian Barwick https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#100Thomas Munro
thomas.munro@gmail.com
In reply to: Andres Freund (#98)
Re: Improving connection scalability: GetSnapshotData()

On Tue, Sep 8, 2020 at 4:11 PM Andres Freund <andres@anarazel.de> wrote:

At first I was very confused as to why none of the existing tests have
found this significant issue. But after thinking about it for a minute
that's because they all use psql, and largely separate psql invocations
for each query :(. Which means that there's no cached snapshot around...

I prototyped a TAP test patch that could maybe do the sort of thing
you need, in patch 0006 over at [1]/messages/by-id/CA+hUKG+FkUuDv-bcBns=Z_O-V9QGW0nWZNHOkEPxHZWjegRXvw@mail.gmail.com. Later versions of that patch set
dropped it, because I figured out how to use the isolation tester
instead, but I guess you can't do that for a standby test (at least
not until someone teaches the isolation tester to support multi-node
schedules, something that would be extremely useful...). Example:

+# start an interactive session that we can use to interleave statements
+my $session = PsqlSession->new($node, "postgres");
+$session->send("\\set PROMPT1 ''\n", 2);
+$session->send("\\set PROMPT2 ''\n", 1);
...
+# our snapshot is not too old yet, so we can still use it
+@lines = $session->send("select * from t order by i limit 1;\n", 2);
+shift @lines;
+$result = shift @lines;
+is($result, "1");
...
+# our snapshot is too old!  the thing it wants to see has been removed
+@lines = $session->send("select * from t order by i limit 1;\n", 2);
+shift @lines;
+$result = shift @lines;
+is($result, "ERROR:  snapshot too old");

[1]: /messages/by-id/CA+hUKG+FkUuDv-bcBns=Z_O-V9QGW0nWZNHOkEPxHZWjegRXvw@mail.gmail.com

#101Konstantin Knizhnik
k.knizhnik@postgrespro.ru
In reply to: Andres Freund (#96)
Re: Improving connection scalability: GetSnapshotData()

On 07.09.2020 23:45, Andres Freund wrote:

Hi,

On Mon, Sep 7, 2020, at 07:20, Konstantin Knizhnik wrote:

And which pgbench database scale factor you have used?

200

Another thing you could try is to run 2-4 pgench instances in different
databases.

I tried to reinitialize database with scale 200 but there was no
significant improvement in performance.

If you're replying to the last bit I am quoting, I was talking about having four databases with separate pbench tables etc. To see how much of it is procarray contention, and how much it is contention of common buffers etc.

Sorry, I have tested hypothesis that the difference in performance in my
and you cases can be explained by size of the table which can have
influence on shared buffer  contention.
Thus is why I used the same scale as you, but there is no difference
compatring with scale 100.

And definitely Postgres performance in this test is limited by lock
contention (most likely shared buffers locks, rather than procarray locks).
If I create two instances of postgres, both with pgbench -s 200 database
and run two pgbenches with 100 connections each, then
each instance shows the same ~1million TPS (1186483) as been launched
standalone. And total TPS is 2.3 millions.

Attachments:
* pgbench.svg

What numactl was used for this one?

None. I have not used numactl in this case.

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

#102Andres Freund
andres@anarazel.de
In reply to: Thomas Munro (#100)
Re: Improving connection scalability: GetSnapshotData()

Hi,

On 2020-09-08 16:44:17 +1200, Thomas Munro wrote:

On Tue, Sep 8, 2020 at 4:11 PM Andres Freund <andres@anarazel.de> wrote:

At first I was very confused as to why none of the existing tests have
found this significant issue. But after thinking about it for a minute
that's because they all use psql, and largely separate psql invocations
for each query :(. Which means that there's no cached snapshot around...

I prototyped a TAP test patch that could maybe do the sort of thing
you need, in patch 0006 over at [1]. Later versions of that patch set
dropped it, because I figured out how to use the isolation tester
instead, but I guess you can't do that for a standby test (at least
not until someone teaches the isolation tester to support multi-node
schedules, something that would be extremely useful...).

Unfortunately proper multi-node isolationtester test basically is
equivalent to building a global lock graph. I think, at least? Including
a need to be able to correlate connections with their locks between the
nodes.

But for something like the bug at hand it'd probably sufficient to just
"hack" something with dblink. In session 1) insert a row on the primary
using dblink, return the LSN, wait for the LSN to have replicated and
finally in session 2) check for row visibility.

Greetings,

Andres Freund

#103Andres Freund
andres@anarazel.de
In reply to: Michail Nikolaev (#54)
Re: Improving connection scalability: GetSnapshotData()

Hi,

On 2020-06-07 11:24:50 +0300, Michail Nikolaev wrote:

Hello, hackers.
Andres, nice work!

Sorry for the off-top.

Some of my work [1] related to the support of index hint bits on
standby is highly interfering with this patch.
Is it safe to consider it committed and start rebasing on top of the patches?

Sorry, I missed this email. Since they're now committed, yes, it is safe
;)

Greetings,

Andres Freund

#104Ian Barwick
ian.barwick@2ndquadrant.com
In reply to: Andres Freund (#102)
1 attachment(s)
Re: Improving connection scalability: GetSnapshotData()

On 2020/09/09 2:53, Andres Freund wrote:

Hi,

On 2020-09-08 16:44:17 +1200, Thomas Munro wrote:

On Tue, Sep 8, 2020 at 4:11 PM Andres Freund <andres@anarazel.de> wrote:

At first I was very confused as to why none of the existing tests have
found this significant issue. But after thinking about it for a minute
that's because they all use psql, and largely separate psql invocations
for each query :(. Which means that there's no cached snapshot around...

I prototyped a TAP test patch that could maybe do the sort of thing
you need, in patch 0006 over at [1]. Later versions of that patch set
dropped it, because I figured out how to use the isolation tester
instead, but I guess you can't do that for a standby test (at least
not until someone teaches the isolation tester to support multi-node
schedules, something that would be extremely useful...).

Unfortunately proper multi-node isolationtester test basically is
equivalent to building a global lock graph. I think, at least? Including
a need to be able to correlate connections with their locks between the
nodes.

But for something like the bug at hand it'd probably sufficient to just
"hack" something with dblink. In session 1) insert a row on the primary
using dblink, return the LSN, wait for the LSN to have replicated and
finally in session 2) check for row visibility.

The attached seems to do the trick.

Regards

Ian Barwick

--
Ian Barwick https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

standby-row-visibility-test.v1.patchtext/x-patch; charset=UTF-8; name=standby-row-visibility-test.v1.patchDownload
commit b31d587d71f75115b02dd1bf6230a56722c67832
Author: Ian Barwick <ian@2ndquadrant.com>
Date:   Wed Sep 9 14:37:40 2020 +0900

    test for standby row visibility

diff --git a/src/test/recovery/Makefile b/src/test/recovery/Makefile
index fa8e031526..2d9a9701fc 100644
--- a/src/test/recovery/Makefile
+++ b/src/test/recovery/Makefile
@@ -9,7 +9,7 @@
 #
 #-------------------------------------------------------------------------
 
-EXTRA_INSTALL=contrib/test_decoding
+EXTRA_INSTALL=contrib/test_decoding contrib/dblink
 
 subdir = src/test/recovery
 top_builddir = ../../..
diff --git a/src/test/recovery/t/021_row_visibility.pl b/src/test/recovery/t/021_row_visibility.pl
new file mode 100644
index 0000000000..5f591d131e
--- /dev/null
+++ b/src/test/recovery/t/021_row_visibility.pl
@@ -0,0 +1,84 @@
+# Checks that a standby session can see all expected rows
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 1;
+
+# Initialize primary node
+my $node_primary = get_new_node('primary');
+$node_primary->init(allows_streaming => 1);
+$node_primary->start;
+
+$node_primary->safe_psql('postgres',
+	'CREATE EXTENSION dblink');
+
+
+# Add an arbitrary table
+$node_primary->safe_psql('postgres',
+	'CREATE TABLE public.foo (id INT)');
+
+# Take backup
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Create streaming standby from backup
+my $node_standby = get_new_node('standby');
+$node_standby->init_from_backup($node_primary, $backup_name,
+	has_streaming => 1);
+$node_standby->start;
+
+sleep(5);
+# Check row visibility in an existing standby session
+
+my ($res, $stdout, $stderr) = $node_standby->psql(
+    'postgres',
+    sprintf(
+        <<'EO_SQL',
+DO $$
+  DECLARE
+    primary_lsn pg_lsn;
+    insert_xmin xid;
+    standby_rec RECORD;
+  BEGIN
+    SELECT INTO primary_lsn, insert_xmin
+           t1.primary_lsn, t1.xmin
+           FROM dblink(
+              'host=%s port=%i dbname=postgres',
+              'INSERT INTO public.foo VALUES (1) RETURNING pg_catalog.pg_current_wal_lsn(), xmin'
+           ) AS t1(primary_lsn pg_lsn, xmin xid);
+
+    LOOP
+      EXIT WHEN pg_catalog.pg_last_wal_replay_lsn() > primary_lsn;
+    END LOOP;
+
+    SELECT INTO standby_rec
+           id
+      FROM public.foo
+     WHERE id = 1 AND xmin = insert_xmin;
+
+    IF FOUND
+      THEN
+        RAISE NOTICE 'row found';
+      ELSE
+        RAISE NOTICE 'row not found';
+    END IF;
+
+  END;
+$$;
+EO_SQL
+        $node_primary->host,
+        $node_primary->port,
+    ),
+);
+
+
+like (
+    $stderr,
+    qr/row found/,
+    'check that inserted row is visible on the standby',
+);
+
+$node_primary->stop;
+$node_standby->stop;
#105Ian Barwick
ian.barwick@2ndquadrant.com
In reply to: Ian Barwick (#99)
1 attachment(s)
Re: Improving connection scalability: GetSnapshotData()

On 2020/09/08 13:23, Ian Barwick wrote:

On 2020/09/08 13:11, Andres Freund wrote:

Hi,

On 2020-09-08 13:03:01 +0900, Ian Barwick wrote:

(...)

I wonder if it's possible to increment "xactCompletionCount"
during replay along these lines:

     *** a/src/backend/access/transam/xact.c
     --- b/src/backend/access/transam/xact.c
     *************** xact_redo_commit(xl_xact_parsed_commit *
     *** 5915,5920 ****
     --- 5915,5924 ----
              */
             if (XactCompletionApplyFeedback(parsed->xinfo))
                     XLogRequestWalReceiverReply();
     +
     +       LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
     +       ShmemVariableCache->xactCompletionCount++;
     +       LWLockRelease(ProcArrayLock);
       }

which seems to work (though quite possibly I've overlooked something I don't
know that I don't know about and it will all break horribly somewhere,
etc. etc.).

We'd also need the same in a few more places. Probably worth looking at
the list where we increment it on the primary (particularly we need to
also increment it for aborts, and 2pc commit/aborts).

Yup.

At first I was very confused as to why none of the existing tests have
found this significant issue. But after thinking about it for a minute
that's because they all use psql, and largely separate psql invocations
for each query :(. Which means that there's no cached snapshot around...

Do you want to try to write a patch?

Sure, I'll give it a go as I have some time right now.

Attached, though bear in mind I'm not very familiar with parts of this,
particularly 2PC stuff, so consider it educated guesswork.

Regards

Ian Barwick

--
Ian Barwick https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

snapshot-cache-standby-fix.v1.patchtext/x-patch; charset=UTF-8; name=snapshot-cache-standby-fix.v1.patchDownload
commit 544e2b1661413fe08e3083f03063c12c0d7cf3aa
Author: Ian Barwick <ian@2ndquadrant.com>
Date:   Tue Sep 8 12:24:14 2020 +0900

    Fix snapshot caching on standbys
    
    Addresses issue introduced in 623a9ba.

diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index b8bedca04a..227d03bbce 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -3320,6 +3320,14 @@ multixact_redo(XLogReaderState *record)
 	}
 	else
 		elog(PANIC, "multixact_redo: unknown op code %u", info);
+
+	/*
+	 * Advance xactCompletionCount so rebuilds of snapshot contents
+	 * can be triggered.
+	 */
+	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+	ShmemVariableCache->xactCompletionCount++;
+	LWLockRelease(ProcArrayLock);
 }
 
 Datum
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index af6afcebb1..04ca858918 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -5915,6 +5915,14 @@ xact_redo_commit(xl_xact_parsed_commit *parsed,
 	 */
 	if (XactCompletionApplyFeedback(parsed->xinfo))
 		XLogRequestWalReceiverReply();
+
+	/*
+	 * Advance xactCompletionCount so rebuilds of snapshot contents
+	 * can be triggered.
+	 */
+	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+	ShmemVariableCache->xactCompletionCount++;
+	LWLockRelease(ProcArrayLock);
 }
 
 /*
@@ -5978,6 +5986,14 @@ xact_redo_abort(xl_xact_parsed_abort *parsed, TransactionId xid)
 
 	/* Make sure files supposed to be dropped are dropped */
 	DropRelationFiles(parsed->xnodes, parsed->nrels, true);
+
+	/*
+	 * Advance xactCompletionCount so rebuilds of snapshot contents
+	 * can be triggered.
+	 */
+	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+	ShmemVariableCache->xactCompletionCount++;
+	LWLockRelease(ProcArrayLock);
 }
 
 void
#106Andres Freund
andres@anarazel.de
In reply to: Ian Barwick (#105)
1 attachment(s)
Re: Improving connection scalability: GetSnapshotData()

Hi,

On 2020-09-09 17:02:58 +0900, Ian Barwick wrote:

Attached, though bear in mind I'm not very familiar with parts of this,
particularly 2PC stuff, so consider it educated guesswork.

Thanks for this, and the test case!

Your commit fixes the issues, but not quite correctly. Multixacts
shouldn't matter, so we don't need to do anything there. And for the
increases, I think they should be inside the already existing
ProcArrayLock acquisition, as in the attached.

I've also included a quite heavily revised version of your test. Instead
of using dblink I went for having a long-running psql that I feed over
stdin. The main reason for not liking the previous version is that it
seems fragile, with the sleep and everything. I expanded it to cover
2PC is as well.

The test probably needs a bit of cleanup, wrapping some of the
redundancy around the pump_until calls.

I think the approach of having a long running psql session is really
useful, and probably would speed up some tests. Does anybody have a good
idea for how to best, and without undue effort, to integrate this into
PostgresNode.pm? I don't really have a great idea, so I think I'd leave
it with a local helper in the new test?

Regards,

Andres

Attachments:

v2-0001-WIP-fix-and-test-snapshot-behaviour-on-standby.patchtext/x-diff; charset=us-asciiDownload
From a637b65fc53b208857e0d3d17141d8ed3609036f Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 14 Sep 2020 16:08:25 -0700
Subject: [PATCH v2] WIP: fix and test snapshot behaviour on standby.

Reported-By: Ian Barwick <ian.barwick@2ndquadrant.com>
Author: Andres Freund <andres@anarazel.de>
Author: Ian Barwick <ian.barwick@2ndquadrant.com>
Discussion: https://postgr.es/m/61291ffe-d611-f889-68b5-c298da9fb18f@2ndquadrant.com
---
 src/backend/storage/ipc/procarray.c       |   3 +
 src/test/recovery/t/021_row_visibility.pl | 227 ++++++++++++++++++++++
 2 files changed, 230 insertions(+)
 create mode 100644 src/test/recovery/t/021_row_visibility.pl

diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 802b119c490..fffa5f7a93e 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -4280,6 +4280,9 @@ ExpireTreeKnownAssignedTransactionIds(TransactionId xid, int nsubxids,
 	/* As in ProcArrayEndTransaction, advance latestCompletedXid */
 	MaintainLatestCompletedXidRecovery(max_xid);
 
+	/* ... and xactCompletionCount */
+	ShmemVariableCache->xactCompletionCount++;
+
 	LWLockRelease(ProcArrayLock);
 }
 
diff --git a/src/test/recovery/t/021_row_visibility.pl b/src/test/recovery/t/021_row_visibility.pl
new file mode 100644
index 00000000000..08713fa2686
--- /dev/null
+++ b/src/test/recovery/t/021_row_visibility.pl
@@ -0,0 +1,227 @@
+# Checks that a standby session can see all expected rows
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 10;
+
+# Initialize primary node
+my $node_primary = get_new_node('primary');
+$node_primary->init(allows_streaming => 1);
+$node_primary->append_conf('postgresql.conf', 'max_prepared_transactions=10');
+$node_primary->start;
+
+# Initialize with empty test table
+$node_primary->safe_psql('postgres',
+	'CREATE TABLE public.test_visibility (data text not null)');
+
+# Take backup
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Create streaming standby from backup
+my $node_standby = get_new_node('standby');
+$node_standby->init_from_backup($node_primary, $backup_name,
+	has_streaming => 1);
+$node_standby->append_conf('postgresql.conf', 'max_prepared_transactions=10');
+$node_standby->start;
+
+# To avoid hanging while expecting some specific input from a psql
+# instance being driven by us, add a timeout high enough that it
+# should never trigger even on very slow machines, unless something
+# is really wrong.
+my $psql_timeout = IPC::Run::timer(5);
+
+
+# One psql to primary for all queries. That allows to check
+# uncommitted changes being replicated and such.
+my ($psql_primary_stdin, $psql_primary_stdout, $psql_primary_stderr) = ('', '', '');
+my $psql_primary = IPC::Run::start(
+	[
+		'psql', '-X', '-qAe', '-f', '-', '-d',
+		$node_primary->connstr('postgres')
+	],
+	'<',
+	\$psql_primary_stdin,
+	'>',
+	\$psql_primary_stdout,
+	'2>',
+	\$psql_primary_stderr,
+	$psql_timeout);
+
+# One psql to standby for all queries. That allows to reuse the same
+# session for multiple queries, which is important to detect some
+# types of errors.
+my ($psql_standby_stdin, $psql_standby_stdout, $psql_standby_stderr) = ('', '', '');
+my $psql_standby = IPC::Run::start(
+	[
+		'psql', '-X', '-qAe', '-f', '-', '-d',
+		$node_standby->connstr('postgres')
+	],
+	'<',
+	\$psql_standby_stdin,
+	'>',
+	\$psql_standby_stdout,
+	'2>',
+	\$psql_standby_stderr,
+	$psql_timeout);
+
+#
+# 1. Check initial data is the same
+#
+$psql_standby_stdin .= q[
+SELECT * FROM test_visibility ORDER BY data;
+  ];
+ok(pump_until($psql_standby, \$psql_standby_stdout, qr/0 rows/m),
+   'data not visible');
+$psql_standby_stdout = '';
+$psql_standby_stderr = '';
+
+
+#
+# 2. Check if an INSERT is replayed and visible
+#
+$node_primary->psql('postgres', "INSERT INTO test_visibility VALUES ('first insert')");
+$node_primary->wait_for_catchup($node_standby, 'replay',
+	$node_primary->lsn('insert'));
+
+$psql_standby_stdin .= q[
+SELECT * FROM test_visibility ORDER BY data;
+  ];
+ok(pump_until($psql_standby, \$psql_standby_stdout, qr/first insert/m),
+   'insert visible');
+$psql_standby_stdout = '';
+$psql_standby_stderr = '';
+
+
+#
+# 3. Verify that uncommitted changes aren't visible.
+#
+$psql_primary_stdin .= q[
+BEGIN;
+UPDATE test_visibility SET data = 'first update' RETURNING data;
+  ];
+ok(pump_until($psql_primary, \$psql_primary_stdout, qr/first update/m),
+   'UPDATE');
+
+# ensure WAL flush
+$node_primary->psql('postgres', "SELECT txid_current();");
+
+$node_primary->wait_for_catchup($node_standby, 'replay',
+	$node_primary->lsn('insert'));
+
+$psql_standby_stdin .= q[
+SELECT * FROM test_visibility ORDER BY data;
+  ];
+ok(pump_until($psql_standby, \$psql_standby_stdout, qr/first insert/m),
+   'uncommitted update invisible');
+$psql_standby_stdout = '';
+$psql_standby_stderr = '';
+
+#
+# 4. That a commit turns 3. visible
+#
+$psql_primary_stdin .= q[
+COMMIT;
+  ];
+ok(pump_until($psql_primary, \$psql_primary_stdout, qr/first update/m),
+   'COMMIT');
+$node_primary->wait_for_catchup($node_standby, 'replay',
+	$node_primary->lsn('insert'));
+
+$psql_standby_stdin .= q[
+SELECT * FROM test_visibility ORDER BY data;
+  ];
+ok(pump_until($psql_standby, \$psql_standby_stdout, qr/first update/m),
+   'committed update visible');
+$psql_standby_stdout = '';
+$psql_standby_stderr = '';
+
+#
+# 5. Check that changes in prepared xacts is invisible
+#
+$psql_primary_stdin .= q[
+DELETE from test_visibility;
+BEGIN;
+INSERT INTO test_visibility VALUES('inserted in prepared will_commit');
+PREPARE TRANSACTION 'will_commit';
+  ];
+ok(pump_until($psql_primary, \$psql_primary_stdout, qr/first update/m),
+   'prepared will_commit');
+$psql_standby_stdout = '';
+$psql_standby_stderr = '';
+
+$psql_primary_stdin .= q[
+BEGIN;
+INSERT INTO test_visibility VALUES('inserted in prepared will_abort');
+PREPARE TRANSACTION 'will_abort';
+  ];
+ok(pump_until($psql_primary, \$psql_primary_stdout, qr/PREPARE/m),
+   'prepared will_abort');
+$psql_standby_stdout = '';
+$psql_standby_stderr = '';
+
+# ensure WAL flush
+$node_primary->psql('postgres', "SELECT txid_current();");
+
+$node_primary->wait_for_catchup($node_standby, 'replay',
+	$node_primary->lsn('insert'));
+
+$psql_standby_stdin .= q[
+SELECT * FROM test_visibility ORDER BY data;
+  ];
+ok(pump_until($psql_standby, \$psql_standby_stdout, qr/0 rows/m),
+   'uncommitted prepared invisible');
+$psql_standby_stdout = '';
+$psql_standby_stderr = '';
+
+# For some variation, finish prepared xacts via separate connections
+$node_primary->safe_psql('postgres',
+	"COMMIT PREPARED 'will_commit';");
+$node_primary->safe_psql('postgres',
+	"ROLLBACK PREPARED 'will_abort';");
+$node_primary->wait_for_catchup($node_standby, 'replay',
+	$node_primary->lsn('insert'));
+
+$psql_standby_stdin .= q[
+SELECT * FROM test_visibility ORDER BY data;
+  ];
+ok(pump_until($psql_standby, \$psql_standby_stdout, qr/will_commit.*\n.*1 row/m),
+   'finished prepared visible');
+$psql_standby_stdout = '';
+$psql_standby_stderr = '';
+
+$node_primary->stop;
+$node_standby->stop;
+
+
+# Pump until string is matched, or timeout occurs
+sub pump_until
+{
+	my ($proc, $stream, $untl) = @_;
+	$proc->pump_nb();
+	while (1)
+	{
+		last if $$stream =~ /$untl/;
+		if ($psql_timeout->is_expired)
+		{
+			diag("aborting wait: program timed out");
+			diag("stream contents: >>", $$stream, "<<");
+			diag("pattern searched for: ", $untl);
+
+			return 0;
+		}
+		if (not $proc->pumpable())
+		{
+			diag("aborting wait: program died");
+			diag("stream contents: >>", $$stream, "<<");
+			diag("pattern searched for: ", $untl);
+
+			return 0;
+		}
+		$proc->pump();
+	}
+	return 1;
+
+}
-- 
2.25.0.114.g5b0ca878e0

#107Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#106)
Re: Improving connection scalability: GetSnapshotData()

Andres Freund <andres@anarazel.de> writes:

I think the approach of having a long running psql session is really
useful, and probably would speed up some tests. Does anybody have a good
idea for how to best, and without undue effort, to integrate this into
PostgresNode.pm? I don't really have a great idea, so I think I'd leave
it with a local helper in the new test?

You could use the interactive_psql infrastructure that already exists
for psql/t/010_tab_completion.pl. That does rely on IO::Pty, but
I think I'd prefer to accept that dependency for such tests over rolling
our own IPC::Run, which is more or less what you've done here.

regards, tom lane

#108Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#107)
Re: Improving connection scalability: GetSnapshotData()

Hi,

On 2020-09-14 20:14:48 -0400, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

I think the approach of having a long running psql session is really
useful, and probably would speed up some tests. Does anybody have a good
idea for how to best, and without undue effort, to integrate this into
PostgresNode.pm? I don't really have a great idea, so I think I'd leave
it with a local helper in the new test?

You could use the interactive_psql infrastructure that already exists
for psql/t/010_tab_completion.pl. That does rely on IO::Pty, but
I think I'd prefer to accept that dependency for such tests over rolling
our own IPC::Run, which is more or less what you've done here.

My test uses IPC::Run - although I'm indirectly 'use'ing, which I guess
isn't pretty. Just as 013_crash_restart.pl already did (even before
psql/t/010_tab_completion.pl). I am mostly wondering whether we could
avoid copying the utility functions into multiple test files...

Does IO::Pty work on windows? Given that currently the test doesn't use
a pty and that there's no benefit I can see in requiring one, I'm a bit
hesitant to go there?

Greetings,

Andres Freund

#109Michael Paquier
michael@paquier.xyz
In reply to: Andres Freund (#108)
Re: Improving connection scalability: GetSnapshotData()

On Mon, Sep 14, 2020 at 05:42:51PM -0700, Andres Freund wrote:

My test uses IPC::Run - although I'm indirectly 'use'ing, which I guess
isn't pretty. Just as 013_crash_restart.pl already did (even before
psql/t/010_tab_completion.pl). I am mostly wondering whether we could
avoid copying the utility functions into multiple test files...

Does IO::Pty work on windows? Given that currently the test doesn't use
a pty and that there's no benefit I can see in requiring one, I'm a bit
hesitant to go there?

Per https://metacpan.org/pod/IO::Tty:
"Windows is now supported, but ONLY under the Cygwin environment, see
http://sources.redhat.com/cygwin/.&quot;

So I would suggest to make stuff a soft dependency (as Tom is
hinting?), and not worry about Windows specifically. It is not like
what we are dealing with here is specific to Windows anyway, so you
would have already sufficient coverage. I would not mind if any
refactoring is done later, once we know that the proposed test is
stable in the buildfarm as we would get a better image of what part of
the facility overlaps across multiple tests.
--
Michael

#110Andres Freund
andres@anarazel.de
In reply to: Michael Paquier (#109)
Re: Improving connection scalability: GetSnapshotData()

Hi,

On 2020-09-15 11:56:24 +0900, Michael Paquier wrote:

On Mon, Sep 14, 2020 at 05:42:51PM -0700, Andres Freund wrote:

My test uses IPC::Run - although I'm indirectly 'use'ing, which I guess
isn't pretty. Just as 013_crash_restart.pl already did (even before
psql/t/010_tab_completion.pl). I am mostly wondering whether we could
avoid copying the utility functions into multiple test files...

Does IO::Pty work on windows? Given that currently the test doesn't use
a pty and that there's no benefit I can see in requiring one, I'm a bit
hesitant to go there?

Per https://metacpan.org/pod/IO::Tty:
"Windows is now supported, but ONLY under the Cygwin environment, see
http://sources.redhat.com/cygwin/.&quot;

So I would suggest to make stuff a soft dependency (as Tom is
hinting?), and not worry about Windows specifically. It is not like
what we are dealing with here is specific to Windows anyway, so you
would have already sufficient coverage. I would not mind if any
refactoring is done later, once we know that the proposed test is
stable in the buildfarm as we would get a better image of what part of
the facility overlaps across multiple tests.

I'm confused - the test as posted should work on windows, and we already
do this in an existing test (src/test/recovery/t/013_crash_restart.pl). What's
the point in adding a platforms specific dependency here?

Greetings,

Andres Freund

#111Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#106)
1 attachment(s)
Re: Improving connection scalability: GetSnapshotData()

Hi,

On 2020-09-14 16:17:18 -0700, Andres Freund wrote:

I've also included a quite heavily revised version of your test. Instead
of using dblink I went for having a long-running psql that I feed over
stdin. The main reason for not liking the previous version is that it
seems fragile, with the sleep and everything. I expanded it to cover
2PC is as well.

The test probably needs a bit of cleanup, wrapping some of the
redundancy around the pump_until calls.

I think the approach of having a long running psql session is really
useful, and probably would speed up some tests. Does anybody have a good
idea for how to best, and without undue effort, to integrate this into
PostgresNode.pm? I don't really have a great idea, so I think I'd leave
it with a local helper in the new test?

Attached is an updated version of the test (better utility function,
stricter regexes, bailing out instead of failing just the current when
psql times out). I'm leaving it in this test for now, but it's fairly
easy to use this way, in my opinion, so it may be worth moving to
PostgresNode at some point.

Greetings,

Andres Freund

Attachments:

v3-0001-Fix-and-test-snapshot-behaviour-on-standby.patchtext/x-diff; charset=us-asciiDownload
From 2ca4fa9de369aeba0d6386ec7d749cb366259728 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 14 Sep 2020 16:08:25 -0700
Subject: [PATCH v3] Fix and test snapshot behaviour on standby.

I (Andres) broke this in 623a9ba79bb, because I didn't think about the
way snapshots are built on standbys sufficiently. Unfortunately our
existing tests did not catch this, as they are all just querying with
psql (therefore ending up with fresh snapshots).

The fix is trivial, we just need to increment the completion counter
in ExpireTreeKnownAssignedTransactionIds(), which is the equivalent of
ProcArrayEndTransaction() during recovery.

This commit also adds a new test doing some basic testing of the
correctness of snapshots built on standbys. To avoid the
aforementioned issue of one-shot psql's not exercising the snapshot
caching, the test uses a long lived psqls, similar to
013_crash_restart.pl. It'd be good to extend the test further.

Reported-By: Ian Barwick <ian.barwick@2ndquadrant.com>
Author: Andres Freund <andres@anarazel.de>
Author: Ian Barwick <ian.barwick@2ndquadrant.com>
Discussion: https://postgr.es/m/61291ffe-d611-f889-68b5-c298da9fb18f@2ndquadrant.com
---
 src/backend/storage/ipc/procarray.c       |   3 +
 src/test/recovery/t/021_row_visibility.pl | 192 ++++++++++++++++++++++
 2 files changed, 195 insertions(+)
 create mode 100644 src/test/recovery/t/021_row_visibility.pl

diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 5aaeb6e2b55..07c5eeb7495 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -4280,6 +4280,9 @@ ExpireTreeKnownAssignedTransactionIds(TransactionId xid, int nsubxids,
 	/* As in ProcArrayEndTransaction, advance latestCompletedXid */
 	MaintainLatestCompletedXidRecovery(max_xid);
 
+	/* ... and xactCompletionCount */
+	ShmemVariableCache->xactCompletionCount++;
+
 	LWLockRelease(ProcArrayLock);
 }
 
diff --git a/src/test/recovery/t/021_row_visibility.pl b/src/test/recovery/t/021_row_visibility.pl
new file mode 100644
index 00000000000..95516b05d01
--- /dev/null
+++ b/src/test/recovery/t/021_row_visibility.pl
@@ -0,0 +1,192 @@
+# Checks that snapshots on standbys behave in a minimally reasonable
+# way.
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 10;
+
+# Initialize primary node
+my $node_primary = get_new_node('primary');
+$node_primary->init(allows_streaming => 1);
+$node_primary->append_conf('postgresql.conf', 'max_prepared_transactions=10');
+$node_primary->start;
+
+# Initialize with empty test table
+$node_primary->safe_psql('postgres',
+	'CREATE TABLE public.test_visibility (data text not null)');
+
+# Take backup
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Create streaming standby from backup
+my $node_standby = get_new_node('standby');
+$node_standby->init_from_backup($node_primary, $backup_name,
+	has_streaming => 1);
+$node_standby->append_conf('postgresql.conf', 'max_prepared_transactions=10');
+$node_standby->start;
+
+# To avoid hanging while expecting some specific input from a psql
+# instance being driven by us, add a timeout high enough that it
+# should never trigger even on very slow machines, unless something
+# is really wrong.
+my $psql_timeout = IPC::Run::timer(30);
+
+# One psql to primary and standby each, for all queries. That allows
+# to check uncommitted changes being replicated and such.
+my %psql_primary = (stdin => '', stdout => '', stderr => '');
+$psql_primary{run} =
+  IPC::Run::start(
+	  ['psql', '-XA', '-f', '-', '-d', $node_primary->connstr('postgres')],
+	  '<', \$psql_primary{stdin},
+	  '>', \$psql_primary{stdout},
+	  '2>', \$psql_primary{stderr},
+	  $psql_timeout);
+
+my %psql_standby = ('stdin' => '', 'stdout' => '', 'stderr' => '');
+$psql_standby{run} =
+  IPC::Run::start(
+	  ['psql', '-XA', '-f', '-', '-d', $node_standby->connstr('postgres')],
+	  '<', \$psql_standby{stdin},
+	  '>', \$psql_standby{stdout},
+	  '2>', \$psql_standby{stderr},
+	  $psql_timeout);
+
+#
+# 1. Check initial data is the same
+#
+ok(send_query_and_wait(\%psql_standby,
+					   q/SELECT * FROM test_visibility ORDER BY data;/,
+					   qr/^\(0 rows\)$/m),
+   'data not visible');
+
+#
+# 2. Check if an INSERT is replayed and visible
+#
+$node_primary->psql('postgres', "INSERT INTO test_visibility VALUES ('first insert')");
+$node_primary->wait_for_catchup($node_standby, 'replay',
+	$node_primary->lsn('insert'));
+
+ok(send_query_and_wait(\%psql_standby,
+					   q[SELECT * FROM test_visibility ORDER BY data;],
+					   qr/first insert.*\n\(1 row\)/m),
+  'insert visible');
+
+#
+# 3. Verify that uncommitted changes aren't visible.
+#
+ok(send_query_and_wait(\%psql_primary,
+					   q[
+BEGIN;
+UPDATE test_visibility SET data = 'first update' RETURNING data;
+					   ],
+					   qr/^UPDATE 1$/m),
+   'UPDATE');
+
+$node_primary->psql('postgres', "SELECT txid_current();"); # ensure WAL flush
+$node_primary->wait_for_catchup($node_standby, 'replay',
+								$node_primary->lsn('insert'));
+
+ok(send_query_and_wait(\%psql_standby,
+					   q[SELECT * FROM test_visibility ORDER BY data;],
+					   qr/first insert.*\n\(1 row\)/m),
+   'uncommitted update invisible');
+
+#
+# 4. That a commit turns 3. visible
+#
+ok(send_query_and_wait(\%psql_primary,
+					   q[COMMIT;],
+					   qr/^COMMIT$/m),
+   'COMMIT');
+
+$node_primary->wait_for_catchup($node_standby, 'replay',
+	$node_primary->lsn('insert'));
+
+ok(send_query_and_wait(\%psql_standby,
+					   q[SELECT * FROM test_visibility ORDER BY data;],
+					   qr/first update\n\(1 row\)$/m),
+   'committed update visible');
+
+#
+# 5. Check that changes in prepared xacts is invisible
+#
+ok(send_query_and_wait(\%psql_primary, q[
+DELETE from test_visibility; -- delete old data, so we start with clean slate
+BEGIN;
+INSERT INTO test_visibility VALUES('inserted in prepared will_commit');
+PREPARE TRANSACTION 'will_commit';],
+					   qr/^PREPARE TRANSACTION$/m),
+   'prepared will_commit');
+
+ok(send_query_and_wait(\%psql_primary, q[
+BEGIN;
+INSERT INTO test_visibility VALUES('inserted in prepared will_abort');
+PREPARE TRANSACTION 'will_abort';
+					   ],
+					   qr/^PREPARE TRANSACTION$/m),
+   'prepared will_abort');
+
+$node_primary->wait_for_catchup($node_standby, 'replay',
+								$node_primary->lsn('insert'));
+
+ok(send_query_and_wait(\%psql_standby,
+					   q[SELECT * FROM test_visibility ORDER BY data;],
+					   qr/^\(0 rows\)$/m),
+   'uncommitted prepared invisible');
+
+# For some variation, finish prepared xacts via separate connections
+$node_primary->safe_psql('postgres',
+	"COMMIT PREPARED 'will_commit';");
+$node_primary->safe_psql('postgres',
+	"ROLLBACK PREPARED 'will_abort';");
+$node_primary->wait_for_catchup($node_standby, 'replay',
+	$node_primary->lsn('insert'));
+
+ok(send_query_and_wait(\%psql_standby,
+					   q[SELECT * FROM test_visibility ORDER BY data;],
+					   qr/will_commit.*\n\(1 row\)$/m),
+   'finished prepared visible');
+
+$node_primary->stop;
+$node_standby->stop;
+
+# Send query, wait until string matches
+sub send_query_and_wait
+{
+	my ($psql, $query, $untl) = @_;
+	my $ret;
+
+	# send query
+	$$psql{stdin} .= $query;
+	$$psql{stdin} .= "\n";
+
+	# wait for query results
+	$$psql{run}->pump_nb();
+	while (1)
+	{
+		last if $$psql{stdout} =~ /$untl/;
+
+		if ($psql_timeout->is_expired)
+		{
+			BAIL_OUT("aborting wait: program timed out\n".
+					 "stream contents: >>$$psql{stdout}<<\n".
+					 "pattern searched for: $untl\n");
+			return 0;
+		}
+		if (not $$psql{run}->pumpable())
+		{
+			BAIL_OUT("aborting wait: program died\n".
+					 "stream contents: >>$$psql{stdout}<<\n".
+					 "pattern searched for: $untl\n");
+			return 0;
+		}
+		$$psql{run}->pump();
+	}
+
+	$$psql{stdout} = '';
+
+	return 1;
+}
-- 
2.25.0.114.g5b0ca878e0

#112Craig Ringer
craig@2ndquadrant.com
In reply to: Andres Freund (#106)
Re: Improving connection scalability: GetSnapshotData()

On Tue, 15 Sep 2020 at 07:17, Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2020-09-09 17:02:58 +0900, Ian Barwick wrote:

Attached, though bear in mind I'm not very familiar with parts of this,
particularly 2PC stuff, so consider it educated guesswork.

Thanks for this, and the test case!

Your commit fixes the issues, but not quite correctly. Multixacts
shouldn't matter, so we don't need to do anything there. And for the
increases, I think they should be inside the already existing
ProcArrayLock acquisition, as in the attached.

I've also included a quite heavily revised version of your test. Instead
of using dblink I went for having a long-running psql that I feed over
stdin. The main reason for not liking the previous version is that it
seems fragile, with the sleep and everything. I expanded it to cover
2PC is as well.

The test probably needs a bit of cleanup, wrapping some of the
redundancy around the pump_until calls.

I think the approach of having a long running psql session is really
useful, and probably would speed up some tests. Does anybody have a good
idea for how to best, and without undue effort, to integrate this into
PostgresNode.pm? I don't really have a great idea, so I think I'd leave
it with a local helper in the new test?

2ndQ has some infra for that and various other TAP enhancements that
I'd like to try to upstream. I'll ask what I can share and how.

--
Craig Ringer http://www.2ndQuadrant.com/
2ndQuadrant - PostgreSQL Solutions for the Enterprise

#113Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#111)
Re: Improving connection scalability: GetSnapshotData()

Hi Ian, Andrew, All,

On 2020-09-30 15:43:17 -0700, Andres Freund wrote:

Attached is an updated version of the test (better utility function,
stricter regexes, bailing out instead of failing just the current when
psql times out). I'm leaving it in this test for now, but it's fairly
easy to use this way, in my opinion, so it may be worth moving to
PostgresNode at some point.

I pushed this yesterday. Ian, thanks again for finding this and helping
with fixing & testing.

Unfortunately currently some buildfarm animals don't like the test for
reasons I don't quite understand. Looks like it's all windows + msys
animals that run the tap tests. On jacana and fairywren the new test
fails, but with a somewhat confusing problem:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=jacana&amp;dt=2020-10-01%2015%3A32%3A34
Bail out! aborting wait: program timed out
# stream contents: >>data
# (0 rows)
# <<
# pattern searched for: (?m-xis:^\\(0 rows\\)$)

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&amp;dt=2020-10-01%2014%3A12%3A13
Bail out! aborting wait: program timed out
stream contents: >>data
(0 rows)
<<
pattern searched for: (?^m:^\\(0 rows\\)$)

I don't know with the -xis indicates on jacana, and why it's not present
on fairywren. Nor do I know why the pattern doesn't match in the first
place, sure looks like it should?

Andrew, do you have an insight into how mingw's regex match differs
from native windows and proper unixoid systems? I guess it's somewhere
around line endings or such?

Jacana successfully deals with 013_crash_restart.pl, which does use the
same mechanis as the new 021_row_visibility.pl - I think the only real
difference is that I used ^ and $ in the regexes in the latter...

Greetings,

Andres Freund

#114Andrew Dunstan
andrew.dunstan@2ndquadrant.com
In reply to: Andres Freund (#113)
Re: Improving connection scalability: GetSnapshotData()

On 10/1/20 2:26 PM, Andres Freund wrote:

Hi Ian, Andrew, All,

On 2020-09-30 15:43:17 -0700, Andres Freund wrote:

Attached is an updated version of the test (better utility function,
stricter regexes, bailing out instead of failing just the current when
psql times out). I'm leaving it in this test for now, but it's fairly
easy to use this way, in my opinion, so it may be worth moving to
PostgresNode at some point.

I pushed this yesterday. Ian, thanks again for finding this and helping
with fixing & testing.

Unfortunately currently some buildfarm animals don't like the test for
reasons I don't quite understand. Looks like it's all windows + msys
animals that run the tap tests. On jacana and fairywren the new test
fails, but with a somewhat confusing problem:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=jacana&amp;dt=2020-10-01%2015%3A32%3A34
Bail out! aborting wait: program timed out
# stream contents: >>data
# (0 rows)
# <<
# pattern searched for: (?m-xis:^\\(0 rows\\)$)

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&amp;dt=2020-10-01%2014%3A12%3A13
Bail out! aborting wait: program timed out
stream contents: >>data
(0 rows)
<<
pattern searched for: (?^m:^\\(0 rows\\)$)

I don't know with the -xis indicates on jacana, and why it's not present
on fairywren. Nor do I know why the pattern doesn't match in the first
place, sure looks like it should?

Andrew, do you have an insight into how mingw's regex match differs
from native windows and proper unixoid systems? I guess it's somewhere
around line endings or such?

Jacana successfully deals with 013_crash_restart.pl, which does use the
same mechanis as the new 021_row_visibility.pl - I think the only real
difference is that I used ^ and $ in the regexes in the latter...

My strong suspicion is that we're getting unwanted CRs. Note the
presence of numerous instances of this in PostgresNode.pm:

$stdout =~ s/\r\n/\n/g if $Config{osname} eq 'msys';

So you probably want something along those lines at the top of the loop
in send_query_and_wait:

$$psql{stdout} =~ s/\r\n/\n/g if $Config{osname} eq 'msys';

possibly also for stderr, just to make it more futureproof, and at the
top of the file:

use Config;

Do you want me to test that first?

The difference between the canonical way perl states the regex is due to
perl version differences. It shouldn't matter.

cheers

andrew

--
Andrew Dunstan https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#115Andres Freund
andres@anarazel.de
In reply to: Andrew Dunstan (#114)
Re: Improving connection scalability: GetSnapshotData()

Hi,

On 2020-10-01 16:00:20 -0400, Andrew Dunstan wrote:

My strong suspicion is that we're getting unwanted CRs. Note the
presence of numerous instances of this in PostgresNode.pm:

$stdout =~ s/\r\n/\n/g if $Config{osname} eq 'msys';

So you probably want something along those lines at the top of the loop
in send_query_and_wait:

$$psql{stdout} =~ s/\r\n/\n/g if $Config{osname} eq 'msys';

Yikes, that's ugly :(.

I assume it's not, as the comments says
# Note: on Windows, IPC::Run seems to convert \r\n to \n in program output
# if we're using native Perl, but not if we're using MSys Perl. So do it
# by hand in the latter case, here and elsewhere.
that IPC::Run converts things, but that native windows perl uses
https://perldoc.perl.org/perlrun#PERLIO
a PERLIO that includes :crlf, whereas msys probably doesn't?

Any chance you could run something like
perl -mPerlIO -e 'print(PerlIO::get_layers(STDIN), "\n");'
on both native and msys perl?

possibly also for stderr, just to make it more futureproof, and at the
top of the file:

use Config;

Do you want me to test that first?

That'd be awesome.

The difference between the canonical way perl states the regex is due to
perl version differences. It shouldn't matter.

Thanks!

Greetings,

Andres Freund

#116Andrew Dunstan
andrew.dunstan@2ndquadrant.com
In reply to: Andres Freund (#115)
Re: Improving connection scalability: GetSnapshotData()

On 10/1/20 4:22 PM, Andres Freund wrote:

Hi,

On 2020-10-01 16:00:20 -0400, Andrew Dunstan wrote:

My strong suspicion is that we're getting unwanted CRs. Note the
presence of numerous instances of this in PostgresNode.pm:

$stdout =~ s/\r\n/\n/g if $Config{osname} eq 'msys';

So you probably want something along those lines at the top of the loop
in send_query_and_wait:

$$psql{stdout} =~ s/\r\n/\n/g if $Config{osname} eq 'msys';

Yikes, that's ugly :(.

I assume it's not, as the comments says
# Note: on Windows, IPC::Run seems to convert \r\n to \n in program output
# if we're using native Perl, but not if we're using MSys Perl. So do it
# by hand in the latter case, here and elsewhere.
that IPC::Run converts things, but that native windows perl uses
https://perldoc.perl.org/perlrun#PERLIO
a PERLIO that includes :crlf, whereas msys probably doesn't?

Any chance you could run something like
perl -mPerlIO -e 'print(PerlIO::get_layers(STDIN), "\n");'
on both native and msys perl?

sys (jacana): stdio

native: unixcrlf

cheers

andrew

--
Andrew Dunstan https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#117Andres Freund
andres@anarazel.de
In reply to: Andrew Dunstan (#116)
Re: Improving connection scalability: GetSnapshotData()

Hi,

On 2020-10-01 16:44:03 -0400, Andrew Dunstan wrote:

I assume it's not, as the comments says
# Note: on Windows, IPC::Run seems to convert \r\n to \n in program output
# if we're using native Perl, but not if we're using MSys Perl. So do it
# by hand in the latter case, here and elsewhere.
that IPC::Run converts things, but that native windows perl uses
https://perldoc.perl.org/perlrun#PERLIO
a PERLIO that includes :crlf, whereas msys probably doesn't?

Any chance you could run something like
perl -mPerlIO -e 'print(PerlIO::get_layers(STDIN), "\n");'
on both native and msys perl?

sys (jacana): stdio

native: unixcrlf

Interesting. That suggest we could get around needing the if msys
branches in several places by setting PERLIO to unixcrlf somewhere
centrally when using msys.

Greetings,

Andres Freund

#118Andrew Dunstan
andrew.dunstan@2ndquadrant.com
In reply to: Andres Freund (#115)
Re: Improving connection scalability: GetSnapshotData()

On 10/1/20 4:22 PM, Andres Freund wrote:

Hi,

On 2020-10-01 16:00:20 -0400, Andrew Dunstan wrote:

My strong suspicion is that we're getting unwanted CRs. Note the
presence of numerous instances of this in PostgresNode.pm:

$stdout =~ s/\r\n/\n/g if $Config{osname} eq 'msys';

So you probably want something along those lines at the top of the loop
in send_query_and_wait:

$$psql{stdout} =~ s/\r\n/\n/g if $Config{osname} eq 'msys';

Yikes, that's ugly :(.

I assume it's not, as the comments says
# Note: on Windows, IPC::Run seems to convert \r\n to \n in program output
# if we're using native Perl, but not if we're using MSys Perl. So do it
# by hand in the latter case, here and elsewhere.
that IPC::Run converts things, but that native windows perl uses
https://perldoc.perl.org/perlrun#PERLIO
a PERLIO that includes :crlf, whereas msys probably doesn't?

Any chance you could run something like
perl -mPerlIO -e 'print(PerlIO::get_layers(STDIN), "\n");'
on both native and msys perl?

possibly also for stderr, just to make it more futureproof, and at the
top of the file:

use Config;

Do you want me to test that first?

That'd be awesome.

The change I suggested makes jacana happy.

cheers

andrew

--
Andrew Dunstan https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#119Ian Barwick
ian.barwick@2ndquadrant.com
In reply to: Andres Freund (#113)
Re: Improving connection scalability: GetSnapshotData()

On 2020/10/02 3:26, Andres Freund wrote:

Hi Ian, Andrew, All,

On 2020-09-30 15:43:17 -0700, Andres Freund wrote:

Attached is an updated version of the test (better utility function,
stricter regexes, bailing out instead of failing just the current when
psql times out). I'm leaving it in this test for now, but it's fairly
easy to use this way, in my opinion, so it may be worth moving to
PostgresNode at some point.

I pushed this yesterday. Ian, thanks again for finding this and helping
with fixing & testing.

Thanks! Apologies for not getting back to your earlier responses,
have been distracted by Various Other Things.

The tests I run which originally triggered the issue now run just fine.

Regards

Ian Barwick

--
Ian Barwick https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#120Andres Freund
andres@anarazel.de
In reply to: Andrew Dunstan (#118)
Re: Improving connection scalability: GetSnapshotData()

Hi,

On 2020-10-01 19:21:14 -0400, Andrew Dunstan wrote:

On 10/1/20 4:22 PM, Andres Freund wrote:

# Note: on Windows, IPC::Run seems to convert \r\n to \n in program output
# if we're using native Perl, but not if we're using MSys Perl. So do it
# by hand in the latter case, here and elsewhere.
that IPC::Run converts things, but that native windows perl uses
https://perldoc.perl.org/perlrun#PERLIO
a PERLIO that includes :crlf, whereas msys probably doesn't?

Any chance you could run something like
perl -mPerlIO -e 'print(PerlIO::get_layers(STDIN), "\n");'
on both native and msys perl?

possibly also for stderr, just to make it more futureproof, and at the
top of the file:

use Config;

Do you want me to test that first?

That'd be awesome.

The change I suggested makes jacana happy.

Thanks, pushed. Hopefully that fixes the mingw animals.

I wonder if we instead should do something like

# Have mingw perl treat CRLF the same way as perl on native windows does
ifeq ($(build_os),mingw32)
PROVE="PERLIO=unixcrlf $(PROVE)"
endif

in Makefile.global.in? Then we could remove these rexes from all the
various places?

Greetings,

Andres Freund

#121Andrew Dunstan
andrew@dunslane.net
In reply to: Andres Freund (#120)
Re: Improving connection scalability: GetSnapshotData()

On 10/5/20 10:33 PM, Andres Freund wrote:

Hi,

On 2020-10-01 19:21:14 -0400, Andrew Dunstan wrote:

On 10/1/20 4:22 PM, Andres Freund wrote:

# Note: on Windows, IPC::Run seems to convert \r\n to \n in program output
# if we're using native Perl, but not if we're using MSys Perl. So do it
# by hand in the latter case, here and elsewhere.
that IPC::Run converts things, but that native windows perl uses
https://perldoc.perl.org/perlrun#PERLIO
a PERLIO that includes :crlf, whereas msys probably doesn't?

Any chance you could run something like
perl -mPerlIO -e 'print(PerlIO::get_layers(STDIN), "\n");'
on both native and msys perl?

possibly also for stderr, just to make it more futureproof, and at the
top of the file:

use Config;

Do you want me to test that first?

That'd be awesome.

The change I suggested makes jacana happy.

Thanks, pushed. Hopefully that fixes the mingw animals.

I don't think we're out of the woods yet. This test is also have bad
effects on bowerbird, which is an MSVC animal. It's hanging completely :-(

Digging some more.

cheers

andrew

--
Andrew Dunstan
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#122AJG
ayden@gera.co.nz
In reply to: Andres Freund (#1)
Re: Improving connection scalability: GetSnapshotData()

Hi,

Greatly appreciate if you could please reply to the following questions as
time allows.

I have seen previous discussion/patches on a built-in connection pooler. How
does this scalability improvement, particularly idle connection improvements
etc, affect that built-in pooler need, if any?

Same general question about an external connection pooler in general in
Production? Still required to route to different servers but no longer
needed for the pooling part. as an example.

Many Thanks!

--
Sent from: https://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html

#123Noname
luis.roberto@siscobra.com.br
In reply to: AJG (#122)
Re: Improving connection scalability: GetSnapshotData()

----- Mensagem original -----

De: "AJG" <ayden@gera.co.nz>
Para: "Pg Hackers" <pgsql-hackers@postgresql.org>
Enviadas: Sábado, 27 de fevereiro de 2021 14:40:58
Assunto: Re: Improving connection scalability: GetSnapshotData()

Hi,

Greatly appreciate if you could please reply to the following questions as
time allows.

I have seen previous discussion/patches on a built-in connection pooler. How
does this scalability improvement, particularly idle connection improvements
etc, affect that built-in pooler need, if any?

Same general question about an external connection pooler in general in
Production? Still required to route to different servers but no longer
needed for the pooling part. as an example.

Many Thanks!

--
Sent from: https://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html

As I understand it, the improvements made to GetSnapShotData() mean having higher connection count does not incur as much a penalty to performance as before.
I am not sure it solves the connection stablishment side of things, but I may be wrong.

Luis R. Weck

#124Konstantin Knizhnik
k.knizhnik@postgrespro.ru
In reply to: AJG (#122)
Re: Improving connection scalability: GetSnapshotData()

On 27.02.2021 20:40, AJG wrote:

Hi,

Greatly appreciate if you could please reply to the following questions as
time allows.

I have seen previous discussion/patches on a built-in connection pooler. How
does this scalability improvement, particularly idle connection improvements
etc, affect that built-in pooler need, if any?

Same general question about an external connection pooler in general in
Production? Still required to route to different servers but no longer
needed for the pooling part. as an example.

Many Thanks!

Connection pooler is still needed.
The patch for GetSnapshotData() mostly improves scalability of read-only
queries and reduce contention for procarray lock.
But read-write upload cause contention for many other resources:
relation extension lock, buffer locks, tuple locks and so on.

If you run pgbench at NUMA machine with hundreds of cores, then you will
still observe significant degradation of performance with increasing
number of connection.
And this degradation will be dramatic if you replace some non-uniform
distribution of keys, for example Zipfian distribution.

Show quoted text

--
Sent from: https://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html