optimizing repeated MVCC snapshots

Started by Robert Haasabout 14 years ago3 messages

robertmhaas@gmail.com

about 14 years ago

1 attachment(s)

On Tue, Jan 3, 2012 at 2:22 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Another thought is that it should always be safe to reuse an old
snapshot if no transactions have committed or aborted since it was
taken

Yeah, that might work better. And it'd be a win for all MVCC snaps,
not just the ones coming from promoted SnapshotNow ...

Here's a rough patch for that. Some benchmark results from a 32-core
Itanium server are included below. They look pretty good. Out of the
dozen configurations I tested, all but one came out ahead with the
patch. The loser was 80-clients on permanent tables, but I'm not too
worried about that since 80 clients on unlogged tables came out ahead.

This is not quite in a committable state yet, since it presumes atomic
64-bit reads and writes, and those aren't actually atomic everywhere.
What I'm thinking we can do is: on platforms where 8-byte reads and
writes are known to be atomic, we do as the patch does currently. On
any platform where we don't know that to be the case, we can move the
test I added to GetSnapshotData inside the lock, which should still be
a small win at low concurrencies. At high concurrencies it's a bit
iffy, because making GetSnapshotData's critical section shorter might
lead to lots of people trying to manipulate the ProcArrayLock spinlock
in very rapid succession. Even if that turns out to be an issue, I'm
inclined to believe that anyone who has enough concurrency for that to
matter probably also has atomic 8-byte reads and writes, and so the
most that will be needed is an update to our notion of which platforms
have that capability. If that turns out to be wrong, the other
obvious alternative is to not to the test at all unless it can be done
unlocked.

To support the above, I'm inclined to add a new file
src/include/atomic.h which optionally defines a macro called
ATOMIC_64BIT_OPS and macros atomic_read_uint64(r) and
atomic_write_uint64(l, r). That way we can eventually support (a)
architectures where 64-bit operations aren't atomic at all, (b)
architectures where ordinary 64-bit operations are atomic
(atomic_read_unit64(r) -> r, and atomic_write_uint64(l, r) -> l = r),
and (c) architectures (like 32-bit x86) where ordinary 64-bit
operations aren't atomic but special instructions (cmpxchg8b) can be
used to get that behavior.

m = master, s = with patch. scale factor 100, median of three
5-minute test runs. shared_buffers=8GB, checkpoint_segments=300,
checkpoint_timeout=30min, effective_cache_size=340GB,
wal_buffers=16MB, wal_writer_delay=20ms, listen_addresses='*',
synchronous_commit=off. binary modified with chatr +pd L +pi L and
run with rtsched -s SCHED_NOAGE -p 178.

Permanent Tables
================

m01 tps = 912.865209 (including connections establishing)
s01 tps = 916.848536 (including connections establishing)
m08 tps = 6256.429549 (including connections establishing)
s08 tps = 6364.214425 (including connections establishing)
m16 tps = 10795.373683 (including connections establishing)
s16 tps = 11038.233359 (including connections establishing)
m24 tps = 13710.400042 (including connections establishing)
s24 tps = 13836.823580 (including connections establishing)
m32 tps = 14574.758456 (including connections establishing)
s32 tps = 15125.196227 (including connections establishing)
m80 tps = 12014.498814 (including connections establishing)
s80 tps = 11825.302643 (including connections establishing)

Unlogged Tables
===============

m01 tps = 942.950926 (including connections establishing)
s01 tps = 953.618574 (including connections establishing)
m08 tps = 6492.238255 (including connections establishing)
s08 tps = 6537.197731 (including connections establishing)
m16 tps = 11363.708861 (including connections establishing)
s16 tps = 11561.193527 (including connections establishing)
m24 tps = 14656.659546 (including connections establishing)
s24 tps = 14977.226426 (including connections establishing)
m32 tps = 16310.814143 (including connections establishing)
s32 tps = 16644.921538 (including connections establishing)
m80 tps = 13422.438927 (including connections establishing)
s80 tps = 13780.256723 (including connections establishing)

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

optimize-repeated-snapshots.patchapplication/octet-stream; name=optimize-repeated-snapshots.patchDownload

diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 6ea0a28..cd0b39c 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -51,6 +51,7 @@
 #include "access/xact.h"
 #include "access/twophase.h"
 #include "miscadmin.h"
+#include "storage/barrier.h"
 #include "storage/procarray.h"
 #include "storage/spin.h"
 #include "utils/builtins.h"
@@ -81,6 +82,9 @@ typedef struct ProcArrayStruct
 	 */
 	TransactionId lastOverflowedXid;
 
+	/* Increment on transaction commit/abort. */
+	TransactionId	xendcount;
+
 	/*
 	 * We declare pgprocnos[] as 1 entry because C wants a fixed-size array, but
 	 * actually it is maxProcs entries long.
@@ -223,7 +227,13 @@ CreateSharedProcArray(void)
 	{
 		/*
 		 * We're the first - initialize.
+		 *
+		 * Note: We need to initialize procArray->xfinish to something other
+		 * than 0, because we're going to later compare it against the
+		 * xendcount of a snapshot to see if anything's changed; and 0 in the
+		 * snapshot means it's as-yet uninitialized.
 		 */
+		procArray->xendcount = 1;
 		procArray->numProcs = 0;
 		procArray->maxProcs = PROCARRAY_MAXPROCS;
 		procArray->maxKnownAssignedXids = TOTAL_MAX_CACHED_SUBXIDS;
@@ -410,6 +420,8 @@ ProcArrayEndTransaction(PGPROC *proc, TransactionId latestXid)
 								  latestXid))
 			ShmemVariableCache->latestCompletedXid = latestXid;
 
+		procArray->xendcount++;
+
 		LWLockRelease(ProcArrayLock);
 	}
 	else
@@ -1245,6 +1257,27 @@ GetSnapshotData(Snapshot snapshot)
 	Assert(snapshot != NULL);
 
 	/*
+	 * If no transactions have committed or aborted since the last time this
+	 * function was called on the passed-in snapshot, we can return without
+	 * doing anything.
+	 *
+	 * Memory ordering effects: It's possible that the procArray->xendcount
+	 * could be fetched by the CPU prior to entering this function, in which
+	 * case we might see an "old" value that matches instead of a "new" value
+	 * that doesn't.  But that's not much different than if this function had
+	 * been called slightly sooner in the first place.  Just to be on the safe
+	 * side, include a read barrier, so that this fetch won't be done before
+	 * all prior fetches have been completed.
+	 *
+	 * XXX: This is theoretically unsafe if 64-bit reads and writes from shared
+	 * memory aren't atomic, but in practice you'd have to be incredibly
+	 * unlucky to have a problem.
+	 */
+	pg_read_barrier();
+	if (procArray->xendcount == snapshot->xendcount)
+		return snapshot;
+
+	/*
 	 * Allocating space for maxProcs xids is usually overkill; numProcs would
 	 * be sufficient.  But it seems better to do the malloc while not holding
 	 * the lock, so we can't look at numProcs.  Likewise, we allocate much
@@ -1283,6 +1316,7 @@ GetSnapshotData(Snapshot snapshot)
 	LWLockAcquire(ProcArrayLock, LW_SHARED);
 
 	/* xmax is always latestCompletedXid + 1 */
+	snapshot->xendcount = arrayP->xendcount;
 	xmax = ShmemVariableCache->latestCompletedXid;
 	Assert(TransactionIdIsNormal(xmax));
 	TransactionIdAdvance(xmax);
diff --git a/src/include/utils/snapshot.h b/src/include/utils/snapshot.h
index 93c02fa..d6a3a68 100644
--- a/src/include/utils/snapshot.h
+++ b/src/include/utils/snapshot.h
@@ -44,12 +44,13 @@ typedef struct SnapshotData
 	 * is stored as an optimization to avoid needing to search the XID arrays
 	 * for most tuples.
 	 */
+	uint64		xendcount;		/* when did we take this snapshot? */
 	TransactionId xmin;			/* all XID < xmin are visible to me */
 	TransactionId xmax;			/* all XID >= xmax are invisible to me */
 	uint32		xcnt;			/* # of xact ids in xip[] */
-	TransactionId *xip;			/* array of xact IDs in progress */
 	/* note: all ids in xip[] satisfy xmin <= xip[i] < xmax */
 	int32		subxcnt;		/* # of xact ids in subxip[] */
+	TransactionId *xip;			/* array of xact IDs in progress */
 	TransactionId *subxip;		/* array of subxact IDs in progress */
 	bool		suboverflowed;	/* has the subxip array overflowed? */
 	bool		takenDuringRecovery;	/* recovery-shaped snapshot? */

Florian Weimer

fweimer@bfk.de

about 14 years ago

In reply to: Robert Haas (#1)

Re: optimizing repeated MVCC snapshots

* Robert Haas:

and (c) architectures (like 32-bit x86) where ordinary 64-bit
operations aren't atomic but special instructions (cmpxchg8b) can be
used to get that behavior.

FILD and FIST are atomic, too, and are supported by more
micro-architectures.

--
Florian Weimer <fweimer@bfk.de>
BFK edv-consulting GmbH http://www.bfk.de/
Kriegsstraße 100 tel: +49-721-96201-1
D-76133 Karlsruhe fax: +49-721-96201-99

Robert Haas

robertmhaas@gmail.com

about 14 years ago

In reply to: Florian Weimer (#2)

Re: optimizing repeated MVCC snapshots

On Thu, Jan 5, 2012 at 9:01 AM, Florian Weimer <fweimer@bfk.de> wrote:

* Robert Haas:

and (c) architectures (like 32-bit x86) where ordinary 64-bit
operations aren't atomic but special instructions (cmpxchg8b) can be
used to get that behavior.

FILD and FIST are atomic, too, and are supported by more
micro-architectures.

Yeah, I think you (or someone) mentioned the code. If someone wants
to write the code, I'm game...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company