Improving connection scalability (src/backend/storage/ipc/procarray.c)
Hi hackers,
Inspired by Andre's works, I put efforts to try to improve
GetSnapShotData().
I think that I got something.
As is well known GetSnapShotData() is a critical function for Postgres
performance, already demonstrated in other benchmarks.
So any gain, no matter how small, makes a difference.
So far, no regression has been observed.
Windows 10 64 bits (msvc 2019 64 bits)
pgbench -M prepared -c $conns -S -n -U postgres
conns tps head tps patched
1 2918.004085 3027.550711
10 12262.415696 12876.641772
50 13656.724571 14877.410140
80 14338.202348 15244.192915
pgbench can't run with conns > 80.
Linux Ubuntu 64 bits (gcc 9.4)
./pgbench -M prepared -c $conns -j $conns -S -n -U postgres
conns tps head tps patched
1 2918.004085 3190.810466
10 12262.415696 17199.862401
50 13656.724571 18278.194114
80 14338.202348 17955.336101
90 16597.510373 18269.660184
100 17706.775793 18349.650150
200 16877.067441 17881.250615
300 16942.260775 17181.441752
400 16794.514911 17124.533892
500 16598.502151 17181.244953
600 16717.935001 16961.130742
700 16651.204834 16959.172005
800 16467.546583 16834.591719
900 16588.241149 16693.902459
1000 16564.985265 16936.952195
I was surprised that in my tests, with connections greater than 100,
the tps performance drops, even in the head.
I don't have access to a powerful machine, to really test with a high
workload.
But with connections below 100, It seems to me that there are obvious gains.
patch attached.
regards,
Ranier Vilela
Attachments:
001-improve-scability-procarray.patchapplication/octet-stream; name=001-improve-scability-procarray.patchDownload
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index cd58c5faf0..f96caf61d7 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -1371,10 +1371,10 @@ TransactionIdIsInProgress(TransactionId xid)
static TransactionId *xids = NULL;
static TransactionId *other_xids;
XidCacheStatus *other_subxidstates;
- int nxids = 0;
ProcArrayStruct *arrayP = procArray;
TransactionId topxid;
TransactionId latestCompletedXid;
+ int nxids = 0;
int mypgxactoff;
int numProcs;
int j;
@@ -1455,9 +1455,9 @@ TransactionIdIsInProgress(TransactionId xid)
numProcs = arrayP->numProcs;
for (int pgxactoff = 0; pgxactoff < numProcs; pgxactoff++)
{
- int pgprocno;
PGPROC *proc;
TransactionId pxid;
+ int pgprocno;
int pxids;
/* Ignore ourselves --- dealt with it above */
@@ -1600,10 +1600,11 @@ TransactionIdIsInProgress(TransactionId xid)
bool
TransactionIdIsActive(TransactionId xid)
{
- bool result = false;
ProcArrayStruct *arrayP = procArray;
TransactionId *other_xids = ProcGlobal->xids;
+ int numProcs;
int i;
+ bool result = false;
/*
* Don't bother checking a transaction older than RecentXmin; it could not
@@ -1614,11 +1615,12 @@ TransactionIdIsActive(TransactionId xid)
LWLockAcquire(ProcArrayLock, LW_SHARED);
- for (i = 0; i < arrayP->numProcs; i++)
+ numProcs = arrayP->numProcs;
+ for (i = 0; i < numProcs; i++)
{
- int pgprocno = arrayP->pgprocnos[i];
- PGPROC *proc = &allProcs[pgprocno];
+ PGPROC *proc;
TransactionId pxid;
+ int pgprocno;
/* Fetch xid just once - see GetNewTransactionId */
pxid = UINT32_ACCESS_ONCE(other_xids[i]);
@@ -1626,6 +1628,8 @@ TransactionIdIsActive(TransactionId xid)
if (!TransactionIdIsValid(pxid))
continue;
+ pgprocno = arrayP->pgprocnos[i];
+ proc = &allProcs[pgprocno];
if (proc->pid == 0)
continue; /* ignore prepared transactions */
@@ -1711,9 +1715,11 @@ static void
ComputeXidHorizons(ComputeXidHorizonsResult *h)
{
ProcArrayStruct *arrayP = procArray;
+ TransactionId *other_xids = ProcGlobal->xids;
+ uint8 *allStatusFlags = ProcGlobal->statusFlags;
TransactionId kaxmin;
+ int numProcs;
bool in_recovery = RecoveryInProgress();
- TransactionId *other_xids = ProcGlobal->xids;
LWLockAcquire(ProcArrayLock, LW_SHARED);
@@ -1762,14 +1768,14 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
*/
h->slot_xmin = procArray->replication_slot_xmin;
h->slot_catalog_xmin = procArray->replication_slot_catalog_xmin;
-
- for (int index = 0; index < arrayP->numProcs; index++)
+ numProcs = arrayP->numProcs;
+ for (int index = 0; index < numProcs; index++)
{
int pgprocno = arrayP->pgprocnos[index];
PGPROC *proc = &allProcs[pgprocno];
- int8 statusFlags = ProcGlobal->statusFlags[index];
TransactionId xid;
TransactionId xmin;
+ int8 statusFlags;
/* Fetch xid just once - see GetNewTransactionId */
xid = UINT32_ACCESS_ONCE(other_xids[index]);
@@ -1802,6 +1808,7 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
* removed, as long as pg_subtrans is not truncated) or doing logical
* decoding (which manages xmin separately, check below).
*/
+ statusFlags = allStatusFlags[index];
if (statusFlags & (PROC_IN_VACUUM | PROC_IN_LOGICAL_DECODING))
continue;
@@ -2221,24 +2228,22 @@ GetSnapshotDataReuse(Snapshot snapshot)
Snapshot
GetSnapshotData(Snapshot snapshot)
{
- ProcArrayStruct *arrayP = procArray;
- TransactionId *other_xids = ProcGlobal->xids;
+ TransactionId *other_xids;
TransactionId xmin;
TransactionId xmax;
- int count = 0;
- int subcount = 0;
- bool suboverflowed = false;
- FullTransactionId latest_completed;
TransactionId oldestxid;
- int mypgxactoff;
TransactionId myxid;
+ TransactionId replication_slot_xmin;
+ TransactionId replication_slot_catalog_xmin;
+ FullTransactionId latest_completed;
uint64 curXactCompletionCount;
-
- TransactionId replication_slot_xmin = InvalidTransactionId;
- TransactionId replication_slot_catalog_xmin = InvalidTransactionId;
+ int count;
+ int subcount;
+ int mypgxactoff;
+ bool suboverflowed;
Assert(snapshot != NULL);
-
+
/*
* Allocating space for maxProcs xids is usually overkill; numProcs would
* be sufficient. But it seems better to do the malloc while not holding
@@ -2250,7 +2255,12 @@ GetSnapshotData(Snapshot snapshot)
* xip arrays if any. (This relies on the fact that all callers pass
* static SnapshotData structs.)
*/
- if (snapshot->xip == NULL)
+ if (snapshot->xip != NULL)
+ {
+ if (GetSnapshotDataReuse(snapshot))
+ return snapshot;
+ }
+ else
{
/*
* First call for this snapshot. Snapshot is same size whether or not
@@ -2271,19 +2281,18 @@ GetSnapshotData(Snapshot snapshot)
errmsg("out of memory")));
}
+ count = 0;
+ subcount = 0;
+ suboverflowed = false;
+
/*
* It is sufficient to get shared lock on ProcArrayLock, even if we are
* going to set MyProc->xmin.
*/
LWLockAcquire(ProcArrayLock, LW_SHARED);
- if (GetSnapshotDataReuse(snapshot))
- {
- LWLockRelease(ProcArrayLock);
- return snapshot;
- }
-
latest_completed = ShmemVariableCache->latestCompletedXid;
+ other_xids = ProcGlobal->xids;
mypgxactoff = MyProc->pgxactoff;
myxid = other_xids[mypgxactoff];
Assert(myxid == MyProc->xid);
@@ -2307,11 +2316,10 @@ GetSnapshotData(Snapshot snapshot)
if (!snapshot->takenDuringRecovery)
{
- int numProcs = arrayP->numProcs;
TransactionId *xip = snapshot->xip;
- int *pgprocnos = arrayP->pgprocnos;
- XidCacheStatus *subxidStates = ProcGlobal->subxidStates;
uint8 *allStatusFlags = ProcGlobal->statusFlags;
+ ProcArrayStruct *arrayP = procArray;
+ int numProcs = arrayP->numProcs;
/*
* First collect set of pgxactoff/xids that need to be included in the
@@ -2348,14 +2356,6 @@ GetSnapshotData(Snapshot snapshot)
*/
Assert(TransactionIdIsNormal(xid));
- /*
- * If the XID is >= xmax, we can skip it; such transactions will
- * be treated as running anyway (and any sub-XIDs will also be >=
- * xmax).
- */
- if (!NormalTransactionIdPrecedes(xid, xmax))
- continue;
-
/*
* Skip over backends doing logical decoding which manages xmin
* separately (check below) and ones running LAZY VACUUM.
@@ -2364,6 +2364,14 @@ GetSnapshotData(Snapshot snapshot)
if (statusFlags & (PROC_IN_LOGICAL_DECODING | PROC_IN_VACUUM))
continue;
+ /*
+ * If the XID is >= xmax, we can skip it; such transactions will
+ * be treated as running anyway (and any sub-XIDs will also be >=
+ * xmax).
+ */
+ if (!NormalTransactionIdPrecedes(xid, xmax))
+ continue;
+
if (NormalTransactionIdPrecedes(xid, xmin))
xmin = xid;
@@ -2387,6 +2395,7 @@ GetSnapshotData(Snapshot snapshot)
*/
if (!suboverflowed)
{
+ XidCacheStatus *subxidStates = ProcGlobal->subxidStates;
if (subxidStates[pgxactoff].overflowed)
suboverflowed = true;
@@ -2396,7 +2405,7 @@ GetSnapshotData(Snapshot snapshot)
if (nsubxids > 0)
{
- int pgprocno = pgprocnos[pgxactoff];
+ int pgprocno = arrayP->pgprocnos[pgxactoff];
PGPROC *proc = &allProcs[pgprocno];
pg_read_barrier(); /* pairs with GetNewTransactionId */
@@ -2585,9 +2594,10 @@ bool
ProcArrayInstallImportedXmin(TransactionId xmin,
VirtualTransactionId *sourcevxid)
{
- bool result = false;
ProcArrayStruct *arrayP = procArray;
+ int numProcs;
int index;
+ bool result = false;
Assert(TransactionIdIsNormal(xmin));
if (!sourcevxid)
@@ -2596,18 +2606,21 @@ ProcArrayInstallImportedXmin(TransactionId xmin,
/* Get lock so source xact can't end while we're doing this */
LWLockAcquire(ProcArrayLock, LW_SHARED);
- for (index = 0; index < arrayP->numProcs; index++)
+ numProcs = arrayP->numProcs;
+ for (index = 0; index < numProcs; index++)
{
- int pgprocno = arrayP->pgprocnos[index];
- PGPROC *proc = &allProcs[pgprocno];
- int statusFlags = ProcGlobal->statusFlags[index];
+ PGPROC *proc;
TransactionId xid;
+ int pgprocno;
+ int statusFlags = ProcGlobal->statusFlags[index];
/* Ignore procs running LAZY VACUUM */
if (statusFlags & PROC_IN_VACUUM)
continue;
/* We are only interested in the specific virtual transaction. */
+ pgprocno = arrayP->pgprocnos[index];
+ proc = &allProcs[pgprocno];
if (proc->backendId != sourcevxid->backendId)
continue;
if (proc->lxid != sourcevxid->localTransactionId)
@@ -2663,8 +2676,8 @@ ProcArrayInstallImportedXmin(TransactionId xmin,
bool
ProcArrayInstallRestoredXmin(TransactionId xmin, PGPROC *proc)
{
- bool result = false;
TransactionId xid;
+ bool result = false;
Assert(TransactionIdIsNormal(xmin));
Assert(proc != NULL);
@@ -2745,6 +2758,7 @@ GetRunningTransactionData(void)
TransactionId latestCompletedXid;
TransactionId oldestRunningXid;
TransactionId *xids;
+ int numProcs;
int index;
int count;
int subcount;
@@ -2794,7 +2808,8 @@ GetRunningTransactionData(void)
/*
* Spin over procArray collecting all xids
*/
- for (index = 0; index < arrayP->numProcs; index++)
+ numProcs = arrayP->numProcs;
+ for (index = 0; index < numProcs; index++)
{
TransactionId xid;
@@ -2838,10 +2853,8 @@ GetRunningTransactionData(void)
{
XidCacheStatus *other_subxidstates = ProcGlobal->subxidStates;
- for (index = 0; index < arrayP->numProcs; index++)
+ for (index = 0; index < numProcs; index++)
{
- int pgprocno = arrayP->pgprocnos[index];
- PGPROC *proc = &allProcs[pgprocno];
int nsubxids;
/*
@@ -2851,6 +2864,9 @@ GetRunningTransactionData(void)
nsubxids = other_subxidstates[index].count;
if (nsubxids > 0)
{
+ int pgprocno = arrayP->pgprocnos[index];
+ PGPROC *proc = &allProcs[pgprocno];
+
/* barrier not really required, as XidGenLock is held, but ... */
pg_read_barrier(); /* pairs with GetNewTransactionId */
@@ -2914,6 +2930,7 @@ GetOldestActiveTransactionId(void)
ProcArrayStruct *arrayP = procArray;
TransactionId *other_xids = ProcGlobal->xids;
TransactionId oldestRunningXid;
+ int numProcs;
int index;
Assert(!RecoveryInProgress());
@@ -2933,7 +2950,8 @@ GetOldestActiveTransactionId(void)
* Spin over procArray collecting all xids and subxids.
*/
LWLockAcquire(ProcArrayLock, LW_SHARED);
- for (index = 0; index < arrayP->numProcs; index++)
+ numProcs = arrayP->numProcs;
+ for (index = 0; index < numProcs; index++)
{
TransactionId xid;
@@ -2978,7 +2996,6 @@ GetOldestSafeDecodingTransactionId(bool catalogOnly)
{
ProcArrayStruct *arrayP = procArray;
TransactionId oldestSafeXid;
- int index;
bool recovery_in_progress = RecoveryInProgress();
Assert(LWLockHeldByMe(ProcArrayLock));
@@ -3001,16 +3018,16 @@ GetOldestSafeDecodingTransactionId(bool catalogOnly)
* slot's general xmin horizon, but the catalog horizon is only usable
* when only catalog data is going to be looked at.
*/
- if (TransactionIdIsValid(procArray->replication_slot_xmin) &&
- TransactionIdPrecedes(procArray->replication_slot_xmin,
+ if (TransactionIdIsValid(arrayP->replication_slot_xmin) &&
+ TransactionIdPrecedes(arrayP->replication_slot_xmin,
oldestSafeXid))
- oldestSafeXid = procArray->replication_slot_xmin;
+ oldestSafeXid = arrayP->replication_slot_xmin;
if (catalogOnly &&
- TransactionIdIsValid(procArray->replication_slot_catalog_xmin) &&
- TransactionIdPrecedes(procArray->replication_slot_catalog_xmin,
+ TransactionIdIsValid(arrayP->replication_slot_catalog_xmin) &&
+ TransactionIdPrecedes(arrayP->replication_slot_catalog_xmin,
oldestSafeXid))
- oldestSafeXid = procArray->replication_slot_catalog_xmin;
+ oldestSafeXid = arrayP->replication_slot_catalog_xmin;
/*
* If we're not in recovery, we walk over the procarray and collect the
@@ -3027,11 +3044,14 @@ GetOldestSafeDecodingTransactionId(bool catalogOnly)
if (!recovery_in_progress)
{
TransactionId *other_xids = ProcGlobal->xids;
+ int numProcs;
+ int index;
/*
* Spin over procArray collecting min(ProcGlobal->xids[i])
*/
- for (index = 0; index < arrayP->numProcs; index++)
+ numProcs = arrayP->numProcs;
+ for (index = 0; index < numProcs; index++)
{
TransactionId xid;
@@ -3076,6 +3096,7 @@ GetVirtualXIDsDelayingChkpt(int *nvxids, int type)
{
VirtualTransactionId *vxids;
ProcArrayStruct *arrayP = procArray;
+ int numProcs;
int count = 0;
int index;
@@ -3086,8 +3107,8 @@ GetVirtualXIDsDelayingChkpt(int *nvxids, int type)
palloc(sizeof(VirtualTransactionId) * arrayP->maxProcs);
LWLockAcquire(ProcArrayLock, LW_SHARED);
-
- for (index = 0; index < arrayP->numProcs; index++)
+ numProcs = arrayP->numProcs;
+ for (index = 0; index < numProcs; index++)
{
int pgprocno = arrayP->pgprocnos[index];
PGPROC *proc = &allProcs[pgprocno];
@@ -3120,15 +3141,16 @@ GetVirtualXIDsDelayingChkpt(int *nvxids, int type)
bool
HaveVirtualXIDsDelayingChkpt(VirtualTransactionId *vxids, int nvxids, int type)
{
- bool result = false;
ProcArrayStruct *arrayP = procArray;
+ int numProcs;
int index;
Assert(type != 0);
LWLockAcquire(ProcArrayLock, LW_SHARED);
- for (index = 0; index < arrayP->numProcs; index++)
+ numProcs = arrayP->numProcs;
+ for (index = 0; index < numProcs; index++)
{
int pgprocno = arrayP->pgprocnos[index];
PGPROC *proc = &allProcs[pgprocno];
@@ -3145,18 +3167,16 @@ HaveVirtualXIDsDelayingChkpt(VirtualTransactionId *vxids, int nvxids, int type)
{
if (VirtualTransactionIdEquals(vxid, vxids[i]))
{
- result = true;
- break;
+ LWLockRelease(ProcArrayLock);
+ return true;
}
}
- if (result)
- break;
}
}
LWLockRelease(ProcArrayLock);
- return result;
+ return false;
}
/*
@@ -3192,25 +3212,25 @@ BackendPidGetProc(int pid)
PGPROC *
BackendPidGetProcWithLock(int pid)
{
- PGPROC *result = NULL;
- ProcArrayStruct *arrayP = procArray;
+ ProcArrayStruct *arrayP;
+ int numProcs;
int index;
if (pid == 0) /* never match dummy PGPROCs */
return NULL;
- for (index = 0; index < arrayP->numProcs; index++)
+ arrayP = procArray;
+ numProcs = arrayP->numProcs;
+ for (index = 0; index < numProcs; index++)
{
- PGPROC *proc = &allProcs[arrayP->pgprocnos[index]];
+ int pgprocno = arrayP->pgprocnos[index];
+ PGPROC *proc = &allProcs[pgprocno];
if (proc->pid == pid)
- {
- result = proc;
- break;
- }
+ return proc;
}
- return result;
+ return NULL;
}
/*
@@ -3229,23 +3249,29 @@ BackendPidGetProcWithLock(int pid)
int
BackendXidGetPid(TransactionId xid)
{
- int result = 0;
- ProcArrayStruct *arrayP = procArray;
- TransactionId *other_xids = ProcGlobal->xids;
+ ProcArrayStruct *arrayP;
+ TransactionId *other_xids;
+ int numProcs;
int index;
+ int result;
if (xid == InvalidTransactionId) /* never match invalid xid */
return 0;
+ arrayP = procArray;
+ other_xids = ProcGlobal->xids;
+ result = 0;
+
LWLockAcquire(ProcArrayLock, LW_SHARED);
- for (index = 0; index < arrayP->numProcs; index++)
+ numProcs = arrayP->numProcs;
+ for (index = 0; index < numProcs; index++)
{
- int pgprocno = arrayP->pgprocnos[index];
- PGPROC *proc = &allProcs[pgprocno];
-
if (other_xids[index] == xid)
{
+ int pgprocno = arrayP->pgprocnos[index];
+ PGPROC *proc = &allProcs[pgprocno];
+
result = proc->pid;
break;
}
@@ -3301,6 +3327,7 @@ GetCurrentVirtualXIDs(TransactionId limitXmin, bool excludeXmin0,
{
VirtualTransactionId *vxids;
ProcArrayStruct *arrayP = procArray;
+ int numProcs;
int count = 0;
int index;
@@ -3310,15 +3337,17 @@ GetCurrentVirtualXIDs(TransactionId limitXmin, bool excludeXmin0,
LWLockAcquire(ProcArrayLock, LW_SHARED);
- for (index = 0; index < arrayP->numProcs; index++)
+ numProcs = arrayP->numProcs;
+ for (index = 0; index < numProcs; index++)
{
int pgprocno = arrayP->pgprocnos[index];
PGPROC *proc = &allProcs[pgprocno];
- uint8 statusFlags = ProcGlobal->statusFlags[index];
+ uint8 statusFlags;
if (proc == MyProc)
continue;
+ statusFlags = ProcGlobal->statusFlags[index];
if (excludeVacuum & statusFlags)
continue;
@@ -3387,6 +3416,7 @@ GetConflictingVirtualXIDs(TransactionId limitXmin, Oid dbOid)
{
static VirtualTransactionId *vxids;
ProcArrayStruct *arrayP = procArray;
+ int numProcs;
int count = 0;
int index;
@@ -3407,7 +3437,8 @@ GetConflictingVirtualXIDs(TransactionId limitXmin, Oid dbOid)
LWLockAcquire(ProcArrayLock, LW_SHARED);
- for (index = 0; index < arrayP->numProcs; index++)
+ numProcs = arrayP->numProcs;
+ for (index = 0; index < numProcs; index++)
{
int pgprocno = arrayP->pgprocnos[index];
PGPROC *proc = &allProcs[pgprocno];
@@ -3467,12 +3498,14 @@ SignalVirtualTransaction(VirtualTransactionId vxid, ProcSignalReason sigmode,
bool conflictPending)
{
ProcArrayStruct *arrayP = procArray;
+ int numProcs;
int index;
pid_t pid = 0;
LWLockAcquire(ProcArrayLock, LW_SHARED);
- for (index = 0; index < arrayP->numProcs; index++)
+ numProcs = arrayP->numProcs;
+ for (index = 0; index < numProcs; index++)
{
int pgprocno = arrayP->pgprocnos[index];
PGPROC *proc = &allProcs[pgprocno];
@@ -3514,8 +3547,9 @@ SignalVirtualTransaction(VirtualTransactionId vxid, ProcSignalReason sigmode,
bool
MinimumActiveBackends(int min)
{
- ProcArrayStruct *arrayP = procArray;
- int count = 0;
+ ProcArrayStruct *arrayP;
+ int numProcs;
+ int count;
int index;
/* Quick short-circuit if no minimum is specified */
@@ -3527,7 +3561,10 @@ MinimumActiveBackends(int min)
* bogus, but since we are only testing fields for zero or nonzero, it
* should be OK. The result is only used for heuristic purposes anyway...
*/
- for (index = 0; index < arrayP->numProcs; index++)
+ count = 0;
+ arrayP = procArray;
+ numProcs = arrayP->numProcs;
+ for (index = 0; index < numProcs; index++)
{
int pgprocno = arrayP->pgprocnos[index];
PGPROC *proc = &allProcs[pgprocno];
@@ -3568,12 +3605,14 @@ int
CountDBBackends(Oid databaseid)
{
ProcArrayStruct *arrayP = procArray;
+ int numProcs;
int count = 0;
int index;
LWLockAcquire(ProcArrayLock, LW_SHARED);
- for (index = 0; index < arrayP->numProcs; index++)
+ numProcs = arrayP->numProcs;
+ for (index = 0; index < numProcs; index++)
{
int pgprocno = arrayP->pgprocnos[index];
PGPROC *proc = &allProcs[pgprocno];
@@ -3598,12 +3637,14 @@ int
CountDBConnections(Oid databaseid)
{
ProcArrayStruct *arrayP = procArray;
+ int numProcs;
int count = 0;
int index;
LWLockAcquire(ProcArrayLock, LW_SHARED);
- for (index = 0; index < arrayP->numProcs; index++)
+ numProcs = arrayP->numProcs;
+ for (index = 0; index < numProcs; index++)
{
int pgprocno = arrayP->pgprocnos[index];
PGPROC *proc = &allProcs[pgprocno];
@@ -3629,12 +3670,14 @@ void
CancelDBBackends(Oid databaseid, ProcSignalReason sigmode, bool conflictPending)
{
ProcArrayStruct *arrayP = procArray;
+ int numProcs;
int index;
/* tell all backends to die */
LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
- for (index = 0; index < arrayP->numProcs; index++)
+ numProcs = arrayP->numProcs;
+ for (index = 0; index < numProcs; index++)
{
int pgprocno = arrayP->pgprocnos[index];
PGPROC *proc = &allProcs[pgprocno];
@@ -3669,12 +3712,14 @@ int
CountUserBackends(Oid roleid)
{
ProcArrayStruct *arrayP = procArray;
+ int numProcs;
int count = 0;
int index;
LWLockAcquire(ProcArrayLock, LW_SHARED);
- for (index = 0; index < arrayP->numProcs; index++)
+ numProcs = arrayP->numProcs;
+ for (index = 0; index < numProcs; index++)
{
int pgprocno = arrayP->pgprocnos[index];
PGPROC *proc = &allProcs[pgprocno];
@@ -3728,8 +3773,9 @@ CountOtherDBBackends(Oid databaseId, int *nbackends, int *nprepared)
for (tries = 0; tries < 50; tries++)
{
int nautovacs = 0;
- bool found = false;
+ int numProcs;
int index;
+ bool found = false;
CHECK_FOR_INTERRUPTS();
@@ -3737,11 +3783,11 @@ CountOtherDBBackends(Oid databaseId, int *nbackends, int *nprepared)
LWLockAcquire(ProcArrayLock, LW_SHARED);
- for (index = 0; index < arrayP->numProcs; index++)
+ numProcs = arrayP->numProcs;
+ for (index = 0; index < numProcs; index++)
{
int pgprocno = arrayP->pgprocnos[index];
PGPROC *proc = &allProcs[pgprocno];
- uint8 statusFlags = ProcGlobal->statusFlags[index];
if (proc->databaseId != databaseId)
continue;
@@ -3754,6 +3800,8 @@ CountOtherDBBackends(Oid databaseId, int *nbackends, int *nprepared)
(*nprepared)++;
else
{
+ uint8 statusFlags = ProcGlobal->statusFlags[index];
+
(*nbackends)++;
if ((statusFlags & PROC_IS_AUTOVACUUM) &&
nautovacs < MAXAUTOVACPIDS)
@@ -3798,12 +3846,14 @@ TerminateOtherDBBackends(Oid databaseId)
{
ProcArrayStruct *arrayP = procArray;
List *pids = NIL;
+ int numProcs;
int nprepared = 0;
int i;
LWLockAcquire(ProcArrayLock, LW_SHARED);
- for (i = 0; i < procArray->numProcs; i++)
+ numProcs = arrayP->numProcs;
+ for (i = 0; i < numProcs; i++)
{
int pgprocno = arrayP->pgprocnos[i];
PGPROC *proc = &allProcs[pgprocno];
@@ -3951,9 +4001,9 @@ XidCacheRemoveRunningXids(TransactionId xid,
int nxids, const TransactionId *xids,
TransactionId latestXid)
{
+ XidCacheStatus *mysubxidstat;
int i,
j;
- XidCacheStatus *mysubxidstat;
Assert(TransactionIdIsValid(xid));
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 5bc2a15160..37259a4c31 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -1800,8 +1800,6 @@ TransactionIdLimitedForOldSnapshots(TransactionId recentXmin,
TransactionId xlimit = recentXmin;
TransactionId latest_xmin;
TimestampTz next_map_update_ts;
- TransactionId threshold_timestamp;
- TransactionId threshold_xid;
Assert(TransactionIdIsNormal(recentXmin));
Assert(OldSnapshotThresholdActive());
@@ -1839,6 +1837,9 @@ TransactionIdLimitedForOldSnapshots(TransactionId recentXmin,
}
else
{
+ TransactionId threshold_timestamp;
+ TransactionId threshold_xid;
+
ts = AlignTimestampToMinuteBoundary(ts)
- (old_snapshot_threshold * USECS_PER_MINUTE);
@@ -1901,8 +1902,6 @@ void
MaintainOldSnapshotTimeMapping(TimestampTz whenTaken, TransactionId xmin)
{
TimestampTz ts;
- TransactionId latest_xmin;
- TimestampTz update_ts;
bool map_update_required = false;
/* Never call this function when old snapshot checking is disabled. */
@@ -1915,14 +1914,12 @@ MaintainOldSnapshotTimeMapping(TimestampTz whenTaken, TransactionId xmin)
* a new value when we have crossed a bucket boundary.
*/
SpinLockAcquire(&oldSnapshotControl->mutex_latest_xmin);
- latest_xmin = oldSnapshotControl->latest_xmin;
- update_ts = oldSnapshotControl->next_map_update;
- if (ts > update_ts)
+ if (ts > oldSnapshotControl->next_map_update)
{
oldSnapshotControl->next_map_update = ts;
map_update_required = true;
}
- if (TransactionIdFollows(xmin, latest_xmin))
+ if (TransactionIdFollows(xmin, oldSnapshotControl->latest_xmin))
oldSnapshotControl->latest_xmin = xmin;
SpinLockRelease(&oldSnapshotControl->mutex_latest_xmin);
@@ -2284,8 +2281,6 @@ RestoreTransactionSnapshot(Snapshot snapshot, void *source_pgproc)
bool
XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot)
{
- uint32 i;
-
/*
* Make a quick range check to eliminate most XIDs without looking at the
* xip arrays. Note that this is OK even if we convert a subxact XID to
@@ -2307,6 +2302,8 @@ XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot)
*/
if (!snapshot->takenDuringRecovery)
{
+ uint32 i;
+
/*
* If the snapshot contains full subxact data, the fastest way to
* check things is just to compare the given XID against both subxact
On Tue, May 24, 2022 at 11:28 AM Ranier Vilela <ranier.vf@gmail.com> wrote:
I think that I got something.
You might have something, but it's pretty hard to tell based on
looking at this patch. Whatever relevant changes it has are mixed with
a bunch of changes that are probably not relevant. For example, it's
hard to believe that moving "uint32 i" to an inner scope in
XidInMVCCSnapshot() is causing a performance gain, because an
optimizing compiler should figure that out anyway.
An even bigger issue is that it's not sufficient to just demonstrate
that the patch improves performance. It's also necessary to make an
argument as to why it is safe and correct, and "I tried it out and
nothing seemed to break" does not qualify as an argument. I'd guess
that most or maybe all of the performance gain that you've observed
here is attributable to changing GetSnapshotData() to call
GetSnapshotDataReuse() without first acquiring ProcArrayLock. That
doesn't seem like a completely hopeless idea, because the comments for
GetSnapshotDataReuse() say this:
* This very likely can be evolved to not need ProcArrayLock held (at very
* least in the case we already hold a snapshot), but that's for another day.
However, those comment seem to imply that it might not be safe in all
cases, and that changes might be needed someplace in order to make it
safe, but you haven't updated these comments, or changed the function
in any way, so it's not really clear how or whether whatever problems
Andres was worried about have been handled.
--
Robert Haas
EDB: http://www.enterprisedb.com
Em ter., 24 de mai. de 2022 às 13:06, Robert Haas <robertmhaas@gmail.com>
escreveu:
On Tue, May 24, 2022 at 11:28 AM Ranier Vilela <ranier.vf@gmail.com>
wrote:I think that I got something.
You might have something, but it's pretty hard to tell based on
looking at this patch. Whatever relevant changes it has are mixed with
a bunch of changes that are probably not relevant. For example, it's
hard to believe that moving "uint32 i" to an inner scope in
XidInMVCCSnapshot() is causing a performance gain, because an
optimizing compiler should figure that out anyway.
I believe that even these small changes are helpful and favorable.
Improves code readability and helps the compiler generate better code,
especially for older compilers.
An even bigger issue is that it's not sufficient to just demonstrate
that the patch improves performance. It's also necessary to make an
argument as to why it is safe and correct, and "I tried it out and
nothing seemed to break" does not qualify as an argument.
Ok, certainly the convincing work is not good.
I'd guess that most or maybe all of the performance gain that you've
observed
here is attributable to changing GetSnapshotData() to call
GetSnapshotDataReuse() without first acquiring ProcArrayLock.
It certainly helps, but I trust that's not the only reason, in all the
tests I did, there was an improvement in performance, even before using
this feature.
If you look closely at GetSnapShotData() you will see that
GetSnapshotDataReuse is called for all snapshots, even the new ones, which
is unnecessary.
Another example NormalTransactionIdPrecedes is more expensive than testing
statusFlags.
That
doesn't seem like a completely hopeless idea, because the comments for
GetSnapshotDataReuse() say this:* This very likely can be evolved to not need ProcArrayLock held (at very
* least in the case we already hold a snapshot), but that's for another
day.However, those comment seem to imply that it might not be safe in all
cases, and that changes might be needed someplace in order to make it
safe, but you haven't updated these comments, or changed the function
in any way, so it's not really clear how or whether whatever problems
Andres was worried about have been handled.
I think it's worth trying and testing to see if everything goes well,
so in the final patch apply whatever comments are needed.
regards,
Ranier Vilela
Hi,
On 2022-05-24 12:28:20 -0300, Ranier Vilela wrote:
Linux Ubuntu 64 bits (gcc 9.4)
./pgbench -M prepared -c $conns -j $conns -S -n -U postgresconns tps head tps patched
1 2918.004085 3190.810466
10 12262.415696 17199.862401
50 13656.724571 18278.194114
80 14338.202348 17955.336101
90 16597.510373 18269.660184
100 17706.775793 18349.650150
200 16877.067441 17881.250615
300 16942.260775 17181.441752
400 16794.514911 17124.533892
500 16598.502151 17181.244953
600 16717.935001 16961.130742
700 16651.204834 16959.172005
800 16467.546583 16834.591719
900 16588.241149 16693.902459
1000 16564.985265 16936.952195
17-18k tps is pretty low for pgbench -S. For a shared_buffers resident run, I
can get 40k in a single connection in an optimized build. If you're testing a
workload >> shared_buffers, GetSnapshotData() isn't the bottleneck. And
testing an assert build isn't a meaningful exercise either, unless you have
way way higher gains (i.e. stuff like turning O(n^2) into O(n)).
What pgbench scale is this and are you using an optimized build?
Greetings,
Andres Freund
Hi,
On 2022-05-24 13:23:43 -0300, Ranier Vilela wrote:
It certainly helps, but I trust that's not the only reason, in all the
tests I did, there was an improvement in performance, even before using
this feature.
If you look closely at GetSnapShotData() you will see that
GetSnapshotDataReuse is called for all snapshots, even the new ones, which
is unnecessary.
That only happens a handful of times as snapshots are persistently
allocated. Doing an extra GetSnapshotDataReuse() in those cases doesn't matter
for performance. If anything this increases the number of jumps for the common
case.
It'd be a huge win to avoid needing ProcArrayLock when reusing a snapshot, but
it's not at all easy to guarantee that it's correct / see how to make it
correct. I'm fairly sure it can be made correct, but ...
Another example NormalTransactionIdPrecedes is more expensive than testing
statusFlags.
That may be true when you count instructions, but isn't at all true when you
take into account that the cachelines containing status flags are hotly
contended.
Also, the likelihood of filtering out a proc due to
NormalTransactionIdPrecedes(xid, xmax) is *vastly* higher than the due to the
statusFlags check. There may be a lot of procs failing that test, but
typically there will be far fewer backends in vacuum or logical decoding.
Greetings,
Andres Freund
Em qua., 25 de mai. de 2022 às 00:46, Andres Freund <andres@anarazel.de>
escreveu:
Hi Andres, thank you for taking a look.
On 2022-05-24 12:28:20 -0300, Ranier Vilela wrote:
Linux Ubuntu 64 bits (gcc 9.4)
./pgbench -M prepared -c $conns -j $conns -S -n -U postgresconns tps head tps patched
1 2918.004085 3190.810466
10 12262.415696 17199.862401
50 13656.724571 18278.194114
80 14338.202348 17955.336101
90 16597.510373 18269.660184
100 17706.775793 18349.650150
200 16877.067441 17881.250615
300 16942.260775 17181.441752
400 16794.514911 17124.533892
500 16598.502151 17181.244953
600 16717.935001 16961.130742
700 16651.204834 16959.172005
800 16467.546583 16834.591719
900 16588.241149 16693.902459
1000 16564.985265 16936.95219517-18k tps is pretty low for pgbench -S. For a shared_buffers resident
run, I
can get 40k in a single connection in an optimized build. If you're
testing a
workload >> shared_buffers, GetSnapshotData() isn't the bottleneck. And
testing an assert build isn't a meaningful exercise either, unless you have
way way higher gains (i.e. stuff like turning O(n^2) into O(n)).
Thanks for sharing these hits.
Yes, their 17-18k tps are disappointing.
What pgbench scale is this and are you using an optimized build?
Yes this optimized build.
CFLAGS='-Wall -Wmissing-prototypes -Wpointer-arith
-Wdeclaration-after-statement -Werror=vla -Wendif-labels
-Wmissing-format-attribute -Wimplicit-fallthrough=3 -Wcast-function-type
-Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard
-Wno-format-truncation -Wno-stringop-truncation -O2'
from config.log
pgbench was initialized with:
pgbench -i -p 5432 -d postgres
pgbench -M prepared -c 100 -j 100 -S -n -U postgres
pgbench (15beta1)
transaction type: <builtin: select only>
scaling factor: 1
query mode: prepared
number of clients: 100
number of threads: 100
The shared_buffers is default:
shared_buffers = 128MB
Intel® Core™ i5-8250U CPU Quad Core
RAM 8GB
SSD 256 GB
Can you share the pgbench configuration and shared buffers
this benchmark?
/messages/by-id/20200301083601.ews6hz5dduc3w2se@alap3.anarazel.de
regards,
Ranier Vilela
Em qua., 25 de mai. de 2022 às 00:56, Andres Freund <andres@anarazel.de>
escreveu:
Hi,
On 2022-05-24 13:23:43 -0300, Ranier Vilela wrote:
It certainly helps, but I trust that's not the only reason, in all the
tests I did, there was an improvement in performance, even before using
this feature.
If you look closely at GetSnapShotData() you will see that
GetSnapshotDataReuse is called for all snapshots, even the new ones,which
is unnecessary.
That only happens a handful of times as snapshots are persistently
allocated.
Yes, but now this does not happen with new snapshots.
Doing an extra GetSnapshotDataReuse() in those cases doesn't matter
for performance. If anything this increases the number of jumps for the
common
case.
IMHO with GetSnapShotData(), any gain makes a difference.
It'd be a huge win to avoid needing ProcArrayLock when reusing a snapshot,
but
it's not at all easy to guarantee that it's correct / see how to make it
correct. I'm fairly sure it can be made correct, but ...
I believe it's worth the effort to make sure everything goes well and use
this feature.
Another example NormalTransactionIdPrecedes is more expensive than
testing
statusFlags.
That may be true when you count instructions, but isn't at all true when
you
take into account that the cachelines containing status flags are hotly
contended.
Also, the likelihood of filtering out a proc due to
NormalTransactionIdPrecedes(xid, xmax) is *vastly* higher than the due to
the
statusFlags check. There may be a lot of procs failing that test, but
typically there will be far fewer backends in vacuum or logical decoding.
I believe that keeping the instructions in the cache together works better
than having the status flags test in the middle.
But I will test this to be sure.
regards,
Ranier Vilela
On 5/25/22 11:07, Ranier Vilela wrote:
Em qua., 25 de mai. de 2022 às 00:46, Andres Freund <andres@anarazel.de
<mailto:andres@anarazel.de>> escreveu:Hi Andres, thank you for taking a look.
On 2022-05-24 12:28:20 -0300, Ranier Vilela wrote:
Linux Ubuntu 64 bits (gcc 9.4)
./pgbench -M prepared -c $conns -j $conns -S -n -U postgresconns tps head tps patched
1 2918.004085 3190.810466
10 12262.415696 17199.862401
50 13656.724571 18278.194114
80 14338.202348 17955.336101
90 16597.510373 18269.660184
100 17706.775793 18349.650150
200 16877.067441 17881.250615
300 16942.260775 17181.441752
400 16794.514911 17124.533892
500 16598.502151 17181.244953
600 16717.935001 16961.130742
700 16651.204834 16959.172005
800 16467.546583 16834.591719
900 16588.241149 16693.902459
1000 16564.985265 16936.95219517-18k tps is pretty low for pgbench -S. For a shared_buffers
resident run, I
can get 40k in a single connection in an optimized build. If you're
testing a
workload >> shared_buffers, GetSnapshotData() isn't the bottleneck. And
testing an assert build isn't a meaningful exercise either, unless
you have
way way higher gains (i.e. stuff like turning O(n^2) into O(n)).Thanks for sharing these hits.
Yes, their 17-18k tps are disappointing.What pgbench scale is this and are you using an optimized build?
Yes this optimized build.
CFLAGS='-Wall -Wmissing-prototypes -Wpointer-arith
-Wdeclaration-after-statement -Werror=vla -Wendif-labels
-Wmissing-format-attribute -Wimplicit-fallthrough=3 -Wcast-function-type
-Wformat-security -fno-strict-aliasing -fwrapv
-fexcess-precision=standard -Wno-format-truncation
-Wno-stringop-truncation -O2'
from config.log
That can still be assert-enabled build. We need to see configure flags.
pgbench was initialized with:
pgbench -i -p 5432 -d postgrespgbench -M prepared -c 100 -j 100 -S -n -U postgres
You're not specifying duration/number of transactions to execute. So
it's using just 10 transactions per client, which is bound to give you
bogus results due to not having anything in relcache etc. Use -T 60 or
something like that.
pgbench (15beta1)
transaction type: <builtin: select only>
scaling factor: 1
query mode: prepared
number of clients: 100
number of threads: 100The shared_buffers is default:
shared_buffers = 128MBIntel® Core™ i5-8250U CPU Quad Core
RAM 8GB
SSD 256 GB
Well, quick results on my laptop (i7-9750H, so not that different from
what you have):
1 = 18908.080126
2 = 32943.953182
3 = 42316.079028
4 = 46700.087645
So something is likely wrong in your setup.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Em qua., 25 de mai. de 2022 às 07:13, Tomas Vondra <
tomas.vondra@enterprisedb.com> escreveu:
On 5/25/22 11:07, Ranier Vilela wrote:
Em qua., 25 de mai. de 2022 às 00:46, Andres Freund <andres@anarazel.de
<mailto:andres@anarazel.de>> escreveu:Hi Andres, thank you for taking a look.
On 2022-05-24 12:28:20 -0300, Ranier Vilela wrote:
Linux Ubuntu 64 bits (gcc 9.4)
./pgbench -M prepared -c $conns -j $conns -S -n -U postgresconns tps head tps patched
1 2918.004085 3190.810466
10 12262.415696 17199.862401
50 13656.724571 18278.194114
80 14338.202348 17955.336101
90 16597.510373 18269.660184
100 17706.775793 18349.650150
200 16877.067441 17881.250615
300 16942.260775 17181.441752
400 16794.514911 17124.533892
500 16598.502151 17181.244953
600 16717.935001 16961.130742
700 16651.204834 16959.172005
800 16467.546583 16834.591719
900 16588.241149 16693.902459
1000 16564.985265 16936.95219517-18k tps is pretty low for pgbench -S. For a shared_buffers
resident run, I
can get 40k in a single connection in an optimized build. If you're
testing a
workload >> shared_buffers, GetSnapshotData() isn't the bottleneck.And
testing an assert build isn't a meaningful exercise either, unless
you have
way way higher gains (i.e. stuff like turning O(n^2) into O(n)).Thanks for sharing these hits.
Yes, their 17-18k tps are disappointing.What pgbench scale is this and are you using an optimized build?
Yes this optimized build.
CFLAGS='-Wall -Wmissing-prototypes -Wpointer-arith
-Wdeclaration-after-statement -Werror=vla -Wendif-labels
-Wmissing-format-attribute -Wimplicit-fallthrough=3 -Wcast-function-type
-Wformat-security -fno-strict-aliasing -fwrapv
-fexcess-precision=standard -Wno-format-truncation
-Wno-stringop-truncation -O2'
from config.logThat can still be assert-enabled build. We need to see configure flags.
./configure
Attached the config.log (compressed)
pgbench was initialized with:
pgbench -i -p 5432 -d postgrespgbench -M prepared -c 100 -j 100 -S -n -U postgres
You're not specifying duration/number of transactions to execute. So
it's using just 10 transactions per client, which is bound to give you
bogus results due to not having anything in relcache etc. Use -T 60 or
something like that.
Ok, I will try with -T 60.
pgbench (15beta1)
transaction type: <builtin: select only>
scaling factor: 1
query mode: prepared
number of clients: 100
number of threads: 100The shared_buffers is default:
shared_buffers = 128MBIntel® Core™ i5-8250U CPU Quad Core
RAM 8GB
SSD 256 GBWell, quick results on my laptop (i7-9750H, so not that different from
what you have):1 = 18908.080126
2 = 32943.953182
3 = 42316.079028
4 = 46700.087645So something is likely wrong in your setup.
select version();
version
----------------------------------------------------------------------------------------------------------
PostgreSQL 15beta1 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu
9.4.0-1ubuntu1~20.04.1) 9.4.0, 64-bit
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu
9.4.0-1ubuntu1~20.04.1'
--with-bugurl=file:///usr/share/doc/gcc-9/README.Bugs
--enable-languages=c,ada,c++,go,brig,d,fortran,objc,obj-c++,gm2
--prefix=/usr --with-gcc-major-version-only --program-suffix=-9
--program-prefix=x86_64-linux-gnu- --enable-shared --enable-linker-build-id
--libexecdir=/usr/lib --without-included-gettext --enable-threads=posix
--libdir=/usr/lib --enable-nls --enable-clocale=gnu
--enable-libstdcxx-debug --enable-libstdcxx-time=yes
--with-default-libstdcxx-abi=new --enable-gnu-unique-object
--disable-vtable-verify --enable-plugin --enable-default-pie
--with-system-zlib --with-target-system-zlib=auto --enable-objc-gc=auto
--enable-multiarch --disable-werror --with-arch-32=i686 --with-abi=m64
--with-multilib-list=m32,m64,mx32 --enable-multilib --with-tune=generic
--enable-offload-targets=nvptx-none=/build/gcc-9-Av3uEd/gcc-9-9.4.0/debian/tmp-nvptx/usr,hsa
--without-cuda-driver --enable-checking=release --build=x86_64-linux-gnu
--host=x86_64-linux-gnu --target=x86_64-linux-gnu
Thread model: posix
gcc version 9.4.0 (Ubuntu 9.4.0-1ubuntu1~20.04.1)
regards,
Ranier Vilela
Attachments:
Em qua., 25 de mai. de 2022 às 08:26, Ranier Vilela <ranier.vf@gmail.com>
escreveu:
Em qua., 25 de mai. de 2022 às 07:13, Tomas Vondra <
tomas.vondra@enterprisedb.com> escreveu:On 5/25/22 11:07, Ranier Vilela wrote:
Em qua., 25 de mai. de 2022 às 00:46, Andres Freund <andres@anarazel.de
<mailto:andres@anarazel.de>> escreveu:Hi Andres, thank you for taking a look.
On 2022-05-24 12:28:20 -0300, Ranier Vilela wrote:
Linux Ubuntu 64 bits (gcc 9.4)
./pgbench -M prepared -c $conns -j $conns -S -n -U postgresconns tps head tps patched
1 2918.004085 3190.810466
10 12262.415696 17199.862401
50 13656.724571 18278.194114
80 14338.202348 17955.336101
90 16597.510373 18269.660184
100 17706.775793 18349.650150
200 16877.067441 17881.250615
300 16942.260775 17181.441752
400 16794.514911 17124.533892
500 16598.502151 17181.244953
600 16717.935001 16961.130742
700 16651.204834 16959.172005
800 16467.546583 16834.591719
900 16588.241149 16693.902459
1000 16564.985265 16936.95219517-18k tps is pretty low for pgbench -S. For a shared_buffers
resident run, I
can get 40k in a single connection in an optimized build. If you're
testing a
workload >> shared_buffers, GetSnapshotData() isn't the bottleneck.And
testing an assert build isn't a meaningful exercise either, unless
you have
way way higher gains (i.e. stuff like turning O(n^2) into O(n)).Thanks for sharing these hits.
Yes, their 17-18k tps are disappointing.What pgbench scale is this and are you using an optimized build?
Yes this optimized build.
CFLAGS='-Wall -Wmissing-prototypes -Wpointer-arith
-Wdeclaration-after-statement -Werror=vla -Wendif-labels
-Wmissing-format-attribute -Wimplicit-fallthrough=3 -Wcast-function-type
-Wformat-security -fno-strict-aliasing -fwrapv
-fexcess-precision=standard -Wno-format-truncation
-Wno-stringop-truncation -O2'
from config.logThat can still be assert-enabled build. We need to see configure flags.
./configure
Attached the config.log (compressed)pgbench was initialized with:
pgbench -i -p 5432 -d postgrespgbench -M prepared -c 100 -j 100 -S -n -U postgres
You're not specifying duration/number of transactions to execute. So
it's using just 10 transactions per client, which is bound to give you
bogus results due to not having anything in relcache etc. Use -T 60 or
something like that.Ok, I will try with -T 60.
Here the results with -T 60:
Linux Ubuntu 64 bits
shared_buffers = 128MB
./pgbench -M prepared -c $conns -j $conns -T 60 -S -n -U postgres
pgbench (15beta1)
transaction type: <builtin: select only>
scaling factor: 1
query mode: prepared
number of clients: 100
number of threads: 100
maximum number of tries: 1
duration: 60 s
conns tps head tps patched
1 17126.326108 17792.414234
10 82068.123383 82468.334836
50 73808.731404 74678.839428
80 73290.191713 73116.553986
90 67558.483043 68384.906949
100 65960.982801 66997.793777
200 62216.011998 62870.243385
300 62924.225658 62796.157548
400 62278.099704 63129.555135
500 63257.930870 62188.825044
600 61479.890611 61517.913967
700 61139.354053 61327.898847
800 60833.663791 61517.913967
900 61305.129642 61248.336593
1000 60990.918719 61041.670996
Linux Ubuntu 64 bits
shared_buffers = 2048MB
./pgbench -M prepared -c $conns -j $conns -S -n -U postgres
pgbench (15beta1)
transaction type: <builtin: select only>
scaling factor: 1
query mode: prepared
number of clients: 100
number of threads: 100
maximum number of tries: 1
number of transactions per client: 10
conns tps head tps patched
1 2918.004085 3211.303789
10 12262.415696 15540.015540
50 13656.724571 16701.182444
80 14338.202348 16628.559551
90 16597.510373 16835.016835
100 17706.775793 16607.433487
200 16877.067441 16426.969799
300 16942.260775 16319.780662
400 16794.514911 16155.023607
500 16598.502151 16051.106724
600 16717.935001 16007.171213
700 16651.204834 16004.353184
800 16467.546583 16834.591719
900 16588.241149 16693.902459
1000 16564.985265 16936.952195
Linux Ubuntu 64 bits
shared_buffers = 2048MB
./pgbench -M prepared -c $conns -j $conns -T 60 -S -n -U postgres
pgbench (15beta1)
transaction type: <builtin: select only>
scaling factor: 1
query mode: prepared
number of clients: 100
number of threads: 100
maximum number of tries: 1
duration: 60 s
conns tps head tps patched
1 17174.265804 17792.414234
10 82365.634750 82468.334836
50 74593.714180 74678.839428
80 69219.756038 73116.553986
90 67419.574189 68384.906949
100 66613.771701 66997.793777
200 61739.784830 62870.243385
300 62109.691298 62796.157548
400 61630.822446 63129.555135
500 61711.019964 62755.190389
600 60620.010181 61517.913967
700 60303.317736 61688.044232
800 60451.113573 61076.666572
900 60017.327157 61256.290037
1000 60088.823434 60986.799312
regards,
Ranier Vilela
On 5/27/22 02:11, Ranier Vilela wrote:
...
Here the results with -T 60:
Might be a good idea to share your analysis / interpretation of the
results, not just the raw data. After all, the change is being proposed
by you, so do you think this shows the change is beneficial?
Linux Ubuntu 64 bits
shared_buffers = 128MB./pgbench -M prepared -c $conns -j $conns -T 60 -S -n -U postgres
pgbench (15beta1)
transaction type: <builtin: select only>
scaling factor: 1
query mode: prepared
number of clients: 100
number of threads: 100
maximum number of tries: 1
duration: 60 sconns tps head tps patched
1 17126.326108 17792.414234
10 82068.123383 82468.334836
50 73808.731404 74678.839428
80 73290.191713 73116.553986
90 67558.483043 68384.906949
100 65960.982801 66997.793777
200 62216.011998 62870.243385
300 62924.225658 62796.157548
400 62278.099704 63129.555135
500 63257.930870 62188.825044
600 61479.890611 61517.913967
700 61139.354053 61327.898847
800 60833.663791 61517.913967
900 61305.129642 61248.336593
1000 60990.918719 61041.670996
These results look much saner, but IMHO it also does not show any clear
benefit of the patch. Or are you still claiming there is a benefit?
BTW it's generally a good idea to do multiple runs and then use the
average and/or median. Results from a single may be quite noisy.
Linux Ubuntu 64 bits
shared_buffers = 2048MB./pgbench -M prepared -c $conns -j $conns -S -n -U postgres
pgbench (15beta1)
transaction type: <builtin: select only>
scaling factor: 1
query mode: prepared
number of clients: 100
number of threads: 100
maximum number of tries: 1
number of transactions per client: 10conns tps head tps patched
1 2918.004085 3211.303789
10 12262.415696 15540.015540
50 13656.724571 16701.182444
80 14338.202348 16628.559551
90 16597.510373 16835.016835
100 17706.775793 16607.433487
200 16877.067441 16426.969799
300 16942.260775 16319.780662
400 16794.514911 16155.023607
500 16598.502151 16051.106724
600 16717.935001 16007.171213
700 16651.204834 16004.353184
800 16467.546583 16834.591719
900 16588.241149 16693.902459
1000 16564.985265 16936.952195
I think we've agreed these results are useless.
Linux Ubuntu 64 bits
shared_buffers = 2048MB./pgbench -M prepared -c $conns -j $conns -T 60 -S -n -U postgres
pgbench (15beta1)
transaction type: <builtin: select only>
scaling factor: 1
query mode: prepared
number of clients: 100
number of threads: 100
maximum number of tries: 1
duration: 60 sconns tps head tps patched
1 17174.265804 17792.414234
10 82365.634750 82468.334836
50 74593.714180 74678.839428
80 69219.756038 73116.553986
90 67419.574189 68384.906949
100 66613.771701 66997.793777
200 61739.784830 62870.243385
300 62109.691298 62796.157548
400 61630.822446 63129.555135
500 61711.019964 62755.190389
600 60620.010181 61517.913967
700 60303.317736 61688.044232
800 60451.113573 61076.666572
900 60017.327157 61256.290037
1000 60088.823434 60986.799312
I have no idea why shared buffers 2GB would be interesting. The proposed
change was related to procarray, not shared buffers. And scale 1 is
~15MB of data, so it fits into 128MB just fine.
Also, the first ~10 results for "patched" case match results for 128MB
shared buffers. That seems very unlikely to happen by chance, so this
seems rather suspicious.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Em qui., 26 de mai. de 2022 às 22:30, Tomas Vondra <
tomas.vondra@enterprisedb.com> escreveu:
On 5/27/22 02:11, Ranier Vilela wrote:
...
Here the results with -T 60:
Might be a good idea to share your analysis / interpretation of the
results, not just the raw data. After all, the change is being proposed
by you, so do you think this shows the change is beneficial?
I think so, but the expectation has diminished.
I expected that the more connections, the better the performance.
And for both patch and head, this doesn't happen in tests.
Performance degrades with a greater number of connections.
GetSnapShowData() isn't a bottleneck?
Linux Ubuntu 64 bits
shared_buffers = 128MB./pgbench -M prepared -c $conns -j $conns -T 60 -S -n -U postgres
pgbench (15beta1)
transaction type: <builtin: select only>
scaling factor: 1
query mode: prepared
number of clients: 100
number of threads: 100
maximum number of tries: 1
duration: 60 sconns tps head tps patched
1 17126.326108 17792.414234
10 82068.123383 82468.334836
50 73808.731404 74678.839428
80 73290.191713 73116.553986
90 67558.483043 68384.906949
100 65960.982801 66997.793777
200 62216.011998 62870.243385
300 62924.225658 62796.157548
400 62278.099704 63129.555135
500 63257.930870 62188.825044
600 61479.890611 61517.913967
700 61139.354053 61327.898847
800 60833.663791 61517.913967
900 61305.129642 61248.336593
1000 60990.918719 61041.670996These results look much saner, but IMHO it also does not show any clear
benefit of the patch. Or are you still claiming there is a benefit?
We agree that they are micro-optimizations.
However, I think they should be considered micro-optimizations in inner
loops,
because all in procarray.c is a hotpath.
The first objective, I believe, was achieved, with no performance
regression.
I agree, the gains are small, by the tests done.
But, IMHO, this is a good way, small gains turn into big gains in the end,
when applied to all code.
Consider GetSnapShotData()
1. Most of the time the snapshot is not null, so:
if (snaphost == NULL), will fail most of the time.
With the patch:
if (snapshot->xip != NULL)
{
if (GetSnapshotDataReuse(snapshot))
return snapshot;
}
Most of the time the test is true and GetSnapshotDataReuse is not called
for new
snapshots.
count, subcount and suboverflowed, will not be initialized, for all
snapshots.
2. If snapshot is taken during recoverys
The pgprocnos and ProcGlobal->subxidStates are not touched unnecessarily.
Only if is not suboverflowed.
Skipping all InvalidTransactionId, mypgxactoff, backends doing logical
decoding,
and XID is >= xmax.
3. Calling GetSnapshotDataReuse() without first acquiring ProcArrayLock.
There's an agreement that this would be fine, for now.
Consider ComputeXidHorizons()
1. ProcGlobal->statusFlags is touched before the lock.
2. allStatusFlags[index] is not touched for all numProcs.
All changes were made with the aim of avoiding or postponing unnecessary
work.
BTW it's generally a good idea to do multiple runs and then use the
average and/or median. Results from a single may be quite noisy.Linux Ubuntu 64 bits
shared_buffers = 2048MB./pgbench -M prepared -c $conns -j $conns -S -n -U postgres
pgbench (15beta1)
transaction type: <builtin: select only>
scaling factor: 1
query mode: prepared
number of clients: 100
number of threads: 100
maximum number of tries: 1
number of transactions per client: 10conns tps head tps patched
1 2918.004085 3211.303789
10 12262.415696 15540.015540
50 13656.724571 16701.182444
80 14338.202348 16628.559551
90 16597.510373 16835.016835
100 17706.775793 16607.433487
200 16877.067441 16426.969799
300 16942.260775 16319.780662
400 16794.514911 16155.023607
500 16598.502151 16051.106724
600 16717.935001 16007.171213
700 16651.204834 16004.353184
800 16467.546583 16834.591719
900 16588.241149 16693.902459
1000 16564.985265 16936.952195I think we've agreed these results are useless.
Linux Ubuntu 64 bits
shared_buffers = 2048MB./pgbench -M prepared -c $conns -j $conns -T 60 -S -n -U postgres
pgbench (15beta1)
transaction type: <builtin: select only>
scaling factor: 1
query mode: prepared
number of clients: 100
number of threads: 100
maximum number of tries: 1
duration: 60 sconns tps head tps patched
1 17174.265804 17792.414234
10 82365.634750 82468.334836
50 74593.714180 74678.839428
80 69219.756038 73116.553986
90 67419.574189 68384.906949
100 66613.771701 66997.793777
200 61739.784830 62870.243385
300 62109.691298 62796.157548
400 61630.822446 63129.555135
500 61711.019964 62755.190389
600 60620.010181 61517.913967
700 60303.317736 61688.044232
800 60451.113573 61076.666572
900 60017.327157 61256.290037
1000 60088.823434 60986.799312I have no idea why shared buffers 2GB would be interesting. The proposed
change was related to procarray, not shared buffers. And scale 1 is
~15MB of data, so it fits into 128MB just fine.
I thought about doing this benchmark, in the most common usage situation
(25% of RAM).
Also, the first ~10 results for "patched" case match results for 128MB
shared buffers. That seems very unlikely to happen by chance, so this
seems rather suspicious.
Probably, copy and paste mistake.
I redid this test, for patched:
Linux Ubuntu 64 bits
shared_buffers = 2048MB
./pgbench -M prepared -c $conns -j $conns -T 60 -S -n -U postgres
pgbench (15beta1)
transaction type: <builtin: select only>
scaling factor: 1
query mode: prepared
number of clients: 100
number of threads: 100
maximum number of tries: 1
duration: 60 s
conns tps head tps patched
1 17174.265804 17524.482668
10 82365.634750 81840.537713
50 74593.714180 74806.729434
80 69219.756038 73116.553986
90 67419.574189 69130.749209
100 66613.771701 67478.234595
200 61739.784830 63094.202413
300 62109.691298 62984.501251
400 61630.822446 63243.232816
500 61711.019964 62827.977636
600 60620.010181 62838.051693
700 60303.317736 61594.629618
800 60451.113573 61208.629058
900 60017.327157 61171.001256
1000 60088.823434 60558.067810
regards,
Ranier Vilela
Hi,
On 2022-05-27 03:30:46 +0200, Tomas Vondra wrote:
On 5/27/22 02:11, Ranier Vilela wrote:
./pgbench -M prepared -c $conns -j $conns -T 60 -S -n -U postgres
pgbench (15beta1)
transaction type: <builtin: select only>
scaling factor: 1
query mode: prepared
number of clients: 100
number of threads: 100
maximum number of tries: 1
duration: 60 sconns tps head tps patched
1 17126.326108 17792.414234
10 82068.123383 82468.334836
50 73808.731404 74678.839428
80 73290.191713 73116.553986
90 67558.483043 68384.906949
100 65960.982801 66997.793777
200 62216.011998 62870.243385
300 62924.225658 62796.157548
400 62278.099704 63129.555135
500 63257.930870 62188.825044
600 61479.890611 61517.913967
700 61139.354053 61327.898847
800 60833.663791 61517.913967
900 61305.129642 61248.336593
1000 60990.918719 61041.670996These results look much saner, but IMHO it also does not show any clear
benefit of the patch. Or are you still claiming there is a benefit?
They don't look all that sane to me - isn't that way lower than one would
expect? Restricting both client and server to the same four cores, a
thermically challenged older laptop I have around I get 150k tps at both 10
and 100 clients.
Either way, I'd not expect to see any GetSnapshotData() scalability effects to
show up on an "Intel® Core™ i5-8250U CPU Quad Core" - there's just not enough
concurrency.
The correct pieces of these changes seem very unlikely to affect
GetSnapshotData() performance meaningfully.
To improve something like GetSnapshotData() you first have to come up with a
workload that shows it being a meaningful part of a profile. Unless it is,
performance differences are going to just be due to various forms of noise.
Greetings,
Andres Freund
Hi,
On 2022-05-27 10:35:08 -0300, Ranier Vilela wrote:
Em qui., 26 de mai. de 2022 �s 22:30, Tomas Vondra <
tomas.vondra@enterprisedb.com> escreveu:On 5/27/22 02:11, Ranier Vilela wrote:
...
Here the results with -T 60:
Might be a good idea to share your analysis / interpretation of the
results, not just the raw data. After all, the change is being proposed
by you, so do you think this shows the change is beneficial?I think so, but the expectation has diminished.
I expected that the more connections, the better the performance.
And for both patch and head, this doesn't happen in tests.
Performance degrades with a greater number of connections.
Your system has four CPUs. Once they're all busy, adding more connections
won't improve performance. It'll just add more and more context switching,
cache misses, and make the OS scheduler do more work.
GetSnapShowData() isn't a bottleneck?
I'd be surprised if it showed up in a profile on your machine with that
workload in any sort of meaningful way. The snapshot reuse logic will always
work - because there are no writes - and thus the only work that needs to be
done is to acquire the ProcArrayLock briefly. And because there is only a
small number of cores, contention on the cacheline for that isn't a problem.
These results look much saner, but IMHO it also does not show any clear
benefit of the patch. Or are you still claiming there is a benefit?We agree that they are micro-optimizations. However, I think they should be
considered micro-optimizations in inner loops, because all in procarray.c is
a hotpath.
As explained earlier, I don't agree that they optimize anything - you're
making some of the scalability behaviour *worse*, if it's changed at all.
The first objective, I believe, was achieved, with no performance
regression.
I agree, the gains are small, by the tests done.
There are no gains.
But, IMHO, this is a good way, small gains turn into big gains in the end,
when applied to all code.Consider GetSnapShotData()
1. Most of the time the snapshot is not null, so:
if (snaphost == NULL), will fail most of the time.With the patch:
if (snapshot->xip != NULL)
{
if (GetSnapshotDataReuse(snapshot))
return snapshot;
}Most of the time the test is true and GetSnapshotDataReuse is not called
for new
snapshots.
count, subcount and suboverflowed, will not be initialized, for all
snapshots.
But that's irrelevant. There's only a few "new" snapshots in the life of a
connection. You're optimizing something irrelevant.
2. If snapshot is taken during recoverys
The pgprocnos and ProcGlobal->subxidStates are not touched unnecessarily.
That code isn't reached when in recovery?
3. Calling GetSnapshotDataReuse() without first acquiring ProcArrayLock.
There's an agreement that this would be fine, for now.
There's no such agreement at all. It's not correct.
Consider ComputeXidHorizons()
1. ProcGlobal->statusFlags is touched before the lock.
Hard to believe that'd have a measurable effect.
2. allStatusFlags[index] is not touched for all numProcs.
I'd be surprised if the compiler couldn't defer that load on its own.
Greetings,
Andres Freund
Em sex., 27 de mai. de 2022 às 18:08, Andres Freund <andres@anarazel.de>
escreveu:
Hi,
On 2022-05-27 03:30:46 +0200, Tomas Vondra wrote:
On 5/27/22 02:11, Ranier Vilela wrote:
./pgbench -M prepared -c $conns -j $conns -T 60 -S -n -U postgres
pgbench (15beta1)
transaction type: <builtin: select only>
scaling factor: 1
query mode: prepared
number of clients: 100
number of threads: 100
maximum number of tries: 1
duration: 60 sconns tps head tps patched
1 17126.326108 17792.414234
10 82068.123383 82468.334836
50 73808.731404 74678.839428
80 73290.191713 73116.553986
90 67558.483043 68384.906949
100 65960.982801 66997.793777
200 62216.011998 62870.243385
300 62924.225658 62796.157548
400 62278.099704 63129.555135
500 63257.930870 62188.825044
600 61479.890611 61517.913967
700 61139.354053 61327.898847
800 60833.663791 61517.913967
900 61305.129642 61248.336593
1000 60990.918719 61041.670996These results look much saner, but IMHO it also does not show any clear
benefit of the patch. Or are you still claiming there is a benefit?They don't look all that sane to me - isn't that way lower than one would
expect?
Yes, quite disappointing.
Restricting both client and server to the same four cores, a
thermically challenged older laptop I have around I get 150k tps at both 10
and 100 clients.
And you can share the benchmark details? Hardware, postgres and pgbench,
please?
Either way, I'd not expect to see any GetSnapshotData() scalability
effects to
show up on an "Intel® Core™ i5-8250U CPU Quad Core" - there's just not
enough
concurrency.
This means that our customers will not see any connections scalability with
PG15, using the simplest hardware?
The correct pieces of these changes seem very unlikely to affect
GetSnapshotData() performance meaningfully.To improve something like GetSnapshotData() you first have to come up with
a
workload that shows it being a meaningful part of a profile. Unless it is,
performance differences are going to just be due to various forms of noise.
Actually in the profiles I got with perf, GetSnapShotData() didn't show up.
regards,
Ranier Vilela
Em sex., 27 de mai. de 2022 às 18:22, Andres Freund <andres@anarazel.de>
escreveu:
Hi,
On 2022-05-27 10:35:08 -0300, Ranier Vilela wrote:
Em qui., 26 de mai. de 2022 às 22:30, Tomas Vondra <
tomas.vondra@enterprisedb.com> escreveu:On 5/27/22 02:11, Ranier Vilela wrote:
...
Here the results with -T 60:
Might be a good idea to share your analysis / interpretation of the
results, not just the raw data. After all, the change is being proposed
by you, so do you think this shows the change is beneficial?I think so, but the expectation has diminished.
I expected that the more connections, the better the performance.
And for both patch and head, this doesn't happen in tests.
Performance degrades with a greater number of connections.Your system has four CPUs. Once they're all busy, adding more connections
won't improve performance. It'll just add more and more context switching,
cache misses, and make the OS scheduler do more work.
conns tps head
10 82365.634750
50 74593.714180
80 69219.756038
90 67419.574189
100 66613.771701
Yes it is quite disappointing that with 100 connections, tps loses to 10
connections.
GetSnapShowData() isn't a bottleneck?
I'd be surprised if it showed up in a profile on your machine with that
workload in any sort of meaningful way. The snapshot reuse logic will
always
work - because there are no writes - and thus the only work that needs to
be
done is to acquire the ProcArrayLock briefly. And because there is only a
small number of cores, contention on the cacheline for that isn't a
problem.
Thanks for sharing this.
These results look much saner, but IMHO it also does not show any clear
benefit of the patch. Or are you still claiming there is a benefit?We agree that they are micro-optimizations. However, I think they
should be
considered micro-optimizations in inner loops, because all in
procarray.c is
a hotpath.
As explained earlier, I don't agree that they optimize anything - you're
making some of the scalability behaviour *worse*, if it's changed at all.The first objective, I believe, was achieved, with no performance
regression.
I agree, the gains are small, by the tests done.There are no gains.
IMHO, I must disagree.
But, IMHO, this is a good way, small gains turn into big gains in the
end,
when applied to all code.
Consider GetSnapShotData()
1. Most of the time the snapshot is not null, so:
if (snaphost == NULL), will fail most of the time.With the patch:
if (snapshot->xip != NULL)
{
if (GetSnapshotDataReuse(snapshot))
return snapshot;
}Most of the time the test is true and GetSnapshotDataReuse is not called
for new
snapshots.
count, subcount and suboverflowed, will not be initialized, for all
snapshots.But that's irrelevant. There's only a few "new" snapshots in the life of a
connection. You're optimizing something irrelevant.
IMHO, when GetSnapShotData() is the bottleneck, all is relevant.
2. If snapshot is taken during recoverys
The pgprocnos and ProcGlobal->subxidStates are not touched unnecessarily.That code isn't reached when in recovery?
Currently it is reached *even* when not in recovery.
With the patch, *only* is reached when in recovery.
3. Calling GetSnapshotDataReuse() without first acquiring ProcArrayLock.
There's an agreement that this would be fine, for now.There's no such agreement at all. It's not correct.
Ok, but there is a chance it will work correctly.
Consider ComputeXidHorizons()
1. ProcGlobal->statusFlags is touched before the lock.Hard to believe that'd have a measurable effect.
IMHO, anything you take out of the lock is a benefit.
2. allStatusFlags[index] is not touched for all numProcs.
I'd be surprised if the compiler couldn't defer that load on its own.
Better be sure of that, no?
regards,
Ranier Vilela
On 5/28/22 02:15, Ranier Vilela wrote:
Em sex., 27 de mai. de 2022 às 18:08, Andres Freund <andres@anarazel.de
<mailto:andres@anarazel.de>> escreveu:Hi,
On 2022-05-27 03:30:46 +0200, Tomas Vondra wrote:
On 5/27/22 02:11, Ranier Vilela wrote:
./pgbench -M prepared -c $conns -j $conns -T 60 -S -n -U postgres
pgbench (15beta1)
transaction type: <builtin: select only>
scaling factor: 1
query mode: prepared
number of clients: 100
number of threads: 100
maximum number of tries: 1
duration: 60 sconns tps head tps patched
1 17126.326108 17792.414234
10 82068.123383 82468.334836
50 73808.731404 74678.839428
80 73290.191713 73116.553986
90 67558.483043 68384.906949
100 65960.982801 66997.793777
200 62216.011998 62870.243385
300 62924.225658 62796.157548
400 62278.099704 63129.555135
500 63257.930870 62188.825044
600 61479.890611 61517.913967
700 61139.354053 61327.898847
800 60833.663791 61517.913967
900 61305.129642 61248.336593
1000 60990.918719 61041.670996These results look much saner, but IMHO it also does not show any
clear
benefit of the patch. Or are you still claiming there is a benefit?
They don't look all that sane to me - isn't that way lower than one
would
expect?Yes, quite disappointing.
Restricting both client and server to the same four cores, a
thermically challenged older laptop I have around I get 150k tps at
both 10
and 100 clients.And you can share the benchmark details? Hardware, postgres and pgbench,
please?Either way, I'd not expect to see any GetSnapshotData() scalability
effects to
show up on an "Intel® Core™ i5-8250U CPU Quad Core" - there's just
not enough
concurrency.This means that our customers will not see any connections scalability
with PG15, using the simplest hardware?
No. It means that on 4-core machine GetSnapshotData() is unlikely to be
a bottleneck, because you'll hit various other bottlenecks way earlier.
I personally doubt it even makes sense to worry about scaling to this
many connections on such tiny system too much.
The correct pieces of these changes seem very unlikely to affect
GetSnapshotData() performance meaningfully.To improve something like GetSnapshotData() you first have to come
up with a
workload that shows it being a meaningful part of a profile. Unless
it is,
performance differences are going to just be due to various forms of
noise.Actually in the profiles I got with perf, GetSnapShotData() didn't show up.
But that's exactly the point Andres is trying to make - if you don't see
GetSnapshotData() in the perf profile, why do you think optimizing it
will have any meaningful impact on throughput?
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On 5/28/22 02:36, Ranier Vilela wrote:
Em sex., 27 de mai. de 2022 às 18:22, Andres Freund <andres@anarazel.de
<mailto:andres@anarazel.de>> escreveu:Hi,
On 2022-05-27 10:35:08 -0300, Ranier Vilela wrote:
Em qui., 26 de mai. de 2022 às 22:30, Tomas Vondra <
tomas.vondra@enterprisedb.com<mailto:tomas.vondra@enterprisedb.com>> escreveu:
On 5/27/22 02:11, Ranier Vilela wrote:
...
Here the results with -T 60:
Might be a good idea to share your analysis / interpretation of the
results, not just the raw data. After all, the change is beingproposed
by you, so do you think this shows the change is beneficial?
I think so, but the expectation has diminished.
I expected that the more connections, the better the performance.
And for both patch and head, this doesn't happen in tests.
Performance degrades with a greater number of connections.Your system has four CPUs. Once they're all busy, adding more
connections
won't improve performance. It'll just add more and more context
switching,
cache misses, and make the OS scheduler do more work.conns tps head
10 82365.634750
50 74593.714180
80 69219.756038
90 67419.574189
100 66613.771701
Yes it is quite disappointing that with 100 connections, tps loses to 10
connections.
IMO that's entirely expected on a system with only 4 cores. Increasing
the number of connections inevitably means more overhead (you have to
track/manage more stuff). And at some point the backends start competing
for L2/L3 caches, context switches are not free either, etc. So once you
cross ~2-3x the number of cores, you should expect this.
This behavior is natural/inherent, it's unlikely to go away, and it's
one of the reasons why we recommend not to use too many connections. If
you try to maximize throughput, just don't do that. Or just use machine
with more cores.
GetSnapShowData() isn't a bottleneck?
I'd be surprised if it showed up in a profile on your machine with that
workload in any sort of meaningful way. The snapshot reuse logic
will always
work - because there are no writes - and thus the only work that
needs to be
done is to acquire the ProcArrayLock briefly. And because there is
only a
small number of cores, contention on the cacheline for that isn't a
problem.Thanks for sharing this.
These results look much saner, but IMHO it also does not show
any clear
benefit of the patch. Or are you still claiming there is a benefit?
We agree that they are micro-optimizations. However, I think they
should be
considered micro-optimizations in inner loops, because all in
procarray.c is
a hotpath.
As explained earlier, I don't agree that they optimize anything - you're
making some of the scalability behaviour *worse*, if it's changed at
all.The first objective, I believe, was achieved, with no performance
regression.
I agree, the gains are small, by the tests done.There are no gains.
IMHO, I must disagree.
You don't have to, really. What you should do is showing results
demonstrating the claimed gains, and so far you have not done that.
I don't want to be rude, but so far you've shown results from a
benchmark testing fork(), due to only running 10 transactions per
client, and then results from a single run for each client count (which
doesn't really show any gains either, and is so noisy).
As mentioned GetSnapshotData() is not even in perf profile, so why would
the patch even make a difference?
You've also claimed it helps generating better code on older compilers,
but you've never supported that with any evidence.
Maybe there is an improvement - show us. Do a benchmark with more runs,
to average-out the noise. Calculate VAR/STDEV to show how variable the
results are. Use that to compare results and decide if there is an
improvement. Also, keep in mind binary layout matters [1]https://www.youtube.com/watch?v=r-TLSBdHe1A.
[1]: https://www.youtube.com/watch?v=r-TLSBdHe1A
But, IMHO, this is a good way, small gains turn into big gains in
the end,
when applied to all code.
Consider GetSnapShotData()
1. Most of the time the snapshot is not null, so:
if (snaphost == NULL), will fail most of the time.With the patch:
if (snapshot->xip != NULL)
{
if (GetSnapshotDataReuse(snapshot))
return snapshot;
}Most of the time the test is true and GetSnapshotDataReuse is not
called
for new
snapshots.
count, subcount and suboverflowed, will not be initialized, for all
snapshots.But that's irrelevant. There's only a few "new" snapshots in the
life of a
connection. You're optimizing something irrelevant.IMHO, when GetSnapShotData() is the bottleneck, all is relevant.
Maybe. Show us the difference.
2. If snapshot is taken during recoverys
The pgprocnos and ProcGlobal->subxidStates are not touchedunnecessarily.
That code isn't reached when in recovery?
Currently it is reached *even* when not in recovery.
With the patch, *only* is reached when in recovery.3. Calling GetSnapshotDataReuse() without first acquiring
ProcArrayLock.
There's an agreement that this would be fine, for now.
There's no such agreement at all. It's not correct.
Ok, but there is a chance it will work correctly.
Either it's correct or not. Chance of being correct does not count.
Consider ComputeXidHorizons()
1. ProcGlobal->statusFlags is touched before the lock.Hard to believe that'd have a measurable effect.
IMHO, anything you take out of the lock is a benefit.
Maybe. Show us the difference.
2. allStatusFlags[index] is not touched for all numProcs.
I'd be surprised if the compiler couldn't defer that load on its own.
Better be sure of that, no?
We rely on compilers doing this in about a million other places.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Em sáb., 28 de mai. de 2022 às 09:00, Tomas Vondra <
tomas.vondra@enterprisedb.com> escreveu:
On 5/28/22 02:15, Ranier Vilela wrote:
Em sex., 27 de mai. de 2022 às 18:08, Andres Freund <andres@anarazel.de
<mailto:andres@anarazel.de>> escreveu:Hi,
On 2022-05-27 03:30:46 +0200, Tomas Vondra wrote:
On 5/27/22 02:11, Ranier Vilela wrote:
./pgbench -M prepared -c $conns -j $conns -T 60 -S -n -U postgres
pgbench (15beta1)
transaction type: <builtin: select only>
scaling factor: 1
query mode: prepared
number of clients: 100
number of threads: 100
maximum number of tries: 1
duration: 60 sconns tps head tps patched
1 17126.326108 17792.414234
10 82068.123383 82468.334836
50 73808.731404 74678.839428
80 73290.191713 73116.553986
90 67558.483043 68384.906949
100 65960.982801 66997.793777
200 62216.011998 62870.243385
300 62924.225658 62796.157548
400 62278.099704 63129.555135
500 63257.930870 62188.825044
600 61479.890611 61517.913967
700 61139.354053 61327.898847
800 60833.663791 61517.913967
900 61305.129642 61248.336593
1000 60990.918719 61041.670996These results look much saner, but IMHO it also does not show any
clear
benefit of the patch. Or are you still claiming there is a benefit?
They don't look all that sane to me - isn't that way lower than one
would
expect?Yes, quite disappointing.
Restricting both client and server to the same four cores, a
thermically challenged older laptop I have around I get 150k tps at
both 10
and 100 clients.And you can share the benchmark details? Hardware, postgres and pgbench,
please?Either way, I'd not expect to see any GetSnapshotData() scalability
effects to
show up on an "Intel® Core™ i5-8250U CPU Quad Core" - there's just
not enough
concurrency.This means that our customers will not see any connections scalability
with PG15, using the simplest hardware?No. It means that on 4-core machine GetSnapshotData() is unlikely to be
a bottleneck, because you'll hit various other bottlenecks way earlier.I personally doubt it even makes sense to worry about scaling to this
many connections on such tiny system too much.
The correct pieces of these changes seem very unlikely to affect
GetSnapshotData() performance meaningfully.To improve something like GetSnapshotData() you first have to come
up with a
workload that shows it being a meaningful part of a profile. Unless
it is,
performance differences are going to just be due to various forms of
noise.Actually in the profiles I got with perf, GetSnapShotData() didn't show
up.
But that's exactly the point Andres is trying to make - if you don't see
GetSnapshotData() in the perf profile, why do you think optimizing it
will have any meaningful impact on throughput?
You see, I've seen in several places that GetSnapShotData() is the
bottleneck in scaling connections.
One of them, if I remember correctly, was at an IBM in Russia.
Another statement occurs in [1]https://techcommunity.microsoft.com/t5/azure-database-for-postgresql/improving-postgres-connection-scalability-snapshots/ba-p/1806462[2]/messages/by-id/5198715A.6070808@vmware.com[3]https://it-events.com/system/attachments/files/000/001/098/original/PostgreSQL_%D0%BC%D0%B0%D1%81%D1%88%D1%82%D0%B0%D0%B1%D0%B8%D1%80%D0%BE%D0%B2%D0%B0%D0%BD%D0%B8%D0%B5.pdf?1448975472
Just because I don't have enough hardware to force GetSnapShotData()
doesn't mean optimizing it won't make a difference.
And even on my modest hardware, we've seen gains, small but consistent.
So IMHO everyone will benefit, including the small servers.
regards,
Ranier Vilela
[1]: https://techcommunity.microsoft.com/t5/azure-database-for-postgresql/improving-postgres-connection-scalability-snapshots/ba-p/1806462
https://techcommunity.microsoft.com/t5/azure-database-for-postgresql/improving-postgres-connection-scalability-snapshots/ba-p/1806462
[2]: /messages/by-id/5198715A.6070808@vmware.com
[3]: https://it-events.com/system/attachments/files/000/001/098/original/PostgreSQL_%D0%BC%D0%B0%D1%81%D1%88%D1%82%D0%B0%D0%B1%D0%B8%D1%80%D0%BE%D0%B2%D0%B0%D0%BD%D0%B8%D0%B5.pdf?1448975472
https://it-events.com/system/attachments/files/000/001/098/original/PostgreSQL_%D0%BC%D0%B0%D1%81%D1%88%D1%82%D0%B0%D0%B1%D0%B8%D1%80%D0%BE%D0%B2%D0%B0%D0%BD%D0%B8%D0%B5.pdf?1448975472
Show quoted text
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On 28 May 2022, at 16:12, Ranier Vilela <ranier.vf@gmail.com> wrote:
Just because I don't have enough hardware to force GetSnapShotData() doesn't mean optimizing it won't make a difference.
Quoting Andres from upthread:
"To improve something like GetSnapshotData() you first have to come up with
a workload that shows it being a meaningful part of a profile. Unless it
is, performance differences are going to just be due to various forms of
noise."
If you think this is a worthwhile improvement, you need to figure out a way to
reliably test it in order to prove it.
--
Daniel Gustafsson https://vmware.com/
On 5/28/22 16:12, Ranier Vilela wrote:
Em sáb., 28 de mai. de 2022 às 09:00, Tomas Vondra
<tomas.vondra@enterprisedb.com <mailto:tomas.vondra@enterprisedb.com>>
escreveu:On 5/28/22 02:15, Ranier Vilela wrote:
Em sex., 27 de mai. de 2022 às 18:08, Andres Freund
<andres@anarazel.de <mailto:andres@anarazel.de>
<mailto:andres@anarazel.de <mailto:andres@anarazel.de>>> escreveu:
Hi,
On 2022-05-27 03:30:46 +0200, Tomas Vondra wrote:
> On 5/27/22 02:11, Ranier Vilela wrote:
> > ./pgbench -M prepared -c $conns -j $conns -T 60 -S -n -Upostgres
> >
> > pgbench (15beta1)
> > transaction type: <builtin: select only>
> > scaling factor: 1
> > query mode: prepared
> > number of clients: 100
> > number of threads: 100
> > maximum number of tries: 1
> > duration: 60 s
> >
> > conns tps head tps patched
> >
> > 1 17126.326108 17792.414234
> > 10 82068.123383 82468.334836
> > 50 73808.731404 74678.839428
> > 80 73290.191713 73116.553986
> > 90 67558.483043 68384.906949
> > 100 65960.982801 66997.793777
> > 200 62216.011998 62870.243385
> > 300 62924.225658 62796.157548
> > 400 62278.099704 63129.555135
> > 500 63257.930870 62188.825044
> > 600 61479.890611 61517.913967
> > 700 61139.354053 61327.898847
> > 800 60833.663791 61517.913967
> > 900 61305.129642 61248.336593
> > 1000 60990.918719 61041.670996
> >
>
> These results look much saner, but IMHO it also does notshow any
clear
> benefit of the patch. Or are you still claiming there is abenefit?
They don't look all that sane to me - isn't that way lower
than one
would
expect?Yes, quite disappointing.
Restricting both client and server to the same four cores, a
thermically challenged older laptop I have around I get 150ktps at
both 10
and 100 clients.And you can share the benchmark details? Hardware, postgres and
pgbench,
please?
Either way, I'd not expect to see any GetSnapshotData()
scalability
effects to
show up on an "Intel® Core™ i5-8250U CPU Quad Core" - there's just
not enough
concurrency.This means that our customers will not see any connections scalability
with PG15, using the simplest hardware?No. It means that on 4-core machine GetSnapshotData() is unlikely to be
a bottleneck, because you'll hit various other bottlenecks way earlier.I personally doubt it even makes sense to worry about scaling to this
many connections on such tiny system too much.The correct pieces of these changes seem very unlikely to affect
GetSnapshotData() performance meaningfully.To improve something like GetSnapshotData() you first have to come
up with a
workload that shows it being a meaningful part of a profile.Unless
it is,
performance differences are going to just be due to variousforms of
noise.
Actually in the profiles I got with perf, GetSnapShotData() didn't
show up.
But that's exactly the point Andres is trying to make - if you don't see
GetSnapshotData() in the perf profile, why do you think optimizing it
will have any meaningful impact on throughput?You see, I've seen in several places that GetSnapShotData() is the
bottleneck in scaling connections.
One of them, if I remember correctly, was at an IBM in Russia.
Another statement occurs in [1][2][3]
No one is claiming GetSnapshotData() can't be a bottleneck on systems
with many cores. That's certainly possible, which is why e.g. Andres
spent a lot of time optimizing for that case.
But that's what we're arguing about. You're trying to convince us that
your patch will improve things, and you're supporting that by numbers
from a machine that is unlikely to be hitting this bottleneck.
Just because I don't have enough hardware to force GetSnapShotData()
doesn't mean optimizing it won't make a difference.
Well, the question is if it actually optimizes things. Maybe it does,
which would be great, but people in this thread (including me) seem to
be fairly skeptical about that claim, because the results are frankly
entirely unconvincing.
I doubt we'll just accept changes in such sensitive places without
results from a relevant machine. Maybe if there was a clear agreement
it's a win, but that's not the case here.
And even on my modest hardware, we've seen gains, small but consistent.
So IMHO everyone will benefit, including the small servers.
No, we haven't seen any convincing gains. I've tried to explain multiple
times that the results you've shared are not showing any clear
improvement, due to only having one run for each client count (which
means there's a lot of noise), impact of binary layout in different
builds, etc. You've ignored all of that, so instead of repeating myself,
I did a simple benchmark on my two machines:
1) i5-2500k / 4 cores and 8GB RAM (so similar to what you have)
2) 2x e5-2620v3 / 16/32 cores, 64GB RAM (so somewhat bigger)
and I tested 1, 2, 5, 10, 50, 100, ...., 1000 clients using the same
benchmark as you (pgbench -S -M prepared ... ). I did 10 client counts
for each client count, to calculate median which evens out the noise.
And for fun I tried this with gcc 9.3, 10.3 and 11.2. The script and
results from both machines are attached.
The results from xeon and gcc 11.2 look like this:
clients master patched diff
---------------------------------------
1 46460 44936 97%
2 87486 84746 97%
5 199102 192169 97%
10 344458 339403 99%
20 515257 512513 99%
30 528675 525467 99%
40 592761 594384 100%
50 694635 706193 102%
100 643950 655238 102%
200 690133 696815 101%
300 670403 677818 101%
400 678573 681387 100%
500 665349 678722 102%
600 666028 670915 101%
700 662316 662511 100%
800 647922 654745 101%
900 650274 654698 101%
1000 644482 649332 101%
Please, explain to me how this shows consistent measurable improvement?
The standard deviation is roughly 1.5% on average, and the difference is
well within that range. Even if there was a tiny improvement for the
high client counts, no one sane will run with that many clients, because
the throughput peaks at ~50 clients. So even if you gain 1% with 500
clients, it's still less than with 50 clients. If anything, this shows
regression for lower client counts.
FWIW this entirely ignores the question is this benchmark even hits the
bottleneck this patch aims to improve. Also, there's the question of
correctness, and I'd bet Andres is right getting snapshot without
holding ProcArrayLock is busted.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Em sáb., 28 de mai. de 2022 às 09:35, Tomas Vondra <
tomas.vondra@enterprisedb.com> escreveu:
On 5/28/22 02:36, Ranier Vilela wrote:
Em sex., 27 de mai. de 2022 às 18:22, Andres Freund <andres@anarazel.de
<mailto:andres@anarazel.de>> escreveu:Hi,
On 2022-05-27 10:35:08 -0300, Ranier Vilela wrote:
Em qui., 26 de mai. de 2022 às 22:30, Tomas Vondra <
tomas.vondra@enterprisedb.com<mailto:tomas.vondra@enterprisedb.com>> escreveu:
On 5/27/22 02:11, Ranier Vilela wrote:
...
Here the results with -T 60:
Might be a good idea to share your analysis / interpretation of
the
results, not just the raw data. After all, the change is being
proposed
by you, so do you think this shows the change is beneficial?
I think so, but the expectation has diminished.
I expected that the more connections, the better the performance.
And for both patch and head, this doesn't happen in tests.
Performance degrades with a greater number of connections.Your system has four CPUs. Once they're all busy, adding more
connections
won't improve performance. It'll just add more and more context
switching,
cache misses, and make the OS scheduler do more work.conns tps head
10 82365.634750
50 74593.714180
80 69219.756038
90 67419.574189
100 66613.771701
Yes it is quite disappointing that with 100 connections, tps loses to 10
connections.IMO that's entirely expected on a system with only 4 cores. Increasing
the number of connections inevitably means more overhead (you have to
track/manage more stuff). And at some point the backends start competing
for L2/L3 caches, context switches are not free either, etc. So once you
cross ~2-3x the number of cores, you should expect this.This behavior is natural/inherent, it's unlikely to go away, and it's
one of the reasons why we recommend not to use too many connections. If
you try to maximize throughput, just don't do that. Or just use machine
with more cores.GetSnapShowData() isn't a bottleneck?
I'd be surprised if it showed up in a profile on your machine with
that
workload in any sort of meaningful way. The snapshot reuse logic
will always
work - because there are no writes - and thus the only work that
needs to be
done is to acquire the ProcArrayLock briefly. And because there is
only a
small number of cores, contention on the cacheline for that isn't a
problem.Thanks for sharing this.
These results look much saner, but IMHO it also does not show
any clear
benefit of the patch. Or are you still claiming there is a
benefit?
We agree that they are micro-optimizations. However, I think they
should be
considered micro-optimizations in inner loops, because all in
procarray.c is
a hotpath.
As explained earlier, I don't agree that they optimize anything -
you're
making some of the scalability behaviour *worse*, if it's changed at
all.The first objective, I believe, was achieved, with no performance
regression.
I agree, the gains are small, by the tests done.There are no gains.
IMHO, I must disagree.
You don't have to, really. What you should do is showing results
demonstrating the claimed gains, and so far you have not done that.I don't want to be rude, but so far you've shown results from a
benchmark testing fork(), due to only running 10 transactions per
client, and then results from a single run for each client count (which
doesn't really show any gains either, and is so noisy).As mentioned GetSnapshotData() is not even in perf profile, so why would
the patch even make a difference?You've also claimed it helps generating better code on older compilers,
but you've never supported that with any evidence.Maybe there is an improvement - show us. Do a benchmark with more runs,
to average-out the noise. Calculate VAR/STDEV to show how variable the
results are. Use that to compare results and decide if there is an
improvement. Also, keep in mind binary layout matters [1].
I redid the benchmark with a better machine:
Intel i7-10510U
RAM 8GB
SSD 512GB
Linux Ubuntu 64 bits
All files are attached, including the raw data of the results.
I did the calculations as requested.
But a quick average of the 10 benchmarks, done resulted in 10,000 tps more.
Not bad, for a simple patch, made entirely of micro-optimizations.
Results attached.
regards,
Ranier Vilela
Attachments:
procarray_bench.tar.xzapplication/x-xz; name=procarray_bench.tar.xzDownload
�7zXZ ���F ! t/�������] 8�!�['"o�M�F ��1�f�ja o
���+ �W�HX��� ��Z�@x[O
d����J�h$!`>:�z�/�9��
Y��h��8D`����a hh���;�F
��#�#��`z_
"J|}�0��� [Ce�I� ���
Sf�8�@�T�����-H9YH�b�fq%����e���0��� ,hLBr�iu�W���?&��8�Q�����5�������r��.y�?/�7�t ,D��/�iPa��������(���ep�pYY��9�*HZBy�!���@�,��8��������:���%���he,^��Q�$�4AD��l�g�����������&�vc��o�n="�[��M�a�nn�*�����\����`�H_�U�JSgT��mn��Z$X
=.��t&�, ������Ng����3��F4h�)�|T��y/����w;U�>�������Gv��>��x��B���#�H�,t;�<�����Z����km�qeM,�����AP��[a<�z�;I3EMs������.-.�Vbc�Wa7[��j�T�u�
_���F�Xf���iLC����j"R��G���L{�i%�r���r���2����C��O�������X>C�
:�z�Z��������I�!���@�{�jo��>R�l$��aS��LQ�e��e�M��u�j���3\j�
�It\�x�
��B}�!���Xub��O��Dy�6�T'S�����������b�A���t�+&R��C��9��O#2��%/��5PyK���"�u�Q�y�4��"(yv�UZzt&c�:��R��7_��o�$���S�*%C����8��6�'Sj��e�p������ ��>�XA��1
f
��+YW�������
T�)`���z�u��n~��f@��� C^��M����X��`�%��f\vj��t�����q��k�~�S�f}�>i��b����n���j%��h�^(s��F�����p�q�O�0=y���@
���I��&qE^>N[���A���q^�q����n�iZ��cEI��V; ��hNA�%�T���{����3rQJ7��w�B�� 2WF@��|�n T�P~��H�V�H:��&t�#��W��X�6i'D��e>��1��Q�Kj(VBu��?���U�e�����H����w����\��*��%�t>�E��Y:_���H`�9� R�,��"�t��.8�uRY�,�)Lq��bc���[����x�_���%�=^#��LT1@���&�d4b3���Q|��O5�:��_�eN��2��\���Q���61J��i���c�8���������C%1{� ��V?�G�`�v�������6�m�dL��[� �d�&+�c�\�uiW�!�&���sB��0����9]w� �%v���!���bDe�����+ o��w���Z���>�w3���'�_:AT��/R;�lm�eY2r��B�����=q�Z�\��$��� 8kGzz��wb^���R�A_����N.3�a� xu=�*b��(�Y������� �r���e+��O�;*my�BCy|!�����
����2���
����W��X���� �X5��������
�:�g��6�B�
BJS# 2A���B��t���N9960�ve�c����g�1��#5?���
�g����&