pgbench: option delaying queries till connections establishment?

Started by Andres Freundalmost 6 years ago54 messages

andres@anarazel.de

almost 6 years ago

Hi,

I am trying to run a few benchmarks measuring the effects of patch to
make GetSnapshotData() faster in the face of larger numbers of
established connections.

Before the patch connection establishment often is very slow due to
contention. The first few connections are fast, but after that it takes
increasingly long. The first few connections constantly hold
ProcArrayLock in shared mode, which then makes it hard for new
connections to acquire it exclusively (I'm addressing that to a
significant degree in the patch FWIW).

But for a fair comparison of the runtime effects I'd like to only
compare the throughput for when connections are actually usable,
otherwise I end up benchmarking few vs many connections, which is not
useful. And because I'd like to run the numbers for a lot of different
numbers of connections etc, I can't just make each run several hour
longs to make the initial minutes not matter much.

Therefore I'd like to make pgbench wait till it has established all
connections, before they run queries.

Does anybody else see this as being useful?

If so, should this be done unconditionally? A new option? Included in an
existing one somehow?

Greetings,

Andres Freund

andres@anarazel.de

almost 6 years ago

In reply to: Andres Freund (#1)

Re: pgbench: option delaying queries till connections establishment?

Hi,

On 2020-02-27 10:01:00 -0800, Andres Freund wrote:

If so, should this be done unconditionally? A new option? Included in an
existing one somehow?

FWIW, leaving windows, error handling, and other annoyances aside, this
can be implemented fairly simply. See below.

As an example of the difference:

Before:
andres@awork3:~/build/postgres/dev-optimize/vpath$ ./src/bin/pgbench/pgbench -M prepared -c 5000 -j 100 -T 100 -P1 -S
starting vacuum...end.
progress: 100.4 s, 515307.4 tps, lat 1.374 ms stddev 7.739
transaction type: <builtin: select only>
scaling factor: 30
query mode: prepared
number of clients: 5000
number of threads: 100
duration: 100 s
number of transactions actually processed: 51728348
latency average = 1.374 ms
latency stddev = 7.739 ms
tps = 513802.541226 (including connections establishing)
tps = 521342.427158 (excluding connections establishing)

Note that there's no progress report until the end. That's because the
main thread didn't get a connection until the other threads were done.

After:

pgbench -M prepared -c 5000 -j 100 -T 100 -P1 -S
starting vacuum...end.
progress: 1.5 s, 9943.5 tps, lat 4.795 ms stddev 14.822
progress: 2.0 s, 380312.6 tps, lat 1.728 ms stddev 15.461
progress: 3.0 s, 478811.1 tps, lat 2.052 ms stddev 31.687
progress: 4.0 s, 470804.6 tps, lat 1.941 ms stddev 24.661

I think this also shows that "including/excluding connections
establishing" as well as some of the other stats reported pretty
bogus. In the 'before' case a substantial numer of the connections had
not yet been established until the end of the test run!

diff --git i/src/bin/pgbench/pgbench.c w/src/bin/pgbench/pgbench.c
index 1159757acb0..1a82c6a290e 100644
--- i/src/bin/pgbench/pgbench.c
+++ w/src/bin/pgbench/pgbench.c
@@ -310,6 +310,8 @@ typedef struct RandomState
 /* Various random sequences are initialized from this one. */
 static RandomState base_random_sequence;

+pthread_barrier_t conn_barrier;
+
 /*
  * Connection state machine states.
  */
@@ -6110,6 +6112,8 @@ main(int argc, char **argv)

     /* start threads */
 #ifdef ENABLE_THREAD_SAFETY
+    pthread_barrier_init(&conn_barrier, NULL, nthreads);
+
     for (i = 0; i < nthreads; i++)
     {
         TState     *thread = &threads[i];
@@ -6265,6 +6269,8 @@ threadRun(void *arg)
     INSTR_TIME_SET_CURRENT(thread->conn_time);
     INSTR_TIME_SUBTRACT(thread->conn_time, thread->start_time);

+    pthread_barrier_wait(&conn_barrier);
+
     /* explicitly initialize the state machines */
     for (i = 0; i < nstate; i++)
     {

Greetings,

Andres Freund

Kyotaro Horiguchi

horikyota.ntt@gmail.com

almost 6 years ago

In reply to: Andres Freund (#2)

Re: pgbench: option delaying queries till connections establishment?

At Thu, 27 Feb 2020 10:51:29 -0800, Andres Freund <andres@anarazel.de> wrote in

Hi,

On 2020-02-27 10:01:00 -0800, Andres Freund wrote:

If so, should this be done unconditionally? A new option? Included in an
existing one somehow?

FWIW, leaving windows, error handling, and other annoyances aside, this
can be implemented fairly simply. See below.

As an example of the difference:

Before:
andres@awork3:~/build/postgres/dev-optimize/vpath$ ./src/bin/pgbench/pgbench -M prepared -c 5000 -j 100 -T 100 -P1 -S
starting vacuum...end.
progress: 100.4 s, 515307.4 tps, lat 1.374 ms stddev 7.739
transaction type: <builtin: select only>
scaling factor: 30
query mode: prepared
number of clients: 5000
number of threads: 100
duration: 100 s
number of transactions actually processed: 51728348
latency average = 1.374 ms
latency stddev = 7.739 ms
tps = 513802.541226 (including connections establishing)
tps = 521342.427158 (excluding connections establishing)

Note that there's no progress report until the end. That's because the
main thread didn't get a connection until the other threads were done.

After:

pgbench -M prepared -c 5000 -j 100 -T 100 -P1 -S
starting vacuum...end.
progress: 1.5 s, 9943.5 tps, lat 4.795 ms stddev 14.822
progress: 2.0 s, 380312.6 tps, lat 1.728 ms stddev 15.461
progress: 3.0 s, 478811.1 tps, lat 2.052 ms stddev 31.687
progress: 4.0 s, 470804.6 tps, lat 1.941 ms stddev 24.661

I think this also shows that "including/excluding connections
establishing" as well as some of the other stats reported pretty
bogus. In the 'before' case a substantial numer of the connections had
not yet been established until the end of the test run!

I see it useful. In most cases we don't care connection time of
pgbench. Especially in the mentioned case the result is just bogus. I
think the reason for "including/excluding connection establishing" is
not that people wants to see how long connection took to establish but
that how long the substantial work took. If each client did run with
continuously re-establishing new connections the connection time would
be useful, but actually all the connections are established at once at
the beginning.

So FWIW I prefer that the barrier is applied by default (that is, it
can be disabled) and the progress time starts at the time all clients
has been established.

starting vacuum...end.

+ time to established 5000 connections: 1323ms
! progress: 1.0 s, 330000.5 tps, lat 2.795 ms stddev 14.822
! progress: 2.0 s, 380312.6 tps, lat 1.728 ms stddev 15.461
! progress: 3.0 s, 478811.1 tps, lat 2.052 ms stddev 31.687
! progress: 4.0 s, 470804.6 tps, lat 1.941 ms stddev 24.661

transaction type: <builtin: select only>
scaling factor: 30
query mode: prepared
number of clients: 5000
number of threads: 100
duration: 100 s
number of transactions actually processed: 51728348
latency average = 1.374 ms
latency stddev = 7.739 ms
tps = 513802.541226 (including connections establishing)
tps = 521342.427158 (excluding connections establishing)

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

coelho@cri.ensmp.fr

almost 6 years ago

In reply to: Andres Freund (#1)

Re: pgbench: option delaying queries till connections establishment?

Hello Andres,

Therefore I'd like to make pgbench wait till it has established all
connections, before they run queries.

Does anybody else see this as being useful?

Yes, I think that having this behavior available would make sense.

If so, should this be done unconditionally?

Dunno. I should think about it. I'd say probably.

Pgbench is more or less designed to run a long hopefully steady-state
benchmark, so that the initial connection setup is always negligeable. Not
complying with this hypothesis quite often leads to weird results.

A new option?

Maybe, if not unconditional.

If there is an unconditional barrier, the excluding/including connection
stuff does not make a lot of sense when not under -C, if it did make any
sense before…

Included in an existing one somehow?

Which one would you suggest?

Adding a synchronization barrier should be simple enough, I thought about
it in the past.

However, I'd still be wary that it is no silver bullet: if you start a lot
of threads compared to the number of available cores, pgbench would
basically overload the system, and you would experience a lot of waiting
time which reflects that the client code has not got enough cpu time.
Basically you would be testing the OS process/thread management
performance.

On my 4-core laptop, with a do-nothing script (\set i 0):

sh> pgbench -T 10 -f nope.sql -P 1 -j 10 -c 10
latency average = 0.000 ms
latency stddev = 0.049 ms
tps = 21048841.630291 (including connections establishing)
tps = 21075139.938887 (excluding connections establishing)

sh> pgbench -T 10 -f nope.sql -P 1 -j 100 -c 100
latency average = 0.002 ms
latency stddev = 0.470 ms
tps = 23846777.241370 (including connections establishing)
tps = 24132252.146257 (excluding connections establishing)

Throughput is slightly better, latency average and variance explode
because each thread is given stretches of cpu time to advance, then wait
for the next round of cpu time.

--
Fabien.

coelho@cri.ensmp.fr

almost 6 years ago

In reply to: Kyotaro Horiguchi (#3)

Re: pgbench: option delaying queries till connections establishment?

Hello Kyotaro-san,

I think this also shows that "including/excluding connections
establishing" as well as some of the other stats reported pretty
bogus. In the 'before' case a substantial numer of the connections had
not yet been established until the end of the test run!

I see it useful. In most cases we don't care connection time of
pgbench. Especially in the mentioned case the result is just bogus. I
think the reason for "including/excluding connection establishing" is
not that people wants to see how long connection took to establish but
that how long the substantial work took. If each client did run with
continuously re-establishing new connections the connection time would
be useful, but actually all the connections are established at once at
the beginning.

So FWIW I prefer that the barrier is applied by default

Yep.

(that is, it can be disabled)

On reflection, I'm not sure I see a use case for not running the barrier
if it is available.

and the progress time starts at the time all clients has been
established.

Yep, the start time should be set after the connection barrier, and
possibly before a start barrier to ensure that no transaction has started
before the start time: although performance measures are expected to be
slightly false because of how they are measured (measuring takes time),
from a benchmarking perspective the displayed result should be <= the
actual performance.

Now, again, if long benchmarks are run, which for a db should more or less
always be the case, this should not matter much.

starting vacuum...end.

+ time to established 5000 connections: 1323ms

Yep, maybe showing the initial connection time separately.

--
Fabien.

andres@anarazel.de

almost 6 years ago

In reply to: Fabien COELHO (#4)

Re: pgbench: option delaying queries till connections establishment?

Hi,

On 2020-02-29 15:29:19 +0100, Fabien COELHO wrote:

Pgbench is more or less designed to run a long hopefully steady-state
benchmark, so that the initial connection setup is always negligeable. Not
complying with this hypothesis quite often leads to weird results.

I don't think this is a good starting point. Sure, a longer run will
yield more precise results, and one needs more than just an
instantaneous measurement. But in a lot of cases we want to use pgbench
to measure a lot of different variations, making it infeasible for each
run to be all that long.

Of course whether that's feasible depends on the workload (e.g. readonly
runs can be shorter than read/write runs).

Also note that in the case that made me look at this, you'd have to run
the test for *weeks* to drown out the performance difference that's
solely caused by difference in how long individual connects are
established. Partially because the "excluding connection establishing"
number is entirely broken, but also because fewer connections having
been established changes the performance so much.

I think we should also consider making pgbench actually use non-blocking
connection establishment. It seems pretty weird that that's the one
libpq operation where we don't? In particular for -C, with -c > -j,
that makes the results pretty meaningless.

Adding a synchronization barrier should be simple enough, I thought about it
in the past.

However, I'd still be wary that it is no silver bullet: if you start a lot
of threads compared to the number of available cores, pgbench would
basically overload the system, and you would experience a lot of waiting
time which reflects that the client code has not got enough cpu time.
Basically you would be testing the OS process/thread management performance.

Sure, that's possible. But I don't see what that has to do with the
barrier?

Also, most scripts actually have client/server interaction...

Greetings,

Andres Freund

coelho@cri.ensmp.fr

almost 6 years ago

In reply to: Andres Freund (#2)

1 attachment(s)

Re: pgbench: option delaying queries till connections establishment?

Hello Andres,

FWIW, leaving windows, error handling, and other annoyances aside, this
can be implemented fairly simply. See below.

Attached an attempt at improving things.

I've put 2 barriers: one so that all threads are up, one when all
connections are setup and the bench is ready to go.

I've done a blind attempt at implementing the barrier stuff on windows.

I've changed the performance calculations depending on -C or not. Ramp-up
effects are smoothed.

I've merged all time-related stuff (time_t, instr_time, int64) to use a
unique type (pg_time_usec_t) and set of functions/macros, which simplifies
the code somehow.

I've tried to do some variable renaming to distinguish timestamps and
intervals.

This is work in progress.

--
Fabien.

Attachments:

pgbench-barrier-2.patchtext/x-diff; name=pgbench-barrier-2.patchDownload

diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index b8864c6ae5..a16d9d49e1 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -113,18 +113,29 @@ typedef struct socket_set
  */
 
 #ifdef WIN32
+#define PTHREAD_BARRIER_SERIAL_THREAD (-1)
+
 /* Use native win32 threads on Windows */
 typedef struct win32_pthread *pthread_t;
 typedef int pthread_attr_t;
+typedef SYNCHRONIZATION_BARRIER pthread_barrier_t;
 
 static int	pthread_create(pthread_t *thread, pthread_attr_t *attr, void *(*start_routine) (void *), void *arg);
 static int	pthread_join(pthread_t th, void **thread_return);
+
+static int	pthread_barrier_init(pthread_barrier_t *barrier, void *unused, int nthreads);
+static int	pthread_barrier_wait(pthread_barrier_t *barrier);
+static int	pthread_barrier_destroy(pthread_barrier_t *barrier);
 #elif defined(ENABLE_THREAD_SAFETY)
 /* Use platform-dependent pthread capability */
 #include <pthread.h>
 #else
 /* No threads implementation, use none (-j 1) */
 #define pthread_t void *
+#define pthread_barrier_t void *
+#define pthread_barrier_init(a, b, c) /* ignore */
+#define pthread_barrier_wait(a) /* ignore */
+#define pthread_barrier_destroy(a) /* ignore */
 #endif
 
 
@@ -291,7 +302,7 @@ typedef struct SimpleStats
  */
 typedef struct StatsData
 {
-	time_t		start_time;		/* interval start time, for aggregates */
+	pg_time_usec_t	start_time;		/* interval start time, for aggregates */
 	int64		cnt;			/* number of transactions, including skipped */
 	int64		skipped;		/* number of transactions skipped under --rate
 								 * and --latency-limit */
@@ -310,6 +321,11 @@ typedef struct RandomState
 /* Various random sequences are initialized from this one. */
 static RandomState base_random_sequence;
 
+/* Thread synchronization barriers */
+static pthread_barrier_t
+	start_barrier,		/* all threads are started */
+	bench_barrier;		/* benchmarking ready to start */
+
 /*
  * Connection state machine states.
  */
@@ -417,10 +433,10 @@ typedef struct
 	bool		vars_sorted;	/* are variables sorted by name? */
 
 	/* various times about current transaction */
-	int64		txn_scheduled;	/* scheduled start time of transaction (usec) */
-	int64		sleep_until;	/* scheduled start time of next cmd (usec) */
-	instr_time	txn_begin;		/* used for measuring schedule lag times */
-	instr_time	stmt_begin;		/* used for measuring statement latencies */
+	pg_time_usec_t	txn_scheduled;	/* scheduled start time of transaction (usec) */
+	pg_time_usec_t	sleep_until;	/* scheduled start time of next cmd (usec) */
+	pg_time_usec_t	txn_begin;		/* used for measuring schedule lag times (usec) */
+	pg_time_usec_t	stmt_begin;		/* used for measuring statement latencies (usec) */
 
 	bool		prepared[MAX_SCRIPTS];	/* whether client prepared the script */
 
@@ -452,10 +468,13 @@ typedef struct
 	FILE	   *logfile;		/* where to log, or NULL */
 
 	/* per thread collected stats */
-	instr_time	start_time;		/* thread start time */
-	instr_time	conn_time;
+	pg_time_usec_t	start_time;		/* thread start (creation) time */
+	pg_time_usec_t	started_time;	/* thread is running after start barrier */
+	pg_time_usec_t	bench_start_time; 	/* thread is really benchmarking */
+	pg_time_usec_t	conn_delay;		/* cumulated connection and deconnection delays (usec) */
+
 	StatsData	stats;
-	int64		latency_late;	/* executed but late transactions */
+	int64		latency_late;	/* count executed but late transactions */
 } TState;
 
 #define INVALID_THREAD		((pthread_t) 0)
@@ -594,10 +613,10 @@ static void setIntValue(PgBenchValue *pv, int64 ival);
 static void setDoubleValue(PgBenchValue *pv, double dval);
 static bool evaluateExpr(CState *st, PgBenchExpr *expr,
 						 PgBenchValue *retval);
-static ConnectionStateEnum executeMetaCommand(CState *st, instr_time *now);
+static ConnectionStateEnum executeMetaCommand(CState *st, pg_time_usec_t *now);
 static void doLog(TState *thread, CState *st,
 				  StatsData *agg, bool skipped, double latency, double lag);
-static void processXactStats(TState *thread, CState *st, instr_time *now,
+static void processXactStats(TState *thread, CState *st, pg_time_usec_t *now,
 							 bool skipped, StatsData *agg);
 static void append_fillfactor(char *opts, int len);
 static void addScript(ParsedScript script);
@@ -1102,9 +1121,9 @@ mergeSimpleStats(SimpleStats *acc, SimpleStats *ss)
  * the given value.
  */
 static void
-initStats(StatsData *sd, time_t start_time)
+initStats(StatsData *sd, pg_time_usec_t start)
 {
-	sd->start_time = start_time;
+	sd->start_time = start;
 	sd->cnt = 0;
 	sd->skipped = 0;
 	initSimpleStats(&sd->latency);
@@ -2857,7 +2876,6 @@ evaluateSleep(CState *st, int argc, char **argv, int *usecs)
 static void
 advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 {
-	instr_time	now;
 
 	/*
 	 * gettimeofday() isn't free, so we get the current timestamp lazily the
@@ -2867,7 +2885,7 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 	 * means "not set yet".  Reset "now" when we execute shell commands or
 	 * expressions, which might take a non-negligible amount of time, though.
 	 */
-	INSTR_TIME_SET_ZERO(now);
+	pg_time_usec_t	now = 0;
 
 	/*
 	 * Loop in the state machine, until we have to wait for a result from the
@@ -2902,29 +2920,30 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 
 				/* Start new transaction (script) */
 			case CSTATE_START_TX:
+				PG_TIME_SET_CURRENT_LAZY(now);
 
 				/* establish connection if needed, i.e. under --connect */
 				if (st->con == NULL)
 				{
-					instr_time	start;
+					pg_time_usec_t	start = now;
 
-					INSTR_TIME_SET_CURRENT_LAZY(now);
-					start = now;
 					if ((st->con = doConnect()) == NULL)
 					{
 						pg_log_error("client %d aborted while establishing connection", st->id);
 						st->state = CSTATE_ABORTED;
 						break;
 					}
-					INSTR_TIME_SET_CURRENT(now);
-					INSTR_TIME_ACCUM_DIFF(thread->conn_time, now, start);
+
+					/* reset now after connection */
+					now = pg_time_get_usec();
+
+					thread->conn_delay += now - start;
 
 					/* Reset session-local state */
 					memset(st->prepared, 0, sizeof(st->prepared));
 				}
 
 				/* record transaction start time */
-				INSTR_TIME_SET_CURRENT_LAZY(now);
 				st->txn_begin = now;
 
 				/*
@@ -2932,7 +2951,7 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 				 * scheduled start time.
 				 */
 				if (!throttle_delay)
-					st->txn_scheduled = INSTR_TIME_GET_MICROSEC(now);
+					st->txn_scheduled = now;
 
 				/* Begin with the first command */
 				st->state = CSTATE_START_COMMAND;
@@ -2968,12 +2987,9 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 				 */
 				if (latency_limit)
 				{
-					int64		now_us;
+					PG_TIME_SET_CURRENT_LAZY(now);
 
-					INSTR_TIME_SET_CURRENT_LAZY(now);
-					now_us = INSTR_TIME_GET_MICROSEC(now);
-
-					while (thread->throttle_trigger < now_us - latency_limit &&
+					while (thread->throttle_trigger < now - latency_limit &&
 						   (nxacts <= 0 || st->cnt < nxacts))
 					{
 						processXactStats(thread, st, &now, true, agg);
@@ -3006,9 +3022,9 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 				 * Wait until it's time to start next transaction.
 				 */
 			case CSTATE_THROTTLE:
-				INSTR_TIME_SET_CURRENT_LAZY(now);
+				PG_TIME_SET_CURRENT_LAZY(now);
 
-				if (INSTR_TIME_GET_MICROSEC(now) < st->txn_scheduled)
+				if (now < st->txn_scheduled)
 					return;		/* still sleeping, nothing to do here */
 
 				/* done sleeping, but don't start transaction if we're done */
@@ -3031,7 +3047,7 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 				/* record begin time of next command, and initiate it */
 				if (report_per_command)
 				{
-					INSTR_TIME_SET_CURRENT_LAZY(now);
+					PG_TIME_SET_CURRENT_LAZY(now);
 					st->stmt_begin = now;
 				}
 
@@ -3194,8 +3210,8 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 				 * instead of CSTATE_START_TX.
 				 */
 			case CSTATE_SLEEP:
-				INSTR_TIME_SET_CURRENT_LAZY(now);
-				if (INSTR_TIME_GET_MICROSEC(now) < st->sleep_until)
+				PG_TIME_SET_CURRENT_LAZY(now);
+				if (now < st->sleep_until)
 					return;		/* still sleeping, nothing to do here */
 				/* Else done sleeping. */
 				st->state = CSTATE_END_COMMAND;
@@ -3215,13 +3231,12 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 				{
 					Command    *command;
 
-					INSTR_TIME_SET_CURRENT_LAZY(now);
+					PG_TIME_SET_CURRENT_LAZY(now);
 
 					command = sql_script[st->use_file].commands[st->command];
 					/* XXX could use a mutex here, but we choose not to */
 					addToSimpleStats(&command->stats,
-									 INSTR_TIME_GET_DOUBLE(now) -
-									 INSTR_TIME_GET_DOUBLE(st->stmt_begin));
+									 PG_TIME_GET_DOUBLE(now - st->stmt_begin));
 				}
 
 				/* Go ahead with next command, to be executed or skipped */
@@ -3247,7 +3262,7 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 				if (is_connect)
 				{
 					finishCon(st);
-					INSTR_TIME_SET_ZERO(now);
+					now = 0;
 				}
 
 				if ((st->cnt >= nxacts && duration <= 0) || timer_exceeded)
@@ -3285,7 +3300,7 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
  * take no time to execute.
  */
 static ConnectionStateEnum
-executeMetaCommand(CState *st, instr_time *now)
+executeMetaCommand(CState *st, pg_time_usec_t *now)
 {
 	Command    *command = sql_script[st->use_file].commands[st->command];
 	int			argc;
@@ -3327,8 +3342,8 @@ executeMetaCommand(CState *st, instr_time *now)
 			return CSTATE_ABORTED;
 		}
 
-		INSTR_TIME_SET_CURRENT_LAZY(*now);
-		st->sleep_until = INSTR_TIME_GET_MICROSEC(*now) + usec;
+		PG_TIME_SET_CURRENT_LAZY(*now);
+		st->sleep_until = (*now) + usec;
 		return CSTATE_SLEEP;
 	}
 	else if (command->meta == META_SET)
@@ -3431,7 +3446,7 @@ executeMetaCommand(CState *st, instr_time *now)
 	 * executing the expression or shell command might have taken a
 	 * non-negligible amount of time, so reset 'now'
 	 */
-	INSTR_TIME_SET_ZERO(*now);
+	*now = 0;
 
 	return CSTATE_END_COMMAND;
 }
@@ -3441,14 +3456,15 @@ executeMetaCommand(CState *st, instr_time *now)
  *
  * We print Unix-epoch timestamps in the log, so that entries can be
  * correlated against other logs.  On some platforms this could be obtained
- * from the instr_time reading the caller has, but rather than get entangled
- * with that, we just eat the cost of an extra syscall in all cases.
+ * from the caller, but rather than get entangled with that, we just eat
+ * the cost of an extra syscall in all cases.
  */
 static void
 doLog(TState *thread, CState *st,
 	  StatsData *agg, bool skipped, double latency, double lag)
 {
 	FILE	   *logfile = thread->logfile;
+	pg_time_usec_t	now = pg_time_get_usec();
 
 	Assert(use_log);
 
@@ -3468,13 +3484,12 @@ doLog(TState *thread, CState *st,
 		 * any empty intervals in between (this may happen with very low tps,
 		 * e.g. --rate=0.1).
 		 */
-		time_t		now = time(NULL);
 
 		while (agg->start_time + agg_interval <= now)
 		{
 			/* print aggregated report to logfile */
-			fprintf(logfile, "%ld " INT64_FORMAT " %.0f %.0f %.0f %.0f",
-					(long) agg->start_time,
+			fprintf(logfile, INT64_FORMAT " " INT64_FORMAT " %.0f %.0f %.0f %.0f",
+					agg->start_time,
 					agg->cnt,
 					agg->latency.sum,
 					agg->latency.sum2,
@@ -3502,17 +3517,13 @@ doLog(TState *thread, CState *st,
 	else
 	{
 		/* no, print raw transactions */
-		struct timeval tv;
-
-		gettimeofday(&tv, NULL);
 		if (skipped)
 			fprintf(logfile, "%d " INT64_FORMAT " skipped %d %ld %ld",
-					st->id, st->cnt, st->use_file,
-					(long) tv.tv_sec, (long) tv.tv_usec);
+					st->id, st->cnt, st->use_file, now / 1000000, now % 1000000);
 		else
 			fprintf(logfile, "%d " INT64_FORMAT " %.0f %d %ld %ld",
 					st->id, st->cnt, latency, st->use_file,
-					(long) tv.tv_sec, (long) tv.tv_usec);
+					now / 1000000, now % 1000000);
 		if (throttle_delay)
 			fprintf(logfile, " %.0f", lag);
 		fputc('\n', logfile);
@@ -3526,7 +3537,7 @@ doLog(TState *thread, CState *st,
  * Note that even skipped transactions are counted in the "cnt" fields.)
  */
 static void
-processXactStats(TState *thread, CState *st, instr_time *now,
+processXactStats(TState *thread, CState *st, pg_time_usec_t *now,
 				 bool skipped, StatsData *agg)
 {
 	double		latency = 0.0,
@@ -3536,11 +3547,11 @@ processXactStats(TState *thread, CState *st, instr_time *now,
 
 	if (detailed && !skipped)
 	{
-		INSTR_TIME_SET_CURRENT_LAZY(*now);
+		PG_TIME_SET_CURRENT_LAZY(*now);
 
 		/* compute latency & lag */
-		latency = INSTR_TIME_GET_MICROSEC(*now) - st->txn_scheduled;
-		lag = INSTR_TIME_GET_MICROSEC(st->txn_begin) - st->txn_scheduled;
+		latency = (*now) - st->txn_scheduled;
+		lag = st->txn_begin - st->txn_scheduled;
 	}
 
 	if (thread_details)
@@ -3803,10 +3814,7 @@ initGenerateDataClientSide(PGconn *con)
 	int64		k;
 
 	/* used to track elapsed time and estimate of the remaining time */
-	instr_time	start,
-				diff;
-	double		elapsed_sec,
-				remaining_sec;
+	pg_time_usec_t	start;
 	int			log_interval = 1;
 
 	/* Stay on the same line if reporting to a terminal */
@@ -3856,7 +3864,7 @@ initGenerateDataClientSide(PGconn *con)
 	}
 	PQclear(res);
 
-	INSTR_TIME_SET_CURRENT(start);
+	start = pg_time_get_usec();
 
 	for (k = 0; k < (int64) naccounts * scale; k++)
 	{
@@ -3881,11 +3889,8 @@ initGenerateDataClientSide(PGconn *con)
 		 */
 		if ((!use_quiet) && (j % 100000 == 0))
 		{
-			INSTR_TIME_SET_CURRENT(diff);
-			INSTR_TIME_SUBTRACT(diff, start);
-
-			elapsed_sec = INSTR_TIME_GET_DOUBLE(diff);
-			remaining_sec = ((double) scale * naccounts - j) * elapsed_sec / j;
+			double elapsed_sec = PG_TIME_GET_DOUBLE(pg_time_get_usec() - start);
+			double remaining_sec = ((double) scale * naccounts - j) * elapsed_sec / j;
 
 			fprintf(stderr, INT64_FORMAT " of " INT64_FORMAT " tuples (%d%%) done (elapsed %.2f s, remaining %.2f s)%c",
 					j, (int64) naccounts * scale,
@@ -3895,11 +3900,8 @@ initGenerateDataClientSide(PGconn *con)
 		/* let's not call the timing for each row, but only each 100 rows */
 		else if (use_quiet && (j % 100 == 0))
 		{
-			INSTR_TIME_SET_CURRENT(diff);
-			INSTR_TIME_SUBTRACT(diff, start);
-
-			elapsed_sec = INSTR_TIME_GET_DOUBLE(diff);
-			remaining_sec = ((double) scale * naccounts - j) * elapsed_sec / j;
+			double elapsed_sec = PG_TIME_GET_DOUBLE(pg_time_get_usec() - start);
+			double remaining_sec = ((double) scale * naccounts - j) * elapsed_sec / j;
 
 			/* have we reached the next interval (or end)? */
 			if ((j == scale * naccounts) || (elapsed_sec >= log_interval * LOG_STEP_SECONDS))
@@ -4095,10 +4097,8 @@ runInitSteps(const char *initialize_steps)
 
 	for (step = initialize_steps; *step != '\0'; step++)
 	{
-		instr_time	start;
 		char	   *op = NULL;
-
-		INSTR_TIME_SET_CURRENT(start);
+		pg_time_usec_t	start = pg_time_get_usec();
 
 		switch (*step)
 		{
@@ -4140,12 +4140,7 @@ runInitSteps(const char *initialize_steps)
 
 		if (op != NULL)
 		{
-			instr_time	diff;
-			double		elapsed_sec;
-
-			INSTR_TIME_SET_CURRENT(diff);
-			INSTR_TIME_SUBTRACT(diff, start);
-			elapsed_sec = INSTR_TIME_GET_DOUBLE(diff);
+			double		elapsed_sec = PG_TIME_GET_DOUBLE(pg_time_get_usec() - start);
 
 			if (!first)
 				appendPQExpBufferStr(&stats, ", ");
@@ -5064,12 +5059,12 @@ addScript(ParsedScript script)
  * progress report.  On exit, they are updated with the new stats.
  */
 static void
-printProgressReport(TState *threads, int64 test_start, int64 now,
+printProgressReport(TState *threads, int64 test_start, pg_time_usec_t now,
 					StatsData *last, int64 *last_report)
 {
 	/* generate and show report */
-	int64		run = now - *last_report,
-				ntx;
+	pg_time_usec_t	run = now - *last_report;
+	int64		ntx;
 	double		tps,
 				total_run,
 				latency,
@@ -5116,16 +5111,7 @@ printProgressReport(TState *threads, int64 test_start, int64 now,
 
 	if (progress_timestamp)
 	{
-		/*
-		 * On some platforms the current system timestamp is available in
-		 * now_time, but rather than get entangled with that, we just eat the
-		 * cost of an extra syscall in all cases.
-		 */
-		struct timeval tv;
-
-		gettimeofday(&tv, NULL);
-		snprintf(tbuf, sizeof(tbuf), "%ld.%03ld s",
-				 (long) tv.tv_sec, (long) (tv.tv_usec / 1000));
+		snprintf(tbuf, sizeof(tbuf), "%.3f s", PG_TIME_GET_DOUBLE(now));
 	}
 	else
 	{
@@ -5165,20 +5151,16 @@ printSimpleStats(const char *prefix, SimpleStats *ss)
 
 /* print out results */
 static void
-printResults(StatsData *total, instr_time total_time,
-			 instr_time conn_total_time, int64 latency_late)
+printResults(StatsData *total,
+			 pg_time_usec_t total_delay,		/* benchmarking time */
+			 pg_time_usec_t conn_total_delay,	/* is_connect */
+			 pg_time_usec_t conn_elapsed_delay,	/* !is_connect */
+			 int64 latency_late)
 {
-	double		time_include,
-				tps_include,
-				tps_exclude;
+	/* tps is about actually executed transactions during benchmarking */
 	int64		ntx = total->cnt - total->skipped;
-
-	time_include = INSTR_TIME_GET_DOUBLE(total_time);
-
-	/* tps is about actually executed transactions */
-	tps_include = ntx / time_include;
-	tps_exclude = ntx /
-		(time_include - (INSTR_TIME_GET_DOUBLE(conn_total_time) / nclients));
+	double 		bench_delay = PG_TIME_GET_DOUBLE(total_delay);
+	double		tps = ntx / bench_delay;
 
 	/* Report test parameters. */
 	printf("transaction type: %s\n",
@@ -5210,8 +5192,7 @@ printResults(StatsData *total, instr_time total_time,
 
 	if (throttle_delay && latency_limit)
 		printf("number of transactions skipped: " INT64_FORMAT " (%.3f %%)\n",
-			   total->skipped,
-			   100.0 * total->skipped / total->cnt);
+			   total->skipped, 100.0 * total->skipped / total->cnt);
 
 	if (latency_limit)
 		printf("number of transactions above the %.1f ms latency limit: " INT64_FORMAT "/" INT64_FORMAT " (%.3f %%)\n",
@@ -5224,7 +5205,7 @@ printResults(StatsData *total, instr_time total_time,
 	{
 		/* no measurement, show average latency computed from run time */
 		printf("latency average = %.3f ms\n",
-			   1000.0 * time_include * nclients / total->cnt);
+			   0.001 * total_delay * nclients / total->cnt);
 	}
 
 	if (throttle_delay)
@@ -5239,8 +5220,16 @@ printResults(StatsData *total, instr_time total_time,
 			   0.001 * total->lag.sum / total->cnt, 0.001 * total->lag.max);
 	}
 
-	printf("tps = %f (including connections establishing)\n", tps_include);
-	printf("tps = %f (excluding connections establishing)\n", tps_exclude);
+	if (is_connect)
+	{
+		printf("average connection time = %.3f ms\n", 0.001 * conn_total_delay / total->cnt);
+		printf("tps = %f (including reconnection times)\n", tps);
+	}
+	else
+	{
+		printf("initial connection time = %.3f ms\n", 0.001 * conn_elapsed_delay);
+		printf("tps = %f (without initial connection establishing)\n", tps);
+	}
 
 	/* Report per-script/command statistics */
 	if (per_script_stats || report_per_command)
@@ -5261,7 +5250,7 @@ printResults(StatsData *total, instr_time total_time,
 					   100.0 * sql_script[i].weight / total_weight,
 					   sstats->cnt,
 					   100.0 * sstats->cnt / total->cnt,
-					   (sstats->cnt - sstats->skipped) / time_include);
+					   (sstats->cnt - sstats->skipped) / bench_delay);
 
 				if (throttle_delay && latency_limit && sstats->cnt > 0)
 					printf(" - number of transactions skipped: " INT64_FORMAT " (%.3f%%)\n",
@@ -5309,10 +5298,7 @@ set_random_seed(const char *seed)
 	if (seed == NULL || strcmp(seed, "time") == 0)
 	{
 		/* rely on current time */
-		instr_time	now;
-
-		INSTR_TIME_SET_CURRENT(now);
-		iseed = (uint64) INSTR_TIME_GET_MICROSEC(now);
+		iseed = pg_time_get_usec();
 	}
 	else if (strcmp(seed, "rand") == 0)
 	{
@@ -5415,9 +5401,10 @@ main(int argc, char **argv)
 	CState	   *state;			/* status of clients */
 	TState	   *threads;		/* array of thread */
 
-	instr_time	start_time;		/* start up time */
-	instr_time	total_time;
-	instr_time	conn_total_time;
+	pg_time_usec_t
+				start_time,				/* start up time */
+				bench_start_time = 0,	/* first recorded benchmarking time */
+				conn_total_delay;		/* cumulated connection time in threads */
 	int64		latency_late = 0;
 	StatsData	stats;
 	int			weight;
@@ -6100,25 +6087,28 @@ main(int argc, char **argv)
 	/* all clients must be assigned to a thread */
 	Assert(nclients_dealt == nclients);
 
-	/* get start up time */
-	INSTR_TIME_SET_CURRENT(start_time);
+	/* get start up time for the whole computation */
+	start_time = pg_time_get_usec();
 
 	/* set alarm if duration is specified. */
 	if (duration > 0)
 		setalarm(duration);
 
+	pthread_barrier_init(&start_barrier, NULL, nthreads);
+	pthread_barrier_init(&bench_barrier, NULL, nthreads);
+
 	/* start threads */
 #ifdef ENABLE_THREAD_SAFETY
+
 	for (i = 0; i < nthreads; i++)
 	{
 		TState	   *thread = &threads[i];
 
-		INSTR_TIME_SET_CURRENT(thread->start_time);
+		thread->start_time = pg_time_get_usec();
 
 		/* compute when to stop */
 		if (duration > 0)
-			end_time = INSTR_TIME_GET_MICROSEC(thread->start_time) +
-				(int64) 1000000 * duration;
+			end_time = thread->start_time + (int64) 1000000 * duration;
 
 		/* the first thread (i = 0) is executed by main thread */
 		if (i > 0)
@@ -6137,17 +6127,17 @@ main(int argc, char **argv)
 		}
 	}
 #else
-	INSTR_TIME_SET_CURRENT(threads[0].start_time);
+	threads[0].start_time = pg_time_get_usec();
 	/* compute when to stop */
 	if (duration > 0)
-		end_time = INSTR_TIME_GET_MICROSEC(threads[0].start_time) +
-			(int64) 1000000 * duration;
+		end_time = threads[0].start_time + (int64) 1000000 * duration;
 	threads[0].thread = INVALID_THREAD;
 #endif							/* ENABLE_THREAD_SAFETY */
 
 	/* wait for threads and accumulate results */
 	initStats(&stats, 0);
-	INSTR_TIME_SET_ZERO(conn_total_time);
+	conn_total_delay = 0;
+
 	for (i = 0; i < nthreads; i++)
 	{
 		TState	   *thread = &threads[i];
@@ -6173,8 +6163,14 @@ main(int argc, char **argv)
 		stats.cnt += thread->stats.cnt;
 		stats.skipped += thread->stats.skipped;
 		latency_late += thread->latency_late;
-		INSTR_TIME_ADD(conn_total_time, thread->conn_time);
+		conn_total_delay += thread->conn_delay;
+
+		/* first recorded benchmarking start time */
+		if (bench_start_time == 0 || thread->bench_start_time < bench_start_time)
+			bench_start_time = thread->bench_start_time;
 	}
+
+	/* FIXME: should this be connection time? */
 	disconnect_all(state, nclients);
 
 	/*
@@ -6187,9 +6183,11 @@ main(int argc, char **argv)
 	 * easily exceed steady state performance.  This is one of the many ways
 	 * short runs convey deceptive performance figures.
 	 */
-	INSTR_TIME_SET_CURRENT(total_time);
-	INSTR_TIME_SUBTRACT(total_time, start_time);
-	printResults(&stats, total_time, conn_total_time, latency_late);
+	printResults(&stats, pg_time_get_usec() - bench_start_time, conn_total_delay,
+				 bench_start_time - start_time, latency_late);
+
+	pthread_barrier_destroy(&start_barrier);
+	pthread_barrier_destroy(&bench_barrier);
 
 	if (exit_code != 0)
 		pg_log_fatal("Run was aborted; the above results are incomplete.");
@@ -6202,34 +6200,16 @@ threadRun(void *arg)
 {
 	TState	   *thread = (TState *) arg;
 	CState	   *state = thread->state;
-	instr_time	start,
-				end;
+	pg_time_usec_t	start;
 	int			nstate = thread->nstate;
 	int			remains = nstate;	/* number of remaining clients */
 	socket_set *sockets = alloc_socket_set(nstate);
-	int			i;
-
-	/* for reporting progress: */
-	int64		thread_start = INSTR_TIME_GET_MICROSEC(thread->start_time);
-	int64		last_report = thread_start;
-	int64		next_report = last_report + (int64) progress * 1000000;
+	int64		thread_start,
+				last_report,
+				next_report;
 	StatsData	last,
 				aggs;
 
-	/*
-	 * Initialize throttling rate target for all of the thread's clients.  It
-	 * might be a little more accurate to reset thread->start_time here too.
-	 * The possible drift seems too small relative to typical throttle delay
-	 * times to worry about it.
-	 */
-	INSTR_TIME_SET_CURRENT(start);
-	thread->throttle_trigger = INSTR_TIME_GET_MICROSEC(start);
-
-	INSTR_TIME_SET_ZERO(thread->conn_time);
-
-	initStats(&aggs, time(NULL));
-	last = aggs;
-
 	/* open log file if requested */
 	if (use_log)
 	{
@@ -6250,32 +6230,55 @@ threadRun(void *arg)
 		}
 	}
 
+	/* explicitly initialize the state machines */
+	for (int i = 0; i < nstate; i++)
+	{
+		state[i].state = CSTATE_CHOOSE_SCRIPT;
+	}
+
+	/* READY */
+	pthread_barrier_wait(&start_barrier);
+
+	thread_start = pg_time_get_usec();
+	thread->started_time = thread_start;
+	last_report = thread_start;
+	next_report = last_report + (int64) 1000000 * progress;
+
+	/* STEADY */
 	if (!is_connect)
 	{
 		/* make connections to the database before starting */
-		for (i = 0; i < nstate; i++)
+		for (int i = 0; i < nstate; i++)
 		{
 			if ((state[i].con = doConnect()) == NULL)
 				goto done;
 		}
+
+		/* compute connection delay */
+		thread->conn_delay = pg_time_get_usec() - thread->started_time;
 	}
-
-	/* time after thread and connections set up */
-	INSTR_TIME_SET_CURRENT(thread->conn_time);
-	INSTR_TIME_SUBTRACT(thread->conn_time, thread->start_time);
-
-	/* explicitly initialize the state machines */
-	for (i = 0; i < nstate; i++)
+	else
 	{
-		state[i].state = CSTATE_CHOOSE_SCRIPT;
+		/* no connection delay to record */
+		thread->conn_delay = 0;
 	}
 
+	/* GO */
+	pthread_barrier_wait(&bench_barrier);
+
+	start = pg_time_get_usec();
+	thread->bench_start_time = start;
+	thread->throttle_trigger = start;
+
+	initStats(&aggs, start);
+	last = aggs;
+
 	/* loop till all clients have terminated */
 	while (remains > 0)
 	{
 		int			nsocks;		/* number of sockets to be waited for */
-		int64		min_usec;
-		int64		now_usec = 0;	/* set this only if needed */
+		pg_time_usec_t	min_usec;
+		pg_time_usec_t	now = 0;	/* set this only if needed */
 
 		/*
 		 * identify which client sockets should be checked for input, and
@@ -6284,27 +6287,21 @@ threadRun(void *arg)
 		clear_socket_set(sockets);
 		nsocks = 0;
 		min_usec = PG_INT64_MAX;
-		for (i = 0; i < nstate; i++)
+		for (int i = 0; i < nstate; i++)
 		{
 			CState	   *st = &state[i];
 
 			if (st->state == CSTATE_SLEEP || st->state == CSTATE_THROTTLE)
 			{
 				/* a nap from the script, or under throttling */
-				int64		this_usec;
+				pg_time_usec_t		this_usec;
 
 				/* get current time if needed */
-				if (now_usec == 0)
-				{
-					instr_time	now;
-
-					INSTR_TIME_SET_CURRENT(now);
-					now_usec = INSTR_TIME_GET_MICROSEC(now);
-				}
+				PG_TIME_SET_CURRENT_LAZY(now);
 
 				/* min_usec should be the minimum delay across all clients */
 				this_usec = (st->state == CSTATE_SLEEP ?
-							 st->sleep_until : st->txn_scheduled) - now_usec;
+							 st->sleep_until : st->txn_scheduled) - now;
 				if (min_usec > this_usec)
 					min_usec = this_usec;
 			}
@@ -6339,19 +6336,12 @@ threadRun(void *arg)
 		/* also wake up to print the next progress report on time */
 		if (progress && min_usec > 0 && thread->tid == 0)
 		{
-			/* get current time if needed */
-			if (now_usec == 0)
-			{
-				instr_time	now;
+			PG_TIME_SET_CURRENT_LAZY(now);
 
-				INSTR_TIME_SET_CURRENT(now);
-				now_usec = INSTR_TIME_GET_MICROSEC(now);
-			}
-
-			if (now_usec >= next_report)
+			if (now >= next_report)
 				min_usec = 0;
-			else if ((next_report - now_usec) < min_usec)
-				min_usec = next_report - now_usec;
+			else if ((next_report - now) < min_usec)
+				min_usec = next_report - now;
 		}
 
 		/*
@@ -6400,7 +6390,7 @@ threadRun(void *arg)
 
 		/* ok, advance the state machine of each connection */
 		nsocks = 0;
-		for (i = 0; i < nstate; i++)
+		for (int i = 0; i < nstate; i++)
 		{
 			CState	   *st = &state[i];
 
@@ -6438,11 +6428,8 @@ threadRun(void *arg)
 		/* progress report is made by thread 0 for all threads */
 		if (progress && thread->tid == 0)
 		{
-			instr_time	now_time;
-			int64		now;
+			pg_time_usec_t	now = pg_time_get_usec();
 
-			INSTR_TIME_SET_CURRENT(now_time);
-			now = INSTR_TIME_GET_MICROSEC(now_time);
 			if (now >= next_report)
 			{
 				/*
@@ -6460,17 +6447,17 @@ threadRun(void *arg)
 				 */
 				do
 				{
-					next_report += (int64) progress * 1000000;
+					next_report += (int64) 1000000 * progress;
 				} while (now >= next_report);
 			}
 		}
 	}
 
 done:
-	INSTR_TIME_SET_CURRENT(start);
+	start = pg_time_get_usec();
 	disconnect_all(state, nstate);
-	INSTR_TIME_SET_CURRENT(end);
-	INSTR_TIME_ACCUM_DIFF(thread->conn_time, end, start);
+	thread->conn_delay += pg_time_get_usec() - start;
+
 	if (thread->logfile)
 	{
 		if (agg_interval > 0)
@@ -6783,4 +6770,26 @@ pthread_join(pthread_t th, void **thread_return)
 	return 0;
 }
 
+static int
+pthread_barrier_init(pthread_barrier_t *barrier, void *unused, int nthreads)
+{
+	/* no spinning: threads are not expected to arrive at the barrier together */
+	bool ok = InitializeSynchronizationBarrier(barrier, nthreads, 0);
+	return 0;
+}
+
+static int
+pthread_barrier_wait(pthread_barrier_t *barrier)
+{
+	bool last = EnterSynchronizationBarrier(barrier, SYNCHRONIZATION_BARRIER_FLAGS_BLOCK_ONLY);
+	return last ? PTHREAD_BARRIER_SERIAL_THREAD : 0;
+}
+
+static int
+pthread_barrier_destroy(pthread_barrier_t *barrier)
+{
+	(void) DeleteSynchronizationBarrier(barrier);
+	return 0;
+}
+
 #endif							/* WIN32 */
diff --git a/src/include/portability/instr_time.h b/src/include/portability/instr_time.h
index d6459327cc..df58604121 100644
--- a/src/include/portability/instr_time.h
+++ b/src/include/portability/instr_time.h
@@ -253,4 +253,27 @@ GetTimerFrequency(void)
 #define INSTR_TIME_SET_CURRENT_LAZY(t) \
 	(INSTR_TIME_IS_ZERO(t) ? INSTR_TIME_SET_CURRENT(t), true : false)
 
+/*
+ * Simpler convenient interface
+ *
+ * The instr_time type is expensive when dealing with time arithmetic.
+ * Define a type to hold microseconds on top of this, suitable for
+ * benchmarking performance measures, eg in "pgbench".
+ */
+typedef int64 pg_time_usec_t;
+
+static inline pg_time_usec_t
+pg_time_get_usec(void)
+{
+	instr_time now;
+
+	INSTR_TIME_SET_CURRENT(now);
+	return (pg_time_usec_t) INSTR_TIME_GET_MICROSEC(now);
+}
+
+#define PG_TIME_SET_CURRENT_LAZY(t)		\
+	if ((t) == 0) 						\
+		(t) = pg_time_get_usec()
+
+#define PG_TIME_GET_DOUBLE(t) (0.000001 * (t))
 #endif							/* INSTR_TIME_H */

andres@anarazel.de

almost 6 years ago

In reply to: Fabien COELHO (#7)

Re: pgbench: option delaying queries till connections establishment?

Hi,

On 2020-03-01 22:16:06 +0100, Fabien COELHO wrote:

Hello Andres,

FWIW, leaving windows, error handling, and other annoyances aside, this
can be implemented fairly simply. See below.

Attached an attempt at improving things.

Awesome!

I've put 2 barriers: one so that all threads are up, one when all
connections are setup and the bench is ready to go.

I'd done similarly locally.

Slight aside: Have you ever looked at moving pgbench to non-blocking
connection establishment? It seems weird to use non-blocking everywhere
but connection establishment.

I've done a blind attempt at implementing the barrier stuff on windows.

Neat.

I've changed the performance calculations depending on -C or not. Ramp-up
effects are smoothed.

I've merged all time-related stuff (time_t, instr_time, int64) to use a
unique type (pg_time_usec_t) and set of functions/macros, which simplifies
the code somehow.

Hm. I'm not convinced it's a good idea for pgbench to do its own thing
here.

#ifdef WIN32
+#define PTHREAD_BARRIER_SERIAL_THREAD (-1)
+
/* Use native win32 threads on Windows */
typedef struct win32_pthread *pthread_t;
typedef int pthread_attr_t;
+typedef SYNCHRONIZATION_BARRIER pthread_barrier_t;

static int	pthread_create(pthread_t *thread, pthread_attr_t *attr, void *(*start_routine) (void *), void *arg);
static int	pthread_join(pthread_t th, void **thread_return);
+
+static int	pthread_barrier_init(pthread_barrier_t *barrier, void *unused, int nthreads);
+static int	pthread_barrier_wait(pthread_barrier_t *barrier);
+static int	pthread_barrier_destroy(pthread_barrier_t *barrier);

How about using 'struct unknown_type *unused' instead of "unused"?
Because the void *unused will accept everything...

+/* Thread synchronization barriers */
+static pthread_barrier_t
+	start_barrier,		/* all threads are started */
+	bench_barrier;		/* benchmarking ready to start */
+

We don't really need two barriers here. The way that pthread barriers
are defined is that they 'reset' after all the threads have arrived. You
can argue we still want that, but ...

@@ -5165,20 +5151,16 @@ printSimpleStats(const char *prefix, SimpleStats *ss)
/* print out results */
static void
-printResults(StatsData *total, instr_time total_time,
-			 instr_time conn_total_time, int64 latency_late)
+printResults(StatsData *total,

Given that we're changing the output (for the better) of pgbench again,
I wonder if we should add the pgbench version to the benchmark
output. Otherwise it seems easy to end up e.g. seeing a performance
difference between pg12 and pg14, where all that's actually happening is
a different output because each run used the respective pgbench version.

+			 pg_time_usec_t total_delay,		/* benchmarking time */
+			 pg_time_usec_t conn_total_delay,	/* is_connect */
+			 pg_time_usec_t conn_elapsed_delay,	/* !is_connect */
+			 int64 latency_late)

I'm not a fan of naming these 'delay'. To me that doesn't sounds like
it's about the time the total benchmark has taken.

@@ -5239,8 +5220,16 @@ printResults(StatsData *total, instr_time total_time,
0.001 * total->lag.sum / total->cnt, 0.001 * total->lag.max);
}

-	printf("tps = %f (including connections establishing)\n", tps_include);
-	printf("tps = %f (excluding connections establishing)\n", tps_exclude);
+	if (is_connect)
+	{
+		printf("average connection time = %.3f ms\n", 0.001 * conn_total_delay / total->cnt);
+		printf("tps = %f (including reconnection times)\n", tps);
+	}
+	else
+	{
+		printf("initial connection time = %.3f ms\n", 0.001 * conn_elapsed_delay);
+		printf("tps = %f (without initial connection establishing)\n", tps);
+	}

Keeping these separate makes sense to me, they're just so wildly
different.

+/*
+ * Simpler convenient interface
+ *
+ * The instr_time type is expensive when dealing with time arithmetic.
+ * Define a type to hold microseconds on top of this, suitable for
+ * benchmarking performance measures, eg in "pgbench".
+ */
+typedef int64 pg_time_usec_t;
+
+static inline pg_time_usec_t
+pg_time_get_usec(void)
+{
+	instr_time now;
+
+	INSTR_TIME_SET_CURRENT(now);
+	return (pg_time_usec_t) INSTR_TIME_GET_MICROSEC(now);
+}

For me the function name sounds like you're getting the usec out of a
pg_time. Not that it's getting a new timestamp.

+#define PG_TIME_SET_CURRENT_LAZY(t)		\
+	if ((t) == 0) 						\
+		(t) = pg_time_get_usec()
+
+#define PG_TIME_GET_DOUBLE(t) (0.000001 * (t))
#endif							/* INSTR_TIME_H */

I'd make it an inline function instead of this.

Greetings,

Andres Freund

coelho@cri.ensmp.fr

almost 6 years ago

In reply to: Andres Freund (#8)

Re: pgbench: option delaying queries till connections establishment?

Hello Andres,

Slight aside: Have you ever looked at moving pgbench to non-blocking
connection establishment? It seems weird to use non-blocking everywhere
but connection establishment.

Nope. If there is some interest, why not. The reason for not doing it is
that the typical use case is just to connect once at the beginning so that
connections do not matter anyway. Now with -C it makes sense.

I've changed the performance calculations depending on -C or not. Ramp-up
effects are smoothed.

I've merged all time-related stuff (time_t, instr_time, int64) to use a
unique type (pg_time_usec_t) and set of functions/macros, which simplifies
the code somehow.

Hm. I'm not convinced it's a good idea for pgbench to do its own thing
here.

Having 3 time types (in fact, 4, double is used as well for some
calculations) in just one file to deal with time does not help much to
understand the code, and there is quite a few line to translate from one
to the other.

+static int pthread_barrier_init(pthread_barrier_t *barrier, void *unused, int nthreads);

How about using 'struct unknown_type *unused' instead of "unused"?
Because the void *unused will accept everything...

Never encountered this pattern. It does not seem to be used anywhere in pg
sources. I'd be afraid that some compilers would complain. I can try
anyway.

+/* Thread synchronization barriers */
+static pthread_barrier_t
+	start_barrier,		/* all threads are started */
+	bench_barrier;		/* benchmarking ready to start */
+
We don't really need two barriers here. The way that pthread barriers
are defined is that they 'reset' after all the threads have arrived. You
can argue we still want that, but ...

Yes, one barrier could be reused.

/* print out results */
static void
-printResults(StatsData *total, instr_time total_time,
-			 instr_time conn_total_time, int64 latency_late)
+printResults(StatsData *total,
Given that we're changing the output (for the better) of pgbench again,
I wonder if we should add the pgbench version to the benchmark
output.

Dunno. Maybe.

Otherwise it seems easy to end up e.g. seeing a performance
difference between pg12 and pg14, where all that's actually happening is
a different output because each run used the respective pgbench version.

Yep.

+			 pg_time_usec_t total_delay,		/* benchmarking time */
+			 pg_time_usec_t conn_total_delay,	/* is_connect */
+			 pg_time_usec_t conn_elapsed_delay,	/* !is_connect */
+			 int64 latency_late)
I'm not a fan of naming these 'delay'. To me that doesn't sounds like
it's about the time the total benchmark has taken.

Hmmm… I'd like to differentiate variable names which contain timestamp
versus those which contain intervals, given that it is the same underlying
type. That said, I'm not very happy with "delay" either.

What would you suggest?

+pg_time_get_usec(void)

For me the function name sounds like you're getting the usec out of a
pg_time. Not that it's getting a new timestamp.

Ok, I'll think of something else, possibly "pg_now"? "pg_time_now"?

+#define PG_TIME_SET_CURRENT_LAZY(t)		\
+	if ((t) == 0) 						\
+		(t) = pg_time_get_usec()
+
+#define PG_TIME_GET_DOUBLE(t) (0.000001 * (t))
#endif							/* INSTR_TIME_H */

I'd make it an inline function instead of this.

I did it that way because it was already done with defines on instr_time,
but I'm fine with inline.

I'll try to look at it over the week-end.

--
Fabien.

coelho@cri.ensmp.fr

almost 6 years ago

In reply to: Andres Freund (#8)

1 attachment(s)

Re: pgbench: option delaying queries till connections establishment?

Hello Andres,

I've changed the performance calculations depending on -C or not. Ramp-up
effects are smoothed.

I've merged all time-related stuff (time_t, instr_time, int64) to use a
unique type (pg_time_usec_t) and set of functions/macros, which simplifies
the code somehow.

Hm. I'm not convinced it's a good idea for pgbench to do its own thing
here.

Given the unjustifiable heterogeneousness it induces and the simpler code
after the move, I think it is much better. Pgbench cloc is smaller after
barrier are added (4655 to 4650) thanks to that and a few other code
simplifications. Removing all INSTR_TIME_* costly macros is a relief in
itself…

+static int pthread_barrier_init(pthread_barrier_t *barrier, void *unused, int nthreads);

How about using 'struct unknown_type *unused' instead of "unused"?

Haven't done it because I found no other instances in pg, and anyway this
code is only used once locally and NULL is passed.

+static pthread_barrier_t
+	start_barrier,		/* all threads are started */
+	bench_barrier;		/* benchmarking ready to start */

We don't really need two barriers here.

Indeed. Down to one.

/* print out results */

Given that we're changing the output (for the better) of pgbench again,
I wonder if we should add the pgbench version to the benchmark
output.

Not sure about it, but done anyway.

+			 pg_time_usec_t total_delay,		/* benchmarking time */
+			 pg_time_usec_t conn_total_delay,	/* is_connect */
+			 pg_time_usec_t conn_elapsed_delay,	/* !is_connect */
+			 int64 latency_late)
I'm not a fan of naming these 'delay'. To me that doesn't sounds like
it's about the time the total benchmark has taken.

I have used '_duration', and tried to clarify some field and variable
names depending on what data they actually hold.

+		printf("tps = %f (including reconnection times)\n", tps);
+		printf("tps = %f (without initial connection establishing)\n", tps);
Keeping these separate makes sense to me, they're just so wildly
different.

Yep. I've added a comment about that.

+static inline pg_time_usec_t
+pg_time_get_usec(void)

For me the function name sounds like you're getting the usec out of a
pg_time. Not that it's getting a new timestamp.

I've used "pg_time_now()".

+#define PG_TIME_SET_CURRENT_LAZY(t)		\
+	if ((t) == 0) 						\
+		(t) = pg_time_get_usec()
I'd make it an inline function instead of this.

Done "pg_time_now_lazy(&now)"

I have also simplified the code around thread creation & join because it
was a mess: thread 0 was run in the middle of the stat collection loop…

I have updated the doc with actual current output, but excluding the
version display which would have to be changed between releases.

--
Fabien.

Attachments:

pgbench-barrier-3.patchtext/x-diff; name=pgbench-barrier-3.patchDownload

diff --git a/doc/src/sgml/ref/pgbench.sgml b/doc/src/sgml/ref/pgbench.sgml
index 4c48a58ed2..b77c1089d8 100644
--- a/doc/src/sgml/ref/pgbench.sgml
+++ b/doc/src/sgml/ref/pgbench.sgml
@@ -55,8 +55,10 @@ number of clients: 10
 number of threads: 1
 number of transactions per client: 1000
 number of transactions actually processed: 10000/10000
-tps = 85.184871 (including connections establishing)
-tps = 85.296346 (excluding connections establishing)
+latency average = 11.013 ms
+latency stddev = 7.351 ms
+initial connection time = 45.758 ms
+tps = 896.967014 (without initial connection establishing)
 </screen>
 
   The first six lines report some of the most important parameter
@@ -1835,7 +1837,6 @@ END;
   <para>
    For the default script, the output will look similar to this:
 <screen>
-starting vacuum...end.
 transaction type: &lt;builtin: TPC-B (sort of)&gt;
 scaling factor: 1
 query mode: simple
@@ -1843,22 +1844,22 @@ number of clients: 10
 number of threads: 1
 number of transactions per client: 1000
 number of transactions actually processed: 10000/10000
-latency average = 15.844 ms
-latency stddev = 2.715 ms
-tps = 618.764555 (including connections establishing)
-tps = 622.977698 (excluding connections establishing)
+latency average = 10.870 ms
+latency stddev = 7.341 ms
+initial connection time = 30.954 ms
+tps = 907.949122 (without initial connection establishing)
 statement latencies in milliseconds:
-        0.002  \set aid random(1, 100000 * :scale)
-        0.005  \set bid random(1, 1 * :scale)
-        0.002  \set tid random(1, 10 * :scale)
-        0.001  \set delta random(-5000, 5000)
-        0.326  BEGIN;
-        0.603  UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
-        0.454  SELECT abalance FROM pgbench_accounts WHERE aid = :aid;
-        5.528  UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;
-        7.335  UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;
-        0.371  INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);
-        1.212  END;
+    0.001  \set aid random(1, 100000 * :scale)
+    0.001  \set bid random(1, 1 * :scale)
+    0.001  \set tid random(1, 10 * :scale)
+    0.000  \set delta random(-5000, 5000)
+    0.046  BEGIN;
+    0.151  UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
+    0.107  SELECT abalance FROM pgbench_accounts WHERE aid = :aid;
+    4.241  UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;
+    5.245  UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;
+    0.102  INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);
+    0.974  END;
 </screen>
   </para>
 
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index b8864c6ae5..b8366f2c5c 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -113,18 +113,29 @@ typedef struct socket_set
  */
 
 #ifdef WIN32
+#define PTHREAD_BARRIER_SERIAL_THREAD (-1)
+
 /* Use native win32 threads on Windows */
 typedef struct win32_pthread *pthread_t;
 typedef int pthread_attr_t;
+typedef SYNCHRONIZATION_BARRIER pthread_barrier_t;
 
 static int	pthread_create(pthread_t *thread, pthread_attr_t *attr, void *(*start_routine) (void *), void *arg);
 static int	pthread_join(pthread_t th, void **thread_return);
+
+static int	pthread_barrier_init(pthread_barrier_t *barrier, void *unused, int nthreads);
+static int	pthread_barrier_wait(pthread_barrier_t *barrier);
+static int	pthread_barrier_destroy(pthread_barrier_t *barrier);
 #elif defined(ENABLE_THREAD_SAFETY)
 /* Use platform-dependent pthread capability */
 #include <pthread.h>
 #else
 /* No threads implementation, use none (-j 1) */
 #define pthread_t void *
+#define pthread_barrier_t void *
+#define pthread_barrier_init(a, b, c) /* ignore */
+#define pthread_barrier_wait(a) /* ignore */
+#define pthread_barrier_destroy(a) /* ignore */
 #endif
 
 
@@ -291,9 +302,9 @@ typedef struct SimpleStats
  */
 typedef struct StatsData
 {
-	time_t		start_time;		/* interval start time, for aggregates */
-	int64		cnt;			/* number of transactions, including skipped */
-	int64		skipped;		/* number of transactions skipped under --rate
+	pg_time_usec_t	start_time;	/* interval start time, for aggregates */
+	int64			cnt;		/* number of transactions, including skipped */
+	int64			skipped;	/* number of transactions skipped under --rate
 								 * and --latency-limit */
 	SimpleStats latency;
 	SimpleStats lag;
@@ -310,6 +321,9 @@ typedef struct RandomState
 /* Various random sequences are initialized from this one. */
 static RandomState base_random_sequence;
 
+/* Synchronization barrier for start and connection */
+static pthread_barrier_t barrier;
+
 /*
  * Connection state machine states.
  */
@@ -416,11 +430,11 @@ typedef struct
 	int			nvariables;		/* number of variables */
 	bool		vars_sorted;	/* are variables sorted by name? */
 
-	/* various times about current transaction */
-	int64		txn_scheduled;	/* scheduled start time of transaction (usec) */
-	int64		sleep_until;	/* scheduled start time of next cmd (usec) */
-	instr_time	txn_begin;		/* used for measuring schedule lag times */
-	instr_time	stmt_begin;		/* used for measuring statement latencies */
+	/* various times about current transaction in microseconds */
+	pg_time_usec_t	txn_scheduled;	/* scheduled start time of transaction */
+	pg_time_usec_t	sleep_until;	/* scheduled start time of next cmd */
+	pg_time_usec_t	txn_begin;		/* used for measuring schedule lag times */
+	pg_time_usec_t	stmt_begin;		/* used for measuring statement latencies */
 
 	bool		prepared[MAX_SCRIPTS];	/* whether client prepared the script */
 
@@ -449,13 +463,16 @@ typedef struct
 	RandomState ts_sample_rs;	/* random state for log sampling */
 
 	int64		throttle_trigger;	/* previous/next throttling (us) */
-	FILE	   *logfile;		/* where to log, or NULL */
+	FILE	   *logfile;			/* where to log, or NULL */
+
+	/* per thread collected stats in microseconds */
+	pg_time_usec_t	create_time;	/* thread creation time */
+	pg_time_usec_t	started_time;	/* thread is running after start barrier */
+	pg_time_usec_t	bench_start; 	/* thread is benchmarking after connection barrier */
+	pg_time_usec_t	conn_duration;	/* cumulated connection and deconnection delays */
 
-	/* per thread collected stats */
-	instr_time	start_time;		/* thread start time */
-	instr_time	conn_time;
 	StatsData	stats;
-	int64		latency_late;	/* executed but late transactions */
+	int64		latency_late;	/* count executed but late transactions */
 } TState;
 
 #define INVALID_THREAD		((pthread_t) 0)
@@ -594,10 +611,10 @@ static void setIntValue(PgBenchValue *pv, int64 ival);
 static void setDoubleValue(PgBenchValue *pv, double dval);
 static bool evaluateExpr(CState *st, PgBenchExpr *expr,
 						 PgBenchValue *retval);
-static ConnectionStateEnum executeMetaCommand(CState *st, instr_time *now);
+static ConnectionStateEnum executeMetaCommand(CState *st, pg_time_usec_t *now);
 static void doLog(TState *thread, CState *st,
 				  StatsData *agg, bool skipped, double latency, double lag);
-static void processXactStats(TState *thread, CState *st, instr_time *now,
+static void processXactStats(TState *thread, CState *st, pg_time_usec_t *now,
 							 bool skipped, StatsData *agg);
 static void append_fillfactor(char *opts, int len);
 static void addScript(ParsedScript script);
@@ -1102,9 +1119,9 @@ mergeSimpleStats(SimpleStats *acc, SimpleStats *ss)
  * the given value.
  */
 static void
-initStats(StatsData *sd, time_t start_time)
+initStats(StatsData *sd, pg_time_usec_t start)
 {
-	sd->start_time = start_time;
+	sd->start_time = start;
 	sd->cnt = 0;
 	sd->skipped = 0;
 	initSimpleStats(&sd->latency);
@@ -2857,7 +2874,6 @@ evaluateSleep(CState *st, int argc, char **argv, int *usecs)
 static void
 advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 {
-	instr_time	now;
 
 	/*
 	 * gettimeofday() isn't free, so we get the current timestamp lazily the
@@ -2867,7 +2883,7 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 	 * means "not set yet".  Reset "now" when we execute shell commands or
 	 * expressions, which might take a non-negligible amount of time, though.
 	 */
-	INSTR_TIME_SET_ZERO(now);
+	pg_time_usec_t	now = 0;
 
 	/*
 	 * Loop in the state machine, until we have to wait for a result from the
@@ -2902,29 +2918,30 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 
 				/* Start new transaction (script) */
 			case CSTATE_START_TX:
+				pg_time_now_lazy(&now);
 
 				/* establish connection if needed, i.e. under --connect */
 				if (st->con == NULL)
 				{
-					instr_time	start;
+					pg_time_usec_t	start = now;
 
-					INSTR_TIME_SET_CURRENT_LAZY(now);
-					start = now;
 					if ((st->con = doConnect()) == NULL)
 					{
 						pg_log_error("client %d aborted while establishing connection", st->id);
 						st->state = CSTATE_ABORTED;
 						break;
 					}
-					INSTR_TIME_SET_CURRENT(now);
-					INSTR_TIME_ACCUM_DIFF(thread->conn_time, now, start);
+
+					/* reset now after connection */
+					now = pg_time_now();
+
+					thread->conn_duration += now - start;
 
 					/* Reset session-local state */
 					memset(st->prepared, 0, sizeof(st->prepared));
 				}
 
 				/* record transaction start time */
-				INSTR_TIME_SET_CURRENT_LAZY(now);
 				st->txn_begin = now;
 
 				/*
@@ -2932,7 +2949,7 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 				 * scheduled start time.
 				 */
 				if (!throttle_delay)
-					st->txn_scheduled = INSTR_TIME_GET_MICROSEC(now);
+					st->txn_scheduled = now;
 
 				/* Begin with the first command */
 				st->state = CSTATE_START_COMMAND;
@@ -2968,12 +2985,9 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 				 */
 				if (latency_limit)
 				{
-					int64		now_us;
+					pg_time_now_lazy(&now);
 
-					INSTR_TIME_SET_CURRENT_LAZY(now);
-					now_us = INSTR_TIME_GET_MICROSEC(now);
-
-					while (thread->throttle_trigger < now_us - latency_limit &&
+					while (thread->throttle_trigger < now - latency_limit &&
 						   (nxacts <= 0 || st->cnt < nxacts))
 					{
 						processXactStats(thread, st, &now, true, agg);
@@ -3006,9 +3020,9 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 				 * Wait until it's time to start next transaction.
 				 */
 			case CSTATE_THROTTLE:
-				INSTR_TIME_SET_CURRENT_LAZY(now);
+				pg_time_now_lazy(&now);
 
-				if (INSTR_TIME_GET_MICROSEC(now) < st->txn_scheduled)
+				if (now < st->txn_scheduled)
 					return;		/* still sleeping, nothing to do here */
 
 				/* done sleeping, but don't start transaction if we're done */
@@ -3031,7 +3045,7 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 				/* record begin time of next command, and initiate it */
 				if (report_per_command)
 				{
-					INSTR_TIME_SET_CURRENT_LAZY(now);
+					pg_time_now_lazy(&now);
 					st->stmt_begin = now;
 				}
 
@@ -3194,8 +3208,8 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 				 * instead of CSTATE_START_TX.
 				 */
 			case CSTATE_SLEEP:
-				INSTR_TIME_SET_CURRENT_LAZY(now);
-				if (INSTR_TIME_GET_MICROSEC(now) < st->sleep_until)
+				pg_time_now_lazy(&now);
+				if (now < st->sleep_until)
 					return;		/* still sleeping, nothing to do here */
 				/* Else done sleeping. */
 				st->state = CSTATE_END_COMMAND;
@@ -3215,13 +3229,12 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 				{
 					Command    *command;
 
-					INSTR_TIME_SET_CURRENT_LAZY(now);
+					pg_time_now_lazy(&now);
 
 					command = sql_script[st->use_file].commands[st->command];
 					/* XXX could use a mutex here, but we choose not to */
 					addToSimpleStats(&command->stats,
-									 INSTR_TIME_GET_DOUBLE(now) -
-									 INSTR_TIME_GET_DOUBLE(st->stmt_begin));
+									 PG_TIME_GET_DOUBLE(now - st->stmt_begin));
 				}
 
 				/* Go ahead with next command, to be executed or skipped */
@@ -3247,7 +3260,7 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 				if (is_connect)
 				{
 					finishCon(st);
-					INSTR_TIME_SET_ZERO(now);
+					now = 0;
 				}
 
 				if ((st->cnt >= nxacts && duration <= 0) || timer_exceeded)
@@ -3285,7 +3298,7 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
  * take no time to execute.
  */
 static ConnectionStateEnum
-executeMetaCommand(CState *st, instr_time *now)
+executeMetaCommand(CState *st, pg_time_usec_t *now)
 {
 	Command    *command = sql_script[st->use_file].commands[st->command];
 	int			argc;
@@ -3327,8 +3340,8 @@ executeMetaCommand(CState *st, instr_time *now)
 			return CSTATE_ABORTED;
 		}
 
-		INSTR_TIME_SET_CURRENT_LAZY(*now);
-		st->sleep_until = INSTR_TIME_GET_MICROSEC(*now) + usec;
+		pg_time_now_lazy(now);
+		st->sleep_until = (*now) + usec;
 		return CSTATE_SLEEP;
 	}
 	else if (command->meta == META_SET)
@@ -3431,7 +3444,7 @@ executeMetaCommand(CState *st, instr_time *now)
 	 * executing the expression or shell command might have taken a
 	 * non-negligible amount of time, so reset 'now'
 	 */
-	INSTR_TIME_SET_ZERO(*now);
+	*now = 0;
 
 	return CSTATE_END_COMMAND;
 }
@@ -3441,14 +3454,15 @@ executeMetaCommand(CState *st, instr_time *now)
  *
  * We print Unix-epoch timestamps in the log, so that entries can be
  * correlated against other logs.  On some platforms this could be obtained
- * from the instr_time reading the caller has, but rather than get entangled
- * with that, we just eat the cost of an extra syscall in all cases.
+ * from the caller, but rather than get entangled with that, we just eat
+ * the cost of an extra syscall in all cases.
  */
 static void
 doLog(TState *thread, CState *st,
 	  StatsData *agg, bool skipped, double latency, double lag)
 {
 	FILE	   *logfile = thread->logfile;
+	pg_time_usec_t	now = pg_time_now();
 
 	Assert(use_log);
 
@@ -3468,13 +3482,12 @@ doLog(TState *thread, CState *st,
 		 * any empty intervals in between (this may happen with very low tps,
 		 * e.g. --rate=0.1).
 		 */
-		time_t		now = time(NULL);
 
 		while (agg->start_time + agg_interval <= now)
 		{
 			/* print aggregated report to logfile */
-			fprintf(logfile, "%ld " INT64_FORMAT " %.0f %.0f %.0f %.0f",
-					(long) agg->start_time,
+			fprintf(logfile, INT64_FORMAT " " INT64_FORMAT " %.0f %.0f %.0f %.0f",
+					agg->start_time,
 					agg->cnt,
 					agg->latency.sum,
 					agg->latency.sum2,
@@ -3502,17 +3515,13 @@ doLog(TState *thread, CState *st,
 	else
 	{
 		/* no, print raw transactions */
-		struct timeval tv;
-
-		gettimeofday(&tv, NULL);
 		if (skipped)
 			fprintf(logfile, "%d " INT64_FORMAT " skipped %d %ld %ld",
-					st->id, st->cnt, st->use_file,
-					(long) tv.tv_sec, (long) tv.tv_usec);
+					st->id, st->cnt, st->use_file, now / 1000000, now % 1000000);
 		else
 			fprintf(logfile, "%d " INT64_FORMAT " %.0f %d %ld %ld",
 					st->id, st->cnt, latency, st->use_file,
-					(long) tv.tv_sec, (long) tv.tv_usec);
+					now / 1000000, now % 1000000);
 		if (throttle_delay)
 			fprintf(logfile, " %.0f", lag);
 		fputc('\n', logfile);
@@ -3526,7 +3535,7 @@ doLog(TState *thread, CState *st,
  * Note that even skipped transactions are counted in the "cnt" fields.)
  */
 static void
-processXactStats(TState *thread, CState *st, instr_time *now,
+processXactStats(TState *thread, CState *st, pg_time_usec_t *now,
 				 bool skipped, StatsData *agg)
 {
 	double		latency = 0.0,
@@ -3536,11 +3545,11 @@ processXactStats(TState *thread, CState *st, instr_time *now,
 
 	if (detailed && !skipped)
 	{
-		INSTR_TIME_SET_CURRENT_LAZY(*now);
+		pg_time_now_lazy(now);
 
 		/* compute latency & lag */
-		latency = INSTR_TIME_GET_MICROSEC(*now) - st->txn_scheduled;
-		lag = INSTR_TIME_GET_MICROSEC(st->txn_begin) - st->txn_scheduled;
+		latency = (*now) - st->txn_scheduled;
+		lag = st->txn_begin - st->txn_scheduled;
 	}
 
 	if (thread_details)
@@ -3803,10 +3812,7 @@ initGenerateDataClientSide(PGconn *con)
 	int64		k;
 
 	/* used to track elapsed time and estimate of the remaining time */
-	instr_time	start,
-				diff;
-	double		elapsed_sec,
-				remaining_sec;
+	pg_time_usec_t	start;
 	int			log_interval = 1;
 
 	/* Stay on the same line if reporting to a terminal */
@@ -3856,7 +3862,7 @@ initGenerateDataClientSide(PGconn *con)
 	}
 	PQclear(res);
 
-	INSTR_TIME_SET_CURRENT(start);
+	start = pg_time_now();
 
 	for (k = 0; k < (int64) naccounts * scale; k++)
 	{
@@ -3881,11 +3887,8 @@ initGenerateDataClientSide(PGconn *con)
 		 */
 		if ((!use_quiet) && (j % 100000 == 0))
 		{
-			INSTR_TIME_SET_CURRENT(diff);
-			INSTR_TIME_SUBTRACT(diff, start);
-
-			elapsed_sec = INSTR_TIME_GET_DOUBLE(diff);
-			remaining_sec = ((double) scale * naccounts - j) * elapsed_sec / j;
+			double elapsed_sec = PG_TIME_GET_DOUBLE(pg_time_now() - start);
+			double remaining_sec = ((double) scale * naccounts - j) * elapsed_sec / j;
 
 			fprintf(stderr, INT64_FORMAT " of " INT64_FORMAT " tuples (%d%%) done (elapsed %.2f s, remaining %.2f s)%c",
 					j, (int64) naccounts * scale,
@@ -3895,11 +3898,8 @@ initGenerateDataClientSide(PGconn *con)
 		/* let's not call the timing for each row, but only each 100 rows */
 		else if (use_quiet && (j % 100 == 0))
 		{
-			INSTR_TIME_SET_CURRENT(diff);
-			INSTR_TIME_SUBTRACT(diff, start);
-
-			elapsed_sec = INSTR_TIME_GET_DOUBLE(diff);
-			remaining_sec = ((double) scale * naccounts - j) * elapsed_sec / j;
+			double elapsed_sec = PG_TIME_GET_DOUBLE(pg_time_now() - start);
+			double remaining_sec = ((double) scale * naccounts - j) * elapsed_sec / j;
 
 			/* have we reached the next interval (or end)? */
 			if ((j == scale * naccounts) || (elapsed_sec >= log_interval * LOG_STEP_SECONDS))
@@ -4095,10 +4095,8 @@ runInitSteps(const char *initialize_steps)
 
 	for (step = initialize_steps; *step != '\0'; step++)
 	{
-		instr_time	start;
 		char	   *op = NULL;
-
-		INSTR_TIME_SET_CURRENT(start);
+		pg_time_usec_t	start = pg_time_now();
 
 		switch (*step)
 		{
@@ -4140,12 +4138,7 @@ runInitSteps(const char *initialize_steps)
 
 		if (op != NULL)
 		{
-			instr_time	diff;
-			double		elapsed_sec;
-
-			INSTR_TIME_SET_CURRENT(diff);
-			INSTR_TIME_SUBTRACT(diff, start);
-			elapsed_sec = INSTR_TIME_GET_DOUBLE(diff);
+			double		elapsed_sec = PG_TIME_GET_DOUBLE(pg_time_now() - start);
 
 			if (!first)
 				appendPQExpBufferStr(&stats, ", ");
@@ -5064,12 +5057,12 @@ addScript(ParsedScript script)
  * progress report.  On exit, they are updated with the new stats.
  */
 static void
-printProgressReport(TState *threads, int64 test_start, int64 now,
+printProgressReport(TState *threads, int64 test_start, pg_time_usec_t now,
 					StatsData *last, int64 *last_report)
 {
 	/* generate and show report */
-	int64		run = now - *last_report,
-				ntx;
+	pg_time_usec_t	run = now - *last_report;
+	int64		ntx;
 	double		tps,
 				total_run,
 				latency,
@@ -5116,16 +5109,7 @@ printProgressReport(TState *threads, int64 test_start, int64 now,
 
 	if (progress_timestamp)
 	{
-		/*
-		 * On some platforms the current system timestamp is available in
-		 * now_time, but rather than get entangled with that, we just eat the
-		 * cost of an extra syscall in all cases.
-		 */
-		struct timeval tv;
-
-		gettimeofday(&tv, NULL);
-		snprintf(tbuf, sizeof(tbuf), "%ld.%03ld s",
-				 (long) tv.tv_sec, (long) (tv.tv_usec / 1000));
+		snprintf(tbuf, sizeof(tbuf), "%.3f s", PG_TIME_GET_DOUBLE(now));
 	}
 	else
 	{
@@ -5165,21 +5149,18 @@ printSimpleStats(const char *prefix, SimpleStats *ss)
 
 /* print out results */
 static void
-printResults(StatsData *total, instr_time total_time,
-			 instr_time conn_total_time, int64 latency_late)
+printResults(StatsData *total,
+			 pg_time_usec_t total_duration,			/* benchmarking time */
+			 pg_time_usec_t conn_total_duration,	/* is_connect */
+			 pg_time_usec_t conn_elapsed_duration,	/* !is_connect */
+			 int64 latency_late)
 {
-	double		time_include,
-				tps_include,
-				tps_exclude;
+	/* tps is about actually executed transactions during benchmarking */
 	int64		ntx = total->cnt - total->skipped;
+	double 		bench_duration = PG_TIME_GET_DOUBLE(total_duration);
+	double		tps = ntx / bench_duration;
 
-	time_include = INSTR_TIME_GET_DOUBLE(total_time);
-
-	/* tps is about actually executed transactions */
-	tps_include = ntx / time_include;
-	tps_exclude = ntx /
-		(time_include - (INSTR_TIME_GET_DOUBLE(conn_total_time) / nclients));
-
+	printf("pgbench (PostgreSQL) %d.%d\n", PG_VERSION_NUM / 10000, PG_VERSION_NUM % 100);
 	/* Report test parameters. */
 	printf("transaction type: %s\n",
 		   num_scripts == 1 ? sql_script[0].desc : "multiple scripts");
@@ -5210,8 +5191,7 @@ printResults(StatsData *total, instr_time total_time,
 
 	if (throttle_delay && latency_limit)
 		printf("number of transactions skipped: " INT64_FORMAT " (%.3f %%)\n",
-			   total->skipped,
-			   100.0 * total->skipped / total->cnt);
+			   total->skipped, 100.0 * total->skipped / total->cnt);
 
 	if (latency_limit)
 		printf("number of transactions above the %.1f ms latency limit: " INT64_FORMAT "/" INT64_FORMAT " (%.3f %%)\n",
@@ -5224,7 +5204,7 @@ printResults(StatsData *total, instr_time total_time,
 	{
 		/* no measurement, show average latency computed from run time */
 		printf("latency average = %.3f ms\n",
-			   1000.0 * time_include * nclients / total->cnt);
+			   0.001 * total_duration * nclients / total->cnt);
 	}
 
 	if (throttle_delay)
@@ -5239,8 +5219,25 @@ printResults(StatsData *total, instr_time total_time,
 			   0.001 * total->lag.sum / total->cnt, 0.001 * total->lag.max);
 	}
 
-	printf("tps = %f (including connections establishing)\n", tps_include);
-	printf("tps = %f (excluding connections establishing)\n", tps_exclude);
+	/*
+	 * Under -C/--connect, each transaction incurs a significant connection cost,
+	 * it would not make much sense to ignore it in tps, and it would not be tps
+	 * anyway.
+	 *
+	 * Otherwise connections are made just once at the beginning of the run
+	 * and should not impact performance but for very short run, so they are
+	 * (right)fully ignored in tps.
+	 */
+	if (is_connect)
+	{
+		printf("average connection time = %.3f ms\n", 0.001 * conn_total_duration / total->cnt);
+		printf("tps = %f (including reconnection times)\n", tps);
+	}
+	else
+	{
+		printf("initial connection time = %.3f ms\n", 0.001 * conn_elapsed_duration);
+		printf("tps = %f (without initial connection establishing)\n", tps);
+	}
 
 	/* Report per-script/command statistics */
 	if (per_script_stats || report_per_command)
@@ -5261,7 +5258,7 @@ printResults(StatsData *total, instr_time total_time,
 					   100.0 * sql_script[i].weight / total_weight,
 					   sstats->cnt,
 					   100.0 * sstats->cnt / total->cnt,
-					   (sstats->cnt - sstats->skipped) / time_include);
+					   (sstats->cnt - sstats->skipped) / bench_duration);
 
 				if (throttle_delay && latency_limit && sstats->cnt > 0)
 					printf(" - number of transactions skipped: " INT64_FORMAT " (%.3f%%)\n",
@@ -5309,10 +5306,7 @@ set_random_seed(const char *seed)
 	if (seed == NULL || strcmp(seed, "time") == 0)
 	{
 		/* rely on current time */
-		instr_time	now;
-
-		INSTR_TIME_SET_CURRENT(now);
-		iseed = (uint64) INSTR_TIME_GET_MICROSEC(now);
+		iseed = pg_time_now();
 	}
 	else if (strcmp(seed, "rand") == 0)
 	{
@@ -5415,9 +5409,10 @@ main(int argc, char **argv)
 	CState	   *state;			/* status of clients */
 	TState	   *threads;		/* array of thread */
 
-	instr_time	start_time;		/* start up time */
-	instr_time	total_time;
-	instr_time	conn_total_time;
+	pg_time_usec_t
+		start_time,				/* start up time */
+		bench_start = 0,		/* first recorded benchmarking time */
+		conn_total_duration;	/* cumulated connection time in threads */
 	int64		latency_late = 0;
 	StatsData	stats;
 	int			weight;
@@ -6100,67 +6095,55 @@ main(int argc, char **argv)
 	/* all clients must be assigned to a thread */
 	Assert(nclients_dealt == nclients);
 
-	/* get start up time */
-	INSTR_TIME_SET_CURRENT(start_time);
+	/* get start up time for the whole computation */
+	start_time = pg_time_now();
 
 	/* set alarm if duration is specified. */
 	if (duration > 0)
 		setalarm(duration);
 
-	/* start threads */
+	pthread_barrier_init(&barrier, NULL, nthreads);
+
 #ifdef ENABLE_THREAD_SAFETY
-	for (i = 0; i < nthreads; i++)
+	/* start all threads but thread 0 which is executed directly later */
+	for (i = 1; i < nthreads; i++)
 	{
 		TState	   *thread = &threads[i];
+		int			err;
 
-		INSTR_TIME_SET_CURRENT(thread->start_time);
+		thread->create_time = pg_time_now();
+		err = pthread_create(&thread->thread, NULL, threadRun, thread);
 
-		/* compute when to stop */
-		if (duration > 0)
-			end_time = INSTR_TIME_GET_MICROSEC(thread->start_time) +
-				(int64) 1000000 * duration;
-
-		/* the first thread (i = 0) is executed by main thread */
-		if (i > 0)
-		{
-			int			err = pthread_create(&thread->thread, NULL, threadRun, thread);
-
-			if (err != 0 || thread->thread == INVALID_THREAD)
-			{
-				pg_log_fatal("could not create thread: %m");
-				exit(1);
-			}
-		}
-		else
+		if (err != 0 || thread->thread == INVALID_THREAD)
 		{
-			thread->thread = INVALID_THREAD;
+			pg_log_fatal("could not create thread: %m");
+			exit(1);
 		}
 	}
 #else
-	INSTR_TIME_SET_CURRENT(threads[0].start_time);
+	Assert(nthreads == 1);
+#endif							/* ENABLE_THREAD_SAFETY */
+
 	/* compute when to stop */
+	threads[0].create_time = pg_time_now();
 	if (duration > 0)
-		end_time = INSTR_TIME_GET_MICROSEC(threads[0].start_time) +
-			(int64) 1000000 * duration;
+		end_time = threads[0].create_time + (int64) 1000000 * duration;
+
+	/* run thread 0 directly */
 	threads[0].thread = INVALID_THREAD;
-#endif							/* ENABLE_THREAD_SAFETY */
+	(void) threadRun(&threads[0]);
 
-	/* wait for threads and accumulate results */
+	/* wait for other threads and accumulate results */
 	initStats(&stats, 0);
-	INSTR_TIME_SET_ZERO(conn_total_time);
+	conn_total_duration = 0;
+
 	for (i = 0; i < nthreads; i++)
 	{
 		TState	   *thread = &threads[i];
 
 #ifdef ENABLE_THREAD_SAFETY
-		if (threads[i].thread == INVALID_THREAD)
-			/* actually run this thread directly in the main thread */
-			(void) threadRun(thread);
-		else
-			/* wait of other threads. should check that 0 is returned? */
+		if (i > 0)
 			pthread_join(thread->thread, NULL);
-#else
-		(void) threadRun(thread);
 #endif							/* ENABLE_THREAD_SAFETY */
 
 		for (int j = 0; j < thread->nstate; j++)
@@ -6173,23 +6156,26 @@ main(int argc, char **argv)
 		stats.cnt += thread->stats.cnt;
 		stats.skipped += thread->stats.skipped;
 		latency_late += thread->latency_late;
-		INSTR_TIME_ADD(conn_total_time, thread->conn_time);
+		conn_total_duration += thread->conn_duration;
+
+		/* first recorded benchmarking start time */
+		if (bench_start == 0 || thread->bench_start < bench_start)
+			bench_start = thread->bench_start;
 	}
+
+	/* XXX should this be connection time? */
 	disconnect_all(state, nclients);
 
 	/*
-	 * XXX We compute results as though every client of every thread started
-	 * and finished at the same time.  That model can diverge noticeably from
-	 * reality for a short benchmark run involving relatively many threads.
-	 * The first thread may process notably many transactions before the last
-	 * thread begins.  Improving the model alone would bring limited benefit,
-	 * because performance during those periods of partial thread count can
-	 * easily exceed steady state performance.  This is one of the many ways
-	 * short runs convey deceptive performance figures.
+	 * Beware that performance of short benchmarks with many threads and possibly
+	 * long transactions can be deceptive because threads do not start and finish
+	 * at the exact same time. The total duration computed here encompasses all
+	 * transactions so that tps shown is somehow slightly underestimated.
 	 */
-	INSTR_TIME_SET_CURRENT(total_time);
-	INSTR_TIME_SUBTRACT(total_time, start_time);
-	printResults(&stats, total_time, conn_total_time, latency_late);
+	printResults(&stats, pg_time_now() - bench_start, conn_total_duration,
+				 bench_start - start_time, latency_late);
+
+	pthread_barrier_destroy(&barrier);
 
 	if (exit_code != 0)
 		pg_log_fatal("Run was aborted; the above results are incomplete.");
@@ -6202,34 +6188,16 @@ threadRun(void *arg)
 {
 	TState	   *thread = (TState *) arg;
 	CState	   *state = thread->state;
-	instr_time	start,
-				end;
+	pg_time_usec_t	start;
 	int			nstate = thread->nstate;
 	int			remains = nstate;	/* number of remaining clients */
 	socket_set *sockets = alloc_socket_set(nstate);
-	int			i;
-
-	/* for reporting progress: */
-	int64		thread_start = INSTR_TIME_GET_MICROSEC(thread->start_time);
-	int64		last_report = thread_start;
-	int64		next_report = last_report + (int64) progress * 1000000;
+	int64		thread_start,
+				last_report,
+				next_report;
 	StatsData	last,
 				aggs;
 
-	/*
-	 * Initialize throttling rate target for all of the thread's clients.  It
-	 * might be a little more accurate to reset thread->start_time here too.
-	 * The possible drift seems too small relative to typical throttle delay
-	 * times to worry about it.
-	 */
-	INSTR_TIME_SET_CURRENT(start);
-	thread->throttle_trigger = INSTR_TIME_GET_MICROSEC(start);
-
-	INSTR_TIME_SET_ZERO(thread->conn_time);
-
-	initStats(&aggs, time(NULL));
-	last = aggs;
-
 	/* open log file if requested */
 	if (use_log)
 	{
@@ -6250,32 +6218,55 @@ threadRun(void *arg)
 		}
 	}
 
+	/* explicitly initialize the state machines */
+	for (int i = 0; i < nstate; i++)
+	{
+		state[i].state = CSTATE_CHOOSE_SCRIPT;
+	}
+
+	/* READY */
+	pthread_barrier_wait(&barrier);
+
+	thread_start = pg_time_now();
+	thread->started_time = thread_start;
+	last_report = thread_start;
+	next_report = last_report + (int64) 1000000 * progress;
+
+	/* STEADY */
 	if (!is_connect)
 	{
 		/* make connections to the database before starting */
-		for (i = 0; i < nstate; i++)
+		for (int i = 0; i < nstate; i++)
 		{
 			if ((state[i].con = doConnect()) == NULL)
 				goto done;
 		}
+
+		/* compute connection delay */
+		thread->conn_duration = pg_time_now() - thread->started_time;
 	}
-
-	/* time after thread and connections set up */
-	INSTR_TIME_SET_CURRENT(thread->conn_time);
-	INSTR_TIME_SUBTRACT(thread->conn_time, thread->start_time);
-
-	/* explicitly initialize the state machines */
-	for (i = 0; i < nstate; i++)
+	else
 	{
-		state[i].state = CSTATE_CHOOSE_SCRIPT;
+		/* no connection delay to record */
+		thread->conn_duration = 0;
 	}
 
+	/* GO */
+	pthread_barrier_wait(&barrier);
+
+	start = pg_time_now();
+	thread->bench_start = start;
+	thread->throttle_trigger = start;
+
+	initStats(&aggs, start);
+	last = aggs;
+
 	/* loop till all clients have terminated */
 	while (remains > 0)
 	{
 		int			nsocks;		/* number of sockets to be waited for */
-		int64		min_usec;
-		int64		now_usec = 0;	/* set this only if needed */
+		pg_time_usec_t	min_usec;
+		pg_time_usec_t	now = 0;	/* set this only if needed */
 
 		/*
 		 * identify which client sockets should be checked for input, and
@@ -6284,27 +6275,21 @@ threadRun(void *arg)
 		clear_socket_set(sockets);
 		nsocks = 0;
 		min_usec = PG_INT64_MAX;
-		for (i = 0; i < nstate; i++)
+		for (int i = 0; i < nstate; i++)
 		{
 			CState	   *st = &state[i];
 
 			if (st->state == CSTATE_SLEEP || st->state == CSTATE_THROTTLE)
 			{
 				/* a nap from the script, or under throttling */
-				int64		this_usec;
+				pg_time_usec_t		this_usec;
 
 				/* get current time if needed */
-				if (now_usec == 0)
-				{
-					instr_time	now;
-
-					INSTR_TIME_SET_CURRENT(now);
-					now_usec = INSTR_TIME_GET_MICROSEC(now);
-				}
+				pg_time_now_lazy(&now);
 
 				/* min_usec should be the minimum delay across all clients */
 				this_usec = (st->state == CSTATE_SLEEP ?
-							 st->sleep_until : st->txn_scheduled) - now_usec;
+							 st->sleep_until : st->txn_scheduled) - now;
 				if (min_usec > this_usec)
 					min_usec = this_usec;
 			}
@@ -6339,19 +6324,12 @@ threadRun(void *arg)
 		/* also wake up to print the next progress report on time */
 		if (progress && min_usec > 0 && thread->tid == 0)
 		{
-			/* get current time if needed */
-			if (now_usec == 0)
-			{
-				instr_time	now;
+			pg_time_now_lazy(&now);
 
-				INSTR_TIME_SET_CURRENT(now);
-				now_usec = INSTR_TIME_GET_MICROSEC(now);
-			}
-
-			if (now_usec >= next_report)
+			if (now >= next_report)
 				min_usec = 0;
-			else if ((next_report - now_usec) < min_usec)
-				min_usec = next_report - now_usec;
+			else if ((next_report - now) < min_usec)
+				min_usec = next_report - now;
 		}
 
 		/*
@@ -6400,7 +6378,7 @@ threadRun(void *arg)
 
 		/* ok, advance the state machine of each connection */
 		nsocks = 0;
-		for (i = 0; i < nstate; i++)
+		for (int i = 0; i < nstate; i++)
 		{
 			CState	   *st = &state[i];
 
@@ -6438,11 +6416,8 @@ threadRun(void *arg)
 		/* progress report is made by thread 0 for all threads */
 		if (progress && thread->tid == 0)
 		{
-			instr_time	now_time;
-			int64		now;
+			pg_time_usec_t	now = pg_time_now();
 
-			INSTR_TIME_SET_CURRENT(now_time);
-			now = INSTR_TIME_GET_MICROSEC(now_time);
 			if (now >= next_report)
 			{
 				/*
@@ -6460,17 +6435,17 @@ threadRun(void *arg)
 				 */
 				do
 				{
-					next_report += (int64) progress * 1000000;
+					next_report += (int64) 1000000 * progress;
 				} while (now >= next_report);
 			}
 		}
 	}
 
 done:
-	INSTR_TIME_SET_CURRENT(start);
+	start = pg_time_now();
 	disconnect_all(state, nstate);
-	INSTR_TIME_SET_CURRENT(end);
-	INSTR_TIME_ACCUM_DIFF(thread->conn_time, end, start);
+	thread->conn_duration += pg_time_now() - start;
+
 	if (thread->logfile)
 	{
 		if (agg_interval > 0)
@@ -6783,4 +6758,26 @@ pthread_join(pthread_t th, void **thread_return)
 	return 0;
 }
 
+static int
+pthread_barrier_init(pthread_barrier_t *barrier, void *unused, int nthreads)
+{
+	/* no spinning: threads are not expected to arrive at the barrier together */
+	bool ok = InitializeSynchronizationBarrier(barrier, nthreads, 0);
+	return 0;
+}
+
+static int
+pthread_barrier_wait(pthread_barrier_t *barrier)
+{
+	bool last = EnterSynchronizationBarrier(barrier, SYNCHRONIZATION_BARRIER_FLAGS_BLOCK_ONLY);
+	return last ? PTHREAD_BARRIER_SERIAL_THREAD : 0;
+}
+
+static int
+pthread_barrier_destroy(pthread_barrier_t *barrier)
+{
+	(void) DeleteSynchronizationBarrier(barrier);
+	return 0;
+}
+
 #endif							/* WIN32 */
diff --git a/src/include/portability/instr_time.h b/src/include/portability/instr_time.h
index d6459327cc..e2e9567613 100644
--- a/src/include/portability/instr_time.h
+++ b/src/include/portability/instr_time.h
@@ -253,4 +253,30 @@ GetTimerFrequency(void)
 #define INSTR_TIME_SET_CURRENT_LAZY(t) \
 	(INSTR_TIME_IS_ZERO(t) ? INSTR_TIME_SET_CURRENT(t), true : false)
 
+/*
+ * Simpler convenient interface
+ *
+ * The instr_time type is expensive when dealing with time arithmetic.
+ * Define a type to hold microseconds on top of this, suitable for
+ * benchmarking performance measures, eg in "pgbench".
+ */
+typedef int64 pg_time_usec_t;
+
+static inline pg_time_usec_t
+pg_time_now(void)
+{
+	instr_time now;
+
+	INSTR_TIME_SET_CURRENT(now);
+	return (pg_time_usec_t) INSTR_TIME_GET_MICROSEC(now);
+}
+
+static inline void
+pg_time_now_lazy(pg_time_usec_t *now)
+{
+	if ((*now) == 0)
+		(*now) = pg_time_now();
+}
+
+#define PG_TIME_GET_DOUBLE(t) (0.000001 * (t))
 #endif							/* INSTR_TIME_H */

coelho@cri.ensmp.fr

almost 6 years ago

In reply to: Andres Freund (#8)

1 attachment(s)

Re: pgbench: option delaying queries till connections establishment?

Hallo Andres,

Slight aside: Have you ever looked at moving pgbench to non-blocking
connection establishment? It seems weird to use non-blocking everywhere
but connection establishment.

Attached an attempt at doing that, mostly done for fun. It seems to be a
little slower on a local socket.

What do you think?

Maybe it would be worth having it with an option?

--
Fabien.

Attachments:

pgbench-async-connect-1.patchtext/x-diff; name=pgbench-async-connect-1.patchDownload

diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index b8864c6ae5..83ac2235a5 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -102,8 +102,9 @@ typedef struct socket_set
 
 typedef struct socket_set
 {
-	int			maxfd;			/* largest FD currently set in fds */
-	fd_set		fds;
+	int			maxfd;			/* largest FD currently set in reads or writes */
+	fd_set		reads;
+	fd_set		writes;
 } socket_set;
 
 #endif							/* POLL_USING_SELECT */
@@ -318,15 +319,32 @@ typedef enum
 	/*
 	 * The client must first choose a script to execute.  Once chosen, it can
 	 * either be throttled (state CSTATE_PREPARE_THROTTLE under --rate), start
-	 * right away (state CSTATE_START_TX) or not start at all if the timer was
-	 * exceeded (state CSTATE_FINISHED).
+	 * right away (state CSTATE_START_TX or CSTATE_CONNECT on --connect) or
+	 * not start at all if the timer was exceeded (state CSTATE_FINISHED).
 	 */
 	CSTATE_CHOOSE_SCRIPT,
 
 	/*
-	 * CSTATE_START_TX performs start-of-transaction processing.  Establishes
-	 * a new connection for the transaction in --connect mode, records the
-	 * transaction start time, and proceed to the first command.
+	 * In CSTATE_PREPARE_THROTTLE state, we calculate when to begin the next
+	 * transaction, and advance to CSTATE_THROTTLE.  CSTATE_THROTTLE state
+	 * sleeps until that moment, then advances to CSTATE_CONNECT (-C) or
+	 * CSTATE_START_TX (not -C), or CSTATE_FINISHED if the next transaction
+	 * would start beyond the end of the run.
+	 */
+	CSTATE_PREPARE_THROTTLE,
+	CSTATE_THROTTLE,
+
+	/*
+	 * CSTATE_CONNECT Establishes a connection asynchronously before starting
+	 * a transaction, under -C. The state is then CSTATE_CONNECTING till the
+	 * connection is established, and then CSTATE_START_TX.
+	 */
+	CSTATE_CONNECT,
+	CSTATE_CONNECTING,
+
+	/*
+	 * CSTATE_START_TX performs start-of-transaction processing.
+	 * It records the transaction start time, and proceed to the first command.
 	 *
 	 * Note: once a script is started, it will either error or run till its
 	 * end, where it may be interrupted. It is not interrupted while running,
@@ -335,16 +353,6 @@ typedef enum
 	 */
 	CSTATE_START_TX,
 
-	/*
-	 * In CSTATE_PREPARE_THROTTLE state, we calculate when to begin the next
-	 * transaction, and advance to CSTATE_THROTTLE.  CSTATE_THROTTLE state
-	 * sleeps until that moment, then advances to CSTATE_START_TX, or
-	 * CSTATE_FINISHED if the next transaction would start beyond the end of
-	 * the run.
-	 */
-	CSTATE_PREPARE_THROTTLE,
-	CSTATE_THROTTLE,
-
 	/*
 	 * We loop through these states, to process each command in the script:
 	 *
@@ -401,6 +409,7 @@ typedef struct
 	int			id;				/* client No. */
 	ConnectionStateEnum state;	/* state machine's current state. */
 	ConditionalStack cstack;	/* enclosing conditionals state */
+	PostgresPollingStatusType	poll_state; /* async connection status */
 
 	/*
 	 * Separate randomness for each client. This is used for random functions
@@ -421,6 +430,7 @@ typedef struct
 	int64		sleep_until;	/* scheduled start time of next cmd (usec) */
 	instr_time	txn_begin;		/* used for measuring schedule lag times */
 	instr_time	stmt_begin;		/* used for measuring statement latencies */
+	instr_time	conn_begin;		/* start of asynchronous connection under -C */
 
 	bool		prepared[MAX_SCRIPTS];	/* whether client prepared the script */
 
@@ -607,9 +617,9 @@ static void setalarm(int seconds);
 static socket_set *alloc_socket_set(int count);
 static void free_socket_set(socket_set *sa);
 static void clear_socket_set(socket_set *sa);
-static void add_socket_to_set(socket_set *sa, int fd, int idx);
+static void add_socket_to_set(socket_set *sa, int fd, int idx, bool reading);
 static int	wait_on_socket_set(socket_set *sa, int64 usecs);
-static bool socket_has_input(socket_set *sa, int fd, int idx);
+static bool socket_is_ready(socket_set *sa, int fd, int idx);
 
 
 /* callback functions for our flex lexer */
@@ -1165,6 +1175,57 @@ tryExecuteStatement(PGconn *con, const char *sql)
 	PQclear(res);
 }
 
+#define PARAMS_ARRAY_SIZE	7
+
+/* set connection parameters */
+static void
+setPQconnectionParams(const char *keywords[PARAMS_ARRAY_SIZE],
+					  const char *values[PARAMS_ARRAY_SIZE],
+					  const char *password)
+{
+	keywords[0] = "host";
+	values[0] = pghost;
+	keywords[1] = "port";
+	values[1] = pgport;
+	keywords[2] = "user";
+	values[2] = login;
+	keywords[3] = "password";
+	values[3] = password;
+	keywords[4] = "dbname";
+	values[4] = dbName;
+	keywords[5] = "fallback_application_name";
+	values[5] = progname;
+	keywords[6] = NULL;
+	values[6] = NULL;
+}
+
+/* start a connection to the backend, asynchronously */
+static PGconn *
+doAsyncConnect(void)
+{
+	const char *keywords[PARAMS_ARRAY_SIZE];
+	const char *values[PARAMS_ARRAY_SIZE];
+	PGconn	   *conn;
+
+	setPQconnectionParams(keywords, values, NULL);
+	conn = PQconnectStartParams(keywords, values, true);
+
+	if (!conn)
+		pg_log_error("async connection to database \"%s\" failed on start", dbName);
+	else if (PQstatus(conn) == CONNECTION_BAD)
+	{
+		if (PQconnectionNeedsPassword(conn))
+			pg_log_error("async connection to database \"%s\" expecting a password", dbName);
+		else
+			pg_log_error("async connection to database \"%s\" failed", dbName);
+
+		PQfinish(conn);
+		conn = NULL;
+	}
+
+	return conn;
+}
+
 /* set up a connection to the backend */
 static PGconn *
 doConnect(void)
@@ -1180,26 +1241,10 @@ doConnect(void)
 	 */
 	do
 	{
-#define PARAMS_ARRAY_SIZE	7
-
 		const char *keywords[PARAMS_ARRAY_SIZE];
 		const char *values[PARAMS_ARRAY_SIZE];
 
-		keywords[0] = "host";
-		values[0] = pghost;
-		keywords[1] = "port";
-		values[1] = pgport;
-		keywords[2] = "user";
-		values[2] = login;
-		keywords[3] = "password";
-		values[3] = have_password ? password : NULL;
-		keywords[4] = "dbname";
-		values[4] = dbName;
-		keywords[5] = "fallback_application_name";
-		values[5] = progname;
-		keywords[6] = NULL;
-		values[6] = NULL;
-
+		setPQconnectionParams(keywords, values, have_password ? password : NULL);
 		new_pass = false;
 
 		conn = PQconnectdbParams(keywords, values, true);
@@ -2897,31 +2942,58 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 				 * a new transaction, or to get throttled if that's requested.
 				 */
 				st->state = timer_exceeded ? CSTATE_FINISHED :
-					throttle_delay > 0 ? CSTATE_PREPARE_THROTTLE : CSTATE_START_TX;
+					throttle_delay > 0 ? CSTATE_PREPARE_THROTTLE :
+						is_connect ? CSTATE_CONNECT : CSTATE_START_TX;
 				break;
 
 				/* Start new transaction (script) */
-			case CSTATE_START_TX:
+			case CSTATE_CONNECT:
 
-				/* establish connection if needed, i.e. under --connect */
-				if (st->con == NULL)
+				/* establish async connection under --connect */
+				Assert(st->con == NULL);
+
+				INSTR_TIME_SET_CURRENT_LAZY(now);
+				st->conn_begin = now;
+
+				if ((st->con = doAsyncConnect()) == NULL)
 				{
-					instr_time	start;
+					pg_log_error("client %d aborted while starting async connection", st->id);
+					st->state = CSTATE_ABORTED;
+				}
+				else
+				{
+					st->state = CSTATE_CONNECTING;
+					/* once PQconnectPoll is called, either _READING or _WRITING */
+					st->poll_state = PGRES_POLLING_OK;
+				}
+				break;
+
+			case CSTATE_CONNECTING:
 
-					INSTR_TIME_SET_CURRENT_LAZY(now);
-					start = now;
-					if ((st->con = doConnect()) == NULL)
-					{
-						pg_log_error("client %d aborted while establishing connection", st->id);
-						st->state = CSTATE_ABORTED;
-						break;
-					}
+				Assert(st->con != NULL);
+
+				st->poll_state = PQconnectPoll(st->con);
+
+				if (st->poll_state == PGRES_POLLING_FAILED)
+				{
+					pg_log_error("client %d aborted while establishing connection", st->id);
+					st->state = CSTATE_ABORTED;
+				}
+				else if (st->poll_state == PGRES_POLLING_OK)
+				{
+					/* record elapsed connection time */
 					INSTR_TIME_SET_CURRENT(now);
-					INSTR_TIME_ACCUM_DIFF(thread->conn_time, now, start);
-
+					INSTR_TIME_ACCUM_DIFF(thread->conn_time, now, st->conn_begin);
 					/* Reset session-local state */
 					memset(st->prepared, 0, sizeof(st->prepared));
+					st->state = CSTATE_START_TX;
 				}
+				/* else PGRES_POLLING_{READING,WRITING}, i.e. waiting for something */
+				break;
+
+			case CSTATE_START_TX:
+
+				Assert(st->con != NULL && PQstatus(st->con) == CONNECTION_OK);
 
 				/* record transaction start time */
 				INSTR_TIME_SET_CURRENT_LAZY(now);
@@ -3012,7 +3084,8 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 					return;		/* still sleeping, nothing to do here */
 
 				/* done sleeping, but don't start transaction if we're done */
-				st->state = timer_exceeded ? CSTATE_FINISHED : CSTATE_START_TX;
+				st->state = timer_exceeded ? CSTATE_FINISHED :
+					is_connect ? CSTATE_CONNECT : CSTATE_START_TX;
 				break;
 
 				/*
@@ -6308,13 +6381,16 @@ threadRun(void *arg)
 				if (min_usec > this_usec)
 					min_usec = this_usec;
 			}
-			else if (st->state == CSTATE_WAIT_RESULT)
+			else if (st->state == CSTATE_WAIT_RESULT ||
+					 /* CONNECTING && POLLING_OK only occur on connection start */
+					 (st->state == CSTATE_CONNECTING && st->poll_state != PGRES_POLLING_OK))
 			{
 				/*
 				 * waiting for result from server - nothing to do unless the
 				 * socket is readable
 				 */
 				int			sock = PQsocket(st->con);
+				bool		reading = st->state == CSTATE_WAIT_RESULT || st->poll_state == PGRES_POLLING_READING;
 
 				if (sock < 0)
 				{
@@ -6322,7 +6398,7 @@ threadRun(void *arg)
 					goto done;
 				}
 
-				add_socket_to_set(sockets, sock, nsocks++);
+				add_socket_to_set(sockets, sock, nsocks++, reading);
 			}
 			else if (st->state != CSTATE_ABORTED &&
 					 st->state != CSTATE_FINISHED)
@@ -6415,7 +6491,7 @@ threadRun(void *arg)
 					goto done;
 				}
 
-				if (!socket_has_input(sockets, sock, nsocks++))
+				if (!socket_is_ready(sockets, sock, nsocks++))
 					continue;
 			}
 			else if (st->state == CSTATE_FINISHED ||
@@ -6569,9 +6645,9 @@ setalarm(int seconds)
  *		to expire.  timeout is measured in microseconds; 0 means wait forever.
  *		Returns result code of underlying syscall (>=0 if OK, else see errno).
  *
- * socket_has_input: after waiting, call this to see if given socket has
- *		input.  fd and idx parameters should match some previous call to
- *		add_socket_to_set.
+ * socket_is_ready: after waiting, call this to see if given socket has
+ *		input or is ready for writing.  fd and idx parameters should match
+ * 		some previous call to add_socket_to_set.
  *
  * Note that wait_on_socket_set destructively modifies the state of the
  * socket set.  After checking for input, caller must apply clear_socket_set
@@ -6605,11 +6681,11 @@ clear_socket_set(socket_set *sa)
 }
 
 static void
-add_socket_to_set(socket_set *sa, int fd, int idx)
+add_socket_to_set(socket_set *sa, int fd, int idx, bool reading)
 {
 	Assert(idx < sa->maxfds && idx == sa->curfds);
 	sa->pollfds[idx].fd = fd;
-	sa->pollfds[idx].events = POLLIN;
+	sa->pollfds[idx].events = reading ? POLLIN : POLLOUT;
 	sa->pollfds[idx].revents = 0;
 	sa->curfds++;
 }
@@ -6632,7 +6708,7 @@ wait_on_socket_set(socket_set *sa, int64 usecs)
 }
 
 static bool
-socket_has_input(socket_set *sa, int fd, int idx)
+socket_is_ready(socket_set *sa, int fd, int idx)
 {
 	/*
 	 * In some cases, threadRun will apply clear_socket_set and then try to
@@ -6646,7 +6722,7 @@ socket_has_input(socket_set *sa, int fd, int idx)
 		return false;
 
 	Assert(idx < sa->curfds && sa->pollfds[idx].fd == fd);
-	return (sa->pollfds[idx].revents & POLLIN) != 0;
+	return (sa->pollfds[idx].revents & (POLLIN|POLLOUT)) != 0;
 }
 
 #endif							/* POLL_USING_PPOLL */
@@ -6668,12 +6744,13 @@ free_socket_set(socket_set *sa)
 static void
 clear_socket_set(socket_set *sa)
 {
-	FD_ZERO(&sa->fds);
+	FD_ZERO(&sa->reads);
+	FD_ZERO(&sa->writes);
 	sa->maxfd = -1;
 }
 
 static void
-add_socket_to_set(socket_set *sa, int fd, int idx)
+add_socket_to_set(socket_set *sa, int fd, int idx, bool reading)
 {
 	if (fd < 0 || fd >= FD_SETSIZE)
 	{
@@ -6684,7 +6761,7 @@ add_socket_to_set(socket_set *sa, int fd, int idx)
 		pg_log_fatal("too many client connections for select()");
 		exit(1);
 	}
-	FD_SET(fd, &sa->fds);
+	FD_SET(fd, reading ? &sa->reads : &sa->writes);
 	if (fd > sa->maxfd)
 		sa->maxfd = fd;
 }
@@ -6698,18 +6775,18 @@ wait_on_socket_set(socket_set *sa, int64 usecs)
 
 		timeout.tv_sec = usecs / 1000000;
 		timeout.tv_usec = usecs % 1000000;
-		return select(sa->maxfd + 1, &sa->fds, NULL, NULL, &timeout);
+		return select(sa->maxfd + 1, &sa->reads, &sa->writes, NULL, &timeout);
 	}
 	else
 	{
-		return select(sa->maxfd + 1, &sa->fds, NULL, NULL, NULL);
+		return select(sa->maxfd + 1, &sa->reads, &sa->writes, NULL, NULL);
 	}
 }
 
 static bool
-socket_has_input(socket_set *sa, int fd, int idx)
+socket_is_ready(socket_set *sa, int fd, int idx)
 {
-	return (FD_ISSET(fd, &sa->fds) != 0);
+	return FD_ISSET(fd, &sa->reads) != 0 || FD_ISSET(fd, &sa->writes) != 0;
 }
 
 #endif							/* POLL_USING_SELECT */

coelho@cri.ensmp.fr

over 5 years ago

In reply to: Fabien COELHO (#10)

2 attachment(s)

Re: pgbench: option delaying queries till connections establishment?

Hello,

I've merged all time-related stuff (time_t, instr_time, int64) to use a
unique type (pg_time_usec_t) and set of functions/macros, which simplifies
the code somehow.

Hm. I'm not convinced it's a good idea for pgbench to do its own thing
here.

I really think that the refactoring part is a good thing because cloc and
cost is reduced (time arithmetic is an ugly pain with instr_time).

I have split the patch.

* First patch reworks time measurements in pgbench.

It creates a convenient pg_time_usec_t and use it everywhere, getting rid
of "instr_time_t". The code is somehow simplified wrt what time are taken
and what they mean.

Instead of displaying 2 tps at the end, which is basically insane, it
shows one tps for --connect, which includes reconnection times, and one
tps for the usual one connection at startup which simply ignores the
initial connection time.

This (mostly) refactoring reduces the cloc.

* Second patch adds a barrier before starting the bench

It applies on top of the previous one. The initial imbalance due to thread
creation times is smoothed.

I may add a --start-on option afterwards so that several pgbench (running
on distinct hosts) can be synchronized, which would be implemented as a
delay inserted by thread 0 before the barrier.

The windows implementation is more or less blind, if someone can confirm
that it works, it would be nice.

--
Fabien.

Attachments:

pgbench-usec-1.patchtext/x-diff; name=pgbench-usec-1.patchDownload

diff --git a/doc/src/sgml/ref/pgbench.sgml b/doc/src/sgml/ref/pgbench.sgml
index f5a51d3732..26b4c4f61c 100644
--- a/doc/src/sgml/ref/pgbench.sgml
+++ b/doc/src/sgml/ref/pgbench.sgml
@@ -58,8 +58,10 @@ number of clients: 10
 number of threads: 1
 number of transactions per client: 1000
 number of transactions actually processed: 10000/10000
-tps = 85.184871 (including connections establishing)
-tps = 85.296346 (excluding connections establishing)
+latency average = 11.013 ms
+latency stddev = 7.351 ms
+initial connection time = 45.758 ms
+tps = 896.967014 (without initial connection establishing)
 </screen>
 
   The first six lines report some of the most important parameter
@@ -2228,7 +2230,6 @@ END;
   <para>
    For the default script, the output will look similar to this:
 <screen>
-starting vacuum...end.
 transaction type: &lt;builtin: TPC-B (sort of)&gt;
 scaling factor: 1
 query mode: simple
@@ -2236,22 +2237,22 @@ number of clients: 10
 number of threads: 1
 number of transactions per client: 1000
 number of transactions actually processed: 10000/10000
-latency average = 15.844 ms
-latency stddev = 2.715 ms
-tps = 618.764555 (including connections establishing)
-tps = 622.977698 (excluding connections establishing)
+latency average = 10.870 ms
+latency stddev = 7.341 ms
+initial connection time = 30.954 ms
+tps = 907.949122 (without initial connection establishing)
 statement latencies in milliseconds:
-        0.002  \set aid random(1, 100000 * :scale)
-        0.005  \set bid random(1, 1 * :scale)
-        0.002  \set tid random(1, 10 * :scale)
-        0.001  \set delta random(-5000, 5000)
-        0.326  BEGIN;
-        0.603  UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
-        0.454  SELECT abalance FROM pgbench_accounts WHERE aid = :aid;
-        5.528  UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;
-        7.335  UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;
-        0.371  INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);
-        1.212  END;
+    0.001  \set aid random(1, 100000 * :scale)
+    0.001  \set bid random(1, 1 * :scale)
+    0.001  \set tid random(1, 10 * :scale)
+    0.000  \set delta random(-5000, 5000)
+    0.046  BEGIN;
+    0.151  UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
+    0.107  SELECT abalance FROM pgbench_accounts WHERE aid = :aid;
+    4.241  UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;
+    5.245  UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;
+    0.102  INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);
+    0.974  END;
 </screen>
   </para>
 
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 08a5947a9e..46be67adaf 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -291,9 +291,9 @@ typedef struct SimpleStats
  */
 typedef struct StatsData
 {
-	time_t		start_time;		/* interval start time, for aggregates */
-	int64		cnt;			/* number of transactions, including skipped */
-	int64		skipped;		/* number of transactions skipped under --rate
+	pg_time_usec_t	start_time;	/* interval start time, for aggregates */
+	int64			cnt;		/* number of transactions, including skipped */
+	int64			skipped;	/* number of transactions skipped under --rate
 								 * and --latency-limit */
 	SimpleStats latency;
 	SimpleStats lag;
@@ -416,11 +416,11 @@ typedef struct
 	int			nvariables;		/* number of variables */
 	bool		vars_sorted;	/* are variables sorted by name? */
 
-	/* various times about current transaction */
-	int64		txn_scheduled;	/* scheduled start time of transaction (usec) */
-	int64		sleep_until;	/* scheduled start time of next cmd (usec) */
-	instr_time	txn_begin;		/* used for measuring schedule lag times */
-	instr_time	stmt_begin;		/* used for measuring statement latencies */
+	/* various times about current transaction in microseconds */
+	pg_time_usec_t	txn_scheduled;	/* scheduled start time of transaction */
+	pg_time_usec_t	sleep_until;	/* scheduled start time of next cmd */
+	pg_time_usec_t	txn_begin;		/* used for measuring schedule lag times */
+	pg_time_usec_t	stmt_begin;		/* used for measuring statement latencies */
 
 	bool		prepared[MAX_SCRIPTS];	/* whether client prepared the script */
 
@@ -449,13 +449,16 @@ typedef struct
 	RandomState ts_sample_rs;	/* random state for log sampling */
 
 	int64		throttle_trigger;	/* previous/next throttling (us) */
-	FILE	   *logfile;		/* where to log, or NULL */
+	FILE	   *logfile;			/* where to log, or NULL */
+
+	/* per thread collected stats in microseconds */
+	pg_time_usec_t	create_time;	/* thread creation time */
+	pg_time_usec_t	started_time;	/* thread is running */
+	pg_time_usec_t	bench_start; 	/* thread is benchmarking */
+	pg_time_usec_t	conn_duration;	/* cumulated connection and deconnection delays */
 
-	/* per thread collected stats */
-	instr_time	start_time;		/* thread start time */
-	instr_time	conn_time;
 	StatsData	stats;
-	int64		latency_late;	/* executed but late transactions */
+	int64		latency_late;	/* count executed but late transactions */
 } TState;
 
 #define INVALID_THREAD		((pthread_t) 0)
@@ -597,10 +600,10 @@ static void setIntValue(PgBenchValue *pv, int64 ival);
 static void setDoubleValue(PgBenchValue *pv, double dval);
 static bool evaluateExpr(CState *st, PgBenchExpr *expr,
 						 PgBenchValue *retval);
-static ConnectionStateEnum executeMetaCommand(CState *st, instr_time *now);
+static ConnectionStateEnum executeMetaCommand(CState *st, pg_time_usec_t *now);
 static void doLog(TState *thread, CState *st,
 				  StatsData *agg, bool skipped, double latency, double lag);
-static void processXactStats(TState *thread, CState *st, instr_time *now,
+static void processXactStats(TState *thread, CState *st, pg_time_usec_t *now,
 							 bool skipped, StatsData *agg);
 static void append_fillfactor(char *opts, int len);
 static void addScript(ParsedScript script);
@@ -1105,9 +1108,9 @@ mergeSimpleStats(SimpleStats *acc, SimpleStats *ss)
  * the given value.
  */
 static void
-initStats(StatsData *sd, time_t start_time)
+initStats(StatsData *sd, pg_time_usec_t start)
 {
-	sd->start_time = start_time;
+	sd->start_time = start;
 	sd->cnt = 0;
 	sd->skipped = 0;
 	initSimpleStats(&sd->latency);
@@ -2878,7 +2881,6 @@ evaluateSleep(CState *st, int argc, char **argv, int *usecs)
 static void
 advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 {
-	instr_time	now;
 
 	/*
 	 * gettimeofday() isn't free, so we get the current timestamp lazily the
@@ -2888,7 +2890,7 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 	 * means "not set yet".  Reset "now" when we execute shell commands or
 	 * expressions, which might take a non-negligible amount of time, though.
 	 */
-	INSTR_TIME_SET_ZERO(now);
+	pg_time_usec_t	now = 0;
 
 	/*
 	 * Loop in the state machine, until we have to wait for a result from the
@@ -2923,29 +2925,30 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 
 				/* Start new transaction (script) */
 			case CSTATE_START_TX:
+				pg_time_now_lazy(&now);
 
 				/* establish connection if needed, i.e. under --connect */
 				if (st->con == NULL)
 				{
-					instr_time	start;
+					pg_time_usec_t	start = now;
 
-					INSTR_TIME_SET_CURRENT_LAZY(now);
-					start = now;
 					if ((st->con = doConnect()) == NULL)
 					{
 						pg_log_error("client %d aborted while establishing connection", st->id);
 						st->state = CSTATE_ABORTED;
 						break;
 					}
-					INSTR_TIME_SET_CURRENT(now);
-					INSTR_TIME_ACCUM_DIFF(thread->conn_time, now, start);
+
+					/* reset now after connection */
+					now = pg_time_now();
+
+					thread->conn_duration += now - start;
 
 					/* Reset session-local state */
 					memset(st->prepared, 0, sizeof(st->prepared));
 				}
 
 				/* record transaction start time */
-				INSTR_TIME_SET_CURRENT_LAZY(now);
 				st->txn_begin = now;
 
 				/*
@@ -2953,7 +2956,7 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 				 * scheduled start time.
 				 */
 				if (!throttle_delay)
-					st->txn_scheduled = INSTR_TIME_GET_MICROSEC(now);
+					st->txn_scheduled = now;
 
 				/* Begin with the first command */
 				st->state = CSTATE_START_COMMAND;
@@ -2989,12 +2992,9 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 				 */
 				if (latency_limit)
 				{
-					int64		now_us;
+					pg_time_now_lazy(&now);
 
-					INSTR_TIME_SET_CURRENT_LAZY(now);
-					now_us = INSTR_TIME_GET_MICROSEC(now);
-
-					while (thread->throttle_trigger < now_us - latency_limit &&
+					while (thread->throttle_trigger < now - latency_limit &&
 						   (nxacts <= 0 || st->cnt < nxacts))
 					{
 						processXactStats(thread, st, &now, true, agg);
@@ -3027,9 +3027,9 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 				 * Wait until it's time to start next transaction.
 				 */
 			case CSTATE_THROTTLE:
-				INSTR_TIME_SET_CURRENT_LAZY(now);
+				pg_time_now_lazy(&now);
 
-				if (INSTR_TIME_GET_MICROSEC(now) < st->txn_scheduled)
+				if (now < st->txn_scheduled)
 					return;		/* still sleeping, nothing to do here */
 
 				/* done sleeping, but don't start transaction if we're done */
@@ -3052,7 +3052,7 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 				/* record begin time of next command, and initiate it */
 				if (report_per_command)
 				{
-					INSTR_TIME_SET_CURRENT_LAZY(now);
+					pg_time_now_lazy(&now);
 					st->stmt_begin = now;
 				}
 
@@ -3217,8 +3217,8 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 				 * instead of CSTATE_START_TX.
 				 */
 			case CSTATE_SLEEP:
-				INSTR_TIME_SET_CURRENT_LAZY(now);
-				if (INSTR_TIME_GET_MICROSEC(now) < st->sleep_until)
+				pg_time_now_lazy(&now);
+				if (now < st->sleep_until)
 					return;		/* still sleeping, nothing to do here */
 				/* Else done sleeping. */
 				st->state = CSTATE_END_COMMAND;
@@ -3238,13 +3238,12 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 				{
 					Command    *command;
 
-					INSTR_TIME_SET_CURRENT_LAZY(now);
+					pg_time_now_lazy(&now);
 
 					command = sql_script[st->use_file].commands[st->command];
 					/* XXX could use a mutex here, but we choose not to */
 					addToSimpleStats(&command->stats,
-									 INSTR_TIME_GET_DOUBLE(now) -
-									 INSTR_TIME_GET_DOUBLE(st->stmt_begin));
+									 PG_TIME_GET_DOUBLE(now - st->stmt_begin));
 				}
 
 				/* Go ahead with next command, to be executed or skipped */
@@ -3270,7 +3269,7 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 				if (is_connect)
 				{
 					finishCon(st);
-					INSTR_TIME_SET_ZERO(now);
+					now = 0;
 				}
 
 				if ((st->cnt >= nxacts && duration <= 0) || timer_exceeded)
@@ -3308,7 +3307,7 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
  * take no time to execute.
  */
 static ConnectionStateEnum
-executeMetaCommand(CState *st, instr_time *now)
+executeMetaCommand(CState *st, pg_time_usec_t *now)
 {
 	Command    *command = sql_script[st->use_file].commands[st->command];
 	int			argc;
@@ -3350,8 +3349,8 @@ executeMetaCommand(CState *st, instr_time *now)
 			return CSTATE_ABORTED;
 		}
 
-		INSTR_TIME_SET_CURRENT_LAZY(*now);
-		st->sleep_until = INSTR_TIME_GET_MICROSEC(*now) + usec;
+		pg_time_now_lazy(now);
+		st->sleep_until = (*now) + usec;
 		return CSTATE_SLEEP;
 	}
 	else if (command->meta == META_SET)
@@ -3454,7 +3453,7 @@ executeMetaCommand(CState *st, instr_time *now)
 	 * executing the expression or shell command might have taken a
 	 * non-negligible amount of time, so reset 'now'
 	 */
-	INSTR_TIME_SET_ZERO(*now);
+	*now = 0;
 
 	return CSTATE_END_COMMAND;
 }
@@ -3464,14 +3463,15 @@ executeMetaCommand(CState *st, instr_time *now)
  *
  * We print Unix-epoch timestamps in the log, so that entries can be
  * correlated against other logs.  On some platforms this could be obtained
- * from the instr_time reading the caller has, but rather than get entangled
- * with that, we just eat the cost of an extra syscall in all cases.
+ * from the caller, but rather than get entangled with that, we just eat
+ * the cost of an extra syscall in all cases.
  */
 static void
 doLog(TState *thread, CState *st,
 	  StatsData *agg, bool skipped, double latency, double lag)
 {
 	FILE	   *logfile = thread->logfile;
+	pg_time_usec_t	now = pg_time_now();
 
 	Assert(use_log);
 
@@ -3491,13 +3491,12 @@ doLog(TState *thread, CState *st,
 		 * any empty intervals in between (this may happen with very low tps,
 		 * e.g. --rate=0.1).
 		 */
-		time_t		now = time(NULL);
 
 		while (agg->start_time + agg_interval <= now)
 		{
 			/* print aggregated report to logfile */
-			fprintf(logfile, "%ld " INT64_FORMAT " %.0f %.0f %.0f %.0f",
-					(long) agg->start_time,
+			fprintf(logfile, INT64_FORMAT " " INT64_FORMAT " %.0f %.0f %.0f %.0f",
+					agg->start_time,
 					agg->cnt,
 					agg->latency.sum,
 					agg->latency.sum2,
@@ -3525,17 +3524,13 @@ doLog(TState *thread, CState *st,
 	else
 	{
 		/* no, print raw transactions */
-		struct timeval tv;
-
-		gettimeofday(&tv, NULL);
 		if (skipped)
 			fprintf(logfile, "%d " INT64_FORMAT " skipped %d %ld %ld",
-					st->id, st->cnt, st->use_file,
-					(long) tv.tv_sec, (long) tv.tv_usec);
+					st->id, st->cnt, st->use_file, now / 1000000, now % 1000000);
 		else
 			fprintf(logfile, "%d " INT64_FORMAT " %.0f %d %ld %ld",
 					st->id, st->cnt, latency, st->use_file,
-					(long) tv.tv_sec, (long) tv.tv_usec);
+					now / 1000000, now % 1000000);
 		if (throttle_delay)
 			fprintf(logfile, " %.0f", lag);
 		fputc('\n', logfile);
@@ -3549,7 +3544,7 @@ doLog(TState *thread, CState *st,
  * Note that even skipped transactions are counted in the "cnt" fields.)
  */
 static void
-processXactStats(TState *thread, CState *st, instr_time *now,
+processXactStats(TState *thread, CState *st, pg_time_usec_t *now,
 				 bool skipped, StatsData *agg)
 {
 	double		latency = 0.0,
@@ -3559,11 +3554,11 @@ processXactStats(TState *thread, CState *st, instr_time *now,
 
 	if (detailed && !skipped)
 	{
-		INSTR_TIME_SET_CURRENT_LAZY(*now);
+		pg_time_now_lazy(now);
 
 		/* compute latency & lag */
-		latency = INSTR_TIME_GET_MICROSEC(*now) - st->txn_scheduled;
-		lag = INSTR_TIME_GET_MICROSEC(st->txn_begin) - st->txn_scheduled;
+		latency = (*now) - st->txn_scheduled;
+		lag = st->txn_begin - st->txn_scheduled;
 	}
 
 	if (thread_details)
@@ -3826,10 +3821,7 @@ initGenerateDataClientSide(PGconn *con)
 	int64		k;
 
 	/* used to track elapsed time and estimate of the remaining time */
-	instr_time	start,
-				diff;
-	double		elapsed_sec,
-				remaining_sec;
+	pg_time_usec_t	start;
 	int			log_interval = 1;
 
 	/* Stay on the same line if reporting to a terminal */
@@ -3879,7 +3871,7 @@ initGenerateDataClientSide(PGconn *con)
 	}
 	PQclear(res);
 
-	INSTR_TIME_SET_CURRENT(start);
+	start = pg_time_now();
 
 	for (k = 0; k < (int64) naccounts * scale; k++)
 	{
@@ -3904,11 +3896,8 @@ initGenerateDataClientSide(PGconn *con)
 		 */
 		if ((!use_quiet) && (j % 100000 == 0))
 		{
-			INSTR_TIME_SET_CURRENT(diff);
-			INSTR_TIME_SUBTRACT(diff, start);
-
-			elapsed_sec = INSTR_TIME_GET_DOUBLE(diff);
-			remaining_sec = ((double) scale * naccounts - j) * elapsed_sec / j;
+			double elapsed_sec = PG_TIME_GET_DOUBLE(pg_time_now() - start);
+			double remaining_sec = ((double) scale * naccounts - j) * elapsed_sec / j;
 
 			fprintf(stderr, INT64_FORMAT " of " INT64_FORMAT " tuples (%d%%) done (elapsed %.2f s, remaining %.2f s)%c",
 					j, (int64) naccounts * scale,
@@ -3918,11 +3907,8 @@ initGenerateDataClientSide(PGconn *con)
 		/* let's not call the timing for each row, but only each 100 rows */
 		else if (use_quiet && (j % 100 == 0))
 		{
-			INSTR_TIME_SET_CURRENT(diff);
-			INSTR_TIME_SUBTRACT(diff, start);
-
-			elapsed_sec = INSTR_TIME_GET_DOUBLE(diff);
-			remaining_sec = ((double) scale * naccounts - j) * elapsed_sec / j;
+			double elapsed_sec = PG_TIME_GET_DOUBLE(pg_time_now() - start);
+			double remaining_sec = ((double) scale * naccounts - j) * elapsed_sec / j;
 
 			/* have we reached the next interval (or end)? */
 			if ((j == scale * naccounts) || (elapsed_sec >= log_interval * LOG_STEP_SECONDS))
@@ -4118,10 +4104,8 @@ runInitSteps(const char *initialize_steps)
 
 	for (step = initialize_steps; *step != '\0'; step++)
 	{
-		instr_time	start;
 		char	   *op = NULL;
-
-		INSTR_TIME_SET_CURRENT(start);
+		pg_time_usec_t	start = pg_time_now();
 
 		switch (*step)
 		{
@@ -4163,12 +4147,7 @@ runInitSteps(const char *initialize_steps)
 
 		if (op != NULL)
 		{
-			instr_time	diff;
-			double		elapsed_sec;
-
-			INSTR_TIME_SET_CURRENT(diff);
-			INSTR_TIME_SUBTRACT(diff, start);
-			elapsed_sec = INSTR_TIME_GET_DOUBLE(diff);
+			double		elapsed_sec = PG_TIME_GET_DOUBLE(pg_time_now() - start);
 
 			if (!first)
 				appendPQExpBufferStr(&stats, ", ");
@@ -5090,12 +5069,12 @@ addScript(ParsedScript script)
  * progress report.  On exit, they are updated with the new stats.
  */
 static void
-printProgressReport(TState *threads, int64 test_start, int64 now,
+printProgressReport(TState *threads, int64 test_start, pg_time_usec_t now,
 					StatsData *last, int64 *last_report)
 {
 	/* generate and show report */
-	int64		run = now - *last_report,
-				ntx;
+	pg_time_usec_t	run = now - *last_report;
+	int64		ntx;
 	double		tps,
 				total_run,
 				latency,
@@ -5142,16 +5121,7 @@ printProgressReport(TState *threads, int64 test_start, int64 now,
 
 	if (progress_timestamp)
 	{
-		/*
-		 * On some platforms the current system timestamp is available in
-		 * now_time, but rather than get entangled with that, we just eat the
-		 * cost of an extra syscall in all cases.
-		 */
-		struct timeval tv;
-
-		gettimeofday(&tv, NULL);
-		snprintf(tbuf, sizeof(tbuf), "%ld.%03ld s",
-				 (long) tv.tv_sec, (long) (tv.tv_usec / 1000));
+		snprintf(tbuf, sizeof(tbuf), "%.3f s", PG_TIME_GET_DOUBLE(now));
 	}
 	else
 	{
@@ -5191,21 +5161,18 @@ printSimpleStats(const char *prefix, SimpleStats *ss)
 
 /* print out results */
 static void
-printResults(StatsData *total, instr_time total_time,
-			 instr_time conn_total_time, int64 latency_late)
+printResults(StatsData *total,
+			 pg_time_usec_t total_duration,			/* benchmarking time */
+			 pg_time_usec_t conn_total_duration,	/* is_connect */
+			 pg_time_usec_t conn_elapsed_duration,	/* !is_connect */
+			 int64 latency_late)
 {
-	double		time_include,
-				tps_include,
-				tps_exclude;
+	/* tps is about actually executed transactions during benchmarking */
 	int64		ntx = total->cnt - total->skipped;
+	double 		bench_duration = PG_TIME_GET_DOUBLE(total_duration);
+	double		tps = ntx / bench_duration;
 
-	time_include = INSTR_TIME_GET_DOUBLE(total_time);
-
-	/* tps is about actually executed transactions */
-	tps_include = ntx / time_include;
-	tps_exclude = ntx /
-		(time_include - (INSTR_TIME_GET_DOUBLE(conn_total_time) / nclients));
-
+	printf("pgbench (PostgreSQL) %d.%d\n", PG_VERSION_NUM / 10000, PG_VERSION_NUM % 100);
 	/* Report test parameters. */
 	printf("transaction type: %s\n",
 		   num_scripts == 1 ? sql_script[0].desc : "multiple scripts");
@@ -5236,8 +5203,7 @@ printResults(StatsData *total, instr_time total_time,
 
 	if (throttle_delay && latency_limit)
 		printf("number of transactions skipped: " INT64_FORMAT " (%.3f %%)\n",
-			   total->skipped,
-			   100.0 * total->skipped / total->cnt);
+			   total->skipped, 100.0 * total->skipped / total->cnt);
 
 	if (latency_limit)
 		printf("number of transactions above the %.1f ms latency limit: " INT64_FORMAT "/" INT64_FORMAT " (%.3f %%)\n",
@@ -5250,7 +5216,7 @@ printResults(StatsData *total, instr_time total_time,
 	{
 		/* no measurement, show average latency computed from run time */
 		printf("latency average = %.3f ms\n",
-			   1000.0 * time_include * nclients / total->cnt);
+			   0.001 * total_duration * nclients / total->cnt);
 	}
 
 	if (throttle_delay)
@@ -5265,8 +5231,25 @@ printResults(StatsData *total, instr_time total_time,
 			   0.001 * total->lag.sum / total->cnt, 0.001 * total->lag.max);
 	}
 
-	printf("tps = %f (including connections establishing)\n", tps_include);
-	printf("tps = %f (excluding connections establishing)\n", tps_exclude);
+	/*
+	 * Under -C/--connect, each transaction incurs a significant connection cost,
+	 * it would not make much sense to ignore it in tps, and it would not be tps
+	 * anyway.
+	 *
+	 * Otherwise connections are made just once at the beginning of the run
+	 * and should not impact performance but for very short run, so they are
+	 * (right)fully ignored in tps.
+	 */
+	if (is_connect)
+	{
+		printf("average connection time = %.3f ms\n", 0.001 * conn_total_duration / total->cnt);
+		printf("tps = %f (including reconnection times)\n", tps);
+	}
+	else
+	{
+		printf("initial connection time = %.3f ms\n", 0.001 * conn_elapsed_duration);
+		printf("tps = %f (without initial connection establishing)\n", tps);
+	}
 
 	/* Report per-script/command statistics */
 	if (per_script_stats || report_per_command)
@@ -5287,7 +5270,7 @@ printResults(StatsData *total, instr_time total_time,
 					   100.0 * sql_script[i].weight / total_weight,
 					   sstats->cnt,
 					   100.0 * sstats->cnt / total->cnt,
-					   (sstats->cnt - sstats->skipped) / time_include);
+					   (sstats->cnt - sstats->skipped) / bench_duration);
 
 				if (throttle_delay && latency_limit && sstats->cnt > 0)
 					printf(" - number of transactions skipped: " INT64_FORMAT " (%.3f%%)\n",
@@ -5335,10 +5318,7 @@ set_random_seed(const char *seed)
 	if (seed == NULL || strcmp(seed, "time") == 0)
 	{
 		/* rely on current time */
-		instr_time	now;
-
-		INSTR_TIME_SET_CURRENT(now);
-		iseed = (uint64) INSTR_TIME_GET_MICROSEC(now);
+		iseed = pg_time_now();
 	}
 	else if (strcmp(seed, "rand") == 0)
 	{
@@ -5441,9 +5421,10 @@ main(int argc, char **argv)
 	CState	   *state;			/* status of clients */
 	TState	   *threads;		/* array of thread */
 
-	instr_time	start_time;		/* start up time */
-	instr_time	total_time;
-	instr_time	conn_total_time;
+	pg_time_usec_t
+		start_time,				/* start up time */
+		bench_start = 0,		/* first recorded benchmarking time */
+		conn_total_duration;	/* cumulated connection time in threads */
 	int64		latency_late = 0;
 	StatsData	stats;
 	int			weight;
@@ -6126,67 +6107,53 @@ main(int argc, char **argv)
 	/* all clients must be assigned to a thread */
 	Assert(nclients_dealt == nclients);
 
-	/* get start up time */
-	INSTR_TIME_SET_CURRENT(start_time);
+	/* get start up time for the whole computation */
+	start_time = pg_time_now();
 
 	/* set alarm if duration is specified. */
 	if (duration > 0)
 		setalarm(duration);
 
-	/* start threads */
 #ifdef ENABLE_THREAD_SAFETY
-	for (i = 0; i < nthreads; i++)
+	/* start all threads but thread 0 which is executed directly later */
+	for (i = 1; i < nthreads; i++)
 	{
 		TState	   *thread = &threads[i];
+		int			err;
 
-		INSTR_TIME_SET_CURRENT(thread->start_time);
+		thread->create_time = pg_time_now();
+		err = pthread_create(&thread->thread, NULL, threadRun, thread);
 
-		/* compute when to stop */
-		if (duration > 0)
-			end_time = INSTR_TIME_GET_MICROSEC(thread->start_time) +
-				(int64) 1000000 * duration;
-
-		/* the first thread (i = 0) is executed by main thread */
-		if (i > 0)
-		{
-			int			err = pthread_create(&thread->thread, NULL, threadRun, thread);
-
-			if (err != 0 || thread->thread == INVALID_THREAD)
-			{
-				pg_log_fatal("could not create thread: %m");
-				exit(1);
-			}
-		}
-		else
+		if (err != 0 || thread->thread == INVALID_THREAD)
 		{
-			thread->thread = INVALID_THREAD;
+			pg_log_fatal("could not create thread: %m");
+			exit(1);
 		}
 	}
 #else
-	INSTR_TIME_SET_CURRENT(threads[0].start_time);
+	Assert(nthreads == 1);
+#endif							/* ENABLE_THREAD_SAFETY */
+
 	/* compute when to stop */
+	threads[0].create_time = pg_time_now();
 	if (duration > 0)
-		end_time = INSTR_TIME_GET_MICROSEC(threads[0].start_time) +
-			(int64) 1000000 * duration;
+		end_time = threads[0].create_time + (int64) 1000000 * duration;
+
+	/* run thread 0 directly */
 	threads[0].thread = INVALID_THREAD;
-#endif							/* ENABLE_THREAD_SAFETY */
+	(void) threadRun(&threads[0]);
 
-	/* wait for threads and accumulate results */
+	/* wait for other threads and accumulate results */
 	initStats(&stats, 0);
-	INSTR_TIME_SET_ZERO(conn_total_time);
+	conn_total_duration = 0;
+
 	for (i = 0; i < nthreads; i++)
 	{
 		TState	   *thread = &threads[i];
 
 #ifdef ENABLE_THREAD_SAFETY
-		if (threads[i].thread == INVALID_THREAD)
-			/* actually run this thread directly in the main thread */
-			(void) threadRun(thread);
-		else
-			/* wait of other threads. should check that 0 is returned? */
+		if (i > 0)
 			pthread_join(thread->thread, NULL);
-#else
-		(void) threadRun(thread);
 #endif							/* ENABLE_THREAD_SAFETY */
 
 		for (int j = 0; j < thread->nstate; j++)
@@ -6199,23 +6166,24 @@ main(int argc, char **argv)
 		stats.cnt += thread->stats.cnt;
 		stats.skipped += thread->stats.skipped;
 		latency_late += thread->latency_late;
-		INSTR_TIME_ADD(conn_total_time, thread->conn_time);
+		conn_total_duration += thread->conn_duration;
+
+		/* first recorded benchmarking start time */
+		if (bench_start == 0 || thread->bench_start < bench_start)
+			bench_start = thread->bench_start;
 	}
+
+	/* XXX should this be connection time? */
 	disconnect_all(state, nclients);
 
 	/*
-	 * XXX We compute results as though every client of every thread started
-	 * and finished at the same time.  That model can diverge noticeably from
-	 * reality for a short benchmark run involving relatively many threads.
-	 * The first thread may process notably many transactions before the last
-	 * thread begins.  Improving the model alone would bring limited benefit,
-	 * because performance during those periods of partial thread count can
-	 * easily exceed steady state performance.  This is one of the many ways
-	 * short runs convey deceptive performance figures.
+	 * Beware that performance of short benchmarks with many threads and possibly
+	 * long transactions can be deceptive because threads do not start and finish
+	 * at the exact same time. The total duration computed here encompasses all
+	 * transactions so that tps shown is somehow slightly underestimated.
 	 */
-	INSTR_TIME_SET_CURRENT(total_time);
-	INSTR_TIME_SUBTRACT(total_time, start_time);
-	printResults(&stats, total_time, conn_total_time, latency_late);
+	printResults(&stats, pg_time_now() - bench_start, conn_total_duration,
+				 bench_start - start_time, latency_late);
 
 	if (exit_code != 0)
 		pg_log_fatal("Run was aborted; the above results are incomplete.");
@@ -6228,34 +6196,16 @@ threadRun(void *arg)
 {
 	TState	   *thread = (TState *) arg;
 	CState	   *state = thread->state;
-	instr_time	start,
-				end;
+	pg_time_usec_t	start;
 	int			nstate = thread->nstate;
 	int			remains = nstate;	/* number of remaining clients */
 	socket_set *sockets = alloc_socket_set(nstate);
-	int			i;
-
-	/* for reporting progress: */
-	int64		thread_start = INSTR_TIME_GET_MICROSEC(thread->start_time);
-	int64		last_report = thread_start;
-	int64		next_report = last_report + (int64) progress * 1000000;
+	int64		thread_start,
+				last_report,
+				next_report;
 	StatsData	last,
 				aggs;
 
-	/*
-	 * Initialize throttling rate target for all of the thread's clients.  It
-	 * might be a little more accurate to reset thread->start_time here too.
-	 * The possible drift seems too small relative to typical throttle delay
-	 * times to worry about it.
-	 */
-	INSTR_TIME_SET_CURRENT(start);
-	thread->throttle_trigger = INSTR_TIME_GET_MICROSEC(start);
-
-	INSTR_TIME_SET_ZERO(thread->conn_time);
-
-	initStats(&aggs, time(NULL));
-	last = aggs;
-
 	/* open log file if requested */
 	if (use_log)
 	{
@@ -6276,32 +6226,51 @@ threadRun(void *arg)
 		}
 	}
 
+	/* explicitly initialize the state machines */
+	for (int i = 0; i < nstate; i++)
+	{
+		state[i].state = CSTATE_CHOOSE_SCRIPT;
+	}
+
+	/* READY */
+	thread_start = pg_time_now();
+	thread->started_time = thread_start;
+	last_report = thread_start;
+	next_report = last_report + (int64) 1000000 * progress;
+
+	/* STEADY */
 	if (!is_connect)
 	{
 		/* make connections to the database before starting */
-		for (i = 0; i < nstate; i++)
+		for (int i = 0; i < nstate; i++)
 		{
 			if ((state[i].con = doConnect()) == NULL)
 				goto done;
 		}
+
+		/* compute connection delay */
+		thread->conn_duration = pg_time_now() - thread->started_time;
 	}
-
-	/* time after thread and connections set up */
-	INSTR_TIME_SET_CURRENT(thread->conn_time);
-	INSTR_TIME_SUBTRACT(thread->conn_time, thread->start_time);
-
-	/* explicitly initialize the state machines */
-	for (i = 0; i < nstate; i++)
+	else
 	{
-		state[i].state = CSTATE_CHOOSE_SCRIPT;
+		/* no connection delay to record */
+		thread->conn_duration = 0;
 	}
 
+
+	start = pg_time_now();
+	thread->bench_start = start;
+	thread->throttle_trigger = start;
+
+	initStats(&aggs, start);
+	last = aggs;
+
 	/* loop till all clients have terminated */
 	while (remains > 0)
 	{
 		int			nsocks;		/* number of sockets to be waited for */
-		int64		min_usec;
-		int64		now_usec = 0;	/* set this only if needed */
+		pg_time_usec_t	min_usec;
+		pg_time_usec_t	now = 0;	/* set this only if needed */
 
 		/*
 		 * identify which client sockets should be checked for input, and
@@ -6310,27 +6279,21 @@ threadRun(void *arg)
 		clear_socket_set(sockets);
 		nsocks = 0;
 		min_usec = PG_INT64_MAX;
-		for (i = 0; i < nstate; i++)
+		for (int i = 0; i < nstate; i++)
 		{
 			CState	   *st = &state[i];
 
 			if (st->state == CSTATE_SLEEP || st->state == CSTATE_THROTTLE)
 			{
 				/* a nap from the script, or under throttling */
-				int64		this_usec;
+				pg_time_usec_t		this_usec;
 
 				/* get current time if needed */
-				if (now_usec == 0)
-				{
-					instr_time	now;
-
-					INSTR_TIME_SET_CURRENT(now);
-					now_usec = INSTR_TIME_GET_MICROSEC(now);
-				}
+				pg_time_now_lazy(&now);
 
 				/* min_usec should be the minimum delay across all clients */
 				this_usec = (st->state == CSTATE_SLEEP ?
-							 st->sleep_until : st->txn_scheduled) - now_usec;
+							 st->sleep_until : st->txn_scheduled) - now;
 				if (min_usec > this_usec)
 					min_usec = this_usec;
 			}
@@ -6365,19 +6328,12 @@ threadRun(void *arg)
 		/* also wake up to print the next progress report on time */
 		if (progress && min_usec > 0 && thread->tid == 0)
 		{
-			/* get current time if needed */
-			if (now_usec == 0)
-			{
-				instr_time	now;
+			pg_time_now_lazy(&now);
 
-				INSTR_TIME_SET_CURRENT(now);
-				now_usec = INSTR_TIME_GET_MICROSEC(now);
-			}
-
-			if (now_usec >= next_report)
+			if (now >= next_report)
 				min_usec = 0;
-			else if ((next_report - now_usec) < min_usec)
-				min_usec = next_report - now_usec;
+			else if ((next_report - now) < min_usec)
+				min_usec = next_report - now;
 		}
 
 		/*
@@ -6426,7 +6382,7 @@ threadRun(void *arg)
 
 		/* ok, advance the state machine of each connection */
 		nsocks = 0;
-		for (i = 0; i < nstate; i++)
+		for (int i = 0; i < nstate; i++)
 		{
 			CState	   *st = &state[i];
 
@@ -6464,11 +6420,8 @@ threadRun(void *arg)
 		/* progress report is made by thread 0 for all threads */
 		if (progress && thread->tid == 0)
 		{
-			instr_time	now_time;
-			int64		now;
+			pg_time_usec_t	now = pg_time_now();
 
-			INSTR_TIME_SET_CURRENT(now_time);
-			now = INSTR_TIME_GET_MICROSEC(now_time);
 			if (now >= next_report)
 			{
 				/*
@@ -6486,17 +6439,17 @@ threadRun(void *arg)
 				 */
 				do
 				{
-					next_report += (int64) progress * 1000000;
+					next_report += (int64) 1000000 * progress;
 				} while (now >= next_report);
 			}
 		}
 	}
 
 done:
-	INSTR_TIME_SET_CURRENT(start);
+	start = pg_time_now();
 	disconnect_all(state, nstate);
-	INSTR_TIME_SET_CURRENT(end);
-	INSTR_TIME_ACCUM_DIFF(thread->conn_time, end, start);
+	thread->conn_duration += pg_time_now() - start;
+
 	if (thread->logfile)
 	{
 		if (agg_interval > 0)
diff --git a/src/include/portability/instr_time.h b/src/include/portability/instr_time.h
index d6459327cc..fcc0bdda28 100644
--- a/src/include/portability/instr_time.h
+++ b/src/include/portability/instr_time.h
@@ -253,4 +253,32 @@ GetTimerFrequency(void)
 #define INSTR_TIME_SET_CURRENT_LAZY(t) \
 	(INSTR_TIME_IS_ZERO(t) ? INSTR_TIME_SET_CURRENT(t), true : false)
 
+/*
+ * Simpler convenient interface
+ *
+ * The instr_time type is expensive when dealing with time arithmetic.
+ * Define a type to hold microseconds on top of this, suitable for
+ * benchmarking performance measures, eg in "pgbench".
+ *
+ * Type int64 is good enough for about 584500 years.
+ */
+typedef int64 pg_time_usec_t;
+
+static inline pg_time_usec_t
+pg_time_now(void)
+{
+	instr_time now;
+
+	INSTR_TIME_SET_CURRENT(now);
+	return (pg_time_usec_t) INSTR_TIME_GET_MICROSEC(now);
+}
+
+static inline void
+pg_time_now_lazy(pg_time_usec_t *now)
+{
+	if ((*now) == 0)
+		(*now) = pg_time_now();
+}
+
+#define PG_TIME_GET_DOUBLE(t) (0.000001 * (t))
 #endif							/* INSTR_TIME_H */

pgbench-barrier-4.patchtext/x-diff; name=pgbench-barrier-4.patchDownload

diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 46be67adaf..79a2a10dee 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -113,18 +113,29 @@ typedef struct socket_set
  */
 
 #ifdef WIN32
+#define PTHREAD_BARRIER_SERIAL_THREAD (-1)
+
 /* Use native win32 threads on Windows */
 typedef struct win32_pthread *pthread_t;
 typedef int pthread_attr_t;
+typedef SYNCHRONIZATION_BARRIER pthread_barrier_t;
 
 static int	pthread_create(pthread_t *thread, pthread_attr_t *attr, void *(*start_routine) (void *), void *arg);
 static int	pthread_join(pthread_t th, void **thread_return);
+
+static int	pthread_barrier_init(pthread_barrier_t *barrier, void *unused, int nthreads);
+static int	pthread_barrier_wait(pthread_barrier_t *barrier);
+static int	pthread_barrier_destroy(pthread_barrier_t *barrier);
 #elif defined(ENABLE_THREAD_SAFETY)
 /* Use platform-dependent pthread capability */
 #include <pthread.h>
 #else
 /* No threads implementation, use none (-j 1) */
 #define pthread_t void *
+#define pthread_barrier_t void *
+#define pthread_barrier_init(a, b, c) /* ignore */
+#define pthread_barrier_wait(a) /* ignore */
+#define pthread_barrier_destroy(a) /* ignore */
 #endif
 
 
@@ -310,6 +321,9 @@ typedef struct RandomState
 /* Various random sequences are initialized from this one. */
 static RandomState base_random_sequence;
 
+/* Synchronization barrier for start and connection */
+static pthread_barrier_t barrier;
+
 /*
  * Connection state machine states.
  */
@@ -453,8 +467,8 @@ typedef struct
 
 	/* per thread collected stats in microseconds */
 	pg_time_usec_t	create_time;	/* thread creation time */
-	pg_time_usec_t	started_time;	/* thread is running */
-	pg_time_usec_t	bench_start; 	/* thread is benchmarking */
+	pg_time_usec_t	started_time;	/* thread is running after start barrier */
+	pg_time_usec_t	bench_start; 	/* thread is benchmarking after connection barrier */
 	pg_time_usec_t	conn_duration;	/* cumulated connection and deconnection delays */
 
 	StatsData	stats;
@@ -6114,6 +6128,8 @@ main(int argc, char **argv)
 	if (duration > 0)
 		setalarm(duration);
 
+	pthread_barrier_init(&barrier, NULL, nthreads);
+
 #ifdef ENABLE_THREAD_SAFETY
 	/* start all threads but thread 0 which is executed directly later */
 	for (i = 1; i < nthreads; i++)
@@ -6185,6 +6201,8 @@ main(int argc, char **argv)
 	printResults(&stats, pg_time_now() - bench_start, conn_total_duration,
 				 bench_start - start_time, latency_late);
 
+	pthread_barrier_destroy(&barrier);
+
 	if (exit_code != 0)
 		pg_log_fatal("Run was aborted; the above results are incomplete.");
 
@@ -6233,6 +6251,8 @@ threadRun(void *arg)
 	}
 
 	/* READY */
+	pthread_barrier_wait(&barrier);
+
 	thread_start = pg_time_now();
 	thread->started_time = thread_start;
 	last_report = thread_start;
@@ -6257,6 +6277,8 @@ threadRun(void *arg)
 		thread->conn_duration = 0;
 	}
 
+	/* GO */
+	pthread_barrier_wait(&barrier);
 
 	start = pg_time_now();
 	thread->bench_start = start;
@@ -6762,4 +6784,26 @@ pthread_join(pthread_t th, void **thread_return)
 	return 0;
 }
 
+static int
+pthread_barrier_init(pthread_barrier_t *barrier, void *unused, int nthreads)
+{
+	/* no spinning: threads are not expected to arrive at the barrier together */
+	bool ok = InitializeSynchronizationBarrier(barrier, nthreads, 0);
+	return 0;
+}
+
+static int
+pthread_barrier_wait(pthread_barrier_t *barrier)
+{
+	bool last = EnterSynchronizationBarrier(barrier, SYNCHRONIZATION_BARRIER_FLAGS_BLOCK_ONLY);
+	return last ? PTHREAD_BARRIER_SERIAL_THREAD : 0;
+}
+
+static int
+pthread_barrier_destroy(pthread_barrier_t *barrier)
+{
+	(void) DeleteSynchronizationBarrier(barrier);
+	return 0;
+}
+
 #endif							/* WIN32 */

Daniel Gustafsson

daniel@yesql.se

over 5 years ago

In reply to: Fabien COELHO (#12)

Re: pgbench: option delaying queries till connections establishment?

On 17 May 2020, at 11:55, Fabien COELHO <coelho@cri.ensmp.fr> wrote:

I have split the patch.

* First patch reworks time measurements in pgbench.

* Second patch adds a barrier before starting the bench

It applies on top of the previous one. The initial imbalance due to thread creation times is smoothed.

The usecs patch fails to apply to HEAD, can you please submit a rebased version
of this patchset. Also, when doing so, can you please rename the patches such
that sort alphabetically in the order in which they are intended to be applied.
The CFBot patch tester will otherwise try to apply them out of order which
cause errors.

cheers ./daniel

coelho@cri.ensmp.fr

over 5 years ago

In reply to: Daniel Gustafsson (#13)

2 attachment(s)

Re: pgbench: option delaying queries till connections establishment?

* First patch reworks time measurements in pgbench.
* Second patch adds a barrier before starting the bench

It applies on top of the previous one. The initial imbalance due to
thread creation times is smoothed.

The usecs patch fails to apply to HEAD, can you please submit a rebased version
of this patchset. Also, when doing so, can you please rename the patches such
that sort alphabetically in the order in which they are intended to be applied.
The CFBot patch tester will otherwise try to apply them out of order which
cause errors.

Ok. Attached.

--
Fabien.

Attachments:

pgbench-A-usec-2.patchtext/x-diff; name=pgbench-A-usec-2.patchDownload

diff --git a/doc/src/sgml/ref/pgbench.sgml b/doc/src/sgml/ref/pgbench.sgml
index 8e728dc094..3c5ef9c27e 100644
--- a/doc/src/sgml/ref/pgbench.sgml
+++ b/doc/src/sgml/ref/pgbench.sgml
@@ -58,8 +58,10 @@ number of clients: 10
 number of threads: 1
 number of transactions per client: 1000
 number of transactions actually processed: 10000/10000
-tps = 85.184871 (including connections establishing)
-tps = 85.296346 (excluding connections establishing)
+latency average = 11.013 ms
+latency stddev = 7.351 ms
+initial connection time = 45.758 ms
+tps = 896.967014 (without initial connection establishing)
 </screen>
 
   The first six lines report some of the most important parameter
@@ -2226,7 +2228,6 @@ END;
   <para>
    For the default script, the output will look similar to this:
 <screen>
-starting vacuum...end.
 transaction type: &lt;builtin: TPC-B (sort of)&gt;
 scaling factor: 1
 query mode: simple
@@ -2234,22 +2235,22 @@ number of clients: 10
 number of threads: 1
 number of transactions per client: 1000
 number of transactions actually processed: 10000/10000
-latency average = 15.844 ms
-latency stddev = 2.715 ms
-tps = 618.764555 (including connections establishing)
-tps = 622.977698 (excluding connections establishing)
+latency average = 10.870 ms
+latency stddev = 7.341 ms
+initial connection time = 30.954 ms
+tps = 907.949122 (without initial connection establishing)
 statement latencies in milliseconds:
-        0.002  \set aid random(1, 100000 * :scale)
-        0.005  \set bid random(1, 1 * :scale)
-        0.002  \set tid random(1, 10 * :scale)
-        0.001  \set delta random(-5000, 5000)
-        0.326  BEGIN;
-        0.603  UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
-        0.454  SELECT abalance FROM pgbench_accounts WHERE aid = :aid;
-        5.528  UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;
-        7.335  UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;
-        0.371  INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);
-        1.212  END;
+    0.001  \set aid random(1, 100000 * :scale)
+    0.001  \set bid random(1, 1 * :scale)
+    0.001  \set tid random(1, 10 * :scale)
+    0.000  \set delta random(-5000, 5000)
+    0.046  BEGIN;
+    0.151  UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
+    0.107  SELECT abalance FROM pgbench_accounts WHERE aid = :aid;
+    4.241  UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;
+    5.245  UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;
+    0.102  INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);
+    0.974  END;
 </screen>
   </para>
 
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 08a5947a9e..46be67adaf 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -291,9 +291,9 @@ typedef struct SimpleStats
  */
 typedef struct StatsData
 {
-	time_t		start_time;		/* interval start time, for aggregates */
-	int64		cnt;			/* number of transactions, including skipped */
-	int64		skipped;		/* number of transactions skipped under --rate
+	pg_time_usec_t	start_time;	/* interval start time, for aggregates */
+	int64			cnt;		/* number of transactions, including skipped */
+	int64			skipped;	/* number of transactions skipped under --rate
 								 * and --latency-limit */
 	SimpleStats latency;
 	SimpleStats lag;
@@ -416,11 +416,11 @@ typedef struct
 	int			nvariables;		/* number of variables */
 	bool		vars_sorted;	/* are variables sorted by name? */
 
-	/* various times about current transaction */
-	int64		txn_scheduled;	/* scheduled start time of transaction (usec) */
-	int64		sleep_until;	/* scheduled start time of next cmd (usec) */
-	instr_time	txn_begin;		/* used for measuring schedule lag times */
-	instr_time	stmt_begin;		/* used for measuring statement latencies */
+	/* various times about current transaction in microseconds */
+	pg_time_usec_t	txn_scheduled;	/* scheduled start time of transaction */
+	pg_time_usec_t	sleep_until;	/* scheduled start time of next cmd */
+	pg_time_usec_t	txn_begin;		/* used for measuring schedule lag times */
+	pg_time_usec_t	stmt_begin;		/* used for measuring statement latencies */
 
 	bool		prepared[MAX_SCRIPTS];	/* whether client prepared the script */
 
@@ -449,13 +449,16 @@ typedef struct
 	RandomState ts_sample_rs;	/* random state for log sampling */
 
 	int64		throttle_trigger;	/* previous/next throttling (us) */
-	FILE	   *logfile;		/* where to log, or NULL */
+	FILE	   *logfile;			/* where to log, or NULL */
+
+	/* per thread collected stats in microseconds */
+	pg_time_usec_t	create_time;	/* thread creation time */
+	pg_time_usec_t	started_time;	/* thread is running */
+	pg_time_usec_t	bench_start; 	/* thread is benchmarking */
+	pg_time_usec_t	conn_duration;	/* cumulated connection and deconnection delays */
 
-	/* per thread collected stats */
-	instr_time	start_time;		/* thread start time */
-	instr_time	conn_time;
 	StatsData	stats;
-	int64		latency_late;	/* executed but late transactions */
+	int64		latency_late;	/* count executed but late transactions */
 } TState;
 
 #define INVALID_THREAD		((pthread_t) 0)
@@ -597,10 +600,10 @@ static void setIntValue(PgBenchValue *pv, int64 ival);
 static void setDoubleValue(PgBenchValue *pv, double dval);
 static bool evaluateExpr(CState *st, PgBenchExpr *expr,
 						 PgBenchValue *retval);
-static ConnectionStateEnum executeMetaCommand(CState *st, instr_time *now);
+static ConnectionStateEnum executeMetaCommand(CState *st, pg_time_usec_t *now);
 static void doLog(TState *thread, CState *st,
 				  StatsData *agg, bool skipped, double latency, double lag);
-static void processXactStats(TState *thread, CState *st, instr_time *now,
+static void processXactStats(TState *thread, CState *st, pg_time_usec_t *now,
 							 bool skipped, StatsData *agg);
 static void append_fillfactor(char *opts, int len);
 static void addScript(ParsedScript script);
@@ -1105,9 +1108,9 @@ mergeSimpleStats(SimpleStats *acc, SimpleStats *ss)
  * the given value.
  */
 static void
-initStats(StatsData *sd, time_t start_time)
+initStats(StatsData *sd, pg_time_usec_t start)
 {
-	sd->start_time = start_time;
+	sd->start_time = start;
 	sd->cnt = 0;
 	sd->skipped = 0;
 	initSimpleStats(&sd->latency);
@@ -2878,7 +2881,6 @@ evaluateSleep(CState *st, int argc, char **argv, int *usecs)
 static void
 advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 {
-	instr_time	now;
 
 	/*
 	 * gettimeofday() isn't free, so we get the current timestamp lazily the
@@ -2888,7 +2890,7 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 	 * means "not set yet".  Reset "now" when we execute shell commands or
 	 * expressions, which might take a non-negligible amount of time, though.
 	 */
-	INSTR_TIME_SET_ZERO(now);
+	pg_time_usec_t	now = 0;
 
 	/*
 	 * Loop in the state machine, until we have to wait for a result from the
@@ -2923,29 +2925,30 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 
 				/* Start new transaction (script) */
 			case CSTATE_START_TX:
+				pg_time_now_lazy(&now);
 
 				/* establish connection if needed, i.e. under --connect */
 				if (st->con == NULL)
 				{
-					instr_time	start;
+					pg_time_usec_t	start = now;
 
-					INSTR_TIME_SET_CURRENT_LAZY(now);
-					start = now;
 					if ((st->con = doConnect()) == NULL)
 					{
 						pg_log_error("client %d aborted while establishing connection", st->id);
 						st->state = CSTATE_ABORTED;
 						break;
 					}
-					INSTR_TIME_SET_CURRENT(now);
-					INSTR_TIME_ACCUM_DIFF(thread->conn_time, now, start);
+
+					/* reset now after connection */
+					now = pg_time_now();
+
+					thread->conn_duration += now - start;
 
 					/* Reset session-local state */
 					memset(st->prepared, 0, sizeof(st->prepared));
 				}
 
 				/* record transaction start time */
-				INSTR_TIME_SET_CURRENT_LAZY(now);
 				st->txn_begin = now;
 
 				/*
@@ -2953,7 +2956,7 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 				 * scheduled start time.
 				 */
 				if (!throttle_delay)
-					st->txn_scheduled = INSTR_TIME_GET_MICROSEC(now);
+					st->txn_scheduled = now;
 
 				/* Begin with the first command */
 				st->state = CSTATE_START_COMMAND;
@@ -2989,12 +2992,9 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 				 */
 				if (latency_limit)
 				{
-					int64		now_us;
+					pg_time_now_lazy(&now);
 
-					INSTR_TIME_SET_CURRENT_LAZY(now);
-					now_us = INSTR_TIME_GET_MICROSEC(now);
-
-					while (thread->throttle_trigger < now_us - latency_limit &&
+					while (thread->throttle_trigger < now - latency_limit &&
 						   (nxacts <= 0 || st->cnt < nxacts))
 					{
 						processXactStats(thread, st, &now, true, agg);
@@ -3027,9 +3027,9 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 				 * Wait until it's time to start next transaction.
 				 */
 			case CSTATE_THROTTLE:
-				INSTR_TIME_SET_CURRENT_LAZY(now);
+				pg_time_now_lazy(&now);
 
-				if (INSTR_TIME_GET_MICROSEC(now) < st->txn_scheduled)
+				if (now < st->txn_scheduled)
 					return;		/* still sleeping, nothing to do here */
 
 				/* done sleeping, but don't start transaction if we're done */
@@ -3052,7 +3052,7 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 				/* record begin time of next command, and initiate it */
 				if (report_per_command)
 				{
-					INSTR_TIME_SET_CURRENT_LAZY(now);
+					pg_time_now_lazy(&now);
 					st->stmt_begin = now;
 				}
 
@@ -3217,8 +3217,8 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 				 * instead of CSTATE_START_TX.
 				 */
 			case CSTATE_SLEEP:
-				INSTR_TIME_SET_CURRENT_LAZY(now);
-				if (INSTR_TIME_GET_MICROSEC(now) < st->sleep_until)
+				pg_time_now_lazy(&now);
+				if (now < st->sleep_until)
 					return;		/* still sleeping, nothing to do here */
 				/* Else done sleeping. */
 				st->state = CSTATE_END_COMMAND;
@@ -3238,13 +3238,12 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 				{
 					Command    *command;
 
-					INSTR_TIME_SET_CURRENT_LAZY(now);
+					pg_time_now_lazy(&now);
 
 					command = sql_script[st->use_file].commands[st->command];
 					/* XXX could use a mutex here, but we choose not to */
 					addToSimpleStats(&command->stats,
-									 INSTR_TIME_GET_DOUBLE(now) -
-									 INSTR_TIME_GET_DOUBLE(st->stmt_begin));
+									 PG_TIME_GET_DOUBLE(now - st->stmt_begin));
 				}
 
 				/* Go ahead with next command, to be executed or skipped */
@@ -3270,7 +3269,7 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 				if (is_connect)
 				{
 					finishCon(st);
-					INSTR_TIME_SET_ZERO(now);
+					now = 0;
 				}
 
 				if ((st->cnt >= nxacts && duration <= 0) || timer_exceeded)
@@ -3308,7 +3307,7 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
  * take no time to execute.
  */
 static ConnectionStateEnum
-executeMetaCommand(CState *st, instr_time *now)
+executeMetaCommand(CState *st, pg_time_usec_t *now)
 {
 	Command    *command = sql_script[st->use_file].commands[st->command];
 	int			argc;
@@ -3350,8 +3349,8 @@ executeMetaCommand(CState *st, instr_time *now)
 			return CSTATE_ABORTED;
 		}
 
-		INSTR_TIME_SET_CURRENT_LAZY(*now);
-		st->sleep_until = INSTR_TIME_GET_MICROSEC(*now) + usec;
+		pg_time_now_lazy(now);
+		st->sleep_until = (*now) + usec;
 		return CSTATE_SLEEP;
 	}
 	else if (command->meta == META_SET)
@@ -3454,7 +3453,7 @@ executeMetaCommand(CState *st, instr_time *now)
 	 * executing the expression or shell command might have taken a
 	 * non-negligible amount of time, so reset 'now'
 	 */
-	INSTR_TIME_SET_ZERO(*now);
+	*now = 0;
 
 	return CSTATE_END_COMMAND;
 }
@@ -3464,14 +3463,15 @@ executeMetaCommand(CState *st, instr_time *now)
  *
  * We print Unix-epoch timestamps in the log, so that entries can be
  * correlated against other logs.  On some platforms this could be obtained
- * from the instr_time reading the caller has, but rather than get entangled
- * with that, we just eat the cost of an extra syscall in all cases.
+ * from the caller, but rather than get entangled with that, we just eat
+ * the cost of an extra syscall in all cases.
  */
 static void
 doLog(TState *thread, CState *st,
 	  StatsData *agg, bool skipped, double latency, double lag)
 {
 	FILE	   *logfile = thread->logfile;
+	pg_time_usec_t	now = pg_time_now();
 
 	Assert(use_log);
 
@@ -3491,13 +3491,12 @@ doLog(TState *thread, CState *st,
 		 * any empty intervals in between (this may happen with very low tps,
 		 * e.g. --rate=0.1).
 		 */
-		time_t		now = time(NULL);
 
 		while (agg->start_time + agg_interval <= now)
 		{
 			/* print aggregated report to logfile */
-			fprintf(logfile, "%ld " INT64_FORMAT " %.0f %.0f %.0f %.0f",
-					(long) agg->start_time,
+			fprintf(logfile, INT64_FORMAT " " INT64_FORMAT " %.0f %.0f %.0f %.0f",
+					agg->start_time,
 					agg->cnt,
 					agg->latency.sum,
 					agg->latency.sum2,
@@ -3525,17 +3524,13 @@ doLog(TState *thread, CState *st,
 	else
 	{
 		/* no, print raw transactions */
-		struct timeval tv;
-
-		gettimeofday(&tv, NULL);
 		if (skipped)
 			fprintf(logfile, "%d " INT64_FORMAT " skipped %d %ld %ld",
-					st->id, st->cnt, st->use_file,
-					(long) tv.tv_sec, (long) tv.tv_usec);
+					st->id, st->cnt, st->use_file, now / 1000000, now % 1000000);
 		else
 			fprintf(logfile, "%d " INT64_FORMAT " %.0f %d %ld %ld",
 					st->id, st->cnt, latency, st->use_file,
-					(long) tv.tv_sec, (long) tv.tv_usec);
+					now / 1000000, now % 1000000);
 		if (throttle_delay)
 			fprintf(logfile, " %.0f", lag);
 		fputc('\n', logfile);
@@ -3549,7 +3544,7 @@ doLog(TState *thread, CState *st,
  * Note that even skipped transactions are counted in the "cnt" fields.)
  */
 static void
-processXactStats(TState *thread, CState *st, instr_time *now,
+processXactStats(TState *thread, CState *st, pg_time_usec_t *now,
 				 bool skipped, StatsData *agg)
 {
 	double		latency = 0.0,
@@ -3559,11 +3554,11 @@ processXactStats(TState *thread, CState *st, instr_time *now,
 
 	if (detailed && !skipped)
 	{
-		INSTR_TIME_SET_CURRENT_LAZY(*now);
+		pg_time_now_lazy(now);
 
 		/* compute latency & lag */
-		latency = INSTR_TIME_GET_MICROSEC(*now) - st->txn_scheduled;
-		lag = INSTR_TIME_GET_MICROSEC(st->txn_begin) - st->txn_scheduled;
+		latency = (*now) - st->txn_scheduled;
+		lag = st->txn_begin - st->txn_scheduled;
 	}
 
 	if (thread_details)
@@ -3826,10 +3821,7 @@ initGenerateDataClientSide(PGconn *con)
 	int64		k;
 
 	/* used to track elapsed time and estimate of the remaining time */
-	instr_time	start,
-				diff;
-	double		elapsed_sec,
-				remaining_sec;
+	pg_time_usec_t	start;
 	int			log_interval = 1;
 
 	/* Stay on the same line if reporting to a terminal */
@@ -3879,7 +3871,7 @@ initGenerateDataClientSide(PGconn *con)
 	}
 	PQclear(res);
 
-	INSTR_TIME_SET_CURRENT(start);
+	start = pg_time_now();
 
 	for (k = 0; k < (int64) naccounts * scale; k++)
 	{
@@ -3904,11 +3896,8 @@ initGenerateDataClientSide(PGconn *con)
 		 */
 		if ((!use_quiet) && (j % 100000 == 0))
 		{
-			INSTR_TIME_SET_CURRENT(diff);
-			INSTR_TIME_SUBTRACT(diff, start);
-
-			elapsed_sec = INSTR_TIME_GET_DOUBLE(diff);
-			remaining_sec = ((double) scale * naccounts - j) * elapsed_sec / j;
+			double elapsed_sec = PG_TIME_GET_DOUBLE(pg_time_now() - start);
+			double remaining_sec = ((double) scale * naccounts - j) * elapsed_sec / j;
 
 			fprintf(stderr, INT64_FORMAT " of " INT64_FORMAT " tuples (%d%%) done (elapsed %.2f s, remaining %.2f s)%c",
 					j, (int64) naccounts * scale,
@@ -3918,11 +3907,8 @@ initGenerateDataClientSide(PGconn *con)
 		/* let's not call the timing for each row, but only each 100 rows */
 		else if (use_quiet && (j % 100 == 0))
 		{
-			INSTR_TIME_SET_CURRENT(diff);
-			INSTR_TIME_SUBTRACT(diff, start);
-
-			elapsed_sec = INSTR_TIME_GET_DOUBLE(diff);
-			remaining_sec = ((double) scale * naccounts - j) * elapsed_sec / j;
+			double elapsed_sec = PG_TIME_GET_DOUBLE(pg_time_now() - start);
+			double remaining_sec = ((double) scale * naccounts - j) * elapsed_sec / j;
 
 			/* have we reached the next interval (or end)? */
 			if ((j == scale * naccounts) || (elapsed_sec >= log_interval * LOG_STEP_SECONDS))
@@ -4118,10 +4104,8 @@ runInitSteps(const char *initialize_steps)
 
 	for (step = initialize_steps; *step != '\0'; step++)
 	{
-		instr_time	start;
 		char	   *op = NULL;
-
-		INSTR_TIME_SET_CURRENT(start);
+		pg_time_usec_t	start = pg_time_now();
 
 		switch (*step)
 		{
@@ -4163,12 +4147,7 @@ runInitSteps(const char *initialize_steps)
 
 		if (op != NULL)
 		{
-			instr_time	diff;
-			double		elapsed_sec;
-
-			INSTR_TIME_SET_CURRENT(diff);
-			INSTR_TIME_SUBTRACT(diff, start);
-			elapsed_sec = INSTR_TIME_GET_DOUBLE(diff);
+			double		elapsed_sec = PG_TIME_GET_DOUBLE(pg_time_now() - start);
 
 			if (!first)
 				appendPQExpBufferStr(&stats, ", ");
@@ -5090,12 +5069,12 @@ addScript(ParsedScript script)
  * progress report.  On exit, they are updated with the new stats.
  */
 static void
-printProgressReport(TState *threads, int64 test_start, int64 now,
+printProgressReport(TState *threads, int64 test_start, pg_time_usec_t now,
 					StatsData *last, int64 *last_report)
 {
 	/* generate and show report */
-	int64		run = now - *last_report,
-				ntx;
+	pg_time_usec_t	run = now - *last_report;
+	int64		ntx;
 	double		tps,
 				total_run,
 				latency,
@@ -5142,16 +5121,7 @@ printProgressReport(TState *threads, int64 test_start, int64 now,
 
 	if (progress_timestamp)
 	{
-		/*
-		 * On some platforms the current system timestamp is available in
-		 * now_time, but rather than get entangled with that, we just eat the
-		 * cost of an extra syscall in all cases.
-		 */
-		struct timeval tv;
-
-		gettimeofday(&tv, NULL);
-		snprintf(tbuf, sizeof(tbuf), "%ld.%03ld s",
-				 (long) tv.tv_sec, (long) (tv.tv_usec / 1000));
+		snprintf(tbuf, sizeof(tbuf), "%.3f s", PG_TIME_GET_DOUBLE(now));
 	}
 	else
 	{
@@ -5191,21 +5161,18 @@ printSimpleStats(const char *prefix, SimpleStats *ss)
 
 /* print out results */
 static void
-printResults(StatsData *total, instr_time total_time,
-			 instr_time conn_total_time, int64 latency_late)
+printResults(StatsData *total,
+			 pg_time_usec_t total_duration,			/* benchmarking time */
+			 pg_time_usec_t conn_total_duration,	/* is_connect */
+			 pg_time_usec_t conn_elapsed_duration,	/* !is_connect */
+			 int64 latency_late)
 {
-	double		time_include,
-				tps_include,
-				tps_exclude;
+	/* tps is about actually executed transactions during benchmarking */
 	int64		ntx = total->cnt - total->skipped;
+	double 		bench_duration = PG_TIME_GET_DOUBLE(total_duration);
+	double		tps = ntx / bench_duration;
 
-	time_include = INSTR_TIME_GET_DOUBLE(total_time);
-
-	/* tps is about actually executed transactions */
-	tps_include = ntx / time_include;
-	tps_exclude = ntx /
-		(time_include - (INSTR_TIME_GET_DOUBLE(conn_total_time) / nclients));
-
+	printf("pgbench (PostgreSQL) %d.%d\n", PG_VERSION_NUM / 10000, PG_VERSION_NUM % 100);
 	/* Report test parameters. */
 	printf("transaction type: %s\n",
 		   num_scripts == 1 ? sql_script[0].desc : "multiple scripts");
@@ -5236,8 +5203,7 @@ printResults(StatsData *total, instr_time total_time,
 
 	if (throttle_delay && latency_limit)
 		printf("number of transactions skipped: " INT64_FORMAT " (%.3f %%)\n",
-			   total->skipped,
-			   100.0 * total->skipped / total->cnt);
+			   total->skipped, 100.0 * total->skipped / total->cnt);
 
 	if (latency_limit)
 		printf("number of transactions above the %.1f ms latency limit: " INT64_FORMAT "/" INT64_FORMAT " (%.3f %%)\n",
@@ -5250,7 +5216,7 @@ printResults(StatsData *total, instr_time total_time,
 	{
 		/* no measurement, show average latency computed from run time */
 		printf("latency average = %.3f ms\n",
-			   1000.0 * time_include * nclients / total->cnt);
+			   0.001 * total_duration * nclients / total->cnt);
 	}
 
 	if (throttle_delay)
@@ -5265,8 +5231,25 @@ printResults(StatsData *total, instr_time total_time,
 			   0.001 * total->lag.sum / total->cnt, 0.001 * total->lag.max);
 	}
 
-	printf("tps = %f (including connections establishing)\n", tps_include);
-	printf("tps = %f (excluding connections establishing)\n", tps_exclude);
+	/*
+	 * Under -C/--connect, each transaction incurs a significant connection cost,
+	 * it would not make much sense to ignore it in tps, and it would not be tps
+	 * anyway.
+	 *
+	 * Otherwise connections are made just once at the beginning of the run
+	 * and should not impact performance but for very short run, so they are
+	 * (right)fully ignored in tps.
+	 */
+	if (is_connect)
+	{
+		printf("average connection time = %.3f ms\n", 0.001 * conn_total_duration / total->cnt);
+		printf("tps = %f (including reconnection times)\n", tps);
+	}
+	else
+	{
+		printf("initial connection time = %.3f ms\n", 0.001 * conn_elapsed_duration);
+		printf("tps = %f (without initial connection establishing)\n", tps);
+	}
 
 	/* Report per-script/command statistics */
 	if (per_script_stats || report_per_command)
@@ -5287,7 +5270,7 @@ printResults(StatsData *total, instr_time total_time,
 					   100.0 * sql_script[i].weight / total_weight,
 					   sstats->cnt,
 					   100.0 * sstats->cnt / total->cnt,
-					   (sstats->cnt - sstats->skipped) / time_include);
+					   (sstats->cnt - sstats->skipped) / bench_duration);
 
 				if (throttle_delay && latency_limit && sstats->cnt > 0)
 					printf(" - number of transactions skipped: " INT64_FORMAT " (%.3f%%)\n",
@@ -5335,10 +5318,7 @@ set_random_seed(const char *seed)
 	if (seed == NULL || strcmp(seed, "time") == 0)
 	{
 		/* rely on current time */
-		instr_time	now;
-
-		INSTR_TIME_SET_CURRENT(now);
-		iseed = (uint64) INSTR_TIME_GET_MICROSEC(now);
+		iseed = pg_time_now();
 	}
 	else if (strcmp(seed, "rand") == 0)
 	{
@@ -5441,9 +5421,10 @@ main(int argc, char **argv)
 	CState	   *state;			/* status of clients */
 	TState	   *threads;		/* array of thread */
 
-	instr_time	start_time;		/* start up time */
-	instr_time	total_time;
-	instr_time	conn_total_time;
+	pg_time_usec_t
+		start_time,				/* start up time */
+		bench_start = 0,		/* first recorded benchmarking time */
+		conn_total_duration;	/* cumulated connection time in threads */
 	int64		latency_late = 0;
 	StatsData	stats;
 	int			weight;
@@ -6126,67 +6107,53 @@ main(int argc, char **argv)
 	/* all clients must be assigned to a thread */
 	Assert(nclients_dealt == nclients);
 
-	/* get start up time */
-	INSTR_TIME_SET_CURRENT(start_time);
+	/* get start up time for the whole computation */
+	start_time = pg_time_now();
 
 	/* set alarm if duration is specified. */
 	if (duration > 0)
 		setalarm(duration);
 
-	/* start threads */
 #ifdef ENABLE_THREAD_SAFETY
-	for (i = 0; i < nthreads; i++)
+	/* start all threads but thread 0 which is executed directly later */
+	for (i = 1; i < nthreads; i++)
 	{
 		TState	   *thread = &threads[i];
+		int			err;
 
-		INSTR_TIME_SET_CURRENT(thread->start_time);
+		thread->create_time = pg_time_now();
+		err = pthread_create(&thread->thread, NULL, threadRun, thread);
 
-		/* compute when to stop */
-		if (duration > 0)
-			end_time = INSTR_TIME_GET_MICROSEC(thread->start_time) +
-				(int64) 1000000 * duration;
-
-		/* the first thread (i = 0) is executed by main thread */
-		if (i > 0)
-		{
-			int			err = pthread_create(&thread->thread, NULL, threadRun, thread);
-
-			if (err != 0 || thread->thread == INVALID_THREAD)
-			{
-				pg_log_fatal("could not create thread: %m");
-				exit(1);
-			}
-		}
-		else
+		if (err != 0 || thread->thread == INVALID_THREAD)
 		{
-			thread->thread = INVALID_THREAD;
+			pg_log_fatal("could not create thread: %m");
+			exit(1);
 		}
 	}
 #else
-	INSTR_TIME_SET_CURRENT(threads[0].start_time);
+	Assert(nthreads == 1);
+#endif							/* ENABLE_THREAD_SAFETY */
+
 	/* compute when to stop */
+	threads[0].create_time = pg_time_now();
 	if (duration > 0)
-		end_time = INSTR_TIME_GET_MICROSEC(threads[0].start_time) +
-			(int64) 1000000 * duration;
+		end_time = threads[0].create_time + (int64) 1000000 * duration;
+
+	/* run thread 0 directly */
 	threads[0].thread = INVALID_THREAD;
-#endif							/* ENABLE_THREAD_SAFETY */
+	(void) threadRun(&threads[0]);
 
-	/* wait for threads and accumulate results */
+	/* wait for other threads and accumulate results */
 	initStats(&stats, 0);
-	INSTR_TIME_SET_ZERO(conn_total_time);
+	conn_total_duration = 0;
+
 	for (i = 0; i < nthreads; i++)
 	{
 		TState	   *thread = &threads[i];
 
 #ifdef ENABLE_THREAD_SAFETY
-		if (threads[i].thread == INVALID_THREAD)
-			/* actually run this thread directly in the main thread */
-			(void) threadRun(thread);
-		else
-			/* wait of other threads. should check that 0 is returned? */
+		if (i > 0)
 			pthread_join(thread->thread, NULL);
-#else
-		(void) threadRun(thread);
 #endif							/* ENABLE_THREAD_SAFETY */
 
 		for (int j = 0; j < thread->nstate; j++)
@@ -6199,23 +6166,24 @@ main(int argc, char **argv)
 		stats.cnt += thread->stats.cnt;
 		stats.skipped += thread->stats.skipped;
 		latency_late += thread->latency_late;
-		INSTR_TIME_ADD(conn_total_time, thread->conn_time);
+		conn_total_duration += thread->conn_duration;
+
+		/* first recorded benchmarking start time */
+		if (bench_start == 0 || thread->bench_start < bench_start)
+			bench_start = thread->bench_start;
 	}
+
+	/* XXX should this be connection time? */
 	disconnect_all(state, nclients);
 
 	/*
-	 * XXX We compute results as though every client of every thread started
-	 * and finished at the same time.  That model can diverge noticeably from
-	 * reality for a short benchmark run involving relatively many threads.
-	 * The first thread may process notably many transactions before the last
-	 * thread begins.  Improving the model alone would bring limited benefit,
-	 * because performance during those periods of partial thread count can
-	 * easily exceed steady state performance.  This is one of the many ways
-	 * short runs convey deceptive performance figures.
+	 * Beware that performance of short benchmarks with many threads and possibly
+	 * long transactions can be deceptive because threads do not start and finish
+	 * at the exact same time. The total duration computed here encompasses all
+	 * transactions so that tps shown is somehow slightly underestimated.
 	 */
-	INSTR_TIME_SET_CURRENT(total_time);
-	INSTR_TIME_SUBTRACT(total_time, start_time);
-	printResults(&stats, total_time, conn_total_time, latency_late);
+	printResults(&stats, pg_time_now() - bench_start, conn_total_duration,
+				 bench_start - start_time, latency_late);
 
 	if (exit_code != 0)
 		pg_log_fatal("Run was aborted; the above results are incomplete.");
@@ -6228,34 +6196,16 @@ threadRun(void *arg)
 {
 	TState	   *thread = (TState *) arg;
 	CState	   *state = thread->state;
-	instr_time	start,
-				end;
+	pg_time_usec_t	start;
 	int			nstate = thread->nstate;
 	int			remains = nstate;	/* number of remaining clients */
 	socket_set *sockets = alloc_socket_set(nstate);
-	int			i;
-
-	/* for reporting progress: */
-	int64		thread_start = INSTR_TIME_GET_MICROSEC(thread->start_time);
-	int64		last_report = thread_start;
-	int64		next_report = last_report + (int64) progress * 1000000;
+	int64		thread_start,
+				last_report,
+				next_report;
 	StatsData	last,
 				aggs;
 
-	/*
-	 * Initialize throttling rate target for all of the thread's clients.  It
-	 * might be a little more accurate to reset thread->start_time here too.
-	 * The possible drift seems too small relative to typical throttle delay
-	 * times to worry about it.
-	 */
-	INSTR_TIME_SET_CURRENT(start);
-	thread->throttle_trigger = INSTR_TIME_GET_MICROSEC(start);
-
-	INSTR_TIME_SET_ZERO(thread->conn_time);
-
-	initStats(&aggs, time(NULL));
-	last = aggs;
-
 	/* open log file if requested */
 	if (use_log)
 	{
@@ -6276,32 +6226,51 @@ threadRun(void *arg)
 		}
 	}
 
+	/* explicitly initialize the state machines */
+	for (int i = 0; i < nstate; i++)
+	{
+		state[i].state = CSTATE_CHOOSE_SCRIPT;
+	}
+
+	/* READY */
+	thread_start = pg_time_now();
+	thread->started_time = thread_start;
+	last_report = thread_start;
+	next_report = last_report + (int64) 1000000 * progress;
+
+	/* STEADY */
 	if (!is_connect)
 	{
 		/* make connections to the database before starting */
-		for (i = 0; i < nstate; i++)
+		for (int i = 0; i < nstate; i++)
 		{
 			if ((state[i].con = doConnect()) == NULL)
 				goto done;
 		}
+
+		/* compute connection delay */
+		thread->conn_duration = pg_time_now() - thread->started_time;
 	}
-
-	/* time after thread and connections set up */
-	INSTR_TIME_SET_CURRENT(thread->conn_time);
-	INSTR_TIME_SUBTRACT(thread->conn_time, thread->start_time);
-
-	/* explicitly initialize the state machines */
-	for (i = 0; i < nstate; i++)
+	else
 	{
-		state[i].state = CSTATE_CHOOSE_SCRIPT;
+		/* no connection delay to record */
+		thread->conn_duration = 0;
 	}
 
+
+	start = pg_time_now();
+	thread->bench_start = start;
+	thread->throttle_trigger = start;
+
+	initStats(&aggs, start);
+	last = aggs;
+
 	/* loop till all clients have terminated */
 	while (remains > 0)
 	{
 		int			nsocks;		/* number of sockets to be waited for */
-		int64		min_usec;
-		int64		now_usec = 0;	/* set this only if needed */
+		pg_time_usec_t	min_usec;
+		pg_time_usec_t	now = 0;	/* set this only if needed */
 
 		/*
 		 * identify which client sockets should be checked for input, and
@@ -6310,27 +6279,21 @@ threadRun(void *arg)
 		clear_socket_set(sockets);
 		nsocks = 0;
 		min_usec = PG_INT64_MAX;
-		for (i = 0; i < nstate; i++)
+		for (int i = 0; i < nstate; i++)
 		{
 			CState	   *st = &state[i];
 
 			if (st->state == CSTATE_SLEEP || st->state == CSTATE_THROTTLE)
 			{
 				/* a nap from the script, or under throttling */
-				int64		this_usec;
+				pg_time_usec_t		this_usec;
 
 				/* get current time if needed */
-				if (now_usec == 0)
-				{
-					instr_time	now;
-
-					INSTR_TIME_SET_CURRENT(now);
-					now_usec = INSTR_TIME_GET_MICROSEC(now);
-				}
+				pg_time_now_lazy(&now);
 
 				/* min_usec should be the minimum delay across all clients */
 				this_usec = (st->state == CSTATE_SLEEP ?
-							 st->sleep_until : st->txn_scheduled) - now_usec;
+							 st->sleep_until : st->txn_scheduled) - now;
 				if (min_usec > this_usec)
 					min_usec = this_usec;
 			}
@@ -6365,19 +6328,12 @@ threadRun(void *arg)
 		/* also wake up to print the next progress report on time */
 		if (progress && min_usec > 0 && thread->tid == 0)
 		{
-			/* get current time if needed */
-			if (now_usec == 0)
-			{
-				instr_time	now;
+			pg_time_now_lazy(&now);
 
-				INSTR_TIME_SET_CURRENT(now);
-				now_usec = INSTR_TIME_GET_MICROSEC(now);
-			}
-
-			if (now_usec >= next_report)
+			if (now >= next_report)
 				min_usec = 0;
-			else if ((next_report - now_usec) < min_usec)
-				min_usec = next_report - now_usec;
+			else if ((next_report - now) < min_usec)
+				min_usec = next_report - now;
 		}
 
 		/*
@@ -6426,7 +6382,7 @@ threadRun(void *arg)
 
 		/* ok, advance the state machine of each connection */
 		nsocks = 0;
-		for (i = 0; i < nstate; i++)
+		for (int i = 0; i < nstate; i++)
 		{
 			CState	   *st = &state[i];
 
@@ -6464,11 +6420,8 @@ threadRun(void *arg)
 		/* progress report is made by thread 0 for all threads */
 		if (progress && thread->tid == 0)
 		{
-			instr_time	now_time;
-			int64		now;
+			pg_time_usec_t	now = pg_time_now();
 
-			INSTR_TIME_SET_CURRENT(now_time);
-			now = INSTR_TIME_GET_MICROSEC(now_time);
 			if (now >= next_report)
 			{
 				/*
@@ -6486,17 +6439,17 @@ threadRun(void *arg)
 				 */
 				do
 				{
-					next_report += (int64) progress * 1000000;
+					next_report += (int64) 1000000 * progress;
 				} while (now >= next_report);
 			}
 		}
 	}
 
 done:
-	INSTR_TIME_SET_CURRENT(start);
+	start = pg_time_now();
 	disconnect_all(state, nstate);
-	INSTR_TIME_SET_CURRENT(end);
-	INSTR_TIME_ACCUM_DIFF(thread->conn_time, end, start);
+	thread->conn_duration += pg_time_now() - start;
+
 	if (thread->logfile)
 	{
 		if (agg_interval > 0)
diff --git a/src/include/portability/instr_time.h b/src/include/portability/instr_time.h
index d6459327cc..fcc0bdda28 100644
--- a/src/include/portability/instr_time.h
+++ b/src/include/portability/instr_time.h
@@ -253,4 +253,32 @@ GetTimerFrequency(void)
 #define INSTR_TIME_SET_CURRENT_LAZY(t) \
 	(INSTR_TIME_IS_ZERO(t) ? INSTR_TIME_SET_CURRENT(t), true : false)
 
+/*
+ * Simpler convenient interface
+ *
+ * The instr_time type is expensive when dealing with time arithmetic.
+ * Define a type to hold microseconds on top of this, suitable for
+ * benchmarking performance measures, eg in "pgbench".
+ *
+ * Type int64 is good enough for about 584500 years.
+ */
+typedef int64 pg_time_usec_t;
+
+static inline pg_time_usec_t
+pg_time_now(void)
+{
+	instr_time now;
+
+	INSTR_TIME_SET_CURRENT(now);
+	return (pg_time_usec_t) INSTR_TIME_GET_MICROSEC(now);
+}
+
+static inline void
+pg_time_now_lazy(pg_time_usec_t *now)
+{
+	if ((*now) == 0)
+		(*now) = pg_time_now();
+}
+
+#define PG_TIME_GET_DOUBLE(t) (0.000001 * (t))
 #endif							/* INSTR_TIME_H */

pgbench-B-barrier-5.patchtext/x-diff; name=pgbench-B-barrier-5.patchDownload

diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 46be67adaf..79a2a10dee 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -113,18 +113,29 @@ typedef struct socket_set
  */
 
 #ifdef WIN32
+#define PTHREAD_BARRIER_SERIAL_THREAD (-1)
+
 /* Use native win32 threads on Windows */
 typedef struct win32_pthread *pthread_t;
 typedef int pthread_attr_t;
+typedef SYNCHRONIZATION_BARRIER pthread_barrier_t;
 
 static int	pthread_create(pthread_t *thread, pthread_attr_t *attr, void *(*start_routine) (void *), void *arg);
 static int	pthread_join(pthread_t th, void **thread_return);
+
+static int	pthread_barrier_init(pthread_barrier_t *barrier, void *unused, int nthreads);
+static int	pthread_barrier_wait(pthread_barrier_t *barrier);
+static int	pthread_barrier_destroy(pthread_barrier_t *barrier);
 #elif defined(ENABLE_THREAD_SAFETY)
 /* Use platform-dependent pthread capability */
 #include <pthread.h>
 #else
 /* No threads implementation, use none (-j 1) */
 #define pthread_t void *
+#define pthread_barrier_t void *
+#define pthread_barrier_init(a, b, c) /* ignore */
+#define pthread_barrier_wait(a) /* ignore */
+#define pthread_barrier_destroy(a) /* ignore */
 #endif
 
 
@@ -310,6 +321,9 @@ typedef struct RandomState
 /* Various random sequences are initialized from this one. */
 static RandomState base_random_sequence;
 
+/* Synchronization barrier for start and connection */
+static pthread_barrier_t barrier;
+
 /*
  * Connection state machine states.
  */
@@ -453,8 +467,8 @@ typedef struct
 
 	/* per thread collected stats in microseconds */
 	pg_time_usec_t	create_time;	/* thread creation time */
-	pg_time_usec_t	started_time;	/* thread is running */
-	pg_time_usec_t	bench_start; 	/* thread is benchmarking */
+	pg_time_usec_t	started_time;	/* thread is running after start barrier */
+	pg_time_usec_t	bench_start; 	/* thread is benchmarking after connection barrier */
 	pg_time_usec_t	conn_duration;	/* cumulated connection and deconnection delays */
 
 	StatsData	stats;
@@ -6114,6 +6128,8 @@ main(int argc, char **argv)
 	if (duration > 0)
 		setalarm(duration);
 
+	pthread_barrier_init(&barrier, NULL, nthreads);
+
 #ifdef ENABLE_THREAD_SAFETY
 	/* start all threads but thread 0 which is executed directly later */
 	for (i = 1; i < nthreads; i++)
@@ -6185,6 +6201,8 @@ main(int argc, char **argv)
 	printResults(&stats, pg_time_now() - bench_start, conn_total_duration,
 				 bench_start - start_time, latency_late);
 
+	pthread_barrier_destroy(&barrier);
+
 	if (exit_code != 0)
 		pg_log_fatal("Run was aborted; the above results are incomplete.");
 
@@ -6233,6 +6251,8 @@ threadRun(void *arg)
 	}
 
 	/* READY */
+	pthread_barrier_wait(&barrier);
+
 	thread_start = pg_time_now();
 	thread->started_time = thread_start;
 	last_report = thread_start;
@@ -6257,6 +6277,8 @@ threadRun(void *arg)
 		thread->conn_duration = 0;
 	}
 
+	/* GO */
+	pthread_barrier_wait(&barrier);
 
 	start = pg_time_now();
 	thread->bench_start = start;
@@ -6762,4 +6784,26 @@ pthread_join(pthread_t th, void **thread_return)
 	return 0;
 }
 
+static int
+pthread_barrier_init(pthread_barrier_t *barrier, void *unused, int nthreads)
+{
+	/* no spinning: threads are not expected to arrive at the barrier together */
+	bool ok = InitializeSynchronizationBarrier(barrier, nthreads, 0);
+	return 0;
+}
+
+static int
+pthread_barrier_wait(pthread_barrier_t *barrier)
+{
+	bool last = EnterSynchronizationBarrier(barrier, SYNCHRONIZATION_BARRIER_FLAGS_BLOCK_ONLY);
+	return last ? PTHREAD_BARRIER_SERIAL_THREAD : 0;
+}
+
+static int
+pthread_barrier_destroy(pthread_barrier_t *barrier)
+{
+	(void) DeleteSynchronizationBarrier(barrier);
+	return 0;
+}
+
 #endif							/* WIN32 */

kuroda.hayato@fujitsu.com

kuroda.hayato@fujitsu.com

about 5 years ago

In reply to: Fabien COELHO (#14)

RE: pgbench: option delaying queries till connections establishment?

Dear Fabien, Andres

I think your idea is good, hence I put some comments as a reviewer.
I focus on only the linux code because I'm not familiar with the Windows system. Sorry.

[For patch A]

Please complete fixes for the documentation. At least the following sentence should be fixed:
```
The last two lines report the number of transactions per second, figured with and without counting the time to start database sessions.
```

-starting vacuum...end.

I think any other options should be disabled in the example, therefore please leave this line.

+       /* explicitly initialize the state machines */
+       for (int i = 0; i < nstate; i++)
+       {
+               state[i].state = CSTATE_CHOOSE_SCRIPT;
+       }

I'm not sure but I think braces should be removed in our coding conventions.

Other changes are being reviewed now.

[For patch B]

+ /* GO */
+ pthread_barrier_wait(&barrier);

The current implementation is too simple. If nthreads >= 2 and connection fails in the one thread,
the other one will wait forever.
Some special treatments are needed in the `done` code block and here.

[others]

(that is, it can be disabled)

On reflection, I'm not sure I see a use case for not running the barrier
if it is available.

If the start point changes and there is no way to disable this feature,
the backward compatibility will be completely violated.
It means that tps cannot be compared to older versions under the same conditions,
and It may hide performance-related issues.
I think it's not good.

Best regards,
Hayato Kuroda
FUJITSU LIMITED

-----Original Message-----
From: Fabien COELHO <coelho@cri.ensmp.fr>
Sent: Saturday, July 4, 2020 3:34 PM
To: Daniel Gustafsson <daniel@yesql.se>
Cc: Andres Freund <andres@anarazel.de>; pgsql-hackers@postgresql.org
Subject: Re: pgbench: option delaying queries till connections establishment?

* First patch reworks time measurements in pgbench.
* Second patch adds a barrier before starting the bench

It applies on top of the previous one. The initial imbalance due to
thread creation times is smoothed.

The usecs patch fails to apply to HEAD, can you please submit a rebased version
of this patchset. Also, when doing so, can you please rename the patches such
that sort alphabetically in the order in which they are intended to be applied.
The CFBot patch tester will otherwise try to apply them out of order which
cause errors.

Ok. Attached.

--
Fabien.

kuroda.hayato@fujitsu.com

kuroda.hayato@fujitsu.com

about 5 years ago

In reply to: kuroda.hayato@fujitsu.com (#15)

1 attachment(s)

RE: pgbench: option delaying queries till connections establishment?

Dear Fabien;

The current implementation is too simple. If nthreads >= 2 and connection fails in the one thread,
the other one will wait forever.

I attached the very preliminary patch for solving the problem.
Even if threads fail to connect, the others can go through the barrier.
But I think this implementation is not good...

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

Attachments:

temp_fix.patchapplication/octet-stream; name=temp_fix.patchDownload

diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 94b06d7058..7a058bdae4 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -6262,7 +6262,10 @@ threadRun(void *arg)
 		for (int i = 0; i < nstate; i++)
 		{
 			if ((state[i].con = doConnect()) == NULL)
+			{
+				pthread_barrier_wait(&barrier);
 				goto done;
+			}
 		}
 
 		/* compute connection delay */

coelho@cri.ensmp.fr

about 5 years ago

In reply to: kuroda.hayato@fujitsu.com (#15)

2 attachment(s)

RE: pgbench: option delaying queries till connections establishment?

Hello,

Please complete fixes for the documentation. At least the following sentence should be fixed:
```
The last two lines report the number of transactions per second, figured with and without counting the time to start database sessions.
```

Indeed. I scanned the file but did not find other places that needed
updating.

-starting vacuum...end.

I think any other options should be disabled in the example, therefore please leave this line.

Yes.

+       for (int i = 0; i < nstate; i++)
+       {
+               state[i].state = CSTATE_CHOOSE_SCRIPT;
+       }
I'm not sure but I think braces should be removed in our coding conventions.

Not sure either. I'm not for having too many braces anyway, so I removed
them.

+ /* GO */
+ pthread_barrier_wait(&barrier);

The current implementation is too simple. If nthreads >= 2 and connection fails in the one thread,
the other one will wait forever.
Some special treatments are needed in the `done` code block and here.

Indeed. I took your next patch with an added explanation. I'm unclear
whether proceeding makes much sense though, that is some thread would run
the test and other would have aborted. Hmmm.

(that is, it can be disabled)

On reflection, I'm not sure I see a use case for not running the barrier
if it is available.

If the start point changes and there is no way to disable this feature,
the backward compatibility will be completely violated.
It means that tps cannot be compared to older versions under the same conditions,
and It may hide performance-related issues.
I think it's not good.

ISTM that there is another patch in the queue which needs barriers to
delay some initialization so as to fix a corner case bug, in which case
the behavior would be mandatory. The current submission could add an
option to skip the barrier synchronization, but I'm not enthousiastic to
add an option and remove it shortly later.

Moreover, the "compatibility" is with nothing which does not make much
sense. When testing with many threads and clients, the current
implementation make threads start when they are ready, which means that
you can have threads issuing transactions while others are not yet
connected or not even started, so that the actually measured performance
is quite awkward for a short bench. ISTM that allowing such a backward
compatible strange behavior does not serve pg users. If the user want the
old unreliable behavior, they can use a old pgbench, and obtain unreliable
figures.

For these two reasons, I'm inclined not to add an option to skip these
barriers, but this can be debatted further.

Attached 2 updated patches.

--
Fabien.

Attachments:

pgbench-A-usec-3.patchtext/x-diff; name=pgbench-A-usec-3.patchDownload

diff --git a/doc/src/sgml/ref/pgbench.sgml b/doc/src/sgml/ref/pgbench.sgml
index 7180fedd65..f02721da0d 100644
--- a/doc/src/sgml/ref/pgbench.sgml
+++ b/doc/src/sgml/ref/pgbench.sgml
@@ -58,8 +58,10 @@ number of clients: 10
 number of threads: 1
 number of transactions per client: 1000
 number of transactions actually processed: 10000/10000
-tps = 85.184871 (including connections establishing)
-tps = 85.296346 (excluding connections establishing)
+latency average = 11.013 ms
+latency stddev = 7.351 ms
+initial connection time = 45.758 ms
+tps = 896.967014 (without initial connection establishing)
 </screen>
 
   The first six lines report some of the most important parameter
@@ -68,8 +70,7 @@ tps = 85.296346 (excluding connections establishing)
   and number of transactions per client); these will be equal unless the run
   failed before completion.  (In <option>-T</option> mode, only the actual
   number of transactions is printed.)
-  The last two lines report the number of transactions per second,
-  figured with and without counting the time to start database sessions.
+  The last line reports the number of transactions per second.
  </para>
 
   <para>
@@ -2234,22 +2235,22 @@ number of clients: 10
 number of threads: 1
 number of transactions per client: 1000
 number of transactions actually processed: 10000/10000
-latency average = 15.844 ms
-latency stddev = 2.715 ms
-tps = 618.764555 (including connections establishing)
-tps = 622.977698 (excluding connections establishing)
+latency average = 10.870 ms
+latency stddev = 7.341 ms
+initial connection time = 30.954 ms
+tps = 907.949122 (without initial connection establishing)
 statement latencies in milliseconds:
-        0.002  \set aid random(1, 100000 * :scale)
-        0.005  \set bid random(1, 1 * :scale)
-        0.002  \set tid random(1, 10 * :scale)
-        0.001  \set delta random(-5000, 5000)
-        0.326  BEGIN;
-        0.603  UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
-        0.454  SELECT abalance FROM pgbench_accounts WHERE aid = :aid;
-        5.528  UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;
-        7.335  UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;
-        0.371  INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);
-        1.212  END;
+    0.001  \set aid random(1, 100000 * :scale)
+    0.001  \set bid random(1, 1 * :scale)
+    0.001  \set tid random(1, 10 * :scale)
+    0.000  \set delta random(-5000, 5000)
+    0.046  BEGIN;
+    0.151  UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
+    0.107  SELECT abalance FROM pgbench_accounts WHERE aid = :aid;
+    4.241  UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;
+    5.245  UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;
+    0.102  INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);
+    0.974  END;
 </screen>
   </para>
 
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 3057665bbe..b8a3e28690 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -292,9 +292,9 @@ typedef struct SimpleStats
  */
 typedef struct StatsData
 {
-	time_t		start_time;		/* interval start time, for aggregates */
-	int64		cnt;			/* number of transactions, including skipped */
-	int64		skipped;		/* number of transactions skipped under --rate
+	pg_time_usec_t	start_time;	/* interval start time, for aggregates */
+	int64			cnt;		/* number of transactions, including skipped */
+	int64			skipped;	/* number of transactions skipped under --rate
 								 * and --latency-limit */
 	SimpleStats latency;
 	SimpleStats lag;
@@ -417,11 +417,11 @@ typedef struct
 	int			nvariables;		/* number of variables */
 	bool		vars_sorted;	/* are variables sorted by name? */
 
-	/* various times about current transaction */
-	int64		txn_scheduled;	/* scheduled start time of transaction (usec) */
-	int64		sleep_until;	/* scheduled start time of next cmd (usec) */
-	instr_time	txn_begin;		/* used for measuring schedule lag times */
-	instr_time	stmt_begin;		/* used for measuring statement latencies */
+	/* various times about current transaction in microseconds */
+	pg_time_usec_t	txn_scheduled;	/* scheduled start time of transaction */
+	pg_time_usec_t	sleep_until;	/* scheduled start time of next cmd */
+	pg_time_usec_t	txn_begin;		/* used for measuring schedule lag times */
+	pg_time_usec_t	stmt_begin;		/* used for measuring statement latencies */
 
 	bool		prepared[MAX_SCRIPTS];	/* whether client prepared the script */
 
@@ -450,13 +450,16 @@ typedef struct
 	RandomState ts_sample_rs;	/* random state for log sampling */
 
 	int64		throttle_trigger;	/* previous/next throttling (us) */
-	FILE	   *logfile;		/* where to log, or NULL */
+	FILE	   *logfile;			/* where to log, or NULL */
+
+	/* per thread collected stats in microseconds */
+	pg_time_usec_t	create_time;	/* thread creation time */
+	pg_time_usec_t	started_time;	/* thread is running */
+	pg_time_usec_t	bench_start; 	/* thread is benchmarking */
+	pg_time_usec_t	conn_duration;	/* cumulated connection and deconnection delays */
 
-	/* per thread collected stats */
-	instr_time	start_time;		/* thread start time */
-	instr_time	conn_time;
 	StatsData	stats;
-	int64		latency_late;	/* executed but late transactions */
+	int64		latency_late;	/* count executed but late transactions */
 } TState;
 
 #define INVALID_THREAD		((pthread_t) 0)
@@ -598,10 +601,10 @@ static void setIntValue(PgBenchValue *pv, int64 ival);
 static void setDoubleValue(PgBenchValue *pv, double dval);
 static bool evaluateExpr(CState *st, PgBenchExpr *expr,
 						 PgBenchValue *retval);
-static ConnectionStateEnum executeMetaCommand(CState *st, instr_time *now);
+static ConnectionStateEnum executeMetaCommand(CState *st, pg_time_usec_t *now);
 static void doLog(TState *thread, CState *st,
 				  StatsData *agg, bool skipped, double latency, double lag);
-static void processXactStats(TState *thread, CState *st, instr_time *now,
+static void processXactStats(TState *thread, CState *st, pg_time_usec_t *now,
 							 bool skipped, StatsData *agg);
 static void addScript(ParsedScript script);
 static void *threadRun(void *arg);
@@ -1105,9 +1108,9 @@ mergeSimpleStats(SimpleStats *acc, SimpleStats *ss)
  * the given value.
  */
 static void
-initStats(StatsData *sd, time_t start_time)
+initStats(StatsData *sd, pg_time_usec_t start)
 {
-	sd->start_time = start_time;
+	sd->start_time = start;
 	sd->cnt = 0;
 	sd->skipped = 0;
 	initSimpleStats(&sd->latency);
@@ -2876,7 +2879,6 @@ evaluateSleep(CState *st, int argc, char **argv, int *usecs)
 static void
 advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 {
-	instr_time	now;
 
 	/*
 	 * gettimeofday() isn't free, so we get the current timestamp lazily the
@@ -2886,7 +2888,7 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 	 * means "not set yet".  Reset "now" when we execute shell commands or
 	 * expressions, which might take a non-negligible amount of time, though.
 	 */
-	INSTR_TIME_SET_ZERO(now);
+	pg_time_usec_t	now = 0;
 
 	/*
 	 * Loop in the state machine, until we have to wait for a result from the
@@ -2921,29 +2923,30 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 
 				/* Start new transaction (script) */
 			case CSTATE_START_TX:
+				pg_time_now_lazy(&now);
 
 				/* establish connection if needed, i.e. under --connect */
 				if (st->con == NULL)
 				{
-					instr_time	start;
+					pg_time_usec_t	start = now;
 
-					INSTR_TIME_SET_CURRENT_LAZY(now);
-					start = now;
 					if ((st->con = doConnect()) == NULL)
 					{
 						pg_log_error("client %d aborted while establishing connection", st->id);
 						st->state = CSTATE_ABORTED;
 						break;
 					}
-					INSTR_TIME_SET_CURRENT(now);
-					INSTR_TIME_ACCUM_DIFF(thread->conn_time, now, start);
+
+					/* reset now after connection */
+					now = pg_time_now();
+
+					thread->conn_duration += now - start;
 
 					/* Reset session-local state */
 					memset(st->prepared, 0, sizeof(st->prepared));
 				}
 
 				/* record transaction start time */
-				INSTR_TIME_SET_CURRENT_LAZY(now);
 				st->txn_begin = now;
 
 				/*
@@ -2951,7 +2954,7 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 				 * scheduled start time.
 				 */
 				if (!throttle_delay)
-					st->txn_scheduled = INSTR_TIME_GET_MICROSEC(now);
+					st->txn_scheduled = now;
 
 				/* Begin with the first command */
 				st->state = CSTATE_START_COMMAND;
@@ -2987,12 +2990,9 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 				 */
 				if (latency_limit)
 				{
-					int64		now_us;
+					pg_time_now_lazy(&now);
 
-					INSTR_TIME_SET_CURRENT_LAZY(now);
-					now_us = INSTR_TIME_GET_MICROSEC(now);
-
-					while (thread->throttle_trigger < now_us - latency_limit &&
+					while (thread->throttle_trigger < now - latency_limit &&
 						   (nxacts <= 0 || st->cnt < nxacts))
 					{
 						processXactStats(thread, st, &now, true, agg);
@@ -3025,9 +3025,9 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 				 * Wait until it's time to start next transaction.
 				 */
 			case CSTATE_THROTTLE:
-				INSTR_TIME_SET_CURRENT_LAZY(now);
+				pg_time_now_lazy(&now);
 
-				if (INSTR_TIME_GET_MICROSEC(now) < st->txn_scheduled)
+				if (now < st->txn_scheduled)
 					return;		/* still sleeping, nothing to do here */
 
 				/* done sleeping, but don't start transaction if we're done */
@@ -3050,7 +3050,7 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 				/* record begin time of next command, and initiate it */
 				if (report_per_command)
 				{
-					INSTR_TIME_SET_CURRENT_LAZY(now);
+					pg_time_now_lazy(&now);
 					st->stmt_begin = now;
 				}
 
@@ -3215,8 +3215,8 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 				 * instead of CSTATE_START_TX.
 				 */
 			case CSTATE_SLEEP:
-				INSTR_TIME_SET_CURRENT_LAZY(now);
-				if (INSTR_TIME_GET_MICROSEC(now) < st->sleep_until)
+				pg_time_now_lazy(&now);
+				if (now < st->sleep_until)
 					return;		/* still sleeping, nothing to do here */
 				/* Else done sleeping. */
 				st->state = CSTATE_END_COMMAND;
@@ -3236,13 +3236,12 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 				{
 					Command    *command;
 
-					INSTR_TIME_SET_CURRENT_LAZY(now);
+					pg_time_now_lazy(&now);
 
 					command = sql_script[st->use_file].commands[st->command];
 					/* XXX could use a mutex here, but we choose not to */
 					addToSimpleStats(&command->stats,
-									 INSTR_TIME_GET_DOUBLE(now) -
-									 INSTR_TIME_GET_DOUBLE(st->stmt_begin));
+									 PG_TIME_GET_DOUBLE(now - st->stmt_begin));
 				}
 
 				/* Go ahead with next command, to be executed or skipped */
@@ -3268,7 +3267,7 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 				if (is_connect)
 				{
 					finishCon(st);
-					INSTR_TIME_SET_ZERO(now);
+					now = 0;
 				}
 
 				if ((st->cnt >= nxacts && duration <= 0) || timer_exceeded)
@@ -3306,7 +3305,7 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
  * take no time to execute.
  */
 static ConnectionStateEnum
-executeMetaCommand(CState *st, instr_time *now)
+executeMetaCommand(CState *st, pg_time_usec_t *now)
 {
 	Command    *command = sql_script[st->use_file].commands[st->command];
 	int			argc;
@@ -3348,8 +3347,8 @@ executeMetaCommand(CState *st, instr_time *now)
 			return CSTATE_ABORTED;
 		}
 
-		INSTR_TIME_SET_CURRENT_LAZY(*now);
-		st->sleep_until = INSTR_TIME_GET_MICROSEC(*now) + usec;
+		pg_time_now_lazy(now);
+		st->sleep_until = (*now) + usec;
 		return CSTATE_SLEEP;
 	}
 	else if (command->meta == META_SET)
@@ -3452,7 +3451,7 @@ executeMetaCommand(CState *st, instr_time *now)
 	 * executing the expression or shell command might have taken a
 	 * non-negligible amount of time, so reset 'now'
 	 */
-	INSTR_TIME_SET_ZERO(*now);
+	*now = 0;
 
 	return CSTATE_END_COMMAND;
 }
@@ -3462,14 +3461,15 @@ executeMetaCommand(CState *st, instr_time *now)
  *
  * We print Unix-epoch timestamps in the log, so that entries can be
  * correlated against other logs.  On some platforms this could be obtained
- * from the instr_time reading the caller has, but rather than get entangled
- * with that, we just eat the cost of an extra syscall in all cases.
+ * from the caller, but rather than get entangled with that, we just eat
+ * the cost of an extra syscall in all cases.
  */
 static void
 doLog(TState *thread, CState *st,
 	  StatsData *agg, bool skipped, double latency, double lag)
 {
 	FILE	   *logfile = thread->logfile;
+	pg_time_usec_t	now = pg_time_now();
 
 	Assert(use_log);
 
@@ -3489,13 +3489,12 @@ doLog(TState *thread, CState *st,
 		 * any empty intervals in between (this may happen with very low tps,
 		 * e.g. --rate=0.1).
 		 */
-		time_t		now = time(NULL);
 
 		while (agg->start_time + agg_interval <= now)
 		{
 			/* print aggregated report to logfile */
-			fprintf(logfile, "%ld " INT64_FORMAT " %.0f %.0f %.0f %.0f",
-					(long) agg->start_time,
+			fprintf(logfile, INT64_FORMAT " " INT64_FORMAT " %.0f %.0f %.0f %.0f",
+					agg->start_time,
 					agg->cnt,
 					agg->latency.sum,
 					agg->latency.sum2,
@@ -3523,17 +3522,13 @@ doLog(TState *thread, CState *st,
 	else
 	{
 		/* no, print raw transactions */
-		struct timeval tv;
-
-		gettimeofday(&tv, NULL);
 		if (skipped)
 			fprintf(logfile, "%d " INT64_FORMAT " skipped %d %ld %ld",
-					st->id, st->cnt, st->use_file,
-					(long) tv.tv_sec, (long) tv.tv_usec);
+					st->id, st->cnt, st->use_file, now / 1000000, now % 1000000);
 		else
 			fprintf(logfile, "%d " INT64_FORMAT " %.0f %d %ld %ld",
 					st->id, st->cnt, latency, st->use_file,
-					(long) tv.tv_sec, (long) tv.tv_usec);
+					now / 1000000, now % 1000000);
 		if (throttle_delay)
 			fprintf(logfile, " %.0f", lag);
 		fputc('\n', logfile);
@@ -3547,7 +3542,7 @@ doLog(TState *thread, CState *st,
  * Note that even skipped transactions are counted in the "cnt" fields.)
  */
 static void
-processXactStats(TState *thread, CState *st, instr_time *now,
+processXactStats(TState *thread, CState *st, pg_time_usec_t *now,
 				 bool skipped, StatsData *agg)
 {
 	double		latency = 0.0,
@@ -3557,11 +3552,11 @@ processXactStats(TState *thread, CState *st, instr_time *now,
 
 	if (detailed && !skipped)
 	{
-		INSTR_TIME_SET_CURRENT_LAZY(*now);
+		pg_time_now_lazy(now);
 
 		/* compute latency & lag */
-		latency = INSTR_TIME_GET_MICROSEC(*now) - st->txn_scheduled;
-		lag = INSTR_TIME_GET_MICROSEC(st->txn_begin) - st->txn_scheduled;
+		latency = (*now) - st->txn_scheduled;
+		lag = st->txn_begin - st->txn_scheduled;
 	}
 
 	if (thread_details)
@@ -3812,10 +3807,7 @@ initGenerateDataClientSide(PGconn *con)
 	int64		k;
 
 	/* used to track elapsed time and estimate of the remaining time */
-	instr_time	start,
-				diff;
-	double		elapsed_sec,
-				remaining_sec;
+	pg_time_usec_t	start;
 	int			log_interval = 1;
 
 	/* Stay on the same line if reporting to a terminal */
@@ -3867,7 +3859,7 @@ initGenerateDataClientSide(PGconn *con)
 	}
 	PQclear(res);
 
-	INSTR_TIME_SET_CURRENT(start);
+	start = pg_time_now();
 
 	for (k = 0; k < (int64) naccounts * scale; k++)
 	{
@@ -3892,11 +3884,8 @@ initGenerateDataClientSide(PGconn *con)
 		 */
 		if ((!use_quiet) && (j % 100000 == 0))
 		{
-			INSTR_TIME_SET_CURRENT(diff);
-			INSTR_TIME_SUBTRACT(diff, start);
-
-			elapsed_sec = INSTR_TIME_GET_DOUBLE(diff);
-			remaining_sec = ((double) scale * naccounts - j) * elapsed_sec / j;
+			double elapsed_sec = PG_TIME_GET_DOUBLE(pg_time_now() - start);
+			double remaining_sec = ((double) scale * naccounts - j) * elapsed_sec / j;
 
 			fprintf(stderr, INT64_FORMAT " of " INT64_FORMAT " tuples (%d%%) done (elapsed %.2f s, remaining %.2f s)%c",
 					j, (int64) naccounts * scale,
@@ -3906,11 +3895,8 @@ initGenerateDataClientSide(PGconn *con)
 		/* let's not call the timing for each row, but only each 100 rows */
 		else if (use_quiet && (j % 100 == 0))
 		{
-			INSTR_TIME_SET_CURRENT(diff);
-			INSTR_TIME_SUBTRACT(diff, start);
-
-			elapsed_sec = INSTR_TIME_GET_DOUBLE(diff);
-			remaining_sec = ((double) scale * naccounts - j) * elapsed_sec / j;
+			double elapsed_sec = PG_TIME_GET_DOUBLE(pg_time_now() - start);
+			double remaining_sec = ((double) scale * naccounts - j) * elapsed_sec / j;
 
 			/* have we reached the next interval (or end)? */
 			if ((j == scale * naccounts) || (elapsed_sec >= log_interval * LOG_STEP_SECONDS))
@@ -4115,10 +4101,8 @@ runInitSteps(const char *initialize_steps)
 
 	for (step = initialize_steps; *step != '\0'; step++)
 	{
-		instr_time	start;
 		char	   *op = NULL;
-
-		INSTR_TIME_SET_CURRENT(start);
+		pg_time_usec_t	start = pg_time_now();
 
 		switch (*step)
 		{
@@ -4160,12 +4144,7 @@ runInitSteps(const char *initialize_steps)
 
 		if (op != NULL)
 		{
-			instr_time	diff;
-			double		elapsed_sec;
-
-			INSTR_TIME_SET_CURRENT(diff);
-			INSTR_TIME_SUBTRACT(diff, start);
-			elapsed_sec = INSTR_TIME_GET_DOUBLE(diff);
+			double		elapsed_sec = PG_TIME_GET_DOUBLE(pg_time_now() - start);
 
 			if (!first)
 				appendPQExpBufferStr(&stats, ", ");
@@ -5087,12 +5066,12 @@ addScript(ParsedScript script)
  * progress report.  On exit, they are updated with the new stats.
  */
 static void
-printProgressReport(TState *threads, int64 test_start, int64 now,
+printProgressReport(TState *threads, int64 test_start, pg_time_usec_t now,
 					StatsData *last, int64 *last_report)
 {
 	/* generate and show report */
-	int64		run = now - *last_report,
-				ntx;
+	pg_time_usec_t	run = now - *last_report;
+	int64		ntx;
 	double		tps,
 				total_run,
 				latency,
@@ -5139,16 +5118,7 @@ printProgressReport(TState *threads, int64 test_start, int64 now,
 
 	if (progress_timestamp)
 	{
-		/*
-		 * On some platforms the current system timestamp is available in
-		 * now_time, but rather than get entangled with that, we just eat the
-		 * cost of an extra syscall in all cases.
-		 */
-		struct timeval tv;
-
-		gettimeofday(&tv, NULL);
-		snprintf(tbuf, sizeof(tbuf), "%ld.%03ld s",
-				 (long) tv.tv_sec, (long) (tv.tv_usec / 1000));
+		snprintf(tbuf, sizeof(tbuf), "%.3f s", PG_TIME_GET_DOUBLE(now));
 	}
 	else
 	{
@@ -5188,21 +5158,18 @@ printSimpleStats(const char *prefix, SimpleStats *ss)
 
 /* print out results */
 static void
-printResults(StatsData *total, instr_time total_time,
-			 instr_time conn_total_time, int64 latency_late)
+printResults(StatsData *total,
+			 pg_time_usec_t total_duration,			/* benchmarking time */
+			 pg_time_usec_t conn_total_duration,	/* is_connect */
+			 pg_time_usec_t conn_elapsed_duration,	/* !is_connect */
+			 int64 latency_late)
 {
-	double		time_include,
-				tps_include,
-				tps_exclude;
+	/* tps is about actually executed transactions during benchmarking */
 	int64		ntx = total->cnt - total->skipped;
+	double 		bench_duration = PG_TIME_GET_DOUBLE(total_duration);
+	double		tps = ntx / bench_duration;
 
-	time_include = INSTR_TIME_GET_DOUBLE(total_time);
-
-	/* tps is about actually executed transactions */
-	tps_include = ntx / time_include;
-	tps_exclude = ntx /
-		(time_include - (INSTR_TIME_GET_DOUBLE(conn_total_time) / nclients));
-
+	printf("pgbench (PostgreSQL) %d.%d\n", PG_VERSION_NUM / 10000, PG_VERSION_NUM % 100);
 	/* Report test parameters. */
 	printf("transaction type: %s\n",
 		   num_scripts == 1 ? sql_script[0].desc : "multiple scripts");
@@ -5233,8 +5200,7 @@ printResults(StatsData *total, instr_time total_time,
 
 	if (throttle_delay && latency_limit)
 		printf("number of transactions skipped: " INT64_FORMAT " (%.3f %%)\n",
-			   total->skipped,
-			   100.0 * total->skipped / total->cnt);
+			   total->skipped, 100.0 * total->skipped / total->cnt);
 
 	if (latency_limit)
 		printf("number of transactions above the %.1f ms latency limit: " INT64_FORMAT "/" INT64_FORMAT " (%.3f %%)\n",
@@ -5247,7 +5213,7 @@ printResults(StatsData *total, instr_time total_time,
 	{
 		/* no measurement, show average latency computed from run time */
 		printf("latency average = %.3f ms\n",
-			   1000.0 * time_include * nclients / total->cnt);
+			   0.001 * total_duration * nclients / total->cnt);
 	}
 
 	if (throttle_delay)
@@ -5262,8 +5228,25 @@ printResults(StatsData *total, instr_time total_time,
 			   0.001 * total->lag.sum / total->cnt, 0.001 * total->lag.max);
 	}
 
-	printf("tps = %f (including connections establishing)\n", tps_include);
-	printf("tps = %f (excluding connections establishing)\n", tps_exclude);
+	/*
+	 * Under -C/--connect, each transaction incurs a significant connection cost,
+	 * it would not make much sense to ignore it in tps, and it would not be tps
+	 * anyway.
+	 *
+	 * Otherwise connections are made just once at the beginning of the run
+	 * and should not impact performance but for very short run, so they are
+	 * (right)fully ignored in tps.
+	 */
+	if (is_connect)
+	{
+		printf("average connection time = %.3f ms\n", 0.001 * conn_total_duration / total->cnt);
+		printf("tps = %f (including reconnection times)\n", tps);
+	}
+	else
+	{
+		printf("initial connection time = %.3f ms\n", 0.001 * conn_elapsed_duration);
+		printf("tps = %f (without initial connection establishing)\n", tps);
+	}
 
 	/* Report per-script/command statistics */
 	if (per_script_stats || report_per_command)
@@ -5284,7 +5267,7 @@ printResults(StatsData *total, instr_time total_time,
 					   100.0 * sql_script[i].weight / total_weight,
 					   sstats->cnt,
 					   100.0 * sstats->cnt / total->cnt,
-					   (sstats->cnt - sstats->skipped) / time_include);
+					   (sstats->cnt - sstats->skipped) / bench_duration);
 
 				if (throttle_delay && latency_limit && sstats->cnt > 0)
 					printf(" - number of transactions skipped: " INT64_FORMAT " (%.3f%%)\n",
@@ -5332,10 +5315,7 @@ set_random_seed(const char *seed)
 	if (seed == NULL || strcmp(seed, "time") == 0)
 	{
 		/* rely on current time */
-		instr_time	now;
-
-		INSTR_TIME_SET_CURRENT(now);
-		iseed = (uint64) INSTR_TIME_GET_MICROSEC(now);
+		iseed = pg_time_now();
 	}
 	else if (strcmp(seed, "rand") == 0)
 	{
@@ -5438,9 +5418,10 @@ main(int argc, char **argv)
 	CState	   *state;			/* status of clients */
 	TState	   *threads;		/* array of thread */
 
-	instr_time	start_time;		/* start up time */
-	instr_time	total_time;
-	instr_time	conn_total_time;
+	pg_time_usec_t
+		start_time,				/* start up time */
+		bench_start = 0,		/* first recorded benchmarking time */
+		conn_total_duration;	/* cumulated connection time in threads */
 	int64		latency_late = 0;
 	StatsData	stats;
 	int			weight;
@@ -6123,67 +6104,53 @@ main(int argc, char **argv)
 	/* all clients must be assigned to a thread */
 	Assert(nclients_dealt == nclients);
 
-	/* get start up time */
-	INSTR_TIME_SET_CURRENT(start_time);
+	/* get start up time for the whole computation */
+	start_time = pg_time_now();
 
 	/* set alarm if duration is specified. */
 	if (duration > 0)
 		setalarm(duration);
 
-	/* start threads */
 #ifdef ENABLE_THREAD_SAFETY
-	for (i = 0; i < nthreads; i++)
+	/* start all threads but thread 0 which is executed directly later */
+	for (i = 1; i < nthreads; i++)
 	{
 		TState	   *thread = &threads[i];
+		int			err;
 
-		INSTR_TIME_SET_CURRENT(thread->start_time);
+		thread->create_time = pg_time_now();
+		err = pthread_create(&thread->thread, NULL, threadRun, thread);
 
-		/* compute when to stop */
-		if (duration > 0)
-			end_time = INSTR_TIME_GET_MICROSEC(thread->start_time) +
-				(int64) 1000000 * duration;
-
-		/* the first thread (i = 0) is executed by main thread */
-		if (i > 0)
-		{
-			int			err = pthread_create(&thread->thread, NULL, threadRun, thread);
-
-			if (err != 0 || thread->thread == INVALID_THREAD)
-			{
-				pg_log_fatal("could not create thread: %m");
-				exit(1);
-			}
-		}
-		else
+		if (err != 0 || thread->thread == INVALID_THREAD)
 		{
-			thread->thread = INVALID_THREAD;
+			pg_log_fatal("could not create thread: %m");
+			exit(1);
 		}
 	}
 #else
-	INSTR_TIME_SET_CURRENT(threads[0].start_time);
+	Assert(nthreads == 1);
+#endif							/* ENABLE_THREAD_SAFETY */
+
 	/* compute when to stop */
+	threads[0].create_time = pg_time_now();
 	if (duration > 0)
-		end_time = INSTR_TIME_GET_MICROSEC(threads[0].start_time) +
-			(int64) 1000000 * duration;
+		end_time = threads[0].create_time + (int64) 1000000 * duration;
+
+	/* run thread 0 directly */
 	threads[0].thread = INVALID_THREAD;
-#endif							/* ENABLE_THREAD_SAFETY */
+	(void) threadRun(&threads[0]);
 
-	/* wait for threads and accumulate results */
+	/* wait for other threads and accumulate results */
 	initStats(&stats, 0);
-	INSTR_TIME_SET_ZERO(conn_total_time);
+	conn_total_duration = 0;
+
 	for (i = 0; i < nthreads; i++)
 	{
 		TState	   *thread = &threads[i];
 
 #ifdef ENABLE_THREAD_SAFETY
-		if (threads[i].thread == INVALID_THREAD)
-			/* actually run this thread directly in the main thread */
-			(void) threadRun(thread);
-		else
-			/* wait of other threads. should check that 0 is returned? */
+		if (i > 0)
 			pthread_join(thread->thread, NULL);
-#else
-		(void) threadRun(thread);
 #endif							/* ENABLE_THREAD_SAFETY */
 
 		for (int j = 0; j < thread->nstate; j++)
@@ -6196,23 +6163,24 @@ main(int argc, char **argv)
 		stats.cnt += thread->stats.cnt;
 		stats.skipped += thread->stats.skipped;
 		latency_late += thread->latency_late;
-		INSTR_TIME_ADD(conn_total_time, thread->conn_time);
+		conn_total_duration += thread->conn_duration;
+
+		/* first recorded benchmarking start time */
+		if (bench_start == 0 || thread->bench_start < bench_start)
+			bench_start = thread->bench_start;
 	}
+
+	/* XXX should this be connection time? */
 	disconnect_all(state, nclients);
 
 	/*
-	 * XXX We compute results as though every client of every thread started
-	 * and finished at the same time.  That model can diverge noticeably from
-	 * reality for a short benchmark run involving relatively many threads.
-	 * The first thread may process notably many transactions before the last
-	 * thread begins.  Improving the model alone would bring limited benefit,
-	 * because performance during those periods of partial thread count can
-	 * easily exceed steady state performance.  This is one of the many ways
-	 * short runs convey deceptive performance figures.
+	 * Beware that performance of short benchmarks with many threads and possibly
+	 * long transactions can be deceptive because threads do not start and finish
+	 * at the exact same time. The total duration computed here encompasses all
+	 * transactions so that tps shown is somehow slightly underestimated.
 	 */
-	INSTR_TIME_SET_CURRENT(total_time);
-	INSTR_TIME_SUBTRACT(total_time, start_time);
-	printResults(&stats, total_time, conn_total_time, latency_late);
+	printResults(&stats, pg_time_now() - bench_start, conn_total_duration,
+				 bench_start - start_time, latency_late);
 
 	if (exit_code != 0)
 		pg_log_fatal("Run was aborted; the above results are incomplete.");
@@ -6225,34 +6193,16 @@ threadRun(void *arg)
 {
 	TState	   *thread = (TState *) arg;
 	CState	   *state = thread->state;
-	instr_time	start,
-				end;
+	pg_time_usec_t	start;
 	int			nstate = thread->nstate;
 	int			remains = nstate;	/* number of remaining clients */
 	socket_set *sockets = alloc_socket_set(nstate);
-	int			i;
-
-	/* for reporting progress: */
-	int64		thread_start = INSTR_TIME_GET_MICROSEC(thread->start_time);
-	int64		last_report = thread_start;
-	int64		next_report = last_report + (int64) progress * 1000000;
+	int64		thread_start,
+				last_report,
+				next_report;
 	StatsData	last,
 				aggs;
 
-	/*
-	 * Initialize throttling rate target for all of the thread's clients.  It
-	 * might be a little more accurate to reset thread->start_time here too.
-	 * The possible drift seems too small relative to typical throttle delay
-	 * times to worry about it.
-	 */
-	INSTR_TIME_SET_CURRENT(start);
-	thread->throttle_trigger = INSTR_TIME_GET_MICROSEC(start);
-
-	INSTR_TIME_SET_ZERO(thread->conn_time);
-
-	initStats(&aggs, time(NULL));
-	last = aggs;
-
 	/* open log file if requested */
 	if (use_log)
 	{
@@ -6273,32 +6223,49 @@ threadRun(void *arg)
 		}
 	}
 
+	/* explicitly initialize the state machines */
+	for (int i = 0; i < nstate; i++)
+		state[i].state = CSTATE_CHOOSE_SCRIPT;
+
+	/* READY */
+	thread_start = pg_time_now();
+	thread->started_time = thread_start;
+	last_report = thread_start;
+	next_report = last_report + (int64) 1000000 * progress;
+
+	/* STEADY */
 	if (!is_connect)
 	{
 		/* make connections to the database before starting */
-		for (i = 0; i < nstate; i++)
+		for (int i = 0; i < nstate; i++)
 		{
 			if ((state[i].con = doConnect()) == NULL)
 				goto done;
 		}
+
+		/* compute connection delay */
+		thread->conn_duration = pg_time_now() - thread->started_time;
 	}
-
-	/* time after thread and connections set up */
-	INSTR_TIME_SET_CURRENT(thread->conn_time);
-	INSTR_TIME_SUBTRACT(thread->conn_time, thread->start_time);
-
-	/* explicitly initialize the state machines */
-	for (i = 0; i < nstate; i++)
+	else
 	{
-		state[i].state = CSTATE_CHOOSE_SCRIPT;
+		/* no connection delay to record */
+		thread->conn_duration = 0;
 	}
 
+
+	start = pg_time_now();
+	thread->bench_start = start;
+	thread->throttle_trigger = start;
+
+	initStats(&aggs, start);
+	last = aggs;
+
 	/* loop till all clients have terminated */
 	while (remains > 0)
 	{
 		int			nsocks;		/* number of sockets to be waited for */
-		int64		min_usec;
-		int64		now_usec = 0;	/* set this only if needed */
+		pg_time_usec_t	min_usec;
+		pg_time_usec_t	now = 0;	/* set this only if needed */
 
 		/*
 		 * identify which client sockets should be checked for input, and
@@ -6307,27 +6274,21 @@ threadRun(void *arg)
 		clear_socket_set(sockets);
 		nsocks = 0;
 		min_usec = PG_INT64_MAX;
-		for (i = 0; i < nstate; i++)
+		for (int i = 0; i < nstate; i++)
 		{
 			CState	   *st = &state[i];
 
 			if (st->state == CSTATE_SLEEP || st->state == CSTATE_THROTTLE)
 			{
 				/* a nap from the script, or under throttling */
-				int64		this_usec;
+				pg_time_usec_t		this_usec;
 
 				/* get current time if needed */
-				if (now_usec == 0)
-				{
-					instr_time	now;
-
-					INSTR_TIME_SET_CURRENT(now);
-					now_usec = INSTR_TIME_GET_MICROSEC(now);
-				}
+				pg_time_now_lazy(&now);
 
 				/* min_usec should be the minimum delay across all clients */
 				this_usec = (st->state == CSTATE_SLEEP ?
-							 st->sleep_until : st->txn_scheduled) - now_usec;
+							 st->sleep_until : st->txn_scheduled) - now;
 				if (min_usec > this_usec)
 					min_usec = this_usec;
 			}
@@ -6362,19 +6323,12 @@ threadRun(void *arg)
 		/* also wake up to print the next progress report on time */
 		if (progress && min_usec > 0 && thread->tid == 0)
 		{
-			/* get current time if needed */
-			if (now_usec == 0)
-			{
-				instr_time	now;
+			pg_time_now_lazy(&now);
 
-				INSTR_TIME_SET_CURRENT(now);
-				now_usec = INSTR_TIME_GET_MICROSEC(now);
-			}
-
-			if (now_usec >= next_report)
+			if (now >= next_report)
 				min_usec = 0;
-			else if ((next_report - now_usec) < min_usec)
-				min_usec = next_report - now_usec;
+			else if ((next_report - now) < min_usec)
+				min_usec = next_report - now;
 		}
 
 		/*
@@ -6423,7 +6377,7 @@ threadRun(void *arg)
 
 		/* ok, advance the state machine of each connection */
 		nsocks = 0;
-		for (i = 0; i < nstate; i++)
+		for (int i = 0; i < nstate; i++)
 		{
 			CState	   *st = &state[i];
 
@@ -6461,11 +6415,8 @@ threadRun(void *arg)
 		/* progress report is made by thread 0 for all threads */
 		if (progress && thread->tid == 0)
 		{
-			instr_time	now_time;
-			int64		now;
+			pg_time_usec_t	now = pg_time_now();
 
-			INSTR_TIME_SET_CURRENT(now_time);
-			now = INSTR_TIME_GET_MICROSEC(now_time);
 			if (now >= next_report)
 			{
 				/*
@@ -6483,17 +6434,17 @@ threadRun(void *arg)
 				 */
 				do
 				{
-					next_report += (int64) progress * 1000000;
+					next_report += (int64) 1000000 * progress;
 				} while (now >= next_report);
 			}
 		}
 	}
 
 done:
-	INSTR_TIME_SET_CURRENT(start);
+	start = pg_time_now();
 	disconnect_all(state, nstate);
-	INSTR_TIME_SET_CURRENT(end);
-	INSTR_TIME_ACCUM_DIFF(thread->conn_time, end, start);
+	thread->conn_duration += pg_time_now() - start;
+
 	if (thread->logfile)
 	{
 		if (agg_interval > 0)
diff --git a/src/include/portability/instr_time.h b/src/include/portability/instr_time.h
index d6459327cc..fcc0bdda28 100644
--- a/src/include/portability/instr_time.h
+++ b/src/include/portability/instr_time.h
@@ -253,4 +253,32 @@ GetTimerFrequency(void)
 #define INSTR_TIME_SET_CURRENT_LAZY(t) \
 	(INSTR_TIME_IS_ZERO(t) ? INSTR_TIME_SET_CURRENT(t), true : false)
 
+/*
+ * Simpler convenient interface
+ *
+ * The instr_time type is expensive when dealing with time arithmetic.
+ * Define a type to hold microseconds on top of this, suitable for
+ * benchmarking performance measures, eg in "pgbench".
+ *
+ * Type int64 is good enough for about 584500 years.
+ */
+typedef int64 pg_time_usec_t;
+
+static inline pg_time_usec_t
+pg_time_now(void)
+{
+	instr_time now;
+
+	INSTR_TIME_SET_CURRENT(now);
+	return (pg_time_usec_t) INSTR_TIME_GET_MICROSEC(now);
+}
+
+static inline void
+pg_time_now_lazy(pg_time_usec_t *now)
+{
+	if ((*now) == 0)
+		(*now) = pg_time_now();
+}
+
+#define PG_TIME_GET_DOUBLE(t) (0.000001 * (t))
 #endif							/* INSTR_TIME_H */

pgbench-B-barrier-6.patchtext/x-diff; name=pgbench-B-barrier-6.patchDownload

diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index b8a3e28690..28403a50fc 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -114,18 +114,29 @@ typedef struct socket_set
  */
 
 #ifdef WIN32
+#define PTHREAD_BARRIER_SERIAL_THREAD (-1)
+
 /* Use native win32 threads on Windows */
 typedef struct win32_pthread *pthread_t;
 typedef int pthread_attr_t;
+typedef SYNCHRONIZATION_BARRIER pthread_barrier_t;
 
 static int	pthread_create(pthread_t *thread, pthread_attr_t *attr, void *(*start_routine) (void *), void *arg);
 static int	pthread_join(pthread_t th, void **thread_return);
+
+static int	pthread_barrier_init(pthread_barrier_t *barrier, void *unused, int nthreads);
+static int	pthread_barrier_wait(pthread_barrier_t *barrier);
+static int	pthread_barrier_destroy(pthread_barrier_t *barrier);
 #elif defined(ENABLE_THREAD_SAFETY)
 /* Use platform-dependent pthread capability */
 #include <pthread.h>
 #else
 /* No threads implementation, use none (-j 1) */
 #define pthread_t void *
+#define pthread_barrier_t void *
+#define pthread_barrier_init(a, b, c) /* ignore */
+#define pthread_barrier_wait(a) /* ignore */
+#define pthread_barrier_destroy(a) /* ignore */
 #endif
 
 
@@ -311,6 +322,9 @@ typedef struct RandomState
 /* Various random sequences are initialized from this one. */
 static RandomState base_random_sequence;
 
+/* Synchronization barrier for start and connection */
+static pthread_barrier_t barrier;
+
 /*
  * Connection state machine states.
  */
@@ -454,8 +468,8 @@ typedef struct
 
 	/* per thread collected stats in microseconds */
 	pg_time_usec_t	create_time;	/* thread creation time */
-	pg_time_usec_t	started_time;	/* thread is running */
-	pg_time_usec_t	bench_start; 	/* thread is benchmarking */
+	pg_time_usec_t	started_time;	/* thread is running after start barrier */
+	pg_time_usec_t	bench_start; 	/* thread is benchmarking after connection barrier */
 	pg_time_usec_t	conn_duration;	/* cumulated connection and deconnection delays */
 
 	StatsData	stats;
@@ -6111,6 +6125,8 @@ main(int argc, char **argv)
 	if (duration > 0)
 		setalarm(duration);
 
+	pthread_barrier_init(&barrier, NULL, nthreads);
+
 #ifdef ENABLE_THREAD_SAFETY
 	/* start all threads but thread 0 which is executed directly later */
 	for (i = 1; i < nthreads; i++)
@@ -6182,6 +6198,8 @@ main(int argc, char **argv)
 	printResults(&stats, pg_time_now() - bench_start, conn_total_duration,
 				 bench_start - start_time, latency_late);
 
+	pthread_barrier_destroy(&barrier);
+
 	if (exit_code != 0)
 		pg_log_fatal("Run was aborted; the above results are incomplete.");
 
@@ -6228,6 +6246,8 @@ threadRun(void *arg)
 		state[i].state = CSTATE_CHOOSE_SCRIPT;
 
 	/* READY */
+	pthread_barrier_wait(&barrier);
+
 	thread_start = pg_time_now();
 	thread->started_time = thread_start;
 	last_report = thread_start;
@@ -6240,7 +6260,18 @@ threadRun(void *arg)
 		for (int i = 0; i < nstate; i++)
 		{
 			if ((state[i].con = doConnect()) == NULL)
+			{
+				/*
+				 * On connection failure, we meet the barrier here in place of
+				 * GO before proceeding to the "done" path which will cleanup,
+				 * so as to avoid locking the process.
+				 *
+				 * It is unclear whether it is worth doing anything rather than
+				 * coldly exiting with an error message.
+				 */
+				pthread_barrier_wait(&barrier);
 				goto done;
+			}
 		}
 
 		/* compute connection delay */
@@ -6252,6 +6283,8 @@ threadRun(void *arg)
 		thread->conn_duration = 0;
 	}
 
+	/* GO */
+	pthread_barrier_wait(&barrier);
 
 	start = pg_time_now();
 	thread->bench_start = start;
@@ -6757,4 +6790,26 @@ pthread_join(pthread_t th, void **thread_return)
 	return 0;
 }
 
+static int
+pthread_barrier_init(pthread_barrier_t *barrier, void *unused, int nthreads)
+{
+	/* no spinning: threads are not expected to arrive at the barrier together */
+	bool ok = InitializeSynchronizationBarrier(barrier, nthreads, 0);
+	return 0;
+}
+
+static int
+pthread_barrier_wait(pthread_barrier_t *barrier)
+{
+	bool last = EnterSynchronizationBarrier(barrier, SYNCHRONIZATION_BARRIER_FLAGS_BLOCK_ONLY);
+	return last ? PTHREAD_BARRIER_SERIAL_THREAD : 0;
+}
+
+static int
+pthread_barrier_destroy(pthread_barrier_t *barrier)
+{
+	(void) DeleteSynchronizationBarrier(barrier);
+	return 0;
+}
+
 #endif							/* WIN32 */

kuroda.hayato@fujitsu.com

kuroda.hayato@fujitsu.com

about 5 years ago

In reply to: Fabien COELHO (#17)

RE: pgbench: option delaying queries till connections establishment?

Dear Fabien,

Indeed. I scanned the file but did not find other places that needed
updating.

Yes.

Not sure either. I'm not for having too many braces anyway, so I removed
them.

I checked your fixes and I think it's OK.
Finally, please move the doc fixes to patch B in order to separate patches
completely.

Indeed. I took your next patch with an added explanation. I'm unclear
whether proceeding makes much sense though, that is some thread would run
the test and other would have aborted. Hmmm.

Your comment looks good, thanks.
In the previous version pgbench starts benchmarking even if some connections fail.
And users can notice the connection failure by stderr output.
Hence the current specification may be enough.
If you agree, please remove the following lines:

```
+				 * It is unclear whether it is worth doing anything rather than
+				 * coldly exiting with an error message.
```

ISTM that there is another patch in the queue which needs barriers to
delay some initialization so as to fix a corner case bug, in which case
the behavior would be mandatory. The current submission could add an
option to skip the barrier synchronization, but I'm not enthousiastic to
add an option and remove it shortly later.

Could you tell me which patch you mention? Basically I agree what you say,
but I want to check it.

Hayato Kuroda
FUJITSU LIMITED

coelho@cri.ensmp.fr

about 5 years ago

In reply to: kuroda.hayato@fujitsu.com (#18)

RE: pgbench: option delaying queries till connections establishment?

Hello,

Indeed. I took your next patch with an added explanation. I'm unclear
whether proceeding makes much sense though, that is some thread would run
the test and other would have aborted. Hmmm.

Your comment looks good, thanks. In the previous version pgbench starts
benchmarking even if some connections fail. And users can notice the
connection failure by stderr output. Hence the current specification may
be enough.

Usually I run many pgbench through scripts, so I'm probably not there to
check a lone stderr failure at the beginning if performance figures are
actually reported.

If you agree, please remove the following lines:

```
+				 * It is unclear whether it is worth doing anything rather than
+				 * coldly exiting with an error message.
```

I can remove the line, but I strongly believe that reporting performance
figures if some client connection failed thus the bench could not run as
prescribed is a bad behavior. The good news is that it is probably quite
unlikely. So I'd prefer to keep it and possibly submit a patch to change
the behavior.

ISTM that there is another patch in the queue which needs barriers to
delay some initialization so as to fix a corner case bug, in which case
the behavior would be mandatory. The current submission could add an
option to skip the barrier synchronization, but I'm not enthousiastic to
add an option and remove it shortly later.

Could you tell me which patch you mention? Basically I agree what you say,
but I want to check it.

Should be this one: https://commitfest.postgresql.org/30/2624/,

--
Fabien.

kuroda.hayato@fujitsu.com

kuroda.hayato@fujitsu.com

about 5 years ago

In reply to: Fabien COELHO (#19)

RE: pgbench: option delaying queries till connections establishment?

Dear Fabien,

Usually I run many pgbench through scripts, so I'm probably not there to
check a lone stderr failure at the beginning if performance figures are
actually reported.

I can remove the line, but I strongly believe that reporting performance
figures if some client connection failed thus the bench could not run as
prescribed is a bad behavior. The good news is that it is probably quite
unlikely. So I'd prefer to keep it and possibly submit a patch to change
the behavior.

I agree such a situation is very bad, and I understood you have a plan to
submit patches for fix it. If so leaving lines as a TODO is OK.

Should be this one: https://commitfest.postgresql.org/30/2624/

This discussion is still on-going, but I can see that the starting time
may be delayed for looking up all pgbench-variables.
(I think the status of this thread might be wrong. it should be
'Needs review,' but now 'Waiting on Author.')

This patch is mostly good and can change a review status soon,
however, I think it should wait that related patch.
Please discuss how to fix it with Tom, and this will commit soon.

Hayato Kuroda
FUJITSU LIMITED

coelho@cri.ensmp.fr

about 5 years ago

In reply to: kuroda.hayato@fujitsu.com (#20)

RE: pgbench: option delaying queries till connections establishment?

Hello,

I can remove the line, but I strongly believe that reporting performance
figures if some client connection failed thus the bench could not run as
prescribed is a bad behavior. The good news is that it is probably quite
unlikely. So I'd prefer to keep it and possibly submit a patch to change
the behavior.

I agree such a situation is very bad, and I understood you have a plan to
submit patches for fix it. If so leaving lines as a TODO is OK.

Thanks.

Should be this one: https://commitfest.postgresql.org/30/2624/

This discussion is still on-going, but I can see that the starting time
may be delayed for looking up all pgbench-variables.

Yep, that's it.

(I think the status of this thread might be wrong. it should be
'Needs review,' but now 'Waiting on Author.')

I changed it to "Needs review".

This patch is mostly good and can change a review status soon,
however, I think it should wait that related patch.

Hmmm.

Please discuss how to fix it with Tom,

I would not have the presumption to pressure Tom's agenda in any way!

and this will commit soon.

and this will wait till its time comes. In the mean time, I think that you
should put the patch status as you see fit, independently of the other
patch: it seems unlikely that they would be committed together, and I'll
have to merge the remaining one anyway.

--
Fabien.

kuroda.hayato@fujitsu.com

kuroda.hayato@fujitsu.com

about 5 years ago

In reply to: Fabien COELHO (#21)

RE: pgbench: option delaying queries till connections establishment?

Dear Fabien,

and this will wait till its time comes. In the mean time, I think that you
should put the patch status as you see fit, independently of the other
patch: it seems unlikely that they would be committed together, and I'll
have to merge the remaining one anyway.

OK. I found the related thread[1]https://commitfest.postgresql.org/31/2827/, and I understood you will submit another patch
on the thread.

PostgreSQL Patch Tester says all regression tests are passed, and
I change the status to "Ready for committer."

[1]: https://commitfest.postgresql.org/31/2827/

Thank you for discussing with me.

Hayato Kuroda
FUJITSU LIMITED

-----Original Message-----
From: Fabien COELHO <coelho@cri.ensmp.fr>
Sent: Wednesday, November 11, 2020 9:24 PM
To: Kuroda, Hayato/黒田隼人 <kuroda.hayato@fujitsu.com>
Cc: Andres Freund <andres@anarazel.de>; Daniel Gustafsson <daniel@yesql.se>; pgsql-hackers@postgresql.org
Subject: RE: pgbench: option delaying queries till connections establishment?

Hello,

I can remove the line, but I strongly believe that reporting performance
figures if some client connection failed thus the bench could not run as
prescribed is a bad behavior. The good news is that it is probably quite
unlikely. So I'd prefer to keep it and possibly submit a patch to change
the behavior.

I agree such a situation is very bad, and I understood you have a plan to
submit patches for fix it. If so leaving lines as a TODO is OK.

Thanks.

Should be this one: https://commitfest.postgresql.org/30/2624/

This discussion is still on-going, but I can see that the starting time
may be delayed for looking up all pgbench-variables.

Yep, that's it.

(I think the status of this thread might be wrong. it should be
'Needs review,' but now 'Waiting on Author.')

I changed it to "Needs review".

This patch is mostly good and can change a review status soon,
however, I think it should wait that related patch.

Hmmm.

Please discuss how to fix it with Tom,

I would not have the presumption to pressure Tom's agenda in any way!

and this will commit soon.

and this will wait till its time comes. In the mean time, I think that you
should put the patch status as you see fit, independently of the other
patch: it seems unlikely that they would be committed together, and I'll
have to merge the remaining one anyway.

--
Fabien.

Marina Polyakova

m.polyakova@postgrespro.ru

about 5 years ago

In reply to: kuroda.hayato@fujitsu.com (#22)

Re: pgbench: option delaying queries till connections establishment?

Hello!

On 2020-11-13 08:44, kuroda.hayato@fujitsu.com wrote:

Dear Fabien,

and this will wait till its time comes. In the mean time, I think that
you
should put the patch status as you see fit, independently of the other
patch: it seems unlikely that they would be committed together, and
I'll
have to merge the remaining one anyway.

OK. I found the related thread[1], and I understood you will submit
another patch
on the thread.

PostgreSQL Patch Tester says all regression tests are passed, and
I change the status to "Ready for committer."

[1]: https://commitfest.postgresql.org/31/2827/

Thank you for discussing with me.

Hayato Kuroda
FUJITSU LIMITED

From the mentioned thread [2]/messages/by-id/e5a09b790db21356376b6e73673aa07c@postgrespro.ru:

While trying to test a patch that adds a synchronization barrier in
pgbench [1] on Windows,

Thanks for trying that, I do not have a windows setup for testing, and
the sync code I wrote for Windows is basically blind coding:-(

FYI:

1) It looks like pgbench will no longer support Windows XP due to the
function DeleteSynchronizationBarrier. From
https://docs.microsoft.com/en-us/windows/win32/api/synchapi/nf-synchapi-deletesynchronizationbarrier
:

Minimum supported client: Windows 8 [desktop apps only]
Minimum supported server: Windows Server 2012 [desktop apps only]

On Windows Server 2008 R2 (MSVC 2013) the 6-th version of the patch
[1] has compiled without (new) warnings, but when running pgbench I
got the following error:

The procedure entry point DeleteSynchronizationBarrier could not be
located in the dynamic link library KERNEL32.dll.

IMO, it looks like either old Windows systems should not call new
functions, or we should throw them a compilation error. (Update
MIN_WINNT to 0x0602 = Windows 8 in src/include/port/win32.h?) In the
second case it looks like the documentation should be updated too, see
doc/src/sgml/installation.sgml:

<para>
<productname>PostgreSQL</productname> can be expected to work on these
operating
systems: Linux (all recent distributions), Windows (XP and later),
FreeBSD, OpenBSD, NetBSD, macOS, AIX, HP/UX, and Solaris.
Other Unix-like systems may also work but are not currently
being tested. In most cases, all CPU architectures supported by
a given operating system will work. Look in
<xref linkend="installation-platform-notes"/> below to see if
there is information
specific to your operating system, particularly if using an older
system.
</para>

<...>

<para>
The native Windows port requires a 32 or 64-bit version of Windows
2000 or later. Earlier operating systems do
not have sufficient infrastructure (but Cygwin may be used on
those). MinGW, the Unix-like build tools, and MSYS, a collection
of Unix tools required to run shell scripts
like <command>configure</command>, can be downloaded
from <ulink url="http://www.mingw.org/"></ulink>. Neither is
required to run the resulting binaries; they are needed only for
creating the binaries.
</para>

[2]: /messages/by-id/e5a09b790db21356376b6e73673aa07c@postgrespro.ru
/messages/by-id/e5a09b790db21356376b6e73673aa07c@postgrespro.ru

--
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

coelho@cri.ensmp.fr

about 5 years ago

In reply to: Marina Polyakova (#23)

2 attachment(s)

Re: pgbench: option delaying queries till connections establishment?

Hello Marina,

1) It looks like pgbench will no longer support Windows XP due to the
function DeleteSynchronizationBarrier. From
https://docs.microsoft.com/en-us/windows/win32/api/synchapi/nf-synchapi-deletesynchronizationbarrier

Minimum supported client: Windows 8 [desktop apps only]
Minimum supported server: Windows Server 2012 [desktop apps only]

Thanks for the test and precise analysis!

Sigh.

I do not think that putting such version requirements are worth it just
for the sake of pgbench.

In the attached version, I just comment out the call and add an
explanation about why it is commented out. If pg overall version
requirements are changed on windows, then it could be reinstated.

--
Fabien.

Attachments:

pgbench-A-usec-3.patchtext/x-diff; NAME=pgbench-A-usec-3.patchDownload

diff --git a/doc/src/sgml/ref/pgbench.sgml b/doc/src/sgml/ref/pgbench.sgml
index 7180fedd65..f02721da0d 100644
--- a/doc/src/sgml/ref/pgbench.sgml
+++ b/doc/src/sgml/ref/pgbench.sgml
@@ -58,8 +58,10 @@ number of clients: 10
 number of threads: 1
 number of transactions per client: 1000
 number of transactions actually processed: 10000/10000
-tps = 85.184871 (including connections establishing)
-tps = 85.296346 (excluding connections establishing)
+latency average = 11.013 ms
+latency stddev = 7.351 ms
+initial connection time = 45.758 ms
+tps = 896.967014 (without initial connection establishing)
 </screen>
 
   The first six lines report some of the most important parameter
@@ -68,8 +70,7 @@ tps = 85.296346 (excluding connections establishing)
   and number of transactions per client); these will be equal unless the run
   failed before completion.  (In <option>-T</option> mode, only the actual
   number of transactions is printed.)
-  The last two lines report the number of transactions per second,
-  figured with and without counting the time to start database sessions.
+  The last line reports the number of transactions per second.
  </para>
 
   <para>
@@ -2234,22 +2235,22 @@ number of clients: 10
 number of threads: 1
 number of transactions per client: 1000
 number of transactions actually processed: 10000/10000
-latency average = 15.844 ms
-latency stddev = 2.715 ms
-tps = 618.764555 (including connections establishing)
-tps = 622.977698 (excluding connections establishing)
+latency average = 10.870 ms
+latency stddev = 7.341 ms
+initial connection time = 30.954 ms
+tps = 907.949122 (without initial connection establishing)
 statement latencies in milliseconds:
-        0.002  \set aid random(1, 100000 * :scale)
-        0.005  \set bid random(1, 1 * :scale)
-        0.002  \set tid random(1, 10 * :scale)
-        0.001  \set delta random(-5000, 5000)
-        0.326  BEGIN;
-        0.603  UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
-        0.454  SELECT abalance FROM pgbench_accounts WHERE aid = :aid;
-        5.528  UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;
-        7.335  UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;
-        0.371  INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);
-        1.212  END;
+    0.001  \set aid random(1, 100000 * :scale)
+    0.001  \set bid random(1, 1 * :scale)
+    0.001  \set tid random(1, 10 * :scale)
+    0.000  \set delta random(-5000, 5000)
+    0.046  BEGIN;
+    0.151  UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
+    0.107  SELECT abalance FROM pgbench_accounts WHERE aid = :aid;
+    4.241  UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;
+    5.245  UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;
+    0.102  INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);
+    0.974  END;
 </screen>
   </para>
 
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 3057665bbe..b8a3e28690 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -292,9 +292,9 @@ typedef struct SimpleStats
  */
 typedef struct StatsData
 {
-	time_t		start_time;		/* interval start time, for aggregates */
-	int64		cnt;			/* number of transactions, including skipped */
-	int64		skipped;		/* number of transactions skipped under --rate
+	pg_time_usec_t	start_time;	/* interval start time, for aggregates */
+	int64			cnt;		/* number of transactions, including skipped */
+	int64			skipped;	/* number of transactions skipped under --rate
 								 * and --latency-limit */
 	SimpleStats latency;
 	SimpleStats lag;
@@ -417,11 +417,11 @@ typedef struct
 	int			nvariables;		/* number of variables */
 	bool		vars_sorted;	/* are variables sorted by name? */
 
-	/* various times about current transaction */
-	int64		txn_scheduled;	/* scheduled start time of transaction (usec) */
-	int64		sleep_until;	/* scheduled start time of next cmd (usec) */
-	instr_time	txn_begin;		/* used for measuring schedule lag times */
-	instr_time	stmt_begin;		/* used for measuring statement latencies */
+	/* various times about current transaction in microseconds */
+	pg_time_usec_t	txn_scheduled;	/* scheduled start time of transaction */
+	pg_time_usec_t	sleep_until;	/* scheduled start time of next cmd */
+	pg_time_usec_t	txn_begin;		/* used for measuring schedule lag times */
+	pg_time_usec_t	stmt_begin;		/* used for measuring statement latencies */
 
 	bool		prepared[MAX_SCRIPTS];	/* whether client prepared the script */
 
@@ -450,13 +450,16 @@ typedef struct
 	RandomState ts_sample_rs;	/* random state for log sampling */
 
 	int64		throttle_trigger;	/* previous/next throttling (us) */
-	FILE	   *logfile;		/* where to log, or NULL */
+	FILE	   *logfile;			/* where to log, or NULL */
+
+	/* per thread collected stats in microseconds */
+	pg_time_usec_t	create_time;	/* thread creation time */
+	pg_time_usec_t	started_time;	/* thread is running */
+	pg_time_usec_t	bench_start; 	/* thread is benchmarking */
+	pg_time_usec_t	conn_duration;	/* cumulated connection and deconnection delays */
 
-	/* per thread collected stats */
-	instr_time	start_time;		/* thread start time */
-	instr_time	conn_time;
 	StatsData	stats;
-	int64		latency_late;	/* executed but late transactions */
+	int64		latency_late;	/* count executed but late transactions */
 } TState;
 
 #define INVALID_THREAD		((pthread_t) 0)
@@ -598,10 +601,10 @@ static void setIntValue(PgBenchValue *pv, int64 ival);
 static void setDoubleValue(PgBenchValue *pv, double dval);
 static bool evaluateExpr(CState *st, PgBenchExpr *expr,
 						 PgBenchValue *retval);
-static ConnectionStateEnum executeMetaCommand(CState *st, instr_time *now);
+static ConnectionStateEnum executeMetaCommand(CState *st, pg_time_usec_t *now);
 static void doLog(TState *thread, CState *st,
 				  StatsData *agg, bool skipped, double latency, double lag);
-static void processXactStats(TState *thread, CState *st, instr_time *now,
+static void processXactStats(TState *thread, CState *st, pg_time_usec_t *now,
 							 bool skipped, StatsData *agg);
 static void addScript(ParsedScript script);
 static void *threadRun(void *arg);
@@ -1105,9 +1108,9 @@ mergeSimpleStats(SimpleStats *acc, SimpleStats *ss)
  * the given value.
  */
 static void
-initStats(StatsData *sd, time_t start_time)
+initStats(StatsData *sd, pg_time_usec_t start)
 {
-	sd->start_time = start_time;
+	sd->start_time = start;
 	sd->cnt = 0;
 	sd->skipped = 0;
 	initSimpleStats(&sd->latency);
@@ -2876,7 +2879,6 @@ evaluateSleep(CState *st, int argc, char **argv, int *usecs)
 static void
 advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 {
-	instr_time	now;
 
 	/*
 	 * gettimeofday() isn't free, so we get the current timestamp lazily the
@@ -2886,7 +2888,7 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 	 * means "not set yet".  Reset "now" when we execute shell commands or
 	 * expressions, which might take a non-negligible amount of time, though.
 	 */
-	INSTR_TIME_SET_ZERO(now);
+	pg_time_usec_t	now = 0;
 
 	/*
 	 * Loop in the state machine, until we have to wait for a result from the
@@ -2921,29 +2923,30 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 
 				/* Start new transaction (script) */
 			case CSTATE_START_TX:
+				pg_time_now_lazy(&now);
 
 				/* establish connection if needed, i.e. under --connect */
 				if (st->con == NULL)
 				{
-					instr_time	start;
+					pg_time_usec_t	start = now;
 
-					INSTR_TIME_SET_CURRENT_LAZY(now);
-					start = now;
 					if ((st->con = doConnect()) == NULL)
 					{
 						pg_log_error("client %d aborted while establishing connection", st->id);
 						st->state = CSTATE_ABORTED;
 						break;
 					}
-					INSTR_TIME_SET_CURRENT(now);
-					INSTR_TIME_ACCUM_DIFF(thread->conn_time, now, start);
+
+					/* reset now after connection */
+					now = pg_time_now();
+
+					thread->conn_duration += now - start;
 
 					/* Reset session-local state */
 					memset(st->prepared, 0, sizeof(st->prepared));
 				}
 
 				/* record transaction start time */
-				INSTR_TIME_SET_CURRENT_LAZY(now);
 				st->txn_begin = now;
 
 				/*
@@ -2951,7 +2954,7 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 				 * scheduled start time.
 				 */
 				if (!throttle_delay)
-					st->txn_scheduled = INSTR_TIME_GET_MICROSEC(now);
+					st->txn_scheduled = now;
 
 				/* Begin with the first command */
 				st->state = CSTATE_START_COMMAND;
@@ -2987,12 +2990,9 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 				 */
 				if (latency_limit)
 				{
-					int64		now_us;
+					pg_time_now_lazy(&now);
 
-					INSTR_TIME_SET_CURRENT_LAZY(now);
-					now_us = INSTR_TIME_GET_MICROSEC(now);
-
-					while (thread->throttle_trigger < now_us - latency_limit &&
+					while (thread->throttle_trigger < now - latency_limit &&
 						   (nxacts <= 0 || st->cnt < nxacts))
 					{
 						processXactStats(thread, st, &now, true, agg);
@@ -3025,9 +3025,9 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 				 * Wait until it's time to start next transaction.
 				 */
 			case CSTATE_THROTTLE:
-				INSTR_TIME_SET_CURRENT_LAZY(now);
+				pg_time_now_lazy(&now);
 
-				if (INSTR_TIME_GET_MICROSEC(now) < st->txn_scheduled)
+				if (now < st->txn_scheduled)
 					return;		/* still sleeping, nothing to do here */
 
 				/* done sleeping, but don't start transaction if we're done */
@@ -3050,7 +3050,7 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 				/* record begin time of next command, and initiate it */
 				if (report_per_command)
 				{
-					INSTR_TIME_SET_CURRENT_LAZY(now);
+					pg_time_now_lazy(&now);
 					st->stmt_begin = now;
 				}
 
@@ -3215,8 +3215,8 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 				 * instead of CSTATE_START_TX.
 				 */
 			case CSTATE_SLEEP:
-				INSTR_TIME_SET_CURRENT_LAZY(now);
-				if (INSTR_TIME_GET_MICROSEC(now) < st->sleep_until)
+				pg_time_now_lazy(&now);
+				if (now < st->sleep_until)
 					return;		/* still sleeping, nothing to do here */
 				/* Else done sleeping. */
 				st->state = CSTATE_END_COMMAND;
@@ -3236,13 +3236,12 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 				{
 					Command    *command;
 
-					INSTR_TIME_SET_CURRENT_LAZY(now);
+					pg_time_now_lazy(&now);
 
 					command = sql_script[st->use_file].commands[st->command];
 					/* XXX could use a mutex here, but we choose not to */
 					addToSimpleStats(&command->stats,
-									 INSTR_TIME_GET_DOUBLE(now) -
-									 INSTR_TIME_GET_DOUBLE(st->stmt_begin));
+									 PG_TIME_GET_DOUBLE(now - st->stmt_begin));
 				}
 
 				/* Go ahead with next command, to be executed or skipped */
@@ -3268,7 +3267,7 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 				if (is_connect)
 				{
 					finishCon(st);
-					INSTR_TIME_SET_ZERO(now);
+					now = 0;
 				}
 
 				if ((st->cnt >= nxacts && duration <= 0) || timer_exceeded)
@@ -3306,7 +3305,7 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
  * take no time to execute.
  */
 static ConnectionStateEnum
-executeMetaCommand(CState *st, instr_time *now)
+executeMetaCommand(CState *st, pg_time_usec_t *now)
 {
 	Command    *command = sql_script[st->use_file].commands[st->command];
 	int			argc;
@@ -3348,8 +3347,8 @@ executeMetaCommand(CState *st, instr_time *now)
 			return CSTATE_ABORTED;
 		}
 
-		INSTR_TIME_SET_CURRENT_LAZY(*now);
-		st->sleep_until = INSTR_TIME_GET_MICROSEC(*now) + usec;
+		pg_time_now_lazy(now);
+		st->sleep_until = (*now) + usec;
 		return CSTATE_SLEEP;
 	}
 	else if (command->meta == META_SET)
@@ -3452,7 +3451,7 @@ executeMetaCommand(CState *st, instr_time *now)
 	 * executing the expression or shell command might have taken a
 	 * non-negligible amount of time, so reset 'now'
 	 */
-	INSTR_TIME_SET_ZERO(*now);
+	*now = 0;
 
 	return CSTATE_END_COMMAND;
 }
@@ -3462,14 +3461,15 @@ executeMetaCommand(CState *st, instr_time *now)
  *
  * We print Unix-epoch timestamps in the log, so that entries can be
  * correlated against other logs.  On some platforms this could be obtained
- * from the instr_time reading the caller has, but rather than get entangled
- * with that, we just eat the cost of an extra syscall in all cases.
+ * from the caller, but rather than get entangled with that, we just eat
+ * the cost of an extra syscall in all cases.
  */
 static void
 doLog(TState *thread, CState *st,
 	  StatsData *agg, bool skipped, double latency, double lag)
 {
 	FILE	   *logfile = thread->logfile;
+	pg_time_usec_t	now = pg_time_now();
 
 	Assert(use_log);
 
@@ -3489,13 +3489,12 @@ doLog(TState *thread, CState *st,
 		 * any empty intervals in between (this may happen with very low tps,
 		 * e.g. --rate=0.1).
 		 */
-		time_t		now = time(NULL);
 
 		while (agg->start_time + agg_interval <= now)
 		{
 			/* print aggregated report to logfile */
-			fprintf(logfile, "%ld " INT64_FORMAT " %.0f %.0f %.0f %.0f",
-					(long) agg->start_time,
+			fprintf(logfile, INT64_FORMAT " " INT64_FORMAT " %.0f %.0f %.0f %.0f",
+					agg->start_time,
 					agg->cnt,
 					agg->latency.sum,
 					agg->latency.sum2,
@@ -3523,17 +3522,13 @@ doLog(TState *thread, CState *st,
 	else
 	{
 		/* no, print raw transactions */
-		struct timeval tv;
-
-		gettimeofday(&tv, NULL);
 		if (skipped)
 			fprintf(logfile, "%d " INT64_FORMAT " skipped %d %ld %ld",
-					st->id, st->cnt, st->use_file,
-					(long) tv.tv_sec, (long) tv.tv_usec);
+					st->id, st->cnt, st->use_file, now / 1000000, now % 1000000);
 		else
 			fprintf(logfile, "%d " INT64_FORMAT " %.0f %d %ld %ld",
 					st->id, st->cnt, latency, st->use_file,
-					(long) tv.tv_sec, (long) tv.tv_usec);
+					now / 1000000, now % 1000000);
 		if (throttle_delay)
 			fprintf(logfile, " %.0f", lag);
 		fputc('\n', logfile);
@@ -3547,7 +3542,7 @@ doLog(TState *thread, CState *st,
  * Note that even skipped transactions are counted in the "cnt" fields.)
  */
 static void
-processXactStats(TState *thread, CState *st, instr_time *now,
+processXactStats(TState *thread, CState *st, pg_time_usec_t *now,
 				 bool skipped, StatsData *agg)
 {
 	double		latency = 0.0,
@@ -3557,11 +3552,11 @@ processXactStats(TState *thread, CState *st, instr_time *now,
 
 	if (detailed && !skipped)
 	{
-		INSTR_TIME_SET_CURRENT_LAZY(*now);
+		pg_time_now_lazy(now);
 
 		/* compute latency & lag */
-		latency = INSTR_TIME_GET_MICROSEC(*now) - st->txn_scheduled;
-		lag = INSTR_TIME_GET_MICROSEC(st->txn_begin) - st->txn_scheduled;
+		latency = (*now) - st->txn_scheduled;
+		lag = st->txn_begin - st->txn_scheduled;
 	}
 
 	if (thread_details)
@@ -3812,10 +3807,7 @@ initGenerateDataClientSide(PGconn *con)
 	int64		k;
 
 	/* used to track elapsed time and estimate of the remaining time */
-	instr_time	start,
-				diff;
-	double		elapsed_sec,
-				remaining_sec;
+	pg_time_usec_t	start;
 	int			log_interval = 1;
 
 	/* Stay on the same line if reporting to a terminal */
@@ -3867,7 +3859,7 @@ initGenerateDataClientSide(PGconn *con)
 	}
 	PQclear(res);
 
-	INSTR_TIME_SET_CURRENT(start);
+	start = pg_time_now();
 
 	for (k = 0; k < (int64) naccounts * scale; k++)
 	{
@@ -3892,11 +3884,8 @@ initGenerateDataClientSide(PGconn *con)
 		 */
 		if ((!use_quiet) && (j % 100000 == 0))
 		{
-			INSTR_TIME_SET_CURRENT(diff);
-			INSTR_TIME_SUBTRACT(diff, start);
-
-			elapsed_sec = INSTR_TIME_GET_DOUBLE(diff);
-			remaining_sec = ((double) scale * naccounts - j) * elapsed_sec / j;
+			double elapsed_sec = PG_TIME_GET_DOUBLE(pg_time_now() - start);
+			double remaining_sec = ((double) scale * naccounts - j) * elapsed_sec / j;
 
 			fprintf(stderr, INT64_FORMAT " of " INT64_FORMAT " tuples (%d%%) done (elapsed %.2f s, remaining %.2f s)%c",
 					j, (int64) naccounts * scale,
@@ -3906,11 +3895,8 @@ initGenerateDataClientSide(PGconn *con)
 		/* let's not call the timing for each row, but only each 100 rows */
 		else if (use_quiet && (j % 100 == 0))
 		{
-			INSTR_TIME_SET_CURRENT(diff);
-			INSTR_TIME_SUBTRACT(diff, start);
-
-			elapsed_sec = INSTR_TIME_GET_DOUBLE(diff);
-			remaining_sec = ((double) scale * naccounts - j) * elapsed_sec / j;
+			double elapsed_sec = PG_TIME_GET_DOUBLE(pg_time_now() - start);
+			double remaining_sec = ((double) scale * naccounts - j) * elapsed_sec / j;
 
 			/* have we reached the next interval (or end)? */
 			if ((j == scale * naccounts) || (elapsed_sec >= log_interval * LOG_STEP_SECONDS))
@@ -4115,10 +4101,8 @@ runInitSteps(const char *initialize_steps)
 
 	for (step = initialize_steps; *step != '\0'; step++)
 	{
-		instr_time	start;
 		char	   *op = NULL;
-
-		INSTR_TIME_SET_CURRENT(start);
+		pg_time_usec_t	start = pg_time_now();
 
 		switch (*step)
 		{
@@ -4160,12 +4144,7 @@ runInitSteps(const char *initialize_steps)
 
 		if (op != NULL)
 		{
-			instr_time	diff;
-			double		elapsed_sec;
-
-			INSTR_TIME_SET_CURRENT(diff);
-			INSTR_TIME_SUBTRACT(diff, start);
-			elapsed_sec = INSTR_TIME_GET_DOUBLE(diff);
+			double		elapsed_sec = PG_TIME_GET_DOUBLE(pg_time_now() - start);
 
 			if (!first)
 				appendPQExpBufferStr(&stats, ", ");
@@ -5087,12 +5066,12 @@ addScript(ParsedScript script)
  * progress report.  On exit, they are updated with the new stats.
  */
 static void
-printProgressReport(TState *threads, int64 test_start, int64 now,
+printProgressReport(TState *threads, int64 test_start, pg_time_usec_t now,
 					StatsData *last, int64 *last_report)
 {
 	/* generate and show report */
-	int64		run = now - *last_report,
-				ntx;
+	pg_time_usec_t	run = now - *last_report;
+	int64		ntx;
 	double		tps,
 				total_run,
 				latency,
@@ -5139,16 +5118,7 @@ printProgressReport(TState *threads, int64 test_start, int64 now,
 
 	if (progress_timestamp)
 	{
-		/*
-		 * On some platforms the current system timestamp is available in
-		 * now_time, but rather than get entangled with that, we just eat the
-		 * cost of an extra syscall in all cases.
-		 */
-		struct timeval tv;
-
-		gettimeofday(&tv, NULL);
-		snprintf(tbuf, sizeof(tbuf), "%ld.%03ld s",
-				 (long) tv.tv_sec, (long) (tv.tv_usec / 1000));
+		snprintf(tbuf, sizeof(tbuf), "%.3f s", PG_TIME_GET_DOUBLE(now));
 	}
 	else
 	{
@@ -5188,21 +5158,18 @@ printSimpleStats(const char *prefix, SimpleStats *ss)
 
 /* print out results */
 static void
-printResults(StatsData *total, instr_time total_time,
-			 instr_time conn_total_time, int64 latency_late)
+printResults(StatsData *total,
+			 pg_time_usec_t total_duration,			/* benchmarking time */
+			 pg_time_usec_t conn_total_duration,	/* is_connect */
+			 pg_time_usec_t conn_elapsed_duration,	/* !is_connect */
+			 int64 latency_late)
 {
-	double		time_include,
-				tps_include,
-				tps_exclude;
+	/* tps is about actually executed transactions during benchmarking */
 	int64		ntx = total->cnt - total->skipped;
+	double 		bench_duration = PG_TIME_GET_DOUBLE(total_duration);
+	double		tps = ntx / bench_duration;
 
-	time_include = INSTR_TIME_GET_DOUBLE(total_time);
-
-	/* tps is about actually executed transactions */
-	tps_include = ntx / time_include;
-	tps_exclude = ntx /
-		(time_include - (INSTR_TIME_GET_DOUBLE(conn_total_time) / nclients));
-
+	printf("pgbench (PostgreSQL) %d.%d\n", PG_VERSION_NUM / 10000, PG_VERSION_NUM % 100);
 	/* Report test parameters. */
 	printf("transaction type: %s\n",
 		   num_scripts == 1 ? sql_script[0].desc : "multiple scripts");
@@ -5233,8 +5200,7 @@ printResults(StatsData *total, instr_time total_time,
 
 	if (throttle_delay && latency_limit)
 		printf("number of transactions skipped: " INT64_FORMAT " (%.3f %%)\n",
-			   total->skipped,
-			   100.0 * total->skipped / total->cnt);
+			   total->skipped, 100.0 * total->skipped / total->cnt);
 
 	if (latency_limit)
 		printf("number of transactions above the %.1f ms latency limit: " INT64_FORMAT "/" INT64_FORMAT " (%.3f %%)\n",
@@ -5247,7 +5213,7 @@ printResults(StatsData *total, instr_time total_time,
 	{
 		/* no measurement, show average latency computed from run time */
 		printf("latency average = %.3f ms\n",
-			   1000.0 * time_include * nclients / total->cnt);
+			   0.001 * total_duration * nclients / total->cnt);
 	}
 
 	if (throttle_delay)
@@ -5262,8 +5228,25 @@ printResults(StatsData *total, instr_time total_time,
 			   0.001 * total->lag.sum / total->cnt, 0.001 * total->lag.max);
 	}
 
-	printf("tps = %f (including connections establishing)\n", tps_include);
-	printf("tps = %f (excluding connections establishing)\n", tps_exclude);
+	/*
+	 * Under -C/--connect, each transaction incurs a significant connection cost,
+	 * it would not make much sense to ignore it in tps, and it would not be tps
+	 * anyway.
+	 *
+	 * Otherwise connections are made just once at the beginning of the run
+	 * and should not impact performance but for very short run, so they are
+	 * (right)fully ignored in tps.
+	 */
+	if (is_connect)
+	{
+		printf("average connection time = %.3f ms\n", 0.001 * conn_total_duration / total->cnt);
+		printf("tps = %f (including reconnection times)\n", tps);
+	}
+	else
+	{
+		printf("initial connection time = %.3f ms\n", 0.001 * conn_elapsed_duration);
+		printf("tps = %f (without initial connection establishing)\n", tps);
+	}
 
 	/* Report per-script/command statistics */
 	if (per_script_stats || report_per_command)
@@ -5284,7 +5267,7 @@ printResults(StatsData *total, instr_time total_time,
 					   100.0 * sql_script[i].weight / total_weight,
 					   sstats->cnt,
 					   100.0 * sstats->cnt / total->cnt,
-					   (sstats->cnt - sstats->skipped) / time_include);
+					   (sstats->cnt - sstats->skipped) / bench_duration);
 
 				if (throttle_delay && latency_limit && sstats->cnt > 0)
 					printf(" - number of transactions skipped: " INT64_FORMAT " (%.3f%%)\n",
@@ -5332,10 +5315,7 @@ set_random_seed(const char *seed)
 	if (seed == NULL || strcmp(seed, "time") == 0)
 	{
 		/* rely on current time */
-		instr_time	now;
-
-		INSTR_TIME_SET_CURRENT(now);
-		iseed = (uint64) INSTR_TIME_GET_MICROSEC(now);
+		iseed = pg_time_now();
 	}
 	else if (strcmp(seed, "rand") == 0)
 	{
@@ -5438,9 +5418,10 @@ main(int argc, char **argv)
 	CState	   *state;			/* status of clients */
 	TState	   *threads;		/* array of thread */
 
-	instr_time	start_time;		/* start up time */
-	instr_time	total_time;
-	instr_time	conn_total_time;
+	pg_time_usec_t
+		start_time,				/* start up time */
+		bench_start = 0,		/* first recorded benchmarking time */
+		conn_total_duration;	/* cumulated connection time in threads */
 	int64		latency_late = 0;
 	StatsData	stats;
 	int			weight;
@@ -6123,67 +6104,53 @@ main(int argc, char **argv)
 	/* all clients must be assigned to a thread */
 	Assert(nclients_dealt == nclients);
 
-	/* get start up time */
-	INSTR_TIME_SET_CURRENT(start_time);
+	/* get start up time for the whole computation */
+	start_time = pg_time_now();
 
 	/* set alarm if duration is specified. */
 	if (duration > 0)
 		setalarm(duration);
 
-	/* start threads */
 #ifdef ENABLE_THREAD_SAFETY
-	for (i = 0; i < nthreads; i++)
+	/* start all threads but thread 0 which is executed directly later */
+	for (i = 1; i < nthreads; i++)
 	{
 		TState	   *thread = &threads[i];
+		int			err;
 
-		INSTR_TIME_SET_CURRENT(thread->start_time);
+		thread->create_time = pg_time_now();
+		err = pthread_create(&thread->thread, NULL, threadRun, thread);
 
-		/* compute when to stop */
-		if (duration > 0)
-			end_time = INSTR_TIME_GET_MICROSEC(thread->start_time) +
-				(int64) 1000000 * duration;
-
-		/* the first thread (i = 0) is executed by main thread */
-		if (i > 0)
-		{
-			int			err = pthread_create(&thread->thread, NULL, threadRun, thread);
-
-			if (err != 0 || thread->thread == INVALID_THREAD)
-			{
-				pg_log_fatal("could not create thread: %m");
-				exit(1);
-			}
-		}
-		else
+		if (err != 0 || thread->thread == INVALID_THREAD)
 		{
-			thread->thread = INVALID_THREAD;
+			pg_log_fatal("could not create thread: %m");
+			exit(1);
 		}
 	}
 #else
-	INSTR_TIME_SET_CURRENT(threads[0].start_time);
+	Assert(nthreads == 1);
+#endif							/* ENABLE_THREAD_SAFETY */
+
 	/* compute when to stop */
+	threads[0].create_time = pg_time_now();
 	if (duration > 0)
-		end_time = INSTR_TIME_GET_MICROSEC(threads[0].start_time) +
-			(int64) 1000000 * duration;
+		end_time = threads[0].create_time + (int64) 1000000 * duration;
+
+	/* run thread 0 directly */
 	threads[0].thread = INVALID_THREAD;
-#endif							/* ENABLE_THREAD_SAFETY */
+	(void) threadRun(&threads[0]);
 
-	/* wait for threads and accumulate results */
+	/* wait for other threads and accumulate results */
 	initStats(&stats, 0);
-	INSTR_TIME_SET_ZERO(conn_total_time);
+	conn_total_duration = 0;
+
 	for (i = 0; i < nthreads; i++)
 	{
 		TState	   *thread = &threads[i];
 
 #ifdef ENABLE_THREAD_SAFETY
-		if (threads[i].thread == INVALID_THREAD)
-			/* actually run this thread directly in the main thread */
-			(void) threadRun(thread);
-		else
-			/* wait of other threads. should check that 0 is returned? */
+		if (i > 0)
 			pthread_join(thread->thread, NULL);
-#else
-		(void) threadRun(thread);
 #endif							/* ENABLE_THREAD_SAFETY */
 
 		for (int j = 0; j < thread->nstate; j++)
@@ -6196,23 +6163,24 @@ main(int argc, char **argv)
 		stats.cnt += thread->stats.cnt;
 		stats.skipped += thread->stats.skipped;
 		latency_late += thread->latency_late;
-		INSTR_TIME_ADD(conn_total_time, thread->conn_time);
+		conn_total_duration += thread->conn_duration;
+
+		/* first recorded benchmarking start time */
+		if (bench_start == 0 || thread->bench_start < bench_start)
+			bench_start = thread->bench_start;
 	}
+
+	/* XXX should this be connection time? */
 	disconnect_all(state, nclients);
 
 	/*
-	 * XXX We compute results as though every client of every thread started
-	 * and finished at the same time.  That model can diverge noticeably from
-	 * reality for a short benchmark run involving relatively many threads.
-	 * The first thread may process notably many transactions before the last
-	 * thread begins.  Improving the model alone would bring limited benefit,
-	 * because performance during those periods of partial thread count can
-	 * easily exceed steady state performance.  This is one of the many ways
-	 * short runs convey deceptive performance figures.
+	 * Beware that performance of short benchmarks with many threads and possibly
+	 * long transactions can be deceptive because threads do not start and finish
+	 * at the exact same time. The total duration computed here encompasses all
+	 * transactions so that tps shown is somehow slightly underestimated.
 	 */
-	INSTR_TIME_SET_CURRENT(total_time);
-	INSTR_TIME_SUBTRACT(total_time, start_time);
-	printResults(&stats, total_time, conn_total_time, latency_late);
+	printResults(&stats, pg_time_now() - bench_start, conn_total_duration,
+				 bench_start - start_time, latency_late);
 
 	if (exit_code != 0)
 		pg_log_fatal("Run was aborted; the above results are incomplete.");
@@ -6225,34 +6193,16 @@ threadRun(void *arg)
 {
 	TState	   *thread = (TState *) arg;
 	CState	   *state = thread->state;
-	instr_time	start,
-				end;
+	pg_time_usec_t	start;
 	int			nstate = thread->nstate;
 	int			remains = nstate;	/* number of remaining clients */
 	socket_set *sockets = alloc_socket_set(nstate);
-	int			i;
-
-	/* for reporting progress: */
-	int64		thread_start = INSTR_TIME_GET_MICROSEC(thread->start_time);
-	int64		last_report = thread_start;
-	int64		next_report = last_report + (int64) progress * 1000000;
+	int64		thread_start,
+				last_report,
+				next_report;
 	StatsData	last,
 				aggs;
 
-	/*
-	 * Initialize throttling rate target for all of the thread's clients.  It
-	 * might be a little more accurate to reset thread->start_time here too.
-	 * The possible drift seems too small relative to typical throttle delay
-	 * times to worry about it.
-	 */
-	INSTR_TIME_SET_CURRENT(start);
-	thread->throttle_trigger = INSTR_TIME_GET_MICROSEC(start);
-
-	INSTR_TIME_SET_ZERO(thread->conn_time);
-
-	initStats(&aggs, time(NULL));
-	last = aggs;
-
 	/* open log file if requested */
 	if (use_log)
 	{
@@ -6273,32 +6223,49 @@ threadRun(void *arg)
 		}
 	}
 
+	/* explicitly initialize the state machines */
+	for (int i = 0; i < nstate; i++)
+		state[i].state = CSTATE_CHOOSE_SCRIPT;
+
+	/* READY */
+	thread_start = pg_time_now();
+	thread->started_time = thread_start;
+	last_report = thread_start;
+	next_report = last_report + (int64) 1000000 * progress;
+
+	/* STEADY */
 	if (!is_connect)
 	{
 		/* make connections to the database before starting */
-		for (i = 0; i < nstate; i++)
+		for (int i = 0; i < nstate; i++)
 		{
 			if ((state[i].con = doConnect()) == NULL)
 				goto done;
 		}
+
+		/* compute connection delay */
+		thread->conn_duration = pg_time_now() - thread->started_time;
 	}
-
-	/* time after thread and connections set up */
-	INSTR_TIME_SET_CURRENT(thread->conn_time);
-	INSTR_TIME_SUBTRACT(thread->conn_time, thread->start_time);
-
-	/* explicitly initialize the state machines */
-	for (i = 0; i < nstate; i++)
+	else
 	{
-		state[i].state = CSTATE_CHOOSE_SCRIPT;
+		/* no connection delay to record */
+		thread->conn_duration = 0;
 	}
 
+
+	start = pg_time_now();
+	thread->bench_start = start;
+	thread->throttle_trigger = start;
+
+	initStats(&aggs, start);
+	last = aggs;
+
 	/* loop till all clients have terminated */
 	while (remains > 0)
 	{
 		int			nsocks;		/* number of sockets to be waited for */
-		int64		min_usec;
-		int64		now_usec = 0;	/* set this only if needed */
+		pg_time_usec_t	min_usec;
+		pg_time_usec_t	now = 0;	/* set this only if needed */
 
 		/*
 		 * identify which client sockets should be checked for input, and
@@ -6307,27 +6274,21 @@ threadRun(void *arg)
 		clear_socket_set(sockets);
 		nsocks = 0;
 		min_usec = PG_INT64_MAX;
-		for (i = 0; i < nstate; i++)
+		for (int i = 0; i < nstate; i++)
 		{
 			CState	   *st = &state[i];
 
 			if (st->state == CSTATE_SLEEP || st->state == CSTATE_THROTTLE)
 			{
 				/* a nap from the script, or under throttling */
-				int64		this_usec;
+				pg_time_usec_t		this_usec;
 
 				/* get current time if needed */
-				if (now_usec == 0)
-				{
-					instr_time	now;
-
-					INSTR_TIME_SET_CURRENT(now);
-					now_usec = INSTR_TIME_GET_MICROSEC(now);
-				}
+				pg_time_now_lazy(&now);
 
 				/* min_usec should be the minimum delay across all clients */
 				this_usec = (st->state == CSTATE_SLEEP ?
-							 st->sleep_until : st->txn_scheduled) - now_usec;
+							 st->sleep_until : st->txn_scheduled) - now;
 				if (min_usec > this_usec)
 					min_usec = this_usec;
 			}
@@ -6362,19 +6323,12 @@ threadRun(void *arg)
 		/* also wake up to print the next progress report on time */
 		if (progress && min_usec > 0 && thread->tid == 0)
 		{
-			/* get current time if needed */
-			if (now_usec == 0)
-			{
-				instr_time	now;
+			pg_time_now_lazy(&now);
 
-				INSTR_TIME_SET_CURRENT(now);
-				now_usec = INSTR_TIME_GET_MICROSEC(now);
-			}
-
-			if (now_usec >= next_report)
+			if (now >= next_report)
 				min_usec = 0;
-			else if ((next_report - now_usec) < min_usec)
-				min_usec = next_report - now_usec;
+			else if ((next_report - now) < min_usec)
+				min_usec = next_report - now;
 		}
 
 		/*
@@ -6423,7 +6377,7 @@ threadRun(void *arg)
 
 		/* ok, advance the state machine of each connection */
 		nsocks = 0;
-		for (i = 0; i < nstate; i++)
+		for (int i = 0; i < nstate; i++)
 		{
 			CState	   *st = &state[i];
 
@@ -6461,11 +6415,8 @@ threadRun(void *arg)
 		/* progress report is made by thread 0 for all threads */
 		if (progress && thread->tid == 0)
 		{
-			instr_time	now_time;
-			int64		now;
+			pg_time_usec_t	now = pg_time_now();
 
-			INSTR_TIME_SET_CURRENT(now_time);
-			now = INSTR_TIME_GET_MICROSEC(now_time);
 			if (now >= next_report)
 			{
 				/*
@@ -6483,17 +6434,17 @@ threadRun(void *arg)
 				 */
 				do
 				{
-					next_report += (int64) progress * 1000000;
+					next_report += (int64) 1000000 * progress;
 				} while (now >= next_report);
 			}
 		}
 	}
 
 done:
-	INSTR_TIME_SET_CURRENT(start);
+	start = pg_time_now();
 	disconnect_all(state, nstate);
-	INSTR_TIME_SET_CURRENT(end);
-	INSTR_TIME_ACCUM_DIFF(thread->conn_time, end, start);
+	thread->conn_duration += pg_time_now() - start;
+
 	if (thread->logfile)
 	{
 		if (agg_interval > 0)
diff --git a/src/include/portability/instr_time.h b/src/include/portability/instr_time.h
index d6459327cc..fcc0bdda28 100644
--- a/src/include/portability/instr_time.h
+++ b/src/include/portability/instr_time.h
@@ -253,4 +253,32 @@ GetTimerFrequency(void)
 #define INSTR_TIME_SET_CURRENT_LAZY(t) \
 	(INSTR_TIME_IS_ZERO(t) ? INSTR_TIME_SET_CURRENT(t), true : false)
 
+/*
+ * Simpler convenient interface
+ *
+ * The instr_time type is expensive when dealing with time arithmetic.
+ * Define a type to hold microseconds on top of this, suitable for
+ * benchmarking performance measures, eg in "pgbench".
+ *
+ * Type int64 is good enough for about 584500 years.
+ */
+typedef int64 pg_time_usec_t;
+
+static inline pg_time_usec_t
+pg_time_now(void)
+{
+	instr_time now;
+
+	INSTR_TIME_SET_CURRENT(now);
+	return (pg_time_usec_t) INSTR_TIME_GET_MICROSEC(now);
+}
+
+static inline void
+pg_time_now_lazy(pg_time_usec_t *now)
+{
+	if ((*now) == 0)
+		(*now) = pg_time_now();
+}
+
+#define PG_TIME_GET_DOUBLE(t) (0.000001 * (t))
 #endif							/* INSTR_TIME_H */

pgbench-B-barrier-7.patchtext/x-diff; NAME=pgbench-B-barrier-7.patchDownload

diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index b8a3e28690..f2c54e4762 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -114,18 +114,29 @@ typedef struct socket_set
  */
 
 #ifdef WIN32
+#define PTHREAD_BARRIER_SERIAL_THREAD (-1)
+
 /* Use native win32 threads on Windows */
 typedef struct win32_pthread *pthread_t;
 typedef int pthread_attr_t;
+typedef SYNCHRONIZATION_BARRIER pthread_barrier_t;
 
 static int	pthread_create(pthread_t *thread, pthread_attr_t *attr, void *(*start_routine) (void *), void *arg);
 static int	pthread_join(pthread_t th, void **thread_return);
+
+static int	pthread_barrier_init(pthread_barrier_t *barrier, void *unused, int nthreads);
+static int	pthread_barrier_wait(pthread_barrier_t *barrier);
+static int	pthread_barrier_destroy(pthread_barrier_t *barrier);
 #elif defined(ENABLE_THREAD_SAFETY)
 /* Use platform-dependent pthread capability */
 #include <pthread.h>
 #else
 /* No threads implementation, use none (-j 1) */
 #define pthread_t void *
+#define pthread_barrier_t void *
+#define pthread_barrier_init(a, b, c) /* ignore */
+#define pthread_barrier_wait(a) /* ignore */
+#define pthread_barrier_destroy(a) /* ignore */
 #endif
 
 
@@ -311,6 +322,9 @@ typedef struct RandomState
 /* Various random sequences are initialized from this one. */
 static RandomState base_random_sequence;
 
+/* Synchronization barrier for start and connection */
+static pthread_barrier_t barrier;
+
 /*
  * Connection state machine states.
  */
@@ -454,8 +468,8 @@ typedef struct
 
 	/* per thread collected stats in microseconds */
 	pg_time_usec_t	create_time;	/* thread creation time */
-	pg_time_usec_t	started_time;	/* thread is running */
-	pg_time_usec_t	bench_start; 	/* thread is benchmarking */
+	pg_time_usec_t	started_time;	/* thread is running after start barrier */
+	pg_time_usec_t	bench_start; 	/* thread is benchmarking after connection barrier */
 	pg_time_usec_t	conn_duration;	/* cumulated connection and deconnection delays */
 
 	StatsData	stats;
@@ -6111,6 +6125,8 @@ main(int argc, char **argv)
 	if (duration > 0)
 		setalarm(duration);
 
+	pthread_barrier_init(&barrier, NULL, nthreads);
+
 #ifdef ENABLE_THREAD_SAFETY
 	/* start all threads but thread 0 which is executed directly later */
 	for (i = 1; i < nthreads; i++)
@@ -6182,6 +6198,8 @@ main(int argc, char **argv)
 	printResults(&stats, pg_time_now() - bench_start, conn_total_duration,
 				 bench_start - start_time, latency_late);
 
+	pthread_barrier_destroy(&barrier);
+
 	if (exit_code != 0)
 		pg_log_fatal("Run was aborted; the above results are incomplete.");
 
@@ -6228,6 +6246,8 @@ threadRun(void *arg)
 		state[i].state = CSTATE_CHOOSE_SCRIPT;
 
 	/* READY */
+	pthread_barrier_wait(&barrier);
+
 	thread_start = pg_time_now();
 	thread->started_time = thread_start;
 	last_report = thread_start;
@@ -6240,7 +6260,18 @@ threadRun(void *arg)
 		for (int i = 0; i < nstate; i++)
 		{
 			if ((state[i].con = doConnect()) == NULL)
+			{
+				/*
+				 * On connection failure, we meet the barrier here in place of
+				 * GO before proceeding to the "done" path which will cleanup,
+				 * so as to avoid locking the process.
+				 *
+				 * It is unclear whether it is worth doing anything rather than
+				 * coldly exiting with an error message.
+				 */
+				pthread_barrier_wait(&barrier);
 				goto done;
+			}
 		}
 
 		/* compute connection delay */
@@ -6252,6 +6283,8 @@ threadRun(void *arg)
 		thread->conn_duration = 0;
 	}
 
+	/* GO */
+	pthread_barrier_wait(&barrier);
 
 	start = pg_time_now();
 	thread->bench_start = start;
@@ -6757,4 +6790,34 @@ pthread_join(pthread_t th, void **thread_return)
 	return 0;
 }
 
+static int
+pthread_barrier_init(pthread_barrier_t *barrier, void *unused, int nthreads)
+{
+	/* no spinning: threads are not expected to arrive at the barrier together */
+	bool ok = InitializeSynchronizationBarrier(barrier, nthreads, 0);
+	return 0;
+}
+
+static int
+pthread_barrier_wait(pthread_barrier_t *barrier)
+{
+	bool last = EnterSynchronizationBarrier(barrier, SYNCHRONIZATION_BARRIER_FLAGS_BLOCK_ONLY);
+	return last ? PTHREAD_BARRIER_SERIAL_THREAD : 0;
+}
+
+static int
+pthread_barrier_destroy(pthread_barrier_t *barrier)
+{
+	/*
+	 * The following is coldly ignored because it requires Windows 8
+	 * or Windows Server 2012, which is a little too much.
+	 *
+	 * Also, there is a SYNCHRONIZATION_BARRIER_FLAGS_NO_DELETE flag
+	 * but it probably requires the same versions.
+	 *
+	 *  (void) DeleteSynchronizationBarrier(barrier);
+	 */
+	return 0;
+}
+
 #endif							/* WIN32 */

Alexander Korotkov

aekorotkov@gmail.com

about 5 years ago

In reply to: Andres Freund (#1)

Re: pgbench: option delaying queries till connections establishment?

Hi!

On Thu, Feb 27, 2020 at 9:01 PM Andres Freund <andres@anarazel.de> wrote:

I am trying to run a few benchmarks measuring the effects of patch to
make GetSnapshotData() faster in the face of larger numbers of
established connections.

Before the patch connection establishment often is very slow due to
contention. The first few connections are fast, but after that it takes
increasingly long. The first few connections constantly hold
ProcArrayLock in shared mode, which then makes it hard for new
connections to acquire it exclusively (I'm addressing that to a
significant degree in the patch FWIW).

Hmm... Let's see the big picture. You've recently committed a
patchset, which greatly improved the performance of GetSnapshotData().
And you're making further improvements in this direction. But you're
getting trouble in measuring the effect, because Postgres is still
stuck on ProcArrayLock. And in this thread you propose a workaround
for that implemented on the pgbench side. My very dumb idea is
following: should we finally give a chance to more fair lwlocks rather
than inventing workarounds?

As I remember, your major argument against more fair lwlocks was the
idea that we should fix lwlocks use-cases rather than lwlock mechanism
themselves. But can we expect that we fix all the lwlocks use-case in
any reasonable prospect? My guess is 'no'.

Links
1. /messages/by-id/CAPpHfdvJhO1qutziOp=dy8TO8Xb4L38BxgKG4RPa1up1Lzh_UQ@mail.gmail.com

------
Regards,
Alexander Korotkov

Marina Polyakova

m.polyakova@postgrespro.ru

about 5 years ago

In reply to: Alexander Korotkov (#25)

Re: pgbench: option delaying queries till connections establishment?

Hello!

On 2020-11-14 20:07, Alexander Korotkov wrote:

Hmm... Let's see the big picture. You've recently committed a
patchset, which greatly improved the performance of GetSnapshotData().
And you're making further improvements in this direction. But you're
getting trouble in measuring the effect, because Postgres is still
stuck on ProcArrayLock. And in this thread you propose a workaround
for that implemented on the pgbench side. My very dumb idea is
following: should we finally give a chance to more fair lwlocks rather
than inventing workarounds?

As I remember, your major argument against more fair lwlocks was the
idea that we should fix lwlocks use-cases rather than lwlock mechanism
themselves. But can we expect that we fix all the lwlocks use-case in
any reasonable prospect? My guess is 'no'.

Links
1.
/messages/by-id/CAPpHfdvJhO1qutziOp=dy8TO8Xb4L38BxgKG4RPa1up1Lzh_UQ@mail.gmail.com

Sorry I'm not familiar with the internal architecture of snapshots,
locks etc. in postgres, but I wanted to ask - what exact kind of patch
for fair lwlocks do you want to offer to the community? I applied the
6-th version of the patch for fair lwlocks from [1]/messages/by-id/CAPpHfduV3v3EG7K74-9htBZz_mpE993zGz-=2k5RNA3tqabUAA@mail.gmail.com to the old master
branch (see commit [2]https://github.com/postgres/postgres/commit/84d514887f9ca673ae688d00f8b544e70f1ab270), started many threads in pgbench (-M prepared -c
1000 -j 500 -T 10 -P1 -S) and I did not receive stable first progress
reports, which IIUC are one of the advantages of the discussed patch for
the pgbench (see [3]/messages/by-id/20200227185129.hikscyenomnlrord@alap3.anarazel.de)...

[1]: /messages/by-id/CAPpHfduV3v3EG7K74-9htBZz_mpE993zGz-=2k5RNA3tqabUAA@mail.gmail.com
/messages/by-id/CAPpHfduV3v3EG7K74-9htBZz_mpE993zGz-=2k5RNA3tqabUAA@mail.gmail.com
[2]: https://github.com/postgres/postgres/commit/84d514887f9ca673ae688d00f8b544e70f1ab270
https://github.com/postgres/postgres/commit/84d514887f9ca673ae688d00f8b544e70f1ab270
[3]: /messages/by-id/20200227185129.hikscyenomnlrord@alap3.anarazel.de
/messages/by-id/20200227185129.hikscyenomnlrord@alap3.anarazel.de

--
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

andres@anarazel.de

about 5 years ago

In reply to: Alexander Korotkov (#25)

Re: pgbench: option delaying queries till connections establishment?

Hi,

On 2020-11-14 20:07:38 +0300, Alexander Korotkov wrote:

Hmm... Let's see the big picture. You've recently committed a
patchset, which greatly improved the performance of GetSnapshotData().
And you're making further improvements in this direction. But you're
getting trouble in measuring the effect, because Postgres is still
stuck on ProcArrayLock.

No, the problem was that I couldn't measure the before/after behaviour
reliably, because not all connections actually ever get established
*before* the GetSnapshotData() scability patchset. Which made the
numbers pointless, because we'd often end up with e.g. 80 connections
doing work pre-patch, and 800 post-patch; which obviously measures very
different things.

I think the issue really is that, independent of PG lock contention,
it'll take a while to establish all connections, and that starting to
benchmark with only some connections established will create pretty
pointless numbers.

And in this thread you propose a workaround
for that implemented on the pgbench side. My very dumb idea is
following: should we finally give a chance to more fair lwlocks rather
than inventing workarounds?

Perhaps - I just don't think it's related to this thread. And how you're
going to address the overhead.

Greetings,

Andres Freund

andres@anarazel.de

about 5 years ago

In reply to: Marina Polyakova (#26)

Re: pgbench: option delaying queries till connections establishment?

Hi,

On 2020-11-17 00:09:34 +0300, Marina Polyakova wrote:

Sorry I'm not familiar with the internal architecture of snapshots, locks
etc. in postgres, but I wanted to ask - what exact kind of patch for fair
lwlocks do you want to offer to the community? I applied the 6-th version of
the patch for fair lwlocks from [1] to the old master branch (see commit
[2]), started many threads in pgbench (-M prepared -c 1000 -j 500 -T 10 -P1
-S) and I did not receive stable first progress reports, which IIUC are one
of the advantages of the discussed patch for the pgbench (see [3])...

Thanks for running some benchmarks.

I think it's quite unsurprising that you'd see skewed numbers initially,
even with fairer locks. Just by virtue of pgbench starting threads and
each thread immediately starting to perform work, you are bound to get
odd pretty meaningless initial numbers. Even without contention, and
when using fewer connections than there are CPUs. And especially when
starting a larger number of connections, because the main pgbench thread
will get fewer and fewer scheduler slices because of the threads running
benchmarks already.

Regards,

Andres

coelho@cri.ensmp.fr

about 5 years ago

In reply to: Andres Freund (#27)

Re: pgbench: option delaying queries till connections establishment?

I think the issue really is that, independent of PG lock contention,
it'll take a while to establish all connections, and that starting to
benchmark with only some connections established will create pretty
pointless numbers.

Yes. This is why I think that if we have some way to synchronize it should
always be used, i.e. not an option.

--
Fabien.

thomas.munro@gmail.com

about 5 years ago

In reply to: Fabien COELHO (#24)

Re: pgbench: option delaying queries till connections establishment?

On Sun, Nov 15, 2020 at 4:53 AM Fabien COELHO <coelho@cri.ensmp.fr> wrote:

In the attached version, I just comment out the call and add an
explanation about why it is commented out. If pg overall version
requirements are changed on windows, then it could be reinstated.

It looks like macOS doesn't have pthread barriers (via cfbot 2021, now
with more operating systems):

pgbench.c:326:8: error: unknown type name 'pthread_barrier_t'
static pthread_barrier_t barrier;
^
pgbench.c:6128:2: error: implicit declaration of function
'pthread_barrier_init' is invalid in C99
[-Werror,-Wimplicit-function-declaration]
pthread_barrier_init(&barrier, NULL, nthreads);
^

coelho@cri.ensmp.fr

about 5 years ago

In reply to: Thomas Munro (#30)

Re: pgbench: option delaying queries till connections establishment?

It looks like macOS doesn't have pthread barriers (via cfbot 2021, now
with more operating systems):

Indeed:-(

I'll look into that.

--
Fabien.

thomas.munro@gmail.com

about 5 years ago

In reply to: Fabien COELHO (#31)

3 attachment(s)

Re: pgbench: option delaying queries till connections establishment?

On Sat, Jan 2, 2021 at 9:15 AM Fabien COELHO <coelho@cri.ensmp.fr> wrote:

It looks like macOS doesn't have pthread barriers (via cfbot 2021, now
with more operating systems):

Indeed:-(

I'll look into that.

Just for fun, the attached 0002 patch is a quick prototype of a
replacement for that stuff that seems to work OK on a Mac here. (I'm
not sure if the Windows part makes sense or works.)

Attachments:

0001-A.patchapplication/octet-stream; name=0001-A.patchDownload

From 05a850023b075b1c09478c695d6286a524ec0f7e Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Sat, 2 Jan 2021 14:13:11 +1300
Subject: [PATCH 1/3] A

---
 doc/src/sgml/ref/pgbench.sgml        |  39 +--
 src/bin/pgbench/pgbench.c            | 415 ++++++++++++---------------
 src/include/portability/instr_time.h |  28 ++
 3 files changed, 231 insertions(+), 251 deletions(-)

diff --git a/doc/src/sgml/ref/pgbench.sgml b/doc/src/sgml/ref/pgbench.sgml
index b03d0cc50f..8503cac585 100644
--- a/doc/src/sgml/ref/pgbench.sgml
+++ b/doc/src/sgml/ref/pgbench.sgml
@@ -58,8 +58,10 @@ number of clients: 10
 number of threads: 1
 number of transactions per client: 1000
 number of transactions actually processed: 10000/10000
-tps = 85.184871 (including connections establishing)
-tps = 85.296346 (excluding connections establishing)
+latency average = 11.013 ms
+latency stddev = 7.351 ms
+initial connection time = 45.758 ms
+tps = 896.967014 (without initial connection establishing)
 </screen>
 
   The first six lines report some of the most important parameter
@@ -68,8 +70,7 @@ tps = 85.296346 (excluding connections establishing)
   and number of transactions per client); these will be equal unless the run
   failed before completion.  (In <option>-T</option> mode, only the actual
   number of transactions is printed.)
-  The last two lines report the number of transactions per second,
-  figured with and without counting the time to start database sessions.
+  The last line reports the number of transactions per second.
  </para>
 
   <para>
@@ -2244,22 +2245,22 @@ number of clients: 10
 number of threads: 1
 number of transactions per client: 1000
 number of transactions actually processed: 10000/10000
-latency average = 15.844 ms
-latency stddev = 2.715 ms
-tps = 618.764555 (including connections establishing)
-tps = 622.977698 (excluding connections establishing)
+latency average = 10.870 ms
+latency stddev = 7.341 ms
+initial connection time = 30.954 ms
+tps = 907.949122 (without initial connection establishing)
 statement latencies in milliseconds:
-        0.002  \set aid random(1, 100000 * :scale)
-        0.005  \set bid random(1, 1 * :scale)
-        0.002  \set tid random(1, 10 * :scale)
-        0.001  \set delta random(-5000, 5000)
-        0.326  BEGIN;
-        0.603  UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
-        0.454  SELECT abalance FROM pgbench_accounts WHERE aid = :aid;
-        5.528  UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;
-        7.335  UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;
-        0.371  INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);
-        1.212  END;
+    0.001  \set aid random(1, 100000 * :scale)
+    0.001  \set bid random(1, 1 * :scale)
+    0.001  \set tid random(1, 10 * :scale)
+    0.000  \set delta random(-5000, 5000)
+    0.046  BEGIN;
+    0.151  UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
+    0.107  SELECT abalance FROM pgbench_accounts WHERE aid = :aid;
+    4.241  UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;
+    5.245  UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;
+    0.102  INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);
+    0.974  END;
 </screen>
   </para>
 
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 3057665bbe..b8a3e28690 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -292,9 +292,9 @@ typedef struct SimpleStats
  */
 typedef struct StatsData
 {
-	time_t		start_time;		/* interval start time, for aggregates */
-	int64		cnt;			/* number of transactions, including skipped */
-	int64		skipped;		/* number of transactions skipped under --rate
+	pg_time_usec_t	start_time;	/* interval start time, for aggregates */
+	int64			cnt;		/* number of transactions, including skipped */
+	int64			skipped;	/* number of transactions skipped under --rate
 								 * and --latency-limit */
 	SimpleStats latency;
 	SimpleStats lag;
@@ -417,11 +417,11 @@ typedef struct
 	int			nvariables;		/* number of variables */
 	bool		vars_sorted;	/* are variables sorted by name? */
 
-	/* various times about current transaction */
-	int64		txn_scheduled;	/* scheduled start time of transaction (usec) */
-	int64		sleep_until;	/* scheduled start time of next cmd (usec) */
-	instr_time	txn_begin;		/* used for measuring schedule lag times */
-	instr_time	stmt_begin;		/* used for measuring statement latencies */
+	/* various times about current transaction in microseconds */
+	pg_time_usec_t	txn_scheduled;	/* scheduled start time of transaction */
+	pg_time_usec_t	sleep_until;	/* scheduled start time of next cmd */
+	pg_time_usec_t	txn_begin;		/* used for measuring schedule lag times */
+	pg_time_usec_t	stmt_begin;		/* used for measuring statement latencies */
 
 	bool		prepared[MAX_SCRIPTS];	/* whether client prepared the script */
 
@@ -450,13 +450,16 @@ typedef struct
 	RandomState ts_sample_rs;	/* random state for log sampling */
 
 	int64		throttle_trigger;	/* previous/next throttling (us) */
-	FILE	   *logfile;		/* where to log, or NULL */
+	FILE	   *logfile;			/* where to log, or NULL */
+
+	/* per thread collected stats in microseconds */
+	pg_time_usec_t	create_time;	/* thread creation time */
+	pg_time_usec_t	started_time;	/* thread is running */
+	pg_time_usec_t	bench_start; 	/* thread is benchmarking */
+	pg_time_usec_t	conn_duration;	/* cumulated connection and deconnection delays */
 
-	/* per thread collected stats */
-	instr_time	start_time;		/* thread start time */
-	instr_time	conn_time;
 	StatsData	stats;
-	int64		latency_late;	/* executed but late transactions */
+	int64		latency_late;	/* count executed but late transactions */
 } TState;
 
 #define INVALID_THREAD		((pthread_t) 0)
@@ -598,10 +601,10 @@ static void setIntValue(PgBenchValue *pv, int64 ival);
 static void setDoubleValue(PgBenchValue *pv, double dval);
 static bool evaluateExpr(CState *st, PgBenchExpr *expr,
 						 PgBenchValue *retval);
-static ConnectionStateEnum executeMetaCommand(CState *st, instr_time *now);
+static ConnectionStateEnum executeMetaCommand(CState *st, pg_time_usec_t *now);
 static void doLog(TState *thread, CState *st,
 				  StatsData *agg, bool skipped, double latency, double lag);
-static void processXactStats(TState *thread, CState *st, instr_time *now,
+static void processXactStats(TState *thread, CState *st, pg_time_usec_t *now,
 							 bool skipped, StatsData *agg);
 static void addScript(ParsedScript script);
 static void *threadRun(void *arg);
@@ -1105,9 +1108,9 @@ mergeSimpleStats(SimpleStats *acc, SimpleStats *ss)
  * the given value.
  */
 static void
-initStats(StatsData *sd, time_t start_time)
+initStats(StatsData *sd, pg_time_usec_t start)
 {
-	sd->start_time = start_time;
+	sd->start_time = start;
 	sd->cnt = 0;
 	sd->skipped = 0;
 	initSimpleStats(&sd->latency);
@@ -2876,7 +2879,6 @@ evaluateSleep(CState *st, int argc, char **argv, int *usecs)
 static void
 advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 {
-	instr_time	now;
 
 	/*
 	 * gettimeofday() isn't free, so we get the current timestamp lazily the
@@ -2886,7 +2888,7 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 	 * means "not set yet".  Reset "now" when we execute shell commands or
 	 * expressions, which might take a non-negligible amount of time, though.
 	 */
-	INSTR_TIME_SET_ZERO(now);
+	pg_time_usec_t	now = 0;
 
 	/*
 	 * Loop in the state machine, until we have to wait for a result from the
@@ -2921,29 +2923,30 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 
 				/* Start new transaction (script) */
 			case CSTATE_START_TX:
+				pg_time_now_lazy(&now);
 
 				/* establish connection if needed, i.e. under --connect */
 				if (st->con == NULL)
 				{
-					instr_time	start;
+					pg_time_usec_t	start = now;
 
-					INSTR_TIME_SET_CURRENT_LAZY(now);
-					start = now;
 					if ((st->con = doConnect()) == NULL)
 					{
 						pg_log_error("client %d aborted while establishing connection", st->id);
 						st->state = CSTATE_ABORTED;
 						break;
 					}
-					INSTR_TIME_SET_CURRENT(now);
-					INSTR_TIME_ACCUM_DIFF(thread->conn_time, now, start);
+
+					/* reset now after connection */
+					now = pg_time_now();
+
+					thread->conn_duration += now - start;
 
 					/* Reset session-local state */
 					memset(st->prepared, 0, sizeof(st->prepared));
 				}
 
 				/* record transaction start time */
-				INSTR_TIME_SET_CURRENT_LAZY(now);
 				st->txn_begin = now;
 
 				/*
@@ -2951,7 +2954,7 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 				 * scheduled start time.
 				 */
 				if (!throttle_delay)
-					st->txn_scheduled = INSTR_TIME_GET_MICROSEC(now);
+					st->txn_scheduled = now;
 
 				/* Begin with the first command */
 				st->state = CSTATE_START_COMMAND;
@@ -2987,12 +2990,9 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 				 */
 				if (latency_limit)
 				{
-					int64		now_us;
+					pg_time_now_lazy(&now);
 
-					INSTR_TIME_SET_CURRENT_LAZY(now);
-					now_us = INSTR_TIME_GET_MICROSEC(now);
-
-					while (thread->throttle_trigger < now_us - latency_limit &&
+					while (thread->throttle_trigger < now - latency_limit &&
 						   (nxacts <= 0 || st->cnt < nxacts))
 					{
 						processXactStats(thread, st, &now, true, agg);
@@ -3025,9 +3025,9 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 				 * Wait until it's time to start next transaction.
 				 */
 			case CSTATE_THROTTLE:
-				INSTR_TIME_SET_CURRENT_LAZY(now);
+				pg_time_now_lazy(&now);
 
-				if (INSTR_TIME_GET_MICROSEC(now) < st->txn_scheduled)
+				if (now < st->txn_scheduled)
 					return;		/* still sleeping, nothing to do here */
 
 				/* done sleeping, but don't start transaction if we're done */
@@ -3050,7 +3050,7 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 				/* record begin time of next command, and initiate it */
 				if (report_per_command)
 				{
-					INSTR_TIME_SET_CURRENT_LAZY(now);
+					pg_time_now_lazy(&now);
 					st->stmt_begin = now;
 				}
 
@@ -3215,8 +3215,8 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 				 * instead of CSTATE_START_TX.
 				 */
 			case CSTATE_SLEEP:
-				INSTR_TIME_SET_CURRENT_LAZY(now);
-				if (INSTR_TIME_GET_MICROSEC(now) < st->sleep_until)
+				pg_time_now_lazy(&now);
+				if (now < st->sleep_until)
 					return;		/* still sleeping, nothing to do here */
 				/* Else done sleeping. */
 				st->state = CSTATE_END_COMMAND;
@@ -3236,13 +3236,12 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 				{
 					Command    *command;
 
-					INSTR_TIME_SET_CURRENT_LAZY(now);
+					pg_time_now_lazy(&now);
 
 					command = sql_script[st->use_file].commands[st->command];
 					/* XXX could use a mutex here, but we choose not to */
 					addToSimpleStats(&command->stats,
-									 INSTR_TIME_GET_DOUBLE(now) -
-									 INSTR_TIME_GET_DOUBLE(st->stmt_begin));
+									 PG_TIME_GET_DOUBLE(now - st->stmt_begin));
 				}
 
 				/* Go ahead with next command, to be executed or skipped */
@@ -3268,7 +3267,7 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 				if (is_connect)
 				{
 					finishCon(st);
-					INSTR_TIME_SET_ZERO(now);
+					now = 0;
 				}
 
 				if ((st->cnt >= nxacts && duration <= 0) || timer_exceeded)
@@ -3306,7 +3305,7 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
  * take no time to execute.
  */
 static ConnectionStateEnum
-executeMetaCommand(CState *st, instr_time *now)
+executeMetaCommand(CState *st, pg_time_usec_t *now)
 {
 	Command    *command = sql_script[st->use_file].commands[st->command];
 	int			argc;
@@ -3348,8 +3347,8 @@ executeMetaCommand(CState *st, instr_time *now)
 			return CSTATE_ABORTED;
 		}
 
-		INSTR_TIME_SET_CURRENT_LAZY(*now);
-		st->sleep_until = INSTR_TIME_GET_MICROSEC(*now) + usec;
+		pg_time_now_lazy(now);
+		st->sleep_until = (*now) + usec;
 		return CSTATE_SLEEP;
 	}
 	else if (command->meta == META_SET)
@@ -3452,7 +3451,7 @@ executeMetaCommand(CState *st, instr_time *now)
 	 * executing the expression or shell command might have taken a
 	 * non-negligible amount of time, so reset 'now'
 	 */
-	INSTR_TIME_SET_ZERO(*now);
+	*now = 0;
 
 	return CSTATE_END_COMMAND;
 }
@@ -3462,14 +3461,15 @@ executeMetaCommand(CState *st, instr_time *now)
  *
  * We print Unix-epoch timestamps in the log, so that entries can be
  * correlated against other logs.  On some platforms this could be obtained
- * from the instr_time reading the caller has, but rather than get entangled
- * with that, we just eat the cost of an extra syscall in all cases.
+ * from the caller, but rather than get entangled with that, we just eat
+ * the cost of an extra syscall in all cases.
  */
 static void
 doLog(TState *thread, CState *st,
 	  StatsData *agg, bool skipped, double latency, double lag)
 {
 	FILE	   *logfile = thread->logfile;
+	pg_time_usec_t	now = pg_time_now();
 
 	Assert(use_log);
 
@@ -3489,13 +3489,12 @@ doLog(TState *thread, CState *st,
 		 * any empty intervals in between (this may happen with very low tps,
 		 * e.g. --rate=0.1).
 		 */
-		time_t		now = time(NULL);
 
 		while (agg->start_time + agg_interval <= now)
 		{
 			/* print aggregated report to logfile */
-			fprintf(logfile, "%ld " INT64_FORMAT " %.0f %.0f %.0f %.0f",
-					(long) agg->start_time,
+			fprintf(logfile, INT64_FORMAT " " INT64_FORMAT " %.0f %.0f %.0f %.0f",
+					agg->start_time,
 					agg->cnt,
 					agg->latency.sum,
 					agg->latency.sum2,
@@ -3523,17 +3522,13 @@ doLog(TState *thread, CState *st,
 	else
 	{
 		/* no, print raw transactions */
-		struct timeval tv;
-
-		gettimeofday(&tv, NULL);
 		if (skipped)
 			fprintf(logfile, "%d " INT64_FORMAT " skipped %d %ld %ld",
-					st->id, st->cnt, st->use_file,
-					(long) tv.tv_sec, (long) tv.tv_usec);
+					st->id, st->cnt, st->use_file, now / 1000000, now % 1000000);
 		else
 			fprintf(logfile, "%d " INT64_FORMAT " %.0f %d %ld %ld",
 					st->id, st->cnt, latency, st->use_file,
-					(long) tv.tv_sec, (long) tv.tv_usec);
+					now / 1000000, now % 1000000);
 		if (throttle_delay)
 			fprintf(logfile, " %.0f", lag);
 		fputc('\n', logfile);
@@ -3547,7 +3542,7 @@ doLog(TState *thread, CState *st,
  * Note that even skipped transactions are counted in the "cnt" fields.)
  */
 static void
-processXactStats(TState *thread, CState *st, instr_time *now,
+processXactStats(TState *thread, CState *st, pg_time_usec_t *now,
 				 bool skipped, StatsData *agg)
 {
 	double		latency = 0.0,
@@ -3557,11 +3552,11 @@ processXactStats(TState *thread, CState *st, instr_time *now,
 
 	if (detailed && !skipped)
 	{
-		INSTR_TIME_SET_CURRENT_LAZY(*now);
+		pg_time_now_lazy(now);
 
 		/* compute latency & lag */
-		latency = INSTR_TIME_GET_MICROSEC(*now) - st->txn_scheduled;
-		lag = INSTR_TIME_GET_MICROSEC(st->txn_begin) - st->txn_scheduled;
+		latency = (*now) - st->txn_scheduled;
+		lag = st->txn_begin - st->txn_scheduled;
 	}
 
 	if (thread_details)
@@ -3812,10 +3807,7 @@ initGenerateDataClientSide(PGconn *con)
 	int64		k;
 
 	/* used to track elapsed time and estimate of the remaining time */
-	instr_time	start,
-				diff;
-	double		elapsed_sec,
-				remaining_sec;
+	pg_time_usec_t	start;
 	int			log_interval = 1;
 
 	/* Stay on the same line if reporting to a terminal */
@@ -3867,7 +3859,7 @@ initGenerateDataClientSide(PGconn *con)
 	}
 	PQclear(res);
 
-	INSTR_TIME_SET_CURRENT(start);
+	start = pg_time_now();
 
 	for (k = 0; k < (int64) naccounts * scale; k++)
 	{
@@ -3892,11 +3884,8 @@ initGenerateDataClientSide(PGconn *con)
 		 */
 		if ((!use_quiet) && (j % 100000 == 0))
 		{
-			INSTR_TIME_SET_CURRENT(diff);
-			INSTR_TIME_SUBTRACT(diff, start);
-
-			elapsed_sec = INSTR_TIME_GET_DOUBLE(diff);
-			remaining_sec = ((double) scale * naccounts - j) * elapsed_sec / j;
+			double elapsed_sec = PG_TIME_GET_DOUBLE(pg_time_now() - start);
+			double remaining_sec = ((double) scale * naccounts - j) * elapsed_sec / j;
 
 			fprintf(stderr, INT64_FORMAT " of " INT64_FORMAT " tuples (%d%%) done (elapsed %.2f s, remaining %.2f s)%c",
 					j, (int64) naccounts * scale,
@@ -3906,11 +3895,8 @@ initGenerateDataClientSide(PGconn *con)
 		/* let's not call the timing for each row, but only each 100 rows */
 		else if (use_quiet && (j % 100 == 0))
 		{
-			INSTR_TIME_SET_CURRENT(diff);
-			INSTR_TIME_SUBTRACT(diff, start);
-
-			elapsed_sec = INSTR_TIME_GET_DOUBLE(diff);
-			remaining_sec = ((double) scale * naccounts - j) * elapsed_sec / j;
+			double elapsed_sec = PG_TIME_GET_DOUBLE(pg_time_now() - start);
+			double remaining_sec = ((double) scale * naccounts - j) * elapsed_sec / j;
 
 			/* have we reached the next interval (or end)? */
 			if ((j == scale * naccounts) || (elapsed_sec >= log_interval * LOG_STEP_SECONDS))
@@ -4115,10 +4101,8 @@ runInitSteps(const char *initialize_steps)
 
 	for (step = initialize_steps; *step != '\0'; step++)
 	{
-		instr_time	start;
 		char	   *op = NULL;
-
-		INSTR_TIME_SET_CURRENT(start);
+		pg_time_usec_t	start = pg_time_now();
 
 		switch (*step)
 		{
@@ -4160,12 +4144,7 @@ runInitSteps(const char *initialize_steps)
 
 		if (op != NULL)
 		{
-			instr_time	diff;
-			double		elapsed_sec;
-
-			INSTR_TIME_SET_CURRENT(diff);
-			INSTR_TIME_SUBTRACT(diff, start);
-			elapsed_sec = INSTR_TIME_GET_DOUBLE(diff);
+			double		elapsed_sec = PG_TIME_GET_DOUBLE(pg_time_now() - start);
 
 			if (!first)
 				appendPQExpBufferStr(&stats, ", ");
@@ -5087,12 +5066,12 @@ addScript(ParsedScript script)
  * progress report.  On exit, they are updated with the new stats.
  */
 static void
-printProgressReport(TState *threads, int64 test_start, int64 now,
+printProgressReport(TState *threads, int64 test_start, pg_time_usec_t now,
 					StatsData *last, int64 *last_report)
 {
 	/* generate and show report */
-	int64		run = now - *last_report,
-				ntx;
+	pg_time_usec_t	run = now - *last_report;
+	int64		ntx;
 	double		tps,
 				total_run,
 				latency,
@@ -5139,16 +5118,7 @@ printProgressReport(TState *threads, int64 test_start, int64 now,
 
 	if (progress_timestamp)
 	{
-		/*
-		 * On some platforms the current system timestamp is available in
-		 * now_time, but rather than get entangled with that, we just eat the
-		 * cost of an extra syscall in all cases.
-		 */
-		struct timeval tv;
-
-		gettimeofday(&tv, NULL);
-		snprintf(tbuf, sizeof(tbuf), "%ld.%03ld s",
-				 (long) tv.tv_sec, (long) (tv.tv_usec / 1000));
+		snprintf(tbuf, sizeof(tbuf), "%.3f s", PG_TIME_GET_DOUBLE(now));
 	}
 	else
 	{
@@ -5188,21 +5158,18 @@ printSimpleStats(const char *prefix, SimpleStats *ss)
 
 /* print out results */
 static void
-printResults(StatsData *total, instr_time total_time,
-			 instr_time conn_total_time, int64 latency_late)
+printResults(StatsData *total,
+			 pg_time_usec_t total_duration,			/* benchmarking time */
+			 pg_time_usec_t conn_total_duration,	/* is_connect */
+			 pg_time_usec_t conn_elapsed_duration,	/* !is_connect */
+			 int64 latency_late)
 {
-	double		time_include,
-				tps_include,
-				tps_exclude;
+	/* tps is about actually executed transactions during benchmarking */
 	int64		ntx = total->cnt - total->skipped;
+	double 		bench_duration = PG_TIME_GET_DOUBLE(total_duration);
+	double		tps = ntx / bench_duration;
 
-	time_include = INSTR_TIME_GET_DOUBLE(total_time);
-
-	/* tps is about actually executed transactions */
-	tps_include = ntx / time_include;
-	tps_exclude = ntx /
-		(time_include - (INSTR_TIME_GET_DOUBLE(conn_total_time) / nclients));
-
+	printf("pgbench (PostgreSQL) %d.%d\n", PG_VERSION_NUM / 10000, PG_VERSION_NUM % 100);
 	/* Report test parameters. */
 	printf("transaction type: %s\n",
 		   num_scripts == 1 ? sql_script[0].desc : "multiple scripts");
@@ -5233,8 +5200,7 @@ printResults(StatsData *total, instr_time total_time,
 
 	if (throttle_delay && latency_limit)
 		printf("number of transactions skipped: " INT64_FORMAT " (%.3f %%)\n",
-			   total->skipped,
-			   100.0 * total->skipped / total->cnt);
+			   total->skipped, 100.0 * total->skipped / total->cnt);
 
 	if (latency_limit)
 		printf("number of transactions above the %.1f ms latency limit: " INT64_FORMAT "/" INT64_FORMAT " (%.3f %%)\n",
@@ -5247,7 +5213,7 @@ printResults(StatsData *total, instr_time total_time,
 	{
 		/* no measurement, show average latency computed from run time */
 		printf("latency average = %.3f ms\n",
-			   1000.0 * time_include * nclients / total->cnt);
+			   0.001 * total_duration * nclients / total->cnt);
 	}
 
 	if (throttle_delay)
@@ -5262,8 +5228,25 @@ printResults(StatsData *total, instr_time total_time,
 			   0.001 * total->lag.sum / total->cnt, 0.001 * total->lag.max);
 	}
 
-	printf("tps = %f (including connections establishing)\n", tps_include);
-	printf("tps = %f (excluding connections establishing)\n", tps_exclude);
+	/*
+	 * Under -C/--connect, each transaction incurs a significant connection cost,
+	 * it would not make much sense to ignore it in tps, and it would not be tps
+	 * anyway.
+	 *
+	 * Otherwise connections are made just once at the beginning of the run
+	 * and should not impact performance but for very short run, so they are
+	 * (right)fully ignored in tps.
+	 */
+	if (is_connect)
+	{
+		printf("average connection time = %.3f ms\n", 0.001 * conn_total_duration / total->cnt);
+		printf("tps = %f (including reconnection times)\n", tps);
+	}
+	else
+	{
+		printf("initial connection time = %.3f ms\n", 0.001 * conn_elapsed_duration);
+		printf("tps = %f (without initial connection establishing)\n", tps);
+	}
 
 	/* Report per-script/command statistics */
 	if (per_script_stats || report_per_command)
@@ -5284,7 +5267,7 @@ printResults(StatsData *total, instr_time total_time,
 					   100.0 * sql_script[i].weight / total_weight,
 					   sstats->cnt,
 					   100.0 * sstats->cnt / total->cnt,
-					   (sstats->cnt - sstats->skipped) / time_include);
+					   (sstats->cnt - sstats->skipped) / bench_duration);
 
 				if (throttle_delay && latency_limit && sstats->cnt > 0)
 					printf(" - number of transactions skipped: " INT64_FORMAT " (%.3f%%)\n",
@@ -5332,10 +5315,7 @@ set_random_seed(const char *seed)
 	if (seed == NULL || strcmp(seed, "time") == 0)
 	{
 		/* rely on current time */
-		instr_time	now;
-
-		INSTR_TIME_SET_CURRENT(now);
-		iseed = (uint64) INSTR_TIME_GET_MICROSEC(now);
+		iseed = pg_time_now();
 	}
 	else if (strcmp(seed, "rand") == 0)
 	{
@@ -5438,9 +5418,10 @@ main(int argc, char **argv)
 	CState	   *state;			/* status of clients */
 	TState	   *threads;		/* array of thread */
 
-	instr_time	start_time;		/* start up time */
-	instr_time	total_time;
-	instr_time	conn_total_time;
+	pg_time_usec_t
+		start_time,				/* start up time */
+		bench_start = 0,		/* first recorded benchmarking time */
+		conn_total_duration;	/* cumulated connection time in threads */
 	int64		latency_late = 0;
 	StatsData	stats;
 	int			weight;
@@ -6123,67 +6104,53 @@ main(int argc, char **argv)
 	/* all clients must be assigned to a thread */
 	Assert(nclients_dealt == nclients);
 
-	/* get start up time */
-	INSTR_TIME_SET_CURRENT(start_time);
+	/* get start up time for the whole computation */
+	start_time = pg_time_now();
 
 	/* set alarm if duration is specified. */
 	if (duration > 0)
 		setalarm(duration);
 
-	/* start threads */
 #ifdef ENABLE_THREAD_SAFETY
-	for (i = 0; i < nthreads; i++)
+	/* start all threads but thread 0 which is executed directly later */
+	for (i = 1; i < nthreads; i++)
 	{
 		TState	   *thread = &threads[i];
+		int			err;
 
-		INSTR_TIME_SET_CURRENT(thread->start_time);
-
-		/* compute when to stop */
-		if (duration > 0)
-			end_time = INSTR_TIME_GET_MICROSEC(thread->start_time) +
-				(int64) 1000000 * duration;
-
-		/* the first thread (i = 0) is executed by main thread */
-		if (i > 0)
-		{
-			int			err = pthread_create(&thread->thread, NULL, threadRun, thread);
+		thread->create_time = pg_time_now();
+		err = pthread_create(&thread->thread, NULL, threadRun, thread);
 
-			if (err != 0 || thread->thread == INVALID_THREAD)
-			{
-				pg_log_fatal("could not create thread: %m");
-				exit(1);
-			}
-		}
-		else
+		if (err != 0 || thread->thread == INVALID_THREAD)
 		{
-			thread->thread = INVALID_THREAD;
+			pg_log_fatal("could not create thread: %m");
+			exit(1);
 		}
 	}
 #else
-	INSTR_TIME_SET_CURRENT(threads[0].start_time);
+	Assert(nthreads == 1);
+#endif							/* ENABLE_THREAD_SAFETY */
+
 	/* compute when to stop */
+	threads[0].create_time = pg_time_now();
 	if (duration > 0)
-		end_time = INSTR_TIME_GET_MICROSEC(threads[0].start_time) +
-			(int64) 1000000 * duration;
+		end_time = threads[0].create_time + (int64) 1000000 * duration;
+
+	/* run thread 0 directly */
 	threads[0].thread = INVALID_THREAD;
-#endif							/* ENABLE_THREAD_SAFETY */
+	(void) threadRun(&threads[0]);
 
-	/* wait for threads and accumulate results */
+	/* wait for other threads and accumulate results */
 	initStats(&stats, 0);
-	INSTR_TIME_SET_ZERO(conn_total_time);
+	conn_total_duration = 0;
+
 	for (i = 0; i < nthreads; i++)
 	{
 		TState	   *thread = &threads[i];
 
 #ifdef ENABLE_THREAD_SAFETY
-		if (threads[i].thread == INVALID_THREAD)
-			/* actually run this thread directly in the main thread */
-			(void) threadRun(thread);
-		else
-			/* wait of other threads. should check that 0 is returned? */
+		if (i > 0)
 			pthread_join(thread->thread, NULL);
-#else
-		(void) threadRun(thread);
 #endif							/* ENABLE_THREAD_SAFETY */
 
 		for (int j = 0; j < thread->nstate; j++)
@@ -6196,23 +6163,24 @@ main(int argc, char **argv)
 		stats.cnt += thread->stats.cnt;
 		stats.skipped += thread->stats.skipped;
 		latency_late += thread->latency_late;
-		INSTR_TIME_ADD(conn_total_time, thread->conn_time);
+		conn_total_duration += thread->conn_duration;
+
+		/* first recorded benchmarking start time */
+		if (bench_start == 0 || thread->bench_start < bench_start)
+			bench_start = thread->bench_start;
 	}
+
+	/* XXX should this be connection time? */
 	disconnect_all(state, nclients);
 
 	/*
-	 * XXX We compute results as though every client of every thread started
-	 * and finished at the same time.  That model can diverge noticeably from
-	 * reality for a short benchmark run involving relatively many threads.
-	 * The first thread may process notably many transactions before the last
-	 * thread begins.  Improving the model alone would bring limited benefit,
-	 * because performance during those periods of partial thread count can
-	 * easily exceed steady state performance.  This is one of the many ways
-	 * short runs convey deceptive performance figures.
+	 * Beware that performance of short benchmarks with many threads and possibly
+	 * long transactions can be deceptive because threads do not start and finish
+	 * at the exact same time. The total duration computed here encompasses all
+	 * transactions so that tps shown is somehow slightly underestimated.
 	 */
-	INSTR_TIME_SET_CURRENT(total_time);
-	INSTR_TIME_SUBTRACT(total_time, start_time);
-	printResults(&stats, total_time, conn_total_time, latency_late);
+	printResults(&stats, pg_time_now() - bench_start, conn_total_duration,
+				 bench_start - start_time, latency_late);
 
 	if (exit_code != 0)
 		pg_log_fatal("Run was aborted; the above results are incomplete.");
@@ -6225,34 +6193,16 @@ threadRun(void *arg)
 {
 	TState	   *thread = (TState *) arg;
 	CState	   *state = thread->state;
-	instr_time	start,
-				end;
+	pg_time_usec_t	start;
 	int			nstate = thread->nstate;
 	int			remains = nstate;	/* number of remaining clients */
 	socket_set *sockets = alloc_socket_set(nstate);
-	int			i;
-
-	/* for reporting progress: */
-	int64		thread_start = INSTR_TIME_GET_MICROSEC(thread->start_time);
-	int64		last_report = thread_start;
-	int64		next_report = last_report + (int64) progress * 1000000;
+	int64		thread_start,
+				last_report,
+				next_report;
 	StatsData	last,
 				aggs;
 
-	/*
-	 * Initialize throttling rate target for all of the thread's clients.  It
-	 * might be a little more accurate to reset thread->start_time here too.
-	 * The possible drift seems too small relative to typical throttle delay
-	 * times to worry about it.
-	 */
-	INSTR_TIME_SET_CURRENT(start);
-	thread->throttle_trigger = INSTR_TIME_GET_MICROSEC(start);
-
-	INSTR_TIME_SET_ZERO(thread->conn_time);
-
-	initStats(&aggs, time(NULL));
-	last = aggs;
-
 	/* open log file if requested */
 	if (use_log)
 	{
@@ -6273,32 +6223,49 @@ threadRun(void *arg)
 		}
 	}
 
+	/* explicitly initialize the state machines */
+	for (int i = 0; i < nstate; i++)
+		state[i].state = CSTATE_CHOOSE_SCRIPT;
+
+	/* READY */
+	thread_start = pg_time_now();
+	thread->started_time = thread_start;
+	last_report = thread_start;
+	next_report = last_report + (int64) 1000000 * progress;
+
+	/* STEADY */
 	if (!is_connect)
 	{
 		/* make connections to the database before starting */
-		for (i = 0; i < nstate; i++)
+		for (int i = 0; i < nstate; i++)
 		{
 			if ((state[i].con = doConnect()) == NULL)
 				goto done;
 		}
-	}
-
-	/* time after thread and connections set up */
-	INSTR_TIME_SET_CURRENT(thread->conn_time);
-	INSTR_TIME_SUBTRACT(thread->conn_time, thread->start_time);
 
-	/* explicitly initialize the state machines */
-	for (i = 0; i < nstate; i++)
+		/* compute connection delay */
+		thread->conn_duration = pg_time_now() - thread->started_time;
+	}
+	else
 	{
-		state[i].state = CSTATE_CHOOSE_SCRIPT;
+		/* no connection delay to record */
+		thread->conn_duration = 0;
 	}
 
+
+	start = pg_time_now();
+	thread->bench_start = start;
+	thread->throttle_trigger = start;
+
+	initStats(&aggs, start);
+	last = aggs;
+
 	/* loop till all clients have terminated */
 	while (remains > 0)
 	{
 		int			nsocks;		/* number of sockets to be waited for */
-		int64		min_usec;
-		int64		now_usec = 0;	/* set this only if needed */
+		pg_time_usec_t	min_usec;
+		pg_time_usec_t	now = 0;	/* set this only if needed */
 
 		/*
 		 * identify which client sockets should be checked for input, and
@@ -6307,27 +6274,21 @@ threadRun(void *arg)
 		clear_socket_set(sockets);
 		nsocks = 0;
 		min_usec = PG_INT64_MAX;
-		for (i = 0; i < nstate; i++)
+		for (int i = 0; i < nstate; i++)
 		{
 			CState	   *st = &state[i];
 
 			if (st->state == CSTATE_SLEEP || st->state == CSTATE_THROTTLE)
 			{
 				/* a nap from the script, or under throttling */
-				int64		this_usec;
+				pg_time_usec_t		this_usec;
 
 				/* get current time if needed */
-				if (now_usec == 0)
-				{
-					instr_time	now;
-
-					INSTR_TIME_SET_CURRENT(now);
-					now_usec = INSTR_TIME_GET_MICROSEC(now);
-				}
+				pg_time_now_lazy(&now);
 
 				/* min_usec should be the minimum delay across all clients */
 				this_usec = (st->state == CSTATE_SLEEP ?
-							 st->sleep_until : st->txn_scheduled) - now_usec;
+							 st->sleep_until : st->txn_scheduled) - now;
 				if (min_usec > this_usec)
 					min_usec = this_usec;
 			}
@@ -6362,19 +6323,12 @@ threadRun(void *arg)
 		/* also wake up to print the next progress report on time */
 		if (progress && min_usec > 0 && thread->tid == 0)
 		{
-			/* get current time if needed */
-			if (now_usec == 0)
-			{
-				instr_time	now;
+			pg_time_now_lazy(&now);
 
-				INSTR_TIME_SET_CURRENT(now);
-				now_usec = INSTR_TIME_GET_MICROSEC(now);
-			}
-
-			if (now_usec >= next_report)
+			if (now >= next_report)
 				min_usec = 0;
-			else if ((next_report - now_usec) < min_usec)
-				min_usec = next_report - now_usec;
+			else if ((next_report - now) < min_usec)
+				min_usec = next_report - now;
 		}
 
 		/*
@@ -6423,7 +6377,7 @@ threadRun(void *arg)
 
 		/* ok, advance the state machine of each connection */
 		nsocks = 0;
-		for (i = 0; i < nstate; i++)
+		for (int i = 0; i < nstate; i++)
 		{
 			CState	   *st = &state[i];
 
@@ -6461,11 +6415,8 @@ threadRun(void *arg)
 		/* progress report is made by thread 0 for all threads */
 		if (progress && thread->tid == 0)
 		{
-			instr_time	now_time;
-			int64		now;
+			pg_time_usec_t	now = pg_time_now();
 
-			INSTR_TIME_SET_CURRENT(now_time);
-			now = INSTR_TIME_GET_MICROSEC(now_time);
 			if (now >= next_report)
 			{
 				/*
@@ -6483,17 +6434,17 @@ threadRun(void *arg)
 				 */
 				do
 				{
-					next_report += (int64) progress * 1000000;
+					next_report += (int64) 1000000 * progress;
 				} while (now >= next_report);
 			}
 		}
 	}
 
 done:
-	INSTR_TIME_SET_CURRENT(start);
+	start = pg_time_now();
 	disconnect_all(state, nstate);
-	INSTR_TIME_SET_CURRENT(end);
-	INSTR_TIME_ACCUM_DIFF(thread->conn_time, end, start);
+	thread->conn_duration += pg_time_now() - start;
+
 	if (thread->logfile)
 	{
 		if (agg_interval > 0)
diff --git a/src/include/portability/instr_time.h b/src/include/portability/instr_time.h
index d6459327cc..fcc0bdda28 100644
--- a/src/include/portability/instr_time.h
+++ b/src/include/portability/instr_time.h
@@ -253,4 +253,32 @@ GetTimerFrequency(void)
 #define INSTR_TIME_SET_CURRENT_LAZY(t) \
 	(INSTR_TIME_IS_ZERO(t) ? INSTR_TIME_SET_CURRENT(t), true : false)
 
+/*
+ * Simpler convenient interface
+ *
+ * The instr_time type is expensive when dealing with time arithmetic.
+ * Define a type to hold microseconds on top of this, suitable for
+ * benchmarking performance measures, eg in "pgbench".
+ *
+ * Type int64 is good enough for about 584500 years.
+ */
+typedef int64 pg_time_usec_t;
+
+static inline pg_time_usec_t
+pg_time_now(void)
+{
+	instr_time now;
+
+	INSTR_TIME_SET_CURRENT(now);
+	return (pg_time_usec_t) INSTR_TIME_GET_MICROSEC(now);
+}
+
+static inline void
+pg_time_now_lazy(pg_time_usec_t *now)
+{
+	if ((*now) == 0)
+		(*now) = pg_time_now();
+}
+
+#define PG_TIME_GET_DOUBLE(t) (0.000001 * (t))
 #endif							/* INSTR_TIME_H */
-- 
2.24.3 (Apple Git-128)

0002-Add-pthread_barrier_t-emulation.patchapplication/octet-stream; name=0002-Add-pthread_barrier_t-emulation.patchDownload

From 612e7c2a873a11103801c24ce2c8361afa1fa195 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Sat, 2 Jan 2021 15:05:06 +1300
Subject: [PATCH 2/3] Add pthread_barrier_t emulation.

Map pthread_barrier_t to native barriers on Windows.  Supply an
implementation based on other pthread primitives for macOS.

Discussion: https://postgr.es/m/20200227180100.zyvjwzcpiokfsqm2%40alap3.anarazel.de
---
 configure                       | 69 +++++++++++++++++++++++++++++++++
 configure.ac                    |  2 +
 src/include/pg_config.h.in      |  3 ++
 src/include/port/pg_pthread.h   | 40 +++++++++++++++++++
 src/port/pthread_barrier_wait.c | 64 ++++++++++++++++++++++++++++++
 5 files changed, 178 insertions(+)
 create mode 100644 src/include/port/pg_pthread.h
 create mode 100644 src/port/pthread_barrier_wait.c

diff --git a/configure b/configure
index 07529825d1..5a749b66da 100755
--- a/configure
+++ b/configure
@@ -11673,6 +11673,62 @@ if test "$ac_res" != no; then :
 
 fi
 
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for library containing pthread_barrier_wait" >&5
+$as_echo_n "checking for library containing pthread_barrier_wait... " >&6; }
+if ${ac_cv_search_pthread_barrier_wait+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  ac_func_search_save_LIBS=$LIBS
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+
+/* Override any GCC internal prototype to avoid an error.
+   Use char because int might match the return type of a GCC
+   builtin and then its argument prototype would still apply.  */
+#ifdef __cplusplus
+extern "C"
+#endif
+char pthread_barrier_wait ();
+int
+main ()
+{
+return pthread_barrier_wait ();
+  ;
+  return 0;
+}
+_ACEOF
+for ac_lib in '' pthread; do
+  if test -z "$ac_lib"; then
+    ac_res="none required"
+  else
+    ac_res=-l$ac_lib
+    LIBS="-l$ac_lib  $ac_func_search_save_LIBS"
+  fi
+  if ac_fn_c_try_link "$LINENO"; then :
+  ac_cv_search_pthread_barrier_wait=$ac_res
+fi
+rm -f core conftest.err conftest.$ac_objext \
+    conftest$ac_exeext
+  if ${ac_cv_search_pthread_barrier_wait+:} false; then :
+  break
+fi
+done
+if ${ac_cv_search_pthread_barrier_wait+:} false; then :
+
+else
+  ac_cv_search_pthread_barrier_wait=no
+fi
+rm conftest.$ac_ext
+LIBS=$ac_func_search_save_LIBS
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $ac_cv_search_pthread_barrier_wait" >&5
+$as_echo "$ac_cv_search_pthread_barrier_wait" >&6; }
+ac_res=$ac_cv_search_pthread_barrier_wait
+if test "$ac_res" != no; then :
+  test "$ac_res" = "none required" || LIBS="$ac_res $LIBS"
+
+fi
+
 # Solaris:
 { $as_echo "$as_me:${as_lineno-$LINENO}: checking for library containing fdatasync" >&5
 $as_echo_n "checking for library containing fdatasync... " >&6; }
@@ -15845,6 +15901,19 @@ esac
 
 fi
 
+ac_fn_c_check_func "$LINENO" "pthread_barrier_wait" "ac_cv_func_pthread_barrier_wait"
+if test "x$ac_cv_func_pthread_barrier_wait" = xyes; then :
+  $as_echo "#define HAVE_PTHREAD_BARRIER_WAIT 1" >>confdefs.h
+
+else
+  case " $LIBOBJS " in
+  *" pthread_barrier_wait.$ac_objext "* ) ;;
+  *) LIBOBJS="$LIBOBJS pthread_barrier_wait.$ac_objext"
+ ;;
+esac
+
+fi
+
 ac_fn_c_check_func "$LINENO" "pwrite" "ac_cv_func_pwrite"
 if test "x$ac_cv_func_pwrite" = xyes; then :
   $as_echo "#define HAVE_PWRITE 1" >>confdefs.h
diff --git a/configure.ac b/configure.ac
index 7f855783f4..1d3cc1751c 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1156,6 +1156,7 @@ AC_SEARCH_LIBS(getopt_long, [getopt gnugetopt])
 AC_SEARCH_LIBS(shm_open, rt)
 AC_SEARCH_LIBS(shm_unlink, rt)
 AC_SEARCH_LIBS(clock_gettime, [rt posix4])
+AC_SEARCH_LIBS(pthread_barrier_wait, pthread)
 # Solaris:
 AC_SEARCH_LIBS(fdatasync, [rt posix4])
 # Required for thread_test.c on Solaris
@@ -1734,6 +1735,7 @@ AC_REPLACE_FUNCS(m4_normalize([
 	link
 	mkdtemp
 	pread
+	pthread_barrier_wait
 	pwrite
 	random
 	srandom
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index ddaa9e8e18..b9f828385c 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -421,6 +421,9 @@
 /* Define if you have POSIX threads libraries and header files. */
 #undef HAVE_PTHREAD
 
+/* Define to 1 if you have the `pthread_barrier_wait' function. */
+#undef HAVE_PTHREAD_BARRIER_WAIT
+
 /* Define to 1 if you have the `pthread_is_threaded_np' function. */
 #undef HAVE_PTHREAD_IS_THREADED_NP
 
diff --git a/src/include/port/pg_pthread.h b/src/include/port/pg_pthread.h
new file mode 100644
index 0000000000..21987429eb
--- /dev/null
+++ b/src/include/port/pg_pthread.h
@@ -0,0 +1,40 @@
+#ifndef PG_PTHREAD_H
+#define PG_PTHREAD_H
+
+#ifndef WIN32
+#include <pthread.h>
+#endif
+
+#ifndef HAVE_PTHREAD_BARRIER_WAIT
+
+#ifndef PTHREAD_BARRIER_SERIAL_THREAD
+#define PTHREAD_BARRIER_SERIAL_THREAD (-1)
+#endif
+
+#ifdef WIN32
+/* Forward to the appropriate Win32 primitives. */
+typedef SYNCHRONIZATION_BARRIER pthread_barrier_t;
+#else
+/*
+ * For systems that have no pthread barriers, but have the rest of the pthread
+ * stuff (that's at least macOS), we'll do our own simple barrier
+ * implementation.
+ */
+typedef struct pg_pthread_barrier
+{
+	bool		sense;			/* we only need a one bit phase */
+	int			party;			/* number of threads expected */
+	int			arrived;		/* number of threads that have arrived */
+	pthread_mutex_t mutex;
+	pthread_cond_t cv;
+} pthread_barrier_t;
+#endif
+
+extern int pthread_barrier_init(pthread_barrier_t *barrier, void *unused,
+								int party);
+extern int pthread_barrier_wait(pthread_barrier_t *barrier);
+extern int pthread_barrier_destroy(pthread_barrier_t *barrier);
+
+#endif
+
+#endif
diff --git a/src/port/pthread_barrier_wait.c b/src/port/pthread_barrier_wait.c
new file mode 100644
index 0000000000..6c8a0f29a9
--- /dev/null
+++ b/src/port/pthread_barrier_wait.c
@@ -0,0 +1,64 @@
+#include "postgres_fe.h"
+
+#include "port/pg_pthread.h"
+
+int
+pthread_barrier_init(pthread_barrier_t *barrier, void *unused, int party)
+{
+#ifdef WIN32
+	InitializeSynchronizationBarrier(barrier, party, 0);
+#else
+	barrier->sense = false;
+	barrier->party = party;
+	barrier->arrived = 0;
+	pthread_cond_init(&barrier->cv, NULL);
+	pthread_mutex_init(&barrier->mutex, NULL);
+	return 0;
+#endif
+}
+
+int
+pthread_barrier_wait(pthread_barrier_t *barrier)
+{
+#ifdef WIN32
+	if (EnterSynchronizationBarrier(barrier,
+									SYNCHRONIZATION_BARRIER_FLAGS_BLOCK_ONLY))
+		return PTHREAD_BARRIER_SERIAL_THREAD;
+#else
+	bool		initial_sense;
+
+	pthread_mutex_lock(&barrier->mutex);
+
+	/* We have arrived at the barrier. */
+	barrier->arrived++;
+	Assert(barrier->arrived <= barrier->party);
+
+	/* If we were the last to arrive, release the others and return. */
+	if (barrier->arrived == barrier->party)
+	{
+		barrier->arrived = 0;
+		barrier->sense = !barrier->sense;
+		pthread_mutex_unlock(&barrier->mutex);
+		pthread_cond_broadcast(&barrier->cv);
+
+		return PTHREAD_BARRIER_SERIAL_THREAD;
+	}
+
+	/* Wait for someone else to flip the sense. */
+	initial_sense = barrier->sense;
+	do
+	{
+		pthread_cond_wait(&barrier->cv, &barrier->mutex);
+	} while (barrier->sense == initial_sense);
+	pthread_mutex_unlock(&barrier->mutex);
+#endif
+
+	return 0;
+}
+
+int
+pthread_barrier_destroy(pthread_barrier_t *barrier)
+{
+	/* XXX DeleteSynchronizationBarrier() in Windows 8, Windows Server 2012? */
+	return 0;
+}
-- 
2.24.3 (Apple Git-128)

0003-B.patchapplication/octet-stream; name=0003-B.patchDownload

From 79efa2073b4345ae95b9336412ee328ab2695df1 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Sat, 2 Jan 2021 14:13:28 +1300
Subject: [PATCH 3/3] B

---
 src/bin/pgbench/pgbench.c | 35 ++++++++++++++++++++++++++++++-----
 1 file changed, 30 insertions(+), 5 deletions(-)

diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index b8a3e28690..d0970d15ba 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -120,15 +120,18 @@ typedef int pthread_attr_t;
 
 static int	pthread_create(pthread_t *thread, pthread_attr_t *attr, void *(*start_routine) (void *), void *arg);
 static int	pthread_join(pthread_t th, void **thread_return);
+#include "port/pg_pthread.h"
 #elif defined(ENABLE_THREAD_SAFETY)
-/* Use platform-dependent pthread capability */
-#include <pthread.h>
+#include "port/pg_pthread.h"
 #else
 /* No threads implementation, use none (-j 1) */
 #define pthread_t void *
+#define pthread_barrier_t void *
+#define pthread_barrier_init(a, b, c) /* ignore */
+#define pthread_barrier_wait(a) /* ignore */
+#define pthread_barrier_destroy(a) /* ignore */
 #endif
 
-
 /********************************************************************
  * some configurable parameters */
 
@@ -311,6 +314,9 @@ typedef struct RandomState
 /* Various random sequences are initialized from this one. */
 static RandomState base_random_sequence;
 
+/* Synchronization barrier for start and connection */
+static pthread_barrier_t barrier;
+
 /*
  * Connection state machine states.
  */
@@ -454,8 +460,8 @@ typedef struct
 
 	/* per thread collected stats in microseconds */
 	pg_time_usec_t	create_time;	/* thread creation time */
-	pg_time_usec_t	started_time;	/* thread is running */
-	pg_time_usec_t	bench_start; 	/* thread is benchmarking */
+	pg_time_usec_t	started_time;	/* thread is running after start barrier */
+	pg_time_usec_t	bench_start; 	/* thread is benchmarking after connection barrier */
 	pg_time_usec_t	conn_duration;	/* cumulated connection and deconnection delays */
 
 	StatsData	stats;
@@ -6111,6 +6117,8 @@ main(int argc, char **argv)
 	if (duration > 0)
 		setalarm(duration);
 
+	pthread_barrier_init(&barrier, NULL, nthreads);
+
 #ifdef ENABLE_THREAD_SAFETY
 	/* start all threads but thread 0 which is executed directly later */
 	for (i = 1; i < nthreads; i++)
@@ -6182,6 +6190,8 @@ main(int argc, char **argv)
 	printResults(&stats, pg_time_now() - bench_start, conn_total_duration,
 				 bench_start - start_time, latency_late);
 
+	pthread_barrier_destroy(&barrier);
+
 	if (exit_code != 0)
 		pg_log_fatal("Run was aborted; the above results are incomplete.");
 
@@ -6228,6 +6238,8 @@ threadRun(void *arg)
 		state[i].state = CSTATE_CHOOSE_SCRIPT;
 
 	/* READY */
+	pthread_barrier_wait(&barrier);
+
 	thread_start = pg_time_now();
 	thread->started_time = thread_start;
 	last_report = thread_start;
@@ -6240,7 +6252,18 @@ threadRun(void *arg)
 		for (int i = 0; i < nstate; i++)
 		{
 			if ((state[i].con = doConnect()) == NULL)
+			{
+				/*
+				 * On connection failure, we meet the barrier here in place of
+				 * GO before proceeding to the "done" path which will cleanup,
+				 * so as to avoid locking the process.
+				 *
+				 * It is unclear whether it is worth doing anything rather than
+				 * coldly exiting with an error message.
+				 */
+				pthread_barrier_wait(&barrier);
 				goto done;
+			}
 		}
 
 		/* compute connection delay */
@@ -6252,6 +6275,8 @@ threadRun(void *arg)
 		thread->conn_duration = 0;
 	}
 
+	/* GO */
+	pthread_barrier_wait(&barrier);
 
 	start = pg_time_now();
 	thread->bench_start = start;
-- 
2.24.3 (Apple Git-128)

coelho@cri.ensmp.fr

about 5 years ago

In reply to: Thomas Munro (#32)

Re: pgbench: option delaying queries till connections establishment?

It looks like macOS doesn't have pthread barriers (via cfbot 2021, now
with more operating systems):

Indeed:-(

I'll look into that.

Just for fun, the attached 0002 patch is a quick prototype of a
replacement for that stuff that seems to work OK on a Mac here. (I'm
not sure if the Windows part makes sense or works.)

Thanks! That will definitely help because I do not have a Mac. I'll do
some cleanup.

--
Fabien.

thomas.munro@gmail.com

about 5 years ago

In reply to: Fabien COELHO (#33)

Re: pgbench: option delaying queries till connections establishment?

On Sun, Jan 3, 2021 at 9:49 AM Fabien COELHO <coelho@cri.ensmp.fr> wrote:

Just for fun, the attached 0002 patch is a quick prototype of a
replacement for that stuff that seems to work OK on a Mac here. (I'm
not sure if the Windows part makes sense or works.)

Thanks! That will definitely help because I do not have a Mac. I'll do
some cleanup.

I think the main things to clean up are:

1. pthread_barrier_init() should check for errors from
pthread_cond_init() and pthread_mutex_init(), and return -1.
2. pthread_barrier_destroy() should call pthread_cond_destroy() and
pthread_mutex_destroy().
3 . Decide if it's sane for the Windows-based emulation to be in here
too, or if it should stay in pgbench.c. Or alternatively, if we're
emulating pthread stuff on Windows, why not also put the other pthread
emulation stuff from pgbench.c into a "ports" file; that seems
premature and overkill for your project. I dunno.
4. cfbot shows that it's not building on Windows because
HAVE_PTHREAD_BARRIER_WAIT is missing from Solution.pm.

As far as I know, it's OK and common practice to ignore the return
code from eg pthread_mutex_lock() and pthread_cond_wait(), with
rationale along the lines that there'd have to be a programming error
for them to fail in simple cases.

Unfortunately, cfbot can only tell us that it's building OK on a Mac,
but doesn't actually run the pgbench code to reach this stuff. It's
not running check-world on that platform yet for the following asinine
reason:

connection to database failed: Unix-domain socket path
"/private/var/folders/3y/l0z1x3693dl_8n0qybp4dqwh0000gn/T/cirrus-ci-build/src/bin/pg_upgrade/.s.PGSQL.58080"
is too long (maximum 103 bytes)

thomas.munro@gmail.com

almost 5 years ago

In reply to: Thomas Munro (#34)

3 attachment(s)

Re: pgbench: option delaying queries till connections establishment?

On Sat, Jan 9, 2021 at 8:13 AM Thomas Munro <thomas.munro@gmail.com> wrote:

On Sun, Jan 3, 2021 at 9:49 AM Fabien COELHO <coelho@cri.ensmp.fr> wrote:

Just for fun, the attached 0002 patch is a quick prototype of a
replacement for that stuff that seems to work OK on a Mac here. (I'm
not sure if the Windows part makes sense or works.)

Thanks! That will definitely help because I do not have a Mac. I'll do
some cleanup.

I think the main things to clean up are:

I’m supposed to be on vacation this week, but someone left a shiny new
Arm Mac laptop near me, so here’s a cleaned up version.

1. pthread_barrier_init() should check for errors from
pthread_cond_init() and pthread_mutex_init(), and return -1.

Done.

2. pthread_barrier_destroy() should call pthread_cond_destroy() and
pthread_mutex_destroy().

Done.

3 . Decide if it's sane for the Windows-based emulation to be in here
too, or if it should stay in pgbench.c. Or alternatively, if we're
emulating pthread stuff on Windows, why not also put the other pthread
emulation stuff from pgbench.c into a "ports" file; that seems
premature and overkill for your project. I dunno.

I decided to solve only the macOS problem for now. So in this
version, the A and B patches are exactly as you had them in your v7,
except that B includes “port/pg_pthread.h” instead of <pthread.h>.

Maybe it’d make sense to move the Win32 pthread emulation stuff out of
pgbench.c into src/port too (the pre-existing stuff, and the new
barrier stuff you added), but that seems like a separate patch, one
that I’m not best placed to write, and it’s not clear to me that we’ll
want to be using pthread APIs as our main abstraction if/when thread
usage increases in the PG source tree anyway. Other opinions welcome.

4. cfbot shows that it's not building on Windows because
HAVE_PTHREAD_BARRIER_WAIT is missing from Solution.pm.

Fixed, I think.

Attachments:

v8-0001-A.patchapplication/octet-stream; name=v8-0001-A.patchDownload

From 382e4755a292e355cbac5746c43fdb5c02749402 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Sat, 2 Jan 2021 14:13:11 +1300
Subject: [PATCH v8 1/3] A

---
 doc/src/sgml/ref/pgbench.sgml        |  39 +--
 src/bin/pgbench/pgbench.c            | 415 ++++++++++++---------------
 src/include/portability/instr_time.h |  28 ++
 3 files changed, 231 insertions(+), 251 deletions(-)

diff --git a/doc/src/sgml/ref/pgbench.sgml b/doc/src/sgml/ref/pgbench.sgml
index faa7c26b0a..171f824fcd 100644
--- a/doc/src/sgml/ref/pgbench.sgml
+++ b/doc/src/sgml/ref/pgbench.sgml
@@ -58,8 +58,10 @@ number of clients: 10
 number of threads: 1
 number of transactions per client: 1000
 number of transactions actually processed: 10000/10000
-tps = 85.184871 (including connections establishing)
-tps = 85.296346 (excluding connections establishing)
+latency average = 11.013 ms
+latency stddev = 7.351 ms
+initial connection time = 45.758 ms
+tps = 896.967014 (without initial connection establishing)
 </screen>
 
   The first six lines report some of the most important parameter
@@ -68,8 +70,7 @@ tps = 85.296346 (excluding connections establishing)
   and number of transactions per client); these will be equal unless the run
   failed before completion.  (In <option>-T</option> mode, only the actual
   number of transactions is printed.)
-  The last two lines report the number of transactions per second,
-  figured with and without counting the time to start database sessions.
+  The last line reports the number of transactions per second.
  </para>
 
   <para>
@@ -2244,22 +2245,22 @@ number of clients: 10
 number of threads: 1
 number of transactions per client: 1000
 number of transactions actually processed: 10000/10000
-latency average = 15.844 ms
-latency stddev = 2.715 ms
-tps = 618.764555 (including connections establishing)
-tps = 622.977698 (excluding connections establishing)
+latency average = 10.870 ms
+latency stddev = 7.341 ms
+initial connection time = 30.954 ms
+tps = 907.949122 (without initial connection establishing)
 statement latencies in milliseconds:
-        0.002  \set aid random(1, 100000 * :scale)
-        0.005  \set bid random(1, 1 * :scale)
-        0.002  \set tid random(1, 10 * :scale)
-        0.001  \set delta random(-5000, 5000)
-        0.326  BEGIN;
-        0.603  UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
-        0.454  SELECT abalance FROM pgbench_accounts WHERE aid = :aid;
-        5.528  UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;
-        7.335  UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;
-        0.371  INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);
-        1.212  END;
+    0.001  \set aid random(1, 100000 * :scale)
+    0.001  \set bid random(1, 1 * :scale)
+    0.001  \set tid random(1, 10 * :scale)
+    0.000  \set delta random(-5000, 5000)
+    0.046  BEGIN;
+    0.151  UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
+    0.107  SELECT abalance FROM pgbench_accounts WHERE aid = :aid;
+    4.241  UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;
+    5.245  UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;
+    0.102  INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);
+    0.974  END;
 </screen>
   </para>
 
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index f7da3e1f62..e639f4980d 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -292,9 +292,9 @@ typedef struct SimpleStats
  */
 typedef struct StatsData
 {
-	time_t		start_time;		/* interval start time, for aggregates */
-	int64		cnt;			/* number of transactions, including skipped */
-	int64		skipped;		/* number of transactions skipped under --rate
+	pg_time_usec_t	start_time;	/* interval start time, for aggregates */
+	int64			cnt;		/* number of transactions, including skipped */
+	int64			skipped;	/* number of transactions skipped under --rate
 								 * and --latency-limit */
 	SimpleStats latency;
 	SimpleStats lag;
@@ -417,11 +417,11 @@ typedef struct
 	int			nvariables;		/* number of variables */
 	bool		vars_sorted;	/* are variables sorted by name? */
 
-	/* various times about current transaction */
-	int64		txn_scheduled;	/* scheduled start time of transaction (usec) */
-	int64		sleep_until;	/* scheduled start time of next cmd (usec) */
-	instr_time	txn_begin;		/* used for measuring schedule lag times */
-	instr_time	stmt_begin;		/* used for measuring statement latencies */
+	/* various times about current transaction in microseconds */
+	pg_time_usec_t	txn_scheduled;	/* scheduled start time of transaction */
+	pg_time_usec_t	sleep_until;	/* scheduled start time of next cmd */
+	pg_time_usec_t	txn_begin;		/* used for measuring schedule lag times */
+	pg_time_usec_t	stmt_begin;		/* used for measuring statement latencies */
 
 	bool		prepared[MAX_SCRIPTS];	/* whether client prepared the script */
 
@@ -450,13 +450,16 @@ typedef struct
 	RandomState ts_sample_rs;	/* random state for log sampling */
 
 	int64		throttle_trigger;	/* previous/next throttling (us) */
-	FILE	   *logfile;		/* where to log, or NULL */
+	FILE	   *logfile;			/* where to log, or NULL */
+
+	/* per thread collected stats in microseconds */
+	pg_time_usec_t	create_time;	/* thread creation time */
+	pg_time_usec_t	started_time;	/* thread is running */
+	pg_time_usec_t	bench_start; 	/* thread is benchmarking */
+	pg_time_usec_t	conn_duration;	/* cumulated connection and deconnection delays */
 
-	/* per thread collected stats */
-	instr_time	start_time;		/* thread start time */
-	instr_time	conn_time;
 	StatsData	stats;
-	int64		latency_late;	/* executed but late transactions */
+	int64		latency_late;	/* count executed but late transactions */
 } TState;
 
 #define INVALID_THREAD		((pthread_t) 0)
@@ -598,10 +601,10 @@ static void setIntValue(PgBenchValue *pv, int64 ival);
 static void setDoubleValue(PgBenchValue *pv, double dval);
 static bool evaluateExpr(CState *st, PgBenchExpr *expr,
 						 PgBenchValue *retval);
-static ConnectionStateEnum executeMetaCommand(CState *st, instr_time *now);
+static ConnectionStateEnum executeMetaCommand(CState *st, pg_time_usec_t *now);
 static void doLog(TState *thread, CState *st,
 				  StatsData *agg, bool skipped, double latency, double lag);
-static void processXactStats(TState *thread, CState *st, instr_time *now,
+static void processXactStats(TState *thread, CState *st, pg_time_usec_t *now,
 							 bool skipped, StatsData *agg);
 static void addScript(ParsedScript script);
 static void *threadRun(void *arg);
@@ -1105,9 +1108,9 @@ mergeSimpleStats(SimpleStats *acc, SimpleStats *ss)
  * the given value.
  */
 static void
-initStats(StatsData *sd, time_t start_time)
+initStats(StatsData *sd, pg_time_usec_t start)
 {
-	sd->start_time = start_time;
+	sd->start_time = start;
 	sd->cnt = 0;
 	sd->skipped = 0;
 	initSimpleStats(&sd->latency);
@@ -2890,7 +2893,6 @@ evaluateSleep(CState *st, int argc, char **argv, int *usecs)
 static void
 advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 {
-	instr_time	now;
 
 	/*
 	 * gettimeofday() isn't free, so we get the current timestamp lazily the
@@ -2900,7 +2902,7 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 	 * means "not set yet".  Reset "now" when we execute shell commands or
 	 * expressions, which might take a non-negligible amount of time, though.
 	 */
-	INSTR_TIME_SET_ZERO(now);
+	pg_time_usec_t	now = 0;
 
 	/*
 	 * Loop in the state machine, until we have to wait for a result from the
@@ -2935,29 +2937,30 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 
 				/* Start new transaction (script) */
 			case CSTATE_START_TX:
+				pg_time_now_lazy(&now);
 
 				/* establish connection if needed, i.e. under --connect */
 				if (st->con == NULL)
 				{
-					instr_time	start;
+					pg_time_usec_t	start = now;
 
-					INSTR_TIME_SET_CURRENT_LAZY(now);
-					start = now;
 					if ((st->con = doConnect()) == NULL)
 					{
 						pg_log_error("client %d aborted while establishing connection", st->id);
 						st->state = CSTATE_ABORTED;
 						break;
 					}
-					INSTR_TIME_SET_CURRENT(now);
-					INSTR_TIME_ACCUM_DIFF(thread->conn_time, now, start);
+
+					/* reset now after connection */
+					now = pg_time_now();
+
+					thread->conn_duration += now - start;
 
 					/* Reset session-local state */
 					memset(st->prepared, 0, sizeof(st->prepared));
 				}
 
 				/* record transaction start time */
-				INSTR_TIME_SET_CURRENT_LAZY(now);
 				st->txn_begin = now;
 
 				/*
@@ -2965,7 +2968,7 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 				 * scheduled start time.
 				 */
 				if (!throttle_delay)
-					st->txn_scheduled = INSTR_TIME_GET_MICROSEC(now);
+					st->txn_scheduled = now;
 
 				/* Begin with the first command */
 				st->state = CSTATE_START_COMMAND;
@@ -3001,12 +3004,9 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 				 */
 				if (latency_limit)
 				{
-					int64		now_us;
+					pg_time_now_lazy(&now);
 
-					INSTR_TIME_SET_CURRENT_LAZY(now);
-					now_us = INSTR_TIME_GET_MICROSEC(now);
-
-					while (thread->throttle_trigger < now_us - latency_limit &&
+					while (thread->throttle_trigger < now - latency_limit &&
 						   (nxacts <= 0 || st->cnt < nxacts))
 					{
 						processXactStats(thread, st, &now, true, agg);
@@ -3039,9 +3039,9 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 				 * Wait until it's time to start next transaction.
 				 */
 			case CSTATE_THROTTLE:
-				INSTR_TIME_SET_CURRENT_LAZY(now);
+				pg_time_now_lazy(&now);
 
-				if (INSTR_TIME_GET_MICROSEC(now) < st->txn_scheduled)
+				if (now < st->txn_scheduled)
 					return;		/* still sleeping, nothing to do here */
 
 				/* done sleeping, but don't start transaction if we're done */
@@ -3064,7 +3064,7 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 				/* record begin time of next command, and initiate it */
 				if (report_per_command)
 				{
-					INSTR_TIME_SET_CURRENT_LAZY(now);
+					pg_time_now_lazy(&now);
 					st->stmt_begin = now;
 				}
 
@@ -3229,8 +3229,8 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 				 * instead of CSTATE_START_TX.
 				 */
 			case CSTATE_SLEEP:
-				INSTR_TIME_SET_CURRENT_LAZY(now);
-				if (INSTR_TIME_GET_MICROSEC(now) < st->sleep_until)
+				pg_time_now_lazy(&now);
+				if (now < st->sleep_until)
 					return;		/* still sleeping, nothing to do here */
 				/* Else done sleeping. */
 				st->state = CSTATE_END_COMMAND;
@@ -3250,13 +3250,12 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 				{
 					Command    *command;
 
-					INSTR_TIME_SET_CURRENT_LAZY(now);
+					pg_time_now_lazy(&now);
 
 					command = sql_script[st->use_file].commands[st->command];
 					/* XXX could use a mutex here, but we choose not to */
 					addToSimpleStats(&command->stats,
-									 INSTR_TIME_GET_DOUBLE(now) -
-									 INSTR_TIME_GET_DOUBLE(st->stmt_begin));
+									 PG_TIME_GET_DOUBLE(now - st->stmt_begin));
 				}
 
 				/* Go ahead with next command, to be executed or skipped */
@@ -3282,7 +3281,7 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 				if (is_connect)
 				{
 					finishCon(st);
-					INSTR_TIME_SET_ZERO(now);
+					now = 0;
 				}
 
 				if ((st->cnt >= nxacts && duration <= 0) || timer_exceeded)
@@ -3320,7 +3319,7 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
  * take no time to execute.
  */
 static ConnectionStateEnum
-executeMetaCommand(CState *st, instr_time *now)
+executeMetaCommand(CState *st, pg_time_usec_t *now)
 {
 	Command    *command = sql_script[st->use_file].commands[st->command];
 	int			argc;
@@ -3362,8 +3361,8 @@ executeMetaCommand(CState *st, instr_time *now)
 			return CSTATE_ABORTED;
 		}
 
-		INSTR_TIME_SET_CURRENT_LAZY(*now);
-		st->sleep_until = INSTR_TIME_GET_MICROSEC(*now) + usec;
+		pg_time_now_lazy(now);
+		st->sleep_until = (*now) + usec;
 		return CSTATE_SLEEP;
 	}
 	else if (command->meta == META_SET)
@@ -3466,7 +3465,7 @@ executeMetaCommand(CState *st, instr_time *now)
 	 * executing the expression or shell command might have taken a
 	 * non-negligible amount of time, so reset 'now'
 	 */
-	INSTR_TIME_SET_ZERO(*now);
+	*now = 0;
 
 	return CSTATE_END_COMMAND;
 }
@@ -3476,14 +3475,15 @@ executeMetaCommand(CState *st, instr_time *now)
  *
  * We print Unix-epoch timestamps in the log, so that entries can be
  * correlated against other logs.  On some platforms this could be obtained
- * from the instr_time reading the caller has, but rather than get entangled
- * with that, we just eat the cost of an extra syscall in all cases.
+ * from the caller, but rather than get entangled with that, we just eat
+ * the cost of an extra syscall in all cases.
  */
 static void
 doLog(TState *thread, CState *st,
 	  StatsData *agg, bool skipped, double latency, double lag)
 {
 	FILE	   *logfile = thread->logfile;
+	pg_time_usec_t	now = pg_time_now();
 
 	Assert(use_log);
 
@@ -3503,13 +3503,12 @@ doLog(TState *thread, CState *st,
 		 * any empty intervals in between (this may happen with very low tps,
 		 * e.g. --rate=0.1).
 		 */
-		time_t		now = time(NULL);
 
 		while (agg->start_time + agg_interval <= now)
 		{
 			/* print aggregated report to logfile */
-			fprintf(logfile, "%ld " INT64_FORMAT " %.0f %.0f %.0f %.0f",
-					(long) agg->start_time,
+			fprintf(logfile, INT64_FORMAT " " INT64_FORMAT " %.0f %.0f %.0f %.0f",
+					agg->start_time,
 					agg->cnt,
 					agg->latency.sum,
 					agg->latency.sum2,
@@ -3537,17 +3536,13 @@ doLog(TState *thread, CState *st,
 	else
 	{
 		/* no, print raw transactions */
-		struct timeval tv;
-
-		gettimeofday(&tv, NULL);
 		if (skipped)
 			fprintf(logfile, "%d " INT64_FORMAT " skipped %d %ld %ld",
-					st->id, st->cnt, st->use_file,
-					(long) tv.tv_sec, (long) tv.tv_usec);
+					st->id, st->cnt, st->use_file, now / 1000000, now % 1000000);
 		else
 			fprintf(logfile, "%d " INT64_FORMAT " %.0f %d %ld %ld",
 					st->id, st->cnt, latency, st->use_file,
-					(long) tv.tv_sec, (long) tv.tv_usec);
+					now / 1000000, now % 1000000);
 		if (throttle_delay)
 			fprintf(logfile, " %.0f", lag);
 		fputc('\n', logfile);
@@ -3561,7 +3556,7 @@ doLog(TState *thread, CState *st,
  * Note that even skipped transactions are counted in the "cnt" fields.)
  */
 static void
-processXactStats(TState *thread, CState *st, instr_time *now,
+processXactStats(TState *thread, CState *st, pg_time_usec_t *now,
 				 bool skipped, StatsData *agg)
 {
 	double		latency = 0.0,
@@ -3571,11 +3566,11 @@ processXactStats(TState *thread, CState *st, instr_time *now,
 
 	if (detailed && !skipped)
 	{
-		INSTR_TIME_SET_CURRENT_LAZY(*now);
+		pg_time_now_lazy(now);
 
 		/* compute latency & lag */
-		latency = INSTR_TIME_GET_MICROSEC(*now) - st->txn_scheduled;
-		lag = INSTR_TIME_GET_MICROSEC(st->txn_begin) - st->txn_scheduled;
+		latency = (*now) - st->txn_scheduled;
+		lag = st->txn_begin - st->txn_scheduled;
 	}
 
 	if (thread_details)
@@ -3826,10 +3821,7 @@ initGenerateDataClientSide(PGconn *con)
 	int64		k;
 
 	/* used to track elapsed time and estimate of the remaining time */
-	instr_time	start,
-				diff;
-	double		elapsed_sec,
-				remaining_sec;
+	pg_time_usec_t	start;
 	int			log_interval = 1;
 
 	/* Stay on the same line if reporting to a terminal */
@@ -3881,7 +3873,7 @@ initGenerateDataClientSide(PGconn *con)
 	}
 	PQclear(res);
 
-	INSTR_TIME_SET_CURRENT(start);
+	start = pg_time_now();
 
 	for (k = 0; k < (int64) naccounts * scale; k++)
 	{
@@ -3906,11 +3898,8 @@ initGenerateDataClientSide(PGconn *con)
 		 */
 		if ((!use_quiet) && (j % 100000 == 0))
 		{
-			INSTR_TIME_SET_CURRENT(diff);
-			INSTR_TIME_SUBTRACT(diff, start);
-
-			elapsed_sec = INSTR_TIME_GET_DOUBLE(diff);
-			remaining_sec = ((double) scale * naccounts - j) * elapsed_sec / j;
+			double elapsed_sec = PG_TIME_GET_DOUBLE(pg_time_now() - start);
+			double remaining_sec = ((double) scale * naccounts - j) * elapsed_sec / j;
 
 			fprintf(stderr, INT64_FORMAT " of " INT64_FORMAT " tuples (%d%%) done (elapsed %.2f s, remaining %.2f s)%c",
 					j, (int64) naccounts * scale,
@@ -3920,11 +3909,8 @@ initGenerateDataClientSide(PGconn *con)
 		/* let's not call the timing for each row, but only each 100 rows */
 		else if (use_quiet && (j % 100 == 0))
 		{
-			INSTR_TIME_SET_CURRENT(diff);
-			INSTR_TIME_SUBTRACT(diff, start);
-
-			elapsed_sec = INSTR_TIME_GET_DOUBLE(diff);
-			remaining_sec = ((double) scale * naccounts - j) * elapsed_sec / j;
+			double elapsed_sec = PG_TIME_GET_DOUBLE(pg_time_now() - start);
+			double remaining_sec = ((double) scale * naccounts - j) * elapsed_sec / j;
 
 			/* have we reached the next interval (or end)? */
 			if ((j == scale * naccounts) || (elapsed_sec >= log_interval * LOG_STEP_SECONDS))
@@ -4129,10 +4115,8 @@ runInitSteps(const char *initialize_steps)
 
 	for (step = initialize_steps; *step != '\0'; step++)
 	{
-		instr_time	start;
 		char	   *op = NULL;
-
-		INSTR_TIME_SET_CURRENT(start);
+		pg_time_usec_t	start = pg_time_now();
 
 		switch (*step)
 		{
@@ -4174,12 +4158,7 @@ runInitSteps(const char *initialize_steps)
 
 		if (op != NULL)
 		{
-			instr_time	diff;
-			double		elapsed_sec;
-
-			INSTR_TIME_SET_CURRENT(diff);
-			INSTR_TIME_SUBTRACT(diff, start);
-			elapsed_sec = INSTR_TIME_GET_DOUBLE(diff);
+			double		elapsed_sec = PG_TIME_GET_DOUBLE(pg_time_now() - start);
 
 			if (!first)
 				appendPQExpBufferStr(&stats, ", ");
@@ -5101,12 +5080,12 @@ addScript(ParsedScript script)
  * progress report.  On exit, they are updated with the new stats.
  */
 static void
-printProgressReport(TState *threads, int64 test_start, int64 now,
+printProgressReport(TState *threads, int64 test_start, pg_time_usec_t now,
 					StatsData *last, int64 *last_report)
 {
 	/* generate and show report */
-	int64		run = now - *last_report,
-				ntx;
+	pg_time_usec_t	run = now - *last_report;
+	int64		ntx;
 	double		tps,
 				total_run,
 				latency,
@@ -5153,16 +5132,7 @@ printProgressReport(TState *threads, int64 test_start, int64 now,
 
 	if (progress_timestamp)
 	{
-		/*
-		 * On some platforms the current system timestamp is available in
-		 * now_time, but rather than get entangled with that, we just eat the
-		 * cost of an extra syscall in all cases.
-		 */
-		struct timeval tv;
-
-		gettimeofday(&tv, NULL);
-		snprintf(tbuf, sizeof(tbuf), "%ld.%03ld s",
-				 (long) tv.tv_sec, (long) (tv.tv_usec / 1000));
+		snprintf(tbuf, sizeof(tbuf), "%.3f s", PG_TIME_GET_DOUBLE(now));
 	}
 	else
 	{
@@ -5202,21 +5172,18 @@ printSimpleStats(const char *prefix, SimpleStats *ss)
 
 /* print out results */
 static void
-printResults(StatsData *total, instr_time total_time,
-			 instr_time conn_total_time, int64 latency_late)
+printResults(StatsData *total,
+			 pg_time_usec_t total_duration,			/* benchmarking time */
+			 pg_time_usec_t conn_total_duration,	/* is_connect */
+			 pg_time_usec_t conn_elapsed_duration,	/* !is_connect */
+			 int64 latency_late)
 {
-	double		time_include,
-				tps_include,
-				tps_exclude;
+	/* tps is about actually executed transactions during benchmarking */
 	int64		ntx = total->cnt - total->skipped;
+	double 		bench_duration = PG_TIME_GET_DOUBLE(total_duration);
+	double		tps = ntx / bench_duration;
 
-	time_include = INSTR_TIME_GET_DOUBLE(total_time);
-
-	/* tps is about actually executed transactions */
-	tps_include = ntx / time_include;
-	tps_exclude = ntx /
-		(time_include - (INSTR_TIME_GET_DOUBLE(conn_total_time) / nclients));
-
+	printf("pgbench (PostgreSQL) %d.%d\n", PG_VERSION_NUM / 10000, PG_VERSION_NUM % 100);
 	/* Report test parameters. */
 	printf("transaction type: %s\n",
 		   num_scripts == 1 ? sql_script[0].desc : "multiple scripts");
@@ -5247,8 +5214,7 @@ printResults(StatsData *total, instr_time total_time,
 
 	if (throttle_delay && latency_limit)
 		printf("number of transactions skipped: " INT64_FORMAT " (%.3f %%)\n",
-			   total->skipped,
-			   100.0 * total->skipped / total->cnt);
+			   total->skipped, 100.0 * total->skipped / total->cnt);
 
 	if (latency_limit)
 		printf("number of transactions above the %.1f ms latency limit: " INT64_FORMAT "/" INT64_FORMAT " (%.3f %%)\n",
@@ -5261,7 +5227,7 @@ printResults(StatsData *total, instr_time total_time,
 	{
 		/* no measurement, show average latency computed from run time */
 		printf("latency average = %.3f ms\n",
-			   1000.0 * time_include * nclients / total->cnt);
+			   0.001 * total_duration * nclients / total->cnt);
 	}
 
 	if (throttle_delay)
@@ -5276,8 +5242,25 @@ printResults(StatsData *total, instr_time total_time,
 			   0.001 * total->lag.sum / total->cnt, 0.001 * total->lag.max);
 	}
 
-	printf("tps = %f (including connections establishing)\n", tps_include);
-	printf("tps = %f (excluding connections establishing)\n", tps_exclude);
+	/*
+	 * Under -C/--connect, each transaction incurs a significant connection cost,
+	 * it would not make much sense to ignore it in tps, and it would not be tps
+	 * anyway.
+	 *
+	 * Otherwise connections are made just once at the beginning of the run
+	 * and should not impact performance but for very short run, so they are
+	 * (right)fully ignored in tps.
+	 */
+	if (is_connect)
+	{
+		printf("average connection time = %.3f ms\n", 0.001 * conn_total_duration / total->cnt);
+		printf("tps = %f (including reconnection times)\n", tps);
+	}
+	else
+	{
+		printf("initial connection time = %.3f ms\n", 0.001 * conn_elapsed_duration);
+		printf("tps = %f (without initial connection establishing)\n", tps);
+	}
 
 	/* Report per-script/command statistics */
 	if (per_script_stats || report_per_command)
@@ -5298,7 +5281,7 @@ printResults(StatsData *total, instr_time total_time,
 					   100.0 * sql_script[i].weight / total_weight,
 					   sstats->cnt,
 					   100.0 * sstats->cnt / total->cnt,
-					   (sstats->cnt - sstats->skipped) / time_include);
+					   (sstats->cnt - sstats->skipped) / bench_duration);
 
 				if (throttle_delay && latency_limit && sstats->cnt > 0)
 					printf(" - number of transactions skipped: " INT64_FORMAT " (%.3f%%)\n",
@@ -5346,10 +5329,7 @@ set_random_seed(const char *seed)
 	if (seed == NULL || strcmp(seed, "time") == 0)
 	{
 		/* rely on current time */
-		instr_time	now;
-
-		INSTR_TIME_SET_CURRENT(now);
-		iseed = (uint64) INSTR_TIME_GET_MICROSEC(now);
+		iseed = pg_time_now();
 	}
 	else if (strcmp(seed, "rand") == 0)
 	{
@@ -5452,9 +5432,10 @@ main(int argc, char **argv)
 	CState	   *state;			/* status of clients */
 	TState	   *threads;		/* array of thread */
 
-	instr_time	start_time;		/* start up time */
-	instr_time	total_time;
-	instr_time	conn_total_time;
+	pg_time_usec_t
+		start_time,				/* start up time */
+		bench_start = 0,		/* first recorded benchmarking time */
+		conn_total_duration;	/* cumulated connection time in threads */
 	int64		latency_late = 0;
 	StatsData	stats;
 	int			weight;
@@ -6137,67 +6118,53 @@ main(int argc, char **argv)
 	/* all clients must be assigned to a thread */
 	Assert(nclients_dealt == nclients);
 
-	/* get start up time */
-	INSTR_TIME_SET_CURRENT(start_time);
+	/* get start up time for the whole computation */
+	start_time = pg_time_now();
 
 	/* set alarm if duration is specified. */
 	if (duration > 0)
 		setalarm(duration);
 
-	/* start threads */
 #ifdef ENABLE_THREAD_SAFETY
-	for (i = 0; i < nthreads; i++)
+	/* start all threads but thread 0 which is executed directly later */
+	for (i = 1; i < nthreads; i++)
 	{
 		TState	   *thread = &threads[i];
+		int			err;
 
-		INSTR_TIME_SET_CURRENT(thread->start_time);
-
-		/* compute when to stop */
-		if (duration > 0)
-			end_time = INSTR_TIME_GET_MICROSEC(thread->start_time) +
-				(int64) 1000000 * duration;
-
-		/* the first thread (i = 0) is executed by main thread */
-		if (i > 0)
-		{
-			int			err = pthread_create(&thread->thread, NULL, threadRun, thread);
+		thread->create_time = pg_time_now();
+		err = pthread_create(&thread->thread, NULL, threadRun, thread);
 
-			if (err != 0 || thread->thread == INVALID_THREAD)
-			{
-				pg_log_fatal("could not create thread: %m");
-				exit(1);
-			}
-		}
-		else
+		if (err != 0 || thread->thread == INVALID_THREAD)
 		{
-			thread->thread = INVALID_THREAD;
+			pg_log_fatal("could not create thread: %m");
+			exit(1);
 		}
 	}
 #else
-	INSTR_TIME_SET_CURRENT(threads[0].start_time);
+	Assert(nthreads == 1);
+#endif							/* ENABLE_THREAD_SAFETY */
+
 	/* compute when to stop */
+	threads[0].create_time = pg_time_now();
 	if (duration > 0)
-		end_time = INSTR_TIME_GET_MICROSEC(threads[0].start_time) +
-			(int64) 1000000 * duration;
+		end_time = threads[0].create_time + (int64) 1000000 * duration;
+
+	/* run thread 0 directly */
 	threads[0].thread = INVALID_THREAD;
-#endif							/* ENABLE_THREAD_SAFETY */
+	(void) threadRun(&threads[0]);
 
-	/* wait for threads and accumulate results */
+	/* wait for other threads and accumulate results */
 	initStats(&stats, 0);
-	INSTR_TIME_SET_ZERO(conn_total_time);
+	conn_total_duration = 0;
+
 	for (i = 0; i < nthreads; i++)
 	{
 		TState	   *thread = &threads[i];
 
 #ifdef ENABLE_THREAD_SAFETY
-		if (threads[i].thread == INVALID_THREAD)
-			/* actually run this thread directly in the main thread */
-			(void) threadRun(thread);
-		else
-			/* wait of other threads. should check that 0 is returned? */
+		if (i > 0)
 			pthread_join(thread->thread, NULL);
-#else
-		(void) threadRun(thread);
 #endif							/* ENABLE_THREAD_SAFETY */
 
 		for (int j = 0; j < thread->nstate; j++)
@@ -6210,23 +6177,24 @@ main(int argc, char **argv)
 		stats.cnt += thread->stats.cnt;
 		stats.skipped += thread->stats.skipped;
 		latency_late += thread->latency_late;
-		INSTR_TIME_ADD(conn_total_time, thread->conn_time);
+		conn_total_duration += thread->conn_duration;
+
+		/* first recorded benchmarking start time */
+		if (bench_start == 0 || thread->bench_start < bench_start)
+			bench_start = thread->bench_start;
 	}
+
+	/* XXX should this be connection time? */
 	disconnect_all(state, nclients);
 
 	/*
-	 * XXX We compute results as though every client of every thread started
-	 * and finished at the same time.  That model can diverge noticeably from
-	 * reality for a short benchmark run involving relatively many threads.
-	 * The first thread may process notably many transactions before the last
-	 * thread begins.  Improving the model alone would bring limited benefit,
-	 * because performance during those periods of partial thread count can
-	 * easily exceed steady state performance.  This is one of the many ways
-	 * short runs convey deceptive performance figures.
+	 * Beware that performance of short benchmarks with many threads and possibly
+	 * long transactions can be deceptive because threads do not start and finish
+	 * at the exact same time. The total duration computed here encompasses all
+	 * transactions so that tps shown is somehow slightly underestimated.
 	 */
-	INSTR_TIME_SET_CURRENT(total_time);
-	INSTR_TIME_SUBTRACT(total_time, start_time);
-	printResults(&stats, total_time, conn_total_time, latency_late);
+	printResults(&stats, pg_time_now() - bench_start, conn_total_duration,
+				 bench_start - start_time, latency_late);
 
 	if (exit_code != 0)
 		pg_log_fatal("Run was aborted; the above results are incomplete.");
@@ -6239,34 +6207,16 @@ threadRun(void *arg)
 {
 	TState	   *thread = (TState *) arg;
 	CState	   *state = thread->state;
-	instr_time	start,
-				end;
+	pg_time_usec_t	start;
 	int			nstate = thread->nstate;
 	int			remains = nstate;	/* number of remaining clients */
 	socket_set *sockets = alloc_socket_set(nstate);
-	int			i;
-
-	/* for reporting progress: */
-	int64		thread_start = INSTR_TIME_GET_MICROSEC(thread->start_time);
-	int64		last_report = thread_start;
-	int64		next_report = last_report + (int64) progress * 1000000;
+	int64		thread_start,
+				last_report,
+				next_report;
 	StatsData	last,
 				aggs;
 
-	/*
-	 * Initialize throttling rate target for all of the thread's clients.  It
-	 * might be a little more accurate to reset thread->start_time here too.
-	 * The possible drift seems too small relative to typical throttle delay
-	 * times to worry about it.
-	 */
-	INSTR_TIME_SET_CURRENT(start);
-	thread->throttle_trigger = INSTR_TIME_GET_MICROSEC(start);
-
-	INSTR_TIME_SET_ZERO(thread->conn_time);
-
-	initStats(&aggs, time(NULL));
-	last = aggs;
-
 	/* open log file if requested */
 	if (use_log)
 	{
@@ -6287,32 +6237,49 @@ threadRun(void *arg)
 		}
 	}
 
+	/* explicitly initialize the state machines */
+	for (int i = 0; i < nstate; i++)
+		state[i].state = CSTATE_CHOOSE_SCRIPT;
+
+	/* READY */
+	thread_start = pg_time_now();
+	thread->started_time = thread_start;
+	last_report = thread_start;
+	next_report = last_report + (int64) 1000000 * progress;
+
+	/* STEADY */
 	if (!is_connect)
 	{
 		/* make connections to the database before starting */
-		for (i = 0; i < nstate; i++)
+		for (int i = 0; i < nstate; i++)
 		{
 			if ((state[i].con = doConnect()) == NULL)
 				goto done;
 		}
-	}
-
-	/* time after thread and connections set up */
-	INSTR_TIME_SET_CURRENT(thread->conn_time);
-	INSTR_TIME_SUBTRACT(thread->conn_time, thread->start_time);
 
-	/* explicitly initialize the state machines */
-	for (i = 0; i < nstate; i++)
+		/* compute connection delay */
+		thread->conn_duration = pg_time_now() - thread->started_time;
+	}
+	else
 	{
-		state[i].state = CSTATE_CHOOSE_SCRIPT;
+		/* no connection delay to record */
+		thread->conn_duration = 0;
 	}
 
+
+	start = pg_time_now();
+	thread->bench_start = start;
+	thread->throttle_trigger = start;
+
+	initStats(&aggs, start);
+	last = aggs;
+
 	/* loop till all clients have terminated */
 	while (remains > 0)
 	{
 		int			nsocks;		/* number of sockets to be waited for */
-		int64		min_usec;
-		int64		now_usec = 0;	/* set this only if needed */
+		pg_time_usec_t	min_usec;
+		pg_time_usec_t	now = 0;	/* set this only if needed */
 
 		/*
 		 * identify which client sockets should be checked for input, and
@@ -6321,27 +6288,21 @@ threadRun(void *arg)
 		clear_socket_set(sockets);
 		nsocks = 0;
 		min_usec = PG_INT64_MAX;
-		for (i = 0; i < nstate; i++)
+		for (int i = 0; i < nstate; i++)
 		{
 			CState	   *st = &state[i];
 
 			if (st->state == CSTATE_SLEEP || st->state == CSTATE_THROTTLE)
 			{
 				/* a nap from the script, or under throttling */
-				int64		this_usec;
+				pg_time_usec_t		this_usec;
 
 				/* get current time if needed */
-				if (now_usec == 0)
-				{
-					instr_time	now;
-
-					INSTR_TIME_SET_CURRENT(now);
-					now_usec = INSTR_TIME_GET_MICROSEC(now);
-				}
+				pg_time_now_lazy(&now);
 
 				/* min_usec should be the minimum delay across all clients */
 				this_usec = (st->state == CSTATE_SLEEP ?
-							 st->sleep_until : st->txn_scheduled) - now_usec;
+							 st->sleep_until : st->txn_scheduled) - now;
 				if (min_usec > this_usec)
 					min_usec = this_usec;
 			}
@@ -6376,19 +6337,12 @@ threadRun(void *arg)
 		/* also wake up to print the next progress report on time */
 		if (progress && min_usec > 0 && thread->tid == 0)
 		{
-			/* get current time if needed */
-			if (now_usec == 0)
-			{
-				instr_time	now;
+			pg_time_now_lazy(&now);
 
-				INSTR_TIME_SET_CURRENT(now);
-				now_usec = INSTR_TIME_GET_MICROSEC(now);
-			}
-
-			if (now_usec >= next_report)
+			if (now >= next_report)
 				min_usec = 0;
-			else if ((next_report - now_usec) < min_usec)
-				min_usec = next_report - now_usec;
+			else if ((next_report - now) < min_usec)
+				min_usec = next_report - now;
 		}
 
 		/*
@@ -6437,7 +6391,7 @@ threadRun(void *arg)
 
 		/* ok, advance the state machine of each connection */
 		nsocks = 0;
-		for (i = 0; i < nstate; i++)
+		for (int i = 0; i < nstate; i++)
 		{
 			CState	   *st = &state[i];
 
@@ -6475,11 +6429,8 @@ threadRun(void *arg)
 		/* progress report is made by thread 0 for all threads */
 		if (progress && thread->tid == 0)
 		{
-			instr_time	now_time;
-			int64		now;
+			pg_time_usec_t	now = pg_time_now();
 
-			INSTR_TIME_SET_CURRENT(now_time);
-			now = INSTR_TIME_GET_MICROSEC(now_time);
 			if (now >= next_report)
 			{
 				/*
@@ -6497,17 +6448,17 @@ threadRun(void *arg)
 				 */
 				do
 				{
-					next_report += (int64) progress * 1000000;
+					next_report += (int64) 1000000 * progress;
 				} while (now >= next_report);
 			}
 		}
 	}
 
 done:
-	INSTR_TIME_SET_CURRENT(start);
+	start = pg_time_now();
 	disconnect_all(state, nstate);
-	INSTR_TIME_SET_CURRENT(end);
-	INSTR_TIME_ACCUM_DIFF(thread->conn_time, end, start);
+	thread->conn_duration += pg_time_now() - start;
+
 	if (thread->logfile)
 	{
 		if (agg_interval > 0)
diff --git a/src/include/portability/instr_time.h b/src/include/portability/instr_time.h
index 39a4f0600e..faf806a441 100644
--- a/src/include/portability/instr_time.h
+++ b/src/include/portability/instr_time.h
@@ -253,4 +253,32 @@ GetTimerFrequency(void)
 #define INSTR_TIME_SET_CURRENT_LAZY(t) \
 	(INSTR_TIME_IS_ZERO(t) ? INSTR_TIME_SET_CURRENT(t), true : false)
 
+/*
+ * Simpler convenient interface
+ *
+ * The instr_time type is expensive when dealing with time arithmetic.
+ * Define a type to hold microseconds on top of this, suitable for
+ * benchmarking performance measures, eg in "pgbench".
+ *
+ * Type int64 is good enough for about 584500 years.
+ */
+typedef int64 pg_time_usec_t;
+
+static inline pg_time_usec_t
+pg_time_now(void)
+{
+	instr_time now;
+
+	INSTR_TIME_SET_CURRENT(now);
+	return (pg_time_usec_t) INSTR_TIME_GET_MICROSEC(now);
+}
+
+static inline void
+pg_time_now_lazy(pg_time_usec_t *now)
+{
+	if ((*now) == 0)
+		(*now) = pg_time_now();
+}
+
+#define PG_TIME_GET_DOUBLE(t) (0.000001 * (t))
 #endif							/* INSTR_TIME_H */
-- 
2.24.3 (Apple Git-128)

v8-0002-Add-missing-pthread_barrier_t.patchapplication/octet-stream; name=v8-0002-Add-missing-pthread_barrier_t.patchDownload

From 5d7ad6481704b5005e795b4c63fcdbc714eba670 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Sat, 2 Jan 2021 15:05:06 +1300
Subject: [PATCH v8 2/3] Add missing pthread_barrier_t.

Supply a simple implementation of the missing pthread_barrier_t type and
functions, for macOS.

Discussion: https://postgr.es/m/20200227180100.zyvjwzcpiokfsqm2%40alap3.anarazel.de
---
 configure                       | 69 +++++++++++++++++++++++++++++++++
 configure.ac                    |  2 +
 src/include/pg_config.h.in      |  3 ++
 src/include/port/pg_pthread.h   | 41 ++++++++++++++++++++
 src/port/pthread_barrier_wait.c | 66 +++++++++++++++++++++++++++++++
 src/tools/msvc/Solution.pm      |  1 +
 6 files changed, 182 insertions(+)
 create mode 100644 src/include/port/pg_pthread.h
 create mode 100644 src/port/pthread_barrier_wait.c

diff --git a/configure b/configure
index 8af4b99021..48af1add80 100755
--- a/configure
+++ b/configure
@@ -11673,6 +11673,62 @@ if test "$ac_res" != no; then :
 
 fi
 
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for library containing pthread_barrier_wait" >&5
+$as_echo_n "checking for library containing pthread_barrier_wait... " >&6; }
+if ${ac_cv_search_pthread_barrier_wait+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  ac_func_search_save_LIBS=$LIBS
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+
+/* Override any GCC internal prototype to avoid an error.
+   Use char because int might match the return type of a GCC
+   builtin and then its argument prototype would still apply.  */
+#ifdef __cplusplus
+extern "C"
+#endif
+char pthread_barrier_wait ();
+int
+main ()
+{
+return pthread_barrier_wait ();
+  ;
+  return 0;
+}
+_ACEOF
+for ac_lib in '' pthread; do
+  if test -z "$ac_lib"; then
+    ac_res="none required"
+  else
+    ac_res=-l$ac_lib
+    LIBS="-l$ac_lib  $ac_func_search_save_LIBS"
+  fi
+  if ac_fn_c_try_link "$LINENO"; then :
+  ac_cv_search_pthread_barrier_wait=$ac_res
+fi
+rm -f core conftest.err conftest.$ac_objext \
+    conftest$ac_exeext
+  if ${ac_cv_search_pthread_barrier_wait+:} false; then :
+  break
+fi
+done
+if ${ac_cv_search_pthread_barrier_wait+:} false; then :
+
+else
+  ac_cv_search_pthread_barrier_wait=no
+fi
+rm conftest.$ac_ext
+LIBS=$ac_func_search_save_LIBS
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $ac_cv_search_pthread_barrier_wait" >&5
+$as_echo "$ac_cv_search_pthread_barrier_wait" >&6; }
+ac_res=$ac_cv_search_pthread_barrier_wait
+if test "$ac_res" != no; then :
+  test "$ac_res" = "none required" || LIBS="$ac_res $LIBS"
+
+fi
+
 # Solaris:
 { $as_echo "$as_me:${as_lineno-$LINENO}: checking for library containing fdatasync" >&5
 $as_echo_n "checking for library containing fdatasync... " >&6; }
@@ -15858,6 +15914,19 @@ esac
 
 fi
 
+ac_fn_c_check_func "$LINENO" "pthread_barrier_wait" "ac_cv_func_pthread_barrier_wait"
+if test "x$ac_cv_func_pthread_barrier_wait" = xyes; then :
+  $as_echo "#define HAVE_PTHREAD_BARRIER_WAIT 1" >>confdefs.h
+
+else
+  case " $LIBOBJS " in
+  *" pthread_barrier_wait.$ac_objext "* ) ;;
+  *) LIBOBJS="$LIBOBJS pthread_barrier_wait.$ac_objext"
+ ;;
+esac
+
+fi
+
 ac_fn_c_check_func "$LINENO" "pwrite" "ac_cv_func_pwrite"
 if test "x$ac_cv_func_pwrite" = xyes; then :
   $as_echo "#define HAVE_PWRITE 1" >>confdefs.h
diff --git a/configure.ac b/configure.ac
index 868a94c9ba..7a73affc10 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1156,6 +1156,7 @@ AC_SEARCH_LIBS(getopt_long, [getopt gnugetopt])
 AC_SEARCH_LIBS(shm_open, rt)
 AC_SEARCH_LIBS(shm_unlink, rt)
 AC_SEARCH_LIBS(clock_gettime, [rt posix4])
+AC_SEARCH_LIBS(pthread_barrier_wait, pthread)
 # Solaris:
 AC_SEARCH_LIBS(fdatasync, [rt posix4])
 # Required for thread_test.c on Solaris
@@ -1738,6 +1739,7 @@ AC_REPLACE_FUNCS(m4_normalize([
 	mkdtemp
 	pread
 	preadv
+	pthread_barrier_wait
 	pwrite
 	pwritev
 	random
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index f4d9f3b408..0bc2efb2d8 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -424,6 +424,9 @@
 /* Define if you have POSIX threads libraries and header files. */
 #undef HAVE_PTHREAD
 
+/* Define to 1 if you have the `pthread_barrier_wait' function. */
+#undef HAVE_PTHREAD_BARRIER_WAIT
+
 /* Define to 1 if you have the `pthread_is_threaded_np' function. */
 #undef HAVE_PTHREAD_IS_THREADED_NP
 
diff --git a/src/include/port/pg_pthread.h b/src/include/port/pg_pthread.h
new file mode 100644
index 0000000000..5222cdce6e
--- /dev/null
+++ b/src/include/port/pg_pthread.h
@@ -0,0 +1,41 @@
+/*-------------------------------------------------------------------------
+ *
+ * Declarations for missing POSIX thread components.
+ *
+ *	  Currently this supplies an implementation of pthread_barrier_t for the
+ *	  benefit of macOS, which lacks it as of release 11.  These declarations
+ *	  are not in port.h, because that'd require <pthread.h> to be included by
+ *	  every translation unit.
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#ifndef PG_PTHREAD_H
+#define PG_PTHREAD_H
+
+#include <pthread.h>
+
+#ifndef HAVE_PTHREAD_BARRIER_WAIT
+
+#ifndef PTHREAD_BARRIER_SERIAL_THREAD
+#define PTHREAD_BARRIER_SERIAL_THREAD (-1)
+#endif
+
+typedef struct pg_pthread_barrier
+{
+	bool		sense;			/* we only need a one bit phase */
+	int			count;			/* number of threads expected */
+	int			arrived;		/* number of threads that have arrived */
+	pthread_mutex_t mutex;
+	pthread_cond_t cond;
+} pthread_barrier_t;
+
+extern int pthread_barrier_init(pthread_barrier_t *barrier,
+								const void *attr,
+								int count);
+extern int pthread_barrier_wait(pthread_barrier_t *barrier);
+extern int pthread_barrier_destroy(pthread_barrier_t *barrier);
+
+#endif
+
+#endif
diff --git a/src/port/pthread_barrier_wait.c b/src/port/pthread_barrier_wait.c
new file mode 100644
index 0000000000..08dacc3085
--- /dev/null
+++ b/src/port/pthread_barrier_wait.c
@@ -0,0 +1,66 @@
+#include "postgres_fe.h"
+
+#include "port/pg_pthread.h"
+
+int
+pthread_barrier_init(pthread_barrier_t *barrier, const void *attr, int count)
+{
+	barrier->sense = false;
+	barrier->count = count;
+	barrier->arrived = 0;
+	if (pthread_cond_init(&barrier->cond, NULL) < 0)
+		return -1;
+	if (pthread_mutex_init(&barrier->mutex, NULL) < 0)
+	{
+		int save_errno = errno;
+
+		pthread_cond_destroy(&barrier->cond);
+		errno = save_errno;
+
+		return -1;
+	}
+
+	return 0;
+}
+
+int
+pthread_barrier_wait(pthread_barrier_t *barrier)
+{
+	bool		initial_sense;
+
+	pthread_mutex_lock(&barrier->mutex);
+
+	/* We have arrived at the barrier. */
+	barrier->arrived++;
+	Assert(barrier->arrived <= barrier->count);
+
+	/* If we were the last to arrive, release the others and return. */
+	if (barrier->arrived == barrier->count)
+	{
+		barrier->arrived = 0;
+		barrier->sense = !barrier->sense;
+		pthread_mutex_unlock(&barrier->mutex);
+		pthread_cond_broadcast(&barrier->cond);
+
+		return PTHREAD_BARRIER_SERIAL_THREAD;
+	}
+
+	/* Wait for someone else to flip the sense. */
+	initial_sense = barrier->sense;
+	do
+	{
+		pthread_cond_wait(&barrier->cond, &barrier->mutex);
+	} while (barrier->sense == initial_sense);
+
+	pthread_mutex_unlock(&barrier->mutex);
+
+	return 0;
+}
+
+int
+pthread_barrier_destroy(pthread_barrier_t *barrier)
+{
+	pthread_cond_destroy(&barrier->cond);
+	pthread_mutex_destroy(&barrier->mutex);
+	return 0;
+}
diff --git a/src/tools/msvc/Solution.pm b/src/tools/msvc/Solution.pm
index 59a42bea97..c34e616c39 100644
--- a/src/tools/msvc/Solution.pm
+++ b/src/tools/msvc/Solution.pm
@@ -333,6 +333,7 @@ sub GenerateFiles
 		HAVE_PSTAT                  => undef,
 		HAVE_PS_STRINGS             => undef,
 		HAVE_PTHREAD                => undef,
+		HAVE_PTHREAD_BARRIER_WAIT   => undef,
 		HAVE_PTHREAD_IS_THREADED_NP => undef,
 		HAVE_PTHREAD_PRIO_INHERIT   => undef,
 		HAVE_PWRITE                 => undef,
-- 
2.24.3 (Apple Git-128)

v8-0003-B.patchapplication/octet-stream; name=v8-0003-B.patchDownload

From 93e981ddd66af08b027de3916f65042d6e38c92a Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Mon, 18 Jan 2021 10:07:31 +1300
Subject: [PATCH v8 3/3] B

---
 src/bin/pgbench/pgbench.c | 69 +++++++++++++++++++++++++++++++++++++--
 1 file changed, 66 insertions(+), 3 deletions(-)

diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index e639f4980d..6287e28229 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -114,18 +114,29 @@ typedef struct socket_set
  */
 
 #ifdef WIN32
+#define PTHREAD_BARRIER_SERIAL_THREAD (-1)
+
 /* Use native win32 threads on Windows */
 typedef struct win32_pthread *pthread_t;
 typedef int pthread_attr_t;
+typedef SYNCHRONIZATION_BARRIER pthread_barrier_t;
 
 static int	pthread_create(pthread_t *thread, pthread_attr_t *attr, void *(*start_routine) (void *), void *arg);
 static int	pthread_join(pthread_t th, void **thread_return);
+
+static int	pthread_barrier_init(pthread_barrier_t *barrier, void *unused, int nthreads);
+static int	pthread_barrier_wait(pthread_barrier_t *barrier);
+static int	pthread_barrier_destroy(pthread_barrier_t *barrier);
 #elif defined(ENABLE_THREAD_SAFETY)
 /* Use platform-dependent pthread capability */
-#include <pthread.h>
+#include "port/pg_pthread.h"
 #else
 /* No threads implementation, use none (-j 1) */
 #define pthread_t void *
+#define pthread_barrier_t void *
+#define pthread_barrier_init(a, b, c) /* ignore */
+#define pthread_barrier_wait(a) /* ignore */
+#define pthread_barrier_destroy(a) /* ignore */
 #endif
 
 
@@ -311,6 +322,9 @@ typedef struct RandomState
 /* Various random sequences are initialized from this one. */
 static RandomState base_random_sequence;
 
+/* Synchronization barrier for start and connection */
+static pthread_barrier_t barrier;
+
 /*
  * Connection state machine states.
  */
@@ -454,8 +468,8 @@ typedef struct
 
 	/* per thread collected stats in microseconds */
 	pg_time_usec_t	create_time;	/* thread creation time */
-	pg_time_usec_t	started_time;	/* thread is running */
-	pg_time_usec_t	bench_start; 	/* thread is benchmarking */
+	pg_time_usec_t	started_time;	/* thread is running after start barrier */
+	pg_time_usec_t	bench_start; 	/* thread is benchmarking after connection barrier */
 	pg_time_usec_t	conn_duration;	/* cumulated connection and deconnection delays */
 
 	StatsData	stats;
@@ -6125,6 +6139,8 @@ main(int argc, char **argv)
 	if (duration > 0)
 		setalarm(duration);
 
+	pthread_barrier_init(&barrier, NULL, nthreads);
+
 #ifdef ENABLE_THREAD_SAFETY
 	/* start all threads but thread 0 which is executed directly later */
 	for (i = 1; i < nthreads; i++)
@@ -6196,6 +6212,8 @@ main(int argc, char **argv)
 	printResults(&stats, pg_time_now() - bench_start, conn_total_duration,
 				 bench_start - start_time, latency_late);
 
+	pthread_barrier_destroy(&barrier);
+
 	if (exit_code != 0)
 		pg_log_fatal("Run was aborted; the above results are incomplete.");
 
@@ -6242,6 +6260,8 @@ threadRun(void *arg)
 		state[i].state = CSTATE_CHOOSE_SCRIPT;
 
 	/* READY */
+	pthread_barrier_wait(&barrier);
+
 	thread_start = pg_time_now();
 	thread->started_time = thread_start;
 	last_report = thread_start;
@@ -6254,7 +6274,18 @@ threadRun(void *arg)
 		for (int i = 0; i < nstate; i++)
 		{
 			if ((state[i].con = doConnect()) == NULL)
+			{
+				/*
+				 * On connection failure, we meet the barrier here in place of
+				 * GO before proceeding to the "done" path which will cleanup,
+				 * so as to avoid locking the process.
+				 *
+				 * It is unclear whether it is worth doing anything rather than
+				 * coldly exiting with an error message.
+				 */
+				pthread_barrier_wait(&barrier);
 				goto done;
+			}
 		}
 
 		/* compute connection delay */
@@ -6266,6 +6297,8 @@ threadRun(void *arg)
 		thread->conn_duration = 0;
 	}
 
+	/* GO */
+	pthread_barrier_wait(&barrier);
 
 	start = pg_time_now();
 	thread->bench_start = start;
@@ -6771,4 +6804,34 @@ pthread_join(pthread_t th, void **thread_return)
 	return 0;
 }
 
+static int
+pthread_barrier_init(pthread_barrier_t *barrier, void *unused, int nthreads)
+{
+	/* no spinning: threads are not expected to arrive at the barrier together */
+	bool ok = InitializeSynchronizationBarrier(barrier, nthreads, 0);
+	return 0;
+}
+
+static int
+pthread_barrier_wait(pthread_barrier_t *barrier)
+{
+	bool last = EnterSynchronizationBarrier(barrier, SYNCHRONIZATION_BARRIER_FLAGS_BLOCK_ONLY);
+	return last ? PTHREAD_BARRIER_SERIAL_THREAD : 0;
+}
+
+static int
+pthread_barrier_destroy(pthread_barrier_t *barrier)
+{
+	/*
+	 * The following is coldly ignored because it requires Windows 8
+	 * or Windows Server 2012, which is a little too much.
+	 *
+	 * Also, there is a SYNCHRONIZATION_BARRIER_FLAGS_NO_DELETE flag
+	 * but it probably requires the same versions.
+	 *
+	 *  (void) DeleteSynchronizationBarrier(barrier);
+	 */
+	return 0;
+}
+
 #endif							/* WIN32 */
-- 
2.24.3 (Apple Git-128)

coelho@cri.ensmp.fr

almost 5 years ago

In reply to: Thomas Munro (#35)

3 attachment(s)

Re: pgbench: option delaying queries till connections establishment?

Hello Thomas,

3 . Decide if it's sane for the Windows-based emulation to be in here
too, or if it should stay in pgbench.c. Or alternatively, if we're
emulating pthread stuff on Windows, why not also put the other pthread
emulation stuff from pgbench.c into a "ports" file; that seems
premature and overkill for your project. I dunno.

I decided to solve only the macOS problem for now. So in this
version, the A and B patches are exactly as you had them in your v7,
except that B includes “port/pg_pthread.h” instead of <pthread.h>.

Maybe it’d make sense to move the Win32 pthread emulation stuff out of
pgbench.c into src/port too (the pre-existing stuff, and the new
barrier stuff you added), but that seems like a separate patch, one
that I’m not best placed to write, and it’s not clear to me that we’ll
want to be using pthread APIs as our main abstraction if/when thread
usage increases in the PG source tree anyway. Other opinions welcome.

I think it would be much more consistent to move all the thread complement
stuff there directly: Currently (v8) the windows implementation is in
pgbench and the MacOS implementation in port, which is quite messy.

Attached is a patch set which does that. I cannot test it neither on
Windows nor on MacOS. Path 1 & 2 are really independent.

--
Fabien.

Attachments:

v9.0001.pgbench-barrier.patchtext/x-diff; name=v9.0001.pgbench-barrier.patchDownload

diff --git a/configure b/configure
index e202697bbf..54c5a7963f 100755
--- a/configure
+++ b/configure
@@ -11668,6 +11668,62 @@ if test "$ac_res" != no; then :
 
 fi
 
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for library containing pthread_barrier_wait" >&5
+$as_echo_n "checking for library containing pthread_barrier_wait... " >&6; }
+if ${ac_cv_search_pthread_barrier_wait+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  ac_func_search_save_LIBS=$LIBS
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+
+/* Override any GCC internal prototype to avoid an error.
+   Use char because int might match the return type of a GCC
+   builtin and then its argument prototype would still apply.  */
+#ifdef __cplusplus
+extern "C"
+#endif
+char pthread_barrier_wait ();
+int
+main ()
+{
+return pthread_barrier_wait ();
+  ;
+  return 0;
+}
+_ACEOF
+for ac_lib in '' pthread; do
+  if test -z "$ac_lib"; then
+    ac_res="none required"
+  else
+    ac_res=-l$ac_lib
+    LIBS="-l$ac_lib  $ac_func_search_save_LIBS"
+  fi
+  if ac_fn_c_try_link "$LINENO"; then :
+  ac_cv_search_pthread_barrier_wait=$ac_res
+fi
+rm -f core conftest.err conftest.$ac_objext \
+    conftest$ac_exeext
+  if ${ac_cv_search_pthread_barrier_wait+:} false; then :
+  break
+fi
+done
+if ${ac_cv_search_pthread_barrier_wait+:} false; then :
+
+else
+  ac_cv_search_pthread_barrier_wait=no
+fi
+rm conftest.$ac_ext
+LIBS=$ac_func_search_save_LIBS
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $ac_cv_search_pthread_barrier_wait" >&5
+$as_echo "$ac_cv_search_pthread_barrier_wait" >&6; }
+ac_res=$ac_cv_search_pthread_barrier_wait
+if test "$ac_res" != no; then :
+  test "$ac_res" = "none required" || LIBS="$ac_res $LIBS"
+
+fi
+
 # Solaris:
 { $as_echo "$as_me:${as_lineno-$LINENO}: checking for library containing fdatasync" >&5
 $as_echo_n "checking for library containing fdatasync... " >&6; }
@@ -15853,6 +15909,19 @@ esac
 
 fi
 
+ac_fn_c_check_func "$LINENO" "pthread_barrier_wait" "ac_cv_func_pthread_barrier_wait"
+if test "x$ac_cv_func_pthread_barrier_wait" = xyes; then :
+  $as_echo "#define HAVE_PTHREAD_BARRIER_WAIT 1" >>confdefs.h
+
+else
+  case " $LIBOBJS " in
+  *" pthread_barrier_wait.$ac_objext "* ) ;;
+  *) LIBOBJS="$LIBOBJS pthread_barrier_wait.$ac_objext"
+ ;;
+esac
+
+fi
+
 ac_fn_c_check_func "$LINENO" "pwrite" "ac_cv_func_pwrite"
 if test "x$ac_cv_func_pwrite" = xyes; then :
   $as_echo "#define HAVE_PWRITE 1" >>confdefs.h
diff --git a/configure.ac b/configure.ac
index a5ad072ee4..23ad861b27 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1152,6 +1152,7 @@ AC_SEARCH_LIBS(getopt_long, [getopt gnugetopt])
 AC_SEARCH_LIBS(shm_open, rt)
 AC_SEARCH_LIBS(shm_unlink, rt)
 AC_SEARCH_LIBS(clock_gettime, [rt posix4])
+AC_SEARCH_LIBS(pthread_barrier_wait, pthread)
 # Solaris:
 AC_SEARCH_LIBS(fdatasync, [rt posix4])
 # Required for thread_test.c on Solaris
@@ -1734,6 +1735,7 @@ AC_REPLACE_FUNCS(m4_normalize([
 	mkdtemp
 	pread
 	preadv
+	pthread_barrier_wait
 	pwrite
 	pwritev
 	random
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index a4a3f40048..0ac0e186ea 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -66,6 +66,7 @@
 #include "libpq-fe.h"
 #include "pgbench.h"
 #include "portability/instr_time.h"
+#include "port/pg_pthread.h"
 
 #ifndef M_PI
 #define M_PI 3.14159265358979323846
@@ -109,26 +110,6 @@ typedef struct socket_set
 
 #endif							/* POLL_USING_SELECT */
 
-/*
- * Multi-platform pthread implementations
- */
-
-#ifdef WIN32
-/* Use native win32 threads on Windows */
-typedef struct win32_pthread *pthread_t;
-typedef int pthread_attr_t;
-
-static int	pthread_create(pthread_t *thread, pthread_attr_t *attr, void *(*start_routine) (void *), void *arg);
-static int	pthread_join(pthread_t th, void **thread_return);
-#elif defined(ENABLE_THREAD_SAFETY)
-/* Use platform-dependent pthread capability */
-#include <pthread.h>
-#else
-/* No threads implementation, use none (-j 1) */
-#define pthread_t void *
-#endif
-
-
 /********************************************************************
  * some configurable parameters */
 
@@ -6742,74 +6723,3 @@ socket_has_input(socket_set *sa, int fd, int idx)
 }
 
 #endif							/* POLL_USING_SELECT */
-
-
-/* partial pthread implementation for Windows */
-
-#ifdef WIN32
-
-typedef struct win32_pthread
-{
-	HANDLE		handle;
-	void	   *(*routine) (void *);
-	void	   *arg;
-	void	   *result;
-} win32_pthread;
-
-static unsigned __stdcall
-win32_pthread_run(void *arg)
-{
-	win32_pthread *th = (win32_pthread *) arg;
-
-	th->result = th->routine(th->arg);
-
-	return 0;
-}
-
-static int
-pthread_create(pthread_t *thread,
-			   pthread_attr_t *attr,
-			   void *(*start_routine) (void *),
-			   void *arg)
-{
-	int			save_errno;
-	win32_pthread *th;
-
-	th = (win32_pthread *) pg_malloc(sizeof(win32_pthread));
-	th->routine = start_routine;
-	th->arg = arg;
-	th->result = NULL;
-
-	th->handle = (HANDLE) _beginthreadex(NULL, 0, win32_pthread_run, th, 0, NULL);
-	if (th->handle == NULL)
-	{
-		save_errno = errno;
-		free(th);
-		return save_errno;
-	}
-
-	*thread = th;
-	return 0;
-}
-
-static int
-pthread_join(pthread_t th, void **thread_return)
-{
-	if (th == NULL || th->handle == NULL)
-		return errno = EINVAL;
-
-	if (WaitForSingleObject(th->handle, INFINITE) != WAIT_OBJECT_0)
-	{
-		_dosmaperr(GetLastError());
-		return errno;
-	}
-
-	if (thread_return)
-		*thread_return = th->result;
-
-	CloseHandle(th->handle);
-	free(th);
-	return 0;
-}
-
-#endif							/* WIN32 */
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index f4d9f3b408..0bc2efb2d8 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -424,6 +424,9 @@
 /* Define if you have POSIX threads libraries and header files. */
 #undef HAVE_PTHREAD
 
+/* Define to 1 if you have the `pthread_barrier_wait' function. */
+#undef HAVE_PTHREAD_BARRIER_WAIT
+
 /* Define to 1 if you have the `pthread_is_threaded_np' function. */
 #undef HAVE_PTHREAD_IS_THREADED_NP
 
diff --git a/src/include/port/pg_pthread.h b/src/include/port/pg_pthread.h
new file mode 100644
index 0000000000..90056b1242
--- /dev/null
+++ b/src/include/port/pg_pthread.h
@@ -0,0 +1,77 @@
+/*-------------------------------------------------------------------------
+ *
+ * Declarations for missing POSIX thread components.
+ *
+ *	  Currently this supplies an implementation of pthread_barrier_t for the
+ *	  benefit of macOS, which lacks it as of release 11.  These declarations
+ *	  are not in port.h, because that'd require <pthread.h> to be included by
+ *	  every translation unit.
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#ifndef PG_PTHREAD_H
+#define PG_PTHREAD_H
+
+/*
+ * Multi-platform pthread implementations
+ */
+#ifdef WIN32
+
+/* Use native win32 threads on Windows */
+
+typedef struct win32_pthread *pthread_t;
+typedef int pthread_attr_t;
+typedef SYNCHRONIZATION_BARRIER pthread_barrier_t;
+
+extern int	pthread_create(pthread_t *thread, pthread_attr_t *attr, void *(*start_routine) (void *), void *arg);
+extern int	pthread_join(pthread_t th, void **thread_return);
+
+extern int	pthread_barrier_init(pthread_barrier_t *barrier, void *unused, int nthreads);
+extern int	pthread_barrier_wait(pthread_barrier_t *barrier);
+extern int	pthread_barrier_destroy(pthread_barrier_t *barrier);
+
+#define PTHREAD_BARRIER_SERIAL_THREAD (-1)
+
+#elif defined(ENABLE_THREAD_SAFETY)
+
+/* Use platform-dependent pthread capability */
+#include <pthread.h>
+
+#ifndef PTHREAD_BARRIER_SERIAL_THREAD
+#define PTHREAD_BARRIER_SERIAL_THREAD (-1)
+#endif
+
+/* MacOS is missing pthread barriers */
+
+#if !defined(HAVE_PTHREAD_BARRIER_WAIT)
+
+typedef struct pg_pthread_barrier
+{
+	bool		sense;			/* we only need a one bit phase */
+	int			count;			/* number of threads expected */
+	int			arrived;		/* number of threads that have arrived */
+	pthread_mutex_t mutex;
+	pthread_cond_t cond;
+} pthread_barrier_t;
+
+extern int pthread_barrier_init(pthread_barrier_t *barrier,
+								const void *attr,
+								int count);
+extern int pthread_barrier_wait(pthread_barrier_t *barrier);
+extern int pthread_barrier_destroy(pthread_barrier_t *barrier);
+
+#endif /* HAVE_PTHREAD_BARRIER_WAIT */
+
+#else
+
+/* No threads implementation, use none */
+#define pthread_t void *
+#define pthread_barrier_t void *
+#define pthread_barrier_init(a, b, c) /* ignore */
+#define pthread_barrier_wait(a) /* ignore */
+#define pthread_barrier_destroy(a) /* ignore */
+
+#endif /* WIN32 / ENABLE_THREAD_SAFETY */
+
+#endif /* PG_PTHREAD_H */
diff --git a/src/port/Makefile b/src/port/Makefile
index e41b005c4f..1a4b1f3a25 100644
--- a/src/port/Makefile
+++ b/src/port/Makefile
@@ -59,7 +59,8 @@ OBJS = \
 	snprintf.o \
 	strerror.o \
 	tar.o \
-	thread.o
+	thread.o \
+	pthread.o
 
 # libpgport.a, libpgport_shlib.a, and libpgport_srv.a contain the same files
 # foo.o, foo_shlib.o, and foo_srv.o are all built from foo.c
diff --git a/src/port/pthread.c b/src/port/pthread.c
new file mode 100644
index 0000000000..224ac23d26
--- /dev/null
+++ b/src/port/pthread.c
@@ -0,0 +1,174 @@
+#include "postgres_fe.h"
+
+#include "port/pg_pthread.h"
+
+/*
+ * partial pthread and pthread_barrier implementation for Windows
+ */
+
+#ifdef WIN32
+
+typedef struct win32_pthread
+{
+	HANDLE		handle;
+	void	   *(*routine) (void *);
+	void	   *arg;
+	void	   *result;
+} win32_pthread;
+
+unsigned __stdcall
+win32_pthread_run(void *arg)
+{
+	win32_pthread *th = (win32_pthread *) arg;
+
+	th->result = th->routine(th->arg);
+
+	return 0;
+}
+
+int
+pthread_create(pthread_t *thread,
+			   pthread_attr_t *attr,
+			   void *(*start_routine) (void *),
+			   void *arg)
+{
+	int			save_errno;
+	win32_pthread *th;
+
+	th = (win32_pthread *) pg_malloc(sizeof(win32_pthread));
+	th->routine = start_routine;
+	th->arg = arg;
+	th->result = NULL;
+
+	th->handle = (HANDLE) _beginthreadex(NULL, 0, win32_pthread_run, th, 0, NULL);
+	if (th->handle == NULL)
+	{
+		save_errno = errno;
+		free(th);
+		return save_errno;
+	}
+
+	*thread = th;
+	return 0;
+}
+
+int
+pthread_join(pthread_t th, void **thread_return)
+{
+	if (th == NULL || th->handle == NULL)
+		return errno = EINVAL;
+
+	if (WaitForSingleObject(th->handle, INFINITE) != WAIT_OBJECT_0)
+	{
+		_dosmaperr(GetLastError());
+		return errno;
+	}
+
+	if (thread_return)
+		*thread_return = th->result;
+
+	CloseHandle(th->handle);
+	free(th);
+	return 0;
+}
+
+int
+pthread_barrier_init(pthread_barrier_t *barrier, void *unused, int nthreads)
+{
+	/* no spinning: threads are not expected to arrive at the barrier together */
+	bool ok = InitializeSynchronizationBarrier(barrier, nthreads, 0);
+	return 0;
+}
+
+int
+pthread_barrier_wait(pthread_barrier_t *barrier)
+{
+	bool last = EnterSynchronizationBarrier(barrier, SYNCHRONIZATION_BARRIER_FLAGS_BLOCK_ONLY);
+	return last ? PTHREAD_BARRIER_SERIAL_THREAD : 0;
+}
+
+int
+pthread_barrier_destroy(pthread_barrier_t *barrier)
+{
+	/*
+	 * The following is coldly ignored because it requires Windows 8
+	 * or Windows Server 2012, which is a little too much.
+	 *
+	 * Also, there is a SYNCHRONIZATION_BARRIER_FLAGS_NO_DELETE flag
+	 * but it probably requires the same versions.
+	 *
+	 *  (void) DeleteSynchronizationBarrier(barrier);
+	 */
+	return 0;
+}
+
+#elif !defined(HAVE_PTHREAD_BARRIER_WAIT)
+
+/*
+ * pthread barrier implementation for MacOS
+ */
+
+int
+pthread_barrier_init(pthread_barrier_t *barrier, const void *attr, int count)
+{
+	barrier->sense = false;
+	barrier->count = count;
+	barrier->arrived = 0;
+	if (pthread_cond_init(&barrier->cond, NULL) < 0)
+		return -1;
+	if (pthread_mutex_init(&barrier->mutex, NULL) < 0)
+	{
+		int save_errno = errno;
+
+		pthread_cond_destroy(&barrier->cond);
+		errno = save_errno;
+
+		return -1;
+	}
+
+	return 0;
+}
+
+int
+pthread_barrier_wait(pthread_barrier_t *barrier)
+{
+	bool		initial_sense;
+
+	pthread_mutex_lock(&barrier->mutex);
+
+	/* We have arrived at the barrier. */
+	barrier->arrived++;
+	Assert(barrier->arrived <= barrier->count);
+
+	/* If we were the last to arrive, release the others and return. */
+	if (barrier->arrived == barrier->count)
+	{
+		barrier->arrived = 0;
+		barrier->sense = !barrier->sense;
+		pthread_mutex_unlock(&barrier->mutex);
+		pthread_cond_broadcast(&barrier->cond);
+
+		return PTHREAD_BARRIER_SERIAL_THREAD;
+	}
+
+	/* Wait for someone else to flip the sense. */
+	initial_sense = barrier->sense;
+	do
+	{
+		pthread_cond_wait(&barrier->cond, &barrier->mutex);
+	} while (barrier->sense == initial_sense);
+
+	pthread_mutex_unlock(&barrier->mutex);
+
+	return 0;
+}
+
+int
+pthread_barrier_destroy(pthread_barrier_t *barrier)
+{
+	pthread_cond_destroy(&barrier->cond);
+	pthread_mutex_destroy(&barrier->mutex);
+	return 0;
+}
+
+#endif							/* HAVE_PTHREAD_BARRIER_WAIT / WIN32 */
diff --git a/src/tools/msvc/Mkvcbuild.pm b/src/tools/msvc/Mkvcbuild.pm
index 90328db04e..cfcfe9e416 100644
--- a/src/tools/msvc/Mkvcbuild.pm
+++ b/src/tools/msvc/Mkvcbuild.pm
@@ -102,7 +102,7 @@ sub mkvcbuild
 	  pread.c preadv.c pwrite.c pwritev.c pg_bitutils.c
 	  pg_strong_random.c pgcheckdir.c pgmkdirp.c pgsleep.c pgstrcasecmp.c
 	  pqsignal.c mkdtemp.c qsort.c qsort_arg.c quotes.c system.c
-	  strerror.c tar.c thread.c
+	  strerror.c tar.c thread.c pthread.c
 	  win32env.c win32error.c win32security.c win32setlocale.c win32stat.c);
 
 	push(@pgportfiles, 'strtof.c') if ($vsVersion < '14.00');
diff --git a/src/tools/msvc/Solution.pm b/src/tools/msvc/Solution.pm
index 2f28de0355..f091e6a863 100644
--- a/src/tools/msvc/Solution.pm
+++ b/src/tools/msvc/Solution.pm
@@ -333,6 +333,7 @@ sub GenerateFiles
 		HAVE_PSTAT                  => undef,
 		HAVE_PS_STRINGS             => undef,
 		HAVE_PTHREAD                => undef,
+		HAVE_PTHREAD_BARRIER_WAIT   => undef,
 		HAVE_PTHREAD_IS_THREADED_NP => undef,
 		HAVE_PTHREAD_PRIO_INHERIT   => undef,
 		HAVE_PWRITE                 => undef,

v9.0002.pgbench-barrier.patchtext/x-diff; name=v9.0002.pgbench-barrier.patchDownload

diff --git a/doc/src/sgml/ref/pgbench.sgml b/doc/src/sgml/ref/pgbench.sgml
index faa7c26b0a..171f824fcd 100644
--- a/doc/src/sgml/ref/pgbench.sgml
+++ b/doc/src/sgml/ref/pgbench.sgml
@@ -58,8 +58,10 @@ number of clients: 10
 number of threads: 1
 number of transactions per client: 1000
 number of transactions actually processed: 10000/10000
-tps = 85.184871 (including connections establishing)
-tps = 85.296346 (excluding connections establishing)
+latency average = 11.013 ms
+latency stddev = 7.351 ms
+initial connection time = 45.758 ms
+tps = 896.967014 (without initial connection establishing)
 </screen>
 
   The first six lines report some of the most important parameter
@@ -68,8 +70,7 @@ tps = 85.296346 (excluding connections establishing)
   and number of transactions per client); these will be equal unless the run
   failed before completion.  (In <option>-T</option> mode, only the actual
   number of transactions is printed.)
-  The last two lines report the number of transactions per second,
-  figured with and without counting the time to start database sessions.
+  The last line reports the number of transactions per second.
  </para>
 
   <para>
@@ -2244,22 +2245,22 @@ number of clients: 10
 number of threads: 1
 number of transactions per client: 1000
 number of transactions actually processed: 10000/10000
-latency average = 15.844 ms
-latency stddev = 2.715 ms
-tps = 618.764555 (including connections establishing)
-tps = 622.977698 (excluding connections establishing)
+latency average = 10.870 ms
+latency stddev = 7.341 ms
+initial connection time = 30.954 ms
+tps = 907.949122 (without initial connection establishing)
 statement latencies in milliseconds:
-        0.002  \set aid random(1, 100000 * :scale)
-        0.005  \set bid random(1, 1 * :scale)
-        0.002  \set tid random(1, 10 * :scale)
-        0.001  \set delta random(-5000, 5000)
-        0.326  BEGIN;
-        0.603  UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
-        0.454  SELECT abalance FROM pgbench_accounts WHERE aid = :aid;
-        5.528  UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;
-        7.335  UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;
-        0.371  INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);
-        1.212  END;
+    0.001  \set aid random(1, 100000 * :scale)
+    0.001  \set bid random(1, 1 * :scale)
+    0.001  \set tid random(1, 10 * :scale)
+    0.000  \set delta random(-5000, 5000)
+    0.046  BEGIN;
+    0.151  UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
+    0.107  SELECT abalance FROM pgbench_accounts WHERE aid = :aid;
+    4.241  UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;
+    5.245  UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;
+    0.102  INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);
+    0.974  END;
 </screen>
   </para>
 
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 0ac0e186ea..00c48b5afe 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -273,9 +273,9 @@ typedef struct SimpleStats
  */
 typedef struct StatsData
 {
-	time_t		start_time;		/* interval start time, for aggregates */
-	int64		cnt;			/* number of transactions, including skipped */
-	int64		skipped;		/* number of transactions skipped under --rate
+	pg_time_usec_t	start_time;	/* interval start time, for aggregates */
+	int64			cnt;		/* number of transactions, including skipped */
+	int64			skipped;	/* number of transactions skipped under --rate
 								 * and --latency-limit */
 	SimpleStats latency;
 	SimpleStats lag;
@@ -398,11 +398,11 @@ typedef struct
 	int			nvariables;		/* number of variables */
 	bool		vars_sorted;	/* are variables sorted by name? */
 
-	/* various times about current transaction */
-	int64		txn_scheduled;	/* scheduled start time of transaction (usec) */
-	int64		sleep_until;	/* scheduled start time of next cmd (usec) */
-	instr_time	txn_begin;		/* used for measuring schedule lag times */
-	instr_time	stmt_begin;		/* used for measuring statement latencies */
+	/* various times about current transaction in microseconds */
+	pg_time_usec_t	txn_scheduled;	/* scheduled start time of transaction */
+	pg_time_usec_t	sleep_until;	/* scheduled start time of next cmd */
+	pg_time_usec_t	txn_begin;		/* used for measuring schedule lag times */
+	pg_time_usec_t	stmt_begin;		/* used for measuring statement latencies */
 
 	bool		prepared[MAX_SCRIPTS];	/* whether client prepared the script */
 
@@ -431,13 +431,16 @@ typedef struct
 	RandomState ts_sample_rs;	/* random state for log sampling */
 
 	int64		throttle_trigger;	/* previous/next throttling (us) */
-	FILE	   *logfile;		/* where to log, or NULL */
+	FILE	   *logfile;			/* where to log, or NULL */
+
+	/* per thread collected stats in microseconds */
+	pg_time_usec_t	create_time;	/* thread creation time */
+	pg_time_usec_t	started_time;	/* thread is running */
+	pg_time_usec_t	bench_start; 	/* thread is benchmarking */
+	pg_time_usec_t	conn_duration;	/* cumulated connection and deconnection delays */
 
-	/* per thread collected stats */
-	instr_time	start_time;		/* thread start time */
-	instr_time	conn_time;
 	StatsData	stats;
-	int64		latency_late;	/* executed but late transactions */
+	int64		latency_late;	/* count executed but late transactions */
 } TState;
 
 #define INVALID_THREAD		((pthread_t) 0)
@@ -579,10 +582,10 @@ static void setIntValue(PgBenchValue *pv, int64 ival);
 static void setDoubleValue(PgBenchValue *pv, double dval);
 static bool evaluateExpr(CState *st, PgBenchExpr *expr,
 						 PgBenchValue *retval);
-static ConnectionStateEnum executeMetaCommand(CState *st, instr_time *now);
+static ConnectionStateEnum executeMetaCommand(CState *st, pg_time_usec_t *now);
 static void doLog(TState *thread, CState *st,
 				  StatsData *agg, bool skipped, double latency, double lag);
-static void processXactStats(TState *thread, CState *st, instr_time *now,
+static void processXactStats(TState *thread, CState *st, pg_time_usec_t *now,
 							 bool skipped, StatsData *agg);
 static void addScript(ParsedScript script);
 static void *threadRun(void *arg);
@@ -1086,9 +1089,9 @@ mergeSimpleStats(SimpleStats *acc, SimpleStats *ss)
  * the given value.
  */
 static void
-initStats(StatsData *sd, time_t start_time)
+initStats(StatsData *sd, pg_time_usec_t start)
 {
-	sd->start_time = start_time;
+	sd->start_time = start;
 	sd->cnt = 0;
 	sd->skipped = 0;
 	initSimpleStats(&sd->latency);
@@ -2870,7 +2873,6 @@ evaluateSleep(CState *st, int argc, char **argv, int *usecs)
 static void
 advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 {
-	instr_time	now;
 
 	/*
 	 * gettimeofday() isn't free, so we get the current timestamp lazily the
@@ -2880,7 +2882,7 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 	 * means "not set yet".  Reset "now" when we execute shell commands or
 	 * expressions, which might take a non-negligible amount of time, though.
 	 */
-	INSTR_TIME_SET_ZERO(now);
+	pg_time_usec_t	now = 0;
 
 	/*
 	 * Loop in the state machine, until we have to wait for a result from the
@@ -2915,29 +2917,30 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 
 				/* Start new transaction (script) */
 			case CSTATE_START_TX:
+				pg_time_now_lazy(&now);
 
 				/* establish connection if needed, i.e. under --connect */
 				if (st->con == NULL)
 				{
-					instr_time	start;
+					pg_time_usec_t	start = now;
 
-					INSTR_TIME_SET_CURRENT_LAZY(now);
-					start = now;
 					if ((st->con = doConnect()) == NULL)
 					{
 						pg_log_error("client %d aborted while establishing connection", st->id);
 						st->state = CSTATE_ABORTED;
 						break;
 					}
-					INSTR_TIME_SET_CURRENT(now);
-					INSTR_TIME_ACCUM_DIFF(thread->conn_time, now, start);
+
+					/* reset now after connection */
+					now = pg_time_now();
+
+					thread->conn_duration += now - start;
 
 					/* Reset session-local state */
 					memset(st->prepared, 0, sizeof(st->prepared));
 				}
 
 				/* record transaction start time */
-				INSTR_TIME_SET_CURRENT_LAZY(now);
 				st->txn_begin = now;
 
 				/*
@@ -2945,7 +2948,7 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 				 * scheduled start time.
 				 */
 				if (!throttle_delay)
-					st->txn_scheduled = INSTR_TIME_GET_MICROSEC(now);
+					st->txn_scheduled = now;
 
 				/* Begin with the first command */
 				st->state = CSTATE_START_COMMAND;
@@ -2981,12 +2984,9 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 				 */
 				if (latency_limit)
 				{
-					int64		now_us;
+					pg_time_now_lazy(&now);
 
-					INSTR_TIME_SET_CURRENT_LAZY(now);
-					now_us = INSTR_TIME_GET_MICROSEC(now);
-
-					while (thread->throttle_trigger < now_us - latency_limit &&
+					while (thread->throttle_trigger < now - latency_limit &&
 						   (nxacts <= 0 || st->cnt < nxacts))
 					{
 						processXactStats(thread, st, &now, true, agg);
@@ -3019,9 +3019,9 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 				 * Wait until it's time to start next transaction.
 				 */
 			case CSTATE_THROTTLE:
-				INSTR_TIME_SET_CURRENT_LAZY(now);
+				pg_time_now_lazy(&now);
 
-				if (INSTR_TIME_GET_MICROSEC(now) < st->txn_scheduled)
+				if (now < st->txn_scheduled)
 					return;		/* still sleeping, nothing to do here */
 
 				/* done sleeping, but don't start transaction if we're done */
@@ -3044,7 +3044,7 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 				/* record begin time of next command, and initiate it */
 				if (report_per_command)
 				{
-					INSTR_TIME_SET_CURRENT_LAZY(now);
+					pg_time_now_lazy(&now);
 					st->stmt_begin = now;
 				}
 
@@ -3209,8 +3209,8 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 				 * instead of CSTATE_START_TX.
 				 */
 			case CSTATE_SLEEP:
-				INSTR_TIME_SET_CURRENT_LAZY(now);
-				if (INSTR_TIME_GET_MICROSEC(now) < st->sleep_until)
+				pg_time_now_lazy(&now);
+				if (now < st->sleep_until)
 					return;		/* still sleeping, nothing to do here */
 				/* Else done sleeping. */
 				st->state = CSTATE_END_COMMAND;
@@ -3230,13 +3230,12 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 				{
 					Command    *command;
 
-					INSTR_TIME_SET_CURRENT_LAZY(now);
+					pg_time_now_lazy(&now);
 
 					command = sql_script[st->use_file].commands[st->command];
 					/* XXX could use a mutex here, but we choose not to */
 					addToSimpleStats(&command->stats,
-									 INSTR_TIME_GET_DOUBLE(now) -
-									 INSTR_TIME_GET_DOUBLE(st->stmt_begin));
+									 PG_TIME_GET_DOUBLE(now - st->stmt_begin));
 				}
 
 				/* Go ahead with next command, to be executed or skipped */
@@ -3262,7 +3261,7 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 				if (is_connect)
 				{
 					finishCon(st);
-					INSTR_TIME_SET_ZERO(now);
+					now = 0;
 				}
 
 				if ((st->cnt >= nxacts && duration <= 0) || timer_exceeded)
@@ -3300,7 +3299,7 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
  * take no time to execute.
  */
 static ConnectionStateEnum
-executeMetaCommand(CState *st, instr_time *now)
+executeMetaCommand(CState *st, pg_time_usec_t *now)
 {
 	Command    *command = sql_script[st->use_file].commands[st->command];
 	int			argc;
@@ -3342,8 +3341,8 @@ executeMetaCommand(CState *st, instr_time *now)
 			return CSTATE_ABORTED;
 		}
 
-		INSTR_TIME_SET_CURRENT_LAZY(*now);
-		st->sleep_until = INSTR_TIME_GET_MICROSEC(*now) + usec;
+		pg_time_now_lazy(now);
+		st->sleep_until = (*now) + usec;
 		return CSTATE_SLEEP;
 	}
 	else if (command->meta == META_SET)
@@ -3446,7 +3445,7 @@ executeMetaCommand(CState *st, instr_time *now)
 	 * executing the expression or shell command might have taken a
 	 * non-negligible amount of time, so reset 'now'
 	 */
-	INSTR_TIME_SET_ZERO(*now);
+	*now = 0;
 
 	return CSTATE_END_COMMAND;
 }
@@ -3456,14 +3455,15 @@ executeMetaCommand(CState *st, instr_time *now)
  *
  * We print Unix-epoch timestamps in the log, so that entries can be
  * correlated against other logs.  On some platforms this could be obtained
- * from the instr_time reading the caller has, but rather than get entangled
- * with that, we just eat the cost of an extra syscall in all cases.
+ * from the caller, but rather than get entangled with that, we just eat
+ * the cost of an extra syscall in all cases.
  */
 static void
 doLog(TState *thread, CState *st,
 	  StatsData *agg, bool skipped, double latency, double lag)
 {
 	FILE	   *logfile = thread->logfile;
+	pg_time_usec_t	now = pg_time_now();
 
 	Assert(use_log);
 
@@ -3483,13 +3483,12 @@ doLog(TState *thread, CState *st,
 		 * any empty intervals in between (this may happen with very low tps,
 		 * e.g. --rate=0.1).
 		 */
-		time_t		now = time(NULL);
 
 		while (agg->start_time + agg_interval <= now)
 		{
 			/* print aggregated report to logfile */
-			fprintf(logfile, "%ld " INT64_FORMAT " %.0f %.0f %.0f %.0f",
-					(long) agg->start_time,
+			fprintf(logfile, INT64_FORMAT " " INT64_FORMAT " %.0f %.0f %.0f %.0f",
+					agg->start_time,
 					agg->cnt,
 					agg->latency.sum,
 					agg->latency.sum2,
@@ -3517,17 +3516,13 @@ doLog(TState *thread, CState *st,
 	else
 	{
 		/* no, print raw transactions */
-		struct timeval tv;
-
-		gettimeofday(&tv, NULL);
 		if (skipped)
 			fprintf(logfile, "%d " INT64_FORMAT " skipped %d %ld %ld",
-					st->id, st->cnt, st->use_file,
-					(long) tv.tv_sec, (long) tv.tv_usec);
+					st->id, st->cnt, st->use_file, now / 1000000, now % 1000000);
 		else
 			fprintf(logfile, "%d " INT64_FORMAT " %.0f %d %ld %ld",
 					st->id, st->cnt, latency, st->use_file,
-					(long) tv.tv_sec, (long) tv.tv_usec);
+					now / 1000000, now % 1000000);
 		if (throttle_delay)
 			fprintf(logfile, " %.0f", lag);
 		fputc('\n', logfile);
@@ -3541,7 +3536,7 @@ doLog(TState *thread, CState *st,
  * Note that even skipped transactions are counted in the "cnt" fields.)
  */
 static void
-processXactStats(TState *thread, CState *st, instr_time *now,
+processXactStats(TState *thread, CState *st, pg_time_usec_t *now,
 				 bool skipped, StatsData *agg)
 {
 	double		latency = 0.0,
@@ -3551,11 +3546,11 @@ processXactStats(TState *thread, CState *st, instr_time *now,
 
 	if (detailed && !skipped)
 	{
-		INSTR_TIME_SET_CURRENT_LAZY(*now);
+		pg_time_now_lazy(now);
 
 		/* compute latency & lag */
-		latency = INSTR_TIME_GET_MICROSEC(*now) - st->txn_scheduled;
-		lag = INSTR_TIME_GET_MICROSEC(st->txn_begin) - st->txn_scheduled;
+		latency = (*now) - st->txn_scheduled;
+		lag = st->txn_begin - st->txn_scheduled;
 	}
 
 	if (thread_details)
@@ -3806,10 +3801,7 @@ initGenerateDataClientSide(PGconn *con)
 	int64		k;
 
 	/* used to track elapsed time and estimate of the remaining time */
-	instr_time	start,
-				diff;
-	double		elapsed_sec,
-				remaining_sec;
+	pg_time_usec_t	start;
 	int			log_interval = 1;
 
 	/* Stay on the same line if reporting to a terminal */
@@ -3861,7 +3853,7 @@ initGenerateDataClientSide(PGconn *con)
 	}
 	PQclear(res);
 
-	INSTR_TIME_SET_CURRENT(start);
+	start = pg_time_now();
 
 	for (k = 0; k < (int64) naccounts * scale; k++)
 	{
@@ -3886,11 +3878,8 @@ initGenerateDataClientSide(PGconn *con)
 		 */
 		if ((!use_quiet) && (j % 100000 == 0))
 		{
-			INSTR_TIME_SET_CURRENT(diff);
-			INSTR_TIME_SUBTRACT(diff, start);
-
-			elapsed_sec = INSTR_TIME_GET_DOUBLE(diff);
-			remaining_sec = ((double) scale * naccounts - j) * elapsed_sec / j;
+			double elapsed_sec = PG_TIME_GET_DOUBLE(pg_time_now() - start);
+			double remaining_sec = ((double) scale * naccounts - j) * elapsed_sec / j;
 
 			fprintf(stderr, INT64_FORMAT " of " INT64_FORMAT " tuples (%d%%) done (elapsed %.2f s, remaining %.2f s)%c",
 					j, (int64) naccounts * scale,
@@ -3900,11 +3889,8 @@ initGenerateDataClientSide(PGconn *con)
 		/* let's not call the timing for each row, but only each 100 rows */
 		else if (use_quiet && (j % 100 == 0))
 		{
-			INSTR_TIME_SET_CURRENT(diff);
-			INSTR_TIME_SUBTRACT(diff, start);
-
-			elapsed_sec = INSTR_TIME_GET_DOUBLE(diff);
-			remaining_sec = ((double) scale * naccounts - j) * elapsed_sec / j;
+			double elapsed_sec = PG_TIME_GET_DOUBLE(pg_time_now() - start);
+			double remaining_sec = ((double) scale * naccounts - j) * elapsed_sec / j;
 
 			/* have we reached the next interval (or end)? */
 			if ((j == scale * naccounts) || (elapsed_sec >= log_interval * LOG_STEP_SECONDS))
@@ -4109,10 +4095,8 @@ runInitSteps(const char *initialize_steps)
 
 	for (step = initialize_steps; *step != '\0'; step++)
 	{
-		instr_time	start;
 		char	   *op = NULL;
-
-		INSTR_TIME_SET_CURRENT(start);
+		pg_time_usec_t	start = pg_time_now();
 
 		switch (*step)
 		{
@@ -4154,12 +4138,7 @@ runInitSteps(const char *initialize_steps)
 
 		if (op != NULL)
 		{
-			instr_time	diff;
-			double		elapsed_sec;
-
-			INSTR_TIME_SET_CURRENT(diff);
-			INSTR_TIME_SUBTRACT(diff, start);
-			elapsed_sec = INSTR_TIME_GET_DOUBLE(diff);
+			double		elapsed_sec = PG_TIME_GET_DOUBLE(pg_time_now() - start);
 
 			if (!first)
 				appendPQExpBufferStr(&stats, ", ");
@@ -5081,12 +5060,12 @@ addScript(ParsedScript script)
  * progress report.  On exit, they are updated with the new stats.
  */
 static void
-printProgressReport(TState *threads, int64 test_start, int64 now,
+printProgressReport(TState *threads, int64 test_start, pg_time_usec_t now,
 					StatsData *last, int64 *last_report)
 {
 	/* generate and show report */
-	int64		run = now - *last_report,
-				ntx;
+	pg_time_usec_t	run = now - *last_report;
+	int64		ntx;
 	double		tps,
 				total_run,
 				latency,
@@ -5133,16 +5112,7 @@ printProgressReport(TState *threads, int64 test_start, int64 now,
 
 	if (progress_timestamp)
 	{
-		/*
-		 * On some platforms the current system timestamp is available in
-		 * now_time, but rather than get entangled with that, we just eat the
-		 * cost of an extra syscall in all cases.
-		 */
-		struct timeval tv;
-
-		gettimeofday(&tv, NULL);
-		snprintf(tbuf, sizeof(tbuf), "%ld.%03ld s",
-				 (long) tv.tv_sec, (long) (tv.tv_usec / 1000));
+		snprintf(tbuf, sizeof(tbuf), "%.3f s", PG_TIME_GET_DOUBLE(now));
 	}
 	else
 	{
@@ -5182,21 +5152,18 @@ printSimpleStats(const char *prefix, SimpleStats *ss)
 
 /* print out results */
 static void
-printResults(StatsData *total, instr_time total_time,
-			 instr_time conn_total_time, int64 latency_late)
+printResults(StatsData *total,
+			 pg_time_usec_t total_duration,			/* benchmarking time */
+			 pg_time_usec_t conn_total_duration,	/* is_connect */
+			 pg_time_usec_t conn_elapsed_duration,	/* !is_connect */
+			 int64 latency_late)
 {
-	double		time_include,
-				tps_include,
-				tps_exclude;
+	/* tps is about actually executed transactions during benchmarking */
 	int64		ntx = total->cnt - total->skipped;
+	double 		bench_duration = PG_TIME_GET_DOUBLE(total_duration);
+	double		tps = ntx / bench_duration;
 
-	time_include = INSTR_TIME_GET_DOUBLE(total_time);
-
-	/* tps is about actually executed transactions */
-	tps_include = ntx / time_include;
-	tps_exclude = ntx /
-		(time_include - (INSTR_TIME_GET_DOUBLE(conn_total_time) / nclients));
-
+	printf("pgbench (PostgreSQL) %d.%d\n", PG_VERSION_NUM / 10000, PG_VERSION_NUM % 100);
 	/* Report test parameters. */
 	printf("transaction type: %s\n",
 		   num_scripts == 1 ? sql_script[0].desc : "multiple scripts");
@@ -5227,8 +5194,7 @@ printResults(StatsData *total, instr_time total_time,
 
 	if (throttle_delay && latency_limit)
 		printf("number of transactions skipped: " INT64_FORMAT " (%.3f %%)\n",
-			   total->skipped,
-			   100.0 * total->skipped / total->cnt);
+			   total->skipped, 100.0 * total->skipped / total->cnt);
 
 	if (latency_limit)
 		printf("number of transactions above the %.1f ms latency limit: " INT64_FORMAT "/" INT64_FORMAT " (%.3f %%)\n",
@@ -5241,7 +5207,7 @@ printResults(StatsData *total, instr_time total_time,
 	{
 		/* no measurement, show average latency computed from run time */
 		printf("latency average = %.3f ms\n",
-			   1000.0 * time_include * nclients / total->cnt);
+			   0.001 * total_duration * nclients / total->cnt);
 	}
 
 	if (throttle_delay)
@@ -5256,8 +5222,25 @@ printResults(StatsData *total, instr_time total_time,
 			   0.001 * total->lag.sum / total->cnt, 0.001 * total->lag.max);
 	}
 
-	printf("tps = %f (including connections establishing)\n", tps_include);
-	printf("tps = %f (excluding connections establishing)\n", tps_exclude);
+	/*
+	 * Under -C/--connect, each transaction incurs a significant connection cost,
+	 * it would not make much sense to ignore it in tps, and it would not be tps
+	 * anyway.
+	 *
+	 * Otherwise connections are made just once at the beginning of the run
+	 * and should not impact performance but for very short run, so they are
+	 * (right)fully ignored in tps.
+	 */
+	if (is_connect)
+	{
+		printf("average connection time = %.3f ms\n", 0.001 * conn_total_duration / total->cnt);
+		printf("tps = %f (including reconnection times)\n", tps);
+	}
+	else
+	{
+		printf("initial connection time = %.3f ms\n", 0.001 * conn_elapsed_duration);
+		printf("tps = %f (without initial connection establishing)\n", tps);
+	}
 
 	/* Report per-script/command statistics */
 	if (per_script_stats || report_per_command)
@@ -5278,7 +5261,7 @@ printResults(StatsData *total, instr_time total_time,
 					   100.0 * sql_script[i].weight / total_weight,
 					   sstats->cnt,
 					   100.0 * sstats->cnt / total->cnt,
-					   (sstats->cnt - sstats->skipped) / time_include);
+					   (sstats->cnt - sstats->skipped) / bench_duration);
 
 				if (throttle_delay && latency_limit && sstats->cnt > 0)
 					printf(" - number of transactions skipped: " INT64_FORMAT " (%.3f%%)\n",
@@ -5326,10 +5309,7 @@ set_random_seed(const char *seed)
 	if (seed == NULL || strcmp(seed, "time") == 0)
 	{
 		/* rely on current time */
-		instr_time	now;
-
-		INSTR_TIME_SET_CURRENT(now);
-		iseed = (uint64) INSTR_TIME_GET_MICROSEC(now);
+		iseed = pg_time_now();
 	}
 	else if (strcmp(seed, "rand") == 0)
 	{
@@ -5432,9 +5412,10 @@ main(int argc, char **argv)
 	CState	   *state;			/* status of clients */
 	TState	   *threads;		/* array of thread */
 
-	instr_time	start_time;		/* start up time */
-	instr_time	total_time;
-	instr_time	conn_total_time;
+	pg_time_usec_t
+		start_time,				/* start up time */
+		bench_start = 0,		/* first recorded benchmarking time */
+		conn_total_duration;	/* cumulated connection time in threads */
 	int64		latency_late = 0;
 	StatsData	stats;
 	int			weight;
@@ -6110,67 +6091,53 @@ main(int argc, char **argv)
 	/* all clients must be assigned to a thread */
 	Assert(nclients_dealt == nclients);
 
-	/* get start up time */
-	INSTR_TIME_SET_CURRENT(start_time);
+	/* get start up time for the whole computation */
+	start_time = pg_time_now();
 
 	/* set alarm if duration is specified. */
 	if (duration > 0)
 		setalarm(duration);
 
-	/* start threads */
 #ifdef ENABLE_THREAD_SAFETY
-	for (i = 0; i < nthreads; i++)
+	/* start all threads but thread 0 which is executed directly later */
+	for (i = 1; i < nthreads; i++)
 	{
 		TState	   *thread = &threads[i];
+		int			err;
 
-		INSTR_TIME_SET_CURRENT(thread->start_time);
+		thread->create_time = pg_time_now();
+		err = pthread_create(&thread->thread, NULL, threadRun, thread);
 
-		/* compute when to stop */
-		if (duration > 0)
-			end_time = INSTR_TIME_GET_MICROSEC(thread->start_time) +
-				(int64) 1000000 * duration;
-
-		/* the first thread (i = 0) is executed by main thread */
-		if (i > 0)
-		{
-			int			err = pthread_create(&thread->thread, NULL, threadRun, thread);
-
-			if (err != 0 || thread->thread == INVALID_THREAD)
-			{
-				pg_log_fatal("could not create thread: %m");
-				exit(1);
-			}
-		}
-		else
+		if (err != 0 || thread->thread == INVALID_THREAD)
 		{
-			thread->thread = INVALID_THREAD;
+			pg_log_fatal("could not create thread: %m");
+			exit(1);
 		}
 	}
 #else
-	INSTR_TIME_SET_CURRENT(threads[0].start_time);
+	Assert(nthreads == 1);
+#endif							/* ENABLE_THREAD_SAFETY */
+
 	/* compute when to stop */
+	threads[0].create_time = pg_time_now();
 	if (duration > 0)
-		end_time = INSTR_TIME_GET_MICROSEC(threads[0].start_time) +
-			(int64) 1000000 * duration;
+		end_time = threads[0].create_time + (int64) 1000000 * duration;
+
+	/* run thread 0 directly */
 	threads[0].thread = INVALID_THREAD;
-#endif							/* ENABLE_THREAD_SAFETY */
+	(void) threadRun(&threads[0]);
 
-	/* wait for threads and accumulate results */
+	/* wait for other threads and accumulate results */
 	initStats(&stats, 0);
-	INSTR_TIME_SET_ZERO(conn_total_time);
+	conn_total_duration = 0;
+
 	for (i = 0; i < nthreads; i++)
 	{
 		TState	   *thread = &threads[i];
 
 #ifdef ENABLE_THREAD_SAFETY
-		if (threads[i].thread == INVALID_THREAD)
-			/* actually run this thread directly in the main thread */
-			(void) threadRun(thread);
-		else
-			/* wait of other threads. should check that 0 is returned? */
+		if (i > 0)
 			pthread_join(thread->thread, NULL);
-#else
-		(void) threadRun(thread);
 #endif							/* ENABLE_THREAD_SAFETY */
 
 		for (int j = 0; j < thread->nstate; j++)
@@ -6183,23 +6150,24 @@ main(int argc, char **argv)
 		stats.cnt += thread->stats.cnt;
 		stats.skipped += thread->stats.skipped;
 		latency_late += thread->latency_late;
-		INSTR_TIME_ADD(conn_total_time, thread->conn_time);
+		conn_total_duration += thread->conn_duration;
+
+		/* first recorded benchmarking start time */
+		if (bench_start == 0 || thread->bench_start < bench_start)
+			bench_start = thread->bench_start;
 	}
+
+	/* XXX should this be connection time? */
 	disconnect_all(state, nclients);
 
 	/*
-	 * XXX We compute results as though every client of every thread started
-	 * and finished at the same time.  That model can diverge noticeably from
-	 * reality for a short benchmark run involving relatively many threads.
-	 * The first thread may process notably many transactions before the last
-	 * thread begins.  Improving the model alone would bring limited benefit,
-	 * because performance during those periods of partial thread count can
-	 * easily exceed steady state performance.  This is one of the many ways
-	 * short runs convey deceptive performance figures.
+	 * Beware that performance of short benchmarks with many threads and possibly
+	 * long transactions can be deceptive because threads do not start and finish
+	 * at the exact same time. The total duration computed here encompasses all
+	 * transactions so that tps shown is somehow slightly underestimated.
 	 */
-	INSTR_TIME_SET_CURRENT(total_time);
-	INSTR_TIME_SUBTRACT(total_time, start_time);
-	printResults(&stats, total_time, conn_total_time, latency_late);
+	printResults(&stats, pg_time_now() - bench_start, conn_total_duration,
+				 bench_start - start_time, latency_late);
 
 	if (exit_code != 0)
 		pg_log_fatal("Run was aborted; the above results are incomplete.");
@@ -6212,34 +6180,16 @@ threadRun(void *arg)
 {
 	TState	   *thread = (TState *) arg;
 	CState	   *state = thread->state;
-	instr_time	start,
-				end;
+	pg_time_usec_t	start;
 	int			nstate = thread->nstate;
 	int			remains = nstate;	/* number of remaining clients */
 	socket_set *sockets = alloc_socket_set(nstate);
-	int			i;
-
-	/* for reporting progress: */
-	int64		thread_start = INSTR_TIME_GET_MICROSEC(thread->start_time);
-	int64		last_report = thread_start;
-	int64		next_report = last_report + (int64) progress * 1000000;
+	int64		thread_start,
+				last_report,
+				next_report;
 	StatsData	last,
 				aggs;
 
-	/*
-	 * Initialize throttling rate target for all of the thread's clients.  It
-	 * might be a little more accurate to reset thread->start_time here too.
-	 * The possible drift seems too small relative to typical throttle delay
-	 * times to worry about it.
-	 */
-	INSTR_TIME_SET_CURRENT(start);
-	thread->throttle_trigger = INSTR_TIME_GET_MICROSEC(start);
-
-	INSTR_TIME_SET_ZERO(thread->conn_time);
-
-	initStats(&aggs, time(NULL));
-	last = aggs;
-
 	/* open log file if requested */
 	if (use_log)
 	{
@@ -6260,32 +6210,49 @@ threadRun(void *arg)
 		}
 	}
 
+	/* explicitly initialize the state machines */
+	for (int i = 0; i < nstate; i++)
+		state[i].state = CSTATE_CHOOSE_SCRIPT;
+
+	/* READY */
+	thread_start = pg_time_now();
+	thread->started_time = thread_start;
+	last_report = thread_start;
+	next_report = last_report + (int64) 1000000 * progress;
+
+	/* STEADY */
 	if (!is_connect)
 	{
 		/* make connections to the database before starting */
-		for (i = 0; i < nstate; i++)
+		for (int i = 0; i < nstate; i++)
 		{
 			if ((state[i].con = doConnect()) == NULL)
 				goto done;
 		}
+
+		/* compute connection delay */
+		thread->conn_duration = pg_time_now() - thread->started_time;
 	}
-
-	/* time after thread and connections set up */
-	INSTR_TIME_SET_CURRENT(thread->conn_time);
-	INSTR_TIME_SUBTRACT(thread->conn_time, thread->start_time);
-
-	/* explicitly initialize the state machines */
-	for (i = 0; i < nstate; i++)
+	else
 	{
-		state[i].state = CSTATE_CHOOSE_SCRIPT;
+		/* no connection delay to record */
+		thread->conn_duration = 0;
 	}
 
+
+	start = pg_time_now();
+	thread->bench_start = start;
+	thread->throttle_trigger = start;
+
+	initStats(&aggs, start);
+	last = aggs;
+
 	/* loop till all clients have terminated */
 	while (remains > 0)
 	{
 		int			nsocks;		/* number of sockets to be waited for */
-		int64		min_usec;
-		int64		now_usec = 0;	/* set this only if needed */
+		pg_time_usec_t	min_usec;
+		pg_time_usec_t	now = 0;	/* set this only if needed */
 
 		/*
 		 * identify which client sockets should be checked for input, and
@@ -6294,27 +6261,21 @@ threadRun(void *arg)
 		clear_socket_set(sockets);
 		nsocks = 0;
 		min_usec = PG_INT64_MAX;
-		for (i = 0; i < nstate; i++)
+		for (int i = 0; i < nstate; i++)
 		{
 			CState	   *st = &state[i];
 
 			if (st->state == CSTATE_SLEEP || st->state == CSTATE_THROTTLE)
 			{
 				/* a nap from the script, or under throttling */
-				int64		this_usec;
+				pg_time_usec_t		this_usec;
 
 				/* get current time if needed */
-				if (now_usec == 0)
-				{
-					instr_time	now;
-
-					INSTR_TIME_SET_CURRENT(now);
-					now_usec = INSTR_TIME_GET_MICROSEC(now);
-				}
+				pg_time_now_lazy(&now);
 
 				/* min_usec should be the minimum delay across all clients */
 				this_usec = (st->state == CSTATE_SLEEP ?
-							 st->sleep_until : st->txn_scheduled) - now_usec;
+							 st->sleep_until : st->txn_scheduled) - now;
 				if (min_usec > this_usec)
 					min_usec = this_usec;
 			}
@@ -6349,19 +6310,12 @@ threadRun(void *arg)
 		/* also wake up to print the next progress report on time */
 		if (progress && min_usec > 0 && thread->tid == 0)
 		{
-			/* get current time if needed */
-			if (now_usec == 0)
-			{
-				instr_time	now;
+			pg_time_now_lazy(&now);
 
-				INSTR_TIME_SET_CURRENT(now);
-				now_usec = INSTR_TIME_GET_MICROSEC(now);
-			}
-
-			if (now_usec >= next_report)
+			if (now >= next_report)
 				min_usec = 0;
-			else if ((next_report - now_usec) < min_usec)
-				min_usec = next_report - now_usec;
+			else if ((next_report - now) < min_usec)
+				min_usec = next_report - now;
 		}
 
 		/*
@@ -6410,7 +6364,7 @@ threadRun(void *arg)
 
 		/* ok, advance the state machine of each connection */
 		nsocks = 0;
-		for (i = 0; i < nstate; i++)
+		for (int i = 0; i < nstate; i++)
 		{
 			CState	   *st = &state[i];
 
@@ -6448,11 +6402,8 @@ threadRun(void *arg)
 		/* progress report is made by thread 0 for all threads */
 		if (progress && thread->tid == 0)
 		{
-			instr_time	now_time;
-			int64		now;
+			pg_time_usec_t	now = pg_time_now();
 
-			INSTR_TIME_SET_CURRENT(now_time);
-			now = INSTR_TIME_GET_MICROSEC(now_time);
 			if (now >= next_report)
 			{
 				/*
@@ -6470,17 +6421,17 @@ threadRun(void *arg)
 				 */
 				do
 				{
-					next_report += (int64) progress * 1000000;
+					next_report += (int64) 1000000 * progress;
 				} while (now >= next_report);
 			}
 		}
 	}
 
 done:
-	INSTR_TIME_SET_CURRENT(start);
+	start = pg_time_now();
 	disconnect_all(state, nstate);
-	INSTR_TIME_SET_CURRENT(end);
-	INSTR_TIME_ACCUM_DIFF(thread->conn_time, end, start);
+	thread->conn_duration += pg_time_now() - start;
+
 	if (thread->logfile)
 	{
 		if (agg_interval > 0)
diff --git a/src/include/portability/instr_time.h b/src/include/portability/instr_time.h
index 39a4f0600e..faf806a441 100644
--- a/src/include/portability/instr_time.h
+++ b/src/include/portability/instr_time.h
@@ -253,4 +253,32 @@ GetTimerFrequency(void)
 #define INSTR_TIME_SET_CURRENT_LAZY(t) \
 	(INSTR_TIME_IS_ZERO(t) ? INSTR_TIME_SET_CURRENT(t), true : false)
 
+/*
+ * Simpler convenient interface
+ *
+ * The instr_time type is expensive when dealing with time arithmetic.
+ * Define a type to hold microseconds on top of this, suitable for
+ * benchmarking performance measures, eg in "pgbench".
+ *
+ * Type int64 is good enough for about 584500 years.
+ */
+typedef int64 pg_time_usec_t;
+
+static inline pg_time_usec_t
+pg_time_now(void)
+{
+	instr_time now;
+
+	INSTR_TIME_SET_CURRENT(now);
+	return (pg_time_usec_t) INSTR_TIME_GET_MICROSEC(now);
+}
+
+static inline void
+pg_time_now_lazy(pg_time_usec_t *now)
+{
+	if ((*now) == 0)
+		(*now) = pg_time_now();
+}
+
+#define PG_TIME_GET_DOUBLE(t) (0.000001 * (t))
 #endif							/* INSTR_TIME_H */

v9.0003.pgbench-barrier.patchtext/x-diff; name=v9.0003.pgbench-barrier.patchDownload

diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 00c48b5afe..16a90a9fa3 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -292,6 +292,9 @@ typedef struct RandomState
 /* Various random sequences are initialized from this one. */
 static RandomState base_random_sequence;
 
+/* Synchronization barrier for start and connection */
+static pthread_barrier_t barrier;
+
 /*
  * Connection state machine states.
  */
@@ -435,8 +438,8 @@ typedef struct
 
 	/* per thread collected stats in microseconds */
 	pg_time_usec_t	create_time;	/* thread creation time */
-	pg_time_usec_t	started_time;	/* thread is running */
-	pg_time_usec_t	bench_start; 	/* thread is benchmarking */
+	pg_time_usec_t	started_time;	/* thread is running after start barrier */
+	pg_time_usec_t	bench_start; 	/* thread is benchmarking after connection barrier */
 	pg_time_usec_t	conn_duration;	/* cumulated connection and deconnection delays */
 
 	StatsData	stats;
@@ -6098,6 +6101,8 @@ main(int argc, char **argv)
 	if (duration > 0)
 		setalarm(duration);
 
+	pthread_barrier_init(&barrier, NULL, nthreads);
+
 #ifdef ENABLE_THREAD_SAFETY
 	/* start all threads but thread 0 which is executed directly later */
 	for (i = 1; i < nthreads; i++)
@@ -6169,6 +6174,8 @@ main(int argc, char **argv)
 	printResults(&stats, pg_time_now() - bench_start, conn_total_duration,
 				 bench_start - start_time, latency_late);
 
+	pthread_barrier_destroy(&barrier);
+
 	if (exit_code != 0)
 		pg_log_fatal("Run was aborted; the above results are incomplete.");
 
@@ -6215,6 +6222,8 @@ threadRun(void *arg)
 		state[i].state = CSTATE_CHOOSE_SCRIPT;
 
 	/* READY */
+	pthread_barrier_wait(&barrier);
+
 	thread_start = pg_time_now();
 	thread->started_time = thread_start;
 	last_report = thread_start;
@@ -6227,7 +6236,18 @@ threadRun(void *arg)
 		for (int i = 0; i < nstate; i++)
 		{
 			if ((state[i].con = doConnect()) == NULL)
+			{
+				/*
+				 * On connection failure, we meet the barrier here in place of
+				 * GO before proceeding to the "done" path which will cleanup,
+				 * so as to avoid locking the process.
+				 *
+				 * It is unclear whether it is worth doing anything rather than
+				 * coldly exiting with an error message.
+				 */
+				pthread_barrier_wait(&barrier);
 				goto done;
+			}
 		}
 
 		/* compute connection delay */
@@ -6239,6 +6259,8 @@ threadRun(void *arg)
 		thread->conn_duration = 0;
 	}
 
+	/* GO */
+	pthread_barrier_wait(&barrier);
 
 	start = pg_time_now();
 	thread->bench_start = start;

thomas.munro@gmail.com

almost 5 years ago

In reply to: Fabien COELHO (#36)

Re: pgbench: option delaying queries till connections establishment?

On Sun, Jan 31, 2021 at 1:18 AM Fabien COELHO <coelho@cri.ensmp.fr> wrote:

I think it would be much more consistent to move all the thread complement
stuff there directly: Currently (v8) the windows implementation is in
pgbench and the MacOS implementation in port, which is quite messy.

Hmm. Well this is totally subjective, but here's how I see this after
thinking about it a bit more: macOS does actually have POSIX threads,
except for this tiny missing piece, so it's OK to write a toy
implementation that is reasonably conformant, and put it in there
using the usual AC_REPLACE_FUNCS machinery. It will go away when
Apple eventually adds a real one. Windows does not, and here we're
writing a very partial toy implementation that is far from conformant.
I think that's OK for pgbench's purposes, for now, but I'd prefer to
keep it inside pgbench.c. I think at some point in the (hopefully not
too distant) future, we'll start working on thread support for the
backend, and then I think we'll probably come up with our own
abstraction over Windows and POSIX threads, rather than trying to use
POSIX API wrappers from Windows, so I don't really want this stuff in
the port library. Does this make some kind of sense?

Attached is a patch set which does that. I cannot test it neither on
Windows nor on MacOS. Path 1 & 2 are really independent.

No worries. For some reason I have a lot of computers; I'll try to
get this (or rather a version with the Windows stuff moved back)
passing on all of them soon, with the aim of making it committable.

thomas.munro@gmail.com

almost 5 years ago

In reply to: Thomas Munro (#37)

4 attachment(s)

Re: pgbench: option delaying queries till connections establishment?

On Wed, Mar 3, 2021 at 6:23 PM Thomas Munro <thomas.munro@gmail.com> wrote:

On Sun, Jan 31, 2021 at 1:18 AM Fabien COELHO <coelho@cri.ensmp.fr> wrote:

I think it would be much more consistent to move all the thread complement
stuff there directly: Currently (v8) the windows implementation is in
pgbench and the MacOS implementation in port, which is quite messy.

Hmm. Well this is totally subjective, but here's how I see this after
thinking about it a bit more: macOS does actually have POSIX threads,
except for this tiny missing piece, so it's OK to write a toy
implementation that is reasonably conformant, and put it in there
using the usual AC_REPLACE_FUNCS machinery. It will go away when
Apple eventually adds a real one. Windows does not, and here we're
writing a very partial toy implementation that is far from conformant.
I think that's OK for pgbench's purposes, for now, but I'd prefer to
keep it inside pgbench.c. I think at some point in the (hopefully not
too distant) future, we'll start working on thread support for the
backend, and then I think we'll probably come up with our own
abstraction over Windows and POSIX threads, rather than trying to use
POSIX API wrappers from Windows, so I don't really want this stuff in
the port library. Does this make some kind of sense?

Here is an attempt to move things in that direction. It compiles
tests OK on Unix including macOS with and without
--disable-thread-safety, and it compiles on Windows (via CI) but I
don't yet know if it works there.

v10-0001-Add-missing-pthread_barrier_t.patch

Same as v8. Adds the missing pthread_barrier_t support for macOS
only. Based on the traditional configure symbol probe for now. It's
possible that we'll later decide to use declarations to be more
future-proof against Apple's API versioning strategy, but I don't have
one of those weird cross-version compiler setups to investigate that
(see complaints from James Hilliard about the way we deal with
pwrite()).

v10-0002-pgbench-Refactor-the-way-we-do-thread-portabilit.patch

New. Abandons the concept of doing a fake pthread API on Windows in
pgbench.c, in favour of a couple of tiny local macros that abstract
over POSIX, Windows and threadless builds. This results in less code,
and also fixes some minor problems I spotted in pre-existing code:
it's not OK to use (pthread_t) 0 as a special value, or to compare
pthread_t values with ==, or to assume that pthread APIs set errno.

v10-0003-pgbench-Improve-time-measurement-code.patch

Your original A patch, rebased over the above. I haven't reviewed
this one. It lacks a commit message.

v10-0004-pgbench-Synchronize-client-threads.patch

Adds in the barriers.

Attachments:

v10-0001-Add-missing-pthread_barrier_t.patchtext/x-patch; charset=US-ASCII; name=v10-0001-Add-missing-pthread_barrier_t.patchDownload

From 1459914c729e1157a932254bf7483f1ef7ac6408 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Sat, 2 Jan 2021 15:05:06 +1300
Subject: [PATCH v10 1/5] Add missing pthread_barrier_t.

Supply a simple implementation of the missing pthread_barrier_t type and
functions, for macOS.

Discussion: https://postgr.es/m/20200227180100.zyvjwzcpiokfsqm2%40alap3.anarazel.de
---
 configure                       | 69 +++++++++++++++++++++++++++++++++
 configure.ac                    |  2 +
 src/include/pg_config.h.in      |  3 ++
 src/include/port/pg_pthread.h   | 41 ++++++++++++++++++++
 src/port/pthread_barrier_wait.c | 66 +++++++++++++++++++++++++++++++
 src/tools/msvc/Solution.pm      |  1 +
 6 files changed, 182 insertions(+)
 create mode 100644 src/include/port/pg_pthread.h
 create mode 100644 src/port/pthread_barrier_wait.c

diff --git a/configure b/configure
index ce9ea36999..fad817bb38 100755
--- a/configure
+++ b/configure
@@ -11635,6 +11635,62 @@ if test "$ac_res" != no; then :
 
 fi
 
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for library containing pthread_barrier_wait" >&5
+$as_echo_n "checking for library containing pthread_barrier_wait... " >&6; }
+if ${ac_cv_search_pthread_barrier_wait+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  ac_func_search_save_LIBS=$LIBS
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+
+/* Override any GCC internal prototype to avoid an error.
+   Use char because int might match the return type of a GCC
+   builtin and then its argument prototype would still apply.  */
+#ifdef __cplusplus
+extern "C"
+#endif
+char pthread_barrier_wait ();
+int
+main ()
+{
+return pthread_barrier_wait ();
+  ;
+  return 0;
+}
+_ACEOF
+for ac_lib in '' pthread; do
+  if test -z "$ac_lib"; then
+    ac_res="none required"
+  else
+    ac_res=-l$ac_lib
+    LIBS="-l$ac_lib  $ac_func_search_save_LIBS"
+  fi
+  if ac_fn_c_try_link "$LINENO"; then :
+  ac_cv_search_pthread_barrier_wait=$ac_res
+fi
+rm -f core conftest.err conftest.$ac_objext \
+    conftest$ac_exeext
+  if ${ac_cv_search_pthread_barrier_wait+:} false; then :
+  break
+fi
+done
+if ${ac_cv_search_pthread_barrier_wait+:} false; then :
+
+else
+  ac_cv_search_pthread_barrier_wait=no
+fi
+rm conftest.$ac_ext
+LIBS=$ac_func_search_save_LIBS
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $ac_cv_search_pthread_barrier_wait" >&5
+$as_echo "$ac_cv_search_pthread_barrier_wait" >&6; }
+ac_res=$ac_cv_search_pthread_barrier_wait
+if test "$ac_res" != no; then :
+  test "$ac_res" = "none required" || LIBS="$ac_res $LIBS"
+
+fi
+
 # Solaris:
 { $as_echo "$as_me:${as_lineno-$LINENO}: checking for library containing fdatasync" >&5
 $as_echo_n "checking for library containing fdatasync... " >&6; }
@@ -15883,6 +15939,19 @@ esac
 
 fi
 
+ac_fn_c_check_func "$LINENO" "pthread_barrier_wait" "ac_cv_func_pthread_barrier_wait"
+if test "x$ac_cv_func_pthread_barrier_wait" = xyes; then :
+  $as_echo "#define HAVE_PTHREAD_BARRIER_WAIT 1" >>confdefs.h
+
+else
+  case " $LIBOBJS " in
+  *" pthread_barrier_wait.$ac_objext "* ) ;;
+  *) LIBOBJS="$LIBOBJS pthread_barrier_wait.$ac_objext"
+ ;;
+esac
+
+fi
+
 ac_fn_c_check_func "$LINENO" "pwrite" "ac_cv_func_pwrite"
 if test "x$ac_cv_func_pwrite" = xyes; then :
   $as_echo "#define HAVE_PWRITE 1" >>confdefs.h
diff --git a/configure.ac b/configure.ac
index f54f65febe..0ed53571dd 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1143,6 +1143,7 @@ AC_SEARCH_LIBS(getopt_long, [getopt gnugetopt])
 AC_SEARCH_LIBS(shm_open, rt)
 AC_SEARCH_LIBS(shm_unlink, rt)
 AC_SEARCH_LIBS(clock_gettime, [rt posix4])
+AC_SEARCH_LIBS(pthread_barrier_wait, pthread)
 # Solaris:
 AC_SEARCH_LIBS(fdatasync, [rt posix4])
 # Required for thread_test.c on Solaris
@@ -1743,6 +1744,7 @@ AC_REPLACE_FUNCS(m4_normalize([
 	mkdtemp
 	pread
 	preadv
+	pthread_barrier_wait
 	pwrite
 	pwritev
 	random
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 04dc330119..7a7cc21d8d 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -424,6 +424,9 @@
 /* Define if you have POSIX threads libraries and header files. */
 #undef HAVE_PTHREAD
 
+/* Define to 1 if you have the `pthread_barrier_wait' function. */
+#undef HAVE_PTHREAD_BARRIER_WAIT
+
 /* Define to 1 if you have the `pthread_is_threaded_np' function. */
 #undef HAVE_PTHREAD_IS_THREADED_NP
 
diff --git a/src/include/port/pg_pthread.h b/src/include/port/pg_pthread.h
new file mode 100644
index 0000000000..5222cdce6e
--- /dev/null
+++ b/src/include/port/pg_pthread.h
@@ -0,0 +1,41 @@
+/*-------------------------------------------------------------------------
+ *
+ * Declarations for missing POSIX thread components.
+ *
+ *	  Currently this supplies an implementation of pthread_barrier_t for the
+ *	  benefit of macOS, which lacks it as of release 11.  These declarations
+ *	  are not in port.h, because that'd require <pthread.h> to be included by
+ *	  every translation unit.
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#ifndef PG_PTHREAD_H
+#define PG_PTHREAD_H
+
+#include <pthread.h>
+
+#ifndef HAVE_PTHREAD_BARRIER_WAIT
+
+#ifndef PTHREAD_BARRIER_SERIAL_THREAD
+#define PTHREAD_BARRIER_SERIAL_THREAD (-1)
+#endif
+
+typedef struct pg_pthread_barrier
+{
+	bool		sense;			/* we only need a one bit phase */
+	int			count;			/* number of threads expected */
+	int			arrived;		/* number of threads that have arrived */
+	pthread_mutex_t mutex;
+	pthread_cond_t cond;
+} pthread_barrier_t;
+
+extern int pthread_barrier_init(pthread_barrier_t *barrier,
+								const void *attr,
+								int count);
+extern int pthread_barrier_wait(pthread_barrier_t *barrier);
+extern int pthread_barrier_destroy(pthread_barrier_t *barrier);
+
+#endif
+
+#endif
diff --git a/src/port/pthread_barrier_wait.c b/src/port/pthread_barrier_wait.c
new file mode 100644
index 0000000000..08dacc3085
--- /dev/null
+++ b/src/port/pthread_barrier_wait.c
@@ -0,0 +1,66 @@
+#include "postgres_fe.h"
+
+#include "port/pg_pthread.h"
+
+int
+pthread_barrier_init(pthread_barrier_t *barrier, const void *attr, int count)
+{
+	barrier->sense = false;
+	barrier->count = count;
+	barrier->arrived = 0;
+	if (pthread_cond_init(&barrier->cond, NULL) < 0)
+		return -1;
+	if (pthread_mutex_init(&barrier->mutex, NULL) < 0)
+	{
+		int save_errno = errno;
+
+		pthread_cond_destroy(&barrier->cond);
+		errno = save_errno;
+
+		return -1;
+	}
+
+	return 0;
+}
+
+int
+pthread_barrier_wait(pthread_barrier_t *barrier)
+{
+	bool		initial_sense;
+
+	pthread_mutex_lock(&barrier->mutex);
+
+	/* We have arrived at the barrier. */
+	barrier->arrived++;
+	Assert(barrier->arrived <= barrier->count);
+
+	/* If we were the last to arrive, release the others and return. */
+	if (barrier->arrived == barrier->count)
+	{
+		barrier->arrived = 0;
+		barrier->sense = !barrier->sense;
+		pthread_mutex_unlock(&barrier->mutex);
+		pthread_cond_broadcast(&barrier->cond);
+
+		return PTHREAD_BARRIER_SERIAL_THREAD;
+	}
+
+	/* Wait for someone else to flip the sense. */
+	initial_sense = barrier->sense;
+	do
+	{
+		pthread_cond_wait(&barrier->cond, &barrier->mutex);
+	} while (barrier->sense == initial_sense);
+
+	pthread_mutex_unlock(&barrier->mutex);
+
+	return 0;
+}
+
+int
+pthread_barrier_destroy(pthread_barrier_t *barrier)
+{
+	pthread_cond_destroy(&barrier->cond);
+	pthread_mutex_destroy(&barrier->mutex);
+	return 0;
+}
diff --git a/src/tools/msvc/Solution.pm b/src/tools/msvc/Solution.pm
index 69b08591cc..a4f5cc4bdb 100644
--- a/src/tools/msvc/Solution.pm
+++ b/src/tools/msvc/Solution.pm
@@ -333,6 +333,7 @@ sub GenerateFiles
 		HAVE_PSTAT                  => undef,
 		HAVE_PS_STRINGS             => undef,
 		HAVE_PTHREAD                => undef,
+		HAVE_PTHREAD_BARRIER_WAIT   => undef,
 		HAVE_PTHREAD_IS_THREADED_NP => undef,
 		HAVE_PTHREAD_PRIO_INHERIT   => undef,
 		HAVE_PWRITE                 => undef,
-- 
2.30.1

v10-0002-pgbench-Refactor-the-way-we-do-thread-portabilit.patchtext/x-patch; charset=US-ASCII; name=v10-0002-pgbench-Refactor-the-way-we-do-thread-portabilit.patchDownload

From cd314bda033e6be3165420ff07485a5bb2935c8a Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Thu, 4 Mar 2021 00:42:13 +1300
Subject: [PATCH v10 2/5] pgbench: Refactor the way we do thread portability.

Instead of maintaining an incomplete emulation of POSIX threads for
Windows, let's use an extremely minimalist abstraction over both APIs
for now.  Small problems fixed: it's not OK to use (pthread_t) 0
as a special value, it's no OK to compare thread_t values with ==, and
we incorrectly assumed that pthread functions set errno.

Discussion: https://postgr.es/m/20200227180100.zyvjwzcpiokfsqm2%40alap3.anarazel.de
---
 src/bin/pgbench/pgbench.c | 123 ++++++++++----------------------------
 1 file changed, 30 insertions(+), 93 deletions(-)

diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 31a4df45f5..1534a99ae1 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -110,22 +110,37 @@ typedef struct socket_set
 #endif							/* POLL_USING_SELECT */
 
 /*
- * Multi-platform pthread implementations
+ * Multi-platform thread implementations
  */
 
 #ifdef WIN32
 /* Use native win32 threads on Windows */
-typedef struct win32_pthread *pthread_t;
-typedef int pthread_attr_t;
-
-static int	pthread_create(pthread_t *thread, pthread_attr_t *attr, void *(*start_routine) (void *), void *arg);
-static int	pthread_join(pthread_t th, void **thread_return);
+#include <windows.h>
+#define GETERRNO() (_dosmaperr(GetLastError()), errno)
+#define THREAD_T HANDLE
+#define THREAD_FUNC_RETURN_TYPE void
+#define THREAD_FUNC_RETURN return
+#define THREAD_CREATE(handle, function, arg) \
+	((*(handle) = _beginthread((function), 0, (arg))) != (HANDLE) -1 ? \
+	 0 : GETERRNO())
+#define THREAD_JOIN(handle) \
+	(WaitForSingleObject(handle, INFINITE) != WAIT_OBJECT_0 ? \
+	GETERRNO() : CloseHandle(handle) ? 0 : GETERRNO())
 #elif defined(ENABLE_THREAD_SAFETY)
 /* Use platform-dependent pthread capability */
 #include <pthread.h>
+#define THREAD_T pthread_t
+#define THREAD_FUNC_RETURN_TYPE void *
+#define THREAD_FUNC_RETURN return NULL
+#define THREAD_CREATE(handle, function, arg) \
+	pthread_create((handle), NULL, (function), (arg))
+#define THREAD_JOIN(handle) \
+	pthread_join((handle), NULL)
 #else
 /* No threads implementation, use none (-j 1) */
-#define pthread_t void *
+#define THREAD_T void *
+#define THREAD_FUNC_RETURN_TYPE void *
+#define THREAD_FUNC_RETURN return NULL
 #endif
 
 
@@ -435,7 +450,7 @@ typedef struct
 typedef struct
 {
 	int			tid;			/* thread id */
-	pthread_t	thread;			/* thread handle */
+	THREAD_T	thread;			/* thread handle */
 	CState	   *state;			/* array of CState */
 	int			nstate;			/* length of state[] */
 
@@ -458,8 +473,6 @@ typedef struct
 	int64		latency_late;	/* executed but late transactions */
 } TState;
 
-#define INVALID_THREAD		((pthread_t) 0)
-
 /*
  * queries read from files
  */
@@ -603,7 +616,7 @@ static void doLog(TState *thread, CState *st,
 static void processXactStats(TState *thread, CState *st, instr_time *now,
 							 bool skipped, StatsData *agg);
 static void addScript(ParsedScript script);
-static void *threadRun(void *arg);
+static THREAD_FUNC_RETURN_TYPE threadRun(void *arg);
 static void finishCon(CState *st);
 static void setalarm(int seconds);
 static socket_set *alloc_socket_set(int count);
@@ -6148,18 +6161,14 @@ main(int argc, char **argv)
 		/* the first thread (i = 0) is executed by main thread */
 		if (i > 0)
 		{
-			int			err = pthread_create(&thread->thread, NULL, threadRun, thread);
+			errno = THREAD_CREATE(&thread->thread, threadRun, thread);
 
-			if (err != 0 || thread->thread == INVALID_THREAD)
+			if (errno != 0)
 			{
 				pg_log_fatal("could not create thread: %m");
 				exit(1);
 			}
 		}
-		else
-		{
-			thread->thread = INVALID_THREAD;
-		}
 	}
 #else
 	INSTR_TIME_SET_CURRENT(threads[0].start_time);
@@ -6167,7 +6176,6 @@ main(int argc, char **argv)
 	if (duration > 0)
 		end_time = INSTR_TIME_GET_MICROSEC(threads[0].start_time) +
 			(int64) 1000000 * duration;
-	threads[0].thread = INVALID_THREAD;
 #endif							/* ENABLE_THREAD_SAFETY */
 
 	/* wait for threads and accumulate results */
@@ -6178,12 +6186,12 @@ main(int argc, char **argv)
 		TState	   *thread = &threads[i];
 
 #ifdef ENABLE_THREAD_SAFETY
-		if (threads[i].thread == INVALID_THREAD)
+		if (i == 0)
 			/* actually run this thread directly in the main thread */
 			(void) threadRun(thread);
 		else
 			/* wait of other threads. should check that 0 is returned? */
-			pthread_join(thread->thread, NULL);
+			THREAD_JOIN(thread->thread);
 #else
 		(void) threadRun(thread);
 #endif							/* ENABLE_THREAD_SAFETY */
@@ -6222,7 +6230,7 @@ main(int argc, char **argv)
 	return exit_code;
 }
 
-static void *
+static THREAD_FUNC_RETURN_TYPE
 threadRun(void *arg)
 {
 	TState	   *thread = (TState *) arg;
@@ -6507,7 +6515,7 @@ done:
 		thread->logfile = NULL;
 	}
 	free_socket_set(sockets);
-	return NULL;
+	THREAD_FUNC_RETURN;
 }
 
 static void
@@ -6738,74 +6746,3 @@ socket_has_input(socket_set *sa, int fd, int idx)
 }
 
 #endif							/* POLL_USING_SELECT */
-
-
-/* partial pthread implementation for Windows */
-
-#ifdef WIN32
-
-typedef struct win32_pthread
-{
-	HANDLE		handle;
-	void	   *(*routine) (void *);
-	void	   *arg;
-	void	   *result;
-} win32_pthread;
-
-static unsigned __stdcall
-win32_pthread_run(void *arg)
-{
-	win32_pthread *th = (win32_pthread *) arg;
-
-	th->result = th->routine(th->arg);
-
-	return 0;
-}
-
-static int
-pthread_create(pthread_t *thread,
-			   pthread_attr_t *attr,
-			   void *(*start_routine) (void *),
-			   void *arg)
-{
-	int			save_errno;
-	win32_pthread *th;
-
-	th = (win32_pthread *) pg_malloc(sizeof(win32_pthread));
-	th->routine = start_routine;
-	th->arg = arg;
-	th->result = NULL;
-
-	th->handle = (HANDLE) _beginthreadex(NULL, 0, win32_pthread_run, th, 0, NULL);
-	if (th->handle == NULL)
-	{
-		save_errno = errno;
-		free(th);
-		return save_errno;
-	}
-
-	*thread = th;
-	return 0;
-}
-
-static int
-pthread_join(pthread_t th, void **thread_return)
-{
-	if (th == NULL || th->handle == NULL)
-		return errno = EINVAL;
-
-	if (WaitForSingleObject(th->handle, INFINITE) != WAIT_OBJECT_0)
-	{
-		_dosmaperr(GetLastError());
-		return errno;
-	}
-
-	if (thread_return)
-		*thread_return = th->result;
-
-	CloseHandle(th->handle);
-	free(th);
-	return 0;
-}
-
-#endif							/* WIN32 */
-- 
2.30.1

v10-0003-pgbench-Improve-time-measurement-code.patchtext/x-patch; charset=US-ASCII; name=v10-0003-pgbench-Improve-time-measurement-code.patchDownload

From 514bd197523f51511b11825092de395729c6f469 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Thu, 4 Mar 2021 17:25:06 +1300
Subject: [PATCH v10 3/5] pgbench: Improve time measurement code.

XXX This needs a commit message.

Author: Andres Freund <andres@anarazel.de>
Author: Fabien COELHO <coelho@cri.ensmp.fr>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Discussion: https://postgr.es/m/20200227180100.zyvjwzcpiokfsqm2%40alap3.anarazel.de
---
 doc/src/sgml/ref/pgbench.sgml        |  39 +--
 src/bin/pgbench/pgbench.c            | 410 ++++++++++++---------------
 src/include/portability/instr_time.h |  28 ++
 3 files changed, 230 insertions(+), 247 deletions(-)

diff --git a/doc/src/sgml/ref/pgbench.sgml b/doc/src/sgml/ref/pgbench.sgml
index 2ec0580a79..3c3699cd73 100644
--- a/doc/src/sgml/ref/pgbench.sgml
+++ b/doc/src/sgml/ref/pgbench.sgml
@@ -58,8 +58,10 @@ number of clients: 10
 number of threads: 1
 number of transactions per client: 1000
 number of transactions actually processed: 10000/10000
-tps = 85.184871 (including connections establishing)
-tps = 85.296346 (excluding connections establishing)
+latency average = 11.013 ms
+latency stddev = 7.351 ms
+initial connection time = 45.758 ms
+tps = 896.967014 (without initial connection establishing)
 </screen>
 
   The first six lines report some of the most important parameter
@@ -68,8 +70,7 @@ tps = 85.296346 (excluding connections establishing)
   and number of transactions per client); these will be equal unless the run
   failed before completion.  (In <option>-T</option> mode, only the actual
   number of transactions is printed.)
-  The last two lines report the number of transactions per second,
-  figured with and without counting the time to start database sessions.
+  The last line reports the number of transactions per second.
  </para>
 
   <para>
@@ -2257,22 +2258,22 @@ number of clients: 10
 number of threads: 1
 number of transactions per client: 1000
 number of transactions actually processed: 10000/10000
-latency average = 15.844 ms
-latency stddev = 2.715 ms
-tps = 618.764555 (including connections establishing)
-tps = 622.977698 (excluding connections establishing)
+latency average = 10.870 ms
+latency stddev = 7.341 ms
+initial connection time = 30.954 ms
+tps = 907.949122 (without initial connection establishing)
 statement latencies in milliseconds:
-        0.002  \set aid random(1, 100000 * :scale)
-        0.005  \set bid random(1, 1 * :scale)
-        0.002  \set tid random(1, 10 * :scale)
-        0.001  \set delta random(-5000, 5000)
-        0.326  BEGIN;
-        0.603  UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
-        0.454  SELECT abalance FROM pgbench_accounts WHERE aid = :aid;
-        5.528  UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;
-        7.335  UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;
-        0.371  INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);
-        1.212  END;
+    0.001  \set aid random(1, 100000 * :scale)
+    0.001  \set bid random(1, 1 * :scale)
+    0.001  \set tid random(1, 10 * :scale)
+    0.000  \set delta random(-5000, 5000)
+    0.046  BEGIN;
+    0.151  UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
+    0.107  SELECT abalance FROM pgbench_accounts WHERE aid = :aid;
+    4.241  UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;
+    5.245  UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;
+    0.102  INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);
+    0.974  END;
 </screen>
   </para>
 
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 1534a99ae1..e459dd7979 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -307,9 +307,9 @@ typedef struct SimpleStats
  */
 typedef struct StatsData
 {
-	time_t		start_time;		/* interval start time, for aggregates */
-	int64		cnt;			/* number of transactions, including skipped */
-	int64		skipped;		/* number of transactions skipped under --rate
+	pg_time_usec_t	start_time;	/* interval start time, for aggregates */
+	int64			cnt;		/* number of transactions, including skipped */
+	int64			skipped;	/* number of transactions skipped under --rate
 								 * and --latency-limit */
 	SimpleStats latency;
 	SimpleStats lag;
@@ -432,11 +432,11 @@ typedef struct
 	int			nvariables;		/* number of variables */
 	bool		vars_sorted;	/* are variables sorted by name? */
 
-	/* various times about current transaction */
-	int64		txn_scheduled;	/* scheduled start time of transaction (usec) */
-	int64		sleep_until;	/* scheduled start time of next cmd (usec) */
-	instr_time	txn_begin;		/* used for measuring schedule lag times */
-	instr_time	stmt_begin;		/* used for measuring statement latencies */
+	/* various times about current transaction in microseconds */
+	pg_time_usec_t	txn_scheduled;	/* scheduled start time of transaction */
+	pg_time_usec_t	sleep_until;	/* scheduled start time of next cmd */
+	pg_time_usec_t	txn_begin;		/* used for measuring schedule lag times */
+	pg_time_usec_t	stmt_begin;		/* used for measuring statement latencies */
 
 	bool		prepared[MAX_SCRIPTS];	/* whether client prepared the script */
 
@@ -464,13 +464,16 @@ typedef struct
 	RandomState ts_sample_rs;	/* random state for log sampling */
 
 	int64		throttle_trigger;	/* previous/next throttling (us) */
-	FILE	   *logfile;		/* where to log, or NULL */
+	FILE	   *logfile;			/* where to log, or NULL */
+
+	/* per thread collected stats in microseconds */
+	pg_time_usec_t	create_time;	/* thread creation time */
+	pg_time_usec_t	started_time;	/* thread is running */
+	pg_time_usec_t	bench_start; 	/* thread is benchmarking */
+	pg_time_usec_t	conn_duration;	/* cumulated connection and deconnection delays */
 
-	/* per thread collected stats */
-	instr_time	start_time;		/* thread start time */
-	instr_time	conn_time;
 	StatsData	stats;
-	int64		latency_late;	/* executed but late transactions */
+	int64		latency_late;	/* count executed but late transactions */
 } TState;
 
 /*
@@ -610,10 +613,10 @@ static void setIntValue(PgBenchValue *pv, int64 ival);
 static void setDoubleValue(PgBenchValue *pv, double dval);
 static bool evaluateExpr(CState *st, PgBenchExpr *expr,
 						 PgBenchValue *retval);
-static ConnectionStateEnum executeMetaCommand(CState *st, instr_time *now);
+static ConnectionStateEnum executeMetaCommand(CState *st, pg_time_usec_t *now);
 static void doLog(TState *thread, CState *st,
 				  StatsData *agg, bool skipped, double latency, double lag);
-static void processXactStats(TState *thread, CState *st, instr_time *now,
+static void processXactStats(TState *thread, CState *st, pg_time_usec_t *now,
 							 bool skipped, StatsData *agg);
 static void addScript(ParsedScript script);
 static THREAD_FUNC_RETURN_TYPE threadRun(void *arg);
@@ -1117,9 +1120,9 @@ mergeSimpleStats(SimpleStats *acc, SimpleStats *ss)
  * the given value.
  */
 static void
-initStats(StatsData *sd, time_t start_time)
+initStats(StatsData *sd, pg_time_usec_t start)
 {
-	sd->start_time = start_time;
+	sd->start_time = start;
 	sd->cnt = 0;
 	sd->skipped = 0;
 	initSimpleStats(&sd->latency);
@@ -2898,7 +2901,6 @@ evaluateSleep(CState *st, int argc, char **argv, int *usecs)
 static void
 advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 {
-	instr_time	now;
 
 	/*
 	 * gettimeofday() isn't free, so we get the current timestamp lazily the
@@ -2908,7 +2910,7 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 	 * means "not set yet".  Reset "now" when we execute shell commands or
 	 * expressions, which might take a non-negligible amount of time, though.
 	 */
-	INSTR_TIME_SET_ZERO(now);
+	pg_time_usec_t	now = 0;
 
 	/*
 	 * Loop in the state machine, until we have to wait for a result from the
@@ -2943,29 +2945,30 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 
 				/* Start new transaction (script) */
 			case CSTATE_START_TX:
+				pg_time_now_lazy(&now);
 
 				/* establish connection if needed, i.e. under --connect */
 				if (st->con == NULL)
 				{
-					instr_time	start;
+					pg_time_usec_t	start = now;
 
-					INSTR_TIME_SET_CURRENT_LAZY(now);
-					start = now;
 					if ((st->con = doConnect()) == NULL)
 					{
 						pg_log_error("client %d aborted while establishing connection", st->id);
 						st->state = CSTATE_ABORTED;
 						break;
 					}
-					INSTR_TIME_SET_CURRENT(now);
-					INSTR_TIME_ACCUM_DIFF(thread->conn_time, now, start);
+
+					/* reset now after connection */
+					now = pg_time_now();
+
+					thread->conn_duration += now - start;
 
 					/* Reset session-local state */
 					memset(st->prepared, 0, sizeof(st->prepared));
 				}
 
 				/* record transaction start time */
-				INSTR_TIME_SET_CURRENT_LAZY(now);
 				st->txn_begin = now;
 
 				/*
@@ -2973,7 +2976,7 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 				 * scheduled start time.
 				 */
 				if (!throttle_delay)
-					st->txn_scheduled = INSTR_TIME_GET_MICROSEC(now);
+					st->txn_scheduled = now;
 
 				/* Begin with the first command */
 				st->state = CSTATE_START_COMMAND;
@@ -3009,12 +3012,9 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 				 */
 				if (latency_limit)
 				{
-					int64		now_us;
+					pg_time_now_lazy(&now);
 
-					INSTR_TIME_SET_CURRENT_LAZY(now);
-					now_us = INSTR_TIME_GET_MICROSEC(now);
-
-					while (thread->throttle_trigger < now_us - latency_limit &&
+					while (thread->throttle_trigger < now - latency_limit &&
 						   (nxacts <= 0 || st->cnt < nxacts))
 					{
 						processXactStats(thread, st, &now, true, agg);
@@ -3047,9 +3047,9 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 				 * Wait until it's time to start next transaction.
 				 */
 			case CSTATE_THROTTLE:
-				INSTR_TIME_SET_CURRENT_LAZY(now);
+				pg_time_now_lazy(&now);
 
-				if (INSTR_TIME_GET_MICROSEC(now) < st->txn_scheduled)
+				if (now < st->txn_scheduled)
 					return;		/* still sleeping, nothing to do here */
 
 				/* done sleeping, but don't start transaction if we're done */
@@ -3072,7 +3072,7 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 				/* record begin time of next command, and initiate it */
 				if (report_per_command)
 				{
-					INSTR_TIME_SET_CURRENT_LAZY(now);
+					pg_time_now_lazy(&now);
 					st->stmt_begin = now;
 				}
 
@@ -3237,8 +3237,8 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 				 * instead of CSTATE_START_TX.
 				 */
 			case CSTATE_SLEEP:
-				INSTR_TIME_SET_CURRENT_LAZY(now);
-				if (INSTR_TIME_GET_MICROSEC(now) < st->sleep_until)
+				pg_time_now_lazy(&now);
+				if (now < st->sleep_until)
 					return;		/* still sleeping, nothing to do here */
 				/* Else done sleeping. */
 				st->state = CSTATE_END_COMMAND;
@@ -3258,13 +3258,12 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 				{
 					Command    *command;
 
-					INSTR_TIME_SET_CURRENT_LAZY(now);
+					pg_time_now_lazy(&now);
 
 					command = sql_script[st->use_file].commands[st->command];
 					/* XXX could use a mutex here, but we choose not to */
 					addToSimpleStats(&command->stats,
-									 INSTR_TIME_GET_DOUBLE(now) -
-									 INSTR_TIME_GET_DOUBLE(st->stmt_begin));
+									 PG_TIME_GET_DOUBLE(now - st->stmt_begin));
 				}
 
 				/* Go ahead with next command, to be executed or skipped */
@@ -3290,7 +3289,7 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 				if (is_connect)
 				{
 					finishCon(st);
-					INSTR_TIME_SET_ZERO(now);
+					now = 0;
 				}
 
 				if ((st->cnt >= nxacts && duration <= 0) || timer_exceeded)
@@ -3328,7 +3327,7 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
  * take no time to execute.
  */
 static ConnectionStateEnum
-executeMetaCommand(CState *st, instr_time *now)
+executeMetaCommand(CState *st, pg_time_usec_t *now)
 {
 	Command    *command = sql_script[st->use_file].commands[st->command];
 	int			argc;
@@ -3370,8 +3369,8 @@ executeMetaCommand(CState *st, instr_time *now)
 			return CSTATE_ABORTED;
 		}
 
-		INSTR_TIME_SET_CURRENT_LAZY(*now);
-		st->sleep_until = INSTR_TIME_GET_MICROSEC(*now) + usec;
+		pg_time_now_lazy(now);
+		st->sleep_until = (*now) + usec;
 		return CSTATE_SLEEP;
 	}
 	else if (command->meta == META_SET)
@@ -3474,7 +3473,7 @@ executeMetaCommand(CState *st, instr_time *now)
 	 * executing the expression or shell command might have taken a
 	 * non-negligible amount of time, so reset 'now'
 	 */
-	INSTR_TIME_SET_ZERO(*now);
+	*now = 0;
 
 	return CSTATE_END_COMMAND;
 }
@@ -3484,14 +3483,15 @@ executeMetaCommand(CState *st, instr_time *now)
  *
  * We print Unix-epoch timestamps in the log, so that entries can be
  * correlated against other logs.  On some platforms this could be obtained
- * from the instr_time reading the caller has, but rather than get entangled
- * with that, we just eat the cost of an extra syscall in all cases.
+ * from the caller, but rather than get entangled with that, we just eat
+ * the cost of an extra syscall in all cases.
  */
 static void
 doLog(TState *thread, CState *st,
 	  StatsData *agg, bool skipped, double latency, double lag)
 {
 	FILE	   *logfile = thread->logfile;
+	pg_time_usec_t	now = pg_time_now();
 
 	Assert(use_log);
 
@@ -3511,13 +3511,12 @@ doLog(TState *thread, CState *st,
 		 * any empty intervals in between (this may happen with very low tps,
 		 * e.g. --rate=0.1).
 		 */
-		time_t		now = time(NULL);
 
 		while (agg->start_time + agg_interval <= now)
 		{
 			/* print aggregated report to logfile */
-			fprintf(logfile, "%ld " INT64_FORMAT " %.0f %.0f %.0f %.0f",
-					(long) agg->start_time,
+			fprintf(logfile, INT64_FORMAT " " INT64_FORMAT " %.0f %.0f %.0f %.0f",
+					agg->start_time,
 					agg->cnt,
 					agg->latency.sum,
 					agg->latency.sum2,
@@ -3545,17 +3544,13 @@ doLog(TState *thread, CState *st,
 	else
 	{
 		/* no, print raw transactions */
-		struct timeval tv;
-
-		gettimeofday(&tv, NULL);
 		if (skipped)
 			fprintf(logfile, "%d " INT64_FORMAT " skipped %d %ld %ld",
-					st->id, st->cnt, st->use_file,
-					(long) tv.tv_sec, (long) tv.tv_usec);
+					st->id, st->cnt, st->use_file, now / 1000000, now % 1000000);
 		else
 			fprintf(logfile, "%d " INT64_FORMAT " %.0f %d %ld %ld",
 					st->id, st->cnt, latency, st->use_file,
-					(long) tv.tv_sec, (long) tv.tv_usec);
+					now / 1000000, now % 1000000);
 		if (throttle_delay)
 			fprintf(logfile, " %.0f", lag);
 		fputc('\n', logfile);
@@ -3569,7 +3564,7 @@ doLog(TState *thread, CState *st,
  * Note that even skipped transactions are counted in the "cnt" fields.)
  */
 static void
-processXactStats(TState *thread, CState *st, instr_time *now,
+processXactStats(TState *thread, CState *st, pg_time_usec_t *now,
 				 bool skipped, StatsData *agg)
 {
 	double		latency = 0.0,
@@ -3579,11 +3574,11 @@ processXactStats(TState *thread, CState *st, instr_time *now,
 
 	if (detailed && !skipped)
 	{
-		INSTR_TIME_SET_CURRENT_LAZY(*now);
+		pg_time_now_lazy(now);
 
 		/* compute latency & lag */
-		latency = INSTR_TIME_GET_MICROSEC(*now) - st->txn_scheduled;
-		lag = INSTR_TIME_GET_MICROSEC(st->txn_begin) - st->txn_scheduled;
+		latency = (*now) - st->txn_scheduled;
+		lag = st->txn_begin - st->txn_scheduled;
 	}
 
 	if (thread_details)
@@ -3834,10 +3829,7 @@ initGenerateDataClientSide(PGconn *con)
 	int64		k;
 
 	/* used to track elapsed time and estimate of the remaining time */
-	instr_time	start,
-				diff;
-	double		elapsed_sec,
-				remaining_sec;
+	pg_time_usec_t	start;
 	int			log_interval = 1;
 
 	/* Stay on the same line if reporting to a terminal */
@@ -3889,7 +3881,7 @@ initGenerateDataClientSide(PGconn *con)
 	}
 	PQclear(res);
 
-	INSTR_TIME_SET_CURRENT(start);
+	start = pg_time_now();
 
 	for (k = 0; k < (int64) naccounts * scale; k++)
 	{
@@ -3914,11 +3906,8 @@ initGenerateDataClientSide(PGconn *con)
 		 */
 		if ((!use_quiet) && (j % 100000 == 0))
 		{
-			INSTR_TIME_SET_CURRENT(diff);
-			INSTR_TIME_SUBTRACT(diff, start);
-
-			elapsed_sec = INSTR_TIME_GET_DOUBLE(diff);
-			remaining_sec = ((double) scale * naccounts - j) * elapsed_sec / j;
+			double elapsed_sec = PG_TIME_GET_DOUBLE(pg_time_now() - start);
+			double remaining_sec = ((double) scale * naccounts - j) * elapsed_sec / j;
 
 			fprintf(stderr, INT64_FORMAT " of " INT64_FORMAT " tuples (%d%%) done (elapsed %.2f s, remaining %.2f s)%c",
 					j, (int64) naccounts * scale,
@@ -3928,11 +3917,8 @@ initGenerateDataClientSide(PGconn *con)
 		/* let's not call the timing for each row, but only each 100 rows */
 		else if (use_quiet && (j % 100 == 0))
 		{
-			INSTR_TIME_SET_CURRENT(diff);
-			INSTR_TIME_SUBTRACT(diff, start);
-
-			elapsed_sec = INSTR_TIME_GET_DOUBLE(diff);
-			remaining_sec = ((double) scale * naccounts - j) * elapsed_sec / j;
+			double elapsed_sec = PG_TIME_GET_DOUBLE(pg_time_now() - start);
+			double remaining_sec = ((double) scale * naccounts - j) * elapsed_sec / j;
 
 			/* have we reached the next interval (or end)? */
 			if ((j == scale * naccounts) || (elapsed_sec >= log_interval * LOG_STEP_SECONDS))
@@ -4137,10 +4123,8 @@ runInitSteps(const char *initialize_steps)
 
 	for (step = initialize_steps; *step != '\0'; step++)
 	{
-		instr_time	start;
 		char	   *op = NULL;
-
-		INSTR_TIME_SET_CURRENT(start);
+		pg_time_usec_t	start = pg_time_now();
 
 		switch (*step)
 		{
@@ -4182,12 +4166,7 @@ runInitSteps(const char *initialize_steps)
 
 		if (op != NULL)
 		{
-			instr_time	diff;
-			double		elapsed_sec;
-
-			INSTR_TIME_SET_CURRENT(diff);
-			INSTR_TIME_SUBTRACT(diff, start);
-			elapsed_sec = INSTR_TIME_GET_DOUBLE(diff);
+			double		elapsed_sec = PG_TIME_GET_DOUBLE(pg_time_now() - start);
 
 			if (!first)
 				appendPQExpBufferStr(&stats, ", ");
@@ -5109,12 +5088,12 @@ addScript(ParsedScript script)
  * progress report.  On exit, they are updated with the new stats.
  */
 static void
-printProgressReport(TState *threads, int64 test_start, int64 now,
+printProgressReport(TState *threads, int64 test_start, pg_time_usec_t now,
 					StatsData *last, int64 *last_report)
 {
 	/* generate and show report */
-	int64		run = now - *last_report,
-				ntx;
+	pg_time_usec_t	run = now - *last_report;
+	int64		ntx;
 	double		tps,
 				total_run,
 				latency,
@@ -5161,16 +5140,7 @@ printProgressReport(TState *threads, int64 test_start, int64 now,
 
 	if (progress_timestamp)
 	{
-		/*
-		 * On some platforms the current system timestamp is available in
-		 * now_time, but rather than get entangled with that, we just eat the
-		 * cost of an extra syscall in all cases.
-		 */
-		struct timeval tv;
-
-		gettimeofday(&tv, NULL);
-		snprintf(tbuf, sizeof(tbuf), "%ld.%03ld s",
-				 (long) tv.tv_sec, (long) (tv.tv_usec / 1000));
+		snprintf(tbuf, sizeof(tbuf), "%.3f s", PG_TIME_GET_DOUBLE(now));
 	}
 	else
 	{
@@ -5210,21 +5180,18 @@ printSimpleStats(const char *prefix, SimpleStats *ss)
 
 /* print out results */
 static void
-printResults(StatsData *total, instr_time total_time,
-			 instr_time conn_total_time, int64 latency_late)
+printResults(StatsData *total,
+			 pg_time_usec_t total_duration,			/* benchmarking time */
+			 pg_time_usec_t conn_total_duration,	/* is_connect */
+			 pg_time_usec_t conn_elapsed_duration,	/* !is_connect */
+			 int64 latency_late)
 {
-	double		time_include,
-				tps_include,
-				tps_exclude;
+	/* tps is about actually executed transactions during benchmarking */
 	int64		ntx = total->cnt - total->skipped;
+	double 		bench_duration = PG_TIME_GET_DOUBLE(total_duration);
+	double		tps = ntx / bench_duration;
 
-	time_include = INSTR_TIME_GET_DOUBLE(total_time);
-
-	/* tps is about actually executed transactions */
-	tps_include = ntx / time_include;
-	tps_exclude = ntx /
-		(time_include - (INSTR_TIME_GET_DOUBLE(conn_total_time) / nclients));
-
+	printf("pgbench (PostgreSQL) %d.%d\n", PG_VERSION_NUM / 10000, PG_VERSION_NUM % 100);
 	/* Report test parameters. */
 	printf("transaction type: %s\n",
 		   num_scripts == 1 ? sql_script[0].desc : "multiple scripts");
@@ -5255,8 +5222,7 @@ printResults(StatsData *total, instr_time total_time,
 
 	if (throttle_delay && latency_limit)
 		printf("number of transactions skipped: " INT64_FORMAT " (%.3f %%)\n",
-			   total->skipped,
-			   100.0 * total->skipped / total->cnt);
+			   total->skipped, 100.0 * total->skipped / total->cnt);
 
 	if (latency_limit)
 		printf("number of transactions above the %.1f ms latency limit: " INT64_FORMAT "/" INT64_FORMAT " (%.3f %%)\n",
@@ -5269,7 +5235,7 @@ printResults(StatsData *total, instr_time total_time,
 	{
 		/* no measurement, show average latency computed from run time */
 		printf("latency average = %.3f ms\n",
-			   1000.0 * time_include * nclients / total->cnt);
+			   0.001 * total_duration * nclients / total->cnt);
 	}
 
 	if (throttle_delay)
@@ -5284,8 +5250,25 @@ printResults(StatsData *total, instr_time total_time,
 			   0.001 * total->lag.sum / total->cnt, 0.001 * total->lag.max);
 	}
 
-	printf("tps = %f (including connections establishing)\n", tps_include);
-	printf("tps = %f (excluding connections establishing)\n", tps_exclude);
+	/*
+	 * Under -C/--connect, each transaction incurs a significant connection cost,
+	 * it would not make much sense to ignore it in tps, and it would not be tps
+	 * anyway.
+	 *
+	 * Otherwise connections are made just once at the beginning of the run
+	 * and should not impact performance but for very short run, so they are
+	 * (right)fully ignored in tps.
+	 */
+	if (is_connect)
+	{
+		printf("average connection time = %.3f ms\n", 0.001 * conn_total_duration / total->cnt);
+		printf("tps = %f (including reconnection times)\n", tps);
+	}
+	else
+	{
+		printf("initial connection time = %.3f ms\n", 0.001 * conn_elapsed_duration);
+		printf("tps = %f (without initial connection establishing)\n", tps);
+	}
 
 	/* Report per-script/command statistics */
 	if (per_script_stats || report_per_command)
@@ -5306,7 +5289,7 @@ printResults(StatsData *total, instr_time total_time,
 					   100.0 * sql_script[i].weight / total_weight,
 					   sstats->cnt,
 					   100.0 * sstats->cnt / total->cnt,
-					   (sstats->cnt - sstats->skipped) / time_include);
+					   (sstats->cnt - sstats->skipped) / bench_duration);
 
 				if (throttle_delay && latency_limit && sstats->cnt > 0)
 					printf(" - number of transactions skipped: " INT64_FORMAT " (%.3f%%)\n",
@@ -5354,10 +5337,7 @@ set_random_seed(const char *seed)
 	if (seed == NULL || strcmp(seed, "time") == 0)
 	{
 		/* rely on current time */
-		instr_time	now;
-
-		INSTR_TIME_SET_CURRENT(now);
-		iseed = (uint64) INSTR_TIME_GET_MICROSEC(now);
+		iseed = pg_time_now();
 	}
 	else if (strcmp(seed, "rand") == 0)
 	{
@@ -5460,9 +5440,10 @@ main(int argc, char **argv)
 	CState	   *state;			/* status of clients */
 	TState	   *threads;		/* array of thread */
 
-	instr_time	start_time;		/* start up time */
-	instr_time	total_time;
-	instr_time	conn_total_time;
+	pg_time_usec_t
+		start_time,				/* start up time */
+		bench_start = 0,		/* first recorded benchmarking time */
+		conn_total_duration;	/* cumulated connection time in threads */
 	int64		latency_late = 0;
 	StatsData	stats;
 	int			weight;
@@ -6138,62 +6119,51 @@ main(int argc, char **argv)
 	/* all clients must be assigned to a thread */
 	Assert(nclients_dealt == nclients);
 
-	/* get start up time */
-	INSTR_TIME_SET_CURRENT(start_time);
+	/* get start up time for the whole computation */
+	start_time = pg_time_now();
 
 	/* set alarm if duration is specified. */
 	if (duration > 0)
 		setalarm(duration);
 
-	/* start threads */
 #ifdef ENABLE_THREAD_SAFETY
-	for (i = 0; i < nthreads; i++)
+	/* start all threads but thread 0 which is executed directly later */
+	for (i = 1; i < nthreads; i++)
 	{
 		TState	   *thread = &threads[i];
 
-		INSTR_TIME_SET_CURRENT(thread->start_time);
-
-		/* compute when to stop */
-		if (duration > 0)
-			end_time = INSTR_TIME_GET_MICROSEC(thread->start_time) +
-				(int64) 1000000 * duration;
+		thread->create_time = pg_time_now();
+		errno = THREAD_CREATE(&thread->thread, threadRun, thread);
 
-		/* the first thread (i = 0) is executed by main thread */
-		if (i > 0)
+		if (errno != 0)
 		{
-			errno = THREAD_CREATE(&thread->thread, threadRun, thread);
-
-			if (errno != 0)
-			{
-				pg_log_fatal("could not create thread: %m");
-				exit(1);
-			}
+			pg_log_fatal("could not create thread: %m");
+			exit(1);
 		}
 	}
 #else
-	INSTR_TIME_SET_CURRENT(threads[0].start_time);
+	Assert(nthreads == 1);
+#endif							/* ENABLE_THREAD_SAFETY */
+
 	/* compute when to stop */
+	threads[0].create_time = pg_time_now();
 	if (duration > 0)
-		end_time = INSTR_TIME_GET_MICROSEC(threads[0].start_time) +
-			(int64) 1000000 * duration;
-#endif							/* ENABLE_THREAD_SAFETY */
+		end_time = threads[0].create_time + (int64) 1000000 * duration;
 
-	/* wait for threads and accumulate results */
+	/* run thread 0 directly */
+	(void) threadRun(&threads[0]);
+
+	/* wait for other threads and accumulate results */
 	initStats(&stats, 0);
-	INSTR_TIME_SET_ZERO(conn_total_time);
+	conn_total_duration = 0;
+
 	for (i = 0; i < nthreads; i++)
 	{
 		TState	   *thread = &threads[i];
 
 #ifdef ENABLE_THREAD_SAFETY
-		if (i == 0)
-			/* actually run this thread directly in the main thread */
-			(void) threadRun(thread);
-		else
-			/* wait of other threads. should check that 0 is returned? */
+		if (i > 0)
 			THREAD_JOIN(thread->thread);
-#else
-		(void) threadRun(thread);
 #endif							/* ENABLE_THREAD_SAFETY */
 
 		for (int j = 0; j < thread->nstate; j++)
@@ -6206,23 +6176,24 @@ main(int argc, char **argv)
 		stats.cnt += thread->stats.cnt;
 		stats.skipped += thread->stats.skipped;
 		latency_late += thread->latency_late;
-		INSTR_TIME_ADD(conn_total_time, thread->conn_time);
+		conn_total_duration += thread->conn_duration;
+
+		/* first recorded benchmarking start time */
+		if (bench_start == 0 || thread->bench_start < bench_start)
+			bench_start = thread->bench_start;
 	}
+
+	/* XXX should this be connection time? */
 	disconnect_all(state, nclients);
 
 	/*
-	 * XXX We compute results as though every client of every thread started
-	 * and finished at the same time.  That model can diverge noticeably from
-	 * reality for a short benchmark run involving relatively many threads.
-	 * The first thread may process notably many transactions before the last
-	 * thread begins.  Improving the model alone would bring limited benefit,
-	 * because performance during those periods of partial thread count can
-	 * easily exceed steady state performance.  This is one of the many ways
-	 * short runs convey deceptive performance figures.
+	 * Beware that performance of short benchmarks with many threads and possibly
+	 * long transactions can be deceptive because threads do not start and finish
+	 * at the exact same time. The total duration computed here encompasses all
+	 * transactions so that tps shown is somehow slightly underestimated.
 	 */
-	INSTR_TIME_SET_CURRENT(total_time);
-	INSTR_TIME_SUBTRACT(total_time, start_time);
-	printResults(&stats, total_time, conn_total_time, latency_late);
+	printResults(&stats, pg_time_now() - bench_start, conn_total_duration,
+				 bench_start - start_time, latency_late);
 
 	if (exit_code != 0)
 		pg_log_fatal("Run was aborted; the above results are incomplete.");
@@ -6235,34 +6206,16 @@ threadRun(void *arg)
 {
 	TState	   *thread = (TState *) arg;
 	CState	   *state = thread->state;
-	instr_time	start,
-				end;
+	pg_time_usec_t	start;
 	int			nstate = thread->nstate;
 	int			remains = nstate;	/* number of remaining clients */
 	socket_set *sockets = alloc_socket_set(nstate);
-	int			i;
-
-	/* for reporting progress: */
-	int64		thread_start = INSTR_TIME_GET_MICROSEC(thread->start_time);
-	int64		last_report = thread_start;
-	int64		next_report = last_report + (int64) progress * 1000000;
+	int64		thread_start,
+				last_report,
+				next_report;
 	StatsData	last,
 				aggs;
 
-	/*
-	 * Initialize throttling rate target for all of the thread's clients.  It
-	 * might be a little more accurate to reset thread->start_time here too.
-	 * The possible drift seems too small relative to typical throttle delay
-	 * times to worry about it.
-	 */
-	INSTR_TIME_SET_CURRENT(start);
-	thread->throttle_trigger = INSTR_TIME_GET_MICROSEC(start);
-
-	INSTR_TIME_SET_ZERO(thread->conn_time);
-
-	initStats(&aggs, time(NULL));
-	last = aggs;
-
 	/* open log file if requested */
 	if (use_log)
 	{
@@ -6283,32 +6236,49 @@ threadRun(void *arg)
 		}
 	}
 
+	/* explicitly initialize the state machines */
+	for (int i = 0; i < nstate; i++)
+		state[i].state = CSTATE_CHOOSE_SCRIPT;
+
+	/* READY */
+	thread_start = pg_time_now();
+	thread->started_time = thread_start;
+	last_report = thread_start;
+	next_report = last_report + (int64) 1000000 * progress;
+
+	/* STEADY */
 	if (!is_connect)
 	{
 		/* make connections to the database before starting */
-		for (i = 0; i < nstate; i++)
+		for (int i = 0; i < nstate; i++)
 		{
 			if ((state[i].con = doConnect()) == NULL)
 				goto done;
 		}
-	}
 
-	/* time after thread and connections set up */
-	INSTR_TIME_SET_CURRENT(thread->conn_time);
-	INSTR_TIME_SUBTRACT(thread->conn_time, thread->start_time);
-
-	/* explicitly initialize the state machines */
-	for (i = 0; i < nstate; i++)
+		/* compute connection delay */
+		thread->conn_duration = pg_time_now() - thread->started_time;
+	}
+	else
 	{
-		state[i].state = CSTATE_CHOOSE_SCRIPT;
+		/* no connection delay to record */
+		thread->conn_duration = 0;
 	}
 
+
+	start = pg_time_now();
+	thread->bench_start = start;
+	thread->throttle_trigger = start;
+
+	initStats(&aggs, start);
+	last = aggs;
+
 	/* loop till all clients have terminated */
 	while (remains > 0)
 	{
 		int			nsocks;		/* number of sockets to be waited for */
-		int64		min_usec;
-		int64		now_usec = 0;	/* set this only if needed */
+		pg_time_usec_t	min_usec;
+		pg_time_usec_t	now = 0;	/* set this only if needed */
 
 		/*
 		 * identify which client sockets should be checked for input, and
@@ -6317,27 +6287,21 @@ threadRun(void *arg)
 		clear_socket_set(sockets);
 		nsocks = 0;
 		min_usec = PG_INT64_MAX;
-		for (i = 0; i < nstate; i++)
+		for (int i = 0; i < nstate; i++)
 		{
 			CState	   *st = &state[i];
 
 			if (st->state == CSTATE_SLEEP || st->state == CSTATE_THROTTLE)
 			{
 				/* a nap from the script, or under throttling */
-				int64		this_usec;
+				pg_time_usec_t		this_usec;
 
 				/* get current time if needed */
-				if (now_usec == 0)
-				{
-					instr_time	now;
-
-					INSTR_TIME_SET_CURRENT(now);
-					now_usec = INSTR_TIME_GET_MICROSEC(now);
-				}
+				pg_time_now_lazy(&now);
 
 				/* min_usec should be the minimum delay across all clients */
 				this_usec = (st->state == CSTATE_SLEEP ?
-							 st->sleep_until : st->txn_scheduled) - now_usec;
+							 st->sleep_until : st->txn_scheduled) - now;
 				if (min_usec > this_usec)
 					min_usec = this_usec;
 			}
@@ -6372,19 +6336,12 @@ threadRun(void *arg)
 		/* also wake up to print the next progress report on time */
 		if (progress && min_usec > 0 && thread->tid == 0)
 		{
-			/* get current time if needed */
-			if (now_usec == 0)
-			{
-				instr_time	now;
+			pg_time_now_lazy(&now);
 
-				INSTR_TIME_SET_CURRENT(now);
-				now_usec = INSTR_TIME_GET_MICROSEC(now);
-			}
-
-			if (now_usec >= next_report)
+			if (now >= next_report)
 				min_usec = 0;
-			else if ((next_report - now_usec) < min_usec)
-				min_usec = next_report - now_usec;
+			else if ((next_report - now) < min_usec)
+				min_usec = next_report - now;
 		}
 
 		/*
@@ -6433,7 +6390,7 @@ threadRun(void *arg)
 
 		/* ok, advance the state machine of each connection */
 		nsocks = 0;
-		for (i = 0; i < nstate; i++)
+		for (int i = 0; i < nstate; i++)
 		{
 			CState	   *st = &state[i];
 
@@ -6471,11 +6428,8 @@ threadRun(void *arg)
 		/* progress report is made by thread 0 for all threads */
 		if (progress && thread->tid == 0)
 		{
-			instr_time	now_time;
-			int64		now;
+			pg_time_usec_t	now = pg_time_now();
 
-			INSTR_TIME_SET_CURRENT(now_time);
-			now = INSTR_TIME_GET_MICROSEC(now_time);
 			if (now >= next_report)
 			{
 				/*
@@ -6493,17 +6447,17 @@ threadRun(void *arg)
 				 */
 				do
 				{
-					next_report += (int64) progress * 1000000;
+					next_report += (int64) 1000000 * progress;
 				} while (now >= next_report);
 			}
 		}
 	}
 
 done:
-	INSTR_TIME_SET_CURRENT(start);
+	start = pg_time_now();
 	disconnect_all(state, nstate);
-	INSTR_TIME_SET_CURRENT(end);
-	INSTR_TIME_ACCUM_DIFF(thread->conn_time, end, start);
+	thread->conn_duration += pg_time_now() - start;
+
 	if (thread->logfile)
 	{
 		if (agg_interval > 0)
diff --git a/src/include/portability/instr_time.h b/src/include/portability/instr_time.h
index 39a4f0600e..faf806a441 100644
--- a/src/include/portability/instr_time.h
+++ b/src/include/portability/instr_time.h
@@ -253,4 +253,32 @@ GetTimerFrequency(void)
 #define INSTR_TIME_SET_CURRENT_LAZY(t) \
 	(INSTR_TIME_IS_ZERO(t) ? INSTR_TIME_SET_CURRENT(t), true : false)
 
+/*
+ * Simpler convenient interface
+ *
+ * The instr_time type is expensive when dealing with time arithmetic.
+ * Define a type to hold microseconds on top of this, suitable for
+ * benchmarking performance measures, eg in "pgbench".
+ *
+ * Type int64 is good enough for about 584500 years.
+ */
+typedef int64 pg_time_usec_t;
+
+static inline pg_time_usec_t
+pg_time_now(void)
+{
+	instr_time now;
+
+	INSTR_TIME_SET_CURRENT(now);
+	return (pg_time_usec_t) INSTR_TIME_GET_MICROSEC(now);
+}
+
+static inline void
+pg_time_now_lazy(pg_time_usec_t *now)
+{
+	if ((*now) == 0)
+		(*now) = pg_time_now();
+}
+
+#define PG_TIME_GET_DOUBLE(t) (0.000001 * (t))
 #endif							/* INSTR_TIME_H */
-- 
2.30.1

v10-0004-pgbench-Synchronize-client-threads.patchtext/x-patch; charset=US-ASCII; name=v10-0004-pgbench-Synchronize-client-threads.patchDownload

From f21dc379681c6b5467fe00fb5ddcd438d49b6ec3 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Mon, 18 Jan 2021 10:07:31 +1300
Subject: [PATCH v10 4/5] pgbench: Synchronize client threads.

Wait until all pgbench threads are connected before benchmarking begins.

Author: Andres Freund <andres@anarazel.de>
Author: Fabien COELHO <coelho@cri.ensmp.fr>
Reviewed-by: Marina Polyakova <m.polyakova@postgrespro.ru>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Discussion: https://postgr.es/m/20200227180100.zyvjwzcpiokfsqm2%40alap3.anarazel.de
---
 src/bin/pgbench/pgbench.c | 45 ++++++++++++++++++++++++++++++++++++---
 1 file changed, 42 insertions(+), 3 deletions(-)

diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index e459dd7979..babd635b7e 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -126,9 +126,17 @@ typedef struct socket_set
 #define THREAD_JOIN(handle) \
 	(WaitForSingleObject(handle, INFINITE) != WAIT_OBJECT_0 ? \
 	GETERRNO() : CloseHandle(handle) ? 0 : GETERRNO())
+#define THREAD_BARRIER_T SYNCHRONIZATION_BARRIER
+#define THREAD_BARRIER_INIT(barrier, n) \
+	(InitializeSynchronizationBarrier((barrier), (n), 0) ? 0 : GETERRNO())
+#define THREAD_BARRIER_WAIT(barrier) \
+	(EnterSynchronizationBarrier((barrier), \
+								 SYNCHRONIZATION_BARRIER_FLAGS_BLOCK_ONLY) ? \
+								 0 : GETERRNO())
+#define THREAD_BARRIER_DESTROY(barrier)
 #elif defined(ENABLE_THREAD_SAFETY)
 /* Use platform-dependent pthread capability */
-#include <pthread.h>
+#include "port/pg_pthread.h"
 #define THREAD_T pthread_t
 #define THREAD_FUNC_RETURN_TYPE void *
 #define THREAD_FUNC_RETURN return NULL
@@ -136,11 +144,20 @@ typedef struct socket_set
 	pthread_create((handle), NULL, (function), (arg))
 #define THREAD_JOIN(handle) \
 	pthread_join((handle), NULL)
+#define THREAD_BARRIER_T pthread_barrier_t
+#define THREAD_BARRIER_INIT(barrier, n) \
+	pthread_barrier_init((barrier), NULL, (n))
+#define THREAD_BARRIER_WAIT(barrier) pthread_barrier_wait((barrier))
+#define THREAD_BARRIER_DESTROY(barrier) pthread_barrier_destroy((barrier))
 #else
 /* No threads implementation, use none (-j 1) */
 #define THREAD_T void *
 #define THREAD_FUNC_RETURN_TYPE void *
 #define THREAD_FUNC_RETURN return NULL
+#define THREAD_BARRIER_T void *
+#define THREAD_BARRIER_INIT(barrier, n)
+#define THREAD_BARRIER_WAIT(barrier)
+#define THREAD_BARRIER_DESTROY(barrier)
 #endif
 
 
@@ -326,6 +343,9 @@ typedef struct RandomState
 /* Various random sequences are initialized from this one. */
 static RandomState base_random_sequence;
 
+/* Synchronization barrier for start and connection */
+static THREAD_BARRIER_T barrier;
+
 /*
  * Connection state machine states.
  */
@@ -468,8 +488,8 @@ typedef struct
 
 	/* per thread collected stats in microseconds */
 	pg_time_usec_t	create_time;	/* thread creation time */
-	pg_time_usec_t	started_time;	/* thread is running */
-	pg_time_usec_t	bench_start; 	/* thread is benchmarking */
+	pg_time_usec_t	started_time;	/* thread is running after start barrier */
+	pg_time_usec_t	bench_start; 	/* thread is benchmarking after connection barrier */
 	pg_time_usec_t	conn_duration;	/* cumulated connection and deconnection delays */
 
 	StatsData	stats;
@@ -6126,6 +6146,8 @@ main(int argc, char **argv)
 	if (duration > 0)
 		setalarm(duration);
 
+	THREAD_BARRIER_INIT(&barrier, nthreads);
+
 #ifdef ENABLE_THREAD_SAFETY
 	/* start all threads but thread 0 which is executed directly later */
 	for (i = 1; i < nthreads; i++)
@@ -6195,6 +6217,8 @@ main(int argc, char **argv)
 	printResults(&stats, pg_time_now() - bench_start, conn_total_duration,
 				 bench_start - start_time, latency_late);
 
+	THREAD_BARRIER_DESTROY(&barrier);
+
 	if (exit_code != 0)
 		pg_log_fatal("Run was aborted; the above results are incomplete.");
 
@@ -6241,6 +6265,8 @@ threadRun(void *arg)
 		state[i].state = CSTATE_CHOOSE_SCRIPT;
 
 	/* READY */
+	THREAD_BARRIER_WAIT(&barrier);
+
 	thread_start = pg_time_now();
 	thread->started_time = thread_start;
 	last_report = thread_start;
@@ -6253,7 +6279,18 @@ threadRun(void *arg)
 		for (int i = 0; i < nstate; i++)
 		{
 			if ((state[i].con = doConnect()) == NULL)
+			{
+				/*
+				 * On connection failure, we meet the barrier here in place of
+				 * GO before proceeding to the "done" path which will cleanup,
+				 * so as to avoid locking the process.
+				 *
+				 * It is unclear whether it is worth doing anything rather than
+				 * coldly exiting with an error message.
+				 */
+				THREAD_BARRIER_WAIT(&barrier);
 				goto done;
+			}
 		}
 
 		/* compute connection delay */
@@ -6265,6 +6302,8 @@ threadRun(void *arg)
 		thread->conn_duration = 0;
 	}
 
+	/* GO */
+	THREAD_BARRIER_WAIT(&barrier);
 
 	start = pg_time_now();
 	thread->bench_start = start;
-- 
2.30.1

thomas.munro@gmail.com

almost 5 years ago

In reply to: Thomas Munro (#38)

4 attachment(s)

Re: pgbench: option delaying queries till connections establishment?

On Thu, Mar 4, 2021 at 10:44 PM Thomas Munro <thomas.munro@gmail.com> wrote:

v10-0002-pgbench-Refactor-the-way-we-do-thread-portabilit.patch

Here's a better version of that part. I don't yet know if it actually
works on Windows...

Attachments:

v11-0001-Add-missing-pthread_barrier_t.patchtext/x-patch; charset=US-ASCII; name=v11-0001-Add-missing-pthread_barrier_t.patchDownload

From 3aa63dfc086ab1f687ed668091a6bda8bf270fa7 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Sat, 2 Jan 2021 15:05:06 +1300
Subject: [PATCH v11 1/6] Add missing pthread_barrier_t.

Supply a simple implementation of the missing pthread_barrier_t type and
functions, for macOS.

Discussion: https://postgr.es/m/20200227180100.zyvjwzcpiokfsqm2%40alap3.anarazel.de
---
 configure                       | 69 +++++++++++++++++++++++++++++++++
 configure.ac                    |  2 +
 src/include/pg_config.h.in      |  3 ++
 src/include/port/pg_pthread.h   | 41 ++++++++++++++++++++
 src/port/pthread_barrier_wait.c | 66 +++++++++++++++++++++++++++++++
 src/tools/msvc/Solution.pm      |  1 +
 6 files changed, 182 insertions(+)
 create mode 100644 src/include/port/pg_pthread.h
 create mode 100644 src/port/pthread_barrier_wait.c

diff --git a/configure b/configure
index ce9ea36999..fad817bb38 100755
--- a/configure
+++ b/configure
@@ -11635,6 +11635,62 @@ if test "$ac_res" != no; then :
 
 fi
 
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for library containing pthread_barrier_wait" >&5
+$as_echo_n "checking for library containing pthread_barrier_wait... " >&6; }
+if ${ac_cv_search_pthread_barrier_wait+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  ac_func_search_save_LIBS=$LIBS
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+
+/* Override any GCC internal prototype to avoid an error.
+   Use char because int might match the return type of a GCC
+   builtin and then its argument prototype would still apply.  */
+#ifdef __cplusplus
+extern "C"
+#endif
+char pthread_barrier_wait ();
+int
+main ()
+{
+return pthread_barrier_wait ();
+  ;
+  return 0;
+}
+_ACEOF
+for ac_lib in '' pthread; do
+  if test -z "$ac_lib"; then
+    ac_res="none required"
+  else
+    ac_res=-l$ac_lib
+    LIBS="-l$ac_lib  $ac_func_search_save_LIBS"
+  fi
+  if ac_fn_c_try_link "$LINENO"; then :
+  ac_cv_search_pthread_barrier_wait=$ac_res
+fi
+rm -f core conftest.err conftest.$ac_objext \
+    conftest$ac_exeext
+  if ${ac_cv_search_pthread_barrier_wait+:} false; then :
+  break
+fi
+done
+if ${ac_cv_search_pthread_barrier_wait+:} false; then :
+
+else
+  ac_cv_search_pthread_barrier_wait=no
+fi
+rm conftest.$ac_ext
+LIBS=$ac_func_search_save_LIBS
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $ac_cv_search_pthread_barrier_wait" >&5
+$as_echo "$ac_cv_search_pthread_barrier_wait" >&6; }
+ac_res=$ac_cv_search_pthread_barrier_wait
+if test "$ac_res" != no; then :
+  test "$ac_res" = "none required" || LIBS="$ac_res $LIBS"
+
+fi
+
 # Solaris:
 { $as_echo "$as_me:${as_lineno-$LINENO}: checking for library containing fdatasync" >&5
 $as_echo_n "checking for library containing fdatasync... " >&6; }
@@ -15883,6 +15939,19 @@ esac
 
 fi
 
+ac_fn_c_check_func "$LINENO" "pthread_barrier_wait" "ac_cv_func_pthread_barrier_wait"
+if test "x$ac_cv_func_pthread_barrier_wait" = xyes; then :
+  $as_echo "#define HAVE_PTHREAD_BARRIER_WAIT 1" >>confdefs.h
+
+else
+  case " $LIBOBJS " in
+  *" pthread_barrier_wait.$ac_objext "* ) ;;
+  *) LIBOBJS="$LIBOBJS pthread_barrier_wait.$ac_objext"
+ ;;
+esac
+
+fi
+
 ac_fn_c_check_func "$LINENO" "pwrite" "ac_cv_func_pwrite"
 if test "x$ac_cv_func_pwrite" = xyes; then :
   $as_echo "#define HAVE_PWRITE 1" >>confdefs.h
diff --git a/configure.ac b/configure.ac
index f54f65febe..0ed53571dd 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1143,6 +1143,7 @@ AC_SEARCH_LIBS(getopt_long, [getopt gnugetopt])
 AC_SEARCH_LIBS(shm_open, rt)
 AC_SEARCH_LIBS(shm_unlink, rt)
 AC_SEARCH_LIBS(clock_gettime, [rt posix4])
+AC_SEARCH_LIBS(pthread_barrier_wait, pthread)
 # Solaris:
 AC_SEARCH_LIBS(fdatasync, [rt posix4])
 # Required for thread_test.c on Solaris
@@ -1743,6 +1744,7 @@ AC_REPLACE_FUNCS(m4_normalize([
 	mkdtemp
 	pread
 	preadv
+	pthread_barrier_wait
 	pwrite
 	pwritev
 	random
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 04dc330119..7a7cc21d8d 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -424,6 +424,9 @@
 /* Define if you have POSIX threads libraries and header files. */
 #undef HAVE_PTHREAD
 
+/* Define to 1 if you have the `pthread_barrier_wait' function. */
+#undef HAVE_PTHREAD_BARRIER_WAIT
+
 /* Define to 1 if you have the `pthread_is_threaded_np' function. */
 #undef HAVE_PTHREAD_IS_THREADED_NP
 
diff --git a/src/include/port/pg_pthread.h b/src/include/port/pg_pthread.h
new file mode 100644
index 0000000000..5222cdce6e
--- /dev/null
+++ b/src/include/port/pg_pthread.h
@@ -0,0 +1,41 @@
+/*-------------------------------------------------------------------------
+ *
+ * Declarations for missing POSIX thread components.
+ *
+ *	  Currently this supplies an implementation of pthread_barrier_t for the
+ *	  benefit of macOS, which lacks it as of release 11.  These declarations
+ *	  are not in port.h, because that'd require <pthread.h> to be included by
+ *	  every translation unit.
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#ifndef PG_PTHREAD_H
+#define PG_PTHREAD_H
+
+#include <pthread.h>
+
+#ifndef HAVE_PTHREAD_BARRIER_WAIT
+
+#ifndef PTHREAD_BARRIER_SERIAL_THREAD
+#define PTHREAD_BARRIER_SERIAL_THREAD (-1)
+#endif
+
+typedef struct pg_pthread_barrier
+{
+	bool		sense;			/* we only need a one bit phase */
+	int			count;			/* number of threads expected */
+	int			arrived;		/* number of threads that have arrived */
+	pthread_mutex_t mutex;
+	pthread_cond_t cond;
+} pthread_barrier_t;
+
+extern int pthread_barrier_init(pthread_barrier_t *barrier,
+								const void *attr,
+								int count);
+extern int pthread_barrier_wait(pthread_barrier_t *barrier);
+extern int pthread_barrier_destroy(pthread_barrier_t *barrier);
+
+#endif
+
+#endif
diff --git a/src/port/pthread_barrier_wait.c b/src/port/pthread_barrier_wait.c
new file mode 100644
index 0000000000..08dacc3085
--- /dev/null
+++ b/src/port/pthread_barrier_wait.c
@@ -0,0 +1,66 @@
+#include "postgres_fe.h"
+
+#include "port/pg_pthread.h"
+
+int
+pthread_barrier_init(pthread_barrier_t *barrier, const void *attr, int count)
+{
+	barrier->sense = false;
+	barrier->count = count;
+	barrier->arrived = 0;
+	if (pthread_cond_init(&barrier->cond, NULL) < 0)
+		return -1;
+	if (pthread_mutex_init(&barrier->mutex, NULL) < 0)
+	{
+		int save_errno = errno;
+
+		pthread_cond_destroy(&barrier->cond);
+		errno = save_errno;
+
+		return -1;
+	}
+
+	return 0;
+}
+
+int
+pthread_barrier_wait(pthread_barrier_t *barrier)
+{
+	bool		initial_sense;
+
+	pthread_mutex_lock(&barrier->mutex);
+
+	/* We have arrived at the barrier. */
+	barrier->arrived++;
+	Assert(barrier->arrived <= barrier->count);
+
+	/* If we were the last to arrive, release the others and return. */
+	if (barrier->arrived == barrier->count)
+	{
+		barrier->arrived = 0;
+		barrier->sense = !barrier->sense;
+		pthread_mutex_unlock(&barrier->mutex);
+		pthread_cond_broadcast(&barrier->cond);
+
+		return PTHREAD_BARRIER_SERIAL_THREAD;
+	}
+
+	/* Wait for someone else to flip the sense. */
+	initial_sense = barrier->sense;
+	do
+	{
+		pthread_cond_wait(&barrier->cond, &barrier->mutex);
+	} while (barrier->sense == initial_sense);
+
+	pthread_mutex_unlock(&barrier->mutex);
+
+	return 0;
+}
+
+int
+pthread_barrier_destroy(pthread_barrier_t *barrier)
+{
+	pthread_cond_destroy(&barrier->cond);
+	pthread_mutex_destroy(&barrier->mutex);
+	return 0;
+}
diff --git a/src/tools/msvc/Solution.pm b/src/tools/msvc/Solution.pm
index 69b08591cc..a4f5cc4bdb 100644
--- a/src/tools/msvc/Solution.pm
+++ b/src/tools/msvc/Solution.pm
@@ -333,6 +333,7 @@ sub GenerateFiles
 		HAVE_PSTAT                  => undef,
 		HAVE_PS_STRINGS             => undef,
 		HAVE_PTHREAD                => undef,
+		HAVE_PTHREAD_BARRIER_WAIT   => undef,
 		HAVE_PTHREAD_IS_THREADED_NP => undef,
 		HAVE_PTHREAD_PRIO_INHERIT   => undef,
 		HAVE_PWRITE                 => undef,
-- 
2.30.0

v11-0002-pgbench-Refactor-the-way-we-do-thread-portabilit.patchtext/x-patch; charset=US-ASCII; name=v11-0002-pgbench-Refactor-the-way-we-do-thread-portabilit.patchDownload

From 3b7941869aa3f0960aec9c367942ed13e66c071c Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Thu, 4 Mar 2021 00:42:13 +1300
Subject: [PATCH v11 2/6] pgbench: Refactor the way we do thread portability.

Instead of maintaining an incomplete emulation of POSIX threads for
Windows, let's use an extremely minimalist abstraction over both APIs
for now.  Small problems fixed: it's not OK to use (pthread_t) 0
as a special value, it's no OK to compare thread_t values with ==, and
we incorrectly assumed that pthread functions set errno.

Discussion: https://postgr.es/m/20200227180100.zyvjwzcpiokfsqm2%40alap3.anarazel.de
---
 src/bin/pgbench/pgbench.c | 122 +++++++++-----------------------------
 1 file changed, 29 insertions(+), 93 deletions(-)

diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 31a4df45f5..44674d9179 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -110,22 +110,36 @@ typedef struct socket_set
 #endif							/* POLL_USING_SELECT */
 
 /*
- * Multi-platform pthread implementations
+ * Multi-platform thread implementations
  */
 
 #ifdef WIN32
 /* Use native win32 threads on Windows */
-typedef struct win32_pthread *pthread_t;
-typedef int pthread_attr_t;
-
-static int	pthread_create(pthread_t *thread, pthread_attr_t *attr, void *(*start_routine) (void *), void *arg);
-static int	pthread_join(pthread_t th, void **thread_return);
+#include <windows.h>
+#define GETERRNO() (_dosmaperr(GetLastError()), errno)
+#define THREAD_T HANDLE
+#define THREAD_FUNC_RETURN_TYPE unsigned
+#define THREAD_FUNC_RETURN return 0
+#define THREAD_CREATE(handle, function, arg) \
+	((*(handle) = (HANDLE) _beginthreadex(NULL, 0, (function), (arg), 0, NULL)) == 0 ? errno : 0)
+#define THREAD_JOIN(handle) \
+	(WaitForSingleObject(handle, INFINITE) != WAIT_OBJECT_0 ? \
+	GETERRNO() : CloseHandle(handle) ? 0 : GETERRNO())
 #elif defined(ENABLE_THREAD_SAFETY)
 /* Use platform-dependent pthread capability */
 #include <pthread.h>
+#define THREAD_T pthread_t
+#define THREAD_FUNC_RETURN_TYPE void *
+#define THREAD_FUNC_RETURN return NULL
+#define THREAD_CREATE(handle, function, arg) \
+	pthread_create((handle), NULL, (function), (arg))
+#define THREAD_JOIN(handle) \
+	pthread_join((handle), NULL)
 #else
 /* No threads implementation, use none (-j 1) */
-#define pthread_t void *
+#define THREAD_T void *
+#define THREAD_FUNC_RETURN_TYPE void *
+#define THREAD_FUNC_RETURN return NULL
 #endif
 
 
@@ -435,7 +449,7 @@ typedef struct
 typedef struct
 {
 	int			tid;			/* thread id */
-	pthread_t	thread;			/* thread handle */
+	THREAD_T	thread;			/* thread handle */
 	CState	   *state;			/* array of CState */
 	int			nstate;			/* length of state[] */
 
@@ -458,8 +472,6 @@ typedef struct
 	int64		latency_late;	/* executed but late transactions */
 } TState;
 
-#define INVALID_THREAD		((pthread_t) 0)
-
 /*
  * queries read from files
  */
@@ -603,7 +615,7 @@ static void doLog(TState *thread, CState *st,
 static void processXactStats(TState *thread, CState *st, instr_time *now,
 							 bool skipped, StatsData *agg);
 static void addScript(ParsedScript script);
-static void *threadRun(void *arg);
+static THREAD_FUNC_RETURN_TYPE threadRun(void *arg);
 static void finishCon(CState *st);
 static void setalarm(int seconds);
 static socket_set *alloc_socket_set(int count);
@@ -6148,18 +6160,14 @@ main(int argc, char **argv)
 		/* the first thread (i = 0) is executed by main thread */
 		if (i > 0)
 		{
-			int			err = pthread_create(&thread->thread, NULL, threadRun, thread);
+			errno = THREAD_CREATE(&thread->thread, threadRun, thread);
 
-			if (err != 0 || thread->thread == INVALID_THREAD)
+			if (errno != 0)
 			{
 				pg_log_fatal("could not create thread: %m");
 				exit(1);
 			}
 		}
-		else
-		{
-			thread->thread = INVALID_THREAD;
-		}
 	}
 #else
 	INSTR_TIME_SET_CURRENT(threads[0].start_time);
@@ -6167,7 +6175,6 @@ main(int argc, char **argv)
 	if (duration > 0)
 		end_time = INSTR_TIME_GET_MICROSEC(threads[0].start_time) +
 			(int64) 1000000 * duration;
-	threads[0].thread = INVALID_THREAD;
 #endif							/* ENABLE_THREAD_SAFETY */
 
 	/* wait for threads and accumulate results */
@@ -6178,12 +6185,12 @@ main(int argc, char **argv)
 		TState	   *thread = &threads[i];
 
 #ifdef ENABLE_THREAD_SAFETY
-		if (threads[i].thread == INVALID_THREAD)
+		if (i == 0)
 			/* actually run this thread directly in the main thread */
 			(void) threadRun(thread);
 		else
 			/* wait of other threads. should check that 0 is returned? */
-			pthread_join(thread->thread, NULL);
+			THREAD_JOIN(thread->thread);
 #else
 		(void) threadRun(thread);
 #endif							/* ENABLE_THREAD_SAFETY */
@@ -6222,7 +6229,7 @@ main(int argc, char **argv)
 	return exit_code;
 }
 
-static void *
+static THREAD_FUNC_RETURN_TYPE
 threadRun(void *arg)
 {
 	TState	   *thread = (TState *) arg;
@@ -6507,7 +6514,7 @@ done:
 		thread->logfile = NULL;
 	}
 	free_socket_set(sockets);
-	return NULL;
+	THREAD_FUNC_RETURN;
 }
 
 static void
@@ -6738,74 +6745,3 @@ socket_has_input(socket_set *sa, int fd, int idx)
 }
 
 #endif							/* POLL_USING_SELECT */
-
-
-/* partial pthread implementation for Windows */
-
-#ifdef WIN32
-
-typedef struct win32_pthread
-{
-	HANDLE		handle;
-	void	   *(*routine) (void *);
-	void	   *arg;
-	void	   *result;
-} win32_pthread;
-
-static unsigned __stdcall
-win32_pthread_run(void *arg)
-{
-	win32_pthread *th = (win32_pthread *) arg;
-
-	th->result = th->routine(th->arg);
-
-	return 0;
-}
-
-static int
-pthread_create(pthread_t *thread,
-			   pthread_attr_t *attr,
-			   void *(*start_routine) (void *),
-			   void *arg)
-{
-	int			save_errno;
-	win32_pthread *th;
-
-	th = (win32_pthread *) pg_malloc(sizeof(win32_pthread));
-	th->routine = start_routine;
-	th->arg = arg;
-	th->result = NULL;
-
-	th->handle = (HANDLE) _beginthreadex(NULL, 0, win32_pthread_run, th, 0, NULL);
-	if (th->handle == NULL)
-	{
-		save_errno = errno;
-		free(th);
-		return save_errno;
-	}
-
-	*thread = th;
-	return 0;
-}
-
-static int
-pthread_join(pthread_t th, void **thread_return)
-{
-	if (th == NULL || th->handle == NULL)
-		return errno = EINVAL;
-
-	if (WaitForSingleObject(th->handle, INFINITE) != WAIT_OBJECT_0)
-	{
-		_dosmaperr(GetLastError());
-		return errno;
-	}
-
-	if (thread_return)
-		*thread_return = th->result;
-
-	CloseHandle(th->handle);
-	free(th);
-	return 0;
-}
-
-#endif							/* WIN32 */
-- 
2.30.0

v11-0003-pgbench-Improve-time-measurement-code.patchtext/x-patch; charset=US-ASCII; name=v11-0003-pgbench-Improve-time-measurement-code.patchDownload

From a157bc61997db0333e49ed9b6dcd42d6cb55169f Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Thu, 4 Mar 2021 17:25:06 +1300
Subject: [PATCH v11 3/6] pgbench: Improve time measurement code.

XXX This needs a commit message.

Author: Fabien COELHO <coelho@cri.ensmp.fr>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Discussion: https://postgr.es/m/20200227180100.zyvjwzcpiokfsqm2%40alap3.anarazel.de
---
 doc/src/sgml/ref/pgbench.sgml        |  39 +--
 src/bin/pgbench/pgbench.c            | 410 ++++++++++++---------------
 src/include/portability/instr_time.h |  28 ++
 3 files changed, 230 insertions(+), 247 deletions(-)

diff --git a/doc/src/sgml/ref/pgbench.sgml b/doc/src/sgml/ref/pgbench.sgml
index 2ec0580a79..3c3699cd73 100644
--- a/doc/src/sgml/ref/pgbench.sgml
+++ b/doc/src/sgml/ref/pgbench.sgml
@@ -58,8 +58,10 @@ number of clients: 10
 number of threads: 1
 number of transactions per client: 1000
 number of transactions actually processed: 10000/10000
-tps = 85.184871 (including connections establishing)
-tps = 85.296346 (excluding connections establishing)
+latency average = 11.013 ms
+latency stddev = 7.351 ms
+initial connection time = 45.758 ms
+tps = 896.967014 (without initial connection establishing)
 </screen>
 
   The first six lines report some of the most important parameter
@@ -68,8 +70,7 @@ tps = 85.296346 (excluding connections establishing)
   and number of transactions per client); these will be equal unless the run
   failed before completion.  (In <option>-T</option> mode, only the actual
   number of transactions is printed.)
-  The last two lines report the number of transactions per second,
-  figured with and without counting the time to start database sessions.
+  The last line reports the number of transactions per second.
  </para>
 
   <para>
@@ -2257,22 +2258,22 @@ number of clients: 10
 number of threads: 1
 number of transactions per client: 1000
 number of transactions actually processed: 10000/10000
-latency average = 15.844 ms
-latency stddev = 2.715 ms
-tps = 618.764555 (including connections establishing)
-tps = 622.977698 (excluding connections establishing)
+latency average = 10.870 ms
+latency stddev = 7.341 ms
+initial connection time = 30.954 ms
+tps = 907.949122 (without initial connection establishing)
 statement latencies in milliseconds:
-        0.002  \set aid random(1, 100000 * :scale)
-        0.005  \set bid random(1, 1 * :scale)
-        0.002  \set tid random(1, 10 * :scale)
-        0.001  \set delta random(-5000, 5000)
-        0.326  BEGIN;
-        0.603  UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
-        0.454  SELECT abalance FROM pgbench_accounts WHERE aid = :aid;
-        5.528  UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;
-        7.335  UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;
-        0.371  INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);
-        1.212  END;
+    0.001  \set aid random(1, 100000 * :scale)
+    0.001  \set bid random(1, 1 * :scale)
+    0.001  \set tid random(1, 10 * :scale)
+    0.000  \set delta random(-5000, 5000)
+    0.046  BEGIN;
+    0.151  UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
+    0.107  SELECT abalance FROM pgbench_accounts WHERE aid = :aid;
+    4.241  UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;
+    5.245  UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;
+    0.102  INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);
+    0.974  END;
 </screen>
   </para>
 
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 44674d9179..40f8ae02a1 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -306,9 +306,9 @@ typedef struct SimpleStats
  */
 typedef struct StatsData
 {
-	time_t		start_time;		/* interval start time, for aggregates */
-	int64		cnt;			/* number of transactions, including skipped */
-	int64		skipped;		/* number of transactions skipped under --rate
+	pg_time_usec_t	start_time;	/* interval start time, for aggregates */
+	int64			cnt;		/* number of transactions, including skipped */
+	int64			skipped;	/* number of transactions skipped under --rate
 								 * and --latency-limit */
 	SimpleStats latency;
 	SimpleStats lag;
@@ -431,11 +431,11 @@ typedef struct
 	int			nvariables;		/* number of variables */
 	bool		vars_sorted;	/* are variables sorted by name? */
 
-	/* various times about current transaction */
-	int64		txn_scheduled;	/* scheduled start time of transaction (usec) */
-	int64		sleep_until;	/* scheduled start time of next cmd (usec) */
-	instr_time	txn_begin;		/* used for measuring schedule lag times */
-	instr_time	stmt_begin;		/* used for measuring statement latencies */
+	/* various times about current transaction in microseconds */
+	pg_time_usec_t	txn_scheduled;	/* scheduled start time of transaction */
+	pg_time_usec_t	sleep_until;	/* scheduled start time of next cmd */
+	pg_time_usec_t	txn_begin;		/* used for measuring schedule lag times */
+	pg_time_usec_t	stmt_begin;		/* used for measuring statement latencies */
 
 	bool		prepared[MAX_SCRIPTS];	/* whether client prepared the script */
 
@@ -463,13 +463,16 @@ typedef struct
 	RandomState ts_sample_rs;	/* random state for log sampling */
 
 	int64		throttle_trigger;	/* previous/next throttling (us) */
-	FILE	   *logfile;		/* where to log, or NULL */
+	FILE	   *logfile;			/* where to log, or NULL */
+
+	/* per thread collected stats in microseconds */
+	pg_time_usec_t	create_time;	/* thread creation time */
+	pg_time_usec_t	started_time;	/* thread is running */
+	pg_time_usec_t	bench_start; 	/* thread is benchmarking */
+	pg_time_usec_t	conn_duration;	/* cumulated connection and deconnection delays */
 
-	/* per thread collected stats */
-	instr_time	start_time;		/* thread start time */
-	instr_time	conn_time;
 	StatsData	stats;
-	int64		latency_late;	/* executed but late transactions */
+	int64		latency_late;	/* count executed but late transactions */
 } TState;
 
 /*
@@ -609,10 +612,10 @@ static void setIntValue(PgBenchValue *pv, int64 ival);
 static void setDoubleValue(PgBenchValue *pv, double dval);
 static bool evaluateExpr(CState *st, PgBenchExpr *expr,
 						 PgBenchValue *retval);
-static ConnectionStateEnum executeMetaCommand(CState *st, instr_time *now);
+static ConnectionStateEnum executeMetaCommand(CState *st, pg_time_usec_t *now);
 static void doLog(TState *thread, CState *st,
 				  StatsData *agg, bool skipped, double latency, double lag);
-static void processXactStats(TState *thread, CState *st, instr_time *now,
+static void processXactStats(TState *thread, CState *st, pg_time_usec_t *now,
 							 bool skipped, StatsData *agg);
 static void addScript(ParsedScript script);
 static THREAD_FUNC_RETURN_TYPE threadRun(void *arg);
@@ -1116,9 +1119,9 @@ mergeSimpleStats(SimpleStats *acc, SimpleStats *ss)
  * the given value.
  */
 static void
-initStats(StatsData *sd, time_t start_time)
+initStats(StatsData *sd, pg_time_usec_t start)
 {
-	sd->start_time = start_time;
+	sd->start_time = start;
 	sd->cnt = 0;
 	sd->skipped = 0;
 	initSimpleStats(&sd->latency);
@@ -2897,7 +2900,6 @@ evaluateSleep(CState *st, int argc, char **argv, int *usecs)
 static void
 advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 {
-	instr_time	now;
 
 	/*
 	 * gettimeofday() isn't free, so we get the current timestamp lazily the
@@ -2907,7 +2909,7 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 	 * means "not set yet".  Reset "now" when we execute shell commands or
 	 * expressions, which might take a non-negligible amount of time, though.
 	 */
-	INSTR_TIME_SET_ZERO(now);
+	pg_time_usec_t	now = 0;
 
 	/*
 	 * Loop in the state machine, until we have to wait for a result from the
@@ -2942,29 +2944,30 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 
 				/* Start new transaction (script) */
 			case CSTATE_START_TX:
+				pg_time_now_lazy(&now);
 
 				/* establish connection if needed, i.e. under --connect */
 				if (st->con == NULL)
 				{
-					instr_time	start;
+					pg_time_usec_t	start = now;
 
-					INSTR_TIME_SET_CURRENT_LAZY(now);
-					start = now;
 					if ((st->con = doConnect()) == NULL)
 					{
 						pg_log_error("client %d aborted while establishing connection", st->id);
 						st->state = CSTATE_ABORTED;
 						break;
 					}
-					INSTR_TIME_SET_CURRENT(now);
-					INSTR_TIME_ACCUM_DIFF(thread->conn_time, now, start);
+
+					/* reset now after connection */
+					now = pg_time_now();
+
+					thread->conn_duration += now - start;
 
 					/* Reset session-local state */
 					memset(st->prepared, 0, sizeof(st->prepared));
 				}
 
 				/* record transaction start time */
-				INSTR_TIME_SET_CURRENT_LAZY(now);
 				st->txn_begin = now;
 
 				/*
@@ -2972,7 +2975,7 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 				 * scheduled start time.
 				 */
 				if (!throttle_delay)
-					st->txn_scheduled = INSTR_TIME_GET_MICROSEC(now);
+					st->txn_scheduled = now;
 
 				/* Begin with the first command */
 				st->state = CSTATE_START_COMMAND;
@@ -3008,12 +3011,9 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 				 */
 				if (latency_limit)
 				{
-					int64		now_us;
+					pg_time_now_lazy(&now);
 
-					INSTR_TIME_SET_CURRENT_LAZY(now);
-					now_us = INSTR_TIME_GET_MICROSEC(now);
-
-					while (thread->throttle_trigger < now_us - latency_limit &&
+					while (thread->throttle_trigger < now - latency_limit &&
 						   (nxacts <= 0 || st->cnt < nxacts))
 					{
 						processXactStats(thread, st, &now, true, agg);
@@ -3046,9 +3046,9 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 				 * Wait until it's time to start next transaction.
 				 */
 			case CSTATE_THROTTLE:
-				INSTR_TIME_SET_CURRENT_LAZY(now);
+				pg_time_now_lazy(&now);
 
-				if (INSTR_TIME_GET_MICROSEC(now) < st->txn_scheduled)
+				if (now < st->txn_scheduled)
 					return;		/* still sleeping, nothing to do here */
 
 				/* done sleeping, but don't start transaction if we're done */
@@ -3071,7 +3071,7 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 				/* record begin time of next command, and initiate it */
 				if (report_per_command)
 				{
-					INSTR_TIME_SET_CURRENT_LAZY(now);
+					pg_time_now_lazy(&now);
 					st->stmt_begin = now;
 				}
 
@@ -3236,8 +3236,8 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 				 * instead of CSTATE_START_TX.
 				 */
 			case CSTATE_SLEEP:
-				INSTR_TIME_SET_CURRENT_LAZY(now);
-				if (INSTR_TIME_GET_MICROSEC(now) < st->sleep_until)
+				pg_time_now_lazy(&now);
+				if (now < st->sleep_until)
 					return;		/* still sleeping, nothing to do here */
 				/* Else done sleeping. */
 				st->state = CSTATE_END_COMMAND;
@@ -3257,13 +3257,12 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 				{
 					Command    *command;
 
-					INSTR_TIME_SET_CURRENT_LAZY(now);
+					pg_time_now_lazy(&now);
 
 					command = sql_script[st->use_file].commands[st->command];
 					/* XXX could use a mutex here, but we choose not to */
 					addToSimpleStats(&command->stats,
-									 INSTR_TIME_GET_DOUBLE(now) -
-									 INSTR_TIME_GET_DOUBLE(st->stmt_begin));
+									 PG_TIME_GET_DOUBLE(now - st->stmt_begin));
 				}
 
 				/* Go ahead with next command, to be executed or skipped */
@@ -3289,7 +3288,7 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 				if (is_connect)
 				{
 					finishCon(st);
-					INSTR_TIME_SET_ZERO(now);
+					now = 0;
 				}
 
 				if ((st->cnt >= nxacts && duration <= 0) || timer_exceeded)
@@ -3327,7 +3326,7 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
  * take no time to execute.
  */
 static ConnectionStateEnum
-executeMetaCommand(CState *st, instr_time *now)
+executeMetaCommand(CState *st, pg_time_usec_t *now)
 {
 	Command    *command = sql_script[st->use_file].commands[st->command];
 	int			argc;
@@ -3369,8 +3368,8 @@ executeMetaCommand(CState *st, instr_time *now)
 			return CSTATE_ABORTED;
 		}
 
-		INSTR_TIME_SET_CURRENT_LAZY(*now);
-		st->sleep_until = INSTR_TIME_GET_MICROSEC(*now) + usec;
+		pg_time_now_lazy(now);
+		st->sleep_until = (*now) + usec;
 		return CSTATE_SLEEP;
 	}
 	else if (command->meta == META_SET)
@@ -3473,7 +3472,7 @@ executeMetaCommand(CState *st, instr_time *now)
 	 * executing the expression or shell command might have taken a
 	 * non-negligible amount of time, so reset 'now'
 	 */
-	INSTR_TIME_SET_ZERO(*now);
+	*now = 0;
 
 	return CSTATE_END_COMMAND;
 }
@@ -3483,14 +3482,15 @@ executeMetaCommand(CState *st, instr_time *now)
  *
  * We print Unix-epoch timestamps in the log, so that entries can be
  * correlated against other logs.  On some platforms this could be obtained
- * from the instr_time reading the caller has, but rather than get entangled
- * with that, we just eat the cost of an extra syscall in all cases.
+ * from the caller, but rather than get entangled with that, we just eat
+ * the cost of an extra syscall in all cases.
  */
 static void
 doLog(TState *thread, CState *st,
 	  StatsData *agg, bool skipped, double latency, double lag)
 {
 	FILE	   *logfile = thread->logfile;
+	pg_time_usec_t	now = pg_time_now();
 
 	Assert(use_log);
 
@@ -3510,13 +3510,12 @@ doLog(TState *thread, CState *st,
 		 * any empty intervals in between (this may happen with very low tps,
 		 * e.g. --rate=0.1).
 		 */
-		time_t		now = time(NULL);
 
 		while (agg->start_time + agg_interval <= now)
 		{
 			/* print aggregated report to logfile */
-			fprintf(logfile, "%ld " INT64_FORMAT " %.0f %.0f %.0f %.0f",
-					(long) agg->start_time,
+			fprintf(logfile, INT64_FORMAT " " INT64_FORMAT " %.0f %.0f %.0f %.0f",
+					agg->start_time,
 					agg->cnt,
 					agg->latency.sum,
 					agg->latency.sum2,
@@ -3544,17 +3543,13 @@ doLog(TState *thread, CState *st,
 	else
 	{
 		/* no, print raw transactions */
-		struct timeval tv;
-
-		gettimeofday(&tv, NULL);
 		if (skipped)
 			fprintf(logfile, "%d " INT64_FORMAT " skipped %d %ld %ld",
-					st->id, st->cnt, st->use_file,
-					(long) tv.tv_sec, (long) tv.tv_usec);
+					st->id, st->cnt, st->use_file, now / 1000000, now % 1000000);
 		else
 			fprintf(logfile, "%d " INT64_FORMAT " %.0f %d %ld %ld",
 					st->id, st->cnt, latency, st->use_file,
-					(long) tv.tv_sec, (long) tv.tv_usec);
+					now / 1000000, now % 1000000);
 		if (throttle_delay)
 			fprintf(logfile, " %.0f", lag);
 		fputc('\n', logfile);
@@ -3568,7 +3563,7 @@ doLog(TState *thread, CState *st,
  * Note that even skipped transactions are counted in the "cnt" fields.)
  */
 static void
-processXactStats(TState *thread, CState *st, instr_time *now,
+processXactStats(TState *thread, CState *st, pg_time_usec_t *now,
 				 bool skipped, StatsData *agg)
 {
 	double		latency = 0.0,
@@ -3578,11 +3573,11 @@ processXactStats(TState *thread, CState *st, instr_time *now,
 
 	if (detailed && !skipped)
 	{
-		INSTR_TIME_SET_CURRENT_LAZY(*now);
+		pg_time_now_lazy(now);
 
 		/* compute latency & lag */
-		latency = INSTR_TIME_GET_MICROSEC(*now) - st->txn_scheduled;
-		lag = INSTR_TIME_GET_MICROSEC(st->txn_begin) - st->txn_scheduled;
+		latency = (*now) - st->txn_scheduled;
+		lag = st->txn_begin - st->txn_scheduled;
 	}
 
 	if (thread_details)
@@ -3833,10 +3828,7 @@ initGenerateDataClientSide(PGconn *con)
 	int64		k;
 
 	/* used to track elapsed time and estimate of the remaining time */
-	instr_time	start,
-				diff;
-	double		elapsed_sec,
-				remaining_sec;
+	pg_time_usec_t	start;
 	int			log_interval = 1;
 
 	/* Stay on the same line if reporting to a terminal */
@@ -3888,7 +3880,7 @@ initGenerateDataClientSide(PGconn *con)
 	}
 	PQclear(res);
 
-	INSTR_TIME_SET_CURRENT(start);
+	start = pg_time_now();
 
 	for (k = 0; k < (int64) naccounts * scale; k++)
 	{
@@ -3913,11 +3905,8 @@ initGenerateDataClientSide(PGconn *con)
 		 */
 		if ((!use_quiet) && (j % 100000 == 0))
 		{
-			INSTR_TIME_SET_CURRENT(diff);
-			INSTR_TIME_SUBTRACT(diff, start);
-
-			elapsed_sec = INSTR_TIME_GET_DOUBLE(diff);
-			remaining_sec = ((double) scale * naccounts - j) * elapsed_sec / j;
+			double elapsed_sec = PG_TIME_GET_DOUBLE(pg_time_now() - start);
+			double remaining_sec = ((double) scale * naccounts - j) * elapsed_sec / j;
 
 			fprintf(stderr, INT64_FORMAT " of " INT64_FORMAT " tuples (%d%%) done (elapsed %.2f s, remaining %.2f s)%c",
 					j, (int64) naccounts * scale,
@@ -3927,11 +3916,8 @@ initGenerateDataClientSide(PGconn *con)
 		/* let's not call the timing for each row, but only each 100 rows */
 		else if (use_quiet && (j % 100 == 0))
 		{
-			INSTR_TIME_SET_CURRENT(diff);
-			INSTR_TIME_SUBTRACT(diff, start);
-
-			elapsed_sec = INSTR_TIME_GET_DOUBLE(diff);
-			remaining_sec = ((double) scale * naccounts - j) * elapsed_sec / j;
+			double elapsed_sec = PG_TIME_GET_DOUBLE(pg_time_now() - start);
+			double remaining_sec = ((double) scale * naccounts - j) * elapsed_sec / j;
 
 			/* have we reached the next interval (or end)? */
 			if ((j == scale * naccounts) || (elapsed_sec >= log_interval * LOG_STEP_SECONDS))
@@ -4136,10 +4122,8 @@ runInitSteps(const char *initialize_steps)
 
 	for (step = initialize_steps; *step != '\0'; step++)
 	{
-		instr_time	start;
 		char	   *op = NULL;
-
-		INSTR_TIME_SET_CURRENT(start);
+		pg_time_usec_t	start = pg_time_now();
 
 		switch (*step)
 		{
@@ -4181,12 +4165,7 @@ runInitSteps(const char *initialize_steps)
 
 		if (op != NULL)
 		{
-			instr_time	diff;
-			double		elapsed_sec;
-
-			INSTR_TIME_SET_CURRENT(diff);
-			INSTR_TIME_SUBTRACT(diff, start);
-			elapsed_sec = INSTR_TIME_GET_DOUBLE(diff);
+			double		elapsed_sec = PG_TIME_GET_DOUBLE(pg_time_now() - start);
 
 			if (!first)
 				appendPQExpBufferStr(&stats, ", ");
@@ -5108,12 +5087,12 @@ addScript(ParsedScript script)
  * progress report.  On exit, they are updated with the new stats.
  */
 static void
-printProgressReport(TState *threads, int64 test_start, int64 now,
+printProgressReport(TState *threads, int64 test_start, pg_time_usec_t now,
 					StatsData *last, int64 *last_report)
 {
 	/* generate and show report */
-	int64		run = now - *last_report,
-				ntx;
+	pg_time_usec_t	run = now - *last_report;
+	int64		ntx;
 	double		tps,
 				total_run,
 				latency,
@@ -5160,16 +5139,7 @@ printProgressReport(TState *threads, int64 test_start, int64 now,
 
 	if (progress_timestamp)
 	{
-		/*
-		 * On some platforms the current system timestamp is available in
-		 * now_time, but rather than get entangled with that, we just eat the
-		 * cost of an extra syscall in all cases.
-		 */
-		struct timeval tv;
-
-		gettimeofday(&tv, NULL);
-		snprintf(tbuf, sizeof(tbuf), "%ld.%03ld s",
-				 (long) tv.tv_sec, (long) (tv.tv_usec / 1000));
+		snprintf(tbuf, sizeof(tbuf), "%.3f s", PG_TIME_GET_DOUBLE(now));
 	}
 	else
 	{
@@ -5209,21 +5179,18 @@ printSimpleStats(const char *prefix, SimpleStats *ss)
 
 /* print out results */
 static void
-printResults(StatsData *total, instr_time total_time,
-			 instr_time conn_total_time, int64 latency_late)
+printResults(StatsData *total,
+			 pg_time_usec_t total_duration,			/* benchmarking time */
+			 pg_time_usec_t conn_total_duration,	/* is_connect */
+			 pg_time_usec_t conn_elapsed_duration,	/* !is_connect */
+			 int64 latency_late)
 {
-	double		time_include,
-				tps_include,
-				tps_exclude;
+	/* tps is about actually executed transactions during benchmarking */
 	int64		ntx = total->cnt - total->skipped;
+	double 		bench_duration = PG_TIME_GET_DOUBLE(total_duration);
+	double		tps = ntx / bench_duration;
 
-	time_include = INSTR_TIME_GET_DOUBLE(total_time);
-
-	/* tps is about actually executed transactions */
-	tps_include = ntx / time_include;
-	tps_exclude = ntx /
-		(time_include - (INSTR_TIME_GET_DOUBLE(conn_total_time) / nclients));
-
+	printf("pgbench (PostgreSQL) %d.%d\n", PG_VERSION_NUM / 10000, PG_VERSION_NUM % 100);
 	/* Report test parameters. */
 	printf("transaction type: %s\n",
 		   num_scripts == 1 ? sql_script[0].desc : "multiple scripts");
@@ -5254,8 +5221,7 @@ printResults(StatsData *total, instr_time total_time,
 
 	if (throttle_delay && latency_limit)
 		printf("number of transactions skipped: " INT64_FORMAT " (%.3f %%)\n",
-			   total->skipped,
-			   100.0 * total->skipped / total->cnt);
+			   total->skipped, 100.0 * total->skipped / total->cnt);
 
 	if (latency_limit)
 		printf("number of transactions above the %.1f ms latency limit: " INT64_FORMAT "/" INT64_FORMAT " (%.3f %%)\n",
@@ -5268,7 +5234,7 @@ printResults(StatsData *total, instr_time total_time,
 	{
 		/* no measurement, show average latency computed from run time */
 		printf("latency average = %.3f ms\n",
-			   1000.0 * time_include * nclients / total->cnt);
+			   0.001 * total_duration * nclients / total->cnt);
 	}
 
 	if (throttle_delay)
@@ -5283,8 +5249,25 @@ printResults(StatsData *total, instr_time total_time,
 			   0.001 * total->lag.sum / total->cnt, 0.001 * total->lag.max);
 	}
 
-	printf("tps = %f (including connections establishing)\n", tps_include);
-	printf("tps = %f (excluding connections establishing)\n", tps_exclude);
+	/*
+	 * Under -C/--connect, each transaction incurs a significant connection cost,
+	 * it would not make much sense to ignore it in tps, and it would not be tps
+	 * anyway.
+	 *
+	 * Otherwise connections are made just once at the beginning of the run
+	 * and should not impact performance but for very short run, so they are
+	 * (right)fully ignored in tps.
+	 */
+	if (is_connect)
+	{
+		printf("average connection time = %.3f ms\n", 0.001 * conn_total_duration / total->cnt);
+		printf("tps = %f (including reconnection times)\n", tps);
+	}
+	else
+	{
+		printf("initial connection time = %.3f ms\n", 0.001 * conn_elapsed_duration);
+		printf("tps = %f (without initial connection establishing)\n", tps);
+	}
 
 	/* Report per-script/command statistics */
 	if (per_script_stats || report_per_command)
@@ -5305,7 +5288,7 @@ printResults(StatsData *total, instr_time total_time,
 					   100.0 * sql_script[i].weight / total_weight,
 					   sstats->cnt,
 					   100.0 * sstats->cnt / total->cnt,
-					   (sstats->cnt - sstats->skipped) / time_include);
+					   (sstats->cnt - sstats->skipped) / bench_duration);
 
 				if (throttle_delay && latency_limit && sstats->cnt > 0)
 					printf(" - number of transactions skipped: " INT64_FORMAT " (%.3f%%)\n",
@@ -5353,10 +5336,7 @@ set_random_seed(const char *seed)
 	if (seed == NULL || strcmp(seed, "time") == 0)
 	{
 		/* rely on current time */
-		instr_time	now;
-
-		INSTR_TIME_SET_CURRENT(now);
-		iseed = (uint64) INSTR_TIME_GET_MICROSEC(now);
+		iseed = pg_time_now();
 	}
 	else if (strcmp(seed, "rand") == 0)
 	{
@@ -5459,9 +5439,10 @@ main(int argc, char **argv)
 	CState	   *state;			/* status of clients */
 	TState	   *threads;		/* array of thread */
 
-	instr_time	start_time;		/* start up time */
-	instr_time	total_time;
-	instr_time	conn_total_time;
+	pg_time_usec_t
+		start_time,				/* start up time */
+		bench_start = 0,		/* first recorded benchmarking time */
+		conn_total_duration;	/* cumulated connection time in threads */
 	int64		latency_late = 0;
 	StatsData	stats;
 	int			weight;
@@ -6137,62 +6118,51 @@ main(int argc, char **argv)
 	/* all clients must be assigned to a thread */
 	Assert(nclients_dealt == nclients);
 
-	/* get start up time */
-	INSTR_TIME_SET_CURRENT(start_time);
+	/* get start up time for the whole computation */
+	start_time = pg_time_now();
 
 	/* set alarm if duration is specified. */
 	if (duration > 0)
 		setalarm(duration);
 
-	/* start threads */
 #ifdef ENABLE_THREAD_SAFETY
-	for (i = 0; i < nthreads; i++)
+	/* start all threads but thread 0 which is executed directly later */
+	for (i = 1; i < nthreads; i++)
 	{
 		TState	   *thread = &threads[i];
 
-		INSTR_TIME_SET_CURRENT(thread->start_time);
-
-		/* compute when to stop */
-		if (duration > 0)
-			end_time = INSTR_TIME_GET_MICROSEC(thread->start_time) +
-				(int64) 1000000 * duration;
+		thread->create_time = pg_time_now();
+		errno = THREAD_CREATE(&thread->thread, threadRun, thread);
 
-		/* the first thread (i = 0) is executed by main thread */
-		if (i > 0)
+		if (errno != 0)
 		{
-			errno = THREAD_CREATE(&thread->thread, threadRun, thread);
-
-			if (errno != 0)
-			{
-				pg_log_fatal("could not create thread: %m");
-				exit(1);
-			}
+			pg_log_fatal("could not create thread: %m");
+			exit(1);
 		}
 	}
 #else
-	INSTR_TIME_SET_CURRENT(threads[0].start_time);
+	Assert(nthreads == 1);
+#endif							/* ENABLE_THREAD_SAFETY */
+
 	/* compute when to stop */
+	threads[0].create_time = pg_time_now();
 	if (duration > 0)
-		end_time = INSTR_TIME_GET_MICROSEC(threads[0].start_time) +
-			(int64) 1000000 * duration;
-#endif							/* ENABLE_THREAD_SAFETY */
+		end_time = threads[0].create_time + (int64) 1000000 * duration;
 
-	/* wait for threads and accumulate results */
+	/* run thread 0 directly */
+	(void) threadRun(&threads[0]);
+
+	/* wait for other threads and accumulate results */
 	initStats(&stats, 0);
-	INSTR_TIME_SET_ZERO(conn_total_time);
+	conn_total_duration = 0;
+
 	for (i = 0; i < nthreads; i++)
 	{
 		TState	   *thread = &threads[i];
 
 #ifdef ENABLE_THREAD_SAFETY
-		if (i == 0)
-			/* actually run this thread directly in the main thread */
-			(void) threadRun(thread);
-		else
-			/* wait of other threads. should check that 0 is returned? */
+		if (i > 0)
 			THREAD_JOIN(thread->thread);
-#else
-		(void) threadRun(thread);
 #endif							/* ENABLE_THREAD_SAFETY */
 
 		for (int j = 0; j < thread->nstate; j++)
@@ -6205,23 +6175,24 @@ main(int argc, char **argv)
 		stats.cnt += thread->stats.cnt;
 		stats.skipped += thread->stats.skipped;
 		latency_late += thread->latency_late;
-		INSTR_TIME_ADD(conn_total_time, thread->conn_time);
+		conn_total_duration += thread->conn_duration;
+
+		/* first recorded benchmarking start time */
+		if (bench_start == 0 || thread->bench_start < bench_start)
+			bench_start = thread->bench_start;
 	}
+
+	/* XXX should this be connection time? */
 	disconnect_all(state, nclients);
 
 	/*
-	 * XXX We compute results as though every client of every thread started
-	 * and finished at the same time.  That model can diverge noticeably from
-	 * reality for a short benchmark run involving relatively many threads.
-	 * The first thread may process notably many transactions before the last
-	 * thread begins.  Improving the model alone would bring limited benefit,
-	 * because performance during those periods of partial thread count can
-	 * easily exceed steady state performance.  This is one of the many ways
-	 * short runs convey deceptive performance figures.
+	 * Beware that performance of short benchmarks with many threads and possibly
+	 * long transactions can be deceptive because threads do not start and finish
+	 * at the exact same time. The total duration computed here encompasses all
+	 * transactions so that tps shown is somehow slightly underestimated.
 	 */
-	INSTR_TIME_SET_CURRENT(total_time);
-	INSTR_TIME_SUBTRACT(total_time, start_time);
-	printResults(&stats, total_time, conn_total_time, latency_late);
+	printResults(&stats, pg_time_now() - bench_start, conn_total_duration,
+				 bench_start - start_time, latency_late);
 
 	if (exit_code != 0)
 		pg_log_fatal("Run was aborted; the above results are incomplete.");
@@ -6234,34 +6205,16 @@ threadRun(void *arg)
 {
 	TState	   *thread = (TState *) arg;
 	CState	   *state = thread->state;
-	instr_time	start,
-				end;
+	pg_time_usec_t	start;
 	int			nstate = thread->nstate;
 	int			remains = nstate;	/* number of remaining clients */
 	socket_set *sockets = alloc_socket_set(nstate);
-	int			i;
-
-	/* for reporting progress: */
-	int64		thread_start = INSTR_TIME_GET_MICROSEC(thread->start_time);
-	int64		last_report = thread_start;
-	int64		next_report = last_report + (int64) progress * 1000000;
+	int64		thread_start,
+				last_report,
+				next_report;
 	StatsData	last,
 				aggs;
 
-	/*
-	 * Initialize throttling rate target for all of the thread's clients.  It
-	 * might be a little more accurate to reset thread->start_time here too.
-	 * The possible drift seems too small relative to typical throttle delay
-	 * times to worry about it.
-	 */
-	INSTR_TIME_SET_CURRENT(start);
-	thread->throttle_trigger = INSTR_TIME_GET_MICROSEC(start);
-
-	INSTR_TIME_SET_ZERO(thread->conn_time);
-
-	initStats(&aggs, time(NULL));
-	last = aggs;
-
 	/* open log file if requested */
 	if (use_log)
 	{
@@ -6282,32 +6235,49 @@ threadRun(void *arg)
 		}
 	}
 
+	/* explicitly initialize the state machines */
+	for (int i = 0; i < nstate; i++)
+		state[i].state = CSTATE_CHOOSE_SCRIPT;
+
+	/* READY */
+	thread_start = pg_time_now();
+	thread->started_time = thread_start;
+	last_report = thread_start;
+	next_report = last_report + (int64) 1000000 * progress;
+
+	/* STEADY */
 	if (!is_connect)
 	{
 		/* make connections to the database before starting */
-		for (i = 0; i < nstate; i++)
+		for (int i = 0; i < nstate; i++)
 		{
 			if ((state[i].con = doConnect()) == NULL)
 				goto done;
 		}
-	}
 
-	/* time after thread and connections set up */
-	INSTR_TIME_SET_CURRENT(thread->conn_time);
-	INSTR_TIME_SUBTRACT(thread->conn_time, thread->start_time);
-
-	/* explicitly initialize the state machines */
-	for (i = 0; i < nstate; i++)
+		/* compute connection delay */
+		thread->conn_duration = pg_time_now() - thread->started_time;
+	}
+	else
 	{
-		state[i].state = CSTATE_CHOOSE_SCRIPT;
+		/* no connection delay to record */
+		thread->conn_duration = 0;
 	}
 
+
+	start = pg_time_now();
+	thread->bench_start = start;
+	thread->throttle_trigger = start;
+
+	initStats(&aggs, start);
+	last = aggs;
+
 	/* loop till all clients have terminated */
 	while (remains > 0)
 	{
 		int			nsocks;		/* number of sockets to be waited for */
-		int64		min_usec;
-		int64		now_usec = 0;	/* set this only if needed */
+		pg_time_usec_t	min_usec;
+		pg_time_usec_t	now = 0;	/* set this only if needed */
 
 		/*
 		 * identify which client sockets should be checked for input, and
@@ -6316,27 +6286,21 @@ threadRun(void *arg)
 		clear_socket_set(sockets);
 		nsocks = 0;
 		min_usec = PG_INT64_MAX;
-		for (i = 0; i < nstate; i++)
+		for (int i = 0; i < nstate; i++)
 		{
 			CState	   *st = &state[i];
 
 			if (st->state == CSTATE_SLEEP || st->state == CSTATE_THROTTLE)
 			{
 				/* a nap from the script, or under throttling */
-				int64		this_usec;
+				pg_time_usec_t		this_usec;
 
 				/* get current time if needed */
-				if (now_usec == 0)
-				{
-					instr_time	now;
-
-					INSTR_TIME_SET_CURRENT(now);
-					now_usec = INSTR_TIME_GET_MICROSEC(now);
-				}
+				pg_time_now_lazy(&now);
 
 				/* min_usec should be the minimum delay across all clients */
 				this_usec = (st->state == CSTATE_SLEEP ?
-							 st->sleep_until : st->txn_scheduled) - now_usec;
+							 st->sleep_until : st->txn_scheduled) - now;
 				if (min_usec > this_usec)
 					min_usec = this_usec;
 			}
@@ -6371,19 +6335,12 @@ threadRun(void *arg)
 		/* also wake up to print the next progress report on time */
 		if (progress && min_usec > 0 && thread->tid == 0)
 		{
-			/* get current time if needed */
-			if (now_usec == 0)
-			{
-				instr_time	now;
+			pg_time_now_lazy(&now);
 
-				INSTR_TIME_SET_CURRENT(now);
-				now_usec = INSTR_TIME_GET_MICROSEC(now);
-			}
-
-			if (now_usec >= next_report)
+			if (now >= next_report)
 				min_usec = 0;
-			else if ((next_report - now_usec) < min_usec)
-				min_usec = next_report - now_usec;
+			else if ((next_report - now) < min_usec)
+				min_usec = next_report - now;
 		}
 
 		/*
@@ -6432,7 +6389,7 @@ threadRun(void *arg)
 
 		/* ok, advance the state machine of each connection */
 		nsocks = 0;
-		for (i = 0; i < nstate; i++)
+		for (int i = 0; i < nstate; i++)
 		{
 			CState	   *st = &state[i];
 
@@ -6470,11 +6427,8 @@ threadRun(void *arg)
 		/* progress report is made by thread 0 for all threads */
 		if (progress && thread->tid == 0)
 		{
-			instr_time	now_time;
-			int64		now;
+			pg_time_usec_t	now = pg_time_now();
 
-			INSTR_TIME_SET_CURRENT(now_time);
-			now = INSTR_TIME_GET_MICROSEC(now_time);
 			if (now >= next_report)
 			{
 				/*
@@ -6492,17 +6446,17 @@ threadRun(void *arg)
 				 */
 				do
 				{
-					next_report += (int64) progress * 1000000;
+					next_report += (int64) 1000000 * progress;
 				} while (now >= next_report);
 			}
 		}
 	}
 
 done:
-	INSTR_TIME_SET_CURRENT(start);
+	start = pg_time_now();
 	disconnect_all(state, nstate);
-	INSTR_TIME_SET_CURRENT(end);
-	INSTR_TIME_ACCUM_DIFF(thread->conn_time, end, start);
+	thread->conn_duration += pg_time_now() - start;
+
 	if (thread->logfile)
 	{
 		if (agg_interval > 0)
diff --git a/src/include/portability/instr_time.h b/src/include/portability/instr_time.h
index 39a4f0600e..faf806a441 100644
--- a/src/include/portability/instr_time.h
+++ b/src/include/portability/instr_time.h
@@ -253,4 +253,32 @@ GetTimerFrequency(void)
 #define INSTR_TIME_SET_CURRENT_LAZY(t) \
 	(INSTR_TIME_IS_ZERO(t) ? INSTR_TIME_SET_CURRENT(t), true : false)
 
+/*
+ * Simpler convenient interface
+ *
+ * The instr_time type is expensive when dealing with time arithmetic.
+ * Define a type to hold microseconds on top of this, suitable for
+ * benchmarking performance measures, eg in "pgbench".
+ *
+ * Type int64 is good enough for about 584500 years.
+ */
+typedef int64 pg_time_usec_t;
+
+static inline pg_time_usec_t
+pg_time_now(void)
+{
+	instr_time now;
+
+	INSTR_TIME_SET_CURRENT(now);
+	return (pg_time_usec_t) INSTR_TIME_GET_MICROSEC(now);
+}
+
+static inline void
+pg_time_now_lazy(pg_time_usec_t *now)
+{
+	if ((*now) == 0)
+		(*now) = pg_time_now();
+}
+
+#define PG_TIME_GET_DOUBLE(t) (0.000001 * (t))
 #endif							/* INSTR_TIME_H */
-- 
2.30.0

v11-0004-pgbench-Synchronize-client-threads.patchtext/x-patch; charset=US-ASCII; name=v11-0004-pgbench-Synchronize-client-threads.patchDownload

From 8d3578ce82f25f0b6a1475cf7149d77e16515660 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Mon, 18 Jan 2021 10:07:31 +1300
Subject: [PATCH v11 4/6] pgbench: Synchronize client threads.

Wait until all pgbench threads are connected before benchmarking begins.

Author: Andres Freund <andres@anarazel.de>
Author: Fabien COELHO <coelho@cri.ensmp.fr>
Reviewed-by: Marina Polyakova <m.polyakova@postgrespro.ru>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Discussion: https://postgr.es/m/20200227180100.zyvjwzcpiokfsqm2%40alap3.anarazel.de
---
 src/bin/pgbench/pgbench.c | 45 ++++++++++++++++++++++++++++++++++++---
 1 file changed, 42 insertions(+), 3 deletions(-)

diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 40f8ae02a1..a71182805f 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -125,9 +125,17 @@ typedef struct socket_set
 #define THREAD_JOIN(handle) \
 	(WaitForSingleObject(handle, INFINITE) != WAIT_OBJECT_0 ? \
 	GETERRNO() : CloseHandle(handle) ? 0 : GETERRNO())
+#define THREAD_BARRIER_T SYNCHRONIZATION_BARRIER
+#define THREAD_BARRIER_INIT(barrier, n) \
+	(InitializeSynchronizationBarrier((barrier), (n), 0) ? 0 : GETERRNO())
+#define THREAD_BARRIER_WAIT(barrier) \
+	(EnterSynchronizationBarrier((barrier), \
+								 SYNCHRONIZATION_BARRIER_FLAGS_BLOCK_ONLY) ? \
+								 0 : GETERRNO())
+#define THREAD_BARRIER_DESTROY(barrier)
 #elif defined(ENABLE_THREAD_SAFETY)
 /* Use platform-dependent pthread capability */
-#include <pthread.h>
+#include "port/pg_pthread.h"
 #define THREAD_T pthread_t
 #define THREAD_FUNC_RETURN_TYPE void *
 #define THREAD_FUNC_RETURN return NULL
@@ -135,11 +143,20 @@ typedef struct socket_set
 	pthread_create((handle), NULL, (function), (arg))
 #define THREAD_JOIN(handle) \
 	pthread_join((handle), NULL)
+#define THREAD_BARRIER_T pthread_barrier_t
+#define THREAD_BARRIER_INIT(barrier, n) \
+	pthread_barrier_init((barrier), NULL, (n))
+#define THREAD_BARRIER_WAIT(barrier) pthread_barrier_wait((barrier))
+#define THREAD_BARRIER_DESTROY(barrier) pthread_barrier_destroy((barrier))
 #else
 /* No threads implementation, use none (-j 1) */
 #define THREAD_T void *
 #define THREAD_FUNC_RETURN_TYPE void *
 #define THREAD_FUNC_RETURN return NULL
+#define THREAD_BARRIER_T void *
+#define THREAD_BARRIER_INIT(barrier, n)
+#define THREAD_BARRIER_WAIT(barrier)
+#define THREAD_BARRIER_DESTROY(barrier)
 #endif
 
 
@@ -325,6 +342,9 @@ typedef struct RandomState
 /* Various random sequences are initialized from this one. */
 static RandomState base_random_sequence;
 
+/* Synchronization barrier for start and connection */
+static THREAD_BARRIER_T barrier;
+
 /*
  * Connection state machine states.
  */
@@ -467,8 +487,8 @@ typedef struct
 
 	/* per thread collected stats in microseconds */
 	pg_time_usec_t	create_time;	/* thread creation time */
-	pg_time_usec_t	started_time;	/* thread is running */
-	pg_time_usec_t	bench_start; 	/* thread is benchmarking */
+	pg_time_usec_t	started_time;	/* thread is running after start barrier */
+	pg_time_usec_t	bench_start; 	/* thread is benchmarking after connection barrier */
 	pg_time_usec_t	conn_duration;	/* cumulated connection and deconnection delays */
 
 	StatsData	stats;
@@ -6125,6 +6145,8 @@ main(int argc, char **argv)
 	if (duration > 0)
 		setalarm(duration);
 
+	THREAD_BARRIER_INIT(&barrier, nthreads);
+
 #ifdef ENABLE_THREAD_SAFETY
 	/* start all threads but thread 0 which is executed directly later */
 	for (i = 1; i < nthreads; i++)
@@ -6194,6 +6216,8 @@ main(int argc, char **argv)
 	printResults(&stats, pg_time_now() - bench_start, conn_total_duration,
 				 bench_start - start_time, latency_late);
 
+	THREAD_BARRIER_DESTROY(&barrier);
+
 	if (exit_code != 0)
 		pg_log_fatal("Run was aborted; the above results are incomplete.");
 
@@ -6240,6 +6264,8 @@ threadRun(void *arg)
 		state[i].state = CSTATE_CHOOSE_SCRIPT;
 
 	/* READY */
+	THREAD_BARRIER_WAIT(&barrier);
+
 	thread_start = pg_time_now();
 	thread->started_time = thread_start;
 	last_report = thread_start;
@@ -6252,7 +6278,18 @@ threadRun(void *arg)
 		for (int i = 0; i < nstate; i++)
 		{
 			if ((state[i].con = doConnect()) == NULL)
+			{
+				/*
+				 * On connection failure, we meet the barrier here in place of
+				 * GO before proceeding to the "done" path which will cleanup,
+				 * so as to avoid locking the process.
+				 *
+				 * It is unclear whether it is worth doing anything rather than
+				 * coldly exiting with an error message.
+				 */
+				THREAD_BARRIER_WAIT(&barrier);
 				goto done;
+			}
 		}
 
 		/* compute connection delay */
@@ -6264,6 +6301,8 @@ threadRun(void *arg)
 		thread->conn_duration = 0;
 	}
 
+	/* GO */
+	THREAD_BARRIER_WAIT(&barrier);
 
 	start = pg_time_now();
 	thread->bench_start = start;
-- 
2.30.0

thomas.munro@gmail.com

almost 5 years ago

In reply to: Thomas Munro (#39)

4 attachment(s)

Re: pgbench: option delaying queries till connections establishment?

On Fri, Mar 5, 2021 at 6:22 PM Thomas Munro <thomas.munro@gmail.com> wrote:

On Thu, Mar 4, 2021 at 10:44 PM Thomas Munro <thomas.munro@gmail.com> wrote:

v10-0002-pgbench-Refactor-the-way-we-do-thread-portabilit.patch

Here's a better version of that part. I don't yet know if it actually
works on Windows...

David Rowley kindly tested this for me on Windows and told me how to
fix one of the macros that had incorrect error checking on that OS.
So here's a new version. I'm planning to commit 0001 and 0002 soon,
if there are no objections. 0003 needs some more review.

Attachments:

v12-0001-Add-missing-pthread_barrier_t.patchtext/x-patch; charset=US-ASCII; name=v12-0001-Add-missing-pthread_barrier_t.patchDownload

From 1aa4ca844b90d76c08f806505bf5350674e701e4 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Sat, 2 Jan 2021 15:05:06 +1300
Subject: [PATCH v12 1/4] Add missing pthread_barrier_t.

Supply a simple implementation of the missing pthread_barrier_t type and
functions, for macOS.

Discussion: https://postgr.es/m/20200227180100.zyvjwzcpiokfsqm2%40alap3.anarazel.de
---
 configure                       | 69 +++++++++++++++++++++++++++++++++
 configure.ac                    |  2 +
 src/include/pg_config.h.in      |  3 ++
 src/include/port/pg_pthread.h   | 41 ++++++++++++++++++++
 src/port/pthread_barrier_wait.c | 66 +++++++++++++++++++++++++++++++
 src/tools/msvc/Solution.pm      |  1 +
 6 files changed, 182 insertions(+)
 create mode 100644 src/include/port/pg_pthread.h
 create mode 100644 src/port/pthread_barrier_wait.c

diff --git a/configure b/configure
index ce9ea36999..fad817bb38 100755
--- a/configure
+++ b/configure
@@ -11635,6 +11635,62 @@ if test "$ac_res" != no; then :
 
 fi
 
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for library containing pthread_barrier_wait" >&5
+$as_echo_n "checking for library containing pthread_barrier_wait... " >&6; }
+if ${ac_cv_search_pthread_barrier_wait+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  ac_func_search_save_LIBS=$LIBS
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+
+/* Override any GCC internal prototype to avoid an error.
+   Use char because int might match the return type of a GCC
+   builtin and then its argument prototype would still apply.  */
+#ifdef __cplusplus
+extern "C"
+#endif
+char pthread_barrier_wait ();
+int
+main ()
+{
+return pthread_barrier_wait ();
+  ;
+  return 0;
+}
+_ACEOF
+for ac_lib in '' pthread; do
+  if test -z "$ac_lib"; then
+    ac_res="none required"
+  else
+    ac_res=-l$ac_lib
+    LIBS="-l$ac_lib  $ac_func_search_save_LIBS"
+  fi
+  if ac_fn_c_try_link "$LINENO"; then :
+  ac_cv_search_pthread_barrier_wait=$ac_res
+fi
+rm -f core conftest.err conftest.$ac_objext \
+    conftest$ac_exeext
+  if ${ac_cv_search_pthread_barrier_wait+:} false; then :
+  break
+fi
+done
+if ${ac_cv_search_pthread_barrier_wait+:} false; then :
+
+else
+  ac_cv_search_pthread_barrier_wait=no
+fi
+rm conftest.$ac_ext
+LIBS=$ac_func_search_save_LIBS
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $ac_cv_search_pthread_barrier_wait" >&5
+$as_echo "$ac_cv_search_pthread_barrier_wait" >&6; }
+ac_res=$ac_cv_search_pthread_barrier_wait
+if test "$ac_res" != no; then :
+  test "$ac_res" = "none required" || LIBS="$ac_res $LIBS"
+
+fi
+
 # Solaris:
 { $as_echo "$as_me:${as_lineno-$LINENO}: checking for library containing fdatasync" >&5
 $as_echo_n "checking for library containing fdatasync... " >&6; }
@@ -15883,6 +15939,19 @@ esac
 
 fi
 
+ac_fn_c_check_func "$LINENO" "pthread_barrier_wait" "ac_cv_func_pthread_barrier_wait"
+if test "x$ac_cv_func_pthread_barrier_wait" = xyes; then :
+  $as_echo "#define HAVE_PTHREAD_BARRIER_WAIT 1" >>confdefs.h
+
+else
+  case " $LIBOBJS " in
+  *" pthread_barrier_wait.$ac_objext "* ) ;;
+  *) LIBOBJS="$LIBOBJS pthread_barrier_wait.$ac_objext"
+ ;;
+esac
+
+fi
+
 ac_fn_c_check_func "$LINENO" "pwrite" "ac_cv_func_pwrite"
 if test "x$ac_cv_func_pwrite" = xyes; then :
   $as_echo "#define HAVE_PWRITE 1" >>confdefs.h
diff --git a/configure.ac b/configure.ac
index f54f65febe..0ed53571dd 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1143,6 +1143,7 @@ AC_SEARCH_LIBS(getopt_long, [getopt gnugetopt])
 AC_SEARCH_LIBS(shm_open, rt)
 AC_SEARCH_LIBS(shm_unlink, rt)
 AC_SEARCH_LIBS(clock_gettime, [rt posix4])
+AC_SEARCH_LIBS(pthread_barrier_wait, pthread)
 # Solaris:
 AC_SEARCH_LIBS(fdatasync, [rt posix4])
 # Required for thread_test.c on Solaris
@@ -1743,6 +1744,7 @@ AC_REPLACE_FUNCS(m4_normalize([
 	mkdtemp
 	pread
 	preadv
+	pthread_barrier_wait
 	pwrite
 	pwritev
 	random
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 04dc330119..7a7cc21d8d 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -424,6 +424,9 @@
 /* Define if you have POSIX threads libraries and header files. */
 #undef HAVE_PTHREAD
 
+/* Define to 1 if you have the `pthread_barrier_wait' function. */
+#undef HAVE_PTHREAD_BARRIER_WAIT
+
 /* Define to 1 if you have the `pthread_is_threaded_np' function. */
 #undef HAVE_PTHREAD_IS_THREADED_NP
 
diff --git a/src/include/port/pg_pthread.h b/src/include/port/pg_pthread.h
new file mode 100644
index 0000000000..5222cdce6e
--- /dev/null
+++ b/src/include/port/pg_pthread.h
@@ -0,0 +1,41 @@
+/*-------------------------------------------------------------------------
+ *
+ * Declarations for missing POSIX thread components.
+ *
+ *	  Currently this supplies an implementation of pthread_barrier_t for the
+ *	  benefit of macOS, which lacks it as of release 11.  These declarations
+ *	  are not in port.h, because that'd require <pthread.h> to be included by
+ *	  every translation unit.
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#ifndef PG_PTHREAD_H
+#define PG_PTHREAD_H
+
+#include <pthread.h>
+
+#ifndef HAVE_PTHREAD_BARRIER_WAIT
+
+#ifndef PTHREAD_BARRIER_SERIAL_THREAD
+#define PTHREAD_BARRIER_SERIAL_THREAD (-1)
+#endif
+
+typedef struct pg_pthread_barrier
+{
+	bool		sense;			/* we only need a one bit phase */
+	int			count;			/* number of threads expected */
+	int			arrived;		/* number of threads that have arrived */
+	pthread_mutex_t mutex;
+	pthread_cond_t cond;
+} pthread_barrier_t;
+
+extern int pthread_barrier_init(pthread_barrier_t *barrier,
+								const void *attr,
+								int count);
+extern int pthread_barrier_wait(pthread_barrier_t *barrier);
+extern int pthread_barrier_destroy(pthread_barrier_t *barrier);
+
+#endif
+
+#endif
diff --git a/src/port/pthread_barrier_wait.c b/src/port/pthread_barrier_wait.c
new file mode 100644
index 0000000000..08dacc3085
--- /dev/null
+++ b/src/port/pthread_barrier_wait.c
@@ -0,0 +1,66 @@
+#include "postgres_fe.h"
+
+#include "port/pg_pthread.h"
+
+int
+pthread_barrier_init(pthread_barrier_t *barrier, const void *attr, int count)
+{
+	barrier->sense = false;
+	barrier->count = count;
+	barrier->arrived = 0;
+	if (pthread_cond_init(&barrier->cond, NULL) < 0)
+		return -1;
+	if (pthread_mutex_init(&barrier->mutex, NULL) < 0)
+	{
+		int save_errno = errno;
+
+		pthread_cond_destroy(&barrier->cond);
+		errno = save_errno;
+
+		return -1;
+	}
+
+	return 0;
+}
+
+int
+pthread_barrier_wait(pthread_barrier_t *barrier)
+{
+	bool		initial_sense;
+
+	pthread_mutex_lock(&barrier->mutex);
+
+	/* We have arrived at the barrier. */
+	barrier->arrived++;
+	Assert(barrier->arrived <= barrier->count);
+
+	/* If we were the last to arrive, release the others and return. */
+	if (barrier->arrived == barrier->count)
+	{
+		barrier->arrived = 0;
+		barrier->sense = !barrier->sense;
+		pthread_mutex_unlock(&barrier->mutex);
+		pthread_cond_broadcast(&barrier->cond);
+
+		return PTHREAD_BARRIER_SERIAL_THREAD;
+	}
+
+	/* Wait for someone else to flip the sense. */
+	initial_sense = barrier->sense;
+	do
+	{
+		pthread_cond_wait(&barrier->cond, &barrier->mutex);
+	} while (barrier->sense == initial_sense);
+
+	pthread_mutex_unlock(&barrier->mutex);
+
+	return 0;
+}
+
+int
+pthread_barrier_destroy(pthread_barrier_t *barrier)
+{
+	pthread_cond_destroy(&barrier->cond);
+	pthread_mutex_destroy(&barrier->mutex);
+	return 0;
+}
diff --git a/src/tools/msvc/Solution.pm b/src/tools/msvc/Solution.pm
index 69b08591cc..a4f5cc4bdb 100644
--- a/src/tools/msvc/Solution.pm
+++ b/src/tools/msvc/Solution.pm
@@ -333,6 +333,7 @@ sub GenerateFiles
 		HAVE_PSTAT                  => undef,
 		HAVE_PS_STRINGS             => undef,
 		HAVE_PTHREAD                => undef,
+		HAVE_PTHREAD_BARRIER_WAIT   => undef,
 		HAVE_PTHREAD_IS_THREADED_NP => undef,
 		HAVE_PTHREAD_PRIO_INHERIT   => undef,
 		HAVE_PWRITE                 => undef,
-- 
2.30.0

v12-0002-pgbench-Refactor-the-way-we-do-thread-portabilit.patchtext/x-patch; charset=US-ASCII; name=v12-0002-pgbench-Refactor-the-way-we-do-thread-portabilit.patchDownload

From dd65821225a30344f8c1a979998068dbfe063958 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Thu, 4 Mar 2021 00:42:13 +1300
Subject: [PATCH v12 2/4] pgbench: Refactor the way we do thread portability.

Instead of maintaining an incomplete emulation of POSIX threads for
Windows, let's use an extremely minimalist abstraction over both APIs
for now.  Small problems fixed: it's not OK to use (pthread_t) 0
as a special value, it's no OK to compare thread_t values with ==, and
we incorrectly assumed that pthread functions set errno.

Discussion: https://postgr.es/m/20200227180100.zyvjwzcpiokfsqm2%40alap3.anarazel.de
---
 src/bin/pgbench/pgbench.c | 122 +++++++++-----------------------------
 1 file changed, 29 insertions(+), 93 deletions(-)

diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index f1d98be2d2..968b72480c 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -111,22 +111,36 @@ typedef struct socket_set
 #endif							/* POLL_USING_SELECT */
 
 /*
- * Multi-platform pthread implementations
+ * Multi-platform thread implementations
  */
 
 #ifdef WIN32
 /* Use native win32 threads on Windows */
-typedef struct win32_pthread *pthread_t;
-typedef int pthread_attr_t;
-
-static int	pthread_create(pthread_t *thread, pthread_attr_t *attr, void *(*start_routine) (void *), void *arg);
-static int	pthread_join(pthread_t th, void **thread_return);
+#include <windows.h>
+#define GETERRNO() (_dosmaperr(GetLastError()), errno)
+#define THREAD_T HANDLE
+#define THREAD_FUNC_RETURN_TYPE unsigned
+#define THREAD_FUNC_RETURN return 0
+#define THREAD_CREATE(handle, function, arg) \
+	((*(handle) = (HANDLE) _beginthreadex(NULL, 0, (function), (arg), 0, NULL)) == 0 ? errno : 0)
+#define THREAD_JOIN(handle) \
+	(WaitForSingleObject(handle, INFINITE) != WAIT_OBJECT_0 ? \
+	GETERRNO() : CloseHandle(handle) ? 0 : GETERRNO())
 #elif defined(ENABLE_THREAD_SAFETY)
 /* Use platform-dependent pthread capability */
 #include <pthread.h>
+#define THREAD_T pthread_t
+#define THREAD_FUNC_RETURN_TYPE void *
+#define THREAD_FUNC_RETURN return NULL
+#define THREAD_CREATE(handle, function, arg) \
+	pthread_create((handle), NULL, (function), (arg))
+#define THREAD_JOIN(handle) \
+	pthread_join((handle), NULL)
 #else
 /* No threads implementation, use none (-j 1) */
-#define pthread_t void *
+#define THREAD_T void *
+#define THREAD_FUNC_RETURN_TYPE void *
+#define THREAD_FUNC_RETURN return NULL
 #endif
 
 
@@ -436,7 +450,7 @@ typedef struct
 typedef struct
 {
 	int			tid;			/* thread id */
-	pthread_t	thread;			/* thread handle */
+	THREAD_T	thread;			/* thread handle */
 	CState	   *state;			/* array of CState */
 	int			nstate;			/* length of state[] */
 
@@ -459,8 +473,6 @@ typedef struct
 	int64		latency_late;	/* executed but late transactions */
 } TState;
 
-#define INVALID_THREAD		((pthread_t) 0)
-
 /*
  * queries read from files
  */
@@ -604,7 +616,7 @@ static void doLog(TState *thread, CState *st,
 static void processXactStats(TState *thread, CState *st, instr_time *now,
 							 bool skipped, StatsData *agg);
 static void addScript(ParsedScript script);
-static void *threadRun(void *arg);
+static THREAD_FUNC_RETURN_TYPE threadRun(void *arg);
 static void finishCon(CState *st);
 static void setalarm(int seconds);
 static socket_set *alloc_socket_set(int count);
@@ -6142,18 +6154,14 @@ main(int argc, char **argv)
 		/* the first thread (i = 0) is executed by main thread */
 		if (i > 0)
 		{
-			int			err = pthread_create(&thread->thread, NULL, threadRun, thread);
+			errno = THREAD_CREATE(&thread->thread, threadRun, thread);
 
-			if (err != 0 || thread->thread == INVALID_THREAD)
+			if (errno != 0)
 			{
 				pg_log_fatal("could not create thread: %m");
 				exit(1);
 			}
 		}
-		else
-		{
-			thread->thread = INVALID_THREAD;
-		}
 	}
 #else
 	INSTR_TIME_SET_CURRENT(threads[0].start_time);
@@ -6161,7 +6169,6 @@ main(int argc, char **argv)
 	if (duration > 0)
 		end_time = INSTR_TIME_GET_MICROSEC(threads[0].start_time) +
 			(int64) 1000000 * duration;
-	threads[0].thread = INVALID_THREAD;
 #endif							/* ENABLE_THREAD_SAFETY */
 
 	/* wait for threads and accumulate results */
@@ -6172,12 +6179,12 @@ main(int argc, char **argv)
 		TState	   *thread = &threads[i];
 
 #ifdef ENABLE_THREAD_SAFETY
-		if (threads[i].thread == INVALID_THREAD)
+		if (i == 0)
 			/* actually run this thread directly in the main thread */
 			(void) threadRun(thread);
 		else
 			/* wait of other threads. should check that 0 is returned? */
-			pthread_join(thread->thread, NULL);
+			THREAD_JOIN(thread->thread);
 #else
 		(void) threadRun(thread);
 #endif							/* ENABLE_THREAD_SAFETY */
@@ -6216,7 +6223,7 @@ main(int argc, char **argv)
 	return exit_code;
 }
 
-static void *
+static THREAD_FUNC_RETURN_TYPE
 threadRun(void *arg)
 {
 	TState	   *thread = (TState *) arg;
@@ -6501,7 +6508,7 @@ done:
 		thread->logfile = NULL;
 	}
 	free_socket_set(sockets);
-	return NULL;
+	THREAD_FUNC_RETURN;
 }
 
 static void
@@ -6732,74 +6739,3 @@ socket_has_input(socket_set *sa, int fd, int idx)
 }
 
 #endif							/* POLL_USING_SELECT */
-
-
-/* partial pthread implementation for Windows */
-
-#ifdef WIN32
-
-typedef struct win32_pthread
-{
-	HANDLE		handle;
-	void	   *(*routine) (void *);
-	void	   *arg;
-	void	   *result;
-} win32_pthread;
-
-static unsigned __stdcall
-win32_pthread_run(void *arg)
-{
-	win32_pthread *th = (win32_pthread *) arg;
-
-	th->result = th->routine(th->arg);
-
-	return 0;
-}
-
-static int
-pthread_create(pthread_t *thread,
-			   pthread_attr_t *attr,
-			   void *(*start_routine) (void *),
-			   void *arg)
-{
-	int			save_errno;
-	win32_pthread *th;
-
-	th = (win32_pthread *) pg_malloc(sizeof(win32_pthread));
-	th->routine = start_routine;
-	th->arg = arg;
-	th->result = NULL;
-
-	th->handle = (HANDLE) _beginthreadex(NULL, 0, win32_pthread_run, th, 0, NULL);
-	if (th->handle == NULL)
-	{
-		save_errno = errno;
-		free(th);
-		return save_errno;
-	}
-
-	*thread = th;
-	return 0;
-}
-
-static int
-pthread_join(pthread_t th, void **thread_return)
-{
-	if (th == NULL || th->handle == NULL)
-		return errno = EINVAL;
-
-	if (WaitForSingleObject(th->handle, INFINITE) != WAIT_OBJECT_0)
-	{
-		_dosmaperr(GetLastError());
-		return errno;
-	}
-
-	if (thread_return)
-		*thread_return = th->result;
-
-	CloseHandle(th->handle);
-	free(th);
-	return 0;
-}
-
-#endif							/* WIN32 */
-- 
2.30.0

v12-0003-pgbench-Improve-time-measurement-code.patchtext/x-patch; charset=US-ASCII; name=v12-0003-pgbench-Improve-time-measurement-code.patchDownload

From 778645a33e87be203f806cab77175f2497e2c574 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Thu, 4 Mar 2021 17:25:06 +1300
Subject: [PATCH v12 3/4] pgbench: Improve time measurement code.

XXX This needs a commit message.

Author: Fabien COELHO <coelho@cri.ensmp.fr>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Discussion: https://postgr.es/m/20200227180100.zyvjwzcpiokfsqm2%40alap3.anarazel.de
---
 doc/src/sgml/ref/pgbench.sgml        |  39 +--
 src/bin/pgbench/pgbench.c            | 410 ++++++++++++---------------
 src/include/portability/instr_time.h |  28 ++
 3 files changed, 230 insertions(+), 247 deletions(-)

diff --git a/doc/src/sgml/ref/pgbench.sgml b/doc/src/sgml/ref/pgbench.sgml
index 2ec0580a79..3c3699cd73 100644
--- a/doc/src/sgml/ref/pgbench.sgml
+++ b/doc/src/sgml/ref/pgbench.sgml
@@ -58,8 +58,10 @@ number of clients: 10
 number of threads: 1
 number of transactions per client: 1000
 number of transactions actually processed: 10000/10000
-tps = 85.184871 (including connections establishing)
-tps = 85.296346 (excluding connections establishing)
+latency average = 11.013 ms
+latency stddev = 7.351 ms
+initial connection time = 45.758 ms
+tps = 896.967014 (without initial connection establishing)
 </screen>
 
   The first six lines report some of the most important parameter
@@ -68,8 +70,7 @@ tps = 85.296346 (excluding connections establishing)
   and number of transactions per client); these will be equal unless the run
   failed before completion.  (In <option>-T</option> mode, only the actual
   number of transactions is printed.)
-  The last two lines report the number of transactions per second,
-  figured with and without counting the time to start database sessions.
+  The last line reports the number of transactions per second.
  </para>
 
   <para>
@@ -2257,22 +2258,22 @@ number of clients: 10
 number of threads: 1
 number of transactions per client: 1000
 number of transactions actually processed: 10000/10000
-latency average = 15.844 ms
-latency stddev = 2.715 ms
-tps = 618.764555 (including connections establishing)
-tps = 622.977698 (excluding connections establishing)
+latency average = 10.870 ms
+latency stddev = 7.341 ms
+initial connection time = 30.954 ms
+tps = 907.949122 (without initial connection establishing)
 statement latencies in milliseconds:
-        0.002  \set aid random(1, 100000 * :scale)
-        0.005  \set bid random(1, 1 * :scale)
-        0.002  \set tid random(1, 10 * :scale)
-        0.001  \set delta random(-5000, 5000)
-        0.326  BEGIN;
-        0.603  UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
-        0.454  SELECT abalance FROM pgbench_accounts WHERE aid = :aid;
-        5.528  UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;
-        7.335  UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;
-        0.371  INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);
-        1.212  END;
+    0.001  \set aid random(1, 100000 * :scale)
+    0.001  \set bid random(1, 1 * :scale)
+    0.001  \set tid random(1, 10 * :scale)
+    0.000  \set delta random(-5000, 5000)
+    0.046  BEGIN;
+    0.151  UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
+    0.107  SELECT abalance FROM pgbench_accounts WHERE aid = :aid;
+    4.241  UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;
+    5.245  UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;
+    0.102  INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);
+    0.974  END;
 </screen>
   </para>
 
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 968b72480c..6be83459ea 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -307,9 +307,9 @@ typedef struct SimpleStats
  */
 typedef struct StatsData
 {
-	time_t		start_time;		/* interval start time, for aggregates */
-	int64		cnt;			/* number of transactions, including skipped */
-	int64		skipped;		/* number of transactions skipped under --rate
+	pg_time_usec_t	start_time;	/* interval start time, for aggregates */
+	int64			cnt;		/* number of transactions, including skipped */
+	int64			skipped;	/* number of transactions skipped under --rate
 								 * and --latency-limit */
 	SimpleStats latency;
 	SimpleStats lag;
@@ -432,11 +432,11 @@ typedef struct
 	int			nvariables;		/* number of variables */
 	bool		vars_sorted;	/* are variables sorted by name? */
 
-	/* various times about current transaction */
-	int64		txn_scheduled;	/* scheduled start time of transaction (usec) */
-	int64		sleep_until;	/* scheduled start time of next cmd (usec) */
-	instr_time	txn_begin;		/* used for measuring schedule lag times */
-	instr_time	stmt_begin;		/* used for measuring statement latencies */
+	/* various times about current transaction in microseconds */
+	pg_time_usec_t	txn_scheduled;	/* scheduled start time of transaction */
+	pg_time_usec_t	sleep_until;	/* scheduled start time of next cmd */
+	pg_time_usec_t	txn_begin;		/* used for measuring schedule lag times */
+	pg_time_usec_t	stmt_begin;		/* used for measuring statement latencies */
 
 	bool		prepared[MAX_SCRIPTS];	/* whether client prepared the script */
 
@@ -464,13 +464,16 @@ typedef struct
 	RandomState ts_sample_rs;	/* random state for log sampling */
 
 	int64		throttle_trigger;	/* previous/next throttling (us) */
-	FILE	   *logfile;		/* where to log, or NULL */
+	FILE	   *logfile;			/* where to log, or NULL */
+
+	/* per thread collected stats in microseconds */
+	pg_time_usec_t	create_time;	/* thread creation time */
+	pg_time_usec_t	started_time;	/* thread is running */
+	pg_time_usec_t	bench_start; 	/* thread is benchmarking */
+	pg_time_usec_t	conn_duration;	/* cumulated connection and deconnection delays */
 
-	/* per thread collected stats */
-	instr_time	start_time;		/* thread start time */
-	instr_time	conn_time;
 	StatsData	stats;
-	int64		latency_late;	/* executed but late transactions */
+	int64		latency_late;	/* count executed but late transactions */
 } TState;
 
 /*
@@ -610,10 +613,10 @@ static void setIntValue(PgBenchValue *pv, int64 ival);
 static void setDoubleValue(PgBenchValue *pv, double dval);
 static bool evaluateExpr(CState *st, PgBenchExpr *expr,
 						 PgBenchValue *retval);
-static ConnectionStateEnum executeMetaCommand(CState *st, instr_time *now);
+static ConnectionStateEnum executeMetaCommand(CState *st, pg_time_usec_t *now);
 static void doLog(TState *thread, CState *st,
 				  StatsData *agg, bool skipped, double latency, double lag);
-static void processXactStats(TState *thread, CState *st, instr_time *now,
+static void processXactStats(TState *thread, CState *st, pg_time_usec_t *now,
 							 bool skipped, StatsData *agg);
 static void addScript(ParsedScript script);
 static THREAD_FUNC_RETURN_TYPE threadRun(void *arg);
@@ -1117,9 +1120,9 @@ mergeSimpleStats(SimpleStats *acc, SimpleStats *ss)
  * the given value.
  */
 static void
-initStats(StatsData *sd, time_t start_time)
+initStats(StatsData *sd, pg_time_usec_t start)
 {
-	sd->start_time = start_time;
+	sd->start_time = start;
 	sd->cnt = 0;
 	sd->skipped = 0;
 	initSimpleStats(&sd->latency);
@@ -2898,7 +2901,6 @@ evaluateSleep(CState *st, int argc, char **argv, int *usecs)
 static void
 advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 {
-	instr_time	now;
 
 	/*
 	 * gettimeofday() isn't free, so we get the current timestamp lazily the
@@ -2908,7 +2910,7 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 	 * means "not set yet".  Reset "now" when we execute shell commands or
 	 * expressions, which might take a non-negligible amount of time, though.
 	 */
-	INSTR_TIME_SET_ZERO(now);
+	pg_time_usec_t	now = 0;
 
 	/*
 	 * Loop in the state machine, until we have to wait for a result from the
@@ -2943,29 +2945,30 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 
 				/* Start new transaction (script) */
 			case CSTATE_START_TX:
+				pg_time_now_lazy(&now);
 
 				/* establish connection if needed, i.e. under --connect */
 				if (st->con == NULL)
 				{
-					instr_time	start;
+					pg_time_usec_t	start = now;
 
-					INSTR_TIME_SET_CURRENT_LAZY(now);
-					start = now;
 					if ((st->con = doConnect()) == NULL)
 					{
 						pg_log_error("client %d aborted while establishing connection", st->id);
 						st->state = CSTATE_ABORTED;
 						break;
 					}
-					INSTR_TIME_SET_CURRENT(now);
-					INSTR_TIME_ACCUM_DIFF(thread->conn_time, now, start);
+
+					/* reset now after connection */
+					now = pg_time_now();
+
+					thread->conn_duration += now - start;
 
 					/* Reset session-local state */
 					memset(st->prepared, 0, sizeof(st->prepared));
 				}
 
 				/* record transaction start time */
-				INSTR_TIME_SET_CURRENT_LAZY(now);
 				st->txn_begin = now;
 
 				/*
@@ -2973,7 +2976,7 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 				 * scheduled start time.
 				 */
 				if (!throttle_delay)
-					st->txn_scheduled = INSTR_TIME_GET_MICROSEC(now);
+					st->txn_scheduled = now;
 
 				/* Begin with the first command */
 				st->state = CSTATE_START_COMMAND;
@@ -3009,12 +3012,9 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 				 */
 				if (latency_limit)
 				{
-					int64		now_us;
+					pg_time_now_lazy(&now);
 
-					INSTR_TIME_SET_CURRENT_LAZY(now);
-					now_us = INSTR_TIME_GET_MICROSEC(now);
-
-					while (thread->throttle_trigger < now_us - latency_limit &&
+					while (thread->throttle_trigger < now - latency_limit &&
 						   (nxacts <= 0 || st->cnt < nxacts))
 					{
 						processXactStats(thread, st, &now, true, agg);
@@ -3047,9 +3047,9 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 				 * Wait until it's time to start next transaction.
 				 */
 			case CSTATE_THROTTLE:
-				INSTR_TIME_SET_CURRENT_LAZY(now);
+				pg_time_now_lazy(&now);
 
-				if (INSTR_TIME_GET_MICROSEC(now) < st->txn_scheduled)
+				if (now < st->txn_scheduled)
 					return;		/* still sleeping, nothing to do here */
 
 				/* done sleeping, but don't start transaction if we're done */
@@ -3072,7 +3072,7 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 				/* record begin time of next command, and initiate it */
 				if (report_per_command)
 				{
-					INSTR_TIME_SET_CURRENT_LAZY(now);
+					pg_time_now_lazy(&now);
 					st->stmt_begin = now;
 				}
 
@@ -3237,8 +3237,8 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 				 * instead of CSTATE_START_TX.
 				 */
 			case CSTATE_SLEEP:
-				INSTR_TIME_SET_CURRENT_LAZY(now);
-				if (INSTR_TIME_GET_MICROSEC(now) < st->sleep_until)
+				pg_time_now_lazy(&now);
+				if (now < st->sleep_until)
 					return;		/* still sleeping, nothing to do here */
 				/* Else done sleeping. */
 				st->state = CSTATE_END_COMMAND;
@@ -3258,13 +3258,12 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 				{
 					Command    *command;
 
-					INSTR_TIME_SET_CURRENT_LAZY(now);
+					pg_time_now_lazy(&now);
 
 					command = sql_script[st->use_file].commands[st->command];
 					/* XXX could use a mutex here, but we choose not to */
 					addToSimpleStats(&command->stats,
-									 INSTR_TIME_GET_DOUBLE(now) -
-									 INSTR_TIME_GET_DOUBLE(st->stmt_begin));
+									 PG_TIME_GET_DOUBLE(now - st->stmt_begin));
 				}
 
 				/* Go ahead with next command, to be executed or skipped */
@@ -3290,7 +3289,7 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 				if (is_connect)
 				{
 					finishCon(st);
-					INSTR_TIME_SET_ZERO(now);
+					now = 0;
 				}
 
 				if ((st->cnt >= nxacts && duration <= 0) || timer_exceeded)
@@ -3328,7 +3327,7 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
  * take no time to execute.
  */
 static ConnectionStateEnum
-executeMetaCommand(CState *st, instr_time *now)
+executeMetaCommand(CState *st, pg_time_usec_t *now)
 {
 	Command    *command = sql_script[st->use_file].commands[st->command];
 	int			argc;
@@ -3370,8 +3369,8 @@ executeMetaCommand(CState *st, instr_time *now)
 			return CSTATE_ABORTED;
 		}
 
-		INSTR_TIME_SET_CURRENT_LAZY(*now);
-		st->sleep_until = INSTR_TIME_GET_MICROSEC(*now) + usec;
+		pg_time_now_lazy(now);
+		st->sleep_until = (*now) + usec;
 		return CSTATE_SLEEP;
 	}
 	else if (command->meta == META_SET)
@@ -3474,7 +3473,7 @@ executeMetaCommand(CState *st, instr_time *now)
 	 * executing the expression or shell command might have taken a
 	 * non-negligible amount of time, so reset 'now'
 	 */
-	INSTR_TIME_SET_ZERO(*now);
+	*now = 0;
 
 	return CSTATE_END_COMMAND;
 }
@@ -3484,14 +3483,15 @@ executeMetaCommand(CState *st, instr_time *now)
  *
  * We print Unix-epoch timestamps in the log, so that entries can be
  * correlated against other logs.  On some platforms this could be obtained
- * from the instr_time reading the caller has, but rather than get entangled
- * with that, we just eat the cost of an extra syscall in all cases.
+ * from the caller, but rather than get entangled with that, we just eat
+ * the cost of an extra syscall in all cases.
  */
 static void
 doLog(TState *thread, CState *st,
 	  StatsData *agg, bool skipped, double latency, double lag)
 {
 	FILE	   *logfile = thread->logfile;
+	pg_time_usec_t	now = pg_time_now();
 
 	Assert(use_log);
 
@@ -3511,13 +3511,12 @@ doLog(TState *thread, CState *st,
 		 * any empty intervals in between (this may happen with very low tps,
 		 * e.g. --rate=0.1).
 		 */
-		time_t		now = time(NULL);
 
 		while (agg->start_time + agg_interval <= now)
 		{
 			/* print aggregated report to logfile */
-			fprintf(logfile, "%ld " INT64_FORMAT " %.0f %.0f %.0f %.0f",
-					(long) agg->start_time,
+			fprintf(logfile, INT64_FORMAT " " INT64_FORMAT " %.0f %.0f %.0f %.0f",
+					agg->start_time,
 					agg->cnt,
 					agg->latency.sum,
 					agg->latency.sum2,
@@ -3545,17 +3544,13 @@ doLog(TState *thread, CState *st,
 	else
 	{
 		/* no, print raw transactions */
-		struct timeval tv;
-
-		gettimeofday(&tv, NULL);
 		if (skipped)
 			fprintf(logfile, "%d " INT64_FORMAT " skipped %d %ld %ld",
-					st->id, st->cnt, st->use_file,
-					(long) tv.tv_sec, (long) tv.tv_usec);
+					st->id, st->cnt, st->use_file, now / 1000000, now % 1000000);
 		else
 			fprintf(logfile, "%d " INT64_FORMAT " %.0f %d %ld %ld",
 					st->id, st->cnt, latency, st->use_file,
-					(long) tv.tv_sec, (long) tv.tv_usec);
+					now / 1000000, now % 1000000);
 		if (throttle_delay)
 			fprintf(logfile, " %.0f", lag);
 		fputc('\n', logfile);
@@ -3569,7 +3564,7 @@ doLog(TState *thread, CState *st,
  * Note that even skipped transactions are counted in the "cnt" fields.)
  */
 static void
-processXactStats(TState *thread, CState *st, instr_time *now,
+processXactStats(TState *thread, CState *st, pg_time_usec_t *now,
 				 bool skipped, StatsData *agg)
 {
 	double		latency = 0.0,
@@ -3579,11 +3574,11 @@ processXactStats(TState *thread, CState *st, instr_time *now,
 
 	if (detailed && !skipped)
 	{
-		INSTR_TIME_SET_CURRENT_LAZY(*now);
+		pg_time_now_lazy(now);
 
 		/* compute latency & lag */
-		latency = INSTR_TIME_GET_MICROSEC(*now) - st->txn_scheduled;
-		lag = INSTR_TIME_GET_MICROSEC(st->txn_begin) - st->txn_scheduled;
+		latency = (*now) - st->txn_scheduled;
+		lag = st->txn_begin - st->txn_scheduled;
 	}
 
 	if (thread_details)
@@ -3834,10 +3829,7 @@ initGenerateDataClientSide(PGconn *con)
 	int64		k;
 
 	/* used to track elapsed time and estimate of the remaining time */
-	instr_time	start,
-				diff;
-	double		elapsed_sec,
-				remaining_sec;
+	pg_time_usec_t	start;
 	int			log_interval = 1;
 
 	/* Stay on the same line if reporting to a terminal */
@@ -3889,7 +3881,7 @@ initGenerateDataClientSide(PGconn *con)
 	}
 	PQclear(res);
 
-	INSTR_TIME_SET_CURRENT(start);
+	start = pg_time_now();
 
 	for (k = 0; k < (int64) naccounts * scale; k++)
 	{
@@ -3914,11 +3906,8 @@ initGenerateDataClientSide(PGconn *con)
 		 */
 		if ((!use_quiet) && (j % 100000 == 0))
 		{
-			INSTR_TIME_SET_CURRENT(diff);
-			INSTR_TIME_SUBTRACT(diff, start);
-
-			elapsed_sec = INSTR_TIME_GET_DOUBLE(diff);
-			remaining_sec = ((double) scale * naccounts - j) * elapsed_sec / j;
+			double elapsed_sec = PG_TIME_GET_DOUBLE(pg_time_now() - start);
+			double remaining_sec = ((double) scale * naccounts - j) * elapsed_sec / j;
 
 			fprintf(stderr, INT64_FORMAT " of " INT64_FORMAT " tuples (%d%%) done (elapsed %.2f s, remaining %.2f s)%c",
 					j, (int64) naccounts * scale,
@@ -3928,11 +3917,8 @@ initGenerateDataClientSide(PGconn *con)
 		/* let's not call the timing for each row, but only each 100 rows */
 		else if (use_quiet && (j % 100 == 0))
 		{
-			INSTR_TIME_SET_CURRENT(diff);
-			INSTR_TIME_SUBTRACT(diff, start);
-
-			elapsed_sec = INSTR_TIME_GET_DOUBLE(diff);
-			remaining_sec = ((double) scale * naccounts - j) * elapsed_sec / j;
+			double elapsed_sec = PG_TIME_GET_DOUBLE(pg_time_now() - start);
+			double remaining_sec = ((double) scale * naccounts - j) * elapsed_sec / j;
 
 			/* have we reached the next interval (or end)? */
 			if ((j == scale * naccounts) || (elapsed_sec >= log_interval * LOG_STEP_SECONDS))
@@ -4137,10 +4123,8 @@ runInitSteps(const char *initialize_steps)
 
 	for (step = initialize_steps; *step != '\0'; step++)
 	{
-		instr_time	start;
 		char	   *op = NULL;
-
-		INSTR_TIME_SET_CURRENT(start);
+		pg_time_usec_t	start = pg_time_now();
 
 		switch (*step)
 		{
@@ -4182,12 +4166,7 @@ runInitSteps(const char *initialize_steps)
 
 		if (op != NULL)
 		{
-			instr_time	diff;
-			double		elapsed_sec;
-
-			INSTR_TIME_SET_CURRENT(diff);
-			INSTR_TIME_SUBTRACT(diff, start);
-			elapsed_sec = INSTR_TIME_GET_DOUBLE(diff);
+			double		elapsed_sec = PG_TIME_GET_DOUBLE(pg_time_now() - start);
 
 			if (!first)
 				appendPQExpBufferStr(&stats, ", ");
@@ -5109,12 +5088,12 @@ addScript(ParsedScript script)
  * progress report.  On exit, they are updated with the new stats.
  */
 static void
-printProgressReport(TState *threads, int64 test_start, int64 now,
+printProgressReport(TState *threads, int64 test_start, pg_time_usec_t now,
 					StatsData *last, int64 *last_report)
 {
 	/* generate and show report */
-	int64		run = now - *last_report,
-				ntx;
+	pg_time_usec_t	run = now - *last_report;
+	int64		ntx;
 	double		tps,
 				total_run,
 				latency,
@@ -5161,16 +5140,7 @@ printProgressReport(TState *threads, int64 test_start, int64 now,
 
 	if (progress_timestamp)
 	{
-		/*
-		 * On some platforms the current system timestamp is available in
-		 * now_time, but rather than get entangled with that, we just eat the
-		 * cost of an extra syscall in all cases.
-		 */
-		struct timeval tv;
-
-		gettimeofday(&tv, NULL);
-		snprintf(tbuf, sizeof(tbuf), "%ld.%03ld s",
-				 (long) tv.tv_sec, (long) (tv.tv_usec / 1000));
+		snprintf(tbuf, sizeof(tbuf), "%.3f s", PG_TIME_GET_DOUBLE(now));
 	}
 	else
 	{
@@ -5210,21 +5180,18 @@ printSimpleStats(const char *prefix, SimpleStats *ss)
 
 /* print out results */
 static void
-printResults(StatsData *total, instr_time total_time,
-			 instr_time conn_total_time, int64 latency_late)
+printResults(StatsData *total,
+			 pg_time_usec_t total_duration,			/* benchmarking time */
+			 pg_time_usec_t conn_total_duration,	/* is_connect */
+			 pg_time_usec_t conn_elapsed_duration,	/* !is_connect */
+			 int64 latency_late)
 {
-	double		time_include,
-				tps_include,
-				tps_exclude;
+	/* tps is about actually executed transactions during benchmarking */
 	int64		ntx = total->cnt - total->skipped;
+	double 		bench_duration = PG_TIME_GET_DOUBLE(total_duration);
+	double		tps = ntx / bench_duration;
 
-	time_include = INSTR_TIME_GET_DOUBLE(total_time);
-
-	/* tps is about actually executed transactions */
-	tps_include = ntx / time_include;
-	tps_exclude = ntx /
-		(time_include - (INSTR_TIME_GET_DOUBLE(conn_total_time) / nclients));
-
+	printf("pgbench (PostgreSQL) %d.%d\n", PG_VERSION_NUM / 10000, PG_VERSION_NUM % 100);
 	/* Report test parameters. */
 	printf("transaction type: %s\n",
 		   num_scripts == 1 ? sql_script[0].desc : "multiple scripts");
@@ -5255,8 +5222,7 @@ printResults(StatsData *total, instr_time total_time,
 
 	if (throttle_delay && latency_limit)
 		printf("number of transactions skipped: " INT64_FORMAT " (%.3f %%)\n",
-			   total->skipped,
-			   100.0 * total->skipped / total->cnt);
+			   total->skipped, 100.0 * total->skipped / total->cnt);
 
 	if (latency_limit)
 		printf("number of transactions above the %.1f ms latency limit: " INT64_FORMAT "/" INT64_FORMAT " (%.3f %%)\n",
@@ -5269,7 +5235,7 @@ printResults(StatsData *total, instr_time total_time,
 	{
 		/* no measurement, show average latency computed from run time */
 		printf("latency average = %.3f ms\n",
-			   1000.0 * time_include * nclients / total->cnt);
+			   0.001 * total_duration * nclients / total->cnt);
 	}
 
 	if (throttle_delay)
@@ -5284,8 +5250,25 @@ printResults(StatsData *total, instr_time total_time,
 			   0.001 * total->lag.sum / total->cnt, 0.001 * total->lag.max);
 	}
 
-	printf("tps = %f (including connections establishing)\n", tps_include);
-	printf("tps = %f (excluding connections establishing)\n", tps_exclude);
+	/*
+	 * Under -C/--connect, each transaction incurs a significant connection cost,
+	 * it would not make much sense to ignore it in tps, and it would not be tps
+	 * anyway.
+	 *
+	 * Otherwise connections are made just once at the beginning of the run
+	 * and should not impact performance but for very short run, so they are
+	 * (right)fully ignored in tps.
+	 */
+	if (is_connect)
+	{
+		printf("average connection time = %.3f ms\n", 0.001 * conn_total_duration / total->cnt);
+		printf("tps = %f (including reconnection times)\n", tps);
+	}
+	else
+	{
+		printf("initial connection time = %.3f ms\n", 0.001 * conn_elapsed_duration);
+		printf("tps = %f (without initial connection establishing)\n", tps);
+	}
 
 	/* Report per-script/command statistics */
 	if (per_script_stats || report_per_command)
@@ -5306,7 +5289,7 @@ printResults(StatsData *total, instr_time total_time,
 					   100.0 * sql_script[i].weight / total_weight,
 					   sstats->cnt,
 					   100.0 * sstats->cnt / total->cnt,
-					   (sstats->cnt - sstats->skipped) / time_include);
+					   (sstats->cnt - sstats->skipped) / bench_duration);
 
 				if (throttle_delay && latency_limit && sstats->cnt > 0)
 					printf(" - number of transactions skipped: " INT64_FORMAT " (%.3f%%)\n",
@@ -5354,10 +5337,7 @@ set_random_seed(const char *seed)
 	if (seed == NULL || strcmp(seed, "time") == 0)
 	{
 		/* rely on current time */
-		instr_time	now;
-
-		INSTR_TIME_SET_CURRENT(now);
-		iseed = (uint64) INSTR_TIME_GET_MICROSEC(now);
+		iseed = pg_time_now();
 	}
 	else if (strcmp(seed, "rand") == 0)
 	{
@@ -5460,9 +5440,10 @@ main(int argc, char **argv)
 	CState	   *state;			/* status of clients */
 	TState	   *threads;		/* array of thread */
 
-	instr_time	start_time;		/* start up time */
-	instr_time	total_time;
-	instr_time	conn_total_time;
+	pg_time_usec_t
+		start_time,				/* start up time */
+		bench_start = 0,		/* first recorded benchmarking time */
+		conn_total_duration;	/* cumulated connection time in threads */
 	int64		latency_late = 0;
 	StatsData	stats;
 	int			weight;
@@ -6131,62 +6112,51 @@ main(int argc, char **argv)
 	/* all clients must be assigned to a thread */
 	Assert(nclients_dealt == nclients);
 
-	/* get start up time */
-	INSTR_TIME_SET_CURRENT(start_time);
+	/* get start up time for the whole computation */
+	start_time = pg_time_now();
 
 	/* set alarm if duration is specified. */
 	if (duration > 0)
 		setalarm(duration);
 
-	/* start threads */
 #ifdef ENABLE_THREAD_SAFETY
-	for (i = 0; i < nthreads; i++)
+	/* start all threads but thread 0 which is executed directly later */
+	for (i = 1; i < nthreads; i++)
 	{
 		TState	   *thread = &threads[i];
 
-		INSTR_TIME_SET_CURRENT(thread->start_time);
-
-		/* compute when to stop */
-		if (duration > 0)
-			end_time = INSTR_TIME_GET_MICROSEC(thread->start_time) +
-				(int64) 1000000 * duration;
+		thread->create_time = pg_time_now();
+		errno = THREAD_CREATE(&thread->thread, threadRun, thread);
 
-		/* the first thread (i = 0) is executed by main thread */
-		if (i > 0)
+		if (errno != 0)
 		{
-			errno = THREAD_CREATE(&thread->thread, threadRun, thread);
-
-			if (errno != 0)
-			{
-				pg_log_fatal("could not create thread: %m");
-				exit(1);
-			}
+			pg_log_fatal("could not create thread: %m");
+			exit(1);
 		}
 	}
 #else
-	INSTR_TIME_SET_CURRENT(threads[0].start_time);
+	Assert(nthreads == 1);
+#endif							/* ENABLE_THREAD_SAFETY */
+
 	/* compute when to stop */
+	threads[0].create_time = pg_time_now();
 	if (duration > 0)
-		end_time = INSTR_TIME_GET_MICROSEC(threads[0].start_time) +
-			(int64) 1000000 * duration;
-#endif							/* ENABLE_THREAD_SAFETY */
+		end_time = threads[0].create_time + (int64) 1000000 * duration;
 
-	/* wait for threads and accumulate results */
+	/* run thread 0 directly */
+	(void) threadRun(&threads[0]);
+
+	/* wait for other threads and accumulate results */
 	initStats(&stats, 0);
-	INSTR_TIME_SET_ZERO(conn_total_time);
+	conn_total_duration = 0;
+
 	for (i = 0; i < nthreads; i++)
 	{
 		TState	   *thread = &threads[i];
 
 #ifdef ENABLE_THREAD_SAFETY
-		if (i == 0)
-			/* actually run this thread directly in the main thread */
-			(void) threadRun(thread);
-		else
-			/* wait of other threads. should check that 0 is returned? */
+		if (i > 0)
 			THREAD_JOIN(thread->thread);
-#else
-		(void) threadRun(thread);
 #endif							/* ENABLE_THREAD_SAFETY */
 
 		for (int j = 0; j < thread->nstate; j++)
@@ -6199,23 +6169,24 @@ main(int argc, char **argv)
 		stats.cnt += thread->stats.cnt;
 		stats.skipped += thread->stats.skipped;
 		latency_late += thread->latency_late;
-		INSTR_TIME_ADD(conn_total_time, thread->conn_time);
+		conn_total_duration += thread->conn_duration;
+
+		/* first recorded benchmarking start time */
+		if (bench_start == 0 || thread->bench_start < bench_start)
+			bench_start = thread->bench_start;
 	}
+
+	/* XXX should this be connection time? */
 	disconnect_all(state, nclients);
 
 	/*
-	 * XXX We compute results as though every client of every thread started
-	 * and finished at the same time.  That model can diverge noticeably from
-	 * reality for a short benchmark run involving relatively many threads.
-	 * The first thread may process notably many transactions before the last
-	 * thread begins.  Improving the model alone would bring limited benefit,
-	 * because performance during those periods of partial thread count can
-	 * easily exceed steady state performance.  This is one of the many ways
-	 * short runs convey deceptive performance figures.
+	 * Beware that performance of short benchmarks with many threads and possibly
+	 * long transactions can be deceptive because threads do not start and finish
+	 * at the exact same time. The total duration computed here encompasses all
+	 * transactions so that tps shown is somehow slightly underestimated.
 	 */
-	INSTR_TIME_SET_CURRENT(total_time);
-	INSTR_TIME_SUBTRACT(total_time, start_time);
-	printResults(&stats, total_time, conn_total_time, latency_late);
+	printResults(&stats, pg_time_now() - bench_start, conn_total_duration,
+				 bench_start - start_time, latency_late);
 
 	if (exit_code != 0)
 		pg_log_fatal("Run was aborted; the above results are incomplete.");
@@ -6228,34 +6199,16 @@ threadRun(void *arg)
 {
 	TState	   *thread = (TState *) arg;
 	CState	   *state = thread->state;
-	instr_time	start,
-				end;
+	pg_time_usec_t	start;
 	int			nstate = thread->nstate;
 	int			remains = nstate;	/* number of remaining clients */
 	socket_set *sockets = alloc_socket_set(nstate);
-	int			i;
-
-	/* for reporting progress: */
-	int64		thread_start = INSTR_TIME_GET_MICROSEC(thread->start_time);
-	int64		last_report = thread_start;
-	int64		next_report = last_report + (int64) progress * 1000000;
+	int64		thread_start,
+				last_report,
+				next_report;
 	StatsData	last,
 				aggs;
 
-	/*
-	 * Initialize throttling rate target for all of the thread's clients.  It
-	 * might be a little more accurate to reset thread->start_time here too.
-	 * The possible drift seems too small relative to typical throttle delay
-	 * times to worry about it.
-	 */
-	INSTR_TIME_SET_CURRENT(start);
-	thread->throttle_trigger = INSTR_TIME_GET_MICROSEC(start);
-
-	INSTR_TIME_SET_ZERO(thread->conn_time);
-
-	initStats(&aggs, time(NULL));
-	last = aggs;
-
 	/* open log file if requested */
 	if (use_log)
 	{
@@ -6276,32 +6229,49 @@ threadRun(void *arg)
 		}
 	}
 
+	/* explicitly initialize the state machines */
+	for (int i = 0; i < nstate; i++)
+		state[i].state = CSTATE_CHOOSE_SCRIPT;
+
+	/* READY */
+	thread_start = pg_time_now();
+	thread->started_time = thread_start;
+	last_report = thread_start;
+	next_report = last_report + (int64) 1000000 * progress;
+
+	/* STEADY */
 	if (!is_connect)
 	{
 		/* make connections to the database before starting */
-		for (i = 0; i < nstate; i++)
+		for (int i = 0; i < nstate; i++)
 		{
 			if ((state[i].con = doConnect()) == NULL)
 				goto done;
 		}
-	}
 
-	/* time after thread and connections set up */
-	INSTR_TIME_SET_CURRENT(thread->conn_time);
-	INSTR_TIME_SUBTRACT(thread->conn_time, thread->start_time);
-
-	/* explicitly initialize the state machines */
-	for (i = 0; i < nstate; i++)
+		/* compute connection delay */
+		thread->conn_duration = pg_time_now() - thread->started_time;
+	}
+	else
 	{
-		state[i].state = CSTATE_CHOOSE_SCRIPT;
+		/* no connection delay to record */
+		thread->conn_duration = 0;
 	}
 
+
+	start = pg_time_now();
+	thread->bench_start = start;
+	thread->throttle_trigger = start;
+
+	initStats(&aggs, start);
+	last = aggs;
+
 	/* loop till all clients have terminated */
 	while (remains > 0)
 	{
 		int			nsocks;		/* number of sockets to be waited for */
-		int64		min_usec;
-		int64		now_usec = 0;	/* set this only if needed */
+		pg_time_usec_t	min_usec;
+		pg_time_usec_t	now = 0;	/* set this only if needed */
 
 		/*
 		 * identify which client sockets should be checked for input, and
@@ -6310,27 +6280,21 @@ threadRun(void *arg)
 		clear_socket_set(sockets);
 		nsocks = 0;
 		min_usec = PG_INT64_MAX;
-		for (i = 0; i < nstate; i++)
+		for (int i = 0; i < nstate; i++)
 		{
 			CState	   *st = &state[i];
 
 			if (st->state == CSTATE_SLEEP || st->state == CSTATE_THROTTLE)
 			{
 				/* a nap from the script, or under throttling */
-				int64		this_usec;
+				pg_time_usec_t		this_usec;
 
 				/* get current time if needed */
-				if (now_usec == 0)
-				{
-					instr_time	now;
-
-					INSTR_TIME_SET_CURRENT(now);
-					now_usec = INSTR_TIME_GET_MICROSEC(now);
-				}
+				pg_time_now_lazy(&now);
 
 				/* min_usec should be the minimum delay across all clients */
 				this_usec = (st->state == CSTATE_SLEEP ?
-							 st->sleep_until : st->txn_scheduled) - now_usec;
+							 st->sleep_until : st->txn_scheduled) - now;
 				if (min_usec > this_usec)
 					min_usec = this_usec;
 			}
@@ -6365,19 +6329,12 @@ threadRun(void *arg)
 		/* also wake up to print the next progress report on time */
 		if (progress && min_usec > 0 && thread->tid == 0)
 		{
-			/* get current time if needed */
-			if (now_usec == 0)
-			{
-				instr_time	now;
+			pg_time_now_lazy(&now);
 
-				INSTR_TIME_SET_CURRENT(now);
-				now_usec = INSTR_TIME_GET_MICROSEC(now);
-			}
-
-			if (now_usec >= next_report)
+			if (now >= next_report)
 				min_usec = 0;
-			else if ((next_report - now_usec) < min_usec)
-				min_usec = next_report - now_usec;
+			else if ((next_report - now) < min_usec)
+				min_usec = next_report - now;
 		}
 
 		/*
@@ -6426,7 +6383,7 @@ threadRun(void *arg)
 
 		/* ok, advance the state machine of each connection */
 		nsocks = 0;
-		for (i = 0; i < nstate; i++)
+		for (int i = 0; i < nstate; i++)
 		{
 			CState	   *st = &state[i];
 
@@ -6464,11 +6421,8 @@ threadRun(void *arg)
 		/* progress report is made by thread 0 for all threads */
 		if (progress && thread->tid == 0)
 		{
-			instr_time	now_time;
-			int64		now;
+			pg_time_usec_t	now = pg_time_now();
 
-			INSTR_TIME_SET_CURRENT(now_time);
-			now = INSTR_TIME_GET_MICROSEC(now_time);
 			if (now >= next_report)
 			{
 				/*
@@ -6486,17 +6440,17 @@ threadRun(void *arg)
 				 */
 				do
 				{
-					next_report += (int64) progress * 1000000;
+					next_report += (int64) 1000000 * progress;
 				} while (now >= next_report);
 			}
 		}
 	}
 
 done:
-	INSTR_TIME_SET_CURRENT(start);
+	start = pg_time_now();
 	disconnect_all(state, nstate);
-	INSTR_TIME_SET_CURRENT(end);
-	INSTR_TIME_ACCUM_DIFF(thread->conn_time, end, start);
+	thread->conn_duration += pg_time_now() - start;
+
 	if (thread->logfile)
 	{
 		if (agg_interval > 0)
diff --git a/src/include/portability/instr_time.h b/src/include/portability/instr_time.h
index 39a4f0600e..faf806a441 100644
--- a/src/include/portability/instr_time.h
+++ b/src/include/portability/instr_time.h
@@ -253,4 +253,32 @@ GetTimerFrequency(void)
 #define INSTR_TIME_SET_CURRENT_LAZY(t) \
 	(INSTR_TIME_IS_ZERO(t) ? INSTR_TIME_SET_CURRENT(t), true : false)
 
+/*
+ * Simpler convenient interface
+ *
+ * The instr_time type is expensive when dealing with time arithmetic.
+ * Define a type to hold microseconds on top of this, suitable for
+ * benchmarking performance measures, eg in "pgbench".
+ *
+ * Type int64 is good enough for about 584500 years.
+ */
+typedef int64 pg_time_usec_t;
+
+static inline pg_time_usec_t
+pg_time_now(void)
+{
+	instr_time now;
+
+	INSTR_TIME_SET_CURRENT(now);
+	return (pg_time_usec_t) INSTR_TIME_GET_MICROSEC(now);
+}
+
+static inline void
+pg_time_now_lazy(pg_time_usec_t *now)
+{
+	if ((*now) == 0)
+		(*now) = pg_time_now();
+}
+
+#define PG_TIME_GET_DOUBLE(t) (0.000001 * (t))
 #endif							/* INSTR_TIME_H */
-- 
2.30.0

v12-0004-pgbench-Synchronize-client-threads.patchtext/x-patch; charset=US-ASCII; name=v12-0004-pgbench-Synchronize-client-threads.patchDownload

From d558c3efc88da4be55fe41138720a0c812a8dff4 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Mon, 18 Jan 2021 10:07:31 +1300
Subject: [PATCH v12 4/4] pgbench: Synchronize client threads.

Wait until all pgbench threads are connected before benchmarking begins.

Author: Andres Freund <andres@anarazel.de>
Author: Fabien COELHO <coelho@cri.ensmp.fr>
Reviewed-by: Marina Polyakova <m.polyakova@postgrespro.ru>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Reviewed-by: David Rowley <dgrowleyml@gmail.com>
Discussion: https://postgr.es/m/20200227180100.zyvjwzcpiokfsqm2%40alap3.anarazel.de
---
 src/bin/pgbench/pgbench.c | 44 ++++++++++++++++++++++++++++++++++++---
 1 file changed, 41 insertions(+), 3 deletions(-)

diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 6be83459ea..5bd806b9b0 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -126,9 +126,16 @@ typedef struct socket_set
 #define THREAD_JOIN(handle) \
 	(WaitForSingleObject(handle, INFINITE) != WAIT_OBJECT_0 ? \
 	GETERRNO() : CloseHandle(handle) ? 0 : GETERRNO())
+#define THREAD_BARRIER_T SYNCHRONIZATION_BARRIER
+#define THREAD_BARRIER_INIT(barrier, n) \
+	(InitializeSynchronizationBarrier((barrier), (n), 0) ? 0 : GETERRNO())
+#define THREAD_BARRIER_WAIT(barrier) \
+	EnterSynchronizationBarrier((barrier), \
+								SYNCHRONIZATION_BARRIER_FLAGS_BLOCK_ONLY)
+#define THREAD_BARRIER_DESTROY(barrier)
 #elif defined(ENABLE_THREAD_SAFETY)
 /* Use platform-dependent pthread capability */
-#include <pthread.h>
+#include "port/pg_pthread.h"
 #define THREAD_T pthread_t
 #define THREAD_FUNC_RETURN_TYPE void *
 #define THREAD_FUNC_RETURN return NULL
@@ -136,11 +143,20 @@ typedef struct socket_set
 	pthread_create((handle), NULL, (function), (arg))
 #define THREAD_JOIN(handle) \
 	pthread_join((handle), NULL)
+#define THREAD_BARRIER_T pthread_barrier_t
+#define THREAD_BARRIER_INIT(barrier, n) \
+	pthread_barrier_init((barrier), NULL, (n))
+#define THREAD_BARRIER_WAIT(barrier) pthread_barrier_wait((barrier))
+#define THREAD_BARRIER_DESTROY(barrier) pthread_barrier_destroy((barrier))
 #else
 /* No threads implementation, use none (-j 1) */
 #define THREAD_T void *
 #define THREAD_FUNC_RETURN_TYPE void *
 #define THREAD_FUNC_RETURN return NULL
+#define THREAD_BARRIER_T void *
+#define THREAD_BARRIER_INIT(barrier, n)
+#define THREAD_BARRIER_WAIT(barrier)
+#define THREAD_BARRIER_DESTROY(barrier)
 #endif
 
 
@@ -326,6 +342,9 @@ typedef struct RandomState
 /* Various random sequences are initialized from this one. */
 static RandomState base_random_sequence;
 
+/* Synchronization barrier for start and connection */
+static THREAD_BARRIER_T barrier;
+
 /*
  * Connection state machine states.
  */
@@ -468,8 +487,8 @@ typedef struct
 
 	/* per thread collected stats in microseconds */
 	pg_time_usec_t	create_time;	/* thread creation time */
-	pg_time_usec_t	started_time;	/* thread is running */
-	pg_time_usec_t	bench_start; 	/* thread is benchmarking */
+	pg_time_usec_t	started_time;	/* thread is running after start barrier */
+	pg_time_usec_t	bench_start; 	/* thread is benchmarking after connection barrier */
 	pg_time_usec_t	conn_duration;	/* cumulated connection and deconnection delays */
 
 	StatsData	stats;
@@ -6119,6 +6138,8 @@ main(int argc, char **argv)
 	if (duration > 0)
 		setalarm(duration);
 
+	THREAD_BARRIER_INIT(&barrier, nthreads);
+
 #ifdef ENABLE_THREAD_SAFETY
 	/* start all threads but thread 0 which is executed directly later */
 	for (i = 1; i < nthreads; i++)
@@ -6188,6 +6209,8 @@ main(int argc, char **argv)
 	printResults(&stats, pg_time_now() - bench_start, conn_total_duration,
 				 bench_start - start_time, latency_late);
 
+	THREAD_BARRIER_DESTROY(&barrier);
+
 	if (exit_code != 0)
 		pg_log_fatal("Run was aborted; the above results are incomplete.");
 
@@ -6234,6 +6257,8 @@ threadRun(void *arg)
 		state[i].state = CSTATE_CHOOSE_SCRIPT;
 
 	/* READY */
+	THREAD_BARRIER_WAIT(&barrier);
+
 	thread_start = pg_time_now();
 	thread->started_time = thread_start;
 	last_report = thread_start;
@@ -6246,7 +6271,18 @@ threadRun(void *arg)
 		for (int i = 0; i < nstate; i++)
 		{
 			if ((state[i].con = doConnect()) == NULL)
+			{
+				/*
+				 * On connection failure, we meet the barrier here in place of
+				 * GO before proceeding to the "done" path which will cleanup,
+				 * so as to avoid locking the process.
+				 *
+				 * It is unclear whether it is worth doing anything rather than
+				 * coldly exiting with an error message.
+				 */
+				THREAD_BARRIER_WAIT(&barrier);
 				goto done;
+			}
 		}
 
 		/* compute connection delay */
@@ -6258,6 +6294,8 @@ threadRun(void *arg)
 		thread->conn_duration = 0;
 	}
 
+	/* GO */
+	THREAD_BARRIER_WAIT(&barrier);
 
 	start = pg_time_now();
 	thread->bench_start = start;
-- 
2.30.0

thomas.munro@gmail.com

almost 5 years ago

In reply to: Thomas Munro (#40)

Re: pgbench: option delaying queries till connections establishment?

On Mon, Mar 8, 2021 at 3:18 PM Thomas Munro <thomas.munro@gmail.com> wrote:

David Rowley kindly tested this for me on Windows and told me how to
fix one of the macros that had incorrect error checking on that OS.
So here's a new version. I'm planning to commit 0001 and 0002 soon,
if there are no objections. 0003 needs some more review.

I made a few mostly cosmetic changes, pgindented and pushed all these patches.

tgl@sss.pgh.pa.us

almost 5 years ago

In reply to: Thomas Munro (#41)

Re: pgbench: option delaying queries till connections establishment?

Thomas Munro <thomas.munro@gmail.com> writes:

On Mon, Mar 8, 2021 at 3:18 PM Thomas Munro <thomas.munro@gmail.com> wrote:

David Rowley kindly tested this for me on Windows and told me how to
fix one of the macros that had incorrect error checking on that OS.
So here's a new version. I'm planning to commit 0001 and 0002 soon,
if there are no objections. 0003 needs some more review.

I made a few mostly cosmetic changes, pgindented and pushed all these patches.

So, gaur is not too happy with this:

ccache gcc -std=gnu99 -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Werror=vla -Wendif-labels -Wmissing-format-attribute -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -g -O2 -I../../src/port -DFRONTEND -I../../src/include -D_USE_CTYPE_MACROS -D_XOPEN_SOURCE_EXTENDED -I/usr/local/libxml2-2.6.23/include/libxml2 -I/usr/local/ssl-1.0.1e/include -c -o strlcat.o strlcat.c
pthread_barrier_wait.c: In function 'pthread_barrier_init':
pthread_barrier_wait.c:24:2: error: incompatible type for argument 2 of 'pthread_cond_init'
/usr/include/pthread.h:378:5: note: expected 'pthread_condattr_t' but argument is of type 'void *'
pthread_barrier_wait.c:26:2: error: incompatible type for argument 2 of 'pthread_mutex_init'
/usr/include/pthread.h:354:5: note: expected 'pthread_mutexattr_t' but argument is of type 'void *'
make[2]: *** [pthread_barrier_wait.o] Error 1

Checking the man pages, it seems that this ancient HPUX version
is using some pre-POSIX API spec in which pthread_cond_init takes a
pthread_condattr_t rather than a pointer to pthread_condattr_t.
Similarly for pthread_mutex_init.

While it's likely that we could work around that, it's my
opinion that we shouldn't have to, because gaur is building with
--disable-thread-safety. If that switch has any meaning at all,
it should be that we don't try to use thread infrastructure.
Was any thought given to being able to opt out of this patchset
to support that configure option?

regards, tom lane

thomas.munro@gmail.com

almost 5 years ago

In reply to: Tom Lane (#42)

Re: pgbench: option delaying queries till connections establishment?

On Sat, Mar 13, 2021 at 3:47 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Checking the man pages, it seems that this ancient HPUX version
is using some pre-POSIX API spec in which pthread_cond_init takes a
pthread_condattr_t rather than a pointer to pthread_condattr_t.
Similarly for pthread_mutex_init.

Wow.

While it's likely that we could work around that, it's my
opinion that we shouldn't have to, because gaur is building with
--disable-thread-safety. If that switch has any meaning at all,
it should be that we don't try to use thread infrastructure.
Was any thought given to being able to opt out of this patchset
to support that configure option?

Oops. The pgbench code was tested under --disable-thread-safety, but
it didn't occur to me that the AC_REPLACE_FUNCS search for
pthread_barrier_wait should also be conditional on that; I will now go
and try to figure out how to do that.

tgl@sss.pgh.pa.us

almost 5 years ago

In reply to: Thomas Munro (#43)

Re: pgbench: option delaying queries till connections establishment?

Thomas Munro <thomas.munro@gmail.com> writes:

On Sat, Mar 13, 2021 at 3:47 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Was any thought given to being able to opt out of this patchset
to support that configure option?

Oops. The pgbench code was tested under --disable-thread-safety, but
it didn't occur to me that the AC_REPLACE_FUNCS search for
pthread_barrier_wait should also be conditional on that; I will now go
and try to figure out how to do that.

OK, cool. I don't think it's hard, just do

if test "$enable_thread_safety" = yes; then
AC_REPLACE_FUNCS(pthread_barrier_wait)
fi

Probably this check should be likewise conditional:

AC_SEARCH_LIBS(pthread_barrier_wait, pthread)

regards, tom lane

thomas.munro@gmail.com

almost 5 years ago

In reply to: Tom Lane (#44)

1 attachment(s)

Re: pgbench: option delaying queries till connections establishment?

On Sat, Mar 13, 2021 at 4:09 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

OK, cool. I don't think it's hard, just do

if test "$enable_thread_safety" = yes; then
AC_REPLACE_FUNCS(pthread_barrier_wait)
fi

Probably this check should be likewise conditional:

AC_SEARCH_LIBS(pthread_barrier_wait, pthread)

Thanks. This seems to work for me on a Mac. I confirmed with nm that
we don't define or reference any pthread_XXX symbols with
--disable-thread-safety, and we do otherwise, and the pgbench tests
pass either way.

Attachments:

0001-Fix-new-pthread-code-to-respect-disable-thread-safet.patchapplication/octet-stream; name=0001-Fix-new-pthread-code-to-respect-disable-thread-safet.patchDownload

From ca041d9b73e1a9192cfd99eef5bee3408d118c25 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Sat, 13 Mar 2021 16:23:07 +1300
Subject: [PATCH] Fix new pthread code to respect --disable-thread-safety.

Don't try to compile src/port/pthread_barrier_wait.c if we opted out of
threads at configure time.  Revealed by build farm member gaur, which
can't compile this code it because of problems with its pthread
implementation.  It shouldn't be trying to, because it's using
--disable-thread-safety.

Defect in commit 44bf3d50.

Reported-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/2568537.1615603606%40sss.pgh.pa.us
---
 configure    | 145 +++++++++++++++++++++++++++------------------------
 configure.ac |  10 +++-
 2 files changed, 84 insertions(+), 71 deletions(-)

diff --git a/configure b/configure
index fad817bb38..3fd4cecbeb 100755
--- a/configure
+++ b/configure
@@ -11635,62 +11635,6 @@ if test "$ac_res" != no; then :
 
 fi
 
-{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for library containing pthread_barrier_wait" >&5
-$as_echo_n "checking for library containing pthread_barrier_wait... " >&6; }
-if ${ac_cv_search_pthread_barrier_wait+:} false; then :
-  $as_echo_n "(cached) " >&6
-else
-  ac_func_search_save_LIBS=$LIBS
-cat confdefs.h - <<_ACEOF >conftest.$ac_ext
-/* end confdefs.h.  */
-
-/* Override any GCC internal prototype to avoid an error.
-   Use char because int might match the return type of a GCC
-   builtin and then its argument prototype would still apply.  */
-#ifdef __cplusplus
-extern "C"
-#endif
-char pthread_barrier_wait ();
-int
-main ()
-{
-return pthread_barrier_wait ();
-  ;
-  return 0;
-}
-_ACEOF
-for ac_lib in '' pthread; do
-  if test -z "$ac_lib"; then
-    ac_res="none required"
-  else
-    ac_res=-l$ac_lib
-    LIBS="-l$ac_lib  $ac_func_search_save_LIBS"
-  fi
-  if ac_fn_c_try_link "$LINENO"; then :
-  ac_cv_search_pthread_barrier_wait=$ac_res
-fi
-rm -f core conftest.err conftest.$ac_objext \
-    conftest$ac_exeext
-  if ${ac_cv_search_pthread_barrier_wait+:} false; then :
-  break
-fi
-done
-if ${ac_cv_search_pthread_barrier_wait+:} false; then :
-
-else
-  ac_cv_search_pthread_barrier_wait=no
-fi
-rm conftest.$ac_ext
-LIBS=$ac_func_search_save_LIBS
-fi
-{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $ac_cv_search_pthread_barrier_wait" >&5
-$as_echo "$ac_cv_search_pthread_barrier_wait" >&6; }
-ac_res=$ac_cv_search_pthread_barrier_wait
-if test "$ac_res" != no; then :
-  test "$ac_res" = "none required" || LIBS="$ac_res $LIBS"
-
-fi
-
 # Solaris:
 { $as_echo "$as_me:${as_lineno-$LINENO}: checking for library containing fdatasync" >&5
 $as_echo_n "checking for library containing fdatasync... " >&6; }
@@ -11978,6 +11922,65 @@ if test "$ac_res" != no; then :
 fi
 
 
+if test "$enable_thread_safety" = yes; then
+  { $as_echo "$as_me:${as_lineno-$LINENO}: checking for library containing pthread_barrier_wait" >&5
+$as_echo_n "checking for library containing pthread_barrier_wait... " >&6; }
+if ${ac_cv_search_pthread_barrier_wait+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  ac_func_search_save_LIBS=$LIBS
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+
+/* Override any GCC internal prototype to avoid an error.
+   Use char because int might match the return type of a GCC
+   builtin and then its argument prototype would still apply.  */
+#ifdef __cplusplus
+extern "C"
+#endif
+char pthread_barrier_wait ();
+int
+main ()
+{
+return pthread_barrier_wait ();
+  ;
+  return 0;
+}
+_ACEOF
+for ac_lib in '' pthread; do
+  if test -z "$ac_lib"; then
+    ac_res="none required"
+  else
+    ac_res=-l$ac_lib
+    LIBS="-l$ac_lib  $ac_func_search_save_LIBS"
+  fi
+  if ac_fn_c_try_link "$LINENO"; then :
+  ac_cv_search_pthread_barrier_wait=$ac_res
+fi
+rm -f core conftest.err conftest.$ac_objext \
+    conftest$ac_exeext
+  if ${ac_cv_search_pthread_barrier_wait+:} false; then :
+  break
+fi
+done
+if ${ac_cv_search_pthread_barrier_wait+:} false; then :
+
+else
+  ac_cv_search_pthread_barrier_wait=no
+fi
+rm conftest.$ac_ext
+LIBS=$ac_func_search_save_LIBS
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $ac_cv_search_pthread_barrier_wait" >&5
+$as_echo "$ac_cv_search_pthread_barrier_wait" >&6; }
+ac_res=$ac_cv_search_pthread_barrier_wait
+if test "$ac_res" != no; then :
+  test "$ac_res" = "none required" || LIBS="$ac_res $LIBS"
+
+fi
+
+fi
+
 if test "$with_readline" = yes; then
 
 
@@ -15939,19 +15942,6 @@ esac
 
 fi
 
-ac_fn_c_check_func "$LINENO" "pthread_barrier_wait" "ac_cv_func_pthread_barrier_wait"
-if test "x$ac_cv_func_pthread_barrier_wait" = xyes; then :
-  $as_echo "#define HAVE_PTHREAD_BARRIER_WAIT 1" >>confdefs.h
-
-else
-  case " $LIBOBJS " in
-  *" pthread_barrier_wait.$ac_objext "* ) ;;
-  *) LIBOBJS="$LIBOBJS pthread_barrier_wait.$ac_objext"
- ;;
-esac
-
-fi
-
 ac_fn_c_check_func "$LINENO" "pwrite" "ac_cv_func_pwrite"
 if test "x$ac_cv_func_pwrite" = xyes; then :
   $as_echo "#define HAVE_PWRITE 1" >>confdefs.h
@@ -16058,6 +16048,23 @@ fi
 
 
 
+if test "$enable_thread_safety" = yes; then
+  ac_fn_c_check_func "$LINENO" "pthread_barrier_wait" "ac_cv_func_pthread_barrier_wait"
+if test "x$ac_cv_func_pthread_barrier_wait" = xyes; then :
+  $as_echo "#define HAVE_PTHREAD_BARRIER_WAIT 1" >>confdefs.h
+
+else
+  case " $LIBOBJS " in
+  *" pthread_barrier_wait.$ac_objext "* ) ;;
+  *) LIBOBJS="$LIBOBJS pthread_barrier_wait.$ac_objext"
+ ;;
+esac
+
+fi
+
+
+fi
+
 if test "$PORTNAME" = "win32" -o "$PORTNAME" = "cygwin"; then
 	# Cygwin and (apparently, based on test results) Mingw both
 	# have a broken strtof(), so substitute the same replacement
diff --git a/configure.ac b/configure.ac
index 0ed53571dd..2f1585adc0 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1143,7 +1143,6 @@ AC_SEARCH_LIBS(getopt_long, [getopt gnugetopt])
 AC_SEARCH_LIBS(shm_open, rt)
 AC_SEARCH_LIBS(shm_unlink, rt)
 AC_SEARCH_LIBS(clock_gettime, [rt posix4])
-AC_SEARCH_LIBS(pthread_barrier_wait, pthread)
 # Solaris:
 AC_SEARCH_LIBS(fdatasync, [rt posix4])
 # Required for thread_test.c on Solaris
@@ -1156,6 +1155,10 @@ AC_SEARCH_LIBS(shmget, cygipc)
 # *BSD:
 AC_SEARCH_LIBS(backtrace_symbols, execinfo)
 
+if test "$enable_thread_safety" = yes; then
+  AC_SEARCH_LIBS(pthread_barrier_wait, pthread)
+fi
+
 if test "$with_readline" = yes; then
   PGAC_CHECK_READLINE
   if test x"$pgac_cv_check_readline" = x"no"; then
@@ -1744,7 +1747,6 @@ AC_REPLACE_FUNCS(m4_normalize([
 	mkdtemp
 	pread
 	preadv
-	pthread_barrier_wait
 	pwrite
 	pwritev
 	random
@@ -1755,6 +1757,10 @@ AC_REPLACE_FUNCS(m4_normalize([
 	strtof
 ]))
 
+if test "$enable_thread_safety" = yes; then
+  AC_REPLACE_FUNCS(pthread_barrier_wait)
+fi
+
 if test "$PORTNAME" = "win32" -o "$PORTNAME" = "cygwin"; then
 	# Cygwin and (apparently, based on test results) Mingw both
 	# have a broken strtof(), so substitute the same replacement
-- 
2.24.3 (Apple Git-128)

tgl@sss.pgh.pa.us

almost 5 years ago

In reply to: Thomas Munro (#45)

Re: pgbench: option delaying queries till connections establishment?

Thomas Munro <thomas.munro@gmail.com> writes:

Thanks. This seems to work for me on a Mac. I confirmed with nm that
we don't define or reference any pthread_XXX symbols with
--disable-thread-safety, and we do otherwise, and the pgbench tests
pass either way.

Looks reasonable by eyeball. If you'd push it, I can launch
a gaur run right away.

regards, tom lane

thomas.munro@gmail.com

almost 5 years ago

In reply to: Tom Lane (#46)

Re: pgbench: option delaying queries till connections establishment?

On Sat, Mar 13, 2021 at 4:59 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Thomas Munro <thomas.munro@gmail.com> writes:

Thanks. This seems to work for me on a Mac. I confirmed with nm that
we don't define or reference any pthread_XXX symbols with
--disable-thread-safety, and we do otherwise, and the pgbench tests
pass either way.

Looks reasonable by eyeball. If you'd push it, I can launch
a gaur run right away.

Done.

tgl@sss.pgh.pa.us

almost 5 years ago

In reply to: Thomas Munro (#47)

Re: pgbench: option delaying queries till connections establishment?

Thomas Munro <thomas.munro@gmail.com> writes:

On Sat, Mar 13, 2021 at 4:59 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Looks reasonable by eyeball. If you'd push it, I can launch
a gaur run right away.

Done.

gaur's gotten through "make" and "make check" cleanly. Unfortunately
I expect it will fail at the pg_amcheck test before it reaches pgbench.
But for the moment it's reasonable to assume we're good here. Thanks!

regards, tom lane

coelho@cri.ensmp.fr

almost 5 years ago

In reply to: Thomas Munro (#41)

Re: pgbench: option delaying queries till connections establishment?

Hello Thomas,

David Rowley kindly tested this for me on Windows and told me how to
fix one of the macros that had incorrect error checking on that OS.
So here's a new version. I'm planning to commit 0001 and 0002 soon,
if there are no objections. 0003 needs some more review.

I made a few mostly cosmetic changes, pgindented and pushed all these patches.

Thanks a lot for pushing all that, and fixing issues raised by buildfarm
animals pretty unexpected and strange failures…

I must say that I'm not a big fan of the macro-based all-in-capitals API
for threads because it exposes some platform specific uglyness (eg
THREAD_FUNC_CC) and it does not look much like clean C code when used. I
liked the previous partial pthread implementation better, even if it was
not the real thing, obviously.

ISTM that with the current approach threads are always used on Windows,
i.e. pgbench does not comply to "ENABLE_THREAD_SAFETY" configuration on
that platform. Not sure whether this is an issue that need to be
addressed, though.

--
Fabien.

thomas.munro@gmail.com

almost 5 years ago

In reply to: Fabien COELHO (#49)

Re: pgbench: option delaying queries till connections establishment?

On Sat, Mar 13, 2021 at 9:08 PM Fabien COELHO <coelho@cri.ensmp.fr> wrote:

I must say that I'm not a big fan of the macro-based all-in-capitals API
for threads because it exposes some platform specific uglyness (eg
THREAD_FUNC_CC) and it does not look much like clean C code when used. I
liked the previous partial pthread implementation better, even if it was
not the real thing, obviously.

But we were using macros already, to support --disable-thread-safety
builds. I just changed them to upper case and dropped the 'p',
because I didn't like to pretend to do POSIX threads, but do it so
badly. Perhaps we should drop --disable-thread-safety soon, and
perhaps it is nearly time to create a good thread abstraction in clean
C code, for use in the server and here? Then we won't need any ugly
macros.

ISTM that with the current approach threads are always used on Windows,
i.e. pgbench does not comply to "ENABLE_THREAD_SAFETY" configuration on
that platform. Not sure whether this is an issue that need to be
addressed, though.

The idea of that option, as I understand it, is that in ancient times
there were Unix systems with no threads (that's of course the reason
PostgreSQL is the way it is). I don't think that was ever the case
for Windows NT, and we have no build option for that on Windows
AFAICS.

coelho@cri.ensmp.fr

almost 5 years ago

In reply to: Thomas Munro (#50)

Re: pgbench: option delaying queries till connections establishment?

Hello Thomas,

I must say that I'm not a big fan of the macro-based all-in-capitals API
for threads because it exposes some platform specific uglyness (eg
THREAD_FUNC_CC) and it does not look much like clean C code when used. I
liked the previous partial pthread implementation better, even if it was
not the real thing, obviously.

But we were using macros already, to support --disable-thread-safety
builds.

Yep, but the look and feel was still C code.

I just changed them to upper case and dropped the 'p', because I didn't
like to pretend to do POSIX threads, but do it so badly.

Hmmm. From the source code point of view it was just like actually using
posix threads, even if the underlying machinery was not quite that on some
systems. I value looking at "beautiful" and "standard" code if possible,
even if there is some cheating involved, compared to exposing macros. I
made some effort to remove the pretty ugly and inefficient INSTR_TIME
macros from pgbench, replaced with straightforward arithmetic and inlined
functions. Now some other macros just crept back in:-) Anyway, this is
just "les goûts et les couleurs" (just a matter of taste), as we say here.

Perhaps we should drop --disable-thread-safety soon, and perhaps it is
nearly time to create a good thread abstraction in clean C code, for use
in the server and here? Then we won't need any ugly macros.

+1.

ISTM that with the current approach threads are always used on Windows,
i.e. pgbench does not comply to "ENABLE_THREAD_SAFETY" configuration on
that platform. Not sure whether this is an issue that need to be
addressed, though.

The idea of that option, as I understand it, is that in ancient times
there were Unix systems with no threads (that's of course the reason
PostgreSQL is the way it is). I don't think that was ever the case
for Windows NT, and we have no build option for that on Windows
AFAICS.

Ok, fine with me.

--
Fabien.

davecramer@postgres.rocks

over 2 years ago

In reply to: Andres Freund (#1)

Re: pgbench: option delaying queries till connections establishment?

Dave Cramer
www.postgres.rocks

On Tue, 16 May 2023 at 07:27, Andres Freund <andres@anarazel.de> wrote:

Hi,

I am trying to run a few benchmarks measuring the effects of patch to
make GetSnapshotData() faster in the face of larger numbers of
established connections.

Before the patch connection establishment often is very slow due to
contention. The first few connections are fast, but after that it takes
increasingly long. The first few connections constantly hold
ProcArrayLock in shared mode, which then makes it hard for new
connections to acquire it exclusively (I'm addressing that to a
significant degree in the patch FWIW).

But for a fair comparison of the runtime effects I'd like to only
compare the throughput for when connections are actually usable,
otherwise I end up benchmarking few vs many connections, which is not
useful. And because I'd like to run the numbers for a lot of different
numbers of connections etc, I can't just make each run several hour
longs to make the initial minutes not matter much.

Therefore I'd like to make pgbench wait till it has established all
connections, before they run queries.

Does anybody else see this as being useful?

If so, should this be done unconditionally? A new option? Included in an
existing one somehow?

Greetings,

Andres Freund

I've recently run into something I am having difficulty understanding.

I am running pgbench with the following
pgbench -h localhost -c 100 -j 100 -t 2 -S -s 1000 pgbench -U pgbench
--protocol=simple

Without pgbouncer I get around 5k TPS
with pgbouncer I get around 15k TPS

Looking at the code connection initiation time should not be part of the
calculation so I' puzzled why pgbouncer is making such a dramatic
difference ?

Dave

davecramer@postgres.rocks

over 2 years ago

In reply to: Dave Cramer (#52)

Re: pgbench: option delaying queries till connections establishment?

I've recently run into something I am having difficulty understanding.

I am running pgbench with the following
pgbench -h localhost -c 100 -j 100 -t 2 -S -s 1000 pgbench -U pgbench
--protocol=simple

Without pgbouncer I get around 5k TPS
with pgbouncer I get around 15k TPS

Looking at the code connection initiation time should not be part of the
calculation so I' puzzled why pgbouncer is making such a dramatic
difference ?

Dave

Turns out that for this specific test, pg is faster with a pooler.

Dave Cramer, [May 16, 2023 at 9:49:24 AM]:

turns out having a connection pool helps. First run is without a pool,
second with

pgbench=# select mean_exec_time, stddev_exec_time, calls, total_exec_time,
min_exec_time, max_exec_time from pg_stat_statements where
queryid=-531095336438083412;

mean_exec_time | stddev_exec_time | calls | total_exec_time |
min_exec_time
| max_exec_time

--------------------+--------------------+-------+-------------------+----------------------+---------------

0.4672699999999998 | 2.2758508661446535 | 200 | 93.45399999999997 |
0.046616000000000005 | 17.434766

(1 row)

pgbench=# select pg_stat_statements_reset();

pg_stat_statements_reset

--------------------------

(1 row)

pgbench=# select mean_exec_time, stddev_exec_time, calls, total_exec_time,
min_exec_time, max_exec_time from pg_stat_statements where
queryid=-531095336438083412;

mean_exec_time | stddev_exec_time | calls | total_exec_time |
min_exec_time | max_exec_time

---------------------+----------------------+-------+--------------------+---------------+---------------

0.06640186499999999 | 0.021800404695481574 | 200 | 13.280373000000004 |
0.034006 | 0.226696
(1 row)

coelho@cri.ensmp.fr

over 2 years ago

In reply to: Dave Cramer (#53)

Re: pgbench: option delaying queries till connections establishment?

Hello Dave,

I am running pgbench with the following
pgbench -h localhost -c 100 -j 100 -t 2 -S -s 1000 pgbench -U pgbench
--protocol=simple

Without pgbouncer I get around 5k TPS
with pgbouncer I get around 15k TPS

Looking at the code connection initiation time should not be part of the
calculation so I' puzzled why pgbouncer is making such a dramatic
difference ?

Turns out that for this specific test, pg is faster with a pooler.

This does not tell "why".

Does the pooler prepares statements, whereas "simple" does not?

--
Fabien.