pgbench logging broken by time logic changes

Started by Gregory Smithover 4 years ago64 messages

gregsmithpgsql@gmail.com

over 4 years ago

Back on March 10 Thomas Munro committed and wrestled multiple reworks of
the pgbench code from Fabien and the crew. The feature to synchronize
startup I'm looking forward to testing now that I have a packaged beta.
Variations on that problem have bit me so many times I added code last year
to my pgbench processing pipeline to just throw out the first and last 10%
of every data set.

Before I could get to startup timing I noticed the pgbench logging output
was broken via commit 547f04e7 "Improve time logic":
/messages/by-id/E1lJqpF-00064e-C6@gemulon.postgresql.org

A lot of things are timed in pgbench now so I appreciate the idea. Y'all
started that whole big thread about sync on my birthday though and I didn't
follow the details of what that was reviewed against. For the logging use
case I suspect it's just broken everywhere. The two platforms I tested
were PGDG Ubuntu beta1 apt install and Mac git build. Example:

$ createdb pgbench
$ pgbench -i -s 1 pgbench
$ pgbench -S -T 1 -l pgbench
$ head pgbench_log.*
0 1 1730 0 1537380 70911
0 2 541 0 1537380 71474

The epoch time is the 5th column in the output, and this week it should
look like this:

0 1 1411 0 1623767029 732926
0 2 711 0 1623767029 733660

If you're not an epoch guru who recognizes what's wrong already, you might
grab https://github.com/gregs1104/pgbench-tools/ and party like it's 1970
to see it:

$ ~/repos/pgbench-tools/log-to-csv 1 local < pgbench_log* | head
1970-01-18 14:03:00.070911,0,1.73,1,local
1970-01-18 14:03:00.071474,0,0.541,1,local

I have a lot of community oriented work backed up behind this right now, so
I'm gonna be really honest. This time rework commit in its current form
makes me uncomfortable at this point in the release schedule. The commit
has already fought through two rounds of platform specific bug fixes. But
since the buildfarm doesn't test the logging feature, that whole process is
suspect.

My take on the PostgreSQL way to proceed: this bug exposes that pgbench
logging is a feature we finally need to design testing for. We need a new
buildfarm test and then a march through a full release phase to see how it
goes. Only then should we start messing with the time logic. Even if we
fixed the source today on both my test platforms, I'd still be nervous that
beta 2 could ship and more performance testing could fall over from this
modification. And that's cutting things a little close.

The fastest way to get me back to comfortable would be to unwind 547f04e7
and its associated fixes and take it back to review. I understand the
intent and value; I appreciate the work so far. The big industry
architecture shift from Intel to ARM has me worried about time overhead
again, the old code is wonky, and in the PG15 release cycle I already have
resources planned around this area.

# PG15 Plans

I didn't intend to roll back in after time away and go right to a revert
review. But I also really don't want to start my public PG14 story
documenting the reality that I had to use PG13's pgbench to generate my
examples either. I can't fight much with this logging problem while also
doing my planned public performance testing of PG14. I already had to push
back a solid bit of Beta 1 PR from this week, some "community PG is great!"
promotional blogging.

Let me offer what I can commit to from Crunchy corporate. I'm about to
submit multiple pgbench feature changes to the open CF starting July, with
David Christiansen. We and the rest of Crunchy will happily help re-review
this time change idea, its logging issues, testing, rejoin the study of
platform time call overhead, and bash the whole mess into shape for PG15.
I personally am looking forward to it.

The commit made a functional change to the way connection time is
displayed; that I can take or leave as committed. I'm not sure it can be
decoupled from the rest of the changes. It did cause a small breaking
pgbench output parsing problem for me, just trivial regex adjustment. That
break would fit in fine with my upcoming round of submissions.

--
Greg Smith greg.smith@crunchydata.com
Director of Open Source Strategy, Crunchy Data

Yugo NAGATA

nagata@sraoss.co.jp

over 4 years ago

In reply to: Gregory Smith (#1)

1 attachment(s)

Re: pgbench logging broken by time logic changes

On Wed, 16 Jun 2021 10:49:36 -0400
Gregory Smith <gregsmithpgsql@gmail.com> wrote:

A lot of things are timed in pgbench now so I appreciate the idea. Y'all
started that whole big thread about sync on my birthday though and I didn't
follow the details of what that was reviewed against. For the logging use
case I suspect it's just broken everywhere. The two platforms I tested
were PGDG Ubuntu beta1 apt install and Mac git build. Example:

$ createdb pgbench
$ pgbench -i -s 1 pgbench
$ pgbench -S -T 1 -l pgbench
$ head pgbench_log.*
0 1 1730 0 1537380 70911
0 2 541 0 1537380 71474

The epoch time is the 5th column in the output, and this week it should
look like this:

0 1 1411 0 1623767029 732926
0 2 711 0 1623767029 733660

If you're not an epoch guru who recognizes what's wrong already, you might
grab https://github.com/gregs1104/pgbench-tools/ and party like it's 1970
to see it:

$ ~/repos/pgbench-tools/log-to-csv 1 local < pgbench_log* | head
1970-01-18 14:03:00.070911,0,1.73,1,local
1970-01-18 14:03:00.071474,0,0.541,1,local

After the commit, pgbench tries to get the current timestamp by calling
pg_time_now(). This uses INSTR_TIME_SET_CURRENT in it, but this macro
can call clock_gettime(CLOCK_MONOTONIC[_RAW], ) instead of gettimeofday
or clock_gettime(CLOCK_REALTIME, ). When CLOCK_MONOTONIC[_RAW] is used,
clock_gettime doesn't return epoch time. Therefore, we can use
INSTR_TIME_SET_CURRENT aiming to calculate a duration, but we should
not have used this to get the current timestamp.

I think we can fix this issue by using gettimeofday for logging as before
this was changed. I attached the patch.

Regards,
Yugo Nagata

--
Yugo NAGATA <nagata@sraoss.co.jp>

Attachments:

pgbench-log-timestamp.patchtext/x-diff; name=pgbench-log-timestamp.patchDownload

diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index d7479925cb..2be0c62cab 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -3778,7 +3778,6 @@ doLog(TState *thread, CState *st,
 	  StatsData *agg, bool skipped, double latency, double lag)
 {
 	FILE	   *logfile = thread->logfile;
-	pg_time_usec_t now = pg_time_now();
 
 	Assert(use_log);
 
@@ -3793,6 +3792,8 @@ doLog(TState *thread, CState *st,
 	/* should we aggregate the results or not? */
 	if (agg_interval > 0)
 	{
+		pg_time_usec_t now = pg_time_now();
+
 		/*
 		 * Loop until we reach the interval of the current moment, and print
 		 * any empty intervals in between (this may happen with very low tps,
@@ -3830,16 +3831,21 @@ doLog(TState *thread, CState *st,
 	}
 	else
 	{
+		struct timeval tv;
+
+		gettimeofday(&tv, NULL);
+
 		/* no, print raw transactions */
 		if (skipped)
 			fprintf(logfile, "%d " INT64_FORMAT " skipped %d " INT64_FORMAT " "
 					INT64_FORMAT,
-					st->id, st->cnt, st->use_file, now / 1000000, now % 1000000);
+					st->id, st->cnt, st->use_file,
+					(long) tv.tv_sec, (long) tv.tv_usec);
 		else
 			fprintf(logfile, "%d " INT64_FORMAT " %.0f %d " INT64_FORMAT " "
 					INT64_FORMAT,
 					st->id, st->cnt, latency, st->use_file,
-					now / 1000000, now % 1000000);
+					(long) tv.tv_sec, (long) tv.tv_usec);
 		if (throttle_delay)
 			fprintf(logfile, " %.0f", lag);
 		fputc('\n', logfile);
@@ -5455,7 +5461,16 @@ printProgressReport(TState *threads, int64 test_start, pg_time_usec_t now,
 
 	if (progress_timestamp)
 	{
-		snprintf(tbuf, sizeof(tbuf), "%.3f s", PG_TIME_GET_DOUBLE(now));
+		/*
+		 * On some platforms the current system timestamp is available in
+		 * now_time, but rather than get entangled with that, we just eat the
+		 * cost of an extra syscall in all cases.
+		 */
+		struct timeval tv;
+
+		gettimeofday(&tv, NULL);
+		snprintf(tbuf, sizeof(tbuf), "%ld.%03ld s",
+				 (long) tv.tv_sec, (long) (tv.tv_usec / 1000));
 	}
 	else
 	{

Fabien COELHO

coelho@cri.ensmp.fr

over 4 years ago

In reply to: Gregory Smith (#1)

Re: pgbench logging broken by time logic changes

Hello Greg,

I have a lot of community oriented work backed up behind this right now, so
I'm gonna be really honest. This time rework commit in its current form
makes me uncomfortable at this point in the release schedule. The commit
has already fought through two rounds of platform specific bug fixes. But
since the buildfarm doesn't test the logging feature, that whole process is
suspect.

Logging is/should going to be fixed.

My take on the PostgreSQL way to proceed: this bug exposes that pgbench
logging is a feature we finally need to design testing for.

Sure.

The key feedback for me is the usual one: what is not tested does
not work. Wow:-)

I'm unhappy because I already added tap tests for time-sensitive features
(-T and others, maybe logging aggregates, cannot remember), which have
been removed because they could fail under some circonstances (eg very
very very very slow hosts), or required some special handling (a few lines
of code) in pgbench, and the net result of this is there is not a single
test in place for some features:-(

There is no problem with proposing tests, the problem is that they are
accepted, or if they are accepted then that they are not removed at the
first small issue but rather fixed, or their limitations accepted, because
testing time-sensitive features is not as simple as testing functional
features.

Note that currently there is not a single test of psql with autocommit off
or with "on error rollback". Last time a submitted tap tests to raise psql
test coverage from 50% to 90%, it was rejected. I'll admit that I'm tired
arguing that more testing is required, and I'm very happy if someone else
is ready to try again. Good luck! :-)

We need a new buildfarm test and then a march through a full release
phase to see how it goes.

Only then should we start messing with the time logic. Even if we fixed
the source today on both my test platforms, I'd still be nervous that
beta 2 could ship and more performance testing could fall over from this
modification. And that's cutting things a little close.

Well, the point beta is to discover bugs not caught by reviews and dev
tests, fix them, and possibly add tests which would have caught them.

If you revert all features on the first issue in a corner case and put it
back to the queue, then I do not see why the review and dev tests will be
much better on the next round, so it does not really help moving things
forward.

IMHO, the pragmatic approach is to look at fixing first, and maybe revert
if the problems are deep. I'm not sure this is obviously the case here.

--
Fabien.

Fabien COELHO

coelho@cri.ensmp.fr

over 4 years ago

In reply to: Yugo NAGATA (#2)

1 attachment(s)

Re: pgbench logging broken by time logic changes

pg_time_now(). This uses INSTR_TIME_SET_CURRENT in it, but this macro
can call clock_gettime(CLOCK_MONOTONIC[_RAW], ) instead of gettimeofday
or clock_gettime(CLOCK_REALTIME, ). When CLOCK_MONOTONIC[_RAW] is used,
clock_gettime doesn't return epoch time. Therefore, we can use
INSTR_TIME_SET_CURRENT aiming to calculate a duration, but we should
not have used this to get the current timestamp.

I think we can fix this issue by using gettimeofday for logging as before
this was changed. I attached the patch.

I cannot say that I'm thrilled by having multiple tv stuff back in several
place. I can be okay with one, though. What about the attached? Does it
make sense?

--
Fabien.

Attachments:

pgbench-epoch-1.patchtext/x-diff; name=pgbench-epoch-1.patchDownload

diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index d7479925cb..5fe2a9286f 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -672,11 +672,11 @@ static const PsqlScanCallbacks pgbench_callbacks = {
 static inline pg_time_usec_t
 pg_time_now(void)
 {
-	instr_time	now;
+	struct timeval tv;
 
-	INSTR_TIME_SET_CURRENT(now);
+	gettimeofday(&tv, NULL);
 
-	return (pg_time_usec_t) INSTR_TIME_GET_MICROSEC(now);
+	return tv.tv_sec * 1000000 + tv.tv_usec;
 }
 
 static inline void

Andrew Dunstan

andrew@dunslane.net

over 4 years ago

In reply to: Fabien COELHO (#3)

Re: pgbench logging broken by time logic changes

On 6/16/21 2:59 PM, Fabien COELHO wrote:

Hello Greg,

I have a lot of community oriented work backed up behind this right
now, so
I'm gonna be really honest. This time rework commit in its current form
makes me uncomfortable at this point in the release schedule. The
commit
has already fought through two rounds of platform specific bug
fixes. But
since the buildfarm doesn't test the logging feature, that whole
process is
suspect.

Logging is/should going to be fixed.

My take on the PostgreSQL way to proceed: this bug exposes that pgbench
logging is a feature we finally need to design testing for.

Sure.

The key feedback for me is the usual one: what is not tested does not
work. Wow:-)

Agreed.

I'm unhappy because I already added tap tests for time-sensitive
features (-T and others, maybe logging aggregates, cannot remember),
which have been removed because they could fail under some
circonstances (eg very very very very slow hosts), or required some
special handling (a few lines of code) in pgbench, and the net result
of this is there is not a single test in place for some features:-(

I'm not familiar with exactly what happened in this case, but tests need
to be resilient over a wide range of performance characteristics. One
way around this issue might be to have a way of detecting that it's on a
slow platform and if so either skipping tests (Test::More provides
plenty of support for this) or expecting different results.

There is no problem with proposing tests, the problem is that they are
accepted, or if they are accepted then that they are not removed at
the first small issue but rather fixed, or their limitations accepted,
because testing time-sensitive features is not as simple as testing
functional features.

Note that currently there is not a single test of psql with autocommit
off or with "on error rollback". Last time a submitted tap tests to
raise psql test coverage from 50% to 90%, it was rejected. I'll admit
that I'm tired arguing that more testing is required, and I'm very
happy if someone else is ready to try again. Good luck! :-)

:-(

We need a new buildfarm test and then a march through a full release
phase to see how it goes.

Only then should we start messing with the time logic. Even if we
fixed the source today on both my test platforms, I'd still be
nervous that beta 2 could ship and more performance testing could
fall over from this modification. And that's cutting things a little
close.

Well, the point beta is to discover bugs not caught by reviews and dev
tests, fix them, and possibly add tests which would have caught them.

If you revert all features on the first issue in a corner case and put
it back to the queue, then I do not see why the review and dev tests
will be much better on the next round, so it does not really help
moving things forward.

IMHO, the pragmatic approach is to look at fixing first, and maybe
revert if the problems are deep. I'm not sure this is obviously the
case here.

It does look like the submitted fix basically reverts the changes w.r.t.
this timestamp logging.

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Thomas Munro

thomas.munro@gmail.com

over 4 years ago

In reply to: Gregory Smith (#1)

Re: pgbench logging broken by time logic changes

Hi Greg,

On Thu, Jun 17, 2021 at 2:49 AM Gregory Smith <gregsmithpgsql@gmail.com> wrote:

Back on March 10 Thomas Munro committed and wrestled multiple reworks of the pgbench code from Fabien and the crew. The feature to synchronize startup I'm looking forward to testing now that I have a packaged beta. Variations on that problem have bit me so many times I added code last year to my pgbench processing pipeline to just throw out the first and last 10% of every data set.

Yeah, commit aeb57af8 is a nice improvement and was the main thing I
wanted to get into the tree for 14 in this area, because it was
measuring the wrong thing.

Before I could get to startup timing I noticed the pgbench logging output was broken via commit 547f04e7 "Improve time logic": /messages/by-id/E1lJqpF-00064e-C6@gemulon.postgresql.org

It does suck that we broke the logging and that it took 3 months for
anyone to notice and report it to the list. Seems like it should be
straightforward to fix, though, with fixes already proposed (though I
haven't studied them yet, will do).

I have a lot of community oriented work backed up behind this right now, so I'm gonna be really honest. This time rework commit in its current form makes me uncomfortable at this point in the release schedule. The commit has already fought through two rounds of platform specific bug fixes. But since the buildfarm doesn't test the logging feature, that whole process is suspect.

It's true that this work produced a few rounds of small portability
follow-ups: c427de42 (work around strange hacks elsewhere in the tree
for AIX), 68b34b23 (missing calling convention specifier on Windows),
and de91c3b9 (adjust pthread missing-function code for threadless
builds). These were problems that didn't show up on developer or CI
systems (including threadless and Windows), and IMHO are typical sorts
of problems you expect to have to work through when stuff hits the
build farm, especially when using new system interfaces. So I don't
think any of that, on its own, supports reverting anything here.

My take on the PostgreSQL way to proceed: this bug exposes that pgbench logging is a feature we finally need to design testing for. We need a new buildfarm test and then a march through a full release phase to see how it goes. Only then should we start messing with the time logic. Even if we fixed the source today on both my test platforms, I'd still be nervous that beta 2 could ship and more performance testing could fall over from this modification. And that's cutting things a little close.

The fastest way to get me back to comfortable would be to unwind 547f04e7 and its associated fixes and take it back to review. I understand the intent and value; I appreciate the work so far. The big industry architecture shift from Intel to ARM has me worried about time overhead again, the old code is wonky, and in the PG15 release cycle I already have resources planned around this area.

Let me study the proposed fixes on this and the other thread about
pgbench logging for a bit.

Glad to hear that you're working on this area. I guess you might be
researching stuff along the same sorts of lines as in the thread
"Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?" (though
that's about the executor). As I already expressed in that thread, if
the backend's instrumentation code is improved as proposed there,
we'll probably want to rip some of these pgbench changes out anyway
and go back to common instrumentation code.

For that reason, I'm not super attached to that new pg_time_usec_t
stuff at all, and wouldn't be sad if we reverted that piece. I am
moderately attached to the sync changes, though. pgbench 13 is
objectively producing incorrect results in that respect.

Michael Paquier

michael@paquier.xyz

over 4 years ago

In reply to: Thomas Munro (#6)

Re: pgbench logging broken by time logic changes

On Thu, Jun 17, 2021 at 12:36:10PM +1200, Thomas Munro wrote:

For that reason, I'm not super attached to that new pg_time_usec_t
stuff at all, and wouldn't be sad if we reverted that piece. I am
moderately attached to the sync changes, though. pgbench 13 is
objectively producing incorrect results in that respect.

There is another item in this area where pgbench uses incorrect maths
when aggregating the stats of transactions mid-run and at the end of a
thread, issue caused by 547f04e as this code path forgot to handle the
s <-> us conversion:
/messages/by-id/CAF7igB1r6wRfSCEAB-iZBKxkowWY6+dFF2jObSdd9+iVK+vHJg@mail.gmail.com

Wouldn't it be better to put all those fixes into the same bag? If
you drop the business with pg_time_usec_t, it looks like we don't
really need to do anything there.
--
Michael

Yugo NAGATA

nagata@sraoss.co.jp

over 4 years ago

In reply to: Fabien COELHO (#4)

1 attachment(s)

Re: pgbench logging broken by time logic changes

On Wed, 16 Jun 2021 21:11:41 +0200 (CEST)
Fabien COELHO <coelho@cri.ensmp.fr> wrote:

pg_time_now(). This uses INSTR_TIME_SET_CURRENT in it, but this macro
can call clock_gettime(CLOCK_MONOTONIC[_RAW], ) instead of gettimeofday
or clock_gettime(CLOCK_REALTIME, ). When CLOCK_MONOTONIC[_RAW] is used,
clock_gettime doesn't return epoch time. Therefore, we can use
INSTR_TIME_SET_CURRENT aiming to calculate a duration, but we should
not have used this to get the current timestamp.

I think we can fix this issue by using gettimeofday for logging as before
this was changed. I attached the patch.

I cannot say that I'm thrilled by having multiple tv stuff back in several
place. I can be okay with one, though. What about the attached? Does it
make sense?

At first, I also thought of fixing pg_time_now() to use gettimeofday() instead
of INSTR_TIME_SET_CURRENT, but I noticed that using INSTR_TIME_SET_CURRENT is
proper to measure time interval. I mean, this macro uses
lock_gettime(CLOCK_MONOTONIC, ) if avilable, which give reliable interval
timing even in the face of changes to the system clock due to NTP.

The commit 547f04e7 changed all of INSTR_TIME_SET_CURRENT, gettimeofday(), and
time() to pg_now_time() which calls INSTR_TIME_SET_CURRENT in it. So, my patch
intented to revert these changes only about gettimeofday() and time(), and remain
changes about INSTR_TIME_SET_CURRENT.

I attached the updated patch because I forgot to revert pg_now_time() to time(NULL).

Another idea to fix is adding 'use_epoch' flag to pg_now_time() like below,

pg_time_now(bool use_epoch)
{
if (use_epoch)
{
struct timeval tv;
gettimeofday(&tv, NULL);
return tv.tv_sec * 1000000 + tv.tv_usec;
}
else
{
instr_time now;
INSTR_TIME_SET_CURRENT(now);
return (pg_time_usec_t) INSTR_TIME_GET_MICROSEC(now);
}
}

or making an additional function that returns epoch time.

By the way, there is another advantage of using clock_gettime(). That is, this
returns precise results in nanoseconds. However, we can not benefit from it because
pg_time_now() converts the value to uint64 by using INSTR_TIME_GET_MICROSEC. So,
if we would like more precious statistics in pgbench, it might be better to use
INSTR_TIME_GET_MICROSEC instead of pg_time_now in other places.

Regards,
Yugo Nagata

--
Yugo NAGATA <nagata@sraoss.co.jp>

Attachments:

pgbench-log-timestamp-2.patchtext/x-diff; name=pgbench-log-timestamp-2.patchDownload

diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index d7479925cb..83e60a2ce6 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -334,7 +334,7 @@ typedef int64 pg_time_usec_t;
  */
 typedef struct StatsData
 {
-	pg_time_usec_t start_time;	/* interval start time, for aggregates */
+	time_t		start_time;		/* interval start time, for aggregates */
 	int64		cnt;			/* number of transactions, including skipped */
 	int64		skipped;		/* number of transactions skipped under --rate
 								 * and --latency-limit */
@@ -1279,9 +1279,9 @@ mergeSimpleStats(SimpleStats *acc, SimpleStats *ss)
  * the given value.
  */
 static void
-initStats(StatsData *sd, pg_time_usec_t start)
+initStats(StatsData *sd, time_t start_time)
 {
-	sd->start_time = start;
+	sd->start_time = start_time;
 	sd->cnt = 0;
 	sd->skipped = 0;
 	initSimpleStats(&sd->latency);
@@ -3778,7 +3778,6 @@ doLog(TState *thread, CState *st,
 	  StatsData *agg, bool skipped, double latency, double lag)
 {
 	FILE	   *logfile = thread->logfile;
-	pg_time_usec_t now = pg_time_now();
 
 	Assert(use_log);
 
@@ -3793,6 +3792,8 @@ doLog(TState *thread, CState *st,
 	/* should we aggregate the results or not? */
 	if (agg_interval > 0)
 	{
+		time_t		now = time(NULL);
+
 		/*
 		 * Loop until we reach the interval of the current moment, and print
 		 * any empty intervals in between (this may happen with very low tps,
@@ -3830,16 +3831,21 @@ doLog(TState *thread, CState *st,
 	}
 	else
 	{
+		struct timeval tv;
+
+		gettimeofday(&tv, NULL);
+
 		/* no, print raw transactions */
 		if (skipped)
 			fprintf(logfile, "%d " INT64_FORMAT " skipped %d " INT64_FORMAT " "
 					INT64_FORMAT,
-					st->id, st->cnt, st->use_file, now / 1000000, now % 1000000);
+					st->id, st->cnt, st->use_file,
+					(long) tv.tv_sec, (long) tv.tv_usec);
 		else
 			fprintf(logfile, "%d " INT64_FORMAT " %.0f %d " INT64_FORMAT " "
 					INT64_FORMAT,
 					st->id, st->cnt, latency, st->use_file,
-					now / 1000000, now % 1000000);
+					(long) tv.tv_sec, (long) tv.tv_usec);
 		if (throttle_delay)
 			fprintf(logfile, " %.0f", lag);
 		fputc('\n', logfile);
@@ -5455,7 +5461,16 @@ printProgressReport(TState *threads, int64 test_start, pg_time_usec_t now,
 
 	if (progress_timestamp)
 	{
-		snprintf(tbuf, sizeof(tbuf), "%.3f s", PG_TIME_GET_DOUBLE(now));
+		/*
+		 * On some platforms the current system timestamp is available in
+		 * now_time, but rather than get entangled with that, we just eat the
+		 * cost of an extra syscall in all cases.
+		 */
+		struct timeval tv;
+
+		gettimeofday(&tv, NULL);
+		snprintf(tbuf, sizeof(tbuf), "%ld.%03ld s",
+				 (long) tv.tv_sec, (long) (tv.tv_usec / 1000));
 	}
 	else
 	{
@@ -6601,7 +6616,7 @@ threadRun(void *arg)
 	thread->bench_start = start;
 	thread->throttle_trigger = start;
 
-	initStats(&aggs, start);
+	initStats(&aggs, time(NULL));
 	last = aggs;
 
 	/* loop till all clients have terminated */

Kyotaro Horiguchi

horikyota.ntt@gmail.com

over 4 years ago

In reply to: Yugo NAGATA (#8)

1 attachment(s)

Re: pgbench logging broken by time logic changes

At Thu, 17 Jun 2021 12:23:42 +0900, Yugo NAGATA <nagata@sraoss.co.jp> wrote in

On Wed, 16 Jun 2021 21:11:41 +0200 (CEST)
Fabien COELHO <coelho@cri.ensmp.fr> wrote:

I cannot say that I'm thrilled by having multiple tv stuff back in several
place. I can be okay with one, though. What about the attached? Does it
make sense?

+1 The patch rounds down sd->start_time from ms to s but it seems to
me a degradation.

At first, I also thought of fixing pg_time_now() to use gettimeofday() instead
of INSTR_TIME_SET_CURRENT, but I noticed that using INSTR_TIME_SET_CURRENT is
proper to measure time interval. I mean, this macro uses
lock_gettime(CLOCK_MONOTONIC, ) if avilable, which give reliable interval
timing even in the face of changes to the system clock due to NTP.

If I understand the op correctly, the problem here is the time values
in pgbench log file are based on a bogus epoch. If it's the only issue
here and and if we just want to show the time based on the unix epoch
time, just recording the difference would work as I scketched in the
attached. (Precisely theepoch would move if we set the system clock
but I don't think that matters:p)

By the way, there is another advantage of using clock_gettime(). That is, this
returns precise results in nanoseconds. However, we can not benefit from it because
pg_time_now() converts the value to uint64 by using INSTR_TIME_GET_MICROSEC. So,
if we would like more precious statistics in pgbench, it might be better to use
INSTR_TIME_GET_MICROSEC instead of pg_time_now in other places.

I'm not sure we have transaction lasts for very short time that
nanoseconds matters.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

show_uepoch_time_in_logs.patchtext/x-patch; charset=us-asciiDownload

diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index d7479925cb..0ce9315a2b 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -342,6 +342,8 @@ typedef struct StatsData
 	SimpleStats lag;
 } StatsData;
 
+pg_time_usec_t epoch_shift;
+
 /*
  * Struct to keep random state.
  */
@@ -3803,7 +3805,7 @@ doLog(TState *thread, CState *st,
 		{
 			/* print aggregated report to logfile */
 			fprintf(logfile, INT64_FORMAT " " INT64_FORMAT " %.0f %.0f %.0f %.0f",
-					agg->start_time,
+					agg->start_time + epoch_shift,
 					agg->cnt,
 					agg->latency.sum,
 					agg->latency.sum2,
@@ -3834,12 +3836,12 @@ doLog(TState *thread, CState *st,
 		if (skipped)
 			fprintf(logfile, "%d " INT64_FORMAT " skipped %d " INT64_FORMAT " "
 					INT64_FORMAT,
-					st->id, st->cnt, st->use_file, now / 1000000, now % 1000000);
+					st->id, st->cnt, st->use_file, (now + epoch_shift) / 1000000, (now + epoch_shift) % 1000000);
 		else
 			fprintf(logfile, "%d " INT64_FORMAT " %.0f %d " INT64_FORMAT " "
 					INT64_FORMAT,
 					st->id, st->cnt, latency, st->use_file,
-					now / 1000000, now % 1000000);
+					(now + epoch_shift) / 1000000, (now + epoch_shift) % 1000000);
 		if (throttle_delay)
 			fprintf(logfile, " %.0f", lag);
 		fputc('\n', logfile);
@@ -5775,7 +5777,12 @@ main(int argc, char **argv)
 	char	   *env;
 
 	int			exit_code = 0;
+	struct timeval tv;
 
+	gettimeofday(&tv, NULL);
+	
+	epoch_shift = tv.tv_sec * 1000000 + tv.tv_usec - pg_time_now();
+		
 	pg_logging_init(argv[0]);
 	progname = get_progname(argv[0]);

#10

Yugo NAGATA

nagata@sraoss.co.jp

over 4 years ago

In reply to: Kyotaro Horiguchi (#9)

Re: pgbench logging broken by time logic changes

On Thu, 17 Jun 2021 14:17:56 +0900 (JST)
Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:

At Thu, 17 Jun 2021 12:23:42 +0900, Yugo NAGATA <nagata@sraoss.co.jp> wrote in

On Wed, 16 Jun 2021 21:11:41 +0200 (CEST)
Fabien COELHO <coelho@cri.ensmp.fr> wrote:

I cannot say that I'm thrilled by having multiple tv stuff back in several
place. I can be okay with one, though. What about the attached? Does it
make sense?

+1 The patch rounds down sd->start_time from ms to s but it seems to
me a degradation.

I don't think that's matter because sd->start_time is used only for
log aggregation and aggregate-interval is specified in seconds, though.

At first, I also thought of fixing pg_time_now() to use gettimeofday() instead
of INSTR_TIME_SET_CURRENT, but I noticed that using INSTR_TIME_SET_CURRENT is
proper to measure time interval. I mean, this macro uses
lock_gettime(CLOCK_MONOTONIC, ) if avilable, which give reliable interval
timing even in the face of changes to the system clock due to NTP.

If I understand the op correctly, the problem here is the time values
in pgbench log file are based on a bogus epoch. If it's the only issue
here and and if we just want to show the time based on the unix epoch
time, just recording the difference would work as I scketched in the
attached. (Precisely theepoch would move if we set the system clock
but I don't think that matters:p)

That makes sense. If the system clock is shifted due to NTP (for example)
it would not affect the measurement although timestamps in logs could be shifted
because gettimeofday is called only once.

If we fix it in this way, we should fix also printProgressReport().

     if (progress_timestamp)
     {
-         snprintf(tbuf, sizeof(tbuf), "%.3f s", PG_TIME_GET_DOUBLE(now));
+         snprintf(tbuf, sizeof(tbuf), "%.3f s", PG_TIME_GET_DOUBLE(now * epoch_shift));
     }

By the way, there is another advantage of using clock_gettime(). That is, this
returns precise results in nanoseconds. However, we can not benefit from it because
pg_time_now() converts the value to uint64 by using INSTR_TIME_GET_MICROSEC. So,
if we would like more precious statistics in pgbench, it might be better to use
INSTR_TIME_GET_MICROSEC instead of pg_time_now in other places.

I'm not sure we have transaction lasts for very short time that
nanoseconds matters.

I thought it might affect the accuracy when statistics are accumulated
through a huge numbers of transactions, but I am fine with it if no one
cares it.

Regards,
Yugo Nagata

--
Yugo NAGATA <nagata@sraoss.co.jp>

#11

Fabien COELHO

coelho@cri.ensmp.fr

over 4 years ago

In reply to: Yugo NAGATA (#8)

Re: pgbench logging broken by time logic changes

Hello Yugo-san,

I think we can fix this issue by using gettimeofday for logging as before
this was changed. I attached the patch.

I cannot say that I'm thrilled by having multiple tv stuff back in several
place. I can be okay with one, though. What about the attached? Does it
make sense?

At first, I also thought of fixing pg_time_now() to use gettimeofday() instead
of INSTR_TIME_SET_CURRENT, but I noticed that using INSTR_TIME_SET_CURRENT is
proper to measure time interval. I mean, this macro uses
lock_gettime(CLOCK_MONOTONIC, ) if avilable, which give reliable interval
timing even in the face of changes to the system clock due to NTP.

Ok, I was thinking that it was possible that there was this kind of
implication. Now, does it matter that much if a few transactions are
skewed by NTP from time to time? Having to deal with different time
functions in the same file seems painful.

The commit 547f04e7 changed all of INSTR_TIME_SET_CURRENT, gettimeofday(), and
time() to pg_now_time() which calls INSTR_TIME_SET_CURRENT in it. So, my patch
intented to revert these changes only about gettimeofday() and time(), and remain
changes about INSTR_TIME_SET_CURRENT.

Hmmm.

pg_time_now(bool use_epoch)
{
if (use_epoch)
{
struct timeval tv;
gettimeofday(&tv, NULL);
return tv.tv_sec * 1000000 + tv.tv_usec;
}
else
{
instr_time now;
INSTR_TIME_SET_CURRENT(now);
return (pg_time_usec_t) INSTR_TIME_GET_MICROSEC(now);
}
}

or making an additional function that returns epoch time.

Yes, but when to call which version? How to avoid confusion? After giving
it some thoughts, ISTM that the best short-term decision is just to have
epoch everywhere, i.e. having now() to rely on gettimeofday, because:

- at least one user is unhappy with not having epoch in the log file,
and indeed it makes sense to be unhappy about that if they want to
correlated logs. So I agree to undo that, or provide an option to undo
it.

- having different times with different origins at different point in
the code makes it really hard to understand and maintain, and if we
trade maintainability for precision it should really be worth it, and
I'm not sure that working around NTP adjustment is worth it right now.

In the not so short term, I'd say that the best approach would be use
relative time internally and just to offset these with a global epoch
start time when displaying a timestamp.

By the way, there is another advantage of using clock_gettime(). That is, this
returns precise results in nanoseconds. However, we can not benefit from it because
pg_time_now() converts the value to uint64 by using INSTR_TIME_GET_MICROSEC. So,
if we would like more precious statistics in pgbench, it might be better to use
INSTR_TIME_GET_MICROSEC instead of pg_time_now in other places.

The INSTR_TIME macros are pretty ugly and inefficient, especially when
time arithmetic is involved because they re-implements 64 bits operations
based on 32 bit ints. I really wanted to get rid of that as much as
possible. From a database benchmarking perspective ISTM that ï¿œs is the
right smallest unit, given that a transaction implies significant delays
such as networks communications, parsing, and so on. So I do not think we
should ever need nanos.

In conclusion, ISTM that it is enough to simply change now to call
gettimeofday to fix the issue raised by Greg. This is patch v1 on the
thread.

--
Fabien.

#12

Fabien COELHO

coelho@cri.ensmp.fr

over 4 years ago

In reply to: Kyotaro Horiguchi (#9)

1 attachment(s)

Re: pgbench logging broken by time logic changes

Hello,

I cannot say that I'm thrilled by having multiple tv stuff back in several
place. I can be okay with one, though. What about the attached? Does it
make sense?

+1 The patch rounds down sd->start_time from ms to s but it seems to
me a degradation.

Yes, please we should not use time.

At first, I also thought of fixing pg_time_now() to use gettimeofday() instead
of INSTR_TIME_SET_CURRENT, but I noticed that using INSTR_TIME_SET_CURRENT is
proper to measure time interval. I mean, this macro uses
lock_gettime(CLOCK_MONOTONIC, ) if avilable, which give reliable interval
timing even in the face of changes to the system clock due to NTP.

If I understand the op correctly, the problem here is the time values
in pgbench log file are based on a bogus epoch.

It is not "bogus", but is not necessary epoch depending on the underlying
function called behind by INSTR_TIME macros, and people are entitled to
expect epoch for log correlations.

If it's the only issue
here and and if we just want to show the time based on the unix epoch
time, just recording the difference would work as I scketched in the
attached. (Precisely theepoch would move if we set the system clock
but I don't think that matters:p)

I do like the approach.

I'm hesitant to promote it for fixing the beta, but the code impact is
small enough, so I'd say yes. Maybe there is a similar issue with progress
which should probably use the same approach. I think that aligning the
implementations can wait for pg15.

The patch as white space issues. Attached an updated version which fixes
that, adds comments and simplify the code a little bit.

I'm not sure we have transaction lasts for very short time that
nanoseconds matters.

Indeed.

--
Fabien.

Attachments:

pgbench-epoch-4.patchtext/x-diff; name=pgbench-epoch-4.patchDownload

diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index d7479925cb..876c5a4547 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -342,6 +342,12 @@ typedef struct StatsData
 	SimpleStats lag;
 } StatsData;
 
+/*
+ * For displaying epoch timestamp, as some time functions may have
+ * another reference.
+ */
+pg_time_usec_t epoch_shift;
+
 /*
  * Struct to keep random state.
  */
@@ -3803,7 +3809,7 @@ doLog(TState *thread, CState *st,
 		{
 			/* print aggregated report to logfile */
 			fprintf(logfile, INT64_FORMAT " " INT64_FORMAT " %.0f %.0f %.0f %.0f",
-					agg->start_time,
+					agg->start_time + epoch_shift,
 					agg->cnt,
 					agg->latency.sum,
 					agg->latency.sum2,
@@ -3830,6 +3836,9 @@ doLog(TState *thread, CState *st,
 	}
 	else
 	{
+		/* switch to epoch for displaying the timestamp */
+		now += epoch_shift;
+
 		/* no, print raw transactions */
 		if (skipped)
 			fprintf(logfile, "%d " INT64_FORMAT " skipped %d " INT64_FORMAT " "
@@ -5775,6 +5784,11 @@ main(int argc, char **argv)
 	char	   *env;
 
 	int			exit_code = 0;
+	struct timeval tv;
+
+	/* record shift between epoch and now() */
+	gettimeofday(&tv, NULL);
+	epoch_shift = tv.tv_sec * 1000000 + tv.tv_usec - pg_time_now();
 
 	pg_logging_init(argv[0]);
 	progname = get_progname(argv[0]);

#13

Fabien COELHO

coelho@cri.ensmp.fr

over 4 years ago

In reply to: Thomas Munro (#6)

Re: pgbench logging broken by time logic changes

Hello Thomas,

Before I could get to startup timing I noticed the pgbench logging
output was broken via commit 547f04e7 "Improve time logic":
/messages/by-id/E1lJqpF-00064e-C6@gemulon.postgresql.org

It does suck that we broke the logging and that it took 3 months for
anyone to notice and report it to the list.

Indeed. Well, it also demonstrates that beta are useful.

Seems like it should be straightforward to fix, though, with fixes
already proposed (though I haven't studied them yet, will do).

I think that fixing logging is simple enough, thus a revert is not
necessary.

I have a lot of community oriented work backed up behind this right
now, so I'm gonna be really honest. This time rework commit in its
current form makes me uncomfortable at this point in the release
schedule. The commit has already fought through two rounds of platform
specific bug fixes. But since the buildfarm doesn't test the logging
feature, that whole process is suspect.

It's true that this work produced a few rounds of small portability
follow-ups: c427de42 (work around strange hacks elsewhere in the tree
for AIX), 68b34b23 (missing calling convention specifier on Windows),
and de91c3b9 (adjust pthread missing-function code for threadless
builds). These were problems that didn't show up on developer or CI
systems (including threadless and Windows), and IMHO are typical sorts
of problems you expect to have to work through when stuff hits the
build farm, especially when using new system interfaces. So I don't
think any of that, on its own, supports reverting anything here.

Yep, the buildfarm is here to catch portability issues, and it does its
job:-) There is no doubt that logging is has been broken because of lack
of tests in this area, shame on us. I think it is easy to fix.

[...] For that reason, I'm not super attached to that new pg_time_usec_t
stuff at all, and wouldn't be sad if we reverted that piece.

Well, I was sooo happy to get rid of INSTR_TIME ugly and inefficient
macros in pgbench… so anything looks better to me.

Note that Michaël is having a look at fixing pgbench logging issues.

--
Fabien.

#14

Fabien COELHO

coelho@cri.ensmp.fr

over 4 years ago

In reply to: Michael Paquier (#7)

1 attachment(s)

Re: pgbench logging broken by time logic changes

Wouldn't it be better to put all those fixes into the same bag?

Attached.

--
Fabien.

#15

Fabien COELHO

coelho@cri.ensmp.fr

over 4 years ago

In reply to: Fabien COELHO (#14)

1 attachment(s)

Re: pgbench logging broken by time logic changes

Wouldn't it be better to put all those fixes into the same bag?

Attached.

Even better if the patch is not empty.

--
Fabien.

Attachments:

pgbench-log-fix-6.patchtext/x-diff; name=pgbench-log-fix-6.patchDownload

diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index d7479925cb..3df92bdd2b 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -342,6 +342,12 @@ typedef struct StatsData
 	SimpleStats lag;
 } StatsData;
 
+/*
+ * For displaying epoch timestamp, as some time functions may have
+ * another reference.
+ */
+pg_time_usec_t epoch_shift;
+
 /*
  * Struct to keep random state.
  */
@@ -648,7 +654,7 @@ static void setDoubleValue(PgBenchValue *pv, double dval);
 static bool evaluateExpr(CState *st, PgBenchExpr *expr,
 						 PgBenchValue *retval);
 static ConnectionStateEnum executeMetaCommand(CState *st, pg_time_usec_t *now);
-static void doLog(TState *thread, CState *st,
+static void doLog(TState *thread, CState *st, bool tx,
 				  StatsData *agg, bool skipped, double latency, double lag);
 static void processXactStats(TState *thread, CState *st, pg_time_usec_t *now,
 							 bool skipped, StatsData *agg);
@@ -3765,16 +3771,47 @@ executeMetaCommand(CState *st, pg_time_usec_t *now)
 	return CSTATE_END_COMMAND;
 }
 
+/* print aggregated report to logfile */
+static void
+logAgg(FILE *logfile, StatsData *agg)
+{
+	fprintf(logfile, INT64_FORMAT " " INT64_FORMAT " %.0f %.0f %.0f %.0f",
+			agg->start_time + epoch_shift,
+			agg->cnt,
+			agg->latency.sum,
+			agg->latency.sum2,
+			agg->latency.min,
+			agg->latency.max);
+	if (throttle_delay)
+	{
+		fprintf(logfile, " %.0f %.0f %.0f %.0f",
+				agg->lag.sum,
+				agg->lag.sum2,
+				agg->lag.min,
+				agg->lag.max);
+		if (latency_limit)
+			fprintf(logfile, " " INT64_FORMAT, agg->skipped);
+	}
+	fputc('\n', logfile);
+}
+
 /*
  * Print log entry after completing one transaction.
  *
+ * Param tx tells whether the call actually represents a transaction,
+ * or if it is used to flush aggregated logs.
+ *
+ * The function behavior changes depending on sample_rate (a fraction of
+ * transaction is reported) and agg_interval (transactions are aggregated
+ * and reported once every agg_interval seconds).
+ *
  * We print Unix-epoch timestamps in the log, so that entries can be
  * correlated against other logs.  On some platforms this could be obtained
  * from the caller, but rather than get entangled with that, we just eat
  * the cost of an extra syscall in all cases.
  */
 static void
-doLog(TState *thread, CState *st,
+doLog(TState *thread, CState *st, bool tx,
 	  StatsData *agg, bool skipped, double latency, double lag)
 {
 	FILE	   *logfile = thread->logfile;
@@ -3793,43 +3830,36 @@ doLog(TState *thread, CState *st,
 	/* should we aggregate the results or not? */
 	if (agg_interval > 0)
 	{
+		pg_time_usec_t	next;
+
 		/*
 		 * Loop until we reach the interval of the current moment, and print
 		 * any empty intervals in between (this may happen with very low tps,
 		 * e.g. --rate=0.1).
 		 */
-
-		while (agg->start_time + agg_interval <= now)
+		while ((next = agg->start_time + agg_interval * INT64CONST(1000000)) <= now)
 		{
-			/* print aggregated report to logfile */
-			fprintf(logfile, INT64_FORMAT " " INT64_FORMAT " %.0f %.0f %.0f %.0f",
-					agg->start_time,
-					agg->cnt,
-					agg->latency.sum,
-					agg->latency.sum2,
-					agg->latency.min,
-					agg->latency.max);
-			if (throttle_delay)
-			{
-				fprintf(logfile, " %.0f %.0f %.0f %.0f",
-						agg->lag.sum,
-						agg->lag.sum2,
-						agg->lag.min,
-						agg->lag.max);
-				if (latency_limit)
-					fprintf(logfile, " " INT64_FORMAT, agg->skipped);
-			}
-			fputc('\n', logfile);
+			logAgg(logfile, agg);
 
 			/* reset data and move to next interval */
-			initStats(agg, agg->start_time + agg_interval);
+			initStats(agg, next);
 		}
 
-		/* accumulate the current transaction */
-		accumStats(agg, skipped, latency, lag);
+		if (tx)
+			/* accumulate the current transaction */
+			accumStats(agg, skipped, latency, lag);
+		else
+			/* final call to show the last aggregate */
+			logAgg(logfile, agg);
 	}
 	else
 	{
+		/* switch to epoch */
+		now += epoch_shift;
+
+		/* !tx only used for aggregated data */
+		Assert(tx);
+
 		/* no, print raw transactions */
 		if (skipped)
 			fprintf(logfile, "%d " INT64_FORMAT " skipped %d " INT64_FORMAT " "
@@ -3889,7 +3919,7 @@ processXactStats(TState *thread, CState *st, pg_time_usec_t *now,
 	st->cnt++;
 
 	if (use_log)
-		doLog(thread, st, agg, skipped, latency, lag);
+		doLog(thread, st, true, agg, skipped, latency, lag);
 
 	/* XXX could use a mutex here, but we choose not to */
 	if (per_script_stats)
@@ -5775,6 +5805,11 @@ main(int argc, char **argv)
 	char	   *env;
 
 	int			exit_code = 0;
+	struct timeval tv;
+
+	/* record shift between epoch and now() */
+	gettimeofday(&tv, NULL);
+	epoch_shift = tv.tv_sec * 1000000 + tv.tv_usec - pg_time_now();
 
 	pg_logging_init(argv[0]);
 	progname = get_progname(argv[0]);
@@ -6794,8 +6829,9 @@ done:
 		if (agg_interval > 0)
 		{
 			/* log aggregated but not yet reported transactions */
-			doLog(thread, state, &aggs, false, 0, 0);
+			doLog(thread, state, false, &aggs, false, 0.0, 0.0);
 		}
+
 		fclose(thread->logfile);
 		thread->logfile = NULL;
 	}

#16

Yugo NAGATA

nagata@sraoss.co.jp

over 4 years ago

In reply to: Fabien COELHO (#15)

1 attachment(s)

Re: pgbench logging broken by time logic changes

On Thu, 17 Jun 2021 10:18:03 +0200 (CEST)
Fabien COELHO <coelho@cri.ensmp.fr> wrote:

Wouldn't it be better to put all those fixes into the same bag?

Attached.

Even better if the patch is not empty.

I found you forgot to fix printProgressReport().

Also, according to the document, interval_start in Aggregated Logging
seems to be printed in seconds instead of ms.

<para>
Here is some example output:
<screen>
1345828501 5601 1542744 483552416 61 2573
1345828503 7884 1979812 565806736 60 1479
1345828505 7208 1979422 567277552 59 1391
1345828507 7685 1980268 569784714 60 1398
1345828509 7073 1979779 573489941 236 1411
</screen></para>

If we obey the document and keep the back-compatibility, we should fix
logAgg().

The attached patch includes these fixes.

Regards,
Yugo Nagata

--
Yugo NAGATA <nagata@sraoss.co.jp>

Attachments:

pgbench-log-fix-7.patchtext/x-diff; name=pgbench-log-fix-7.patchDownload

diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index d7479925cb..08ecb07b9b 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -342,6 +342,12 @@ typedef struct StatsData
 	SimpleStats lag;
 } StatsData;
 
+/*
+ * For displaying epoch timestamp, as some time functions may have
+ * another reference.
+ */
+pg_time_usec_t epoch_shift;
+
 /*
  * Struct to keep random state.
  */
@@ -648,7 +654,7 @@ static void setDoubleValue(PgBenchValue *pv, double dval);
 static bool evaluateExpr(CState *st, PgBenchExpr *expr,
 						 PgBenchValue *retval);
 static ConnectionStateEnum executeMetaCommand(CState *st, pg_time_usec_t *now);
-static void doLog(TState *thread, CState *st,
+static void doLog(TState *thread, CState *st, bool tx,
 				  StatsData *agg, bool skipped, double latency, double lag);
 static void processXactStats(TState *thread, CState *st, pg_time_usec_t *now,
 							 bool skipped, StatsData *agg);
@@ -3765,16 +3771,47 @@ executeMetaCommand(CState *st, pg_time_usec_t *now)
 	return CSTATE_END_COMMAND;
 }
 
+/* print aggregated report to logfile */
+static void
+logAgg(FILE *logfile, StatsData *agg)
+{
+	fprintf(logfile, INT64_FORMAT " " INT64_FORMAT " %.0f %.0f %.0f %.0f",
+			(agg->start_time + epoch_shift) / 1000000,
+			agg->cnt,
+			agg->latency.sum,
+			agg->latency.sum2,
+			agg->latency.min,
+			agg->latency.max);
+	if (throttle_delay)
+	{
+		fprintf(logfile, " %.0f %.0f %.0f %.0f",
+				agg->lag.sum,
+				agg->lag.sum2,
+				agg->lag.min,
+				agg->lag.max);
+		if (latency_limit)
+			fprintf(logfile, " " INT64_FORMAT, agg->skipped);
+	}
+	fputc('\n', logfile);
+}
+
 /*
  * Print log entry after completing one transaction.
  *
+ * Param tx tells whether the call actually represents a transaction,
+ * or if it is used to flush aggregated logs.
+ *
+ * The function behavior changes depending on sample_rate (a fraction of
+ * transaction is reported) and agg_interval (transactions are aggregated
+ * and reported once every agg_interval seconds).
+ *
  * We print Unix-epoch timestamps in the log, so that entries can be
  * correlated against other logs.  On some platforms this could be obtained
  * from the caller, but rather than get entangled with that, we just eat
  * the cost of an extra syscall in all cases.
  */
 static void
-doLog(TState *thread, CState *st,
+doLog(TState *thread, CState *st, bool tx,
 	  StatsData *agg, bool skipped, double latency, double lag)
 {
 	FILE	   *logfile = thread->logfile;
@@ -3793,43 +3830,36 @@ doLog(TState *thread, CState *st,
 	/* should we aggregate the results or not? */
 	if (agg_interval > 0)
 	{
+		pg_time_usec_t	next;
+
 		/*
 		 * Loop until we reach the interval of the current moment, and print
 		 * any empty intervals in between (this may happen with very low tps,
 		 * e.g. --rate=0.1).
 		 */
-
-		while (agg->start_time + agg_interval <= now)
+		while ((next = agg->start_time + agg_interval * INT64CONST(1000000)) <= now)
 		{
-			/* print aggregated report to logfile */
-			fprintf(logfile, INT64_FORMAT " " INT64_FORMAT " %.0f %.0f %.0f %.0f",
-					agg->start_time,
-					agg->cnt,
-					agg->latency.sum,
-					agg->latency.sum2,
-					agg->latency.min,
-					agg->latency.max);
-			if (throttle_delay)
-			{
-				fprintf(logfile, " %.0f %.0f %.0f %.0f",
-						agg->lag.sum,
-						agg->lag.sum2,
-						agg->lag.min,
-						agg->lag.max);
-				if (latency_limit)
-					fprintf(logfile, " " INT64_FORMAT, agg->skipped);
-			}
-			fputc('\n', logfile);
+			logAgg(logfile, agg);
 
 			/* reset data and move to next interval */
-			initStats(agg, agg->start_time + agg_interval);
+			initStats(agg, next);
 		}
 
-		/* accumulate the current transaction */
-		accumStats(agg, skipped, latency, lag);
+		if (tx)
+			/* accumulate the current transaction */
+			accumStats(agg, skipped, latency, lag);
+		else
+			/* final call to show the last aggregate */
+			logAgg(logfile, agg);
 	}
 	else
 	{
+		/* switch to epoch */
+		now += epoch_shift;
+
+		/* !tx only used for aggregated data */
+		Assert(tx);
+
 		/* no, print raw transactions */
 		if (skipped)
 			fprintf(logfile, "%d " INT64_FORMAT " skipped %d " INT64_FORMAT " "
@@ -3889,7 +3919,7 @@ processXactStats(TState *thread, CState *st, pg_time_usec_t *now,
 	st->cnt++;
 
 	if (use_log)
-		doLog(thread, st, agg, skipped, latency, lag);
+		doLog(thread, st, true, agg, skipped, latency, lag);
 
 	/* XXX could use a mutex here, but we choose not to */
 	if (per_script_stats)
@@ -5455,7 +5485,7 @@ printProgressReport(TState *threads, int64 test_start, pg_time_usec_t now,
 
 	if (progress_timestamp)
 	{
-		snprintf(tbuf, sizeof(tbuf), "%.3f s", PG_TIME_GET_DOUBLE(now));
+		snprintf(tbuf, sizeof(tbuf), "%.3f s", PG_TIME_GET_DOUBLE(now + epoch_shift));
 	}
 	else
 	{
@@ -5775,6 +5805,11 @@ main(int argc, char **argv)
 	char	   *env;
 
 	int			exit_code = 0;
+	struct timeval tv;
+
+	/* record shift between epoch and now() */
+	gettimeofday(&tv, NULL);
+	epoch_shift = tv.tv_sec * 1000000 + tv.tv_usec - pg_time_now();
 
 	pg_logging_init(argv[0]);
 	progname = get_progname(argv[0]);
@@ -6794,8 +6829,9 @@ done:
 		if (agg_interval > 0)
 		{
 			/* log aggregated but not yet reported transactions */
-			doLog(thread, state, &aggs, false, 0, 0);
+			doLog(thread, state, false, &aggs, false, 0.0, 0.0);
 		}
+
 		fclose(thread->logfile);
 		thread->logfile = NULL;
 	}

#17

Fabien COELHO

coelho@cri.ensmp.fr

over 4 years ago

In reply to: Yugo NAGATA (#16)

Re: pgbench logging broken by time logic changes

I found you forgot to fix printProgressReport().

Indeed.

Also, according to the document, interval_start in Aggregated Logging
seems to be printed in seconds instead of ms.

Indeed. I'm unsure about what we should really want there, but for a beta
bug fix I agree that it must simply comply to the old documented behavior.

The attached patch includes these fixes.

Thanks. Works for me.

--
Fabien.

#18

Thomas Munro

thomas.munro@gmail.com

over 4 years ago

In reply to: Fabien COELHO (#13)

1 attachment(s)

Re: pgbench logging broken by time logic changes

On Thu, Jun 17, 2021 at 7:24 PM Fabien COELHO <coelho@cri.ensmp.fr> wrote:

Seems like it should be straightforward to fix, though, with fixes
already proposed (though I haven't studied them yet, will do).

I think that fixing logging is simple enough, thus a revert is not
necessary.

I prepared a draft revert patch for discussion, just in case it comes
in handy. This reverts "pgbench: Improve time logic.", but "pgbench:
Synchronize client threads." remains (slightly rearranged).

I'm on the fence TBH, I can see that it's really small things and it
seems we have the patches, but it's late, late enough that
benchmarking gurus are showing up to benchmark with it for real, and
it's not great to be getting in the way of that with what is mostly
refactoring work, so I don't think it would be a bad thing if we
agreed to try again in 15. (A number of arguments for and against
moving pgbench out of the postgresql source tree and release cycle
occur to me, but I guess that's a topic for another thread.)

[...] For that reason, I'm not super attached to that new pg_time_usec_t
stuff at all, and wouldn't be sad if we reverted that piece.

Well, I was sooo happy to get rid of INSTR_TIME ugly and inefficient
macros in pgbench… so anything looks better to me.

I don't love it either, in this code or the executor. (I know you
also don't like the THREAD_CREATE etc macros. I have something to
propose to improve that for 15....)

Note that Michaël is having a look at fixing pgbench logging issues.

Yeah I've been catching up with these threads.

Attachments:

0001-Revert-pgbench-Improve-time-logic.patchtext/x-patch; charset=US-ASCII; name=0001-Revert-pgbench-Improve-time-logic.patchDownload

From 07d2102ed7faee6c1ed7cbacf44b3c0d95d3e2d5 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Thu, 17 Jun 2021 22:36:56 +1200
Subject: [PATCH] Revert "pgbench: Improve time logic."

Discussion: https://postgr.es/m/CAHLJuCW_8Vpcr0%3Dt6O_gozrg3wXXWXZXDioYJd3NhvKriqgpfQ%40mail.gmail.com
---
 doc/src/sgml/ref/pgbench.sgml    |  39 ++-
 src/bin/pgbench/pgbench.c        | 443 ++++++++++++++++---------------
 src/tools/pgindent/typedefs.list |   1 -
 3 files changed, 248 insertions(+), 235 deletions(-)

diff --git a/doc/src/sgml/ref/pgbench.sgml b/doc/src/sgml/ref/pgbench.sgml
index 0c60077e1f..eba6409148 100644
--- a/doc/src/sgml/ref/pgbench.sgml
+++ b/doc/src/sgml/ref/pgbench.sgml
@@ -58,10 +58,8 @@ number of clients: 10
 number of threads: 1
 number of transactions per client: 1000
 number of transactions actually processed: 10000/10000
-latency average = 11.013 ms
-latency stddev = 7.351 ms
-initial connection time = 45.758 ms
-tps = 896.967014 (without initial connection time)
+tps = 85.184871 (including connections establishing)
+tps = 85.296346 (excluding connections establishing)
 </screen>
 
   The first six lines report some of the most important parameter
@@ -70,7 +68,8 @@ tps = 896.967014 (without initial connection time)
   and number of transactions per client); these will be equal unless the run
   failed before completion.  (In <option>-T</option> mode, only the actual
   number of transactions is printed.)
-  The last line reports the number of transactions per second.
+  The last two lines report the number of transactions per second,
+  figured with and without counting the time to start database sessions.
  </para>
 
   <para>
@@ -2339,22 +2338,22 @@ number of clients: 10
 number of threads: 1
 number of transactions per client: 1000
 number of transactions actually processed: 10000/10000
-latency average = 10.870 ms
-latency stddev = 7.341 ms
-initial connection time = 30.954 ms
-tps = 907.949122 (without initial connection time)
+latency average = 15.844 ms
+latency stddev = 2.715 ms
+tps = 618.764555 (including connections establishing)
+tps = 622.977698 (excluding connections establishing)
 statement latencies in milliseconds:
-    0.001  \set aid random(1, 100000 * :scale)
-    0.001  \set bid random(1, 1 * :scale)
-    0.001  \set tid random(1, 10 * :scale)
-    0.000  \set delta random(-5000, 5000)
-    0.046  BEGIN;
-    0.151  UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
-    0.107  SELECT abalance FROM pgbench_accounts WHERE aid = :aid;
-    4.241  UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;
-    5.245  UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;
-    0.102  INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);
-    0.974  END;
+        0.002  \set aid random(1, 100000 * :scale)
+        0.005  \set bid random(1, 1 * :scale)
+        0.002  \set tid random(1, 10 * :scale)
+        0.001  \set delta random(-5000, 5000)
+        0.326  BEGIN;
+        0.603  UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
+        0.454  SELECT abalance FROM pgbench_accounts WHERE aid = :aid;
+        5.528  UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;
+        7.335  UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;
+        0.371  INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);
+        1.212  END;
 </screen>
   </para>
 
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index d7479925cb..49a7cabf0f 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -321,20 +321,13 @@ typedef struct SimpleStats
 	double		sum2;			/* sum of squared values */
 } SimpleStats;
 
-/*
- * The instr_time type is expensive when dealing with time arithmetic.  Define
- * a type to hold microseconds instead.  Type int64 is good enough for about
- * 584500 years.
- */
-typedef int64 pg_time_usec_t;
-
 /*
  * Data structure to hold various statistics: per-thread and per-script stats
  * are maintained and merged together.
  */
 typedef struct StatsData
 {
-	pg_time_usec_t start_time;	/* interval start time, for aggregates */
+	time_t		start_time;		/* interval start time, for aggregates */
 	int64		cnt;			/* number of transactions, including skipped */
 	int64		skipped;		/* number of transactions skipped under --rate
 								 * and --latency-limit */
@@ -463,11 +456,11 @@ typedef struct
 	int			nvariables;		/* number of variables */
 	bool		vars_sorted;	/* are variables sorted by name? */
 
-	/* various times about current transaction in microseconds */
-	pg_time_usec_t txn_scheduled;	/* scheduled start time of transaction */
-	pg_time_usec_t sleep_until; /* scheduled start time of next cmd */
-	pg_time_usec_t txn_begin;	/* used for measuring schedule lag times */
-	pg_time_usec_t stmt_begin;	/* used for measuring statement latencies */
+	/* various times about current transaction */
+	int64		txn_scheduled;	/* scheduled start time of transaction (usec) */
+	int64		sleep_until;	/* scheduled start time of next cmd (usec) */
+	instr_time	txn_begin;		/* used for measuring schedule lag times */
+	instr_time	stmt_begin;		/* used for measuring statement latencies */
 
 	bool		prepared[MAX_SCRIPTS];	/* whether client prepared the script */
 
@@ -497,15 +490,11 @@ typedef struct
 	int64		throttle_trigger;	/* previous/next throttling (us) */
 	FILE	   *logfile;		/* where to log, or NULL */
 
-	/* per thread collected stats in microseconds */
-	pg_time_usec_t create_time; /* thread creation time */
-	pg_time_usec_t started_time;	/* thread is running */
-	pg_time_usec_t bench_start; /* thread is benchmarking */
-	pg_time_usec_t conn_duration;	/* cumulated connection and deconnection
-									 * delays */
-
+	/* per thread collected stats */
+	instr_time	start_time;		/* thread start time */
+	instr_time	conn_time;
 	StatsData	stats;
-	int64		latency_late;	/* count executed but late transactions */
+	int64		latency_late;	/* executed but late transactions */
 } TState;
 
 /*
@@ -647,10 +636,10 @@ static void setIntValue(PgBenchValue *pv, int64 ival);
 static void setDoubleValue(PgBenchValue *pv, double dval);
 static bool evaluateExpr(CState *st, PgBenchExpr *expr,
 						 PgBenchValue *retval);
-static ConnectionStateEnum executeMetaCommand(CState *st, pg_time_usec_t *now);
+static ConnectionStateEnum executeMetaCommand(CState *st, instr_time *now);
 static void doLog(TState *thread, CState *st,
 				  StatsData *agg, bool skipped, double latency, double lag);
-static void processXactStats(TState *thread, CState *st, pg_time_usec_t *now,
+static void processXactStats(TState *thread, CState *st, instr_time *now,
 							 bool skipped, StatsData *agg);
 static void addScript(ParsedScript script);
 static THREAD_FUNC_RETURN_TYPE THREAD_FUNC_CC threadRun(void *arg);
@@ -669,24 +658,6 @@ static const PsqlScanCallbacks pgbench_callbacks = {
 	NULL,						/* don't need get_variable functionality */
 };
 
-static inline pg_time_usec_t
-pg_time_now(void)
-{
-	instr_time	now;
-
-	INSTR_TIME_SET_CURRENT(now);
-
-	return (pg_time_usec_t) INSTR_TIME_GET_MICROSEC(now);
-}
-
-static inline void
-pg_time_now_lazy(pg_time_usec_t *now)
-{
-	if ((*now) == 0)
-		(*now) = pg_time_now();
-}
-
-#define PG_TIME_GET_DOUBLE(t) (0.000001 * (t))
 
 static void
 usage(void)
@@ -1279,9 +1250,9 @@ mergeSimpleStats(SimpleStats *acc, SimpleStats *ss)
  * the given value.
  */
 static void
-initStats(StatsData *sd, pg_time_usec_t start)
+initStats(StatsData *sd, time_t start_time)
 {
-	sd->start_time = start;
+	sd->start_time = start_time;
 	sd->cnt = 0;
 	sd->skipped = 0;
 	initSimpleStats(&sd->latency);
@@ -3118,6 +3089,7 @@ evaluateSleep(CState *st, int argc, char **argv, int *usecs)
 static void
 advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 {
+	instr_time	now;
 
 	/*
 	 * gettimeofday() isn't free, so we get the current timestamp lazily the
@@ -3127,7 +3099,7 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 	 * means "not set yet".  Reset "now" when we execute shell commands or
 	 * expressions, which might take a non-negligible amount of time, though.
 	 */
-	pg_time_usec_t now = 0;
+	INSTR_TIME_SET_ZERO(now);
 
 	/*
 	 * Loop in the state machine, until we have to wait for a result from the
@@ -3162,30 +3134,29 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 
 				/* Start new transaction (script) */
 			case CSTATE_START_TX:
-				pg_time_now_lazy(&now);
 
 				/* establish connection if needed, i.e. under --connect */
 				if (st->con == NULL)
 				{
-					pg_time_usec_t start = now;
+					instr_time	start;
 
+					INSTR_TIME_SET_CURRENT_LAZY(now);
+					start = now;
 					if ((st->con = doConnect()) == NULL)
 					{
 						pg_log_error("client %d aborted while establishing connection", st->id);
 						st->state = CSTATE_ABORTED;
 						break;
 					}
-
-					/* reset now after connection */
-					now = pg_time_now();
-
-					thread->conn_duration += now - start;
+					INSTR_TIME_SET_CURRENT(now);
+					INSTR_TIME_ACCUM_DIFF(thread->conn_time, now, start);
 
 					/* Reset session-local state */
 					memset(st->prepared, 0, sizeof(st->prepared));
 				}
 
 				/* record transaction start time */
+				INSTR_TIME_SET_CURRENT_LAZY(now);
 				st->txn_begin = now;
 
 				/*
@@ -3193,7 +3164,7 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 				 * scheduled start time.
 				 */
 				if (!throttle_delay)
-					st->txn_scheduled = now;
+					st->txn_scheduled = INSTR_TIME_GET_MICROSEC(now);
 
 				/* Begin with the first command */
 				st->state = CSTATE_START_COMMAND;
@@ -3229,9 +3200,12 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 				 */
 				if (latency_limit)
 				{
-					pg_time_now_lazy(&now);
+					int64		now_us;
+
+					INSTR_TIME_SET_CURRENT_LAZY(now);
+					now_us = INSTR_TIME_GET_MICROSEC(now);
 
-					while (thread->throttle_trigger < now - latency_limit &&
+					while (thread->throttle_trigger < now_us - latency_limit &&
 						   (nxacts <= 0 || st->cnt < nxacts))
 					{
 						processXactStats(thread, st, &now, true, agg);
@@ -3264,9 +3238,9 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 				 * Wait until it's time to start next transaction.
 				 */
 			case CSTATE_THROTTLE:
-				pg_time_now_lazy(&now);
+				INSTR_TIME_SET_CURRENT_LAZY(now);
 
-				if (now < st->txn_scheduled)
+				if (INSTR_TIME_GET_MICROSEC(now) < st->txn_scheduled)
 					return;		/* still sleeping, nothing to do here */
 
 				/* done sleeping, but don't start transaction if we're done */
@@ -3289,7 +3263,7 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 				/* record begin time of next command, and initiate it */
 				if (report_per_command)
 				{
-					pg_time_now_lazy(&now);
+					INSTR_TIME_SET_CURRENT_LAZY(now);
 					st->stmt_begin = now;
 				}
 
@@ -3485,8 +3459,8 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 				 * instead of CSTATE_START_TX.
 				 */
 			case CSTATE_SLEEP:
-				pg_time_now_lazy(&now);
-				if (now < st->sleep_until)
+				INSTR_TIME_SET_CURRENT_LAZY(now);
+				if (INSTR_TIME_GET_MICROSEC(now) < st->sleep_until)
 					return;		/* still sleeping, nothing to do here */
 				/* Else done sleeping. */
 				st->state = CSTATE_END_COMMAND;
@@ -3506,12 +3480,13 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 				{
 					Command    *command;
 
-					pg_time_now_lazy(&now);
+					INSTR_TIME_SET_CURRENT_LAZY(now);
 
 					command = sql_script[st->use_file].commands[st->command];
 					/* XXX could use a mutex here, but we choose not to */
 					addToSimpleStats(&command->stats,
-									 PG_TIME_GET_DOUBLE(now - st->stmt_begin));
+									 INSTR_TIME_GET_DOUBLE(now) -
+									 INSTR_TIME_GET_DOUBLE(st->stmt_begin));
 				}
 
 				/* Go ahead with next command, to be executed or skipped */
@@ -3537,7 +3512,7 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 				if (is_connect)
 				{
 					finishCon(st);
-					now = 0;
+					INSTR_TIME_SET_ZERO(now);
 				}
 
 				if ((st->cnt >= nxacts && duration <= 0) || timer_exceeded)
@@ -3575,7 +3550,7 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
  * take no time to execute.
  */
 static ConnectionStateEnum
-executeMetaCommand(CState *st, pg_time_usec_t *now)
+executeMetaCommand(CState *st, instr_time *now)
 {
 	Command    *command = sql_script[st->use_file].commands[st->command];
 	int			argc;
@@ -3617,8 +3592,8 @@ executeMetaCommand(CState *st, pg_time_usec_t *now)
 			return CSTATE_ABORTED;
 		}
 
-		pg_time_now_lazy(now);
-		st->sleep_until = (*now) + usec;
+		INSTR_TIME_SET_CURRENT_LAZY(*now);
+		st->sleep_until = INSTR_TIME_GET_MICROSEC(*now) + usec;
 		return CSTATE_SLEEP;
 	}
 	else if (command->meta == META_SET)
@@ -3760,7 +3735,7 @@ executeMetaCommand(CState *st, pg_time_usec_t *now)
 	 * executing the expression or shell command might have taken a
 	 * non-negligible amount of time, so reset 'now'
 	 */
-	*now = 0;
+	INSTR_TIME_SET_ZERO(*now);
 
 	return CSTATE_END_COMMAND;
 }
@@ -3770,15 +3745,14 @@ executeMetaCommand(CState *st, pg_time_usec_t *now)
  *
  * We print Unix-epoch timestamps in the log, so that entries can be
  * correlated against other logs.  On some platforms this could be obtained
- * from the caller, but rather than get entangled with that, we just eat
- * the cost of an extra syscall in all cases.
+ * from the instr_time reading the caller has, but rather than get entangled
+ * with that, we just eat the cost of an extra syscall in all cases.
  */
 static void
 doLog(TState *thread, CState *st,
 	  StatsData *agg, bool skipped, double latency, double lag)
 {
 	FILE	   *logfile = thread->logfile;
-	pg_time_usec_t now = pg_time_now();
 
 	Assert(use_log);
 
@@ -3798,12 +3772,13 @@ doLog(TState *thread, CState *st,
 		 * any empty intervals in between (this may happen with very low tps,
 		 * e.g. --rate=0.1).
 		 */
+		time_t		now = time(NULL);
 
 		while (agg->start_time + agg_interval <= now)
 		{
 			/* print aggregated report to logfile */
-			fprintf(logfile, INT64_FORMAT " " INT64_FORMAT " %.0f %.0f %.0f %.0f",
-					agg->start_time,
+			fprintf(logfile, "%ld " INT64_FORMAT " %.0f %.0f %.0f %.0f",
+					(long) agg->start_time,
 					agg->cnt,
 					agg->latency.sum,
 					agg->latency.sum2,
@@ -3831,15 +3806,17 @@ doLog(TState *thread, CState *st,
 	else
 	{
 		/* no, print raw transactions */
+		struct timeval tv;
+
+		gettimeofday(&tv, NULL);
 		if (skipped)
-			fprintf(logfile, "%d " INT64_FORMAT " skipped %d " INT64_FORMAT " "
-					INT64_FORMAT,
-					st->id, st->cnt, st->use_file, now / 1000000, now % 1000000);
+			fprintf(logfile, "%d " INT64_FORMAT " skipped %d %ld %ld",
+					st->id, st->cnt, st->use_file,
+					(long) tv.tv_sec, (long) tv.tv_usec);
 		else
-			fprintf(logfile, "%d " INT64_FORMAT " %.0f %d " INT64_FORMAT " "
-					INT64_FORMAT,
+			fprintf(logfile, "%d " INT64_FORMAT " %.0f %d %ld %ld",
 					st->id, st->cnt, latency, st->use_file,
-					now / 1000000, now % 1000000);
+					(long) tv.tv_sec, (long) tv.tv_usec);
 		if (throttle_delay)
 			fprintf(logfile, " %.0f", lag);
 		fputc('\n', logfile);
@@ -3853,7 +3830,7 @@ doLog(TState *thread, CState *st,
  * Note that even skipped transactions are counted in the "cnt" fields.)
  */
 static void
-processXactStats(TState *thread, CState *st, pg_time_usec_t *now,
+processXactStats(TState *thread, CState *st, instr_time *now,
 				 bool skipped, StatsData *agg)
 {
 	double		latency = 0.0,
@@ -3863,11 +3840,11 @@ processXactStats(TState *thread, CState *st, pg_time_usec_t *now,
 
 	if (detailed && !skipped)
 	{
-		pg_time_now_lazy(now);
+		INSTR_TIME_SET_CURRENT_LAZY(*now);
 
 		/* compute latency & lag */
-		latency = (*now) - st->txn_scheduled;
-		lag = st->txn_begin - st->txn_scheduled;
+		latency = INSTR_TIME_GET_MICROSEC(*now) - st->txn_scheduled;
+		lag = INSTR_TIME_GET_MICROSEC(st->txn_begin) - st->txn_scheduled;
 	}
 
 	if (thread_details)
@@ -4118,7 +4095,10 @@ initGenerateDataClientSide(PGconn *con)
 	int64		k;
 
 	/* used to track elapsed time and estimate of the remaining time */
-	pg_time_usec_t start;
+	instr_time	start,
+				diff;
+	double		elapsed_sec,
+				remaining_sec;
 	int			log_interval = 1;
 
 	/* Stay on the same line if reporting to a terminal */
@@ -4170,7 +4150,7 @@ initGenerateDataClientSide(PGconn *con)
 	}
 	PQclear(res);
 
-	start = pg_time_now();
+	INSTR_TIME_SET_CURRENT(start);
 
 	for (k = 0; k < (int64) naccounts * scale; k++)
 	{
@@ -4195,8 +4175,11 @@ initGenerateDataClientSide(PGconn *con)
 		 */
 		if ((!use_quiet) && (j % 100000 == 0))
 		{
-			double		elapsed_sec = PG_TIME_GET_DOUBLE(pg_time_now() - start);
-			double		remaining_sec = ((double) scale * naccounts - j) * elapsed_sec / j;
+			INSTR_TIME_SET_CURRENT(diff);
+			INSTR_TIME_SUBTRACT(diff, start);
+
+			elapsed_sec = INSTR_TIME_GET_DOUBLE(diff);
+			remaining_sec = ((double) scale * naccounts - j) * elapsed_sec / j;
 
 			fprintf(stderr, INT64_FORMAT " of " INT64_FORMAT " tuples (%d%%) done (elapsed %.2f s, remaining %.2f s)%c",
 					j, (int64) naccounts * scale,
@@ -4206,8 +4189,11 @@ initGenerateDataClientSide(PGconn *con)
 		/* let's not call the timing for each row, but only each 100 rows */
 		else if (use_quiet && (j % 100 == 0))
 		{
-			double		elapsed_sec = PG_TIME_GET_DOUBLE(pg_time_now() - start);
-			double		remaining_sec = ((double) scale * naccounts - j) * elapsed_sec / j;
+			INSTR_TIME_SET_CURRENT(diff);
+			INSTR_TIME_SUBTRACT(diff, start);
+
+			elapsed_sec = INSTR_TIME_GET_DOUBLE(diff);
+			remaining_sec = ((double) scale * naccounts - j) * elapsed_sec / j;
 
 			/* have we reached the next interval (or end)? */
 			if ((j == scale * naccounts) || (elapsed_sec >= log_interval * LOG_STEP_SECONDS))
@@ -4412,8 +4398,10 @@ runInitSteps(const char *initialize_steps)
 
 	for (step = initialize_steps; *step != '\0'; step++)
 	{
+		instr_time	start;
 		char	   *op = NULL;
-		pg_time_usec_t start = pg_time_now();
+
+		INSTR_TIME_SET_CURRENT(start);
 
 		switch (*step)
 		{
@@ -4455,7 +4443,12 @@ runInitSteps(const char *initialize_steps)
 
 		if (op != NULL)
 		{
-			double		elapsed_sec = PG_TIME_GET_DOUBLE(pg_time_now() - start);
+			instr_time	diff;
+			double		elapsed_sec;
+
+			INSTR_TIME_SET_CURRENT(diff);
+			INSTR_TIME_SUBTRACT(diff, start);
+			elapsed_sec = INSTR_TIME_GET_DOUBLE(diff);
 
 			if (!first)
 				appendPQExpBufferStr(&stats, ", ");
@@ -5403,12 +5396,12 @@ addScript(ParsedScript script)
  * progress report.  On exit, they are updated with the new stats.
  */
 static void
-printProgressReport(TState *threads, int64 test_start, pg_time_usec_t now,
+printProgressReport(TState *threads, int64 test_start, int64 now,
 					StatsData *last, int64 *last_report)
 {
 	/* generate and show report */
-	pg_time_usec_t run = now - *last_report;
-	int64		ntx;
+	int64		run = now - *last_report,
+				ntx;
 	double		tps,
 				total_run,
 				latency,
@@ -5455,7 +5448,16 @@ printProgressReport(TState *threads, int64 test_start, pg_time_usec_t now,
 
 	if (progress_timestamp)
 	{
-		snprintf(tbuf, sizeof(tbuf), "%.3f s", PG_TIME_GET_DOUBLE(now));
+		/*
+		 * On some platforms the current system timestamp is available in
+		 * now_time, but rather than get entangled with that, we just eat the
+		 * cost of an extra syscall in all cases.
+		 */
+		struct timeval tv;
+
+		gettimeofday(&tv, NULL);
+		snprintf(tbuf, sizeof(tbuf), "%ld.%03ld s",
+				 (long) tv.tv_sec, (long) (tv.tv_usec / 1000));
 	}
 	else
 	{
@@ -5495,18 +5497,21 @@ printSimpleStats(const char *prefix, SimpleStats *ss)
 
 /* print out results */
 static void
-printResults(StatsData *total,
-			 pg_time_usec_t total_duration, /* benchmarking time */
-			 pg_time_usec_t conn_total_duration,	/* is_connect */
-			 pg_time_usec_t conn_elapsed_duration,	/* !is_connect */
-			 int64 latency_late)
+printResults(StatsData *total, instr_time total_time,
+			 instr_time conn_total_time, int64 latency_late)
 {
-	/* tps is about actually executed transactions during benchmarking */
+	double		time_include,
+				tps_include,
+				tps_exclude;
 	int64		ntx = total->cnt - total->skipped;
-	double		bench_duration = PG_TIME_GET_DOUBLE(total_duration);
-	double		tps = ntx / bench_duration;
 
-	printf("pgbench (PostgreSQL) %d.%d\n", PG_VERSION_NUM / 10000, PG_VERSION_NUM % 100);
+	time_include = INSTR_TIME_GET_DOUBLE(total_time);
+
+	/* tps is about actually executed transactions */
+	tps_include = ntx / time_include;
+	tps_exclude = ntx /
+		(time_include - (INSTR_TIME_GET_DOUBLE(conn_total_time) / nclients));
+
 	/* Report test parameters. */
 	printf("transaction type: %s\n",
 		   num_scripts == 1 ? sql_script[0].desc : "multiple scripts");
@@ -5537,7 +5542,8 @@ printResults(StatsData *total,
 
 	if (throttle_delay && latency_limit)
 		printf("number of transactions skipped: " INT64_FORMAT " (%.3f %%)\n",
-			   total->skipped, 100.0 * total->skipped / total->cnt);
+			   total->skipped,
+			   100.0 * total->skipped / total->cnt);
 
 	if (latency_limit)
 		printf("number of transactions above the %.1f ms latency limit: " INT64_FORMAT "/" INT64_FORMAT " (%.3f %%)\n",
@@ -5550,7 +5556,7 @@ printResults(StatsData *total,
 	{
 		/* no measurement, show average latency computed from run time */
 		printf("latency average = %.3f ms\n",
-			   0.001 * total_duration * nclients / total->cnt);
+			   1000.0 * time_include * nclients / total->cnt);
 	}
 
 	if (throttle_delay)
@@ -5565,25 +5571,8 @@ printResults(StatsData *total,
 			   0.001 * total->lag.sum / total->cnt, 0.001 * total->lag.max);
 	}
 
-	/*
-	 * Under -C/--connect, each transaction incurs a significant connection
-	 * cost, it would not make much sense to ignore it in tps, and it would
-	 * not be tps anyway.
-	 *
-	 * Otherwise connections are made just once at the beginning of the run
-	 * and should not impact performance but for very short run, so they are
-	 * (right)fully ignored in tps.
-	 */
-	if (is_connect)
-	{
-		printf("average connection time = %.3f ms\n", 0.001 * conn_total_duration / total->cnt);
-		printf("tps = %f (including reconnection times)\n", tps);
-	}
-	else
-	{
-		printf("initial connection time = %.3f ms\n", 0.001 * conn_elapsed_duration);
-		printf("tps = %f (without initial connection time)\n", tps);
-	}
+	printf("tps = %f (including connections establishing)\n", tps_include);
+	printf("tps = %f (excluding connections establishing)\n", tps_exclude);
 
 	/* Report per-script/command statistics */
 	if (per_script_stats || report_per_command)
@@ -5604,7 +5593,7 @@ printResults(StatsData *total,
 					   100.0 * sql_script[i].weight / total_weight,
 					   sstats->cnt,
 					   100.0 * sstats->cnt / total->cnt,
-					   (sstats->cnt - sstats->skipped) / bench_duration);
+					   (sstats->cnt - sstats->skipped) / time_include);
 
 				if (throttle_delay && latency_limit && sstats->cnt > 0)
 					printf(" - number of transactions skipped: " INT64_FORMAT " (%.3f%%)\n",
@@ -5652,7 +5641,10 @@ set_random_seed(const char *seed)
 	if (seed == NULL || strcmp(seed, "time") == 0)
 	{
 		/* rely on current time */
-		iseed = pg_time_now();
+		instr_time	now;
+
+		INSTR_TIME_SET_CURRENT(now);
+		iseed = (uint64) INSTR_TIME_GET_MICROSEC(now);
 	}
 	else if (strcmp(seed, "rand") == 0)
 	{
@@ -5755,11 +5747,9 @@ main(int argc, char **argv)
 	CState	   *state;			/* status of clients */
 	TState	   *threads;		/* array of thread */
 
-	pg_time_usec_t
-				start_time,		/* start up time */
-				bench_start = 0,	/* first recorded benchmarking time */
-				conn_total_duration;	/* cumulated connection time in
-										 * threads */
+	instr_time	start_time;		/* start up time */
+	instr_time	total_time;
+	instr_time	conn_total_time;
 	int64		latency_late = 0;
 	StatsData	stats;
 	int			weight;
@@ -6428,55 +6418,67 @@ main(int argc, char **argv)
 	/* all clients must be assigned to a thread */
 	Assert(nclients_dealt == nclients);
 
-	/* get start up time for the whole computation */
-	start_time = pg_time_now();
+	/* get start up time */
+	INSTR_TIME_SET_CURRENT(start_time);
 
 	/* set alarm if duration is specified. */
 	if (duration > 0)
 		setalarm(duration);
 
+	/* set up the barrier that we'll use to synchronize threads */
 	errno = THREAD_BARRIER_INIT(&barrier, nthreads);
 	if (errno != 0)
 		pg_log_fatal("could not initialize barrier: %m");
 
+	/* start threads */
 #ifdef ENABLE_THREAD_SAFETY
-	/* start all threads but thread 0 which is executed directly later */
-	for (i = 1; i < nthreads; i++)
+	for (i = 0; i < nthreads; i++)
 	{
 		TState	   *thread = &threads[i];
 
-		thread->create_time = pg_time_now();
-		errno = THREAD_CREATE(&thread->thread, threadRun, thread);
+		INSTR_TIME_SET_CURRENT(thread->start_time);
+
+		/* compute when to stop */
+		if (duration > 0)
+			end_time = INSTR_TIME_GET_MICROSEC(thread->start_time) +
+				(int64) 1000000 * duration;
 
-		if (errno != 0)
+		/* the first thread (i = 0) is executed by main thread */
+		if (i > 0)
 		{
-			pg_log_fatal("could not create thread: %m");
-			exit(1);
+			errno = THREAD_CREATE(&thread->thread, threadRun, thread);
+
+			if (errno != 0)
+			{
+				pg_log_fatal("could not create thread: %m");
+				exit(1);
+			}
 		}
 	}
 #else
-	Assert(nthreads == 1);
-#endif							/* ENABLE_THREAD_SAFETY */
-
+	INSTR_TIME_SET_CURRENT(threads[0].start_time);
 	/* compute when to stop */
-	threads[0].create_time = pg_time_now();
 	if (duration > 0)
-		end_time = threads[0].create_time + (int64) 1000000 * duration;
-
-	/* run thread 0 directly */
-	(void) threadRun(&threads[0]);
+		end_time = INSTR_TIME_GET_MICROSEC(threads[0].start_time) +
+			(int64) 1000000 * duration;
+#endif							/* ENABLE_THREAD_SAFETY */
 
-	/* wait for other threads and accumulate results */
+	/* wait for threads and accumulate results */
 	initStats(&stats, 0);
-	conn_total_duration = 0;
-
+	INSTR_TIME_SET_ZERO(conn_total_time);
 	for (i = 0; i < nthreads; i++)
 	{
 		TState	   *thread = &threads[i];
 
 #ifdef ENABLE_THREAD_SAFETY
-		if (i > 0)
+		if (i == 0)
+			/* actually run this thread directly in the main thread */
+			(void) threadRun(thread);
+		else
+			/* wait of other threads. should check that 0 is returned? */
 			THREAD_JOIN(thread->thread);
+#else
+		(void) threadRun(thread);
 #endif							/* ENABLE_THREAD_SAFETY */
 
 		for (int j = 0; j < thread->nstate; j++)
@@ -6489,25 +6491,23 @@ main(int argc, char **argv)
 		stats.cnt += thread->stats.cnt;
 		stats.skipped += thread->stats.skipped;
 		latency_late += thread->latency_late;
-		conn_total_duration += thread->conn_duration;
-
-		/* first recorded benchmarking start time */
-		if (bench_start == 0 || thread->bench_start < bench_start)
-			bench_start = thread->bench_start;
+		INSTR_TIME_ADD(conn_total_time, thread->conn_time);
 	}
-
-	/* XXX should this be connection time? */
 	disconnect_all(state, nclients);
 
 	/*
-	 * Beware that performance of short benchmarks with many threads and
-	 * possibly long transactions can be deceptive because threads do not
-	 * start and finish at the exact same time. The total duration computed
-	 * here encompasses all transactions so that tps shown is somehow slightly
-	 * underestimated.
+	 * XXX We compute results as though every client of every thread started
+	 * and finished at the same time.  That model can diverge noticeably from
+	 * reality for a short benchmark run involving relatively many threads.
+	 * The first thread may process notably many transactions before the last
+	 * thread begins.  Improving the model alone would bring limited benefit,
+	 * because performance during those periods of partial thread count can
+	 * easily exceed steady state performance.  This is one of the many ways
+	 * short runs convey deceptive performance figures.
 	 */
-	printResults(&stats, pg_time_now() - bench_start, conn_total_duration,
-				 bench_start - start_time, latency_late);
+	INSTR_TIME_SET_CURRENT(total_time);
+	INSTR_TIME_SUBTRACT(total_time, start_time);
+	printResults(&stats, total_time, conn_total_time, latency_late);
 
 	THREAD_BARRIER_DESTROY(&barrier);
 
@@ -6522,16 +6522,34 @@ threadRun(void *arg)
 {
 	TState	   *thread = (TState *) arg;
 	CState	   *state = thread->state;
-	pg_time_usec_t start;
+	instr_time	start,
+				end;
 	int			nstate = thread->nstate;
 	int			remains = nstate;	/* number of remaining clients */
 	socket_set *sockets = alloc_socket_set(nstate);
-	int64		thread_start,
-				last_report,
-				next_report;
+	int			i;
+
+	/* for reporting progress: */
+	int64		thread_start = INSTR_TIME_GET_MICROSEC(thread->start_time);
+	int64		last_report = thread_start;
+	int64		next_report = last_report + (int64) progress * 1000000;
 	StatsData	last,
 				aggs;
 
+	/*
+	 * Initialize throttling rate target for all of the thread's clients.  It
+	 * might be a little more accurate to reset thread->start_time here too.
+	 * The possible drift seems too small relative to typical throttle delay
+	 * times to worry about it.
+	 */
+	INSTR_TIME_SET_CURRENT(start);
+	thread->throttle_trigger = INSTR_TIME_GET_MICROSEC(start);
+
+	INSTR_TIME_SET_ZERO(thread->conn_time);
+
+	initStats(&aggs, time(NULL));
+	last = aggs;
+
 	/* open log file if requested */
 	if (use_log)
 	{
@@ -6552,23 +6570,10 @@ threadRun(void *arg)
 		}
 	}
 
-	/* explicitly initialize the state machines */
-	for (int i = 0; i < nstate; i++)
-		state[i].state = CSTATE_CHOOSE_SCRIPT;
-
-	/* READY */
-	THREAD_BARRIER_WAIT(&barrier);
-
-	thread_start = pg_time_now();
-	thread->started_time = thread_start;
-	last_report = thread_start;
-	next_report = last_report + (int64) 1000000 * progress;
-
-	/* STEADY */
 	if (!is_connect)
 	{
 		/* make connections to the database before starting */
-		for (int i = 0; i < nstate; i++)
+		for (i = 0; i < nstate; i++)
 		{
 			if ((state[i].con = doConnect()) == NULL)
 			{
@@ -6584,32 +6589,26 @@ threadRun(void *arg)
 				goto done;
 			}
 		}
-
-		/* compute connection delay */
-		thread->conn_duration = pg_time_now() - thread->started_time;
-	}
-	else
-	{
-		/* no connection delay to record */
-		thread->conn_duration = 0;
 	}
 
-	/* GO */
-	THREAD_BARRIER_WAIT(&barrier);
+	/* time after thread and connections set up */
+	INSTR_TIME_SET_CURRENT(thread->conn_time);
+	INSTR_TIME_SUBTRACT(thread->conn_time, thread->start_time);
 
-	start = pg_time_now();
-	thread->bench_start = start;
-	thread->throttle_trigger = start;
+	THREAD_BARRIER_WAIT(&barrier);
 
-	initStats(&aggs, start);
-	last = aggs;
+	/* explicitly initialize the state machines */
+	for (i = 0; i < nstate; i++)
+	{
+		state[i].state = CSTATE_CHOOSE_SCRIPT;
+	}
 
 	/* loop till all clients have terminated */
 	while (remains > 0)
 	{
 		int			nsocks;		/* number of sockets to be waited for */
-		pg_time_usec_t min_usec;
-		pg_time_usec_t now = 0; /* set this only if needed */
+		int64		min_usec;
+		int64		now_usec = 0;	/* set this only if needed */
 
 		/*
 		 * identify which client sockets should be checked for input, and
@@ -6618,21 +6617,27 @@ threadRun(void *arg)
 		clear_socket_set(sockets);
 		nsocks = 0;
 		min_usec = PG_INT64_MAX;
-		for (int i = 0; i < nstate; i++)
+		for (i = 0; i < nstate; i++)
 		{
 			CState	   *st = &state[i];
 
 			if (st->state == CSTATE_SLEEP || st->state == CSTATE_THROTTLE)
 			{
 				/* a nap from the script, or under throttling */
-				pg_time_usec_t this_usec;
+				int64		this_usec;
 
 				/* get current time if needed */
-				pg_time_now_lazy(&now);
+				if (now_usec == 0)
+				{
+					instr_time	now;
+
+					INSTR_TIME_SET_CURRENT(now);
+					now_usec = INSTR_TIME_GET_MICROSEC(now);
+				}
 
 				/* min_usec should be the minimum delay across all clients */
 				this_usec = (st->state == CSTATE_SLEEP ?
-							 st->sleep_until : st->txn_scheduled) - now;
+							 st->sleep_until : st->txn_scheduled) - now_usec;
 				if (min_usec > this_usec)
 					min_usec = this_usec;
 			}
@@ -6667,12 +6672,19 @@ threadRun(void *arg)
 		/* also wake up to print the next progress report on time */
 		if (progress && min_usec > 0 && thread->tid == 0)
 		{
-			pg_time_now_lazy(&now);
+			/* get current time if needed */
+			if (now_usec == 0)
+			{
+				instr_time	now;
 
-			if (now >= next_report)
+				INSTR_TIME_SET_CURRENT(now);
+				now_usec = INSTR_TIME_GET_MICROSEC(now);
+			}
+
+			if (now_usec >= next_report)
 				min_usec = 0;
-			else if ((next_report - now) < min_usec)
-				min_usec = next_report - now;
+			else if ((next_report - now_usec) < min_usec)
+				min_usec = next_report - now_usec;
 		}
 
 		/*
@@ -6721,7 +6733,7 @@ threadRun(void *arg)
 
 		/* ok, advance the state machine of each connection */
 		nsocks = 0;
-		for (int i = 0; i < nstate; i++)
+		for (i = 0; i < nstate; i++)
 		{
 			CState	   *st = &state[i];
 
@@ -6759,8 +6771,11 @@ threadRun(void *arg)
 		/* progress report is made by thread 0 for all threads */
 		if (progress && thread->tid == 0)
 		{
-			pg_time_usec_t now = pg_time_now();
+			instr_time	now_time;
+			int64		now;
 
+			INSTR_TIME_SET_CURRENT(now_time);
+			now = INSTR_TIME_GET_MICROSEC(now_time);
 			if (now >= next_report)
 			{
 				/*
@@ -6778,17 +6793,17 @@ threadRun(void *arg)
 				 */
 				do
 				{
-					next_report += (int64) 1000000 * progress;
+					next_report += (int64) progress * 1000000;
 				} while (now >= next_report);
 			}
 		}
 	}
 
 done:
-	start = pg_time_now();
+	INSTR_TIME_SET_CURRENT(start);
 	disconnect_all(state, nstate);
-	thread->conn_duration += pg_time_now() - start;
-
+	INSTR_TIME_SET_CURRENT(end);
+	INSTR_TIME_ACCUM_DIFF(thread->conn_time, end, start);
 	if (thread->logfile)
 	{
 		if (agg_interval > 0)
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index abdb08319c..ce7963a674 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3347,7 +3347,6 @@ pg_sha512_ctx
 pg_snapshot
 pg_stack_base_t
 pg_time_t
-pg_time_usec_t
 pg_tz
 pg_tz_cache
 pg_tzenum
-- 
2.30.2

#19

Andrew Dunstan

andrew@dunslane.net

over 4 years ago

In reply to: Thomas Munro (#18)

Re: pgbench logging broken by time logic changes

On 6/17/21 8:49 AM, Thomas Munro wrote:

On Thu, Jun 17, 2021 at 7:24 PM Fabien COELHO <coelho@cri.ensmp.fr> wrote:

Seems like it should be straightforward to fix, though, with fixes
already proposed (though I haven't studied them yet, will do).

I think that fixing logging is simple enough, thus a revert is not
necessary.

I prepared a draft revert patch for discussion, just in case it comes
in handy. This reverts "pgbench: Improve time logic.", but "pgbench:
Synchronize client threads." remains (slightly rearranged).

I'm on the fence TBH, I can see that it's really small things and it
seems we have the patches, but it's late, late enough that
benchmarking gurus are showing up to benchmark with it for real, and
it's not great to be getting in the way of that with what is mostly
refactoring work, so I don't think it would be a bad thing if we
agreed to try again in 15.

Is there an identified issue beyond the concrete example Greg gave of
the timestamps?

We are still fixing a few things with potentially far more impact than
anything in pgbench, so fixing this wouldn't bother me that much, as
long as we get it done for Beta2.

(A number of arguments for and against
moving pgbench out of the postgresql source tree and release cycle
occur to me, but I guess that's a topic for another thread.)

Indeed.

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

#20

Gregory Smith

gregsmithpgsql@gmail.com

over 4 years ago

In reply to: Fabien COELHO (#3)

Re: pgbench logging broken by time logic changes

On Wed, Jun 16, 2021 at 2:59 PM Fabien COELHO <coelho@cri.ensmp.fr> wrote:

I'm unhappy because I already added tap tests for time-sensitive features
(-T and others, maybe logging aggregates, cannot remember), which have
been removed because they could fail under some circonstances (eg very
very very very slow hosts), or required some special handling (a few lines
of code) in pgbench, and the net result of this is there is not a single
test in place for some features:-(

I understand your struggle and I hope I was clear about two things:

-I am excited by all the progress made in pgbench, and this problem is an
integration loose end rather than a developer failure at any level.
-Doing better in this messy area takes a team that goes from development to
release management, and I had no right to complain unless I brought
resources to improve in the specific areas of the process that I want to be
better.

I think the only thing you and I disagree on is that you see a "first issue
in a corner case" where I see a process failure that is absolutely vital
for me to improve. Since the reality is that I might be the best
positioned person to actually move said process forward in a meaningful
long-term way, I have every intention of applying pressure to the area
you're frustrated at. Crunchy has a whole parallel review team to the
community one now focused on what our corporate and government customers
need for software process control and procedure compliance. The primary
business problem I'm working on now is how to include performance review in
that mix.

I already know I need to re-engage with you over how I need min/max numbers
in the aggregate logging output to accomplish some valuable goals. When I
get around to that this summer, I'd really enjoy talking with you a bit,
video call or something, about really any community topic you're frustrated
with. I have a lot riding now on the productivity of the PostgreSQL hacker
community and I want everyone to succeed at the best goals.

There is no problem with proposing tests, the problem is that they are

accepted, or if they are accepted then that they are not removed at the
first small issue but rather fixed, or their limitations accepted, because
testing time-sensitive features is not as simple as testing functional
features.

For 2020 Crunchy gave me a sort of sabbatical year to research community
oriented benchmarking topics. Having a self contained project in my home
turned out to be the perfect way to spend *that* wreck of a year.

I made significant progress toward the idea of having a performance farm
for PostgreSQL. On my laptop today is a 14GB database with 1s resolution
latency traces for 663 days of pgbench time running 4 workloads across a
small bare metal farm of various operating systems and hardware classes. I
can answer questions like "how long does a typical SSD take to execute an
INSERT commit?" across my farm with SQL. It's at the "works for me!" stage
of development, and I thought this was the right time in the development
cycle to start sharing improvement ideas from my work; thus the other
submissions in progress I alluded to.

The logging feature is in an intermediate spot where validating it requires
light custom tooling that compares its output against known variables like
the system time. It doesn't quite have a performance component to it.
Since this time logic detail is a well known portability minefield, I
thought demanding that particular test was a pretty easy sell.

That you in particular are frustrated here makes perfect sense to me. I am
fresh and ready to carry this forward some distance, and I hope the outcome
makes you happy

#21

Fabien COELHO

coelho@cri.ensmp.fr

over 4 years ago

In reply to: Andrew Dunstan (#19)

Re: pgbench logging broken by time logic changes

Is there an identified issue beyond the concrete example Greg gave of
the timestamps?

AFAICS, there is a patch which fixes all known issues linked to pgbench
logging. Whether there exists other issues is possible, but the "broken"
area was quite specific. There are also some TAP tests on pgbench which do
catch issues.

--
Fabien.

#22

Fabien COELHO

coelho@cri.ensmp.fr

over 4 years ago

In reply to: Gregory Smith (#20)

Re: pgbench logging broken by time logic changes

Hello Greg,

I think the only thing you and I disagree on is that you see a "first
issue in a corner case" where I see a process failure that is absolutely
vital for me to improve.

Hmmm. I agree that improvements are needed, but for me there is simply a
few missing (removed) tap tests which should/could have caught these
issues, which are AFAICS limited to the untested area.

Given the speed of the process and the energy and patience needed to move
things forward, reverting means that the patch is probably dead for at
least a year, possibly an eon, and that is too bad because IMHO it was an
improvement (my eyes are watering when I see INSTR_TIME macros), so I'd
prefer a fix rather than a revert if it is possible, which in this case I
think it could be.

Since the reality is that I might be the best positioned person

Good for you:-)

to actually move said process forward in a meaningful long-term way, I
have every intention of applying pressure to the area you're frustrated
at. Crunchy has a whole parallel review team to the community one now
focused on what our corporate and government customers need for software
process control and procedure compliance. The primary business problem
I'm working on now is how to include performance review in that mix.

This idea has been around for some time now. It is quite a task, and a
working and possibly extended pgbench is just one part of the overall
software, infrastructure and procedure needed to have that.

I already know I need to re-engage with you over how I need min/max numbers
in the aggregate logging output to accomplish some valuable goals.

I do try to review every patch submitted about pgbench. Feel free to fire!

When I get around to that this summer, I'd really enjoy talking with you
a bit, video call or something, about really any community topic you're
frustrated with.

"frustrated" may be a strong word. I'm somehow annoyed, and unlikely to
ever submit many tests improvements in the future.

There is no problem with proposing tests, the problem is that they are
accepted, or if they are accepted then that they are not removed at the
first small issue but rather fixed, or their limitations accepted, because
testing time-sensitive features is not as simple as testing functional
features.

For 2020 Crunchy gave me a sort of sabbatical year to research community
oriented benchmarking topics. Having a self contained project in my home
turned out to be the perfect way to spend *that* wreck of a year.

Yep.

I made significant progress toward the idea of having a performance farm
for PostgreSQL. On my laptop today is a 14GB database with 1s resolution
latency traces for 663 days of pgbench time running 4 workloads across a
small bare metal farm of various operating systems and hardware classes.

Wow.

I can answer questions like "how long does a typical SSD take to execute
an INSERT commit?" across my farm with SQL.

So, what is the answer? :-)

It's at the "works for me!" stage of development, and I thought this was
the right time in the development cycle to start sharing improvement
ideas from my work; thus the other submissions in progress I alluded to.

The logging feature is in an intermediate spot where validating it requires
light custom tooling that compares its output against known variables like
the system time.

Sure.

It doesn't quite have a performance component to it.

Hmmm, if you log all transactions it can becomes the performance
bottleneck quite quickly:-)

Since this time logic detail is a well known portability minefield, I
thought demanding that particular test was a pretty easy sell.

The test I recalled was removed at ad51c6f. Ok, it would not have caught
the issue about timestamp (although it could have been improved to do so),
but it would have caught the trivial one about the catchup loop in
aggregate interval generating too many lines because of a forgotten
conversion to µs.

--
Fabien.

#23

Fabien COELHO

coelho@cri.ensmp.fr

over 4 years ago

In reply to: Thomas Munro (#18)

Re: pgbench logging broken by time logic changes

Hello Thomas,

I prepared a draft revert patch for discussion, just in case it comes
in handy. This reverts "pgbench: Improve time logic.", but "pgbench:
Synchronize client threads." remains (slightly rearranged).

I had a quick look.

I had forgotten that this patch also fixed the long-running brain-damaged
tps computation that has been bothering me for years, so that one sane
performance figure is now reported instead of two not-clear-to-interpret
take-your-pick figures.

It would be a real loss if this user-facing fix is removed in the
process:-(

--
Fabien.

#24

Michael Paquier

michael@paquier.xyz

over 4 years ago

In reply to: Thomas Munro (#18)

Re: pgbench logging broken by time logic changes

On Fri, Jun 18, 2021 at 12:49:42AM +1200, Thomas Munro wrote:

I'm on the fence TBH, I can see that it's really small things and it
seems we have the patches, but it's late, late enough that
benchmarking gurus are showing up to benchmark with it for real, and
it's not great to be getting in the way of that with what is mostly
refactoring work, so I don't think it would be a bad thing if we
agreed to try again in 15. (A number of arguments for and against
moving pgbench out of the postgresql source tree and release cycle
occur to me, but I guess that's a topic for another thread.)

I may be missing something of course, but I don't see any strong
reason why we need to do a revert here if we have patches to discuss
first.

Note that Michaël is having a look at fixing pgbench logging issues.

Yeah I've been catching up with these threads.

Thomas, do you want me to look more at this issue? I don't feel
comfortable with the idea of doing anything if you are planning to
look at this thread and you are the owner here, so that should be your
call.

From what I can see, we have the same area getting patched with
patches across two threads, so it seems better to give up the other
thread and just focus on the discussion here, where v7 has been sent:
/messages/by-id/20210617175542.ad6b9b82926d8469e8520324@sraoss.co.jp
/messages/by-id/CAF7igB1r6wRfSCEAB-iZBKxkowWY6+dFF2jObSdd9+iVK+vHJg@mail.gmail.com
--
Michael

#25

Noah Misch

noah@leadboat.com

over 4 years ago

In reply to: Andrew Dunstan (#5)

Re: pgbench logging broken by time logic changes

On Wed, Jun 16, 2021 at 03:13:30PM -0400, Andrew Dunstan wrote:

On 6/16/21 2:59 PM, Fabien COELHO wrote:

The key feedback for me is the usual one: what is not tested does not
work. Wow:-)

Agreed.

I'm unhappy because I already added tap tests for time-sensitive
features (-T and others, maybe logging aggregates, cannot remember),
which have been removed because they could fail under some
circonstances (eg very very very very slow hosts), or required some
special handling (a few lines of code) in pgbench, and the net result
of this is there is not a single test in place for some features:-(

I'm not familiar with exactly what happened in this case, but tests need
to be resilient over a wide range of performance characteristics. One
way around this issue might be to have a way of detecting that it's on a
slow platform and if so either skipping tests (Test::More provides
plenty of support for this) or expecting different results.

Detection would need the host to be consistently slow, like running under
Valgrind or a 20-year-old CPU. We also test on systems having highly-variable
performance due to other processes competing for the same hardware. I'd
perhaps add a "./configure --enable-realtime-tests" option that enables
affected tests. Testers should use the option whenever the execution
environment has sufficient reserved CPU.

#26

Thomas Munro

thomas.munro@gmail.com

over 4 years ago

In reply to: Michael Paquier (#24)

Re: pgbench logging broken by time logic changes

On Fri, Jun 18, 2021 at 12:31 PM Michael Paquier <michael@paquier.xyz> wrote:

On Fri, Jun 18, 2021 at 12:49:42AM +1200, Thomas Munro wrote:

Yeah I've been catching up with these threads.

Thomas, do you want me to look more at this issue? I don't feel
comfortable with the idea of doing anything if you are planning to
look at this thread and you are the owner here, so that should be your
call.

From what I can see, we have the same area getting patched with
patches across two threads, so it seems better to give up the other
thread and just focus on the discussion here, where v7 has been sent:
/messages/by-id/20210617175542.ad6b9b82926d8469e8520324@sraoss.co.jp
/messages/by-id/CAF7igB1r6wRfSCEAB-iZBKxkowWY6+dFF2jObSdd9+iVK+vHJg@mail.gmail.com

Thanks for looking so far. It's the weekend here and I need to
unplug, but I'll test these changes and if all looks good push on
Monday.

#27

Michael Paquier

michael@paquier.xyz

over 4 years ago

In reply to: Thomas Munro (#26)

Re: pgbench logging broken by time logic changes

On Sat, Jun 19, 2021 at 11:59:16AM +1200, Thomas Munro wrote:

Thanks for looking so far. It's the weekend here and I need to
unplug, but I'll test these changes and if all looks good push on
Monday.

Thanks for the update. Have a good weekend.
--
Michael

#28

Alvaro Herrera

alvherre@alvh.no-ip.org

over 4 years ago

In reply to: Thomas Munro (#26)

Re: pgbench logging broken by time logic changes

On 2021-Jun-19, Thomas Munro wrote:

Thanks for looking so far. It's the weekend here and I need to
unplug, but I'll test these changes and if all looks good push on
Monday.

Surely not the same day as the beta stamp...

--
ï¿½lvaro Herrera Valdivia, Chile
"Always assume the user will do much worse than the stupidest thing
you can imagine." (Julien PUYDT)

#29

Thomas Munro

thomas.munro@gmail.com

over 4 years ago

In reply to: Alvaro Herrera (#28)

Re: pgbench logging broken by time logic changes

On Sun, Jun 20, 2021 at 3:18 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

On 2021-Jun-19, Thomas Munro wrote:

Thanks for looking so far. It's the weekend here and I need to
unplug, but I'll test these changes and if all looks good push on
Monday.

Surely not the same day as the beta stamp...

Because of timezones, that'll be Sunday in the Americas. Still too close?

#30

Thomas Munro

thomas.munro@gmail.com

over 4 years ago

In reply to: Thomas Munro (#29)

Re: pgbench logging broken by time logic changes

On Sun, Jun 20, 2021 at 4:52 PM Thomas Munro <thomas.munro@gmail.com> wrote:

On Sun, Jun 20, 2021 at 3:18 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

On 2021-Jun-19, Thomas Munro wrote:

Thanks for looking so far. It's the weekend here and I need to
unplug, but I'll test these changes and if all looks good push on
Monday.

Surely not the same day as the beta stamp...

Because of timezones, that'll be Sunday in the Americas. Still too close?

Upon reflection, that amounts to the same thing really, so yeah,
scratch that plan. I'll defer until after that (and then I'll be
leaning more towards the revert option).

#31

Fabien COELHO

coelho@cri.ensmp.fr

over 4 years ago

In reply to: Thomas Munro (#30)

Re: pgbench logging broken by time logic changes

Upon reflection, that amounts to the same thing really, so yeah,
scratch that plan. I'll defer until after that (and then I'll be
leaning more towards the revert option).

Sigh. I do not understand anything about the decision processus.

If you do revert, please consider NOT reverting the tps computation
changes intermixed in the patch.

--
Fabien.

#32

Andrew Dunstan

andrew@dunslane.net

over 4 years ago

In reply to: Fabien COELHO (#31)

Re: pgbench logging broken by time logic changes

On 6/20/21 5:02 AM, Fabien COELHO wrote:

Upon reflection, that amounts to the same thing really, so yeah,
scratch that plan. I'll defer until after that (and then I'll be
leaning more towards the revert option).

Sigh. I do not understand anything about the decision processus.

Yes, sometimes it passeth all understanding.

There will certainly be a BETA3, and in every recent year except last
year there has been a BETA4.

If this were core server code threatening data integrity I would be
inclined to be more strict, but after all pg_bench is a utility program,
and I think we can allow a little more latitude.

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

#33

Michael Paquier

michael@paquier.xyz

over 4 years ago

In reply to: Andrew Dunstan (#32)

Re: pgbench logging broken by time logic changes

On Sun, Jun 20, 2021 at 10:15:55AM -0400, Andrew Dunstan wrote:

If this were core server code threatening data integrity I would be
inclined to be more strict, but after all pg_bench is a utility program,
and I think we can allow a little more latitude.

+1.  Let's be flexible here.  It looks better to not rush a fix, and
we still have some time ahead.
--
Michael

#34

Fabien COELHO

coelho@cri.ensmp.fr

over 4 years ago

In reply to: Michael Paquier (#33)

1 attachment(s)

Re: pgbench logging broken by time logic changes

Bonjour Michaï¿œl,

If this were core server code threatening data integrity I would be
inclined to be more strict, but after all pg_bench is a utility program,
and I think we can allow a little more latitude.

+1. Let's be flexible here. It looks better to not rush a fix, and
we still have some time ahead.

Attached an updated v8 patch which adds (reinstate) an improved TAP test
which would have caught the various regressions on logs.

Given that such tests were removed once before, I'm unsure whether they
will be acceptable, despite that their usefulness has been clearly
demonstrated. At least it is for the record. Sigh:-(

--
Fabien.

Attachments:

pgbench-log-fix-8.patchtext/x-diff; name=pgbench-log-fix-8.patchDownload

diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index e61055b6b7..dfb2ff6859 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -343,6 +343,12 @@ typedef struct StatsData
 	SimpleStats lag;
 } StatsData;
 
+/*
+ * For displaying epoch timestamp, as some time functions may have
+ * another reference.
+ */
+pg_time_usec_t epoch_shift;
+
 /*
  * Struct to keep random state.
  */
@@ -649,7 +655,7 @@ static void setDoubleValue(PgBenchValue *pv, double dval);
 static bool evaluateExpr(CState *st, PgBenchExpr *expr,
 						 PgBenchValue *retval);
 static ConnectionStateEnum executeMetaCommand(CState *st, pg_time_usec_t *now);
-static void doLog(TState *thread, CState *st,
+static void doLog(TState *thread, CState *st, bool tx,
 				  StatsData *agg, bool skipped, double latency, double lag);
 static void processXactStats(TState *thread, CState *st, pg_time_usec_t *now,
 							 bool skipped, StatsData *agg);
@@ -3766,16 +3772,47 @@ executeMetaCommand(CState *st, pg_time_usec_t *now)
 	return CSTATE_END_COMMAND;
 }
 
+/* print aggregated report to logfile */
+static void
+logAgg(FILE *logfile, StatsData *agg)
+{
+	fprintf(logfile, INT64_FORMAT " " INT64_FORMAT " %.0f %.0f %.0f %.0f",
+			(agg->start_time + epoch_shift) / 1000000,
+			agg->cnt,
+			agg->latency.sum,
+			agg->latency.sum2,
+			agg->latency.min,
+			agg->latency.max);
+	if (throttle_delay)
+	{
+		fprintf(logfile, " %.0f %.0f %.0f %.0f",
+				agg->lag.sum,
+				agg->lag.sum2,
+				agg->lag.min,
+				agg->lag.max);
+		if (latency_limit)
+			fprintf(logfile, " " INT64_FORMAT, agg->skipped);
+	}
+	fputc('\n', logfile);
+}
+
 /*
  * Print log entry after completing one transaction.
  *
+ * Param tx tells whether the call actually represents a transaction,
+ * or if it is used to flush aggregated logs.
+ *
+ * The function behavior changes depending on sample_rate (a fraction of
+ * transaction is reported) and agg_interval (transactions are aggregated
+ * and reported once every agg_interval seconds).
+ *
  * We print Unix-epoch timestamps in the log, so that entries can be
  * correlated against other logs.  On some platforms this could be obtained
  * from the caller, but rather than get entangled with that, we just eat
  * the cost of an extra syscall in all cases.
  */
 static void
-doLog(TState *thread, CState *st,
+doLog(TState *thread, CState *st, bool tx,
 	  StatsData *agg, bool skipped, double latency, double lag)
 {
 	FILE	   *logfile = thread->logfile;
@@ -3794,43 +3831,36 @@ doLog(TState *thread, CState *st,
 	/* should we aggregate the results or not? */
 	if (agg_interval > 0)
 	{
+		pg_time_usec_t	next;
+
 		/*
 		 * Loop until we reach the interval of the current moment, and print
 		 * any empty intervals in between (this may happen with very low tps,
 		 * e.g. --rate=0.1).
 		 */
-
-		while (agg->start_time + agg_interval <= now)
+		while ((next = agg->start_time + agg_interval * INT64CONST(1000000)) <= now)
 		{
-			/* print aggregated report to logfile */
-			fprintf(logfile, INT64_FORMAT " " INT64_FORMAT " %.0f %.0f %.0f %.0f",
-					agg->start_time,
-					agg->cnt,
-					agg->latency.sum,
-					agg->latency.sum2,
-					agg->latency.min,
-					agg->latency.max);
-			if (throttle_delay)
-			{
-				fprintf(logfile, " %.0f %.0f %.0f %.0f",
-						agg->lag.sum,
-						agg->lag.sum2,
-						agg->lag.min,
-						agg->lag.max);
-				if (latency_limit)
-					fprintf(logfile, " " INT64_FORMAT, agg->skipped);
-			}
-			fputc('\n', logfile);
+			logAgg(logfile, agg);
 
 			/* reset data and move to next interval */
-			initStats(agg, agg->start_time + agg_interval);
+			initStats(agg, next);
 		}
 
-		/* accumulate the current transaction */
-		accumStats(agg, skipped, latency, lag);
+		if (tx)
+			/* accumulate the current transaction */
+			accumStats(agg, skipped, latency, lag);
+		else
+			/* final call to show the last aggregate */
+			logAgg(logfile, agg);
 	}
 	else
 	{
+		/* switch to epoch */
+		now += epoch_shift;
+
+		/* !tx only used for aggregated data */
+		Assert(tx);
+
 		/* no, print raw transactions */
 		if (skipped)
 			fprintf(logfile, "%d " INT64_FORMAT " skipped %d " INT64_FORMAT " "
@@ -3890,7 +3920,7 @@ processXactStats(TState *thread, CState *st, pg_time_usec_t *now,
 	st->cnt++;
 
 	if (use_log)
-		doLog(thread, st, agg, skipped, latency, lag);
+		doLog(thread, st, true, agg, skipped, latency, lag);
 
 	/* XXX could use a mutex here, but we choose not to */
 	if (per_script_stats)
@@ -5456,7 +5486,7 @@ printProgressReport(TState *threads, int64 test_start, pg_time_usec_t now,
 
 	if (progress_timestamp)
 	{
-		snprintf(tbuf, sizeof(tbuf), "%.3f s", PG_TIME_GET_DOUBLE(now));
+		snprintf(tbuf, sizeof(tbuf), "%.3f s", PG_TIME_GET_DOUBLE(now + epoch_shift));
 	}
 	else
 	{
@@ -5806,6 +5836,11 @@ main(int argc, char **argv)
 	char	   *env;
 
 	int			exit_code = 0;
+	struct timeval tv;
+
+	/* record shift between epoch and now() */
+	gettimeofday(&tv, NULL);
+	epoch_shift = tv.tv_sec * 1000000 + tv.tv_usec - pg_time_now();
 
 	pg_logging_init(argv[0]);
 	progname = get_progname(argv[0]);
@@ -6828,8 +6863,9 @@ done:
 		if (agg_interval > 0)
 		{
 			/* log aggregated but not yet reported transactions */
-			doLog(thread, state, &aggs, false, 0, 0);
+			doLog(thread, state, false, &aggs, false, 0.0, 0.0);
 		}
+
 		fclose(thread->logfile);
 		thread->logfile = NULL;
 	}
diff --git a/src/bin/pgbench/t/001_pgbench_with_server.pl b/src/bin/pgbench/t/001_pgbench_with_server.pl
index 923203ea51..2e34c1abf6 100644
--- a/src/bin/pgbench/t/001_pgbench_with_server.pl
+++ b/src/bin/pgbench/t/001_pgbench_with_server.pl
@@ -8,6 +8,7 @@ use PostgresNode;
 use TestLib;
 use Test::More;
 use Config;
+use Time::HiRes qw( time );
 
 # start a pgbench specific server
 my $node = get_new_node('main');
@@ -54,12 +55,14 @@ sub pgbench
 
 	push @cmd, @args;
 
+	my $start = time();
 	$node->command_checks_all(\@cmd, $stat, $out, $err, $name);
+	my $stop = time();
 
 	# cleanup?
 	#unlink @filenames or die "cannot unlink files (@filenames): $!";
 
-	return;
+	return $stop - $start;
 }
 
 # tablespace for testing, because partitioned tables cannot use pg_default
@@ -1173,7 +1176,14 @@ sub list_files
 	return map { $dir . '/' . $_ } @files;
 }
 
+#
 # check log contents and cleanup
+#   $dir: directory holding logs
+#   $prefix: file prefix for per-thread logs
+#   $nb: number of expected files
+#   $min/$max: minimum/maximum number of lines in log files
+#   $re: regular expression each line should match
+#
 sub check_pgbench_logs
 {
 	local $Test::Builder::Level = $Test::Builder::Level + 1;
@@ -1182,7 +1192,7 @@ sub check_pgbench_logs
 
 	# $prefix is simple enough, thus does not need escaping
 	my @logs = list_files($dir, qr{^$prefix\..*$});
-	ok(@logs == $nb, "number of log files");
+	ok(@logs == $nb, "number of log files (@logs)");
 	ok(grep(/\/$prefix\.\d+(\.\d+)?$/, @logs) == $nb, "file name format");
 
 	my $log_number = 0;
@@ -1205,6 +1215,36 @@ sub check_pgbench_logs
 
 my $bdir = $node->basedir;
 
+# note: this test is time sensitive, and may fail on a very
+#       loaded host.
+# note: --progress-timestamp is not tested
+my $delay = pgbench(
+	'-T 2 -P 1 -l --aggregate-interval=1 -S -b se@2'
+	. ' --rate=20 --latency-limit=1000 -j ' . $nthreads
+	. ' -c 3 -r',
+	0,
+	[   qr{type: multiple},
+		qr{clients: 3},
+		qr{threads: $nthreads},
+		qr{duration: 2 s},
+		qr{script 1: .* select only},
+		qr{script 2: .* select only},
+		qr{statement latencies in milliseconds},
+		qr{FROM pgbench_accounts} ],
+	[ qr{vacuum}, qr{progress: 1\b} ],
+	'pgbench progress', undef,
+	"--log-prefix=$bdir/001_pgbench_log_1");
+
+# cool check that we are around 2 seconds
+# The rate may results in an unlucky schedule which triggers
+# an early exit, hence the loose bound.
+ok(0.0 < $delay && $delay < 4.0, "-T 2 run around 2 seconds");
+
+# $nthreads threads, 2 seconds, but due to timing imprecision we might get
+# only 1 or as many as 3 progress reports per thread.
+check_pgbench_logs($bdir, '001_pgbench_log_1', $nthreads, 1, 3,
+	qr{^\d{10,} \d{1,2} \d+ \d+ \d+ \d+ \d+ \d+ \d+ \d+ \d+$});
+
 # with sampling rate
 pgbench(
 	"-n -S -t 50 -c 2 --log --sampling-rate=0.5", 0,

#35

Michael Paquier

michael@paquier.xyz

over 4 years ago

In reply to: Fabien COELHO (#34)

Re: pgbench logging broken by time logic changes

On Tue, Jun 22, 2021 at 12:06:45PM +0200, Fabien COELHO wrote:

Attached an updated v8 patch which adds (reinstate) an improved TAP test
which would have caught the various regressions on logs.

Given that such tests were removed once before, I'm unsure whether they will
be acceptable, despite that their usefulness has been clearly demonstrated.
At least it is for the record. Sigh:-(

Thanks!

This v8 is an addition of the fix for the epoch with the adjustments
for the aggregate reports in the logs. The maths look rather right
after a read and after some tests.

+# note: this test is time sensitive, and may fail on a very
+#       loaded host.
+# note: --progress-timestamp is not tested
+my $delay = pgbench(
+	'-T 2 -P 1 -l --aggregate-interval=1 -S -b se@2'
+	. ' --rate=20 --latency-limit=1000 -j ' . $nthreads
+	. ' -c 3 -r',
+	0,
+	[   qr{type: multiple},
+		qr{clients: 3},
+		qr{threads: $nthreads},
+		qr{duration: 2 s},
+		qr{script 1: .* select only},
+		qr{script 2: .* select only},
+		qr{statement latencies in milliseconds},
+		qr{FROM pgbench_accounts} ],
+	[ qr{vacuum}, qr{progress: 1\b} ],
+	'pgbench progress', undef,
+	"--log-prefix=$bdir/001_pgbench_log_1");
Could we make that shorter at 1s?  That will shorten the duration of
the test run.  It is easy to miss that this test is for
--aggregate-interval (aka the logAgg() path) without a comment.

+# cool check that we are around 2 seconds
+# The rate may results in an unlucky schedule which triggers
+# an early exit, hence the loose bound.
+ok(0.0 < $delay && $delay < 4.0, "-T 2 run around 2 seconds");

The command itself would not fail, but we would just fail on a machine
where the difference in start/stop time is higher than 4 seconds,
right? On RPI-level machines, this could fail reliably. I am not
completely sure what's the additional value we can get from that extra
test, to be honest.

+# $nthreads threads, 2 seconds, but due to timing imprecision we might get
+# only 1 or as many as 3 progress reports per thread.
+check_pgbench_logs($bdir, '001_pgbench_log_1', $nthreads, 1, 3,
+	qr{^\d{10,} \d{1,2} \d+ \d+ \d+ \d+ \d+ \d+ \d+ \d+ \d+$});
+
Now this one is good and actually stable thanks to the fact that we'd
get at least the final logs, and the main complain we got to discuss
about on this thread was the format of the logs.  I would say to give
up on the first test, and keep the second.  But, is this regex
correct?  Shouldn't we check for six integer fields only with the
first one having a minimal number of digits for the epoch?
--
Michael

#36

Fabien COELHO

coelho@cri.ensmp.fr

over 4 years ago

In reply to: Michael Paquier (#35)

Re: pgbench logging broken by time logic changes

Hello,

+# note: this test is time sensitive, and may fail on a very
+#       loaded host.
+# note: --progress-timestamp is not tested
+my $delay = pgbench(
+	'-T 2 -P 1 -l --aggregate-interval=1 -S -b se@2'
+	. ' --rate=20 --latency-limit=1000 -j ' . $nthreads
+	. ' -c 3 -r',
+	0,
+	[   qr{type: multiple},
+		qr{clients: 3},
+		qr{threads: $nthreads},
+		qr{duration: 2 s},
+		qr{script 1: .* select only},
+		qr{script 2: .* select only},
+		qr{statement latencies in milliseconds},
+		qr{FROM pgbench_accounts} ],
+	[ qr{vacuum}, qr{progress: 1\b} ],
+	'pgbench progress', undef,
+	"--log-prefix=$bdir/001_pgbench_log_1");

Could we make that shorter at 1s? That will shorten the duration of
the test run. It is easy to miss that this test is for
--aggregate-interval (aka the logAgg() path) without a comment.

It is for -T, -P and --aggregate-interval. The units of all these is
seconds, the minimum is 1, I put 2 so that It is pretty sure to get some
output. We could try 1, but I'm less confident that an output is ensured
in all cases on a very slow host which may decide to shorten the run
before having shown a progress for instance.

+# cool check that we are around 2 seconds
+# The rate may results in an unlucky schedule which triggers
+# an early exit, hence the loose bound.
+ok(0.0 < $delay && $delay < 4.0, "-T 2 run around 2 seconds");
The command itself would not fail, but we would just fail on a machine
where the difference in start/stop time is higher than 4 seconds,
right?

Yep. It could detect a struck pgbench process which was one of the
reported issue. Maybe there should be a timeout added.

On RPI-level machines, this could fail reliably.

Dunno, Not sure what RPI means. Probably not "Retail Price Index"… maybe
Rasberry-Pi? In that case, the O-4 second leeway is intended to be loose
enough to accomodate such hosts, but I cannot test that.

I am not completely sure what's the additional value we can get from
that extra test, to be honest.

This would be to detect a somehow stuck process. It could be looser if
necessary. Or removed, or preferably commented out, or enabled with some
options (eg an environment variable? configure option?). Such control
could also skip all 3 calls, in which case the 2 seconds is not an issue.

+# $nthreads threads, 2 seconds, but due to timing imprecision we might get
+# only 1 or as many as 3 progress reports per thread.
+check_pgbench_logs($bdir, '001_pgbench_log_1', $nthreads, 1, 3,
+	qr{^\d{10,} \d{1,2} \d+ \d+ \d+ \d+ \d+ \d+ \d+ \d+ \d+$});
+
Now this one is good and actually stable thanks to the fact that we'd
get at least the final logs, and the main complain we got to discuss
about on this thread was the format of the logs.

Yep, this test would have probably detected the epoch issue reported by
Greg.

I would say to give up on the first test, and keep the second.

You mean remove the time check and keep the log check. I'd like to keep a
time check, or at least have it in comment so that I can enable it simply.

But, is this regex correct? Shouldn't we check for six integer fields
only with the first one having a minimal number of digits for the epoch?

Epoch (seconds since 1970-01-01?) is currently 10 digits. Not sure how
well it would work if some host have another zero start date.

Given the options of the bench run, there are that many fields in the
log output, I'm not sure why we would want to check for less?

--
Fabien.

#37

Michael Paquier

michael@paquier.xyz

over 4 years ago

In reply to: Fabien COELHO (#36)

Re: pgbench logging broken by time logic changes

On Wed, Jun 23, 2021 at 08:26:32AM +0200, Fabien COELHO wrote:

Could we make that shorter at 1s? That will shorten the duration of
the test run. It is easy to miss that this test is for
--aggregate-interval (aka the logAgg() path) without a comment.

It is for -T, -P and --aggregate-interval. The units of all these is
seconds, the minimum is 1, I put 2 so that It is pretty sure to get some
output. We could try 1, but I'm less confident that an output is ensured in
all cases on a very slow host which may decide to shorten the run before
having shown a progress for instance.

Could it be possible to document the intention of the test and its
coverage then? With the current patch, one has to guess what's the
intention behind this case.

+# $nthreads threads, 2 seconds, but due to timing imprecision we might get
+# only 1 or as many as 3 progress reports per thread.
+check_pgbench_logs($bdir, '001_pgbench_log_1', $nthreads, 1, 3,
+	qr{^\d{10,} \d{1,2} \d+ \d+ \d+ \d+ \d+ \d+ \d+ \d+ \d+$});
+
Now this one is good and actually stable thanks to the fact that we'd
get at least the final logs, and the main complain we got to discuss
about on this thread was the format of the logs.

Yep, this test would have probably detected the epoch issue reported by
Greg.

(Sorry I missed the use of throttle_delay that would generate 10
fields in the log file)

Hm.. Could it be possible to tighten a bit the regex used here then?
I was playing with it and it is not really picky in terms of patterns
allowed or rejected. The follow-up checks for check_pgbench_logs()
could be a bit more restrictive as well, but my regex-fu is bad.

I would say to give up on the first test, and keep the second.

You mean remove the time check and keep the log check. I'd like to keep a
time check, or at least have it in comment so that I can enable it simply.

I'd rather avoid tests that tend to fail on slow machines. There is a
flotilla in the buildfarm.
--
Michael

#38

Fabien COELHO

coelho@cri.ensmp.fr

over 4 years ago

In reply to: Michael Paquier (#37)

1 attachment(s)

Re: pgbench logging broken by time logic changes

Bonjour Michaï¿œl,

Could it be possible to document the intention of the test and its
coverage then? With the current patch, one has to guess what's the
intention behind this case.

Ok, see attached.

+check_pgbench_logs($bdir, '001_pgbench_log_1', $nthreads, 1, 3,
+	qr{^\d{10,} \d{1,2} \d+ \d+ \d+ \d+ \d+ \d+ \d+ \d+ \d+$});
Hm.. Could it be possible to tighten a bit the regex used here then?

I was playing with it and it is not really picky in terms of patterns
allowed or rejected.

The follow-up checks for check_pgbench_logs() could be a bit more
restrictive as well, but my regex-fu is bad.

Given the probabilistic nature of a --rate run and the variability of
hosts which may run the tests, it is hard to be more specific that \d+ for
actual performance data. The run may try 0 or 50 transaction within a
second (both with pretty low probabilities), so the test mostly checks the
format and some basic sanity on the output and logs.

I would say to give up on the first test, and keep the second.

You mean remove the time check and keep the log check. I'd like to keep a
time check, or at least have it in comment so that I can enable it simply.

I'd rather avoid tests that tend to fail on slow machines. There is a
flotilla in the buildfarm.

Commented out in attached v9.

--
Fabien.

Attachments:

pgbench-log-fix-9.patchtext/x-diff; name=pgbench-log-fix-9.patchDownload

diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index e61055b6b7..dfb2ff6859 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -343,6 +343,12 @@ typedef struct StatsData
 	SimpleStats lag;
 } StatsData;
 
+/*
+ * For displaying epoch timestamp, as some time functions may have
+ * another reference.
+ */
+pg_time_usec_t epoch_shift;
+
 /*
  * Struct to keep random state.
  */
@@ -649,7 +655,7 @@ static void setDoubleValue(PgBenchValue *pv, double dval);
 static bool evaluateExpr(CState *st, PgBenchExpr *expr,
 						 PgBenchValue *retval);
 static ConnectionStateEnum executeMetaCommand(CState *st, pg_time_usec_t *now);
-static void doLog(TState *thread, CState *st,
+static void doLog(TState *thread, CState *st, bool tx,
 				  StatsData *agg, bool skipped, double latency, double lag);
 static void processXactStats(TState *thread, CState *st, pg_time_usec_t *now,
 							 bool skipped, StatsData *agg);
@@ -3766,16 +3772,47 @@ executeMetaCommand(CState *st, pg_time_usec_t *now)
 	return CSTATE_END_COMMAND;
 }
 
+/* print aggregated report to logfile */
+static void
+logAgg(FILE *logfile, StatsData *agg)
+{
+	fprintf(logfile, INT64_FORMAT " " INT64_FORMAT " %.0f %.0f %.0f %.0f",
+			(agg->start_time + epoch_shift) / 1000000,
+			agg->cnt,
+			agg->latency.sum,
+			agg->latency.sum2,
+			agg->latency.min,
+			agg->latency.max);
+	if (throttle_delay)
+	{
+		fprintf(logfile, " %.0f %.0f %.0f %.0f",
+				agg->lag.sum,
+				agg->lag.sum2,
+				agg->lag.min,
+				agg->lag.max);
+		if (latency_limit)
+			fprintf(logfile, " " INT64_FORMAT, agg->skipped);
+	}
+	fputc('\n', logfile);
+}
+
 /*
  * Print log entry after completing one transaction.
  *
+ * Param tx tells whether the call actually represents a transaction,
+ * or if it is used to flush aggregated logs.
+ *
+ * The function behavior changes depending on sample_rate (a fraction of
+ * transaction is reported) and agg_interval (transactions are aggregated
+ * and reported once every agg_interval seconds).
+ *
  * We print Unix-epoch timestamps in the log, so that entries can be
  * correlated against other logs.  On some platforms this could be obtained
  * from the caller, but rather than get entangled with that, we just eat
  * the cost of an extra syscall in all cases.
  */
 static void
-doLog(TState *thread, CState *st,
+doLog(TState *thread, CState *st, bool tx,
 	  StatsData *agg, bool skipped, double latency, double lag)
 {
 	FILE	   *logfile = thread->logfile;
@@ -3794,43 +3831,36 @@ doLog(TState *thread, CState *st,
 	/* should we aggregate the results or not? */
 	if (agg_interval > 0)
 	{
+		pg_time_usec_t	next;
+
 		/*
 		 * Loop until we reach the interval of the current moment, and print
 		 * any empty intervals in between (this may happen with very low tps,
 		 * e.g. --rate=0.1).
 		 */
-
-		while (agg->start_time + agg_interval <= now)
+		while ((next = agg->start_time + agg_interval * INT64CONST(1000000)) <= now)
 		{
-			/* print aggregated report to logfile */
-			fprintf(logfile, INT64_FORMAT " " INT64_FORMAT " %.0f %.0f %.0f %.0f",
-					agg->start_time,
-					agg->cnt,
-					agg->latency.sum,
-					agg->latency.sum2,
-					agg->latency.min,
-					agg->latency.max);
-			if (throttle_delay)
-			{
-				fprintf(logfile, " %.0f %.0f %.0f %.0f",
-						agg->lag.sum,
-						agg->lag.sum2,
-						agg->lag.min,
-						agg->lag.max);
-				if (latency_limit)
-					fprintf(logfile, " " INT64_FORMAT, agg->skipped);
-			}
-			fputc('\n', logfile);
+			logAgg(logfile, agg);
 
 			/* reset data and move to next interval */
-			initStats(agg, agg->start_time + agg_interval);
+			initStats(agg, next);
 		}
 
-		/* accumulate the current transaction */
-		accumStats(agg, skipped, latency, lag);
+		if (tx)
+			/* accumulate the current transaction */
+			accumStats(agg, skipped, latency, lag);
+		else
+			/* final call to show the last aggregate */
+			logAgg(logfile, agg);
 	}
 	else
 	{
+		/* switch to epoch */
+		now += epoch_shift;
+
+		/* !tx only used for aggregated data */
+		Assert(tx);
+
 		/* no, print raw transactions */
 		if (skipped)
 			fprintf(logfile, "%d " INT64_FORMAT " skipped %d " INT64_FORMAT " "
@@ -3890,7 +3920,7 @@ processXactStats(TState *thread, CState *st, pg_time_usec_t *now,
 	st->cnt++;
 
 	if (use_log)
-		doLog(thread, st, agg, skipped, latency, lag);
+		doLog(thread, st, true, agg, skipped, latency, lag);
 
 	/* XXX could use a mutex here, but we choose not to */
 	if (per_script_stats)
@@ -5456,7 +5486,7 @@ printProgressReport(TState *threads, int64 test_start, pg_time_usec_t now,
 
 	if (progress_timestamp)
 	{
-		snprintf(tbuf, sizeof(tbuf), "%.3f s", PG_TIME_GET_DOUBLE(now));
+		snprintf(tbuf, sizeof(tbuf), "%.3f s", PG_TIME_GET_DOUBLE(now + epoch_shift));
 	}
 	else
 	{
@@ -5806,6 +5836,11 @@ main(int argc, char **argv)
 	char	   *env;
 
 	int			exit_code = 0;
+	struct timeval tv;
+
+	/* record shift between epoch and now() */
+	gettimeofday(&tv, NULL);
+	epoch_shift = tv.tv_sec * 1000000 + tv.tv_usec - pg_time_now();
 
 	pg_logging_init(argv[0]);
 	progname = get_progname(argv[0]);
@@ -6828,8 +6863,9 @@ done:
 		if (agg_interval > 0)
 		{
 			/* log aggregated but not yet reported transactions */
-			doLog(thread, state, &aggs, false, 0, 0);
+			doLog(thread, state, false, &aggs, false, 0.0, 0.0);
 		}
+
 		fclose(thread->logfile);
 		thread->logfile = NULL;
 	}
diff --git a/src/bin/pgbench/t/001_pgbench_with_server.pl b/src/bin/pgbench/t/001_pgbench_with_server.pl
index 923203ea51..89e6164d30 100644
--- a/src/bin/pgbench/t/001_pgbench_with_server.pl
+++ b/src/bin/pgbench/t/001_pgbench_with_server.pl
@@ -8,6 +8,7 @@ use PostgresNode;
 use TestLib;
 use Test::More;
 use Config;
+use Time::HiRes qw( time );
 
 # start a pgbench specific server
 my $node = get_new_node('main');
@@ -54,12 +55,14 @@ sub pgbench
 
 	push @cmd, @args;
 
+	my $start = time();
 	$node->command_checks_all(\@cmd, $stat, $out, $err, $name);
+	my $stop = time();
 
 	# cleanup?
 	#unlink @filenames or die "cannot unlink files (@filenames): $!";
 
-	return;
+	return $stop - $start;
 }
 
 # tablespace for testing, because partitioned tables cannot use pg_default
@@ -1173,7 +1176,14 @@ sub list_files
 	return map { $dir . '/' . $_ } @files;
 }
 
+#
 # check log contents and cleanup
+#   $dir: directory holding logs
+#   $prefix: file prefix for per-thread logs
+#   $nb: number of expected files
+#   $min/$max: minimum/maximum number of lines in log files
+#   $re: regular expression each line should match
+#
 sub check_pgbench_logs
 {
 	local $Test::Builder::Level = $Test::Builder::Level + 1;
@@ -1182,7 +1192,7 @@ sub check_pgbench_logs
 
 	# $prefix is simple enough, thus does not need escaping
 	my @logs = list_files($dir, qr{^$prefix\..*$});
-	ok(@logs == $nb, "number of log files");
+	ok(@logs == $nb, "number of log files (@logs)");
 	ok(grep(/\/$prefix\.\d+(\.\d+)?$/, @logs) == $nb, "file name format");
 
 	my $log_number = 0;
@@ -1205,7 +1215,52 @@ sub check_pgbench_logs
 
 my $bdir = $node->basedir;
 
-# with sampling rate
+# Test time-sensitive features on a light read-only transaction:
+#
+#   -T: bench duration, 2 seconds to exercise progress & logs
+#   -P: progress report
+#   --aggregate-interval: periodic aggregated logs
+#   --rate: schedule load
+#   --latency-limit: max delay, not deeply exercice
+#
+# note: the --rate behavior is probabilistic in nature.
+# note: --progress-timestamp is not tested.
+my $delay = pgbench(
+	'-T 2 -P 1 -l --aggregate-interval=1 -S -b se@2'
+	. ' --rate=20 --latency-limit=1000 -j ' . $nthreads
+	. ' -c 3 -r',
+	0,
+	[   qr{type: multiple},
+		qr{clients: 3},
+		qr{threads: $nthreads},
+		qr{duration: 2 s},
+		qr{script 1: .* select only},
+		qr{script 2: .* select only},
+		qr{statement latencies in milliseconds},
+		qr{FROM pgbench_accounts} ],
+	[ qr{vacuum}, qr{progress: 1\b} ],
+	'pgbench progress', undef,
+	"--log-prefix=$bdir/001_pgbench_log_1");
+
+# cool check that we are around 2 seconds
+# The rate may results in an unlucky schedule which triggers
+# an early exit, hence the loose bound.
+#
+# THIS TEST IS COMMENTED OUT BUT PLEASE LET IT THERE SO THAT
+# IT CAN BE ENABLED EASILY.
+#
+## ok(1.5 < $delay && $delay < 2.5, "-T 2 run around 2 seconds");
+
+# $nthreads threads, 2 seconds, but due to timing imprecision we might get
+# only 1 or as many as 3 progress reports per thread.
+# aggregate log format is:
+#   epoch #tx sum sum2 min max [sum sum2 min max [skipped]]
+# first serie about latency ; second about lag (--rate) ;
+# skipped only if --latency-limit is set.
+check_pgbench_logs($bdir, '001_pgbench_log_1', $nthreads, 1, 3,
+	qr{^\d{10,} \d{1,2} \d+ \d+ \d+ \d+ \d+ \d+ \d+ \d+ \d+$});
+
+# with sampling rate, 2 clients with 50 tx each
 pgbench(
 	"-n -S -t 50 -c 2 --log --sampling-rate=0.5", 0,
 	[ qr{select only}, qr{processed: 100/100} ], [qr{^$}],

#39

Alvaro Herrera

alvherre@alvh.no-ip.org

over 4 years ago

In reply to: Fabien COELHO (#38)

Re: pgbench logging broken by time logic changes

On 2021-Jun-23, Fabien COELHO wrote:

+# cool check that we are around 2 seconds
+# The rate may results in an unlucky schedule which triggers
+# an early exit, hence the loose bound.
+#
+# THIS TEST IS COMMENTED OUT BUT PLEASE LET IT THERE SO THAT
+# IT CAN BE ENABLED EASILY.
+#
+## ok(1.5 < $delay && $delay < 2.5, "-T 2 run around 2 seconds");

I think you should use Test::More's "skip" for this, perhaps something
like this:

SKIP: {
skip "This test is unreliable";

# explain why
ok(1.5 < $delay && $delay < 2.5, "-T 2 run around 2 seconds");
}

... or, actually, even better would be to use a TODO block, so that the
test is run and reports its status, but if it happens not to succeed it
will not cause the whole test to fail. That way you'll accumulate some
evidence that may serve to improve the test in the future until it
works fully:

TODO: {
local $TODO = "Ths test is unreliable";

ok(1.5 < $delay && $delay < 2.5, "-T 2 run around 2 seconds");
}

--
ï¿½lvaro Herrera Valdivia, Chile
"El Maquinismo fue proscrito so pena de cosquilleo hasta la muerte"
(Ijon Tichy en Viajes, Stanislaw Lem)

#40

Fabien COELHO

coelho@cri.ensmp.fr

over 4 years ago

In reply to: Alvaro Herrera (#39)

1 attachment(s)

Re: pgbench logging broken by time logic changes

Ola ï¿œlvaro,

... or, actually, even better would be to use a TODO block, so that the
test is run and reports its status, but if it happens not to succeed it
will not cause the whole test to fail. That way you'll accumulate some
evidence that may serve to improve the test in the future until it
works fully:

TODO: {
local $TODO = "Ths test is unreliable";

ok(1.5 < $delay && $delay < 2.5, "-T 2 run around 2 seconds");
}

Thanks for the hint! Why not, having the ability to collect data is a good
thing, so attached v10 does that. If something go wrong, the TODO section
could be extended around all calls.

--
Fabien.

Attachments:

pgbench-log-fix-10.patchtext/x-diff; name=pgbench-log-fix-10.patchDownload

diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index e61055b6b7..dfb2ff6859 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -343,6 +343,12 @@ typedef struct StatsData
 	SimpleStats lag;
 } StatsData;
 
+/*
+ * For displaying epoch timestamp, as some time functions may have
+ * another reference.
+ */
+pg_time_usec_t epoch_shift;
+
 /*
  * Struct to keep random state.
  */
@@ -649,7 +655,7 @@ static void setDoubleValue(PgBenchValue *pv, double dval);
 static bool evaluateExpr(CState *st, PgBenchExpr *expr,
 						 PgBenchValue *retval);
 static ConnectionStateEnum executeMetaCommand(CState *st, pg_time_usec_t *now);
-static void doLog(TState *thread, CState *st,
+static void doLog(TState *thread, CState *st, bool tx,
 				  StatsData *agg, bool skipped, double latency, double lag);
 static void processXactStats(TState *thread, CState *st, pg_time_usec_t *now,
 							 bool skipped, StatsData *agg);
@@ -3766,16 +3772,47 @@ executeMetaCommand(CState *st, pg_time_usec_t *now)
 	return CSTATE_END_COMMAND;
 }
 
+/* print aggregated report to logfile */
+static void
+logAgg(FILE *logfile, StatsData *agg)
+{
+	fprintf(logfile, INT64_FORMAT " " INT64_FORMAT " %.0f %.0f %.0f %.0f",
+			(agg->start_time + epoch_shift) / 1000000,
+			agg->cnt,
+			agg->latency.sum,
+			agg->latency.sum2,
+			agg->latency.min,
+			agg->latency.max);
+	if (throttle_delay)
+	{
+		fprintf(logfile, " %.0f %.0f %.0f %.0f",
+				agg->lag.sum,
+				agg->lag.sum2,
+				agg->lag.min,
+				agg->lag.max);
+		if (latency_limit)
+			fprintf(logfile, " " INT64_FORMAT, agg->skipped);
+	}
+	fputc('\n', logfile);
+}
+
 /*
  * Print log entry after completing one transaction.
  *
+ * Param tx tells whether the call actually represents a transaction,
+ * or if it is used to flush aggregated logs.
+ *
+ * The function behavior changes depending on sample_rate (a fraction of
+ * transaction is reported) and agg_interval (transactions are aggregated
+ * and reported once every agg_interval seconds).
+ *
  * We print Unix-epoch timestamps in the log, so that entries can be
  * correlated against other logs.  On some platforms this could be obtained
  * from the caller, but rather than get entangled with that, we just eat
  * the cost of an extra syscall in all cases.
  */
 static void
-doLog(TState *thread, CState *st,
+doLog(TState *thread, CState *st, bool tx,
 	  StatsData *agg, bool skipped, double latency, double lag)
 {
 	FILE	   *logfile = thread->logfile;
@@ -3794,43 +3831,36 @@ doLog(TState *thread, CState *st,
 	/* should we aggregate the results or not? */
 	if (agg_interval > 0)
 	{
+		pg_time_usec_t	next;
+
 		/*
 		 * Loop until we reach the interval of the current moment, and print
 		 * any empty intervals in between (this may happen with very low tps,
 		 * e.g. --rate=0.1).
 		 */
-
-		while (agg->start_time + agg_interval <= now)
+		while ((next = agg->start_time + agg_interval * INT64CONST(1000000)) <= now)
 		{
-			/* print aggregated report to logfile */
-			fprintf(logfile, INT64_FORMAT " " INT64_FORMAT " %.0f %.0f %.0f %.0f",
-					agg->start_time,
-					agg->cnt,
-					agg->latency.sum,
-					agg->latency.sum2,
-					agg->latency.min,
-					agg->latency.max);
-			if (throttle_delay)
-			{
-				fprintf(logfile, " %.0f %.0f %.0f %.0f",
-						agg->lag.sum,
-						agg->lag.sum2,
-						agg->lag.min,
-						agg->lag.max);
-				if (latency_limit)
-					fprintf(logfile, " " INT64_FORMAT, agg->skipped);
-			}
-			fputc('\n', logfile);
+			logAgg(logfile, agg);
 
 			/* reset data and move to next interval */
-			initStats(agg, agg->start_time + agg_interval);
+			initStats(agg, next);
 		}
 
-		/* accumulate the current transaction */
-		accumStats(agg, skipped, latency, lag);
+		if (tx)
+			/* accumulate the current transaction */
+			accumStats(agg, skipped, latency, lag);
+		else
+			/* final call to show the last aggregate */
+			logAgg(logfile, agg);
 	}
 	else
 	{
+		/* switch to epoch */
+		now += epoch_shift;
+
+		/* !tx only used for aggregated data */
+		Assert(tx);
+
 		/* no, print raw transactions */
 		if (skipped)
 			fprintf(logfile, "%d " INT64_FORMAT " skipped %d " INT64_FORMAT " "
@@ -3890,7 +3920,7 @@ processXactStats(TState *thread, CState *st, pg_time_usec_t *now,
 	st->cnt++;
 
 	if (use_log)
-		doLog(thread, st, agg, skipped, latency, lag);
+		doLog(thread, st, true, agg, skipped, latency, lag);
 
 	/* XXX could use a mutex here, but we choose not to */
 	if (per_script_stats)
@@ -5456,7 +5486,7 @@ printProgressReport(TState *threads, int64 test_start, pg_time_usec_t now,
 
 	if (progress_timestamp)
 	{
-		snprintf(tbuf, sizeof(tbuf), "%.3f s", PG_TIME_GET_DOUBLE(now));
+		snprintf(tbuf, sizeof(tbuf), "%.3f s", PG_TIME_GET_DOUBLE(now + epoch_shift));
 	}
 	else
 	{
@@ -5806,6 +5836,11 @@ main(int argc, char **argv)
 	char	   *env;
 
 	int			exit_code = 0;
+	struct timeval tv;
+
+	/* record shift between epoch and now() */
+	gettimeofday(&tv, NULL);
+	epoch_shift = tv.tv_sec * 1000000 + tv.tv_usec - pg_time_now();
 
 	pg_logging_init(argv[0]);
 	progname = get_progname(argv[0]);
@@ -6828,8 +6863,9 @@ done:
 		if (agg_interval > 0)
 		{
 			/* log aggregated but not yet reported transactions */
-			doLog(thread, state, &aggs, false, 0, 0);
+			doLog(thread, state, false, &aggs, false, 0.0, 0.0);
 		}
+
 		fclose(thread->logfile);
 		thread->logfile = NULL;
 	}
diff --git a/src/bin/pgbench/t/001_pgbench_with_server.pl b/src/bin/pgbench/t/001_pgbench_with_server.pl
index 923203ea51..60260e08f9 100644
--- a/src/bin/pgbench/t/001_pgbench_with_server.pl
+++ b/src/bin/pgbench/t/001_pgbench_with_server.pl
@@ -8,6 +8,7 @@ use PostgresNode;
 use TestLib;
 use Test::More;
 use Config;
+use Time::HiRes qw( time );
 
 # start a pgbench specific server
 my $node = get_new_node('main');
@@ -54,12 +55,14 @@ sub pgbench
 
 	push @cmd, @args;
 
+	my $start = time();
 	$node->command_checks_all(\@cmd, $stat, $out, $err, $name);
+	my $stop = time();
 
 	# cleanup?
 	#unlink @filenames or die "cannot unlink files (@filenames): $!";
 
-	return;
+	return $stop - $start;
 }
 
 # tablespace for testing, because partitioned tables cannot use pg_default
@@ -1173,7 +1176,14 @@ sub list_files
 	return map { $dir . '/' . $_ } @files;
 }
 
+#
 # check log contents and cleanup
+#   $dir: directory holding logs
+#   $prefix: file prefix for per-thread logs
+#   $nb: number of expected files
+#   $min/$max: minimum/maximum number of lines in log files
+#   $re: regular expression each line should match
+#
 sub check_pgbench_logs
 {
 	local $Test::Builder::Level = $Test::Builder::Level + 1;
@@ -1182,7 +1192,7 @@ sub check_pgbench_logs
 
 	# $prefix is simple enough, thus does not need escaping
 	my @logs = list_files($dir, qr{^$prefix\..*$});
-	ok(@logs == $nb, "number of log files");
+	ok(@logs == $nb, "number of log files (@logs)");
 	ok(grep(/\/$prefix\.\d+(\.\d+)?$/, @logs) == $nb, "file name format");
 
 	my $log_number = 0;
@@ -1205,7 +1215,57 @@ sub check_pgbench_logs
 
 my $bdir = $node->basedir;
 
-# with sampling rate
+# Test time-sensitive features on a light read-only transaction:
+#
+#   -T: bench duration, 2 seconds to exercise progress & logs
+#   -P: progress report
+#   --aggregate-interval: periodic aggregated logs
+#   --rate: schedule load
+#   --latency-limit: max delay, not deeply exercice
+#
+# note: the --rate behavior is probabilistic in nature.
+# note: --progress-timestamp is not tested.
+my $delay = pgbench(
+	'-T 2 -P 1 -l --aggregate-interval=1 -S -b se@2'
+	. ' --rate=20 --latency-limit=1000 -j ' . $nthreads
+	. ' -c 3 -r',
+	0,
+	[   qr{type: multiple},
+		qr{clients: 3},
+		qr{threads: $nthreads},
+		qr{duration: 2 s},
+		qr{script 1: .* select only},
+		qr{script 2: .* select only},
+		qr{statement latencies in milliseconds},
+		qr{FROM pgbench_accounts} ],
+	[ qr{vacuum}, qr{progress: 1\b} ],
+	'pgbench progress', undef,
+	"--log-prefix=$bdir/001_pgbench_log_1");
+
+# cool check that we are around 2 seconds
+
+TODO: {
+	local $TODO = "possibly unreliable on slow hosts or unlucky runs";
+
+	# The rate may results in an unlucky schedule which triggers
+	# an early exit, hence the loose bound.
+
+	# also, the delay may totally fail on very slow or overloard hosts,
+	# valgrind runs...
+
+	ok(1.5 < $delay && $delay < 2.5, "-T 2 run around 2 seconds");
+}
+
+# $nthreads threads, 2 seconds, but due to timing imprecision we might get
+# only 1 or as many as 3 progress reports per thread.
+# aggregate log format is:
+#   epoch #tx sum sum2 min max [sum sum2 min max [skipped]]
+# first serie about latency ; second about lag (--rate) ;
+# skipped only if --latency-limit is set.
+check_pgbench_logs($bdir, '001_pgbench_log_1', $nthreads, 1, 3,
+	qr{^\d{10,} \d{1,2} \d+ \d+ \d+ \d+ \d+ \d+ \d+ \d+ \d+$});
+
+# with sampling rate, 2 clients with 50 tx each
 pgbench(
 	"-n -S -t 50 -c 2 --log --sampling-rate=0.5", 0,
 	[ qr{select only}, qr{processed: 100/100} ], [qr{^$}],

#41

Michael Paquier

michael@paquier.xyz

over 4 years ago

In reply to: Fabien COELHO (#40)

2 attachment(s)

Re: pgbench logging broken by time logic changes

On Wed, Jun 23, 2021 at 10:01:28PM +0200, Fabien COELHO wrote:

Thanks for the hint! Why not, having the ability to collect data is a good
thing, so attached v10 does that. If something go wrong, the TODO section
could be extended around all calls.

+check_pgbench_logs($bdir, '001_pgbench_log_1', $nthreads, 1, 3,
+   qr{^\d{10,} \d{1,2} \d+ \d+ \d+ \d+ \d+ \d+ \d+ \d+ \d+$});
FWIW, I am still seeing problems with the regex pattern you are using
here, because this fails to detect the number of fields we should have
generated here, for one.  If you are not convinced, just run your new
test and extend or reduce the amount of data generated by logAgg() in
your patch: the tests will still pass.

So I have investigated this stuff in details. The regular expressions
are correctly shaped, but the use of grep() in check_pgbench_logs()
seems to be incorrect.

For example, let's take an aggregate report generated by your new
test:
"1624498086 13 27632 60597490 1683 2853 3227 883179 120 386 123"
Here are some extra ones, shorter and longer:
"1624498086 13 27632 60597490 1683 2853 3227 8831";
"1624498086 13 27632 60597490 1683 2853 3227 883179 120 386 123 123";

Using grep() with "$re" results in all the fields matching. Using on
the contrary "/$re/" in grep(), like list_files(), would only match
the first one, which is correct. Please see attached a small script
to show my point, called perl_grep.pl.

With this issue fixed, I have bumped into what looks like a different
bug in the tests. 001_pgbench_log_2 uses pgbench with 2 clients, but
expects only patterns in the logs where the first column value uses
only 0. With two clients, those first values can be either 0 or 1 due
to the client ID set.

It seems to me that we had better fix this issue and back-patch where
this has been introduced so as we have exact match checks with the log
formarts, no? Please see the attached.

Thoughts?
--
Michael

Attachments:

perl_grep.pltext/x-perl; charset=us-asciiDownload

pgbench-tap-fix.patchtext/x-diff; charset=us-asciiDownload

diff --git a/src/bin/pgbench/t/001_pgbench_with_server.pl b/src/bin/pgbench/t/001_pgbench_with_server.pl
index 923203ea51..ef8e40240b 100644
--- a/src/bin/pgbench/t/001_pgbench_with_server.pl
+++ b/src/bin/pgbench/t/001_pgbench_with_server.pl
@@ -1194,7 +1194,7 @@ sub check_pgbench_logs
 			my $clen     = @contents;
 			ok( $min <= $clen && $clen <= $max,
 				"transaction count for $log ($clen)");
-			ok( grep($re, @contents) == $clen,
+			ok( grep(/$re/, @contents) == $clen,
 				"transaction format for $prefix");
 			close $fh or die "$@";
 		};
@@ -1213,7 +1213,7 @@ pgbench(
 	"--log-prefix=$bdir/001_pgbench_log_2");
 
 check_pgbench_logs($bdir, '001_pgbench_log_2', 1, 8, 92,
-	qr{^0 \d{1,2} \d+ \d \d+ \d+$});
+	qr{^{0,1} \d{1,2} \d+ \d \d+ \d+$});
 
 # check log file in some detail
 pgbench(

#42

Fabien COELHO

coelho@cri.ensmp.fr

over 4 years ago

In reply to: Michael Paquier (#41)

1 attachment(s)

Re: pgbench logging broken by time logic changes

Bonjour Michaï¿œl,

Using grep() with "$re" results in all the fields matching. Using on
the contrary "/$re/" in grep(), like list_files(), would only match
the first one, which is correct.

Ok, good catch. Perl is kind of a strange language.

With this issue fixed, I have bumped into what looks like a different
bug in the tests. 001_pgbench_log_2 uses pgbench with 2 clients, but>
expects only patterns in the logs where the first column value uses only
0. With two clients, those first values can be either 0 or 1 due to the
client ID set.

Indeed. The tests passes because the number of expected lines is quite

It seems to me that we had better fix this issue and back-patch where
this has been introduced so as we have exact match checks with the log
formarts, no? Please see the attached.

Ok, however the regex should be "^[01] ...".

Attached v11 with your fixes + the above regex fix.

--
Fabien.

Attachments:

pgbench-log-fix-11.patchtext/x-diff; name=pgbench-log-fix-11.patchDownload

diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index e61055b6b7..dfb2ff6859 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -343,6 +343,12 @@ typedef struct StatsData
 	SimpleStats lag;
 } StatsData;
 
+/*
+ * For displaying epoch timestamp, as some time functions may have
+ * another reference.
+ */
+pg_time_usec_t epoch_shift;
+
 /*
  * Struct to keep random state.
  */
@@ -649,7 +655,7 @@ static void setDoubleValue(PgBenchValue *pv, double dval);
 static bool evaluateExpr(CState *st, PgBenchExpr *expr,
 						 PgBenchValue *retval);
 static ConnectionStateEnum executeMetaCommand(CState *st, pg_time_usec_t *now);
-static void doLog(TState *thread, CState *st,
+static void doLog(TState *thread, CState *st, bool tx,
 				  StatsData *agg, bool skipped, double latency, double lag);
 static void processXactStats(TState *thread, CState *st, pg_time_usec_t *now,
 							 bool skipped, StatsData *agg);
@@ -3766,16 +3772,47 @@ executeMetaCommand(CState *st, pg_time_usec_t *now)
 	return CSTATE_END_COMMAND;
 }
 
+/* print aggregated report to logfile */
+static void
+logAgg(FILE *logfile, StatsData *agg)
+{
+	fprintf(logfile, INT64_FORMAT " " INT64_FORMAT " %.0f %.0f %.0f %.0f",
+			(agg->start_time + epoch_shift) / 1000000,
+			agg->cnt,
+			agg->latency.sum,
+			agg->latency.sum2,
+			agg->latency.min,
+			agg->latency.max);
+	if (throttle_delay)
+	{
+		fprintf(logfile, " %.0f %.0f %.0f %.0f",
+				agg->lag.sum,
+				agg->lag.sum2,
+				agg->lag.min,
+				agg->lag.max);
+		if (latency_limit)
+			fprintf(logfile, " " INT64_FORMAT, agg->skipped);
+	}
+	fputc('\n', logfile);
+}
+
 /*
  * Print log entry after completing one transaction.
  *
+ * Param tx tells whether the call actually represents a transaction,
+ * or if it is used to flush aggregated logs.
+ *
+ * The function behavior changes depending on sample_rate (a fraction of
+ * transaction is reported) and agg_interval (transactions are aggregated
+ * and reported once every agg_interval seconds).
+ *
  * We print Unix-epoch timestamps in the log, so that entries can be
  * correlated against other logs.  On some platforms this could be obtained
  * from the caller, but rather than get entangled with that, we just eat
  * the cost of an extra syscall in all cases.
  */
 static void
-doLog(TState *thread, CState *st,
+doLog(TState *thread, CState *st, bool tx,
 	  StatsData *agg, bool skipped, double latency, double lag)
 {
 	FILE	   *logfile = thread->logfile;
@@ -3794,43 +3831,36 @@ doLog(TState *thread, CState *st,
 	/* should we aggregate the results or not? */
 	if (agg_interval > 0)
 	{
+		pg_time_usec_t	next;
+
 		/*
 		 * Loop until we reach the interval of the current moment, and print
 		 * any empty intervals in between (this may happen with very low tps,
 		 * e.g. --rate=0.1).
 		 */
-
-		while (agg->start_time + agg_interval <= now)
+		while ((next = agg->start_time + agg_interval * INT64CONST(1000000)) <= now)
 		{
-			/* print aggregated report to logfile */
-			fprintf(logfile, INT64_FORMAT " " INT64_FORMAT " %.0f %.0f %.0f %.0f",
-					agg->start_time,
-					agg->cnt,
-					agg->latency.sum,
-					agg->latency.sum2,
-					agg->latency.min,
-					agg->latency.max);
-			if (throttle_delay)
-			{
-				fprintf(logfile, " %.0f %.0f %.0f %.0f",
-						agg->lag.sum,
-						agg->lag.sum2,
-						agg->lag.min,
-						agg->lag.max);
-				if (latency_limit)
-					fprintf(logfile, " " INT64_FORMAT, agg->skipped);
-			}
-			fputc('\n', logfile);
+			logAgg(logfile, agg);
 
 			/* reset data and move to next interval */
-			initStats(agg, agg->start_time + agg_interval);
+			initStats(agg, next);
 		}
 
-		/* accumulate the current transaction */
-		accumStats(agg, skipped, latency, lag);
+		if (tx)
+			/* accumulate the current transaction */
+			accumStats(agg, skipped, latency, lag);
+		else
+			/* final call to show the last aggregate */
+			logAgg(logfile, agg);
 	}
 	else
 	{
+		/* switch to epoch */
+		now += epoch_shift;
+
+		/* !tx only used for aggregated data */
+		Assert(tx);
+
 		/* no, print raw transactions */
 		if (skipped)
 			fprintf(logfile, "%d " INT64_FORMAT " skipped %d " INT64_FORMAT " "
@@ -3890,7 +3920,7 @@ processXactStats(TState *thread, CState *st, pg_time_usec_t *now,
 	st->cnt++;
 
 	if (use_log)
-		doLog(thread, st, agg, skipped, latency, lag);
+		doLog(thread, st, true, agg, skipped, latency, lag);
 
 	/* XXX could use a mutex here, but we choose not to */
 	if (per_script_stats)
@@ -5456,7 +5486,7 @@ printProgressReport(TState *threads, int64 test_start, pg_time_usec_t now,
 
 	if (progress_timestamp)
 	{
-		snprintf(tbuf, sizeof(tbuf), "%.3f s", PG_TIME_GET_DOUBLE(now));
+		snprintf(tbuf, sizeof(tbuf), "%.3f s", PG_TIME_GET_DOUBLE(now + epoch_shift));
 	}
 	else
 	{
@@ -5806,6 +5836,11 @@ main(int argc, char **argv)
 	char	   *env;
 
 	int			exit_code = 0;
+	struct timeval tv;
+
+	/* record shift between epoch and now() */
+	gettimeofday(&tv, NULL);
+	epoch_shift = tv.tv_sec * 1000000 + tv.tv_usec - pg_time_now();
 
 	pg_logging_init(argv[0]);
 	progname = get_progname(argv[0]);
@@ -6828,8 +6863,9 @@ done:
 		if (agg_interval > 0)
 		{
 			/* log aggregated but not yet reported transactions */
-			doLog(thread, state, &aggs, false, 0, 0);
+			doLog(thread, state, false, &aggs, false, 0.0, 0.0);
 		}
+
 		fclose(thread->logfile);
 		thread->logfile = NULL;
 	}
diff --git a/src/bin/pgbench/t/001_pgbench_with_server.pl b/src/bin/pgbench/t/001_pgbench_with_server.pl
index 923203ea51..d52c797d5b 100644
--- a/src/bin/pgbench/t/001_pgbench_with_server.pl
+++ b/src/bin/pgbench/t/001_pgbench_with_server.pl
@@ -8,6 +8,7 @@ use PostgresNode;
 use TestLib;
 use Test::More;
 use Config;
+use Time::HiRes qw( time );
 
 # start a pgbench specific server
 my $node = get_new_node('main');
@@ -54,12 +55,14 @@ sub pgbench
 
 	push @cmd, @args;
 
+	my $start = time();
 	$node->command_checks_all(\@cmd, $stat, $out, $err, $name);
+	my $stop = time();
 
 	# cleanup?
 	#unlink @filenames or die "cannot unlink files (@filenames): $!";
 
-	return;
+	return $stop - $start;
 }
 
 # tablespace for testing, because partitioned tables cannot use pg_default
@@ -1173,7 +1176,14 @@ sub list_files
 	return map { $dir . '/' . $_ } @files;
 }
 
+#
 # check log contents and cleanup
+#   $dir: directory holding logs
+#   $prefix: file prefix for per-thread logs
+#   $nb: number of expected files
+#   $min/$max: minimum/maximum number of lines in log files
+#   $re: regular expression each line should match
+#
 sub check_pgbench_logs
 {
 	local $Test::Builder::Level = $Test::Builder::Level + 1;
@@ -1182,7 +1192,7 @@ sub check_pgbench_logs
 
 	# $prefix is simple enough, thus does not need escaping
 	my @logs = list_files($dir, qr{^$prefix\..*$});
-	ok(@logs == $nb, "number of log files");
+	ok(@logs == $nb, "number of log files (@logs)");
 	ok(grep(/\/$prefix\.\d+(\.\d+)?$/, @logs) == $nb, "file name format");
 
 	my $log_number = 0;
@@ -1194,7 +1204,7 @@ sub check_pgbench_logs
 			my $clen     = @contents;
 			ok( $min <= $clen && $clen <= $max,
 				"transaction count for $log ($clen)");
-			ok( grep($re, @contents) == $clen,
+			ok( grep(/$re/, @contents) == $clen,
 				"transaction format for $prefix");
 			close $fh or die "$@";
 		};
@@ -1205,7 +1215,57 @@ sub check_pgbench_logs
 
 my $bdir = $node->basedir;
 
-# with sampling rate
+# Test time-sensitive features on a light read-only transaction:
+#
+#   -T: bench duration, 2 seconds to exercise progress & logs
+#   -P: progress report
+#   --aggregate-interval: periodic aggregated logs
+#   --rate: schedule load
+#   --latency-limit: max delay, not deeply exercice
+#
+# note: the --rate behavior is probabilistic in nature.
+# note: --progress-timestamp is not tested.
+my $delay = pgbench(
+	'-T 2 -P 1 -l --aggregate-interval=1 -S -b se@2'
+	. ' --rate=20 --latency-limit=1000 -j ' . $nthreads
+	. ' -c 3 -r',
+	0,
+	[   qr{type: multiple},
+		qr{clients: 3},
+		qr{threads: $nthreads},
+		qr{duration: 2 s},
+		qr{script 1: .* select only},
+		qr{script 2: .* select only},
+		qr{statement latencies in milliseconds},
+		qr{FROM pgbench_accounts} ],
+	[ qr{vacuum}, qr{progress: 1\b} ],
+	'pgbench progress', undef,
+	"--log-prefix=$bdir/001_pgbench_log_1");
+
+# cool check that we are around 2 seconds
+
+TODO: {
+	local $TODO = "possibly unreliable on slow hosts or unlucky runs";
+
+	# The rate may results in an unlucky schedule which triggers
+	# an early exit, hence the loose bound.
+
+	# also, the delay may totally fail on very slow or overloard hosts,
+	# valgrind runs...
+
+	ok(1.5 < $delay && $delay < 2.5, "-T 2 run around 2 seconds");
+}
+
+# $nthreads threads, 2 seconds, but due to timing imprecision we might get
+# only 1 or as many as 3 progress reports per thread.
+# aggregate log format is:
+#   epoch #tx sum sum2 min max [sum sum2 min max [skipped]]
+# first serie about latency ; second about lag (--rate) ;
+# skipped only if --latency-limit is set.
+check_pgbench_logs($bdir, '001_pgbench_log_1', $nthreads, 1, 3,
+	qr{^\d{10,} \d{1,2} \d+ \d+ \d+ \d+ \d+ \d+ \d+ \d+ \d+$});
+
+# with sampling rate, 2 clients with 50 tx each
 pgbench(
 	"-n -S -t 50 -c 2 --log --sampling-rate=0.5", 0,
 	[ qr{select only}, qr{processed: 100/100} ], [qr{^$}],
@@ -1213,7 +1273,7 @@ pgbench(
 	"--log-prefix=$bdir/001_pgbench_log_2");
 
 check_pgbench_logs($bdir, '001_pgbench_log_2', 1, 8, 92,
-	qr{^0 \d{1,2} \d+ \d \d+ \d+$});
+	qr{^[01] \d{1,2} \d+ \d \d+ \d+$});
 
 # check log file in some detail
 pgbench(

#43

Andrew Dunstan

andrew@dunslane.net

over 4 years ago

In reply to: Fabien COELHO (#42)

Re: pgbench logging broken by time logic changes

On 6/24/21 2:46 AM, Fabien COELHO wrote:

Bonjour Michaï¿œl,

Using grep() with "$re" results in all the fields matching.ï¿œ Using on
the contrary "/$re/" in grep(), like list_files(), would only match
the first one, which is correct.

Ok, good catch. Perl is kind of a strange language.

Not really, the explanation is fairly simple:

grep returns the values for which the test is true.

grep ("$re",@values) doesn't perform a regex test against the values, it
tests the truth of "$re" for each value, i.e. it's more or less the same
asï¿œ grep (1, @values), which will always returns the whole @values list.

By contrast grep (/$re/, @values) returns those elements of @values that
match the regex.

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

#44

Michael Paquier

michael@paquier.xyz

over 4 years ago

In reply to: Andrew Dunstan (#43)

Re: pgbench logging broken by time logic changes

On Thu, Jun 24, 2021 at 08:03:27AM -0400, Andrew Dunstan wrote:

On 6/24/21 2:46 AM, Fabien COELHO wrote:

Using grep() with "$re" results in all the fields matching. Using on
the contrary "/$re/" in grep(), like list_files(), would only match
the first one, which is correct.

Ok, good catch. Perl is kind of a strange language.

Okay, I have extracted this part from your patch, and back-patched
this fix down to 11. The comments were a good addition, so I have
kept them. I have also made the second regex of check_pgbench_logs()
pickier with the client ID value expected, as it can only be 0.

By contrast grep (/$re/, @values) returns those elements of @values that
match the regex.

Thanks for the details here.
--
Michael

#45

Fabien COELHO

coelho@cri.ensmp.fr

over 4 years ago

In reply to: Michael Paquier (#44)

1 attachment(s)

Re: pgbench logging broken by time logic changes

Bonjour Michaï¿½l,

Okay, I have extracted this part from your patch, and back-patched
this fix down to 11. The comments were a good addition, so I have
kept them. I have also made the second regex of check_pgbench_logs()
pickier with the client ID value expected, as it can only be 0.

Attached the remaining part of the patch to fix known issues on pgbench
logging.

I've added an entry on the open item on the wiki. I'm unsure about who the
owner should be.

--
Fabien.

Attachments:

pgbench-log-fix-12.patchtext/x-diff; name=pgbench-log-fix-12.patchDownload

diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 4aeccd93af..394c8b15ed 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -343,6 +343,12 @@ typedef struct StatsData
 	SimpleStats lag;
 } StatsData;
 
+/*
+ * For displaying epoch timestamp, as some time functions may have
+ * another reference.
+ */
+pg_time_usec_t epoch_shift;
+
 /*
  * Struct to keep random state.
  */
@@ -649,7 +655,7 @@ static void setDoubleValue(PgBenchValue *pv, double dval);
 static bool evaluateExpr(CState *st, PgBenchExpr *expr,
 						 PgBenchValue *retval);
 static ConnectionStateEnum executeMetaCommand(CState *st, pg_time_usec_t *now);
-static void doLog(TState *thread, CState *st,
+static void doLog(TState *thread, CState *st, bool tx,
 				  StatsData *agg, bool skipped, double latency, double lag);
 static void processXactStats(TState *thread, CState *st, pg_time_usec_t *now,
 							 bool skipped, StatsData *agg);
@@ -3768,16 +3774,47 @@ executeMetaCommand(CState *st, pg_time_usec_t *now)
 	return CSTATE_END_COMMAND;
 }
 
+/* print aggregated report to logfile */
+static void
+logAgg(FILE *logfile, StatsData *agg)
+{
+	fprintf(logfile, INT64_FORMAT " " INT64_FORMAT " %.0f %.0f %.0f %.0f",
+			(agg->start_time + epoch_shift) / 1000000,
+			agg->cnt,
+			agg->latency.sum,
+			agg->latency.sum2,
+			agg->latency.min,
+			agg->latency.max);
+	if (throttle_delay)
+	{
+		fprintf(logfile, " %.0f %.0f %.0f %.0f",
+				agg->lag.sum,
+				agg->lag.sum2,
+				agg->lag.min,
+				agg->lag.max);
+		if (latency_limit)
+			fprintf(logfile, " " INT64_FORMAT, agg->skipped);
+	}
+	fputc('\n', logfile);
+}
+
 /*
  * Print log entry after completing one transaction.
  *
+ * Param tx tells whether the call actually represents a transaction,
+ * or if it is used to flush aggregated logs.
+ *
+ * The function behavior changes depending on sample_rate (a fraction of
+ * transaction is reported) and agg_interval (transactions are aggregated
+ * and reported once every agg_interval seconds).
+ *
  * We print Unix-epoch timestamps in the log, so that entries can be
  * correlated against other logs.  On some platforms this could be obtained
  * from the caller, but rather than get entangled with that, we just eat
  * the cost of an extra syscall in all cases.
  */
 static void
-doLog(TState *thread, CState *st,
+doLog(TState *thread, CState *st, bool tx,
 	  StatsData *agg, bool skipped, double latency, double lag)
 {
 	FILE	   *logfile = thread->logfile;
@@ -3796,43 +3833,36 @@ doLog(TState *thread, CState *st,
 	/* should we aggregate the results or not? */
 	if (agg_interval > 0)
 	{
+		pg_time_usec_t	next;
+
 		/*
 		 * Loop until we reach the interval of the current moment, and print
 		 * any empty intervals in between (this may happen with very low tps,
 		 * e.g. --rate=0.1).
 		 */
-
-		while (agg->start_time + agg_interval <= now)
+		while ((next = agg->start_time + agg_interval * INT64CONST(1000000)) <= now)
 		{
-			/* print aggregated report to logfile */
-			fprintf(logfile, INT64_FORMAT " " INT64_FORMAT " %.0f %.0f %.0f %.0f",
-					agg->start_time,
-					agg->cnt,
-					agg->latency.sum,
-					agg->latency.sum2,
-					agg->latency.min,
-					agg->latency.max);
-			if (throttle_delay)
-			{
-				fprintf(logfile, " %.0f %.0f %.0f %.0f",
-						agg->lag.sum,
-						agg->lag.sum2,
-						agg->lag.min,
-						agg->lag.max);
-				if (latency_limit)
-					fprintf(logfile, " " INT64_FORMAT, agg->skipped);
-			}
-			fputc('\n', logfile);
+			logAgg(logfile, agg);
 
 			/* reset data and move to next interval */
-			initStats(agg, agg->start_time + agg_interval);
+			initStats(agg, next);
 		}
 
-		/* accumulate the current transaction */
-		accumStats(agg, skipped, latency, lag);
+		if (tx)
+			/* accumulate the current transaction */
+			accumStats(agg, skipped, latency, lag);
+		else
+			/* final call to show the last aggregate */
+			logAgg(logfile, agg);
 	}
 	else
 	{
+		/* switch to epoch */
+		now += epoch_shift;
+
+		/* !tx only used for aggregated data */
+		Assert(tx);
+
 		/* no, print raw transactions */
 		if (skipped)
 			fprintf(logfile, "%d " INT64_FORMAT " skipped %d " INT64_FORMAT " "
@@ -3892,7 +3922,7 @@ processXactStats(TState *thread, CState *st, pg_time_usec_t *now,
 	st->cnt++;
 
 	if (use_log)
-		doLog(thread, st, agg, skipped, latency, lag);
+		doLog(thread, st, true, agg, skipped, latency, lag);
 
 	/* XXX could use a mutex here, but we choose not to */
 	if (per_script_stats)
@@ -5458,7 +5488,7 @@ printProgressReport(TState *threads, int64 test_start, pg_time_usec_t now,
 
 	if (progress_timestamp)
 	{
-		snprintf(tbuf, sizeof(tbuf), "%.3f s", PG_TIME_GET_DOUBLE(now));
+		snprintf(tbuf, sizeof(tbuf), "%.3f s", PG_TIME_GET_DOUBLE(now + epoch_shift));
 	}
 	else
 	{
@@ -5808,6 +5838,11 @@ main(int argc, char **argv)
 	char	   *env;
 
 	int			exit_code = 0;
+	struct timeval tv;
+
+	/* record shift between epoch and now() */
+	gettimeofday(&tv, NULL);
+	epoch_shift = tv.tv_sec * 1000000 + tv.tv_usec - pg_time_now();
 
 	pg_logging_init(argv[0]);
 	progname = get_progname(argv[0]);
@@ -6830,8 +6865,9 @@ done:
 		if (agg_interval > 0)
 		{
 			/* log aggregated but not yet reported transactions */
-			doLog(thread, state, &aggs, false, 0, 0);
+			doLog(thread, state, false, &aggs, false, 0.0, 0.0);
 		}
+
 		fclose(thread->logfile);
 		thread->logfile = NULL;
 	}
diff --git a/src/bin/pgbench/t/001_pgbench_with_server.pl b/src/bin/pgbench/t/001_pgbench_with_server.pl
index 3aa9d5d753..41250bf91d 100644
--- a/src/bin/pgbench/t/001_pgbench_with_server.pl
+++ b/src/bin/pgbench/t/001_pgbench_with_server.pl
@@ -8,6 +8,7 @@ use PostgresNode;
 use TestLib;
 use Test::More;
 use Config;
+use Time::HiRes qw( time );
 
 # start a pgbench specific server
 my $node = get_new_node('main');
@@ -54,12 +55,14 @@ sub pgbench
 
 	push @cmd, @args;
 
+	my $start = time();
 	$node->command_checks_all(\@cmd, $stat, $out, $err, $name);
+	my $stop = time();
 
 	# cleanup?
 	#unlink @filenames or die "cannot unlink files (@filenames): $!";
 
-	return;
+	return $stop - $start;
 }
 
 # tablespace for testing, because partitioned tables cannot use pg_default
@@ -1187,7 +1190,7 @@ sub check_pgbench_logs
 
 	# $prefix is simple enough, thus does not need escaping
 	my @logs = list_files($dir, qr{^$prefix\..*$});
-	ok(@logs == $nb, "number of log files");
+	ok(@logs == $nb, "number of log files (@logs)");
 	ok(grep(/\/$prefix\.\d+(\.\d+)?$/, @logs) == $nb, "file name format");
 
 	my $log_number = 0;
@@ -1220,6 +1223,58 @@ sub check_pgbench_logs
 my $bdir = $node->basedir;
 
 # Run with sampling rate, 2 clients with 50 transactions each.
+#
+# Test time-sensitive features on a light read-only transaction:
+#
+#   -T: bench duration, 2 seconds to exercise progress & logs
+#   -P: progress report
+#   --aggregate-interval: periodic aggregated logs
+#   --rate: schedule load
+#   --latency-limit: max delay, not deeply exercice
+#
+# note: the --rate behavior is probabilistic in nature.
+# note: --progress-timestamp is not tested.
+my $delay = pgbench(
+	'-T 2 -P 1 -l --aggregate-interval=1 -S -b se@2'
+	. ' --rate=20 --latency-limit=1000 -j ' . $nthreads
+	. ' -c 3 -r',
+	0,
+	[   qr{type: multiple},
+		qr{clients: 3},
+		qr{threads: $nthreads},
+		qr{duration: 2 s},
+		qr{script 1: .* select only},
+		qr{script 2: .* select only},
+		qr{statement latencies in milliseconds},
+		qr{FROM pgbench_accounts} ],
+	[ qr{vacuum}, qr{progress: 1\b} ],
+	'pgbench progress', undef,
+	"--log-prefix=$bdir/001_pgbench_log_1");
+
+# cool check that we are around 2 seconds
+
+TODO: {
+	local $TODO = "possibly unreliable on slow hosts or unlucky runs";
+
+	# The rate may results in an unlucky schedule which triggers
+	# an early exit, hence the loose bound.
+
+	# also, the delay may totally fail on very slow or overloard hosts,
+	# valgrind runs...
+
+	ok(1.5 < $delay && $delay < 2.5, "-T 2 run around 2 seconds");
+}
+
+# $nthreads threads, 2 seconds, but due to timing imprecision we might get
+# only 1 or as many as 3 progress reports per thread.
+# aggregate log format is:
+#   epoch #tx sum sum2 min max [sum sum2 min max [skipped]]
+# first serie about latency ; second about lag (--rate) ;
+# skipped only if --latency-limit is set.
+check_pgbench_logs($bdir, '001_pgbench_log_1', $nthreads, 1, 3,
+	qr{^\d{10,} \d{1,2} \d+ \d+ \d+ \d+ \d+ \d+ \d+ \d+ \d+$});
+
+# with sampling rate, 2 clients with 50 tx each
 pgbench(
 	"-n -S -t 50 -c 2 --log --sampling-rate=0.5", 0,
 	[ qr{select only}, qr{processed: 100/100} ], [qr{^$}],

#46

Michael Paquier

michael@paquier.xyz

over 4 years ago

In reply to: Fabien COELHO (#45)

Re: pgbench logging broken by time logic changes

On Wed, Jun 30, 2021 at 09:45:47AM +0200, Fabien COELHO wrote:

Attached the remaining part of the patch to fix known issues on pgbench
logging.

I've added an entry on the open item on the wiki. I'm unsure about who the
owner should be.

There is already an item: "Incorrect time maths in pgbench".
--
Michael

#47

Thomas Munro

thomas.munro@gmail.com

over 4 years ago

In reply to: Michael Paquier (#46)

Re: pgbench logging broken by time logic changes

On Wed, Jun 30, 2021 at 8:05 PM Michael Paquier <michael@paquier.xyz> wrote:

On Wed, Jun 30, 2021 at 09:45:47AM +0200, Fabien COELHO wrote:

Attached the remaining part of the patch to fix known issues on pgbench
logging.

I've added an entry on the open item on the wiki. I'm unsure about who the
owner should be.

There is already an item: "Incorrect time maths in pgbench".

Fabien, thanks for the updated patch, I'm looking at it. I removed
the duplicate item. More soon.

#48

Fabien COELHO

coelho@cri.ensmp.fr

over 4 years ago

In reply to: Thomas Munro (#47)

Re: pgbench logging broken by time logic changes

Hello Thomas,

I've added an entry on the open item on the wiki. I'm unsure about who the
owner should be.

There is already an item: "Incorrect time maths in pgbench".

Argh *shoot*, I went over the list too quickly, looking for "log" as a
keyword.

Fabien, thanks for the updated patch, I'm looking at it. I removed
the duplicate item. More soon.

Thanks. Sorry for the noise.

--
Fabien.

#49

Fabien COELHO

coelho@cri.ensmp.fr

over 4 years ago

In reply to: Fabien COELHO (#48)

1 attachment(s)

Re: pgbench logging broken by time logic changes

Fabien, thanks for the updated patch, I'm looking at it.

After looking at it again, here is an update which ensure 64 bits on
epoch_shift computation.

--
Fabien.

Attachments:

pgbench-log-fix-13.patchtext/x-diff; name=pgbench-log-fix-13.patchDownload

diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 4aeccd93af..7750b5d660 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -343,6 +343,12 @@ typedef struct StatsData
 	SimpleStats lag;
 } StatsData;
 
+/*
+ * For displaying epoch timestamp, as some time functions may have
+ * another reference.
+ */
+pg_time_usec_t epoch_shift;
+
 /*
  * Struct to keep random state.
  */
@@ -649,7 +655,7 @@ static void setDoubleValue(PgBenchValue *pv, double dval);
 static bool evaluateExpr(CState *st, PgBenchExpr *expr,
 						 PgBenchValue *retval);
 static ConnectionStateEnum executeMetaCommand(CState *st, pg_time_usec_t *now);
-static void doLog(TState *thread, CState *st,
+static void doLog(TState *thread, CState *st, bool tx,
 				  StatsData *agg, bool skipped, double latency, double lag);
 static void processXactStats(TState *thread, CState *st, pg_time_usec_t *now,
 							 bool skipped, StatsData *agg);
@@ -3768,16 +3774,47 @@ executeMetaCommand(CState *st, pg_time_usec_t *now)
 	return CSTATE_END_COMMAND;
 }
 
+/* print aggregated report to logfile */
+static void
+logAgg(FILE *logfile, StatsData *agg)
+{
+	fprintf(logfile, INT64_FORMAT " " INT64_FORMAT " %.0f %.0f %.0f %.0f",
+			(agg->start_time + epoch_shift) / 1000000,
+			agg->cnt,
+			agg->latency.sum,
+			agg->latency.sum2,
+			agg->latency.min,
+			agg->latency.max);
+	if (throttle_delay)
+	{
+		fprintf(logfile, " %.0f %.0f %.0f %.0f",
+				agg->lag.sum,
+				agg->lag.sum2,
+				agg->lag.min,
+				agg->lag.max);
+		if (latency_limit)
+			fprintf(logfile, " " INT64_FORMAT, agg->skipped);
+	}
+	fputc('\n', logfile);
+}
+
 /*
  * Print log entry after completing one transaction.
  *
+ * Param tx tells whether the call actually represents a transaction,
+ * or if it is used to flush aggregated logs.
+ *
+ * The function behavior changes depending on sample_rate (a fraction of
+ * transaction is reported) and agg_interval (transactions are aggregated
+ * and reported once every agg_interval seconds).
+ *
  * We print Unix-epoch timestamps in the log, so that entries can be
  * correlated against other logs.  On some platforms this could be obtained
  * from the caller, but rather than get entangled with that, we just eat
  * the cost of an extra syscall in all cases.
  */
 static void
-doLog(TState *thread, CState *st,
+doLog(TState *thread, CState *st, bool tx,
 	  StatsData *agg, bool skipped, double latency, double lag)
 {
 	FILE	   *logfile = thread->logfile;
@@ -3796,43 +3833,36 @@ doLog(TState *thread, CState *st,
 	/* should we aggregate the results or not? */
 	if (agg_interval > 0)
 	{
+		pg_time_usec_t	next;
+
 		/*
 		 * Loop until we reach the interval of the current moment, and print
 		 * any empty intervals in between (this may happen with very low tps,
 		 * e.g. --rate=0.1).
 		 */
-
-		while (agg->start_time + agg_interval <= now)
+		while ((next = agg->start_time + agg_interval * INT64CONST(1000000)) <= now)
 		{
-			/* print aggregated report to logfile */
-			fprintf(logfile, INT64_FORMAT " " INT64_FORMAT " %.0f %.0f %.0f %.0f",
-					agg->start_time,
-					agg->cnt,
-					agg->latency.sum,
-					agg->latency.sum2,
-					agg->latency.min,
-					agg->latency.max);
-			if (throttle_delay)
-			{
-				fprintf(logfile, " %.0f %.0f %.0f %.0f",
-						agg->lag.sum,
-						agg->lag.sum2,
-						agg->lag.min,
-						agg->lag.max);
-				if (latency_limit)
-					fprintf(logfile, " " INT64_FORMAT, agg->skipped);
-			}
-			fputc('\n', logfile);
+			logAgg(logfile, agg);
 
 			/* reset data and move to next interval */
-			initStats(agg, agg->start_time + agg_interval);
+			initStats(agg, next);
 		}
 
-		/* accumulate the current transaction */
-		accumStats(agg, skipped, latency, lag);
+		if (tx)
+			/* accumulate the current transaction */
+			accumStats(agg, skipped, latency, lag);
+		else
+			/* final call to show the last aggregate */
+			logAgg(logfile, agg);
 	}
 	else
 	{
+		/* switch to epoch */
+		now += epoch_shift;
+
+		/* !tx only used for aggregated data */
+		Assert(tx);
+
 		/* no, print raw transactions */
 		if (skipped)
 			fprintf(logfile, "%d " INT64_FORMAT " skipped %d " INT64_FORMAT " "
@@ -3892,7 +3922,7 @@ processXactStats(TState *thread, CState *st, pg_time_usec_t *now,
 	st->cnt++;
 
 	if (use_log)
-		doLog(thread, st, agg, skipped, latency, lag);
+		doLog(thread, st, true, agg, skipped, latency, lag);
 
 	/* XXX could use a mutex here, but we choose not to */
 	if (per_script_stats)
@@ -5458,7 +5488,7 @@ printProgressReport(TState *threads, int64 test_start, pg_time_usec_t now,
 
 	if (progress_timestamp)
 	{
-		snprintf(tbuf, sizeof(tbuf), "%.3f s", PG_TIME_GET_DOUBLE(now));
+		snprintf(tbuf, sizeof(tbuf), "%.3f s", PG_TIME_GET_DOUBLE(now + epoch_shift));
 	}
 	else
 	{
@@ -5808,6 +5838,11 @@ main(int argc, char **argv)
 	char	   *env;
 
 	int			exit_code = 0;
+	struct timeval tv;
+
+	/* record shift between epoch and now() */
+	gettimeofday(&tv, NULL);
+	epoch_shift = tv.tv_sec * INT64CONST(1000000) + tv.tv_usec - pg_time_now();
 
 	pg_logging_init(argv[0]);
 	progname = get_progname(argv[0]);
@@ -6830,8 +6865,9 @@ done:
 		if (agg_interval > 0)
 		{
 			/* log aggregated but not yet reported transactions */
-			doLog(thread, state, &aggs, false, 0, 0);
+			doLog(thread, state, false, &aggs, false, 0.0, 0.0);
 		}
+
 		fclose(thread->logfile);
 		thread->logfile = NULL;
 	}
diff --git a/src/bin/pgbench/t/001_pgbench_with_server.pl b/src/bin/pgbench/t/001_pgbench_with_server.pl
index 3aa9d5d753..41250bf91d 100644
--- a/src/bin/pgbench/t/001_pgbench_with_server.pl
+++ b/src/bin/pgbench/t/001_pgbench_with_server.pl
@@ -8,6 +8,7 @@ use PostgresNode;
 use TestLib;
 use Test::More;
 use Config;
+use Time::HiRes qw( time );
 
 # start a pgbench specific server
 my $node = get_new_node('main');
@@ -54,12 +55,14 @@ sub pgbench
 
 	push @cmd, @args;
 
+	my $start = time();
 	$node->command_checks_all(\@cmd, $stat, $out, $err, $name);
+	my $stop = time();
 
 	# cleanup?
 	#unlink @filenames or die "cannot unlink files (@filenames): $!";
 
-	return;
+	return $stop - $start;
 }
 
 # tablespace for testing, because partitioned tables cannot use pg_default
@@ -1187,7 +1190,7 @@ sub check_pgbench_logs
 
 	# $prefix is simple enough, thus does not need escaping
 	my @logs = list_files($dir, qr{^$prefix\..*$});
-	ok(@logs == $nb, "number of log files");
+	ok(@logs == $nb, "number of log files (@logs)");
 	ok(grep(/\/$prefix\.\d+(\.\d+)?$/, @logs) == $nb, "file name format");
 
 	my $log_number = 0;
@@ -1220,6 +1223,58 @@ sub check_pgbench_logs
 my $bdir = $node->basedir;
 
 # Run with sampling rate, 2 clients with 50 transactions each.
+#
+# Test time-sensitive features on a light read-only transaction:
+#
+#   -T: bench duration, 2 seconds to exercise progress & logs
+#   -P: progress report
+#   --aggregate-interval: periodic aggregated logs
+#   --rate: schedule load
+#   --latency-limit: max delay, not deeply exercice
+#
+# note: the --rate behavior is probabilistic in nature.
+# note: --progress-timestamp is not tested.
+my $delay = pgbench(
+	'-T 2 -P 1 -l --aggregate-interval=1 -S -b se@2'
+	. ' --rate=20 --latency-limit=1000 -j ' . $nthreads
+	. ' -c 3 -r',
+	0,
+	[   qr{type: multiple},
+		qr{clients: 3},
+		qr{threads: $nthreads},
+		qr{duration: 2 s},
+		qr{script 1: .* select only},
+		qr{script 2: .* select only},
+		qr{statement latencies in milliseconds},
+		qr{FROM pgbench_accounts} ],
+	[ qr{vacuum}, qr{progress: 1\b} ],
+	'pgbench progress', undef,
+	"--log-prefix=$bdir/001_pgbench_log_1");
+
+# cool check that we are around 2 seconds
+
+TODO: {
+	local $TODO = "possibly unreliable on slow hosts or unlucky runs";
+
+	# The rate may results in an unlucky schedule which triggers
+	# an early exit, hence the loose bound.
+
+	# also, the delay may totally fail on very slow or overloard hosts,
+	# valgrind runs...
+
+	ok(1.5 < $delay && $delay < 2.5, "-T 2 run around 2 seconds");
+}
+
+# $nthreads threads, 2 seconds, but due to timing imprecision we might get
+# only 1 or as many as 3 progress reports per thread.
+# aggregate log format is:
+#   epoch #tx sum sum2 min max [sum sum2 min max [skipped]]
+# first serie about latency ; second about lag (--rate) ;
+# skipped only if --latency-limit is set.
+check_pgbench_logs($bdir, '001_pgbench_log_1', $nthreads, 1, 3,
+	qr{^\d{10,} \d{1,2} \d+ \d+ \d+ \d+ \d+ \d+ \d+ \d+ \d+$});
+
+# with sampling rate, 2 clients with 50 tx each
 pgbench(
 	"-n -S -t 50 -c 2 --log --sampling-rate=0.5", 0,
 	[ qr{select only}, qr{processed: 100/100} ], [qr{^$}],

#50

Thomas Munro

thomas.munro@gmail.com

over 4 years ago

In reply to: Fabien COELHO (#49)

Re: pgbench logging broken by time logic changes

On Wed, Jun 30, 2021 at 9:55 PM Fabien COELHO <coelho@cri.ensmp.fr> wrote:

Fabien, thanks for the updated patch, I'm looking at it.

After looking at it again, here is an update which ensure 64 bits on
epoch_shift computation.

Hi Fabien,

The code in pgbench 13 aggregates into buckets that begin on the
boundaries of wall clock seconds, because it is triggered by changes
in time_t. In the current patch, we aggregate data into buckets that
begin on the boundaries of whole seconds since start_time. Those
boundaries are not aligned with wall clock seconds, and yet we print
out the times rounded to wall clock seconds.

With the following temporary hack:

 static void
 logAgg(FILE *logfile, StatsData *agg)
 {
-       fprintf(logfile, INT64_FORMAT " " INT64_FORMAT " %.0f %.0f %.0f %.0f",
-                       (agg->start_time + epoch_shift) / 1000000,
+       fprintf(logfile, /*INT64_FORMAT*/ "%f " INT64_FORMAT " %.0f
%.0f %.0f %.0f",
+                       (agg->start_time + epoch_shift) / 1000000.0,

... you can see what I mean:

1625115080.840406 325 999256 3197232764 1450 6846

Perhaps we should round the start time of the first aggregate down to
the nearest wall clock second? That would mean that the first
aggregate misses a part of a second (as it does in pgbench 13), but
all later aggregates begin at the time we write in the log (as it does
in pgbench 13). That is, if we log 1625115080 we mean "all results >=
1625115080.000000". It's a small detail, but it could be important
for someone trying to correlate the log with other data. What do you
think?

#51

Fabien COELHO

coelho@cri.ensmp.fr

over 4 years ago

In reply to: Thomas Munro (#50)

1 attachment(s)

Re: pgbench logging broken by time logic changes

Hello Thomas,

After looking at it again, here is an update which ensure 64 bits on
epoch_shift computation.

The code in pgbench 13 aggregates into buckets that begin on the
boundaries of wall clock seconds, because it is triggered by changes
in time_t. In the current patch, we aggregate data into buckets that
begin on the boundaries of whole seconds since start_time. Those
boundaries are not aligned with wall clock seconds, and yet we print
out the times rounded to wall clock seconds.

Yes, I noticed this small changed, and did not feel it was an issue at the
time.

I thought of doing something like the format change your are suggesting.
However people would like it and it would need to be discussed, hence it
stayed that way… People have scripts to process log files and do not like
format changes, basically.

Perhaps we should round the start time of the first aggregate down to
the nearest wall clock second?

Yep, but that requires a common start point for all threads. Why not.

That would mean that the first aggregate misses a part of a second (as
it does in pgbench 13), but all later aggregates begin at the time we
write in the log (as it does in pgbench 13). That is, if we log
1625115080 we mean "all results >= 1625115080.000000". It's a small
detail, but it could be important for someone trying to correlate the
log with other data. What do you think?

I think that you are right. The simplest way is to align on whole seconds,
which is easier than changing the format and have complaints about that,
or not align and have complaints about the timestamp being rounded.

Attached a v14 in that spirit.

--
Fabien.

Attachments:

pgbench-log-fix-14.patchtext/x-diff; name=pgbench-log-fix-14.patchDownload

diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 4aeccd93af..003ba3382e 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -343,6 +343,17 @@ typedef struct StatsData
 	SimpleStats lag;
 } StatsData;
 
+/*
+ * For displaying epoch timestamp, as some time functions may have
+ * another reference.
+ */
+pg_time_usec_t epoch_shift;
+
+/*
+ * When the bench run started, just before creating threads
+ */
+pg_time_usec_t run_start_time;
+
 /*
  * Struct to keep random state.
  */
@@ -649,7 +660,7 @@ static void setDoubleValue(PgBenchValue *pv, double dval);
 static bool evaluateExpr(CState *st, PgBenchExpr *expr,
 						 PgBenchValue *retval);
 static ConnectionStateEnum executeMetaCommand(CState *st, pg_time_usec_t *now);
-static void doLog(TState *thread, CState *st,
+static void doLog(TState *thread, CState *st, bool tx,
 				  StatsData *agg, bool skipped, double latency, double lag);
 static void processXactStats(TState *thread, CState *st, pg_time_usec_t *now,
 							 bool skipped, StatsData *agg);
@@ -3768,16 +3779,47 @@ executeMetaCommand(CState *st, pg_time_usec_t *now)
 	return CSTATE_END_COMMAND;
 }
 
+/* print aggregated report to logfile */
+static void
+logAgg(FILE *logfile, StatsData *agg)
+{
+	fprintf(logfile, INT64_FORMAT " " INT64_FORMAT " %.0f %.0f %.0f %.0f",
+			(agg->start_time + epoch_shift) / 1000000,
+			agg->cnt,
+			agg->latency.sum,
+			agg->latency.sum2,
+			agg->latency.min,
+			agg->latency.max);
+	if (throttle_delay)
+	{
+		fprintf(logfile, " %.0f %.0f %.0f %.0f",
+				agg->lag.sum,
+				agg->lag.sum2,
+				agg->lag.min,
+				agg->lag.max);
+		if (latency_limit)
+			fprintf(logfile, " " INT64_FORMAT, agg->skipped);
+	}
+	fputc('\n', logfile);
+}
+
 /*
  * Print log entry after completing one transaction.
  *
+ * Param tx tells whether the call actually represents a transaction,
+ * or if it is used to flush aggregated logs.
+ *
+ * The function behavior changes depending on sample_rate (a fraction of
+ * transaction is reported) and agg_interval (transactions are aggregated
+ * and reported once every agg_interval seconds).
+ *
  * We print Unix-epoch timestamps in the log, so that entries can be
  * correlated against other logs.  On some platforms this could be obtained
  * from the caller, but rather than get entangled with that, we just eat
  * the cost of an extra syscall in all cases.
  */
 static void
-doLog(TState *thread, CState *st,
+doLog(TState *thread, CState *st, bool tx,
 	  StatsData *agg, bool skipped, double latency, double lag)
 {
 	FILE	   *logfile = thread->logfile;
@@ -3796,43 +3838,36 @@ doLog(TState *thread, CState *st,
 	/* should we aggregate the results or not? */
 	if (agg_interval > 0)
 	{
+		pg_time_usec_t	next;
+
 		/*
 		 * Loop until we reach the interval of the current moment, and print
 		 * any empty intervals in between (this may happen with very low tps,
 		 * e.g. --rate=0.1).
 		 */
-
-		while (agg->start_time + agg_interval <= now)
+		while ((next = agg->start_time + agg_interval * INT64CONST(1000000)) <= now)
 		{
-			/* print aggregated report to logfile */
-			fprintf(logfile, INT64_FORMAT " " INT64_FORMAT " %.0f %.0f %.0f %.0f",
-					agg->start_time,
-					agg->cnt,
-					agg->latency.sum,
-					agg->latency.sum2,
-					agg->latency.min,
-					agg->latency.max);
-			if (throttle_delay)
-			{
-				fprintf(logfile, " %.0f %.0f %.0f %.0f",
-						agg->lag.sum,
-						agg->lag.sum2,
-						agg->lag.min,
-						agg->lag.max);
-				if (latency_limit)
-					fprintf(logfile, " " INT64_FORMAT, agg->skipped);
-			}
-			fputc('\n', logfile);
+			logAgg(logfile, agg);
 
 			/* reset data and move to next interval */
-			initStats(agg, agg->start_time + agg_interval);
+			initStats(agg, next);
 		}
 
-		/* accumulate the current transaction */
-		accumStats(agg, skipped, latency, lag);
+		if (tx)
+			/* accumulate the current transaction */
+			accumStats(agg, skipped, latency, lag);
+		else
+			/* final call to show the last aggregate */
+			logAgg(logfile, agg);
 	}
 	else
 	{
+		/* switch to epoch */
+		now += epoch_shift;
+
+		/* !tx only used for aggregated data */
+		Assert(tx);
+
 		/* no, print raw transactions */
 		if (skipped)
 			fprintf(logfile, "%d " INT64_FORMAT " skipped %d " INT64_FORMAT " "
@@ -3892,7 +3927,7 @@ processXactStats(TState *thread, CState *st, pg_time_usec_t *now,
 	st->cnt++;
 
 	if (use_log)
-		doLog(thread, st, agg, skipped, latency, lag);
+		doLog(thread, st, true, agg, skipped, latency, lag);
 
 	/* XXX could use a mutex here, but we choose not to */
 	if (per_script_stats)
@@ -5458,7 +5493,7 @@ printProgressReport(TState *threads, int64 test_start, pg_time_usec_t now,
 
 	if (progress_timestamp)
 	{
-		snprintf(tbuf, sizeof(tbuf), "%.3f s", PG_TIME_GET_DOUBLE(now));
+		snprintf(tbuf, sizeof(tbuf), "%.3f s", PG_TIME_GET_DOUBLE(now + epoch_shift));
 	}
 	else
 	{
@@ -5789,7 +5824,6 @@ main(int argc, char **argv)
 	TState	   *threads;		/* array of thread */
 
 	pg_time_usec_t
-				start_time,		/* start up time */
 				bench_start = 0,	/* first recorded benchmarking time */
 				conn_total_duration;	/* cumulated connection time in
 										 * threads */
@@ -5808,6 +5842,11 @@ main(int argc, char **argv)
 	char	   *env;
 
 	int			exit_code = 0;
+	struct timeval tv;
+
+	/* record shift between epoch and now() */
+	gettimeofday(&tv, NULL);
+	epoch_shift = tv.tv_sec * INT64CONST(1000000) + tv.tv_usec - pg_time_now();
 
 	pg_logging_init(argv[0]);
 	progname = get_progname(argv[0]);
@@ -6465,7 +6504,7 @@ main(int argc, char **argv)
 	Assert(nclients_dealt == nclients);
 
 	/* get start up time for the whole computation */
-	start_time = pg_time_now();
+	run_start_time = pg_time_now();
 
 	/* set alarm if duration is specified. */
 	if (duration > 0)
@@ -6543,7 +6582,7 @@ main(int argc, char **argv)
 	 * underestimated.
 	 */
 	printResults(&stats, pg_time_now() - bench_start, conn_total_duration,
-				 bench_start - start_time, latency_late);
+				 bench_start - run_start_time, latency_late);
 
 	THREAD_BARRIER_DESTROY(&barrier);
 
@@ -6637,7 +6676,8 @@ threadRun(void *arg)
 	thread->bench_start = start;
 	thread->throttle_trigger = start;
 
-	initStats(&aggs, start);
+	/* aggregations are do one whole seconds, the same for all threads */
+	initStats(&aggs, run_start_time / 1000000 * 1000000);
 	last = aggs;
 
 	/* loop till all clients have terminated */
@@ -6830,8 +6870,9 @@ done:
 		if (agg_interval > 0)
 		{
 			/* log aggregated but not yet reported transactions */
-			doLog(thread, state, &aggs, false, 0, 0);
+			doLog(thread, state, false, &aggs, false, 0.0, 0.0);
 		}
+
 		fclose(thread->logfile);
 		thread->logfile = NULL;
 	}
diff --git a/src/bin/pgbench/t/001_pgbench_with_server.pl b/src/bin/pgbench/t/001_pgbench_with_server.pl
index 3aa9d5d753..41250bf91d 100644
--- a/src/bin/pgbench/t/001_pgbench_with_server.pl
+++ b/src/bin/pgbench/t/001_pgbench_with_server.pl
@@ -8,6 +8,7 @@ use PostgresNode;
 use TestLib;
 use Test::More;
 use Config;
+use Time::HiRes qw( time );
 
 # start a pgbench specific server
 my $node = get_new_node('main');
@@ -54,12 +55,14 @@ sub pgbench
 
 	push @cmd, @args;
 
+	my $start = time();
 	$node->command_checks_all(\@cmd, $stat, $out, $err, $name);
+	my $stop = time();
 
 	# cleanup?
 	#unlink @filenames or die "cannot unlink files (@filenames): $!";
 
-	return;
+	return $stop - $start;
 }
 
 # tablespace for testing, because partitioned tables cannot use pg_default
@@ -1187,7 +1190,7 @@ sub check_pgbench_logs
 
 	# $prefix is simple enough, thus does not need escaping
 	my @logs = list_files($dir, qr{^$prefix\..*$});
-	ok(@logs == $nb, "number of log files");
+	ok(@logs == $nb, "number of log files (@logs)");
 	ok(grep(/\/$prefix\.\d+(\.\d+)?$/, @logs) == $nb, "file name format");
 
 	my $log_number = 0;
@@ -1220,6 +1223,58 @@ sub check_pgbench_logs
 my $bdir = $node->basedir;
 
 # Run with sampling rate, 2 clients with 50 transactions each.
+#
+# Test time-sensitive features on a light read-only transaction:
+#
+#   -T: bench duration, 2 seconds to exercise progress & logs
+#   -P: progress report
+#   --aggregate-interval: periodic aggregated logs
+#   --rate: schedule load
+#   --latency-limit: max delay, not deeply exercice
+#
+# note: the --rate behavior is probabilistic in nature.
+# note: --progress-timestamp is not tested.
+my $delay = pgbench(
+	'-T 2 -P 1 -l --aggregate-interval=1 -S -b se@2'
+	. ' --rate=20 --latency-limit=1000 -j ' . $nthreads
+	. ' -c 3 -r',
+	0,
+	[   qr{type: multiple},
+		qr{clients: 3},
+		qr{threads: $nthreads},
+		qr{duration: 2 s},
+		qr{script 1: .* select only},
+		qr{script 2: .* select only},
+		qr{statement latencies in milliseconds},
+		qr{FROM pgbench_accounts} ],
+	[ qr{vacuum}, qr{progress: 1\b} ],
+	'pgbench progress', undef,
+	"--log-prefix=$bdir/001_pgbench_log_1");
+
+# cool check that we are around 2 seconds
+
+TODO: {
+	local $TODO = "possibly unreliable on slow hosts or unlucky runs";
+
+	# The rate may results in an unlucky schedule which triggers
+	# an early exit, hence the loose bound.
+
+	# also, the delay may totally fail on very slow or overloard hosts,
+	# valgrind runs...
+
+	ok(1.5 < $delay && $delay < 2.5, "-T 2 run around 2 seconds");
+}
+
+# $nthreads threads, 2 seconds, but due to timing imprecision we might get
+# only 1 or as many as 3 progress reports per thread.
+# aggregate log format is:
+#   epoch #tx sum sum2 min max [sum sum2 min max [skipped]]
+# first serie about latency ; second about lag (--rate) ;
+# skipped only if --latency-limit is set.
+check_pgbench_logs($bdir, '001_pgbench_log_1', $nthreads, 1, 3,
+	qr{^\d{10,} \d{1,2} \d+ \d+ \d+ \d+ \d+ \d+ \d+ \d+ \d+$});
+
+# with sampling rate, 2 clients with 50 tx each
 pgbench(
 	"-n -S -t 50 -c 2 --log --sampling-rate=0.5", 0,
 	[ qr{select only}, qr{processed: 100/100} ], [qr{^$}],

#52

Thomas Munro

thomas.munro@gmail.com

over 4 years ago

In reply to: Fabien COELHO (#51)

1 attachment(s)

Re: pgbench logging broken by time logic changes

On Thu, Jul 1, 2021 at 8:50 PM Fabien COELHO <coelho@cri.ensmp.fr> wrote:

Attached a v14 in that spirit.

Thanks! This doesn't seem to address the complaint, though. Don't
you need to do something like this? (See also attached.)

+ initStats(&aggs, start - (start + epoch_shift) % 1000000);

That should reproduce what pgbench 13 does implicitly when it uses
time(NULL). Namely, it rewinds to the start of the current *wall
clock* second, so that all future aggregates also start at round
number wall clock times, at the cost of making the first aggregate
miss out on a fraction of a second.

I wonder if some of the confusion on the other thread about the final
aggregate[1]/messages/by-id/alpine.DEB.2.22.394.2106102323310.3698412@pseudo was due to this difference. By rounding down, we get a
"head start" (because the first aggregate is short), so we usually
manage to record the expected number of aggregates before time runs
out. It's a race though. Your non-rounding version was more likely
to lose the race and finish before the final expected aggregate was
logged, so you added code to force a final aggregate to be logged. Do
I have this right? I'm not entirely sure how useful a partial final
aggregate is (it's probably one you have to throw away, like the first
one, no? Isn't it better if we only have to throw away the first
one?). I'm not sure, but if we keep that change, a couple of very
minor nits: I found the "tx" parameter name a little confusing. Do
you think it's clearer if we change it to "final" (with inverted
sense)? For the final aggregate, shouldn't we call doLog() only if
agg->cnt > 0?

I think I'd be inclined to take that change back out though, making
this patch very small and net behaviour like pgbench 13, if you agree
with my explanation for why you had to add it and why it's not
actually necessary with the fixed rounding shown above. (And perhaps
in v15 we might consider other ideas like using hi-res times in the
log and not rounding, etc, a topic for later.)

I don't really see the value in the test that checks that $delay falls
in the range 1.5s - 2.5s and then ignores the result. If it hangs
forever, we'll find out about it, and otherwise no human or machine
will ever care about that test. I removed it from this version. Were
you really attached to it?

I made some very minor language tweaks in comments (we don't usually
shorten "benchmark" to "bench" in English, "series" keeps the -s in
singular (blame the Romans), etc).

I think we should make it clear when we mean the *Unix* epoch (a
comment "switch to epoch" isn't meaningful on its own, to me at
least), so I changed that in a few places.

[1]: /messages/by-id/alpine.DEB.2.22.394.2106102323310.3698412@pseudo

Attachments:

v15-0001-Fix-bugs-in-pgbench-log-output.patchtext/x-patch; charset=US-ASCII; name=v15-0001-Fix-bugs-in-pgbench-log-output.patchDownload

From 91f9f5351e8842940714dcce6fdcb12a995fc7c1 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Wed, 7 Jul 2021 16:39:30 +1200
Subject: [PATCH v15] Fix bugs in pgbench log output.

Commit 547f04e switched from using instr_time to pg_time_usec_t, but
introduced a couple of bugs:

1.  The log previously included the current Unix epoch time, but it was
inadvertantly changed to use a system-dependent epoch.  Restore the Unix
epoch-based reporting, and make sure to begin aggregates on wall clock
second boundaries as before.

2.  The aggregation interval needed to be scaled appropriately for the
new accounting unit to avoid logging bogus empty aggregates.  Also force
a final partial aggregate to be logged.

Also add a simple new TAP tests to verify the logged output format.

Back-patch to 14.

Author: Fabien COELHO <coelho@cri.ensmp.fr>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Yugo NAGATA <nagata@sraoss.co.jp>
Reviewed-by: Alvaro Herrera <alvherre@alvh.no-ip.org>
Reported-by: YoungHwan Joo <rulyox@gmail.com>
Reported-by: Gregory Smith <gregsmithpgsql@gmail.com>
Discussion: https://postgr.es/m/CAF7igB1r6wRfSCEAB-iZBKxkowWY6%2BdFF2jObSdd9%2BiVK%2BvHJg%40mail.gmail.com
Discussion: https://postgr.es/m/CAHLJuCW_8Vpcr0%3Dt6O_gozrg3wXXWXZXDioYJd3NhvKriqgpfQ%40mail.gmail.com
---
 src/bin/pgbench/pgbench.c                    | 103 +++++++++++++------
 src/bin/pgbench/t/001_pgbench_with_server.pl |  40 ++++++-
 2 files changed, 112 insertions(+), 31 deletions(-)

diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 4aeccd93af..3846ef8f40 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -343,6 +343,12 @@ typedef struct StatsData
 	SimpleStats lag;
 } StatsData;
 
+/*
+ * For displaying Unix epoch timestamps, as some time functions may have
+ * another reference.
+ */
+pg_time_usec_t epoch_shift;
+
 /*
  * Struct to keep random state.
  */
@@ -649,7 +655,7 @@ static void setDoubleValue(PgBenchValue *pv, double dval);
 static bool evaluateExpr(CState *st, PgBenchExpr *expr,
 						 PgBenchValue *retval);
 static ConnectionStateEnum executeMetaCommand(CState *st, pg_time_usec_t *now);
-static void doLog(TState *thread, CState *st,
+static void doLog(TState *thread, CState *st, bool final,
 				  StatsData *agg, bool skipped, double latency, double lag);
 static void processXactStats(TState *thread, CState *st, pg_time_usec_t *now,
 							 bool skipped, StatsData *agg);
@@ -3768,16 +3774,46 @@ executeMetaCommand(CState *st, pg_time_usec_t *now)
 	return CSTATE_END_COMMAND;
 }
 
+/* print aggregated report to logfile */
+static void
+logAgg(FILE *logfile, StatsData *agg)
+{
+	fprintf(logfile, INT64_FORMAT " " INT64_FORMAT " %.0f %.0f %.0f %.0f",
+			(agg->start_time + epoch_shift) / 1000000,
+			agg->cnt,
+			agg->latency.sum,
+			agg->latency.sum2,
+			agg->latency.min,
+			agg->latency.max);
+	if (throttle_delay)
+	{
+		fprintf(logfile, " %.0f %.0f %.0f %.0f",
+				agg->lag.sum,
+				agg->lag.sum2,
+				agg->lag.min,
+				agg->lag.max);
+		if (latency_limit)
+			fprintf(logfile, " " INT64_FORMAT, agg->skipped);
+	}
+	fputc('\n', logfile);
+}
+
 /*
  * Print log entry after completing one transaction.
  *
+ * The function behavior changes depending on sample_rate (a fraction of
+ * transaction is reported) and agg_interval (transactions are aggregated
+ * and reported once every agg_interval seconds).  For aggregation, "final"
+ * indicates that no transaction has occurred, but the final aggregate
+ * should be logged.
+ *
  * We print Unix-epoch timestamps in the log, so that entries can be
  * correlated against other logs.  On some platforms this could be obtained
  * from the caller, but rather than get entangled with that, we just eat
  * the cost of an extra syscall in all cases.
  */
 static void
-doLog(TState *thread, CState *st,
+doLog(TState *thread, CState *st, bool final,
 	  StatsData *agg, bool skipped, double latency, double lag)
 {
 	FILE	   *logfile = thread->logfile;
@@ -3796,43 +3832,36 @@ doLog(TState *thread, CState *st,
 	/* should we aggregate the results or not? */
 	if (agg_interval > 0)
 	{
+		pg_time_usec_t	next;
+
 		/*
 		 * Loop until we reach the interval of the current moment, and print
 		 * any empty intervals in between (this may happen with very low tps,
 		 * e.g. --rate=0.1).
 		 */
-
-		while (agg->start_time + agg_interval <= now)
+		while ((next = agg->start_time + agg_interval * INT64CONST(1000000)) <= now)
 		{
-			/* print aggregated report to logfile */
-			fprintf(logfile, INT64_FORMAT " " INT64_FORMAT " %.0f %.0f %.0f %.0f",
-					agg->start_time,
-					agg->cnt,
-					agg->latency.sum,
-					agg->latency.sum2,
-					agg->latency.min,
-					agg->latency.max);
-			if (throttle_delay)
-			{
-				fprintf(logfile, " %.0f %.0f %.0f %.0f",
-						agg->lag.sum,
-						agg->lag.sum2,
-						agg->lag.min,
-						agg->lag.max);
-				if (latency_limit)
-					fprintf(logfile, " " INT64_FORMAT, agg->skipped);
-			}
-			fputc('\n', logfile);
+			logAgg(logfile, agg);
 
 			/* reset data and move to next interval */
-			initStats(agg, agg->start_time + agg_interval);
+			initStats(agg, next);
 		}
 
-		/* accumulate the current transaction */
-		accumStats(agg, skipped, latency, lag);
+		if (!final)
+			/* accumulate the current transaction */
+			accumStats(agg, skipped, latency, lag);
+		else if (agg->cnt > 0)
+			/* final call to show the last aggregate */
+			logAgg(logfile, agg);
 	}
 	else
 	{
+		/* switch to Unix epoch */
+		now += epoch_shift;
+
+		/* final only used for aggregated data */
+		Assert(!final);
+
 		/* no, print raw transactions */
 		if (skipped)
 			fprintf(logfile, "%d " INT64_FORMAT " skipped %d " INT64_FORMAT " "
@@ -3892,7 +3921,7 @@ processXactStats(TState *thread, CState *st, pg_time_usec_t *now,
 	st->cnt++;
 
 	if (use_log)
-		doLog(thread, st, agg, skipped, latency, lag);
+		doLog(thread, st, false, agg, skipped, latency, lag);
 
 	/* XXX could use a mutex here, but we choose not to */
 	if (per_script_stats)
@@ -5458,7 +5487,8 @@ printProgressReport(TState *threads, int64 test_start, pg_time_usec_t now,
 
 	if (progress_timestamp)
 	{
-		snprintf(tbuf, sizeof(tbuf), "%.3f s", PG_TIME_GET_DOUBLE(now));
+		snprintf(tbuf, sizeof(tbuf), "%.3f s",
+				 PG_TIME_GET_DOUBLE(now + epoch_shift));
 	}
 	else
 	{
@@ -5808,6 +5838,11 @@ main(int argc, char **argv)
 	char	   *env;
 
 	int			exit_code = 0;
+	struct timeval tv;
+
+	/* record shift between Unix epoch and now() */
+	gettimeofday(&tv, NULL);
+	epoch_shift = tv.tv_sec * INT64CONST(1000000) + tv.tv_usec - pg_time_now();
 
 	pg_logging_init(argv[0]);
 	progname = get_progname(argv[0]);
@@ -6637,7 +6672,14 @@ threadRun(void *arg)
 	thread->bench_start = start;
 	thread->throttle_trigger = start;
 
-	initStats(&aggs, start);
+	/*
+	 * The log format currently has Unix epoch timestamps with whole numbers of
+	 * seconds.  Round the first aggregate's start time down to the nearest
+	 * Unix epoch second (the very first aggregate might really have started a
+	 * fraction of a second later, but later aggregates are measured from the
+	 * whole number time that is actually logged).
+	 */
+	initStats(&aggs, start - (start + epoch_shift) % 1000000);
 	last = aggs;
 
 	/* loop till all clients have terminated */
@@ -6830,8 +6872,9 @@ done:
 		if (agg_interval > 0)
 		{
 			/* log aggregated but not yet reported transactions */
-			doLog(thread, state, &aggs, false, 0, 0);
+			doLog(thread, state, true, &aggs, false, 0.0, 0.0);
 		}
+
 		fclose(thread->logfile);
 		thread->logfile = NULL;
 	}
diff --git a/src/bin/pgbench/t/001_pgbench_with_server.pl b/src/bin/pgbench/t/001_pgbench_with_server.pl
index 3aa9d5d753..70bd097b80 100644
--- a/src/bin/pgbench/t/001_pgbench_with_server.pl
+++ b/src/bin/pgbench/t/001_pgbench_with_server.pl
@@ -1187,7 +1187,7 @@ sub check_pgbench_logs
 
 	# $prefix is simple enough, thus does not need escaping
 	my @logs = list_files($dir, qr{^$prefix\..*$});
-	ok(@logs == $nb, "number of log files");
+	ok(@logs == $nb, "number of log files (@logs)");
 	ok(grep(/\/$prefix\.\d+(\.\d+)?$/, @logs) == $nb, "file name format");
 
 	my $log_number = 0;
@@ -1220,6 +1220,44 @@ sub check_pgbench_logs
 my $bdir = $node->basedir;
 
 # Run with sampling rate, 2 clients with 50 transactions each.
+#
+# Test time-sensitive features on a light read-only transaction:
+#
+#   -T: benchmark duration, 2 seconds to exercise progress & logs
+#   -P: progress report
+#   --aggregate-interval: periodic aggregated logs
+#   --rate: schedule load
+#   --latency-limit: max delay, not exercised much
+#
+# note: the --rate behavior is probabilistic in nature.
+# note: --progress-timestamp is not tested.
+pgbench(
+	'-T 2 -P 1 -l --aggregate-interval=1 -S -b se@2'
+	. ' --rate=20 --latency-limit=1000 -j ' . $nthreads
+	. ' -c 3 -r',
+	0,
+	[   qr{type: multiple},
+		qr{clients: 3},
+		qr{threads: $nthreads},
+		qr{duration: 2 s},
+		qr{script 1: .* select only},
+		qr{script 2: .* select only},
+		qr{statement latencies in milliseconds},
+		qr{FROM pgbench_accounts} ],
+	[ qr{vacuum}, qr{progress: 1\b} ],
+	'pgbench progress', undef,
+	"--log-prefix=$bdir/001_pgbench_log_1");
+
+# $nthreads threads, 2 seconds, but due to timing imprecision we might get
+# only 1 or as many as 3 progress reports per thread.
+# aggregate log format is:
+#   epoch #tx sum sum2 min max [sum sum2 min max [skipped]]
+# first series about latency; second about lag (--rate);
+# skipped only if --latency-limit is set.
+check_pgbench_logs($bdir, '001_pgbench_log_1', $nthreads, 1, 3,
+	qr{^\d{10,} \d{1,2} \d+ \d+ \d+ \d+ \d+ \d+ \d+ \d+ \d+$});
+
+# with sampling rate, 2 clients with 50 tx each
 pgbench(
 	"-n -S -t 50 -c 2 --log --sampling-rate=0.5", 0,
 	[ qr{select only}, qr{processed: 100/100} ], [qr{^$}],
-- 
2.30.2

#53

Hannu Krosing

hannuk@google.com

over 4 years ago

In reply to: Kyotaro Horiguchi (#9)

Re: pgbench logging broken by time logic changes

On Thu, Jun 17, 2021 at 7:18 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

I'm not sure we have transaction lasts for very short time that
nanoseconds matters.

Nanoseconds may not matter yet, but they could be handy when for
example we want to determine the order of parallel query executions.

We are less than an order of magnitude away from being able to do 1M
inserts/updates/deletes per second, so microseconds already are not
always 100% reliable.

We could possibly move to using LSNs fetched as part of the queries
for this case, but this will surely introduce more problems than it
solves :)

Cheers
-----
Hannu Krosing
Google Cloud - We have a long list of planned contributions and we are hiring.
Contact me if interested.

#54

Fabien COELHO

coelho@cri.ensmp.fr

over 4 years ago

In reply to: Thomas Munro (#52)

1 attachment(s)

Re: pgbench logging broken by time logic changes

Hello Thomas,

Thanks! This doesn't seem to address the complaint, though. Don't
you need to do something like this? (See also attached.)

+ initStats(&aggs, start - (start + epoch_shift) % 1000000);

ISTM that this is: (start + epoch_shift) / 1000000 * 1000000

That should reproduce what pgbench 13 does implicitly when it uses
time(NULL).

I understand that you are shifting the aggregate internal start time to
epoch, however ISTM that other points in the program are not shifted
consistently with this, eg the while comparison in doLog? Also if start
time is log shifted, then it should not be shifted again when printed (in
logAgg). Attached version tries to be consistent.

Namely, it rewinds to the start of the current *wall clock* second, so
that all future aggregates also start at round number wall clock times,
at the cost of making the first aggregate miss out on a fraction of a
second.

ISTM that it was already wall clock time, but not epoch wall clock.
I'm okay with realigning aggregates on full seconds.

I wonder if some of the confusion on the other thread about the final
aggregate[1] was due to this difference.

Dunno. The parallel execution with thread is a pain when handling details.

By rounding down, we get a "head start" (because the first aggregate is
short), so we usually manage to record the expected number of aggregates
before time runs out.

Fine with me if everything is consistent.

It's a race though. Your non-rounding version was more likely
to lose the race and finish before the final expected aggregate was
logged, so you added code to force a final aggregate to be logged.

ISTM that we always want to force because some modes can have low tps, and
the aggregates should be "full".

Do I have this right? I'm not entirely sure how useful a partial final
aggregate is

If you ask for 10 seconds run with 1 aggregate per second, you expect to
see (at least, about) 10 lines, and I want to ensure that, otherwise
people will ask questions, tools will have to look for special cases,
missing rows, whatever, and it will be a pain there. We want to produce
something simple, consistent, reliable, that tools can depend on.

(it's probably one you have to throw away, like the first one, no?
Isn't it better if we only have to throw away the first one?).

This should be the user decision to drop it or not, not the tool producing
it, IMO.

I'm not sure, but if we keep that change, a couple of very minor nits:
I found the "tx" parameter name a little confusing. Do you think it's
clearer if we change it to "final" (with inverted sense)?

I agree that tx is not a very good name, but the inversion does not look
right to me. The "normal" behavior is

For the final aggregate, shouldn't we call doLog() only if agg->cnt > 0?

No, I think that we should want to have all aggregates, even with zeros,
so that the user can expect a deterministic number of lines.

I think I'd be inclined to take that change back out though, making this
patch very small and net behaviour like pgbench 13, if you agree with my
explanation for why you had to add it and why it's not actually
necessary with the fixed rounding shown above. (And perhaps in v15 we
might consider other ideas like using hi-res times in the log and not
rounding, etc, a topic for later.)

I think that I'm moslty okay.

I don't really see the value in the test that checks that $delay falls
in the range 1.5s - 2.5s and then ignores the result. If it hangs
forever, we'll find out about it, and otherwise no human or machine
will ever care about that test. I removed it from this version. Were
you really attached to it?

YES, REALLY! It would just have caught quite a few of the regressions we
are trying to address here. I want it there even if ignored because I'll
look for it to avoid regressions in the future. If the test is actually
removed, recreating it is a pain. If you really want to disactivate it,
use if(0) but PLEASE let it there so that it can ne reactivated for tests
very simply, not bad maintaining some test outside of the tree.

Also, if farm logs show that it is okay on all animals, it can be switched
on by removing the ignore trick.

I made some very minor language tweaks in comments (we don't usually
shorten "benchmark" to "bench" in English, "series" keeps the -s in
singular (blame the Romans), etc).

Thanks! My English is kind of fuzzy in the details:-)

I think we should make it clear when we mean the *Unix* epoch (a
comment "switch to epoch" isn't meaningful on its own, to me at
least), so I changed that in a few places.

Ok.

Attached v16:
- tries to be consistent wrt epoch & aggregates, aligning to Unix epoch
as you suggested.
- renames tx as accumulate, but does not invert it.
- always shows aggregates so that the user can depend on the output,
even if stats are zero, because ISTM that clever must be avoided.
- put tests back, even if ignored, because I really want them available
easily.

When/if you get to commit this patch, eventually, do not forget that I'm
pushing forward fixes contributed by others, including Kyotaro Horiguchi
and Yugo Nagata.

--
Fabien.

Attachments:

pgbench-log-fix-16.patchtext/x-diff; NAME=pgbench-log-fix-16.patchDownload

diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 4aeccd93af..57c193d445 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -343,6 +343,12 @@ typedef struct StatsData
 	SimpleStats lag;
 } StatsData;
 
+/*
+ * For displaying Unix epoch timestamps, as some time functions may have
+ * another reference.
+ */
+pg_time_usec_t epoch_shift;
+
 /*
  * Struct to keep random state.
  */
@@ -649,7 +655,7 @@ static void setDoubleValue(PgBenchValue *pv, double dval);
 static bool evaluateExpr(CState *st, PgBenchExpr *expr,
 						 PgBenchValue *retval);
 static ConnectionStateEnum executeMetaCommand(CState *st, pg_time_usec_t *now);
-static void doLog(TState *thread, CState *st,
+static void doLog(TState *thread, CState *st, bool accumulate,
 				  StatsData *agg, bool skipped, double latency, double lag);
 static void processXactStats(TState *thread, CState *st, pg_time_usec_t *now,
 							 bool skipped, StatsData *agg);
@@ -3768,20 +3774,50 @@ executeMetaCommand(CState *st, pg_time_usec_t *now)
 	return CSTATE_END_COMMAND;
 }
 
+/* print aggregated report to logfile */
+static void
+logAgg(FILE *logfile, StatsData *agg)
+{
+	fprintf(logfile, INT64_FORMAT " " INT64_FORMAT " %.0f %.0f %.0f %.0f",
+			agg->start_time / 1000000, /* this is epoch time */
+			agg->cnt,
+			agg->latency.sum,
+			agg->latency.sum2,
+			agg->latency.min,
+			agg->latency.max);
+	if (throttle_delay)
+	{
+		fprintf(logfile, " %.0f %.0f %.0f %.0f",
+				agg->lag.sum,
+				agg->lag.sum2,
+				agg->lag.min,
+				agg->lag.max);
+		if (latency_limit)
+			fprintf(logfile, " " INT64_FORMAT, agg->skipped);
+	}
+	fputc('\n', logfile);
+}
+
 /*
  * Print log entry after completing one transaction.
  *
- * We print Unix-epoch timestamps in the log, so that entries can be
+ * The function behavior changes depending on sample_rate (a fraction of
+ * transaction is reported) and agg_interval (transactions are aggregated
+ * and reported once every agg_interval seconds).  For aggregation,
+ * "accumulate" indicates that we are recording a transaction, otherwise it
+ * is a hint that the final aggregate should be logged.
+ *
+ * We use Unix-epoch timestamps in the log, so that entries can be
  * correlated against other logs.  On some platforms this could be obtained
  * from the caller, but rather than get entangled with that, we just eat
  * the cost of an extra syscall in all cases.
  */
 static void
-doLog(TState *thread, CState *st,
+doLog(TState *thread, CState *st, bool accumulate,
 	  StatsData *agg, bool skipped, double latency, double lag)
 {
 	FILE	   *logfile = thread->logfile;
-	pg_time_usec_t now = pg_time_now();
+	pg_time_usec_t now = pg_time_now() + epoch_shift;
 
 	Assert(use_log);
 
@@ -3796,43 +3832,33 @@ doLog(TState *thread, CState *st,
 	/* should we aggregate the results or not? */
 	if (agg_interval > 0)
 	{
+		pg_time_usec_t	next;
+
 		/*
 		 * Loop until we reach the interval of the current moment, and print
 		 * any empty intervals in between (this may happen with very low tps,
 		 * e.g. --rate=0.1).
 		 */
-
-		while (agg->start_time + agg_interval <= now)
+		while ((next = agg->start_time + agg_interval * INT64CONST(1000000)) <= now)
 		{
-			/* print aggregated report to logfile */
-			fprintf(logfile, INT64_FORMAT " " INT64_FORMAT " %.0f %.0f %.0f %.0f",
-					agg->start_time,
-					agg->cnt,
-					agg->latency.sum,
-					agg->latency.sum2,
-					agg->latency.min,
-					agg->latency.max);
-			if (throttle_delay)
-			{
-				fprintf(logfile, " %.0f %.0f %.0f %.0f",
-						agg->lag.sum,
-						agg->lag.sum2,
-						agg->lag.min,
-						agg->lag.max);
-				if (latency_limit)
-					fprintf(logfile, " " INT64_FORMAT, agg->skipped);
-			}
-			fputc('\n', logfile);
+			logAgg(logfile, agg);
 
 			/* reset data and move to next interval */
-			initStats(agg, agg->start_time + agg_interval);
+			initStats(agg, next);
 		}
 
-		/* accumulate the current transaction */
-		accumStats(agg, skipped, latency, lag);
+		if (accumulate)
+			/* accumulate the current transaction */
+			accumStats(agg, skipped, latency, lag);
+		else
+			/* final call to show the last aggregate */
+			logAgg(logfile, agg);
 	}
 	else
 	{
+		/* accumulate as false is only used for aggregated data */
+		Assert(accumulate);
+
 		/* no, print raw transactions */
 		if (skipped)
 			fprintf(logfile, "%d " INT64_FORMAT " skipped %d " INT64_FORMAT " "
@@ -3892,7 +3918,7 @@ processXactStats(TState *thread, CState *st, pg_time_usec_t *now,
 	st->cnt++;
 
 	if (use_log)
-		doLog(thread, st, agg, skipped, latency, lag);
+		doLog(thread, st, true, agg, skipped, latency, lag);
 
 	/* XXX could use a mutex here, but we choose not to */
 	if (per_script_stats)
@@ -5458,7 +5484,8 @@ printProgressReport(TState *threads, int64 test_start, pg_time_usec_t now,
 
 	if (progress_timestamp)
 	{
-		snprintf(tbuf, sizeof(tbuf), "%.3f s", PG_TIME_GET_DOUBLE(now));
+		snprintf(tbuf, sizeof(tbuf), "%.3f s",
+				 PG_TIME_GET_DOUBLE(now + epoch_shift));
 	}
 	else
 	{
@@ -5789,8 +5816,8 @@ main(int argc, char **argv)
 	TState	   *threads;		/* array of thread */
 
 	pg_time_usec_t
-				start_time,		/* start up time */
-				bench_start = 0,	/* first recorded benchmarking time */
+				start_time,				/* overall start time */
+				bench_start = 0,		/* first recorded benchmarking time */
 				conn_total_duration;	/* cumulated connection time in
 										 * threads */
 	int64		latency_late = 0;
@@ -5808,6 +5835,11 @@ main(int argc, char **argv)
 	char	   *env;
 
 	int			exit_code = 0;
+	struct timeval tv;
+
+	/* record shift between Unix epoch and now() */
+	gettimeofday(&tv, NULL);
+	epoch_shift = tv.tv_sec * INT64CONST(1000000) + tv.tv_usec - pg_time_now();
 
 	pg_logging_init(argv[0]);
 	progname = get_progname(argv[0]);
@@ -6637,7 +6669,14 @@ threadRun(void *arg)
 	thread->bench_start = start;
 	thread->throttle_trigger = start;
 
-	initStats(&aggs, start);
+	/*
+	 * The log format currently has Unix epoch timestamps with whole numbers of
+	 * seconds.  Round the first aggregate's start time down to the nearest
+	 * Unix epoch second (the very first aggregate might really have started a
+	 * fraction of a second later, but later aggregates are measured from the
+	 * whole number time that is actually logged).
+	 */
+	initStats(&aggs, (start + epoch_shift) / 1000000 * 1000000);
 	last = aggs;
 
 	/* loop till all clients have terminated */
@@ -6830,8 +6869,9 @@ done:
 		if (agg_interval > 0)
 		{
 			/* log aggregated but not yet reported transactions */
-			doLog(thread, state, &aggs, false, 0, 0);
+			doLog(thread, state, false, &aggs, false, 0.0, 0.0);
 		}
+
 		fclose(thread->logfile);
 		thread->logfile = NULL;
 	}
diff --git a/src/bin/pgbench/t/001_pgbench_with_server.pl b/src/bin/pgbench/t/001_pgbench_with_server.pl
index 3aa9d5d753..41250bf91d 100644
--- a/src/bin/pgbench/t/001_pgbench_with_server.pl
+++ b/src/bin/pgbench/t/001_pgbench_with_server.pl
@@ -8,6 +8,7 @@ use PostgresNode;
 use TestLib;
 use Test::More;
 use Config;
+use Time::HiRes qw( time );
 
 # start a pgbench specific server
 my $node = get_new_node('main');
@@ -54,12 +55,14 @@ sub pgbench
 
 	push @cmd, @args;
 
+	my $start = time();
 	$node->command_checks_all(\@cmd, $stat, $out, $err, $name);
+	my $stop = time();
 
 	# cleanup?
 	#unlink @filenames or die "cannot unlink files (@filenames): $!";
 
-	return;
+	return $stop - $start;
 }
 
 # tablespace for testing, because partitioned tables cannot use pg_default
@@ -1187,7 +1190,7 @@ sub check_pgbench_logs
 
 	# $prefix is simple enough, thus does not need escaping
 	my @logs = list_files($dir, qr{^$prefix\..*$});
-	ok(@logs == $nb, "number of log files");
+	ok(@logs == $nb, "number of log files (@logs)");
 	ok(grep(/\/$prefix\.\d+(\.\d+)?$/, @logs) == $nb, "file name format");
 
 	my $log_number = 0;
@@ -1220,6 +1223,58 @@ sub check_pgbench_logs
 my $bdir = $node->basedir;
 
 # Run with sampling rate, 2 clients with 50 transactions each.
+#
+# Test time-sensitive features on a light read-only transaction:
+#
+#   -T: bench duration, 2 seconds to exercise progress & logs
+#   -P: progress report
+#   --aggregate-interval: periodic aggregated logs
+#   --rate: schedule load
+#   --latency-limit: max delay, not deeply exercice
+#
+# note: the --rate behavior is probabilistic in nature.
+# note: --progress-timestamp is not tested.
+my $delay = pgbench(
+	'-T 2 -P 1 -l --aggregate-interval=1 -S -b se@2'
+	. ' --rate=20 --latency-limit=1000 -j ' . $nthreads
+	. ' -c 3 -r',
+	0,
+	[   qr{type: multiple},
+		qr{clients: 3},
+		qr{threads: $nthreads},
+		qr{duration: 2 s},
+		qr{script 1: .* select only},
+		qr{script 2: .* select only},
+		qr{statement latencies in milliseconds},
+		qr{FROM pgbench_accounts} ],
+	[ qr{vacuum}, qr{progress: 1\b} ],
+	'pgbench progress', undef,
+	"--log-prefix=$bdir/001_pgbench_log_1");
+
+# cool check that we are around 2 seconds
+
+TODO: {
+	local $TODO = "possibly unreliable on slow hosts or unlucky runs";
+
+	# The rate may results in an unlucky schedule which triggers
+	# an early exit, hence the loose bound.
+
+	# also, the delay may totally fail on very slow or overloard hosts,
+	# valgrind runs...
+
+	ok(1.5 < $delay && $delay < 2.5, "-T 2 run around 2 seconds");
+}
+
+# $nthreads threads, 2 seconds, but due to timing imprecision we might get
+# only 1 or as many as 3 progress reports per thread.
+# aggregate log format is:
+#   epoch #tx sum sum2 min max [sum sum2 min max [skipped]]
+# first serie about latency ; second about lag (--rate) ;
+# skipped only if --latency-limit is set.
+check_pgbench_logs($bdir, '001_pgbench_log_1', $nthreads, 1, 3,
+	qr{^\d{10,} \d{1,2} \d+ \d+ \d+ \d+ \d+ \d+ \d+ \d+ \d+$});
+
+# with sampling rate, 2 clients with 50 tx each
 pgbench(
 	"-n -S -t 50 -c 2 --log --sampling-rate=0.5", 0,
 	[ qr{select only}, qr{processed: 100/100} ], [qr{^$}],

#55

Fabien COELHO

coelho@cri.ensmp.fr

over 4 years ago

In reply to: Hannu Krosing (#53)

Re: pgbench logging broken by time logic changes

Hello Hannu,

I'm not sure we have transaction lasts for very short time that
nanoseconds matters.

Nanoseconds may not matter yet, but they could be handy when for
example we want to determine the order of parallel query executions.

We are less than an order of magnitude away from being able to do 1M
inserts/updates/deletes per second, so microseconds already are not
always 100% reliable.

ISTM that 1M tps would be with really a lot of parallel clients, thus the
latency of each would be quite measurable, so that µs would still make
sense for measuring their performance? If an actual network is involved,
the network latency is already 100-200 µs even before executing any code.

--
Fabien.

#56

Thomas Munro

thomas.munro@gmail.com

over 4 years ago

In reply to: Fabien COELHO (#54)

1 attachment(s)

Re: pgbench logging broken by time logic changes

On Fri, Jul 9, 2021 at 5:15 AM Fabien COELHO <coelho@cri.ensmp.fr> wrote:

Thanks! This doesn't seem to address the complaint, though. Don't
you need to do something like this? (See also attached.)

+ initStats(&aggs, start - (start + epoch_shift) % 1000000);

ISTM that this is: (start + epoch_shift) / 1000000 * 1000000

Sure, it's equivalent if you also change doLog() to match the change
in epoch, as your v16 did. Your version seems fine to me. The key
thing is that we only start new aggregates when the Unix epoch time
crosses over a XXX.000000 boundary, since we're only logging the XXX
part. That's much like(*) pgbench13, and I'm happy however you want
to code that. Case closed on that point. Thanks!

Isn't it better if we only have to throw away the first one?).

This should be the user decision to drop it or not, not the tool producing
it, IMO.

Let me try this complaint again. It's admittedly a minor detail, but
my goal in this thread is to match pgbench13's treatment of aggregate
boundaries and partial aggregates, so that we can close the open item
for 14 with a minimal fix that doesn't make any unnecessary changes.
Discussions about improvements or pre-existing problems can wait.

First, let me demonstrate that pgbench13 throws away final partial
aggregates. I hacked REL_13_STABLE like so:

if (agg_interval > 0)
{
/* log aggregated but not yet reported transactions */
+fprintf(thread->logfile, "XXX log aggregated but not yet reported
transactions: aggs.cnt = %ld\n", aggs.cnt);
doLog(thread, state, &aggs, false, 0, 0);
}
fclose(thread->logfile);

I ran pgbench -T6 --aggregate-interval 1 -l -S postgres, and it
produced a log file containing:

===BOF===
1625782245 7974 428574 24998642 49 683
1625782246 19165 998755 53217669 49 310
1625782247 19657 998868 51577774 47 189
1625782248 19707 998898 51660408 47 189
1625782249 19969 998867 50454999 48 156
1625782250 19845 998877 51071013 47 210
XXX log aggregated but not yet reported transactions: aggs.cnt = 10988
===EOF===

We can see three interesting things:

1. The first aggregate is partial (only ~7k transactions, because it
started partway through a second). Users have to throw away that
first aggregate because its statistics are noise. That is the price
to pay to have later aggregates start at the time they say.

2. We get 5 more full aggregates (~19k transactions). That's a total
of 6 aggregates, which makes intuitive sense with -T6.

3. At the end, the extra call to doLog() did nothing, and yet cnt =
10988. That's data we're throwing away, because Unix epoch time has
not advanced far enough to reach a new aggregate start time (it's not
impossible, but probability is very low). Checking the commit log, I
see that the code that claims to log the final aggregate came from
commit b6037664960 (2016); apparently it doesn't do what the comments
seem to think it does (did that ever work? Should perhaps be cleaned
up, but separately, it's not an open item for 14).

Now, in your recent patches you force that final partial aggregate to
be logged in that case with that new flag mechanism, as we can see:

===BOF===
1625783726 11823 609143 32170321 48 549
1625783727 19530 998995 52383115 47 210
1625783728 19468 999026 52208898 46 181
1625783729 19826 999001 51238427 46 185
1625783730 19195 999110 52841674 49 172
1625783731 18572 998992 56028876 48 318
1625783732 7484 388620 20951100 48 316
===EOF===

1. We get a partial initial aggregate just like 13. That sacrificial
aggregate helps us synchronize the rest of the aggregates with the
logged timestamps. Good.

2. We get 5 full aggregates (~19k transactions) just like 13. As in
13, that's quite likely, because the first one was "short" so we
almost always reach the end of the 6th one before -T6 runs out of
sand. Good.

3. We get a new partial aggregate at the end. Users would have to
throw that one away too. This is not a big deal, but it's a change in
behaviour that should be discussed.

Given that that last one is a waste of pixels and a (so far)
unjustified change in behaviour, I propose, this time with a little
more force and an updated patch, that we abandon that part of the
change. I submit that you only added that because your earlier
patches didn't have the partial aggregate at the start, so then it
often didn't produce the 6th line of output. So, you forced it to log
whatever it had left, even though the full time hadn't elapsed yet.
Now we don't need that.

The patch and resulting code are simpler, and the user experience matches 13.

See attached.

I don't really see the value in the test that checks that $delay falls
in the range 1.5s - 2.5s and then ignores the result. If it hangs
forever, we'll find out about it, and otherwise no human or machine
will ever care about that test. I removed it from this version. Were
you really attached to it?

YES, REALLY! It would just have caught quite a few of the regressions we
are trying to address here. I want it there even if ignored because I'll
look for it to avoid regressions in the future. If the test is actually
removed, recreating it is a pain. If you really want to disactivate it,
use if(0) but PLEASE let it there so that it can ne reactivated for tests
very simply, not bad maintaining some test outside of the tree.

Ok, you win :-)

When/if you get to commit this patch, eventually, do not forget that I'm
pushing forward fixes contributed by others, including Kyotaro Horiguchi
and Yugo Nagata.

Fixed, thanks.

* I say "much like" and not "exactly like"; of course there may be a
subtle difference if ntpd adjusts your clock while a benchmark is
running. Something must give, and 14's coding prefers to keep the
duration of aggregates stable at exactly X seconds according to the
high precision timer, so that the statistics it reports are
meaningful, but 13 prefers to keep the timestamps it logs in sync with
other software using gettimeofday() and will give you a weird short or
long aggregate to achieve that (producing bad statistics). I can see
arguments for both but I'm OK with that change and I see that it is in
line with your general goal of switching to modern accurate time
interfaces.

Attachments:

v17-0001-Fix-bugs-in-pgbench-log-output.patchtext/x-patch; charset=US-ASCII; name=v17-0001-Fix-bugs-in-pgbench-log-output.patchDownload

From 5a5074c1ba971197569a7ac83f6f80c56479b806 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Fri, 9 Jul 2021 12:09:16 +1200
Subject: [PATCH v17] Fix bugs in pgbench log output.

Commit 547f04e switched from using instr_time to pg_time_usec_t, but
introduced a couple of bugs:

1.  The log previously used Unix time stamps, but it was changed to
write out times based on a system-dependent epoch.  Restore the Unix
epoch-based reporting, and make sure to begin aggregates on the
boundaries of Unix epoch seconds, as before.

2.  The aggregation interval needed to be scaled appropriately for the
new accounting unit, to avoid logging bogus empty aggregates.

Also add simple new TAP tests to verify the logged output format.  Since
one of the checks is likely to fail on a heavily loaded system, ignore
failures in that test.

Back-patch to 14.

Author: Fabien COELHO <coelho@cri.ensmp.fr>
Author: Yugo NAGATA <nagata@sraoss.co.jp>
Author: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Alvaro Herrera <alvherre@alvh.no-ip.org>
Reported-by: YoungHwan Joo <rulyox@gmail.com>
Reported-by: Gregory Smith <gregsmithpgsql@gmail.com>
Discussion: https://postgr.es/m/CAF7igB1r6wRfSCEAB-iZBKxkowWY6%2BdFF2jObSdd9%2BiVK%2BvHJg%40mail.gmail.com
Discussion: https://postgr.es/m/CAHLJuCW_8Vpcr0%3Dt6O_gozrg3wXXWXZXDioYJd3NhvKriqgpfQ%40mail.gmail.com
---
 src/bin/pgbench/pgbench.c                    | 43 ++++++++++++---
 src/bin/pgbench/t/001_pgbench_with_server.pl | 58 +++++++++++++++++++-
 2 files changed, 92 insertions(+), 9 deletions(-)

diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 4aeccd93af..a54958930b 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -343,6 +343,12 @@ typedef struct StatsData
 	SimpleStats lag;
 } StatsData;
 
+/*
+ * For displaying Unix epoch timestamps, as some time functions may have
+ * another reference.
+ */
+pg_time_usec_t epoch_shift;
+
 /*
  * Struct to keep random state.
  */
@@ -3771,7 +3777,11 @@ executeMetaCommand(CState *st, pg_time_usec_t *now)
 /*
  * Print log entry after completing one transaction.
  *
- * We print Unix-epoch timestamps in the log, so that entries can be
+ * The function behavior changes depending on sample_rate (a fraction of
+ * transaction is reported) and agg_interval (transactions are aggregated
+ * and reported once every agg_interval seconds).
+ *
+ * We use Unix-epoch timestamps in the log, so that entries can be
  * correlated against other logs.  On some platforms this could be obtained
  * from the caller, but rather than get entangled with that, we just eat
  * the cost of an extra syscall in all cases.
@@ -3781,7 +3791,7 @@ doLog(TState *thread, CState *st,
 	  StatsData *agg, bool skipped, double latency, double lag)
 {
 	FILE	   *logfile = thread->logfile;
-	pg_time_usec_t now = pg_time_now();
+	pg_time_usec_t now = pg_time_now() + epoch_shift;
 
 	Assert(use_log);
 
@@ -3796,17 +3806,19 @@ doLog(TState *thread, CState *st,
 	/* should we aggregate the results or not? */
 	if (agg_interval > 0)
 	{
+		pg_time_usec_t next;
+
 		/*
 		 * Loop until we reach the interval of the current moment, and print
 		 * any empty intervals in between (this may happen with very low tps,
 		 * e.g. --rate=0.1).
 		 */
 
-		while (agg->start_time + agg_interval <= now)
+		while ((next = agg->start_time + agg_interval * INT64CONST(1000000)) <= now)
 		{
 			/* print aggregated report to logfile */
 			fprintf(logfile, INT64_FORMAT " " INT64_FORMAT " %.0f %.0f %.0f %.0f",
-					agg->start_time,
+					agg->start_time / 1000000,	/* seconds since Unix epoch */
 					agg->cnt,
 					agg->latency.sum,
 					agg->latency.sum2,
@@ -3825,7 +3837,7 @@ doLog(TState *thread, CState *st,
 			fputc('\n', logfile);
 
 			/* reset data and move to next interval */
-			initStats(agg, agg->start_time + agg_interval);
+			initStats(agg, next);
 		}
 
 		/* accumulate the current transaction */
@@ -5458,7 +5470,8 @@ printProgressReport(TState *threads, int64 test_start, pg_time_usec_t now,
 
 	if (progress_timestamp)
 	{
-		snprintf(tbuf, sizeof(tbuf), "%.3f s", PG_TIME_GET_DOUBLE(now));
+		snprintf(tbuf, sizeof(tbuf), "%.3f s",
+				 PG_TIME_GET_DOUBLE(now + epoch_shift));
 	}
 	else
 	{
@@ -5808,6 +5821,15 @@ main(int argc, char **argv)
 	char	   *env;
 
 	int			exit_code = 0;
+	struct timeval tv;
+
+	/*
+	 * Record difference between Unix epoch and high resolution timer's epoch.
+	 * We'll use this for logging and aggregation with Unix epoch-based
+	 * buckets.
+	 */
+	gettimeofday(&tv, NULL);
+	epoch_shift = tv.tv_sec * INT64CONST(1000000) + tv.tv_usec - pg_time_now();
 
 	pg_logging_init(argv[0]);
 	progname = get_progname(argv[0]);
@@ -6637,7 +6659,14 @@ threadRun(void *arg)
 	thread->bench_start = start;
 	thread->throttle_trigger = start;
 
-	initStats(&aggs, start);
+	/*
+	 * The log format currently has Unix epoch timestamps with whole numbers
+	 * of seconds.  Round the first aggregate's start time down to the nearest
+	 * Unix epoch second (the very first aggregate might really have started a
+	 * fraction of a second later, but later aggregates are measured from the
+	 * whole number time that is actually logged).
+	 */
+	initStats(&aggs, (start + epoch_shift) / 1000000 * 1000000);
 	last = aggs;
 
 	/* loop till all clients have terminated */
diff --git a/src/bin/pgbench/t/001_pgbench_with_server.pl b/src/bin/pgbench/t/001_pgbench_with_server.pl
index 3aa9d5d753..3bf42fad1b 100644
--- a/src/bin/pgbench/t/001_pgbench_with_server.pl
+++ b/src/bin/pgbench/t/001_pgbench_with_server.pl
@@ -8,6 +8,7 @@ use PostgresNode;
 use TestLib;
 use Test::More;
 use Config;
+use Time::HiRes qw(time);
 
 # start a pgbench specific server
 my $node = get_new_node('main');
@@ -54,12 +55,14 @@ sub pgbench
 
 	push @cmd, @args;
 
+	my $start = time();
 	$node->command_checks_all(\@cmd, $stat, $out, $err, $name);
+	my $stop = time();
 
 	# cleanup?
 	#unlink @filenames or die "cannot unlink files (@filenames): $!";
 
-	return;
+	return $stop - $start;
 }
 
 # tablespace for testing, because partitioned tables cannot use pg_default
@@ -1187,7 +1190,7 @@ sub check_pgbench_logs
 
 	# $prefix is simple enough, thus does not need escaping
 	my @logs = list_files($dir, qr{^$prefix\..*$});
-	ok(@logs == $nb, "number of log files");
+	ok(@logs == $nb, "number of log files (@logs)");
 	ok(grep(/\/$prefix\.\d+(\.\d+)?$/, @logs) == $nb, "file name format");
 
 	my $log_number = 0;
@@ -1220,6 +1223,57 @@ sub check_pgbench_logs
 my $bdir = $node->basedir;
 
 # Run with sampling rate, 2 clients with 50 transactions each.
+#
+# Test time-sensitive features on a light read-only transaction:
+#
+#   -T: bench duration, 2 seconds to exercise progress & logs
+#   -P: progress report
+#   --aggregate-interval: periodic aggregated logs
+#   --rate: schedule load
+#   --latency-limit: max delay, not deeply exercice
+#
+# note: the --rate behavior is probabilistic in nature.
+# note: --progress-timestamp is not tested.
+my $delay = pgbench(
+	'-T 2 -P 1 -l --aggregate-interval=1 -S -b se@2'
+	. ' --rate=20 --latency-limit=1000 -j ' . $nthreads
+	. ' -c 3 -r',
+	0,
+	[   qr{type: multiple},
+		qr{clients: 3},
+		qr{threads: $nthreads},
+		qr{duration: 2 s},
+		qr{script 1: .* select only},
+		qr{script 2: .* select only},
+		qr{statement latencies in milliseconds},
+		qr{FROM pgbench_accounts} ],
+	[ qr{vacuum}, qr{progress: 1\b} ],
+	'pgbench progress', undef,
+	"--log-prefix=$bdir/001_pgbench_log_1");
+
+# check that we are around 2 seconds
+TODO: {
+	local $TODO = "possibly unreliable on slow hosts or unlucky runs";
+
+	# The rate may results in an unlucky schedule which triggers
+	# an early exit, hence the loose bound.
+
+	# also, the delay may totally fail on very slow or overloaded hosts,
+	# valgrind runs...
+
+	ok(1.5 < $delay && $delay < 2.5, "-T 2 run around 2 seconds");
+}
+
+# $nthreads threads, 2 seconds, but due to timing imprecision we might get
+# only 1 or as many as 3 progress reports per thread.
+# aggregate log format is:
+#   unix_epoch_time #tx sum sum2 min max [sum sum2 min max [skipped]]
+# first series about latency; second about lag (--rate) ;
+# skipped only if --latency-limit is set.
+check_pgbench_logs($bdir, '001_pgbench_log_1', $nthreads, 1, 3,
+	qr{^\d{10,} \d{1,2} \d+ \d+ \d+ \d+ \d+ \d+ \d+ \d+ \d+$});
+
+# with sampling rate, 2 clients with 50 tx each
 pgbench(
 	"-n -S -t 50 -c 2 --log --sampling-rate=0.5", 0,
 	[ qr{select only}, qr{processed: 100/100} ], [qr{^$}],
-- 
2.30.2

#57

Fabien COELHO

coelho@cri.ensmp.fr

over 4 years ago

In reply to: Thomas Munro (#56)

Re: pgbench logging broken by time logic changes

Hello Thomas,

Isn't it better if we only have to throw away the first one?).

This should be the user decision to drop it or not, not the tool
producing it, IMO.

Let me try this complaint again. [...]

I understand your point.

For me removing silently the last bucket is not right because the sum of
aggregates does not match the final grand total. This change is
intentional and borders on a bug fix, which is what this patch was doing,
even if it is also a small behavioral change: We should want the detailed
and final reports in agreement.

I do agree that users should probably ignore the first and last lines.

See attached.

Works for me: patch applies, global and local check ok. I'm fine with it.

If it was me, I'd still show the last bucket, but it does not matter much.

--
Fabien.

#58

Thomas Munro

thomas.munro@gmail.com

over 4 years ago

In reply to: Fabien COELHO (#57)

Re: pgbench logging broken by time logic changes

On Fri, Jul 9, 2021 at 5:14 PM Fabien COELHO <coelho@cri.ensmp.fr> wrote:

Works for me: patch applies, global and local check ok. I'm fine with it.

I hoped we were done here but I realised that your check for 1-3 log
lines will not survive the harsh environment of the build farm.
Adding sleep(2) before the final doLog() confirms that. I had two
ideas:

1. Give up and expect 1-180 lines. (180s is the current timeout
tolerance used elsewhere to deal with
swamped/valgrind/swapping/microsd computers, after a few rounds of
inflation, so if you need an arbitrary large number to avoid buildfarm
measles that's my suggestion....)
2. Change doLog() to give up after end_time. But then I think you'd
need to make it 0-3 :-(

I think the logging could be tightened up to work the way you expected
in future work, though. Perhaps we could change all logging to work
with transaction start time instead of log-writing time, which doLog()
should receive. If you never start a transaction after end_time, then
there can never be an aggregate that begins after that, and the whole
thing becomes more deterministic. That kind of alignment of aggregate
timing with whole-run timing could also get rid of those partial
aggregates completely. But that's an idea for 15.

So I think we should do 1 for now. Objections or better ideas?

#59

Fabien COELHO

coelho@cri.ensmp.fr

over 4 years ago

In reply to: Thomas Munro (#58)

Re: pgbench logging broken by time logic changes

Works for me: patch applies, global and local check ok. I'm fine with it.

I hoped we were done here but I realised that your check for 1-3 log
lines will not survive the harsh environment of the build farm.
Adding sleep(2) before the final doLog() confirms that. I had two
ideas:

1. Give up and expect 1-180 lines. (180s is the current timeout
tolerance used elsewhere to deal with
swamped/valgrind/swapping/microsd computers, after a few rounds of
inflation, so if you need an arbitrary large number to avoid buildfarm
measles that's my suggestion....)
2. Change doLog() to give up after end_time. But then I think you'd
need to make it 0-3 :-(

I think the logging could be tightened up to work the way you expected
in future work, though. Perhaps we could change all logging to work
with transaction start time instead of log-writing time, which doLog()
should receive. If you never start a transaction after end_time, then
there can never be an aggregate that begins after that, and the whole
thing becomes more deterministic. That kind of alignment of aggregate
timing with whole-run timing could also get rid of those partial
aggregates completely. But that's an idea for 15.

So I think we should do 1 for now. Objections or better ideas?

At least, we now that it is too much.

What about moving the test as is in the TODO section with a comment, next
to the other one, for now?

I hesitated to suggest that before for the above risks, but I was very
naively optimistic that it may pass because the test is not that too
demanding.

Someone suggested to have a "real-time" configure switch to enable/disable
time-sensitive tests.

--
Fabien.

#60

Fabien COELHO

coelho@cri.ensmp.fr

over 4 years ago

In reply to: Fabien COELHO (#59)

1 attachment(s)

Re: pgbench logging broken by time logic changes

Hello again,

I hoped we were done here but I realised that your check for 1-3 log
lines will not survive the harsh environment of the build farm.
Adding sleep(2) before the final doLog() confirms that. I had two
ideas:

So I think we should do 1 for now. Objections or better ideas?

At least, we now that it is too much.

I misread your point. You think that it should fail, but it is not
tried yet. I'm rather optimistic that it should not fail, but I'm okay
with averting the risk anyway.

What about moving the test as is in the TODO section with a comment, next to
the other one, for now?

I stand by this solution which should allow to get some data from the
field, as v18 attached. If all is green then the TODO could be removed
later.

--
Fabien.

Attachments:

pgbench-log-fix-18.patchtext/x-diff; name=pgbench-log-fix-18.patchDownload

diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 4aeccd93af..a54958930b 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -343,6 +343,12 @@ typedef struct StatsData
 	SimpleStats lag;
 } StatsData;
 
+/*
+ * For displaying Unix epoch timestamps, as some time functions may have
+ * another reference.
+ */
+pg_time_usec_t epoch_shift;
+
 /*
  * Struct to keep random state.
  */
@@ -3771,7 +3777,11 @@ executeMetaCommand(CState *st, pg_time_usec_t *now)
 /*
  * Print log entry after completing one transaction.
  *
- * We print Unix-epoch timestamps in the log, so that entries can be
+ * The function behavior changes depending on sample_rate (a fraction of
+ * transaction is reported) and agg_interval (transactions are aggregated
+ * and reported once every agg_interval seconds).
+ *
+ * We use Unix-epoch timestamps in the log, so that entries can be
  * correlated against other logs.  On some platforms this could be obtained
  * from the caller, but rather than get entangled with that, we just eat
  * the cost of an extra syscall in all cases.
@@ -3781,7 +3791,7 @@ doLog(TState *thread, CState *st,
 	  StatsData *agg, bool skipped, double latency, double lag)
 {
 	FILE	   *logfile = thread->logfile;
-	pg_time_usec_t now = pg_time_now();
+	pg_time_usec_t now = pg_time_now() + epoch_shift;
 
 	Assert(use_log);
 
@@ -3796,17 +3806,19 @@ doLog(TState *thread, CState *st,
 	/* should we aggregate the results or not? */
 	if (agg_interval > 0)
 	{
+		pg_time_usec_t next;
+
 		/*
 		 * Loop until we reach the interval of the current moment, and print
 		 * any empty intervals in between (this may happen with very low tps,
 		 * e.g. --rate=0.1).
 		 */
 
-		while (agg->start_time + agg_interval <= now)
+		while ((next = agg->start_time + agg_interval * INT64CONST(1000000)) <= now)
 		{
 			/* print aggregated report to logfile */
 			fprintf(logfile, INT64_FORMAT " " INT64_FORMAT " %.0f %.0f %.0f %.0f",
-					agg->start_time,
+					agg->start_time / 1000000,	/* seconds since Unix epoch */
 					agg->cnt,
 					agg->latency.sum,
 					agg->latency.sum2,
@@ -3825,7 +3837,7 @@ doLog(TState *thread, CState *st,
 			fputc('\n', logfile);
 
 			/* reset data and move to next interval */
-			initStats(agg, agg->start_time + agg_interval);
+			initStats(agg, next);
 		}
 
 		/* accumulate the current transaction */
@@ -5458,7 +5470,8 @@ printProgressReport(TState *threads, int64 test_start, pg_time_usec_t now,
 
 	if (progress_timestamp)
 	{
-		snprintf(tbuf, sizeof(tbuf), "%.3f s", PG_TIME_GET_DOUBLE(now));
+		snprintf(tbuf, sizeof(tbuf), "%.3f s",
+				 PG_TIME_GET_DOUBLE(now + epoch_shift));
 	}
 	else
 	{
@@ -5808,6 +5821,15 @@ main(int argc, char **argv)
 	char	   *env;
 
 	int			exit_code = 0;
+	struct timeval tv;
+
+	/*
+	 * Record difference between Unix epoch and high resolution timer's epoch.
+	 * We'll use this for logging and aggregation with Unix epoch-based
+	 * buckets.
+	 */
+	gettimeofday(&tv, NULL);
+	epoch_shift = tv.tv_sec * INT64CONST(1000000) + tv.tv_usec - pg_time_now();
 
 	pg_logging_init(argv[0]);
 	progname = get_progname(argv[0]);
@@ -6637,7 +6659,14 @@ threadRun(void *arg)
 	thread->bench_start = start;
 	thread->throttle_trigger = start;
 
-	initStats(&aggs, start);
+	/*
+	 * The log format currently has Unix epoch timestamps with whole numbers
+	 * of seconds.  Round the first aggregate's start time down to the nearest
+	 * Unix epoch second (the very first aggregate might really have started a
+	 * fraction of a second later, but later aggregates are measured from the
+	 * whole number time that is actually logged).
+	 */
+	initStats(&aggs, (start + epoch_shift) / 1000000 * 1000000);
 	last = aggs;
 
 	/* loop till all clients have terminated */
diff --git a/src/bin/pgbench/t/001_pgbench_with_server.pl b/src/bin/pgbench/t/001_pgbench_with_server.pl
index 3aa9d5d753..d674cc59a5 100644
--- a/src/bin/pgbench/t/001_pgbench_with_server.pl
+++ b/src/bin/pgbench/t/001_pgbench_with_server.pl
@@ -8,6 +8,7 @@ use PostgresNode;
 use TestLib;
 use Test::More;
 use Config;
+use Time::HiRes qw(time);
 
 # start a pgbench specific server
 my $node = get_new_node('main');
@@ -54,12 +55,14 @@ sub pgbench
 
 	push @cmd, @args;
 
+	my $start = time();
 	$node->command_checks_all(\@cmd, $stat, $out, $err, $name);
+	my $stop = time();
 
 	# cleanup?
 	#unlink @filenames or die "cannot unlink files (@filenames): $!";
 
-	return;
+	return $stop - $start;
 }
 
 # tablespace for testing, because partitioned tables cannot use pg_default
@@ -1187,7 +1190,7 @@ sub check_pgbench_logs
 
 	# $prefix is simple enough, thus does not need escaping
 	my @logs = list_files($dir, qr{^$prefix\..*$});
-	ok(@logs == $nb, "number of log files");
+	ok(@logs == $nb, "number of log files (@logs)");
 	ok(grep(/\/$prefix\.\d+(\.\d+)?$/, @logs) == $nb, "file name format");
 
 	my $log_number = 0;
@@ -1219,7 +1222,58 @@ sub check_pgbench_logs
 
 my $bdir = $node->basedir;
 
-# Run with sampling rate, 2 clients with 50 transactions each.
+TODO: {
+	#
+	# Test time-sensitive features on a light read-only transaction
+	#
+	local $TODO = "possibly unreliable on slow hosts or unlucky runs";
+
+	# Run with sampling rate, 2 clients with 50 transactions each.
+	#
+	#   -T: bench duration, 2 seconds to exercise progress & logs
+	#   -P: progress report
+	#   --aggregate-interval: periodic aggregated logs
+	#   --rate: schedule load
+	#   --latency-limit: max delay, not deeply exercice
+	#
+	# note: the --rate behavior is probabilistic in nature.
+	# note: --progress-timestamp is not tested.
+	my $delay = pgbench(
+		'-T 2 -P 1 -l --aggregate-interval=1 -S -b se@2'
+		. ' --rate=20 --latency-limit=1000 -j ' . $nthreads
+		. ' -c 3 -r',
+		0,
+		[   qr{type: multiple},
+			qr{clients: 3},
+			qr{threads: $nthreads},
+			qr{duration: 2 s},
+			qr{script 1: .* select only},
+			qr{script 2: .* select only},
+			qr{statement latencies in milliseconds},
+			qr{FROM pgbench_accounts} ],
+		[ qr{vacuum}, qr{progress: 1\b} ],
+		'pgbench progress', undef,
+		"--log-prefix=$bdir/001_pgbench_log_1");
+
+	# The rate may results in an unlucky schedule which triggers
+	# an early exit, hence the loose bound.
+
+	# also, the delay may totally fail on very slow or overloaded hosts,
+	# valgrind runs...
+
+	ok(1.5 < $delay && $delay < 2.5, "-T 2 run around 2 seconds");
+}
+
+# $nthreads threads, 2 seconds, but due to timing imprecision we might get
+# only 1 or as many as 3 progress reports per thread.
+# aggregate log format is:
+#   unix_epoch_time #tx sum sum2 min max [sum sum2 min max [skipped]]
+# first series about latency; second about lag (--rate) ;
+# skipped only if --latency-limit is set.
+check_pgbench_logs($bdir, '001_pgbench_log_1', $nthreads, 1, 3,
+	qr{^\d{10,} \d{1,2} \d+ \d+ \d+ \d+ \d+ \d+ \d+ \d+ \d+$});
+
+# with sampling rate, 2 clients with 50 tx each
 pgbench(
 	"-n -S -t 50 -c 2 --log --sampling-rate=0.5", 0,
 	[ qr{select only}, qr{processed: 100/100} ], [qr{^$}],

#61

Thomas Munro

thomas.munro@gmail.com

over 4 years ago

In reply to: Fabien COELHO (#60)

Re: pgbench logging broken by time logic changes

Hi Fabien,

I committed the code change without the new TAP tests, because I
didn't want to leave the open item hanging any longer. As for the
test, ...

On Sat, Jul 10, 2021 at 9:36 PM Fabien COELHO <coelho@cri.ensmp.fr> wrote:

I hoped we were done here but I realised that your check for 1-3 log
lines will not survive the harsh environment of the build farm.
Adding sleep(2) before the final doLog() confirms that. I had two
ideas:

I misread your point. You think that it should fail, but it is not
tried yet. I'm rather optimistic that it should not fail, but I'm okay
with averting the risk anyway.

... I know it can fail, and your v18 didn't fix that, because...

+check_pgbench_logs($bdir, '001_pgbench_log_1', $nthreads, 1, 3,

^
|

... this range can be exceeded.

That's because extra aggregations are created based on doLog() reading
the clock after a transaction is finished, entirely independently of
the -T mechanism deciding when to stop the benchmark, and potentially
many seconds later in adverse conditions. As I mentioned, you can see
it fail with your own eyes if you hack the code like so:

        if (agg_interval > 0)
        {
+           /*
+            * XXX: simulate an overloaded raspberry pi swapping to a microsd
+            * card or other random delays as we can expect in the build farm
+            */
+           sleep(3);
            /* log aggregated but not yet reported transactions */
            doLog(thread, state, &aggs, false, 0, 0);
        }

I stand by this solution which should allow to get some data from the
field, as v18 attached. If all is green then the TODO could be removed
later.

I suspect the number of aggregates could be made deterministic, as I
described in an earlier message. What do you think about doing
something like that first for the next release, before trying to add
assertions about the number of aggregates? I'm with you on the
importance of testing, but it seems better to start by making the
thing more testable.

#62

Fabien COELHO

coelho@cri.ensmp.fr

over 4 years ago

In reply to: Thomas Munro (#61)

1 attachment(s)

Re: pgbench logging broken by time logic changes

Hello Thomas,

I committed the code change without the new TAP tests, because I
didn't want to leave the open item hanging any longer.

Ok. Good.

As for the test, ... [...]

Argh, so there are no tests that would have caught the regressions:-(

... I know it can fail, and your v18 didn't fix that, because...

+check_pgbench_logs($bdir, '001_pgbench_log_1', $nthreads, 1, 3,
... this range can be exceeded.

Indeed. I meant to move that one in the TODO section as well, not just the
previous call, so that all time-sensitive tests are fully ignored but
reported, which would be enough for me.

I suspect the number of aggregates could be made deterministic, as I
described in an earlier message. What do you think about doing
something like that first for the next release, before trying to add
assertions about the number of aggregates?

I think that last time I did something to get more deterministic results
in pgbench, which involved a few lines of hocus-pocus in pgbench, the
patch got rejected:-)

An "ignored" tests looked like a good compromise to check how things are
going in the farm and to be able to check for more non regressions when
developing pgbench, without introducing behavioral changes.

I'm with you on the importance of testing, but it seems better to start
by making the thing more testable.

I'm used to my test patches being rejected, including modifying pgbench
behavior to make it more testable. Am I mad enough to retry? Maybe, maybe
not.

Attached the fully "ignored" version of the time features test as a patch.

--
Fabien.

Attachments:

pgbench-time-test-1.patchtext/x-diff; name=pgbench-time-test-1.patchDownload

diff --git a/src/bin/pgbench/t/001_pgbench_with_server.pl b/src/bin/pgbench/t/001_pgbench_with_server.pl
index 3aa9d5d753..f2a1c861b2 100644
--- a/src/bin/pgbench/t/001_pgbench_with_server.pl
+++ b/src/bin/pgbench/t/001_pgbench_with_server.pl
@@ -8,6 +8,7 @@ use PostgresNode;
 use TestLib;
 use Test::More;
 use Config;
+use Time::HiRes qw(time);
 
 # start a pgbench specific server
 my $node = get_new_node('main');
@@ -54,12 +55,14 @@ sub pgbench
 
 	push @cmd, @args;
 
+	my $start = time();
 	$node->command_checks_all(\@cmd, $stat, $out, $err, $name);
+	my $stop = time();
 
 	# cleanup?
 	#unlink @filenames or die "cannot unlink files (@filenames): $!";
 
-	return;
+	return $stop - $start;
 }
 
 # tablespace for testing, because partitioned tables cannot use pg_default
@@ -1187,7 +1190,7 @@ sub check_pgbench_logs
 
 	# $prefix is simple enough, thus does not need escaping
 	my @logs = list_files($dir, qr{^$prefix\..*$});
-	ok(@logs == $nb, "number of log files");
+	ok(@logs == $nb, "number of log files (@logs)");
 	ok(grep(/\/$prefix\.\d+(\.\d+)?$/, @logs) == $nb, "file name format");
 
 	my $log_number = 0;
@@ -1219,7 +1222,58 @@ sub check_pgbench_logs
 
 my $bdir = $node->basedir;
 
-# Run with sampling rate, 2 clients with 50 transactions each.
+TODO: {
+	#
+	# Test time-sensitive features on a light read-only transaction
+	#
+	local $TODO = "possibly unreliable on slow hosts or unlucky runs";
+
+	# Run with sampling rate, 2 clients with 50 transactions each.
+	#
+	#   -T: bench duration, 2 seconds to exercise progress & logs
+	#   -P: progress report
+	#   --aggregate-interval: periodic aggregated logs
+	#   --rate: schedule load
+	#   --latency-limit: max delay, not deeply exercice
+	#
+	# note: the --rate behavior is probabilistic in nature.
+	# note: --progress-timestamp is not tested.
+	my $delay = pgbench(
+		'-T 2 -P 1 -l --aggregate-interval=1 -S -b se@2'
+		. ' --rate=20 --latency-limit=1000 -j ' . $nthreads
+		. ' -c 3 -r',
+		0,
+		[   qr{type: multiple},
+			qr{clients: 3},
+			qr{threads: $nthreads},
+			qr{duration: 2 s},
+			qr{script 1: .* select only},
+			qr{script 2: .* select only},
+			qr{statement latencies in milliseconds},
+			qr{FROM pgbench_accounts} ],
+		[ qr{vacuum}, qr{progress: 1\b} ],
+		'pgbench progress', undef,
+		"--log-prefix=$bdir/001_pgbench_log_1");
+
+	# The rate may results in an unlucky schedule which triggers
+	# an early exit, hence the loose bound.
+
+	# also, the delay may totally fail on very slow or overloaded hosts,
+	# valgrind runs...
+
+	ok(1.5 < $delay && $delay < 2.5, "-T 2 run around 2 seconds");
+
+	# $nthreads threads, 2 seconds, but due to timing imprecision we might get
+	# only 1 or as many as 3 progress reports per thread.
+	# aggregate log format is:
+	#   unix_epoch_time #tx sum sum2 min max [sum sum2 min max [skipped]]
+	# first series about latency; second about lag (--rate) ;
+	# skipped only if --latency-limit is set.
+	check_pgbench_logs($bdir, '001_pgbench_log_1', $nthreads, 1, 3,
+		qr{^\d{10,} \d{1,2} \d+ \d+ \d+ \d+ \d+ \d+ \d+ \d+ \d+$});
+}
+
+# with sampling rate, 2 clients with 50 tx each
 pgbench(
 	"-n -S -t 50 -c 2 --log --sampling-rate=0.5", 0,
 	[ qr{select only}, qr{processed: 100/100} ], [qr{^$}],

#63

Daniel Gustafsson

daniel@yesql.se

about 4 years ago

In reply to: Fabien COELHO (#62)

Re: pgbench logging broken by time logic changes

On 11 Jul 2021, at 15:07, Fabien COELHO <coelho@cri.ensmp.fr> wrote:

Attached the fully "ignored" version of the time features test as a patch.

This version of the patch is failing to apply on top of HEAD, can you please
submit a rebased version?

--
Daniel Gustafsson https://vmware.com/

#64

Daniel Gustafsson

daniel@yesql.se

about 4 years ago

In reply to: Daniel Gustafsson (#63)

Re: pgbench logging broken by time logic changes

On 4 Nov 2021, at 13:38, Daniel Gustafsson <daniel@yesql.se> wrote:

On 11 Jul 2021, at 15:07, Fabien COELHO <coelho@cri.ensmp.fr> wrote:

Attached the fully "ignored" version of the time features test as a patch.

This version of the patch is failing to apply on top of HEAD, can you please
submit a rebased version?

I'm marking this patch Returned with Feedback, please feel free to resubmit
this when there is an updated version of the patch available.

--
Daniel Gustafsson https://vmware.com/