pgbench bug candidate: negative "initial connection time"

Started by kuroda.hayato@fujitsu.comover 4 years ago27 messages

kuroda.hayato@fujitsu.com

over 4 years ago

1 attachment(s)

Hi Hackers,

I played pgbench with wrong parameters, and I found bug-candidate.
The latest commit in my source is 3a09d75.

1. Do initdb and start.
2. Initialize schema and data with "scale factor" = 1.
3. execute following command many times:

$ pgbench -c 101 -j 10 postgres

Then, sometimes the negative " initial connection time" was returned.
Lateyncy average is also strange.

```
$ pgbench -c 101 -j 10 postgres
starting vacuum...end.
pgbench: error: connection to server on socket "/tmp/.s.PGSQL.5432" failed: FATAL: sorry, too many clients already
pgbench (PostgreSQL) 14.0
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 101
number of threads: 10
number of transactions per client: 10
number of transactions actually processed: 910/1010
latency average = 41387686.662 ms
initial connection time = -372896921.586 ms
tps = 0.002440 (without initial connection time)
```

I sought pgbench.c and found a reason.
When a thread failed to get some connections, they do not fill any values to thread->bench_start in threadRun().
And if the failure is caused in the final thread (this means threads[nthreads - 1]->bench_start is zero),
the following if-statement sets bench_start to zero.

```
6494 /* first recorded benchmarking start time */
6495 if (bench_start == 0 || thread->bench_start < bench_start)
6496 bench_start = thread->bench_start;
```

The wrong bench_start propagates to printResult() and then some invalid values are appered.

```
6509 printResults(&stats, pg_time_now() - bench_start, conn_total_duration,
6510 bench_start - start_time, latency_late);
```

I cannot distinguish whether we have to fix it, but I attache the patch.
This simply ignores a result when therad->bench_start is zero.

How do you think?

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

Attachments:

ignore_failed_thread.patchapplication/octet-stream; name=ignore_failed_thread.patchDownload

diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index dc84b7b9b7..0694893d0b 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -6483,6 +6483,10 @@ main(int argc, char **argv)
 			if (thread->state[j].state == CSTATE_ABORTED)
 				exit_code = 2;
 
+		/* skip if the thread faild to get connection */
+		if (thread->bench_start == 0)
+			continue;
+
 		/* aggregate thread level stats */
 		mergeSimpleStats(&stats.latency, &thread->stats.latency);
 		mergeSimpleStats(&stats.lag, &thread->stats.lag);

Fabien COELHO

coelho@cri.ensmp.fr

over 4 years ago

In reply to: kuroda.hayato@fujitsu.com (#1)

Re: pgbench bug candidate: negative "initial connection time"

Hello Hayato-san,

I played pgbench with wrong parameters,

That's good:-)

and I found bug-candidate.

1. Do initdb and start.
2. Initialize schema and data with "scale factor" = 1.
3. execute following command many times:

$ pgbench -c 101 -j 10 postgres

Then, sometimes the negative " initial connection time" was returned.
Lateyncy average is also strange.

```
$ pgbench -c 101 -j 10 postgres
starting vacuum...end.
pgbench: error: connection to server on socket "/tmp/.s.PGSQL.5432" failed: FATAL: sorry, too many clients already

Hmmm.

AFAICR there was a decision to generate a report even if something went
very wrong, in this case some client could not connect, so some values
are not initialized, hence the absurd figures, as you show below.

Maybe we should revisit this decision.

initial connection time = -372896921.586 ms

I sought pgbench.c and found a reason.

When a thread failed to get some connections, they do not fill any values to thread->bench_start in threadRun().
And if the failure is caused in the final thread (this means threads[nthreads - 1]->bench_start is zero),
the following if-statement sets bench_start to zero.

I cannot distinguish whether we have to fix it, but I attache the patch.
This simply ignores a result when therad->bench_start is zero.

How do you think?

Hmmm. Possibly. Another option could be not to report anything after some
errors. I'm not sure, because it would depend on the use case. I guess the
command returned an error status as well.

I'm going to give it some thoughts.

--
Fabien.

kuroda.hayato@fujitsu.com

over 4 years ago

In reply to: Fabien COELHO (#2)

RE: pgbench bug candidate: negative "initial connection time"

Dear Fabien,

Thank you for replying!

Hmmm. Possibly. Another option could be not to report anything after some
errors. I'm not sure, because it would depend on the use case. I guess the
command returned an error status as well.

I did not know any use cases and decisions , but I vote to report nothing when error occurs.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

Yugo NAGATA

nagata@sraoss.co.jp

over 4 years ago

In reply to: kuroda.hayato@fujitsu.com (#1)

Re: pgbench bug candidate: negative "initial connection time"

Hello Kuroda-san,

On Fri, 11 Jun 2021 08:58:45 +0000
"kuroda.hayato@fujitsu.com" <kuroda.hayato@fujitsu.com> wrote:

Hi Hackers,

I played pgbench with wrong parameters, and I found bug-candidate.
The latest commit in my source is 3a09d75.

1. Do initdb and start.
2. Initialize schema and data with "scale factor" = 1.
3. execute following command many times:

$ pgbench -c 101 -j 10 postgres

Then, sometimes the negative " initial connection time" was returned.
Lateyncy average is also strange.

```
$ pgbench -c 101 -j 10 postgres
starting vacuum...end.
pgbench: error: connection to server on socket "/tmp/.s.PGSQL.5432" failed: FATAL: sorry, too many clients already
pgbench (PostgreSQL) 14.0
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 101
number of threads: 10
number of transactions per client: 10
number of transactions actually processed: 910/1010
latency average = 41387686.662 ms
initial connection time = -372896921.586 ms
tps = 0.002440 (without initial connection time)
```

I sought pgbench.c and found a reason.
When a thread failed to get some connections, they do not fill any values to thread->bench_start in threadRun().
And if the failure is caused in the final thread (this means threads[nthreads - 1]->bench_start is zero),
the following if-statement sets bench_start to zero.

```
6494 /* first recorded benchmarking start time */
6495 if (bench_start == 0 || thread->bench_start < bench_start)
6496 bench_start = thread->bench_start;
```

The wrong bench_start propagates to printResult() and then some invalid values are appered.

```
6509 printResults(&stats, pg_time_now() - bench_start, conn_total_duration,
6510 bench_start - start_time, latency_late);
```

I cannot distinguish whether we have to fix it, but I attache the patch.
This simply ignores a result when therad->bench_start is zero.

 +		/* skip if the thread faild to get connection */
+		if (thread->bench_start == 0)
+			continue;

It detects if a thread failed to get the initial connection by thread->bench_start == 0, but this assumes the initial value is zero. For ensuring this, I think it is better to initialize it in an early state, for example like this.

@@ -6419,6 +6419,7 @@ main(int argc, char **argv)
initRandomState(&thread->ts_throttle_rs);
initRandomState(&thread->ts_sample_rs);
thread->logfile = NULL; /* filled in later */
+ thread->bench_start = 0;
thread->latency_late = 0;
initStats(&thread->stats, 0)

typo: faild -> failed

Regards,
Yugo Nagata
--
Yugo NAGATA <nagata@sraoss.co.jp>

Yugo NAGATA

nagata@sraoss.co.jp

over 4 years ago

In reply to: kuroda.hayato@fujitsu.com (#3)

1 attachment(s)

Re: pgbench bug candidate: negative "initial connection time"

On Mon, 14 Jun 2021 00:42:12 +0000
"kuroda.hayato@fujitsu.com" <kuroda.hayato@fujitsu.com> wrote:

Dear Fabien,

Thank you for replying!

Hmmm. Possibly. Another option could be not to report anything after some
errors. I'm not sure, because it would depend on the use case. I guess the
command returned an error status as well.

I did not know any use cases and decisions , but I vote to report nothing when error occurs.

I would prefer to abort the thread whose connection got an error and report
results for other threads, as handled when doConnect fails in CSTATE_START_TX
state.

In this case, we have to set the state to CSTATE_ABORT before going to 'done'
as fixed in the attached patch, in order to ensure that exit status is 2 and the
result reports "pgbench: fatal: Run was aborted; the above results are incomplete."

Otherwise, if we want pgbench to exit immediately when a connection error occurs,
we have tocall exit(1) to ensure the exit code is 1, of course. Anyway, it is wrong
that thecurrent pgbench exit successfully with exit code 0 when doConnnect fails.

Regards,
Yugo Nagata

--
Yugo NAGATA <nagata@sraoss.co.jp>

Attachments:

pgbench_connect_abort.patchtext/x-diff; name=pgbench_connect_abort.patchDownload

diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index d7479925cb..35b67265c7 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -6581,6 +6581,7 @@ threadRun(void *arg)
 				 * than coldly exiting with an error message.
 				 */
 				THREAD_BARRIER_WAIT(&barrier);
+				state[i].state = CSTATE_ABORTED;
 				goto done;
 			}
 		}
@@ -6647,6 +6648,7 @@ threadRun(void *arg)
 				if (sock < 0)
 				{
 					pg_log_error("invalid socket: %s", PQerrorMessage(st->con));
+					st->state = CSTATE_ABORTED;
 					goto done;
 				}
 
@@ -6708,6 +6710,7 @@ threadRun(void *arg)
 				}
 				/* must be something wrong */
 				pg_log_fatal("%s() failed: %m", SOCKET_WAIT_METHOD);
+				state[0].state = CSTATE_ABORTED;
 				goto done;
 			}
 		}
@@ -6733,6 +6736,7 @@ threadRun(void *arg)
 				if (sock < 0)
 				{
 					pg_log_error("invalid socket: %s", PQerrorMessage(st->con));
+					st->state = CSTATE_ABORTED;
 					goto done;
 				}

Fabien COELHO

coelho@cri.ensmp.fr

over 4 years ago

In reply to: Yugo NAGATA (#5)

Re: pgbench bug candidate: negative "initial connection time"

Hmmm. Possibly. Another option could be not to report anything after some
errors. I'm not sure, because it would depend on the use case. I guess the
command returned an error status as well.

I did not know any use cases and decisions , but I vote to report nothing when error occurs.

I would prefer to abort the thread whose connection got an error and report
results for other threads, as handled when doConnect fails in CSTATE_START_TX
state.

It is unclear to me whether it makes much sense to report performance when
things go wrong. At least when a one connection per client bench is run
ISTM that it should not proceed, because the bench could not even start
as prescribe. When connection break while the bench has already started,
maybe it makes more sense to proceed, although I guess that maybe
reattempting connections would make also sense in such case.

In this case, we have to set the state to CSTATE_ABORT before going to 'done'
as fixed in the attached patch, in order to ensure that exit status is 2 and the
result reports "pgbench: fatal: Run was aborted; the above results are incomplete."

Hmmm. I agree that at least reporting that there was an issue is a good
idea.

Otherwise, if we want pgbench to exit immediately when a connection error occurs,
we have tocall exit(1) to ensure the exit code is 1, of course. Anyway, it is wrong
that thecurrent pgbench exit successfully with exit code 0 when doConnnect fails.

Indeed, I can only agree on this one.

--
Fabien.

Yugo NAGATA

nagata@sraoss.co.jp

over 4 years ago

In reply to: Fabien COELHO (#6)

1 attachment(s)

Re: pgbench bug candidate: negative "initial connection time"

Hello Fabien,

On Mon, 14 Jun 2021 11:30:14 +0200 (CEST)
Fabien COELHO <coelho@cri.ensmp.fr> wrote:

Hmmm. Possibly. Another option could be not to report anything after some
errors. I'm not sure, because it would depend on the use case. I guess the
command returned an error status as well.

I did not know any use cases and decisions , but I vote to report nothing when error occurs.

I would prefer to abort the thread whose connection got an error and report
results for other threads, as handled when doConnect fails in CSTATE_START_TX
state.

It is unclear to me whether it makes much sense to report performance when
things go wrong. At least when a one connection per client bench is run
ISTM that it should not proceed, because the bench could not even start
as prescribe.

I agreed that when an initial connections fails we cannot start a bench
in the condition that the user wants and that we should stop early to let
the user know it and check the conf.

I attached a patch, which is a fusion of my previous patch that changes the
state to CSTATE_ABORT when the socket get failure during the bench, and a
part of your patch attached in [1]/messages/by-id/alpine.DEB.2.22.394.2106141011100.1338009@pseudo that exits for initial failures.

[1]: /messages/by-id/alpine.DEB.2.22.394.2106141011100.1338009@pseudo

When connection break while the bench has already started,
maybe it makes more sense to proceed,

The result would be incomplete also in this case. However, the reason why
it is worth to proceed is that such information is still useful for users,
or we don't want to waste the bench that has already started?

although I guess that maybe
reattempting connections would make also sense in such case.

This might become possible after pgbench gets the feature to retry in deadlock
or serialization errors. I am working on rebase of the patch [2]/messages/by-id/20210524112910.444fbfdfbff747bd3b9720ee@sraoss.co.jp and I will
submit this in a few days.

[2]: /messages/by-id/20210524112910.444fbfdfbff747bd3b9720ee@sraoss.co.jp

--
Yugo NAGATA <nagata@sraoss.co.jp>

Attachments:

pgbench_connect_abort-2.patchtext/x-diff; name=pgbench_connect_abort-2.patchDownload

diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index d7479925cb..2d9080c444 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -3171,6 +3171,7 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 
 					if ((st->con = doConnect()) == NULL)
 					{
+						/* as the bench is already running, we do not abort */
 						pg_log_error("client %d aborted while establishing connection", st->id);
 						st->state = CSTATE_ABORTED;
 						break;
@@ -4405,7 +4406,10 @@ runInitSteps(const char *initialize_steps)
 	initPQExpBuffer(&stats);
 
 	if ((con = doConnect()) == NULL)
+	{
+		pg_log_fatal("connection for initialization failed");
 		exit(1);
+	}
 
 	setup_cancel_handler(NULL);
 	SetCancelConn(con);
@@ -6332,7 +6336,10 @@ main(int argc, char **argv)
 	/* opening connection... */
 	con = doConnect();
 	if (con == NULL)
+	{
+		pg_log_fatal("setup connection failed");
 		exit(1);
+	}
 
 	pg_log_debug("pghost: %s pgport: %s nclients: %d %s: %d dbName: %s",
 				 PQhost(con), PQport(con), nclients,
@@ -6437,7 +6444,10 @@ main(int argc, char **argv)
 
 	errno = THREAD_BARRIER_INIT(&barrier, nthreads);
 	if (errno != 0)
+	{
 		pg_log_fatal("could not initialize barrier: %m");
+		exit(1);
+	}
 
 #ifdef ENABLE_THREAD_SAFETY
 	/* start all threads but thread 0 which is executed directly later */
@@ -6548,7 +6558,7 @@ threadRun(void *arg)
 		if (thread->logfile == NULL)
 		{
 			pg_log_fatal("could not open logfile \"%s\": %m", logpath);
-			goto done;
+			exit(1);
 		}
 	}
 
@@ -6572,16 +6582,10 @@ threadRun(void *arg)
 		{
 			if ((state[i].con = doConnect()) == NULL)
 			{
-				/*
-				 * On connection failure, we meet the barrier here in place of
-				 * GO before proceeding to the "done" path which will cleanup,
-				 * so as to avoid locking the process.
-				 *
-				 * It is unclear whether it is worth doing anything rather
-				 * than coldly exiting with an error message.
-				 */
-				THREAD_BARRIER_WAIT(&barrier);
-				goto done;
+				/* coldly abort on connection failure */
+				pg_log_fatal("cannot create connection for thread %d client %d",
+							 thread->tid, i);
+				exit(1);
 			}
 		}
 
@@ -6647,6 +6651,7 @@ threadRun(void *arg)
 				if (sock < 0)
 				{
 					pg_log_error("invalid socket: %s", PQerrorMessage(st->con));
+					st->state = CSTATE_ABORTED;
 					goto done;
 				}
 
@@ -6707,7 +6712,8 @@ threadRun(void *arg)
 					continue;
 				}
 				/* must be something wrong */
-				pg_log_fatal("%s() failed: %m", SOCKET_WAIT_METHOD);
+				pg_log_error("%s() failed: %m", SOCKET_WAIT_METHOD);
+				state[0].state = CSTATE_ABORTED;
 				goto done;
 			}
 		}
@@ -6733,6 +6739,7 @@ threadRun(void *arg)
 				if (sock < 0)
 				{
 					pg_log_error("invalid socket: %s", PQerrorMessage(st->con));
+					st->state = CSTATE_ABORTED;
 					goto done;
 				}

Fabien COELHO

coelho@cri.ensmp.fr

over 4 years ago

In reply to: Yugo NAGATA (#7)

Re: pgbench bug candidate: negative "initial connection time"

Hello Yugo-san,

When connection break while the bench has already started,
maybe it makes more sense to proceed,

The result would be incomplete also in this case. However, the reason why
it is worth to proceed is that such information is still useful for users,
or we don't want to waste the bench that has already started?

Hmmm. It depends on what the user is testing. If one is interested in
client resilience under errors, the bench should probably attempt a
reconnect. If one is interested in best performance when all is well,
then clearly something is amiss and there is no point to go on.

although I guess that maybe reattempting connections would make also
sense in such case.

This might become possible after pgbench gets the feature to retry in deadlock
or serialization errors.

Yes, I agree that part of the needed infrastructure would be in place for
that. As reconnecting is already in place under -c, so possibly it is just
a matter of switching between states with some care.

I am working on rebase of the patch [2] and I will submit this in a few
days.

Ok. Very good, I look forward to your submission! I'll be sure to look at
it.

--
Fabien.

Yugo NAGATA

nagata@sraoss.co.jp

over 4 years ago

In reply to: Fabien COELHO (#8)

Re: pgbench bug candidate: negative "initial connection time"

On Wed, 16 Jun 2021 20:25:31 +0200 (CEST)
Fabien COELHO <coelho@cri.ensmp.fr> wrote:

Hello Yugo-san,

When connection break while the bench has already started,
maybe it makes more sense to proceed,

The result would be incomplete also in this case. However, the reason why
it is worth to proceed is that such information is still useful for users,
or we don't want to waste the bench that has already started?

Hmmm. It depends on what the user is testing. If one is interested in
client resilience under errors, the bench should probably attempt a
reconnect. If one is interested in best performance when all is well,
then clearly something is amiss and there is no point to go on.

Agreed. After --max-tries options is implemented on pgbench, we would be
able to add a new feature to allow users to choose if we try to reconnect
or not. However, we don't have it yet for now, so we should just abort
the client and report the abortion at the end of the bench when a connection
or socket error occurs during the bench, as same the existing behaviour.

By the way, the issue of inital connection erros reported in this thread
will be fixed by the patch attached in my previous post (a major part are
written by you :-) ). Is this acceptable for you?

Regards,
Yugo Nagata

--
Yugo NAGATA <nagata@sraoss.co.jp>

#10

Fabien COELHO

coelho@cri.ensmp.fr

over 4 years ago

In reply to: Yugo NAGATA (#9)

Re: pgbench bug candidate: negative "initial connection time"

Hello Yugo-san,

By the way, the issue of inital connection erros reported in this thread
will be fixed by the patch attached in my previous post (a major part are
written by you :-)

That does not, on its own, ensure that it is bug free:-)

). Is this acceptable for you?

I disagree on two counts:

First, thread[0] should not appear.

Second, currently the *only* function to change the client state is
advanceConnectionState, so it can be checked there and any bug is only
there. We had issues before when several functions where doing updates,
and it was a mess to understand what was going on. I really want that it
stays that way, so I disagree with setting the state to ABORTED from
threadRun. Moreover I do not see that it brings a feature, so ISTM that it
is not an actual issue not to do it?

--
Fabien.

#11

Yugo NAGATA

nagata@sraoss.co.jp

over 4 years ago

In reply to: Fabien COELHO (#10)

1 attachment(s)

Re: pgbench bug candidate: negative "initial connection time"

Hello Fabien,

On Thu, 17 Jun 2021 10:37:05 +0200 (CEST)
Fabien COELHO <coelho@cri.ensmp.fr> wrote:

). Is this acceptable for you?

I disagree on two counts:

First, thread[0] should not appear.

Second, currently the *only* function to change the client state is
advanceConnectionState, so it can be checked there and any bug is only
there. We had issues before when several functions where doing updates,
and it was a mess to understand what was going on. I really want that it
stays that way, so I disagree with setting the state to ABORTED from
threadRun. Moreover I do not see that it brings a feature, so ISTM that it
is not an actual issue not to do it?

Ok. I gave up to change the state in threadRun. Instead, I changed the
condition at the end of bench, which enables to report abortion due to
socket errors.

+@@ -6480,7 +6490,7 @@ main(int argc, char **argv)
+ #endif							/* ENABLE_THREAD_SAFETY */
+ 
+ 		for (int j = 0; j < thread->nstate; j++)
+-			if (thread->state[j].state == CSTATE_ABORTED)
++			if (thread->state[j].state != CSTATE_FINISHED)
+ 				exit_code = 2;
+ 
+ 		/* aggregate thread level stats */

Does this make sense?

--
Yugo NAGATA <nagata@sraoss.co.jp>

Attachments:

pgbench_connect_abort-3.patchtext/x-diff; name=pgbench_connect_abort-3.patchDownload

diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index d7479925cb..901f25aab7 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -3171,6 +3171,7 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 
 					if ((st->con = doConnect()) == NULL)
 					{
+						/* as the bench is already running, we do not abort */
 						pg_log_error("client %d aborted while establishing connection", st->id);
 						st->state = CSTATE_ABORTED;
 						break;
@@ -4405,7 +4406,10 @@ runInitSteps(const char *initialize_steps)
 	initPQExpBuffer(&stats);
 
 	if ((con = doConnect()) == NULL)
+	{
+		pg_log_fatal("connection for initialization failed");
 		exit(1);
+	}
 
 	setup_cancel_handler(NULL);
 	SetCancelConn(con);
@@ -6332,7 +6336,10 @@ main(int argc, char **argv)
 	/* opening connection... */
 	con = doConnect();
 	if (con == NULL)
+	{
+		pg_log_fatal("setup connection failed");
 		exit(1);
+	}
 
 	pg_log_debug("pghost: %s pgport: %s nclients: %d %s: %d dbName: %s",
 				 PQhost(con), PQport(con), nclients,
@@ -6437,7 +6444,10 @@ main(int argc, char **argv)
 
 	errno = THREAD_BARRIER_INIT(&barrier, nthreads);
 	if (errno != 0)
+	{
 		pg_log_fatal("could not initialize barrier: %m");
+		exit(1);
+	}
 
 #ifdef ENABLE_THREAD_SAFETY
 	/* start all threads but thread 0 which is executed directly later */
@@ -6480,7 +6490,7 @@ main(int argc, char **argv)
 #endif							/* ENABLE_THREAD_SAFETY */
 
 		for (int j = 0; j < thread->nstate; j++)
-			if (thread->state[j].state == CSTATE_ABORTED)
+			if (thread->state[j].state != CSTATE_FINISHED)
 				exit_code = 2;
 
 		/* aggregate thread level stats */
@@ -6548,7 +6558,7 @@ threadRun(void *arg)
 		if (thread->logfile == NULL)
 		{
 			pg_log_fatal("could not open logfile \"%s\": %m", logpath);
-			goto done;
+			exit(1);
 		}
 	}
 
@@ -6572,16 +6582,10 @@ threadRun(void *arg)
 		{
 			if ((state[i].con = doConnect()) == NULL)
 			{
-				/*
-				 * On connection failure, we meet the barrier here in place of
-				 * GO before proceeding to the "done" path which will cleanup,
-				 * so as to avoid locking the process.
-				 *
-				 * It is unclear whether it is worth doing anything rather
-				 * than coldly exiting with an error message.
-				 */
-				THREAD_BARRIER_WAIT(&barrier);
-				goto done;
+				/* coldly abort on connection failure */
+				pg_log_fatal("cannot create connection for thread %d client %d",
+							 thread->tid, i);
+				exit(1);
 			}
 		}
 
@@ -6707,7 +6711,7 @@ threadRun(void *arg)
 					continue;
 				}
 				/* must be something wrong */
-				pg_log_fatal("%s() failed: %m", SOCKET_WAIT_METHOD);
+				pg_log_error("%s() failed: %m", SOCKET_WAIT_METHOD);
 				goto done;
 			}
 		}

#12

Fabien COELHO

coelho@cri.ensmp.fr

over 4 years ago

In reply to: Yugo NAGATA (#11)

Re: pgbench bug candidate: negative "initial connection time"

Second, currently the *only* function to change the client state is
advanceConnectionState, so it can be checked there and any bug is only
there. We had issues before when several functions where doing updates,
and it was a mess to understand what was going on. I really want that it
stays that way, so I disagree with setting the state to ABORTED from
threadRun. Moreover I do not see that it brings a feature, so ISTM that it
is not an actual issue not to do it?

Ok. I gave up to change the state in threadRun. Instead, I changed the
condition at the end of bench, which enables to report abortion due to
socket errors.
+@@ -6480,7 +6490,7 @@ main(int argc, char **argv)
+ #endif							/* ENABLE_THREAD_SAFETY */
+
+ 		for (int j = 0; j < thread->nstate; j++)
+-			if (thread->state[j].state == CSTATE_ABORTED)
++			if (thread->state[j].state != CSTATE_FINISHED)
+ 				exit_code = 2;
+
+ 		/* aggregate thread level stats */
Does this make sense?

Yes, definitely.

--
Fabien.

#13

Kyotaro Horiguchi

horikyota.ntt@gmail.com

over 4 years ago

In reply to: Fabien COELHO (#12)

Re: pgbench bug candidate: negative "initial connection time"

At Thu, 17 Jun 2021 11:52:04 +0200 (CEST), Fabien COELHO <coelho@cri.ensmp.fr> wrote in

Ok. I gave up to change the state in threadRun. Instead, I changed the
condition at the end of bench, which enables to report abortion due to
socket errors.
+@@ -6480,7 +6490,7 @@ main(int argc, char **argv)
+ #endif							/* ENABLE_THREAD_SAFETY */
+
+ 		for (int j = 0; j < thread->nstate; j++)
+-			if (thread->state[j].state == CSTATE_ABORTED)
++			if (thread->state[j].state != CSTATE_FINISHED)
+ 				exit_code = 2;
+
+ 		/* aggregate thread level stats */
Does this make sense?
Yes, definitely.

I sought for a simple way to enforce all client finishes with the
states abort or finished but I didn't find. So +1 for the
change. However, as a matter of style. if we touch the code maybe we
want to enclose the if statement.

Doing this means we regard any state other than CSTATE_FINISHED as
aborted. So, the current goto's to done in threadRun are effectively
aborting a part or the all clients running on the thread. So for
example the following place:

pgbench.c:6713

/* must be something wrong */
pg_log_error("%s() failed: %m", SOCKET_WAIT_METHOD);
goto done;

Should say such like "thread %d aborted: %s() failed: ...".

I'm not sure what is the consensus here about the case where aborted
client can recoonect to the same server. This patch doesn't that. However, I think causing reconnection needs more work than accepted as a fix while beta.

====

+ /* as the bench is already running, we do not abort */
pg_log_error("client %d aborted while establishing connection", st->id);
st->state = CSTATE_ABORTED;

The comment looks strange that it is saying "we don't abort" while
setting the state to CSTATE_ABORT then showing "client %d aborted".

====
   if ((con = doConnect()) == NULL)
+  {
+    pg_log_fatal("connection for initialization failed");
     exit(1);

doConnect() prints an error emssage given from libpq. So The
additional messaget is redundant.

====
errno = THREAD_BARRIER_INIT(&barrier, nthreads);
if (errno != 0)
+ {
pg_log_fatal("could not initialize barrier: %m");
+ exit(1);

This is a run-time error. Maybe we should return 2 in that case.

===
if (thread->logfile == NULL)
{
pg_log_fatal("could not open logfile \"%s\": %m", logpath);
- goto done;
+ exit(1);

Maybe we should exit with 2 this case. If we exit this case, we might
also want to exit when fclose() fails. (Currently the error of
fclose() is ignored.)

===
+        /* coldly abort on connection failure */
+        pg_log_fatal("cannot create connection for thread %d client %d",
+               thread->tid, i);
+        exit(1);

It seems to me that the "thread %d client %d(not clinent id but the
client index within the thread)" doesn't make sense to users. Even if
we showed a message like that, it should show only the global clinent
id (cstate->id).

I think that we should return with 2 here but we return with 1
in another place for the same reason..

===
         /* must be something wrong */
-        pg_log_fatal("%s() failed: %m", SOCKET_WAIT_METHOD);
+        pg_log_error("%s() failed: %m", SOCKET_WAIT_METHOD);
         goto done;

Why doesn't a fatal error cause an immediate exit? (And if we change
this to fatal, we also need to change similar errors in the same
function to fatal.)

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#14

Fabien COELHO

coelho@cri.ensmp.fr

over 4 years ago

In reply to: Kyotaro Horiguchi (#13)

Re: pgbench bug candidate: negative "initial connection time"

Hello,

Doing this means we regard any state other than CSTATE_FINISHED as
aborted. So, the current goto's to done in threadRun are effectively
aborting a part or the all clients running on the thread. So for
example the following place:

pgbench.c:6713

/* must be something wrong */
pg_log_error("%s() failed: %m", SOCKET_WAIT_METHOD);
goto done;

Should say such like "thread %d aborted: %s() failed: ...".

Yep, possibly.

I'm not sure what is the consensus here about the case where aborted
client can recoonect to the same server. This patch doesn't that.

Non trivial future work.

However, I think causing reconnection needs more work than accepted as
a fix while beta.

It is an entire project which requires some thinking about.

+ /* as the bench is already running, we do not abort */
pg_log_error("client %d aborted while establishing connection", st->id);
st->state = CSTATE_ABORTED;

The comment looks strange that it is saying "we don't abort" while
setting the state to CSTATE_ABORT then showing "client %d aborted".

Indeed. There is abort from the client, which just means that it stops
sending transaction, and abort for the process, which is basically
"exit(1)".

====
if ((con = doConnect()) == NULL)
+  {
+    pg_log_fatal("connection for initialization failed");
exit(1);
doConnect() prints an error emssage given from libpq. So The
additional messaget is redundant.

This is not the same for me: doConnect may fail but we may decide to go
retry the connection later, or just one client may be disconnected but
others are going on, which is different from deciding to stop the whole
program, which deserves a message on its own.

====
errno = THREAD_BARRIER_INIT(&barrier, nthreads);
if (errno != 0)
+ {
pg_log_fatal("could not initialize barrier: %m");
+ exit(1);

This is a run-time error. Maybe we should return 2 in that case.

Hmmm. Yep.

===
if (thread->logfile == NULL)
{
pg_log_fatal("could not open logfile \"%s\": %m", logpath);
- goto done;
+ exit(1);

Maybe we should exit with 2 this case.

Yep.

If we exit this case, we might also want to exit when fclose() fails.
(Currently the error of fclose() is ignored.)

Not sure. I'd let it at that for now.

===
+        /* coldly abort on connection failure */
+        pg_log_fatal("cannot create connection for thread %d client %d",
+               thread->tid, i);
+        exit(1);
It seems to me that the "thread %d client %d(not clinent id but the
client index within the thread)" doesn't make sense to users. Even if
we showed a message like that, it should show only the global clinent
id (cstate->id).

This is not obvious to me. I think that we should be homogeneous with what
is already done around.

I think that we should return with 2 here but we return with 1
in another place for the same reason..

Possibly.

/* must be something wrong */
-        pg_log_fatal("%s() failed: %m", SOCKET_WAIT_METHOD);
+        pg_log_error("%s() failed: %m", SOCKET_WAIT_METHOD);
goto done;

Why doesn't a fatal error cause an immediate exit?

Good point. I do not know, but I would expect it to be the case, and
AFAICR it does not.

(And if we change this to fatal, we also need to change similar errors
in the same function to fatal.)

Possibly.

I'll look into it over the week-end.

Thanks for the feedback!

--
Fabien.

#15

Fabien COELHO

coelho@cri.ensmp.fr

over 4 years ago

In reply to: Fabien COELHO (#14)

1 attachment(s)

Re: pgbench bug candidate: negative "initial connection time"

/* must be something wrong */
pg_log_error("%s() failed: %m", SOCKET_WAIT_METHOD);
goto done;

Should say such like "thread %d aborted: %s() failed: ...".

After having a lookg, there are already plenty such cases. I'd say not to
change anything for beta, and think of it for the next round.

====
errno = THREAD_BARRIER_INIT(&barrier, nthreads);
if (errno != 0)
+ {
pg_log_fatal("could not initialize barrier: %m");
+ exit(1);

This is a run-time error. Maybe we should return 2 in that case.

I think that you are right, but there are plenty such places where exit
should be 2 instead of 1 if the doc is followed:

"""Errors during the run such as database errors or problems in the script
will result in exit status 2."""

My beta take is to let these as they are, i.e. pretty inconsistent all
over pgbench, and schedule a cleanup on the next round.

===
if (thread->logfile == NULL)
{
pg_log_fatal("could not open logfile \"%s\": %m", logpath);
- goto done;
+ exit(1);

Maybe we should exit with 2 this case.

Yep.

The bench is not even started, this is not really run time yet, 1 seems
ok. The failure may be due to a typo in the path which comes from the
user.

If we exit this case, we might also want to exit when fclose() fails.
(Currently the error of fclose() is ignored.)

Not sure. I'd let it at that for now.

I stand by this.

+        /* coldly abort on connection failure */
+        pg_log_fatal("cannot create connection for thread %d client %d",
+               thread->tid, i);
+        exit(1);
It seems to me that the "thread %d client %d(not clinent id but the
client index within the thread)" doesn't make sense to users. Even if
we showed a message like that, it should show only the global clinent
id (cstate->id).
This is not obvious to me. I think that we should be homogeneous with what is
already done around.

ok for only giving the global client id.

I think that we should return with 2 here but we return with 1
in another place for the same reason..

Possibly.

Again for this one, the bench has not really started, so 1 seems fine.

/* must be something wrong */
-        pg_log_fatal("%s() failed: %m", SOCKET_WAIT_METHOD);
+        pg_log_error("%s() failed: %m", SOCKET_WAIT_METHOD);
goto done;
Why doesn't a fatal error cause an immediate exit?
Good point. I do not know, but I would expect it to be the case, and AFAICR
it does not.

(And if we change this to fatal, we also need to change similar errors in
the same function to fatal.)

Possibly.

On second look, I think that error is fine, indeed we do not stop the
process, so "fatal" it is not;

Attached Yugo-san patch with some updates discussed in the previous mails,
so as to move things along.

--
Fabien.

Attachments:

pgbench-conn-abort-4.patchtext/x-diff; name=pgbench-conn-abort-4.patchDownload

diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index d7479925cb..202507f5d5 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -3171,6 +3171,7 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 
 					if ((st->con = doConnect()) == NULL)
 					{
+						/* as the bench is already running, we do not abort the process */
 						pg_log_error("client %d aborted while establishing connection", st->id);
 						st->state = CSTATE_ABORTED;
 						break;
@@ -4405,7 +4406,10 @@ runInitSteps(const char *initialize_steps)
 	initPQExpBuffer(&stats);
 
 	if ((con = doConnect()) == NULL)
+	{
+		pg_log_fatal("connection for initialization failed");
 		exit(1);
+	}
 
 	setup_cancel_handler(NULL);
 	SetCancelConn(con);
@@ -6332,7 +6336,10 @@ main(int argc, char **argv)
 	/* opening connection... */
 	con = doConnect();
 	if (con == NULL)
+	{
+		pg_log_fatal("setup connection failed");
 		exit(1);
+	}
 
 	pg_log_debug("pghost: %s pgport: %s nclients: %d %s: %d dbName: %s",
 				 PQhost(con), PQport(con), nclients,
@@ -6437,7 +6444,10 @@ main(int argc, char **argv)
 
 	errno = THREAD_BARRIER_INIT(&barrier, nthreads);
 	if (errno != 0)
+	{
 		pg_log_fatal("could not initialize barrier: %m");
+		exit(1);
+	}
 
 #ifdef ENABLE_THREAD_SAFETY
 	/* start all threads but thread 0 which is executed directly later */
@@ -6480,7 +6490,7 @@ main(int argc, char **argv)
 #endif							/* ENABLE_THREAD_SAFETY */
 
 		for (int j = 0; j < thread->nstate; j++)
-			if (thread->state[j].state == CSTATE_ABORTED)
+			if (thread->state[j].state != CSTATE_FINISHED)
 				exit_code = 2;
 
 		/* aggregate thread level stats */
@@ -6548,7 +6558,7 @@ threadRun(void *arg)
 		if (thread->logfile == NULL)
 		{
 			pg_log_fatal("could not open logfile \"%s\": %m", logpath);
-			goto done;
+			exit(1);
 		}
 	}
 
@@ -6572,16 +6582,10 @@ threadRun(void *arg)
 		{
 			if ((state[i].con = doConnect()) == NULL)
 			{
-				/*
-				 * On connection failure, we meet the barrier here in place of
-				 * GO before proceeding to the "done" path which will cleanup,
-				 * so as to avoid locking the process.
-				 *
-				 * It is unclear whether it is worth doing anything rather
-				 * than coldly exiting with an error message.
-				 */
-				THREAD_BARRIER_WAIT(&barrier);
-				goto done;
+				/* coldly abort on initial connection failure */
+				pg_log_fatal("cannot create connection for client %d",
+							 state[i].id);
+				exit(1);
 			}
 		}
 
@@ -6707,7 +6711,7 @@ threadRun(void *arg)
 					continue;
 				}
 				/* must be something wrong */
-				pg_log_fatal("%s() failed: %m", SOCKET_WAIT_METHOD);
+				pg_log_error("%s() failed: %m", SOCKET_WAIT_METHOD);
 				goto done;
 			}
 		}

#16

Yugo NAGATA

nagata@sraoss.co.jp

over 4 years ago

In reply to: Fabien COELHO (#15)

Re: pgbench bug candidate: negative "initial connection time"

Hello Horiguchi-san, Fabien,

On Fri, 18 Jun 2021 15:58:48 +0200 (CEST)
Fabien COELHO <coelho@cri.ensmp.fr> wrote:

/* must be something wrong */
pg_log_error("%s() failed: %m", SOCKET_WAIT_METHOD);
goto done;

Should say such like "thread %d aborted: %s() failed: ...".

After having a lookg, there are already plenty such cases. I'd say not to
change anything for beta, and think of it for the next round.

Agreed. Basically, I think the existing message should be left as is.

====
errno = THREAD_BARRIER_INIT(&barrier, nthreads);
if (errno != 0)
+ {
pg_log_fatal("could not initialize barrier: %m");
+ exit(1);

This is a run-time error. Maybe we should return 2 in that case.

I think that you are right, but there are plenty such places where exit
should be 2 instead of 1 if the doc is followed:

"""Errors during the run such as database errors or problems in the script
will result in exit status 2."""

My beta take is to let these as they are, i.e. pretty inconsistent all
over pgbench, and schedule a cleanup on the next round.

As same as the below Fabian's comment about thread->logfile,

===
if (thread->logfile == NULL)
{
pg_log_fatal("could not open logfile \"%s\": %m", logpath);
- goto done;
+ exit(1);

Maybe we should exit with 2 this case.

Yep.

The bench is not even started, this is not really run time yet, 1 seems
ok. The failure may be due to a typo in the path which comes from the
user.

the bench is not started at THREAD_BARRIER_INIT, so I think exit(1) is ok.

/* must be something wrong */
-        pg_log_fatal("%s() failed: %m", SOCKET_WAIT_METHOD);
+        pg_log_error("%s() failed: %m", SOCKET_WAIT_METHOD);
goto done;
Why doesn't a fatal error cause an immediate exit?
Good point. I do not know, but I would expect it to be the case, and AFAICR
it does not.

(And if we change this to fatal, we also need to change similar errors in
the same function to fatal.)

Possibly.
On second look, I think that error is fine, indeed we do not stop the
process, so "fatal" it is not;

I replaced this 'fatal' with 'error' because we are aborting the client
instead of exit(1). When pgbench was rewritten to use common logging API
by the commit 30a3e772b40, somehow pg_log_fatal was used, but I am
wondering it should have be pg_log_error.

Attached Yugo-san patch with some updates discussed in the previous mails,
so as to move things along.

Thank you for update. I agree with this fix.

Regards,
Yugo Nagata

--
Yugo NAGATA <nagata@sraoss.co.jp>

#17

Yugo NAGATA

nagata@sraoss.co.jp

over 4 years ago

In reply to: Fabien COELHO (#15)

1 attachment(s)

Re: pgbench bug candidate: negative "initial connection time"

Hello,

On Fri, 18 Jun 2021 15:58:48 +0200 (CEST)
Fabien COELHO <coelho@cri.ensmp.fr> wrote:

Attached Yugo-san patch with some updates discussed in the previous mails,
so as to move things along.

I attached the patch rebased to a change due to 856de3b39cf.

Regards,
Yugo Nagata

--
Yugo NAGATA <nagata@sraoss.co.jp>

Attachments:

pgbench-conn-abort-5.patchtext/x-diff; name=pgbench-conn-abort-5.patchDownload

diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 55d14604c0..e460dc7e5b 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -3181,6 +3181,7 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 
 					if ((st->con = doConnect()) == NULL)
 					{
+						/* as the bench is already running, we do not abort the process */
 						pg_log_error("client %d aborted while establishing connection", st->id);
 						st->state = CSTATE_ABORTED;
 						break;
@@ -4418,7 +4419,10 @@ runInitSteps(const char *initialize_steps)
 	initPQExpBuffer(&stats);
 
 	if ((con = doConnect()) == NULL)
+	{
+		pg_log_fatal("connection for initialization failed");
 		exit(1);
+	}
 
 	setup_cancel_handler(NULL);
 	SetCancelConn(con);
@@ -6361,7 +6365,10 @@ main(int argc, char **argv)
 	/* opening connection... */
 	con = doConnect();
 	if (con == NULL)
+	{
+		pg_log_fatal("setup connection failed");
 		exit(1);
+	}
 
 	/* report pgbench and server versions */
 	printVersion(con);
@@ -6515,7 +6522,7 @@ main(int argc, char **argv)
 #endif							/* ENABLE_THREAD_SAFETY */
 
 		for (int j = 0; j < thread->nstate; j++)
-			if (thread->state[j].state == CSTATE_ABORTED)
+			if (thread->state[j].state != CSTATE_FINISHED)
 				exit_code = 2;
 
 		/* aggregate thread level stats */
@@ -6583,7 +6590,7 @@ threadRun(void *arg)
 		if (thread->logfile == NULL)
 		{
 			pg_log_fatal("could not open logfile \"%s\": %m", logpath);
-			goto done;
+			exit(1);
 		}
 	}
 
@@ -6607,16 +6614,10 @@ threadRun(void *arg)
 		{
 			if ((state[i].con = doConnect()) == NULL)
 			{
-				/*
-				 * On connection failure, we meet the barrier here in place of
-				 * GO before proceeding to the "done" path which will cleanup,
-				 * so as to avoid locking the process.
-				 *
-				 * It is unclear whether it is worth doing anything rather
-				 * than coldly exiting with an error message.
-				 */
-				THREAD_BARRIER_WAIT(&barrier);
-				goto done;
+				/* coldly abort on initial connection failure */
+				pg_log_fatal("cannot create connection for client %d",
+							 state[i].id);
+				exit(1);
 			}
 		}
 
@@ -6749,7 +6750,7 @@ threadRun(void *arg)
 					continue;
 				}
 				/* must be something wrong */
-				pg_log_fatal("%s() failed: %m", SOCKET_WAIT_METHOD);
+				pg_log_error("%s() failed: %m", SOCKET_WAIT_METHOD);
 				goto done;
 			}
 		}

#18

Fujii Masao

masao.fujii@oss.nttdata.com

over 4 years ago

In reply to: Yugo NAGATA (#17)

Re: pgbench bug candidate: negative "initial connection time"

On 2021/07/29 13:23, Yugo NAGATA wrote:

Hello,

On Fri, 18 Jun 2021 15:58:48 +0200 (CEST)
Fabien COELHO <coelho@cri.ensmp.fr> wrote:

Attached Yugo-san patch with some updates discussed in the previous mails,
so as to move things along.

I attached the patch rebased to a change due to 856de3b39cf.

+		pg_log_fatal("connection for initialization failed");
+		pg_log_fatal("setup connection failed");
+				pg_log_fatal("cannot create connection for client %d",

These fatal messages output when doConnect() fails should be a bit more consistent each other? For example,

could not create connection for initialization
could not create connection for setup
could not create connection for client %d

I'm not sure, but *if* "xxx failed" is more proper for pgbench, what about

connection for initialization failed
connection for setup failed
connection for client %d failed

Exit status 1 indicates static problems such as invalid command-line options.
Errors during the run such as database errors or problems in the script will
result in exit status 2.

While reading the code and docs related to the patch, I found
these descriptions in pgbench docs. The first description needs to be
updated? Because even database error (e.g., failure of connection for setup)
can result in exit status 1 if it happens before the benchmark actually runs.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

#19

Yugo NAGATA

nagata@sraoss.co.jp

over 4 years ago

In reply to: Fujii Masao (#18)

1 attachment(s)

Re: pgbench bug candidate: negative "initial connection time"

Hello Fujii-san,

On Tue, 7 Sep 2021 02:34:17 +0900
Fujii Masao <masao.fujii@oss.nttdata.com> wrote:

On 2021/07/29 13:23, Yugo NAGATA wrote:

Hello,

On Fri, 18 Jun 2021 15:58:48 +0200 (CEST)
Fabien COELHO <coelho@cri.ensmp.fr> wrote:

Attached Yugo-san patch with some updates discussed in the previous mails,
so as to move things along.

I attached the patch rebased to a change due to 856de3b39cf.
+		pg_log_fatal("connection for initialization failed");
+		pg_log_fatal("setup connection failed");
+				pg_log_fatal("cannot create connection for client %d",
These fatal messages output when doConnect() fails should be a bit more consistent each other? For example,

could not create connection for initialization
could not create connection for setup
could not create connection for client %d

Ok. I fixed as your suggestion.

Exit status 1 indicates static problems such as invalid command-line options.
Errors during the run such as database errors or problems in the script will
result in exit status 2.

While reading the code and docs related to the patch, I found
these descriptions in pgbench docs. The first description needs to be
updated? Because even database error (e.g., failure of connection for setup)
can result in exit status 1 if it happens before the benchmark actually runs.

That makes sense. Failures of setup connection or initial connection doesn't
seem 'static problems'. I rewrote this description to explain exit status 1
indicates also interal errors and early errors.

Exit status 1 indicates static problems such as invalid command-line options
or internal errors which are supposed to never occur. Early errors that occur
when starting benchmark such as initial connection failures also exit with
status 1.

Regards,
Yugo Nagata

--
Yugo NAGATA <nagata@sraoss.co.jp>

Attachments:

pgbench-conn-abort-6.patchtext/x-diff; name=pgbench-conn-abort-6.patchDownload

diff --git a/doc/src/sgml/ref/pgbench.sgml b/doc/src/sgml/ref/pgbench.sgml
index 0f432767c2..c71dab644c 100644
--- a/doc/src/sgml/ref/pgbench.sgml
+++ b/doc/src/sgml/ref/pgbench.sgml
@@ -904,10 +904,12 @@ pgbench <optional> <replaceable>options</replaceable> </optional> <replaceable>d
 
   <para>
    A successful run will exit with status 0.  Exit status 1 indicates static
-   problems such as invalid command-line options.  Errors during the run such
-   as database errors or problems in the script will result in exit status 2.
-   In the latter case, <application>pgbench</application> will print partial
-   results.
+   problems such as invalid command-line options or internal errors which
+   are supposed to never occur.  Early errors that occur when starting
+   benchmark such as initial connection failures also exit with status 1.
+   Errors during the run such as database errors or problems in the script
+   will result in exit status 2. In the latter case,
+   <application>pgbench</application> will print partial results.
   </para>
  </refsect1>
 
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 433abd954b..8d44399c16 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -3181,6 +3181,7 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 
 					if ((st->con = doConnect()) == NULL)
 					{
+						/* as the bench is already running, we do not abort the process */
 						pg_log_error("client %d aborted while establishing connection", st->id);
 						st->state = CSTATE_ABORTED;
 						break;
@@ -4456,7 +4457,10 @@ runInitSteps(const char *initialize_steps)
 	initPQExpBuffer(&stats);
 
 	if ((con = doConnect()) == NULL)
+	{
+		pg_log_fatal("could not create connection for initialization");
 		exit(1);
+	}
 
 	setup_cancel_handler(NULL);
 	SetCancelConn(con);
@@ -6399,7 +6403,10 @@ main(int argc, char **argv)
 	/* opening connection... */
 	con = doConnect();
 	if (con == NULL)
+	{
+		pg_log_fatal("could not create connection for setup");
 		exit(1);
+	}
 
 	/* report pgbench and server versions */
 	printVersion(con);
@@ -6553,7 +6560,7 @@ main(int argc, char **argv)
 #endif							/* ENABLE_THREAD_SAFETY */
 
 		for (int j = 0; j < thread->nstate; j++)
-			if (thread->state[j].state == CSTATE_ABORTED)
+			if (thread->state[j].state != CSTATE_FINISHED)
 				exit_code = 2;
 
 		/* aggregate thread level stats */
@@ -6625,7 +6632,7 @@ threadRun(void *arg)
 		if (thread->logfile == NULL)
 		{
 			pg_log_fatal("could not open logfile \"%s\": %m", logpath);
-			goto done;
+			exit(1);
 		}
 	}
 
@@ -6650,16 +6657,10 @@ threadRun(void *arg)
 		{
 			if ((state[i].con = doConnect()) == NULL)
 			{
-				/*
-				 * On connection failure, we meet the barrier here in place of
-				 * GO before proceeding to the "done" path which will cleanup,
-				 * so as to avoid locking the process.
-				 *
-				 * It is unclear whether it is worth doing anything rather
-				 * than coldly exiting with an error message.
-				 */
-				THREAD_BARRIER_WAIT(&barrier);
-				goto done;
+				/* coldly abort on initial connection failure */
+				pg_log_fatal("could not create connection for client %d",
+							 state[i].id);
+				exit(1);
 			}
 		}
 	}
@@ -6784,7 +6785,7 @@ threadRun(void *arg)
 					continue;
 				}
 				/* must be something wrong */
-				pg_log_fatal("%s() failed: %m", SOCKET_WAIT_METHOD);
+				pg_log_error("%s() failed: %m", SOCKET_WAIT_METHOD);
 				goto done;
 			}
 		}

#20

Fujii Masao

masao.fujii@oss.nttdata.com

over 4 years ago

In reply to: Yugo NAGATA (#19)

Re: pgbench bug candidate: negative "initial connection time"

On 2021/09/24 7:26, Yugo NAGATA wrote:

That makes sense. Failures of setup connection or initial connection doesn't
seem 'static problems'. I rewrote this description to explain exit status 1
indicates also interal errors and early errors.

Exit status 1 indicates static problems such as invalid command-line options
or internal errors which are supposed to never occur. Early errors that occur
when starting benchmark such as initial connection failures also exit with
status 1.

LGTM. Barring any objection, I will commit the patch.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

#21

Fujii Masao

masao.fujii@oss.nttdata.com

over 4 years ago

In reply to: Fujii Masao (#20)

Re: pgbench bug candidate: negative "initial connection time"

On 2021/09/24 11:26, Fujii Masao wrote:

On 2021/09/24 7:26, Yugo NAGATA wrote:

That makes sense. Failures of setup connection or initial connection doesn't
seem 'static problems'. I rewrote this description to explain exit status 1
indicates also interal errors and early errors.

ï¿½ï¿½ Exit status 1 indicates static problems such as invalid command-line options
ï¿½ï¿½ or internal errors which are supposed to never occur.ï¿½ Early errors that occur
ï¿½ï¿½ when starting benchmark such as initial connection failures also exit with
ï¿½ï¿½ status 1.

LGTM. Barring any objection, I will commit the patch.

I extracted two changes from the patch and pushed (also back-patched) them.

The remainings are the changes of handling of initial connection or
logfile open failures. I agree to push them at least for the master.
But I'm not sure if they should be back-patched. Without these changes,
even when those failures happen, pgbench proceeds the benchmark and
reports the result. But with the changes, pgbench exits immediately in
that case. I'm not sure if there are people who expect this behavior,
but if there are, maybe we should not change it at least at stable branches.
Thought?

BTW, when logfile fails to be opened, pgbench gets stuck due to commit
aeb57af8e6. So even if we decided not to back-patch those changes,
we should improve the handling of logfile open failure, to fix the issue.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

#22

Michael Paquier

michael@paquier.xyz

over 4 years ago

In reply to: Fujii Masao (#21)

Re: pgbench bug candidate: negative "initial connection time"

On Wed, Sep 29, 2021 at 10:11:53PM +0900, Fujii Masao wrote:

BTW, when logfile fails to be opened, pgbench gets stuck due to commit
aeb57af8e6. So even if we decided not to back-patch those changes,
we should improve the handling of logfile open failure, to fix the issue.

There is an entry in the CF for this thread:
https://commitfest.postgresql.org/34/3219/

I have moved that to the next one as some pieces are missing. If you
are planning to handle the rest, could you register your name as a
committer?
--
Michael

#23

Fujii Masao

masao.fujii@oss.nttdata.com

over 4 years ago

In reply to: Michael Paquier (#22)

1 attachment(s)

Re: pgbench bug candidate: negative "initial connection time"

On 2021/10/01 15:27, Michael Paquier wrote:

On Wed, Sep 29, 2021 at 10:11:53PM +0900, Fujii Masao wrote:

BTW, when logfile fails to be opened, pgbench gets stuck due to commit
aeb57af8e6. So even if we decided not to back-patch those changes,
we should improve the handling of logfile open failure, to fix the issue.

There is an entry in the CF for this thread:
https://commitfest.postgresql.org/34/3219/

I have moved that to the next one as some pieces are missing. If you
are planning to handle the rest, could you register your name as a
committer?

Thanks for letting me know that!
I registered myself as a committer of the patch again.

pg_time_usec_t conn_duration; /* cumulated connection and deconnection
* delays */

BTW, while reading the patch, I found the above comment in pgbench.c.
"deconnection" seems a valid word in French (?), but isn't it better to
replace it with "disconnection"? Patch attached.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Attachments:

deconnection.patchtext/plain; charset=UTF-8; name=deconnection.patch; x-mac-creator=0; x-mac-type=0Download

diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index d6b31bbf53..81693e1765 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -509,7 +509,7 @@ typedef struct
 	pg_time_usec_t create_time; /* thread creation time */
 	pg_time_usec_t started_time;	/* thread is running */
 	pg_time_usec_t bench_start; /* thread is benchmarking */
-	pg_time_usec_t conn_duration;	/* cumulated connection and deconnection
+	pg_time_usec_t conn_duration;	/* cumulated connection and disconnection
 									 * delays */
 
 	StatsData	stats;

#24

Fujii Masao

masao.fujii@oss.nttdata.com

about 4 years ago

In reply to: Fujii Masao (#21)

1 attachment(s)

Re: pgbench bug candidate: negative "initial connection time"

On 2021/09/29 22:11, Fujii Masao wrote:

On 2021/09/24 11:26, Fujii Masao wrote:

On 2021/09/24 7:26, Yugo NAGATA wrote:

That makes sense. Failures of setup connection or initial connection doesn't
seem 'static problems'. I rewrote this description to explain exit status 1
indicates also interal errors and early errors.

ï¿½ï¿½ Exit status 1 indicates static problems such as invalid command-line options
ï¿½ï¿½ or internal errors which are supposed to never occur.ï¿½ Early errors that occur
ï¿½ï¿½ when starting benchmark such as initial connection failures also exit with
ï¿½ï¿½ status 1.

LGTM. Barring any objection, I will commit the patch.

I extracted two changes from the patch and pushed (also back-patched) them.

The remainings are the changes of handling of initial connection or
logfile open failures. I agree to push them at least for the master.
But I'm not sure if they should be back-patched. Without these changes,
even when those failures happen, pgbench proceeds the benchmark and
reports the result. But with the changes, pgbench exits immediately in
that case. I'm not sure if there are people who expect this behavior,
but if there are, maybe we should not change it at least at stable branches.
Thought?

The current behavior should be improved, but not a bug.
So I don't think that the patch needs to be back-patched.
Barring any objection, I will push the attached patch to the master.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Attachments:

pgbench-conn-abort-7.patchtext/plain; charset=UTF-8; name=pgbench-conn-abort-7.patch; x-mac-creator=0; x-mac-type=0Download

diff --git a/doc/src/sgml/ref/pgbench.sgml b/doc/src/sgml/ref/pgbench.sgml
index 0f432767c2..c71dab644c 100644
--- a/doc/src/sgml/ref/pgbench.sgml
+++ b/doc/src/sgml/ref/pgbench.sgml
@@ -904,10 +904,12 @@ pgbench <optional> <replaceable>options</replaceable> </optional> <replaceable>d
 
   <para>
    A successful run will exit with status 0.  Exit status 1 indicates static
-   problems such as invalid command-line options.  Errors during the run such
-   as database errors or problems in the script will result in exit status 2.
-   In the latter case, <application>pgbench</application> will print partial
-   results.
+   problems such as invalid command-line options or internal errors which
+   are supposed to never occur.  Early errors that occur when starting
+   benchmark such as initial connection failures also exit with status 1.
+   Errors during the run such as database errors or problems in the script
+   will result in exit status 2. In the latter case,
+   <application>pgbench</application> will print partial results.
   </para>
  </refsect1>
 
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index d17f69333f..f8331bbb60 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -3181,6 +3181,7 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 
 					if ((st->con = doConnect()) == NULL)
 					{
+						/* as the bench is already running, we do not abort the process */
 						pg_log_error("client %d aborted while establishing connection", st->id);
 						st->state = CSTATE_ABORTED;
 						break;
@@ -4456,7 +4457,10 @@ runInitSteps(const char *initialize_steps)
 	initPQExpBuffer(&stats);
 
 	if ((con = doConnect()) == NULL)
+	{
+		pg_log_fatal("could not create connection for initialization");
 		exit(1);
+	}
 
 	setup_cancel_handler(NULL);
 	SetCancelConn(con);
@@ -6399,7 +6403,10 @@ main(int argc, char **argv)
 	/* opening connection... */
 	con = doConnect();
 	if (con == NULL)
+	{
+		pg_log_fatal("could not create connection for setup");
 		exit(1);
+	}
 
 	/* report pgbench and server versions */
 	printVersion(con);
@@ -6625,7 +6632,7 @@ threadRun(void *arg)
 		if (thread->logfile == NULL)
 		{
 			pg_log_fatal("could not open logfile \"%s\": %m", logpath);
-			goto done;
+			exit(1);
 		}
 	}
 
@@ -6650,16 +6657,10 @@ threadRun(void *arg)
 		{
 			if ((state[i].con = doConnect()) == NULL)
 			{
-				/*
-				 * On connection failure, we meet the barrier here in place of
-				 * GO before proceeding to the "done" path which will cleanup,
-				 * so as to avoid locking the process.
-				 *
-				 * It is unclear whether it is worth doing anything rather
-				 * than coldly exiting with an error message.
-				 */
-				THREAD_BARRIER_WAIT(&barrier);
-				goto done;
+				/* coldly abort on initial connection failure */
+				pg_log_fatal("could not create connection for client %d",
+							 state[i].id);
+				exit(1);
 			}
 		}
 	}

#25

Fujii Masao

masao.fujii@oss.nttdata.com

about 4 years ago

In reply to: Fujii Masao (#23)

Re: pgbench bug candidate: negative "initial connection time"

On 2021/10/09 0:41, Fujii Masao wrote:

On 2021/10/01 15:27, Michael Paquier wrote:

On Wed, Sep 29, 2021 at 10:11:53PM +0900, Fujii Masao wrote:

BTW, when logfile fails to be opened, pgbench gets stuck due to commit
aeb57af8e6. So even if we decided not to back-patch those changes,
we should improve the handling of logfile open failure, to fix the issue.

There is an entry in the CF for this thread:
https://commitfest.postgresql.org/34/3219/

I have moved that to the next one as some pieces are missing.ï¿½ If you
are planning to handle the rest, could you register your name as a
committer?

Thanks for letting me know that!
I registered myself as a committer of the patch again.

ï¿½ï¿½ï¿½ï¿½pg_time_usec_t conn_duration;ï¿½ï¿½ï¿½ /* cumulated connection and deconnection
ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ * delays */

BTW, while reading the patch, I found the above comment in pgbench.c.
"deconnection" seems a valid word in French (?), but isn't it better to
replace it with "disconnection"? Patch attached.

Barring any objection, I will push this patch.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

#26

Fujii Masao

masao.fujii@oss.nttdata.com

about 4 years ago

In reply to: Fujii Masao (#24)

Re: pgbench bug candidate: negative "initial connection time"

On 2021/11/01 23:01, Fujii Masao wrote:

The remainings are the changes of handling of initial connection or
logfile open failures. I agree to push them at least for the master.
But I'm not sure if they should be back-patched. Without these changes,
even when those failures happen, pgbench proceeds the benchmark and
reports the result. But with the changes, pgbench exits immediately in
that case. I'm not sure if there are people who expect this behavior,
but if there are, maybe we should not change it at least at stable branches.
Thought?

The current behavior should be improved, but not a bug.
So I don't think that the patch needs to be back-patched.
Barring any objection, I will push the attached patch to the master.

Pushed. Thanks!

I also pushed the typo-fix patch that I proposed upthread.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

#27

Yugo NAGATA

nagata@sraoss.co.jp

about 4 years ago

In reply to: Fujii Masao (#26)

Re: pgbench bug candidate: negative "initial connection time"

On Tue, 2 Nov 2021 23:11:39 +0900
Fujii Masao <masao.fujii@oss.nttdata.com> wrote:

On 2021/11/01 23:01, Fujii Masao wrote:

The remainings are the changes of handling of initial connection or
logfile open failures. I agree to push them at least for the master.
But I'm not sure if they should be back-patched. Without these changes,
even when those failures happen, pgbench proceeds the benchmark and
reports the result. But with the changes, pgbench exits immediately in
that case. I'm not sure if there are people who expect this behavior,
but if there are, maybe we should not change it at least at stable branches.
Thought?

The current behavior should be improved, but not a bug.
So I don't think that the patch needs to be back-patched.
Barring any objection, I will push the attached patch to the master.

Pushed. Thanks!

Thanks!

I also pushed the typo-fix patch that I proposed upthread.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

--
Yugo NAGATA <nagata@sraoss.co.jp>