Suggestion to add --continue-client-on-abort option to pgbench
Hi hackers,
I would like to suggest adding a new option to pgbench, which enables
the client to continue processing transactions even if some errors occur
during a transaction.
Currently, a client stops sending requests when its transaction is
aborted due to reasons other than serialization failures or deadlocks. I
think in some cases, especially when using custom scripts, the client
should be able to rollback the failed transaction and start a new one.
For example, my custom script (insert_to_unique_column.sql) follows:
```
CREATE TABLE IF NOT EXISTS test (col1 serial, col2 int unique);
INSERT INTO test (col2) VALUES (random(0, 50000));
```
Assume we need to continuously apply load to the server using 5 clients
for a certain period of time. However, a client sometimes stops when its
transaction in my custom script is aborted due to a check constraint
violation. As a result, the load on the server is lower than expected,
which is the problem I want to address.
The proposed new option solves this problem. When
--continue-client-on-abort is set to true, the client rolls back the
failed transaction and starts a new one. This allows all 5 clients to
continuously apply load to the server, even if some transactions fail.
```
% bin/pgbench -d postgres -f ../insert_to_unique_column.sql -T 10
--failures-detailed --continue-client-on-error
transaction type: ../custom_script_insert.sql
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
maximum number of tries: 1
duration: 10 s
number of transactions actually processed: 33552
number of failed transactions: 21901 (39.495%)
number of serialization failures: 0 (0.000%)
number of deadlock failures: 0 (0.000%)
number of other failures: 21901 (39.495%)
latency average = 0.180 ms (including failures)
initial connection time = 2.857 ms
tps = 3356.092385 (without initial connection time)
```
I have attached the patch. I would appreciate your feedback.
Best regards,
Rintaro Ikeda
NTT DATA Corporation Japan
Attachments:
0001-add-continue-client-on-error-option-to-pgbench.patchtext/x-diff; charset=us-ascii; name=0001-add-continue-client-on-error-option-to-pgbench.patchDownload
From a15432989e2539d55e6ad2f26b3aac7b2221413f Mon Sep 17 00:00:00 2001
From: Rintaro Ikeda <51394766+rinikeda@users.noreply.github.com>
Date: Sat, 10 May 2025 16:49:17 +0900
Subject: [PATCH] add continue-client-on-error option to pgbench
---
src/bin/pgbench/pgbench.c | 41 +++++++++++++++++++++++++++++++++++----
1 file changed, 37 insertions(+), 4 deletions(-)
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 497a936c141..8eaf6ea38e3 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -440,6 +440,8 @@ typedef struct StatsData
int64 deadlock_failures; /* number of transactions that were not
* successfully retried after a deadlock
* error */
+ int64 other_sql_failures; /* number of failed transactions, which is enabled
+ * if --continue-client-on-error is used */
SimpleStats latency;
SimpleStats lag;
} StatsData;
@@ -770,6 +772,7 @@ static int64 total_weight = 0;
static bool verbose_errors = false; /* print verbose messages of all errors */
static bool exit_on_abort = false; /* exit when any client is aborted */
+static bool continue_on_error = false; /* continue after errors */
/* Builtin test scripts */
typedef struct BuiltinScript
@@ -1467,6 +1470,7 @@ initStats(StatsData *sd, pg_time_usec_t start)
sd->retried = 0;
sd->serialization_failures = 0;
sd->deadlock_failures = 0;
+ sd->other_sql_failures = 0;
initSimpleStats(&sd->latency);
initSimpleStats(&sd->lag);
}
@@ -1516,6 +1520,12 @@ accumStats(StatsData *stats, bool skipped, double lat, double lag,
case ESTATUS_DEADLOCK_ERROR:
stats->deadlock_failures++;
break;
+ case ESTATUS_OTHER_SQL_ERROR:
+ stats->other_sql_failures++;
+ break;
+ case ESTATUS_OTHER_SQL_ERROR:
+ stats->other_sql_failures++;
+ break;
default:
/* internal error which should never occur */
pg_fatal("unexpected error status: %d", estatus);
@@ -3247,7 +3257,8 @@ static bool
canRetryError(EStatus estatus)
{
return (estatus == ESTATUS_SERIALIZATION_ERROR ||
- estatus == ESTATUS_DEADLOCK_ERROR);
+ estatus == ESTATUS_DEADLOCK_ERROR ||
+ (continue_on_error && estatus == ESTATUS_OTHER_SQL_ERROR));
}
/*
@@ -4528,7 +4539,8 @@ static int64
getFailures(const StatsData *stats)
{
return (stats->serialization_failures +
- stats->deadlock_failures);
+ stats->deadlock_failures +
+ stats->other_sql_failures);
}
/*
@@ -4548,6 +4560,8 @@ getResultString(bool skipped, EStatus estatus)
return "serialization";
case ESTATUS_DEADLOCK_ERROR:
return "deadlock";
+ case ESTATUS_OTHER_SQL_ERROR:
+ return "error";
default:
/* internal error which should never occur */
pg_fatal("unexpected error status: %d", estatus);
@@ -4603,6 +4617,7 @@ doLog(TState *thread, CState *st,
int64 skipped = 0;
int64 serialization_failures = 0;
int64 deadlock_failures = 0;
+ int64 other_sql_failures = 0;
int64 retried = 0;
int64 retries = 0;
@@ -4643,10 +4658,12 @@ doLog(TState *thread, CState *st,
{
serialization_failures = agg->serialization_failures;
deadlock_failures = agg->deadlock_failures;
+ other_sql_failures = agg->other_sql_failures;
}
- fprintf(logfile, " " INT64_FORMAT " " INT64_FORMAT,
+ fprintf(logfile, " " INT64_FORMAT " " INT64_FORMAT " " INT64_FORMAT,
serialization_failures,
- deadlock_failures);
+ deadlock_failures,
+ other_sql_failures);
fputc('\n', logfile);
@@ -6285,6 +6302,7 @@ printProgressReport(TState *threads, int64 test_start, pg_time_usec_t now,
cur.serialization_failures +=
threads[i].stats.serialization_failures;
cur.deadlock_failures += threads[i].stats.deadlock_failures;
+ cur.other_sql_failures += threads[i].stats.other_sql_failures;
}
/* we count only actually executed transactions */
@@ -6443,6 +6461,9 @@ printResults(StatsData *total,
printf("number of deadlock failures: " INT64_FORMAT " (%.3f%%)\n",
total->deadlock_failures,
100.0 * total->deadlock_failures / total_cnt);
+ printf("number of other failures: " INT64_FORMAT " (%.3f%%)\n",
+ total->other_sql_failures,
+ 100.0 * total->other_sql_failures / total_cnt);
}
/* it can be non-zero only if max_tries is not equal to one */
@@ -6546,6 +6567,10 @@ printResults(StatsData *total,
sstats->deadlock_failures,
(100.0 * sstats->deadlock_failures /
script_total_cnt));
+ printf(" - number of other failures: " INT64_FORMAT " (%.3f%%)\n",
+ sstats->other_sql_failures,
+ (100.0 * sstats->other_sql_failures /
+ script_total_cnt));
}
/*
@@ -6705,6 +6730,7 @@ main(int argc, char **argv)
{"verbose-errors", no_argument, NULL, 15},
{"exit-on-abort", no_argument, NULL, 16},
{"debug", no_argument, NULL, 17},
+ {"continue-client-on-error", no_argument, NULL, 18},
{NULL, 0, NULL, 0}
};
@@ -7058,6 +7084,12 @@ main(int argc, char **argv)
case 17: /* debug */
pg_logging_increase_verbosity();
break;
+ case 18: /* continue-client-on-error */
+ continue_on_error = true;
+ break;
+ case 18: /* continue-client-on-error */
+ continue_on_error = true;
+ break;
default:
/* getopt_long already emitted a complaint */
pg_log_error_hint("Try \"%s --help\" for more information.", progname);
@@ -7413,6 +7445,7 @@ main(int argc, char **argv)
stats.retried += thread->stats.retried;
stats.serialization_failures += thread->stats.serialization_failures;
stats.deadlock_failures += thread->stats.deadlock_failures;
+ stats.other_sql_failures += thread->stats.other_sql_failures;
latency_late += thread->latency_late;
conn_total_duration += thread->conn_duration;
--
2.39.5 (Apple Git-154)
On Sat, May 10, 2025 at 8:45 PM ikedarintarof <ikedarintarof@oss.nttdata.com>
wrote:
Hi hackers,
I would like to suggest adding a new option to pgbench, which enables
the client to continue processing transactions even if some errors occur
during a transaction.
Currently, a client stops sending requests when its transaction is
aborted due to reasons other than serialization failures or deadlocks. I
think in some cases, especially when using custom scripts, the client
should be able to rollback the failed transaction and start a new one.For example, my custom script (insert_to_unique_column.sql) follows:
```
CREATE TABLE IF NOT EXISTS test (col1 serial, col2 int unique);
INSERT INTO test (col2) VALUES (random(0, 50000));
```
Assume we need to continuously apply load to the server using 5 clients
for a certain period of time. However, a client sometimes stops when its
transaction in my custom script is aborted due to a check constraint
violation. As a result, the load on the server is lower than expected,
which is the problem I want to address.The proposed new option solves this problem. When
--continue-client-on-abort is set to true, the client rolls back the
failed transaction and starts a new one. This allows all 5 clients to
continuously apply load to the server, even if some transactions fail.```
% bin/pgbench -d postgres -f ../insert_to_unique_column.sql -T 10
--failures-detailed --continue-client-on-error
transaction type: ../custom_script_insert.sql
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
maximum number of tries: 1
duration: 10 s
number of transactions actually processed: 33552
number of failed transactions: 21901 (39.495%)
number of serialization failures: 0 (0.000%)
number of deadlock failures: 0 (0.000%)
number of other failures: 21901 (39.495%)
latency average = 0.180 ms (including failures)
initial connection time = 2.857 ms
tps = 3356.092385 (without initial connection time)
```I have attached the patch. I would appreciate your feedback.
Best regards,
Rintaro Ikeda
NTT DATA Corporation Japan
Hi Rintaro,
Thanks for the patch and explanation. I understand your goal is to ensure
that pgbench clients continue running even when some transactions fail due
to application-level errors (e.g., constraint violations), especially when
running custom scripts.
However, I wonder if the intended behavior can't already be achieved using
standard SQL constructs — specifically ON CONFLICT or careful transaction
structure. For example, your sample script:
CREATE TABLE IF NOT EXISTS test (col1 serial, col2 int unique);
INSERT INTO test (col2) VALUES (random(0, 50000));
can be rewritten as:
\setrandom val 0 50000
INSERT INTO test (col2) VALUES (:val) ON CONFLICT DO NOTHING;
This avoids transaction aborts entirely in the presence of uniqueness
violations and ensures the client continues to issue load without
interruption. In many real-world benchmarking scenarios, this is the
preferred and simplest approach.
So from that angle, could you elaborate on specific cases where this
SQL-level workaround wouldn't be sufficient? Are there error types you
intend to handle that cannot be gracefully avoided or recovered from using
SQL constructs like ON CONFLICT, or SAVEPOINT/ROLLBACK TO?
Best regards,
Stepan Neretin
On Sat, May 10, 2025 at 8:45 PM ikedarintarof <ikedarintarof@oss.nttdata.com> wrote:
Hi hackers,
I would like to suggest adding a new option to pgbench, which enables
the client to continue processing transactions even if some errors occur
during a transaction.
Currently, a client stops sending requests when its transaction is
aborted due to reasons other than serialization failures or deadlocks. I
think in some cases, especially when using custom scripts, the client
should be able to rollback the failed transaction and start a new one.For example, my custom script (insert_to_unique_column.sql) follows:
```
CREATE TABLE IF NOT EXISTS test (col1 serial, col2 int unique);
INSERT INTO test (col2) VALUES (random(0, 50000));
```
Assume we need to continuously apply load to the server using 5 clients
for a certain period of time. However, a client sometimes stops when its
transaction in my custom script is aborted due to a check constraint
violation. As a result, the load on the server is lower than expected,
which is the problem I want to address.The proposed new option solves this problem. When
--continue-client-on-abort is set to true, the client rolls back the
failed transaction and starts a new one. This allows all 5 clients to
continuously apply load to the server, even if some transactions fail.
+1. I've had similar cases before too, where I'd wanted pgbench to
continue creating load on the server even if a transaction failed
server-side for any reason. Sometimes, I'd even want that type of
load.
On Sat, 10 May 2025 at 17:02, Stepan Neretin <slpmcf@gmail.com> wrote:
INSERT INTO test (col2) VALUES (random(0, 50000));
can be rewritten as:
\setrandom val 0 50000
INSERT INTO test (col2) VALUES (:val) ON CONFLICT DO NOTHING;
That won't test the same execution paths, so an option to explicitly
rollback or ignore failed transactions (rather than stopping the
benchmark) would be a nice feature.
With e.g. ON CONFLICT DO NOTHING you'll have much higher workload if
there are many conflicting entries, as that triggers and catches
per-row errors, rather than per-statement. E.g. INSERT INTO ... SELECT
...multiple rows could conflict on multiple rows, but will fail on the
first conflict. DO NOTHING would cause full execution of the SELECT
statement, which has an inherently different performance profile.
This avoids transaction aborts entirely in the presence of uniqueness violations and ensures the client continues to issue load without interruption. In many real-world benchmarking scenarios, this is the preferred and simplest approach.
So from that angle, could you elaborate on specific cases where this SQL-level workaround wouldn't be sufficient? Are there error types you intend to handle that cannot be gracefully avoided or recovered from using SQL constructs like ON CONFLICT, or SAVEPOINT/ROLLBACK TO?
The issue isn't necessarily whether you can construct SQL scripts that
don't raise such errors (indeed, it's possible to do so for nearly any
command; you can run pl/pgsql procedures or DO blocks which catch and
ignore errors), but rather whether we can make pgbench function in a
way that can keep load on the server even when it notices an error.
Kind regards,
Matthias van de Meent
Neon (https://neon.tech)
On Sun, May 11, 2025 at 7:07 PM Matthias van de Meent <
boekewurm+postgres@gmail.com> wrote:
On Sat, May 10, 2025 at 8:45 PM ikedarintarof <
ikedarintarof@oss.nttdata.com> wrote:
Hi hackers,
I would like to suggest adding a new option to pgbench, which enables
the client to continue processing transactions even if some errors occur
during a transaction.
Currently, a client stops sending requests when its transaction is
aborted due to reasons other than serialization failures or deadlocks. I
think in some cases, especially when using custom scripts, the client
should be able to rollback the failed transaction and start a new one.For example, my custom script (insert_to_unique_column.sql) follows:
```
CREATE TABLE IF NOT EXISTS test (col1 serial, col2 int unique);
INSERT INTO test (col2) VALUES (random(0, 50000));
```
Assume we need to continuously apply load to the server using 5 clients
for a certain period of time. However, a client sometimes stops when its
transaction in my custom script is aborted due to a check constraint
violation. As a result, the load on the server is lower than expected,
which is the problem I want to address.The proposed new option solves this problem. When
--continue-client-on-abort is set to true, the client rolls back the
failed transaction and starts a new one. This allows all 5 clients to
continuously apply load to the server, even if some transactions fail.+1. I've had similar cases before too, where I'd wanted pgbench to
continue creating load on the server even if a transaction failed
server-side for any reason. Sometimes, I'd even want that type of
load.On Sat, 10 May 2025 at 17:02, Stepan Neretin <slpmcf@gmail.com> wrote:
INSERT INTO test (col2) VALUES (random(0, 50000));
can be rewritten as:
\setrandom val 0 50000
INSERT INTO test (col2) VALUES (:val) ON CONFLICT DO NOTHING;That won't test the same execution paths, so an option to explicitly
rollback or ignore failed transactions (rather than stopping the
benchmark) would be a nice feature.
With e.g. ON CONFLICT DO NOTHING you'll have much higher workload if
there are many conflicting entries, as that triggers and catches
per-row errors, rather than per-statement. E.g. INSERT INTO ... SELECT
...multiple rows could conflict on multiple rows, but will fail on the
first conflict. DO NOTHING would cause full execution of the SELECT
statement, which has an inherently different performance profile.This avoids transaction aborts entirely in the presence of uniqueness
violations and ensures the client continues to issue load without
interruption. In many real-world benchmarking scenarios, this is the
preferred and simplest approach.So from that angle, could you elaborate on specific cases where this
SQL-level workaround wouldn't be sufficient? Are there error types you
intend to handle that cannot be gracefully avoided or recovered from using
SQL constructs like ON CONFLICT, or SAVEPOINT/ROLLBACK TO?The issue isn't necessarily whether you can construct SQL scripts that
don't raise such errors (indeed, it's possible to do so for nearly any
command; you can run pl/pgsql procedures or DO blocks which catch and
ignore errors), but rather whether we can make pgbench function in a
way that can keep load on the server even when it notices an error.Kind regards,
Matthias van de Meent
Neon (https://neon.tech)
Hi Matthias,
Thanks for your detailed explanation — it really helped clarify the
usefulness of the patch. I agree that the feature is indeed valuable, and
it's great to see it being pushed forward.
Regarding the patch code, I noticed that there are duplicate case entries
in the command-line option handling (specifically for case 18 or case
ESTATUS_OTHER_SQL_ERROR, the continue-client-on-error option). These
duplicated cases can be merged to simplify the logic and reduce redundancy.
Best regards,
Stepan Neretin
Hi Stepan and Matthias,
Thank you both for your replies. I agree with Matthias's detailed explanation regarding the purpose of the patch.
Regarding the patch code, I noticed that there are duplicate case
entries in the command-line option handling (specifically for case 18
or case ESTATUS_OTHER_SQL_ERROR, the continue-client-on-error option).
These duplicated cases can be merged to simplify the logic and reduce
redundancy.
I also appreciate you for pointing out my mistakes in the previous version of the patch. I fixed the duplicated lines. I’ve attached the updated patch.
Best regards,
Rintaro Ikeda
On Sat, May 10, 2025 at 8:45 PM ikedarintarof
<ikedarintarof@oss.nttdata.com> wrote:
Hi hackers,
I would like to suggest adding a new option to pgbench, which
enables
the client to continue processing transactions even if some errors
occur
during a transaction.
Currently, a client stops sending requests when its transaction is
aborted due to reasons other than serialization failures ordeadlocks. I
think in some cases, especially when using custom scripts, the
client
should be able to rollback the failed transaction and start a new
one.
For example, my custom script (insert_to_unique_column.sql)
follows:
```
CREATE TABLE IF NOT EXISTS test (col1 serial, col2 int unique);
INSERT INTO test (col2) VALUES (random(0, 50000));
```
Assume we need to continuously apply load to the server using 5clients
for a certain period of time. However, a client sometimes stops
when its
transaction in my custom script is aborted due to a check
constraint
violation. As a result, the load on the server is lower than
expected,
which is the problem I want to address.
The proposed new option solves this problem. When
--continue-client-on-abort is set to true, the client rolls backthe
failed transaction and starts a new one. This allows all 5 clients
to
continuously apply load to the server, even if some transactions
fail.
+1. I've had similar cases before too, where I'd wanted pgbench to
continue creating load on the server even if a transaction failed
server-side for any reason. Sometimes, I'd even want that type of
load.On Sat, 10 May 2025 at 17:02, Stepan Neretin <slpmcf@gmail.com> wrote:
INSERT INTO test (col2) VALUES (random(0, 50000));
can be rewritten as:
\setrandom val 0 50000
INSERT INTO test (col2) VALUES (:val) ON CONFLICT DO NOTHING;That won't test the same execution paths, so an option to explicitly
rollback or ignore failed transactions (rather than stopping the
benchmark) would be a nice feature.
With e.g. ON CONFLICT DO NOTHING you'll have much higher workload if
there are many conflicting entries, as that triggers and catches
per-row errors, rather than per-statement. E.g. INSERT INTO ... SELECT
...multiple rows could conflict on multiple rows, but will fail on the
first conflict. DO NOTHING would cause full execution of the SELECT
statement, which has an inherently different performance profile.This avoids transaction aborts entirely in the presence of
uniqueness violations and ensures the client continues to issue load
without interruption. In many real-world benchmarking scenarios, this
is the preferred and simplest approach.So from that angle, could you elaborate on specific cases where this
SQL-level workaround wouldn't be sufficient? Are there error types you
intend to handle that cannot be gracefully avoided or recovered from
using SQL constructs like ON CONFLICT, or SAVEPOINT/ROLLBACK TO?The issue isn't necessarily whether you can construct SQL scripts that
don't raise such errors (indeed, it's possible to do so for nearly any
command; you can run pl/pgsql procedures or DO blocks which catch and
ignore errors), but rather whether we can make pgbench function in a
way that can keep load on the server even when it notices an error.Kind regards,
Matthias van de Meent
Neon (https://urldefense.com/v3/__https://neon.tech__;!!GCTRfqYYOYGmgK_z!92h3wOeDsDRQ1abfcL8-tRZqrAQ0w5RXwLNofOa_guIgDHdYknrizKqUGkvSn1_OU-xzMRv2halvtpUX7BFE8e3aPO_1-CZDhQ$ )
Hi Matthias,
Thanks for your detailed explanation — it really helped clarify the
usefulness of the patch. I agree that the feature is indeed valuable,
and it's great to see it being pushed forward.
Regarding the patch code, I noticed that there are duplicate case
entries in the command-line option handling (specifically for case 18 or
case ESTATUS_OTHER_SQL_ERROR, the continue-client-on-error option).
These duplicated cases can be merged to simplify the logic and reduce
redundancy.
Best regards,
Stepan Neretin
Attachments:
0001-add-continue-client-on-error-option-to-pgbench_ver2.patchapplication/octet-stream; name=0001-add-continue-client-on-error-option-to-pgbench_ver2.patchDownload
From 1d36f296b3672c2b9570022705931f9a7b265f47 Mon Sep 17 00:00:00 2001
From: "Rintaro.Ikeda" <ikedarintarof@oss.nttdata.com>
Date: Mon, 12 May 2025 21:57:39 +0900
Subject: [PATCH] add continue-client-on-error option to pgbench
---
src/bin/pgbench/pgbench.c | 34 ++++++++++++++++++++++++++++++----
1 file changed, 30 insertions(+), 4 deletions(-)
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 497a936c141..f1c4e7a3ea8 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -440,6 +440,10 @@ typedef struct StatsData
int64 deadlock_failures; /* number of transactions that were not
* successfully retried after a deadlock
* error */
+ int64 other_sql_failures; /* number of failed transactions for reasons
+ * other than serialization/deadlock failure
+ * , which is enabled if --continue-client-on-error
+ * is used */
SimpleStats latency;
SimpleStats lag;
} StatsData;
@@ -770,6 +774,7 @@ static int64 total_weight = 0;
static bool verbose_errors = false; /* print verbose messages of all errors */
static bool exit_on_abort = false; /* exit when any client is aborted */
+static bool continue_on_error = false; /* continue after errors */
/* Builtin test scripts */
typedef struct BuiltinScript
@@ -1467,6 +1472,7 @@ initStats(StatsData *sd, pg_time_usec_t start)
sd->retried = 0;
sd->serialization_failures = 0;
sd->deadlock_failures = 0;
+ sd->other_sql_failures = 0;
initSimpleStats(&sd->latency);
initSimpleStats(&sd->lag);
}
@@ -3247,7 +3253,8 @@ static bool
canRetryError(EStatus estatus)
{
return (estatus == ESTATUS_SERIALIZATION_ERROR ||
- estatus == ESTATUS_DEADLOCK_ERROR);
+ estatus == ESTATUS_DEADLOCK_ERROR ||
+ (continue_on_error && estatus == ESTATUS_OTHER_SQL_ERROR));
}
/*
@@ -4528,7 +4535,8 @@ static int64
getFailures(const StatsData *stats)
{
return (stats->serialization_failures +
- stats->deadlock_failures);
+ stats->deadlock_failures +
+ stats->other_sql_failures);
}
/*
@@ -4548,6 +4556,8 @@ getResultString(bool skipped, EStatus estatus)
return "serialization";
case ESTATUS_DEADLOCK_ERROR:
return "deadlock";
+ case ESTATUS_OTHER_SQL_ERROR:
+ return "error (except serialization/deadlock)";
default:
/* internal error which should never occur */
pg_fatal("unexpected error status: %d", estatus);
@@ -4603,6 +4613,7 @@ doLog(TState *thread, CState *st,
int64 skipped = 0;
int64 serialization_failures = 0;
int64 deadlock_failures = 0;
+ int64 other_sql_failures = 0;
int64 retried = 0;
int64 retries = 0;
@@ -4643,10 +4654,12 @@ doLog(TState *thread, CState *st,
{
serialization_failures = agg->serialization_failures;
deadlock_failures = agg->deadlock_failures;
+ other_sql_failures = agg->other_sql_failures;
}
- fprintf(logfile, " " INT64_FORMAT " " INT64_FORMAT,
+ fprintf(logfile, " " INT64_FORMAT " " INT64_FORMAT " " INT64_FORMAT,
serialization_failures,
- deadlock_failures);
+ deadlock_failures,
+ other_sql_failures);
fputc('\n', logfile);
@@ -6285,6 +6298,7 @@ printProgressReport(TState *threads, int64 test_start, pg_time_usec_t now,
cur.serialization_failures +=
threads[i].stats.serialization_failures;
cur.deadlock_failures += threads[i].stats.deadlock_failures;
+ cur.other_sql_failures += threads[i].stats.other_sql_failures;
}
/* we count only actually executed transactions */
@@ -6443,6 +6457,9 @@ printResults(StatsData *total,
printf("number of deadlock failures: " INT64_FORMAT " (%.3f%%)\n",
total->deadlock_failures,
100.0 * total->deadlock_failures / total_cnt);
+ printf("number of other failures: " INT64_FORMAT " (%.3f%%)\n",
+ total->other_sql_failures,
+ 100.0 * total->other_sql_failures / total_cnt);
}
/* it can be non-zero only if max_tries is not equal to one */
@@ -6546,6 +6563,10 @@ printResults(StatsData *total,
sstats->deadlock_failures,
(100.0 * sstats->deadlock_failures /
script_total_cnt));
+ printf(" - number of other failures: " INT64_FORMAT " (%.3f%%)\n",
+ sstats->other_sql_failures,
+ (100.0 * sstats->other_sql_failures /
+ script_total_cnt));
}
/*
@@ -6705,6 +6726,7 @@ main(int argc, char **argv)
{"verbose-errors", no_argument, NULL, 15},
{"exit-on-abort", no_argument, NULL, 16},
{"debug", no_argument, NULL, 17},
+ {"continue-client-on-error", no_argument, NULL, 18},
{NULL, 0, NULL, 0}
};
@@ -7058,6 +7080,9 @@ main(int argc, char **argv)
case 17: /* debug */
pg_logging_increase_verbosity();
break;
+ case 18: /* continue-client-on-error */
+ continue_on_error = true;
+ break;
default:
/* getopt_long already emitted a complaint */
pg_log_error_hint("Try \"%s --help\" for more information.", progname);
@@ -7413,6 +7438,7 @@ main(int argc, char **argv)
stats.retried += thread->stats.retried;
stats.serialization_failures += thread->stats.serialization_failures;
stats.deadlock_failures += thread->stats.deadlock_failures;
+ stats.other_sql_failures += thread->stats.other_sql_failures;
latency_late += thread->latency_late;
conn_total_duration += thread->conn_duration;
--
2.39.5 (Apple Git-154)
Import Notes
Reply to msg id not found: ee7279ebf35bd0f638ebd37a7e13cb43@oss.nttdata.com
On Tue, May 13, 2025 at 9:20 AM <Rintaro.Ikeda@nttdata.com> wrote:
I also appreciate you for pointing out my mistakes in the previous version of the patch. I fixed the duplicated lines. I’ve attached the updated patch.
This is a useful feature, so +1 from my side. Here are some initial
comments on the patch while having a quick look.
1. You need to update the stats for this new counter in the
"accumStats()" function.
2. IMHO, " continue-on-error " is more user-friendly than
"continue-client-on-error".
3. There are a lot of whitespace errors, so those can be fixed. You
can just try to apply using git am, and it will report those
whitespace warnings. And for fixing, you can just use
"--whitespace=fix" along with git am.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
Hi,
On Tue, May 13, 2025 at 11:27 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
On Tue, May 13, 2025 at 9:20 AM <Rintaro.Ikeda@nttdata.com> wrote:
I also appreciate you for pointing out my mistakes in the previous
version of the patch. I fixed the duplicated lines. I’ve attached the
updated patch.This is a useful feature, so +1 from my side. Here are some initial
comments on the patch while having a quick look.1. You need to update the stats for this new counter in the
"accumStats()" function.2. IMHO, " continue-on-error " is more user-friendly than
"continue-client-on-error".3. There are a lot of whitespace errors, so those can be fixed. You
can just try to apply using git am, and it will report those
whitespace warnings. And for fixing, you can just use
"--whitespace=fix" along with git am.
Hi, +1 for the idea. I’ve reviewed and tested the patch. Aside from Dilip’s
feedback and the missing usage information for this option, the patch LGTM.
Here's the diff for the missing usage information for this option and as
Dilip mentioned updating the new counter in the "accumStats()" function.
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index baaf1379be2..20d456bc4b9 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -959,6 +959,8 @@ usage(void)
" --log-prefix=PREFIX prefix for transaction time
log file\n"
" (default: \"pgbench_log\")\n"
" --max-tries=NUM max number of tries to run
transaction (default: 1)\n"
+ " --continue-client-on-error\n"
+ " Continue and retry
transactions that failed due to errors other than serialization or
deadlocks.\n"
" --progress-timestamp use Unix epoch timestamps
for progress\n"
" --random-seed=SEED set random seed (\"time\",
\"rand\", integer)\n"
" --sampling-rate=NUM fraction of transactions to
log (e.g., 0.01 for 1%%)\n"
@@ -1522,6 +1524,9 @@ accumStats(StatsData *stats, bool skipped, double
lat, double lag,
case ESTATUS_DEADLOCK_ERROR:
stats->deadlock_failures++;
break;
+ case ESTATUS_OTHER_SQL_ERROR:
+ stats->other_sql_failures++;
+ break;
default:
/* internal error which should never occur */
pg_fatal("unexpected error status: %d", estatus);
--
Thanks,
Srinath Reddy Sadipiralla
EDB: https://www.enterprisedb.com/
Hi, hakers.
On Tue, May 13, 2025 at 11:27 AM Dilip Kumar <dilipbalaut@gmail.com <mailto:dilipbalaut@gmail.com>> wrote:
1. You need to update the stats for this new counter in the
"accumStats()" function.2. IMHO, " continue-on-error " is more user-friendly than
"continue-client-on-error".3. There are a lot of whitespace errors, so those can be fixed. You
can just try to apply using git am, and it will report those
whitespace warnings. And for fixing, you can just use
"--whitespace=fix" along with git am.
On May 14, 2025, at 18:08, Srinath Reddy Sadipiralla <srinath2133@gmail.com> wrote:
Here's the diff for the missing usage information for this option and as Dilip mentioned updating the new counter in the "accumStats()" function.
Thank you very much for the helpful comments, and apologies for my delayed reply.
I've updated the patch based on your suggestions:
- Modified name of the option.
- Added the missing explanation.
- Updated the new counter in the `accumStats()` function as pointed out.
- Fixed the whitespace issues.
Additionally, I've included documentation for the new option.
I'm submitting this updated patch to the current CommitFest.
Best Regards,
Rintaro Ikeda

Show quoted text
On May 14, 2025, at 18:08, Srinath Reddy Sadipiralla <srinath2133@gmail.com> wrote:
Hi,
On Tue, May 13, 2025 at 11:27 AM Dilip Kumar <dilipbalaut@gmail.com <mailto:dilipbalaut@gmail.com>> wrote:
On Tue, May 13, 2025 at 9:20 AM <Rintaro.Ikeda@nttdata.com <mailto:Rintaro.Ikeda@nttdata.com>> wrote:
I also appreciate you for pointing out my mistakes in the previous version of the patch. I fixed the duplicated lines. I’ve attached the updated patch.
This is a useful feature, so +1 from my side. Here are some initial
comments on the patch while having a quick look.1. You need to update the stats for this new counter in the
"accumStats()" function.2. IMHO, " continue-on-error " is more user-friendly than
"continue-client-on-error".3. There are a lot of whitespace errors, so those can be fixed. You
can just try to apply using git am, and it will report those
whitespace warnings. And for fixing, you can just use
"--whitespace=fix" along with git am.Hi, +1 for the idea. I’ve reviewed and tested the patch. Aside from Dilip’s feedback and the missing usage information for this option, the patch LGTM.
Here's the diff for the missing usage information for this option and as Dilip mentioned updating the new counter in the "accumStats()" function.
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c index baaf1379be2..20d456bc4b9 100644 --- a/src/bin/pgbench/pgbench.c +++ b/src/bin/pgbench/pgbench.c @@ -959,6 +959,8 @@ usage(void) " --log-prefix=PREFIX prefix for transaction time log file\n" " (default: \"pgbench_log\")\n" " --max-tries=NUM max number of tries to run transaction (default: 1)\n" + " --continue-client-on-error\n" + " Continue and retry transactions that failed due to errors other than serialization or deadlocks.\n" " --progress-timestamp use Unix epoch timestamps for progress\n" " --random-seed=SEED set random seed (\"time\", \"rand\", integer)\n" " --sampling-rate=NUM fraction of transactions to log (e.g., 0.01 for 1%%)\n" @@ -1522,6 +1524,9 @@ accumStats(StatsData *stats, bool skipped, double lat, double lag, case ESTATUS_DEADLOCK_ERROR: stats->deadlock_failures++; break; + case ESTATUS_OTHER_SQL_ERROR: + stats->other_sql_failures++; + break; default: /* internal error which should never occur */ pg_fatal("unexpected error status: %d", estatus); -- Thanks, Srinath Reddy Sadipiralla EDB: https://www.enterprisedb.com/
Attachments:
v3-0001-Add-continue-on-error-option-to-pgbench.patchapplication/octet-stream; name=v3-0001-Add-continue-on-error-option-to-pgbench.patch; x-unix-mode=0644Download
From fca20d18dbedc8a9c66408da3e7139cd4192ff5b Mon Sep 17 00:00:00 2001
From: "Rintaro.Ikeda" <ikedarintarof@oss.nttdata.com>
Date: Mon, 12 May 2025 21:57:39 +0900
Subject: [PATCH] Add continue-on-error option to pgbench
When the option is set, client rolls back the failed transaction and starts a
new one when its transaction fails due to the reason other than the deadlock and
serialization failure.
---
doc/src/sgml/ref/pgbench.sgml | 12 +++++++++++
src/bin/pgbench/pgbench.c | 39 +++++++++++++++++++++++++++++++----
2 files changed, 47 insertions(+), 4 deletions(-)
diff --git a/doc/src/sgml/ref/pgbench.sgml b/doc/src/sgml/ref/pgbench.sgml
index ab252d9fc74..dcb8c1c487c 100644
--- a/doc/src/sgml/ref/pgbench.sgml
+++ b/doc/src/sgml/ref/pgbench.sgml
@@ -914,6 +914,18 @@ pgbench <optional> <replaceable>options</replaceable> </optional> <replaceable>d
</listitem>
</varlistentry>
+ <varlistentry id="pgbench-option-continue-on-error">
+ <term><option>--continue-on-error</option></term>
+ <listitem>
+ <para>
+ Client rolls back the failed transaction and starts a new one when its
+ transaction fails due to the reason other than the deadlock and
+ serialization failure. This allows all clients specified with -c option
+ to continuously apply load to the server, even if some transactions fail.
+ </para>
+ </listitem>
+ </varlistentry>
+
</variablelist>
</para>
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 497a936c141..5db222f2c1e 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -440,6 +440,10 @@ typedef struct StatsData
int64 deadlock_failures; /* number of transactions that were not
* successfully retried after a deadlock
* error */
+ int64 other_sql_failures; /* number of failed transactions for reasons
+ * other than serialization/deadlock failure
+ * , which is enabled if --continue-on-error
+ * is used */
SimpleStats latency;
SimpleStats lag;
} StatsData;
@@ -770,6 +774,7 @@ static int64 total_weight = 0;
static bool verbose_errors = false; /* print verbose messages of all errors */
static bool exit_on_abort = false; /* exit when any client is aborted */
+static bool continue_on_error = false; /* continue after errors */
/* Builtin test scripts */
typedef struct BuiltinScript
@@ -954,6 +959,8 @@ usage(void)
" --log-prefix=PREFIX prefix for transaction time log file\n"
" (default: \"pgbench_log\")\n"
" --max-tries=NUM max number of tries to run transaction (default: 1)\n"
+ " --continue-on-error\n"
+ " Continue and retry transactions that failed due to errors other than serialization or deadlocks.\n"
" --progress-timestamp use Unix epoch timestamps for progress\n"
" --random-seed=SEED set random seed (\"time\", \"rand\", integer)\n"
" --sampling-rate=NUM fraction of transactions to log (e.g., 0.01 for 1%%)\n"
@@ -1467,6 +1474,7 @@ initStats(StatsData *sd, pg_time_usec_t start)
sd->retried = 0;
sd->serialization_failures = 0;
sd->deadlock_failures = 0;
+ sd->other_sql_failures = 0;
initSimpleStats(&sd->latency);
initSimpleStats(&sd->lag);
}
@@ -1516,6 +1524,9 @@ accumStats(StatsData *stats, bool skipped, double lat, double lag,
case ESTATUS_DEADLOCK_ERROR:
stats->deadlock_failures++;
break;
+ case ESTATUS_OTHER_SQL_ERROR:
+ stats->other_sql_failures++;
+ break;
default:
/* internal error which should never occur */
pg_fatal("unexpected error status: %d", estatus);
@@ -3247,7 +3258,8 @@ static bool
canRetryError(EStatus estatus)
{
return (estatus == ESTATUS_SERIALIZATION_ERROR ||
- estatus == ESTATUS_DEADLOCK_ERROR);
+ estatus == ESTATUS_DEADLOCK_ERROR ||
+ (continue_on_error && estatus == ESTATUS_OTHER_SQL_ERROR));
}
/*
@@ -4528,7 +4540,8 @@ static int64
getFailures(const StatsData *stats)
{
return (stats->serialization_failures +
- stats->deadlock_failures);
+ stats->deadlock_failures +
+ stats->other_sql_failures);
}
/*
@@ -4548,6 +4561,8 @@ getResultString(bool skipped, EStatus estatus)
return "serialization";
case ESTATUS_DEADLOCK_ERROR:
return "deadlock";
+ case ESTATUS_OTHER_SQL_ERROR:
+ return "error (except serialization/deadlock)";
default:
/* internal error which should never occur */
pg_fatal("unexpected error status: %d", estatus);
@@ -4603,6 +4618,7 @@ doLog(TState *thread, CState *st,
int64 skipped = 0;
int64 serialization_failures = 0;
int64 deadlock_failures = 0;
+ int64 other_sql_failures = 0;
int64 retried = 0;
int64 retries = 0;
@@ -4643,10 +4659,12 @@ doLog(TState *thread, CState *st,
{
serialization_failures = agg->serialization_failures;
deadlock_failures = agg->deadlock_failures;
+ other_sql_failures = agg->other_sql_failures;
}
- fprintf(logfile, " " INT64_FORMAT " " INT64_FORMAT,
+ fprintf(logfile, " " INT64_FORMAT " " INT64_FORMAT " " INT64_FORMAT,
serialization_failures,
- deadlock_failures);
+ deadlock_failures,
+ other_sql_failures);
fputc('\n', logfile);
@@ -6285,6 +6303,7 @@ printProgressReport(TState *threads, int64 test_start, pg_time_usec_t now,
cur.serialization_failures +=
threads[i].stats.serialization_failures;
cur.deadlock_failures += threads[i].stats.deadlock_failures;
+ cur.other_sql_failures += threads[i].stats.other_sql_failures;
}
/* we count only actually executed transactions */
@@ -6443,6 +6462,9 @@ printResults(StatsData *total,
printf("number of deadlock failures: " INT64_FORMAT " (%.3f%%)\n",
total->deadlock_failures,
100.0 * total->deadlock_failures / total_cnt);
+ printf("number of other failures: " INT64_FORMAT " (%.3f%%)\n",
+ total->other_sql_failures,
+ 100.0 * total->other_sql_failures / total_cnt);
}
/* it can be non-zero only if max_tries is not equal to one */
@@ -6546,6 +6568,10 @@ printResults(StatsData *total,
sstats->deadlock_failures,
(100.0 * sstats->deadlock_failures /
script_total_cnt));
+ printf(" - number of other failures: " INT64_FORMAT " (%.3f%%)\n",
+ sstats->other_sql_failures,
+ (100.0 * sstats->other_sql_failures /
+ script_total_cnt));
}
/*
@@ -6705,6 +6731,7 @@ main(int argc, char **argv)
{"verbose-errors", no_argument, NULL, 15},
{"exit-on-abort", no_argument, NULL, 16},
{"debug", no_argument, NULL, 17},
+ {"continue-on-error", no_argument, NULL, 18},
{NULL, 0, NULL, 0}
};
@@ -7058,6 +7085,9 @@ main(int argc, char **argv)
case 17: /* debug */
pg_logging_increase_verbosity();
break;
+ case 18: /* continue-on-error */
+ continue_on_error = true;
+ break;
default:
/* getopt_long already emitted a complaint */
pg_log_error_hint("Try \"%s --help\" for more information.", progname);
@@ -7413,6 +7443,7 @@ main(int argc, char **argv)
stats.retried += thread->stats.retried;
stats.serialization_failures += thread->stats.serialization_failures;
stats.deadlock_failures += thread->stats.deadlock_failures;
+ stats.other_sql_failures += thread->stats.other_sql_failures;
latency_late += thread->latency_late;
conn_total_duration += thread->conn_duration;
--
2.39.5 (Apple Git-154)
Hi, Hackers.
I've attached the patch that I failed to include in my previous email.
(I'm still a bit confused about how to attach files using the standard
Mail client on macOS.)
Best Regards,
Rintaro Ikeda
Attachments:
v3-0001-Add-continue-on-error-option-to-pgbench.patchtext/x-diff; name=v3-0001-Add-continue-on-error-option-to-pgbench.patchDownload
From fca20d18dbedc8a9c66408da3e7139cd4192ff5b Mon Sep 17 00:00:00 2001
From: "Rintaro.Ikeda" <ikedarintarof@oss.nttdata.com>
Date: Mon, 12 May 2025 21:57:39 +0900
Subject: [PATCH] Add continue-on-error option to pgbench
When the option is set, client rolls back the failed transaction and starts a
new one when its transaction fails due to the reason other than the deadlock and
serialization failure.
---
doc/src/sgml/ref/pgbench.sgml | 12 +++++++++++
src/bin/pgbench/pgbench.c | 39 +++++++++++++++++++++++++++++++----
2 files changed, 47 insertions(+), 4 deletions(-)
diff --git a/doc/src/sgml/ref/pgbench.sgml b/doc/src/sgml/ref/pgbench.sgml
index ab252d9fc74..dcb8c1c487c 100644
--- a/doc/src/sgml/ref/pgbench.sgml
+++ b/doc/src/sgml/ref/pgbench.sgml
@@ -914,6 +914,18 @@ pgbench <optional> <replaceable>options</replaceable> </optional> <replaceable>d
</listitem>
</varlistentry>
+ <varlistentry id="pgbench-option-continue-on-error">
+ <term><option>--continue-on-error</option></term>
+ <listitem>
+ <para>
+ Client rolls back the failed transaction and starts a new one when its
+ transaction fails due to the reason other than the deadlock and
+ serialization failure. This allows all clients specified with -c option
+ to continuously apply load to the server, even if some transactions fail.
+ </para>
+ </listitem>
+ </varlistentry>
+
</variablelist>
</para>
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 497a936c141..5db222f2c1e 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -440,6 +440,10 @@ typedef struct StatsData
int64 deadlock_failures; /* number of transactions that were not
* successfully retried after a deadlock
* error */
+ int64 other_sql_failures; /* number of failed transactions for reasons
+ * other than serialization/deadlock failure
+ * , which is enabled if --continue-on-error
+ * is used */
SimpleStats latency;
SimpleStats lag;
} StatsData;
@@ -770,6 +774,7 @@ static int64 total_weight = 0;
static bool verbose_errors = false; /* print verbose messages of all errors */
static bool exit_on_abort = false; /* exit when any client is aborted */
+static bool continue_on_error = false; /* continue after errors */
/* Builtin test scripts */
typedef struct BuiltinScript
@@ -954,6 +959,8 @@ usage(void)
" --log-prefix=PREFIX prefix for transaction time log file\n"
" (default: \"pgbench_log\")\n"
" --max-tries=NUM max number of tries to run transaction (default: 1)\n"
+ " --continue-on-error\n"
+ " Continue and retry transactions that failed due to errors other than serialization or deadlocks.\n"
" --progress-timestamp use Unix epoch timestamps for progress\n"
" --random-seed=SEED set random seed (\"time\", \"rand\", integer)\n"
" --sampling-rate=NUM fraction of transactions to log (e.g., 0.01 for 1%%)\n"
@@ -1467,6 +1474,7 @@ initStats(StatsData *sd, pg_time_usec_t start)
sd->retried = 0;
sd->serialization_failures = 0;
sd->deadlock_failures = 0;
+ sd->other_sql_failures = 0;
initSimpleStats(&sd->latency);
initSimpleStats(&sd->lag);
}
@@ -1516,6 +1524,9 @@ accumStats(StatsData *stats, bool skipped, double lat, double lag,
case ESTATUS_DEADLOCK_ERROR:
stats->deadlock_failures++;
break;
+ case ESTATUS_OTHER_SQL_ERROR:
+ stats->other_sql_failures++;
+ break;
default:
/* internal error which should never occur */
pg_fatal("unexpected error status: %d", estatus);
@@ -3247,7 +3258,8 @@ static bool
canRetryError(EStatus estatus)
{
return (estatus == ESTATUS_SERIALIZATION_ERROR ||
- estatus == ESTATUS_DEADLOCK_ERROR);
+ estatus == ESTATUS_DEADLOCK_ERROR ||
+ (continue_on_error && estatus == ESTATUS_OTHER_SQL_ERROR));
}
/*
@@ -4528,7 +4540,8 @@ static int64
getFailures(const StatsData *stats)
{
return (stats->serialization_failures +
- stats->deadlock_failures);
+ stats->deadlock_failures +
+ stats->other_sql_failures);
}
/*
@@ -4548,6 +4561,8 @@ getResultString(bool skipped, EStatus estatus)
return "serialization";
case ESTATUS_DEADLOCK_ERROR:
return "deadlock";
+ case ESTATUS_OTHER_SQL_ERROR:
+ return "error (except serialization/deadlock)";
default:
/* internal error which should never occur */
pg_fatal("unexpected error status: %d", estatus);
@@ -4603,6 +4618,7 @@ doLog(TState *thread, CState *st,
int64 skipped = 0;
int64 serialization_failures = 0;
int64 deadlock_failures = 0;
+ int64 other_sql_failures = 0;
int64 retried = 0;
int64 retries = 0;
@@ -4643,10 +4659,12 @@ doLog(TState *thread, CState *st,
{
serialization_failures = agg->serialization_failures;
deadlock_failures = agg->deadlock_failures;
+ other_sql_failures = agg->other_sql_failures;
}
- fprintf(logfile, " " INT64_FORMAT " " INT64_FORMAT,
+ fprintf(logfile, " " INT64_FORMAT " " INT64_FORMAT " " INT64_FORMAT,
serialization_failures,
- deadlock_failures);
+ deadlock_failures,
+ other_sql_failures);
fputc('\n', logfile);
@@ -6285,6 +6303,7 @@ printProgressReport(TState *threads, int64 test_start, pg_time_usec_t now,
cur.serialization_failures +=
threads[i].stats.serialization_failures;
cur.deadlock_failures += threads[i].stats.deadlock_failures;
+ cur.other_sql_failures += threads[i].stats.other_sql_failures;
}
/* we count only actually executed transactions */
@@ -6443,6 +6462,9 @@ printResults(StatsData *total,
printf("number of deadlock failures: " INT64_FORMAT " (%.3f%%)\n",
total->deadlock_failures,
100.0 * total->deadlock_failures / total_cnt);
+ printf("number of other failures: " INT64_FORMAT " (%.3f%%)\n",
+ total->other_sql_failures,
+ 100.0 * total->other_sql_failures / total_cnt);
}
/* it can be non-zero only if max_tries is not equal to one */
@@ -6546,6 +6568,10 @@ printResults(StatsData *total,
sstats->deadlock_failures,
(100.0 * sstats->deadlock_failures /
script_total_cnt));
+ printf(" - number of other failures: " INT64_FORMAT " (%.3f%%)\n",
+ sstats->other_sql_failures,
+ (100.0 * sstats->other_sql_failures /
+ script_total_cnt));
}
/*
@@ -6705,6 +6731,7 @@ main(int argc, char **argv)
{"verbose-errors", no_argument, NULL, 15},
{"exit-on-abort", no_argument, NULL, 16},
{"debug", no_argument, NULL, 17},
+ {"continue-on-error", no_argument, NULL, 18},
{NULL, 0, NULL, 0}
};
@@ -7058,6 +7085,9 @@ main(int argc, char **argv)
case 17: /* debug */
pg_logging_increase_verbosity();
break;
+ case 18: /* continue-on-error */
+ continue_on_error = true;
+ break;
default:
/* getopt_long already emitted a complaint */
pg_log_error_hint("Try \"%s --help\" for more information.", progname);
@@ -7413,6 +7443,7 @@ main(int argc, char **argv)
stats.retried += thread->stats.retried;
stats.serialization_failures += thread->stats.serialization_failures;
stats.deadlock_failures += thread->stats.deadlock_failures;
+ stats.other_sql_failures += thread->stats.other_sql_failures;
latency_late += thread->latency_late;
conn_total_duration += thread->conn_duration;
--
2.39.5 (Apple Git-154)
Dear Ikeda-san,
Thanks for starting the new thread! I have never known the issue before I heard at
PGConf.dev.
Few comments:
1.
This parameter seems a type of benchmark option. So should we set
benchmarking_option_set as well?
2.
Not sure, but exit-on-abort seems a similar option. What if both are specified?
Is it allowed?
3.
Can you consider a test case for the new parameter?
Best regards,
Hayato Kuroda
FUJITSU LIMITED
Dear Kuroda-san, hackers,
On 2025/06/04 21:57, Hayato Kuroda (Fujitsu) wrote:
Dear Ikeda-san,
Thanks for starting the new thread! I have never known the issue before I heard at
PGConf.dev.Few comments:
1.
This parameter seems a type of benchmark option. So should we set
benchmarking_option_set as well?2.
Not sure, but exit-on-abort seems a similar option. What if both are specified?
Is it allowed?3.
Can you consider a test case for the new parameter?Best regards,
Hayato Kuroda
FUJITSU LIMITED
Thank you for your valuable comment!
1. I should've also set benchmarking_option_set. I've modified it accordingly.
2. The exit-on-abort option and continue-on-error option are mutually exclusive.
Therefore, I've updated the patch to throw a FATAL error when two options are
set simultaneously. Corresponding explanation was also added.
(I'm wondering the name of parameter should be continue-on-abort so that users
understand the two option are mutually exclusive.)
3. I've added the test.
Additionally, I modified the patch so that st->state does not transition to
CSTATE_RETRY when a transaction fails and continue-on-error option is enabled.
In the previous patch, we retry the failed transaction up to max-try times,
which is unnecessary for our purpose: clients does not exit when its
transactions fail.
I've attached the updated patch.
v3-0001-Add-continue-on-error-option-to-pgbench.patch is identical to
v4-0001-Add-continue-on-error-option-to-pgbench.patch. The v4-0002 patch is the
diff from the previous patch.
Best regards,
Rintaro Ikeda
Attachments:
v4-0001-Add-continue-on-error-option-to-pgbench.patchtext/plain; charset=UTF-8; name=v4-0001-Add-continue-on-error-option-to-pgbench.patchDownload
From 6a539c2d467b671728a564a0368eb474d398310c Mon Sep 17 00:00:00 2001
From: "Rintaro.Ikeda" <ikedarintarof@oss.nttdata.com>
Date: Mon, 12 May 2025 21:57:39 +0900
Subject: [PATCH 1/2] Add continue-on-error option to pgbench
When the option is set, client rolls back the failed transaction and starts a
new one when its transaction fails due to the reason other than the deadlock and
serialization failure.
---
doc/src/sgml/ref/pgbench.sgml | 12 +++++++++++
src/bin/pgbench/pgbench.c | 39 +++++++++++++++++++++++++++++++----
2 files changed, 47 insertions(+), 4 deletions(-)
diff --git a/doc/src/sgml/ref/pgbench.sgml b/doc/src/sgml/ref/pgbench.sgml
index ab252d9fc74..dcb8c1c487c 100644
--- a/doc/src/sgml/ref/pgbench.sgml
+++ b/doc/src/sgml/ref/pgbench.sgml
@@ -914,6 +914,18 @@ pgbench <optional> <replaceable>options</replaceable> </optional> <replaceable>d
</listitem>
</varlistentry>
+ <varlistentry id="pgbench-option-continue-on-error">
+ <term><option>--continue-on-error</option></term>
+ <listitem>
+ <para>
+ Client rolls back the failed transaction and starts a new one when its
+ transaction fails due to the reason other than the deadlock and
+ serialization failure. This allows all clients specified with -c option
+ to continuously apply load to the server, even if some transactions fail.
+ </para>
+ </listitem>
+ </varlistentry>
+
</variablelist>
</para>
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 497a936c141..5db222f2c1e 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -440,6 +440,10 @@ typedef struct StatsData
int64 deadlock_failures; /* number of transactions that were not
* successfully retried after a deadlock
* error */
+ int64 other_sql_failures; /* number of failed transactions for reasons
+ * other than serialization/deadlock failure
+ * , which is enabled if --continue-on-error
+ * is used */
SimpleStats latency;
SimpleStats lag;
} StatsData;
@@ -770,6 +774,7 @@ static int64 total_weight = 0;
static bool verbose_errors = false; /* print verbose messages of all errors */
static bool exit_on_abort = false; /* exit when any client is aborted */
+static bool continue_on_error = false; /* continue after errors */
/* Builtin test scripts */
typedef struct BuiltinScript
@@ -954,6 +959,8 @@ usage(void)
" --log-prefix=PREFIX prefix for transaction time log file\n"
" (default: \"pgbench_log\")\n"
" --max-tries=NUM max number of tries to run transaction (default: 1)\n"
+ " --continue-on-error\n"
+ " Continue and retry transactions that failed due to errors other than serialization or deadlocks.\n"
" --progress-timestamp use Unix epoch timestamps for progress\n"
" --random-seed=SEED set random seed (\"time\", \"rand\", integer)\n"
" --sampling-rate=NUM fraction of transactions to log (e.g., 0.01 for 1%%)\n"
@@ -1467,6 +1474,7 @@ initStats(StatsData *sd, pg_time_usec_t start)
sd->retried = 0;
sd->serialization_failures = 0;
sd->deadlock_failures = 0;
+ sd->other_sql_failures = 0;
initSimpleStats(&sd->latency);
initSimpleStats(&sd->lag);
}
@@ -1516,6 +1524,9 @@ accumStats(StatsData *stats, bool skipped, double lat, double lag,
case ESTATUS_DEADLOCK_ERROR:
stats->deadlock_failures++;
break;
+ case ESTATUS_OTHER_SQL_ERROR:
+ stats->other_sql_failures++;
+ break;
default:
/* internal error which should never occur */
pg_fatal("unexpected error status: %d", estatus);
@@ -3247,7 +3258,8 @@ static bool
canRetryError(EStatus estatus)
{
return (estatus == ESTATUS_SERIALIZATION_ERROR ||
- estatus == ESTATUS_DEADLOCK_ERROR);
+ estatus == ESTATUS_DEADLOCK_ERROR ||
+ (continue_on_error && estatus == ESTATUS_OTHER_SQL_ERROR));
}
/*
@@ -4528,7 +4540,8 @@ static int64
getFailures(const StatsData *stats)
{
return (stats->serialization_failures +
- stats->deadlock_failures);
+ stats->deadlock_failures +
+ stats->other_sql_failures);
}
/*
@@ -4548,6 +4561,8 @@ getResultString(bool skipped, EStatus estatus)
return "serialization";
case ESTATUS_DEADLOCK_ERROR:
return "deadlock";
+ case ESTATUS_OTHER_SQL_ERROR:
+ return "error (except serialization/deadlock)";
default:
/* internal error which should never occur */
pg_fatal("unexpected error status: %d", estatus);
@@ -4603,6 +4618,7 @@ doLog(TState *thread, CState *st,
int64 skipped = 0;
int64 serialization_failures = 0;
int64 deadlock_failures = 0;
+ int64 other_sql_failures = 0;
int64 retried = 0;
int64 retries = 0;
@@ -4643,10 +4659,12 @@ doLog(TState *thread, CState *st,
{
serialization_failures = agg->serialization_failures;
deadlock_failures = agg->deadlock_failures;
+ other_sql_failures = agg->other_sql_failures;
}
- fprintf(logfile, " " INT64_FORMAT " " INT64_FORMAT,
+ fprintf(logfile, " " INT64_FORMAT " " INT64_FORMAT " " INT64_FORMAT,
serialization_failures,
- deadlock_failures);
+ deadlock_failures,
+ other_sql_failures);
fputc('\n', logfile);
@@ -6285,6 +6303,7 @@ printProgressReport(TState *threads, int64 test_start, pg_time_usec_t now,
cur.serialization_failures +=
threads[i].stats.serialization_failures;
cur.deadlock_failures += threads[i].stats.deadlock_failures;
+ cur.other_sql_failures += threads[i].stats.other_sql_failures;
}
/* we count only actually executed transactions */
@@ -6443,6 +6462,9 @@ printResults(StatsData *total,
printf("number of deadlock failures: " INT64_FORMAT " (%.3f%%)\n",
total->deadlock_failures,
100.0 * total->deadlock_failures / total_cnt);
+ printf("number of other failures: " INT64_FORMAT " (%.3f%%)\n",
+ total->other_sql_failures,
+ 100.0 * total->other_sql_failures / total_cnt);
}
/* it can be non-zero only if max_tries is not equal to one */
@@ -6546,6 +6568,10 @@ printResults(StatsData *total,
sstats->deadlock_failures,
(100.0 * sstats->deadlock_failures /
script_total_cnt));
+ printf(" - number of other failures: " INT64_FORMAT " (%.3f%%)\n",
+ sstats->other_sql_failures,
+ (100.0 * sstats->other_sql_failures /
+ script_total_cnt));
}
/*
@@ -6705,6 +6731,7 @@ main(int argc, char **argv)
{"verbose-errors", no_argument, NULL, 15},
{"exit-on-abort", no_argument, NULL, 16},
{"debug", no_argument, NULL, 17},
+ {"continue-on-error", no_argument, NULL, 18},
{NULL, 0, NULL, 0}
};
@@ -7058,6 +7085,9 @@ main(int argc, char **argv)
case 17: /* debug */
pg_logging_increase_verbosity();
break;
+ case 18: /* continue-on-error */
+ continue_on_error = true;
+ break;
default:
/* getopt_long already emitted a complaint */
pg_log_error_hint("Try \"%s --help\" for more information.", progname);
@@ -7413,6 +7443,7 @@ main(int argc, char **argv)
stats.retried += thread->stats.retried;
stats.serialization_failures += thread->stats.serialization_failures;
stats.deadlock_failures += thread->stats.deadlock_failures;
+ stats.other_sql_failures += thread->stats.other_sql_failures;
latency_late += thread->latency_late;
conn_total_duration += thread->conn_duration;
--
2.39.5 (Apple Git-154)
v4-0002-1.-Do-not-retry-failed-transaction-due-to-other_sql_.patchtext/plain; charset=UTF-8; name=v4-0002-1.-Do-not-retry-failed-transaction-due-to-other_sql_.patchDownload
From d9d363c7c298f44063a2aa33530622548ee45cbf Mon Sep 17 00:00:00 2001
From: Rintaro Ikeda <ikedarintaro@ikedarintarous-MacBook-Air.local>
Date: Sun, 8 Jun 2025 23:40:32 +0900
Subject: [PATCH 2/2] 1. Do not retry failed transaction due to
other_sql_failures. 2. modify documentation and comments. 3. add test.
---
doc/src/sgml/ref/pgbench.sgml | 3 +++
src/bin/pgbench/pgbench.c | 26 +++++++++++++++-----
src/bin/pgbench/t/001_pgbench_with_server.pl | 22 +++++++++++++++++
3 files changed, 45 insertions(+), 6 deletions(-)
diff --git a/doc/src/sgml/ref/pgbench.sgml b/doc/src/sgml/ref/pgbench.sgml
index dcb8c1c487c..2086dd59cb3 100644
--- a/doc/src/sgml/ref/pgbench.sgml
+++ b/doc/src/sgml/ref/pgbench.sgml
@@ -923,6 +923,9 @@ pgbench <optional> <replaceable>options</replaceable> </optional> <replaceable>d
serialization failure. This allows all clients specified with -c option
to continuously apply load to the server, even if some transactions fail.
</para>
+ <para>
+ Note that this option can not be used together with
+ <option>--exit-on-abort</option>.
</listitem>
</varlistentry>
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 5db222f2c1e..2333110c29f 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -287,6 +287,9 @@ static int main_pid; /* main process id used in log filename */
/*
* We cannot retry a transaction after the serialization/deadlock error if its
* number of tries reaches this maximum; if its value is zero, it is not used.
+ * We can ignore errors including serialization/deadlock errors and other errors
+ * if --continue-on-error is set, but in this case the failed transaction is not
+ * retried.
*/
static uint32 max_tries = 1;
@@ -402,7 +405,8 @@ typedef struct StatsData
* directly successful transactions (they were successfully completed on
* the first try).
*
- * A failed transaction is defined as unsuccessfully retried transactions.
+ * A failed transaction is defined as unsuccessfully retried transactions
+ * unless continue-on-error option is specified.
* It can be one of two types:
*
* failed (the number of failed transactions) =
@@ -411,6 +415,11 @@ typedef struct StatsData
* 'deadlock_failures' (they got a deadlock error and were not
* successfully retried).
*
+ * When continue-on-error option is specified,
+ * failed (the number of failed transactions) =
+ * 'other_sql_failures' (they got a error when continue-on-error option
+ * was specified).
+ *
* If the transaction was retried after a serialization or a deadlock
* error this does not guarantee that this retry was successful. Thus
*
@@ -960,7 +969,7 @@ usage(void)
" (default: \"pgbench_log\")\n"
" --max-tries=NUM max number of tries to run transaction (default: 1)\n"
" --continue-on-error\n"
- " Continue and retry transactions that failed due to errors other than serialization or deadlocks.\n"
+ " continue to process transactions after a trasaction fails due to errors other than serialization or deadlocks.\n"
" --progress-timestamp use Unix epoch timestamps for progress\n"
" --random-seed=SEED set random seed (\"time\", \"rand\", integer)\n"
" --sampling-rate=NUM fraction of transactions to log (e.g., 0.01 for 1%%)\n"
@@ -3258,8 +3267,7 @@ static bool
canRetryError(EStatus estatus)
{
return (estatus == ESTATUS_SERIALIZATION_ERROR ||
- estatus == ESTATUS_DEADLOCK_ERROR ||
- (continue_on_error && estatus == ESTATUS_OTHER_SQL_ERROR));
+ estatus == ESTATUS_DEADLOCK_ERROR);
}
/*
@@ -4019,7 +4027,7 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
if (PQpipelineStatus(st->con) != PQ_PIPELINE_ON)
st->state = CSTATE_END_COMMAND;
}
- else if (canRetryError(st->estatus))
+ else if (canRetryError(st->estatus) | continue_on_error)
st->state = CSTATE_ERROR;
else
st->state = CSTATE_ABORTED;
@@ -4111,6 +4119,7 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
* can retry the error.
*/
st->state = timer_exceeded ? CSTATE_FINISHED :
+ continue_on_error ? CSTATE_FAILURE :
doRetry(st, &now) ? CSTATE_RETRY : CSTATE_FAILURE;
}
else
@@ -6446,7 +6455,8 @@ printResults(StatsData *total,
/*
* Remaining stats are nonsensical if we failed to execute any xacts due
- * to others than serialization or deadlock errors
+ * to others than serialization or deadlock errors and --continue-on-error
+ * is not set.
*/
if (total_cnt <= 0)
return;
@@ -7086,6 +7096,7 @@ main(int argc, char **argv)
pg_logging_increase_verbosity();
break;
case 18: /* continue-on-error */
+ benchmarking_option_set = true;
continue_on_error = true;
break;
default:
@@ -7242,6 +7253,9 @@ main(int argc, char **argv)
pg_fatal("an unlimited number of transaction tries can only be used with --latency-limit or a duration (-T)");
}
+ if (exit_on_abort && continue_on_error)
+ pg_fatal("--exit-on-abort and --continue-on-error are mutually exclusive options");
+
/*
* save main process id in the global variable because process id will be
* changed after fork.
diff --git a/src/bin/pgbench/t/001_pgbench_with_server.pl b/src/bin/pgbench/t/001_pgbench_with_server.pl
index 7dd78940300..afb49b554d0 100644
--- a/src/bin/pgbench/t/001_pgbench_with_server.pl
+++ b/src/bin/pgbench/t/001_pgbench_with_server.pl
@@ -1813,6 +1813,28 @@ update counter set i = i+1 returning i \gset
# Clean up
$node->safe_psql('postgres', 'DROP TABLE counter;');
+# Test --continue-on-error
+$node->safe_psql('postgres',
+ 'CREATE TABLE unique_table(i int unique); ' . 'INSERT INTO unique_table VALUES (0);');
+
+$node->pgbench(
+ '-t 10 --continue-on-error --failures-detailed',
+ 0,
+ [
+ qr{processed: 0/10\b},
+ qr{other failures: 10\b}
+ ],
+ [],
+ 'test --continue-on-error',
+ {
+ '002_continue_on_error' => q{
+ insert into unique_table values 0;
+ }
+ });
+
+# Clean up
+$node->safe_psql('postgres', 'DROP TABLE unique_table;');
+
# done
$node->safe_psql('postgres', 'DROP TABLESPACE regress_pgbench_tap_1_ts');
$node->stop;
--
2.39.5 (Apple Git-154)
Dear Ikeda-san,
Thanks for updating the patch!
1. I should've also set benchmarking_option_set. I've modified it accordingly.
Confirmed it has been fixed. Thanks.
2. The exit-on-abort option and continue-on-error option are mutually exclusive.
Therefore, I've updated the patch to throw a FATAL error when two options are
set simultaneously. Corresponding explanation was also added.
(I'm wondering the name of parameter should be continue-on-abort so that users
understand the two option are mutually exclusive.)
Make sense, +1.
Here are new comments.
01. build failure
According to the cfbot [1], the documentation cannot be built. IIUC </para> seemed
to be missed here:
```
+ <para>
+ Note that this option can not be used together with
+ <option>--exit-on-abort</option>.
+ </listitem>
+ </varlistentry>
```
02. patch separation
How about separating the patch series like:
0001 - contains option handling and retry part, and documentation
0002 - contains accumulation/reporting part
0003 - contains tests.
I hope above style is more helpful for reviewers.
03. documentation
```
+ Note that this option can not be used together with
+ <option>--exit-on-abort</option>.
```
I feel we should add the similar description in `exit-on-abort` part.
04. documentation
```
+ Client rolls back the failed transaction and starts a new one when its
+ transaction fails due to the reason other than the deadlock and
+ serialization failure. This allows all clients specified with -c option
+ to continuously apply load to the server, even if some transactions fail.
```
I feel the description contains bit redundant part and misses the default behavior.
How about:
```
<para>
Clients survive when their transactions are aborted, and they continue
their run. Without the option, clients exit when transactions they run
are aborted.
</para>
<para>
Note that serialization failures or deadlock failures do not abort the
client, so they are not affected by this option.
See <xref linkend="failures-and-retries"/> for more information.
</para>
```
05. StatsData
```
+ * When continue-on-error option is specified,
+ * failed (the number of failed transactions) =
+ * 'other_sql_failures' (they got a error when continue-on-error option
+ * was specified).
```
Let me confirm one point; can serialization_failures and deadlock_failures be
counted when continue-on-error is true? If so, the comment seems not correct for me.
The formula can be 'serialization_failures' + 'deadlock_failures' + 'deadlock_failures'
in the case.
06. StatsData
Another point; can other_sql_failures be counted when the continue-on-error is NOT
specified? I feel it should be...
06. usage()
Added line is too long. According to program_help_ok(), the output by help should
be less than 80.
07.
Please run pgindent/pgperltidy, I got some diffs.
[1]: https://cirrus-ci.com/task/5210061275922432
Best regards,
Hayato Kuroda
FUJITSU LIMITED
On Mon, 9 Jun 2025 09:34:03 +0000
"Hayato Kuroda (Fujitsu)" <kuroda.hayato@fujitsu.com> wrote:
2. The exit-on-abort option and continue-on-error option are mutually exclusive.
Therefore, I've updated the patch to throw a FATAL error when two options are
set simultaneously. Corresponding explanation was also added.
I don't think that's right since "abort" and "error" are different concept in pgbench.
(Here, "abort" refers to the termination of a client, not a transaction abort.)
The --exit-on-abort option forces to exit pgbench immediately when any client is aborted
due to some error. When the --continue-on-error option is not set, SQL errors other than
deadlock or serialization error cause a client to be aborted. On the other hand, when the option
is set, clients are not aborted due to any SQL errors; instead they continue to run after them.
However, clients can still be aborted for other reasons, such as connection failures or
meta-command errors (e.g., \set x 1/0). In these cases, the --exit-on-abort option remains
useful even when --continue-on-error is enabled.
(I'm wondering the name of parameter should be continue-on-abort so that users
understand the two option are mutually exclusive.)
For the same reason as above, I believe --continue-on-error is a more accurate description
of the option's behavior.
02. patch separation
How about separating the patch series like:0001 - contains option handling and retry part, and documentation
0002 - contains accumulation/reporting part
0003 - contains tests.I hope above style is more helpful for reviewers.
I'm not sure whether it's necessary to split the patch, as the change doesn't seem very
complex. However, the current separation appears inconsistent. For example, patch 0001
modifies canRetryError(), but patch 0002 reverts that change, and so on.
04. documentation ``` + Client rolls back the failed transaction and starts a new one when its + transaction fails due to the reason other than the deadlock and + serialization failure. This allows all clients specified with -c option + to continuously apply load to the server, even if some transactions fail. ```I feel the description contains bit redundant part and misses the default behavior.
How about:
```
<para>
Clients survive when their transactions are aborted, and they continue
their run. Without the option, clients exit when transactions they run
are aborted.
</para>
<para>
Note that serialization failures or deadlock failures do not abort the
client, so they are not affected by this option.
See <xref linkend="failures-and-retries"/> for more information.
</para>
```
I think we can make it clearer as follows:
Allows clients to continue their run even if an SQL statement fails due to errors other
than serialization or deadlock. Without this option, the client is aborted after
such errors.
Note that serialization and deadlock failures never cause the client to be aborted,
so they are not affected by this option. See <xref linkend="failures-and-retries"/>
for more information.
That said, a review by a native English speaker would still be appreciated.
Also, we would need to update several parts of the documentation. For example, the
"Failures and Serialization/Deadlock Retries" section should be revised to describe the
behavior change. In addition, we should update the explanations of output result examples
and logging, the description of the --failures-detailed option, and so on.
If transactions are not retried after SQL errors other than serialization or deadlock,
this should also be explicitly documented.
05. StatsData ``` + * When continue-on-error option is specified, + * failed (the number of failed transactions) = + * 'other_sql_failures' (they got a error when continue-on-error option + * was specified). ```Let me confirm one point; can serialization_failures and deadlock_failures be
counted when continue-on-error is true? If so, the comment seems not correct for me.
The formula can be 'serialization_failures' + 'deadlock_failures' + 'deadlock_failures'
in the case.
+1
06. StatsData
Another point; can other_sql_failures be counted when the continue-on-error is NOT
specified? I feel it should be...
We could do that. However, if an SQL error other than a serialization or deadlock error
occurs when --continue-on-error is not set, pgbench will be aborted midway and the printed
results will be incomplete. Therefore, this might not make much sense.
06. usage()
Added line is too long. According to program_help_ok(), the output by help should
be less than 80.
+1
Here are additional comments from me.
@@ -4548,6 +4570,8 @@ getResultString(bool skipped, EStatus estatus)
return "serialization";
case ESTATUS_DEADLOCK_ERROR:
return "deadlock";
+ case ESTATUS_OTHER_SQL_ERROR:
+ return "error (except serialization/deadlock)";
Strings returned by getResultString() are printed in the "time" field of the
log when both the -l and --failures-detailed options are set. Therefore, they
should be single words that do not contain any space characters. I wonder if
something like "other" or "other_sql_error" would be appropriate.
@@ -4099,6 +4119,7 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
* can retry the error.
*/
st->state = timer_exceeded ? CSTATE_FINISHED :
+ continue_on_error ? CSTATE_FAILURE :
doRetry(st, &now) ? CSTATE_RETRY : CSTATE_FAILURE;
}
else
This fix is not necessary because doRetry() (and canRetryError(), which is called
within it) will return false when continue_on_error is set (after applying patch 0002).
case PGRES_NONFATAL_ERROR:
case PGRES_FATAL_ERROR:
st->estatus = getSQLErrorStatus(PQresultErrorField(res,
PG_DIAG_SQLSTATE));
if (canRetryError(st->estatus))
{
if (verbose_errors)
commandError(st, PQerrorMessage(st->con));
goto error;
}
/* fall through */
default:
/* anything else is unexpected */
pg_log_error("client %d script %d aborted in command %d query %d: %s",
st->id, st->use_file, st->command, qrynum,
PQerrorMessage(st->con));
goto error;
}
When an SQL error other than a serialization or deadlock error occurs, an error message is
output via pg_log_error in this code path. However, I think this should be reported only
when verbose_errors is set, similar to how serialization and deadlock errors are handled when
--continue-on-error is enabled
Best regards,
Yugo Nagata
--
Yugo Nagata <nagata@sraoss.co.jp>
Dear Nagata-san,
2. The exit-on-abort option and continue-on-error option are mutually
exclusive.
Therefore, I've updated the patch to throw a FATAL error when two options
are
set simultaneously. Corresponding explanation was also added.
I don't think that's right since "abort" and "error" are different concept in pgbench.
(Here, "abort" refers to the termination of a client, not a transaction abort.)The --exit-on-abort option forces to exit pgbench immediately when any client is
aborted
due to some error. When the --continue-on-error option is not set, SQL errors
other than
deadlock or serialization error cause a client to be aborted. On the other hand,
when the option
is set, clients are not aborted due to any SQL errors; instead they continue to run
after them.
However, clients can still be aborted for other reasons, such as connection
failures or
meta-command errors (e.g., \set x 1/0). In these cases, the --exit-on-abort option
remains
useful even when --continue-on-error is enabled.
To clarify: another approach is that allow --continue-on-error option to continue
running even when clients meet such errors. Which one is better?
02. patch separation
How about separating the patch series like:0001 - contains option handling and retry part, and documentation
0002 - contains accumulation/reporting part
0003 - contains tests.I hope above style is more helpful for reviewers.
I'm not sure whether it's necessary to split the patch, as the change doesn't seem
very
complex. However, the current separation appears inconsistent. For example,
patch 0001
modifies canRetryError(), but patch 0002 reverts that change, and so on.
Either way is fine for me if they are changed from the current method.
04. documentation ``` + Client rolls back the failed transaction and starts a new one when its + transaction fails due to the reason other than the deadlock and + serialization failure. This allows all clients specified with -c option + to continuously apply load to the server, even if some transactionsfail.
```
I feel the description contains bit redundant part and misses the default
behavior.
How about:
```
<para>
Clients survive when their transactions are aborted, and they continue
their run. Without the option, clients exit when transactions they run
are aborted.
</para>
<para>
Note that serialization failures or deadlock failures do not abort the
client, so they are not affected by this option.
See <xref linkend="failures-and-retries"/> for more information.
</para>
```I think we can make it clearer as follows:
I do not have confident for English, native speaker is needed....
06. usage()
Added line is too long. According to program_help_ok(), the output by helpshould
be less than 80.
+1
FYI - I posted a patch which adds the test. You can apply and confirm how the function says.
[1]: /messages/by-id/OSCPR01MB1496610451F5896375B2562E6F56BA@OSCPR01MB14966.jpnprd01.prod.outlook.com
Best regards,
Hayato Kuroda
FUJITSU LIMITED
On Tue, 17 Jun 2025 03:47:00 +0000
"Hayato Kuroda (Fujitsu)" <kuroda.hayato@fujitsu.com> wrote:
Dear Nagata-san,
2. The exit-on-abort option and continue-on-error option are mutually
exclusive.
Therefore, I've updated the patch to throw a FATAL error when two options
are
set simultaneously. Corresponding explanation was also added.
I don't think that's right since "abort" and "error" are different concept in pgbench.
(Here, "abort" refers to the termination of a client, not a transaction abort.)The --exit-on-abort option forces to exit pgbench immediately when any client is
aborted
due to some error. When the --continue-on-error option is not set, SQL errors
other than
deadlock or serialization error cause a client to be aborted. On the other hand,
when the option
is set, clients are not aborted due to any SQL errors; instead they continue to run
after them.
However, clients can still be aborted for other reasons, such as connection
failures or
meta-command errors (e.g., \set x 1/0). In these cases, the --exit-on-abort option
remains
useful even when --continue-on-error is enabled.To clarify: another approach is that allow --continue-on-error option to continue
running even when clients meet such errors. Which one is better?
It might be worth discussing which types of errors this option should allow pgbench
to continue after. On my understand the current patch's goal is to allow only SQL
level errors like comstraint violations. It seems good because this could simulate
behaviour of applications that ignore or retry such errors (although they are not
retried in the current patch). Perhaps, it makes sense to allow to continue after
some network errors because it would enable benchmarks usign a cluster system as a
cloud service that could report a temporary error during a failover.
It might be worth discussing which types of errors this option should allow pgbench to
continue after.
As I understand it, the current patch aims to allow continuation only after SQL-level
errors, such as constraint violations. That seems reasonable, as it can simulate the
behavior of applications that ignore or retry such errors (even though retries are not
implemented in the current patch).
Perhaps it also makes sense to allow continuation after certain network errors, as this
would enable benchmarking with cluster systems or cloud services, which might report
temporary errors during a failover. We would need additional work to properly detect
and handle network errors, though.
However, I'm not sure it's reasonable to allow continuation after other types of errors,
such as misuse of meta-commands or unexpected errors during their execution, since these
wouldn't simulate any real application behavior and would more likely indicate a failure
in the benchmarking process itself.
Best regards,
Yugo Nagata
--
Yugo Nagata <nagata@sraoss.co.jp>
On Tue, 17 Jun 2025 16:28:28 +0900
Yugo Nagata <nagata@sraoss.co.jp> wrote:
On Tue, 17 Jun 2025 03:47:00 +0000
"Hayato Kuroda (Fujitsu)" <kuroda.hayato@fujitsu.com> wrote:Dear Nagata-san,
2. The exit-on-abort option and continue-on-error option are mutually
exclusive.
Therefore, I've updated the patch to throw a FATAL error when two options
are
set simultaneously. Corresponding explanation was also added.
I don't think that's right since "abort" and "error" are different concept in pgbench.
(Here, "abort" refers to the termination of a client, not a transaction abort.)The --exit-on-abort option forces to exit pgbench immediately when any client is
aborted
due to some error. When the --continue-on-error option is not set, SQL errors
other than
deadlock or serialization error cause a client to be aborted. On the other hand,
when the option
is set, clients are not aborted due to any SQL errors; instead they continue to run
after them.
However, clients can still be aborted for other reasons, such as connection
failures or
meta-command errors (e.g., \set x 1/0). In these cases, the --exit-on-abort option
remains
useful even when --continue-on-error is enabled.To clarify: another approach is that allow --continue-on-error option to continue
running even when clients meet such errors. Which one is better?It might be worth discussing which types of errors this option should allow pgbench
to continue after. On my understand the current patch's goal is to allow only SQL
level errors like comstraint violations. It seems good because this could simulate
behaviour of applications that ignore or retry such errors (although they are not
retried in the current patch). Perhaps, it makes sense to allow to continue after
some network errors because it would enable benchmarks usign a cluster system as a
cloud service that could report a temporary error during a failover.
I apologize for accidentally leaving the draft paragraph just above in my previous post.
Please ignore it.
It might be worth discussing which types of errors this option should allow pgbench to
continue after.As I understand it, the current patch aims to allow continuation only after SQL-level
errors, such as constraint violations. That seems reasonable, as it can simulate the
behavior of applications that ignore or retry such errors (even though retries are not
implemented in the current patch).Perhaps it also makes sense to allow continuation after certain network errors, as this
would enable benchmarking with cluster systems or cloud services, which might report
temporary errors during a failover. We would need additional work to properly detect
and handle network errors, though.However, I'm not sure it's reasonable to allow continuation after other types of errors,
such as misuse of meta-commands or unexpected errors during their execution, since these
wouldn't simulate any real application behavior and would more likely indicate a failure
in the benchmarking process itself.Best regards,
Yugo Nagata--
Yugo Nagata <nagata@sraoss.co.jp>
--
Yugo Nagata <nagata@sraoss.co.jp>
Dear Nagata-san,
As I understand it, the current patch aims to allow continuation only after
SQL-level
errors, such as constraint violations. That seems reasonable, as it can simulate
the
behavior of applications that ignore or retry such errors (even though retries are
not
implemented in the current patch).
Yes, no one has objections to retry in this case. This is a main part of the proposal.
However, I'm not sure it's reasonable to allow continuation after other types of
errors,
such as misuse of meta-commands or unexpected errors during their execution,
since these
wouldn't simulate any real application behavior and would more likely indicate a
failure
in the benchmarking process itself.
I have a concern for \gset metacommand.
According to the doc and source code, \gset assumed that executed command surely
returns a tuple:
```
if (meta == META_GSET && ntuples != 1)
{
/* under \gset, report the error */
pg_log_error("client %d script %d command %d query %d: expected one row, got %d",
st->id, st->use_file, st->command, qrynum, PQntuples(res));
st->estatus = ESTATUS_META_COMMAND_ERROR;
goto error;
}
```
But sometimes the SQL may not be able to return tuples or return multiple ones due
to the concurrent transactions. I feel retrying the transaction is very useful
in this case.
Anyway, we must confirm the opinion from the proposer.
[1]: https://github.com/ryogrid/tpcc_like_with_pgbench
Best regards,
Hayato Kuroda
FUJITSU LIMITED
On Thu, 26 Jun 2025 05:45:12 +0000
"Hayato Kuroda (Fujitsu)" <kuroda.hayato@fujitsu.com> wrote:
Dear Nagata-san,
As I understand it, the current patch aims to allow continuation only after
SQL-level
errors, such as constraint violations. That seems reasonable, as it can simulate
the
behavior of applications that ignore or retry such errors (even though retries are
not
implemented in the current patch).Yes, no one has objections to retry in this case. This is a main part of the proposal.,
As I understand it, the proposed --continue-on-error option does not retry the transaction
in any case; it simply gives up on the transaction. That is, when an SQL-level error occurs,
the transaction is reported as "failed" rather than "retried", and the random state is discarded.
However, I'm not sure it's reasonable to allow continuation after other types of
errors,
such as misuse of meta-commands or unexpected errors during their execution,
since these
wouldn't simulate any real application behavior and would more likely indicate a
failure
in the benchmarking process itself.I have a concern for \gset metacommand.
According to the doc and source code, \gset assumed that executed command surely
returns a tuple:```
if (meta == META_GSET && ntuples != 1)
{
/* under \gset, report the error */
pg_log_error("client %d script %d command %d query %d: expected one row, got %d",
st->id, st->use_file, st->command, qrynum, PQntuples(res));
st->estatus = ESTATUS_META_COMMAND_ERROR;
goto error;
}
```But sometimes the SQL may not be able to return tuples or return multiple ones due
to the concurrent transactions. I feel retrying the transaction is very useful
in this case.
You can use \aset command instead to avoid the error of pgbench. If the query doesn't
return any row, a subsecuent SQL command trying to use the varialbe will fail, but this
would be ignored without terminating the benchmark when the --coneinue-on-error option
enabled.
Anyway, we must confirm the opinion from the proposer.
+1
Best regards,
Yugo Nagata
--
Yugo Nagata <nagata@sraoss.co.jp>
Hi,
Thank you very much for your valuable comments and kind advice. I'm
currently working on revising the previous patch based on the feedback
received. I would like to share my thoughts regarding the conditions
under which the --continue-on-error option should initiate a new
transaction or a new connection.
In my opinion, when the --continue-on-error option is enabled, pgbench
clients does not need to start new transactions after network errors and
other errors except for SQL-level errors.
Network errors are relatively rare, except in failover scenarios.
Outside of failover, any network issues should be resolved rather than
worked around. In the context of failover, the key metric is not TPS,
but system downtime. While one might infer the timing of a failover by
observing by using --progress option, you can easily determine the
downtime by executing simple SQL query such as `psql -c 'SELECT 1` every
second.
On 2025/06/26 18:47, Yugo Nagata wrote:
On Thu, 26 Jun 2025 05:45:12 +0000
"Hayato Kuroda (Fujitsu)" <kuroda.hayato@fujitsu.com> wrote:Dear Nagata-san,
As I understand it, the current patch aims to allow continuation only
after
SQL-level
errors, such as constraint violations. That seems reasonable, as it
can simulate
the
behavior of applications that ignore or retry such errors (even
though retries are
not
implemented in the current patch).Yes, no one has objections to retry in this case. This is a main part
of the proposal.,As I understand it, the proposed --continue-on-error option does not
retry the transaction
in any case; it simply gives up on the transaction. That is, when an
SQL-level error occurs,
the transaction is reported as "failed" rather than "retried", and the
random state is discarded.
Retrying the failed transaction is not necessary when the transaction
failed due to SQL-level errors. Unlike real-world applications, pgbench
does not need to complete specific transaction successfully. In the case
of unique constraint violations, retrying the same transaction will
likely to result in the same error again.
I want to hear your thoughts on this.
Best regards,
Rintaro Ikeda
On Fri, 27 Jun 2025 14:06:24 +0900
ikedarintarof <ikedarintarof@oss.nttdata.com> wrote:
Hi,
Thank you very much for your valuable comments and kind advice. I'm
currently working on revising the previous patch based on the feedback
received. I would like to share my thoughts regarding the conditions
under which the --continue-on-error option should initiate a new
transaction or a new connection.In my opinion, when the --continue-on-error option is enabled, pgbench
clients does not need to start new transactions after network errors and
other errors except for SQL-level errors.
+1
I agree that --continue-on-error prevents pgbench from terminating only when
SQL-level errors occur, and does not change the behavior in the case of other
types of errors, including network errors.
As I understand it, the proposed --continue-on-error option does not
retry the transaction
in any case; it simply gives up on the transaction. That is, when an
SQL-level error occurs,
the transaction is reported as "failed" rather than "retried", and the
random state is discarded.Retrying the failed transaction is not necessary when the transaction
failed due to SQL-level errors. Unlike real-world applications, pgbench
does not need to complete specific transaction successfully. In the case
of unique constraint violations, retrying the same transaction will
likely to result in the same error again.
Agreed.
Regards,
Yugo Nagata
--
Yugo Nagata <nagata@sraoss.co.jp>
Dear Nagata-san, Ikeda-san,
In my opinion, when the --continue-on-error option is enabled, pgbench
clients does not need to start new transactions after network errors and
other errors except for SQL-level errors.+1
I agree that --continue-on-error prevents pgbench from terminating only when
SQL-level errors occur, and does not change the behavior in the case of other
types of errors, including network errors.
OK, so let's do like that.
BTW, initially we were discussing the combination of --continue-on-error and
--exit-on-abort. What it the conclusion?
I feel the Nagata-san's point [1]/messages/by-id/20250614002453.5c72f2ec80864d840150a642@sraoss.co.jp is valid in this approach.
As I understand it, the proposed --continue-on-error option does not
retry the transaction
in any case; it simply gives up on the transaction. That is, when an
SQL-level error occurs,
the transaction is reported as "failed" rather than "retried", and the
random state is discarded.Retrying the failed transaction is not necessary when the transaction
failed due to SQL-level errors. Unlike real-world applications, pgbench
does not need to complete specific transaction successfully. In the case
of unique constraint violations, retrying the same transaction will
likely to result in the same error again.
I intended here that clients could throw away the failed transaction and start
new one again in the case. I hope we are on the same page...
[1]: /messages/by-id/20250614002453.5c72f2ec80864d840150a642@sraoss.co.jp
Best regards,
Hayato Kuroda
FUJITSU LIMITED
On Fri, 27 Jun 2025 10:59:09 +0000
"Hayato Kuroda (Fujitsu)" <kuroda.hayato@fujitsu.com> wrote:
Retrying the failed transaction is not necessary when the transaction
failed due to SQL-level errors. Unlike real-world applications, pgbench
does not need to complete specific transaction successfully. In the case
of unique constraint violations, retrying the same transaction will
likely to result in the same error again.I intended here that clients could throw away the failed transaction and start
new one again in the case. I hope we are on the same page...
Could I confirm what you mean by "start new one"?
In the current pgbench, when a query raises an error (a deadlock or
serialization failure), it can be retried using the same random state.
This typically means the query will be retried with the same parameter values.
On the other hand, when the query ultimately fails (possibly after some retries),
the transaction is marked as a "failure", and the next transaction starts with a
new random state (i.e., with new parameter values).
Therefore, if a query fails due to a unique constraint violation and is retried
with the same parameters, it will keep failing on each retry.
Regards,
Yugo Nagata
--
Yugo Nagata <nagata@sraoss.co.jp>
Hi,
I've updated the previous patch based on your feedback. Below is a summary of
the changes from v4 to v5:
1. (v5-0001) Added documentation and removed some code paths in response to the
comments.
2. (v5-0001) Modified the condition to transition from CSTATE_WAIT_RESULT to
CSTATE_ERROR, instead of adding a condition in canRetryError(), which had
enabled clients continue after its transaction failed. This is because, when the
--continue-on-error option is set, clients do not retry failed transactions but
start new ones.
3. (v5-0002) Renamed the enumerator TSTATUS_OTHER_ERROR, which could be
mistakenly interpreted as being related to other SQL errors. It represents an
unknown transaction status, so it has been renamed to TSTATUS_UNKNOWN_ERROR.
On 2025/06/14 0:24, Yugo Nagata wrote:
case PGRES_NONFATAL_ERROR:
case PGRES_FATAL_ERROR:
st->estatus = getSQLErrorStatus(PQresultErrorField(res,
PG_DIAG_SQLSTATE));
if (canRetryError(st->estatus))
{
if (verbose_errors)
commandError(st, PQerrorMessage(st->con));
goto error;
}
/* fall through */default:
/* anything else is unexpected */
pg_log_error("client %d script %d aborted in command %d query %d: %s",
st->id, st->use_file, st->command, qrynum,
PQerrorMessage(st->con));
goto error;
}When an SQL error other than a serialization or deadlock error occurs, an error message is
output via pg_log_error in this code path. However, I think this should be reported only
when verbose_errors is set, similar to how serialization and deadlock errors are handled when
--continue-on-error is enabled
I think the error message logged via pg_log_error is useful when verbose_errors
is not specified, because it informs users that the client has exited. Without
it, users may not notice that something went wrong.
On 2025/06/27 19:59, Hayato Kuroda (Fujitsu) wrote:
BTW, initially we were discussing the combination of --continue-on-error and
--exit-on-abort. What it the conclusion?
I feel the Nagata-san's point [1] is valid in this approach.
I agree with the conclusion. I've removed the code path that prohibited using
--continue-on-error and --exit-on-abort options together.
On 2025/06/30 15:02, Yugo Nagata wrote:
On Fri, 27 Jun 2025 10:59:09 +0000
"Hayato Kuroda (Fujitsu)" <kuroda.hayato@fujitsu.com> wrote:Retrying the failed transaction is not necessary when the transaction
failed due to SQL-level errors. Unlike real-world applications, pgbench
does not need to complete specific transaction successfully. In the case
of unique constraint violations, retrying the same transaction will
likely to result in the same error again.I intended here that clients could throw away the failed transaction and start
new one again in the case. I hope we are on the same page...Could I confirm what you mean by "start new one"?
In the current pgbench, when a query raises an error (a deadlock or
serialization failure), it can be retried using the same random state.
This typically means the query will be retried with the same parameter values.On the other hand, when the query ultimately fails (possibly after some retries),
the transaction is marked as a "failure", and the next transaction starts with a
new random state (i.e., with new parameter values).Therefore, if a query fails due to a unique constraint violation and is retried
with the same parameters, it will keep failing on each retry.
Thank you for your explanation. I understand it as you described. I've also
attached a schematic diagram of the state machine. I hope it will help clarify
the behavior of pgbench. Red arrows represent the transition of state when SQL
command fails and --continue-on-error option is specified.
Best Regards,
Rintaro Ikeda
Attachments:
v5-0001-Add-continue-on-error-option-to-pgbench.patchtext/plain; charset=UTF-8; name=v5-0001-Add-continue-on-error-option-to-pgbench.patchDownload
From e9b8d4579c4adf0582f739327aaa3b9877311633 Mon Sep 17 00:00:00 2001
From: Rintaro Ikeda <ikedarintarof@oss.nttdata.com>
Date: Tue, 1 Jul 2025 14:18:44 +0900
Subject: [PATCH v5 1/2] When the option is set, client rolls back the failed
transaction and starts a new one when its transaction fails due to the reason
other than the deadlock and serialization failure.
---
doc/src/sgml/ref/pgbench.sgml | 70 +++++++++++++++-----
src/bin/pgbench/pgbench.c | 51 ++++++++++++--
src/bin/pgbench/t/001_pgbench_with_server.pl | 22 ++++++
3 files changed, 121 insertions(+), 22 deletions(-)
diff --git a/doc/src/sgml/ref/pgbench.sgml b/doc/src/sgml/ref/pgbench.sgml
index ab252d9fc74..cc5ab173f2f 100644
--- a/doc/src/sgml/ref/pgbench.sgml
+++ b/doc/src/sgml/ref/pgbench.sgml
@@ -77,8 +77,8 @@ tps = 896.967014 (without initial connection time)
failed before completion or some SQL command(s) failed. (In
<option>-T</option> mode, only the actual number of transactions is printed.)
The next line reports the number of failed transactions due to
- serialization or deadlock errors (see <xref linkend="failures-and-retries"/>
- for more information).
+ serialization or deadlock errors by default (see
+ <xref linkend="failures-and-retries"/> for more information).
The last line reports the number of transactions per second.
</para>
@@ -790,6 +790,9 @@ pgbench <optional> <replaceable>options</replaceable> </optional> <replaceable>d
<listitem>
<para>deadlock failures;</para>
</listitem>
+ <listitem>
+ <para>other failures;</para>
+ </listitem>
</itemizedlist>
See <xref linkend="failures-and-retries"/> for more information.
</para>
@@ -914,6 +917,26 @@ pgbench <optional> <replaceable>options</replaceable> </optional> <replaceable>d
</listitem>
</varlistentry>
+ <varlistentry id="pgbench-option-continue-on-error">
+ <term><option>--continue-on-error</option></term>
+ <listitem>
+ <para>
+ Allows clients to continue their run even if an SQL statement fails due to
+ errors other than serialization or deadlock. Unlike serialization and deadlock
+ failures, clients do not retry the same transactions but start new transaction.
+ This option is useful when your custom script may raise errors due to some
+ reason like unique constraints violation. Without this option, the client is
+ aborted after such errors.
+ </para>
+ <para>
+ Note that serialization and deadlock failures never cause the client to be
+ aborted even after clients retries <option>--max-tries</option> times by
+ default, so they are not affected by this option.
+ See <xref linkend="failures-and-retries"/> for more information.
+ </para>
+ </listitem>
+ </varlistentry>
+
</variablelist>
</para>
@@ -2409,8 +2432,8 @@ END;
will be reported as <literal>failed</literal>. If you use the
<option>--failures-detailed</option> option, the
<replaceable>time</replaceable> of the failed transaction will be reported as
- <literal>serialization</literal> or
- <literal>deadlock</literal> depending on the type of failure (see
+ <literal>serialization</literal>, <literal>deadlock</literal>, or
+ <literal>other</literal> depending on the type of failure (see
<xref linkend="failures-and-retries"/> for more information).
</para>
@@ -2638,6 +2661,16 @@ END;
</para>
</listitem>
</varlistentry>
+
+ <varlistentry>
+ <term><replaceable>other_sql_failures</replaceable></term>
+ <listitem>
+ <para>
+ number of transactions that got a SQL error
+ (zero unless <option>--failures-detailed</option> is specified)
+ </para>
+ </listitem>
+ </varlistentry>
</variablelist>
</para>
@@ -2646,8 +2679,8 @@ END;
<screen>
<userinput>pgbench --aggregate-interval=10 --time=20 --client=10 --log --rate=1000 --latency-limit=10 --failures-detailed --max-tries=10 test</userinput>
-1650260552 5178 26171317 177284491527 1136 44462 2647617 7321113867 0 9866 64 7564 28340 4148 0
-1650260562 4808 25573984 220121792172 1171 62083 3037380 9666800914 0 9998 598 7392 26621 4527 0
+1650260552 5178 26171317 177284491527 1136 44462 2647617 7321113867 0 9866 64 7564 28340 4148 0 0
+1650260562 4808 25573984 220121792172 1171 62083 3037380 9666800914 0 9998 598 7392 26621 4527 0 0
</screen>
</para>
@@ -2839,9 +2872,11 @@ statement latencies in milliseconds, failures and retries:
<option>--exit-on-abort</option> is specified. Otherwise in the worst
case they only lead to the abortion of the failed client while other
clients continue their run (but some client errors are handled without
- an abortion of the client and reported separately, see below). Later in
- this section it is assumed that the discussed errors are only the
- direct client errors and they are not internal
+ an abortion of the client and reported separately, see below). When
+ <option>--continue-on-error</option> is specified, the client
+ continues to process new transactions even if it encounters an error.
+ Later in this section it is assumed that the discussed errors are only
+ the direct client errors and they are not internal
<application>pgbench</application> errors.
</para>
</listitem>
@@ -2853,12 +2888,14 @@ statement latencies in milliseconds, failures and retries:
connection with the database server was lost or the end of script was reached
without completing the last transaction. In addition, if execution of an SQL
or meta command fails for reasons other than serialization or deadlock errors,
- the client is aborted. Otherwise, if an SQL command fails with serialization or
- deadlock errors, the client is not aborted. In such cases, the current
- transaction is rolled back, which also includes setting the client variables
- as they were before the run of this transaction (it is assumed that one
- transaction script contains only one transaction; see
- <xref linkend="transactions-and-scripts"/> for more information).
+ the client is aborted by default. However, if the --continue-on-error option
+ is specified, the client does not abort and proceeds to the next transaction
+ regardless of the error. This case is reported as other failures in the output.
+ Otherwise, if an SQL command fails with serialization or deadlock errors, the
+ client is not aborted. In such cases, the current transaction is rolled back,
+ which also includes setting the client variables as they were before the run
+ of this transaction (it is assumed that one transaction script contains only
+ one transaction; see <xref linkend="transactions-and-scripts"/> for more information).
Transactions with serialization or deadlock errors are repeated after
rollbacks until they complete successfully or reach the maximum
number of tries (specified by the <option>--max-tries</option> option) / the maximum
@@ -2898,7 +2935,8 @@ statement latencies in milliseconds, failures and retries:
<para>
The main report contains the number of failed transactions. If the
- <option>--max-tries</option> option is not equal to 1, the main report also
+ <option>--max-tries</option> option is not equal to 1 and
+ <option>--continue-on-error</option> is not specified, the main report also
contains statistics related to retries: the total number of retried
transactions and total number of retries. The per-script report inherits all
these fields from the main report. The per-statement report displays retry
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 497a936c141..15207290811 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -402,7 +402,8 @@ typedef struct StatsData
* directly successful transactions (they were successfully completed on
* the first try).
*
- * A failed transaction is defined as unsuccessfully retried transactions.
+ * A failed transaction is defined as unsuccessfully retried transactions
+ * unless continue-on-error option is specified.
* It can be one of two types:
*
* failed (the number of failed transactions) =
@@ -411,6 +412,12 @@ typedef struct StatsData
* 'deadlock_failures' (they got a deadlock error and were not
* successfully retried).
*
+ * When continue-on-error option is specified,
+ * failed (the number of failed transactions) =
+ * 'serialization_failures' + 'deadlock_failures' +
+ * 'other_sql_failures' (they got a error when continue-on-error option
+ * was specified).
+ *
* If the transaction was retried after a serialization or a deadlock
* error this does not guarantee that this retry was successful. Thus
*
@@ -440,6 +447,11 @@ typedef struct StatsData
int64 deadlock_failures; /* number of transactions that were not
* successfully retried after a deadlock
* error */
+ int64 other_sql_failures; /* number of failed transactions for
+ * reasons other than
+ * serialization/deadlock failure , which
+ * is enabled if --continue-on-error is
+ * used */
SimpleStats latency;
SimpleStats lag;
} StatsData;
@@ -770,6 +782,7 @@ static int64 total_weight = 0;
static bool verbose_errors = false; /* print verbose messages of all errors */
static bool exit_on_abort = false; /* exit when any client is aborted */
+static bool continue_on_error = false; /* continue after errors */
/* Builtin test scripts */
typedef struct BuiltinScript
@@ -954,6 +967,7 @@ usage(void)
" --log-prefix=PREFIX prefix for transaction time log file\n"
" (default: \"pgbench_log\")\n"
" --max-tries=NUM max number of tries to run transaction (default: 1)\n"
+ " --continue-on-error continue processing transactions after a trasaction fails\n"
" --progress-timestamp use Unix epoch timestamps for progress\n"
" --random-seed=SEED set random seed (\"time\", \"rand\", integer)\n"
" --sampling-rate=NUM fraction of transactions to log (e.g., 0.01 for 1%%)\n"
@@ -1467,6 +1481,7 @@ initStats(StatsData *sd, pg_time_usec_t start)
sd->retried = 0;
sd->serialization_failures = 0;
sd->deadlock_failures = 0;
+ sd->other_sql_failures = 0;
initSimpleStats(&sd->latency);
initSimpleStats(&sd->lag);
}
@@ -1516,6 +1531,9 @@ accumStats(StatsData *stats, bool skipped, double lat, double lag,
case ESTATUS_DEADLOCK_ERROR:
stats->deadlock_failures++;
break;
+ case ESTATUS_OTHER_SQL_ERROR:
+ stats->other_sql_failures++;
+ break;
default:
/* internal error which should never occur */
pg_fatal("unexpected error status: %d", estatus);
@@ -4007,7 +4025,7 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
if (PQpipelineStatus(st->con) != PQ_PIPELINE_ON)
st->state = CSTATE_END_COMMAND;
}
- else if (canRetryError(st->estatus))
+ else if (continue_on_error | canRetryError(st->estatus))
st->state = CSTATE_ERROR;
else
st->state = CSTATE_ABORTED;
@@ -4528,7 +4546,8 @@ static int64
getFailures(const StatsData *stats)
{
return (stats->serialization_failures +
- stats->deadlock_failures);
+ stats->deadlock_failures +
+ stats->other_sql_failures);
}
/*
@@ -4548,6 +4567,8 @@ getResultString(bool skipped, EStatus estatus)
return "serialization";
case ESTATUS_DEADLOCK_ERROR:
return "deadlock";
+ case ESTATUS_OTHER_SQL_ERROR:
+ return "other";
default:
/* internal error which should never occur */
pg_fatal("unexpected error status: %d", estatus);
@@ -4603,6 +4624,7 @@ doLog(TState *thread, CState *st,
int64 skipped = 0;
int64 serialization_failures = 0;
int64 deadlock_failures = 0;
+ int64 other_sql_failures = 0;
int64 retried = 0;
int64 retries = 0;
@@ -4643,10 +4665,12 @@ doLog(TState *thread, CState *st,
{
serialization_failures = agg->serialization_failures;
deadlock_failures = agg->deadlock_failures;
+ other_sql_failures = agg->other_sql_failures;
}
- fprintf(logfile, " " INT64_FORMAT " " INT64_FORMAT,
+ fprintf(logfile, " " INT64_FORMAT " " INT64_FORMAT " " INT64_FORMAT,
serialization_failures,
- deadlock_failures);
+ deadlock_failures,
+ other_sql_failures);
fputc('\n', logfile);
@@ -6285,6 +6309,7 @@ printProgressReport(TState *threads, int64 test_start, pg_time_usec_t now,
cur.serialization_failures +=
threads[i].stats.serialization_failures;
cur.deadlock_failures += threads[i].stats.deadlock_failures;
+ cur.other_sql_failures += threads[i].stats.other_sql_failures;
}
/* we count only actually executed transactions */
@@ -6427,7 +6452,8 @@ printResults(StatsData *total,
/*
* Remaining stats are nonsensical if we failed to execute any xacts due
- * to others than serialization or deadlock errors
+ * to other than serialization or deadlock errors and --continue-on-error
+ * is not set.
*/
if (total_cnt <= 0)
return;
@@ -6443,6 +6469,9 @@ printResults(StatsData *total,
printf("number of deadlock failures: " INT64_FORMAT " (%.3f%%)\n",
total->deadlock_failures,
100.0 * total->deadlock_failures / total_cnt);
+ printf("number of other failures: " INT64_FORMAT " (%.3f%%)\n",
+ total->other_sql_failures,
+ 100.0 * total->other_sql_failures / total_cnt);
}
/* it can be non-zero only if max_tries is not equal to one */
@@ -6546,6 +6575,10 @@ printResults(StatsData *total,
sstats->deadlock_failures,
(100.0 * sstats->deadlock_failures /
script_total_cnt));
+ printf(" - number of other failures: " INT64_FORMAT " (%.3f%%)\n",
+ sstats->other_sql_failures,
+ (100.0 * sstats->other_sql_failures /
+ script_total_cnt));
}
/*
@@ -6705,6 +6738,7 @@ main(int argc, char **argv)
{"verbose-errors", no_argument, NULL, 15},
{"exit-on-abort", no_argument, NULL, 16},
{"debug", no_argument, NULL, 17},
+ {"continue-on-error", no_argument, NULL, 18},
{NULL, 0, NULL, 0}
};
@@ -7058,6 +7092,10 @@ main(int argc, char **argv)
case 17: /* debug */
pg_logging_increase_verbosity();
break;
+ case 18: /* continue-on-error */
+ benchmarking_option_set = true;
+ continue_on_error = true;
+ break;
default:
/* getopt_long already emitted a complaint */
pg_log_error_hint("Try \"%s --help\" for more information.", progname);
@@ -7413,6 +7451,7 @@ main(int argc, char **argv)
stats.retried += thread->stats.retried;
stats.serialization_failures += thread->stats.serialization_failures;
stats.deadlock_failures += thread->stats.deadlock_failures;
+ stats.other_sql_failures += thread->stats.other_sql_failures;
latency_late += thread->latency_late;
conn_total_duration += thread->conn_duration;
diff --git a/src/bin/pgbench/t/001_pgbench_with_server.pl b/src/bin/pgbench/t/001_pgbench_with_server.pl
index 7dd78940300..afb49b554d0 100644
--- a/src/bin/pgbench/t/001_pgbench_with_server.pl
+++ b/src/bin/pgbench/t/001_pgbench_with_server.pl
@@ -1813,6 +1813,28 @@ update counter set i = i+1 returning i \gset
# Clean up
$node->safe_psql('postgres', 'DROP TABLE counter;');
+# Test --continue-on-error
+$node->safe_psql('postgres',
+ 'CREATE TABLE unique_table(i int unique); ' . 'INSERT INTO unique_table VALUES (0);');
+
+$node->pgbench(
+ '-t 10 --continue-on-error --failures-detailed',
+ 0,
+ [
+ qr{processed: 0/10\b},
+ qr{other failures: 10\b}
+ ],
+ [],
+ 'test --continue-on-error',
+ {
+ '002_continue_on_error' => q{
+ insert into unique_table values 0;
+ }
+ });
+
+# Clean up
+$node->safe_psql('postgres', 'DROP TABLE unique_table;');
+
# done
$node->safe_psql('postgres', 'DROP TABLESPACE regress_pgbench_tap_1_ts');
$node->stop;
--
2.39.5 (Apple Git-154)
v5-0002-Rename-confusing-enumerator.patchtext/plain; charset=UTF-8; name=v5-0002-Rename-confusing-enumerator.patchDownload
From 690a4ec636eae6fcaf171abb2480e29f07cc0a88 Mon Sep 17 00:00:00 2001
From: Rintaro Ikeda <ikedarintarof@oss.nttdata.com>
Date: Tue, 1 Jul 2025 14:28:04 +0900
Subject: [PATCH v5 2/2] Rename the confusing enumerator which may be
mistakenly assumed to be related to other_sql_errors
---
src/bin/pgbench/pgbench.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 15207290811..3435a8894b1 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -484,7 +484,7 @@ typedef enum TStatus
TSTATUS_IDLE,
TSTATUS_IN_BLOCK,
TSTATUS_CONN_ERROR,
- TSTATUS_OTHER_ERROR,
+ TSTATUS_UNKNOWN_ERROR,
} TStatus;
/* Various random sequences are initialized from this one. */
@@ -3576,12 +3576,12 @@ getTransactionStatus(PGconn *con)
* not. Internal error which should never occur.
*/
pg_log_error("unexpected transaction status %d", tx_status);
- return TSTATUS_OTHER_ERROR;
+ return TSTATUS_UNKNOWN_ERROR;
}
/* not reached */
Assert(false);
- return TSTATUS_OTHER_ERROR;
+ return TSTATUS_UNKNOWN_ERROR;
}
/*
--
2.39.5 (Apple Git-154)
Dear Ikeda-san, Nagata-san,
Thanks for updating the patch!
Could I confirm what you mean by "start new one"?
In the current pgbench, when a query raises an error (a deadlock or
serialization failure), it can be retried using the same random state.
This typically means the query will be retried with the same parameter values.On the other hand, when the query ultimately fails (possibly after some retries),
the transaction is marked as a "failure", and the next transaction starts with a
new random state (i.e., with new parameter values).Therefore, if a query fails due to a unique constraint violation and is retried
with the same parameters, it will keep failing on each retry.Thank you for your explanation. I understand it as you described. I've also
attached a schematic diagram of the state machine. I hope it will help clarify
the behavior of pgbench. Red arrows represent the transition of state when SQL
command fails and --continue-on-error option is specified.
Thanks for the diagram, it's quite helpful. Let me share my understanding and opinion.
The terminology "retry" is being used for the transition CSTATE_ERROR->CSTATE_RETRY,
and here the random state would be restored to be the begining:
```
/*
* Reset the random state as they were at the beginning of the
* transaction.
*/
st->cs_func_rs = st->random_state;
```
In --continue-on-error case, the transaction CSTATE_WAIT_RESULT->CSTATE_ERROR
can happen even the reason of failure is not serialization and deadlock.
Ultimately the pass will reach ...->CSTATE_END_TX->CSTATE_CHOOSE_SCRIPT, the
beginning of the state machine. cs_func_rs is not overwritten in the route so
that different random value would be generated, or even another script may be
chosen. Is it correct?
And I feel this behavior is OK. Most likely failure here is the unique constraint
violation. Clients should roll the dice again otherwise it would face the same
error again.
Below are my comments for the latest patch.
01.
```
$ git am ../patches/pgbench/v5-0001-Add-continue-on-error-opt
ion-to-pgbench.patch
Applying: When the option is set, client rolls back the failed transaction and...
.git/rebase-apply/patch:65: trailing whitespace.
<literal>serialization</literal>, <literal>deadlock</literal>, or
.git/rebase-apply/patch:139: trailing whitespace.
<option>--max-tries</option> option is not equal to 1 and
warning: 2 lines add whitespace errors.
```
I got warnings when I applied the patch. Please fix it.
02.
```
+ * 'serialization_failures' + 'deadlock_failures' +
+ * 'other_sql_failures' (they got a error when continue-on-error option
```
The first line has the tab, but it should be normal blank.
03.
```
+ else if (continue_on_error | canRetryError(st->estatus))
```
I feel "|" should be "||".
04.
```
<term><replaceable>retries</replaceable></term>
<listitem>
<para>
number of retries after serialization or deadlock errors
(zero unless <option>--max-tries</option> is not equal to one)
</para>
</listitem>
```
To confirm; --continue-on-error won't be counted here because it is not "retry",
in other words, it does not reach CSTATE_RETRY, right?
Best regards,
Hayato Kuroda
FUJITSU LIMITED
Hi,
On Tue, 1 Jul 2025 17:43:18 +0900
Rintaro Ikeda <ikedarintarof@oss.nttdata.com> wrote:
I've updated the previous patch based on your feedback. Below is a summary of
the changes from v4 to v5:
Thank you for updating the patch.
On 2025/06/14 0:24, Yugo Nagata wrote:
case PGRES_NONFATAL_ERROR:
case PGRES_FATAL_ERROR:
st->estatus = getSQLErrorStatus(PQresultErrorField(res,
PG_DIAG_SQLSTATE));
if (canRetryError(st->estatus))
{
if (verbose_errors)
commandError(st, PQerrorMessage(st->con));
goto error;
}
/* fall through */default:
/* anything else is unexpected */
pg_log_error("client %d script %d aborted in command %d query %d: %s",
st->id, st->use_file, st->command, qrynum,
PQerrorMessage(st->con));
goto error;
}When an SQL error other than a serialization or deadlock error occurs, an error message is
output via pg_log_error in this code path. However, I think this should be reported only
when verbose_errors is set, similar to how serialization and deadlock errors are handled when
--continue-on-error is enabledI think the error message logged via pg_log_error is useful when verbose_errors
is not specified, because it informs users that the client has exited. Without
it, users may not notice that something went wrong.
However, if a large number of errors occur, this could result in a significant increase
in stderr output during the benchmark.
Users can still notice that something went wrong by checking the “number of other failures”
reported after the run, and I assume that in most cases, when --continue-on-error is enabled,
users aren’t particularly interested in seeing individual error messages as they happen.
It’s true that seeing error messages during the benchmark might be useful in some cases, but
the same could be said for serialization or deadlock errors, and that’s exactly what the
--verbose-errors option is for.
Here are some comments on the patch.
(1)
}
- else if (canRetryError(st->estatus))
+ else if (continue_on_error | canRetryError(st->estatus))
st->state = CSTATE_ERROR;
else
st->state = CSTATE_ABORTED;
Due to this change, when --continue-on-error is enabled, st->state is set to
CSTATE_ERROR regardless of the type of error returned by readCommandResponse.
When the error is not ESTATUS_OTHER_SQL_ERROR, e.g. ESTATUS_META_COMMAND_ERROR
due to a failure of \gset with query returning more the one row.
Therefore, this should be like:
else if ((st->estatus == ESTATUS_OTHER_SQL_ERROR && continue_on_error) ||
canRetryError(st->estatus))
(2)
+ " --continue-on-error continue processing transations after a trasaction fails\n"
"trasaction" is a typo and including "transaction" twice looks a bit redundant.
Instead using the word "transaction", how about:
"--continue-on-error continue running after an SQL error" ?
This version is shorter, avoids repetition, and describes well the actual behavior when
SQL statements fail.
As for the comments:
(3)
- * A failed transaction is defined as unsuccessfully retried transactions.
+ * A failed transaction is defined as unsuccessfully retried transactions
+ * unless continue-on-error option is specified.
* It can be one of two types:
*
* failed (the number of failed transactions) =
@@ -411,6 +412,12 @@ typedef struct StatsData
* 'deadlock_failures' (they got a deadlock error and were not
* successfully retried).
*
+ * When continue-on-error option is specified,
+ * failed (the number of failed transactions) =
+ * 'serialization_failures' + 'deadlock_failures' +
+ * 'other_sql_failures' (they got a error when continue-on-error option
+ * was specified).
+ *
To explain explicitly that there are two definitions of failed transactions
depending on the situation, how about:
"""
A failed transaction is counted differently depending on whether
the --continue-on-error option is specified.
Without --continue-on-error:
failed (the number of failed transactions) =
'serialization_failures' (they got a serialization error and were not
successfully retried) +
'deadlock_failures' (they got a deadlock error and were not
successfully retried).
When --continue-on-error is specified:
failed (number of failed transactions) =
'serialization_failures' + 'deadlock_failures' +
'other_sql_failures' (they got some other SQL error; the transaction was
not retried and counted as failed due to
--continue-on-error).
"""
(4)
+ int64 other_sql_failures; /* number of failed transactions for
+ * reasons other than
+ * serialization/deadlock failure , which
+ * is enabled if --continue-on-error is
+ * used */
Is "counted" is more proper than "enabled" here?
Af for the documentations:
(5)
The next line reports the number of failed transactions due to
- serialization or deadlock errors (see <xref linkend="failures-and-retries"/>
- for more information).
+ serialization or deadlock errors by default (see
+ <xref linkend="failures-and-retries"/> for more information).
Would it be more readable to simply say:
"The next line reports the number of failed transactions (see ... for more information),
since definition of "failed transaction" has become a bit messy?
(6)
connection with the database server was lost or the end of script was reached
without completing the last transaction. In addition, if execution of an SQL
or meta command fails for reasons other than serialization or deadlock errors,
- the client is aborted. Otherwise, if an SQL command fails with serialization or
- deadlock errors, the client is not aborted. In such cases, the current
- transaction is rolled back, which also includes setting the client variables
- as they were before the run of this transaction (it is assumed that one
- transaction script contains only one transaction; see
- <xref linkend="transactions-and-scripts"/> for more information).
+ the client is aborted by default. However, if the --continue-on-error option
+ is specified, the client does not abort and proceeds to the next transaction
+ regardless of the error. This case is reported as other failures in the output.
+ Otherwise, if an SQL command fails with serialization or deadlock errors, the
+ client is not aborted. In such cases, the current transaction is rolled back,
+ which also includes setting the client variables as they were before the run
+ of this transaction (it is assumed that one transaction script contains only
+ one transaction; see <xref linkend="transactions-and-scripts"/> for more information).
To emphasize the default behavior, I wonder it would be better to move "by default"
to the beginning of the statements; like
"By default, if execution of an SQL or meta command fails for reasons other than
serialization or deadlock errors, the client is aborted."
How about quoting "other failures"? like:
"These cases are reported as "other failures" in the output."
Also, I feel the meaning of "Otherwise" has becomes somewhat unclear since the
explanation of --continue-on-error was added between the sentences So, how about
clarifying that "the clients are not aborted due to serializable/deadlock even without
--continue-on-error". For example;
"On contrast, if an SQL command fails with serialization or deadlock errors, the
client is not aborted even without <option>--continue-on-error</option>.
Instead, the current transaction is rolled back, which also includes setting
the client variables as they were before the run of this transaction
(it is assumed that one transaction script contains only
one transaction; see <xref linkend="transactions-and-scripts"/> for more information)."
(7)
The main report contains the number of failed transactions. If the
- <option>--max-tries</option> option is not equal to 1, the main report also
+ <option>--max-tries</option> option is not equal to 1 and
+ <option>--continue-on-error</option> is not specified, the main report also
contains statistics related to retries: the total number of retried
Is that true?
The retreis statitics would be included even without --continue-on-error.
Regards,
Yugo Nagata
--
Yugo Nagata <nagata@sraoss.co.jp>
On Fri, 4 Jul 2025 13:01:12 +0000
"Hayato Kuroda (Fujitsu)" <kuroda.hayato@fujitsu.com> wrote:
Thanks for the diagram, it's quite helpful. Let me share my understanding and opinion.
The terminology "retry" is being used for the transition CSTATE_ERROR->CSTATE_RETRY,
and here the random state would be restored to be the begining:```
/*
* Reset the random state as they were at the beginning of the
* transaction.
*/
st->cs_func_rs = st->random_state;
```
Yes. The random state is reset in the CSTATE_RETRY state, which then transitions
directly to CSTATE_START_COMMAND.
In --continue-on-error case, the transaction CSTATE_WAIT_RESULT->CSTATE_ERROR
can happen even the reason of failure is not serialization and deadlock.
Ultimately the pass will reach ...->CSTATE_END_TX->CSTATE_CHOOSE_SCRIPT, the
beginning of the state machine. cs_func_rs is not overwritten in the route so
that different random value would be generated, or even another script may be
chosen. Is it correct?
Yes, that matches my understanding.
04.
```
<term><replaceable>retries</replaceable></term>
<listitem>
<para>
number of retries after serialization or deadlock errors
(zero unless <option>--max-tries</option> is not equal to one)
</para>
</listitem>
```To confirm; --continue-on-error won't be counted here because it is not "retry",
in other words, it does not reach CSTATE_RETRY, right?
Right. Transactions marked as failure due to --continue-on-error are not retried
and should not counted here.
Regards,
Yugo Nagata
--
Yugo Nagata <nagata@sraoss.co.jp>
Hi,
Thank you for the kind comments.
I've updated the previous patch.
Below is a summary of the changes:
1. The code path and documentation have been corrected based on your feedback.
2. The following message is now suppressed by default. Instead, an error message
is added when a client aborts during SQL execution. (v6-0003-Suppress-xxx.patch)
```
if (verbose_errors)
pg_log_error("client %d script %d aborted in command %d query %d: %s",
st->id, st->use_file, st->command, qrynum,
PQerrorMessage(st->con));
```
On 2025/07/04 22:01, Hayato Kuroda (Fujitsu) wrote:
Could I confirm what you mean by "start new one"?
In the current pgbench, when a query raises an error (a deadlock or
serialization failure), it can be retried using the same random state.
This typically means the query will be retried with the same parameter values.On the other hand, when the query ultimately fails (possibly after some retries),
the transaction is marked as a "failure", and the next transaction starts with a
new random state (i.e., with new parameter values).Therefore, if a query fails due to a unique constraint violation and is retried
with the same parameters, it will keep failing on each retry.Thank you for your explanation. I understand it as you described. I've also
attached a schematic diagram of the state machine. I hope it will help clarify
the behavior of pgbench. Red arrows represent the transition of state when SQL
command fails and --continue-on-error option is specified.Thanks for the diagram, it's quite helpful. Let me share my understanding and opinion.
The terminology "retry" is being used for the transition CSTATE_ERROR->CSTATE_RETRY,
and here the random state would be restored to be the begining:```
/*
* Reset the random state as they were at the beginning of the
* transaction.
*/
st->cs_func_rs = st->random_state;
```In --continue-on-error case, the transaction CSTATE_WAIT_RESULT->CSTATE_ERROR
can happen even the reason of failure is not serialization and deadlock.
Ultimately the pass will reach ...->CSTATE_END_TX->CSTATE_CHOOSE_SCRIPT, the
beginning of the state machine. cs_func_rs is not overwritten in the route so
that different random value would be generated, or even another script may be
chosen. Is it correct?
Yes, I believe that’s correct.
01.
```
$ git am ../patches/pgbench/v5-0001-Add-continue-on-error-opt
ion-to-pgbench.patch
Applying: When the option is set, client rolls back the failed transaction and...
.git/rebase-apply/patch:65: trailing whitespace.
<literal>serialization</literal>, <literal>deadlock</literal>, or
.git/rebase-apply/patch:139: trailing whitespace.
<option>--max-tries</option> option is not equal to 1 and
warning: 2 lines add whitespace errors.
```I got warnings when I applied the patch. Please fix it.
It's been fixed.
02. ``` + * 'serialization_failures' + 'deadlock_failures' + + * 'other_sql_failures' (they got a error when continue-on-error option ``` The first line has the tab, but it should be normal blank.
I hadn't noticed it. It's fixed.
03.
```
+ else if (continue_on_error | canRetryError(st->estatus))
```I feel "|" should be "||".
Thank you for pointing out. Fixed it.
04.
```
<term><replaceable>retries</replaceable></term>
<listitem>
<para>
number of retries after serialization or deadlock errors
(zero unless <option>--max-tries</option> is not equal to one)
</para>
</listitem>
```To confirm; --continue-on-error won't be counted here because it is not "retry",
in other words, it does not reach CSTATE_RETRY, right?
Yes. I agree with Nagata-san [1]/messages/by-id/20250705002239.27e6e5a4ba22c047ac2fa16a@sraoss.co.jp — --continue-on-error is not considered a
"retry" because it doesn't reach CSTATE_RETRY.
On 2025/07/05 0:03, Yugo Nagata wrote:
case PGRES_NONFATAL_ERROR:
case PGRES_FATAL_ERROR:
st->estatus = getSQLErrorStatus(PQresultErrorField(res,
PG_DIAG_SQLSTATE));
if (canRetryError(st->estatus))
{
if (verbose_errors)
commandError(st, PQerrorMessage(st->con));
goto error;
}
/* fall through */default:
/* anything else is unexpected */
pg_log_error("client %d script %d aborted in command %d query %d: %s",
st->id, st->use_file, st->command, qrynum,
PQerrorMessage(st->con));
goto error;
}When an SQL error other than a serialization or deadlock error occurs, an error message is
output via pg_log_error in this code path. However, I think this should be reported only
when verbose_errors is set, similar to how serialization and deadlock errors are handled when
--continue-on-error is enabledI think the error message logged via pg_log_error is useful when verbose_errors
is not specified, because it informs users that the client has exited. Without
it, users may not notice that something went wrong.However, if a large number of errors occur, this could result in a significant increase
in stderr output during the benchmark.Users can still notice that something went wrong by checking the “number of other failures”
reported after the run, and I assume that in most cases, when --continue-on-error is enabled,
users aren’t particularly interested in seeing individual error messages as they happen.It’s true that seeing error messages during the benchmark might be useful in some cases, but
the same could be said for serialization or deadlock errors, and that’s exactly what the
--verbose-errors option is for.
I understand your concern. The condition for calling pg_log_error() was modified
to reduce stderr output.
Additionally, an error message was added for cases where some clients aborted
while executing SQL commands, similar to other code paths that transition to
st->state = CSTATE_ABORTED, as shown in the example below:
```
pg_log_error("client %d aborted while establishing connection", st->id);
st->state = CSTATE_ABORTED;
```
Here are some comments on the patch.
(1)
} - else if (canRetryError(st->estatus)) + else if (continue_on_error | canRetryError(st->estatus)) st->state = CSTATE_ERROR; else st->state = CSTATE_ABORTED;Due to this change, when --continue-on-error is enabled, st->state is set to
CSTATE_ERROR regardless of the type of error returned by readCommandResponse.
When the error is not ESTATUS_OTHER_SQL_ERROR, e.g. ESTATUS_META_COMMAND_ERROR
due to a failure of \gset with query returning more the one row.Therefore, this should be like:
else if ((st->estatus == ESTATUS_OTHER_SQL_ERROR && continue_on_error) ||
canRetryError(st->estatus))
Thanks for pointing that out — I’ve corrected it.
(2)
+ " --continue-on-error continue processing transations after a trasaction fails\n"
"trasaction" is a typo and including "transaction" twice looks a bit redundant.
Instead using the word "transaction", how about:"--continue-on-error continue running after an SQL error" ?
This version is shorter, avoids repetition, and describes well the actual behavior when
SQL statements fail.
Fixed it.
(3)
- * A failed transaction is defined as unsuccessfully retried transactions. + * A failed transaction is defined as unsuccessfully retried transactions + * unless continue-on-error option is specified. * It can be one of two types: * * failed (the number of failed transactions) = @@ -411,6 +412,12 @@ typedef struct StatsData * 'deadlock_failures' (they got a deadlock error and were not * successfully retried). * + * When continue-on-error option is specified, + * failed (the number of failed transactions) = + * 'serialization_failures' + 'deadlock_failures' + + * 'other_sql_failures' (they got a error when continue-on-error option + * was specified). + *To explain explicitly that there are two definitions of failed transactions
depending on the situation, how about:"""
A failed transaction is counted differently depending on whether
the --continue-on-error option is specified.Without --continue-on-error:
failed (the number of failed transactions) =
'serialization_failures' (they got a serialization error and were not
successfully retried) +
'deadlock_failures' (they got a deadlock error and were not
successfully retried).When --continue-on-error is specified:
failed (number of failed transactions) =
'serialization_failures' + 'deadlock_failures' +
'other_sql_failures' (they got some other SQL error; the transaction was
not retried and counted as failed due to
--continue-on-error).
"""
Thank you for your suggestion. I modified it accordingly.
(4) + int64 other_sql_failures; /* number of failed transactions for + * reasons other than + * serialization/deadlock failure , which + * is enabled if --continue-on-error is + * used */Is "counted" is more proper than "enabled" here?
Fixed.
Af for the documentations: (5) The next line reports the number of failed transactions due to - serialization or deadlock errors (see <xref linkend="failures-and-retries"/> - for more information). + serialization or deadlock errors by default (see + <xref linkend="failures-and-retries"/> for more information).Would it be more readable to simply say:
"The next line reports the number of failed transactions (see ... for more information),
since definition of "failed transaction" has become a bit messy?
I fixed it to the simple explanation.
(6) connection with the database server was lost or the end of script was reached without completing the last transaction. In addition, if execution of an SQL or meta command fails for reasons other than serialization or deadlock errors, - the client is aborted. Otherwise, if an SQL command fails with serialization or - deadlock errors, the client is not aborted. In such cases, the current - transaction is rolled back, which also includes setting the client variables - as they were before the run of this transaction (it is assumed that one - transaction script contains only one transaction; see - <xref linkend="transactions-and-scripts"/> for more information). + the client is aborted by default. However, if the --continue-on-error option + is specified, the client does not abort and proceeds to the next transaction + regardless of the error. This case is reported as other failures in the output. + Otherwise, if an SQL command fails with serialization or deadlock errors, the + client is not aborted. In such cases, the current transaction is rolled back, + which also includes setting the client variables as they were before the run + of this transaction (it is assumed that one transaction script contains only + one transaction; see <xref linkend="transactions-and-scripts"/> for more information).To emphasize the default behavior, I wonder it would be better to move "by default"
to the beginning of the statements; like"By default, if execution of an SQL or meta command fails for reasons other than
serialization or deadlock errors, the client is aborted."How about quoting "other failures"? like:
"These cases are reported as "other failures" in the output."
Also, I feel the meaning of "Otherwise" has becomes somewhat unclear since the
explanation of --continue-on-error was added between the sentences So, how about
clarifying that "the clients are not aborted due to serializable/deadlock even without
--continue-on-error". For example;"On contrast, if an SQL command fails with serialization or deadlock errors, the
client is not aborted even without <option>--continue-on-error</option>.
Instead, the current transaction is rolled back, which also includes setting
the client variables as they were before the run of this transaction
(it is assumed that one transaction script contains only
one transaction; see <xref linkend="transactions-and-scripts"/> for more information)."
I've modified according to your suggestion.
(7) The main report contains the number of failed transactions. If the - <option>--max-tries</option> option is not equal to 1, the main report also + <option>--max-tries</option> option is not equal to 1 and + <option>--continue-on-error</option> is not specified, the main report also contains statistics related to retries: the total number of retriedIs that true?
The retreis statitics would be included even without --continue-on-error.
That was wrong. I corrected it.
[1]: /messages/by-id/20250705002239.27e6e5a4ba22c047ac2fa16a@sraoss.co.jp
/messages/by-id/20250705002239.27e6e5a4ba22c047ac2fa16a@sraoss.co.jp
Regards,
Rintaro Ikeda
Attachments:
v6-0001-Add-continue-on-error-option.patchtext/plain; charset=UTF-8; name=v6-0001-Add-continue-on-error-option.patchDownload
From caa1ede6a7b5ac3e19b73943a1a810bf98e32e21 Mon Sep 17 00:00:00 2001
From: Rintaro Ikeda <ikedarintarof@oss.nttdata.com>
Date: Wed, 9 Jul 2025 23:36:37 +0900
Subject: [PATCH v6 1/3] Add --continue-on-error option
When the option is set, client rolls back the failed transaction and starts a
new one when its transaction fails due to the reason other than the deadlock and
serialization failure.
---
doc/src/sgml/ref/pgbench.sgml | 71 +++++++++++++++-----
src/bin/pgbench/pgbench.c | 55 +++++++++++++--
src/bin/pgbench/t/001_pgbench_with_server.pl | 22 ++++++
3 files changed, 124 insertions(+), 24 deletions(-)
diff --git a/doc/src/sgml/ref/pgbench.sgml b/doc/src/sgml/ref/pgbench.sgml
index ab252d9fc74..15fcb45e223 100644
--- a/doc/src/sgml/ref/pgbench.sgml
+++ b/doc/src/sgml/ref/pgbench.sgml
@@ -76,9 +76,8 @@ tps = 896.967014 (without initial connection time)
and number of transactions per client); these will be equal unless the run
failed before completion or some SQL command(s) failed. (In
<option>-T</option> mode, only the actual number of transactions is printed.)
- The next line reports the number of failed transactions due to
- serialization or deadlock errors (see <xref linkend="failures-and-retries"/>
- for more information).
+ The next line reports the number of failed transactions (see
+ <xref linkend="failures-and-retries"/> for more information).
The last line reports the number of transactions per second.
</para>
@@ -790,6 +789,9 @@ pgbench <optional> <replaceable>options</replaceable> </optional> <replaceable>d
<listitem>
<para>deadlock failures;</para>
</listitem>
+ <listitem>
+ <para>other failures;</para>
+ </listitem>
</itemizedlist>
See <xref linkend="failures-and-retries"/> for more information.
</para>
@@ -914,6 +916,26 @@ pgbench <optional> <replaceable>options</replaceable> </optional> <replaceable>d
</listitem>
</varlistentry>
+ <varlistentry id="pgbench-option-continue-on-error">
+ <term><option>--continue-on-error</option></term>
+ <listitem>
+ <para>
+ Allows clients to continue their run even if an SQL statement fails due to
+ errors other than serialization or deadlock. Unlike serialization and deadlock
+ failures, clients do not retry the same transactions but start new transaction.
+ This option is useful when your custom script may raise errors due to some
+ reason like unique constraints violation. Without this option, the client is
+ aborted after such errors.
+ </para>
+ <para>
+ Note that serialization and deadlock failures never cause the client to be
+ aborted even after clients retries <option>--max-tries</option> times by
+ default, so they are not affected by this option.
+ See <xref linkend="failures-and-retries"/> for more information.
+ </para>
+ </listitem>
+ </varlistentry>
+
</variablelist>
</para>
@@ -2409,8 +2431,8 @@ END;
will be reported as <literal>failed</literal>. If you use the
<option>--failures-detailed</option> option, the
<replaceable>time</replaceable> of the failed transaction will be reported as
- <literal>serialization</literal> or
- <literal>deadlock</literal> depending on the type of failure (see
+ <literal>serialization</literal>, <literal>deadlock</literal>, or
+ <literal>other</literal> depending on the type of failure (see
<xref linkend="failures-and-retries"/> for more information).
</para>
@@ -2638,6 +2660,16 @@ END;
</para>
</listitem>
</varlistentry>
+
+ <varlistentry>
+ <term><replaceable>other_sql_failures</replaceable></term>
+ <listitem>
+ <para>
+ number of transactions that got a SQL error
+ (zero unless <option>--failures-detailed</option> is specified)
+ </para>
+ </listitem>
+ </varlistentry>
</variablelist>
</para>
@@ -2646,8 +2678,8 @@ END;
<screen>
<userinput>pgbench --aggregate-interval=10 --time=20 --client=10 --log --rate=1000 --latency-limit=10 --failures-detailed --max-tries=10 test</userinput>
-1650260552 5178 26171317 177284491527 1136 44462 2647617 7321113867 0 9866 64 7564 28340 4148 0
-1650260562 4808 25573984 220121792172 1171 62083 3037380 9666800914 0 9998 598 7392 26621 4527 0
+1650260552 5178 26171317 177284491527 1136 44462 2647617 7321113867 0 9866 64 7564 28340 4148 0 0
+1650260562 4808 25573984 220121792172 1171 62083 3037380 9666800914 0 9998 598 7392 26621 4527 0 0
</screen>
</para>
@@ -2839,9 +2871,11 @@ statement latencies in milliseconds, failures and retries:
<option>--exit-on-abort</option> is specified. Otherwise in the worst
case they only lead to the abortion of the failed client while other
clients continue their run (but some client errors are handled without
- an abortion of the client and reported separately, see below). Later in
- this section it is assumed that the discussed errors are only the
- direct client errors and they are not internal
+ an abortion of the client and reported separately, see below). When
+ <option>--continue-on-error</option> is specified, the client
+ continues to process new transactions even if it encounters an error.
+ Later in this section it is assumed that the discussed errors are only
+ the direct client errors and they are not internal
<application>pgbench</application> errors.
</para>
</listitem>
@@ -2851,14 +2885,17 @@ statement latencies in milliseconds, failures and retries:
<para>
A client's run is aborted in case of a serious error; for example, the
connection with the database server was lost or the end of script was reached
- without completing the last transaction. In addition, if execution of an SQL
+ without completing the last transaction. By default, if execution of an SQL
or meta command fails for reasons other than serialization or deadlock errors,
- the client is aborted. Otherwise, if an SQL command fails with serialization or
- deadlock errors, the client is not aborted. In such cases, the current
- transaction is rolled back, which also includes setting the client variables
- as they were before the run of this transaction (it is assumed that one
- transaction script contains only one transaction; see
- <xref linkend="transactions-and-scripts"/> for more information).
+ the client is aborted. However, if the --continue-on-error option is specified,
+ the client does not abort and proceeds to the next transaction regardless of
+ the error. These cases are reported as "other failures" in the output.
+ On contrast, if an SQL command fails with serialization or deadlock errors, the
+ client is not aborted even without <option>--continue-on-error</option>.
+ Instead, the current transaction is rolled back, which also includes setting
+ the client variables as they were before the run of this transaction
+ (it is assumed that one transaction script contains only one transaction;
+ see <xref linkend="transactions-and-scripts"/> for more information).
Transactions with serialization or deadlock errors are repeated after
rollbacks until they complete successfully or reach the maximum
number of tries (specified by the <option>--max-tries</option> option) / the maximum
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 497a936c141..4b3ddb3146f 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -402,15 +402,23 @@ typedef struct StatsData
* directly successful transactions (they were successfully completed on
* the first try).
*
- * A failed transaction is defined as unsuccessfully retried transactions.
- * It can be one of two types:
+ * A failed transaction is counted differently depending on whether
+ * the --continue-on-error option is specified.
*
+ * Without --continue-on-error:
* failed (the number of failed transactions) =
* 'serialization_failures' (they got a serialization error and were not
* successfully retried) +
* 'deadlock_failures' (they got a deadlock error and were not
* successfully retried).
*
+ * When --continue-on-error is specified:
+ *
+ * failed (number of failed transactions) =
+ * 'serialization_failures' + 'deadlock_failures' +
+ * 'other_sql_failures' (they got some other SQL error; the transaction was
+ * not retried and counted as failed due to --continue-on-error).
+ *
* If the transaction was retried after a serialization or a deadlock
* error this does not guarantee that this retry was successful. Thus
*
@@ -440,6 +448,11 @@ typedef struct StatsData
int64 deadlock_failures; /* number of transactions that were not
* successfully retried after a deadlock
* error */
+ int64 other_sql_failures; /* number of failed transactions for
+ * reasons other than
+ * serialization/deadlock failure, which
+ * is counted if --continue-on-error is
+ * specified */
SimpleStats latency;
SimpleStats lag;
} StatsData;
@@ -770,6 +783,7 @@ static int64 total_weight = 0;
static bool verbose_errors = false; /* print verbose messages of all errors */
static bool exit_on_abort = false; /* exit when any client is aborted */
+static bool continue_on_error = false; /* continue after errors */
/* Builtin test scripts */
typedef struct BuiltinScript
@@ -954,6 +968,7 @@ usage(void)
" --log-prefix=PREFIX prefix for transaction time log file\n"
" (default: \"pgbench_log\")\n"
" --max-tries=NUM max number of tries to run transaction (default: 1)\n"
+ " --continue-on-error continue running after an SQL error\n"
" --progress-timestamp use Unix epoch timestamps for progress\n"
" --random-seed=SEED set random seed (\"time\", \"rand\", integer)\n"
" --sampling-rate=NUM fraction of transactions to log (e.g., 0.01 for 1%%)\n"
@@ -1467,6 +1482,7 @@ initStats(StatsData *sd, pg_time_usec_t start)
sd->retried = 0;
sd->serialization_failures = 0;
sd->deadlock_failures = 0;
+ sd->other_sql_failures = 0;
initSimpleStats(&sd->latency);
initSimpleStats(&sd->lag);
}
@@ -1516,6 +1532,9 @@ accumStats(StatsData *stats, bool skipped, double lat, double lag,
case ESTATUS_DEADLOCK_ERROR:
stats->deadlock_failures++;
break;
+ case ESTATUS_OTHER_SQL_ERROR:
+ stats->other_sql_failures++;
+ break;
default:
/* internal error which should never occur */
pg_fatal("unexpected error status: %d", estatus);
@@ -4007,7 +4026,8 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
if (PQpipelineStatus(st->con) != PQ_PIPELINE_ON)
st->state = CSTATE_END_COMMAND;
}
- else if (canRetryError(st->estatus))
+ else if ((st->estatus == ESTATUS_OTHER_SQL_ERROR && continue_on_error) ||
+ canRetryError(st->estatus))
st->state = CSTATE_ERROR;
else
st->state = CSTATE_ABORTED;
@@ -4528,7 +4548,8 @@ static int64
getFailures(const StatsData *stats)
{
return (stats->serialization_failures +
- stats->deadlock_failures);
+ stats->deadlock_failures +
+ stats->other_sql_failures);
}
/*
@@ -4548,6 +4569,8 @@ getResultString(bool skipped, EStatus estatus)
return "serialization";
case ESTATUS_DEADLOCK_ERROR:
return "deadlock";
+ case ESTATUS_OTHER_SQL_ERROR:
+ return "other";
default:
/* internal error which should never occur */
pg_fatal("unexpected error status: %d", estatus);
@@ -4603,6 +4626,7 @@ doLog(TState *thread, CState *st,
int64 skipped = 0;
int64 serialization_failures = 0;
int64 deadlock_failures = 0;
+ int64 other_sql_failures = 0;
int64 retried = 0;
int64 retries = 0;
@@ -4643,10 +4667,12 @@ doLog(TState *thread, CState *st,
{
serialization_failures = agg->serialization_failures;
deadlock_failures = agg->deadlock_failures;
+ other_sql_failures = agg->other_sql_failures;
}
- fprintf(logfile, " " INT64_FORMAT " " INT64_FORMAT,
+ fprintf(logfile, " " INT64_FORMAT " " INT64_FORMAT " " INT64_FORMAT,
serialization_failures,
- deadlock_failures);
+ deadlock_failures,
+ other_sql_failures);
fputc('\n', logfile);
@@ -6285,6 +6311,7 @@ printProgressReport(TState *threads, int64 test_start, pg_time_usec_t now,
cur.serialization_failures +=
threads[i].stats.serialization_failures;
cur.deadlock_failures += threads[i].stats.deadlock_failures;
+ cur.other_sql_failures += threads[i].stats.other_sql_failures;
}
/* we count only actually executed transactions */
@@ -6427,7 +6454,8 @@ printResults(StatsData *total,
/*
* Remaining stats are nonsensical if we failed to execute any xacts due
- * to others than serialization or deadlock errors
+ * to other than serialization or deadlock errors and --continue-on-error
+ * is not set.
*/
if (total_cnt <= 0)
return;
@@ -6443,6 +6471,9 @@ printResults(StatsData *total,
printf("number of deadlock failures: " INT64_FORMAT " (%.3f%%)\n",
total->deadlock_failures,
100.0 * total->deadlock_failures / total_cnt);
+ printf("number of other failures: " INT64_FORMAT " (%.3f%%)\n",
+ total->other_sql_failures,
+ 100.0 * total->other_sql_failures / total_cnt);
}
/* it can be non-zero only if max_tries is not equal to one */
@@ -6546,6 +6577,10 @@ printResults(StatsData *total,
sstats->deadlock_failures,
(100.0 * sstats->deadlock_failures /
script_total_cnt));
+ printf(" - number of other failures: " INT64_FORMAT " (%.3f%%)\n",
+ sstats->other_sql_failures,
+ (100.0 * sstats->other_sql_failures /
+ script_total_cnt));
}
/*
@@ -6705,6 +6740,7 @@ main(int argc, char **argv)
{"verbose-errors", no_argument, NULL, 15},
{"exit-on-abort", no_argument, NULL, 16},
{"debug", no_argument, NULL, 17},
+ {"continue-on-error", no_argument, NULL, 18},
{NULL, 0, NULL, 0}
};
@@ -7058,6 +7094,10 @@ main(int argc, char **argv)
case 17: /* debug */
pg_logging_increase_verbosity();
break;
+ case 18: /* continue-on-error */
+ benchmarking_option_set = true;
+ continue_on_error = true;
+ break;
default:
/* getopt_long already emitted a complaint */
pg_log_error_hint("Try \"%s --help\" for more information.", progname);
@@ -7413,6 +7453,7 @@ main(int argc, char **argv)
stats.retried += thread->stats.retried;
stats.serialization_failures += thread->stats.serialization_failures;
stats.deadlock_failures += thread->stats.deadlock_failures;
+ stats.other_sql_failures += thread->stats.other_sql_failures;
latency_late += thread->latency_late;
conn_total_duration += thread->conn_duration;
diff --git a/src/bin/pgbench/t/001_pgbench_with_server.pl b/src/bin/pgbench/t/001_pgbench_with_server.pl
index 7dd78940300..8bb35dda5f7 100644
--- a/src/bin/pgbench/t/001_pgbench_with_server.pl
+++ b/src/bin/pgbench/t/001_pgbench_with_server.pl
@@ -1813,6 +1813,28 @@ update counter set i = i+1 returning i \gset
# Clean up
$node->safe_psql('postgres', 'DROP TABLE counter;');
+# Test --continue-on-error
+$node->safe_psql('postgres',
+ 'CREATE TABLE unique_table(i int unique);' . 'INSERT INTO unique_table VALUES (0);');
+
+$node->pgbench(
+ '-t 10 --continue-on-error --failures-detailed',
+ 0,
+ [
+ qr{processed: 0/10\b},
+ qr{other failures: 10\b}
+ ],
+ [],
+ 'test --continue-on-error',
+ {
+ '002_continue_on_error' => q{
+ insert into unique_table values 0;
+ }
+ });
+
+# Clean up
+$node->safe_psql('postgres', 'DROP TABLE unique_table;');
+
# done
$node->safe_psql('postgres', 'DROP TABLESPACE regress_pgbench_tap_1_ts');
$node->stop;
--
2.39.5 (Apple Git-154)
v6-0002-Rename-a-confusing-enumerator.patchtext/plain; charset=UTF-8; name=v6-0002-Rename-a-confusing-enumerator.patchDownload
From c1074c2a076e879196e5c68bc641995bface8453 Mon Sep 17 00:00:00 2001
From: Rintaro Ikeda <ikedarintarof@oss.nttdata.com>
Date: Wed, 9 Jul 2025 23:50:36 +0900
Subject: [PATCH v6 2/3] Rename a confusing enumerator
Rename the confusing enumerator which may be mistakenly assumed to be related to
other_sql_errors
---
src/bin/pgbench/pgbench.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 4b3ddb3146f..95a7083ede0 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -485,7 +485,7 @@ typedef enum TStatus
TSTATUS_IDLE,
TSTATUS_IN_BLOCK,
TSTATUS_CONN_ERROR,
- TSTATUS_OTHER_ERROR,
+ TSTATUS_UNKNOWN_ERROR,
} TStatus;
/* Various random sequences are initialized from this one. */
@@ -3577,12 +3577,12 @@ getTransactionStatus(PGconn *con)
* not. Internal error which should never occur.
*/
pg_log_error("unexpected transaction status %d", tx_status);
- return TSTATUS_OTHER_ERROR;
+ return TSTATUS_UNKNOWN_ERROR;
}
/* not reached */
Assert(false);
- return TSTATUS_OTHER_ERROR;
+ return TSTATUS_UNKNOWN_ERROR;
}
/*
--
2.39.5 (Apple Git-154)
v6-0003-Suppress-error-messages-unless-client-abort.patchtext/plain; charset=UTF-8; name=v6-0003-Suppress-error-messages-unless-client-abort.patchDownload
From 6d916730e26384e7f3a559515bd16d0d9831064b Mon Sep 17 00:00:00 2001
From: Rintaro Ikeda <ikedarintarof@oss.nttdata.com>
Date: Wed, 9 Jul 2025 23:46:19 +0900
Subject: [PATCH v6 3/3] Suppress error messages unless client abort
Suppress error messages for individual failed SQL commands and report them only
hen the client aborts
---
src/bin/pgbench/pgbench.c | 10 +++++++---
src/bin/pgbench/t/001_pgbench_with_server.pl | 14 +++++++-------
2 files changed, 14 insertions(+), 10 deletions(-)
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 95a7083ede0..26995b93313 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -3385,9 +3385,10 @@ readCommandResponse(CState *st, MetaCommand meta, char *varprefix)
default:
/* anything else is unexpected */
- pg_log_error("client %d script %d aborted in command %d query %d: %s",
- st->id, st->use_file, st->command, qrynum,
- PQerrorMessage(st->con));
+ if (verbose_errors)
+ pg_log_error("client %d script %d aborted in command %d query %d: %s",
+ st->id, st->use_file, st->command, qrynum,
+ PQerrorMessage(st->con));
goto error;
}
@@ -4030,7 +4031,10 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
canRetryError(st->estatus))
st->state = CSTATE_ERROR;
else
+ {
+ pg_log_error("client %d aborted while executing SQL commands", st->id);
st->state = CSTATE_ABORTED;
+ }
break;
/*
diff --git a/src/bin/pgbench/t/001_pgbench_with_server.pl b/src/bin/pgbench/t/001_pgbench_with_server.pl
index 8bb35dda5f7..a38a1cf4ab7 100644
--- a/src/bin/pgbench/t/001_pgbench_with_server.pl
+++ b/src/bin/pgbench/t/001_pgbench_with_server.pl
@@ -301,7 +301,7 @@ $node->append_conf('postgresql.conf',
. "log_parameter_max_length_on_error = 0");
$node->reload;
$node->pgbench(
- '-n -t1 -c1 -M prepared',
+ '-n -t1 -c1 -M prepared --verbose',
2,
[],
[
@@ -328,7 +328,7 @@ $node->append_conf('postgresql.conf',
. "log_parameter_max_length_on_error = 64");
$node->reload;
$node->pgbench(
- '-n -t1 -c1 -M prepared',
+ '-n -t1 -c1 -M prepared --verbose',
2,
[],
[
@@ -342,7 +342,7 @@ SELECT 1 / (random() / 2)::int, :one::int, :two::int;
}
});
$node->pgbench(
- '-n -t1 -c1 -M prepared',
+ '-n -t1 -c1 -M prepared --verbose',
2,
[],
[
@@ -370,7 +370,7 @@ $node->append_conf('postgresql.conf',
. "log_parameter_max_length_on_error = -1");
$node->reload;
$node->pgbench(
- '-n -t1 -c1 -M prepared',
+ '-n -t1 -c1 -M prepared --verbose',
2,
[],
[
@@ -387,7 +387,7 @@ SELECT 1 / (random() / 2)::int, :one::int, :two::int;
$node->append_conf('postgresql.conf', "log_min_duration_statement = 0");
$node->reload;
$node->pgbench(
- '-n -t1 -c1 -M prepared',
+ '-n -t1 -c1 -M prepared --verbose',
2,
[],
[
@@ -410,7 +410,7 @@ $log = undef;
# Check that bad parameters are reported during typinput phase of BIND
$node->pgbench(
- '-n -t1 -c1 -M prepared',
+ '-n -t1 -c1 -M prepared --verbose',
2,
[],
[
@@ -1464,7 +1464,7 @@ for my $e (@errors)
my $n = '001_pgbench_error_' . $name;
$n =~ s/ /_/g;
$node->pgbench(
- '-n -t 1 -Dfoo=bla -Dnull=null -Dtrue=true -Done=1 -Dzero=0.0 -Dbadtrue=trueXXX'
+ '-n -t 1 -Dfoo=bla -Dnull=null -Dtrue=true -Done=1 -Dzero=0.0 -Dbadtrue=trueXXX --verbose'
. ' -Dmaxint=9223372036854775807 -Dminint=-9223372036854775808'
. ($no_prepare ? '' : ' -M prepared'),
$status,
--
2.39.5 (Apple Git-154)
On Wed, 9 Jul 2025 23:58:32 +0900
Rintaro Ikeda <ikedarintarof@oss.nttdata.com> wrote:
Hi,
Thank you for the kind comments.
I've updated the previous patch.
Thank you for updating the patch!
However, if a large number of errors occur, this could result in a significant increase
in stderr output during the benchmark.Users can still notice that something went wrong by checking the “number of other failures”
reported after the run, and I assume that in most cases, when --continue-on-error is enabled,
users aren’t particularly interested in seeing individual error messages as they happen.It’s true that seeing error messages during the benchmark might be useful in some cases, but
the same could be said for serialization or deadlock errors, and that’s exactly what the
--verbose-errors option is for.I understand your concern. The condition for calling pg_log_error() was modified
to reduce stderr output.
Additionally, an error message was added for cases where some clients aborted
while executing SQL commands, similar to other code paths that transition to
st->state = CSTATE_ABORTED, as shown in the example below:```
pg_log_error("client %d aborted while establishing connection", st->id);
st->state = CSTATE_ABORTED;
```
default:
/* anything else is unexpected */
- pg_log_error("client %d script %d aborted in command %d query %d: %s",
- st->id, st->use_file, st->command, qrynum,
- PQerrorMessage(st->con));
+ if (verbose_errors)
+ pg_log_error("client %d script %d aborted in command %d query %d: %s",
+ st->id, st->use_file, st->command, qrynum,
+ PQerrorMessage(st->con));
goto error;
}
Thanks to this fix, error messages caused by SQL errors are now output only when
--verbose-errors is enable. However, the comment describes the condition as "unexpected",
and the message states that the client was "aborted". This does not seems accurate, since
clients are not aborted due to SQL errors when --continue-on-errors is enabled.
I think the error message should be emitted using commandError() when both
--coneinut-on-errors and --verbose-errors are specified, like this;
case PGRES_NONFATAL_ERROR:
case PGRES_FATAL_ERROR:
st->estatus = getSQLErrorStatus(PQresultErrorField(res,
PG_DIAG_SQLSTATE));
if (continue_on_error || canRetryError(st->estatus))
{
if (verbose_errors)
commandError(st, PQerrorMessage(st->con));
goto error;
}
/* fall through */
In addition, the error message in the "default" case should be shown regardless
of the --verbose-errors since it represents an unexpected situation and should
always reported.
Finally, I believe this fix should be included in patch 0001 rather than 0003,
as it would be a part of the implementation of --continiue-on-error.
As of 0003:
+ {
+ pg_log_error("client %d aborted while executing SQL commands", st->id);
st->state = CSTATE_ABORTED;
+ }
break;
I understand that the patch is not directly related to --continue-on-error, similar to 0002,
and that it aims to improve the error message to indicate that the client was aborted due to
some error during readCommandResponse().
However, this message doesn't seem entirely accurate, since the error is not always caused
by an SQL command failure itself. For example, it could also be due to a failure of the \gset
meta-command.
In addition, this fix causes error messages to be emitted twice. For example, if \gset fails,
the following similar messages are printed:
pgbench: error: client 0 script 0 command 0 query 0: expected one row, got 0
pgbench: error: client 0 aborted while executing SQL commands
Even worse, if an unexpected error occurs in readCommandResponse() (i.e., the default case),
the following messages are emitted, both indicating that the client was aborted;
pgbench: error: client 0 script 0 aborted in command ... query ...
pgbench: error: client 0 aborted while executing SQL commands
I feel this is a bit redundant.
Therefore, if we are to improve these messages to indicate explicitly that the client
was aborted, I would suggest modifying the error messages in readCommandResponse() rather
than adding a new one in advanceConnectionState().
I've attached patch 0003 incorporating my suggestion. What do you think?
Additionally, the patch 0001 includes the fix that was originally part of
your proposed 0003, as previously discussed.
Regards,
Yugo Nagata
--
Yugo Nagata <nagata@sraoss.co.jp>
Attachments:
v7-0003-Improve-error-messages-for-errors-that-cause-clie.patchtext/x-diff; name=v7-0003-Improve-error-messages-for-errors-that-cause-clie.patchDownload
From f9c3ad15d2cac1e536b0eb3c93aabc2f127b4f30 Mon Sep 17 00:00:00 2001
From: Yugo Nagata <nagata@sraoss.co.jp>
Date: Thu, 10 Jul 2025 17:21:05 +0900
Subject: [PATCH v7 3/3] Improve error messages for errors that cause client
abortion
This commit modifies relevant error messages to explicitly indicate that the
client was aborted. As part of this change, pg_log_error was replaced with
commandFailed().
---
src/bin/pgbench/pgbench.c | 12 +++++-------
1 file changed, 5 insertions(+), 7 deletions(-)
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 7dbeb79ca8d..41a7c19fff5 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -3309,8 +3309,7 @@ readCommandResponse(CState *st, MetaCommand meta, char *varprefix)
case PGRES_EMPTY_QUERY: /* may be used for testing no-op overhead */
if (is_last && meta == META_GSET)
{
- pg_log_error("client %d script %d command %d query %d: expected one row, got %d",
- st->id, st->use_file, st->command, qrynum, 0);
+ commandFailed(st, "gset", psprintf("expected one row, got %d", 0));
st->estatus = ESTATUS_META_COMMAND_ERROR;
goto error;
}
@@ -3324,8 +3323,7 @@ readCommandResponse(CState *st, MetaCommand meta, char *varprefix)
if (meta == META_GSET && ntuples != 1)
{
/* under \gset, report the error */
- pg_log_error("client %d script %d command %d query %d: expected one row, got %d",
- st->id, st->use_file, st->command, qrynum, PQntuples(res));
+ commandFailed(st, "gset", psprintf("expected one row, got %d", PQntuples(res)));
st->estatus = ESTATUS_META_COMMAND_ERROR;
goto error;
}
@@ -3339,18 +3337,18 @@ readCommandResponse(CState *st, MetaCommand meta, char *varprefix)
for (int fld = 0; fld < PQnfields(res); fld++)
{
char *varname = PQfname(res, fld);
+ char *cmd = (meta == META_ASET ? "aset" : "gset");
/* allocate varname only if necessary, freed below */
if (*varprefix != '\0')
varname = psprintf("%s%s", varprefix, varname);
/* store last row result as a string */
- if (!putVariable(&st->variables, meta == META_ASET ? "aset" : "gset", varname,
+ if (!putVariable(&st->variables, cmd, varname,
PQgetvalue(res, ntuples - 1, fld)))
{
/* internal error */
- pg_log_error("client %d script %d command %d query %d: error storing into variable %s",
- st->id, st->use_file, st->command, qrynum, varname);
+ commandFailed(st, cmd, psprintf("error storing into variable %s", varname));
st->estatus = ESTATUS_META_COMMAND_ERROR;
goto error;
}
--
2.43.0
v7-0002-Rename-a-confusing-enumerator.patchtext/x-diff; name=v7-0002-Rename-a-confusing-enumerator.patchDownload
From b1da757cb71acb7df7174a3a3dd2755461e80276 Mon Sep 17 00:00:00 2001
From: Rintaro Ikeda <ikedarintarof@oss.nttdata.com>
Date: Wed, 9 Jul 2025 23:50:36 +0900
Subject: [PATCH v7 2/3] Rename a confusing enumerator
Rename the confusing enumerator which may be mistakenly assumed to be related to
other_sql_errors
---
src/bin/pgbench/pgbench.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index edd8b01f794..7dbeb79ca8d 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -485,7 +485,7 @@ typedef enum TStatus
TSTATUS_IDLE,
TSTATUS_IN_BLOCK,
TSTATUS_CONN_ERROR,
- TSTATUS_OTHER_ERROR,
+ TSTATUS_UNKNOWN_ERROR,
} TStatus;
/* Various random sequences are initialized from this one. */
@@ -3577,12 +3577,12 @@ getTransactionStatus(PGconn *con)
* not. Internal error which should never occur.
*/
pg_log_error("unexpected transaction status %d", tx_status);
- return TSTATUS_OTHER_ERROR;
+ return TSTATUS_UNKNOWN_ERROR;
}
/* not reached */
Assert(false);
- return TSTATUS_OTHER_ERROR;
+ return TSTATUS_UNKNOWN_ERROR;
}
/*
--
2.43.0
v7-0001-Add-continue-on-error-option.patchtext/x-diff; name=v7-0001-Add-continue-on-error-option.patchDownload
From d0d3ec97abe98d77501dbf38bbb248493535be52 Mon Sep 17 00:00:00 2001
From: Rintaro Ikeda <ikedarintarof@oss.nttdata.com>
Date: Wed, 9 Jul 2025 23:36:37 +0900
Subject: [PATCH v7 1/3] Add --continue-on-error option
When the option is set, client rolls back the failed transaction and starts a
new one when its transaction fails due to the reason other than the deadlock and
serialization failure.
---
doc/src/sgml/ref/pgbench.sgml | 71 +++++++++++++++-----
src/bin/pgbench/pgbench.c | 57 +++++++++++++---
src/bin/pgbench/t/001_pgbench_with_server.pl | 22 ++++++
3 files changed, 125 insertions(+), 25 deletions(-)
diff --git a/doc/src/sgml/ref/pgbench.sgml b/doc/src/sgml/ref/pgbench.sgml
index ab252d9fc74..15fcb45e223 100644
--- a/doc/src/sgml/ref/pgbench.sgml
+++ b/doc/src/sgml/ref/pgbench.sgml
@@ -76,9 +76,8 @@ tps = 896.967014 (without initial connection time)
and number of transactions per client); these will be equal unless the run
failed before completion or some SQL command(s) failed. (In
<option>-T</option> mode, only the actual number of transactions is printed.)
- The next line reports the number of failed transactions due to
- serialization or deadlock errors (see <xref linkend="failures-and-retries"/>
- for more information).
+ The next line reports the number of failed transactions (see
+ <xref linkend="failures-and-retries"/> for more information).
The last line reports the number of transactions per second.
</para>
@@ -790,6 +789,9 @@ pgbench <optional> <replaceable>options</replaceable> </optional> <replaceable>d
<listitem>
<para>deadlock failures;</para>
</listitem>
+ <listitem>
+ <para>other failures;</para>
+ </listitem>
</itemizedlist>
See <xref linkend="failures-and-retries"/> for more information.
</para>
@@ -914,6 +916,26 @@ pgbench <optional> <replaceable>options</replaceable> </optional> <replaceable>d
</listitem>
</varlistentry>
+ <varlistentry id="pgbench-option-continue-on-error">
+ <term><option>--continue-on-error</option></term>
+ <listitem>
+ <para>
+ Allows clients to continue their run even if an SQL statement fails due to
+ errors other than serialization or deadlock. Unlike serialization and deadlock
+ failures, clients do not retry the same transactions but start new transaction.
+ This option is useful when your custom script may raise errors due to some
+ reason like unique constraints violation. Without this option, the client is
+ aborted after such errors.
+ </para>
+ <para>
+ Note that serialization and deadlock failures never cause the client to be
+ aborted even after clients retries <option>--max-tries</option> times by
+ default, so they are not affected by this option.
+ See <xref linkend="failures-and-retries"/> for more information.
+ </para>
+ </listitem>
+ </varlistentry>
+
</variablelist>
</para>
@@ -2409,8 +2431,8 @@ END;
will be reported as <literal>failed</literal>. If you use the
<option>--failures-detailed</option> option, the
<replaceable>time</replaceable> of the failed transaction will be reported as
- <literal>serialization</literal> or
- <literal>deadlock</literal> depending on the type of failure (see
+ <literal>serialization</literal>, <literal>deadlock</literal>, or
+ <literal>other</literal> depending on the type of failure (see
<xref linkend="failures-and-retries"/> for more information).
</para>
@@ -2638,6 +2660,16 @@ END;
</para>
</listitem>
</varlistentry>
+
+ <varlistentry>
+ <term><replaceable>other_sql_failures</replaceable></term>
+ <listitem>
+ <para>
+ number of transactions that got a SQL error
+ (zero unless <option>--failures-detailed</option> is specified)
+ </para>
+ </listitem>
+ </varlistentry>
</variablelist>
</para>
@@ -2646,8 +2678,8 @@ END;
<screen>
<userinput>pgbench --aggregate-interval=10 --time=20 --client=10 --log --rate=1000 --latency-limit=10 --failures-detailed --max-tries=10 test</userinput>
-1650260552 5178 26171317 177284491527 1136 44462 2647617 7321113867 0 9866 64 7564 28340 4148 0
-1650260562 4808 25573984 220121792172 1171 62083 3037380 9666800914 0 9998 598 7392 26621 4527 0
+1650260552 5178 26171317 177284491527 1136 44462 2647617 7321113867 0 9866 64 7564 28340 4148 0 0
+1650260562 4808 25573984 220121792172 1171 62083 3037380 9666800914 0 9998 598 7392 26621 4527 0 0
</screen>
</para>
@@ -2839,9 +2871,11 @@ statement latencies in milliseconds, failures and retries:
<option>--exit-on-abort</option> is specified. Otherwise in the worst
case they only lead to the abortion of the failed client while other
clients continue their run (but some client errors are handled without
- an abortion of the client and reported separately, see below). Later in
- this section it is assumed that the discussed errors are only the
- direct client errors and they are not internal
+ an abortion of the client and reported separately, see below). When
+ <option>--continue-on-error</option> is specified, the client
+ continues to process new transactions even if it encounters an error.
+ Later in this section it is assumed that the discussed errors are only
+ the direct client errors and they are not internal
<application>pgbench</application> errors.
</para>
</listitem>
@@ -2851,14 +2885,17 @@ statement latencies in milliseconds, failures and retries:
<para>
A client's run is aborted in case of a serious error; for example, the
connection with the database server was lost or the end of script was reached
- without completing the last transaction. In addition, if execution of an SQL
+ without completing the last transaction. By default, if execution of an SQL
or meta command fails for reasons other than serialization or deadlock errors,
- the client is aborted. Otherwise, if an SQL command fails with serialization or
- deadlock errors, the client is not aborted. In such cases, the current
- transaction is rolled back, which also includes setting the client variables
- as they were before the run of this transaction (it is assumed that one
- transaction script contains only one transaction; see
- <xref linkend="transactions-and-scripts"/> for more information).
+ the client is aborted. However, if the --continue-on-error option is specified,
+ the client does not abort and proceeds to the next transaction regardless of
+ the error. These cases are reported as "other failures" in the output.
+ On contrast, if an SQL command fails with serialization or deadlock errors, the
+ client is not aborted even without <option>--continue-on-error</option>.
+ Instead, the current transaction is rolled back, which also includes setting
+ the client variables as they were before the run of this transaction
+ (it is assumed that one transaction script contains only one transaction;
+ see <xref linkend="transactions-and-scripts"/> for more information).
Transactions with serialization or deadlock errors are repeated after
rollbacks until they complete successfully or reach the maximum
number of tries (specified by the <option>--max-tries</option> option) / the maximum
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 497a936c141..edd8b01f794 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -402,15 +402,23 @@ typedef struct StatsData
* directly successful transactions (they were successfully completed on
* the first try).
*
- * A failed transaction is defined as unsuccessfully retried transactions.
- * It can be one of two types:
+ * A failed transaction is counted differently depending on whether
+ * the --continue-on-error option is specified.
*
+ * Without --continue-on-error:
* failed (the number of failed transactions) =
* 'serialization_failures' (they got a serialization error and were not
* successfully retried) +
* 'deadlock_failures' (they got a deadlock error and were not
* successfully retried).
*
+ * When --continue-on-error is specified:
+ *
+ * failed (number of failed transactions) =
+ * 'serialization_failures' + 'deadlock_failures' +
+ * 'other_sql_failures' (they got some other SQL error; the transaction was
+ * not retried and counted as failed due to --continue-on-error).
+ *
* If the transaction was retried after a serialization or a deadlock
* error this does not guarantee that this retry was successful. Thus
*
@@ -440,6 +448,11 @@ typedef struct StatsData
int64 deadlock_failures; /* number of transactions that were not
* successfully retried after a deadlock
* error */
+ int64 other_sql_failures; /* number of failed transactions for
+ * reasons other than
+ * serialization/deadlock failure, which
+ * is counted if --continue-on-error is
+ * specified */
SimpleStats latency;
SimpleStats lag;
} StatsData;
@@ -770,6 +783,7 @@ static int64 total_weight = 0;
static bool verbose_errors = false; /* print verbose messages of all errors */
static bool exit_on_abort = false; /* exit when any client is aborted */
+static bool continue_on_error = false; /* continue after errors */
/* Builtin test scripts */
typedef struct BuiltinScript
@@ -954,6 +968,7 @@ usage(void)
" --log-prefix=PREFIX prefix for transaction time log file\n"
" (default: \"pgbench_log\")\n"
" --max-tries=NUM max number of tries to run transaction (default: 1)\n"
+ " --continue-on-error continue running after an SQL error\n"
" --progress-timestamp use Unix epoch timestamps for progress\n"
" --random-seed=SEED set random seed (\"time\", \"rand\", integer)\n"
" --sampling-rate=NUM fraction of transactions to log (e.g., 0.01 for 1%%)\n"
@@ -1467,6 +1482,7 @@ initStats(StatsData *sd, pg_time_usec_t start)
sd->retried = 0;
sd->serialization_failures = 0;
sd->deadlock_failures = 0;
+ sd->other_sql_failures = 0;
initSimpleStats(&sd->latency);
initSimpleStats(&sd->lag);
}
@@ -1516,6 +1532,9 @@ accumStats(StatsData *stats, bool skipped, double lat, double lag,
case ESTATUS_DEADLOCK_ERROR:
stats->deadlock_failures++;
break;
+ case ESTATUS_OTHER_SQL_ERROR:
+ stats->other_sql_failures++;
+ break;
default:
/* internal error which should never occur */
pg_fatal("unexpected error status: %d", estatus);
@@ -3356,7 +3375,7 @@ readCommandResponse(CState *st, MetaCommand meta, char *varprefix)
case PGRES_FATAL_ERROR:
st->estatus = getSQLErrorStatus(PQresultErrorField(res,
PG_DIAG_SQLSTATE));
- if (canRetryError(st->estatus))
+ if (continue_on_error || canRetryError(st->estatus))
{
if (verbose_errors)
commandError(st, PQerrorMessage(st->con));
@@ -4007,7 +4026,8 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
if (PQpipelineStatus(st->con) != PQ_PIPELINE_ON)
st->state = CSTATE_END_COMMAND;
}
- else if (canRetryError(st->estatus))
+ else if ((st->estatus == ESTATUS_OTHER_SQL_ERROR && continue_on_error) ||
+ canRetryError(st->estatus))
st->state = CSTATE_ERROR;
else
st->state = CSTATE_ABORTED;
@@ -4528,7 +4548,8 @@ static int64
getFailures(const StatsData *stats)
{
return (stats->serialization_failures +
- stats->deadlock_failures);
+ stats->deadlock_failures +
+ stats->other_sql_failures);
}
/*
@@ -4548,6 +4569,8 @@ getResultString(bool skipped, EStatus estatus)
return "serialization";
case ESTATUS_DEADLOCK_ERROR:
return "deadlock";
+ case ESTATUS_OTHER_SQL_ERROR:
+ return "other";
default:
/* internal error which should never occur */
pg_fatal("unexpected error status: %d", estatus);
@@ -4603,6 +4626,7 @@ doLog(TState *thread, CState *st,
int64 skipped = 0;
int64 serialization_failures = 0;
int64 deadlock_failures = 0;
+ int64 other_sql_failures = 0;
int64 retried = 0;
int64 retries = 0;
@@ -4643,10 +4667,12 @@ doLog(TState *thread, CState *st,
{
serialization_failures = agg->serialization_failures;
deadlock_failures = agg->deadlock_failures;
+ other_sql_failures = agg->other_sql_failures;
}
- fprintf(logfile, " " INT64_FORMAT " " INT64_FORMAT,
+ fprintf(logfile, " " INT64_FORMAT " " INT64_FORMAT " " INT64_FORMAT,
serialization_failures,
- deadlock_failures);
+ deadlock_failures,
+ other_sql_failures);
fputc('\n', logfile);
@@ -6285,6 +6311,7 @@ printProgressReport(TState *threads, int64 test_start, pg_time_usec_t now,
cur.serialization_failures +=
threads[i].stats.serialization_failures;
cur.deadlock_failures += threads[i].stats.deadlock_failures;
+ cur.other_sql_failures += threads[i].stats.other_sql_failures;
}
/* we count only actually executed transactions */
@@ -6427,7 +6454,8 @@ printResults(StatsData *total,
/*
* Remaining stats are nonsensical if we failed to execute any xacts due
- * to others than serialization or deadlock errors
+ * to other than serialization or deadlock errors and --continue-on-error
+ * is not set.
*/
if (total_cnt <= 0)
return;
@@ -6443,6 +6471,9 @@ printResults(StatsData *total,
printf("number of deadlock failures: " INT64_FORMAT " (%.3f%%)\n",
total->deadlock_failures,
100.0 * total->deadlock_failures / total_cnt);
+ printf("number of other failures: " INT64_FORMAT " (%.3f%%)\n",
+ total->other_sql_failures,
+ 100.0 * total->other_sql_failures / total_cnt);
}
/* it can be non-zero only if max_tries is not equal to one */
@@ -6546,6 +6577,10 @@ printResults(StatsData *total,
sstats->deadlock_failures,
(100.0 * sstats->deadlock_failures /
script_total_cnt));
+ printf(" - number of other failures: " INT64_FORMAT " (%.3f%%)\n",
+ sstats->other_sql_failures,
+ (100.0 * sstats->other_sql_failures /
+ script_total_cnt));
}
/*
@@ -6705,6 +6740,7 @@ main(int argc, char **argv)
{"verbose-errors", no_argument, NULL, 15},
{"exit-on-abort", no_argument, NULL, 16},
{"debug", no_argument, NULL, 17},
+ {"continue-on-error", no_argument, NULL, 18},
{NULL, 0, NULL, 0}
};
@@ -7058,6 +7094,10 @@ main(int argc, char **argv)
case 17: /* debug */
pg_logging_increase_verbosity();
break;
+ case 18: /* continue-on-error */
+ benchmarking_option_set = true;
+ continue_on_error = true;
+ break;
default:
/* getopt_long already emitted a complaint */
pg_log_error_hint("Try \"%s --help\" for more information.", progname);
@@ -7413,6 +7453,7 @@ main(int argc, char **argv)
stats.retried += thread->stats.retried;
stats.serialization_failures += thread->stats.serialization_failures;
stats.deadlock_failures += thread->stats.deadlock_failures;
+ stats.other_sql_failures += thread->stats.other_sql_failures;
latency_late += thread->latency_late;
conn_total_duration += thread->conn_duration;
diff --git a/src/bin/pgbench/t/001_pgbench_with_server.pl b/src/bin/pgbench/t/001_pgbench_with_server.pl
index 7dd78940300..8bb35dda5f7 100644
--- a/src/bin/pgbench/t/001_pgbench_with_server.pl
+++ b/src/bin/pgbench/t/001_pgbench_with_server.pl
@@ -1813,6 +1813,28 @@ update counter set i = i+1 returning i \gset
# Clean up
$node->safe_psql('postgres', 'DROP TABLE counter;');
+# Test --continue-on-error
+$node->safe_psql('postgres',
+ 'CREATE TABLE unique_table(i int unique);' . 'INSERT INTO unique_table VALUES (0);');
+
+$node->pgbench(
+ '-t 10 --continue-on-error --failures-detailed',
+ 0,
+ [
+ qr{processed: 0/10\b},
+ qr{other failures: 10\b}
+ ],
+ [],
+ 'test --continue-on-error',
+ {
+ '002_continue_on_error' => q{
+ insert into unique_table values 0;
+ }
+ });
+
+# Clean up
+$node->safe_psql('postgres', 'DROP TABLE unique_table;');
+
# done
$node->safe_psql('postgres', 'DROP TABLESPACE regress_pgbench_tap_1_ts');
$node->stop;
--
2.43.0
Hi,
On 2025/07/10 18:17, Yugo Nagata wrote:
On Wed, 9 Jul 2025 23:58:32 +0900
Rintaro Ikeda <ikedarintarof@oss.nttdata.com> wrote:Hi,
Thank you for the kind comments.
I've updated the previous patch.
Thank you for updating the patch!
However, if a large number of errors occur, this could result in a significant increase
in stderr output during the benchmark.Users can still notice that something went wrong by checking the “number of other failures”
reported after the run, and I assume that in most cases, when --continue-on-error is enabled,
users aren’t particularly interested in seeing individual error messages as they happen.It’s true that seeing error messages during the benchmark might be useful in some cases, but
the same could be said for serialization or deadlock errors, and that’s exactly what the
--verbose-errors option is for.I understand your concern. The condition for calling pg_log_error() was modified
to reduce stderr output.
Additionally, an error message was added for cases where some clients aborted
while executing SQL commands, similar to other code paths that transition to
st->state = CSTATE_ABORTED, as shown in the example below:```
pg_log_error("client %d aborted while establishing connection", st->id);
st->state = CSTATE_ABORTED;
```default: /* anything else is unexpected */ - pg_log_error("client %d script %d aborted in command %d query %d: %s", - st->id, st->use_file, st->command, qrynum, - PQerrorMessage(st->con)); + if (verbose_errors) + pg_log_error("client %d script %d aborted in command %d query %d: %s", + st->id, st->use_file, st->command, qrynum, + PQerrorMessage(st->con)); goto error; }Thanks to this fix, error messages caused by SQL errors are now output only when
--verbose-errors is enable. However, the comment describes the condition as "unexpected",
and the message states that the client was "aborted". This does not seems accurate, since
clients are not aborted due to SQL errors when --continue-on-errors is enabled.I think the error message should be emitted using commandError() when both
--coneinut-on-errors and --verbose-errors are specified, like this;case PGRES_NONFATAL_ERROR:
case PGRES_FATAL_ERROR:
st->estatus = getSQLErrorStatus(PQresultErrorField(res,
PG_DIAG_SQLSTATE));
if (continue_on_error || canRetryError(st->estatus))
{
if (verbose_errors)
commandError(st, PQerrorMessage(st->con));
goto error;
}
/* fall through */In addition, the error message in the "default" case should be shown regardless
of the --verbose-errors since it represents an unexpected situation and should
always reported.Finally, I believe this fix should be included in patch 0001 rather than 0003,
as it would be a part of the implementation of --continiue-on-error.As of 0003:
+ { + pg_log_error("client %d aborted while executing SQL commands", st->id); st->state = CSTATE_ABORTED; + } break;I understand that the patch is not directly related to --continue-on-error, similar to 0002,
and that it aims to improve the error message to indicate that the client was aborted due to
some error during readCommandResponse().However, this message doesn't seem entirely accurate, since the error is not always caused
by an SQL command failure itself. For example, it could also be due to a failure of the \gset
meta-command.In addition, this fix causes error messages to be emitted twice. For example, if \gset fails,
the following similar messages are printed:pgbench: error: client 0 script 0 command 0 query 0: expected one row, got 0
pgbench: error: client 0 aborted while executing SQL commandsEven worse, if an unexpected error occurs in readCommandResponse() (i.e., the default case),
the following messages are emitted, both indicating that the client was aborted;pgbench: error: client 0 script 0 aborted in command ... query ...
pgbench: error: client 0 aborted while executing SQL commandsI feel this is a bit redundant.
Therefore, if we are to improve these messages to indicate explicitly that the client
was aborted, I would suggest modifying the error messages in readCommandResponse() rather
than adding a new one in advanceConnectionState().I've attached patch 0003 incorporating my suggestion. What do you think?
Thank you very much for the updated patch!
I reviewed 0003 and it looks great - the error message become easier to understand.
I noticed one small thing I’d like to discuss. I'm not sure that users clearly
understand which aborted in the following error message, the client or the script.
pgbench: error: client 0 script 0 aborted in command ... query ...
Since the code path always results in a client abort, I wonder if the following
message might be clearer:
pgbench: error: client 0 aborted in script 0 command ... query ...
Regards,
Rintaro Ikeda
Hi,
On Sun, 13 Jul 2025 23:15:24 +0900
Rintaro Ikeda <ikedarintarof@oss.nttdata.com> wrote:
I noticed one small thing I’d like to discuss. I'm not sure that users clearly
understand which aborted in the following error message, the client or the script.pgbench: error: client 0 script 0 aborted in command ... query ...
Since the code path always results in a client abort, I wonder if the following
message might be clearer:pgbench: error: client 0 aborted in script 0 command ... query ...
Indeed, it seems clearer to explicitly state that it is the client that
was aborted.
I've attached an updated patch that replaces the remaining message mentioned
above with a call to commandFailed(). With this change, the output in such
situations will now be:
"client 0 aborted in command 0 (SQL) of script 0; ...."
Regards,
Yugo Nagata
--
Yugo Nagata <nagata@sraoss.co.jp>
Attachments:
v8-0003-Improve-error-messages-for-errors-that-cause-clie.patchtext/x-diff; name=v8-0003-Improve-error-messages-for-errors-that-cause-clie.patchDownload
From 19c5ee6c077091eaf99e133b26a3e822a39f3964 Mon Sep 17 00:00:00 2001
From: Yugo Nagata <nagata@sraoss.co.jp>
Date: Thu, 10 Jul 2025 17:21:05 +0900
Subject: [PATCH v8 3/3] Improve error messages for errors that cause client
abortion
This commit modifies relevant error messages to explicitly indicate that the
client was aborted. As part of this change, pg_log_error was replaced with
commandFailed().
---
src/bin/pgbench/pgbench.c | 12 +++++-------
1 file changed, 5 insertions(+), 7 deletions(-)
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 7dbeb79ca8d..41a7c19fff5 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -3309,8 +3309,7 @@ readCommandResponse(CState *st, MetaCommand meta, char *varprefix)
case PGRES_EMPTY_QUERY: /* may be used for testing no-op overhead */
if (is_last && meta == META_GSET)
{
- pg_log_error("client %d script %d command %d query %d: expected one row, got %d",
- st->id, st->use_file, st->command, qrynum, 0);
+ commandFailed(st, "gset", psprintf("expected one row, got %d", 0));
st->estatus = ESTATUS_META_COMMAND_ERROR;
goto error;
}
@@ -3324,8 +3323,7 @@ readCommandResponse(CState *st, MetaCommand meta, char *varprefix)
if (meta == META_GSET && ntuples != 1)
{
/* under \gset, report the error */
- pg_log_error("client %d script %d command %d query %d: expected one row, got %d",
- st->id, st->use_file, st->command, qrynum, PQntuples(res));
+ commandFailed(st, "gset", psprintf("expected one row, got %d", PQntuples(res)));
st->estatus = ESTATUS_META_COMMAND_ERROR;
goto error;
}
@@ -3339,18 +3337,18 @@ readCommandResponse(CState *st, MetaCommand meta, char *varprefix)
for (int fld = 0; fld < PQnfields(res); fld++)
{
char *varname = PQfname(res, fld);
+ char *cmd = (meta == META_ASET ? "aset" : "gset");
/* allocate varname only if necessary, freed below */
if (*varprefix != '\0')
varname = psprintf("%s%s", varprefix, varname);
/* store last row result as a string */
- if (!putVariable(&st->variables, meta == META_ASET ? "aset" : "gset", varname,
+ if (!putVariable(&st->variables, cmd, varname,
PQgetvalue(res, ntuples - 1, fld)))
{
/* internal error */
- pg_log_error("client %d script %d command %d query %d: error storing into variable %s",
- st->id, st->use_file, st->command, qrynum, varname);
+ commandFailed(st, cmd, psprintf("error storing into variable %s", varname));
st->estatus = ESTATUS_META_COMMAND_ERROR;
goto error;
}
--
2.43.0
v8-0002-Rename-a-confusing-enumerator.patchtext/x-diff; name=v8-0002-Rename-a-confusing-enumerator.patchDownload
From 8a1583068ee4737ba82664a359638902c93e56a3 Mon Sep 17 00:00:00 2001
From: Rintaro Ikeda <ikedarintarof@oss.nttdata.com>
Date: Wed, 9 Jul 2025 23:50:36 +0900
Subject: [PATCH v8 2/3] Rename a confusing enumerator
Rename the confusing enumerator which may be mistakenly assumed to be related to
other_sql_errors
---
src/bin/pgbench/pgbench.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index edd8b01f794..7dbeb79ca8d 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -485,7 +485,7 @@ typedef enum TStatus
TSTATUS_IDLE,
TSTATUS_IN_BLOCK,
TSTATUS_CONN_ERROR,
- TSTATUS_OTHER_ERROR,
+ TSTATUS_UNKNOWN_ERROR,
} TStatus;
/* Various random sequences are initialized from this one. */
@@ -3577,12 +3577,12 @@ getTransactionStatus(PGconn *con)
* not. Internal error which should never occur.
*/
pg_log_error("unexpected transaction status %d", tx_status);
- return TSTATUS_OTHER_ERROR;
+ return TSTATUS_UNKNOWN_ERROR;
}
/* not reached */
Assert(false);
- return TSTATUS_OTHER_ERROR;
+ return TSTATUS_UNKNOWN_ERROR;
}
/*
--
2.43.0
v8-0001-Add-continue-on-error-option.patchtext/x-diff; name=v8-0001-Add-continue-on-error-option.patchDownload
From a5d4081648105990d1ce9085ea9ffe23f09e01f9 Mon Sep 17 00:00:00 2001
From: Rintaro Ikeda <ikedarintarof@oss.nttdata.com>
Date: Wed, 9 Jul 2025 23:36:37 +0900
Subject: [PATCH v8 1/3] Add --continue-on-error option
When the option is set, client rolls back the failed transaction and starts a
new one when its transaction fails due to the reason other than the deadlock and
serialization failure.
---
doc/src/sgml/ref/pgbench.sgml | 71 +++++++++++++++-----
src/bin/pgbench/pgbench.c | 57 +++++++++++++---
src/bin/pgbench/t/001_pgbench_with_server.pl | 22 ++++++
3 files changed, 125 insertions(+), 25 deletions(-)
diff --git a/doc/src/sgml/ref/pgbench.sgml b/doc/src/sgml/ref/pgbench.sgml
index ab252d9fc74..15fcb45e223 100644
--- a/doc/src/sgml/ref/pgbench.sgml
+++ b/doc/src/sgml/ref/pgbench.sgml
@@ -76,9 +76,8 @@ tps = 896.967014 (without initial connection time)
and number of transactions per client); these will be equal unless the run
failed before completion or some SQL command(s) failed. (In
<option>-T</option> mode, only the actual number of transactions is printed.)
- The next line reports the number of failed transactions due to
- serialization or deadlock errors (see <xref linkend="failures-and-retries"/>
- for more information).
+ The next line reports the number of failed transactions (see
+ <xref linkend="failures-and-retries"/> for more information).
The last line reports the number of transactions per second.
</para>
@@ -790,6 +789,9 @@ pgbench <optional> <replaceable>options</replaceable> </optional> <replaceable>d
<listitem>
<para>deadlock failures;</para>
</listitem>
+ <listitem>
+ <para>other failures;</para>
+ </listitem>
</itemizedlist>
See <xref linkend="failures-and-retries"/> for more information.
</para>
@@ -914,6 +916,26 @@ pgbench <optional> <replaceable>options</replaceable> </optional> <replaceable>d
</listitem>
</varlistentry>
+ <varlistentry id="pgbench-option-continue-on-error">
+ <term><option>--continue-on-error</option></term>
+ <listitem>
+ <para>
+ Allows clients to continue their run even if an SQL statement fails due to
+ errors other than serialization or deadlock. Unlike serialization and deadlock
+ failures, clients do not retry the same transactions but start new transaction.
+ This option is useful when your custom script may raise errors due to some
+ reason like unique constraints violation. Without this option, the client is
+ aborted after such errors.
+ </para>
+ <para>
+ Note that serialization and deadlock failures never cause the client to be
+ aborted even after clients retries <option>--max-tries</option> times by
+ default, so they are not affected by this option.
+ See <xref linkend="failures-and-retries"/> for more information.
+ </para>
+ </listitem>
+ </varlistentry>
+
</variablelist>
</para>
@@ -2409,8 +2431,8 @@ END;
will be reported as <literal>failed</literal>. If you use the
<option>--failures-detailed</option> option, the
<replaceable>time</replaceable> of the failed transaction will be reported as
- <literal>serialization</literal> or
- <literal>deadlock</literal> depending on the type of failure (see
+ <literal>serialization</literal>, <literal>deadlock</literal>, or
+ <literal>other</literal> depending on the type of failure (see
<xref linkend="failures-and-retries"/> for more information).
</para>
@@ -2638,6 +2660,16 @@ END;
</para>
</listitem>
</varlistentry>
+
+ <varlistentry>
+ <term><replaceable>other_sql_failures</replaceable></term>
+ <listitem>
+ <para>
+ number of transactions that got a SQL error
+ (zero unless <option>--failures-detailed</option> is specified)
+ </para>
+ </listitem>
+ </varlistentry>
</variablelist>
</para>
@@ -2646,8 +2678,8 @@ END;
<screen>
<userinput>pgbench --aggregate-interval=10 --time=20 --client=10 --log --rate=1000 --latency-limit=10 --failures-detailed --max-tries=10 test</userinput>
-1650260552 5178 26171317 177284491527 1136 44462 2647617 7321113867 0 9866 64 7564 28340 4148 0
-1650260562 4808 25573984 220121792172 1171 62083 3037380 9666800914 0 9998 598 7392 26621 4527 0
+1650260552 5178 26171317 177284491527 1136 44462 2647617 7321113867 0 9866 64 7564 28340 4148 0 0
+1650260562 4808 25573984 220121792172 1171 62083 3037380 9666800914 0 9998 598 7392 26621 4527 0 0
</screen>
</para>
@@ -2839,9 +2871,11 @@ statement latencies in milliseconds, failures and retries:
<option>--exit-on-abort</option> is specified. Otherwise in the worst
case they only lead to the abortion of the failed client while other
clients continue their run (but some client errors are handled without
- an abortion of the client and reported separately, see below). Later in
- this section it is assumed that the discussed errors are only the
- direct client errors and they are not internal
+ an abortion of the client and reported separately, see below). When
+ <option>--continue-on-error</option> is specified, the client
+ continues to process new transactions even if it encounters an error.
+ Later in this section it is assumed that the discussed errors are only
+ the direct client errors and they are not internal
<application>pgbench</application> errors.
</para>
</listitem>
@@ -2851,14 +2885,17 @@ statement latencies in milliseconds, failures and retries:
<para>
A client's run is aborted in case of a serious error; for example, the
connection with the database server was lost or the end of script was reached
- without completing the last transaction. In addition, if execution of an SQL
+ without completing the last transaction. By default, if execution of an SQL
or meta command fails for reasons other than serialization or deadlock errors,
- the client is aborted. Otherwise, if an SQL command fails with serialization or
- deadlock errors, the client is not aborted. In such cases, the current
- transaction is rolled back, which also includes setting the client variables
- as they were before the run of this transaction (it is assumed that one
- transaction script contains only one transaction; see
- <xref linkend="transactions-and-scripts"/> for more information).
+ the client is aborted. However, if the --continue-on-error option is specified,
+ the client does not abort and proceeds to the next transaction regardless of
+ the error. These cases are reported as "other failures" in the output.
+ On contrast, if an SQL command fails with serialization or deadlock errors, the
+ client is not aborted even without <option>--continue-on-error</option>.
+ Instead, the current transaction is rolled back, which also includes setting
+ the client variables as they were before the run of this transaction
+ (it is assumed that one transaction script contains only one transaction;
+ see <xref linkend="transactions-and-scripts"/> for more information).
Transactions with serialization or deadlock errors are repeated after
rollbacks until they complete successfully or reach the maximum
number of tries (specified by the <option>--max-tries</option> option) / the maximum
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 497a936c141..edd8b01f794 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -402,15 +402,23 @@ typedef struct StatsData
* directly successful transactions (they were successfully completed on
* the first try).
*
- * A failed transaction is defined as unsuccessfully retried transactions.
- * It can be one of two types:
+ * A failed transaction is counted differently depending on whether
+ * the --continue-on-error option is specified.
*
+ * Without --continue-on-error:
* failed (the number of failed transactions) =
* 'serialization_failures' (they got a serialization error and were not
* successfully retried) +
* 'deadlock_failures' (they got a deadlock error and were not
* successfully retried).
*
+ * When --continue-on-error is specified:
+ *
+ * failed (number of failed transactions) =
+ * 'serialization_failures' + 'deadlock_failures' +
+ * 'other_sql_failures' (they got some other SQL error; the transaction was
+ * not retried and counted as failed due to --continue-on-error).
+ *
* If the transaction was retried after a serialization or a deadlock
* error this does not guarantee that this retry was successful. Thus
*
@@ -440,6 +448,11 @@ typedef struct StatsData
int64 deadlock_failures; /* number of transactions that were not
* successfully retried after a deadlock
* error */
+ int64 other_sql_failures; /* number of failed transactions for
+ * reasons other than
+ * serialization/deadlock failure, which
+ * is counted if --continue-on-error is
+ * specified */
SimpleStats latency;
SimpleStats lag;
} StatsData;
@@ -770,6 +783,7 @@ static int64 total_weight = 0;
static bool verbose_errors = false; /* print verbose messages of all errors */
static bool exit_on_abort = false; /* exit when any client is aborted */
+static bool continue_on_error = false; /* continue after errors */
/* Builtin test scripts */
typedef struct BuiltinScript
@@ -954,6 +968,7 @@ usage(void)
" --log-prefix=PREFIX prefix for transaction time log file\n"
" (default: \"pgbench_log\")\n"
" --max-tries=NUM max number of tries to run transaction (default: 1)\n"
+ " --continue-on-error continue running after an SQL error\n"
" --progress-timestamp use Unix epoch timestamps for progress\n"
" --random-seed=SEED set random seed (\"time\", \"rand\", integer)\n"
" --sampling-rate=NUM fraction of transactions to log (e.g., 0.01 for 1%%)\n"
@@ -1467,6 +1482,7 @@ initStats(StatsData *sd, pg_time_usec_t start)
sd->retried = 0;
sd->serialization_failures = 0;
sd->deadlock_failures = 0;
+ sd->other_sql_failures = 0;
initSimpleStats(&sd->latency);
initSimpleStats(&sd->lag);
}
@@ -1516,6 +1532,9 @@ accumStats(StatsData *stats, bool skipped, double lat, double lag,
case ESTATUS_DEADLOCK_ERROR:
stats->deadlock_failures++;
break;
+ case ESTATUS_OTHER_SQL_ERROR:
+ stats->other_sql_failures++;
+ break;
default:
/* internal error which should never occur */
pg_fatal("unexpected error status: %d", estatus);
@@ -3356,7 +3375,7 @@ readCommandResponse(CState *st, MetaCommand meta, char *varprefix)
case PGRES_FATAL_ERROR:
st->estatus = getSQLErrorStatus(PQresultErrorField(res,
PG_DIAG_SQLSTATE));
- if (canRetryError(st->estatus))
+ if (continue_on_error || canRetryError(st->estatus))
{
if (verbose_errors)
commandError(st, PQerrorMessage(st->con));
@@ -4007,7 +4026,8 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
if (PQpipelineStatus(st->con) != PQ_PIPELINE_ON)
st->state = CSTATE_END_COMMAND;
}
- else if (canRetryError(st->estatus))
+ else if ((st->estatus == ESTATUS_OTHER_SQL_ERROR && continue_on_error) ||
+ canRetryError(st->estatus))
st->state = CSTATE_ERROR;
else
st->state = CSTATE_ABORTED;
@@ -4528,7 +4548,8 @@ static int64
getFailures(const StatsData *stats)
{
return (stats->serialization_failures +
- stats->deadlock_failures);
+ stats->deadlock_failures +
+ stats->other_sql_failures);
}
/*
@@ -4548,6 +4569,8 @@ getResultString(bool skipped, EStatus estatus)
return "serialization";
case ESTATUS_DEADLOCK_ERROR:
return "deadlock";
+ case ESTATUS_OTHER_SQL_ERROR:
+ return "other";
default:
/* internal error which should never occur */
pg_fatal("unexpected error status: %d", estatus);
@@ -4603,6 +4626,7 @@ doLog(TState *thread, CState *st,
int64 skipped = 0;
int64 serialization_failures = 0;
int64 deadlock_failures = 0;
+ int64 other_sql_failures = 0;
int64 retried = 0;
int64 retries = 0;
@@ -4643,10 +4667,12 @@ doLog(TState *thread, CState *st,
{
serialization_failures = agg->serialization_failures;
deadlock_failures = agg->deadlock_failures;
+ other_sql_failures = agg->other_sql_failures;
}
- fprintf(logfile, " " INT64_FORMAT " " INT64_FORMAT,
+ fprintf(logfile, " " INT64_FORMAT " " INT64_FORMAT " " INT64_FORMAT,
serialization_failures,
- deadlock_failures);
+ deadlock_failures,
+ other_sql_failures);
fputc('\n', logfile);
@@ -6285,6 +6311,7 @@ printProgressReport(TState *threads, int64 test_start, pg_time_usec_t now,
cur.serialization_failures +=
threads[i].stats.serialization_failures;
cur.deadlock_failures += threads[i].stats.deadlock_failures;
+ cur.other_sql_failures += threads[i].stats.other_sql_failures;
}
/* we count only actually executed transactions */
@@ -6427,7 +6454,8 @@ printResults(StatsData *total,
/*
* Remaining stats are nonsensical if we failed to execute any xacts due
- * to others than serialization or deadlock errors
+ * to other than serialization or deadlock errors and --continue-on-error
+ * is not set.
*/
if (total_cnt <= 0)
return;
@@ -6443,6 +6471,9 @@ printResults(StatsData *total,
printf("number of deadlock failures: " INT64_FORMAT " (%.3f%%)\n",
total->deadlock_failures,
100.0 * total->deadlock_failures / total_cnt);
+ printf("number of other failures: " INT64_FORMAT " (%.3f%%)\n",
+ total->other_sql_failures,
+ 100.0 * total->other_sql_failures / total_cnt);
}
/* it can be non-zero only if max_tries is not equal to one */
@@ -6546,6 +6577,10 @@ printResults(StatsData *total,
sstats->deadlock_failures,
(100.0 * sstats->deadlock_failures /
script_total_cnt));
+ printf(" - number of other failures: " INT64_FORMAT " (%.3f%%)\n",
+ sstats->other_sql_failures,
+ (100.0 * sstats->other_sql_failures /
+ script_total_cnt));
}
/*
@@ -6705,6 +6740,7 @@ main(int argc, char **argv)
{"verbose-errors", no_argument, NULL, 15},
{"exit-on-abort", no_argument, NULL, 16},
{"debug", no_argument, NULL, 17},
+ {"continue-on-error", no_argument, NULL, 18},
{NULL, 0, NULL, 0}
};
@@ -7058,6 +7094,10 @@ main(int argc, char **argv)
case 17: /* debug */
pg_logging_increase_verbosity();
break;
+ case 18: /* continue-on-error */
+ benchmarking_option_set = true;
+ continue_on_error = true;
+ break;
default:
/* getopt_long already emitted a complaint */
pg_log_error_hint("Try \"%s --help\" for more information.", progname);
@@ -7413,6 +7453,7 @@ main(int argc, char **argv)
stats.retried += thread->stats.retried;
stats.serialization_failures += thread->stats.serialization_failures;
stats.deadlock_failures += thread->stats.deadlock_failures;
+ stats.other_sql_failures += thread->stats.other_sql_failures;
latency_late += thread->latency_late;
conn_total_duration += thread->conn_duration;
diff --git a/src/bin/pgbench/t/001_pgbench_with_server.pl b/src/bin/pgbench/t/001_pgbench_with_server.pl
index 7dd78940300..8bb35dda5f7 100644
--- a/src/bin/pgbench/t/001_pgbench_with_server.pl
+++ b/src/bin/pgbench/t/001_pgbench_with_server.pl
@@ -1813,6 +1813,28 @@ update counter set i = i+1 returning i \gset
# Clean up
$node->safe_psql('postgres', 'DROP TABLE counter;');
+# Test --continue-on-error
+$node->safe_psql('postgres',
+ 'CREATE TABLE unique_table(i int unique);' . 'INSERT INTO unique_table VALUES (0);');
+
+$node->pgbench(
+ '-t 10 --continue-on-error --failures-detailed',
+ 0,
+ [
+ qr{processed: 0/10\b},
+ qr{other failures: 10\b}
+ ],
+ [],
+ 'test --continue-on-error',
+ {
+ '002_continue_on_error' => q{
+ insert into unique_table values 0;
+ }
+ });
+
+# Clean up
+$node->safe_psql('postgres', 'DROP TABLE unique_table;');
+
# done
$node->safe_psql('postgres', 'DROP TABLESPACE regress_pgbench_tap_1_ts');
$node->stop;
--
2.43.0
Hi,
On 2025/07/15 11:16, Yugo Nagata wrote:
I noticed one small thing I’d like to discuss. I'm not sure that users clearly
understand which aborted in the following error message, the client or the script.pgbench: error: client 0 script 0 aborted in command ... query ...
Since the code path always results in a client abort, I wonder if the following
message might be clearer:pgbench: error: client 0 aborted in script 0 command ... query ...
Indeed, it seems clearer to explicitly state that it is the client that
was aborted.I've attached an updated patch that replaces the remaining message mentioned
above with a call to commandFailed(). With this change, the output in such
situations will now be:"client 0 aborted in command 0 (SQL) of script 0; ...."
Thank you for updating the patch!
When I executed a custom script that may raise a unique constraint violation, I
got the following output:
pgbench: error: client 0 script 0 aborted in command 1 query 0: ERROR:
duplicate key value violates unique constraint "test_col2_key"
I think we should also change the error message in pg_log_error. I modified the
patch v8-0003 as follows:
@@ -3383,8 +3383,8 @@ readCommandResponse(CState *st, MetaCommand meta, char
*varprefix)
default:
/* anything else is unexpected */
- pg_log_error("client %d script %d aborted in
command %d query %d: %s",
- st->id, st->use_file,
st->command, qrynum,
+ pg_log_error("client %d aborted in command %d
query %d of script %d: %s",
+ st->id, st->command,
qrynum, st->use_file,
PQerrorMessage(st->con));
goto error;
}
With this change, the output now is like this:
pgbench: error: client 0 aborted in command 1 query 0 of script 0: ERROR:
duplicate key value violates unique constraint "test_col2_key"
I want hear your thoughts.
Also, let me ask one question. In this case, I directly modified your commit in
the v8-0003 patch. Is that the right way to update the patch?
Regards,
Rintaro Ikeda
Attachments:
v9-0001-Add-continue-on-error-option.patchtext/plain; charset=UTF-8; name=v9-0001-Add-continue-on-error-option.patchDownload
From 202e24cfad77763bf4da2f3023845223adb60e2c Mon Sep 17 00:00:00 2001
From: Rintaro Ikeda <ikedarintarof@oss.nttdata.com>
Date: Wed, 9 Jul 2025 23:36:37 +0900
Subject: [PATCH v9 1/3] Add --continue-on-error option
When the option is set, client rolls back the failed transaction and starts a
new one when its transaction fails due to the reason other than the deadlock and
serialization failure.
---
doc/src/sgml/ref/pgbench.sgml | 71 +++++++++++++++-----
src/bin/pgbench/pgbench.c | 57 +++++++++++++---
src/bin/pgbench/t/001_pgbench_with_server.pl | 22 ++++++
3 files changed, 125 insertions(+), 25 deletions(-)
diff --git a/doc/src/sgml/ref/pgbench.sgml b/doc/src/sgml/ref/pgbench.sgml
index ab252d9fc74..15fcb45e223 100644
--- a/doc/src/sgml/ref/pgbench.sgml
+++ b/doc/src/sgml/ref/pgbench.sgml
@@ -76,9 +76,8 @@ tps = 896.967014 (without initial connection time)
and number of transactions per client); these will be equal unless the run
failed before completion or some SQL command(s) failed. (In
<option>-T</option> mode, only the actual number of transactions is printed.)
- The next line reports the number of failed transactions due to
- serialization or deadlock errors (see <xref linkend="failures-and-retries"/>
- for more information).
+ The next line reports the number of failed transactions (see
+ <xref linkend="failures-and-retries"/> for more information).
The last line reports the number of transactions per second.
</para>
@@ -790,6 +789,9 @@ pgbench <optional> <replaceable>options</replaceable> </optional> <replaceable>d
<listitem>
<para>deadlock failures;</para>
</listitem>
+ <listitem>
+ <para>other failures;</para>
+ </listitem>
</itemizedlist>
See <xref linkend="failures-and-retries"/> for more information.
</para>
@@ -914,6 +916,26 @@ pgbench <optional> <replaceable>options</replaceable> </optional> <replaceable>d
</listitem>
</varlistentry>
+ <varlistentry id="pgbench-option-continue-on-error">
+ <term><option>--continue-on-error</option></term>
+ <listitem>
+ <para>
+ Allows clients to continue their run even if an SQL statement fails due to
+ errors other than serialization or deadlock. Unlike serialization and deadlock
+ failures, clients do not retry the same transactions but start new transaction.
+ This option is useful when your custom script may raise errors due to some
+ reason like unique constraints violation. Without this option, the client is
+ aborted after such errors.
+ </para>
+ <para>
+ Note that serialization and deadlock failures never cause the client to be
+ aborted even after clients retries <option>--max-tries</option> times by
+ default, so they are not affected by this option.
+ See <xref linkend="failures-and-retries"/> for more information.
+ </para>
+ </listitem>
+ </varlistentry>
+
</variablelist>
</para>
@@ -2409,8 +2431,8 @@ END;
will be reported as <literal>failed</literal>. If you use the
<option>--failures-detailed</option> option, the
<replaceable>time</replaceable> of the failed transaction will be reported as
- <literal>serialization</literal> or
- <literal>deadlock</literal> depending on the type of failure (see
+ <literal>serialization</literal>, <literal>deadlock</literal>, or
+ <literal>other</literal> depending on the type of failure (see
<xref linkend="failures-and-retries"/> for more information).
</para>
@@ -2638,6 +2660,16 @@ END;
</para>
</listitem>
</varlistentry>
+
+ <varlistentry>
+ <term><replaceable>other_sql_failures</replaceable></term>
+ <listitem>
+ <para>
+ number of transactions that got a SQL error
+ (zero unless <option>--failures-detailed</option> is specified)
+ </para>
+ </listitem>
+ </varlistentry>
</variablelist>
</para>
@@ -2646,8 +2678,8 @@ END;
<screen>
<userinput>pgbench --aggregate-interval=10 --time=20 --client=10 --log --rate=1000 --latency-limit=10 --failures-detailed --max-tries=10 test</userinput>
-1650260552 5178 26171317 177284491527 1136 44462 2647617 7321113867 0 9866 64 7564 28340 4148 0
-1650260562 4808 25573984 220121792172 1171 62083 3037380 9666800914 0 9998 598 7392 26621 4527 0
+1650260552 5178 26171317 177284491527 1136 44462 2647617 7321113867 0 9866 64 7564 28340 4148 0 0
+1650260562 4808 25573984 220121792172 1171 62083 3037380 9666800914 0 9998 598 7392 26621 4527 0 0
</screen>
</para>
@@ -2839,9 +2871,11 @@ statement latencies in milliseconds, failures and retries:
<option>--exit-on-abort</option> is specified. Otherwise in the worst
case they only lead to the abortion of the failed client while other
clients continue their run (but some client errors are handled without
- an abortion of the client and reported separately, see below). Later in
- this section it is assumed that the discussed errors are only the
- direct client errors and they are not internal
+ an abortion of the client and reported separately, see below). When
+ <option>--continue-on-error</option> is specified, the client
+ continues to process new transactions even if it encounters an error.
+ Later in this section it is assumed that the discussed errors are only
+ the direct client errors and they are not internal
<application>pgbench</application> errors.
</para>
</listitem>
@@ -2851,14 +2885,17 @@ statement latencies in milliseconds, failures and retries:
<para>
A client's run is aborted in case of a serious error; for example, the
connection with the database server was lost or the end of script was reached
- without completing the last transaction. In addition, if execution of an SQL
+ without completing the last transaction. By default, if execution of an SQL
or meta command fails for reasons other than serialization or deadlock errors,
- the client is aborted. Otherwise, if an SQL command fails with serialization or
- deadlock errors, the client is not aborted. In such cases, the current
- transaction is rolled back, which also includes setting the client variables
- as they were before the run of this transaction (it is assumed that one
- transaction script contains only one transaction; see
- <xref linkend="transactions-and-scripts"/> for more information).
+ the client is aborted. However, if the --continue-on-error option is specified,
+ the client does not abort and proceeds to the next transaction regardless of
+ the error. These cases are reported as "other failures" in the output.
+ On contrast, if an SQL command fails with serialization or deadlock errors, the
+ client is not aborted even without <option>--continue-on-error</option>.
+ Instead, the current transaction is rolled back, which also includes setting
+ the client variables as they were before the run of this transaction
+ (it is assumed that one transaction script contains only one transaction;
+ see <xref linkend="transactions-and-scripts"/> for more information).
Transactions with serialization or deadlock errors are repeated after
rollbacks until they complete successfully or reach the maximum
number of tries (specified by the <option>--max-tries</option> option) / the maximum
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 497a936c141..edd8b01f794 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -402,15 +402,23 @@ typedef struct StatsData
* directly successful transactions (they were successfully completed on
* the first try).
*
- * A failed transaction is defined as unsuccessfully retried transactions.
- * It can be one of two types:
+ * A failed transaction is counted differently depending on whether
+ * the --continue-on-error option is specified.
*
+ * Without --continue-on-error:
* failed (the number of failed transactions) =
* 'serialization_failures' (they got a serialization error and were not
* successfully retried) +
* 'deadlock_failures' (they got a deadlock error and were not
* successfully retried).
*
+ * When --continue-on-error is specified:
+ *
+ * failed (number of failed transactions) =
+ * 'serialization_failures' + 'deadlock_failures' +
+ * 'other_sql_failures' (they got some other SQL error; the transaction was
+ * not retried and counted as failed due to --continue-on-error).
+ *
* If the transaction was retried after a serialization or a deadlock
* error this does not guarantee that this retry was successful. Thus
*
@@ -440,6 +448,11 @@ typedef struct StatsData
int64 deadlock_failures; /* number of transactions that were not
* successfully retried after a deadlock
* error */
+ int64 other_sql_failures; /* number of failed transactions for
+ * reasons other than
+ * serialization/deadlock failure, which
+ * is counted if --continue-on-error is
+ * specified */
SimpleStats latency;
SimpleStats lag;
} StatsData;
@@ -770,6 +783,7 @@ static int64 total_weight = 0;
static bool verbose_errors = false; /* print verbose messages of all errors */
static bool exit_on_abort = false; /* exit when any client is aborted */
+static bool continue_on_error = false; /* continue after errors */
/* Builtin test scripts */
typedef struct BuiltinScript
@@ -954,6 +968,7 @@ usage(void)
" --log-prefix=PREFIX prefix for transaction time log file\n"
" (default: \"pgbench_log\")\n"
" --max-tries=NUM max number of tries to run transaction (default: 1)\n"
+ " --continue-on-error continue running after an SQL error\n"
" --progress-timestamp use Unix epoch timestamps for progress\n"
" --random-seed=SEED set random seed (\"time\", \"rand\", integer)\n"
" --sampling-rate=NUM fraction of transactions to log (e.g., 0.01 for 1%%)\n"
@@ -1467,6 +1482,7 @@ initStats(StatsData *sd, pg_time_usec_t start)
sd->retried = 0;
sd->serialization_failures = 0;
sd->deadlock_failures = 0;
+ sd->other_sql_failures = 0;
initSimpleStats(&sd->latency);
initSimpleStats(&sd->lag);
}
@@ -1516,6 +1532,9 @@ accumStats(StatsData *stats, bool skipped, double lat, double lag,
case ESTATUS_DEADLOCK_ERROR:
stats->deadlock_failures++;
break;
+ case ESTATUS_OTHER_SQL_ERROR:
+ stats->other_sql_failures++;
+ break;
default:
/* internal error which should never occur */
pg_fatal("unexpected error status: %d", estatus);
@@ -3356,7 +3375,7 @@ readCommandResponse(CState *st, MetaCommand meta, char *varprefix)
case PGRES_FATAL_ERROR:
st->estatus = getSQLErrorStatus(PQresultErrorField(res,
PG_DIAG_SQLSTATE));
- if (canRetryError(st->estatus))
+ if (continue_on_error || canRetryError(st->estatus))
{
if (verbose_errors)
commandError(st, PQerrorMessage(st->con));
@@ -4007,7 +4026,8 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
if (PQpipelineStatus(st->con) != PQ_PIPELINE_ON)
st->state = CSTATE_END_COMMAND;
}
- else if (canRetryError(st->estatus))
+ else if ((st->estatus == ESTATUS_OTHER_SQL_ERROR && continue_on_error) ||
+ canRetryError(st->estatus))
st->state = CSTATE_ERROR;
else
st->state = CSTATE_ABORTED;
@@ -4528,7 +4548,8 @@ static int64
getFailures(const StatsData *stats)
{
return (stats->serialization_failures +
- stats->deadlock_failures);
+ stats->deadlock_failures +
+ stats->other_sql_failures);
}
/*
@@ -4548,6 +4569,8 @@ getResultString(bool skipped, EStatus estatus)
return "serialization";
case ESTATUS_DEADLOCK_ERROR:
return "deadlock";
+ case ESTATUS_OTHER_SQL_ERROR:
+ return "other";
default:
/* internal error which should never occur */
pg_fatal("unexpected error status: %d", estatus);
@@ -4603,6 +4626,7 @@ doLog(TState *thread, CState *st,
int64 skipped = 0;
int64 serialization_failures = 0;
int64 deadlock_failures = 0;
+ int64 other_sql_failures = 0;
int64 retried = 0;
int64 retries = 0;
@@ -4643,10 +4667,12 @@ doLog(TState *thread, CState *st,
{
serialization_failures = agg->serialization_failures;
deadlock_failures = agg->deadlock_failures;
+ other_sql_failures = agg->other_sql_failures;
}
- fprintf(logfile, " " INT64_FORMAT " " INT64_FORMAT,
+ fprintf(logfile, " " INT64_FORMAT " " INT64_FORMAT " " INT64_FORMAT,
serialization_failures,
- deadlock_failures);
+ deadlock_failures,
+ other_sql_failures);
fputc('\n', logfile);
@@ -6285,6 +6311,7 @@ printProgressReport(TState *threads, int64 test_start, pg_time_usec_t now,
cur.serialization_failures +=
threads[i].stats.serialization_failures;
cur.deadlock_failures += threads[i].stats.deadlock_failures;
+ cur.other_sql_failures += threads[i].stats.other_sql_failures;
}
/* we count only actually executed transactions */
@@ -6427,7 +6454,8 @@ printResults(StatsData *total,
/*
* Remaining stats are nonsensical if we failed to execute any xacts due
- * to others than serialization or deadlock errors
+ * to other than serialization or deadlock errors and --continue-on-error
+ * is not set.
*/
if (total_cnt <= 0)
return;
@@ -6443,6 +6471,9 @@ printResults(StatsData *total,
printf("number of deadlock failures: " INT64_FORMAT " (%.3f%%)\n",
total->deadlock_failures,
100.0 * total->deadlock_failures / total_cnt);
+ printf("number of other failures: " INT64_FORMAT " (%.3f%%)\n",
+ total->other_sql_failures,
+ 100.0 * total->other_sql_failures / total_cnt);
}
/* it can be non-zero only if max_tries is not equal to one */
@@ -6546,6 +6577,10 @@ printResults(StatsData *total,
sstats->deadlock_failures,
(100.0 * sstats->deadlock_failures /
script_total_cnt));
+ printf(" - number of other failures: " INT64_FORMAT " (%.3f%%)\n",
+ sstats->other_sql_failures,
+ (100.0 * sstats->other_sql_failures /
+ script_total_cnt));
}
/*
@@ -6705,6 +6740,7 @@ main(int argc, char **argv)
{"verbose-errors", no_argument, NULL, 15},
{"exit-on-abort", no_argument, NULL, 16},
{"debug", no_argument, NULL, 17},
+ {"continue-on-error", no_argument, NULL, 18},
{NULL, 0, NULL, 0}
};
@@ -7058,6 +7094,10 @@ main(int argc, char **argv)
case 17: /* debug */
pg_logging_increase_verbosity();
break;
+ case 18: /* continue-on-error */
+ benchmarking_option_set = true;
+ continue_on_error = true;
+ break;
default:
/* getopt_long already emitted a complaint */
pg_log_error_hint("Try \"%s --help\" for more information.", progname);
@@ -7413,6 +7453,7 @@ main(int argc, char **argv)
stats.retried += thread->stats.retried;
stats.serialization_failures += thread->stats.serialization_failures;
stats.deadlock_failures += thread->stats.deadlock_failures;
+ stats.other_sql_failures += thread->stats.other_sql_failures;
latency_late += thread->latency_late;
conn_total_duration += thread->conn_duration;
diff --git a/src/bin/pgbench/t/001_pgbench_with_server.pl b/src/bin/pgbench/t/001_pgbench_with_server.pl
index 7dd78940300..8bb35dda5f7 100644
--- a/src/bin/pgbench/t/001_pgbench_with_server.pl
+++ b/src/bin/pgbench/t/001_pgbench_with_server.pl
@@ -1813,6 +1813,28 @@ update counter set i = i+1 returning i \gset
# Clean up
$node->safe_psql('postgres', 'DROP TABLE counter;');
+# Test --continue-on-error
+$node->safe_psql('postgres',
+ 'CREATE TABLE unique_table(i int unique);' . 'INSERT INTO unique_table VALUES (0);');
+
+$node->pgbench(
+ '-t 10 --continue-on-error --failures-detailed',
+ 0,
+ [
+ qr{processed: 0/10\b},
+ qr{other failures: 10\b}
+ ],
+ [],
+ 'test --continue-on-error',
+ {
+ '002_continue_on_error' => q{
+ insert into unique_table values 0;
+ }
+ });
+
+# Clean up
+$node->safe_psql('postgres', 'DROP TABLE unique_table;');
+
# done
$node->safe_psql('postgres', 'DROP TABLESPACE regress_pgbench_tap_1_ts');
$node->stop;
--
2.39.5 (Apple Git-154)
v9-0002-Rename-a-confusing-enumerator.patchtext/plain; charset=UTF-8; name=v9-0002-Rename-a-confusing-enumerator.patchDownload
From e92761bfffc97117b732d589a786f8ee4d9e29a7 Mon Sep 17 00:00:00 2001
From: Rintaro Ikeda <ikedarintarof@oss.nttdata.com>
Date: Wed, 9 Jul 2025 23:50:36 +0900
Subject: [PATCH v9 2/3] Rename a confusing enumerator
Rename the confusing enumerator which may be mistakenly assumed to be related to
other_sql_errors
---
src/bin/pgbench/pgbench.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index edd8b01f794..7dbeb79ca8d 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -485,7 +485,7 @@ typedef enum TStatus
TSTATUS_IDLE,
TSTATUS_IN_BLOCK,
TSTATUS_CONN_ERROR,
- TSTATUS_OTHER_ERROR,
+ TSTATUS_UNKNOWN_ERROR,
} TStatus;
/* Various random sequences are initialized from this one. */
@@ -3577,12 +3577,12 @@ getTransactionStatus(PGconn *con)
* not. Internal error which should never occur.
*/
pg_log_error("unexpected transaction status %d", tx_status);
- return TSTATUS_OTHER_ERROR;
+ return TSTATUS_UNKNOWN_ERROR;
}
/* not reached */
Assert(false);
- return TSTATUS_OTHER_ERROR;
+ return TSTATUS_UNKNOWN_ERROR;
}
/*
--
2.39.5 (Apple Git-154)
v9-0003-Improve-error-messages-for-errors-that-cause-clie.patchtext/plain; charset=UTF-8; name=v9-0003-Improve-error-messages-for-errors-that-cause-clie.patchDownload
From e83f635c5dbafa6d7c2de6c2b9a111b4fd906e55 Mon Sep 17 00:00:00 2001
From: Yugo Nagata <nagata@sraoss.co.jp>
Date: Thu, 10 Jul 2025 17:21:05 +0900
Subject: [PATCH v9 3/3] Improve error messages for errors that cause client
abortion
This commit modifies relevant error messages to explicitly indicate that the
client was aborted. As part of this change, pg_log_error was replaced with
commandFailed().
---
src/bin/pgbench/pgbench.c | 16 +++++++---------
1 file changed, 7 insertions(+), 9 deletions(-)
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 7dbeb79ca8d..3e855c1b0aa 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -3309,8 +3309,7 @@ readCommandResponse(CState *st, MetaCommand meta, char *varprefix)
case PGRES_EMPTY_QUERY: /* may be used for testing no-op overhead */
if (is_last && meta == META_GSET)
{
- pg_log_error("client %d script %d command %d query %d: expected one row, got %d",
- st->id, st->use_file, st->command, qrynum, 0);
+ commandFailed(st, "gset", psprintf("expected one row, got %d", 0));
st->estatus = ESTATUS_META_COMMAND_ERROR;
goto error;
}
@@ -3324,8 +3323,7 @@ readCommandResponse(CState *st, MetaCommand meta, char *varprefix)
if (meta == META_GSET && ntuples != 1)
{
/* under \gset, report the error */
- pg_log_error("client %d script %d command %d query %d: expected one row, got %d",
- st->id, st->use_file, st->command, qrynum, PQntuples(res));
+ commandFailed(st, "gset", psprintf("expected one row, got %d", PQntuples(res)));
st->estatus = ESTATUS_META_COMMAND_ERROR;
goto error;
}
@@ -3339,18 +3337,18 @@ readCommandResponse(CState *st, MetaCommand meta, char *varprefix)
for (int fld = 0; fld < PQnfields(res); fld++)
{
char *varname = PQfname(res, fld);
+ char *cmd = (meta == META_ASET ? "aset" : "gset");
/* allocate varname only if necessary, freed below */
if (*varprefix != '\0')
varname = psprintf("%s%s", varprefix, varname);
/* store last row result as a string */
- if (!putVariable(&st->variables, meta == META_ASET ? "aset" : "gset", varname,
+ if (!putVariable(&st->variables, cmd, varname,
PQgetvalue(res, ntuples - 1, fld)))
{
/* internal error */
- pg_log_error("client %d script %d command %d query %d: error storing into variable %s",
- st->id, st->use_file, st->command, qrynum, varname);
+ commandFailed(st, cmd, psprintf("error storing into variable %s", varname));
st->estatus = ESTATUS_META_COMMAND_ERROR;
goto error;
}
@@ -3385,8 +3383,8 @@ readCommandResponse(CState *st, MetaCommand meta, char *varprefix)
default:
/* anything else is unexpected */
- pg_log_error("client %d script %d aborted in command %d query %d: %s",
- st->id, st->use_file, st->command, qrynum,
+ pg_log_error("client %d aborted in command %d query %d of script %d: %s",
+ st->id, st->command, qrynum, st->use_file,
PQerrorMessage(st->con));
goto error;
}
--
2.39.5 (Apple Git-154)
On Wed, 16 Jul 2025 21:35:01 +0900
Rintaro Ikeda <ikedarintarof@oss.nttdata.com> wrote:
Hi,
On 2025/07/15 11:16, Yugo Nagata wrote:
I noticed one small thing I’d like to discuss. I'm not sure that users clearly
understand which aborted in the following error message, the client or the script.pgbench: error: client 0 script 0 aborted in command ... query ...
Since the code path always results in a client abort, I wonder if the following
message might be clearer:pgbench: error: client 0 aborted in script 0 command ... query ...
Indeed, it seems clearer to explicitly state that it is the client that
was aborted.I've attached an updated patch that replaces the remaining message mentioned
above with a call to commandFailed(). With this change, the output in such
situations will now be:"client 0 aborted in command 0 (SQL) of script 0; ...."
Thank you for updating the patch!
When I executed a custom script that may raise a unique constraint violation, I
got the following output:pgbench: error: client 0 script 0 aborted in command 1 query 0: ERROR:
duplicate key value violates unique constraint "test_col2_key"
I'm sorry. I must have failed to attach the correct patch in my previous post.
As a result, patch v8 was actually the same as v7, and the message in question
was not modified as intended.
I think we should also change the error message in pg_log_error. I modified the
patch v8-0003 as follows:
@@ -3383,8 +3383,8 @@ readCommandResponse(CState *st, MetaCommand meta, char
*varprefix)default: /* anything else is unexpected */ - pg_log_error("client %d script %d aborted in command %d query %d: %s", - st->id, st->use_file, st->command, qrynum, + pg_log_error("client %d aborted in command %d query %d of script %d: %s", + st->id, st->command, qrynum, st->use_file, PQerrorMessage(st->con)); goto error; }With this change, the output now is like this:
pgbench: error: client 0 aborted in command 1 query 0 of script 0: ERROR:
duplicate key value violates unique constraint "test_col2_key"
I want hear your thoughts.
My idea is to modify this as follows;
default:
/* anything else is unexpected */
- pg_log_error("client %d script %d aborted in command %d query %d: %s",
- st->id, st->use_file, st->command, qrynum,
- PQerrorMessage(st->con));
+ commandFailed(st, "SQL", PQerrorMessage(st->con));
goto error;
}
This fix is originally planned to be included in patch v8, but was missed.
It is now included in the attached patch, v10.
With this change, the output becomes:
pgbench: error: client 0 aborted in command 0 (SQL) of script 0;
ERROR: duplicate key value violates unique constraint "t2_pkey"
Although there is a slight difference, the message is essentially the same as
your proposal. Also, I believe the use of commandFailed() makes the code simpler
and more consistent.
What do you think?
Also, let me ask one question. In this case, I directly modified your commit in
the v8-0003 patch. Is that the right way to update the patch?
I’m not sure if that’s the best way, but I think modifying the patch directly is a
valid way to propose an alternative approach during discussion, as long as the original
patch is respected. It can often help clarify suggestions.
Regards,
Yugo Nagata
--
Yugo Nagata <nagata@sraoss.co.jp>
Attachments:
v10-0003-Improve-error-messages-for-errors-that-cause-cli.patchtext/x-diff; name=v10-0003-Improve-error-messages-for-errors-that-cause-cli.patchDownload
From 9b45e1a0d5a2efd9443002bd84e0f3b93e6a4332 Mon Sep 17 00:00:00 2001
From: Yugo Nagata <nagata@sraoss.co.jp>
Date: Thu, 10 Jul 2025 17:21:05 +0900
Subject: [PATCH v10 3/3] Improve error messages for errors that cause client
abortion
This commit modifies relevant error messages to explicitly indicate that the
client was aborted. As part of this change, pg_log_error was replaced with
commandFailed().
---
src/bin/pgbench/pgbench.c | 16 ++++++----------
1 file changed, 6 insertions(+), 10 deletions(-)
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 7dbeb79ca8d..4124c7b341c 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -3309,8 +3309,7 @@ readCommandResponse(CState *st, MetaCommand meta, char *varprefix)
case PGRES_EMPTY_QUERY: /* may be used for testing no-op overhead */
if (is_last && meta == META_GSET)
{
- pg_log_error("client %d script %d command %d query %d: expected one row, got %d",
- st->id, st->use_file, st->command, qrynum, 0);
+ commandFailed(st, "gset", psprintf("expected one row, got %d", 0));
st->estatus = ESTATUS_META_COMMAND_ERROR;
goto error;
}
@@ -3324,8 +3323,7 @@ readCommandResponse(CState *st, MetaCommand meta, char *varprefix)
if (meta == META_GSET && ntuples != 1)
{
/* under \gset, report the error */
- pg_log_error("client %d script %d command %d query %d: expected one row, got %d",
- st->id, st->use_file, st->command, qrynum, PQntuples(res));
+ commandFailed(st, "gset", psprintf("expected one row, got %d", PQntuples(res)));
st->estatus = ESTATUS_META_COMMAND_ERROR;
goto error;
}
@@ -3339,18 +3337,18 @@ readCommandResponse(CState *st, MetaCommand meta, char *varprefix)
for (int fld = 0; fld < PQnfields(res); fld++)
{
char *varname = PQfname(res, fld);
+ char *cmd = (meta == META_ASET ? "aset" : "gset");
/* allocate varname only if necessary, freed below */
if (*varprefix != '\0')
varname = psprintf("%s%s", varprefix, varname);
/* store last row result as a string */
- if (!putVariable(&st->variables, meta == META_ASET ? "aset" : "gset", varname,
+ if (!putVariable(&st->variables, cmd, varname,
PQgetvalue(res, ntuples - 1, fld)))
{
/* internal error */
- pg_log_error("client %d script %d command %d query %d: error storing into variable %s",
- st->id, st->use_file, st->command, qrynum, varname);
+ commandFailed(st, cmd, psprintf("error storing into variable %s", varname));
st->estatus = ESTATUS_META_COMMAND_ERROR;
goto error;
}
@@ -3385,9 +3383,7 @@ readCommandResponse(CState *st, MetaCommand meta, char *varprefix)
default:
/* anything else is unexpected */
- pg_log_error("client %d script %d aborted in command %d query %d: %s",
- st->id, st->use_file, st->command, qrynum,
- PQerrorMessage(st->con));
+ commandFailed(st, "SQL", PQerrorMessage(st->con));
goto error;
}
--
2.43.0
v10-0002-Rename-a-confusing-enumerator.patchtext/x-diff; name=v10-0002-Rename-a-confusing-enumerator.patchDownload
From 54ae59a76bd9b465f546c02ac248df14d82aa36c Mon Sep 17 00:00:00 2001
From: Rintaro Ikeda <ikedarintarof@oss.nttdata.com>
Date: Wed, 9 Jul 2025 23:50:36 +0900
Subject: [PATCH v10 2/3] Rename a confusing enumerator
Rename the confusing enumerator which may be mistakenly assumed to be related to
other_sql_errors
---
src/bin/pgbench/pgbench.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index edd8b01f794..7dbeb79ca8d 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -485,7 +485,7 @@ typedef enum TStatus
TSTATUS_IDLE,
TSTATUS_IN_BLOCK,
TSTATUS_CONN_ERROR,
- TSTATUS_OTHER_ERROR,
+ TSTATUS_UNKNOWN_ERROR,
} TStatus;
/* Various random sequences are initialized from this one. */
@@ -3577,12 +3577,12 @@ getTransactionStatus(PGconn *con)
* not. Internal error which should never occur.
*/
pg_log_error("unexpected transaction status %d", tx_status);
- return TSTATUS_OTHER_ERROR;
+ return TSTATUS_UNKNOWN_ERROR;
}
/* not reached */
Assert(false);
- return TSTATUS_OTHER_ERROR;
+ return TSTATUS_UNKNOWN_ERROR;
}
/*
--
2.43.0
v10-0001-Add-continue-on-error-option.patchtext/x-diff; name=v10-0001-Add-continue-on-error-option.patchDownload
From 7d948731260679b7dde6861a7176a0cf8cb2b8b9 Mon Sep 17 00:00:00 2001
From: Rintaro Ikeda <ikedarintarof@oss.nttdata.com>
Date: Wed, 9 Jul 2025 23:36:37 +0900
Subject: [PATCH v10 1/3] Add --continue-on-error option
When the option is set, client rolls back the failed transaction and starts a
new one when its transaction fails due to the reason other than the deadlock and
serialization failure.
---
doc/src/sgml/ref/pgbench.sgml | 71 +++++++++++++++-----
src/bin/pgbench/pgbench.c | 57 +++++++++++++---
src/bin/pgbench/t/001_pgbench_with_server.pl | 22 ++++++
3 files changed, 125 insertions(+), 25 deletions(-)
diff --git a/doc/src/sgml/ref/pgbench.sgml b/doc/src/sgml/ref/pgbench.sgml
index ab252d9fc74..15fcb45e223 100644
--- a/doc/src/sgml/ref/pgbench.sgml
+++ b/doc/src/sgml/ref/pgbench.sgml
@@ -76,9 +76,8 @@ tps = 896.967014 (without initial connection time)
and number of transactions per client); these will be equal unless the run
failed before completion or some SQL command(s) failed. (In
<option>-T</option> mode, only the actual number of transactions is printed.)
- The next line reports the number of failed transactions due to
- serialization or deadlock errors (see <xref linkend="failures-and-retries"/>
- for more information).
+ The next line reports the number of failed transactions (see
+ <xref linkend="failures-and-retries"/> for more information).
The last line reports the number of transactions per second.
</para>
@@ -790,6 +789,9 @@ pgbench <optional> <replaceable>options</replaceable> </optional> <replaceable>d
<listitem>
<para>deadlock failures;</para>
</listitem>
+ <listitem>
+ <para>other failures;</para>
+ </listitem>
</itemizedlist>
See <xref linkend="failures-and-retries"/> for more information.
</para>
@@ -914,6 +916,26 @@ pgbench <optional> <replaceable>options</replaceable> </optional> <replaceable>d
</listitem>
</varlistentry>
+ <varlistentry id="pgbench-option-continue-on-error">
+ <term><option>--continue-on-error</option></term>
+ <listitem>
+ <para>
+ Allows clients to continue their run even if an SQL statement fails due to
+ errors other than serialization or deadlock. Unlike serialization and deadlock
+ failures, clients do not retry the same transactions but start new transaction.
+ This option is useful when your custom script may raise errors due to some
+ reason like unique constraints violation. Without this option, the client is
+ aborted after such errors.
+ </para>
+ <para>
+ Note that serialization and deadlock failures never cause the client to be
+ aborted even after clients retries <option>--max-tries</option> times by
+ default, so they are not affected by this option.
+ See <xref linkend="failures-and-retries"/> for more information.
+ </para>
+ </listitem>
+ </varlistentry>
+
</variablelist>
</para>
@@ -2409,8 +2431,8 @@ END;
will be reported as <literal>failed</literal>. If you use the
<option>--failures-detailed</option> option, the
<replaceable>time</replaceable> of the failed transaction will be reported as
- <literal>serialization</literal> or
- <literal>deadlock</literal> depending on the type of failure (see
+ <literal>serialization</literal>, <literal>deadlock</literal>, or
+ <literal>other</literal> depending on the type of failure (see
<xref linkend="failures-and-retries"/> for more information).
</para>
@@ -2638,6 +2660,16 @@ END;
</para>
</listitem>
</varlistentry>
+
+ <varlistentry>
+ <term><replaceable>other_sql_failures</replaceable></term>
+ <listitem>
+ <para>
+ number of transactions that got a SQL error
+ (zero unless <option>--failures-detailed</option> is specified)
+ </para>
+ </listitem>
+ </varlistentry>
</variablelist>
</para>
@@ -2646,8 +2678,8 @@ END;
<screen>
<userinput>pgbench --aggregate-interval=10 --time=20 --client=10 --log --rate=1000 --latency-limit=10 --failures-detailed --max-tries=10 test</userinput>
-1650260552 5178 26171317 177284491527 1136 44462 2647617 7321113867 0 9866 64 7564 28340 4148 0
-1650260562 4808 25573984 220121792172 1171 62083 3037380 9666800914 0 9998 598 7392 26621 4527 0
+1650260552 5178 26171317 177284491527 1136 44462 2647617 7321113867 0 9866 64 7564 28340 4148 0 0
+1650260562 4808 25573984 220121792172 1171 62083 3037380 9666800914 0 9998 598 7392 26621 4527 0 0
</screen>
</para>
@@ -2839,9 +2871,11 @@ statement latencies in milliseconds, failures and retries:
<option>--exit-on-abort</option> is specified. Otherwise in the worst
case they only lead to the abortion of the failed client while other
clients continue their run (but some client errors are handled without
- an abortion of the client and reported separately, see below). Later in
- this section it is assumed that the discussed errors are only the
- direct client errors and they are not internal
+ an abortion of the client and reported separately, see below). When
+ <option>--continue-on-error</option> is specified, the client
+ continues to process new transactions even if it encounters an error.
+ Later in this section it is assumed that the discussed errors are only
+ the direct client errors and they are not internal
<application>pgbench</application> errors.
</para>
</listitem>
@@ -2851,14 +2885,17 @@ statement latencies in milliseconds, failures and retries:
<para>
A client's run is aborted in case of a serious error; for example, the
connection with the database server was lost or the end of script was reached
- without completing the last transaction. In addition, if execution of an SQL
+ without completing the last transaction. By default, if execution of an SQL
or meta command fails for reasons other than serialization or deadlock errors,
- the client is aborted. Otherwise, if an SQL command fails with serialization or
- deadlock errors, the client is not aborted. In such cases, the current
- transaction is rolled back, which also includes setting the client variables
- as they were before the run of this transaction (it is assumed that one
- transaction script contains only one transaction; see
- <xref linkend="transactions-and-scripts"/> for more information).
+ the client is aborted. However, if the --continue-on-error option is specified,
+ the client does not abort and proceeds to the next transaction regardless of
+ the error. These cases are reported as "other failures" in the output.
+ On contrast, if an SQL command fails with serialization or deadlock errors, the
+ client is not aborted even without <option>--continue-on-error</option>.
+ Instead, the current transaction is rolled back, which also includes setting
+ the client variables as they were before the run of this transaction
+ (it is assumed that one transaction script contains only one transaction;
+ see <xref linkend="transactions-and-scripts"/> for more information).
Transactions with serialization or deadlock errors are repeated after
rollbacks until they complete successfully or reach the maximum
number of tries (specified by the <option>--max-tries</option> option) / the maximum
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 497a936c141..edd8b01f794 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -402,15 +402,23 @@ typedef struct StatsData
* directly successful transactions (they were successfully completed on
* the first try).
*
- * A failed transaction is defined as unsuccessfully retried transactions.
- * It can be one of two types:
+ * A failed transaction is counted differently depending on whether
+ * the --continue-on-error option is specified.
*
+ * Without --continue-on-error:
* failed (the number of failed transactions) =
* 'serialization_failures' (they got a serialization error and were not
* successfully retried) +
* 'deadlock_failures' (they got a deadlock error and were not
* successfully retried).
*
+ * When --continue-on-error is specified:
+ *
+ * failed (number of failed transactions) =
+ * 'serialization_failures' + 'deadlock_failures' +
+ * 'other_sql_failures' (they got some other SQL error; the transaction was
+ * not retried and counted as failed due to --continue-on-error).
+ *
* If the transaction was retried after a serialization or a deadlock
* error this does not guarantee that this retry was successful. Thus
*
@@ -440,6 +448,11 @@ typedef struct StatsData
int64 deadlock_failures; /* number of transactions that were not
* successfully retried after a deadlock
* error */
+ int64 other_sql_failures; /* number of failed transactions for
+ * reasons other than
+ * serialization/deadlock failure, which
+ * is counted if --continue-on-error is
+ * specified */
SimpleStats latency;
SimpleStats lag;
} StatsData;
@@ -770,6 +783,7 @@ static int64 total_weight = 0;
static bool verbose_errors = false; /* print verbose messages of all errors */
static bool exit_on_abort = false; /* exit when any client is aborted */
+static bool continue_on_error = false; /* continue after errors */
/* Builtin test scripts */
typedef struct BuiltinScript
@@ -954,6 +968,7 @@ usage(void)
" --log-prefix=PREFIX prefix for transaction time log file\n"
" (default: \"pgbench_log\")\n"
" --max-tries=NUM max number of tries to run transaction (default: 1)\n"
+ " --continue-on-error continue running after an SQL error\n"
" --progress-timestamp use Unix epoch timestamps for progress\n"
" --random-seed=SEED set random seed (\"time\", \"rand\", integer)\n"
" --sampling-rate=NUM fraction of transactions to log (e.g., 0.01 for 1%%)\n"
@@ -1467,6 +1482,7 @@ initStats(StatsData *sd, pg_time_usec_t start)
sd->retried = 0;
sd->serialization_failures = 0;
sd->deadlock_failures = 0;
+ sd->other_sql_failures = 0;
initSimpleStats(&sd->latency);
initSimpleStats(&sd->lag);
}
@@ -1516,6 +1532,9 @@ accumStats(StatsData *stats, bool skipped, double lat, double lag,
case ESTATUS_DEADLOCK_ERROR:
stats->deadlock_failures++;
break;
+ case ESTATUS_OTHER_SQL_ERROR:
+ stats->other_sql_failures++;
+ break;
default:
/* internal error which should never occur */
pg_fatal("unexpected error status: %d", estatus);
@@ -3356,7 +3375,7 @@ readCommandResponse(CState *st, MetaCommand meta, char *varprefix)
case PGRES_FATAL_ERROR:
st->estatus = getSQLErrorStatus(PQresultErrorField(res,
PG_DIAG_SQLSTATE));
- if (canRetryError(st->estatus))
+ if (continue_on_error || canRetryError(st->estatus))
{
if (verbose_errors)
commandError(st, PQerrorMessage(st->con));
@@ -4007,7 +4026,8 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
if (PQpipelineStatus(st->con) != PQ_PIPELINE_ON)
st->state = CSTATE_END_COMMAND;
}
- else if (canRetryError(st->estatus))
+ else if ((st->estatus == ESTATUS_OTHER_SQL_ERROR && continue_on_error) ||
+ canRetryError(st->estatus))
st->state = CSTATE_ERROR;
else
st->state = CSTATE_ABORTED;
@@ -4528,7 +4548,8 @@ static int64
getFailures(const StatsData *stats)
{
return (stats->serialization_failures +
- stats->deadlock_failures);
+ stats->deadlock_failures +
+ stats->other_sql_failures);
}
/*
@@ -4548,6 +4569,8 @@ getResultString(bool skipped, EStatus estatus)
return "serialization";
case ESTATUS_DEADLOCK_ERROR:
return "deadlock";
+ case ESTATUS_OTHER_SQL_ERROR:
+ return "other";
default:
/* internal error which should never occur */
pg_fatal("unexpected error status: %d", estatus);
@@ -4603,6 +4626,7 @@ doLog(TState *thread, CState *st,
int64 skipped = 0;
int64 serialization_failures = 0;
int64 deadlock_failures = 0;
+ int64 other_sql_failures = 0;
int64 retried = 0;
int64 retries = 0;
@@ -4643,10 +4667,12 @@ doLog(TState *thread, CState *st,
{
serialization_failures = agg->serialization_failures;
deadlock_failures = agg->deadlock_failures;
+ other_sql_failures = agg->other_sql_failures;
}
- fprintf(logfile, " " INT64_FORMAT " " INT64_FORMAT,
+ fprintf(logfile, " " INT64_FORMAT " " INT64_FORMAT " " INT64_FORMAT,
serialization_failures,
- deadlock_failures);
+ deadlock_failures,
+ other_sql_failures);
fputc('\n', logfile);
@@ -6285,6 +6311,7 @@ printProgressReport(TState *threads, int64 test_start, pg_time_usec_t now,
cur.serialization_failures +=
threads[i].stats.serialization_failures;
cur.deadlock_failures += threads[i].stats.deadlock_failures;
+ cur.other_sql_failures += threads[i].stats.other_sql_failures;
}
/* we count only actually executed transactions */
@@ -6427,7 +6454,8 @@ printResults(StatsData *total,
/*
* Remaining stats are nonsensical if we failed to execute any xacts due
- * to others than serialization or deadlock errors
+ * to other than serialization or deadlock errors and --continue-on-error
+ * is not set.
*/
if (total_cnt <= 0)
return;
@@ -6443,6 +6471,9 @@ printResults(StatsData *total,
printf("number of deadlock failures: " INT64_FORMAT " (%.3f%%)\n",
total->deadlock_failures,
100.0 * total->deadlock_failures / total_cnt);
+ printf("number of other failures: " INT64_FORMAT " (%.3f%%)\n",
+ total->other_sql_failures,
+ 100.0 * total->other_sql_failures / total_cnt);
}
/* it can be non-zero only if max_tries is not equal to one */
@@ -6546,6 +6577,10 @@ printResults(StatsData *total,
sstats->deadlock_failures,
(100.0 * sstats->deadlock_failures /
script_total_cnt));
+ printf(" - number of other failures: " INT64_FORMAT " (%.3f%%)\n",
+ sstats->other_sql_failures,
+ (100.0 * sstats->other_sql_failures /
+ script_total_cnt));
}
/*
@@ -6705,6 +6740,7 @@ main(int argc, char **argv)
{"verbose-errors", no_argument, NULL, 15},
{"exit-on-abort", no_argument, NULL, 16},
{"debug", no_argument, NULL, 17},
+ {"continue-on-error", no_argument, NULL, 18},
{NULL, 0, NULL, 0}
};
@@ -7058,6 +7094,10 @@ main(int argc, char **argv)
case 17: /* debug */
pg_logging_increase_verbosity();
break;
+ case 18: /* continue-on-error */
+ benchmarking_option_set = true;
+ continue_on_error = true;
+ break;
default:
/* getopt_long already emitted a complaint */
pg_log_error_hint("Try \"%s --help\" for more information.", progname);
@@ -7413,6 +7453,7 @@ main(int argc, char **argv)
stats.retried += thread->stats.retried;
stats.serialization_failures += thread->stats.serialization_failures;
stats.deadlock_failures += thread->stats.deadlock_failures;
+ stats.other_sql_failures += thread->stats.other_sql_failures;
latency_late += thread->latency_late;
conn_total_duration += thread->conn_duration;
diff --git a/src/bin/pgbench/t/001_pgbench_with_server.pl b/src/bin/pgbench/t/001_pgbench_with_server.pl
index 7dd78940300..8bb35dda5f7 100644
--- a/src/bin/pgbench/t/001_pgbench_with_server.pl
+++ b/src/bin/pgbench/t/001_pgbench_with_server.pl
@@ -1813,6 +1813,28 @@ update counter set i = i+1 returning i \gset
# Clean up
$node->safe_psql('postgres', 'DROP TABLE counter;');
+# Test --continue-on-error
+$node->safe_psql('postgres',
+ 'CREATE TABLE unique_table(i int unique);' . 'INSERT INTO unique_table VALUES (0);');
+
+$node->pgbench(
+ '-t 10 --continue-on-error --failures-detailed',
+ 0,
+ [
+ qr{processed: 0/10\b},
+ qr{other failures: 10\b}
+ ],
+ [],
+ 'test --continue-on-error',
+ {
+ '002_continue_on_error' => q{
+ insert into unique_table values 0;
+ }
+ });
+
+# Clean up
+$node->safe_psql('postgres', 'DROP TABLE unique_table;');
+
# done
$node->safe_psql('postgres', 'DROP TABLESPACE regress_pgbench_tap_1_ts');
$node->stop;
--
2.43.0
Hi,
On 2025/07/16 22:49, Yugo Nagata wrote:
I think we should also change the error message in pg_log_error. I modified the
patch v8-0003 as follows:
@@ -3383,8 +3383,8 @@ readCommandResponse(CState *st, MetaCommand meta, char
*varprefix)default: /* anything else is unexpected */ - pg_log_error("client %d script %d aborted in command %d query %d: %s", - st->id, st->use_file, st->command, qrynum, + pg_log_error("client %d aborted in command %d query %d of script %d: %s", + st->id, st->command, qrynum, st->use_file, PQerrorMessage(st->con)); goto error; }With this change, the output now is like this:
pgbench: error: client 0 aborted in command 1 query 0 of script 0: ERROR:
duplicate key value violates unique constraint "test_col2_key"
I want hear your thoughts.
My idea is to modify this as follows;
default: /* anything else is unexpected */ - pg_log_error("client %d script %d aborted in command %d query %d: %s", - st->id, st->use_file, st->command, qrynum, - PQerrorMessage(st->con)); + commandFailed(st, "SQL", PQerrorMessage(st->con)); goto error; }This fix is originally planned to be included in patch v8, but was missed.
It is now included in the attached patch, v10.With this change, the output becomes:
pgbench: error: client 0 aborted in command 0 (SQL) of script 0;
ERROR: duplicate key value violates unique constraint "t2_pkey"Although there is a slight difference, the message is essentially the same as
your proposal. Also, I believe the use of commandFailed() makes the code simpler
and more consistent.What do you think?
Thank you for the new patch! I think Nagata-san's v10 patch is a clear
improvement over my v9 patch. I'm happy with the changes.
Also, let me ask one question. In this case, I directly modified your commit in
the v8-0003 patch. Is that the right way to update the patch?I’m not sure if that’s the best way, but I think modifying the patch directly is a
valid way to propose an alternative approach during discussion, as long as the original
patch is respected. It can often help clarify suggestions.
I understand that. Thank you.
Regards,
Rintaro Ikeda
On Fri, 18 Jul 2025 17:07:53 +0900
Rintaro Ikeda <ikedarintarof@oss.nttdata.com> wrote:
Hi,
On 2025/07/16 22:49, Yugo Nagata wrote:
I think we should also change the error message in pg_log_error. I modified the
patch v8-0003 as follows:
@@ -3383,8 +3383,8 @@ readCommandResponse(CState *st, MetaCommand meta, char
*varprefix)default: /* anything else is unexpected */ - pg_log_error("client %d script %d aborted in command %d query %d: %s", - st->id, st->use_file, st->command, qrynum, + pg_log_error("client %d aborted in command %d query %d of script %d: %s", + st->id, st->command, qrynum, st->use_file, PQerrorMessage(st->con)); goto error; }With this change, the output now is like this:
pgbench: error: client 0 aborted in command 1 query 0 of script 0: ERROR:
duplicate key value violates unique constraint "test_col2_key"
I want hear your thoughts.
My idea is to modify this as follows;
default: /* anything else is unexpected */ - pg_log_error("client %d script %d aborted in command %d query %d: %s", - st->id, st->use_file, st->command, qrynum, - PQerrorMessage(st->con)); + commandFailed(st, "SQL", PQerrorMessage(st->con)); goto error; }This fix is originally planned to be included in patch v8, but was missed.
It is now included in the attached patch, v10.With this change, the output becomes:
pgbench: error: client 0 aborted in command 0 (SQL) of script 0;
ERROR: duplicate key value violates unique constraint "t2_pkey"Although there is a slight difference, the message is essentially the same as
your proposal. Also, I believe the use of commandFailed() makes the code simpler
and more consistent.What do you think?
Thank you for the new patch! I think Nagata-san's v10 patch is a clear
improvement over my v9 patch. I'm happy with the changes.
Thank you.
I believe the patches implement the expected behavior, include appropriste doc and test
modification, are in good shape overall, so if there are no objections,
I'll mark this as Read-for-Committer.
Regards,
Yugo Nagata
--
Yugo Nagata <nagata@sraoss.co.jp>
On Tue, 22 Jul 2025 17:49:49 +0900
Yugo Nagata <nagata@sraoss.co.jp> wrote:
On Fri, 18 Jul 2025 17:07:53 +0900
Rintaro Ikeda <ikedarintarof@oss.nttdata.com> wrote:Hi,
On 2025/07/16 22:49, Yugo Nagata wrote:
I think we should also change the error message in pg_log_error. I modified the
patch v8-0003 as follows:
@@ -3383,8 +3383,8 @@ readCommandResponse(CState *st, MetaCommand meta, char
*varprefix)default: /* anything else is unexpected */ - pg_log_error("client %d script %d aborted in command %d query %d: %s", - st->id, st->use_file, st->command, qrynum, + pg_log_error("client %d aborted in command %d query %d of script %d: %s", + st->id, st->command, qrynum, st->use_file, PQerrorMessage(st->con)); goto error; }With this change, the output now is like this:
pgbench: error: client 0 aborted in command 1 query 0 of script 0: ERROR:
duplicate key value violates unique constraint "test_col2_key"
I want hear your thoughts.
My idea is to modify this as follows;
default: /* anything else is unexpected */ - pg_log_error("client %d script %d aborted in command %d query %d: %s", - st->id, st->use_file, st->command, qrynum, - PQerrorMessage(st->con)); + commandFailed(st, "SQL", PQerrorMessage(st->con)); goto error; }This fix is originally planned to be included in patch v8, but was missed.
It is now included in the attached patch, v10.With this change, the output becomes:
pgbench: error: client 0 aborted in command 0 (SQL) of script 0;
ERROR: duplicate key value violates unique constraint "t2_pkey"Although there is a slight difference, the message is essentially the same as
your proposal. Also, I believe the use of commandFailed() makes the code simpler
and more consistent.What do you think?
Thank you for the new patch! I think Nagata-san's v10 patch is a clear
improvement over my v9 patch. I'm happy with the changes.Thank you.
I believe the patches implement the expected behavior, include appropriste doc and test
modification, are in good shape overall, so if there are no objections,
I'll mark this as Read-for-Committer.
I've updated the CF status to Ready for Committer.
Regards,
Yugo Nagata
--
Yugo Nagata <nagata@sraoss.co.jp>
On Thu, Jul 24, 2025 at 5:44 AM Yugo Nagata <nagata@sraoss.co.jp> wrote:
I believe the patches implement the expected behavior, include appropriste doc and test
modification, are in good shape overall, so if there are no objections,
I'll mark this as Read-for-Committer.I've updated the CF status to Ready for Committer.
Thanks for working on it! As Matthias, Dilip, Srinath and many others
pointed out it would be a very nice and helpful addition to pgbench.
I've just used it out of necessity and it worked as advertised for me
and it even adds a cool-looking "XXX failed" when used with -P
progress meter:
progress: 1.0 s, 0.0 tps, lat 0.000 ms stddev 0.000, 3854 failed
progress: 2.0 s, 0.0 tps, lat 0.000 ms stddev 0.000, 3796 failed
-J.
On Tue, Sep 16, 2025 at 5:34 PM Jakub Wartak
<jakub.wartak@enterprisedb.com> wrote:
On Thu, Jul 24, 2025 at 5:44 AM Yugo Nagata <nagata@sraoss.co.jp> wrote:
I believe the patches implement the expected behavior, include appropriste doc and test
modification, are in good shape overall, so if there are no objections,
I'll mark this as Read-for-Committer.I've updated the CF status to Ready for Committer.
Since this patch is marked as ready for committer, I've started reviewing it.
The patch basically looks good to me.
+ the client is aborted. However, if the --continue-on-error option
is specified,
"--continue-on-error" should be enclosed in <option> tags.
+ without completing the last transaction. By default, if execution of an SQL
or meta command fails for reasons other than serialization or
deadlock errors,
<snip>
+ the client is aborted. However, if the --continue-on-error option
is specified,
+ the client does not abort and proceeds to the next transaction regardless of
+ the error. These cases are reported as "other failures" in the output.
This explanation can be read as if --continue-on-error allows the client to
proceed to the next transaction even when mata command (not SQL) fails,
but that is not correct, right? If so, the description should be updated to
make it clear that only SQL errors are affected, while meta command failures
are not.
+$node->pgbench(
+ '-t 10 --continue-on-error --failures-detailed',
Isn't it better to specify also -n option to skip unnecessary VACUUM and
speed the test up?
+ 'test --continue-on-error',
+ {
+ '002_continue_on_error' => q{
Regarding the test file name, perhaps 001 would be a better prefix than 002,
since other tests in 001_pgbench_with_server.pl use 001 as the prefix.
+ insert into unique_table values 0;
This INSERT causes a syntax error. Was this intentional? If the intention was
to test unique constraint violations, it should instead be
INSERT INTO unique_table VALUES (0);.
To further improve the test, it might also be useful to mix successful and
failed transactions in the --continue-on-error case. For example,
the following change would result in one successful transaction and
nine failures:
-----------------------------
$node->safe_psql('postgres',
- 'CREATE TABLE unique_table(i int unique);' . 'INSERT INTO
unique_table VALUES (0);');
+ 'CREATE TABLE unique_table(i int unique);');
$node->pgbench(
'-t 10 --continue-on-error --failures-detailed',
0,
[
- qr{processed: 0/10\b},
- qr{other failures: 10\b}
+ qr{processed: 1/10\b},
+ qr{other failures: 9\b}
-----------------------------
Regards,
--
Fujii Masao
On Thu, 18 Sep 2025 01:52:46 +0900
Fujii Masao <masao.fujii@gmail.com> wrote:
On Tue, Sep 16, 2025 at 5:34 PM Jakub Wartak
<jakub.wartak@enterprisedb.com> wrote:On Thu, Jul 24, 2025 at 5:44 AM Yugo Nagata <nagata@sraoss.co.jp> wrote:
I believe the patches implement the expected behavior, include appropriste doc and test
modification, are in good shape overall, so if there are no objections,
I'll mark this as Read-for-Committer.I've updated the CF status to Ready for Committer.
Since this patch is marked as ready for committer, I've started reviewing it.
The patch basically looks good to me.+ the client is aborted. However, if the --continue-on-error option
is specified,"--continue-on-error" should be enclosed in <option> tags.
+1
+ without completing the last transaction. By default, if execution of an SQL or meta command fails for reasons other than serialization or deadlock errors, <snip> + the client is aborted. However, if the --continue-on-error option is specified, + the client does not abort and proceeds to the next transaction regardless of + the error. These cases are reported as "other failures" in the output.This explanation can be read as if --continue-on-error allows the client to
proceed to the next transaction even when mata command (not SQL) fails,
but that is not correct, right? If so, the description should be updated to
make it clear that only SQL errors are affected, while meta command failures
are not.
That makes sense. How about rewriting this like:
However, if the --continue-on-error option is specified and the error occurs in
an SQL command, the client does not abort and proceeds to the next
transaction regardless of the error. These cases are reported as "other failures"
in the output. Note that if the error occurs in a meta-command, the client will
still abort even when this option is specified.
+$node->pgbench( + '-t 10 --continue-on-error --failures-detailed',Isn't it better to specify also -n option to skip unnecessary VACUUM and
speed the test up?
+1
+ 'test --continue-on-error', + { + '002_continue_on_error' => q{Regarding the test file name, perhaps 001 would be a better prefix than 002,
since other tests in 001_pgbench_with_server.pl use 001 as the prefix.
Right. This filename is shown in the “transaction type:” field of the results
when the test fails, so it should be aligned with the test file name.
+ insert into unique_table values 0;
This INSERT causes a syntax error. Was this intentional? If the intention was
to test unique constraint violations, it should instead be
INSERT INTO unique_table VALUES (0);.
This was clearly unintentional. I happened to overlook it during my review.
To further improve the test, it might also be useful to mix successful and
failed transactions in the --continue-on-error case. For example,
the following change would result in one successful transaction and
nine failures:----------------------------- $node->safe_psql('postgres', - 'CREATE TABLE unique_table(i int unique);' . 'INSERT INTO unique_table VALUES (0);'); + 'CREATE TABLE unique_table(i int unique);');$node->pgbench( '-t 10 --continue-on-error --failures-detailed', 0, [ - qr{processed: 0/10\b}, - qr{other failures: 10\b} + qr{processed: 1/10\b}, + qr{other failures: 9\b} -----------------------------
+1
This makes the purpose of the test clearer.
Regards,
Yugo Nagata
--
Yugo Nagata <nagata@sraoss.co.jp>
On Thu, Sep 18, 2025 at 10:22 AM Yugo Nagata <nagata@sraoss.co.jp> wrote:
That makes sense. How about rewriting this like:
However, if the --continue-on-error option is specified and the error occurs in
an SQL command, the client does not abort and proceeds to the next
transaction regardless of the error. These cases are reported as "other failures"
in the output. Note that if the error occurs in a meta-command, the client will
still abort even when this option is specified.
How about phrasing it like this, based on your version?
----------------------------
A client's run is aborted in case of a serious error; for example, the
connection with the database server was lost or the end of script was reached
without completing the last transaction. The client also aborts
if a meta-command fails, or if an SQL command fails for reasons other than
serialization or deadlock errors when --continue-on-error is not specified.
With --continue-on-error, the client does not abort on such SQL errors
and instead proceeds to the next transaction. These cases are reported
as "other failures" in the output. If the error occurs in a meta-command,
however, the client still aborts even when this option is specified.
----------------------------
Regards,
--
Fujii Masao
On Thu, 18 Sep 2025 14:37:29 +0900
Fujii Masao <masao.fujii@gmail.com> wrote:
On Thu, Sep 18, 2025 at 10:22 AM Yugo Nagata <nagata@sraoss.co.jp> wrote:
That makes sense. How about rewriting this like:
However, if the --continue-on-error option is specified and the error occurs in
an SQL command, the client does not abort and proceeds to the next
transaction regardless of the error. These cases are reported as "other failures"
in the output. Note that if the error occurs in a meta-command, the client will
still abort even when this option is specified.How about phrasing it like this, based on your version?
----------------------------
A client's run is aborted in case of a serious error; for example, the
connection with the database server was lost or the end of script was reached
without completing the last transaction. The client also aborts
if a meta-command fails, or if an SQL command fails for reasons other than
serialization or deadlock errors when --continue-on-error is not specified.
With --continue-on-error, the client does not abort on such SQL errors
and instead proceeds to the next transaction. These cases are reported
as "other failures" in the output. If the error occurs in a meta-command,
however, the client still aborts even when this option is specified.
----------------------------
I'm fine with that. This version is clearer.
Regards,
Yugo Nagata
--
Yugo Nagata <nagata@sraoss.co.jp>
On Thu, Sep 18, 2025 at 4:20 PM Yugo Nagata <nagata@sraoss.co.jp> wrote:
On Thu, 18 Sep 2025 14:37:29 +0900
Fujii Masao <masao.fujii@gmail.com> wrote:On Thu, Sep 18, 2025 at 10:22 AM Yugo Nagata <nagata@sraoss.co.jp> wrote:
That makes sense. How about rewriting this like:
However, if the --continue-on-error option is specified and the error occurs in
an SQL command, the client does not abort and proceeds to the next
transaction regardless of the error. These cases are reported as "other failures"
in the output. Note that if the error occurs in a meta-command, the client will
still abort even when this option is specified.How about phrasing it like this, based on your version?
----------------------------
A client's run is aborted in case of a serious error; for example, the
connection with the database server was lost or the end of script was reached
without completing the last transaction. The client also aborts
if a meta-command fails, or if an SQL command fails for reasons other than
serialization or deadlock errors when --continue-on-error is not specified.
With --continue-on-error, the client does not abort on such SQL errors
and instead proceeds to the next transaction. These cases are reported
as "other failures" in the output. If the error occurs in a meta-command,
however, the client still aborts even when this option is specified.
----------------------------I'm fine with that. This version is clearer.
Thanks for checking!
Also I'd like to share the review comments for 0002 and 0003.
Regarding 0002:
- TSTATUS_OTHER_ERROR,
+ TSTATUS_UNKNOWN_ERROR,
You did this rename to avoid confusion with other_sql_errors.
I see the intention, but I'm not sure if this concern is really valid
and if the rename adds much value. Also, TSTATUS_UNKNOWN_ERROR
might be mistakenly assumed to be related to PQTRANS_UNKNOWN,
even though they aren't related...
But if we agree with this change, I think it should be folded into 0001,
since there's no strong reason to keep it separate.
Regarding 0003:
- pg_log_error("client %d script %d command %d query %d: expected one
row, got %d",
- st->id, st->use_file, st->command, qrynum, 0);
+ commandFailed(st, "gset", psprintf("expected one row, got %d", 0));
The change to use commandFailed() seems to remove
the "query %d" detail that the current pg_log_error() call reports.
Is it OK to lose that information?
Regards,
--
Fujii Masao
On Fri, Sep 19, 2025 at 11:43 AM Fujii Masao <masao.fujii@gmail.com> wrote:
On Thu, Sep 18, 2025 at 4:20 PM Yugo Nagata <nagata@sraoss.co.jp> wrote:
On Thu, 18 Sep 2025 14:37:29 +0900
Fujii Masao <masao.fujii@gmail.com> wrote:On Thu, Sep 18, 2025 at 10:22 AM Yugo Nagata <nagata@sraoss.co.jp> wrote:
That makes sense. How about rewriting this like:
However, if the --continue-on-error option is specified and the error occurs in
an SQL command, the client does not abort and proceeds to the next
transaction regardless of the error. These cases are reported as "other failures"
in the output. Note that if the error occurs in a meta-command, the client will
still abort even when this option is specified.How about phrasing it like this, based on your version?
----------------------------
A client's run is aborted in case of a serious error; for example, the
connection with the database server was lost or the end of script was reached
without completing the last transaction. The client also aborts
if a meta-command fails, or if an SQL command fails for reasons other than
serialization or deadlock errors when --continue-on-error is not specified.
With --continue-on-error, the client does not abort on such SQL errors
and instead proceeds to the next transaction. These cases are reported
as "other failures" in the output. If the error occurs in a meta-command,
however, the client still aborts even when this option is specified.
----------------------------I'm fine with that. This version is clearer.
Thanks for checking!
I've updated the 0001 patch based on the comments.
The revised version is attached.
While testing, I found that running pgbench with --continue-on-error and
pipeline mode triggers the following assertion failure. Could this be
a bug in the patch?
---------------------------------------------------
$ cat pipeline.pgbench
\startpipeline
DO $$
BEGIN
PERFORM pg_sleep(3);
PERFORM pg_terminate_backend(pg_backend_pid());
END $$;
\endpipeline
$ pgbench -n --debug --verbose-errors -f pipeline.pgbench -c 2 -t 4 -M
extended --continue-on-error
...
Assertion failed:
(sql_script[st->use_file].commands[st->command]->type == 1), function
commandError, file pgbench.c, line 3081.
Abort trap: 6
---------------------------------------------------
When I ran the same command without --continue-on-error,
the assertion failure did not occur.
Regards,
--
Fujii Masao
Attachments:
v11-0001-Add-continue-on-error-option.patchapplication/octet-stream; name=v11-0001-Add-continue-on-error-option.patchDownload
From 85febac195673e375f4847815261f36adcc6b860 Mon Sep 17 00:00:00 2001
From: Fujii Masao <fujii@postgresql.org>
Date: Fri, 19 Sep 2025 16:54:49 +0900
Subject: [PATCH v11] Add --continue-on-error option
When the option is set, client rolls back the failed transaction and starts a
new one when its transaction fails due to the reason other than the deadlock and
serialization failure.
---
doc/src/sgml/ref/pgbench.sgml | 64 ++++++++++++++++----
src/bin/pgbench/pgbench.c | 57 ++++++++++++++---
src/bin/pgbench/t/001_pgbench_with_server.pl | 22 +++++++
3 files changed, 124 insertions(+), 19 deletions(-)
diff --git a/doc/src/sgml/ref/pgbench.sgml b/doc/src/sgml/ref/pgbench.sgml
index ab252d9fc74..828ce0d90cf 100644
--- a/doc/src/sgml/ref/pgbench.sgml
+++ b/doc/src/sgml/ref/pgbench.sgml
@@ -76,9 +76,8 @@ tps = 896.967014 (without initial connection time)
and number of transactions per client); these will be equal unless the run
failed before completion or some SQL command(s) failed. (In
<option>-T</option> mode, only the actual number of transactions is printed.)
- The next line reports the number of failed transactions due to
- serialization or deadlock errors (see <xref linkend="failures-and-retries"/>
- for more information).
+ The next line reports the number of failed transactions (see
+ <xref linkend="failures-and-retries"/> for more information).
The last line reports the number of transactions per second.
</para>
@@ -790,6 +789,9 @@ pgbench <optional> <replaceable>options</replaceable> </optional> <replaceable>d
<listitem>
<para>deadlock failures;</para>
</listitem>
+ <listitem>
+ <para>other failures;</para>
+ </listitem>
</itemizedlist>
See <xref linkend="failures-and-retries"/> for more information.
</para>
@@ -914,6 +916,26 @@ pgbench <optional> <replaceable>options</replaceable> </optional> <replaceable>d
</listitem>
</varlistentry>
+ <varlistentry id="pgbench-option-continue-on-error">
+ <term><option>--continue-on-error</option></term>
+ <listitem>
+ <para>
+ Allows clients to continue their run even if an SQL statement fails due to
+ errors other than serialization or deadlock. Unlike serialization and deadlock
+ failures, clients do not retry the same transactions but start new transaction.
+ This option is useful when your custom script may raise errors due to some
+ reason like unique constraints violation. Without this option, the client is
+ aborted after such errors.
+ </para>
+ <para>
+ Note that serialization and deadlock failures never cause the client to be
+ aborted even after clients retries <option>--max-tries</option> times by
+ default, so they are not affected by this option.
+ See <xref linkend="failures-and-retries"/> for more information.
+ </para>
+ </listitem>
+ </varlistentry>
+
</variablelist>
</para>
@@ -2409,8 +2431,8 @@ END;
will be reported as <literal>failed</literal>. If you use the
<option>--failures-detailed</option> option, the
<replaceable>time</replaceable> of the failed transaction will be reported as
- <literal>serialization</literal> or
- <literal>deadlock</literal> depending on the type of failure (see
+ <literal>serialization</literal>, <literal>deadlock</literal>, or
+ <literal>other</literal> depending on the type of failure (see
<xref linkend="failures-and-retries"/> for more information).
</para>
@@ -2638,6 +2660,16 @@ END;
</para>
</listitem>
</varlistentry>
+
+ <varlistentry>
+ <term><replaceable>other_sql_failures</replaceable></term>
+ <listitem>
+ <para>
+ number of transactions that got a SQL error
+ (zero unless <option>--failures-detailed</option> is specified)
+ </para>
+ </listitem>
+ </varlistentry>
</variablelist>
</para>
@@ -2646,8 +2678,8 @@ END;
<screen>
<userinput>pgbench --aggregate-interval=10 --time=20 --client=10 --log --rate=1000 --latency-limit=10 --failures-detailed --max-tries=10 test</userinput>
-1650260552 5178 26171317 177284491527 1136 44462 2647617 7321113867 0 9866 64 7564 28340 4148 0
-1650260562 4808 25573984 220121792172 1171 62083 3037380 9666800914 0 9998 598 7392 26621 4527 0
+1650260552 5178 26171317 177284491527 1136 44462 2647617 7321113867 0 9866 64 7564 28340 4148 0 0
+1650260562 4808 25573984 220121792172 1171 62083 3037380 9666800914 0 9998 598 7392 26621 4527 0 0
</screen>
</para>
@@ -2851,10 +2883,20 @@ statement latencies in milliseconds, failures and retries:
<para>
A client's run is aborted in case of a serious error; for example, the
connection with the database server was lost or the end of script was reached
- without completing the last transaction. In addition, if execution of an SQL
- or meta command fails for reasons other than serialization or deadlock errors,
- the client is aborted. Otherwise, if an SQL command fails with serialization or
- deadlock errors, the client is not aborted. In such cases, the current
+ without completing the last transaction. The client also aborts
+ if a meta command fails, or if an SQL command fails for reasons other than
+ serialization or deadlock errors when <option>--continue-on-error</option>
+ is not specified. With <option>--continue-on-error</option>,
+ the client does not abort on such SQL errors and instead proceeds to
+ the next transaction. These cases are reported as
+ <literal>other failures</literal> in the output. If the error occurs
+ in a meta command, however, the client still aborts even when this option
+ is specified.
+ </para>
+ <para>
+ If an SQL command fails due to serialization or deadlock errors, the
+ client does not aborted, regardless of whether
+ <option>--continue-on-error</option> is used. Instead, the current
transaction is rolled back, which also includes setting the client variables
as they were before the run of this transaction (it is assumed that one
transaction script contains only one transaction; see
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 3cafd88ac53..c6e0444d6e2 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -402,15 +402,23 @@ typedef struct StatsData
* directly successful transactions (they were successfully completed on
* the first try).
*
- * A failed transaction is defined as unsuccessfully retried transactions.
- * It can be one of two types:
+ * A failed transaction is counted differently depending on whether
+ * the --continue-on-error option is specified.
*
+ * Without --continue-on-error:
* failed (the number of failed transactions) =
* 'serialization_failures' (they got a serialization error and were not
* successfully retried) +
* 'deadlock_failures' (they got a deadlock error and were not
* successfully retried).
*
+ * When --continue-on-error is specified:
+ *
+ * failed (number of failed transactions) =
+ * 'serialization_failures' + 'deadlock_failures' +
+ * 'other_sql_failures' (they got some other SQL error; the transaction was
+ * not retried and counted as failed due to --continue-on-error).
+ *
* If the transaction was retried after a serialization or a deadlock
* error this does not guarantee that this retry was successful. Thus
*
@@ -440,6 +448,11 @@ typedef struct StatsData
int64 deadlock_failures; /* number of transactions that were not
* successfully retried after a deadlock
* error */
+ int64 other_sql_failures; /* number of failed transactions for
+ * reasons other than
+ * serialization/deadlock failure, which
+ * is counted if --continue-on-error is
+ * specified */
SimpleStats latency;
SimpleStats lag;
} StatsData;
@@ -770,6 +783,7 @@ static int64 total_weight = 0;
static bool verbose_errors = false; /* print verbose messages of all errors */
static bool exit_on_abort = false; /* exit when any client is aborted */
+static bool continue_on_error = false; /* continue after errors */
/* Builtin test scripts */
typedef struct BuiltinScript
@@ -954,6 +968,7 @@ usage(void)
" --log-prefix=PREFIX prefix for transaction time log file\n"
" (default: \"pgbench_log\")\n"
" --max-tries=NUM max number of tries to run transaction (default: 1)\n"
+ " --continue-on-error continue running after an SQL error\n"
" --progress-timestamp use Unix epoch timestamps for progress\n"
" --random-seed=SEED set random seed (\"time\", \"rand\", integer)\n"
" --sampling-rate=NUM fraction of transactions to log (e.g., 0.01 for 1%%)\n"
@@ -1467,6 +1482,7 @@ initStats(StatsData *sd, pg_time_usec_t start)
sd->retried = 0;
sd->serialization_failures = 0;
sd->deadlock_failures = 0;
+ sd->other_sql_failures = 0;
initSimpleStats(&sd->latency);
initSimpleStats(&sd->lag);
}
@@ -1516,6 +1532,9 @@ accumStats(StatsData *stats, bool skipped, double lat, double lag,
case ESTATUS_DEADLOCK_ERROR:
stats->deadlock_failures++;
break;
+ case ESTATUS_OTHER_SQL_ERROR:
+ stats->other_sql_failures++;
+ break;
default:
/* internal error which should never occur */
pg_fatal("unexpected error status: %d", estatus);
@@ -3356,7 +3375,7 @@ readCommandResponse(CState *st, MetaCommand meta, char *varprefix)
case PGRES_FATAL_ERROR:
st->estatus = getSQLErrorStatus(PQresultErrorField(res,
PG_DIAG_SQLSTATE));
- if (canRetryError(st->estatus))
+ if (continue_on_error || canRetryError(st->estatus))
{
if (verbose_errors)
commandError(st, PQerrorMessage(st->con));
@@ -4020,7 +4039,8 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
if (PQpipelineStatus(st->con) != PQ_PIPELINE_ON)
st->state = CSTATE_END_COMMAND;
}
- else if (canRetryError(st->estatus))
+ else if ((st->estatus == ESTATUS_OTHER_SQL_ERROR && continue_on_error) ||
+ canRetryError(st->estatus))
st->state = CSTATE_ERROR;
else
st->state = CSTATE_ABORTED;
@@ -4541,7 +4561,8 @@ static int64
getFailures(const StatsData *stats)
{
return (stats->serialization_failures +
- stats->deadlock_failures);
+ stats->deadlock_failures +
+ stats->other_sql_failures);
}
/*
@@ -4561,6 +4582,8 @@ getResultString(bool skipped, EStatus estatus)
return "serialization";
case ESTATUS_DEADLOCK_ERROR:
return "deadlock";
+ case ESTATUS_OTHER_SQL_ERROR:
+ return "other";
default:
/* internal error which should never occur */
pg_fatal("unexpected error status: %d", estatus);
@@ -4616,6 +4639,7 @@ doLog(TState *thread, CState *st,
int64 skipped = 0;
int64 serialization_failures = 0;
int64 deadlock_failures = 0;
+ int64 other_sql_failures = 0;
int64 retried = 0;
int64 retries = 0;
@@ -4656,10 +4680,12 @@ doLog(TState *thread, CState *st,
{
serialization_failures = agg->serialization_failures;
deadlock_failures = agg->deadlock_failures;
+ other_sql_failures = agg->other_sql_failures;
}
- fprintf(logfile, " " INT64_FORMAT " " INT64_FORMAT,
+ fprintf(logfile, " " INT64_FORMAT " " INT64_FORMAT " " INT64_FORMAT,
serialization_failures,
- deadlock_failures);
+ deadlock_failures,
+ other_sql_failures);
fputc('\n', logfile);
@@ -6298,6 +6324,7 @@ printProgressReport(TState *threads, int64 test_start, pg_time_usec_t now,
cur.serialization_failures +=
threads[i].stats.serialization_failures;
cur.deadlock_failures += threads[i].stats.deadlock_failures;
+ cur.other_sql_failures += threads[i].stats.other_sql_failures;
}
/* we count only actually executed transactions */
@@ -6440,7 +6467,8 @@ printResults(StatsData *total,
/*
* Remaining stats are nonsensical if we failed to execute any xacts due
- * to others than serialization or deadlock errors
+ * to other than serialization or deadlock errors and --continue-on-error
+ * is not set.
*/
if (total_cnt <= 0)
return;
@@ -6456,6 +6484,9 @@ printResults(StatsData *total,
printf("number of deadlock failures: " INT64_FORMAT " (%.3f%%)\n",
total->deadlock_failures,
100.0 * total->deadlock_failures / total_cnt);
+ printf("number of other failures: " INT64_FORMAT " (%.3f%%)\n",
+ total->other_sql_failures,
+ 100.0 * total->other_sql_failures / total_cnt);
}
/* it can be non-zero only if max_tries is not equal to one */
@@ -6559,6 +6590,10 @@ printResults(StatsData *total,
sstats->deadlock_failures,
(100.0 * sstats->deadlock_failures /
script_total_cnt));
+ printf(" - number of other failures: " INT64_FORMAT " (%.3f%%)\n",
+ sstats->other_sql_failures,
+ (100.0 * sstats->other_sql_failures /
+ script_total_cnt));
}
/*
@@ -6718,6 +6753,7 @@ main(int argc, char **argv)
{"verbose-errors", no_argument, NULL, 15},
{"exit-on-abort", no_argument, NULL, 16},
{"debug", no_argument, NULL, 17},
+ {"continue-on-error", no_argument, NULL, 18},
{NULL, 0, NULL, 0}
};
@@ -7071,6 +7107,10 @@ main(int argc, char **argv)
case 17: /* debug */
pg_logging_increase_verbosity();
break;
+ case 18: /* continue-on-error */
+ benchmarking_option_set = true;
+ continue_on_error = true;
+ break;
default:
/* getopt_long already emitted a complaint */
pg_log_error_hint("Try \"%s --help\" for more information.", progname);
@@ -7426,6 +7466,7 @@ main(int argc, char **argv)
stats.retried += thread->stats.retried;
stats.serialization_failures += thread->stats.serialization_failures;
stats.deadlock_failures += thread->stats.deadlock_failures;
+ stats.other_sql_failures += thread->stats.other_sql_failures;
latency_late += thread->latency_late;
conn_total_duration += thread->conn_duration;
diff --git a/src/bin/pgbench/t/001_pgbench_with_server.pl b/src/bin/pgbench/t/001_pgbench_with_server.pl
index 7dd78940300..3c19a36a005 100644
--- a/src/bin/pgbench/t/001_pgbench_with_server.pl
+++ b/src/bin/pgbench/t/001_pgbench_with_server.pl
@@ -1813,6 +1813,28 @@ update counter set i = i+1 returning i \gset
# Clean up
$node->safe_psql('postgres', 'DROP TABLE counter;');
+# Test --continue-on-error
+$node->safe_psql('postgres',
+ 'CREATE TABLE unique_table(i int unique);');
+
+$node->pgbench(
+ '-n -t 10 --continue-on-error --failures-detailed',
+ 0,
+ [
+ qr{processed: 1/10\b},
+ qr{other failures: 9\b}
+ ],
+ [],
+ 'test --continue-on-error',
+ {
+ '001_continue_on_error' => q{
+ INSERT INTO unique_table VALUES(0);
+ }
+ });
+
+# Clean up
+$node->safe_psql('postgres', 'DROP TABLE unique_table;');
+
# done
$node->safe_psql('postgres', 'DROP TABLESPACE regress_pgbench_tap_1_ts');
$node->stop;
--
2.50.1
On Fri, 19 Sep 2025 11:43:28 +0900
Fujii Masao <masao.fujii@gmail.com> wrote:
On Thu, Sep 18, 2025 at 4:20 PM Yugo Nagata <nagata@sraoss.co.jp> wrote:
On Thu, 18 Sep 2025 14:37:29 +0900
Fujii Masao <masao.fujii@gmail.com> wrote:On Thu, Sep 18, 2025 at 10:22 AM Yugo Nagata <nagata@sraoss.co.jp> wrote:
That makes sense. How about rewriting this like:
However, if the --continue-on-error option is specified and the error occurs in
an SQL command, the client does not abort and proceeds to the next
transaction regardless of the error. These cases are reported as "other failures"
in the output. Note that if the error occurs in a meta-command, the client will
still abort even when this option is specified.How about phrasing it like this, based on your version?
----------------------------
A client's run is aborted in case of a serious error; for example, the
connection with the database server was lost or the end of script was reached
without completing the last transaction. The client also aborts
if a meta-command fails, or if an SQL command fails for reasons other than
serialization or deadlock errors when --continue-on-error is not specified.
With --continue-on-error, the client does not abort on such SQL errors
and instead proceeds to the next transaction. These cases are reported
as "other failures" in the output. If the error occurs in a meta-command,
however, the client still aborts even when this option is specified.
----------------------------I'm fine with that. This version is clearer.
Thanks for checking!
Also I'd like to share the review comments for 0002 and 0003.
Regarding 0002:
- TSTATUS_OTHER_ERROR, + TSTATUS_UNKNOWN_ERROR,You did this rename to avoid confusion with other_sql_errors.
I see the intention, but I'm not sure if this concern is really valid
and if the rename adds much value. Also, TSTATUS_UNKNOWN_ERROR
might be mistakenly assumed to be related to PQTRANS_UNKNOWN,
even though they aren't related...
I don’t have a strong opinion on this, but I think TSTATUS_* is slightly
related to PQTRANS_*, since getTransactionStatus() determines the TSTATUS
value based on PQTRANS. There is no one-to-one relationship, of course,
but it is more related than ESTATUS_OTHER_SQL_ERROR, which is entirely
separate.
But if we agree with this change, I think it should be folded into 0001,
since there's no strong reason to keep it separate.
+1
I personally don't care if ommiting this change, but I would like to wait
for Ikeda-san's response because he is the author of these two patches.
Regarding 0003:
- pg_log_error("client %d script %d command %d query %d: expected one row, got %d", - st->id, st->use_file, st->command, qrynum, 0); + commandFailed(st, "gset", psprintf("expected one row, got %d", 0));The change to use commandFailed() seems to remove
the "query %d" detail that the current pg_log_error() call reports.
Is it OK to lose that information?
"qrynum" is the index of SQL queries combined by "\;", but reporting it
in \gset errors is almost useless, since \gset can only be applied to the
last query of a compound query. So I think it’s fine to commit this.
That said, it might still be useful for debugging when an internal error like
the following occurs (mainly for developers rather than users):
/* internal error */
commandFailed(st, cmd, psprintf("error storing into variable %s", varname));
For that case, I’d be fine with adding information like this:
/* internal error */
commandFailed(st, cmd, psprintf("error storing into variable %s, at query %d", varname, qrynum));
Regards,
Yugo Nagata
--
Yugo Nagata <nagata@sraoss.co.jp>
On Fri, 19 Sep 2025 19:21:29 +0900
Fujii Masao <masao.fujii@gmail.com> wrote:
On Fri, Sep 19, 2025 at 11:43 AM Fujii Masao <masao.fujii@gmail.com> wrote:
On Thu, Sep 18, 2025 at 4:20 PM Yugo Nagata <nagata@sraoss.co.jp> wrote:
On Thu, 18 Sep 2025 14:37:29 +0900
Fujii Masao <masao.fujii@gmail.com> wrote:On Thu, Sep 18, 2025 at 10:22 AM Yugo Nagata <nagata@sraoss.co.jp> wrote:
That makes sense. How about rewriting this like:
However, if the --continue-on-error option is specified and the error occurs in
an SQL command, the client does not abort and proceeds to the next
transaction regardless of the error. These cases are reported as "other failures"
in the output. Note that if the error occurs in a meta-command, the client will
still abort even when this option is specified.How about phrasing it like this, based on your version?
----------------------------
A client's run is aborted in case of a serious error; for example, the
connection with the database server was lost or the end of script was reached
without completing the last transaction. The client also aborts
if a meta-command fails, or if an SQL command fails for reasons other than
serialization or deadlock errors when --continue-on-error is not specified.
With --continue-on-error, the client does not abort on such SQL errors
and instead proceeds to the next transaction. These cases are reported
as "other failures" in the output. If the error occurs in a meta-command,
however, the client still aborts even when this option is specified.
----------------------------I'm fine with that. This version is clearer.
Thanks for checking!
I've updated the 0001 patch based on the comments.
The revised version is attached.
Thank you for updating the patch.
While testing, I found that running pgbench with --continue-on-error and
pipeline mode triggers the following assertion failure. Could this be
a bug in the patch?---------------------------------------------------
$ cat pipeline.pgbench
\startpipeline
DO $$
BEGIN
PERFORM pg_sleep(3);
PERFORM pg_terminate_backend(pg_backend_pid());
END $$;
\endpipeline$ pgbench -n --debug --verbose-errors -f pipeline.pgbench -c 2 -t 4 -M
extended --continue-on-error
...
Assertion failed:
(sql_script[st->use_file].commands[st->command]->type == 1), function
commandError, file pgbench.c, line 3081.
Abort trap: 6
---------------------------------------------------When I ran the same command without --continue-on-error,
the assertion failure did not occur.
I think this bug was introduced by commit 4a39f87acd6e, which enabled pgbench
to retry and added the --verbose-errors option, rather than by this patch itself.
The assertion failure occurs in commandError(), which is called to report an error when
it can be retried (i.e., serializable failure or deadlock), or when --continue-on-error
is used after this patch.
Assert(sql_script[st->use_file].commands[st->command]->type == SQL_COMMAND);
This assumes the error is always detected during SQL command execution, but
that’s not correct, since in pipeline mode, the error can be detected when
a \endpipeline meta-command is executed.
$ cat deadlock.sql
\startpipeline
begin;
lock b;
lock a;
end;
\endpipeline
$ cat deadlock2.sql
\startpipeline
begin;
lock a;
lock b;
end;
\endpipeline
$ pgbench --verbose-errors -f deadlock.sql -f deadlock2.sql -c 2 -T 3 -M extended
pgbench (19devel)
starting vacuum...end.
pgbench: pgbench.c:3062: commandError: Assertion `sql_script[st->use_file].commands[st->command]->type == 1' failed.
Although one option would be to remove this assertion, if we prefer to keep it,
the attached patch fixes the issue.
Regards,
Yugo Nagata
--
Yugo Nagata <nagata@sraoss.co.jp>
Attachments:
fix_pgbench_assertion_failure_in_pipeline.patch.txttext/plain; name=fix_pgbench_assertion_failure_in_pipeline.patch.txtDownload
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 3cafd88ac53..35e17939190 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -3059,7 +3059,9 @@ commandFailed(CState *st, const char *cmd, const char *message)
static void
commandError(CState *st, const char *message)
{
- Assert(sql_script[st->use_file].commands[st->command]->type == SQL_COMMAND);
+ Assert(sql_script[st->use_file].commands[st->command]->type == SQL_COMMAND ||
+ sql_script[st->use_file].commands[st->command]->meta == META_ENDPIPELINE);
+
pg_log_info("client %d got an error in command %d (SQL) of script %d; %s",
st->id, st->command, st->use_file, message);
}
Thank you for reviewing the patches.
On 2025/09/19 20:56, Yugo Nagata wrote:
A client's run is aborted in case of a serious error; for example, the
connection with the database server was lost or the end of script was reached
without completing the last transaction. The client also aborts
if a meta-command fails, or if an SQL command fails for reasons other than
serialization or deadlock errors when --continue-on-error is not specified.
With --continue-on-error, the client does not abort on such SQL errors
and instead proceeds to the next transaction. These cases are reported
as "other failures" in the output. If the error occurs in a meta-command,
however, the client still aborts even when this option is specified.
----------------------------I'm fine with that. This version is clearer.
I also agree with this.
Also I'd like to share the review comments for 0002 and 0003.
Regarding 0002:
- TSTATUS_OTHER_ERROR, + TSTATUS_UNKNOWN_ERROR,You did this rename to avoid confusion with other_sql_errors.
I see the intention, but I'm not sure if this concern is really valid
and if the rename adds much value. Also, TSTATUS_UNKNOWN_ERROR
might be mistakenly assumed to be related to PQTRANS_UNKNOWN,
even though they aren't related...I don’t have a strong opinion on this, but I think TSTATUS_* is slightly
related to PQTRANS_*, since getTransactionStatus() determines the TSTATUS
value based on PQTRANS. There is no one-to-one relationship, of course,
but it is more related than ESTATUS_OTHER_SQL_ERROR, which is entirely
separate.But if we agree with this change, I think it should be folded into 0001,
since there's no strong reason to keep it separate.+1
I personally don't care if ommiting this change, but I would like to wait
for Ikeda-san's response because he is the author of these two patches.
The points you both raise make sense to me.
Changing the macro name is not important for the purpose of the patch, so I now
feel it would be reasonable to drop patch 0002.
Regards,
Rintaro Ikeda
On Sat, Sep 20, 2025 at 12:21 AM Yugo Nagata <nagata@sraoss.co.jp> wrote:
While testing, I found that running pgbench with --continue-on-error and
pipeline mode triggers the following assertion failure. Could this be
a bug in the patch?---------------------------------------------------
$ cat pipeline.pgbench
\startpipeline
DO $$
BEGIN
PERFORM pg_sleep(3);
PERFORM pg_terminate_backend(pg_backend_pid());
END $$;
\endpipeline$ pgbench -n --debug --verbose-errors -f pipeline.pgbench -c 2 -t 4 -M
extended --continue-on-error
...
Assertion failed:
(sql_script[st->use_file].commands[st->command]->type == 1), function
commandError, file pgbench.c, line 3081.
Abort trap: 6
---------------------------------------------------When I ran the same command without --continue-on-error,
the assertion failure did not occur.I think this bug was introduced by commit 4a39f87acd6e, which enabled pgbench
to retry and added the --verbose-errors option, rather than by this patch itself.The assertion failure occurs in commandError(), which is called to report an error when
it can be retried (i.e., serializable failure or deadlock), or when --continue-on-error
is used after this patch.Assert(sql_script[st->use_file].commands[st->command]->type == SQL_COMMAND);
This assumes the error is always detected during SQL command execution, but
that’s not correct, since in pipeline mode, the error can be detected when
a \endpipeline meta-command is executed.$ cat deadlock.sql
\startpipeline
begin;
lock b;
lock a;
end;
\endpipeline$ cat deadlock2.sql
\startpipeline
begin;
lock a;
lock b;
end;
\endpipeline$ pgbench --verbose-errors -f deadlock.sql -f deadlock2.sql -c 2 -T 3 -M extended
pgbench (19devel)
starting vacuum...end.
pgbench: pgbench.c:3062: commandError: Assertion `sql_script[st->use_file].commands[st->command]->type == 1' failed.Although one option would be to remove this assertion, if we prefer to keep it,
the attached patch fixes the issue.
Thanks for the analysis and the patch!
I think we should fix the issue rather than just removing the assertion.
I'd like to apply your patch with the following source comment:
---------------------------
Errors should only be detected during an SQL command or the \endpipeline
meta command. Any other case triggers an assertion failure.
--------------------------
With your patch and the continue-on-error patches, running the same pgbench
command I used to reproduce the assertion failure upthread causes pgbench
to hang. From my analysis, it enters an infinite loop in discardUntilSync().
That loop waits for PGRES_PIPELINE_SYNC, but since the connection has already
been closed, it never arrives, leaving pgbench stuck.
Could this also happen without the continue-on-error patch, or is it a new bug
introduced by it? Either way, it seems pgbench needs to exit the loop when
the result status is PGRES_FATAL_ERROR.
Regards,
--
Fujii Masao
On Sat, Sep 20, 2025 at 9:58 PM Rintaro Ikeda
<ikedarintarof@oss.nttdata.com> wrote:
The points you both raise make sense to me.
Changing the macro name is not important for the purpose of the patch, so I now
feel it would be reasonable to drop patch 0002.
Thanks for your thoughts! So let's focus on the 0001 patch for now.
Regards,
--
Fujii Masao
Hi,
On 2025/09/22 11:56, Fujii Masao wrote:
On Sat, Sep 20, 2025 at 12:21 AM Yugo Nagata <nagata@sraoss.co.jp> wrote:
While testing, I found that running pgbench with --continue-on-error and
pipeline mode triggers the following assertion failure. Could this be
a bug in the patch?---------------------------------------------------
$ cat pipeline.pgbench
\startpipeline
DO $$
BEGIN
PERFORM pg_sleep(3);
PERFORM pg_terminate_backend(pg_backend_pid());
END $$;
\endpipeline$ pgbench -n --debug --verbose-errors -f pipeline.pgbench -c 2 -t 4 -M
extended --continue-on-error
...
Assertion failed:
(sql_script[st->use_file].commands[st->command]->type == 1), function
commandError, file pgbench.c, line 3081.
Abort trap: 6
---------------------------------------------------When I ran the same command without --continue-on-error,
the assertion failure did not occur.I think this bug was introduced by commit 4a39f87acd6e, which enabled pgbench
to retry and added the --verbose-errors option, rather than by this patch itself.The assertion failure occurs in commandError(), which is called to report an error when
it can be retried (i.e., serializable failure or deadlock), or when --continue-on-error
is used after this patch.Assert(sql_script[st->use_file].commands[st->command]->type == SQL_COMMAND);
This assumes the error is always detected during SQL command execution, but
that’s not correct, since in pipeline mode, the error can be detected when
a \endpipeline meta-command is executed.$ cat deadlock.sql
\startpipeline
begin;
lock b;
lock a;
end;
\endpipeline$ cat deadlock2.sql
\startpipeline
begin;
lock a;
lock b;
end;
\endpipeline$ pgbench --verbose-errors -f deadlock.sql -f deadlock2.sql -c 2 -T 3 -M extended
pgbench (19devel)
starting vacuum...end.
pgbench: pgbench.c:3062: commandError: Assertion `sql_script[st->use_file].commands[st->command]->type == 1' failed.Although one option would be to remove this assertion, if we prefer to keep it,
the attached patch fixes the issue.Thanks for the analysis and the patch!
I think we should fix the issue rather than just removing the assertion.
I'd like to apply your patch with the following source comment:---------------------------
Errors should only be detected during an SQL command or the \endpipeline
meta command. Any other case triggers an assertion failure.
--------------------------With your patch and the continue-on-error patches, running the same pgbench
command I used to reproduce the assertion failure upthread causes pgbench
to hang. From my analysis, it enters an infinite loop in discardUntilSync().
That loop waits for PGRES_PIPELINE_SYNC, but since the connection has already
been closed, it never arrives, leaving pgbench stuck.Could this also happen without the continue-on-error patch, or is it a new bug
introduced by it? Either way, it seems pgbench needs to exit the loop when
the result status is PGRES_FATAL_ERROR.
Thank you for the analysis and the patches.
I think the issue is a new bug because we have transitioned to CSTATE_ABORT
immediately after queries failed, without executing discardUntilSync().
I've attached a patch that fixes the assertion error. The content of v1 patch by
Mr. Nagata is also included. I would appreciate it if you review my patch.
Regards,
Rintaro Ikeda
Attachments:
v2_fix_pgbench_fix_assertion_in_pipeline.patch.txttext/plain; charset=UTF-8; name=v2_fix_pgbench_fix_assertion_in_pipeline.patch.txtDownload
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 6e9304e254f..cd5faf3370a 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -3078,7 +3078,13 @@ commandFailed(CState *st, const char *cmd, const char *message)
static void
commandError(CState *st, const char *message)
{
- Assert(sql_script[st->use_file].commands[st->command]->type == SQL_COMMAND);
+ /*
+ Errors should only be detected during an SQL command or the \endpipeline
+ meta command. Any other case triggers an assertion failure.
+ */
+ Assert(sql_script[st->use_file].commands[st->command]->type == SQL_COMMAND ||
+ sql_script[st->use_file].commands[st->command]->meta == META_ENDPIPELINE);
+
pg_log_info("client %d got an error in command %d (SQL) of script %d; %s",
st->id, st->command, st->use_file, message);
}
@@ -3525,9 +3531,7 @@ discardUntilSync(CState *st)
{
PGresult *res = PQgetResult(st->con);
- if (PQresultStatus(res) == PGRES_PIPELINE_SYNC)
- received_sync = true;
- else if (received_sync)
+ if (received_sync == true)
{
/*
* PGRES_PIPELINE_SYNC must be followed by another
@@ -3541,11 +3545,23 @@ discardUntilSync(CState *st)
*/
st->num_syncs = 0;
PQclear(res);
- break;
+ goto done;
}
- PQclear(res);
+
+ switch (PQresultStatus(res))
+ {
+ case PGRES_PIPELINE_SYNC:
+ received_sync = true;
+ case PGRES_FATAL_ERROR:
+ PQclear(res);
+ goto done;
+ default:
+ PQclear(res);
+ }
+
}
+done:
/* exit pipeline */
if (PQexitPipelineMode(st->con) != 1)
{
On Tue, Sep 23, 2025 at 11:58 AM Rintaro Ikeda
<ikedarintarof@oss.nttdata.com> wrote:
I think the issue is a new bug because we have transitioned to CSTATE_ABORT
immediately after queries failed, without executing discardUntilSync().
If so, the fix in discardUntilSync() should be part of the continue-on-error
patch. The assertion failure fix should be a separate patch, since only
that needs to be backpatched (the failure can also occur in back branches).
I've attached a patch that fixes the assertion error. The content of v1 patch by
Mr. Nagata is also included. I would appreciate it if you review my patch.
+ if (received_sync == true)
For boolean flags, we usually just use the variable itself instead of
"== true/false".
+ switch (PQresultStatus(res))
+ {
+ case PGRES_PIPELINE_SYNC:
+ received_sync = true;
In the PGRES_PIPELINE_SYNC case, PQclear(res) isn't called but should be.
+ case PGRES_FATAL_ERROR:
+ PQclear(res);
+ goto done;
I don't think we need goto here. How about this instead?
-----------------------
@@ -3525,11 +3525,18 @@ discardUntilSync(CState *st)
* results have been discarded.
*/
st->num_syncs = 0;
- PQclear(res);
break;
}
+ /*
+ * Stop receiving further results if PGRES_FATAL_ERROR
is returned
+ * (e.g., due to a connection failure) before
PGRES_PIPELINE_SYNC,
+ * since PGRES_PIPELINE_SYNC will never arrive.
+ */
+ else if (PQresultStatus(res) == PGRES_FATAL_ERROR)
+ break;
PQclear(res);
}
+ PQclear(res);
/* exit pipeline */
if (PQexitPipelineMode(st->con) != 1)
-----------------------
Regards,
--
Fujii Masao
On Thu, 25 Sep 2025 02:19:27 +0900
Fujii Masao <masao.fujii@gmail.com> wrote:
On Tue, Sep 23, 2025 at 11:58 AM Rintaro Ikeda
<ikedarintarof@oss.nttdata.com> wrote:I think the issue is a new bug because we have transitioned to CSTATE_ABORT
immediately after queries failed, without executing discardUntilSync().
Agreed.
If so, the fix in discardUntilSync() should be part of the continue-on-error
patch. The assertion failure fix should be a separate patch, since only
that needs to be backpatched (the failure can also occur in back branches).
+1
I've attached a patch that fixes the assertion error. The content of v1 patch by
Mr. Nagata is also included. I would appreciate it if you review my patch.
+ switch (PQresultStatus(res)) + { + case PGRES_PIPELINE_SYNC: + received_sync = true;In the PGRES_PIPELINE_SYNC case, PQclear(res) isn't called but should be.
+ case PGRES_FATAL_ERROR: + PQclear(res); + goto done;I don't think we need goto here. How about this instead?
----------------------- @@ -3525,11 +3525,18 @@ discardUntilSync(CState *st) * results have been discarded. */ st->num_syncs = 0; - PQclear(res); break; } + /* + * Stop receiving further results if PGRES_FATAL_ERROR is returned + * (e.g., due to a connection failure) before PGRES_PIPELINE_SYNC, + * since PGRES_PIPELINE_SYNC will never arrive. + */ + else if (PQresultStatus(res) == PGRES_FATAL_ERROR) + break; PQclear(res); } + PQclear(res);/* exit pipeline */
if (PQexitPipelineMode(st->con) != 1)
-----------------------
I think Fujii-san's version is better because Ikeda-san's version doesn't
consider the case where PGRES_PIPELINE_SYNC is followed by another one.
In that situation, the loop would terminate without getting NULL, which would
causes an error in PQexitPipelineMode().
However, I would like to suggest an alternative solution: checking the connection
status when readCommandResponse() returns false. This seems more straightforwad,
since the cause of the error can be investigated immediately after it is detected.
@@ -4024,7 +4043,10 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
if (PQpipelineStatus(st->con) != PQ_PIPELINE_ON)
st->state = CSTATE_END_COMMAND;
}
- else if (canRetryError(st->estatus))
+ else if (PQstatus(st->con) == CONNECTION_BAD)
+ st->state = CSTATE_ABORTED;
+ else if ((st->estatus == ESTATUS_OTHER_SQL_ERROR && continue_on_error) ||
+ canRetryError(st->estatus))
st->state = CSTATE_ERROR;
else
st->state = CSTATE_ABORTED;
What do you think?
Additionally, I noticed that in pipeline mode, the error message reported in
readCommandResponse() is lost, because it is reset when PQgetResult() returned
NULL to indicate the end of query processing. For example:
pgbench: client 0 got an error in command 3 (SQL) of script 0;
pgbench: client 1 got an error in command 3 (SQL) of script 0;
This can be fixed this by saving the previous error message and using it
for the report. After the fix:
pgbench: client 0 got an error in command 3 (SQL) of script 0; FATAL: terminating connection due to administrator command
I've attached update patches.
0001 fixes the assersion failure in commandError() and error message lost
in readCommandResponse().
0002 was the previous 0001 that adds --continiue-on-error, including the
fix to handle connection termination errors.
0003 is for improving error messages for errors that cause client abortion.
Regareds,
Yugo Nagata
--
Yugo Nagata <nagata@sraoss.co.jp>
Attachments:
v12-0003-Improve-error-messages-for-errors-that-cause-cli.patchtext/x-diff; name=v12-0003-Improve-error-messages-for-errors-that-cause-cli.patchDownload
From e6b4022ec06f97a1ed100de9aca9eebd5fd4bc02 Mon Sep 17 00:00:00 2001
From: Yugo Nagata <nagata@sraoss.co.jp>
Date: Thu, 10 Jul 2025 17:21:05 +0900
Subject: [PATCH v12 3/3] Improve error messages for errors that cause client
abortion
This commit modifies relevant error messages to explicitly indicate that the
client was aborted. As part of this change, pg_log_error was replaced with
commandFailed().
---
src/bin/pgbench/pgbench.c | 15 ++++++---------
1 file changed, 6 insertions(+), 9 deletions(-)
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 36d15c95f3e..43450b4b54a 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -3318,8 +3318,7 @@ readCommandResponse(CState *st, MetaCommand meta, char *varprefix)
case PGRES_EMPTY_QUERY: /* may be used for testing no-op overhead */
if (is_last && meta == META_GSET)
{
- pg_log_error("client %d script %d command %d query %d: expected one row, got %d",
- st->id, st->use_file, st->command, qrynum, 0);
+ commandFailed(st, "gset", psprintf("expected one row, got %d", 0));
st->estatus = ESTATUS_META_COMMAND_ERROR;
goto error;
}
@@ -3333,8 +3332,7 @@ readCommandResponse(CState *st, MetaCommand meta, char *varprefix)
if (meta == META_GSET && ntuples != 1)
{
/* under \gset, report the error */
- pg_log_error("client %d script %d command %d query %d: expected one row, got %d",
- st->id, st->use_file, st->command, qrynum, PQntuples(res));
+ commandFailed(st, "gset", psprintf("expected one row, got %d", PQntuples(res)));
st->estatus = ESTATUS_META_COMMAND_ERROR;
goto error;
}
@@ -3348,18 +3346,18 @@ readCommandResponse(CState *st, MetaCommand meta, char *varprefix)
for (int fld = 0; fld < PQnfields(res); fld++)
{
char *varname = PQfname(res, fld);
+ char *cmd = (meta == META_ASET ? "aset" : "gset");
/* allocate varname only if necessary, freed below */
if (*varprefix != '\0')
varname = psprintf("%s%s", varprefix, varname);
/* store last row result as a string */
- if (!putVariable(&st->variables, meta == META_ASET ? "aset" : "gset", varname,
+ if (!putVariable(&st->variables, cmd, varname,
PQgetvalue(res, ntuples - 1, fld)))
{
/* internal error */
- pg_log_error("client %d script %d command %d query %d: error storing into variable %s",
- st->id, st->use_file, st->command, qrynum, varname);
+ commandFailed(st, cmd, psprintf("error storing into variable %s", varname));
st->estatus = ESTATUS_META_COMMAND_ERROR;
goto error;
}
@@ -3394,8 +3392,7 @@ readCommandResponse(CState *st, MetaCommand meta, char *varprefix)
default:
/* anything else is unexpected */
- pg_log_error("client %d script %d aborted in command %d query %d: %s",
- st->id, st->use_file, st->command, qrynum, errmsg);
+ commandFailed(st, "SQL", errmsg);
goto error;
}
--
2.43.0
v12-0002-Add-continue-on-error-option.patchtext/x-diff; name=v12-0002-Add-continue-on-error-option.patchDownload
From b8f05f605176232bf0aa1eaf8a2783c17059a39a Mon Sep 17 00:00:00 2001
From: Fujii Masao <fujii@postgresql.org>
Date: Fri, 19 Sep 2025 16:54:49 +0900
Subject: [PATCH v12 2/3] Add --continue-on-error option
When the option is set, client rolls back the failed transaction and starts a
new one when its transaction fails due to the reason other than the deadlock and
serialization failure.
---
doc/src/sgml/ref/pgbench.sgml | 64 ++++++++++++++++----
src/bin/pgbench/pgbench.c | 59 +++++++++++++++---
src/bin/pgbench/t/001_pgbench_with_server.pl | 22 +++++++
3 files changed, 126 insertions(+), 19 deletions(-)
diff --git a/doc/src/sgml/ref/pgbench.sgml b/doc/src/sgml/ref/pgbench.sgml
index ab252d9fc74..828ce0d90cf 100644
--- a/doc/src/sgml/ref/pgbench.sgml
+++ b/doc/src/sgml/ref/pgbench.sgml
@@ -76,9 +76,8 @@ tps = 896.967014 (without initial connection time)
and number of transactions per client); these will be equal unless the run
failed before completion or some SQL command(s) failed. (In
<option>-T</option> mode, only the actual number of transactions is printed.)
- The next line reports the number of failed transactions due to
- serialization or deadlock errors (see <xref linkend="failures-and-retries"/>
- for more information).
+ The next line reports the number of failed transactions (see
+ <xref linkend="failures-and-retries"/> for more information).
The last line reports the number of transactions per second.
</para>
@@ -790,6 +789,9 @@ pgbench <optional> <replaceable>options</replaceable> </optional> <replaceable>d
<listitem>
<para>deadlock failures;</para>
</listitem>
+ <listitem>
+ <para>other failures;</para>
+ </listitem>
</itemizedlist>
See <xref linkend="failures-and-retries"/> for more information.
</para>
@@ -914,6 +916,26 @@ pgbench <optional> <replaceable>options</replaceable> </optional> <replaceable>d
</listitem>
</varlistentry>
+ <varlistentry id="pgbench-option-continue-on-error">
+ <term><option>--continue-on-error</option></term>
+ <listitem>
+ <para>
+ Allows clients to continue their run even if an SQL statement fails due to
+ errors other than serialization or deadlock. Unlike serialization and deadlock
+ failures, clients do not retry the same transactions but start new transaction.
+ This option is useful when your custom script may raise errors due to some
+ reason like unique constraints violation. Without this option, the client is
+ aborted after such errors.
+ </para>
+ <para>
+ Note that serialization and deadlock failures never cause the client to be
+ aborted even after clients retries <option>--max-tries</option> times by
+ default, so they are not affected by this option.
+ See <xref linkend="failures-and-retries"/> for more information.
+ </para>
+ </listitem>
+ </varlistentry>
+
</variablelist>
</para>
@@ -2409,8 +2431,8 @@ END;
will be reported as <literal>failed</literal>. If you use the
<option>--failures-detailed</option> option, the
<replaceable>time</replaceable> of the failed transaction will be reported as
- <literal>serialization</literal> or
- <literal>deadlock</literal> depending on the type of failure (see
+ <literal>serialization</literal>, <literal>deadlock</literal>, or
+ <literal>other</literal> depending on the type of failure (see
<xref linkend="failures-and-retries"/> for more information).
</para>
@@ -2638,6 +2660,16 @@ END;
</para>
</listitem>
</varlistentry>
+
+ <varlistentry>
+ <term><replaceable>other_sql_failures</replaceable></term>
+ <listitem>
+ <para>
+ number of transactions that got a SQL error
+ (zero unless <option>--failures-detailed</option> is specified)
+ </para>
+ </listitem>
+ </varlistentry>
</variablelist>
</para>
@@ -2646,8 +2678,8 @@ END;
<screen>
<userinput>pgbench --aggregate-interval=10 --time=20 --client=10 --log --rate=1000 --latency-limit=10 --failures-detailed --max-tries=10 test</userinput>
-1650260552 5178 26171317 177284491527 1136 44462 2647617 7321113867 0 9866 64 7564 28340 4148 0
-1650260562 4808 25573984 220121792172 1171 62083 3037380 9666800914 0 9998 598 7392 26621 4527 0
+1650260552 5178 26171317 177284491527 1136 44462 2647617 7321113867 0 9866 64 7564 28340 4148 0 0
+1650260562 4808 25573984 220121792172 1171 62083 3037380 9666800914 0 9998 598 7392 26621 4527 0 0
</screen>
</para>
@@ -2851,10 +2883,20 @@ statement latencies in milliseconds, failures and retries:
<para>
A client's run is aborted in case of a serious error; for example, the
connection with the database server was lost or the end of script was reached
- without completing the last transaction. In addition, if execution of an SQL
- or meta command fails for reasons other than serialization or deadlock errors,
- the client is aborted. Otherwise, if an SQL command fails with serialization or
- deadlock errors, the client is not aborted. In such cases, the current
+ without completing the last transaction. The client also aborts
+ if a meta command fails, or if an SQL command fails for reasons other than
+ serialization or deadlock errors when <option>--continue-on-error</option>
+ is not specified. With <option>--continue-on-error</option>,
+ the client does not abort on such SQL errors and instead proceeds to
+ the next transaction. These cases are reported as
+ <literal>other failures</literal> in the output. If the error occurs
+ in a meta command, however, the client still aborts even when this option
+ is specified.
+ </para>
+ <para>
+ If an SQL command fails due to serialization or deadlock errors, the
+ client does not aborted, regardless of whether
+ <option>--continue-on-error</option> is used. Instead, the current
transaction is rolled back, which also includes setting the client variables
as they were before the run of this transaction (it is assumed that one
transaction script contains only one transaction; see
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index f25a2e20e70..36d15c95f3e 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -402,15 +402,23 @@ typedef struct StatsData
* directly successful transactions (they were successfully completed on
* the first try).
*
- * A failed transaction is defined as unsuccessfully retried transactions.
- * It can be one of two types:
+ * A failed transaction is counted differently depending on whether
+ * the --continue-on-error option is specified.
*
+ * Without --continue-on-error:
* failed (the number of failed transactions) =
* 'serialization_failures' (they got a serialization error and were not
* successfully retried) +
* 'deadlock_failures' (they got a deadlock error and were not
* successfully retried).
*
+ * When --continue-on-error is specified:
+ *
+ * failed (number of failed transactions) =
+ * 'serialization_failures' + 'deadlock_failures' +
+ * 'other_sql_failures' (they got some other SQL error; the transaction was
+ * not retried and counted as failed due to --continue-on-error).
+ *
* If the transaction was retried after a serialization or a deadlock
* error this does not guarantee that this retry was successful. Thus
*
@@ -440,6 +448,11 @@ typedef struct StatsData
int64 deadlock_failures; /* number of transactions that were not
* successfully retried after a deadlock
* error */
+ int64 other_sql_failures; /* number of failed transactions for
+ * reasons other than
+ * serialization/deadlock failure, which
+ * is counted if --continue-on-error is
+ * specified */
SimpleStats latency;
SimpleStats lag;
} StatsData;
@@ -770,6 +783,7 @@ static int64 total_weight = 0;
static bool verbose_errors = false; /* print verbose messages of all errors */
static bool exit_on_abort = false; /* exit when any client is aborted */
+static bool continue_on_error = false; /* continue after errors */
/* Builtin test scripts */
typedef struct BuiltinScript
@@ -954,6 +968,7 @@ usage(void)
" --log-prefix=PREFIX prefix for transaction time log file\n"
" (default: \"pgbench_log\")\n"
" --max-tries=NUM max number of tries to run transaction (default: 1)\n"
+ " --continue-on-error continue running after an SQL error\n"
" --progress-timestamp use Unix epoch timestamps for progress\n"
" --random-seed=SEED set random seed (\"time\", \"rand\", integer)\n"
" --sampling-rate=NUM fraction of transactions to log (e.g., 0.01 for 1%%)\n"
@@ -1467,6 +1482,7 @@ initStats(StatsData *sd, pg_time_usec_t start)
sd->retried = 0;
sd->serialization_failures = 0;
sd->deadlock_failures = 0;
+ sd->other_sql_failures = 0;
initSimpleStats(&sd->latency);
initSimpleStats(&sd->lag);
}
@@ -1516,6 +1532,9 @@ accumStats(StatsData *stats, bool skipped, double lat, double lag,
case ESTATUS_DEADLOCK_ERROR:
stats->deadlock_failures++;
break;
+ case ESTATUS_OTHER_SQL_ERROR:
+ stats->other_sql_failures++;
+ break;
default:
/* internal error which should never occur */
pg_fatal("unexpected error status: %d", estatus);
@@ -3365,7 +3384,7 @@ readCommandResponse(CState *st, MetaCommand meta, char *varprefix)
case PGRES_FATAL_ERROR:
st->estatus = getSQLErrorStatus(PQresultErrorField(res,
PG_DIAG_SQLSTATE));
- if (canRetryError(st->estatus))
+ if (continue_on_error || canRetryError(st->estatus))
{
if (verbose_errors)
commandError(st, errmsg);
@@ -4029,7 +4048,10 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
if (PQpipelineStatus(st->con) != PQ_PIPELINE_ON)
st->state = CSTATE_END_COMMAND;
}
- else if (canRetryError(st->estatus))
+ else if (PQstatus(st->con) == CONNECTION_BAD)
+ st->state = CSTATE_ABORTED;
+ else if ((st->estatus == ESTATUS_OTHER_SQL_ERROR && continue_on_error) ||
+ canRetryError(st->estatus))
st->state = CSTATE_ERROR;
else
st->state = CSTATE_ABORTED;
@@ -4550,7 +4572,8 @@ static int64
getFailures(const StatsData *stats)
{
return (stats->serialization_failures +
- stats->deadlock_failures);
+ stats->deadlock_failures +
+ stats->other_sql_failures);
}
/*
@@ -4570,6 +4593,8 @@ getResultString(bool skipped, EStatus estatus)
return "serialization";
case ESTATUS_DEADLOCK_ERROR:
return "deadlock";
+ case ESTATUS_OTHER_SQL_ERROR:
+ return "other";
default:
/* internal error which should never occur */
pg_fatal("unexpected error status: %d", estatus);
@@ -4625,6 +4650,7 @@ doLog(TState *thread, CState *st,
int64 skipped = 0;
int64 serialization_failures = 0;
int64 deadlock_failures = 0;
+ int64 other_sql_failures = 0;
int64 retried = 0;
int64 retries = 0;
@@ -4665,10 +4691,12 @@ doLog(TState *thread, CState *st,
{
serialization_failures = agg->serialization_failures;
deadlock_failures = agg->deadlock_failures;
+ other_sql_failures = agg->other_sql_failures;
}
- fprintf(logfile, " " INT64_FORMAT " " INT64_FORMAT,
+ fprintf(logfile, " " INT64_FORMAT " " INT64_FORMAT " " INT64_FORMAT,
serialization_failures,
- deadlock_failures);
+ deadlock_failures,
+ other_sql_failures);
fputc('\n', logfile);
@@ -6307,6 +6335,7 @@ printProgressReport(TState *threads, int64 test_start, pg_time_usec_t now,
cur.serialization_failures +=
threads[i].stats.serialization_failures;
cur.deadlock_failures += threads[i].stats.deadlock_failures;
+ cur.other_sql_failures += threads[i].stats.other_sql_failures;
}
/* we count only actually executed transactions */
@@ -6449,7 +6478,8 @@ printResults(StatsData *total,
/*
* Remaining stats are nonsensical if we failed to execute any xacts due
- * to others than serialization or deadlock errors
+ * to other than serialization or deadlock errors and --continue-on-error
+ * is not set.
*/
if (total_cnt <= 0)
return;
@@ -6465,6 +6495,9 @@ printResults(StatsData *total,
printf("number of deadlock failures: " INT64_FORMAT " (%.3f%%)\n",
total->deadlock_failures,
100.0 * total->deadlock_failures / total_cnt);
+ printf("number of other failures: " INT64_FORMAT " (%.3f%%)\n",
+ total->other_sql_failures,
+ 100.0 * total->other_sql_failures / total_cnt);
}
/* it can be non-zero only if max_tries is not equal to one */
@@ -6568,6 +6601,10 @@ printResults(StatsData *total,
sstats->deadlock_failures,
(100.0 * sstats->deadlock_failures /
script_total_cnt));
+ printf(" - number of other failures: " INT64_FORMAT " (%.3f%%)\n",
+ sstats->other_sql_failures,
+ (100.0 * sstats->other_sql_failures /
+ script_total_cnt));
}
/*
@@ -6727,6 +6764,7 @@ main(int argc, char **argv)
{"verbose-errors", no_argument, NULL, 15},
{"exit-on-abort", no_argument, NULL, 16},
{"debug", no_argument, NULL, 17},
+ {"continue-on-error", no_argument, NULL, 18},
{NULL, 0, NULL, 0}
};
@@ -7080,6 +7118,10 @@ main(int argc, char **argv)
case 17: /* debug */
pg_logging_increase_verbosity();
break;
+ case 18: /* continue-on-error */
+ benchmarking_option_set = true;
+ continue_on_error = true;
+ break;
default:
/* getopt_long already emitted a complaint */
pg_log_error_hint("Try \"%s --help\" for more information.", progname);
@@ -7435,6 +7477,7 @@ main(int argc, char **argv)
stats.retried += thread->stats.retried;
stats.serialization_failures += thread->stats.serialization_failures;
stats.deadlock_failures += thread->stats.deadlock_failures;
+ stats.other_sql_failures += thread->stats.other_sql_failures;
latency_late += thread->latency_late;
conn_total_duration += thread->conn_duration;
diff --git a/src/bin/pgbench/t/001_pgbench_with_server.pl b/src/bin/pgbench/t/001_pgbench_with_server.pl
index 7dd78940300..3c19a36a005 100644
--- a/src/bin/pgbench/t/001_pgbench_with_server.pl
+++ b/src/bin/pgbench/t/001_pgbench_with_server.pl
@@ -1813,6 +1813,28 @@ update counter set i = i+1 returning i \gset
# Clean up
$node->safe_psql('postgres', 'DROP TABLE counter;');
+# Test --continue-on-error
+$node->safe_psql('postgres',
+ 'CREATE TABLE unique_table(i int unique);');
+
+$node->pgbench(
+ '-n -t 10 --continue-on-error --failures-detailed',
+ 0,
+ [
+ qr{processed: 1/10\b},
+ qr{other failures: 9\b}
+ ],
+ [],
+ 'test --continue-on-error',
+ {
+ '001_continue_on_error' => q{
+ INSERT INTO unique_table VALUES(0);
+ }
+ });
+
+# Clean up
+$node->safe_psql('postgres', 'DROP TABLE unique_table;');
+
# done
$node->safe_psql('postgres', 'DROP TABLESPACE regress_pgbench_tap_1_ts');
$node->stop;
--
2.43.0
v12-0001-Fix-assertion-failure-and-verbose-messages-in-pi.patchtext/x-diff; name=v12-0001-Fix-assertion-failure-and-verbose-messages-in-pi.patchDownload
From d141ea4422d76b59021e2c25ad378bfd12d97651 Mon Sep 17 00:00:00 2001
From: Yugo Nagata <nagata@sraoss.co.jp>
Date: Wed, 24 Sep 2025 22:23:25 +0900
Subject: [PATCH v12 1/3] Fix assertion failure and verbose messages in
pipeline mode
commandError() is called to report errors when they can be retried, and
it previously assumed that errors are always detected during SQL command
execution. However, in pipeline mode, an error may also be detected when
a \endpipeline meta-command is executed.
This caused an assertion failure. To fix this, it is now assumed that
errors can also be detected in this case.
Additionally, in pipeline mode, the error message reported in
readCommandResponse() was lost, because it was reset when PQgetResult()
returned NULL to indicate the end of query processing. To fix this, save
the previous error message and use it for reporting.
---
src/bin/pgbench/pgbench.c | 19 ++++++++++++++-----
1 file changed, 14 insertions(+), 5 deletions(-)
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 3cafd88ac53..f25a2e20e70 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -3059,7 +3059,13 @@ commandFailed(CState *st, const char *cmd, const char *message)
static void
commandError(CState *st, const char *message)
{
- Assert(sql_script[st->use_file].commands[st->command]->type == SQL_COMMAND);
+ /*
+ Errors should only be detected during an SQL command or the \endpipeline
+ meta command. Any other case triggers an assertion failure.
+ */
+ Assert(sql_script[st->use_file].commands[st->command]->type == SQL_COMMAND ||
+ sql_script[st->use_file].commands[st->command]->meta == META_ENDPIPELINE);
+
pg_log_info("client %d got an error in command %d (SQL) of script %d; %s",
st->id, st->command, st->use_file, message);
}
@@ -3265,6 +3271,7 @@ readCommandResponse(CState *st, MetaCommand meta, char *varprefix)
PGresult *res;
PGresult *next_res;
int qrynum = 0;
+ char *errmsg;
/*
* varprefix should be set only with \gset or \aset, and \endpipeline and
@@ -3280,6 +3287,8 @@ readCommandResponse(CState *st, MetaCommand meta, char *varprefix)
{
bool is_last;
+ errmsg = pg_strdup(PQerrorMessage(st->con));
+
/* peek at the next result to know whether the current is last */
next_res = PQgetResult(st->con);
is_last = (next_res == NULL);
@@ -3349,7 +3358,7 @@ readCommandResponse(CState *st, MetaCommand meta, char *varprefix)
st->num_syncs--;
if (st->num_syncs == 0 && PQexitPipelineMode(st->con) != 1)
pg_log_error("client %d failed to exit pipeline mode: %s", st->id,
- PQerrorMessage(st->con));
+ errmsg);
break;
case PGRES_NONFATAL_ERROR:
@@ -3359,7 +3368,7 @@ readCommandResponse(CState *st, MetaCommand meta, char *varprefix)
if (canRetryError(st->estatus))
{
if (verbose_errors)
- commandError(st, PQerrorMessage(st->con));
+ commandError(st, errmsg);
goto error;
}
/* fall through */
@@ -3367,14 +3376,14 @@ readCommandResponse(CState *st, MetaCommand meta, char *varprefix)
default:
/* anything else is unexpected */
pg_log_error("client %d script %d aborted in command %d query %d: %s",
- st->id, st->use_file, st->command, qrynum,
- PQerrorMessage(st->con));
+ st->id, st->use_file, st->command, qrynum, errmsg);
goto error;
}
PQclear(res);
qrynum++;
res = next_res;
+ pg_free(errmsg);
}
if (qrynum == 0)
--
2.43.0
On Thu, 25 Sep 2025 11:09:40 +0900
Yugo Nagata <nagata@sraoss.co.jp> wrote:
On Thu, 25 Sep 2025 02:19:27 +0900
Fujii Masao <masao.fujii@gmail.com> wrote:On Tue, Sep 23, 2025 at 11:58 AM Rintaro Ikeda
<ikedarintarof@oss.nttdata.com> wrote:I think the issue is a new bug because we have transitioned to CSTATE_ABORT
immediately after queries failed, without executing discardUntilSync().Agreed.
If so, the fix in discardUntilSync() should be part of the continue-on-error
patch. The assertion failure fix should be a separate patch, since only
that needs to be backpatched (the failure can also occur in back branches).+1
I've attached a patch that fixes the assertion error. The content of v1 patch by
Mr. Nagata is also included. I would appreciate it if you review my patch.+ switch (PQresultStatus(res)) + { + case PGRES_PIPELINE_SYNC: + received_sync = true;In the PGRES_PIPELINE_SYNC case, PQclear(res) isn't called but should be.
+ case PGRES_FATAL_ERROR: + PQclear(res); + goto done;I don't think we need goto here. How about this instead?
----------------------- @@ -3525,11 +3525,18 @@ discardUntilSync(CState *st) * results have been discarded. */ st->num_syncs = 0; - PQclear(res); break; } + /* + * Stop receiving further results if PGRES_FATAL_ERROR is returned + * (e.g., due to a connection failure) before PGRES_PIPELINE_SYNC, + * since PGRES_PIPELINE_SYNC will never arrive. + */ + else if (PQresultStatus(res) == PGRES_FATAL_ERROR) + break; PQclear(res); } + PQclear(res);/* exit pipeline */
if (PQexitPipelineMode(st->con) != 1)
-----------------------I think Fujii-san's version is better because Ikeda-san's version doesn't
consider the case where PGRES_PIPELINE_SYNC is followed by another one.
In that situation, the loop would terminate without getting NULL, which would
causes an error in PQexitPipelineMode().However, I would like to suggest an alternative solution: checking the connection
status when readCommandResponse() returns false. This seems more straightforwad,
since the cause of the error can be investigated immediately after it is detected.@@ -4024,7 +4043,10 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg) if (PQpipelineStatus(st->con) != PQ_PIPELINE_ON) st->state = CSTATE_END_COMMAND; } - else if (canRetryError(st->estatus)) + else if (PQstatus(st->con) == CONNECTION_BAD) + st->state = CSTATE_ABORTED; + else if ((st->estatus == ESTATUS_OTHER_SQL_ERROR && continue_on_error) || + canRetryError(st->estatus)) st->state = CSTATE_ERROR; else st->state = CSTATE_ABORTED;What do you think?
Additionally, I noticed that in pipeline mode, the error message reported in
readCommandResponse() is lost, because it is reset when PQgetResult() returned
NULL to indicate the end of query processing. For example:pgbench: client 0 got an error in command 3 (SQL) of script 0;
pgbench: client 1 got an error in command 3 (SQL) of script 0;This can be fixed this by saving the previous error message and using it
for the report. After the fix:pgbench: client 0 got an error in command 3 (SQL) of script 0; FATAL: terminating connection due to administrator command
I've attached update patches.
0001 fixes the assersion failure in commandError() and error message lost
in readCommandResponse().0002 was the previous 0001 that adds --continiue-on-error, including the
fix to handle connection termination errors.0003 is for improving error messages for errors that cause client abortion.
I think the patch 0001 should be back patched since the issues can occurs
even for retries of serialization failure or deadlock detection in pipeline mode.
Regards,
Yugo Nagata
--
Yugo Nagata <nagata@sraoss.co.jp>
On Thu, Sep 25, 2025 at 11:17 AM Yugo Nagata <nagata@sraoss.co.jp> wrote:
On Thu, 25 Sep 2025 11:09:40 +0900
Yugo Nagata <nagata@sraoss.co.jp> wrote:On Thu, 25 Sep 2025 02:19:27 +0900
Fujii Masao <masao.fujii@gmail.com> wrote:On Tue, Sep 23, 2025 at 11:58 AM Rintaro Ikeda
<ikedarintarof@oss.nttdata.com> wrote:I think the issue is a new bug because we have transitioned to CSTATE_ABORT
immediately after queries failed, without executing discardUntilSync().Agreed.
If so, the fix in discardUntilSync() should be part of the continue-on-error
patch. The assertion failure fix should be a separate patch, since only
that needs to be backpatched (the failure can also occur in back branches).+1
I've attached a patch that fixes the assertion error. The content of v1 patch by
Mr. Nagata is also included. I would appreciate it if you review my patch.+ switch (PQresultStatus(res)) + { + case PGRES_PIPELINE_SYNC: + received_sync = true;In the PGRES_PIPELINE_SYNC case, PQclear(res) isn't called but should be.
+ case PGRES_FATAL_ERROR: + PQclear(res); + goto done;I don't think we need goto here. How about this instead?
----------------------- @@ -3525,11 +3525,18 @@ discardUntilSync(CState *st) * results have been discarded. */ st->num_syncs = 0; - PQclear(res); break; } + /* + * Stop receiving further results if PGRES_FATAL_ERROR is returned + * (e.g., due to a connection failure) before PGRES_PIPELINE_SYNC, + * since PGRES_PIPELINE_SYNC will never arrive. + */ + else if (PQresultStatus(res) == PGRES_FATAL_ERROR) + break; PQclear(res); } + PQclear(res);/* exit pipeline */
if (PQexitPipelineMode(st->con) != 1)
-----------------------I think Fujii-san's version is better because Ikeda-san's version doesn't
consider the case where PGRES_PIPELINE_SYNC is followed by another one.
In that situation, the loop would terminate without getting NULL, which would
causes an error in PQexitPipelineMode().However, I would like to suggest an alternative solution: checking the connection
status when readCommandResponse() returns false. This seems more straightforwad,
since the cause of the error can be investigated immediately after it is detected.@@ -4024,7 +4043,10 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg) if (PQpipelineStatus(st->con) != PQ_PIPELINE_ON) st->state = CSTATE_END_COMMAND; } - else if (canRetryError(st->estatus)) + else if (PQstatus(st->con) == CONNECTION_BAD) + st->state = CSTATE_ABORTED; + else if ((st->estatus == ESTATUS_OTHER_SQL_ERROR && continue_on_error) || + canRetryError(st->estatus)) st->state = CSTATE_ERROR; else st->state = CSTATE_ABORTED;What do you think?
Additionally, I noticed that in pipeline mode, the error message reported in
readCommandResponse() is lost, because it is reset when PQgetResult() returned
NULL to indicate the end of query processing. For example:pgbench: client 0 got an error in command 3 (SQL) of script 0;
pgbench: client 1 got an error in command 3 (SQL) of script 0;This can be fixed this by saving the previous error message and using it
for the report. After the fix:pgbench: client 0 got an error in command 3 (SQL) of script 0; FATAL: terminating connection due to administrator command
I've attached update patches.
0001 fixes the assersion failure in commandError() and error message lost
in readCommandResponse().0002 was the previous 0001 that adds --continiue-on-error, including the
fix to handle connection termination errors.0003 is for improving error messages for errors that cause client abortion.
I think the patch 0001 should be back patched since the issues can occurs
even for retries of serialization failure or deadlock detection in pipeline mode.
Agreed.
Regarding 0001:
+ /*
+ Errors should only be detected during an SQL command or the \endpipeline
+ meta command. Any other case triggers an assertion failure.
+ */
* should be added before "Errors" and "meta".
+ errmsg = pg_strdup(PQerrorMessage(st->con));
It would be good to add a comment explaining why we do this.
+ pg_free(errmsg);
Shouldn't pg_free() be called also in the error case, i.e., after
jumping to the error label?
Regards,
--
Fujii Masao
On Thu, 25 Sep 2025 13:49:05 +0900
Fujii Masao <masao.fujii@gmail.com> wrote:
On Thu, Sep 25, 2025 at 11:17 AM Yugo Nagata <nagata@sraoss.co.jp> wrote:
On Thu, 25 Sep 2025 11:09:40 +0900
Yugo Nagata <nagata@sraoss.co.jp> wrote:On Thu, 25 Sep 2025 02:19:27 +0900
Fujii Masao <masao.fujii@gmail.com> wrote:On Tue, Sep 23, 2025 at 11:58 AM Rintaro Ikeda
<ikedarintarof@oss.nttdata.com> wrote:I think the issue is a new bug because we have transitioned to CSTATE_ABORT
immediately after queries failed, without executing discardUntilSync().Agreed.
If so, the fix in discardUntilSync() should be part of the continue-on-error
patch. The assertion failure fix should be a separate patch, since only
that needs to be backpatched (the failure can also occur in back branches).+1
I've attached a patch that fixes the assertion error. The content of v1 patch by
Mr. Nagata is also included. I would appreciate it if you review my patch.+ switch (PQresultStatus(res)) + { + case PGRES_PIPELINE_SYNC: + received_sync = true;In the PGRES_PIPELINE_SYNC case, PQclear(res) isn't called but should be.
+ case PGRES_FATAL_ERROR: + PQclear(res); + goto done;I don't think we need goto here. How about this instead?
----------------------- @@ -3525,11 +3525,18 @@ discardUntilSync(CState *st) * results have been discarded. */ st->num_syncs = 0; - PQclear(res); break; } + /* + * Stop receiving further results if PGRES_FATAL_ERROR is returned + * (e.g., due to a connection failure) before PGRES_PIPELINE_SYNC, + * since PGRES_PIPELINE_SYNC will never arrive. + */ + else if (PQresultStatus(res) == PGRES_FATAL_ERROR) + break; PQclear(res); } + PQclear(res);/* exit pipeline */
if (PQexitPipelineMode(st->con) != 1)
-----------------------I think Fujii-san's version is better because Ikeda-san's version doesn't
consider the case where PGRES_PIPELINE_SYNC is followed by another one.
In that situation, the loop would terminate without getting NULL, which would
causes an error in PQexitPipelineMode().However, I would like to suggest an alternative solution: checking the connection
status when readCommandResponse() returns false. This seems more straightforwad,
since the cause of the error can be investigated immediately after it is detected.@@ -4024,7 +4043,10 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg) if (PQpipelineStatus(st->con) != PQ_PIPELINE_ON) st->state = CSTATE_END_COMMAND; } - else if (canRetryError(st->estatus)) + else if (PQstatus(st->con) == CONNECTION_BAD) + st->state = CSTATE_ABORTED; + else if ((st->estatus == ESTATUS_OTHER_SQL_ERROR && continue_on_error) || + canRetryError(st->estatus)) st->state = CSTATE_ERROR; else st->state = CSTATE_ABORTED;What do you think?
Additionally, I noticed that in pipeline mode, the error message reported in
readCommandResponse() is lost, because it is reset when PQgetResult() returned
NULL to indicate the end of query processing. For example:pgbench: client 0 got an error in command 3 (SQL) of script 0;
pgbench: client 1 got an error in command 3 (SQL) of script 0;This can be fixed this by saving the previous error message and using it
for the report. After the fix:pgbench: client 0 got an error in command 3 (SQL) of script 0; FATAL: terminating connection due to administrator command
I've attached update patches.
0001 fixes the assersion failure in commandError() and error message lost
in readCommandResponse().0002 was the previous 0001 that adds --continiue-on-error, including the
fix to handle connection termination errors.0003 is for improving error messages for errors that cause client abortion.
I think the patch 0001 should be back patched since the issues can occurs
even for retries of serialization failure or deadlock detection in pipeline mode.Agreed.
Regarding 0001:
+ /* + Errors should only be detected during an SQL command or the \endpipeline + meta command. Any other case triggers an assertion failure. + */* should be added before "Errors" and "meta".
Oops. Fixed.
+ errmsg = pg_strdup(PQerrorMessage(st->con));
It would be good to add a comment explaining why we do this.
+ pg_free(errmsg);
Shouldn't pg_free() be called also in the error case, i.e., after
jumping to the error label?
Yes, it should be.
Fixed.
I've attached updated patches.
Regards,
Yugo Nagata
--
Yugo Nagata <nagata@sraoss.co.jp>
Attachments:
v13-0003-Improve-error-messages-for-errors-that-cause-cli.patchtext/x-diff; name=v13-0003-Improve-error-messages-for-errors-that-cause-cli.patchDownload
From 4e4edc2cc8d1bb565059c72836b026ecceee882f Mon Sep 17 00:00:00 2001
From: Yugo Nagata <nagata@sraoss.co.jp>
Date: Thu, 10 Jul 2025 17:21:05 +0900
Subject: [PATCH v13 3/3] Improve error messages for errors that cause client
abortion
This commit modifies relevant error messages to explicitly indicate that the
client was aborted. As part of this change, pg_log_error was replaced with
commandFailed().
---
src/bin/pgbench/pgbench.c | 15 ++++++---------
1 file changed, 6 insertions(+), 9 deletions(-)
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index ee288e19bd0..7d078de3457 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -3318,8 +3318,7 @@ readCommandResponse(CState *st, MetaCommand meta, char *varprefix)
case PGRES_EMPTY_QUERY: /* may be used for testing no-op overhead */
if (is_last && meta == META_GSET)
{
- pg_log_error("client %d script %d command %d query %d: expected one row, got %d",
- st->id, st->use_file, st->command, qrynum, 0);
+ commandFailed(st, "gset", psprintf("expected one row, got %d", 0));
st->estatus = ESTATUS_META_COMMAND_ERROR;
goto error;
}
@@ -3333,8 +3332,7 @@ readCommandResponse(CState *st, MetaCommand meta, char *varprefix)
if (meta == META_GSET && ntuples != 1)
{
/* under \gset, report the error */
- pg_log_error("client %d script %d command %d query %d: expected one row, got %d",
- st->id, st->use_file, st->command, qrynum, PQntuples(res));
+ commandFailed(st, "gset", psprintf("expected one row, got %d", PQntuples(res)));
st->estatus = ESTATUS_META_COMMAND_ERROR;
goto error;
}
@@ -3348,18 +3346,18 @@ readCommandResponse(CState *st, MetaCommand meta, char *varprefix)
for (int fld = 0; fld < PQnfields(res); fld++)
{
char *varname = PQfname(res, fld);
+ char *cmd = (meta == META_ASET ? "aset" : "gset");
/* allocate varname only if necessary, freed below */
if (*varprefix != '\0')
varname = psprintf("%s%s", varprefix, varname);
/* store last row result as a string */
- if (!putVariable(&st->variables, meta == META_ASET ? "aset" : "gset", varname,
+ if (!putVariable(&st->variables, cmd, varname,
PQgetvalue(res, ntuples - 1, fld)))
{
/* internal error */
- pg_log_error("client %d script %d command %d query %d: error storing into variable %s",
- st->id, st->use_file, st->command, qrynum, varname);
+ commandFailed(st, cmd, psprintf("error storing into variable %s", varname));
st->estatus = ESTATUS_META_COMMAND_ERROR;
goto error;
}
@@ -3394,8 +3392,7 @@ readCommandResponse(CState *st, MetaCommand meta, char *varprefix)
default:
/* anything else is unexpected */
- pg_log_error("client %d script %d aborted in command %d query %d: %s",
- st->id, st->use_file, st->command, qrynum, errmsg);
+ commandFailed(st, "SQL", errmsg);
goto error;
}
--
2.43.0
v13-0002-Add-continue-on-error-option.patchtext/x-diff; name=v13-0002-Add-continue-on-error-option.patchDownload
From e531d12a192a2db529a7844450e6af72d80e244b Mon Sep 17 00:00:00 2001
From: Fujii Masao <fujii@postgresql.org>
Date: Fri, 19 Sep 2025 16:54:49 +0900
Subject: [PATCH v13 2/3] Add --continue-on-error option
When the option is set, client rolls back the failed transaction and starts a
new one when its transaction fails due to the reason other than the deadlock and
serialization failure.
---
doc/src/sgml/ref/pgbench.sgml | 64 ++++++++++++++++----
src/bin/pgbench/pgbench.c | 59 +++++++++++++++---
src/bin/pgbench/t/001_pgbench_with_server.pl | 22 +++++++
3 files changed, 126 insertions(+), 19 deletions(-)
diff --git a/doc/src/sgml/ref/pgbench.sgml b/doc/src/sgml/ref/pgbench.sgml
index ab252d9fc74..828ce0d90cf 100644
--- a/doc/src/sgml/ref/pgbench.sgml
+++ b/doc/src/sgml/ref/pgbench.sgml
@@ -76,9 +76,8 @@ tps = 896.967014 (without initial connection time)
and number of transactions per client); these will be equal unless the run
failed before completion or some SQL command(s) failed. (In
<option>-T</option> mode, only the actual number of transactions is printed.)
- The next line reports the number of failed transactions due to
- serialization or deadlock errors (see <xref linkend="failures-and-retries"/>
- for more information).
+ The next line reports the number of failed transactions (see
+ <xref linkend="failures-and-retries"/> for more information).
The last line reports the number of transactions per second.
</para>
@@ -790,6 +789,9 @@ pgbench <optional> <replaceable>options</replaceable> </optional> <replaceable>d
<listitem>
<para>deadlock failures;</para>
</listitem>
+ <listitem>
+ <para>other failures;</para>
+ </listitem>
</itemizedlist>
See <xref linkend="failures-and-retries"/> for more information.
</para>
@@ -914,6 +916,26 @@ pgbench <optional> <replaceable>options</replaceable> </optional> <replaceable>d
</listitem>
</varlistentry>
+ <varlistentry id="pgbench-option-continue-on-error">
+ <term><option>--continue-on-error</option></term>
+ <listitem>
+ <para>
+ Allows clients to continue their run even if an SQL statement fails due to
+ errors other than serialization or deadlock. Unlike serialization and deadlock
+ failures, clients do not retry the same transactions but start new transaction.
+ This option is useful when your custom script may raise errors due to some
+ reason like unique constraints violation. Without this option, the client is
+ aborted after such errors.
+ </para>
+ <para>
+ Note that serialization and deadlock failures never cause the client to be
+ aborted even after clients retries <option>--max-tries</option> times by
+ default, so they are not affected by this option.
+ See <xref linkend="failures-and-retries"/> for more information.
+ </para>
+ </listitem>
+ </varlistentry>
+
</variablelist>
</para>
@@ -2409,8 +2431,8 @@ END;
will be reported as <literal>failed</literal>. If you use the
<option>--failures-detailed</option> option, the
<replaceable>time</replaceable> of the failed transaction will be reported as
- <literal>serialization</literal> or
- <literal>deadlock</literal> depending on the type of failure (see
+ <literal>serialization</literal>, <literal>deadlock</literal>, or
+ <literal>other</literal> depending on the type of failure (see
<xref linkend="failures-and-retries"/> for more information).
</para>
@@ -2638,6 +2660,16 @@ END;
</para>
</listitem>
</varlistentry>
+
+ <varlistentry>
+ <term><replaceable>other_sql_failures</replaceable></term>
+ <listitem>
+ <para>
+ number of transactions that got a SQL error
+ (zero unless <option>--failures-detailed</option> is specified)
+ </para>
+ </listitem>
+ </varlistentry>
</variablelist>
</para>
@@ -2646,8 +2678,8 @@ END;
<screen>
<userinput>pgbench --aggregate-interval=10 --time=20 --client=10 --log --rate=1000 --latency-limit=10 --failures-detailed --max-tries=10 test</userinput>
-1650260552 5178 26171317 177284491527 1136 44462 2647617 7321113867 0 9866 64 7564 28340 4148 0
-1650260562 4808 25573984 220121792172 1171 62083 3037380 9666800914 0 9998 598 7392 26621 4527 0
+1650260552 5178 26171317 177284491527 1136 44462 2647617 7321113867 0 9866 64 7564 28340 4148 0 0
+1650260562 4808 25573984 220121792172 1171 62083 3037380 9666800914 0 9998 598 7392 26621 4527 0 0
</screen>
</para>
@@ -2851,10 +2883,20 @@ statement latencies in milliseconds, failures and retries:
<para>
A client's run is aborted in case of a serious error; for example, the
connection with the database server was lost or the end of script was reached
- without completing the last transaction. In addition, if execution of an SQL
- or meta command fails for reasons other than serialization or deadlock errors,
- the client is aborted. Otherwise, if an SQL command fails with serialization or
- deadlock errors, the client is not aborted. In such cases, the current
+ without completing the last transaction. The client also aborts
+ if a meta command fails, or if an SQL command fails for reasons other than
+ serialization or deadlock errors when <option>--continue-on-error</option>
+ is not specified. With <option>--continue-on-error</option>,
+ the client does not abort on such SQL errors and instead proceeds to
+ the next transaction. These cases are reported as
+ <literal>other failures</literal> in the output. If the error occurs
+ in a meta command, however, the client still aborts even when this option
+ is specified.
+ </para>
+ <para>
+ If an SQL command fails due to serialization or deadlock errors, the
+ client does not aborted, regardless of whether
+ <option>--continue-on-error</option> is used. Instead, the current
transaction is rolled back, which also includes setting the client variables
as they were before the run of this transaction (it is assumed that one
transaction script contains only one transaction; see
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index de00669f288..ee288e19bd0 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -402,15 +402,23 @@ typedef struct StatsData
* directly successful transactions (they were successfully completed on
* the first try).
*
- * A failed transaction is defined as unsuccessfully retried transactions.
- * It can be one of two types:
+ * A failed transaction is counted differently depending on whether
+ * the --continue-on-error option is specified.
*
+ * Without --continue-on-error:
* failed (the number of failed transactions) =
* 'serialization_failures' (they got a serialization error and were not
* successfully retried) +
* 'deadlock_failures' (they got a deadlock error and were not
* successfully retried).
*
+ * When --continue-on-error is specified:
+ *
+ * failed (number of failed transactions) =
+ * 'serialization_failures' + 'deadlock_failures' +
+ * 'other_sql_failures' (they got some other SQL error; the transaction was
+ * not retried and counted as failed due to --continue-on-error).
+ *
* If the transaction was retried after a serialization or a deadlock
* error this does not guarantee that this retry was successful. Thus
*
@@ -440,6 +448,11 @@ typedef struct StatsData
int64 deadlock_failures; /* number of transactions that were not
* successfully retried after a deadlock
* error */
+ int64 other_sql_failures; /* number of failed transactions for
+ * reasons other than
+ * serialization/deadlock failure, which
+ * is counted if --continue-on-error is
+ * specified */
SimpleStats latency;
SimpleStats lag;
} StatsData;
@@ -770,6 +783,7 @@ static int64 total_weight = 0;
static bool verbose_errors = false; /* print verbose messages of all errors */
static bool exit_on_abort = false; /* exit when any client is aborted */
+static bool continue_on_error = false; /* continue after errors */
/* Builtin test scripts */
typedef struct BuiltinScript
@@ -954,6 +968,7 @@ usage(void)
" --log-prefix=PREFIX prefix for transaction time log file\n"
" (default: \"pgbench_log\")\n"
" --max-tries=NUM max number of tries to run transaction (default: 1)\n"
+ " --continue-on-error continue running after an SQL error\n"
" --progress-timestamp use Unix epoch timestamps for progress\n"
" --random-seed=SEED set random seed (\"time\", \"rand\", integer)\n"
" --sampling-rate=NUM fraction of transactions to log (e.g., 0.01 for 1%%)\n"
@@ -1467,6 +1482,7 @@ initStats(StatsData *sd, pg_time_usec_t start)
sd->retried = 0;
sd->serialization_failures = 0;
sd->deadlock_failures = 0;
+ sd->other_sql_failures = 0;
initSimpleStats(&sd->latency);
initSimpleStats(&sd->lag);
}
@@ -1516,6 +1532,9 @@ accumStats(StatsData *stats, bool skipped, double lat, double lag,
case ESTATUS_DEADLOCK_ERROR:
stats->deadlock_failures++;
break;
+ case ESTATUS_OTHER_SQL_ERROR:
+ stats->other_sql_failures++;
+ break;
default:
/* internal error which should never occur */
pg_fatal("unexpected error status: %d", estatus);
@@ -3365,7 +3384,7 @@ readCommandResponse(CState *st, MetaCommand meta, char *varprefix)
case PGRES_FATAL_ERROR:
st->estatus = getSQLErrorStatus(PQresultErrorField(res,
PG_DIAG_SQLSTATE));
- if (canRetryError(st->estatus))
+ if (continue_on_error || canRetryError(st->estatus))
{
if (verbose_errors)
commandError(st, errmsg);
@@ -4030,7 +4049,10 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
if (PQpipelineStatus(st->con) != PQ_PIPELINE_ON)
st->state = CSTATE_END_COMMAND;
}
- else if (canRetryError(st->estatus))
+ else if (PQstatus(st->con) == CONNECTION_BAD)
+ st->state = CSTATE_ABORTED;
+ else if ((st->estatus == ESTATUS_OTHER_SQL_ERROR && continue_on_error) ||
+ canRetryError(st->estatus))
st->state = CSTATE_ERROR;
else
st->state = CSTATE_ABORTED;
@@ -4551,7 +4573,8 @@ static int64
getFailures(const StatsData *stats)
{
return (stats->serialization_failures +
- stats->deadlock_failures);
+ stats->deadlock_failures +
+ stats->other_sql_failures);
}
/*
@@ -4571,6 +4594,8 @@ getResultString(bool skipped, EStatus estatus)
return "serialization";
case ESTATUS_DEADLOCK_ERROR:
return "deadlock";
+ case ESTATUS_OTHER_SQL_ERROR:
+ return "other";
default:
/* internal error which should never occur */
pg_fatal("unexpected error status: %d", estatus);
@@ -4626,6 +4651,7 @@ doLog(TState *thread, CState *st,
int64 skipped = 0;
int64 serialization_failures = 0;
int64 deadlock_failures = 0;
+ int64 other_sql_failures = 0;
int64 retried = 0;
int64 retries = 0;
@@ -4666,10 +4692,12 @@ doLog(TState *thread, CState *st,
{
serialization_failures = agg->serialization_failures;
deadlock_failures = agg->deadlock_failures;
+ other_sql_failures = agg->other_sql_failures;
}
- fprintf(logfile, " " INT64_FORMAT " " INT64_FORMAT,
+ fprintf(logfile, " " INT64_FORMAT " " INT64_FORMAT " " INT64_FORMAT,
serialization_failures,
- deadlock_failures);
+ deadlock_failures,
+ other_sql_failures);
fputc('\n', logfile);
@@ -6308,6 +6336,7 @@ printProgressReport(TState *threads, int64 test_start, pg_time_usec_t now,
cur.serialization_failures +=
threads[i].stats.serialization_failures;
cur.deadlock_failures += threads[i].stats.deadlock_failures;
+ cur.other_sql_failures += threads[i].stats.other_sql_failures;
}
/* we count only actually executed transactions */
@@ -6450,7 +6479,8 @@ printResults(StatsData *total,
/*
* Remaining stats are nonsensical if we failed to execute any xacts due
- * to others than serialization or deadlock errors
+ * to other than serialization or deadlock errors and --continue-on-error
+ * is not set.
*/
if (total_cnt <= 0)
return;
@@ -6466,6 +6496,9 @@ printResults(StatsData *total,
printf("number of deadlock failures: " INT64_FORMAT " (%.3f%%)\n",
total->deadlock_failures,
100.0 * total->deadlock_failures / total_cnt);
+ printf("number of other failures: " INT64_FORMAT " (%.3f%%)\n",
+ total->other_sql_failures,
+ 100.0 * total->other_sql_failures / total_cnt);
}
/* it can be non-zero only if max_tries is not equal to one */
@@ -6569,6 +6602,10 @@ printResults(StatsData *total,
sstats->deadlock_failures,
(100.0 * sstats->deadlock_failures /
script_total_cnt));
+ printf(" - number of other failures: " INT64_FORMAT " (%.3f%%)\n",
+ sstats->other_sql_failures,
+ (100.0 * sstats->other_sql_failures /
+ script_total_cnt));
}
/*
@@ -6728,6 +6765,7 @@ main(int argc, char **argv)
{"verbose-errors", no_argument, NULL, 15},
{"exit-on-abort", no_argument, NULL, 16},
{"debug", no_argument, NULL, 17},
+ {"continue-on-error", no_argument, NULL, 18},
{NULL, 0, NULL, 0}
};
@@ -7081,6 +7119,10 @@ main(int argc, char **argv)
case 17: /* debug */
pg_logging_increase_verbosity();
break;
+ case 18: /* continue-on-error */
+ benchmarking_option_set = true;
+ continue_on_error = true;
+ break;
default:
/* getopt_long already emitted a complaint */
pg_log_error_hint("Try \"%s --help\" for more information.", progname);
@@ -7436,6 +7478,7 @@ main(int argc, char **argv)
stats.retried += thread->stats.retried;
stats.serialization_failures += thread->stats.serialization_failures;
stats.deadlock_failures += thread->stats.deadlock_failures;
+ stats.other_sql_failures += thread->stats.other_sql_failures;
latency_late += thread->latency_late;
conn_total_duration += thread->conn_duration;
diff --git a/src/bin/pgbench/t/001_pgbench_with_server.pl b/src/bin/pgbench/t/001_pgbench_with_server.pl
index 7dd78940300..3c19a36a005 100644
--- a/src/bin/pgbench/t/001_pgbench_with_server.pl
+++ b/src/bin/pgbench/t/001_pgbench_with_server.pl
@@ -1813,6 +1813,28 @@ update counter set i = i+1 returning i \gset
# Clean up
$node->safe_psql('postgres', 'DROP TABLE counter;');
+# Test --continue-on-error
+$node->safe_psql('postgres',
+ 'CREATE TABLE unique_table(i int unique);');
+
+$node->pgbench(
+ '-n -t 10 --continue-on-error --failures-detailed',
+ 0,
+ [
+ qr{processed: 1/10\b},
+ qr{other failures: 9\b}
+ ],
+ [],
+ 'test --continue-on-error',
+ {
+ '001_continue_on_error' => q{
+ INSERT INTO unique_table VALUES(0);
+ }
+ });
+
+# Clean up
+$node->safe_psql('postgres', 'DROP TABLE unique_table;');
+
# done
$node->safe_psql('postgres', 'DROP TABLESPACE regress_pgbench_tap_1_ts');
$node->stop;
--
2.43.0
v13-0001-Fix-assertion-failure-and-verbose-messages-in-pi.patchtext/x-diff; name=v13-0001-Fix-assertion-failure-and-verbose-messages-in-pi.patchDownload
From 84935ea888d8ef607af6afb521cee8d98a85d1db Mon Sep 17 00:00:00 2001
From: Yugo Nagata <nagata@sraoss.co.jp>
Date: Wed, 24 Sep 2025 22:23:25 +0900
Subject: [PATCH v13 1/3] Fix assertion failure and verbose messages in
pipeline mode
commandError() is called to report errors when they can be retried, and
it previously assumed that errors are always detected during SQL command
execution. However, in pipeline mode, an error may also be detected when
a \endpipeline meta-command is executed.
This caused an assertion failure. To fix this, it is now assumed that
errors can also be detected in this case.
Additionally, in pipeline mode, the error message reported in
readCommandResponse() was lost, because it was reset when PQgetResult()
returned NULL to indicate the end of query processing. To fix this, save
the previous error message and use it for reporting.
---
src/bin/pgbench/pgbench.c | 20 +++++++++++++++-----
1 file changed, 15 insertions(+), 5 deletions(-)
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 3cafd88ac53..de00669f288 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -3059,7 +3059,13 @@ commandFailed(CState *st, const char *cmd, const char *message)
static void
commandError(CState *st, const char *message)
{
- Assert(sql_script[st->use_file].commands[st->command]->type == SQL_COMMAND);
+ /*
+ * Errors should only be detected during an SQL command or the \endpipeline
+ * meta command. Any other case triggers an assertion failure.
+ */
+ Assert(sql_script[st->use_file].commands[st->command]->type == SQL_COMMAND ||
+ sql_script[st->use_file].commands[st->command]->meta == META_ENDPIPELINE);
+
pg_log_info("client %d got an error in command %d (SQL) of script %d; %s",
st->id, st->command, st->use_file, message);
}
@@ -3265,6 +3271,7 @@ readCommandResponse(CState *st, MetaCommand meta, char *varprefix)
PGresult *res;
PGresult *next_res;
int qrynum = 0;
+ char *errmsg;
/*
* varprefix should be set only with \gset or \aset, and \endpipeline and
@@ -3280,6 +3287,8 @@ readCommandResponse(CState *st, MetaCommand meta, char *varprefix)
{
bool is_last;
+ errmsg = pg_strdup(PQerrorMessage(st->con));
+
/* peek at the next result to know whether the current is last */
next_res = PQgetResult(st->con);
is_last = (next_res == NULL);
@@ -3349,7 +3358,7 @@ readCommandResponse(CState *st, MetaCommand meta, char *varprefix)
st->num_syncs--;
if (st->num_syncs == 0 && PQexitPipelineMode(st->con) != 1)
pg_log_error("client %d failed to exit pipeline mode: %s", st->id,
- PQerrorMessage(st->con));
+ errmsg);
break;
case PGRES_NONFATAL_ERROR:
@@ -3359,7 +3368,7 @@ readCommandResponse(CState *st, MetaCommand meta, char *varprefix)
if (canRetryError(st->estatus))
{
if (verbose_errors)
- commandError(st, PQerrorMessage(st->con));
+ commandError(st, errmsg);
goto error;
}
/* fall through */
@@ -3367,14 +3376,14 @@ readCommandResponse(CState *st, MetaCommand meta, char *varprefix)
default:
/* anything else is unexpected */
pg_log_error("client %d script %d aborted in command %d query %d: %s",
- st->id, st->use_file, st->command, qrynum,
- PQerrorMessage(st->con));
+ st->id, st->use_file, st->command, qrynum, errmsg);
goto error;
}
PQclear(res);
qrynum++;
res = next_res;
+ pg_free(errmsg);
}
if (qrynum == 0)
@@ -3388,6 +3397,7 @@ readCommandResponse(CState *st, MetaCommand meta, char *varprefix)
error:
PQclear(res);
PQclear(next_res);
+ pg_free(errmsg);
do
{
res = PQgetResult(st->con);
--
2.43.0
Hi,
The patch looks good, I've spotted some typos in the doc.
+ Allows clients to continue their run even if an SQL statement
fails due to
+ errors other than serialization or deadlock. Unlike
serialization and deadlock
+ failures, clients do not retry the same transactions but
start new transaction.
Should be "but start a new transaction.", although "proceed to the
next transaction." may be clearer here that ?
+ number of transactions that got a SQL error
+ (zero unless <option>--failures-detailed</option> is specified)
It seems like both "a SQL" and "an SQL" are used in the codebase and
doc, but this page only uses "an SQL", so using "an SQL" may be better
for consistency.
+ If an SQL command fails due to serialization or deadlock errors, the
+ client does not aborted, regardless of whether
Should be "the client does not abort."
Regards,
Anthonin Bonnefoy
Hi Yugo,
Thanks for the patch. After reviewing it, I got a few small comments:
On Sep 25, 2025, at 15:22, Yugo Nagata <nagata@sraoss.co.jp> wrote:
--
Yugo Nagata <nagata@sraoss.co.jp <mailto:nagata@sraoss.co.jp>>
<v13-0003-Improve-error-messages-for-errors-that-cause-cli.patch><v13-0002-Add-continue-on-error-option.patch><v13-0001-Fix-assertion-failure-and-verbose-messages-in-pi.patch>
1 - 0001
```
@@ -3265,6 +3271,7 @@ readCommandResponse(CState *st, MetaCommand meta, char *varprefix)
PGresult *res;
PGresult *next_res;
int qrynum = 0;
+ char *errmsg;
```
I think we should initialize errmsg to NULL. Compiler won’t auto initialize a local variable. If it happens to not enter the while loop, then errmsg will hold a random value, then pg_free(errmsg) will have trouble.
2 - 0002
```
+ <para>
+ Allows clients to continue their run even if an SQL statement fails due to
+ errors other than serialization or deadlock. Unlike serialization and deadlock
+ failures, clients do not retry the same transactions but start new transaction.
+ This option is useful when your custom script may raise errors due to some
+ reason like unique constraints violation. Without this option, the client is
+ aborted after such errors.
+ </para>
```
A few nit suggestions:
* “continue their run” => “continue running”
* “clients to not retry the same transactions but start new transaction” => “clients do not retry the same transaction but start a new transaction instead"
* “due to some reason like” => “for reasons such as"
3 - 0002
```
+ * Without --continue-on-error:
* failed (the number of failed transactions) =
```
Maybe add an empty line after “without” line.
4 - 0002
```
+ * When --continue-on-error is specified:
+ *
+ * failed (number of failed transactions) =
```
Maybe change to “With ---continue-on-error”, which sounds consistent with the previous “without”.
5 - 0002
```
+ int64 other_sql_failures; /* number of failed transactions for
+ * reasons other than
+ * serialization/deadlock failure, which
+ * is counted if --continue-on-error is
+ * specified */
```
How about rename this variable to “sql_errors”, which reflects to the new option name.
6 - 0002
```
@@ -4571,6 +4594,8 @@ getResultString(bool skipped, EStatus estatus)
return "serialization";
case ESTATUS_DEADLOCK_ERROR:
return "deadlock";
+ case ESTATUS_OTHER_SQL_ERROR:
+ return "other”;
```
I think this can just return “error”. I checked where this function is called, there will not be other words such as “error” appended.
7 - 0002
```
/* it can be non-zero only if max_tries is not equal to one */
@@ -6569,6 +6602,10 @@ printResults(StatsData *total,
sstats->deadlock_failures,
(100.0 * sstats->deadlock_failures /
script_total_cnt));
+ printf(" - number of other failures: " INT64_FORMAT " (%.3f%%)\n",
+ sstats->other_sql_failures,
+ (100.0 * sstats->other_sql_failures /
+ script_total_cnt));
```
Do we only want to print this number when “—continue-on-error” is given?
Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/
On Thu, Sep 25, 2025 at 4:22 PM Yugo Nagata <nagata@sraoss.co.jp> wrote:
I've attached updated patches.
Thanks for updating the patches!
About 0001: you mentioned that the lost error message issue occurs in
pipeline mode.
Just to confirm, are you sure it never happens in non-pipeline mode?
From a quick look,
readCommandResponse() seems to have this problem regardless of whether pipeline
mode is used.
If it can also happen outside pipeline mode, maybe we should split this from
the assertion failure fix, since they'd need to be backpatched to
different branches.
What do you think?
Regards,
--
Fujii Masao
On Fri, 26 Sep 2025 00:03:06 +0900
Fujii Masao <masao.fujii@gmail.com> wrote:
On Thu, Sep 25, 2025 at 4:22 PM Yugo Nagata <nagata@sraoss.co.jp> wrote:
I've attached updated patches.
Thanks for updating the patches!
About 0001: you mentioned that the lost error message issue occurs in
pipeline mode.
Just to confirm, are you sure it never happens in non-pipeline mode?
From a quick look,
readCommandResponse() seems to have this problem regardless of whether pipeline
mode is used.If it can also happen outside pipeline mode, maybe we should split this from
the assertion failure fix, since they'd need to be backpatched to
different branches.
I could not find a code path that resets the error state before reporting in
non-pipeline mode, since it is typically reset when starting to send a query.
However, referencing an error message after another PQgetResult() does not seem
like a good idea in general, so I agree with splitting the patch.
I'll submit updated patches soon.
Regards,
Yugo Nagata
--
Yugo Nagata <nagata@sraoss.co.jp>
On Thu, 25 Sep 2025 17:17:36 +0800
Chao Li <li.evan.chao@gmail.com> wrote:
Hi Yugo,
Thanks for the patch. After reviewing it, I got a few small comments:
Thank you for your reviewing and comments.
On Sep 25, 2025, at 15:22, Yugo Nagata <nagata@sraoss.co.jp> wrote:
--
Yugo Nagata <nagata@sraoss.co.jp <mailto:nagata@sraoss.co.jp>>
<v13-0003-Improve-error-messages-for-errors-that-cause-cli.patch><v13-0002-Add-continue-on-error-option.patch><v13-0001-Fix-assertion-failure-and-verbose-messages-in-pi.patch>1 - 0001
```
@@ -3265,6 +3271,7 @@ readCommandResponse(CState *st, MetaCommand meta, char *varprefix)
PGresult *res;
PGresult *next_res;
int qrynum = 0;
+ char *errmsg;
```I think we should initialize errmsg to NULL. Compiler won’t auto initialize a local variable. If it happens to not enter the while loop, then errmsg will hold a random value, then pg_free(errmsg) will have trouble.
I think this initialization is unnecessary, just like for res and next_res.
If the code happens not to enter the while loop, pg_free(errmsg) will not be
called anyway, since the error: label is only reachable from inside the loop.
2 - 0002 ``` + <para> + Allows clients to continue their run even if an SQL statement fails due to + errors other than serialization or deadlock. Unlike serialization and deadlock + failures, clients do not retry the same transactions but start new transaction. + This option is useful when your custom script may raise errors due to some + reason like unique constraints violation. Without this option, the client is + aborted after such errors. + </para> ```A few nit suggestions:
* “continue their run” => “continue running”
Fixed.
* “clients to not retry the same transactions but start new transaction” => “clients do not retry the same transaction but start a new transaction instead"
I see your point. Maybe we could follow Anthonin Bonnefoy's suggestion
to use "proceed to the next transaction", as it may sound a bit more natural.
* “due to some reason like” => “for reasons such as"
Fixed.
3 - 0002
```
+ * Without --continue-on-error:
* failed (the number of failed transactions) =
```Maybe add an empty line after “without” line.
Makes sense. Fixed.
4 - 0002 ``` + * When --continue-on-error is specified: + * + * failed (number of failed transactions) = ```Maybe change to “With ---continue-on-error”, which sounds consistent with the previous “without”.
Fixed.
5 - 0002 ``` + int64 other_sql_failures; /* number of failed transactions for + * reasons other than + * serialization/deadlock failure, which + * is counted if --continue-on-error is + * specified */ ```How about rename this variable to “sql_errors”, which reflects to the new option name.
I think it’s better to keep the current name, since the variable counts failed transactions,
even though that happens to be equivalent to the number of SQL errors. It’s also consistent
with the other variables, serialization_failures and deadlock_failures.
6 - 0002
```
@@ -4571,6 +4594,8 @@ getResultString(bool skipped, EStatus estatus)
return "serialization";
case ESTATUS_DEADLOCK_ERROR:
return "deadlock";
+ case ESTATUS_OTHER_SQL_ERROR:
+ return "other”;
```I think this can just return “error”. I checked where this function is called, there will not be other words such as “error” appended.
getResultString() is called to get a string that represents the type of error
causing the transaction failure, so simply returning "error" doesn’t seem very
useful.
7 - 0002 ``` /* it can be non-zero only if max_tries is not equal to one */ @@ -6569,6 +6602,10 @@ printResults(StatsData *total, sstats->deadlock_failures, (100.0 * sstats->deadlock_failures / script_total_cnt)); + printf(" - number of other failures: " INT64_FORMAT " (%.3f%%)\n", + sstats->other_sql_failures, + (100.0 * sstats->other_sql_failures / + script_total_cnt)); ```Do we only want to print this number when “—continue-on-error” is given?
We could do that, but this message is printed only when
--failures-detailed is specified. So I think users would not mind
if it shows that the number of other failures is zero, even when
--continue-on-error is not specified.
I would appreciate hearing other people's opinions on this.
I've attached updated patches that include fixes for some of your
suggestions and for Anthonin Bonnefoy's suggestion on the documentation.
I also split the patch according to Fujii-san's suggestion.
Regards,
Yugo Nagata
--
Yugo Nagata <nagata@sraoss.co.jp>
Attachments:
v14-0004-pgbench-Improve-error-messages-for-errors-that-c.patchtext/x-diff; name=v14-0004-pgbench-Improve-error-messages-for-errors-that-c.patchDownload
From b8cfbb44bae06def9ed2ad78edd8e3ec80e34a16 Mon Sep 17 00:00:00 2001
From: Yugo Nagata <nagata@sraoss.co.jp>
Date: Thu, 10 Jul 2025 17:21:05 +0900
Subject: [PATCH v14 4/4] pgbench: Improve error messages for errors that cause
client abortion
This commit modifies relevant error messages to explicitly indicate that the
client was aborted. As part of this change, pg_log_error was replaced with
commandFailed().
---
src/bin/pgbench/pgbench.c | 15 ++++++---------
1 file changed, 6 insertions(+), 9 deletions(-)
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 382c0367157..4468ff38d33 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -3320,8 +3320,7 @@ readCommandResponse(CState *st, MetaCommand meta, char *varprefix)
case PGRES_EMPTY_QUERY: /* may be used for testing no-op overhead */
if (is_last && meta == META_GSET)
{
- pg_log_error("client %d script %d command %d query %d: expected one row, got %d",
- st->id, st->use_file, st->command, qrynum, 0);
+ commandFailed(st, "gset", psprintf("expected one row, got %d", 0));
st->estatus = ESTATUS_META_COMMAND_ERROR;
goto error;
}
@@ -3335,8 +3334,7 @@ readCommandResponse(CState *st, MetaCommand meta, char *varprefix)
if (meta == META_GSET && ntuples != 1)
{
/* under \gset, report the error */
- pg_log_error("client %d script %d command %d query %d: expected one row, got %d",
- st->id, st->use_file, st->command, qrynum, PQntuples(res));
+ commandFailed(st, "gset", psprintf("expected one row, got %d", PQntuples(res)));
st->estatus = ESTATUS_META_COMMAND_ERROR;
goto error;
}
@@ -3350,18 +3348,18 @@ readCommandResponse(CState *st, MetaCommand meta, char *varprefix)
for (int fld = 0; fld < PQnfields(res); fld++)
{
char *varname = PQfname(res, fld);
+ char *cmd = (meta == META_ASET ? "aset" : "gset");
/* allocate varname only if necessary, freed below */
if (*varprefix != '\0')
varname = psprintf("%s%s", varprefix, varname);
/* store last row result as a string */
- if (!putVariable(&st->variables, meta == META_ASET ? "aset" : "gset", varname,
+ if (!putVariable(&st->variables, cmd, varname,
PQgetvalue(res, ntuples - 1, fld)))
{
/* internal error */
- pg_log_error("client %d script %d command %d query %d: error storing into variable %s",
- st->id, st->use_file, st->command, qrynum, varname);
+ commandFailed(st, cmd, psprintf("error storing into variable %s", varname));
st->estatus = ESTATUS_META_COMMAND_ERROR;
goto error;
}
@@ -3396,8 +3394,7 @@ readCommandResponse(CState *st, MetaCommand meta, char *varprefix)
default:
/* anything else is unexpected */
- pg_log_error("client %d script %d aborted in command %d query %d: %s",
- st->id, st->use_file, st->command, qrynum, errmsg);
+ commandFailed(st, "SQL", errmsg);
goto error;
}
--
2.43.0
v14-0003-pgbench-Add-continue-on-error-option.patchtext/x-diff; name=v14-0003-pgbench-Add-continue-on-error-option.patchDownload
From e6259f68aa3683ae52baefc4f7616e0674dcd4c1 Mon Sep 17 00:00:00 2001
From: Fujii Masao <fujii@postgresql.org>
Date: Fri, 19 Sep 2025 16:54:49 +0900
Subject: [PATCH v14 3/4] pgbench: Add --continue-on-error option
When the option is set, client rolls back the failed transaction and starts a
new one when its transaction fails due to the reason other than the deadlock and
serialization failure.
---
doc/src/sgml/ref/pgbench.sgml | 64 ++++++++++++++++----
src/bin/pgbench/pgbench.c | 60 +++++++++++++++---
src/bin/pgbench/t/001_pgbench_with_server.pl | 22 +++++++
3 files changed, 127 insertions(+), 19 deletions(-)
diff --git a/doc/src/sgml/ref/pgbench.sgml b/doc/src/sgml/ref/pgbench.sgml
index ab252d9fc74..63230102357 100644
--- a/doc/src/sgml/ref/pgbench.sgml
+++ b/doc/src/sgml/ref/pgbench.sgml
@@ -76,9 +76,8 @@ tps = 896.967014 (without initial connection time)
and number of transactions per client); these will be equal unless the run
failed before completion or some SQL command(s) failed. (In
<option>-T</option> mode, only the actual number of transactions is printed.)
- The next line reports the number of failed transactions due to
- serialization or deadlock errors (see <xref linkend="failures-and-retries"/>
- for more information).
+ The next line reports the number of failed transactions (see
+ <xref linkend="failures-and-retries"/> for more information).
The last line reports the number of transactions per second.
</para>
@@ -790,6 +789,9 @@ pgbench <optional> <replaceable>options</replaceable> </optional> <replaceable>d
<listitem>
<para>deadlock failures;</para>
</listitem>
+ <listitem>
+ <para>other failures;</para>
+ </listitem>
</itemizedlist>
See <xref linkend="failures-and-retries"/> for more information.
</para>
@@ -914,6 +916,26 @@ pgbench <optional> <replaceable>options</replaceable> </optional> <replaceable>d
</listitem>
</varlistentry>
+ <varlistentry id="pgbench-option-continue-on-error">
+ <term><option>--continue-on-error</option></term>
+ <listitem>
+ <para>
+ Allows clients to continue running even if an SQL statement fails due to
+ errors other than serialization or deadlock. Unlike serialization and deadlock
+ failures, clients do not retry the same transactions but proceed to the next
+ transaction. This option is useful when your custom script may raise errors for
+ reasons such as unique constraints violation. Without this option, the
+ client is aborted after such errors.
+ </para>
+ <para>
+ Note that serialization and deadlock failures never cause the client to be
+ aborted even after clients retries <option>--max-tries</option> times by
+ default, so they are not affected by this option.
+ See <xref linkend="failures-and-retries"/> for more information.
+ </para>
+ </listitem>
+ </varlistentry>
+
</variablelist>
</para>
@@ -2409,8 +2431,8 @@ END;
will be reported as <literal>failed</literal>. If you use the
<option>--failures-detailed</option> option, the
<replaceable>time</replaceable> of the failed transaction will be reported as
- <literal>serialization</literal> or
- <literal>deadlock</literal> depending on the type of failure (see
+ <literal>serialization</literal>, <literal>deadlock</literal>, or
+ <literal>other</literal> depending on the type of failure (see
<xref linkend="failures-and-retries"/> for more information).
</para>
@@ -2638,6 +2660,16 @@ END;
</para>
</listitem>
</varlistentry>
+
+ <varlistentry>
+ <term><replaceable>other_sql_failures</replaceable></term>
+ <listitem>
+ <para>
+ number of transactions that got an SQL error
+ (zero unless <option>--failures-detailed</option> is specified)
+ </para>
+ </listitem>
+ </varlistentry>
</variablelist>
</para>
@@ -2646,8 +2678,8 @@ END;
<screen>
<userinput>pgbench --aggregate-interval=10 --time=20 --client=10 --log --rate=1000 --latency-limit=10 --failures-detailed --max-tries=10 test</userinput>
-1650260552 5178 26171317 177284491527 1136 44462 2647617 7321113867 0 9866 64 7564 28340 4148 0
-1650260562 4808 25573984 220121792172 1171 62083 3037380 9666800914 0 9998 598 7392 26621 4527 0
+1650260552 5178 26171317 177284491527 1136 44462 2647617 7321113867 0 9866 64 7564 28340 4148 0 0
+1650260562 4808 25573984 220121792172 1171 62083 3037380 9666800914 0 9998 598 7392 26621 4527 0 0
</screen>
</para>
@@ -2851,10 +2883,20 @@ statement latencies in milliseconds, failures and retries:
<para>
A client's run is aborted in case of a serious error; for example, the
connection with the database server was lost or the end of script was reached
- without completing the last transaction. In addition, if execution of an SQL
- or meta command fails for reasons other than serialization or deadlock errors,
- the client is aborted. Otherwise, if an SQL command fails with serialization or
- deadlock errors, the client is not aborted. In such cases, the current
+ without completing the last transaction. The client also aborts
+ if a meta command fails, or if an SQL command fails for reasons other than
+ serialization or deadlock errors when <option>--continue-on-error</option>
+ is not specified. With <option>--continue-on-error</option>,
+ the client does not abort on such SQL errors and instead proceeds to
+ the next transaction. These cases are reported as
+ <literal>other failures</literal> in the output. If the error occurs
+ in a meta command, however, the client still aborts even when this option
+ is specified.
+ </para>
+ <para>
+ If an SQL command fails due to serialization or deadlock errors, the
+ client does not abort, regardless of whether
+ <option>--continue-on-error</option> is used. Instead, the current
transaction is rolled back, which also includes setting the client variables
as they were before the run of this transaction (it is assumed that one
transaction script contains only one transaction; see
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 36c52303a9a..382c0367157 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -402,8 +402,10 @@ typedef struct StatsData
* directly successful transactions (they were successfully completed on
* the first try).
*
- * A failed transaction is defined as unsuccessfully retried transactions.
- * It can be one of two types:
+ * A failed transaction is counted differently depending on whether
+ * the --continue-on-error option is specified.
+ *
+ * Without --continue-on-error:
*
* failed (the number of failed transactions) =
* 'serialization_failures' (they got a serialization error and were not
@@ -411,6 +413,13 @@ typedef struct StatsData
* 'deadlock_failures' (they got a deadlock error and were not
* successfully retried).
*
+ * With --continue-on-error:
+ *
+ * failed (number of failed transactions) =
+ * 'serialization_failures' + 'deadlock_failures' +
+ * 'other_sql_failures' (they got some other SQL error; the transaction was
+ * not retried and counted as failed due to --continue-on-error).
+ *
* If the transaction was retried after a serialization or a deadlock
* error this does not guarantee that this retry was successful. Thus
*
@@ -440,6 +449,11 @@ typedef struct StatsData
int64 deadlock_failures; /* number of transactions that were not
* successfully retried after a deadlock
* error */
+ int64 other_sql_failures; /* number of failed transactions for
+ * reasons other than
+ * serialization/deadlock failure, which
+ * is counted if --continue-on-error is
+ * specified */
SimpleStats latency;
SimpleStats lag;
} StatsData;
@@ -770,6 +784,7 @@ static int64 total_weight = 0;
static bool verbose_errors = false; /* print verbose messages of all errors */
static bool exit_on_abort = false; /* exit when any client is aborted */
+static bool continue_on_error = false; /* continue after errors */
/* Builtin test scripts */
typedef struct BuiltinScript
@@ -954,6 +969,7 @@ usage(void)
" --log-prefix=PREFIX prefix for transaction time log file\n"
" (default: \"pgbench_log\")\n"
" --max-tries=NUM max number of tries to run transaction (default: 1)\n"
+ " --continue-on-error continue running after an SQL error\n"
" --progress-timestamp use Unix epoch timestamps for progress\n"
" --random-seed=SEED set random seed (\"time\", \"rand\", integer)\n"
" --sampling-rate=NUM fraction of transactions to log (e.g., 0.01 for 1%%)\n"
@@ -1467,6 +1483,7 @@ initStats(StatsData *sd, pg_time_usec_t start)
sd->retried = 0;
sd->serialization_failures = 0;
sd->deadlock_failures = 0;
+ sd->other_sql_failures = 0;
initSimpleStats(&sd->latency);
initSimpleStats(&sd->lag);
}
@@ -1516,6 +1533,9 @@ accumStats(StatsData *stats, bool skipped, double lat, double lag,
case ESTATUS_DEADLOCK_ERROR:
stats->deadlock_failures++;
break;
+ case ESTATUS_OTHER_SQL_ERROR:
+ stats->other_sql_failures++;
+ break;
default:
/* internal error which should never occur */
pg_fatal("unexpected error status: %d", estatus);
@@ -3366,7 +3386,7 @@ readCommandResponse(CState *st, MetaCommand meta, char *varprefix)
case PGRES_FATAL_ERROR:
st->estatus = getSQLErrorStatus(PQresultErrorField(res,
PG_DIAG_SQLSTATE));
- if (canRetryError(st->estatus))
+ if (continue_on_error || canRetryError(st->estatus))
{
if (verbose_errors)
commandError(st, errmsg);
@@ -4031,7 +4051,10 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
if (PQpipelineStatus(st->con) != PQ_PIPELINE_ON)
st->state = CSTATE_END_COMMAND;
}
- else if (canRetryError(st->estatus))
+ else if (PQstatus(st->con) == CONNECTION_BAD)
+ st->state = CSTATE_ABORTED;
+ else if ((st->estatus == ESTATUS_OTHER_SQL_ERROR && continue_on_error) ||
+ canRetryError(st->estatus))
st->state = CSTATE_ERROR;
else
st->state = CSTATE_ABORTED;
@@ -4552,7 +4575,8 @@ static int64
getFailures(const StatsData *stats)
{
return (stats->serialization_failures +
- stats->deadlock_failures);
+ stats->deadlock_failures +
+ stats->other_sql_failures);
}
/*
@@ -4572,6 +4596,8 @@ getResultString(bool skipped, EStatus estatus)
return "serialization";
case ESTATUS_DEADLOCK_ERROR:
return "deadlock";
+ case ESTATUS_OTHER_SQL_ERROR:
+ return "other";
default:
/* internal error which should never occur */
pg_fatal("unexpected error status: %d", estatus);
@@ -4627,6 +4653,7 @@ doLog(TState *thread, CState *st,
int64 skipped = 0;
int64 serialization_failures = 0;
int64 deadlock_failures = 0;
+ int64 other_sql_failures = 0;
int64 retried = 0;
int64 retries = 0;
@@ -4667,10 +4694,12 @@ doLog(TState *thread, CState *st,
{
serialization_failures = agg->serialization_failures;
deadlock_failures = agg->deadlock_failures;
+ other_sql_failures = agg->other_sql_failures;
}
- fprintf(logfile, " " INT64_FORMAT " " INT64_FORMAT,
+ fprintf(logfile, " " INT64_FORMAT " " INT64_FORMAT " " INT64_FORMAT,
serialization_failures,
- deadlock_failures);
+ deadlock_failures,
+ other_sql_failures);
fputc('\n', logfile);
@@ -6309,6 +6338,7 @@ printProgressReport(TState *threads, int64 test_start, pg_time_usec_t now,
cur.serialization_failures +=
threads[i].stats.serialization_failures;
cur.deadlock_failures += threads[i].stats.deadlock_failures;
+ cur.other_sql_failures += threads[i].stats.other_sql_failures;
}
/* we count only actually executed transactions */
@@ -6451,7 +6481,8 @@ printResults(StatsData *total,
/*
* Remaining stats are nonsensical if we failed to execute any xacts due
- * to others than serialization or deadlock errors
+ * to other than serialization or deadlock errors and --continue-on-error
+ * is not set.
*/
if (total_cnt <= 0)
return;
@@ -6467,6 +6498,9 @@ printResults(StatsData *total,
printf("number of deadlock failures: " INT64_FORMAT " (%.3f%%)\n",
total->deadlock_failures,
100.0 * total->deadlock_failures / total_cnt);
+ printf("number of other failures: " INT64_FORMAT " (%.3f%%)\n",
+ total->other_sql_failures,
+ 100.0 * total->other_sql_failures / total_cnt);
}
/* it can be non-zero only if max_tries is not equal to one */
@@ -6570,6 +6604,10 @@ printResults(StatsData *total,
sstats->deadlock_failures,
(100.0 * sstats->deadlock_failures /
script_total_cnt));
+ printf(" - number of other failures: " INT64_FORMAT " (%.3f%%)\n",
+ sstats->other_sql_failures,
+ (100.0 * sstats->other_sql_failures /
+ script_total_cnt));
}
/*
@@ -6729,6 +6767,7 @@ main(int argc, char **argv)
{"verbose-errors", no_argument, NULL, 15},
{"exit-on-abort", no_argument, NULL, 16},
{"debug", no_argument, NULL, 17},
+ {"continue-on-error", no_argument, NULL, 18},
{NULL, 0, NULL, 0}
};
@@ -7082,6 +7121,10 @@ main(int argc, char **argv)
case 17: /* debug */
pg_logging_increase_verbosity();
break;
+ case 18: /* continue-on-error */
+ benchmarking_option_set = true;
+ continue_on_error = true;
+ break;
default:
/* getopt_long already emitted a complaint */
pg_log_error_hint("Try \"%s --help\" for more information.", progname);
@@ -7437,6 +7480,7 @@ main(int argc, char **argv)
stats.retried += thread->stats.retried;
stats.serialization_failures += thread->stats.serialization_failures;
stats.deadlock_failures += thread->stats.deadlock_failures;
+ stats.other_sql_failures += thread->stats.other_sql_failures;
latency_late += thread->latency_late;
conn_total_duration += thread->conn_duration;
diff --git a/src/bin/pgbench/t/001_pgbench_with_server.pl b/src/bin/pgbench/t/001_pgbench_with_server.pl
index 7dd78940300..3c19a36a005 100644
--- a/src/bin/pgbench/t/001_pgbench_with_server.pl
+++ b/src/bin/pgbench/t/001_pgbench_with_server.pl
@@ -1813,6 +1813,28 @@ update counter set i = i+1 returning i \gset
# Clean up
$node->safe_psql('postgres', 'DROP TABLE counter;');
+# Test --continue-on-error
+$node->safe_psql('postgres',
+ 'CREATE TABLE unique_table(i int unique);');
+
+$node->pgbench(
+ '-n -t 10 --continue-on-error --failures-detailed',
+ 0,
+ [
+ qr{processed: 1/10\b},
+ qr{other failures: 9\b}
+ ],
+ [],
+ 'test --continue-on-error',
+ {
+ '001_continue_on_error' => q{
+ INSERT INTO unique_table VALUES(0);
+ }
+ });
+
+# Clean up
+$node->safe_psql('postgres', 'DROP TABLE unique_table;');
+
# done
$node->safe_psql('postgres', 'DROP TABLESPACE regress_pgbench_tap_1_ts');
$node->stop;
--
2.43.0
v14-0002-pgbench-Fix-assertion-failure-at-using-verbose-e.patchtext/x-diff; name=v14-0002-pgbench-Fix-assertion-failure-at-using-verbose-e.patchDownload
From c158421f3b152a89d5f2e411cbea07b2588b5c76 Mon Sep 17 00:00:00 2001
From: Yugo Nagata <nagata@sraoss.co.jp>
Date: Fri, 26 Sep 2025 10:43:01 +0900
Subject: [PATCH v14 2/4] pgbench: Fix assertion failure at using
--verbose-errors in pipeline mode
commandError() is called to report errors when they can be retried, and
it previously assumed that errors are always detected during SQL command
execution. However, in pipeline mode, an error may also be detected when
a \endpipeline meta-command is executed.
This caused an assertion failure. To fix this, it is now assumed that
errors can also be detected in this case.
---
src/bin/pgbench/pgbench.c | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index f0a405ca129..36c52303a9a 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -3059,7 +3059,13 @@ commandFailed(CState *st, const char *cmd, const char *message)
static void
commandError(CState *st, const char *message)
{
- Assert(sql_script[st->use_file].commands[st->command]->type == SQL_COMMAND);
+ /*
+ * Errors should only be detected during an SQL command or the \endpipeline
+ * meta command. Any other case triggers an assertion failure.
+ */
+ Assert(sql_script[st->use_file].commands[st->command]->type == SQL_COMMAND ||
+ sql_script[st->use_file].commands[st->command]->meta == META_ENDPIPELINE);
+
pg_log_info("client %d got an error in command %d (SQL) of script %d; %s",
st->id, st->command, st->use_file, message);
}
--
2.43.0
v14-0001-pgbench-Do-not-reference-error-message-after-ano.patchtext/x-diff; name=v14-0001-pgbench-Do-not-reference-error-message-after-ano.patchDownload
From 1e9bccb2d43c0d2264133ef239655947c8e0864e Mon Sep 17 00:00:00 2001
From: Yugo Nagata <nagata@sraoss.co.jp>
Date: Fri, 26 Sep 2025 10:41:34 +0900
Subject: [PATCH v14 1/4] pgbench: Do not reference error message after another
PQgetResult() call
Previously, readCommandResponse() accessed the error message
after calling another PQgetResult() to peek at the next result
in order to determine whether the current one was the last.
This caused the error message to be lost in pipeline mode.
Although this issue has never been observed in non-pipeline mode,
referencing an error message after another PQgetResult() call
does not seem like a good idea in general.
Fix this by saving the previous error message and using it for reporting.
---
src/bin/pgbench/pgbench.c | 13 +++++++++----
1 file changed, 9 insertions(+), 4 deletions(-)
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 3cafd88ac53..f0a405ca129 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -3265,6 +3265,7 @@ readCommandResponse(CState *st, MetaCommand meta, char *varprefix)
PGresult *res;
PGresult *next_res;
int qrynum = 0;
+ char *errmsg;
/*
* varprefix should be set only with \gset or \aset, and \endpipeline and
@@ -3280,6 +3281,9 @@ readCommandResponse(CState *st, MetaCommand meta, char *varprefix)
{
bool is_last;
+ /* save the previous error message before peek at the next result */
+ errmsg = pg_strdup(PQerrorMessage(st->con));
+
/* peek at the next result to know whether the current is last */
next_res = PQgetResult(st->con);
is_last = (next_res == NULL);
@@ -3349,7 +3353,7 @@ readCommandResponse(CState *st, MetaCommand meta, char *varprefix)
st->num_syncs--;
if (st->num_syncs == 0 && PQexitPipelineMode(st->con) != 1)
pg_log_error("client %d failed to exit pipeline mode: %s", st->id,
- PQerrorMessage(st->con));
+ errmsg);
break;
case PGRES_NONFATAL_ERROR:
@@ -3359,7 +3363,7 @@ readCommandResponse(CState *st, MetaCommand meta, char *varprefix)
if (canRetryError(st->estatus))
{
if (verbose_errors)
- commandError(st, PQerrorMessage(st->con));
+ commandError(st, errmsg);
goto error;
}
/* fall through */
@@ -3367,14 +3371,14 @@ readCommandResponse(CState *st, MetaCommand meta, char *varprefix)
default:
/* anything else is unexpected */
pg_log_error("client %d script %d aborted in command %d query %d: %s",
- st->id, st->use_file, st->command, qrynum,
- PQerrorMessage(st->con));
+ st->id, st->use_file, st->command, qrynum, errmsg);
goto error;
}
PQclear(res);
qrynum++;
res = next_res;
+ pg_free(errmsg);
}
if (qrynum == 0)
@@ -3388,6 +3392,7 @@ readCommandResponse(CState *st, MetaCommand meta, char *varprefix)
error:
PQclear(res);
PQclear(next_res);
+ pg_free(errmsg);
do
{
res = PQgetResult(st->con);
--
2.43.0
On Thu, 25 Sep 2025 10:27:44 +0200
Anthonin Bonnefoy <anthonin.bonnefoy@datadoghq.com> wrote:
Hi,
The patch looks good, I've spotted some typos in the doc.
+ Allows clients to continue their run even if an SQL statement fails due to + errors other than serialization or deadlock. Unlike serialization and deadlock + failures, clients do not retry the same transactions but start new transaction.Should be "but start a new transaction.", although "proceed to the
next transaction." may be clearer here that ?+ number of transactions that got a SQL error + (zero unless <option>--failures-detailed</option> is specified)It seems like both "a SQL" and "an SQL" are used in the codebase and
doc, but this page only uses "an SQL", so using "an SQL" may be better
for consistency.+ If an SQL command fails due to serialization or deadlock errors, the + client does not aborted, regardless of whetherShould be "the client does not abort."
Thank you for your review.
I've attached the updated patch in my previous post in this thread.
By the way, on the pgsql-hackers list, top-posting is generally discouraged [1]https://wiki.postgresql.org/wiki/Mailing_Lists,
so replying below the quoted messages is usually preferred.
[1]: https://wiki.postgresql.org/wiki/Mailing_Lists
Regards,
Yugo Nagata
--
Yugo Nagata <nagata@sraoss.co.jp>
On Fri, 26 Sep 2025 11:44:42 +0900
Yugo Nagata <nagata@sraoss.co.jp> wrote:
On Thu, 25 Sep 2025 17:17:36 +0800
Chao Li <li.evan.chao@gmail.com> wrote:Hi Yugo,
Thanks for the patch. After reviewing it, I got a few small comments:
Thank you for your reviewing and comments.
On Sep 25, 2025, at 15:22, Yugo Nagata <nagata@sraoss.co.jp> wrote:
--
Yugo Nagata <nagata@sraoss.co.jp <mailto:nagata@sraoss.co.jp>>
<v13-0003-Improve-error-messages-for-errors-that-cause-cli.patch><v13-0002-Add-continue-on-error-option.patch><v13-0001-Fix-assertion-failure-and-verbose-messages-in-pi.patch>1 - 0001
```
@@ -3265,6 +3271,7 @@ readCommandResponse(CState *st, MetaCommand meta, char *varprefix)
PGresult *res;
PGresult *next_res;
int qrynum = 0;
+ char *errmsg;
```I think we should initialize errmsg to NULL. Compiler won’t auto initialize a local variable. If it happens to not enter the while loop, then errmsg will hold a random value, then pg_free(errmsg) will have trouble.
I think this initialization is unnecessary, just like for res and next_res.
If the code happens not to enter the while loop, pg_free(errmsg) will not be
called anyway, since the error: label is only reachable from inside the loop.2 - 0002 ``` + <para> + Allows clients to continue their run even if an SQL statement fails due to + errors other than serialization or deadlock. Unlike serialization and deadlock + failures, clients do not retry the same transactions but start new transaction. + This option is useful when your custom script may raise errors due to some + reason like unique constraints violation. Without this option, the client is + aborted after such errors. + </para> ```A few nit suggestions:
* “continue their run” => “continue running”
Fixed.
* “clients to not retry the same transactions but start new transaction” => “clients do not retry the same transaction but start a new transaction instead"
I see your point. Maybe we could follow Anthonin Bonnefoy's suggestion
to use "proceed to the next transaction", as it may sound a bit more natural.* “due to some reason like” => “for reasons such as"
Fixed.
3 - 0002
```
+ * Without --continue-on-error:
* failed (the number of failed transactions) =
```Maybe add an empty line after “without” line.
Makes sense. Fixed.
4 - 0002 ``` + * When --continue-on-error is specified: + * + * failed (number of failed transactions) = ```Maybe change to “With ---continue-on-error”, which sounds consistent with the previous “without”.
Fixed.
5 - 0002 ``` + int64 other_sql_failures; /* number of failed transactions for + * reasons other than + * serialization/deadlock failure, which + * is counted if --continue-on-error is + * specified */ ```How about rename this variable to “sql_errors”, which reflects to the new option name.
I think it’s better to keep the current name, since the variable counts failed transactions,
even though that happens to be equivalent to the number of SQL errors. It’s also consistent
with the other variables, serialization_failures and deadlock_failures.6 - 0002
```
@@ -4571,6 +4594,8 @@ getResultString(bool skipped, EStatus estatus)
return "serialization";
case ESTATUS_DEADLOCK_ERROR:
return "deadlock";
+ case ESTATUS_OTHER_SQL_ERROR:
+ return "other”;
```I think this can just return “error”. I checked where this function is called, there will not be other words such as “error” appended.
getResultString() is called to get a string that represents the type of error
causing the transaction failure, so simply returning "error" doesn’t seem very
useful.7 - 0002 ``` /* it can be non-zero only if max_tries is not equal to one */ @@ -6569,6 +6602,10 @@ printResults(StatsData *total, sstats->deadlock_failures, (100.0 * sstats->deadlock_failures / script_total_cnt)); + printf(" - number of other failures: " INT64_FORMAT " (%.3f%%)\n", + sstats->other_sql_failures, + (100.0 * sstats->other_sql_failures / + script_total_cnt)); ```Do we only want to print this number when “—continue-on-error” is given?
We could do that, but this message is printed only when
--failures-detailed is specified. So I think users would not mind
if it shows that the number of other failures is zero, even when
--continue-on-error is not specified.I would appreciate hearing other people's opinions on this.
I've attached updated patches that include fixes for some of your
suggestions and for Anthonin Bonnefoy's suggestion on the documentation.I also split the patch according to Fujii-san's suggestion.
Fujii-san, thank you for committing the patch that fixes the assertion failure.
I've attached the remaining patches so that cfbot stays green.
Regards,
Yugo Nagata
--
Yugo Nagata <nagata@sraoss.co.jp>
Attachments:
v15-0003-pgbench-Improve-error-messages-for-errors-that-c.patchtext/x-diff; name=v15-0003-pgbench-Improve-error-messages-for-errors-that-c.patchDownload
From 353376bd49fa322c222049fa2fada540b0b7f2b3 Mon Sep 17 00:00:00 2001
From: Yugo Nagata <nagata@sraoss.co.jp>
Date: Thu, 10 Jul 2025 17:21:05 +0900
Subject: [PATCH v15 3/3] pgbench: Improve error messages for errors that cause
client abortion
This commit modifies relevant error messages to explicitly indicate that the
client was aborted. As part of this change, pg_log_error was replaced with
commandFailed().
---
src/bin/pgbench/pgbench.c | 15 ++++++---------
1 file changed, 6 insertions(+), 9 deletions(-)
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 18bce17a245..9afdf9e6d6c 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -3321,8 +3321,7 @@ readCommandResponse(CState *st, MetaCommand meta, char *varprefix)
case PGRES_EMPTY_QUERY: /* may be used for testing no-op overhead */
if (is_last && meta == META_GSET)
{
- pg_log_error("client %d script %d command %d query %d: expected one row, got %d",
- st->id, st->use_file, st->command, qrynum, 0);
+ commandFailed(st, "gset", psprintf("expected one row, got %d", 0));
st->estatus = ESTATUS_META_COMMAND_ERROR;
goto error;
}
@@ -3336,8 +3335,7 @@ readCommandResponse(CState *st, MetaCommand meta, char *varprefix)
if (meta == META_GSET && ntuples != 1)
{
/* under \gset, report the error */
- pg_log_error("client %d script %d command %d query %d: expected one row, got %d",
- st->id, st->use_file, st->command, qrynum, PQntuples(res));
+ commandFailed(st, "gset", psprintf("expected one row, got %d", PQntuples(res)));
st->estatus = ESTATUS_META_COMMAND_ERROR;
goto error;
}
@@ -3351,18 +3349,18 @@ readCommandResponse(CState *st, MetaCommand meta, char *varprefix)
for (int fld = 0; fld < PQnfields(res); fld++)
{
char *varname = PQfname(res, fld);
+ char *cmd = (meta == META_ASET ? "aset" : "gset");
/* allocate varname only if necessary, freed below */
if (*varprefix != '\0')
varname = psprintf("%s%s", varprefix, varname);
/* store last row result as a string */
- if (!putVariable(&st->variables, meta == META_ASET ? "aset" : "gset", varname,
+ if (!putVariable(&st->variables, cmd, varname,
PQgetvalue(res, ntuples - 1, fld)))
{
/* internal error */
- pg_log_error("client %d script %d command %d query %d: error storing into variable %s",
- st->id, st->use_file, st->command, qrynum, varname);
+ commandFailed(st, cmd, psprintf("error storing into variable %s", varname));
st->estatus = ESTATUS_META_COMMAND_ERROR;
goto error;
}
@@ -3397,8 +3395,7 @@ readCommandResponse(CState *st, MetaCommand meta, char *varprefix)
default:
/* anything else is unexpected */
- pg_log_error("client %d script %d aborted in command %d query %d: %s",
- st->id, st->use_file, st->command, qrynum, errmsg);
+ commandFailed(st, "SQL", errmsg);
goto error;
}
--
2.43.0
v15-0002-pgbench-Add-continue-on-error-option.patchtext/x-diff; name=v15-0002-pgbench-Add-continue-on-error-option.patchDownload
From 426e6cb4d711f61a792c3d4ec38e2a07bd59d2ac Mon Sep 17 00:00:00 2001
From: Fujii Masao <fujii@postgresql.org>
Date: Fri, 19 Sep 2025 16:54:49 +0900
Subject: [PATCH v15 2/3] pgbench: Add --continue-on-error option
When the option is set, client rolls back the failed transaction and starts a
new one when its transaction fails due to the reason other than the deadlock and
serialization failure.
---
doc/src/sgml/ref/pgbench.sgml | 64 ++++++++++++++++----
src/bin/pgbench/pgbench.c | 60 +++++++++++++++---
src/bin/pgbench/t/001_pgbench_with_server.pl | 22 +++++++
3 files changed, 127 insertions(+), 19 deletions(-)
diff --git a/doc/src/sgml/ref/pgbench.sgml b/doc/src/sgml/ref/pgbench.sgml
index a5edf612443..0305f4553d3 100644
--- a/doc/src/sgml/ref/pgbench.sgml
+++ b/doc/src/sgml/ref/pgbench.sgml
@@ -76,9 +76,8 @@ tps = 896.967014 (without initial connection time)
and number of transactions per client); these will be equal unless the run
failed before completion or some SQL command(s) failed. (In
<option>-T</option> mode, only the actual number of transactions is printed.)
- The next line reports the number of failed transactions due to
- serialization or deadlock errors (see <xref linkend="failures-and-retries"/>
- for more information).
+ The next line reports the number of failed transactions (see
+ <xref linkend="failures-and-retries"/> for more information).
The last line reports the number of transactions per second.
</para>
@@ -790,6 +789,9 @@ pgbench <optional> <replaceable>options</replaceable> </optional> <replaceable>d
<listitem>
<para>deadlock failures;</para>
</listitem>
+ <listitem>
+ <para>other failures;</para>
+ </listitem>
</itemizedlist>
See <xref linkend="failures-and-retries"/> for more information.
</para>
@@ -914,6 +916,26 @@ pgbench <optional> <replaceable>options</replaceable> </optional> <replaceable>d
</listitem>
</varlistentry>
+ <varlistentry id="pgbench-option-continue-on-error">
+ <term><option>--continue-on-error</option></term>
+ <listitem>
+ <para>
+ Allows clients to continue running even if an SQL statement fails due to
+ errors other than serialization or deadlock. Unlike serialization and deadlock
+ failures, clients do not retry the same transactions but proceed to the next
+ transaction. This option is useful when your custom script may raise errors for
+ reasons such as unique constraints violation. Without this option, the
+ client is aborted after such errors.
+ </para>
+ <para>
+ Note that serialization and deadlock failures never cause the client to be
+ aborted even after clients retries <option>--max-tries</option> times by
+ default, so they are not affected by this option.
+ See <xref linkend="failures-and-retries"/> for more information.
+ </para>
+ </listitem>
+ </varlistentry>
+
</variablelist>
</para>
@@ -2408,8 +2430,8 @@ END;
will be reported as <literal>failed</literal>. If you use the
<option>--failures-detailed</option> option, the
<replaceable>time</replaceable> of the failed transaction will be reported as
- <literal>serialization</literal> or
- <literal>deadlock</literal> depending on the type of failure (see
+ <literal>serialization</literal>, <literal>deadlock</literal>, or
+ <literal>other</literal> depending on the type of failure (see
<xref linkend="failures-and-retries"/> for more information).
</para>
@@ -2637,6 +2659,16 @@ END;
</para>
</listitem>
</varlistentry>
+
+ <varlistentry>
+ <term><replaceable>other_sql_failures</replaceable></term>
+ <listitem>
+ <para>
+ number of transactions that got an SQL error
+ (zero unless <option>--failures-detailed</option> is specified)
+ </para>
+ </listitem>
+ </varlistentry>
</variablelist>
</para>
@@ -2645,8 +2677,8 @@ END;
<screen>
<userinput>pgbench --aggregate-interval=10 --time=20 --client=10 --log --rate=1000 --latency-limit=10 --failures-detailed --max-tries=10 test</userinput>
-1650260552 5178 26171317 177284491527 1136 44462 2647617 7321113867 0 9866 64 7564 28340 4148 0
-1650260562 4808 25573984 220121792172 1171 62083 3037380 9666800914 0 9998 598 7392 26621 4527 0
+1650260552 5178 26171317 177284491527 1136 44462 2647617 7321113867 0 9866 64 7564 28340 4148 0 0
+1650260562 4808 25573984 220121792172 1171 62083 3037380 9666800914 0 9998 598 7392 26621 4527 0 0
</screen>
</para>
@@ -2850,10 +2882,20 @@ statement latencies in milliseconds, failures and retries:
<para>
A client's run is aborted in case of a serious error; for example, the
connection with the database server was lost or the end of script was reached
- without completing the last transaction. In addition, if execution of an SQL
- or meta command fails for reasons other than serialization or deadlock errors,
- the client is aborted. Otherwise, if an SQL command fails with serialization or
- deadlock errors, the client is not aborted. In such cases, the current
+ without completing the last transaction. The client also aborts
+ if a meta command fails, or if an SQL command fails for reasons other than
+ serialization or deadlock errors when <option>--continue-on-error</option>
+ is not specified. With <option>--continue-on-error</option>,
+ the client does not abort on such SQL errors and instead proceeds to
+ the next transaction. These cases are reported as
+ <literal>other failures</literal> in the output. If the error occurs
+ in a meta command, however, the client still aborts even when this option
+ is specified.
+ </para>
+ <para>
+ If an SQL command fails due to serialization or deadlock errors, the
+ client does not abort, regardless of whether
+ <option>--continue-on-error</option> is used. Instead, the current
transaction is rolled back, which also includes setting the client variables
as they were before the run of this transaction (it is assumed that one
transaction script contains only one transaction; see
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index a84c68705de..18bce17a245 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -402,8 +402,10 @@ typedef struct StatsData
* directly successful transactions (they were successfully completed on
* the first try).
*
- * A failed transaction is defined as unsuccessfully retried transactions.
- * It can be one of two types:
+ * A failed transaction is counted differently depending on whether
+ * the --continue-on-error option is specified.
+ *
+ * Without --continue-on-error:
*
* failed (the number of failed transactions) =
* 'serialization_failures' (they got a serialization error and were not
@@ -411,6 +413,13 @@ typedef struct StatsData
* 'deadlock_failures' (they got a deadlock error and were not
* successfully retried).
*
+ * With --continue-on-error:
+ *
+ * failed (number of failed transactions) =
+ * 'serialization_failures' + 'deadlock_failures' +
+ * 'other_sql_failures' (they got some other SQL error; the transaction was
+ * not retried and counted as failed due to --continue-on-error).
+ *
* If the transaction was retried after a serialization or a deadlock
* error this does not guarantee that this retry was successful. Thus
*
@@ -440,6 +449,11 @@ typedef struct StatsData
int64 deadlock_failures; /* number of transactions that were not
* successfully retried after a deadlock
* error */
+ int64 other_sql_failures; /* number of failed transactions for
+ * reasons other than
+ * serialization/deadlock failure, which
+ * is counted if --continue-on-error is
+ * specified */
SimpleStats latency;
SimpleStats lag;
} StatsData;
@@ -770,6 +784,7 @@ static int64 total_weight = 0;
static bool verbose_errors = false; /* print verbose messages of all errors */
static bool exit_on_abort = false; /* exit when any client is aborted */
+static bool continue_on_error = false; /* continue after errors */
/* Builtin test scripts */
typedef struct BuiltinScript
@@ -954,6 +969,7 @@ usage(void)
" --log-prefix=PREFIX prefix for transaction time log file\n"
" (default: \"pgbench_log\")\n"
" --max-tries=NUM max number of tries to run transaction (default: 1)\n"
+ " --continue-on-error continue running after an SQL error\n"
" --progress-timestamp use Unix epoch timestamps for progress\n"
" --random-seed=SEED set random seed (\"time\", \"rand\", integer)\n"
" --sampling-rate=NUM fraction of transactions to log (e.g., 0.01 for 1%%)\n"
@@ -1467,6 +1483,7 @@ initStats(StatsData *sd, pg_time_usec_t start)
sd->retried = 0;
sd->serialization_failures = 0;
sd->deadlock_failures = 0;
+ sd->other_sql_failures = 0;
initSimpleStats(&sd->latency);
initSimpleStats(&sd->lag);
}
@@ -1516,6 +1533,9 @@ accumStats(StatsData *stats, bool skipped, double lat, double lag,
case ESTATUS_DEADLOCK_ERROR:
stats->deadlock_failures++;
break;
+ case ESTATUS_OTHER_SQL_ERROR:
+ stats->other_sql_failures++;
+ break;
default:
/* internal error which should never occur */
pg_fatal("unexpected error status: %d", estatus);
@@ -3367,7 +3387,7 @@ readCommandResponse(CState *st, MetaCommand meta, char *varprefix)
case PGRES_FATAL_ERROR:
st->estatus = getSQLErrorStatus(PQresultErrorField(res,
PG_DIAG_SQLSTATE));
- if (canRetryError(st->estatus))
+ if (continue_on_error || canRetryError(st->estatus))
{
if (verbose_errors)
commandError(st, errmsg);
@@ -4032,7 +4052,10 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
if (PQpipelineStatus(st->con) != PQ_PIPELINE_ON)
st->state = CSTATE_END_COMMAND;
}
- else if (canRetryError(st->estatus))
+ else if (PQstatus(st->con) == CONNECTION_BAD)
+ st->state = CSTATE_ABORTED;
+ else if ((st->estatus == ESTATUS_OTHER_SQL_ERROR && continue_on_error) ||
+ canRetryError(st->estatus))
st->state = CSTATE_ERROR;
else
st->state = CSTATE_ABORTED;
@@ -4553,7 +4576,8 @@ static int64
getFailures(const StatsData *stats)
{
return (stats->serialization_failures +
- stats->deadlock_failures);
+ stats->deadlock_failures +
+ stats->other_sql_failures);
}
/*
@@ -4573,6 +4597,8 @@ getResultString(bool skipped, EStatus estatus)
return "serialization";
case ESTATUS_DEADLOCK_ERROR:
return "deadlock";
+ case ESTATUS_OTHER_SQL_ERROR:
+ return "other";
default:
/* internal error which should never occur */
pg_fatal("unexpected error status: %d", estatus);
@@ -4628,6 +4654,7 @@ doLog(TState *thread, CState *st,
int64 skipped = 0;
int64 serialization_failures = 0;
int64 deadlock_failures = 0;
+ int64 other_sql_failures = 0;
int64 retried = 0;
int64 retries = 0;
@@ -4668,10 +4695,12 @@ doLog(TState *thread, CState *st,
{
serialization_failures = agg->serialization_failures;
deadlock_failures = agg->deadlock_failures;
+ other_sql_failures = agg->other_sql_failures;
}
- fprintf(logfile, " " INT64_FORMAT " " INT64_FORMAT,
+ fprintf(logfile, " " INT64_FORMAT " " INT64_FORMAT " " INT64_FORMAT,
serialization_failures,
- deadlock_failures);
+ deadlock_failures,
+ other_sql_failures);
fputc('\n', logfile);
@@ -6310,6 +6339,7 @@ printProgressReport(TState *threads, int64 test_start, pg_time_usec_t now,
cur.serialization_failures +=
threads[i].stats.serialization_failures;
cur.deadlock_failures += threads[i].stats.deadlock_failures;
+ cur.other_sql_failures += threads[i].stats.other_sql_failures;
}
/* we count only actually executed transactions */
@@ -6452,7 +6482,8 @@ printResults(StatsData *total,
/*
* Remaining stats are nonsensical if we failed to execute any xacts due
- * to others than serialization or deadlock errors
+ * to other than serialization or deadlock errors and --continue-on-error
+ * is not set.
*/
if (total_cnt <= 0)
return;
@@ -6468,6 +6499,9 @@ printResults(StatsData *total,
printf("number of deadlock failures: " INT64_FORMAT " (%.3f%%)\n",
total->deadlock_failures,
100.0 * total->deadlock_failures / total_cnt);
+ printf("number of other failures: " INT64_FORMAT " (%.3f%%)\n",
+ total->other_sql_failures,
+ 100.0 * total->other_sql_failures / total_cnt);
}
/* it can be non-zero only if max_tries is not equal to one */
@@ -6571,6 +6605,10 @@ printResults(StatsData *total,
sstats->deadlock_failures,
(100.0 * sstats->deadlock_failures /
script_total_cnt));
+ printf(" - number of other failures: " INT64_FORMAT " (%.3f%%)\n",
+ sstats->other_sql_failures,
+ (100.0 * sstats->other_sql_failures /
+ script_total_cnt));
}
/*
@@ -6730,6 +6768,7 @@ main(int argc, char **argv)
{"verbose-errors", no_argument, NULL, 15},
{"exit-on-abort", no_argument, NULL, 16},
{"debug", no_argument, NULL, 17},
+ {"continue-on-error", no_argument, NULL, 18},
{NULL, 0, NULL, 0}
};
@@ -7083,6 +7122,10 @@ main(int argc, char **argv)
case 17: /* debug */
pg_logging_increase_verbosity();
break;
+ case 18: /* continue-on-error */
+ benchmarking_option_set = true;
+ continue_on_error = true;
+ break;
default:
/* getopt_long already emitted a complaint */
pg_log_error_hint("Try \"%s --help\" for more information.", progname);
@@ -7438,6 +7481,7 @@ main(int argc, char **argv)
stats.retried += thread->stats.retried;
stats.serialization_failures += thread->stats.serialization_failures;
stats.deadlock_failures += thread->stats.deadlock_failures;
+ stats.other_sql_failures += thread->stats.other_sql_failures;
latency_late += thread->latency_late;
conn_total_duration += thread->conn_duration;
diff --git a/src/bin/pgbench/t/001_pgbench_with_server.pl b/src/bin/pgbench/t/001_pgbench_with_server.pl
index 7dd78940300..3c19a36a005 100644
--- a/src/bin/pgbench/t/001_pgbench_with_server.pl
+++ b/src/bin/pgbench/t/001_pgbench_with_server.pl
@@ -1813,6 +1813,28 @@ update counter set i = i+1 returning i \gset
# Clean up
$node->safe_psql('postgres', 'DROP TABLE counter;');
+# Test --continue-on-error
+$node->safe_psql('postgres',
+ 'CREATE TABLE unique_table(i int unique);');
+
+$node->pgbench(
+ '-n -t 10 --continue-on-error --failures-detailed',
+ 0,
+ [
+ qr{processed: 1/10\b},
+ qr{other failures: 9\b}
+ ],
+ [],
+ 'test --continue-on-error',
+ {
+ '001_continue_on_error' => q{
+ INSERT INTO unique_table VALUES(0);
+ }
+ });
+
+# Clean up
+$node->safe_psql('postgres', 'DROP TABLE unique_table;');
+
# done
$node->safe_psql('postgres', 'DROP TABLESPACE regress_pgbench_tap_1_ts');
$node->stop;
--
2.43.0
v15-0001-pgbench-Do-not-reference-error-message-after-ano.patchtext/x-diff; name=v15-0001-pgbench-Do-not-reference-error-message-after-ano.patchDownload
From 8e4b9d2489f12d342b3a613466dfa043130e8e5f Mon Sep 17 00:00:00 2001
From: Yugo Nagata <nagata@sraoss.co.jp>
Date: Fri, 26 Sep 2025 10:41:34 +0900
Subject: [PATCH v15 1/3] pgbench: Do not reference error message after another
PQgetResult() call
Previously, readCommandResponse() accessed the error message
after calling another PQgetResult() to peek at the next result
in order to determine whether the current one was the last.
This caused the error message to be lost in pipeline mode.
Although this issue has never been observed in non-pipeline mode,
referencing an error message after another PQgetResult() call
does not seem like a good idea in general.
Fix this by saving the previous error message and using it for reporting.
---
src/bin/pgbench/pgbench.c | 13 +++++++++----
1 file changed, 9 insertions(+), 4 deletions(-)
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index cc03af05447..a84c68705de 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -3272,6 +3272,7 @@ readCommandResponse(CState *st, MetaCommand meta, char *varprefix)
PGresult *res;
PGresult *next_res;
int qrynum = 0;
+ char *errmsg;
/*
* varprefix should be set only with \gset or \aset, and \endpipeline and
@@ -3287,6 +3288,9 @@ readCommandResponse(CState *st, MetaCommand meta, char *varprefix)
{
bool is_last;
+ /* save the previous error message before peek at the next result */
+ errmsg = pg_strdup(PQerrorMessage(st->con));
+
/* peek at the next result to know whether the current is last */
next_res = PQgetResult(st->con);
is_last = (next_res == NULL);
@@ -3356,7 +3360,7 @@ readCommandResponse(CState *st, MetaCommand meta, char *varprefix)
st->num_syncs--;
if (st->num_syncs == 0 && PQexitPipelineMode(st->con) != 1)
pg_log_error("client %d failed to exit pipeline mode: %s", st->id,
- PQerrorMessage(st->con));
+ errmsg);
break;
case PGRES_NONFATAL_ERROR:
@@ -3366,7 +3370,7 @@ readCommandResponse(CState *st, MetaCommand meta, char *varprefix)
if (canRetryError(st->estatus))
{
if (verbose_errors)
- commandError(st, PQerrorMessage(st->con));
+ commandError(st, errmsg);
goto error;
}
/* fall through */
@@ -3374,14 +3378,14 @@ readCommandResponse(CState *st, MetaCommand meta, char *varprefix)
default:
/* anything else is unexpected */
pg_log_error("client %d script %d aborted in command %d query %d: %s",
- st->id, st->use_file, st->command, qrynum,
- PQerrorMessage(st->con));
+ st->id, st->use_file, st->command, qrynum, errmsg);
goto error;
}
PQclear(res);
qrynum++;
res = next_res;
+ pg_free(errmsg);
}
if (qrynum == 0)
@@ -3395,6 +3399,7 @@ readCommandResponse(CState *st, MetaCommand meta, char *varprefix)
error:
PQclear(res);
PQclear(next_res);
+ pg_free(errmsg);
do
{
res = PQgetResult(st->con);
--
2.43.0
On Tue, Sep 30, 2025 at 10:24 AM Yugo Nagata <nagata@sraoss.co.jp> wrote:
Fujii-san, thank you for committing the patch that fixes the assertion failure.
I've attached the remaining patches so that cfbot stays green.
Thanks for reattaching the patches!
For 0001, after reading the docs on PQresultErrorMessage(), I wonder if it would
be better to just use that to get the error message. Thought?
Regards,
--
Fujii Masao
On Tue, 30 Sep 2025 13:46:11 +0900
Fujii Masao <masao.fujii@gmail.com> wrote:
On Tue, Sep 30, 2025 at 10:24 AM Yugo Nagata <nagata@sraoss.co.jp> wrote:
Fujii-san, thank you for committing the patch that fixes the assertion failure.
I've attached the remaining patches so that cfbot stays green.Thanks for reattaching the patches!
For 0001, after reading the docs on PQresultErrorMessage(), I wonder if it would
be better to just use that to get the error message. Thought?
Thank you for your suggestion.
I agree that it is better to use PQresultErrorMessage().
I had overlooked the existence of this interface.
I've attached the updated patches.
Regards,
Yugo Nagata
--
Yugo Nagata <nagata@sraoss.co.jp>
Attachments:
v16-0003-pgbench-Improve-error-messages-for-errors-that-c.patchtext/x-diff; name=v16-0003-pgbench-Improve-error-messages-for-errors-that-c.patchDownload
From 04813c8a3af687fda6bb6141eff8d8e97a0ff52f Mon Sep 17 00:00:00 2001
From: Yugo Nagata <nagata@sraoss.co.jp>
Date: Thu, 10 Jul 2025 17:21:05 +0900
Subject: [PATCH v16 3/3] pgbench: Improve error messages for errors that cause
client abortion
This commit modifies relevant error messages to explicitly indicate that the
client was aborted. As part of this change, pg_log_error was replaced with
commandFailed().
---
src/bin/pgbench/pgbench.c | 16 ++++++----------
1 file changed, 6 insertions(+), 10 deletions(-)
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 316d95cc1fe..680283a0122 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -3317,8 +3317,7 @@ readCommandResponse(CState *st, MetaCommand meta, char *varprefix)
case PGRES_EMPTY_QUERY: /* may be used for testing no-op overhead */
if (is_last && meta == META_GSET)
{
- pg_log_error("client %d script %d command %d query %d: expected one row, got %d",
- st->id, st->use_file, st->command, qrynum, 0);
+ commandFailed(st, "gset", psprintf("expected one row, got %d", 0));
st->estatus = ESTATUS_META_COMMAND_ERROR;
goto error;
}
@@ -3332,8 +3331,7 @@ readCommandResponse(CState *st, MetaCommand meta, char *varprefix)
if (meta == META_GSET && ntuples != 1)
{
/* under \gset, report the error */
- pg_log_error("client %d script %d command %d query %d: expected one row, got %d",
- st->id, st->use_file, st->command, qrynum, PQntuples(res));
+ commandFailed(st, "gset", psprintf("expected one row, got %d", PQntuples(res)));
st->estatus = ESTATUS_META_COMMAND_ERROR;
goto error;
}
@@ -3347,18 +3345,18 @@ readCommandResponse(CState *st, MetaCommand meta, char *varprefix)
for (int fld = 0; fld < PQnfields(res); fld++)
{
char *varname = PQfname(res, fld);
+ char *cmd = (meta == META_ASET ? "aset" : "gset");
/* allocate varname only if necessary, freed below */
if (*varprefix != '\0')
varname = psprintf("%s%s", varprefix, varname);
/* store last row result as a string */
- if (!putVariable(&st->variables, meta == META_ASET ? "aset" : "gset", varname,
+ if (!putVariable(&st->variables, cmd, varname,
PQgetvalue(res, ntuples - 1, fld)))
{
/* internal error */
- pg_log_error("client %d script %d command %d query %d: error storing into variable %s",
- st->id, st->use_file, st->command, qrynum, varname);
+ commandFailed(st, cmd, psprintf("error storing into variable %s", varname));
st->estatus = ESTATUS_META_COMMAND_ERROR;
goto error;
}
@@ -3393,9 +3391,7 @@ readCommandResponse(CState *st, MetaCommand meta, char *varprefix)
default:
/* anything else is unexpected */
- pg_log_error("client %d script %d aborted in command %d query %d: %s",
- st->id, st->use_file, st->command, qrynum,
- PQresultErrorMessage(res));
+ commandFailed(st, "SQL", PQresultErrorMessage(res));
goto error;
}
--
2.43.0
v16-0002-pgbench-Add-continue-on-error-option.patchtext/x-diff; name=v16-0002-pgbench-Add-continue-on-error-option.patchDownload
From 137c557a27b5af9b8fdbfb9cc31b77e63ce5492b Mon Sep 17 00:00:00 2001
From: Fujii Masao <fujii@postgresql.org>
Date: Fri, 19 Sep 2025 16:54:49 +0900
Subject: [PATCH v16 2/3] pgbench: Add --continue-on-error option
When the option is set, client rolls back the failed transaction and starts a
new one when its transaction fails due to the reason other than the deadlock and
serialization failure.
---
doc/src/sgml/ref/pgbench.sgml | 64 ++++++++++++++++----
src/bin/pgbench/pgbench.c | 60 +++++++++++++++---
src/bin/pgbench/t/001_pgbench_with_server.pl | 22 +++++++
3 files changed, 127 insertions(+), 19 deletions(-)
diff --git a/doc/src/sgml/ref/pgbench.sgml b/doc/src/sgml/ref/pgbench.sgml
index a5edf612443..0305f4553d3 100644
--- a/doc/src/sgml/ref/pgbench.sgml
+++ b/doc/src/sgml/ref/pgbench.sgml
@@ -76,9 +76,8 @@ tps = 896.967014 (without initial connection time)
and number of transactions per client); these will be equal unless the run
failed before completion or some SQL command(s) failed. (In
<option>-T</option> mode, only the actual number of transactions is printed.)
- The next line reports the number of failed transactions due to
- serialization or deadlock errors (see <xref linkend="failures-and-retries"/>
- for more information).
+ The next line reports the number of failed transactions (see
+ <xref linkend="failures-and-retries"/> for more information).
The last line reports the number of transactions per second.
</para>
@@ -790,6 +789,9 @@ pgbench <optional> <replaceable>options</replaceable> </optional> <replaceable>d
<listitem>
<para>deadlock failures;</para>
</listitem>
+ <listitem>
+ <para>other failures;</para>
+ </listitem>
</itemizedlist>
See <xref linkend="failures-and-retries"/> for more information.
</para>
@@ -914,6 +916,26 @@ pgbench <optional> <replaceable>options</replaceable> </optional> <replaceable>d
</listitem>
</varlistentry>
+ <varlistentry id="pgbench-option-continue-on-error">
+ <term><option>--continue-on-error</option></term>
+ <listitem>
+ <para>
+ Allows clients to continue running even if an SQL statement fails due to
+ errors other than serialization or deadlock. Unlike serialization and deadlock
+ failures, clients do not retry the same transactions but proceed to the next
+ transaction. This option is useful when your custom script may raise errors for
+ reasons such as unique constraints violation. Without this option, the
+ client is aborted after such errors.
+ </para>
+ <para>
+ Note that serialization and deadlock failures never cause the client to be
+ aborted even after clients retries <option>--max-tries</option> times by
+ default, so they are not affected by this option.
+ See <xref linkend="failures-and-retries"/> for more information.
+ </para>
+ </listitem>
+ </varlistentry>
+
</variablelist>
</para>
@@ -2408,8 +2430,8 @@ END;
will be reported as <literal>failed</literal>. If you use the
<option>--failures-detailed</option> option, the
<replaceable>time</replaceable> of the failed transaction will be reported as
- <literal>serialization</literal> or
- <literal>deadlock</literal> depending on the type of failure (see
+ <literal>serialization</literal>, <literal>deadlock</literal>, or
+ <literal>other</literal> depending on the type of failure (see
<xref linkend="failures-and-retries"/> for more information).
</para>
@@ -2637,6 +2659,16 @@ END;
</para>
</listitem>
</varlistentry>
+
+ <varlistentry>
+ <term><replaceable>other_sql_failures</replaceable></term>
+ <listitem>
+ <para>
+ number of transactions that got an SQL error
+ (zero unless <option>--failures-detailed</option> is specified)
+ </para>
+ </listitem>
+ </varlistentry>
</variablelist>
</para>
@@ -2645,8 +2677,8 @@ END;
<screen>
<userinput>pgbench --aggregate-interval=10 --time=20 --client=10 --log --rate=1000 --latency-limit=10 --failures-detailed --max-tries=10 test</userinput>
-1650260552 5178 26171317 177284491527 1136 44462 2647617 7321113867 0 9866 64 7564 28340 4148 0
-1650260562 4808 25573984 220121792172 1171 62083 3037380 9666800914 0 9998 598 7392 26621 4527 0
+1650260552 5178 26171317 177284491527 1136 44462 2647617 7321113867 0 9866 64 7564 28340 4148 0 0
+1650260562 4808 25573984 220121792172 1171 62083 3037380 9666800914 0 9998 598 7392 26621 4527 0 0
</screen>
</para>
@@ -2850,10 +2882,20 @@ statement latencies in milliseconds, failures and retries:
<para>
A client's run is aborted in case of a serious error; for example, the
connection with the database server was lost or the end of script was reached
- without completing the last transaction. In addition, if execution of an SQL
- or meta command fails for reasons other than serialization or deadlock errors,
- the client is aborted. Otherwise, if an SQL command fails with serialization or
- deadlock errors, the client is not aborted. In such cases, the current
+ without completing the last transaction. The client also aborts
+ if a meta command fails, or if an SQL command fails for reasons other than
+ serialization or deadlock errors when <option>--continue-on-error</option>
+ is not specified. With <option>--continue-on-error</option>,
+ the client does not abort on such SQL errors and instead proceeds to
+ the next transaction. These cases are reported as
+ <literal>other failures</literal> in the output. If the error occurs
+ in a meta command, however, the client still aborts even when this option
+ is specified.
+ </para>
+ <para>
+ If an SQL command fails due to serialization or deadlock errors, the
+ client does not abort, regardless of whether
+ <option>--continue-on-error</option> is used. Instead, the current
transaction is rolled back, which also includes setting the client variables
as they were before the run of this transaction (it is assumed that one
transaction script contains only one transaction; see
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 36c6469149e..316d95cc1fe 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -402,8 +402,10 @@ typedef struct StatsData
* directly successful transactions (they were successfully completed on
* the first try).
*
- * A failed transaction is defined as unsuccessfully retried transactions.
- * It can be one of two types:
+ * A failed transaction is counted differently depending on whether
+ * the --continue-on-error option is specified.
+ *
+ * Without --continue-on-error:
*
* failed (the number of failed transactions) =
* 'serialization_failures' (they got a serialization error and were not
@@ -411,6 +413,13 @@ typedef struct StatsData
* 'deadlock_failures' (they got a deadlock error and were not
* successfully retried).
*
+ * With --continue-on-error:
+ *
+ * failed (number of failed transactions) =
+ * 'serialization_failures' + 'deadlock_failures' +
+ * 'other_sql_failures' (they got some other SQL error; the transaction was
+ * not retried and counted as failed due to --continue-on-error).
+ *
* If the transaction was retried after a serialization or a deadlock
* error this does not guarantee that this retry was successful. Thus
*
@@ -440,6 +449,11 @@ typedef struct StatsData
int64 deadlock_failures; /* number of transactions that were not
* successfully retried after a deadlock
* error */
+ int64 other_sql_failures; /* number of failed transactions for
+ * reasons other than
+ * serialization/deadlock failure, which
+ * is counted if --continue-on-error is
+ * specified */
SimpleStats latency;
SimpleStats lag;
} StatsData;
@@ -770,6 +784,7 @@ static int64 total_weight = 0;
static bool verbose_errors = false; /* print verbose messages of all errors */
static bool exit_on_abort = false; /* exit when any client is aborted */
+static bool continue_on_error = false; /* continue after errors */
/* Builtin test scripts */
typedef struct BuiltinScript
@@ -954,6 +969,7 @@ usage(void)
" --log-prefix=PREFIX prefix for transaction time log file\n"
" (default: \"pgbench_log\")\n"
" --max-tries=NUM max number of tries to run transaction (default: 1)\n"
+ " --continue-on-error continue running after an SQL error\n"
" --progress-timestamp use Unix epoch timestamps for progress\n"
" --random-seed=SEED set random seed (\"time\", \"rand\", integer)\n"
" --sampling-rate=NUM fraction of transactions to log (e.g., 0.01 for 1%%)\n"
@@ -1467,6 +1483,7 @@ initStats(StatsData *sd, pg_time_usec_t start)
sd->retried = 0;
sd->serialization_failures = 0;
sd->deadlock_failures = 0;
+ sd->other_sql_failures = 0;
initSimpleStats(&sd->latency);
initSimpleStats(&sd->lag);
}
@@ -1516,6 +1533,9 @@ accumStats(StatsData *stats, bool skipped, double lat, double lag,
case ESTATUS_DEADLOCK_ERROR:
stats->deadlock_failures++;
break;
+ case ESTATUS_OTHER_SQL_ERROR:
+ stats->other_sql_failures++;
+ break;
default:
/* internal error which should never occur */
pg_fatal("unexpected error status: %d", estatus);
@@ -3363,7 +3383,7 @@ readCommandResponse(CState *st, MetaCommand meta, char *varprefix)
case PGRES_FATAL_ERROR:
st->estatus = getSQLErrorStatus(PQresultErrorField(res,
PG_DIAG_SQLSTATE));
- if (canRetryError(st->estatus))
+ if (continue_on_error || canRetryError(st->estatus))
{
if (verbose_errors)
commandError(st, PQresultErrorMessage(res));
@@ -4027,7 +4047,10 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
if (PQpipelineStatus(st->con) != PQ_PIPELINE_ON)
st->state = CSTATE_END_COMMAND;
}
- else if (canRetryError(st->estatus))
+ else if (PQstatus(st->con) == CONNECTION_BAD)
+ st->state = CSTATE_ABORTED;
+ else if ((st->estatus == ESTATUS_OTHER_SQL_ERROR && continue_on_error) ||
+ canRetryError(st->estatus))
st->state = CSTATE_ERROR;
else
st->state = CSTATE_ABORTED;
@@ -4548,7 +4571,8 @@ static int64
getFailures(const StatsData *stats)
{
return (stats->serialization_failures +
- stats->deadlock_failures);
+ stats->deadlock_failures +
+ stats->other_sql_failures);
}
/*
@@ -4568,6 +4592,8 @@ getResultString(bool skipped, EStatus estatus)
return "serialization";
case ESTATUS_DEADLOCK_ERROR:
return "deadlock";
+ case ESTATUS_OTHER_SQL_ERROR:
+ return "other";
default:
/* internal error which should never occur */
pg_fatal("unexpected error status: %d", estatus);
@@ -4623,6 +4649,7 @@ doLog(TState *thread, CState *st,
int64 skipped = 0;
int64 serialization_failures = 0;
int64 deadlock_failures = 0;
+ int64 other_sql_failures = 0;
int64 retried = 0;
int64 retries = 0;
@@ -4663,10 +4690,12 @@ doLog(TState *thread, CState *st,
{
serialization_failures = agg->serialization_failures;
deadlock_failures = agg->deadlock_failures;
+ other_sql_failures = agg->other_sql_failures;
}
- fprintf(logfile, " " INT64_FORMAT " " INT64_FORMAT,
+ fprintf(logfile, " " INT64_FORMAT " " INT64_FORMAT " " INT64_FORMAT,
serialization_failures,
- deadlock_failures);
+ deadlock_failures,
+ other_sql_failures);
fputc('\n', logfile);
@@ -6305,6 +6334,7 @@ printProgressReport(TState *threads, int64 test_start, pg_time_usec_t now,
cur.serialization_failures +=
threads[i].stats.serialization_failures;
cur.deadlock_failures += threads[i].stats.deadlock_failures;
+ cur.other_sql_failures += threads[i].stats.other_sql_failures;
}
/* we count only actually executed transactions */
@@ -6447,7 +6477,8 @@ printResults(StatsData *total,
/*
* Remaining stats are nonsensical if we failed to execute any xacts due
- * to others than serialization or deadlock errors
+ * to other than serialization or deadlock errors and --continue-on-error
+ * is not set.
*/
if (total_cnt <= 0)
return;
@@ -6463,6 +6494,9 @@ printResults(StatsData *total,
printf("number of deadlock failures: " INT64_FORMAT " (%.3f%%)\n",
total->deadlock_failures,
100.0 * total->deadlock_failures / total_cnt);
+ printf("number of other failures: " INT64_FORMAT " (%.3f%%)\n",
+ total->other_sql_failures,
+ 100.0 * total->other_sql_failures / total_cnt);
}
/* it can be non-zero only if max_tries is not equal to one */
@@ -6566,6 +6600,10 @@ printResults(StatsData *total,
sstats->deadlock_failures,
(100.0 * sstats->deadlock_failures /
script_total_cnt));
+ printf(" - number of other failures: " INT64_FORMAT " (%.3f%%)\n",
+ sstats->other_sql_failures,
+ (100.0 * sstats->other_sql_failures /
+ script_total_cnt));
}
/*
@@ -6725,6 +6763,7 @@ main(int argc, char **argv)
{"verbose-errors", no_argument, NULL, 15},
{"exit-on-abort", no_argument, NULL, 16},
{"debug", no_argument, NULL, 17},
+ {"continue-on-error", no_argument, NULL, 18},
{NULL, 0, NULL, 0}
};
@@ -7078,6 +7117,10 @@ main(int argc, char **argv)
case 17: /* debug */
pg_logging_increase_verbosity();
break;
+ case 18: /* continue-on-error */
+ benchmarking_option_set = true;
+ continue_on_error = true;
+ break;
default:
/* getopt_long already emitted a complaint */
pg_log_error_hint("Try \"%s --help\" for more information.", progname);
@@ -7433,6 +7476,7 @@ main(int argc, char **argv)
stats.retried += thread->stats.retried;
stats.serialization_failures += thread->stats.serialization_failures;
stats.deadlock_failures += thread->stats.deadlock_failures;
+ stats.other_sql_failures += thread->stats.other_sql_failures;
latency_late += thread->latency_late;
conn_total_duration += thread->conn_duration;
diff --git a/src/bin/pgbench/t/001_pgbench_with_server.pl b/src/bin/pgbench/t/001_pgbench_with_server.pl
index 7dd78940300..3c19a36a005 100644
--- a/src/bin/pgbench/t/001_pgbench_with_server.pl
+++ b/src/bin/pgbench/t/001_pgbench_with_server.pl
@@ -1813,6 +1813,28 @@ update counter set i = i+1 returning i \gset
# Clean up
$node->safe_psql('postgres', 'DROP TABLE counter;');
+# Test --continue-on-error
+$node->safe_psql('postgres',
+ 'CREATE TABLE unique_table(i int unique);');
+
+$node->pgbench(
+ '-n -t 10 --continue-on-error --failures-detailed',
+ 0,
+ [
+ qr{processed: 1/10\b},
+ qr{other failures: 9\b}
+ ],
+ [],
+ 'test --continue-on-error',
+ {
+ '001_continue_on_error' => q{
+ INSERT INTO unique_table VALUES(0);
+ }
+ });
+
+# Clean up
+$node->safe_psql('postgres', 'DROP TABLE unique_table;');
+
# done
$node->safe_psql('postgres', 'DROP TABLESPACE regress_pgbench_tap_1_ts');
$node->stop;
--
2.43.0
v16-0001-pgbench-Use-PQresultErrorMessage-instead-of-PQer.patchtext/x-diff; name=v16-0001-pgbench-Use-PQresultErrorMessage-instead-of-PQer.patchDownload
From 30cd08bac78448afa18630b3c479b2630b4279ba Mon Sep 17 00:00:00 2001
From: Yugo Nagata <nagata@sraoss.co.jp>
Date: Fri, 26 Sep 2025 10:41:34 +0900
Subject: [PATCH v16 1/3] pgbench: Use PQresultErrorMessage() instead of
PQerrorMessage()
Previously, readCommandResponse() used PQerrorMessage() to get the error
message after calling another PQgetResult() to peek at the next result
in order to determine whether the current one was the last.
This caused the error message to be lost in pipeline mode.
Although this issue has never been observed in non-pipeline mode,
referencing an error message using PQerrorMessage() after another
PQgetResult() call does not seem like a good idea in general.
Fix this by using PQresultErrorMessage() instead.
---
src/bin/pgbench/pgbench.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index cc03af05447..36c6469149e 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -3356,7 +3356,7 @@ readCommandResponse(CState *st, MetaCommand meta, char *varprefix)
st->num_syncs--;
if (st->num_syncs == 0 && PQexitPipelineMode(st->con) != 1)
pg_log_error("client %d failed to exit pipeline mode: %s", st->id,
- PQerrorMessage(st->con));
+ PQresultErrorMessage(res));
break;
case PGRES_NONFATAL_ERROR:
@@ -3366,7 +3366,7 @@ readCommandResponse(CState *st, MetaCommand meta, char *varprefix)
if (canRetryError(st->estatus))
{
if (verbose_errors)
- commandError(st, PQerrorMessage(st->con));
+ commandError(st, PQresultErrorMessage(res));
goto error;
}
/* fall through */
@@ -3375,7 +3375,7 @@ readCommandResponse(CState *st, MetaCommand meta, char *varprefix)
/* anything else is unexpected */
pg_log_error("client %d script %d aborted in command %d query %d: %s",
st->id, st->use_file, st->command, qrynum,
- PQerrorMessage(st->con));
+ PQresultErrorMessage(res));
goto error;
}
--
2.43.0
On Tue, Sep 30, 2025 at 3:17 PM Yugo Nagata <nagata@sraoss.co.jp> wrote:
On Tue, 30 Sep 2025 13:46:11 +0900
Fujii Masao <masao.fujii@gmail.com> wrote:On Tue, Sep 30, 2025 at 10:24 AM Yugo Nagata <nagata@sraoss.co.jp> wrote:
Fujii-san, thank you for committing the patch that fixes the assertion failure.
I've attached the remaining patches so that cfbot stays green.Thanks for reattaching the patches!
For 0001, after reading the docs on PQresultErrorMessage(), I wonder if it would
be better to just use that to get the error message. Thought?Thank you for your suggestion.
I agree that it is better to use PQresultErrorMessage().
I had overlooked the existence of this interface.I've attached the updated patches.
Thanks for updating the patches! I've pushed 0001.
Regarding 0002:
- if (canRetryError(st->estatus))
+ if (continue_on_error || canRetryError(st->estatus))
{
if (verbose_errors)
commandError(st, PQresultErrorMessage(res));
goto error;
With this change, even non-SQL errors (e.g., connection failures) would
satisfy the condition when --continue-on-error is set. Isn't that a problem?
Shouldn't we also check that the error status is one that
--continue-on-error is meant to handle?
+ * Without --continue-on-error:
*
* failed (the number of failed transactions) =
* 'serialization_failures' (they got a serialization error and were not
* successfully retried) +
* 'deadlock_failures' (they got a deadlock error and were not
* successfully retried).
*
+ * With --continue-on-error:
+ *
+ * failed (number of failed transactions) =
+ * 'serialization_failures' + 'deadlock_failures' +
+ * 'other_sql_failures' (they got some other SQL error; the transaction was
+ * not retried and counted as failed due to --continue-on-error).
About the comments on failed transactions: I don't think we need
to split them into separate "with/without --continue-on-error" sections.
How about simplifying them like this?
------------------------
* failed (the number of failed transactions) =
* 'serialization_failures' (they got a serialization error and were not
* successfully retried) +
* 'deadlock_failures' (they got a deadlock error and were not
* successfully retried) +
* 'other_sql_failures' (they failed on the first try or after retries
* due to a SQL error other than serialization or
* deadlock; they are counted as a failed transaction
* only when --continue-on-error is specified).
------------------------
* 'retried' (number of all retried transactions) =
* successfully retried transactions +
* failed transactions.
Since transactions that failed on the first try (i.e., no retries) due to
an SQL error are not counted as 'retried', shouldn't this source comment
be updated?
Regards,
--
Fujii Masao
Hi,
On 2025/10/02 1:22, Fujii Masao wrote:
Regarding 0002:
- if (canRetryError(st->estatus)) + if (continue_on_error || canRetryError(st->estatus)) { if (verbose_errors) commandError(st, PQresultErrorMessage(res)); goto error;With this change, even non-SQL errors (e.g., connection failures) would
satisfy the condition when --continue-on-error is set. Isn't that a problem?
Shouldn't we also check that the error status is one that
--continue-on-error is meant to handle?
I agree that connection failures should not be ignored even when
--continue-on-error is specified.
For now, I’m not sure if other cases would cause issues, so the updated patch
explicitly checks the connection status and emits an error message when the
connection is lost.
+ * Without --continue-on-error: * * failed (the number of failed transactions) = * 'serialization_failures' (they got a serialization error and were not * successfully retried) + * 'deadlock_failures' (they got a deadlock error and were not * successfully retried). * + * With --continue-on-error: + * + * failed (number of failed transactions) = + * 'serialization_failures' + 'deadlock_failures' + + * 'other_sql_failures' (they got some other SQL error; the transaction was + * not retried and counted as failed due to --continue-on-error).About the comments on failed transactions: I don't think we need
to split them into separate "with/without --continue-on-error" sections.
How about simplifying them like this?------------------------
* failed (the number of failed transactions) =
* 'serialization_failures' (they got a serialization error and were not
* successfully retried) +
* 'deadlock_failures' (they got a deadlock error and were not
* successfully retried) +
* 'other_sql_failures' (they failed on the first try or after retries
* due to a SQL error other than serialization or
* deadlock; they are counted as a failed transaction
* only when --continue-on-error is specified).
------------------------
Thank you for the suggestion. I’ve updated the comments as you proposed.
* 'retried' (number of all retried transactions) =
* successfully retried transactions +
* failed transactions.Since transactions that failed on the first try (i.e., no retries) due to
an SQL error are not counted as 'retried', shouldn't this source comment
be updated?
Agreed. I added "failed transactions" is actually counted when they are retied.
I've attached the updated patch v17-0002. 0003 remains unchanged.
Best regards,
Rintaro Ikeda
Attachments:
v17-0002-pgbench-Add-continue-on-error-option.patchtext/plain; charset=UTF-8; name=v17-0002-pgbench-Add-continue-on-error-option.patchDownload
From 8ae5be55a2704f813e200917968ae040146486ab Mon Sep 17 00:00:00 2001
From: Fujii Masao <fujii@postgresql.org>
Date: Fri, 19 Sep 2025 16:54:49 +0900
Subject: [PATCH v17 2/3] pgbench: Add --continue-on-error option
When the option is set, client rolls back the failed transaction and starts a
new one when its transaction fails due to the reason other than the deadlock and
serialization failure.
---
doc/src/sgml/ref/pgbench.sgml | 64 ++++++++++++++++----
src/bin/pgbench/pgbench.c | 63 +++++++++++++++----
src/bin/pgbench/t/001_pgbench_with_server.pl | 22 +++++++
3 files changed, 125 insertions(+), 24 deletions(-)
diff --git a/doc/src/sgml/ref/pgbench.sgml b/doc/src/sgml/ref/pgbench.sgml
index ab252d9fc74..63230102357 100644
--- a/doc/src/sgml/ref/pgbench.sgml
+++ b/doc/src/sgml/ref/pgbench.sgml
@@ -76,9 +76,8 @@ tps = 896.967014 (without initial connection time)
and number of transactions per client); these will be equal unless the run
failed before completion or some SQL command(s) failed. (In
<option>-T</option> mode, only the actual number of transactions is printed.)
- The next line reports the number of failed transactions due to
- serialization or deadlock errors (see <xref linkend="failures-and-retries"/>
- for more information).
+ The next line reports the number of failed transactions (see
+ <xref linkend="failures-and-retries"/> for more information).
The last line reports the number of transactions per second.
</para>
@@ -790,6 +789,9 @@ pgbench <optional> <replaceable>options</replaceable> </optional> <replaceable>d
<listitem>
<para>deadlock failures;</para>
</listitem>
+ <listitem>
+ <para>other failures;</para>
+ </listitem>
</itemizedlist>
See <xref linkend="failures-and-retries"/> for more information.
</para>
@@ -914,6 +916,26 @@ pgbench <optional> <replaceable>options</replaceable> </optional> <replaceable>d
</listitem>
</varlistentry>
+ <varlistentry id="pgbench-option-continue-on-error">
+ <term><option>--continue-on-error</option></term>
+ <listitem>
+ <para>
+ Allows clients to continue running even if an SQL statement fails due to
+ errors other than serialization or deadlock. Unlike serialization and deadlock
+ failures, clients do not retry the same transactions but proceed to the next
+ transaction. This option is useful when your custom script may raise errors for
+ reasons such as unique constraints violation. Without this option, the
+ client is aborted after such errors.
+ </para>
+ <para>
+ Note that serialization and deadlock failures never cause the client to be
+ aborted even after clients retries <option>--max-tries</option> times by
+ default, so they are not affected by this option.
+ See <xref linkend="failures-and-retries"/> for more information.
+ </para>
+ </listitem>
+ </varlistentry>
+
</variablelist>
</para>
@@ -2409,8 +2431,8 @@ END;
will be reported as <literal>failed</literal>. If you use the
<option>--failures-detailed</option> option, the
<replaceable>time</replaceable> of the failed transaction will be reported as
- <literal>serialization</literal> or
- <literal>deadlock</literal> depending on the type of failure (see
+ <literal>serialization</literal>, <literal>deadlock</literal>, or
+ <literal>other</literal> depending on the type of failure (see
<xref linkend="failures-and-retries"/> for more information).
</para>
@@ -2638,6 +2660,16 @@ END;
</para>
</listitem>
</varlistentry>
+
+ <varlistentry>
+ <term><replaceable>other_sql_failures</replaceable></term>
+ <listitem>
+ <para>
+ number of transactions that got an SQL error
+ (zero unless <option>--failures-detailed</option> is specified)
+ </para>
+ </listitem>
+ </varlistentry>
</variablelist>
</para>
@@ -2646,8 +2678,8 @@ END;
<screen>
<userinput>pgbench --aggregate-interval=10 --time=20 --client=10 --log --rate=1000 --latency-limit=10 --failures-detailed --max-tries=10 test</userinput>
-1650260552 5178 26171317 177284491527 1136 44462 2647617 7321113867 0 9866 64 7564 28340 4148 0
-1650260562 4808 25573984 220121792172 1171 62083 3037380 9666800914 0 9998 598 7392 26621 4527 0
+1650260552 5178 26171317 177284491527 1136 44462 2647617 7321113867 0 9866 64 7564 28340 4148 0 0
+1650260562 4808 25573984 220121792172 1171 62083 3037380 9666800914 0 9998 598 7392 26621 4527 0 0
</screen>
</para>
@@ -2851,10 +2883,20 @@ statement latencies in milliseconds, failures and retries:
<para>
A client's run is aborted in case of a serious error; for example, the
connection with the database server was lost or the end of script was reached
- without completing the last transaction. In addition, if execution of an SQL
- or meta command fails for reasons other than serialization or deadlock errors,
- the client is aborted. Otherwise, if an SQL command fails with serialization or
- deadlock errors, the client is not aborted. In such cases, the current
+ without completing the last transaction. The client also aborts
+ if a meta command fails, or if an SQL command fails for reasons other than
+ serialization or deadlock errors when <option>--continue-on-error</option>
+ is not specified. With <option>--continue-on-error</option>,
+ the client does not abort on such SQL errors and instead proceeds to
+ the next transaction. These cases are reported as
+ <literal>other failures</literal> in the output. If the error occurs
+ in a meta command, however, the client still aborts even when this option
+ is specified.
+ </para>
+ <para>
+ If an SQL command fails due to serialization or deadlock errors, the
+ client does not abort, regardless of whether
+ <option>--continue-on-error</option> is used. Instead, the current
transaction is rolled back, which also includes setting the client variables
as they were before the run of this transaction (it is assumed that one
transaction script contains only one transaction; see
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 8656a87d280..7aa4dd0a893 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -402,14 +402,15 @@ typedef struct StatsData
* directly successful transactions (they were successfully completed on
* the first try).
*
- * A failed transaction is defined as unsuccessfully retried transactions.
- * It can be one of two types:
- *
- * failed (the number of failed transactions) =
+ * 'failed' (the number of failed transactions) =
* 'serialization_failures' (they got a serialization error and were not
- * successfully retried) +
+ * successfully retried) +
* 'deadlock_failures' (they got a deadlock error and were not
- * successfully retried).
+ * successfully retried) +
+ * 'other_sql_failures' (they failed on the first try or after retries
+ * due to a SQL error other than serialization or
+ * deadlock; they are counted as a failed transaction
+ * only when --continue-on-error is specified).
*
* If the transaction was retried after a serialization or a deadlock
* error this does not guarantee that this retry was successful. Thus
@@ -421,7 +422,7 @@ typedef struct StatsData
*
* 'retried' (number of all retried transactions) =
* successfully retried transactions +
- * failed transactions.
+ * unsuccessful retried transactions.
*----------
*/
int64 cnt; /* number of successful transactions, not
@@ -440,6 +441,11 @@ typedef struct StatsData
int64 deadlock_failures; /* number of transactions that were not
* successfully retried after a deadlock
* error */
+ int64 other_sql_failures; /* number of failed transactions for
+ * reasons other than
+ * serialization/deadlock failure, which
+ * is counted if --continue-on-error is
+ * specified */
SimpleStats latency;
SimpleStats lag;
} StatsData;
@@ -770,6 +776,7 @@ static int64 total_weight = 0;
static bool verbose_errors = false; /* print verbose messages of all errors */
static bool exit_on_abort = false; /* exit when any client is aborted */
+static bool continue_on_error = false; /* continue after errors */
/* Builtin test scripts */
typedef struct BuiltinScript
@@ -954,6 +961,7 @@ usage(void)
" --log-prefix=PREFIX prefix for transaction time log file\n"
" (default: \"pgbench_log\")\n"
" --max-tries=NUM max number of tries to run transaction (default: 1)\n"
+ " --continue-on-error continue running after an SQL error\n"
" --progress-timestamp use Unix epoch timestamps for progress\n"
" --random-seed=SEED set random seed (\"time\", \"rand\", integer)\n"
" --sampling-rate=NUM fraction of transactions to log (e.g., 0.01 for 1%%)\n"
@@ -1467,6 +1475,7 @@ initStats(StatsData *sd, pg_time_usec_t start)
sd->retried = 0;
sd->serialization_failures = 0;
sd->deadlock_failures = 0;
+ sd->other_sql_failures = 0;
initSimpleStats(&sd->latency);
initSimpleStats(&sd->lag);
}
@@ -1516,6 +1525,9 @@ accumStats(StatsData *stats, bool skipped, double lat, double lag,
case ESTATUS_DEADLOCK_ERROR:
stats->deadlock_failures++;
break;
+ case ESTATUS_OTHER_SQL_ERROR:
+ stats->other_sql_failures++;
+ break;
default:
/* internal error which should never occur */
pg_fatal("unexpected error status: %d", estatus);
@@ -3356,7 +3368,8 @@ readCommandResponse(CState *st, MetaCommand meta, char *varprefix)
case PGRES_FATAL_ERROR:
st->estatus = getSQLErrorStatus(PQresultErrorField(res,
PG_DIAG_SQLSTATE));
- if (canRetryError(st->estatus))
+ if ((continue_on_error || canRetryError(st->estatus)) &&
+ PQstatus(st->con) != CONNECTION_BAD)
{
if (verbose_errors)
commandError(st, PQresultErrorMessage(res));
@@ -4020,7 +4033,10 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
if (PQpipelineStatus(st->con) != PQ_PIPELINE_ON)
st->state = CSTATE_END_COMMAND;
}
- else if (canRetryError(st->estatus))
+ else if (PQstatus(st->con) == CONNECTION_BAD)
+ st->state = CSTATE_ABORTED;
+ else if ((st->estatus == ESTATUS_OTHER_SQL_ERROR && continue_on_error) ||
+ canRetryError(st->estatus))
st->state = CSTATE_ERROR;
else
st->state = CSTATE_ABORTED;
@@ -4541,7 +4557,8 @@ static int64
getFailures(const StatsData *stats)
{
return (stats->serialization_failures +
- stats->deadlock_failures);
+ stats->deadlock_failures +
+ stats->other_sql_failures);
}
/*
@@ -4561,6 +4578,8 @@ getResultString(bool skipped, EStatus estatus)
return "serialization";
case ESTATUS_DEADLOCK_ERROR:
return "deadlock";
+ case ESTATUS_OTHER_SQL_ERROR:
+ return "other";
default:
/* internal error which should never occur */
pg_fatal("unexpected error status: %d", estatus);
@@ -4616,6 +4635,7 @@ doLog(TState *thread, CState *st,
int64 skipped = 0;
int64 serialization_failures = 0;
int64 deadlock_failures = 0;
+ int64 other_sql_failures = 0;
int64 retried = 0;
int64 retries = 0;
@@ -4656,10 +4676,12 @@ doLog(TState *thread, CState *st,
{
serialization_failures = agg->serialization_failures;
deadlock_failures = agg->deadlock_failures;
+ other_sql_failures = agg->other_sql_failures;
}
- fprintf(logfile, " " INT64_FORMAT " " INT64_FORMAT,
+ fprintf(logfile, " " INT64_FORMAT " " INT64_FORMAT " " INT64_FORMAT,
serialization_failures,
- deadlock_failures);
+ deadlock_failures,
+ other_sql_failures);
fputc('\n', logfile);
@@ -6298,6 +6320,7 @@ printProgressReport(TState *threads, int64 test_start, pg_time_usec_t now,
cur.serialization_failures +=
threads[i].stats.serialization_failures;
cur.deadlock_failures += threads[i].stats.deadlock_failures;
+ cur.other_sql_failures += threads[i].stats.other_sql_failures;
}
/* we count only actually executed transactions */
@@ -6440,7 +6463,8 @@ printResults(StatsData *total,
/*
* Remaining stats are nonsensical if we failed to execute any xacts due
- * to others than serialization or deadlock errors
+ * to other than serialization or deadlock errors and --continue-on-error
+ * is not set.
*/
if (total_cnt <= 0)
return;
@@ -6456,6 +6480,9 @@ printResults(StatsData *total,
printf("number of deadlock failures: " INT64_FORMAT " (%.3f%%)\n",
total->deadlock_failures,
100.0 * total->deadlock_failures / total_cnt);
+ printf("number of other failures: " INT64_FORMAT " (%.3f%%)\n",
+ total->other_sql_failures,
+ 100.0 * total->other_sql_failures / total_cnt);
}
/* it can be non-zero only if max_tries is not equal to one */
@@ -6559,6 +6586,10 @@ printResults(StatsData *total,
sstats->deadlock_failures,
(100.0 * sstats->deadlock_failures /
script_total_cnt));
+ printf(" - number of other failures: " INT64_FORMAT " (%.3f%%)\n",
+ sstats->other_sql_failures,
+ (100.0 * sstats->other_sql_failures /
+ script_total_cnt));
}
/*
@@ -6718,6 +6749,7 @@ main(int argc, char **argv)
{"verbose-errors", no_argument, NULL, 15},
{"exit-on-abort", no_argument, NULL, 16},
{"debug", no_argument, NULL, 17},
+ {"continue-on-error", no_argument, NULL, 18},
{NULL, 0, NULL, 0}
};
@@ -7071,6 +7103,10 @@ main(int argc, char **argv)
case 17: /* debug */
pg_logging_increase_verbosity();
break;
+ case 18: /* continue-on-error */
+ benchmarking_option_set = true;
+ continue_on_error = true;
+ break;
default:
/* getopt_long already emitted a complaint */
pg_log_error_hint("Try \"%s --help\" for more information.", progname);
@@ -7426,6 +7462,7 @@ main(int argc, char **argv)
stats.retried += thread->stats.retried;
stats.serialization_failures += thread->stats.serialization_failures;
stats.deadlock_failures += thread->stats.deadlock_failures;
+ stats.other_sql_failures += thread->stats.other_sql_failures;
latency_late += thread->latency_late;
conn_total_duration += thread->conn_duration;
diff --git a/src/bin/pgbench/t/001_pgbench_with_server.pl b/src/bin/pgbench/t/001_pgbench_with_server.pl
index 7dd78940300..3c19a36a005 100644
--- a/src/bin/pgbench/t/001_pgbench_with_server.pl
+++ b/src/bin/pgbench/t/001_pgbench_with_server.pl
@@ -1813,6 +1813,28 @@ update counter set i = i+1 returning i \gset
# Clean up
$node->safe_psql('postgres', 'DROP TABLE counter;');
+# Test --continue-on-error
+$node->safe_psql('postgres',
+ 'CREATE TABLE unique_table(i int unique);');
+
+$node->pgbench(
+ '-n -t 10 --continue-on-error --failures-detailed',
+ 0,
+ [
+ qr{processed: 1/10\b},
+ qr{other failures: 9\b}
+ ],
+ [],
+ 'test --continue-on-error',
+ {
+ '001_continue_on_error' => q{
+ INSERT INTO unique_table VALUES(0);
+ }
+ });
+
+# Clean up
+$node->safe_psql('postgres', 'DROP TABLE unique_table;');
+
# done
$node->safe_psql('postgres', 'DROP TABLESPACE regress_pgbench_tap_1_ts');
$node->stop;
--
2.39.5 (Apple Git-154)
On Sun, Oct 19, 2025 at 10:12 PM Rintaro Ikeda
<ikedarintarof@oss.nttdata.com> wrote:
Hi,
On 2025/10/02 1:22, Fujii Masao wrote:
Regarding 0002:
- if (canRetryError(st->estatus)) + if (continue_on_error || canRetryError(st->estatus)) { if (verbose_errors) commandError(st, PQresultErrorMessage(res)); goto error;With this change, even non-SQL errors (e.g., connection failures) would
satisfy the condition when --continue-on-error is set. Isn't that a problem?
Shouldn't we also check that the error status is one that
--continue-on-error is meant to handle?I agree that connection failures should not be ignored even when
--continue-on-error is specified.
For now, I’m not sure if other cases would cause issues, so the updated patch
explicitly checks the connection status and emits an error message when the
connection is lost.
I agree that connection failures should prevent further processing even with
--continue-on-error, and pgbench should focus on handling that first.
However, the patch doesn't seem to handle cases where the connection is
terminated by an admin (e.g., via pg_terminate_backend()) correctly.
Please see the following test case, which is the same one I shared earlier:
-----------------------------------------
$ cat pipeline.sql
\startpipeline
DO $$
BEGIN
PERFORM pg_sleep(3);
PERFORM pg_terminate_backend(pg_backend_pid());
END $$;
\endpipeline
$ pgbench -n -f pipeline.sql -c 2 -t 4 -M extended --continue-on-error
-----------------------------------------
In this case, PQstatus() (added in readCommandResponse() by the patch)
still returns CONNECTION_OK (BTW, the SQLSTATE is 57P01 in this case).
As a result, the expected error message like “client ... script ... aborted
in command ...” isn't reported. So the PQstatus() check alone that
the patch added doesn't fully fix the issue.
Regards,
--
Fujii Masao
On Tue, Oct 21, 2025 at 9:58 AM Fujii Masao <masao.fujii@gmail.com> wrote:
I agree that connection failures should prevent further processing even with
--continue-on-error, and pgbench should focus on handling that first.
However, the patch doesn't seem to handle cases where the connection is
terminated by an admin (e.g., via pg_terminate_backend()) correctly.
Please see the following test case, which is the same one I shared earlier:-----------------------------------------
$ cat pipeline.sql
\startpipeline
DO $$
BEGIN
PERFORM pg_sleep(3);
PERFORM pg_terminate_backend(pg_backend_pid());
END $$;
\endpipeline$ pgbench -n -f pipeline.sql -c 2 -t 4 -M extended --continue-on-error
-----------------------------------------In this case, PQstatus() (added in readCommandResponse() by the patch)
still returns CONNECTION_OK (BTW, the SQLSTATE is 57P01 in this case).
As a result, the expected error message like “client ... script ... aborted
in command ...” isn't reported. So the PQstatus() check alone that
the patch added doesn't fully fix the issue.
One approach to address this issue is to keep calling PQgetResult() until
it returns NULL, and then check the connection status when getSQLErrorStatus()
determines the error state. If the connection status is CONNECTION_BAD
at that point, we can treat it as a connection failure and stop processing
even when --continue-on-error is specified. Attached is a WIP patch
implementing this idea based on the v17 patch. It still needs more testing,
review, and possibly documentation updates.
Another option would be to explicitly list all SQLSTATE codes (e.g., 57P01)
that should prevent continued processing, even with --continue-on-error,
inside getSQLErrorStatus(). However, maintaining such a list would be
cumbersome, so I believe the first approach is preferable. Thought?
Regards,
--
Fujii Masao
Attachments:
v18-0001-pgbench-Add-continue-on-error-option.patchapplication/octet-stream; name=v18-0001-pgbench-Add-continue-on-error-option.patchDownload
From 7fc0d4de78fa3873068e9fc4672a97b9c0181686 Mon Sep 17 00:00:00 2001
From: Fujii Masao <fujii@postgresql.org>
Date: Mon, 27 Oct 2025 15:23:11 +0900
Subject: [PATCH v18] pgbench: Add --continue-on-error option
When the option is set, client rolls back the failed transaction and starts a
new one when its transaction fails due to the reason other than the deadlock and
serialization failure.
---
doc/src/sgml/ref/pgbench.sgml | 64 +++++++++--
src/bin/pgbench/pgbench.c | 108 +++++++++++++++----
src/bin/pgbench/t/001_pgbench_with_server.pl | 22 ++++
3 files changed, 161 insertions(+), 33 deletions(-)
diff --git a/doc/src/sgml/ref/pgbench.sgml b/doc/src/sgml/ref/pgbench.sgml
index a5edf612443..0305f4553d3 100644
--- a/doc/src/sgml/ref/pgbench.sgml
+++ b/doc/src/sgml/ref/pgbench.sgml
@@ -76,9 +76,8 @@ tps = 896.967014 (without initial connection time)
and number of transactions per client); these will be equal unless the run
failed before completion or some SQL command(s) failed. (In
<option>-T</option> mode, only the actual number of transactions is printed.)
- The next line reports the number of failed transactions due to
- serialization or deadlock errors (see <xref linkend="failures-and-retries"/>
- for more information).
+ The next line reports the number of failed transactions (see
+ <xref linkend="failures-and-retries"/> for more information).
The last line reports the number of transactions per second.
</para>
@@ -790,6 +789,9 @@ pgbench <optional> <replaceable>options</replaceable> </optional> <replaceable>d
<listitem>
<para>deadlock failures;</para>
</listitem>
+ <listitem>
+ <para>other failures;</para>
+ </listitem>
</itemizedlist>
See <xref linkend="failures-and-retries"/> for more information.
</para>
@@ -914,6 +916,26 @@ pgbench <optional> <replaceable>options</replaceable> </optional> <replaceable>d
</listitem>
</varlistentry>
+ <varlistentry id="pgbench-option-continue-on-error">
+ <term><option>--continue-on-error</option></term>
+ <listitem>
+ <para>
+ Allows clients to continue running even if an SQL statement fails due to
+ errors other than serialization or deadlock. Unlike serialization and deadlock
+ failures, clients do not retry the same transactions but proceed to the next
+ transaction. This option is useful when your custom script may raise errors for
+ reasons such as unique constraints violation. Without this option, the
+ client is aborted after such errors.
+ </para>
+ <para>
+ Note that serialization and deadlock failures never cause the client to be
+ aborted even after clients retries <option>--max-tries</option> times by
+ default, so they are not affected by this option.
+ See <xref linkend="failures-and-retries"/> for more information.
+ </para>
+ </listitem>
+ </varlistentry>
+
</variablelist>
</para>
@@ -2408,8 +2430,8 @@ END;
will be reported as <literal>failed</literal>. If you use the
<option>--failures-detailed</option> option, the
<replaceable>time</replaceable> of the failed transaction will be reported as
- <literal>serialization</literal> or
- <literal>deadlock</literal> depending on the type of failure (see
+ <literal>serialization</literal>, <literal>deadlock</literal>, or
+ <literal>other</literal> depending on the type of failure (see
<xref linkend="failures-and-retries"/> for more information).
</para>
@@ -2637,6 +2659,16 @@ END;
</para>
</listitem>
</varlistentry>
+
+ <varlistentry>
+ <term><replaceable>other_sql_failures</replaceable></term>
+ <listitem>
+ <para>
+ number of transactions that got an SQL error
+ (zero unless <option>--failures-detailed</option> is specified)
+ </para>
+ </listitem>
+ </varlistentry>
</variablelist>
</para>
@@ -2645,8 +2677,8 @@ END;
<screen>
<userinput>pgbench --aggregate-interval=10 --time=20 --client=10 --log --rate=1000 --latency-limit=10 --failures-detailed --max-tries=10 test</userinput>
-1650260552 5178 26171317 177284491527 1136 44462 2647617 7321113867 0 9866 64 7564 28340 4148 0
-1650260562 4808 25573984 220121792172 1171 62083 3037380 9666800914 0 9998 598 7392 26621 4527 0
+1650260552 5178 26171317 177284491527 1136 44462 2647617 7321113867 0 9866 64 7564 28340 4148 0 0
+1650260562 4808 25573984 220121792172 1171 62083 3037380 9666800914 0 9998 598 7392 26621 4527 0 0
</screen>
</para>
@@ -2850,10 +2882,20 @@ statement latencies in milliseconds, failures and retries:
<para>
A client's run is aborted in case of a serious error; for example, the
connection with the database server was lost or the end of script was reached
- without completing the last transaction. In addition, if execution of an SQL
- or meta command fails for reasons other than serialization or deadlock errors,
- the client is aborted. Otherwise, if an SQL command fails with serialization or
- deadlock errors, the client is not aborted. In such cases, the current
+ without completing the last transaction. The client also aborts
+ if a meta command fails, or if an SQL command fails for reasons other than
+ serialization or deadlock errors when <option>--continue-on-error</option>
+ is not specified. With <option>--continue-on-error</option>,
+ the client does not abort on such SQL errors and instead proceeds to
+ the next transaction. These cases are reported as
+ <literal>other failures</literal> in the output. If the error occurs
+ in a meta command, however, the client still aborts even when this option
+ is specified.
+ </para>
+ <para>
+ If an SQL command fails due to serialization or deadlock errors, the
+ client does not abort, regardless of whether
+ <option>--continue-on-error</option> is used. Instead, the current
transaction is rolled back, which also includes setting the client variables
as they were before the run of this transaction (it is assumed that one
transaction script contains only one transaction; see
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 1515ed405ba..dd8c2b2748e 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -402,14 +402,15 @@ typedef struct StatsData
* directly successful transactions (they were successfully completed on
* the first try).
*
- * A failed transaction is defined as unsuccessfully retried transactions.
- * It can be one of two types:
- *
- * failed (the number of failed transactions) =
+ * 'failed' (the number of failed transactions) =
* 'serialization_failures' (they got a serialization error and were not
- * successfully retried) +
+ * successfully retried) +
* 'deadlock_failures' (they got a deadlock error and were not
- * successfully retried).
+ * successfully retried) +
+ * 'other_sql_failures' (they failed on the first try or after retries
+ * due to a SQL error other than serialization or
+ * deadlock; they are counted as a failed transaction
+ * only when --continue-on-error is specified).
*
* If the transaction was retried after a serialization or a deadlock
* error this does not guarantee that this retry was successful. Thus
@@ -421,7 +422,7 @@ typedef struct StatsData
*
* 'retried' (number of all retried transactions) =
* successfully retried transactions +
- * failed transactions.
+ * unsuccessful retried transactions.
*----------
*/
int64 cnt; /* number of successful transactions, not
@@ -440,6 +441,11 @@ typedef struct StatsData
int64 deadlock_failures; /* number of transactions that were not
* successfully retried after a deadlock
* error */
+ int64 other_sql_failures; /* number of failed transactions for
+ * reasons other than
+ * serialization/deadlock failure, which
+ * is counted if --continue-on-error is
+ * specified */
SimpleStats latency;
SimpleStats lag;
} StatsData;
@@ -462,6 +468,7 @@ typedef enum EStatus
ESTATUS_SERIALIZATION_ERROR,
ESTATUS_DEADLOCK_ERROR,
ESTATUS_OTHER_SQL_ERROR,
+ ESTATUS_CONN_ERROR,
} EStatus;
/*
@@ -770,6 +777,7 @@ static int64 total_weight = 0;
static bool verbose_errors = false; /* print verbose messages of all errors */
static bool exit_on_abort = false; /* exit when any client is aborted */
+static bool continue_on_error = false; /* continue after errors */
/* Builtin test scripts */
typedef struct BuiltinScript
@@ -954,6 +962,7 @@ usage(void)
" --log-prefix=PREFIX prefix for transaction time log file\n"
" (default: \"pgbench_log\")\n"
" --max-tries=NUM max number of tries to run transaction (default: 1)\n"
+ " --continue-on-error continue running after an SQL error\n"
" --progress-timestamp use Unix epoch timestamps for progress\n"
" --random-seed=SEED set random seed (\"time\", \"rand\", integer)\n"
" --sampling-rate=NUM fraction of transactions to log (e.g., 0.01 for 1%%)\n"
@@ -1467,6 +1476,7 @@ initStats(StatsData *sd, pg_time_usec_t start)
sd->retried = 0;
sd->serialization_failures = 0;
sd->deadlock_failures = 0;
+ sd->other_sql_failures = 0;
initSimpleStats(&sd->latency);
initSimpleStats(&sd->lag);
}
@@ -1516,6 +1526,11 @@ accumStats(StatsData *stats, bool skipped, double lat, double lag,
case ESTATUS_DEADLOCK_ERROR:
stats->deadlock_failures++;
break;
+ case ESTATUS_OTHER_SQL_ERROR:
+ stats->other_sql_failures++;
+ break;
+ case ESTATUS_CONN_ERROR:
+ break; /* don't count connection failures */
default:
/* internal error which should never occur */
pg_fatal("unexpected error status: %d", estatus);
@@ -3231,11 +3246,30 @@ sendCommand(CState *st, Command *command)
}
/*
- * Get the error status from the error code.
+ * Read and discard all available results from the connection.
+ */
+static void
+discardAvailableResults(CState *st)
+{
+ PGresult *res;
+
+ do
+ {
+ res = PQgetResult(st->con);
+ PQclear(res);
+ } while (res);
+}
+
+/*
+ * Determine the error status based on the connection status and error code.
*/
static EStatus
-getSQLErrorStatus(const char *sqlState)
+getSQLErrorStatus(CState *st, const char *sqlState)
{
+ discardAvailableResults(st);
+ if (PQstatus(st->con) == CONNECTION_BAD)
+ return ESTATUS_CONN_ERROR;
+
if (sqlState != NULL)
{
if (strcmp(sqlState, ERRCODE_T_R_SERIALIZATION_FAILURE) == 0)
@@ -3257,6 +3291,17 @@ canRetryError(EStatus estatus)
estatus == ESTATUS_DEADLOCK_ERROR);
}
+/*
+ * Returns true if --continue-on-error is specified and this error allows
+ * processing to continue.
+ */
+static bool
+canContinueOnError(EStatus estatus)
+{
+ return (continue_on_error &&
+ estatus == ESTATUS_OTHER_SQL_ERROR);
+}
+
/*
* Process query response from the backend.
*
@@ -3375,9 +3420,9 @@ readCommandResponse(CState *st, MetaCommand meta, char *varprefix)
case PGRES_NONFATAL_ERROR:
case PGRES_FATAL_ERROR:
- st->estatus = getSQLErrorStatus(PQresultErrorField(res,
- PG_DIAG_SQLSTATE));
- if (canRetryError(st->estatus))
+ st->estatus = getSQLErrorStatus(st, PQresultErrorField(res,
+ PG_DIAG_SQLSTATE));
+ if (canRetryError(st->estatus) || canContinueOnError(st->estatus))
{
if (verbose_errors)
commandError(st, PQresultErrorMessage(res));
@@ -3409,11 +3454,7 @@ readCommandResponse(CState *st, MetaCommand meta, char *varprefix)
error:
PQclear(res);
PQclear(next_res);
- do
- {
- res = PQgetResult(st->con);
- PQclear(res);
- } while (res);
+ discardAvailableResults(st);
return false;
}
@@ -4041,7 +4082,7 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
if (PQpipelineStatus(st->con) != PQ_PIPELINE_ON)
st->state = CSTATE_END_COMMAND;
}
- else if (canRetryError(st->estatus))
+ else if (canRetryError(st->estatus) || canContinueOnError(st->estatus))
st->state = CSTATE_ERROR;
else
st->state = CSTATE_ABORTED;
@@ -4562,7 +4603,8 @@ static int64
getFailures(const StatsData *stats)
{
return (stats->serialization_failures +
- stats->deadlock_failures);
+ stats->deadlock_failures +
+ stats->other_sql_failures);
}
/*
@@ -4582,6 +4624,10 @@ getResultString(bool skipped, EStatus estatus)
return "serialization";
case ESTATUS_DEADLOCK_ERROR:
return "deadlock";
+ case ESTATUS_OTHER_SQL_ERROR:
+ return "other";
+ case ESTATUS_CONN_ERROR:
+ return "connection";
default:
/* internal error which should never occur */
pg_fatal("unexpected error status: %d", estatus);
@@ -4637,6 +4683,7 @@ doLog(TState *thread, CState *st,
int64 skipped = 0;
int64 serialization_failures = 0;
int64 deadlock_failures = 0;
+ int64 other_sql_failures = 0;
int64 retried = 0;
int64 retries = 0;
@@ -4677,10 +4724,12 @@ doLog(TState *thread, CState *st,
{
serialization_failures = agg->serialization_failures;
deadlock_failures = agg->deadlock_failures;
+ other_sql_failures = agg->other_sql_failures;
}
- fprintf(logfile, " " INT64_FORMAT " " INT64_FORMAT,
+ fprintf(logfile, " " INT64_FORMAT " " INT64_FORMAT " " INT64_FORMAT,
serialization_failures,
- deadlock_failures);
+ deadlock_failures,
+ other_sql_failures);
fputc('\n', logfile);
@@ -6319,6 +6368,7 @@ printProgressReport(TState *threads, int64 test_start, pg_time_usec_t now,
cur.serialization_failures +=
threads[i].stats.serialization_failures;
cur.deadlock_failures += threads[i].stats.deadlock_failures;
+ cur.other_sql_failures += threads[i].stats.other_sql_failures;
}
/* we count only actually executed transactions */
@@ -6461,7 +6511,8 @@ printResults(StatsData *total,
/*
* Remaining stats are nonsensical if we failed to execute any xacts due
- * to others than serialization or deadlock errors
+ * to other than serialization or deadlock errors and --continue-on-error
+ * is not set.
*/
if (total_cnt <= 0)
return;
@@ -6477,6 +6528,9 @@ printResults(StatsData *total,
printf("number of deadlock failures: " INT64_FORMAT " (%.3f%%)\n",
total->deadlock_failures,
100.0 * total->deadlock_failures / total_cnt);
+ printf("number of other failures: " INT64_FORMAT " (%.3f%%)\n",
+ total->other_sql_failures,
+ 100.0 * total->other_sql_failures / total_cnt);
}
/* it can be non-zero only if max_tries is not equal to one */
@@ -6580,6 +6634,10 @@ printResults(StatsData *total,
sstats->deadlock_failures,
(100.0 * sstats->deadlock_failures /
script_total_cnt));
+ printf(" - number of other failures: " INT64_FORMAT " (%.3f%%)\n",
+ sstats->other_sql_failures,
+ (100.0 * sstats->other_sql_failures /
+ script_total_cnt));
}
/*
@@ -6739,6 +6797,7 @@ main(int argc, char **argv)
{"verbose-errors", no_argument, NULL, 15},
{"exit-on-abort", no_argument, NULL, 16},
{"debug", no_argument, NULL, 17},
+ {"continue-on-error", no_argument, NULL, 18},
{NULL, 0, NULL, 0}
};
@@ -7092,6 +7151,10 @@ main(int argc, char **argv)
case 17: /* debug */
pg_logging_increase_verbosity();
break;
+ case 18: /* continue-on-error */
+ benchmarking_option_set = true;
+ continue_on_error = true;
+ break;
default:
/* getopt_long already emitted a complaint */
pg_log_error_hint("Try \"%s --help\" for more information.", progname);
@@ -7447,6 +7510,7 @@ main(int argc, char **argv)
stats.retried += thread->stats.retried;
stats.serialization_failures += thread->stats.serialization_failures;
stats.deadlock_failures += thread->stats.deadlock_failures;
+ stats.other_sql_failures += thread->stats.other_sql_failures;
latency_late += thread->latency_late;
conn_total_duration += thread->conn_duration;
diff --git a/src/bin/pgbench/t/001_pgbench_with_server.pl b/src/bin/pgbench/t/001_pgbench_with_server.pl
index f820e88abe4..581e9af7907 100644
--- a/src/bin/pgbench/t/001_pgbench_with_server.pl
+++ b/src/bin/pgbench/t/001_pgbench_with_server.pl
@@ -1835,6 +1835,28 @@ $node->pgbench(
# Clean up
$node->safe_psql('postgres', 'DROP TABLE counter;');
+# Test --continue-on-error
+$node->safe_psql('postgres',
+ 'CREATE TABLE unique_table(i int unique);');
+
+$node->pgbench(
+ '-n -t 10 --continue-on-error --failures-detailed',
+ 0,
+ [
+ qr{processed: 1/10\b},
+ qr{other failures: 9\b}
+ ],
+ [],
+ 'test --continue-on-error',
+ {
+ '001_continue_on_error' => q{
+ INSERT INTO unique_table VALUES(0);
+ }
+ });
+
+# Clean up
+$node->safe_psql('postgres', 'DROP TABLE unique_table;');
+
# done
$node->safe_psql('postgres', 'DROP TABLESPACE regress_pgbench_tap_1_ts');
$node->stop;
--
2.50.1
On Mon, Oct 27, 2025 at 6:13 PM Fujii Masao <masao.fujii@gmail.com> wrote:
One approach to address this issue is to keep calling PQgetResult() until
it returns NULL, and then check the connection status when getSQLErrorStatus()
determines the error state. If the connection status is CONNECTION_BAD
at that point, we can treat it as a connection failure and stop processing
even when --continue-on-error is specified. Attached is a WIP patch
implementing this idea based on the v17 patch. It still needs more testing,
review, and possibly documentation updates.Another option would be to explicitly list all SQLSTATE codes (e.g., 57P01)
that should prevent continued processing, even with --continue-on-error,
inside getSQLErrorStatus(). However, maintaining such a list would be
cumbersome, so I believe the first approach is preferable. Thought?
Nagata-san let me know off-list that there was the case where the previous
patch didn't work correctly in pipeline mode. I've updated the patch so that
--continue-on-error now works properly in that mode, and also revised
the commit message. Updated patch attached.
Regards,
--
Fujii Masao
Attachments:
v19-0001-pgbench-Add-continue-on-error-option.patchapplication/octet-stream; name=v19-0001-pgbench-Add-continue-on-error-option.patchDownload
From 92e1ea78e898c2df11bc505216e059f6c9d3714b Mon Sep 17 00:00:00 2001
From: Fujii Masao <fujii@postgresql.org>
Date: Tue, 28 Oct 2025 10:33:39 +0900
Subject: [PATCH v19] pgbench: Add --continue-on-error option.
This commit adds the --continue-on-error option, allowing pgbench clients
to continue running even when SQL statements fail for reasons other than
serialization or deadlock errors. Without this option (by default),
the clients aborted in such cases, which was the only available behavior
previously.
This option is useful for benchmarks using custom scripts that may
raise errors, such as unique constraint violations, where users want
pgbench to complete the run despite individual statement failures.
Author: Rintaro Ikeda <ikedarintarof@oss.nttdata.com>
Co-authored-by: Yugo Nagata <nagata@sraoss.co.jp>
Co-authored-by: Fujii Masao <masao.fujii@gmail.com>
Reviewed-by: Stepan Neretin <slpmcf@gmail.com>
Reviewed-by: Matthias van de Meent <boekewurm+postgres@gmail.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Srinath Reddy Sadipiralla <srinath2133@gmail.com>
Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Reviewed-by: Anthonin Bonnefoy <anthonin.bonnefoy@datadoghq.com>
Reviewed-by: Chao Li <lic@highgo.com>
Discussion: https://postgr.es/m/44334231a4d214fac382a69cceb7d9fc@oss.nttdata.com
---
doc/src/sgml/ref/pgbench.sgml | 64 ++++++++--
src/bin/pgbench/pgbench.c | 117 +++++++++++++++----
src/bin/pgbench/t/001_pgbench_with_server.pl | 22 ++++
3 files changed, 170 insertions(+), 33 deletions(-)
diff --git a/doc/src/sgml/ref/pgbench.sgml b/doc/src/sgml/ref/pgbench.sgml
index a5edf612443..0305f4553d3 100644
--- a/doc/src/sgml/ref/pgbench.sgml
+++ b/doc/src/sgml/ref/pgbench.sgml
@@ -76,9 +76,8 @@ tps = 896.967014 (without initial connection time)
and number of transactions per client); these will be equal unless the run
failed before completion or some SQL command(s) failed. (In
<option>-T</option> mode, only the actual number of transactions is printed.)
- The next line reports the number of failed transactions due to
- serialization or deadlock errors (see <xref linkend="failures-and-retries"/>
- for more information).
+ The next line reports the number of failed transactions (see
+ <xref linkend="failures-and-retries"/> for more information).
The last line reports the number of transactions per second.
</para>
@@ -790,6 +789,9 @@ pgbench <optional> <replaceable>options</replaceable> </optional> <replaceable>d
<listitem>
<para>deadlock failures;</para>
</listitem>
+ <listitem>
+ <para>other failures;</para>
+ </listitem>
</itemizedlist>
See <xref linkend="failures-and-retries"/> for more information.
</para>
@@ -914,6 +916,26 @@ pgbench <optional> <replaceable>options</replaceable> </optional> <replaceable>d
</listitem>
</varlistentry>
+ <varlistentry id="pgbench-option-continue-on-error">
+ <term><option>--continue-on-error</option></term>
+ <listitem>
+ <para>
+ Allows clients to continue running even if an SQL statement fails due to
+ errors other than serialization or deadlock. Unlike serialization and deadlock
+ failures, clients do not retry the same transactions but proceed to the next
+ transaction. This option is useful when your custom script may raise errors for
+ reasons such as unique constraints violation. Without this option, the
+ client is aborted after such errors.
+ </para>
+ <para>
+ Note that serialization and deadlock failures never cause the client to be
+ aborted even after clients retries <option>--max-tries</option> times by
+ default, so they are not affected by this option.
+ See <xref linkend="failures-and-retries"/> for more information.
+ </para>
+ </listitem>
+ </varlistentry>
+
</variablelist>
</para>
@@ -2408,8 +2430,8 @@ END;
will be reported as <literal>failed</literal>. If you use the
<option>--failures-detailed</option> option, the
<replaceable>time</replaceable> of the failed transaction will be reported as
- <literal>serialization</literal> or
- <literal>deadlock</literal> depending on the type of failure (see
+ <literal>serialization</literal>, <literal>deadlock</literal>, or
+ <literal>other</literal> depending on the type of failure (see
<xref linkend="failures-and-retries"/> for more information).
</para>
@@ -2637,6 +2659,16 @@ END;
</para>
</listitem>
</varlistentry>
+
+ <varlistentry>
+ <term><replaceable>other_sql_failures</replaceable></term>
+ <listitem>
+ <para>
+ number of transactions that got an SQL error
+ (zero unless <option>--failures-detailed</option> is specified)
+ </para>
+ </listitem>
+ </varlistentry>
</variablelist>
</para>
@@ -2645,8 +2677,8 @@ END;
<screen>
<userinput>pgbench --aggregate-interval=10 --time=20 --client=10 --log --rate=1000 --latency-limit=10 --failures-detailed --max-tries=10 test</userinput>
-1650260552 5178 26171317 177284491527 1136 44462 2647617 7321113867 0 9866 64 7564 28340 4148 0
-1650260562 4808 25573984 220121792172 1171 62083 3037380 9666800914 0 9998 598 7392 26621 4527 0
+1650260552 5178 26171317 177284491527 1136 44462 2647617 7321113867 0 9866 64 7564 28340 4148 0 0
+1650260562 4808 25573984 220121792172 1171 62083 3037380 9666800914 0 9998 598 7392 26621 4527 0 0
</screen>
</para>
@@ -2850,10 +2882,20 @@ statement latencies in milliseconds, failures and retries:
<para>
A client's run is aborted in case of a serious error; for example, the
connection with the database server was lost or the end of script was reached
- without completing the last transaction. In addition, if execution of an SQL
- or meta command fails for reasons other than serialization or deadlock errors,
- the client is aborted. Otherwise, if an SQL command fails with serialization or
- deadlock errors, the client is not aborted. In such cases, the current
+ without completing the last transaction. The client also aborts
+ if a meta command fails, or if an SQL command fails for reasons other than
+ serialization or deadlock errors when <option>--continue-on-error</option>
+ is not specified. With <option>--continue-on-error</option>,
+ the client does not abort on such SQL errors and instead proceeds to
+ the next transaction. These cases are reported as
+ <literal>other failures</literal> in the output. If the error occurs
+ in a meta command, however, the client still aborts even when this option
+ is specified.
+ </para>
+ <para>
+ If an SQL command fails due to serialization or deadlock errors, the
+ client does not abort, regardless of whether
+ <option>--continue-on-error</option> is used. Instead, the current
transaction is rolled back, which also includes setting the client variables
as they were before the run of this transaction (it is assumed that one
transaction script contains only one transaction; see
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 1515ed405ba..d8764ba6fe0 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -402,14 +402,15 @@ typedef struct StatsData
* directly successful transactions (they were successfully completed on
* the first try).
*
- * A failed transaction is defined as unsuccessfully retried transactions.
- * It can be one of two types:
- *
- * failed (the number of failed transactions) =
+ * 'failed' (the number of failed transactions) =
* 'serialization_failures' (they got a serialization error and were not
- * successfully retried) +
+ * successfully retried) +
* 'deadlock_failures' (they got a deadlock error and were not
- * successfully retried).
+ * successfully retried) +
+ * 'other_sql_failures' (they failed on the first try or after retries
+ * due to a SQL error other than serialization or
+ * deadlock; they are counted as a failed transaction
+ * only when --continue-on-error is specified).
*
* If the transaction was retried after a serialization or a deadlock
* error this does not guarantee that this retry was successful. Thus
@@ -421,7 +422,7 @@ typedef struct StatsData
*
* 'retried' (number of all retried transactions) =
* successfully retried transactions +
- * failed transactions.
+ * unsuccessful retried transactions.
*----------
*/
int64 cnt; /* number of successful transactions, not
@@ -440,6 +441,11 @@ typedef struct StatsData
int64 deadlock_failures; /* number of transactions that were not
* successfully retried after a deadlock
* error */
+ int64 other_sql_failures; /* number of failed transactions for
+ * reasons other than
+ * serialization/deadlock failure, which
+ * is counted if --continue-on-error is
+ * specified */
SimpleStats latency;
SimpleStats lag;
} StatsData;
@@ -457,6 +463,7 @@ typedef enum EStatus
{
ESTATUS_NO_ERROR = 0,
ESTATUS_META_COMMAND_ERROR,
+ ESTATUS_CONN_ERROR,
/* SQL errors */
ESTATUS_SERIALIZATION_ERROR,
@@ -770,6 +777,7 @@ static int64 total_weight = 0;
static bool verbose_errors = false; /* print verbose messages of all errors */
static bool exit_on_abort = false; /* exit when any client is aborted */
+static bool continue_on_error = false; /* continue after errors */
/* Builtin test scripts */
typedef struct BuiltinScript
@@ -949,6 +957,7 @@ usage(void)
" -T, --time=NUM duration of benchmark test in seconds\n"
" -v, --vacuum-all vacuum all four standard tables before tests\n"
" --aggregate-interval=NUM aggregate data over NUM seconds\n"
+ " --continue-on-error continue running after an SQL error\n"
" --exit-on-abort exit when any client is aborted\n"
" --failures-detailed report the failures grouped by basic types\n"
" --log-prefix=PREFIX prefix for transaction time log file\n"
@@ -1467,6 +1476,7 @@ initStats(StatsData *sd, pg_time_usec_t start)
sd->retried = 0;
sd->serialization_failures = 0;
sd->deadlock_failures = 0;
+ sd->other_sql_failures = 0;
initSimpleStats(&sd->latency);
initSimpleStats(&sd->lag);
}
@@ -1516,6 +1526,9 @@ accumStats(StatsData *stats, bool skipped, double lat, double lag,
case ESTATUS_DEADLOCK_ERROR:
stats->deadlock_failures++;
break;
+ case ESTATUS_OTHER_SQL_ERROR:
+ stats->other_sql_failures++;
+ break;
default:
/* internal error which should never occur */
pg_fatal("unexpected error status: %d", estatus);
@@ -3231,11 +3244,43 @@ sendCommand(CState *st, Command *command)
}
/*
- * Get the error status from the error code.
+ * Read and discard all available results from the connection.
+ */
+static void
+discardAvailableResults(CState *st)
+{
+ PGresult *res = NULL;
+
+ for (;;)
+ {
+ res = PQgetResult(st->con);
+
+ /*
+ * Read and discard results until PQgetResult() returns NULL (no more
+ * results) or a connection failure is detected. If the pipeline
+ * status is PQ_PIPELINE_ABORTED, more results may still be available
+ * even after PQgetResult() returns NULL, so continue reading in that
+ * case.
+ */
+ if ((res == NULL && PQpipelineStatus(st->con) != PQ_PIPELINE_ABORTED) ||
+ PQstatus(st->con) == CONNECTION_BAD)
+ break;
+
+ PQclear(res);
+ }
+ PQclear(res);
+}
+
+/*
+ * Determine the error status based on the connection status and error code.
*/
static EStatus
-getSQLErrorStatus(const char *sqlState)
+getSQLErrorStatus(CState *st, const char *sqlState)
{
+ discardAvailableResults(st);
+ if (PQstatus(st->con) == CONNECTION_BAD)
+ return ESTATUS_CONN_ERROR;
+
if (sqlState != NULL)
{
if (strcmp(sqlState, ERRCODE_T_R_SERIALIZATION_FAILURE) == 0)
@@ -3257,6 +3302,17 @@ canRetryError(EStatus estatus)
estatus == ESTATUS_DEADLOCK_ERROR);
}
+/*
+ * Returns true if --continue-on-error is specified and this error allows
+ * processing to continue.
+ */
+static bool
+canContinueOnError(EStatus estatus)
+{
+ return (continue_on_error &&
+ estatus == ESTATUS_OTHER_SQL_ERROR);
+}
+
/*
* Process query response from the backend.
*
@@ -3375,9 +3431,9 @@ readCommandResponse(CState *st, MetaCommand meta, char *varprefix)
case PGRES_NONFATAL_ERROR:
case PGRES_FATAL_ERROR:
- st->estatus = getSQLErrorStatus(PQresultErrorField(res,
- PG_DIAG_SQLSTATE));
- if (canRetryError(st->estatus))
+ st->estatus = getSQLErrorStatus(st, PQresultErrorField(res,
+ PG_DIAG_SQLSTATE));
+ if (canRetryError(st->estatus) || canContinueOnError(st->estatus))
{
if (verbose_errors)
commandError(st, PQresultErrorMessage(res));
@@ -3409,11 +3465,7 @@ readCommandResponse(CState *st, MetaCommand meta, char *varprefix)
error:
PQclear(res);
PQclear(next_res);
- do
- {
- res = PQgetResult(st->con);
- PQclear(res);
- } while (res);
+ discardAvailableResults(st);
return false;
}
@@ -4041,7 +4093,7 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
if (PQpipelineStatus(st->con) != PQ_PIPELINE_ON)
st->state = CSTATE_END_COMMAND;
}
- else if (canRetryError(st->estatus))
+ else if (canRetryError(st->estatus) || canContinueOnError(st->estatus))
st->state = CSTATE_ERROR;
else
st->state = CSTATE_ABORTED;
@@ -4562,7 +4614,8 @@ static int64
getFailures(const StatsData *stats)
{
return (stats->serialization_failures +
- stats->deadlock_failures);
+ stats->deadlock_failures +
+ stats->other_sql_failures);
}
/*
@@ -4582,6 +4635,8 @@ getResultString(bool skipped, EStatus estatus)
return "serialization";
case ESTATUS_DEADLOCK_ERROR:
return "deadlock";
+ case ESTATUS_OTHER_SQL_ERROR:
+ return "other";
default:
/* internal error which should never occur */
pg_fatal("unexpected error status: %d", estatus);
@@ -4637,6 +4692,7 @@ doLog(TState *thread, CState *st,
int64 skipped = 0;
int64 serialization_failures = 0;
int64 deadlock_failures = 0;
+ int64 other_sql_failures = 0;
int64 retried = 0;
int64 retries = 0;
@@ -4677,10 +4733,12 @@ doLog(TState *thread, CState *st,
{
serialization_failures = agg->serialization_failures;
deadlock_failures = agg->deadlock_failures;
+ other_sql_failures = agg->other_sql_failures;
}
- fprintf(logfile, " " INT64_FORMAT " " INT64_FORMAT,
+ fprintf(logfile, " " INT64_FORMAT " " INT64_FORMAT " " INT64_FORMAT,
serialization_failures,
- deadlock_failures);
+ deadlock_failures,
+ other_sql_failures);
fputc('\n', logfile);
@@ -6319,6 +6377,7 @@ printProgressReport(TState *threads, int64 test_start, pg_time_usec_t now,
cur.serialization_failures +=
threads[i].stats.serialization_failures;
cur.deadlock_failures += threads[i].stats.deadlock_failures;
+ cur.other_sql_failures += threads[i].stats.other_sql_failures;
}
/* we count only actually executed transactions */
@@ -6461,7 +6520,8 @@ printResults(StatsData *total,
/*
* Remaining stats are nonsensical if we failed to execute any xacts due
- * to others than serialization or deadlock errors
+ * to other than serialization or deadlock errors and --continue-on-error
+ * is not set.
*/
if (total_cnt <= 0)
return;
@@ -6477,6 +6537,9 @@ printResults(StatsData *total,
printf("number of deadlock failures: " INT64_FORMAT " (%.3f%%)\n",
total->deadlock_failures,
100.0 * total->deadlock_failures / total_cnt);
+ printf("number of other failures: " INT64_FORMAT " (%.3f%%)\n",
+ total->other_sql_failures,
+ 100.0 * total->other_sql_failures / total_cnt);
}
/* it can be non-zero only if max_tries is not equal to one */
@@ -6580,6 +6643,10 @@ printResults(StatsData *total,
sstats->deadlock_failures,
(100.0 * sstats->deadlock_failures /
script_total_cnt));
+ printf(" - number of other failures: " INT64_FORMAT " (%.3f%%)\n",
+ sstats->other_sql_failures,
+ (100.0 * sstats->other_sql_failures /
+ script_total_cnt));
}
/*
@@ -6739,6 +6806,7 @@ main(int argc, char **argv)
{"verbose-errors", no_argument, NULL, 15},
{"exit-on-abort", no_argument, NULL, 16},
{"debug", no_argument, NULL, 17},
+ {"continue-on-error", no_argument, NULL, 18},
{NULL, 0, NULL, 0}
};
@@ -7092,6 +7160,10 @@ main(int argc, char **argv)
case 17: /* debug */
pg_logging_increase_verbosity();
break;
+ case 18: /* continue-on-error */
+ benchmarking_option_set = true;
+ continue_on_error = true;
+ break;
default:
/* getopt_long already emitted a complaint */
pg_log_error_hint("Try \"%s --help\" for more information.", progname);
@@ -7447,6 +7519,7 @@ main(int argc, char **argv)
stats.retried += thread->stats.retried;
stats.serialization_failures += thread->stats.serialization_failures;
stats.deadlock_failures += thread->stats.deadlock_failures;
+ stats.other_sql_failures += thread->stats.other_sql_failures;
latency_late += thread->latency_late;
conn_total_duration += thread->conn_duration;
diff --git a/src/bin/pgbench/t/001_pgbench_with_server.pl b/src/bin/pgbench/t/001_pgbench_with_server.pl
index f820e88abe4..581e9af7907 100644
--- a/src/bin/pgbench/t/001_pgbench_with_server.pl
+++ b/src/bin/pgbench/t/001_pgbench_with_server.pl
@@ -1835,6 +1835,28 @@ $node->pgbench(
# Clean up
$node->safe_psql('postgres', 'DROP TABLE counter;');
+# Test --continue-on-error
+$node->safe_psql('postgres',
+ 'CREATE TABLE unique_table(i int unique);');
+
+$node->pgbench(
+ '-n -t 10 --continue-on-error --failures-detailed',
+ 0,
+ [
+ qr{processed: 1/10\b},
+ qr{other failures: 9\b}
+ ],
+ [],
+ 'test --continue-on-error',
+ {
+ '001_continue_on_error' => q{
+ INSERT INTO unique_table VALUES(0);
+ }
+ });
+
+# Clean up
+$node->safe_psql('postgres', 'DROP TABLE unique_table;');
+
# done
$node->safe_psql('postgres', 'DROP TABLESPACE regress_pgbench_tap_1_ts');
$node->stop;
--
2.50.1
On Wed, Oct 29, 2025 at 1:00 AM Fujii Masao <masao.fujii@gmail.com> wrote:
On Mon, Oct 27, 2025 at 6:13 PM Fujii Masao <masao.fujii@gmail.com> wrote:
One approach to address this issue is to keep calling PQgetResult() until
it returns NULL, and then check the connection status when getSQLErrorStatus()
determines the error state. If the connection status is CONNECTION_BAD
at that point, we can treat it as a connection failure and stop processing
even when --continue-on-error is specified. Attached is a WIP patch
implementing this idea based on the v17 patch. It still needs more testing,
review, and possibly documentation updates.Another option would be to explicitly list all SQLSTATE codes (e.g., 57P01)
that should prevent continued processing, even with --continue-on-error,
inside getSQLErrorStatus(). However, maintaining such a list would be
cumbersome, so I believe the first approach is preferable. Thought?Nagata-san let me know off-list that there was the case where the previous
patch didn't work correctly in pipeline mode. I've updated the patch so that
--continue-on-error now works properly in that mode, and also revised
the commit message. Updated patch attached.
In v19 patch, the description of --continue-on-error was placed right after
--verbose-errors in the docs. Since pgbench long option descriptions are listed
in alphabetical order, I've moved it to follow --aggregate-interval instead.
I've also refined the wording of the --continue-on-error description.
Attached is the updated patch. Unless there are any objections, I will
commit it.
Regards,
--
Fujii Masao
Attachments:
v20-0001-pgbench-Add-continue-on-error-option.patchapplication/octet-stream; name=v20-0001-pgbench-Add-continue-on-error-option.patchDownload
From e8edd13874145bf9d888b1ba6a3e9154a25ba4fe Mon Sep 17 00:00:00 2001
From: Fujii Masao <fujii@postgresql.org>
Date: Wed, 5 Nov 2025 13:57:18 +0900
Subject: [PATCH v20] pgbench: Add --continue-on-error option.
This commit adds the --continue-on-error option, allowing pgbench clients
to continue running even when SQL statements fail for reasons other than
serialization or deadlock errors. Without this option (by default),
the clients aborted in such cases, which was the only available behavior
previously.
This option is useful for benchmarks using custom scripts that may
raise errors, such as unique constraint violations, where users want
pgbench to complete the run despite individual statement failures.
Author: Rintaro Ikeda <ikedarintarof@oss.nttdata.com>
Co-authored-by: Yugo Nagata <nagata@sraoss.co.jp>
Co-authored-by: Fujii Masao <masao.fujii@gmail.com>
Reviewed-by: Stepan Neretin <slpmcf@gmail.com>
Reviewed-by: Matthias van de Meent <boekewurm+postgres@gmail.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Srinath Reddy Sadipiralla <srinath2133@gmail.com>
Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Reviewed-by: Anthonin Bonnefoy <anthonin.bonnefoy@datadoghq.com>
Reviewed-by: Chao Li <lic@highgo.com>
Discussion: https://postgr.es/m/44334231a4d214fac382a69cceb7d9fc@oss.nttdata.com
---
doc/src/sgml/ref/pgbench.sgml | 65 +++++++++--
src/bin/pgbench/pgbench.c | 117 +++++++++++++++----
src/bin/pgbench/t/001_pgbench_with_server.pl | 22 ++++
3 files changed, 171 insertions(+), 33 deletions(-)
diff --git a/doc/src/sgml/ref/pgbench.sgml b/doc/src/sgml/ref/pgbench.sgml
index a5edf612443..ecfc3d2f2b7 100644
--- a/doc/src/sgml/ref/pgbench.sgml
+++ b/doc/src/sgml/ref/pgbench.sgml
@@ -76,9 +76,8 @@ tps = 896.967014 (without initial connection time)
and number of transactions per client); these will be equal unless the run
failed before completion or some SQL command(s) failed. (In
<option>-T</option> mode, only the actual number of transactions is printed.)
- The next line reports the number of failed transactions due to
- serialization or deadlock errors (see <xref linkend="failures-and-retries"/>
- for more information).
+ The next line reports the number of failed transactions (see
+ <xref linkend="failures-and-retries"/> for more information).
The last line reports the number of transactions per second.
</para>
@@ -759,6 +758,26 @@ pgbench <optional> <replaceable>options</replaceable> </optional> <replaceable>d
</listitem>
</varlistentry>
+ <varlistentry id="pgbench-option-continue-on-error">
+ <term><option>--continue-on-error</option></term>
+ <listitem>
+ <para>
+ Allows clients to continue running even if an SQL statement fails
+ due to errors other than serialization or deadlock. By default,
+ clients abort after such errors, but with this option enabled,
+ they proceed to the next transaction instead. Note that
+ clients still abort even with this option if an error causes
+ the connection to fail.
+ See <xref linkend="failures-and-retries"/> for more information.
+ </para>
+ <para>
+ This option is useful when your custom script may raise errors
+ such as unique constraint violations, but you want the benchmark
+ to continue and measure performance including those failures.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="pgbench-option-exit-on-abort">
<term><option>--exit-on-abort</option></term>
<listitem>
@@ -790,6 +809,9 @@ pgbench <optional> <replaceable>options</replaceable> </optional> <replaceable>d
<listitem>
<para>deadlock failures;</para>
</listitem>
+ <listitem>
+ <para>other failures;</para>
+ </listitem>
</itemizedlist>
See <xref linkend="failures-and-retries"/> for more information.
</para>
@@ -2408,8 +2430,8 @@ END;
will be reported as <literal>failed</literal>. If you use the
<option>--failures-detailed</option> option, the
<replaceable>time</replaceable> of the failed transaction will be reported as
- <literal>serialization</literal> or
- <literal>deadlock</literal> depending on the type of failure (see
+ <literal>serialization</literal>, <literal>deadlock</literal>, or
+ <literal>other</literal> depending on the type of failure (see
<xref linkend="failures-and-retries"/> for more information).
</para>
@@ -2637,6 +2659,17 @@ END;
</para>
</listitem>
</varlistentry>
+
+ <varlistentry>
+ <term><replaceable>other_sql_failures</replaceable></term>
+ <listitem>
+ <para>
+ number of transactions that got an SQL error
+ (zero unless both <option>--failures-detailed</option> and
+ <option>--continue-on-error</option> are specified)
+ </para>
+ </listitem>
+ </varlistentry>
</variablelist>
</para>
@@ -2645,8 +2678,8 @@ END;
<screen>
<userinput>pgbench --aggregate-interval=10 --time=20 --client=10 --log --rate=1000 --latency-limit=10 --failures-detailed --max-tries=10 test</userinput>
-1650260552 5178 26171317 177284491527 1136 44462 2647617 7321113867 0 9866 64 7564 28340 4148 0
-1650260562 4808 25573984 220121792172 1171 62083 3037380 9666800914 0 9998 598 7392 26621 4527 0
+1650260552 5178 26171317 177284491527 1136 44462 2647617 7321113867 0 9866 64 7564 28340 4148 0 0
+1650260562 4808 25573984 220121792172 1171 62083 3037380 9666800914 0 9998 598 7392 26621 4527 0 0
</screen>
</para>
@@ -2850,10 +2883,20 @@ statement latencies in milliseconds, failures and retries:
<para>
A client's run is aborted in case of a serious error; for example, the
connection with the database server was lost or the end of script was reached
- without completing the last transaction. In addition, if execution of an SQL
- or meta command fails for reasons other than serialization or deadlock errors,
- the client is aborted. Otherwise, if an SQL command fails with serialization or
- deadlock errors, the client is not aborted. In such cases, the current
+ without completing the last transaction. The client also aborts
+ if a meta command fails, or if an SQL command fails for reasons other than
+ serialization or deadlock errors when <option>--continue-on-error</option>
+ is not specified. With <option>--continue-on-error</option>,
+ the client does not abort on such SQL errors and instead proceeds to
+ the next transaction. These cases are reported as
+ <literal>other failures</literal> in the output. If the error occurs
+ in a meta command, however, the client still aborts even when this option
+ is specified.
+ </para>
+ <para>
+ If an SQL command fails due to serialization or deadlock errors, the
+ client does not abort, regardless of whether
+ <option>--continue-on-error</option> is used. Instead, the current
transaction is rolled back, which also includes setting the client variables
as they were before the run of this transaction (it is assumed that one
transaction script contains only one transaction; see
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 1515ed405ba..d8764ba6fe0 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -402,14 +402,15 @@ typedef struct StatsData
* directly successful transactions (they were successfully completed on
* the first try).
*
- * A failed transaction is defined as unsuccessfully retried transactions.
- * It can be one of two types:
- *
- * failed (the number of failed transactions) =
+ * 'failed' (the number of failed transactions) =
* 'serialization_failures' (they got a serialization error and were not
- * successfully retried) +
+ * successfully retried) +
* 'deadlock_failures' (they got a deadlock error and were not
- * successfully retried).
+ * successfully retried) +
+ * 'other_sql_failures' (they failed on the first try or after retries
+ * due to a SQL error other than serialization or
+ * deadlock; they are counted as a failed transaction
+ * only when --continue-on-error is specified).
*
* If the transaction was retried after a serialization or a deadlock
* error this does not guarantee that this retry was successful. Thus
@@ -421,7 +422,7 @@ typedef struct StatsData
*
* 'retried' (number of all retried transactions) =
* successfully retried transactions +
- * failed transactions.
+ * unsuccessful retried transactions.
*----------
*/
int64 cnt; /* number of successful transactions, not
@@ -440,6 +441,11 @@ typedef struct StatsData
int64 deadlock_failures; /* number of transactions that were not
* successfully retried after a deadlock
* error */
+ int64 other_sql_failures; /* number of failed transactions for
+ * reasons other than
+ * serialization/deadlock failure, which
+ * is counted if --continue-on-error is
+ * specified */
SimpleStats latency;
SimpleStats lag;
} StatsData;
@@ -457,6 +463,7 @@ typedef enum EStatus
{
ESTATUS_NO_ERROR = 0,
ESTATUS_META_COMMAND_ERROR,
+ ESTATUS_CONN_ERROR,
/* SQL errors */
ESTATUS_SERIALIZATION_ERROR,
@@ -770,6 +777,7 @@ static int64 total_weight = 0;
static bool verbose_errors = false; /* print verbose messages of all errors */
static bool exit_on_abort = false; /* exit when any client is aborted */
+static bool continue_on_error = false; /* continue after errors */
/* Builtin test scripts */
typedef struct BuiltinScript
@@ -949,6 +957,7 @@ usage(void)
" -T, --time=NUM duration of benchmark test in seconds\n"
" -v, --vacuum-all vacuum all four standard tables before tests\n"
" --aggregate-interval=NUM aggregate data over NUM seconds\n"
+ " --continue-on-error continue running after an SQL error\n"
" --exit-on-abort exit when any client is aborted\n"
" --failures-detailed report the failures grouped by basic types\n"
" --log-prefix=PREFIX prefix for transaction time log file\n"
@@ -1467,6 +1476,7 @@ initStats(StatsData *sd, pg_time_usec_t start)
sd->retried = 0;
sd->serialization_failures = 0;
sd->deadlock_failures = 0;
+ sd->other_sql_failures = 0;
initSimpleStats(&sd->latency);
initSimpleStats(&sd->lag);
}
@@ -1516,6 +1526,9 @@ accumStats(StatsData *stats, bool skipped, double lat, double lag,
case ESTATUS_DEADLOCK_ERROR:
stats->deadlock_failures++;
break;
+ case ESTATUS_OTHER_SQL_ERROR:
+ stats->other_sql_failures++;
+ break;
default:
/* internal error which should never occur */
pg_fatal("unexpected error status: %d", estatus);
@@ -3231,11 +3244,43 @@ sendCommand(CState *st, Command *command)
}
/*
- * Get the error status from the error code.
+ * Read and discard all available results from the connection.
+ */
+static void
+discardAvailableResults(CState *st)
+{
+ PGresult *res = NULL;
+
+ for (;;)
+ {
+ res = PQgetResult(st->con);
+
+ /*
+ * Read and discard results until PQgetResult() returns NULL (no more
+ * results) or a connection failure is detected. If the pipeline
+ * status is PQ_PIPELINE_ABORTED, more results may still be available
+ * even after PQgetResult() returns NULL, so continue reading in that
+ * case.
+ */
+ if ((res == NULL && PQpipelineStatus(st->con) != PQ_PIPELINE_ABORTED) ||
+ PQstatus(st->con) == CONNECTION_BAD)
+ break;
+
+ PQclear(res);
+ }
+ PQclear(res);
+}
+
+/*
+ * Determine the error status based on the connection status and error code.
*/
static EStatus
-getSQLErrorStatus(const char *sqlState)
+getSQLErrorStatus(CState *st, const char *sqlState)
{
+ discardAvailableResults(st);
+ if (PQstatus(st->con) == CONNECTION_BAD)
+ return ESTATUS_CONN_ERROR;
+
if (sqlState != NULL)
{
if (strcmp(sqlState, ERRCODE_T_R_SERIALIZATION_FAILURE) == 0)
@@ -3257,6 +3302,17 @@ canRetryError(EStatus estatus)
estatus == ESTATUS_DEADLOCK_ERROR);
}
+/*
+ * Returns true if --continue-on-error is specified and this error allows
+ * processing to continue.
+ */
+static bool
+canContinueOnError(EStatus estatus)
+{
+ return (continue_on_error &&
+ estatus == ESTATUS_OTHER_SQL_ERROR);
+}
+
/*
* Process query response from the backend.
*
@@ -3375,9 +3431,9 @@ readCommandResponse(CState *st, MetaCommand meta, char *varprefix)
case PGRES_NONFATAL_ERROR:
case PGRES_FATAL_ERROR:
- st->estatus = getSQLErrorStatus(PQresultErrorField(res,
- PG_DIAG_SQLSTATE));
- if (canRetryError(st->estatus))
+ st->estatus = getSQLErrorStatus(st, PQresultErrorField(res,
+ PG_DIAG_SQLSTATE));
+ if (canRetryError(st->estatus) || canContinueOnError(st->estatus))
{
if (verbose_errors)
commandError(st, PQresultErrorMessage(res));
@@ -3409,11 +3465,7 @@ readCommandResponse(CState *st, MetaCommand meta, char *varprefix)
error:
PQclear(res);
PQclear(next_res);
- do
- {
- res = PQgetResult(st->con);
- PQclear(res);
- } while (res);
+ discardAvailableResults(st);
return false;
}
@@ -4041,7 +4093,7 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
if (PQpipelineStatus(st->con) != PQ_PIPELINE_ON)
st->state = CSTATE_END_COMMAND;
}
- else if (canRetryError(st->estatus))
+ else if (canRetryError(st->estatus) || canContinueOnError(st->estatus))
st->state = CSTATE_ERROR;
else
st->state = CSTATE_ABORTED;
@@ -4562,7 +4614,8 @@ static int64
getFailures(const StatsData *stats)
{
return (stats->serialization_failures +
- stats->deadlock_failures);
+ stats->deadlock_failures +
+ stats->other_sql_failures);
}
/*
@@ -4582,6 +4635,8 @@ getResultString(bool skipped, EStatus estatus)
return "serialization";
case ESTATUS_DEADLOCK_ERROR:
return "deadlock";
+ case ESTATUS_OTHER_SQL_ERROR:
+ return "other";
default:
/* internal error which should never occur */
pg_fatal("unexpected error status: %d", estatus);
@@ -4637,6 +4692,7 @@ doLog(TState *thread, CState *st,
int64 skipped = 0;
int64 serialization_failures = 0;
int64 deadlock_failures = 0;
+ int64 other_sql_failures = 0;
int64 retried = 0;
int64 retries = 0;
@@ -4677,10 +4733,12 @@ doLog(TState *thread, CState *st,
{
serialization_failures = agg->serialization_failures;
deadlock_failures = agg->deadlock_failures;
+ other_sql_failures = agg->other_sql_failures;
}
- fprintf(logfile, " " INT64_FORMAT " " INT64_FORMAT,
+ fprintf(logfile, " " INT64_FORMAT " " INT64_FORMAT " " INT64_FORMAT,
serialization_failures,
- deadlock_failures);
+ deadlock_failures,
+ other_sql_failures);
fputc('\n', logfile);
@@ -6319,6 +6377,7 @@ printProgressReport(TState *threads, int64 test_start, pg_time_usec_t now,
cur.serialization_failures +=
threads[i].stats.serialization_failures;
cur.deadlock_failures += threads[i].stats.deadlock_failures;
+ cur.other_sql_failures += threads[i].stats.other_sql_failures;
}
/* we count only actually executed transactions */
@@ -6461,7 +6520,8 @@ printResults(StatsData *total,
/*
* Remaining stats are nonsensical if we failed to execute any xacts due
- * to others than serialization or deadlock errors
+ * to other than serialization or deadlock errors and --continue-on-error
+ * is not set.
*/
if (total_cnt <= 0)
return;
@@ -6477,6 +6537,9 @@ printResults(StatsData *total,
printf("number of deadlock failures: " INT64_FORMAT " (%.3f%%)\n",
total->deadlock_failures,
100.0 * total->deadlock_failures / total_cnt);
+ printf("number of other failures: " INT64_FORMAT " (%.3f%%)\n",
+ total->other_sql_failures,
+ 100.0 * total->other_sql_failures / total_cnt);
}
/* it can be non-zero only if max_tries is not equal to one */
@@ -6580,6 +6643,10 @@ printResults(StatsData *total,
sstats->deadlock_failures,
(100.0 * sstats->deadlock_failures /
script_total_cnt));
+ printf(" - number of other failures: " INT64_FORMAT " (%.3f%%)\n",
+ sstats->other_sql_failures,
+ (100.0 * sstats->other_sql_failures /
+ script_total_cnt));
}
/*
@@ -6739,6 +6806,7 @@ main(int argc, char **argv)
{"verbose-errors", no_argument, NULL, 15},
{"exit-on-abort", no_argument, NULL, 16},
{"debug", no_argument, NULL, 17},
+ {"continue-on-error", no_argument, NULL, 18},
{NULL, 0, NULL, 0}
};
@@ -7092,6 +7160,10 @@ main(int argc, char **argv)
case 17: /* debug */
pg_logging_increase_verbosity();
break;
+ case 18: /* continue-on-error */
+ benchmarking_option_set = true;
+ continue_on_error = true;
+ break;
default:
/* getopt_long already emitted a complaint */
pg_log_error_hint("Try \"%s --help\" for more information.", progname);
@@ -7447,6 +7519,7 @@ main(int argc, char **argv)
stats.retried += thread->stats.retried;
stats.serialization_failures += thread->stats.serialization_failures;
stats.deadlock_failures += thread->stats.deadlock_failures;
+ stats.other_sql_failures += thread->stats.other_sql_failures;
latency_late += thread->latency_late;
conn_total_duration += thread->conn_duration;
diff --git a/src/bin/pgbench/t/001_pgbench_with_server.pl b/src/bin/pgbench/t/001_pgbench_with_server.pl
index f820e88abe4..581e9af7907 100644
--- a/src/bin/pgbench/t/001_pgbench_with_server.pl
+++ b/src/bin/pgbench/t/001_pgbench_with_server.pl
@@ -1835,6 +1835,28 @@ $node->pgbench(
# Clean up
$node->safe_psql('postgres', 'DROP TABLE counter;');
+# Test --continue-on-error
+$node->safe_psql('postgres',
+ 'CREATE TABLE unique_table(i int unique);');
+
+$node->pgbench(
+ '-n -t 10 --continue-on-error --failures-detailed',
+ 0,
+ [
+ qr{processed: 1/10\b},
+ qr{other failures: 9\b}
+ ],
+ [],
+ 'test --continue-on-error',
+ {
+ '001_continue_on_error' => q{
+ INSERT INTO unique_table VALUES(0);
+ }
+ });
+
+# Clean up
+$node->safe_psql('postgres', 'DROP TABLE unique_table;');
+
# done
$node->safe_psql('postgres', 'DROP TABLESPACE regress_pgbench_tap_1_ts');
$node->stop;
--
2.51.2
On Nov 5, 2025, at 23:12, Fujii Masao <masao.fujii@gmail.com> wrote:
On Wed, Oct 29, 2025 at 1:00 AM Fujii Masao <masao.fujii@gmail.com> wrote:
On Mon, Oct 27, 2025 at 6:13 PM Fujii Masao <masao.fujii@gmail.com> wrote:
One approach to address this issue is to keep calling PQgetResult() until
it returns NULL, and then check the connection status when getSQLErrorStatus()
determines the error state. If the connection status is CONNECTION_BAD
at that point, we can treat it as a connection failure and stop processing
even when --continue-on-error is specified. Attached is a WIP patch
implementing this idea based on the v17 patch. It still needs more testing,
review, and possibly documentation updates.Another option would be to explicitly list all SQLSTATE codes (e.g., 57P01)
that should prevent continued processing, even with --continue-on-error,
inside getSQLErrorStatus(). However, maintaining such a list would be
cumbersome, so I believe the first approach is preferable. Thought?Nagata-san let me know off-list that there was the case where the previous
patch didn't work correctly in pipeline mode. I've updated the patch so that
--continue-on-error now works properly in that mode, and also revised
the commit message. Updated patch attached.In v19 patch, the description of --continue-on-error was placed right after
--verbose-errors in the docs. Since pgbench long option descriptions are listed
in alphabetical order, I've moved it to follow --aggregate-interval instead.
I've also refined the wording of the --continue-on-error description.Attached is the updated patch. Unless there are any objections, I will
commit it.Regards,
--
Fujii Masao
<v20-0001-pgbench-Add-continue-on-error-option.patch>
I just eyeball reviewed v20 and got a doubt:
```
+static void
+discardAvailableResults(CState *st)
+{
+ PGresult *res = NULL;
+
+ for (;;)
+ {
+ res = PQgetResult(st->con);
+
+ /*
+ * Read and discard results until PQgetResult() returns NULL (no more
+ * results) or a connection failure is detected. If the pipeline
+ * status is PQ_PIPELINE_ABORTED, more results may still be available
+ * even after PQgetResult() returns NULL, so continue reading in that
+ * case.
+ */
+ if ((res == NULL && PQpipelineStatus(st->con) != PQ_PIPELINE_ABORTED) ||
+ PQstatus(st->con) == CONNECTION_BAD)
+ break;
+
+ PQclear(res);
+ }
+ PQclear(res);
+}
```
If pipeline is aborted and no more results, then the “if” will be "true && false”. And in this case, I guess PQstatus(st->con) != CONNECTION_BAD because it’s not a connection error, then overall, the “if” will be “false”, and it falls into an infinite loop.
Expect that, everything else looks good to me.
Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/
On Thu, Nov 6, 2025 at 8:38 AM Chao Li <li.evan.chao@gmail.com> wrote:
I just eyeball reviewed v20 and got a doubt:
Thanks for the review!
+static void +discardAvailableResults(CState *st) +{ + PGresult *res = NULL; + + for (;;) + { + res = PQgetResult(st->con); + + /* + * Read and discard results until PQgetResult() returns NULL (no more + * results) or a connection failure is detected. If the pipeline + * status is PQ_PIPELINE_ABORTED, more results may still be available + * even after PQgetResult() returns NULL, so continue reading in that + * case. + */ + if ((res == NULL && PQpipelineStatus(st->con) != PQ_PIPELINE_ABORTED) || + PQstatus(st->con) == CONNECTION_BAD) + break; + + PQclear(res); + } + PQclear(res); +} ```If pipeline is aborted and no more results, then the “if” will be "true && false”. And in this case, I guess PQstatus(st->con) != CONNECTION_BAD because it’s not a connection error, then overall, the “if” will be “false”, and it falls into an infinite loop.
Can this situation actually happen? It would be helpful if you could share
the custom script that triggers it.
When the pipeline is aborted, PGRES_PIPELINE_SYNC should arrive afterward,
changing the status from PQ_PIPELINE_ABORTED to PQ_PIPELINE_ON. That should
make the condition true and prevent an infinite loop, right?
Regards,
--
Fujii Masao
On Nov 7, 2025, at 00:38, Fujii Masao <masao.fujii@gmail.com> wrote:
On Thu, Nov 6, 2025 at 8:38 AM Chao Li <li.evan.chao@gmail.com> wrote:
I just eyeball reviewed v20 and got a doubt:
Thanks for the review!
+static void +discardAvailableResults(CState *st) +{ + PGresult *res = NULL; + + for (;;) + { + res = PQgetResult(st->con); + + /* + * Read and discard results until PQgetResult() returns NULL (no more + * results) or a connection failure is detected. If the pipeline + * status is PQ_PIPELINE_ABORTED, more results may still be available + * even after PQgetResult() returns NULL, so continue reading in that + * case. + */ + if ((res == NULL && PQpipelineStatus(st->con) != PQ_PIPELINE_ABORTED) || + PQstatus(st->con) == CONNECTION_BAD) + break; + + PQclear(res); + } + PQclear(res); +} ```If pipeline is aborted and no more results, then the “if” will be "true && false”. And in this case, I guess PQstatus(st->con) != CONNECTION_BAD because it’s not a connection error, then overall, the “if” will be “false”, and it falls into an infinite loop.
Can this situation actually happen? It would be helpful if you could share
the custom script that triggers it.
No, I don’t have such a script. I am on vacation and traveling with my family this week, I just found a little time to work on the day, that was why I only did an eyeball review.
When the pipeline is aborted, PGRES_PIPELINE_SYNC should arrive afterward,
changing the status from PQ_PIPELINE_ABORTED to PQ_PIPELINE_ON. That should
make the condition true and prevent an infinite loop, right?
If you put this explanation to the inline comment, things would get clearer. But based on this explanation, I just got the other doubt. When a pipeline is aborted, res is NULL, but we still stay in the for loop, PQClear(res) will do nothing, then the “for” loop is similar to an empty loop, would that lead to a high CPU usage? From this perspective, when pipeline is aborted, while waiting for PIPELINE_SYNC, adding a tiny sleep might be better.
I will back to work next Monday, then I will try to run a test and reproduce the scenario of pipeline abort.
Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/
On Fri, Nov 7, 2025 at 9:07 AM Chao Li <li.evan.chao@gmail.com> wrote:
If you put this explanation to the inline comment, things would get clearer. But based on this explanation, I just got the other doubt. When a pipeline is aborted, res is NULL, but we still stay in the for loop, PQClear(res) will do nothing, then the “for” loop is similar to an empty loop, would that lead to a high CPU usage? From this perspective, when pipeline is aborted, while waiting for PIPELINE_SYNC, adding a tiny sleep might be better.
You're concerned about cases where the server response is delayed,
causing the pipeline status to take time to reach PIPELINE_SYNC, right?
In that situation, since discardAvailableResults() waits on PQgetResult(),
it shouldn't enter a busy loop, correct?
I will back to work next Monday, then I will try to run a test and reproduce the scenario of pipeline abort.
I plan to commit the patch soon, but let's keep discussing and
investigating the case you mentioned afterward!
Regards,
--
Fujii Masao
On Nov 7, 2025, at 17:33, Fujii Masao <masao.fujii@gmail.com> wrote:
On Fri, Nov 7, 2025 at 9:07 AM Chao Li <li.evan.chao@gmail.com> wrote:
If you put this explanation to the inline comment, things would get clearer. But based on this explanation, I just got the other doubt. When a pipeline is aborted, res is NULL, but we still stay in the for loop, PQClear(res) will do nothing, then the “for” loop is similar to an empty loop, would that lead to a high CPU usage? From this perspective, when pipeline is aborted, while waiting for PIPELINE_SYNC, adding a tiny sleep might be better.
You're concerned about cases where the server response is delayed,
causing the pipeline status to take time to reach PIPELINE_SYNC, right?
In that situation, since discardAvailableResults() waits on PQgetResult(),
it shouldn't enter a busy loop, correct?I will back to work next Monday, then I will try to run a test and reproduce the scenario of pipeline abort.
I plan to commit the patch soon, but let's keep discussing and
investigating the case you mentioned afterward!
I just did a test. In the test, I inserted a tuple with the same primary key so that the inserts fails by the unique key constraint which breaks the pipeline, and some random select statements followed. And I added some debug messages in discardAvailableResults(), which showed me that the function will discard rest of statements’ results until \endpipeline. As there are anyway limited number of statements before \endpipeline, my concern is actually not valid. So, now I am good with this patch.
Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/
On Mon, Nov 10, 2025 at 11:07 AM Chao Li <li.evan.chao@gmail.com> wrote:
I just did a test. In the test, I inserted a tuple with the same primary key so that the inserts fails by the unique key constraint which breaks the pipeline, and some random select statements followed. And I added some debug messages in discardAvailableResults(), which showed me that the function will discard rest of statements’ results until \endpipeline. As there are anyway limited number of statements before \endpipeline, my concern is actually not valid. So, now I am good with this patch.
Thanks a lot for testing!
Regards,
--
Fujii Masao
On Fri, 7 Nov 2025 18:33:17 +0900
Fujii Masao <masao.fujii@gmail.com> wrote:
I plan to commit the patch soon, but let's keep discussing and
investigating the case you mentioned afterward!
I'm sorry for the late reply and for not joining the discussion earlier.
I've spent some time investigating the code in pgbench and libpq, and
it seems to me that your commit looks fine.
However, I found another issue related to the --continue-on-error option,
where an assertion failure occurs in the following test case:
$ cat pgbench_error.sql
\startpipeline
select 1/0;
\syncpipeline
select 1;
\syncpipeline
select 1;
\syncpipeline
select 1;
\endpipeline
$ pgbench -f pgbench_error.sql -M extended --continue-on-error -T 1
pgbench (19devel)
starting vacuum...end.
pgbench: pgbench.c:3594: discardUntilSync: Assertion `res == ((void *)0)' failed.
Even after removing the Assert(), we get the following error:
pgbench: error: client 0 aborted: failed to exit pipeline mode for rolling back the failed transaction
This happens because discardUntilSync() does not expect that a PGRES_TUPLES_OK may be
received after \syncpipeline, and also fails to discard all PGRES_PIPELINE_SYNC results
when multiple \syncpipeline commands are used.
I've attached a patch to fix this.
If a PGRES_PIPELINE_SYNC is followed by something other than PGRES_PIPELINE_SYNC or NULL,
it means that another PGRES_PIPELINE_SYNC will eventually follow after some other results.
In this case, we should reset the receive_sync flag and continue discarding results.
I think this fix should be back-patched, since this is not a bug introduced by
--continue-on-error. The same assertion failure occurs in the following test case,
where transactions are retried after a deadlock error:
$ cat deadlock.sql
\startpipeline
select * from a order by i for update;
select 1;
\syncpipeline
select 1;
\syncpipeline
select 1;
\syncpipeline
select 1;
\endpipeline
$ cat deadlock2.sql
\startpipeline
select * from a order by i desc for update;
select 1;
\syncpipeline
select 1;
\syncpipeline
select 1;
\syncpipeline
select 1;
\endpipeline
$ pgbench -f deadlock.sql -f deadlock2.sql -j 2 -c 2 -M extended
Regards,
Yugo Nagata
--
Yugo Nagata <nagata@sraoss.co.jp>
Attachments:
0001-Make-sure-discardUntilSync-discards-until-the-last-s.patchtext/x-diff; name=0001-Make-sure-discardUntilSync-discards-until-the-last-s.patchDownload
From 6a20315b9d25ddc9f77b96d2e8318d9853b105eb Mon Sep 17 00:00:00 2001
From: Yugo Nagata <nagata@sraoss.co.jp>
Date: Tue, 11 Nov 2025 10:14:30 +0900
Subject: [PATCH] Make sure discardUntilSync() discards until the last sync
point
---
src/bin/pgbench/pgbench.c | 15 ++++++++++-----
1 file changed, 10 insertions(+), 5 deletions(-)
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index d8764ba6fe0..c31dd30672b 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -3563,14 +3563,14 @@ doRetry(CState *st, pg_time_usec_t *now)
}
/*
- * Read results and discard it until a sync point.
+ * Read and discard results until the last sync point.
*/
static int
discardUntilSync(CState *st)
{
bool received_sync = false;
- /* send a sync */
+ /* Send a sync since all PGRES_PIPELINE_SYNC may be already received. */
if (!PQpipelineSync(st->con))
{
pg_log_error("client %d aborted: failed to send a pipeline sync",
@@ -3588,10 +3588,15 @@ discardUntilSync(CState *st)
else if (received_sync)
{
/*
- * PGRES_PIPELINE_SYNC must be followed by another
- * PGRES_PIPELINE_SYNC or NULL; otherwise, assert failure.
+ * If a PGRES_PIPELINE_SYNC is followed by something other than
+ * PGRES_PIPELINE_SYNC or NULL, another PGRES_PIPELINE_SYNC will
+ * eventually follow.
*/
- Assert(res == NULL);
+ if (res)
+ {
+ received_sync = false;
+ continue;
+ }
/*
* Reset ongoing sync count to 0 since all PGRES_PIPELINE_SYNC
--
2.43.0
On Nov 11, 2025, at 09:50, Yugo Nagata <nagata@sraoss.co.jp> wrote:
On Fri, 7 Nov 2025 18:33:17 +0900
Fujii Masao <masao.fujii@gmail.com> wrote:I plan to commit the patch soon, but let's keep discussing and
investigating the case you mentioned afterward!I'm sorry for the late reply and for not joining the discussion earlier.
I've spent some time investigating the code in pgbench and libpq, and
it seems to me that your commit looks fine.However, I found another issue related to the --continue-on-error option,
where an assertion failure occurs in the following test case:$ cat pgbench_error.sql
\startpipeline
select 1/0;
\syncpipeline
select 1;
\syncpipeline
select 1;
\syncpipeline
select 1;
\endpipeline$ pgbench -f pgbench_error.sql -M extended --continue-on-error -T 1
pgbench (19devel)
starting vacuum...end.
pgbench: pgbench.c:3594: discardUntilSync: Assertion `res == ((void *)0)' failed.Even after removing the Assert(), we get the following error:
pgbench: error: client 0 aborted: failed to exit pipeline mode for rolling back the failed transaction
This happens because discardUntilSync() does not expect that a PGRES_TUPLES_OK may be
received after \syncpipeline, and also fails to discard all PGRES_PIPELINE_SYNC results
when multiple \syncpipeline commands are used.I've attached a patch to fix this.
If a PGRES_PIPELINE_SYNC is followed by something other than PGRES_PIPELINE_SYNC or NULL,
it means that another PGRES_PIPELINE_SYNC will eventually follow after some other results.
In this case, we should reset the receive_sync flag and continue discarding results.I think this fix should be back-patched, since this is not a bug introduced by
--continue-on-error. The same assertion failure occurs in the following test case,
where transactions are retried after a deadlock error:$ cat deadlock.sql
\startpipeline
select * from a order by i for update;
select 1;
\syncpipeline
select 1;
\syncpipeline
select 1;
\syncpipeline
select 1;
\endpipeline$ cat deadlock2.sql
\startpipeline
select * from a order by i desc for update;
select 1;
\syncpipeline
select 1;
\syncpipeline
select 1;
\syncpipeline
select 1;
\endpipeline$ pgbench -f deadlock.sql -f deadlock2.sql -j 2 -c 2 -M extended
Regards,
Yugo Nagata--
Yugo Nagata <nagata@sraoss.co.jp>
<0001-Make-sure-discardUntilSync-discards-until-the-last-s.patch>
Hi Yugo-san,
I am also debugging the patch for the other purpose when I saw your email, so I tried to reproduce the problem with your script.
I think in master branch, we can simply fix the problem by calling discardAvailableResults(st) before discardUntilSync(st), like this:
```
/* Read and discard until a sync point in pipeline mode */
if (PQpipelineStatus(st->con) != PQ_PIPELINE_OFF)
{
discardAvailableResults(st); # <=== Add this line
if (!discardUntilSync(st))
{
st->state = CSTATE_ABORTED;
break;
}
}
```
But this is not good for back-patch.
Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/
On Nov 10, 2025, at 12:45, Fujii Masao <masao.fujii@gmail.com> wrote:
On Mon, Nov 10, 2025 at 11:07 AM Chao Li <li.evan.chao@gmail.com> wrote:
I just did a test. In the test, I inserted a tuple with the same primary key so that the inserts fails by the unique key constraint which breaks the pipeline, and some random select statements followed. And I added some debug messages in discardAvailableResults(), which showed me that the function will discard rest of statements’ results until \endpipeline. As there are anyway limited number of statements before \endpipeline, my concern is actually not valid. So, now I am good with this patch.
Thanks a lot for testing!
Hi Fujii-san,
I just did more tests on both pipeline mode and non-pipeline mode, I think the main purpose of discardAvailableResults() is to drain results for pipeline mode. In non-pipeline mode, a NULL res indicates no more result to read; while in pipeline mode, when a pipeline is aborted, either a valid result or NULL could still be returned, thus we need to wait until pipeline state switch to PQ_PIPELINE_OK. From this perspective, the current inline comment is correct, but I feel it’s not clear enough.
So I am proposing the function comment and inline comment like the following:
```
/*
* Read and discard all available results from the connection.
*
* Non-pipeline mode:
* ------------------
* PQgetResult() returns each PGresult in order for the last command sent.
* When it returns NULL, that definitively means there are no more results
* for that command. We stop on NULL (or on CONNECTION_BAD).
*
* Pipeline mode:
* --------------
* If an earlier command in the pipeline errors, libpq enters the
* PQ_PIPELINE_ABORTED state. In this state, PQgetResult() may return
* either a valid PGresult or NULL, and a NULL return does NOT mean
* that the connection is drained. More results for later commands (or
* protocol housekeeping such as the pipeline sync result) can still
* arrive afterward. Therefore we must continue calling PQgetResult()
* while PQpipelineStatus(conn) == PQ_PIPELINE_ABORTED, even if we see
* intermittent NULLs.
*/
static void
discardAvailableResults(CState *st)
{
PGresult *res = NULL;
for (;;)
{
res = PQgetResult(st->con);
/*
* Stop when there are no more results *and* the pipeline is not
* in the aborted state, or if the connection has failed.
*/
if ((res == NULL && PQpipelineStatus(st->con) != PQ_PIPELINE_ABORTED) ||
PQstatus(st->con) == CONNECTION_BAD)
break;
PQclear(res);
}
PQclear(res);
}
```
What do you think?
Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/
On Tue, Nov 11, 2025 at 10:50 AM Yugo Nagata <nagata@sraoss.co.jp> wrote:
I've attached a patch to fix this.
Thanks for reporting the issue and providing the patch!
If a PGRES_PIPELINE_SYNC is followed by something other than PGRES_PIPELINE_SYNC or NULL,
it means that another PGRES_PIPELINE_SYNC will eventually follow after some other results.
In this case, we should reset the receive_sync flag and continue discarding results.
Yes.
+ if (res)
+ {
+ received_sync = false;
+ continue;
Shouldn't we also call PQclear(res) here? For example:
---------------------------
if (PQresultStatus(res) == PGRES_PIPELINE_SYNC)
received_sync = true;
- else if (received_sync)
+ else if (received_sync && res == NULL)
{
- /*
- * PGRES_PIPELINE_SYNC must be followed by another
- * PGRES_PIPELINE_SYNC or NULL; otherwise,
assert failure.
- */
- Assert(res == NULL);
-
/*
* Reset ongoing sync count to 0 since all
PGRES_PIPELINE_SYNC
* results have been discarded.
@@ -3601,6 +3595,8 @@ discardUntilSync(CState *st)
PQclear(res);
break;
}
+ else
+ received_sync = false;
PQclear(res);
---------------------------
Regards,
--
Fujii Masao
On Tue, Nov 11, 2025 at 11:41 AM Chao Li <li.evan.chao@gmail.com> wrote:
I am also debugging the patch for the other purpose when I saw your email, so I tried to reproduce the problem with your script.
I think in master branch, we can simply fix the problem by calling discardAvailableResults(st) before discardUntilSync(st), like this:
This change doesn't seem to fix the issue. If the custom script includes
many \syncpipeline commands, the assertion failure can still occur. No?
Regards,
--
Fujii Masao
On Tue, Nov 11, 2025 at 11:49 AM Chao Li <li.evan.chao@gmail.com> wrote:
I just did more tests on both pipeline mode and non-pipeline mode, I think the main purpose of discardAvailableResults() is to drain results for pipeline mode. In non-pipeline mode, a NULL res indicates no more result to read; while in pipeline mode, when a pipeline is aborted, either a valid result or NULL could still be returned, thus we need to wait until pipeline state switch to PQ_PIPELINE_OK. From this perspective, the current inline comment is correct, but I feel it’s not clear enough.
Thanks for working on this!
After reconsidering, I think the main goal here is to determine whether
the error causes a connection failure after it occurs.
If we can read and discard results without PQstatus() becoming CONNECTION_BAD
either until the end (in non-pipeline mode) or until the first sync point
after an error (in pipeline mode), that means the connection is still alive,
and processing can continue when --continue-on-error is specified.
The current function comments don’t mention this purpose enough,
so seems they should be updated to clarify that.
Regards,
--
Fujii Masao
On Wed, 12 Nov 2025 00:22:38 +0900
Fujii Masao <masao.fujii@gmail.com> wrote:
On Tue, Nov 11, 2025 at 11:41 AM Chao Li <li.evan.chao@gmail.com> wrote:
I am also debugging the patch for the other purpose when I saw your email, so I tried to reproduce the problem with your script.
I think in master branch, we can simply fix the problem by calling discardAvailableResults(st) before discardUntilSync(st), like this:
This change doesn't seem to fix the issue. If the custom script includes
many \syncpipeline commands, the assertion failure can still occur. No?
Yes. discardAvailableResults() does not discard all syncs, just until
NULL following the first sync, in pipeline mode without a connection failure.
Regards,
Yugo Nagata
--
Yugo Nagata <nagata@sraoss.co.jp>
On Wed, 12 Nov 2025 00:20:15 +0900
Fujii Masao <masao.fujii@gmail.com> wrote:
On Tue, Nov 11, 2025 at 10:50 AM Yugo Nagata <nagata@sraoss.co.jp> wrote:
I've attached a patch to fix this.
Thanks for reporting the issue and providing the patch!
If a PGRES_PIPELINE_SYNC is followed by something other than PGRES_PIPELINE_SYNC or NULL,
it means that another PGRES_PIPELINE_SYNC will eventually follow after some other results.
In this case, we should reset the receive_sync flag and continue discarding results.Yes.
+ if (res) + { + received_sync = false; + continue;Shouldn't we also call PQclear(res) here? For example:
Thank you for your review!
Yes, we need PQclear() here.
I've attached an updated patch.
The comment for the PQpipelineSync() call has been also updated to clarify
why it is necessary.
In addition, I added a connection status check in the loop to avoid an
infinte loop that waiting for PQpipelineSync after a connection failure.
I packed these changes in the same patch, but they can be split into separate
patches.
What do you think?
Regards,
Yugo Nagata
--
Yugo Nagata <nagata@sraoss.co.jp>
Attachments:
v2-0001-pgbench-Fix-assertion-failure-with-multiple-syncp.patchtext/x-diff; name=v2-0001-pgbench-Fix-assertion-failure-with-multiple-syncp.patchDownload
From 8170406337210755442eedcf6b253031e22297e1 Mon Sep 17 00:00:00 2001
From: Yugo Nagata <nagata@sraoss.co.jp>
Date: Tue, 11 Nov 2025 10:14:30 +0900
Subject: [PATCH v2] pgbench: Fix assertion failure with multiple \syncpipeline
in pipeline mode.
When running pgbench with a custom script that triggered retriable errors
(e.g., deadlock errors) followed by multiple \syncpipeline commands in
pipeline mode, an assertion failure could occur:
pgbench.c:3594: discardUntilSync: Assertion `res == ((void *)0)' failed
This happened because discardUntilSync() did not expect that a result
other than NULL (e.g. PGRES_TUPLES_OK) might be received after \syncpipeline.
This commit fixes the assertion failure by resetting the receive_sync flag
and continueing to discard results to ensure that all results are discarded
until the last sync point.
Also, if the connection was unexpectedly closed, this function could get
stuck in an infinite loop waiting for PGRES_PIPELINE_SYNC, which would never
be received. To fix this, exit the loop immediately if a connection failure
is detected.
---
src/bin/pgbench/pgbench.c | 35 +++++++++++++++++++++++++----------
1 file changed, 25 insertions(+), 10 deletions(-)
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index d8764ba6fe0..f165fabce36 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -3563,14 +3563,18 @@ doRetry(CState *st, pg_time_usec_t *now)
}
/*
- * Read results and discard it until a sync point.
+ * Read and discard results until the last sync point.
*/
static int
discardUntilSync(CState *st)
{
bool received_sync = false;
- /* send a sync */
+ /*
+ * Send a sync to ensure at least one PGRES_PIPELINE_SYNC is received
+ * and to avoid an infinite loop, since all earlier ones may have
+ * already been received.
+ */
if (!PQpipelineSync(st->con))
{
pg_log_error("client %d aborted: failed to send a pipeline sync",
@@ -3578,21 +3582,15 @@ discardUntilSync(CState *st)
return 0;
}
- /* receive PGRES_PIPELINE_SYNC and null following it */
+ /* receive the last PGRES_PIPELINE_SYNC and null following it */
for (;;)
{
PGresult *res = PQgetResult(st->con);
if (PQresultStatus(res) == PGRES_PIPELINE_SYNC)
received_sync = true;
- else if (received_sync)
+ else if (received_sync && res == NULL)
{
- /*
- * PGRES_PIPELINE_SYNC must be followed by another
- * PGRES_PIPELINE_SYNC or NULL; otherwise, assert failure.
- */
- Assert(res == NULL);
-
/*
* Reset ongoing sync count to 0 since all PGRES_PIPELINE_SYNC
* results have been discarded.
@@ -3601,6 +3599,23 @@ discardUntilSync(CState *st)
PQclear(res);
break;
}
+ else
+ {
+ if (PQstatus(st->con) == CONNECTION_BAD)
+ {
+ pg_log_error("client %d aborted: the backend died while rolling back the failed transaction after",
+ st->id);
+ PQclear(res);
+ return 0;
+ }
+
+ /*
+ * If a PGRES_PIPELINE_SYNC is followed by something other than
+ * PGRES_PIPELINE_SYNC or NULL, another PGRES_PIPELINE_SYNC will
+ * eventually follow.
+ */
+ received_sync = false;
+ }
PQclear(res);
}
--
2.43.0
On Wed, 12 Nov 2025 00:20:15 +0900
Fujii Masao <masao.fujii@gmail.com> wrote:
On Tue, Nov 11, 2025 at 10:50 AM Yugo Nagata <nagata@sraoss.co.jp> wrote:
I've attached a patch to fix this.
Thanks for reporting the issue and providing the patch!
If a PGRES_PIPELINE_SYNC is followed by something other than PGRES_PIPELINE_SYNC or NULL,
it means that another PGRES_PIPELINE_SYNC will eventually follow after some other results.
In this case, we should reset the receive_sync flag and continue discarding results.Yes.
+ if (res) + { + received_sync = false; + continue;Shouldn't we also call PQclear(res) here? For example:
Thank you for your review!
Yes, we need PQclear() here.
I've attached an updated patch.
The comment for the PQpipelineSync() call has been also updated to clarify
why it is necessary.
In addition, I added a connection status check in the loop to avoid an
infinte loop that waiting for PQpipelineSync after a connection failure.
I packed these changes in the same patch, but they can be split into separate
patches.
What do you think?
Regards,
Yugo Nagata
--
Yugo Nagata <nagata@sraoss.co.jp>
Attachments:
v2-0001-pgbench-Fix-assertion-failure-with-multiple-syncp.patchtext/x-diff; name=v2-0001-pgbench-Fix-assertion-failure-with-multiple-syncp.patchDownload
From 8170406337210755442eedcf6b253031e22297e1 Mon Sep 17 00:00:00 2001
From: Yugo Nagata <nagata@sraoss.co.jp>
Date: Tue, 11 Nov 2025 10:14:30 +0900
Subject: [PATCH v2] pgbench: Fix assertion failure with multiple \syncpipeline
in pipeline mode.
When running pgbench with a custom script that triggered retriable errors
(e.g., deadlock errors) followed by multiple \syncpipeline commands in
pipeline mode, an assertion failure could occur:
pgbench.c:3594: discardUntilSync: Assertion `res == ((void *)0)' failed
This happened because discardUntilSync() did not expect that a result
other than NULL (e.g. PGRES_TUPLES_OK) might be received after \syncpipeline.
This commit fixes the assertion failure by resetting the receive_sync flag
and continueing to discard results to ensure that all results are discarded
until the last sync point.
Also, if the connection was unexpectedly closed, this function could get
stuck in an infinite loop waiting for PGRES_PIPELINE_SYNC, which would never
be received. To fix this, exit the loop immediately if a connection failure
is detected.
---
src/bin/pgbench/pgbench.c | 35 +++++++++++++++++++++++++----------
1 file changed, 25 insertions(+), 10 deletions(-)
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index d8764ba6fe0..f165fabce36 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -3563,14 +3563,18 @@ doRetry(CState *st, pg_time_usec_t *now)
}
/*
- * Read results and discard it until a sync point.
+ * Read and discard results until the last sync point.
*/
static int
discardUntilSync(CState *st)
{
bool received_sync = false;
- /* send a sync */
+ /*
+ * Send a sync to ensure at least one PGRES_PIPELINE_SYNC is received
+ * and to avoid an infinite loop, since all earlier ones may have
+ * already been received.
+ */
if (!PQpipelineSync(st->con))
{
pg_log_error("client %d aborted: failed to send a pipeline sync",
@@ -3578,21 +3582,15 @@ discardUntilSync(CState *st)
return 0;
}
- /* receive PGRES_PIPELINE_SYNC and null following it */
+ /* receive the last PGRES_PIPELINE_SYNC and null following it */
for (;;)
{
PGresult *res = PQgetResult(st->con);
if (PQresultStatus(res) == PGRES_PIPELINE_SYNC)
received_sync = true;
- else if (received_sync)
+ else if (received_sync && res == NULL)
{
- /*
- * PGRES_PIPELINE_SYNC must be followed by another
- * PGRES_PIPELINE_SYNC or NULL; otherwise, assert failure.
- */
- Assert(res == NULL);
-
/*
* Reset ongoing sync count to 0 since all PGRES_PIPELINE_SYNC
* results have been discarded.
@@ -3601,6 +3599,23 @@ discardUntilSync(CState *st)
PQclear(res);
break;
}
+ else
+ {
+ if (PQstatus(st->con) == CONNECTION_BAD)
+ {
+ pg_log_error("client %d aborted: the backend died while rolling back the failed transaction after",
+ st->id);
+ PQclear(res);
+ return 0;
+ }
+
+ /*
+ * If a PGRES_PIPELINE_SYNC is followed by something other than
+ * PGRES_PIPELINE_SYNC or NULL, another PGRES_PIPELINE_SYNC will
+ * eventually follow.
+ */
+ received_sync = false;
+ }
PQclear(res);
}
--
2.43.0
On Nov 12, 2025, at 17:34, Yugo Nagata <nagata@sraoss.co.jp> wrote:
On Wed, 12 Nov 2025 00:20:15 +0900
Fujii Masao <masao.fujii@gmail.com> wrote:On Tue, Nov 11, 2025 at 10:50 AM Yugo Nagata <nagata@sraoss.co.jp> wrote:
I've attached a patch to fix this.
Thanks for reporting the issue and providing the patch!
If a PGRES_PIPELINE_SYNC is followed by something other than PGRES_PIPELINE_SYNC or NULL,
it means that another PGRES_PIPELINE_SYNC will eventually follow after some other results.
In this case, we should reset the receive_sync flag and continue discarding results.Yes.
+ if (res) + { + received_sync = false; + continue;Shouldn't we also call PQclear(res) here? For example:
Thank you for your review!
Yes, we need PQclear() here.I've attached an updated patch.
The comment for the PQpipelineSync() call has been also updated to clarify
why it is necessary.In addition, I added a connection status check in the loop to avoid an
infinte loop that waiting for PQpipelineSync after a connection failure.I packed these changes in the same patch, but they can be split into separate
patches.What do you think?
Regards,
Yugo Nagata--
Yugo Nagata <nagata@sraoss.co.jp>
<v2-0001-pgbench-Fix-assertion-failure-with-multiple-syncp.patch>
I debugged further this morning, and I think I have found the root cause. Ultimately, the problem is not with discardUntilSync(), instead, discardAvailableResults() mistakenly eats PGRES_PIPELINE_SYNC.
In my debug, I slightly updated Yugo’s script as: (every select returns a different value)
```
% cat pgbench_error.sql
\startpipeline
select 1/0;
\syncpipeline
select 2;
\syncpipeline
select 3;
\syncpipeline
select 4;
\endpipeline
```
Please see my dirty fix in the attachment. The diff is based master + Yugo’s v2 patch.
In my fix, I make discardAvailableResults() to return the PGRES_PIPELINE_SYNC it reads, and moved discardAvailableResults() out of getSQLErrorStatus(), so that if discardAvailableResults() returns a result, then use the result as next_res to continue the reading loop.
Here is my execution output:
```
% pgbench -n --failures-detailed --continue-on-error -M extended -t 5 -f pgbench_error.sql evantest
pgbench (19devel)
EVAN: readCommandResponse: Got result: res=7, conn=0
EVAN: readCommandResponse2: Got result: next_res=7, conn=0
EVAN: discardAvailableResults: Got result: res=10, conn=0
EVAN: discardAvailableResults: Got sync, returning, conn=0
EVAN: readCommandResponse2: Got result: next_res=2, conn=0
EVAN: readCommandResponse2: Got next-result value: 2, conn=0
EVAN: readCommandResponse2: Got result: next_res=7, conn=0
EVAN: readCommandResponse: completed, conn=0
EVAN: readCommandResponse: Got result: res=10, conn=0
EVAN: readCommandResponse2: Got result: next_res=2, conn=0
EVAN: readCommandResponse2: Got next-result value: 3, conn=0
EVAN: readCommandResponse2: Got result: next_res=7, conn=0
EVAN: readCommandResponse: completed, conn=0
EVAN: readCommandResponse: Got result: res=10, conn=0
EVAN: readCommandResponse2: Got result: next_res=2, conn=0
EVAN: readCommandResponse2: Got next-result value: 4, conn=0
EVAN: readCommandResponse2: Got result: next_res=7, conn=0
EVAN: readCommandResponse: completed, conn=0
EVAN: readCommandResponse: Got result: res=10, conn=0
EVAN: readCommandResponse2: Got result: next_res=7, conn=0
EVAN: readCommandResponse: completed, conn=0
EVAN: readCommandResponse: Got result: res=7, conn=0
EVAN: readCommandResponse2: Got result: next_res=7, conn=0
EVAN: discardAvailableResults: Got result: res=10, conn=0
EVAN: discardAvailableResults: Got sync, returning, conn=0
EVAN: readCommandResponse2: Got result: next_res=2, conn=0
EVAN: readCommandResponse2: Got next-result value: 2, conn=0
EVAN: readCommandResponse2: Got result: next_res=7, conn=0
EVAN: readCommandResponse: completed, conn=0
EVAN: readCommandResponse: Got result: res=10, conn=0
EVAN: readCommandResponse2: Got result: next_res=2, conn=0
EVAN: readCommandResponse2: Got next-result value: 3, conn=0
EVAN: readCommandResponse2: Got result: next_res=7, conn=0
EVAN: readCommandResponse: completed, conn=0
EVAN: readCommandResponse: Got result: res=10, conn=0
EVAN: readCommandResponse2: Got result: next_res=2, conn=0
EVAN: readCommandResponse2: Got next-result value: 4, conn=0
EVAN: readCommandResponse2: Got result: next_res=7, conn=0
EVAN: readCommandResponse: completed, conn=0
EVAN: readCommandResponse: Got result: res=10, conn=0
EVAN: readCommandResponse2: Got result: next_res=7, conn=0
EVAN: readCommandResponse: completed, conn=0
EVAN: readCommandResponse: Got result: res=7, conn=0
EVAN: readCommandResponse2: Got result: next_res=7, conn=0
EVAN: discardAvailableResults: Got result: res=10, conn=0
EVAN: discardAvailableResults: Got sync, returning, conn=0
EVAN: readCommandResponse2: Got result: next_res=2, conn=0
EVAN: readCommandResponse2: Got next-result value: 2, conn=0
EVAN: readCommandResponse2: Got result: next_res=7, conn=0
EVAN: readCommandResponse: completed, conn=0
EVAN: readCommandResponse: Got result: res=10, conn=0
EVAN: readCommandResponse2: Got result: next_res=2, conn=0
EVAN: readCommandResponse2: Got next-result value: 3, conn=0
EVAN: readCommandResponse2: Got result: next_res=7, conn=0
EVAN: readCommandResponse: completed, conn=0
EVAN: readCommandResponse: Got result: res=10, conn=0
EVAN: readCommandResponse2: Got result: next_res=2, conn=0
EVAN: readCommandResponse2: Got next-result value: 4, conn=0
EVAN: readCommandResponse2: Got result: next_res=7, conn=0
EVAN: readCommandResponse: completed, conn=0
EVAN: readCommandResponse: Got result: res=10, conn=0
EVAN: readCommandResponse2: Got result: next_res=7, conn=0
EVAN: readCommandResponse: completed, conn=0
EVAN: readCommandResponse: Got result: res=7, conn=0
EVAN: readCommandResponse2: Got result: next_res=7, conn=0
EVAN: discardAvailableResults: Got result: res=10, conn=0
EVAN: discardAvailableResults: Got sync, returning, conn=0
EVAN: readCommandResponse2: Got result: next_res=2, conn=0
EVAN: readCommandResponse2: Got next-result value: 2, conn=0
EVAN: readCommandResponse2: Got result: next_res=7, conn=0
EVAN: readCommandResponse: completed, conn=0
EVAN: readCommandResponse: Got result: res=10, conn=0
EVAN: readCommandResponse2: Got result: next_res=2, conn=0
EVAN: readCommandResponse2: Got next-result value: 3, conn=0
EVAN: readCommandResponse2: Got result: next_res=7, conn=0
EVAN: readCommandResponse: completed, conn=0
EVAN: readCommandResponse: Got result: res=10, conn=0
EVAN: readCommandResponse2: Got result: next_res=2, conn=0
EVAN: readCommandResponse2: Got next-result value: 4, conn=0
EVAN: readCommandResponse2: Got result: next_res=7, conn=0
EVAN: readCommandResponse: completed, conn=0
EVAN: readCommandResponse: Got result: res=10, conn=0
EVAN: readCommandResponse2: Got result: next_res=7, conn=0
EVAN: readCommandResponse: completed, conn=0
EVAN: readCommandResponse: Got result: res=7, conn=0
EVAN: readCommandResponse2: Got result: next_res=7, conn=0
EVAN: discardAvailableResults: Got result: res=10, conn=0
EVAN: discardAvailableResults: Got sync, returning, conn=0
EVAN: readCommandResponse2: Got result: next_res=2, conn=0
EVAN: readCommandResponse2: Got next-result value: 2, conn=0
EVAN: readCommandResponse2: Got result: next_res=7, conn=0
EVAN: readCommandResponse: completed, conn=0
EVAN: readCommandResponse: Got result: res=10, conn=0
EVAN: readCommandResponse2: Got result: next_res=2, conn=0
EVAN: readCommandResponse2: Got next-result value: 3, conn=0
EVAN: readCommandResponse2: Got result: next_res=7, conn=0
EVAN: readCommandResponse: completed, conn=0
EVAN: readCommandResponse: Got result: res=10, conn=0
EVAN: readCommandResponse2: Got result: next_res=2, conn=0
EVAN: readCommandResponse2: Got next-result value: 4, conn=0
EVAN: readCommandResponse2: Got result: next_res=7, conn=0
EVAN: readCommandResponse: completed, conn=0
EVAN: readCommandResponse: Got result: res=10, conn=0
EVAN: readCommandResponse2: Got result: next_res=7, conn=0
EVAN: readCommandResponse: completed, conn=0
transaction type: pgbench_error.sql
scaling factor: 1
query mode: extended
number of clients: 1
number of threads: 1
maximum number of tries: 1
number of transactions per client: 5
number of transactions actually processed: 5/5
number of failed transactions: 0 (0.000%)
number of serialization failures: 0 (0.000%)
number of deadlock failures: 0 (0.000%)
number of other failures: 0 (0.000%)
latency average = 0.265 ms
initial connection time = 2.092 ms
tps = 3773.584906 (without initial connection time)
```
You can see that, select 2/3/4 are properly handled.
Yugo-san, if you add some debug log, you will see that with your patch, 2 and 3 will be discarded by discardUntilSync(), so I don’t think your patch works.
To apply my dirty diff:
* git checkout master
* git am Yugo’s v2 patch
* git apply dirty-fix.diff
Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/
Attachments:
dirty-fix.diffapplication/octet-stream; name=dirty-fix.diff; x-unix-mode=0644Download
diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index 4bdf21c319c..663fa7fafbb 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -1855,7 +1855,7 @@ asyncQueueReadAllNotifications(void)
Snapshot snapshot;
/* page_buffer must be adequately aligned */
- alignas(AsyncQueueEntry) char page_buffer[QUEUE_PAGESIZE];
+ alignas(alignof(AsyncQueueEntry)) char page_buffer[QUEUE_PAGESIZE];
/* Fetch current state */
LWLockAcquire(NotifyQueueLock, LW_SHARED);
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index f165fabce36..8e8564ae8ae 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -3246,7 +3246,7 @@ sendCommand(CState *st, Command *command)
/*
* Read and discard all available results from the connection.
*/
-static void
+static PGresult *
discardAvailableResults(CState *st)
{
PGresult *res = NULL;
@@ -3255,6 +3255,21 @@ discardAvailableResults(CState *st)
{
res = PQgetResult(st->con);
+ printf("EVAN: discardAvailableResults: Got result: res=%d, conn=%d\n",
+ PQresultStatus(res), PQstatus(st->con));
+ if (PQresultStatus(res) == PGRES_TUPLES_OK)
+ {
+ char *val = PQgetvalue(res, 0, 0);
+ printf("EVAN: discardAvailableResults: Got result value: %s, conn=%d\n",
+ val, PQstatus(st->con));
+ }
+ if (PQresultStatus(res) == PGRES_PIPELINE_SYNC)
+ {
+ printf("EVAN: discardAvailableResults: Got sync, returning, conn=%d\n",
+ PQstatus(st->con));
+ return res;
+ }
+
/*
* Read and discard results until PQgetResult() returns NULL (no more
* results) or a connection failure is detected. If the pipeline
@@ -3264,11 +3279,16 @@ discardAvailableResults(CState *st)
*/
if ((res == NULL && PQpipelineStatus(st->con) != PQ_PIPELINE_ABORTED) ||
PQstatus(st->con) == CONNECTION_BAD)
+ {
+ printf("EVAN: discardAvailableResults: breaking loop, conn=%d\n",
+ PQstatus(st->con));
break;
+ }
PQclear(res);
}
PQclear(res);
+ return NULL;
}
/*
@@ -3277,7 +3297,10 @@ discardAvailableResults(CState *st)
static EStatus
getSQLErrorStatus(CState *st, const char *sqlState)
{
- discardAvailableResults(st);
+ //discardAvailableResults(st);
+ if (st->estatus == ESTATUS_NO_ERROR)
+ return ESTATUS_NO_ERROR;
+
if (PQstatus(st->con) == CONNECTION_BAD)
return ESTATUS_CONN_ERROR;
@@ -3338,13 +3361,28 @@ readCommandResponse(CState *st, MetaCommand meta, char *varprefix)
((meta == META_GSET || meta == META_ASET) && varprefix != NULL));
res = PQgetResult(st->con);
-
+ printf("EVAN: readCommandResponse: Got result: res=%d, conn=%d\n",
+ PQresultStatus(res), PQstatus(st->con));
+ if (PQresultStatus(res) == PGRES_TUPLES_OK)
+ {
+ char *val = PQgetvalue(res, 0, 0);
+ printf("EVAN: readCommandResponse: Got result value: %s, conn=%d\n",
+ val, PQstatus(st->con));
+ }
while (res != NULL)
{
bool is_last;
/* peek at the next result to know whether the current is last */
next_res = PQgetResult(st->con);
+ printf("EVAN: readCommandResponse2: Got result: next_res=%d, conn=%d\n",
+ PQresultStatus(next_res), PQstatus(st->con));
+ if (PQresultStatus(next_res) == PGRES_TUPLES_OK)
+ {
+ char *val = PQgetvalue(next_res, 0, 0);
+ printf("EVAN: readCommandResponse2: Got next-result value: %s, conn=%d\n",
+ val, PQstatus(st->con));
+ }
is_last = (next_res == NULL);
switch (PQresultStatus(res))
@@ -3431,8 +3469,19 @@ readCommandResponse(CState *st, MetaCommand meta, char *varprefix)
case PGRES_NONFATAL_ERROR:
case PGRES_FATAL_ERROR:
+ {
+ PGresult *temp_res = discardAvailableResults(st);
+ if (temp_res != NULL)
+ {
+ next_res = temp_res;
+ break;
+ }
st->estatus = getSQLErrorStatus(st, PQresultErrorField(res,
PG_DIAG_SQLSTATE));
+ if (st->estatus == ESTATUS_NO_ERROR)
+ {
+ break;
+ }
if (canRetryError(st->estatus) || canContinueOnError(st->estatus))
{
if (verbose_errors)
@@ -3440,6 +3489,7 @@ readCommandResponse(CState *st, MetaCommand meta, char *varprefix)
goto error;
}
/* fall through */
+ }
default:
/* anything else is unexpected */
@@ -3460,6 +3510,8 @@ readCommandResponse(CState *st, MetaCommand meta, char *varprefix)
return false;
}
+ printf("EVAN: readCommandResponse: completed, conn=%d\n",
+ PQstatus(st->con));
return true;
error:
@@ -3588,9 +3640,14 @@ discardUntilSync(CState *st)
PGresult *res = PQgetResult(st->con);
if (PQresultStatus(res) == PGRES_PIPELINE_SYNC)
+ {
+ printf("EVAN: Got sync while discarding until sync, conn=%d\n", PQstatus(st->con));
received_sync = true;
- else if (received_sync && res == NULL)
+ }
+ else if (received_sync) // && res == NULL)
{
+ printf("EVAN: Got null while discarding until sync, conn=%d\n", PQstatus(st->con));
+ Assert(res == NULL);
/*
* Reset ongoing sync count to 0 since all PGRES_PIPELINE_SYNC
* results have been discarded.
@@ -3599,23 +3656,31 @@ discardUntilSync(CState *st)
PQclear(res);
break;
}
- else
- {
- if (PQstatus(st->con) == CONNECTION_BAD)
- {
- pg_log_error("client %d aborted: the backend died while rolling back the failed transaction after",
- st->id);
- PQclear(res);
- return 0;
- }
-
- /*
- * If a PGRES_PIPELINE_SYNC is followed by something other than
- * PGRES_PIPELINE_SYNC or NULL, another PGRES_PIPELINE_SYNC will
- * eventually follow.
- */
- received_sync = false;
- }
+ //else
+ //{
+ // printf("EVAN: Got result while discarding until sync: %d, conn=%d\n",
+ // PQresultStatus(res), PQstatus(st->con));
+ // if (PQresultStatus(res) == PGRES_TUPLES_OK)
+ // {
+ // char *val = PQgetvalue(res, 0, 0);
+ // printf("EVAN: Got result value while discarding until sync: %s, conn=%d\n",
+ // val, PQstatus(st->con));
+ // }
+ // if (PQstatus(st->con) == CONNECTION_BAD)
+ // {
+ // pg_log_error("client %d aborted: the backend died while rolling back the failed transaction after",
+ // st->id);
+ // PQclear(res);
+ // return 0;
+ // }
+ //
+ // /*
+ // * If a PGRES_PIPELINE_SYNC is followed by something other than
+ // * PGRES_PIPELINE_SYNC or NULL, another PGRES_PIPELINE_SYNC will
+ // * eventually follow.
+ // */
+ // received_sync = false;
+ //}
PQclear(res);
}
On Wed, 12 Nov 2025 01:47:37 +0900
Fujii Masao <masao.fujii@gmail.com> wrote:
On Tue, Nov 11, 2025 at 11:49 AM Chao Li <li.evan.chao@gmail.com> wrote:
I just did more tests on both pipeline mode and non-pipeline mode, I think the main purpose of discardAvailableResults() is to drain results for pipeline mode. In non-pipeline mode, a NULL res indicates no more result to read; while in pipeline mode, when a pipeline is aborted, either a valid result or NULL could still be returned, thus we need to wait until pipeline state switch to PQ_PIPELINE_OK. From this perspective, the current inline comment is correct, but I feel it’s not clear enough.
Thanks for working on this!
After reconsidering, I think the main goal here is to determine whether
the error causes a connection failure after it occurs.If we can read and discard results without PQstatus() becoming CONNECTION_BAD
either until the end (in non-pipeline mode) or until the first sync point
after an error (in pipeline mode), that means the connection is still alive,
and processing can continue when --continue-on-error is specified.The current function comments don’t mention this purpose enough,
so seems they should be updated to clarify that.
I agree that the goal of this function is to discard results until the point
where a connection failure can be detected. When the socket reaches EOF,
PQgetResult() returns PGRES_FATAL_ERROR to report it, followed by NULL.
However, in an aborted pipeline, several NULLs following each PGRES_PIPELINE_ABORTED
may be returned before that, so we need to discard those NULLs beforehand.
Considering this, the function name "discardAvailableResults" might be a bit misleading,
since it doesn’t actually discard all available results. How about renaming it to something
like "discardForErrorStatusCheck" (a bit long, though)?
Related to this, I doubt the necessity of calling this function after the error: label in
readCommandResponse(). If the error is retriable, all results will be discarded later by
discardUntilSync(). If it’s not retriable, the thread will immediately exit and the connection
will be abandoned, so discarding results here seems unnecessary.
If discardAvailableResults() is unnecessary here, we could embed its logic into
getSQLErrorStatus() instead of leaving it as a separate function.
Regards,
Yugo Nagata
--
Yugo Nagata <nagata@sraoss.co.jp>
On Wed, Nov 12, 2025 at 6:34 PM Yugo Nagata <nagata@sraoss.co.jp> wrote:
I've attached an updated patch.
Thanks for updating the patch!
The comment for the PQpipelineSync() call has been also updated to clarify
why it is necessary.
+ /*
+ * If a PGRES_PIPELINE_SYNC is followed by something other than
+ * PGRES_PIPELINE_SYNC or NULL, another PGRES_PIPELINE_SYNC will
+ * eventually follow.
+ */
LGTM. I'd like to append "Reset received_sync to false to wait for
it." into this comment.
In addition, I added a connection status check in the loop to avoid an
infinte loop that waiting for PQpipelineSync after a connection failure.
Would it be better to move this status check right after PQgetResult()
so that connection failures can be detected regardless of what result
it returns?
+ pg_log_error("client %d aborted: the backend died while rolling back
the failed transaction after",
The trailing “after” seems unnecessary.
Since there's no guarantee the backend actually died in this case,
it might be better to use something like "client %d aborted while rolling back
the transaction after an error; perhaps the backend died while processing"
which matches the wording used under CSTATE_WAIT_ROLLBACK_RESULT
in advanceConnectionState().
Regards,
--
Fujii Masao
On Thu, Nov 13, 2025 at 11:21 AM Chao Li <li.evan.chao@gmail.com> wrote:
I debugged further this morning, and I think I have found the root cause. Ultimately, the problem is not with discardUntilSync(), instead, discardAvailableResults() mistakenly eats PGRES_PIPELINE_SYNC.
Thanks for debugging!
Yes, discardAvailableResults() can discard PGRES_PIPELINE_SYNC,
but do you mean that's the root cause of the assertion failure
Nagata-san reported?
Since that failure can occur even in older branches, I was thinking
that newer code
like discardAvailableResults() in master isn't the root cause...
Regards,
--
Fujii Masao
On Nov 13, 2025, at 11:47, Fujii Masao <masao.fujii@gmail.com> wrote:
On Thu, Nov 13, 2025 at 11:21 AM Chao Li <li.evan.chao@gmail.com> wrote:
I debugged further this morning, and I think I have found the root cause. Ultimately, the problem is not with discardUntilSync(), instead, discardAvailableResults() mistakenly eats PGRES_PIPELINE_SYNC.
Thanks for debugging!
Yes, discardAvailableResults() can discard PGRES_PIPELINE_SYNC,
but do you mean that's the root cause of the assertion failure
Nagata-san reported?
Since that failure can occur even in older branches, I was thinking
that newer code
like discardAvailableResults() in master isn't the root cause...
I haven’t debugged with old code, but the old code also discard non-NULL results:
```
- do
- {
- res = PQgetResult(st->con);
- PQclear(res);
- } while (res);
+ discardAvailableResults(st);
```
Which may also discard the sync message. That’s my guess. I can also debug the old code this afternoon.
Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/
On Nov 13, 2025, at 12:02, Chao Li <li.evan.chao@gmail.com> wrote:
On Nov 13, 2025, at 11:47, Fujii Masao <masao.fujii@gmail.com> wrote:
On Thu, Nov 13, 2025 at 11:21 AM Chao Li <li.evan.chao@gmail.com> wrote:
I debugged further this morning, and I think I have found the root cause. Ultimately, the problem is not with discardUntilSync(), instead, discardAvailableResults() mistakenly eats PGRES_PIPELINE_SYNC.
Thanks for debugging!
Yes, discardAvailableResults() can discard PGRES_PIPELINE_SYNC,
but do you mean that's the root cause of the assertion failure
Nagata-san reported?
Since that failure can occur even in older branches, I was thinking
that newer code
like discardAvailableResults() in master isn't the root cause...I haven’t debugged with old code, but the old code also discard non-NULL results:
``` - do - { - res = PQgetResult(st->con); - PQclear(res); - } while (res); + discardAvailableResults(st); ```Which may also discard the sync message. That’s my guess. I can also debug the old code this afternoon.
I just tried the old code but it didn’t trigger the assert with Yugo’s deadlock scripts.
I did "git reset --hard a3ea5330fcf47390c8ab420bbf433a97a54505d6”, that is the previous commit of “—continue-on-error”. And I ran Yugo’s deadlock scripts, but I didn’t get the assert:
```
% pgbench -n --failures-detailed -M extended -j 2 -c 2 -f deadlock.sql -f deadlock2.sql evantest
pgbench (19devel)
transaction type: multiple scripts
scaling factor: 1
query mode: extended
number of clients: 2
number of threads: 2
maximum number of tries: 1
number of transactions per client: 10
number of transactions actually processed: 20/20
number of failed transactions: 0 (0.000%)
number of serialization failures: 0 (0.000%)
number of deadlock failures: 0 (0.000%)
latency average = 0.341 ms
initial connection time = 2.637 ms
tps = 5865.102639 (without initial connection time)
SQL script 1: deadlock.sql
- weight: 1 (targets 50.0% of total)
- 12 transactions (60.0% of total)
- number of transactions actually processed: 12 (tps = 3519.061584)
- number of failed transactions: 0 (0.000%)
- number of serialization failures: 0 (0.000%)
- number of deadlock failures: 0 (0.000%)
- latency average = 0.311 ms
- latency stddev = 0.304 ms
SQL script 2: deadlock2.sql
- weight: 1 (targets 50.0% of total)
- 8 transactions (40.0% of total)
- number of transactions actually processed: 8 (tps = 2346.041056)
- number of failed transactions: 0 (0.000%)
- number of serialization failures: 0 (0.000%)
- number of deadlock failures: 0 (0.000%)
- latency average = 0.366 ms
- latency stddev = 0.364 ms
```
Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/
On Thu, 13 Nov 2025 13:14:37 +0800
Chao Li <li.evan.chao@gmail.com> wrote:
On Nov 13, 2025, at 12:02, Chao Li <li.evan.chao@gmail.com> wrote:
On Nov 13, 2025, at 11:47, Fujii Masao <masao.fujii@gmail.com> wrote:
On Thu, Nov 13, 2025 at 11:21 AM Chao Li <li.evan.chao@gmail.com> wrote:
I debugged further this morning, and I think I have found the root cause. Ultimately, the problem is not with discardUntilSync(), instead, discardAvailableResults() mistakenly eats PGRES_PIPELINE_SYNC.
Thanks for debugging!
Yes, discardAvailableResults() can discard PGRES_PIPELINE_SYNC,
but do you mean that's the root cause of the assertion failure
Nagata-san reported?
Since that failure can occur even in older branches, I was thinking
that newer code
like discardAvailableResults() in master isn't the root cause...I haven’t debugged with old code, but the old code also discard non-NULL results:
``` - do - { - res = PQgetResult(st->con); - PQclear(res); - } while (res); + discardAvailableResults(st); ```Which may also discard the sync message. That’s my guess. I can also debug the old code this afternoon.
I just tried the old code but it didn’t trigger the assert with Yugo’s deadlock scripts.
To trigger a deadlock error, the tables need to have enough rows so that the scan takes some
time. In my environment, about 1,000 rows were enough to cause a deadlock.
Regards,
Yugo Nagata
I did "git reset --hard a3ea5330fcf47390c8ab420bbf433a97a54505d6”, that is the previous commit of “—continue-on-error”. And I ran Yugo’s deadlock scripts, but I didn’t get the assert:
```
% pgbench -n --failures-detailed -M extended -j 2 -c 2 -f deadlock.sql -f deadlock2.sql evantest
pgbench (19devel)
transaction type: multiple scripts
scaling factor: 1
query mode: extended
number of clients: 2
number of threads: 2
maximum number of tries: 1
number of transactions per client: 10
number of transactions actually processed: 20/20
number of failed transactions: 0 (0.000%)
number of serialization failures: 0 (0.000%)
number of deadlock failures: 0 (0.000%)
latency average = 0.341 ms
initial connection time = 2.637 ms
tps = 5865.102639 (without initial connection time)
SQL script 1: deadlock.sql
- weight: 1 (targets 50.0% of total)
- 12 transactions (60.0% of total)
- number of transactions actually processed: 12 (tps = 3519.061584)
- number of failed transactions: 0 (0.000%)
- number of serialization failures: 0 (0.000%)
- number of deadlock failures: 0 (0.000%)
- latency average = 0.311 ms
- latency stddev = 0.304 ms
SQL script 2: deadlock2.sql
- weight: 1 (targets 50.0% of total)
- 8 transactions (40.0% of total)
- number of transactions actually processed: 8 (tps = 2346.041056)
- number of failed transactions: 0 (0.000%)
- number of serialization failures: 0 (0.000%)
- number of deadlock failures: 0 (0.000%)
- latency average = 0.366 ms
- latency stddev = 0.364 ms
```Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/
--
Yugo Nagata <nagata@sraoss.co.jp>
On Nov 13, 2025, at 13:50, Yugo Nagata <nagata@sraoss.co.jp> wrote:
To trigger a deadlock error, the tables need to have enough rows so that the scan takes some
time. In my environment, about 1,000 rows were enough to cause a deadlock.
Yes, after inserting 1000 rows, I got the assert triggered. I added some logs to track what had been read:
```
% pgbench -n --failures-detailed -M extended -j 2 -c 2 -f deadlock.sql -f deadlock2.sql evantest
pgbench (19devel)
EVAN: on error discard: Got result: res=11, conn=0
EVAN: on error discard: Got result: res=7, conn=0
EVAN: discardUntilSync: Got result: res=10, conn=0 <== received sync
EVAN: discardUntilSync: Got sync, conn=0
EVAN: discardUntilSync: Got result: res=2, conn=0 <== then immediately received result of next select, without a null res in between
EVAN: discardUntilSync: Got result value: 2, conn=0
Assertion failed: (res == ((void*)0)), function discardUntilSync, file pgbench.c, line 3579.
zsh: abort pgbench -n --failures-detailed -M extended -j 2 -c 2 -f deadlock.sql -f
```
Looks like there is not a null result following the PIPELINE_SYNC message.
So the code comment seems to not accurate:
```
/*
* PGRES_PIPELINE_SYNC must be followed by another
* PGRES_PIPELINE_SYNC or NULL; otherwise, assert failure.
*/
Assert(res == NULL);
```
Then I made a dirty change that return from discardUntilSync() once receives SYNC:
```
if (PQresultStatus(res) == PGRES_PIPELINE_SYNC)
{
printf("EVAN: discardUntilSync: Got sync, conn=%d\n",
PQstatus(st->con));
received_sync = true;
st->num_syncs = 0;
PQclear(res);
break;
}
```
that eliminates the assert:
```
% pgbench -n --failures-detailed -M extended -j 2 -c 2 -f deadlock.sql -f deadlock2.sql evantest
pgbench (19devel)
EVAN: on error discard: Got result: res=11, conn=0
EVAN: on error discard: Got result: res=7, conn=0
EVAN: discardUntilSync: Got result: res=10, conn=0
EVAN: discardUntilSync: Got sync, conn=0
pgbench: error: client 0 aborted: failed to exit pipeline mode for rolling back the failed transaction
transaction type: multiple scripts
scaling factor: 1
query mode: extended
number of clients: 2
number of threads: 2
maximum number of tries: 1
number of transactions per client: 10
number of transactions actually processed: 10/20
number of failed transactions: 0 (0.000%)
number of serialization failures: 0 (0.000%)
number of deadlock failures: 0 (0.000%)
latency average = 203.933 ms
initial connection time = 3.006 ms
tps = 9.807152 (without initial connection time)
SQL script 1: deadlock.sql
- weight: 1 (targets 50.0% of total)
- 8 transactions (80.0% of total)
- number of transactions actually processed: 8 (tps = 7.845722)
- number of failed transactions: 0 (0.000%)
- number of serialization failures: 0 (0.000%)
- number of deadlock failures: 0 (0.000%)
- latency average = 127.115 ms
- latency stddev = 332.002 ms
SQL script 2: deadlock2.sql
- weight: 1 (targets 50.0% of total)
- 2 transactions (20.0% of total)
- number of transactions actually processed: 2 (tps = 1.961430)
- number of failed transactions: 0 (0.000%)
- number of serialization failures: 0 (0.000%)
- number of deadlock failures: 0 (0.000%)
- latency average = 1.347 ms
- latency stddev = 0.207 ms
pgbench: error: Run was aborted; the above results are incomplete.
```
So, I think now the key problem is to confirm if there must be a NULL following PGRES_PIPELINE_SYNC.
Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/
On Thu, 13 Nov 2025 11:55:25 +0900
Fujii Masao <masao.fujii@gmail.com> wrote:
On Wed, Nov 12, 2025 at 6:34 PM Yugo Nagata <nagata@sraoss.co.jp> wrote:
I've attached an updated patch.
Thanks for updating the patch!
The comment for the PQpipelineSync() call has been also updated to clarify
why it is necessary.+ /* + * If a PGRES_PIPELINE_SYNC is followed by something other than + * PGRES_PIPELINE_SYNC or NULL, another PGRES_PIPELINE_SYNC will + * eventually follow. + */LGTM. I'd like to append "Reset received_sync to false to wait for
it." into this comment.In addition, I added a connection status check in the loop to avoid an
infinte loop that waiting for PQpipelineSync after a connection failure.Would it be better to move this status check right after PQgetResult()
so that connection failures can be detected regardless of what result
it returns?+ pg_log_error("client %d aborted: the backend died while rolling back
the failed transaction after",The trailing “after” seems unnecessary.
Since there's no guarantee the backend actually died in this case,
it might be better to use something like "client %d aborted while rolling back
the transaction after an error; perhaps the backend died while processing"
which matches the wording used under CSTATE_WAIT_ROLLBACK_RESULT
in advanceConnectionState().
Thank you for your review!
I've attached an updated patch reflecting your suggestion.
Regards,
Yugo Nagata
--
Yugo Nagata <nagata@sraoss.co.jp>
Attachments:
v3-0001-pgbench-Fix-assertion-failure-with-multiple-syncp.patchtext/x-diff; name=v3-0001-pgbench-Fix-assertion-failure-with-multiple-syncp.patchDownload
From c7b3d71880dc1bbd9efbf22f663c30a0b7e01a9a Mon Sep 17 00:00:00 2001
From: Yugo Nagata <nagata@sraoss.co.jp>
Date: Tue, 11 Nov 2025 10:14:30 +0900
Subject: [PATCH v3] pgbench: Fix assertion failure with multiple \syncpipeline
in pipeline mode.
When running pgbench with a custom script that triggered retriable errors
(e.g., deadlock errors) followed by multiple \syncpipeline commands in
pipeline mode, an assertion failure could occur:
pgbench.c:3594: discardUntilSync: Assertion `res == ((void *)0)' failed
This happened because discardUntilSync() did not expect that a result
other than NULL (e.g. PGRES_TUPLES_OK) might be received after \syncpipeline.
This commit fixes the assertion failure by resetting the receive_sync flag
and continueing to discard results to ensure that all results are discarded
until the last sync point.
Also, if the connection was unexpectedly closed, this function could get
stuck in an infinite loop waiting for PGRES_PIPELINE_SYNC, which would never
be received. To fix this, exit the loop immediately if a connection failure
is detected.
---
src/bin/pgbench/pgbench.c | 35 +++++++++++++++++++++++++----------
1 file changed, 25 insertions(+), 10 deletions(-)
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index d8764ba6fe0..7d50ee38399 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -3563,14 +3563,18 @@ doRetry(CState *st, pg_time_usec_t *now)
}
/*
- * Read results and discard it until a sync point.
+ * Read and discard results until the last sync point.
*/
static int
discardUntilSync(CState *st)
{
bool received_sync = false;
- /* send a sync */
+ /*
+ * Send a sync to ensure at least one PGRES_PIPELINE_SYNC is received
+ * and to avoid an infinite loop, since all earlier ones may have
+ * already been received.
+ */
if (!PQpipelineSync(st->con))
{
pg_log_error("client %d aborted: failed to send a pipeline sync",
@@ -3578,21 +3582,23 @@ discardUntilSync(CState *st)
return 0;
}
- /* receive PGRES_PIPELINE_SYNC and null following it */
+ /* receive the last PGRES_PIPELINE_SYNC and null following it */
for (;;)
{
PGresult *res = PQgetResult(st->con);
+ if (PQstatus(st->con) == CONNECTION_BAD)
+ {
+ pg_log_error("client %d aborted while rolling back the transaction after an error; perhaps the backend died while processing",
+ st->id);
+ PQclear(res);
+ return 0;
+ }
+
if (PQresultStatus(res) == PGRES_PIPELINE_SYNC)
received_sync = true;
- else if (received_sync)
+ else if (received_sync && res == NULL)
{
- /*
- * PGRES_PIPELINE_SYNC must be followed by another
- * PGRES_PIPELINE_SYNC or NULL; otherwise, assert failure.
- */
- Assert(res == NULL);
-
/*
* Reset ongoing sync count to 0 since all PGRES_PIPELINE_SYNC
* results have been discarded.
@@ -3601,6 +3607,15 @@ discardUntilSync(CState *st)
PQclear(res);
break;
}
+ else
+ {
+ /*
+ * If a PGRES_PIPELINE_SYNC is followed by something other than
+ * PGRES_PIPELINE_SYNC or NULL, another PGRES_PIPELINE_SYNC will
+ * eventually follow. Reset received_sync to false to wait for it.
+ */
+ received_sync = false;
+ }
PQclear(res);
}
--
2.43.0
On Nov 13, 2025, at 15:09, Yugo Nagata <nagata@sraoss.co.jp> wrote:
On Thu, 13 Nov 2025 11:55:25 +0900
Fujii Masao <masao.fujii@gmail.com> wrote:On Wed, Nov 12, 2025 at 6:34 PM Yugo Nagata <nagata@sraoss.co.jp> wrote:
I've attached an updated patch.
Thanks for updating the patch!
The comment for the PQpipelineSync() call has been also updated to clarify
why it is necessary.+ /* + * If a PGRES_PIPELINE_SYNC is followed by something other than + * PGRES_PIPELINE_SYNC or NULL, another PGRES_PIPELINE_SYNC will + * eventually follow. + */LGTM. I'd like to append "Reset received_sync to false to wait for
it." into this comment.In addition, I added a connection status check in the loop to avoid an
infinte loop that waiting for PQpipelineSync after a connection failure.Would it be better to move this status check right after PQgetResult()
so that connection failures can be detected regardless of what result
it returns?+ pg_log_error("client %d aborted: the backend died while rolling back
the failed transaction after",The trailing “after” seems unnecessary.
Since there's no guarantee the backend actually died in this case,
it might be better to use something like "client %d aborted while rolling back
the transaction after an error; perhaps the backend died while processing"
which matches the wording used under CSTATE_WAIT_ROLLBACK_RESULT
in advanceConnectionState().Thank you for your review!
I've attached an updated patch reflecting your suggestion.Regards,
Yugo Nagata--
Yugo Nagata <nagata@sraoss.co.jp>
<v3-0001-pgbench-Fix-assertion-failure-with-multiple-syncp.patch>
With v3 patch, the assert is gone, but test result is no longer accurate, because discardAvailableResults() discarded PIPELINE_SYNC messages. This is my test result with v3:
```
% pgbench -n --failures-detailed -M extended -j 2 -c 2 -f deadlock.sql -f deadlock2.sql evantest
pgbench (19devel)
EVAN: discardAvailableResults: discarding result: res=11, conn=0
EVAN: discardAvailableResults: discarding result: res=7, conn=0
EVAN: discardAvailableResults: discarding result: res=10, conn=0
EVAN: discardAvailableResults: discarding result: res=2, conn=0
EVAN: discardAvailableResults: discarding result value: 6, conn=0
EVAN: discardAvailableResults: discarding result: res=10, conn=0
EVAN: discardAvailableResults: discarding result: res=2, conn=0
EVAN: discardAvailableResults: discarding result value: 7, conn=0
EVAN: discardUntilSync: Got SYNC, conn=0
EVAN: discardUntilSync: Got result: res=2, conn=0
EVAN: discardUntilSync: discarding result value=8, conn=0
EVAN: discardUntilSync: Got result: res=7, conn=0
EVAN: discardUntilSync: Got SYNC, conn=0
EVAN: discardUntilSync: Got SYNC, conn=0
EVAN: discardUntilSync: Got NULL, conn=0
EVAN: discardAvailableResults: discarding result: res=11, conn=0
EVAN: discardAvailableResults: discarding result: res=7, conn=0
EVAN: discardAvailableResults: discarding result: res=10, conn=0
EVAN: discardAvailableResults: discarding result: res=2, conn=0
EVAN: discardAvailableResults: discarding result value: 6, conn=0
EVAN: discardAvailableResults: discarding result: res=10, conn=0
EVAN: discardAvailableResults: discarding result: res=2, conn=0
EVAN: discardAvailableResults: discarding result value: 7, conn=0
EVAN: discardUntilSync: Got SYNC, conn=0
EVAN: discardUntilSync: Got result: res=2, conn=0
EVAN: discardUntilSync: discarding result value=8, conn=0
EVAN: discardUntilSync: Got result: res=7, conn=0
EVAN: discardUntilSync: Got SYNC, conn=0
EVAN: discardUntilSync: Got SYNC, conn=0
EVAN: discardUntilSync: Got NULL, conn=0
EVAN: discardAvailableResults: discarding result: res=11, conn=0
EVAN: discardAvailableResults: discarding result: res=7, conn=0
EVAN: discardAvailableResults: discarding result: res=10, conn=0
EVAN: discardAvailableResults: discarding result: res=2, conn=0
EVAN: discardAvailableResults: discarding result value: 2, conn=0
EVAN: discardAvailableResults: discarding result: res=10, conn=0
EVAN: discardAvailableResults: discarding result: res=2, conn=0
EVAN: discardAvailableResults: discarding result value: 3, conn=0
EVAN: discardUntilSync: Got SYNC, conn=0
EVAN: discardUntilSync: Got result: res=2, conn=0
EVAN: discardUntilSync: discarding result value=4, conn=0
EVAN: discardUntilSync: Got result: res=7, conn=0
EVAN: discardUntilSync: Got SYNC, conn=0
EVAN: discardUntilSync: Got SYNC, conn=0
EVAN: discardUntilSync: Got NULL, conn=0
EVAN: discardAvailableResults: discarding result: res=11, conn=0
EVAN: discardAvailableResults: discarding result: res=7, conn=0
EVAN: discardAvailableResults: discarding result: res=10, conn=0
EVAN: discardAvailableResults: discarding result: res=2, conn=0
EVAN: discardAvailableResults: discarding result value: 6, conn=0
EVAN: discardAvailableResults: discarding result: res=10, conn=0
EVAN: discardAvailableResults: discarding result: res=2, conn=0
EVAN: discardAvailableResults: discarding result value: 7, conn=0
EVAN: discardUntilSync: Got SYNC, conn=0
EVAN: discardUntilSync: Got result: res=2, conn=0
EVAN: discardUntilSync: discarding result value=8, conn=0
EVAN: discardUntilSync: Got result: res=7, conn=0
EVAN: discardUntilSync: Got SYNC, conn=0
EVAN: discardUntilSync: Got SYNC, conn=0
EVAN: discardUntilSync: Got NULL, conn=0
EVAN: discardAvailableResults: discarding result: res=11, conn=0
EVAN: discardAvailableResults: discarding result: res=7, conn=0
EVAN: discardAvailableResults: discarding result: res=10, conn=0
EVAN: discardAvailableResults: discarding result: res=2, conn=0
EVAN: discardAvailableResults: discarding result value: 2, conn=0
EVAN: discardAvailableResults: discarding result: res=10, conn=0
EVAN: discardAvailableResults: discarding result: res=2, conn=0
EVAN: discardAvailableResults: discarding result value: 3, conn=0
EVAN: discardUntilSync: Got SYNC, conn=0
EVAN: discardUntilSync: Got result: res=2, conn=0
EVAN: discardUntilSync: discarding result value=4, conn=0
EVAN: discardUntilSync: Got result: res=7, conn=0
EVAN: discardUntilSync: Got SYNC, conn=0
EVAN: discardUntilSync: Got SYNC, conn=0
EVAN: discardUntilSync: Got NULL, conn=0
transaction type: multiple scripts
scaling factor: 1
query mode: extended
number of clients: 2
number of threads: 2
maximum number of tries: 1
number of transactions per client: 10
number of transactions actually processed: 15/20
number of failed transactions: 5 (25.000%)
number of serialization failures: 0 (0.000%)
number of deadlock failures: 5 (25.000%)
number of other failures: 0 (0.000%)
latency average = 502.741 ms (including failures)
initial connection time = 2.882 ms
tps = 2.983644 (without initial connection time)
SQL script 1: deadlock.sql
- weight: 1 (targets 50.0% of total)
- 11 transactions (55.0% of total)
- number of transactions actually processed: 9 (tps = 1.790186)
- number of failed transactions: 2 (18.182%)
- number of serialization failures: 0 (0.000%)
- number of deadlock failures: 2 (18.182%)
- number of other failures: 0 (0.000%)
- latency average = 336.030 ms
- latency stddev = 472.160 ms
SQL script 2: deadlock2.sql
- weight: 1 (targets 50.0% of total)
- 9 transactions (45.0% of total)
- number of transactions actually processed: 6 (tps = 1.193457)
- number of failed transactions: 3 (33.333%)
- number of serialization failures: 0 (0.000%)
- number of deadlock failures: 3 (33.333%)
- number of other failures: 0 (0.000%)
- latency average = 335.757 ms
- latency stddev = 472.107 ms
```
We can see:
* number of transactions actually processed: 15/20
* number of failed transactions: 5 (25.000%)
However, with the dirty diff I sent in the morning:
```
% pgbench -n --failures-detailed -M extended -j 2 -c 2 -f deadlock.sql -f deadlock2.sql evantest
… omit debug logs …
transaction type: multiple scripts
scaling factor: 1
query mode: extended
number of clients: 2
number of threads: 2
maximum number of tries: 1
number of transactions per client: 10
number of transactions actually processed: 20/20
number of failed transactions: 0 (0.000%)
number of serialization failures: 0 (0.000%)
number of deadlock failures: 0 (0.000%)
number of other failures: 0 (0.000%)
latency average = 302.863 ms
initial connection time = 2.749 ms
tps = 6.603655 (without initial connection time)
SQL script 1: deadlock.sql
- weight: 1 (targets 50.0% of total)
- 11 transactions (55.0% of total)
- number of transactions actually processed: 11 (tps = 3.632010)
- number of failed transactions: 0 (0.000%)
- number of serialization failures: 0 (0.000%)
- number of deadlock failures: 0 (0.000%)
- number of other failures: 0 (0.000%)
- latency average = 275.532 ms
- latency stddev = 445.629 ms
SQL script 2: deadlock2.sql
- weight: 1 (targets 50.0% of total)
- 9 transactions (45.0% of total)
- number of transactions actually processed: 9 (tps = 2.971645)
- number of failed transactions: 0 (0.000%)
- number of serialization failures: 0 (0.000%)
- number of deadlock failures: 0 (0.000%)
- number of other failures: 0 (0.000%)
- latency average = 336.150 ms
- latency stddev = 472.091 ms
```
Now, all transactions are processed, there is no failure, I think that is expected, because syncpipeline should rollback failures, so that all script should succeed.
Feels to me like, because of introducing the new discardAvailableResults(), we need to make different fixes for master and old branches.
Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/
On Thu, 13 Nov 2025 16:13:30 +0800
Chao Li <li.evan.chao@gmail.com> wrote:
Now, all transactions are processed, there is no failure, I think that is expected, because syncpipeline should rollback failures, so that all script should succeed.
Feels to me like, because of introducing the new discardAvailableResults(), we need to make different fixes for master and old branches.
I understand your claim that scripts rolled back by \syncpipeline should
be considered successful. However, I believe treating them as failed
transactions is the expected behavior in pgbench, since it assumes that
a transaction script contains only one transaction, as described in the
documentation [1]https://www.postgresql.org/docs/current/pgbench.html#FAILURES-AND-RETRIES.
The following script:
\startpipeline
<queries list 1>
\syncpipeline
<queries list 2>
\endpipeline
can be considered equivalent to:
BEGIN;
<queries list 1>
END;
BEGIN;
<queries list 2>
END;
with respect to the scope of queries rolled back.
In the latter script, an error (such as a deadlock or serialization failure)
in any query is recorded as a failed transaction in the current pgbench, even
if part of the script has already been committed.
Therefore, the same behavior would be expected for the former script using a
pipeline.
[1]: https://www.postgresql.org/docs/current/pgbench.html#FAILURES-AND-RETRIES
Regards,
Yugo Nagata
--
Yugo Nagata <nagata@sraoss.co.jp>
On Thu, 13 Nov 2025 16:13:30 +0800
Chao Li <li.evan.chao@gmail.com> wrote:
Now, all transactions are processed, there is no failure, I think that is expected, because syncpipeline should rollback failures, so that all script should succeed.
Feels to me like, because of introducing the new discardAvailableResults(), we need to make different fixes for master and old branches.
I understand your claim that scripts rolled back by \syncpipeline should
be considered successful. However, I believe treating them as failed
transactions is the expected behavior in pgbench, since it assumes that
a transaction script contains only one transaction, as described in the
documentation [1]https://www.postgresql.org/docs/current/pgbench.html#FAILURES-AND-RETRIES.
The following script:
\startpipeline
<queries list 1>
\syncpipeline
<queries list 2>
\endpipeline
can be considered equivalent to:
BEGIN;
<queries list 1>
END;
BEGIN;
<queries list 2>
END;
with respect to the scope of queries rolled back.
In the latter script, an error (such as a deadlock or serialization failure)
in any query is recorded as a failed transaction in the current pgbench, even
if part of the script has already been committed.
Therefore, the same behavior would be expected for the former script using a
pipeline.
[1]: https://www.postgresql.org/docs/current/pgbench.html#FAILURES-AND-RETRIES
Regards,
Yugo Nagata
--
Yugo Nagata <nagata@sraoss.co.jp>
On Nov 13, 2025, at 17:40, Yugo Nagata <nagata@sraoss.co.jp> wrote:
The following script:
\startpipeline
<queries list 1>
\syncpipeline
<queries list 2>
\endpipelinecan be considered equivalent to:
BEGIN;
<queries list 1>
END;
BEGIN;
<queries list 2>
END;
This looks like that every \sysnpipeline will start a new transaction, is that true?
Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/
On Thu, 13 Nov 2025 18:17:37 +0800
Chao Li <li.evan.chao@gmail.com> wrote:
On Nov 13, 2025, at 17:40, Yugo Nagata <nagata@sraoss.co.jp> wrote:
The following script:
\startpipeline
<queries list 1>
\syncpipeline
<queries list 2>
\endpipelinecan be considered equivalent to:
BEGIN;
<queries list 1>
END;
BEGIN;
<queries list 2>
END;This looks like that every \sysnpipeline will start a new transaction, is that true?
Yes, it causes a new transaction to start.
In a pipeline, an implicit transaction block is started, and \syncpipeline closes it.
Then, a new implicit transaction begins.
Here’s a simple example to illustrate this:
$ cat pipeline_tx.sql
drop table if exists tbl;
create table tbl (i int);
\startpipeline
insert into tbl values(1);
insert into tbl values(2);
\syncpipeline
insert into tbl values(3);
insert into tbl values(4);
\endpipeline
$ pgbench -f pipeline_tx.sql -t 1 -M extended -n > /dev/null
$ psql -c "select xmin, i from tbl"
xmin | i
------+---
1268 | 1
1268 | 2
1269 | 3
1269 | 4
(4 rows)
Regards,
Yugo Nagata
--
Yugo Nagata <nagata@sraoss.co.jp>
On Nov 13, 2025, at 21:55, Fujii Masao <masao.fujii@gmail.com> wrote:
On Thu, Nov 13, 2025 at 4:09 PM Yugo Nagata <nagata@sraoss.co.jp> wrote:
Thank you for your review!
I've attached an updated patch reflecting your suggestion.Thanks for updating the patch! LGTM.
You mentioned that the assertion failure could occur when using \syncpipeline,
but it seems that multiple PGRES_PIPELINE_SYNC results can also appear
even without it, which can still trigger the same issue. For example,
I was able to reproduce the assertion failure in v16 (which doesn't support
\syncpipeline) with the following setup:--------------------------------
$ cat deadlock.sql
\startpipeline
select * from a order by i for update;
select 1;
\endpipeline$ cat deadlock2.sql
\startpipeline
select * from a order by i desc for update;
select 1;
\endpipeline$ psql -c "create table a (i int primary key); insert into a
values(generate_series(1,1000));"$ pgbench -n -j 4 -c 4 -T 5 -M extended -f deadlock.sql -f deadlock2.sql
...
Assertion failed: (res == ((void *)0)), function discardUntilSync,
file pgbench.c, line 3479.
--------------------------------So I've updated the commit message to clarify that while using \syncpipeline
makes the failure more likely, it can still occur without it. Since the issue
can also happen in v15 and v16 (which both lack \syncpipeline), I plan to
backpatch the fix to v15. The failure doesn't occur in v14 because it doesn't
support retriable error retries.I've also made a few cosmetic tweaks to the patch. Attached is the updated
version, which I plan to push.Regards,
--
Fujii Masao
<v4-0001-pgbench-PG15-PG16-Fix-assertion-failure-when-discarding-res.txt><v4-0001-pgbench-Fix-assertion-failure-when-discarding-res.patch>
I think I was misunderstanding that “\syncpipeline” would recover the transaction. Once the confusion is resolved, I think v4 patch is overall good. Only one small comment:
```
+ else if (received_sync && res == NULL)
{
- /*
- * PGRES_PIPELINE_SYNC must be followed by another
- * PGRES_PIPELINE_SYNC or NULL; otherwise, assert failure.
- */
- Assert(res == NULL);
-
/*
* Reset ongoing sync count to 0 since all PGRES_PIPELINE_SYNC
* results have been discarded.
@@ -3601,6 +3610,15 @@ discardUntilSync(CState *st)
PQclear(res);
break;
}
```
As we now add “res==NULL” to the “else if”, once entering "else if (received_sync && res == NULL)”, res must be NULL, so "PQclear(res);” should be deleted. Leaving it there doesn’t harm today, but is error-prone, because if in future someone removes “res==NULL” from the “else if”, it will lead to double memory free, because after “break”, PQclear(res) will be called again.
Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/
Import Notes
Reply to msg id not found: CAHGQGwGaOf3rQAmpMy4B3DM7_hpUn1dVZaqv6aZgOFJexbQag@mail.gmail.com
On Thu, 13 Nov 2025 22:55:53 +0900
Fujii Masao <masao.fujii@gmail.com> wrote:
On Thu, Nov 13, 2025 at 4:09 PM Yugo Nagata <nagata@sraoss.co.jp> wrote:
Thank you for your review!
I've attached an updated patch reflecting your suggestion.Thanks for updating the patch! LGTM.
You mentioned that the assertion failure could occur when using \syncpipeline,
but it seems that multiple PGRES_PIPELINE_SYNC results can also appear
even without it, which can still trigger the same issue. For example,
I was able to reproduce the assertion failure in v16 (which doesn't support
\syncpipeline) with the following setup:--------------------------------
$ cat deadlock.sql
\startpipeline
select * from a order by i for update;
select 1;
\endpipeline$ cat deadlock2.sql
\startpipeline
select * from a order by i desc for update;
select 1;
\endpipeline$ psql -c "create table a (i int primary key); insert into a
values(generate_series(1,1000));"$ pgbench -n -j 4 -c 4 -T 5 -M extended -f deadlock.sql -f deadlock2.sql
...
Assertion failed: (res == ((void *)0)), function discardUntilSync,
file pgbench.c, line 3479.
--------------------------------So I've updated the commit message to clarify that while using \syncpipeline
makes the failure more likely, it can still occur without it. Since the issue
can also happen in v15 and v16 (which both lack \syncpipeline), I plan to
backpatch the fix to v15. The failure doesn't occur in v14 because it doesn't
support retriable error retries.
I could not reproduce it with the latest REL_16_STABLE branch.
Perhaps, the assertion failure you mentioned above was the one
fixed by 1d3ded521?
Or, I am missing something...
I've also made a few cosmetic tweaks to the patch. Attached is the updated
version, which I plan to push.
Thank you for updating the patch.
By the way, your prevous email has not been archived [1]https://www.postgresql.org/list/pgsql-hackers/since/202511130000/.
I guess it was not received by the server due to some issue.
Therefore, I've attached patches you've sent.
[1]: https://www.postgresql.org/list/pgsql-hackers/since/202511130000/
--
Yugo Nagata <nagata@sraoss.co.jp>
Attachments:
v4-0001-pgbench-Fix-assertion-failure-when-discarding-res.patchtext/x-diff; name=v4-0001-pgbench-Fix-assertion-failure-when-discarding-res.patchDownload
From f62f3acb82ebea71cb322b5a4b4effb3de557261 Mon Sep 17 00:00:00 2001
From: Fujii Masao <fujii@postgresql.org>
Date: Thu, 13 Nov 2025 18:43:19 +0900
Subject: [PATCH v4] pgbench: Fix assertion failure when discarding results
after retriable errors.
Previously, when pgbench ran a custom script that triggered retriable errors
(e.g., deadlocks) in pipeline mode, the following assertion failure could occur:
Assertion failed: (res == ((void*)0)), function discardUntilSync, file pgbench.c, line 3594.
This typically happened when multiple \syncpipeline commands followed
a statement that caused a retriable error. However, even in v15 and v16
where \syncpipeline is not supported, scripts without it could still trigger
this failure.
The issue was that discardUntilSync() assumed a pipeline sync result
(PGRES_PIPELINE_SYNC) would always be followed by either another sync result
or NULL. This assumption was incorrect: when multiple sync requests were sent,
a sync result could instead be followed by another result type. In such cases,
discardUntilSync() mishandled the results, leading to the assertion failure.
This commit fixes the issue by making discardUntilSync() correctly handle cases
where a pipeline sync result is followed by other result types. It now continues
discarding results until another pipeline sync followed by NULL is reached.
Backpatched to v15, where support for retrying retriable errors in pgbench
was introduced.
Author: Yugo Nagata <nagata@sraoss.co.jp>
Reviewed-by: Chao Li <lic@highgo.com>
Reviewed-by: Fujii Masao <masao.fujii@gmail.com>
Discussion: https://postgr.es/m/20251111105037.f3fc554616bc19891f926c5b@sraoss.co.jp
Backpatch-through: 15
---
src/bin/pgbench/pgbench.c | 38 ++++++++++++++++++++++++++++----------
1 file changed, 28 insertions(+), 10 deletions(-)
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index d8764ba6fe0..8caf7b8bdaf 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -3563,14 +3563,18 @@ doRetry(CState *st, pg_time_usec_t *now)
}
/*
- * Read results and discard it until a sync point.
+ * Read and discard results until the last sync point.
*/
static int
discardUntilSync(CState *st)
{
bool received_sync = false;
- /* send a sync */
+ /*
+ * Send a Sync message to ensure at least one PGRES_PIPELINE_SYNC is
+ * received and to avoid an infinite loop, since all earlier ones may have
+ * already been received.
+ */
if (!PQpipelineSync(st->con))
{
pg_log_error("client %d aborted: failed to send a pipeline sync",
@@ -3578,21 +3582,26 @@ discardUntilSync(CState *st)
return 0;
}
- /* receive PGRES_PIPELINE_SYNC and null following it */
+ /*
+ * Continue reading results until the last sync point, i.e., until
+ * reaching null just after PGRES_PIPELINE_SYNC.
+ */
for (;;)
{
PGresult *res = PQgetResult(st->con);
+ if (PQstatus(st->con) == CONNECTION_BAD)
+ {
+ pg_log_error("client %d aborted while rolling back the transaction after an error; perhaps the backend died while processing",
+ st->id);
+ PQclear(res);
+ return 0;
+ }
+
if (PQresultStatus(res) == PGRES_PIPELINE_SYNC)
received_sync = true;
- else if (received_sync)
+ else if (received_sync && res == NULL)
{
- /*
- * PGRES_PIPELINE_SYNC must be followed by another
- * PGRES_PIPELINE_SYNC or NULL; otherwise, assert failure.
- */
- Assert(res == NULL);
-
/*
* Reset ongoing sync count to 0 since all PGRES_PIPELINE_SYNC
* results have been discarded.
@@ -3601,6 +3610,15 @@ discardUntilSync(CState *st)
PQclear(res);
break;
}
+ else
+ {
+ /*
+ * If a PGRES_PIPELINE_SYNC is followed by something other than
+ * PGRES_PIPELINE_SYNC or NULL, another PGRES_PIPELINE_SYNC will
+ * appear later. Reset received_sync to false to wait for it.
+ */
+ received_sync = false;
+ }
PQclear(res);
}
--
2.51.2
v4-0001-pgbench-PG15-PG16-Fix-assertion-failure-when-discarding-res.txttext/plain; name=v4-0001-pgbench-PG15-PG16-Fix-assertion-failure-when-discarding-res.txtDownload
From 1865eabfd65232feff106e7c01c8c6c9161571c8 Mon Sep 17 00:00:00 2001
From: Fujii Masao <fujii@postgresql.org>
Date: Thu, 13 Nov 2025 18:43:19 +0900
Subject: [PATCH v4] pgbench: Fix assertion failure when discarding results
after retriable errors.
Previously, when pgbench ran a custom script that triggered retriable errors
(e.g., deadlocks) in pipeline mode, the following assertion failure could occur:
Assertion failed: (res == ((void*)0)), function discardUntilSync, file pgbench.c, line 3594.
This typically happened when multiple \syncpipeline commands followed
a statement that caused a retriable error. However, even in v15 and v16
where \syncpipeline is not supported, scripts without it could still trigger
this failure.
The issue was that discardUntilSync() assumed a pipeline sync result
(PGRES_PIPELINE_SYNC) would always be followed by either another sync result
or NULL. This assumption was incorrect: when multiple sync requests were sent,
a sync result could instead be followed by another result type. In such cases,
discardUntilSync() mishandled the results, leading to the assertion failure.
This commit fixes the issue by making discardUntilSync() correctly handle cases
where a pipeline sync result is followed by other result types. It now continues
discarding results until another pipeline sync followed by NULL is reached.
Backpatched to v15, where support for retrying retriable errors in pgbench
was introduced.
Author: Yugo Nagata <nagata@sraoss.co.jp>
Reviewed-by: Chao Li <lic@highgo.com>
Reviewed-by: Fujii Masao <masao.fujii@gmail.com>
Discussion: https://postgr.es/m/20251111105037.f3fc554616bc19891f926c5b@sraoss.co.jp
Backpatch-through: 15
---
src/bin/pgbench/pgbench.c | 38 ++++++++++++++++++++++++++++----------
1 file changed, 28 insertions(+), 10 deletions(-)
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index adf6e45953b..4bdd507582a 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -3475,14 +3475,18 @@ doRetry(CState *st, pg_time_usec_t *now)
}
/*
- * Read results and discard it until a sync point.
+ * Read and discard results until the last sync point.
*/
static int
discardUntilSync(CState *st)
{
bool received_sync = false;
- /* send a sync */
+ /*
+ * Send a Sync message to ensure at least one PGRES_PIPELINE_SYNC is
+ * received and to avoid an infinite loop, since all earlier ones may have
+ * already been received.
+ */
if (!PQpipelineSync(st->con))
{
pg_log_error("client %d aborted: failed to send a pipeline sync",
@@ -3490,24 +3494,38 @@ discardUntilSync(CState *st)
return 0;
}
- /* receive PGRES_PIPELINE_SYNC and null following it */
+ /*
+ * Continue reading results until the last sync point, i.e., until
+ * reaching null just after PGRES_PIPELINE_SYNC.
+ */
for (;;)
{
PGresult *res = PQgetResult(st->con);
+ if (PQstatus(st->con) == CONNECTION_BAD)
+ {
+ pg_log_error("client %d aborted while rolling back the transaction after an error; perhaps the backend died while processing",
+ st->id);
+ PQclear(res);
+ return 0;
+ }
+
if (PQresultStatus(res) == PGRES_PIPELINE_SYNC)
received_sync = true;
- else if (received_sync)
+ else if (received_sync && res == NULL)
{
- /*
- * PGRES_PIPELINE_SYNC must be followed by another
- * PGRES_PIPELINE_SYNC or NULL; otherwise, assert failure.
- */
- Assert(res == NULL);
-
PQclear(res);
break;
}
+ else
+ {
+ /*
+ * If a PGRES_PIPELINE_SYNC is followed by something other than
+ * PGRES_PIPELINE_SYNC or NULL, another PGRES_PIPELINE_SYNC will
+ * appear later. Reset received_sync to false to wait for it.
+ */
+ received_sync = false;
+ }
PQclear(res);
}
--
2.51.2
Import Notes
Reply to msg id not found: CAHGQGwGaOf3rQAmpMy4B3DM7_hpUn1dVZaqv6aZgOFJexbQag@mail.gmail.com
On Fri, Nov 14, 2025 at 4:50 PM Yugo Nagata <nagata@sraoss.co.jp> wrote:
I could not reproduce it with the latest REL_16_STABLE branch.
Perhaps, the assertion failure you mentioned above was the one
fixed by 1d3ded521?
Yeah, you're right! Thanks for catching that.
I've updated the commit message to explicitly mention the \syncpipeline command.
Patch attached.
Since the assertion failure can occur in versions that support \syncpipeline,
the fix doesn't need to be backpatched to v16 or older.
Regards,
--
Fujii Masao
Attachments:
v5-0001-pgbench-Fix-assertion-failure-with-multiple-syncp.patchapplication/octet-stream; name=v5-0001-pgbench-Fix-assertion-failure-with-multiple-syncp.patchDownload
From 29816a73e048e1a35a99ef467c03366df9b5b249 Mon Sep 17 00:00:00 2001
From: Fujii Masao <fujii@postgresql.org>
Date: Thu, 13 Nov 2025 18:43:19 +0900
Subject: [PATCH v5] pgbench: Fix assertion failure with multiple \syncpipeline
in pipeline mode.
Previously, when pgbench ran a custom script that triggered retriable errors
(e.g., deadlocks) followed by multiple \syncpipeline commands in pipeline mode,
the following assertion failure could occur:
Assertion failed: (res == ((void*)0)), function discardUntilSync, file pgbench.c, line 3594.
The issue was that discardUntilSync() assumed a pipeline sync result
(PGRES_PIPELINE_SYNC) would always be followed by either another sync result
or NULL. This assumption was incorrect: when multiple sync requests were sent,
a sync result could instead be followed by another result type. In such cases,
discardUntilSync() mishandled the results, leading to the assertion failure.
This commit fixes the issue by making discardUntilSync() correctly handle cases
where a pipeline sync result is followed by other result types. It now continues
discarding results until another pipeline sync followed by NULL is reached.
Backpatched to v17, where support for \syncpipeline command in pgbench was
introduced.
Author: Yugo Nagata <nagata@sraoss.co.jp>
Reviewed-by: Chao Li <lic@highgo.com>
Reviewed-by: Fujii Masao <masao.fujii@gmail.com>
Discussion: https://postgr.es/m/20251111105037.f3fc554616bc19891f926c5b@sraoss.co.jp
Backpatch-through: 17
---
src/bin/pgbench/pgbench.c | 39 ++++++++++++++++++++++++++++-----------
1 file changed, 28 insertions(+), 11 deletions(-)
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index d8764ba6fe0..a425176ecdc 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -3563,14 +3563,18 @@ doRetry(CState *st, pg_time_usec_t *now)
}
/*
- * Read results and discard it until a sync point.
+ * Read and discard results until the last sync point.
*/
static int
discardUntilSync(CState *st)
{
bool received_sync = false;
- /* send a sync */
+ /*
+ * Send a Sync message to ensure at least one PGRES_PIPELINE_SYNC is
+ * received and to avoid an infinite loop, since all earlier ones may have
+ * already been received.
+ */
if (!PQpipelineSync(st->con))
{
pg_log_error("client %d aborted: failed to send a pipeline sync",
@@ -3578,29 +3582,42 @@ discardUntilSync(CState *st)
return 0;
}
- /* receive PGRES_PIPELINE_SYNC and null following it */
+ /*
+ * Continue reading results until the last sync point, i.e., until
+ * reaching null just after PGRES_PIPELINE_SYNC.
+ */
for (;;)
{
PGresult *res = PQgetResult(st->con);
+ if (PQstatus(st->con) == CONNECTION_BAD)
+ {
+ pg_log_error("client %d aborted while rolling back the transaction after an error; perhaps the backend died while processing",
+ st->id);
+ PQclear(res);
+ return 0;
+ }
+
if (PQresultStatus(res) == PGRES_PIPELINE_SYNC)
received_sync = true;
- else if (received_sync)
+ else if (received_sync && res == NULL)
{
- /*
- * PGRES_PIPELINE_SYNC must be followed by another
- * PGRES_PIPELINE_SYNC or NULL; otherwise, assert failure.
- */
- Assert(res == NULL);
-
/*
* Reset ongoing sync count to 0 since all PGRES_PIPELINE_SYNC
* results have been discarded.
*/
st->num_syncs = 0;
- PQclear(res);
break;
}
+ else
+ {
+ /*
+ * If a PGRES_PIPELINE_SYNC is followed by something other than
+ * PGRES_PIPELINE_SYNC or NULL, another PGRES_PIPELINE_SYNC will
+ * appear later. Reset received_sync to false to wait for it.
+ */
+ received_sync = false;
+ }
PQclear(res);
}
--
2.51.2
On Fri, Nov 14, 2025 at 4:45 PM Chao Li <li.evan.chao@gmail.com> wrote:
``` + else if (received_sync && res == NULL) { - /* - * PGRES_PIPELINE_SYNC must be followed by another - * PGRES_PIPELINE_SYNC or NULL; otherwise, assert failure. - */ - Assert(res == NULL); - /* * Reset ongoing sync count to 0 since all PGRES_PIPELINE_SYNC * results have been discarded. @@ -3601,6 +3610,15 @@ discardUntilSync(CState *st) PQclear(res); break; } ```As we now add “res==NULL” to the “else if”, once entering "else if (received_sync && res == NULL)”, res must be NULL, so "PQclear(res);” should be deleted.
OK, the PQclear() there is unnecessary, so I removed it in the patch I
posted earlier.
Regards,
--
Fujii Masao
On Fri, 14 Nov 2025 18:08:24 +0900
Fujii Masao <masao.fujii@gmail.com> wrote:
On Fri, Nov 14, 2025 at 4:50 PM Yugo Nagata <nagata@sraoss.co.jp> wrote:
I could not reproduce it with the latest REL_16_STABLE branch.
Perhaps, the assertion failure you mentioned above was the one
fixed by 1d3ded521?Yeah, you're right! Thanks for catching that.
I've updated the commit message to explicitly mention the \syncpipeline command.
Patch attached.Since the assertion failure can occur in versions that support \syncpipeline,
the fix doesn't need to be backpatched to v16 or older.
Thank you for updating and pushing the patch!
Yugo Nagata
--
Yugo Nagata <nagata@sraoss.co.jp>