Avoid stuck of pbgench due to skipped transactions
Hi,
I found that pgbench could get stuck when every transaction
come to be skipped and the number of transaction is not limitted
by -t option.
For example, when I usee a large rate (-R) for throttling and a
small latency limit (-L) values with a duration (-T), pbbench
got stuck.
$ pgbench -T 5 -R 100000000 -L 1;
When we specify the number of transactions by -t, it doesn't get
stuck because the number of skipped transactions are counted and
checked during the loop. However, the timer expiration is not
checked in the loop although it is checked before and after a
sleep for throttling.
I think it is better to check the timer expiration even in the loop
of transaction skips and to finish pgbnech successfully because we
should correcly repport how many transactions are proccessed and
skipped also in this case, and getting stuck would not be good
anyway.
I attached a patch for this fix.
Regards,
Yugo Nagata
--
Yugo NAGATA <nagata@sraoss.co.jp>
Attachments:
pgbench_avoiding_stuck.patchtext/x-diff; name=pgbench_avoiding_stuck.patchDownload
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index dc84b7b9b7..1aa3e6b7be 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -3232,7 +3232,8 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
pg_time_now_lazy(&now);
while (thread->throttle_trigger < now - latency_limit &&
- (nxacts <= 0 || st->cnt < nxacts))
+ (nxacts <= 0 || st->cnt < nxacts) &&
+ !timer_exceeded)
{
processXactStats(thread, st, &now, true, agg);
/* next rendez-vous */
Hello Yugo-san,
For example, when I usee a large rate (-R) for throttling and a
small latency limit (-L) values with a duration (-T), pbbench
got stuck.$ pgbench -T 5 -R 100000000 -L 1;
Indeed, it does not get out of the catchup loop for a long time because
even scheduling takes more time than the expected transaction time!
I think it is better to check the timer expiration even in the loop
of transaction skips and to finish pgbnech successfully because we
should correcly repport how many transactions are proccessed and
skipped also in this case, and getting stuck would not be good
anyway.I attached a patch for this fix.
The patch mostly works for me, and I agree that the bench should not be in
a loop on any parameters, even when "crazy" parameters are given…
However I'm not sure this is the right way to handle this issue.
The catch-up loop can be dropped and the automaton can loop over itself to
reschedule. Doing that as the attached fixes this issue and also makes
progress reporting work proprely in more cases, and reduces the number of
lines of code. I did not add a test case because time sensitive tests have
been removed (which is too bad, IMHO).
--
Fabien.
Attachments:
pgbench-stuck-2.patchtext/x-diff; name=pgbench-stuck-2.patchDownload
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index d7479925cb..fe75533e3e 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -3223,31 +3223,30 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
/*
* If --latency-limit is used, and this slot is already late
* so that the transaction will miss the latency limit even if
- * it completed immediately, skip this time slot and schedule
- * to continue running on the next slot that isn't late yet.
- * But don't iterate beyond the -t limit, if one is given.
+ * it completed immediately, skip this time slot and loop to
+ * reschedule.
*/
if (latency_limit)
{
pg_time_now_lazy(&now);
- while (thread->throttle_trigger < now - latency_limit &&
- (nxacts <= 0 || st->cnt < nxacts))
+ if (thread->throttle_trigger < now - latency_limit)
{
processXactStats(thread, st, &now, true, agg);
- /* next rendez-vous */
- thread->throttle_trigger +=
- getPoissonRand(&thread->ts_throttle_rs, throttle_delay);
- st->txn_scheduled = thread->throttle_trigger;
- }
- /*
- * stop client if -t was exceeded in the previous skip
- * loop
- */
- if (nxacts > 0 && st->cnt >= nxacts)
- {
- st->state = CSTATE_FINISHED;
+ /* stop client if -T/-t was exceeded. */
+ if (timer_exceeded || (nxacts > 0 && st->cnt >= nxacts))
+ /*
+ * For very unrealistic rates under -T, some skipped
+ * transactions are not counted because the catchup
+ * loop is not fast enough just to do the scheduling
+ * and counting at the expected speed.
+ *
+ * We do not bother with such a degenerate case.
+ */
+ st->state = CSTATE_FINISHED;
+
+ /* otherwise loop over PREPARE_THROTTLE */
break;
}
}
Hello Fabien,
On Sun, 13 Jun 2021 08:56:59 +0200 (CEST)
Fabien COELHO <coelho@cri.ensmp.fr> wrote:
I attached a patch for this fix.
The patch mostly works for me, and I agree that the bench should not be in
a loop on any parameters, even when "crazy" parameters are given…However I'm not sure this is the right way to handle this issue.
The catch-up loop can be dropped and the automaton can loop over itself to
reschedule. Doing that as the attached fixes this issue and also makes
progress reporting work proprely in more cases, and reduces the number of
lines of code. I did not add a test case because time sensitive tests have
been removed (which is too bad, IMHO).
I agree with your way to fix. However, the progress reporting didn't work
because we cannot return from advanceConnectionState to threadRun and just
break the loop.
+ /* otherwise loop over PREPARE_THROTTLE */
break;
I attached the fixed patch that uses return instead of break, and I confirmed
that this made the progress reporting work property.
Regards,
Yugo Nagata
--
Yugo NAGATA <nagata@sraoss.co.jp>
Attachments:
pgbench-stuck-3.patchtext/x-diff; name=pgbench-stuck-3.patchDownload
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index dc84b7b9b7..8f5d000938 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -3223,32 +3223,31 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
/*
* If --latency-limit is used, and this slot is already late
* so that the transaction will miss the latency limit even if
- * it completed immediately, skip this time slot and schedule
- * to continue running on the next slot that isn't late yet.
- * But don't iterate beyond the -t limit, if one is given.
+ * it completed immediately, skip this time slot and loop to
+ * reschedule.
*/
if (latency_limit)
{
pg_time_now_lazy(&now);
- while (thread->throttle_trigger < now - latency_limit &&
- (nxacts <= 0 || st->cnt < nxacts))
+ if (thread->throttle_trigger < now - latency_limit)
{
processXactStats(thread, st, &now, true, agg);
- /* next rendez-vous */
- thread->throttle_trigger +=
- getPoissonRand(&thread->ts_throttle_rs, throttle_delay);
- st->txn_scheduled = thread->throttle_trigger;
- }
- /*
- * stop client if -t was exceeded in the previous skip
- * loop
- */
- if (nxacts > 0 && st->cnt >= nxacts)
- {
- st->state = CSTATE_FINISHED;
- break;
+ /* stop client if -T/-t was exceeded. */
+ if (timer_exceeded || (nxacts > 0 && st->cnt >= nxacts))
+ /*
+ * For very unrealistic rates under -T, some skipped
+ * transactions are not counted because the catchup
+ * loop is not fast enough just to do the scheduling
+ * and counting at the expected speed.
+ *
+ * We do not bother with such a degenerate case.
+ */
+ st->state = CSTATE_FINISHED;
+
+ /*otherwise loop over PREPARE_THROTTLE */
+ return;
}
}
I attached a patch for this fix.
The patch mostly works for me, and I agree that the bench should not be in
a loop on any parameters, even when "crazy" parameters are given…However I'm not sure this is the right way to handle this issue.
The catch-up loop can be dropped and the automaton can loop over itself to
reschedule. Doing that as the attached fixes this issue and also makes
progress reporting work proprely in more cases, and reduces the number of
lines of code. I did not add a test case because time sensitive tests have
been removed (which is too bad, IMHO).I agree with your way to fix. However, the progress reporting didn't work
because we cannot return from advanceConnectionState to threadRun and just
break the loop.+ /* otherwise loop over PREPARE_THROTTLE */
break;I attached the fixed patch that uses return instead of break, and I confirmed
that this made the progress reporting work property.
I'm hesitating to do such a strictural change for a degenerate case linked
to "insane" parameters, as pg is unlikely to reach 100 million tps, ever.
It seems to me enough that the command is not blocked in such cases.
--
Fabien.
On Mon, 14 Jun 2021 08:47:40 +0200 (CEST)
Fabien COELHO <coelho@cri.ensmp.fr> wrote:
I attached a patch for this fix.
The patch mostly works for me, and I agree that the bench should not be in
a loop on any parameters, even when "crazy" parameters are given…However I'm not sure this is the right way to handle this issue.
The catch-up loop can be dropped and the automaton can loop over itself to
reschedule. Doing that as the attached fixes this issue and also makes
progress reporting work proprely in more cases, and reduces the number of
lines of code. I did not add a test case because time sensitive tests have
been removed (which is too bad, IMHO).I agree with your way to fix. However, the progress reporting didn't work
because we cannot return from advanceConnectionState to threadRun and just
break the loop.+ /* otherwise loop over PREPARE_THROTTLE */
break;I attached the fixed patch that uses return instead of break, and I confirmed
that this made the progress reporting work property.I'm hesitating to do such a strictural change for a degenerate case linked
to "insane" parameters, as pg is unlikely to reach 100 million tps, ever.
It seems to me enough that the command is not blocked in such cases.
Sure. The change from "break" to "return" is just for making the progress
reporting work in the loop, as you mentioned. However, my original intention
is avoiding stuck in a corner-case where a unrealistic parameter was used, and
I agree with you that this change is not so necessary for handling such a
special situation.
Regards,
Yugo Nagata
--
Yugo NAGATA <nagata@sraoss.co.jp>
On Mon, 14 Jun 2021 16:06:10 +0900
Yugo NAGATA <nagata@sraoss.co.jp> wrote:
On Mon, 14 Jun 2021 08:47:40 +0200 (CEST)
Fabien COELHO <coelho@cri.ensmp.fr> wrote:
I attached the fixed patch that uses return instead of break, and I confirmed
that this made the progress reporting work property.I'm hesitating to do such a strictural change for a degenerate case linked
to "insane" parameters, as pg is unlikely to reach 100 million tps, ever.
It seems to me enough that the command is not blocked in such cases.Sure. The change from "break" to "return" is just for making the progress
reporting work in the loop, as you mentioned. However, my original intention
is avoiding stuck in a corner-case where a unrealistic parameter was used, and
I agree with you that this change is not so necessary for handling such a
special situation.
I attached the v2 patch to clarify that I withdrew the v3 patch.
Regards
Yugo Nagata
--
Yugo NAGATA <nagata@sraoss.co.jp>
Attachments:
pgbench-stuck-2.patchtext/x-diff; name=pgbench-stuck-2.patchDownload
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index d7479925cb..fe75533e3e 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -3223,31 +3223,30 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
/*
* If --latency-limit is used, and this slot is already late
* so that the transaction will miss the latency limit even if
- * it completed immediately, skip this time slot and schedule
- * to continue running on the next slot that isn't late yet.
- * But don't iterate beyond the -t limit, if one is given.
+ * it completed immediately, skip this time slot and loop to
+ * reschedule.
*/
if (latency_limit)
{
pg_time_now_lazy(&now);
- while (thread->throttle_trigger < now - latency_limit &&
- (nxacts <= 0 || st->cnt < nxacts))
+ if (thread->throttle_trigger < now - latency_limit)
{
processXactStats(thread, st, &now, true, agg);
- /* next rendez-vous */
- thread->throttle_trigger +=
- getPoissonRand(&thread->ts_throttle_rs, throttle_delay);
- st->txn_scheduled = thread->throttle_trigger;
- }
- /*
- * stop client if -t was exceeded in the previous skip
- * loop
- */
- if (nxacts > 0 && st->cnt >= nxacts)
- {
- st->state = CSTATE_FINISHED;
+ /* stop client if -T/-t was exceeded. */
+ if (timer_exceeded || (nxacts > 0 && st->cnt >= nxacts))
+ /*
+ * For very unrealistic rates under -T, some skipped
+ * transactions are not counted because the catchup
+ * loop is not fast enough just to do the scheduling
+ * and counting at the expected speed.
+ *
+ * We do not bother with such a degenerate case.
+ */
+ st->state = CSTATE_FINISHED;
+
+ /* otherwise loop over PREPARE_THROTTLE */
break;
}
}
The following review has been posted through the commitfest application:
make installcheck-world: tested, failed
Implements feature: tested, failed
Spec compliant: not tested
Documentation: not tested
Looks fine to me, as a way of catching this edge case.
Hello Greg,
On Tue, 22 Jun 2021 19:22:38 +0000
Greg Sabino Mullane <htamfids@gmail.com> wrote:
The following review has been posted through the commitfest application:
make installcheck-world: tested, failed
Implements feature: tested, failed
Spec compliant: not tested
Documentation: not testedLooks fine to me, as a way of catching this edge case.
Thank you for looking into this!
'make installcheck-world' and 'Implements feature' are marked "failed",
but did you find any problem on this patch?
--
Yugo NAGATA <nagata@sraoss.co.jp>
Apologies, just saw this. I found no problems, those "failures" were just
me missing checkboxes on the commitfest interface. +1 on the patch.
Cheers,
Greg
On Tue, 10 Aug 2021 10:50:20 -0400
Greg Sabino Mullane <htamfids@gmail.com> wrote:
Apologies, just saw this. I found no problems, those "failures" were just
me missing checkboxes on the commitfest interface. +1 on the patch.
Thank you!
--
Yugo NAGATA <nagata@sraoss.co.jp>
On 2021/06/17 1:23, Yugo NAGATA wrote:
I attached the v2 patch to clarify that I withdrew the v3 patch.
Thanks for the patch!
+ * For very unrealistic rates under -T, some skipped
+ * transactions are not counted because the catchup
+ * loop is not fast enough just to do the scheduling
+ * and counting at the expected speed.
+ *
+ * We do not bother with such a degenerate case.
+ */
ISTM that the patch changes pgbench so that it can skip counting
some skipped transactions here even for realistic rates under -T.
Of course, which would happen very rarely. Is this understanding right?
On the other hand, even without the patch, in the first place, there seems
no guarantee that all the skipped transactions are counted under -T.
When the timer is exceeded in CSTATE_END_TX, a client ends without
checking outstanding skipped transactions. Therefore the "issue" that
some skipped transactions are not counted is not one the patch newly introdues.
So that behavior change by the patch would be acceptable.
Is this understanding right?
Regards,
--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION
Hello Fujii-san,
ISTM that the patch changes pgbench so that it can skip counting
some skipped transactions here even for realistic rates under -T.
Of course, which would happen very rarely. Is this understanding right?
Yes. The point is to get out of the scheduling loop when time has expired,
as soon it is known, instead of looping there for some possibly long time.
On the other hand, even without the patch, in the first place, there seems
no guarantee that all the skipped transactions are counted under -T.
When the timer is exceeded in CSTATE_END_TX, a client ends without
checking outstanding skipped transactions.
Indeed. But that should be very few transactions under latency limit.
Therefore the "issue" that some skipped transactions are not counted is
not one the patch newly introdues.
Yep. The patch counts less of them though, because of the early exit
introduced in the patch in the scheduling state. Before it could be stuck
in the "while (late) { count; schedule; }" loop.
So that behavior change by the patch would be acceptable. Is this
understanding right?
I think so.
--
Fabien.
On 2021/09/04 15:27, Fabien COELHO wrote:
Hello Fujii-san,
ISTM that the patch changes pgbench so that it can skip counting
some skipped transactions here even for realistic rates under -T.
Of course, which would happen very rarely. Is this understanding right?Yes. The point is to get out of the scheduling loop when time has expired, as soon it is known, instead of looping there for some possibly long time.
Thanks for checking my understanding!
+ * For very unrealistic rates under -T, some skipped
+ * transactions are not counted because the catchup
+ * loop is not fast enough just to do the scheduling
+ * and counting at the expected speed.
+ *
+ * We do not bother with such a degenerate case.
So this comment is a bit misleading? What about updating this as follows?
------------------------------
Stop counting skipped transactions under -T as soon as the timer is exceeded.
Because otherwise it can take a very long time to count all of them especially
when quite a lot of them happen with unrealistically high rate setting in -R,
which would prevent pgbench from ending immediately. Because of this behavior,
note that there is no guarantee that all skipped transactions are counted
under -T though there is under -t. This is OK in practice because it's very
unlikely to happen with realistic setting.
------------------------------
So that behavior change by the patch would be acceptable. Is this understanding right?
I think so.
+1
One question is; which version do we want to back-patch to?
Regards,
--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION
Hello Fujii-san,
Stop counting skipped transactions under -T as soon as the timer is
exceeded. Because otherwise it can take a very long time to count all of
them especially when quite a lot of them happen with unrealistically
high rate setting in -R, which would prevent pgbench from ending
immediately. Because of this behavior, note that there is no guarantee
that all skipped transactions are counted under -T though there is under
-t. This is OK in practice because it's very unlikely to happen with
realistic setting.
Ok, I find this text quite clear.
One question is; which version do we want to back-patch to?
If we consider it a "very minor bug fix" which is triggered by somehow
unrealistic options, so I'd say 14 & dev, or possibly only dev.
--
Fabien.
On 2021/09/07 18:24, Fabien COELHO wrote:
Hello Fujii-san,
Stop counting skipped transactions under -T as soon as the timer is exceeded. Because otherwise it can take a very long time to count all of them especially when quite a lot of them happen with unrealistically high rate setting in -R, which would prevent pgbench from ending immediately. Because of this behavior, note that there is no guarantee that all skipped transactions are counted under -T though there is under -t. This is OK in practice because it's very unlikely to happen with realistic setting.
Ok, I find this text quite clear.
Thanks for the check! So attached is the updated version of the patch.
One question is; which version do we want to back-patch to?
If we consider it a "very minor bug fix" which is triggered by somehow unrealistic options, so I'd say 14 & dev, or possibly only dev.
Agreed. Since it's hard to imagine the issue happens in practice,
we don't need to bother back-patch to the stable branches.
So I'm thinking to commit the patch to 15dev and 14.
Regards,
--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION
Attachments:
pgbench-stuck-3.patchtext/plain; charset=UTF-8; name=pgbench-stuck-3.patch; x-mac-creator=0; x-mac-type=0Download
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 4c9952a85a..433abd954b 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -3233,31 +3233,36 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
/*
* If --latency-limit is used, and this slot is already late
* so that the transaction will miss the latency limit even if
- * it completed immediately, skip this time slot and schedule
- * to continue running on the next slot that isn't late yet.
- * But don't iterate beyond the -t limit, if one is given.
+ * it completed immediately, skip this time slot and loop to
+ * reschedule.
*/
if (latency_limit)
{
pg_time_now_lazy(&now);
- while (thread->throttle_trigger < now - latency_limit &&
- (nxacts <= 0 || st->cnt < nxacts))
+ if (thread->throttle_trigger < now - latency_limit)
{
processXactStats(thread, st, &now, true, agg);
- /* next rendez-vous */
- thread->throttle_trigger +=
- getPoissonRand(&thread->ts_throttle_rs, throttle_delay);
- st->txn_scheduled = thread->throttle_trigger;
- }
- /*
- * stop client if -t was exceeded in the previous skip
- * loop
- */
- if (nxacts > 0 && st->cnt >= nxacts)
- {
- st->state = CSTATE_FINISHED;
+ /*
+ * Finish client if -T or -t was exceeded.
+ *
+ * Stop counting skipped transactions under -T as soon
+ * as the timer is exceeded. Because otherwise it can
+ * take a very long time to count all of them
+ * especially when quite a lot of them happen with
+ * unrealistically high rate setting in -R, which
+ * would prevent pgbench from ending immediately.
+ * Because of this behavior, note that there is no
+ * guarantee that all skipped transactions are counted
+ * under -T though there is under -t. This is OK in
+ * practice because it's very unlikely to happen with
+ * realistic setting.
+ */
+ if (timer_exceeded || (nxacts > 0 && st->cnt >= nxacts))
+ st->state = CSTATE_FINISHED;
+
+ /* Go back to top of loop with CSTATE_PREPARE_THROTTLE */
break;
}
}
On 2021/09/08 23:40, Fujii Masao wrote:
Agreed. Since it's hard to imagine the issue happens in practice,
we don't need to bother back-patch to the stable branches.
So I'm thinking to commit the patch to 15dev and 14.
Pushed. Thanks!
Regards,
--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION