Re: [COMMITTERS] pgsql: Remove pgbench "progress" test pending solution of its timing is (fwd)
<Bounced from commits to have it on hackers>
Hello Tom,
Well, I think it's mostly about valgrind making everything really slow. Since
we have seen some passes from skink recently, perhaps there was also a
component of more-load-on-the-machine-than-usual. But in the end this is
just evidence for my point that regression tests have to be very much not
timing-sensitive. We run them under all kinds of out-of-the-ordinary stress.
Attached is an attempt at improving the situation:
(1) there are added comments to explain the whys, which is to provide
coverage for pgbench time-related features... while still not being
time-sensitive, which is a challenge.
(2) the test now only expects "progress: \d" from the output, so it is enough
that one progress is shown, whenever it is shown.
(3) if the test is detected to have gone AWOL, detailed log checks are
coldly skipped.
This would have passed on "skink" under the special conditions it encountered.
I cannot guaranty that it would pass under any circumstance, though.
If it still encounters a failure, ISTM that it should only be a missing
"progress:" in the output, which has not been encountered so far.
If it occurs, a few options would remain, none of them very convincing:
- give the test some more time, eg 3 seconds (hmmm... could still fail
after any time...)
- drop the "progress:" expectation (hmmm... but then it does not check
anything).
- make the "progress:" output check conditional to the running time
(hmmm... it would require changing the command_checks_all function,
and there is still a chance that the bench was stuck for 2 seconds
then given time to realize that it has to stop right now...)
- use an even simpler transaction, eg "SELECT;" which is less likely to
get stuck (but still could get stuck...).
For the random-ness related test (--sample-rate), we could add a feature to
pgbench which allows to control the random seed, so that the number of samples
could be known in advance for testing purposes.
--
Fabien.
Attachments:
pgbench-tap-fix-4.patchtext/x-diff; charset=us-ascii; name=pgbench-tap-fix-4.patchDownload+46-3
On Sun, Sep 24, 2017 at 11:30 PM, Fabien COELHO <coelho@cri.ensmp.fr> wrote:
Attached is an attempt at improving the situation:
(1) there are added comments to explain the whys, which is to provide
coverage for pgbench time-related features... while still not being
time-sensitive, which is a challenge.(2) the test now only expects "progress: \d" from the output, so it is
enough that one progress is shown, whenever it is shown.(3) if the test is detected to have gone AWOL, detailed log checks are
coldly skipped.This would have passed on "skink" under the special conditions it
encountered.I cannot guaranty that it would pass under any circumstance, though.
If it still encounters a failure, ISTM that it should only be a missing
"progress:" in the output, which has not been encountered so far.If it occurs, a few options would remain, none of them very convincing:
- give the test some more time, eg 3 seconds (hmmm... could still fail
after any time...)- drop the "progress:" expectation (hmmm... but then it does not check
anything).- make the "progress:" output check conditional to the running time
(hmmm... it would require changing the command_checks_all function,
and there is still a chance that the bench was stuck for 2 seconds
then given time to realize that it has to stop right now...)- use an even simpler transaction, eg "SELECT;" which is less likely to
get stuck (but still could get stuck...).For the random-ness related test (--sample-rate), we could add a feature to
pgbench which allows to control the random seed, so that the number of
samples could be known in advance for testing purposes.
This didn't get any reviews, so bumped to CF 2018-01.
--
Michael
Fabien COELHO <coelho@cri.ensmp.fr> writes:
Attached is an attempt at improving the situation:
I took a quick look at this and can't really convince myself that it
improves our situation. The fact that it prints nothing if the runtime
is outside of a tightly constrained range seems likely to hide the sort
of bug you might hope such a test would detect. A simple example is
that, if there's an off-by-one bug causing the test to run for 3 seconds
not 2, this test script won't complain about it. Worse, if pgbench dumps
core instantly on startup when given -p, this test script won't complain
about it. And the test is of no value at all on very slow buildfarm
critters such as valgrind or clobber-cache-always animals.
(2) the test now only expects "progress: \d" from the output, so it is enough
that one progress is shown, whenever it is shown.
Hm. Could we get somewhere by making the test look for that, and
adjusting the loop logic inside pgbench so that (maybe only with the
tested switch values) it's guaranteed to print at least one progress
output regardless of timing, because it won't check for exit until after
it's printed a log message?
For the random-ness related test (--sample-rate), we could add a feature to
pgbench which allows to control the random seed, so that the number of samples
could be known in advance for testing purposes.
Don't see how locking down the seed gets us anywhere. The behavior of
random() can't be assumed identical across different machines, even
with the same seed.
regards, tom lane
Hello Tom,
Thanks for having a look at this attempt.
Attached is an attempt at improving the situation:
I took a quick look at this and can't really convince myself that it
improves our situation. The fact that it prints nothing if the runtime
is outside of a tightly constrained range seems likely to hide the sort
of bug you might hope such a test would detect. A simple example is
that, if there's an off-by-one bug causing the test to run for 3 seconds
not 2, this test script won't complain about it. Worse, if pgbench dumps
core instantly on startup when given -p, this test script won't complain
about it. And the test is of no value at all on very slow buildfarm
critters such as valgrind or clobber-cache-always animals.
Hmmm.
Note that an instant core does not fall under "too slow" and is caught
anyway because pgbench return status is checked, and I expect it would not
be zero when core dumped.
Alas, testing time related features with zero assumption about time is
quite difficult:-)
The compromise I'm proposing is to skip time-related stuff if too slow.
The value I see is that it provides coverage for all time-related
features. I can also add a check that the run is *at least* 2 seconds when
two seconds are asked for, because otherwise something was certainly
amiss.
However it cannot do the impossible, i.e. check that something happens
every second if we cannot assume that some cycles are spared for the
process within a second. There is no miracle to expect.
(2) the test now only expects "progress: \d" from the output, so it is enough
that one progress is shown, whenever it is shown.Hm. Could we get somewhere by making the test look for that, and
adjusting the loop logic inside pgbench so that (maybe only with the
tested switch values) it's guaranteed to print at least one progress
output regardless of timing, because it won't check for exit until after
it's printed a log message?
I'll look into ensuring structuraly that at least one progress is shown.
For the random-ness related test (--sample-rate), we could add a feature to
pgbench which allows to control the random seed, so that the number of samples
could be known in advance for testing purposes.Don't see how locking down the seed gets us anywhere. The behavior of
random() can't be assumed identical across different machines, even
with the same seed.
Indeed, I surmised in between that expecting some portability from
random() is naᅵve. Controlling the seed could still be useful for testing
though, so I have submitted a patch for that.
--
Fabien.
Hello Tom,
From my previous answer, to motivate these tests:
The compromise I'm proposing is to skip time-related stuff if too slow. The
value I see is that it provides coverage for all time-related features. I can
also add a check that the run is *at least* 2 seconds when two seconds are
asked for, because otherwise something was certainly amiss.
Also,
Hm. Could we get somewhere by making the test look for that, and
adjusting the loop logic inside pgbench so that (maybe only with the
tested switch values) it's guaranteed to print at least one progress
output regardless of timing, because it won't check for exit until after
it's printed a log message?I'll look into ensuring structuraly that at least one progress is shown.
How pgbenchs prints a progress if none were printed, or if the last
progress was over 0.5 seconds ago, so as to have kind of a catchup in the
end. The progress report generation is moved into a separate function,
which is an improvement of its own for the readability of threadRun.
Also, I have added a slight behavioral change when under tap testing
(through an environment variable) to avoid the end of processing shortcut
when there is nothing to do. This ensures that the -T 2 tap test runs for
at least 2 seconds, whatever. If the host is overload it might be more,
but it cannot be less unless something was wrong.
All that is not perfect, but ISTM that having some minimal coverage of
time-related features is worth the compromise.
--
Fabien.
Attachments:
pgbench-tap-progress-2.patchtext/x-diff; name=pgbench-tap-progress-2.patchDownload+177-86
Minor rebase after vpath fix (e94f2bc809a0c684185666f19d81f6496e732a3a).
--
Fabien.
Attachments:
pgbench-tap-progress-3.patchtext/plain; name=pgbench-tap-progress-3.patchDownload+174-85
On 18/01/18 12:26, Fabien COELHO wrote:
Hm. Could we get somewhere by making the test look for that, and
adjusting the loop logic inside pgbench so that (maybe only with the
tested switch values) it's guaranteed to print at least one progress
output regardless of timing, because it won't check for exit until after
it's printed a log message?I'll look into ensuring structuraly that at least one progress is shown.
How pgbenchs prints a progress if none were printed, or if the last
progress was over 0.5 seconds ago, so as to have kind of a catchup in the
end.
I don't understand the 0.5 second rule. For the tests, we only need to
ensure that at least one progress report is printed, right?
Looking at the code as it exists, I think it already works like that,
although it's by accident. Not sure though, and if we're going to rely
on that, it makes sense to make it more explicit.
The progress report generation is moved into a separate function,
which is an improvement of its own for the readability of threadRun.
Agreed on that.
Also, I have added a slight behavioral change when under tap testing
(through an environment variable) to avoid the end of processing shortcut
when there is nothing to do. This ensures that the -T 2 tap test runs for
at least 2 seconds, whatever. If the host is overload it might be more,
but it cannot be less unless something was wrong.
If you want to write a test that checks that a two-second test takes at
least two seconds, can't you just not use throttling in that test?
- Heikki
Hello Heikki,
Thanks for having a look at this small patch which aim at improving
pgbench coverage.
How pgbenchs prints a progress if none were printed, or if the last
progress was over 0.5 seconds ago, so as to have kind of a catchup in the
end.I don't understand the 0.5 second rule. For the tests, we only need to ensure
that at least one progress report is printed, right?
I'm not so sure;-) I do not want to trust the threadRun loop in case of
heavy load or whatever issue a run may encounter, so this approach ensures
that structurally there is always one output even of the whole loop went
very wrong.
It also adds a small feature which is that there is always a final
progress when the run is completed, which can be useful when computing
progress statistics, otherwise some transactions could not be reported in
any progress at all.
Looking at the code as it exists, I think it already works like that,
although it's by accident. Not sure though, and if we're going to rely on
that, it makes sense to make it more explicit.
Yep.
Also, by default, when running under throttling for 2 seconds at 20 tps,
there is a slight chance that no transactions are scheduled and that
pgbench returns immediately, hence the added variable to ensure that it
lasts something anyway, and that some minimal reporting is always done
whatever.
when there is nothing to do. This ensures that the -T 2 tap test runs for
at least 2 seconds, whatever. If the host is overload it might be more,
but it cannot be less unless something was wrong.If you want to write a test that checks that a two-second test takes at least
two seconds, can't you just not use throttling in that test?
Indeed� but then throttling would not be tested:-) The point of the test
is to exercise all time-related options, including throttling with a
reasonable small value.
Attached is a rebase.
--
Fabien.
Attachments:
pgbench-tap-progress-4.patchtext/plain; name=pgbench-tap-progress-4.patchDownload+174-86
On 12/07/18 19:00, Fabien COELHO wrote:
How pgbenchs prints a progress if none were printed, or if the last
progress was over 0.5 seconds ago, so as to have kind of a catchup in the
end.I don't understand the 0.5 second rule. For the tests, we only need to ensure
that at least one progress report is printed, right?I'm not so sure;-) I do not want to trust the threadRun loop in case of
heavy load or whatever issue a run may encounter, so this approach ensures
that structurally there is always one output even of the whole loop went
very wrong.
I still don't understand.
For the testing, we just need to make sure that at least one progress
report is always printed, if -P is used. Right? So where does the 0.5
second rule come in? Can't we just do "if (no progress reports were
printed) { print progress report; }" at the end?
It also adds a small feature which is that there is always a final
progress when the run is completed, which can be useful when computing
progress statistics, otherwise some transactions could not be reported in
any progress at all.
Any transactions in the last 0.5 seconds might still not get reported in
any progress reports.
when there is nothing to do. This ensures that the -T 2 tap test runs for
at least 2 seconds, whatever. If the host is overload it might be more,
but it cannot be less unless something was wrong.If you want to write a test that checks that a two-second test takes at least
two seconds, can't you just not use throttling in that test?Indeed� but then throttling would not be tested:-) The point of the test
is to exercise all time-related options, including throttling with a
reasonable small value.
Ok. I don't think that's really worthwhile. If we add some code that
only runs in testing, then we're not really testing the real thing. I
wouldn't trust the test to tell much. Let's just leave out that magic
environment variable thing, and try to get the rest of the patch finished.
- Heikki
I don't understand the 0.5 second rule. For the tests, we only need to
ensure that at least one progress report is printed, right?[...]
I still don't understand.
Let's look at the code:
if (progress && thread->tid == 0)
{
...
if (last_report == thread_start || now - last_report >= 500000)
doProgressReport(thread, &last, &last_report, now, thread_start);
For the testing, we just need to make sure that at least one progress report
is always printed, if -P is used. Right?
Yep. That is the first condition above the last_report is set to
thread_start meaning that there has been no report.
So where does the 0.5 second rule come in? Can't we just do "if (no
progress reports were printed) { print progress report; }" at the end?
The second 0.5s condition is to print a closing report if some time
significant time passed since the last one, but we do not want to print
a report at the same second.
pgbench -T 5 -P 2
Would then print report at 2, 4 and 5. 0.5 ensures that we are not going
to do "2 4.0[00] 4.0[01]" on -t whatever -P 2, which would look silly.
If you do not like it then the second condition can be removed, fine with
me.
It also adds a small feature which is that there is always a final
progress when the run is completed, which can be useful when computing
progress statistics, otherwise some transactions could not be reported in
any progress at all.Any transactions in the last 0.5 seconds might still not get reported in any
progress reports.
Yep. I wanted a reasonable threshold, given that both -T and -P are in
seconds anyway, so it probably could only happen with -t.
Indeed… but then throttling would not be tested:-) The point of the test
is to exercise all time-related options, including throttling with a
reasonable small value.Ok. I don't think that's really worthwhile. If we add some code that only
runs in testing, then we're not really testing the real thing. I wouldn't
trust the test to tell much. Let's just leave out that magic environment
variable thing, and try to get the rest of the patch finished.
If you remove the environment, then some checks need to be removed,
because the 2 second run may be randomly shorten when there is nothing to
do. If not, the test will fail underterminiscally, which is not
acceptable. Hence the hack. I agree that it is not beautiful.
The more reasonable alternative could be to always last 2 seconds under
-T 2, even if the execution can be shorten because there is nothing to do
at all, i.e. remove the environment-based condition but keep the sleep.
--
Fabien.
Indeed… but then throttling would not be tested:-) The point of the test
is to exercise all time-related options, including throttling with a
reasonable small value.Ok. I don't think that's really worthwhile. If we add some code that only
runs in testing, then we're not really testing the real thing. I wouldn't
trust the test to tell much. Let's just leave out that magic environment
variable thing, and try to get the rest of the patch finished.If you remove the environment, then some checks need to be removed, because
the 2 second run may be randomly shorten when there is nothing to do. If not,
the test will fail underterminiscally, which is not acceptable. Hence the
hack. I agree that it is not beautiful.The more reasonable alternative could be to always last 2 seconds under -T 2,
even if the execution can be shorten because there is nothing to do at all,
i.e. remove the environment-based condition but keep the sleep.
Yet another option would be to replace the env variable by an option, eg
"--strict-time", that would be used probaly only by the TAP test, but
would be an available feature.
--
Fabien.
On 12/07/18 21:27, Fabien COELHO wrote:
For the testing, we just need to make sure that at least one progress report
is always printed, if -P is used. Right?Yep. That is the first condition above the last_report is set to
thread_start meaning that there has been no report.So where does the 0.5 second rule come in? Can't we just do "if (no
progress reports were printed) { print progress report; }" at the end?The second 0.5s condition is to print a closing report if some time
significant time passed since the last one, but we do not want to print
a report at the same second.pgbench -T 5 -P 2
Would then print report at 2, 4 and 5. 0.5 ensures that we are not going
to do "2 4.0[00] 4.0[01]" on -t whatever -P 2, which would look silly.If you do not like it then the second condition can be removed, fine with
me.
As the code stands, you would get reports at "2 4.0[00]", right? Let's
keep it that way. I think the only change we need to make in the logic
is to check at the end, if *any* progress reports at all have been
printed, and print one if not. And do that only when the -P option is
smaller than the -T option, I suppose.
It also adds a small feature which is that there is always a final
progress when the run is completed, which can be useful when computing
progress statistics, otherwise some transactions could not be reported in
any progress at all.Any transactions in the last 0.5 seconds might still not get reported in any
progress reports.Yep. I wanted a reasonable threshold, given that both -T and -P are in
seconds anyway, so it probably could only happen with -t.
Oh. I'm a bit surprised we don't support decimals, i.e. -P 0.5.
Actually, it seems to be acceptd, but it's truncated down to the nearest
integer. That's not very nice :-(. But it's a separate issue.
Indeed… but then throttling would not be tested:-) The point of the test
is to exercise all time-related options, including throttling with a
reasonable small value.Ok. I don't think that's really worthwhile. If we add some code that only
runs in testing, then we're not really testing the real thing. I wouldn't
trust the test to tell much. Let's just leave out that magic environment
variable thing, and try to get the rest of the patch finished.If you remove the environment, then some checks need to be removed,
because the 2 second run may be randomly shorten when there is nothing to
do. If not, the test will fail underterminiscally, which is not
acceptable. Hence the hack. I agree that it is not beautiful.The more reasonable alternative could be to always last 2 seconds under
-T 2, even if the execution can be shorten because there is nothing to do
at all, i.e. remove the environment-based condition but keep the sleep.
That sounds reasonable. It's a bit silly to wait when there's nothing to
do, but it's also weird if the test exits before the specified time is
up. Seems less surprising to always sleep.
- Heikki
Hello Heikki,
[...] Let's keep it that way. I think the only change we need to make in
the logic is to check at the end, if *any* progress reports at all have
been printed, and print one if not.
Ok, this simplifies the condition.
And do that only when the -P option is smaller than the -T option, I
suppose.
Yep, why not.
Oh. I'm a bit surprised we don't support decimals, i.e. -P 0.5. Actually, it
seems to be acceptd, but it's truncated down to the nearest integer.
Indeed, "atoi" does not detect errors, and it is true of the many uses in
pgbench: clients, threads, scale, duration, fillfactor...
That's not very nice :-(. But it's a separate issue.
Yep. For human consumption, seconds seem okay.
The more reasonable alternative could be to always last 2 seconds under
-T 2, even if the execution can be shorten because there is nothing to do
at all, i.e. remove the environment-based condition but keep the sleep.That sounds reasonable. It's a bit silly to wait when there's nothing to do,
but it's also weird if the test exits before the specified time is up. Seems
less surprising to always sleep.
I did that in the attached version: no more environment variable hack, and
no execution shortcut even if there is nothing to do.
I also had to reproduce the progress logic to keep on printing report of
(no) progress in this tailing phase.
I'm not very happy because it is a change of behavior. I suggest that I
could add a special "--strict-time-compliance" option to do this only when
required... and it would only be required by tap tests.
--
Fabien.
Attachments:
pgbench-tap-progress-5.patchtext/plain; name=pgbench-tap-progress-5.patchDownload+181-87
On 18/07/18 01:43, Fabien COELHO wrote:
The more reasonable alternative could be to always last 2 seconds under
-T 2, even if the execution can be shorten because there is nothing to do
at all, i.e. remove the environment-based condition but keep the sleep.That sounds reasonable. It's a bit silly to wait when there's nothing to do,
but it's also weird if the test exits before the specified time is up. Seems
less surprising to always sleep.I did that in the attached version: no more environment variable hack, and
no execution shortcut even if there is nothing to do.I also had to reproduce the progress logic to keep on printing report of
(no) progress in this tailing phase.
On second thoughts, there's one problem with this approach of always
waiting until -T is up. What if all the threads died because of errors?
For example:
pgbench postgres -T 10 -P 1 -c 2
starting vacuum...end.
client 0 aborted in command 5 (SQL) of script 0; ERROR: relation
"pgbench_accounts" does not exist
LINE 1: UPDATE pgbench_accounts SET abalance = abalance + -4963 WHER...
^
client 1 aborted in command 5 (SQL) of script 0; ERROR: relation
"pgbench_accounts" does not exist
LINE 1: UPDATE pgbench_accounts SET abalance = abalance + -2985 WHER...
^
progress: 1.0 s, 0.0 tps, lat 0.000 ms stddev 0.000
progress: 2.0 s, 0.0 tps, lat 0.000 ms stddev 0.000
progress: 3.0 s, 0.0 tps, lat 0.000 ms stddev 0.000
progress: 4.0 s, 0.0 tps, lat 0.000 ms stddev 0.000
progress: 5.0 s, 0.0 tps, lat 0.000 ms stddev 0.000
progress: 6.0 s, 0.0 tps, lat 0.000 ms stddev 0.000
progress: 7.0 s, 0.0 tps, lat 0.000 ms stddev 0.000
progress: 8.0 s, 0.0 tps, lat 0.000 ms stddev 0.000
progress: 9.0 s, 0.0 tps, lat 0.000 ms stddev 0.000
progress: 10.0 s, 0.0 tps, lat 0.000 ms stddev 0.000
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 2
number of threads: 1
duration: 10 s
number of transactions actually processed: 0
I don't think you want to wait in that situation. I think we should wait
at the end only if there some threads still alive, with nothing to do
only because of --rate.
- Heikki
Hello Heikki,
I did that in the attached version: no more environment variable hack, and
no execution shortcut even if there is nothing to do.I also had to reproduce the progress logic to keep on printing report of
(no) progress in this tailing phase.On second thoughts, there's one problem with this approach of always waiting
until -T is up. What if all the threads died because of errors? For example:
Good corner-case catch! This behavior is indeed silly.
I don't think you want to wait in that situation. I think we should wait at
the end only if there some threads still alive, with nothing to do only
because of --rate.
Yep. The attached version does only the tailing stuff under -R and not all
threads were stopped on errors, with comments to tell about the why.
I'm still wondering about a specific option to explicitely require this
behavioral change.
--
Fabien.
Attachments:
pgbench-tap-progress-6.patchtext/plain; name=pgbench-tap-progress-6.patchDownload+191-87
On 18/07/18 16:01, Fabien COELHO wrote:
I don't think you want to wait in that situation. I think we should wait at
the end only if there some threads still alive, with nothing to do only
because of --rate.Yep. The attached version does only the tailing stuff under -R and not all
threads were stopped on errors, with comments to tell about the why.
Hmm. How about we just remove this special case from doCustom():
case CSTATE_START_THROTTLE:
.../*
* stop client if next transaction is beyond pgbench end of
* execution
*/
if (duration > 0 && st->txn_scheduled > end_time)
{
st->state = CSTATE_FINISHED;
break;
}
That way, we let the client go into CSTATE_THROTTLE state, even though
we know that the timer will run out before we reach txn_scheduled. Then
it will work the way we want, right? One small difference is that then
the clients will keep the connections open longer, until the timer
expires, but I think that's reasonable. Less surprising than the current
behavior, even.
- Heikki
Hello Heikki,
Yep. The attached version does only the tailing stuff under -R and not all
threads were stopped on errors, with comments to tell about the why.Hmm. How about we just remove this special case from doCustom():
case CSTATE_START_THROTTLE:
// ...
if (duration > 0 && st->txn_scheduled > end_time)
{
st->state = CSTATE_FINISHED;
break;
}That way, we let the client go into CSTATE_THROTTLE state, even though we
know that the timer will run out before we reach txn_scheduled. Then it will
work the way we want, right? One small difference is that then the clients
will keep the connections open longer, until the timer expires, but I think
that's reasonable. Less surprising than the current behavior, even.
Hmmm... in this instance, and if this test is removed, ISTM that it can
start the transaction, re-establishing a connection under --connect, and
the transaction will run to its end even if it is beyond the expected end
of run. So removing this test does not seem desirable.
--
Fabien.
On 18/07/18 22:56, Fabien COELHO wrote:
Hello Heikki,
Yep. The attached version does only the tailing stuff under -R and not all
threads were stopped on errors, with comments to tell about the why.Hmm. How about we just remove this special case from doCustom():
case CSTATE_START_THROTTLE:
// ...
if (duration > 0 && st->txn_scheduled > end_time)
{
st->state = CSTATE_FINISHED;
break;
}That way, we let the client go into CSTATE_THROTTLE state, even though we
know that the timer will run out before we reach txn_scheduled. Then it will
work the way we want, right? One small difference is that then the clients
will keep the connections open longer, until the timer expires, but I think
that's reasonable. Less surprising than the current behavior, even.Hmmm... in this instance, and if this test is removed, ISTM that it can
start the transaction, re-establishing a connection under --connect, and
the transaction will run to its end even if it is beyond the expected end
of run. So removing this test does not seem desirable.
Can you elaborate? I don't think that's how it works. In threadRun(), we
have this:
for (i = 0; i < nstate; i++)
{
CState *st = &state[i];if (st->state == CSTATE_THROTTLE && timer_exceeded)
{
/* interrupt client that has not started a transaction */
st->state = CSTATE_FINISHED;
finishCon(st);
remains--;
}
else if (st->state == CSTATE_SLEEP || st->state == CSTATE_THROTTLE)
...
As soon as the -T timer is exceeded, the above code will close all
connections that are in CSTATE_THROTTLE state.
- Heikki
Hmm. How about we just remove this special case from doCustom():
case CSTATE_START_THROTTLE:
// ...
if (duration > 0 && st->txn_scheduled > end_time)
{
st->state = CSTATE_FINISHED;
break;
}That way, we let the client go into CSTATE_THROTTLE state, even though
we know that the timer will run out before we reach txn_scheduled.
Then it will work the way we want, right? One small difference is that
then the clients will keep the connections open longer, until the
timer expires, but I think that's reasonable. Less surprising than the
current behavior, even.Hmmm... in this instance, and if this test is removed, ISTM that it can
start the transaction, re-establishing a connection under --connect, and
the transaction will run to its end even if it is beyond the expected end
of run. So removing this test does not seem desirable.Can you elaborate? I don't think that's how it works. In threadRun(), we have
this:
The state path I want to avoid is, without getting out of doCustom, is:
CHOOSE_SCRIPT:
-> START_THROTTLE (i.e. under -R)
START_THROTTLE:
update txn_schedule, which happen to be after end_time,
i.e. the transaction is scheduled after the expected end of the run.
but if the condition is removed, then proceed directly to
-> THROTTLE
THROTTLE:
if now has passed txn_schedule (we are late), no return, proceed
directly to -> START_TX
START_TX:
-> START_COMMAND
START_COMMAND: executed... & "return" on SQL, but it is too late
for stopping the tx now, it has started.
Maybe I missed something, but it looks to me that we can get in the
situation where a transaction scheduled beyond the end of run is started
anyway if it happens that it was a little late.
[... threadRun ...]
As soon as the -T timer is exceeded, the above code will close all
connections that are in CSTATE_THROTTLE state.
So threadRun() would not have the opportunity to stop the scheduled
transaction, even if beyond the end of run, because it would not have got
out of doCustom, in the case I outlined above.
--
Fabien.
On 18/07/18 23:29, Fabien COELHO wrote:
Hmm. How about we just remove this special case from doCustom():
case CSTATE_START_THROTTLE:
// ...
if (duration > 0 && st->txn_scheduled > end_time)
{
st->state = CSTATE_FINISHED;
break;
}That way, we let the client go into CSTATE_THROTTLE state, even though
we know that the timer will run out before we reach txn_scheduled.
Then it will work the way we want, right? One small difference is that
then the clients will keep the connections open longer, until the
timer expires, but I think that's reasonable. Less surprising than the
current behavior, even.Hmmm... in this instance, and if this test is removed, ISTM that it can
start the transaction, re-establishing a connection under --connect, and
the transaction will run to its end even if it is beyond the expected end
of run. So removing this test does not seem desirable.Can you elaborate? I don't think that's how it works. In threadRun(), we have
this:The state path I want to avoid is, without getting out of doCustom, is:
CHOOSE_SCRIPT:
-> START_THROTTLE (i.e. under -R)
START_THROTTLE:
update txn_schedule, which happen to be after end_time,
i.e. the transaction is scheduled after the expected end of the run.
but if the condition is removed, then proceed directly to
-> THROTTLE
THROTTLE:
if now has passed txn_schedule (we are late), no return, proceed
directly to -> START_TX
START_TX:
-> START_COMMAND
START_COMMAND: executed... & "return" on SQL, but it is too late
for stopping the tx now, it has started.Maybe I missed something, but it looks to me that we can get in the
situation where a transaction scheduled beyond the end of run is started
anyway if it happens that it was a little late.[... threadRun ...]
As soon as the -T timer is exceeded, the above code will close all
connections that are in CSTATE_THROTTLE state.So threadRun() would not have the opportunity to stop the scheduled
transaction, even if beyond the end of run, because it would not have got
out of doCustom, in the case I outlined above.
I see. Instead of moving to FINISHED state, then, we could stay in
THROTTLE state, and 'return' out of doCustom(), to give the code in
threadRun() a chance to detect that the timer is up. Something like the
attached. (I moved the check after the check for latency_limit, because
that code updates txn_scheduled. Seems more like a more correct place,
although that's as a separate issue.)
- Heikki