pgbench - minor fix for meta command only scripts
While testing meta-command pgbench only scripts, I noticed that there is
an infinite loop in threadRun, which means that other tasks such as
reporting progress do not get a chance.
The attached patch breaks this loop by always returning at the end of a
script.
On "pgbench -T 3 -P 1 -f noop.sql", before this patch, the progress is not
shown, after it is.
--
Fabien.
On Sat, Jul 9, 2016 at 4:09 PM, Fabien COELHO <coelho@cri.ensmp.fr> wrote:
While testing meta-command pgbench only scripts, I noticed that there is an
infinite loop in threadRun, which means that other tasks such as reporting
progress do not get a chance.The attached patch breaks this loop by always returning at the end of a
script.On "pgbench -T 3 -P 1 -f noop.sql", before this patch, the progress is not
shown, after it is.
You may want to name your patches with .patch or .diff. Using .sql is
disturbing style :)
Indeed, not reporting the progress back to the client in the case of a
script with only meta commands is non-intuitive.
- /* after a meta command, immediately proceed with next command */
- goto top;
+ /*
+ * After a meta command, immediately proceed with next command...
+ * although not if last. This exception ensures that a meta command
+ * only script does not always loop in doCustom, so that other tasks
+ * in threadRun, eg progress reporting or switching client,
get a chance.
+ */
+ if (commands[st->state + 1] != NULL)
+ goto top;
This looks good to me. I'd just rewrite the comment block with
something like that, more simplified:
+ /*
+ * After a meta command, immediately proceed with next command.
+ * But if this is the last command, just leave.
+ */
--
Michael
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Hello Michaᅵl,
You may want to name your patches with .patch or .diff. Using .sql is
disturbing style :)
Indeed! :-)
Indeed, not reporting the progress back to the client in the case of a
script with only meta commands is non-intuitive.This looks good to me. I'd just rewrite the comment block with
something like that, more simplified:
Ok. Here is an updated version, with a better suffix and a simplified
comment.
Thanks,
--
Fabien.
Attachments:
pgbench-no-sql-fix-2.patchtext/x-diff; name=pgbench-no-sql-fix-2.patchDownload+6-2
Fabien COELHO <coelho@cri.ensmp.fr> writes:
Ok. Here is an updated version, with a better suffix and a simplified
comment.
Doesn't this break the handling of latency calculations, or at least make
the results completely different for the last metacommand than what they
would be for a non-last command? It looks like it needs to loop back so
that the latency calculation is completed for the metacommand before it
can exit. Seems to me it would probably make more sense to fall out at
the end of the "transaction finished" if-block, around line 1923 in HEAD.
(The code structure in here seems like a complete mess to me, but probably
now is not the time to refactor it.)
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Hello Tom,
Ok. Here is an updated version, with a better suffix and a simplified
comment.Doesn't this break the handling of latency calculations, or at least make
the results completely different for the last metacommand than what they
would be for a non-last command? It looks like it needs to loop back so
that the latency calculation is completed for the metacommand before it
can exit. Seems to me it would probably make more sense to fall out at
the end of the "transaction finished" if-block, around line 1923 in HEAD.
Indeed, it would trouble a little bit the stats computation by delaying
the recording of the end of statement & transaction.
However line 1923 is a shortcut for ending pgbench, but at the end of a
transaction more stuff must be done, eg choosing the next script and
reconnecting, before exiting. The solution is more contrived.
The attached patch provides a solution which ensures the return in the
right condition and after the stat collection. The code structure requires
another ugly boolean to proceed so as to preserve doing the reconnection
between the decision that the return must be done and the place where it
can be done, after reconnecting.
(The code structure in here seems like a complete mess to me, but probably
now is not the time to refactor it.)
I fully agree that the code structure is a total mess:-( Maybe I'll try to
submit a simpler one some day.
Basically the doCustom function is not resilient, you cannot exit from
anywhere and hope that re-entring would achieve a consistent behavior.
While reading the code to find a better place for a return, I noted some
possible inconsistencies in recording stats, which are noted as comments
in the attached patch.
Calling chooseScript is done both from outside for initialization and from
inside doCustom, where it could be done once and more clearly in doCustom.
Boolean listen is not reset because the script is expected to execute
directly the start of the next statement. I succeeded in convincing myself
that it actually works, but it is unobvious to spot why. I think that a
simpler pattern would be welcome. Also, some other things (eg prepared)
are not reset in all cases, not sure why.
The goto should probably be replaced by a while.
...
--
Fabien.
Attachments:
pgbench-latency-t-2.patchtext/x-diff; name=pgbench-latency-t-2.patchDownload+15-1
The attached patch provides a solution which ensures the return in the right
condition and after the stat collection. The code structure requires another
ugly boolean to proceed so as to preserve doing the reconnection between the
decision that the return must be done and the place where it can be done,
after reconnecting.
Ooops, the attached patched was the right content but wrongly named:-(
Here it is again with a consistent name.
Sorry for the noise.
--
Fabien.
Attachments:
pgbench-no-sql-fix-3.patchtext/x-diff; name=pgbench-no-sql-fix-3.patchDownload+15-1
On 07/13/2016 11:14 AM, Fabien COELHO wrote:
(The code structure in here seems like a complete mess to me, but probably
now is not the time to refactor it.)I fully agree that the code structure is a total mess:-( Maybe I'll try to
submit a simpler one some day.Basically the doCustom function is not resilient, you cannot exit from
anywhere and hope that re-entring would achieve a consistent behavior.While reading the code to find a better place for a return, I noted some
possible inconsistencies in recording stats, which are noted as comments
in the attached patch.Calling chooseScript is done both from outside for initialization and from
inside doCustom, where it could be done once and more clearly in doCustom.Boolean listen is not reset because the script is expected to execute
directly the start of the next statement. I succeeded in convincing myself
that it actually works, but it is unobvious to spot why. I think that a
simpler pattern would be welcome. Also, some other things (eg prepared)
are not reset in all cases, not sure why.The goto should probably be replaced by a while.
...
Yeah, it really is quite a mess. I tried to review your patch, and I
think it's correct, but I couldn't totally convince myself, because of
the existing messiness of the logic. So I bit the bullet and started
refactoring.
I came up with the attached. It refactors the logic in doCustom() into a
state machine. I think this is much clearer, what do you think?
@@ -1892,6 +1895,7 @@ top: /* * Read and discard the query result; note this is not included in * the statement latency numbers. + * Should this be done before recording the statement stats? */ res = PQgetResult(st->con); switch (PQresultStatus(res))
Well, the comment right there says "note this is not included in the
statement latency numbers", so apparently it's intentional. Whether it's
a good idea or not, I don't know :-). It does seem a bit surprising.
But what seems more bogus to me is that we do that after recording the
*transaction* stats, if this was the last command. So the PQgetResult()
of the last command in the transaction is not included in the
transaction stats, even though the PQgetResult() calls for any previous
commands are. (Perhaps that's what you meant too?)
I changed that in my patch, it would've been inconvenient to keep that
old behavior, and it doesn't make any sense to me anyway.
- Heikki
Attachments:
refactor-pgbench-doCustom.patchtext/x-patch; name=refactor-pgbench-doCustom.patchDownload+1078-1061
Hello Heikki,
Yeah, it really is quite a mess. I tried to review your patch, and I think
it's correct, but I couldn't totally convince myself, because of the existing
messiness of the logic.
Alas:-(
So I bit the bullet and started refactoring.
Wow!
I came up with the attached. It refactors the logic in doCustom() into a
state machine.
Sounds good! This can only help.
I think this is much clearer, what do you think?
I think that something was really needed. I'm going to review and test
this patch very carefully, probably over next week-end, and report.
--
Fabien.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Hello Heikki,
Yeah, it really is quite a mess. I tried to review your patch, and I think
it's correct, but I couldn't totally convince myself, because of the existing
messiness of the logic. So I bit the bullet and started refactoring.I came up with the attached. It refactors the logic in doCustom() into a
state machine. I think this is much clearer, what do you think?
The patch did not apply to master because of you committed the sleep fix
in between. I updated the patch so that the fix is included as well.
I think that this is really needed. The code is much clearer and simple to
understand with the state machines & additional functions. This is a
definite improvement to the code base.
I've done quite some testing with various options (-r, --rate,
--latency-limit, -C...) and got pretty reasonnable results.
Although I cannot be absolutely sure that the refactoring does not
introduce any new bug, I'm convinced that it will be much easier to find
them:-)
Attached are some small changes to your version:
I have added the sleep_until fix.
I have fixed a bug introduced in the patch by changing && by || in the
(min_sec > 0 && maxsock != -1) condition which was inducing errors with
multi-threads & clients...
I have factored out several error messages in "commandFailed", in place of
the "metaCommandFailed", and added the script number as well in the error
messages. All messages are now specific to the failed command.
I have added two states to the machine:
- CSTATE_CHOOSE_SCRIPT which simplifies threadRun, there is now one call
to chooseScript instead of two before.
- CSTATE_END_COMMAND which manages is_latencies and proceeding to the
next command, thus merging the three instances of updating the stats
that were in the first version.
The later state means that processing query results is included in the per
statement latency, which is an improvement because before I was getting
some transaction latency significantly larger that the apparent sum of the
per-statement latencies, which did not make much sense...
I have added & updated a few comments. There are some places where the
break could be a pass through instead, not sure how desirable it is, I'm
fine with break.
Well, the comment right there says "note this is not included in the
statement latency numbers", so apparently it's intentional. Whether it's a
good idea or not, I don't know :-). It does seem a bit surprising.
Indeed, it also results in apparently inconsistent numbers, and it creates
a mess for recording the statement latency because it meant that in some
case the latency was collected before the actual end of the command, see
the discussion about CSTATE_END_COMMAND above.
But what seems more bogus to me is that we do that after recording the
*transaction* stats, if this was the last command. So the PQgetResult() of
the last command in the transaction is not included in the transaction stats,
even though the PQgetResult() calls for any previous commands are. (Perhaps
that's what you meant too?)I changed that in my patch, it would've been inconvenient to keep that old
behavior, and it doesn't make any sense to me anyway.
Fine with me.
--
Fabien.
Attachments:
pgbench-refactor-2.patchtext/x-diff; name=pgbench-refactor-2.patchDownload+677-461
On 09/24/2016 12:45 PM, Fabien COELHO wrote:
Although I cannot be absolutely sure that the refactoring does not
introduce any new bug, I'm convinced that it will be much easier to find
them:-)
:-)
Attached are some small changes to your version:
I have added the sleep_until fix.
I have fixed a bug introduced in the patch by changing && by || in the
(min_sec > 0 && maxsock != -1) condition which was inducing errors with
multi-threads & clients...I have factored out several error messages in "commandFailed", in place of
the "metaCommandFailed", and added the script number as well in the error
messages. All messages are now specific to the failed command.I have added two states to the machine:
- CSTATE_CHOOSE_SCRIPT which simplifies threadRun, there is now one call
to chooseScript instead of two before.- CSTATE_END_COMMAND which manages is_latencies and proceeding to the
next command, thus merging the three instances of updating the stats
that were in the first version.The later state means that processing query results is included in the per
statement latency, which is an improvement because before I was getting
some transaction latency significantly larger that the apparent sum of the
per-statement latencies, which did not make much sense...
Ok. I agree that makes more sense.
I have added & updated a few comments.
Thanks! Committed.
There are some places where the break could be a pass through
instead, not sure how desirable it is, I'm fine with break.
I left them as "break". Pass-throughs are error-prone, and make it more
difficult to read, IMHO. The compiler will optimize it into a
pass-through anyway, if possible and worthwhile, so there should be no
performance difference.
- Heikki
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Sep 26, 2016 at 1:01 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
On 09/24/2016 12:45 PM, Fabien COELHO wrote:
Attached are some small changes to your version:
I have added the sleep_until fix.
I have fixed a bug introduced in the patch by changing && by || in the
(min_sec > 0 && maxsock != -1) condition which was inducing errors with
multi-threads & clients...I have factored out several error messages in "commandFailed", in place of
the "metaCommandFailed", and added the script number as well in the error
messages. All messages are now specific to the failed command.I have added two states to the machine:
- CSTATE_CHOOSE_SCRIPT which simplifies threadRun, there is now one call
to chooseScript instead of two before.- CSTATE_END_COMMAND which manages is_latencies and proceeding to the
next command, thus merging the three instances of updating the stats
that were in the first version.The later state means that processing query results is included in the per
statement latency, which is an improvement because before I was getting
some transaction latency significantly larger that the apparent sum of the
per-statement latencies, which did not make much sense...Ok. I agree that makes more sense.
I have added & updated a few comments.
Thanks! Committed.
There are some places where the break could be a pass through
instead, not sure how desirable it is, I'm fine with break.
I left them as "break". Pass-throughs are error-prone, and make it more
difficult to read, IMHO. The compiler will optimize it into a pass-through
anyway, if possible and worthwhile, so there should be no performance
difference.
Since this commit (12788ae49e1933f463bc5), if I use the --rate to throttle
the transaction rate, it does get throttled to about the indicated speed,
but the pg_bench consumes the entire CPU.
At the block of code starting
if (min_usec > 0 && maxsock != -1)
If maxsock == -1, then there is no sleep happening.
Cheers,
Jeff
Hello Jeff,
I have fixed a bug introduced in the patch by changing && by || in the
(min_sec > 0 && maxsock != -1) condition which was inducing errors with
multi-threads & clients...
Since this commit (12788ae49e1933f463bc5), if I use the --rate to throttle
the transaction rate, it does get throttled to about the indicated speed,
but the pg_bench consumes the entire CPU.At the block of code starting
if (min_usec > 0 && maxsock != -1)If maxsock == -1, then there is no sleep happening.
Argh, shame on me:-(
I cannot find the "induced errors" I was refering to in the message...
Sleeping is definitely needed to avoid a hard loop.
Patch attached fixes it and does not seem introduce any special issue...
Should probably be backpatched.
Thanks for the debug!
--
Fabien.
Attachments:
pgbench-rate-bug-1.patchtext/x-diff; name=pgbench-rate-bug-1.patchDownload+1-1
On Mon, Sep 4, 2017 at 1:56 PM, Fabien COELHO <coelho@cri.ensmp.fr> wrote:
Hello Jeff,
I have fixed a bug introduced in the patch by changing && by || in the
(min_sec > 0 && maxsock != -1) condition which was inducing errors with
multi-threads & clients...Since this commit (12788ae49e1933f463bc5), if I use the --rate to throttle
the transaction rate, it does get throttled to about the indicated speed,
but the pg_bench consumes the entire CPU.At the block of code starting
if (min_usec > 0 && maxsock != -1)If maxsock == -1, then there is no sleep happening.
Argh, shame on me:-(
I cannot find the "induced errors" I was refering to in the message...
Sleeping is definitely needed to avoid a hard loop.Patch attached fixes it and does not seem introduce any special issue...
Should probably be backpatched.
Thanks for the debug!
Thanks Fabien, that works for me.
But if min_sec <= 0, do we want to do whatever it is that we already know
is over-do, before stopping to do the select? If it is safe to go through
this code path when maxsock == -1, then should we just change it to this?
if (min_usec > 0)
Cheers,
Jeff
Hello Jeff,
Ok, the problem was a little bit more trivial than I thought.
The issue is that under a low rate there may be no transaction in
progress, however the wait procedure was relying on select's timeout. If
nothing is active there is nothing to wait for, thus it was an active loop
in this case...
I've introduced a usleep call in place of select for this particular
case. Hopefully this is portable.
ISTM that this bug exists since rate was introduced, so shame on me and
back-patching should be needed.
--
Fabien.
Attachments:
pgbench-rate-bug-2.patchtext/x-diff; name=pgbench-rate-bug-2.patchDownload+19-9
On Mon, Sep 11, 2017 at 1:49 AM, Fabien COELHO <coelho@cri.ensmp.fr> wrote:
Hello Jeff,
Ok, the problem was a little bit more trivial than I thought.
The issue is that under a low rate there may be no transaction in
progress, however the wait procedure was relying on select's timeout. If
nothing is active there is nothing to wait for, thus it was an active loop
in this case...I've introduced a usleep call in place of select for this particular case.
Hopefully this is portable.
Shouldn't we use pg_usleep to ensure portability? it is defined for
front-end code. But it returns void, so the error check will have to be
changed.
I didn't see the problem before the commit I originally indicated , so I
don't think it has to be back-patched to before v10.
Cheers,
Jeff
Hello Jeff,
Shouldn't we use pg_usleep to ensure portability? it is defined for
front-end code. But it returns void, so the error check will have to be
changed.
Attached v3 with pg_usleep called instead.
I didn't see the problem before the commit I originally indicated , so I
don't think it has to be back-patched to before v10.
Hmmm.... you've got a point, although I'm not sure how it could work
without sleeping explicitely. Maybe the path was calling select with an
empty wait list plus timeout, and select is kind enough to just sleep on
an empty list, or some other miracle. ISTM clearer to explicitely sleep in
that case.
--
Fabien.
Attachments:
pgbench-rate-bug-3.patchtext/x-diff; name=pgbench-rate-bug-3.patchDownload+18-8
On Tue, Sep 12, 2017 at 03:27:13AM +0200, Fabien COELHO wrote:
Shouldn't we use pg_usleep to ensure portability? it is defined for
front-end code. But it returns void, so the error check will have to be
changed.Attached v3 with pg_usleep called instead.
I didn't see the problem before the commit I originally indicated , so I
don't think it has to be back-patched to before v10.Hmmm.... you've got a point, although I'm not sure how it could work without
sleeping explicitely. Maybe the path was calling select with an empty wait
list plus timeout, and select is kind enough to just sleep on an empty list,
or some other miracle. ISTM clearer to explicitely sleep in that case.
[Action required within three days. This is a generic notification.]
The above-described topic is currently a PostgreSQL 10 open item. Heikki,
since you committed the patch believed to have created it, you own this open
item. If some other commit is more relevant or if this does not belong as a
v10 open item, please let us know. Otherwise, please observe the policy on
open item ownership[1]/messages/by-id/20170404140717.GA2675809@tornado.leadboat.com and send a status update within three calendar days of
this message. Include a date for your subsequent status update. Testers may
discover new open items at any time, and I want to plan to get them all fixed
well in advance of shipping v10. Consequently, I will appreciate your efforts
toward speedy resolution. Thanks.
[1]: /messages/by-id/20170404140717.GA2675809@tornado.leadboat.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Sep 11, 2017 at 6:27 PM, Fabien COELHO <coelho@cri.ensmp.fr> wrote:
Hello Jeff,
Shouldn't we use pg_usleep to ensure portability? it is defined for
front-end code. But it returns void, so the error check will have to be
changed.Attached v3 with pg_usleep called instead.
I didn't see the problem before the commit I originally indicated , so I
don't think it has to be back-patched to before v10.
Hmmm.... you've got a point, although I'm not sure how it could work
without sleeping explicitely. Maybe the path was calling select with an
empty wait list plus timeout, and select is kind enough to just sleep on an
empty list, or some other miracle.
Not really a miracle, calling select with an empty list of file handles is
a standard way to sleep on Unix-like platforms. (Indeed, that is how
pg_usleep is implemented on non-Windows platforms, see
"src/port/pgsleep.c"). The problem is that it is reportedly not portable
to Windows. But I tested pgbench.exe for 9.6.5-1 from EDB installer, and I
don't see excessive CPU usage for a throttled run, and it throttles to
about the correct speed. So maybe the non-portability is more rumor than
reality. So I don't know if this needs backpatching or not. But it should
be fixed for v10, as there it becomes a demonstrably live issue.
ISTM clearer to explicitly sleep in that case.
Yes.
Cheers,
Jeff
reality. So I don't know if this needs backpatching or not. But it
should be fixed for v10, as there it becomes a demonstrably live issue.
Yes.
--
Fabien.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Sep 11, 2017 at 4:49 AM, Fabien COELHO <coelho@cri.ensmp.fr> wrote:
Ok, the problem was a little bit more trivial than I thought.
The issue is that under a low rate there may be no transaction in progress,
however the wait procedure was relying on select's timeout. If nothing is
active there is nothing to wait for, thus it was an active loop in this
case...I've introduced a usleep call in place of select for this particular case.
Hopefully this is portable.ISTM that this bug exists since rate was introduced, so shame on me and
back-patching should be needed.
I took a look at this and found that the proposed patch applies
cleanly all the way back to 9.5, but the regression is reported to
have begun with a commit that starts in v10. I haven't probed into
this in any depth, but are we sure that
12788ae49e1933f463bc59a6efe46c4a01701b76 is in fact where this problem
originated?
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers