pgbench stuck with 100% cpu usage

Started by Pavan Deolaseeover 8 years ago9 messages
#1Pavan Deolasee
pavan.deolasee@gmail.com

Hello,

While running some tests, I encountered a situation where pgbench gets
stuck in an infinite loop, consuming 100% cpu. The setup was:

- Start postgres server from the master branch
- Initialise pgbench
- Run pgbench -c 10 -T 100
- Stop postgres with -m immediate

Now it seems that pgbench gets stuck and it's state machine does not
advance. Attaching it to debugger, I saw that one of the clients remain
stuck in this loop forever.

if (command->type == SQL_COMMAND)
{
if (!sendCommand(st, command))
{
/*
* Failed. Stay in CSTATE_START_COMMAND state, to
* retry. ??? What the point or retrying? Should
* rather abort?
*/
return;
}
else
st->state = CSTATE_WAIT_RESULT;
}

sendCommand() returns false because the underlying connection is bad
and PQsendQuery returns 0. Reading the comment, it seems that the author
thought about this situation but decided to retry instead of abort. Not
sure what was the rationale for that decision, may be to deal with
transient failures?

The commit that introduced this code is 12788ae49e1933f463bc. So I am
copying Heikki.

Thanks,
Pavan

--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#2Fabien COELHO
coelho@cri.ensmp.fr
In reply to: Pavan Deolasee (#1)
Re: pgbench stuck with 100% cpu usage

While running some tests, I encountered a situation where pgbench gets
stuck in an infinite loop, consuming 100% cpu. The setup was:

- Start postgres server from the master branch
- Initialise pgbench
- Run pgbench -c 10 -T 100
- Stop postgres with -m immediate

That is a strange test to run, but it would be better if the behavior was
not that one.

Now it seems that pgbench gets stuck and it's state machine does not
advance. Attaching it to debugger, I saw that one of the clients remain
stuck in this loop forever.

if (!sendCommand(st, command))
{
/*
* Failed. Stay in CSTATE_START_COMMAND state, to
* retry. ??? What the point or retrying? Should
* rather abort?
*/

As the comments indicate and your situation shows, probably stopping the
client would be a better much option when send fails, instead of
retrying... indefinitely.

The commit that introduced this code is 12788ae49e1933f463bc. So I amn
copying Heikki.

AFAICR the commit was mostly a heavy restructuring of previous
unmaintainable spaghetti code. I'm not sure the problem was not there
before under one form or another.

I agree that it should error out & stop the client in this case at least.

--
Fabien.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Fabien COELHO
coelho@cri.ensmp.fr
In reply to: Fabien COELHO (#2)
1 attachment(s)
Re: pgbench stuck with 100% cpu usage

The commit that introduced this code is 12788ae49e1933f463bc. So I amn
copying Heikki.

AFAICR the commit was mostly a heavy restructuring of previous unmaintainable
spaghetti code. I'm not sure the problem was not there before under one form
or another.

I agree that it should error out & stop the client in this case at least.

Here is a probable "fix", which does was the comment said should be done.

I could not trigger an infinite loop with various kill -9 and other quick
stops. Could you try it on your side?

--
Fabien.

Attachments:

pgbench-send-fail-1.patchtext/x-diff; name=pgbench-send-fail-1.patchDownload
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index e37496c..f039413 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -2194,12 +2194,8 @@ doCustom(TState *thread, CState *st, StatsData *agg)
 				{
 					if (!sendCommand(st, command))
 					{
-						/*
-						 * Failed. Stay in CSTATE_START_COMMAND state, to
-						 * retry. ??? What the point or retrying? Should
-						 * rather abort?
-						 */
-						return;
+						commandFailed(st, "SQL command send failed");
+						st->state = CSTATE_ABORTED;
 					}
 					else
 						st->state = CSTATE_WAIT_RESULT;
#4Pavan Deolasee
pavan.deolasee@gmail.com
In reply to: Fabien COELHO (#2)
Re: pgbench stuck with 100% cpu usage

On Fri, Sep 29, 2017 at 12:22 AM, Fabien COELHO <coelho@cri.ensmp.fr> wrote:

While running some tests, I encountered a situation where pgbench gets

stuck in an infinite loop, consuming 100% cpu. The setup was:

- Start postgres server from the master branch
- Initialise pgbench
- Run pgbench -c 10 -T 100
- Stop postgres with -m immediate

That is a strange test to run, but it would be better if the behavior was
not that one.

Well, I think it's a very legitimate test, not for testing performance, but
testing crash recovery and I use it very often. This particular test was
run to catch another bug which will be reported separately.

Thanks,
Pavan

--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#5Pavan Deolasee
pavan.deolasee@gmail.com
In reply to: Fabien COELHO (#3)
Re: pgbench stuck with 100% cpu usage

On Fri, Sep 29, 2017 at 1:03 AM, Fabien COELHO <coelho@cri.ensmp.fr> wrote:

The commit that introduced this code is 12788ae49e1933f463bc. So I amn

copying Heikki.

AFAICR the commit was mostly a heavy restructuring of previous
unmaintainable spaghetti code. I'm not sure the problem was not there
before under one form or another.

I agree that it should error out & stop the client in this case at least.

Here is a probable "fix", which does was the comment said should be done.

Looks good to me.

I could not trigger an infinite loop with various kill -9 and other quick
stops. Could you try it on your side?

Ok, I will try. But TBH I did not try to reproduce that either and I am not
sure if I can. I discovered the problem when my laptop's battery started
draining out much more quickly. Having seen the problem, it seems very
obvious though.

Thanks,
Pavan

--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#6Fabien COELHO
coelho@cri.ensmp.fr
In reply to: Pavan Deolasee (#4)
Re: pgbench stuck with 100% cpu usage

- Run pgbench -c 10 -T 100
- Stop postgres with -m immediate

That is a strange test to run, but it would be better if the behavior was
not that one.

Well, I think it's a very legitimate test, not for testing performance, but
testing crash recovery and I use it very often.

Ok, interesting. Now I understand your purpose.

You may consider something like "BEGIN; UPDATE ...; \sleep 100 ms;
COMMIT;" so that a crash is most likely to occur with plenty transactions
in progress but without much load.

--
Fabien.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#7Robert Haas
robertmhaas@gmail.com
In reply to: Pavan Deolasee (#5)
Re: pgbench stuck with 100% cpu usage

On Fri, Sep 29, 2017 at 1:39 AM, Pavan Deolasee
<pavan.deolasee@gmail.com> wrote:

Looks good to me.

Committed and back-patched to v10. I have to say I'm kind of
surprised that the comment removed by this patch got committed in the
first place. It's got a ??? in it and isn't very grammatical either.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Fabien COELHO
coelho@cri.ensmp.fr
In reply to: Robert Haas (#7)
Re: pgbench stuck with 100% cpu usage

Committed and back-patched to v10. I have to say I'm kind of
surprised that the comment removed by this patch got committed in the
first place. It's got a ??? in it and isn't very grammatical either.

ISTM that I reviewed the initial patch.

AFAICR I agreed with the comment that whether it was appropriate to go on
was unclear, but it did not strike me as obviously a bad idea at the time,
so I let it pass. Now it does strike me as a bad idea (tm):-) My English
is kind of fuzzy, so I tend not to comment too much on English unless I'm
really sure that it is wrong.

Note that there is another 100% cpu pgbench bug, see
https://commitfest.postgresql.org/15/1292/, which seems more likely to
occur in the wild that this one, but there has been no review of the fix I
sent.

--
Fabien.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

In reply to: Pavan Deolasee (#4)
Re: pgbench stuck with 100% cpu usage

On Thu, Sep 28, 2017 at 10:36 PM, Pavan Deolasee
<pavan.deolasee@gmail.com> wrote:

Well, I think it's a very legitimate test, not for testing performance, but
testing crash recovery and I use it very often. This particular test was run
to catch another bug which will be reported separately.

Yeah, I use pgbench for stuff like that all the time.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers