<IDLE> and waiting

Started by Gurjeet Singhover 18 years ago10 messageshackers

singh.gurjeet@gmail.com

over 18 years ago

Hi guys,

I saw a strange behaviour on one of the production boxes. The
pg_stat_activity shows a process as <IDLE> and yet 'waiting' !!! On top of
it (understandably, since its IDLE), there are no entries for this pid in
pg_locks!

Following are the snapshots of the two system views.

procpid | current_query | waiting | duration |
backend_start
---------+-----------------------+---------+------------------+-------------------------------
20762 | <IDLE> | f | | 2008-01-31
13:38:30.848898-08
19776 | <IDLE> | t | 00:38:34.76833 | 2008-01-31
12:51:29.005744-08
20356 | <IDLE> | f | 00:38:29.971425 | 2008-01-31
13:17:37.617497-08
19775 | <IDLE> | f | 00:38:27.187201 | 2008-01-31
12:51:28.999242-08
19774 | <IDLE> | f | 00:38:27.187068 | 2008-01-31
12:51:28.90554-08
20728 | <IDLE> | f | 00:14:03.913027 | 2008-01-31
13:36:11.345822-08
9727 | <IDLE> | f | 00:03:07.444273 | 2008-01-24
22:25:00.289931-08
9684 | <IDLE> | f | 00:00:07.704656 | 2008-01-24
22:22:00.007377-08
19390 | <IDLE> in transaction | f | 00:00:00.027585 | 2008-01-31
12:30:07.999246-08
19389 | <IDLE> in transaction | t | -00:00:00.000255 | 2008-01-31
12:30:07.973868-08

select * from pg_locks where pid in ( 19776, 19389 );

locktype | database | relation | page | tuple | transactionid |
classid | objid | objsubid | transaction | pid | mode |
granted
---------------+----------+----------+------+-------+---------------+---------+-------+----------+-------------+-------+------------------+---------
relation | 16584 | 17070 | | |
| | | | 3700350056 | 19389 | RowExclusiveLock | t
relation | 16584 | 17106 | | |
| | | | 3700350056 | 19389 | RowExclusiveLock | t
relation | 16584 | 17068 | | |
| | | | 3700350056 | 19389 | RowExclusiveLock | t
transactionid | | | | | 3700350056
| | | | 3700350056 | 19389 | ExclusiveLock | t
relation | 16584 | 17108 | | |
| | | | 3700350056 | 19389 | RowExclusiveLock | t
(5 rows)

The 'duration' column above is just now()-query_start. These are not
just two instant snapshots, but we could see this output consistently for
quite long.

I tracked the 'waiting' column a little bit in the source code, and saw
that it is actually generated from PgBackendStatus.st_waiting . Is it
possible that, for some reason, postgres forgot to update this for a
backend?

select version();
version
--------------------------------------------------------------------------------------------
PostgreSQL 8.2.4 on x86_64-unknown-linux-gnu, compiled by GCC gcc (GCC)
4.1.0 (SUSE Linux)

This issue has been seen twice now.

--
gurjeet[.singh]@EnterpriseDB.com
singh.gurjeet@{ gmail | hotmail | indiatimes | yahoo }.com

EnterpriseDB http://www.enterprisedb.com

17° 29' 34.37"N, 78° 30' 59.76"E - Hyderabad
18° 32' 57.25"N, 73° 56' 25.42"E - Pune
37° 47' 19.72"N, 122° 24' 1.69" W - San Francisco *

http://gurjeet.frihost.net

Mail sent from my BlackLaptop device

Tom Lane

tgl@sss.pgh.pa.us

over 18 years ago

In reply to: Gurjeet Singh (#1)

Re: <IDLE> and waiting

"Gurjeet Singh" <singh.gurjeet@gmail.com> writes:

I saw a strange behaviour on one of the production boxes. The
pg_stat_activity shows a process as <IDLE> and yet 'waiting' !!! On top of
it (understandably, since its IDLE), there are no entries for this pid in
pg_locks!

Hmm, I can reproduce something like this by aborting a wait for lock.
It seems the problem is that WaitOnLock() is ignoring its own good
advice, assuming that it can do cleanup work after waiting.

regards, tom lane

Gurjeet Singh

singh.gurjeet@gmail.com

over 18 years ago

In reply to: Tom Lane (#2)

Re: <IDLE> and waiting

The situation seems pretty bad!!

Here are the steps to reproduce in 'PostgreSQL 8.3beta2 on
x86_64-unknown-linux-gnu, compiled by GCC gcc (GCC) 3.3.3 (SuSE Linux)':

session 1: begin;
session 1: update test set a = 112 where a = 112;
session 2: update test set a = 113 where a = 112; --waits
session 1: select * from pg_stat_activity; -- executed this a few times
before executing 'select version()' and then following:
session 1: <stat query1> -- see end of mail for this query

session 1: select * from pg_locks

Then,
session 2: ^C
Cancel request sent
ERROR: canceling statement due to user request

session1: <stat query1>

session 1: select * from pg_locks ;

session 1: select pg_backend_pid();

pg_backend_pid
----------------
11975

The last mentioned output of <stat query> shows session 1 executing 'select
* from p_s_a', whereas the <stat query> _is_ being executed in session 1!!!
This result is consistently returned for a while, and later...

session 2: select pg_backend_pid();

pg_backend_pid
----------------
12577

session 1: <stat query1>

After a while again:

session 1: <stat query2> -- notice 2 not 1; 'select *' comes back to
haunt!!!

session 1: <stat query1> -- 1 back in action

The <stat query1> is:
select
procpid, current_query::varchar(50), waiting, now() - query_start as
duration, backend_start
from pg_stat_activity
where current_query <> '<IDLE>'
and current_query not like '%DONT COUNT ME1 %'
order by duration desc
limit 10;

The <stat query2> is:
select
procpid, current_query::varchar(50), waiting, now() - query_start as
duration, backend_start
from pg_stat_activity
where current_query not like '%DONT COUNT ME1 %'
order by duration desc
limit 10;

Found more bugs than I was looking for, to reproduce!!!

The reporter also made an observation (on 8.2.4) that there were
deadlocks detected at around the same time. Looked at WaitOnLock(), and
clearly there's a problem, but is it at the same/only place we are
suspecting it to be?

Best regards,

Clearly, there's something wrong

On Feb 1, 2008 8:16 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

"Gurjeet Singh" <singh.gurjeet@gmail.com> writes:

I saw a strange behaviour on one of the production boxes. The
pg_stat_activity shows a process as <IDLE> and yet 'waiting' !!! On top

of

it (understandably, since its IDLE), there are no entries for this pid

in

pg_locks!

Hmm, I can reproduce something like this by aborting a wait for lock.
It seems the problem is that WaitOnLock() is ignoring its own good
advice, assuming that it can do cleanup work after waiting.

regards, tom lane

--
gurjeet[.singh]@EnterpriseDB.com
singh.gurjeet@{ gmail | hotmail | indiatimes | yahoo }.com

EnterpriseDB http://www.enterprisedb.com

17° 29' 34.37"N, 78° 30' 59.76"E - Hyderabad
18° 32' 57.25"N, 73° 56' 25.42"E - Pune
37° 47' 19.72"N, 122° 24' 1.69" W - San Francisco *

http://gurjeet.frihost.net

Mail sent from my BlackLaptop device

Tom Lane

tgl@sss.pgh.pa.us

over 18 years ago

In reply to: Gurjeet Singh (#3)

Re: <IDLE> and waiting

"Gurjeet Singh" <singh.gurjeet@gmail.com> writes:

The situation seems pretty bad!!

I think at least part of your problem is not understanding that a single
transaction sees a frozen snapshot of pg_stat_activity.

regards, tom lane

Gurjeet Singh

singh.gurjeet@gmail.com

over 18 years ago

In reply to: Tom Lane (#4)

Re: <IDLE> and waiting

On Feb 1, 2008 3:56 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

"Gurjeet Singh" <singh.gurjeet@gmail.com> writes:

The situation seems pretty bad!!

I think at least part of your problem is not understanding that a single
transaction sees a frozen snapshot of pg_stat_activity.

It does! I assumed that pg_stat_activity produced the
transaction-independent snapshot of internal memory structures! Is that the
case with pg_locks too!? I hope not.

BTW, we cannot say that the pg_stat_activity behaves in a consistent manner
(transactions-wise). From what I could infer, this view's results are frozen
when you first query the view, not when the transaction started (which is
how other (normal) relations behave). It's a bit confusing, and should be
documented if this is the way it is intended to work; Something along the
lines of : "In a transaction, this view will repeatedly show the same
results that were returned by it's first invocation in the transaction." in
a less confusing way :)

So we are back to the original problem... Canceling a 'waiting' transaction
does not revert the session's 'waiting' state back to 'false' (consistently
reproducible).

--
gurjeet[.singh]@EnterpriseDB.com
singh.gurjeet@{ gmail | hotmail | indiatimes | yahoo }.com

EnterpriseDB http://www.enterprisedb.com

17° 29' 34.37"N, 78° 30' 59.76"E - Hyderabad
18° 32' 57.25"N, 73° 56' 25.42"E - Pune
37° 47' 19.72"N, 122° 24' 1.69" W - San Francisco *

http://gurjeet.frihost.net

Mail sent from my BlackLaptop device

Tom Lane

tgl@sss.pgh.pa.us

over 18 years ago

In reply to: Tom Lane (#2)

Re: <IDLE> and waiting

I wrote:

"Gurjeet Singh" <singh.gurjeet@gmail.com> writes:

I saw a strange behaviour on one of the production boxes. The
pg_stat_activity shows a process as <IDLE> and yet 'waiting' !!! On top of
it (understandably, since its IDLE), there are no entries for this pid in
pg_locks!

Hmm, I can reproduce something like this by aborting a wait for lock.
It seems the problem is that WaitOnLock() is ignoring its own good
advice, assuming that it can do cleanup work after waiting.

I've committed a fix for this. (Too late for 8.3.0, unfortunately.)

regards, tom lane

Gurjeet Singh

singh.gurjeet@gmail.com

over 18 years ago

In reply to: Tom Lane (#6)

Re: <IDLE> and waiting

On Feb 2, 2008 2:28 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I wrote:

"Gurjeet Singh" <singh.gurjeet@gmail.com> writes:

I saw a strange behaviour on one of the production boxes. The
pg_stat_activity shows a process as <IDLE> and yet 'waiting' !!! On top

of

it (understandably, since its IDLE), there are no entries for this pid

in

pg_locks!

Hmm, I can reproduce something like this by aborting a wait for lock.
It seems the problem is that WaitOnLock() is ignoring its own good
advice, assuming that it can do cleanup work after waiting.

I've committed a fix for this. (Too late for 8.3.0, unfortunately.)

Thanks. Like 8.2, can it not be back-patched on 8.3 too?

I just looked at the patch... Isn't PG_TRY() an expensive call to make in
the lock.c code? I was thinking of registering a Xact callback using
RegisterXactCallback() and performing 'waiting' reset in that callback if
the Xact event is XACT_EVENT_ABORT.

That would have been compliant with the previous comments ('if we fail, any
cleanup must happen in xact abort processing, not here').

Comments.

Best regards,
--
gurjeet[.singh]@EnterpriseDB.com
singh.gurjeet@{ gmail | hotmail | indiatimes | yahoo }.com

EnterpriseDB http://www.enterprisedb.com

17° 29' 34.37"N, 78° 30' 59.76"E - Hyderabad
18° 32' 57.25"N, 73° 56' 25.42"E - Pune
37° 47' 19.72"N, 122° 24' 1.69" W - San Francisco *

http://gurjeet.frihost.net

Mail sent from my BlackLaptop device

Alvaro Herrera

alvherre@2ndquadrant.com

over 18 years ago

In reply to: Gurjeet Singh (#7)

Re: <IDLE> and waiting

Gurjeet Singh escribiï¿½:

I just looked at the patch... Isn't PG_TRY() an expensive call to make in
the lock.c code? I was thinking of registering a Xact callback using
RegisterXactCallback() and performing 'waiting' reset in that callback if
the Xact event is XACT_EVENT_ABORT.

PG_TRY is not expensive as all that -- it's just a sigsetjmp() call and
another stack frame.

--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

Gurjeet Singh

singh.gurjeet@gmail.com

over 18 years ago

In reply to: Alvaro Herrera (#8)

Re: <IDLE> and waiting

On Feb 2, 2008 3:27 PM, Alvaro Herrera <alvherre@commandprompt.com> wrote:

Gurjeet Singh escribió:

I just looked at the patch... Isn't PG_TRY() an expensive call to make

in

the lock.c code? I was thinking of registering a Xact callback using
RegisterXactCallback() and performing 'waiting' reset in that callback

if

the Xact event is XACT_EVENT_ABORT.

PG_TRY is not expensive as all that -- it's just a sigsetjmp() call and
another stack frame.

Thats why I asked. I assumed that creating stacks was expensive. Isn't this
the reason the compilers came up with the function inline capability; to
avoid stacks, because they can be expensive. Or am I confusing two different
types of stacks!

Moreover, calling a callback, once in a while (only upon XACT abort), may
prove to be much cheaper than setting up an additional stack on every
lock-acquire call.

Really, my 2 cents.

--
gurjeet[.singh]@EnterpriseDB.com
singh.gurjeet@{ gmail | hotmail | indiatimes | yahoo }.com

EnterpriseDB http://www.enterprisedb.com

17° 29' 34.37"N, 78° 30' 59.76"E - Hyderabad
18° 32' 57.25"N, 73° 56' 25.42"E - Pune
37° 47' 19.72"N, 122° 24' 1.69" W - San Francisco *

http://gurjeet.frihost.net

Mail sent from my BlackLaptop device

#10

Tom Lane

tgl@sss.pgh.pa.us

over 18 years ago

In reply to: Alvaro Herrera (#8)

Re: <IDLE> and waiting

Alvaro Herrera <alvherre@commandprompt.com> writes:

Gurjeet Singh escribi�:

I just looked at the patch... Isn't PG_TRY() an expensive call to make in
the lock.c code? I was thinking of registering a Xact callback using
RegisterXactCallback() and performing 'waiting' reset in that callback if
the Xact event is XACT_EVENT_ABORT.

PG_TRY is not expensive as all that -- it's just a sigsetjmp() call and
another stack frame.

Also, since we're about to block here, shaving microseconds is not all
that important.

The reason I did it that way was to avoid having to export the saved
ps-display string out to someplace LockWaitCancel could find it.

regards, tom lane