<IDLE> and waiting
Hi guys,
I saw a strange behaviour on one of the production boxes. The
pg_stat_activity shows a process as <IDLE> and yet 'waiting' !!! On top of
it (understandably, since its IDLE), there are no entries for this pid in
pg_locks!
Following are the snapshots of the two system views.
procpid | current_query | waiting | duration |
backend_start
---------+-----------------------+---------+------------------+-------------------------------
20762 | <IDLE> | f | | 2008-01-31
13:38:30.848898-08
19776 | <IDLE> | t | 00:38:34.76833 | 2008-01-31
12:51:29.005744-08
20356 | <IDLE> | f | 00:38:29.971425 | 2008-01-31
13:17:37.617497-08
19775 | <IDLE> | f | 00:38:27.187201 | 2008-01-31
12:51:28.999242-08
19774 | <IDLE> | f | 00:38:27.187068 | 2008-01-31
12:51:28.90554-08
20728 | <IDLE> | f | 00:14:03.913027 | 2008-01-31
13:36:11.345822-08
9727 | <IDLE> | f | 00:03:07.444273 | 2008-01-24
22:25:00.289931-08
9684 | <IDLE> | f | 00:00:07.704656 | 2008-01-24
22:22:00.007377-08
19390 | <IDLE> in transaction | f | 00:00:00.027585 | 2008-01-31
12:30:07.999246-08
19389 | <IDLE> in transaction | t | -00:00:00.000255 | 2008-01-31
12:30:07.973868-08
select * from pg_locks where pid in ( 19776, 19389 );
locktype | database | relation | page | tuple | transactionid |
classid | objid | objsubid | transaction | pid | mode |
granted
---------------+----------+----------+------+-------+---------------+---------+-------+----------+-------------+-------+------------------+---------
relation | 16584 | 17070 | | |
| | | | 3700350056 | 19389 | RowExclusiveLock | t
relation | 16584 | 17106 | | |
| | | | 3700350056 | 19389 | RowExclusiveLock | t
relation | 16584 | 17068 | | |
| | | | 3700350056 | 19389 | RowExclusiveLock | t
transactionid | | | | | 3700350056
| | | | 3700350056 | 19389 | ExclusiveLock | t
relation | 16584 | 17108 | | |
| | | | 3700350056 | 19389 | RowExclusiveLock | t
(5 rows)
The 'duration' column above is just now()-query_start. These are not
just two instant snapshots, but we could see this output consistently for
quite long.
I tracked the 'waiting' column a little bit in the source code, and saw
that it is actually generated from PgBackendStatus.st_waiting . Is it
possible that, for some reason, postgres forgot to update this for a
backend?
select version();
version
--------------------------------------------------------------------------------------------
PostgreSQL 8.2.4 on x86_64-unknown-linux-gnu, compiled by GCC gcc (GCC)
4.1.0 (SUSE Linux)
This issue has been seen twice now.
--
gurjeet[.singh]@EnterpriseDB.com
singh.gurjeet@{ gmail | hotmail | indiatimes | yahoo }.com
EnterpriseDB http://www.enterprisedb.com
17° 29' 34.37"N, 78° 30' 59.76"E - Hyderabad
18° 32' 57.25"N, 73° 56' 25.42"E - Pune
37° 47' 19.72"N, 122° 24' 1.69" W - San Francisco *
Mail sent from my BlackLaptop device
"Gurjeet Singh" <singh.gurjeet@gmail.com> writes:
I saw a strange behaviour on one of the production boxes. The
pg_stat_activity shows a process as <IDLE> and yet 'waiting' !!! On top of
it (understandably, since its IDLE), there are no entries for this pid in
pg_locks!
Hmm, I can reproduce something like this by aborting a wait for lock.
It seems the problem is that WaitOnLock() is ignoring its own good
advice, assuming that it can do cleanup work after waiting.
regards, tom lane
The situation seems pretty bad!!
Here are the steps to reproduce in 'PostgreSQL 8.3beta2 on
x86_64-unknown-linux-gnu, compiled by GCC gcc (GCC) 3.3.3 (SuSE Linux)':
session 1: begin;
session 1: update test set a = 112 where a = 112;
session 2: update test set a = 113 where a = 112; --waits
session 1: select * from pg_stat_activity; -- executed this a few times
before executing 'select version()' and then following:
session 1: <stat query1> -- see end of mail for this query
procpid | current_query | waiting |
duration | backend_start
---------+----------------------------------------+---------+------------------+-------------------------------
12577 | update test set a = 113 where a = 112; | t | -00:01:
35.782881 | 2008-02-01 13:36:15.31027-08
11975 | select * from pg_stat_activity ; | f | -00:01:
52.554697 | 2008-02-01 13:30:40.396392-08
(2 rows)
session 1: select * from pg_locks
locktype | database | relation | page | tuple | virtualxid |
transactionid | classid | objid | objsubid | virtualtransaction | pid
| mode | granted
---------------+----------+----------+------+-------+------------+---------------+---------+-------+----------+---------------
-----+-------+------------------+---------
transactionid | | | | | |
390 | | | | 2/14 | 12577 | ShareLock | f
transactionid | | | | | |
390 | | | | 1/9 | 11975 | ExclusiveLock | t
<snip irrelevant>
Then,
session 2: ^C
Cancel request sent
ERROR: canceling statement due to user request
session1: <stat query1>
procpid | current_query | waiting |
duration | backend_start
---------+----------------------------------------+---------+------------------+-------------------------------
12577 | update test set a = 113 where a = 112; | t | -00:01:
35.782881 | 2008-02-01 13:36:15.31027-08
11975 | select * from pg_stat_activity ; | f | -00:01:
52.554697 | 2008-02-01 13:30:40.396392-08
(2 rows)
session 1: select * from pg_locks ;
<no traces of pid 12577>
session 1: select pg_backend_pid();
pg_backend_pid
----------------
11975
The last mentioned output of <stat query> shows session 1 executing 'select
* from p_s_a', whereas the <stat query> _is_ being executed in session 1!!!
This result is consistently returned for a while, and later...
session 2: select pg_backend_pid();
pg_backend_pid
----------------
12577
session 1: <stat query1>
procpid | current_query | waiting | duration |
backend_start
---------+-----------------------+---------+-----------------+-------------------------------
11975 | <IDLE> in transaction | f | 00:06:08.671029 | 2008-02-01
13:30:40.396392-08
(1 row)
After a while again:
session 1: <stat query2> -- notice 2 not 1; 'select *' comes back to
haunt!!!
procpid | current_query | waiting |
duration | backend_start
---------+----------------------------------------+---------+------------------+-------------------------------
12577 | update test set a = 113 where a = 112; | t | -00:01:
35.782881 | 2008-02-01 13:36:15.31027-08
11975 | select * from pg_stat_activity ; | f | -00:01:
52.554697 | 2008-02-01 13:30:40.396392-08
(2 rows)
session 1: <stat query1> -- 1 back in action
procpid | current_query | waiting |
duration | backend_start
---------+----------------------------------------+---------+------------------+-------------------------------
12577 | update test set a = 113 where a = 112; | t | -00:01:
35.782881 | 2008-02-01 13:36:15.31027-08
11975 | select * from pg_stat_activity ; | f | -00:01:
52.554697 | 2008-02-01 13:30:40.396392-08
(2 rows)
The <stat query1> is:
select
procpid, current_query::varchar(50), waiting, now() - query_start as
duration, backend_start
from pg_stat_activity
where current_query <> '<IDLE>'
and current_query not like '%DONT COUNT ME1 %'
order by duration desc
limit 10;
The <stat query2> is:
select
procpid, current_query::varchar(50), waiting, now() - query_start as
duration, backend_start
from pg_stat_activity
where current_query not like '%DONT COUNT ME1 %'
order by duration desc
limit 10;
Found more bugs than I was looking for, to reproduce!!!
The reporter also made an observation (on 8.2.4) that there were
deadlocks detected at around the same time. Looked at WaitOnLock(), and
clearly there's a problem, but is it at the same/only place we are
suspecting it to be?
Best regards,
PS: Ran the <stat query>ies 1 and 2 again, just before hitting 'send', and
the result is the same:
procpid | current_query | waiting |
duration | backend_start
---------+----------------------------------------+---------+------------------+-------------------------------
12577 | update test set a = 113 where a = 112; | t | -00:01:
35.782881 | 2008-02-01 13:36:15.31027-08
11975 | select * from pg_stat_activity ; | f | -00:01:
52.554697 | 2008-02-01 13:30:40.396392-08
(2 rows)
Clearly, there's something wrong
On Feb 1, 2008 8:16 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
"Gurjeet Singh" <singh.gurjeet@gmail.com> writes:
I saw a strange behaviour on one of the production boxes. The
pg_stat_activity shows a process as <IDLE> and yet 'waiting' !!! On topof
it (understandably, since its IDLE), there are no entries for this pid
in
pg_locks!
Hmm, I can reproduce something like this by aborting a wait for lock.
It seems the problem is that WaitOnLock() is ignoring its own good
advice, assuming that it can do cleanup work after waiting.regards, tom lane
--
gurjeet[.singh]@EnterpriseDB.com
singh.gurjeet@{ gmail | hotmail | indiatimes | yahoo }.com
EnterpriseDB http://www.enterprisedb.com
17° 29' 34.37"N, 78° 30' 59.76"E - Hyderabad
18° 32' 57.25"N, 73° 56' 25.42"E - Pune
37° 47' 19.72"N, 122° 24' 1.69" W - San Francisco *
Mail sent from my BlackLaptop device
"Gurjeet Singh" <singh.gurjeet@gmail.com> writes:
The situation seems pretty bad!!
I think at least part of your problem is not understanding that a single
transaction sees a frozen snapshot of pg_stat_activity.
regards, tom lane
On Feb 1, 2008 3:56 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
"Gurjeet Singh" <singh.gurjeet@gmail.com> writes:
The situation seems pretty bad!!
I think at least part of your problem is not understanding that a single
transaction sees a frozen snapshot of pg_stat_activity.
It does! I assumed that pg_stat_activity produced the
transaction-independent snapshot of internal memory structures! Is that the
case with pg_locks too!? I hope not.
BTW, we cannot say that the pg_stat_activity behaves in a consistent manner
(transactions-wise). From what I could infer, this view's results are frozen
when you first query the view, not when the transaction started (which is
how other (normal) relations behave). It's a bit confusing, and should be
documented if this is the way it is intended to work; Something along the
lines of : "In a transaction, this view will repeatedly show the same
results that were returned by it's first invocation in the transaction." in
a less confusing way :)
So we are back to the original problem... Canceling a 'waiting' transaction
does not revert the session's 'waiting' state back to 'false' (consistently
reproducible).
--
gurjeet[.singh]@EnterpriseDB.com
singh.gurjeet@{ gmail | hotmail | indiatimes | yahoo }.com
EnterpriseDB http://www.enterprisedb.com
17° 29' 34.37"N, 78° 30' 59.76"E - Hyderabad
18° 32' 57.25"N, 73° 56' 25.42"E - Pune
37° 47' 19.72"N, 122° 24' 1.69" W - San Francisco *
Mail sent from my BlackLaptop device
I wrote:
"Gurjeet Singh" <singh.gurjeet@gmail.com> writes:
I saw a strange behaviour on one of the production boxes. The
pg_stat_activity shows a process as <IDLE> and yet 'waiting' !!! On top of
it (understandably, since its IDLE), there are no entries for this pid in
pg_locks!
Hmm, I can reproduce something like this by aborting a wait for lock.
It seems the problem is that WaitOnLock() is ignoring its own good
advice, assuming that it can do cleanup work after waiting.
I've committed a fix for this. (Too late for 8.3.0, unfortunately.)
regards, tom lane
On Feb 2, 2008 2:28 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
I wrote:
"Gurjeet Singh" <singh.gurjeet@gmail.com> writes:
I saw a strange behaviour on one of the production boxes. The
pg_stat_activity shows a process as <IDLE> and yet 'waiting' !!! On topof
it (understandably, since its IDLE), there are no entries for this pid
in
pg_locks!
Hmm, I can reproduce something like this by aborting a wait for lock.
It seems the problem is that WaitOnLock() is ignoring its own good
advice, assuming that it can do cleanup work after waiting.I've committed a fix for this. (Too late for 8.3.0, unfortunately.)
Thanks. Like 8.2, can it not be back-patched on 8.3 too?
I just looked at the patch... Isn't PG_TRY() an expensive call to make in
the lock.c code? I was thinking of registering a Xact callback using
RegisterXactCallback() and performing 'waiting' reset in that callback if
the Xact event is XACT_EVENT_ABORT.
That would have been compliant with the previous comments ('if we fail, any
cleanup must happen in xact abort processing, not here').
Comments.
Best regards,
--
gurjeet[.singh]@EnterpriseDB.com
singh.gurjeet@{ gmail | hotmail | indiatimes | yahoo }.com
EnterpriseDB http://www.enterprisedb.com
17° 29' 34.37"N, 78° 30' 59.76"E - Hyderabad
18° 32' 57.25"N, 73° 56' 25.42"E - Pune
37° 47' 19.72"N, 122° 24' 1.69" W - San Francisco *
Mail sent from my BlackLaptop device
Gurjeet Singh escribi�:
I just looked at the patch... Isn't PG_TRY() an expensive call to make in
the lock.c code? I was thinking of registering a Xact callback using
RegisterXactCallback() and performing 'waiting' reset in that callback if
the Xact event is XACT_EVENT_ABORT.
PG_TRY is not expensive as all that -- it's just a sigsetjmp() call and
another stack frame.
--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.
On Feb 2, 2008 3:27 PM, Alvaro Herrera <alvherre@commandprompt.com> wrote:
Gurjeet Singh escribió:
I just looked at the patch... Isn't PG_TRY() an expensive call to make
in
the lock.c code? I was thinking of registering a Xact callback using
RegisterXactCallback() and performing 'waiting' reset in that callbackif
the Xact event is XACT_EVENT_ABORT.
PG_TRY is not expensive as all that -- it's just a sigsetjmp() call and
another stack frame.
Thats why I asked. I assumed that creating stacks was expensive. Isn't this
the reason the compilers came up with the function inline capability; to
avoid stacks, because they can be expensive. Or am I confusing two different
types of stacks!
Moreover, calling a callback, once in a while (only upon XACT abort), may
prove to be much cheaper than setting up an additional stack on every
lock-acquire call.
Really, my 2 cents.
--
gurjeet[.singh]@EnterpriseDB.com
singh.gurjeet@{ gmail | hotmail | indiatimes | yahoo }.com
EnterpriseDB http://www.enterprisedb.com
17° 29' 34.37"N, 78° 30' 59.76"E - Hyderabad
18° 32' 57.25"N, 73° 56' 25.42"E - Pune
37° 47' 19.72"N, 122° 24' 1.69" W - San Francisco *
Mail sent from my BlackLaptop device
Alvaro Herrera <alvherre@commandprompt.com> writes:
Gurjeet Singh escribi�:
I just looked at the patch... Isn't PG_TRY() an expensive call to make in
the lock.c code? I was thinking of registering a Xact callback using
RegisterXactCallback() and performing 'waiting' reset in that callback if
the Xact event is XACT_EVENT_ABORT.
PG_TRY is not expensive as all that -- it's just a sigsetjmp() call and
another stack frame.
Also, since we're about to block here, shaving microseconds is not all
that important.
The reason I did it that way was to avoid having to export the saved
ps-display string out to someplace LockWaitCancel could find it.
regards, tom lane