circular wait not triggering deadlock ?

Started by Justin Pryzbyabout 8 years ago5 messagesgeneral
Jump to latest
#1Justin Pryzby
pryzby@telsasoft.com

Running pg10.2, I have a handful of maintenance jobs run in the middle of the
night, which appear to have gotten stuck waiting on each other..

. reindex - loop around pg_repack or psql -c REINDEX $table for those which can't
be repacked)
. vacuum - mostly runs ANALYZE, like on empty parent tables

I believe I've come across something similar before, but hadn't previously
stopped to dig into it.

ts=# SELECT * FROM pg_stat_activity WHERE pid=20488;
-[ RECORD 1 ]----+------------------------------------------------------------------------------------------------------------------------------------------
datid | 16402
datname | ts
pid | 20488
usesysid | 10
usename | postgres
application_name | pg_repack
client_addr |
client_hostname |
client_port | -1
backend_start | 2018-03-07 01:00:15.72433-07
xact_start | 2018-03-07 01:00:15.747619-07
query_start | 2018-03-07 01:00:15.747619-07
state_change | 2018-03-07 01:00:15.74762-07
wait_event_type | Lock
wait_event | virtualxid
state | active
backend_xid |
backend_xmin |
query | CREATE INDEX CONCURRENTLY index_1113966210 ON eric_enodeb_cell_20180304 USING btree (site_id) WITH (fillfactor='100') TABLESPACE oldindex
backend_type | client backend

ts=# SELECT * FROM pg_stat_activity WHERE pid=530;
-[ RECORD 1 ]----+--------------------------------------------------------------------------
datid | 16402
datname | ts
pid | 530
usesysid | 10
usename | postgres
application_name | psql
client_addr |
client_hostname |
client_port | -1
backend_start | 2018-03-07 02:11:59.00832-07
xact_start | 2018-03-07 02:11:59.009855-07
query_start | 2018-03-07 02:11:59.009855-07
state_change | 2018-03-07 02:11:59.009861-07
wait_event_type | Lock
wait_event | relation
state | active
backend_xid |
backend_xmin | 67578467
query | ALTER TABLE eric_enodeb_cell_metrics* ALTER start_time SET STATISTICS 400
backend_type | client backend

ts=# SELECT relation::regclass, granted, * FROM pg_locks WHERE (pid=530 AND NOT granted) OR relation='eric_enodeb_cell_20180304'::regclass; -- virtualxid='16/38159763' OR relation='eric_enodeb_cell_metrics'::regclass ORDER BY 1,2,3;
relation | granted | locktype | database | relation | page | tuple | virtualxid | transactionid | classid | objid | objsubid | virtualtransaction | pid | mode | granted | fastpath
---------------------------+---------+----------+----------+------------+------+-------+------------+---------------+---------+-------+----------+--------------------+-------+--------------------------+---------+----------
eric_enodeb_cell_20180304 | t | relation | 16402 | 1113966172 | | | | | | | | 19/28401734 | 20488 | ShareUpdateExclusiveLock | t | f
eric_enodeb_cell_20180304 | f | relation | 16402 | 1113966172 | | | | | | | | 16/38159763 | 530 | ShareUpdateExclusiveLock | f | f

I see PID 530 has ShareUpdateExclusiveLock on many child tables.
..but NOT on eric_enodeb_cell_20180304 , which #20488 is trying to reindex.

ts=# SELECT granted, relation::regclass, * FROM pg_locks WHERE pid=530 OR relation='eric_enodeb_cell_metrics'::regclass ORDER BY 1, 2, 3;
granted | relation | locktype | database | relation | page | tuple | virtualxid | transactionid | classid | objid | objsubid | virtualtransaction | pid | mode | granted | fastpath
---------+---------------------------+------------+----------+------------+------+-------+-------------+---------------+---------+-------+----------+--------------------+-------+--------------------------+---------+----------
f | eric_enodeb_cell_metrics | relation | 16402 | 948644697 | | | | | | | | 32/838769 | 18941 | AccessShareLock | f | f
f | eric_enodeb_cell_metrics | relation | 16402 | 948644697 | | | | | | | | 18/38156962 | 16592 | AccessExclusiveLock | f | f
f | eric_enodeb_cell_metrics | relation | 16402 | 948644697 | | | | | | | | 31/1337307 | 22961 | AccessShareLock | f | f
f | eric_enodeb_cell_metrics | relation | 16402 | 948644697 | | | | | | | | 27/1755026 | 17248 | AccessShareLock | f | f
f | eric_enodeb_cell_metrics | relation | 16402 | 948644697 | | | | | | | | 28/1118276 | 28357 | AccessShareLock | f | f
f | eric_enodeb_cell_metrics | relation | 16402 | 948644697 | | | | | | | | 29/830016 | 21846 | AccessExclusiveLock | f | f
f | eric_enodeb_cell_20180304 | relation | 16402 | 1113966172 | | | | | | | | 16/38159763 | 530 | ShareUpdateExclusiveLock | f | f
t | eric_enodeb_cell_201609 | relation | 16402 | 193318232 | | | | | | | | 16/38159763 | 530 | ShareUpdateExclusiveLock | t | f
t | eric_enodeb_cell_201610 | relation | 16402 | 193318241 | | | | | | | | 16/38159763 | 530 | ShareUpdateExclusiveLock | t | f
t | eric_enodeb_cell_201611 | relation | 16402 | 206004215 | | | | | | | | 16/38159763 | 530 | ShareUpdateExclusiveLock | t | f
t | eric_enodeb_cell_201612 | relation | 16402 | 232980573 | | | | | | | | 16/38159763 | 530 | ShareUpdateExclusiveLock | t | f
t | eric_enodeb_cell_201701 | relation | 16402 | 259721731 | | | | | | | | 16/38159763 | 530 | ShareUpdateExclusiveLock | t | f
t | eric_enodeb_cell_201702 | relation | 16402 | 291841516 | | | | | | | | 16/38159763 | 530 | ShareUpdateExclusiveLock | t | f
t | eric_enodeb_cell_20170301 | relation | 16402 | 384321707 | | | | | | | | 16/38159763 | 530 | ShareUpdateExclusiveLock | t | f
t | eric_enodeb_cell_20170302 | relation | 16402 | 385254657 | | | | | | | | 16/38159763 | 530 | ShareUpdateExclusiveLock | t | f
t | eric_enodeb_cell_20170303 | relation | 16402 | 386180686 | | | | | | | | 16/38159763 | 530 | ShareUpdateExclusiveLock | t | f
...

ts=# SELECT granted, relation::regclass, * FROM pg_locks WHERE pid=20488 ORDER BY 1,2,3;
-[ RECORD 1 ]------+--------------------------
granted | f
locktype | virtualxid
virtualxid | 22/4317099
virtualtransaction | 19/28401734
pid | 20488
mode | ShareLock
granted | f
fastpath | f

ts=# SELECT granted, pid, mode, virtualtransaction, virtualxid FROM pg_locks WHERE virtualxid='22/4317099' ORDER BY 1,2,3;
granted | pid | mode | virtualtransaction | virtualxid
---------+-------+---------------+--------------------+------------
f | 20488 | ShareLock | 19/28401734 | 22/4317099
t | 6471 | ExclusiveLock | 22/4317099 | 22/4317099

[pryzbyj@database ~]$ ps -O lstart -C reindex,pg_repack
PID STARTED S TTY TIME COMMAND
20038 Wed Mar 7 01:00:00 2018 S ? 00:00:00 /bin/sh /usr/local/sbin/reindex
20484 Wed Mar 7 01:00:14 2018 S ? 00:00:00 /bin/sh /usr/local/sbin/reindex
20485 Wed Mar 7 01:00:14 2018 S ? 00:00:00 /usr/pgsql-10/bin/pg_repack -E warning --no-kill-backend -d ts -i eric_enodeb_cell_20180304_site_idx -Ss oldindex

Note, I believe last night our backup job would've run for longer and processed
(many) more tables than usual, looping around pg_dump --snapshot.

Is it a bug that this isn't caught by a deadlock detector and cancelled?

Thanks,
Justin

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Justin Pryzby (#1)
Re: circular wait not triggering deadlock ?

Justin Pryzby <pryzby@telsasoft.com> writes:

Running pg10.2, I have a handful of maintenance jobs run in the middle of the
night, which appear to have gotten stuck waiting on each other..

ts=# SELECT granted, pid, mode, virtualtransaction, virtualxid FROM pg_locks WHERE virtualxid='22/4317099' ORDER BY 1,2,3;
granted | pid | mode | virtualtransaction | virtualxid
---------+-------+---------------+--------------------+------------
f | 20488 | ShareLock | 19/28401734 | 22/4317099
t | 6471 | ExclusiveLock | 22/4317099 | 22/4317099

PID 20488 is evidently waiting for PID 6471 to finish its transaction.
What's that one doing?

Is it a bug that this isn't caught by a deadlock detector and cancelled?

We did find a bug recently with concurrent CREATE INDEX CONCURRENTLY's
all waiting for each other to commit.

regards, tom lane

#3Justin Pryzby
pryzby@telsasoft.com
In reply to: Tom Lane (#2)
Re: circular wait not triggering deadlock ?

On Thu, Mar 08, 2018 at 01:57:06PM -0500, Tom Lane wrote:

Justin Pryzby <pryzby@telsasoft.com> writes:

Running pg10.2, I have a handful of maintenance jobs run in the middle of the
night, which appear to have gotten stuck waiting on each other..

ts=# SELECT granted, pid, mode, virtualtransaction, virtualxid FROM pg_locks WHERE virtualxid='22/4317099' ORDER BY 1,2,3;
granted | pid | mode | virtualtransaction | virtualxid
---------+-------+---------------+--------------------+------------
f | 20488 | ShareLock | 19/28401734 | 22/4317099
t | 6471 | ExclusiveLock | 22/4317099 | 22/4317099

PID 20488 is evidently waiting for PID 6471 to finish its transaction.
What's that one doing?

Um, I thought I had kept track of all two pids but looks not..

query | SELECT pg_export_snapshot();

So that explains that.
I already killed one proc, but..

ts=# SELECT granted, relation::regclass, pid, mode, virtualtransaction, virtualxid FROM pg_locks WHERE pid=17248 ORDER BY 1;
granted | relation | pid | mode | virtualtransaction | virtualxid
---------+-----------------------------------------+-------+-----------------+--------------------+------------
f | eric_enodeb_cell_metrics | 17248 | AccessShareLock | 27/1755026 |

ts=# SELECT granted, relation::regclass, pid, mode, virtualtransaction, virtualxid FROM pg_locks WHERE relation='eric_enodeb_cell_metrics'::regclass ORDER BY 1; --pid=17248 virtualxid='22/4317099' ORDER BY 1,2,3;
granted | relation | pid | mode | virtualtransaction | virtualxid
---------+--------------------------+-------+--------------------------+--------------------+------------
f | eric_enodeb_cell_metrics | 22961 | AccessShareLock | 31/1337307 |
f | eric_enodeb_cell_metrics | 17248 | AccessShareLock | 27/1755026 |
f | eric_enodeb_cell_metrics | 28357 | AccessShareLock | 28/1118276 |
f | eric_enodeb_cell_metrics | 21846 | AccessExclusiveLock | 29/830016 |
f | eric_enodeb_cell_metrics | 16592 | AccessExclusiveLock | 18/38156962 |
f | eric_enodeb_cell_metrics | 18941 | AccessShareLock | 32/838769 |
t | eric_enodeb_cell_metrics | 530 | ShareUpdateExclusiveLock | 16/38159763 |
(7 rows)

ts=# SELECT pid, state, left(query,199) FROM pg_stat_activity WHERE pid IN(530,17248);
530 | active | ALTER TABLE eric_enodeb_cell_metrics* ALTER start_time SET STATISTICS 400
17248 | active | SELECT c.tableoid, c.oid, c.relname, (SELECT pg_catalog.array_agg(acl ORDER BY row_n) FROM (SELECT acl, row_n FROM pg_catalog.unnest(coalesce(c.relacl,pg_catalog.acldefault(
CASE WHEN c.relkind = 'S'

ts=# SELECT granted, pid, mode, virtualtransaction, virtualxid FROM pg_locks WHERE relation='eric_enodeb_cell_20180304'::regclass ;
granted | pid | mode | virtualtransaction | virtualxid
---------+-------+--------------------------+--------------------+------------
t | 20488 | ShareUpdateExclusiveLock | 19/28401734 |
f | 530 | ShareUpdateExclusiveLock | 16/38159763 |

ts=# SELECT pid, state, left(query,199) FROM pg_stat_activity WHERE pid=20488;
pid | state | left
-------+--------+-------------------------------------------------------------------------------------------------------------------------------------------
20488 | active | CREATE INDEX CONCURRENTLY index_1113966210 ON eric_enodeb_cell_20180304 USING btree (site_id) WITH (fillfactor='100') TABLESPACE oldindex

ts=# SELECT granted, pid, relation::regclass, locktype, mode virtualxid, virtualxid, virtualtransaction FROM pg_locks WHERE pid=20488 ORDER BY 1;
granted | pid | relation | locktype | virtualxid | virtualxid | virtualtransaction
---------+-------+---------------------------+------------+--------------------------+-------------+--------------------
f | 20488 | | virtualxid | ShareLock | 22/4317099 | 19/28401734

ts=# SELECT granted, pid, relation::regclass, locktype, mode virtualxid, virtualtransaction FROM pg_locks WHERE virtualxid='22/4317099' ORDER BY 1;
granted | pid | relation | locktype | virtualxid | virtualtransaction
---------+-------+----------+------------+---------------+--------------------
f | 20488 | | virtualxid | ShareLock | 19/28401734
t | 6471 | | virtualxid | ExclusiveLock | 22/4317099
(2 rows)

So...I gather ALTER SET STATS is waiting on pg_dump which is waiting on CREATE
INDEX which is waiting on SELECT pg_export_snapshot(), which I take to mean
that I should avoid running pg_repack until the backup is finished..

ts=# SELECT pid, state, application_name app, left(query,99) FROM pg_stat_activity ORDER BY xact_start LIMIT 9;
pid | state | app | left
-------+---------------------+-----------+-----------------------------------------------------------------------------------------------------
6471 | idle in transaction | psql | SELECT pg_export_snapshot();
20488 | active | pg_repack | CREATE INDEX CONCURRENTLY index_1113966210 ON eric_enodeb_cell_20180304 USING btree (site_id) WITH
530 | active | psql | ALTER TABLE eric_enodeb_cell_metrics* ALTER start_time SET STATISTICS 400
17248 | active | pg_dump | SELECT c.tableoid, c.oid, c.relname, (SELECT pg_catalog.array_agg(acl ORDER BY row_n) FROM (SELECT

Justin

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Justin Pryzby (#3)
Re: circular wait not triggering deadlock ?

Justin Pryzby <pryzby@telsasoft.com> writes:

On Thu, Mar 08, 2018 at 01:57:06PM -0500, Tom Lane wrote:

PID 20488 is evidently waiting for PID 6471 to finish its transaction.
What's that one doing?

Um, I thought I had kept track of all two pids but looks not..
query | SELECT pg_export_snapshot();

pg_export_snapshot doesn't block for other transactions, though.
Further down, you have output that confirms that:

6471 | idle in transaction | psql | SELECT pg_export_snapshot();

That SELECT is the most recently completed command, not the current one.
So now the question is what the connected application is waiting for
before committing.

regards, tom lane

#5Justin Pryzby
pryzby@telsasoft.com
In reply to: Tom Lane (#4)
Re: circular wait not triggering deadlock ?

On Thu, Mar 08, 2018 at 03:05:36PM -0500, Tom Lane wrote:

Justin Pryzby <pryzby@telsasoft.com> writes:

On Thu, Mar 08, 2018 at 01:57:06PM -0500, Tom Lane wrote:

PID 20488 is evidently waiting for PID 6471 to finish its transaction.
What's that one doing?

Um, I thought I had kept track of all two pids but looks not..
query | SELECT pg_export_snapshot();

pg_export_snapshot doesn't block for other transactions, though.
Further down, you have output that confirms that:

6471 | idle in transaction | psql | SELECT pg_export_snapshot();

That SELECT is the most recently completed command, not the current one.
So now the question is what the connected application is waiting for
before committing.

I believe it does an idle loop around sleep(1), until all the pg_dump
--snapshot children are done.

Justin