Concurrent deadlock scenario with DROP INDEX on partitioned index

Started by Jimmy Yihabout 4 years ago7 messageshackers
Jump to latest
#1Jimmy Yih
jyih@vmware.com

Hello pgsql-hackers,

When dropping a partitioned index, the locking order starts with the
partitioned table, then its partitioned index, then the partition
indexes dependent on that partitioned index, and finally the dependent
partition indexes' parent tables. This order allows a deadlock
scenario to happen if for example an ANALYZE happens on one of the
partition tables which locks the partition table and then blocks when
it attempts to lock its index (the DROP INDEX has the index lock but
cannot get the partition table lock).

Deadlock Reproduction
==================================================
Initial setup:
CREATE TABLE foopart (a int) PARTITION BY RANGE (a);
CREATE TABLE foopart_child PARTITION OF foopart FOR VALUES FROM (1) TO (5);
CREATE INDEX foopart_idx ON foopart USING btree(a);

Attach debugger to session 1 and put breakpoint on function deleteObjectsInList:
1: DROP INDEX foopart_idx;

While session 1 is blocked by debugger breakpoint, run the following:
2: ANALYZE foopart_child;

After a couple seconds, detach the debugger from session 1 and see deadlock.

In order to prevent this deadlock scenario, we should find and lock
all the sub-partition and partition tables beneath the partitioned
index's partitioned table before attempting to lock the sub-partition
and partition indexes.

Attached is a patch (+isolation test) which fixes the issue. We
observed this on latest head of Postgres.

Regards,
Jimmy Yih and Gaurab Dey

Attachments:

0001-Fix-concurrent-deadlock-scenario-with-DROP-INDEX-on-.patchapplication/octet-stream; name=0001-Fix-concurrent-deadlock-scenario-with-DROP-INDEX-on-.patchDownload+130-1
#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jimmy Yih (#1)
Re: Concurrent deadlock scenario with DROP INDEX on partitioned index

Jimmy Yih <jyih@vmware.com> writes:

When dropping a partitioned index, the locking order starts with the
partitioned table, then its partitioned index, then the partition
indexes dependent on that partitioned index, and finally the dependent
partition indexes' parent tables. This order allows a deadlock
scenario to happen if for example an ANALYZE happens on one of the
partition tables which locks the partition table and then blocks when
it attempts to lock its index (the DROP INDEX has the index lock but
cannot get the partition table lock).

I agree this is a bug, and I think that you may have the right
general idea about how to fix it: acquire the necessary locks in
RangeVarCallbackForDropRelation. However, this patch still needs
a fair amount of work.

1. RangeVarCallbackForDropRelation can get called more than once
during a lookup (in case of concurrent rename and suchlike).
If so it needs to be prepared to drop the lock(s) it got last time.
You have not implemented any such logic. This doesn't seem hard
to fix, just store the OID list into struct DropRelationCallbackState.

2. I'm not sure you have fully thought through the interaction
with the subsequent "is_partition" stanza. If the target is
an intermediate partitioned index, don't we need to acquire lock
on its parent index before starting to lock children? (It may
be that that stanza is already in the wrong place relative to
the table-locking stanza it currently follows, but not sure.)

3. Calling find_all_inheritors with lockmode NoLock, and then
locking those relation OIDs later, is both unsafe and code-wasteful.
Just tell find_all_inheritors to get the locks you want.

4. This code will invoke find_all_inheritors on InvalidOid if
IndexGetRelation fails. It needs to be within the if-test
just above.

5. Reading classform again at this point in the function is
not merely inefficient, but outright wrong, because we
already released the syscache entry. Use the local variable.

6. You've not paid enough attention to updating existing comments,
particularly the header comment for RangeVarCallbackForDropRelation.

Actually though, maybe you *don't* want to do this in
RangeVarCallbackForDropRelation. Because of point 2, it might be
better to run find_all_inheritors after we've successfully
identified and locked the direct target relation, ie do it back
in RemoveRelations. I've not thought hard about that, but it's
attractive if only because it'd mean you don't have to fix point 1.

regards, tom lane

#3Jimmy Yih
jyih@vmware.com
In reply to: Tom Lane (#2)
Re: Concurrent deadlock scenario with DROP INDEX on partitioned index

Tom Lane <tgl@sss.pgh.pa.us> wrote:

1. RangeVarCallbackForDropRelation can get called more than once
during a lookup (in case of concurrent rename and suchlike).
If so it needs to be prepared to drop the lock(s) it got last time.
You have not implemented any such logic. This doesn't seem hard
to fix, just store the OID list into struct DropRelationCallbackState.

Fixed in attached patch. We added the OID list to the
DropRelationCallbackState as you suggested.

2. I'm not sure you have fully thought through the interaction
with the subsequent "is_partition" stanza. If the target is
an intermediate partitioned index, don't we need to acquire lock
on its parent index before starting to lock children? (It may
be that that stanza is already in the wrong place relative to
the table-locking stanza it currently follows, but not sure.)

It's not required to acquire lock on a possible parent index because
of the restriction where we can only run DROP INDEX on the top-most
partitioned index.

3. Calling find_all_inheritors with lockmode NoLock, and then
locking those relation OIDs later, is both unsafe and code-wasteful.
Just tell find_all_inheritors to get the locks you want.

Fixed in attached patch.

4. This code will invoke find_all_inheritors on InvalidOid if
IndexGetRelation fails. It needs to be within the if-test
just above.

Fixed in attached patch.

5. Reading classform again at this point in the function is
not merely inefficient, but outright wrong, because we
already released the syscache entry. Use the local variable.

Fixed in attached patch. Added another local variable
is_partitioned_index to store the classform value. The main reason we
need the classform is because the existing relkind and
expected_relkind local variables would only show RELKIND_INDEX whereas
we needed exactly RELKIND_PARTITIONED_INDEX.

6. You've not paid enough attention to updating existing comments,
particularly the header comment for RangeVarCallbackForDropRelation.

Fixed in attached patch. Updated the header comment and aggregated our
introduced comment to another relative comment slightly above the
proposed locking section.

Actually though, maybe you *don't* want to do this in
RangeVarCallbackForDropRelation. Because of point 2, it might be
better to run find_all_inheritors after we've successfully
identified and locked the direct target relation, ie do it back
in RemoveRelations. I've not thought hard about that, but it's
attractive if only because it'd mean you don't have to fix point 1.

We think that RangeVarCallbackForDropRelation is probably the most
correct function to place the fix. It would look a bit out-of-place
being in RemoveRelations seeing how there's already relative DROP
INDEX code in RangeVarCallbackForDropRelation. With point 2 explained
and point 1 addressed, the fix seems to look okay in
RangeVarCallbackForDropRelation.

Thanks for the feedback. Attached new patch with feedback addressed.

--
Jimmy Yih (VMware)
Gaurab Dey (VMware)

Attachments:

v2-0001-Fix-concurrent-deadlock-scenario-with-DROP-INDEX-.patchapplication/octet-stream; name=v2-0001-Fix-concurrent-deadlock-scenario-with-DROP-INDEX-.patchDownload+164-9
#4Zhihong Yu
zyu@yugabyte.com
In reply to: Jimmy Yih (#3)
Re: Concurrent deadlock scenario with DROP INDEX on partitioned index

On Wed, Mar 16, 2022 at 11:20 AM Jimmy Yih <jyih@vmware.com> wrote:

Tom Lane <tgl@sss.pgh.pa.us> wrote:

1. RangeVarCallbackForDropRelation can get called more than once
during a lookup (in case of concurrent rename and suchlike).
If so it needs to be prepared to drop the lock(s) it got last time.
You have not implemented any such logic. This doesn't seem hard
to fix, just store the OID list into struct DropRelationCallbackState.

Fixed in attached patch. We added the OID list to the
DropRelationCallbackState as you suggested.

2. I'm not sure you have fully thought through the interaction
with the subsequent "is_partition" stanza. If the target is
an intermediate partitioned index, don't we need to acquire lock
on its parent index before starting to lock children? (It may
be that that stanza is already in the wrong place relative to
the table-locking stanza it currently follows, but not sure.)

It's not required to acquire lock on a possible parent index because
of the restriction where we can only run DROP INDEX on the top-most
partitioned index.

3. Calling find_all_inheritors with lockmode NoLock, and then
locking those relation OIDs later, is both unsafe and code-wasteful.
Just tell find_all_inheritors to get the locks you want.

Fixed in attached patch.

4. This code will invoke find_all_inheritors on InvalidOid if
IndexGetRelation fails. It needs to be within the if-test
just above.

Fixed in attached patch.

5. Reading classform again at this point in the function is
not merely inefficient, but outright wrong, because we
already released the syscache entry. Use the local variable.

Fixed in attached patch. Added another local variable
is_partitioned_index to store the classform value. The main reason we
need the classform is because the existing relkind and
expected_relkind local variables would only show RELKIND_INDEX whereas
we needed exactly RELKIND_PARTITIONED_INDEX.

6. You've not paid enough attention to updating existing comments,
particularly the header comment for RangeVarCallbackForDropRelation.

Fixed in attached patch. Updated the header comment and aggregated our
introduced comment to another relative comment slightly above the
proposed locking section.

Actually though, maybe you *don't* want to do this in
RangeVarCallbackForDropRelation. Because of point 2, it might be
better to run find_all_inheritors after we've successfully
identified and locked the direct target relation, ie do it back
in RemoveRelations. I've not thought hard about that, but it's
attractive if only because it'd mean you don't have to fix point 1.

We think that RangeVarCallbackForDropRelation is probably the most
correct function to place the fix. It would look a bit out-of-place
being in RemoveRelations seeing how there's already relative DROP
INDEX code in RangeVarCallbackForDropRelation. With point 2 explained
and point 1 addressed, the fix seems to look okay in
RangeVarCallbackForDropRelation.

Thanks for the feedback. Attached new patch with feedback addressed.

--
Jimmy Yih (VMware)
Gaurab Dey (VMware)

Hi,
For RangeVarCallbackForDropRelation():

+ LockRelationOid(indexRelationOid, heap_lockmode);

Since the above is called for both if and else blocks, it can be lifted
outside.

Cheers

#5Jimmy Yih
jyih@vmware.com
In reply to: Zhihong Yu (#4)
Re: Concurrent deadlock scenario with DROP INDEX on partitioned index

Zhihong Yu <zyu@yugabyte.com> wrote:

Hi,
For RangeVarCallbackForDropRelation():

+ LockRelationOid(indexRelationOid, heap_lockmode);

Since the above is called for both if and else blocks, it can be lifted outside.

Thanks for the feedback. Attached new v3 patch with feedback addressed.

--
Jimmy Yih (VMware)
Gaurab Dey (VMware)

Attachments:

v3-0001-Fix-concurrent-deadlock-scenario-with-DROP-INDEX-.patchapplication/octet-stream; name=v3-0001-Fix-concurrent-deadlock-scenario-with-DROP-INDEX-.patchDownload+161-9
#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jimmy Yih (#3)
Re: Concurrent deadlock scenario with DROP INDEX on partitioned index

Jimmy Yih <jyih@vmware.com> writes:

Tom Lane <tgl@sss.pgh.pa.us> wrote:

Actually though, maybe you *don't* want to do this in
RangeVarCallbackForDropRelation. Because of point 2, it might be
better to run find_all_inheritors after we've successfully
identified and locked the direct target relation, ie do it back
in RemoveRelations. I've not thought hard about that, but it's
attractive if only because it'd mean you don't have to fix point 1.

We think that RangeVarCallbackForDropRelation is probably the most
correct function to place the fix. It would look a bit out-of-place
being in RemoveRelations seeing how there's already relative DROP
INDEX code in RangeVarCallbackForDropRelation.

I really think you made the wrong choice here. Doing the locking in
RemoveRelations leads to an extremely simple patch, as I demonstrate
below. Moreover, since RemoveRelations also has special-case code
for partitioned indexes, it's hard to argue that it mustn't cover
this case too.

Also, I think the proposed test case isn't very good, because when
I run it without applying the code patch, it fails to demonstrate
any deadlock. The query output is different, but not obviously
wrong.

Fixed in attached patch. Added another local variable
is_partitioned_index to store the classform value. The main reason we
need the classform is because the existing relkind and
expected_relkind local variables would only show RELKIND_INDEX whereas
we needed exactly RELKIND_PARTITIONED_INDEX.

Yeah. As I looked at that I realized that it was totally confusing:
at least one previous author thought that "relkind" stored the rel's
actual relkind, which it doesn't as the code stands. In particular,
in this bit:

if ((relkind == RELKIND_INDEX || relkind == RELKIND_PARTITIONED_INDEX) &&
relOid != oldRelOid)
{
state->heapOid = IndexGetRelation(relOid, true);

the test for relkind == RELKIND_PARTITIONED_INDEX is dead code
because relkind will never be that. It accidentally works anyway
because the other half of the || does the right thing, but somebody
was confused, and so will readers be.

Hence, I propose the attached. 0001 is pure refactoring: it hopefully
clears up the confusion about which "relkind" is which, and it also
saves a couple of redundant syscache fetches in RemoveRelations.
Then 0002 adds the actual bug fix as well as a test case that does
deadlock with unpatched code.

regards, tom lane

Attachments:

v4-0001-preliminary-refactoring.patchtext/x-diff; charset=us-ascii; name=v4-0001-preliminary-refactoring.patchDownload+28-15
v4-0002-fix-concurrent-deadlock.patchtext/x-diff; charset=us-ascii; name=v4-0002-fix-concurrent-deadlock.patchDownload+157-0
#7Jimmy Yih
jyih@vmware.com
In reply to: Tom Lane (#6)
Re: Concurrent deadlock scenario with DROP INDEX on partitioned index

Tom Lane <tgl@sss.pgh.pa.us> wrote:

Hence, I propose the attached. 0001 is pure refactoring: it hopefully
clears up the confusion about which "relkind" is which, and it also
saves a couple of redundant syscache fetches in RemoveRelations.
Then 0002 adds the actual bug fix as well as a test case that does
deadlock with unpatched code.

The proposed patches look great and make much more sense. I see you've
already squashed and committed in
7b6ec86532c2ca585d671239bba867fe380448ed. Thanks!

--
Jimmy Yih (VMware)
Gaurab Dey (VMware)