BUG #14680: startup process on standby encounter a deadlock of TwoPhaseStateLock when redo 2PC xlog

Started by wangchuantingalmost 9 years ago30 messageshackersbugs
Jump to latest
#1wangchuanting
wangchuanting@huawei.com
hackersbugs

The following bug has been logged on the website:

Bug reference: 14680
Logged by: chuanting wang
Email address: wangchuanting@huawei.com
PostgreSQL version: 10beta1
Operating system: SuSE
Description:

startup process on standby encounter a deadlock of TwoPhaseStateLock when
redo 2PC xlog.

the stack is:
Thread 6 (Thread 0x7f002e3ff700 (LWP 35305)):
#0 0x00007f1106167627 in semop () from /lib64/libc.so.6
#1 0x0000000000ce5622 in PGSemaphoreLock(PGSemaphoreData*, bool) ()
#2 0x0000000000d73e4d in LWLockAcquire(LWLockId, LWLockMode) ()
#3 0x0000000000969028 in RemoveGXact(GlobalTransactionData*) ()
#4 0x000000000096b254 in ProcessTwoPhaseBuffer(unsigned int, XLogRecPtr,
bool, bool, bool) ()
#5 0x000000000096b6b8 in PrescanPreparedTransactions(unsigned int**, int*)
()
#6 0x000000000097af97 in xlog_redo(XLogReaderState*) ()
#7 0x00000000009839df in StartupXLOG() ()
#8 0x0000000000cffd0b in StartupProcessMain() ()
#9 0x0000000000991539 in AuxiliaryProcessMain(int, char**) ()
#10 0x0000000000cfb209 in SubPostmasterMain(int, char**) ()
#11 0x0000000000cfb4f2 in MainStarterThreadFunc(void*) ()
#12 0x00007f11099a17b6 in start_thread () from /lib64/libpthread.so.0
#13 0x00007f1106165d6d in clone () from /lib64/libc.so.6
#14 0x0000000000000000 in ?? ()

the reason is:
when do PrescanPreparedTransactions, startup process already granted a
shared lock on TwoPhaseStateLock, then it called RemoveGXact, but in
RemoveGXact it acquired exclusive lock on TwoPhaseStateLock, so it blocked
by itself

anyone, has some idea on how to fix it?

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: wangchuanting (#1)
hackersbugs
Re: BUG #14680: startup process on standby encounter a deadlock of TwoPhaseStateLock when redo 2PC xlog

wangchuanting@huawei.com writes:

startup process on standby encounter a deadlock of TwoPhaseStateLock when
redo 2PC xlog.

Please provide an example of how to get into this state.

regards, tom lane

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#3Michael Paquier
michael@paquier.xyz
In reply to: Tom Lane (#2)
hackersbugs
Re: BUG #14680: startup process on standby encounter a deadlock of TwoPhaseStateLock when redo 2PC xlog

On Wed, May 31, 2017 at 6:57 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

wangchuanting@huawei.com writes:

startup process on standby encounter a deadlock of TwoPhaseStateLock when
redo 2PC xlog.

Please provide an example of how to get into this state.

That would help. Are you seeing in the logs something like "removing
future two-phase state from memory for XXX" or "removing stale
two-phase state from shared memory for XXX"?

Even with that, the light-weight lock sequence taken in those code
paths look definitely wrong to me, we should not take twice
TwoPhaseStateLock in the same code path. I think that we should remove
the lock acquisitions in RemoveGXact() and PrepareRedoRemove, and then
upgrade the locks of PrescanPreparedTransactions() and
StandbyRecoverPreparedTransactions() to be exclusive. We still need to
keep a lock as CheckPointTwoPhase() may still be triggered by the
checkpoint. Tom, what do you think?
--
Michael

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#4wangchuanting
wangchuanting@huawei.com
In reply to: Michael Paquier (#3)
hackersbugs
Re: BUG #14680: startup process on standby encounter a deadlock of TwoPhaseStateLock when redo 2PC xlog

we use Postgres-XC+pg10(4 coordinator + 4 datanode(pg 10 as datanode, and 1
master 1 standby for each datanode)), and benchmark tpcc, there is some
cross datanode transactions that use 2pc, during testing, we restart the
cluster, then one datanode standby can not recovery done and hangup with
TwoPhaseStateLock deadlock.

and sorry, we reinstall the cluster, and the log is removed, we will try to
reproduce, but anyway, the code is not right like Michael saied.

--
View this message in context: http://www.postgresql-archive.org/BUG-14680-startup-process-on-standby-encounter-a-deadlock-of-TwoPhaseStateLock-when-redo-2PC-xlog-tp5964069p5964241.html
Sent from the PostgreSQL - bugs mailing list archive at Nabble.com.

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#5Michael Paquier
michael@paquier.xyz
In reply to: Michael Paquier (#3)
hackersbugs
Re: BUG #14680: startup process on standby encounter a deadlock of TwoPhaseStateLock when redo 2PC xlog

On Wed, May 31, 2017 at 12:30 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Wed, May 31, 2017 at 6:57 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

wangchuanting@huawei.com writes:

startup process on standby encounter a deadlock of TwoPhaseStateLock when
redo 2PC xlog.

Please provide an example of how to get into this state.

That would help. Are you seeing in the logs something like "removing
future two-phase state from memory for XXX" or "removing stale
two-phase state from shared memory for XXX"?

Even with that, the light-weight lock sequence taken in those code
paths look definitely wrong to me, we should not take twice
TwoPhaseStateLock in the same code path. I think that we should remove
the lock acquisitions in RemoveGXact() and PrepareRedoRemove, and then
upgrade the locks of PrescanPreparedTransactions() and
StandbyRecoverPreparedTransactions() to be exclusive. We still need to
keep a lock as CheckPointTwoPhase() may still be triggered by the
checkpoint. Tom, what do you think?

Attached is what I was thinking about for reference. I just came back
from a long flight and I am pretty tired, so my brain may have missed
something. I'll take again a look at this issue on Monday, an open
item has been added for now.
--
Michael

Attachments:

2pc-redo-lwlock-fix.patchapplication/octet-stream; name=2pc-redo-lwlock-fix.patchDownload+20-15
#6Michael Paquier
michael@paquier.xyz
In reply to: wangchuanting (#4)
hackersbugs
Re: Re: BUG #14680: startup process on standby encounter a deadlock of TwoPhaseStateLock when redo 2PC xlog

On Thu, Jun 1, 2017 at 12:11 AM, wangchuanting <wangchuanting@huawei.com> wrote:

we use Postgres-XC+pg10(4 coordinator + 4 datanode(pg 10 as datanode, and 1
master 1 standby for each datanode)), and benchmark tpcc, there is some
cross datanode transactions that use 2pc, during testing, we restart the
cluster, then one datanode standby can not recovery done and hangup with
TwoPhaseStateLock deadlock.

(Former Postgres-XC maintainer here)
Are you aware of the fact that this is not going to work? Postgres
protocol has been extended between coordinators and datanodes to be
able to push down transaction ID, snapshot as well as timestamps when
running transactions across nodes. So by using a community Postgres as
a datanode you break the global consistency of the cluster. There are
also a couple of other things proper to datanodes.

and sorry, we reinstall the cluster, and the log is removed, we will try to
reproduce, but anyway, the code is not right like Michael said.

On that I agree.
--
Michael

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#7wangchuanting
wangchuanting@huawei.com
In reply to: Michael Paquier (#6)
hackersbugs
Re: BUG #14680: startup process on standby encounter a deadlock of TwoPhaseStateLock when redo 2PC xlog

So by using a community Postgres as
a datanode you break the global consistency of the cluster. There are
also a couple of other things proper to datanodes.

sorry, i confuse you, actually we merged "Speedup 2PC recovery by skipping
two phase state files in normal path" to pgxc.

--
View this message in context: http://www.postgresql-archive.org/BUG-14680-startup-process-on-standby-encounter-a-deadlock-of-TwoPhaseStateLock-when-redo-2PC-xlog-tp5964069p5964398.html
Sent from the PostgreSQL - bugs mailing list archive at Nabble.com.

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#8wangchuanting
wangchuanting@huawei.com
In reply to: Michael Paquier (#5)
hackersbugs
Re: BUG #14680: startup process on standby encounter a deadlock of TwoPhaseStateLock when redo 2PC xlog

Hi Michael,

appreciate for your quick response.

i look carefully on the patch, because of we removed TwoPhaseStateLock
lwlock acquire in `RemoveGXact' and let caller held lwlock, so i think:
1. xact_redo also need held lwlock before call PrepareRedoRemove
2. RecoverPreparedTransactions also need held lwlock before call
ProcessTwoPhaseBuffer

--
View this message in context: http://www.postgresql-archive.org/BUG-14680-startup-process-on-standby-encounter-a-deadlock-of-TwoPhaseStateLock-when-redo-2PC-xlog-tp5964069p5964407.html
Sent from the PostgreSQL - bugs mailing list archive at Nabble.com.

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#9Noah Misch
noah@leadboat.com
In reply to: Michael Paquier (#5)
hackersbugs
Re: BUG #14680: startup process on standby encounter a deadlock of TwoPhaseStateLock when redo 2PC xlog

On Thu, Jun 01, 2017 at 01:07:53AM -0700, Michael Paquier wrote:

On Wed, May 31, 2017 at 12:30 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Wed, May 31, 2017 at 6:57 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

wangchuanting@huawei.com writes:

startup process on standby encounter a deadlock of TwoPhaseStateLock when
redo 2PC xlog.

Please provide an example of how to get into this state.

That would help. Are you seeing in the logs something like "removing
future two-phase state from memory for XXX" or "removing stale
two-phase state from shared memory for XXX"?

Even with that, the light-weight lock sequence taken in those code
paths look definitely wrong to me, we should not take twice
TwoPhaseStateLock in the same code path. I think that we should remove
the lock acquisitions in RemoveGXact() and PrepareRedoRemove, and then
upgrade the locks of PrescanPreparedTransactions() and
StandbyRecoverPreparedTransactions() to be exclusive. We still need to
keep a lock as CheckPointTwoPhase() may still be triggered by the
checkpoint. Tom, what do you think?

Attached is what I was thinking about for reference. I just came back
from a long flight and I am pretty tired, so my brain may have missed
something. I'll take again a look at this issue on Monday, an open
item has been added for now.

[Action required within three days. This is a generic notification.]

The above-described topic is currently a PostgreSQL 10 open item. Simon,
since you committed the patch believed to have created it, you own this open
item. If some other commit is more relevant or if this does not belong as a
v10 open item, please let us know. Otherwise, please observe the policy on
open item ownership[1]/messages/by-id/20170404140717.GA2675809@tornado.leadboat.com and send a status update within three calendar days of
this message. Include a date for your subsequent status update. Testers may
discover new open items at any time, and I want to plan to get them all fixed
well in advance of shipping v10. Consequently, I will appreciate your efforts
toward speedy resolution. Thanks.

[1]: /messages/by-id/20170404140717.GA2675809@tornado.leadboat.com

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#10Michael Paquier
michael@paquier.xyz
In reply to: wangchuanting (#8)
hackersbugs
Re: Re: BUG #14680: startup process on standby encounter a deadlock of TwoPhaseStateLock when redo 2PC xlog

On Fri, Jun 2, 2017 at 10:59 AM, wangchuanting <wangchuanting@huawei.com> wrote:

Appreciate for your quick response.

Back into business for this issue.

i look carefully on the patch, because of we removed TwoPhaseStateLock
lwlock acquire in `RemoveGXact' and let caller held lwlock, so i think:
1. xact_redo also need held lwlock before call PrepareRedoRemove
2. RecoverPreparedTransactions also need held lwlock before call
ProcessTwoPhaseBuffer

Thanks for the input. I have reviewed all the code paths that have
been modified, and strengthened the code with assertions using
LWLockHeldByMeInMode() to make sure that the correct lock is always
hold in those code paths. There is actually no point in holding the
lock in restoreTwoPhaseData(), but as this makes the code less
consistent with the rest I added one. Also, I have replaced the lock
acquisition in PrepareRedoAdd() with acquisitions at higher levels,
and added an assertion in this routine. This makes the 2PC state data
addition and removal more consistent with each other.
--
Michael

Attachments:

2pc-redo-lwlock-fix-v2.patchapplication/octet-stream; name=2pc-redo-lwlock-fix-v2.patchDownload+51-27
#11Michael Paquier
michael@paquier.xyz
In reply to: Noah Misch (#9)
hackersbugs
Re: BUG #14680: startup process on standby encounter a deadlock of TwoPhaseStateLock when redo 2PC xlog

On Mon, Jun 5, 2017 at 7:24 AM, Noah Misch <noah@leadboat.com> wrote:

[Action required within three days. This is a generic notification.]

The above-described topic is currently a PostgreSQL 10 open item. Simon,
since you committed the patch believed to have created it, you own this open
item. If some other commit is more relevant or if this does not belong as a
v10 open item, please let us know. Otherwise, please observe the policy on
open item ownership[1] and send a status update within three calendar days of
this message. Include a date for your subsequent status update. Testers may
discover new open items at any time, and I want to plan to get them all fixed
well in advance of shipping v10. Consequently, I will appreciate your efforts
toward speedy resolution. Thanks.

I have just provided a patch that takes care of solving this issue and
cleans up the lock handling for all the 2PC redo code paths.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12wangchuanting
wangchuanting@huawei.com
In reply to: Michael Paquier (#10)
hackersbugs
Re: BUG #14680: startup process on standby encounter a deadlock of TwoPhaseStateLock when redo 2PC xlog

Hi, Michael

I don't understand why the patch remove TwoPhaseStateLock in MarkAsPrepared.

if so:
1. does it need remove lock acquire in MarkAsPreparing also?
2. MarkAsPrepared will call ProcArrayAdd, and in ProcArrayAdd it acquires
ProcArrayLock, we should carefully handle the lock sequence about these two
locks.

--
View this message in context: http://www.postgresql-archive.org/BUG-14680-startup-process-on-standby-encounter-a-deadlock-of-TwoPhaseStateLock-when-redo-2PC-xlog-tp5964069p5964796.html
Sent from the PostgreSQL - bugs mailing list archive at Nabble.com.

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#13Michael Paquier
michael@paquier.xyz
In reply to: wangchuanting (#12)
hackersbugs
Re: Re: BUG #14680: startup process on standby encounter a deadlock of TwoPhaseStateLock when redo 2PC xlog

On Mon, Jun 5, 2017 at 8:32 PM, wangchuanting <wangchuanting@huawei.com> wrote:

I don't understand why the patch remove TwoPhaseStateLock in MarkAsPrepared.

This one is on purpose, no low-level routines in this patch acquire
TwoPhaseStateLock. Logically I agree that it does not change much
but... You wrote something below about that.

if so:
1. does it need remove lock acquire in MarkAsPreparing also?

No, MarkAsPreparing() is used only in the non-redo code path.

2. MarkAsPrepared will call ProcArrayAdd, and in ProcArrayAdd it acquires
ProcArrayLock, we should carefully handle the lock sequence about these two
locks.

And this gives a reason to not complicate our lives.
--
Michael

Attachments:

2pc-redo-lwlock-fix-v3.patchapplication/octet-stream; name=2pc-redo-lwlock-fix-v3.patchDownload+42-22
#14Noah Misch
noah@leadboat.com
In reply to: Noah Misch (#9)
hackersbugs
Re: BUG #14680: startup process on standby encounter a deadlock of TwoPhaseStateLock when redo 2PC xlog

On Sun, Jun 04, 2017 at 10:24:30PM +0000, Noah Misch wrote:

On Thu, Jun 01, 2017 at 01:07:53AM -0700, Michael Paquier wrote:

On Wed, May 31, 2017 at 12:30 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Wed, May 31, 2017 at 6:57 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

wangchuanting@huawei.com writes:

startup process on standby encounter a deadlock of TwoPhaseStateLock when
redo 2PC xlog.

Please provide an example of how to get into this state.

That would help. Are you seeing in the logs something like "removing
future two-phase state from memory for XXX" or "removing stale
two-phase state from shared memory for XXX"?

Even with that, the light-weight lock sequence taken in those code
paths look definitely wrong to me, we should not take twice
TwoPhaseStateLock in the same code path. I think that we should remove
the lock acquisitions in RemoveGXact() and PrepareRedoRemove, and then
upgrade the locks of PrescanPreparedTransactions() and
StandbyRecoverPreparedTransactions() to be exclusive. We still need to
keep a lock as CheckPointTwoPhase() may still be triggered by the
checkpoint. Tom, what do you think?

Attached is what I was thinking about for reference. I just came back
from a long flight and I am pretty tired, so my brain may have missed
something. I'll take again a look at this issue on Monday, an open
item has been added for now.

[Action required within three days. This is a generic notification.]

The above-described topic is currently a PostgreSQL 10 open item. Simon,
since you committed the patch believed to have created it, you own this open
item. If some other commit is more relevant or if this does not belong as a
v10 open item, please let us know. Otherwise, please observe the policy on
open item ownership[1] and send a status update within three calendar days of
this message. Include a date for your subsequent status update. Testers may
discover new open items at any time, and I want to plan to get them all fixed
well in advance of shipping v10. Consequently, I will appreciate your efforts
toward speedy resolution. Thanks.

[1] /messages/by-id/20170404140717.GA2675809@tornado.leadboat.com

This PostgreSQL 10 open item is past due for your status update. Kindly send
a status update within 24 hours, and include a date for your subsequent status
update. Refer to the policy on open item ownership:
/messages/by-id/20170404140717.GA2675809@tornado.leadboat.com

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#15Michael Paquier
michael@paquier.xyz
In reply to: Noah Misch (#14)
hackersbugs
Re: BUG #14680: startup process on standby encounter a deadlock of TwoPhaseStateLock when redo 2PC xlog

On Fri, Jun 9, 2017 at 3:17 PM, Noah Misch <noah@leadboat.com> wrote:

This PostgreSQL 10 open item is past due for your status update. Kindly send
a status update within 24 hours, and include a date for your subsequent status
update. Refer to the policy on open item ownership:
/messages/by-id/20170404140717.GA2675809@tornado.leadboat.com

I have sent an updated patch simplifying the locking here:
/messages/by-id/CAB7nPqQthLP9GvD2242epHKOBkDMd+03tSuFvK3cVZsGarQyWA@mail.gmail.com
--
Michael

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#16Noah Misch
noah@leadboat.com
In reply to: Noah Misch (#14)
hackersbugs
Re: BUG #14680: startup process on standby encounter a deadlock of TwoPhaseStateLock when redo 2PC xlog

On Thu, Jun 08, 2017 at 11:17:38PM -0700, Noah Misch wrote:

On Sun, Jun 04, 2017 at 10:24:30PM +0000, Noah Misch wrote:

On Thu, Jun 01, 2017 at 01:07:53AM -0700, Michael Paquier wrote:

On Wed, May 31, 2017 at 12:30 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Wed, May 31, 2017 at 6:57 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

wangchuanting@huawei.com writes:

startup process on standby encounter a deadlock of TwoPhaseStateLock when
redo 2PC xlog.

Please provide an example of how to get into this state.

That would help. Are you seeing in the logs something like "removing
future two-phase state from memory for XXX" or "removing stale
two-phase state from shared memory for XXX"?

Even with that, the light-weight lock sequence taken in those code
paths look definitely wrong to me, we should not take twice
TwoPhaseStateLock in the same code path. I think that we should remove
the lock acquisitions in RemoveGXact() and PrepareRedoRemove, and then
upgrade the locks of PrescanPreparedTransactions() and
StandbyRecoverPreparedTransactions() to be exclusive. We still need to
keep a lock as CheckPointTwoPhase() may still be triggered by the
checkpoint. Tom, what do you think?

Attached is what I was thinking about for reference. I just came back
from a long flight and I am pretty tired, so my brain may have missed
something. I'll take again a look at this issue on Monday, an open
item has been added for now.

[Action required within three days. This is a generic notification.]

The above-described topic is currently a PostgreSQL 10 open item. Simon,
since you committed the patch believed to have created it, you own this open
item. If some other commit is more relevant or if this does not belong as a
v10 open item, please let us know. Otherwise, please observe the policy on
open item ownership[1] and send a status update within three calendar days of
this message. Include a date for your subsequent status update. Testers may
discover new open items at any time, and I want to plan to get them all fixed
well in advance of shipping v10. Consequently, I will appreciate your efforts
toward speedy resolution. Thanks.

[1] /messages/by-id/20170404140717.GA2675809@tornado.leadboat.com

This PostgreSQL 10 open item is past due for your status update. Kindly send
a status update within 24 hours, and include a date for your subsequent status
update. Refer to the policy on open item ownership:
/messages/by-id/20170404140717.GA2675809@tornado.leadboat.com

IMMEDIATE ATTENTION REQUIRED. This PostgreSQL 10 open item is long past due
for your status update. Please reacquaint yourself with the policy on open
item ownership[1]/messages/by-id/20170404140717.GA2675809@tornado.leadboat.com and then reply immediately. If I do not hear from you by
2017-06-11 07:00 UTC, I will transfer this item to release management team
ownership without further notice.

[1]: /messages/by-id/20170404140717.GA2675809@tornado.leadboat.com

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#17Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Noah Misch (#16)
hackersbugs
Re: Re: BUG #14680: startup process on standby encounter a deadlock of TwoPhaseStateLock when redo 2PC xlog

Noah Misch wrote:

IMMEDIATE ATTENTION REQUIRED. This PostgreSQL 10 open item is long past due
for your status update. Please reacquaint yourself with the policy on open
item ownership[1] and then reply immediately. If I do not hear from you by
2017-06-11 07:00 UTC, I will transfer this item to release management team
ownership without further notice.

I volunteer to own this item. I'll report on Wednesday 14th at the latest.

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#18Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Michael Paquier (#10)
hackersbugs
Re: Re: BUG #14680: startup process on standby encounter a deadlock of TwoPhaseStateLock when redo 2PC xlog

Michael Paquier wrote:

i look carefully on the patch, because of we removed TwoPhaseStateLock
lwlock acquire in `RemoveGXact' and let caller held lwlock, so i think:
1. xact_redo also need held lwlock before call PrepareRedoRemove
2. RecoverPreparedTransactions also need held lwlock before call
ProcessTwoPhaseBuffer

Thanks for the input. I have reviewed all the code paths that have
been modified, and strengthened the code with assertions using
LWLockHeldByMeInMode() to make sure that the correct lock is always
hold in those code paths. There is actually no point in holding the
lock in restoreTwoPhaseData(), but as this makes the code less
consistent with the rest I added one. Also, I have replaced the lock
acquisition in PrepareRedoAdd() with acquisitions at higher levels,
and added an assertion in this routine. This makes the 2PC state data
addition and removal more consistent with each other.

Thanks for the patch -- I agree that the new coding looks better.

I also agree that restoreTwoPhaseData doesn't need to hold the lock,
since we don't expect anything running concurrently, but that it's okay
to hold it and makes the whole thing simpler.

We could try to apply an equivalent rationale to
RecoverPreparedTransactions: even though we have already been running in
HOT standby mode for a while, there's no possibility of concurrent
activity either: since we exited recovery, master cannot write any more
2PC xacts, and we haven't yet set the flag that lets open sessions write
WAL. However, it seems mildly dangerous to run the bottom sections of
the loop with the lock held, since it acquires other lwlocks and
generally does a lot of crap.

Also, let's add an lwlock-is-held assertion to MarkAsPreparingGuts.

BTW: I think that saving one redundant line of code is not worth the
ugliness that it costs in xact_redo, so let's undo that. The patch is a
bit bulky, but the resulting code is simpler.

In short, I propose the attached.

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

2pc-redo-lwlock-fix-v4.patchtext/plain; charset=us-asciiDownload+81-65
#19Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Alvaro Herrera (#18)
hackersbugs
Re: Re: BUG #14680: startup process on standby encounter a deadlock of TwoPhaseStateLock when redo 2PC xlog

Alvaro Herrera wrote:

For some reason I sent a preliminary version of the email I was writing.
Some additional thoughts that were supposed to be in there:

I also agree that restoreTwoPhaseData doesn't need to hold the lock,
since we don't expect anything running concurrently, but that it's okay
to hold it and makes the whole thing simpler.

It's okay to hold the lock for the whole duration of the function.

We could try to apply an equivalent rationale to
RecoverPreparedTransactions: even though we have already been running in
HOT standby mode for a while, there's no possibility of concurrent
activity either: since we exited recovery, master cannot write any more
2PC xacts, and we haven't yet set the flag that lets open sessions write
WAL. However, it seems mildly dangerous to run the bottom sections of
the loop with the lock held, since it acquires other lwlocks and
generally does a lot of crap.

Therefore, let's adopt your idea of acquiring the lock only for the
specific low-level calls that require it. But the comment atop the loop
needs a lot of work to explain why it's doing that (i.e. how come it's
reading TwoPhaseState without a lock), as in my proposed patch.

(At first, I didn't really like this very much and wanted to add more
locking to that function, but it turned quite messy, so I back-tracked).

Thanks

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#20Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: wangchuanting (#4)
hackersbugs
Re: Re: BUG #14680: startup process on standby encounter a deadlock of TwoPhaseStateLock when redo 2PC xlog

wangchuanting wrote:

we use Postgres-XC+pg10(4 coordinator + 4 datanode(pg 10 as datanode, and 1
master 1 standby for each datanode)), and benchmark tpcc, there is some
cross datanode transactions that use 2pc, during testing, we restart the
cluster, then one datanode standby can not recovery done and hangup with
TwoPhaseStateLock deadlock.

and sorry, we reinstall the cluster, and the log is removed, we will try to
reproduce, but anyway, the code is not right like Michael saied.

Did you have any luck reproducing the original issue? It would be good
if you could confirm that the bug is gone with the patch I posted in
/messages/by-id/20170611022536.goef4cdbmgicye5g@alvherre.pgsql

Thanks

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#21Michael Paquier
michael@paquier.xyz
In reply to: Alvaro Herrera (#19)
hackersbugs
#22wangchuanting
wangchuanting@huawei.com
In reply to: Alvaro Herrera (#20)
hackersbugs
#23Michael Paquier
michael@paquier.xyz
In reply to: wangchuanting (#22)
hackersbugs
#24wangchuanting
wangchuanting@huawei.com
In reply to: Michael Paquier (#23)
hackersbugs
#25Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Michael Paquier (#21)
hackersbugs
#26Michael Paquier
michael@paquier.xyz
In reply to: Alvaro Herrera (#25)
hackersbugs
#27Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Michael Paquier (#26)
hackersbugs
#28Michael Paquier
michael@paquier.xyz
In reply to: Alvaro Herrera (#27)
hackersbugs
#29Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Michael Paquier (#28)
hackersbugs
#30Michael Paquier
michael@paquier.xyz
In reply to: Alvaro Herrera (#29)
hackersbugs