TopoSort() fix
Hi Hackers,
I think I found an issue in the TopoSort() function.
As the comments say,
/* .....
* ...... If there are any other processes
* in the same lock group on the queue, set their number of
* beforeConstraints to -1 to indicate that they should be
emitted
* with their groupmates rather than considered separately.
*/
If the line "break;" exists, there is no chance to set beforeConstraints to
-1 for other processes in the same lock group.
So, I think we need delete the line "break;" . See the patch.
I just took a look, and I found all the following versions have this line .
postgresql-12beta2, postgresql-12beta1, postgresql-11.4,
postgresql-11.3,postgresql-11.0,
postgresql-10.9,postgresql-10.5, postgresql-10.0
Thanks,
Ruihai
Attachments:
TopoSort.patchapplication/octet-stream; name=TopoSort.patchDownload+0-1
Rui Hai Jiang <ruihaij@gmail.com> writes:
I think I found an issue in the TopoSort() function.
This indeed seems like a live bug introduced by a1c1af2a.
Robert?
regards, tom lane
Could the attached patch fix this issue? Or does any one else plan to fix
it?
If people are busy and have not time, I can go ahead to fix it. To fix
this issue, do we need a patch for each official branch?
Regards,
Ruihai
On Tue, Jul 2, 2019 at 11:23 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Show quoted text
Rui Hai Jiang <ruihaij@gmail.com> writes:
I think I found an issue in the TopoSort() function.
This indeed seems like a live bug introduced by a1c1af2a.
Robert?regards, tom lane
On Wed, Jul 03, 2019 at 10:41:59AM +0800, Rui Hai Jiang wrote:
Could the attached patch fix this issue? Or does any one else plan to fix
it?If people are busy and have not time, I can go ahead to fix it. To fix
this issue, do we need a patch for each official branch?
Only a committer could merge any fix you produce. What you have sent
looks fine to me, so let's wait for Robert, who has visiblu broken
this part to comment. Back-patched versions are usually taken care of
by the committer merging the fix, and by experience it is better to
agree about the shape of a patch on HEAD before working on other
branches. Depending on the review done, the patch's shape may change
slightly...
--
Michael
On Tue, Jul 2, 2019 at 11:23 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Rui Hai Jiang <ruihaij@gmail.com> writes:
I think I found an issue in the TopoSort() function.
This indeed seems like a live bug introduced by a1c1af2a.
Robert?
This is pretty thoroughly swapped out of my head, but it looks like
that analysis might be correct.
Is it practical to come up with a test case that demonstrates the problem?
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
I'll try to figure out some scenarios to do the test. A parallel process
group is needed for the test.
Actually I was trying to do some testing against the locking mechanism. I
happened to see this issue.
On Wed, Jul 3, 2019 at 9:38 PM Robert Haas <robertmhaas@gmail.com> wrote:
Show quoted text
On Tue, Jul 2, 2019 at 11:23 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Rui Hai Jiang <ruihaij@gmail.com> writes:
I think I found an issue in the TopoSort() function.
This indeed seems like a live bug introduced by a1c1af2a.
Robert?This is pretty thoroughly swapped out of my head, but it looks like
that analysis might be correct.Is it practical to come up with a test case that demonstrates the problem?
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On 2019-Jul-04, Rui Hai Jiang wrote:
I'll try to figure out some scenarios to do the test. A parallel process
group is needed for the test.Actually I was trying to do some testing against the locking mechanism. I
happened to see this issue.
Hello, is anybody looking into this issue?
--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi,
On 2019-07-26 18:05:38 -0400, Alvaro Herrera wrote:
On 2019-Jul-04, Rui Hai Jiang wrote:
I'll try to figure out some scenarios to do the test. A parallel process
group is needed for the test.
Rui, have you made any progress on this?
Actually I was trying to do some testing against the locking mechanism. I
happened to see this issue.Hello, is anybody looking into this issue?
I guess this is on Robert's docket otherwise. He's on vacation till
early next week...
Greetings,
Andres Freund
Andres Freund <andres@anarazel.de> writes:
On 2019-07-26 18:05:38 -0400, Alvaro Herrera wrote:
Hello, is anybody looking into this issue?
I guess this is on Robert's docket otherwise. He's on vacation till
early next week...
I think this is a sufficiently obvious bug, and a sufficiently
obvious fix, that we should just fix it and not insist on getting
a reproducible test case first. I think a test case would soon
bit-rot anyway, and no longer exercise the problem.
I certainly do *not* want to wait so long that we miss the
upcoming minor releases.
regards, tom lane
On Fri, Jul 26, 2019 at 08:24:16PM -0400, Tom Lane wrote:
I think this is a sufficiently obvious bug, and a sufficiently
obvious fix, that we should just fix it and not insist on getting
a reproducible test case first. I think a test case would soon
bit-rot anyway, and no longer exercise the problem.I certainly do *not* want to wait so long that we miss the
upcoming minor releases.
+1. Any volunteers?
--
Michael
Michael Paquier <michael@paquier.xyz> writes:
On Fri, Jul 26, 2019 at 08:24:16PM -0400, Tom Lane wrote:
I think this is a sufficiently obvious bug, and a sufficiently
obvious fix, that we should just fix it and not insist on getting
a reproducible test case first. I think a test case would soon
bit-rot anyway, and no longer exercise the problem.
I certainly do *not* want to wait so long that we miss the
upcoming minor releases.
+1. Any volunteers?
If Robert doesn't weigh in pretty soon, I'll take responsibility for it.
regards, tom lane
On Mon, Jul 29, 2019 at 10:56 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Michael Paquier <michael@paquier.xyz> writes:
On Fri, Jul 26, 2019 at 08:24:16PM -0400, Tom Lane wrote:
I think this is a sufficiently obvious bug, and a sufficiently
obvious fix, that we should just fix it and not insist on getting
a reproducible test case first. I think a test case would soon
bit-rot anyway, and no longer exercise the problem.
I certainly do *not* want to wait so long that we miss the
upcoming minor releases.+1. Any volunteers?
If Robert doesn't weigh in pretty soon, I'll take responsibility for it.
That's fine, or if you prefer that I commit it, I will.
I just got back from a week's vacation and am only very gradually
unburying myself from mounds of email. (Of course, the way
pgsql-hackers is getting, there's sort of always a mound of email
these days.)
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
[ removing <ruihaij@gmail.com>, as that mailing address seems to be MIA ]
Robert Haas <robertmhaas@gmail.com> writes:
On Mon, Jul 29, 2019 at 10:56 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
If Robert doesn't weigh in pretty soon, I'll take responsibility for it.
That's fine, or if you prefer that I commit it, I will.
FYI, I just got done inventing a way to reach that code, and I have
to suspect that it's impossible to do so in production, because under
ordinary circumstances no parallel worker will take any exclusive lock
that isn't already held by its leader. (If you happen to know an
easy counterexample, let's see it.)
The attached heavily-hacked version of deadlock-soft.spec makes it go by
forcing duplicate advisory locks to be taken in worker processes, which
of course first requires disabling PreventAdvisoryLocksInParallelMode().
I kind of wonder if we should provide some debug-only, here-be-dragons
way to disable that restriction so that we could make this an official
regression test, because I'm now pretty suspicious that none of this code
has ever executed before.
Anyway, armed with this, I was able to prove that HEAD just hangs up
on this test case; apparently the deadlock checker never detects that
the additional holders of the advisory lock need to be rearranged.
And removing that "break" fixes it.
So I'll go commit the break-ectomy, but what do people think about
testing this better?
regards, tom lane
Attachments:
hack-to-test-parallel-locking.patchtext/x-diff; charset=us-ascii; name=hack-to-test-parallel-locking.patchDownload+91-24
On Mon, Jul 29, 2019 at 5:55 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
FYI, I just got done inventing a way to reach that code, and I have
to suspect that it's impossible to do so in production, because under
ordinary circumstances no parallel worker will take any exclusive lock
that isn't already held by its leader. (If you happen to know an
easy counterexample, let's see it.)
I think the way you could make that happen would be to run a parallel
query that calls a user-defined function which does LOCK TABLE.
Anyway, armed with this, I was able to prove that HEAD just hangs up
on this test case; apparently the deadlock checker never detects that
the additional holders of the advisory lock need to be rearranged.
And removing that "break" fixes it.
Nice!
So I'll go commit the break-ectomy, but what do people think about
testing this better?
I think it's a great idea. I was never very happy with the amount of
exercise I was able to give this code.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes:
On Mon, Jul 29, 2019 at 5:55 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
FYI, I just got done inventing a way to reach that code, and I have
to suspect that it's impossible to do so in production, because under
ordinary circumstances no parallel worker will take any exclusive lock
that isn't already held by its leader. (If you happen to know an
easy counterexample, let's see it.)
I think the way you could make that happen would be to run a parallel
query that calls a user-defined function which does LOCK TABLE.
I tried that first. There are backstops preventing doing LOCK TABLE
in a worker, just like for advisory locks.
I believe the only accessible route to taking any sort of new lock
in a parallel worker is catalog lookups causing AccessShareLock on
a catalog. In principle, maybe you could make a deadlock situation
by combining parallel workers with something that takes
AccessExclusiveLock on a catalog ... but making that into a reliable
test case sounds about impossible, because AEL on a catalog will
have all sorts of unpleasant side-effects, such as blocking
isolationtester's own queries. (Getting it to work in a
CLOBBER_CACHE_ALWAYS build seems right out, for instance.)
regards, tom lane
On Mon, Jul 29, 2019 at 9:48 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
I tried that first. There are backstops preventing doing LOCK TABLE
in a worker, just like for advisory locks.I believe the only accessible route to taking any sort of new lock
in a parallel worker is catalog lookups causing AccessShareLock on
a catalog.
Can't the worker just query a previously-untouched table, maybe by
constructing a string and then using EXECUTE to execute it?
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Mon, Jul 29, 2019 at 10:56:05AM -0400, Tom Lane wrote:
Michael Paquier <michael@paquier.xyz> writes:
+1. Any volunteers?
If Robert doesn't weigh in pretty soon, I'll take responsibility for it.
Thanks Tom for taking care of it!
--
Michael
On 2019-Jul-29, Tom Lane wrote:
FYI, I just got done inventing a way to reach that code, and I have
to suspect that it's impossible to do so in production, because under
ordinary circumstances no parallel worker will take any exclusive lock
that isn't already held by its leader.
Hmm, okay, so this wasn't a bug that would have bit anyone in practice,
yeah? That's reassuring.
Thanks,
--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Alvaro Herrera <alvherre@2ndquadrant.com> writes:
On 2019-Jul-29, Tom Lane wrote:
FYI, I just got done inventing a way to reach that code, and I have
to suspect that it's impossible to do so in production, because under
ordinary circumstances no parallel worker will take any exclusive lock
that isn't already held by its leader.
Hmm, okay, so this wasn't a bug that would have bit anyone in practice,
yeah? That's reassuring.
At the least, you'd have to go well out of your way to make it happen.
regards, tom lane
Robert Haas <robertmhaas@gmail.com> writes:
On Mon, Jul 29, 2019 at 9:48 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
I believe the only accessible route to taking any sort of new lock
in a parallel worker is catalog lookups causing AccessShareLock on
a catalog.
Can't the worker just query a previously-untouched table, maybe by
constructing a string and then using EXECUTE to execute it?
Hm, yeah, looks like you could get a new AccessShareLock that way too.
But not any exclusive lock.
I also looked into whether one could use SELECT FOR UPDATE/SHARE to get
stronger locks at a tuple level, but that's been blocked off as well.
You guys really did a pretty good job of locking that down.
After thinking about this for awhile, though, I believe it might be
reasonable to just remove PreventAdvisoryLocksInParallelMode()
altogether. The "parallel unsafe" markings on the advisory-lock
functions seem like adequate protection against somebody running
them in a parallel worker. If you defeat that by calling them from
a mislabeled-parallel-safe wrapper (as the proposed test case does),
then any negative consequences are on your own head. AFAICT the
only actual negative consequence is that the locks disappear the
moment the parallel worker exits, so we'd not be opening any large
holes even to people who rip off the safety cover.
(BTW, why aren't these functions just "parallel restricted"?)
regards, tom lane