pgsql: Add some isolation tests for deadlock detection and resolution.

Started by Robert Haasabout 10 years ago30 messageshackers

Jump to latest

Robert Haas

robertmhaas@gmail.com

about 10 years ago

Add some isolation tests for deadlock detection and resolution.

Previously, we had no test coverage for the deadlock detector.

Branch
------
master

Details
-------
http://git.postgresql.org/pg/commitdiff/4c9864b9b4d87d02f07f40bb27976da737afdcab

Modified Files
--------------
src/test/isolation/expected/deadlock-hard.out | 35 +++++++++++++
src/test/isolation/expected/deadlock-simple.out | 11 ++++
src/test/isolation/expected/deadlock-soft-2.out | 17 ++++++
src/test/isolation/expected/deadlock-soft.out | 17 ++++++
src/test/isolation/isolation_schedule | 4 ++
src/test/isolation/specs/deadlock-hard.spec | 70 +++++++++++++++++++++++++
src/test/isolation/specs/deadlock-simple.spec | 29 ++++++++++
src/test/isolation/specs/deadlock-soft-2.spec | 38 ++++++++++++++
src/test/isolation/specs/deadlock-soft.spec | 40 ++++++++++++++
9 files changed, 261 insertions(+)

--
Sent via pgsql-committers mailing list (pgsql-committers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-committers

Tom Lane

tgl@sss.pgh.pa.us

about 10 years ago

In reply to: Robert Haas (#1)

Re: pgsql: Add some isolation tests for deadlock detection and resolution.

Robert Haas <rhaas@postgresql.org> writes:

Add some isolation tests for deadlock detection and resolution.

Buildfarm says this needs work ...

dromedary is one of mine, do you need me to look into what is
happening?

regards, tom lane

--
Sent via pgsql-committers mailing list (pgsql-committers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-committers

Robert Haas

robertmhaas@gmail.com

about 10 years ago

In reply to: Tom Lane (#2)

Re: [COMMITTERS] pgsql: Add some isolation tests for deadlock detection and resolution.

On Thu, Feb 11, 2016 at 9:29 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <rhaas@postgresql.org> writes:

Add some isolation tests for deadlock detection and resolution.

Buildfarm says this needs work ...

dromedary is one of mine, do you need me to look into what is
happening?

That would be great. Taking a look at what happened, I have a feeling
this may be a race condition of some kind in the isolation tester. It
seems to have failed to recognize that a1 started waiting, and that
caused the "deadlock detected" message to reported differently. I'm
not immediately sure what to do about that.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Robert Haas

robertmhaas@gmail.com

about 10 years ago

In reply to: Robert Haas (#3)

Re: [COMMITTERS] pgsql: Add some isolation tests for deadlock detection and resolution.

On Thu, Feb 11, 2016 at 9:36 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Feb 11, 2016 at 9:29 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <rhaas@postgresql.org> writes:

Add some isolation tests for deadlock detection and resolution.

Buildfarm says this needs work ...

dromedary is one of mine, do you need me to look into what is
happening?

That would be great. Taking a look at what happened, I have a feeling
this may be a race condition of some kind in the isolation tester. It
seems to have failed to recognize that a1 started waiting, and that
caused the "deadlock detected" message to reported differently. I'm
not immediately sure what to do about that.

Yeah, so: try_complete_step() waits 10ms, and if it still hasn't
gotten any data back from the server, then it uses a separate query to
see whether the step in question is waiting on a lock. So what
must've happened here is that it took more than 10ms for the process
to show up as waiting in pg_stat_activity.

It might be possible to fix this by not passing STEP_NONBLOCK if
there's only one connection that isn't waiting. I think I had it like
that at one point, and then took it out because it caused some other
problem. Another option is to lengthen the timeout. It doesn't seem
great to be dependent on a fixed timeout, but the server doesn't send
any protocol traffic to indicate a lock wait. If we declared which
steps are supposed to wait, then there'd be no ambiguity, but that
seems like a drag.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Tom Lane

tgl@sss.pgh.pa.us

about 10 years ago

In reply to: Robert Haas (#4)

Re: Re: [COMMITTERS] pgsql: Add some isolation tests for deadlock detection and resolution.

Robert Haas <robertmhaas@gmail.com> writes:

That would be great. Taking a look at what happened, I have a feeling
this may be a race condition of some kind in the isolation tester. It
seems to have failed to recognize that a1 started waiting, and that
caused the "deadlock detected" message to reported differently. I'm
not immediately sure what to do about that.

Yeah, so: try_complete_step() waits 10ms, and if it still hasn't
gotten any data back from the server, then it uses a separate query to
see whether the step in question is waiting on a lock. So what
must've happened here is that it took more than 10ms for the process
to show up as waiting in pg_stat_activity.

No, because the machines that are failing are showing a "<waiting ...>"
annotation that your reference output *doesn't* have. I think what is
actually happening is that these machines are seeing the process as
waiting and reporting it, whereas on your machine the backend detects
the deadlock and completes the query (with an error) before
isolationtester realizes that the process is waiting.

It would probably help if you didn't do this:

setup { BEGIN; SET deadlock_timeout = '10ms'; }

which pretty much guarantees that there is a race condition: you've set it
so that the deadlock detector will run at approximately the same time when
isolationtester will be probing the state. I'm surprised that it seemed
to act consistently for you. I would suggest putting all the other
sessions to deadlock_timeout of 100s and the one you want to fail to
timeout of ~ 5s. That will mean that the "<waiting ...>" output should
show up pretty reliably even on overloaded buildfarm critters.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Tom Lane

tgl@sss.pgh.pa.us

about 10 years ago

In reply to: Tom Lane (#5)

Re: Re: [COMMITTERS] pgsql: Add some isolation tests for deadlock detection and resolution.

I wrote:

No, because the machines that are failing are showing a "<waiting ...>"
annotation that your reference output *doesn't* have. I think what is
actually happening is that these machines are seeing the process as
waiting and reporting it, whereas on your machine the backend detects
the deadlock and completes the query (with an error) before
isolationtester realizes that the process is waiting.

I confirmed this theory by the expedient of changing the '10ms' setting
in the test script to 1ms (which worked) and 100ms (which did not, on
the same machine).

I've committed an update that adjusts the timeouts to hopefully ensure
that isolationtester always sees the query as blocked before the deadlock
detector unblocks it; which seems like the behavior we want to test for,
anyway.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Robert Haas

robertmhaas@gmail.com

about 10 years ago

In reply to: Tom Lane (#6)

Re: Re: [COMMITTERS] pgsql: Add some isolation tests for deadlock detection and resolution.

On Thu, Feb 11, 2016 at 12:04 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I wrote:

No, because the machines that are failing are showing a "<waiting ...>"
annotation that your reference output *doesn't* have. I think what is
actually happening is that these machines are seeing the process as
waiting and reporting it, whereas on your machine the backend detects
the deadlock and completes the query (with an error) before
isolationtester realizes that the process is waiting.

I confirmed this theory by the expedient of changing the '10ms' setting
in the test script to 1ms (which worked) and 100ms (which did not, on
the same machine).

I've committed an update that adjusts the timeouts to hopefully ensure
that isolationtester always sees the query as blocked before the deadlock
detector unblocks it; which seems like the behavior we want to test for,
anyway.

Thanks. I really appreciate it.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Tom Lane

tgl@sss.pgh.pa.us

about 10 years ago

In reply to: Tom Lane (#6)

Re: Re: [COMMITTERS] pgsql: Add some isolation tests for deadlock detection and resolution.

We're not out of the woods on this :-( ... jaguarundi, which is the first
of the CLOBBER_CACHE_ALWAYS animals to run these tests, didn't like them
at all. I think I fixed the deadlock-soft-2 failure, but its take on
deadlock-hard is:

*** 17,25 ****
  step s6a7: LOCK TABLE a7; <waiting ...>
  step s7a8: LOCK TABLE a8; <waiting ...>
  step s8a1: LOCK TABLE a1; <waiting ...>
- step s8a1: <... completed>
  step s7a8: <... completed>
! error in steps s8a1 s7a8: ERROR:  deadlock detected
  step s8c: COMMIT;
  step s7c: COMMIT;
  step s6a7: <... completed>
--- 17,25 ----
  step s6a7: LOCK TABLE a7; <waiting ...>
  step s7a8: LOCK TABLE a8; <waiting ...>
  step s8a1: LOCK TABLE a1; <waiting ...>
  step s7a8: <... completed>
! step s8a1: <... completed>
! ERROR:  deadlock detected
  step s8c: COMMIT;
  step s7c: COMMIT;
  step s6a7: <... completed>

The problem here is that when the deadlock detector kills s8's
transaction, s7a8 is also left free to proceed, so there is a race
condition as to which query completion will get back to
isolationtester first.

One grotty way to handle that would be something like

-step "s7a8"	{ LOCK TABLE a8; }
+step "s7a8"	{ LOCK TABLE a8; SELECT pg_sleep(5); }

Or we could simplify the locking structure enough so that no other
transactions are released by the deadlock failure. I do not know
exactly what you had in mind to be testing here?

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Robert Haas

robertmhaas@gmail.com

about 10 years ago

In reply to: Tom Lane (#8)

Re: Re: [COMMITTERS] pgsql: Add some isolation tests for deadlock detection and resolution.

On Thu, Feb 11, 2016 at 11:34 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

We're not out of the woods on this :-( ... jaguarundi, which is the first
of the CLOBBER_CACHE_ALWAYS animals to run these tests, didn't like them
at all. I think I fixed the deadlock-soft-2 failure, but its take on
deadlock-hard is:
*** 17,25 ****
step s6a7: LOCK TABLE a7; <waiting ...>
step s7a8: LOCK TABLE a8; <waiting ...>
step s8a1: LOCK TABLE a1; <waiting ...>
- step s8a1: <... completed>
step s7a8: <... completed>
! error in steps s8a1 s7a8: ERROR:  deadlock detected
step s8c: COMMIT;
step s7c: COMMIT;
step s6a7: <... completed>
--- 17,25 ----
step s6a7: LOCK TABLE a7; <waiting ...>
step s7a8: LOCK TABLE a8; <waiting ...>
step s8a1: LOCK TABLE a1; <waiting ...>
step s7a8: <... completed>
! step s8a1: <... completed>
! ERROR:  deadlock detected
step s8c: COMMIT;
step s7c: COMMIT;
step s6a7: <... completed>
The problem here is that when the deadlock detector kills s8's
transaction, s7a8 is also left free to proceed, so there is a race
condition as to which query completion will get back to
isolationtester first.

One grotty way to handle that would be something like
-step "s7a8"    { LOCK TABLE a8; }
+step "s7a8"    { LOCK TABLE a8; SELECT pg_sleep(5); }
Or we could simplify the locking structure enough so that no other
transactions are released by the deadlock failure. I do not know
exactly what you had in mind to be testing here?

Basically just that the deadlock actually got detected. That may
sound a bit weak, but considering we had no test for it at all before
this...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10

Tom Lane

tgl@sss.pgh.pa.us

about 10 years ago

In reply to: Robert Haas (#9)

Re: Re: [COMMITTERS] pgsql: Add some isolation tests for deadlock detection and resolution.

Robert Haas <robertmhaas@gmail.com> writes:

On Thu, Feb 11, 2016 at 11:34 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
The problem here is that when the deadlock detector kills s8's
transaction, s7a8 is also left free to proceed, so there is a race
condition as to which query completion will get back to
isolationtester first.

One grotty way to handle that would be something like
-step "s7a8"    { LOCK TABLE a8; }
+step "s7a8"    { LOCK TABLE a8; SELECT pg_sleep(5); }
Or we could simplify the locking structure enough so that no other
transactions are released by the deadlock failure. I do not know
exactly what you had in mind to be testing here?

Basically just that the deadlock actually got detected. That may
sound a bit weak, but considering we had no test for it at all before
this...

I tried fixing it as shown above, and was dismayed to find out that
it didn't work, ie, there was still a difference between the regular
output and the results with CLOBBER_CACHE_ALWAYS. In the latter case
the printout makes it appear that s7a8 completed before s8a1, which
is nonsensical.

Investigation showed that there are a couple of reasons. One,
isolationtester's is-it-waiting query takes an insane amount of
time under CLOBBER_CACHE_ALWAYS --- over half a second on my
reasonably new server. Probing the state of half a dozen blocked
sessions thus takes a while. Second, once s8 has been booted out
of its transaction, s7 is no longer "blocked" according to
isolationtester's definition (it's doing the pg_sleep query
instead). Therefore, when we're rechecking all the other blocked
steps after detecting that s8 has become blocked, two things
happen: enough time elapses for the deadlock detector to fire,
and then when we get around to checking s7, we don't see it as
blocked and therefore wait until it finishes. So s7a8 is reported
first despite the pg_sleep, and would be no matter large a pg_sleep
delay is used.

We could possibly fix this by using a deadlock timeout even higher than
5 seconds, but that way madness lies.

Instead, what I propose we do about this is to change isolationtester
so that once it's decided that a given step is blocked, it no longer
issues the is-it-waiting query for that step; it just assumes that the
step should be treated as blocked. So all we need do for "backlogged"
steps is check PQisBusy/PQconsumeInput. That both greatly reduces the
number of is-it-waiting queries that are needed and avoids any flappy
behavior of the answer.

Comments?

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11

Tom Lane

tgl@sss.pgh.pa.us

about 10 years ago

In reply to: Tom Lane (#10)

Re: Re: [COMMITTERS] pgsql: Add some isolation tests for deadlock detection and resolution.

I wrote:

Instead, what I propose we do about this is to change isolationtester
so that once it's decided that a given step is blocked, it no longer
issues the is-it-waiting query for that step; it just assumes that the
step should be treated as blocked. So all we need do for "backlogged"
steps is check PQisBusy/PQconsumeInput. That both greatly reduces the
number of is-it-waiting queries that are needed and avoids any flappy
behavior of the answer.

Hmm, that seemed to work fine here, but the buildfarm is not very happy
with it, and on reflection I guess it's just moving the indeterminacy
somewhere else. If we check for completion of a given step, and don't
wait till it's either completed or known blocked, then we have a race
condition that can change the order in which completions are reported.

The fundamental problem I fear is that isolationtester is designed around
the assumption that only its own actions (query issuances) change the
state of the database. Trying to use it to test deadlock detection is
problematic because deadlock-breaking changes the DB state asynchronously.

I think what we have to do is revert that change and dumb down
deadlock-hard until it produces stable results even on the CLOBBER
critters. One thing that'd help is reducing the number of processes
involved --- AFAICS, testing an 8-way deadlock is not really any more
interesting than testing, say, 4-way, and that would halve the amount
of time isolationtester spends figuring out the state.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12

Robert Haas

robertmhaas@gmail.com

about 10 years ago

In reply to: Tom Lane (#11)

Re: Re: [COMMITTERS] pgsql: Add some isolation tests for deadlock detection and resolution.

On Fri, Feb 12, 2016 at 4:59 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I wrote:

Instead, what I propose we do about this is to change isolationtester
so that once it's decided that a given step is blocked, it no longer
issues the is-it-waiting query for that step; it just assumes that the
step should be treated as blocked. So all we need do for "backlogged"
steps is check PQisBusy/PQconsumeInput. That both greatly reduces the
number of is-it-waiting queries that are needed and avoids any flappy
behavior of the answer.

Hmm, that seemed to work fine here, but the buildfarm is not very happy
with it, and on reflection I guess it's just moving the indeterminacy
somewhere else. If we check for completion of a given step, and don't
wait till it's either completed or known blocked, then we have a race
condition that can change the order in which completions are reported.

The fundamental problem I fear is that isolationtester is designed around
the assumption that only its own actions (query issuances) change the
state of the database. Trying to use it to test deadlock detection is
problematic because deadlock-breaking changes the DB state asynchronously.

I think what we have to do is revert that change and dumb down
deadlock-hard until it produces stable results even on the CLOBBER
critters. One thing that'd help is reducing the number of processes
involved --- AFAICS, testing an 8-way deadlock is not really any more
interesting than testing, say, 4-way, and that would halve the amount
of time isolationtester spends figuring out the state.

Maybe we should introduce a way to declare whether a step is expected
to wait or not. I thought about doing that, and the only reason I
didn't is because I couldn't figure out a reasonable syntax. But, in
many respects, that would actually be better than the current system
of having isolationtester try to figure it out itself.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13

Andres Freund

andres@anarazel.de

about 10 years ago

In reply to: Tom Lane (#10)

Re: Re: [COMMITTERS] pgsql: Add some isolation tests for deadlock detection and resolution.

On 2016-02-12 13:16:54 -0500, Tom Lane wrote:

Investigation showed that there are a couple of reasons. One,
isolationtester's is-it-waiting query takes an insane amount of
time under CLOBBER_CACHE_ALWAYS --- over half a second on my
reasonably new server.

I wonder if we shouldn't just expose a 'which pid is process X waiting
for' API, implemented serverside. That's generally really useful, and
looks like it's actually going to be less complicated than that
query... And it's surely going to be faster.

Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14

Robert Haas

robertmhaas@gmail.com

about 10 years ago

In reply to: Andres Freund (#13)

Re: Re: [COMMITTERS] pgsql: Add some isolation tests for deadlock detection and resolution.

On Fri, Feb 12, 2016 at 6:22 PM, Andres Freund <andres@anarazel.de> wrote:

On 2016-02-12 13:16:54 -0500, Tom Lane wrote:

Investigation showed that there are a couple of reasons. One,
isolationtester's is-it-waiting query takes an insane amount of
time under CLOBBER_CACHE_ALWAYS --- over half a second on my
reasonably new server.

I wonder if we shouldn't just expose a 'which pid is process X waiting
for' API, implemented serverside. That's generally really useful, and
looks like it's actually going to be less complicated than that
query... And it's surely going to be faster.

If PID 12000 and PID 13000 hold AccessShareLock on relation foo, and
PID 14000 awaits AccessExclusiveLock on that relation, what does the
function return when 14000 is passed as an argument?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15

Tom Lane

tgl@sss.pgh.pa.us

about 10 years ago

In reply to: Robert Haas (#14)

Re: Re: [COMMITTERS] pgsql: Add some isolation tests for deadlock detection and resolution.

Robert Haas <robertmhaas@gmail.com> writes:

On Fri, Feb 12, 2016 at 6:22 PM, Andres Freund <andres@anarazel.de> wrote:

I wonder if we shouldn't just expose a 'which pid is process X waiting
for' API, implemented serverside. That's generally really useful, and
looks like it's actually going to be less complicated than that
query... And it's surely going to be faster.

If PID 12000 and PID 13000 hold AccessShareLock on relation foo, and
PID 14000 awaits AccessExclusiveLock on that relation, what does the
function return when 14000 is passed as an argument?

Yeah. In general, it's not that easy to say that A is waiting
specifically on B --- there may be multiple blockers and/or multiple ways
it could get released. isolationtester's query is not really correct
IMO. It's okay as long as nothing else besides autovacuum is taking
locks, but I wouldn't want to try to make it work in a general purpose
environment.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16

Tom Lane

tgl@sss.pgh.pa.us

about 10 years ago

In reply to: Robert Haas (#12)

Re: Re: [COMMITTERS] pgsql: Add some isolation tests for deadlock detection and resolution.

Robert Haas <robertmhaas@gmail.com> writes:

On Fri, Feb 12, 2016 at 4:59 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

The fundamental problem I fear is that isolationtester is designed around
the assumption that only its own actions (query issuances) change the
state of the database. Trying to use it to test deadlock detection is
problematic because deadlock-breaking changes the DB state asynchronously.

Maybe we should introduce a way to declare whether a step is expected
to wait or not. I thought about doing that, and the only reason I
didn't is because I couldn't figure out a reasonable syntax. But, in
many respects, that would actually be better than the current system
of having isolationtester try to figure it out itself.

Meh. I'm not sure that would actually help anything. The problem is not
so much with figuring out whether a step blocks, as knowing when it's
expected to complete. And for sure I don't want to annotate the spec
files to the point of saying "this action should cause these other steps
to complete".

Actually ... the thing we've been fighting over the past couple days is
having to make step completion orders deterministic when in principle
they aren't and don't need to be. Maybe the solution is something like
the core tests' variant expected files, ugly though those are.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17

Bruce Momjian

bruce@momjian.us

about 10 years ago

In reply to: Tom Lane (#16)

Re: [COMMITTERS] pgsql: Add some isolation tests for deadlock detection and resolution.

The tests worked fine on faster build animals, right? And the clobber
animals are much much slower.... So it seems perfectly sensible that their
deadlock timeout would just have to be much much higher to have the same
behaviour. I see nothing wrong in just setting deadlock timeout to a minute
or more on these machines.

The invariant is just that the deadlock timeout needs enough head room over
the actual time the tester's queries take. If they normally take a 1/10th
of a second then why not just set the timeout to 10x however long they take
on the clobber animals?

--
Greg

#18

Tom Lane

tgl@sss.pgh.pa.us

about 10 years ago

In reply to: Bruce Momjian (#17)

Re: Re: [COMMITTERS] pgsql: Add some isolation tests for deadlock detection and resolution.

Greg Stark <stark@mit.edu> writes:

The tests worked fine on faster build animals, right? And the clobber
animals are much much slower.... So it seems perfectly sensible that their
deadlock timeout would just have to be much much higher to have the same
behaviour. I see nothing wrong in just setting deadlock timeout to a minute
or more on these machines.

We don't have a way to make the isolation tests change behavior depending
on how the backend is compiled. So the only actually available fix is to
make that test take "a minute or more" for *everybody*.

Aside from that, it's just disturbing that these tests aren't
deterministic regardless of machine speed. We don't seem to have a
way around that right now, but I wish we did.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19

Tom Lane

tgl@sss.pgh.pa.us

about 10 years ago

In reply to: Andres Freund (#13)

Re: Re: [COMMITTERS] pgsql: Add some isolation tests for deadlock detection and resolution.

Andres Freund <andres@anarazel.de> writes:

I wonder if we shouldn't just expose a 'which pid is process X waiting
for' API, implemented serverside. That's generally really useful, and
looks like it's actually going to be less complicated than that
query... And it's surely going to be faster.

Attached is a draft patch for a new function that reports the set of PIDs
directly blocking a given PID (that is, holding or awaiting conflicting
locks on the lockable object it's waiting for).

I replaced isolationtester's pg_locks query with this, and found that
it's about 9X faster in a normal build, and 3X faster with
CLOBBER_CACHE_ALWAYS turned on. That would give us some nice headroom
for the isolation tests with CLOBBER_CACHE_ALWAYS animals. (Note that
in view of
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=markhor&dt=2016-02-14%2007%3A38%3A37
we still need to do *something* about the speed of the new deadlock-hard
test; this patch could avoid the need to dumb down or slow down that test
even further.)

Not to be neglected also is that (I believe) this gives the right answer,
whereas isolationtester's existing query is currently completely broken by
parallel queries, and it doesn't understand non-conflicting lock modes
either. (It did, at least partially, before commit 38f8bdcac4982215;
I am not sure that taking out the mode checks was a good idea. But
putting them back would make the query slower yet.)

This is WIP, in part because I've written no user docs for the new
function, and in part because I think people might want to bikeshed the
API. What is here is:

"pg_blocker_pids(integer) returns integer[]" returns a possibly-empty
array of the PIDs of backend processes that block the backend with
specified PID. You get an empty array, not an error, if the argument
isn't a valid backend PID or that backend isn't waiting. In parallel
query situations, the output includes PIDs that are blocking any PID
in the given process's lock group, and what is reported is always
the PID of the lock group leader for whichever process is kdoing the
blocking. Also, in parallel query situations, the same PID might appear
multiple times in the output; it didn't seem worth trying to eliminate
duplicates.

Reasonable API change requests might include returning a rowset rather
than an array and returning more data per lock conflict. I've not
bothered with either thing here because isolationtester doesn't care
and it would make the query somewhat slower for isolationtester's
usage (though, probably, not enough slower to really matter).

It should also be noted that I've not really tested the parallel
query aspects of this, because I'm not sure how to create a conflicting
lock request in a parallel worker. However, if I'm not still
misunderstanding the new semantics of the lock data structures, that
aspect of it seems unlikely to be wrong.

Comments?

regards, tom lane

#20

Robert Haas

robertmhaas@gmail.com

about 10 years ago

In reply to: Tom Lane (#19)

Re: Re: [COMMITTERS] pgsql: Add some isolation tests for deadlock detection and resolution.

On Tue, Feb 16, 2016 at 2:59 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Andres Freund <andres@anarazel.de> writes:

I wonder if we shouldn't just expose a 'which pid is process X waiting
for' API, implemented serverside. That's generally really useful, and
looks like it's actually going to be less complicated than that
query... And it's surely going to be faster.

Attached is a draft patch for a new function that reports the set of PIDs
directly blocking a given PID (that is, holding or awaiting conflicting
locks on the lockable object it's waiting for).

I replaced isolationtester's pg_locks query with this, and found that
it's about 9X faster in a normal build, and 3X faster with
CLOBBER_CACHE_ALWAYS turned on. That would give us some nice headroom
for the isolation tests with CLOBBER_CACHE_ALWAYS animals. (Note that
in view of
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=markhor&dt=2016-02-14%2007%3A38%3A37
we still need to do *something* about the speed of the new deadlock-hard
test; this patch could avoid the need to dumb down or slow down that test
even further.)

Not to be neglected also is that (I believe) this gives the right answer,
whereas isolationtester's existing query is currently completely broken by
parallel queries, and it doesn't understand non-conflicting lock modes
either. (It did, at least partially, before commit 38f8bdcac4982215;
I am not sure that taking out the mode checks was a good idea. But
putting them back would make the query slower yet.)

The reason I took that out is because it breaks the deadlock-soft
test. It's possible to have a situation where no granted lock
conflicts with an awaited lock. If that happens, the old query
wrongly concluded that the waiting process was not in fact waiting.
(Consider A hold AccessShareLock, B awaits AccessExclusiveLock, C now
requests AccessShareLock and *waits*.)

As for the patch itself, I'm having trouble grokking what it's trying
to do. I think it might be worth having a comment defining precisely
what we mean by "A blocks B". I would define "A blocks B" in general
as either A holds a lock which conflicts with one sought by B
(hard-blocked) or A awaits a lock which conflicts with one sought by B
and precedes it in the wait queue (soft-blocked). I have wondered
before if we shouldn't modify pg_locks to expose the wait-queue
ordering; without that, you can't reliably determine in general
whether A soft-blocks B, which means every view anyone has ever
written over pg_locks that purports to say who blocks who is
necessarily buggy.

For parallel queries, there's a further relevant distinction when we
say "A blocks B". We might mean either that (1) process B cannot
resume execution until the lock conflict is resolved or (2) the group
leader for process B cannot complete the current parallel operation
until the lock conflict is resolved. If you're trying to figure out
why one particular member of a parallel group is stuck, you want to
answer question #1. If you're trying to figure out what all the
things that need to get out of the way to finish the query, you want
to answer question #2. I think this function is aiming to answer
question #2, not question #1, but I'm less clear on the reason behind
that definitional choice.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#21

Tom Lane

tgl@sss.pgh.pa.us

about 10 years ago

In reply to: Robert Haas (#20)

#22

Tom Lane

tgl@sss.pgh.pa.us

about 10 years ago

In reply to: Tom Lane (#21)

#23

Tom Lane