FOR KEY LOCK foreign keys

robertmhaas@gmail.com

almost 15 years ago

In reply to: David E. Wheeler (#2)

Re: FOR KEY LOCK foreign keys

On Fri, Jan 14, 2011 at 1:00 PM, David E. Wheeler <david@kineticode.com> wrote:

On Jan 13, 2011, at 1:58 PM, Alvaro Herrera wrote:

Something else that might be of interest: the patch as presented here
does NOT solve the deadlock problem originally presented by Joel
Jacobson. It does solve the second, simpler example I presented in my
blog article referenced above, however. I need to have a closer look at
that problem to figure out if we could fix the deadlock too.

Sounds like a big win already. Should this be considered a WIP patch, though, if you still plan to look at Joel's deadlock example?

Alvaro, are you planning to add this to the CF?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

alvherre@commandprompt.com

almost 15 years ago

In reply to: David E. Wheeler (#2)

Re: FOR KEY LOCK foreign keys

Excerpts from David E. Wheeler's message of vie ene 14 15:00:48 -0300 2011:

On Jan 13, 2011, at 1:58 PM, Alvaro Herrera wrote:

Something else that might be of interest: the patch as presented here
does NOT solve the deadlock problem originally presented by Joel
Jacobson. It does solve the second, simpler example I presented in my
blog article referenced above, however. I need to have a closer look at
that problem to figure out if we could fix the deadlock too.

Sounds like a big win already. Should this be considered a WIP patch, though, if you still plan to look at Joel's deadlock example?

Not necessarily -- we can implement that as a later refinement/improvement.

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

alvherre@commandprompt.com

almost 15 years ago

In reply to: Robert Haas (#3)

Re: FOR KEY LOCK foreign keys

Excerpts from Robert Haas's message of vie ene 14 15:08:27 -0300 2011:

On Fri, Jan 14, 2011 at 1:00 PM, David E. Wheeler <david@kineticode.com> wrote:

On Jan 13, 2011, at 1:58 PM, Alvaro Herrera wrote:

Something else that might be of interest: the patch as presented here
does NOT solve the deadlock problem originally presented by Joel
Jacobson. It does solve the second, simpler example I presented in my
blog article referenced above, however. I need to have a closer look at
that problem to figure out if we could fix the deadlock too.

Sounds like a big win already. Should this be considered a WIP patch, though, if you still plan to look at Joel's deadlock example?

Alvaro, are you planning to add this to the CF?

Eh, yes.

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Dimitri Fontaine

dimitri@2ndQuadrant.fr

almost 15 years ago

In reply to: Alvaro Herrera (#1)

Re: FOR KEY LOCK foreign keys

Hi,

This is a first level of review for the patch. I finally didn't get as
much time as I hoped I would, so couldn't get familiar with the locking
internals and machinery… as a result, I can't much comment on the code.

The patch applies cleanly (patch moves one hunk all by itself) and
compiles with no warning. It includes no docs, and I think it will be
required to document the user visible SELECT … FOR KEY LOCK OF x new
feature.

Code wise, very few comments here. It looks like the new code had been
there from the beginning by the reading of the patch. I only have one
question about a variable naming:

! COPY_SCALAR_FIELD(forUpdate);
! COPY_SCALAR_FIELD(strength);

forUpdate used to be a boolean, strength is now one of LCS_FORUPDATE,
LCS_FORSHARE or LCS_FORKEYLOCK. I wonder if that's a fortunate naming
here, but IANANS (I Am Not A Native Speaker).

Alvaro Herrera <alvherre@commandprompt.com> writes:

As previously commented, here's a proposal with patch to turn foreign
key checks into something less intrusive.

The basic idea, as proposed by Simon Riggs, was discussed in a previous
pgsql-hackers thread here:
http://archives.postgresql.org/message-id/AANLkTimo9XVcEzfiBR-ut3KVNDkjm2Vxh+t8kAmWjPuv@mail.gmail.com

This link here provides a test case that will issue a deadlock, and

Something else that might be of interest: the patch as presented here
does NOT solve the deadlock problem originally presented by Joel

Indeed, that's the first thing I tried… I'm not sure about why fixing
the deadlock issue wouldn't be in this patch scope?

The thing that I'm able to confirm by running this test case is that the
RI trigger check is done with the new code from the patch:

CONTEXT: SQL statement "SELECT 1 FROM ONLY "public"."a" x WHERE "aid" OPERATOR(pg_catalog.=) $1 FOR KEY LOCK OF x"

Sorry for not posting more tests yet, but seeing how late I am to find
the time for the first level review I figured I might as well send that
already. I will try some other test cases, but sure enough, that should
be part of the user level documentation…

Regards,
--
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

robertmhaas@gmail.com

almost 15 years ago

In reply to: Dimitri Fontaine (#6)

Re: FOR KEY LOCK foreign keys

On Sat, Jan 22, 2011 at 4:25 PM, Dimitri Fontaine
<dimitri@2ndquadrant.fr> wrote:

Hi,

This is a first level of review for the patch. I finally didn't get as
much time as I hoped I would, so couldn't get familiar with the locking
internals and machinery… as a result, I can't much comment on the code.

The patch applies cleanly (patch moves one hunk all by itself) and
compiles with no warning. It includes no docs, and I think it will be
required to document the user visible SELECT … FOR KEY LOCK OF x new
feature.

I feel like this should be called "KEY SHARE" rather than "KEY LOCK".
It's essentially a weaker version of the SHARE lock we have now, but
that's not clear from the name.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Marti Raudsepp

marti@juffo.org

almost 15 years ago

In reply to: Alvaro Herrera (#1)

Re: FOR KEY LOCK foreign keys

On Thu, Jan 13, 2011 at 23:58, Alvaro Herrera
<alvherre@commandprompt.com> wrote:

It goes like this: instead of acquiring a shared lock on the involved
tuple, we only acquire a "key lock", that is, something that prevents
the tuple from going away entirely but not from updating fields that are
not covered by any unique index.

As discussed, this is still more restrictive than necessary (we could
lock only those columns that are involved in the foreign key being
checked), but that has all sorts of implementation level problems, so we
settled for this, which is still much better than the current state of
affairs.

Seems to me that you can go a bit further without much trouble, if you
only consider indexes that *can* be referenced by foreign keys --
indexes that don't have expressions or predicates.

I frequently create unique indexes on (lower(name)) where I want
case-insensitive unique indexes, or use predicates like WHERE
deleted=false to allow duplicates after deleting the old item.

So, instead of:
if (indexInfo->ii_Unique)
you can write:
if (indexInfo->ii_Unique
&& indexInfo->ii_Expressions == NIL
&& indexInfo->ii_Predicate == NIL)

This would slightly simplify RelationGetIndexAttrBitmap() because you
no longer have to worry about including columns that are part of index
expressions/predicates.

I guess rd_uindexattr should be renamed to something like
rd_keyindexattr or rd_keyattr.

Is this worthwhile? I can write and submit a patch if it sounds good.

Regards,
Marti

[1]: http://archives.postgresql.org/message-id/22196.1282757644@sss.pgh.pa.us

noah@leadboat.com

almost 15 years ago

In reply to: Alvaro Herrera (#1)

1 attachment(s)

Re: FOR KEY LOCK foreign keys

Hi Alvaro,

On Thu, Jan 13, 2011 at 06:58:09PM -0300, Alvaro Herrera wrote:

As previously commented, here's a proposal with patch to turn foreign
key checks into something less intrusive.

The basic idea, as proposed by Simon Riggs, was discussed in a previous
pgsql-hackers thread here:
http://archives.postgresql.org/message-id/AANLkTimo9XVcEzfiBR-ut3KVNDkjm2Vxh+t8kAmWjPuv@mail.gmail.com

It goes like this: instead of acquiring a shared lock on the involved
tuple, we only acquire a "key lock", that is, something that prevents
the tuple from going away entirely but not from updating fields that are
not covered by any unique index.

First off, this is highly-valuable work. My experience echoes that of some
other commenter (I *think* it was Josh Berkus, but I can't find the original
reference now): this is the #1 cause of production deadlocks. To boot, the
patch is small and fits cleanly into the current code.

The patch had a trivial conflict in planner.c, plus plenty of offsets. I've
attached the rebased patch that I used for review. For anyone following along,
all the interesting hunks touch heapam.c; the rest is largely mechanical. A
"diff -w" patch is also considerably easier to follow.

Incidentally, HeapTupleSatisfiesMVCC has some bits of code like this (not new):

/* MultiXacts are currently only allowed to lock tuples */
Assert(tuple->t_infomask & HEAP_IS_LOCKED);

They're specifically only allowed for SHARE and KEY locks, right?
heap_lock_tuple seems to assume as much.

Having read [1]http://archives.postgresql.org/message-id/22196.1282757644@sss.pgh.pa.us, I tried to work out what kind of table-level lock we must hold
before proceeding with a DDL operation that changes the set of "key" columns.
The thing we must prevent is an UPDATE making a concurrent decision about its
need to conflict with a FOR KEY LOCK lock. Therefore, it's sufficient for the
DDL to take ShareLock. CREATE INDEX does just this, so we're good.

I observe visibility breakage with this test case:

-- Setup
BEGIN;
DROP TABLE IF EXISTS child, parent;
CREATE TABLE parent (
parent_key int PRIMARY KEY,
aux text NOT NULL
);
CREATE TABLE child (
child_key int PRIMARY KEY,
parent_key int NOT NULL REFERENCES parent
);
INSERT INTO parent VALUES (1, 'foo');
COMMIT;
TABLE parent; -- set hint bit
SELECT to_hex(t_infomask::int), * FROM heap_page_items(get_raw_page('parent', 0));
to_hex | lp | lp_off | lp_flags | lp_len | t_xmin | t_xmax | t_field3 | t_ctid | t_infomask2 | t_infomask | t_hoff | t_bits | t_oid
--------+----+--------+----------+--------+--------+--------+----------+--------+-------------+------------+--------+--------+-------
902 | 1 | 8160 | 1 | 32 | 1125 | 0 | 33 | (0,1) | 2 | 2306 | 24 | NULL | NULL

-- Interleaved part
P0:
BEGIN;
INSERT INTO child VALUES (1, 1);
P1:
BEGIN;
SELECT to_hex(t_infomask::int), * FROM heap_page_items(get_raw_page('parent', 0));
to_hex | lp | lp_off | lp_flags | lp_len | t_xmin | t_xmax | t_field3 | t_ctid | t_infomask2 | t_infomask | t_hoff | t_bits | t_oid
--------+----+--------+----------+--------+--------+--------+----------+--------+-------------+------------+--------+--------+-------
112 | 1 | 8160 | 1 | 32 | 1125 | 1126 | 33 | (0,1) | 2 | 274 | 24 | NULL | NULL
UPDATE parent SET aux = 'baz'; -- UPDATE 1
TABLE parent; -- 0 rows
SELECT to_hex(t_infomask::int), * FROM heap_page_items(get_raw_page('parent', 0));
to_hex | lp | lp_off | lp_flags | lp_len | t_xmin | t_xmax | t_field3 | t_ctid | t_infomask2 | t_infomask | t_hoff | t_bits | t_oid
--------+----+--------+----------+--------+--------+--------+----------+--------+-------------+------------+--------+--------+-------
102 | 1 | 8160 | 1 | 32 | 1125 | 1128 | 0 | (0,2) | 16386 | 258 | 24 | NULL | NULL
2012 | 2 | 8128 | 1 | 32 | 1128 | 1126 | 2249 | (0,2) | -32766 | 8210 | 24 | NULL | NULL

The problem seems to be that funny t_cid (2249). Tracing through heap_update,
the new code is not setting t_cid during this test case.

My own deadlock test case, which is fixed by the patch, uses the same setup.
Its interleaved part is as follows:

P0: INSERT INTO child VALUES (1, 1);
P1: INSERT INTO child VALUES (2, 1);
P0: UPDATE parent SET aux = 'bar';
P1: UPDATE parent SET aux = 'baz';

As discussed, this is still more restrictive than necessary (we could
lock only those columns that are involved in the foreign key being
checked), but that has all sorts of implementation level problems, so we
settled for this, which is still much better than the current state of
affairs.

Agreed. What about locking only the columns that are actually used in any
incoming foreign key (not just the FK in question at the time)? We'd just have
more work to do on a cold relcache, a pg_depend scan per unique index.

Usually, each of my tables has no more than one candidate key referenced by
FOREIGN KEY constraints: the explicit or notional primary key. I regularly add
UNIQUE indexes not used by any foreign key, though. YMMV. Given this
optimization, constraining the lock even further by individual FOREIGN KEY
constraint would be utterly unimportant for my databases.

I published about this here:
http://commandprompt.com/blogs/alvaro_herrera/2010/11/fixing_foreign_key_deadlocks_part_2/

So, as a rough design,

1. Create a new SELECT locking clause. For now, we're calling it SELECT FOR KEY LOCK
2. This will acquire a new type of lock in the tuple, dubbed a "keylock".
3. This lock will conflict with DELETE, SELECT FOR UPDATE, and SELECT FOR SHARE.

It does not conflict with SELECT FOR SHARE, does it?

4. It also conflicts with UPDATE if the UPDATE modifies an attribute
indexed by a unique index.

This is the per-tuple lock conflict table before your change:

FOR SHARE conflicts with FOR UPDATE
FOR UPDATE conflicts with FOR UPDATE and FOR SHARE

After:

FOR KEY LOCK conflicts with FOR UPDATE
FOR SHARE conflicts with FOR UPDATE
FOR UPDATE conflicts with FOR UPDATE, FOR SHARE, (FOR KEY LOCK if cols <@ keycols)

The odd thing here is the checking of an outside condition to decide whether
locks conflict. Normally, to get a different conflict list, we add another lock
type. What about this?

FOR KEY SHARE conflicts with FOR KEY UPDATE
FOR SHARE conflicts with FOR KEY UPDATE, FOR UPDATE
FOR UPDATE conflicts with FOR KEY UPDATE, FOR UPDATE, FOR SHARE
FOR KEY UPDATE conflicts with FOR KEY UPDATE, FOR UPDATE, FOR SHARE, FOR KEY SHARE

This would also fix Joel's test case. A disadvantage is that we'd check for
changes in FK-referenced columns-change even when there's no key lock activity.
That seems acceptable, but it's a point for debate.

Either way, SELECT ... FOR UPDATE will probably end up different than a true
update. The full behavior relies on having an old tuple to bear the UPDATE lock
and a new tuple to bear the KEY lock. In the current patch, SELECT ... FOR
UPDATE blocks on KEY just like SHARE. So there will be that wart in the
conflict lists, no matter what.

Here's a patch for this, on which I need to do some more testing and
update docs.

Some patch details:

1. We use a new bit in t_infomask for HEAP_XMAX_KEY_LOCK, 0x0010.
2. Key-locking a tuple means setting the XMAX_KEY_LOCK bit, and setting the
Xmax to the locker (just like the other lock marks). If the tuple is
already key-locked, a MultiXactId needs to be created from the
original locker(s) and the new transaction.

Makes sense.

3. The original tuple needs to be marked with the Cmax of the locking
command, to prevent it from being seen in the same transaction.

Could you elaborate on this requirement?

4. A non-conflicting update to the tuple must carry forward some fields
from the original tuple into the updated copy. Those include Xmax,
XMAX_IS_MULTI, XMAX_KEY_LOCK, and the CommandId and COMBO_CID flag.

HeapTupleHeaderGetCmax() has this assertion:

/* We do not store cmax when locking a tuple */
Assert(!(tup->t_infomask & (HEAP_MOVED | HEAP_IS_LOCKED)));

Assuming that assertion is still valid, there will never be a HEAP_COMBOCID flag
to copy. Right?

5. We check for the is-indexed condition early in heap_update. This
check is independent of the HOT check, which occurs later in the
routine.
6. The relcache entry now keeps two lists of indexed attributes; the new
one only covers unique indexes. Both lists are built in a single
pass over the index list and saved in the relcache entry, so a
heap_update call only does this once. The main difference between
the two checks is that the one for HOT is done after the tuple has
been toasted. This cannot be done for this check, because the
toaster runs too late. This means some work is duplicated. We
could optimize this further.

Seems reasonable.

Something else that might be of interest: the patch as presented here
does NOT solve the deadlock problem originally presented by Joel
Jacobson. It does solve the second, simpler example I presented in my
blog article referenced above, however. I need to have a closer look at
that problem to figure out if we could fix the deadlock too.

One thing that helped me to think through Joel's test case is that the two
middle statements take tuple-level locks, but that's inessential. Granted, FOR
UPDATE tuple locks are by far the most common kind of blocking in production.
Here's another formulation that also still gets a deadlock:

P1: BEGIN;
P2: BEGIN;
P1: UPDATE A SET Col1 = 1 WHERE AID = 1; -- FOR UPDATE tuple lock
P2: LOCK TABLE pg_am IN ROW SHARE MODE
P1: LOCK TABLE pg_am IN ROW SHARE MODE -- blocks
P2: UPDATE B SET Col2 = 1 WHERE BID = 2; -- blocks for KEY => deadlock

As best I can tell, the explanation is that this patch only improves things when
the FOR KEY LOCK precedes the FOR UPDATE. Splitting out FOR KEY UPDATE fixes
that. It would also optimize this complement to your own blog post example,
which still blocks needlessly:

-- Session 1
CREATE TABLE foo (a int PRIMARY KEY, b text);
CREATE TABLE bar (a int NOT NULL REFERENCES foo);
INSERT INTO foo VALUES (42);

BEGIN;
UPDATE foo SET b = 'Hello World' ;

-- Session 2
INSERT INTO bar VALUES (42);

Automated tests would go a long way toward building confidence that this patch
does the right thing. Thanks to the SSI patch, we now have an in-tree test
framework for testing interleaved transactions. The only thing it needs to be
suitable for this work is a way to handle blocked commands. If you like, I can
try to whip something up for that.

Hunk-specific comments (based on diff -w version of patch):

*** a/src/backend/access/heap/heapam.c
--- b/src/backend/access/heap/heapam.c

***************
*** 2484,2489 **** l2:
--- 2487,2508 ----
xwait = HeapTupleHeaderGetXmax(oldtup.t_data);
infomask = oldtup.t_data->t_infomask;

+ 		/*
+ 		 * if it's only key-locked and we're not updating an indexed column,
+ 		 * we can act though MayBeUpdated was returned, but the resulting tuple
+ 		 * needs a bunch of fields copied from the original.
+ 		 */
+ 		if ((infomask & HEAP_XMAX_KEY_LOCK) &&
+ 			!(infomask & HEAP_XMAX_SHARED_LOCK) &&
+ 			HeapSatisfiesHOTUpdate(relation, keylck_attrs,
+ 								   &oldtup, newtup))
+ 		{
+ 			result = HeapTupleMayBeUpdated;
+ 			keylocked_update = true;
+ 		}

The condition for getting here is "result == HeapTupleBeingUpdated && wait". If
!wait, we'd never get the chance to see if this would avoid the wait. Currently
all callers pass wait = true, so this is academic.

+ 
+ 		if (!keylocked_update)
+ 		{
LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
/*
***************
*** 2563,2568 **** l2:
--- 2582,2588 ----
else
result = HeapTupleUpdated;
}
+ 	}
if (crosscheck != InvalidSnapshot && result == HeapTupleMayBeUpdated)
{
***************
*** 2609,2621 **** l2:

newtup->t_data->t_infomask &= ~(HEAP_XACT_MASK);
newtup->t_data->t_infomask2 &= ~(HEAP2_XACT_MASK);
! newtup->t_data->t_infomask |= (HEAP_XMAX_INVALID | HEAP_UPDATED);
HeapTupleHeaderSetXmin(newtup->t_data, xid);
- HeapTupleHeaderSetCmin(newtup->t_data, cid);
- HeapTupleHeaderSetXmax(newtup->t_data, 0); /* for cleanliness */
newtup->t_tableOid = RelationGetRelid(relation);
/*
* Replace cid with a combo cid if necessary.  Note that we already put
* the plain cid into the new tuple.
*/
--- 2629,2671 ----
newtup->t_data->t_infomask &= ~(HEAP_XACT_MASK);
newtup->t_data->t_infomask2 &= ~(HEAP2_XACT_MASK);
! newtup->t_data->t_infomask |= HEAP_UPDATED;
HeapTupleHeaderSetXmin(newtup->t_data, xid);
newtup->t_tableOid = RelationGetRelid(relation);
/*
+ 	 * If this update is touching a tuple that was key-locked, we need to
+ 	 * carry forward some bits from the old tuple into the new copy.
+ 	 */
+ 	if (keylocked_update)
+ 	{
+ 		HeapTupleHeaderSetXmax(newtup->t_data,
+ 							   HeapTupleHeaderGetXmax(oldtup.t_data));
+ 		newtup->t_data->t_infomask |= (oldtup.t_data->t_infomask & 
+ 									   (HEAP_XMAX_IS_MULTI |
+ 										HEAP_XMAX_KEY_LOCK));
+ 		/*
+ 		 * we also need to copy the combo CID stuff, but only if the original
+ 		 * tuple was created by us; otherwise the combocid module complains
+ 		 * (Alternatively we could use HeapTupleHeaderGetRawCommandId)
+ 		 */

This comment should describe why it's correct, not just indicate that another
module complains if we do otherwise.

+ 		if (TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetXmin(oldtup.t_data)))
+ 		{
+ 			newtup->t_data->t_infomask |= (oldtup.t_data->t_infomask & 
+ 										   HEAP_COMBOCID);

HeapTupleHeaderSetCmin unsets HEAP_COMBOCID, so this is a no-op.

+ 			HeapTupleHeaderSetCmin(newtup->t_data,
+ 								   HeapTupleHeaderGetCmin(oldtup.t_data));
+ 		}
+ 
+ 	}
+ 	else
+ 	{
+ 		newtup->t_data->t_infomask |= HEAP_XMAX_INVALID;
+ 		HeapTupleHeaderSetXmax(newtup->t_data, 0);	/* for cleanliness */
+ 		HeapTupleHeaderSetCmin(newtup->t_data, cid);
+ 	}

As mentioned above, this code can fail to set Cmin entirely.

+
+ /*
* Replace cid with a combo cid if necessary. Note that we already put
* the plain cid into the new tuple.
*/
***************
*** 3142,3148 **** heap_lock_tuple(Relation relation, HeapTuple tuple, Buffer *buffer,
LOCKMODE tuple_lock_type;
bool have_tuple_lock = false;

! tuple_lock_type = (mode == LockTupleShared) ? ShareLock : ExclusiveLock;
*buffer = ReadBuffer(relation, ItemPointerGetBlockNumber(tid));
LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE);
--- 3192,3211 ----
LOCKMODE	tuple_lock_type;
bool		have_tuple_lock = false;
! /* in FOR KEY LOCK mode, we use a share lock temporarily */

I found this comment confusing. The first several times I read it, I thought it
meant that we start out by setting HEAP_XMAX_SHARED_LOCK in the tuple, then
downgrade it. However, this is talking about the ephemeral heavyweight lock.
Maybe it's just me, but consider deleting this comment.

! switch (mode)
! {
! case LockTupleShared:
! case LockTupleKeylock:
! tuple_lock_type = ShareLock;
! break;
! case LockTupleExclusive:
! tuple_lock_type = ExclusiveLock;
! break;
! default:
! elog(ERROR, "invalid tuple lock mode");
! tuple_lock_type = 0; /* keep compiler quiet */
! }

*buffer = ReadBuffer(relation, ItemPointerGetBlockNumber(tid));
LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE);
***************
*** 3175,3192 **** l3:
LockBuffer(*buffer, BUFFER_LOCK_UNLOCK);
/*
! 		 * If we wish to acquire share lock, and the tuple is already
! 		 * share-locked by a multixact that includes any subtransaction of the
* current top transaction, then we effectively hold the desired lock
* already.  We *must* succeed without trying to take the tuple lock,
* else we will deadlock against anyone waiting to acquire exclusive
* lock.  We don't need to make any state changes in this case.
*/
! 		if (mode == LockTupleShared &&
(infomask & HEAP_XMAX_IS_MULTI) &&
MultiXactIdIsCurrent((MultiXactId) xwait))
{
! 			Assert(infomask & HEAP_XMAX_SHARED_LOCK);
/* Probably can't hold tuple lock here, but may as well check */
if (have_tuple_lock)
UnlockTuple(relation, tid, tuple_lock_type);
--- 3238,3255 ----
LockBuffer(*buffer, BUFFER_LOCK_UNLOCK);
/*
! * If we wish to acquire a key or share lock, and the tuple is already
! * share- or key-locked by a multixact that includes any subtransaction of the
* current top transaction, then we effectively hold the desired lock
* already. We *must* succeed without trying to take the tuple lock,
* else we will deadlock against anyone waiting to acquire exclusive
* lock. We don't need to make any state changes in this case.
*/
! if ((mode == LockTupleShared || mode == LockTupleKeylock) &&
(infomask & HEAP_XMAX_IS_MULTI) &&
MultiXactIdIsCurrent((MultiXactId) xwait))
{
! Assert(infomask & HEAP_IS_SHARE_LOCKED);
/* Probably can't hold tuple lock here, but may as well check */
if (have_tuple_lock)
UnlockTuple(relation, tid, tuple_lock_type);

If we're upgrading from KEY LOCK to a SHARE, we can't take this shortcut. At a
minimum, we need to update t_infomask.

Then there's a choice: do we queue up normally and risk deadlock, or do we skip
the heavyweight lock queue and risk starvation? Your last blog post suggests a
preference for the latter. I haven't formed a strong preference, but given this
behavior, ...

P0: FOR SHARE -- acquired
P1: UPDATE -- blocks
P2: FOR SHARE -- blocks

... I'm not sure why making the first lock FOR KEY LOCK ought to change things.

Some documentation may be in order about the deadlock hazards of mixing FOR
SHARE locks with foreign key usage.

***************
*** 3217,3226 **** l3:
have_tuple_lock = true;
}
! 		if (mode == LockTupleShared && (infomask & HEAP_XMAX_SHARED_LOCK))
{
/*
! 			 * Acquiring sharelock when there's at least one sharelocker
* already.  We need not wait for him/them to complete.
*/
LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE);
--- 3280,3290 ----
have_tuple_lock = true;
}
! if ((mode == LockTupleShared || mode == LockTupleKeylock) &&
! (infomask & HEAP_IS_SHARE_LOCKED))
{
/*
! * Acquiring sharelock or keylock when there's at least one such locker
* already. We need not wait for him/them to complete.
*/
LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE);

Likewise: we cannot implicitly upgrade someone else's KEY LOCK to SHARE.

***************
*** 3476,3482 **** l3:
xlrec.target.tid = tuple->t_self;
xlrec.locking_xid = xid;
xlrec.xid_is_mxact = ((new_infomask & HEAP_XMAX_IS_MULTI) != 0);
! 		xlrec.shared_lock = (mode == LockTupleShared);
rdata[0].data = (char *) &xlrec;
rdata[0].len = SizeOfHeapLock;
rdata[0].buffer = InvalidBuffer;
--- 3543,3549 ----
xlrec.target.tid = tuple->t_self;
xlrec.locking_xid = xid;
xlrec.xid_is_mxact = ((new_infomask & HEAP_XMAX_IS_MULTI) != 0);
! 		xlrec.lock_strength = mode == LockTupleShared ? 's' : mode == LockTupleKeylock ? 'k' : 'x';

Seems strange having these character literals. Why not just cast the mode to a
char? Could even set the enum values to the ASCII values of those characters,
if you were so inclined. Happily, they fall in the right order.

rdata[0].data = (char *) &xlrec;
rdata[0].len = SizeOfHeapLock;
rdata[0].buffer = InvalidBuffer;

*** a/src/backend/executor/execMain.c
--- b/src/backend/executor/execMain.c

***************
*** 112,119 **** lnext:
/* okay, try to lock the tuple */
if (erm->markType == ROW_MARK_EXCLUSIVE)
lockmode = LockTupleExclusive;
! else
lockmode = LockTupleShared;

test = heap_lock_tuple(erm->relation, &tuple, &buffer,
&update_ctid, &update_xmax,
--- 112,126 ----
/* okay, try to lock the tuple */
if (erm->markType == ROW_MARK_EXCLUSIVE)
lockmode = LockTupleExclusive;
! 		else if (erm->markType == ROW_MARK_SHARE)
lockmode = LockTupleShared;
+ 		else if (erm->markType == ROW_MARK_KEYLOCK)
+ 			lockmode = LockTupleKeylock;
+ 		else
+ 		{
+ 			elog(ERROR, "unsupported rowmark type");
+ 			lockmode = LockTupleExclusive;	/* keep compiler quiet */
+ 		}

A switch statement would be more consistent with what you've done elsewhere.

*** a/src/backend/nodes/outfuncs.c
--- b/src/backend/nodes/outfuncs.c

***************
*** 2181,2187 **** _outRowMarkClause(StringInfo str, RowMarkClause *node)
WRITE_NODE_TYPE("ROWMARKCLAUSE");
WRITE_UINT_FIELD(rti);
! 	WRITE_BOOL_FIELD(forUpdate);
WRITE_BOOL_FIELD(noWait);
WRITE_BOOL_FIELD(pushedDown);
}
--- 2181,2187 ----
WRITE_NODE_TYPE("ROWMARKCLAUSE");
WRITE_UINT_FIELD(rti);
! WRITE_BOOL_FIELD(strength);

WRITE_ENUM_FIELD?

WRITE_BOOL_FIELD(noWait);
WRITE_BOOL_FIELD(pushedDown);
}
*** a/src/backend/nodes/readfuncs.c
--- b/src/backend/nodes/readfuncs.c
***************
*** 299,305 **** _readRowMarkClause(void)
READ_LOCALS(RowMarkClause);
READ_UINT_FIELD(rti);
! READ_BOOL_FIELD(forUpdate);
READ_BOOL_FIELD(noWait);
READ_BOOL_FIELD(pushedDown);
--- 299,305 ----
READ_LOCALS(RowMarkClause);
READ_UINT_FIELD(rti);
! READ_BOOL_FIELD(strength);

READ_ENUM_FIELD?

*** a/src/backend/optimizer/plan/planner.c
--- b/src/backend/optimizer/plan/planner.c

***************
*** 1887,1896 **** preprocess_rowmarks(PlannerInfo *root)
newrc = makeNode(PlanRowMark);
newrc->rti = newrc->prti = rc->rti;
newrc->rowmarkId = ++(root->glob->lastRowMarkId);
! if (rc->forUpdate)
newrc->markType = ROW_MARK_EXCLUSIVE;
! else
newrc->markType = ROW_MARK_SHARE;
newrc->noWait = rc->noWait;
newrc->isParent = false;
--- 1887,1904 ----
newrc = makeNode(PlanRowMark);
newrc->rti = newrc->prti = rc->rti;
newrc->rowmarkId = ++(root->glob->lastRowMarkId);
!  		switch (rc->strength)
!  		{
!  			case LCS_FORUPDATE:
newrc->markType = ROW_MARK_EXCLUSIVE;
!  				break;
!  			case LCS_FORSHARE:
newrc->markType = ROW_MARK_SHARE;
+  				break;
+  			case LCS_FORKEYLOCK:
+  				newrc->markType = ROW_MARK_KEYLOCK;
+  				break;
+  		}

This needs a "default" clause throwing an error. (Seems like the default could
be in #ifdef USE_ASSERT_CHECKING, but we don't seem to ever do that.)

*** a/src/backend/tcop/utility.c
--- b/src/backend/tcop/utility.c
***************
*** 2205,2214 **** CreateCommandTag(Node *parsetree)
else if (stmt->rowMarks != NIL)
{
/* not 100% but probably close enough */
! 							if (((RowMarkClause *) linitial(stmt->rowMarks))->forUpdate)
tag = "SELECT FOR UPDATE";
! 							else
tag = "SELECT FOR SHARE";
}
else
tag = "SELECT";
--- 2205,2225 ----
else if (stmt->rowMarks != NIL)
{
/* not 100% but probably close enough */
! 							switch (((RowMarkClause *) linitial(stmt->rowMarks))->strength)
! 							{
! 								case LCS_FORUPDATE:
tag = "SELECT FOR UPDATE";
! 									break;
! 								case LCS_FORSHARE:
tag = "SELECT FOR SHARE";
+ 									break;
+ 								case LCS_FORKEYLOCK:
+ 									tag = "SELECT FOR KEY LOCK";
+ 									break;
+ 								default:
+ 									tag =  "???";
+ 									break;

elog(ERROR) in the default clause, perhaps? See earlier comment.

*** a/src/backend/utils/adt/ruleutils.c
--- b/src/backend/utils/adt/ruleutils.c
***************
*** 2837,2848 **** get_select_query_def(Query *query, deparse_context *context,
if (rc->pushedDown)
continue;

! 			if (rc->forUpdate)
! 				appendContextKeyword(context, " FOR UPDATE",
-PRETTYINDENT_STD, PRETTYINDENT_STD, 0);
! 			else
appendContextKeyword(context, " FOR SHARE",
-PRETTYINDENT_STD, PRETTYINDENT_STD, 0);
appendStringInfo(buf, " OF %s",
quote_identifier(rte->eref->aliasname));
if (rc->noWait)
--- 2837,2858 ----
if (rc->pushedDown)
continue;

! 			switch (rc->strength)
! 			{
! 				case LCS_FORKEYLOCK:
! 					appendContextKeyword(context, " FOR KEY LOCK",
-PRETTYINDENT_STD, PRETTYINDENT_STD, 0);
! 					break;
! 				case LCS_FORSHARE:
appendContextKeyword(context, " FOR SHARE",
-PRETTYINDENT_STD, PRETTYINDENT_STD, 0);
+ 					break;
+ 				case LCS_FORUPDATE:
+ 					appendContextKeyword(context, " FOR UPDATE",
+ 										 -PRETTYINDENT_STD, PRETTYINDENT_STD, 0);
+ 					break;
+ 			}

Another switch statement; see earlier comment.

*** a/src/backend/utils/cache/relcache.c
--- b/src/backend/utils/cache/relcache.c

***************
*** 3661,3675 **** RelationGetIndexAttrBitmap(Relation relation)
--- 3665,3688 ----
int			attrnum = indexInfo->ii_KeyAttrNumbers[i];

if (attrnum != 0)
+ 			{
indexattrs = bms_add_member(indexattrs,
attrnum - FirstLowInvalidHeapAttributeNumber);
+ 				if (indexInfo->ii_Unique)
+ 					uindexattrs = bms_add_member(uindexattrs,
+ 											   	 attrnum - FirstLowInvalidHeapAttributeNumber);
+ 			}
}

/* Collect all attributes used in expressions, too */
pull_varattnos((Node *) indexInfo->ii_Expressions, &indexattrs);
+ 		if (indexInfo->ii_Unique)
+ 			pull_varattnos((Node *) indexInfo->ii_Expressions, &uindexattrs);

No need; as Marti mentioned, such indexes are not usable for FOREIGN KEY.

/* Collect all attributes in the index predicate, too */
pull_varattnos((Node *) indexInfo->ii_Predicate, &indexattrs);
+ 		if (indexInfo->ii_Unique)
+ 			pull_varattnos((Node *) indexInfo->ii_Predicate, &uindexattrs);

Likewise.

*** a/src/include/access/htup.h
--- b/src/include/access/htup.h
***************
*** 163,174 **** typedef HeapTupleHeaderData *HeapTupleHeader;
#define HEAP_HASVARWIDTH		0x0002	/* has variable-width attribute(s) */
#define HEAP_HASEXTERNAL		0x0004	/* has external stored attribute(s) */
#define HEAP_HASOID				0x0008	/* has an object-id field */
! /* bit 0x0010 is available */
#define HEAP_COMBOCID			0x0020	/* t_cid is a combo cid */
#define HEAP_XMAX_EXCL_LOCK		0x0040	/* xmax is exclusive locker */
#define HEAP_XMAX_SHARED_LOCK	0x0080	/* xmax is shared locker */
/* if either LOCK bit is set, xmax hasn't deleted the tuple, only locked it */
! #define HEAP_IS_LOCKED	(HEAP_XMAX_EXCL_LOCK | HEAP_XMAX_SHARED_LOCK)
#define HEAP_XMIN_COMMITTED		0x0100	/* t_xmin committed */
#define HEAP_XMIN_INVALID		0x0200	/* t_xmin invalid/aborted */
#define HEAP_XMAX_COMMITTED		0x0400	/* t_xmax committed */
--- 163,177 ----
#define HEAP_HASVARWIDTH		0x0002	/* has variable-width attribute(s) */
#define HEAP_HASEXTERNAL		0x0004	/* has external stored attribute(s) */
#define HEAP_HASOID				0x0008	/* has an object-id field */
! #define HEAP_XMAX_KEY_LOCK		0x0010	/* xmax is a "key" locker */
#define HEAP_COMBOCID			0x0020	/* t_cid is a combo cid */
#define HEAP_XMAX_EXCL_LOCK		0x0040	/* xmax is exclusive locker */
#define HEAP_XMAX_SHARED_LOCK	0x0080	/* xmax is shared locker */
+ /* if either SHARE or KEY lock bit is set, this is a "shared" lock */
+ #define HEAP_IS_SHARE_LOCKED (HEAP_XMAX_SHARED_LOCK | HEAP_XMAX_KEY_LOCK)
/* if either LOCK bit is set, xmax hasn't deleted the tuple, only locked it */

"either" should now be "any".

! #define HEAP_IS_LOCKED (HEAP_XMAX_EXCL_LOCK | HEAP_XMAX_SHARED_LOCK | \
! HEAP_XMAX_KEY_LOCK)
#define HEAP_XMIN_COMMITTED 0x0100 /* t_xmin committed */
#define HEAP_XMIN_INVALID 0x0200 /* t_xmin invalid/aborted */
#define HEAP_XMAX_COMMITTED 0x0400 /* t_xmax committed */

*** a/src/include/nodes/parsenodes.h
--- b/src/include/nodes/parsenodes.h
***************
*** 554,571 **** typedef struct DefElem
} DefElem;
/*
! * LockingClause - raw representation of FOR UPDATE/SHARE options
*
* Note: lockedRels == NIL means "all relations in query". Otherwise it
* is a list of RangeVar nodes. (We use RangeVar mainly because it carries
* a location field --- currently, parse analysis insists on unqualified
* names in LockingClause.)
*/
typedef struct LockingClause
{
NodeTag type;
List *lockedRels; /* FOR UPDATE or FOR SHARE relations */
! bool forUpdate; /* true = FOR UPDATE, false = FOR SHARE */
bool noWait; /* NOWAIT option */
} LockingClause;
--- 554,579 ----
} DefElem;
/*
!  * LockingClause - raw representation of FOR UPDATE/SHARE/KEY LOCK options
*
* Note: lockedRels == NIL means "all relations in query".	Otherwise it
* is a list of RangeVar nodes.  (We use RangeVar mainly because it carries
* a location field --- currently, parse analysis insists on unqualified
* names in LockingClause.)
*/
+ typedef enum LockClauseStrength
+ {
+ 	/* order is important -- see applyLockingClause */
+ 	LCS_FORKEYLOCK,
+ 	LCS_FORSHARE,
+ 	LCS_FORUPDATE
+ } LockClauseStrength;
+ 

It's sure odd having this enum precisely mirror LockTupleMode. Is there
precedent for this? They are at opposite ends of processing stack, I suppose.

typedef struct LockingClause
{
NodeTag type;
List *lockedRels; /* FOR UPDATE or FOR SHARE relations */
! LockClauseStrength strength;
bool noWait; /* NOWAIT option */
} LockingClause;

***************
*** 839,856 **** typedef struct WindowClause
*	   parser output representation of FOR UPDATE/SHARE clauses
*
* Query.rowMarks contains a separate RowMarkClause node for each relation
!  * identified as a FOR UPDATE/SHARE target.  If FOR UPDATE/SHARE is applied
!  * to a subquery, we generate RowMarkClauses for all normal and subquery rels
!  * in the subquery, but they are marked pushedDown = true to distinguish them
!  * from clauses that were explicitly written at this query level.  Also,
!  * Query.hasForUpdate tells whether there were explicit FOR UPDATE/SHARE
!  * clauses in the current query level.
*/
typedef struct RowMarkClause
{
NodeTag		type;
Index		rti;			/* range table index of target relation */
! 	bool		forUpdate;		/* true = FOR UPDATE, false = FOR SHARE */
bool		noWait;			/* NOWAIT option */
bool		pushedDown;		/* pushed down from higher query level? */
} RowMarkClause;
--- 847,864 ----
*	   parser output representation of FOR UPDATE/SHARE clauses
*
* Query.rowMarks contains a separate RowMarkClause node for each relation
!  * identified as a FOR UPDATE/SHARE/KEY LOCK target.  If one of these clauses
!  * is applied to a subquery, we generate RowMarkClauses for all normal and
!  * subquery rels in the subquery, but they are marked pushedDown = true to
!  * distinguish them from clauses that were explicitly written at this query
!  * level.  Also, Query.hasForUpdate tells whether there were explicit FOR
!  * UPDATE/SHARE clauses in the current query level.

Need a "/KEY LOCK" in the last sentence.

*/
typedef struct RowMarkClause
{
NodeTag type;
Index rti; /* range table index of target relation */
! LockClauseStrength strength;
bool noWait; /* NOWAIT option */
bool pushedDown; /* pushed down from higher query level? */
} RowMarkClause;

I'd like to do some more testing around HOT and TOAST, plus run performance
tests. Figured I should get this much fired off, though.

Thanks,
nm

Attachments:

fklocks-20110211.patchtext/plain; charset=us-asciiDownload

*** a/src/backend/access/heap/heapam.c
--- b/src/backend/access/heap/heapam.c
***************
*** 2417,2422 **** heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
--- 2417,2423 ----
  	HTSU_Result result;
  	TransactionId xid = GetCurrentTransactionId();
  	Bitmapset  *hot_attrs;
+ 	Bitmapset  *keylck_attrs;
  	ItemId		lp;
  	HeapTupleData oldtup;
  	HeapTuple	heaptup;
***************
*** 2430,2435 **** heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
--- 2431,2437 ----
  	bool		have_tuple_lock = false;
  	bool		iscombo;
  	bool		use_hot_update = false;
+ 	bool		keylocked_update = false;
  	bool		all_visible_cleared = false;
  	bool		all_visible_cleared_new = false;
  
***************
*** 2447,2453 **** heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
  	 * Note that we get a copy here, so we need not worry about relcache flush
  	 * happening midway through.
  	 */
! 	hot_attrs = RelationGetIndexAttrBitmap(relation);
  
  	buffer = ReadBuffer(relation, ItemPointerGetBlockNumber(otid));
  	LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
--- 2449,2456 ----
  	 * Note that we get a copy here, so we need not worry about relcache flush
  	 * happening midway through.
  	 */
! 	hot_attrs = RelationGetIndexAttrBitmap(relation, false);
! 	keylck_attrs = RelationGetIndexAttrBitmap(relation, true);
  
  	buffer = ReadBuffer(relation, ItemPointerGetBlockNumber(otid));
  	LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
***************
*** 2484,2567 **** l2:
  		xwait = HeapTupleHeaderGetXmax(oldtup.t_data);
  		infomask = oldtup.t_data->t_infomask;
  
- 		LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
- 
  		/*
! 		 * Acquire tuple lock to establish our priority for the tuple (see
! 		 * heap_lock_tuple).  LockTuple will release us when we are
! 		 * next-in-line for the tuple.
! 		 *
! 		 * If we are forced to "start over" below, we keep the tuple lock;
! 		 * this arranges that we stay at the head of the line while rechecking
! 		 * tuple state.
  		 */
! 		if (!have_tuple_lock)
  		{
! 			LockTuple(relation, &(oldtup.t_self), ExclusiveLock);
! 			have_tuple_lock = true;
  		}
  
! 		/*
! 		 * Sleep until concurrent transaction ends.  Note that we don't care
! 		 * if the locker has an exclusive or shared lock, because we need
! 		 * exclusive.
! 		 */
! 
! 		if (infomask & HEAP_XMAX_IS_MULTI)
  		{
! 			/* wait for multixact */
! 			MultiXactIdWait((MultiXactId) xwait);
! 			LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
  
  			/*
! 			 * If xwait had just locked the tuple then some other xact could
! 			 * update this tuple before we get to this point.  Check for xmax
! 			 * change, and start over if so.
  			 */
! 			if (!(oldtup.t_data->t_infomask & HEAP_XMAX_IS_MULTI) ||
! 				!TransactionIdEquals(HeapTupleHeaderGetXmax(oldtup.t_data),
! 									 xwait))
! 				goto l2;
  
  			/*
! 			 * You might think the multixact is necessarily done here, but not
! 			 * so: it could have surviving members, namely our own xact or
! 			 * other subxacts of this backend.	It is legal for us to update
! 			 * the tuple in either case, however (the latter case is
! 			 * essentially a situation of upgrading our former shared lock to
! 			 * exclusive).	We don't bother changing the on-disk hint bits
! 			 * since we are about to overwrite the xmax altogether.
  			 */
! 		}
! 		else
! 		{
! 			/* wait for regular transaction to end */
! 			XactLockTableWait(xwait);
! 			LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
  
  			/*
! 			 * xwait is done, but if xwait had just locked the tuple then some
! 			 * other xact could update this tuple before we get to this point.
! 			 * Check for xmax change, and start over if so.
  			 */
! 			if ((oldtup.t_data->t_infomask & HEAP_XMAX_IS_MULTI) ||
! 				!TransactionIdEquals(HeapTupleHeaderGetXmax(oldtup.t_data),
! 									 xwait))
! 				goto l2;
! 
! 			/* Otherwise check if it committed or aborted */
! 			UpdateXmaxHintBits(oldtup.t_data, buffer, xwait);
  		}
- 
- 		/*
- 		 * We may overwrite if previous xmax aborted, or if it committed but
- 		 * only locked the tuple without updating it.
- 		 */
- 		if (oldtup.t_data->t_infomask & (HEAP_XMAX_INVALID |
- 										 HEAP_IS_LOCKED))
- 			result = HeapTupleMayBeUpdated;
- 		else
- 			result = HeapTupleUpdated;
  	}
  
  	if (crosscheck != InvalidSnapshot && result == HeapTupleMayBeUpdated)
--- 2487,2587 ----
  		xwait = HeapTupleHeaderGetXmax(oldtup.t_data);
  		infomask = oldtup.t_data->t_infomask;
  
  		/*
! 		 * if it's only key-locked and we're not updating an indexed column,
! 		 * we can act though MayBeUpdated was returned, but the resulting tuple
! 		 * needs a bunch of fields copied from the original.
  		 */
! 		if ((infomask & HEAP_XMAX_KEY_LOCK) &&
! 			!(infomask & HEAP_XMAX_SHARED_LOCK) &&
! 			HeapSatisfiesHOTUpdate(relation, keylck_attrs,
! 								   &oldtup, newtup))
  		{
! 			result = HeapTupleMayBeUpdated;
! 			keylocked_update = true;
  		}
  
! 		if (!keylocked_update)
  		{
! 			LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
  
  			/*
! 			 * Acquire tuple lock to establish our priority for the tuple (see
! 			 * heap_lock_tuple).  LockTuple will release us when we are
! 			 * next-in-line for the tuple.
! 			 *
! 			 * If we are forced to "start over" below, we keep the tuple lock;
! 			 * this arranges that we stay at the head of the line while rechecking
! 			 * tuple state.
  			 */
! 			if (!have_tuple_lock)
! 			{
! 				LockTuple(relation, &(oldtup.t_self), ExclusiveLock);
! 				have_tuple_lock = true;
! 			}
  
  			/*
! 			 * Sleep until concurrent transaction ends.  Note that we don't care
! 			 * if the locker has an exclusive or shared lock, because we need
! 			 * exclusive.
  			 */
! 
! 			if (infomask & HEAP_XMAX_IS_MULTI)
! 			{
! 				/* wait for multixact */
! 				MultiXactIdWait((MultiXactId) xwait);
! 				LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
! 
! 				/*
! 				 * If xwait had just locked the tuple then some other xact could
! 				 * update this tuple before we get to this point.  Check for xmax
! 				 * change, and start over if so.
! 				 */
! 				if (!(oldtup.t_data->t_infomask & HEAP_XMAX_IS_MULTI) ||
! 					!TransactionIdEquals(HeapTupleHeaderGetXmax(oldtup.t_data),
! 										 xwait))
! 					goto l2;
! 
! 				/*
! 				 * You might think the multixact is necessarily done here, but not
! 				 * so: it could have surviving members, namely our own xact or
! 				 * other subxacts of this backend.	It is legal for us to update
! 				 * the tuple in either case, however (the latter case is
! 				 * essentially a situation of upgrading our former shared lock to
! 				 * exclusive).	We don't bother changing the on-disk hint bits
! 				 * since we are about to overwrite the xmax altogether.
! 				 */
! 			}
! 			else
! 			{
! 				/* wait for regular transaction to end */
! 				XactLockTableWait(xwait);
! 				LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
! 
! 				/*
! 				 * xwait is done, but if xwait had just locked the tuple then some
! 				 * other xact could update this tuple before we get to this point.
! 				 * Check for xmax change, and start over if so.
! 				 */
! 				if ((oldtup.t_data->t_infomask & HEAP_XMAX_IS_MULTI) ||
! 					!TransactionIdEquals(HeapTupleHeaderGetXmax(oldtup.t_data),
! 										 xwait))
! 					goto l2;
! 
! 				/* Otherwise check if it committed or aborted */
! 				UpdateXmaxHintBits(oldtup.t_data, buffer, xwait);
! 			}
  
  			/*
! 			 * We may overwrite if previous xmax aborted, or if it committed but
! 			 * only locked the tuple without updating it.
  			 */
! 			if (oldtup.t_data->t_infomask & (HEAP_XMAX_INVALID |
! 											 HEAP_IS_LOCKED))
! 				result = HeapTupleMayBeUpdated;
! 			else
! 				result = HeapTupleUpdated;
  		}
  	}
  
  	if (crosscheck != InvalidSnapshot && result == HeapTupleMayBeUpdated)
***************
*** 2609,2621 **** l2:
  
  	newtup->t_data->t_infomask &= ~(HEAP_XACT_MASK);
  	newtup->t_data->t_infomask2 &= ~(HEAP2_XACT_MASK);
! 	newtup->t_data->t_infomask |= (HEAP_XMAX_INVALID | HEAP_UPDATED);
  	HeapTupleHeaderSetXmin(newtup->t_data, xid);
- 	HeapTupleHeaderSetCmin(newtup->t_data, cid);
- 	HeapTupleHeaderSetXmax(newtup->t_data, 0);	/* for cleanliness */
  	newtup->t_tableOid = RelationGetRelid(relation);
  
  	/*
  	 * Replace cid with a combo cid if necessary.  Note that we already put
  	 * the plain cid into the new tuple.
  	 */
--- 2629,2671 ----
  
  	newtup->t_data->t_infomask &= ~(HEAP_XACT_MASK);
  	newtup->t_data->t_infomask2 &= ~(HEAP2_XACT_MASK);
! 	newtup->t_data->t_infomask |= HEAP_UPDATED;
  	HeapTupleHeaderSetXmin(newtup->t_data, xid);
  	newtup->t_tableOid = RelationGetRelid(relation);
  
  	/*
+ 	 * If this update is touching a tuple that was key-locked, we need to
+ 	 * carry forward some bits from the old tuple into the new copy.
+ 	 */
+ 	if (keylocked_update)
+ 	{
+ 		HeapTupleHeaderSetXmax(newtup->t_data,
+ 							   HeapTupleHeaderGetXmax(oldtup.t_data));
+ 		newtup->t_data->t_infomask |= (oldtup.t_data->t_infomask & 
+ 									   (HEAP_XMAX_IS_MULTI |
+ 										HEAP_XMAX_KEY_LOCK));
+ 		/*
+ 		 * we also need to copy the combo CID stuff, but only if the original
+ 		 * tuple was created by us; otherwise the combocid module complains
+ 		 * (Alternatively we could use HeapTupleHeaderGetRawCommandId)
+ 		 */
+ 		if (TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetXmin(oldtup.t_data)))
+ 		{
+ 			newtup->t_data->t_infomask |= (oldtup.t_data->t_infomask & 
+ 										   HEAP_COMBOCID);
+ 			HeapTupleHeaderSetCmin(newtup->t_data,
+ 								   HeapTupleHeaderGetCmin(oldtup.t_data));
+ 		}
+ 
+ 	}
+ 	else
+ 	{
+ 		newtup->t_data->t_infomask |= HEAP_XMAX_INVALID;
+ 		HeapTupleHeaderSetXmax(newtup->t_data, 0);	/* for cleanliness */
+ 		HeapTupleHeaderSetCmin(newtup->t_data, cid);
+ 	}
+ 
+ 	/*
  	 * Replace cid with a combo cid if necessary.  Note that we already put
  	 * the plain cid into the new tuple.
  	 */
***************
*** 3142,3148 **** heap_lock_tuple(Relation relation, HeapTuple tuple, Buffer *buffer,
  	LOCKMODE	tuple_lock_type;
  	bool		have_tuple_lock = false;
  
! 	tuple_lock_type = (mode == LockTupleShared) ? ShareLock : ExclusiveLock;
  
  	*buffer = ReadBuffer(relation, ItemPointerGetBlockNumber(tid));
  	LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE);
--- 3192,3211 ----
  	LOCKMODE	tuple_lock_type;
  	bool		have_tuple_lock = false;
  
! 	/* in FOR KEY LOCK mode, we use a share lock temporarily */
! 	switch (mode)
! 	{
! 		case LockTupleShared:
! 		case LockTupleKeylock:
! 			tuple_lock_type = ShareLock;
! 			break;
! 		case LockTupleExclusive:
! 			tuple_lock_type = ExclusiveLock;
! 			break;
! 		default:
! 			elog(ERROR, "invalid tuple lock mode");
! 			tuple_lock_type = 0;	/* keep compiler quiet */
! 	}
  
  	*buffer = ReadBuffer(relation, ItemPointerGetBlockNumber(tid));
  	LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE);
***************
*** 3175,3192 **** l3:
  		LockBuffer(*buffer, BUFFER_LOCK_UNLOCK);
  
  		/*
! 		 * If we wish to acquire share lock, and the tuple is already
! 		 * share-locked by a multixact that includes any subtransaction of the
  		 * current top transaction, then we effectively hold the desired lock
  		 * already.  We *must* succeed without trying to take the tuple lock,
  		 * else we will deadlock against anyone waiting to acquire exclusive
  		 * lock.  We don't need to make any state changes in this case.
  		 */
! 		if (mode == LockTupleShared &&
  			(infomask & HEAP_XMAX_IS_MULTI) &&
  			MultiXactIdIsCurrent((MultiXactId) xwait))
  		{
! 			Assert(infomask & HEAP_XMAX_SHARED_LOCK);
  			/* Probably can't hold tuple lock here, but may as well check */
  			if (have_tuple_lock)
  				UnlockTuple(relation, tid, tuple_lock_type);
--- 3238,3255 ----
  		LockBuffer(*buffer, BUFFER_LOCK_UNLOCK);
  
  		/*
! 		 * If we wish to acquire a key or share lock, and the tuple is already
! 		 * share- or key-locked by a multixact that includes any subtransaction of the
  		 * current top transaction, then we effectively hold the desired lock
  		 * already.  We *must* succeed without trying to take the tuple lock,
  		 * else we will deadlock against anyone waiting to acquire exclusive
  		 * lock.  We don't need to make any state changes in this case.
  		 */
! 		if ((mode == LockTupleShared || mode == LockTupleKeylock) &&
  			(infomask & HEAP_XMAX_IS_MULTI) &&
  			MultiXactIdIsCurrent((MultiXactId) xwait))
  		{
! 			Assert(infomask & HEAP_IS_SHARE_LOCKED);
  			/* Probably can't hold tuple lock here, but may as well check */
  			if (have_tuple_lock)
  				UnlockTuple(relation, tid, tuple_lock_type);
***************
*** 3217,3226 **** l3:
  			have_tuple_lock = true;
  		}
  
! 		if (mode == LockTupleShared && (infomask & HEAP_XMAX_SHARED_LOCK))
  		{
  			/*
! 			 * Acquiring sharelock when there's at least one sharelocker
  			 * already.  We need not wait for him/them to complete.
  			 */
  			LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE);
--- 3280,3290 ----
  			have_tuple_lock = true;
  		}
  
! 		if ((mode == LockTupleShared || mode == LockTupleKeylock) &&
! 			(infomask & HEAP_IS_SHARE_LOCKED))
  		{
  			/*
! 			 * Acquiring sharelock or keylock when there's at least one such locker
  			 * already.  We need not wait for him/them to complete.
  			 */
  			LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE);
***************
*** 3229,3235 **** l3:
  			 * Make sure it's still a shared lock, else start over.  (It's OK
  			 * if the ownership of the shared lock has changed, though.)
  			 */
! 			if (!(tuple->t_data->t_infomask & HEAP_XMAX_SHARED_LOCK))
  				goto l3;
  		}
  		else if (infomask & HEAP_XMAX_IS_MULTI)
--- 3293,3299 ----
  			 * Make sure it's still a shared lock, else start over.  (It's OK
  			 * if the ownership of the shared lock has changed, though.)
  			 */
! 			if (!(tuple->t_data->t_infomask & HEAP_IS_SHARE_LOCKED))
  				goto l3;
  		}
  		else if (infomask & HEAP_XMAX_IS_MULTI)
***************
*** 3339,3346 **** l3:
  	if (!(old_infomask & (HEAP_XMAX_INVALID |
  						  HEAP_XMAX_COMMITTED |
  						  HEAP_XMAX_IS_MULTI)) &&
! 		(mode == LockTupleShared ?
  		 (old_infomask & HEAP_IS_LOCKED) :
  		 (old_infomask & HEAP_XMAX_EXCL_LOCK)) &&
  		TransactionIdIsCurrentTransactionId(xmax))
  	{
--- 3403,3412 ----
  	if (!(old_infomask & (HEAP_XMAX_INVALID |
  						  HEAP_XMAX_COMMITTED |
  						  HEAP_XMAX_IS_MULTI)) &&
! 		(mode == LockTupleKeylock ?
  		 (old_infomask & HEAP_IS_LOCKED) :
+ 		 mode == LockTupleShared ?
+ 		 (old_infomask & (HEAP_XMAX_SHARED_LOCK | HEAP_XMAX_EXCL_LOCK)) :
  		 (old_infomask & HEAP_XMAX_EXCL_LOCK)) &&
  		TransactionIdIsCurrentTransactionId(xmax))
  	{
***************
*** 3364,3373 **** l3:
  									HEAP_IS_LOCKED |
  									HEAP_MOVED);
  
! 	if (mode == LockTupleShared)
  	{
  		/*
! 		 * If this is the first acquisition of a shared lock in the current
  		 * transaction, set my per-backend OldestMemberMXactId setting. We can
  		 * be certain that the transaction will never become a member of any
  		 * older MultiXactIds than that.  (We have to do this even if we end
--- 3430,3439 ----
  									HEAP_IS_LOCKED |
  									HEAP_MOVED);
  
! 	if (mode == LockTupleShared || mode == LockTupleKeylock)
  	{
  		/*
! 		 * If this is the first acquisition of a keylock or shared lock in the current
  		 * transaction, set my per-backend OldestMemberMXactId setting. We can
  		 * be certain that the transaction will never become a member of any
  		 * older MultiXactIds than that.  (We have to do this even if we end
***************
*** 3376,3382 **** l3:
  		 */
  		MultiXactIdSetOldestMember();
  
! 		new_infomask |= HEAP_XMAX_SHARED_LOCK;
  
  		/*
  		 * Check to see if we need a MultiXactId because there are multiple
--- 3442,3449 ----
  		 */
  		MultiXactIdSetOldestMember();
  
! 		new_infomask |= mode == LockTupleShared ? HEAP_XMAX_SHARED_LOCK :
! 			HEAP_XMAX_KEY_LOCK;
  
  		/*
  		 * Check to see if we need a MultiXactId because there are multiple
***************
*** 3476,3482 **** l3:
  		xlrec.target.tid = tuple->t_self;
  		xlrec.locking_xid = xid;
  		xlrec.xid_is_mxact = ((new_infomask & HEAP_XMAX_IS_MULTI) != 0);
! 		xlrec.shared_lock = (mode == LockTupleShared);
  		rdata[0].data = (char *) &xlrec;
  		rdata[0].len = SizeOfHeapLock;
  		rdata[0].buffer = InvalidBuffer;
--- 3543,3549 ----
  		xlrec.target.tid = tuple->t_self;
  		xlrec.locking_xid = xid;
  		xlrec.xid_is_mxact = ((new_infomask & HEAP_XMAX_IS_MULTI) != 0);
! 		xlrec.lock_strength = mode == LockTupleShared ? 's' : mode == LockTupleKeylock ? 'k' : 'x';
  		rdata[0].data = (char *) &xlrec;
  		rdata[0].len = SizeOfHeapLock;
  		rdata[0].buffer = InvalidBuffer;
***************
*** 4795,4802 **** heap_xlog_lock(XLogRecPtr lsn, XLogRecord *record)
  						  HEAP_MOVED);
  	if (xlrec->xid_is_mxact)
  		htup->t_infomask |= HEAP_XMAX_IS_MULTI;
! 	if (xlrec->shared_lock)
  		htup->t_infomask |= HEAP_XMAX_SHARED_LOCK;
  	else
  		htup->t_infomask |= HEAP_XMAX_EXCL_LOCK;
  	HeapTupleHeaderClearHotUpdated(htup);
--- 4862,4871 ----
  						  HEAP_MOVED);
  	if (xlrec->xid_is_mxact)
  		htup->t_infomask |= HEAP_XMAX_IS_MULTI;
! 	if (xlrec->lock_strength == 's')
  		htup->t_infomask |= HEAP_XMAX_SHARED_LOCK;
+ 	else if (xlrec->lock_strength == 'k')
+ 		htup->t_infomask |= HEAP_XMAX_KEY_LOCK;
  	else
  		htup->t_infomask |= HEAP_XMAX_EXCL_LOCK;
  	HeapTupleHeaderClearHotUpdated(htup);
***************
*** 4999,5006 **** heap_desc(StringInfo buf, uint8 xl_info, char *rec)
  	{
  		xl_heap_lock *xlrec = (xl_heap_lock *) rec;
  
! 		if (xlrec->shared_lock)
  			appendStringInfo(buf, "shared_lock: ");
  		else
  			appendStringInfo(buf, "exclusive_lock: ");
  		if (xlrec->xid_is_mxact)
--- 5068,5077 ----
  	{
  		xl_heap_lock *xlrec = (xl_heap_lock *) rec;
  
! 		if (xlrec->lock_strength == 's')
  			appendStringInfo(buf, "shared_lock: ");
+ 		else if (xlrec->lock_strength == 'k')
+ 			appendStringInfo(buf, "key_lock: ");
  		else
  			appendStringInfo(buf, "exclusive_lock: ");
  		if (xlrec->xid_is_mxact)
*** a/src/backend/catalog/index.c
--- b/src/backend/catalog/index.c
***************
*** 2863,2869 **** reindex_relation(Oid relid, bool toast_too, int flags)
  
  	/* Ensure rd_indexattr is valid; see comments for RelationSetIndexList */
  	if (is_pg_class)
! 		(void) RelationGetIndexAttrBitmap(rel);
  
  	PG_TRY();
  	{
--- 2863,2869 ----
  
  	/* Ensure rd_indexattr is valid; see comments for RelationSetIndexList */
  	if (is_pg_class)
! 		(void) RelationGetIndexAttrBitmap(rel, false);
  
  	PG_TRY();
  	{
*** a/src/backend/executor/execMain.c
--- b/src/backend/executor/execMain.c
***************
*** 701,707 **** InitPlan(QueryDesc *queryDesc, int eflags)
  	}
  
  	/*
! 	 * Similarly, we have to lock relations selected FOR UPDATE/FOR SHARE
  	 * before we initialize the plan tree, else we'd be risking lock upgrades.
  	 * While we are at it, build the ExecRowMark list.
  	 */
--- 701,707 ----
  	}
  
  	/*
! 	 * Similarly, we have to lock relations selected FOR UPDATE/FOR SHARE/KEY LOCK
  	 * before we initialize the plan tree, else we'd be risking lock upgrades.
  	 * While we are at it, build the ExecRowMark list.
  	 */
***************
*** 721,726 **** InitPlan(QueryDesc *queryDesc, int eflags)
--- 721,727 ----
  		{
  			case ROW_MARK_EXCLUSIVE:
  			case ROW_MARK_SHARE:
+ 			case ROW_MARK_KEYLOCK:
  				relid = getrelid(rc->rti, rangeTable);
  				relation = heap_open(relid, RowShareLock);
  				break;
*** a/src/backend/executor/nodeLockRows.c
--- b/src/backend/executor/nodeLockRows.c
***************
*** 112,119 **** lnext:
  		/* okay, try to lock the tuple */
  		if (erm->markType == ROW_MARK_EXCLUSIVE)
  			lockmode = LockTupleExclusive;
! 		else
  			lockmode = LockTupleShared;
  
  		test = heap_lock_tuple(erm->relation, &tuple, &buffer,
  							   &update_ctid, &update_xmax,
--- 112,126 ----
  		/* okay, try to lock the tuple */
  		if (erm->markType == ROW_MARK_EXCLUSIVE)
  			lockmode = LockTupleExclusive;
! 		else if (erm->markType == ROW_MARK_SHARE)
  			lockmode = LockTupleShared;
+ 		else if (erm->markType == ROW_MARK_KEYLOCK)
+ 			lockmode = LockTupleKeylock;
+ 		else
+ 		{
+ 			elog(ERROR, "unsupported rowmark type");
+ 			lockmode = LockTupleExclusive;	/* keep compiler quiet */
+ 		}
  
  		test = heap_lock_tuple(erm->relation, &tuple, &buffer,
  							   &update_ctid, &update_xmax,
*** a/src/backend/nodes/copyfuncs.c
--- b/src/backend/nodes/copyfuncs.c
***************
*** 1953,1959 **** _copyRowMarkClause(RowMarkClause *from)
  	RowMarkClause *newnode = makeNode(RowMarkClause);
  
  	COPY_SCALAR_FIELD(rti);
! 	COPY_SCALAR_FIELD(forUpdate);
  	COPY_SCALAR_FIELD(noWait);
  	COPY_SCALAR_FIELD(pushedDown);
  
--- 1953,1959 ----
  	RowMarkClause *newnode = makeNode(RowMarkClause);
  
  	COPY_SCALAR_FIELD(rti);
! 	COPY_SCALAR_FIELD(strength);
  	COPY_SCALAR_FIELD(noWait);
  	COPY_SCALAR_FIELD(pushedDown);
  
***************
*** 2310,2316 **** _copyLockingClause(LockingClause *from)
  	LockingClause *newnode = makeNode(LockingClause);
  
  	COPY_NODE_FIELD(lockedRels);
! 	COPY_SCALAR_FIELD(forUpdate);
  	COPY_SCALAR_FIELD(noWait);
  
  	return newnode;
--- 2310,2316 ----
  	LockingClause *newnode = makeNode(LockingClause);
  
  	COPY_NODE_FIELD(lockedRels);
! 	COPY_SCALAR_FIELD(strength);
  	COPY_SCALAR_FIELD(noWait);
  
  	return newnode;
*** a/src/backend/nodes/equalfuncs.c
--- b/src/backend/nodes/equalfuncs.c
***************
*** 2266,2272 **** static bool
  _equalLockingClause(LockingClause *a, LockingClause *b)
  {
  	COMPARE_NODE_FIELD(lockedRels);
! 	COMPARE_SCALAR_FIELD(forUpdate);
  	COMPARE_SCALAR_FIELD(noWait);
  
  	return true;
--- 2266,2272 ----
  _equalLockingClause(LockingClause *a, LockingClause *b)
  {
  	COMPARE_NODE_FIELD(lockedRels);
! 	COMPARE_SCALAR_FIELD(strength);
  	COMPARE_SCALAR_FIELD(noWait);
  
  	return true;
***************
*** 2335,2341 **** static bool
  _equalRowMarkClause(RowMarkClause *a, RowMarkClause *b)
  {
  	COMPARE_SCALAR_FIELD(rti);
! 	COMPARE_SCALAR_FIELD(forUpdate);
  	COMPARE_SCALAR_FIELD(noWait);
  	COMPARE_SCALAR_FIELD(pushedDown);
  
--- 2335,2341 ----
  _equalRowMarkClause(RowMarkClause *a, RowMarkClause *b)
  {
  	COMPARE_SCALAR_FIELD(rti);
! 	COMPARE_SCALAR_FIELD(strength);
  	COMPARE_SCALAR_FIELD(noWait);
  	COMPARE_SCALAR_FIELD(pushedDown);
  
*** a/src/backend/nodes/outfuncs.c
--- b/src/backend/nodes/outfuncs.c
***************
*** 2005,2011 **** _outLockingClause(StringInfo str, LockingClause *node)
  	WRITE_NODE_TYPE("LOCKINGCLAUSE");
  
  	WRITE_NODE_FIELD(lockedRels);
! 	WRITE_BOOL_FIELD(forUpdate);
  	WRITE_BOOL_FIELD(noWait);
  }
  
--- 2005,2011 ----
  	WRITE_NODE_TYPE("LOCKINGCLAUSE");
  
  	WRITE_NODE_FIELD(lockedRels);
! 	WRITE_ENUM_FIELD(strength, LockClauseStrength);
  	WRITE_BOOL_FIELD(noWait);
  }
  
***************
*** 2181,2187 **** _outRowMarkClause(StringInfo str, RowMarkClause *node)
  	WRITE_NODE_TYPE("ROWMARKCLAUSE");
  
  	WRITE_UINT_FIELD(rti);
! 	WRITE_BOOL_FIELD(forUpdate);
  	WRITE_BOOL_FIELD(noWait);
  	WRITE_BOOL_FIELD(pushedDown);
  }
--- 2181,2187 ----
  	WRITE_NODE_TYPE("ROWMARKCLAUSE");
  
  	WRITE_UINT_FIELD(rti);
! 	WRITE_BOOL_FIELD(strength);
  	WRITE_BOOL_FIELD(noWait);
  	WRITE_BOOL_FIELD(pushedDown);
  }
*** a/src/backend/nodes/readfuncs.c
--- b/src/backend/nodes/readfuncs.c
***************
*** 299,305 **** _readRowMarkClause(void)
  	READ_LOCALS(RowMarkClause);
  
  	READ_UINT_FIELD(rti);
! 	READ_BOOL_FIELD(forUpdate);
  	READ_BOOL_FIELD(noWait);
  	READ_BOOL_FIELD(pushedDown);
  
--- 299,305 ----
  	READ_LOCALS(RowMarkClause);
  
  	READ_UINT_FIELD(rti);
! 	READ_BOOL_FIELD(strength);
  	READ_BOOL_FIELD(noWait);
  	READ_BOOL_FIELD(pushedDown);
  
*** a/src/backend/optimizer/plan/initsplan.c
--- b/src/backend/optimizer/plan/initsplan.c
***************
*** 561,571 **** make_outerjoininfo(PlannerInfo *root,
  	Assert(jointype != JOIN_RIGHT);
  
  	/*
! 	 * Presently the executor cannot support FOR UPDATE/SHARE marking of rels
  	 * appearing on the nullable side of an outer join. (It's somewhat unclear
  	 * what that would mean, anyway: what should we mark when a result row is
  	 * generated from no element of the nullable relation?)  So, complain if
! 	 * any nullable rel is FOR UPDATE/SHARE.
  	 *
  	 * You might be wondering why this test isn't made far upstream in the
  	 * parser.	It's because the parser hasn't got enough info --- consider
--- 561,571 ----
  	Assert(jointype != JOIN_RIGHT);
  
  	/*
! 	 * Presently the executor cannot support FOR UPDATE/SHARE/KEY LOCK marking of rels
  	 * appearing on the nullable side of an outer join. (It's somewhat unclear
  	 * what that would mean, anyway: what should we mark when a result row is
  	 * generated from no element of the nullable relation?)  So, complain if
! 	 * any nullable rel is FOR UPDATE/SHARE/KEY LOCK.
  	 *
  	 * You might be wondering why this test isn't made far upstream in the
  	 * parser.	It's because the parser hasn't got enough info --- consider
***************
*** 583,589 **** make_outerjoininfo(PlannerInfo *root,
  			(jointype == JOIN_FULL && bms_is_member(rc->rti, left_rels)))
  			ereport(ERROR,
  					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
! 					 errmsg("SELECT FOR UPDATE/SHARE cannot be applied to the nullable side of an outer join")));
  	}
  
  	sjinfo->syn_lefthand = left_rels;
--- 583,589 ----
  			(jointype == JOIN_FULL && bms_is_member(rc->rti, left_rels)))
  			ereport(ERROR,
  					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
! 					 errmsg("SELECT FOR UPDATE/SHARE/KEY LOCK cannot be applied to the nullable side of an outer join")));
  	}
  
  	sjinfo->syn_lefthand = left_rels;
*** a/src/backend/optimizer/plan/planner.c
--- b/src/backend/optimizer/plan/planner.c
***************
*** 1830,1836 **** preprocess_rowmarks(PlannerInfo *root)
  	if (parse->rowMarks)
  	{
  		/*
! 		 * We've got trouble if FOR UPDATE/SHARE appears inside grouping,
  		 * since grouping renders a reference to individual tuple CTIDs
  		 * invalid.  This is also checked at parse time, but that's
  		 * insufficient because of rule substitution, query pullup, etc.
--- 1830,1836 ----
  	if (parse->rowMarks)
  	{
  		/*
! 		 * We've got trouble if FOR UPDATE/SHARE/KEY LOCK appears inside grouping,
  		 * since grouping renders a reference to individual tuple CTIDs
  		 * invalid.  This is also checked at parse time, but that's
  		 * insufficient because of rule substitution, query pullup, etc.
***************
*** 1840,1846 **** preprocess_rowmarks(PlannerInfo *root)
  	else
  	{
  		/*
! 		 * We only need rowmarks for UPDATE, DELETE, or FOR UPDATE/SHARE.
  		 */
  		if (parse->commandType != CMD_UPDATE &&
  			parse->commandType != CMD_DELETE)
--- 1840,1846 ----
  	else
  	{
  		/*
! 		 * We only need rowmarks for UPDATE, DELETE, or FOR UPDATE/SHARE/KEY LOCK.
  		 */
  		if (parse->commandType != CMD_UPDATE &&
  			parse->commandType != CMD_DELETE)
***************
*** 1850,1856 **** preprocess_rowmarks(PlannerInfo *root)
  	/*
  	 * We need to have rowmarks for all base relations except the target. We
  	 * make a bitmapset of all base rels and then remove the items we don't
! 	 * need or have FOR UPDATE/SHARE marks for.
  	 */
  	rels = get_base_rel_indexes((Node *) parse->jointree);
  	if (parse->resultRelation)
--- 1850,1856 ----
  	/*
  	 * We need to have rowmarks for all base relations except the target. We
  	 * make a bitmapset of all base rels and then remove the items we don't
! 	 * need or have FOR UPDATE/SHARE/KEY LOCK marks for.
  	 */
  	rels = get_base_rel_indexes((Node *) parse->jointree);
  	if (parse->resultRelation)
***************
*** 1887,1896 **** preprocess_rowmarks(PlannerInfo *root)
  		newrc = makeNode(PlanRowMark);
  		newrc->rti = newrc->prti = rc->rti;
  		newrc->rowmarkId = ++(root->glob->lastRowMarkId);
! 		if (rc->forUpdate)
! 			newrc->markType = ROW_MARK_EXCLUSIVE;
! 		else
! 			newrc->markType = ROW_MARK_SHARE;
  		newrc->noWait = rc->noWait;
  		newrc->isParent = false;
  
--- 1887,1904 ----
  		newrc = makeNode(PlanRowMark);
  		newrc->rti = newrc->prti = rc->rti;
  		newrc->rowmarkId = ++(root->glob->lastRowMarkId);
!  		switch (rc->strength)
!  		{
!  			case LCS_FORUPDATE:
!  				newrc->markType = ROW_MARK_EXCLUSIVE;
!  				break;
!  			case LCS_FORSHARE:
!  				newrc->markType = ROW_MARK_SHARE;
!  				break;
!  			case LCS_FORKEYLOCK:
!  				newrc->markType = ROW_MARK_KEYLOCK;
!  				break;
!  		}
  		newrc->noWait = rc->noWait;
  		newrc->isParent = false;
  
*** a/src/backend/parser/analyze.c
--- b/src/backend/parser/analyze.c
***************
*** 2161,2167 **** transformLockingClause(ParseState *pstate, Query *qry, LockingClause *lc,
  	/* make a clause we can pass down to subqueries to select all rels */
  	allrels = makeNode(LockingClause);
  	allrels->lockedRels = NIL;	/* indicates all rels */
! 	allrels->forUpdate = lc->forUpdate;
  	allrels->noWait = lc->noWait;
  
  	if (lockedRels == NIL)
--- 2161,2167 ----
  	/* make a clause we can pass down to subqueries to select all rels */
  	allrels = makeNode(LockingClause);
  	allrels->lockedRels = NIL;	/* indicates all rels */
! 	allrels->strength = lc->strength;
  	allrels->noWait = lc->noWait;
  
  	if (lockedRels == NIL)
***************
*** 2177,2188 **** transformLockingClause(ParseState *pstate, Query *qry, LockingClause *lc,
  			{
  				case RTE_RELATION:
  					applyLockingClause(qry, i,
! 									   lc->forUpdate, lc->noWait, pushedDown);
  					rte->requiredPerms |= ACL_SELECT_FOR_UPDATE;
  					break;
  				case RTE_SUBQUERY:
  					applyLockingClause(qry, i,
! 									   lc->forUpdate, lc->noWait, pushedDown);
  
  					/*
  					 * FOR UPDATE/SHARE of subquery is propagated to all of
--- 2177,2188 ----
  			{
  				case RTE_RELATION:
  					applyLockingClause(qry, i,
! 									   lc->strength, lc->noWait, pushedDown);
  					rte->requiredPerms |= ACL_SELECT_FOR_UPDATE;
  					break;
  				case RTE_SUBQUERY:
  					applyLockingClause(qry, i,
! 									   lc->strength, lc->noWait, pushedDown);
  
  					/*
  					 * FOR UPDATE/SHARE of subquery is propagated to all of
***************
*** 2226,2238 **** transformLockingClause(ParseState *pstate, Query *qry, LockingClause *lc,
  					{
  						case RTE_RELATION:
  							applyLockingClause(qry, i,
! 											   lc->forUpdate, lc->noWait,
  											   pushedDown);
  							rte->requiredPerms |= ACL_SELECT_FOR_UPDATE;
  							break;
  						case RTE_SUBQUERY:
  							applyLockingClause(qry, i,
! 											   lc->forUpdate, lc->noWait,
  											   pushedDown);
  							/* see comment above */
  							transformLockingClause(pstate, rte->subquery,
--- 2226,2238 ----
  					{
  						case RTE_RELATION:
  							applyLockingClause(qry, i,
! 											   lc->strength, lc->noWait,
  											   pushedDown);
  							rte->requiredPerms |= ACL_SELECT_FOR_UPDATE;
  							break;
  						case RTE_SUBQUERY:
  							applyLockingClause(qry, i,
! 											   lc->strength, lc->noWait,
  											   pushedDown);
  							/* see comment above */
  							transformLockingClause(pstate, rte->subquery,
***************
*** 2291,2297 **** transformLockingClause(ParseState *pstate, Query *qry, LockingClause *lc,
   */
  void
  applyLockingClause(Query *qry, Index rtindex,
! 				   bool forUpdate, bool noWait, bool pushedDown)
  {
  	RowMarkClause *rc;
  
--- 2291,2297 ----
   */
  void
  applyLockingClause(Query *qry, Index rtindex,
! 				   LockClauseStrength strength, bool noWait, bool pushedDown)
  {
  	RowMarkClause *rc;
  
***************
*** 2303,2312 **** applyLockingClause(Query *qry, Index rtindex,
  	if ((rc = get_parse_rowmark(qry, rtindex)) != NULL)
  	{
  		/*
! 		 * If the same RTE is specified both FOR UPDATE and FOR SHARE, treat
! 		 * it as FOR UPDATE.  (Reasonable, since you can't take both a shared
! 		 * and exclusive lock at the same time; it'll end up being exclusive
! 		 * anyway.)
  		 *
  		 * We also consider that NOWAIT wins if it's specified both ways. This
  		 * is a bit more debatable but raising an error doesn't seem helpful.
--- 2303,2312 ----
  	if ((rc = get_parse_rowmark(qry, rtindex)) != NULL)
  	{
  		/*
! 		 * If the same RTE is specified for more than one locking strength,
! 		 * treat is as the strongest.  (Reasonable, since you can't take both a
! 		 * shared and exclusive lock at the same time; it'll end up being
! 		 * exclusive anyway.)
  		 *
  		 * We also consider that NOWAIT wins if it's specified both ways. This
  		 * is a bit more debatable but raising an error doesn't seem helpful.
***************
*** 2315,2321 **** applyLockingClause(Query *qry, Index rtindex,
  		 *
  		 * And of course pushedDown becomes false if any clause is explicit.
  		 */
! 		rc->forUpdate |= forUpdate;
  		rc->noWait |= noWait;
  		rc->pushedDown &= pushedDown;
  		return;
--- 2315,2321 ----
  		 *
  		 * And of course pushedDown becomes false if any clause is explicit.
  		 */
! 		rc->strength = Max(rc->strength, strength);
  		rc->noWait |= noWait;
  		rc->pushedDown &= pushedDown;
  		return;
***************
*** 2324,2330 **** applyLockingClause(Query *qry, Index rtindex,
  	/* Make a new RowMarkClause */
  	rc = makeNode(RowMarkClause);
  	rc->rti = rtindex;
! 	rc->forUpdate = forUpdate;
  	rc->noWait = noWait;
  	rc->pushedDown = pushedDown;
  	qry->rowMarks = lappend(qry->rowMarks, rc);
--- 2324,2330 ----
  	/* Make a new RowMarkClause */
  	rc = makeNode(RowMarkClause);
  	rc->rti = rtindex;
! 	rc->strength = strength;
  	rc->noWait = noWait;
  	rc->pushedDown = pushedDown;
  	qry->rowMarks = lappend(qry->rowMarks, rc);
*** a/src/backend/parser/gram.y
--- b/src/backend/parser/gram.y
***************
*** 8542,8548 **** for_locking_item:
  				{
  					LockingClause *n = makeNode(LockingClause);
  					n->lockedRels = $3;
! 					n->forUpdate = TRUE;
  					n->noWait = $4;
  					$$ = (Node *) n;
  				}
--- 8542,8548 ----
  				{
  					LockingClause *n = makeNode(LockingClause);
  					n->lockedRels = $3;
! 					n->strength = LCS_FORUPDATE;
  					n->noWait = $4;
  					$$ = (Node *) n;
  				}
***************
*** 8550,8559 **** for_locking_item:
  				{
  					LockingClause *n = makeNode(LockingClause);
  					n->lockedRels = $3;
! 					n->forUpdate = FALSE;
  					n->noWait = $4;
  					$$ = (Node *) n;
  				}
  		;
  
  locked_rels_list:
--- 8550,8567 ----
  				{
  					LockingClause *n = makeNode(LockingClause);
  					n->lockedRels = $3;
! 					n->strength = LCS_FORSHARE;
  					n->noWait = $4;
  					$$ = (Node *) n;
  				}
+ 			| FOR KEY LOCK_P locked_rels_list opt_nowait
+ 				{
+ 					LockingClause *n = makeNode(LockingClause);
+ 					n->lockedRels = $4;
+ 					n->strength = LCS_FORKEYLOCK;
+ 					n->noWait = $5;
+ 					$$ = (Node *) n;
+ 				}
  		;
  
  locked_rels_list:
*** a/src/backend/rewrite/rewriteHandler.c
--- b/src/backend/rewrite/rewriteHandler.c
***************
*** 55,61 **** static void rewriteValuesRTE(RangeTblEntry *rte, Relation target_relation,
  static void rewriteTargetListUD(Query *parsetree, RangeTblEntry *target_rte,
  								Relation target_relation);
  static void markQueryForLocking(Query *qry, Node *jtnode,
! 					bool forUpdate, bool noWait, bool pushedDown);
  static List *matchLocks(CmdType event, RuleLock *rulelocks,
  		   int varno, Query *parsetree);
  static Query *fireRIRrules(Query *parsetree, List *activeRIRs,
--- 55,61 ----
  static void rewriteTargetListUD(Query *parsetree, RangeTblEntry *target_rte,
  								Relation target_relation);
  static void markQueryForLocking(Query *qry, Node *jtnode,
! 					LockClauseStrength strength, bool noWait, bool pushedDown);
  static List *matchLocks(CmdType event, RuleLock *rulelocks,
  		   int varno, Query *parsetree);
  static Query *fireRIRrules(Query *parsetree, List *activeRIRs,
***************
*** 1354,1361 **** ApplyRetrieveRule(Query *parsetree,
  	rte->modifiedCols = NULL;
  
  	/*
! 	 * If FOR UPDATE/SHARE of view, mark all the contained tables as implicit
! 	 * FOR UPDATE/SHARE, the same as the parser would have done if the view's
  	 * subquery had been written out explicitly.
  	 *
  	 * Note: we don't consider forUpdatePushedDown here; such marks will be
--- 1354,1361 ----
  	rte->modifiedCols = NULL;
  
  	/*
! 	 * If FOR UPDATE/SHARE/KEY LOCK of view, mark all the contained tables as implicit
! 	 * FOR UPDATE/SHARE/KEY LOCK, the same as the parser would have done if the view's
  	 * subquery had been written out explicitly.
  	 *
  	 * Note: we don't consider forUpdatePushedDown here; such marks will be
***************
*** 1363,1375 **** ApplyRetrieveRule(Query *parsetree,
  	 */
  	if (rc != NULL)
  		markQueryForLocking(rule_action, (Node *) rule_action->jointree,
! 							rc->forUpdate, rc->noWait, true);
  
  	return parsetree;
  }
  
  /*
!  * Recursively mark all relations used by a view as FOR UPDATE/SHARE.
   *
   * This may generate an invalid query, eg if some sub-query uses an
   * aggregate.  We leave it to the planner to detect that.
--- 1363,1375 ----
  	 */
  	if (rc != NULL)
  		markQueryForLocking(rule_action, (Node *) rule_action->jointree,
! 							rc->strength, rc->noWait, true);
  
  	return parsetree;
  }
  
  /*
!  * Recursively mark all relations used by a view as FOR UPDATE/SHARE/KEY LOCK.
   *
   * This may generate an invalid query, eg if some sub-query uses an
   * aggregate.  We leave it to the planner to detect that.
***************
*** 1381,1387 **** ApplyRetrieveRule(Query *parsetree,
   */
  static void
  markQueryForLocking(Query *qry, Node *jtnode,
! 					bool forUpdate, bool noWait, bool pushedDown)
  {
  	if (jtnode == NULL)
  		return;
--- 1381,1387 ----
   */
  static void
  markQueryForLocking(Query *qry, Node *jtnode,
! 					LockClauseStrength strength, bool noWait, bool pushedDown)
  {
  	if (jtnode == NULL)
  		return;
***************
*** 1392,1406 **** markQueryForLocking(Query *qry, Node *jtnode,
  
  		if (rte->rtekind == RTE_RELATION)
  		{
! 			applyLockingClause(qry, rti, forUpdate, noWait, pushedDown);
  			rte->requiredPerms |= ACL_SELECT_FOR_UPDATE;
  		}
  		else if (rte->rtekind == RTE_SUBQUERY)
  		{
! 			applyLockingClause(qry, rti, forUpdate, noWait, pushedDown);
! 			/* FOR UPDATE/SHARE of subquery is propagated to subquery's rels */
  			markQueryForLocking(rte->subquery, (Node *) rte->subquery->jointree,
! 								forUpdate, noWait, true);
  		}
  		/* other RTE types are unaffected by FOR UPDATE */
  	}
--- 1392,1406 ----
  
  		if (rte->rtekind == RTE_RELATION)
  		{
! 			applyLockingClause(qry, rti, strength, noWait, pushedDown);
  			rte->requiredPerms |= ACL_SELECT_FOR_UPDATE;
  		}
  		else if (rte->rtekind == RTE_SUBQUERY)
  		{
! 			applyLockingClause(qry, rti, strength, noWait, pushedDown);
! 			/* FOR UPDATE/SHARE/KEY LOCK of subquery is propagated to subquery's rels */
  			markQueryForLocking(rte->subquery, (Node *) rte->subquery->jointree,
! 								strength, noWait, true);
  		}
  		/* other RTE types are unaffected by FOR UPDATE */
  	}
***************
*** 1410,1423 **** markQueryForLocking(Query *qry, Node *jtnode,
  		ListCell   *l;
  
  		foreach(l, f->fromlist)
! 			markQueryForLocking(qry, lfirst(l), forUpdate, noWait, pushedDown);
  	}
  	else if (IsA(jtnode, JoinExpr))
  	{
  		JoinExpr   *j = (JoinExpr *) jtnode;
  
! 		markQueryForLocking(qry, j->larg, forUpdate, noWait, pushedDown);
! 		markQueryForLocking(qry, j->rarg, forUpdate, noWait, pushedDown);
  	}
  	else
  		elog(ERROR, "unrecognized node type: %d",
--- 1410,1423 ----
  		ListCell   *l;
  
  		foreach(l, f->fromlist)
! 			markQueryForLocking(qry, lfirst(l), strength, noWait, pushedDown);
  	}
  	else if (IsA(jtnode, JoinExpr))
  	{
  		JoinExpr   *j = (JoinExpr *) jtnode;
  
! 		markQueryForLocking(qry, j->larg, strength, noWait, pushedDown);
! 		markQueryForLocking(qry, j->rarg, strength, noWait, pushedDown);
  	}
  	else
  		elog(ERROR, "unrecognized node type: %d",
*** a/src/backend/tcop/utility.c
--- b/src/backend/tcop/utility.c
***************
*** 2205,2214 **** CreateCommandTag(Node *parsetree)
  						else if (stmt->rowMarks != NIL)
  						{
  							/* not 100% but probably close enough */
! 							if (((RowMarkClause *) linitial(stmt->rowMarks))->forUpdate)
! 								tag = "SELECT FOR UPDATE";
! 							else
! 								tag = "SELECT FOR SHARE";
  						}
  						else
  							tag = "SELECT";
--- 2205,2225 ----
  						else if (stmt->rowMarks != NIL)
  						{
  							/* not 100% but probably close enough */
! 							switch (((RowMarkClause *) linitial(stmt->rowMarks))->strength)
! 							{
! 								case LCS_FORUPDATE:
! 									tag = "SELECT FOR UPDATE";
! 									break;
! 								case LCS_FORSHARE:
! 									tag = "SELECT FOR SHARE";
! 									break;
! 								case LCS_FORKEYLOCK:
! 									tag = "SELECT FOR KEY LOCK";
! 									break;
! 								default:
! 									tag =  "???";
! 									break;
! 							}
  						}
  						else
  							tag = "SELECT";
*** a/src/backend/utils/adt/ri_triggers.c
--- b/src/backend/utils/adt/ri_triggers.c
***************
*** 308,314 **** RI_FKey_check(PG_FUNCTION_ARGS)
  	 * Get the relation descriptors of the FK and PK tables.
  	 *
  	 * pk_rel is opened in RowShareLock mode since that's what our eventual
! 	 * SELECT FOR SHARE will get on it.
  	 */
  	fk_rel = trigdata->tg_relation;
  	pk_rel = heap_open(riinfo.pk_relid, RowShareLock);
--- 308,314 ----
  	 * Get the relation descriptors of the FK and PK tables.
  	 *
  	 * pk_rel is opened in RowShareLock mode since that's what our eventual
! 	 * SELECT FOR KEY LOCK will get on it.
  	 */
  	fk_rel = trigdata->tg_relation;
  	pk_rel = heap_open(riinfo.pk_relid, RowShareLock);
***************
*** 338,349 **** RI_FKey_check(PG_FUNCTION_ARGS)
  
  			/* ---------
  			 * The query string built is
! 			 *	SELECT 1 FROM ONLY <pktable>
  			 * ----------
  			 */
  			quoteRelationName(pkrelname, pk_rel);
  			snprintf(querystr, sizeof(querystr),
! 					 "SELECT 1 FROM ONLY %s x FOR SHARE OF x",
  					 pkrelname);
  
  			/* Prepare and save the plan */
--- 338,349 ----
  
  			/* ---------
  			 * The query string built is
! 			 *	SELECT 1 FROM ONLY <pktable> x FOR KEY LOCK OF x
  			 * ----------
  			 */
  			quoteRelationName(pkrelname, pk_rel);
  			snprintf(querystr, sizeof(querystr),
! 					 "SELECT 1 FROM ONLY %s x FOR KEY LOCK OF x",
  					 pkrelname);
  
  			/* Prepare and save the plan */
***************
*** 463,469 **** RI_FKey_check(PG_FUNCTION_ARGS)
  
  		/* ----------
  		 * The query string built is
! 		 *	SELECT 1 FROM ONLY <pktable> WHERE pkatt1 = $1 [AND ...] FOR SHARE
  		 * The type id's for the $ parameters are those of the
  		 * corresponding FK attributes.
  		 * ----------
--- 463,470 ----
  
  		/* ----------
  		 * The query string built is
! 		 *	SELECT 1 FROM ONLY <pktable> x WHERE pkatt1 = $1 [AND ...]
! 		 *	       FOR KEY LOCK OF x
  		 * The type id's for the $ parameters are those of the
  		 * corresponding FK attributes.
  		 * ----------
***************
*** 487,493 **** RI_FKey_check(PG_FUNCTION_ARGS)
  			querysep = "AND";
  			queryoids[i] = fk_type;
  		}
! 		appendStringInfo(&querybuf, " FOR SHARE OF x");
  
  		/* Prepare and save the plan */
  		qplan = ri_PlanCheck(querybuf.data, riinfo.nkeys, queryoids,
--- 488,494 ----
  			querysep = "AND";
  			queryoids[i] = fk_type;
  		}
! 		appendStringInfo(&querybuf, " FOR KEY LOCK OF x");
  
  		/* Prepare and save the plan */
  		qplan = ri_PlanCheck(querybuf.data, riinfo.nkeys, queryoids,
***************
*** 625,631 **** ri_Check_Pk_Match(Relation pk_rel, Relation fk_rel,
  
  		/* ----------
  		 * The query string built is
! 		 *	SELECT 1 FROM ONLY <pktable> WHERE pkatt1 = $1 [AND ...] FOR SHARE
  		 * The type id's for the $ parameters are those of the
  		 * PK attributes themselves.
  		 * ----------
--- 626,633 ----
  
  		/* ----------
  		 * The query string built is
! 		 *	SELECT 1 FROM ONLY <pktable> x WHERE pkatt1 = $1 [AND ...]
! 		 *	       FOR KEY LOCK OF x
  		 * The type id's for the $ parameters are those of the
  		 * PK attributes themselves.
  		 * ----------
***************
*** 648,654 **** ri_Check_Pk_Match(Relation pk_rel, Relation fk_rel,
  			querysep = "AND";
  			queryoids[i] = pk_type;
  		}
! 		appendStringInfo(&querybuf, " FOR SHARE OF x");
  
  		/* Prepare and save the plan */
  		qplan = ri_PlanCheck(querybuf.data, riinfo->nkeys, queryoids,
--- 650,656 ----
  			querysep = "AND";
  			queryoids[i] = pk_type;
  		}
! 		appendStringInfo(&querybuf, " FOR KEY LOCK OF x");
  
  		/* Prepare and save the plan */
  		qplan = ri_PlanCheck(querybuf.data, riinfo->nkeys, queryoids,
***************
*** 712,718 **** RI_FKey_noaction_del(PG_FUNCTION_ARGS)
  	 * Get the relation descriptors of the FK and PK tables and the old tuple.
  	 *
  	 * fk_rel is opened in RowShareLock mode since that's what our eventual
! 	 * SELECT FOR SHARE will get on it.
  	 */
  	fk_rel = heap_open(riinfo.fk_relid, RowShareLock);
  	pk_rel = trigdata->tg_relation;
--- 714,720 ----
  	 * Get the relation descriptors of the FK and PK tables and the old tuple.
  	 *
  	 * fk_rel is opened in RowShareLock mode since that's what our eventual
! 	 * SELECT FOR KEY LOCK will get on it.
  	 */
  	fk_rel = heap_open(riinfo.fk_relid, RowShareLock);
  	pk_rel = trigdata->tg_relation;
***************
*** 780,786 **** RI_FKey_noaction_del(PG_FUNCTION_ARGS)
  
  				/* ----------
  				 * The query string built is
! 				 *	SELECT 1 FROM ONLY <fktable> WHERE $1 = fkatt1 [AND ...]
  				 * The type id's for the $ parameters are those of the
  				 * corresponding PK attributes.
  				 * ----------
--- 782,789 ----
  
  				/* ----------
  				 * The query string built is
! 				 *	SELECT 1 FROM ONLY <fktable> x WHERE $1 = fkatt1 [AND ...]
! 				 *	       FOR KEY LOCK OF x
  				 * The type id's for the $ parameters are those of the
  				 * corresponding PK attributes.
  				 * ----------
***************
*** 805,811 **** RI_FKey_noaction_del(PG_FUNCTION_ARGS)
  					querysep = "AND";
  					queryoids[i] = pk_type;
  				}
! 				appendStringInfo(&querybuf, " FOR SHARE OF x");
  
  				/* Prepare and save the plan */
  				qplan = ri_PlanCheck(querybuf.data, riinfo.nkeys, queryoids,
--- 808,814 ----
  					querysep = "AND";
  					queryoids[i] = pk_type;
  				}
! 				appendStringInfo(&querybuf, " FOR KEY LOCK OF x");
  
  				/* Prepare and save the plan */
  				qplan = ri_PlanCheck(querybuf.data, riinfo.nkeys, queryoids,
***************
*** 890,896 **** RI_FKey_noaction_upd(PG_FUNCTION_ARGS)
  	 * old tuple.
  	 *
  	 * fk_rel is opened in RowShareLock mode since that's what our eventual
! 	 * SELECT FOR SHARE will get on it.
  	 */
  	fk_rel = heap_open(riinfo.fk_relid, RowShareLock);
  	pk_rel = trigdata->tg_relation;
--- 893,899 ----
  	 * old tuple.
  	 *
  	 * fk_rel is opened in RowShareLock mode since that's what our eventual
! 	 * SELECT FOR KEY LOCK will get on it.
  	 */
  	fk_rel = heap_open(riinfo.fk_relid, RowShareLock);
  	pk_rel = trigdata->tg_relation;
***************
*** 993,999 **** RI_FKey_noaction_upd(PG_FUNCTION_ARGS)
  					querysep = "AND";
  					queryoids[i] = pk_type;
  				}
! 				appendStringInfo(&querybuf, " FOR SHARE OF x");
  
  				/* Prepare and save the plan */
  				qplan = ri_PlanCheck(querybuf.data, riinfo.nkeys, queryoids,
--- 996,1002 ----
  					querysep = "AND";
  					queryoids[i] = pk_type;
  				}
! 				appendStringInfo(&querybuf, " FOR KEY LOCK OF x");
  
  				/* Prepare and save the plan */
  				qplan = ri_PlanCheck(querybuf.data, riinfo.nkeys, queryoids,
***************
*** 1431,1437 **** RI_FKey_restrict_del(PG_FUNCTION_ARGS)
  	 * Get the relation descriptors of the FK and PK tables and the old tuple.
  	 *
  	 * fk_rel is opened in RowShareLock mode since that's what our eventual
! 	 * SELECT FOR SHARE will get on it.
  	 */
  	fk_rel = heap_open(riinfo.fk_relid, RowShareLock);
  	pk_rel = trigdata->tg_relation;
--- 1434,1440 ----
  	 * Get the relation descriptors of the FK and PK tables and the old tuple.
  	 *
  	 * fk_rel is opened in RowShareLock mode since that's what our eventual
! 	 * SELECT FOR KEY LOCK will get on it.
  	 */
  	fk_rel = heap_open(riinfo.fk_relid, RowShareLock);
  	pk_rel = trigdata->tg_relation;
***************
*** 1489,1495 **** RI_FKey_restrict_del(PG_FUNCTION_ARGS)
  
  				/* ----------
  				 * The query string built is
! 				 *	SELECT 1 FROM ONLY <fktable> WHERE $1 = fkatt1 [AND ...]
  				 * The type id's for the $ parameters are those of the
  				 * corresponding PK attributes.
  				 * ----------
--- 1492,1499 ----
  
  				/* ----------
  				 * The query string built is
! 				 *	SELECT 1 FROM ONLY <fktable> x WHERE $1 = fkatt1 [AND ...]
! 				 *	       FOR KEY LOCK OF x
  				 * The type id's for the $ parameters are those of the
  				 * corresponding PK attributes.
  				 * ----------
***************
*** 1514,1520 **** RI_FKey_restrict_del(PG_FUNCTION_ARGS)
  					querysep = "AND";
  					queryoids[i] = pk_type;
  				}
! 				appendStringInfo(&querybuf, " FOR SHARE OF x");
  
  				/* Prepare and save the plan */
  				qplan = ri_PlanCheck(querybuf.data, riinfo.nkeys, queryoids,
--- 1518,1524 ----
  					querysep = "AND";
  					queryoids[i] = pk_type;
  				}
! 				appendStringInfo(&querybuf, " FOR KEY LOCK OF x");
  
  				/* Prepare and save the plan */
  				qplan = ri_PlanCheck(querybuf.data, riinfo.nkeys, queryoids,
***************
*** 1604,1610 **** RI_FKey_restrict_upd(PG_FUNCTION_ARGS)
  	 * old tuple.
  	 *
  	 * fk_rel is opened in RowShareLock mode since that's what our eventual
! 	 * SELECT FOR SHARE will get on it.
  	 */
  	fk_rel = heap_open(riinfo.fk_relid, RowShareLock);
  	pk_rel = trigdata->tg_relation;
--- 1608,1614 ----
  	 * old tuple.
  	 *
  	 * fk_rel is opened in RowShareLock mode since that's what our eventual
! 	 * SELECT FOR KEY LOCK will get on it.
  	 */
  	fk_rel = heap_open(riinfo.fk_relid, RowShareLock);
  	pk_rel = trigdata->tg_relation;
***************
*** 1672,1678 **** RI_FKey_restrict_upd(PG_FUNCTION_ARGS)
  
  				/* ----------
  				 * The query string built is
! 				 *	SELECT 1 FROM ONLY <fktable> WHERE $1 = fkatt1 [AND ...]
  				 * The type id's for the $ parameters are those of the
  				 * corresponding PK attributes.
  				 * ----------
--- 1676,1683 ----
  
  				/* ----------
  				 * The query string built is
! 				 *	SELECT 1 FROM ONLY <fktable> x WHERE $1 = fkatt1 [AND ...]
! 				 *	       FOR KEY LOCK OF x
  				 * The type id's for the $ parameters are those of the
  				 * corresponding PK attributes.
  				 * ----------
***************
*** 1697,1703 **** RI_FKey_restrict_upd(PG_FUNCTION_ARGS)
  					querysep = "AND";
  					queryoids[i] = pk_type;
  				}
! 				appendStringInfo(&querybuf, " FOR SHARE OF x");
  
  				/* Prepare and save the plan */
  				qplan = ri_PlanCheck(querybuf.data, riinfo.nkeys, queryoids,
--- 1702,1708 ----
  					querysep = "AND";
  					queryoids[i] = pk_type;
  				}
! 				appendStringInfo(&querybuf, " FOR KEY LOCK OF x");
  
  				/* Prepare and save the plan */
  				qplan = ri_PlanCheck(querybuf.data, riinfo.nkeys, queryoids,
*** a/src/backend/utils/adt/ruleutils.c
--- b/src/backend/utils/adt/ruleutils.c
***************
*** 2837,2848 **** get_select_query_def(Query *query, deparse_context *context,
  			if (rc->pushedDown)
  				continue;
  
! 			if (rc->forUpdate)
! 				appendContextKeyword(context, " FOR UPDATE",
! 									 -PRETTYINDENT_STD, PRETTYINDENT_STD, 0);
! 			else
! 				appendContextKeyword(context, " FOR SHARE",
! 									 -PRETTYINDENT_STD, PRETTYINDENT_STD, 0);
  			appendStringInfo(buf, " OF %s",
  							 quote_identifier(rte->eref->aliasname));
  			if (rc->noWait)
--- 2837,2858 ----
  			if (rc->pushedDown)
  				continue;
  
! 			switch (rc->strength)
! 			{
! 				case LCS_FORKEYLOCK:
! 					appendContextKeyword(context, " FOR KEY LOCK",
! 										 -PRETTYINDENT_STD, PRETTYINDENT_STD, 0);
! 					break;
! 				case LCS_FORSHARE:
! 					appendContextKeyword(context, " FOR SHARE",
! 										 -PRETTYINDENT_STD, PRETTYINDENT_STD, 0);
! 					break;
! 				case LCS_FORUPDATE:
! 					appendContextKeyword(context, " FOR UPDATE",
! 										 -PRETTYINDENT_STD, PRETTYINDENT_STD, 0);
! 					break;
! 			}
! 
  			appendStringInfo(buf, " OF %s",
  							 quote_identifier(rte->eref->aliasname));
  			if (rc->noWait)
*** a/src/backend/utils/cache/relcache.c
--- b/src/backend/utils/cache/relcache.c
***************
*** 3608,3613 **** RelationGetIndexPredicate(Relation relation)
--- 3608,3615 ----
   * simple index keys, but attributes used in expressions and partial-index
   * predicates.)
   *
+  * If "unique" is true, only attributes of unique indexes are considered.
+  *
   * Attribute numbers are offset by FirstLowInvalidHeapAttributeNumber so that
   * we can include system attributes (e.g., OID) in the bitmap representation.
   *
***************
*** 3615,3630 **** RelationGetIndexPredicate(Relation relation)
   * be bms_free'd when not needed anymore.
   */
  Bitmapset *
! RelationGetIndexAttrBitmap(Relation relation)
  {
  	Bitmapset  *indexattrs;
  	List	   *indexoidlist;
  	ListCell   *l;
  	MemoryContext oldcxt;
  
  	/* Quick exit if we already computed the result. */
  	if (relation->rd_indexattr != NULL)
! 		return bms_copy(relation->rd_indexattr);
  
  	/* Fast path if definitely no indexes */
  	if (!RelationGetForm(relation)->relhasindex)
--- 3617,3633 ----
   * be bms_free'd when not needed anymore.
   */
  Bitmapset *
! RelationGetIndexAttrBitmap(Relation relation, bool unique)
  {
  	Bitmapset  *indexattrs;
+ 	Bitmapset  *uindexattrs;
  	List	   *indexoidlist;
  	ListCell   *l;
  	MemoryContext oldcxt;
  
  	/* Quick exit if we already computed the result. */
  	if (relation->rd_indexattr != NULL)
! 		return bms_copy(unique ? relation->rd_uindexattr : relation->rd_indexattr);
  
  	/* Fast path if definitely no indexes */
  	if (!RelationGetForm(relation)->relhasindex)
***************
*** 3643,3648 **** RelationGetIndexAttrBitmap(Relation relation)
--- 3646,3652 ----
  	 * For each index, add referenced attributes to indexattrs.
  	 */
  	indexattrs = NULL;
+ 	uindexattrs = NULL;
  	foreach(l, indexoidlist)
  	{
  		Oid			indexOid = lfirst_oid(l);
***************
*** 3661,3675 **** RelationGetIndexAttrBitmap(Relation relation)
--- 3665,3688 ----
  			int			attrnum = indexInfo->ii_KeyAttrNumbers[i];
  
  			if (attrnum != 0)
+ 			{
  				indexattrs = bms_add_member(indexattrs,
  							   attrnum - FirstLowInvalidHeapAttributeNumber);
+ 				if (indexInfo->ii_Unique)
+ 					uindexattrs = bms_add_member(uindexattrs,
+ 											   	 attrnum - FirstLowInvalidHeapAttributeNumber);
+ 			}
  		}
  
  		/* Collect all attributes used in expressions, too */
  		pull_varattnos((Node *) indexInfo->ii_Expressions, &indexattrs);
+ 		if (indexInfo->ii_Unique)
+ 			pull_varattnos((Node *) indexInfo->ii_Expressions, &uindexattrs);
  
  		/* Collect all attributes in the index predicate, too */
  		pull_varattnos((Node *) indexInfo->ii_Predicate, &indexattrs);
+ 		if (indexInfo->ii_Unique)
+ 			pull_varattnos((Node *) indexInfo->ii_Predicate, &uindexattrs);
  
  		index_close(indexDesc, AccessShareLock);
  	}
***************
*** 3679,3688 **** RelationGetIndexAttrBitmap(Relation relation)
  	/* Now save a copy of the bitmap in the relcache entry. */
  	oldcxt = MemoryContextSwitchTo(CacheMemoryContext);
  	relation->rd_indexattr = bms_copy(indexattrs);
  	MemoryContextSwitchTo(oldcxt);
  
  	/* We return our original working copy for caller to play with */
! 	return indexattrs;
  }
  
  /*
--- 3692,3702 ----
  	/* Now save a copy of the bitmap in the relcache entry. */
  	oldcxt = MemoryContextSwitchTo(CacheMemoryContext);
  	relation->rd_indexattr = bms_copy(indexattrs);
+ 	relation->rd_uindexattr = bms_copy(uindexattrs);
  	MemoryContextSwitchTo(oldcxt);
  
  	/* We return our original working copy for caller to play with */
! 	return unique ? uindexattrs : indexattrs;
  }
  
  /*
*** a/src/include/access/heapam.h
--- b/src/include/access/heapam.h
***************
*** 33,38 **** typedef struct BulkInsertStateData *BulkInsertState;
--- 33,39 ----
  
  typedef enum
  {
+ 	LockTupleKeylock,
  	LockTupleShared,
  	LockTupleExclusive
  } LockTupleMode;
*** a/src/include/access/htup.h
--- b/src/include/access/htup.h
***************
*** 163,174 **** typedef HeapTupleHeaderData *HeapTupleHeader;
  #define HEAP_HASVARWIDTH		0x0002	/* has variable-width attribute(s) */
  #define HEAP_HASEXTERNAL		0x0004	/* has external stored attribute(s) */
  #define HEAP_HASOID				0x0008	/* has an object-id field */
! /* bit 0x0010 is available */
  #define HEAP_COMBOCID			0x0020	/* t_cid is a combo cid */
  #define HEAP_XMAX_EXCL_LOCK		0x0040	/* xmax is exclusive locker */
  #define HEAP_XMAX_SHARED_LOCK	0x0080	/* xmax is shared locker */
  /* if either LOCK bit is set, xmax hasn't deleted the tuple, only locked it */
! #define HEAP_IS_LOCKED	(HEAP_XMAX_EXCL_LOCK | HEAP_XMAX_SHARED_LOCK)
  #define HEAP_XMIN_COMMITTED		0x0100	/* t_xmin committed */
  #define HEAP_XMIN_INVALID		0x0200	/* t_xmin invalid/aborted */
  #define HEAP_XMAX_COMMITTED		0x0400	/* t_xmax committed */
--- 163,177 ----
  #define HEAP_HASVARWIDTH		0x0002	/* has variable-width attribute(s) */
  #define HEAP_HASEXTERNAL		0x0004	/* has external stored attribute(s) */
  #define HEAP_HASOID				0x0008	/* has an object-id field */
! #define HEAP_XMAX_KEY_LOCK		0x0010	/* xmax is a "key" locker */
  #define HEAP_COMBOCID			0x0020	/* t_cid is a combo cid */
  #define HEAP_XMAX_EXCL_LOCK		0x0040	/* xmax is exclusive locker */
  #define HEAP_XMAX_SHARED_LOCK	0x0080	/* xmax is shared locker */
+ /* if either SHARE or KEY lock bit is set, this is a "shared" lock */
+ #define HEAP_IS_SHARE_LOCKED (HEAP_XMAX_SHARED_LOCK | HEAP_XMAX_KEY_LOCK)
  /* if either LOCK bit is set, xmax hasn't deleted the tuple, only locked it */
! #define HEAP_IS_LOCKED	(HEAP_XMAX_EXCL_LOCK | HEAP_XMAX_SHARED_LOCK | \
! 						 HEAP_XMAX_KEY_LOCK)
  #define HEAP_XMIN_COMMITTED		0x0100	/* t_xmin committed */
  #define HEAP_XMIN_INVALID		0x0200	/* t_xmin invalid/aborted */
  #define HEAP_XMAX_COMMITTED		0x0400	/* t_xmax committed */
***************
*** 725,734 **** typedef struct xl_heap_lock
  	xl_heaptid	target;			/* locked tuple id */
  	TransactionId locking_xid;	/* might be a MultiXactId not xid */
  	bool		xid_is_mxact;	/* is it? */
! 	bool		shared_lock;	/* shared or exclusive row lock? */
  } xl_heap_lock;
  
! #define SizeOfHeapLock	(offsetof(xl_heap_lock, shared_lock) + sizeof(bool))
  
  /* This is what we need to know about in-place update */
  typedef struct xl_heap_inplace
--- 728,737 ----
  	xl_heaptid	target;			/* locked tuple id */
  	TransactionId locking_xid;	/* might be a MultiXactId not xid */
  	bool		xid_is_mxact;	/* is it? */
! 	char		lock_strength;	/* keylock, shared, exclusive lock? */
  } xl_heap_lock;
  
! #define SizeOfHeapLock	(offsetof(xl_heap_lock, lock_strength) + sizeof(char))
  
  /* This is what we need to know about in-place update */
  typedef struct xl_heap_inplace
*** a/src/include/nodes/execnodes.h
--- b/src/include/nodes/execnodes.h
***************
*** 404,410 **** typedef struct EState
   * ExecRowMark -
   *	   runtime representation of FOR UPDATE/SHARE clauses
   *
!  * When doing UPDATE, DELETE, or SELECT FOR UPDATE/SHARE, we should have an
   * ExecRowMark for each non-target relation in the query (except inheritance
   * parent RTEs, which can be ignored at runtime).  See PlanRowMark for details
   * about most of the fields.  In addition to fields directly derived from
--- 404,410 ----
   * ExecRowMark -
   *	   runtime representation of FOR UPDATE/SHARE clauses
   *
!  * When doing UPDATE, DELETE, or SELECT FOR UPDATE/SHARE/KEY LOCK, we should have an
   * ExecRowMark for each non-target relation in the query (except inheritance
   * parent RTEs, which can be ignored at runtime).  See PlanRowMark for details
   * about most of the fields.  In addition to fields directly derived from
*** a/src/include/nodes/parsenodes.h
--- b/src/include/nodes/parsenodes.h
***************
*** 554,571 **** typedef struct DefElem
  } DefElem;
  
  /*
!  * LockingClause - raw representation of FOR UPDATE/SHARE options
   *
   * Note: lockedRels == NIL means "all relations in query".	Otherwise it
   * is a list of RangeVar nodes.  (We use RangeVar mainly because it carries
   * a location field --- currently, parse analysis insists on unqualified
   * names in LockingClause.)
   */
  typedef struct LockingClause
  {
  	NodeTag		type;
  	List	   *lockedRels;		/* FOR UPDATE or FOR SHARE relations */
! 	bool		forUpdate;		/* true = FOR UPDATE, false = FOR SHARE */
  	bool		noWait;			/* NOWAIT option */
  } LockingClause;
  
--- 554,579 ----
  } DefElem;
  
  /*
!  * LockingClause - raw representation of FOR UPDATE/SHARE/KEY LOCK options
   *
   * Note: lockedRels == NIL means "all relations in query".	Otherwise it
   * is a list of RangeVar nodes.  (We use RangeVar mainly because it carries
   * a location field --- currently, parse analysis insists on unqualified
   * names in LockingClause.)
   */
+ typedef enum LockClauseStrength
+ {
+ 	/* order is important -- see applyLockingClause */
+ 	LCS_FORKEYLOCK,
+ 	LCS_FORSHARE,
+ 	LCS_FORUPDATE
+ } LockClauseStrength;
+ 
  typedef struct LockingClause
  {
  	NodeTag		type;
  	List	   *lockedRels;		/* FOR UPDATE or FOR SHARE relations */
! 	LockClauseStrength strength;
  	bool		noWait;			/* NOWAIT option */
  } LockingClause;
  
***************
*** 839,856 **** typedef struct WindowClause
   *	   parser output representation of FOR UPDATE/SHARE clauses
   *
   * Query.rowMarks contains a separate RowMarkClause node for each relation
!  * identified as a FOR UPDATE/SHARE target.  If FOR UPDATE/SHARE is applied
!  * to a subquery, we generate RowMarkClauses for all normal and subquery rels
!  * in the subquery, but they are marked pushedDown = true to distinguish them
!  * from clauses that were explicitly written at this query level.  Also,
!  * Query.hasForUpdate tells whether there were explicit FOR UPDATE/SHARE
!  * clauses in the current query level.
   */
  typedef struct RowMarkClause
  {
  	NodeTag		type;
  	Index		rti;			/* range table index of target relation */
! 	bool		forUpdate;		/* true = FOR UPDATE, false = FOR SHARE */
  	bool		noWait;			/* NOWAIT option */
  	bool		pushedDown;		/* pushed down from higher query level? */
  } RowMarkClause;
--- 847,864 ----
   *	   parser output representation of FOR UPDATE/SHARE clauses
   *
   * Query.rowMarks contains a separate RowMarkClause node for each relation
!  * identified as a FOR UPDATE/SHARE/KEY LOCK target.  If one of these clauses
!  * is applied to a subquery, we generate RowMarkClauses for all normal and
!  * subquery rels in the subquery, but they are marked pushedDown = true to
!  * distinguish them from clauses that were explicitly written at this query
!  * level.  Also, Query.hasForUpdate tells whether there were explicit FOR
!  * UPDATE/SHARE clauses in the current query level.
   */
  typedef struct RowMarkClause
  {
  	NodeTag		type;
  	Index		rti;			/* range table index of target relation */
! 	LockClauseStrength strength;
  	bool		noWait;			/* NOWAIT option */
  	bool		pushedDown;		/* pushed down from higher query level? */
  } RowMarkClause;
*** a/src/include/nodes/plannodes.h
--- b/src/include/nodes/plannodes.h
***************
*** 706,712 **** typedef struct Limit
   * RowMarkType -
   *	  enums for types of row-marking operations
   *
!  * When doing UPDATE, DELETE, or SELECT FOR UPDATE/SHARE, we have to uniquely
   * identify all the source rows, not only those from the target relations, so
   * that we can perform EvalPlanQual rechecking at need.  For plain tables we
   * can just fetch the TID, the same as for a target relation.  Otherwise (for
--- 706,712 ----
   * RowMarkType -
   *	  enums for types of row-marking operations
   *
!  * When doing UPDATE, DELETE, or SELECT FOR UPDATE/SHARE/KEY LOCK, we have to uniquely
   * identify all the source rows, not only those from the target relations, so
   * that we can perform EvalPlanQual rechecking at need.  For plain tables we
   * can just fetch the TID, the same as for a target relation.  Otherwise (for
***************
*** 718,736 **** typedef enum RowMarkType
  {
  	ROW_MARK_EXCLUSIVE,			/* obtain exclusive tuple lock */
  	ROW_MARK_SHARE,				/* obtain shared tuple lock */
  	ROW_MARK_REFERENCE,			/* just fetch the TID */
  	ROW_MARK_COPY				/* physically copy the row value */
  } RowMarkType;
  
! #define RowMarkRequiresRowShareLock(marktype)  ((marktype) <= ROW_MARK_SHARE)
  
  /*
   * PlanRowMark -
   *	   plan-time representation of FOR UPDATE/SHARE clauses
   *
!  * When doing UPDATE, DELETE, or SELECT FOR UPDATE/SHARE, we create a separate
   * PlanRowMark node for each non-target relation in the query.	Relations that
!  * are not specified as FOR UPDATE/SHARE are marked ROW_MARK_REFERENCE (if
   * real tables) or ROW_MARK_COPY (if not).
   *
   * Initially all PlanRowMarks have rti == prti and isParent == false.
--- 718,737 ----
  {
  	ROW_MARK_EXCLUSIVE,			/* obtain exclusive tuple lock */
  	ROW_MARK_SHARE,				/* obtain shared tuple lock */
+ 	ROW_MARK_KEYLOCK,			/* obtain keylock tuple lock */
  	ROW_MARK_REFERENCE,			/* just fetch the TID */
  	ROW_MARK_COPY				/* physically copy the row value */
  } RowMarkType;
  
! #define RowMarkRequiresRowShareLock(marktype)  ((marktype) <= ROW_MARK_KEYLOCK)
  
  /*
   * PlanRowMark -
   *	   plan-time representation of FOR UPDATE/SHARE clauses
   *
!  * When doing UPDATE, DELETE, or SELECT FOR UPDATE/SHARE/KEY LOCK, we create a separate
   * PlanRowMark node for each non-target relation in the query.	Relations that
!  * are not specified as FOR UPDATE/SHARE/KEY LOCK are marked ROW_MARK_REFERENCE (if
   * real tables) or ROW_MARK_COPY (if not).
   *
   * Initially all PlanRowMarks have rti == prti and isParent == false.
*** a/src/include/parser/analyze.h
--- b/src/include/parser/analyze.h
***************
*** 31,36 **** extern bool analyze_requires_snapshot(Node *parseTree);
  
  extern void CheckSelectLocking(Query *qry);
  extern void applyLockingClause(Query *qry, Index rtindex,
! 				   bool forUpdate, bool noWait, bool pushedDown);
  
  #endif   /* ANALYZE_H */
--- 31,36 ----
  
  extern void CheckSelectLocking(Query *qry);
  extern void applyLockingClause(Query *qry, Index rtindex,
! 				   LockClauseStrength strength, bool noWait, bool pushedDown);
  
  #endif   /* ANALYZE_H */
*** a/src/include/utils/rel.h
--- b/src/include/utils/rel.h
***************
*** 156,161 **** typedef struct RelationData
--- 156,162 ----
  	Oid			rd_id;			/* relation's object id */
  	List	   *rd_indexlist;	/* list of OIDs of indexes on relation */
  	Bitmapset  *rd_indexattr;	/* identifies columns used in indexes */
+ 	Bitmapset  *rd_uindexattr;	/* identifies columns used in unique indexes */
  	Oid			rd_oidindex;	/* OID of unique index on OID, if any */
  	LockInfoData rd_lockInfo;	/* lock mgr's info for locking relation */
  	RuleLock   *rd_rules;		/* rewrite rules */
*** a/src/include/utils/relcache.h
--- b/src/include/utils/relcache.h
***************
*** 42,48 **** extern List *RelationGetIndexList(Relation relation);
  extern Oid	RelationGetOidIndex(Relation relation);
  extern List *RelationGetIndexExpressions(Relation relation);
  extern List *RelationGetIndexPredicate(Relation relation);
! extern Bitmapset *RelationGetIndexAttrBitmap(Relation relation);
  extern void RelationGetExclusionInfo(Relation indexRelation,
  						 Oid **operators,
  						 Oid **procs,
--- 42,48 ----
  extern Oid	RelationGetOidIndex(Relation relation);
  extern List *RelationGetIndexExpressions(Relation relation);
  extern List *RelationGetIndexPredicate(Relation relation);
! extern Bitmapset *RelationGetIndexAttrBitmap(Relation relation, bool unique);
  extern void RelationGetExclusionInfo(Relation indexRelation,
  						 Oid **operators,
  						 Oid **procs,

#10

alvherre@commandprompt.com

almost 15 years ago

In reply to: Noah Misch (#9)

Re: FOR KEY LOCK foreign keys

Excerpts from Noah Misch's message of vie feb 11 04:13:22 -0300 2011:

Hello,

First, thanks for the very thorough review.

On Thu, Jan 13, 2011 at 06:58:09PM -0300, Alvaro Herrera wrote:

Incidentally, HeapTupleSatisfiesMVCC has some bits of code like this (not new):

/* MultiXacts are currently only allowed to lock tuples */
Assert(tuple->t_infomask & HEAP_IS_LOCKED);

They're specifically only allowed for SHARE and KEY locks, right?
heap_lock_tuple seems to assume as much.

Yeah, since FOR UPDATE acquires an exclusive lock on the tuple, you
can't have a multixact there. Maybe we can make the assert more
specific; I'll have a look.

[ test case with funny visibility behavior ]

Looking into the visibility bug.

I published about this here:
http://commandprompt.com/blogs/alvaro_herrera/2010/11/fixing_foreign_key_deadlocks_part_2/

So, as a rough design,

1. Create a new SELECT locking clause. For now, we're calling it SELECT FOR KEY LOCK
2. This will acquire a new type of lock in the tuple, dubbed a "keylock".
3. This lock will conflict with DELETE, SELECT FOR UPDATE, and SELECT FOR SHARE.

It does not conflict with SELECT FOR SHARE, does it?

It doesn't; I think I copied old text there. (I had originally thought
that they would conflict, but I had to change that due to implementation
restrictions).

The odd thing here is the checking of an outside condition to decide whether
locks conflict. Normally, to get a different conflict list, we add another lock
type. What about this?

FOR KEY SHARE conflicts with FOR KEY UPDATE
FOR SHARE conflicts with FOR KEY UPDATE, FOR UPDATE
FOR UPDATE conflicts with FOR KEY UPDATE, FOR UPDATE, FOR SHARE
FOR KEY UPDATE conflicts with FOR KEY UPDATE, FOR UPDATE, FOR SHARE, FOR KEY SHARE

Hmm, let me see about this.

3. The original tuple needs to be marked with the Cmax of the locking
command, to prevent it from being seen in the same transaction.

Could you elaborate on this requirement?

Consider an open cursor with a snapshot prior to the lock. If we leave
the old tuple as is, the cursor would see that old tuple as visible.
But the locked copy of the tuple is also visible, because the Cmax is
just a locker, not an updater.

4. A non-conflicting update to the tuple must carry forward some fields
from the original tuple into the updated copy. Those include Xmax,
XMAX_IS_MULTI, XMAX_KEY_LOCK, and the CommandId and COMBO_CID flag.

HeapTupleHeaderGetCmax() has this assertion:

/* We do not store cmax when locking a tuple */
Assert(!(tup->t_infomask & (HEAP_MOVED | HEAP_IS_LOCKED)));

Assuming that assertion is still valid, there will never be a HEAP_COMBOCID flag
to copy. Right?

Hmm, I think the assert is wrong, but I'm still paging in the details of
the patch after being away from it for so long. Let me think more about it.

[ Lots more stuff ]

I'll give careful consideration to all this.

Thanks again for the detailed review.

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

#11

noah@leadboat.com

almost 15 years ago

In reply to: Alvaro Herrera (#10)

Re: FOR KEY LOCK foreign keys

On Fri, Feb 11, 2011 at 02:15:20PM -0300, Alvaro Herrera wrote:

Excerpts from Noah Misch's message of vie feb 11 04:13:22 -0300 2011:

On Thu, Jan 13, 2011 at 06:58:09PM -0300, Alvaro Herrera wrote:

3. The original tuple needs to be marked with the Cmax of the locking
command, to prevent it from being seen in the same transaction.

Could you elaborate on this requirement?

Consider an open cursor with a snapshot prior to the lock. If we leave
the old tuple as is, the cursor would see that old tuple as visible.
But the locked copy of the tuple is also visible, because the Cmax is
just a locker, not an updater.

Thanks. Today, a lock operation leaves t_cid unchanged, and an update fills its
own cid into Cmax of the old tuple and Cmin of the new tuple. So, the cursor
would only see the old tuple. What will make that no longer sufficient?

#12

alvherre@commandprompt.com

almost 15 years ago

In reply to: Noah Misch (#9)

Re: FOR KEY LOCK foreign keys

Excerpts from Noah Misch's message of vie feb 11 04:13:22 -0300 2011:

I observe visibility breakage with this test case:

[ ... ]

The problem seems to be that funny t_cid (2249). Tracing through heap_update,
the new code is not setting t_cid during this test case.

So I can fix this problem by simply adding a call to
HeapTupleHeaderSetCmin when the stuff about ComboCid does not hold, but
seeing that screenful plus the subsequent call to
HeapTupleHeaderAdjustCmax feels wrong. I think this needs to be
rethought ...

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

#13

Marti Raudsepp

marti@juffo.org

almost 15 years ago

In reply to: Noah Misch (#9)

1 attachment(s)

Re: FOR KEY LOCK foreign keys

On Fri, Feb 11, 2011 at 09:13, Noah Misch <noah@leadboat.com> wrote:

The patch had a trivial conflict in planner.c, plus plenty of offsets. I've
attached the rebased patch that I used for review. For anyone following along,
all the interesting hunks touch heapam.c; the rest is largely mechanical. A
"diff -w" patch is also considerably easier to follow.

Here's a simple patch for the RelationGetIndexAttrBitmap() function,
as explained in my last post. I don't know if it's any help to you,
but since I wrote it I might as well send it up. This applies on top
of Noah's rebased patch.

I did some tests and it seems to work, although I also hit the same
visibility bug as Noah.

Test case I used:

THREAD A:
create table foo (pk int primary key, ak int);
create unique index on foo (ak) where ak != 0;
create unique index on foo ((-ak));

create table bar (foo_pk int references foo (pk));
insert into foo values(1,1);
begin; insert into bar values(1);

THREAD B:
begin; update foo set ak=2 where ak=1;

Regards,
Marti

Attachments:

0001-Only-acquire-KEY-LOCK-for-colums-that-can-be-referen.patchtext/x-patch; charset=US-ASCII; name=0001-Only-acquire-KEY-LOCK-for-colums-that-can-be-referen.patchDownload

From e069cef91c686aa87e220336198267e5a5a2aeac Mon Sep 17 00:00:00 2001
From: Marti Raudsepp <marti@juffo.org>
Date: Tue, 15 Feb 2011 00:33:35 +0200
Subject: [PATCH] Only acquire KEY LOCK for colums that can be referenced by foreign keys

Don't consider columns in unique indexes that have expressions or WHERE
predicates.
---
 src/backend/utils/cache/relcache.c |   23 +++++++++++++----------
 src/include/utils/rel.h            |    2 +-
 src/include/utils/relcache.h       |    2 +-
 3 files changed, 15 insertions(+), 12 deletions(-)

diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 4d37e8e..5119288 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -3608,7 +3608,8 @@ RelationGetIndexPredicate(Relation relation)
  * simple index keys, but attributes used in expressions and partial-index
  * predicates.)
  *
- * If "unique" is true, only attributes of unique indexes are considered.
+ * If "keyAttrs" is true, only attributes that can be referenced by foreign
+ * keys are considered.
  *
  * Attribute numbers are offset by FirstLowInvalidHeapAttributeNumber so that
  * we can include system attributes (e.g., OID) in the bitmap representation.
@@ -3617,7 +3618,7 @@ RelationGetIndexPredicate(Relation relation)
  * be bms_free'd when not needed anymore.
  */
 Bitmapset *
-RelationGetIndexAttrBitmap(Relation relation, bool unique)
+RelationGetIndexAttrBitmap(Relation relation, bool keyAttrs)
 {
 	Bitmapset  *indexattrs;
 	Bitmapset  *uindexattrs;
@@ -3627,7 +3628,7 @@ RelationGetIndexAttrBitmap(Relation relation, bool unique)
 
 	/* Quick exit if we already computed the result. */
 	if (relation->rd_indexattr != NULL)
-		return bms_copy(unique ? relation->rd_uindexattr : relation->rd_indexattr);
+		return bms_copy(keyAttrs ? relation->rd_keyattr : relation->rd_indexattr);
 
 	/* Fast path if definitely no indexes */
 	if (!RelationGetForm(relation)->relhasindex)
@@ -3653,12 +3654,18 @@ RelationGetIndexAttrBitmap(Relation relation, bool unique)
 		Relation	indexDesc;
 		IndexInfo  *indexInfo;
 		int			i;
+		bool		isKey;
 
 		indexDesc = index_open(indexOid, AccessShareLock);
 
 		/* Extract index key information from the index's pg_index row */
 		indexInfo = BuildIndexInfo(indexDesc);
 
+		/* Can this index be referenced by a foreign key? */
+		isKey = indexInfo->ii_Unique &&
+				indexInfo->ii_Expressions == NIL &&
+				indexInfo->ii_Predicate == NIL;
+
 		/* Collect simple attribute references */
 		for (i = 0; i < indexInfo->ii_NumIndexAttrs; i++)
 		{
@@ -3668,7 +3675,7 @@ RelationGetIndexAttrBitmap(Relation relation, bool unique)
 			{
 				indexattrs = bms_add_member(indexattrs,
 							   attrnum - FirstLowInvalidHeapAttributeNumber);
-				if (indexInfo->ii_Unique)
+				if (isKey)
 					uindexattrs = bms_add_member(uindexattrs,
 											   	 attrnum - FirstLowInvalidHeapAttributeNumber);
 			}
@@ -3676,13 +3683,9 @@ RelationGetIndexAttrBitmap(Relation relation, bool unique)
 
 		/* Collect all attributes used in expressions, too */
 		pull_varattnos((Node *) indexInfo->ii_Expressions, &indexattrs);
-		if (indexInfo->ii_Unique)
-			pull_varattnos((Node *) indexInfo->ii_Expressions, &uindexattrs);
 
 		/* Collect all attributes in the index predicate, too */
 		pull_varattnos((Node *) indexInfo->ii_Predicate, &indexattrs);
-		if (indexInfo->ii_Unique)
-			pull_varattnos((Node *) indexInfo->ii_Predicate, &uindexattrs);
 
 		index_close(indexDesc, AccessShareLock);
 	}
@@ -3692,11 +3695,11 @@ RelationGetIndexAttrBitmap(Relation relation, bool unique)
 	/* Now save a copy of the bitmap in the relcache entry. */
 	oldcxt = MemoryContextSwitchTo(CacheMemoryContext);
 	relation->rd_indexattr = bms_copy(indexattrs);
-	relation->rd_uindexattr = bms_copy(uindexattrs);
+	relation->rd_keyattr = bms_copy(uindexattrs);
 	MemoryContextSwitchTo(oldcxt);
 
 	/* We return our original working copy for caller to play with */
-	return unique ? uindexattrs : indexattrs;
+	return keyAttrs ? uindexattrs : indexattrs;
 }
 
 /*
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index 2251b25..9b70d81 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -156,7 +156,7 @@ typedef struct RelationData
 	Oid			rd_id;			/* relation's object id */
 	List	   *rd_indexlist;	/* list of OIDs of indexes on relation */
 	Bitmapset  *rd_indexattr;	/* identifies columns used in indexes */
-	Bitmapset  *rd_uindexattr;	/* identifies columns used in unique indexes */
+	Bitmapset  *rd_keyattr;		/* cols that can be ref'd by foreign keys */
 	Oid			rd_oidindex;	/* OID of unique index on OID, if any */
 	LockInfoData rd_lockInfo;	/* lock mgr's info for locking relation */
 	RuleLock   *rd_rules;		/* rewrite rules */
diff --git a/src/include/utils/relcache.h b/src/include/utils/relcache.h
index 6d1e64f..d4a09e3 100644
--- a/src/include/utils/relcache.h
+++ b/src/include/utils/relcache.h
@@ -42,7 +42,7 @@ extern List *RelationGetIndexList(Relation relation);
 extern Oid	RelationGetOidIndex(Relation relation);
 extern List *RelationGetIndexExpressions(Relation relation);
 extern List *RelationGetIndexPredicate(Relation relation);
-extern Bitmapset *RelationGetIndexAttrBitmap(Relation relation, bool unique);
+extern Bitmapset *RelationGetIndexAttrBitmap(Relation relation, bool keyAttrs);
 extern void RelationGetExclusionInfo(Relation indexRelation,
 						 Oid **operators,
 						 Oid **procs,
-- 
1.7.4

#14

alvherre@commandprompt.com

almost 15 years ago

In reply to: Marti Raudsepp (#13)

1 attachment(s)

Re: FOR KEY LOCK foreign keys

Excerpts from Marti Raudsepp's message of lun feb 14 19:39:25 -0300 2011:

On Fri, Feb 11, 2011 at 09:13, Noah Misch <noah@leadboat.com> wrote:

The patch had a trivial conflict in planner.c, plus plenty of offsets. I've
attached the rebased patch that I used for review. For anyone following along,
all the interesting hunks touch heapam.c; the rest is largely mechanical. A
"diff -w" patch is also considerably easier to follow.

Here's a simple patch for the RelationGetIndexAttrBitmap() function,
as explained in my last post. I don't know if it's any help to you,
but since I wrote it I might as well send it up. This applies on top
of Noah's rebased patch.

Got it, thanks.

I did some tests and it seems to work, although I also hit the same
visibility bug as Noah.

Yeah, that bug is fixed with the attached, though I am rethinking this
bit.

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Attachments:

0001-Fix-visibility-bug-and-poorly-worded-comment.patchapplication/octet-stream; name=0001-Fix-visibility-bug-and-poorly-worded-comment.patchDownload

From 04298459b514495a8f1ef269b7a43c2ff3a50710 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera <alvherre@alvh.no-ip.org>
Date: Mon, 14 Feb 2011 14:48:40 -0300
Subject: [PATCH 1/2] Fix visibility bug and poorly worded comment

per Noah Misch
---
 src/backend/access/heap/heapam.c |    9 ++++++---
 1 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 7515dc8..5d5ccbf 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -2644,10 +2644,11 @@ l2:
 		newtup->t_data->t_infomask |= (oldtup.t_data->t_infomask & 
 									   (HEAP_XMAX_IS_MULTI |
 										HEAP_XMAX_KEY_LOCK));
+
 		/*
-		 * we also need to copy the combo CID stuff, but only if the original
-		 * tuple was created by us; otherwise the combocid module complains
-		 * (Alternatively we could use HeapTupleHeaderGetRawCommandId)
+		 * If the tuple was created in this transaction, and we're going to
+		 * delete it, then it must have a combo-cid, which we need to preserve.
+		 * Otherwise, just use the passed cid.
 		 */
 		if (TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetXmin(oldtup.t_data)))
 		{
@@ -2656,6 +2657,8 @@ l2:
 			HeapTupleHeaderSetCmin(newtup->t_data,
 								   HeapTupleHeaderGetCmin(oldtup.t_data));
 		}
+		else
+			HeapTupleHeaderSetCmin(newtup->t_data, cid);
 
 	}
 	else
-- 
1.7.2.3

#15

robertmhaas@gmail.com

almost 15 years ago

In reply to: Alvaro Herrera (#14)

Re: FOR KEY LOCK foreign keys

On Mon, Feb 14, 2011 at 6:49 PM, Alvaro Herrera
<alvherre@commandprompt.com> wrote:

Excerpts from Marti Raudsepp's message of lun feb 14 19:39:25 -0300 2011:

On Fri, Feb 11, 2011 at 09:13, Noah Misch <noah@leadboat.com> wrote:

The patch had a trivial conflict in planner.c, plus plenty of offsets. I've
attached the rebased patch that I used for review. For anyone following along,
all the interesting hunks touch heapam.c; the rest is largely mechanical. A
"diff -w" patch is also considerably easier to follow.

Here's a simple patch for the RelationGetIndexAttrBitmap() function,
as explained in my last post. I don't know if it's any help to you,
but since I wrote it I might as well send it up. This applies on top
of Noah's rebased patch.

Got it, thanks.

I did some tests and it seems to work, although I also hit the same
visibility bug as Noah.

Yeah, that bug is fixed with the attached, though I am rethinking this
bit.

I am thinking that the statute of limitations has expired on this
patch, and that we should mark it Returned with Feedback and continue
working on it for 9.2. I know it's a valuable feature, but I think
we're out of time.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#16

David E. Wheeler

david@kineticode.com

almost 15 years ago

In reply to: Robert Haas (#15)

Re: FOR KEY LOCK foreign keys

On Feb 15, 2011, at 1:15 PM, Robert Haas wrote:

Yeah, that bug is fixed with the attached, though I am rethinking this
bit.

I am thinking that the statute of limitations has expired on this
patch, and that we should mark it Returned with Feedback and continue
working on it for 9.2. I know it's a valuable feature, but I think
we're out of time.

How is such a determination made, exactly?

Best,

David

#17

alvherre@commandprompt.com

almost 15 years ago

In reply to: Robert Haas (#15)

Re: FOR KEY LOCK foreign keys

Excerpts from Robert Haas's message of mar feb 15 18:15:38 -0300 2011:

I am thinking that the statute of limitations has expired on this
patch, and that we should mark it Returned with Feedback and continue
working on it for 9.2. I know it's a valuable feature, but I think
we're out of time.

Okay, I've marked it as such in the commitfest app. It'll be in 9.2's
first commitfest.

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

#18

Josh Berkus

josh@agliodbs.com

almost 15 years ago

In reply to: David E. Wheeler (#16)

Re: FOR KEY LOCK foreign keys

How is such a determination made, exactly?

It's Feb 15th, and portions of the patch need a rework according to the
author. I'm with Robert on this one.

--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com

#19

noah@leadboat.com

almost 15 years ago

In reply to: Noah Misch (#9)

1 attachment(s)

Re: FOR KEY LOCK foreign keys

On Fri, Feb 11, 2011 at 02:13:22AM -0500, Noah Misch wrote:

Automated tests would go a long way toward building confidence that this patch
does the right thing. Thanks to the SSI patch, we now have an in-tree test
framework for testing interleaved transactions. The only thing it needs to be
suitable for this work is a way to handle blocked commands. If you like, I can
try to whip something up for that.

[off-list ACK followed]

Here's a patch implementing that. It applies to master, with or without your
KEY LOCK patch also applied, though the expected outputs reflect the
improvements from your patch. I add three isolation test specs:

fk-contention: blocking-only test case from your blog post
fk-deadlock: the deadlocking test case I used during patch review
fk-deadlock2: Joel Jacobson's deadlocking test case

When a spec permutation would have us run a command in a currently-blocked
session, we cannot implement that permutation. Such permutations represent
impossible real-world scenarios, anyway. For now, I just explicitly name the
valid permutations in each spec file. If the test harness detects this problem,
we abort the current test spec. It might be nicer to instead cancel all
outstanding queries, issue rollbacks in all sessions, and continue with other
permutations. I hesitated to do that, because we currently leave all
transaction control in the hands of the test spec.

I only support one waiting command at a time. As long as one commands continues
to wait, I run other commands to completion synchronously. This decision has no
impact on the current test specs, which all have two sessions. It avoided a
touchy policy decision concerning deadlock detection. If two commands have
blocked, it may be that a third command needs to run before they will unblock,
or it may be that the two commands have formed a deadlock. We won't know for
sure until deadlock_timeout elapses. If it's possible to run the next step in
the permutation (i.e., it uses a different session from any blocked command), we
can either do so immediately or wait out the deadlock_timeout first. The latter
slows the test suite, but it makes the output more natural -- more like what one
would typically after running the commands by hand. If anyone can think of a
sound general policy, that would be helpful. For now, I've punted.

With a default postgresql.conf, deadlock_timeout constitutes most of the run
time. Reduce it to 20ms to accelerate things when running the tests repeatedly.

Since timing dictates which query participating in a deadlock will be chosen for
cancellation, the expected outputs bearing deadlock errors are unstable. I'm
not sure how much it will come up in practice, so I have not included expected
output variations to address this.

I think this will work on Windows as well as pgbench does, but I haven't
verified that.

Sorry for the delay on this.

Attachments:

fklocks-tests-v1.patchtext/plain; charset=us-asciiDownload

*** /dev/null
--- b/src/test/isolation/expected/fk-contention.out
***************
*** 0 ****
--- 1,16 ----
+ Parsed test spec with 2 sessions
+ 
+ starting permutation: ins com upd
+ step ins:  INSERT INTO bar VALUES (42); 
+ step com:  COMMIT; 
+ step upd:  UPDATE foo SET b = 'Hello World'; 
+ 
+ starting permutation: ins upd com
+ step ins:  INSERT INTO bar VALUES (42); 
+ step upd:  UPDATE foo SET b = 'Hello World'; 
+ step com:  COMMIT; 
+ 
+ starting permutation: upd ins com
+ step upd:  UPDATE foo SET b = 'Hello World'; 
+ step ins:  INSERT INTO bar VALUES (42); 
+ step com:  COMMIT; 
*** /dev/null
--- b/src/test/isolation/expected/fk-deadlock.out
***************
*** 0 ****
--- 1,63 ----
+ Parsed test spec with 2 sessions
+ 
+ starting permutation: s1i s1u s1c s2i s2u s2c
+ step s1i:  INSERT INTO child VALUES (1, 1); 
+ step s1u:  UPDATE parent SET aux = 'bar'; 
+ step s1c:  COMMIT; 
+ step s2i:  INSERT INTO child VALUES (2, 1); 
+ step s2u:  UPDATE parent SET aux = 'baz'; 
+ step s2c:  COMMIT; 
+ 
+ starting permutation: s1i s1u s2i s1c s2u s2c
+ step s1i:  INSERT INTO child VALUES (1, 1); 
+ step s1u:  UPDATE parent SET aux = 'bar'; 
+ step s2i:  INSERT INTO child VALUES (2, 1);  <waiting ...>
+ step s1c:  COMMIT; 
+ step s2i: <... completed>
+ step s2u:  UPDATE parent SET aux = 'baz'; 
+ step s2c:  COMMIT; 
+ 
+ starting permutation: s1i s2i s1u s2u s1c s2c
+ step s1i:  INSERT INTO child VALUES (1, 1); 
+ step s2i:  INSERT INTO child VALUES (2, 1); 
+ step s1u:  UPDATE parent SET aux = 'bar'; 
+ step s2u:  UPDATE parent SET aux = 'baz';  <waiting ...>
+ step s1c:  COMMIT; 
+ step s2u: <... completed>
+ step s2c:  COMMIT; 
+ 
+ starting permutation: s1i s2i s2u s1u s2c s1c
+ step s1i:  INSERT INTO child VALUES (1, 1); 
+ step s2i:  INSERT INTO child VALUES (2, 1); 
+ step s2u:  UPDATE parent SET aux = 'baz'; 
+ step s1u:  UPDATE parent SET aux = 'bar';  <waiting ...>
+ step s2c:  COMMIT; 
+ step s1u: <... completed>
+ step s1c:  COMMIT; 
+ 
+ starting permutation: s2i s1i s1u s2u s1c s2c
+ step s2i:  INSERT INTO child VALUES (2, 1); 
+ step s1i:  INSERT INTO child VALUES (1, 1); 
+ step s1u:  UPDATE parent SET aux = 'bar'; 
+ step s2u:  UPDATE parent SET aux = 'baz';  <waiting ...>
+ step s1c:  COMMIT; 
+ step s2u: <... completed>
+ step s2c:  COMMIT; 
+ 
+ starting permutation: s2i s1i s2u s1u s2c s1c
+ step s2i:  INSERT INTO child VALUES (2, 1); 
+ step s1i:  INSERT INTO child VALUES (1, 1); 
+ step s2u:  UPDATE parent SET aux = 'baz'; 
+ step s1u:  UPDATE parent SET aux = 'bar';  <waiting ...>
+ step s2c:  COMMIT; 
+ step s1u: <... completed>
+ step s1c:  COMMIT; 
+ 
+ starting permutation: s2i s2u s1i s2c s1u s1c
+ step s2i:  INSERT INTO child VALUES (2, 1); 
+ step s2u:  UPDATE parent SET aux = 'baz'; 
+ step s1i:  INSERT INTO child VALUES (1, 1);  <waiting ...>
+ step s2c:  COMMIT; 
+ step s1i: <... completed>
+ step s1u:  UPDATE parent SET aux = 'bar'; 
+ step s1c:  COMMIT; 
*** /dev/null
--- b/src/test/isolation/expected/fk-deadlock2.out
***************
*** 0 ****
--- 1,106 ----
+ Parsed test spec with 2 sessions
+ 
+ starting permutation: s1u1 s1u2 s1c s2u1 s2u2 s2c
+ step s1u1:  UPDATE A SET Col1 = 1 WHERE AID = 1; 
+ step s1u2:  UPDATE B SET Col2 = 1 WHERE BID = 2; 
+ step s1c:  COMMIT; 
+ step s2u1:  UPDATE B SET Col2 = 1 WHERE BID = 2; 
+ step s2u2:  UPDATE B SET Col2 = 1 WHERE BID = 2; 
+ step s2c:  COMMIT; 
+ 
+ starting permutation: s1u1 s1u2 s2u1 s1c s2u2 s2c
+ step s1u1:  UPDATE A SET Col1 = 1 WHERE AID = 1; 
+ step s1u2:  UPDATE B SET Col2 = 1 WHERE BID = 2; 
+ step s2u1:  UPDATE B SET Col2 = 1 WHERE BID = 2;  <waiting ...>
+ step s1c:  COMMIT; 
+ step s2u1: <... completed>
+ step s2u2:  UPDATE B SET Col2 = 1 WHERE BID = 2; 
+ step s2c:  COMMIT; 
+ 
+ starting permutation: s1u1 s2u1 s1u2 s2u2 s1c s2c
+ step s1u1:  UPDATE A SET Col1 = 1 WHERE AID = 1; 
+ step s2u1:  UPDATE B SET Col2 = 1 WHERE BID = 2; 
+ step s1u2:  UPDATE B SET Col2 = 1 WHERE BID = 2;  <waiting ...>
+ step s2u2:  UPDATE B SET Col2 = 1 WHERE BID = 2; 
+ step s1u2: <... completed>
+ ERROR:  deadlock detected
+ step s1c:  COMMIT; 
+ step s2c:  COMMIT; 
+ 
+ starting permutation: s1u1 s2u1 s1u2 s2u2 s2c s1c
+ step s1u1:  UPDATE A SET Col1 = 1 WHERE AID = 1; 
+ step s2u1:  UPDATE B SET Col2 = 1 WHERE BID = 2; 
+ step s1u2:  UPDATE B SET Col2 = 1 WHERE BID = 2;  <waiting ...>
+ step s2u2:  UPDATE B SET Col2 = 1 WHERE BID = 2; 
+ step s1u2: <... completed>
+ ERROR:  deadlock detected
+ step s2c:  COMMIT; 
+ step s1c:  COMMIT; 
+ 
+ starting permutation: s1u1 s2u1 s2u2 s1u2 s1c s2c
+ step s1u1:  UPDATE A SET Col1 = 1 WHERE AID = 1; 
+ step s2u1:  UPDATE B SET Col2 = 1 WHERE BID = 2; 
+ step s2u2:  UPDATE B SET Col2 = 1 WHERE BID = 2;  <waiting ...>
+ step s1u2:  UPDATE B SET Col2 = 1 WHERE BID = 2; 
+ step s2u2: <... completed>
+ ERROR:  deadlock detected
+ step s1c:  COMMIT; 
+ step s2c:  COMMIT; 
+ 
+ starting permutation: s1u1 s2u1 s2u2 s1u2 s2c s1c
+ step s1u1:  UPDATE A SET Col1 = 1 WHERE AID = 1; 
+ step s2u1:  UPDATE B SET Col2 = 1 WHERE BID = 2; 
+ step s2u2:  UPDATE B SET Col2 = 1 WHERE BID = 2;  <waiting ...>
+ step s1u2:  UPDATE B SET Col2 = 1 WHERE BID = 2; 
+ step s2u2: <... completed>
+ ERROR:  deadlock detected
+ step s2c:  COMMIT; 
+ step s1c:  COMMIT; 
+ 
+ starting permutation: s2u1 s1u1 s1u2 s2u2 s1c s2c
+ step s2u1:  UPDATE B SET Col2 = 1 WHERE BID = 2; 
+ step s1u1:  UPDATE A SET Col1 = 1 WHERE AID = 1; 
+ step s1u2:  UPDATE B SET Col2 = 1 WHERE BID = 2;  <waiting ...>
+ step s2u2:  UPDATE B SET Col2 = 1 WHERE BID = 2; 
+ step s1u2: <... completed>
+ ERROR:  deadlock detected
+ step s1c:  COMMIT; 
+ step s2c:  COMMIT; 
+ 
+ starting permutation: s2u1 s1u1 s1u2 s2u2 s2c s1c
+ step s2u1:  UPDATE B SET Col2 = 1 WHERE BID = 2; 
+ step s1u1:  UPDATE A SET Col1 = 1 WHERE AID = 1; 
+ step s1u2:  UPDATE B SET Col2 = 1 WHERE BID = 2;  <waiting ...>
+ step s2u2:  UPDATE B SET Col2 = 1 WHERE BID = 2; 
+ step s1u2: <... completed>
+ ERROR:  deadlock detected
+ step s2c:  COMMIT; 
+ step s1c:  COMMIT; 
+ 
+ starting permutation: s2u1 s1u1 s2u2 s1u2 s1c s2c
+ step s2u1:  UPDATE B SET Col2 = 1 WHERE BID = 2; 
+ step s1u1:  UPDATE A SET Col1 = 1 WHERE AID = 1; 
+ step s2u2:  UPDATE B SET Col2 = 1 WHERE BID = 2;  <waiting ...>
+ step s1u2:  UPDATE B SET Col2 = 1 WHERE BID = 2; 
+ step s2u2: <... completed>
+ ERROR:  deadlock detected
+ step s1c:  COMMIT; 
+ step s2c:  COMMIT; 
+ 
+ starting permutation: s2u1 s1u1 s2u2 s1u2 s2c s1c
+ step s2u1:  UPDATE B SET Col2 = 1 WHERE BID = 2; 
+ step s1u1:  UPDATE A SET Col1 = 1 WHERE AID = 1; 
+ step s2u2:  UPDATE B SET Col2 = 1 WHERE BID = 2;  <waiting ...>
+ step s1u2:  UPDATE B SET Col2 = 1 WHERE BID = 2; 
+ step s2u2: <... completed>
+ ERROR:  deadlock detected
+ step s2c:  COMMIT; 
+ step s1c:  COMMIT; 
+ 
+ starting permutation: s2u1 s2u2 s1u1 s2c s1u2 s1c
+ step s2u1:  UPDATE B SET Col2 = 1 WHERE BID = 2; 
+ step s2u2:  UPDATE B SET Col2 = 1 WHERE BID = 2; 
+ step s1u1:  UPDATE A SET Col1 = 1 WHERE AID = 1; 
+ step s2c:  COMMIT; 
+ step s1u2:  UPDATE B SET Col2 = 1 WHERE BID = 2; 
+ step s1c:  COMMIT; 
*** a/src/test/isolation/isolation_schedule
--- b/src/test/isolation/isolation_schedule
***************
*** 9,11 **** test: ri-trigger
--- 9,14 ----
  test: partial-index
  test: two-ids
  test: multiple-row-versions
+ test: fk-contention
+ test: fk-deadlock
+ test: fk-deadlock2
*** a/src/test/isolation/isolationtester.c
--- b/src/test/isolation/isolationtester.c
***************
*** 9,23 ****
  #include <windows.h>
  #endif
  
  #include <stddef.h>
  #include <stdio.h>
  #include <stdlib.h>
  #include <string.h>
- #include "libpq-fe.h"
  
  #include "isolationtester.h"
  
  static PGconn  **conns = NULL;
  static int	   nconns = 0;
  
  static void run_all_permutations(TestSpec *testspec);
--- 9,36 ----
  #include <windows.h>
  #endif
  
+ #include <errno.h>
+ #include <unistd.h>
  #include <stddef.h>
  #include <stdio.h>
  #include <stdlib.h>
  #include <string.h>
  
+ #ifdef HAVE_SYS_SELECT_H
+ #include <sys/select.h>
+ #endif
+ 
+ #include "libpq-fe.h"
  #include "isolationtester.h"
  
+ #define PREP_WAITING "isolationtester_waiting"
+ 
+ /*
+  * conns[0] is the global setup, teardown, and watchdog connection.  Additional
+  * connections represent spec-defined sessions.
+  */
  static PGconn  **conns = NULL;
+ static const char **backend_ids = NULL;
  static int	   nconns = 0;
  
  static void run_all_permutations(TestSpec *testspec);
***************
*** 25,30 **** static void run_all_permutations_recurse(TestSpec *testspec, int nsteps, Step **
--- 38,47 ----
  static void run_named_permutations(TestSpec *testspec);
  static void run_permutation(TestSpec *testspec, int nsteps, Step **steps);
  
+ #define STEP_NONBLOCK	0x1	/* return 0 as soon as cmd waits for a lock */
+ #define STEP_RETRY		0x2	/* this is a retry of a previously-waiting cmd */
+ static int try_complete_step(Step *step, int flags);
+ 
  static int step_qsort_cmp(const void *a, const void *b);
  static int step_bsearch_cmp(const void *a, const void *b);
  
***************
*** 46,51 **** main(int argc, char **argv)
--- 63,69 ----
  	const char *conninfo;
  	TestSpec   *testspec;
  	int			i;
+ 	PGresult   *res;
  
  	/*
  	 * If the user supplies a parameter on the command line, use it as the
***************
*** 63,75 **** main(int argc, char **argv)
  	testspec = &parseresult;
  	printf("Parsed test spec with %d sessions\n", testspec->nsessions);
  
! 	/* Establish connections to the database, one for each session */
! 	nconns = testspec->nsessions;
  	conns = calloc(nconns, sizeof(PGconn *));
! 	for (i = 0; i < testspec->nsessions; i++)
  	{
- 		PGresult *res;
- 
  		conns[i] = PQconnectdb(conninfo);
  		if (PQstatus(conns[i]) != CONNECTION_OK)
  		{
--- 81,95 ----
  	testspec = &parseresult;
  	printf("Parsed test spec with %d sessions\n", testspec->nsessions);
  
! 	/*
! 	 * Establish connections to the database, one for each session and an extra
! 	 * for lock wait detection and global work.
! 	 */
! 	nconns = 1 + testspec->nsessions;
  	conns = calloc(nconns, sizeof(PGconn *));
! 	backend_ids = calloc(nconns, sizeof(*backend_ids));
! 	for (i = 0; i < nconns; i++)
  	{
  		conns[i] = PQconnectdb(conninfo);
  		if (PQstatus(conns[i]) != CONNECTION_OK)
  		{
***************
*** 89,94 **** main(int argc, char **argv)
--- 109,136 ----
  			exit_nicely();
  		}
  		PQclear(res);
+ 
+ 		/* Get the backend ID for lock wait checking. */
+ 		res = PQexec(conns[i], "SELECT i FROM pg_stat_get_backend_idset() t(i) "
+ 					 "WHERE pg_stat_get_backend_pid(i) = pg_backend_pid()");
+ 		if (PQresultStatus(res) == PGRES_TUPLES_OK)
+ 		{
+ 			if (PQntuples(res) == 1 && PQnfields(res) == 1)
+ 				backend_ids[i] = strdup(PQgetvalue(res, 0, 0));
+ 			else
+ 			{
+ 				fprintf(stderr, "backend id query returned %d rows and %d columns, expected 1 row and 1 column",
+ 						PQntuples(res), PQnfields(res));
+ 				exit_nicely();
+ 			}
+ 		}
+ 		else
+ 		{
+ 			fprintf(stderr, "backend id query failed: %s",
+ 					PQerrorMessage(conns[i]));
+ 			exit_nicely();
+ 		}
+ 		PQclear(res);
  	}
  
  	/* Set the session index fields in steps. */
***************
*** 100,105 **** main(int argc, char **argv)
--- 142,157 ----
  			session->steps[stepindex]->session = i;
  	}
  
+ 	res = PQprepare(conns[0], PREP_WAITING,
+ 					"SELECT 1 WHERE pg_stat_get_backend_waiting($1)", 0, NULL);
+ 	if (PQresultStatus(res) != PGRES_COMMAND_OK)
+ 	{
+ 		fprintf(stderr, "prepare of lock wait query failed: %s",
+ 				PQerrorMessage(conns[0]));
+ 		exit_nicely();
+ 	}
+ 	PQclear(res);
+ 
  	/*
  	 * Run the permutations specified in the spec, or all if none were
  	 * explicitly specified.
***************
*** 254,259 **** run_permutation(TestSpec *testspec, int nsteps, Step **steps)
--- 306,312 ----
  {
  	PGresult *res;
  	int i;
+ 	Step	   *waiting = NULL;
  
  	printf("\nstarting permutation:");
  	for (i = 0; i < nsteps; i++)
***************
*** 277,288 **** run_permutation(TestSpec *testspec, int nsteps, Step **steps)
  	{
  		if (testspec->sessions[i]->setupsql)
  		{
! 			res = PQexec(conns[i], testspec->sessions[i]->setupsql);
  			if (PQresultStatus(res) != PGRES_COMMAND_OK)
  			{
  				fprintf(stderr, "setup of session %s failed: %s",
  						testspec->sessions[i]->name,
! 						PQerrorMessage(conns[0]));
  				exit_nicely();
  			}
  			PQclear(res);
--- 330,341 ----
  	{
  		if (testspec->sessions[i]->setupsql)
  		{
! 			res = PQexec(conns[i+1], testspec->sessions[i]->setupsql);
  			if (PQresultStatus(res) != PGRES_COMMAND_OK)
  			{
  				fprintf(stderr, "setup of session %s failed: %s",
  						testspec->sessions[i]->name,
! 						PQerrorMessage(conns[i+1]));
  				exit_nicely();
  			}
  			PQclear(res);
***************
*** 293,334 **** run_permutation(TestSpec *testspec, int nsteps, Step **steps)
  	for (i = 0; i < nsteps; i++)
  	{
  		Step *step = steps[i];
- 		printf("step %s: %s\n", step->name, step->sql);
- 		res = PQexec(conns[step->session], step->sql);
  
! 		switch(PQresultStatus(res))
  		{
! 			case PGRES_COMMAND_OK:
! 				break;
! 
! 			case PGRES_TUPLES_OK:
! 				printResultSet(res);
! 				break;
  
! 			case PGRES_FATAL_ERROR:
! 				/* Detail may contain xid values, so just show primary. */
! 				printf("%s:  %s\n", PQresultErrorField(res, PG_DIAG_SEVERITY),
! 					   PQresultErrorField(res, PG_DIAG_MESSAGE_PRIMARY));
! 				break;
  
! 			default:
! 				printf("unexpected result status: %s\n",
! 					   PQresStatus(PQresultStatus(res)));
  		}
! 		PQclear(res);
  	}
  
  	/* Perform per-session teardown */
  	for (i = 0; i < testspec->nsessions; i++)
  	{
  		if (testspec->sessions[i]->teardownsql)
  		{
! 			res = PQexec(conns[i], testspec->sessions[i]->teardownsql);
  			if (PQresultStatus(res) != PGRES_COMMAND_OK)
  			{
  				fprintf(stderr, "teardown of session %s failed: %s",
  						testspec->sessions[i]->name,
! 						PQerrorMessage(conns[0]));
  				/* don't exit on teardown failure */
  			}
  			PQclear(res);
--- 346,387 ----
  	for (i = 0; i < nsteps; i++)
  	{
  		Step *step = steps[i];
  
! 		if (!PQsendQuery(conns[1 + step->session], step->sql))
  		{
! 			fprintf(stdout, "failed to send query: %s\n",
! 					PQerrorMessage(conns[1 + step->session]));
! 			exit_nicely();
! 		}
  
! 		if (waiting != NULL)
! 		{
! 			/* Some other step is already waiting: just block. */
! 			try_complete_step(step, 0);
  
! 			/* See if this step unblocked the waiting step. */
! 			if (try_complete_step(waiting, STEP_NONBLOCK | STEP_RETRY))
! 				waiting = NULL;
  		}
! 		else if (!try_complete_step(step, STEP_NONBLOCK))
! 			waiting = step;
  	}
  
+ 	/* Finish any waiting query. */
+ 	if (waiting != NULL)
+ 		try_complete_step(waiting, STEP_RETRY);
+ 
  	/* Perform per-session teardown */
  	for (i = 0; i < testspec->nsessions; i++)
  	{
  		if (testspec->sessions[i]->teardownsql)
  		{
! 			res = PQexec(conns[i+1], testspec->sessions[i]->teardownsql);
  			if (PQresultStatus(res) != PGRES_COMMAND_OK)
  			{
  				fprintf(stderr, "teardown of session %s failed: %s",
  						testspec->sessions[i]->name,
! 						PQerrorMessage(conns[i+1]));
  				/* don't exit on teardown failure */
  			}
  			PQclear(res);
***************
*** 350,355 **** run_permutation(TestSpec *testspec, int nsteps, Step **steps)
--- 403,507 ----
  	}
  }
  
+ /*
+  * Our caller already sent the query associated with this step.  Wait for it to
+  * either complete or (only when given the STEP_NONBLOCK flag) to block while
+  * waiting for a lock.  We assume that any lock wait will persist until we have
+  * executed additional steps in the permutation.  This is not fully robust -- a
+  * concurrent autovacuum could briefly take a lock with which we conflict.  The
+  * risk may be low enough to discount.
+  *
+  * When calling this function on behalf of a given step for a second or later
+  * time, pass the STEP_RETRY flag.  This only affects the messages printed.
+  *
+  * If the STEP_NONBLOCK flag was specified and the query is waiting to acquire a
+  * lock, returns 0.  Otherwise, returns 1.
+  */
+ static int
+ try_complete_step(Step *step, int flags)
+ {
+ 	PGconn	   *conn = conns[1 + step->session];
+ 	fd_set		read_set;
+ 	struct timeval timeout;
+ 	int			sock = PQsocket(conn);
+ 	int			ret;
+ 	PGresult   *res;
+ 
+ 	FD_ZERO(&read_set);
+ 
+ 	while (flags & STEP_NONBLOCK && PQisBusy(conn))
+ 	{
+ 		FD_SET(sock, &read_set);
+ 		timeout.tv_sec = 0;
+ 		timeout.tv_usec = 10000;	/* Check for lock waits every 10ms. */
+ 
+ 		ret = select(sock + 1, &read_set, NULL, NULL, &timeout);
+ 		if (ret < 0)	/* error in select() */
+ 		{
+ 			fprintf(stderr, "select failed: %s\n", strerror(errno));
+ 			exit_nicely();
+ 		}
+ 		else if (ret == 0)	/* select() timeout: check for lock wait */
+ 		{
+ 			int			ntuples;
+ 
+ 			res = PQexecPrepared(conns[0], PREP_WAITING, 1,
+ 								 &backend_ids[step->session + 1],
+ 								 NULL, NULL, 0);
+ 			if (PQresultStatus(res) != PGRES_TUPLES_OK)
+ 			{
+ 				fprintf(stderr, "lock wait query failed: %s",
+ 						PQerrorMessage(conn));
+ 				exit_nicely();
+ 			}
+ 			ntuples = PQntuples(res);
+ 			PQclear(res);
+ 
+ 			if (ntuples >= 1)	/* waiting to acquire a lock */
+ 			{
+ 				if (!(flags & STEP_RETRY))
+ 					printf("step %s: %s <waiting ...>\n",
+ 						   step->name, step->sql);
+ 				return 0;
+ 			}
+ 			/* else, not waiting: give it more time */
+ 		}
+ 		else if (!PQconsumeInput(conn)) /* select(): data available */
+ 		{
+ 			fprintf(stderr, "PQconsumeInput failed: %s", PQerrorMessage(conn));
+ 			exit_nicely();
+ 		}
+ 	}
+ 
+ 	if (flags & STEP_RETRY)
+ 		printf("step %s: <... completed>\n", step->name);
+ 	else
+ 		printf("step %s: %s\n", step->name, step->sql);
+ 
+ 	while ((res = PQgetResult(conn)))
+ 	{
+ 		switch (PQresultStatus(res))
+ 		{
+ 			case PGRES_COMMAND_OK:
+ 				break;
+ 			case PGRES_TUPLES_OK:
+ 				printResultSet(res);
+ 				break;
+ 			case PGRES_FATAL_ERROR:
+ 				/* Detail may contain xid values, so just show primary. */
+ 				printf("%s:  %s\n", PQresultErrorField(res, PG_DIAG_SEVERITY),
+ 					   PQresultErrorField(res, PG_DIAG_MESSAGE_PRIMARY));
+ 				break;
+ 			default:
+ 				printf("unexpected result status: %s\n",
+ 					   PQresStatus(PQresultStatus(res)));
+ 		}
+ 		PQclear(res);
+ 	}
+ 
+ 	return 1;
+ }
+ 
  static void
  printResultSet(PGresult *res)
  {
*** /dev/null
--- b/src/test/isolation/specs/fk-contention.spec
***************
*** 0 ****
--- 1,19 ----
+ setup
+ {
+   CREATE TABLE foo (a int PRIMARY KEY, b text);
+   CREATE TABLE bar (a int NOT NULL REFERENCES foo);
+   INSERT INTO foo VALUES (42);
+ }
+ 
+ teardown
+ {
+   DROP TABLE foo, bar;
+ }
+ 
+ session "s1"
+ setup		{ BEGIN; }
+ step "ins"	{ INSERT INTO bar VALUES (42); }
+ step "com"	{ COMMIT; }
+ 
+ session "s2"
+ step "upd"	{ UPDATE foo SET b = 'Hello World'; }
*** /dev/null
--- b/src/test/isolation/specs/fk-deadlock.spec
***************
*** 0 ****
--- 1,54 ----
+ setup
+ {
+   CREATE TABLE parent (
+ 	parent_key	int		PRIMARY KEY,
+ 	aux			text	NOT NULL
+   );
+ 
+   CREATE TABLE child (
+ 	child_key	int		PRIMARY KEY,
+ 	parent_key	int		NOT NULL REFERENCES parent
+   );
+ 
+   INSERT INTO parent VALUES (1, 'foo');
+ }
+ 
+ teardown
+ {
+   DROP TABLE parent, child;
+ }
+ 
+ session "s1"
+ setup		{ BEGIN; }
+ step "s1i"	{ INSERT INTO child VALUES (1, 1); }
+ step "s1u"	{ UPDATE parent SET aux = 'bar'; }
+ step "s1c"	{ COMMIT; }
+ 
+ session "s2"
+ setup		{ BEGIN; }
+ step "s2i"	{ INSERT INTO child VALUES (2, 1); }
+ step "s2u"	{ UPDATE parent SET aux = 'baz'; }
+ step "s2c"	{ COMMIT; }
+ 
+ ## Most theoretical permutations require that a blocked session execute a
+ ## command, making them impossible in practice.
+ permutation "s1i" "s1u" "s1c" "s2i" "s2u" "s2c"
+ permutation "s1i" "s1u" "s2i" "s1c" "s2u" "s2c"
+ #permutation "s1i" "s1u" "s2i" "s2u" "s1c" "s2c"
+ #permutation "s1i" "s1u" "s2i" "s2u" "s2c" "s1c"
+ #permutation "s1i" "s2i" "s1u" "s1c" "s2u" "s2c"
+ permutation "s1i" "s2i" "s1u" "s2u" "s1c" "s2c"
+ #permutation "s1i" "s2i" "s1u" "s2u" "s2c" "s1c"
+ #permutation "s1i" "s2i" "s2u" "s1u" "s1c" "s2c"
+ permutation "s1i" "s2i" "s2u" "s1u" "s2c" "s1c"
+ #permutation "s1i" "s2i" "s2u" "s2c" "s1u" "s1c"
+ #permutation "s2i" "s1i" "s1u" "s1c" "s2u" "s2c"
+ permutation "s2i" "s1i" "s1u" "s2u" "s1c" "s2c"
+ #permutation "s2i" "s1i" "s1u" "s2u" "s2c" "s1c"
+ #permutation "s2i" "s1i" "s2u" "s1u" "s1c" "s2c"
+ permutation "s2i" "s1i" "s2u" "s1u" "s2c" "s1c"
+ #permutation "s2i" "s1i" "s2u" "s2c" "s1u" "s1c"
+ #permutation "s2i" "s2u" "s1i" "s1u" "s1c" "s2c"
+ #permutation "s2i" "s2u" "s1i" "s1u" "s2c" "s1c"
+ permutation "s2i" "s2u" "s1i" "s2c" "s1u" "s1c"
+ #permutation "s2i" "s2u" "s2c" "s1i" "s1u" "s1c"
*** /dev/null
--- b/src/test/isolation/specs/fk-deadlock2.spec
***************
*** 0 ****
--- 1,59 ----
+ setup
+ {
+   CREATE TABLE A (
+ 	AID integer not null,
+ 	Col1 integer,
+ 	PRIMARY KEY (AID)
+   );
+ 
+   CREATE TABLE B (
+ 	BID integer not null,
+ 	AID integer not null,
+ 	Col2 integer,
+ 	PRIMARY KEY (BID),
+ 	FOREIGN KEY (AID) REFERENCES A(AID)
+   );
+ 
+   INSERT INTO A (AID) VALUES (1);
+   INSERT INTO B (BID,AID) VALUES (2,1);
+ }
+ 
+ teardown
+ {
+   DROP TABLE a, b;
+ }
+ 
+ session "s1"
+ setup		{ BEGIN; }
+ step "s1u1"	{ UPDATE A SET Col1 = 1 WHERE AID = 1; }
+ step "s1u2"	{ UPDATE B SET Col2 = 1 WHERE BID = 2; }
+ step "s1c"	{ COMMIT; }
+ 
+ session "s2"
+ setup		{ BEGIN; }
+ step "s2u1"	{ UPDATE B SET Col2 = 1 WHERE BID = 2; }
+ step "s2u2"	{ UPDATE B SET Col2 = 1 WHERE BID = 2; }
+ step "s2c"	{ COMMIT; }
+ 
+ ## Many theoretical permutations require that a blocked session execute a
+ ## command, making them impossible in practice.
+ permutation "s1u1" "s1u2" "s1c" "s2u1" "s2u2" "s2c"
+ permutation "s1u1" "s1u2" "s2u1" "s1c" "s2u2" "s2c"
+ #permutation "s1u1" "s1u2" "s2u1" "s2u2" "s1c" "s2c"
+ #permutation "s1u1" "s1u2" "s2u1" "s2u2" "s2c" "s1c"
+ #permutation "s1u1" "s2u1" "s1u2" "s1c" "s2u2" "s2c"
+ permutation "s1u1" "s2u1" "s1u2" "s2u2" "s1c" "s2c"
+ permutation "s1u1" "s2u1" "s1u2" "s2u2" "s2c" "s1c"
+ permutation "s1u1" "s2u1" "s2u2" "s1u2" "s1c" "s2c"
+ permutation "s1u1" "s2u1" "s2u2" "s1u2" "s2c" "s1c"
+ #permutation "s1u1" "s2u1" "s2u2" "s2c" "s1u2" "s1c"
+ #permutation "s2u1" "s1u1" "s1u2" "s1c" "s2u2" "s2c"
+ permutation "s2u1" "s1u1" "s1u2" "s2u2" "s1c" "s2c"
+ permutation "s2u1" "s1u1" "s1u2" "s2u2" "s2c" "s1c"
+ permutation "s2u1" "s1u1" "s2u2" "s1u2" "s1c" "s2c"
+ permutation "s2u1" "s1u1" "s2u2" "s1u2" "s2c" "s1c"
+ #permutation "s2u1" "s1u1" "s2u2" "s2c" "s1u2" "s1c"
+ #permutation "s2u1" "s2u2" "s1u1" "s1u2" "s1c" "s2c"
+ #permutation "s2u1" "s2u2" "s1u1" "s1u2" "s2c" "s1c"
+ permutation "s2u1" "s2u2" "s1u1" "s2c" "s1u2" "s1c"
+ #permutation "s2u1" "s2u2" "s2c" "s1u1" "s1u2" "s1c"

#20

Jesper Krogh

jesper@krogh.cc

over 14 years ago

In reply to: Noah Misch (#19)

Re: FOR KEY LOCK foreign keys

I hope this hasn't been forgotten. But I cant see it has been committed
or moved
into the commitfest process?

Jesper

Show quoted text

On 2011-03-11 16:51, Noah Misch wrote:

On Fri, Feb 11, 2011 at 02:13:22AM -0500, Noah Misch wrote:

Automated tests would go a long way toward building confidence that this patch
does the right thing. Thanks to the SSI patch, we now have an in-tree test
framework for testing interleaved transactions. The only thing it needs to be
suitable for this work is a way to handle blocked commands. If you like, I can
try to whip something up for that.

[off-list ACK followed]

Here's a patch implementing that. It applies to master, with or without your
KEY LOCK patch also applied, though the expected outputs reflect the
improvements from your patch. I add three isolation test specs:

fk-contention: blocking-only test case from your blog post
fk-deadlock: the deadlocking test case I used during patch review
fk-deadlock2: Joel Jacobson's deadlocking test case

When a spec permutation would have us run a command in a currently-blocked
session, we cannot implement that permutation. Such permutations represent
impossible real-world scenarios, anyway. For now, I just explicitly name the
valid permutations in each spec file. If the test harness detects this problem,
we abort the current test spec. It might be nicer to instead cancel all
outstanding queries, issue rollbacks in all sessions, and continue with other
permutations. I hesitated to do that, because we currently leave all
transaction control in the hands of the test spec.

I only support one waiting command at a time. As long as one commands continues
to wait, I run other commands to completion synchronously. This decision has no
impact on the current test specs, which all have two sessions. It avoided a
touchy policy decision concerning deadlock detection. If two commands have
blocked, it may be that a third command needs to run before they will unblock,
or it may be that the two commands have formed a deadlock. We won't know for
sure until deadlock_timeout elapses. If it's possible to run the next step in
the permutation (i.e., it uses a different session from any blocked command), we
can either do so immediately or wait out the deadlock_timeout first. The latter
slows the test suite, but it makes the output more natural -- more like what one
would typically after running the commands by hand. If anyone can think of a
sound general policy, that would be helpful. For now, I've punted.

With a default postgresql.conf, deadlock_timeout constitutes most of the run
time. Reduce it to 20ms to accelerate things when running the tests repeatedly.

Since timing dictates which query participating in a deadlock will be chosen for
cancellation, the expected outputs bearing deadlock errors are unstable. I'm
not sure how much it will come up in practice, so I have not included expected
output variations to address this.

I think this will work on Windows as well as pgbench does, but I haven't
verified that.

Sorry for the delay on this.

#21

noah@leadboat.com

over 14 years ago

In reply to: Jesper Krogh (#20)

Re: FOR KEY LOCK foreign keys

On Sun, Jun 19, 2011 at 06:30:41PM +0200, Jesper Krogh wrote:

I hope this hasn't been forgotten. But I cant see it has been committed
or moved
into the commitfest process?

If you're asking about that main patch for $SUBJECT rather than those
isolationtester changes specifically, I can't speak to the plans for it. I
wasn't planning to move the test suite work forward independent of the core
patch it serves, but we could do that if there's another application.

Thanks,
nm

Show quoted text

On 2011-03-11 16:51, Noah Misch wrote:

On Fri, Feb 11, 2011 at 02:13:22AM -0500, Noah Misch wrote:

Automated tests would go a long way toward building confidence that this patch
does the right thing. Thanks to the SSI patch, we now have an in-tree test
framework for testing interleaved transactions. The only thing it needs to be
suitable for this work is a way to handle blocked commands. If you like, I can
try to whip something up for that.

[off-list ACK followed]

Here's a patch implementing that. It applies to master, with or without your
KEY LOCK patch also applied, though the expected outputs reflect the
improvements from your patch. I add three isolation test specs:

fk-contention: blocking-only test case from your blog post
fk-deadlock: the deadlocking test case I used during patch review
fk-deadlock2: Joel Jacobson's deadlocking test case

#22

Jesper Krogh

jesper@krogh.cc

over 14 years ago

In reply to: Noah Misch (#21)

Re: FOR KEY LOCK foreign keys

On 2011-06-20 22:11, Noah Misch wrote:

On Sun, Jun 19, 2011 at 06:30:41PM +0200, Jesper Krogh wrote:

I hope this hasn't been forgotten. But I cant see it has been committed
or moved
into the commitfest process?

If you're asking about that main patch for $SUBJECT rather than those
isolationtester changes specifically, I can't speak to the plans for it. I
wasn't planning to move the test suite work forward independent of the core
patch it serves, but we could do that if there's another application.

Yes, I was actually asking about the main patch for foreign key locks.

Jesper
--
Jesper

#23

alvherre@commandprompt.com

over 14 years ago

In reply to: Noah Misch (#19)

Re: FOR KEY LOCK foreign keys

Excerpts from Noah Misch's message of vie mar 11 12:51:14 -0300 2011:

On Fri, Feb 11, 2011 at 02:13:22AM -0500, Noah Misch wrote:

Automated tests would go a long way toward building confidence that this patch
does the right thing. Thanks to the SSI patch, we now have an in-tree test
framework for testing interleaved transactions. The only thing it needs to be
suitable for this work is a way to handle blocked commands. If you like, I can
try to whip something up for that.

[off-list ACK followed]

Here's a patch implementing that. It applies to master, with or without your
KEY LOCK patch also applied, though the expected outputs reflect the
improvements from your patch. I add three isolation test specs:

fk-contention: blocking-only test case from your blog post
fk-deadlock: the deadlocking test case I used during patch review
fk-deadlock2: Joel Jacobson's deadlocking test case

Thanks for this patch. I have applied it, adjusting the expected output
of these tests to the HEAD code. I'll adjust it when I commit the
fklocks patch, I guess, but it seemed simpler to have it out of the way;
besides it might end up benefitting other people who might be messing
with the locking code.

I only support one waiting command at a time. As long as one commands continues
to wait, I run other commands to completion synchronously.

Should be fine for now, I guess.

I think this will work on Windows as well as pgbench does, but I haven't
verified that.

We will find out shortly.

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

#24

noah@2ndQuadrant.com

over 14 years ago

In reply to: Alvaro Herrera (#23)

1 attachment(s)

Re: FOR KEY LOCK foreign keys

On Tue, Jul 12, 2011 at 05:59:01PM -0400, Alvaro Herrera wrote:

Excerpts from Noah Misch's message of vie mar 11 12:51:14 -0300 2011:

On Fri, Feb 11, 2011 at 02:13:22AM -0500, Noah Misch wrote:

Automated tests would go a long way toward building confidence that this patch
does the right thing. Thanks to the SSI patch, we now have an in-tree test
framework for testing interleaved transactions. The only thing it needs to be
suitable for this work is a way to handle blocked commands. If you like, I can
try to whip something up for that.

[off-list ACK followed]

Here's a patch implementing that. It applies to master, with or without your
KEY LOCK patch also applied, though the expected outputs reflect the
improvements from your patch. I add three isolation test specs:

fk-contention: blocking-only test case from your blog post
fk-deadlock: the deadlocking test case I used during patch review
fk-deadlock2: Joel Jacobson's deadlocking test case

Thanks for this patch. I have applied it, adjusting the expected output
of these tests to the HEAD code. I'll adjust it when I commit the
fklocks patch, I guess, but it seemed simpler to have it out of the way;
besides it might end up benefitting other people who might be messing
with the locking code.

Great. There have been a few recent patches where I would have used this
functionality to provide tests, so I'm glad to have it in.

I think this will work on Windows as well as pgbench does, but I haven't
verified that.

We will find out shortly.

I see you've added a fix for the MSVC animals; thanks.

coypu failed during the run of the test due to a different session being chosen
as the deadlock victim. We can now vary deadlock_timeout to prevent this; see
attached fklocks-tests-deadlock_timeout.patch. This also makes the tests much
faster on a default postgresql.conf.

crake failed when it reported waiting on the first step of an existing isolation
test ("two-ids.spec"). I will need to look into that further.

Thanks,
nm

Attachments:

fklocks-tests-deadlock_timeout.patchtext/plain; charset=us-asciiDownload

diff --git a/src/test/isolation/expected/fk-deadlock.out b/src/test/isolation/expected/fk-deadlock.out
index 6b6ee16..0d86cda 100644
*** a/src/test/isolation/expected/fk-deadlock.out
--- b/src/test/isolation/expected/fk-deadlock.out
***************
*** 32,39 **** step s1i:  INSERT INTO child VALUES (1, 1);
  step s2i:  INSERT INTO child VALUES (2, 1); 
  step s2u:  UPDATE parent SET aux = 'baz';  <waiting ...>
  step s1u:  UPDATE parent SET aux = 'bar'; 
- step s2u: <... completed>
  ERROR:  deadlock detected
  step s2c:  COMMIT; 
  step s1c:  COMMIT; 
  
--- 32,39 ----
  step s2i:  INSERT INTO child VALUES (2, 1); 
  step s2u:  UPDATE parent SET aux = 'baz';  <waiting ...>
  step s1u:  UPDATE parent SET aux = 'bar'; 
  ERROR:  deadlock detected
+ step s2u: <... completed>
  step s2c:  COMMIT; 
  step s1c:  COMMIT; 
  
***************
*** 52,59 **** step s2i:  INSERT INTO child VALUES (2, 1);
  step s1i:  INSERT INTO child VALUES (1, 1); 
  step s2u:  UPDATE parent SET aux = 'baz';  <waiting ...>
  step s1u:  UPDATE parent SET aux = 'bar'; 
- step s2u: <... completed>
  ERROR:  deadlock detected
  step s2c:  COMMIT; 
  step s1c:  COMMIT; 
  
--- 52,59 ----
  step s1i:  INSERT INTO child VALUES (1, 1); 
  step s2u:  UPDATE parent SET aux = 'baz';  <waiting ...>
  step s1u:  UPDATE parent SET aux = 'bar'; 
  ERROR:  deadlock detected
+ step s2u: <... completed>
  step s2c:  COMMIT; 
  step s1c:  COMMIT; 
  
diff --git a/src/test/isolation/expected/fk-deadloindex af3ce8e..6e7f12d 100644
*** a/src/test/isolation/expected/fk-deadlock2.out
--- b/src/test/isolation/expected/fk-deadlock2.out
***************
*** 42,49 **** step s1u1:  UPDATE A SET Col1 = 1 WHERE AID = 1;
  step s2u1:  UPDATE B SET Col2 = 1 WHERE BID = 2; 
  step s2u2:  UPDATE B SET Col2 = 1 WHERE BID = 2;  <waiting ...>
  step s1u2:  UPDATE B SET Col2 = 1 WHERE BID = 2; 
- step s2u2: <... completed>
  ERROR:  deadlock detected
  step s1c:  COMMIT; 
  step s2c:  COMMIT; 
  
--- 42,49 ----
  step s2u1:  UPDATE B SET Col2 = 1 WHERE BID = 2; 
  step s2u2:  UPDATE B SET Col2 = 1 WHERE BID = 2;  <waiting ...>
  step s1u2:  UPDATE B SET Col2 = 1 WHERE BID = 2; 
  ERROR:  deadlock detected
+ step s2u2: <... completed>
  step s1c:  COMMIT; 
  step s2c:  COMMIT; 
  
***************
*** 52,59 **** step s1u1:  UPDATE A SET Col1 = 1 WHERE AID = 1;
  step s2u1:  UPDATE B SET Col2 = 1 WHERE BID = 2; 
  step s2u2:  UPDATE B SET Col2 = 1 WHERE BID = 2;  <waiting ...>
  step s1u2:  UPDATE B SET Col2 = 1 WHERE BID = 2; 
- step s2u2: <... completed>
  ERROR:  deadlock detected
  step s2c:  COMMIT; 
  step s1c:  COMMIT; 
  
--- 52,59 ----
  step s2u1:  UPDATE B SET Col2 = 1 WHERE BID = 2; 
  step s2u2:  UPDATE B SET Col2 = 1 WHERE BID = 2;  <waiting ...>
  step s1u2:  UPDATE B SET Col2 = 1 WHERE BID = 2; 
  ERROR:  deadlock detected
+ step s2u2: <... completed>
  step s2c:  COMMIT; 
  step s1c:  COMMIT; 
  
***************
*** 82,89 **** step s2u1:  UPDATE B SET Col2 = 1 WHERE BID = 2;
  step s1u1:  UPDATE A SET Col1 = 1 WHERE AID = 1; 
  step s2u2:  UPDATE B SET Col2 = 1 WHERE BID = 2;  <waiting ...>
  step s1u2:  UPDATE B SET Col2 = 1 WHERE BID = 2; 
- step s2u2: <... completed>
  ERROR:  deadlock detected
  step s1c:  COMMIT; 
  step s2c:  COMMIT; 
  
--- 82,89 ----
  step s1u1:  UPDATE A SET Col1 = 1 WHERE AID = 1; 
  step s2u2:  UPDATE B SET Col2 = 1 WHERE BID = 2;  <waiting ...>
  step s1u2:  UPDATE B SET Col2 = 1 WHERE BID = 2; 
  ERROR:  deadlock detected
+ step s2u2: <... completed>
  step s1c:  COMMIT; 
  step s2c:  COMMIT; 
  
***************
*** 92,99 **** step s2u1:  UPDATE B SET Col2 = 1 WHERE BID = 2;
  step s1u1:  UPDATE A SET Col1 = 1 WHERE AID = 1; 
  step s2u2:  UPDATE B SET Col2 = 1 WHERE BID = 2;  <waiting ...>
  step s1u2:  UPDATE B SET Col2 = 1 WHERE BID = 2; 
- step s2u2: <... completed>
  ERROR:  deadlock detected
  step s2c:  COMMIT; 
  step s1c:  COMMIT; 
  
--- 92,99 ----
  step s1u1:  UPDATE A SET Col1 = 1 WHERE AID = 1; 
  step s2u2:  UPDATE B SET Col2 = 1 WHERE BID = 2;  <waiting ...>
  step s1u2:  UPDATE B SET Col2 = 1 WHERE BID = 2; 
  ERROR:  deadlock detected
+ step s2u2: <... completed>
  step s2c:  COMMIT; 
  step s1c:  COMMIT; 
  
diff --git a/src/test/isolation/specs/fk-deadlock.sindex 530cf10..b533d77 100644
*** a/src/test/isolation/specs/fk-deadlock.spec
--- b/src/test/isolation/specs/fk-deadlock.spec
***************
*** 19,31 **** teardown
  }
  
  session "s1"
! setup		{ BEGIN; }
  step "s1i"	{ INSERT INTO child VALUES (1, 1); }
  step "s1u"	{ UPDATE parent SET aux = 'bar'; }
  step "s1c"	{ COMMIT; }
  
  session "s2"
! setup		{ BEGIN; }
  step "s2i"	{ INSERT INTO child VALUES (2, 1); }
  step "s2u"	{ UPDATE parent SET aux = 'baz'; }
  step "s2c"	{ COMMIT; }
--- 19,31 ----
  }
  
  session "s1"
! setup		{ BEGIN; SET deadlock_timeout = '20ms'; }
  step "s1i"	{ INSERT INTO child VALUES (1, 1); }
  step "s1u"	{ UPDATE parent SET aux = 'bar'; }
  step "s1c"	{ COMMIT; }
  
  session "s2"
! setup		{ BEGIN; SET deadlock_timeout = '10s'; }
  step "s2i"	{ INSERT INTO child VALUES (2, 1); }
  step "s2u"	{ UPDATE parent SET aux = 'baz'; }
  step "s2c"	{ COMMIT; }
diff --git a/src/test/isolation/specs/fk-deadlocindex 91a87d1..5653628 100644
*** a/src/test/isolation/specs/fk-deadlock2.spec
--- b/src/test/isolation/specs/fk-deadlock2.spec
***************
*** 24,36 **** teardown
  }
  
  session "s1"
! setup		{ BEGIN; }
  step "s1u1"	{ UPDATE A SET Col1 = 1 WHERE AID = 1; }
  step "s1u2"	{ UPDATE B SET Col2 = 1 WHERE BID = 2; }
  step "s1c"	{ COMMIT; }
  
  session "s2"
! setup		{ BEGIN; }
  step "s2u1"	{ UPDATE B SET Col2 = 1 WHERE BID = 2; }
  step "s2u2"	{ UPDATE B SET Col2 = 1 WHERE BID = 2; }
  step "s2c"	{ COMMIT; }
--- 24,36 ----
  }
  
  session "s1"
! setup		{ BEGIN; SET deadlock_timeout = '20ms'; }
  step "s1u1"	{ UPDATE A SET Col1 = 1 WHERE AID = 1; }
  step "s1u2"	{ UPDATE B SET Col2 = 1 WHERE BID = 2; }
  step "s1c"	{ COMMIT; }
  
  session "s2"
! setup		{ BEGIN; SET deadlock_timeout = '10s'; }
  step "s2u1"	{ UPDATE B SET Col2 = 1 WHERE BID = 2; }
  step "s2u2"	{ UPDATE B SET Col2 = 1 WHERE BID = 2; }
  step "s2c"	{ COMMIT; }

#25

http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=crake&dt=2011-07-12%2022:32:02
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=nightjar&dt=2011-07-14%2016:27:00
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=pitta&dt=2011-07-15%2015:00:08
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=crake&dt=2011-07-15%2018:32:02

alvherre@commandprompt.com

over 14 years ago

In reply to: Noah Misch (#24)

Re: FOR KEY LOCK foreign keys

Excerpts from Noah Misch's message of mié jul 13 01:34:10 -0400 2011:

coypu failed during the run of the test due to a different session being chosen
as the deadlock victim. We can now vary deadlock_timeout to prevent this; see
attached fklocks-tests-deadlock_timeout.patch. This also makes the tests much
faster on a default postgresql.conf.

I applied your patch, thanks. I couldn't reproduce the failures without
it, even running only the three new tests in a loop a few dozen times.

crake failed when it reported waiting on the first step of an existing isolation
test ("two-ids.spec"). I will need to look into that further.

Actually, there are four failures in tests other than the two fixed by
your patch. These are:

The last two are an identical failure in multiple-row-versions:
***************
*** 1,11 ****
Parsed test spec with 4 sessions

starting permutation: rx1 wx2 c2 wx3 ry3 wy4 rz4 c4 c3 wz1 c1
! step rx1: SELECT * FROM t WHERE id = 1000000;
id txt

  1000000                       
- step wx2:  UPDATE t SET txt = 'b' WHERE id = 1000000; 
  step c2:  COMMIT; 
  step wx3:  UPDATE t SET txt = 'c' WHERE id = 1000000; 
  step ry3:  SELECT * FROM t WHERE id = 500000; 
--- 1,12 ----
  Parsed test spec with 4 sessions

starting permutation: rx1 wx2 c2 wx3 ry3 wy4 rz4 c4 c3 wz1 c1
! step rx1: SELECT * FROM t WHERE id = 1000000; <waiting ...>
! step wx2: UPDATE t SET txt = 'b' WHERE id = 1000000;
! step rx1: <... completed>
id txt

1000000
step c2: COMMIT;
step wx3: UPDATE t SET txt = 'c' WHERE id = 1000000;
step ry3: SELECT * FROM t WHERE id = 500000;

The other failure by crake in two-ids:

***************
*** 440,447 ****
step c3: COMMIT;

starting permutation: rxwy2 wx1 ry3 c2 c3 c1
! step rxwy2: update D2 set id = (select id+1 from D1);
step wx1: update D1 set id = id + 1;
step ry3: select id from D2;
id

--- 440,448 ----
  step c3:  COMMIT;

starting permutation: rxwy2 wx1 ry3 c2 c3 c1
! step rxwy2: update D2 set id = (select id+1 from D1); <waiting ...>
step wx1: update D1 set id = id + 1;
+ step rxwy2: <... completed>
step ry3: select id from D2;
id

And the most problematic one, in nightjar, is a failure to send two
async commands, which is not supported by the new code:

--- 255,260 ----
  ERROR:  could not serialize access due to read/write dependencies among transactions

starting permutation: ry2 wx2 rx1 wy1 c2 c1
! step ry2: SELECT count(*) FROM project WHERE project_manager = 1; <waiting ...>
! failed to send query: another command is already in progress

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

#26

noah@2ndQuadrant.com

over 14 years ago

In reply to: Alvaro Herrera (#25)

1 attachment(s)

Re: FOR KEY LOCK foreign keys

On Fri, Jul 15, 2011 at 07:01:26PM -0400, Alvaro Herrera wrote:

Excerpts from Noah Misch's message of miï¿½ jul 13 01:34:10 -0400 2011:

coypu failed during the run of the test due to a different session being chosen
as the deadlock victim. We can now vary deadlock_timeout to prevent this; see
attached fklocks-tests-deadlock_timeout.patch. This also makes the tests much
faster on a default postgresql.conf.

I applied your patch, thanks. I couldn't reproduce the failures without
it, even running only the three new tests in a loop a few dozen times.

It's probably more likely to crop up on a loaded system. I did not actually
reproduce it myself. However, if you swap the timeouts, the opposite session
finds the deadlock. From there, I'm convinced that the right timing
perturbations could yield the symptom coypu exhibited.

crake failed when it reported waiting on the first step of an existing isolation
test ("two-ids.spec"). I will need to look into that further.

Actually, there are four failures in tests other than the two fixed by
your patch. These are:

http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=crake&dt=2011-07-12%2022:32:02
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=nightjar&dt=2011-07-14%2016:27:00
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=pitta&dt=2011-07-15%2015:00:08
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=crake&dt=2011-07-15%2018:32:02

Thanks for summarizing. These all boil down to lock waits not anticipated by
the test specs. Having pondered this, I've been able to come up with just one
explanation. If autovacuum runs VACUUM during the test and finds that it can
truncate dead space from the end of a relation, it will acquire an
AccessExclusiveLock. When I decrease autovacuum_naptime to 1s, I do see
plenty of pg_type and pg_attribute truncations during a test run.

When I sought to reproduce this, what I first saw instead was an indefinite
test suite hang. That turned out to arise from an unrelated thinko -- I
assumed that backend IDs were stable for the life of the backend, but they're
only stable for the life of a pgstat snapshot. This fell down when a backend
older than one of the test backends exited during the test:

4199 2011-07-16 03:33:28.733 EDT DEBUG: forked new backend, pid=23984 socket=8
23984 2011-07-16 03:33:28.737 EDT LOG: statement: SET client_min_messages = warning;
23984 2011-07-16 03:33:28.739 EDT LOG: statement: SELECT i FROM pg_stat_get_backend_idset() t(i) WHERE pg_stat_get_backend_pid(i) = pg_backend_pid()
23985 2011-07-16 03:33:28.740 EDT DEBUG: autovacuum: processing database "postgres"
4199 2011-07-16 03:33:28.754 EDT DEBUG: forked new backend, pid=23986 socket=8
23986 2011-07-16 03:33:28.754 EDT LOG: statement: SET client_min_messages = warning;
4199 2011-07-16 03:33:28.755 EDT DEBUG: server process (PID 23985) exited with exit code 0
23986 2011-07-16 03:33:28.755 EDT LOG: statement: SELECT i FROM pg_stat_get_backend_idset() t(i) WHERE pg_stat_get_backend_pid(i) = pg_backend_pid()
4199 2011-07-16 03:33:28.766 EDT DEBUG: forked new backend, pid=23987 socket=8
23987 2011-07-16 03:33:28.766 EDT LOG: statement: SET client_min_messages = warning;
23987 2011-07-16 03:33:28.767 EDT LOG: statement: SELECT i FROM pg_stat_get_backend_idset() t(i) WHERE pg_stat_get_backend_pid(i) = pg_backend_pid()

This led isolationtester to initialize backend_ids = {1,2,2}, making us unable
to detect lock waits correctly. That's also consistent with the symptoms Rï¿½mi
Zara just reported. With that fixed, I was able to reproduce the failure due
to autovacuum-truncate-induced transient waiting using this recipe:
- autovacuum_naptime = 1s
- src/test/isolation/Makefile changed to pass --use-existing during installcheck
- Run 'make installcheck' in a loop
- A concurrent session running this in a loop:
CREATE TABLE churn (a int, b int, c int, d int, e int, f int, g int, h int);
DROP TABLE churn;

That yields a steady stream of vacuum truncations, and an associated lock wait
generally capsized the suite within 5-10 runs. Frankly, I have some
difficulty believing that this mechanic alone produced all four failures you
cite above; I suspect I'm still missing some more-frequent cause. Any other
theories on which system background activities can cause a transient lock
wait? It would have to produce a "pgstat_report_waiting(true)" call, so I
believe that excludes all LWLock and lighter contention.

In any event, I have attached a patch that fixes the problems I have described
here. To ignore autovacuum, it only recognizes a wait when one of the
backends under test holds a conflicting lock. (It occurs to me that perhaps
we should expose a pg_lock_conflicts(lockmode_held text, lockmode_req text)
function to simplify this query -- this is a fairly common monitoring need.)

With that change in place, my setup survived through about fifty suite runs at
a time. The streak would end when session 2 would unexpectedly detect a
deadlock that session 1 should have detected. The session 1 deadlock_timeout
I chose, 20ms, is too aggressive. When session 2 is to issue the command that
completes the deadlock, it must do so before session 1 runs the deadlock
detector. Since we burn 10ms just noticing that the previous statement has
blocked, that left only 10ms to issue the next statement. This patch bumps
the figure from 20s to 100ms; hopefully that will be enough for even a
decently-loaded virtual host. We should keep it as low as is reasonable,
because it contributes directly to the isolation suite runtime. Each addition
to deadlock_timeout slows the suite by 12x that amount.

With this patch in its final form, I have completed 180+ suite runs without a
failure. In the absence of better theories on the cause for the buildfarm
failures, we should give the buildfarm a whirl with this patch.

I apologize for the quantity of errata this change is entailing.

Thanks,
nm

--
Noah Misch http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

fklocks-tests-harden.patchtext/plain; charset=us-asciiDownload

diff --git a/src/test/isolation/isolationtester.c b/src/test/isolation/isolationtester.c
index 126e185..96d7f17 100644
*** a/src/test/isolation/isolationtester.c
--- b/src/test/isolation/isolationtester.c
***************
*** 21,26 ****
--- 21,27 ----
  #endif
  
  #include "libpq-fe.h"
+ #include "pqexpbuffer.h"
  
  #include "isolationtester.h"
  
***************
*** 31,37 ****
   * connections represent spec-defined sessions.
   */
  static PGconn **conns = NULL;
! static const char **backend_ids = NULL;
  static int	nconns = 0;
  
  static void run_all_permutations(TestSpec * testspec);
--- 32,38 ----
   * connections represent spec-defined sessions.
   */
  static PGconn **conns = NULL;
! static const char **backend_pids = NULL;
  static int	nconns = 0;
  
  static void run_all_permutations(TestSpec * testspec);
***************
*** 67,72 **** main(int argc, char **argv)
--- 68,74 ----
  	TestSpec   *testspec;
  	int			i;
  	PGresult   *res;
+ 	PQExpBufferData wait_query;
  
  	/*
  	 * If the user supplies a parameter on the command line, use it as the
***************
*** 89,95 **** main(int argc, char **argv)
  	 */
  	nconns = 1 + testspec->nsessions;
  	conns = calloc(nconns, sizeof(PGconn *));
! 	backend_ids = calloc(nconns, sizeof(*backend_ids));
  	for (i = 0; i < nconns; i++)
  	{
  		conns[i] = PQconnectdb(conninfo);
--- 91,97 ----
  	 */
  	nconns = 1 + testspec->nsessions;
  	conns = calloc(nconns, sizeof(PGconn *));
! 	backend_pids = calloc(nconns, sizeof(*backend_pids));
  	for (i = 0; i < nconns; i++)
  	{
  		conns[i] = PQconnectdb(conninfo);
***************
*** 112,134 **** main(int argc, char **argv)
  		}
  		PQclear(res);
  
! 		/* Get the backend ID for lock wait checking. */
! 		res = PQexec(conns[i], "SELECT i FROM pg_stat_get_backend_idset() t(i) "
! 					 "WHERE pg_stat_get_backend_pid(i) = pg_backend_pid()");
  		if (PQresultStatus(res) == PGRES_TUPLES_OK)
  		{
  			if (PQntuples(res) == 1 && PQnfields(res) == 1)
! 				backend_ids[i] = strdup(PQgetvalue(res, 0, 0));
  			else
  			{
! 				fprintf(stderr, "backend id query returned %d rows and %d columns, expected 1 row and 1 column",
  						PQntuples(res), PQnfields(res));
  				exit_nicely();
  			}
  		}
  		else
  		{
! 			fprintf(stderr, "backend id query failed: %s",
  					PQerrorMessage(conns[i]));
  			exit_nicely();
  		}
--- 114,135 ----
  		}
  		PQclear(res);
  
! 		/* Get the backend pid for lock wait checking. */
! 		res = PQexec(conns[i], "SELECT pg_backend_pid()");
  		if (PQresultStatus(res) == PGRES_TUPLES_OK)
  		{
  			if (PQntuples(res) == 1 && PQnfields(res) == 1)
! 				backend_pids[i] = strdup(PQgetvalue(res, 0, 0));
  			else
  			{
! 				fprintf(stderr, "backend pid query returned %d rows and %d columns, expected 1 row and 1 column",
  						PQntuples(res), PQnfields(res));
  				exit_nicely();
  			}
  		}
  		else
  		{
! 			fprintf(stderr, "backend pid query failed: %s",
  					PQerrorMessage(conns[i]));
  			exit_nicely();
  		}
***************
*** 145,152 **** main(int argc, char **argv)
  			session->steps[stepindex]->session = i;
  	}
  
! 	res = PQprepare(conns[0], PREP_WAITING,
! 					"SELECT 1 WHERE pg_stat_get_backend_waiting($1)", 0, NULL);
  	if (PQresultStatus(res) != PGRES_COMMAND_OK)
  	{
  		fprintf(stderr, "prepare of lock wait query failed: %s",
--- 146,232 ----
  			session->steps[stepindex]->session = i;
  	}
  
! 	/*
! 	 * Build the query we'll use to detect lock contention among sessions in
! 	 * the test specification.  Most of the time, we could get away with
! 	 * simply checking whether a session is waiting for *any* lock: we don't
! 	 * exactly expect concurrent use of test tables.  However, autovacuum will
! 	 * occasionally take AccessExclusiveLock to truncate a table, and we must
! 	 * ignore that transient wait.
! 	 */
! 	initPQExpBuffer(&wait_query);
! 	appendPQExpBufferStr(&wait_query,
! 						 "SELECT 1 FROM pg_locks holder, pg_locks waiter "
! 						 "WHERE NOT waiter.granted AND waiter.pid = $1 "
! 						 "AND holder.granted "
! 						 "AND holder.pid <> $1 AND holder.pid IN (");
! 	/* The spec syntax requires at least one session; assume that here. */
! 	appendPQExpBuffer(&wait_query, "%s", backend_pids[1]);
! 	for (i = 2; i < nconns; i++)
! 		appendPQExpBuffer(&wait_query, ", %s", backend_pids[i]);
! 	appendPQExpBufferStr(&wait_query,
! 						 ") "
! 
! 						 "AND holder.mode = ANY (CASE waiter.mode "
! 						 "WHEN 'AccessShareLock' THEN ARRAY["
! 						 "'AccessExclusiveLock'] "
! 						 "WHEN 'RowShareLock' THEN ARRAY["
! 						 "'ExclusiveLock',"
! 						 "'AccessExclusiveLock'] "
! 						 "WHEN 'RowExclusiveLock' THEN ARRAY["
! 						 "'ShareLock',"
! 						 "'ShareRowExclusiveLock',"
! 						 "'ExclusiveLock',"
! 						 "'AccessExclusiveLock'] "
! 						 "WHEN 'ShareUpdateExclusiveLock' THEN ARRAY["
! 						 "'ShareUpdateExclusiveLock',"
! 						 "'ShareLock',"
! 						 "'ShareRowExclusiveLock',"
! 						 "'ExclusiveLock',"
! 						 "'AccessExclusiveLock'] "
! 						 "WHEN 'ShareLock' THEN ARRAY["
! 						 "'RowExclusiveLock',"
! 						 "'ShareUpdateExclusiveLock',"
! 						 "'ShareRowExclusiveLock',"
! 						 "'ExclusiveLock',"
! 						 "'AccessExclusiveLock'] "
! 						 "WHEN 'ShareRowExclusiveLock' THEN ARRAY["
! 						 "'RowExclusiveLock',"
! 						 "'ShareUpdateExclusiveLock',"
! 						 "'ShareLock',"
! 						 "'ShareRowExclusiveLock',"
! 						 "'ExclusiveLock',"
! 						 "'AccessExclusiveLock'] "
! 						 "WHEN 'ExclusiveLock' THEN ARRAY["
! 						 "'RowShareLock',"
! 						 "'RowExclusiveLock',"
! 						 "'ShareUpdateExclusiveLock',"
! 						 "'ShareLock',"
! 						 "'ShareRowExclusiveLock',"
! 						 "'ExclusiveLock',"
! 						 "'AccessExclusiveLock'] "
! 						 "WHEN 'AccessExclusiveLock' THEN ARRAY["
! 						 "'AccessShareLock',"
! 						 "'RowShareLock',"
! 						 "'RowExclusiveLock',"
! 						 "'ShareUpdateExclusiveLock',"
! 						 "'ShareLock',"
! 						 "'ShareRowExclusiveLock',"
! 						 "'ExclusiveLock',"
! 						 "'AccessExclusiveLock'] END) "
! 
! 						 "AND holder.locktype IS NOT DISTINCT FROM waiter.locktype "
! 						 "AND holder.database IS NOT DISTINCT FROM waiter.database "
! 						 "AND holder.relation IS NOT DISTINCT FROM waiter.relation "
! 						 "AND holder.page IS NOT DISTINCT FROM waiter.page "
! 						 "AND holder.tuple IS NOT DISTINCT FROM waiter.tuple "
! 						 "AND holder.virtualxid IS NOT DISTINCT FROM waiter.virtualxid "
! 						 "AND holder.transactionid IS NOT DISTINCT FROM waiter.transactionid "
! 						 "AND holder.classid IS NOT DISTINCT FROM waiter.classid "
! 						 "AND holder.objid IS NOT DISTINCT FROM waiter.objid "
! 						 "AND holder.objsubid IS NOT DISTINCT FROM waiter.objsubid ");
! 
! 	res = PQprepare(conns[0], PREP_WAITING, wait_query.data, 0, NULL);
  	if (PQresultStatus(res) != PGRES_COMMAND_OK)
  	{
  		fprintf(stderr, "prepare of lock wait query failed: %s",
***************
*** 154,159 **** main(int argc, char **argv)
--- 234,240 ----
  		exit_nicely();
  	}
  	PQclear(res);
+ 	termPQExpBuffer(&wait_query);
  
  	/*
  	 * Run the permutations specified in the spec, or all if none were
***************
*** 411,419 **** run_permutation(TestSpec * testspec, int nsteps, Step ** steps)
   * Our caller already sent the query associated with this step.  Wait for it
   * to either complete or (if given the STEP_NONBLOCK flag) to block while
   * waiting for a lock.  We assume that any lock wait will persist until we
!  * have executed additional steps in the permutation.  This is not fully
!  * robust -- a concurrent autovacuum could briefly take a lock with which we
!  * conflict.  The risk may be low enough to discount.
   *
   * When calling this function on behalf of a given step for a second or later
   * time, pass the STEP_RETRY flag.  This only affects the messages printed.
--- 492,498 ----
   * Our caller already sent the query associated with this step.  Wait for it
   * to either complete or (if given the STEP_NONBLOCK flag) to block while
   * waiting for a lock.  We assume that any lock wait will persist until we
!  * have executed additional steps in the permutation.
   *
   * When calling this function on behalf of a given step for a second or later
   * time, pass the STEP_RETRY flag.  This only affects the messages printed.
***************
*** 450,456 **** try_complete_step(Step *step, int flags)
  			int			ntuples;
  
  			res = PQexecPrepared(conns[0], PREP_WAITING, 1,
! 								 &backend_ids[step->session + 1],
  								 NULL, NULL, 0);
  			if (PQresultStatus(res) != PGRES_TUPLES_OK)
  			{
--- 529,535 ----
  			int			ntuples;
  
  			res = PQexecPrepared(conns[0], PREP_WAITING, 1,
! 								 &backend_pids[step->session + 1],
  								 NULL, NULL, 0);
  			if (PQresultStatus(res) != PGRES_TUPLES_OK)
  			{
diff --git a/src/test/isolation/specs/fk-deindex b533d77..9f46c6b 100644
*** a/src/test/isolation/specs/fk-deadlock.spec
--- b/src/test/isolation/specs/fk-deadlock.spec
***************
*** 19,25 **** teardown
  }
  
  session "s1"
! setup		{ BEGIN; SET deadlock_timeout = '20ms'; }
  step "s1i"	{ INSERT INTO child VALUES (1, 1); }
  step "s1u"	{ UPDATE parent SET aux = 'bar'; }
  step "s1c"	{ COMMIT; }
--- 19,25 ----
  }
  
  session "s1"
! setup		{ BEGIN; SET deadlock_timeout = '100ms'; }
  step "s1i"	{ INSERT INTO child VALUES (1, 1); }
  step "s1u"	{ UPDATE parent SET aux = 'bar'; }
  step "s1c"	{ COMMIT; }
diff --git a/src/test/isolation/specs/fk-deadlocindex 5653628..a8f1516 100644
*** a/src/test/isolation/specs/fk-deadlock2.spec
--- b/src/test/isolation/specs/fk-deadlock2.spec
***************
*** 24,30 **** teardown
  }
  
  session "s1"
! setup		{ BEGIN; SET deadlock_timeout = '20ms'; }
  step "s1u1"	{ UPDATE A SET Col1 = 1 WHERE AID = 1; }
  step "s1u2"	{ UPDATE B SET Col2 = 1 WHERE BID = 2; }
  step "s1c"	{ COMMIT; }
--- 24,30 ----
  }
  
  session "s1"
! setup		{ BEGIN; SET deadlock_timeout = '100ms'; }
  step "s1u1"	{ UPDATE A SET Col1 = 1 WHERE AID = 1; }
  step "s1u2"	{ UPDATE B SET Col2 = 1 WHERE BID = 2; }
  step "s1c"	{ COMMIT; }

#27

Kevin Grittner

Kevin.Grittner@wicourts.gov

over 14 years ago

In reply to: Noah Misch (#26)

1 attachment(s)

Re: FOR KEY LOCK foreign keys

Noah Misch wrote:

With this patch in its final form, I have completed 180+ suite runs
without a failure.

The attached patch allows the tests to pass when
default_transaction_isolation is stricter than 'read committed'.
This is a slight change from the previously posted version of the
files (because of a change in the order of statements, based on the
timeouts), and in patch form this time.

Since `make installcheck-world` works at all isolation level
defaults, as do all previously included isolation tests, it seems
like a good idea to keep this up. It will simplify my testing of SSI
changes, anyway.

-Kevin

Attachments:

fklocks-tests-strict-isolation.patchapplication/octet-stream; name=fklocks-tests-strict-isolation.patchDownload

*** /dev/null
--- b/src/test/isolation/expected/fk-deadlock2_1.out
***************
*** 0 ****
--- 1,110 ----
+ Parsed test spec with 2 sessions
+ 
+ starting permutation: s1u1 s1u2 s1c s2u1 s2u2 s2c
+ step s1u1:  UPDATE A SET Col1 = 1 WHERE AID = 1; 
+ step s1u2:  UPDATE B SET Col2 = 1 WHERE BID = 2; 
+ step s1c:  COMMIT; 
+ step s2u1:  UPDATE B SET Col2 = 1 WHERE BID = 2; 
+ step s2u2:  UPDATE B SET Col2 = 1 WHERE BID = 2; 
+ step s2c:  COMMIT; 
+ 
+ starting permutation: s1u1 s1u2 s2u1 s1c s2u2 s2c
+ step s1u1:  UPDATE A SET Col1 = 1 WHERE AID = 1; 
+ step s1u2:  UPDATE B SET Col2 = 1 WHERE BID = 2; 
+ step s2u1:  UPDATE B SET Col2 = 1 WHERE BID = 2;  <waiting ...>
+ step s1c:  COMMIT; 
+ step s2u1: <... completed>
+ ERROR:  could not serialize access due to concurrent update
+ step s2u2:  UPDATE B SET Col2 = 1 WHERE BID = 2; 
+ ERROR:  current transaction is aborted, commands ignored until end of transaction block
+ step s2c:  COMMIT; 
+ 
+ starting permutation: s1u1 s2u1 s1u2 s2u2 s1c s2c
+ step s1u1:  UPDATE A SET Col1 = 1 WHERE AID = 1; 
+ step s2u1:  UPDATE B SET Col2 = 1 WHERE BID = 2; 
+ step s1u2:  UPDATE B SET Col2 = 1 WHERE BID = 2;  <waiting ...>
+ step s2u2:  UPDATE B SET Col2 = 1 WHERE BID = 2; 
+ step s1u2: <... completed>
+ ERROR:  deadlock detected
+ step s1c:  COMMIT; 
+ step s2c:  COMMIT; 
+ 
+ starting permutation: s1u1 s2u1 s1u2 s2u2 s2c s1c
+ step s1u1:  UPDATE A SET Col1 = 1 WHERE AID = 1; 
+ step s2u1:  UPDATE B SET Col2 = 1 WHERE BID = 2; 
+ step s1u2:  UPDATE B SET Col2 = 1 WHERE BID = 2;  <waiting ...>
+ step s2u2:  UPDATE B SET Col2 = 1 WHERE BID = 2; 
+ step s1u2: <... completed>
+ ERROR:  deadlock detected
+ step s2c:  COMMIT; 
+ step s1c:  COMMIT; 
+ 
+ starting permutation: s1u1 s2u1 s2u2 s1u2 s1c s2c
+ step s1u1:  UPDATE A SET Col1 = 1 WHERE AID = 1; 
+ step s2u1:  UPDATE B SET Col2 = 1 WHERE BID = 2; 
+ step s2u2:  UPDATE B SET Col2 = 1 WHERE BID = 2;  <waiting ...>
+ step s1u2:  UPDATE B SET Col2 = 1 WHERE BID = 2; 
+ ERROR:  deadlock detected
+ step s2u2: <... completed>
+ step s1c:  COMMIT; 
+ step s2c:  COMMIT; 
+ 
+ starting permutation: s1u1 s2u1 s2u2 s1u2 s2c s1c
+ step s1u1:  UPDATE A SET Col1 = 1 WHERE AID = 1; 
+ step s2u1:  UPDATE B SET Col2 = 1 WHERE BID = 2; 
+ step s2u2:  UPDATE B SET Col2 = 1 WHERE BID = 2;  <waiting ...>
+ step s1u2:  UPDATE B SET Col2 = 1 WHERE BID = 2; 
+ ERROR:  deadlock detected
+ step s2u2: <... completed>
+ step s2c:  COMMIT; 
+ step s1c:  COMMIT; 
+ 
+ starting permutation: s2u1 s1u1 s1u2 s2u2 s1c s2c
+ step s2u1:  UPDATE B SET Col2 = 1 WHERE BID = 2; 
+ step s1u1:  UPDATE A SET Col1 = 1 WHERE AID = 1; 
+ step s1u2:  UPDATE B SET Col2 = 1 WHERE BID = 2;  <waiting ...>
+ step s2u2:  UPDATE B SET Col2 = 1 WHERE BID = 2; 
+ step s1u2: <... completed>
+ ERROR:  deadlock detected
+ step s1c:  COMMIT; 
+ step s2c:  COMMIT; 
+ 
+ starting permutation: s2u1 s1u1 s1u2 s2u2 s2c s1c
+ step s2u1:  UPDATE B SET Col2 = 1 WHERE BID = 2; 
+ step s1u1:  UPDATE A SET Col1 = 1 WHERE AID = 1; 
+ step s1u2:  UPDATE B SET Col2 = 1 WHERE BID = 2;  <waiting ...>
+ step s2u2:  UPDATE B SET Col2 = 1 WHERE BID = 2; 
+ step s1u2: <... completed>
+ ERROR:  deadlock detected
+ step s2c:  COMMIT; 
+ step s1c:  COMMIT; 
+ 
+ starting permutation: s2u1 s1u1 s2u2 s1u2 s1c s2c
+ step s2u1:  UPDATE B SET Col2 = 1 WHERE BID = 2; 
+ step s1u1:  UPDATE A SET Col1 = 1 WHERE AID = 1; 
+ step s2u2:  UPDATE B SET Col2 = 1 WHERE BID = 2;  <waiting ...>
+ step s1u2:  UPDATE B SET Col2 = 1 WHERE BID = 2; 
+ ERROR:  deadlock detected
+ step s2u2: <... completed>
+ step s1c:  COMMIT; 
+ step s2c:  COMMIT; 
+ 
+ starting permutation: s2u1 s1u1 s2u2 s1u2 s2c s1c
+ step s2u1:  UPDATE B SET Col2 = 1 WHERE BID = 2; 
+ step s1u1:  UPDATE A SET Col1 = 1 WHERE AID = 1; 
+ step s2u2:  UPDATE B SET Col2 = 1 WHERE BID = 2;  <waiting ...>
+ step s1u2:  UPDATE B SET Col2 = 1 WHERE BID = 2; 
+ ERROR:  deadlock detected
+ step s2u2: <... completed>
+ step s2c:  COMMIT; 
+ step s1c:  COMMIT; 
+ 
+ starting permutation: s2u1 s2u2 s1u1 s2c s1u2 s1c
+ step s2u1:  UPDATE B SET Col2 = 1 WHERE BID = 2; 
+ step s2u2:  UPDATE B SET Col2 = 1 WHERE BID = 2; 
+ step s1u1:  UPDATE A SET Col1 = 1 WHERE AID = 1;  <waiting ...>
+ step s2c:  COMMIT; 
+ step s1u1: <... completed>
+ step s1u2:  UPDATE B SET Col2 = 1 WHERE BID = 2; 
+ ERROR:  could not serialize access due to read/write dependencies among transactions
+ step s1c:  COMMIT; 
*** /dev/null
--- b/src/test/isolation/expected/fk-deadlock2_2.out
***************
*** 0 ****
--- 1,110 ----
+ Parsed test spec with 2 sessions
+ 
+ starting permutation: s1u1 s1u2 s1c s2u1 s2u2 s2c
+ step s1u1:  UPDATE A SET Col1 = 1 WHERE AID = 1; 
+ step s1u2:  UPDATE B SET Col2 = 1 WHERE BID = 2; 
+ step s1c:  COMMIT; 
+ step s2u1:  UPDATE B SET Col2 = 1 WHERE BID = 2; 
+ step s2u2:  UPDATE B SET Col2 = 1 WHERE BID = 2; 
+ step s2c:  COMMIT; 
+ 
+ starting permutation: s1u1 s1u2 s2u1 s1c s2u2 s2c
+ step s1u1:  UPDATE A SET Col1 = 1 WHERE AID = 1; 
+ step s1u2:  UPDATE B SET Col2 = 1 WHERE BID = 2; 
+ step s2u1:  UPDATE B SET Col2 = 1 WHERE BID = 2;  <waiting ...>
+ step s1c:  COMMIT; 
+ step s2u1: <... completed>
+ ERROR:  could not serialize access due to concurrent update
+ step s2u2:  UPDATE B SET Col2 = 1 WHERE BID = 2; 
+ ERROR:  current transaction is aborted, commands ignored until end of transaction block
+ step s2c:  COMMIT; 
+ 
+ starting permutation: s1u1 s2u1 s1u2 s2u2 s1c s2c
+ step s1u1:  UPDATE A SET Col1 = 1 WHERE AID = 1; 
+ step s2u1:  UPDATE B SET Col2 = 1 WHERE BID = 2; 
+ step s1u2:  UPDATE B SET Col2 = 1 WHERE BID = 2;  <waiting ...>
+ step s2u2:  UPDATE B SET Col2 = 1 WHERE BID = 2; 
+ step s1u2: <... completed>
+ ERROR:  deadlock detected
+ step s1c:  COMMIT; 
+ step s2c:  COMMIT; 
+ 
+ starting permutation: s1u1 s2u1 s1u2 s2u2 s2c s1c
+ step s1u1:  UPDATE A SET Col1 = 1 WHERE AID = 1; 
+ step s2u1:  UPDATE B SET Col2 = 1 WHERE BID = 2; 
+ step s1u2:  UPDATE B SET Col2 = 1 WHERE BID = 2;  <waiting ...>
+ step s2u2:  UPDATE B SET Col2 = 1 WHERE BID = 2; 
+ step s1u2: <... completed>
+ ERROR:  deadlock detected
+ step s2c:  COMMIT; 
+ step s1c:  COMMIT; 
+ 
+ starting permutation: s1u1 s2u1 s2u2 s1u2 s1c s2c
+ step s1u1:  UPDATE A SET Col1 = 1 WHERE AID = 1; 
+ step s2u1:  UPDATE B SET Col2 = 1 WHERE BID = 2; 
+ step s2u2:  UPDATE B SET Col2 = 1 WHERE BID = 2;  <waiting ...>
+ step s1u2:  UPDATE B SET Col2 = 1 WHERE BID = 2; 
+ ERROR:  deadlock detected
+ step s2u2: <... completed>
+ step s1c:  COMMIT; 
+ step s2c:  COMMIT; 
+ 
+ starting permutation: s1u1 s2u1 s2u2 s1u2 s2c s1c
+ step s1u1:  UPDATE A SET Col1 = 1 WHERE AID = 1; 
+ step s2u1:  UPDATE B SET Col2 = 1 WHERE BID = 2; 
+ step s2u2:  UPDATE B SET Col2 = 1 WHERE BID = 2;  <waiting ...>
+ step s1u2:  UPDATE B SET Col2 = 1 WHERE BID = 2; 
+ ERROR:  deadlock detected
+ step s2u2: <... completed>
+ step s2c:  COMMIT; 
+ step s1c:  COMMIT; 
+ 
+ starting permutation: s2u1 s1u1 s1u2 s2u2 s1c s2c
+ step s2u1:  UPDATE B SET Col2 = 1 WHERE BID = 2; 
+ step s1u1:  UPDATE A SET Col1 = 1 WHERE AID = 1; 
+ step s1u2:  UPDATE B SET Col2 = 1 WHERE BID = 2;  <waiting ...>
+ step s2u2:  UPDATE B SET Col2 = 1 WHERE BID = 2; 
+ step s1u2: <... completed>
+ ERROR:  deadlock detected
+ step s1c:  COMMIT; 
+ step s2c:  COMMIT; 
+ 
+ starting permutation: s2u1 s1u1 s1u2 s2u2 s2c s1c
+ step s2u1:  UPDATE B SET Col2 = 1 WHERE BID = 2; 
+ step s1u1:  UPDATE A SET Col1 = 1 WHERE AID = 1; 
+ step s1u2:  UPDATE B SET Col2 = 1 WHERE BID = 2;  <waiting ...>
+ step s2u2:  UPDATE B SET Col2 = 1 WHERE BID = 2; 
+ step s1u2: <... completed>
+ ERROR:  deadlock detected
+ step s2c:  COMMIT; 
+ step s1c:  COMMIT; 
+ 
+ starting permutation: s2u1 s1u1 s2u2 s1u2 s1c s2c
+ step s2u1:  UPDATE B SET Col2 = 1 WHERE BID = 2; 
+ step s1u1:  UPDATE A SET Col1 = 1 WHERE AID = 1; 
+ step s2u2:  UPDATE B SET Col2 = 1 WHERE BID = 2;  <waiting ...>
+ step s1u2:  UPDATE B SET Col2 = 1 WHERE BID = 2; 
+ ERROR:  deadlock detected
+ step s2u2: <... completed>
+ step s1c:  COMMIT; 
+ step s2c:  COMMIT; 
+ 
+ starting permutation: s2u1 s1u1 s2u2 s1u2 s2c s1c
+ step s2u1:  UPDATE B SET Col2 = 1 WHERE BID = 2; 
+ step s1u1:  UPDATE A SET Col1 = 1 WHERE AID = 1; 
+ step s2u2:  UPDATE B SET Col2 = 1 WHERE BID = 2;  <waiting ...>
+ step s1u2:  UPDATE B SET Col2 = 1 WHERE BID = 2; 
+ ERROR:  deadlock detected
+ step s2u2: <... completed>
+ step s2c:  COMMIT; 
+ step s1c:  COMMIT; 
+ 
+ starting permutation: s2u1 s2u2 s1u1 s2c s1u2 s1c
+ step s2u1:  UPDATE B SET Col2 = 1 WHERE BID = 2; 
+ step s2u2:  UPDATE B SET Col2 = 1 WHERE BID = 2; 
+ step s1u1:  UPDATE A SET Col1 = 1 WHERE AID = 1;  <waiting ...>
+ step s2c:  COMMIT; 
+ step s1u1: <... completed>
+ step s1u2:  UPDATE B SET Col2 = 1 WHERE BID = 2; 
+ ERROR:  could not serialize access due to concurrent update
+ step s1c:  COMMIT; 
*** /dev/null
--- b/src/test/isolation/expected/fk-deadlock_1.out
***************
*** 0 ****
--- 1,71 ----
+ Parsed test spec with 2 sessions
+ 
+ starting permutation: s1i s1u s1c s2i s2u s2c
+ step s1i:  INSERT INTO child VALUES (1, 1); 
+ step s1u:  UPDATE parent SET aux = 'bar'; 
+ step s1c:  COMMIT; 
+ step s2i:  INSERT INTO child VALUES (2, 1); 
+ step s2u:  UPDATE parent SET aux = 'baz'; 
+ step s2c:  COMMIT; 
+ 
+ starting permutation: s1i s1u s2i s1c s2u s2c
+ step s1i:  INSERT INTO child VALUES (1, 1); 
+ step s1u:  UPDATE parent SET aux = 'bar'; 
+ step s2i:  INSERT INTO child VALUES (2, 1);  <waiting ...>
+ step s1c:  COMMIT; 
+ step s2i: <... completed>
+ ERROR:  could not serialize access due to concurrent update
+ step s2u:  UPDATE parent SET aux = 'baz'; 
+ ERROR:  current transaction is aborted, commands ignored until end of transaction block
+ step s2c:  COMMIT; 
+ 
+ starting permutation: s1i s2i s1u s2u s1c s2c
+ step s1i:  INSERT INTO child VALUES (1, 1); 
+ step s2i:  INSERT INTO child VALUES (2, 1); 
+ step s1u:  UPDATE parent SET aux = 'bar';  <waiting ...>
+ step s2u:  UPDATE parent SET aux = 'baz'; 
+ step s1u: <... completed>
+ ERROR:  deadlock detected
+ step s1c:  COMMIT; 
+ step s2c:  COMMIT; 
+ 
+ starting permutation: s1i s2i s2u s1u s2c s1c
+ step s1i:  INSERT INTO child VALUES (1, 1); 
+ step s2i:  INSERT INTO child VALUES (2, 1); 
+ step s2u:  UPDATE parent SET aux = 'baz';  <waiting ...>
+ step s1u:  UPDATE parent SET aux = 'bar'; 
+ ERROR:  deadlock detected
+ step s2u: <... completed>
+ step s2c:  COMMIT; 
+ step s1c:  COMMIT; 
+ 
+ starting permutation: s2i s1i s1u s2u s1c s2c
+ step s2i:  INSERT INTO child VALUES (2, 1); 
+ step s1i:  INSERT INTO child VALUES (1, 1); 
+ step s1u:  UPDATE parent SET aux = 'bar';  <waiting ...>
+ step s2u:  UPDATE parent SET aux = 'baz'; 
+ step s1u: <... completed>
+ ERROR:  deadlock detected
+ step s1c:  COMMIT; 
+ step s2c:  COMMIT; 
+ 
+ starting permutation: s2i s1i s2u s1u s2c s1c
+ step s2i:  INSERT INTO child VALUES (2, 1); 
+ step s1i:  INSERT INTO child VALUES (1, 1); 
+ step s2u:  UPDATE parent SET aux = 'baz';  <waiting ...>
+ step s1u:  UPDATE parent SET aux = 'bar'; 
+ ERROR:  deadlock detected
+ step s2u: <... completed>
+ step s2c:  COMMIT; 
+ step s1c:  COMMIT; 
+ 
+ starting permutation: s2i s2u s1i s2c s1u s1c
+ step s2i:  INSERT INTO child VALUES (2, 1); 
+ step s2u:  UPDATE parent SET aux = 'baz'; 
+ step s1i:  INSERT INTO child VALUES (1, 1);  <waiting ...>
+ step s2c:  COMMIT; 
+ step s1i: <... completed>
+ ERROR:  could not serialize access due to concurrent update
+ step s1u:  UPDATE parent SET aux = 'bar'; 
+ ERROR:  current transaction is aborted, commands ignored until end of transaction block
+ step s1c:  COMMIT;

Import Notes

Resolved by subject fallback

#28

noah@2ndQuadrant.com

over 14 years ago

In reply to: Kevin Grittner (#27)

Re: FOR KEY LOCK foreign keys

On Sat, Jul 16, 2011 at 01:03:31PM -0500, Kevin Grittner wrote:

Noah Misch wrote:

With this patch in its final form, I have completed 180+ suite runs
without a failure.

The attached patch allows the tests to pass when
default_transaction_isolation is stricter than 'read committed'.
This is a slight change from the previously posted version of the
files (because of a change in the order of statements, based on the
timeouts), and in patch form this time.

Since `make installcheck-world` works at all isolation level
defaults, as do all previously included isolation tests, it seems
like a good idea to keep this up. It will simplify my testing of SSI
changes, anyway.

This does seem sensible. Thanks.

--
Noah Misch http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#29

Kevin Grittner

Kevin.Grittner@wicourts.gov

over 14 years ago

In reply to: Kevin Grittner (#27)

Re: FOR KEY LOCK foreign keys

"Kevin Grittner" <Kevin.Grittner@wicourts.gov> wrote:

Noah Misch wrote:

With this patch in its final form, I have completed 180+ suite
runs without a failure.

The attached patch allows the tests to pass when
default_transaction_isolation is stricter than 'read committed'.

Without these two patches the tests fail about one time out of three
on my machine at the office at the 'read committed' transaction
isolation level, and all the time at stricter levels. On my machine
at home I haven't seen the failures at 'read committed'. I don't
know if this is Intel (at work) versus AMD (at home) or what.

With both Noah's patch and mine I haven't yet seen a failure in
either environment, with a few dozen tries..

-Kevin

#30

http://archives.postgresql.org/pgsql-hackers/2011-07/msg00867.php

alvherre@commandprompt.com

over 14 years ago

In reply to: Kevin Grittner (#27)

Re: FOR KEY LOCK foreign keys

Excerpts from Kevin Grittner's message of sáb jul 16 14:03:31 -0400 2011:

Noah Misch wrote:

With this patch in its final form, I have completed 180+ suite runs
without a failure.

The attached patch allows the tests to pass when
default_transaction_isolation is stricter than 'read committed'.
This is a slight change from the previously posted version of the
files (because of a change in the order of statements, based on the
timeouts), and in patch form this time.

Thanks, applied. Sorry for the delay.

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

#31

Kevin Grittner

Kevin.Grittner@wicourts.gov

over 14 years ago

In reply to: Alvaro Herrera (#30)

Re: FOR KEY LOCK foreign keys

Alvaro Herrera <alvherre@commandprompt.com> wrote:

Excerpts from Kevin Grittner's message:

Noah Misch wrote:

With this patch in its final form, I have completed 180+ suite
runs without a failure.

The attached patch allows the tests to pass when
default_transaction_isolation is stricter than 'read committed'.
This is a slight change from the previously posted version of the
files (because of a change in the order of statements, based on
the timeouts), and in patch form this time.

Thanks, applied. Sorry for the delay.

My patch was intended to supplement Noah's patch here:

Without his patch, there is still random failure on my work machine
at all transaction isolation levels.

-Kevin

#32

alvherre@commandprompt.com

over 14 years ago

In reply to: Kevin Grittner (#31)

Re: FOR KEY LOCK foreign keys

Excerpts from Kevin Grittner's message of mar jul 19 13:49:53 -0400 2011:

Alvaro Herrera <alvherre@commandprompt.com> wrote:

Excerpts from Kevin Grittner's message:

Noah Misch wrote:

With this patch in its final form, I have completed 180+ suite
runs without a failure.

The attached patch allows the tests to pass when
default_transaction_isolation is stricter than 'read committed'.
This is a slight change from the previously posted version of the
files (because of a change in the order of statements, based on
the timeouts), and in patch form this time.

Thanks, applied. Sorry for the delay.

My patch was intended to supplement Noah's patch here:

I'm aware of that, thanks. I'm getting that one in too, shortly.

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

#33

alvherre@commandprompt.com

over 14 years ago

In reply to: Noah Misch (#26)

Re: FOR KEY LOCK foreign keys

Excerpts from Noah Misch's message of sáb jul 16 13:11:49 -0400 2011:

In any event, I have attached a patch that fixes the problems I have described
here. To ignore autovacuum, it only recognizes a wait when one of the
backends under test holds a conflicting lock. (It occurs to me that perhaps
we should expose a pg_lock_conflicts(lockmode_held text, lockmode_req text)
function to simplify this query -- this is a fairly common monitoring need.)

Applied it. I agree that having such an utility function is worthwhile,
particularly if we're working on making pg_locks more usable as a whole.

(I wasn't able to reproduce Rémi's hangups here, so I wasn't able to
reproduce the other bits either.)

With that change in place, my setup survived through about fifty suite runs at
a time. The streak would end when session 2 would unexpectedly detect a
deadlock that session 1 should have detected. The session 1 deadlock_timeout
I chose, 20ms, is too aggressive. When session 2 is to issue the command that
completes the deadlock, it must do so before session 1 runs the deadlock
detector. Since we burn 10ms just noticing that the previous statement has
blocked, that left only 10ms to issue the next statement. This patch bumps
the figure from 20s to 100ms; hopefully that will be enough for even a
decently-loaded virtual host.

Committed this too.

With this patch in its final form, I have completed 180+ suite runs without a
failure. In the absence of better theories on the cause for the buildfarm
failures, we should give the buildfarm a whirl with this patch.

Great. If there is some other failure mechanism, we'll find out ...

I apologize for the quantity of errata this change is entailing.

No need to apologize. I might as well apologize myself because I didn't
detect these problems on review. But we don't do that -- we just fix
the problems and move on. It's great that you were able to come up with
a fix quickly.

And this is precisely why I committed this way ahead of the patch that
it was written to help: we're now not fixing problems in both
simultaneously. By the time we get that other patch in, this test
harness will be fully robust.

Thanks for all your effort in this.

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

#34

http://archives.postgresql.org/message-id/1294953201-sup-2099@alvh.no-ip.org

alvherre@commandprompt.com

over 14 years ago

In reply to: Alvaro Herrera (#1)

1 attachment(s)

Re: FOR KEY LOCK foreign keys

Hackers,

This is an updated version of the patch I introduced here:

Mainly, this patch addresses the numerous comments by Noah Misch here:
http://archives.postgresql.org/message-id/20110211071322.GB26971@tornado.leadboat.com
My thanks to Noah for the very exhaustive review and ideas.

I also removed the bit about copying the ComboCid to the new version of
the tuple during an update. I think that must have been the result of
very fuzzy thinking; I cannot find any reasoning that leads to it being
necessary, or even correct.

I also included Marti Raudsepp's patch to consider only indexes usable
in foreign keys.

One thing I have not addressed is Noah's idea about creating a new lock
mode, KEY UPDATE, that would let us solve the initial problem that this
patch set to resolve in the first place. I am not clear on exactly how
that is to be implemented, because currently heap_update and heap_delete
do not grab any kind of lock but instead do their own ad-hoc waiting. I
think that might need to be reshuffled a bit, to which I haven't gotten
yet, and is a radical enough idea that I would like it to be discussed
by the hackers community at large before setting sail on developing it.
In the meantime, this patch does improve the current situation quite a
lot.

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Attachments:

fklocks-2.patchapplication/octet-stream; name=fklocks-2.patchDownload

*** a/src/backend/access/heap/heapam.c
--- b/src/backend/access/heap/heapam.c
***************
*** 2450,2455 **** heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
--- 2450,2456 ----
  	HTSU_Result result;
  	TransactionId xid = GetCurrentTransactionId();
  	Bitmapset  *hot_attrs;
+ 	Bitmapset  *keylck_attrs;
  	ItemId		lp;
  	HeapTupleData oldtup;
  	HeapTuple	heaptup;
***************
*** 2466,2471 **** heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
--- 2467,2473 ----
  	bool		have_tuple_lock = false;
  	bool		iscombo;
  	bool		use_hot_update = false;
+ 	bool		keylocked_update = false;
  	bool		all_visible_cleared = false;
  	bool		all_visible_cleared_new = false;
  
***************
*** 2483,2489 **** heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
  	 * Note that we get a copy here, so we need not worry about relcache flush
  	 * happening midway through.
  	 */
! 	hot_attrs = RelationGetIndexAttrBitmap(relation);
  
  	block = ItemPointerGetBlockNumber(otid);
  	buffer = ReadBuffer(relation, block);
--- 2485,2492 ----
  	 * Note that we get a copy here, so we need not worry about relcache flush
  	 * happening midway through.
  	 */
! 	hot_attrs = RelationGetIndexAttrBitmap(relation, false);
! 	keylck_attrs = RelationGetIndexAttrBitmap(relation, true);
  
  	block = ItemPointerGetBlockNumber(otid);
  	buffer = ReadBuffer(relation, block);
***************
*** 2524,2614 **** l2:
  	}
  	else if (result == HeapTupleBeingUpdated && wait)
  	{
- 		TransactionId xwait;
  		uint16		infomask;
  
- 		/* must copy state data before unlocking buffer */
- 		xwait = HeapTupleHeaderGetXmax(oldtup.t_data);
  		infomask = oldtup.t_data->t_infomask;
  
- 		LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
- 
  		/*
! 		 * Acquire tuple lock to establish our priority for the tuple (see
! 		 * heap_lock_tuple).  LockTuple will release us when we are
! 		 * next-in-line for the tuple.
! 		 *
! 		 * If we are forced to "start over" below, we keep the tuple lock;
! 		 * this arranges that we stay at the head of the line while rechecking
! 		 * tuple state.
  		 */
! 		if (!have_tuple_lock)
  		{
! 			LockTuple(relation, &(oldtup.t_self), ExclusiveLock);
! 			have_tuple_lock = true;
  		}
  
! 		/*
! 		 * Sleep until concurrent transaction ends.  Note that we don't care
! 		 * if the locker has an exclusive or shared lock, because we need
! 		 * exclusive.
! 		 */
! 
! 		if (infomask & HEAP_XMAX_IS_MULTI)
  		{
! 			/* wait for multixact */
! 			MultiXactIdWait((MultiXactId) xwait);
! 			LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
  
  			/*
! 			 * If xwait had just locked the tuple then some other xact could
! 			 * update this tuple before we get to this point.  Check for xmax
! 			 * change, and start over if so.
  			 */
! 			if (!(oldtup.t_data->t_infomask & HEAP_XMAX_IS_MULTI) ||
! 				!TransactionIdEquals(HeapTupleHeaderGetXmax(oldtup.t_data),
! 									 xwait))
! 				goto l2;
  
  			/*
! 			 * You might think the multixact is necessarily done here, but not
! 			 * so: it could have surviving members, namely our own xact or
! 			 * other subxacts of this backend.	It is legal for us to update
! 			 * the tuple in either case, however (the latter case is
! 			 * essentially a situation of upgrading our former shared lock to
! 			 * exclusive).	We don't bother changing the on-disk hint bits
! 			 * since we are about to overwrite the xmax altogether.
  			 */
! 		}
! 		else
! 		{
! 			/* wait for regular transaction to end */
! 			XactLockTableWait(xwait);
! 			LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
  
  			/*
! 			 * xwait is done, but if xwait had just locked the tuple then some
! 			 * other xact could update this tuple before we get to this point.
! 			 * Check for xmax change, and start over if so.
  			 */
! 			if ((oldtup.t_data->t_infomask & HEAP_XMAX_IS_MULTI) ||
! 				!TransactionIdEquals(HeapTupleHeaderGetXmax(oldtup.t_data),
! 									 xwait))
! 				goto l2;
! 
! 			/* Otherwise check if it committed or aborted */
! 			UpdateXmaxHintBits(oldtup.t_data, buffer, xwait);
  		}
- 
- 		/*
- 		 * We may overwrite if previous xmax aborted, or if it committed but
- 		 * only locked the tuple without updating it.
- 		 */
- 		if (oldtup.t_data->t_infomask & (HEAP_XMAX_INVALID |
- 										 HEAP_IS_LOCKED))
- 			result = HeapTupleMayBeUpdated;
- 		else
- 			result = HeapTupleUpdated;
  	}
  
  	if (crosscheck != InvalidSnapshot && result == HeapTupleMayBeUpdated)
--- 2527,2636 ----
  	}
  	else if (result == HeapTupleBeingUpdated && wait)
  	{
  		uint16		infomask;
  
  		infomask = oldtup.t_data->t_infomask;
  
  		/*
! 		 * if it's only key-locked and we're not updating an indexed column,
! 		 * we can act though MayBeUpdated was returned, but the resulting tuple
! 		 * needs a bunch of fields copied from the original.
  		 */
! 		if ((infomask & HEAP_XMAX_KEY_LOCK) &&
! 			!(infomask & HEAP_XMAX_SHARED_LOCK) &&
! 			HeapSatisfiesHOTUpdate(relation, keylck_attrs,
! 								   &oldtup, newtup))
  		{
! 			result = HeapTupleMayBeUpdated;
! 			keylocked_update = true;
  		}
  
! 		if (!keylocked_update)
  		{
! 			TransactionId xwait;
! 
! 			/* must copy state data before unlocking buffer */
! 			xwait = HeapTupleHeaderGetXmax(oldtup.t_data);
! 
! 			LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
  
  			/*
! 			 * Acquire tuple lock to establish our priority for the tuple (see
! 			 * heap_lock_tuple).  LockTuple will release us when we are
! 			 * next-in-line for the tuple.
! 			 *
! 			 * If we are forced to "start over" below, we keep the tuple lock;
! 			 * this arranges that we stay at the head of the line while rechecking
! 			 * tuple state.
  			 */
! 			if (!have_tuple_lock)
! 			{
! 				LockTuple(relation, &(oldtup.t_self), ExclusiveLock);
! 				have_tuple_lock = true;
! 			}
  
  			/*
! 			 * Sleep until concurrent transaction ends.  Note that we don't care
! 			 * if the locker has an exclusive or shared lock, because we need
! 			 * exclusive.
  			 */
! 
! 			if (infomask & HEAP_XMAX_IS_MULTI)
! 			{
! 				/* wait for multixact */
! 				MultiXactIdWait((MultiXactId) xwait);
! 				LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
! 
! 				/*
! 				 * If xwait had just locked the tuple then some other xact could
! 				 * update this tuple before we get to this point.  Check for xmax
! 				 * change, and start over if so.
! 				 */
! 				if (!(oldtup.t_data->t_infomask & HEAP_XMAX_IS_MULTI) ||
! 					!TransactionIdEquals(HeapTupleHeaderGetXmax(oldtup.t_data),
! 										 xwait))
! 					goto l2;
! 
! 				/*
! 				 * You might think the multixact is necessarily done here, but not
! 				 * so: it could have surviving members, namely our own xact or
! 				 * other subxacts of this backend.	It is legal for us to update
! 				 * the tuple in either case, however (the latter case is
! 				 * essentially a situation of upgrading our former shared lock to
! 				 * exclusive).	We don't bother changing the on-disk hint bits
! 				 * since we are about to overwrite the xmax altogether.
! 				 */
! 			}
! 			else
! 			{
! 				/* wait for regular transaction to end */
! 				XactLockTableWait(xwait);
! 				LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
! 
! 				/*
! 				 * xwait is done, but if xwait had just locked the tuple then some
! 				 * other xact could update this tuple before we get to this point.
! 				 * Check for xmax change, and start over if so.
! 				 */
! 				if ((oldtup.t_data->t_infomask & HEAP_XMAX_IS_MULTI) ||
! 					!TransactionIdEquals(HeapTupleHeaderGetXmax(oldtup.t_data),
! 										 xwait))
! 					goto l2;
! 
! 				/* Otherwise check if it committed or aborted */
! 				UpdateXmaxHintBits(oldtup.t_data, buffer, xwait);
! 			}
  
  			/*
! 			 * We may overwrite if previous xmax aborted, or if it committed but
! 			 * only locked the tuple without updating it.
  			 */
! 			if (oldtup.t_data->t_infomask & (HEAP_XMAX_INVALID |
! 											 HEAP_IS_LOCKED))
! 				result = HeapTupleMayBeUpdated;
! 			else
! 				result = HeapTupleUpdated;
  		}
  	}
  
  	if (crosscheck != InvalidSnapshot && result == HeapTupleMayBeUpdated)
***************
*** 2632,2637 **** l2:
--- 2654,2660 ----
  		if (vmbuffer != InvalidBuffer)
  			ReleaseBuffer(vmbuffer);
  		bms_free(hot_attrs);
+ 		bms_free(keylck_attrs);
  		return result;
  	}
  
***************
*** 2670,2682 **** l2:
  		Assert(!(newtup->t_data->t_infomask & HEAP_HASOID));
  	}
  
  	newtup->t_data->t_infomask &= ~(HEAP_XACT_MASK);
  	newtup->t_data->t_infomask2 &= ~(HEAP2_XACT_MASK);
! 	newtup->t_data->t_infomask |= (HEAP_XMAX_INVALID | HEAP_UPDATED);
  	HeapTupleHeaderSetXmin(newtup->t_data, xid);
  	HeapTupleHeaderSetCmin(newtup->t_data, cid);
- 	HeapTupleHeaderSetXmax(newtup->t_data, 0);	/* for cleanliness */
  	newtup->t_tableOid = RelationGetRelid(relation);
  
  	/*
  	 * Replace cid with a combo cid if necessary.  Note that we already put
--- 2693,2721 ----
  		Assert(!(newtup->t_data->t_infomask & HEAP_HASOID));
  	}
  
+ 	/*
+ 	 * Prepare the new tuple with the appropriate initial values of Xmin and
+ 	 * Xmax, as well as initial infomask bits.
+ 	 */
  	newtup->t_data->t_infomask &= ~(HEAP_XACT_MASK);
  	newtup->t_data->t_infomask2 &= ~(HEAP2_XACT_MASK);
! 	newtup->t_data->t_infomask |= HEAP_UPDATED;
  	HeapTupleHeaderSetXmin(newtup->t_data, xid);
  	HeapTupleHeaderSetCmin(newtup->t_data, cid);
  	newtup->t_tableOid = RelationGetRelid(relation);
+ 	if (keylocked_update)
+ 	{
+ 		HeapTupleHeaderSetXmax(newtup->t_data,
+ 							   HeapTupleHeaderGetXmax(oldtup.t_data));
+ 		newtup->t_data->t_infomask |= (oldtup.t_data->t_infomask & 
+ 									   (HEAP_XMAX_IS_MULTI |
+ 										HEAP_XMAX_KEY_LOCK));
+ 	}
+ 	else
+ 	{
+ 		newtup->t_data->t_infomask |= HEAP_XMAX_INVALID;
+ 		HeapTupleHeaderSetXmax(newtup->t_data, 0);	/* for cleanliness */
+ 	}
  
  	/*
  	 * Replace cid with a combo cid if necessary.  Note that we already put
***************
*** 2971,2976 **** l2:
--- 3010,3016 ----
  	}
  
  	bms_free(hot_attrs);
+ 	bms_free(keylck_attrs);
  
  	return HeapTupleMayBeUpdated;
  }
***************
*** 3203,3209 **** heap_lock_tuple(Relation relation, HeapTuple tuple, Buffer *buffer,
  	LOCKMODE	tuple_lock_type;
  	bool		have_tuple_lock = false;
  
! 	tuple_lock_type = (mode == LockTupleShared) ? ShareLock : ExclusiveLock;
  
  	*buffer = ReadBuffer(relation, ItemPointerGetBlockNumber(tid));
  	LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE);
--- 3243,3261 ----
  	LOCKMODE	tuple_lock_type;
  	bool		have_tuple_lock = false;
  
! 	switch (mode)
! 	{
! 		case LockTupleShared:
! 		case LockTupleKeylock:
! 			tuple_lock_type = ShareLock;
! 			break;
! 		case LockTupleExclusive:
! 			tuple_lock_type = ExclusiveLock;
! 			break;
! 		default:
! 			elog(ERROR, "invalid tuple lock mode");
! 			tuple_lock_type = 0;	/* keep compiler quiet */
! 	}
  
  	*buffer = ReadBuffer(relation, ItemPointerGetBlockNumber(tid));
  	LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE);
***************
*** 3242,3253 **** l3:
  		 * already.  We *must* succeed without trying to take the tuple lock,
  		 * else we will deadlock against anyone waiting to acquire exclusive
  		 * lock.  We don't need to make any state changes in this case.
  		 */
! 		if (mode == LockTupleShared &&
! 			(infomask & HEAP_XMAX_IS_MULTI) &&
  			MultiXactIdIsCurrent((MultiXactId) xwait))
  		{
- 			Assert(infomask & HEAP_XMAX_SHARED_LOCK);
  			/* Probably can't hold tuple lock here, but may as well check */
  			if (have_tuple_lock)
  				UnlockTuple(relation, tid, tuple_lock_type);
--- 3294,3312 ----
  		 * already.  We *must* succeed without trying to take the tuple lock,
  		 * else we will deadlock against anyone waiting to acquire exclusive
  		 * lock.  We don't need to make any state changes in this case.
+ 		 *
+ 		 * Likewise, if we wish to acquire a key lock, and the tuple is already
+ 		 * share- or key-locked by us, we effectively hold the lock already.
+ 		 *
+ 		 * Note we cannot do this if we're asking for share lock and the tuple
+ 		 * is only key-locked.
  		 */
! 		if ((infomask & HEAP_XMAX_IS_MULTI) &&
! 			(((mode == LockTupleShared) && (infomask & HEAP_XMAX_SHARED_LOCK)) ||
! 			 ((mode == LockTupleKeylock) &&
! 			  (infomask & (HEAP_XMAX_SHARED_LOCK | HEAP_XMAX_KEY_LOCK)))) &&
  			MultiXactIdIsCurrent((MultiXactId) xwait))
  		{
  			/* Probably can't hold tuple lock here, but may as well check */
  			if (have_tuple_lock)
  				UnlockTuple(relation, tid, tuple_lock_type);
***************
*** 3293,3298 **** l3:
--- 3352,3372 ----
  			if (!(tuple->t_data->t_infomask & HEAP_XMAX_SHARED_LOCK))
  				goto l3;
  		}
+ 		else if (mode == LockTupleKeylock &&
+ 				 (infomask & (HEAP_XMAX_SHARED_LOCK | HEAP_XMAX_KEY_LOCK)))
+ 		{
+ 			/*
+ 			 * As above: acquiring keylock when there's at least one share- or
+ 			 * key-locker already.  We need not wait for him/them to complete.
+ 			 */
+ 			LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE);
+ 
+ 			/*
+ 			 * Make sure it's still an appropriate lock, else start over.
+ 			 */
+ 			if (!(tuple->t_data->t_infomask & (HEAP_XMAX_SHARED_LOCK | HEAP_XMAX_KEY_LOCK)))
+ 				goto l3;
+ 		}
  		else if (infomask & HEAP_XMAX_IS_MULTI)
  		{
  			/* wait for multixact to end */
***************
*** 3400,3407 **** l3:
  	if (!(old_infomask & (HEAP_XMAX_INVALID |
  						  HEAP_XMAX_COMMITTED |
  						  HEAP_XMAX_IS_MULTI)) &&
! 		(mode == LockTupleShared ?
  		 (old_infomask & HEAP_IS_LOCKED) :
  		 (old_infomask & HEAP_XMAX_EXCL_LOCK)) &&
  		TransactionIdIsCurrentTransactionId(xmax))
  	{
--- 3474,3483 ----
  	if (!(old_infomask & (HEAP_XMAX_INVALID |
  						  HEAP_XMAX_COMMITTED |
  						  HEAP_XMAX_IS_MULTI)) &&
! 		(mode == LockTupleKeylock ?
  		 (old_infomask & HEAP_IS_LOCKED) :
+ 		 mode == LockTupleShared ?
+ 		 (old_infomask & (HEAP_XMAX_SHARED_LOCK | HEAP_XMAX_EXCL_LOCK)) :
  		 (old_infomask & HEAP_XMAX_EXCL_LOCK)) &&
  		TransactionIdIsCurrentTransactionId(xmax))
  	{
***************
*** 3425,3434 **** l3:
  									HEAP_IS_LOCKED |
  									HEAP_MOVED);
  
! 	if (mode == LockTupleShared)
  	{
  		/*
! 		 * If this is the first acquisition of a shared lock in the current
  		 * transaction, set my per-backend OldestMemberMXactId setting. We can
  		 * be certain that the transaction will never become a member of any
  		 * older MultiXactIds than that.  (We have to do this even if we end
--- 3501,3510 ----
  									HEAP_IS_LOCKED |
  									HEAP_MOVED);
  
! 	if (mode == LockTupleShared || mode == LockTupleKeylock)
  	{
  		/*
! 		 * If this is the first acquisition of a keylock or shared lock in the current
  		 * transaction, set my per-backend OldestMemberMXactId setting. We can
  		 * be certain that the transaction will never become a member of any
  		 * older MultiXactIds than that.  (We have to do this even if we end
***************
*** 3437,3443 **** l3:
  		 */
  		MultiXactIdSetOldestMember();
  
! 		new_infomask |= HEAP_XMAX_SHARED_LOCK;
  
  		/*
  		 * Check to see if we need a MultiXactId because there are multiple
--- 3513,3520 ----
  		 */
  		MultiXactIdSetOldestMember();
  
! 		new_infomask |= mode == LockTupleShared ? HEAP_XMAX_SHARED_LOCK :
! 			HEAP_XMAX_KEY_LOCK;
  
  		/*
  		 * Check to see if we need a MultiXactId because there are multiple
***************
*** 3537,3543 **** l3:
  		xlrec.target.tid = tuple->t_self;
  		xlrec.locking_xid = xid;
  		xlrec.xid_is_mxact = ((new_infomask & HEAP_XMAX_IS_MULTI) != 0);
! 		xlrec.shared_lock = (mode == LockTupleShared);
  		rdata[0].data = (char *) &xlrec;
  		rdata[0].len = SizeOfHeapLock;
  		rdata[0].buffer = InvalidBuffer;
--- 3614,3620 ----
  		xlrec.target.tid = tuple->t_self;
  		xlrec.locking_xid = xid;
  		xlrec.xid_is_mxact = ((new_infomask & HEAP_XMAX_IS_MULTI) != 0);
! 		xlrec.lock_strength = mode;
  		rdata[0].data = (char *) &xlrec;
  		rdata[0].len = SizeOfHeapLock;
  		rdata[0].buffer = InvalidBuffer;
***************
*** 4987,4996 **** heap_xlog_lock(XLogRecPtr lsn, XLogRecord *record)
  						  HEAP_MOVED);
  	if (xlrec->xid_is_mxact)
  		htup->t_infomask |= HEAP_XMAX_IS_MULTI;
! 	if (xlrec->shared_lock)
  		htup->t_infomask |= HEAP_XMAX_SHARED_LOCK;
  	else
  		htup->t_infomask |= HEAP_XMAX_EXCL_LOCK;
  	HeapTupleHeaderClearHotUpdated(htup);
  	HeapTupleHeaderSetXmax(htup, xlrec->locking_xid);
  	HeapTupleHeaderSetCmax(htup, FirstCommandId, false);
--- 5064,5078 ----
  						  HEAP_MOVED);
  	if (xlrec->xid_is_mxact)
  		htup->t_infomask |= HEAP_XMAX_IS_MULTI;
! 	if (xlrec->lock_strength == LockTupleShared)
  		htup->t_infomask |= HEAP_XMAX_SHARED_LOCK;
+ 	else if (xlrec->lock_strength == LockTupleKeylock)
+ 		htup->t_infomask |= HEAP_XMAX_KEY_LOCK;
  	else
+ 	{
+ 		Assert(xlrec->lock_strength == LockTupleExclusive);
  		htup->t_infomask |= HEAP_XMAX_EXCL_LOCK;
+ 	}
  	HeapTupleHeaderClearHotUpdated(htup);
  	HeapTupleHeaderSetXmax(htup, xlrec->locking_xid);
  	HeapTupleHeaderSetCmax(htup, FirstCommandId, false);
***************
*** 5194,5203 **** heap_desc(StringInfo buf, uint8 xl_info, char *rec)
  	{
  		xl_heap_lock *xlrec = (xl_heap_lock *) rec;
  
! 		if (xlrec->shared_lock)
  			appendStringInfo(buf, "shared_lock: ");
! 		else
  			appendStringInfo(buf, "exclusive_lock: ");
  		if (xlrec->xid_is_mxact)
  			appendStringInfo(buf, "mxid ");
  		else
--- 5276,5289 ----
  	{
  		xl_heap_lock *xlrec = (xl_heap_lock *) rec;
  
! 		if (xlrec->lock_strength == LockTupleShared)
  			appendStringInfo(buf, "shared_lock: ");
! 		else if (xlrec->lock_strength == LockTupleKeylock)
! 			appendStringInfo(buf, "key_lock: ");
! 		else if (xlrec->lock_strength == LockTupleExclusive)
  			appendStringInfo(buf, "exclusive_lock: ");
+ 		else
+ 			appendStringInfo(buf, "unknown_type_lock: ");
  		if (xlrec->xid_is_mxact)
  			appendStringInfo(buf, "mxid ");
  		else
*** a/src/backend/catalog/index.c
--- b/src/backend/catalog/index.c
***************
*** 2986,2992 **** reindex_relation(Oid relid, int flags)
  
  	/* Ensure rd_indexattr is valid; see comments for RelationSetIndexList */
  	if (is_pg_class)
! 		(void) RelationGetIndexAttrBitmap(rel);
  
  	PG_TRY();
  	{
--- 2986,2992 ----
  
  	/* Ensure rd_indexattr is valid; see comments for RelationSetIndexList */
  	if (is_pg_class)
! 		(void) RelationGetIndexAttrBitmap(rel, false);
  
  	PG_TRY();
  	{
*** a/src/backend/executor/execMain.c
--- b/src/backend/executor/execMain.c
***************
*** 801,807 **** InitPlan(QueryDesc *queryDesc, int eflags)
  	}
  
  	/*
! 	 * Similarly, we have to lock relations selected FOR UPDATE/FOR SHARE
  	 * before we initialize the plan tree, else we'd be risking lock upgrades.
  	 * While we are at it, build the ExecRowMark list.
  	 */
--- 801,807 ----
  	}
  
  	/*
! 	 * Similarly, we have to lock relations selected FOR UPDATE/FOR SHARE/KEY LOCK
  	 * before we initialize the plan tree, else we'd be risking lock upgrades.
  	 * While we are at it, build the ExecRowMark list.
  	 */
***************
*** 821,826 **** InitPlan(QueryDesc *queryDesc, int eflags)
--- 821,827 ----
  		{
  			case ROW_MARK_EXCLUSIVE:
  			case ROW_MARK_SHARE:
+ 			case ROW_MARK_KEYLOCK:
  				relid = getrelid(rc->rti, rangeTable);
  				relation = heap_open(relid, RowShareLock);
  				break;
*** a/src/backend/executor/nodeLockRows.c
--- b/src/backend/executor/nodeLockRows.c
***************
*** 111,120 **** lnext:
  		tuple.t_self = *((ItemPointer) DatumGetPointer(datum));
  
  		/* okay, try to lock the tuple */
! 		if (erm->markType == ROW_MARK_EXCLUSIVE)
! 			lockmode = LockTupleExclusive;
! 		else
! 			lockmode = LockTupleShared;
  
  		test = heap_lock_tuple(erm->relation, &tuple, &buffer,
  							   &update_ctid, &update_xmax,
--- 111,132 ----
  		tuple.t_self = *((ItemPointer) DatumGetPointer(datum));
  
  		/* okay, try to lock the tuple */
! 		switch (erm->markType)
! 		{
! 			case ROW_MARK_EXCLUSIVE:
! 				lockmode = LockTupleExclusive;
! 				break;
! 			case ROW_MARK_SHARE:
! 				lockmode = LockTupleShared;
! 				break;
! 			case ROW_MARK_KEYLOCK:
! 				lockmode = LockTupleKeylock;
! 				break;
! 			default:
! 				elog(ERROR, "unsupported rowmark type");
! 				lockmode = LockTupleExclusive;	/* keep compiler quiet */
! 				break;
! 		}
  
  		test = heap_lock_tuple(erm->relation, &tuple, &buffer,
  							   &update_ctid, &update_xmax,
*** a/src/backend/nodes/copyfuncs.c
--- b/src/backend/nodes/copyfuncs.c
***************
*** 2008,2014 **** _copyRowMarkClause(RowMarkClause *from)
  	RowMarkClause *newnode = makeNode(RowMarkClause);
  
  	COPY_SCALAR_FIELD(rti);
! 	COPY_SCALAR_FIELD(forUpdate);
  	COPY_SCALAR_FIELD(noWait);
  	COPY_SCALAR_FIELD(pushedDown);
  
--- 2008,2014 ----
  	RowMarkClause *newnode = makeNode(RowMarkClause);
  
  	COPY_SCALAR_FIELD(rti);
! 	COPY_SCALAR_FIELD(strength);
  	COPY_SCALAR_FIELD(noWait);
  	COPY_SCALAR_FIELD(pushedDown);
  
***************
*** 2366,2372 **** _copyLockingClause(LockingClause *from)
  	LockingClause *newnode = makeNode(LockingClause);
  
  	COPY_NODE_FIELD(lockedRels);
! 	COPY_SCALAR_FIELD(forUpdate);
  	COPY_SCALAR_FIELD(noWait);
  
  	return newnode;
--- 2366,2372 ----
  	LockingClause *newnode = makeNode(LockingClause);
  
  	COPY_NODE_FIELD(lockedRels);
! 	COPY_SCALAR_FIELD(strength);
  	COPY_SCALAR_FIELD(noWait);
  
  	return newnode;
*** a/src/backend/nodes/equalfuncs.c
--- b/src/backend/nodes/equalfuncs.c
***************
*** 2291,2297 **** static bool
  _equalLockingClause(LockingClause *a, LockingClause *b)
  {
  	COMPARE_NODE_FIELD(lockedRels);
! 	COMPARE_SCALAR_FIELD(forUpdate);
  	COMPARE_SCALAR_FIELD(noWait);
  
  	return true;
--- 2291,2297 ----
  _equalLockingClause(LockingClause *a, LockingClause *b)
  {
  	COMPARE_NODE_FIELD(lockedRels);
! 	COMPARE_SCALAR_FIELD(strength);
  	COMPARE_SCALAR_FIELD(noWait);
  
  	return true;
***************
*** 2362,2368 **** static bool
  _equalRowMarkClause(RowMarkClause *a, RowMarkClause *b)
  {
  	COMPARE_SCALAR_FIELD(rti);
! 	COMPARE_SCALAR_FIELD(forUpdate);
  	COMPARE_SCALAR_FIELD(noWait);
  	COMPARE_SCALAR_FIELD(pushedDown);
  
--- 2362,2368 ----
  _equalRowMarkClause(RowMarkClause *a, RowMarkClause *b)
  {
  	COMPARE_SCALAR_FIELD(rti);
! 	COMPARE_SCALAR_FIELD(strength);
  	COMPARE_SCALAR_FIELD(noWait);
  	COMPARE_SCALAR_FIELD(pushedDown);
  
*** a/src/backend/nodes/outfuncs.c
--- b/src/backend/nodes/outfuncs.c
***************
*** 2070,2076 **** _outLockingClause(StringInfo str, LockingClause *node)
  	WRITE_NODE_TYPE("LOCKINGCLAUSE");
  
  	WRITE_NODE_FIELD(lockedRels);
! 	WRITE_BOOL_FIELD(forUpdate);
  	WRITE_BOOL_FIELD(noWait);
  }
  
--- 2070,2076 ----
  	WRITE_NODE_TYPE("LOCKINGCLAUSE");
  
  	WRITE_NODE_FIELD(lockedRels);
! 	WRITE_ENUM_FIELD(strength, LockClauseStrength);
  	WRITE_BOOL_FIELD(noWait);
  }
  
***************
*** 2247,2253 **** _outRowMarkClause(StringInfo str, RowMarkClause *node)
  	WRITE_NODE_TYPE("ROWMARKCLAUSE");
  
  	WRITE_UINT_FIELD(rti);
! 	WRITE_BOOL_FIELD(forUpdate);
  	WRITE_BOOL_FIELD(noWait);
  	WRITE_BOOL_FIELD(pushedDown);
  }
--- 2247,2253 ----
  	WRITE_NODE_TYPE("ROWMARKCLAUSE");
  
  	WRITE_UINT_FIELD(rti);
! 	WRITE_ENUM_FIELD(strength, LockClauseStrength);
  	WRITE_BOOL_FIELD(noWait);
  	WRITE_BOOL_FIELD(pushedDown);
  }
*** a/src/backend/nodes/readfuncs.c
--- b/src/backend/nodes/readfuncs.c
***************
*** 301,307 **** _readRowMarkClause(void)
  	READ_LOCALS(RowMarkClause);
  
  	READ_UINT_FIELD(rti);
! 	READ_BOOL_FIELD(forUpdate);
  	READ_BOOL_FIELD(noWait);
  	READ_BOOL_FIELD(pushedDown);
  
--- 301,307 ----
  	READ_LOCALS(RowMarkClause);
  
  	READ_UINT_FIELD(rti);
! 	READ_ENUM_FIELD(strength, LockClauseStrength);
  	READ_BOOL_FIELD(noWait);
  	READ_BOOL_FIELD(pushedDown);
  
*** a/src/backend/optimizer/plan/initsplan.c
--- b/src/backend/optimizer/plan/initsplan.c
***************
*** 563,573 **** make_outerjoininfo(PlannerInfo *root,
  	Assert(jointype != JOIN_RIGHT);
  
  	/*
! 	 * Presently the executor cannot support FOR UPDATE/SHARE marking of rels
  	 * appearing on the nullable side of an outer join. (It's somewhat unclear
  	 * what that would mean, anyway: what should we mark when a result row is
  	 * generated from no element of the nullable relation?)  So, complain if
! 	 * any nullable rel is FOR UPDATE/SHARE.
  	 *
  	 * You might be wondering why this test isn't made far upstream in the
  	 * parser.	It's because the parser hasn't got enough info --- consider
--- 563,573 ----
  	Assert(jointype != JOIN_RIGHT);
  
  	/*
! 	 * Presently the executor cannot support FOR UPDATE/SHARE/KEY LOCK marking of rels
  	 * appearing on the nullable side of an outer join. (It's somewhat unclear
  	 * what that would mean, anyway: what should we mark when a result row is
  	 * generated from no element of the nullable relation?)  So, complain if
! 	 * any nullable rel is FOR UPDATE/SHARE/KEY LOCK.
  	 *
  	 * You might be wondering why this test isn't made far upstream in the
  	 * parser.	It's because the parser hasn't got enough info --- consider
***************
*** 585,591 **** make_outerjoininfo(PlannerInfo *root,
  			(jointype == JOIN_FULL && bms_is_member(rc->rti, left_rels)))
  			ereport(ERROR,
  					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
! 					 errmsg("SELECT FOR UPDATE/SHARE cannot be applied to the nullable side of an outer join")));
  	}
  
  	sjinfo->syn_lefthand = left_rels;
--- 585,591 ----
  			(jointype == JOIN_FULL && bms_is_member(rc->rti, left_rels)))
  			ereport(ERROR,
  					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
! 					 errmsg("SELECT FOR UPDATE/SHARE/KEY LOCK cannot be applied to the nullable side of an outer join")));
  	}
  
  	sjinfo->syn_lefthand = left_rels;
*** a/src/backend/optimizer/plan/planner.c
--- b/src/backend/optimizer/plan/planner.c
***************
*** 1837,1843 **** preprocess_rowmarks(PlannerInfo *root)
  	if (parse->rowMarks)
  	{
  		/*
! 		 * We've got trouble if FOR UPDATE/SHARE appears inside grouping,
  		 * since grouping renders a reference to individual tuple CTIDs
  		 * invalid.  This is also checked at parse time, but that's
  		 * insufficient because of rule substitution, query pullup, etc.
--- 1837,1843 ----
  	if (parse->rowMarks)
  	{
  		/*
! 		 * We've got trouble if FOR UPDATE/SHARE/KEY LOCK appears inside grouping,
  		 * since grouping renders a reference to individual tuple CTIDs
  		 * invalid.  This is also checked at parse time, but that's
  		 * insufficient because of rule substitution, query pullup, etc.
***************
*** 1847,1853 **** preprocess_rowmarks(PlannerInfo *root)
  	else
  	{
  		/*
! 		 * We only need rowmarks for UPDATE, DELETE, or FOR UPDATE/SHARE.
  		 */
  		if (parse->commandType != CMD_UPDATE &&
  			parse->commandType != CMD_DELETE)
--- 1847,1853 ----
  	else
  	{
  		/*
! 		 * We only need rowmarks for UPDATE, DELETE, or FOR UPDATE/SHARE/KEY LOCK.
  		 */
  		if (parse->commandType != CMD_UPDATE &&
  			parse->commandType != CMD_DELETE)
***************
*** 1857,1863 **** preprocess_rowmarks(PlannerInfo *root)
  	/*
  	 * We need to have rowmarks for all base relations except the target. We
  	 * make a bitmapset of all base rels and then remove the items we don't
! 	 * need or have FOR UPDATE/SHARE marks for.
  	 */
  	rels = get_base_rel_indexes((Node *) parse->jointree);
  	if (parse->resultRelation)
--- 1857,1863 ----
  	/*
  	 * We need to have rowmarks for all base relations except the target. We
  	 * make a bitmapset of all base rels and then remove the items we don't
! 	 * need or have FOR UPDATE/SHARE/KEY LOCK marks for.
  	 */
  	rels = get_base_rel_indexes((Node *) parse->jointree);
  	if (parse->resultRelation)
***************
*** 1894,1903 **** preprocess_rowmarks(PlannerInfo *root)
  		newrc = makeNode(PlanRowMark);
  		newrc->rti = newrc->prti = rc->rti;
  		newrc->rowmarkId = ++(root->glob->lastRowMarkId);
! 		if (rc->forUpdate)
! 			newrc->markType = ROW_MARK_EXCLUSIVE;
! 		else
! 			newrc->markType = ROW_MARK_SHARE;
  		newrc->noWait = rc->noWait;
  		newrc->isParent = false;
  
--- 1894,1913 ----
  		newrc = makeNode(PlanRowMark);
  		newrc->rti = newrc->prti = rc->rti;
  		newrc->rowmarkId = ++(root->glob->lastRowMarkId);
! 		switch (rc->strength)
! 		{
! 			case LCS_FORUPDATE:
! 				newrc->markType = ROW_MARK_EXCLUSIVE;
! 				break;
! 			case LCS_FORSHARE:
! 				newrc->markType = ROW_MARK_SHARE;
! 				break;
! 			case LCS_FORKEYLOCK:
! 				newrc->markType = ROW_MARK_KEYLOCK;
! 				break;
! 			default:
! 				elog(ERROR, "unsupported rowmark type %d", rc->strength);
! 		}
  		newrc->noWait = rc->noWait;
  		newrc->isParent = false;
  
*** a/src/backend/parser/analyze.c
--- b/src/backend/parser/analyze.c
***************
*** 2310,2316 **** transformLockingClause(ParseState *pstate, Query *qry, LockingClause *lc,
  	/* make a clause we can pass down to subqueries to select all rels */
  	allrels = makeNode(LockingClause);
  	allrels->lockedRels = NIL;	/* indicates all rels */
! 	allrels->forUpdate = lc->forUpdate;
  	allrels->noWait = lc->noWait;
  
  	if (lockedRels == NIL)
--- 2310,2316 ----
  	/* make a clause we can pass down to subqueries to select all rels */
  	allrels = makeNode(LockingClause);
  	allrels->lockedRels = NIL;	/* indicates all rels */
! 	allrels->strength = lc->strength;
  	allrels->noWait = lc->noWait;
  
  	if (lockedRels == NIL)
***************
*** 2329,2340 **** transformLockingClause(ParseState *pstate, Query *qry, LockingClause *lc,
  					if (rte->relkind == RELKIND_FOREIGN_TABLE)
  						break;
  					applyLockingClause(qry, i,
! 									   lc->forUpdate, lc->noWait, pushedDown);
  					rte->requiredPerms |= ACL_SELECT_FOR_UPDATE;
  					break;
  				case RTE_SUBQUERY:
  					applyLockingClause(qry, i,
! 									   lc->forUpdate, lc->noWait, pushedDown);
  
  					/*
  					 * FOR UPDATE/SHARE of subquery is propagated to all of
--- 2329,2340 ----
  					if (rte->relkind == RELKIND_FOREIGN_TABLE)
  						break;
  					applyLockingClause(qry, i,
! 									   lc->strength, lc->noWait, pushedDown);
  					rte->requiredPerms |= ACL_SELECT_FOR_UPDATE;
  					break;
  				case RTE_SUBQUERY:
  					applyLockingClause(qry, i,
! 									   lc->strength, lc->noWait, pushedDown);
  
  					/*
  					 * FOR UPDATE/SHARE of subquery is propagated to all of
***************
*** 2384,2396 **** transformLockingClause(ParseState *pstate, Query *qry, LockingClause *lc,
  											 rte->eref->aliasname),
  									  parser_errposition(pstate, thisrel->location)));
  							applyLockingClause(qry, i,
! 											   lc->forUpdate, lc->noWait,
  											   pushedDown);
  							rte->requiredPerms |= ACL_SELECT_FOR_UPDATE;
  							break;
  						case RTE_SUBQUERY:
  							applyLockingClause(qry, i,
! 											   lc->forUpdate, lc->noWait,
  											   pushedDown);
  							/* see comment above */
  							transformLockingClause(pstate, rte->subquery,
--- 2384,2396 ----
  											 rte->eref->aliasname),
  									  parser_errposition(pstate, thisrel->location)));
  							applyLockingClause(qry, i,
! 											   lc->strength, lc->noWait,
  											   pushedDown);
  							rte->requiredPerms |= ACL_SELECT_FOR_UPDATE;
  							break;
  						case RTE_SUBQUERY:
  							applyLockingClause(qry, i,
! 											   lc->strength, lc->noWait,
  											   pushedDown);
  							/* see comment above */
  							transformLockingClause(pstate, rte->subquery,
***************
*** 2443,2449 **** transformLockingClause(ParseState *pstate, Query *qry, LockingClause *lc,
   */
  void
  applyLockingClause(Query *qry, Index rtindex,
! 				   bool forUpdate, bool noWait, bool pushedDown)
  {
  	RowMarkClause *rc;
  
--- 2443,2449 ----
   */
  void
  applyLockingClause(Query *qry, Index rtindex,
! 				   LockClauseStrength strength, bool noWait, bool pushedDown)
  {
  	RowMarkClause *rc;
  
***************
*** 2455,2464 **** applyLockingClause(Query *qry, Index rtindex,
  	if ((rc = get_parse_rowmark(qry, rtindex)) != NULL)
  	{
  		/*
! 		 * If the same RTE is specified both FOR UPDATE and FOR SHARE, treat
! 		 * it as FOR UPDATE.  (Reasonable, since you can't take both a shared
! 		 * and exclusive lock at the same time; it'll end up being exclusive
! 		 * anyway.)
  		 *
  		 * We also consider that NOWAIT wins if it's specified both ways. This
  		 * is a bit more debatable but raising an error doesn't seem helpful.
--- 2455,2464 ----
  	if ((rc = get_parse_rowmark(qry, rtindex)) != NULL)
  	{
  		/*
! 		 * If the same RTE is specified for more than one locking strength,
! 		 * treat is as the strongest.  (Reasonable, since you can't take both a
! 		 * shared and exclusive lock at the same time; it'll end up being
! 		 * exclusive anyway.)
  		 *
  		 * We also consider that NOWAIT wins if it's specified both ways. This
  		 * is a bit more debatable but raising an error doesn't seem helpful.
***************
*** 2467,2473 **** applyLockingClause(Query *qry, Index rtindex,
  		 *
  		 * And of course pushedDown becomes false if any clause is explicit.
  		 */
! 		rc->forUpdate |= forUpdate;
  		rc->noWait |= noWait;
  		rc->pushedDown &= pushedDown;
  		return;
--- 2467,2473 ----
  		 *
  		 * And of course pushedDown becomes false if any clause is explicit.
  		 */
! 		rc->strength = Max(rc->strength, strength);
  		rc->noWait |= noWait;
  		rc->pushedDown &= pushedDown;
  		return;
***************
*** 2476,2482 **** applyLockingClause(Query *qry, Index rtindex,
  	/* Make a new RowMarkClause */
  	rc = makeNode(RowMarkClause);
  	rc->rti = rtindex;
! 	rc->forUpdate = forUpdate;
  	rc->noWait = noWait;
  	rc->pushedDown = pushedDown;
  	qry->rowMarks = lappend(qry->rowMarks, rc);
--- 2476,2482 ----
  	/* Make a new RowMarkClause */
  	rc = makeNode(RowMarkClause);
  	rc->rti = rtindex;
! 	rc->strength = strength;
  	rc->noWait = noWait;
  	rc->pushedDown = pushedDown;
  	qry->rowMarks = lappend(qry->rowMarks, rc);
*** a/src/backend/parser/gram.y
--- b/src/backend/parser/gram.y
***************
*** 8760,8766 **** for_locking_item:
  				{
  					LockingClause *n = makeNode(LockingClause);
  					n->lockedRels = $3;
! 					n->forUpdate = TRUE;
  					n->noWait = $4;
  					$$ = (Node *) n;
  				}
--- 8760,8766 ----
  				{
  					LockingClause *n = makeNode(LockingClause);
  					n->lockedRels = $3;
! 					n->strength = LCS_FORUPDATE;
  					n->noWait = $4;
  					$$ = (Node *) n;
  				}
***************
*** 8768,8777 **** for_locking_item:
  				{
  					LockingClause *n = makeNode(LockingClause);
  					n->lockedRels = $3;
! 					n->forUpdate = FALSE;
  					n->noWait = $4;
  					$$ = (Node *) n;
  				}
  		;
  
  locked_rels_list:
--- 8768,8785 ----
  				{
  					LockingClause *n = makeNode(LockingClause);
  					n->lockedRels = $3;
! 					n->strength = LCS_FORSHARE;
  					n->noWait = $4;
  					$$ = (Node *) n;
  				}
+ 			| FOR KEY LOCK_P locked_rels_list opt_nowait
+ 				{
+ 					LockingClause *n = makeNode(LockingClause);
+ 					n->lockedRels = $4;
+ 					n->strength = LCS_FORKEYLOCK;
+ 					n->noWait = $5;
+ 					$$ = (Node *) n;
+ 				}
  		;
  
  locked_rels_list:
*** a/src/backend/rewrite/rewriteHandler.c
--- b/src/backend/rewrite/rewriteHandler.c
***************
*** 56,62 **** static void rewriteValuesRTE(RangeTblEntry *rte, Relation target_relation,
  static void rewriteTargetListUD(Query *parsetree, RangeTblEntry *target_rte,
  					Relation target_relation);
  static void markQueryForLocking(Query *qry, Node *jtnode,
! 					bool forUpdate, bool noWait, bool pushedDown);
  static List *matchLocks(CmdType event, RuleLock *rulelocks,
  		   int varno, Query *parsetree);
  static Query *fireRIRrules(Query *parsetree, List *activeRIRs,
--- 56,62 ----
  static void rewriteTargetListUD(Query *parsetree, RangeTblEntry *target_rte,
  					Relation target_relation);
  static void markQueryForLocking(Query *qry, Node *jtnode,
! 					LockClauseStrength strength, bool noWait, bool pushedDown);
  static List *matchLocks(CmdType event, RuleLock *rulelocks,
  		   int varno, Query *parsetree);
  static Query *fireRIRrules(Query *parsetree, List *activeRIRs,
***************
*** 1402,1409 **** ApplyRetrieveRule(Query *parsetree,
  	rte->modifiedCols = NULL;
  
  	/*
! 	 * If FOR UPDATE/SHARE of view, mark all the contained tables as implicit
! 	 * FOR UPDATE/SHARE, the same as the parser would have done if the view's
  	 * subquery had been written out explicitly.
  	 *
  	 * Note: we don't consider forUpdatePushedDown here; such marks will be
--- 1402,1409 ----
  	rte->modifiedCols = NULL;
  
  	/*
! 	 * If FOR UPDATE/SHARE/KEY LOCK of view, mark all the contained tables as implicit
! 	 * FOR UPDATE/SHARE/KEY LOCK, the same as the parser would have done if the view's
  	 * subquery had been written out explicitly.
  	 *
  	 * Note: we don't consider forUpdatePushedDown here; such marks will be
***************
*** 1411,1423 **** ApplyRetrieveRule(Query *parsetree,
  	 */
  	if (rc != NULL)
  		markQueryForLocking(rule_action, (Node *) rule_action->jointree,
! 							rc->forUpdate, rc->noWait, true);
  
  	return parsetree;
  }
  
  /*
!  * Recursively mark all relations used by a view as FOR UPDATE/SHARE.
   *
   * This may generate an invalid query, eg if some sub-query uses an
   * aggregate.  We leave it to the planner to detect that.
--- 1411,1423 ----
  	 */
  	if (rc != NULL)
  		markQueryForLocking(rule_action, (Node *) rule_action->jointree,
! 							rc->strength, rc->noWait, true);
  
  	return parsetree;
  }
  
  /*
!  * Recursively mark all relations used by a view as FOR UPDATE/SHARE/KEY LOCK.
   *
   * This may generate an invalid query, eg if some sub-query uses an
   * aggregate.  We leave it to the planner to detect that.
***************
*** 1429,1435 **** ApplyRetrieveRule(Query *parsetree,
   */
  static void
  markQueryForLocking(Query *qry, Node *jtnode,
! 					bool forUpdate, bool noWait, bool pushedDown)
  {
  	if (jtnode == NULL)
  		return;
--- 1429,1435 ----
   */
  static void
  markQueryForLocking(Query *qry, Node *jtnode,
! 					LockClauseStrength strength, bool noWait, bool pushedDown)
  {
  	if (jtnode == NULL)
  		return;
***************
*** 1443,1458 **** markQueryForLocking(Query *qry, Node *jtnode,
  			/* ignore foreign tables */
  			if (rte->relkind != RELKIND_FOREIGN_TABLE)
  			{
! 				applyLockingClause(qry, rti, forUpdate, noWait, pushedDown);
  				rte->requiredPerms |= ACL_SELECT_FOR_UPDATE;
  			}
  		}
  		else if (rte->rtekind == RTE_SUBQUERY)
  		{
! 			applyLockingClause(qry, rti, forUpdate, noWait, pushedDown);
! 			/* FOR UPDATE/SHARE of subquery is propagated to subquery's rels */
  			markQueryForLocking(rte->subquery, (Node *) rte->subquery->jointree,
! 								forUpdate, noWait, true);
  		}
  		/* other RTE types are unaffected by FOR UPDATE */
  	}
--- 1443,1458 ----
  			/* ignore foreign tables */
  			if (rte->relkind != RELKIND_FOREIGN_TABLE)
  			{
! 				applyLockingClause(qry, rti, strength, noWait, pushedDown);
  				rte->requiredPerms |= ACL_SELECT_FOR_UPDATE;
  			}
  		}
  		else if (rte->rtekind == RTE_SUBQUERY)
  		{
! 			applyLockingClause(qry, rti, strength, noWait, pushedDown);
! 			/* FOR UPDATE/SHARE/KEY LOCK of subquery is propagated to subquery's rels */
  			markQueryForLocking(rte->subquery, (Node *) rte->subquery->jointree,
! 								strength, noWait, true);
  		}
  		/* other RTE types are unaffected by FOR UPDATE */
  	}
***************
*** 1462,1475 **** markQueryForLocking(Query *qry, Node *jtnode,
  		ListCell   *l;
  
  		foreach(l, f->fromlist)
! 			markQueryForLocking(qry, lfirst(l), forUpdate, noWait, pushedDown);
  	}
  	else if (IsA(jtnode, JoinExpr))
  	{
  		JoinExpr   *j = (JoinExpr *) jtnode;
  
! 		markQueryForLocking(qry, j->larg, forUpdate, noWait, pushedDown);
! 		markQueryForLocking(qry, j->rarg, forUpdate, noWait, pushedDown);
  	}
  	else
  		elog(ERROR, "unrecognized node type: %d",
--- 1462,1475 ----
  		ListCell   *l;
  
  		foreach(l, f->fromlist)
! 			markQueryForLocking(qry, lfirst(l), strength, noWait, pushedDown);
  	}
  	else if (IsA(jtnode, JoinExpr))
  	{
  		JoinExpr   *j = (JoinExpr *) jtnode;
  
! 		markQueryForLocking(qry, j->larg, strength, noWait, pushedDown);
! 		markQueryForLocking(qry, j->rarg, strength, noWait, pushedDown);
  	}
  	else
  		elog(ERROR, "unrecognized node type: %d",
*** a/src/backend/tcop/utility.c
--- b/src/backend/tcop/utility.c
***************
*** 130,136 **** CommandIsReadOnly(Node *parsetree)
  				if (stmt->intoClause != NULL)
  					return false;		/* SELECT INTO */
  				else if (stmt->rowMarks != NIL)
! 					return false;		/* SELECT FOR UPDATE/SHARE */
  				else if (stmt->hasModifyingCTE)
  					return false;		/* data-modifying CTE */
  				else
--- 130,136 ----
  				if (stmt->intoClause != NULL)
  					return false;		/* SELECT INTO */
  				else if (stmt->rowMarks != NIL)
! 					return false;		/* SELECT FOR UPDATE/SHARE/KEY LOCK */
  				else if (stmt->hasModifyingCTE)
  					return false;		/* data-modifying CTE */
  				else
***************
*** 2181,2190 **** CreateCommandTag(Node *parsetree)
  						else if (stmt->rowMarks != NIL)
  						{
  							/* not 100% but probably close enough */
! 							if (((PlanRowMark *) linitial(stmt->rowMarks))->markType == ROW_MARK_EXCLUSIVE)
! 								tag = "SELECT FOR UPDATE";
! 							else
! 								tag = "SELECT FOR SHARE";
  						}
  						else
  							tag = "SELECT";
--- 2181,2201 ----
  						else if (stmt->rowMarks != NIL)
  						{
  							/* not 100% but probably close enough */
! 							switch (((RowMarkClause *) linitial(stmt->rowMarks))->strength)
! 							{
! 								case LCS_FORUPDATE:
! 									tag = "SELECT FOR UPDATE";
! 									break;
! 								case LCS_FORSHARE:
! 									tag = "SELECT FOR SHARE";
! 									break;
! 								case LCS_FORKEYLOCK:
! 									tag = "SELECT FOR KEY LOCK";
! 									break;
! 								default:
! 									tag =  "???";
! 									break;
! 							}
  						}
  						else
  							tag = "SELECT";
***************
*** 2231,2240 **** CreateCommandTag(Node *parsetree)
  						else if (stmt->rowMarks != NIL)
  						{
  							/* not 100% but probably close enough */
! 							if (((RowMarkClause *) linitial(stmt->rowMarks))->forUpdate)
! 								tag = "SELECT FOR UPDATE";
! 							else
! 								tag = "SELECT FOR SHARE";
  						}
  						else
  							tag = "SELECT";
--- 2242,2262 ----
  						else if (stmt->rowMarks != NIL)
  						{
  							/* not 100% but probably close enough */
! 							switch (((RowMarkClause *) linitial(stmt->rowMarks))->strength)
! 							{
! 								case LCS_FORUPDATE:
! 									tag = "SELECT FOR UPDATE";
! 									break;
! 								case LCS_FORSHARE:
! 									tag = "SELECT FOR SHARE";
! 									break;
! 								case LCS_FORKEYLOCK:
! 									tag = "SELECT FOR KEY LOCK";
! 									break;
! 								default:
! 									tag =  "???";
! 									break;
! 							}
  						}
  						else
  							tag = "SELECT";
*** a/src/backend/utils/adt/ri_triggers.c
--- b/src/backend/utils/adt/ri_triggers.c
***************
*** 309,315 **** RI_FKey_check(PG_FUNCTION_ARGS)
  	 * Get the relation descriptors of the FK and PK tables.
  	 *
  	 * pk_rel is opened in RowShareLock mode since that's what our eventual
! 	 * SELECT FOR SHARE will get on it.
  	 */
  	fk_rel = trigdata->tg_relation;
  	pk_rel = heap_open(riinfo.pk_relid, RowShareLock);
--- 309,315 ----
  	 * Get the relation descriptors of the FK and PK tables.
  	 *
  	 * pk_rel is opened in RowShareLock mode since that's what our eventual
! 	 * SELECT FOR KEY LOCK will get on it.
  	 */
  	fk_rel = trigdata->tg_relation;
  	pk_rel = heap_open(riinfo.pk_relid, RowShareLock);
***************
*** 339,350 **** RI_FKey_check(PG_FUNCTION_ARGS)
  
  			/* ---------
  			 * The query string built is
! 			 *	SELECT 1 FROM ONLY <pktable>
  			 * ----------
  			 */
  			quoteRelationName(pkrelname, pk_rel);
  			snprintf(querystr, sizeof(querystr),
! 					 "SELECT 1 FROM ONLY %s x FOR SHARE OF x",
  					 pkrelname);
  
  			/* Prepare and save the plan */
--- 339,350 ----
  
  			/* ---------
  			 * The query string built is
! 			 *	SELECT 1 FROM ONLY <pktable> x FOR KEY LOCK OF x
  			 * ----------
  			 */
  			quoteRelationName(pkrelname, pk_rel);
  			snprintf(querystr, sizeof(querystr),
! 					 "SELECT 1 FROM ONLY %s x FOR KEY LOCK OF x",
  					 pkrelname);
  
  			/* Prepare and save the plan */
***************
*** 464,470 **** RI_FKey_check(PG_FUNCTION_ARGS)
  
  		/* ----------
  		 * The query string built is
! 		 *	SELECT 1 FROM ONLY <pktable> WHERE pkatt1 = $1 [AND ...] FOR SHARE
  		 * The type id's for the $ parameters are those of the
  		 * corresponding FK attributes.
  		 * ----------
--- 464,471 ----
  
  		/* ----------
  		 * The query string built is
! 		 *	SELECT 1 FROM ONLY <pktable> x WHERE pkatt1 = $1 [AND ...]
! 		 *	       FOR KEY LOCK OF x
  		 * The type id's for the $ parameters are those of the
  		 * corresponding FK attributes.
  		 * ----------
***************
*** 488,494 **** RI_FKey_check(PG_FUNCTION_ARGS)
  			querysep = "AND";
  			queryoids[i] = fk_type;
  		}
! 		appendStringInfo(&querybuf, " FOR SHARE OF x");
  
  		/* Prepare and save the plan */
  		qplan = ri_PlanCheck(querybuf.data, riinfo.nkeys, queryoids,
--- 489,495 ----
  			querysep = "AND";
  			queryoids[i] = fk_type;
  		}
! 		appendStringInfo(&querybuf, " FOR KEY LOCK OF x");
  
  		/* Prepare and save the plan */
  		qplan = ri_PlanCheck(querybuf.data, riinfo.nkeys, queryoids,
***************
*** 626,632 **** ri_Check_Pk_Match(Relation pk_rel, Relation fk_rel,
  
  		/* ----------
  		 * The query string built is
! 		 *	SELECT 1 FROM ONLY <pktable> WHERE pkatt1 = $1 [AND ...] FOR SHARE
  		 * The type id's for the $ parameters are those of the
  		 * PK attributes themselves.
  		 * ----------
--- 627,634 ----
  
  		/* ----------
  		 * The query string built is
! 		 *	SELECT 1 FROM ONLY <pktable> x WHERE pkatt1 = $1 [AND ...]
! 		 *	       FOR KEY LOCK OF x
  		 * The type id's for the $ parameters are those of the
  		 * PK attributes themselves.
  		 * ----------
***************
*** 649,655 **** ri_Check_Pk_Match(Relation pk_rel, Relation fk_rel,
  			querysep = "AND";
  			queryoids[i] = pk_type;
  		}
! 		appendStringInfo(&querybuf, " FOR SHARE OF x");
  
  		/* Prepare and save the plan */
  		qplan = ri_PlanCheck(querybuf.data, riinfo->nkeys, queryoids,
--- 651,657 ----
  			querysep = "AND";
  			queryoids[i] = pk_type;
  		}
! 		appendStringInfo(&querybuf, " FOR KEY LOCK OF x");
  
  		/* Prepare and save the plan */
  		qplan = ri_PlanCheck(querybuf.data, riinfo->nkeys, queryoids,
***************
*** 713,719 **** RI_FKey_noaction_del(PG_FUNCTION_ARGS)
  	 * Get the relation descriptors of the FK and PK tables and the old tuple.
  	 *
  	 * fk_rel is opened in RowShareLock mode since that's what our eventual
! 	 * SELECT FOR SHARE will get on it.
  	 */
  	fk_rel = heap_open(riinfo.fk_relid, RowShareLock);
  	pk_rel = trigdata->tg_relation;
--- 715,721 ----
  	 * Get the relation descriptors of the FK and PK tables and the old tuple.
  	 *
  	 * fk_rel is opened in RowShareLock mode since that's what our eventual
! 	 * SELECT FOR KEY LOCK will get on it.
  	 */
  	fk_rel = heap_open(riinfo.fk_relid, RowShareLock);
  	pk_rel = trigdata->tg_relation;
***************
*** 781,787 **** RI_FKey_noaction_del(PG_FUNCTION_ARGS)
  
  				/* ----------
  				 * The query string built is
! 				 *	SELECT 1 FROM ONLY <fktable> WHERE $1 = fkatt1 [AND ...]
  				 * The type id's for the $ parameters are those of the
  				 * corresponding PK attributes.
  				 * ----------
--- 783,790 ----
  
  				/* ----------
  				 * The query string built is
! 				 *	SELECT 1 FROM ONLY <fktable> x WHERE $1 = fkatt1 [AND ...]
! 				 *	       FOR KEY LOCK OF x
  				 * The type id's for the $ parameters are those of the
  				 * corresponding PK attributes.
  				 * ----------
***************
*** 806,812 **** RI_FKey_noaction_del(PG_FUNCTION_ARGS)
  					querysep = "AND";
  					queryoids[i] = pk_type;
  				}
! 				appendStringInfo(&querybuf, " FOR SHARE OF x");
  
  				/* Prepare and save the plan */
  				qplan = ri_PlanCheck(querybuf.data, riinfo.nkeys, queryoids,
--- 809,815 ----
  					querysep = "AND";
  					queryoids[i] = pk_type;
  				}
! 				appendStringInfo(&querybuf, " FOR KEY LOCK OF x");
  
  				/* Prepare and save the plan */
  				qplan = ri_PlanCheck(querybuf.data, riinfo.nkeys, queryoids,
***************
*** 891,897 **** RI_FKey_noaction_upd(PG_FUNCTION_ARGS)
  	 * old tuple.
  	 *
  	 * fk_rel is opened in RowShareLock mode since that's what our eventual
! 	 * SELECT FOR SHARE will get on it.
  	 */
  	fk_rel = heap_open(riinfo.fk_relid, RowShareLock);
  	pk_rel = trigdata->tg_relation;
--- 894,900 ----
  	 * old tuple.
  	 *
  	 * fk_rel is opened in RowShareLock mode since that's what our eventual
! 	 * SELECT FOR KEY LOCK will get on it.
  	 */
  	fk_rel = heap_open(riinfo.fk_relid, RowShareLock);
  	pk_rel = trigdata->tg_relation;
***************
*** 994,1000 **** RI_FKey_noaction_upd(PG_FUNCTION_ARGS)
  					querysep = "AND";
  					queryoids[i] = pk_type;
  				}
! 				appendStringInfo(&querybuf, " FOR SHARE OF x");
  
  				/* Prepare and save the plan */
  				qplan = ri_PlanCheck(querybuf.data, riinfo.nkeys, queryoids,
--- 997,1003 ----
  					querysep = "AND";
  					queryoids[i] = pk_type;
  				}
! 				appendStringInfo(&querybuf, " FOR KEY LOCK OF x");
  
  				/* Prepare and save the plan */
  				qplan = ri_PlanCheck(querybuf.data, riinfo.nkeys, queryoids,
***************
*** 1432,1438 **** RI_FKey_restrict_del(PG_FUNCTION_ARGS)
  	 * Get the relation descriptors of the FK and PK tables and the old tuple.
  	 *
  	 * fk_rel is opened in RowShareLock mode since that's what our eventual
! 	 * SELECT FOR SHARE will get on it.
  	 */
  	fk_rel = heap_open(riinfo.fk_relid, RowShareLock);
  	pk_rel = trigdata->tg_relation;
--- 1435,1441 ----
  	 * Get the relation descriptors of the FK and PK tables and the old tuple.
  	 *
  	 * fk_rel is opened in RowShareLock mode since that's what our eventual
! 	 * SELECT FOR KEY LOCK will get on it.
  	 */
  	fk_rel = heap_open(riinfo.fk_relid, RowShareLock);
  	pk_rel = trigdata->tg_relation;
***************
*** 1490,1496 **** RI_FKey_restrict_del(PG_FUNCTION_ARGS)
  
  				/* ----------
  				 * The query string built is
! 				 *	SELECT 1 FROM ONLY <fktable> WHERE $1 = fkatt1 [AND ...]
  				 * The type id's for the $ parameters are those of the
  				 * corresponding PK attributes.
  				 * ----------
--- 1493,1500 ----
  
  				/* ----------
  				 * The query string built is
! 				 *	SELECT 1 FROM ONLY <fktable> x WHERE $1 = fkatt1 [AND ...]
! 				 *	       FOR KEY LOCK OF x
  				 * The type id's for the $ parameters are those of the
  				 * corresponding PK attributes.
  				 * ----------
***************
*** 1515,1521 **** RI_FKey_restrict_del(PG_FUNCTION_ARGS)
  					querysep = "AND";
  					queryoids[i] = pk_type;
  				}
! 				appendStringInfo(&querybuf, " FOR SHARE OF x");
  
  				/* Prepare and save the plan */
  				qplan = ri_PlanCheck(querybuf.data, riinfo.nkeys, queryoids,
--- 1519,1525 ----
  					querysep = "AND";
  					queryoids[i] = pk_type;
  				}
! 				appendStringInfo(&querybuf, " FOR KEY LOCK OF x");
  
  				/* Prepare and save the plan */
  				qplan = ri_PlanCheck(querybuf.data, riinfo.nkeys, queryoids,
***************
*** 1605,1611 **** RI_FKey_restrict_upd(PG_FUNCTION_ARGS)
  	 * old tuple.
  	 *
  	 * fk_rel is opened in RowShareLock mode since that's what our eventual
! 	 * SELECT FOR SHARE will get on it.
  	 */
  	fk_rel = heap_open(riinfo.fk_relid, RowShareLock);
  	pk_rel = trigdata->tg_relation;
--- 1609,1615 ----
  	 * old tuple.
  	 *
  	 * fk_rel is opened in RowShareLock mode since that's what our eventual
! 	 * SELECT FOR KEY LOCK will get on it.
  	 */
  	fk_rel = heap_open(riinfo.fk_relid, RowShareLock);
  	pk_rel = trigdata->tg_relation;
***************
*** 1673,1679 **** RI_FKey_restrict_upd(PG_FUNCTION_ARGS)
  
  				/* ----------
  				 * The query string built is
! 				 *	SELECT 1 FROM ONLY <fktable> WHERE $1 = fkatt1 [AND ...]
  				 * The type id's for the $ parameters are those of the
  				 * corresponding PK attributes.
  				 * ----------
--- 1677,1684 ----
  
  				/* ----------
  				 * The query string built is
! 				 *	SELECT 1 FROM ONLY <fktable> x WHERE $1 = fkatt1 [AND ...]
! 				 *	       FOR KEY LOCK OF x
  				 * The type id's for the $ parameters are those of the
  				 * corresponding PK attributes.
  				 * ----------
***************
*** 1698,1704 **** RI_FKey_restrict_upd(PG_FUNCTION_ARGS)
  					querysep = "AND";
  					queryoids[i] = pk_type;
  				}
! 				appendStringInfo(&querybuf, " FOR SHARE OF x");
  
  				/* Prepare and save the plan */
  				qplan = ri_PlanCheck(querybuf.data, riinfo.nkeys, queryoids,
--- 1703,1709 ----
  					querysep = "AND";
  					queryoids[i] = pk_type;
  				}
! 				appendStringInfo(&querybuf, " FOR KEY LOCK OF x");
  
  				/* Prepare and save the plan */
  				qplan = ri_PlanCheck(querybuf.data, riinfo.nkeys, queryoids,
*** a/src/backend/utils/adt/ruleutils.c
--- b/src/backend/utils/adt/ruleutils.c
***************
*** 2857,2868 **** get_select_query_def(Query *query, deparse_context *context,
  			if (rc->pushedDown)
  				continue;
  
! 			if (rc->forUpdate)
! 				appendContextKeyword(context, " FOR UPDATE",
! 									 -PRETTYINDENT_STD, PRETTYINDENT_STD, 0);
! 			else
! 				appendContextKeyword(context, " FOR SHARE",
! 									 -PRETTYINDENT_STD, PRETTYINDENT_STD, 0);
  			appendStringInfo(buf, " OF %s",
  							 quote_identifier(rte->eref->aliasname));
  			if (rc->noWait)
--- 2857,2880 ----
  			if (rc->pushedDown)
  				continue;
  
! 			switch (rc->strength)
! 			{
! 				case LCS_FORKEYLOCK:
! 					appendContextKeyword(context, " FOR KEY LOCK",
! 										 -PRETTYINDENT_STD, PRETTYINDENT_STD, 0);
! 					break;
! 				case LCS_FORSHARE:
! 					appendContextKeyword(context, " FOR SHARE",
! 										 -PRETTYINDENT_STD, PRETTYINDENT_STD, 0);
! 					break;
! 				case LCS_FORUPDATE:
! 					appendContextKeyword(context, " FOR UPDATE",
! 										 -PRETTYINDENT_STD, PRETTYINDENT_STD, 0);
! 					break;
! 				default:
! 					elog(ERROR, "unrecognized row locking clause %d", rc->strength);
! 			}
! 
  			appendStringInfo(buf, " OF %s",
  							 quote_identifier(rte->eref->aliasname));
  			if (rc->noWait)
*** a/src/backend/utils/cache/relcache.c
--- b/src/backend/utils/cache/relcache.c
***************
*** 3614,3619 **** RelationGetIndexPredicate(Relation relation)
--- 3614,3622 ----
   * simple index keys, but attributes used in expressions and partial-index
   * predicates.)
   *
+  * If "keyAttrs" is true, only attributes that can be referenced by foreign
+  * keys are considered.
+  *
   * Attribute numbers are offset by FirstLowInvalidHeapAttributeNumber so that
   * we can include system attributes (e.g., OID) in the bitmap representation.
   *
***************
*** 3625,3640 **** RelationGetIndexPredicate(Relation relation)
   * be bms_free'd when not needed anymore.
   */
  Bitmapset *
! RelationGetIndexAttrBitmap(Relation relation)
  {
  	Bitmapset  *indexattrs;
  	List	   *indexoidlist;
  	ListCell   *l;
  	MemoryContext oldcxt;
  
  	/* Quick exit if we already computed the result. */
  	if (relation->rd_indexattr != NULL)
! 		return bms_copy(relation->rd_indexattr);
  
  	/* Fast path if definitely no indexes */
  	if (!RelationGetForm(relation)->relhasindex)
--- 3628,3644 ----
   * be bms_free'd when not needed anymore.
   */
  Bitmapset *
! RelationGetIndexAttrBitmap(Relation relation, bool keyAttrs)
  {
  	Bitmapset  *indexattrs;
+ 	Bitmapset  *uindexattrs;
  	List	   *indexoidlist;
  	ListCell   *l;
  	MemoryContext oldcxt;
  
  	/* Quick exit if we already computed the result. */
  	if (relation->rd_indexattr != NULL)
! 		return bms_copy(keyAttrs ? relation->rd_keyattr : relation->rd_indexattr);
  
  	/* Fast path if definitely no indexes */
  	if (!RelationGetForm(relation)->relhasindex)
***************
*** 3653,3678 **** RelationGetIndexAttrBitmap(Relation relation)
--- 3657,3694 ----
  	 * For each index, add referenced attributes to indexattrs.
  	 */
  	indexattrs = NULL;
+ 	uindexattrs = NULL;
  	foreach(l, indexoidlist)
  	{
  		Oid			indexOid = lfirst_oid(l);
  		Relation	indexDesc;
  		IndexInfo  *indexInfo;
  		int			i;
+ 		bool		isKey;
  
  		indexDesc = index_open(indexOid, AccessShareLock);
  
  		/* Extract index key information from the index's pg_index row */
  		indexInfo = BuildIndexInfo(indexDesc);
  
+ 		/* Can this index be referenced by a foreign key? */
+ 		isKey = indexInfo->ii_Unique &&
+ 				indexInfo->ii_Expressions == NIL &&
+ 				indexInfo->ii_Predicate == NIL;
+ 
  		/* Collect simple attribute references */
  		for (i = 0; i < indexInfo->ii_NumIndexAttrs; i++)
  		{
  			int			attrnum = indexInfo->ii_KeyAttrNumbers[i];
  
  			if (attrnum != 0)
+ 			{
  				indexattrs = bms_add_member(indexattrs,
  							   attrnum - FirstLowInvalidHeapAttributeNumber);
+ 				if (isKey)
+ 					uindexattrs = bms_add_member(uindexattrs,
+ 											   	 attrnum - FirstLowInvalidHeapAttributeNumber);
+ 			}
  		}
  
  		/* Collect all attributes used in expressions, too */
***************
*** 3689,3698 **** RelationGetIndexAttrBitmap(Relation relation)
  	/* Now save a copy of the bitmap in the relcache entry. */
  	oldcxt = MemoryContextSwitchTo(CacheMemoryContext);
  	relation->rd_indexattr = bms_copy(indexattrs);
  	MemoryContextSwitchTo(oldcxt);
  
  	/* We return our original working copy for caller to play with */
! 	return indexattrs;
  }
  
  /*
--- 3705,3715 ----
  	/* Now save a copy of the bitmap in the relcache entry. */
  	oldcxt = MemoryContextSwitchTo(CacheMemoryContext);
  	relation->rd_indexattr = bms_copy(indexattrs);
+ 	relation->rd_keyattr = bms_copy(uindexattrs);
  	MemoryContextSwitchTo(oldcxt);
  
  	/* We return our original working copy for caller to play with */
! 	return keyAttrs ? uindexattrs : indexattrs;
  }
  
  /*
*** a/src/include/access/heapam.h
--- b/src/include/access/heapam.h
***************
*** 31,38 ****
--- 31,44 ----
  
  typedef struct BulkInsertStateData *BulkInsertState;
  
+ /*
+  * This enum mirrors LockClauseStrength precisely, but we define it separately
+  * to reduce having to share otherwise unrelated headers.  To go from one to
+  * the other, we wade through the planner using a third enum, RowMarkType.
+  */
  typedef enum
  {
+ 	LockTupleKeylock,
  	LockTupleShared,
  	LockTupleExclusive
  } LockTupleMode;
*** a/src/include/access/htup.h
--- b/src/include/access/htup.h
***************
*** 163,174 **** typedef HeapTupleHeaderData *HeapTupleHeader;
  #define HEAP_HASVARWIDTH		0x0002	/* has variable-width attribute(s) */
  #define HEAP_HASEXTERNAL		0x0004	/* has external stored attribute(s) */
  #define HEAP_HASOID				0x0008	/* has an object-id field */
! /* bit 0x0010 is available */
  #define HEAP_COMBOCID			0x0020	/* t_cid is a combo cid */
  #define HEAP_XMAX_EXCL_LOCK		0x0040	/* xmax is exclusive locker */
  #define HEAP_XMAX_SHARED_LOCK	0x0080	/* xmax is shared locker */
! /* if either LOCK bit is set, xmax hasn't deleted the tuple, only locked it */
! #define HEAP_IS_LOCKED	(HEAP_XMAX_EXCL_LOCK | HEAP_XMAX_SHARED_LOCK)
  #define HEAP_XMIN_COMMITTED		0x0100	/* t_xmin committed */
  #define HEAP_XMIN_INVALID		0x0200	/* t_xmin invalid/aborted */
  #define HEAP_XMAX_COMMITTED		0x0400	/* t_xmax committed */
--- 163,177 ----
  #define HEAP_HASVARWIDTH		0x0002	/* has variable-width attribute(s) */
  #define HEAP_HASEXTERNAL		0x0004	/* has external stored attribute(s) */
  #define HEAP_HASOID				0x0008	/* has an object-id field */
! #define HEAP_XMAX_KEY_LOCK		0x0010	/* xmax is a "key" locker */
  #define HEAP_COMBOCID			0x0020	/* t_cid is a combo cid */
  #define HEAP_XMAX_EXCL_LOCK		0x0040	/* xmax is exclusive locker */
  #define HEAP_XMAX_SHARED_LOCK	0x0080	/* xmax is shared locker */
! /* if either SHARE or KEY lock bit is set, this is a "shared" lock */
! #define HEAP_IS_SHARE_LOCKED (HEAP_XMAX_SHARED_LOCK | HEAP_XMAX_KEY_LOCK)
! /* if any LOCK bit is set, xmax hasn't deleted the tuple, only locked it */
! #define HEAP_IS_LOCKED	(HEAP_XMAX_EXCL_LOCK | HEAP_XMAX_SHARED_LOCK | \
! 						 HEAP_XMAX_KEY_LOCK)
  #define HEAP_XMIN_COMMITTED		0x0100	/* t_xmin committed */
  #define HEAP_XMIN_INVALID		0x0200	/* t_xmin invalid/aborted */
  #define HEAP_XMAX_COMMITTED		0x0400	/* t_xmax committed */
***************
*** 726,735 **** typedef struct xl_heap_lock
  	xl_heaptid	target;			/* locked tuple id */
  	TransactionId locking_xid;	/* might be a MultiXactId not xid */
  	bool		xid_is_mxact;	/* is it? */
! 	bool		shared_lock;	/* shared or exclusive row lock? */
  } xl_heap_lock;
  
! #define SizeOfHeapLock	(offsetof(xl_heap_lock, shared_lock) + sizeof(bool))
  
  /* This is what we need to know about in-place update */
  typedef struct xl_heap_inplace
--- 729,738 ----
  	xl_heaptid	target;			/* locked tuple id */
  	TransactionId locking_xid;	/* might be a MultiXactId not xid */
  	bool		xid_is_mxact;	/* is it? */
! 	int8		lock_strength;	/* keylock, shared, exclusive lock? */
  } xl_heap_lock;
  
! #define SizeOfHeapLock	(offsetof(xl_heap_lock, lock_strength) + sizeof(int8))
  
  /* This is what we need to know about in-place update */
  typedef struct xl_heap_inplace
***************
*** 767,774 **** extern void HeapTupleHeaderAdvanceLatestRemovedXid(HeapTupleHeader tuple,
  extern CommandId HeapTupleHeaderGetCmin(HeapTupleHeader tup);
  extern CommandId HeapTupleHeaderGetCmax(HeapTupleHeader tup);
  extern void HeapTupleHeaderAdjustCmax(HeapTupleHeader tup,
! 						  CommandId *cmax,
! 						  bool *iscombo);
  
  /* ----------------
   *		fastgetattr
--- 770,776 ----
  extern CommandId HeapTupleHeaderGetCmin(HeapTupleHeader tup);
  extern CommandId HeapTupleHeaderGetCmax(HeapTupleHeader tup);
  extern void HeapTupleHeaderAdjustCmax(HeapTupleHeader tup,
! 						  CommandId *cmax, bool *iscombo);
  
  /* ----------------
   *		fastgetattr
*** a/src/include/access/xlog_internal.h
--- b/src/include/access/xlog_internal.h
***************
*** 71,77 **** typedef struct XLogContRecord
  /*
   * Each page of XLOG file has a header like this:
   */
! #define XLOG_PAGE_MAGIC 0xD068	/* can be used as WAL version indicator */
  
  typedef struct XLogPageHeaderData
  {
--- 71,77 ----
  /*
   * Each page of XLOG file has a header like this:
   */
! #define XLOG_PAGE_MAGIC 0xD069	/* can be used as WAL version indicator */
  
  typedef struct XLogPageHeaderData
  {
*** a/src/include/nodes/execnodes.h
--- b/src/include/nodes/execnodes.h
***************
*** 408,414 **** typedef struct EState
   * ExecRowMark -
   *	   runtime representation of FOR UPDATE/SHARE clauses
   *
!  * When doing UPDATE, DELETE, or SELECT FOR UPDATE/SHARE, we should have an
   * ExecRowMark for each non-target relation in the query (except inheritance
   * parent RTEs, which can be ignored at runtime).  See PlanRowMark for details
   * about most of the fields.  In addition to fields directly derived from
--- 408,414 ----
   * ExecRowMark -
   *	   runtime representation of FOR UPDATE/SHARE clauses
   *
!  * When doing UPDATE, DELETE, or SELECT FOR UPDATE/SHARE/KEY LOCK, we should have an
   * ExecRowMark for each non-target relation in the query (except inheritance
   * parent RTEs, which can be ignored at runtime).  See PlanRowMark for details
   * about most of the fields.  In addition to fields directly derived from
*** a/src/include/nodes/parsenodes.h
--- b/src/include/nodes/parsenodes.h
***************
*** 119,125 **** typedef struct Query
  	bool		hasDistinctOn;	/* distinctClause is from DISTINCT ON */
  	bool		hasRecursive;	/* WITH RECURSIVE was specified */
  	bool		hasModifyingCTE;	/* has INSERT/UPDATE/DELETE in WITH */
! 	bool		hasForUpdate;	/* FOR UPDATE or FOR SHARE was specified */
  
  	List	   *cteList;		/* WITH list (of CommonTableExpr's) */
  
--- 119,125 ----
  	bool		hasDistinctOn;	/* distinctClause is from DISTINCT ON */
  	bool		hasRecursive;	/* WITH RECURSIVE was specified */
  	bool		hasModifyingCTE;	/* has INSERT/UPDATE/DELETE in WITH */
! 	bool		hasForUpdate;	/* FOR UPDATE/SHARE/KEY LOCK was specified */
  
  	List	   *cteList;		/* WITH list (of CommonTableExpr's) */
  
***************
*** 569,586 **** typedef struct DefElem
  } DefElem;
  
  /*
!  * LockingClause - raw representation of FOR UPDATE/SHARE options
   *
   * Note: lockedRels == NIL means "all relations in query".	Otherwise it
   * is a list of RangeVar nodes.  (We use RangeVar mainly because it carries
   * a location field --- currently, parse analysis insists on unqualified
   * names in LockingClause.)
   */
  typedef struct LockingClause
  {
  	NodeTag		type;
  	List	   *lockedRels;		/* FOR UPDATE or FOR SHARE relations */
! 	bool		forUpdate;		/* true = FOR UPDATE, false = FOR SHARE */
  	bool		noWait;			/* NOWAIT option */
  } LockingClause;
  
--- 569,594 ----
  } DefElem;
  
  /*
!  * LockingClause - raw representation of FOR UPDATE/SHARE/KEY LOCK options
   *
   * Note: lockedRels == NIL means "all relations in query".	Otherwise it
   * is a list of RangeVar nodes.  (We use RangeVar mainly because it carries
   * a location field --- currently, parse analysis insists on unqualified
   * names in LockingClause.)
   */
+ typedef enum LockClauseStrength
+ {
+ 	/* order is important -- see applyLockingClause */
+ 	LCS_FORKEYLOCK,
+ 	LCS_FORSHARE,
+ 	LCS_FORUPDATE
+ } LockClauseStrength;
+ 
  typedef struct LockingClause
  {
  	NodeTag		type;
  	List	   *lockedRels;		/* FOR UPDATE or FOR SHARE relations */
! 	LockClauseStrength strength;
  	bool		noWait;			/* NOWAIT option */
  } LockingClause;
  
***************
*** 863,880 **** typedef struct WindowClause
   *	   parser output representation of FOR UPDATE/SHARE clauses
   *
   * Query.rowMarks contains a separate RowMarkClause node for each relation
!  * identified as a FOR UPDATE/SHARE target.  If FOR UPDATE/SHARE is applied
!  * to a subquery, we generate RowMarkClauses for all normal and subquery rels
!  * in the subquery, but they are marked pushedDown = true to distinguish them
!  * from clauses that were explicitly written at this query level.  Also,
!  * Query.hasForUpdate tells whether there were explicit FOR UPDATE/SHARE
!  * clauses in the current query level.
   */
  typedef struct RowMarkClause
  {
  	NodeTag		type;
  	Index		rti;			/* range table index of target relation */
! 	bool		forUpdate;		/* true = FOR UPDATE, false = FOR SHARE */
  	bool		noWait;			/* NOWAIT option */
  	bool		pushedDown;		/* pushed down from higher query level? */
  } RowMarkClause;
--- 871,888 ----
   *	   parser output representation of FOR UPDATE/SHARE clauses
   *
   * Query.rowMarks contains a separate RowMarkClause node for each relation
!  * identified as a FOR UPDATE/SHARE/KEY LOCK target.  If one of these clauses
!  * is applied to a subquery, we generate RowMarkClauses for all normal and
!  * subquery rels in the subquery, but they are marked pushedDown = true to
!  * distinguish them from clauses that were explicitly written at this query
!  * level.  Also, Query.hasForUpdate tells whether there were explicit FOR
!  * UPDATE/SHARE/KEY LOCK clauses in the current query level.
   */
  typedef struct RowMarkClause
  {
  	NodeTag		type;
  	Index		rti;			/* range table index of target relation */
! 	LockClauseStrength strength;
  	bool		noWait;			/* NOWAIT option */
  	bool		pushedDown;		/* pushed down from higher query level? */
  } RowMarkClause;
*** a/src/include/nodes/plannodes.h
--- b/src/include/nodes/plannodes.h
***************
*** 722,728 **** typedef struct Limit
   * RowMarkType -
   *	  enums for types of row-marking operations
   *
!  * When doing UPDATE, DELETE, or SELECT FOR UPDATE/SHARE, we have to uniquely
   * identify all the source rows, not only those from the target relations, so
   * that we can perform EvalPlanQual rechecking at need.  For plain tables we
   * can just fetch the TID, the same as for a target relation.  Otherwise (for
--- 722,728 ----
   * RowMarkType -
   *	  enums for types of row-marking operations
   *
!  * When doing UPDATE, DELETE, or SELECT FOR UPDATE/SHARE/KEY LOCK, we have to uniquely
   * identify all the source rows, not only those from the target relations, so
   * that we can perform EvalPlanQual rechecking at need.  For plain tables we
   * can just fetch the TID, the same as for a target relation.  Otherwise (for
***************
*** 734,752 **** typedef enum RowMarkType
  {
  	ROW_MARK_EXCLUSIVE,			/* obtain exclusive tuple lock */
  	ROW_MARK_SHARE,				/* obtain shared tuple lock */
  	ROW_MARK_REFERENCE,			/* just fetch the TID */
  	ROW_MARK_COPY				/* physically copy the row value */
  } RowMarkType;
  
! #define RowMarkRequiresRowShareLock(marktype)  ((marktype) <= ROW_MARK_SHARE)
  
  /*
   * PlanRowMark -
   *	   plan-time representation of FOR UPDATE/SHARE clauses
   *
!  * When doing UPDATE, DELETE, or SELECT FOR UPDATE/SHARE, we create a separate
   * PlanRowMark node for each non-target relation in the query.	Relations that
!  * are not specified as FOR UPDATE/SHARE are marked ROW_MARK_REFERENCE (if
   * real tables) or ROW_MARK_COPY (if not).
   *
   * Initially all PlanRowMarks have rti == prti and isParent == false.
--- 734,753 ----
  {
  	ROW_MARK_EXCLUSIVE,			/* obtain exclusive tuple lock */
  	ROW_MARK_SHARE,				/* obtain shared tuple lock */
+ 	ROW_MARK_KEYLOCK,			/* obtain keylock tuple lock */
  	ROW_MARK_REFERENCE,			/* just fetch the TID */
  	ROW_MARK_COPY				/* physically copy the row value */
  } RowMarkType;
  
! #define RowMarkRequiresRowShareLock(marktype)  ((marktype) <= ROW_MARK_KEYLOCK)
  
  /*
   * PlanRowMark -
   *	   plan-time representation of FOR UPDATE/SHARE clauses
   *
!  * When doing UPDATE, DELETE, or SELECT FOR UPDATE/SHARE/KEY LOCK, we create a separate
   * PlanRowMark node for each non-target relation in the query.	Relations that
!  * are not specified as FOR UPDATE/SHARE/KEY LOCK are marked ROW_MARK_REFERENCE (if
   * real tables) or ROW_MARK_COPY (if not).
   *
   * Initially all PlanRowMarks have rti == prti and isParent == false.
*** a/src/include/parser/analyze.h
--- b/src/include/parser/analyze.h
***************
*** 31,36 **** extern bool analyze_requires_snapshot(Node *parseTree);
  
  extern void CheckSelectLocking(Query *qry);
  extern void applyLockingClause(Query *qry, Index rtindex,
! 				   bool forUpdate, bool noWait, bool pushedDown);
  
  #endif   /* ANALYZE_H */
--- 31,36 ----
  
  extern void CheckSelectLocking(Query *qry);
  extern void applyLockingClause(Query *qry, Index rtindex,
! 				   LockClauseStrength strength, bool noWait, bool pushedDown);
  
  #endif   /* ANALYZE_H */
*** a/src/include/utils/rel.h
--- b/src/include/utils/rel.h
***************
*** 103,108 **** typedef struct RelationData
--- 103,109 ----
  	Oid			rd_id;			/* relation's object id */
  	List	   *rd_indexlist;	/* list of OIDs of indexes on relation */
  	Bitmapset  *rd_indexattr;	/* identifies columns used in indexes */
+ 	Bitmapset  *rd_keyattr;		/* cols that can be ref'd by foreign keys */
  	Oid			rd_oidindex;	/* OID of unique index on OID, if any */
  	LockInfoData rd_lockInfo;	/* lock mgr's info for locking relation */
  	RuleLock   *rd_rules;		/* rewrite rules */
*** a/src/include/utils/relcache.h
--- b/src/include/utils/relcache.h
***************
*** 42,48 **** extern List *RelationGetIndexList(Relation relation);
  extern Oid	RelationGetOidIndex(Relation relation);
  extern List *RelationGetIndexExpressions(Relation relation);
  extern List *RelationGetIndexPredicate(Relation relation);
! extern Bitmapset *RelationGetIndexAttrBitmap(Relation relation);
  extern void RelationGetExclusionInfo(Relation indexRelation,
  						 Oid **operators,
  						 Oid **procs,
--- 42,48 ----
  extern Oid	RelationGetOidIndex(Relation relation);
  extern List *RelationGetIndexExpressions(Relation relation);
  extern List *RelationGetIndexPredicate(Relation relation);
! extern Bitmapset *RelationGetIndexAttrBitmap(Relation relation, bool keyAttrs);
  extern void RelationGetExclusionInfo(Relation indexRelation,
  						 Oid **operators,
  						 Oid **procs,
*** a/src/test/isolation/expected/fk-contention.out
--- b/src/test/isolation/expected/fk-contention.out
***************
*** 7,15 **** step upd:  UPDATE foo SET b = 'Hello World';
  
  starting permutation: ins upd com
  step ins:  INSERT INTO bar VALUES (42); 
! step upd:  UPDATE foo SET b = 'Hello World';  <waiting ...>
  step com:  COMMIT; 
- step upd: <... completed>
  
  starting permutation: upd ins com
  step upd:  UPDATE foo SET b = 'Hello World'; 
--- 7,14 ----
  
  starting permutation: ins upd com
  step ins:  INSERT INTO bar VALUES (42); 
! step upd:  UPDATE foo SET b = 'Hello World'; 
  step com:  COMMIT; 
  
  starting permutation: upd ins com
  step upd:  UPDATE foo SET b = 'Hello World'; 
*** a/src/test/isolation/expected/fk-deadlock.out
--- b/src/test/isolation/expected/fk-deadlock.out
***************
*** 20,60 **** step s2c:  COMMIT;
  starting permutation: s1i s2i s1u s2u s1c s2c
  step s1i:  INSERT INTO child VALUES (1, 1); 
  step s2i:  INSERT INTO child VALUES (2, 1); 
! step s1u:  UPDATE parent SET aux = 'bar';  <waiting ...>
! step s2u:  UPDATE parent SET aux = 'baz'; 
! step s1u: <... completed>
! ERROR:  deadlock detected
  step s1c:  COMMIT; 
  step s2c:  COMMIT; 
  
  starting permutation: s1i s2i s2u s1u s2c s1c
  step s1i:  INSERT INTO child VALUES (1, 1); 
  step s2i:  INSERT INTO child VALUES (2, 1); 
! step s2u:  UPDATE parent SET aux = 'baz';  <waiting ...>
! step s1u:  UPDATE parent SET aux = 'bar'; 
! ERROR:  deadlock detected
! step s2u: <... completed>
  step s2c:  COMMIT; 
  step s1c:  COMMIT; 
  
  starting permutation: s2i s1i s1u s2u s1c s2c
  step s2i:  INSERT INTO child VALUES (2, 1); 
  step s1i:  INSERT INTO child VALUES (1, 1); 
! step s1u:  UPDATE parent SET aux = 'bar';  <waiting ...>
! step s2u:  UPDATE parent SET aux = 'baz'; 
! step s1u: <... completed>
! ERROR:  deadlock detected
  step s1c:  COMMIT; 
  step s2c:  COMMIT; 
  
  starting permutation: s2i s1i s2u s1u s2c s1c
  step s2i:  INSERT INTO child VALUES (2, 1); 
  step s1i:  INSERT INTO child VALUES (1, 1); 
! step s2u:  UPDATE parent SET aux = 'baz';  <waiting ...>
! step s1u:  UPDATE parent SET aux = 'bar'; 
! ERROR:  deadlock detected
! step s2u: <... completed>
  step s2c:  COMMIT; 
  step s1c:  COMMIT; 
  
  starting permutation: s2i s2u s1i s2c s1u s1c
--- 20,56 ----
  starting permutation: s1i s2i s1u s2u s1c s2c
  step s1i:  INSERT INTO child VALUES (1, 1); 
  step s2i:  INSERT INTO child VALUES (2, 1); 
! step s1u:  UPDATE parent SET aux = 'bar'; 
! step s2u:  UPDATE parent SET aux = 'baz';  <waiting ...>
  step s1c:  COMMIT; 
+ step s2u: <... completed>
  step s2c:  COMMIT; 
  
  starting permutation: s1i s2i s2u s1u s2c s1c
  step s1i:  INSERT INTO child VALUES (1, 1); 
  step s2i:  INSERT INTO child VALUES (2, 1); 
! step s2u:  UPDATE parent SET aux = 'baz'; 
! step s1u:  UPDATE parent SET aux = 'bar';  <waiting ...>
  step s2c:  COMMIT; 
+ step s1u: <... completed>
  step s1c:  COMMIT; 
  
  starting permutation: s2i s1i s1u s2u s1c s2c
  step s2i:  INSERT INTO child VALUES (2, 1); 
  step s1i:  INSERT INTO child VALUES (1, 1); 
! step s1u:  UPDATE parent SET aux = 'bar'; 
! step s2u:  UPDATE parent SET aux = 'baz';  <waiting ...>
  step s1c:  COMMIT; 
+ step s2u: <... completed>
  step s2c:  COMMIT; 
  
  starting permutation: s2i s1i s2u s1u s2c s1c
  step s2i:  INSERT INTO child VALUES (2, 1); 
  step s1i:  INSERT INTO child VALUES (1, 1); 
! step s2u:  UPDATE parent SET aux = 'baz'; 
! step s1u:  UPDATE parent SET aux = 'bar';  <waiting ...>
  step s2c:  COMMIT; 
+ step s1u: <... completed>
  step s1c:  COMMIT; 
  
  starting permutation: s2i s2u s1i s2c s1u s1c
*** a/src/test/isolation/expected/fk-deadlock2.out
--- b/src/test/isolation/expected/fk-deadlock2.out
***************
*** 100,107 **** step s1c:  COMMIT;
  starting permutation: s2u1 s2u2 s1u1 s2c s1u2 s1c
  step s2u1:  UPDATE B SET Col2 = 1 WHERE BID = 2; 
  step s2u2:  UPDATE B SET Col2 = 1 WHERE BID = 2; 
! step s1u1:  UPDATE A SET Col1 = 1 WHERE AID = 1;  <waiting ...>
  step s2c:  COMMIT; 
- step s1u1: <... completed>
  step s1u2:  UPDATE B SET Col2 = 1 WHERE BID = 2; 
  step s1c:  COMMIT; 
--- 100,106 ----
  starting permutation: s2u1 s2u2 s1u1 s2c s1u2 s1c
  step s2u1:  UPDATE B SET Col2 = 1 WHERE BID = 2; 
  step s2u2:  UPDATE B SET Col2 = 1 WHERE BID = 2; 
! step s1u1:  UPDATE A SET Col1 = 1 WHERE AID = 1; 
  step s2c:  COMMIT; 
  step s1u2:  UPDATE B SET Col2 = 1 WHERE BID = 2; 
  step s1c:  COMMIT;

#35

robertmhaas@gmail.com

over 14 years ago

In reply to: Alvaro Herrera (#34)

Re: FOR KEY LOCK foreign keys

On Wed, Jul 27, 2011 at 7:16 PM, Alvaro Herrera
<alvherre@commandprompt.com> wrote:

One thing I have not addressed is Noah's idea about creating a new lock
mode, KEY UPDATE, that would let us solve the initial problem that this
patch set to resolve in the first place. I am not clear on exactly how
that is to be implemented, because currently heap_update and heap_delete
do not grab any kind of lock but instead do their own ad-hoc waiting. I
think that might need to be reshuffled a bit, to which I haven't gotten
yet, and is a radical enough idea that I would like it to be discussed
by the hackers community at large before setting sail on developing it.
In the meantime, this patch does improve the current situation quite a
lot.

I haven't looked at the patch yet, but do you have a pointer to Noah's
proposal? And/or a description of how it differs from what you
implemented here?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#36