Row-level Locks & SERIALIZABLE transactions, postgres vs. Oracle

Started by Florian Pflugover 15 years ago32 messages
#1Florian Pflug
fgp@phlo.org

Hi

After the recent discussion about the impossibility of efficiently implementing FK-like constraint triggers in PL/PGSQL that work correctly under SERIALIZABLe transactions, I've compared our behavior to that of Oracle. As it turns out, a slight difference in Oracle's behavior makes those FK constraint triggers which on postgres are only correct in READ COMMITTED mode fully correct in SERIALIZABLE mode also.

1. Summary of the previous discussion

The built-in FK constraint trigger looks for rows visible under either the transaction's snapshot *or* a freshly taken MVCC snapshot when checking for child-table rows that'd prevent an UPDATE or DELETE of a row in the parent table. This is necessary even though the parent row is SHARE-locked on INSERTs/UPDATEs to the child table, and would also be necessary if it was UPDATE-locked. The following series of commands illustrates why

C1: BEGIN
C1: SELECT * FROM t WHERE id = 1 FOR UPDATE
C2: BEGIN
C2: SET TRANSACTION ISOLATION LEVEL SERIALIZABLE
C2: SELECT * FROM t -- Take snapshot before C1 commits
C1: COMMIT
C2: DELETE FROM t WHERE id = 1
C2: COMMIT

Since C1 commits before C2 does DELETE, C2 is entirely unaffected by C1's UPDATE-lock. C2 has no way of detecting possible dependent rows that C1 might have inserted, since C1 is invisible to C2.

Note that if you swap the SELECT .. FOR UPDATE and the DELETE commands, the SELECT .. FOR UPDATE will cause a serialization error!

2. The behavior or Oracle

Oracle treats a "FOR UPDATE" lock much like an actual UPDATE when checking for serialization conflicts. This causes the DELETE in the example above to raise a serialization error, and hence prevents the failure case for FK constraint triggers even without a recheck under a current snapshot.

One can think of a FOR UPDATE lock as a kind of read barrier on Oracle - it prevents other transactions from messing with the row that don't consider the locking transaction to be visible.

3. Conclusio

While it might seem strange at first for a lock to affect other transactions even after the locking transaction has ended, it actually makes sense when viewed as a kind of write barrier. It is very common for locking primitives to use barrier instructions to ensure that one lock holder sees all changes done by the previous owner. Raising a serialization error in the example above is the transactional equivalent of such a barrier instruction in the case of SERIALIZABLE transactions - since updating the transaction's snapshot is obviously not an option, the remaining alternative is to restart the whole transaction under a current snapshot. This is exactly what raising a serialization error accomplishes.

Also, while Oracle's behavior has obvious use-cases (e.g. FK-like constraints), I failed to come up with a case where postgres' current behavior is useful. When would you want a (SERIALIZABLE) transaction to wait for a lock, but then continue as if the lock had never existed? What is the point of waiting then in the first place?

All in all, I believe that SHARE and UPDATE row-level locks should be changed to cause concurrent UPDATEs to fail with a serialization error. I can come up with a patch that does that, but I wanted to get some feedback on the idea before I put the work in.

best regards,
Florian Pflug

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Florian Pflug (#1)
Re: Row-level Locks & SERIALIZABLE transactions, postgres vs. Oracle

Florian Pflug <fgp@phlo.org> writes:

All in all, I believe that SHARE and UPDATE row-level locks should be
changed to cause concurrent UPDATEs to fail with a serialization
error.

I don't see an argument for doing that for FOR SHARE locks, and it
already happens for FOR UPDATE (at least if the row actually gets
updated). AFAICS this proposal mainly breaks things, in pursuit of
an unnecessary and probably-impossible-anyway goal of making FK locking
work with only user-level snapshots.

regards, tom lane

#3Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Florian Pflug (#1)
Re: Row-level Locks & SERIALIZABLE transactions, postgres vs. Oracle

Florian Pflug <fgp@phlo.org> wrote:

All in all, I believe that SHARE and UPDATE row-level locks should
be changed to cause concurrent UPDATEs to fail with a
serialization error. I can come up with a patch that does that,
but I wanted to get some feedback on the idea before I put the
work in.

Before you work on that, you might want to wait until you can review
the work that I and Dan Ports (a Ph.D. candidate from MIT) have been
doing on support for true serializable transactions. You don't need
to use FOR SHARE or FOR UPDATE or any explicit locks as long as the
concurrent transactions are SERIALIZABLE. We have it working, but
have been holding off on discussion or patch submission at Tom's
request -- he felt it would distract from the process of getting the
release out.

Whenever people are ready, I can submit a WIP patch. All issues
discuss on this thread "Just Work" with the patch applied. There's
a Wiki page and a public git repository related to this work, for
anyone who is interested and not busy with release work.

-Kevin

#4Florian Pflug
fgp@phlo.org
In reply to: Tom Lane (#2)
Re: Row-level Locks & SERIALIZABLE transactions, postgres vs. Oracle

On May 13, 2010, at 23:39 , Tom Lane wrote:

Florian Pflug <fgp@phlo.org> writes:

All in all, I believe that SHARE and UPDATE row-level locks should be
changed to cause concurrent UPDATEs to fail with a serialization
error.

I don't see an argument for doing that for FOR SHARE locks, and it
already happens for FOR UPDATE (at least if the row actually gets
updated).

Yes, actually updating the row is a workaround. A prohibitively expensive one, though.

The arguments are as stated

a) SHARE or UPDATE locking a concurrently updated row *does* cause as serialization error, making the current behavior asymmetric

b) Locking primitives usually ensure that once you obtain the lock you see the most recent version of the data. This is currently true for READ COMMITTED transactions but not for SERIALIZABLE ones, and pretty undesirable a behavior for a locking primitive.

c) I fail to see how the current behavior is useful in the presence of SERIALIZABLE transactions. Currently, they could IMHO completely ignore FOR SHARE locks, without making any previously correct algorithm incorrect.

plus a weaker one:

d) Oracle does it for FOR UPDATE locks, and actually has an example of a FK trigger in PL/SQL in their docs.

AFAICS this proposal mainly breaks things, in pursuit of
an unnecessary and probably-impossible-anyway goal of making FK locking
work with only user-level snapshots.

I don't see the breakage this'd cause. For READ COMMITTED transactions nothing changes. For SERIALIZABLE transactions the behavior of FOR UPDATE / FOR SHARE becomes much easier to grasp. In both cases a SHARE lock would then say "Only update this row if you have seen the locking transaction's changes".

Why do you think that making FK locking work with only user-level snapshots is probably-impossible-anyway? With the proposed changes, simply FOR SHARE locking the parent row on INSERT/UPDATE of the child, plus checking for child rows on UPDATE/DELETE of the parent gives a 100% correct FK trigger.

I do not have a formal proof for that last assertion, but I'm not aware of any counter-examples either. Would love to hear of any, though.

best regards,
Florian Pflug

#5Florian Pflug
fgp@phlo.org
In reply to: Kevin Grittner (#3)
Re: Row-level Locks & SERIALIZABLE transactions, postgres vs. Oracle

On May 13, 2010, at 23:51 , Kevin Grittner wrote:

Florian Pflug <fgp@phlo.org> wrote:

All in all, I believe that SHARE and UPDATE row-level locks should
be changed to cause concurrent UPDATEs to fail with a
serialization error. I can come up with a patch that does that,
but I wanted to get some feedback on the idea before I put the
work in.

Before you work on that, you might want to wait until you can review
the work that I and Dan Ports (a Ph.D. candidate from MIT) have been
doing on support for true serializable transactions. You don't need
to use FOR SHARE or FOR UPDATE or any explicit locks as long as the
concurrent transactions are SERIALIZABLE. We have it working, but
have been holding off on discussion or patch submission at Tom's
request -- he felt it would distract from the process of getting the
release out.

I'm very exited about the work you're doing there, and believe it'd be a great feature to have.

However, I view my proposal as pretty orthogonal to that work. True serializable transaction are much more powerful than what I proposed, but at a much higher price too, due to the necessity of SIREAD locks. My proposal allows for simple FK-like constraints to be implemented at user-level that are correct for all isolation levels.

best regards,
Florian Pflug

#6Greg Stark
gsstark@mit.edu
In reply to: Florian Pflug (#1)
Re: Row-level Locks & SERIALIZABLE transactions, postgres vs. Oracle

On Thu, May 13, 2010 at 10:25 PM, Florian Pflug <fgp@phlo.org> wrote:

C1: BEGIN
C1: SELECT * FROM t WHERE id = 1 FOR UPDATE
C2: BEGIN
C2: SET TRANSACTION ISOLATION LEVEL SERIALIZABLE
C2: SELECT * FROM t -- Take snapshot before C1 commits
C1: COMMIT
C2: DELETE FROM t WHERE id = 1
C2: COMMIT

Can you give an actual realistic example -- ie, not doing a select for
update and then never updating the row or with an explanation of what
the programmer is attempting to accomplish with such an unusual
sequence? The rest of the post talks about FKs but I don't see any
here...

--
greg

#7Anssi Kääriäinen
anssi.kaariainen@thl.fi
In reply to: Greg Stark (#6)
Re: Row-level Locks & SERIALIZABLE transactions, postgres vs. Oracle

On 05/14/2010 03:37 AM, Greg Stark wrote:

On Thu, May 13, 2010 at 10:25 PM, Florian Pflug<fgp@phlo.org> wrote:

C1: BEGIN
C1: SELECT * FROM t WHERE id = 1 FOR UPDATE
C2: BEGIN
C2: SET TRANSACTION ISOLATION LEVEL SERIALIZABLE
C2: SELECT * FROM t -- Take snapshot before C1 commits
C1: COMMIT
C2: DELETE FROM t WHERE id = 1
C2: COMMIT

Can you give an actual realistic example -- ie, not doing a select for
update and then never updating the row or with an explanation of what
the programmer is attempting to accomplish with such an unusual
sequence? The rest of the post talks about FKs but I don't see any
here...

Doing a select for update and then never updating the row is a realistic
example.

I am currently designing a database where this is an issue. The
simplified schema to illustrate the problem:

create table object (
id integer primary key
);

insert into object values(1);

create table attribute (
object_id integer not null references object,
attr_type integer not null, -- references attr_types
value text not null,
valid_from timestamp not null,
valid_until timestamp
);

Now, I want to make sure there are no pairs of (object_id, attr_type)
where the valid_from, valid_until times overlap.

A problematic sequence for this schema, both transactions in isolation
level serializable:

C1: begin;
C1: select * from object where id = 1 for update;
-- check for conflicting attr_type, realistically where condition should
have overlapping check, but left out for simplicity...
C1: select * from attribute where object_id = 1 and attr_type = 1;
-- Ok, nothing overlapping, I am able to insert.
C1: insert into attribute values (1, 1, 'Anssi', now(), null);
C2: begin;
-- This blocks.
C2: select * from object where id = 1 for update;
C1: commit;
-- Check for conflicts. This select won't see the insert C1 did.
C2: select * from attribute where object_id = 1 and attr_type = 1;
-- C2 doesn't see anything conflicting
C2: insert into attribute values (1, 1, 'Matti', now(), null);
C2: commit;
-- Inconsistency.

Now, that same sequence does work for read committed isolation level (C2
sees the insert of C1), and that is my solution for now: require
applications to use read committed isolation level. This could also be
solved by issuing "update object set id = id where id = 1" instead of
using select for update. This would result in serialization error.

I know that for this particular example the upcoming exclusion
constraints would solve the problem. But if I would want to ensure that
if attr_value for attr_type 1 is 'Anssi' then attr_value for attr_type 2
is 'Kääriäinen', then exclusion constraints could not be used.

--
Anssi Kääriäinen

#8Nicolas Barbier
nicolas.barbier@gmail.com
In reply to: Greg Stark (#6)
Re: Row-level Locks & SERIALIZABLE transactions, postgres vs. Oracle

2010/5/14 Greg Stark <gsstark@mit.edu>:

On Thu, May 13, 2010 at 10:25 PM, Florian Pflug <fgp@phlo.org> wrote:

C1: BEGIN
C1: SELECT * FROM t WHERE id = 1 FOR UPDATE
C2: BEGIN
C2: SET TRANSACTION ISOLATION LEVEL SERIALIZABLE
C2: SELECT * FROM t -- Take snapshot before C1 commits
C1: COMMIT
C2: DELETE FROM t WHERE id = 1
C2: COMMIT

Can you give an actual realistic example -- ie, not doing a select for
update and then never updating the row or with an explanation of what
the programmer is attempting to accomplish with such an unusual
sequence? The rest of the post talks about FKs but I don't see any
here...

The link with FKs is as follows:

* The example does not use a real FK, because the whole purpose is to
do the same as FKs while not using the FK machinery.
* The example uses only one table, because that is enough to
illustrate the problem (see next items).
* C1 locks a row, supposedly because it wants to create a reference to
it in a non-mentioned table, and wants to prevent the row from being
deleted under it.
* C2 deletes that row (supposedly after it verified that there are no
references to it; it would indeed not be able to see the reference
that C1 created/would create), and C1 fails to detect that.
* C2 also fails to detect the problem, because the lock that C1 held
is being released after C1 commits, and C2 can happily go on deleting
the row.
* The end result is that the hypothetical reference is created,
although the referent is gone.

Nicolas

#9Florian Pflug
fgp@phlo.org
In reply to: Greg Stark (#6)
Re: Row-level Locks & SERIALIZABLE transactions, postgres vs. Oracle

On May 14, 2010, at 2:37 , Greg Stark wrote:

On Thu, May 13, 2010 at 10:25 PM, Florian Pflug <fgp@phlo.org> wrote:

C1: BEGIN
C1: SELECT * FROM t WHERE id = 1 FOR UPDATE
C2: BEGIN
C2: SET TRANSACTION ISOLATION LEVEL SERIALIZABLE
C2: SELECT * FROM t -- Take snapshot before C1 commits
C1: COMMIT
C2: DELETE FROM t WHERE id = 1
C2: COMMIT

Can you give an actual realistic example -- ie, not doing a select for
update and then never updating the row or with an explanation of what
the programmer is attempting to accomplish with such an unusual
sequence? The rest of the post talks about FKs but I don't see any
here...

The table "t" is supposed to represent the parent table of a FK constraint. The SELECT FOR UPDATE is done upon an INSERT to the child table to protect the parent row against concurrent deletion. I've used FOR UPDATE instead of FOR SHARE because I did test this against oracle also, and oracle does not support FOR SHARE.

Here's a full example of a pair of FK triggers in PL/PGSQL that work correctly in READ COMMITTED mode but fail to enforce the constraint in SERIALIZABLE mode as the following sequence of commands show. With my proposal, the DELETE would again raise a serialization error and hence keep the constraint satisfied.

C1: BEGIN
C1: INSERT INTO child (parent_id) VALUES (1) -- Locks the parent row FOR UPDATE
C2: BEGIN
C2: SET TRANSACTION ISOLATION LEVEL SERIALIZABLE
C2: SELECT TRUE -- Take snapshot *before* C1 commits
C1: COMMIT
C2: DELETE FROM parent WHERE parent_id = 1 -- Succeeds
C2: COMMIT

----------
CREATE TABLE parent (parent_id SERIAL NOT NULL PRIMARY KEY);
CREATE TABLE child (child_id SERIAL NOT NULL PRIMARY KEY, parent_id INTEGER NOT NULL);

CREATE FUNCTION ri_parent() RETURNS TRIGGER AS $body$
BEGIN
PERFORM TRUE FROM child WHERE parent_id = OLD.parent_id;
IF FOUND THEN
RAISE SQLSTATE '23503' USING MESSAGE = 'Parent ' || OLD.parent_id || ' still referenced during ' || TG_OP;
END IF;
RETURN NULL;
END;
$body$ LANGUAGE PLPGSQL VOLATILE;
CREATE TRIGGER ri_parent AFTER UPDATE OR DELETE ON parent FOR EACH ROW EXECUTE PROCEDURE ri_parent();

CREATE FUNCTION ri_child() RETURNS TRIGGER AS $body$
BEGIN
PERFORM TRUE FROM parent WHERE parent_id = NEW.parent_id FOR UPDATE OF parent;
IF NOT FOUND THEN
RAISE SQLSTATE '23503' USING MESSAGE = 'Parent ' || NEW.parent_id || ' does not exist during ' || TG_OP;
END IF;
RETURN NULL;
END;
$body$ LANGUAGE PLPGSQL VOLATILE;
CREATE TRIGGER ri_child AFTER INSERT OR UPDATE ON child FOR EACH ROW EXECUTE PROCEDURE ri_child();
----------

best regards,

Florian Pflug

#10Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Florian Pflug (#9)
Re: Row-level Locks & SERIALIZABLE transactions, postgres vs. Oracle

[slight rearrangement]

Florian Pflug wrote:

I'm very exited about the work you're doing

Always nice to hear. :-)

I view my proposal as pretty orthogonal to that work.

My proposal allows for simple FK-like constraints to be
implemented at user-level that are correct for all isolation
levels.

OK, I can see the attraction in that.

True serializable transaction are much more powerful than what I
proposed, but at a much higher price too, due to the necessity of
SIREAD locks.

I think that SIREAD locks will generally be cheaper than SELECT FOR
UPDATE, since the former don't require any disk I/O and the latter
do. I only have one benchmark so far (more on the way), but it
attempts to isolate the cost of acquiring the SIREAD locks by using
a read-only load against a fully cached database. Benchmarks so far
show the new version of the SERIALIZABLE level as supporting 1.8%
fewer TPS than REPEATABLE READ (the existing snapshot isolation
level) in that environment. That will probably disappear into the
noise for any load involving disk I/O.

Now *rollbacks*, particularly those due to false positives, might
become a more serious issue in some pessimal loads, but I'm still
working on developing meaningful benchmarks for that.

I guess what I'm suggesting is that unless you have a very small
database with a very large number of connections in a high
contention workload, or you can't require SERIALIZABLE transaction
isolation level, SSI might actually perform better than what you're
proposing. Of course, that's all conjecture until there are
benchmarks; but I'd be very interested in getting any and all
alternative solutions like this worked into a benchmark -- where I
can pull out the FOR UPDATE and FOR SHARE clauses, any redundant
updates or denormalizations added just for concurrency issues, and
all explicit locking -- and compare that under SERIALIZABLE to the
original performance.

-Kevin

#11Florian Pflug
fgp@phlo.org
In reply to: Kevin Grittner (#10)
Re: Row-level Locks & SERIALIZABLE transactions, postgres vs. Oracle

On May 14, 2010, at 12:56 , Kevin Grittner wrote:

True serializable transaction are much more powerful than what I
proposed, but at a much higher price too, due to the necessity of
SIREAD locks.

I think that SIREAD locks will generally be cheaper than SELECT FOR
UPDATE, since the former don't require any disk I/O and the latter
do. I only have one benchmark so far (more on the way), but it
attempts to isolate the cost of acquiring the SIREAD locks by using
a read-only load against a fully cached database. Benchmarks so far
show the new version of the SERIALIZABLE level as supporting 1.8%
fewer TPS than REPEATABLE READ (the existing snapshot isolation
level) in that environment. That will probably disappear into the
noise for any load involving disk I/O.

I can see how a single SIREAD lock can potentially be cheaper than a FOR SHARE or FOR UPDATE lock. But the number of SIREAD locks would exceed the number of FOR SHARE / FOR UPDATE locks by a few order of magnitude I'd think - at least of you ran even transaction under true serializable isolation.

I don't quite understand how SIREAD locks work if they don't involve any disk IO, since shared memory isn't resizable. But I guess I'll find out once you post the patch ;-)

I guess what I'm suggesting is that unless you have a very small
database with a very large number of connections in a high
contention workload, or you can't require SERIALIZABLE transaction
isolation level, SSI might actually perform better than what you're
proposing.

That is entirely possible. However, unless your patch completely removes support for snapshot isolation (what is current called SERIALIZABLE), my proposal still eliminates the situation that user-level constraints are correct in READ COMMITTED and (true) SERIALIZABLE isolation but not in snapshot isolation.

Btw, the only user of FOR SHARE locks inside postgres proper are the RI triggers, and those do that special crosscheck when called within a SERIALIZABLE transactions. I do take this as evidence that the current behavior might not be all that useful with serializable transactions...

best regards,
Florian Pflug

#12Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Florian Pflug (#11)
Re: Row-level Locks & SERIALIZABLE transactions, postgres vs. Oracle

Florian Pflug <fgp@phlo.org> wrote:

On May 14, 2010, at 12:56 , Kevin Grittner wrote:

I think that SIREAD locks will generally be cheaper than SELECT
FOR UPDATE, since the former don't require any disk I/O and the
latter do.

I can see how a single SIREAD lock can potentially be cheaper than
a FOR SHARE or FOR UPDATE lock. But the number of SIREAD locks
would exceed the number of FOR SHARE / FOR UPDATE locks by a few
order of magnitude I'd think - at least of you ran even
transaction under true serializable isolation.

I don't quite understand how SIREAD locks work if they don't
involve any disk IO, since shared memory isn't resizable.

We use a well-worn technique used by many (most?) database products
-- granularity promotion. This is one of the things which could
cause enough false positives under some loads to cause your
technique to perform better than SSI for those loads.

unless your patch completely removes support for snapshot
isolation (what is current called SERIALIZABLE)

Both SERIALIZABLE and REPEATABLE READ currently map to snapshot
isolation. We're leaving REPEATABLE READ alone.

my proposal still eliminates the situation that user-level
constraints are correct in READ COMMITTED and (true) SERIALIZABLE
isolation but not in snapshot isolation.

Agreed. If someone wants to enforce user-level constraints using
SSI, they will somehow need to ensure that less strict isolation
levels are never used to modify data. Your approach lifts that
burden.

By the way, if you can make this behave in a similar way to Oracle,
especially if the syntax is compatible, I'm sure it will help
promote PostgreSQL adoption. At PostgreSQL Conference U.S. East
2010, I talked briefly with a couple guys from an Oracle shop who
were looking at converting to PostgreSQL, and were very concerned
about not having what you describe. The techniques required to
ensure integrity in PostgreSQL were not, to put it mildly, appealing
to them. I suspect that they would be satisfied with *either* SSI
or the change you describe.

-Kevin

#13Florian Pflug
fgp@phlo.org
In reply to: Kevin Grittner (#12)
Re: Row-level Locks & SERIALIZABLE transactions, postgres vs. Oracle

On May 14, 2010, at 15:54 , Kevin Grittner wrote:

Florian Pflug <fgp@phlo.org> wrote:

On May 14, 2010, at 12:56 , Kevin Grittner wrote:
unless your patch completely removes support for snapshot
isolation (what is current called SERIALIZABLE)

Both SERIALIZABLE and REPEATABLE READ currently map to snapshot
isolation. We're leaving REPEATABLE READ alone.

Ah, yeah, that makes a lot of sense. I kinda had forgotten about REPEATABLE READ...

my proposal still eliminates the situation that user-level
constraints are correct in READ COMMITTED and (true) SERIALIZABLE
isolation but not in snapshot isolation.

Agreed. If someone wants to enforce user-level constraints using
SSI, they will somehow need to ensure that less strict isolation
levels are never used to modify data. Your approach lifts that
burden.

By the way, if you can make this behave in a similar way to Oracle,
especially if the syntax is compatible, I'm sure it will help
promote PostgreSQL adoption. At PostgreSQL Conference U.S. East
2010, I talked briefly with a couple guys from an Oracle shop who
were looking at converting to PostgreSQL, and were very concerned
about not having what you describe. The techniques required to
ensure integrity in PostgreSQL were not, to put it mildly, appealing
to them. I suspect that they would be satisfied with *either* SSI
or the change you describe.

My proposal would make SELECT ... FOR UPDATE behave like Oracle does with regard to serialization conflicts. SELECT ... FOR SHARE doesn't seem to exist on Oracle at all - at least I couldn't find a reference to it in the docs.

The syntax isn't 100% compatible because Oracle seems to expect a list of columns after the FOR UPDATE clause, while postgres expects a list of tables.

I must admit that I wasn't able to find an explicit reference to Oracle's behavior in their docs, so I had to resort to experiments. They do have examples showing how to do FK-like constraints with triggers, and those don't contain any warning whatsoever about problems in SERIALIZABLE mode, though. But still, if there is word on this from Oracle somewhere, I'd love to hear about it.

best regards,
Florian Pflug

#14Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Florian Pflug (#13)
Re: Row-level Locks & SERIALIZABLE transactions, postgres vs. Oracle

Florian Pflug <fgp@phlo.org> wrote:

I must admit that I wasn't able to find an explicit reference to
Oracle's behavior in their docs, so I had to resort to
experiments. They do have examples showing how to do FK-like
constraints with triggers, and those don't contain any warning
whatsoever about problems in SERIALIZABLE mode, though. But
still, if there is word on this from Oracle somewhere, I'd love to
hear about it.

I suspect that in trying to emulate Oracle on this, you may run into
an issue which posed challenges for the SSI implementation which
didn't come up in the Cahill prototype implementations: Oracle, and
all other MVCC databases I've read about outside of PostgreSQL, use
an "update in place with a rollback log" technique. Access to any
version of a given row or index entry goes through a single
location, with possible backtracking through the log after that,
which simplifies management of certain concurrency issues. Do they
perhaps use an in-RAM lock table, pointing to the "base" location of
the row for these SELECT FOR UPDATE locks? (Just guessing; I've
never used Oracle, myself.)

-Kevin

#15Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#2)
Re: Row-level Locks & SERIALIZABLE transactions, postgres vs. Oracle

On Thu, May 13, 2010 at 5:39 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Florian Pflug <fgp@phlo.org> writes:

All in all, I believe that SHARE and UPDATE row-level locks should be
changed to cause concurrent UPDATEs to fail with a serialization
error.

I don't see an argument for doing that for FOR SHARE locks, and it
already happens for FOR UPDATE (at least if the row actually gets
updated).  AFAICS this proposal mainly breaks things, in pursuit of
an unnecessary and probably-impossible-anyway goal of making FK locking
work with only user-level snapshots.

After giving this considerable thought and testing the behavior at
some length, I think the OP has it right. One thing I sometimes need
to do is denormalize a copy of a field, e.g.

CREATE TABLE parent (id serial, mode integer not null, primary key (id));
CREATE TABLE child (id serial, parent_id integer not null references
parent (id), parent_mode integer not null);

The way I have typically implemented this in the past is:

1. Add a trigger to the parent table so that, whenever the mode column
gets updated, we do an update on the parent_mode of all children.
2. Add a trigger to the child table so that, when a new child is
inserted, it initializes parent_mode from its parent. I do SELECT
with FOR UPDATE on the parent parent can't change under me; though FOR
SHARE ought to be enough also since we're just trying to lock out
concurrent updates.

Suppose T1 updates the parent's mode while T2 adds a new child; then
both commit. In read committed mode, this seems to work OK regardless
of the order of T1 and T2. If T1 grabs the lock first, then T2 sees
the updated version of the row after T1 commits. If T2 grabs the lock
first, then the update on the parent blocks until the child commits.
Subsequently, when the trigger fires, it apparently uses an up-to-date
snapshot, so the new child is updated also. In serializable mode,
things are not so good. If T1 grabs the lock first, the child waits
to see whether it commits or aborts. On commit, it complains that it
can't serialize and aborts, which is reasonable - transaction aborts
are the price you pay for serializability. If T2 grabs the lock
first, the update on the parent blocks as before, but now the update
is done with the old snapshot and ignores the new child, so the new
child now has a value for parent_mode that doesn't match the parent's
actual mode. That is, you get the wrong answer due to a serialization
anomaly that didn't existed at the read committed level.

Increasing the transaction isolation level is supposed to *eliminate*
serialization anomalies, not create them.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

#16Rob Wultsch
wultsch@gmail.com
In reply to: Kevin Grittner (#14)
Re: Row-level Locks & SERIALIZABLE transactions, postgres vs. Oracle

On Fri, May 14, 2010 at 7:32 AM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:

Oracle, and all other MVCC databases I've read about outside of PostgreSQL, use
an "update in place with a rollback log" technique.

Have you looked at PBXT (which is explicitly NOT SERIALIZABLE)?

--
Rob Wultsch
wultsch@gmail.com

#17Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Rob Wultsch (#16)
Re: Row-level Locks & SERIALIZABLE transactions, postgres vs. Oracle

Rob Wultsch wrote:

Have you looked at PBXT (which is explicitly NOT SERIALIZABLE)?

I hadn't heard of it; so I took a quick look based on your post. It
seems to a new engine to use with MySQL which has MVCC without a
rollback log, so it presumably uses techniques at least vaguely
similar to PostgreSQL. Anything in particular you wanted me to
notice about it besides that? (Of course another MySQL engine which
doesn't provide very strong integrity guarantees isn't exciting to
me as a technology in itself.)

-Kevin

#18Rob Wultsch
wultsch@gmail.com
In reply to: Kevin Grittner (#17)
Re: Row-level Locks & SERIALIZABLE transactions, postgres vs. Oracle

On Sat, May 15, 2010 at 4:09 AM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:

 Anything in particular you wanted me to notice about it besides that?

Nope. It was just a counter point to your previous comment.

--
Rob Wultsch
wultsch@gmail.com

#19Florian Pflug
fgp@phlo.org
In reply to: Robert Haas (#15)
1 attachment(s)
Re: Row-level Locks & SERIALIZABLE transactions, postgres vs. Oracle

On May 14, 2010, at 22:54 , Robert Haas wrote:

On Thu, May 13, 2010 at 5:39 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Florian Pflug <fgp@phlo.org> writes:

All in all, I believe that SHARE and UPDATE row-level locks should be
changed to cause concurrent UPDATEs to fail with a serialization
error.

I don't see an argument for doing that for FOR SHARE locks, and it
already happens for FOR UPDATE (at least if the row actually gets
updated). AFAICS this proposal mainly breaks things, in pursuit of
an unnecessary and probably-impossible-anyway goal of making FK locking
work with only user-level snapshots.

After giving this considerable thought and testing the behavior at
some length, I think the OP has it right. One thing I sometimes need
to do is denormalize a copy of a field, e.g.

<snipped example>

I've whipped up a quick and still rather dirty patch that implements the behavior I proposed, at least for the case of conflicts between FOR UPDATE locks and updates. With the patch, any attempt to UPDATE or FOR UPDATE lock a row that has concurrently been FOR UPDATE locked will cause a serialization error. (The same for an actually updated row of course, but that happened before too).

While this part of the patch was fairly straight forward, make FOR SHARE conflict too seems to be much harder. The assumption that a lock becomes irrelevant after the transaction(s) that held it completely is built deeply into the multi xact machinery that powers SHARE locks. That machinery therefore assumes that once all members of a multi xact have completed the multi xact is dead also. But my proposal depends on a SERIALIZABLE transaction being able to find if any of the lockers of a row are invisible under it's snapshot - for which it'd need any multi xact containing invisible xids to outlive its snapshot.

best regards,
Florian Pflug

Attachments:

serializable_share_lock.patchapplication/octet-stream; name=serializable_share_lock.patchDownload
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 8975191..bf6c77e 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -83,6 +83,7 @@ static XLogRecPtr log_heap_update(Relation reln, Buffer oldbuf,
 				bool all_visible_cleared, bool new_all_visible_cleared);
 static bool HeapSatisfiesHOTUpdate(Relation relation, Bitmapset *hot_attrs,
 					   HeapTuple oldtup, HeapTuple newtup);
+static bool HeapSatisfiedLockCheckSnapshot(HeapTupleHeader tuple, Snapshot snapshot);
 
 
 /* ----------------------------------------------------------------
@@ -2033,8 +2034,9 @@ simple_heap_insert(Relation relation, HeapTuple tup)
  *	update_xmax - output parameter, used only for failure case (see below)
  *	cid - delete command ID (used for visibility test, and stored into
  *		cmax if successful)
- *	crosscheck - if not InvalidSnapshot, also check tuple against this
  *	wait - true if should wait for any conflicting update to commit/abort
+ *	lockcheck_snapshot - if not NULL, report the tuple as updated if it
+ *		was locked by a transaction not visible under this snapshot
  *
  * Normal, successful return value is HeapTupleMayBeUpdated, which
  * actually means we did delete it.  Failure return codes are
@@ -2049,7 +2051,7 @@ simple_heap_insert(Relation relation, HeapTuple tup)
 HTSU_Result
 heap_delete(Relation relation, ItemPointer tid,
 			ItemPointer ctid, TransactionId *update_xmax,
-			CommandId cid, Snapshot crosscheck, bool wait)
+			CommandId cid, bool wait, Snapshot lockcheck_snapshot)
 {
 	HTSU_Result result;
 	TransactionId xid = GetCurrentTransactionId();
@@ -2171,11 +2173,10 @@ l1:
 			result = HeapTupleUpdated;
 	}
 
-	if (crosscheck != InvalidSnapshot && result == HeapTupleMayBeUpdated)
+	if ((result == HeapTupleMayBeUpdated) &&
+		!HeapSatisfiedLockCheckSnapshot(tp.t_data, lockcheck_snapshot))
 	{
-		/* Perform additional check for serializable RI updates */
-		if (!HeapTupleSatisfiesVisibility(&tp, crosscheck, buffer))
-			result = HeapTupleUpdated;
+		result = HeapTupleUpdated;
 	}
 
 	if (result != HeapTupleMayBeUpdated)
@@ -2183,7 +2184,8 @@ l1:
 		Assert(result == HeapTupleSelfUpdated ||
 			   result == HeapTupleUpdated ||
 			   result == HeapTupleBeingUpdated);
-		Assert(!(tp.t_data->t_infomask & HEAP_XMAX_INVALID));
+		Assert(!(tp.t_data->t_infomask & HEAP_XMAX_INVALID) ||
+			   (tp.t_data->t_infomask & HEAP_IS_LOCKED));
 		*ctid = tp.t_data->t_ctid;
 		*update_xmax = HeapTupleHeaderGetXmax(tp.t_data);
 		UnlockReleaseBuffer(buffer);
@@ -2313,8 +2315,9 @@ simple_heap_delete(Relation relation, ItemPointer tid)
 
 	result = heap_delete(relation, tid,
 						 &update_ctid, &update_xmax,
-						 GetCurrentCommandId(true), InvalidSnapshot,
-						 true /* wait for commit */ );
+						 GetCurrentCommandId(true),
+						 true /* wait for commit */ ,
+						 InvalidSnapshot);
 	switch (result)
 	{
 		case HeapTupleSelfUpdated:
@@ -2349,7 +2352,8 @@ simple_heap_delete(Relation relation, ItemPointer tid)
  *	update_xmax - output parameter, used only for failure case (see below)
  *	cid - update command ID (used for visibility test, and stored into
  *		cmax/cmin if successful)
- *	crosscheck - if not InvalidSnapshot, also check old tuple against this
+ *	lockcheck_snapshot - if not NULL, report the tuple as updated if it
+ *		was locked by a transaction not visible under this snapshot
  *	wait - true if should wait for any conflicting update to commit/abort
  *
  * Normal, successful return value is HeapTupleMayBeUpdated, which
@@ -2371,7 +2375,7 @@ simple_heap_delete(Relation relation, ItemPointer tid)
 HTSU_Result
 heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
 			ItemPointer ctid, TransactionId *update_xmax,
-			CommandId cid, Snapshot crosscheck, bool wait)
+			CommandId cid, bool wait, Snapshot lockcheck_snapshot)
 {
 	HTSU_Result result;
 	TransactionId xid = GetCurrentTransactionId();
@@ -2523,19 +2527,20 @@ l2:
 			result = HeapTupleUpdated;
 	}
 
-	if (crosscheck != InvalidSnapshot && result == HeapTupleMayBeUpdated)
+	if ((result == HeapTupleMayBeUpdated) &&
+		!HeapSatisfiedLockCheckSnapshot(oldtup.t_data, lockcheck_snapshot))
 	{
-		/* Perform additional check for serializable RI updates */
-		if (!HeapTupleSatisfiesVisibility(&oldtup, crosscheck, buffer))
-			result = HeapTupleUpdated;
+		result = HeapTupleUpdated;
 	}
 
+
 	if (result != HeapTupleMayBeUpdated)
 	{
 		Assert(result == HeapTupleSelfUpdated ||
 			   result == HeapTupleUpdated ||
 			   result == HeapTupleBeingUpdated);
-		Assert(!(oldtup.t_data->t_infomask & HEAP_XMAX_INVALID));
+		Assert(!(oldtup.t_data->t_infomask & HEAP_XMAX_INVALID) ||
+			   (oldtup.t_data->t_infomask & HEAP_IS_LOCKED));
 		*ctid = oldtup.t_data->t_ctid;
 		*update_xmax = HeapTupleHeaderGetXmax(oldtup.t_data);
 		UnlockReleaseBuffer(buffer);
@@ -2961,6 +2966,56 @@ HeapSatisfiesHOTUpdate(Relation relation, Bitmapset *hot_attrs,
 	return true;
 }
 
+
+/*
+ * Check if the tuple was locked by a transaction invisible to
+ * lockcheck_snapshot. Returns false if so.
+ */
+static bool
+HeapSatisfiedLockCheckSnapshot(HeapTupleHeader tuple, Snapshot lockcheck_snapshot)
+{
+	if (tuple->t_infomask & HEAP_IS_LOCKED)
+	{
+		/* If the tuple was locked, we now check whether the locking
+		 * transaction(s) are visible under lockcheck_snapshot. If
+		 * they aren't, we pretent the tuple was updated.
+		 */
+
+		if (tuple->t_infomask & HEAP_XMAX_IS_MULTI)
+		{
+			/* XXXFGP: Do the same check as below, but for multi xacts. The
+			 *         hard part is to guarantee that the multixact will still
+			 *         be there when we need to examine it. Currently, no
+			 *         mechanism seems to exist that'd guarantee that all
+			 *         multi xacts containing members invisible to
+			 *         lockcheck_snapshot outlive lockcheck_snapshot. Especially
+			 *         if lockcheck_snapshot is a serializable snapshot, which
+			 *         unfortunately is it's designated use-case.
+			 */
+			return true;
+		}
+		else
+		{
+			if ((lockcheck_snapshot != InvalidSnapshot) &&
+			    XidInMVCCSnapshot(HeapTupleHeaderGetXmax(tuple), lockcheck_snapshot))
+			{
+				/* Locking transaction is visible to lockcheck_snapshot */
+				return false;
+			}
+			else
+			{
+				/* Locking transaction is visible to lockcheck_snapshot */
+				return true;
+			}
+		}
+	}
+	else
+	{
+		/* Tuple wasn't locked */
+		return true;
+	}
+}
+
 /*
  *	simple_heap_update - replace a tuple
  *
@@ -2978,8 +3033,8 @@ simple_heap_update(Relation relation, ItemPointer otid, HeapTuple tup)
 
 	result = heap_update(relation, otid, tup,
 						 &update_ctid, &update_xmax,
-						 GetCurrentCommandId(true), InvalidSnapshot,
-						 true /* wait for commit */ );
+						 GetCurrentCommandId(true),
+						 true /* wait for commit */, InvalidSnapshot);
 	switch (result)
 	{
 		case HeapTupleSelfUpdated:
@@ -3013,6 +3068,9 @@ simple_heap_update(Relation relation, ItemPointer otid, HeapTuple tup)
  *		tuple's cmax if lock is successful)
  *	mode: indicates if shared or exclusive tuple lock is desired
  *	nowait: if true, ereport rather than blocking if lock not available
+ *	lockcheck_snapshot: if not NULL, report the tuple as updated if it
+ *						was locked by a transaction not visible under
+ *						this snapshot
  *
  * Output parameters:
  *	*tuple: all fields filled in
@@ -3066,7 +3124,8 @@ simple_heap_update(Relation relation, ItemPointer otid, HeapTuple tup)
 HTSU_Result
 heap_lock_tuple(Relation relation, HeapTuple tuple, Buffer *buffer,
 				ItemPointer ctid, TransactionId *update_xmax,
-				CommandId cid, LockTupleMode mode, bool nowait)
+				CommandId cid, LockTupleMode mode, bool nowait,
+				Snapshot lockcheck_snapshot)
 {
 	HTSU_Result result;
 	ItemPointer tid = &(tuple->t_self);
@@ -3247,10 +3306,17 @@ l3:
 			result = HeapTupleUpdated;
 	}
 
+	if ((result == HeapTupleMayBeUpdated) &&
+		!HeapSatisfiedLockCheckSnapshot(tuple->t_data, lockcheck_snapshot))
+	{
+		result = HeapTupleUpdated;
+	}
+
 	if (result != HeapTupleMayBeUpdated)
 	{
 		Assert(result == HeapTupleSelfUpdated || result == HeapTupleUpdated);
-		Assert(!(tuple->t_data->t_infomask & HEAP_XMAX_INVALID));
+		Assert(!(tuple->t_data->t_infomask & HEAP_XMAX_INVALID) ||
+			   (tuple->t_data->t_infomask & HEAP_IS_LOCKED));
 		*ctid = tuple->t_data->t_ctid;
 		*update_xmax = HeapTupleHeaderGetXmax(tuple->t_data);
 		LockBuffer(*buffer, BUFFER_LOCK_UNLOCK);
diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
index 2cbc192..120ea60 100644
--- a/src/backend/commands/trigger.c
+++ b/src/backend/commands/trigger.c
@@ -2347,7 +2347,9 @@ ltrmark:;
 		test = heap_lock_tuple(relation, &tuple, &buffer,
 							   &update_ctid, &update_xmax,
 							   estate->es_output_cid,
-							   LockTupleExclusive, false);
+							   LockTupleExclusive, false,
+							   IsXactIsoLevelSerializable ? estate->es_snapshot :
+															InvalidSnapshot);
 		switch (test)
 		{
 			case HeapTupleSelfUpdated:
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index d299310..fd5ed3c 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -1518,10 +1518,13 @@ EvalPlanQualFetch(EState *estate, Relation relation, int lockmode,
 			/*
 			 * This is a live tuple, so now try to lock it.
 			 */
+			Assert(!IsXactIsoLevelSerializable || (estate->es_snapshot != InvalidSnapshot));
 			test = heap_lock_tuple(relation, &tuple, &buffer,
 								   &update_ctid, &update_xmax,
 								   estate->es_output_cid,
-								   lockmode, false);
+								   lockmode, false,
+								   IsXactIsoLevelSerializable ? estate->es_snapshot :
+																InvalidSnapshot);
 			/* We now have two pins on the buffer, get rid of one */
 			ReleaseBuffer(buffer);
 
diff --git a/src/backend/executor/nodeLockRows.c b/src/backend/executor/nodeLockRows.c
index 00f0562..f8baa13 100644
--- a/src/backend/executor/nodeLockRows.c
+++ b/src/backend/executor/nodeLockRows.c
@@ -117,7 +117,9 @@ lnext:
 		test = heap_lock_tuple(erm->relation, &tuple, &buffer,
 							   &update_ctid, &update_xmax,
 							   estate->es_output_cid,
-							   lockmode, erm->noWait);
+							   lockmode, erm->noWait,
+							   IsXactIsoLevelSerializable ? estate->es_snapshot :
+															InvalidSnapshot);
 		ReleaseBuffer(buffer);
 		switch (test)
 		{
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 7856b66..ecc75f9 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -316,8 +316,9 @@ ldelete:;
 	result = heap_delete(resultRelationDesc, tupleid,
 						 &update_ctid, &update_xmax,
 						 estate->es_output_cid,
-						 estate->es_crosscheck_snapshot,
-						 true /* wait for commit */ );
+						 true, /* wait for commit */
+						 IsXactIsoLevelSerializable ? estate->es_snapshot :
+													  InvalidSnapshot);
 	switch (result)
 	{
 		case HeapTupleSelfUpdated:
@@ -504,8 +505,9 @@ lreplace:;
 	result = heap_update(resultRelationDesc, tupleid, tuple,
 						 &update_ctid, &update_xmax,
 						 estate->es_output_cid,
-						 estate->es_crosscheck_snapshot,
-						 true /* wait for commit */ );
+						 true, /* wait for commit */ 
+						 IsXactIsoLevelSerializable ? estate->es_snapshot :
+													  InvalidSnapshot);
 	switch (result)
 	{
 		case HeapTupleSelfUpdated:
diff --git a/src/backend/utils/time/tqual.c b/src/backend/utils/time/tqual.c
index 4f3630c..302a571 100644
--- a/src/backend/utils/time/tqual.c
+++ b/src/backend/utils/time/tqual.c
@@ -72,9 +72,6 @@ SnapshotData SnapshotSelfData = {HeapTupleSatisfiesSelf};
 SnapshotData SnapshotAnyData = {HeapTupleSatisfiesAny};
 SnapshotData SnapshotToastData = {HeapTupleSatisfiesToast};
 
-/* local functions */
-static bool XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot);
-
 
 /*
  * SetHintBits()
@@ -1253,7 +1250,7 @@ HeapTupleSatisfiesVacuum(HeapTupleHeader tuple, TransactionId OldestXmin,
  * by this function.  This is OK for current uses, because we actually only
  * apply this for known-committed XIDs.
  */
-static bool
+bool
 XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot)
 {
 	uint32		i;
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index f963146..a17d62c 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -98,15 +98,16 @@ extern Oid heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 			int options, BulkInsertState bistate);
 extern HTSU_Result heap_delete(Relation relation, ItemPointer tid,
 			ItemPointer ctid, TransactionId *update_xmax,
-			CommandId cid, Snapshot crosscheck, bool wait);
+			CommandId cid, bool wait, Snapshot crosscheck);
 extern HTSU_Result heap_update(Relation relation, ItemPointer otid,
 			HeapTuple newtup,
 			ItemPointer ctid, TransactionId *update_xmax,
-			CommandId cid, Snapshot crosscheck, bool wait);
+			CommandId cid, bool wait, Snapshot crosscheck);
 extern HTSU_Result heap_lock_tuple(Relation relation, HeapTuple tuple,
 				Buffer *buffer, ItemPointer ctid,
 				TransactionId *update_xmax, CommandId cid,
-				LockTupleMode mode, bool nowait);
+				LockTupleMode mode, bool nowait,
+				Snapshot lockcheck_snapshot);
 extern void heap_inplace_update(Relation relation, HeapTuple tuple);
 extern bool heap_freeze_tuple(HeapTupleHeader tuple, TransactionId cutoff_xid,
 				  Buffer buf);
diff --git a/src/include/utils/tqual.h b/src/include/utils/tqual.h
index e85e820..fa386e7 100644
--- a/src/include/utils/tqual.h
+++ b/src/include/utils/tqual.h
@@ -41,6 +41,8 @@ extern PGDLLIMPORT SnapshotData SnapshotToastData;
 #define IsMVCCSnapshot(snapshot)  \
 	((snapshot)->satisfies == HeapTupleSatisfiesMVCC)
 
+bool XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot);
+
 /*
  * HeapTupleSatisfiesVisibility
  *		True iff heap tuple satisfies a time qual.
#20Florian Pflug
fgp@phlo.org
In reply to: Florian Pflug (#19)
Re: Row-level Locks & SERIALIZABLE transactions, postgres vs. Oracle

On May 14, 2010, at 16:32 , Kevin Grittner wrote:

Florian Pflug <fgp@phlo.org> wrote:

I must admit that I wasn't able to find an explicit reference to
Oracle's behavior in their docs, so I had to resort to
experiments. They do have examples showing how to do FK-like
constraints with triggers, and those don't contain any warning
whatsoever about problems in SERIALIZABLE mode, though. But
still, if there is word on this from Oracle somewhere, I'd love to
hear about it.

I suspect that in trying to emulate Oracle on this, you may run into
an issue which posed challenges for the SSI implementation which
didn't come up in the Cahill prototype implementations: Oracle, and
all other MVCC databases I've read about outside of PostgreSQL, use
an "update in place with a rollback log" technique. Access to any
version of a given row or index entry goes through a single
location, with possible backtracking through the log after that,
which simplifies management of certain concurrency issues. Do they
perhaps use an in-RAM lock table, pointing to the "base" location of
the row for these SELECT FOR UPDATE locks? (Just guessing; I've
never used Oracle, myself.)

Thanks for the heads up. I think my proposed doges this, though, since UPDATE as well as FOR SHARE and FOR UPDATE already follow the ctid chain to find the most recent tuple and fail with a serialization error (within >= REPEATABLE READ transaction) should this tuple be inaccessible to the transaction's snapshot.

Btw, I've just posted a quick-and-dirty patch that implements the parts of my proposal that deal with FOR UPDATE vs. UPDATE conflicts in response to Robert Haas' mail on this thread, just in case you're interested.

best regards,
Florian Pflug

#21Robert Haas
robertmhaas@gmail.com
In reply to: Florian Pflug (#19)
Re: Row-level Locks & SERIALIZABLE transactions, postgres vs. Oracle

On Sun, May 16, 2010 at 9:07 PM, Florian Pflug <fgp@phlo.org> wrote:

On May 14, 2010, at 22:54 , Robert Haas wrote:

On Thu, May 13, 2010 at 5:39 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Florian Pflug <fgp@phlo.org> writes:

All in all, I believe that SHARE and UPDATE row-level locks should be
changed to cause concurrent UPDATEs to fail with a serialization
error.

I don't see an argument for doing that for FOR SHARE locks, and it
already happens for FOR UPDATE (at least if the row actually gets
updated).  AFAICS this proposal mainly breaks things, in pursuit of
an unnecessary and probably-impossible-anyway goal of making FK locking
work with only user-level snapshots.

After giving this considerable thought and testing the behavior at
some length, I think the OP has it right.  One thing I sometimes need
to do is denormalize a copy of a field, e.g.

<snipped example>

I've whipped up a quick and still rather dirty patch that implements the behavior I proposed, at least for the case of conflicts between FOR UPDATE locks and updates. With the patch, any attempt to UPDATE or FOR UPDATE lock a row that has concurrently been FOR UPDATE locked will cause a serialization error. (The same for an actually updated row of course, but that happened before too).

While this part of the patch was fairly straight forward, make FOR SHARE conflict too seems to be much harder. The assumption that a lock becomes irrelevant after the transaction(s) that held it completely is built deeply into the multi xact machinery that powers SHARE locks. That machinery therefore assumes that once all members of a multi xact have completed the multi xact is dead also. But my proposal depends on a SERIALIZABLE transaction being able to find if any of the lockers of a row are invisible under it's snapshot - for which it'd need any multi xact containing invisible xids to outlive its snapshot.

Thanks for putting this together. I suggest adding it to the open
CommitFest - even if we decide to go forward with this, I don't
imagine anyone is going to be excited about changing it during beta.

https://commitfest.postgresql.org/action/commitfest_view/open

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

#22Florian Pflug
fgp@phlo.org
In reply to: Robert Haas (#21)
1 attachment(s)
Re: Row-level Locks & SERIALIZABLE transactions, postgres vs. Oracle

On May 17, 2010, at 3:30 , Robert Haas wrote:

On Sun, May 16, 2010 at 9:07 PM, Florian Pflug <fgp@phlo.org> wrote:

On May 14, 2010, at 22:54 , Robert Haas wrote:

On Thu, May 13, 2010 at 5:39 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Florian Pflug <fgp@phlo.org> writes:

All in all, I believe that SHARE and UPDATE row-level locks should be
changed to cause concurrent UPDATEs to fail with a serialization
error.

I don't see an argument for doing that for FOR SHARE locks, and it
already happens for FOR UPDATE (at least if the row actually gets
updated). AFAICS this proposal mainly breaks things, in pursuit of
an unnecessary and probably-impossible-anyway goal of making FK locking
work with only user-level snapshots.

After giving this considerable thought and testing the behavior at
some length, I think the OP has it right. One thing I sometimes need
to do is denormalize a copy of a field, e.g.

<snipped example>

I've whipped up a quick and still rather dirty patch that implements the behavior I proposed, at least for the case of conflicts between FOR UPDATE locks and updates. With the patch, any attempt to UPDATE or FOR UPDATE lock a row that has concurrently been FOR UPDATE locked will cause a serialization error. (The same for an actually updated row of course, but that happened before too).

While this part of the patch was fairly straight forward, make FOR SHARE conflict too seems to be much harder. The assumption that a lock becomes irrelevant after the transaction(s) that held it completely is built deeply into the multi xact machinery that powers SHARE locks. That machinery therefore assumes that once all members of a multi xact have completed the multi xact is dead also. But my proposal depends on a SERIALIZABLE transaction being able to find if any of the lockers of a row are invisible under it's snapshot - for which it'd need any multi xact containing invisible xids to outlive its snapshot.

Thanks for putting this together. I suggest adding it to the open
CommitFest - even if we decide to go forward with this, I don't
imagine anyone is going to be excited about changing it during beta.

https://commitfest.postgresql.org/action/commitfest_view/open

Will do. Thanks for the link.

Here is an updated version that works for SHARE locks too.

(This message mainly serves as a way to link the updated patch to the commit fest entry. Is this how I'm supposed to do that, or am I doing something wrong?)

best regards,
Florian Pflug

Attachments:

serializable_lock_consistency.patchapplication/octet-stream; name=serializable_lock_consistency.patchDownload
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 8975191..9086c60 100644
*** a/src/backend/access/heap/heapam.c
--- b/src/backend/access/heap/heapam.c
*************** static XLogRecPtr log_heap_update(Relati
*** 83,88 ****
--- 83,89 ----
  				bool all_visible_cleared, bool new_all_visible_cleared);
  static bool HeapSatisfiesHOTUpdate(Relation relation, Bitmapset *hot_attrs,
  					   HeapTuple oldtup, HeapTuple newtup);
+ static bool HeapSatisfiesLockersVisible(HeapTupleHeader tuple, Snapshot snapshot);
  
  
  /* ----------------------------------------------------------------
*************** simple_heap_insert(Relation relation, He
*** 2033,2040 ****
   *	update_xmax - output parameter, used only for failure case (see below)
   *	cid - delete command ID (used for visibility test, and stored into
   *		cmax if successful)
-  *	crosscheck - if not InvalidSnapshot, also check tuple against this
   *	wait - true if should wait for any conflicting update to commit/abort
   *
   * Normal, successful return value is HeapTupleMayBeUpdated, which
   * actually means we did delete it.  Failure return codes are
--- 2034,2042 ----
   *	update_xmax - output parameter, used only for failure case (see below)
   *	cid - delete command ID (used for visibility test, and stored into
   *		cmax if successful)
   *	wait - true if should wait for any conflicting update to commit/abort
+  *	lockcheck_snapshot - if not NULL, report the tuple as updated if it
+  *		was locked by a transaction not visible under this snapshot
   *
   * Normal, successful return value is HeapTupleMayBeUpdated, which
   * actually means we did delete it.  Failure return codes are
*************** simple_heap_insert(Relation relation, He
*** 2049,2055 ****
  HTSU_Result
  heap_delete(Relation relation, ItemPointer tid,
  			ItemPointer ctid, TransactionId *update_xmax,
! 			CommandId cid, Snapshot crosscheck, bool wait)
  {
  	HTSU_Result result;
  	TransactionId xid = GetCurrentTransactionId();
--- 2051,2057 ----
  HTSU_Result
  heap_delete(Relation relation, ItemPointer tid,
  			ItemPointer ctid, TransactionId *update_xmax,
! 			CommandId cid, bool wait, Snapshot lockcheck_snapshot)
  {
  	HTSU_Result result;
  	TransactionId xid = GetCurrentTransactionId();
*************** l1:
*** 2170,2181 ****
  		else
  			result = HeapTupleUpdated;
  	}
! 
! 	if (crosscheck != InvalidSnapshot && result == HeapTupleMayBeUpdated)
  	{
! 		/* Perform additional check for serializable RI updates */
! 		if (!HeapTupleSatisfiesVisibility(&tp, crosscheck, buffer))
! 			result = HeapTupleUpdated;
  	}
  
  	if (result != HeapTupleMayBeUpdated)
--- 2172,2183 ----
  		else
  			result = HeapTupleUpdated;
  	}
! 	
! 	/* Verify visibility of locking transactions. */
! 	if ((result == HeapTupleMayBeUpdated) &&
! 		!HeapSatisfiesLockersVisible(tp.t_data, lockcheck_snapshot))
  	{
! 		result = HeapTupleUpdated;
  	}
  
  	if (result != HeapTupleMayBeUpdated)
*************** l1:
*** 2183,2189 ****
  		Assert(result == HeapTupleSelfUpdated ||
  			   result == HeapTupleUpdated ||
  			   result == HeapTupleBeingUpdated);
! 		Assert(!(tp.t_data->t_infomask & HEAP_XMAX_INVALID));
  		*ctid = tp.t_data->t_ctid;
  		*update_xmax = HeapTupleHeaderGetXmax(tp.t_data);
  		UnlockReleaseBuffer(buffer);
--- 2185,2192 ----
  		Assert(result == HeapTupleSelfUpdated ||
  			   result == HeapTupleUpdated ||
  			   result == HeapTupleBeingUpdated);
! 		Assert(!(tp.t_data->t_infomask & HEAP_XMAX_INVALID) ||
! 			   (tp.t_data->t_infomask & HEAP_IS_LOCKED));
  		*ctid = tp.t_data->t_ctid;
  		*update_xmax = HeapTupleHeaderGetXmax(tp.t_data);
  		UnlockReleaseBuffer(buffer);
*************** simple_heap_delete(Relation relation, It
*** 2313,2320 ****
  
  	result = heap_delete(relation, tid,
  						 &update_ctid, &update_xmax,
! 						 GetCurrentCommandId(true), InvalidSnapshot,
! 						 true /* wait for commit */ );
  	switch (result)
  	{
  		case HeapTupleSelfUpdated:
--- 2316,2324 ----
  
  	result = heap_delete(relation, tid,
  						 &update_ctid, &update_xmax,
! 						 GetCurrentCommandId(true),
! 						 true /* wait for commit */ ,
! 						 InvalidSnapshot);
  	switch (result)
  	{
  		case HeapTupleSelfUpdated:
*************** simple_heap_delete(Relation relation, It
*** 2349,2355 ****
   *	update_xmax - output parameter, used only for failure case (see below)
   *	cid - update command ID (used for visibility test, and stored into
   *		cmax/cmin if successful)
!  *	crosscheck - if not InvalidSnapshot, also check old tuple against this
   *	wait - true if should wait for any conflicting update to commit/abort
   *
   * Normal, successful return value is HeapTupleMayBeUpdated, which
--- 2353,2360 ----
   *	update_xmax - output parameter, used only for failure case (see below)
   *	cid - update command ID (used for visibility test, and stored into
   *		cmax/cmin if successful)
!  *	lockcheck_snapshot - if not NULL, report the tuple as updated if it
!  *		was locked by a transaction not visible under this snapshot
   *	wait - true if should wait for any conflicting update to commit/abort
   *
   * Normal, successful return value is HeapTupleMayBeUpdated, which
*************** simple_heap_delete(Relation relation, It
*** 2371,2377 ****
  HTSU_Result
  heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
  			ItemPointer ctid, TransactionId *update_xmax,
! 			CommandId cid, Snapshot crosscheck, bool wait)
  {
  	HTSU_Result result;
  	TransactionId xid = GetCurrentTransactionId();
--- 2376,2382 ----
  HTSU_Result
  heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
  			ItemPointer ctid, TransactionId *update_xmax,
! 			CommandId cid, bool wait, Snapshot lockcheck_snapshot)
  {
  	HTSU_Result result;
  	TransactionId xid = GetCurrentTransactionId();
*************** l2:
*** 2523,2541 ****
  			result = HeapTupleUpdated;
  	}
  
! 	if (crosscheck != InvalidSnapshot && result == HeapTupleMayBeUpdated)
  	{
! 		/* Perform additional check for serializable RI updates */
! 		if (!HeapTupleSatisfiesVisibility(&oldtup, crosscheck, buffer))
! 			result = HeapTupleUpdated;
  	}
  
  	if (result != HeapTupleMayBeUpdated)
  	{
  		Assert(result == HeapTupleSelfUpdated ||
  			   result == HeapTupleUpdated ||
  			   result == HeapTupleBeingUpdated);
! 		Assert(!(oldtup.t_data->t_infomask & HEAP_XMAX_INVALID));
  		*ctid = oldtup.t_data->t_ctid;
  		*update_xmax = HeapTupleHeaderGetXmax(oldtup.t_data);
  		UnlockReleaseBuffer(buffer);
--- 2528,2548 ----
  			result = HeapTupleUpdated;
  	}
  
! 	/* Verify visibility of locking transactions. */
! 	if ((result == HeapTupleMayBeUpdated) &&
! 		!HeapSatisfiesLockersVisible(oldtup.t_data, lockcheck_snapshot))
  	{
! 		result = HeapTupleUpdated;
  	}
  
+ 
  	if (result != HeapTupleMayBeUpdated)
  	{
  		Assert(result == HeapTupleSelfUpdated ||
  			   result == HeapTupleUpdated ||
  			   result == HeapTupleBeingUpdated);
! 		Assert(!(oldtup.t_data->t_infomask & HEAP_XMAX_INVALID) ||
! 			   (oldtup.t_data->t_infomask & HEAP_IS_LOCKED));
  		*ctid = oldtup.t_data->t_ctid;
  		*update_xmax = HeapTupleHeaderGetXmax(oldtup.t_data);
  		UnlockReleaseBuffer(buffer);
*************** HeapSatisfiesHOTUpdate(Relation relation
*** 2961,2966 ****
--- 2968,3045 ----
  	return true;
  }
  
+ 
+ /*
+  * Returns false if one of the tuple's lockers committed but
+  * aren't visible according to lockcheck_snapshot.
+  * Assumes that all locking transaction either committed or aborted,
+  * but aren't still in progress.
+  */
+ static bool
+ HeapSatisfiesLockersVisible(HeapTupleHeader tuple, Snapshot lockcheck_snapshot)
+ {
+ 	if (lockcheck_snapshot == InvalidSnapshot)
+ 		return true;
+ 	
+ 	if (tuple->t_infomask & HEAP_IS_LOCKED)
+ 	{
+ 		/* If the tuple was locked, we now check whether the locking
+ 		 * transaction(s) are visible under lockcheck_snapshot. If
+ 		 * they aren't, we pretent the tuple was updated.
+ 		 */
+ 
+ 		if (tuple->t_infomask & HEAP_XMAX_IS_MULTI)
+ 		{
+ 			TransactionId* xids;
+ 			int xids_l = GetMultiXactIdMembers(HeapTupleHeaderGetXmax(tuple), &xids);
+ 
+ 			if (xids_l < 1) {
+ 				/* The multi xact either is too old to be inspected or doesn't contain members.
+ 				 * The second case is probably impossible, but even if not it doesn't pose
+ 				 * any problem.
+ 				 * In the first case, we have to trust that all xids that were contained in
+ 				 * the xact are in fact visible under lockcheck_snapshot. Currently this
+ 				 * is always the case, since lockcheck_snapshot is always the transaction's
+ 				 * serializable snapshot, and we call MultiXactIdSetOldestVisible() before
+ 				 * acquireing that snapshot.
+ 				 */
+ 				return true;
+ 			}
+ 			else
+ 			{
+ 				int i;
+ 				for (i = 0; i < xids_l; ++i)
+ 				{
+ 					/* We expect to be called after the locking transactions' fates have been decided. */
+ 					Assert(!TransactionIdIsInProgress(xids[i]));
+ 					
+ 					if (!TransactionIdDidAbort(xids[i]) &&
+ 					    XidInMVCCSnapshot(xids[i], lockcheck_snapshot))
+ 					{
+ 						/* Ups. Non-aborted, invisible locker */
+ 						return false;
+ 					}
+ 				}
+ 				return true;
+ 			}
+ 		}
+ 		else
+ 		{
+ 			/* We expect to be called after the locking transaction's fate has been decided. */
+ 			Assert(!TransactionIdIsInProgress(HeapTupleHeaderGetXmax(tuple)));
+ 			
+ 			/* Locker must either be visible or have aborted */
+ 			return TransactionIdDidAbort(HeapTupleHeaderGetXmax(tuple)) ||
+ 				   !XidInMVCCSnapshot(HeapTupleHeaderGetXmax(tuple), lockcheck_snapshot);
+ 		}
+ 	}
+ 	else
+ 	{
+ 		/* Tuple wasn't locked */
+ 		return true;
+ 	}
+ }
+ 
  /*
   *	simple_heap_update - replace a tuple
   *
*************** simple_heap_update(Relation relation, It
*** 2978,2985 ****
  
  	result = heap_update(relation, otid, tup,
  						 &update_ctid, &update_xmax,
! 						 GetCurrentCommandId(true), InvalidSnapshot,
! 						 true /* wait for commit */ );
  	switch (result)
  	{
  		case HeapTupleSelfUpdated:
--- 3057,3064 ----
  
  	result = heap_update(relation, otid, tup,
  						 &update_ctid, &update_xmax,
! 						 GetCurrentCommandId(true),
! 						 true /* wait for commit */, InvalidSnapshot);
  	switch (result)
  	{
  		case HeapTupleSelfUpdated:
*************** simple_heap_update(Relation relation, It
*** 3013,3018 ****
--- 3092,3100 ----
   *		tuple's cmax if lock is successful)
   *	mode: indicates if shared or exclusive tuple lock is desired
   *	nowait: if true, ereport rather than blocking if lock not available
+  *	lockcheck_snapshot: if not NULL, report the tuple as updated if it
+  *						was locked by a transaction not visible under
+  *						this snapshot
   *
   * Output parameters:
   *	*tuple: all fields filled in
*************** simple_heap_update(Relation relation, It
*** 3066,3072 ****
  HTSU_Result
  heap_lock_tuple(Relation relation, HeapTuple tuple, Buffer *buffer,
  				ItemPointer ctid, TransactionId *update_xmax,
! 				CommandId cid, LockTupleMode mode, bool nowait)
  {
  	HTSU_Result result;
  	ItemPointer tid = &(tuple->t_self);
--- 3148,3155 ----
  HTSU_Result
  heap_lock_tuple(Relation relation, HeapTuple tuple, Buffer *buffer,
  				ItemPointer ctid, TransactionId *update_xmax,
! 				CommandId cid, LockTupleMode mode, bool nowait,
! 				Snapshot lockcheck_snapshot)
  {
  	HTSU_Result result;
  	ItemPointer tid = &(tuple->t_self);
*************** l3:
*** 3246,3256 ****
  		else
  			result = HeapTupleUpdated;
  	}
  
  	if (result != HeapTupleMayBeUpdated)
  	{
  		Assert(result == HeapTupleSelfUpdated || result == HeapTupleUpdated);
! 		Assert(!(tuple->t_data->t_infomask & HEAP_XMAX_INVALID));
  		*ctid = tuple->t_data->t_ctid;
  		*update_xmax = HeapTupleHeaderGetXmax(tuple->t_data);
  		LockBuffer(*buffer, BUFFER_LOCK_UNLOCK);
--- 3329,3347 ----
  		else
  			result = HeapTupleUpdated;
  	}
+ 	
+ 	/* Verify visibility of locking transactions */
+ 	if ((result == HeapTupleMayBeUpdated) &&
+ 		!HeapSatisfiesLockersVisible(tuple->t_data, lockcheck_snapshot))
+ 	{
+ 		result = HeapTupleUpdated;
+ 	}
  
  	if (result != HeapTupleMayBeUpdated)
  	{
  		Assert(result == HeapTupleSelfUpdated || result == HeapTupleUpdated);
! 		Assert(!(tuple->t_data->t_infomask & HEAP_XMAX_INVALID) ||
! 			   (tuple->t_data->t_infomask & HEAP_IS_LOCKED));
  		*ctid = tuple->t_data->t_ctid;
  		*update_xmax = HeapTupleHeaderGetXmax(tuple->t_data);
  		LockBuffer(*buffer, BUFFER_LOCK_UNLOCK);
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index b5bb311..47ff5de 100644
*** a/src/backend/access/transam/multixact.c
--- b/src/backend/access/transam/multixact.c
*************** static MemoryContext MXactContext = NULL
*** 211,217 ****
  #endif
  
  /* internal MultiXactId management */
- static void MultiXactIdSetOldestVisible(void);
  static MultiXactId CreateMultiXactId(int nxids, TransactionId *xids);
  static void RecordNewMultiXact(MultiXactId multi, MultiXactOffset offset,
  				   int nxids, TransactionId *xids);
--- 211,216 ----
*************** MultiXactIdSetOldestMember(void)
*** 531,537 ****
   * there is no live transaction, now or later, that can be a member of any
   * MultiXactId older than the OldestVisibleMXactId we compute here.
   */
! static void
  MultiXactIdSetOldestVisible(void)
  {
  	if (!MultiXactIdIsValid(OldestVisibleMXactId[MyBackendId]))
--- 530,536 ----
   * there is no live transaction, now or later, that can be a member of any
   * MultiXactId older than the OldestVisibleMXactId we compute here.
   */
! void
  MultiXactIdSetOldestVisible(void)
  {
  	if (!MultiXactIdIsValid(OldestVisibleMXactId[MyBackendId]))
diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
index 2cbc192..120ea60 100644
*** a/src/backend/commands/trigger.c
--- b/src/backend/commands/trigger.c
*************** ltrmark:;
*** 2347,2353 ****
  		test = heap_lock_tuple(relation, &tuple, &buffer,
  							   &update_ctid, &update_xmax,
  							   estate->es_output_cid,
! 							   LockTupleExclusive, false);
  		switch (test)
  		{
  			case HeapTupleSelfUpdated:
--- 2347,2355 ----
  		test = heap_lock_tuple(relation, &tuple, &buffer,
  							   &update_ctid, &update_xmax,
  							   estate->es_output_cid,
! 							   LockTupleExclusive, false,
! 							   IsXactIsoLevelSerializable ? estate->es_snapshot :
! 															InvalidSnapshot);
  		switch (test)
  		{
  			case HeapTupleSelfUpdated:
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index d299310..fd5ed3c 100644
*** a/src/backend/executor/execMain.c
--- b/src/backend/executor/execMain.c
*************** EvalPlanQualFetch(EState *estate, Relati
*** 1518,1527 ****
  			/*
  			 * This is a live tuple, so now try to lock it.
  			 */
  			test = heap_lock_tuple(relation, &tuple, &buffer,
  								   &update_ctid, &update_xmax,
  								   estate->es_output_cid,
! 								   lockmode, false);
  			/* We now have two pins on the buffer, get rid of one */
  			ReleaseBuffer(buffer);
  
--- 1518,1530 ----
  			/*
  			 * This is a live tuple, so now try to lock it.
  			 */
+ 			Assert(!IsXactIsoLevelSerializable || (estate->es_snapshot != InvalidSnapshot));
  			test = heap_lock_tuple(relation, &tuple, &buffer,
  								   &update_ctid, &update_xmax,
  								   estate->es_output_cid,
! 								   lockmode, false,
! 								   IsXactIsoLevelSerializable ? estate->es_snapshot :
! 																InvalidSnapshot);
  			/* We now have two pins on the buffer, get rid of one */
  			ReleaseBuffer(buffer);
  
diff --git a/src/backend/executor/nodeLockRows.c b/src/backend/executor/nodeLockRows.c
index 00f0562..b771b19 100644
*** a/src/backend/executor/nodeLockRows.c
--- b/src/backend/executor/nodeLockRows.c
*************** lnext:
*** 71,76 ****
--- 71,77 ----
  		ItemPointerData update_ctid;
  		TransactionId update_xmax;
  		LockTupleMode lockmode;
+ 		Snapshot lockcheck_snapshot = InvalidSnapshot;
  		HTSU_Result test;
  		HeapTuple	copyTuple;
  
*************** lnext:
*** 110,123 ****
  
  		/* okay, try to lock the tuple */
  		if (erm->markType == ROW_MARK_EXCLUSIVE)
  			lockmode = LockTupleExclusive;
  		else
  			lockmode = LockTupleShared;
  
  		test = heap_lock_tuple(erm->relation, &tuple, &buffer,
  							   &update_ctid, &update_xmax,
  							   estate->es_output_cid,
! 							   lockmode, erm->noWait);
  		ReleaseBuffer(buffer);
  		switch (test)
  		{
--- 111,134 ----
  
  		/* okay, try to lock the tuple */
  		if (erm->markType == ROW_MARK_EXCLUSIVE)
+ 		{
  			lockmode = LockTupleExclusive;
+ 			/* In serializable mode, acquireing an exclusive lock
+ 			 * requires seeing all other lockers.
+ 			 */
+ 			if (IsXactIsoLevelSerializable)
+ 				lockcheck_snapshot = estate->es_snapshot;
+ 		}
  		else
+ 		{
  			lockmode = LockTupleShared;
+ 		}
  
  		test = heap_lock_tuple(erm->relation, &tuple, &buffer,
  							   &update_ctid, &update_xmax,
  							   estate->es_output_cid,
! 							   lockmode, erm->noWait,
! 							   lockcheck_snapshot);
  		ReleaseBuffer(buffer);
  		switch (test)
  		{
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 7856b66..ecc75f9 100644
*** a/src/backend/executor/nodeModifyTable.c
--- b/src/backend/executor/nodeModifyTable.c
*************** ldelete:;
*** 316,323 ****
  	result = heap_delete(resultRelationDesc, tupleid,
  						 &update_ctid, &update_xmax,
  						 estate->es_output_cid,
! 						 estate->es_crosscheck_snapshot,
! 						 true /* wait for commit */ );
  	switch (result)
  	{
  		case HeapTupleSelfUpdated:
--- 316,324 ----
  	result = heap_delete(resultRelationDesc, tupleid,
  						 &update_ctid, &update_xmax,
  						 estate->es_output_cid,
! 						 true, /* wait for commit */
! 						 IsXactIsoLevelSerializable ? estate->es_snapshot :
! 													  InvalidSnapshot);
  	switch (result)
  	{
  		case HeapTupleSelfUpdated:
*************** lreplace:;
*** 504,511 ****
  	result = heap_update(resultRelationDesc, tupleid, tuple,
  						 &update_ctid, &update_xmax,
  						 estate->es_output_cid,
! 						 estate->es_crosscheck_snapshot,
! 						 true /* wait for commit */ );
  	switch (result)
  	{
  		case HeapTupleSelfUpdated:
--- 505,513 ----
  	result = heap_update(resultRelationDesc, tupleid, tuple,
  						 &update_ctid, &update_xmax,
  						 estate->es_output_cid,
! 						 true, /* wait for commit */ 
! 						 IsXactIsoLevelSerializable ? estate->es_snapshot :
! 													  InvalidSnapshot);
  	switch (result)
  	{
  		case HeapTupleSelfUpdated:
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 59a27de..d4dc5ca 100644
*** a/src/backend/utils/time/snapmgr.c
--- b/src/backend/utils/time/snapmgr.c
***************
*** 27,32 ****
--- 27,33 ----
  
  #include "access/transam.h"
  #include "access/xact.h"
+ #include "access/multixact.h"
  #include "storage/proc.h"
  #include "storage/procarray.h"
  #include "utils/memutils.h"
*************** GetTransactionSnapshot(void)
*** 125,130 ****
--- 126,139 ----
  	if (!FirstSnapshotSet)
  	{
  		Assert(RegisteredSnapshots == 0);
+ 		
+ 		/* Must store the oldest visible multi xact *before* taking the
+ 		 * serializable snapshot. Otherwise HeapSatisfiesLockersVisible in
+ 		 * heapam.c will fail, since it'll be unable to inspect a multi xact
+ 		 * that might contains xids invisible to the serializable snapshot.
+ 		 */
+ 		if (IsXactIsoLevelSerializable)
+ 			MultiXactIdSetOldestVisible();
  
  		CurrentSnapshot = GetSnapshotData(&CurrentSnapshotData);
  		FirstSnapshotSet = true;
diff --git a/src/backend/utils/time/tqual.c b/src/backend/utils/time/tqual.c
index 4f3630c..302a571 100644
*** a/src/backend/utils/time/tqual.c
--- b/src/backend/utils/time/tqual.c
*************** SnapshotData SnapshotSelfData = {HeapTup
*** 72,80 ****
  SnapshotData SnapshotAnyData = {HeapTupleSatisfiesAny};
  SnapshotData SnapshotToastData = {HeapTupleSatisfiesToast};
  
- /* local functions */
- static bool XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot);
- 
  
  /*
   * SetHintBits()
--- 72,77 ----
*************** HeapTupleSatisfiesVacuum(HeapTupleHeader
*** 1253,1259 ****
   * by this function.  This is OK for current uses, because we actually only
   * apply this for known-committed XIDs.
   */
! static bool
  XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot)
  {
  	uint32		i;
--- 1250,1256 ----
   * by this function.  This is OK for current uses, because we actually only
   * apply this for known-committed XIDs.
   */
! bool
  XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot)
  {
  	uint32		i;
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index f963146..a17d62c 100644
*** a/src/include/access/heapam.h
--- b/src/include/access/heapam.h
*************** extern Oid heap_insert(Relation relation
*** 98,112 ****
  			int options, BulkInsertState bistate);
  extern HTSU_Result heap_delete(Relation relation, ItemPointer tid,
  			ItemPointer ctid, TransactionId *update_xmax,
! 			CommandId cid, Snapshot crosscheck, bool wait);
  extern HTSU_Result heap_update(Relation relation, ItemPointer otid,
  			HeapTuple newtup,
  			ItemPointer ctid, TransactionId *update_xmax,
! 			CommandId cid, Snapshot crosscheck, bool wait);
  extern HTSU_Result heap_lock_tuple(Relation relation, HeapTuple tuple,
  				Buffer *buffer, ItemPointer ctid,
  				TransactionId *update_xmax, CommandId cid,
! 				LockTupleMode mode, bool nowait);
  extern void heap_inplace_update(Relation relation, HeapTuple tuple);
  extern bool heap_freeze_tuple(HeapTupleHeader tuple, TransactionId cutoff_xid,
  				  Buffer buf);
--- 98,113 ----
  			int options, BulkInsertState bistate);
  extern HTSU_Result heap_delete(Relation relation, ItemPointer tid,
  			ItemPointer ctid, TransactionId *update_xmax,
! 			CommandId cid, bool wait, Snapshot crosscheck);
  extern HTSU_Result heap_update(Relation relation, ItemPointer otid,
  			HeapTuple newtup,
  			ItemPointer ctid, TransactionId *update_xmax,
! 			CommandId cid, bool wait, Snapshot crosscheck);
  extern HTSU_Result heap_lock_tuple(Relation relation, HeapTuple tuple,
  				Buffer *buffer, ItemPointer ctid,
  				TransactionId *update_xmax, CommandId cid,
! 				LockTupleMode mode, bool nowait,
! 				Snapshot lockcheck_snapshot);
  extern void heap_inplace_update(Relation relation, HeapTuple tuple);
  extern bool heap_freeze_tuple(HeapTupleHeader tuple, TransactionId cutoff_xid,
  				  Buffer buf);
diff --git a/src/include/access/multixact.h b/src/include/access/multixact.h
index 5910465..4769bca 100644
*** a/src/include/access/multixact.h
--- b/src/include/access/multixact.h
*************** extern bool MultiXactIdIsCurrent(MultiXa
*** 49,54 ****
--- 49,55 ----
  extern void MultiXactIdWait(MultiXactId multi);
  extern bool ConditionalMultiXactIdWait(MultiXactId multi);
  extern void MultiXactIdSetOldestMember(void);
+ extern void MultiXactIdSetOldestVisible(void);
  extern int	GetMultiXactIdMembers(MultiXactId multi, TransactionId **xids);
  
  extern void AtEOXact_MultiXact(void);
diff --git a/src/include/utils/tqual.h b/src/include/utils/tqual.h
index e85e820..fa386e7 100644
*** a/src/include/utils/tqual.h
--- b/src/include/utils/tqual.h
*************** extern PGDLLIMPORT SnapshotData Snapshot
*** 41,46 ****
--- 41,48 ----
  #define IsMVCCSnapshot(snapshot)  \
  	((snapshot)->satisfies == HeapTupleSatisfiesMVCC)
  
+ bool XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot);
+ 
  /*
   * HeapTupleSatisfiesVisibility
   *		True iff heap tuple satisfies a time qual.
#23Robert Haas
robertmhaas@gmail.com
In reply to: Florian Pflug (#22)
Re: Row-level Locks & SERIALIZABLE transactions, postgres vs. Oracle

On Tue, May 18, 2010 at 8:15 PM, Florian Pflug <fgp@phlo.org> wrote:

Will do. Thanks for the link.

Here is an updated version that works for SHARE locks too.

(This message mainly serves as a way to link the updated patch to the commit fest entry. Is this how I'm supposed to do that, or am I doing something wrong?)

Yeah - just go to the existing CF entry and say "New Comment" then
select type "Patch".

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

#24Florian Pflug
fgp@phlo.org
In reply to: Florian Pflug (#22)
3 attachment(s)
Re: Row-level Locks & SERIALIZABLE transactions, postgres vs. Oracle

On May 19, 2010, at 2:15 , Florian Pflug wrote:

On May 17, 2010, at 3:30 , Robert Haas wrote:

On Sun, May 16, 2010 at 9:07 PM, Florian Pflug <fgp@phlo.org> wrote:

On May 14, 2010, at 22:54 , Robert Haas wrote:

On Thu, May 13, 2010 at 5:39 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Florian Pflug <fgp@phlo.org> writes:

All in all, I believe that SHARE and UPDATE row-level locks should be
changed to cause concurrent UPDATEs to fail with a serialization
error.

I don't see an argument for doing that for FOR SHARE locks, and it
already happens for FOR UPDATE (at least if the row actually gets
updated). AFAICS this proposal mainly breaks things, in pursuit of
an unnecessary and probably-impossible-anyway goal of making FK locking
work with only user-level snapshots.

After giving this considerable thought and testing the behavior at
some length, I think the OP has it right. One thing I sometimes need
to do is denormalize a copy of a field, e.g.

<snipped example>

I've whipped up a quick and still rather dirty patch that implements the behavior I proposed, at least for the case of conflicts between FOR UPDATE locks and updates. With the patch, any attempt to UPDATE or FOR UPDATE lock a row that has concurrently been FOR UPDATE locked will cause a serialization error. (The same for an actually updated row of course, but that happened before too).

While this part of the patch was fairly straight forward, make FOR SHARE conflict too seems to be much harder. The assumption that a lock becomes irrelevant after the transaction(s) that held it completely is built deeply into the multi xact machinery that powers SHARE locks. That machinery therefore assumes that once all members of a multi xact have completed the multi xact is dead also. But my proposal depends on a SERIALIZABLE transaction being able to find if any of the lockers of a row are invisible under it's snapshot - for which it'd need any multi xact containing invisible xids to outlive its snapshot.

Thanks for putting this together. I suggest adding it to the open
CommitFest - even if we decide to go forward with this, I don't
imagine anyone is going to be excited about changing it during beta.

https://commitfest.postgresql.org/action/commitfest_view/open

Will do. Thanks for the link.

Here is an updated version that works for SHARE locks too.

Forgetting to run "make check" before sending a patch is bad, as I just proved :-(

For the archives' and the commitfest app's sake, here is a version that actually passes the regression tests.

To make up for it, I also did some testing with a custom pgbench script & schema and proved the effectiveness of this patch. I ran this with "pgbench -s 10 -j 10 -c 10 -t 1000 -n -f fkbench.pgbench" on both HEAD and HEAD+patch. The former errors out quickly with "database inconsistent" while the later completes the pgbench run without errors.

The patch still needs more work, at least on the comments & documentation side of things, but I'm going to let this rest now while we're in beta.

Patch, pgbench script and schema attached.

Attachments:

serializable_lock_consistency.patchapplication/octet-stream; name=serializable_lock_consistency.patchDownload
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 8975191..175c656 100644
*** a/src/backend/access/heap/heapam.c
--- b/src/backend/access/heap/heapam.c
*************** static XLogRecPtr log_heap_update(Relati
*** 83,88 ****
--- 83,89 ----
  				bool all_visible_cleared, bool new_all_visible_cleared);
  static bool HeapSatisfiesHOTUpdate(Relation relation, Bitmapset *hot_attrs,
  					   HeapTuple oldtup, HeapTuple newtup);
+ static bool HeapSatisfiesLockersVisible(HeapTupleHeader tuple, Snapshot snapshot);
  
  
  /* ----------------------------------------------------------------
*************** simple_heap_insert(Relation relation, He
*** 2033,2040 ****
   *	update_xmax - output parameter, used only for failure case (see below)
   *	cid - delete command ID (used for visibility test, and stored into
   *		cmax if successful)
-  *	crosscheck - if not InvalidSnapshot, also check tuple against this
   *	wait - true if should wait for any conflicting update to commit/abort
   *
   * Normal, successful return value is HeapTupleMayBeUpdated, which
   * actually means we did delete it.  Failure return codes are
--- 2034,2042 ----
   *	update_xmax - output parameter, used only for failure case (see below)
   *	cid - delete command ID (used for visibility test, and stored into
   *		cmax if successful)
   *	wait - true if should wait for any conflicting update to commit/abort
+  *	lockcheck_snapshot - if not NULL, report the tuple as updated if it
+  *		was locked by a transaction not visible under this snapshot
   *
   * Normal, successful return value is HeapTupleMayBeUpdated, which
   * actually means we did delete it.  Failure return codes are
*************** simple_heap_insert(Relation relation, He
*** 2049,2055 ****
  HTSU_Result
  heap_delete(Relation relation, ItemPointer tid,
  			ItemPointer ctid, TransactionId *update_xmax,
! 			CommandId cid, Snapshot crosscheck, bool wait)
  {
  	HTSU_Result result;
  	TransactionId xid = GetCurrentTransactionId();
--- 2051,2057 ----
  HTSU_Result
  heap_delete(Relation relation, ItemPointer tid,
  			ItemPointer ctid, TransactionId *update_xmax,
! 			CommandId cid, bool wait, Snapshot lockcheck_snapshot)
  {
  	HTSU_Result result;
  	TransactionId xid = GetCurrentTransactionId();
*************** l1:
*** 2170,2181 ****
  		else
  			result = HeapTupleUpdated;
  	}
! 
! 	if (crosscheck != InvalidSnapshot && result == HeapTupleMayBeUpdated)
  	{
! 		/* Perform additional check for serializable RI updates */
! 		if (!HeapTupleSatisfiesVisibility(&tp, crosscheck, buffer))
! 			result = HeapTupleUpdated;
  	}
  
  	if (result != HeapTupleMayBeUpdated)
--- 2172,2183 ----
  		else
  			result = HeapTupleUpdated;
  	}
! 	
! 	/* Verify visibility of locking transactions. */
! 	if ((result == HeapTupleMayBeUpdated) &&
! 		!HeapSatisfiesLockersVisible(tp.t_data, lockcheck_snapshot))
  	{
! 		result = HeapTupleUpdated;
  	}
  
  	if (result != HeapTupleMayBeUpdated)
*************** l1:
*** 2183,2189 ****
  		Assert(result == HeapTupleSelfUpdated ||
  			   result == HeapTupleUpdated ||
  			   result == HeapTupleBeingUpdated);
! 		Assert(!(tp.t_data->t_infomask & HEAP_XMAX_INVALID));
  		*ctid = tp.t_data->t_ctid;
  		*update_xmax = HeapTupleHeaderGetXmax(tp.t_data);
  		UnlockReleaseBuffer(buffer);
--- 2185,2192 ----
  		Assert(result == HeapTupleSelfUpdated ||
  			   result == HeapTupleUpdated ||
  			   result == HeapTupleBeingUpdated);
! 		Assert(!(tp.t_data->t_infomask & HEAP_XMAX_INVALID) ||
! 			   (tp.t_data->t_infomask & HEAP_IS_LOCKED));
  		*ctid = tp.t_data->t_ctid;
  		*update_xmax = HeapTupleHeaderGetXmax(tp.t_data);
  		UnlockReleaseBuffer(buffer);
*************** simple_heap_delete(Relation relation, It
*** 2313,2320 ****
  
  	result = heap_delete(relation, tid,
  						 &update_ctid, &update_xmax,
! 						 GetCurrentCommandId(true), InvalidSnapshot,
! 						 true /* wait for commit */ );
  	switch (result)
  	{
  		case HeapTupleSelfUpdated:
--- 2316,2324 ----
  
  	result = heap_delete(relation, tid,
  						 &update_ctid, &update_xmax,
! 						 GetCurrentCommandId(true),
! 						 true /* wait for commit */ ,
! 						 InvalidSnapshot);
  	switch (result)
  	{
  		case HeapTupleSelfUpdated:
*************** simple_heap_delete(Relation relation, It
*** 2349,2355 ****
   *	update_xmax - output parameter, used only for failure case (see below)
   *	cid - update command ID (used for visibility test, and stored into
   *		cmax/cmin if successful)
!  *	crosscheck - if not InvalidSnapshot, also check old tuple against this
   *	wait - true if should wait for any conflicting update to commit/abort
   *
   * Normal, successful return value is HeapTupleMayBeUpdated, which
--- 2353,2360 ----
   *	update_xmax - output parameter, used only for failure case (see below)
   *	cid - update command ID (used for visibility test, and stored into
   *		cmax/cmin if successful)
!  *	lockcheck_snapshot - if not NULL, report the tuple as updated if it
!  *		was locked by a transaction not visible under this snapshot
   *	wait - true if should wait for any conflicting update to commit/abort
   *
   * Normal, successful return value is HeapTupleMayBeUpdated, which
*************** simple_heap_delete(Relation relation, It
*** 2371,2377 ****
  HTSU_Result
  heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
  			ItemPointer ctid, TransactionId *update_xmax,
! 			CommandId cid, Snapshot crosscheck, bool wait)
  {
  	HTSU_Result result;
  	TransactionId xid = GetCurrentTransactionId();
--- 2376,2382 ----
  HTSU_Result
  heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
  			ItemPointer ctid, TransactionId *update_xmax,
! 			CommandId cid, bool wait, Snapshot lockcheck_snapshot)
  {
  	HTSU_Result result;
  	TransactionId xid = GetCurrentTransactionId();
*************** l2:
*** 2523,2541 ****
  			result = HeapTupleUpdated;
  	}
  
! 	if (crosscheck != InvalidSnapshot && result == HeapTupleMayBeUpdated)
  	{
! 		/* Perform additional check for serializable RI updates */
! 		if (!HeapTupleSatisfiesVisibility(&oldtup, crosscheck, buffer))
! 			result = HeapTupleUpdated;
  	}
  
  	if (result != HeapTupleMayBeUpdated)
  	{
  		Assert(result == HeapTupleSelfUpdated ||
  			   result == HeapTupleUpdated ||
  			   result == HeapTupleBeingUpdated);
! 		Assert(!(oldtup.t_data->t_infomask & HEAP_XMAX_INVALID));
  		*ctid = oldtup.t_data->t_ctid;
  		*update_xmax = HeapTupleHeaderGetXmax(oldtup.t_data);
  		UnlockReleaseBuffer(buffer);
--- 2528,2548 ----
  			result = HeapTupleUpdated;
  	}
  
! 	/* Verify visibility of locking transactions. */
! 	if ((result == HeapTupleMayBeUpdated) &&
! 		!HeapSatisfiesLockersVisible(oldtup.t_data, lockcheck_snapshot))
  	{
! 		result = HeapTupleUpdated;
  	}
  
+ 
  	if (result != HeapTupleMayBeUpdated)
  	{
  		Assert(result == HeapTupleSelfUpdated ||
  			   result == HeapTupleUpdated ||
  			   result == HeapTupleBeingUpdated);
! 		Assert(!(oldtup.t_data->t_infomask & HEAP_XMAX_INVALID) ||
! 			   (oldtup.t_data->t_infomask & HEAP_IS_LOCKED));
  		*ctid = oldtup.t_data->t_ctid;
  		*update_xmax = HeapTupleHeaderGetXmax(oldtup.t_data);
  		UnlockReleaseBuffer(buffer);
*************** HeapSatisfiesHOTUpdate(Relation relation
*** 2961,2966 ****
--- 2968,3057 ----
  	return true;
  }
  
+ 
+ /*
+  * Returns false if one of the tuple's lockers committed but
+  * aren't visible according to lockcheck_snapshot, exclusing subtransactions
+  * of the current transaction.
+  * Assumes that all locking transaction either committed or aborted,
+  * but aren't still in progress.
+  */
+ static bool
+ HeapSatisfiesLockersVisible(HeapTupleHeader tuple, Snapshot lockcheck_snapshot)
+ {
+ 	if (lockcheck_snapshot == InvalidSnapshot)
+ 		return true;
+ 	
+ 	if (tuple->t_infomask & HEAP_IS_LOCKED)
+ 	{
+ 		/* If the tuple was locked, we now check whether the locking
+ 		 * transaction(s) are visible under lockcheck_snapshot. If
+ 		 * they aren't, we pretent the tuple was updated.
+ 		 */
+ 
+ 		if (tuple->t_infomask & HEAP_XMAX_IS_MULTI)
+ 		{
+ 			TransactionId* xids;
+ 			int xids_l = GetMultiXactIdMembers(HeapTupleHeaderGetXmax(tuple), &xids);
+ 
+ 			if (xids_l < 1) {
+ 				/* The multi xact either is too old to be inspected or doesn't contain members.
+ 				 * The second case is probably impossible, but even if not it doesn't pose
+ 				 * any problem.
+ 				 * In the first case, we have to trust that all xids that were contained in
+ 				 * the xact are in fact visible under lockcheck_snapshot. Currently this
+ 				 * is always the case, since lockcheck_snapshot is always the transaction's
+ 				 * serializable snapshot, and we call MultiXactIdSetOldestVisible() before
+ 				 * acquireing that snapshot.
+ 				 */
+ 				return true;
+ 			}
+ 			else
+ 			{
+ 				int i;
+ 				for (i = 0; i < xids_l; ++i)
+ 				{
+ 					/* Ignore our own subtransactions */
+ 					if (TransactionIdIsCurrentTransactionId(xids[i]))
+ 						continue;
+ 					
+ 					/* We expect to be called after the locking transactions' fates have been decided */
+ 					Assert(!TransactionIdIsInProgress(xids[i]));
+ 					
+ 					if (!TransactionIdDidAbort(xids[i]) &&
+ 					    XidInMVCCSnapshot(xids[i], lockcheck_snapshot))
+ 					{
+ 						/* Ups. Non-aborted, invisible locker */
+ 						return false;
+ 					}
+ 				}
+ 				return true;
+ 			}
+ 		}
+ 		else
+ 		{
+ 			TransactionId xid = HeapTupleHeaderGetXmax(tuple);
+ 			
+ 			/* Ignore our own subtransactions */
+ 			if (TransactionIdIsCurrentTransactionId(xid))
+ 				return true;
+ 			
+ 			/* We expect to be called after the locking transaction's fate has been decided, excluding
+ 			   ourself. */
+ 			Assert(!TransactionIdIsInProgress(xid));
+ 			
+ 			/* Locker must either be visible or have aborted */
+ 			return TransactionIdDidAbort(xid) ||
+ 				   !XidInMVCCSnapshot(xid, lockcheck_snapshot);
+ 		}
+ 	}
+ 	else
+ 	{
+ 		/* Tuple wasn't locked */
+ 		return true;
+ 	}
+ }
+ 
  /*
   *	simple_heap_update - replace a tuple
   *
*************** simple_heap_update(Relation relation, It
*** 2978,2985 ****
  
  	result = heap_update(relation, otid, tup,
  						 &update_ctid, &update_xmax,
! 						 GetCurrentCommandId(true), InvalidSnapshot,
! 						 true /* wait for commit */ );
  	switch (result)
  	{
  		case HeapTupleSelfUpdated:
--- 3069,3076 ----
  
  	result = heap_update(relation, otid, tup,
  						 &update_ctid, &update_xmax,
! 						 GetCurrentCommandId(true),
! 						 true /* wait for commit */, InvalidSnapshot);
  	switch (result)
  	{
  		case HeapTupleSelfUpdated:
*************** simple_heap_update(Relation relation, It
*** 3013,3018 ****
--- 3104,3112 ----
   *		tuple's cmax if lock is successful)
   *	mode: indicates if shared or exclusive tuple lock is desired
   *	nowait: if true, ereport rather than blocking if lock not available
+  *	lockcheck_snapshot: if not NULL, report the tuple as updated if it
+  *						was locked by a transaction not visible under
+  *						this snapshot
   *
   * Output parameters:
   *	*tuple: all fields filled in
*************** simple_heap_update(Relation relation, It
*** 3066,3072 ****
  HTSU_Result
  heap_lock_tuple(Relation relation, HeapTuple tuple, Buffer *buffer,
  				ItemPointer ctid, TransactionId *update_xmax,
! 				CommandId cid, LockTupleMode mode, bool nowait)
  {
  	HTSU_Result result;
  	ItemPointer tid = &(tuple->t_self);
--- 3160,3167 ----
  HTSU_Result
  heap_lock_tuple(Relation relation, HeapTuple tuple, Buffer *buffer,
  				ItemPointer ctid, TransactionId *update_xmax,
! 				CommandId cid, LockTupleMode mode, bool nowait,
! 				Snapshot lockcheck_snapshot)
  {
  	HTSU_Result result;
  	ItemPointer tid = &(tuple->t_self);
*************** l3:
*** 3246,3256 ****
  		else
  			result = HeapTupleUpdated;
  	}
  
  	if (result != HeapTupleMayBeUpdated)
  	{
  		Assert(result == HeapTupleSelfUpdated || result == HeapTupleUpdated);
! 		Assert(!(tuple->t_data->t_infomask & HEAP_XMAX_INVALID));
  		*ctid = tuple->t_data->t_ctid;
  		*update_xmax = HeapTupleHeaderGetXmax(tuple->t_data);
  		LockBuffer(*buffer, BUFFER_LOCK_UNLOCK);
--- 3341,3359 ----
  		else
  			result = HeapTupleUpdated;
  	}
+ 	
+ 	/* Verify visibility of locking transactions */
+ 	if ((result == HeapTupleMayBeUpdated) &&
+ 		!HeapSatisfiesLockersVisible(tuple->t_data, lockcheck_snapshot))
+ 	{
+ 		result = HeapTupleUpdated;
+ 	}
  
  	if (result != HeapTupleMayBeUpdated)
  	{
  		Assert(result == HeapTupleSelfUpdated || result == HeapTupleUpdated);
! 		Assert(!(tuple->t_data->t_infomask & HEAP_XMAX_INVALID) ||
! 			   (tuple->t_data->t_infomask & HEAP_IS_LOCKED));
  		*ctid = tuple->t_data->t_ctid;
  		*update_xmax = HeapTupleHeaderGetXmax(tuple->t_data);
  		LockBuffer(*buffer, BUFFER_LOCK_UNLOCK);
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index b5bb311..47ff5de 100644
*** a/src/backend/access/transam/multixact.c
--- b/src/backend/access/transam/multixact.c
*************** static MemoryContext MXactContext = NULL
*** 211,217 ****
  #endif
  
  /* internal MultiXactId management */
- static void MultiXactIdSetOldestVisible(void);
  static MultiXactId CreateMultiXactId(int nxids, TransactionId *xids);
  static void RecordNewMultiXact(MultiXactId multi, MultiXactOffset offset,
  				   int nxids, TransactionId *xids);
--- 211,216 ----
*************** MultiXactIdSetOldestMember(void)
*** 531,537 ****
   * there is no live transaction, now or later, that can be a member of any
   * MultiXactId older than the OldestVisibleMXactId we compute here.
   */
! static void
  MultiXactIdSetOldestVisible(void)
  {
  	if (!MultiXactIdIsValid(OldestVisibleMXactId[MyBackendId]))
--- 530,536 ----
   * there is no live transaction, now or later, that can be a member of any
   * MultiXactId older than the OldestVisibleMXactId we compute here.
   */
! void
  MultiXactIdSetOldestVisible(void)
  {
  	if (!MultiXactIdIsValid(OldestVisibleMXactId[MyBackendId]))
diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
index 2cbc192..120ea60 100644
*** a/src/backend/commands/trigger.c
--- b/src/backend/commands/trigger.c
*************** ltrmark:;
*** 2347,2353 ****
  		test = heap_lock_tuple(relation, &tuple, &buffer,
  							   &update_ctid, &update_xmax,
  							   estate->es_output_cid,
! 							   LockTupleExclusive, false);
  		switch (test)
  		{
  			case HeapTupleSelfUpdated:
--- 2347,2355 ----
  		test = heap_lock_tuple(relation, &tuple, &buffer,
  							   &update_ctid, &update_xmax,
  							   estate->es_output_cid,
! 							   LockTupleExclusive, false,
! 							   IsXactIsoLevelSerializable ? estate->es_snapshot :
! 															InvalidSnapshot);
  		switch (test)
  		{
  			case HeapTupleSelfUpdated:
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index d299310..fd5ed3c 100644
*** a/src/backend/executor/execMain.c
--- b/src/backend/executor/execMain.c
*************** EvalPlanQualFetch(EState *estate, Relati
*** 1518,1527 ****
  			/*
  			 * This is a live tuple, so now try to lock it.
  			 */
  			test = heap_lock_tuple(relation, &tuple, &buffer,
  								   &update_ctid, &update_xmax,
  								   estate->es_output_cid,
! 								   lockmode, false);
  			/* We now have two pins on the buffer, get rid of one */
  			ReleaseBuffer(buffer);
  
--- 1518,1530 ----
  			/*
  			 * This is a live tuple, so now try to lock it.
  			 */
+ 			Assert(!IsXactIsoLevelSerializable || (estate->es_snapshot != InvalidSnapshot));
  			test = heap_lock_tuple(relation, &tuple, &buffer,
  								   &update_ctid, &update_xmax,
  								   estate->es_output_cid,
! 								   lockmode, false,
! 								   IsXactIsoLevelSerializable ? estate->es_snapshot :
! 																InvalidSnapshot);
  			/* We now have two pins on the buffer, get rid of one */
  			ReleaseBuffer(buffer);
  
diff --git a/src/backend/executor/nodeLockRows.c b/src/backend/executor/nodeLockRows.c
index 00f0562..b771b19 100644
*** a/src/backend/executor/nodeLockRows.c
--- b/src/backend/executor/nodeLockRows.c
*************** lnext:
*** 71,76 ****
--- 71,77 ----
  		ItemPointerData update_ctid;
  		TransactionId update_xmax;
  		LockTupleMode lockmode;
+ 		Snapshot lockcheck_snapshot = InvalidSnapshot;
  		HTSU_Result test;
  		HeapTuple	copyTuple;
  
*************** lnext:
*** 110,123 ****
  
  		/* okay, try to lock the tuple */
  		if (erm->markType == ROW_MARK_EXCLUSIVE)
  			lockmode = LockTupleExclusive;
  		else
  			lockmode = LockTupleShared;
  
  		test = heap_lock_tuple(erm->relation, &tuple, &buffer,
  							   &update_ctid, &update_xmax,
  							   estate->es_output_cid,
! 							   lockmode, erm->noWait);
  		ReleaseBuffer(buffer);
  		switch (test)
  		{
--- 111,134 ----
  
  		/* okay, try to lock the tuple */
  		if (erm->markType == ROW_MARK_EXCLUSIVE)
+ 		{
  			lockmode = LockTupleExclusive;
+ 			/* In serializable mode, acquireing an exclusive lock
+ 			 * requires seeing all other lockers.
+ 			 */
+ 			if (IsXactIsoLevelSerializable)
+ 				lockcheck_snapshot = estate->es_snapshot;
+ 		}
  		else
+ 		{
  			lockmode = LockTupleShared;
+ 		}
  
  		test = heap_lock_tuple(erm->relation, &tuple, &buffer,
  							   &update_ctid, &update_xmax,
  							   estate->es_output_cid,
! 							   lockmode, erm->noWait,
! 							   lockcheck_snapshot);
  		ReleaseBuffer(buffer);
  		switch (test)
  		{
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 7856b66..ecc75f9 100644
*** a/src/backend/executor/nodeModifyTable.c
--- b/src/backend/executor/nodeModifyTable.c
*************** ldelete:;
*** 316,323 ****
  	result = heap_delete(resultRelationDesc, tupleid,
  						 &update_ctid, &update_xmax,
  						 estate->es_output_cid,
! 						 estate->es_crosscheck_snapshot,
! 						 true /* wait for commit */ );
  	switch (result)
  	{
  		case HeapTupleSelfUpdated:
--- 316,324 ----
  	result = heap_delete(resultRelationDesc, tupleid,
  						 &update_ctid, &update_xmax,
  						 estate->es_output_cid,
! 						 true, /* wait for commit */
! 						 IsXactIsoLevelSerializable ? estate->es_snapshot :
! 													  InvalidSnapshot);
  	switch (result)
  	{
  		case HeapTupleSelfUpdated:
*************** lreplace:;
*** 504,511 ****
  	result = heap_update(resultRelationDesc, tupleid, tuple,
  						 &update_ctid, &update_xmax,
  						 estate->es_output_cid,
! 						 estate->es_crosscheck_snapshot,
! 						 true /* wait for commit */ );
  	switch (result)
  	{
  		case HeapTupleSelfUpdated:
--- 505,513 ----
  	result = heap_update(resultRelationDesc, tupleid, tuple,
  						 &update_ctid, &update_xmax,
  						 estate->es_output_cid,
! 						 true, /* wait for commit */ 
! 						 IsXactIsoLevelSerializable ? estate->es_snapshot :
! 													  InvalidSnapshot);
  	switch (result)
  	{
  		case HeapTupleSelfUpdated:
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 59a27de..d4dc5ca 100644
*** a/src/backend/utils/time/snapmgr.c
--- b/src/backend/utils/time/snapmgr.c
***************
*** 27,32 ****
--- 27,33 ----
  
  #include "access/transam.h"
  #include "access/xact.h"
+ #include "access/multixact.h"
  #include "storage/proc.h"
  #include "storage/procarray.h"
  #include "utils/memutils.h"
*************** GetTransactionSnapshot(void)
*** 125,130 ****
--- 126,139 ----
  	if (!FirstSnapshotSet)
  	{
  		Assert(RegisteredSnapshots == 0);
+ 		
+ 		/* Must store the oldest visible multi xact *before* taking the
+ 		 * serializable snapshot. Otherwise HeapSatisfiesLockersVisible in
+ 		 * heapam.c will fail, since it'll be unable to inspect a multi xact
+ 		 * that might contains xids invisible to the serializable snapshot.
+ 		 */
+ 		if (IsXactIsoLevelSerializable)
+ 			MultiXactIdSetOldestVisible();
  
  		CurrentSnapshot = GetSnapshotData(&CurrentSnapshotData);
  		FirstSnapshotSet = true;
diff --git a/src/backend/utils/time/tqual.c b/src/backend/utils/time/tqual.c
index 4f3630c..302a571 100644
*** a/src/backend/utils/time/tqual.c
--- b/src/backend/utils/time/tqual.c
*************** SnapshotData SnapshotSelfData = {HeapTup
*** 72,80 ****
  SnapshotData SnapshotAnyData = {HeapTupleSatisfiesAny};
  SnapshotData SnapshotToastData = {HeapTupleSatisfiesToast};
  
- /* local functions */
- static bool XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot);
- 
  
  /*
   * SetHintBits()
--- 72,77 ----
*************** HeapTupleSatisfiesVacuum(HeapTupleHeader
*** 1253,1259 ****
   * by this function.  This is OK for current uses, because we actually only
   * apply this for known-committed XIDs.
   */
! static bool
  XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot)
  {
  	uint32		i;
--- 1250,1256 ----
   * by this function.  This is OK for current uses, because we actually only
   * apply this for known-committed XIDs.
   */
! bool
  XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot)
  {
  	uint32		i;
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index f963146..a17d62c 100644
*** a/src/include/access/heapam.h
--- b/src/include/access/heapam.h
*************** extern Oid heap_insert(Relation relation
*** 98,112 ****
  			int options, BulkInsertState bistate);
  extern HTSU_Result heap_delete(Relation relation, ItemPointer tid,
  			ItemPointer ctid, TransactionId *update_xmax,
! 			CommandId cid, Snapshot crosscheck, bool wait);
  extern HTSU_Result heap_update(Relation relation, ItemPointer otid,
  			HeapTuple newtup,
  			ItemPointer ctid, TransactionId *update_xmax,
! 			CommandId cid, Snapshot crosscheck, bool wait);
  extern HTSU_Result heap_lock_tuple(Relation relation, HeapTuple tuple,
  				Buffer *buffer, ItemPointer ctid,
  				TransactionId *update_xmax, CommandId cid,
! 				LockTupleMode mode, bool nowait);
  extern void heap_inplace_update(Relation relation, HeapTuple tuple);
  extern bool heap_freeze_tuple(HeapTupleHeader tuple, TransactionId cutoff_xid,
  				  Buffer buf);
--- 98,113 ----
  			int options, BulkInsertState bistate);
  extern HTSU_Result heap_delete(Relation relation, ItemPointer tid,
  			ItemPointer ctid, TransactionId *update_xmax,
! 			CommandId cid, bool wait, Snapshot crosscheck);
  extern HTSU_Result heap_update(Relation relation, ItemPointer otid,
  			HeapTuple newtup,
  			ItemPointer ctid, TransactionId *update_xmax,
! 			CommandId cid, bool wait, Snapshot crosscheck);
  extern HTSU_Result heap_lock_tuple(Relation relation, HeapTuple tuple,
  				Buffer *buffer, ItemPointer ctid,
  				TransactionId *update_xmax, CommandId cid,
! 				LockTupleMode mode, bool nowait,
! 				Snapshot lockcheck_snapshot);
  extern void heap_inplace_update(Relation relation, HeapTuple tuple);
  extern bool heap_freeze_tuple(HeapTupleHeader tuple, TransactionId cutoff_xid,
  				  Buffer buf);
diff --git a/src/include/access/multixact.h b/src/include/access/multixact.h
index 5910465..4769bca 100644
*** a/src/include/access/multixact.h
--- b/src/include/access/multixact.h
*************** extern bool MultiXactIdIsCurrent(MultiXa
*** 49,54 ****
--- 49,55 ----
  extern void MultiXactIdWait(MultiXactId multi);
  extern bool ConditionalMultiXactIdWait(MultiXactId multi);
  extern void MultiXactIdSetOldestMember(void);
+ extern void MultiXactIdSetOldestVisible(void);
  extern int	GetMultiXactIdMembers(MultiXactId multi, TransactionId **xids);
  
  extern void AtEOXact_MultiXact(void);
diff --git a/src/include/utils/tqual.h b/src/include/utils/tqual.h
index e85e820..fa386e7 100644
*** a/src/include/utils/tqual.h
--- b/src/include/utils/tqual.h
*************** extern PGDLLIMPORT SnapshotData Snapshot
*** 41,46 ****
--- 41,48 ----
  #define IsMVCCSnapshot(snapshot)  \
  	((snapshot)->satisfies == HeapTupleSatisfiesMVCC)
  
+ bool XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot);
+ 
  /*
   * HeapTupleSatisfiesVisibility
   *		True iff heap tuple satisfies a time qual.
fkbench.init.sqlapplication/octet-stream; name=fkbench.init.sqlDownload
fkbench.pgbenchapplication/octet-stream; name=fkbench.pgbenchDownload
#25Florian Pflug
fgp@phlo.org
In reply to: Florian Pflug (#24)
3 attachment(s)
Re: Row-level Locks & SERIALIZABLE transactions, postgres vs. Oracle

On May 21, 2010, at 4:20 , Florian Pflug wrote:

On May 19, 2010, at 2:15 , Florian Pflug wrote:

On May 17, 2010, at 3:30 , Robert Haas wrote:

On Sun, May 16, 2010 at 9:07 PM, Florian Pflug <fgp@phlo.org> wrote:

On May 14, 2010, at 22:54 , Robert Haas wrote:

On Thu, May 13, 2010 at 5:39 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Florian Pflug <fgp@phlo.org> writes:

All in all, I believe that SHARE and UPDATE row-level locks should be
changed to cause concurrent UPDATEs to fail with a serialization
error.

I don't see an argument for doing that for FOR SHARE locks, and it
already happens for FOR UPDATE (at least if the row actually gets
updated). AFAICS this proposal mainly breaks things, in pursuit of
an unnecessary and probably-impossible-anyway goal of making FK locking
work with only user-level snapshots.

After giving this considerable thought and testing the behavior at
some length, I think the OP has it right. One thing I sometimes need
to do is denormalize a copy of a field, e.g.

<snipped example>

I've whipped up a quick and still rather dirty patch that implements the behavior I proposed, at least for the case of conflicts between FOR UPDATE locks and updates. With the patch, any attempt to UPDATE or FOR UPDATE lock a row that has concurrently been FOR UPDATE locked will cause a serialization error. (The same for an actually updated row of course, but that happened before too).

While this part of the patch was fairly straight forward, make FOR SHARE conflict too seems to be much harder. The assumption that a lock becomes irrelevant after the transaction(s) that held it completely is built deeply into the multi xact machinery that powers SHARE locks. That machinery therefore assumes that once all members of a multi xact have completed the multi xact is dead also. But my proposal depends on a SERIALIZABLE transaction being able to find if any of the lockers of a row are invisible under it's snapshot - for which it'd need any multi xact containing invisible xids to outlive its snapshot.

Thanks for putting this together. I suggest adding it to the open
CommitFest - even if we decide to go forward with this, I don't
imagine anyone is going to be excited about changing it during beta.

https://commitfest.postgresql.org/action/commitfest_view/open

Will do. Thanks for the link.

Here is an updated version that works for SHARE locks too.

Forgetting to run "make check" before sending a patch is bad, as I just proved :-(

For the archives' and the commitfest app's sake, here is a version that actually passes the regression tests.

To make up for it, I also did some testing with a custom pgbench script & schema and proved the effectiveness of this patch. I ran this with "pgbench -s 10 -j 10 -c 10 -t 1000 -n -f fkbench.pgbench" on both HEAD and HEAD+patch. The former errors out quickly with "database inconsistent" while the later completes the pgbench run without errors.

The patch still needs more work, at least on the comments & documentation side of things, but I'm going to let this rest now while we're in beta.

Patch, pgbench script and schema attached.

Great, now my mail client decided to send encode those attachments with MacBinary instead of sending them as plain text :-(

Not sure if MUAs other than Mail.app can open those, so I'm resending this. Really sorry for the noise, guys

best regards,
Florian Pflug

Attachments:

serializable_lock_consistency.patchapplication/octet-stream; name=serializable_lock_consistency.patch; x-unix-mode=0644Download
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 8975191..175c656 100644
*** a/src/backend/access/heap/heapam.c
--- b/src/backend/access/heap/heapam.c
*************** static XLogRecPtr log_heap_update(Relati
*** 83,88 ****
--- 83,89 ----
  				bool all_visible_cleared, bool new_all_visible_cleared);
  static bool HeapSatisfiesHOTUpdate(Relation relation, Bitmapset *hot_attrs,
  					   HeapTuple oldtup, HeapTuple newtup);
+ static bool HeapSatisfiesLockersVisible(HeapTupleHeader tuple, Snapshot snapshot);
  
  
  /* ----------------------------------------------------------------
*************** simple_heap_insert(Relation relation, He
*** 2033,2040 ****
   *	update_xmax - output parameter, used only for failure case (see below)
   *	cid - delete command ID (used for visibility test, and stored into
   *		cmax if successful)
-  *	crosscheck - if not InvalidSnapshot, also check tuple against this
   *	wait - true if should wait for any conflicting update to commit/abort
   *
   * Normal, successful return value is HeapTupleMayBeUpdated, which
   * actually means we did delete it.  Failure return codes are
--- 2034,2042 ----
   *	update_xmax - output parameter, used only for failure case (see below)
   *	cid - delete command ID (used for visibility test, and stored into
   *		cmax if successful)
   *	wait - true if should wait for any conflicting update to commit/abort
+  *	lockcheck_snapshot - if not NULL, report the tuple as updated if it
+  *		was locked by a transaction not visible under this snapshot
   *
   * Normal, successful return value is HeapTupleMayBeUpdated, which
   * actually means we did delete it.  Failure return codes are
*************** simple_heap_insert(Relation relation, He
*** 2049,2055 ****
  HTSU_Result
  heap_delete(Relation relation, ItemPointer tid,
  			ItemPointer ctid, TransactionId *update_xmax,
! 			CommandId cid, Snapshot crosscheck, bool wait)
  {
  	HTSU_Result result;
  	TransactionId xid = GetCurrentTransactionId();
--- 2051,2057 ----
  HTSU_Result
  heap_delete(Relation relation, ItemPointer tid,
  			ItemPointer ctid, TransactionId *update_xmax,
! 			CommandId cid, bool wait, Snapshot lockcheck_snapshot)
  {
  	HTSU_Result result;
  	TransactionId xid = GetCurrentTransactionId();
*************** l1:
*** 2170,2181 ****
  		else
  			result = HeapTupleUpdated;
  	}
! 
! 	if (crosscheck != InvalidSnapshot && result == HeapTupleMayBeUpdated)
  	{
! 		/* Perform additional check for serializable RI updates */
! 		if (!HeapTupleSatisfiesVisibility(&tp, crosscheck, buffer))
! 			result = HeapTupleUpdated;
  	}
  
  	if (result != HeapTupleMayBeUpdated)
--- 2172,2183 ----
  		else
  			result = HeapTupleUpdated;
  	}
! 	
! 	/* Verify visibility of locking transactions. */
! 	if ((result == HeapTupleMayBeUpdated) &&
! 		!HeapSatisfiesLockersVisible(tp.t_data, lockcheck_snapshot))
  	{
! 		result = HeapTupleUpdated;
  	}
  
  	if (result != HeapTupleMayBeUpdated)
*************** l1:
*** 2183,2189 ****
  		Assert(result == HeapTupleSelfUpdated ||
  			   result == HeapTupleUpdated ||
  			   result == HeapTupleBeingUpdated);
! 		Assert(!(tp.t_data->t_infomask & HEAP_XMAX_INVALID));
  		*ctid = tp.t_data->t_ctid;
  		*update_xmax = HeapTupleHeaderGetXmax(tp.t_data);
  		UnlockReleaseBuffer(buffer);
--- 2185,2192 ----
  		Assert(result == HeapTupleSelfUpdated ||
  			   result == HeapTupleUpdated ||
  			   result == HeapTupleBeingUpdated);
! 		Assert(!(tp.t_data->t_infomask & HEAP_XMAX_INVALID) ||
! 			   (tp.t_data->t_infomask & HEAP_IS_LOCKED));
  		*ctid = tp.t_data->t_ctid;
  		*update_xmax = HeapTupleHeaderGetXmax(tp.t_data);
  		UnlockReleaseBuffer(buffer);
*************** simple_heap_delete(Relation relation, It
*** 2313,2320 ****
  
  	result = heap_delete(relation, tid,
  						 &update_ctid, &update_xmax,
! 						 GetCurrentCommandId(true), InvalidSnapshot,
! 						 true /* wait for commit */ );
  	switch (result)
  	{
  		case HeapTupleSelfUpdated:
--- 2316,2324 ----
  
  	result = heap_delete(relation, tid,
  						 &update_ctid, &update_xmax,
! 						 GetCurrentCommandId(true),
! 						 true /* wait for commit */ ,
! 						 InvalidSnapshot);
  	switch (result)
  	{
  		case HeapTupleSelfUpdated:
*************** simple_heap_delete(Relation relation, It
*** 2349,2355 ****
   *	update_xmax - output parameter, used only for failure case (see below)
   *	cid - update command ID (used for visibility test, and stored into
   *		cmax/cmin if successful)
!  *	crosscheck - if not InvalidSnapshot, also check old tuple against this
   *	wait - true if should wait for any conflicting update to commit/abort
   *
   * Normal, successful return value is HeapTupleMayBeUpdated, which
--- 2353,2360 ----
   *	update_xmax - output parameter, used only for failure case (see below)
   *	cid - update command ID (used for visibility test, and stored into
   *		cmax/cmin if successful)
!  *	lockcheck_snapshot - if not NULL, report the tuple as updated if it
!  *		was locked by a transaction not visible under this snapshot
   *	wait - true if should wait for any conflicting update to commit/abort
   *
   * Normal, successful return value is HeapTupleMayBeUpdated, which
*************** simple_heap_delete(Relation relation, It
*** 2371,2377 ****
  HTSU_Result
  heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
  			ItemPointer ctid, TransactionId *update_xmax,
! 			CommandId cid, Snapshot crosscheck, bool wait)
  {
  	HTSU_Result result;
  	TransactionId xid = GetCurrentTransactionId();
--- 2376,2382 ----
  HTSU_Result
  heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
  			ItemPointer ctid, TransactionId *update_xmax,
! 			CommandId cid, bool wait, Snapshot lockcheck_snapshot)
  {
  	HTSU_Result result;
  	TransactionId xid = GetCurrentTransactionId();
*************** l2:
*** 2523,2541 ****
  			result = HeapTupleUpdated;
  	}
  
! 	if (crosscheck != InvalidSnapshot && result == HeapTupleMayBeUpdated)
  	{
! 		/* Perform additional check for serializable RI updates */
! 		if (!HeapTupleSatisfiesVisibility(&oldtup, crosscheck, buffer))
! 			result = HeapTupleUpdated;
  	}
  
  	if (result != HeapTupleMayBeUpdated)
  	{
  		Assert(result == HeapTupleSelfUpdated ||
  			   result == HeapTupleUpdated ||
  			   result == HeapTupleBeingUpdated);
! 		Assert(!(oldtup.t_data->t_infomask & HEAP_XMAX_INVALID));
  		*ctid = oldtup.t_data->t_ctid;
  		*update_xmax = HeapTupleHeaderGetXmax(oldtup.t_data);
  		UnlockReleaseBuffer(buffer);
--- 2528,2548 ----
  			result = HeapTupleUpdated;
  	}
  
! 	/* Verify visibility of locking transactions. */
! 	if ((result == HeapTupleMayBeUpdated) &&
! 		!HeapSatisfiesLockersVisible(oldtup.t_data, lockcheck_snapshot))
  	{
! 		result = HeapTupleUpdated;
  	}
  
+ 
  	if (result != HeapTupleMayBeUpdated)
  	{
  		Assert(result == HeapTupleSelfUpdated ||
  			   result == HeapTupleUpdated ||
  			   result == HeapTupleBeingUpdated);
! 		Assert(!(oldtup.t_data->t_infomask & HEAP_XMAX_INVALID) ||
! 			   (oldtup.t_data->t_infomask & HEAP_IS_LOCKED));
  		*ctid = oldtup.t_data->t_ctid;
  		*update_xmax = HeapTupleHeaderGetXmax(oldtup.t_data);
  		UnlockReleaseBuffer(buffer);
*************** HeapSatisfiesHOTUpdate(Relation relation
*** 2961,2966 ****
--- 2968,3057 ----
  	return true;
  }
  
+ 
+ /*
+  * Returns false if one of the tuple's lockers committed but
+  * aren't visible according to lockcheck_snapshot, exclusing subtransactions
+  * of the current transaction.
+  * Assumes that all locking transaction either committed or aborted,
+  * but aren't still in progress.
+  */
+ static bool
+ HeapSatisfiesLockersVisible(HeapTupleHeader tuple, Snapshot lockcheck_snapshot)
+ {
+ 	if (lockcheck_snapshot == InvalidSnapshot)
+ 		return true;
+ 	
+ 	if (tuple->t_infomask & HEAP_IS_LOCKED)
+ 	{
+ 		/* If the tuple was locked, we now check whether the locking
+ 		 * transaction(s) are visible under lockcheck_snapshot. If
+ 		 * they aren't, we pretent the tuple was updated.
+ 		 */
+ 
+ 		if (tuple->t_infomask & HEAP_XMAX_IS_MULTI)
+ 		{
+ 			TransactionId* xids;
+ 			int xids_l = GetMultiXactIdMembers(HeapTupleHeaderGetXmax(tuple), &xids);
+ 
+ 			if (xids_l < 1) {
+ 				/* The multi xact either is too old to be inspected or doesn't contain members.
+ 				 * The second case is probably impossible, but even if not it doesn't pose
+ 				 * any problem.
+ 				 * In the first case, we have to trust that all xids that were contained in
+ 				 * the xact are in fact visible under lockcheck_snapshot. Currently this
+ 				 * is always the case, since lockcheck_snapshot is always the transaction's
+ 				 * serializable snapshot, and we call MultiXactIdSetOldestVisible() before
+ 				 * acquireing that snapshot.
+ 				 */
+ 				return true;
+ 			}
+ 			else
+ 			{
+ 				int i;
+ 				for (i = 0; i < xids_l; ++i)
+ 				{
+ 					/* Ignore our own subtransactions */
+ 					if (TransactionIdIsCurrentTransactionId(xids[i]))
+ 						continue;
+ 					
+ 					/* We expect to be called after the locking transactions' fates have been decided */
+ 					Assert(!TransactionIdIsInProgress(xids[i]));
+ 					
+ 					if (!TransactionIdDidAbort(xids[i]) &&
+ 					    XidInMVCCSnapshot(xids[i], lockcheck_snapshot))
+ 					{
+ 						/* Ups. Non-aborted, invisible locker */
+ 						return false;
+ 					}
+ 				}
+ 				return true;
+ 			}
+ 		}
+ 		else
+ 		{
+ 			TransactionId xid = HeapTupleHeaderGetXmax(tuple);
+ 			
+ 			/* Ignore our own subtransactions */
+ 			if (TransactionIdIsCurrentTransactionId(xid))
+ 				return true;
+ 			
+ 			/* We expect to be called after the locking transaction's fate has been decided, excluding
+ 			   ourself. */
+ 			Assert(!TransactionIdIsInProgress(xid));
+ 			
+ 			/* Locker must either be visible or have aborted */
+ 			return TransactionIdDidAbort(xid) ||
+ 				   !XidInMVCCSnapshot(xid, lockcheck_snapshot);
+ 		}
+ 	}
+ 	else
+ 	{
+ 		/* Tuple wasn't locked */
+ 		return true;
+ 	}
+ }
+ 
  /*
   *	simple_heap_update - replace a tuple
   *
*************** simple_heap_update(Relation relation, It
*** 2978,2985 ****
  
  	result = heap_update(relation, otid, tup,
  						 &update_ctid, &update_xmax,
! 						 GetCurrentCommandId(true), InvalidSnapshot,
! 						 true /* wait for commit */ );
  	switch (result)
  	{
  		case HeapTupleSelfUpdated:
--- 3069,3076 ----
  
  	result = heap_update(relation, otid, tup,
  						 &update_ctid, &update_xmax,
! 						 GetCurrentCommandId(true),
! 						 true /* wait for commit */, InvalidSnapshot);
  	switch (result)
  	{
  		case HeapTupleSelfUpdated:
*************** simple_heap_update(Relation relation, It
*** 3013,3018 ****
--- 3104,3112 ----
   *		tuple's cmax if lock is successful)
   *	mode: indicates if shared or exclusive tuple lock is desired
   *	nowait: if true, ereport rather than blocking if lock not available
+  *	lockcheck_snapshot: if not NULL, report the tuple as updated if it
+  *						was locked by a transaction not visible under
+  *						this snapshot
   *
   * Output parameters:
   *	*tuple: all fields filled in
*************** simple_heap_update(Relation relation, It
*** 3066,3072 ****
  HTSU_Result
  heap_lock_tuple(Relation relation, HeapTuple tuple, Buffer *buffer,
  				ItemPointer ctid, TransactionId *update_xmax,
! 				CommandId cid, LockTupleMode mode, bool nowait)
  {
  	HTSU_Result result;
  	ItemPointer tid = &(tuple->t_self);
--- 3160,3167 ----
  HTSU_Result
  heap_lock_tuple(Relation relation, HeapTuple tuple, Buffer *buffer,
  				ItemPointer ctid, TransactionId *update_xmax,
! 				CommandId cid, LockTupleMode mode, bool nowait,
! 				Snapshot lockcheck_snapshot)
  {
  	HTSU_Result result;
  	ItemPointer tid = &(tuple->t_self);
*************** l3:
*** 3246,3256 ****
  		else
  			result = HeapTupleUpdated;
  	}
  
  	if (result != HeapTupleMayBeUpdated)
  	{
  		Assert(result == HeapTupleSelfUpdated || result == HeapTupleUpdated);
! 		Assert(!(tuple->t_data->t_infomask & HEAP_XMAX_INVALID));
  		*ctid = tuple->t_data->t_ctid;
  		*update_xmax = HeapTupleHeaderGetXmax(tuple->t_data);
  		LockBuffer(*buffer, BUFFER_LOCK_UNLOCK);
--- 3341,3359 ----
  		else
  			result = HeapTupleUpdated;
  	}
+ 	
+ 	/* Verify visibility of locking transactions */
+ 	if ((result == HeapTupleMayBeUpdated) &&
+ 		!HeapSatisfiesLockersVisible(tuple->t_data, lockcheck_snapshot))
+ 	{
+ 		result = HeapTupleUpdated;
+ 	}
  
  	if (result != HeapTupleMayBeUpdated)
  	{
  		Assert(result == HeapTupleSelfUpdated || result == HeapTupleUpdated);
! 		Assert(!(tuple->t_data->t_infomask & HEAP_XMAX_INVALID) ||
! 			   (tuple->t_data->t_infomask & HEAP_IS_LOCKED));
  		*ctid = tuple->t_data->t_ctid;
  		*update_xmax = HeapTupleHeaderGetXmax(tuple->t_data);
  		LockBuffer(*buffer, BUFFER_LOCK_UNLOCK);
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index b5bb311..47ff5de 100644
*** a/src/backend/access/transam/multixact.c
--- b/src/backend/access/transam/multixact.c
*************** static MemoryContext MXactContext = NULL
*** 211,217 ****
  #endif
  
  /* internal MultiXactId management */
- static void MultiXactIdSetOldestVisible(void);
  static MultiXactId CreateMultiXactId(int nxids, TransactionId *xids);
  static void RecordNewMultiXact(MultiXactId multi, MultiXactOffset offset,
  				   int nxids, TransactionId *xids);
--- 211,216 ----
*************** MultiXactIdSetOldestMember(void)
*** 531,537 ****
   * there is no live transaction, now or later, that can be a member of any
   * MultiXactId older than the OldestVisibleMXactId we compute here.
   */
! static void
  MultiXactIdSetOldestVisible(void)
  {
  	if (!MultiXactIdIsValid(OldestVisibleMXactId[MyBackendId]))
--- 530,536 ----
   * there is no live transaction, now or later, that can be a member of any
   * MultiXactId older than the OldestVisibleMXactId we compute here.
   */
! void
  MultiXactIdSetOldestVisible(void)
  {
  	if (!MultiXactIdIsValid(OldestVisibleMXactId[MyBackendId]))
diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
index 2cbc192..120ea60 100644
*** a/src/backend/commands/trigger.c
--- b/src/backend/commands/trigger.c
*************** ltrmark:;
*** 2347,2353 ****
  		test = heap_lock_tuple(relation, &tuple, &buffer,
  							   &update_ctid, &update_xmax,
  							   estate->es_output_cid,
! 							   LockTupleExclusive, false);
  		switch (test)
  		{
  			case HeapTupleSelfUpdated:
--- 2347,2355 ----
  		test = heap_lock_tuple(relation, &tuple, &buffer,
  							   &update_ctid, &update_xmax,
  							   estate->es_output_cid,
! 							   LockTupleExclusive, false,
! 							   IsXactIsoLevelSerializable ? estate->es_snapshot :
! 															InvalidSnapshot);
  		switch (test)
  		{
  			case HeapTupleSelfUpdated:
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index d299310..fd5ed3c 100644
*** a/src/backend/executor/execMain.c
--- b/src/backend/executor/execMain.c
*************** EvalPlanQualFetch(EState *estate, Relati
*** 1518,1527 ****
  			/*
  			 * This is a live tuple, so now try to lock it.
  			 */
  			test = heap_lock_tuple(relation, &tuple, &buffer,
  								   &update_ctid, &update_xmax,
  								   estate->es_output_cid,
! 								   lockmode, false);
  			/* We now have two pins on the buffer, get rid of one */
  			ReleaseBuffer(buffer);
  
--- 1518,1530 ----
  			/*
  			 * This is a live tuple, so now try to lock it.
  			 */
+ 			Assert(!IsXactIsoLevelSerializable || (estate->es_snapshot != InvalidSnapshot));
  			test = heap_lock_tuple(relation, &tuple, &buffer,
  								   &update_ctid, &update_xmax,
  								   estate->es_output_cid,
! 								   lockmode, false,
! 								   IsXactIsoLevelSerializable ? estate->es_snapshot :
! 																InvalidSnapshot);
  			/* We now have two pins on the buffer, get rid of one */
  			ReleaseBuffer(buffer);
  
diff --git a/src/backend/executor/nodeLockRows.c b/src/backend/executor/nodeLockRows.c
index 00f0562..b771b19 100644
*** a/src/backend/executor/nodeLockRows.c
--- b/src/backend/executor/nodeLockRows.c
*************** lnext:
*** 71,76 ****
--- 71,77 ----
  		ItemPointerData update_ctid;
  		TransactionId update_xmax;
  		LockTupleMode lockmode;
+ 		Snapshot lockcheck_snapshot = InvalidSnapshot;
  		HTSU_Result test;
  		HeapTuple	copyTuple;
  
*************** lnext:
*** 110,123 ****
  
  		/* okay, try to lock the tuple */
  		if (erm->markType == ROW_MARK_EXCLUSIVE)
  			lockmode = LockTupleExclusive;
  		else
  			lockmode = LockTupleShared;
  
  		test = heap_lock_tuple(erm->relation, &tuple, &buffer,
  							   &update_ctid, &update_xmax,
  							   estate->es_output_cid,
! 							   lockmode, erm->noWait);
  		ReleaseBuffer(buffer);
  		switch (test)
  		{
--- 111,134 ----
  
  		/* okay, try to lock the tuple */
  		if (erm->markType == ROW_MARK_EXCLUSIVE)
+ 		{
  			lockmode = LockTupleExclusive;
+ 			/* In serializable mode, acquireing an exclusive lock
+ 			 * requires seeing all other lockers.
+ 			 */
+ 			if (IsXactIsoLevelSerializable)
+ 				lockcheck_snapshot = estate->es_snapshot;
+ 		}
  		else
+ 		{
  			lockmode = LockTupleShared;
+ 		}
  
  		test = heap_lock_tuple(erm->relation, &tuple, &buffer,
  							   &update_ctid, &update_xmax,
  							   estate->es_output_cid,
! 							   lockmode, erm->noWait,
! 							   lockcheck_snapshot);
  		ReleaseBuffer(buffer);
  		switch (test)
  		{
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 7856b66..ecc75f9 100644
*** a/src/backend/executor/nodeModifyTable.c
--- b/src/backend/executor/nodeModifyTable.c
*************** ldelete:;
*** 316,323 ****
  	result = heap_delete(resultRelationDesc, tupleid,
  						 &update_ctid, &update_xmax,
  						 estate->es_output_cid,
! 						 estate->es_crosscheck_snapshot,
! 						 true /* wait for commit */ );
  	switch (result)
  	{
  		case HeapTupleSelfUpdated:
--- 316,324 ----
  	result = heap_delete(resultRelationDesc, tupleid,
  						 &update_ctid, &update_xmax,
  						 estate->es_output_cid,
! 						 true, /* wait for commit */
! 						 IsXactIsoLevelSerializable ? estate->es_snapshot :
! 													  InvalidSnapshot);
  	switch (result)
  	{
  		case HeapTupleSelfUpdated:
*************** lreplace:;
*** 504,511 ****
  	result = heap_update(resultRelationDesc, tupleid, tuple,
  						 &update_ctid, &update_xmax,
  						 estate->es_output_cid,
! 						 estate->es_crosscheck_snapshot,
! 						 true /* wait for commit */ );
  	switch (result)
  	{
  		case HeapTupleSelfUpdated:
--- 505,513 ----
  	result = heap_update(resultRelationDesc, tupleid, tuple,
  						 &update_ctid, &update_xmax,
  						 estate->es_output_cid,
! 						 true, /* wait for commit */ 
! 						 IsXactIsoLevelSerializable ? estate->es_snapshot :
! 													  InvalidSnapshot);
  	switch (result)
  	{
  		case HeapTupleSelfUpdated:
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 59a27de..d4dc5ca 100644
*** a/src/backend/utils/time/snapmgr.c
--- b/src/backend/utils/time/snapmgr.c
***************
*** 27,32 ****
--- 27,33 ----
  
  #include "access/transam.h"
  #include "access/xact.h"
+ #include "access/multixact.h"
  #include "storage/proc.h"
  #include "storage/procarray.h"
  #include "utils/memutils.h"
*************** GetTransactionSnapshot(void)
*** 125,130 ****
--- 126,139 ----
  	if (!FirstSnapshotSet)
  	{
  		Assert(RegisteredSnapshots == 0);
+ 		
+ 		/* Must store the oldest visible multi xact *before* taking the
+ 		 * serializable snapshot. Otherwise HeapSatisfiesLockersVisible in
+ 		 * heapam.c will fail, since it'll be unable to inspect a multi xact
+ 		 * that might contains xids invisible to the serializable snapshot.
+ 		 */
+ 		if (IsXactIsoLevelSerializable)
+ 			MultiXactIdSetOldestVisible();
  
  		CurrentSnapshot = GetSnapshotData(&CurrentSnapshotData);
  		FirstSnapshotSet = true;
diff --git a/src/backend/utils/time/tqual.c b/src/backend/utils/time/tqual.c
index 4f3630c..302a571 100644
*** a/src/backend/utils/time/tqual.c
--- b/src/backend/utils/time/tqual.c
*************** SnapshotData SnapshotSelfData = {HeapTup
*** 72,80 ****
  SnapshotData SnapshotAnyData = {HeapTupleSatisfiesAny};
  SnapshotData SnapshotToastData = {HeapTupleSatisfiesToast};
  
- /* local functions */
- static bool XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot);
- 
  
  /*
   * SetHintBits()
--- 72,77 ----
*************** HeapTupleSatisfiesVacuum(HeapTupleHeader
*** 1253,1259 ****
   * by this function.  This is OK for current uses, because we actually only
   * apply this for known-committed XIDs.
   */
! static bool
  XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot)
  {
  	uint32		i;
--- 1250,1256 ----
   * by this function.  This is OK for current uses, because we actually only
   * apply this for known-committed XIDs.
   */
! bool
  XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot)
  {
  	uint32		i;
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index f963146..a17d62c 100644
*** a/src/include/access/heapam.h
--- b/src/include/access/heapam.h
*************** extern Oid heap_insert(Relation relation
*** 98,112 ****
  			int options, BulkInsertState bistate);
  extern HTSU_Result heap_delete(Relation relation, ItemPointer tid,
  			ItemPointer ctid, TransactionId *update_xmax,
! 			CommandId cid, Snapshot crosscheck, bool wait);
  extern HTSU_Result heap_update(Relation relation, ItemPointer otid,
  			HeapTuple newtup,
  			ItemPointer ctid, TransactionId *update_xmax,
! 			CommandId cid, Snapshot crosscheck, bool wait);
  extern HTSU_Result heap_lock_tuple(Relation relation, HeapTuple tuple,
  				Buffer *buffer, ItemPointer ctid,
  				TransactionId *update_xmax, CommandId cid,
! 				LockTupleMode mode, bool nowait);
  extern void heap_inplace_update(Relation relation, HeapTuple tuple);
  extern bool heap_freeze_tuple(HeapTupleHeader tuple, TransactionId cutoff_xid,
  				  Buffer buf);
--- 98,113 ----
  			int options, BulkInsertState bistate);
  extern HTSU_Result heap_delete(Relation relation, ItemPointer tid,
  			ItemPointer ctid, TransactionId *update_xmax,
! 			CommandId cid, bool wait, Snapshot crosscheck);
  extern HTSU_Result heap_update(Relation relation, ItemPointer otid,
  			HeapTuple newtup,
  			ItemPointer ctid, TransactionId *update_xmax,
! 			CommandId cid, bool wait, Snapshot crosscheck);
  extern HTSU_Result heap_lock_tuple(Relation relation, HeapTuple tuple,
  				Buffer *buffer, ItemPointer ctid,
  				TransactionId *update_xmax, CommandId cid,
! 				LockTupleMode mode, bool nowait,
! 				Snapshot lockcheck_snapshot);
  extern void heap_inplace_update(Relation relation, HeapTuple tuple);
  extern bool heap_freeze_tuple(HeapTupleHeader tuple, TransactionId cutoff_xid,
  				  Buffer buf);
diff --git a/src/include/access/multixact.h b/src/include/access/multixact.h
index 5910465..4769bca 100644
*** a/src/include/access/multixact.h
--- b/src/include/access/multixact.h
*************** extern bool MultiXactIdIsCurrent(MultiXa
*** 49,54 ****
--- 49,55 ----
  extern void MultiXactIdWait(MultiXactId multi);
  extern bool ConditionalMultiXactIdWait(MultiXactId multi);
  extern void MultiXactIdSetOldestMember(void);
+ extern void MultiXactIdSetOldestVisible(void);
  extern int	GetMultiXactIdMembers(MultiXactId multi, TransactionId **xids);
  
  extern void AtEOXact_MultiXact(void);
diff --git a/src/include/utils/tqual.h b/src/include/utils/tqual.h
index e85e820..fa386e7 100644
*** a/src/include/utils/tqual.h
--- b/src/include/utils/tqual.h
*************** extern PGDLLIMPORT SnapshotData Snapshot
*** 41,46 ****
--- 41,48 ----
  #define IsMVCCSnapshot(snapshot)  \
  	((snapshot)->satisfies == HeapTupleSatisfiesMVCC)
  
+ bool XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot);
+ 
  /*
   * HeapTupleSatisfiesVisibility
   *		True iff heap tuple satisfies a time qual.
fkbench.init.sqlapplication/octet-stream; name=fkbench.init.sql; x-unix-mode=0644Download
fkbench.pgbenchapplication/octet-stream; name=fkbench.pgbench; x-unix-mode=0644Download
#26Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Florian Pflug (#25)
Re: Review: Row-level Locks & SERIALIZABLE transactions, postgres vs. Oracle

=================
Submission review
=================

* Is the patch in context diff format?

Yes.

* Does it apply cleanly to the current CVS HEAD?

Yes.

* Does it include reasonable tests, necessary doc patches, etc?

There is one pgbench test which shows incorrect behavior without the
patch and correct behavior with the patch for a significant use
case.

Documentation changes are needed in the "Concurrency Control"
chapter.

================
Usability review
================

* Does the patch actually implement that?

Yes.

* Do we want that?

Yes. We seem to have reached consensus on the -hackers list to that
effect. On a personal note, I heard some current Oracle users who
were considering a switch to PostgreSQL grumbling after Robert's
trigger presentation at PostgreSQL Conference U.S. East about how
they didn't need to use such complex coding techniques to ensure
data integrity in Oracle as is required in PostgreSQL. I was
surprised, since I know that they also get snapshot isolation when
they request serializable, but they explained that SELECT FOR UPDATE
provides stronger guarantees in Oracle. This patch should provide
equivalent behavior, which should ease migration from Oracle and
allow simpler coding techniques in snapshot isolation to protect
data.

* Do we already have it?

No.

* Does it follow SQL spec, or the community-agreed behavior?

It's not in the spec, but it moves to safer behavior which is
consistent with the current Oracle implementation.

* Does it include pg_dump support (if applicable)?

Not applicable.

* Are there dangers?

Some code which continues after blocking will now get a
serialization failure. It's possible that this could cause problems
for some existing code, although that code was likely either using
SELECT FOR UPDATE unnecessarily or was unsafe without this patch.

* Have all the bases been covered?

As far as I can see.

============
Feature test
============

* Does the feature work as advertised?

Yes.

* Are there corner cases the author has failed to consider?

Not that I found.

* Are there any assertion failures or crashes?

No.

==================
Performance review
==================

* Does the patch slow down simple tests?

No.

* If it claims to improve performance, does it?

It makes no such claim.

* Does it slow down other things?

No.

=============
Coding review
=============

* Does it follow the project coding guidelines?

Comments are not all in standard style.

In some cases there are unnecessary braces around a single statement
for an *if* or *else*.

There are function where the last two parameters were changed from:

Snapshot crosscheck, bool wait

to:

bool wait, Snapshot lockcheck_snapshot

It appears to be so that the semantic change to the use of the
snapshot doesn't break code at runtime without forcing the
programmer to notice the change based on compile errors, which seems
like a Good Thing.

* Are there portability issues?

No.

* Will it work on Windows/BSD etc?

Yes.

* Are the comments sufficient and accurate?

Given that there is a significant behavioral change, it seems as
though there could be a sentence or two somewhere concisely stating
the how things behave, but I'm not sure quite where it would go.
Perhaps something in the README file in the access/transam
directory?

* Does it do what it says, correctly?

Yes.

* Does it produce compiler warnings?

No.

* Can you make it crash?

No.

===================
Architecture review
===================

* Is everything done in a way that fits together coherently with
other features/modules?

I think so.

* Are there interdependencies that can cause problems?

Not that I could identify with committed code.

I was concerned about its interaction with the other serializable
patch (by myself and Dan Ports), so I also combined the patches and
tested. Florian's pgbench test did expose bugs in the *other*
patch, which I then fixed in the combined setting. There was still
some breakage in the other patch when Florian's patch was backed
out, so at the moment, this patch would appear to be a
*prerequisite* for the other. (That can hopefully be corrected
by changes to the other patch.)

Also, I'm attempting to adapt the dcheck tests for the other patch
to work with this patch. If successful, I'll post with the results
of that additional testing.

#27Joe Conway
mail@joeconway.com
In reply to: Kevin Grittner (#26)
Re: Review: Row-level Locks & SERIALIZABLE transactions, postgres vs. Oracle

On 07/17/2010 09:25 AM, Kevin Grittner wrote:

I was concerned about its interaction with the other serializable
patch (by myself and Dan Ports), so I also combined the patches and
tested. Florian's pgbench test did expose bugs in the *other*
patch, which I then fixed in the combined setting. There was still
some breakage in the other patch when Florian's patch was backed
out, so at the moment, this patch would appear to be a
*prerequisite* for the other. (That can hopefully be corrected
by changes to the other patch.)

I am currently reading the Cahill paper [1]http://ses.library.usyd.edu.au/bitstream/2123/5353/1/michael-cahill-2009-thesis.pdf in preparation for reviewing
your patch [2]https://commitfest.postgresql.org/action/patch_view?id=310. Should I be installing Florian's patch in addition to
yours when I start testing? Also, where can I get the latest and
greatest version of your patch?

Thanks,

Joe

[1]: http://ses.library.usyd.edu.au/bitstream/2123/5353/1/michael-cahill-2009-thesis.pdf
http://ses.library.usyd.edu.au/bitstream/2123/5353/1/michael-cahill-2009-thesis.pdf

[2]: https://commitfest.postgresql.org/action/patch_view?id=310

--
Joe Conway
credativ LLC: http://www.credativ.us
Linux, PostgreSQL, and general Open Source
Training, Service, Consulting, & Support

#28Florian Pflug
fgp@phlo.org
In reply to: Kevin Grittner (#26)
Re: Review: Row-level Locks & SERIALIZABLE transactions, postgres vs. Oracle

On Jul17, 2010, at 18:25 , Kevin Grittner wrote:

* Does it include reasonable tests, necessary doc patches, etc?

Documentation changes are needed in the "Concurrency Control"
chapter.

<...>

* Do we want that?

Yes. We seem to have reached consensus on the -hackers list to that
effect.

I've kinda waited for the latter before putting work into the former, so I guess I should get going.

* Does the patch slow down simple tests?

The part that I'm slightly nervous about regarding performance regressions is the call of MultiXactIdSetOldestVisible() I had to add to GetTransactionSnapshot() (Though only for serializable transactions). That could increase the pressure on the multi xact machinery since for some workloads since it will increase the time a multi xact stays alive. I don't see a way around that, though, since the patch depends on being able to find multi xact members which are invisible to a serializable transaction's snapshot.

* Does it follow the project coding guidelines?

Comments are not all in standard style.

Does that refer to the language used, or to the formatting?

In some cases there are unnecessary braces around a single statement
for an *if* or *else*.

Will fix.

There are function where the last two parameters were changed from:

Snapshot crosscheck, bool wait

to:

bool wait, Snapshot lockcheck_snapshot

It appears to be so that the semantic change to the use of the
snapshot doesn't break code at runtime without forcing the
programmer to notice the change based on compile errors, which seems
like a Good Thing.

Yeah, that was the idea.

Btw, while the patch obsoletes the crosscheck snapshot, it currently doesn't remove its traces of it throughout the executor and the ri triggers. Mainly because I felt doing so would make forward-porting and reviewing harder without any gain. But ultimately, those traces should probably all go, unless someone feels that for some #ifdef NOT_USED is preferable.

* Are the comments sufficient and accurate?

Given that there is a significant behavioral change, it seems as
though there could be a sentence or two somewhere concisely stating
the how things behave, but I'm not sure quite where it would go.
Perhaps something in the README file in the access/transam
directory?

Hm, I'll try to come up with something.

* Are there interdependencies that can cause problems?

Also, I'm attempting to adapt the dcheck tests for the other patch
to work with this patch. If successful, I'll post with the results
of that additional testing.

Cool! And thanks for the work you put into reviewing this! I'll update the documentation and comments ASAP. Probably not within the next few days though, since I'm unfortunately quite busy at the moment, trying to finally complete my master's thesis.

best regards,
Florian Pflug

#29Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Florian Pflug (#28)
Re: Review: Row-level Locks & SERIALIZABLE transactions, postgres vs. Oracle

Florian Pflug <fgp@phlo.org> wrote:

On Jul17, 2010, at 18:25 , Kevin Grittner wrote:

* Does it follow the project coding guidelines?

Comments are not all in standard style.

Does that refer to the language used, or to the formatting?

Formatting. Comment style seems to be defined here:

http://developer.postgresql.org/pgdocs/postgres/source-format.html

as being:

/*
* comment text begins here
* and continues here
*/

You have these formats in your patch:

/* comment text begins here
* and continues here
*/

/* comment text begins here
and continues here */

/* One line comment like this. */

That last one is actually pretty common in PostgreSQL source, so I'm
not sure that its omission from the style page isn't accidental.

Btw, while the patch obsoletes the crosscheck snapshot, it
currently doesn't remove its traces of it throughout the executor
and the ri triggers. Mainly because I felt doing so would make
forward-porting and reviewing harder without any gain. But
ultimately, those traces should probably all go, unless someone
feels that for some #ifdef NOT_USED is preferable.

My view is that we have a revision control system which makes the
code easy to restore, should it be found to be useful again. If it
has no use at the point of applying this patch, my inclination would
be to delete it. If you're particularly concerned that it could
later be useful, you might want to do that deletion in as a separate
patch to facilitate later resurrection of the code.

-Kevin

#30Andrew Dunstan
andrew@dunslane.net
In reply to: Kevin Grittner (#29)
Re: Review: Row-level Locks & SERIALIZABLE transactions, postgres vs. Oracle

Kevin Grittner wrote:

Comment style seems to be defined here:

http://developer.postgresql.org/pgdocs/postgres/source-format.html

as being:

/*
* comment text begins here
* and continues here
*/

You have these formats in your patch:

/* comment text begins here
* and continues here
*/

/* comment text begins here
and continues here */

/* One line comment like this. */

That last one is actually pretty common in PostgreSQL source, so I'm
not sure that its omission from the style page isn't accidental.

The style doc talks about a standard for multi-line comments - it
doesn't forbid single line comments.

cheers

andrew

#31Florian Pflug
fgp@phlo.org
In reply to: Kevin Grittner (#29)
1 attachment(s)
Re: Review: Row-level Locks & SERIALIZABLE transactions, postgres vs. Oracle

Hi

I've updated mvcc.sgml to explain the new serialization conflict rules for row-level locks, and added a paragraph to backend/executor/README that explains the implementation of those. I've chosen backend/executor/README because it already contains a description of UPDATE handling in READ COMMITTED mode, which seemed at least somewhat related.

The patch now removes all code dealing with the crosscheck_snapshot, since the RI triggers should now be safe without that.

I've also fixed the formatting of multi-line comments to match the coding style. I kept the braces around single-line "if" or "else" statements wherever both "if" and "else" blocks were present and one of them wasn't a single-line block.

I think, in addition to the documentation changes this patch contains, that a section on how to write RI triggers in pl/pgsql would be nice to have. It's not strictly part of the documentation of this feature though, more a potential use-case, so I didn't tackle it for the moment. I do hope to find the time to write such a section, though, and if that happens I'll post it as a separate documentation-only patch.

I've pushed the changes to the branch serializable_row_locks on git://github.com/fgp/postgres.git and also attached a patch.

best regards,
Florian Pflug

Attachments:

serializable_row_locks.patchapplication/octet-stream; name=serializable_row_locks.patchDownload
diff --git a/doc/src/sgml/mvcc.sgml b/doc/src/sgml/mvcc.sgml
index 14614c5..3ee1e38 100644
*** a/doc/src/sgml/mvcc.sgml
--- b/doc/src/sgml/mvcc.sgml
*************** COMMIT;
*** 386,409 ****
      behave the same as <command>SELECT</command>
      in terms of searching for target rows: they will only find target rows
      that were committed as of the transaction start time.  However, such a
!     target
!     row might have already been updated (or deleted or locked) by
!     another concurrent transaction by the time it is found.  In this case, the
!     serializable transaction will wait for the first updating transaction to commit or
!     roll back (if it is still in progress).  If the first updater rolls back,
!     then its effects are negated and the serializable transaction can proceed
!     with updating the originally found row.  But if the first updater commits
!     (and actually updated or deleted the row, not just locked it)
!     then the serializable transaction will be rolled back with the message
! 
! <screen>
! ERROR:  could not serialize access due to concurrent update
! </screen>
  
!     because a serializable transaction cannot modify or lock rows changed by
!     other transactions after the serializable transaction began.
     </para>
! 
     <para>
      When an application receives this error message, it should abort
      the current transaction and retry the whole transaction from
--- 386,409 ----
      behave the same as <command>SELECT</command>
      in terms of searching for target rows: they will only find target rows
      that were committed as of the transaction start time.  However, such a
!     target row might have already been subject to a concurrent
!     <command>UPDATE</command>, <command>DELETE</command>, <command>SELECT
!     FOR UPDATE</command>, or <command>SELECT FOR SHARE</command>. In this case,
!     the serializable transaction will wait for the other transaction to commit
!     or roll back (if it is still in progress). If it rolls back then its effects
!     are negated and the serializable transaction can proceed with modifying
!     or locking the originally found row. If it commits, and the two commands
!     conflict according to <xref linkend="mvcc-serialization-conflicts">,
!     the serializable transaction is rolled back with the message
  
! 	<screen>
! 	ERROR:  could not serialize access due to concurrent update
! 	</screen>
! 	
! 	since serializable transaction cannot simply proceed with the newer row
! 	version like read committed ones do.
     </para>
! 	
     <para>
      When an application receives this error message, it should abort
      the current transaction and retry the whole transaction from
*************** ERROR:  could not serialize access due t
*** 418,423 ****
--- 418,463 ----
      transactions will never have serialization conflicts.
     </para>
  
+    <table tocentry="1" id="mvcc-serialization-conflicts">
+     <title>Serialization Conflicts</title>
+     <tgroup cols="4">
+      <colspec colnum="2" colname="lockst">
+      <colspec colnum="4" colname="lockend">
+      <spanspec namest="lockst" nameend="lockend" spanname="lockreq">
+      <thead>
+       <row>
+        <entry morerows="1">Serializable Transaction</entry>
+        <entry spanname="lockreq">Concurrent Transaction</entry>
+       </row>
+       <row>
+        <entry><command>UPDATE</command>, <command>DELETE</command></entry>
+        <entry><command>SELECT FOR UPDATE</command></entry>
+        <entry><command>SELECT FOR SHARE</command></entry>
+       </row>
+      </thead>
+      <tbody>
+       <row>
+        <entry><command>UPDATE</command>, <command>DELETE</command></entry>
+        <entry align="center">X</entry>
+        <entry align="center">X</entry>
+        <entry align="center">X</entry>
+       </row>
+       <row>
+        <entry><command>SELECT FOR UPDATE</command></entry>
+        <entry align="center">X</entry>
+        <entry align="center">X</entry>
+        <entry align="center">X</entry>
+       </row>
+       <row>
+        <entry><command>SELECT FOR SHARE</command></entry>
+        <entry align="center">X</entry>
+        <entry align="center"></entry>
+        <entry align="center"></entry>
+       </row>
+      </tbody>
+     </tgroup>
+    </table>
+ 
     <para>
      The Serializable mode provides a rigorous guarantee that each
      transaction sees a wholly consistent view of the database.  However,
*************** SELECT SUM(value) FROM mytab WHERE class
*** 921,926 ****
--- 961,974 ----
      </para>
  
      <para>
+       Serializable transactions are affected by concurrent
+       <command>SELECT FOR SHARE</command> and <command>SELECT FOR UPDATE</command>
+       for longer than those locks are actually held, and may be aborted
+       when trying to obtain a conflicting lock. For details,
+       see <xref linkend="xact-serializable">
+     </para>
+ 
+     <para>
       <productname>PostgreSQL</productname> doesn't remember any
       information about modified rows in memory, so there is no limit on
       the number of rows locked at one time.  However, locking a row
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 80d2351..b711b58 100644
*** a/src/backend/access/heap/heapam.c
--- b/src/backend/access/heap/heapam.c
*************** static XLogRecPtr log_heap_update(Relati
*** 83,88 ****
--- 83,89 ----
  				bool all_visible_cleared, bool new_all_visible_cleared);
  static bool HeapSatisfiesHOTUpdate(Relation relation, Bitmapset *hot_attrs,
  					   HeapTuple oldtup, HeapTuple newtup);
+ static bool HeapSatisfiesLockersVisible(HeapTupleHeader tuple, Snapshot snapshot);
  
  
  /* ----------------------------------------------------------------
*************** simple_heap_insert(Relation relation, He
*** 2033,2040 ****
   *	update_xmax - output parameter, used only for failure case (see below)
   *	cid - delete command ID (used for visibility test, and stored into
   *		cmax if successful)
-  *	crosscheck - if not InvalidSnapshot, also check tuple against this
   *	wait - true if should wait for any conflicting update to commit/abort
   *
   * Normal, successful return value is HeapTupleMayBeUpdated, which
   * actually means we did delete it.  Failure return codes are
--- 2034,2042 ----
   *	update_xmax - output parameter, used only for failure case (see below)
   *	cid - delete command ID (used for visibility test, and stored into
   *		cmax if successful)
   *	wait - true if should wait for any conflicting update to commit/abort
+  *	lockcheck_snapshot - if not NULL, report the tuple as updated if it
+  *		was locked by a transaction not visible under this snapshot
   *
   * Normal, successful return value is HeapTupleMayBeUpdated, which
   * actually means we did delete it.  Failure return codes are
*************** simple_heap_insert(Relation relation, He
*** 2049,2055 ****
  HTSU_Result
  heap_delete(Relation relation, ItemPointer tid,
  			ItemPointer ctid, TransactionId *update_xmax,
! 			CommandId cid, Snapshot crosscheck, bool wait)
  {
  	HTSU_Result result;
  	TransactionId xid = GetCurrentTransactionId();
--- 2051,2057 ----
  HTSU_Result
  heap_delete(Relation relation, ItemPointer tid,
  			ItemPointer ctid, TransactionId *update_xmax,
! 			CommandId cid, bool wait, Snapshot lockcheck_snapshot)
  {
  	HTSU_Result result;
  	TransactionId xid = GetCurrentTransactionId();
*************** l1:
*** 2170,2181 ****
  		else
  			result = HeapTupleUpdated;
  	}
! 
! 	if (crosscheck != InvalidSnapshot && result == HeapTupleMayBeUpdated)
  	{
! 		/* Perform additional check for serializable RI updates */
! 		if (!HeapTupleSatisfiesVisibility(&tp, crosscheck, buffer))
! 			result = HeapTupleUpdated;
  	}
  
  	if (result != HeapTupleMayBeUpdated)
--- 2172,2183 ----
  		else
  			result = HeapTupleUpdated;
  	}
! 	
! 	/* Verify visibility of locking transactions. */
! 	if ((result == HeapTupleMayBeUpdated) &&
! 		!HeapSatisfiesLockersVisible(tp.t_data, lockcheck_snapshot))
  	{
! 		result = HeapTupleUpdated;
  	}
  
  	if (result != HeapTupleMayBeUpdated)
*************** l1:
*** 2183,2189 ****
  		Assert(result == HeapTupleSelfUpdated ||
  			   result == HeapTupleUpdated ||
  			   result == HeapTupleBeingUpdated);
! 		Assert(!(tp.t_data->t_infomask & HEAP_XMAX_INVALID));
  		*ctid = tp.t_data->t_ctid;
  		*update_xmax = HeapTupleHeaderGetXmax(tp.t_data);
  		UnlockReleaseBuffer(buffer);
--- 2185,2192 ----
  		Assert(result == HeapTupleSelfUpdated ||
  			   result == HeapTupleUpdated ||
  			   result == HeapTupleBeingUpdated);
! 		Assert(!(tp.t_data->t_infomask & HEAP_XMAX_INVALID) ||
! 			   (tp.t_data->t_infomask & HEAP_IS_LOCKED));
  		*ctid = tp.t_data->t_ctid;
  		*update_xmax = HeapTupleHeaderGetXmax(tp.t_data);
  		UnlockReleaseBuffer(buffer);
*************** simple_heap_delete(Relation relation, It
*** 2313,2320 ****
  
  	result = heap_delete(relation, tid,
  						 &update_ctid, &update_xmax,
! 						 GetCurrentCommandId(true), InvalidSnapshot,
! 						 true /* wait for commit */ );
  	switch (result)
  	{
  		case HeapTupleSelfUpdated:
--- 2316,2324 ----
  
  	result = heap_delete(relation, tid,
  						 &update_ctid, &update_xmax,
! 						 GetCurrentCommandId(true),
! 						 true /* wait for commit */ ,
! 						 InvalidSnapshot);
  	switch (result)
  	{
  		case HeapTupleSelfUpdated:
*************** simple_heap_delete(Relation relation, It
*** 2349,2355 ****
   *	update_xmax - output parameter, used only for failure case (see below)
   *	cid - update command ID (used for visibility test, and stored into
   *		cmax/cmin if successful)
!  *	crosscheck - if not InvalidSnapshot, also check old tuple against this
   *	wait - true if should wait for any conflicting update to commit/abort
   *
   * Normal, successful return value is HeapTupleMayBeUpdated, which
--- 2353,2360 ----
   *	update_xmax - output parameter, used only for failure case (see below)
   *	cid - update command ID (used for visibility test, and stored into
   *		cmax/cmin if successful)
!  *	lockcheck_snapshot - if not NULL, report the tuple as updated if it
!  *		was locked by a transaction not visible under this snapshot
   *	wait - true if should wait for any conflicting update to commit/abort
   *
   * Normal, successful return value is HeapTupleMayBeUpdated, which
*************** simple_heap_delete(Relation relation, It
*** 2371,2377 ****
  HTSU_Result
  heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
  			ItemPointer ctid, TransactionId *update_xmax,
! 			CommandId cid, Snapshot crosscheck, bool wait)
  {
  	HTSU_Result result;
  	TransactionId xid = GetCurrentTransactionId();
--- 2376,2382 ----
  HTSU_Result
  heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
  			ItemPointer ctid, TransactionId *update_xmax,
! 			CommandId cid, bool wait, Snapshot lockcheck_snapshot)
  {
  	HTSU_Result result;
  	TransactionId xid = GetCurrentTransactionId();
*************** l2:
*** 2523,2541 ****
  			result = HeapTupleUpdated;
  	}
  
! 	if (crosscheck != InvalidSnapshot && result == HeapTupleMayBeUpdated)
  	{
! 		/* Perform additional check for serializable RI updates */
! 		if (!HeapTupleSatisfiesVisibility(&oldtup, crosscheck, buffer))
! 			result = HeapTupleUpdated;
  	}
  
  	if (result != HeapTupleMayBeUpdated)
  	{
  		Assert(result == HeapTupleSelfUpdated ||
  			   result == HeapTupleUpdated ||
  			   result == HeapTupleBeingUpdated);
! 		Assert(!(oldtup.t_data->t_infomask & HEAP_XMAX_INVALID));
  		*ctid = oldtup.t_data->t_ctid;
  		*update_xmax = HeapTupleHeaderGetXmax(oldtup.t_data);
  		UnlockReleaseBuffer(buffer);
--- 2528,2548 ----
  			result = HeapTupleUpdated;
  	}
  
! 	/* Verify visibility of locking transactions. */
! 	if ((result == HeapTupleMayBeUpdated) &&
! 		!HeapSatisfiesLockersVisible(oldtup.t_data, lockcheck_snapshot))
  	{
! 		result = HeapTupleUpdated;
  	}
  
+ 
  	if (result != HeapTupleMayBeUpdated)
  	{
  		Assert(result == HeapTupleSelfUpdated ||
  			   result == HeapTupleUpdated ||
  			   result == HeapTupleBeingUpdated);
! 		Assert(!(oldtup.t_data->t_infomask & HEAP_XMAX_INVALID) ||
! 			   (oldtup.t_data->t_infomask & HEAP_IS_LOCKED));
  		*ctid = oldtup.t_data->t_ctid;
  		*update_xmax = HeapTupleHeaderGetXmax(oldtup.t_data);
  		UnlockReleaseBuffer(buffer);
*************** HeapSatisfiesHOTUpdate(Relation relation
*** 2961,2966 ****
--- 2968,3058 ----
  	return true;
  }
  
+ 
+ /*
+  * Returns false if one of the tuple's lockers committed but
+  * aren't visible according to lockcheck_snapshot, excluding subtransactions
+  * of the current transaction.
+  * Assumes that all locking transaction either committed or aborted,
+  * but aren't still in progress.
+  */
+ static bool
+ HeapSatisfiesLockersVisible(HeapTupleHeader tuple, Snapshot lockcheck_snapshot)
+ {
+ 	if (lockcheck_snapshot == InvalidSnapshot)
+ 		return true;
+ 	
+ 	if (tuple->t_infomask & HEAP_IS_LOCKED)
+ 	{
+ 		/*
+ 		 * If the tuple was locked, we now check whether the locking
+ 		 * transaction(s) are visible under lockcheck_snapshot. If
+ 		 * they aren't, we pretend that the tuple was updated.
+ 		 */
+ 
+ 		if (tuple->t_infomask & HEAP_XMAX_IS_MULTI)
+ 		{
+ 			TransactionId* xids;
+ 			int xids_l = GetMultiXactIdMembers(HeapTupleHeaderGetXmax(tuple), &xids);
+ 
+ 			if (xids_l < 1) {
+ 				/*
+ 				 * The multi xact either is too old to be inspected or doesn't contain members.
+ 				 * The second case is probably impossible, but even if not it doesn't pose
+ 				 * any problem.
+ 				 * In the first case, we have to trust that all xids that were contained in
+ 				 * the xact are in fact visible under lockcheck_snapshot. Currently this
+ 				 * is always the case, since lockcheck_snapshot is always the transaction's
+ 				 * serializable snapshot, and we call MultiXactIdSetOldestVisible() before
+ 				 * acquireing that snapshot.
+ 				 */
+ 				return true;
+ 			}
+ 			else
+ 			{
+ 				int i;
+ 				for (i = 0; i < xids_l; ++i)
+ 				{
+ 					/* Ignore our own subtransactions */
+ 					if (TransactionIdIsCurrentTransactionId(xids[i]))
+ 						continue;
+ 					
+ 					/* We expect to be called after the locking transactions' fates have been decided */
+ 					Assert(!TransactionIdIsInProgress(xids[i]));
+ 					
+ 					if (!TransactionIdDidAbort(xids[i]) &&
+ 					    XidInMVCCSnapshot(xids[i], lockcheck_snapshot))
+ 					{
+ 						/* Non-aborted, invisible locker */
+ 						return false;
+ 					}
+ 				}
+ 				return true;
+ 			}
+ 		}
+ 		else
+ 		{
+ 			TransactionId xid = HeapTupleHeaderGetXmax(tuple);
+ 			
+ 			/* Ignore our own subtransactions */
+ 			if (TransactionIdIsCurrentTransactionId(xid))
+ 				return true;
+ 			
+ 			/* We expect to be called after the locking transactions' fates have been decided */
+ 			Assert(!TransactionIdIsInProgress(xid));
+ 			
+ 			/* Locker must either be visible or have aborted */
+ 			return TransactionIdDidAbort(xid) ||
+ 				   !XidInMVCCSnapshot(xid, lockcheck_snapshot);
+ 		}
+ 	}
+ 	else
+ 	{
+ 		/* Tuple wasn't locked */
+ 		return true;
+ 	}
+ }
+ 
  /*
   *	simple_heap_update - replace a tuple
   *
*************** simple_heap_update(Relation relation, It
*** 2978,2985 ****
  
  	result = heap_update(relation, otid, tup,
  						 &update_ctid, &update_xmax,
! 						 GetCurrentCommandId(true), InvalidSnapshot,
! 						 true /* wait for commit */ );
  	switch (result)
  	{
  		case HeapTupleSelfUpdated:
--- 3070,3077 ----
  
  	result = heap_update(relation, otid, tup,
  						 &update_ctid, &update_xmax,
! 						 GetCurrentCommandId(true),
! 						 true /* wait for commit */, InvalidSnapshot);
  	switch (result)
  	{
  		case HeapTupleSelfUpdated:
*************** simple_heap_update(Relation relation, It
*** 3013,3018 ****
--- 3105,3113 ----
   *		tuple's cmax if lock is successful)
   *	mode: indicates if shared or exclusive tuple lock is desired
   *	nowait: if true, ereport rather than blocking if lock not available
+  *	lockcheck_snapshot: if not NULL, report the tuple as updated if it
+  *						was locked by a transaction not visible under
+  *						this snapshot
   *
   * Output parameters:
   *	*tuple: all fields filled in
*************** simple_heap_update(Relation relation, It
*** 3066,3072 ****
  HTSU_Result
  heap_lock_tuple(Relation relation, HeapTuple tuple, Buffer *buffer,
  				ItemPointer ctid, TransactionId *update_xmax,
! 				CommandId cid, LockTupleMode mode, bool nowait)
  {
  	HTSU_Result result;
  	ItemPointer tid = &(tuple->t_self);
--- 3161,3168 ----
  HTSU_Result
  heap_lock_tuple(Relation relation, HeapTuple tuple, Buffer *buffer,
  				ItemPointer ctid, TransactionId *update_xmax,
! 				CommandId cid, LockTupleMode mode, bool nowait,
! 				Snapshot lockcheck_snapshot)
  {
  	HTSU_Result result;
  	ItemPointer tid = &(tuple->t_self);
*************** l3:
*** 3246,3256 ****
  		else
  			result = HeapTupleUpdated;
  	}
  
  	if (result != HeapTupleMayBeUpdated)
  	{
  		Assert(result == HeapTupleSelfUpdated || result == HeapTupleUpdated);
! 		Assert(!(tuple->t_data->t_infomask & HEAP_XMAX_INVALID));
  		*ctid = tuple->t_data->t_ctid;
  		*update_xmax = HeapTupleHeaderGetXmax(tuple->t_data);
  		LockBuffer(*buffer, BUFFER_LOCK_UNLOCK);
--- 3342,3360 ----
  		else
  			result = HeapTupleUpdated;
  	}
+ 	
+ 	/* Verify visibility of locking transactions */
+ 	if ((result == HeapTupleMayBeUpdated) &&
+ 		!HeapSatisfiesLockersVisible(tuple->t_data, lockcheck_snapshot))
+ 	{
+ 		result = HeapTupleUpdated;
+ 	}
  
  	if (result != HeapTupleMayBeUpdated)
  	{
  		Assert(result == HeapTupleSelfUpdated || result == HeapTupleUpdated);
! 		Assert(!(tuple->t_data->t_infomask & HEAP_XMAX_INVALID) ||
! 			   (tuple->t_data->t_infomask & HEAP_IS_LOCKED));
  		*ctid = tuple->t_data->t_ctid;
  		*update_xmax = HeapTupleHeaderGetXmax(tuple->t_data);
  		LockBuffer(*buffer, BUFFER_LOCK_UNLOCK);
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index b5bb311..47ff5de 100644
*** a/src/backend/access/transam/multixact.c
--- b/src/backend/access/transam/multixact.c
*************** static MemoryContext MXactContext = NULL
*** 211,217 ****
  #endif
  
  /* internal MultiXactId management */
- static void MultiXactIdSetOldestVisible(void);
  static MultiXactId CreateMultiXactId(int nxids, TransactionId *xids);
  static void RecordNewMultiXact(MultiXactId multi, MultiXactOffset offset,
  				   int nxids, TransactionId *xids);
--- 211,216 ----
*************** MultiXactIdSetOldestMember(void)
*** 531,537 ****
   * there is no live transaction, now or later, that can be a member of any
   * MultiXactId older than the OldestVisibleMXactId we compute here.
   */
! static void
  MultiXactIdSetOldestVisible(void)
  {
  	if (!MultiXactIdIsValid(OldestVisibleMXactId[MyBackendId]))
--- 530,536 ----
   * there is no live transaction, now or later, that can be a member of any
   * MultiXactId older than the OldestVisibleMXactId we compute here.
   */
! void
  MultiXactIdSetOldestVisible(void)
  {
  	if (!MultiXactIdIsValid(OldestVisibleMXactId[MyBackendId]))
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 906d547..5ae3c62 100644
*** a/src/backend/commands/copy.c
--- b/src/backend/commands/copy.c
*************** DoCopy(const CopyStmt *stmt, const char 
*** 1088,1094 ****
  		/* Create a QueryDesc requesting no output */
  		cstate->queryDesc = CreateQueryDesc(plan, queryString,
  											GetActiveSnapshot(),
- 											InvalidSnapshot,
  											dest, NULL, 0);
  
  		/*
--- 1088,1093 ----
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index b776ad1..643c777 100644
*** a/src/backend/commands/explain.c
--- b/src/backend/commands/explain.c
*************** ExplainOnePlan(PlannedStmt *plannedstmt,
*** 369,375 ****
  
  	/* Create a QueryDesc requesting no output */
  	queryDesc = CreateQueryDesc(plannedstmt, queryString,
! 								GetActiveSnapshot(), InvalidSnapshot,
  								None_Receiver, params, instrument_option);
  
  	INSTR_TIME_SET_CURRENT(starttime);
--- 369,375 ----
  
  	/* Create a QueryDesc requesting no output */
  	queryDesc = CreateQueryDesc(plannedstmt, queryString,
! 								GetActiveSnapshot(),
  								None_Receiver, params, instrument_option);
  
  	INSTR_TIME_SET_CURRENT(starttime);
diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
index 210f6b8..3816efc 100644
*** a/src/backend/commands/trigger.c
--- b/src/backend/commands/trigger.c
*************** GetTupleForTrigger(EState *estate,
*** 2350,2363 ****
  		Assert(epqstate != NULL);
  
  		/*
! 		 * lock tuple for update
  		 */
  ltrmark:;
  		tuple.t_self = *tid;
  		test = heap_lock_tuple(relation, &tuple, &buffer,
  							   &update_ctid, &update_xmax,
  							   estate->es_output_cid,
! 							   LockTupleExclusive, false);
  		switch (test)
  		{
  			case HeapTupleSelfUpdated:
--- 2350,2369 ----
  		Assert(epqstate != NULL);
  
  		/*
! 		 * lock tuple for update.
! 		 *
! 		 * Serializable transactions pass their snapshot as the logcheck_snapshot.
! 		 * This lets heap_lock_tuple report concurrently FOR SHARE or FOR UPDATE
! 		 * locked tuples as HeapTupleUpdated.
  		 */
  ltrmark:;
  		tuple.t_self = *tid;
  		test = heap_lock_tuple(relation, &tuple, &buffer,
  							   &update_ctid, &update_xmax,
  							   estate->es_output_cid,
! 							   LockTupleExclusive, false,
! 							   IsXactIsoLevelSerializable ? estate->es_snapshot :
! 															InvalidSnapshot);
  		switch (test)
  		{
  			case HeapTupleSelfUpdated:
diff --git a/src/backend/executor/README b/src/backend/executor/README
index dc86822..3798a48 100644
*** a/src/backend/executor/README
--- b/src/backend/executor/README
*************** is no explicit prohibition on SRFs in UP
*** 195,197 ****
--- 195,226 ----
  that only the first result row of an SRF counts, because all subsequent
  rows will result in attempts to re-update an already updated target row.
  This is historical behavior and seems not worth changing.)
+ 
+ Row Locks and Serializable Transactions
+ ---------------------------------------
+ 
+ In READ COMMITTED mode, a transaction who encounters a locked row during
+ an UPDATE, DELETE, SELECT FOR UPDATE or SELECT FOR SHARE simply blocks
+ until the locking transaction commits or roll backs, and in the former case
+ then re-executes the statement using the new row version, as described above.
+ 
+ For SERIALIZABLE transaction this is not satisfactory. The RI triggers
+ for example take a FOR SHARE lock on a parent row before allowing a child
+ row to be inserted and verify that deleting a parent row leaves no orphaned
+ children behind before allowing the delete to occur. From within READ COMMITTED
+ transactions, blocking upon a delete or a parent row until all lockers have
+ finished is sufficient to guarantee that this check finds any potential orphan,
+ since the check will be executed with a up-to-date snapshot to which the locking
+ transaction's changes are visible. This, however, is not true for SERIALIZABLE
+ transactions since these will continue to use their old snapshot and hence miss
+ newly inserted rows.
+ 
+ Serializable transactions therefore treat a FOR SHARE or FOR UPDATE lock on a
+ tuple the same as an actual update during UPDATE and SELECT FOR SHARE. They are
+ thus aborted when trying to UPDATE or FOR UPDATE lock a row that was FOR SHARE
+ or FOR UPDATE locked by a concurrent transaction.
+ 
+ This is implemented by the lockcheck_snapshot parameter of heap_update, heap_delete
+ and heap_lock_tuple. If such a snapshot is provided to one of these functions,
+ they return HeapTupleUpdated if the tuple locked (but not necessarily updated)
+ by any transaction invisible to the snapshot.
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 8cd3ae9..049ebd8 100644
*** a/src/backend/executor/execMain.c
--- b/src/backend/executor/execMain.c
*************** standard_ExecutorStart(QueryDesc *queryD
*** 183,189 ****
  	 * Copy other important information into the EState
  	 */
  	estate->es_snapshot = RegisterSnapshot(queryDesc->snapshot);
- 	estate->es_crosscheck_snapshot = RegisterSnapshot(queryDesc->crosscheck_snapshot);
  	estate->es_instrument = queryDesc->instrument_options;
  
  	/*
--- 183,188 ----
*************** standard_ExecutorEnd(QueryDesc *queryDes
*** 348,354 ****
  
  	/* do away with our snapshots */
  	UnregisterSnapshot(estate->es_snapshot);
- 	UnregisterSnapshot(estate->es_crosscheck_snapshot);
  
  	/*
  	 * Must switch out of context before destroying it
--- 347,352 ----
*************** EvalPlanQualFetch(EState *estate, Relati
*** 1533,1543 ****
  
  			/*
  			 * This is a live tuple, so now try to lock it.
  			 */
  			test = heap_lock_tuple(relation, &tuple, &buffer,
  								   &update_ctid, &update_xmax,
  								   estate->es_output_cid,
! 								   lockmode, false);
  			/* We now have two pins on the buffer, get rid of one */
  			ReleaseBuffer(buffer);
  
--- 1531,1548 ----
  
  			/*
  			 * This is a live tuple, so now try to lock it.
+ 			 *
+ 			 * Serializable transactions pass their snapshot as the logcheck_snapshot.
+ 			 * This lets heap_lock_tuple report concurrently FOR SHARE or FOR UPDATE
+ 			 * locked tuples as HeapTupleUpdated.
  			 */
+ 			Assert(!IsXactIsoLevelSerializable || (estate->es_snapshot != InvalidSnapshot));
  			test = heap_lock_tuple(relation, &tuple, &buffer,
  								   &update_ctid, &update_xmax,
  								   estate->es_output_cid,
! 								   lockmode, false,
! 								   IsXactIsoLevelSerializable ? estate->es_snapshot :
! 																InvalidSnapshot);
  			/* We now have two pins on the buffer, get rid of one */
  			ReleaseBuffer(buffer);
  
*************** EvalPlanQualStart(EPQState *epqstate, ES
*** 1906,1912 ****
  	 */
  	estate->es_direction = ForwardScanDirection;
  	estate->es_snapshot = parentestate->es_snapshot;
- 	estate->es_crosscheck_snapshot = parentestate->es_crosscheck_snapshot;
  	estate->es_range_table = parentestate->es_range_table;
  	estate->es_plannedstmt = parentestate->es_plannedstmt;
  	estate->es_junkFilter = parentestate->es_junkFilter;
--- 1911,1916 ----
diff --git a/src/backend/executor/execUtils.c b/src/backend/executor/execUtils.c
index 877abf4..a70221a 100644
*** a/src/backend/executor/execUtils.c
--- b/src/backend/executor/execUtils.c
*************** CreateExecutorState(void)
*** 109,115 ****
  	 */
  	estate->es_direction = ForwardScanDirection;
  	estate->es_snapshot = SnapshotNow;
- 	estate->es_crosscheck_snapshot = InvalidSnapshot;	/* no crosscheck */
  	estate->es_range_table = NIL;
  	estate->es_plannedstmt = NULL;
  
--- 109,114 ----
diff --git a/src/backend/executor/functions.c b/src/backend/executor/functions.c
index bacfe9a..0f77b58 100644
*** a/src/backend/executor/functions.c
--- b/src/backend/executor/functions.c
*************** postquel_start(execution_state *es, SQLF
*** 415,421 ****
  	if (IsA(es->stmt, PlannedStmt))
  		es->qd = CreateQueryDesc((PlannedStmt *) es->stmt,
  								 fcache->src,
! 								 snapshot, InvalidSnapshot,
  								 dest,
  								 fcache->paramLI, 0);
  	else
--- 415,421 ----
  	if (IsA(es->stmt, PlannedStmt))
  		es->qd = CreateQueryDesc((PlannedStmt *) es->stmt,
  								 fcache->src,
! 								 snapshot,
  								 dest,
  								 fcache->paramLI, 0);
  	else
diff --git a/src/backend/executor/nodeLockRows.c b/src/backend/executor/nodeLockRows.c
index 562ee6e..d809a62 100644
*** a/src/backend/executor/nodeLockRows.c
--- b/src/backend/executor/nodeLockRows.c
*************** lnext:
*** 71,76 ****
--- 71,77 ----
  		ItemPointerData update_ctid;
  		TransactionId update_xmax;
  		LockTupleMode lockmode;
+ 		Snapshot lockcheck_snapshot = InvalidSnapshot;
  		HTSU_Result test;
  		HeapTuple	copyTuple;
  
*************** lnext:
*** 110,123 ****
  
  		/* okay, try to lock the tuple */
  		if (erm->markType == ROW_MARK_EXCLUSIVE)
  			lockmode = LockTupleExclusive;
  		else
  			lockmode = LockTupleShared;
  
  		test = heap_lock_tuple(erm->relation, &tuple, &buffer,
  							   &update_ctid, &update_xmax,
  							   estate->es_output_cid,
! 							   lockmode, erm->noWait);
  		ReleaseBuffer(buffer);
  		switch (test)
  		{
--- 111,137 ----
  
  		/* okay, try to lock the tuple */
  		if (erm->markType == ROW_MARK_EXCLUSIVE)
+ 		{
  			lockmode = LockTupleExclusive;
+ 
+ 			/*
+ 			 * Serializable transactions pass their snapshot as the logcheck_snapshot.
+ 			 * This lets heap_lock_tuple report concurrently FOR SHARE or FOR UPDATE
+ 			 * locked tuples as HeapTupleUpdated.
+ 			 */
+ 			if (IsXactIsoLevelSerializable)
+ 				lockcheck_snapshot = estate->es_snapshot;
+ 		}
  		else
+ 		{
  			lockmode = LockTupleShared;
+ 		}
  
  		test = heap_lock_tuple(erm->relation, &tuple, &buffer,
  							   &update_ctid, &update_xmax,
  							   estate->es_output_cid,
! 							   lockmode, erm->noWait,
! 							   lockcheck_snapshot);
  		ReleaseBuffer(buffer);
  		switch (test)
  		{
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 8619ce3..2dad85d 100644
*** a/src/backend/executor/nodeModifyTable.c
--- b/src/backend/executor/nodeModifyTable.c
*************** ExecDelete(ItemPointer tupleid,
*** 307,323 ****
  	/*
  	 * delete the tuple
  	 *
! 	 * Note: if es_crosscheck_snapshot isn't InvalidSnapshot, we check that
! 	 * the row to be deleted is visible to that snapshot, and throw a can't-
! 	 * serialize error if not.	This is a special-case behavior needed for
! 	 * referential integrity updates in serializable transactions.
  	 */
  ldelete:;
  	result = heap_delete(resultRelationDesc, tupleid,
  						 &update_ctid, &update_xmax,
  						 estate->es_output_cid,
! 						 estate->es_crosscheck_snapshot,
! 						 true /* wait for commit */ );
  	switch (result)
  	{
  		case HeapTupleSelfUpdated:
--- 307,323 ----
  	/*
  	 * delete the tuple
  	 *
! 	 * Serializable transactions pass their snapshot as the logcheck_snapshot.
! 	 * This lets heap_lock_tuple report concurrently FOR SHARE or FOR UPDATE
! 	 * locked tuples as HeapTupleUpdated.
  	 */
  ldelete:;
  	result = heap_delete(resultRelationDesc, tupleid,
  						 &update_ctid, &update_xmax,
  						 estate->es_output_cid,
! 						 true, /* wait for commit */
! 						 IsXactIsoLevelSerializable ? estate->es_snapshot :
! 													  InvalidSnapshot);
  	switch (result)
  	{
  		case HeapTupleSelfUpdated:
*************** lreplace:;
*** 496,511 ****
  	/*
  	 * replace the heap tuple
  	 *
! 	 * Note: if es_crosscheck_snapshot isn't InvalidSnapshot, we check that
! 	 * the row to be updated is visible to that snapshot, and throw a can't-
! 	 * serialize error if not.	This is a special-case behavior needed for
! 	 * referential integrity updates in serializable transactions.
  	 */
  	result = heap_update(resultRelationDesc, tupleid, tuple,
  						 &update_ctid, &update_xmax,
  						 estate->es_output_cid,
! 						 estate->es_crosscheck_snapshot,
! 						 true /* wait for commit */ );
  	switch (result)
  	{
  		case HeapTupleSelfUpdated:
--- 496,511 ----
  	/*
  	 * replace the heap tuple
  	 *
! 	 * Serializable transactions pass their snapshot as the logcheck_snapshot.
! 	 * This lets heap_lock_tuple report concurrently FOR SHARE or FOR UPDATE
! 	 * locked tuples as HeapTupleUpdated.
  	 */
  	result = heap_update(resultRelationDesc, tupleid, tuple,
  						 &update_ctid, &update_xmax,
  						 estate->es_output_cid,
! 						 true, /* wait for commit */ 
! 						 IsXactIsoLevelSerializable ? estate->es_snapshot :
! 													  InvalidSnapshot);
  	switch (result)
  	{
  		case HeapTupleSelfUpdated:
diff --git a/src/backend/executor/spi.c b/src/backend/executor/spi.c
index 7f0b5e4..c53a3e1 100644
*** a/src/backend/executor/spi.c
--- b/src/backend/executor/spi.c
*************** static void _SPI_prepare_plan(const char
*** 51,57 ****
  				  ParamListInfo boundParams);
  
  static int _SPI_execute_plan(SPIPlanPtr plan, ParamListInfo paramLI,
! 				  Snapshot snapshot, Snapshot crosscheck_snapshot,
  				  bool read_only, bool fire_triggers, long tcount);
  
  static ParamListInfo _SPI_convert_params(int nargs, Oid *argtypes,
--- 51,57 ----
  				  ParamListInfo boundParams);
  
  static int _SPI_execute_plan(SPIPlanPtr plan, ParamListInfo paramLI,
! 				  Snapshot snapshot,
  				  bool read_only, bool fire_triggers, long tcount);
  
  static ParamListInfo _SPI_convert_params(int nargs, Oid *argtypes,
*************** SPI_execute(const char *src, bool read_o
*** 357,363 ****
  	_SPI_prepare_plan(src, &plan, NULL);
  
  	res = _SPI_execute_plan(&plan, NULL,
! 							InvalidSnapshot, InvalidSnapshot,
  							read_only, true, tcount);
  
  	_SPI_end_call(true);
--- 357,363 ----
  	_SPI_prepare_plan(src, &plan, NULL);
  
  	res = _SPI_execute_plan(&plan, NULL,
! 							InvalidSnapshot,
  							read_only, true, tcount);
  
  	_SPI_end_call(true);
*************** SPI_execute_plan(SPIPlanPtr plan, Datum 
*** 392,398 ****
  							_SPI_convert_params(plan->nargs, plan->argtypes,
  												Values, Nulls,
  												0),
! 							InvalidSnapshot, InvalidSnapshot,
  							read_only, true, tcount);
  
  	_SPI_end_call(true);
--- 392,398 ----
  							_SPI_convert_params(plan->nargs, plan->argtypes,
  												Values, Nulls,
  												0),
! 							InvalidSnapshot,
  							read_only, true, tcount);
  
  	_SPI_end_call(true);
*************** SPI_execute_plan_with_paramlist(SPIPlanP
*** 421,427 ****
  		return res;
  
  	res = _SPI_execute_plan(plan, params,
! 							InvalidSnapshot, InvalidSnapshot,
  							read_only, true, tcount);
  
  	_SPI_end_call(true);
--- 421,427 ----
  		return res;
  
  	res = _SPI_execute_plan(plan, params,
! 							InvalidSnapshot,
  							read_only, true, tcount);
  
  	_SPI_end_call(true);
*************** SPI_execute_plan_with_paramlist(SPIPlanP
*** 444,450 ****
  int
  SPI_execute_snapshot(SPIPlanPtr plan,
  					 Datum *Values, const char *Nulls,
! 					 Snapshot snapshot, Snapshot crosscheck_snapshot,
  					 bool read_only, bool fire_triggers, long tcount)
  {
  	int			res;
--- 444,450 ----
  int
  SPI_execute_snapshot(SPIPlanPtr plan,
  					 Datum *Values, const char *Nulls,
! 					 Snapshot snapshot,
  					 bool read_only, bool fire_triggers, long tcount)
  {
  	int			res;
*************** SPI_execute_snapshot(SPIPlanPtr plan,
*** 463,469 ****
  							_SPI_convert_params(plan->nargs, plan->argtypes,
  												Values, Nulls,
  												0),
! 							snapshot, crosscheck_snapshot,
  							read_only, fire_triggers, tcount);
  
  	_SPI_end_call(true);
--- 463,469 ----
  							_SPI_convert_params(plan->nargs, plan->argtypes,
  												Values, Nulls,
  												0),
! 							snapshot,
  							read_only, fire_triggers, tcount);
  
  	_SPI_end_call(true);
*************** SPI_execute_with_args(const char *src,
*** 516,522 ****
  	/* We don't need to copy the plan since it will be thrown away anyway */
  
  	res = _SPI_execute_plan(&plan, paramLI,
! 							InvalidSnapshot, InvalidSnapshot,
  							read_only, true, tcount);
  
  	_SPI_end_call(true);
--- 516,522 ----
  	/* We don't need to copy the plan since it will be thrown away anyway */
  
  	res = _SPI_execute_plan(&plan, paramLI,
! 							InvalidSnapshot,
  							read_only, true, tcount);
  
  	_SPI_end_call(true);
*************** _SPI_prepare_plan(const char *src, SPIPl
*** 1752,1758 ****
   *
   * snapshot: query snapshot to use, or InvalidSnapshot for the normal
   *		behavior of taking a new snapshot for each query.
-  * crosscheck_snapshot: for RI use, all others pass InvalidSnapshot
   * read_only: TRUE for read-only execution (no CommandCounterIncrement)
   * fire_triggers: TRUE to fire AFTER triggers at end of query (normal case);
   *		FALSE means any AFTER triggers are postponed to end of outer query
--- 1752,1757 ----
*************** _SPI_prepare_plan(const char *src, SPIPl
*** 1760,1766 ****
   */
  static int
  _SPI_execute_plan(SPIPlanPtr plan, ParamListInfo paramLI,
! 				  Snapshot snapshot, Snapshot crosscheck_snapshot,
  				  bool read_only, bool fire_triggers, long tcount)
  {
  	int			my_res = 0;
--- 1759,1765 ----
   */
  static int
  _SPI_execute_plan(SPIPlanPtr plan, ParamListInfo paramLI,
! 				  Snapshot snapshot,
  				  bool read_only, bool fire_triggers, long tcount)
  {
  	int			my_res = 0;
*************** _SPI_execute_plan(SPIPlanPtr plan, Param
*** 1903,1909 ****
  
  				qdesc = CreateQueryDesc((PlannedStmt *) stmt,
  										plansource->query_string,
! 										snap, crosscheck_snapshot,
  										dest,
  										paramLI, 0);
  				res = _SPI_pquery(qdesc, fire_triggers,
--- 1902,1908 ----
  
  				qdesc = CreateQueryDesc((PlannedStmt *) stmt,
  										plansource->query_string,
! 										snap,
  										dest,
  										paramLI, 0);
  				res = _SPI_pquery(qdesc, fire_triggers,
diff --git a/src/backend/tcop/pquery.c b/src/backend/tcop/pquery.c
index 8ad4915..6c05510 100644
*** a/src/backend/tcop/pquery.c
--- b/src/backend/tcop/pquery.c
*************** QueryDesc *
*** 64,70 ****
  CreateQueryDesc(PlannedStmt *plannedstmt,
  				const char *sourceText,
  				Snapshot snapshot,
- 				Snapshot crosscheck_snapshot,
  				DestReceiver *dest,
  				ParamListInfo params,
  				int instrument_options)
--- 64,69 ----
*************** CreateQueryDesc(PlannedStmt *plannedstmt
*** 76,83 ****
  	qd->utilitystmt = plannedstmt->utilityStmt; /* in case DECLARE CURSOR */
  	qd->sourceText = sourceText;	/* query text */
  	qd->snapshot = RegisterSnapshot(snapshot);	/* snapshot */
- 	/* RI check snapshot */
- 	qd->crosscheck_snapshot = RegisterSnapshot(crosscheck_snapshot);
  	qd->dest = dest;			/* output dest */
  	qd->params = params;		/* parameter values passed into query */
  	qd->instrument_options = instrument_options;		/* instrumentation
--- 75,80 ----
*************** CreateUtilityQueryDesc(Node *utilitystmt
*** 109,115 ****
  	qd->utilitystmt = utilitystmt;		/* utility command */
  	qd->sourceText = sourceText;	/* query text */
  	qd->snapshot = RegisterSnapshot(snapshot);	/* snapshot */
- 	qd->crosscheck_snapshot = InvalidSnapshot;	/* RI check snapshot */
  	qd->dest = dest;			/* output dest */
  	qd->params = params;		/* parameter values passed into query */
  	qd->instrument_options = false;		/* uninteresting for utilities */
--- 106,111 ----
*************** FreeQueryDesc(QueryDesc *qdesc)
*** 134,140 ****
  
  	/* forget our snapshots */
  	UnregisterSnapshot(qdesc->snapshot);
- 	UnregisterSnapshot(qdesc->crosscheck_snapshot);
  
  	/* Only the QueryDesc itself need be freed */
  	pfree(qdesc);
--- 130,135 ----
*************** ProcessQuery(PlannedStmt *plan,
*** 178,184 ****
  	 * Create the QueryDesc object
  	 */
  	queryDesc = CreateQueryDesc(plan, sourceText,
! 								GetActiveSnapshot(), InvalidSnapshot,
  								dest, params, 0);
  
  	/*
--- 173,179 ----
  	 * Create the QueryDesc object
  	 */
  	queryDesc = CreateQueryDesc(plan, sourceText,
! 								GetActiveSnapshot(),
  								dest, params, 0);
  
  	/*
*************** PortalStart(Portal portal, ParamListInfo
*** 514,520 ****
  				queryDesc = CreateQueryDesc((PlannedStmt *) linitial(portal->stmts),
  											portal->sourceText,
  											GetActiveSnapshot(),
- 											InvalidSnapshot,
  											None_Receiver,
  											params,
  											0);
--- 509,514 ----
diff --git a/src/backend/utils/adt/ri_triggers.c b/src/backend/utils/adt/ri_triggers.c
index aa06302..02946fa 100644
*** a/src/backend/utils/adt/ri_triggers.c
--- b/src/backend/utils/adt/ri_triggers.c
*************** static SPIPlanPtr ri_PlanCheck(const cha
*** 230,236 ****
  static bool ri_PerformCheck(RI_QueryKey *qkey, SPIPlanPtr qplan,
  				Relation fk_rel, Relation pk_rel,
  				HeapTuple old_tuple, HeapTuple new_tuple,
- 				bool detectNewRows,
  				int expect_OK, const char *constrname);
  static void ri_ExtractValues(RI_QueryKey *qkey, int key_idx,
  				 Relation rel, HeapTuple tuple,
--- 230,235 ----
*************** RI_FKey_check(PG_FUNCTION_ARGS)
*** 357,363 ****
  		ri_PerformCheck(&qkey, qplan,
  						fk_rel, pk_rel,
  						NULL, NULL,
- 						false,
  						SPI_OK_SELECT,
  						NameStr(riinfo.conname));
  
--- 356,361 ----
*************** RI_FKey_check(PG_FUNCTION_ARGS)
*** 500,506 ****
  	ri_PerformCheck(&qkey, qplan,
  					fk_rel, pk_rel,
  					NULL, new_row,
- 					false,
  					SPI_OK_SELECT,
  					NameStr(riinfo.conname));
  
--- 498,503 ----
*************** ri_Check_Pk_Match(Relation pk_rel, Relat
*** 661,667 ****
  	result = ri_PerformCheck(&qkey, qplan,
  							 fk_rel, pk_rel,
  							 old_row, NULL,
- 							 true,		/* treat like update */
  							 SPI_OK_SELECT, NULL);
  
  	if (SPI_finish() != SPI_OK_FINISH)
--- 658,663 ----
*************** RI_FKey_noaction_del(PG_FUNCTION_ARGS)
*** 818,824 ****
  			ri_PerformCheck(&qkey, qplan,
  							fk_rel, pk_rel,
  							old_row, NULL,
- 							true,		/* must detect new rows */
  							SPI_OK_SELECT,
  							NameStr(riinfo.conname));
  
--- 814,819 ----
*************** RI_FKey_noaction_upd(PG_FUNCTION_ARGS)
*** 1006,1012 ****
  			ri_PerformCheck(&qkey, qplan,
  							fk_rel, pk_rel,
  							old_row, NULL,
- 							true,		/* must detect new rows */
  							SPI_OK_SELECT,
  							NameStr(riinfo.conname));
  
--- 1001,1006 ----
*************** RI_FKey_cascade_del(PG_FUNCTION_ARGS)
*** 1168,1174 ****
  			ri_PerformCheck(&qkey, qplan,
  							fk_rel, pk_rel,
  							old_row, NULL,
- 							true,		/* must detect new rows */
  							SPI_OK_DELETE,
  							NameStr(riinfo.conname));
  
--- 1162,1167 ----
*************** RI_FKey_cascade_upd(PG_FUNCTION_ARGS)
*** 1356,1362 ****
  			ri_PerformCheck(&qkey, qplan,
  							fk_rel, pk_rel,
  							old_row, new_row,
- 							true,		/* must detect new rows */
  							SPI_OK_UPDATE,
  							NameStr(riinfo.conname));
  
--- 1349,1354 ----
*************** RI_FKey_restrict_del(PG_FUNCTION_ARGS)
*** 1527,1533 ****
  			ri_PerformCheck(&qkey, qplan,
  							fk_rel, pk_rel,
  							old_row, NULL,
- 							true,		/* must detect new rows */
  							SPI_OK_SELECT,
  							NameStr(riinfo.conname));
  
--- 1519,1524 ----
*************** RI_FKey_restrict_upd(PG_FUNCTION_ARGS)
*** 1710,1716 ****
  			ri_PerformCheck(&qkey, qplan,
  							fk_rel, pk_rel,
  							old_row, NULL,
- 							true,		/* must detect new rows */
  							SPI_OK_SELECT,
  							NameStr(riinfo.conname));
  
--- 1701,1706 ----
*************** RI_FKey_setnull_del(PG_FUNCTION_ARGS)
*** 1881,1887 ****
  			ri_PerformCheck(&qkey, qplan,
  							fk_rel, pk_rel,
  							old_row, NULL,
- 							true,		/* must detect new rows */
  							SPI_OK_UPDATE,
  							NameStr(riinfo.conname));
  
--- 1871,1876 ----
*************** RI_FKey_setnull_upd(PG_FUNCTION_ARGS)
*** 2097,2103 ****
  			ri_PerformCheck(&qkey, qplan,
  							fk_rel, pk_rel,
  							old_row, NULL,
- 							true,		/* must detect new rows */
  							SPI_OK_UPDATE,
  							NameStr(riinfo.conname));
  
--- 2086,2091 ----
*************** RI_FKey_setdefault_del(PG_FUNCTION_ARGS)
*** 2269,2275 ****
  			ri_PerformCheck(&qkey, qplan,
  							fk_rel, pk_rel,
  							old_row, NULL,
- 							true,		/* must detect new rows */
  							SPI_OK_UPDATE,
  							NameStr(riinfo.conname));
  
--- 2257,2262 ----
*************** RI_FKey_setdefault_upd(PG_FUNCTION_ARGS)
*** 2472,2478 ****
  			ri_PerformCheck(&qkey, qplan,
  							fk_rel, pk_rel,
  							old_row, NULL,
- 							true,		/* must detect new rows */
  							SPI_OK_UPDATE,
  							NameStr(riinfo.conname));
  
--- 2459,2464 ----
*************** RI_Initial_Check(Trigger *trigger, Relat
*** 2792,2798 ****
  	spi_result = SPI_execute_snapshot(qplan,
  									  NULL, NULL,
  									  GetLatestSnapshot(),
- 									  InvalidSnapshot,
  									  true, false, 1);
  
  	/* Check result */
--- 2778,2783 ----
*************** static bool
*** 3271,3284 ****
  ri_PerformCheck(RI_QueryKey *qkey, SPIPlanPtr qplan,
  				Relation fk_rel, Relation pk_rel,
  				HeapTuple old_tuple, HeapTuple new_tuple,
- 				bool detectNewRows,
  				int expect_OK, const char *constrname)
  {
  	Relation	query_rel,
  				source_rel;
  	int			key_idx;
  	Snapshot	test_snapshot;
- 	Snapshot	crosscheck_snapshot;
  	int			limit;
  	int			spi_result;
  	Oid			save_userid;
--- 3256,3267 ----
*************** ri_PerformCheck(RI_QueryKey *qkey, SPIPl
*** 3330,3359 ****
  	}
  
  	/*
- 	 * In READ COMMITTED mode, we just need to use an up-to-date regular
- 	 * snapshot, and we will see all rows that could be interesting. But in
- 	 * SERIALIZABLE mode, we can't change the transaction snapshot. If the
- 	 * caller passes detectNewRows == false then it's okay to do the query
- 	 * with the transaction snapshot; otherwise we use a current snapshot, and
- 	 * tell the executor to error out if it finds any rows under the current
- 	 * snapshot that wouldn't be visible per the transaction snapshot.  Note
- 	 * that SPI_execute_snapshot will register the snapshots, so we don't need
- 	 * to bother here.
- 	 */
- 	if (IsXactIsoLevelSerializable && detectNewRows)
- 	{
- 		CommandCounterIncrement();		/* be sure all my own work is visible */
- 		test_snapshot = GetLatestSnapshot();
- 		crosscheck_snapshot = GetTransactionSnapshot();
- 	}
- 	else
- 	{
- 		/* the default SPI behavior is okay */
- 		test_snapshot = InvalidSnapshot;
- 		crosscheck_snapshot = InvalidSnapshot;
- 	}
- 
- 	/*
  	 * If this is a select query (e.g., for a 'no action' or 'restrict'
  	 * trigger), we only need to see if there is a single row in the table,
  	 * matching the key.  Otherwise, limit = 0 - because we want the query to
--- 3313,3318 ----
*************** ri_PerformCheck(RI_QueryKey *qkey, SPIPl
*** 3369,3375 ****
  	/* Finally we can run the query. */
  	spi_result = SPI_execute_snapshot(qplan,
  									  vals, nulls,
! 									  test_snapshot, crosscheck_snapshot,
  									  false, false, limit);
  
  	/* Restore UID and security context */
--- 3328,3334 ----
  	/* Finally we can run the query. */
  	spi_result = SPI_execute_snapshot(qplan,
  									  vals, nulls,
! 									  InvalidSnapshot,
  									  false, false, limit);
  
  	/* Restore UID and security context */
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 59a27de..099e67b 100644
*** a/src/backend/utils/time/snapmgr.c
--- b/src/backend/utils/time/snapmgr.c
***************
*** 27,32 ****
--- 27,33 ----
  
  #include "access/transam.h"
  #include "access/xact.h"
+ #include "access/multixact.h"
  #include "storage/proc.h"
  #include "storage/procarray.h"
  #include "utils/memutils.h"
*************** GetTransactionSnapshot(void)
*** 125,130 ****
--- 126,141 ----
  	if (!FirstSnapshotSet)
  	{
  		Assert(RegisteredSnapshots == 0);
+ 		
+ 		/*
+ 		 * We must store the oldest visible multi xact *before* taking the
+ 		 * serializable snapshot. Otherwise HeapSatisfiesLockersVisible in
+ 		 * heapam.c will be in trouble, since it depends on being able to
+ 		 * inspect all multi xact ids which might contain xids invisible to
+ 		 * the serializable snapshot.
+ 		 */
+ 		if (IsXactIsoLevelSerializable)
+ 			MultiXactIdSetOldestVisible();
  
  		CurrentSnapshot = GetSnapshotData(&CurrentSnapshotData);
  		FirstSnapshotSet = true;
diff --git a/src/backend/utils/time/tqual.c b/src/backend/utils/time/tqual.c
index 4f3630c..302a571 100644
*** a/src/backend/utils/time/tqual.c
--- b/src/backend/utils/time/tqual.c
*************** SnapshotData SnapshotSelfData = {HeapTup
*** 72,80 ****
  SnapshotData SnapshotAnyData = {HeapTupleSatisfiesAny};
  SnapshotData SnapshotToastData = {HeapTupleSatisfiesToast};
  
- /* local functions */
- static bool XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot);
- 
  
  /*
   * SetHintBits()
--- 72,77 ----
*************** HeapTupleSatisfiesVacuum(HeapTupleHeader
*** 1253,1259 ****
   * by this function.  This is OK for current uses, because we actually only
   * apply this for known-committed XIDs.
   */
! static bool
  XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot)
  {
  	uint32		i;
--- 1250,1256 ----
   * by this function.  This is OK for current uses, because we actually only
   * apply this for known-committed XIDs.
   */
! bool
  XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot)
  {
  	uint32		i;
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index f963146..1d51dfb 100644
*** a/src/include/access/heapam.h
--- b/src/include/access/heapam.h
*************** extern Oid heap_insert(Relation relation
*** 98,112 ****
  			int options, BulkInsertState bistate);
  extern HTSU_Result heap_delete(Relation relation, ItemPointer tid,
  			ItemPointer ctid, TransactionId *update_xmax,
! 			CommandId cid, Snapshot crosscheck, bool wait);
  extern HTSU_Result heap_update(Relation relation, ItemPointer otid,
  			HeapTuple newtup,
  			ItemPointer ctid, TransactionId *update_xmax,
! 			CommandId cid, Snapshot crosscheck, bool wait);
  extern HTSU_Result heap_lock_tuple(Relation relation, HeapTuple tuple,
  				Buffer *buffer, ItemPointer ctid,
  				TransactionId *update_xmax, CommandId cid,
! 				LockTupleMode mode, bool nowait);
  extern void heap_inplace_update(Relation relation, HeapTuple tuple);
  extern bool heap_freeze_tuple(HeapTupleHeader tuple, TransactionId cutoff_xid,
  				  Buffer buf);
--- 98,113 ----
  			int options, BulkInsertState bistate);
  extern HTSU_Result heap_delete(Relation relation, ItemPointer tid,
  			ItemPointer ctid, TransactionId *update_xmax,
! 			CommandId cid, bool wait, Snapshot lockcheck_snapshot);
  extern HTSU_Result heap_update(Relation relation, ItemPointer otid,
  			HeapTuple newtup,
  			ItemPointer ctid, TransactionId *update_xmax,
! 			CommandId cid, bool wait, Snapshot lockcheck_snapshot);
  extern HTSU_Result heap_lock_tuple(Relation relation, HeapTuple tuple,
  				Buffer *buffer, ItemPointer ctid,
  				TransactionId *update_xmax, CommandId cid,
! 				LockTupleMode mode, bool nowait,
! 				Snapshot lockcheck_snapshot);
  extern void heap_inplace_update(Relation relation, HeapTuple tuple);
  extern bool heap_freeze_tuple(HeapTupleHeader tuple, TransactionId cutoff_xid,
  				  Buffer buf);
diff --git a/src/include/access/multixact.h b/src/include/access/multixact.h
index 5910465..4769bca 100644
*** a/src/include/access/multixact.h
--- b/src/include/access/multixact.h
*************** extern bool MultiXactIdIsCurrent(MultiXa
*** 49,54 ****
--- 49,55 ----
  extern void MultiXactIdWait(MultiXactId multi);
  extern bool ConditionalMultiXactIdWait(MultiXactId multi);
  extern void MultiXactIdSetOldestMember(void);
+ extern void MultiXactIdSetOldestVisible(void);
  extern int	GetMultiXactIdMembers(MultiXactId multi, TransactionId **xids);
  
  extern void AtEOXact_MultiXact(void);
diff --git a/src/include/executor/execdesc.h b/src/include/executor/execdesc.h
index 749b83b..2cccf38 100644
*** a/src/include/executor/execdesc.h
--- b/src/include/executor/execdesc.h
*************** typedef struct QueryDesc
*** 39,45 ****
  	Node	   *utilitystmt;	/* utility statement, or null */
  	const char *sourceText;		/* source text of the query */
  	Snapshot	snapshot;		/* snapshot to use for query */
- 	Snapshot	crosscheck_snapshot;	/* crosscheck for RI update/delete */
  	DestReceiver *dest;			/* the destination for tuple output */
  	ParamListInfo params;		/* param values being passed in */
  	int			instrument_options;		/* OR of InstrumentOption flags */
--- 39,44 ----
*************** typedef struct QueryDesc
*** 57,63 ****
  extern QueryDesc *CreateQueryDesc(PlannedStmt *plannedstmt,
  				const char *sourceText,
  				Snapshot snapshot,
- 				Snapshot crosscheck_snapshot,
  				DestReceiver *dest,
  				ParamListInfo params,
  				int instrument_options);
--- 56,61 ----
diff --git a/src/include/executor/spi.h b/src/include/executor/spi.h
index 2b05564..541de92 100644
*** a/src/include/executor/spi.h
--- b/src/include/executor/spi.h
*************** extern int SPI_execp(SPIPlanPtr plan, Da
*** 82,88 ****
  extern int SPI_execute_snapshot(SPIPlanPtr plan,
  					 Datum *Values, const char *Nulls,
  					 Snapshot snapshot,
- 					 Snapshot crosscheck_snapshot,
  					 bool read_only, bool fire_triggers, long tcount);
  extern int SPI_execute_with_args(const char *src,
  					  int nargs, Oid *argtypes,
--- 82,87 ----
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 7442d2d..86721c8 100644
*** a/src/include/nodes/execnodes.h
--- b/src/include/nodes/execnodes.h
*************** typedef struct EState
*** 337,343 ****
  	/* Basic state for all query types: */
  	ScanDirection es_direction; /* current scan direction */
  	Snapshot	es_snapshot;	/* time qual to use */
- 	Snapshot	es_crosscheck_snapshot; /* crosscheck time qual for RI */
  	List	   *es_range_table; /* List of RangeTblEntry */
  	PlannedStmt *es_plannedstmt;	/* link to top of plan tree */
  
--- 337,342 ----
diff --git a/src/include/utils/tqual.h b/src/include/utils/tqual.h
index e85e820..fa386e7 100644
*** a/src/include/utils/tqual.h
--- b/src/include/utils/tqual.h
*************** extern PGDLLIMPORT SnapshotData Snapshot
*** 41,46 ****
--- 41,48 ----
  #define IsMVCCSnapshot(snapshot)  \
  	((snapshot)->satisfies == HeapTupleSatisfiesMVCC)
  
+ bool XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot);
+ 
  /*
   * HeapTupleSatisfiesVisibility
   *		True iff heap tuple satisfies a time qual.
#32Florian Pflug
fgp@phlo.org
In reply to: Florian Pflug (#1)
Re: Review: Row-level Locks & SERIALIZABLE transactions, postgres vs. Oracle

On Aug3, 2010, at 00:43 , Florian Pflug wrote:

Sounds good. That'll also give me some time to test the RI trigger
infrastructure now that I've removed the crosscheck code.

Ok, I've found some time do run some additional tests.

I've created a small test suite to compare the behaviour of native (cascading) FK constraints to an PLPGSQL implementation. There is a dedicated child table (native_child respectively plpgsql_child) for each of the two FK implementation. Into both child rows for random parents are inserted, creating the parent if it does not already exists. Concurrently, random parent rows are deleted.

The number of successful inserts into native_child respectively plpgsql_child and the number of successfull deletes are counted, as well as the number of transactions failed due to either a serialization failure or a race condition during the insert (unique_violation or foreign_key_violation).

To verify that things behave as expected, the test suite reports a histogram of the number of parents over the child count after the test completes. The expected distribution is output alongside that histogram, assuming that the number of parents with N children follows an exponential distribution.

I've pushed the test suite to
git@github.com:fgp/fk_concurrency.git
and put a summary of the results of my test runs on
http://wiki.github.com/fgp/fk_concurrency/results.

The results look good. There is no significant change in the behaviour of FKs between HEAD and HEAD+serializable_row_locks, and no significant difference between the native and plpgsql implementation. As expected, the plpgsql implementation fails on an unpatched HEAD, aborting with the error "database inconsistent" pretty quickly.

I'd be interesting to run the plpgsql implementation with the SHARE locking removed against HEAD+true_serializability to see if that changes the number of serialization_failures that occur. I'll do that, but I'll have to wait a bit for another spare time window to open.

best regards,
Florian Pflug