Serializable Snapshot Isolation
Attached is the latest Serializable Snapshot Isolation (SSI) patch.
With Joe's testing and review, and with stress tests adapted from
those used by Florian for his patch, we were able to identify and
fix several bugs. Stability seems good now. We have many tests for
correct behavior which are all looking good. The only solid
benchmarks we have so far show no impact on isolation levels other
than SERIALIZABLE, and a 1.8% increase in run time for a saturation
run of small, read only SERIALIZABLE transactions against a fully
cached database. Dan has been working on setting up some benchmarks
using DBT-2, but doesn't yet have results to publish. If we can get
more eyes on the code during this CF, I'm hoping we can get this
patch committed this round.
This patch is basically an implementation of the techniques
described in the 2008 paper by Cahill et al, and which was further
developed in Cahill's 2009 PhD thesis. Techniques needed to be
adapted somewhat because of differences between PostgreSQL and the
two databases used for prototype implementations for those papers
(Oracle Berkeley DB and InnoDB), and there are a few original ideas
from Dan and myself used to optimize the implementation. One reason
for hoping that this patch gets committed in this CF is that it will
leave time to try out some other, more speculative optimizations
before release.
Documentation is not included in this patch; I plan on submitting
that to a later CF as a separate patch. Changes should be almost
entirely within the Concurrency Control chapter. The current patch
has one new GUC which (if kept) will need to be documented, and one
of the potential optimizations could involve adding a new
transaction property which would then need documentation.
The premise of the patch is simple: that snapshot isolation comes so
close to supporting fully serializable transactions that S2PL is not
necessary -- the database engine can watch for rw-dependencies among
transactions, without introducing any blocking, and roll back
transactions as required to prevent serialization anomalies. This
eliminates the need for using the SELECT FOR SHARE or SELECT FOR
UPDATE clauses, the need for explicit locking, and the need for
additional updates to introduce conflict points.
While block-level locking is included in this patch for btree and
GiST indexes, an index relation lock is still used for predicate
locks when a search is made through a GIN or hash index. These
additional index types can be implemented separately. Dan is
looking at bringing btree indexes to finer granularity, but wants to
have good benchmarks first, to confirm that the net impact is a gain
in performance.
Most of the work is in the new predicate.h and predicate.c files,
which total 2,599 lines, over 39% of which are comment lines. There
are 1626 lines in the new pg_dtester.py.in files, which uses Markus
Wanner's dtester software to implement a large number of correctness
tests. We added 79 lines to lockfuncs.c to include the new
SIReadLock entries in the pg_locks view. The rest of the patch
affects 286 lines (counting an updated line twice) across 25
existing PostgreSQL source files to implement the actual feature.
The code organization and naming issues mentioned here remain:
http://archives.postgresql.org/pgsql-hackers/2010-07/msg00383.php
-Kevin
Attachments:
serializable-5.patchtext/plain; name=serializable-5.patchDownload+4584-105
On 14/09/10 19:34, Kevin Grittner wrote:
Attached is the latest Serializable Snapshot Isolation (SSI) patch.
Great work! A year ago I thought it would be impossible to have a true
serializable mode in PostgreSQL because of the way we do MVCC, and now
we have a patch.
At a quick read-through, the code looks very tidy and clear now. Some
comments:
Should add a citation to Cahill's work this is based on. Preferably with
a hyperlink. A short description of how the predicate locks help to
implement serializable mode would be nice too. I haven't read Cahill's
papers, and I'm left wondering what the RW conflicts and dependencies
are, when you're supposed to grab predicate locks etc.
If a page- or relation level SILOCK is taken, is it possible to get
false conflicts? Ie. a transaction is rolled back because it modified a
tuple on a page where some other transaction modified another tuple,
even though there's no dependency between the two.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote:
Great work! A year ago I thought it would be impossible to have a
true serializable mode in PostgreSQL because of the way we do
MVCC, and now we have a patch.At a quick read-through, the code looks very tidy and clear now.
Some comments:Should add a citation to Cahill's work this is based on.
Preferably with a hyperlink.
I'm planning on drawing from the current Wiki page:
http://wiki.postgresql.org/wiki/Serializable
to put together a README file; do you think the references should go
in the README file, the source code, or both?
A short description of how the predicate locks help to implement
serializable mode would be nice too. I haven't read Cahill's
papers, and I'm left wondering what the RW conflicts and
dependencies are, when you're supposed to grab predicate locks
etc.
Again, I summarize that in the Wiki page, and was planning on
putting it into the README. If you've read the Wiki page and it's
not clear, then I definitely have some work to do there.
If a page- or relation level SILOCK is taken, is it possible to
get false conflicts?
Yes. This technique will generate some false positive rollbacks.
Software will need to be prepared to retry any database transaction
which fails with a serialization failure SQLSTATE. I expect that
proper connection pooling will be particularly important when using
SSI, and flagging transactions which don't write to permanent tables
as READ ONLY transactions will help reduce the rollback rate, too.
Some of the optimizations we have sketched out will definitely
reduce the rate of false positives; however, we don't want to
implement them without a better performance baseline because the
cost of tracking the required information and the contention for LW
locks to maintain the information may hurt performance more than the
restart of transactions which experience serialization failure.
I don't want to steal Dan's thunder after all the hard work he's
done to get good numbers from the DBT-2 benchmark, but suffice it to
say that I've been quite pleased with the performance on that
benchmark. He's pulling together the data for a post on the topic.
Ie. a transaction is rolled back because it modified a tuple on a
page where some other transaction modified another tuple, even
though there's no dependency between the two.
Well, no, because this patch doesn't do anything new with write
conflicts. It's all about the apparent order of execution, based on
one transaction not being able to read what was written by a
concurrent transaction. The reading transaction must be considered
to have run first in that case (Hey, now you know what a rw-conflict
is!) -- but such references can create a cycle -- which is the
source of all serialization anomalies in snapshot isolation.
-Kevin
I've been thinking about these points, and reconsidered somewhat.
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote:
Should add a citation to Cahill's work this is based on.
Preferably with a hyperlink.
I've been thinking that this should be mentioned in both the README
and the source code.
A short description of how the predicate locks help to implement
serializable mode would be nice too. I haven't read Cahill's
papers, and I'm left wondering what the RW conflicts and
dependencies are, when you're supposed to grab predicate locks
etc.
Again -- why be stingy? Given a more complete README file, how
about something like?:
/*
* A rw-conflict occurs when a read by one serializable transaction
* does not see the write of a concurrent serializable transaction
* when that write would have been visible had the writing
* transaction committed before the start of the reading
* transaction. When the write occurs first, the read can detect
* this conflict by examining the MVCC information. When the read
* occurs first, it must record this somewhere so that writes can
* check for a conflict. Predicate locks are used for this.
* Detection of such a conflict does not cause blocking, and does
* not, in itself, cause a transaction rollback.
*
* Transaction rollback is required when one transaction (called a
* "pivot") has a rw-conflict *in* (a concurrent transaction
* couldn't see its write) as well as *out* (it couldn't see the
* write of another transaction). In addition, the transaction on
* the "out" side of the pivot must commit first, and if the
* transaction on the "in" side of the pivot is read-only, it must
* acquire its snapshot after the successful commit of the
* transaction on the "out" side of the pivot.
*/
Would something like that have helped?
-Kevin
On 15/09/10 00:49, Kevin Grittner wrote:
Heikki Linnakangas<heikki.linnakangas@enterprisedb.com> wrote:
A short description of how the predicate locks help to implement
serializable mode would be nice too. I haven't read Cahill's
papers, and I'm left wondering what the RW conflicts and
dependencies are, when you're supposed to grab predicate locks
etc.Again -- why be stingy? Given a more complete README file, how
about something like?:
Well, if it's explained in the readme, that's probably enough.
/*
* A rw-conflict occurs when a read by one serializable transaction
* does not see the write of a concurrent serializable transaction
* when that write would have been visible had the writing
* transaction committed before the start of the reading
* transaction. When the write occurs first, the read can detect
* this conflict by examining the MVCC information. When the read
* occurs first, it must record this somewhere so that writes can
* check for a conflict. Predicate locks are used for this.
* Detection of such a conflict does not cause blocking, and does
* not, in itself, cause a transaction rollback.
*
* Transaction rollback is required when one transaction (called a
* "pivot") has a rw-conflict *in* (a concurrent transaction
* couldn't see its write) as well as *out* (it couldn't see the
* write of another transaction). In addition, the transaction on
* the "out" side of the pivot must commit first, and if the
* transaction on the "in" side of the pivot is read-only, it must
* acquire its snapshot after the successful commit of the
* transaction on the "out" side of the pivot.
*/Would something like that have helped?
Yes. An examples would be very nice too, that description alone is
pretty hard to grasp. Having read the Wiki page, and the slides from
your presentation at pg east 2010, I think understand it now.
Now that I understand what the predicate locks are for, I'm now trying
to get my head around all the data structures in predicate.c. The
functions are well commented, but an overview at the top of the file of
all the hash tables and other data structures would be nice. What is
stored in each, when are they updated, etc.
I've been meaning to look at this patch for some time, but now I'm
actually glad I haven't because I'm now getting a virgin point of view
on the code, seeing the problems that anyone who's not familiar with the
approach will run into. :-)
BTW, does the patch handle prepared transactions yet? It introduces a
call to PreCommit_CheckForSerializationFailure() in CommitTransaction, I
think you'll need that in PrepareTransaction as well.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote:
Now that I understand what the predicate locks are for, I'm now
trying to get my head around all the data structures in
predicate.c. The functions are well commented, but an overview at
the top of the file of all the hash tables and other data
structures would be nice. What is stored in each, when are they
updated, etc.
It probably doesn't help that they're split between predicate.c and
predicate.h. (They were originally all in predicate.c because
nobody else needed to see them, but we moved some to the .h file to
expose them to lockfuncs.c to support listing the locks.)
I'm inclined to move everything except the function prototypes out
of predicate.h to a new predicate_interal.h, and move the structures
defined in predicate.c there, too. And, of course, add the overview
comments in the new file. If that sounds good, I can probably
post a new patch with those changes today -- would that be a good
idea, or should I wait for more feedback before doing that? (It
will be in the git repo either way.)
BTW, does the patch handle prepared transactions yet? It
introduces a call to PreCommit_CheckForSerializationFailure() in
CommitTransaction, I think you'll need that in PrepareTransaction
as well.
Good point. In spite of the NB comment, I did not notice that.
Will fix.
Thanks for the feedback!
-Kevin
Excerpts from Kevin Grittner's message of mié sep 15 09:15:53 -0400 2010:
I'm inclined to move everything except the function prototypes out
of predicate.h to a new predicate_interal.h, and move the structures
defined in predicate.c there, too.
I think that would also solve a concern that I had, which is that we
were starting to include relcache.h (and perhaps other headers as well,
but that's the one that triggered it for me) a bit too liberally, so +1
from me.
--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Alvaro Herrera <alvherre@commandprompt.com> wrote:
I think that would also solve a concern that I had, which is that
we were starting to include relcache.h (and perhaps other headers
as well, but that's the one that triggered it for me) a bit too
liberally, so +1 from me.
Unfortunately, what I proposed doesn't solve that for relcache.h,
although it does eliminate lock.h from almost everywhere and htup.h
from everywhere. (The latter seemed to be left over from an
abandoned approach, and was no longer needed in predicate.h in any
event.)
Most of the functions in predicate.c take a Relation as a parameter.
I could split out the function prototypes for those which *don't*
use it to a separate .h file if you think it is worthwhile. The
functions would be:
void InitPredicateLocks(void);
Size PredicateLockShmemSize(void);
void RegisterSerializableTransaction(const Snapshot snapshot);
void ReleasePredicateLocks(const bool isCommit);
void PreCommit_CheckForSerializationFailure(void);
The files where these are used are:
src/backend/storage/ipc/ipci.c
src/backend/utils/time/snapmgr.c
src/backend/utils/resowner/resowner.c
src/backend/access/transam/xact.c
So any of these files which don't already include relcache.h could
remain without it if we make this split. Is there an easy way to
check which might already include it? Is it worth adding one more
.h file to avoid including relcache.h and snapshot.h in these four
files?
Let me know -- I'm happy to arrange this any way people feel is most
appropriate. I have a profound appreciation for the organization of
this code, and want to maintain it, even if I don't possess the
perspective to know how to best do so. The respect comes from
developing this patch -- every time I gave my manager an estimate of
how long it would take to do something, I found it actually took
about one-third of that time -- and it was entirely due to the
organization and documentation of the code.
-Kevin
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote:
The functions are well commented, but an overview at the top of
the file of all the hash tables and other data structures would be
nice. What is stored in each, when are they updated, etc.
I moved all the structures from predicate.h and predicate.c to a new
predicate_internal.h file and added comments. You can view its
current contents here:
Does this work for you?
That leaves the predicate.h file with just this:
-Kevin
Excerpts from Kevin Grittner's message of mié sep 15 14:52:36 -0400 2010:
Alvaro Herrera <alvherre@commandprompt.com> wrote:
I think that would also solve a concern that I had, which is that
we were starting to include relcache.h (and perhaps other headers
as well, but that's the one that triggered it for me) a bit too
liberally, so +1 from me.Unfortunately, what I proposed doesn't solve that for relcache.h,
although it does eliminate lock.h from almost everywhere and htup.h
from everywhere.
Now that I look at your new patch, I noticed that I was actually
confusing relcache.h with rel.h. The latter includes a big chunk of our
headers, but relcache.h is pretty thin. Including relcache.h in another
header is not much of a problem.
--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Alvaro Herrera <alvherre@commandprompt.com> wrote:
Now that I look at your new patch, I noticed that I was actually
confusing relcache.h with rel.h. The latter includes a big chunk
of our headers, but relcache.h is pretty thin. Including
relcache.h in another header is not much of a problem.
OK, thanks for the clarification.
With the structures all brought back together in a logical order,
and the new comments in front of the structure declarations, do you
think a summary at the top of the file is still needed in that
header file?
-Kevin
On 17/09/10 01:35, Kevin Grittner wrote:
Heikki Linnakangas<heikki.linnakangas@enterprisedb.com> wrote:
The functions are well commented, but an overview at the top of
the file of all the hash tables and other data structures would be
nice. What is stored in each, when are they updated, etc.I moved all the structures from predicate.h and predicate.c to a new
predicate_internal.h file and added comments. You can view its
current contents here:Does this work for you?
Yes, thank you, that helps a lot.
So, the purpose of SerializableXidHash is to provide quick access to the
SERIALIZABLEXACT struct of a top-level transaction, when you know its
transaction id or any of its subtransaction ids. To implement the "or
any of its subtransaction ids" part, you need to have a SERIALIZABLEXID
struct for each subtransaction in shared memory. That sounds like it can
eat through your shared memory very quickly if you have a lot of
subtransactions.
Why not use SubTransGetTopmostTransaction() ?
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas wrote:
So, the purpose of SerializableXidHash is to provide quick access
to the SERIALIZABLEXACT struct of a top-level transaction, when you
know its transaction id or any of its subtransaction ids.
Right.
To implement the "or any of its subtransaction ids" part, you need
to have a SERIALIZABLEXID struct for each subtransaction in shared
memory.
Close -- each subtransaction which writes any tuples.
That sounds like it can eat through your shared memory very quickly
if you have a lot of subtransactions.
Hmmm.... I've never explicitly used subtransactions, so I don't tend
to think of them routinely going too deep. And the struct is pretty
small.
Why not use SubTransGetTopmostTransaction() ?
This needs to work when the xid of a transaction is found in the MVCC
data of a tuple for any overlapping serializable transaction -- even
if that transaction has completed and its connection has been
closed. It didn't look to me like SubTransGetTopmostTransaction()
would work after the transaction was gone.
I guess that's something I should mention in the comments....
-Kevin
Import Notes
Resolved by subject fallback
On 17/09/10 14:56, Kevin Grittner wrote:
Heikki Linnakangas wrote:
Why not use SubTransGetTopmostTransaction() ?
This needs to work when the xid of a transaction is found in the MVCC
data of a tuple for any overlapping serializable transaction -- even
if that transaction has completed and its connection has been
closed. It didn't look to me like SubTransGetTopmostTransaction()
would work after the transaction was gone.
You're right, it doesn't retain that old transactions. But it could
easily be modified to do so.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas wrote:
On 17/09/10 14:56, Kevin Grittner wrote:
Heikki Linnakangas wrote:
Why not use SubTransGetTopmostTransaction() ?
This needs to work when the xid of a transaction is found in the
MVCC data of a tuple for any overlapping serializable transaction
-- even if that transaction has completed and its connection has
been closed. It didn't look to me like
SubTransGetTopmostTransaction() would work after the transaction
was gone.You're right, it doesn't retain that old transactions. But it could
easily be modified to do so.
I shall look into it.
-Kevin
Import Notes
Resolved by subject fallback
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
Heikki Linnakangas wrote:
That sounds like it can eat through your shared memory very quickly
if you have a lot of subtransactions.
Hmmm.... I've never explicitly used subtransactions, so I don't tend
to think of them routinely going too deep. And the struct is pretty
small.
That assumption is absolutely, totally not going to fly.
Why not use SubTransGetTopmostTransaction() ?
This needs to work when the xid of a transaction is found in the MVCC
data of a tuple for any overlapping serializable transaction -- even
if that transaction has completed and its connection has been
closed. It didn't look to me like SubTransGetTopmostTransaction()
would work after the transaction was gone.
Yes, it should work. If it doesn't, you are failing to manage the
TransactionXmin horizon correctly.
regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> wrote:
That assumption is absolutely, totally not going to fly.
Understood; I'm already working on it based on Heikki's input.
This needs to work when the xid of a transaction is found in the
MVCC data of a tuple for any overlapping serializable transaction
-- even if that transaction has completed and its connection has
been closed. It didn't look to me like
SubTransGetTopmostTransaction() would work after the transaction
was gone.Yes, it should work. If it doesn't, you are failing to manage the
TransactionXmin horizon correctly.
So far I haven't wanted to mess with the global xmin values for fear
of the possible impact on other transactions. It actually hasn't
been that hard to maintain a SerializableGlobalXmin value, which is
more efficient than the existing ones for predicate lock cleanup
purposes. That still isn't exactly what I need to modify cleanup of
the subtransaction information, though. Once I've got my head
around the subtrans.c code, I think I'll need to maintain a minimum
that includes the xids for serializable transactions which *overlap*
SerializableGlobalXmin. That doesn't seem very hard to do; I just
haven't needed it until now. Then I'll modify the subtransaction
cleanup to only remove entries before the earlier of the global xmin
of all transactions and the xmin of serializable transactions which
overlap active serializable transactions.
Does all that sound reasonable?
-Kevin
[Apologies for not reply-linking this; work email is down so I'm
sending from gmail.]
Based on feedback from Heikki and Tom I've reworked how I find the
top-level transaction. This is in the git repo, and the changes can
be viewed at:
-Kevin
Import Notes
Resolved by subject fallback
On 18/09/10 21:52, Kevin Grittner wrote:
[Apologies for not reply-linking this; work email is down so I'm
sending from gmail.]Based on feedback from Heikki and Tom I've reworked how I find the
top-level transaction. This is in the git repo, and the changes can
be viewed at:
Thanks, much simpler. Now let's simplify it some more ;-)
ISTM you never search the SerializableXactHash table using a hash key,
except the one call in CheckForSerializableConflictOut, but there you
already have a pointer to the SERIALIZABLEXACT struct. You only re-find
it to make sure it hasn't gone away while you trade the shared lock for
an exclusive one. If we find another way to ensure that, ISTM we don't
need SerializableXactHash at all. My first thought was to forget about
VirtualTransactionId and use TransactionId directly as the hash key for
SERIALIZABLEXACT. The problem is that a transaction doesn't have a
transaction ID when RegisterSerializableTransaction is called. We could
leave the TransactionId blank and only add the SERIALIZABLEXACT struct
to the hash table when an XID is assigned, but there's no provision to
insert an existing struct to a hash table in the current hash table API.
So, I'm not sure of the details yet, but it seems like it could be made
simpler somehow..
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote:
ISTM you never search the SerializableXactHash table using a hash
key, except the one call in CheckForSerializableConflictOut, but
there you already have a pointer to the SERIALIZABLEXACT struct.
You only re-find it to make sure it hasn't gone away while you
trade the shared lock for an exclusive one. If we find another way
to ensure that, ISTM we don't need SerializableXactHash at all. My
first thought was to forget about VirtualTransactionId and use
TransactionId directly as the hash key for SERIALIZABLEXACT. The
problem is that a transaction doesn't have a transaction ID when
RegisterSerializableTransaction is called. We could leave the
TransactionId blank and only add the SERIALIZABLEXACT struct to the
hash table when an XID is assigned, but there's no provision to
insert an existing struct to a hash table in the current hash table
API.So, I'm not sure of the details yet, but it seems like it could be
made simpler somehow..
After tossing it around in my head for a bit, the only thing that I
see (so far) which might work is to maintain a *list* of
SERIALIZABLEXACT objects in memory rather than a using a hash table.
The recheck after releasing the shared lock and acquiring an
exclusive lock would then go through SerializableXidHash. I think
that can work, although I'm not 100% sure that it's an improvement.
I'll look it over in more detail. I'd be happy to hear your thoughts
on this or any other suggestions.
-Kevin