A few beginner's questions concerning concurrency control

Started by Yoram Bibermanalmost 22 years ago4 messagesgeneral
Jump to latest
#1Yoram Biberman
yoramb@hadassah-col.ac.il

I have a few questions concerning concurrency control. I shall thank whoever
can help me.

Question #1
=========
Assume the following (concurrent) schedule, in which both transactions run
in a serializable isolation level:
T1 begin
T2 begin
T1 modifies an item A
T2 reads (selects) A
T2 modifies an item B
T1 reads (selects) B
T1 commits
T2 commits
If I understand correctly then both transactions would be able to commit (as
they modified different items). Each would see a snapshot of the database as
if it ran alone,
and each would read the initial value of the item it reads (and not its
value after the update). But we cannot say that the schedule is equivalent
nor to the serial schedule that run T1 first (as in this schedule T2 would
read the value of B after it was modified by T1), neither to the schedule
that run T2 first (from a symmetric argument concerning item B). So the
schedule is not serializable in the sense of the theory of database systems
(e.g. Ullman’s Principles of Database Systems book). Am I right?

Question #2
=========
I was not able to understand the difference between all the lock modes, when
would a transaction (or the db system) use each lock, and which data
structures each lock locks. For example: what is the difference between a
ROW SHARE lock mode, and a ROW EXCLUSIVE lock mode. I understand that the
former is acquired by a select … for update, while the latter is acquired,
for example, by an UPDATE command; but after a transaction issues select …
for update it has the opportunity to modify the row, so why do we need two
different lock modes? Or what is the difference between SHARE and ACCESS
SHARE? Which data structures are being locked by each lock? Why is EXCLUSIVE
congruent with ACCESS SHARE? If I may say so, I find
The documentation a bit partial concerning the intuition behind the lock
modes,
and examples of using them (beyond the Star War example).

Question #3
=========
In some places it is said that a transaction that only reads does not lock
any table or row, and is never blocked. But if a transaction T1 modifies a
row r, and at the same time transaction T2 selects r, then T2 need to wait
until T1 finishes (as T1 might have deleted the row, or modified it in a way
that would cause T1 not to need it, as the row does not satisfy T2’s WHERE
clause). Am I right? On the other hand in order to read a table T2 gets an
ACCESS SHARE lock on the table, so it blocks transactions that want to drop
the table (and I do not understand why it does not block transactions that
want to add/delete/update rows of the table).

I would thank you if you could find the time to help me with those
questions.

Thanking you in advance

Yoram Biberamn

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Yoram Biberman (#1)
Re: A few beginner's questions concerning concurrency control

Yoram Biberman <yoramb@hadassah-col.ac.il> writes:

... But we cannot say that the schedule is equivalent
nor to the serial schedule that run T1 first (as in this schedule T2 would
read the value of B after it was modified by T1), neither to the schedule
that run T2 first (from a symmetric argument concerning item B). So the
schedule is not serializable in the sense of the theory of database systems
(e.g. Ullman=92s Principles of Database Systems book). Am I right?

Right. To make the behavior mathematically serializable we would have
to add predicate locking --- that is, when T2 reads A it would have to
take out a read-lock on A (in fact, on all rows potentially matching
the WHERE condition it used for its SELECT). This would be amazingly
expensive and it wouldn't actually improve functionality for most
applications :-(

I was not able to understand the difference between all the lock modes, when
would a transaction (or the db system) use each lock, and which data
structures each lock locks.

All the locks are table locks, and the only differences between them are
which other lock types are blocked by a given lock type. The reason for
having so many is to allow fine-grained control of what sorts of things
can be done to a table concurrently. For instance, we can allow reads
and writes to occur concurrently, but we can't allow reads or writes to
occur concurrently with a table schema modification (such as DROP INDEX).

In some places it is said that a transaction that only reads does not lock
any table or row, and is never blocked. But if a transaction T1 modifies a
row r, and at the same time transaction T2 selects r, then T2 need to wait
until T1 finishes (as T1 might have deleted the row, or modified it in a way
that would cause T1 not to need it, as the row does not satisfy T2=92s WHERE
clause). Am I right?

No, because T2 is selecting the row or not based on its state at T2's
snapshot time. What T1 did to it immediately after that time is not
interesting.

On the other hand in order to read a table T2 gets an
ACCESS SHARE lock on the table, so it blocks transactions that want to drop
the table (and I do not understand why it does not block transactions that
want to add/delete/update rows of the table).

We do that mainly because the physical act of dropping the table (ie,
removing the storage file) isn't transactional. We could make this work
if we found a way to postpone the file unlink until after the table is
no longer visible to any running transaction --- but that seems like
more trouble than it's really worth. In practice, concurrent read and
write is what you want for real applications; concurrent schema changes
are not important enough to install a large amount of mechanism to
support. (IMHO anyway.)

regards, tom lane

#3Bruno Wolff III
bruno@wolff.to
In reply to: Yoram Biberman (#1)
Re: A few beginner's questions concerning concurrency control

On Tue, Jun 29, 2004 at 11:55:50 +0200,
Yoram Biberman <yoramb@hadassah-col.ac.il> wrote:

that run T2 first (from a symmetric argument concerning item B). So the
schedule is not serializable in the sense of the theory of database systems
(e.g. Ullman�s Principles of Database Systems book). Am I right?

Your example is not serializable. The main problem case for this is when
people try to select the maximum value in a table and use a higher value
in a newly inserted record. You need to lock the table to make sure this
works.

The documentation a bit partial concerning the intuition behind the lock
modes,
and examples of using them (beyond the Star War example).

I think the best way to understand the documentation is to note which
locks conflict and which statements acquire them. That way if you
want to block an operation you can figure out which locks block that
operation and choose the one that has the fewest side effects.

In some places it is said that a transaction that only reads does not lock
any table or row, and is never blocked. But if a transaction T1 modifies a
row r, and at the same time transaction T2 selects r, then T2 need to wait
until T1 finishes (as T1 might have deleted the row, or modified it in a way
that would cause T1 not to need it, as the row does not satisfy T2�s WHERE
clause). Am I right? On the other hand in order to read a table T2 gets an
ACCESS SHARE lock on the table, so it blocks transactions that want to drop
the table (and I do not understand why it does not block transactions that
want to add/delete/update rows of the table).

No. T2 sees the row in the state before T1 touched it. Only if T2 also needs
to update that row, does it need to wait.

#4Scott Marlowe
smarlowe@qwest.net
In reply to: Yoram Biberman (#1)
Re: A few beginner's questions concerning concurrency

On Tue, 2004-06-29 at 03:55, Yoram Biberman wrote:

I have a few questions concerning concurrency control. I shall thank
whoever can help me.

Question #1
=========
Assume the following (concurrent) schedule, in which both transactions
run in a serializable isolation level:
T1 begin
T2 begin
T1 modifies an item A
T2 reads (selects) A
T2 modifies an item B
T1 reads (selects) B
T1 commits
T2 commits
If I understand correctly then both transactions would be able to
commit (as they modified different items). Each would see a snapshot
of the database as if it ran alone,

If T2 is going to be using information from A to update B, then the
select for A should include a "for update" clause. I know one would
think that "for update" should be used on the destination row, but in
this case, the for update clause on T2's select means the transaction
will fail, and the client should now detect that, rollback, and attempt
the transaction again. Here's what happens:

T1 begin;
T1 set transaction_isolation = serializable;
T2 begin;
T2 set transaction_isolation = serializable;
T1 update test set info='ghi' where id='A';
T2 select * from test where id=1 for update;
(T2 now waits for T1 to complete or rollback)

If T1 now commits, we get this error on T2:

ERROR: could not serialize access due to concurrent update

If T1 rolls back, T2 will complete. This provides the user with the
equivalent to a predicate locking model. However, with such locking
rollbacks become more common, and the client has to know to try again
should a transaction fail.

Question #3
=========
In some places it is said that a transaction that only reads does not
lock any table or row, and is never blocked. But if a transaction T1
modifies a row r, and at the same time transaction T2 selects r, then
T2 need to wait until T1 finishes (as T1 might have deleted the row,
or modified it in a way that would cause T1 not to need it, as the row
does not satisfy T2’s WHERE clause). Am I right?

Yes. Because that's very useful for when people reading the database
aren't writing to it. It allows the database to handle a very large
read load while still updating underneath it.

However, it requires some forethought to make update procedures that
operate on the dataset run properly. While the readers can run in
serializable or read committed depening on their need to see coherent
data between rows and all, the writers would by necessity have to run
not just serializable, but also have to mark the rows they're reading as
for update to make sure another writer didn't get the wrong number.
Since writers, by definition, are usually written by wizards, they can
be expected to know to use for update et. al. to ensure things work out
ok.

Basically, data just read with no write lock shouldn't be used to update
other data in a serializable transaction.

On the other hand in order to read a table T2 gets an ACCESS SHARE
lock on the table, so it blocks transactions that want to drop the
table (and I do not understand why it does not block transactions that
want to add/delete/update rows of the table).

OK, you have 10,000,000,000 rows in the table. You have 1,000 people
connected to the database. Every time one of the 1,000 people read a
dozen to a thousand rows, you lock each one. These locks have to be
shared with all the other backends. How much does that cost?

OTOH, you have the same number as above, every time they read the table,
you throw one single, cheap, almost never failing lock on the table.
How much does that cost?

Finally, you have the same data set. Out of the 1,000 people connected
to the database, three are modifiers / writers. The rest are report
generators or sales / marketing folks doing data mining. The three
modifiers, working on small data sets each time, lock each table they
use for select with for update. Those individual rows get a row level
lock, that has to be shared out with all the other backends. How much
does that cost?

In general, the third scenario is the most useful. It allows for truly
huge read loads to exist on top of complex, slow modifiers and still get
their job done. But it requires the writers to know what they are
doing, semantically.