Shared row locking

Started by Alvaro Herreraover 21 years ago35 messageshackers
Jump to latest
#1Alvaro Herrera
alvherre@dcc.uchile.cl

Hi,

I've been thinking on how to do shared row locking. There are some very
preliminar ideas on this issue. Please comment; particularly if any
part of it sounds unworkable or too incomplete.

There are several problems to be solved here: the grammar, the internal
SelectStmt representation, how to store and share the info between
backends, how to clean up at transaction end, and how to clean up at
backend crash.

The Grammar
===========

The SQL spec does not say anything on this respect (that I can find).
It only talks of "FOR UPDATE" and "FOR READ ONLY". However, because the
FK code uses SPI to do the locking, we definitely have to expose the
funcionality through SQL. So I think we need a new clause, which I
propose to be "FOR SHARE".

The Parser and SelectStmt
=========================

The parser uses for_update_clause and opt_for_update_clause
nonterminals. I assume it's best to change them to (new)
locking_clause, which can in turn be for_update_clause or (new)
for_share_clause.

SelectStmt currently has a forUpdate field (a List to the to-be-locked
tables, or an empty list meaning all of them). We could simply add
another list, say forShare, or use a common list and a flag saying that
it's one or the other. I prefer adding a new list. (Same with the
Query node.)

How to Store the Info
=====================

This is the really interesting part. I have two ideas, one mine (btree)
and other Bruce's (bitmap).

Using a B-tree
--------------
When a backend wants to lock a tuple, it set a bit in its infomask.
Then it inserts to a btree in a special tablespace, using
RelationId-BlockNumber-OffsetNumber as key, and BackendId-TransactionId
as value; actually, an array with a single element containing those two
values.

When a backend wants to lock a tuple that is already locked, it goes to
the btree and inserts itself into the array. To do this, it inserts a
new index item (the enlarged array) and delete the previous one. No
other backend may want to insert simultaneously (thus causing an ugly
race condition), because we hold an exclusive lock on the tuple's heap
page's buffer.

At transaction end, nothing special happens (tuples are not unlocked
explicitly).

When someone wants to know if the tuple is locked (to mark it FOR
UPDATE, or to delete it), it checks the infomask. If it says it's
locked, it goes to check the btree. If the array contains only
BackendId-TransactionId pairs that are no longer running, then the
tuple is not locked and can be deleted/marked (and the btree can be
cleaned up). Else, it will have to wait, using XactLockTableWait, for
the first transaction in the array that is still running. We can be
sure that no one will try to share-lock the tuple while we check the
btree because we hold an exclusive lock on the tuple's heap page's
buffer.

Note that to check whether a transaction is running we need to lock
SInvalLock. To minimize the time we hold it, we save the BackendId so
we don't have to scan the whole shmInvalBuffer->procState array, only
the item that we need to look at. Another possibility would be to use
stock TransactionIdIsInProgress and save the extra 4 bytes of storage.

At server restart, the btree is created empty (or just deleted). There
is one btree per database.

Using a Bitmap
--------------
First we create a counter called shared lock row counter. Then we
create a file like pg_clog, and each counter slot has a bit for every
backend. When we want to shared lock a row we increment the counter and
put that counter value on the row, and set our backend bit in the new
file. We also store that counter value in our backend local memory. On
commit we go through that local memory list and reset all our bits for
those counters. When a row has all zeros, it can be recycled like we do
with pg_clog.

Problems and random comments
============================

There is possibility of starvation, if somebody wants to lock
exclusively a tuple and shared lockers are coming all the time. Not
sure how to solve this.

The wakeup mechanism is not discussed ... is there a special need for
something beyond what we can do with XactLockTable{Insert,Wait} ?

Thanks to tablespaces, it's very easy to create special Relations that
can be dealt with by standard buffer and storage manager, etc.

The btree idea:
- does not need crash recovery. Maybe we could use a stripped down
version of nbtree. This could cause a maintanibility nightmare.

- can't hold more than 300 or so simultaneous lockers (because of value
length, limited to 1/3 of a page). I doubt this is a real problem.

- could have problems (excessive storage requirements) in the long run
because of empty or almost-empty pages.

The bitmap idea:
- seems to have lower overhead

- can use the same lazy cleanup mechanism exposed for the btree idea (in
which case we don't need the list in local memory).

- What can happen in presence of large max_connections settings? Is
this a real problem?

--
Alvaro Herrera (<alvherre[a]dcc.uchile.cl>)
Oh, oh, las chicas galacianas, lo har�n por las perlas,
�Y las de Arrakis por el agua! Pero si buscas damas
Que se consuman como llamas, �Prueba una hija de Caladan! (Gurney Halleck)

#2Bruce Momjian
bruce@momjian.us
In reply to: Alvaro Herrera (#1)
Re: Shared row locking

Alvaro Herrera wrote:

The btree idea:
- does not need crash recovery. Maybe we could use a stripped down
version of nbtree. This could cause a maintanibility nightmare.

Are you saying the btree is an index with no heap? If so, what about
the xid's? Are they just in the btree?

How does the btree get cleaned up over time?

The bitmap idea:
- seems to have lower overhead

- can use the same lazy cleanup mechanism exposed for the btree idea (in
which case we don't need the list in local memory).

You mean all empty/zero rows can be removed? Can we guarantee that on
commit we can clean up the bitmap? If not the idea doesn't work.

- What can happen in presence of large max_connections settings? Is
this a real problem?

I thought about that. 50 backends is 7 bytes, 1000 backends is 128
bytes. For a large number of backends you could just allow X concurrent
locks and use space X*4 bytes.

I think the basic issue is that the btree can be of variable length
while the bitmap has to be of a fixed length.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#3Christopher Kings-Lynne
chriskl@familyhealth.com.au
In reply to: Alvaro Herrera (#1)
Re: Shared row locking

The SQL spec does not say anything on this respect (that I can find).
It only talks of "FOR UPDATE" and "FOR READ ONLY". However, because the
FK code uses SPI to do the locking, we definitely have to expose the
funcionality through SQL. So I think we need a new clause, which I
propose to be "FOR SHARE".

MySQL uses LOCK IN SHARE MODE:

http://dev.mysql.com/doc/mysql/en/InnoDB_locking_reads.html

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#1)
Re: Shared row locking

Alvaro Herrera <alvherre@dcc.uchile.cl> writes:

Using a B-tree

At transaction end, nothing special happens (tuples are not unlocked
explicitly).

I don't think that works, because there is no guarantee that an entry
will get cleaned out before the XID counter wraps around. Worst case,
you might think that a tuple is locked when the XID is left over from
the previous cycle. (Possibly this could be avoided by cleaning out old
XIDs in this table whenever we truncate pg_clog, but that seems a tad
messy.) I'm also a bit concerned about how we avoid table bloat if
there's no proactive cleanup at transaction end.

I think I like the pg_clog-modeled structure a bit better. However it
could be objected that that puts a hard limit of 4G share-locked tuples
at any one time.

In the clog-modeled idea, it wasn't real clear how you decide whether to
assign a new counter value to a previously locked row, or reuse its
previous counter. You must *not* assign a new value when the existing
entry still has bits set, but you probably do want to be aggressive
about assigning new values when you can; else it gets tough to be sure
that the log can be truncated in a reasonable time.

ISTM that your description is conflating several orthogonal issues:
how do we identify entries in this data structure (by CTID, or a shared
counter that increments each time a new lock is acquired); how do we
index the data structure (btree or linear array); and what is stored in
each entry (array of XIDs, or bitmap indexed by BackendId). Not all of
the eight combinations work, but we do have more alternatives than the
two offered, even without coming up with any new ideas ;-)

Note that to check whether a transaction is running we need to lock
SInvalLock. To minimize the time we hold it, we save the BackendId so
we don't have to scan the whole shmInvalBuffer->procState array, only
the item that we need to look at. Another possibility would be to use
stock TransactionIdIsInProgress and save the extra 4 bytes of storage.

I'm a bit worried about deadlocks and race conditions associated with
the conflict between locking a page of this data structure and locking
SInvalLock.

At server restart, the btree is created empty (or just deleted). There
is one btree per database.

One per cluster you meant, right? (Else we can't do locking of rows in
shared tables.)

regards, tom lane

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#2)
Re: Shared row locking

Bruce Momjian <pgman@candle.pha.pa.us> writes:

You mean all empty/zero rows can be removed? Can we guarantee that on
commit we can clean up the bitmap? If not the idea doesn't work.

For whatever data structure we use, we may reset the structure to empty
during backend-crash recovery. So your objection boils down to "what if
a backend exits normally but forgets to clean up its locks?" Assuming
that doesn't happen isn't any worse than assuming a backend will clean
up its shared memory state on non-crash exit, so I don't think it's a
serious concern.

That brings another thought: really what this is all about is working
around the fact that the standard lock manager can only cope with a
finite number of coexisting locks, because it's working in a fixed-size
shared memory arena. Maybe we should instead think about ways to allow
the existing lock table to spill to disk when it gets too big. That
would eliminate max_locks_per_transaction as a source of hard failures,
which would be a nice benefit.

regards, tom lane

#6Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#4)
Re: Shared row locking

Tom Lane wrote:

Alvaro Herrera <alvherre@dcc.uchile.cl> writes:

Using a B-tree

At transaction end, nothing special happens (tuples are not unlocked
explicitly).

I don't think that works, because there is no guarantee that an entry
will get cleaned out before the XID counter wraps around. Worst case,
you might think that a tuple is locked when the XID is left over from
the previous cycle. (Possibly this could be avoided by cleaning out old
XIDs in this table whenever we truncate pg_clog, but that seems a tad
messy.) I'm also a bit concerned about how we avoid table bloat if
there's no proactive cleanup at transaction end.

I think I like the pg_clog-modeled structure a bit better. However it
could be objected that that puts a hard limit of 4G share-locked tuples
at any one time.

In the clog-modeled idea, it wasn't real clear how you decide whether to
assign a new counter value to a previously locked row, or reuse its
previous counter. You must *not* assign a new value when the existing
entry still has bits set, but you probably do want to be aggressive
about assigning new values when you can; else it gets tough to be sure
that the log can be truncated in a reasonable time.

I assume you check and if all the bits are zero, you don't reuse it and
get a new counter. In fact you shouldn't reuse it in case the log is
being truncated while you are looking. :-)

ISTM that your description is conflating several orthogonal issues:
how do we identify entries in this data structure (by CTID, or a shared
counter that increments each time a new lock is acquired); how do we
index the data structure (btree or linear array); and what is stored in
each entry (array of XIDs, or bitmap indexed by BackendId). Not all of
the eight combinations work, but we do have more alternatives than the
two offered, even without coming up with any new ideas ;-)

True. The only advantage to a bitmap vs. just a counter of locked
backends is that you can clean out your own backend bits from the table
without having to record them in your memory. However, because
recording your own counters in local memory doesn't require fixed shared
memory we might be better just recording the shared lock indexes in your
local backend memory and just use an int4 counter in the pg_clog-like
file that we can decrement on backend commit. However I am unclear that
we can guarantee an exiting backend will do that. Certainly it is
cleared on server start.

Note that to check whether a transaction is running we need to lock
SInvalLock. To minimize the time we hold it, we save the BackendId so
we don't have to scan the whole shmInvalBuffer->procState array, only
the item that we need to look at. Another possibility would be to use
stock TransactionIdIsInProgress and save the extra 4 bytes of storage.

I'm a bit worried about deadlocks and race conditions associated with
the conflict between locking a page of this data structure and locking
SInvalLock.

At server restart, the btree is created empty (or just deleted). There
is one btree per database.

One per cluster you meant, right? (Else we can't do locking of rows in
shared tables.)

He meant one per database, I think. I suppose we would need another one
for global tables or disallow shared locking of them.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#7Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#5)
Re: Shared row locking

BTom Lane wrote:

Bruce Momjian <pgman@candle.pha.pa.us> writes:

You mean all empty/zero rows can be removed? Can we guarantee that on
commit we can clean up the bitmap? If not the idea doesn't work.

For whatever data structure we use, we may reset the structure to empty
during backend-crash recovery. So your objection boils down to "what if
a backend exits normally but forgets to clean up its locks?" Assuming
that doesn't happen isn't any worse than assuming a backend will clean
up its shared memory state on non-crash exit, so I don't think it's a
serious concern.

That brings another thought: really what this is all about is working
around the fact that the standard lock manager can only cope with a
finite number of coexisting locks, because it's working in a fixed-size
shared memory arena. Maybe we should instead think about ways to allow
the existing lock table to spill to disk when it gets too big. That
would eliminate max_locks_per_transaction as a source of hard failures,
which would be a nice benefit.

Agreed. Once concern I have about allowing the lock table to spill to
disk is that a large number of FOR UPDATE locks could push out lock
entries used by other backends, causing very poor performance.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#8Simon Riggs
simon@2ndQuadrant.com
In reply to: Bruce Momjian (#7)
Re: Shared row locking

On Sun, 2004-12-19 at 04:04, Bruce Momjian wrote:

BTom Lane wrote:

Bruce Momjian <pgman@candle.pha.pa.us> writes:

You mean all empty/zero rows can be removed? Can we guarantee that on
commit we can clean up the bitmap? If not the idea doesn't work.

For whatever data structure we use, we may reset the structure to empty
during backend-crash recovery. So your objection boils down to "what if
a backend exits normally but forgets to clean up its locks?" Assuming
that doesn't happen isn't any worse than assuming a backend will clean
up its shared memory state on non-crash exit, so I don't think it's a
serious concern.

That brings another thought: really what this is all about is working
around the fact that the standard lock manager can only cope with a
finite number of coexisting locks, because it's working in a fixed-size
shared memory arena. Maybe we should instead think about ways to allow
the existing lock table to spill to disk when it gets too big. That
would eliminate max_locks_per_transaction as a source of hard failures,
which would be a nice benefit.

Agreed. Once concern I have about allowing the lock table to spill to
disk is that a large number of FOR UPDATE locks could push out lock
entries used by other backends, causing very poor performance.

In similar circumstances, DB2 uses these techniques:

- when locktable X % full, then escalate locks to full table locks: both
locktable memory and threshold% are instance parameters

- use a lock mode called Cursor Stability that locks only those rows
currently being examined by a cursor, those maintaining the lock usage
of a cursor at a constant level as the cursor moves. The lock mode of
Repeatable Read *does* lock all rows read

(these are not actually mutually exclusive)

The first one is a real pain, but the idea might be of use somewhere.

The second is a usable, practical alternative that should be considered
and might avoid the need to write the spill-to-disk code and then
discover it performs very badly.

--
Best Regards, Simon Riggs

#9Alvaro Herrera
alvherre@dcc.uchile.cl
In reply to: Simon Riggs (#8)
Re: Shared row locking

On Sun, Dec 19, 2004 at 09:52:01AM +0000, Simon Riggs wrote:

Simon,

In similar circumstances, DB2 uses these techniques:

- when locktable X % full, then escalate locks to full table locks: both
locktable memory and threshold% are instance parameters

This is not useful at all, because the objective of this exercise is to
downgrade locks, from exclusive row locking (SELECT ... FOR UPDATE) to
shared row locking. Keep in mind that this is for foreign key locking,
which is one area where deadlocks are frequently encountered because we
use too strong a lock.

- use a lock mode called Cursor Stability that locks only those rows
currently being examined by a cursor, those maintaining the lock usage
of a cursor at a constant level as the cursor moves. The lock mode of
Repeatable Read *does* lock all rows read

We can't release the locks until the end of the transaction.

The second is a usable, practical alternative that should be considered
and might avoid the need to write the spill-to-disk code and then
discover it performs very badly.

I don't think any of them serves the objective I'm trying to accomplish
:-(

Thanks for your ideas anyway. And keep having them!

--
Alvaro Herrera (<alvherre[@]dcc.uchile.cl>)
"Uno puede defenderse de los ataques; contra los elogios se esta indefenso"

#10Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#9)
Re: Shared row locking

Alvaro Herrera <alvherre@dcc.uchile.cl> writes:

On Sun, Dec 19, 2004 at 09:52:01AM +0000, Simon Riggs wrote:

- use a lock mode called Cursor Stability that locks only those rows
currently being examined by a cursor, those maintaining the lock usage
of a cursor at a constant level as the cursor moves. The lock mode of
Repeatable Read *does* lock all rows read

We can't release the locks until the end of the transaction.

Well, what Simon is essentially proposing is that we work around a
potential performance issue by exposing a less-clean semantic model
to users. I'd prefer to adopt such a thing as a last resort, not a
first one.

regards, tom lane

#11Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Alvaro Herrera (#9)
Re: Shared row locking

On Sun, 19 Dec 2004, Alvaro Herrera wrote:

On Sun, Dec 19, 2004 at 09:52:01AM +0000, Simon Riggs wrote:

Simon,

In similar circumstances, DB2 uses these techniques:

- when locktable X % full, then escalate locks to full table locks: both
locktable memory and threshold% are instance parameters

This is not useful at all, because the objective of this exercise is to
downgrade locks, from exclusive row locking (SELECT ... FOR UPDATE) to
shared row locking. Keep in mind that this is for foreign key locking,
which is one area where deadlocks are frequently encountered because we
use too strong a lock.

Actually it might help in some scenarios. Remember, we're not talking
about upgrading shared locks to exclusive locks. We're only talking about
locking more rows than necessary (all rows).

I believe DB2 does the escalation also for perfomance. Getting a full
table lock is cheaper than individually locking every row.

- Heikki

#12Tom Lane
tgl@sss.pgh.pa.us
In reply to: Heikki Linnakangas (#11)
Re: Shared row locking

Heikki Linnakangas <hlinnaka@iki.fi> writes:

On Sun, 19 Dec 2004, Alvaro Herrera wrote:

This is not useful at all, because the objective of this exercise is to
downgrade locks, from exclusive row locking (SELECT ... FOR UPDATE) to
shared row locking.

Actually it might help in some scenarios. Remember, we're not talking
about upgrading shared locks to exclusive locks. We're only talking about
locking more rows than necessary (all rows).

Nonetheless, it would mean that locks would be taken depending on
implementation-dependent, not-visible-to-the-user considerations.
Shared locks can still cause deadlocks, and so you would have an
unreliable application, which would only be unreliable under load.

As I said in connection with the other proposal, weird user-visible
semantics should be the last resort not the first.

regards, tom lane

#13Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Tom Lane (#12)
Re: Shared row locking

On Sun, 19 Dec 2004, Tom Lane wrote:

Heikki Linnakangas <hlinnaka@iki.fi> writes:

On Sun, 19 Dec 2004, Alvaro Herrera wrote:

This is not useful at all, because the objective of this exercise is to
downgrade locks, from exclusive row locking (SELECT ... FOR UPDATE) to
shared row locking.

Actually it might help in some scenarios. Remember, we're not talking
about upgrading shared locks to exclusive locks. We're only talking about
locking more rows than necessary (all rows).

Nonetheless, it would mean that locks would be taken depending on
implementation-dependent, not-visible-to-the-user considerations.
Shared locks can still cause deadlocks, and so you would have an
unreliable application, which would only be unreliable under load.

As I said in connection with the other proposal, weird user-visible
semantics should be the last resort not the first.

I agree that lock escalation is not a good solution, we run into problems
with DB2 lock escalation at work all the time.

- Heikki

#14Simon Riggs
simon@2ndQuadrant.com
In reply to: Alvaro Herrera (#9)
Re: Shared row locking

On Sun, 2004-12-19 at 18:31, Alvaro Herrera wrote:

Thanks for your ideas anyway. And keep having them!

No problem. Just giving some info on what works and doesn't work in
other implementations.

--
Best Regards, Simon Riggs

#15Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#12)
Re: Shared row locking

Tom Lane wrote:

Heikki Linnakangas <hlinnaka@iki.fi> writes:

On Sun, 19 Dec 2004, Alvaro Herrera wrote:

This is not useful at all, because the objective of this exercise is to
downgrade locks, from exclusive row locking (SELECT ... FOR UPDATE) to
shared row locking.

Actually it might help in some scenarios. Remember, we're not talking
about upgrading shared locks to exclusive locks. We're only talking about
locking more rows than necessary (all rows).

Nonetheless, it would mean that locks would be taken depending on
implementation-dependent, not-visible-to-the-user considerations.
Shared locks can still cause deadlocks, and so you would have an
unreliable application, which would only be unreliable under load.

As I said in connection with the other proposal, weird user-visible
semantics should be the last resort not the first.

One idea is to stamp the same shared lock counter on multiple rows and
just prevent another application from also sharing that lock if too many
rows are locked. It would wait for the lock to be released. This
prevents the problem of having to store thousands of shared lock
counters. There would have to be some flag so say which shared counters
are only for a single backend.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#16Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Heikki Linnakangas (#13)
Re: Shared row locking

On Sun, Dec 19, 2004 at 11:35:02PM +0200, Heikki Linnakangas wrote:

On Sun, 19 Dec 2004, Tom Lane wrote:

Heikki Linnakangas <hlinnaka@iki.fi> writes:

On Sun, 19 Dec 2004, Alvaro Herrera wrote:

This is not useful at all, because the objective of this exercise is to
downgrade locks, from exclusive row locking (SELECT ... FOR UPDATE) to
shared row locking.

Actually it might help in some scenarios. Remember, we're not talking
about upgrading shared locks to exclusive locks. We're only talking about
locking more rows than necessary (all rows).

Nonetheless, it would mean that locks would be taken depending on
implementation-dependent, not-visible-to-the-user considerations.
Shared locks can still cause deadlocks, and so you would have an
unreliable application, which would only be unreliable under load.

As I said in connection with the other proposal, weird user-visible
semantics should be the last resort not the first.

I agree that lock escalation is not a good solution, we run into problems
with DB2 lock escalation at work all the time.

Does anyone know how Oracle deals with this? They use MVCC like
PostgreSQL, so they'd be a better source for inspiration.
--
Jim C. Nasby, Database Consultant decibel@decibel.org
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"

#17Gavin Sherry
swm@linuxworld.com.au
In reply to: Bruce Momjian (#7)
Re: Shared row locking

On Sat, 18 Dec 2004, Bruce Momjian wrote:

BTom Lane wrote:

Bruce Momjian <pgman@candle.pha.pa.us> writes:

You mean all empty/zero rows can be removed? Can we guarantee that on
commit we can clean up the bitmap? If not the idea doesn't work.

For whatever data structure we use, we may reset the structure to empty
during backend-crash recovery. So your objection boils down to "what if
a backend exits normally but forgets to clean up its locks?" Assuming
that doesn't happen isn't any worse than assuming a backend will clean
up its shared memory state on non-crash exit, so I don't think it's a
serious concern.

That brings another thought: really what this is all about is working
around the fact that the standard lock manager can only cope with a
finite number of coexisting locks, because it's working in a fixed-size
shared memory arena. Maybe we should instead think about ways to allow
the existing lock table to spill to disk when it gets too big. That
would eliminate max_locks_per_transaction as a source of hard failures,
which would be a nice benefit.

Agreed. Once concern I have about allowing the lock table to spill to
disk is that a large number of FOR UPDATE locks could push out lock
entries used by other backends, causing very poor performance.

I think if we allow the lock manager to spill to disk (and I think we do
need to allow it) then we should also be able to control the amount of
shared memory allocated. There's little point spilling the lock table to
disk if we have huge amounts of memory.

Gavin

#18Simon Riggs
simon@2ndQuadrant.com
In reply to: Jim Nasby (#16)
Re: Shared row locking

On Mon, 2004-12-20 at 06:34, Jim C. Nasby wrote:

On Sun, Dec 19, 2004 at 11:35:02PM +0200, Heikki Linnakangas wrote:

On Sun, 19 Dec 2004, Tom Lane wrote:

Heikki Linnakangas <hlinnaka@iki.fi> writes:

On Sun, 19 Dec 2004, Alvaro Herrera wrote:

This is not useful at all, because the objective of this exercise is to
downgrade locks, from exclusive row locking (SELECT ... FOR UPDATE) to
shared row locking.

Actually it might help in some scenarios. Remember, we're not talking
about upgrading shared locks to exclusive locks. We're only talking about
locking more rows than necessary (all rows).

Nonetheless, it would mean that locks would be taken depending on
implementation-dependent, not-visible-to-the-user considerations.
Shared locks can still cause deadlocks, and so you would have an
unreliable application, which would only be unreliable under load.

As I said in connection with the other proposal, weird user-visible
semantics should be the last resort not the first.

I agree that lock escalation is not a good solution, we run into problems
with DB2 lock escalation at work all the time.

Does anyone know how Oracle deals with this? They use MVCC like
PostgreSQL, so they'd be a better source for inspiration.

Oracle only uses MVCC in its widest sense - versioning info is stored in
UNDO tablespaces (rollback segments). That implementation is covered by
aggressive patent attorneys.

Oracle implements locking at row level within each data block. The block
header expands dynamically to accommodate a list of transactions that
can access, with minimum and maximum sizes settable by the DBA. This
works reasonably well.

Each SELECT FOR UPDATE is actually a block-write, whether or not the
rows are modified, which has some additional code to recover from this
without crashing/redo. Later transactions end up cleaning up the lock
header info (which later became a problem in Parallel Server).

https://cwisdb.cc.kuleuven.ac.be/ora10doc/server.101/b10743/consist.htm

--
Best Regards, Simon Riggs

#19Alvaro Herrera
alvherre@dcc.uchile.cl
In reply to: Gavin Sherry (#17)
Re: Shared row locking

On Mon, Dec 20, 2004 at 06:23:24PM +1100, Gavin Sherry wrote:

On Sat, 18 Dec 2004, Bruce Momjian wrote:

Agreed. Once concern I have about allowing the lock table to spill to
disk is that a large number of FOR UPDATE locks could push out lock
entries used by other backends, causing very poor performance.

I think if we allow the lock manager to spill to disk (and I think we do
need to allow it) then we should also be able to control the amount of
shared memory allocated. There's little point spilling the lock table to
disk if we have huge amounts of memory.

This is a interesting idea.

Gavin also mentioned to me we should also control the amount of memory
the shared inval queue uses. Causing all backends to refill the cache
is (I assume) a big performance hit.

Maybe we should expose this via new knobs in postgresql.conf, to ease
implementation, or maybe not, to rid users of configuring it.

As a start, we could have WARNINGs when the lock table is spilled and
when a SInvalReset occurs. So the user can know whether he should
increase memory use for those settings.

--
Alvaro Herrera (<alvherre[@]dcc.uchile.cl>)
"Thou shalt not follow the NULL pointer, for chaos and madness await
thee at its end." (2nd Commandment for C programmers)

#20Tom Lane
tgl@sss.pgh.pa.us
In reply to: Gavin Sherry (#17)
Re: Shared row locking

Gavin Sherry <swm@linuxworld.com.au> writes:

I think if we allow the lock manager to spill to disk (and I think we do
need to allow it) then we should also be able to control the amount of
shared memory allocated.

You mean like max_locks_per_transaction?

regards, tom lane

#21Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#19)
#22Merlin Moncure
merlin.moncure@rcsonline.com
In reply to: Tom Lane (#21)
#23Tom Lane
tgl@sss.pgh.pa.us
In reply to: Merlin Moncure (#22)
#24Alvaro Herrera
alvherre@dcc.uchile.cl
In reply to: Tom Lane (#23)
#25Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#24)
#26Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#25)
#27Zeugswetter Andreas SB SD
ZeugswetterA@spardat.at
In reply to: Simon Riggs (#26)
#28Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Alvaro Herrera (#24)
#29Manfred Koizar
mkoi-pg@aon.at
In reply to: Alvaro Herrera (#1)
#30Manfred Koizar
mkoi-pg@aon.at
In reply to: Simon Riggs (#26)
#31Tom Lane
tgl@sss.pgh.pa.us
In reply to: Manfred Koizar (#30)
#32Manfred Koizar
mkoi-pg@aon.at
In reply to: Tom Lane (#31)
#33Tom Lane
tgl@sss.pgh.pa.us
In reply to: Manfred Koizar (#32)
#34Manfred Koizar
mkoi-pg@aon.at
In reply to: Tom Lane (#33)
#35Tom Lane
tgl@sss.pgh.pa.us
In reply to: Manfred Koizar (#34)