'tuple concurrently updated' error for alter role ... set

Started by Oleksii Kliukinabout 15 years ago37 messageshackers

alexk@hintbits.com

about 15 years ago

Hello,

We have recently observed a problem with concurrent execution of ALTER ROLE SET... in several sessions. It's similar to the one from http://git.postgresql.org/gitweb?p=postgresql.git;a=commitdiff;h=fbcf4b92aa64d4577bcf25925b055316b978744a The result is the 'tuple concurrently updated' error message, and the problem is easily reproducible:

CREATE SCHEMA test;
CREATE SCHEMA test2;
CREATE ROLE testrole;

session 1:
while [ 1 ]; do psql postgres -c 'alter role testrole set search_path=test2';done

session 2:
while [ 1 ]; do psql postgres -c 'alter role testrole set search_path=test';done

The error message appears almost immediately on my system.

After digging in the code I've found that a RowExclusiveLock is acquired on a pg_db_role_setting table in AlterSetting(). While the name of the locks suggests that it should conflict with itself, it doesn't. After I've replaced the lock in question with ShareUpdateExclusiveLock, the problem disappeared. Attached is the simple patch with these changes.

Regards,
--
Alexey Klyukin
The PostgreSQL Company - Command Prompt, Inc.

Tom Lane

tgl@sss.pgh.pa.us

about 15 years ago

In reply to: Oleksii Kliukin (#1)

Re: 'tuple concurrently updated' error for alter role ... set

Alexey Klyukin <alexk@commandprompt.com> writes:

After digging in the code I've found that a RowExclusiveLock is acquired on a pg_db_role_setting table in AlterSetting(). While the name of the locks suggests that it should conflict with itself, it doesn't. After I've replaced the lock in question with ShareUpdateExclusiveLock, the problem disappeared. Attached is the simple patch with these changes.

We're not likely to do that, first because it's randomly different from
the handling of every other system catalog update, and second because it
would serialize all updates on this catalog, and probably create
deadlock cases that don't exist now. (BTW, as the patch is given I'd
expect it to still fail, though perhaps with lower probability than
before. For this to actually stop all such cases, you'd have to hold
the lock till commit, which greatly increases the risks of deadlock.)

I see no particular reason why conflicting updates like those *shouldn't*
be expected to fail occasionally.

regards, tom lane

Robert Haas

robertmhaas@gmail.com

about 15 years ago

In reply to: Tom Lane (#2)

Re: 'tuple concurrently updated' error for alter role ... set

On Thu, May 12, 2011 at 6:28 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Alexey Klyukin <alexk@commandprompt.com> writes:

After digging in the code I've found that a RowExclusiveLock is acquired on a pg_db_role_setting table in AlterSetting(). While the name of the locks suggests that it should conflict with itself, it doesn't. After I've replaced the lock in question with ShareUpdateExclusiveLock, the problem disappeared. Attached is the simple patch with these changes.

We're not likely to do that, first because it's randomly different from
the handling of every other system catalog update,

We have very robust locking of this type for table-related DDL
operations and just about none for anything else. I don't consider
the latter to be a feature.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Tom Lane

tgl@sss.pgh.pa.us

about 15 years ago

In reply to: Robert Haas (#3)

Re: 'tuple concurrently updated' error for alter role ... set

Robert Haas <robertmhaas@gmail.com> writes:

On Thu, May 12, 2011 at 6:28 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

We're not likely to do that, first because it's randomly different from
the handling of every other system catalog update,

We have very robust locking of this type for table-related DDL
operations and just about none for anything else. I don't consider
the latter to be a feature.

I didn't say it was ;-). What I *am* saying is that if we're going to
do anything about this sort of problem, there needs to be a
well-considered system-wide plan. Arbitrarily changing the locking
rules for individual operations is not going to make things better,
and taking exclusive locks on whole catalogs is definitely not going to
make things better.

regards, tom lane

Oleksii Kliukin

alexk@hintbits.com

about 15 years ago

In reply to: Tom Lane (#2)

Re: 'tuple concurrently updated' error for alter role ... set

On May 13, 2011, at 1:28 AM, Tom Lane wrote:

Alexey Klyukin <alexk@commandprompt.com> writes:

After digging in the code I've found that a RowExclusiveLock is acquired on a pg_db_role_setting table in AlterSetting(). While the name of the locks suggests that it should conflict with itself, it doesn't. After I've replaced the lock in question with ShareUpdateExclusiveLock, the problem disappeared. Attached is the simple patch with these changes.

We're not likely to do that, first because it's randomly different from
the handling of every other system catalog update, and second because it
would serialize all updates on this catalog, and probably create
deadlock cases that don't exist now. (BTW, as the patch is given I'd
expect it to still fail, though perhaps with lower probability than
before. For this to actually stop all such cases, you'd have to hold
the lock till commit, which greatly increases the risks of deadlock.)

Fair enough. I think the AlterSetting holds the lock till commit (it does
heap_close with NoLock). The DropSetting doesn't do this though.

I see no particular reason why conflicting updates like those *shouldn't*
be expected to fail occasionally.

Excellent question, I don't have enough context to properly answer that (other
than a guess that an unexpected transaction rollback is too unexpected :))
Let me ask the customer first.

--
Alexey Klyukin
The PostgreSQL Company - Command Prompt, Inc.

Robert Haas

robertmhaas@gmail.com

about 15 years ago

In reply to: Tom Lane (#4)

Re: 'tuple concurrently updated' error for alter role ... set

On Thu, May 12, 2011 at 6:59 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

On Thu, May 12, 2011 at 6:28 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

We're not likely to do that, first because it's randomly different from
the handling of every other system catalog update,

We have very robust locking of this type for table-related DDL
operations and just about none for anything else. I don't consider
the latter to be a feature.

I didn't say it was ;-). What I *am* saying is that if we're going to
do anything about this sort of problem, there needs to be a
well-considered system-wide plan. Arbitrarily changing the locking
rules for individual operations is not going to make things better,
and taking exclusive locks on whole catalogs is definitely not going to
make things better.

Yes; true. I'm inclined to say that this is a bug, but not one we're
going to fix before 9.2. I think it might be about time to get
serious about making an effort to sprinkle the code with a few more
LockDatbaseObject() and LockSharedObject() calls.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Tom Lane

tgl@sss.pgh.pa.us

about 15 years ago

In reply to: Robert Haas (#6)

Re: 'tuple concurrently updated' error for alter role ... set

Robert Haas <robertmhaas@gmail.com> writes:

On Thu, May 12, 2011 at 6:59 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I didn't say it was ;-). What I *am* saying is that if we're going to
do anything about this sort of problem, there needs to be a
well-considered system-wide plan. Arbitrarily changing the locking
rules for individual operations is not going to make things better,
and taking exclusive locks on whole catalogs is definitely not going to
make things better.

Yes; true. I'm inclined to say that this is a bug, but not one we're
going to fix before 9.2. I think it might be about time to get
serious about making an effort to sprinkle the code with a few more
LockDatbaseObject() and LockSharedObject() calls.

Yeah. That doesn't rise to the level of a "well-considered plan", but
I believe that we could develop a plan around that concept, ie, take a
lock associated with the individual object we are about to operate on.

BTW, I thought a bit more about why I didn't like the initial proposal
in this thread, and the basic objection is this: the AccessShareLock or
RowExclusiveLock we take on the catalog is not meant to provide any
serialization of operations on individual objects within the catalog.
What it's there for is to interlock against operations that are
operating on the catalog as a table, such as VACUUM FULL (which has to
lock out all accesses to the catalog) or REINDEX (which has to lock out
updates). So the catalog-level lock is the right thing and shouldn't be
changed. If we want to interlock updates of individual objects then we
need a different locking concept for that.

regards, tom lane

Robert Haas

robertmhaas@gmail.com

about 15 years ago

In reply to: Tom Lane (#7)

Re: 'tuple concurrently updated' error for alter role ... set

On Fri, May 13, 2011 at 12:56 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

BTW, I thought a bit more about why I didn't like the initial proposal
in this thread, and the basic objection is this: the AccessShareLock or
RowExclusiveLock we take on the catalog is not meant to provide any
serialization of operations on individual objects within the catalog.
What it's there for is to interlock against operations that are
operating on the catalog as a table, such as VACUUM FULL (which has to
lock out all accesses to the catalog) or REINDEX (which has to lock out
updates). So the catalog-level lock is the right thing and shouldn't be
changed. If we want to interlock updates of individual objects then we
need a different locking concept for that.

Right, I agree. Fortunately, we don't have to invent a new one.
There is already locking being done exactly along these lines for
DROP, COMMENT, and SECURITY LABEL (which is important, because
otherwise we could leave behind orphaned security labels that would be
inherited by a later object with the same OID, leading to a security
problem). I think it would be sensible, and quite simple, to extend
that to other DDL operations.

I think that we probably *don't* want to lock non-table objects when
they are just being *used*. We do that for tables (to lock against
concurrent drop operations) and in some workloads it becomes a severe
bottleneck. Doing it for functions and operators would make the
problem far worse, for no particular benefit. Unlike tables, there is
no underlying relation file to worry about, so the worst thing that
happens is someone continues to use a dropped object slightly after
it's gone, or the old definition of an object that's been modified.

Actually, it's occurred to me from time to time that it would be nice
to eliminate ACCESS SHARE (and while I'm dreaming, maybe ROW SHARE and
ROW EXCLUSIVE) locks for tables as well. Under normal operating
conditions (i.e. no DDL running), these locks generate a huge amount
of lock manager traffic even though none of the locks conflict with
each other. Unfortunately, I don't really see a way to make this
work. But maybe it would at least be possible to create some sort of
fast path. For example, suppose every backend opens a file and uses
that file to record lock tags for the objects on which it is taking
"weak" (ACCESS SHARE/ROW SHARE/ROW EXCLUSIVE) locks on. Before taking
a "strong" lock (anything that conflicts with one of those lock
types), the exclusive locker is required to open all of those files
and transfer the locks into the lock manager proper. Of course, it's
also necessary to nail down the other direction: you have to have some
way of making sure that the backend can't record in it's local file a
lock that would have conflicted had it been taken in the actual lock
manager. But maybe there's some lightweight way we could detect that,
as well. For example, we could keep, say, a 1K array in shared
memory, representing a 1024-way partitioning of the locktag space.
Each byte is 1 if there are any "strong" locks on objects with that
locktag in the lock manager, and 0 if there are none (or maybe you
need a 4K array with exact counts, for bookkeeping). When a backend
wants to take a "weak" lock, it checks the array: if it finds a 0 then
it just records the lock in its file; otherwise, it goes through the
lock manager. When a backend wants a "strong" lock, it first sets the
byte (or bumps the count) in the array, then transfers any existing
weak locks from individual backends to the lock manager, then tries to
get its own lock. Possibly the array operations could be done with
memory synchronization primitives rather than spinlocks, especially on
architectures that support an atomic fetch-and-add. Of course I don't
know quite how we recover if we try to do one of these "lock
transfers" and run out of shared memory... and overall I'm hand-waving
here quite a bit, but in theory it seems like we ought to be able to
rejigger this locking so that we reduce the cost of obtaining a "weak"
lock, perhaps at the expense of making it more expensive to obtain a
"strong" lock, which are relatively rare by comparison.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Bruce Momjian

bruce@momjian.us

about 15 years ago

In reply to: Robert Haas (#8)

Re: 'tuple concurrently updated' error for alter role ... set

Is this a TODO? I don't see it on the TODO list.

---------------------------------------------------------------------------

Robert Haas wrote:

On Fri, May 13, 2011 at 12:56 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

BTW, I thought a bit more about why I didn't like the initial proposal
in this thread, and the basic objection is this: the AccessShareLock or
RowExclusiveLock we take on the catalog is not meant to provide any
serialization of operations on individual objects within the catalog.
What it's there for is to interlock against operations that are
operating on the catalog as a table, such as VACUUM FULL (which has to
lock out all accesses to the catalog) or REINDEX (which has to lock out
updates). ?So the catalog-level lock is the right thing and shouldn't be
changed. ?If we want to interlock updates of individual objects then we
need a different locking concept for that.

Right, I agree. Fortunately, we don't have to invent a new one.
There is already locking being done exactly along these lines for
DROP, COMMENT, and SECURITY LABEL (which is important, because
otherwise we could leave behind orphaned security labels that would be
inherited by a later object with the same OID, leading to a security
problem). I think it would be sensible, and quite simple, to extend
that to other DDL operations.

I think that we probably *don't* want to lock non-table objects when
they are just being *used*. We do that for tables (to lock against
concurrent drop operations) and in some workloads it becomes a severe
bottleneck. Doing it for functions and operators would make the
problem far worse, for no particular benefit. Unlike tables, there is
no underlying relation file to worry about, so the worst thing that
happens is someone continues to use a dropped object slightly after
it's gone, or the old definition of an object that's been modified.

Actually, it's occurred to me from time to time that it would be nice
to eliminate ACCESS SHARE (and while I'm dreaming, maybe ROW SHARE and
ROW EXCLUSIVE) locks for tables as well. Under normal operating
conditions (i.e. no DDL running), these locks generate a huge amount
of lock manager traffic even though none of the locks conflict with
each other. Unfortunately, I don't really see a way to make this
work. But maybe it would at least be possible to create some sort of
fast path. For example, suppose every backend opens a file and uses
that file to record lock tags for the objects on which it is taking
"weak" (ACCESS SHARE/ROW SHARE/ROW EXCLUSIVE) locks on. Before taking
a "strong" lock (anything that conflicts with one of those lock
types), the exclusive locker is required to open all of those files
and transfer the locks into the lock manager proper. Of course, it's
also necessary to nail down the other direction: you have to have some
way of making sure that the backend can't record in it's local file a
lock that would have conflicted had it been taken in the actual lock
manager. But maybe there's some lightweight way we could detect that,
as well. For example, we could keep, say, a 1K array in shared
memory, representing a 1024-way partitioning of the locktag space.
Each byte is 1 if there are any "strong" locks on objects with that
locktag in the lock manager, and 0 if there are none (or maybe you
need a 4K array with exact counts, for bookkeeping). When a backend
wants to take a "weak" lock, it checks the array: if it finds a 0 then
it just records the lock in its file; otherwise, it goes through the
lock manager. When a backend wants a "strong" lock, it first sets the
byte (or bumps the count) in the array, then transfers any existing
weak locks from individual backends to the lock manager, then tries to
get its own lock. Possibly the array operations could be done with
memory synchronization primitives rather than spinlocks, especially on
architectures that support an atomic fetch-and-add. Of course I don't
know quite how we recover if we try to do one of these "lock
transfers" and run out of shared memory... and overall I'm hand-waving
here quite a bit, but in theory it seems like we ought to be able to
rejigger this locking so that we reduce the cost of obtaining a "weak"
lock, perhaps at the expense of making it more expensive to obtain a
"strong" lock, which are relatively rare by comparison.

<end of rambling digression>

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

#10

Noah Misch

noah@leadboat.com

about 15 years ago

In reply to: Robert Haas (#8)

Reducing overhead of frequent table locks

On Fri, May 13, 2011 at 09:07:34AM -0400, Robert Haas wrote:

Actually, it's occurred to me from time to time that it would be nice
to eliminate ACCESS SHARE (and while I'm dreaming, maybe ROW SHARE and
ROW EXCLUSIVE) locks for tables as well. Under normal operating
conditions (i.e. no DDL running), these locks generate a huge amount
of lock manager traffic even though none of the locks conflict with
each other. Unfortunately, I don't really see a way to make this
work. But maybe it would at least be possible to create some sort of
fast path. For example, suppose every backend opens a file and uses
that file to record lock tags for the objects on which it is taking
"weak" (ACCESS SHARE/ROW SHARE/ROW EXCLUSIVE) locks on. Before taking
a "strong" lock (anything that conflicts with one of those lock
types), the exclusive locker is required to open all of those files
and transfer the locks into the lock manager proper. Of course, it's
also necessary to nail down the other direction: you have to have some
way of making sure that the backend can't record in it's local file a
lock that would have conflicted had it been taken in the actual lock
manager. But maybe there's some lightweight way we could detect that,
as well. For example, we could keep, say, a 1K array in shared
memory, representing a 1024-way partitioning of the locktag space.
Each byte is 1 if there are any "strong" locks on objects with that
locktag in the lock manager, and 0 if there are none (or maybe you
need a 4K array with exact counts, for bookkeeping). When a backend
wants to take a "weak" lock, it checks the array: if it finds a 0 then
it just records the lock in its file; otherwise, it goes through the
lock manager. When a backend wants a "strong" lock, it first sets the
byte (or bumps the count) in the array, then transfers any existing
weak locks from individual backends to the lock manager, then tries to
get its own lock. Possibly the array operations could be done with
memory synchronization primitives rather than spinlocks, especially on
architectures that support an atomic fetch-and-add. Of course I don't
know quite how we recover if we try to do one of these "lock
transfers" and run out of shared memory... and overall I'm hand-waving
here quite a bit, but in theory it seems like we ought to be able to
rejigger this locking so that we reduce the cost of obtaining a "weak"
lock, perhaps at the expense of making it more expensive to obtain a
"strong" lock, which are relatively rare by comparison.

<end of rambling digression>

The key is putting a rapid hard stop to all fast-path lock acquisitions and
then reconstructing a valid global picture of the affected lock table regions.
Your 1024-way table of strong lock counts sounds promising. (Offhand, I do
think they would need to be counts, not just flags.)

If I'm understanding correctly, your pseudocode would look roughly like this:

if (level >= ShareUpdateExclusiveLock)
++strong_lock_counts[my_strong_lock_count_partition]
sfence
if (strong_lock_counts[my_strong_lock_count_partition] == 1)
/* marker 1 */
import_all_local_locks
normal_LockAcquireEx
else if (level <= RowExclusiveLock)
lfence
if (strong_lock_counts[my_strong_lock_count_partition] == 0)
/* marker 2 */
local_only
/* marker 3 */
else
normal_LockAcquireEx
else
normal_LockAcquireEx

At marker 1, we need to block until no code is running between markers two and
three. You could do that with a per-backend lock (LW_SHARED by the strong
locker, LW_EXCLUSIVE by the backend). That would probably still be a win over
the current situation, but it would be nice to have something even cheaper.

Then you have the actual procedure for transfer of local locks to the global
lock manager. Using record locks in a file could work, but that's a system call
per lock acquisition regardless of the presence of strong locks. Is that cost
sufficiently trivial? I wonder if, instead, we could signal all backends at
marker 1 to dump the applicable parts of their local (memory) lock tables to
files. Or to another shared memory region, if that didn't mean statically
allocating the largest possible required amount. If we were willing to wait
until all backends reach a CHECK_FOR_INTERRUPTS, they could instead make the
global insertions directly. That might yield a decent amount of bug swatting to
fill in missing CHECK_FOR_INTERRUPTS, though.

Improvements in this area would also have good synergy with efforts to reduce
the global impact of temporary table usage. CREATE TEMP TABLE can be the
major source of AccessExclusiveLock acquisitions. However, with the strong
lock indicator partitioned 1024 ways or more, that shouldn't be a deal killer.

#11

Oleksii Kliukin

alexk@hintbits.com

about 15 years ago

In reply to: Oleksii Kliukin (#5)

Re: 'tuple concurrently updated' error for alter role ... set

On May 13, 2011, at 2:07 AM, Alexey Klyukin wrote:

On May 13, 2011, at 1:28 AM, Tom Lane wrote:

We're not likely to do that, first because it's randomly different from
the handling of every other system catalog update, and second because it
would serialize all updates on this catalog, and probably create
deadlock cases that don't exist now. (BTW, as the patch is given I'd
expect it to still fail, though perhaps with lower probability than
before. For this to actually stop all such cases, you'd have to hold
the lock till commit, which greatly increases the risks of deadlock.)

....

I see no particular reason why conflicting updates like those *shouldn't*
be expected to fail occasionally.

Excellent question, I don't have enough context to properly answer that (other
than a guess that an unexpected transaction rollback is too unexpected :))
Let me ask the customer first.

The original use case is sporadical failures of some internal unit tests due
to the error message in subject.

--
Alexey Klyukin
The PostgreSQL Company - Command Prompt, Inc.

#12

Robert Haas

robertmhaas@gmail.com

about 15 years ago

In reply to: Noah Misch (#10)

Re: Reducing overhead of frequent table locks

On Fri, May 13, 2011 at 4:16 PM, Noah Misch <noah@leadboat.com> wrote:

The key is putting a rapid hard stop to all fast-path lock acquisitions and
then reconstructing a valid global picture of the affected lock table regions.
Your 1024-way table of strong lock counts sounds promising. (Offhand, I do
think they would need to be counts, not just flags.)

If I'm understanding correctly, your pseudocode would look roughly like this:

if (level >= ShareUpdateExclusiveLock)
++strong_lock_counts[my_strong_lock_count_partition]
sfence
if (strong_lock_counts[my_strong_lock_count_partition] == 1)
/* marker 1 */
import_all_local_locks
normal_LockAcquireEx
else if (level <= RowExclusiveLock)
lfence
if (strong_lock_counts[my_strong_lock_count_partition] == 0)
/* marker 2 */
local_only
/* marker 3 */
else
normal_LockAcquireEx
else
normal_LockAcquireEx

I think ShareUpdateExclusiveLock should be treated as neither weak nor
strong. It certainly can't be treated as weak - i.e. use the fast
path - because it's self-conflicting. It could be treated as strong,
but since it doesn't conflict with any of the weak lock types, that
would only serve to prevent fast-path lock acquisitions that otherwise
could have succeeded. In particular, it would unnecessarily disable
fast-path lock acquisition for any relation being vacuumed, which
could be really ugly considering that one of the main workloads that
would benefit from something like this is the case where lots of
backends are fighting over a lock manager partition lock on a table
they all want to run read and/or modify. I think it's best for
ShareUpdateExclusiveLock to always use the regular lock-acquisition
path, but it need not worry about incrementing strong_lock_counts[] or
importing local locks in so doing.

Also, I think in the step just after marker one, we'd only import only
local locks whose lock tags were equal to the lock tag on which we
were attempting to acquire a strong lock. The downside of this whole
approach is that acquiring a strong lock becomes, at least
potentially, a lot slower, because you have to scan through the whole
backend array looking for fast-path locks to import (let's not use the
term "local lock", which is already in use within the lock manager
code). But maybe that can be optimized enough not to matter. After
all, if the lock manager scaled perfectly at high concurrency, we
wouldn't be thinking about this in the first place.

At marker 1, we need to block until no code is running between markers two and
three. You could do that with a per-backend lock (LW_SHARED by the strong
locker, LW_EXCLUSIVE by the backend). That would probably still be a win over
the current situation, but it would be nice to have something even cheaper.

I don't have a better idea than to use an LWLock. I have a patch
floating around to speed up our LWLock implementation, but I haven't
got a workload where the bottleneck is the actual speed of operation
of the LWLock rather than the fact that it's contended in the first
place. And the whole point of this would be to arrange things so that
the LWLocks are uncontended nearly all the time.

Then you have the actual procedure for transfer of local locks to the global
lock manager. Using record locks in a file could work, but that's a system call
per lock acquisition regardless of the presence of strong locks. Is that cost
sufficiently trivial?

No, I don't think we want to go into kernel space. When I spoke of
using a file, I was imagining it as an mmap'd region that one backend
could write to which, at need, another backend could mmap and grovel
through. Another (perhaps simpler) option would be to just put it in
shared memory. That doesn't give you as much flexibility in terms of
expanding the segment, but it would be reasonable to allow space for
only, dunno, say 32 locks per backend in shared memory; if you need
more than that, you flush them all to the main lock table and start
over. You could possibly even just make this a hack for the
particular special case where we're taking a relation lock on a
non-shared relation; then you'd need only 128 bytes for a 32-entry
array, plus the LWLock (I think the database ID is already visible in
shared memory).

I wonder if, instead, we could signal all backends at
marker 1 to dump the applicable parts of their local (memory) lock tables to
files. Or to another shared memory region, if that didn't mean statically
allocating the largest possible required amount. If we were willing to wait
until all backends reach a CHECK_FOR_INTERRUPTS, they could instead make the
global insertions directly. That might yield a decent amount of bug swatting to
fill in missing CHECK_FOR_INTERRUPTS, though.

I've thought about this; I believe it's unworkable. If one backend
goes into the tank (think: SIGSTOP, or blocking on I/O to an
unreadable disk sector) this could lead to cascading failure.

Improvements in this area would also have good synergy with efforts to reduce
the global impact of temporary table usage. CREATE TEMP TABLE can be the
major source of AccessExclusiveLock acquisitions. However, with the strong
lock indicator partitioned 1024 ways or more, that shouldn't be a deal killer.

If that particular case is a problem for you, it seems like optimizing
away the exclusive lock altogether might be possible. No other
backend should be able to touch the table until the transaction
commits anyhow, and at that point we're going to release the lock.
There are possibly some sticky wickets here but it seems at least
worth thinking about.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#13

Noah Misch

noah@leadboat.com

about 15 years ago

In reply to: Robert Haas (#12)

Re: Reducing overhead of frequent table locks

On Fri, May 13, 2011 at 08:55:34PM -0400, Robert Haas wrote:

On Fri, May 13, 2011 at 4:16 PM, Noah Misch <noah@leadboat.com> wrote:

If I'm understanding correctly, your pseudocode would look roughly like this:

? ? ? ?if (level >= ShareUpdateExclusiveLock)

I think ShareUpdateExclusiveLock should be treated as neither weak nor
strong.

Indeed; that should be ShareLock.

It certainly can't be treated as weak - i.e. use the fast
path - because it's self-conflicting. It could be treated as strong,
but since it doesn't conflict with any of the weak lock types, that
would only serve to prevent fast-path lock acquisitions that otherwise
could have succeeded. In particular, it would unnecessarily disable
fast-path lock acquisition for any relation being vacuumed, which
could be really ugly considering that one of the main workloads that
would benefit from something like this is the case where lots of
backends are fighting over a lock manager partition lock on a table
they all want to run read and/or modify. I think it's best for
ShareUpdateExclusiveLock to always use the regular lock-acquisition
path, but it need not worry about incrementing strong_lock_counts[] or
importing local locks in so doing.

Agreed.

Also, I think in the step just after marker one, we'd only import only
local locks whose lock tags were equal to the lock tag on which we
were attempting to acquire a strong lock. The downside of this whole
approach is that acquiring a strong lock becomes, at least
potentially, a lot slower, because you have to scan through the whole
backend array looking for fast-path locks to import (let's not use the
term "local lock", which is already in use within the lock manager
code). But maybe that can be optimized enough not to matter. After
all, if the lock manager scaled perfectly at high concurrency, we
wouldn't be thinking about this in the first place.

Incidentally, I used the term "local lock" because I assumed fast-path locks
would still go through the lock manager far enough to populate the local lock
table. But there may be no reason to do so.

I wonder if, instead, we could signal all backends at
marker 1 to dump the applicable parts of their local (memory) lock tables to
files. ?Or to another shared memory region, if that didn't mean statically
allocating the largest possible required amount. ?If we were willing to wait
until all backends reach a CHECK_FOR_INTERRUPTS, they could instead make the
global insertions directly. ?That might yield a decent amount of bug swatting to
fill in missing CHECK_FOR_INTERRUPTS, though.

I've thought about this; I believe it's unworkable. If one backend
goes into the tank (think: SIGSTOP, or blocking on I/O to an
unreadable disk sector) this could lead to cascading failure.

True. It would need some fairly major advantages to justify that risk, and I
don't see any.

Overall, looks like a promising design sketch to me. Thanks.

#14

Robert Haas

robertmhaas@gmail.com

about 15 years ago

In reply to: Noah Misch (#13)

Re: Reducing overhead of frequent table locks

On Fri, May 13, 2011 at 11:05 PM, Noah Misch <noah@leadboat.com> wrote:

Incidentally, I used the term "local lock" because I assumed fast-path locks
would still go through the lock manager far enough to populate the local lock
table. But there may be no reason to do so.

Oh, good point. I think we probably WOULD need to update the local
lock lock hash table.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#15

Tom Lane

tgl@sss.pgh.pa.us

about 15 years ago

In reply to: Robert Haas (#14)

Re: Reducing overhead of frequent table locks

Robert Haas <robertmhaas@gmail.com> writes:

On Fri, May 13, 2011 at 11:05 PM, Noah Misch <noah@leadboat.com> wrote:

Incidentally, I used the term "local lock" because I assumed fast-path locks
would still go through the lock manager far enough to populate the local lock
table. But there may be no reason to do so.

Oh, good point. I think we probably WOULD need to update the local
lock lock hash table.

I haven't read this thread closely, but the general behavior of the
backend assumes that it's very very cheap to re-acquire a lock that's
already held by the current transaction. It's probably worth
maintaining a local counter just so you can keep that being true, even
if there were no other need for it. (Since I've not read the thread,
I'll refrain from asking how you're gonna clean up at transaction end
if there's no local memory of what locks you hold.)

regards, tom lane

#16

Jeff Janes

jeff.janes@gmail.com

about 15 years ago

In reply to: Robert Haas (#12)

Re: Reducing overhead of frequent table locks

On Fri, May 13, 2011 at 5:55 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, May 13, 2011 at 4:16 PM, Noah Misch <noah@leadboat.com> wrote:

I wonder if, instead, we could signal all backends at
marker 1 to dump the applicable parts of their local (memory) lock tables to
files. Or to another shared memory region, if that didn't mean statically
allocating the largest possible required amount. If we were willing to wait
until all backends reach a CHECK_FOR_INTERRUPTS, they could instead make the
global insertions directly. That might yield a decent amount of bug swatting to
fill in missing CHECK_FOR_INTERRUPTS, though.

I've thought about this; I believe it's unworkable. If one backend
goes into the tank (think: SIGSTOP, or blocking on I/O to an
unreadable disk sector) this could lead to cascading failure.

Would that risk be substantially worse than it currently is? If a
backend goes into the tank while holding access shared locks, it will
still block access exclusive locks until it recovers. And those
queued access exclusive locks will block new access shared locks from
other backends. How much is risk magnified by the new approach,
going from "any backend holding the lock is tanked" to "any process at
all is tanked"?

What I'd considered playing with in the past is having
LockMethodLocalHash hang on to an Access Shared lock even after
locallock->nLocks == 0, so that re-granting the lock would be a purely
local operation. Anyone wanting an Access Exclusive lock and not
immediately getting it would have to send out a plea (via SINVA?) for
other processes to release their locallock->nLocks == 0 locks. But
this would suffer from the same problem of tanked processes.

Cheers,

Jeff

#17

Robert Haas

robertmhaas@gmail.com

about 15 years ago

In reply to: Jeff Janes (#16)

Re: Reducing overhead of frequent table locks

On Sat, May 14, 2011 at 1:33 PM, Jeff Janes <jeff.janes@gmail.com> wrote:

Would that risk be substantially worse than it currently is? If a
backend goes into the tank while holding access shared locks, it will
still block access exclusive locks until it recovers. And those
queued access exclusive locks will block new access shared locks from
other backends. How much is risk magnified by the new approach,
going from "any backend holding the lock is tanked" to "any process at
all is tanked"?

I think that's a pretty substantial increase in risk. Consider that
there may be 100 backends out there, one of which holds a relevant
lock. Needing to wait for all of them to do something instead of just
one is quite different.

Also, quite apart from the possibility of hanging altogether, the
latency would probably be increased quite a bit, and not in a very
predictable fashion.

I have the impression that most of the problem comes from fighting
over CPU cache lines. If that's correct, it may not be important to
avoid shared memory access per se; it may be good enough to arrange
things so that the shared memory which is accessed is *typically* not
being accessed by other backends.

What I'd considered playing with in the past is having
LockMethodLocalHash hang on to an Access Shared lock even after
locallock->nLocks == 0, so that re-granting the lock would be a purely
local operation. Anyone wanting an Access Exclusive lock and not
immediately getting it would have to send out a plea (via SINVA?) for
other processes to release their locallock->nLocks == 0 locks. But
this would suffer from the same problem of tanked processes.

Yeah. I have thought about this, too, but as with Noah's suggestion,
I think this would make the risk of things hanging up substantially
worse than it is now. A backend that, under the present code,
wouldn't be holding an AccessShareLock at all, would now be holding
one that you'd have to convince it to release.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#18

Robert Haas

robertmhaas@gmail.com

about 15 years ago

In reply to: Noah Misch (#10)

Re: Reducing overhead of frequent table locks

On Fri, May 13, 2011 at 4:16 PM, Noah Misch <noah@leadboat.com> wrote:

if (level >= ShareUpdateExclusiveLock)
++strong_lock_counts[my_strong_lock_count_partition]
sfence
if (strong_lock_counts[my_strong_lock_count_partition] == 1)
/* marker 1 */
import_all_local_locks
normal_LockAcquireEx
else if (level <= RowExclusiveLock)
lfence
if (strong_lock_counts[my_strong_lock_count_partition] == 0)
/* marker 2 */
local_only
/* marker 3 */
else
normal_LockAcquireEx
else
normal_LockAcquireEx

At marker 1, we need to block until no code is running between markers two and
three. You could do that with a per-backend lock (LW_SHARED by the strong
locker, LW_EXCLUSIVE by the backend). That would probably still be a win over
the current situation, but it would be nice to have something even cheaper.

Barring some brilliant idea, or anyway for a first cut, it seems to me
that we can adjust the above pseudocode by assuming the use of a
LWLock. In addition, two other adjustments: first, the first line
should test level > ShareUpdateExclusiveLock, rather than >=, per
previous discussion. Second, import_all_local locks needn't really
move everything; just those locks with a matching locktag. Thus:

! if (level > ShareUpdateExclusiveLock)
! ++strong_lock_counts[my_strong_lock_count_partition]
! sfence
! for each backend
! take per-backend lwlock for target backend
! transfer fast-path entries with matching locktag
! release per-backend lwlock for target backend
! normal_LockAcquireEx
! else if (level <= RowExclusiveLock)
! lfence
! if (strong_lock_counts[my_strong_lock_count_partition] == 0)
! take per-backend lwlock for own backend
! fast-path lock acquisition
! release per-backend lwlock for own backend
! else
! normal_LockAcquireEx
! else
! normal_LockAcquireEx

Now, a small fly in the ointment is that we haven't got, with
PostgreSQL, a portable library of memory primitives. So there isn't
an obvious way of doing that sfence/lfence business. Now, it seems to
me that in the "strong lock" case, the sfence isn't really needed
anyway, because we're about to start acquiring and releasing an lwlock
for every backend, and that had better act as a full memory barrier
anyhow, or we're doomed. The "weak lock" case is more interesting,
because we need the fence before we've taken any LWLock.

But perhaps it'd be sufficient to just acquire the per-backend lwlock
before checking strong_lock_counts[]. If, as we hope, we get back a
zero, then we do the fast-path lock acquisition, release the lwlock,
and away we go. If we get back any other value, then we've wasted an
lwlock acquisition cycle. Or actually maybe not: it seems to me that
in that case we'd better transfer all of our fast-path entries into
the main hash table before trying to acquire any lock the slow way, at
least if we don't want the deadlock detector to have to know about the
fast-path. So then we get this:

! if (level > ShareUpdateExclusiveLock)
! ++strong_lock_counts[my_strong_lock_count_partition]
! for each backend
! take per-backend lwlock for target backend
! transfer fastpath entries with matching locktag
! release per-backend lwlock for target backend
! else if (level <= RowExclusiveLock)
! take per-backend lwlock for own backend
! if (strong_lock_counts[my_strong_lock_count_partition] == 0)
! fast-path lock acquisition
! done = true
! else
! transfer all fastpath entries
! release per-backend lwlock for own backend
! if (!done)
! normal_LockAcquireEx

That seems like it ought to work, at least assuming the position of
your fencing instructions was correct in the first place. But there's
one big problem to worry about: what happens if the lock transfer
fails due to shared memory exhaustion? It's not so bad in the "weak
lock" case; it'll feel just like the already-existing case where you
try to push another lock into the shared-memory hash table and there's
no room. Essentially you've been living on borrowed time anyway. On
the other hand, the "strong lock" case is a real problem, because a
large number of granted fast-path locks can effectively DOS any strong
locker, even one that wouldn't have conflicted with them. That's
clearly not going to fly, but it's not clear to me what the best way
is to patch around it.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#19

Noah Misch

noah@leadboat.com

about 15 years ago

In reply to: Robert Haas (#18)

Re: Reducing overhead of frequent table locks

On Mon, May 23, 2011 at 09:15:27PM -0400, Robert Haas wrote:

On Fri, May 13, 2011 at 4:16 PM, Noah Misch <noah@leadboat.com> wrote:

? ? ? ?if (level >= ShareUpdateExclusiveLock)
? ? ? ? ? ? ? ?++strong_lock_counts[my_strong_lock_count_partition]
? ? ? ? ? ? ? ?sfence
? ? ? ? ? ? ? ?if (strong_lock_counts[my_strong_lock_count_partition] == 1)
? ? ? ? ? ? ? ? ? ? ? ?/* marker 1 */
? ? ? ? ? ? ? ? ? ? ? ?import_all_local_locks
? ? ? ? ? ? ? ?normal_LockAcquireEx
? ? ? ?else if (level <= RowExclusiveLock)
? ? ? ? ? ? ? ?lfence
? ? ? ? ? ? ? ?if (strong_lock_counts[my_strong_lock_count_partition] == 0)
? ? ? ? ? ? ? ? ? ? ? ?/* marker 2 */
? ? ? ? ? ? ? ? ? ? ? ?local_only
? ? ? ? ? ? ? ? ? ? ? ?/* marker 3 */
? ? ? ? ? ? ? ?else
? ? ? ? ? ? ? ? ? ? ? ?normal_LockAcquireEx
? ? ? ?else
? ? ? ? ? ? ? ?normal_LockAcquireEx

At marker 1, we need to block until no code is running between markers two and
three. ?You could do that with a per-backend lock (LW_SHARED by the strong
locker, LW_EXCLUSIVE by the backend). ?That would probably still be a win over
the current situation, but it would be nice to have something even cheaper.

Barring some brilliant idea, or anyway for a first cut, it seems to me
that we can adjust the above pseudocode by assuming the use of a
LWLock. In addition, two other adjustments: first, the first line
should test level > ShareUpdateExclusiveLock, rather than >=, per
previous discussion. Second, import_all_local locks needn't really
move everything; just those locks with a matching locktag. Thus:

! if (level > ShareUpdateExclusiveLock)
! ++strong_lock_counts[my_strong_lock_count_partition]
! sfence
! for each backend
! take per-backend lwlock for target backend
! transfer fast-path entries with matching locktag
! release per-backend lwlock for target backend
! normal_LockAcquireEx
! else if (level <= RowExclusiveLock)
! lfence
! if (strong_lock_counts[my_strong_lock_count_partition] == 0)
! take per-backend lwlock for own backend
! fast-path lock acquisition
! release per-backend lwlock for own backend
! else
! normal_LockAcquireEx
! else
! normal_LockAcquireEx

This drops the part about only transferring fast-path entries once when a
strong_lock_counts cell transitions from zero to one. Granted, that itself
requires some yet-undiscussed locking. For that matter, we can't have
multiple strong lockers completing transfers on the same cell in parallel.
Perhaps add a FastPathTransferLock, or an array of per-cell locks, that each
strong locker holds for that entire "if" body and while decrementing the
strong_lock_counts cell at lock release.

As far as the level of detail of this pseudocode goes, there's no need to hold
the per-backend LWLock while transferring the fast-path entries. You just
need to hold it sometime between bumping strong_lock_counts and transferring
the backend's locks. This ensures that, for example, the backend is not
sleeping in the middle of a fast-path lock acquisition for the whole duration
of this code.

Now, a small fly in the ointment is that we haven't got, with
PostgreSQL, a portable library of memory primitives. So there isn't
an obvious way of doing that sfence/lfence business.

I was thinking that, if the final implementation could benefit from memory
barrier interfaces, we should create those interfaces now. Start with only a
platform-independent dummy implementation that runs a lock/unlock cycle on a
spinlock residing in backend-local memory. I'm 75% sure that would be
sufficient on all architectures for which we support spinlocks. It may turn
out that we can't benefit from such interfaces at this time ...

Now, it seems to
me that in the "strong lock" case, the sfence isn't really needed
anyway, because we're about to start acquiring and releasing an lwlock
for every backend, and that had better act as a full memory barrier
anyhow, or we're doomed. The "weak lock" case is more interesting,
because we need the fence before we've taken any LWLock.

Agreed.

But perhaps it'd be sufficient to just acquire the per-backend lwlock
before checking strong_lock_counts[]. If, as we hope, we get back a
zero, then we do the fast-path lock acquisition, release the lwlock,
and away we go. If we get back any other value, then we've wasted an
lwlock acquisition cycle. Or actually maybe not: it seems to me that
in that case we'd better transfer all of our fast-path entries into
the main hash table before trying to acquire any lock the slow way, at
least if we don't want the deadlock detector to have to know about the
fast-path. So then we get this:

! if (level > ShareUpdateExclusiveLock)
! ++strong_lock_counts[my_strong_lock_count_partition]
! for each backend
! take per-backend lwlock for target backend
! transfer fastpath entries with matching locktag
! release per-backend lwlock for target backend
! else if (level <= RowExclusiveLock)
! take per-backend lwlock for own backend
! if (strong_lock_counts[my_strong_lock_count_partition] == 0)
! fast-path lock acquisition
! done = true
! else
! transfer all fastpath entries
! release per-backend lwlock for own backend
! if (!done)
! normal_LockAcquireEx

Could you elaborate on the last part (the need for "else transfer all fastpath
entries") and, specifically, how it aids deadlock avoidance? I didn't think
this change would have any impact on deadlocks, because all relevant locks
will be in the global lock table before any call to normal_LockAcquireEx.

To validate the locking at this level of detail, I think we need to sketch the
unlock protocol, too. On each strong lock release, we'll decrement the
strong_lock_counts cell. No particular interlock with fast-path lockers
should be needed; a stray AccessShareLock needlessly making it into the global
lock table is no problem. As mentioned above, we _will_ need an interlock
with lock transfer operations. How will transferred fast-path locks get
removed from the global lock table? Presumably, the original fast-path locker
should do so at transaction end; anything else would contort the life cycle.
Then add a way for the backend to know which locks had been transferred as
well as an interlock against concurrent transfer operations. Maybe that's
all.

That seems like it ought to work, at least assuming the position of
your fencing instructions was correct in the first place. But there's
one big problem to worry about: what happens if the lock transfer
fails due to shared memory exhaustion? It's not so bad in the "weak
lock" case; it'll feel just like the already-existing case where you
try to push another lock into the shared-memory hash table and there's
no room. Essentially you've been living on borrowed time anyway. On
the other hand, the "strong lock" case is a real problem, because a
large number of granted fast-path locks can effectively DOS any strong
locker, even one that wouldn't have conflicted with them. That's
clearly not going to fly, but it's not clear to me what the best way
is to patch around it.

To put it another way: the current system is fair; the chance of hitting lock
exhaustion is independent of lock level. The new system would be unfair; lock
exhaustion is much more likely to appear for a > ShareUpdateExclusiveLock
acquisition, through no fault of that transaction. I agree this isn't ideal,
but it doesn't look to me like an unacceptable weakness. Making lock slots
first-come, first-served is inherently unfair; we're not at all set up to
justly arbitrate between mutually-hostile lockers competing for slots. The
overall situation will get better, not worse, for the admin who wishes to
defend against hostile unprivileged users attempting a lock table DOS.

Thanks,
nm

#20

Robert Haas

robertmhaas@gmail.com

about 15 years ago

In reply to: Noah Misch (#19)

Re: Reducing overhead of frequent table locks

On Tue, May 24, 2011 at 5:07 AM, Noah Misch <noah@leadboat.com> wrote:

This drops the part about only transferring fast-path entries once when a
strong_lock_counts cell transitions from zero to one.

Right: that's because I don't think that's what we want to do. I
don't think we want to transfer all per-backend locks to the shared
hash table as soon as anyone attempts to acquire a strong lock;
instead, I think we want to transfer only those fast-path locks which
have the same locktag as the strong lock someone is attempting to
acquire. If we do that, then it doesn't matter whether the
strong_lock_counts[] cell is transitioning from 0 to 1 or from 6 to 7:
we still have to check for strong locks with that particular locktag.

Granted, that itself
requires some yet-undiscussed locking. For that matter, we can't have
multiple strong lockers completing transfers on the same cell in parallel.
Perhaps add a FastPathTransferLock, or an array of per-cell locks, that each
strong locker holds for that entire "if" body and while decrementing the
strong_lock_counts cell at lock release.

I was imagining that the per-backend LWLock would protect the list of
fast-path locks. So to transfer locks, you would acquire the
per-backend LWLock for the backend which has the lock, and then the
lock manager partition LWLock, and then perform the transfer.

As far as the level of detail of this pseudocode goes, there's no need to hold
the per-backend LWLock while transferring the fast-path entries. You just
need to hold it sometime between bumping strong_lock_counts and transferring
the backend's locks. This ensures that, for example, the backend is not
sleeping in the middle of a fast-path lock acquisition for the whole duration
of this code.

See above; I'm lost.

Now, a small fly in the ointment is that we haven't got, with
PostgreSQL, a portable library of memory primitives. So there isn't
an obvious way of doing that sfence/lfence business.

I was thinking that, if the final implementation could benefit from memory
barrier interfaces, we should create those interfaces now. Start with only a
platform-independent dummy implementation that runs a lock/unlock cycle on a
spinlock residing in backend-local memory. I'm 75% sure that would be
sufficient on all architectures for which we support spinlocks. It may turn
out that we can't benefit from such interfaces at this time ...

OK.

Now, it seems to
me that in the "strong lock" case, the sfence isn't really needed
anyway, because we're about to start acquiring and releasing an lwlock
for every backend, and that had better act as a full memory barrier
anyhow, or we're doomed. The "weak lock" case is more interesting,
because we need the fence before we've taken any LWLock.

Agreed.

But perhaps it'd be sufficient to just acquire the per-backend lwlock
before checking strong_lock_counts[]. If, as we hope, we get back a
zero, then we do the fast-path lock acquisition, release the lwlock,
and away we go. If we get back any other value, then we've wasted an
lwlock acquisition cycle. Or actually maybe not: it seems to me that
in that case we'd better transfer all of our fast-path entries into
the main hash table before trying to acquire any lock the slow way, at
least if we don't want the deadlock detector to have to know about the
fast-path. So then we get this:

! if (level > ShareUpdateExclusiveLock)
! ++strong_lock_counts[my_strong_lock_count_partition]
! for each backend
! take per-backend lwlock for target backend
! transfer fastpath entries with matching locktag
! release per-backend lwlock for target backend
! else if (level <= RowExclusiveLock)
! take per-backend lwlock for own backend
! if (strong_lock_counts[my_strong_lock_count_partition] == 0)
! fast-path lock acquisition
! done = true
! else
! transfer all fastpath entries
! release per-backend lwlock for own backend
! if (!done)
! normal_LockAcquireEx

Could you elaborate on the last part (the need for "else transfer all fastpath
entries") and, specifically, how it aids deadlock avoidance? I didn't think
this change would have any impact on deadlocks, because all relevant locks
will be in the global lock table before any call to normal_LockAcquireEx.

Oh, hmm, maybe you're right. I was concerned about the possibility
that of a backend which already holds locks going to sleep on a lock
wait, and maybe running the deadlock detector, and failing to notice a
deadlock. But I guess that can't happen: if any of the locks it holds
are relevant to the deadlock detector, the backend attempting to
acquire those locks will transfer them before attempting to acquire
the lock itself, so it should be OK.

To validate the locking at this level of detail, I think we need to sketch the
unlock protocol, too. On each strong lock release, we'll decrement the
strong_lock_counts cell. No particular interlock with fast-path lockers
should be needed; a stray AccessShareLock needlessly making it into the global
lock table is no problem. As mentioned above, we _will_ need an interlock
with lock transfer operations. How will transferred fast-path locks get
removed from the global lock table? Presumably, the original fast-path locker
should do so at transaction end; anything else would contort the life cycle.
Then add a way for the backend to know which locks had been transferred as
well as an interlock against concurrent transfer operations. Maybe that's
all.

I'm thinking that the backend can note, in its local-lock table,
whether it originally acquired a lock via the fast-path or not. Any
lock not originally acquired via the fast-path will be released just
as now. For any lock that WAS originally acquired via the fast-path,
we'll take our own per-backend lwlock, which protects the fast-path
queue, and scan the fast-path queue for a matching entry. If none is
found, then we know the lock was transferred, so release the
per-backend lwlock and do it the regular way (take lock manager
partition lock, etc.).

At transaction end, we need to release all non-session locks, so we
can consolidate a bit to avoid excess locking and unlocking. Take the
per-backend lwlock just once and scan through the queue. Any locks we
find there (that are not session locks) can be nuked from the
local-lock table and we're done. Release the per-backend lwlock. At
this point, any remaining locks that need to be released are in the
shared hash tables and we can proceed as now (see LockReleaseAll -
basically, we iterate over the lock partitions).

To put it another way: the current system is fair; the chance of hitting lock
exhaustion is independent of lock level. The new system would be unfair; lock
exhaustion is much more likely to appear for a > ShareUpdateExclusiveLock
acquisition, through no fault of that transaction. I agree this isn't ideal,
but it doesn't look to me like an unacceptable weakness. Making lock slots
first-come, first-served is inherently unfair; we're not at all set up to
justly arbitrate between mutually-hostile lockers competing for slots. The
overall situation will get better, not worse, for the admin who wishes to
defend against hostile unprivileged users attempting a lock table DOS.

Well, it's certainly true that the proposed system is far less likely
to bomb out trying to acquire an AccessShareLock than what we have
today, since in the common case the AccessShareLock doesn't use up any
shared resources. And that should make a lot of people happy. But as
to the bad scenario, one needn't presume that the lockers are hostile
- it may just be that the system is running on the edge of a full lock
table. In the worst case, someone wanting a strong lock on a table
may end up transferring a hundred or more locks (up to one per
backend) into that table. Even if they all fit and the strong locker
gets his lock, it may now happen that the space is just about
exhausted and other transactions start rolling back, apparently at
random.

Or, even more pathologically, one backend grabs a strong lock on table
X, but it just so happens that there is another table Y in the same
lock partition which is highly trafficked but, as all of the locks
involved are weak, uncontended. Now that strong_lock_counts[] is
non-zero for that partition, all those locks start going into the main
lock manager table. Performance will drop, which isn't great but we
can live with it. But maybe all the locks don't fit. Now we have a
situation in which, due to one backend acquiring a strong lock on
table A, a bunch of other backends are unable to obtain weak locks on
table B, and transactions start rolling back all over the place.

Now maybe you can argue that this scenario is sufficiently unlikely in
practice that we shouldn't worry about it, but if it does happen the
DBA will be incredibly confused, because an apparently innocuous
operation on one table will have resulted in rollbacks acquiring locks
on an apparently unrelated table. Maybe you want to argue that's
sufficiently rare that we shouldn't worry about it, but the
unpleasantness factor seems pretty high to me.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#21

Noah Misch

noah@leadboat.com

about 15 years ago

In reply to: Robert Haas (#20)

#22

Robert Haas

robertmhaas@gmail.com

about 15 years ago

In reply to: Noah Misch (#21)

#23