Getting rid of "tuple concurrently updated" elog()s with concurrent DDLs (at least ALTER TABLE)
Hi all,
Triggering "tuple concurrently updated" from heapam.c, which is an elog(),
is not that difficult for some DDL commands. For example with ALTER ROLE,
just create a role:
psql --no-psqlrc -c 'create role popo'
And then run that in two sessions and the error will show up, triggered
from CatalogTupleUpdate():
while true; do psql --no-psqlrc -c "alter role popo password 'foo'"; done
In this case, upgrading the lock taken on pg_authid from RowExclusiveLock
to ShareRowExclusiveLock prevents concurrent sessions to update each other,
which is what the patch attached does.
I think that we should get rid of all those errors, for many applications
doing concurrent DDLs getting this error is surprising, and could be thought
as a corrupted instance. I am just touching the tip of the iceberg here, as
I have the gut feeling that there are other objects where it is possible to
trigger the failure, and an analysis effort is going to be costly. So I'd
like to get first the temperature about such analysis.
So, opinions?
--
Michael
Attachments:
alter-role-concurrent-v1.patchtext/plain; charset=us-asciiDownload+5-2
On December 26, 2017 5:55:03 AM GMT+01:00, Michael Paquier <michael.paquier@gmail.com> wrote:
Hi all,
Triggering "tuple concurrently updated" from heapam.c, which is an
elog(),
is not that difficult for some DDL commands. For example with ALTER
ROLE,
just create a role:
psql --no-psqlrc -c 'create role popo'And then run that in two sessions and the error will show up, triggered
from CatalogTupleUpdate():
while true; do psql --no-psqlrc -c "alter role popo password 'foo'";
doneIn this case, upgrading the lock taken on pg_authid from
RowExclusiveLock
to ShareRowExclusiveLock prevents concurrent sessions to update each
other,
which is what the patch attached does.I think that we should get rid of all those errors, for many
applications
doing concurrent DDLs getting this error is surprising, and could be
thought
as a corrupted instance. I am just touching the tip of the iceberg
here, as
I have the gut feeling that there are other objects where it is
possible to
trigger the failure, and an analysis effort is going to be costly. So
I'd
like to get first the temperature about such analysis.So, opinions?
You're proposing to lock the entire relation against many forms of concurrent DDL, just to get rid of that error? That seems unacceptable.
Isn't the canonical way to solve this to take object locks?
Andres
Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.
On Tue, Dec 26, 2017 at 4:30 PM, Andres Freund <andres@anarazel.de> wrote:
You're proposing to lock the entire relation against many forms of concurrent DDL, just to get rid of that error? That seems unacceptable.
Isn't the canonical way to solve this to take object locks?
Sure. That's where things in lmgr.c come into play, like
LockSharedObject(), and you could hold with an exclusive lock on a
given object until the end of a transaction before opening the catalog
relation with heap_open(), however with those you need to know the
object OID before taking a lock on the parent relation, right? So you
face problems with lock upgrades, or race conditions show up more
easily. I have to admit that I have not dug much into the problem yet,
it is easy enough to have isolation tests by the way, and I just
noticed that ALTER DATABASE SET can equally trigger the error.
--
Michael
On Tue, Dec 26, 2017 at 4:14 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
On Tue, Dec 26, 2017 at 4:30 PM, Andres Freund <andres@anarazel.de> wrote:
You're proposing to lock the entire relation against many forms of concurrent DDL, just to get rid of that error? That seems unacceptable.
Isn't the canonical way to solve this to take object locks?Sure. That's where things in lmgr.c come into play, like
LockSharedObject(), and you could hold with an exclusive lock on a
given object until the end of a transaction before opening the catalog
relation with heap_open(), however with those you need to know the
object OID before taking a lock on the parent relation, right? So you
face problems with lock upgrades, or race conditions show up more
easily. I have to admit that I have not dug much into the problem yet,
it is easy enough to have isolation tests by the way, and I just
noticed that ALTER DATABASE SET can equally trigger the error.
I don't understand what you mean by "you need to know the object OID
before taking a lock on the parent relation, right?". That seems
wrong.
I think you might need something like what was done in
b3ad5d02c9cd8a4c884cd78480f221afe8ce5590; if, after we look up the
name and before we acquire a lock on the OID, we accept any
invalidation messages, recheck that the object we've locked is still
the one associated with that name.
I think Andres is certainly right that locking all of pg_authid is a nonstarter.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Tue, Dec 26, 2017 at 10:47:59PM -0800, Robert Haas wrote:
On Tue, Dec 26, 2017 at 4:14 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:On Tue, Dec 26, 2017 at 4:30 PM, Andres Freund <andres@anarazel.de> wrote:
You're proposing to lock the entire relation against many forms of concurrent DDL, just to get rid of that error? That seems unacceptable.
Isn't the canonical way to solve this to take object locks?Sure. That's where things in lmgr.c come into play, like
LockSharedObject(), and you could hold with an exclusive lock on a
given object until the end of a transaction before opening the catalog
relation with heap_open(), however with those you need to know the
object OID before taking a lock on the parent relation, right? So you
face problems with lock upgrades, or race conditions show up more
easily. I have to admit that I have not dug much into the problem yet,
it is easy enough to have isolation tests by the way, and I just
noticed that ALTER DATABASE SET can equally trigger the error.I don't understand what you mean by "you need to know the object OID
before taking a lock on the parent relation, right?". That seems
wrong.
Well, the point I am trying to outline here is that it is essential to
take a lock on the object ID before opening pg_authid with heap_open()
with a low-level lock to avoid any concurrency issues when working on
the catalog relation opened.
I think you might need something like what was done in
b3ad5d02c9cd8a4c884cd78480f221afe8ce5590; if, after we look up the
name and before we acquire a lock on the OID, we accept any
invalidation messages, recheck that the object we've locked is still
the one associated with that name.
Indeed, this bit is important, and RangeVarGet does so. Looking at
get_object_address(), it also handles locking of the object. Using that
as interface to address all the concurrency problems could be appealing,
and less invasive for a final result as many DDLs are visibly
impacted (still need to make a list here). What do you think?
--
Michael
On Wed, Dec 27, 2017 at 04:53:42PM +0900, Michael Paquier wrote:
Indeed, this bit is important, and RangeVarGet does so. Looking at
get_object_address(), it also handles locking of the object. Using that
as interface to address all the concurrency problems could be appealing,
and less invasive for a final result as many DDLs are visibly
impacted (still need to make a list here). What do you think?
So, I have looked at all the object types to spot concurrency problems,
and I have found some more. And the winners are:
- database
- role
- domain
- event trigger
- FDW
- server
- user mapping
- type
- tablespace
Most of the failures show "tuple concurrently updated" using a given set
of ALTER commands running concurrently, but I have been able to trigger
as well one "cache lookup failure" for domains, and a duplicate key value
violation for databases.
I have done some extra checks with parallel ALTER commands for the
following objects as well, without spotting issues:
- access method
- cast
- attribute
- extension
- view
- policy
- publication
- subscription
- rule
- transform
- text search stuff
- statistics
- operator [family], etc.
As looking for those failures manually is no fun, I am attaching a patch
which includes a set of isolation regression tests able to trigger
them. You just need to look for unwelcome ERROR messages and you are
good to go. The goal to move forward will be to take care of all those
failures. Please note that isolation tests don't allow dynamic input, so
those have no tests but you can reproduce an error easily with ALTER
TABLESPACE SET SCHEMA between two sessions. Note as well that the domain
test will fail because the cache lookup error on domains exposes the
domain OID, but that's nothing to care about.
Opinions or thoughts?
--
Michael