Proper object locking for GRANT/REVOKE

Started by Peter Eisentrautabout 1 year ago13 messages
#1Peter Eisentraut
peter@eisentraut.org
1 attachment(s)

This patch started out as a refactoring, thinking that
objectNamesToOids() in aclchk.c should really mostly be a loop around
get_object_address(). This is mostly true, with a few special cases
because the node representations are a bit different in some cases, and
OBJECT_PARAMETER_ACL, which is obviously very different. This saves a
bunch of duplicative code, which is nice.

Additionally, get_object_address() handles locking, which
objectNamesToOids() somewhat famously does not do, and there is a code
comment about it. With this refactoring, we get the locking pretty much
for free.

Interestingly, this changes the output of the intra-grant-inplace
isolation test, which is recent work. It might be good to get some
review from those who worked on that, to make sure that the new behavior
is still correct and/or to check whether those test cases are still
applicable.

Also, it would be worth thinking about what the correct lock mode should
be here.

Attachments:

0001-Proper-object-locking-for-GRANT-REVOKE.patchtext/plain; charset=UTF-8; name=0001-Proper-object-locking-for-GRANT-REVOKE.patchDownload
From fb910c6af0fb0691448680ec4a4cc1eb1857cd5b Mon Sep 17 00:00:00 2001
From: Peter Eisentraut <peter@eisentraut.org>
Date: Mon, 28 Oct 2024 16:07:01 +0100
Subject: [PATCH] Proper object locking for GRANT/REVOKE

Refactor objectNamesToOids() to use get_object_address() internally if
possible.  Not only does this save a lot of code, it also allows us to
use the object locking provided by get_object_address() for
GRANT/REVOKE.  There was previously a code comment that complained
about the lack of locking in objectNamesToOids(), which is now fixed.

The check in ExecGrant_Type_check() is obsolete because
get_object_address_type() already does the same check.
---
 src/backend/catalog/aclchk.c                  | 162 +++++-------------
 .../expected/intra-grant-inplace.out          |   2 +-
 2 files changed, 41 insertions(+), 123 deletions(-)

diff --git a/src/backend/catalog/aclchk.c b/src/backend/catalog/aclchk.c
index 95eb0b12277..2f91aa865a3 100644
--- a/src/backend/catalog/aclchk.c
+++ b/src/backend/catalog/aclchk.c
@@ -659,147 +659,76 @@ ExecGrantStmt_oids(InternalGrant *istmt)
  * objectNamesToOids
  *
  * Turn a list of object names of a given type into an Oid list.
- *
- * XXX: This function doesn't take any sort of locks on the objects whose
- * names it looks up.  In the face of concurrent DDL, we might easily latch
- * onto an old version of an object, causing the GRANT or REVOKE statement
- * to fail.
  */
 static List *
 objectNamesToOids(ObjectType objtype, List *objnames, bool is_grant)
 {
 	List	   *objects = NIL;
 	ListCell   *cell;
+	const LOCKMODE lockmode = AccessShareLock;
 
 	Assert(objnames != NIL);
 
 	switch (objtype)
 	{
+		default:
+
+			/*
+			 * For most object types, we use get_object_address() directly.
+			 */
+			foreach(cell, objnames)
+			{
+				ObjectAddress address;
+				Relation	relation;
+
+				address = get_object_address(objtype, lfirst(cell), &relation, lockmode, false);
+				Assert(relation == NULL);
+				objects = lappend_oid(objects, address.objectId);
+			}
+			break;
 		case OBJECT_TABLE:
 		case OBJECT_SEQUENCE:
+
+			/*
+			 * Here, we don't use get_object_address().  It requires that the
+			 * specified object type match the actual type of the object, but
+			 * in GRANT/REVOKE, all table-like things are addressed as TABLE.
+			 */
 			foreach(cell, objnames)
 			{
 				RangeVar   *relvar = (RangeVar *) lfirst(cell);
 				Oid			relOid;
 
-				relOid = RangeVarGetRelid(relvar, NoLock, false);
+				relOid = RangeVarGetRelid(relvar, lockmode, false);
 				objects = lappend_oid(objects, relOid);
 			}
 			break;
-		case OBJECT_DATABASE:
-			foreach(cell, objnames)
-			{
-				char	   *dbname = strVal(lfirst(cell));
-				Oid			dbid;
-
-				dbid = get_database_oid(dbname, false);
-				objects = lappend_oid(objects, dbid);
-			}
-			break;
 		case OBJECT_DOMAIN:
 		case OBJECT_TYPE:
+
+			/*
+			 * The parse representation of types and domains in privilege
+			 * targets is different from that expected by get_object_address()
+			 * (for parse conflict reasons), so we have to do a bit of
+			 * conversion here.
+			 */
 			foreach(cell, objnames)
 			{
 				List	   *typname = (List *) lfirst(cell);
-				Oid			oid;
+				TypeName   *tn = makeTypeNameFromNameList(typname);
+				ObjectAddress address;
+				Relation	relation;
 
-				oid = typenameTypeId(NULL, makeTypeNameFromNameList(typname));
-				objects = lappend_oid(objects, oid);
-			}
-			break;
-		case OBJECT_FUNCTION:
-			foreach(cell, objnames)
-			{
-				ObjectWithArgs *func = (ObjectWithArgs *) lfirst(cell);
-				Oid			funcid;
-
-				funcid = LookupFuncWithArgs(OBJECT_FUNCTION, func, false);
-				objects = lappend_oid(objects, funcid);
-			}
-			break;
-		case OBJECT_LANGUAGE:
-			foreach(cell, objnames)
-			{
-				char	   *langname = strVal(lfirst(cell));
-				Oid			oid;
-
-				oid = get_language_oid(langname, false);
-				objects = lappend_oid(objects, oid);
-			}
-			break;
-		case OBJECT_LARGEOBJECT:
-			foreach(cell, objnames)
-			{
-				Oid			lobjOid = oidparse(lfirst(cell));
-
-				if (!LargeObjectExists(lobjOid))
-					ereport(ERROR,
-							(errcode(ERRCODE_UNDEFINED_OBJECT),
-							 errmsg("large object %u does not exist",
-									lobjOid)));
-
-				objects = lappend_oid(objects, lobjOid);
-			}
-			break;
-		case OBJECT_SCHEMA:
-			foreach(cell, objnames)
-			{
-				char	   *nspname = strVal(lfirst(cell));
-				Oid			oid;
-
-				oid = get_namespace_oid(nspname, false);
-				objects = lappend_oid(objects, oid);
-			}
-			break;
-		case OBJECT_PROCEDURE:
-			foreach(cell, objnames)
-			{
-				ObjectWithArgs *func = (ObjectWithArgs *) lfirst(cell);
-				Oid			procid;
-
-				procid = LookupFuncWithArgs(OBJECT_PROCEDURE, func, false);
-				objects = lappend_oid(objects, procid);
-			}
-			break;
-		case OBJECT_ROUTINE:
-			foreach(cell, objnames)
-			{
-				ObjectWithArgs *func = (ObjectWithArgs *) lfirst(cell);
-				Oid			routid;
-
-				routid = LookupFuncWithArgs(OBJECT_ROUTINE, func, false);
-				objects = lappend_oid(objects, routid);
-			}
-			break;
-		case OBJECT_TABLESPACE:
-			foreach(cell, objnames)
-			{
-				char	   *spcname = strVal(lfirst(cell));
-				Oid			spcoid;
-
-				spcoid = get_tablespace_oid(spcname, false);
-				objects = lappend_oid(objects, spcoid);
-			}
-			break;
-		case OBJECT_FDW:
-			foreach(cell, objnames)
-			{
-				char	   *fdwname = strVal(lfirst(cell));
-				Oid			fdwid = get_foreign_data_wrapper_oid(fdwname, false);
-
-				objects = lappend_oid(objects, fdwid);
-			}
-			break;
-		case OBJECT_FOREIGN_SERVER:
-			foreach(cell, objnames)
-			{
-				char	   *srvname = strVal(lfirst(cell));
-				Oid			srvid = get_foreign_server_oid(srvname, false);
-
-				objects = lappend_oid(objects, srvid);
+				address = get_object_address(objtype, (Node *) tn, &relation, lockmode, false);
+				Assert(relation == NULL);
+				objects = lappend_oid(objects, address.objectId);
 			}
 			break;
 		case OBJECT_PARAMETER_ACL:
+
+			/*
+			 * Parameters are handled completely differently.
+			 */
 			foreach(cell, objnames)
 			{
 				/*
@@ -830,9 +759,6 @@ objectNamesToOids(ObjectType objtype, List *objnames, bool is_grant)
 					objects = lappend_oid(objects, parameterId);
 			}
 			break;
-		default:
-			elog(ERROR, "unrecognized GrantStmt.objtype: %d",
-				 (int) objtype);
 	}
 
 	return objects;
@@ -2458,14 +2384,6 @@ ExecGrant_Type_check(InternalGrant *istmt, HeapTuple tuple)
 				(errcode(ERRCODE_INVALID_GRANT_OPERATION),
 				 errmsg("cannot set privileges of multirange types"),
 				 errhint("Set the privileges of the range type instead.")));
-
-	/* Used GRANT DOMAIN on a non-domain? */
-	if (istmt->objtype == OBJECT_DOMAIN &&
-		pg_type_tuple->typtype != TYPTYPE_DOMAIN)
-		ereport(ERROR,
-				(errcode(ERRCODE_WRONG_OBJECT_TYPE),
-				 errmsg("\"%s\" is not a domain",
-						NameStr(pg_type_tuple->typname))));
 }
 
 static void
diff --git a/src/test/isolation/expected/intra-grant-inplace.out b/src/test/isolation/expected/intra-grant-inplace.out
index 4e9695a0214..1aa9da622da 100644
--- a/src/test/isolation/expected/intra-grant-inplace.out
+++ b/src/test/isolation/expected/intra-grant-inplace.out
@@ -248,6 +248,6 @@ relhasindex
 -----------
 (0 rows)
 
-s4: WARNING:  got: cache lookup failed for relation REDACTED
+s4: WARNING:  got: relation "intra_grant_inplace" does not exist
 step revoke4: <... completed>
 step r3: ROLLBACK;
-- 
2.47.0

#2Bertrand Drouvot
bertranddrouvot.pg@gmail.com
In reply to: Peter Eisentraut (#1)
Re: Proper object locking for GRANT/REVOKE

Hi,

On Mon, Oct 28, 2024 at 04:20:42PM +0100, Peter Eisentraut wrote:

This patch started out as a refactoring, thinking that objectNamesToOids()
in aclchk.c should really mostly be a loop around get_object_address().
This is mostly true, with a few special cases because the node
representations are a bit different in some cases, and OBJECT_PARAMETER_ACL,
which is obviously very different. This saves a bunch of duplicative code,
which is nice.

Additionally, get_object_address() handles locking, which
objectNamesToOids() somewhat famously does not do, and there is a code
comment about it. With this refactoring, we get the locking pretty much for
free.

Thanks for the patch, this refactoring makes sense to me.

A few random comments:

1 ===

+ default:

I like the idea of using default as the first "case" as a way to emphasize that
this is the most "common" behavior.

2 === Nit

+  address = get_object_address(objtype, lfirst(cell), &relation, lockmode, false);
+  Assert(relation == NULL);

Worth to explain why we do expect relation to be NULL here? (the comment on top
of get_object_address() says it all, but maybe a few words here could be worth
it).

3 ===

-
- /* Used GRANT DOMAIN on a non-domain? */
- if (istmt->objtype == OBJECT_DOMAIN &&
- pg_type_tuple->typtype != TYPTYPE_DOMAIN)
- ereport(ERROR,
- (errcode(ERRCODE_WRONG_OBJECT_TYPE),
- errmsg("\"%s\" is not a domain",
- NameStr(pg_type_tuple->typname))));

Yeah, get_object_address()->get_object_address_type() does take care of it.

4 ===

Interestingly, this changes the output of the intra-grant-inplace isolation
test, which is recent work. It might be good to get some review from those
who worked on that, to make sure that the new behavior is still correct
and/or to check whether those test cases are still applicable.

-s4: WARNING:  got: cache lookup failed for relation REDACTED
+s4: WARNING:  got: relation "intra_grant_inplace" does not exist

I did not work on this but out of curiosity I looked at it, and IIUC this change
comes from:

-  relOid = RangeVarGetRelid(relvar, NoLock, false);
+  relOid = RangeVarGetRelid(relvar, lockmode, false);

So without the lock, the error is coming from ExecGrant_Relation() that way:

[local] DO(+0x22529b) [0x5c56afdb629b]
[local] DO(+0x2220b0) [0x5c56afdb30b0]
[local] DO(ExecuteGrantStmt+0x696) [0x5c56afdb3047]
[local] DO(+0x6dc8cb) [0x5c56b026d8cb]
[local] DO(standard_ProcessUtility+0xd38) [0x5c56b026b9da]
[local] DO(ProcessUtility+0x13a) [0x5c56b026ac9b]
[local] DO(+0x460073) [0x5c56afff1073]

With the lock, the error comes from RangeVarGetRelidExtended() that way:

[local] DO(RangeVarGetRelidExtended+0x4e6) [0x5a1c3ac49d36]
[local] DO(+0x2223fe) [0x5a1c3ac2a3fe]
[local] DO(ExecuteGrantStmt+0x111) [0x5a1c3ac29ac2]
[local] DO(+0x6dc1d2) [0x5a1c3b0e41d2]
[local] DO(standard_ProcessUtility+0xd38) [0x5a1c3b0e22e1]
[local] DO(ProcessUtility+0x13a) [0x5a1c3b0e15a2]

That's due to (in RangeVarGetRelidExtended()):

"
* But if lockmode = NoLock, then we assume that either the caller is OK
* with the answer changing under them, or that they already hold some
* appropriate lock, and therefore return the first answer we get without
* checking for invalidation messages.
"

So, in the RangeVarGetRelid(relvar, NoLock, false) case (without the patch) then
if we are able to reach "relId = RelnameGetRelid(relation->relname);" before the
drop gets committed then we get the "cache lookup failed" error.

But if we are slow enough so that we don't reach "relId = RelnameGetRelid(relation->relname);"
before the drop get committed then we would also get the "relation "intra_grant_inplace" does not exist"
error (tested manually, attaching gdb on s4 and breakpoint before the RelnameGetRelid(relation->relname)
call in RangeVarGetRelidExtended()).

The fact that we use lockmode != NoLock in the patch produces a lock followed
by a "retry" in RangeVarGetRelidExtended() and so we get the "relation "intra_grant_inplace" does not exist"
error.

I think that the new behavior is still correct and in fact is more "appropriate" (
I mean that's the kind of error I expect to see from a user point of view).

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#3Peter Eisentraut
peter@eisentraut.org
In reply to: Bertrand Drouvot (#2)
Re: Proper object locking for GRANT/REVOKE

On 31.10.24 15:26, Bertrand Drouvot wrote:

+  address = get_object_address(objtype, lfirst(cell), &relation, lockmode, false);
+  Assert(relation == NULL);

Worth to explain why we do expect relation to be NULL here? (the comment on top
of get_object_address() says it all, but maybe a few words here could be worth
it).

There are several other callers with this pattern.

Maybe it would be better to push the assertion into
get_object_address(), something like

Assert(!relation || relp)

near the end. Meaning, if you pass NULL for the relp argument, then you
don't expect a relation. This is kind of what will happen now anyway,
except with a segfault instead of an assertion.

#4Bertrand Drouvot
bertranddrouvot.pg@gmail.com
In reply to: Peter Eisentraut (#3)
1 attachment(s)
Re: Proper object locking for GRANT/REVOKE

Hi,

On Sat, Nov 09, 2024 at 01:43:13PM +0100, Peter Eisentraut wrote:

On 31.10.24 15:26, Bertrand Drouvot wrote:

+  address = get_object_address(objtype, lfirst(cell), &relation, lockmode, false);
+  Assert(relation == NULL);

Worth to explain why we do expect relation to be NULL here? (the comment on top
of get_object_address() says it all, but maybe a few words here could be worth
it).

There are several other callers with this pattern.

Right. And most of those places declare a Relation prior calling get_object_address()
_only_ for a following assertion.

Maybe it would be better to push the assertion into get_object_address(),
something like

Assert(!relation || relp)

near the end.

That looks like a good idea to me, that would make the code cleaner and easier
to understand.

Meaning, if you pass NULL for the relp argument, then you
don't expect a relation. This is kind of what will happen now anyway,
except with a segfault instead of an assertion.

Yeah, I like it.

So, something like the attached (provided as .txt file to no mess up the CF bot
entry related to this thread) could be applied before?

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v1-0001-Add-an-assertion-in-get_object_address.txttext/plain; charset=us-asciiDownload
From e0766ce3403eba54f28df140dd787b049d5eaf86 Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Date: Mon, 11 Nov 2024 07:07:02 +0000
Subject: [PATCH v1] Add an assertion in get_object_address()

Some places declared a Relation before calling get_object_address() only to
assert that the relation is NULL after the call.

The new assertion allows passing NULL as the relation argument at those places
making the code cleaner and easier to understand.
---
 src/backend/catalog/objectaddress.c |  8 +++++---
 src/backend/commands/alter.c        | 15 ++++-----------
 2 files changed, 9 insertions(+), 14 deletions(-)
  48.8% src/backend/catalog/
  51.1% src/backend/commands/

diff --git a/src/backend/catalog/objectaddress.c b/src/backend/catalog/objectaddress.c
index 85a7b7e641..e089582285 100644
--- a/src/backend/catalog/objectaddress.c
+++ b/src/backend/catalog/objectaddress.c
@@ -896,8 +896,8 @@ static void getRelationIdentity(StringInfo buffer, Oid relid, List **object,
  *
  * If the object is a relation or a child object of a relation (e.g. an
  * attribute or constraint), the relation is also opened and *relp receives
- * the open relcache entry pointer; otherwise, *relp is set to NULL.  This
- * is a bit grotty but it makes life simpler, since the caller will
+ * the open relcache entry pointer (if *relp is not passed as a NULL argument).
+ * This is a bit grotty but it makes life simpler, since the caller will
  * typically need the relcache entry too.  Caller must close the relcache
  * entry when done with it.  The relation is locked with the specified lockmode
  * if the target object is the relation itself or an attribute, but for other
@@ -1204,8 +1204,10 @@ get_object_address(ObjectType objtype, Node *object,
 		old_address = address;
 	}
 
+	Assert(!relation || relp);
 	/* Return the object address and the relation. */
-	*relp = relation;
+	if (relp)
+		*relp = relation;
 	return address;
 }
 
diff --git a/src/backend/commands/alter.c b/src/backend/commands/alter.c
index 4f99ebb447..a45f3bb6b8 100644
--- a/src/backend/commands/alter.c
+++ b/src/backend/commands/alter.c
@@ -421,13 +421,11 @@ ExecRenameStmt(RenameStmt *stmt)
 			{
 				ObjectAddress address;
 				Relation	catalog;
-				Relation	relation;
 
 				address = get_object_address(stmt->renameType,
 											 stmt->object,
-											 &relation,
+											 NULL,
 											 AccessExclusiveLock, false);
-				Assert(relation == NULL);
 
 				catalog = table_open(address.classId, RowExclusiveLock);
 				AlterObjectRename_internal(catalog,
@@ -482,8 +480,7 @@ ExecAlterObjectDependsStmt(AlterObjectDependsStmt *stmt, ObjectAddress *refAddre
 		table_close(rel, NoLock);
 
 	refAddr = get_object_address(OBJECT_EXTENSION, (Node *) stmt->extname,
-								 &rel, AccessExclusiveLock, false);
-	Assert(rel == NULL);
+								 NULL, AccessExclusiveLock, false);
 	if (refAddress)
 		*refAddress = refAddr;
 
@@ -563,16 +560,14 @@ ExecAlterObjectSchemaStmt(AlterObjectSchemaStmt *stmt,
 		case OBJECT_TSTEMPLATE:
 			{
 				Relation	catalog;
-				Relation	relation;
 				Oid			classId;
 				Oid			nspOid;
 
 				address = get_object_address(stmt->objectType,
 											 stmt->object,
-											 &relation,
+											 NULL,
 											 AccessExclusiveLock,
 											 false);
-				Assert(relation == NULL);
 				classId = address.classId;
 				catalog = table_open(classId, RowExclusiveLock);
 				nspOid = LookupCreationNamespace(stmt->newschema);
@@ -876,15 +871,13 @@ ExecAlterOwnerStmt(AlterOwnerStmt *stmt)
 		case OBJECT_TSDICTIONARY:
 		case OBJECT_TSCONFIGURATION:
 			{
-				Relation	relation;
 				ObjectAddress address;
 
 				address = get_object_address(stmt->objectType,
 											 stmt->object,
-											 &relation,
+											 NULL,
 											 AccessExclusiveLock,
 											 false);
-				Assert(relation == NULL);
 
 				AlterObjectOwner_internal(address.classId, address.objectId,
 										  newowner);
-- 
2.34.1

#5Peter Eisentraut
peter@eisentraut.org
In reply to: Bertrand Drouvot (#4)
Re: Proper object locking for GRANT/REVOKE

On 11.11.24 08:53, Bertrand Drouvot wrote:

Maybe it would be better to push the assertion into get_object_address(),
something like

Assert(!relation || relp)

near the end.

That looks like a good idea to me, that would make the code cleaner and easier
to understand.

Meaning, if you pass NULL for the relp argument, then you
don't expect a relation. This is kind of what will happen now anyway,
except with a segfault instead of an assertion.

Yeah, I like it.

So, something like the attached (provided as .txt file to no mess up the CF bot
entry related to this thread) could be applied before?

Thanks. I have applied your patch and then also mine with the
appropriate adjustments.

#6Noah Misch
noah@leadboat.com
In reply to: Peter Eisentraut (#5)
Re: Proper object locking for GRANT/REVOKE

On Fri, Nov 15, 2024 at 11:19:50AM +0100, Peter Eisentraut wrote:

On 11.11.24 08:53, Bertrand Drouvot wrote:

Thanks. I have applied your patch and then also mine with the appropriate
adjustments.

commit d31bbfb wrote:

--- a/src/backend/catalog/aclchk.c
+++ b/src/backend/catalog/aclchk.c
@@ -659,147 +659,77 @@ ExecGrantStmt_oids(InternalGrant *istmt)
* objectNamesToOids
*
* Turn a list of object names of a given type into an Oid list.
- *
- * XXX: This function doesn't take any sort of locks on the objects whose
- * names it looks up.  In the face of concurrent DDL, we might easily latch
- * onto an old version of an object, causing the GRANT or REVOKE statement
- * to fail.

To prevent "latch onto an old version" and remove the last sentence of the
comment, we'd need two more things:

- Use a self-exclusive lock here, not AccessShareLock. With two sessions
doing GRANT under AccessShareLock, one will "latch onto an old version".

- Use a self-exclusive lock before *every* CatalogTupleUpdate() of a row that
GRANT/REVOKE can affect. For example, today's locking in ALTER FUNCTION is
the xmax stamped on the old tuple. If GRANT switched to
ShareUpdateExclusiveLock, concurrent ALTER FUNCTION could still cause GRANT
to "latch onto an old version".

I wouldn't do those, however. It would make GRANT ALL TABLES IN SCHEMA
terminate every autovacuum running in the schema and consume a shared lock
table entry per table in the schema. I think the user-visible benefit of
commit d31bbfb plus this additional work is just changing "ERROR: tuple
concurrently updated" to blocking. That's not nothing, but I don't see it
outweighing autovacuum termination and lock table consumption spikes. What
other benefits and drawbacks should we weigh?

--- a/src/test/isolation/expected/intra-grant-inplace.out
+++ b/src/test/isolation/expected/intra-grant-inplace.out
@@ -248,6 +248,6 @@ relhasindex
-----------
(0 rows)
-s4: WARNING:  got: cache lookup failed for relation REDACTED
+s4: WARNING:  got: relation "intra_grant_inplace" does not exist

The affected permutation existed to cover the first LockRelease() in
SearchSysCacheLocked1(). Since this commit, that line no longer has coverage.

#7Peter Eisentraut
peter@eisentraut.org
In reply to: Noah Misch (#6)
Re: Proper object locking for GRANT/REVOKE

On 25.11.24 02:24, Noah Misch wrote:

commit d31bbfb wrote:

--- a/src/backend/catalog/aclchk.c
+++ b/src/backend/catalog/aclchk.c
@@ -659,147 +659,77 @@ ExecGrantStmt_oids(InternalGrant *istmt)
* objectNamesToOids
*
* Turn a list of object names of a given type into an Oid list.
- *
- * XXX: This function doesn't take any sort of locks on the objects whose
- * names it looks up.  In the face of concurrent DDL, we might easily latch
- * onto an old version of an object, causing the GRANT or REVOKE statement
- * to fail.

To prevent "latch onto an old version" and remove the last sentence of the
comment, we'd need two more things:

- Use a self-exclusive lock here, not AccessShareLock. With two sessions
doing GRANT under AccessShareLock, one will "latch onto an old version".

- Use a self-exclusive lock before *every* CatalogTupleUpdate() of a row that
GRANT/REVOKE can affect. For example, today's locking in ALTER FUNCTION is
the xmax stamped on the old tuple. If GRANT switched to
ShareUpdateExclusiveLock, concurrent ALTER FUNCTION could still cause GRANT
to "latch onto an old version".

Ok, we should probably put that comment back in slightly altered form, like

"XXX This function intentionally takes only an AccessShareLock ...
$REASON. In the face of concurrent DDL, we might easily latch
onto an old version of an object, causing the GRANT or REVOKE statement
to fail."

I wouldn't do those, however. It would make GRANT ALL TABLES IN SCHEMA
terminate every autovacuum running in the schema and consume a shared lock
table entry per table in the schema. I think the user-visible benefit of
commit d31bbfb plus this additional work is just changing "ERROR: tuple
concurrently updated" to blocking. That's not nothing, but I don't see it
outweighing autovacuum termination and lock table consumption spikes. What
other benefits and drawbacks should we weigh?

I think what are describing is a reasonable tradeoff. The user
experience is tolerable: "tuple concurrently updated" is a mildly useful
error message, and it's probably the table owner executing both commands.

The change to AccessShareLock at least prevents confusing "cache lookup
failed" messages, and might alleviate some security concerns about
swapping in a completely different object concurrently (even if we
currently think this is not an actual problem).

--- a/src/test/isolation/expected/intra-grant-inplace.out
+++ b/src/test/isolation/expected/intra-grant-inplace.out
@@ -248,6 +248,6 @@ relhasindex
-----------
(0 rows)
-s4: WARNING:  got: cache lookup failed for relation REDACTED
+s4: WARNING:  got: relation "intra_grant_inplace" does not exist

The affected permutation existed to cover the first LockRelease() in
SearchSysCacheLocked1(). Since this commit, that line no longer has coverage.

Do you have an idea how such a test case could be constructed now?

#8Noah Misch
noah@leadboat.com
In reply to: Peter Eisentraut (#7)
Re: Proper object locking for GRANT/REVOKE

On Mon, Dec 02, 2024 at 12:13:56PM +0100, Peter Eisentraut wrote:

On 25.11.24 02:24, Noah Misch wrote:

commit d31bbfb wrote:

--- a/src/backend/catalog/aclchk.c
+++ b/src/backend/catalog/aclchk.c
@@ -659,147 +659,77 @@ ExecGrantStmt_oids(InternalGrant *istmt)
* objectNamesToOids
*
* Turn a list of object names of a given type into an Oid list.
- *
- * XXX: This function doesn't take any sort of locks on the objects whose
- * names it looks up.  In the face of concurrent DDL, we might easily latch
- * onto an old version of an object, causing the GRANT or REVOKE statement
- * to fail.

To prevent "latch onto an old version" and remove the last sentence of the
comment, we'd need two more things:

- Use a self-exclusive lock here, not AccessShareLock. With two sessions
doing GRANT under AccessShareLock, one will "latch onto an old version".

- Use a self-exclusive lock before *every* CatalogTupleUpdate() of a row that
GRANT/REVOKE can affect. For example, today's locking in ALTER FUNCTION is
the xmax stamped on the old tuple. If GRANT switched to
ShareUpdateExclusiveLock, concurrent ALTER FUNCTION could still cause GRANT
to "latch onto an old version".

Ok, we should probably put that comment back in slightly altered form, like

"XXX This function intentionally takes only an AccessShareLock ... $REASON.
In the face of concurrent DDL, we might easily latch
onto an old version of an object, causing the GRANT or REVOKE statement
to fail."

Yep.

I wouldn't do those, however. It would make GRANT ALL TABLES IN SCHEMA
terminate every autovacuum running in the schema and consume a shared lock
table entry per table in the schema. I think the user-visible benefit of
commit d31bbfb plus this additional work is just changing "ERROR: tuple
concurrently updated" to blocking. That's not nothing, but I don't see it
outweighing autovacuum termination and lock table consumption spikes. What
other benefits and drawbacks should we weigh?

I think what are describing is a reasonable tradeoff. The user experience
is tolerable: "tuple concurrently updated" is a mildly useful error message,
and it's probably the table owner executing both commands.

The change to AccessShareLock at least prevents confusing "cache lookup
failed" messages, and might alleviate some security concerns about swapping
in a completely different object concurrently (even if we currently think
this is not an actual problem).

Perhaps. To me, the v17 behavior smells mildly superior to the v18 behavior.

--- a/src/test/isolation/expected/intra-grant-inplace.out
+++ b/src/test/isolation/expected/intra-grant-inplace.out
@@ -248,6 +248,6 @@ relhasindex
-----------
(0 rows)
-s4: WARNING:  got: cache lookup failed for relation REDACTED
+s4: WARNING:  got: relation "intra_grant_inplace" does not exist

The affected permutation existed to cover the first LockRelease() in
SearchSysCacheLocked1(). Since this commit, that line no longer has coverage.

Do you have an idea how such a test case could be constructed now?

A rough idea. The test worked because REVOKE used only LOCKTAG_TUPLE, which
didn't mind the LOCKTAG_RELATION from DROP TABLE.

One route might be to find another SearchSysCacheLocked1() caller that takes
no locks beyond the LOCKTAG_TUPLE taken right in SearchSysCacheLocked1(). I'd
add a temporary elog to report if that's happening.
check_lock_if_inplace_updateable_rel() is an example of reporting the absence
of a lock. If check-world w/ that elog finds some operation reaching that
circumstance, this test could replace REVOKE with that operation.

Another route would be to remove the catalog row without locking the
underlying object, e.g. by replacing DROP TABLE with DELETE FROM pg_class.
That's more artificial.

#9Peter Eisentraut
peter@eisentraut.org
In reply to: Noah Misch (#8)
2 attachment(s)
Re: Proper object locking for GRANT/REVOKE

On 09.12.24 02:25, Noah Misch wrote:

Ok, we should probably put that comment back in slightly altered form, like

"XXX This function intentionally takes only an AccessShareLock ... $REASON.
In the face of concurrent DDL, we might easily latch
onto an old version of an object, causing the GRANT or REVOKE statement
to fail."

Yep.

There is an open item for PG18 for this. So here is a patch that adds a
comment back, mostly from your descriptions in this thread.

The change to AccessShareLock at least prevents confusing "cache lookup
failed" messages, and might alleviate some security concerns about swapping
in a completely different object concurrently (even if we currently think
this is not an actual problem).

Perhaps. To me, the v17 behavior smells mildly superior to the v18 behavior.

Hmm. I think there has been a general effort to get rid of internal
errors like "cache lookup failed ..." and replace them with proper
user-facing errors. This change seems in line with that.

An alternative, if we wanted to go back to the old behavior (other than
reverting altogether, since I think the refactoring is still valuable),
would be to allow get_object_address() to work with lockmode == NoLock.
That would require a bit of work, but nothing magical.

--- a/src/test/isolation/expected/intra-grant-inplace.out
+++ b/src/test/isolation/expected/intra-grant-inplace.out
@@ -248,6 +248,6 @@ relhasindex
-----------
(0 rows)
-s4: WARNING:  got: cache lookup failed for relation REDACTED
+s4: WARNING:  got: relation "intra_grant_inplace" does not exist

The affected permutation existed to cover the first LockRelease() in
SearchSysCacheLocked1(). Since this commit, that line no longer has coverage.

Do you have an idea how such a test case could be constructed now?

A rough idea. The test worked because REVOKE used only LOCKTAG_TUPLE, which
didn't mind the LOCKTAG_RELATION from DROP TABLE.

One route might be to find another SearchSysCacheLocked1() caller that takes
no locks beyond the LOCKTAG_TUPLE taken right in SearchSysCacheLocked1(). I'd
add a temporary elog to report if that's happening.
check_lock_if_inplace_updateable_rel() is an example of reporting the absence
of a lock. If check-world w/ that elog finds some operation reaching that
circumstance, this test could replace REVOKE with that operation.

Another route would be to remove the catalog row without locking the
underlying object, e.g. by replacing DROP TABLE with DELETE FROM pg_class.
That's more artificial.

I have attached here a patch that shows that the last idea does work.

I don't know what the best solution here is. It seems weird to leave
some higher-level code faulty in order to be able to test lower-level
code that attempts to cope with the faults of the higher-level code. I
know that backstops have value, though.

Attachments:

0001-Improve-objectNamesToOids-comment.patchtext/plain; charset=UTF-8; name=0001-Improve-objectNamesToOids-comment.patchDownload
From 1cac7fecf80a7ed3dc4a714f7c984fd8934fc0ea Mon Sep 17 00:00:00 2001
From: Peter Eisentraut <peter@eisentraut.org>
Date: Wed, 11 Jun 2025 17:08:23 +0200
Subject: [PATCH 1/2] Improve objectNamesToOids() comment

Commit d31bbfb6590 removed the comment at objectNamesToOids() that
there is no locking, because that commit added locking.  But to fix
all the problems, we'd still need a stronger lock.  So put the comment
back with more a detailed explanation.
---
 src/backend/catalog/aclchk.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/src/backend/catalog/aclchk.c b/src/backend/catalog/aclchk.c
index 9ca8a88dc91..24948c1f05e 100644
--- a/src/backend/catalog/aclchk.c
+++ b/src/backend/catalog/aclchk.c
@@ -659,6 +659,20 @@ ExecGrantStmt_oids(InternalGrant *istmt)
  * objectNamesToOids
  *
  * Turn a list of object names of a given type into an Oid list.
+ *
+ * XXX This function intentionally takes only an AccessShareLock.  In the face
+ * of concurrent DDL, we might easily latch onto an old version of an object,
+ * causing the GRANT or REVOKE statement to fail.  But it does prevent the
+ * object from disappearing altogether.  To do better, we would need to use a
+ * self-exclusive lock, perhaps ShareUpdateExclusiveLock, here and before
+ * *every* CatalogTupleUpdate() of a row that GRANT/REVOKE can affect.
+ * Besides that additional work, this could have operational costs.  For
+ * example, it would make GRANT ALL TABLES IN SCHEMA terminate every
+ * autovacuum running in the schema and consume a shared lock table entry per
+ * table in the schema.  The user-visible benefit of that additional work is
+ * just changing "ERROR: tuple concurrently updated" to blocking.  That's not
+ * nothing, but it might not outweigh autovacuum termination and lock table
+ * consumption spikes.
  */
 static List *
 objectNamesToOids(ObjectType objtype, List *objnames, bool is_grant)

base-commit: 361499538c9d3640e1ed5522e08fdf81b08e76ae
-- 
2.49.0

0002-WIP-Attempt-to-put-back-intra-grant-inplace.spec-cov.patchtext/plain; charset=UTF-8; name=0002-WIP-Attempt-to-put-back-intra-grant-inplace.spec-cov.patchDownload
From e1ff7134fca04b9909f30eb8d71a126de92bc2a1 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut <peter@eisentraut.org>
Date: Wed, 11 Jun 2025 17:11:19 +0200
Subject: [PATCH 2/2] WIP: Attempt to put back intra-grant-inplace.spec
 coverage

---
 src/test/isolation/expected/intra-grant-inplace.out | 4 ++--
 src/test/isolation/specs/intra-grant-inplace.spec   | 2 +-
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/src/test/isolation/expected/intra-grant-inplace.out b/src/test/isolation/expected/intra-grant-inplace.out
index 1aa9da622da..23c34d0ca09 100644
--- a/src/test/isolation/expected/intra-grant-inplace.out
+++ b/src/test/isolation/expected/intra-grant-inplace.out
@@ -226,7 +226,7 @@ step revoke4: <... completed>
 starting permutation: b1 drop1 b3 sfu3 revoke4 c1 r3
 step b1: BEGIN;
 step drop1: 
-	DROP TABLE intra_grant_inplace;
+	DELETE FROM pg_class WHERE relname = 'intra_grant_inplace';
 
 step b3: BEGIN ISOLATION LEVEL READ COMMITTED;
 step sfu3: 
@@ -248,6 +248,6 @@ relhasindex
 -----------
 (0 rows)
 
-s4: WARNING:  got: relation "intra_grant_inplace" does not exist
+s4: WARNING:  got: cache lookup failed for relation REDACTED
 step revoke4: <... completed>
 step r3: ROLLBACK;
diff --git a/src/test/isolation/specs/intra-grant-inplace.spec b/src/test/isolation/specs/intra-grant-inplace.spec
index 9936d389359..e9c7848624c 100644
--- a/src/test/isolation/specs/intra-grant-inplace.spec
+++ b/src/test/isolation/specs/intra-grant-inplace.spec
@@ -20,7 +20,7 @@ step grant1	{
 	GRANT SELECT ON intra_grant_inplace TO PUBLIC;
 }
 step drop1	{
-	DROP TABLE intra_grant_inplace;
+	DELETE FROM pg_class WHERE relname = 'intra_grant_inplace';
 }
 step c1	{ COMMIT; }
 
-- 
2.49.0

#10Noah Misch
noah@leadboat.com
In reply to: Peter Eisentraut (#9)
Re: Proper object locking for GRANT/REVOKE

On Wed, Jun 11, 2025 at 05:22:53PM +0200, Peter Eisentraut wrote:

On 09.12.24 02:25, Noah Misch wrote:

Ok, we should probably put that comment back in slightly altered form, like

"XXX This function intentionally takes only an AccessShareLock ... $REASON.
In the face of concurrent DDL, we might easily latch
onto an old version of an object, causing the GRANT or REVOKE statement
to fail."

Yep.

There is an open item for PG18 for this. So here is a patch that adds a
comment back, mostly from your descriptions in this thread.

The change to AccessShareLock at least prevents confusing "cache lookup
failed" messages, and might alleviate some security concerns about swapping
in a completely different object concurrently (even if we currently think
this is not an actual problem).

Perhaps. To me, the v17 behavior smells mildly superior to the v18 behavior.

Hmm. I think there has been a general effort to get rid of internal errors
like "cache lookup failed ..." and replace them with proper user-facing
errors. This change seems in line with that.

I have seen some commits along those lines, e.g. d8a0993. They weren't adding
lock acquisitions, though. If we've deliberately incurred lock acquisitions
just to get rid of "cache lookup failed ...", I don't remember those commits.

An alternative, if we wanted to go back to the old behavior (other than
reverting altogether, since I think the refactoring is still valuable),
would be to allow get_object_address() to work with lockmode == NoLock. That
would require a bit of work, but nothing magical.

That seems a bit better to me than your comment-only proposal, but either
could be okay.

#11Nathan Bossart
nathandbossart@gmail.com
In reply to: Noah Misch (#10)
Re: Proper object locking for GRANT/REVOKE

On Mon, Jun 23, 2025 at 04:16:09PM -0700, Noah Misch wrote:

On Wed, Jun 11, 2025 at 05:22:53PM +0200, Peter Eisentraut wrote:

There is an open item for PG18 for this. So here is a patch that adds a
comment back, mostly from your descriptions in this thread.

The change to AccessShareLock at least prevents confusing "cache lookup
failed" messages, and might alleviate some security concerns about swapping
in a completely different object concurrently (even if we currently think
this is not an actual problem).

Perhaps. To me, the v17 behavior smells mildly superior to the v18 behavior.

Hmm. I think there has been a general effort to get rid of internal errors
like "cache lookup failed ..." and replace them with proper user-facing
errors. This change seems in line with that.

I have seen some commits along those lines, e.g. d8a0993. They weren't adding
lock acquisitions, though. If we've deliberately incurred lock acquisitions
just to get rid of "cache lookup failed ...", I don't remember those commits.

An alternative, if we wanted to go back to the old behavior (other than
reverting altogether, since I think the refactoring is still valuable),
would be to allow get_object_address() to work with lockmode == NoLock. That
would require a bit of work, but nothing magical.

That seems a bit better to me than your comment-only proposal, but either
could be okay.

There is still an open item for this. After a read-through, I tend to
agree with Noah that it might not be worth incurring extra lock
acquisitions solely in the name of avoiding cache lookup errors. This
seems like something that could perhaps be revisited for v19, but it's
pretty late in the game for v18...

--
nathan

#12Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Nathan Bossart (#11)
Re: Proper object locking for GRANT/REVOKE

On 19/08/2025 21:21, Nathan Bossart wrote:

On Mon, Jun 23, 2025 at 04:16:09PM -0700, Noah Misch wrote:

On Wed, Jun 11, 2025 at 05:22:53PM +0200, Peter Eisentraut wrote:

There is an open item for PG18 for this. So here is a patch that adds a
comment back, mostly from your descriptions in this thread.

The change to AccessShareLock at least prevents confusing "cache lookup
failed" messages, and might alleviate some security concerns about swapping
in a completely different object concurrently (even if we currently think
this is not an actual problem).

Perhaps. To me, the v17 behavior smells mildly superior to the v18 behavior.

Hmm. I think there has been a general effort to get rid of internal errors
like "cache lookup failed ..." and replace them with proper user-facing
errors. This change seems in line with that.

I have seen some commits along those lines, e.g. d8a0993. They weren't adding
lock acquisitions, though. If we've deliberately incurred lock acquisitions
just to get rid of "cache lookup failed ...", I don't remember those commits.

An alternative, if we wanted to go back to the old behavior (other than
reverting altogether, since I think the refactoring is still valuable),
would be to allow get_object_address() to work with lockmode == NoLock. That
would require a bit of work, but nothing magical.

That seems a bit better to me than your comment-only proposal, but either
could be okay.

There is still an open item for this. After a read-through, I tend to
agree with Noah that it might not be worth incurring extra lock
acquisitions solely in the name of avoiding cache lookup errors. This
seems like something that could perhaps be revisited for v19, but it's
pretty late in the game for v18...

+1 for the comment patch (0001-Improve-objectNamesToOids-comment.patch).

+1 for also doing something like
(0002-WIP-Attempt-to-put-back-intra-grant-inplace.spec-cov.patch), to
keep the test coverage for the "cache lookup failed" error. Maybe add a
new test for that, given that it's pretty hacky.

Those are just comment and test updates, however, so they should not
block the release. I will remove this from the Open Items list.

- Heikki

#13Peter Eisentraut
peter@eisentraut.org
In reply to: Heikki Linnakangas (#12)
Re: Proper object locking for GRANT/REVOKE

On 26.08.25 18:21, Heikki Linnakangas wrote:

+1 for the comment patch (0001-Improve-objectNamesToOids-comment.patch).

+1 for also doing something like (0002-WIP-Attempt-to-put-back-intra-
grant-inplace.spec-cov.patch), to keep the test coverage for the "cache
lookup failed" error. Maybe add a new test for that, given that it's
pretty hacky.

Those are just comment and test updates, however, so they should not
block the release. I will remove this from the Open Items list.

I have committed these two patches now.