Proper object locking for GRANT/REVOKE
This patch started out as a refactoring, thinking that
objectNamesToOids() in aclchk.c should really mostly be a loop around
get_object_address(). This is mostly true, with a few special cases
because the node representations are a bit different in some cases, and
OBJECT_PARAMETER_ACL, which is obviously very different. This saves a
bunch of duplicative code, which is nice.
Additionally, get_object_address() handles locking, which
objectNamesToOids() somewhat famously does not do, and there is a code
comment about it. With this refactoring, we get the locking pretty much
for free.
Interestingly, this changes the output of the intra-grant-inplace
isolation test, which is recent work. It might be good to get some
review from those who worked on that, to make sure that the new behavior
is still correct and/or to check whether those test cases are still
applicable.
Also, it would be worth thinking about what the correct lock mode should
be here.
Attachments:
0001-Proper-object-locking-for-GRANT-REVOKE.patchtext/plain; charset=UTF-8; name=0001-Proper-object-locking-for-GRANT-REVOKE.patchDownload+41-124
Hi,
On Mon, Oct 28, 2024 at 04:20:42PM +0100, Peter Eisentraut wrote:
This patch started out as a refactoring, thinking that objectNamesToOids()
in aclchk.c should really mostly be a loop around get_object_address().
This is mostly true, with a few special cases because the node
representations are a bit different in some cases, and OBJECT_PARAMETER_ACL,
which is obviously very different. This saves a bunch of duplicative code,
which is nice.Additionally, get_object_address() handles locking, which
objectNamesToOids() somewhat famously does not do, and there is a code
comment about it. With this refactoring, we get the locking pretty much for
free.
Thanks for the patch, this refactoring makes sense to me.
A few random comments:
1 ===
+ default:
I like the idea of using default as the first "case" as a way to emphasize that
this is the most "common" behavior.
2 === Nit
+ address = get_object_address(objtype, lfirst(cell), &relation, lockmode, false);
+ Assert(relation == NULL);
Worth to explain why we do expect relation to be NULL here? (the comment on top
of get_object_address() says it all, but maybe a few words here could be worth
it).
3 ===
-
- /* Used GRANT DOMAIN on a non-domain? */
- if (istmt->objtype == OBJECT_DOMAIN &&
- pg_type_tuple->typtype != TYPTYPE_DOMAIN)
- ereport(ERROR,
- (errcode(ERRCODE_WRONG_OBJECT_TYPE),
- errmsg("\"%s\" is not a domain",
- NameStr(pg_type_tuple->typname))));
Yeah, get_object_address()->get_object_address_type() does take care of it.
4 ===
Interestingly, this changes the output of the intra-grant-inplace isolation
test, which is recent work. It might be good to get some review from those
who worked on that, to make sure that the new behavior is still correct
and/or to check whether those test cases are still applicable.
-s4: WARNING: got: cache lookup failed for relation REDACTED
+s4: WARNING: got: relation "intra_grant_inplace" does not exist
I did not work on this but out of curiosity I looked at it, and IIUC this change
comes from:
- relOid = RangeVarGetRelid(relvar, NoLock, false);
+ relOid = RangeVarGetRelid(relvar, lockmode, false);
So without the lock, the error is coming from ExecGrant_Relation() that way:
[local] DO(+0x22529b) [0x5c56afdb629b]
[local] DO(+0x2220b0) [0x5c56afdb30b0]
[local] DO(ExecuteGrantStmt+0x696) [0x5c56afdb3047]
[local] DO(+0x6dc8cb) [0x5c56b026d8cb]
[local] DO(standard_ProcessUtility+0xd38) [0x5c56b026b9da]
[local] DO(ProcessUtility+0x13a) [0x5c56b026ac9b]
[local] DO(+0x460073) [0x5c56afff1073]
With the lock, the error comes from RangeVarGetRelidExtended() that way:
[local] DO(RangeVarGetRelidExtended+0x4e6) [0x5a1c3ac49d36]
[local] DO(+0x2223fe) [0x5a1c3ac2a3fe]
[local] DO(ExecuteGrantStmt+0x111) [0x5a1c3ac29ac2]
[local] DO(+0x6dc1d2) [0x5a1c3b0e41d2]
[local] DO(standard_ProcessUtility+0xd38) [0x5a1c3b0e22e1]
[local] DO(ProcessUtility+0x13a) [0x5a1c3b0e15a2]
That's due to (in RangeVarGetRelidExtended()):
"
* But if lockmode = NoLock, then we assume that either the caller is OK
* with the answer changing under them, or that they already hold some
* appropriate lock, and therefore return the first answer we get without
* checking for invalidation messages.
"
So, in the RangeVarGetRelid(relvar, NoLock, false) case (without the patch) then
if we are able to reach "relId = RelnameGetRelid(relation->relname);" before the
drop gets committed then we get the "cache lookup failed" error.
But if we are slow enough so that we don't reach "relId = RelnameGetRelid(relation->relname);"
before the drop get committed then we would also get the "relation "intra_grant_inplace" does not exist"
error (tested manually, attaching gdb on s4 and breakpoint before the RelnameGetRelid(relation->relname)
call in RangeVarGetRelidExtended()).
The fact that we use lockmode != NoLock in the patch produces a lock followed
by a "retry" in RangeVarGetRelidExtended() and so we get the "relation "intra_grant_inplace" does not exist"
error.
I think that the new behavior is still correct and in fact is more "appropriate" (
I mean that's the kind of error I expect to see from a user point of view).
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
On 31.10.24 15:26, Bertrand Drouvot wrote:
+ address = get_object_address(objtype, lfirst(cell), &relation, lockmode, false); + Assert(relation == NULL);Worth to explain why we do expect relation to be NULL here? (the comment on top
of get_object_address() says it all, but maybe a few words here could be worth
it).
There are several other callers with this pattern.
Maybe it would be better to push the assertion into
get_object_address(), something like
Assert(!relation || relp)
near the end. Meaning, if you pass NULL for the relp argument, then you
don't expect a relation. This is kind of what will happen now anyway,
except with a segfault instead of an assertion.
Hi,
On Sat, Nov 09, 2024 at 01:43:13PM +0100, Peter Eisentraut wrote:
On 31.10.24 15:26, Bertrand Drouvot wrote:
+ address = get_object_address(objtype, lfirst(cell), &relation, lockmode, false); + Assert(relation == NULL);Worth to explain why we do expect relation to be NULL here? (the comment on top
of get_object_address() says it all, but maybe a few words here could be worth
it).There are several other callers with this pattern.
Right. And most of those places declare a Relation prior calling get_object_address()
_only_ for a following assertion.
Maybe it would be better to push the assertion into get_object_address(),
something likeAssert(!relation || relp)
near the end.
That looks like a good idea to me, that would make the code cleaner and easier
to understand.
Meaning, if you pass NULL for the relp argument, then you
don't expect a relation. This is kind of what will happen now anyway,
except with a segfault instead of an assertion.
Yeah, I like it.
So, something like the attached (provided as .txt file to no mess up the CF bot
entry related to this thread) could be applied before?
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Attachments:
v1-0001-Add-an-assertion-in-get_object_address.txttext/plain; charset=us-asciiDownload+9-15
On 11.11.24 08:53, Bertrand Drouvot wrote:
Maybe it would be better to push the assertion into get_object_address(),
something likeAssert(!relation || relp)
near the end.
That looks like a good idea to me, that would make the code cleaner and easier
to understand.Meaning, if you pass NULL for the relp argument, then you
don't expect a relation. This is kind of what will happen now anyway,
except with a segfault instead of an assertion.Yeah, I like it.
So, something like the attached (provided as .txt file to no mess up the CF bot
entry related to this thread) could be applied before?
Thanks. I have applied your patch and then also mine with the
appropriate adjustments.
On Fri, Nov 15, 2024 at 11:19:50AM +0100, Peter Eisentraut wrote:
On 11.11.24 08:53, Bertrand Drouvot wrote:
Thanks. I have applied your patch and then also mine with the appropriate
adjustments.
commit d31bbfb wrote:
--- a/src/backend/catalog/aclchk.c +++ b/src/backend/catalog/aclchk.c @@ -659,147 +659,77 @@ ExecGrantStmt_oids(InternalGrant *istmt) * objectNamesToOids * * Turn a list of object names of a given type into an Oid list. - * - * XXX: This function doesn't take any sort of locks on the objects whose - * names it looks up. In the face of concurrent DDL, we might easily latch - * onto an old version of an object, causing the GRANT or REVOKE statement - * to fail.
To prevent "latch onto an old version" and remove the last sentence of the
comment, we'd need two more things:
- Use a self-exclusive lock here, not AccessShareLock. With two sessions
doing GRANT under AccessShareLock, one will "latch onto an old version".
- Use a self-exclusive lock before *every* CatalogTupleUpdate() of a row that
GRANT/REVOKE can affect. For example, today's locking in ALTER FUNCTION is
the xmax stamped on the old tuple. If GRANT switched to
ShareUpdateExclusiveLock, concurrent ALTER FUNCTION could still cause GRANT
to "latch onto an old version".
I wouldn't do those, however. It would make GRANT ALL TABLES IN SCHEMA
terminate every autovacuum running in the schema and consume a shared lock
table entry per table in the schema. I think the user-visible benefit of
commit d31bbfb plus this additional work is just changing "ERROR: tuple
concurrently updated" to blocking. That's not nothing, but I don't see it
outweighing autovacuum termination and lock table consumption spikes. What
other benefits and drawbacks should we weigh?
--- a/src/test/isolation/expected/intra-grant-inplace.out +++ b/src/test/isolation/expected/intra-grant-inplace.out @@ -248,6 +248,6 @@ relhasindex ----------- (0 rows)-s4: WARNING: got: cache lookup failed for relation REDACTED +s4: WARNING: got: relation "intra_grant_inplace" does not exist
The affected permutation existed to cover the first LockRelease() in
SearchSysCacheLocked1(). Since this commit, that line no longer has coverage.
On 25.11.24 02:24, Noah Misch wrote:
commit d31bbfb wrote:
--- a/src/backend/catalog/aclchk.c +++ b/src/backend/catalog/aclchk.c @@ -659,147 +659,77 @@ ExecGrantStmt_oids(InternalGrant *istmt) * objectNamesToOids * * Turn a list of object names of a given type into an Oid list. - * - * XXX: This function doesn't take any sort of locks on the objects whose - * names it looks up. In the face of concurrent DDL, we might easily latch - * onto an old version of an object, causing the GRANT or REVOKE statement - * to fail.To prevent "latch onto an old version" and remove the last sentence of the
comment, we'd need two more things:- Use a self-exclusive lock here, not AccessShareLock. With two sessions
doing GRANT under AccessShareLock, one will "latch onto an old version".- Use a self-exclusive lock before *every* CatalogTupleUpdate() of a row that
GRANT/REVOKE can affect. For example, today's locking in ALTER FUNCTION is
the xmax stamped on the old tuple. If GRANT switched to
ShareUpdateExclusiveLock, concurrent ALTER FUNCTION could still cause GRANT
to "latch onto an old version".
Ok, we should probably put that comment back in slightly altered form, like
"XXX This function intentionally takes only an AccessShareLock ...
$REASON. In the face of concurrent DDL, we might easily latch
onto an old version of an object, causing the GRANT or REVOKE statement
to fail."
I wouldn't do those, however. It would make GRANT ALL TABLES IN SCHEMA
terminate every autovacuum running in the schema and consume a shared lock
table entry per table in the schema. I think the user-visible benefit of
commit d31bbfb plus this additional work is just changing "ERROR: tuple
concurrently updated" to blocking. That's not nothing, but I don't see it
outweighing autovacuum termination and lock table consumption spikes. What
other benefits and drawbacks should we weigh?
I think what are describing is a reasonable tradeoff. The user
experience is tolerable: "tuple concurrently updated" is a mildly useful
error message, and it's probably the table owner executing both commands.
The change to AccessShareLock at least prevents confusing "cache lookup
failed" messages, and might alleviate some security concerns about
swapping in a completely different object concurrently (even if we
currently think this is not an actual problem).
--- a/src/test/isolation/expected/intra-grant-inplace.out +++ b/src/test/isolation/expected/intra-grant-inplace.out @@ -248,6 +248,6 @@ relhasindex ----------- (0 rows)-s4: WARNING: got: cache lookup failed for relation REDACTED +s4: WARNING: got: relation "intra_grant_inplace" does not existThe affected permutation existed to cover the first LockRelease() in
SearchSysCacheLocked1(). Since this commit, that line no longer has coverage.
Do you have an idea how such a test case could be constructed now?
On Mon, Dec 02, 2024 at 12:13:56PM +0100, Peter Eisentraut wrote:
On 25.11.24 02:24, Noah Misch wrote:
commit d31bbfb wrote:
--- a/src/backend/catalog/aclchk.c +++ b/src/backend/catalog/aclchk.c @@ -659,147 +659,77 @@ ExecGrantStmt_oids(InternalGrant *istmt) * objectNamesToOids * * Turn a list of object names of a given type into an Oid list. - * - * XXX: This function doesn't take any sort of locks on the objects whose - * names it looks up. In the face of concurrent DDL, we might easily latch - * onto an old version of an object, causing the GRANT or REVOKE statement - * to fail.To prevent "latch onto an old version" and remove the last sentence of the
comment, we'd need two more things:- Use a self-exclusive lock here, not AccessShareLock. With two sessions
doing GRANT under AccessShareLock, one will "latch onto an old version".- Use a self-exclusive lock before *every* CatalogTupleUpdate() of a row that
GRANT/REVOKE can affect. For example, today's locking in ALTER FUNCTION is
the xmax stamped on the old tuple. If GRANT switched to
ShareUpdateExclusiveLock, concurrent ALTER FUNCTION could still cause GRANT
to "latch onto an old version".Ok, we should probably put that comment back in slightly altered form, like
"XXX This function intentionally takes only an AccessShareLock ... $REASON.
In the face of concurrent DDL, we might easily latch
onto an old version of an object, causing the GRANT or REVOKE statement
to fail."
Yep.
I wouldn't do those, however. It would make GRANT ALL TABLES IN SCHEMA
terminate every autovacuum running in the schema and consume a shared lock
table entry per table in the schema. I think the user-visible benefit of
commit d31bbfb plus this additional work is just changing "ERROR: tuple
concurrently updated" to blocking. That's not nothing, but I don't see it
outweighing autovacuum termination and lock table consumption spikes. What
other benefits and drawbacks should we weigh?I think what are describing is a reasonable tradeoff. The user experience
is tolerable: "tuple concurrently updated" is a mildly useful error message,
and it's probably the table owner executing both commands.The change to AccessShareLock at least prevents confusing "cache lookup
failed" messages, and might alleviate some security concerns about swapping
in a completely different object concurrently (even if we currently think
this is not an actual problem).
Perhaps. To me, the v17 behavior smells mildly superior to the v18 behavior.
--- a/src/test/isolation/expected/intra-grant-inplace.out +++ b/src/test/isolation/expected/intra-grant-inplace.out @@ -248,6 +248,6 @@ relhasindex ----------- (0 rows) -s4: WARNING: got: cache lookup failed for relation REDACTED +s4: WARNING: got: relation "intra_grant_inplace" does not existThe affected permutation existed to cover the first LockRelease() in
SearchSysCacheLocked1(). Since this commit, that line no longer has coverage.Do you have an idea how such a test case could be constructed now?
A rough idea. The test worked because REVOKE used only LOCKTAG_TUPLE, which
didn't mind the LOCKTAG_RELATION from DROP TABLE.
One route might be to find another SearchSysCacheLocked1() caller that takes
no locks beyond the LOCKTAG_TUPLE taken right in SearchSysCacheLocked1(). I'd
add a temporary elog to report if that's happening.
check_lock_if_inplace_updateable_rel() is an example of reporting the absence
of a lock. If check-world w/ that elog finds some operation reaching that
circumstance, this test could replace REVOKE with that operation.
Another route would be to remove the catalog row without locking the
underlying object, e.g. by replacing DROP TABLE with DELETE FROM pg_class.
That's more artificial.
On 09.12.24 02:25, Noah Misch wrote:
Ok, we should probably put that comment back in slightly altered form, like
"XXX This function intentionally takes only an AccessShareLock ... $REASON.
In the face of concurrent DDL, we might easily latch
onto an old version of an object, causing the GRANT or REVOKE statement
to fail."Yep.
There is an open item for PG18 for this. So here is a patch that adds a
comment back, mostly from your descriptions in this thread.
The change to AccessShareLock at least prevents confusing "cache lookup
failed" messages, and might alleviate some security concerns about swapping
in a completely different object concurrently (even if we currently think
this is not an actual problem).Perhaps. To me, the v17 behavior smells mildly superior to the v18 behavior.
Hmm. I think there has been a general effort to get rid of internal
errors like "cache lookup failed ..." and replace them with proper
user-facing errors. This change seems in line with that.
An alternative, if we wanted to go back to the old behavior (other than
reverting altogether, since I think the refactoring is still valuable),
would be to allow get_object_address() to work with lockmode == NoLock.
That would require a bit of work, but nothing magical.
--- a/src/test/isolation/expected/intra-grant-inplace.out +++ b/src/test/isolation/expected/intra-grant-inplace.out @@ -248,6 +248,6 @@ relhasindex ----------- (0 rows) -s4: WARNING: got: cache lookup failed for relation REDACTED +s4: WARNING: got: relation "intra_grant_inplace" does not existThe affected permutation existed to cover the first LockRelease() in
SearchSysCacheLocked1(). Since this commit, that line no longer has coverage.Do you have an idea how such a test case could be constructed now?
A rough idea. The test worked because REVOKE used only LOCKTAG_TUPLE, which
didn't mind the LOCKTAG_RELATION from DROP TABLE.One route might be to find another SearchSysCacheLocked1() caller that takes
no locks beyond the LOCKTAG_TUPLE taken right in SearchSysCacheLocked1(). I'd
add a temporary elog to report if that's happening.
check_lock_if_inplace_updateable_rel() is an example of reporting the absence
of a lock. If check-world w/ that elog finds some operation reaching that
circumstance, this test could replace REVOKE with that operation.Another route would be to remove the catalog row without locking the
underlying object, e.g. by replacing DROP TABLE with DELETE FROM pg_class.
That's more artificial.
I have attached here a patch that shows that the last idea does work.
I don't know what the best solution here is. It seems weird to leave
some higher-level code faulty in order to be able to test lower-level
code that attempts to cope with the faults of the higher-level code. I
know that backstops have value, though.
Attachments:
0001-Improve-objectNamesToOids-comment.patchtext/plain; charset=UTF-8; name=0001-Improve-objectNamesToOids-comment.patchDownload+14-1
0002-WIP-Attempt-to-put-back-intra-grant-inplace.spec-cov.patchtext/plain; charset=UTF-8; name=0002-WIP-Attempt-to-put-back-intra-grant-inplace.spec-cov.patchDownload+3-4
On Wed, Jun 11, 2025 at 05:22:53PM +0200, Peter Eisentraut wrote:
On 09.12.24 02:25, Noah Misch wrote:
Ok, we should probably put that comment back in slightly altered form, like
"XXX This function intentionally takes only an AccessShareLock ... $REASON.
In the face of concurrent DDL, we might easily latch
onto an old version of an object, causing the GRANT or REVOKE statement
to fail."Yep.
There is an open item for PG18 for this. So here is a patch that adds a
comment back, mostly from your descriptions in this thread.The change to AccessShareLock at least prevents confusing "cache lookup
failed" messages, and might alleviate some security concerns about swapping
in a completely different object concurrently (even if we currently think
this is not an actual problem).Perhaps. To me, the v17 behavior smells mildly superior to the v18 behavior.
Hmm. I think there has been a general effort to get rid of internal errors
like "cache lookup failed ..." and replace them with proper user-facing
errors. This change seems in line with that.
I have seen some commits along those lines, e.g. d8a0993. They weren't adding
lock acquisitions, though. If we've deliberately incurred lock acquisitions
just to get rid of "cache lookup failed ...", I don't remember those commits.
An alternative, if we wanted to go back to the old behavior (other than
reverting altogether, since I think the refactoring is still valuable),
would be to allow get_object_address() to work with lockmode == NoLock. That
would require a bit of work, but nothing magical.
That seems a bit better to me than your comment-only proposal, but either
could be okay.
On Mon, Jun 23, 2025 at 04:16:09PM -0700, Noah Misch wrote:
On Wed, Jun 11, 2025 at 05:22:53PM +0200, Peter Eisentraut wrote:
There is an open item for PG18 for this. So here is a patch that adds a
comment back, mostly from your descriptions in this thread.The change to AccessShareLock at least prevents confusing "cache lookup
failed" messages, and might alleviate some security concerns about swapping
in a completely different object concurrently (even if we currently think
this is not an actual problem).Perhaps. To me, the v17 behavior smells mildly superior to the v18 behavior.
Hmm. I think there has been a general effort to get rid of internal errors
like "cache lookup failed ..." and replace them with proper user-facing
errors. This change seems in line with that.I have seen some commits along those lines, e.g. d8a0993. They weren't adding
lock acquisitions, though. If we've deliberately incurred lock acquisitions
just to get rid of "cache lookup failed ...", I don't remember those commits.An alternative, if we wanted to go back to the old behavior (other than
reverting altogether, since I think the refactoring is still valuable),
would be to allow get_object_address() to work with lockmode == NoLock. That
would require a bit of work, but nothing magical.That seems a bit better to me than your comment-only proposal, but either
could be okay.
There is still an open item for this. After a read-through, I tend to
agree with Noah that it might not be worth incurring extra lock
acquisitions solely in the name of avoiding cache lookup errors. This
seems like something that could perhaps be revisited for v19, but it's
pretty late in the game for v18...
--
nathan
On 19/08/2025 21:21, Nathan Bossart wrote:
On Mon, Jun 23, 2025 at 04:16:09PM -0700, Noah Misch wrote:
On Wed, Jun 11, 2025 at 05:22:53PM +0200, Peter Eisentraut wrote:
There is an open item for PG18 for this. So here is a patch that adds a
comment back, mostly from your descriptions in this thread.The change to AccessShareLock at least prevents confusing "cache lookup
failed" messages, and might alleviate some security concerns about swapping
in a completely different object concurrently (even if we currently think
this is not an actual problem).Perhaps. To me, the v17 behavior smells mildly superior to the v18 behavior.
Hmm. I think there has been a general effort to get rid of internal errors
like "cache lookup failed ..." and replace them with proper user-facing
errors. This change seems in line with that.I have seen some commits along those lines, e.g. d8a0993. They weren't adding
lock acquisitions, though. If we've deliberately incurred lock acquisitions
just to get rid of "cache lookup failed ...", I don't remember those commits.An alternative, if we wanted to go back to the old behavior (other than
reverting altogether, since I think the refactoring is still valuable),
would be to allow get_object_address() to work with lockmode == NoLock. That
would require a bit of work, but nothing magical.That seems a bit better to me than your comment-only proposal, but either
could be okay.There is still an open item for this. After a read-through, I tend to
agree with Noah that it might not be worth incurring extra lock
acquisitions solely in the name of avoiding cache lookup errors. This
seems like something that could perhaps be revisited for v19, but it's
pretty late in the game for v18...
+1 for the comment patch (0001-Improve-objectNamesToOids-comment.patch).
+1 for also doing something like
(0002-WIP-Attempt-to-put-back-intra-grant-inplace.spec-cov.patch), to
keep the test coverage for the "cache lookup failed" error. Maybe add a
new test for that, given that it's pretty hacky.
Those are just comment and test updates, however, so they should not
block the release. I will remove this from the Open Items list.
- Heikki
On 26.08.25 18:21, Heikki Linnakangas wrote:
+1 for the comment patch (0001-Improve-objectNamesToOids-comment.patch).
+1 for also doing something like (0002-WIP-Attempt-to-put-back-intra-
grant-inplace.spec-cov.patch), to keep the test coverage for the "cache
lookup failed" error. Maybe add a new test for that, given that it's
pretty hacky.Those are just comment and test updates, however, so they should not
block the release. I will remove this from the Open Items list.
I have committed these two patches now.