[BUG] orphaned function
Hi hackers,
here is a scenario that produces an orphaned function (means it does not
belong to any namespace):
Session 1:
postgres=# create schema tobeorph;
CREATE SCHEMA
postgres=# BEGIN;
BEGIN
postgres=*# CREATE OR REPLACE FUNCTION tobeorph.bdttime() RETURNS TIMESTAMP AS $$
DECLARE
outTS TIMESTAMP;
BEGIN
perform pg_sleep(10);
RETURN CURRENT_TIMESTAMP;
END;
$$ LANGUAGE plpgsql volatile;
CREATE FUNCTION
=> Don't commit
Session 2:
postgres=# drop schema tobeorph;
DROP SCHEMA
Session 1:
postgres=*# END;
COMMIT
Function is orphaned:
postgres=# \df *.bdttime
List of functions
Schema | Name | Result data type | Argument data types | Type
--------+---------+-----------------------------+---------------------+------
| bdttime | timestamp without time zone | | func
(1 row)
It appears that an AccessShareLock on the target namespace is not
acquired when the function is being created (while, for example, it is
acquired when creating a relation (through
RangeVarGetAndCheckCreationNamespace())).
While CreateFunction() could lock the object the same way
RangeVarGetAndCheckCreationNamespace() does, that would leave other
objects that belong to namespaces unprotected. Locking the schema in
QualifiedNameGetCreationNamespace() will protect those objects.
Please find attached a patch that adds a LockDatabaseObject() call in
QualifiedNameGetCreationNamespace() to prevent such orphaned situations.
I will add this patch to the next commitfest. I look forward to your
feedback.
Bertrand
Attachments:
v1-001-orphaned_function.patchtext/plain; charset=UTF-8; name=v1-001-orphaned_function.patch; x-mac-creator=0; x-mac-type=0Download
src/backend/catalog/namespace.c | 8 ++++++++
1 file changed, 8 insertions(+)
diff --git a/src/backend/catalog/namespace.c b/src/backend/catalog/namespace.c
index 740570c566..a7f38baf6e 100644
--- a/src/backend/catalog/namespace.c
+++ b/src/backend/catalog/namespace.c
@@ -2989,6 +2989,11 @@ CheckSetNamespace(Oid oldNspOid, Oid nspOid)
*
* Note: calling this may result in a CommandCounterIncrement operation,
* if we have to create or clean out the temp namespace.
+ *
+ * This function acquires AccessShareLock on the target
+ * namespace. Without this, the namespace could be dropped before our
+ * transaction commits, leaving behind objects with relnamespace pointing
+ * to a no-longer-existent namespace.
*/
Oid
QualifiedNameGetCreationNamespace(List *names, char **objname_p)
@@ -3029,6 +3034,9 @@ QualifiedNameGetCreationNamespace(List *names, char **objname_p)
errmsg("no schema has been selected to create in")));
}
+ /* Lock namespace. */
+ LockDatabaseObject(NamespaceRelationId, namespaceId, 0, AccessShareLock);
+
return namespaceId;
}
"Drouvot, Bertrand" <bdrouvot@amazon.com> writes:
here is a scenario that produces an orphaned function (means it does not
belong to any namespace):
[ drop schema before committing function creation ]
Hm. Historically we've not been too fussed about preventing such race
conditions, and I wonder just how far is sane to take it. Should we
acquire lock on the function's argument/result data types? Its language?
Given the precedent of RangeVarGetAndCheckCreationNamespace, I'm
willing to accept this patch's goals as stated. But it feels like
we need some clearer shared understanding of which things we are
willing to bother with preventing races for, and which we aren't.
Please find attached a patch that adds a LockDatabaseObject() call in
QualifiedNameGetCreationNamespace() to prevent such orphaned situations.
I don't think that actually succeeds in preventing the race, it'll
just delay your process till the deletion is committed. But you
already have the namespaceId. Note the complex retry logic in
RangeVarGetAndCheckCreationNamespace; without something on the same
order, you're not closing the hole in any meaningful way. Likely
what this patch should do is refactor that function so that its guts
can be used for other object-creation scenarios. (The fact that
this is so far from trivial is one reason I'm not in a hurry to
extend this sort of protection to other dependencies.)
regards, tom lane
Hi,
On 11/30/20 11:29 PM, Tom Lane wrote:
"Drouvot, Bertrand" <bdrouvot@amazon.com> writes:
here is a scenario that produces an orphaned function (means it does not
belong to any namespace):
[ drop schema before committing function creation ]Hm. Historically we've not been too fussed about preventing such race
conditions, and I wonder just how far is sane to take it. Should we
acquire lock on the function's argument/result data types? Its language?Given the precedent of RangeVarGetAndCheckCreationNamespace, I'm
willing to accept this patch's goals as stated.
Thanks!
Forgot to mention an observed side effect: pg_upgrade failing during the
"Creating dump of database schemas" step, with things like:
pg_dump: last built-in OID is 16383
pg_dump: reading extensions
pg_dump: identifying extension members
pg_dump: reading schemas
pg_dump: reading user-defined tables
pg_dump: reading user-defined functions
pg_dump: error: schema with OID 16385 does not exist
But it feels like
we need some clearer shared understanding of which things we are
willing to bother with preventing races for, and which we aren't.Please find attached a patch that adds a LockDatabaseObject() call in
QualifiedNameGetCreationNamespace() to prevent such orphaned situations.I don't think that actually succeeds in preventing the race, it'll
just delay your process till the deletion is committed.
oooh, i was thinking the other way around: it was protecting the other
situation (when the drop schema is waiting for the create function to
finish).
The other scenario you are referring to needs to be addressed as well,
thanks!
But you
already have the namespaceId. Note the complex retry logic in
RangeVarGetAndCheckCreationNamespace; without something on the same
order, you're not closing the hole in any meaningful way.
understood with the scenario you were referring to previously (which was
not the one I focused on).
Likely
what this patch should do is refactor that function so that its guts
can be used for other object-creation scenarios.
Will have a look.
(The fact that
this is so far from trivial is one reason I'm not in a hurry to
extend this sort of protection to other dependencies.)regards, tom lane
Thanks!
Bertrand
Hi,
On 12/1/20 10:32 AM, Drouvot, Bertrand wrote:
Hi,
On 11/30/20 11:29 PM, Tom Lane wrote:
"Drouvot, Bertrand" <bdrouvot@amazon.com> writes:
here is a scenario that produces an orphaned function (means it does
not
belong to any namespace):
[ drop schema before committing function creation ]Hm. Historically we've not been too fussed about preventing such race
conditions, and I wonder just how far is sane to take it. Should we
acquire lock on the function's argument/result data types? Its
language?Given the precedent of RangeVarGetAndCheckCreationNamespace, I'm
willing to accept this patch's goals as stated.Thanks!
Forgot to mention an observed side effect: pg_upgrade failing during
the "Creating dump of database schemas" step, with things like:pg_dump: last built-in OID is 16383
pg_dump: reading extensions
pg_dump: identifying extension members
pg_dump: reading schemas
pg_dump: reading user-defined tables
pg_dump: reading user-defined functions
pg_dump: error: schema with OID 16385 does not existBut it feels like
we need some clearer shared understanding of which things we are
willing to bother with preventing races for, and which we aren't.Please find attached a patch that adds a LockDatabaseObject() call in
QualifiedNameGetCreationNamespace() to prevent such orphaned
situations.I don't think that actually succeeds in preventing the race, it'll
just delay your process till the deletion is committed.oooh, i was thinking the other way around: it was protecting the other
situation (when the drop schema is waiting for the create function to
finish).The other scenario you are referring to needs to be addressed as well,
thanks!But you
already have the namespaceId. Note the complex retry logic in
RangeVarGetAndCheckCreationNamespace; without something on the same
order, you're not closing the hole in any meaningful way.understood with the scenario you were referring to previously (which
was not the one I focused on).Likely
what this patch should do is refactor that function so that its guts
can be used for other object-creation scenarios.Will have a look.
I ended up making more changes in QualifiedNameGetCreationNamespace
<https://doxygen.postgresql.org/namespace_8c.html#a2cde9c8897e89ae47a99c103c4f96d31>()
to mimic the retry() logic that is in
RangeVarGetAndCheckCreationNamespace().
The attached v2 patch, now protects the function to be orphaned in those
2 scenarios:
Scenario 1: first, session 1 begin create function, then session 2 drops
the schema: drop schema is locked and would produce (once the create
function finishes):
bdt=# 2020-12-02 10:12:46.364 UTC [15810] ERROR: cannot drop schema
tobeorph because other objects depend on it
2020-12-02 10:12:46.364 UTC [15810] DETAIL: function tobeorph.bdttime()
depends on schema tobeorph
2020-12-02 10:12:46.364 UTC [15810] HINT: Use DROP ... CASCADE to drop
the dependent objects too.
2020-12-02 10:12:46.364 UTC [15810] STATEMENT: drop schema tobeorph;
Scenario 2: first, session 1 begin drop schema, then session 2 creates
the function: create function is locked and would produce (once the drop
schema finishes):
2020-12-02 10:14:29.468 UTC [15822] ERROR: schema "tobeorph" does not exist
Thanks
Bertrand
Attachments:
v1-002-orphaned_function.patchtext/plain; charset=UTF-8; name=v1-002-orphaned_function.patch; x-mac-creator=0; x-mac-type=0Download
diff --git a/src/backend/catalog/namespace.c b/src/backend/catalog/namespace.c
index 740570c566..f4e0663ad5 100644
--- a/src/backend/catalog/namespace.c
+++ b/src/backend/catalog/namespace.c
@@ -2989,44 +2989,81 @@ CheckSetNamespace(Oid oldNspOid, Oid nspOid)
*
* Note: calling this may result in a CommandCounterIncrement operation,
* if we have to create or clean out the temp namespace.
+ *
+ * This function acquires AccessShareLock on the target
+ * namespace. Without this, the namespace could be dropped before our
+ * transaction commits, leaving behind objects with relnamespace pointing
+ * to a no-longer-existent namespace.
*/
Oid
QualifiedNameGetCreationNamespace(List *names, char **objname_p)
{
char *schemaname;
Oid namespaceId;
+ Oid oldnspid = InvalidOid;
+ bool retry = false;
+ uint64 inval_count;
/* deconstruct the name list */
DeconstructQualifiedName(names, &schemaname, objname_p);
- if (schemaname)
+ for (;;)
{
- /* check for pg_temp alias */
- if (strcmp(schemaname, "pg_temp") == 0)
+ inval_count = SharedInvalidMessageCounter;
+
+ if (schemaname)
{
- /* Initialize temp namespace */
- AccessTempTableNamespace(false);
- return myTempNamespace;
+ /* check for pg_temp alias */
+ if (strcmp(schemaname, "pg_temp") == 0)
+ {
+ /* Initialize temp namespace */
+ AccessTempTableNamespace(false);
+ return myTempNamespace;
+ }
+ /* use exact schema given */
+ namespaceId = get_namespace_oid(schemaname, false);
+ /* we do not check for USAGE rights here! */
}
- /* use exact schema given */
- namespaceId = get_namespace_oid(schemaname, false);
- /* we do not check for USAGE rights here! */
- }
- else
- {
- /* use the default creation namespace */
- recomputeNamespacePath();
- if (activeTempCreationPending)
+ else
{
- /* Need to initialize temp namespace */
- AccessTempTableNamespace(true);
- return myTempNamespace;
+ /* use the default creation namespace */
+ recomputeNamespacePath();
+ if (activeTempCreationPending)
+ {
+ /* Need to initialize temp namespace */
+ AccessTempTableNamespace(true);
+ return myTempNamespace;
+ }
+ namespaceId = activeCreationNamespace;
+ if (!OidIsValid(namespaceId))
+ ereport(ERROR,
+ (errcode(ERRCODE_UNDEFINED_SCHEMA),
+ errmsg("no schema has been selected to create in")));
}
- namespaceId = activeCreationNamespace;
- if (!OidIsValid(namespaceId))
- ereport(ERROR,
- (errcode(ERRCODE_UNDEFINED_SCHEMA),
- errmsg("no schema has been selected to create in")));
+
+ if (retry)
+ {
+ /* If nothing changed, we're done. */
+ if (namespaceId == oldnspid)
+ break;
+
+ /* If creation namespace has changed, give up old lock. */
+ if (namespaceId != oldnspid)
+ UnlockDatabaseObject(NamespaceRelationId, oldnspid, 0,
+ AccessShareLock);
+ }
+
+ /* Lock namespace. */
+ if (namespaceId != oldnspid)
+ LockDatabaseObject(NamespaceRelationId, namespaceId, 0, AccessShareLock);
+
+ /* If no invalidation message were processed, we're done! */
+ if (inval_count == SharedInvalidMessageCounter)
+ break;
+
+ /* Something may have changed, so recheck our work. */
+ retry = true;
+ oldnspid = namespaceId;
}
return namespaceId;
Le 02/12/2020 à 11:46, Drouvot, Bertrand a écrit :
I ended up making more changes in QualifiedNameGetCreationNamespace
<https://doxygen.postgresql.org/namespace_8c.html#a2cde9c8897e89ae47a99c103c4f96d31>()
to mimic the retry() logic that is in
RangeVarGetAndCheckCreationNamespace().The attached v2 patch, now protects the function to be orphaned in
those 2 scenarios:Scenario 1: first, session 1 begin create function, then session 2
drops the schema: drop schema is locked and would produce (once the
create function finishes):bdt=# 2020-12-02 10:12:46.364 UTC [15810] ERROR: cannot drop schema
tobeorph because other objects depend on it
2020-12-02 10:12:46.364 UTC [15810] DETAIL: function
tobeorph.bdttime() depends on schema tobeorph
2020-12-02 10:12:46.364 UTC [15810] HINT: Use DROP ... CASCADE to
drop the dependent objects too.
2020-12-02 10:12:46.364 UTC [15810] STATEMENT: drop schema tobeorph;Scenario 2: first, session 1 begin drop schema, then session 2 creates
the function: create function is locked and would produce (once the
drop schema finishes):2020-12-02 10:14:29.468 UTC [15822] ERROR: schema "tobeorph" does not
existThanks
Bertrand
Hi,
This is a review for the orphaned function bug fix
/messages/by-id/5a9daaae-5538-b209-6279-e903c3ea2157@amazon.com
Contents & Purpose
==================
This patch fixes an historical race condition in PostgreSQL that leave a
function orphan of a namespace. It happens when a function is created in
a namespace inside a transaction not yet committed and that another
session drop the namespace. The function become orphan and can not be
used anymore.
postgres=# \df *.bdttime
List of functions
Schema | Name | Result data type | Argument data types
| Type
--------+---------+-----------------------------+---------------------+------
| bdttime | timestamp without time zone |
| func
Initial Run
===========
The patch applies cleanly to HEAD. The regression tests all pass
successfully with the new behavior. Given the example of the case to
reproduce, the second session now waits until the first session is
committed and when it is done it thrown error:
postgres=# drop schema tobeorph;
ERROR: cannot drop schema tobeorph because other objects depend on it
DETAIL: function tobeorph.bdttime() depends on schema tobeorph
HINT: Use DROP ... CASCADE to drop the dependent objects too.
the function is well defined in the catalog.
Comments
========
As Tom mention there is still pending DDL concurrency problems. For
example if session 1 execute the following:
CREATE TYPE public.foo as enum ('one', 'two');
BEGIN;
CREATE OR REPLACE FUNCTION tobeorph.bdttime(num foo) RETURNS
TIMESTAMP AS $$
DECLARE
outTS TIMESTAMP;
BEGIN
perform pg_sleep(10);
RETURN CURRENT_TIMESTAMP;
END;
$$ LANGUAGE plpgsql volatile;
if session 2 drop the data type before the session 1 is committed, the
function will be declared with invalid parameter data type.
The same problem applies if the returned type or the procedural language
is dropped. I have tried to fix that in ProcedureCreate() by using an
AccessSharedLock for the data types and languages involved in the
function declaration. With this patch the race condition with parameters
types, return types and PL languages are fixed. Only data types and
procedural languages with Oid greater than FirstBootstrapObjectId will
be locked locked. But this is probably more complex than this fix so it
is proposed as a separate patch
(v1-003-orphaned_function_types_language.patch) to not interfere with
the applying of Bertran's patch.
Conclusion
==========
I tag this patch (v1-002-orphaned_function.patch) as ready for
committers review as it fixes the bug reported.
--
Gilles Darold
LzLabs GmbH
https://www.lzlabs.com/
Attachments:
v1-003-orphaned_function_types_language.patchtext/x-patch; charset=UTF-8; name=v1-003-orphaned_function_types_language.patchDownload
diff --git a/src/backend/catalog/pg_proc.c b/src/backend/catalog/pg_proc.c
index 1dd9ecc063..9b305896a3 100644
--- a/src/backend/catalog/pg_proc.c
+++ b/src/backend/catalog/pg_proc.c
@@ -34,6 +34,7 @@
#include "nodes/nodeFuncs.h"
#include "parser/parse_coerce.h"
#include "parser/parse_type.h"
+#include "storage/lmgr.h"
#include "tcop/pquery.h"
#include "tcop/tcopprot.h"
#include "utils/acl.h"
@@ -55,7 +56,7 @@ static int match_prosrc_to_query(const char *prosrc, const char *queryText,
int cursorpos);
static bool match_prosrc_to_literal(const char *prosrc, const char *literal,
int cursorpos, int *newcursorpos);
-
+static ObjectAddress ObjectAddressSetWithShareLock(Oid classId, Oid objectId);
/* ----------------------------------------------------------------
* ProcedureCreate
@@ -115,6 +116,7 @@ ProcedureCreate(const char *procedureName,
int i;
Oid trfid;
ObjectAddresses *addrs;
+ List *types_locked = NIL;
/*
* sanity checks
@@ -600,11 +602,28 @@ ProcedureCreate(const char *procedureName,
add_exact_object_address(&referenced, addrs);
/* dependency on implementation language */
- ObjectAddressSet(referenced, LanguageRelationId, languageObjectId);
+ if (languageObjectId > FirstBootstrapObjectId)
+ /*
+ * at the same time apply an AccessSharedLock as the language
+ * could be dropped before our transaction commits
+ */
+ referenced = ObjectAddressSetWithShareLock(LanguageRelationId, languageObjectId);
+ else
+ ObjectAddressSet(referenced, LanguageRelationId, languageObjectId);
add_exact_object_address(&referenced, addrs);
/* dependency on return type */
- ObjectAddressSet(referenced, TypeRelationId, returnType);
+ if (returnType > FirstBootstrapObjectId)
+ {
+ /*
+ * at the same time apply an AccessSharedLock as the returned
+ * data type could be dropped before our transaction commits
+ */
+ referenced = ObjectAddressSetWithShareLock(TypeRelationId, returnType);
+ types_locked = lappend(types_locked, (Oid *) &returnType);
+ }
+ else
+ ObjectAddressSet(referenced, TypeRelationId, returnType);
add_exact_object_address(&referenced, addrs);
/* dependency on transform used by return type, if any */
@@ -617,7 +636,30 @@ ProcedureCreate(const char *procedureName,
/* dependency on parameter types */
for (i = 0; i < allParamCount; i++)
{
- ObjectAddressSet(referenced, TypeRelationId, allParams[i]);
+ ListCell *lc;
+ bool nolocked = true;
+ /* do not lock twice a parameter type or if it below FirstBootstrapObjectId */
+ foreach(lc, types_locked)
+ {
+ Oid *savedId = (Oid *) lfirst(lc);
+
+ if (*savedId < FirstBootstrapObjectId || *savedId == allParams[i])
+ {
+ nolocked = false;
+ break;
+ }
+ }
+ if (nolocked)
+ {
+ /*
+ * at the same time apply an AccessSharedLock as the parameter
+ * data type could be dropped before our transaction commits
+ */
+ referenced = ObjectAddressSetWithShareLock(TypeRelationId, allParams[i]);
+ types_locked = lappend(types_locked, (Oid *) &allParams[i]);
+ }
+ else
+ ObjectAddressSet(referenced, TypeRelationId, allParams[i]);
add_exact_object_address(&referenced, addrs);
/* dependency on transform used by parameter type, if any */
@@ -626,7 +668,11 @@ ProcedureCreate(const char *procedureName,
ObjectAddressSet(referenced, TransformRelationId, trfid);
add_exact_object_address(&referenced, addrs);
}
+ /* lock parameter type to avoid concurrent drop */
+ if (allParams[i] != returnType)
+ LockDatabaseObject(TypeRelationId, allParams[i], 0, AccessShareLock);
}
+ list_free(types_locked);
/* dependency on support function, if any */
if (OidIsValid(prosupport))
@@ -1150,3 +1196,36 @@ oid_array_to_list(Datum datum)
result = lappend_oid(result, values[i]);
return result;
}
+
+/*
+ * Return an ObjectAddress for the object after we take an AccessShareLock
+ * on the object to guard against concurrent DDL operations. Without this,
+ * the object could be dropped before our transaction commits leaving behind
+ * relations with object id pointing to a no-longer-existent object.
+ *
+ * Used in ProcedureCreate() to protect from concuttent drop of type used
+ * in function parameters or returned type, as well as language.
+ */
+static ObjectAddress
+ObjectAddressSetWithShareLock(Oid classId, Oid objectId)
+{
+ ObjectAddress referenced;
+
+ switch (classId)
+ {
+ case LanguageRelationId:
+ case TypeRelationId:
+ /* lock object */
+ LockDatabaseObject(classId, objectId, 0, AccessShareLock);
+ break;
+ default:
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+ errmsg("this class of object %d can not be locked here.", classId)));
+ break;
+ }
+
+ ObjectAddressSet(referenced, classId, objectId);
+
+ return referenced;
+}
Gilles Darold <gilles@darold.net> writes:
The same problem applies if the returned type or the procedural language
is dropped. I have tried to fix that in ProcedureCreate() by using an
AccessSharedLock for the data types and languages involved in the
function declaration. With this patch the race condition with parameters
types, return types and PL languages are fixed. Only data types and
procedural languages with Oid greater than FirstBootstrapObjectId will
be locked locked. But this is probably more complex than this fix so it
is proposed as a separate patch
(v1-003-orphaned_function_types_language.patch) to not interfere with
the applying of Bertran's patch.
Indeed. This points up one of the things that is standing in the way
of any serious consideration of applying this patch. To my mind there
are two fundamental, somewhat interrelated considerations that haven't
been meaningfully addressed:
1. What set of objects is it worth trying to remove this race condition
for, and with respect to what dependencies? Bertrand gave no
justification for worrying about function-to-namespace dependencies
specifically, and you've likewise given none for expanding the patch
to consider function-to-datatype dependencies. There are dozens more
cases that could be considered; but I sure don't want to be processing
another ad-hoc fix every time someone thinks they're worried about
another specific case.
Taking a practical viewpoint, I'm far more worried about dependencies
of tables than those of any other kind of object. A messed-up function
definition doesn't cost you much: you can just drop and recreate the
function. A table full of data is way more trouble to recreate, and
indeed the data might be irreplaceable. So it seems pretty silly to
propose that we try to remove race conditions for functions' dependencies
on datatypes without removing the same race condition for tables'
dependencies on their column datatypes.
But any of these options lead to the same question: why stop there?
An approach that would actually be defensible, perhaps, is to incorporate
this functionality into the dependency mechanism: any time we're about to
create a pg_depend or pg_shdepend entry, take out an AccessShareLock on
the referenced object, and then check to make sure it still exists.
This would improve the dependency mechanism so it prevents creation-time
race conditions, not just attempts to drop dependencies of
already-committed objects. It would mean that the one patch would be
the end of the problem, rather than looking forward to a steady drip of
new fixes.
2. How much are we willing to pay for this added protection? The fact
that we've gotten along fine without it for years suggests that the
answer needs to be "not much". But I'm not sure that that actually
is the answer, especially if we don't have a policy that says "we'll
protect against these race conditions but no other ones". I think
there are possibly-serious costs in three different areas:
* Speed. How much do all these additional lock acquisitions slow
down a typical DDL command?
* Number of locks taken per transaction. This'd be particularly an
issue for pg_restore runs using single-transaction mode: they'd take
not only locks on the objects they create, but also a bunch of
reference-protection locks. It's not very hard to believe that this'd
make a difference in whether restoring a large database is possible
without increasing max_locks_per_transaction.
* Risk of deadlock. The reference locks themselves should be sharable,
so maybe there isn't much of a problem, but I want to see this question
seriously analyzed not just ignored.
Obviously, the magnitude of these costs depends a lot on how many
dependencies we want to remove the race condition for. But that's
exactly the reason why I don't want a piecemeal approach of fixing
some problems now and some other problems later. That's way too
much like the old recipe for boiling a frog: we could gradually get
into serious performance problems without anyone ever having stopped
to consider the issue.
In short, I think we should either go for a 100% solution if careful
analysis shows we can afford it, or else have a reasoned policy
why we are going to close these specific race conditions and no others
(implying that we'll reject future patches in the same area). We
haven't got either thing in this thread as it stands, so I do not
think it's anywhere near being ready to commit.
regards, tom lane
Le 18/12/2020 à 00:26, Tom Lane a écrit :
Gilles Darold <gilles@darold.net> writes:
The same problem applies if the returned type or the procedural language
is dropped. I have tried to fix that in ProcedureCreate() by using an
AccessSharedLock for the data types and languages involved in the
function declaration. With this patch the race condition with parameters
types, return types and PL languages are fixed. Only data types and
procedural languages with Oid greater than FirstBootstrapObjectId will
be locked locked. But this is probably more complex than this fix so it
is proposed as a separate patch
(v1-003-orphaned_function_types_language.patch) to not interfere with
the applying of Bertran's patch.Indeed. This points up one of the things that is standing in the way
of any serious consideration of applying this patch. To my mind there
are two fundamental, somewhat interrelated considerations that haven't
been meaningfully addressed:1. What set of objects is it worth trying to remove this race condition
for, and with respect to what dependencies? Bertrand gave no
justification for worrying about function-to-namespace dependencies
specifically, and you've likewise given none for expanding the patch
to consider function-to-datatype dependencies. There are dozens more
cases that could be considered; but I sure don't want to be processing
another ad-hoc fix every time someone thinks they're worried about
another specific case.Taking a practical viewpoint, I'm far more worried about dependencies
of tables than those of any other kind of object. A messed-up function
definition doesn't cost you much: you can just drop and recreate the
function. A table full of data is way more trouble to recreate, and
indeed the data might be irreplaceable. So it seems pretty silly to
propose that we try to remove race conditions for functions' dependencies
on datatypes without removing the same race condition for tables'
dependencies on their column datatypes.But any of these options lead to the same question: why stop there?
An approach that would actually be defensible, perhaps, is to incorporate
this functionality into the dependency mechanism: any time we're about to
create a pg_depend or pg_shdepend entry, take out an AccessShareLock on
the referenced object, and then check to make sure it still exists.
This would improve the dependency mechanism so it prevents creation-time
race conditions, not just attempts to drop dependencies of
already-committed objects. It would mean that the one patch would be
the end of the problem, rather than looking forward to a steady drip of
new fixes.2. How much are we willing to pay for this added protection? The fact
that we've gotten along fine without it for years suggests that the
answer needs to be "not much". But I'm not sure that that actually
is the answer, especially if we don't have a policy that says "we'll
protect against these race conditions but no other ones". I think
there are possibly-serious costs in three different areas:* Speed. How much do all these additional lock acquisitions slow
down a typical DDL command?* Number of locks taken per transaction. This'd be particularly an
issue for pg_restore runs using single-transaction mode: they'd take
not only locks on the objects they create, but also a bunch of
reference-protection locks. It's not very hard to believe that this'd
make a difference in whether restoring a large database is possible
without increasing max_locks_per_transaction.* Risk of deadlock. The reference locks themselves should be sharable,
so maybe there isn't much of a problem, but I want to see this question
seriously analyzed not just ignored.Obviously, the magnitude of these costs depends a lot on how many
dependencies we want to remove the race condition for. But that's
exactly the reason why I don't want a piecemeal approach of fixing
some problems now and some other problems later. That's way too
much like the old recipe for boiling a frog: we could gradually get
into serious performance problems without anyone ever having stopped
to consider the issue.In short, I think we should either go for a 100% solution if careful
analysis shows we can afford it, or else have a reasoned policy
why we are going to close these specific race conditions and no others
(implying that we'll reject future patches in the same area). We
haven't got either thing in this thread as it stands, so I do not
think it's anywhere near being ready to commit.regards, tom lane
Thanks for the review,
Honestly I have never met these races condition in 25 years of
PostgreSQL and will probably never but clearly I don't have the Amazon
database services workload. I let Bertrand decide what to do with the
patch and address the races condition with a more global approach
following your advices if more work is done on this topic. I tag the
patch as "Waiting on author".
Best regards,
--
Gilles Darold
LzLabs GmbH
http://www.lzlabs.com/
Hi Bertrand,
On Fri, Dec 18, 2020 at 6:52 PM Gilles Darold <gilles@darold.net> wrote:
Le 18/12/2020 à 00:26, Tom Lane a écrit :
Gilles Darold <gilles@darold.net> writes:
The same problem applies if the returned type or the procedural language
is dropped. I have tried to fix that in ProcedureCreate() by using an
AccessSharedLock for the data types and languages involved in the
function declaration. With this patch the race condition with parameters
types, return types and PL languages are fixed. Only data types and
procedural languages with Oid greater than FirstBootstrapObjectId will
be locked locked. But this is probably more complex than this fix so it
is proposed as a separate patch
(v1-003-orphaned_function_types_language.patch) to not interfere with
the applying of Bertran's patch.Indeed. This points up one of the things that is standing in the way
of any serious consideration of applying this patch. To my mind there
are two fundamental, somewhat interrelated considerations that haven't
been meaningfully addressed:1. What set of objects is it worth trying to remove this race condition
for, and with respect to what dependencies? Bertrand gave no
justification for worrying about function-to-namespace dependencies
specifically, and you've likewise given none for expanding the patch
to consider function-to-datatype dependencies. There are dozens more
cases that could be considered; but I sure don't want to be processing
another ad-hoc fix every time someone thinks they're worried about
another specific case.Taking a practical viewpoint, I'm far more worried about dependencies
of tables than those of any other kind of object. A messed-up function
definition doesn't cost you much: you can just drop and recreate the
function. A table full of data is way more trouble to recreate, and
indeed the data might be irreplaceable. So it seems pretty silly to
propose that we try to remove race conditions for functions' dependencies
on datatypes without removing the same race condition for tables'
dependencies on their column datatypes.But any of these options lead to the same question: why stop there?
An approach that would actually be defensible, perhaps, is to incorporate
this functionality into the dependency mechanism: any time we're about to
create a pg_depend or pg_shdepend entry, take out an AccessShareLock on
the referenced object, and then check to make sure it still exists.
This would improve the dependency mechanism so it prevents creation-time
race conditions, not just attempts to drop dependencies of
already-committed objects. It would mean that the one patch would be
the end of the problem, rather than looking forward to a steady drip of
new fixes.2. How much are we willing to pay for this added protection? The fact
that we've gotten along fine without it for years suggests that the
answer needs to be "not much". But I'm not sure that that actually
is the answer, especially if we don't have a policy that says "we'll
protect against these race conditions but no other ones". I think
there are possibly-serious costs in three different areas:* Speed. How much do all these additional lock acquisitions slow
down a typical DDL command?* Number of locks taken per transaction. This'd be particularly an
issue for pg_restore runs using single-transaction mode: they'd take
not only locks on the objects they create, but also a bunch of
reference-protection locks. It's not very hard to believe that this'd
make a difference in whether restoring a large database is possible
without increasing max_locks_per_transaction.* Risk of deadlock. The reference locks themselves should be sharable,
so maybe there isn't much of a problem, but I want to see this question
seriously analyzed not just ignored.Obviously, the magnitude of these costs depends a lot on how many
dependencies we want to remove the race condition for. But that's
exactly the reason why I don't want a piecemeal approach of fixing
some problems now and some other problems later. That's way too
much like the old recipe for boiling a frog: we could gradually get
into serious performance problems without anyone ever having stopped
to consider the issue.In short, I think we should either go for a 100% solution if careful
analysis shows we can afford it, or else have a reasoned policy
why we are going to close these specific race conditions and no others
(implying that we'll reject future patches in the same area). We
haven't got either thing in this thread as it stands, so I do not
think it's anywhere near being ready to commit.regards, tom lane
Thanks for the review,
Honestly I have never met these races condition in 25 years of
PostgreSQL and will probably never but clearly I don't have the Amazon
database services workload. I let Bertrand decide what to do with the
patch and address the races condition with a more global approach
following your advices if more work is done on this topic. I tag the
patch as "Waiting on author".
Status update for a commitfest entry.
This patch has been "Waiting on Author" and inactive for almost 2
months. Are you planning to work on address the review comments Tom
sent? This is categorized as bug fixes on CF app but I'm going to set
it to "Returned with Feedback" barring objections as the patch got
feedback but there has been no activity for a long time.
Regards,
--
Masahiko Sawada
EDB: https://www.enterprisedb.com/
On Mon, Feb 1, 2021 at 11:26 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
Hi Bertrand,
On Fri, Dec 18, 2020 at 6:52 PM Gilles Darold <gilles@darold.net> wrote:
Le 18/12/2020 à 00:26, Tom Lane a écrit :
Gilles Darold <gilles@darold.net> writes:
The same problem applies if the returned type or the procedural language
is dropped. I have tried to fix that in ProcedureCreate() by using an
AccessSharedLock for the data types and languages involved in the
function declaration. With this patch the race condition with parameters
types, return types and PL languages are fixed. Only data types and
procedural languages with Oid greater than FirstBootstrapObjectId will
be locked locked. But this is probably more complex than this fix so it
is proposed as a separate patch
(v1-003-orphaned_function_types_language.patch) to not interfere with
the applying of Bertran's patch.Indeed. This points up one of the things that is standing in the way
of any serious consideration of applying this patch. To my mind there
are two fundamental, somewhat interrelated considerations that haven't
been meaningfully addressed:1. What set of objects is it worth trying to remove this race condition
for, and with respect to what dependencies? Bertrand gave no
justification for worrying about function-to-namespace dependencies
specifically, and you've likewise given none for expanding the patch
to consider function-to-datatype dependencies. There are dozens more
cases that could be considered; but I sure don't want to be processing
another ad-hoc fix every time someone thinks they're worried about
another specific case.Taking a practical viewpoint, I'm far more worried about dependencies
of tables than those of any other kind of object. A messed-up function
definition doesn't cost you much: you can just drop and recreate the
function. A table full of data is way more trouble to recreate, and
indeed the data might be irreplaceable. So it seems pretty silly to
propose that we try to remove race conditions for functions' dependencies
on datatypes without removing the same race condition for tables'
dependencies on their column datatypes.But any of these options lead to the same question: why stop there?
An approach that would actually be defensible, perhaps, is to incorporate
this functionality into the dependency mechanism: any time we're about to
create a pg_depend or pg_shdepend entry, take out an AccessShareLock on
the referenced object, and then check to make sure it still exists.
This would improve the dependency mechanism so it prevents creation-time
race conditions, not just attempts to drop dependencies of
already-committed objects. It would mean that the one patch would be
the end of the problem, rather than looking forward to a steady drip of
new fixes.2. How much are we willing to pay for this added protection? The fact
that we've gotten along fine without it for years suggests that the
answer needs to be "not much". But I'm not sure that that actually
is the answer, especially if we don't have a policy that says "we'll
protect against these race conditions but no other ones". I think
there are possibly-serious costs in three different areas:* Speed. How much do all these additional lock acquisitions slow
down a typical DDL command?* Number of locks taken per transaction. This'd be particularly an
issue for pg_restore runs using single-transaction mode: they'd take
not only locks on the objects they create, but also a bunch of
reference-protection locks. It's not very hard to believe that this'd
make a difference in whether restoring a large database is possible
without increasing max_locks_per_transaction.* Risk of deadlock. The reference locks themselves should be sharable,
so maybe there isn't much of a problem, but I want to see this question
seriously analyzed not just ignored.Obviously, the magnitude of these costs depends a lot on how many
dependencies we want to remove the race condition for. But that's
exactly the reason why I don't want a piecemeal approach of fixing
some problems now and some other problems later. That's way too
much like the old recipe for boiling a frog: we could gradually get
into serious performance problems without anyone ever having stopped
to consider the issue.In short, I think we should either go for a 100% solution if careful
analysis shows we can afford it, or else have a reasoned policy
why we are going to close these specific race conditions and no others
(implying that we'll reject future patches in the same area). We
haven't got either thing in this thread as it stands, so I do not
think it's anywhere near being ready to commit.regards, tom lane
Thanks for the review,
Honestly I have never met these races condition in 25 years of
PostgreSQL and will probably never but clearly I don't have the Amazon
database services workload. I let Bertrand decide what to do with the
patch and address the races condition with a more global approach
following your advices if more work is done on this topic. I tag the
patch as "Waiting on author".Status update for a commitfest entry.
This patch has been "Waiting on Author" and inactive for almost 2
months. Are you planning to work on address the review comments Tom
sent? This is categorized as bug fixes on CF app but I'm going to set
it to "Returned with Feedback" barring objections as the patch got
feedback but there has been no activity for a long time.
This patch, which you submitted to this CommitFest, has been awaiting
your attention for more than one month. As such, we have moved it to
"Returned with Feedback" and removed it from the reviewing queue.
Depending on timing, this may be reversable, so let us know if there
are extenuating circumstances. In any case, you are welcome to address
the feedback you have received, and resubmit the patch to the next
CommitFest.
Thank you for contributing to PostgreSQL.
Regards,
--
Masahiko Sawada
EDB: https://www.enterprisedb.com/
Hi Sawada,
On 2/1/21 2:37 PM, Masahiko Sawada wrote:
CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
On Mon, Feb 1, 2021 at 11:26 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
Status update for a commitfest entry.
This patch has been "Waiting on Author" and inactive for almost 2
months. Are you planning to work on address the review comments Tom
sent? This is categorized as bug fixes on CF app but I'm going to set
it to "Returned with Feedback" barring objections as the patch got
feedback but there has been no activity for a long time.This patch, which you submitted to this CommitFest, has been awaiting
your attention for more than one month. As such, we have moved it to
"Returned with Feedback" and removed it from the reviewing queue.
Depending on timing, this may be reversable, so let us know if there
are extenuating circumstances. In any case, you are welcome to address
the feedback you have received, and resubmit the patch to the next
CommitFest.
I had in mind to reply to Tom's comments later this week (sorry for the
delay).
Please let me know if it's possible to put it back to the reviewing
queue (if not I'll open a new thread).
Thanks
Bertrand
On Mon, Feb 1, 2021 at 11:06 PM Drouvot, Bertrand <bdrouvot@amazon.com> wrote:
Hi Sawada,
On 2/1/21 2:37 PM, Masahiko Sawada wrote:
CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
On Mon, Feb 1, 2021 at 11:26 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
Status update for a commitfest entry.
This patch has been "Waiting on Author" and inactive for almost 2
months. Are you planning to work on address the review comments Tom
sent? This is categorized as bug fixes on CF app but I'm going to set
it to "Returned with Feedback" barring objections as the patch got
feedback but there has been no activity for a long time.This patch, which you submitted to this CommitFest, has been awaiting
your attention for more than one month. As such, we have moved it to
"Returned with Feedback" and removed it from the reviewing queue.
Depending on timing, this may be reversable, so let us know if there
are extenuating circumstances. In any case, you are welcome to address
the feedback you have received, and resubmit the patch to the next
CommitFest.I had in mind to reply to Tom's comments later this week (sorry for the
delay).Please let me know if it's possible to put it back to the reviewing
queue (if not I'll open a new thread).
Thank you for letting me know.
I've moved this patch to the next commitfest. Please check the
status on the CF app.
Regards,
--
Masahiko Sawada
EDB: https://www.enterprisedb.com/
Hi,
+ Jim and Nathan.
On 12/18/20 12:26 AM, Tom Lane wrote:
Gilles Darold <gilles@darold.net> writes:
The same problem applies if the returned type or the procedural language
is dropped. I have tried to fix that in ProcedureCreate() by using an
AccessSharedLock for the data types and languages involved in the
function declaration. With this patch the race condition with parameters
types, return types and PL languages are fixed. Only data types and
procedural languages with Oid greater than FirstBootstrapObjectId will
be locked locked. But this is probably more complex than this fix so it
is proposed as a separate patch
(v1-003-orphaned_function_types_language.patch) to not interfere with
the applying of Bertran's patch.Indeed. This points up one of the things that is standing in the way
of any serious consideration of applying this patch. To my mind there
are two fundamental, somewhat interrelated considerations that haven't
been meaningfully addressed:1. What set of objects is it worth trying to remove this race condition
for, and with respect to what dependencies? Bertrand gave no
justification for worrying about function-to-namespace dependencies
specifically,
I focused on this one because this is the most (if not the unique) one
we have observed so far.
Those orphaned functions are breaking stuff like pg_dump, pg_upgrade and
that's how we discovered them.
That being said, I agree that there is no reason to focus only on this
ones and we should better try to "fix" them all.
and you've likewise given none for expanding the patch
to consider function-to-datatype dependencies. There are dozens more
cases that could be considered; but I sure don't want to be processing
another ad-hoc fix every time someone thinks they're worried about
another specific case.Taking a practical viewpoint, I'm far more worried about dependencies
of tables than those of any other kind of object. A messed-up function
definition doesn't cost you much: you can just drop and recreate the
function. A table full of data is way more trouble to recreate, and
indeed the data might be irreplaceable. So it seems pretty silly to
propose that we try to remove race conditions for functions' dependencies
on datatypes without removing the same race condition for tables'
dependencies on their column datatypes.But any of these options lead to the same question: why stop there?
An approach that would actually be defensible, perhaps, is to incorporate
this functionality into the dependency mechanism: any time we're about to
create a pg_depend or pg_shdepend entry, take out an AccessShareLock on
the referenced object, and then check to make sure it still exists.
This would improve the dependency mechanism so it prevents creation-time
race conditions, not just attempts to drop dependencies of
already-committed objects. It would mean that the one patch would be
the end of the problem, rather than looking forward to a steady drip of
new fixes.
Agree that working with pg_depend and pg_shdepend is the way to go.
Instead of using the locking machinery (and then make the one that would
currently produce the orphaned object waiting), Jim proposed another
approach: make use of special snapshot (like a dirty one depending of
the case).
That way instead of locking we could instead report an error, something
like this:
postgres=# drop schema tobeorph;
ERROR: cannot drop schema tobeorph because other objects that are
currently under creation depend on it
DETAIL: function under creation tobeorph.bdttime() depends on schema
tobeorph
2. How much are we willing to pay for this added protection? The fact
that we've gotten along fine without it for years suggests that the
answer needs to be "not much". But I'm not sure that that actually
is the answer, especially if we don't have a policy that says "we'll
protect against these race conditions but no other ones". I think
there are possibly-serious costs in three different areas:* Speed. How much do all these additional lock acquisitions slow
down a typical DDL command?* Number of locks taken per transaction. This'd be particularly an
issue for pg_restore runs using single-transaction mode: they'd take
not only locks on the objects they create, but also a bunch of
reference-protection locks. It's not very hard to believe that this'd
make a difference in whether restoring a large database is possible
without increasing max_locks_per_transaction.* Risk of deadlock. The reference locks themselves should be sharable,
so maybe there isn't much of a problem, but I want to see this question
seriously analyzed not just ignored.
All of this should be mitigated with the "special snapshot" approach.
Obviously, the magnitude of these costs depends a lot on how many
dependencies we want to remove the race condition for. But that's
exactly the reason why I don't want a piecemeal approach of fixing
some problems now and some other problems later. That's way too
much like the old recipe for boiling a frog: we could gradually get
into serious performance problems without anyone ever having stopped
to consider the issue.In short, I think we should either go for a 100% solution if careful
analysis shows we can afford it, or else have a reasoned policy
why we are going to close these specific race conditions and no others
(implying that we'll reject future patches in the same area).
Agree.
What about taking the 100% solution way with the "special snapshot"
approach?
If you think that could make sense then we (Jim, Nathan and I) can work
on a patch with this approach and come back with a proposal.
Bertrand
On 2021-Feb-02, Drouvot, Bertrand wrote:
On 12/18/20 12:26 AM, Tom Lane wrote:
But any of these options lead to the same question: why stop there?
An approach that would actually be defensible, perhaps, is to incorporate
this functionality into the dependency mechanism: any time we're about to
create a pg_depend or pg_shdepend entry, take out an AccessShareLock on
the referenced object, and then check to make sure it still exists.
This would improve the dependency mechanism so it prevents creation-time
race conditions, not just attempts to drop dependencies of
already-committed objects.Agree that working with pg_depend and pg_shdepend is the way to go.
Instead of using the locking machinery (and then make the one that
would currently produce the orphaned object waiting), Jim proposed
another approach: make use of special snapshot (like a dirty one
depending of the case).That way instead of locking we could instead report an error,
something like this:postgres=# drop schema tobeorph;
ERROR:� cannot drop schema tobeorph because other objects that are currently under creation depend on it
DETAIL:� function under creation tobeorph.bdttime() depends on schema tobeorph
Sounds like an idea worth trying. What are the semantics of that
special snapshot? Why do we need a special snapshot at all -- doesn't
CatalogSnapshot serve?
This item is classified as a bug-fix in the commitfest, but it doesn't
sound like something we can actually back-patch. In that light, maybe
it's not an actual bugfix, but a new safety feature.
--
�lvaro Herrera Valdivia, Chile
Hi Alvaro,
Thanks for the feedback!
On 3/27/21 3:13 PM, Alvaro Herrera wrote:
CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
On 2021-Feb-02, Drouvot, Bertrand wrote:
On 12/18/20 12:26 AM, Tom Lane wrote:
But any of these options lead to the same question: why stop there?
An approach that would actually be defensible, perhaps, is to incorporate
this functionality into the dependency mechanism: any time we're about to
create a pg_depend or pg_shdepend entry, take out an AccessShareLock on
the referenced object, and then check to make sure it still exists.
This would improve the dependency mechanism so it prevents creation-time
race conditions, not just attempts to drop dependencies of
already-committed objects.Agree that working with pg_depend and pg_shdepend is the way to go.
Instead of using the locking machinery (and then make the one that
would currently produce the orphaned object waiting), Jim proposed
another approach: make use of special snapshot (like a dirty one
depending of the case).That way instead of locking we could instead report an error,
something like this:postgres=# drop schema tobeorph;
ERROR: cannot drop schema tobeorph because other objects that are currently under creation depend on it
DETAIL: function under creation tobeorph.bdttime() depends on schema tobeorphSounds like an idea worth trying.
Great, if no objections is coming then I'll work on a patch proposal.
What are the semantics of that
special snapshot? Why do we need a special snapshot at all -- doesn't
CatalogSnapshot serve?
Yes, the CatalogSnapshot should serve the need.
By "special" I meant "dirty" for example.
This item is classified as a bug-fix in the commitfest, but it doesn't
sound like something we can actually back-patch.
Why couldn't it be back-patchable?
Thanks
Bertrand
On 3/29/21 7:20 AM, Drouvot, Bertrand wrote:
This item is classified as a bug-fix in the commitfest, but it doesn't
sound like something we can actually back-patch.
Since it looks like a new patch is required and this probably should not
be classified as a bug fix, marking Returned with Feedback.
Please resubmit to the next CF when you have a new patch.
Regards,
--
-David
david@pgmasters.net