pg_get_expr locking

Started by Peter Eisentrautalmost 2 years ago5 messages
#1Peter Eisentraut
peter@eisentraut.org

The function pg_get_expr(), which is used in various system views and
information schema views, does not appear to lock the table passed as
the second argument, and so appears to be liable to fail if there is a
concurrent drop of the table. There is a (probable) case of this being
discussed at [0]/messages/by-id/9ec24d7b-633d-463a-84c6-7acff769c9e8@eisentraut.org. I also see various mentions of this issue in the
commit logs, mostly related to pg_dump.

Is there a reason there is no locking? Performance?

What workaround should we use if there are conflicts created by
concurrent regression tests? Just move the tests around a bit until the
issue goes away?

[0]: /messages/by-id/9ec24d7b-633d-463a-84c6-7acff769c9e8@eisentraut.org
/messages/by-id/9ec24d7b-633d-463a-84c6-7acff769c9e8@eisentraut.org

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Eisentraut (#1)
Re: pg_get_expr locking

Peter Eisentraut <peter@eisentraut.org> writes:

The function pg_get_expr(), which is used in various system views and
information schema views, does not appear to lock the table passed as
the second argument, and so appears to be liable to fail if there is a
concurrent drop of the table. There is a (probable) case of this being
discussed at [0]. I also see various mentions of this issue in the
commit logs, mostly related to pg_dump.

Is there a reason there is no locking? Performance?

I think we have a general rule that you shouldn't be able to take
locks on relations you have no privileges for, so pg_get_expr would
need to verify privileges if it locked the rel. At least for
pg_dump's purposes, that cure would be about as bad as the disease.

What workaround should we use if there are conflicts created by
concurrent regression tests? Just move the tests around a bit until the
issue goes away?

Why would a test be applying pg_get_expr to a table it doesn't
control?

regards, tom lane

#3Peter Eisentraut
peter@eisentraut.org
In reply to: Tom Lane (#2)
Re: pg_get_expr locking

On 07.02.24 16:26, Tom Lane wrote:

What workaround should we use if there are conflicts created by
concurrent regression tests? Just move the tests around a bit until the
issue goes away?

Why would a test be applying pg_get_expr to a table it doesn't
control?

I think the situation is that one test (domain) runs pg_get_expr as part
of an information_schema view, while at the same time another test
(alter_table) drops a table that the pg_get_expr is just processing.

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Eisentraut (#3)
Re: pg_get_expr locking

Peter Eisentraut <peter@eisentraut.org> writes:

On 07.02.24 16:26, Tom Lane wrote:

Why would a test be applying pg_get_expr to a table it doesn't
control?

I think the situation is that one test (domain) runs pg_get_expr as part
of an information_schema view, while at the same time another test
(alter_table) drops a table that the pg_get_expr is just processing.

The test case that's failing is, IIUC,

+SELECT * FROM information_schema.domain_constraints
+  WHERE domain_name IN ('con', 'dom', 'pos_int', 'things')
+  ORDER BY constraint_name;

I see no use of pg_get_expr() in the domain_constraints view:

CREATE VIEW domain_constraints AS
SELECT CAST(current_database() AS sql_identifier) AS constraint_catalog,
CAST(rs.nspname AS sql_identifier) AS constraint_schema,
CAST(con.conname AS sql_identifier) AS constraint_name,
CAST(current_database() AS sql_identifier) AS domain_catalog,
CAST(n.nspname AS sql_identifier) AS domain_schema,
CAST(t.typname AS sql_identifier) AS domain_name,
CAST(CASE WHEN condeferrable THEN 'YES' ELSE 'NO' END
AS yes_or_no) AS is_deferrable,
CAST(CASE WHEN condeferred THEN 'YES' ELSE 'NO' END
AS yes_or_no) AS initially_deferred
FROM pg_namespace rs, pg_namespace n, pg_constraint con, pg_type t
WHERE rs.oid = con.connamespace
AND n.oid = t.typnamespace
AND t.oid = con.contypid
AND (pg_has_role(t.typowner, 'USAGE')
OR has_type_privilege(t.oid, 'USAGE'));

I'm a little suspicious that the failure is actually coming from
somewhere down inside has_type_privilege(), but I traced through
that quickly and don't see how it could reach such an error.
In any case I thought we'd hardened all those functions in 403ac226d.
So I'm still pretty mystified. Have you had any luck in making
the failure reproducible?

regards, tom lane

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#4)
1 attachment(s)
Re: pg_get_expr locking

I wrote:

Peter Eisentraut <peter@eisentraut.org> writes:

I think the situation is that one test (domain) runs pg_get_expr as part
of an information_schema view, while at the same time another test
(alter_table) drops a table that the pg_get_expr is just processing.

The test case that's failing is, IIUC,

+SELECT * FROM information_schema.domain_constraints
+  WHERE domain_name IN ('con', 'dom', 'pos_int', 'things')
+  ORDER BY constraint_name;

Oh, scratch that: there are two confusingly lookalike queries
in the patch. The one that is failing is

SELECT * FROM information_schema.check_constraints
WHERE (constraint_schema, constraint_name)
IN (SELECT constraint_schema, constraint_name
FROM information_schema.domain_constraints
WHERE domain_name IN ('con', 'dom', 'pos_int', 'things'))
ORDER BY constraint_name;

and we have trouble because the evaluation of pg_get_expr in
check_constraints is done before the semijoin that would restrict
it to just the desired objects.

After looking at the code I'm less worried about the
permissions-checking angle than I was before, because I see
that pg_get_expr already takes a transient AccessShareLock
on the rel, down inside set_relation_column_names. This is
not ideal from a permissions standpoint perhaps, but it's
probably OK considering we've done that for a long time.
We just need to hold that lock a little while longer.

I propose the attached as a reasonably localized fix.
We could imagine a more aggressive refactoring that would
allow passing down the Relation instead of re-opening it
in set_relation_column_names, but I doubt it's worth the
trouble.

regards, tom lane

Attachments:

remove-pg_get_expr-race-condition.patchtext/x-diff; charset=us-ascii; name=remove-pg_get_expr-race-condition.patchDownload
diff --git a/src/backend/utils/adt/ruleutils.c b/src/backend/utils/adt/ruleutils.c
index b625f471a8..644966721f 100644
--- a/src/backend/utils/adt/ruleutils.c
+++ b/src/backend/utils/adt/ruleutils.c
@@ -352,8 +352,7 @@ static char *pg_get_partkeydef_worker(Oid relid, int prettyFlags,
 									  bool attrsOnly, bool missing_ok);
 static char *pg_get_constraintdef_worker(Oid constraintId, bool fullCommand,
 										 int prettyFlags, bool missing_ok);
-static text *pg_get_expr_worker(text *expr, Oid relid, const char *relname,
-								int prettyFlags);
+static text *pg_get_expr_worker(text *expr, Oid relid, int prettyFlags);
 static int	print_function_arguments(StringInfo buf, HeapTuple proctup,
 									 bool print_table_args, bool print_defaults);
 static void print_function_rettype(StringInfo buf, HeapTuple proctup);
@@ -2615,6 +2622,11 @@ decompile_column_index_array(Datum column_index_array, Oid relId,
  * partial indexes, column default expressions, etc.  We also support
  * Var-free expressions, for which the OID can be InvalidOid.
  *
+ * If the OID is nonzero but not actually valid, don't throw an error,
+ * just return NULL.  This is a bit questionable, but it's what we've
+ * done historically, and it can help avoid unwanted failures when
+ * examining catalog entries for just-deleted relations.
+ *
  * We expect this function to work, or throw a reasonably clean error,
  * for any node tree that can appear in a catalog pg_node_tree column.
  * Query trees, such as those appearing in pg_rewrite.ev_action, are
@@ -2627,29 +2639,16 @@ pg_get_expr(PG_FUNCTION_ARGS)
 {
 	text	   *expr = PG_GETARG_TEXT_PP(0);
 	Oid			relid = PG_GETARG_OID(1);
+	text	   *result;
 	int			prettyFlags;
-	char	   *relname;
 
 	prettyFlags = PRETTYFLAG_INDENT;
 
-	if (OidIsValid(relid))
-	{
-		/* Get the name for the relation */
-		relname = get_rel_name(relid);
-
-		/*
-		 * If the OID isn't actually valid, don't throw an error, just return
-		 * NULL.  This is a bit questionable, but it's what we've done
-		 * historically, and it can help avoid unwanted failures when
-		 * examining catalog entries for just-deleted relations.
-		 */
-		if (relname == NULL)
-			PG_RETURN_NULL();
-	}
+	result = pg_get_expr_worker(expr, relid, prettyFlags);
+	if (result)
+		PG_RETURN_TEXT_P(result);
 	else
-		relname = NULL;
-
-	PG_RETURN_TEXT_P(pg_get_expr_worker(expr, relid, relname, prettyFlags));
+		PG_RETURN_NULL();
 }
 
 Datum
@@ -2658,33 +2657,27 @@ pg_get_expr_ext(PG_FUNCTION_ARGS)
 	text	   *expr = PG_GETARG_TEXT_PP(0);
 	Oid			relid = PG_GETARG_OID(1);
 	bool		pretty = PG_GETARG_BOOL(2);
+	text	   *result;
 	int			prettyFlags;
-	char	   *relname;
 
 	prettyFlags = GET_PRETTY_FLAGS(pretty);
 
-	if (OidIsValid(relid))
-	{
-		/* Get the name for the relation */
-		relname = get_rel_name(relid);
-		/* See notes above */
-		if (relname == NULL)
-			PG_RETURN_NULL();
-	}
+	result = pg_get_expr_worker(expr, relid, prettyFlags);
+	if (result)
+		PG_RETURN_TEXT_P(result);
 	else
-		relname = NULL;
-
-	PG_RETURN_TEXT_P(pg_get_expr_worker(expr, relid, relname, prettyFlags));
+		PG_RETURN_NULL();
 }
 
 static text *
-pg_get_expr_worker(text *expr, Oid relid, const char *relname, int prettyFlags)
+pg_get_expr_worker(text *expr, Oid relid, int prettyFlags)
 {
 	Node	   *node;
 	Node	   *tst;
 	Relids		relids;
 	List	   *context;
 	char	   *exprstr;
+	Relation	rel = NULL;
 	char	   *str;
 
 	/* Convert input pg_node_tree (really TEXT) object to C string */
@@ -2729,9 +2722,19 @@ pg_get_expr_worker(text *expr, Oid relid, const char *relname, int prettyFlags)
 					 errmsg("expression contains variables")));
 	}
 
-	/* Prepare deparse context if needed */
+	/*
+	 * Prepare deparse context if needed.  If we are deparsing with a relid,
+	 * we need to transiently open and lock the rel, to make sure it won't go
+	 * away underneath us.  (set_relation_column_names would lock it anyway,
+	 * so this isn't really introducing any new behavior.)
+	 */
 	if (OidIsValid(relid))
-		context = deparse_context_for(relname, relid);
+	{
+		rel = try_relation_open(relid, AccessShareLock);
+		if (rel == NULL)
+			return NULL;
+		context = deparse_context_for(RelationGetRelationName(rel), relid);
+	}
 	else
 		context = NIL;
 
@@ -2739,6 +2742,9 @@ pg_get_expr_worker(text *expr, Oid relid, const char *relname, int prettyFlags)
 	str = deparse_expression_pretty(node, context, false, false,
 									prettyFlags, 0);
 
+	if (rel != NULL)
+		relation_close(rel, AccessShareLock);
+
 	return string_to_text(str);
 }