pg_upgrade check for invalid databases

Started by Thomas Krennwallnerover 1 year ago16 messages
#1Thomas Krennwallner
tk@postsubmeta.net
2 attachment(s)

Hi,

if a cluster contains invalid databases that we cannot connect to
anymore, pg_upgrade would currently fail when trying to connect to the
first encountered invalid database with

Performing Consistency Checks
-----------------------------
Checking cluster versions ok

connection failure: connection to server on socket "/tmp/.s.PGSQL.50432"
failed: FATAL: cannot connect to invalid database "foo"
HINT: Use DROP DATABASE to drop invalid databases.

Failure, exiting

If there is more than one invalid database, we need to run pg_upgrade
more than once (unless the user inspects pg_database).

I attached two small patches for PG 17 and PG 18 (can be easily
backported to all previous versions upon request). Instead of just
failing to connect with an error, we collect all invalid databases in a
report file invalid_databases.txt:

Performing Consistency Checks
-----------------------------
Checking cluster versions ok
Checking for invalid databases fatal

Your installation contains invalid databases as a consequence of
interrupted DROP DATABASE. They are now marked as corrupted databases
that cannot be connected to anymore. Consider removing them using
DROP DATABASE ...;
A list of invalid databases is in the file:

/usr/local/pgsql/data/18/pg_upgrade_output.d/20240929T200559.707/invalid_databases.txt
Failure, exiting

Any thoughts on the proposed patches?

Attachments:

pg18-v1-0001-pg_upgrade-Add-check-for-invalid-databases.patchtext/x-patch; charset=UTF-8; name=pg18-v1-0001-pg_upgrade-Add-check-for-invalid-databases.patchDownload
From c44f723c033573fa5da72721595bc135a00e3cbf Mon Sep 17 00:00:00 2001
From: Thomas Krennwallner <teakay@aiven.io>
Date: Fri, 20 Sep 2024 17:33:05 -0400
Subject: [PATCH] pg_upgrade: Add check for invalid databases.

Currently, pg_upgrade fails to connect to the first invalid database
it encounters in get_loadable_libraries() and then aborts.  While we
print a fatal message with a hint what should be done, the output is
terse and does not report further invalid databases present in an
installation.

Instead of just exiting on first error, we collect all
pg_database.datconnlimit values in get_db_infos(), ignore invalid
databases in process_slot(), and then perform check_for_invalid_dbs(),
which collects all invalid databases in a report file
invalid_databases.txt.
---
 src/bin/pg_upgrade/check.c             | 62 ++++++++++++++++++++++++++
 src/bin/pg_upgrade/info.c              |  5 ++-
 src/bin/pg_upgrade/pg_upgrade.h        |  2 +
 src/bin/pg_upgrade/t/002_pg_upgrade.pl |  2 +-
 src/bin/pg_upgrade/task.c              | 32 +++++++++++--
 5 files changed, 98 insertions(+), 5 deletions(-)

diff --git a/src/bin/pg_upgrade/check.c b/src/bin/pg_upgrade/check.c
index 12735a4268..b5b0f7cd6f 100644
--- a/src/bin/pg_upgrade/check.c
+++ b/src/bin/pg_upgrade/check.c
@@ -12,6 +12,7 @@
 #include "catalog/pg_authid_d.h"
 #include "catalog/pg_class_d.h"
 #include "catalog/pg_collation.h"
+#include "catalog/pg_database.h"
 #include "fe_utils/string_utils.h"
 #include "mb/pg_wchar.h"
 #include "pg_upgrade.h"
@@ -19,6 +20,7 @@
 static void check_new_cluster_is_empty(void);
 static void check_is_install_user(ClusterInfo *cluster);
 static void check_proper_datallowconn(ClusterInfo *cluster);
+static void check_for_invalid_dbs(ClusterInfo *cluster);
 static void check_for_prepared_transactions(ClusterInfo *cluster);
 static void check_for_isn_and_int8_passing_mismatch(ClusterInfo *cluster);
 static void check_for_user_defined_postfix_ops(ClusterInfo *cluster);
@@ -598,6 +600,13 @@ check_and_dump_old_cluster(void)
 	 */
 	get_db_rel_and_slot_infos(&old_cluster);
 
+	/*
+	 * Verify that the old cluster does not reference invalid databases.
+	 * Connections to such databases are not allowed and break the following
+	 * checks.
+	 */
+	check_for_invalid_dbs(&old_cluster);
+
 	init_tablespaces();
 
 	get_loadable_libraries();
@@ -691,6 +700,13 @@ check_new_cluster(void)
 {
 	get_db_rel_and_slot_infos(&new_cluster);
 
+	/*
+	 * Verify that the new cluster does not reference invalid databases.
+	 * Connections to such databases are not allowed and break the following
+	 * checks.
+	 */
+	check_for_invalid_dbs(&new_cluster);
+
 	check_new_cluster_is_empty();
 
 	check_loadable_libraries();
@@ -1087,6 +1103,52 @@ check_is_install_user(ClusterInfo *cluster)
 	check_ok();
 }
 
+/*
+ *	check_for_invalid_dbs
+ *
+ *	Ensure that all databases are valid as connections to invalid databases
+ *	are not allowed.
+ */
+static void
+check_for_invalid_dbs(ClusterInfo *cluster)
+{
+	int			dbnum;
+	FILE	   *script = NULL;
+	char		output_path[MAXPGPATH];
+
+	prep_status("Checking for invalid databases");
+
+	snprintf(output_path, sizeof(output_path), "%s/%s",
+			 log_opts.basedir,
+			 "invalid_databases.txt");
+
+	for (dbnum = 0; dbnum < cluster->dbarr.ndbs; dbnum++)
+	{
+		DbInfo	   *active_db = &cluster->dbarr.dbs[dbnum];
+
+		if (active_db->db_connlimit == DATCONNLIMIT_INVALID_DB)
+		{
+			if (script == NULL && (script = fopen_priv(output_path, "w")) == NULL)
+				pg_fatal("could not open file \"%s\": %m", output_path);
+			fprintf(script, "%s\n", active_db->db_name);
+		}
+	}
+
+	if (script)
+	{
+		fclose(script);
+		pg_log(PG_REPORT, "fatal");
+		pg_fatal("Your installation contains invalid databases as a consequence of\n"
+				 "interrupted DROP DATABASE.  They are now marked as corrupted databases\n"
+				 "that cannot be connected to anymore.  Consider removing them using\n"
+				 "    DROP DATABASE ...;\n"
+				 "A list of invalid databases is in the file:\n"
+				 "    %s", output_path);
+	}
+	else
+		check_ok();
+}
+
 
 /*
  *	check_proper_datallowconn
diff --git a/src/bin/pg_upgrade/info.c b/src/bin/pg_upgrade/info.c
index f83ded89cb..0e2bd40d1a 100644
--- a/src/bin/pg_upgrade/info.c
+++ b/src/bin/pg_upgrade/info.c
@@ -409,12 +409,13 @@ get_db_infos(ClusterInfo *cluster)
 	int			tupnum;
 	DbInfo	   *dbinfos;
 	int			i_datname,
+				i_datconnlimit,
 				i_oid,
 				i_spclocation;
 	char		query[QUERY_ALLOC];
 
 	snprintf(query, sizeof(query),
-			 "SELECT d.oid, d.datname, d.encoding, d.datcollate, d.datctype, ");
+			 "SELECT d.oid, d.datname, d.encoding, d.datcollate, d.datctype, d.datconnlimit, ");
 	if (GET_MAJOR_VERSION(cluster->major_version) >= 1700)
 		snprintf(query + strlen(query), sizeof(query) - strlen(query),
 				 "datlocprovider, datlocale, ");
@@ -436,6 +437,7 @@ get_db_infos(ClusterInfo *cluster)
 
 	i_oid = PQfnumber(res, "oid");
 	i_datname = PQfnumber(res, "datname");
+	i_datconnlimit = PQfnumber(res, "datconnlimit");
 	i_spclocation = PQfnumber(res, "spclocation");
 
 	ntups = PQntuples(res);
@@ -445,6 +447,7 @@ get_db_infos(ClusterInfo *cluster)
 	{
 		dbinfos[tupnum].db_oid = atooid(PQgetvalue(res, tupnum, i_oid));
 		dbinfos[tupnum].db_name = pg_strdup(PQgetvalue(res, tupnum, i_datname));
+		dbinfos[tupnum].db_connlimit = atoi(PQgetvalue(res, tupnum, i_datconnlimit));
 		snprintf(dbinfos[tupnum].db_tablespace, sizeof(dbinfos[tupnum].db_tablespace), "%s",
 				 PQgetvalue(res, tupnum, i_spclocation));
 	}
diff --git a/src/bin/pg_upgrade/pg_upgrade.h b/src/bin/pg_upgrade/pg_upgrade.h
index 53f693c2d4..b1c77ca6de 100644
--- a/src/bin/pg_upgrade/pg_upgrade.h
+++ b/src/bin/pg_upgrade/pg_upgrade.h
@@ -194,6 +194,8 @@ typedef struct
 	char	   *db_name;		/* database name */
 	char		db_tablespace[MAXPGPATH];	/* database default tablespace
 											 * path */
+	int			db_connlimit;	/* database invalid if set to
+								 * DATCONNLIMIT_INVALID_DB */
 	RelInfoArr	rel_arr;		/* array of all user relinfos */
 	LogicalSlotInfoArr slot_arr;	/* array of all LogicalSlotInfo */
 } DbInfo;
diff --git a/src/bin/pg_upgrade/t/002_pg_upgrade.pl b/src/bin/pg_upgrade/t/002_pg_upgrade.pl
index 17af2ce61e..c7da6d27b2 100644
--- a/src/bin/pg_upgrade/t/002_pg_upgrade.pl
+++ b/src/bin/pg_upgrade/t/002_pg_upgrade.pl
@@ -424,7 +424,7 @@ SKIP:
 			$mode, '--check',
 		],
 		1,
-		[qr/invalid/],    # pg_upgrade prints errors on stdout :(
+		[qr/invalid_databases\.txt/],    # pg_upgrade prints errors on stdout :(
 		[qr/^$/],
 		'invalid database causes failure');
 	rmtree($newnode->data_dir . "/pg_upgrade_output.d");
diff --git a/src/bin/pg_upgrade/task.c b/src/bin/pg_upgrade/task.c
index ba1726c25e..90de2070a1 100644
--- a/src/bin/pg_upgrade/task.c
+++ b/src/bin/pg_upgrade/task.c
@@ -44,6 +44,7 @@
 
 #include "postgres_fe.h"
 
+#include "catalog/pg_database.h"
 #include "common/connect.h"
 #include "fe_utils/string_utils.h"
 #include "pg_upgrade.h"
@@ -228,6 +229,18 @@ process_query_result(const ClusterInfo *cluster, UpgradeTaskSlot *slot,
 		PQclear(res);
 }
 
+/*
+ * Closes the connection managed by the slot and frees the slot for immediate reuse.
+ */
+static void
+free_slot(UpgradeTaskSlot *slot)
+{
+	if (slot->conn)
+		PQfinish(slot->conn);
+	memset(slot, 0, sizeof(UpgradeTaskSlot));
+	slot->ready = true;
+}
+
 /*
  * Advances the state machine for a given slot as necessary.
  */
@@ -255,6 +268,21 @@ process_slot(const ClusterInfo *cluster, UpgradeTaskSlot *slot, const UpgradeTas
 			 * new connection.
 			 */
 			slot->db_idx = dbs_processing++;
+
+			/*
+			 * Do not try to connect to invalid databases here,
+			 * check_for_invalid_dbs() will report them later. Skip this
+			 * database instead, free the slot and start initiating the next
+			 * connection.
+			 */
+			if (cluster->dbarr.dbs[slot->db_idx].db_connlimit == DATCONNLIMIT_INVALID_DB)
+			{
+				dbs_complete++;
+				free_slot(slot);
+				process_slot(cluster, slot, task);
+				return;
+			}
+
 			slot->state = CONNECTING;
 			start_conn(cluster, slot);
 
@@ -314,9 +342,7 @@ process_slot(const ClusterInfo *cluster, UpgradeTaskSlot *slot, const UpgradeTas
 			 * through the slots.
 			 */
 			dbs_complete++;
-			PQfinish(slot->conn);
-			memset(slot, 0, sizeof(UpgradeTaskSlot));
-			slot->ready = true;
+			free_slot(slot);
 
 			process_slot(cluster, slot, task);
 
-- 
2.46.1

pg17-v1-0001-pg_upgrade-Add-check-for-invalid-databases.patchtext/x-patch; charset=UTF-8; name=pg17-v1-0001-pg_upgrade-Add-check-for-invalid-databases.patchDownload
From bd6da1bbbfe207e91c7ad4fff008da70a99d684c Mon Sep 17 00:00:00 2001
From: Thomas Krennwallner <teakay@aiven.io>
Date: Thu, 12 Sep 2024 16:29:06 -0400
Subject: [PATCH] pg_upgrade: Add check for invalid databases.

Currently, pg_upgrade fails to connect to the first invalid database
it encounters in get_loadable_libraries() and then aborts.  While we
print a fatal message with a hint what should be done, the output is
terse and does not report further invalid databases present in an
installation.

Instead of just exiting on first error, we collect all
pg_database.datconnlimit values in get_db_infos(), ignore invalid
databases in get_db_rel_and_slot_infos(), and then perform
check_for_invalid_dbs(), which collects all invalid databases in a
report file invalid_databases.txt.
---
 src/bin/pg_upgrade/check.c             | 62 ++++++++++++++++++++++++++
 src/bin/pg_upgrade/info.c              | 13 +++++-
 src/bin/pg_upgrade/pg_upgrade.h        |  2 +
 src/bin/pg_upgrade/t/002_pg_upgrade.pl |  2 +-
 4 files changed, 77 insertions(+), 2 deletions(-)

diff --git a/src/bin/pg_upgrade/check.c b/src/bin/pg_upgrade/check.c
index 1cf84cc1bb..d56ea52c27 100644
--- a/src/bin/pg_upgrade/check.c
+++ b/src/bin/pg_upgrade/check.c
@@ -12,6 +12,7 @@
 #include "catalog/pg_authid_d.h"
 #include "catalog/pg_class_d.h"
 #include "catalog/pg_collation.h"
+#include "catalog/pg_database.h"
 #include "fe_utils/string_utils.h"
 #include "mb/pg_wchar.h"
 #include "pg_upgrade.h"
@@ -19,6 +20,7 @@
 static void check_new_cluster_is_empty(void);
 static void check_is_install_user(ClusterInfo *cluster);
 static void check_proper_datallowconn(ClusterInfo *cluster);
+static void check_for_invalid_dbs(ClusterInfo *cluster);
 static void check_for_prepared_transactions(ClusterInfo *cluster);
 static void check_for_isn_and_int8_passing_mismatch(ClusterInfo *cluster);
 static void check_for_user_defined_postfix_ops(ClusterInfo *cluster);
@@ -586,6 +588,13 @@ check_and_dump_old_cluster(bool live_check)
 	 */
 	get_db_rel_and_slot_infos(&old_cluster, live_check);
 
+	/*
+	 * Verify that the old cluster does not reference invalid databases.
+	 * Connections to such databases are not allowed and break the following
+	 * checks.
+	 */
+	check_for_invalid_dbs(&old_cluster);
+
 	init_tablespaces();
 
 	get_loadable_libraries();
@@ -679,6 +688,13 @@ check_new_cluster(void)
 {
 	get_db_rel_and_slot_infos(&new_cluster, false);
 
+	/*
+	 * Verify that the new cluster does not reference invalid databases.
+	 * Connections to such databases are not allowed and break the following
+	 * checks.
+	 */
+	check_for_invalid_dbs(&new_cluster);
+
 	check_new_cluster_is_empty();
 
 	check_loadable_libraries();
@@ -1075,6 +1091,52 @@ check_is_install_user(ClusterInfo *cluster)
 	check_ok();
 }
 
+/*
+ *	check_for_invalid_dbs
+ *
+ *	Ensure that all databases are valid as connections to invalid databases
+ *	are not allowed.
+ */
+static void
+check_for_invalid_dbs(ClusterInfo *cluster)
+{
+	int			dbnum;
+	FILE	   *script = NULL;
+	char		output_path[MAXPGPATH];
+
+	prep_status("Checking for invalid databases");
+
+	snprintf(output_path, sizeof(output_path), "%s/%s",
+			 log_opts.basedir,
+			 "invalid_databases.txt");
+
+	for (dbnum = 0; dbnum < cluster->dbarr.ndbs; dbnum++)
+	{
+		DbInfo	   *active_db = &cluster->dbarr.dbs[dbnum];
+
+		if (active_db->db_connlimit == DATCONNLIMIT_INVALID_DB)
+		{
+			if (script == NULL && (script = fopen_priv(output_path, "w")) == NULL)
+				pg_fatal("could not open file \"%s\": %m", output_path);
+			fprintf(script, "%s\n", active_db->db_name);
+		}
+	}
+
+	if (script)
+	{
+		fclose(script);
+		pg_log(PG_REPORT, "fatal");
+		pg_fatal("Your installation contains invalid databases as a consequence of\n"
+				 "interrupted DROP DATABASE.  They are now marked as corrupted databases\n"
+				 "that cannot be connected to anymore.  Consider removing them using\n"
+				 "    DROP DATABASE ...;\n"
+				 "A list of invalid databases is in the file:\n"
+				 "    %s", output_path);
+	}
+	else
+		check_ok();
+}
+
 
 /*
  *	check_proper_datallowconn
diff --git a/src/bin/pg_upgrade/info.c b/src/bin/pg_upgrade/info.c
index 5c041fa06e..871b1fc30a 100644
--- a/src/bin/pg_upgrade/info.c
+++ b/src/bin/pg_upgrade/info.c
@@ -11,6 +11,7 @@
 
 #include "access/transam.h"
 #include "catalog/pg_class_d.h"
+#include "catalog/pg_database.h"
 #include "pg_upgrade.h"
 
 static void create_rel_filename_map(const char *old_data, const char *new_data,
@@ -290,6 +291,13 @@ get_db_rel_and_slot_infos(ClusterInfo *cluster, bool live_check)
 	{
 		DbInfo	   *pDbInfo = &cluster->dbarr.dbs[dbnum];
 
+		/*
+		 * Do not try to connect to invalid databases here,
+		 * check_for_invalid_dbs() will report them later.
+		 */
+		if (pDbInfo->db_connlimit == DATCONNLIMIT_INVALID_DB)
+			continue;
+
 		get_rel_infos(cluster, pDbInfo);
 
 		if (cluster == &old_cluster)
@@ -384,12 +392,13 @@ get_db_infos(ClusterInfo *cluster)
 	int			tupnum;
 	DbInfo	   *dbinfos;
 	int			i_datname,
+				i_datconnlimit,
 				i_oid,
 				i_spclocation;
 	char		query[QUERY_ALLOC];
 
 	snprintf(query, sizeof(query),
-			 "SELECT d.oid, d.datname, d.encoding, d.datcollate, d.datctype, ");
+			 "SELECT d.oid, d.datname, d.encoding, d.datcollate, d.datctype, d.datconnlimit, ");
 	if (GET_MAJOR_VERSION(cluster->major_version) >= 1700)
 		snprintf(query + strlen(query), sizeof(query) - strlen(query),
 				 "datlocprovider, datlocale, ");
@@ -411,6 +420,7 @@ get_db_infos(ClusterInfo *cluster)
 
 	i_oid = PQfnumber(res, "oid");
 	i_datname = PQfnumber(res, "datname");
+	i_datconnlimit = PQfnumber(res, "datconnlimit");
 	i_spclocation = PQfnumber(res, "spclocation");
 
 	ntups = PQntuples(res);
@@ -420,6 +430,7 @@ get_db_infos(ClusterInfo *cluster)
 	{
 		dbinfos[tupnum].db_oid = atooid(PQgetvalue(res, tupnum, i_oid));
 		dbinfos[tupnum].db_name = pg_strdup(PQgetvalue(res, tupnum, i_datname));
+		dbinfos[tupnum].db_connlimit = atoi(PQgetvalue(res, tupnum, i_datconnlimit));
 		snprintf(dbinfos[tupnum].db_tablespace, sizeof(dbinfos[tupnum].db_tablespace), "%s",
 				 PQgetvalue(res, tupnum, i_spclocation));
 	}
diff --git a/src/bin/pg_upgrade/pg_upgrade.h b/src/bin/pg_upgrade/pg_upgrade.h
index e2b99b49fa..ed5d45fa06 100644
--- a/src/bin/pg_upgrade/pg_upgrade.h
+++ b/src/bin/pg_upgrade/pg_upgrade.h
@@ -195,6 +195,8 @@ typedef struct
 	char	   *db_name;		/* database name */
 	char		db_tablespace[MAXPGPATH];	/* database default tablespace
 											 * path */
+	int			db_connlimit;	/* database invalid if set to
+								 * DATCONNLIMIT_INVALID_DB */
 	RelInfoArr	rel_arr;		/* array of all user relinfos */
 	LogicalSlotInfoArr slot_arr;	/* array of all LogicalSlotInfo */
 } DbInfo;
diff --git a/src/bin/pg_upgrade/t/002_pg_upgrade.pl b/src/bin/pg_upgrade/t/002_pg_upgrade.pl
index 17af2ce61e..c7da6d27b2 100644
--- a/src/bin/pg_upgrade/t/002_pg_upgrade.pl
+++ b/src/bin/pg_upgrade/t/002_pg_upgrade.pl
@@ -424,7 +424,7 @@ SKIP:
 			$mode, '--check',
 		],
 		1,
-		[qr/invalid/],    # pg_upgrade prints errors on stdout :(
+		[qr/invalid_databases\.txt/],    # pg_upgrade prints errors on stdout :(
 		[qr/^$/],
 		'invalid database causes failure');
 	rmtree($newnode->data_dir . "/pg_upgrade_output.d");
-- 
2.46.1

#2Nathan Bossart
nathandbossart@gmail.com
In reply to: Thomas Krennwallner (#1)
Re: pg_upgrade check for invalid databases

On Sun, Sep 29, 2024 at 08:45:50PM -0400, Thomas Krennwallner wrote:

if a cluster contains invalid databases that we cannot connect to anymore,
pg_upgrade would currently fail when trying to connect to the first
encountered invalid database with

[...]

If there is more than one invalid database, we need to run pg_upgrade more
than once (unless the user inspects pg_database).

I attached two small patches for PG 17 and PG 18 (can be easily backported
to all previous versions upon request). Instead of just failing to connect
with an error, we collect all invalid databases in a report file
invalid_databases.txt:

Should we have pg_upgrade skip invalid databases? If the only valid action
is to drop them, IMHO it seems unnecessary to require users to manually
drop them before retrying pg_upgrade.

--
nathan

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Nathan Bossart (#2)
Re: pg_upgrade check for invalid databases

Nathan Bossart <nathandbossart@gmail.com> writes:

Should we have pg_upgrade skip invalid databases? If the only valid action
is to drop them, IMHO it seems unnecessary to require users to manually
drop them before retrying pg_upgrade.

I was thinking the same. But I wonder if there is any chance of
losing data that could be recoverable. It feels like this should
not be a default behavior.

TBH I'm not finding anything very much wrong with the current
behavior... this has to be a rare situation, do we need to add
debatable behavior to make it easier?

regards, tom lane

#4Daniel Gustafsson
daniel@yesql.se
In reply to: Tom Lane (#3)
Re: pg_upgrade check for invalid databases

On 30 Sep 2024, at 16:55, Tom Lane <tgl@sss.pgh.pa.us> wrote:

TBH I'm not finding anything very much wrong with the current
behavior... this has to be a rare situation, do we need to add
debatable behavior to make it easier?

One argument would be to make the checks consistent, pg_upgrade generally tries
to report all the offending entries to help the user when fixing the source
database. Not sure if it's a strong enough argument for carrying code which
really shouldn't see much use though.

--
Daniel Gustafsson

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Daniel Gustafsson (#4)
Re: pg_upgrade check for invalid databases

Daniel Gustafsson <daniel@yesql.se> writes:

On 30 Sep 2024, at 16:55, Tom Lane <tgl@sss.pgh.pa.us> wrote:
TBH I'm not finding anything very much wrong with the current
behavior... this has to be a rare situation, do we need to add
debatable behavior to make it easier?

One argument would be to make the checks consistent, pg_upgrade generally tries
to report all the offending entries to help the user when fixing the source
database. Not sure if it's a strong enough argument for carrying code which
really shouldn't see much use though.

OK, but the consistency argument would be to just report and fail.
I don't think there's a precedent in other pg_upgrade checks for
trying to fix problems automatically.

regards, tom lane

#6Thomas Krennwallner
tk@postsubmeta.net
In reply to: Daniel Gustafsson (#4)
Re: pg_upgrade check for invalid databases

On 30/09/2024 17.29, Daniel Gustafsson wrote:

On 30 Sep 2024, at 16:55, Tom Lane <tgl@sss.pgh.pa.us> wrote:

TBH I'm not finding anything very much wrong with the current
behavior... this has to be a rare situation, do we need to add
debatable behavior to make it easier?

One argument would be to make the checks consistent, pg_upgrade generally tries
to report all the offending entries to help the user when fixing the source
database. Not sure if it's a strong enough argument for carrying code which
really shouldn't see much use though.

In general, I agree that this situation should be rare for deliberate
DROP DATABASE interrupted in interactive sessions.

Unfortunately, for (popular) tools that perform automatic "temporary
database" cleanup, we could recently see an increase in invalid databases.

The additional check for pg_upgrade was made necessary due to several
unrelated customers having invalid databases that stem from left-over
Prisma Migrate "shadow databases" [1]https://www.prisma.io/docs/orm/prisma-migrate/understanding-prisma-migrate/shadow-database. We could not reproduce this
Prisma Migrate issue yet, as those migrations happened some time ago.
Maybe this bug really stems from a much older Prisma Migrate version and
we only see the fallout now. This is still a TODO item.

But it appears that this tool can get interrupted "at the wrong time"
while it is deleting temporary databases (probably a manual Ctrl-C), and
clients are unaware that this can then leave behind invalid databases.

Those temporary databases do not cause any harm as they are not used
anymore. But eventually, PG installations will be upgraded to the next
major version, and it is only then when those invalid databases
resurface after pg_upgrade fails to run the checks.

Long story short: interactive DROP DATABASE interrupts are rare (they do
exist, but customers are usually aware). Automation tools on the other
hand may run DROP DATABASE and when they get interrupted at the wrong
time they will then produce several left-over invalid databases.
pg_upgrade will then fail to run the checks.

[1]: https://www.prisma.io/docs/orm/prisma-migrate/understanding-prisma-migrate/shadow-database
https://www.prisma.io/docs/orm/prisma-migrate/understanding-prisma-migrate/shadow-database

#7Daniel Gustafsson
daniel@yesql.se
In reply to: Tom Lane (#5)
Re: pg_upgrade check for invalid databases

On 1 Oct 2024, at 00:20, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Daniel Gustafsson <daniel@yesql.se> writes:

On 30 Sep 2024, at 16:55, Tom Lane <tgl@sss.pgh.pa.us> wrote:
TBH I'm not finding anything very much wrong with the current
behavior... this has to be a rare situation, do we need to add
debatable behavior to make it easier?

One argument would be to make the checks consistent, pg_upgrade generally tries
to report all the offending entries to help the user when fixing the source
database. Not sure if it's a strong enough argument for carrying code which
really shouldn't see much use though.

OK, but the consistency argument would be to just report and fail.
I don't think there's a precedent in other pg_upgrade checks for
trying to fix problems automatically.

Correct, sorry for being unclear. The consistency argument would be to expand
pg_upgrade to report all invalid databases rather than just the first found;
attempting to fix problems would be a new behavior.

--
Daniel Gustafsson

#8Daniel Gustafsson
daniel@yesql.se
In reply to: Thomas Krennwallner (#6)
Re: pg_upgrade check for invalid databases

On 1 Oct 2024, at 02:35, Thomas Krennwallner <tk@postsubmeta.net> wrote:

On 30/09/2024 17.29, Daniel Gustafsson wrote:

On 30 Sep 2024, at 16:55, Tom Lane <tgl@sss.pgh.pa.us> wrote:
TBH I'm not finding anything very much wrong with the current
behavior... this has to be a rare situation, do we need to add
debatable behavior to make it easier?

One argument would be to make the checks consistent, pg_upgrade generally tries
to report all the offending entries to help the user when fixing the source
database. Not sure if it's a strong enough argument for carrying code which
really shouldn't see much use though.

In general, I agree that this situation should be rare for deliberate DROP DATABASE interrupted in interactive sessions.

Unfortunately, for (popular) tools that perform automatic "temporary database" cleanup, we could recently see an increase in invalid databases.

The additional check for pg_upgrade was made necessary due to several unrelated customers having invalid databases that stem from left-over Prisma Migrate "shadow databases" [1]. We could not reproduce this Prisma Migrate issue yet, as those migrations happened some time ago. Maybe this bug really stems from a much older Prisma Migrate version and we only see the fallout now. This is still a TODO item.

But it appears that this tool can get interrupted "at the wrong time" while it is deleting temporary databases (probably a manual Ctrl-C), and clients are unaware that this can then leave behind invalid databases.

Those temporary databases do not cause any harm as they are not used anymore. But eventually, PG installations will be upgraded to the next major version, and it is only then when those invalid databases resurface after pg_upgrade fails to run the checks.

Databases containing transient data no longer needed left by buggy tools is one
thing, but pg_upgrade won't be able to differentiate between those and invalid
databases of legitimate interest. Allowing pg_upgrade to skip invalid
databases expose the risk of (potentially) valuable data being dropped during
the upgrade due to the user not having realized a rarely-used production
database was invalid.

Long story short: interactive DROP DATABASE interrupts are rare (they do exist, but customers are usually aware). Automation tools on the other hand may run DROP DATABASE and when they get interrupted at the wrong time they will then produce several left-over invalid databases. pg_upgrade will then fail to run the checks.

Checking and reporting all invalid databases during the check phase seems like
a user-friendly option here, I can agree that the current behaviour isn't great
for users experiencing this issue.

--
Daniel Gustafsson

#9Bruce Momjian
bruce@momjian.us
In reply to: Daniel Gustafsson (#7)
Re: pg_upgrade check for invalid databases

On Tue, Oct 1, 2024 at 09:28:54AM +0200, Daniel Gustafsson wrote:

On 1 Oct 2024, at 00:20, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Daniel Gustafsson <daniel@yesql.se> writes:

On 30 Sep 2024, at 16:55, Tom Lane <tgl@sss.pgh.pa.us> wrote:
TBH I'm not finding anything very much wrong with the current
behavior... this has to be a rare situation, do we need to add
debatable behavior to make it easier?

One argument would be to make the checks consistent, pg_upgrade generally tries
to report all the offending entries to help the user when fixing the source
database. Not sure if it's a strong enough argument for carrying code which
really shouldn't see much use though.

OK, but the consistency argument would be to just report and fail.
I don't think there's a precedent in other pg_upgrade checks for
trying to fix problems automatically.

Correct, sorry for being unclear. The consistency argument would be to expand
pg_upgrade to report all invalid databases rather than just the first found;
attempting to fix problems would be a new behavior.

Yes, historically pg_upgrade will fail if it finds anything unusual,
mostly because what it does normally is already scary enough. If users
what pg_upgrade to do cleanups, it would be enabled by a separate flag,
or even a new command-line app.

--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com

When a patient asks the doctor, "Am I going to die?", he means
"Am I going to die soon?"

#10Nathan Bossart
nathandbossart@gmail.com
In reply to: Bruce Momjian (#9)
Re: pg_upgrade check for invalid databases

On Mon, Oct 07, 2024 at 03:37:35PM -0400, Bruce Momjian wrote:

On Tue, Oct 1, 2024 at 09:28:54AM +0200, Daniel Gustafsson wrote:

Correct, sorry for being unclear. The consistency argument would be to expand
pg_upgrade to report all invalid databases rather than just the first found;
attempting to fix problems would be a new behavior.

Yes, historically pg_upgrade will fail if it finds anything unusual,
mostly because what it does normally is already scary enough. If users
what pg_upgrade to do cleanups, it would be enabled by a separate flag,
or even a new command-line app.

While I suspect it's rare that someone CTRL-C's out of an accidental DROP
DATABASE and then runs pg_upgrade before trying to recover the data, I
agree with the principle of having pg_upgrade fail by default for things
like this. If we did add a new flag, the new invalid database report that
Daniel mentions could say something like "try again with
--skip-invalid-databases to have pg_upgrade automatically drop invalid
databases."

--
nathan

#11Daniel Gustafsson
daniel@yesql.se
In reply to: Nathan Bossart (#10)
Re: pg_upgrade check for invalid databases

On 7 Oct 2024, at 22:04, Nathan Bossart <nathandbossart@gmail.com> wrote:

On Mon, Oct 07, 2024 at 03:37:35PM -0400, Bruce Momjian wrote:

On Tue, Oct 1, 2024 at 09:28:54AM +0200, Daniel Gustafsson wrote:

Correct, sorry for being unclear. The consistency argument would be to expand
pg_upgrade to report all invalid databases rather than just the first found;
attempting to fix problems would be a new behavior.

Yes, historically pg_upgrade will fail if it finds anything unusual,
mostly because what it does normally is already scary enough. If users
what pg_upgrade to do cleanups, it would be enabled by a separate flag,
or even a new command-line app.

While I suspect it's rare that someone CTRL-C's out of an accidental DROP
DATABASE and then runs pg_upgrade before trying to recover the data, I
agree with the principle of having pg_upgrade fail by default for things
like this. If we did add a new flag, the new invalid database report that
Daniel mentions could say something like "try again with
--skip-invalid-databases to have pg_upgrade automatically drop invalid
databases."

If we are teaching pg_upgrade to handle errors, either by skipping or by
fixing, then I believe this is the right way to go about it. A successful run
should probably also create a report of the databases which were skipped.

--
Daniel Gustafsson

#12Thomas Krennwallner
tk@postsubmeta.net
In reply to: Daniel Gustafsson (#11)
1 attachment(s)
Re: pg_upgrade check for invalid databases

On Fri, 11 Oct 2024 at 04:01, Daniel Gustafsson <daniel@yesql.se> wrote:

On 7 Oct 2024, at 22:04, Nathan Bossart <nathandbossart@gmail.com> wrote:

On Mon, Oct 07, 2024 at 03:37:35PM -0400, Bruce Momjian wrote:

On Tue, Oct 1, 2024 at 09:28:54AM +0200, Daniel Gustafsson wrote:

Correct, sorry for being unclear. The consistency argument would be to expand
pg_upgrade to report all invalid databases rather than just the first found;
attempting to fix problems would be a new behavior.

Yes, historically pg_upgrade will fail if it finds anything unusual,
mostly because what it does normally is already scary enough. If users
what pg_upgrade to do cleanups, it would be enabled by a separate flag,
or even a new command-line app.

While I suspect it's rare that someone CTRL-C's out of an accidental DROP
DATABASE and then runs pg_upgrade before trying to recover the data, I
agree with the principle of having pg_upgrade fail by default for things
like this. If we did add a new flag, the new invalid database report that
Daniel mentions could say something like "try again with
--skip-invalid-databases to have pg_upgrade automatically drop invalid
databases."

If we are teaching pg_upgrade to handle errors, either by skipping or by
fixing, then I believe this is the right way to go about it. A successful run
should probably also create a report of the databases which were skipped.

In v2 I've made changes to the patch incorporating the suggestions here:

* Default behaviour is to just fail with a report of all invalid databases

* A new option --skip-invalid-databases will then skip the checks, and
would not transfer any invalid database to the new cluster. A warning
with a report file will then follow after a successful run.

Dropping invalid databases in the old cluster will make invalid
databases unrecoverable, so I opted for a skip over invalid databases
approach that would leave invalid databases in the old cluster.

Apart from a missing --skip-invalid-databases test, does this attempt look OK?

Attachments:

pg18-v2-0001-pg_upgrade-Add-check-for-invalid-databases.patchtext/x-patch; charset=US-ASCII; name=pg18-v2-0001-pg_upgrade-Add-check-for-invalid-databases.patchDownload
From e51f581ddc1158a9fb2840dfc04863618a390887 Mon Sep 17 00:00:00 2001
From: Thomas Krennwallner <teakay@aiven.io>
Date: Fri, 20 Sep 2024 17:33:05 -0400
Subject: [PATCH] pg_upgrade: Add check for invalid databases.

Currently, pg_upgrade fails to connect to the first invalid database
it encounters in get_loadable_libraries() and then aborts.  While we
print a fatal message with a hint what should be done, the output is
terse and does not report further invalid databases present in an
installation.

Instead of just exiting on first error, we collect all
pg_database.datconnlimit values in get_db_infos() and order invalid
databases after valid ones in cluster.dbarr. This allows for skipping
over invalid databases in all checks with --skip-invalid-databases, a
new option where we do not transfer such databases from the old
cluster.

Unless we use --skip-invalid-databases, check_for_invalid_dbs()
collects all invalid databases in a report file invalid_databases.txt.
After a successful pg_upgrade run with skipped invalid databases,
create_invalid_databases_report() will collect them in a report file
skipped_invalid_databases.txt for the completion banner output.
---
 doc/src/sgml/ref/pgupgrade.sgml        |  19 ++++
 src/bin/pg_upgrade/check.c             | 131 +++++++++++++++++++++++--
 src/bin/pg_upgrade/info.c              |  20 +++-
 src/bin/pg_upgrade/option.c            |   7 ++
 src/bin/pg_upgrade/pg_upgrade.c        |  18 +++-
 src/bin/pg_upgrade/pg_upgrade.h        |   8 +-
 src/bin/pg_upgrade/t/002_pg_upgrade.pl |   2 +-
 7 files changed, 187 insertions(+), 18 deletions(-)

diff --git a/doc/src/sgml/ref/pgupgrade.sgml b/doc/src/sgml/ref/pgupgrade.sgml
index 4777381dac..a9a6b62e7a 100644
--- a/doc/src/sgml/ref/pgupgrade.sgml
+++ b/doc/src/sgml/ref/pgupgrade.sgml
@@ -253,6 +253,25 @@ PostgreSQL documentation
       </listitem>
      </varlistentry>
 
+     <varlistentry>
+      <term><option>--skip-invalid-databases</option></term>
+      <listitem>
+       <para>
+        If a cluster contains invalid databases, the default behaviour of
+        <application>pg_upgrade</application> is to fail with a report of all
+        such databases found in an installation. Invalid databases, which cannot
+        be connected to anymore, are a consequence of interrupted
+        <command>DROP DATABASE</command> and marked as corrupted databases in
+        <link linkend="catalog-pg-database"><structname>pg_database</structname></link>.<structfield>datconnlimit</structfield>.
+       </para>
+       <para>
+        When using option <option>--skip-invalid-databases</option>,
+        <application>pg_upgrade</application> instead will skip all invalid
+        databases and do not transfer them over to the new cluster.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry>
       <term><option>--sync-method=</option><replaceable>method</replaceable></term>
       <listitem>
diff --git a/src/bin/pg_upgrade/check.c b/src/bin/pg_upgrade/check.c
index 12735a4268..e372feff1a 100644
--- a/src/bin/pg_upgrade/check.c
+++ b/src/bin/pg_upgrade/check.c
@@ -12,6 +12,7 @@
 #include "catalog/pg_authid_d.h"
 #include "catalog/pg_class_d.h"
 #include "catalog/pg_collation.h"
+#include "catalog/pg_database.h"
 #include "fe_utils/string_utils.h"
 #include "mb/pg_wchar.h"
 #include "pg_upgrade.h"
@@ -19,6 +20,7 @@
 static void check_new_cluster_is_empty(void);
 static void check_is_install_user(ClusterInfo *cluster);
 static void check_proper_datallowconn(ClusterInfo *cluster);
+static void check_for_invalid_dbs(ClusterInfo *cluster);
 static void check_for_prepared_transactions(ClusterInfo *cluster);
 static void check_for_isn_and_int8_passing_mismatch(ClusterInfo *cluster);
 static void check_for_user_defined_postfix_ops(ClusterInfo *cluster);
@@ -598,6 +600,11 @@ check_and_dump_old_cluster(void)
 	 */
 	get_db_rel_and_slot_infos(&old_cluster);
 
+	/*
+	 * Verify that the old cluster does not contain invalid databases.
+	 */
+	check_for_invalid_dbs(&old_cluster);
+
 	init_tablespaces();
 
 	get_loadable_libraries();
@@ -691,6 +698,11 @@ check_new_cluster(void)
 {
 	get_db_rel_and_slot_infos(&new_cluster);
 
+	/*
+	 * Verify that the new cluster does not contain invalid databases.
+	 */
+	check_for_invalid_dbs(&new_cluster);
+
 	check_new_cluster_is_empty();
 
 	check_loadable_libraries();
@@ -763,7 +775,7 @@ issue_warnings_and_set_wal_level(void)
 
 
 void
-output_completion_banner(char *deletion_script_file_name)
+output_completion_banner(char *deletion_script_file_name, char *skipped_invalid_databases_file_name)
 {
 	PQExpBufferData user_specification;
 
@@ -775,6 +787,14 @@ output_completion_banner(char *deletion_script_file_name)
 		appendPQExpBufferChar(&user_specification, ' ');
 	}
 
+	if (skipped_invalid_databases_file_name)
+	{
+		pg_log(PG_REPORT,
+			   "The following invalid databases were not transferred by pg_upgrade:\n"
+			   "    %s",
+			   skipped_invalid_databases_file_name);
+	}
+
 	pg_log(PG_REPORT,
 		   "Optimizer statistics are not transferred by pg_upgrade.\n"
 		   "Once you start the new server, consider running:\n"
@@ -998,7 +1018,7 @@ create_script_for_old_cluster_deletion(char **deletion_script_file_name)
 
 			fprintf(script, "\n");
 
-			for (dbnum = 0; dbnum < old_cluster.dbarr.ndbs; dbnum++)
+			for (dbnum = 0; dbnum < old_cluster.dbarr.ntotaldbs; dbnum++)
 				fprintf(script, RMDIR_CMD " %c%s%c%u%c\n", PATH_QUOTE,
 						fix_path_separator(os_info.old_tablespaces[tblnum]),
 						PATH_SEPARATOR, old_cluster.dbarr.dbs[dbnum].db_oid,
@@ -1087,6 +1107,96 @@ check_is_install_user(ClusterInfo *cluster)
 	check_ok();
 }
 
+/*
+ *	check_for_invalid_dbs
+ *
+ *	Ensure that all databases are valid as connections to invalid databases
+ *	are not allowed.
+ */
+static void
+check_for_invalid_dbs(ClusterInfo *cluster)
+{
+	int			dbnum;
+	FILE	   *script = NULL;
+	char		output_path[MAXPGPATH];
+
+	prep_status("Checking for invalid databases");
+
+	if (user_opts.skip_invalid_dbs)
+	{
+		pg_log(PG_REPORT, "skipped");
+		return;
+	}
+
+	snprintf(output_path, sizeof(output_path), "%s/%s",
+			 log_opts.basedir,
+			 "invalid_databases.txt");
+
+	/*
+	 * If invalid databases exist in cluster they are ordered after the valid
+	 * ones in dbarr.
+	 */
+	for (dbnum = cluster->dbarr.ndbs; dbnum < cluster->dbarr.ntotaldbs; dbnum++)
+	{
+		if (script == NULL && (script = fopen_priv(output_path, "w")) == NULL)
+			pg_fatal("could not open file \"%s\": %m", output_path);
+		fprintf(script, "%s\n", cluster->dbarr.dbs[dbnum].db_name);
+	}
+
+	if (script)
+	{
+		fclose(script);
+		pg_log(PG_REPORT, "fatal");
+		pg_fatal("Your installation contains invalid databases as a consequence of\n"
+				 "interrupted DROP DATABASE.  They are now marked as corrupted databases\n"
+				 "that cannot be connected to anymore.  Consider removing them using\n"
+				 "    DROP DATABASE ...;\n"
+				 "or run pg_upgrade with --skip-invalid-databases to skip transferring\n"
+				 "invalid databases to the new cluster.\n"
+				 "A list of invalid databases is in the file:\n"
+				 "    %s", output_path);
+	}
+	else
+		check_ok();
+}
+
+
+/*
+ * create_invalid_databases_report()
+ *
+ *	This is particularly useful for listing skipped invalid databases.
+ */
+void
+create_invalid_databases_report(char **skipped_invalid_databases_file_name)
+{
+	int			dbnum;
+	FILE	   *script = NULL;
+
+	if (!user_opts.skip_invalid_dbs || old_cluster.dbarr.ndbs == old_cluster.dbarr.ntotaldbs)
+		return;
+
+	*skipped_invalid_databases_file_name = psprintf("%sskipped_invalid_databases.txt",
+													SCRIPT_PREFIX);
+
+	prep_status("Creating report for skipped invalid databases");
+
+	for (dbnum = old_cluster.dbarr.ndbs; dbnum < old_cluster.dbarr.ntotaldbs; dbnum++)
+	{
+		if (script == NULL && (script = fopen_priv(*skipped_invalid_databases_file_name, "w")) == NULL)
+			pg_fatal("could not open file \"%s\": %m", *skipped_invalid_databases_file_name);
+		fprintf(script, "%s\n", old_cluster.dbarr.dbs[dbnum].db_name);
+	}
+
+	if (script != NULL)
+	{
+		report_status(PG_WARNING, "warning");
+		pg_log(PG_WARNING,
+			   "\nWARNING:  old cluster contains invalid databases which have been skipped");
+		fclose(script);
+	}
+	else
+		check_ok();
+}
 
 /*
  *	check_proper_datallowconn
@@ -1115,10 +1225,15 @@ check_proper_datallowconn(ClusterInfo *cluster)
 
 	conn_template1 = connectToServer(cluster, "template1");
 
-	/* get database names */
+	/*
+	 * get database names; skip invalid databases as they are handled in
+	 * check_for_invalid_dbs()
+	 */
 	dbres = executeQueryOrDie(conn_template1,
 							  "SELECT	datname, datallowconn "
-							  "FROM	pg_catalog.pg_database");
+							  "FROM	pg_catalog.pg_database "
+							  "WHERE datconnlimit != %d",
+							  DATCONNLIMIT_INVALID_DB);
 
 	i_datname = PQfnumber(dbres, "datname");
 	i_datallowconn = PQfnumber(dbres, "datallowconn");
@@ -1982,7 +2097,9 @@ check_old_cluster_subscription_state(void)
 
 	/*
 	 * Check that all the subscriptions have their respective replication
-	 * origin.  This check only needs to run once.
+	 * origin.  This check only needs to run once.  Skip invalid databases,
+	 * this check will only be performed when user_opts.skip_invalid_dbs is
+	 * true.
 	 */
 	conn = connectToServer(&old_cluster, old_cluster.dbarr.dbs[0].db_name);
 	res = executeQueryOrDie(conn,
@@ -1992,7 +2109,9 @@ check_old_cluster_subscription_state(void)
 							"	ON o.roname = 'pg_' || s.oid "
 							"INNER JOIN pg_catalog.pg_database d "
 							"	ON d.oid = s.subdbid "
-							"WHERE o.roname IS NULL;");
+							"WHERE o.roname IS NULL "
+							"   AND d.datconnlimit != %d",
+							DATCONNLIMIT_INVALID_DB);
 	ntup = PQntuples(res);
 	for (int i = 0; i < ntup; i++)
 	{
diff --git a/src/bin/pg_upgrade/info.c b/src/bin/pg_upgrade/info.c
index f83ded89cb..514ed2d612 100644
--- a/src/bin/pg_upgrade/info.c
+++ b/src/bin/pg_upgrade/info.c
@@ -11,6 +11,7 @@
 
 #include "access/transam.h"
 #include "catalog/pg_class_d.h"
+#include "catalog/pg_database.h"
 #include "pg_upgrade.h"
 #include "pqexpbuffer.h"
 
@@ -407,14 +408,16 @@ get_db_infos(ClusterInfo *cluster)
 	PGresult   *res;
 	int			ntups;
 	int			tupnum;
+	int			valid_tups;
 	DbInfo	   *dbinfos;
 	int			i_datname,
+				i_datconnlimit,
 				i_oid,
 				i_spclocation;
 	char		query[QUERY_ALLOC];
 
 	snprintf(query, sizeof(query),
-			 "SELECT d.oid, d.datname, d.encoding, d.datcollate, d.datctype, ");
+			 "SELECT d.oid, d.datname, d.encoding, d.datcollate, d.datctype, d.datconnlimit, ");
 	if (GET_MAJOR_VERSION(cluster->major_version) >= 1700)
 		snprintf(query + strlen(query), sizeof(query) - strlen(query),
 				 "datlocprovider, datlocale, ");
@@ -430,30 +433,36 @@ get_db_infos(ClusterInfo *cluster)
 			 " LEFT OUTER JOIN pg_catalog.pg_tablespace t "
 			 " ON d.dattablespace = t.oid "
 			 "WHERE d.datallowconn = true "
-			 "ORDER BY 1");
+			 "ORDER BY "
+			 " CASE WHEN d.datconnlimit != %d THEN 0 ELSE 1 END, "
+			 " d.oid", DATCONNLIMIT_INVALID_DB);
 
 	res = executeQueryOrDie(conn, "%s", query);
 
 	i_oid = PQfnumber(res, "oid");
 	i_datname = PQfnumber(res, "datname");
+	i_datconnlimit = PQfnumber(res, "datconnlimit");
 	i_spclocation = PQfnumber(res, "spclocation");
 
 	ntups = PQntuples(res);
 	dbinfos = (DbInfo *) pg_malloc0(sizeof(DbInfo) * ntups);
 
-	for (tupnum = 0; tupnum < ntups; tupnum++)
+	for (tupnum = 0, valid_tups = 0; tupnum < ntups; tupnum++)
 	{
 		dbinfos[tupnum].db_oid = atooid(PQgetvalue(res, tupnum, i_oid));
 		dbinfos[tupnum].db_name = pg_strdup(PQgetvalue(res, tupnum, i_datname));
 		snprintf(dbinfos[tupnum].db_tablespace, sizeof(dbinfos[tupnum].db_tablespace), "%s",
 				 PQgetvalue(res, tupnum, i_spclocation));
+		if (atoi(PQgetvalue(res, tupnum, i_datconnlimit)) != DATCONNLIMIT_INVALID_DB)
+			valid_tups++;
 	}
 	PQclear(res);
 
 	PQfinish(conn);
 
 	cluster->dbarr.dbs = dbinfos;
-	cluster->dbarr.ndbs = ntups;
+	cluster->dbarr.ndbs = valid_tups;
+	cluster->dbarr.ntotaldbs = ntups;
 }
 
 
@@ -774,7 +783,7 @@ free_db_and_rel_infos(DbInfoArr *db_arr)
 {
 	int			dbnum;
 
-	for (dbnum = 0; dbnum < db_arr->ndbs; dbnum++)
+	for (dbnum = 0; dbnum < db_arr->ntotaldbs; dbnum++)
 	{
 		free_rel_infos(&db_arr->dbs[dbnum].rel_arr);
 		pg_free(db_arr->dbs[dbnum].db_name);
@@ -782,6 +791,7 @@ free_db_and_rel_infos(DbInfoArr *db_arr)
 	pg_free(db_arr->dbs);
 	db_arr->dbs = NULL;
 	db_arr->ndbs = 0;
+	db_arr->ntotaldbs = 0;
 }
 
 
diff --git a/src/bin/pg_upgrade/option.c b/src/bin/pg_upgrade/option.c
index 6f41d63eed..1937796a72 100644
--- a/src/bin/pg_upgrade/option.c
+++ b/src/bin/pg_upgrade/option.c
@@ -60,6 +60,7 @@ parseCommandLine(int argc, char *argv[])
 		{"copy", no_argument, NULL, 2},
 		{"copy-file-range", no_argument, NULL, 3},
 		{"sync-method", required_argument, NULL, 4},
+		{"skip-invalid-databases", no_argument, NULL, 5},
 
 		{NULL, 0, NULL, 0}
 	};
@@ -69,6 +70,7 @@ parseCommandLine(int argc, char *argv[])
 	DataDirSyncMethod unused;
 
 	user_opts.do_sync = true;
+	user_opts.skip_invalid_dbs = false;
 	user_opts.transfer_mode = TRANSFER_MODE_COPY;
 
 	os_info.progname = get_progname(argv[0]);
@@ -212,6 +214,10 @@ parseCommandLine(int argc, char *argv[])
 				user_opts.sync_method = pg_strdup(optarg);
 				break;
 
+			case 5:
+				user_opts.skip_invalid_dbs = true;
+				break;
+
 			default:
 				fprintf(stderr, _("Try \"%s --help\" for more information.\n"),
 						os_info.progname);
@@ -306,6 +312,7 @@ usage(void)
 	printf(_("  --clone                       clone instead of copying files to new cluster\n"));
 	printf(_("  --copy                        copy files to new cluster (default)\n"));
 	printf(_("  --copy-file-range             copy files to new cluster with copy_file_range\n"));
+	printf(_("  --skip-invalid-databases      skip transferring invalid databases to the new cluster\n"));
 	printf(_("  --sync-method=METHOD          set method for syncing files to disk\n"));
 	printf(_("  -?, --help                    show this help, then exit\n"));
 	printf(_("\n"
diff --git a/src/bin/pg_upgrade/pg_upgrade.c b/src/bin/pg_upgrade/pg_upgrade.c
index 663235816f..b0353327b9 100644
--- a/src/bin/pg_upgrade/pg_upgrade.c
+++ b/src/bin/pg_upgrade/pg_upgrade.c
@@ -41,6 +41,7 @@
 #include <time.h>
 
 #include "catalog/pg_class_d.h"
+#include "catalog/pg_database.h"
 #include "common/file_perm.h"
 #include "common/logging.h"
 #include "common/restricted_token.h"
@@ -83,7 +84,8 @@ char	   *output_files[] = {
 int
 main(int argc, char **argv)
 {
-	char	   *deletion_script_file_name = NULL;
+	char	   *deletion_script_file_name = NULL,
+			   *skipped_invalid_databases_file_name = NULL;
 
 	/*
 	 * pg_upgrade doesn't currently use common/logging.c, but initialize it
@@ -219,6 +221,8 @@ main(int argc, char **argv)
 
 	create_script_for_old_cluster_deletion(&deletion_script_file_name);
 
+	create_invalid_databases_report(&skipped_invalid_databases_file_name);
+
 	issue_warnings_and_set_wal_level();
 
 	pg_log(PG_REPORT,
@@ -226,9 +230,10 @@ main(int argc, char **argv)
 		   "Upgrade Complete\n"
 		   "----------------");
 
-	output_completion_banner(deletion_script_file_name);
+	output_completion_banner(deletion_script_file_name, skipped_invalid_databases_file_name);
 
 	pg_free(deletion_script_file_name);
+	pg_free(skipped_invalid_databases_file_name);
 
 	cleanup_output_dirs();
 
@@ -859,10 +864,15 @@ set_frozenxids(bool minmxid_only)
 							  "SET	datminmxid = '%u'",
 							  old_cluster.controldata.chkpnt_nxtmulti));
 
-	/* get database names */
+	/*
+	 * get database names; skip invalid databases as they are handled in
+	 * check_for_invalid_dbs()
+	 */
 	dbres = executeQueryOrDie(conn_template1,
 							  "SELECT	datname, datallowconn "
-							  "FROM	pg_catalog.pg_database");
+							  "FROM	pg_catalog.pg_database "
+							  "WHERE datconnlimit != %d",
+							  DATCONNLIMIT_INVALID_DB);
 
 	i_datname = PQfnumber(dbres, "datname");
 	i_datallowconn = PQfnumber(dbres, "datallowconn");
diff --git a/src/bin/pg_upgrade/pg_upgrade.h b/src/bin/pg_upgrade/pg_upgrade.h
index 53f693c2d4..8abb1688ca 100644
--- a/src/bin/pg_upgrade/pg_upgrade.h
+++ b/src/bin/pg_upgrade/pg_upgrade.h
@@ -213,7 +213,8 @@ typedef struct
 typedef struct
 {
 	DbInfo	   *dbs;			/* array of db infos */
-	int			ndbs;			/* number of db infos */
+	int			ndbs;			/* number of valid db infos */
+	int			ntotaldbs;		/* total number of db infos */
 } DbInfoArr;
 
 /*
@@ -323,6 +324,8 @@ typedef struct
 	bool		check;			/* check clusters only, don't change any data */
 	bool		live_check;		/* check clusters only, old server is running */
 	bool		do_sync;		/* flush changes to disk */
+	bool		skip_invalid_dbs;	/* skip transferring invalid databases to
+									 * the new cluster */
 	transferMode transfer_mode; /* copy files or link them? */
 	int			jobs;			/* number of processes/threads to use */
 	char	   *socketdir;		/* directory to use for Unix sockets */
@@ -371,10 +374,11 @@ void		check_and_dump_old_cluster(void);
 void		check_new_cluster(void);
 void		report_clusters_compatible(void);
 void		issue_warnings_and_set_wal_level(void);
-void		output_completion_banner(char *deletion_script_file_name);
+void		output_completion_banner(char *deletion_script_file_name, char *skipped_invalid_databases_file_name);
 void		check_cluster_versions(void);
 void		check_cluster_compatibility(void);
 void		create_script_for_old_cluster_deletion(char **deletion_script_file_name);
+void		create_invalid_databases_report(char **skipped_invalid_databases_file_name);
 
 
 /* controldata.c */
diff --git a/src/bin/pg_upgrade/t/002_pg_upgrade.pl b/src/bin/pg_upgrade/t/002_pg_upgrade.pl
index 17af2ce61e..c7da6d27b2 100644
--- a/src/bin/pg_upgrade/t/002_pg_upgrade.pl
+++ b/src/bin/pg_upgrade/t/002_pg_upgrade.pl
@@ -424,7 +424,7 @@ SKIP:
 			$mode, '--check',
 		],
 		1,
-		[qr/invalid/],    # pg_upgrade prints errors on stdout :(
+		[qr/invalid_databases\.txt/],    # pg_upgrade prints errors on stdout :(
 		[qr/^$/],
 		'invalid database causes failure');
 	rmtree($newnode->data_dir . "/pg_upgrade_output.d");
-- 
2.46.2

#13Bruce Momjian
bruce@momjian.us
In reply to: Thomas Krennwallner (#12)
Re: pg_upgrade check for invalid databases

On Sun, Oct 13, 2024 at 08:28:57AM -0400, Thomas Krennwallner wrote:

In v2 I've made changes to the patch incorporating the suggestions here:

* Default behaviour is to just fail with a report of all invalid databases

* A new option --skip-invalid-databases will then skip the checks, and
would not transfer any invalid database to the new cluster. A warning
with a report file will then follow after a successful run.

Dropping invalid databases in the old cluster will make invalid
databases unrecoverable, so I opted for a skip over invalid databases
approach that would leave invalid databases in the old cluster.

Apart from a missing --skip-invalid-databases test, does this attempt look OK?

I don't think there is enough demand for the feature of skipping invalid
databases because we have gotten few reports of such problems, and also
because your case is related to an external tool causing this problem.

What might be acceptable would be to add an option that would make
pg_upgrade more tolerant of problems in many areas, that is a lot more
research and discussion.

--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com

When a patient asks the doctor, "Am I going to die?", he means
"Am I going to die soon?"

#14Daniel Gustafsson
daniel@yesql.se
In reply to: Bruce Momjian (#13)
1 attachment(s)
Re: pg_upgrade check for invalid databases

On 14 Oct 2024, at 18:57, Bruce Momjian <bruce@momjian.us> wrote:

What might be acceptable would be to add an option that would make
pg_upgrade more tolerant of problems in many areas, that is a lot more
research and discussion.

I agree that the concept of having pg_upgrade perform (opt-in) skipping and/or
repairs of the old cluster warrants a larger discussion in its own thread.
There has been significant amount of energy spent recently to add structure to
the checks, any new feature should be properly designed for the get-go.

In the meantime, the OP has a good point that it's a tad silly that pg_upgrade
fails hard on invalid databases instead of detecting and reporting like how
other errors are handled. The attached adds this check and expands the report
wording to cover it.

--
Daniel Gustafsson

Attachments:

0001-Find-invalid-databases-during-upgrade-check-stage.patchapplication/octet-stream; name=0001-Find-invalid-databases-during-upgrade-check-stage.patch; x-unix-mode=0644Download
From 3608e477cd5d03074f8d5b0fa9d2890cb6944ca5 Mon Sep 17 00:00:00 2001
From: Daniel Gustafsson <dgustafsson@postgresql.org>
Date: Fri, 25 Oct 2024 13:45:28 +0200
Subject: [PATCH] Find invalid databases during upgrade check stage

Before continuing with the check start by checking that all databases
allow connections to avoid a hard fail without proper error reporting.

Inspired by a larger patch by Thomas Krennwallner.

Discussion: https://postgr.es/m/f9315bf0-e03e-4490-9f0d-5b6f7a6d9908@postsubmeta.net
---
 src/bin/pg_upgrade/check.c             | 36 ++++++++++++++++----------
 src/bin/pg_upgrade/t/002_pg_upgrade.pl |  2 +-
 2 files changed, 23 insertions(+), 15 deletions(-)

diff --git a/src/bin/pg_upgrade/check.c b/src/bin/pg_upgrade/check.c
index 12735a4268..2fce686530 100644
--- a/src/bin/pg_upgrade/check.c
+++ b/src/bin/pg_upgrade/check.c
@@ -18,7 +18,7 @@
 
 static void check_new_cluster_is_empty(void);
 static void check_is_install_user(ClusterInfo *cluster);
-static void check_proper_datallowconn(ClusterInfo *cluster);
+static void check_for_connection_status(ClusterInfo *cluster);
 static void check_for_prepared_transactions(ClusterInfo *cluster);
 static void check_for_isn_and_int8_passing_mismatch(ClusterInfo *cluster);
 static void check_for_user_defined_postfix_ops(ClusterInfo *cluster);
@@ -592,6 +592,12 @@ check_and_dump_old_cluster(void)
 	if (!user_opts.live_check)
 		start_postmaster(&old_cluster, true);
 
+	/*
+	 * First check that all databases allow connections since we'll otherwise
+	 * fail in later stages.
+	 */
+	check_for_connection_status(&old_cluster);
+
 	/*
 	 * Extract a list of databases, tables, and logical replication slots from
 	 * the old cluster.
@@ -607,7 +613,6 @@ check_and_dump_old_cluster(void)
 	 * Check for various failure cases
 	 */
 	check_is_install_user(&old_cluster);
-	check_proper_datallowconn(&old_cluster);
 	check_for_prepared_transactions(&old_cluster);
 	check_for_isn_and_int8_passing_mismatch(&old_cluster);
 
@@ -1089,14 +1094,14 @@ check_is_install_user(ClusterInfo *cluster)
 
 
 /*
- *	check_proper_datallowconn
+ *	check_for_connection_status
  *
  *	Ensure that all non-template0 databases allow connections since they
  *	otherwise won't be restored; and that template0 explicitly doesn't allow
  *	connections since it would make pg_dumpall --globals restore fail.
  */
 static void
-check_proper_datallowconn(ClusterInfo *cluster)
+check_for_connection_status(ClusterInfo *cluster)
 {
 	int			dbnum;
 	PGconn	   *conn_template1;
@@ -1104,6 +1109,7 @@ check_proper_datallowconn(ClusterInfo *cluster)
 	int			ntups;
 	int			i_datname;
 	int			i_datallowconn;
+	int			i_datconnlimit;
 	FILE	   *script = NULL;
 	char		output_path[MAXPGPATH];
 
@@ -1111,23 +1117,25 @@ check_proper_datallowconn(ClusterInfo *cluster)
 
 	snprintf(output_path, sizeof(output_path), "%s/%s",
 			 log_opts.basedir,
-			 "databases_with_datallowconn_false.txt");
+			 "databases_cannot_connect_to.txt");
 
 	conn_template1 = connectToServer(cluster, "template1");
 
 	/* get database names */
 	dbres = executeQueryOrDie(conn_template1,
-							  "SELECT	datname, datallowconn "
+							  "SELECT	datname, datallowconn, datconnlimit "
 							  "FROM	pg_catalog.pg_database");
 
 	i_datname = PQfnumber(dbres, "datname");
 	i_datallowconn = PQfnumber(dbres, "datallowconn");
+	i_datconnlimit = PQfnumber(dbres, "datconnlimit");
 
 	ntups = PQntuples(dbres);
 	for (dbnum = 0; dbnum < ntups; dbnum++)
 	{
 		char	   *datname = PQgetvalue(dbres, dbnum, i_datname);
 		char	   *datallowconn = PQgetvalue(dbres, dbnum, i_datallowconn);
+		char	   *datconnlimit = PQgetvalue(dbres, dbnum, i_datconnlimit);
 
 		if (strcmp(datname, "template0") == 0)
 		{
@@ -1139,10 +1147,10 @@ check_proper_datallowconn(ClusterInfo *cluster)
 		else
 		{
 			/*
-			 * avoid datallowconn == false databases from being skipped on
-			 * restore
+			 * Avoid datallowconn == false databases from being skipped on
+			 * restore, and ensure that no databases are marked invalid (-2).
 			 */
-			if (strcmp(datallowconn, "f") == 0)
+			if ((strcmp(datallowconn, "f") == 0) || strcmp(datconnlimit, "-2") == 0)
 			{
 				if (script == NULL && (script = fopen_priv(output_path, "w")) == NULL)
 					pg_fatal("could not open file \"%s\": %m", output_path);
@@ -1161,11 +1169,11 @@ check_proper_datallowconn(ClusterInfo *cluster)
 		fclose(script);
 		pg_log(PG_REPORT, "fatal");
 		pg_fatal("All non-template0 databases must allow connections, i.e. their\n"
-				 "pg_database.datallowconn must be true.  Your installation contains\n"
-				 "non-template0 databases with their pg_database.datallowconn set to\n"
-				 "false.  Consider allowing connection for all non-template0 databases\n"
-				 "or drop the databases which do not allow connections.  A list of\n"
-				 "databases with the problem is in the file:\n"
+				 "pg_database.datallowconn must be true and pg_database.datconnlimit\n"
+				 "must not be -2.  Your installation contains non-template0 databases\n"
+				 "which cannot be connected to.  Consider allowing connection for all\n"
+				 "non-template0 databases or drop the databases which do not allow\n"
+				 "connections.  A list of databases with the problem is in the file:\n"
 				 "    %s", output_path);
 	}
 	else
diff --git a/src/bin/pg_upgrade/t/002_pg_upgrade.pl b/src/bin/pg_upgrade/t/002_pg_upgrade.pl
index 17af2ce61e..9b51f9e666 100644
--- a/src/bin/pg_upgrade/t/002_pg_upgrade.pl
+++ b/src/bin/pg_upgrade/t/002_pg_upgrade.pl
@@ -424,7 +424,7 @@ SKIP:
 			$mode, '--check',
 		],
 		1,
-		[qr/invalid/],    # pg_upgrade prints errors on stdout :(
+		[qr/datconnlimit/],
 		[qr/^$/],
 		'invalid database causes failure');
 	rmtree($newnode->data_dir . "/pg_upgrade_output.d");
-- 
2.39.3 (Apple Git-146)

#15Bruce Momjian
bruce@momjian.us
In reply to: Daniel Gustafsson (#14)
Re: pg_upgrade check for invalid databases

On Fri, Oct 25, 2024 at 01:55:57PM +0200, Daniel Gustafsson wrote:

On 14 Oct 2024, at 18:57, Bruce Momjian <bruce@momjian.us> wrote:

What might be acceptable would be to add an option that would make
pg_upgrade more tolerant of problems in many areas, that is a lot more
research and discussion.

I agree that the concept of having pg_upgrade perform (opt-in) skipping and/or
repairs of the old cluster warrants a larger discussion in its own thread.
There has been significant amount of energy spent recently to add structure to
the checks, any new feature should be properly designed for the get-go.

In the meantime, the OP has a good point that it's a tad silly that pg_upgrade
fails hard on invalid databases instead of detecting and reporting like how
other errors are handled. The attached adds this check and expands the report
wording to cover it.

Agreed.

--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com

When a patient asks the doctor, "Am I going to die?", he means
"Am I going to die soon?"

#16Daniel Gustafsson
daniel@yesql.se
In reply to: Bruce Momjian (#15)
Re: pg_upgrade check for invalid databases

On 1 Nov 2024, at 01:36, Bruce Momjian <bruce@momjian.us> wrote:
On Fri, Oct 25, 2024 at 01:55:57PM +0200, Daniel Gustafsson wrote:

In the meantime, the OP has a good point that it's a tad silly that pg_upgrade
fails hard on invalid databases instead of detecting and reporting like how
other errors are handled. The attached adds this check and expands the report
wording to cover it.

Agreed.

I've applied this part, the discussion on whether or not pg_upgrade should gain
capabilities to skip and/or fix issues should probably be carried over in a new
thread.

--
Daniel Gustafsson