pg15b2: large objects lost on upgrade

Started by Justin Pryzbyover 3 years ago97 messages

pryzby@telsasoft.com

over 3 years ago

I noticed this during beta1, but dismissed the issue when it wasn't easily
reproducible. Now, I saw the same problem while upgrading from beta1 to beta2,
so couldn't dismiss it. It turns out that LOs are lost if VACUUM FULL was run.

| /usr/pgsql-15b1/bin/initdb --no-sync -D pg15b1.dat -k
| /usr/pgsql-15b1/bin/postgres -D pg15b1.dat -c logging_collector=no -p 5678 -k /tmp&
| psql -h /tmp postgres -p 5678 -c '\lo_import /etc/shells' -c 'VACUUM FULL pg_largeobject'
| rm -fr pg15b2.dat && /usr/pgsql-15b2/bin/initdb --no-sync -k -D pg15b2.dat && /usr/pgsql-15b2/bin/pg_upgrade -d pg15b1.dat -D pg15b2.dat -b /usr/pgsql-15b1/bin
| /usr/pgsql-15b2/bin/postgres -D pg15b2.dat -c logging_collector=no -p 5678 -k /tmp&

Or, for your convenience, with paths in tmp_install:
| ./tmp_install/usr/local/pgsql/bin/initdb --no-sync -D pg15b1.dat -k
| ./tmp_install/usr/local/pgsql/bin/postgres -D pg15b1.dat -c logging_collector=no -p 5678 -k /tmp&
| psql -h /tmp postgres -p 5678 -c '\lo_import /etc/shells' -c 'VACUUM FULL pg_largeobject'
| rm -fr pg15b2.dat && ./tmp_install/usr/local/pgsql/bin/initdb --no-sync -k -D pg15b2.dat && ./tmp_install/usr/local/pgsql/bin/pg_upgrade -d pg15b1.dat -D pg15b2.dat -b ./tmp_install/usr/local/pgsql/bin
| ./tmp_install/usr/local/pgsql/bin/postgres -D pg15b2.dat -c logging_collector=no -p 5678 -k /tmp&

postgres=# table pg_largeobject_metadata ;
16384 | 10 |

postgres=# \lo_list
16384 | pryzbyj |

postgres=# \lo_export 16384 /dev/stdout
lo_export
postgres=# table pg_largeobject;

I reproduced the problem at 9a974cbcba but not its parent commit.

commit 9a974cbcba005256a19991203583a94b4f9a21a9
Author: Robert Haas <rhaas@postgresql.org>
Date: Mon Jan 17 13:32:44 2022 -0500

pg_upgrade: Preserve relfilenodes and tablespace OIDs.

--
Justin

Michael Paquier

michael@paquier.xyz

over 3 years ago

In reply to: Justin Pryzby (#1)

Re: pg15b2: large objects lost on upgrade

On Fri, Jul 01, 2022 at 06:14:13PM -0500, Justin Pryzby wrote:

I reproduced the problem at 9a974cbcba but not its parent commit.

commit 9a974cbcba005256a19991203583a94b4f9a21a9
Author: Robert Haas <rhaas@postgresql.org>
Date: Mon Jan 17 13:32:44 2022 -0500

pg_upgrade: Preserve relfilenodes and tablespace OIDs.

Oops. Robert?

This reproduces as well when using pg_upgrade with the same version as
origin and target, meaning that this extra step in the TAP test is
able to reproduce the issue (extra VACUUM FULL before chdir'ing):
--- a/src/bin/pg_upgrade/t/002_pg_upgrade.pl
+++ b/src/bin/pg_upgrade/t/002_pg_upgrade.pl
@@ -208,6 +208,8 @@ if (defined($ENV{oldinstall}))
    }
 }

+$oldnode->safe_psql("regression", "VACUUM FULL pg_largeobject;");
+
# In a VPATH build, we'll be started in the source directory, but we want
--
Michael

Robert Haas

robertmhaas@gmail.com

over 3 years ago

In reply to: Justin Pryzby (#1)

Re: pg15b2: large objects lost on upgrade

On Fri, Jul 1, 2022 at 7:14 PM Justin Pryzby <pryzby@telsasoft.com> wrote:

I noticed this during beta1, but dismissed the issue when it wasn't easily
reproducible. Now, I saw the same problem while upgrading from beta1 to beta2,
so couldn't dismiss it. It turns out that LOs are lost if VACUUM FULL was run.

Yikes. That's really bad, and I have no idea what might be causing it,
either. I'll plan to investigate this on Tuesday unless someone gets
to it before then.

--
Robert Haas
EDB: http://www.enterprisedb.com

Justin Pryzby

pryzby@telsasoft.com

over 3 years ago

In reply to: Robert Haas (#3)

1 attachment(s)

Re: pg15b2: large objects lost on upgrade

On Sat, Jul 02, 2022 at 08:34:04AM -0400, Robert Haas wrote:

On Fri, Jul 1, 2022 at 7:14 PM Justin Pryzby <pryzby@telsasoft.com> wrote:

I noticed this during beta1, but dismissed the issue when it wasn't easily
reproducible. Now, I saw the same problem while upgrading from beta1 to beta2,
so couldn't dismiss it. It turns out that LOs are lost if VACUUM FULL was run.

Yikes. That's really bad, and I have no idea what might be causing it,
either. I'll plan to investigate this on Tuesday unless someone gets
to it before then.

I suppose it's like Bruce said, here.

/messages/by-id/20210601140949.GC22012@momjian.us

|One tricky case is pg_largeobject, which is copied from the old to new
|cluster since it has user data. To preserve that relfilenode, you would
|need to have pg_upgrade perform cluster surgery in each database to
|renumber its relfilenode to match since it is created by initdb. I
|can't think of a case where pg_upgrade already does something like that.

Rather than setting the filenode of the next object as for user tables,
pg-upgrade needs to UPDATE the relfilenode.

This patch "works for me" but feel free to improve on it.

Attachments:

0001-wip-preserve-relfilenode-of-pg_largeobject-and-its-i.patchtext/x-diff; charset=us-asciiDownload

From 88cbe118d4eb4dcf9b6d2831d81f723587d80942 Mon Sep 17 00:00:00 2001
From: Justin Pryzby <pryzbyj@telsasoft.com>
Date: Sat, 2 Jul 2022 00:48:47 -0500
Subject: [PATCH] wip: preserve relfilenode of pg_largeobject and its indexes

An empty table is created by initdb, but its filenode was not preserved, so
pg_largeobject was empty after upgrades.

See also:
9a974cbcba005256a19991203583a94b4f9a21a9
https://www.postgresql.org/message-id/20210601140949.GC22012%40momjian.us
https://www.postgresql.org/message-id/20220701231413.GI13040%40telsasoft.com
---
 src/bin/pg_dump/pg_dump.c              | 37 +++++++++++++++-----------
 src/bin/pg_upgrade/t/002_pg_upgrade.pl |  2 ++
 2 files changed, 24 insertions(+), 15 deletions(-)

diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index d9c5bcafd20..c629f4154fc 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -3135,7 +3135,7 @@ dumpDatabase(Archive *fout)
 
 	/*
 	 * pg_largeobject comes from the old system intact, so set its
-	 * relfrozenxids and relminmxids.
+	 * relfrozenxids, relminmxids and relfilenode.
 	 */
 	if (dopt->binary_upgrade)
 	{
@@ -3143,34 +3143,41 @@ dumpDatabase(Archive *fout)
 		PQExpBuffer loFrozenQry = createPQExpBuffer();
 		PQExpBuffer loOutQry = createPQExpBuffer();
 		int			i_relfrozenxid,
+					i_relfilenode,
+					i_oid,
 					i_relminmxid;
 
 		/*
 		 * pg_largeobject
 		 */
 		if (fout->remoteVersion >= 90300)
-			appendPQExpBuffer(loFrozenQry, "SELECT relfrozenxid, relminmxid\n"
+			appendPQExpBuffer(loFrozenQry, "SELECT relfrozenxid, relminmxid, relfilenode, oid\n"
 							  "FROM pg_catalog.pg_class\n"
-							  "WHERE oid = %u;\n",
-							  LargeObjectRelationId);
+							  "WHERE oid IN (%u, %u);\n",
+							  LargeObjectRelationId, LargeObjectLOidPNIndexId);
 		else
-			appendPQExpBuffer(loFrozenQry, "SELECT relfrozenxid, 0 AS relminmxid\n"
+			appendPQExpBuffer(loFrozenQry, "SELECT relfrozenxid, 0 AS relminmxid, relfilenode, oid\n"
 							  "FROM pg_catalog.pg_class\n"
-							  "WHERE oid = %u;\n",
-							  LargeObjectRelationId);
+							  "WHERE oid IN (%u, %u);\n",
+							  LargeObjectRelationId, LargeObjectLOidPNIndexId);
 
-		lo_res = ExecuteSqlQueryForSingleRow(fout, loFrozenQry->data);
+		lo_res = ExecuteSqlQuery(fout, loFrozenQry->data, PGRES_TUPLES_OK);
 
 		i_relfrozenxid = PQfnumber(lo_res, "relfrozenxid");
 		i_relminmxid = PQfnumber(lo_res, "relminmxid");
+		i_relfilenode = PQfnumber(lo_res, "relfilenode");
+		i_oid = PQfnumber(lo_res, "oid");
+
+		appendPQExpBufferStr(loOutQry, "\n-- For binary upgrade, set pg_largeobject relfrozenxid, relminmxid and relfilenode\n");
+		for (int i = 0; i < PQntuples(lo_res); ++i)
+			appendPQExpBuffer(loOutQry, "UPDATE pg_catalog.pg_class\n"
+							  "SET relfrozenxid = '%u', relminmxid = '%u', relfilenode = '%u'\n"
+							  "WHERE oid = %u;\n",
+							  atooid(PQgetvalue(lo_res, i, i_relfrozenxid)),
+							  atooid(PQgetvalue(lo_res, i, i_relminmxid)),
+							  atooid(PQgetvalue(lo_res, i, i_relfilenode)),
+							  atooid(PQgetvalue(lo_res, i, i_oid)));
 
-		appendPQExpBufferStr(loOutQry, "\n-- For binary upgrade, set pg_largeobject relfrozenxid and relminmxid\n");
-		appendPQExpBuffer(loOutQry, "UPDATE pg_catalog.pg_class\n"
-						  "SET relfrozenxid = '%u', relminmxid = '%u'\n"
-						  "WHERE oid = %u;\n",
-						  atooid(PQgetvalue(lo_res, 0, i_relfrozenxid)),
-						  atooid(PQgetvalue(lo_res, 0, i_relminmxid)),
-						  LargeObjectRelationId);
 		ArchiveEntry(fout, nilCatalogId, createDumpId(),
 					 ARCHIVE_OPTS(.tag = "pg_largeobject",
 								  .description = "pg_largeobject",
diff --git a/src/bin/pg_upgrade/t/002_pg_upgrade.pl b/src/bin/pg_upgrade/t/002_pg_upgrade.pl
index 67e0be68568..a22f0f8c885 100644
--- a/src/bin/pg_upgrade/t/002_pg_upgrade.pl
+++ b/src/bin/pg_upgrade/t/002_pg_upgrade.pl
@@ -208,6 +208,8 @@ if (defined($ENV{oldinstall}))
 	}
 }
 
+$oldnode->safe_psql("regression", "VACUUM FULL pg_largeobject;");
+
 # In a VPATH build, we'll be started in the source directory, but we want
 # to run pg_upgrade in the build directory so that any files generated finish
 # in it, like delete_old_cluster.{sh,bat}.
-- 
2.17.1

Julien Rouhaud

rjuju123@gmail.com

over 3 years ago

In reply to: Robert Haas (#3)

Re: pg15b2: large objects lost on upgrade

On Sat, Jul 02, 2022 at 08:34:04AM -0400, Robert Haas wrote:

On Fri, Jul 1, 2022 at 7:14 PM Justin Pryzby <pryzby@telsasoft.com> wrote:

I noticed this during beta1, but dismissed the issue when it wasn't easily
reproducible. Now, I saw the same problem while upgrading from beta1 to beta2,
so couldn't dismiss it. It turns out that LOs are lost if VACUUM FULL was run.

Yikes. That's really bad, and I have no idea what might be causing it,
either. I'll plan to investigate this on Tuesday unless someone gets
to it before then.

As far as I can see the data is still there, it's just that the new cluster
keeps its default relfilenode instead of preserving the old cluster's value:

regression=# table pg_largeobject;
loid | pageno | data
------+--------+------
(0 rows)

regression=# select oid, relfilenode from pg_class where relname = 'pg_largeobject';
oid | relfilenode
------+-------------
2613 | 2613
(1 row)

-- using the value from the old cluster
regression=# update pg_class set relfilenode = 39909 where oid = 2613;
UPDATE 1

regression=# table pg_largeobject;
loid | pageno |
-------+--------+-----------------
33211 | 0 | \x0a4920776[...]
34356 | 0 | \xdeadbeef
(2 rows)

Shruthi Gowda

gowdashru@gmail.com

over 3 years ago

In reply to: Justin Pryzby (#4)

Re: pg15b2: large objects lost on upgrade

I was able to reproduce the issue. Also, the issue does not occur with code
before to preserve relfilenode commit.
I tested your patch and it fixes the problem.
I am currently analyzing a few things related to the issue. I will come
back once my analysis is completed.

On Sat, Jul 2, 2022 at 9:19 PM Justin Pryzby <pryzby@telsasoft.com> wrote:

Show quoted text

On Sat, Jul 02, 2022 at 08:34:04AM -0400, Robert Haas wrote:

On Fri, Jul 1, 2022 at 7:14 PM Justin Pryzby <pryzby@telsasoft.com>

wrote:

I noticed this during beta1, but dismissed the issue when it wasn't

easily

reproducible. Now, I saw the same problem while upgrading from beta1

to beta2,

so couldn't dismiss it. It turns out that LOs are lost if VACUUM FULL

was run.

Yikes. That's really bad, and I have no idea what might be causing it,
either. I'll plan to investigate this on Tuesday unless someone gets
to it before then.

I suppose it's like Bruce said, here.

/messages/by-id/20210601140949.GC22012@momjian.us

|One tricky case is pg_largeobject, which is copied from the old to new
|cluster since it has user data. To preserve that relfilenode, you would
|need to have pg_upgrade perform cluster surgery in each database to
|renumber its relfilenode to match since it is created by initdb. I
|can't think of a case where pg_upgrade already does something like that.

Rather than setting the filenode of the next object as for user tables,
pg-upgrade needs to UPDATE the relfilenode.

This patch "works for me" but feel free to improve on it.

Robert Haas

robertmhaas@gmail.com

over 3 years ago

In reply to: Justin Pryzby (#4)

Re: pg15b2: large objects lost on upgrade

On Sat, Jul 2, 2022 at 11:49 AM Justin Pryzby <pryzby@telsasoft.com> wrote:

I suppose it's like Bruce said, here.

/messages/by-id/20210601140949.GC22012@momjian.us

Well, I feel dumb. I remember reading that email back when Bruce sent
it, but it seems that it slipped out of my head between then and when
I committed. I think your patch is fine, except that I think maybe we
should adjust this dump comment:

-- For binary upgrade, set pg_largeobject relfrozenxid, relminmxid and
relfilenode

Perhaps:

-- For binary upgrade, preserve values for pg_largeobject and its index

Listing the exact properties preserved seems less important to me than
mentioning that the second UPDATE statement is for its index --
because if you look at the SQL that's generated, you can see what's
being preserved, but you don't automatically know why there are two
UPDATE statements or what the rows with those OIDs are.

I had a moment of panic this morning where I thought maybe the whole
patch needed to be reverted. I was worried that we might need to
preserve the OID of every system table and index. Otherwise, what
happens if the OID counter in the old cluster wraps around and some
user-created object gets an OID that the system tables are using in
the new cluster? However, I think this can't happen, because when the
OID counter wraps around, it wraps around to 16384, and the
relfilenode values for newly created system tables and indexes are all
less than 16384. So I believe we only need to fix pg_largeobject and
its index, and I think your patch does that.

Anyone else see it differently?

--
Robert Haas
EDB: http://www.enterprisedb.com

Justin Pryzby

pryzby@telsasoft.com

over 3 years ago

In reply to: Robert Haas (#7)

Re: pg15b2: large objects lost on upgrade

On Tue, Jul 05, 2022 at 12:43:54PM -0400, Robert Haas wrote:

On Sat, Jul 2, 2022 at 11:49 AM Justin Pryzby <pryzby@telsasoft.com> wrote:

I suppose it's like Bruce said, here.

/messages/by-id/20210601140949.GC22012@momjian.us

Well, I feel dumb. I remember reading that email back when Bruce sent
it, but it seems that it slipped out of my head between then and when
I committed. I think your patch is fine, except that I think maybe we

My patch also leaves a 0 byte file around from initdb, which is harmless, but
dirty.

I've seen before where a bunch of 0 byte files are abandoned in an
otherwise-empty tablespace, with no associated relation, and I have to "rm"
them to be able to drop the tablespace. Maybe that's a known issue, maybe it's
due to crashes or other edge case, maybe it's of no consequence, and maybe it's
already been fixed or being fixed already. But it'd be nice to avoid another
way to have a 0 byte files - especially ones named with system OIDs.

Listing the exact properties preserved seems less important to me than
mentioning that the second UPDATE statement is for its index --

--
Justin

Robert Haas

robertmhaas@gmail.com

over 3 years ago

In reply to: Justin Pryzby (#8)

Re: pg15b2: large objects lost on upgrade

On Tue, Jul 5, 2022 at 12:56 PM Justin Pryzby <pryzby@telsasoft.com> wrote:

My patch also leaves a 0 byte file around from initdb, which is harmless, but
dirty.

I've seen before where a bunch of 0 byte files are abandoned in an
otherwise-empty tablespace, with no associated relation, and I have to "rm"
them to be able to drop the tablespace. Maybe that's a known issue, maybe it's
due to crashes or other edge case, maybe it's of no consequence, and maybe it's
already been fixed or being fixed already. But it'd be nice to avoid another
way to have a 0 byte files - especially ones named with system OIDs.

Do you want to add something to the patch to have pg_upgrade remove
the stray file?

--
Robert Haas
EDB: http://www.enterprisedb.com

#10

Justin Pryzby

pryzby@telsasoft.com

over 3 years ago

In reply to: Robert Haas (#9)

Re: pg15b2: large objects lost on upgrade

On Tue, Jul 05, 2022 at 02:40:21PM -0400, Robert Haas wrote:

On Tue, Jul 5, 2022 at 12:56 PM Justin Pryzby <pryzby@telsasoft.com> wrote:

My patch also leaves a 0 byte file around from initdb, which is harmless, but
dirty.

I've seen before where a bunch of 0 byte files are abandoned in an
otherwise-empty tablespace, with no associated relation, and I have to "rm"
them to be able to drop the tablespace. Maybe that's a known issue, maybe it's
due to crashes or other edge case, maybe it's of no consequence, and maybe it's
already been fixed or being fixed already. But it'd be nice to avoid another
way to have a 0 byte files - especially ones named with system OIDs.

Do you want to add something to the patch to have pg_upgrade remove
the stray file?

I'm looking into it, but it'd help to hear suggestions about where to put it.
My current ideas aren't very good.

--
Justin

#11

Robert Haas

robertmhaas@gmail.com

over 3 years ago

In reply to: Justin Pryzby (#10)

Re: pg15b2: large objects lost on upgrade

On Wed, Jul 6, 2022 at 7:56 AM Justin Pryzby <pryzby@telsasoft.com> wrote:

I'm looking into it, but it'd help to hear suggestions about where to put it.
My current ideas aren't very good.

In main() there is a comment that begins "Most failures happen in
create_new_objects(), which has just completed at this point." I am
thinking you might want to insert a new function call just before that
comment, like remove_orphaned_files() or tidy_up_new_cluster().

Another option could be to do something at the beginning of
transfer_all_new_tablespaces().

--
Robert Haas
EDB: http://www.enterprisedb.com

#12

Justin Pryzby

pryzby@telsasoft.com

over 3 years ago

In reply to: Robert Haas (#11)

1 attachment(s)

Re: pg15b2: large objects lost on upgrade

On Wed, Jul 06, 2022 at 08:25:04AM -0400, Robert Haas wrote:

On Wed, Jul 6, 2022 at 7:56 AM Justin Pryzby <pryzby@telsasoft.com> wrote:

I'm looking into it, but it'd help to hear suggestions about where to put it.
My current ideas aren't very good.

In main() there is a comment that begins "Most failures happen in
create_new_objects(), which has just completed at this point." I am
thinking you might want to insert a new function call just before that
comment, like remove_orphaned_files() or tidy_up_new_cluster().

Another option could be to do something at the beginning of
transfer_all_new_tablespaces().

That seems like the better option, since it has access to the custer's
filenodes.

I checked upgrades from 9.2, upgrades with/out vacuum full, and upgrades with a
DB tablespace.

Maybe it's a good idea to check that the file is empty before unlinking...

--
Justin

Attachments:

0001-fix-Preserve-relfilenodes-for-pg_largeobject-and-its.patchtext/x-diff; charset=us-asciiDownload

From dc4795c66916949493d51ff245c8e1509aaac05e Mon Sep 17 00:00:00 2001
From: Justin Pryzby <pryzbyj@telsasoft.com>
Date: Sat, 2 Jul 2022 00:48:47 -0500
Subject: [PATCH] fix "Preserve relfilenodes" for pg_largeobject and its index

An empty table is created by initdb, and the filenode was still pointing to
that, so pg_largeobject was empty after binary upgrades.

Test like:
| ./tmp_install/usr/local/pgsql/bin/initdb --no-sync -D pg15b1.dat -k
| ./tmp_install/usr/local/pgsql/bin/postgres -D pg15b1.dat -c logging_collector=no -p 5678 -k /tmp&
| psql -h /tmp postgres -p 5678 -c '\lo_import /etc/shells' -c 'VACUUM FULL pg_largeobject'
| rm -fr pg15b2.dat && ./tmp_install/usr/local/pgsql/bin/initdb --no-sync -k -D pg15b2.dat && ./tmp_install/usr/local/pgsql/bin/pg_upgrade -d pg15b1.dat -D pg15b2.dat -b ./tmp_install/usr/local/pgsql/bin
| ./tmp_install/usr/local/pgsql/bin/postgres -D pg15b2.dat -c logging_collector=no -p 5678 -k /tmp&

See also:
9a974cbcba005256a19991203583a94b4f9a21a9
https://www.postgresql.org/message-id/20210601140949.GC22012%40momjian.us
https://www.postgresql.org/message-id/20220701231413.GI13040%40telsasoft.com
---
 src/bin/pg_dump/pg_dump.c              | 37 ++++++++------
 src/bin/pg_upgrade/relfilenumber.c     | 68 +++++++++++++++++++++++---
 src/bin/pg_upgrade/t/002_pg_upgrade.pl |  2 +
 3 files changed, 86 insertions(+), 21 deletions(-)

diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 227cdca2e2d..06d436f2f34 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -3135,7 +3135,7 @@ dumpDatabase(Archive *fout)
 
 	/*
 	 * pg_largeobject comes from the old system intact, so set its
-	 * relfrozenxids and relminmxids.
+	 * relfrozenxids, relminmxids and relfilenode.
 	 */
 	if (dopt->binary_upgrade)
 	{
@@ -3143,34 +3143,41 @@ dumpDatabase(Archive *fout)
 		PQExpBuffer loFrozenQry = createPQExpBuffer();
 		PQExpBuffer loOutQry = createPQExpBuffer();
 		int			i_relfrozenxid,
+					i_relfilenode,
+					i_oid,
 					i_relminmxid;
 
 		/*
 		 * pg_largeobject
 		 */
 		if (fout->remoteVersion >= 90300)
-			appendPQExpBuffer(loFrozenQry, "SELECT relfrozenxid, relminmxid\n"
+			appendPQExpBuffer(loFrozenQry, "SELECT relfrozenxid, relminmxid, relfilenode, oid\n"
 							  "FROM pg_catalog.pg_class\n"
-							  "WHERE oid = %u;\n",
-							  LargeObjectRelationId);
+							  "WHERE oid IN (%u, %u);\n",
+							  LargeObjectRelationId, LargeObjectLOidPNIndexId);
 		else
-			appendPQExpBuffer(loFrozenQry, "SELECT relfrozenxid, 0 AS relminmxid\n"
+			appendPQExpBuffer(loFrozenQry, "SELECT relfrozenxid, 0 AS relminmxid, relfilenode, oid\n"
 							  "FROM pg_catalog.pg_class\n"
-							  "WHERE oid = %u;\n",
-							  LargeObjectRelationId);
+							  "WHERE oid IN (%u, %u);\n",
+							  LargeObjectRelationId, LargeObjectLOidPNIndexId);
 
-		lo_res = ExecuteSqlQueryForSingleRow(fout, loFrozenQry->data);
+		lo_res = ExecuteSqlQuery(fout, loFrozenQry->data, PGRES_TUPLES_OK);
 
 		i_relfrozenxid = PQfnumber(lo_res, "relfrozenxid");
 		i_relminmxid = PQfnumber(lo_res, "relminmxid");
+		i_relfilenode = PQfnumber(lo_res, "relfilenode");
+		i_oid = PQfnumber(lo_res, "oid");
+
+		appendPQExpBufferStr(loOutQry, "\n-- For binary upgrade, preserve values for pg_largeobject and its index\n");
+		for (int i = 0; i < PQntuples(lo_res); ++i)
+			appendPQExpBuffer(loOutQry, "UPDATE pg_catalog.pg_class\n"
+							  "SET relfrozenxid = '%u', relminmxid = '%u', relfilenode = '%u'\n"
+							  "WHERE oid = %u;\n",
+							  atooid(PQgetvalue(lo_res, i, i_relfrozenxid)),
+							  atooid(PQgetvalue(lo_res, i, i_relminmxid)),
+							  atooid(PQgetvalue(lo_res, i, i_relfilenode)),
+							  atooid(PQgetvalue(lo_res, i, i_oid)));
 
-		appendPQExpBufferStr(loOutQry, "\n-- For binary upgrade, set pg_largeobject relfrozenxid and relminmxid\n");
-		appendPQExpBuffer(loOutQry, "UPDATE pg_catalog.pg_class\n"
-						  "SET relfrozenxid = '%u', relminmxid = '%u'\n"
-						  "WHERE oid = %u;\n",
-						  atooid(PQgetvalue(lo_res, 0, i_relfrozenxid)),
-						  atooid(PQgetvalue(lo_res, 0, i_relminmxid)),
-						  LargeObjectRelationId);
 		ArchiveEntry(fout, nilCatalogId, createDumpId(),
 					 ARCHIVE_OPTS(.tag = "pg_largeobject",
 								  .description = "pg_largeobject",
diff --git a/src/bin/pg_upgrade/relfilenumber.c b/src/bin/pg_upgrade/relfilenumber.c
index b3ad8209eca..0dd649aa0f5 100644
--- a/src/bin/pg_upgrade/relfilenumber.c
+++ b/src/bin/pg_upgrade/relfilenumber.c
@@ -13,9 +13,13 @@
 
 #include "access/transam.h"
 #include "catalog/pg_class_d.h"
+#include "catalog/pg_largeobject_d.h"
 #include "pg_upgrade.h"
 
-static void transfer_single_new_db(FileNameMap *maps, int size, char *old_tablespace);
+static void transfer_single_new_db(FileNameMap *maps, int size, char *old_tablespace,
+	bool *has_lotable, bool *has_loindex);
+static void remove_stray_largeobject_files(DbInfo *new_db, bool has_lotable,
+		bool has_loindex);
 static void transfer_relfile(FileNameMap *map, const char *suffix, bool vm_must_add_frozenbit);
 
 
@@ -98,6 +102,8 @@ transfer_all_new_dbs(DbInfoArr *old_db_arr, DbInfoArr *new_db_arr,
 				   *new_db = NULL;
 		FileNameMap *mappings;
 		int			n_maps;
+		bool		has_lotable = false;
+		bool		has_loindex = false;
 
 		/*
 		 * Advance past any databases that exist in the new cluster but not in
@@ -117,22 +123,66 @@ transfer_all_new_dbs(DbInfoArr *old_db_arr, DbInfoArr *new_db_arr,
 
 		mappings = gen_db_file_maps(old_db, new_db, &n_maps, old_pgdata,
 									new_pgdata);
-		if (n_maps)
-		{
-			transfer_single_new_db(mappings, n_maps, old_tablespace);
-		}
+		transfer_single_new_db(mappings, n_maps, old_tablespace,
+			&has_lotable, &has_loindex);
+
+		/*
+		 * Remove the original pg_largeobject files created by initdb if they
+		 * exist in the new cluster with different relfilenodes.  This avoids
+		 * leaving behind stray files unassociated with any relation.
+		 */
+		remove_stray_largeobject_files(new_db, has_lotable, has_loindex);
+
 		/* We allocate something even for n_maps == 0 */
 		pg_free(mappings);
 	}
 }
 
+/*
+ * If there's no relation in the cluster whose relfilenode is the OID of the
+ * largeobject table/index, then remove the stray file.  It ought to have a
+ * single segment and fork.
+ */
+static void
+remove_stray_largeobject_files(DbInfo *new_db, bool has_lotable,
+		bool has_loindex)
+{
+	char		prefix[MAXPGPATH];
+	char		largeobj_file[MAXPGPATH];
+
+	if (new_db->db_tablespace[0] != '\0')
+		snprintf(prefix, sizeof(prefix), "%s/%s/%u",
+				new_db->db_tablespace, new_cluster.tablespace_suffix,
+				new_db->db_oid);
+	else
+		snprintf(prefix, sizeof(prefix), "%s/base/%u",
+				new_cluster.pgdata, new_db->db_oid);
+
+	if (!has_lotable)
+	{
+		snprintf(largeobj_file, sizeof(largeobj_file), "%s/%u",
+				prefix, LargeObjectRelationId);
+		if (unlink(largeobj_file) != 0)
+			pg_fatal("Unable to remove %s.\n", largeobj_file);
+	}
+
+	if (!has_loindex)
+	{
+		snprintf(largeobj_file, sizeof(largeobj_file), "%s/%u",
+				prefix, LargeObjectLOidPNIndexId);
+		if (unlink(largeobj_file) != 0)
+			pg_fatal("Unable to remove %s.\n", largeobj_file);
+	}
+}
+
 /*
  * transfer_single_new_db()
  *
  * create links for mappings stored in "maps" array.
  */
 static void
-transfer_single_new_db(FileNameMap *maps, int size, char *old_tablespace)
+transfer_single_new_db(FileNameMap *maps, int size, char *old_tablespace,
+	bool *has_lotable, bool *has_loindex)
 {
 	int			mapnum;
 	bool		vm_must_add_frozenbit = false;
@@ -146,6 +196,12 @@ transfer_single_new_db(FileNameMap *maps, int size, char *old_tablespace)
 
 	for (mapnum = 0; mapnum < size; mapnum++)
 	{
+		/* Keep track of whether a filenode matches the OID */
+		if (maps[mapnum].relfilenumber == LargeObjectRelationId)
+			*has_lotable = true;
+		if (maps[mapnum].relfilenumber == LargeObjectLOidPNIndexId)
+			*has_loindex = true;
+
 		if (old_tablespace == NULL ||
 			strcmp(maps[mapnum].old_tablespace, old_tablespace) == 0)
 		{
diff --git a/src/bin/pg_upgrade/t/002_pg_upgrade.pl b/src/bin/pg_upgrade/t/002_pg_upgrade.pl
index 2f9b13bf0ae..9ed48c4e06a 100644
--- a/src/bin/pg_upgrade/t/002_pg_upgrade.pl
+++ b/src/bin/pg_upgrade/t/002_pg_upgrade.pl
@@ -208,6 +208,8 @@ if (defined($ENV{oldinstall}))
 	}
 }
 
+$oldnode->safe_psql("regression", "VACUUM FULL pg_largeobject;");
+
 # In a VPATH build, we'll be started in the source directory, but we want
 # to run pg_upgrade in the build directory so that any files generated finish
 # in it, like delete_old_cluster.{sh,bat}.
-- 
2.17.1

#13

Robert Haas

robertmhaas@gmail.com

over 3 years ago

In reply to: Justin Pryzby (#12)

Re: pg15b2: large objects lost on upgrade

On Thu, Jul 7, 2022 at 1:10 PM Justin Pryzby <pryzby@telsasoft.com> wrote:

Maybe it's a good idea to check that the file is empty before unlinking...

If we want to verify that there are no large objects in the cluster,
we could do that in check_new_cluster_is_empty(). However, even if
there aren't, the length of the file could still be more than 0, if
there were some large objects previously and then they were removed.
So it's not entirely obvious to me that we should refuse to remove a
non-empty file.

--
Robert Haas
EDB: http://www.enterprisedb.com

#14

Bruce Momjian

bruce@momjian.us

over 3 years ago

In reply to: Robert Haas (#13)

Re: pg15b2: large objects lost on upgrade

On Thu, Jul 7, 2022 at 01:38:44PM -0400, Robert Haas wrote:

On Thu, Jul 7, 2022 at 1:10 PM Justin Pryzby <pryzby@telsasoft.com> wrote:

Maybe it's a good idea to check that the file is empty before unlinking...

If we want to verify that there are no large objects in the cluster,
we could do that in check_new_cluster_is_empty(). However, even if
there aren't, the length of the file could still be more than 0, if
there were some large objects previously and then they were removed.
So it's not entirely obvious to me that we should refuse to remove a
non-empty file.

Uh, that initdb-created pg_largeobject file should not have any data in
it ever, as far as I know at that point in pg_upgrade. How would values
have gotten in there? Via pg_dump?

--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com

Indecision is a decision. Inaction is an action. Mark Batterson

#15

Robert Haas

robertmhaas@gmail.com

over 3 years ago

In reply to: Bruce Momjian (#14)

Re: pg15b2: large objects lost on upgrade

On Thu, Jul 7, 2022 at 2:24 PM Bruce Momjian <bruce@momjian.us> wrote:

On Thu, Jul 7, 2022 at 01:38:44PM -0400, Robert Haas wrote:

On Thu, Jul 7, 2022 at 1:10 PM Justin Pryzby <pryzby@telsasoft.com> wrote:

Maybe it's a good idea to check that the file is empty before unlinking...

If we want to verify that there are no large objects in the cluster,
we could do that in check_new_cluster_is_empty(). However, even if
there aren't, the length of the file could still be more than 0, if
there were some large objects previously and then they were removed.
So it's not entirely obvious to me that we should refuse to remove a
non-empty file.

Uh, that initdb-created pg_largeobject file should not have any data in
it ever, as far as I know at that point in pg_upgrade. How would values
have gotten in there? Via pg_dump?

I was thinking if the user had done it manually before running pg_upgrade.

--
Robert Haas
EDB: http://www.enterprisedb.com

#16

Justin Pryzby

pryzby@telsasoft.com

over 3 years ago

In reply to: Robert Haas (#15)

Re: pg15b2: large objects lost on upgrade

On Thu, Jul 07, 2022 at 02:38:44PM -0400, Robert Haas wrote:

On Thu, Jul 7, 2022 at 2:24 PM Bruce Momjian <bruce@momjian.us> wrote:

On Thu, Jul 7, 2022 at 01:38:44PM -0400, Robert Haas wrote:

On Thu, Jul 7, 2022 at 1:10 PM Justin Pryzby <pryzby@telsasoft.com> wrote:

Maybe it's a good idea to check that the file is empty before unlinking...

If we want to verify that there are no large objects in the cluster,
we could do that in check_new_cluster_is_empty(). However, even if
there aren't, the length of the file could still be more than 0, if
there were some large objects previously and then they were removed.
So it's not entirely obvious to me that we should refuse to remove a
non-empty file.

Uh, that initdb-created pg_largeobject file should not have any data in
it ever, as far as I know at that point in pg_upgrade. How would values
have gotten in there? Via pg_dump?

I was thinking if the user had done it manually before running pg_upgrade.

We're referring to the new cluster which should have been initdb'd more or less
immediately before running pg_upgrade [0].

It'd be weird to me if someone were to initdb a new cluster, then create some
large objects, and then maybe delete them, and then run pg_upgrade. Why
wouldn't they do their work on the old cluster before upgrading, or on the new
cluster afterwards ?

Does anybody actually do anything significant between initdb and pg_upgrade ?
Is that considered to be supported ? It seems like pg_upgrade could itself run
initdb, with the correct options for locale, checksum, block size, etc
(although it probably has to support the existing way to handle "compatible
encodings").

Actually, I think check_new_cluster_is_empty() ought to prohibit doing work
between initdb and pg_upgrade by checking that no objects have *ever* been
created in the new cluster, by checking that NextOid == 16384. But I have a
separate thread about "pg-upgrade allows itself to be re-run", and this has
more to do with that than about whether to check that the file is empty before
removing it.

--
Justin

#17

Tom Lane

tgl@sss.pgh.pa.us

over 3 years ago

In reply to: Justin Pryzby (#16)

Re: pg15b2: large objects lost on upgrade

Justin Pryzby <pryzby@telsasoft.com> writes:

On Thu, Jul 07, 2022 at 02:38:44PM -0400, Robert Haas wrote:

On Thu, Jul 7, 2022 at 2:24 PM Bruce Momjian <bruce@momjian.us> wrote:

Uh, that initdb-created pg_largeobject file should not have any data in
it ever, as far as I know at that point in pg_upgrade. How would values
have gotten in there? Via pg_dump?

I was thinking if the user had done it manually before running pg_upgrade.

We're referring to the new cluster which should have been initdb'd more or less
immediately before running pg_upgrade [0].

It'd be weird to me if someone were to initdb a new cluster, then create some
large objects, and then maybe delete them, and then run pg_upgrade.

AFAIK you're voiding the warranty if you make any changes at all in the
destination cluster before pg_upgrade'ing. As an example, if you created
a table there you'd be risking an OID and/or relfilenode collision with
something due to be imported from the source cluster.

Actually, I think check_new_cluster_is_empty() ought to prohibit doing work
between initdb and pg_upgrade by checking that no objects have *ever* been
created in the new cluster, by checking that NextOid == 16384.

It would be good to have some such check; I'm not sure that that one in
particular is the best option.

But I have a
separate thread about "pg-upgrade allows itself to be re-run", and this has
more to do with that than about whether to check that the file is empty before
removing it.

Yeah, that's another foot-gun in the same area.

regards, tom lane

#18

Robert Haas

robertmhaas@gmail.com

over 3 years ago

In reply to: Justin Pryzby (#16)

Re: pg15b2: large objects lost on upgrade

+               /* Keep track of whether a filenode matches the OID */
+               if (maps[mapnum].relfilenumber == LargeObjectRelationId)
+                       *has_lotable = true;
+               if (maps[mapnum].relfilenumber == LargeObjectLOidPNIndexId)
+                       *has_loindex = true;
On Thu, Jul 7, 2022 at 2:44 PM Justin Pryzby <pryzby@telsasoft.com> wrote:

We're referring to the new cluster which should have been initdb'd more or less
immediately before running pg_upgrade [0].

It'd be weird to me if someone were to initdb a new cluster, then create some
large objects, and then maybe delete them, and then run pg_upgrade. Why
wouldn't they do their work on the old cluster before upgrading, or on the new
cluster afterwards ?

Does anybody actually do anything significant between initdb and pg_upgrade ?
Is that considered to be supported ? It seems like pg_upgrade could itself run
initdb, with the correct options for locale, checksum, block size, etc
(although it probably has to support the existing way to handle "compatible
encodings").

Actually, I think check_new_cluster_is_empty() ought to prohibit doing work
between initdb and pg_upgrade by checking that no objects have *ever* been
created in the new cluster, by checking that NextOid == 16384. But I have a
separate thread about "pg-upgrade allows itself to be re-run", and this has
more to do with that than about whether to check that the file is empty before
removing it.

I think you're getting at a really good point here which is also my
point: we assume that nothing significant has happened between when
the cluster was created and when pg_upgrade is run, but we don't check
it. Either we shouldn't assume it, or we should check it.

So, is such activity ever legitimate? I think there are people doing
it. The motivation is that maybe you have a dump from the old database
that doesn't quite restore on the new version, but by doing something
to the new cluster, you can make it restore. For instance, maybe there
are some functions that used to be part of core and are now only
available in an extension. That's going to make pg_upgrade's
dump-and-restore workflow fail, but if you install that extension onto
the new cluster, perhaps you can work around the problem. It doesn't
have to be an extension, even. Maybe some function in core just got an
extra argument, and you're using it, so the calls to that function
cause dump-and-restore to fail. You might try overloading it in the
new database with the old calling sequence to get things working.

Now, are these kinds of things considered to be supported? Well, I
don't know that we've made any statement about that one way or the
other. Perhaps they are not. But I can see why people want to use
workarounds like this. The alternative is having to dump-and-restore
instead of an in-place upgrade, and that's painfully slow by
comparison. pg_upgrade itself doesn't give you any tools to deal with
this kind of situation, but the process is just loose enough to allow
people to insert their own workarounds, so they do. I'm sure I'd do
the same, in their shoes.

My view on this is that, while we probably don't want to make such
things officially supported, I don't think we should ban it outright,
either. We probably can't support an upgrade after the next cluster
has been subjected to arbitrary amounts of tinkering, but we're making
a mistake if we write code that has fragile assumptions for no really
good reason. I think we can do better than this excerpt from your
patch, for example:

+               /* Keep track of whether a filenode matches the OID */
+               if (maps[mapnum].relfilenumber == LargeObjectRelationId)
+                       *has_lotable = true;
+               if (maps[mapnum].relfilenumber == LargeObjectLOidPNIndexId)
+                       *has_loindex = true;

I spent a while struggling to understand this because it seems to me
that every database has an LO table and an LO index, so what's up with
these names? I think what these names are really tracking is whether
the relfilenumber of pg_largeobject and its index in the old database
had their default values. But this makes the assumption that the LO
table and LO index in the new database have never been subjected to
VACUUM FULL or CLUSTER and, while there's no real reason to do that, I
can't quite see what the point of such an unnecessary and fragile
assumption might be. If Dilip's patch to make relfilenodes 56 bits
gets committed, this is going to break anyway, because with that
patch, the relfilenode for a table or index doesn't start out being
equal to its OID.

Perhaps a better solution to this particular problem is to remove the
backing files for the large object table and index *before* restoring
the dump, deciding what files to remove by asking the running server
for the file path. It might seem funny to allow for dangling pg_class
entries, but we're going to create that situation for all other user
rels anyway, and pg_upgrade treats pg_largeobject as a user rel.

--
Robert Haas
EDB: http://www.enterprisedb.com

#19

Bruce Momjian

bruce@momjian.us

over 3 years ago

In reply to: Robert Haas (#7)

Re: pg15b2: large objects lost on upgrade

On Tue, Jul 5, 2022 at 12:43:54PM -0400, Robert Haas wrote:

On Sat, Jul 2, 2022 at 11:49 AM Justin Pryzby <pryzby@telsasoft.com> wrote:

I suppose it's like Bruce said, here.

/messages/by-id/20210601140949.GC22012@momjian.us

Well, I feel dumb. I remember reading that email back when Bruce sent
it, but it seems that it slipped out of my head between then and when
I committed. I think your patch is fine, except that I think maybe we

It happens to us all.

I had a moment of panic this morning where I thought maybe the whole

Yes, I have had those panics too.

patch needed to be reverted. I was worried that we might need to
preserve the OID of every system table and index. Otherwise, what
happens if the OID counter in the old cluster wraps around and some
user-created object gets an OID that the system tables are using in
the new cluster? However, I think this can't happen, because when the
OID counter wraps around, it wraps around to 16384, and the
relfilenode values for newly created system tables and indexes are all
less than 16384. So I believe we only need to fix pg_largeobject and
its index, and I think your patch does that.

So, let me explain how I look at this. There are two number-spaces, oid
and relfilenode. In each number-space, there are system-assigned ones
less than 16384, and higher ones for post-initdb use.

What we did in pre-PG15 was to preserve only oids, and have the
relfilenode match the oid, and we have discussed the negatives of this.

For PG 15+, we preserve relfilenodes too. These number assignment cases
only work if we handle _all_ numbering, except for non-pg_largeobject
system tables.

In pre-PG15, pg_largeobject was easily handled because initdb already
assigned the oid and relfilenode to be the same for pg_largeobject, so a
simple copy worked fine. pg_largeobject is an anomaly in PG 15 because
it is assigned a relfilenode in the system number space by initdb, but
then it needs to be potentially renamed into the relfilenode user number
space. This is the basis for my email as already posted:

/messages/by-id/20210601140949.GC22012@momjian.us

You are right to be concerned since you are spanning number spaces, but
I think you are fine because the relfilenode in the user-space cannot
have been used since it already was being used in each database. It is
true we never had a per-database rename like this before.

--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com

Indecision is a decision. Inaction is an action. Mark Batterson

#20

Robert Haas

robertmhaas@gmail.com

over 3 years ago

In reply to: Bruce Momjian (#19)

Re: pg15b2: large objects lost on upgrade

On Thu, Jul 7, 2022 at 4:16 PM Bruce Momjian <bruce@momjian.us> wrote:

You are right to be concerned since you are spanning number spaces, but
I think you are fine because the relfilenode in the user-space cannot
have been used since it already was being used in each database. It is
true we never had a per-database rename like this before.

Thanks for checking over the reasoning, and the kind words in general.
I just committed Justin's fix for the bug, without fixing the fact
that the new cluster's original pg_largeobject files will be left
orphaned afterward. That's a relatively minor problem by comparison,
and it seemed best to me not to wait too long to get the main issue
addressed.

--
Robert Haas
EDB: http://www.enterprisedb.com

#21

Justin Pryzby

pryzby@telsasoft.com

over 3 years ago

In reply to: Robert Haas (#18)

Re: pg15b2: large objects lost on upgrade

On Thu, Jul 07, 2022 at 03:11:38PM -0400, Robert Haas wrote:

point: we assume that nothing significant has happened between when
the cluster was created and when pg_upgrade is run, but we don't check
it. Either we shouldn't assume it, or we should check it.

So, is such activity ever legitimate? I think there are people doing
it. The motivation is that maybe you have a dump from the old database
that doesn't quite restore on the new version, but by doing something
to the new cluster, you can make it restore. For instance, maybe there
are some functions that used to be part of core and are now only
available in an extension. That's going to make pg_upgrade's
dump-and-restore workflow fail, but if you install that extension onto
the new cluster, perhaps you can work around the problem. It doesn't
have to be an extension, even. Maybe some function in core just got an
extra argument, and you're using it, so the calls to that function
cause dump-and-restore to fail. You might try overloading it in the
new database with the old calling sequence to get things working.

I don't think that's even possible.

pg_upgrade drops template1 and postgres before upgrading:

* template1 database will already exist in the target installation,
* so tell pg_restore to drop and recreate it; otherwise we would fail
* to propagate its database-level properties.

* postgres database will already exist in the target installation, so
* tell pg_restore to drop and recreate it; otherwise we would fail to
* propagate its database-level properties.

For any other DBs, you'd hit an error if, after initdb'ing, you started the new
cluster, connected to it, created a DB (?!) and then tried to upgrade:

pg_restore: error: could not execute query: ERROR: database "pryzbyj" already exists

So if people start, connect, and then futz with a cluster before upgrading it,
it must be for global stuff (roles, tablespaces), and not per-DB stuff.
Also, pg_upgrade refuses to run if additional roles are defined...
So I'm not seeing what someone could do on the new cluster.

That supports the idea that it'd be okay to refuse to upgrade anything other
than a pristine cluster.

Now, are these kinds of things considered to be supported? Well, I
don't know that we've made any statement about that one way or the
other. Perhaps they are not. But I can see why people want to use
workarounds like this. The alternative is having to dump-and-restore
instead of an in-place upgrade, and that's painfully slow by
comparison.

The alternative in cases that I know about is to fix the old DB to allow it to
be upgraded. check.c has a list of the things that aren't upgradable, and The
fixes are some things like ALTER TABLE DROP OIDs. We just added another one to
handle v14 aggregates (09878cdd4).

My view on this is that, while we probably don't want to make such
things officially supported, I don't think we should ban it outright,
either. We probably can't support an upgrade after the next cluster
has been subjected to arbitrary amounts of tinkering, but we're making
a mistake if we write code that has fragile assumptions for no really
good reason. I think we can do better than this excerpt from your
patch, for example:
+               /* Keep track of whether a filenode matches the OID */
+               if (maps[mapnum].relfilenumber == LargeObjectRelationId)
+                       *has_lotable = true;
+               if (maps[mapnum].relfilenumber == LargeObjectLOidPNIndexId)
+                       *has_loindex = true;
I spent a while struggling to understand this because it seems to me
that every database has an LO table and an LO index, so what's up with
these names? I think what these names are really tracking is whether
the relfilenumber of pg_largeobject and its index in the old database
had their default values.

Yes, has_lotable means "has a LO table whose filenode matches the OID".
I will solicit suggestions for a better name.

But this makes the assumption that the LO
table and LO index in the new database have never been subjected to
VACUUM FULL or CLUSTER and, while there's no real reason to do that, I
can't quite see what the point of such an unnecessary and fragile
assumption might be.

The idea of running initdb, starting the cluster, and connecting to it to run
VACUUM FULL scares me. Now that I think about it, it might be almost
inconsequential, since the initial DBs are dropped, and the upgrade will fail
if any non-template DB exists. But .. maybe something exciting happens if you
vacuum full a shared catalog... Yup.

With my patch:

./tmp_install/usr/local/pgsql/bin/initdb --no-sync -D pg15b1.dat -k
./tmp_install/usr/local/pgsql/bin/postgres -D pg15b1.dat -c logging_collector=no -p 5678 -k /tmp&
postgres=# \lo_import /etc/shells
postgres=# VACUUM FULL pg_largeobject;
./tmp_install/usr/local/pgsql/bin/initdb --no-sync -D pg15b2.dat -k
./tmp_install/usr/local/pgsql/bin/postgres -D pg15b2.dat -c logging_collector=no -p 5678 -k /tmp&
postgres=# VACUUM FULL pg_database;
tmp_install/usr/local/pgsql/bin/pg_upgrade -D pg15b2.dat -b tmp_install/usr/local/pgsql/bin -d pg15b1.dat

postgres=# SELECT COUNT(1), pg_relation_filenode(oid), array_agg(relname) FROM pg_class WHERE pg_relation_filenode(oid) IS NOT NULL GROUP BY 2 HAVING COUNT(1)>1 ORDER BY 1 DESC ;
count | pg_relation_filenode | array_agg
-------+----------------------+----------------------------------------------------
2 | 16388 | {pg_toast_1262_index,pg_largeobject_loid_pn_index}

I don't have a deep understanding why the DB hasn't imploded at this point,
maybe related to the filenode map file, but it seems very close to being
catastrophic.

It seems like pg_upgrade should at least check that the new cluster has no
objects with either OID or relfilenodes in the user range..
You could blame my patch, since I the issue is limited to pg_largeobject.

Perhaps a better solution to this particular problem is to remove the
backing files for the large object table and index *before* restoring
the dump, deciding what files to remove by asking the running server
for the file path. It might seem funny to allow for dangling pg_class
entries, but we're going to create that situation for all other user
rels anyway, and pg_upgrade treats pg_largeobject as a user rel.

I'll think about it more later.

--
Justin

#22

Robert Haas

robertmhaas@gmail.com

over 3 years ago

In reply to: Justin Pryzby (#21)

Re: pg15b2: large objects lost on upgrade

On Fri, Jul 8, 2022 at 11:53 AM Justin Pryzby <pryzby@telsasoft.com> wrote:

pg_upgrade drops template1 and postgres before upgrading:

Hmm, but I bet you could fiddle with template0. Indeed what's the
difference between a user fiddling with template0 and me committing a
patch that bumps catversion? If the latter doesn't prevent pg_upgrade
from working when the release comes out, why should the former?

I don't have a deep understanding why the DB hasn't imploded at this point,
maybe related to the filenode map file, but it seems very close to being
catastrophic.

Yeah, good example.

It seems like pg_upgrade should at least check that the new cluster has no
objects with either OID or relfilenodes in the user range..

Well... I think it's not quite that simple. There's an argument for
that rule, to be sure, but in some sense it's far too strict. We only
preserve the OIDs of tablespaces, types, enums, roles, and now
relations. So if you create any other type of object in the new
cluster, like say a function, it's totally fine. You could still fail
if the old cluster happens to contain a function with the same
signature, but that's kind of a different issue. An OID collision for
any of the many object types for which OIDs are not preserved is no
problem at all.

But even if we restrict ourselves to talking about an object type for
which OIDs are preserved, it's still not quite that simple. For
example, if I create a relation in the new cluster, its OID or
relfilenode might be the same as a relation that exists in the old
cluster. In such a case, a failure is inevitable. We're definitely in
big trouble, and the question is only whether pg_upgrade will notice.
But it's also possible that, either by planning or by sure dumb luck,
neither the relation nor the OID that I've created in the new cluster
is in use in the old cluster. In such a case, the upgrade can succeed
without breaking anything, or at least nothing other than our sense of
order in the universe.

Without a doubt, there are holes in pg_upgrade's error checking that
need to be plugged, but I think there is room to debate exactly what
size plug we want to use. I can't really say that it's definitely
stupid to use a plug that's definitely big enough to catch all the
scenarios that might break stuff, but I think my own preference would
be to try to craft it so that it isn't too much larger than necessary.
That's partly because I do think there are some scenarios in which
modifying the new cluster might be the easiest way of working around
some problem, but also because, as a matter of principle, I like the
idea of making rules that correspond to the real dangers. If we write
a rule that says essentially "it's no good if there are two relations
sharing a relfilenode," nobody with any understanding of how the
system works can argue that bypassing it is a sensible thing to do,
and probably nobody will even try, because it's so obviously bonkers
to do so. It's a lot less obviously bonkers to try to bypass the
broader prohibition which you suggest should never be bypassed, so
someone may do it, and get themselves in trouble.

Now I'm not saying such a person will get any sympathy from this list.
If for example you #if out the sanity check and hose yourself, people
here are, including me, are going to suggest that you've hosed
yourself and it's not our problem. But ... the world is full of
warnings about problems that aren't really that serious, and sometimes
those have the effect of discouraging people from taking warnings
about very serious problems as seriously as they should. I know that I
no longer panic when the national weather service texts me to say that
there's a tornado or a flash flood in my area. They've just done that
too many times when there was no real issue with which I needed to be
concerned. If I get caught out by a tornado at some point, they're
probably going to say "well that's why you should always take our
warnings seriously," but I'm going to say "well that's why you
shouldn't send spurious warnings."

Perhaps a better solution to this particular problem is to remove the
backing files for the large object table and index *before* restoring
the dump, deciding what files to remove by asking the running server
for the file path. It might seem funny to allow for dangling pg_class
entries, but we're going to create that situation for all other user
rels anyway, and pg_upgrade treats pg_largeobject as a user rel.

I'll think about it more later.

Sounds good.

--
Robert Haas
EDB: http://www.enterprisedb.com

#23

Michael Paquier

michael@paquier.xyz

over 3 years ago

In reply to: Robert Haas (#20)

Re: pg15b2: large objects lost on upgrade

On Fri, Jul 08, 2022 at 10:44:07AM -0400, Robert Haas wrote:

Thanks for checking over the reasoning, and the kind words in general.

Thanks for fixing the main issue.

I just committed Justin's fix for the bug, without fixing the fact
that the new cluster's original pg_largeobject files will be left
orphaned afterward. That's a relatively minor problem by comparison,
and it seemed best to me not to wait too long to get the main issue
addressed.

Hmm. That would mean that the more LOs a cluster has, the more bloat
there will be in the new cluster once the upgrade is done. That could
be quite a few gigs worth of data laying around depending on the data
inserted in the source cluster, and we don't have a way to know which
files to remove post-upgrade, do we?
--
Michael

#24

Robert Haas

robertmhaas@gmail.com

over 3 years ago

In reply to: Michael Paquier (#23)

Re: pg15b2: large objects lost on upgrade

On Sun, Jul 10, 2022 at 9:31 PM Michael Paquier <michael@paquier.xyz> wrote:

Hmm. That would mean that the more LOs a cluster has, the more bloat
there will be in the new cluster once the upgrade is done. That could
be quite a few gigs worth of data laying around depending on the data
inserted in the source cluster, and we don't have a way to know which
files to remove post-upgrade, do we?

The files that are being leaked here are the files backing the
pg_largeobject table and the corresponding index as they existed in
the new cluster just prior to the upgrade. Hopefully, the table is a
zero-length file and the index is just one block, because you're
supposed to use a newly-initdb'd cluster as the target for a
pg_upgrade operation. Now, you don't actually have to do that: as
we've been discussing, there aren't as many sanity checks in this code
as there probably should be. But it would still be surprising to
initdb a new cluster, load gigabytes of large objects into it, and
then use it as the target cluster for a pg_upgrade.

As for whether it's possible to know which files to remove
post-upgrade, that's the same problem as trying to figure out whether
their are orphaned files in any other PostgreSQL cluster. We don't
have a tool for it, but if you're sure that the system is more or less
quiescent - no uncommitted DDL, in particular - you can compare
pg_class.relfilenode to the actual filesystem contents and figure out
what extra stuff is present on the filesystem level.

I am not saying we shouldn't try to fix this up more thoroughly, just
that I think you are overestimating the consequences.

--
Robert Haas
EDB: http://www.enterprisedb.com

#25

Robert Haas

robertmhaas@gmail.com

over 3 years ago

In reply to: Robert Haas (#24)

Re: pg15b2: large objects lost on upgrade

On Mon, Jul 11, 2022 at 9:16 AM Robert Haas <robertmhaas@gmail.com> wrote:

I am not saying we shouldn't try to fix this up more thoroughly, just
that I think you are overestimating the consequences.

I spent a bunch of time looking at this today and I have more sympathy
for Justin's previous proposal now. I found it somewhat hacky that he
was relying on the hard-coded value of LargeObjectRelationId and
LargeObjectLOidPNIndexId, but I discovered that it's harder to do
better than I had assumed. Suppose we don't want to compare against a
hard-coded constant but against the value that is actually present
before the dump overwrites the pg_class row's relfilenode. Well, we
can't get that value from the database in question before restoring
the dump, because restoring either the dump creates or recreates the
database in all cases. The CREATE DATABASE command that will be part
of the dump always specifies TEMPLATE template0, so if we want
something other than a hard-coded constant, we need the
pg_class.relfilenode values from template0 for pg_largeobject and
pg_largeobject_loid_pn_index. But we can't connect to that database to
query those values, because it has datallowconn = false. Oops.

I have a few more ideas to try here. It occurs to me that we could fix
this more cleanly if we could get the dump itself to set the
relfilenode for pg_largeobject to the desired value. Right now, it's
just overwriting the relfilenode stored in the catalog without
actually doing anything that would cause a change on disk. But if we
could make it change the relfilenode in a more principled way that
would actually cause an on-disk change, then the orphaned-file problem
would be fixed, because we'd always be installing the new file over
top of the old file. I'm going to investigate how hard it would be to
make that work.

--
Robert Haas
EDB: http://www.enterprisedb.com

#26

Michael Paquier

michael@paquier.xyz

over 3 years ago

In reply to: Robert Haas (#25)

Re: pg15b2: large objects lost on upgrade

On Tue, Jul 12, 2022 at 04:51:44PM -0400, Robert Haas wrote:

I spent a bunch of time looking at this today and I have more sympathy
for Justin's previous proposal now. I found it somewhat hacky that he
was relying on the hard-coded value of LargeObjectRelationId and
LargeObjectLOidPNIndexId, but I discovered that it's harder to do
better than I had assumed. Suppose we don't want to compare against a
hard-coded constant but against the value that is actually present
before the dump overwrites the pg_class row's relfilenode. Well, we
can't get that value from the database in question before restoring
the dump, because restoring either the dump creates or recreates the
database in all cases. The CREATE DATABASE command that will be part
of the dump always specifies TEMPLATE template0, so if we want
something other than a hard-coded constant, we need the
pg_class.relfilenode values from template0 for pg_largeobject and
pg_largeobject_loid_pn_index. But we can't connect to that database to
query those values, because it has datallowconn = false. Oops.

I have a few more ideas to try here. It occurs to me that we could fix
this more cleanly if we could get the dump itself to set the
relfilenode for pg_largeobject to the desired value. Right now, it's
just overwriting the relfilenode stored in the catalog without
actually doing anything that would cause a change on disk. But if we
could make it change the relfilenode in a more principled way that
would actually cause an on-disk change, then the orphaned-file problem
would be fixed, because we'd always be installing the new file over
top of the old file. I'm going to investigate how hard it would be to
make that work.

Thanks for all the details here. This originally sounded like the new
cluster was keeping around some orphaned relation files with the old
LOs still stored in it. But as that's just the freshly initdb'd
relfilenodes of pg_largeobject, that does not strike me as something
absolutely critical to fix for v15 as orphaned relfilenodes are an
existing problem. If we finish with a solution rather simple in
design, I'd be fine to stick a fix in REL_15_STABLE, but playing with
this stable branch more than necessary may be risky after beta2. At
the end, I would be fine to drop the open item now that the main issue
has been fixed.
--
Michael

#27

Robert Haas

robertmhaas@gmail.com

over 3 years ago

In reply to: Robert Haas (#25)

2 attachment(s)

Re: pg15b2: large objects lost on upgrade

On Tue, Jul 12, 2022 at 4:51 PM Robert Haas <robertmhaas@gmail.com> wrote:

I have a few more ideas to try here. It occurs to me that we could fix
this more cleanly if we could get the dump itself to set the
relfilenode for pg_largeobject to the desired value. Right now, it's
just overwriting the relfilenode stored in the catalog without
actually doing anything that would cause a change on disk. But if we
could make it change the relfilenode in a more principled way that
would actually cause an on-disk change, then the orphaned-file problem
would be fixed, because we'd always be installing the new file over
top of the old file. I'm going to investigate how hard it would be to
make that work.

Well, it took a while to figure out how to make that work, but I
believe I've got it now. Attached please find a couple of patches that
should get the job done. They might need a bit of polish, but I think
the basic concepts are sound.

My first thought was to have the dump issue VACUUM FULL pg_largeobject
after first calling binary_upgrade_set_next_heap_relfilenode() and
binary_upgrade_set_next_index_relfilenode(), and have the VACUUM FULL
use the values configured by those calls for the new heap and index
OID. I got this working in standalone testing, only to find that this
doesn't work inside pg_upgrade. The complaint is "ERROR: VACUUM
cannot run inside a transaction block", and I think that happens
because pg_restore sends the entire TOC entry for a single object to
the server as a single query, and here it contains multiple SQL
commands. That multi-command string ends up being treated like an
implicit transaction block.

So my second thought was to have the dump still call
binary_upgrade_set_next_heap_relfilenode() and
binary_upgrade_set_next_index_relfilenode(), but then afterwards call
TRUNCATE rather than VACUUM FULL. I found that a simple change to
RelationSetNewRelfilenumber() did the trick: I could then easily
change the heap and index relfilenodes for pg_largeobject to any new
values I liked. However, I realized that this approach had a problem:
what if the pg_largeobject relation had never been rewritten in the
old cluster? Then the heap and index relfilenodes would be unchanged.
It's also possible that someone might have run REINDEX in the old
cluster, in which case it might happen that the heap relfilenode was
unchanged, but the index relfilenode had changed. I spent some time
fumbling around with trying to get the non-transactional truncate path
to cover these cases, but the fact that we might need to change the
relfilenode for the index but not the heap makes this approach seem
pretty awful.

But it occurred to me that in the case of a pg_upgrade, we don't
really need to keep the old storage around until commit time. We can
unlink it first, before creating the new storage, and then if the old
and new storage happen to be the same, it still works. I can think of
two possible objections to this way forward. First, it means that the
operation isn't properly transactional. However, if the upgrade fails
at this stage, the new cluster is going to have to be nuked and
recreated anyway, so the fact that things might be in an unclean state
after an ERROR is irrelevant. Second, one might wonder whether such a
fix is really sufficient. For example, what happens if the relfilenode
allocated to pg_largebject in the old cluster is assigned to its index
in the new cluster, and vice versa? To make that work, we would need
to remove all storage for all relfilenodes first, and then recreate
them all afterward. However, I don't think we need to make that work.
If an object in the old cluster has a relfilenode < 16384, then that
should mean it's never been rewritten and therefore its relfilenode in
the new cluster should be the same. The only way this wouldn't be true
is if we shuffled around the initial relfilenode assignments in a new
version of PG so that the same values were used but now for different
objects, which would be a really dumb idea. And on the other hand, if
the object in the old cluster has a relfilenode > 16384, then that
relfilenode value should be unused in the new cluster. If not, the
user has tinkered with the new cluster more than they ought.

So I tried implementing this but I didn't get it quite right the first
time. It's not enough to call smgrdounlinkall() instead of
RelationDropStorage(), because just as RelationDropStorage() does not
actually drop the storage but only schedules it to be dropped later,
so also smgrdounlinkall() does not in fact unlink all, but only some.
It leaves the first segment of the main relation fork around to guard
against the hazards discussed in the header comments for mdunlink().
But those hazards don't really matter here either, because, again, any
failure will necessarily require that the entire new cluster be
destroyed, and also because there shouldn't be any concurrent activity
in the new cluster while pg_upgrade is running. So I adjusted
smgrdounlinkall() to actually remove everything when IsBinaryUpgrade =
true. And then it all seems to work: pg_upgrade of a cluster that has
had a rewrite of pg_largeobject works, and pg_upgrade of a cluster
that has not had such a rewrite works too. Wa-hoo.

As to whether this is a good fix, I think someone could certainly
argue otherwise. This is all a bit grotty. However, I don't find it
all that bad. As long as we're moving files from between one PG
cluster and another using an external tool rather than logic inside
the server itself, I think we're bound to have some hacks someplace to
make it all work. To me, extending them to a few more places to avoid
leaving files behind on disk seems like a good trade-off. Your mileage
may vary.

--
Robert Haas
EDB: http://www.enterprisedb.com

Attachments:

0002-Have-pg_dump-use-TRUNCATE-to-fix-up-pg_largeobject-r.patchapplication/octet-stream; name=0002-Have-pg_dump-use-TRUNCATE-to-fix-up-pg_largeobject-r.patchDownload

From c52c83268eb853f40df3c023e4f36da1553f06d2 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Thu, 14 Jul 2022 15:31:45 -0400
Subject: [PATCH 2/2] Have pg_dump use TRUNCATE to fix up pg_largeobject
 relfilenode.

---
 src/bin/pg_dump/pg_dump.c | 31 +++++++++++++++++++++++++++----
 1 file changed, 27 insertions(+), 4 deletions(-)

diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index e4fdb6b75b..b399b2de1a 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -3141,6 +3141,7 @@ dumpDatabase(Archive *fout)
 		PGresult   *lo_res;
 		PQExpBuffer loFrozenQry = createPQExpBuffer();
 		PQExpBuffer loOutQry = createPQExpBuffer();
+		PQExpBuffer loVacQry = createPQExpBuffer();
 		int			i_relfrozenxid,
 					i_relfilenode,
 					i_oid,
@@ -3167,15 +3168,36 @@ dumpDatabase(Archive *fout)
 		i_relfilenode = PQfnumber(lo_res, "relfilenode");
 		i_oid = PQfnumber(lo_res, "oid");
 
-		appendPQExpBufferStr(loOutQry, "\n-- For binary upgrade, preserve values for pg_largeobject and its index\n");
+		appendPQExpBufferStr(loOutQry, "\n-- For binary upgrade, set pg_largeobject relfrozenxid and relminmxid\n");
+		appendPQExpBufferStr(loVacQry, "\n-- For binary upgrade, preserve pg_largeobject and index relfilenodes\n");
 		for (int i = 0; i < PQntuples(lo_res); ++i)
+		{
+			Oid		oid;
+			RelFileNumber	relfilenumber;
+
 			appendPQExpBuffer(loOutQry, "UPDATE pg_catalog.pg_class\n"
-							  "SET relfrozenxid = '%u', relminmxid = '%u', relfilenode = '%u'\n"
+							  "SET relfrozenxid = '%u', relminmxid = '%u'\n"
 							  "WHERE oid = %u;\n",
 							  atooid(PQgetvalue(lo_res, i, i_relfrozenxid)),
 							  atooid(PQgetvalue(lo_res, i, i_relminmxid)),
-							  atooid(PQgetvalue(lo_res, i, i_relfilenode)),
-							  atooid(PQgetvalue(lo_res, i, i_oid)));
+							  atooid(PQgetvalue(lo_res, i, i_relfilenode)));
+
+			oid = atooid(PQgetvalue(lo_res, i, i_oid));
+			relfilenumber = atooid(PQgetvalue(lo_res, i, i_relfilenode));
+
+			if (oid == LargeObjectRelationId)
+				appendPQExpBuffer(loVacQry,
+								  "SELECT pg_catalog.binary_upgrade_set_next_heap_relfilenode('%u'::pg_catalog.oid);\n",
+								  relfilenumber);
+			else if (oid == LargeObjectLOidPNIndexId)
+				appendPQExpBuffer(loVacQry,
+								  "SELECT pg_catalog.binary_upgrade_set_next_index_relfilenode('%u'::pg_catalog.oid);\n",
+								  relfilenumber);
+		}
+
+		appendPQExpBufferStr(loVacQry,
+							 "TRUNCATE pg_catalog.pg_largeobject;\n");
+		appendPQExpBufferStr(loOutQry, loVacQry->data);
 
 		ArchiveEntry(fout, nilCatalogId, createDumpId(),
 					 ARCHIVE_OPTS(.tag = "pg_largeobject",
@@ -3187,6 +3209,7 @@ dumpDatabase(Archive *fout)
 
 		destroyPQExpBuffer(loFrozenQry);
 		destroyPQExpBuffer(loOutQry);
+		destroyPQExpBuffer(loVacQry);
 	}
 
 	PQclear(res);
-- 
2.24.3 (Apple Git-128)

0001-Allow-TRUNCATE-pg_largeobject-in-binary-upgrade-mode.patchapplication/octet-stream; name=0001-Allow-TRUNCATE-pg_largeobject-in-binary-upgrade-mode.patchDownload

From 0cedbc2e826af3f5917626f2ea52e4a7668049b5 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Thu, 14 Jul 2022 15:14:56 -0400
Subject: [PATCH 1/2] Allow TRUNCATE pg_largeobject in binary upgrade mode.

Because the goal here is to be able to change the relfilenode of
this system table and its index, we need to use the code path that
is normally used for transactional rewrites. However, because we
want to be able to use this code path even when the relfilenode
hasn't changed, we need to deviate from ordinary transactional
semantics by nuking the old storage in its entirety before creating
the new storage.
---
 src/backend/commands/tablecmds.c   | 11 ++++-
 src/backend/storage/smgr/md.c      | 27 +++++++++---
 src/backend/utils/cache/relcache.c | 66 +++++++++++++++++++++++++++---
 3 files changed, 93 insertions(+), 11 deletions(-)

diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index a2f577024a..35f9f6d8d9 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -40,6 +40,7 @@
 #include "catalog/pg_depend.h"
 #include "catalog/pg_foreign_table.h"
 #include "catalog/pg_inherits.h"
+#include "catalog/pg_largeobject.h"
 #include "catalog/pg_namespace.h"
 #include "catalog/pg_opclass.h"
 #include "catalog/pg_statistic_ext.h"
@@ -2185,7 +2186,15 @@ truncate_check_rel(Oid relid, Form_pg_class reltuple)
 				(errcode(ERRCODE_WRONG_OBJECT_TYPE),
 				 errmsg("\"%s\" is not a table", relname)));
 
-	if (!allowSystemTableMods && IsSystemClass(relid, reltuple))
+	/*
+	 * Most system catalogs can't be truncated at all, or at least not unless
+	 * allow_system_table_mods=on. As an exception, however, we allow
+	 * pg_largeobject to be truncated as part of pg_upgrade, because we need
+	 * to change its relfilenode to match the old cluster, and allowing a
+	 * TRUNCATE command to be executed is the easiest way of doing that.
+	 */
+	if (!allowSystemTableMods && IsSystemClass(relid, reltuple)
+		&& (!IsBinaryUpgrade || relid != LargeObjectRelationId))
 		ereport(ERROR,
 				(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
 				 errmsg("permission denied: \"%s\" is a system catalog",
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 3998296a62..3deac496ee 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -319,6 +319,7 @@ mdunlinkfork(RelFileLocatorBackend rlocator, ForkNumber forkNum, bool isRedo)
 {
 	char	   *path;
 	int			ret;
+	BlockNumber segno = 0;
 
 	path = relpath(rlocator, forkNum);
 
@@ -353,8 +354,22 @@ mdunlinkfork(RelFileLocatorBackend rlocator, ForkNumber forkNum, bool isRedo)
 		/* Prevent other backends' fds from holding on to the disk space */
 		ret = do_truncate(path);
 
-		/* Register request to unlink first segment later */
-		register_unlink_segment(rlocator, forkNum, 0 /* first seg */ );
+		/*
+		 * Except during a binary upgrade, register request to unlink first
+		 * segment later, rather than now.
+		 *
+		 * If we're performing a binary upgrade, the dangers described in the
+		 * header comments for mdunlink() do not exist, since after a crash
+		 * or even a simple ERROR, the upgrade fails and the whole new cluster
+		 * must be recreated from scratch. And, on the other hand, it is
+		 * important to remove the files from disk immediately, because we
+		 * may be about to reuse the same relfilenumber.
+		 */
+		if (!IsBinaryUpgrade)
+		{
+			register_unlink_segment(rlocator, forkNum, 0 /* first seg */ );
+			++segno;
+		}
 	}
 
 	/*
@@ -363,15 +378,17 @@ mdunlinkfork(RelFileLocatorBackend rlocator, ForkNumber forkNum, bool isRedo)
 	if (ret >= 0)
 	{
 		char	   *segpath = (char *) palloc(strlen(path) + 12);
-		BlockNumber segno;
 
 		/*
 		 * Note that because we loop until getting ENOENT, we will correctly
 		 * remove all inactive segments as well as active ones.
 		 */
-		for (segno = 1;; segno++)
+		for (;; segno++)
 		{
-			sprintf(segpath, "%s.%u", path, segno);
+			if (segno == 0)
+				strcpy(segpath, path);
+			else
+				sprintf(segpath, "%s.%u", path, segno);
 
 			if (!RelFileLocatorBackendIsTemp(rlocator))
 			{
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index bdb771d278..5658e48777 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -41,6 +41,7 @@
 #include "access/tupdesc_details.h"
 #include "access/xact.h"
 #include "access/xlog.h"
+#include "catalog/binary_upgrade.h"
 #include "catalog/catalog.h"
 #include "catalog/indexing.h"
 #include "catalog/namespace.h"
@@ -3707,9 +3708,36 @@ RelationSetNewRelfilenumber(Relation relation, char persistence)
 	TransactionId freezeXid = InvalidTransactionId;
 	RelFileLocator newrlocator;
 
-	/* Allocate a new relfilenumber */
-	newrelfilenumber = GetNewRelFileNumber(relation->rd_rel->reltablespace,
-										   NULL, persistence);
+	if (!IsBinaryUpgrade)
+	{
+		/* Allocate a new relfilenumber */
+		newrelfilenumber = GetNewRelFileNumber(relation->rd_rel->reltablespace,
+											   NULL, persistence);
+	}
+	else if (relation->rd_rel->relkind == RELKIND_INDEX)
+	{
+		if (!OidIsValid(binary_upgrade_next_index_pg_class_relfilenumber))
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("index relfilenumber value not set when in binary upgrade mode")));
+
+		newrelfilenumber = binary_upgrade_next_index_pg_class_relfilenumber;
+		binary_upgrade_next_index_pg_class_relfilenumber = InvalidOid;
+	}
+	else if (relation->rd_rel->relkind == RELKIND_RELATION)
+	{
+		if (!OidIsValid(binary_upgrade_next_heap_pg_class_relfilenumber))
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("heap relfilenumber value not set when in binary upgrade mode")));
+
+		newrelfilenumber = binary_upgrade_next_heap_pg_class_relfilenumber;
+		binary_upgrade_next_heap_pg_class_relfilenumber = InvalidOid;
+	}
+	else
+		ereport(ERROR,
+				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+				 errmsg("unexpected request for new relfilenumber in binary upgrade mode")));
 
 	/*
 	 * Get a writable copy of the pg_class tuple for the given relation.
@@ -3724,9 +3752,37 @@ RelationSetNewRelfilenumber(Relation relation, char persistence)
 	classform = (Form_pg_class) GETSTRUCT(tuple);
 
 	/*
-	 * Schedule unlinking of the old storage at transaction commit.
+	 * Schedule unlinking of the old storage at transaction commit, except
+	 * when performing a binary upgrade, when we must do it immediately.
 	 */
-	RelationDropStorage(relation);
+	if (IsBinaryUpgrade)
+	{
+		SMgrRelation	srel;
+
+		/*
+		 * During a binary upgrade, we use this code path to ensure that
+		 * pg_largeobject and its index have the same relfilenumbers as in
+		 * the old cluster. This is necessary because pg_upgrade treats
+		 * pg_largebject like a user table, not a system table. It is however
+		 * possible that a table or index may need to end up with the same
+		 * relfilenumber in the new cluster as what it had in the old cluster.
+		 * Hence, we can't wait until commit time to remove the old storage.
+		 *
+		 * In general, this function needs to have transactional semantics,
+		 * and removing the old storage before commit time surely isn't.
+		 * However, it doesn't really matter, because if a binary upgrade
+		 * fails at this stage, the new cluster will need to be recreated
+		 * anyway.
+		 */
+		srel = smgropen(relation->rd_locator, relation->rd_backend);
+		smgrdounlinkall(&srel, 1, false);
+		smgrclose(srel);
+	}
+	else
+	{
+		/* Not a binary upgrade, so just schedule it to happen later. */
+		RelationDropStorage(relation);
+	}
 
 	/*
 	 * Create storage for the main fork of the new relfilenumber.  If it's a
-- 
2.24.3 (Apple Git-128)

#28

Andres Freund

andres@anarazel.de

over 3 years ago

In reply to: Robert Haas (#27)

Re: pg15b2: large objects lost on upgrade

Hi,

On 2022-07-18 14:57:40 -0400, Robert Haas wrote:

As to whether this is a good fix, I think someone could certainly
argue otherwise. This is all a bit grotty. However, I don't find it
all that bad. As long as we're moving files from between one PG
cluster and another using an external tool rather than logic inside
the server itself, I think we're bound to have some hacks someplace to
make it all work. To me, extending them to a few more places to avoid
leaving files behind on disk seems like a good trade-off. Your mileage
may vary.

How about adding a new binary_upgrade_* helper function for this purpose
instead, instead of tying it into truncate?

Greetings,

Andres Freund

#29

Bruce Momjian

bruce@momjian.us

over 3 years ago

In reply to: Robert Haas (#27)

Re: pg15b2: large objects lost on upgrade

On Mon, Jul 18, 2022 at 02:57:40PM -0400, Robert Haas wrote:

So I tried implementing this but I didn't get it quite right the first
time. It's not enough to call smgrdounlinkall() instead of
RelationDropStorage(), because just as RelationDropStorage() does not
actually drop the storage but only schedules it to be dropped later,
so also smgrdounlinkall() does not in fact unlink all, but only some.
It leaves the first segment of the main relation fork around to guard
against the hazards discussed in the header comments for mdunlink().
But those hazards don't really matter here either, because, again, any
failure will necessarily require that the entire new cluster be
destroyed, and also because there shouldn't be any concurrent activity
in the new cluster while pg_upgrade is running. So I adjusted
smgrdounlinkall() to actually remove everything when IsBinaryUpgrade =
true. And then it all seems to work: pg_upgrade of a cluster that has
had a rewrite of pg_largeobject works, and pg_upgrade of a cluster
that has not had such a rewrite works too. Wa-hoo.

Using the IsBinaryUpgrade flag makes sense to me.

--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com

Indecision is a decision. Inaction is an action. Mark Batterson

#30

Robert Haas

robertmhaas@gmail.com

over 3 years ago

In reply to: Andres Freund (#28)

Re: pg15b2: large objects lost on upgrade

On Mon, Jul 18, 2022 at 4:06 PM Andres Freund <andres@anarazel.de> wrote:

How about adding a new binary_upgrade_* helper function for this purpose
instead, instead of tying it into truncate?

I considered that briefly, but it would need to do a lot of things
that TRUNCATE already knows how to do, so it does not seem like a good
idea.

--
Robert Haas
EDB: http://www.enterprisedb.com

#31

Robert Haas

robertmhaas@gmail.com

over 3 years ago

In reply to: Robert Haas (#27)

Re: pg15b2: large objects lost on upgrade

On Mon, Jul 18, 2022 at 2:57 PM Robert Haas <robertmhaas@gmail.com> wrote:

Well, it took a while to figure out how to make that work, but I
believe I've got it now. Attached please find a couple of patches that
should get the job done. They might need a bit of polish, but I think
the basic concepts are sound.

So, would people like these patches (1) committed to master only, (2)
committed to master and back-patched into v15, or (3) not committed at
all? Michael argued upthread that it was too risky to be tinkering
with things at this stage in the release cycle and, certainly, the
more time goes by, the more true that gets. But I'm not convinced that
these patches involve an inordinate degree of risk, and using beta as
a time to fix things that turned out to be buggy is part of the point
of the whole thing. On the other hand, the underlying issue isn't that
serious either, and nobody seems to have reviewed the patches in
detail, either. I don't mind committing them on my own recognizance if
that's what people would prefer; I can take responsibility for fixing
anything that is further broken, as I suppose would be expected even
if someone else did review. But, I don't want to do something that
other people feel is the wrong thing to have done.

--
Robert Haas
EDB: http://www.enterprisedb.com

#32

Bruce Momjian

bruce@momjian.us

over 3 years ago

In reply to: Robert Haas (#31)

Re: pg15b2: large objects lost on upgrade

On Tue, Jul 26, 2022 at 03:45:11PM -0400, Robert Haas wrote:

On Mon, Jul 18, 2022 at 2:57 PM Robert Haas <robertmhaas@gmail.com> wrote:

Well, it took a while to figure out how to make that work, but I
believe I've got it now. Attached please find a couple of patches that
should get the job done. They might need a bit of polish, but I think
the basic concepts are sound.

So, would people like these patches (1) committed to master only, (2)
committed to master and back-patched into v15, or (3) not committed at
all? Michael argued upthread that it was too risky to be tinkering
with things at this stage in the release cycle and, certainly, the
more time goes by, the more true that gets. But I'm not convinced that
these patches involve an inordinate degree of risk, and using beta as
a time to fix things that turned out to be buggy is part of the point
of the whole thing. On the other hand, the underlying issue isn't that
serious either, and nobody seems to have reviewed the patches in
detail, either. I don't mind committing them on my own recognizance if
that's what people would prefer; I can take responsibility for fixing
anything that is further broken, as I suppose would be expected even
if someone else did review. But, I don't want to do something that
other people feel is the wrong thing to have done.

This behavior is new in PG 15, and I would be worried to have one new
behavior in PG 15 and another one in PG 16. Therefore, I would like to
see it in PG 15 and master. I also think not doing anything and leaving
these zero-length files around would also be risky.

--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com

Indecision is a decision. Inaction is an action. Mark Batterson

#33

Robert Haas

robertmhaas@gmail.com

over 3 years ago

In reply to: Bruce Momjian (#32)

Re: pg15b2: large objects lost on upgrade

On Tue, Jul 26, 2022 at 9:09 PM Bruce Momjian <bruce@momjian.us> wrote:

This behavior is new in PG 15, and I would be worried to have one new
behavior in PG 15 and another one in PG 16. Therefore, I would like to
see it in PG 15 and master.

That's also my preference, so committed and back-patched to v15.

--
Robert Haas
EDB: http://www.enterprisedb.com

#34

Tom Lane

tgl@sss.pgh.pa.us

over 3 years ago

In reply to: Robert Haas (#33)

Re: pg15b2: large objects lost on upgrade

Robert Haas <robertmhaas@gmail.com> writes:

That's also my preference, so committed and back-patched to v15.

crake has been failing its cross-version-upgrade tests [1]https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=crake&dt=2022-07-28%2020%3A33%3A20 since
this went in:

log files for step xversion-upgrade-REL9_4_STABLE-HEAD:
==~_~===-=-===~_~== /home/andrew/bf/root/upgrade.crake/HEAD/REL9_4_STABLE-amcheck-1.log ==~_~===-=-===~_~==
heap table "regression.pg_catalog.pg_largeobject", block 0, offset 7:
xmin 7707 precedes relation freeze threshold 0:14779
heap table "regression.pg_catalog.pg_largeobject", block 201, offset 5:
xmin 8633 precedes relation freeze threshold 0:14779

I'm not very sure what to make of that, but it's failed identically
four times in four attempts.

regards, tom lane

[1]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=crake&dt=2022-07-28%2020%3A33%3A20

#35

Robert Haas

robertmhaas@gmail.com

over 3 years ago

In reply to: Tom Lane (#34)

Re: pg15b2: large objects lost on upgrade

On Fri, Jul 29, 2022 at 1:49 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

crake has been failing its cross-version-upgrade tests [1] since
this went in:

log files for step xversion-upgrade-REL9_4_STABLE-HEAD:
==~_~===-=-===~_~== /home/andrew/bf/root/upgrade.crake/HEAD/REL9_4_STABLE-amcheck-1.log ==~_~===-=-===~_~==
heap table "regression.pg_catalog.pg_largeobject", block 0, offset 7:
xmin 7707 precedes relation freeze threshold 0:14779
heap table "regression.pg_catalog.pg_largeobject", block 201, offset 5:
xmin 8633 precedes relation freeze threshold 0:14779

I'm not very sure what to make of that, but it's failed identically
four times in four attempts.

That's complaining about two tuples in the pg_largeobject table with
xmin values that precedes relfrozenxid -- which suggests that even
after 80d6907219, relfrozenxid isn't being correctly preserved in this
test case, since the last run still failed the same way.

But what exactly is this test case testing? I've previously complained
about buildfarm outputs not being as labelled as well as they need to
be in order to be easily understood by, well, me anyway. It'd sure
help if the commands that led up to this problem were included in the
output. I downloaded latest-client.tgz from the build farm server and
am looking at TestUpgradeXversion.pm, but there's no mention of
-amcheck-1.log in there, just -analyse.log, -copy.log, and following.
So I suppose this is running some different code or special
configuration...

--
Robert Haas
EDB: http://www.enterprisedb.com

#36

Robert Haas

robertmhaas@gmail.com

over 3 years ago

In reply to: Robert Haas (#35)

Re: pg15b2: large objects lost on upgrade

On Fri, Jul 29, 2022 at 2:35 PM Robert Haas <robertmhaas@gmail.com> wrote:

But what exactly is this test case testing? I've previously complained
about buildfarm outputs not being as labelled as well as they need to
be in order to be easily understood by, well, me anyway. It'd sure
help if the commands that led up to this problem were included in the
output. I downloaded latest-client.tgz from the build farm server and
am looking at TestUpgradeXversion.pm, but there's no mention of
-amcheck-1.log in there, just -analyse.log, -copy.log, and following.
So I suppose this is running some different code or special
configuration...

I was able to reproduce the problem by running 'make installcheck'
against a 9.4 instance and then doing a pg_upgrade to 16devel (which
took many tries because it told me about many different kinds of
things that it didn't like one at a time; I just dropped objects from
the regression DB until it worked). The dump output looks like this:

-- For binary upgrade, set pg_largeobject relfrozenxid and relminmxid
UPDATE pg_catalog.pg_class
SET relfrozenxid = '0', relminmxid = '0'
WHERE oid = 2683;
UPDATE pg_catalog.pg_class
SET relfrozenxid = '990', relminmxid = '1'
WHERE oid = 2613;

-- For binary upgrade, preserve pg_largeobject and index relfilenodes
SELECT pg_catalog.binary_upgrade_set_next_index_relfilenode('12364'::pg_catalog.oid);
SELECT pg_catalog.binary_upgrade_set_next_heap_relfilenode('12362'::pg_catalog.oid);
TRUNCATE pg_catalog.pg_largeobject;

However, the catalogs show the relfilenode being correct, and the
relfrozenxid set to a larger value. I suspect the problem here is that
this needs to be done in the other order, with the TRUNCATE first and
the update to the pg_class columns afterward.

I think I better look into improving the TAP tests for this, too.

--
Robert Haas
EDB: http://www.enterprisedb.com

#37

Robert Haas

robertmhaas@gmail.com

over 3 years ago

In reply to: Robert Haas (#36)

1 attachment(s)

Re: pg15b2: large objects lost on upgrade

On Fri, Jul 29, 2022 at 3:10 PM Robert Haas <robertmhaas@gmail.com> wrote:

However, the catalogs show the relfilenode being correct, and the
relfrozenxid set to a larger value. I suspect the problem here is that
this needs to be done in the other order, with the TRUNCATE first and
the update to the pg_class columns afterward.

That fix appears to be correct. Patch attached.

I think I better look into improving the TAP tests for this, too.

TAP test enhancement also included in the attached patch.

--
Robert Haas
EDB: http://www.enterprisedb.com

Attachments:

v1-0001-Fix-brown-paper-bag-bug-in-bbe08b8869bd29d587f24e.patchapplication/octet-stream; name=v1-0001-Fix-brown-paper-bag-bug-in-bbe08b8869bd29d587f24e.patchDownload

From 6c40142eae94721db8e32c3ab70963a332e9dea0 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Fri, 29 Jul 2022 15:47:24 -0400
Subject: [PATCH v1] Fix brown paper bag bug in
 bbe08b8869bd29d587f24ef18eb45c7d4d14afca.

We must issue the TRUNCATE command first and update relfrozenxid
and relminmxid afterward; otherwise, TRUNCATE overwrites the
previously-set values.

Add a test case like I should have done the first time.

Per buildfarm report from TestUpgradeXversion.pm, by way of Tom
Lane.
---
 src/bin/pg_dump/pg_dump.c              | 18 +++++-----
 src/bin/pg_upgrade/t/002_pg_upgrade.pl | 46 ++++++++++++++++++++++++++
 2 files changed, 55 insertions(+), 9 deletions(-)

diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 320bf51e4d..da6605175a 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -3141,7 +3141,7 @@ dumpDatabase(Archive *fout)
 		PGresult   *lo_res;
 		PQExpBuffer loFrozenQry = createPQExpBuffer();
 		PQExpBuffer loOutQry = createPQExpBuffer();
-		PQExpBuffer loVacQry = createPQExpBuffer();
+		PQExpBuffer loHorizonQry = createPQExpBuffer();
 		int			i_relfrozenxid,
 					i_relfilenode,
 					i_oid,
@@ -3168,14 +3168,14 @@ dumpDatabase(Archive *fout)
 		i_relfilenode = PQfnumber(lo_res, "relfilenode");
 		i_oid = PQfnumber(lo_res, "oid");
 
-		appendPQExpBufferStr(loOutQry, "\n-- For binary upgrade, set pg_largeobject relfrozenxid and relminmxid\n");
-		appendPQExpBufferStr(loVacQry, "\n-- For binary upgrade, preserve pg_largeobject and index relfilenodes\n");
+		appendPQExpBufferStr(loHorizonQry, "\n-- For binary upgrade, set pg_largeobject relfrozenxid and relminmxid\n");
+		appendPQExpBufferStr(loOutQry, "\n-- For binary upgrade, preserve pg_largeobject and index relfilenodes\n");
 		for (int i = 0; i < PQntuples(lo_res); ++i)
 		{
 			Oid		oid;
 			RelFileNumber	relfilenumber;
 
-			appendPQExpBuffer(loOutQry, "UPDATE pg_catalog.pg_class\n"
+			appendPQExpBuffer(loHorizonQry, "UPDATE pg_catalog.pg_class\n"
 							  "SET relfrozenxid = '%u', relminmxid = '%u'\n"
 							  "WHERE oid = %u;\n",
 							  atooid(PQgetvalue(lo_res, i, i_relfrozenxid)),
@@ -3186,18 +3186,18 @@ dumpDatabase(Archive *fout)
 			relfilenumber = atooid(PQgetvalue(lo_res, i, i_relfilenode));
 
 			if (oid == LargeObjectRelationId)
-				appendPQExpBuffer(loVacQry,
+				appendPQExpBuffer(loOutQry,
 								  "SELECT pg_catalog.binary_upgrade_set_next_heap_relfilenode('%u'::pg_catalog.oid);\n",
 								  relfilenumber);
 			else if (oid == LargeObjectLOidPNIndexId)
-				appendPQExpBuffer(loVacQry,
+				appendPQExpBuffer(loOutQry,
 								  "SELECT pg_catalog.binary_upgrade_set_next_index_relfilenode('%u'::pg_catalog.oid);\n",
 								  relfilenumber);
 		}
 
-		appendPQExpBufferStr(loVacQry,
+		appendPQExpBufferStr(loOutQry,
 							 "TRUNCATE pg_catalog.pg_largeobject;\n");
-		appendPQExpBufferStr(loOutQry, loVacQry->data);
+		appendPQExpBufferStr(loOutQry, loHorizonQry->data);
 
 		ArchiveEntry(fout, nilCatalogId, createDumpId(),
 					 ARCHIVE_OPTS(.tag = "pg_largeobject",
@@ -3208,8 +3208,8 @@ dumpDatabase(Archive *fout)
 		PQclear(lo_res);
 
 		destroyPQExpBuffer(loFrozenQry);
+		destroyPQExpBuffer(loHorizonQry);
 		destroyPQExpBuffer(loOutQry);
-		destroyPQExpBuffer(loVacQry);
 	}
 
 	PQclear(res);
diff --git a/src/bin/pg_upgrade/t/002_pg_upgrade.pl b/src/bin/pg_upgrade/t/002_pg_upgrade.pl
index 9ed48c4e06..09af8157d0 100644
--- a/src/bin/pg_upgrade/t/002_pg_upgrade.pl
+++ b/src/bin/pg_upgrade/t/002_pg_upgrade.pl
@@ -161,6 +161,27 @@ $newnode->command_ok(
 	],
 	'dump before running pg_upgrade');
 
+# Also record the relfrozenxid and relminmxid horizons.
+my $horizon_query = <<EOM;
+SELECT
+  c.oid::regclass, c.relfrozenxid, c.relminmxid
+FROM
+  pg_class c, pg_namespace n
+WHERE
+  c.relnamespace = n.oid AND
+  ((n.nspname !~ '^pg_temp_' AND n.nspname !~ '^pg_toast_temp_' AND
+    n.nspname NOT IN ('pg_catalog', 'information_schema', 'binary_upgrade',
+                      'pg_toast'))
+  OR (n.nspname = 'pg_catalog' AND relname IN ('pg_largeobject')))
+EOM
+$horizon_query =~ s/\s+/ /g; # run it together on one line
+$newnode->command_ok(
+	[
+		'psql', '-At', '-d', $oldnode->connstr('postgres'),
+		'-o', "$tempdir/horizon1.txt", '-c', $horizon_query,
+	],
+	'horizons before running pg_upgrade');
+
 # After dumping, update references to the old source tree's regress.so
 # to point to the new tree.
 if (defined($ENV{oldinstall}))
@@ -294,6 +315,14 @@ $newnode->command_ok(
 	],
 	'dump after running pg_upgrade');
 
+# And second record of horizons as well.
+$newnode->command_ok(
+	[
+		'psql', '-At', '-d', $newnode->connstr('postgres'),
+		'-o', "$tempdir/horizon2.txt", '-c', $horizon_query,
+	],
+	'horizons after running pg_upgrade');
+
 # Compare the two dumps, there should be no differences.
 my $compare_res = compare("$tempdir/dump1.sql", "$tempdir/dump2.sql");
 is($compare_res, 0, 'old and new dumps match after pg_upgrade');
@@ -311,4 +340,21 @@ if ($compare_res != 0)
 	print "=== EOF ===\n";
 }
 
+# Compare the horizons, there should be no differences.
+$compare_res = compare("$tempdir/horizon1.txt", "$tempdir/horizon2.txt");
+is($compare_res, 0, 'old and new horizons match after pg_upgrade');
+
+# Provide more context if the horizons do not match.
+if ($compare_res != 0)
+{
+	my ($stdout, $stderr) =
+	  run_command([ 'diff', "$tempdir/horizon1.txt", "$tempdir/horizon2.txt" ]);
+	print "=== diff of $tempdir/horizon1.txt and $tempdir/horizon2.txt\n";
+	print "=== stdout ===\n";
+	print $stdout;
+	print "=== stderr ===\n";
+	print $stderr;
+	print "=== EOF ===\n";
+}
+
 done_testing();
-- 
2.24.3 (Apple Git-128)

#38

Tom Lane

tgl@sss.pgh.pa.us

over 3 years ago

In reply to: Robert Haas (#37)

Re: pg15b2: large objects lost on upgrade

Robert Haas <robertmhaas@gmail.com> writes:

On Fri, Jul 29, 2022 at 3:10 PM Robert Haas <robertmhaas@gmail.com> wrote:

However, the catalogs show the relfilenode being correct, and the
relfrozenxid set to a larger value. I suspect the problem here is that
this needs to be done in the other order, with the TRUNCATE first and
the update to the pg_class columns afterward.

That fix appears to be correct. Patch attached.

Looks plausible.

regards, tom lane

#39

Robert Haas

robertmhaas@gmail.com

over 3 years ago

In reply to: Tom Lane (#38)

Re: pg15b2: large objects lost on upgrade

On Fri, Jul 29, 2022 at 4:00 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Looks plausible.

Committed.

--
Robert Haas
EDB: http://www.enterprisedb.com

#40

Andrew Dunstan

andrew@dunslane.net

over 3 years ago

In reply to: Robert Haas (#35)

Re: pg15b2: large objects lost on upgrade

On 2022-07-29 Fr 14:35, Robert Haas wrote:

On Fri, Jul 29, 2022 at 1:49 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

crake has been failing its cross-version-upgrade tests [1] since
this went in:

log files for step xversion-upgrade-REL9_4_STABLE-HEAD:
==~_~===-=-===~_~== /home/andrew/bf/root/upgrade.crake/HEAD/REL9_4_STABLE-amcheck-1.log ==~_~===-=-===~_~==
heap table "regression.pg_catalog.pg_largeobject", block 0, offset 7:
xmin 7707 precedes relation freeze threshold 0:14779
heap table "regression.pg_catalog.pg_largeobject", block 201, offset 5:
xmin 8633 precedes relation freeze threshold 0:14779

I'm not very sure what to make of that, but it's failed identically
four times in four attempts.

That's complaining about two tuples in the pg_largeobject table with
xmin values that precedes relfrozenxid -- which suggests that even
after 80d6907219, relfrozenxid isn't being correctly preserved in this
test case, since the last run still failed the same way.

But what exactly is this test case testing? I've previously complained
about buildfarm outputs not being as labelled as well as they need to
be in order to be easily understood by, well, me anyway. It'd sure
help if the commands that led up to this problem were included in the
output. I downloaded latest-client.tgz from the build farm server and
am looking at TestUpgradeXversion.pm, but there's no mention of
-amcheck-1.log in there, just -analyse.log, -copy.log, and following.
So I suppose this is running some different code or special
configuration...

Not really, but it is running git bleeding edge. This code comes from
<https://github.com/PGBuildFarm/client-code/commit/191df23bd25eb5546b0989d71ae92747151f9f39>
at lines 704-705

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

#41

Robert Haas

robertmhaas@gmail.com

over 3 years ago

In reply to: Robert Haas (#39)

Re: pg15b2: large objects lost on upgrade

On Fri, Jul 29, 2022 at 5:13 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Jul 29, 2022 at 4:00 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Looks plausible.

Committed.

wrasse just failed the new test:

[00:09:44.167](0.001s) not ok 16 - old and new horizons match after pg_upgrade
[00:09:44.167](0.001s)
[00:09:44.167](0.000s) # Failed test 'old and new horizons match
after pg_upgrade'
# at t/002_pg_upgrade.pl line 345.
[00:09:44.168](0.000s) # got: '1'
# expected: '0'
=== diff of /export/home/nm/farm/studio64v12_6/HEAD/pgsql.build/src/bin/pg_upgrade/tmp_check/tmp_test_D3cJ/horizon1.txt
and /export/home/nm/farm/studio64v12_6/HEAD/pgsql.build/src/bin/pg_upgrade/tmp_check/tmp_test_D3cJ/horizon2.txt
=== stdout ===
1c1
< pg_backend_pid|21767
---

pg_backend_pid|22045=== stderr ===

=== EOF ===

I'm slightly befuddled as to how we're ending up with a table named
pg_backend_pid. That said, perhaps this is just a case of needing to
prevent autovacuum from running on the new cluster before we've had a
chance to record the horizons? But I'm not very confident in that
explanation at this point.

--
Robert Haas
EDB: http://www.enterprisedb.com

#42

Tom Lane

tgl@sss.pgh.pa.us

over 3 years ago

In reply to: Robert Haas (#41)

Re: pg15b2: large objects lost on upgrade

Robert Haas <robertmhaas@gmail.com> writes:

wrasse just failed the new test:

[00:09:44.167](0.001s) not ok 16 - old and new horizons match after pg_upgrade
[00:09:44.167](0.001s)
[00:09:44.167](0.000s) # Failed test 'old and new horizons match
after pg_upgrade'
# at t/002_pg_upgrade.pl line 345.
[00:09:44.168](0.000s) # got: '1'
# expected: '0'
=== diff of /export/home/nm/farm/studio64v12_6/HEAD/pgsql.build/src/bin/pg_upgrade/tmp_check/tmp_test_D3cJ/horizon1.txt
and /export/home/nm/farm/studio64v12_6/HEAD/pgsql.build/src/bin/pg_upgrade/tmp_check/tmp_test_D3cJ/horizon2.txt
=== stdout ===
1c1
< pg_backend_pid|21767
---

pg_backend_pid|22045=== stderr ===

=== EOF ===

I'm slightly befuddled as to how we're ending up with a table named
pg_backend_pid.

That's not the only thing weird about this printout: there should be
three columns not two in that query's output, and what happened to
the trailing newline? I don't think we're looking at desired
output at all.

I am suspicious that the problem stems from the nonstandard
way you've invoked psql to collect the horizon data. At the very
least this code is duplicating bits of Cluster::psql that it'd be
better not to; and I wonder if the issue is that it's not duplicating
enough. The lack of -X and the lack of use of installed_command()
are red flags. Do you really need to do it like this?

regards, tom lane

#43

Robert Haas

robertmhaas@gmail.com

over 3 years ago

In reply to: Tom Lane (#42)

Re: pg15b2: large objects lost on upgrade

On Fri, Jul 29, 2022 at 7:16 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

That's not the only thing weird about this printout: there should be
three columns not two in that query's output, and what happened to
the trailing newline? I don't think we're looking at desired
output at all.

Well, that's an awfully good point.

I am suspicious that the problem stems from the nonstandard
way you've invoked psql to collect the horizon data. At the very
least this code is duplicating bits of Cluster::psql that it'd be
better not to; and I wonder if the issue is that it's not duplicating
enough. The lack of -X and the lack of use of installed_command()
are red flags. Do you really need to do it like this?

Well, I just copied the pg_dump block which occurs directly beforehand
and modified it. I think that must take care of setting the path
properly, else we'd have things blowing up all over the place. But the
lack of -X could be an issue.

The missing newline thing happens for me locally too, if I revert the
bug fix portion of that commit, but I do seem to get the right columns
in the output. It looks like this:

19:24:16.057](0.000s) not ok 16 - old and new horizons match after pg_upgrade
[19:24:16.058](0.000s)
[19:24:16.058](0.000s) # Failed test 'old and new horizons match
after pg_upgrade'
# at t/002_pg_upgrade.pl line 345.
[19:24:16.058](0.000s) # got: '1'
# expected: '0'
=== diff of /Users/rhaas/pgsql/src/bin/pg_upgrade/tmp_check/tmp_test_K8Fs/horizon1.txt
and /Users/rhaas/pgsql/src/bin/pg_upgrade/tmp_check/tmp_test_K8Fs/horizon2.txt
=== stdout ===
1c1
< pg_largeobject|718|1
---

pg_largeobject|17518|3=== stderr ===

=== EOF ===
[19:24:16.066](0.008s) 1..16

I can't account for the absence of a newline there, because hexdump
says that both horizon1.txt and horizon2.txt end with one, and if I
run diff on those two files and pipe the output into hexdump, it sees
a newline at the end of that output too.

--
Robert Haas
EDB: http://www.enterprisedb.com

#44

Tom Lane

tgl@sss.pgh.pa.us

over 3 years ago

In reply to: Robert Haas (#43)

Re: pg15b2: large objects lost on upgrade

Robert Haas <robertmhaas@gmail.com> writes:

On Fri, Jul 29, 2022 at 7:16 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

I am suspicious that the problem stems from the nonstandard
way you've invoked psql to collect the horizon data.

Well, I just copied the pg_dump block which occurs directly beforehand
and modified it. I think that must take care of setting the path
properly, else we'd have things blowing up all over the place. But the
lack of -X could be an issue.

Hmm. Now that I look, I do see two pre-existing "naked" invocations
of psql in 002_pg_upgrade.pl, ie

$oldnode->command_ok([ 'psql', '-X', '-f', $olddumpfile, 'postgres' ],
'loaded old dump file');

$oldnode->command_ok(
[
'psql', '-X',
'-f', "$srcdir/src/bin/pg_upgrade/upgrade_adapt.sql",
'regression'
],
'ran adapt script');

Those suggest that maybe all you need is -X. However, I don't think
either of those calls is reached by the majority of buildfarm animals,
only ones that are doing cross-version-upgrade tests. So there
could be more secret sauce needed to get this to pass everywhere.

Personally I'd try to replace the two horizon-collection steps with
$newnode->psql calls, using extra_params to inject the '-o' and target
filename command line words. But if you want to try adding -X as
a quicker answer, maybe that will be enough.

regards, tom lane

#45

Robert Haas

robertmhaas@gmail.com

over 3 years ago

In reply to: Tom Lane (#44)

1 attachment(s)

Re: pg15b2: large objects lost on upgrade

On Fri, Jul 29, 2022 at 8:02 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Personally I'd try to replace the two horizon-collection steps with
$newnode->psql calls, using extra_params to inject the '-o' and target
filename command line words. But if you want to try adding -X as
a quicker answer, maybe that will be enough.

Here's a patch that uses a variant of that approach: it just runs
safe_psql straight up and gets the output, then writes it out to temp
files if the output doesn't match and we need to run diff. Let me know
what you think of this.

While working on this, I noticed a few other problems. One is that the
query doesn't have an ORDER BY clause, which it really should, or the
output won't be stable. And the other is that I think we should be
testing against the regression database, not the postgres database,
because it's got a bunch of user tables in it, not just
pg_largeobject.

--
Robert Haas
EDB: http://www.enterprisedb.com

Attachments:

better-psql-invocation.txttext/plain; charset=US-ASCII; name=better-psql-invocation.txtDownload

diff --git a/src/bin/pg_upgrade/t/002_pg_upgrade.pl b/src/bin/pg_upgrade/t/002_pg_upgrade.pl
index 09af8157d0..a8df6440a4 100644
--- a/src/bin/pg_upgrade/t/002_pg_upgrade.pl
+++ b/src/bin/pg_upgrade/t/002_pg_upgrade.pl
@@ -173,14 +173,10 @@ WHERE
     n.nspname NOT IN ('pg_catalog', 'information_schema', 'binary_upgrade',
                       'pg_toast'))
   OR (n.nspname = 'pg_catalog' AND relname IN ('pg_largeobject')))
+ORDER BY c.oid::regclass::text
 EOM
 $horizon_query =~ s/\s+/ /g; # run it together on one line
-$newnode->command_ok(
-	[
-		'psql', '-At', '-d', $oldnode->connstr('postgres'),
-		'-o', "$tempdir/horizon1.txt", '-c', $horizon_query,
-	],
-	'horizons before running pg_upgrade');
+my $horizon1 = $oldnode->safe_psql('regression', $horizon_query);
 
 # After dumping, update references to the old source tree's regress.so
 # to point to the new tree.
@@ -316,12 +312,7 @@ $newnode->command_ok(
 	'dump after running pg_upgrade');
 
 # And second record of horizons as well.
-$newnode->command_ok(
-	[
-		'psql', '-At', '-d', $newnode->connstr('postgres'),
-		'-o', "$tempdir/horizon2.txt", '-c', $horizon_query,
-	],
-	'horizons after running pg_upgrade');
+my $horizon2 = $newnode->safe_psql('regression', $horizon_query);
 
 # Compare the two dumps, there should be no differences.
 my $compare_res = compare("$tempdir/dump1.sql", "$tempdir/dump2.sql");
@@ -341,14 +332,21 @@ if ($compare_res != 0)
 }
 
 # Compare the horizons, there should be no differences.
-$compare_res = compare("$tempdir/horizon1.txt", "$tempdir/horizon2.txt");
-is($compare_res, 0, 'old and new horizons match after pg_upgrade');
+my $horizons_ok = $horizon1 eq $horizon2;
+ok($horizons_ok, 'old and new horizons match after pg_upgrade');
 
 # Provide more context if the horizons do not match.
-if ($compare_res != 0)
+if (! $horizons_ok)
 {
+	# output is long, so use diff to compare
+	open my $fh, ">", "$tempdir/horizon1.txt" or die "could not open file: $!";
+	print $fh $horizon1;
+	close $fh;
+	open $fh, ">", "$tempdir/horizon2.txt" or die "could not open file: $!";
+	print $fh $horizon2;
 	my ($stdout, $stderr) =
-	  run_command([ 'diff', "$tempdir/horizon1.txt", "$tempdir/horizon2.txt" ]);
+		run_command([ 'diff', "$tempdir/horizon1.txt", "$tempdir/horizon2.txt" ]);
+	close $fh;
 	print "=== diff of $tempdir/horizon1.txt and $tempdir/horizon2.txt\n";
 	print "=== stdout ===\n";
 	print $stdout;

#46

Tom Lane

tgl@sss.pgh.pa.us

over 3 years ago

In reply to: Robert Haas (#45)

Re: pg15b2: large objects lost on upgrade

Robert Haas <robertmhaas@gmail.com> writes:

Here's a patch that uses a variant of that approach: it just runs
safe_psql straight up and gets the output, then writes it out to temp
files if the output doesn't match and we need to run diff. Let me know
what you think of this.

That looks good to me, although obviously I don't know for sure
if it will make wrasse happy.

While working on this, I noticed a few other problems. One is that the
query doesn't have an ORDER BY clause, which it really should, or the
output won't be stable. And the other is that I think we should be
testing against the regression database, not the postgres database,
because it's got a bunch of user tables in it, not just
pg_largeobject.

Both of those sound like "d'oh" observations to me. +1

regards, tom lane

#47

Noah Misch

noah@leadboat.com

over 3 years ago

In reply to: Tom Lane (#42)

Re: pg15b2: large objects lost on upgrade

On Fri, Jul 29, 2022 at 07:16:34PM -0400, Tom Lane wrote:

Robert Haas <robertmhaas@gmail.com> writes:

wrasse just failed the new test:

[00:09:44.167](0.001s) not ok 16 - old and new horizons match after pg_upgrade
[00:09:44.167](0.001s)
[00:09:44.167](0.000s) # Failed test 'old and new horizons match
after pg_upgrade'
# at t/002_pg_upgrade.pl line 345.
[00:09:44.168](0.000s) # got: '1'
# expected: '0'
=== diff of /export/home/nm/farm/studio64v12_6/HEAD/pgsql.build/src/bin/pg_upgrade/tmp_check/tmp_test_D3cJ/horizon1.txt
and /export/home/nm/farm/studio64v12_6/HEAD/pgsql.build/src/bin/pg_upgrade/tmp_check/tmp_test_D3cJ/horizon2.txt
=== stdout ===
1c1
< pg_backend_pid|21767
---

pg_backend_pid|22045=== stderr ===

=== EOF ===

I'm slightly befuddled as to how we're ending up with a table named
pg_backend_pid.

The lack of -X and the lack of use of installed_command()
are red flags.

The pg_backend_pid is from "SELECT pg_catalog.pg_backend_pid();" in ~/.psqlrc,
so the lack of -X caused that. The latest commit fixes things on a normal
GNU/Linux box, so I bet it will fix wrasse. (thorntail managed not to fail
that way. For unrelated reasons, I override thorntail's $HOME to a
mostly-empty directory.)

#48

Tom Lane

tgl@sss.pgh.pa.us

over 3 years ago

In reply to: Noah Misch (#47)

Re: pg15b2: large objects lost on upgrade

Noah Misch <noah@leadboat.com> writes:

The pg_backend_pid is from "SELECT pg_catalog.pg_backend_pid();" in ~/.psqlrc,
so the lack of -X caused that. The latest commit fixes things on a normal
GNU/Linux box, so I bet it will fix wrasse.

Yup, looks like we're all good now. Thanks!

regards, tom lane

#49

Jonathan S. Katz

jkatz@postgresql.org

over 3 years ago

In reply to: Tom Lane (#48)

Re: pg15b2: large objects lost on upgrade

On 7/30/22 10:40 AM, Tom Lane wrote:

Noah Misch <noah@leadboat.com> writes:

The pg_backend_pid is from "SELECT pg_catalog.pg_backend_pid();" in ~/.psqlrc,
so the lack of -X caused that. The latest commit fixes things on a normal
GNU/Linux box, so I bet it will fix wrasse.

Yup, looks like we're all good now. Thanks!

Given this appears to be resolved, I have removed this from "Open
Items". Thanks!

Jonathan

#50

Tom Lane

tgl@sss.pgh.pa.us

over 3 years ago

In reply to: Jonathan S. Katz (#49)

Re: pg15b2: large objects lost on upgrade

"Jonathan S. Katz" <jkatz@postgresql.org> writes:

Given this appears to be resolved, I have removed this from "Open
Items". Thanks!

Sadly, we're still not out of the woods. I see three buildfarm
failures in this test since Robert resolved the "-X" problem [1]https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=grassquit&dt=2022-08-01%2019%3A25%3A43[2]https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=wrasse&dt=2022-08-02%2004%3A18%3A18[3]https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=conchuela&dt=2022-08-02%2014%3A56%3A49:

grassquit reported

[19:34:15.249](0.001s) not ok 14 - old and new horizons match after pg_upgrade
[19:34:15.249](0.001s)
[19:34:15.249](0.000s) # Failed test 'old and new horizons match after pg_upgrade'
# at t/002_pg_upgrade.pl line 336.
=== diff of /mnt/resource/bf/build/grassquit/REL_15_STABLE/pgsql.build/src/bin/pg_upgrade/tmp_check/tmp_test_z3zV/horizon1.txt and /mnt/resource/bf/build/grassquit/REL_15_STABLE/pgsql.build/src/bin/pg_upgrade/tmp_check/tmp_test_z3zV/horizon2.txt
=== stdout ===
785c785
< spgist_point_tbl|7213|1
---

spgist_point_tbl|7356|3

787c787
< spgist_text_tbl|7327|1
---

spgist_text_tbl|8311|3=== stderr ===

=== EOF ===

wrasse reported

[06:36:35.834](0.001s) not ok 14 - old and new horizons match after pg_upgrade
[06:36:35.835](0.001s)
[06:36:35.835](0.000s) # Failed test 'old and new horizons match after pg_upgrade'
# at t/002_pg_upgrade.pl line 336.
=== diff of /export/home/nm/farm/studio64v12_6/REL_15_STABLE/pgsql.build/src/bin/pg_upgrade/tmp_check/tmp_test_mLle/horizon1.txt and /export/home/nm/farm/studio64v12_6/REL_15_STABLE/pgsql.build/src/bin/pg_upgrade/tmp_check/tmp_test_mLle/horizon2.txt
=== stdout ===
138c138
< delete_test_table|7171|1
---

delete_test_table|7171|3=== stderr ===

Warning: missing newline at end of file /export/home/nm/farm/studio64v12_6/REL_15_STABLE/pgsql.build/src/bin/pg_upgrade/tmp_check/tmp_test_mLle/horizon1.txt
Warning: missing newline at end of file /export/home/nm/farm/studio64v12_6/REL_15_STABLE/pgsql.build/src/bin/pg_upgrade/tmp_check/tmp_test_mLle/horizon2.txt=== EOF ===

conchuela doesn't seem to have preserved the detailed log, but it
failed at the same place:

# Failed test 'old and new horizons match after pg_upgrade'
# at t/002_pg_upgrade.pl line 336.

Not sure what to make of this, except that maybe the test is telling
us about an actual bug of exactly the kind it's designed to expose.

regards, tom lane

[1]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=grassquit&dt=2022-08-01%2019%3A25%3A43
[2]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=wrasse&dt=2022-08-02%2004%3A18%3A18
[3]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=conchuela&dt=2022-08-02%2014%3A56%3A49

#51

Jonathan S. Katz

jkatz@postgresql.org

over 3 years ago

In reply to: Tom Lane (#50)

Re: pg15b2: large objects lost on upgrade

On 8/2/22 1:12 PM, Tom Lane wrote:

"Jonathan S. Katz" <jkatz@postgresql.org> writes:

Given this appears to be resolved, I have removed this from "Open
Items". Thanks!

Sadly, we're still not out of the woods. I see three buildfarm
failures in this test since Robert resolved the "-X" problem [1][2][3]:

Not sure what to make of this, except that maybe the test is telling
us about an actual bug of exactly the kind it's designed to expose.

Looking at the test code, is there anything that could have changed the
relfrozenxid or relminxid independently of the test on these systems?

That said, I do think we should reopen the item to figure out what's
going on. Doing so now.

Jonathan

#52

Robert Haas

robertmhaas@gmail.com

over 3 years ago

In reply to: Tom Lane (#50)

Re: pg15b2: large objects lost on upgrade

On Tue, Aug 2, 2022 at 1:12 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Not sure what to make of this, except that maybe the test is telling
us about an actual bug of exactly the kind it's designed to expose.

That could be, but what would the bug be exactly? It's hard to think
of a more direct way of setting relminmxid and relfrozenxid than
updating pg_class. It doesn't seem realistic to suppose that we have a
bug where setting a column in a system table to an integer value
sometimes sets it to a slightly larger integer instead. If the values
on the new cluster seemed like they had never been set, or if it
seemed like they had been set to completely random values, then I'd
suspect a bug in the mechanism, but here it seems more believable to
me to think that we're actually setting the correct values and then
something - maybe autovacuum - bumps them again before we have a
chance to look at them.

I'm not quite sure how to rule that theory in or out, though.

--
Robert Haas
EDB: http://www.enterprisedb.com

#53

Jonathan S. Katz

jkatz@postgresql.org

over 3 years ago

In reply to: Robert Haas (#52)

Re: pg15b2: large objects lost on upgrade

On 8/2/22 3:23 PM, Robert Haas wrote:

On Tue, Aug 2, 2022 at 1:12 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Not sure what to make of this, except that maybe the test is telling
us about an actual bug of exactly the kind it's designed to expose.

That could be, but what would the bug be exactly? It's hard to think
of a more direct way of setting relminmxid and relfrozenxid than
updating pg_class. It doesn't seem realistic to suppose that we have a
bug where setting a column in a system table to an integer value
sometimes sets it to a slightly larger integer instead. If the values
on the new cluster seemed like they had never been set, or if it
seemed like they had been set to completely random values, then I'd
suspect a bug in the mechanism, but here it seems more believable to
me to think that we're actually setting the correct values and then
something - maybe autovacuum - bumps them again before we have a
chance to look at them.

FWIW (and I have not looked deeply at the code), I was thinking it could
be something along those lines, given 1. the randomness of the
underlying systems of the impacted farm animals and 2. it was only the
three mentioned.

I'm not quite sure how to rule that theory in or out, though.

Without overcomplicating this, are we able to check to see if autovacuum
ran during the course of the test?

Jonathan

#54

Tom Lane

tgl@sss.pgh.pa.us

over 3 years ago

In reply to: Jonathan S. Katz (#51)

Re: pg15b2: large objects lost on upgrade

"Jonathan S. Katz" <jkatz@postgresql.org> writes:

On 8/2/22 1:12 PM, Tom Lane wrote:

Sadly, we're still not out of the woods. I see three buildfarm
failures in this test since Robert resolved the "-X" problem [1][2][3]:

Looking at the test code, is there anything that could have changed the
relfrozenxid or relminxid independently of the test on these systems?

Hmmm ... now that you mention it, I see nothing in 002_pg_upgrade.pl
that attempts to turn off autovacuum on either the source server or
the destination. So one plausible theory is that autovac moved the
numbers since we checked.

If that is the explanation, then it leaves us with few good options.
I am not in favor of disabling autovacuum in the test: ordinary
users are not going to do that while pg_upgrade'ing, so it'd make
the test less representative of real-world usage, which seems like
a bad idea. We could either drop this particular check again, or
weaken it to allow new relfrozenxid >= old relfrozenxid, likewise
relminxid.

regards, tom lane

#55

Tom Lane

tgl@sss.pgh.pa.us

over 3 years ago

In reply to: Jonathan S. Katz (#53)

Re: pg15b2: large objects lost on upgrade

"Jonathan S. Katz" <jkatz@postgresql.org> writes:

On 8/2/22 3:23 PM, Robert Haas wrote:

I'm not quite sure how to rule that theory in or out, though.

Without overcomplicating this, are we able to check to see if autovacuum
ran during the course of the test?

Looks like we're all thinking along the same lines.

grassquit shows this at the end of the old server's log,
immediately after the query to retrieve the old horizons:

2022-08-01 19:33:41.608 UTC [1487114][postmaster][:0] LOG: received fast shutdown request
2022-08-01 19:33:41.611 UTC [1487114][postmaster][:0] LOG: aborting any active transactions
2022-08-01 19:33:41.643 UTC [1487114][postmaster][:0] LOG: background worker "logical replication launcher" (PID 1487132) exited with exit code 1
2022-08-01 19:33:41.643 UTC [1493875][autovacuum worker][5/6398:0] FATAL: terminating autovacuum process due to administrator command
2022-08-01 19:33:41.932 UTC [1487121][checkpointer][:0] LOG: checkpoint complete: wrote 1568 buffers (9.6%); 0 WAL file(s) added, 0 removed, 33 recycled; write=31.470 s, sync=0.156 s, total=31.711 s; sync files=893, longest=0.002 s, average=0.001 s; distance=33792 kB, estimate=34986 kB
2022-08-01 19:33:41.933 UTC [1487121][checkpointer][:0] LOG: shutting down

and wrasse shows this:

2022-08-02 06:35:01.974 CEST [5606:6] LOG: received fast shutdown request
2022-08-02 06:35:01.974 CEST [5606:7] LOG: aborting any active transactions
2022-08-02 06:35:01.975 CEST [6758:1] FATAL: terminating autovacuum process due to administrator command
2022-08-02 06:35:01.975 CEST [6758:2] CONTEXT: while vacuuming index "spgist_point_idx" of relation "public.spgist_point_tbl"
2022-08-02 06:35:01.981 CEST [5606:8] LOG: background worker "logical replication launcher" (PID 5612) exited with exit code 1
2022-08-02 06:35:01.995 CEST [5607:42] LOG: shutting down

While not smoking guns, these definitely prove that autovac was active.

regards, tom lane

#56

Jonathan S. Katz

jkatz@postgresql.org

over 3 years ago

In reply to: Tom Lane (#55)

Re: pg15b2: large objects lost on upgrade

On 8/2/22 3:39 PM, Tom Lane wrote:

"Jonathan S. Katz" <jkatz@postgresql.org> writes:

On 8/2/22 3:23 PM, Robert Haas wrote:

I'm not quite sure how to rule that theory in or out, though.

Without overcomplicating this, are we able to check to see if autovacuum
ran during the course of the test?

Looks like we're all thinking along the same lines.

While not smoking guns, these definitely prove that autovac was active.

If that is the explanation, then it leaves us with few good options.
I am not in favor of disabling autovacuum in the test: ordinary
users are not going to do that while pg_upgrade'ing, so it'd make
the test less representative of real-world usage, which seems like
a bad idea. We could either drop this particular check again, or
weaken it to allow new relfrozenxid >= old relfrozenxid, likewise
relminxid.

The test does look helpful and it would catch regressions. Loosely
quoting Robert on a different point upthread, we don't want to turn off
the alarm just because it's spuriously going off.

I think the weakened check is OK (and possibly mimics the real-world
where autovacuum runs), unless you see a major drawback to it?

Jonathan

#57

Tom Lane

tgl@sss.pgh.pa.us

over 3 years ago

In reply to: Jonathan S. Katz (#56)

Re: pg15b2: large objects lost on upgrade

"Jonathan S. Katz" <jkatz@postgresql.org> writes:

On 8/2/22 3:39 PM, Tom Lane wrote:

I am not in favor of disabling autovacuum in the test: ordinary
users are not going to do that while pg_upgrade'ing, so it'd make
the test less representative of real-world usage, which seems like
a bad idea. We could either drop this particular check again, or
weaken it to allow new relfrozenxid >= old relfrozenxid, likewise
relminxid.

The test does look helpful and it would catch regressions. Loosely
quoting Robert on a different point upthread, we don't want to turn off
the alarm just because it's spuriously going off.
I think the weakened check is OK (and possibly mimics the real-world
where autovacuum runs), unless you see a major drawback to it?

I also think that ">=" is a sufficient requirement. It'd be a
bit painful to test if we had to cope with potential XID wraparound,
but we know that these installations haven't been around nearly
long enough for that, so a plain ">=" test ought to be good enough.
(Replacing the simple "eq" code with something that can handle that
doesn't look like much fun, though.)

regards, tom lane

#58

Jonathan S. Katz

jkatz@postgresql.org

over 3 years ago

In reply to: Tom Lane (#57)

Re: pg15b2: large objects lost on upgrade

On 8/2/22 3:51 PM, Tom Lane wrote:

"Jonathan S. Katz" <jkatz@postgresql.org> writes:

On 8/2/22 3:39 PM, Tom Lane wrote:

I am not in favor of disabling autovacuum in the test: ordinary
users are not going to do that while pg_upgrade'ing, so it'd make
the test less representative of real-world usage, which seems like
a bad idea. We could either drop this particular check again, or
weaken it to allow new relfrozenxid >= old relfrozenxid, likewise
relminxid.

The test does look helpful and it would catch regressions. Loosely
quoting Robert on a different point upthread, we don't want to turn off
the alarm just because it's spuriously going off.
I think the weakened check is OK (and possibly mimics the real-world
where autovacuum runs), unless you see a major drawback to it?

I also think that ">=" is a sufficient requirement. It'd be a
bit painful to test if we had to cope with potential XID wraparound,
but we know that these installations haven't been around nearly
long enough for that, so a plain ">=" test ought to be good enough.
(Replacing the simple "eq" code with something that can handle that
doesn't look like much fun, though.)

...if these systems are hitting XID wraparound, we have another issue to
worry about.

I started modifying the test to support this behavior, but thought that
because 1. we want to ensure the OID is still equal and 2. in the
examples you showed, both relfrozenxid or relminxid could increment, we
may want to have the individual checks on each column.

I may be able to conjure something up that does the above, but it's been
a minute since I wrote anything in Perl.

Jonathan

#59

Jonathan S. Katz

jkatz@postgresql.org

over 3 years ago

In reply to: Jonathan S. Katz (#58)

1 attachment(s)

Re: pg15b2: large objects lost on upgrade

On 8/2/22 4:20 PM, Jonathan S. Katz wrote:

On 8/2/22 3:51 PM, Tom Lane wrote:

"Jonathan S. Katz" <jkatz@postgresql.org> writes:

On 8/2/22 3:39 PM, Tom Lane wrote:

I am not in favor of disabling autovacuum in the test: ordinary
users are not going to do that while pg_upgrade'ing, so it'd make
the test less representative of real-world usage, which seems like
a bad idea. We could either drop this particular check again, or
weaken it to allow new relfrozenxid >= old relfrozenxid, likewise
relminxid.

The test does look helpful and it would catch regressions. Loosely
quoting Robert on a different point upthread, we don't want to turn off
the alarm just because it's spuriously going off.
I think the weakened check is OK (and possibly mimics the real-world
where autovacuum runs), unless you see a major drawback to it?

I also think that ">=" is a sufficient requirement. It'd be a
bit painful to test if we had to cope with potential XID wraparound,
but we know that these installations haven't been around nearly
long enough for that, so a plain ">=" test ought to be good enough.
(Replacing the simple "eq" code with something that can handle that
doesn't look like much fun, though.)

...if these systems are hitting XID wraparound, we have another issue to
worry about.

I started modifying the test to support this behavior, but thought that
because 1. we want to ensure the OID is still equal and 2. in the
examples you showed, both relfrozenxid or relminxid could increment, we
may want to have the individual checks on each column.

I may be able to conjure something up that does the above, but it's been
a minute since I wrote anything in Perl.

Please see attached patch that does the above. Tests pass on my local
environment (though I did not trigger autovacuum).

Jonathan

Attachments:

pg_upgrade-test-gte-xid.patchtext/plain; charset=UTF-8; name=pg_upgrade-test-gte-xid.patchDownload

diff --git a/src/bin/pg_upgrade/t/002_pg_upgrade.pl b/src/bin/pg_upgrade/t/002_pg_upgrade.pl
index 4cbc75644c..e2b739beff 100644
--- a/src/bin/pg_upgrade/t/002_pg_upgrade.pl
+++ b/src/bin/pg_upgrade/t/002_pg_upgrade.pl
@@ -226,6 +226,8 @@ ORDER BY c.oid::regclass::text
 EOM
 $horizon_query =~ s/\s+/ /g; # run it together on one line
 my $horizon1 = $oldnode->safe_psql('regression', $horizon_query);
+chomp($horizon1);
+my @horizon1_values = split(/\|/, $horizon1);
 
 # In a VPATH build, we'll be started in the source directory, but we want
 # to run pg_upgrade in the build directory so that any files generated finish
@@ -313,6 +315,8 @@ $newnode->command_ok(
 
 # And record the horizons from the upgraded cluster as well.
 my $horizon2 = $newnode->safe_psql('regression', $horizon_query);
+chomp($horizon2);
+my @horizon2_values = split(/\|/, $horizon2);
 
 # Compare the two dumps, there should be no differences.
 my $compare_res = compare("$tempdir/dump1.sql", "$tempdir/dump2.sql");
@@ -331,8 +335,13 @@ if ($compare_res != 0)
 	print "=== EOF ===\n";
 }
 
-# Compare the horizons, there should be no differences.
-my $horizons_ok = $horizon1 eq $horizon2;
+# Compare the horizons. There should be no change in the OID, so we will test
+# for equality. However, because autovacuum may have run between the two
+# horizions, we will check if the updated horizon XIDs are greater than or
+# equal to the originals.
+my $horizons_ok = $horizon2_values[0] eq $horizon1_values[0] &&
+	int($horizon2_values[1]) >= int($horizon1_values[1]) &&
+	int($horizon2_values[2]) >= int($horizon1_values[2]);
 ok($horizons_ok, 'old and new horizons match after pg_upgrade');
 
 # Provide more context if the horizons do not match.

#60

Robert Haas

robertmhaas@gmail.com

over 3 years ago

In reply to: Tom Lane (#57)

Re: pg15b2: large objects lost on upgrade

On Tue, Aug 2, 2022 at 3:51 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

The test does look helpful and it would catch regressions. Loosely
quoting Robert on a different point upthread, we don't want to turn off
the alarm just because it's spuriously going off.
I think the weakened check is OK (and possibly mimics the real-world
where autovacuum runs), unless you see a major drawback to it?

I also think that ">=" is a sufficient requirement. It'd be a
bit painful to test if we had to cope with potential XID wraparound,
but we know that these installations haven't been around nearly
long enough for that, so a plain ">=" test ought to be good enough.
(Replacing the simple "eq" code with something that can handle that
doesn't look like much fun, though.)

I don't really like this approach. Imagine that the code got broken in
such a way that relfrozenxid and relminmxid were set to a value chosen
at random - say, the contents of 4 bytes of unallocated memory that
contained random garbage. Well, right now, the chances that this would
cause a test failure are nearly 100%. With this change, they'd be
nearly 0%.

--
Robert Haas
EDB: http://www.enterprisedb.com

#61

Tom Lane

tgl@sss.pgh.pa.us

over 3 years ago

In reply to: Robert Haas (#60)

Re: pg15b2: large objects lost on upgrade

Robert Haas <robertmhaas@gmail.com> writes:

On Tue, Aug 2, 2022 at 3:51 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

I also think that ">=" is a sufficient requirement.

I don't really like this approach. Imagine that the code got broken in
such a way that relfrozenxid and relminmxid were set to a value chosen
at random - say, the contents of 4 bytes of unallocated memory that
contained random garbage. Well, right now, the chances that this would
cause a test failure are nearly 100%. With this change, they'd be
nearly 0%.

If you have a different solution that you can implement by, say,
tomorrow, then go for it. But I want to see some fix in there
within about 24 hours, because 15beta3 wraps on Monday and we
will need at least a few days to see if the buildfarm is actually
stable with whatever solution is applied.

A possible compromise is to allow new values that are between
old value and old-value-plus-a-few-dozen.

regards, tom lane

#62

Jonathan S. Katz

jkatz@postgresql.org

over 3 years ago

In reply to: Tom Lane (#61)

Re: pg15b2: large objects lost on upgrade

On Aug 3, 2022, at 10:14 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

On Tue, Aug 2, 2022 at 3:51 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
I also think that ">=" is a sufficient requirement.

I don't really like this approach. Imagine that the code got broken in
such a way that relfrozenxid and relminmxid were set to a value chosen
at random - say, the contents of 4 bytes of unallocated memory that
contained random garbage. Well, right now, the chances that this would
cause a test failure are nearly 100%. With this change, they'd be
nearly 0%.

If you have a different solution that you can implement by, say,
tomorrow, then go for it. But I want to see some fix in there
within about 24 hours, because 15beta3 wraps on Monday and we
will need at least a few days to see if the buildfarm is actually
stable with whatever solution is applied.

Yeah, I would argue that the current proposal
guards against the false positives as they currently stand.

I do think Robert raises a fair point, but I wonder
if another test would catch that? I don’t want to
say “this would never happen” because, well,
it could happen. But AIUI this would probably
manifest itself in other places too?

A possible compromise is to allow new values that are between
old value and old-value-plus-a-few-dozen.

Well, that’s kind of deterministic :-) I’m OK
with that tweak, where “OK” means not thrilled,
but I don’t see a better way to get more granular
details (at least through my phone searches).

I can probably have a tweak for this in a couple
of hours if and when I’m on plane wifi.

Jonathan

#63

Robert Haas

robertmhaas@gmail.com

over 3 years ago

In reply to: Tom Lane (#61)

Re: pg15b2: large objects lost on upgrade

On Wed, Aug 3, 2022 at 10:13 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

If you have a different solution that you can implement by, say,
tomorrow, then go for it. But I want to see some fix in there
within about 24 hours, because 15beta3 wraps on Monday and we
will need at least a few days to see if the buildfarm is actually
stable with whatever solution is applied.

I doubt that I can come up with something that quickly, so I guess we
need some stopgap for now.

--
Robert Haas
EDB: http://www.enterprisedb.com

#64

Peter Geoghegan

pg@bowt.ie

over 3 years ago

In reply to: Tom Lane (#54)

Re: pg15b2: large objects lost on upgrade

On Tue, Aug 2, 2022 at 12:32 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Hmmm ... now that you mention it, I see nothing in 002_pg_upgrade.pl
that attempts to turn off autovacuum on either the source server or
the destination. So one plausible theory is that autovac moved the
numbers since we checked.

It's very easy to believe that my work in commit 0b018fab could make
that happen, which is only a few months old. It's now completely
routine for non-aggressive autovacuums to advance relfrozenxid by at
least a small amount.

For example, when autovacuum runs against either the tellers table or
the branches table during a pgbench run, it will now advance
relfrozenxid, every single time. And to a very recent value. This will
happen in spite of the fact that no freezing ever takes place -- it's
just a consequence of the oldest extant XID consistently being quite
young each time, due to workload characteristics.

--
Peter Geoghegan

#65

Peter Geoghegan

pg@bowt.ie

over 3 years ago

In reply to: Robert Haas (#60)

Re: pg15b2: large objects lost on upgrade

On Wed, Aug 3, 2022 at 6:59 AM Robert Haas <robertmhaas@gmail.com> wrote:

I don't really like this approach. Imagine that the code got broken in
such a way that relfrozenxid and relminmxid were set to a value chosen
at random - say, the contents of 4 bytes of unallocated memory that
contained random garbage. Well, right now, the chances that this would
cause a test failure are nearly 100%. With this change, they'd be
nearly 0%.

If that kind of speculative bug existed, and somehow triggered before
the concurrent autovacuum ran (which seems very likely to be the
source of the test flappiness), then it would still be caught, most
likely. VACUUM itself has the following defenses:

* The defensive "can't happen" errors added to
heap_prepare_freeze_tuple() and related freezing routines by commit
699bf7d0 in 2017, as hardening following the "freeze the dead" bug.
That'll catch XIDs that are before the relfrozenxid at the start of
the VACUUM (ditto for MXIDs/relminmxid).

* The assertion added in my recent commit 0b018fab, which verifies
that we're about to set relfrozenxid to something sane.

* VACUUM now warns when it sees a *previous* relfrozenxid that's
apparently "in the future", following recent commit e83ebfe6. This
problem scenario is associated with several historic bugs in
pg_upgrade, where for one reason or another it failed to carry forward
correct relfrozenxid and/or relminmxid values for a table (see the
commit message for references to those old pg_upgrade bugs).

It might make sense to run a manual VACUUM right at the end of the
test, so that you reliably get this kind of coverage, even without
autovacuum.

--
Peter Geoghegan

#66

Andres Freund

andres@anarazel.de

over 3 years ago

In reply to: Robert Haas (#60)

Re: pg15b2: large objects lost on upgrade

Hi,

On 2022-08-03 09:59:40 -0400, Robert Haas wrote:

On Tue, Aug 2, 2022 at 3:51 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

The test does look helpful and it would catch regressions. Loosely
quoting Robert on a different point upthread, we don't want to turn off
the alarm just because it's spuriously going off.
I think the weakened check is OK (and possibly mimics the real-world
where autovacuum runs), unless you see a major drawback to it?

I also think that ">=" is a sufficient requirement. It'd be a
bit painful to test if we had to cope with potential XID wraparound,
but we know that these installations haven't been around nearly
long enough for that, so a plain ">=" test ought to be good enough.
(Replacing the simple "eq" code with something that can handle that
doesn't look like much fun, though.)

I don't really like this approach. Imagine that the code got broken in
such a way that relfrozenxid and relminmxid were set to a value chosen
at random - say, the contents of 4 bytes of unallocated memory that
contained random garbage. Well, right now, the chances that this would
cause a test failure are nearly 100%. With this change, they'd be
nearly 0%.

Can't that pretty easily be addressed by subsequently querying txid_current(),
and checking that the value isn't newer than that?

Greetings,

Andres Freund

#67

Peter Geoghegan

pg@bowt.ie

over 3 years ago

In reply to: Andres Freund (#66)

Re: pg15b2: large objects lost on upgrade

On Wed, Aug 3, 2022 at 1:20 PM Andres Freund <andres@anarazel.de> wrote:

I don't really like this approach. Imagine that the code got broken in
such a way that relfrozenxid and relminmxid were set to a value chosen
at random - say, the contents of 4 bytes of unallocated memory that
contained random garbage. Well, right now, the chances that this would
cause a test failure are nearly 100%. With this change, they'd be
nearly 0%.

Can't that pretty easily be addressed by subsequently querying txid_current(),
and checking that the value isn't newer than that?

It couldn't hurt to do that as well, in passing (at the same time as
testing that newrelfrozenxid >= oldrelfrozenxid directly). But
deliberately running VACUUM afterwards seems like a good idea. We
really ought to expect VACUUM to catch cases where
relfrozenxid/relminmxid is faulty, at least in cases where it can be
proven wrong by noticing some kind of inconsistency.

--
Peter Geoghegan

#68

Tom Lane

tgl@sss.pgh.pa.us

over 3 years ago

In reply to: Peter Geoghegan (#67)

Re: pg15b2: large objects lost on upgrade

Peter Geoghegan <pg@bowt.ie> writes:

It couldn't hurt to do that as well, in passing (at the same time as
testing that newrelfrozenxid >= oldrelfrozenxid directly). But
deliberately running VACUUM afterwards seems like a good idea. We
really ought to expect VACUUM to catch cases where
relfrozenxid/relminmxid is faulty, at least in cases where it can be
proven wrong by noticing some kind of inconsistency.

That doesn't seem like it'd be all that thorough: we expect VACUUM
to skip pages whenever possible. I'm also a bit concerned about
the expense, though admittedly this test is ridiculously expensive
already.

regards, tom lane

#69

Robert Haas

robertmhaas@gmail.com

over 3 years ago

In reply to: Andres Freund (#66)

Re: pg15b2: large objects lost on upgrade

On Wed, Aug 3, 2022 at 4:20 PM Andres Freund <andres@anarazel.de> wrote:

I don't really like this approach. Imagine that the code got broken in
such a way that relfrozenxid and relminmxid were set to a value chosen
at random - say, the contents of 4 bytes of unallocated memory that
contained random garbage. Well, right now, the chances that this would
cause a test failure are nearly 100%. With this change, they'd be
nearly 0%.

Can't that pretty easily be addressed by subsequently querying txid_current(),
and checking that the value isn't newer than that?

Hmm, maybe. The old cluster shouldn't have wrapped around ever, since
we just created it. So the value in the new cluster should be >= that
value and <= the result of txid_curent() ignoring wraparound.

Or we could disable autovacuum on the new cluster, which I think is a
better solution. I like it when things match exactly; it makes me feel
that the universe is well-ordered.

--
Robert Haas
EDB: http://www.enterprisedb.com

#70

Peter Geoghegan

pg@bowt.ie

over 3 years ago

In reply to: Tom Lane (#68)

Re: pg15b2: large objects lost on upgrade

On Wed, Aug 3, 2022 at 1:34 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

That doesn't seem like it'd be all that thorough: we expect VACUUM
to skip pages whenever possible. I'm also a bit concerned about
the expense, though admittedly this test is ridiculously expensive
already.

I bet the SKIP_PAGES_THRESHOLD stuff will be enough to make VACUUM
visit every heap page in practice for a test case like this. That is
all it takes to be able to safely advance relfrozenxid to whatever the
oldest extant XID happened to be. However, I'm no fan of the
SKIP_PAGES_THRESHOLD behavior, and already have plans to get rid of it
-- so I wouldn't rely on that continuing to be true forever.

It's probably not really necessary to have that kind of coverage in
this particular test case. VACUUM will complain about weird
relfrozenxid values in a large variety of contexts, even without
assertions enabled. Mostly I was just saying: if we really do need
test coverage of relfrozenxid in this context, then VACUUM is probably
the way to go.

--
Peter Geoghegan

#71

Tom Lane

tgl@sss.pgh.pa.us

over 3 years ago

In reply to: Robert Haas (#69)

Re: pg15b2: large objects lost on upgrade

Robert Haas <robertmhaas@gmail.com> writes:

Or we could disable autovacuum on the new cluster, which I think is a
better solution. I like it when things match exactly; it makes me feel
that the universe is well-ordered.

Again, this seems to me to be breaking the test's real-world applicability
for a (false?) sense of stability.

regards, tom lane

#72

Andres Freund

andres@anarazel.de

over 3 years ago

In reply to: Tom Lane (#71)

Re: pg15b2: large objects lost on upgrade

On 2022-08-03 16:46:57 -0400, Tom Lane wrote:

Robert Haas <robertmhaas@gmail.com> writes:

Or we could disable autovacuum on the new cluster, which I think is a
better solution. I like it when things match exactly; it makes me feel
that the universe is well-ordered.

Again, this seems to me to be breaking the test's real-world applicability
for a (false?) sense of stability.

Yea, that doesn't seem like an improvement. I e.g. found the issues around
relfilenode reuse in 15 due to autovacuum running in the pg_upgrade target
cluster. And I recall other bugs in the area...

Greetings,

Andres Freund

#73

Peter Geoghegan

pg@bowt.ie

over 3 years ago

In reply to: Tom Lane (#71)

Re: pg15b2: large objects lost on upgrade

On Wed, Aug 3, 2022 at 1:47 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Again, this seems to me to be breaking the test's real-world applicability
for a (false?) sense of stability.

I agree.

A lot of the VACUUM test flappiness issues we've had to deal with in
the past now seem like problems with VACUUM itself, the test's design,
or both. For example, why should we get a totally different
pg_class.reltuples because we couldn't get a cleanup lock on some
page? Why not just make sure to give the same answer either way,
which happens to be the most useful behavior to the user? That way
the test isn't just targeting implementation details.

--
Peter Geoghegan

#74

Jonathan S. Katz

jkatz@postgresql.org

over 3 years ago

In reply to: Peter Geoghegan (#73)

Re: pg15b2: large objects lost on upgrade

On 8/3/22 2:08 PM, Peter Geoghegan wrote:

On Wed, Aug 3, 2022 at 1:47 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Again, this seems to me to be breaking the test's real-world applicability
for a (false?) sense of stability.

I agree.

A lot of the VACUUM test flappiness issues we've had to deal with in
the past now seem like problems with VACUUM itself, the test's design,
or both. For example, why should we get a totally different
pg_class.reltuples because we couldn't get a cleanup lock on some
page? Why not just make sure to give the same answer either way,
which happens to be the most useful behavior to the user? That way
the test isn't just targeting implementation details.

After catching up (and reviewing approaches that could work while on
poor wifi), it does make me wonder if we can have a useful test ready
before beta 3.

I did rule out wanting to do the "xid + $X" check after reviewing some
of the output. I think that both $X could end up varying, and it really
feels like a bandaid.

Andres suggested upthread using "txid_current()" -- for the comparison,
that's one thing I looked at. Would any of the XID info from
"pg_control_checkpoint()" also serve for this test?

If yes to the above, I should be able to modify this fairly quickly.

Jonathan

#75

Tom Lane

tgl@sss.pgh.pa.us

over 3 years ago

In reply to: Jonathan S. Katz (#74)

Re: pg15b2: large objects lost on upgrade

"Jonathan S. Katz" <jkatz@postgresql.org> writes:

I did rule out wanting to do the "xid + $X" check after reviewing some
of the output. I think that both $X could end up varying, and it really
feels like a bandaid.

It is that. I wouldn't feel comfortable with $X less than 100 or so,
which is probably sloppy enough to draw Robert's ire. Still, realizing
that what we want right now is a band-aid for 15beta3, I don't think
it's an unreasonable short-term option.

Andres suggested upthread using "txid_current()" -- for the comparison,
that's one thing I looked at. Would any of the XID info from
"pg_control_checkpoint()" also serve for this test?

I like the idea of txid_current(), but we have no comparable
function for mxid do we? While you could get both numbers from
pg_control_checkpoint(), I doubt that's sufficiently up-to-date.

regards, tom lane

#76

Jonathan S. Katz

jkatz@postgresql.org

over 3 years ago

In reply to: Tom Lane (#75)

1 attachment(s)

Re: pg15b2: large objects lost on upgrade

On 8/3/22 4:19 PM, Tom Lane wrote:

"Jonathan S. Katz" <jkatz@postgresql.org> writes:

I did rule out wanting to do the "xid + $X" check after reviewing some
of the output. I think that both $X could end up varying, and it really
feels like a bandaid.

It is that. I wouldn't feel comfortable with $X less than 100 or so,
which is probably sloppy enough to draw Robert's ire. Still, realizing
that what we want right now is a band-aid for 15beta3, I don't think
it's an unreasonable short-term option.

Attached is the "band-aid / sloppy" version of the patch. Given from the
test examples I kept seeing deltas over 100 for relfrozenxid, I chose
1000. The mxid delta was less, but I kept it at 1000 for consistency
(and because I hope this test is short lived in this state), but can be
talked into otherwise.

Andres suggested upthread using "txid_current()" -- for the comparison,
that's one thing I looked at. Would any of the XID info from
"pg_control_checkpoint()" also serve for this test?

I like the idea of txid_current(), but we have no comparable
function for mxid do we? While you could get both numbers from
pg_control_checkpoint(), I doubt that's sufficiently up-to-date.

...unless we force a checkpoint in the test?

Jonathan

Attachments:

pg_upgrade-test-gte-xid-v2.patchtext/plain; charset=UTF-8; name=pg_upgrade-test-gte-xid-v2.patchDownload

diff --git a/src/bin/pg_upgrade/t/002_pg_upgrade.pl b/src/bin/pg_upgrade/t/002_pg_upgrade.pl
index 4cbc75644c..ced889601f 100644
--- a/src/bin/pg_upgrade/t/002_pg_upgrade.pl
+++ b/src/bin/pg_upgrade/t/002_pg_upgrade.pl
@@ -226,6 +226,8 @@ ORDER BY c.oid::regclass::text
 EOM
 $horizon_query =~ s/\s+/ /g; # run it together on one line
 my $horizon1 = $oldnode->safe_psql('regression', $horizon_query);
+chomp($horizon1);
+my @horizon1_values = split(/\|/, $horizon1);
 
 # In a VPATH build, we'll be started in the source directory, but we want
 # to run pg_upgrade in the build directory so that any files generated finish
@@ -313,6 +315,8 @@ $newnode->command_ok(
 
 # And record the horizons from the upgraded cluster as well.
 my $horizon2 = $newnode->safe_psql('regression', $horizon_query);
+chomp($horizon2);
+my @horizon2_values = split(/\|/, $horizon2);
 
 # Compare the two dumps, there should be no differences.
 my $compare_res = compare("$tempdir/dump1.sql", "$tempdir/dump2.sql");
@@ -331,8 +335,13 @@ if ($compare_res != 0)
 	print "=== EOF ===\n";
 }
 
-# Compare the horizons, there should be no differences.
-my $horizons_ok = $horizon1 eq $horizon2;
+# Compare the horizons. There should be no change in the OID, so we will test
+# for equality. However, because autovacuum may have run between the two
+# horizions, we will check if the updated horizon XIDs are greater than or
+# equal to the originals.
+my $horizons_ok = $horizon2_values[0] eq $horizon1_values[0] &&
+	int($horizon2_values[1]) <= int($horizon1_values[1]) + 1000 &&
+	int($horizon2_values[2]) <= int($horizon1_values[2]) + 1000;
 ok($horizons_ok, 'old and new horizons match after pg_upgrade');
 
 # Provide more context if the horizons do not match.

#77

Robert Haas

robertmhaas@gmail.com

over 3 years ago

In reply to: Tom Lane (#75)

1 attachment(s)

Re: pg15b2: large objects lost on upgrade

On Wed, Aug 3, 2022 at 7:19 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

"Jonathan S. Katz" <jkatz@postgresql.org> writes:

I did rule out wanting to do the "xid + $X" check after reviewing some
of the output. I think that both $X could end up varying, and it really
feels like a bandaid.

It is that. I wouldn't feel comfortable with $X less than 100 or so,
which is probably sloppy enough to draw Robert's ire. Still, realizing
that what we want right now is a band-aid for 15beta3, I don't think
it's an unreasonable short-term option.

100 << 2^32, so it's not terrible, but I'm honestly coming around to
the view that we ought to just nuke this test case.

From my point of view, the assertion that disabling autovacuum during
this test case would make the test case useless seems to be incorrect.
The original purpose of the test was to make sure that the pre-upgrade
schema matched the post-upgrade schema. If having autovacuum running
or not running affects that, we have a serious problem, but this test
case isn't especially likely to find it, because whether autovacuum
runs or not during the brief window where the test is running is
totally unpredictable. Furthermore, if we do have such a problem, it
would probably indicate that vacuum is using the wrong horizons to
prune or test the visibility of the tuples. To find that out, we might
want to compare values upon which the behavior of vacuum might depend,
like relfrozenxid. But to do that, we have to disable autovacuum, so
that the value can't change under us. From my point of view, that's
making test coverage better, not worse, because any bugs in this area
that can be found without explicit testing of relevant horizons are
dependent on low-probability race conditions materializing in the
buildfarm. If we disable autovacuum and then compare relfrozenxid and
whatever else we care about explicitly, we can find bugs in that
category reliably.

However, if people don't accept that argument, then this new test case
is kind of silly. It's not the worst idea in the world to use a
threshold of 100 XIDs or something, but without disabling autovacuum,
we're basically comparing two things that can't be expected to be
equal, so we test and see if they're approximately equal and then call
that good enough. I don't know that I believe we'll ever find a bug
that way, though.

--
Robert Haas
EDB: http://www.enterprisedb.com

Attachments:

enable-autovacuum-later.patchapplication/octet-stream; name=enable-autovacuum-later.patchDownload

diff --git a/src/bin/pg_upgrade/t/002_pg_upgrade.pl b/src/bin/pg_upgrade/t/002_pg_upgrade.pl
index 4cbc75644c..755b26b605 100644
--- a/src/bin/pg_upgrade/t/002_pg_upgrade.pl
+++ b/src/bin/pg_upgrade/t/002_pg_upgrade.pl
@@ -148,6 +148,7 @@ if (defined($ENV{oldinstall}))
 # Initialize a new node for the upgrade.
 my $newnode = PostgreSQL::Test::Cluster->new('new_node');
 $newnode->init(extra => [ '--wal-segsize', '1', '--allow-group-access' ]);
+$newnode->append_conf('autovacuum' => 'off');
 my $newbindir = $newnode->config_data('--bindir');
 my $oldbindir = $oldnode->config_data('--bindir');
 
@@ -302,6 +303,11 @@ if (-d $log_path)
 	}
 }
 
+# Record the horizons from the upgraded cluster, then enable autovacuum.
+my $horizon2 = $newnode->safe_psql('regression', $horizon_query);
+$newnode->safe_psql('regression', "ALTER SYSTEM SET autovacuum = 'on'");
+$newnode->safe_psql('regression', "SELECT pg_reload_conf();");
+
 # Second dump from the upgraded instance.
 $newnode->command_ok(
 	[
@@ -311,9 +317,6 @@ $newnode->command_ok(
 	],
 	'dump after running pg_upgrade');
 
-# And record the horizons from the upgraded cluster as well.
-my $horizon2 = $newnode->safe_psql('regression', $horizon_query);
-
 # Compare the two dumps, there should be no differences.
 my $compare_res = compare("$tempdir/dump1.sql", "$tempdir/dump2.sql");
 is($compare_res, 0, 'old and new dumps match after pg_upgrade');

#78

Robert Haas

robertmhaas@gmail.com

over 3 years ago

In reply to: Jonathan S. Katz (#76)

Re: pg15b2: large objects lost on upgrade

On Thu, Aug 4, 2022 at 10:02 AM Jonathan S. Katz <jkatz@postgresql.org> wrote:

Attached is the "band-aid / sloppy" version of the patch. Given from the
test examples I kept seeing deltas over 100 for relfrozenxid, I chose
1000. The mxid delta was less, but I kept it at 1000 for consistency
(and because I hope this test is short lived in this state), but can be
talked into otherwise.

ISTM that you'd need to loop over the rows and do this for each row.
Otherwise I think you're just comparing results for the first relation
and ignoring all the rest.

--
Robert Haas
EDB: http://www.enterprisedb.com

#79

Tom Lane

tgl@sss.pgh.pa.us

over 3 years ago

In reply to: Jonathan S. Katz (#76)

Re: pg15b2: large objects lost on upgrade

"Jonathan S. Katz" <jkatz@postgresql.org> writes:

On 8/3/22 4:19 PM, Tom Lane wrote:

I like the idea of txid_current(), but we have no comparable
function for mxid do we? While you could get both numbers from
pg_control_checkpoint(), I doubt that's sufficiently up-to-date.

...unless we force a checkpoint in the test?

Hmm ... maybe if you take a snapshot and hold that open while
forcing the checkpoint and doing the subsequent checks. That
seems messy though. Also, while that should serve to hold back
global xmin, I'm not at all clear on whether that has a similar
effect on minmxid.

regards, tom lane

#80

Tom Lane

tgl@sss.pgh.pa.us

over 3 years ago

In reply to: Robert Haas (#77)

Re: pg15b2: large objects lost on upgrade

Robert Haas <robertmhaas@gmail.com> writes:

100 << 2^32, so it's not terrible, but I'm honestly coming around to
the view that we ought to just nuke this test case.

I'd hesitated to suggest that, but I think that's a fine solution.
Especially since we can always put it back in later if we think
of a more robust way.

regards, tom lane

#81

Robert Haas

robertmhaas@gmail.com

over 3 years ago

In reply to: Tom Lane (#80)

Re: pg15b2: large objects lost on upgrade

On Thu, Aug 4, 2022 at 10:26 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

100 << 2^32, so it's not terrible, but I'm honestly coming around to
the view that we ought to just nuke this test case.

I'd hesitated to suggest that, but I think that's a fine solution.
Especially since we can always put it back in later if we think
of a more robust way.

IMHO it's 100% clear how to make it robust. If you want to check that
two values are the same, you can't let one of them be overwritten by
an unrelated event in the middle of the check. There are many specific
things we could do here, a few of which I proposed in my previous
email, but they all boil down to "don't let autovacuum screw up the
results".

But if you don't want to do that, and you also don't want to have
random failures, the only alternatives are weakening the check and
removing the test. It's kind of hard to say which is better, but I'm
inclined to think that if we just weaken the test we're going to think
we've got coverage for this kind of problem when we really don't.

--
Robert Haas
EDB: http://www.enterprisedb.com

#82

Andres Freund

andres@anarazel.de

over 3 years ago

In reply to: Robert Haas (#81)

Re: pg15b2: large objects lost on upgrade

Hi,

On 2022-08-04 12:43:49 -0400, Robert Haas wrote:

On Thu, Aug 4, 2022 at 10:26 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

100 << 2^32, so it's not terrible, but I'm honestly coming around to
the view that we ought to just nuke this test case.

I'd hesitated to suggest that, but I think that's a fine solution.
Especially since we can always put it back in later if we think
of a more robust way.

IMHO it's 100% clear how to make it robust. If you want to check that
two values are the same, you can't let one of them be overwritten by
an unrelated event in the middle of the check. There are many specific
things we could do here, a few of which I proposed in my previous
email, but they all boil down to "don't let autovacuum screw up the
results".

But if you don't want to do that, and you also don't want to have
random failures, the only alternatives are weakening the check and
removing the test. It's kind of hard to say which is better, but I'm
inclined to think that if we just weaken the test we're going to think
we've got coverage for this kind of problem when we really don't.

Why you think it's better to not have the test than to have a very limited
amount of fuzziness (by using the next xid as an upper limit). What's the bug
that will reliably pass the nextxid fuzzy comparison, but not an exact
comparison?

Greetings,

Andres Freund

#83

Tom Lane

tgl@sss.pgh.pa.us

over 3 years ago

In reply to: Robert Haas (#81)

Re: pg15b2: large objects lost on upgrade

Robert Haas <robertmhaas@gmail.com> writes:

IMHO it's 100% clear how to make it robust. If you want to check that
two values are the same, you can't let one of them be overwritten by
an unrelated event in the middle of the check. There are many specific
things we could do here, a few of which I proposed in my previous
email, but they all boil down to "don't let autovacuum screw up the
results".

It doesn't really matter how robust a test case is, if it isn't testing
the thing you need to have tested. So I remain unwilling to disable
autovac in a way that won't match real-world usage.

Note that the patch you proposed at [1]/messages/by-id/CA+TgmoZkBcMi+Nikxfc54dgkWj41Q=Z4nuyHpheTcxA-qfS5Qg@mail.gmail.com will not fix anything.
It turns off autovac in the new node, but the buildfarm failures
we've seen appear to be due to autovac running on the old node.
(I believe that autovac in the new node is *also* a hazard, but
it seems to be a lot less of one, presumably because of timing
considerations.) To make it work, we'd have to shut off autovac
in the old node before starting pg_upgrade, and that would make it
unacceptably (IMHO) different from what real users will do.

Conceivably, we could move all of this processing into pg_upgrade
itself --- autovac disable/re-enable and capturing of the horizon
data --- and that would address my complaint. I don't really want
to go there though, especially when in the final analysis IT IS
NOT A BUG if a rel's horizons advance a bit during pg_upgrade.
It's only a bug if they become inconsistent with the rel's data,
which is not what this test is testing for.

regards, tom lane

[1]: /messages/by-id/CA+TgmoZkBcMi+Nikxfc54dgkWj41Q=Z4nuyHpheTcxA-qfS5Qg@mail.gmail.com

#84

Peter Geoghegan

pg@bowt.ie

over 3 years ago

In reply to: Robert Haas (#81)

Re: pg15b2: large objects lost on upgrade

On Thu, Aug 4, 2022 at 9:44 AM Robert Haas <robertmhaas@gmail.com> wrote:

But if you don't want to do that, and you also don't want to have
random failures, the only alternatives are weakening the check and
removing the test. It's kind of hard to say which is better, but I'm
inclined to think that if we just weaken the test we're going to think
we've got coverage for this kind of problem when we really don't.

Perhaps amcheck's verify_heapam() function can be used here. What
could be better than exhaustively verifying that the relfrozenxid (and
relminmxid) invariants hold for every single tuple in the table? Those
are the exact conditions that we care about, as far as
relfrozenxid/relminmxid goes.

My sense is that that has a much better chance of detecting a real bug
at some point. This approach is arguably an example of property-based
testing.

--
Peter Geoghegan

#85

Tom Lane

tgl@sss.pgh.pa.us

over 3 years ago

In reply to: Peter Geoghegan (#84)

Re: pg15b2: large objects lost on upgrade

Peter Geoghegan <pg@bowt.ie> writes:

Perhaps amcheck's verify_heapam() function can be used here. What
could be better than exhaustively verifying that the relfrozenxid (and
relminmxid) invariants hold for every single tuple in the table?

How much will that add to the test's runtime? I could get behind this
idea if it's not exorbitantly expensive.

regards, tom lane

#86

Peter Geoghegan

pg@bowt.ie

over 3 years ago

In reply to: Tom Lane (#85)

Re: pg15b2: large objects lost on upgrade

On Thu, Aug 4, 2022 at 11:07 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

How much will that add to the test's runtime? I could get behind this
idea if it's not exorbitantly expensive.

I'm not sure offhand, but I suspect it wouldn't be too bad.

--
Peter Geoghegan

#87

Robert Haas

robertmhaas@gmail.com

over 3 years ago

In reply to: Tom Lane (#83)

Re: pg15b2: large objects lost on upgrade

On Thu, Aug 4, 2022 at 1:49 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Note that the patch you proposed at [1] will not fix anything.
It turns off autovac in the new node, but the buildfarm failures
we've seen appear to be due to autovac running on the old node.
(I believe that autovac in the new node is *also* a hazard, but
it seems to be a lot less of one, presumably because of timing
considerations.) To make it work, we'd have to shut off autovac
in the old node before starting pg_upgrade,

Yeah, that's a fair point.

and that would make it
unacceptably (IMHO) different from what real users will do.

I don't agree with that, but as you say, it is a matter of opinion. In
any case, what exactly do you want to do now?

Jonathon Katz has proposed a patch to do the fuzzy comparison which I
believe to be incorrect because I think it compares, at most, the
horizons for one table in the database.

I could go work on a better version of that, or he could, or you
could, but it seems like we're running out of time awfully quick here,
given that you wanted to have this resolved today and it's almost the
end of today.

I think the most practical alternative is to put this file back to the
way it was before I started tinkering with it, and revisit this issue
after the release. If you want to do something else, that's fine, but
I'm not going to be available to work on this issue over the weekend,
so if you want to do something else, you or someone else is going to
have to take responsibility for whatever further stabilization that
other approach may require between now and the release.

--
Robert Haas
EDB: http://www.enterprisedb.com

#88

Tom Lane

tgl@sss.pgh.pa.us

over 3 years ago

In reply to: Robert Haas (#87)

Re: pg15b2: large objects lost on upgrade

Robert Haas <robertmhaas@gmail.com> writes:

I think the most practical alternative is to put this file back to the
way it was before I started tinkering with it, and revisit this issue
after the release.

Yeah, that seems like the right thing. We are running too low on time
to have any confidence that a modified version of the test will be
reliable.

regards, tom lane

#89

Robert Haas

robertmhaas@gmail.com

over 3 years ago

In reply to: Andres Freund (#82)

Re: pg15b2: large objects lost on upgrade

On Thu, Aug 4, 2022 at 12:59 PM Andres Freund <andres@anarazel.de> wrote:

Why you think it's better to not have the test than to have a very limited
amount of fuzziness (by using the next xid as an upper limit). What's the bug
that will reliably pass the nextxid fuzzy comparison, but not an exact
comparison?

I don't know. I mean, I guess one possibility is that the nextXid
value could be wrong too, because I doubt we have any separate test
that would catch that. But more generally, I don't have a lot of
confidence in fuzzy tests. It's too easy for things to look like
they're working when they really aren't.

Let's say that the value in the old cluster was 100 and the nextXid in
the new cluster is 1000. Well, it's not like every value between 100
and 1000 is OK. The overwhelming majority of those values could never
be produced, neither from the old cluster nor from any subsequent
vacuum. Given that the old cluster is suffering no new write
transactions, there's probably exactly two values that are legal: one
being the value from the old cluster, which we know, and the other
being whatever a vacuum of that table would produce, which we don't
know, although we do know that it's somewhere in that range. Let's
flip the question on its head: why should some hypothetical future bug
that we have in this area produce a value outside that range?

If it's a failure to set the value at all, or if it generates a value
at random, we'd likely still catch it. And those are pretty likely, so
maybe the value of such a test is not zero. On the other hand, subtle
breakage might be more likely to survive developer testing than
complete breakage.

--
Robert Haas
EDB: http://www.enterprisedb.com

#90

Peter Geoghegan

pg@bowt.ie

over 3 years ago

In reply to: Robert Haas (#89)

Re: pg15b2: large objects lost on upgrade

On Thu, Aug 4, 2022 at 12:15 PM Robert Haas <robertmhaas@gmail.com> wrote:

Given that the old cluster is suffering no new write
transactions, there's probably exactly two values that are legal: one
being the value from the old cluster, which we know, and the other
being whatever a vacuum of that table would produce, which we don't
know, although we do know that it's somewhere in that range.

What about autoanalyze?

--
Peter Geoghegan

#91

Robert Haas

robertmhaas@gmail.com

over 3 years ago

In reply to: Peter Geoghegan (#90)

Re: pg15b2: large objects lost on upgrade

On Thu, Aug 4, 2022 at 3:23 PM Peter Geoghegan <pg@bowt.ie> wrote:

On Thu, Aug 4, 2022 at 12:15 PM Robert Haas <robertmhaas@gmail.com> wrote:

Given that the old cluster is suffering no new write
transactions, there's probably exactly two values that are legal: one
being the value from the old cluster, which we know, and the other
being whatever a vacuum of that table would produce, which we don't
know, although we do know that it's somewhere in that range.

What about autoanalyze?

What about it?

--
Robert Haas
EDB: http://www.enterprisedb.com

#92

Robert Haas

robertmhaas@gmail.com

over 3 years ago

In reply to: Tom Lane (#88)

Re: pg15b2: large objects lost on upgrade

On Thu, Aug 4, 2022 at 3:10 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

I think the most practical alternative is to put this file back to the
way it was before I started tinkering with it, and revisit this issue
after the release.

Yeah, that seems like the right thing. We are running too low on time
to have any confidence that a modified version of the test will be
reliable.

Done.

--
Robert Haas
EDB: http://www.enterprisedb.com

#93

Peter Geoghegan

pg@bowt.ie

over 3 years ago

In reply to: Robert Haas (#91)

Re: pg15b2: large objects lost on upgrade

On Thu, Aug 4, 2022 at 12:31 PM Robert Haas <robertmhaas@gmail.com> wrote:

What about autoanalyze?

What about it?

It has a tendency to consume an XID, here or there, quite
unpredictably. I've noticed that this often involves an analyze of
pg_statistic. Have you accounted for that?

You said upthread that you don't like "fuzzy" tests, because it's too
easy for things to look like they're working when they really aren't.
I suppose that there may be some truth to that, but ISTM that there is
also a lot to be said for a test that can catch failures that weren't
specifically anticipated. Users won't be running pg_upgrade with
autovacuum disabled. And so ISTM that just testing that relfrozenxid
has been carried forward is more precise about one particular detail
(more precise than alternative approaches to testing), but less
precise about the thing that we actually care about.

--
Peter Geoghegan

#94

Bruce Momjian

bruce@momjian.us

over 3 years ago

In reply to: Tom Lane (#54)

Re: pg15b2: large objects lost on upgrade

On Tue, Aug 2, 2022 at 03:32:05PM -0400, Tom Lane wrote:

"Jonathan S. Katz" <jkatz@postgresql.org> writes:

On 8/2/22 1:12 PM, Tom Lane wrote:

Sadly, we're still not out of the woods. I see three buildfarm
failures in this test since Robert resolved the "-X" problem [1][2][3]:

Looking at the test code, is there anything that could have changed the
relfrozenxid or relminxid independently of the test on these systems?

Hmmm ... now that you mention it, I see nothing in 002_pg_upgrade.pl
that attempts to turn off autovacuum on either the source server or
the destination. So one plausible theory is that autovac moved the
numbers since we checked.

Uh, pg_upgrade assumes autovacuum is not running, and tries to enforce
this:

start_postmaster()
...
/*
* Use -b to disable autovacuum.
*
* Turn off durability requirements to improve object creation speed, and
* we only modify the new cluster, so only use it there. If there is a
* crash, the new cluster has to be recreated anyway. fsync=off is a big
* win on ext4.
*
* Force vacuum_defer_cleanup_age to 0 on the new cluster, so that
* vacuumdb --freeze actually freezes the tuples.
*/
snprintf(cmd, sizeof(cmd),
"\"%s/pg_ctl\" -w -l \"%s/%s\" -D \"%s\" -o \"-p %d -b%s %s%s\" start",
cluster->bindir,
log_opts.logdir,
SERVER_LOG_FILE, cluster->pgconfig, cluster->port,
(cluster == &new_cluster) ?
" -c synchronous_commit=off -c fsync=off -c full_page_writes=off -c vacuum_defer_cleanup_age=0" : "",
cluster->pgopts ? cluster->pgopts : "", socket_string);

Perhaps the test script should do something similar, or this method
doesn't work anymore.

--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com

Indecision is a decision. Inaction is an action. Mark Batterson

#95

Tom Lane

tgl@sss.pgh.pa.us

over 3 years ago

In reply to: Bruce Momjian (#94)

Re: pg15b2: large objects lost on upgrade

Bruce Momjian <bruce@momjian.us> writes:

Hmmm ... now that you mention it, I see nothing in 002_pg_upgrade.pl
that attempts to turn off autovacuum on either the source server or
the destination. So one plausible theory is that autovac moved the
numbers since we checked.

Uh, pg_upgrade assumes autovacuum is not running, and tries to enforce
this:

The problems come from autovac running before or after pg_upgrade.

Perhaps the test script should do something similar,

I'm not on board with that, for the reasons I gave upthread.

regards, tom lane

#96

Bruce Momjian

bruce@momjian.us

over 3 years ago

In reply to: Tom Lane (#95)

Re: pg15b2: large objects lost on upgrade

On Mon, Aug 8, 2022 at 09:51:46PM -0400, Tom Lane wrote:

Bruce Momjian <bruce@momjian.us> writes:

Hmmm ... now that you mention it, I see nothing in 002_pg_upgrade.pl
that attempts to turn off autovacuum on either the source server or
the destination. So one plausible theory is that autovac moved the
numbers since we checked.

Uh, pg_upgrade assumes autovacuum is not running, and tries to enforce
this:

The problems come from autovac running before or after pg_upgrade.

Perhaps the test script should do something similar,

I'm not on board with that, for the reasons I gave upthread.

Uh, I assume it is this paragraph:

If that is the explanation, then it leaves us with few good options.
I am not in favor of disabling autovacuum in the test: ordinary
users are not going to do that while pg_upgrade'ing, so it'd make
the test less representative of real-world usage, which seems like
a bad idea. We could either drop this particular check again, or
weaken it to allow new relfrozenxid >= old relfrozenxid, likewise
relminxid.

I thought the test was setting up a configuration that would never be
used by normal servers. Is that false?

--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com

Indecision is a decision. Inaction is an action. Mark Batterson

#97

Tom Lane

tgl@sss.pgh.pa.us

over 3 years ago

In reply to: Bruce Momjian (#96)

Re: pg15b2: large objects lost on upgrade

Bruce Momjian <bruce@momjian.us> writes:

I thought the test was setting up a configuration that would never be
used by normal servers. Is that false?

*If* we made it disable autovac before starting pg_upgrade,
then that would be a process not used by normal users.
I don't care whether pg_upgrade disables autovac during its
run; that's not what's at issue here.

regards, tom lane