pg_upgrade failing for 200+ million Large Objects

Started by Tharakan, Robinsalmost 5 years ago2 messages

tharar@amazon.com

almost 5 years ago

1 attachment(s)

Hi,

While reviewing a failed upgrade from Postgres v9.5 (to v9.6) I saw that the
instance had ~200 million (in-use) Large Objects. I was able to reproduce
this on a test instance which too fails with a similar error.

pg_restore: executing BLOB 4980622
pg_restore: WARNING: database with OID 0 must be vacuumed within 1000001
transactions
HINT: To avoid a database shutdown, execute a database-wide VACUUM in that
database.
You might also need to commit or roll back old prepared transactions.
pg_restore: executing BLOB 4980623
pg_restore: [archiver (db)] Error while PROCESSING TOC:
pg_restore: [archiver (db)] Error from TOC entry 2565; 2613 4980623 BLOB
4980623 postgres
pg_restore: [archiver (db)] could not execute query: ERROR: database is not
accepting commands to avoid wraparound data loss in database with OID 0
HINT: Stop the postmaster and vacuum that database in single-user mode.
You might also need to commit or roll back old prepared transactions.
Command was: SELECT pg_catalog.lo_create('4980623');

To remove the obvious possibilities, these Large Objects that are still
in-use (so vacuumlo wouldn't help), giving more system resources doesn't
help, moving Large Objects around to another database doesn't help (since
this is cluster-wide restriction), the source instance is nowhere close to
wraparound and lastly recent-most minor versions don't help either (I tried
compiling 9_6_STABLE + upgrade database with 150 million LO and still
encountered the same issue).

Do let me know if I am missing something obvious but it appears that this is
happening owing to 2 things coming together:

* Each Large Object is migrated in its own transaction during pg_upgrade
* pg_resetxlog appears to be narrowing the window (available for pg_upgrade)
to ~146 Million XIDs (2^31 - 1 million XID wraparound margin - 2 billion
which is a hard-coded constant - see [1] - in what appears to be an attempt
to force an Autovacuum Wraparound session soon after upgrade completes).

Ideally such an XID based restriction, is limiting for an instance that's
actively using a lot of Large Objects. Besides forcing AutoVacuum Wraparound
logic to kick in soon after, I am unclear what much else it aims to do. What
it does seem to be doing is to block Major Version upgrades if the
pre-upgrade instance has >146 Million Large Objects (half that, if the LO
additionally requires ALTER LARGE OBJECT OWNER TO for each of those objects
during pg_restore)

For long-term these ideas came to mind, although am unsure which are
low-hanging fruits and which outright impossible - For e.g. clubbing
multiple objects in a transaction [2] / Force AutoVacuum post upgrade (and
thus remove this limitation altogether) or see if "pg_resetxlog -x" (from
within pg_upgrade) could help in some way to work-around this limitation.

Is there a short-term recommendation for this scenario?

I can understand a high number of small-sized objects is not a great way to
use pg_largeobject (since Large Objects was intended to be for, well, 'large
objects') but this magic number of Large Objects is now a stalemate at this
point (with respect to v9.5 EOL).

Reference:
1) pg_resetxlog -
https://github.com/postgres/postgres/blob/ca3b37487be333a1d241dab1bbdd17a211
a88f43/src/bin/pg_resetwal/pg_resetwal.c#L444
2)
/messages/by-id/ed7d86a1-b907-4f53-9f6e-63482d2f2bac%254
0manitou-mail.org

-
Thanks
Robins Tharakan

Justin Pryzby

pryzby@telsasoft.com

almost 5 years ago

In reply to: Tharakan, Robins (#1)

1 attachment(s)

Re: pg_upgrade failing for 200+ million Large Objects

On Wed, Mar 03, 2021 at 11:36:26AM +0000, Tharakan, Robins wrote:

While reviewing a failed upgrade from Postgres v9.5 (to v9.6) I saw that the
instance had ~200 million (in-use) Large Objects. I was able to reproduce
this on a test instance which too fails with a similar error.

If pg_upgrade can't handle millions of objects/transactions/XIDs, that seems
like a legitimate complaint, since apparently the system is working okay
otherwise.

But it also seems like you're using it outside the range of its intended use
(See also [1]/messages/by-id/502641.1606334432@sss.pgh.pa.us | Does pg_dump really have sane performance for that situation, or | are we soon going to be fielding requests to make it not be O(N^2) | in the number of listed tables?). I'm guessing that not many people are going to spend time
running tests of pg_upgrade, each of which takes 25hr, not to mention some
multiple of 128GB RAM+swap.

Creating millions of large objects was too slow for me to test like this:
| time { echo 'begin;'; for a in `seq 1 99999`; do echo '\lo_import /dev/null'; done; echo 'commit;'; } |psql -qh /tmp postgres&

This seems to be enough for what's needed:
| ALTER SYSTEM SET fsync=no; ALTER SYSTEM SET full_page_writes=no; SELECT pg_reload_conf();
| INSERT INTO pg_largeobject_metadata SELECT a, 0 FROM generate_series(100000, 200111222)a;

Now, testing the pg_upgrade was killed after runnning 100min and using 60GB
RAM, so you might say that's a problem too. I converted getBlobs() to use a
cursor, like dumpBlobs(), but it was still killed. I think a test case and a
way to exercizes this failure with a more reasonable amount of time and
resources might be a prerequisite for a patch to fix it.

pg_upgrade is meant for "immediate" upgrades, frequently allowing upgrade in
minutes, where pg_dump |pg_restore might take hours or days. There's two
components to consider: the catalog/metadata part, and the data part. If the
data is large (let's say more than 100GB), then pg_upgrade is expected to be an
improvement over the "dump and restore" process, which is usually infeasible
for large DBs measure in TB.

But the *catalog* part is large, and pg_upgrade still has to run pg_dump, and
pg_restore. The time to do this might dominate over the data part. Our own
customers DBs are 100s of GB to 10TB. For large customers, pg_upgrade takes
45min. In the past, we had tables with many column defaults, which caused the
dump+restore to be slow at a larger fraction of customers.

If it were me, in an EOL situation, I would look at either: 1) find a way to do
dump+restore rather than pg_upgrade; and/or, 2) separately pg_dump the large
objects, drop as many as you can, then pg_upgrade the DB, then restore the
large objects. (And find a better way to store them in the future).

I was able to hack pg_upgrade to call pg_restore --single (with a separate
invocation to handle --create). That passes tests...but I can't say much
beyond that.

Regarding your existing patch: "make check" only tests SQL features.
For development, you'll want to configure like:
|./configure --enable-debug --enable-cassert --enable-tap-tests
And then use "make check-world", and in particular:
time make check -C src/bin/pg_resetwal
time make check -C src/bin/pg_upgrade

I don't think pg_restore needs a user-facing option for XIDs. I think it
should "just work", since a user might be as likely to shoot themselves in the
foot with a commandline option as they are to make an upgrade succeed that
would otherwise fail. pg_upgrade has a --check mode, and if that passes, the
upgrade is intended to work, and not fail halfway through between the schema
dump and restore, with the expectation that the user know to rerun with some
commandline flags. If you pursue the patch with setting a different XID
threshold, maybe you could count the number of objects to be created, or
transactions to be used, and use that as the argument to resetxlog ? I'm not
sure, but pg_restore -l might be a good place to start looking.

I think a goal for this patch should be to allow an increased number of
objects to be handled by pg_upgrade. Large objects may be a special case, and
increasing the number of other objects to be restored to the 100s of millions
might be unimportant.

--
Justin

[1]: /messages/by-id/502641.1606334432@sss.pgh.pa.us | Does pg_dump really have sane performance for that situation, or | are we soon going to be fielding requests to make it not be O(N^2) | in the number of listed tables?
| Does pg_dump really have sane performance for that situation, or
| are we soon going to be fielding requests to make it not be O(N^2)
| in the number of listed tables?

Attachments:

0001-pg_dump-use-a-cursor-in-getBlobs.patchtext/x-diff; charset=us-asciiDownload

From cfc7400bb021659d49170e8b17d067c8e1b9fa33 Mon Sep 17 00:00:00 2001
From: Justin Pryzby <pryzbyj@telsasoft.com>
Date: Tue, 9 Mar 2021 14:06:17 -0600
Subject: [PATCH] pg_dump: use a cursor in getBlobs..

..to mitigate huge memory use in the case of millions of large objects
---
 src/bin/pg_dump/pg_dump.c | 96 +++++++++++++++++++++------------------
 1 file changed, 52 insertions(+), 44 deletions(-)

diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index aa02ada079..3fd7f48605 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -3333,7 +3333,7 @@ getBlobs(Archive *fout)
 	BlobInfo   *binfo;
 	DumpableObject *bdata;
 	PGresult   *res;
-	int			ntups;
+	int			ntups, total = 0;
 	int			i;
 	int			i_oid;
 	int			i_lomowner;
@@ -3341,9 +3341,12 @@ getBlobs(Archive *fout)
 	int			i_rlomacl;
 	int			i_initlomacl;
 	int			i_initrlomacl;
+	const char		*blobFetchQry = "FETCH 1000 IN blob";
 
 	pg_log_info("reading large objects");
 
+	appendPQExpBuffer(blobQry, "DECLARE blob CURSOR FOR ");
+
 	/* Fetch BLOB OIDs, and owner/ACL data if >= 9.0 */
 	if (fout->remoteVersion >= 90600)
 	{
@@ -3393,58 +3396,66 @@ getBlobs(Archive *fout)
 							 "NULL::oid AS initrlomacl "
 							 " FROM pg_largeobject");
 
-	res = ExecuteSqlQuery(fout, blobQry->data, PGRES_TUPLES_OK);
+	ExecuteSqlStatement(fout, blobQry->data);
+	destroyPQExpBuffer(blobQry);
 
-	i_oid = PQfnumber(res, "oid");
-	i_lomowner = PQfnumber(res, "rolname");
-	i_lomacl = PQfnumber(res, "lomacl");
-	i_rlomacl = PQfnumber(res, "rlomacl");
-	i_initlomacl = PQfnumber(res, "initlomacl");
-	i_initrlomacl = PQfnumber(res, "initrlomacl");
+	do {
+		res = ExecuteSqlQuery(fout, blobFetchQry, PGRES_TUPLES_OK);
 
-	ntups = PQntuples(res);
+		i_oid = PQfnumber(res, "oid");
+		i_lomowner = PQfnumber(res, "rolname");
+		i_lomacl = PQfnumber(res, "lomacl");
+		i_rlomacl = PQfnumber(res, "rlomacl");
+		i_initlomacl = PQfnumber(res, "initlomacl");
+		i_initrlomacl = PQfnumber(res, "initrlomacl");
 
-	/*
-	 * Each large object has its own BLOB archive entry.
-	 */
-	binfo = (BlobInfo *) pg_malloc(ntups * sizeof(BlobInfo));
+		ntups = PQntuples(res);
+		total += ntups;
 
-	for (i = 0; i < ntups; i++)
-	{
-		binfo[i].dobj.objType = DO_BLOB;
-		binfo[i].dobj.catId.tableoid = LargeObjectRelationId;
-		binfo[i].dobj.catId.oid = atooid(PQgetvalue(res, i, i_oid));
-		AssignDumpId(&binfo[i].dobj);
+		/*
+		 * Each large object has its own BLOB archive entry.
+		 */
+		binfo = (BlobInfo *) pg_malloc(ntups * sizeof(BlobInfo));
 
-		binfo[i].dobj.name = pg_strdup(PQgetvalue(res, i, i_oid));
-		binfo[i].rolname = pg_strdup(PQgetvalue(res, i, i_lomowner));
-		binfo[i].blobacl = pg_strdup(PQgetvalue(res, i, i_lomacl));
-		binfo[i].rblobacl = pg_strdup(PQgetvalue(res, i, i_rlomacl));
-		binfo[i].initblobacl = pg_strdup(PQgetvalue(res, i, i_initlomacl));
-		binfo[i].initrblobacl = pg_strdup(PQgetvalue(res, i, i_initrlomacl));
+		for (i = 0; i < ntups; i++)
+		{
+			binfo[i].dobj.objType = DO_BLOB;
+			binfo[i].dobj.catId.tableoid = LargeObjectRelationId;
+			binfo[i].dobj.catId.oid = atooid(PQgetvalue(res, i, i_oid));
+			AssignDumpId(&binfo[i].dobj);
+
+			binfo[i].dobj.name = pg_strdup(PQgetvalue(res, i, i_oid));
+			binfo[i].rolname = pg_strdup(PQgetvalue(res, i, i_lomowner));
+			binfo[i].blobacl = pg_strdup(PQgetvalue(res, i, i_lomacl));
+			binfo[i].rblobacl = pg_strdup(PQgetvalue(res, i, i_rlomacl));
+			binfo[i].initblobacl = pg_strdup(PQgetvalue(res, i, i_initlomacl));
+			binfo[i].initrblobacl = pg_strdup(PQgetvalue(res, i, i_initrlomacl));
+
+			if (PQgetisnull(res, i, i_lomacl) &&
+				PQgetisnull(res, i, i_rlomacl) &&
+				PQgetisnull(res, i, i_initlomacl) &&
+				PQgetisnull(res, i, i_initrlomacl))
+				binfo[i].dobj.dump &= ~DUMP_COMPONENT_ACL;
 
-		if (PQgetisnull(res, i, i_lomacl) &&
-			PQgetisnull(res, i, i_rlomacl) &&
-			PQgetisnull(res, i, i_initlomacl) &&
-			PQgetisnull(res, i, i_initrlomacl))
-			binfo[i].dobj.dump &= ~DUMP_COMPONENT_ACL;
+			/*
+			 * In binary-upgrade mode for blobs, we do *not* dump out the blob
+			 * data, as it will be copied by pg_upgrade, which simply copies the
+			 * pg_largeobject table. We *do* however dump out anything but the
+			 * data, as pg_upgrade copies just pg_largeobject, but not
+			 * pg_largeobject_metadata, after the dump is restored.
+			 */
+			if (dopt->binary_upgrade)
+				binfo[i].dobj.dump &= ~DUMP_COMPONENT_DATA;
+		}
 
-		/*
-		 * In binary-upgrade mode for blobs, we do *not* dump out the blob
-		 * data, as it will be copied by pg_upgrade, which simply copies the
-		 * pg_largeobject table. We *do* however dump out anything but the
-		 * data, as pg_upgrade copies just pg_largeobject, but not
-		 * pg_largeobject_metadata, after the dump is restored.
-		 */
-		if (dopt->binary_upgrade)
-			binfo[i].dobj.dump &= ~DUMP_COMPONENT_DATA;
-	}
+		PQclear(res);
+	} while (ntups != 0);
 
 	/*
 	 * If we have any large objects, a "BLOBS" archive entry is needed. This
 	 * is just a placeholder for sorting; it carries no data now.
 	 */
-	if (ntups > 0)
+	if (total > 0)
 	{
 		bdata = (DumpableObject *) pg_malloc(sizeof(DumpableObject));
 		bdata->objType = DO_BLOB_DATA;
@@ -3452,9 +3463,6 @@ getBlobs(Archive *fout)
 		AssignDumpId(bdata);
 		bdata->name = pg_strdup("BLOBS");
 	}
-
-	PQclear(res);
-	destroyPQExpBuffer(blobQry);
 }
 
 /*
-- 
2.17.0