speed up pg_upgrade with large number of tables

Started by 杨伯宇(长堂)over 1 year ago8 messages
#1杨伯宇(长堂)
yangboyu.yby@alibaba-inc.com
3 attachment(s)

Hello postgres hackers:

I am recently working on speeding up pg_upgrade for database with over a
million tables and would like to share some (maybe) optimizeable or
interesting findings.

1: Skip Compatibility Check In "pg_upgrade"
=============================================
Concisely, we've got several databases, each with a million-plus tables.
Running the compatibility check before pg_dump can eat up like half an hour.
If I have performed an online check before the actual upgrade, repeating it
seems unnecessary and just adds to the downtime in many situations.

So, I'm thinking, why not add a "--skip-check" option in pg_upgrade to skip it?
See "1-Skip_Compatibility_Check_v1.patch".

2: Accelerate "FastPathTransferRelationLocks"
===============================================
In this scenario, pg_restore costs much more time than pg_dump. And through
monitoring the "postgres" backend via perf, I found that the much time are
taken by "LWLockAcquire" and "LWLockRelease". Diving deeper, I think I found
the reason:

When we try to create an index (pretty common in pg_restore), the "ShareLock"
to the relation must be held first. Such lock is a "strong" lock, so to acquire
the lock, before we change the global lock hash table, we must traverse each
proc to transfer their relation lock in fastpath. And the issue raise here
(in FastPathTransferRelationLocks ):
we acquire "fpInfoLock" before accessing "proc->databaseId". So we must perform
the lock acquiring and releasing "MaxBackends" times for each index. The reason
is recorded in the comment:
```
/*
* proc->databaseId is set at backend startup time and never changes
* thereafter, so it might be safe to perform this test before
* acquiring &proc->fpInfoLock. In particular, it's certainly safe to
* assume that if the target backend holds any fast-path locks, it
* must have performed a memory-fencing operation (in particular, an
* LWLock acquisition) since setting proc->databaseId. However, it's
* less clear that our backend is certain to have performed a memory
* fencing operation since the other backend set proc->databaseId. So
* for now, we test it after acquiring the LWLock just to be safe.
*/
```

I agree with the reason, but it seems OK to replace LWLockAcquire with a
memory barrier for "proc->databaseId". And this can save some time.
See "2-Accelerate_FastPathTransferRelationLocks_v1.patch".

3: Optimize Toast Index Creating
====================================
While tracing the reason mentioned in point "2", I notice an interesting
performance in creating toast index. In function "create_toast_table"

```
/* ShareLock is not really needed here, but take it anyway */
toast_rel = table_open(toast_relid, ShareLock);
/* some operation */
index_create(xxxx)
```

Yep, ShareLock is not really needed here, since we this is the only transaction
that the toast relation is visible to. But by design (in "relation_open"),
NoLock mode is only used when the caller confirms that it already holds the
lock. So I wonder is it still ok to let the NoLock mode used in such scenario
where the relation is created by current transaction.
See "3-Optimize_Toast_Index_Creating_v1.patch".

That's what I've got. Any response is appreciated.

Best regards,
Yang Boyu

Attachments:

1-Skip_Compatibility_Check_v1.patchapplication/octet-streamDownload
diff --git a/src/bin/pg_upgrade/check.c b/src/bin/pg_upgrade/check.c
index 27924159d6..570500824d 100644
--- a/src/bin/pg_upgrade/check.c
+++ b/src/bin/pg_upgrade/check.c
@@ -590,6 +590,9 @@ check_and_dump_old_cluster(bool live_check)
 
 	get_loadable_libraries();
 
+	/* skip the below check */
+	if (user_opts.skip_check)
+		goto generate_old_dump;
 
 	/*
 	 * Check for various failure cases
@@ -660,6 +663,8 @@ check_and_dump_old_cluster(bool live_check)
 	if (GET_MAJOR_VERSION(old_cluster.major_version) <= 905)
 		check_for_pg_role_prefix(&old_cluster);
 
+generate_old_dump:
+
 	/*
 	 * While not a check option, we do this now because this is the only time
 	 * the old server is running.
diff --git a/src/bin/pg_upgrade/option.c b/src/bin/pg_upgrade/option.c
index 548ea4e623..a85b6a96fd 100644
--- a/src/bin/pg_upgrade/option.c
+++ b/src/bin/pg_upgrade/option.c
@@ -60,6 +60,7 @@ parseCommandLine(int argc, char *argv[])
 		{"copy", no_argument, NULL, 2},
 		{"copy-file-range", no_argument, NULL, 3},
 		{"sync-method", required_argument, NULL, 4},
+		{"skip-check", no_argument, NULL, 5},
 
 		{NULL, 0, NULL, 0}
 	};
@@ -211,6 +212,9 @@ parseCommandLine(int argc, char *argv[])
 					exit(1);
 				user_opts.sync_method = pg_strdup(optarg);
 				break;
+			case 5:
+				user_opts.skip_check = true;
+				break;
 
 			default:
 				fprintf(stderr, _("Try \"%s --help\" for more information.\n"),
@@ -289,6 +293,7 @@ usage(void)
 	printf(_("  -B, --new-bindir=BINDIR       new cluster executable directory (default\n"
 			 "                                same directory as pg_upgrade)\n"));
 	printf(_("  -c, --check                   check clusters only, don't change any data\n"));
+	printf(_("  --skip-check                  skip the check in actual upgrade \n"));
 	printf(_("  -d, --old-datadir=DATADIR     old cluster data directory\n"));
 	printf(_("  -D, --new-datadir=DATADIR     new cluster data directory\n"));
 	printf(_("  -j, --jobs=NUM                number of simultaneous processes or threads to use\n"));
diff --git a/src/bin/pg_upgrade/pg_upgrade.c b/src/bin/pg_upgrade/pg_upgrade.c
index af370768b6..e0071d7725 100644
--- a/src/bin/pg_upgrade/pg_upgrade.c
+++ b/src/bin/pg_upgrade/pg_upgrade.c
@@ -125,6 +125,12 @@ main(int argc, char **argv)
 
 	setup(argv[0], &live_check);
 
+	if (live_check && user_opts.skip_check)
+	{
+		pg_fatal("skip-check can't be used in check mode. Please turn off "
+				 "skip-check if you want to perform a live check .\n");
+	}
+
 	output_check_banner(live_check);
 
 	check_cluster_versions();
diff --git a/src/bin/pg_upgrade/pg_upgrade.h b/src/bin/pg_upgrade/pg_upgrade.h
index 8afe240bdf..adf4ab7e1c 100644
--- a/src/bin/pg_upgrade/pg_upgrade.h
+++ b/src/bin/pg_upgrade/pg_upgrade.h
@@ -327,6 +327,9 @@ typedef struct
 	int			jobs;			/* number of processes/threads to use */
 	char	   *socketdir;		/* directory to use for Unix sockets */
 	char	   *sync_method;
+	bool		skip_check; 	/* true -> skip check in actual upgrade,
+								 * assuming the various failure cases are
+								 * impossible. */
 } UserOpts;
 
 typedef struct
2-Accelerate_FastPathTransferRelationLocks_v1.patchapplication/octet-streamDownload
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 0400a50777..8a1427cd9b 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -2717,28 +2717,25 @@ FastPathTransferRelationLocks(LockMethod lockMethodTable, const LOCKTAG *locktag
 		PGPROC	   *proc = &ProcGlobal->allProcs[i];
 		uint32		f;
 
-		LWLockAcquire(&proc->fpInfoLock, LW_EXCLUSIVE);
-
 		/*
 		 * If the target backend isn't referencing the same database as the
 		 * lock, then we needn't examine the individual relation IDs at all;
 		 * none of them can be relevant.
 		 *
 		 * proc->databaseId is set at backend startup time and never changes
-		 * thereafter, so it might be safe to perform this test before
-		 * acquiring &proc->fpInfoLock.  In particular, it's certainly safe to
+		 * thereafter, so it is safe to perform this test before acquiring
+		 * &proc->fpInfoLock.  In particular, it's certainly safe to
 		 * assume that if the target backend holds any fast-path locks, it
 		 * must have performed a memory-fencing operation (in particular, an
-		 * LWLock acquisition) since setting proc->databaseId.  However, it's
-		 * less clear that our backend is certain to have performed a memory
-		 * fencing operation since the other backend set proc->databaseId.  So
-		 * for now, we test it after acquiring the LWLock just to be safe.
+		 * LWLock acquisition) since setting proc->databaseId.  So once a read
+		 * barrier is performed, we can see the true proc->database if the
+		 * target backend has acquired its fpInfoLock.
 		 */
+		pg_read_barrier();
 		if (proc->databaseId != locktag->locktag_field1)
-		{
-			LWLockRelease(&proc->fpInfoLock);
 			continue;
-		}
+
+		LWLockAcquire(&proc->fpInfoLock, LW_EXCLUSIVE);
 
 		for (f = 0; f < FP_LOCK_SLOTS_PER_BACKEND; f++)
 		{
@@ -2964,21 +2961,19 @@ GetLockConflicts(const LOCKTAG *locktag, LOCKMODE lockmode, int *countp)
 			if (proc == MyProc)
 				continue;
 
-			LWLockAcquire(&proc->fpInfoLock, LW_SHARED);
-
 			/*
 			 * If the target backend isn't referencing the same database as
 			 * the lock, then we needn't examine the individual relation IDs
 			 * at all; none of them can be relevant.
 			 *
 			 * See FastPathTransferRelationLocks() for discussion of why we do
-			 * this test after acquiring the lock.
+			 * pg_read_barrier here.
 			 */
+			pg_read_barrier();
 			if (proc->databaseId != locktag->locktag_field1)
-			{
-				LWLockRelease(&proc->fpInfoLock);
 				continue;
-			}
+
+			LWLockAcquire(&proc->fpInfoLock, LW_SHARED);
 
 			for (f = 0; f < FP_LOCK_SLOTS_PER_BACKEND; f++)
 			{
3-Optimize_Toast_Index_Creating_v1.patchapplication/octet-streamDownload
diff --git a/src/backend/access/common/relation.c b/src/backend/access/common/relation.c
index d8a313a2c9..d5b41b3887 100644
--- a/src/backend/access/common/relation.c
+++ b/src/backend/access/common/relation.c
@@ -35,7 +35,7 @@
  *		If lockmode is not "NoLock", the specified kind of lock is
  *		obtained on the relation.  (Generally, NoLock should only be
  *		used if the caller knows it has some appropriate lock on the
- *		relation already.)
+ *		relation already or the relation is only visible to the caller.)
  *
  *		An error is raised if the relation does not exist.
  *
@@ -61,12 +61,14 @@ relation_open(Oid relationId, LOCKMODE lockmode)
 		elog(ERROR, "could not open relation with OID %u", relationId);
 
 	/*
-	 * If we didn't get the lock ourselves, assert that caller holds one,
-	 * except in bootstrap mode where no locks are used.
+	 * If we didn't get the lock ourselves, assert that caller holds one or
+	 * this relation is created in current trasaction except in bootstrap
+	 * mode where no locks are used.
 	 */
 	Assert(lockmode != NoLock ||
 		   IsBootstrapProcessingMode() ||
-		   CheckRelationLockedByMe(r, AccessShareLock, true));
+		   CheckRelationLockedByMe(r, AccessShareLock, true) ||
+		   OidIsValid(r->rd_createSubid));
 
 	/* Make note that we've accessed a temporary relation */
 	if (RelationUsesLocalBuffers(r))
diff --git a/src/backend/catalog/toasting.c b/src/backend/catalog/toasting.c
index 738bc46ae8..37f73e98bb 100644
--- a/src/backend/catalog/toasting.c
+++ b/src/backend/catalog/toasting.c
@@ -269,8 +269,11 @@ create_toast_table(Relation rel, Oid toastOid, Oid toastIndexOid,
 	/* make the toast relation visible, else table_open will fail */
 	CommandCounterIncrement();
 
-	/* ShareLock is not really needed here, but take it anyway */
-	toast_rel = table_open(toast_relid, ShareLock);
+	/*
+	 * ShareLock is not really needed here, since this is the only transaction
+	 * the relation is visible to.
+	 */
+	toast_rel = table_open(toast_relid, NoLock);
 
 	/*
 	 * Create unique index on chunk_id, chunk_seq.
#2Daniel Gustafsson
daniel@yesql.se
In reply to: 杨伯宇(长堂) (#1)
Re: speed up pg_upgrade with large number of tables

On 5 Jul 2024, at 09:12, 杨伯宇(长堂) <yangboyu.yby@alibaba-inc.com> wrote:

1: Skip Compatibility Check In "pg_upgrade"
=============================================
Concisely, we've got several databases, each with a million-plus tables.
Running the compatibility check before pg_dump can eat up like half an hour.
If I have performed an online check before the actual upgrade, repeating it
seems unnecessary and just adds to the downtime in many situations.

So, I'm thinking, why not add a "--skip-check" option in pg_upgrade to skip it?
See "1-Skip_Compatibility_Check_v1.patch".

How would a user know that nothing has changed in the cluster between running
the check and running the upgrade with a skipped check? Considering how
complicated it is to understand exactly what pg_upgrade does it seems like
quite a large caliber footgun.

I would be much more interested in making the check phase go faster, and indeed
there is ongoing work in this area. Since it sounds like you have a dev and
test environment with a big workload, testing those patches would be helpful.
https://commitfest.postgresql.org/48/4995/ is one that comes to mind.

--
Daniel Gustafsson

#3杨伯宇(长堂)
yangboyu.yby@alibaba-inc.com
In reply to: Daniel Gustafsson (#2)
回复:Re: speed up pg_upgrade with large number of tables

So, I'm thinking, why not add a "--skip-check" option in pg_upgrade to skip it?
See "1-Skip_Compatibility_Check_v1.patch".

How would a user know that nothing has changed in the cluster between running
the check and running the upgrade with a skipped check? Considering how
complicated it is to understand exactly what pg_upgrade does it seems like
quite a large caliber footgun.

Indeed, it's not feasible to execute an concise check ensuring that nothing
has changed. However, in many cases, a cluster consistently executes the
same SQL commands. Thus, if we've verified that the cluster was compatible
30 minutes prior, there's a strong likelihood that it remains compatible now.
Therefore, adding such an 'trust-me' option may still be beneficial.

I would be much more interested in making the check phase go faster, and indeed
there is ongoing work in this area. Since it sounds like you have a dev and
test environment with a big workload, testing those patches would be helpful.
https://commitfest.postgresql.org/48/4995/ is one that comes to mind.

Very meaningful work! I will try it.

--
Best regards,
Yang Boyu

#4Nathan Bossart
nathandbossart@gmail.com
In reply to: 杨伯宇(长堂) (#3)
Re: 回复:Re: speed up pg_upgrade with large number of tables

On Fri, Jul 05, 2024 at 05:24:42PM +0800, 杨伯宇(长堂) wrote:

So, I'm thinking, why not add a "--skip-check" option in pg_upgrade to skip it?
See "1-Skip_Compatibility_Check_v1.patch".

How would a user know that nothing has changed in the cluster between running
the check and running the upgrade with a skipped check? Considering how
complicated it is to understand exactly what pg_upgrade does it seems like
quite a large caliber footgun.

I am also -1 on this one for the same reasons as Daniel.

I would be much more interested in making the check phase go faster, and indeed
there is ongoing work in this area. Since it sounds like you have a dev and
test environment with a big workload, testing those patches would be helpful.
https://commitfest.postgresql.org/48/4995/ is one that comes to mind.

Very meaningful work! I will try it.

Thanks! Since you mentioned that you have multiple databases with 1M+
databases, you might also be interested in commit 2329cad. That should
speed up the pg_dump step quite a bit.

--
nathan

#5杨伯宇(长堂)
yangboyu.yby@alibaba-inc.com
In reply to: Nathan Bossart (#4)
回复:Re: 回复:Re: speed up pg_upgrade with large number of tables

Thanks! Since you mentioned that you have multiple databases with 1M+
databases, you might also be interested in commit 2329cad. That should
speed up the pg_dump step quite a bit.

Wow, I noticed this commit(2329cad) when it appeared in commitfest. It has
doubled the speed of pg_dump in this scenario. Thank you for your effort!

Besides, https://commitfest.postgresql.org/48/4995/ seems insufficient to
this situation. Some time-consuming functions like check_for_data_types_usage
are not yet able to run in parallel. But these patches could be a great
starting point for a more efficient parallelism implementation. Maybe we can
do it later.

#6Nathan Bossart
nathandbossart@gmail.com
In reply to: 杨伯宇(长堂) (#5)
Re: 回复:Re: 回复:Re: speed up pg_upgrade with large number of tables

On Mon, Jul 08, 2024 at 03:22:36PM +0800, 杨伯宇(长堂) wrote:

Besides, https://commitfest.postgresql.org/48/4995/ seems insufficient to
this situation. Some time-consuming functions like check_for_data_types_usage
are not yet able to run in parallel. But these patches could be a great
starting point for a more efficient parallelism implementation. Maybe we can
do it later.

I actually just wrote up the first version of the patch for parallelizing
the data type checks over the weekend. I'll post it shortly.

--
nathan

#7Hari Krishna Sunder
hari.db.pg@gmail.com
In reply to: Nathan Bossart (#6)
Re: 回复:Re: 回复:Re: speed up pg_upgrade with large number of tables

The restore side speedups suggested by Yang seem reasonable and can
potentially speed up the process. We can even go a bit further by starting
the new postgres in a --binary-upgrade mode and skip some of these locks
completely.

On Sun, Jan 19, 2025 at 3:43 PM Nathan Bossart <nathandbossart@gmail.com>
wrote:

Show quoted text

On Mon, Jul 08, 2024 at 03:22:36PM +0800, 杨伯宇(长堂) wrote:

Besides, https://commitfest.postgresql.org/48/4995/ seems insufficient

to

this situation. Some time-consuming functions like

check_for_data_types_usage

are not yet able to run in parallel. But these patches could be a great
starting point for a more efficient parallelism implementation. Maybe we

can

do it later.

I actually just wrote up the first version of the patch for parallelizing
the data type checks over the weekend. I'll post it shortly.

--
nathan

#8Nathan Bossart
nathandbossart@gmail.com
In reply to: Hari Krishna Sunder (#7)
Re: 回复:Re: 回复:Re: speed up pg_upgrade with large number of tables

On Sun, Jan 19, 2025 at 03:57:18PM -0800, Hari Krishna Sunder wrote:

The restore side speedups suggested by Yang seem reasonable and can
potentially speed up the process. We can even go a bit further by starting
the new postgres in a --binary-upgrade mode and skip some of these locks
completely.

I am curious about what kind of speedup we can expect with these patches.
IMHO some benchmarking results would be useful.

--
nathan