use CREATE DATABASE STRATEGY = FILE_COPY in pg_upgrade
I noticed that the "Restoring database schemas in the new cluster" part of
pg_upgrade can take a while if you have many databases, so I experimented
with a couple different settings to see if there are any easy ways to speed
it up. The FILE_COPY strategy for CREATE DATABASE helped quite
significantly on my laptop. For ~3k empty databases, this step went from
~100 seconds to ~30 seconds with the attached patch. I see commit ad43a41
made a similar change for initdb, so there might even be an argument for
back-patching this to v15 (where STRATEGY was introduced). One thing I
still need to verify is that this doesn't harm anything when there are lots
of objects in the databases, i.e., more WAL generated during many
concurrent CREATE-DATABASE-induced checkpoints.
Thoughts?
--
nathan
Attachments:
use_file_copy_for_pg_upgrade.patchtext/plain; charset=us-asciiDownload
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index e324070828..bc9fef4198 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -3126,7 +3126,9 @@ dumpDatabase(Archive *fout)
*/
if (dopt->binary_upgrade)
{
- appendPQExpBuffer(creaQry, "CREATE DATABASE %s WITH TEMPLATE = template0 OID = %u",
+ appendPQExpBuffer(creaQry,
+ "CREATE DATABASE %s WITH TEMPLATE = template0 "
+ "OID = %u STRATEGY = FILE_COPY",
qdatname, dbCatId.oid);
}
else
Em ter., 4 de jun. de 2024 às 16:39, Nathan Bossart <
nathandbossart@gmail.com> escreveu:
I noticed that the "Restoring database schemas in the new cluster" part of
pg_upgrade can take a while if you have many databases, so I experimented
with a couple different settings to see if there are any easy ways to speed
it up. The FILE_COPY strategy for CREATE DATABASE helped quite
significantly on my laptop. For ~3k empty databases, this step went from
~100 seconds to ~30 seconds with the attached patch. I see commit ad43a41
made a similar change for initdb, so there might even be an argument for
back-patching this to v15 (where STRATEGY was introduced). One thing I
still need to verify is that this doesn't harm anything when there are lots
of objects in the databases, i.e., more WAL generated during many
concurrent CREATE-DATABASE-induced checkpoints.Thoughts?
Why not use it too, if not binary_upgrade?
else
{
appendPQExpBuffer(creaQry, "CREATE DATABASE %s WITH TEMPLATE = template0
STRATEGY = FILE_COPY",
qdatname);
}
It seems to me that it also improves in any use.
best regards,
Ranier Vilela
On Wed, Jun 05, 2024 at 01:47:09PM -0300, Ranier Vilela wrote:
Why not use it too, if not binary_upgrade?
else
{
appendPQExpBuffer(creaQry, "CREATE DATABASE %s WITH TEMPLATE = template0
STRATEGY = FILE_COPY",
qdatname);
}It seems to me that it also improves in any use.
Well, if that is true, and I'm not sure it is, then we should probably
consider changing the default STRATEGY in the server instead. I haven't
looked too deeply, but my assumption is that when fsync is disabled (as it
is when restoring schemas during pg_upgrade), both checkpointing and
copying the template database are sufficiently fast enough to beat writing
out all the data to WAL. Furthermore, in my test, all the databases were
basically empty, so I suspect that some CREATE DATABASE commands could
piggy-back on checkpoints initiated by others. This might not be the case
when there are many objects in each database, and that is a scenario I have
yet to test.
--
nathan
On Wed, 5 Jun 2024 at 18:47, Ranier Vilela <ranier.vf@gmail.com> wrote:
Em ter., 4 de jun. de 2024 às 16:39, Nathan Bossart <nathandbossart@gmail.com> escreveu:
I noticed that the "Restoring database schemas in the new cluster" part of
pg_upgrade can take a while if you have many databases, so I experimented
with a couple different settings to see if there are any easy ways to speed
it up. The FILE_COPY strategy for CREATE DATABASE helped quite
significantly on my laptop. For ~3k empty databases, this step went from
~100 seconds to ~30 seconds with the attached patch. I see commit ad43a41
made a similar change for initdb, so there might even be an argument for
back-patching this to v15 (where STRATEGY was introduced). One thing I
still need to verify is that this doesn't harm anything when there are lots
of objects in the databases, i.e., more WAL generated during many
concurrent CREATE-DATABASE-induced checkpoints.Thoughts?
Why not use it too, if not binary_upgrade?
Because in the normal case (not during binary_upgrade) you don't want
to have to generate 2 checkpoints for every created database,
especially not when your shared buffers are large. Checkpoints' costs
scale approximately linearly with the size of shared buffers, so being
able to skip those checkpoints (with strategy=WAL_LOG) will save a lot
of performance in the systems where this performance impact matters
most.
I noticed that the "Restoring database schemas in the new cluster" part of
pg_upgrade can take a while if you have many databases, so I experimented
with a couple different settings to see if there are any easy ways to speed
it up. The FILE_COPY strategy for CREATE DATABASE helped quite
significantly on my laptop. For ~3k empty databases, this step went from
~100 seconds to ~30 seconds with the attached patch.
As for "on my laptop", that sounds very reasonable, but could you
check the performance on systems with larger shared buffer
configurations? I'd imagine (but haven't checked) that binary upgrade
target systems may already be using the shared_buffers from their
source system, which would cause a severe regression when the
to-be-upgraded system has large shared buffers. For initdb the
database size is known in advance and shared_buffers is approximately
empty, but the same is not (always) true when we're doing binary
upgrades.
Kind regards,
Matthias van de Meent
On Wed, Jun 05, 2024 at 07:28:42PM +0200, Matthias van de Meent wrote:
As for "on my laptop", that sounds very reasonable, but could you
check the performance on systems with larger shared buffer
configurations? I'd imagine (but haven't checked) that binary upgrade
target systems may already be using the shared_buffers from their
source system, which would cause a severe regression when the
to-be-upgraded system has large shared buffers. For initdb the
database size is known in advance and shared_buffers is approximately
empty, but the same is not (always) true when we're doing binary
upgrades.
Will do. FWIW I haven't had much luck improving pg_upgrade times by
adjusting server settings, but I also haven't explored it all that much.
--
nathan
On Wed, Jun 5, 2024 at 10:59 PM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
On Wed, 5 Jun 2024 at 18:47, Ranier Vilela <ranier.vf@gmail.com> wrote:
Em ter., 4 de jun. de 2024 às 16:39, Nathan Bossart <nathandbossart@gmail.com> escreveu:
I noticed that the "Restoring database schemas in the new cluster" part of
pg_upgrade can take a while if you have many databases, so I experimented
with a couple different settings to see if there are any easy ways to speed
it up. The FILE_COPY strategy for CREATE DATABASE helped quite
significantly on my laptop. For ~3k empty databases, this step went from
~100 seconds to ~30 seconds with the attached patch. I see commit ad43a41
made a similar change for initdb, so there might even be an argument for
back-patching this to v15 (where STRATEGY was introduced). One thing I
still need to verify is that this doesn't harm anything when there are lots
of objects in the databases, i.e., more WAL generated during many
concurrent CREATE-DATABASE-induced checkpoints.Thoughts?
Why not use it too, if not binary_upgrade?
Because in the normal case (not during binary_upgrade) you don't want
to have to generate 2 checkpoints for every created database,
especially not when your shared buffers are large. Checkpoints' costs
scale approximately linearly with the size of shared buffers, so being
able to skip those checkpoints (with strategy=WAL_LOG) will save a lot
of performance in the systems where this performance impact matters
most.
I agree with you that we introduced the WAL_LOG strategy to avoid
these force checkpoints. However, in binary upgrade cases where no
operations are happening in the system, the FILE_COPY strategy should
be faster.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
On Fri, 7 Jun 2024 at 07:18, Dilip Kumar <dilipbalaut@gmail.com> wrote:
On Wed, Jun 5, 2024 at 10:59 PM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:On Wed, 5 Jun 2024 at 18:47, Ranier Vilela <ranier.vf@gmail.com> wrote:
Why not use it too, if not binary_upgrade?
Because in the normal case (not during binary_upgrade) you don't want
to have to generate 2 checkpoints for every created database,
especially not when your shared buffers are large. Checkpoints' costs
scale approximately linearly with the size of shared buffers, so being
able to skip those checkpoints (with strategy=WAL_LOG) will save a lot
of performance in the systems where this performance impact matters
most.I agree with you that we introduced the WAL_LOG strategy to avoid
these force checkpoints. However, in binary upgrade cases where no
operations are happening in the system, the FILE_COPY strategy should
be faster.
While you would be correct if there were no operations happening in
the system, during binary upgrade we're still actively modifying
catalogs; and this is done with potentially many concurrent jobs. I
think it's not unlikely that this would impact performance.
Now that I think about it, arguably, we shouldn't need to run
checkpoints during binary upgrade for the FILE_COPY strategy after
we've restored the template1 database and created a checkpoint after
that: All other databases use template1 as their template database,
and the checkpoint is there mostly to guarantee the FS knows about all
changes in the template database before we task it with copying the
template database over to our new database, so the protections we get
from more checkpoints are practically useless.
If such a change were implemented (i.e. no checkpoints for FILE_COPY
in binary upgrade, with a single manual checkpoint after restoring
template1 in create_new_objects) I think most of my concerns with this
patch would be alleviated.
Kind regards,
Matthias van de Meent
On Fri, Jun 7, 2024 at 11:57 AM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
On Fri, 7 Jun 2024 at 07:18, Dilip Kumar <dilipbalaut@gmail.com> wrote:
On Wed, Jun 5, 2024 at 10:59 PM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:On Wed, 5 Jun 2024 at 18:47, Ranier Vilela <ranier.vf@gmail.com> wrote:
Why not use it too, if not binary_upgrade?
Because in the normal case (not during binary_upgrade) you don't want
to have to generate 2 checkpoints for every created database,
especially not when your shared buffers are large. Checkpoints' costs
scale approximately linearly with the size of shared buffers, so being
able to skip those checkpoints (with strategy=WAL_LOG) will save a lot
of performance in the systems where this performance impact matters
most.I agree with you that we introduced the WAL_LOG strategy to avoid
these force checkpoints. However, in binary upgrade cases where no
operations are happening in the system, the FILE_COPY strategy should
be faster.While you would be correct if there were no operations happening in
the system, during binary upgrade we're still actively modifying
catalogs; and this is done with potentially many concurrent jobs. I
think it's not unlikely that this would impact performance.
Maybe, but generally, long checkpoints are problematic because they
involve a lot of I/O, which hampers overall system performance.
However, in the case of a binary upgrade, the concurrent operations
are only performing a schema restore, not a real data restore.
Therefore, it shouldn't have a significant impact, and the checkpoints
should also not do a lot of I/O during binary upgrade, right?
Now that I think about it, arguably, we shouldn't need to run
checkpoints during binary upgrade for the FILE_COPY strategy after
we've restored the template1 database and created a checkpoint after
that: All other databases use template1 as their template database,
and the checkpoint is there mostly to guarantee the FS knows about all
changes in the template database before we task it with copying the
template database over to our new database, so the protections we get
from more checkpoints are practically useless.
If such a change were implemented (i.e. no checkpoints for FILE_COPY
in binary upgrade, with a single manual checkpoint after restoring
template1 in create_new_objects) I think most of my concerns with this
patch would be alleviated.
Yeah, I think that's a valid point. The second checkpoint is to ensure
that the XLOG_DBASE_CREATE_FILE_COPY never gets replayed. However, for
binary upgrades, we don't need that guarantee because a checkpoint
will be performed during shutdown at the end of the upgrade anyway.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
On Fri, 7 Jun 2024 at 10:28, Dilip Kumar <dilipbalaut@gmail.com> wrote:
On Fri, Jun 7, 2024 at 11:57 AM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:On Fri, 7 Jun 2024 at 07:18, Dilip Kumar <dilipbalaut@gmail.com> wrote:
On Wed, Jun 5, 2024 at 10:59 PM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:I agree with you that we introduced the WAL_LOG strategy to avoid
these force checkpoints. However, in binary upgrade cases where no
operations are happening in the system, the FILE_COPY strategy should
be faster.While you would be correct if there were no operations happening in
the system, during binary upgrade we're still actively modifying
catalogs; and this is done with potentially many concurrent jobs. I
think it's not unlikely that this would impact performance.Maybe, but generally, long checkpoints are problematic because they
involve a lot of I/O, which hampers overall system performance.
However, in the case of a binary upgrade, the concurrent operations
are only performing a schema restore, not a real data restore.
Therefore, it shouldn't have a significant impact, and the checkpoints
should also not do a lot of I/O during binary upgrade, right?
My primary concern isn't the IO, but the O(shared_buffers) that we
have to go through during a checkpoint. As I mentioned upthread, it is
reasonably possible the new cluster is already setup with a good
fraction of the old system's shared_buffers configured. Every
checkpoint has to scan all those buffers, which IMV can get (much)
more expensive than the IO overhead caused by the WAL_LOG strategy. It
may be a baseless fear as I haven't done the performance benchmarks
for this, but I wouldn't be surprised if shared_buffers=8GB would
measurably impact the upgrade performance in the current patch (vs the
default 128MB).
I'll note that the documentation for upgrading with pg_upgrade has the
step for updating postgresql.conf / postgresql.auto.conf only after
pg_upgrade has run already, but that may not be how it's actually
used: after all, we don't have full control in this process, the user
is the one who provides the new cluster with initdb.
If such a change were implemented (i.e. no checkpoints for FILE_COPY
in binary upgrade, with a single manual checkpoint after restoring
template1 in create_new_objects) I think most of my concerns with this
patch would be alleviated.Yeah, I think that's a valid point. The second checkpoint is to ensure
that the XLOG_DBASE_CREATE_FILE_COPY never gets replayed. However, for
binary upgrades, we don't need that guarantee because a checkpoint
will be performed during shutdown at the end of the upgrade anyway.
Indeed.
Kind regards,
Matthias van de Meent
Neon (https://neon.tech)
On Fri, Jun 7, 2024 at 2:40 PM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
On Fri, 7 Jun 2024 at 10:28, Dilip Kumar <dilipbalaut@gmail.com> wrote:
On Fri, Jun 7, 2024 at 11:57 AM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:On Fri, 7 Jun 2024 at 07:18, Dilip Kumar <dilipbalaut@gmail.com> wrote:
On Wed, Jun 5, 2024 at 10:59 PM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:I agree with you that we introduced the WAL_LOG strategy to avoid
these force checkpoints. However, in binary upgrade cases where no
operations are happening in the system, the FILE_COPY strategy should
be faster.While you would be correct if there were no operations happening in
the system, during binary upgrade we're still actively modifying
catalogs; and this is done with potentially many concurrent jobs. I
think it's not unlikely that this would impact performance.Maybe, but generally, long checkpoints are problematic because they
involve a lot of I/O, which hampers overall system performance.
However, in the case of a binary upgrade, the concurrent operations
are only performing a schema restore, not a real data restore.
Therefore, it shouldn't have a significant impact, and the checkpoints
should also not do a lot of I/O during binary upgrade, right?My primary concern isn't the IO, but the O(shared_buffers) that we
have to go through during a checkpoint. As I mentioned upthread, it is
reasonably possible the new cluster is already setup with a good
fraction of the old system's shared_buffers configured. Every
checkpoint has to scan all those buffers, which IMV can get (much)
more expensive than the IO overhead caused by the WAL_LOG strategy. It
may be a baseless fear as I haven't done the performance benchmarks
for this, but I wouldn't be surprised if shared_buffers=8GB would
measurably impact the upgrade performance in the current patch (vs the
default 128MB).
Okay, that's a valid point.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
On Fri, Jun 07, 2024 at 11:10:25AM +0200, Matthias van de Meent wrote:
My primary concern isn't the IO, but the O(shared_buffers) that we
have to go through during a checkpoint. As I mentioned upthread, it is
reasonably possible the new cluster is already setup with a good
fraction of the old system's shared_buffers configured. Every
checkpoint has to scan all those buffers, which IMV can get (much)
more expensive than the IO overhead caused by the WAL_LOG strategy. It
may be a baseless fear as I haven't done the performance benchmarks
for this, but I wouldn't be surprised if shared_buffers=8GB would
measurably impact the upgrade performance in the current patch (vs the
default 128MB).
I did a handful of benchmarks on an r5.24xlarge that seem to prove your
point. The following are the durations of the pg_restore step of
pg_upgrade:
* 10k empty databases, 128MB shared_buffers
WAL_LOG: 1m 01s
FILE_COPY: 0m 22s
* 10k empty databases, 100GB shared_buffers
WAL_LOG: 2m 03s
FILE_COPY: 5m 08s
* 2.5k databases with 10k tables each, 128MB shared_buffers
WAL_LOG: 17m 20s
FILE_COPY: 16m 44s
* 2.5k databases with 10k tables each, 100GB shared_buffers
WAL_LOG: 16m 39s
FILE_COPY: 15m 21s
I was surprised with the last result, but there's enough other stuff
happening during such a test that I hesitate to conclude much.
I'll note that the documentation for upgrading with pg_upgrade has the
step for updating postgresql.conf / postgresql.auto.conf only after
pg_upgrade has run already, but that may not be how it's actually
used: after all, we don't have full control in this process, the user
is the one who provides the new cluster with initdb.
Good point. I think it's clear that FILE_COPY is not necessarily a win in
all cases for pg_upgrade.
If such a change were implemented (i.e. no checkpoints for FILE_COPY
in binary upgrade, with a single manual checkpoint after restoring
template1 in create_new_objects) I think most of my concerns with this
patch would be alleviated.Yeah, I think that's a valid point. The second checkpoint is to ensure
that the XLOG_DBASE_CREATE_FILE_COPY never gets replayed. However, for
binary upgrades, we don't need that guarantee because a checkpoint
will be performed during shutdown at the end of the upgrade anyway.Indeed.
It looks like pg_dump always uses template0, so AFAICT we don't even need
the suggested manual checkpoint after restoring template1.
I do like the idea of skipping a bunch of unnecessary operations in binary
upgrade mode, since it'll help me in my goal of speeding up pg_upgrade.
But I'm a bit hesitant to get too fancy here and to introduce a bunch of
new "if (IsBinaryUpgrade)" checks if the gains in the field won't be all
that exciting. However, we've already sprinkled such checks quite
liberally, so maybe I'm being too cautious...
--
nathan
On Tue, 11 Jun 2024 at 04:01, Nathan Bossart <nathandbossart@gmail.com> wrote:
On Fri, Jun 07, 2024 at 11:10:25AM +0200, Matthias van de Meent wrote:
My primary concern isn't the IO, but the O(shared_buffers) that we
have to go through during a checkpoint. As I mentioned upthread, it is
reasonably possible the new cluster is already setup with a good
fraction of the old system's shared_buffers configured. Every
checkpoint has to scan all those buffers, which IMV can get (much)
more expensive than the IO overhead caused by the WAL_LOG strategy. It
may be a baseless fear as I haven't done the performance benchmarks
for this, but I wouldn't be surprised if shared_buffers=8GB would
measurably impact the upgrade performance in the current patch (vs the
default 128MB).I did a handful of benchmarks on an r5.24xlarge that seem to prove your
point. The following are the durations of the pg_restore step of
pg_upgrade:* 10k empty databases, 128MB shared_buffers
WAL_LOG: 1m 01s
FILE_COPY: 0m 22s* 10k empty databases, 100GB shared_buffers
WAL_LOG: 2m 03s
FILE_COPY: 5m 08s* 2.5k databases with 10k tables each, 128MB shared_buffers
WAL_LOG: 17m 20s
FILE_COPY: 16m 44s* 2.5k databases with 10k tables each, 100GB shared_buffers
WAL_LOG: 16m 39s
FILE_COPY: 15m 21sI was surprised with the last result, but there's enough other stuff
happening during such a test that I hesitate to conclude much.
If you still have the test data set up, could you test the attached
patch (which does skip the checkpoints in FILE_COPY mode during binary
upgrades)?
If such a change were implemented (i.e. no checkpoints for FILE_COPY
in binary upgrade, with a single manual checkpoint after restoring
template1 in create_new_objects) I think most of my concerns with this
patch would be alleviated.Yeah, I think that's a valid point. The second checkpoint is to ensure
that the XLOG_DBASE_CREATE_FILE_COPY never gets replayed. However, for
binary upgrades, we don't need that guarantee because a checkpoint
will be performed during shutdown at the end of the upgrade anyway.Indeed.
It looks like pg_dump always uses template0, so AFAICT we don't even need
the suggested manual checkpoint after restoring template1.
Thanks for reminding me. It seems I misunderstood the reason why we
first process template1 in create_new_objects, as I didn't read the
comments thoroughly enough.
I do like the idea of skipping a bunch of unnecessary operations in binary
upgrade mode, since it'll help me in my goal of speeding up pg_upgrade.
But I'm a bit hesitant to get too fancy here and to introduce a bunch of
new "if (IsBinaryUpgrade)" checks if the gains in the field won't be all
that exciting. However, we've already sprinkled such checks quite
liberally, so maybe I'm being too cautious...
Hmm, yes. From an IO perspective I think this could be an improvement,
but let's check the numbers first.
Kind regards,
Matthias van de Meent
Attachments:
0001-pg_upgrade-use-CREATE-DATABASE-.-STRATEGY-FILE_COPY.patchapplication/octet-stream; name=0001-pg_upgrade-use-CREATE-DATABASE-.-STRATEGY-FILE_COPY.patchDownload
From 3cce74e0a8124e2c7c3745e961b99e4700677692 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Tue, 11 Jun 2024 10:24:00 +0200
Subject: [PATCH] pg_upgrade: use CREATE DATABASE ... STRATEGY=FILE_COPY
While the strategy is considered very costly due to expensive checkpoints
when the target system has configured large shared_buffers, the issues that
those checkpoints protect against don't apply in a pg_upgrade system.
By skipping the checkpoints during CREATE DATABASE ... STRATEGY=FILE_COPY
and using that strategy for pg_upgrade, we can save a lot of time and IO
that would otherwise be spent on reading pages from disk, generating WAL
records, writing those WAL records to disk, and then writing the pages to
disk - which has the potential of at least double the IO of FILE_COPY.
Author: Nathan Bossart
Author: Matthias van de Meent
---
src/backend/commands/dbcommands.c | 20 +++++++++++++++++---
src/bin/pg_dump/pg_dump.c | 4 +++-
2 files changed, 20 insertions(+), 4 deletions(-)
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index be629ea92c..04ad6f0b34 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -563,9 +563,16 @@ CreateDatabaseUsingFileCopy(Oid src_dboid, Oid dst_dboid, Oid src_tsid,
* happened while we're copying files, a file might be deleted just when
* we're about to copy it, causing the lstat() call in copydir() to fail
* with ENOENT.
+ *
+ * While we're in binary upgrade mode we have full control over which
+ * databases we access. As the template database is always going to be
+ * template0, which doesn't allow connections, we can be sure that that
+ * database won't have any modified pages in shared memory, thus we can
+ * omit this checkpoint.
*/
- RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE |
- CHECKPOINT_WAIT | CHECKPOINT_FLUSH_ALL);
+ if (!IsBinaryUpgrade)
+ RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE |
+ CHECKPOINT_WAIT | CHECKPOINT_FLUSH_ALL);
/*
* Iterate through all tablespaces of the template database, and copy each
@@ -657,10 +664,17 @@ CreateDatabaseUsingFileCopy(Oid src_dboid, Oid dst_dboid, Oid src_tsid,
* seem to be much we can do about that except document it as a
* limitation.
*
+ * In binary upgrade mode, neither of these are a concern: generated WAL
+ * is dropped, and the template0 database (which is the only source
+ * database used in binary upgrade) can't have concurrent modifications,
+ * so we skip this checkpoint.
+ *
* See CreateDatabaseUsingWalLog() for a less cheesy CREATE DATABASE
* strategy that avoids these problems.
*/
- RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT);
+ if (!IsBinaryUpgrade)
+ RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE |
+ CHECKPOINT_WAIT);
}
/*
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index cb14fcafea..aec702675c 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -3126,7 +3126,9 @@ dumpDatabase(Archive *fout)
*/
if (dopt->binary_upgrade)
{
- appendPQExpBuffer(creaQry, "CREATE DATABASE %s WITH TEMPLATE = template0 OID = %u",
+ appendPQExpBuffer(creaQry,
+ "CREATE DATABASE %s WITH TEMPLATE = template0 OID = %u "
+ "STRATEGY = FILE_COPY",
qdatname, dbCatId.oid);
}
else
--
2.40.1
On Tue, Jun 11, 2024 at 10:39:51AM +0200, Matthias van de Meent wrote:
On Tue, 11 Jun 2024 at 04:01, Nathan Bossart <nathandbossart@gmail.com> wrote:
I did a handful of benchmarks on an r5.24xlarge that seem to prove your
point. The following are the durations of the pg_restore step of
pg_upgrade:* 10k empty databases, 128MB shared_buffers
WAL_LOG: 1m 01s
FILE_COPY: 0m 22s* 10k empty databases, 100GB shared_buffers
WAL_LOG: 2m 03s
FILE_COPY: 5m 08s* 2.5k databases with 10k tables each, 128MB shared_buffers
WAL_LOG: 17m 20s
FILE_COPY: 16m 44s* 2.5k databases with 10k tables each, 100GB shared_buffers
WAL_LOG: 16m 39s
FILE_COPY: 15m 21sI was surprised with the last result, but there's enough other stuff
happening during such a test that I hesitate to conclude much.If you still have the test data set up, could you test the attached
patch (which does skip the checkpoints in FILE_COPY mode during binary
upgrades)?
With your patch, I see the following:
* 10k empty databases, 128MB shared_buffers: 0m 27s
* 10k empty databases, 100GB shared_buffers: 1m 44s
I believe the reason the large buffer cache test is still quite a bit
slower is due to the truncation of pg_largeobject (specifically its call to
DropRelationsAllBuffers()). This TRUNCATE command was added in commit
bbe08b8.
--
nathan
On Tue, Jun 11, 2024 at 10:39:51AM +0200, Matthias van de Meent wrote:
On Tue, 11 Jun 2024 at 04:01, Nathan Bossart <nathandbossart@gmail.com> wrote:
It looks like pg_dump always uses template0, so AFAICT we don't even need
the suggested manual checkpoint after restoring template1.Thanks for reminding me. It seems I misunderstood the reason why we
first process template1 in create_new_objects, as I didn't read the
comments thoroughly enough.
Actually, I think you are right that we need a manual checkpoint, except I
think we need it to be after prepare_new_globals(). set_frozenxids()
connects to each database (including template0) and updates a bunch of
pg_class rows, and we probably want those on disk before we start copying
the files to create all the user's databases.
--
nathan
On Wed, Jun 12, 2024 at 09:41:01AM -0500, Nathan Bossart wrote:
Actually, I think you are right that we need a manual checkpoint, except I
think we need it to be after prepare_new_globals(). set_frozenxids()
connects to each database (including template0) and updates a bunch of
pg_class rows, and we probably want those on disk before we start copying
the files to create all the user's databases.
Here is an updated patch.
--
nathan
Attachments:
v3-0001-Use-CREATE-DATABASE-.-STRATEGY-FILE_COPY-in-pg_up.patchtext/plain; charset=us-asciiDownload
From c3aeed33b351475896c2b4002cff125f1edeeb42 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Fri, 14 Jun 2024 16:25:04 -0500
Subject: [PATCH v3 1/1] Use CREATE DATABASE ... STRATEGY = FILE_COPY in
pg_upgrade.
While the strategy is considered very costly due to expensive checkpoints
when the target system has configured large shared_buffers, the issues that
those checkpoints protect against don't apply in a pg_upgrade system.
By skipping the checkpoints during CREATE DATABASE ... STRATEGY=FILE_COPY
and using that strategy for pg_upgrade, we can save a lot of time and IO
that would otherwise be spent on reading pages from disk, generating WAL
records, writing those WAL records to disk, and then writing the pages to
disk - which has the potential of at least double the IO of FILE_COPY.
Co-authored-by: Matthias van de Meent
Reviewed-by: Ranier Vilela, Dilip Kumar
Discussion: https://postgr.es/m/Zl9ta3FtgdjizkJ5%40nathan
---
src/backend/commands/dbcommands.c | 18 +++++++++++++++---
src/bin/pg_dump/pg_dump.c | 9 ++++++++-
src/bin/pg_upgrade/pg_upgrade.c | 12 ++++++++++++
3 files changed, 35 insertions(+), 4 deletions(-)
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index be629ea92c..99d7a9ba3c 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -563,9 +563,14 @@ CreateDatabaseUsingFileCopy(Oid src_dboid, Oid dst_dboid, Oid src_tsid,
* happened while we're copying files, a file might be deleted just when
* we're about to copy it, causing the lstat() call in copydir() to fail
* with ENOENT.
+ *
+ * In binary upgrade mode, we can skip this checkpoint because we are
+ * careful to ensure that template0 is fully written to disk prior to any
+ * CREATE DATABASE commands.
*/
- RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE |
- CHECKPOINT_WAIT | CHECKPOINT_FLUSH_ALL);
+ if (!IsBinaryUpgrade)
+ RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE |
+ CHECKPOINT_WAIT | CHECKPOINT_FLUSH_ALL);
/*
* Iterate through all tablespaces of the template database, and copy each
@@ -657,10 +662,17 @@ CreateDatabaseUsingFileCopy(Oid src_dboid, Oid dst_dboid, Oid src_tsid,
* seem to be much we can do about that except document it as a
* limitation.
*
+ * In binary upgrade mode, we can skip this checkpoint because neither of
+ * these problems applies: we don't ever replay the WAL generated during
+ * pg_upgrade, and we don't concurrently modify template0 (not to mention
+ * that trying to take a backup during pg_upgrade is pointless).
+ *
* See CreateDatabaseUsingWalLog() for a less cheesy CREATE DATABASE
* strategy that avoids these problems.
*/
- RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT);
+ if (!IsBinaryUpgrade)
+ RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE |
+ CHECKPOINT_WAIT);
}
/*
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index e324070828..1e51a63442 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -3123,10 +3123,17 @@ dumpDatabase(Archive *fout)
* since those can't be altered later. Other DB properties are left to
* the DATABASE PROPERTIES entry, so that they can be applied after
* reconnecting to the target DB.
+ *
+ * For binary upgrade, we use the FILE_COPY strategy because testing has
+ * shown it to be faster. When the server is in binary upgrade mode, it
+ * will also skip the checkpoints normally triggered by CREATE DATABASE
+ * STRATEGY = FILE_COPY.
*/
if (dopt->binary_upgrade)
{
- appendPQExpBuffer(creaQry, "CREATE DATABASE %s WITH TEMPLATE = template0 OID = %u",
+ appendPQExpBuffer(creaQry,
+ "CREATE DATABASE %s WITH TEMPLATE = template0 "
+ "OID = %u STRATEGY = FILE_COPY",
qdatname, dbCatId.oid);
}
else
diff --git a/src/bin/pg_upgrade/pg_upgrade.c b/src/bin/pg_upgrade/pg_upgrade.c
index af370768b6..45f095a230 100644
--- a/src/bin/pg_upgrade/pg_upgrade.c
+++ b/src/bin/pg_upgrade/pg_upgrade.c
@@ -534,9 +534,21 @@ static void
create_new_objects(void)
{
int dbnum;
+ PGconn *conn_new_template1;
prep_status_progress("Restoring database schemas in the new cluster");
+ /*
+ * Ensure that any changes to template0 are fully written out to disk
+ * prior to restoring the databases. This is necessary because we use the
+ * FILE_COPY strategy to create the databases (which testing has shown to
+ * be faster), and we skip the checkpoints that this strategy ordinarily
+ * incurs.
+ */
+ conn_new_template1 = connectToServer(&new_cluster, "template1");
+ PQclear(executeQueryOrDie(conn_new_template1, "CHECKPOINT"));
+ PQfinish(conn_new_template1);
+
/*
* We cannot process the template1 database concurrently with others,
* because when it's transiently dropped, connection attempts would fail.
--
2.39.3 (Apple Git-146)
On Fri, 14 Jun 2024 at 23:29, Nathan Bossart <nathandbossart@gmail.com> wrote:
On Wed, Jun 12, 2024 at 09:41:01AM -0500, Nathan Bossart wrote:
Actually, I think you are right that we need a manual checkpoint, except I
think we need it to be after prepare_new_globals(). set_frozenxids()
connects to each database (including template0) and updates a bunch of
pg_class rows, and we probably want those on disk before we start copying
the files to create all the user's databases.
Good catch, I hadn't thought of that.
Here is an updated patch.
LGTM.
Kind regards,
Matthias van de Meent
Neon (https://neon.tech)
On Fri, Jun 14, 2024 at 5:29 PM Nathan Bossart <nathandbossart@gmail.com> wrote:
On Wed, Jun 12, 2024 at 09:41:01AM -0500, Nathan Bossart wrote:
Actually, I think you are right that we need a manual checkpoint, except I
think we need it to be after prepare_new_globals(). set_frozenxids()
connects to each database (including template0) and updates a bunch of
pg_class rows, and we probably want those on disk before we start copying
the files to create all the user's databases.Here is an updated patch.
OK, I have a (probably) stupid question. The comment says:
+ * In binary upgrade mode, we can skip this checkpoint because neither of
+ * these problems applies: we don't ever replay the WAL generated during
+ * pg_upgrade, and we don't concurrently modify template0 (not to mention
+ * that trying to take a backup during pg_upgrade is pointless).
But what happens if the system crashes during pg_upgrade? Does this
patch make things worse than they are today? And should we care?
--
Robert Haas
EDB: http://www.enterprisedb.com
On Wed, Jun 19, 2024 at 09:17:00AM -0400, Robert Haas wrote:
OK, I have a (probably) stupid question. The comment says:
+ * In binary upgrade mode, we can skip this checkpoint because neither of + * these problems applies: we don't ever replay the WAL generated during + * pg_upgrade, and we don't concurrently modify template0 (not to mention + * that trying to take a backup during pg_upgrade is pointless).But what happens if the system crashes during pg_upgrade? Does this
patch make things worse than they are today? And should we care?
My understanding is that you basically have to restart the upgrade from
scratch if that happens. I suppose there could be a problem if you try to
use the half-upgraded cluster after a crash, but I imagine you have a good
chance of encountering other problems if you do that, too. So I don't
think we care...
--
nathan
On Wed, 19 Jun 2024 at 15:17, Robert Haas <robertmhaas@gmail.com> wrote:
On Fri, Jun 14, 2024 at 5:29 PM Nathan Bossart <nathandbossart@gmail.com> wrote:
On Wed, Jun 12, 2024 at 09:41:01AM -0500, Nathan Bossart wrote:
Actually, I think you are right that we need a manual checkpoint, except I
think we need it to be after prepare_new_globals(). set_frozenxids()
connects to each database (including template0) and updates a bunch of
pg_class rows, and we probably want those on disk before we start copying
the files to create all the user's databases.Here is an updated patch.
OK, I have a (probably) stupid question. The comment says:
+ * In binary upgrade mode, we can skip this checkpoint because neither of + * these problems applies: we don't ever replay the WAL generated during + * pg_upgrade, and we don't concurrently modify template0 (not to mention + * that trying to take a backup during pg_upgrade is pointless).But what happens if the system crashes during pg_upgrade? Does this
patch make things worse than they are today? And should we care?
As Nathan just said, AFAIK we don't have a way to resume progress from
a crashed pg_upgrade, which implies you currently have to start over
with a fresh dbinit.
This patch wouldn't change that for the worse, it'd only add one more
process that depends on that behaviour.
Maybe in the future that'll change if pg_upgrade and pg_dump are made
smart enough to resume their restore progress across forceful
disconnects, but for now this patch seems like a nice performance
improvement.
Kind regards,
Matthias van de Meent
Neon (https://neon.tech)
On Wed, Jun 19, 2024 at 08:37:17AM -0500, Nathan Bossart wrote:
My understanding is that you basically have to restart the upgrade from
scratch if that happens. I suppose there could be a problem if you try to
use the half-upgraded cluster after a crash, but I imagine you have a good
chance of encountering other problems if you do that, too. So I don't
think we care...
It's never been assumed that it would be safe to redo a
pg_upgradeafter a crash on a cluster initdb'd for the upgrade, so I
don't think we need to care about that, as well.
One failure I suspect would quickly be faced is OIDs getting reused
again as these are currently kept consistent.
--
Michael