optimize file transfer in pg_upgrade

Started by Nathan Bossartover 1 year ago45 messages

nathandbossart@gmail.com

over 1 year ago

For clusters with many relations, the file transfer step of pg_upgrade can
take the longest. This step clones, copies, or links the user relation
files from the older cluster to the new cluster, so the amount of time it
takes is closely related to the number of relations. However, since v15,
we've preserved the relfilenodes during pg_upgrade, which means that all of
these user relation files will have the same name. Therefore, it can be
much faster to instead move the entire data directory from the old cluster
to the new cluster and to then swap the catalog relation files.

The attached proof-of-concept patches implement this "catalog-swap" mode
for demonstration purposes. I tested this mode on a cluster with 200
databases, each with 10,000 tables with 1,000 rows and 2 unique constraints
apiece. Each database also had 10,000 sequences. The test used 96 jobs.

pg_upgrade --link --sync-method syncfs --> 10m 23s (~5m linking)
pg_upgrade --catalog-swap --> 5m 32s (~30s linking)

While these results are encouraging, there are a couple of interesting
problems to manage. First, in order to move the data directory from the
old cluster to the new cluster, we will have first moved the new cluster's
data directory (full of files created by pg_restore) aside. After the file
transfer stage, this directory will be filled with useless empty files that
should eventually be deleted. Furthermore, none of these files will have
been synchronized to disk (outside of whatever the kernel has done in the
background), so pg_upgrade's data synchronization step can take a very long
time, even when syncfs() is used (so long that pg_upgrade can take even
longer than before). After much testing, the best way I've found to deal
with this problem is to introduce a special mode for "initdb --sync-only"
that calls fsync() for everything _except_ the actual data files. If we
fsync() the new catalog files as we move them into place, and if we assume
that the old catalog files will have been properly synchronized before
upgrading, there's no reason to synchronize them again at the end.

Another interesting problem is that pg_upgrade currently doesn't transfer
the sequence data files. Since v10, we've restored these via pg_restore.
I believe this was originally done for the introduction of the pg_sequence
catalog, which changed the format of sequence tuples. In the new
catalog-swap mode I am proposing, this means we need to transfer all the
pg_restore-generated sequence data files. If there are many sequences, it
can be difficult to determine which transfer mode and synchronization
method will be faster. Since sequence tuple modifications are very rare, I
think the new catalog-swap mode should just use the sequence data files
from the old cluster whenever possible.

There are a couple of other smaller trade-offs with this approach, too.
First, this new mode complicates rollback if, say, the machine loses power
during file transfer. IME the vast majority of failures happen before this
step, and it should be relatively simple to generate a script that will
safely perform the required rollback steps, so I don't think this is a
deal-breaker. Second, this mode leaves around a bunch of files that users
would likely want to clean up at some point. I think the easiest way to
handle this is to just put all these files in the old cluster's data
directory so that the cleanup script generated by pg_upgrade also takes
care of them.

Thoughts?

--
nathan

Greg Sabino Mullane

greg@turnstep.com

over 1 year ago

In reply to: Nathan Bossart (#1)

Re: optimize file transfer in pg_upgrade

On Wed, Nov 6, 2024 at 5:07 PM Nathan Bossart <nathandbossart@gmail.com>
wrote:

Therefore, it can be much faster to instead move the entire data directory
from the old cluster
to the new cluster and to then swap the catalog relation files.

Thank you for breaking this up so clearly into separate commits. I think it
is a very interesting idea, and anything to speed up pg_upgrade is always
welcome. Some minor thoughts:

[PATCH v1 3/8] Introduce catalog-swap mode for pg_upgrade.
.. we don't really expect there to be directories within database

directories,

so perhaps it would be better to either unconditionally rename or to fail.

Failure seems the best option here, so we can cleanly handle any future
cases in which we decide to put dirs in this directory.

if (RelFileNumberIsValid(rfn))
{
FileNameMap key;

key.relfilenumber = (RelFileNumber) rfn;
if (bsearch(&key, context->maps, context->size,
sizeof(FileNameMap), FileNameMapCmp))
return 0;
}

snprintf(dst, sizeof(dst), "%s/%s", context->target, filename);
if (rename(fname, dst) != 0)

I'm not quite clear what we are doing here with falling through
for InvalidOid entries, could you explain?

.. vm_must_add_frozenbit isn't handled yet. We could either disallow
using catalog-swap mode if the upgrade involves versions older than v9.6

Yes, this. No need for more code to handle super old versions when other
options exist.

with this problem is to introduce a special mode for "initdb --sync-only"

that calls fsync() for everything _except_ the actual data files. If we
fsync() the new catalog files as we move them into place, and if we assume
that the old catalog files will have been properly synchronized before
upgrading, there's no reason to synchronize them again at the end.

Very cool approach!

Cheers,
Greg

Bruce Momjian

bruce@momjian.us

over 1 year ago

In reply to: Nathan Bossart (#1)

Re: optimize file transfer in pg_upgrade

On Wed, Nov 6, 2024 at 04:07:35PM -0600, Nathan Bossart wrote:

For clusters with many relations, the file transfer step of pg_upgrade can
take the longest. This step clones, copies, or links the user relation
files from the older cluster to the new cluster, so the amount of time it
takes is closely related to the number of relations. However, since v15,
we've preserved the relfilenodes during pg_upgrade, which means that all of
these user relation files will have the same name. Therefore, it can be
much faster to instead move the entire data directory from the old cluster
to the new cluster and to then swap the catalog relation files.

That is certainly a creative idea. I am surprised the links take so
long. Obviously rollback would be hard, as you mentioned, while now you
can rollback --link until you start. I think it clearly should be
considered. The patch is smaller than I expected.

--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com

When a patient asks the doctor, "Am I going to die?", he means
"Am I going to die soon?"

optimize file transfer in pg_upgrade

Attachments:

Attachments:

Attachments:

Attachments:

Attachments:

Attachments: