optimize file transfer in pg_upgrade

Started by Nathan Bossartover 1 year ago45 messages
Jump to latest
#1Nathan Bossart
nathandbossart@gmail.com

For clusters with many relations, the file transfer step of pg_upgrade can
take the longest. This step clones, copies, or links the user relation
files from the older cluster to the new cluster, so the amount of time it
takes is closely related to the number of relations. However, since v15,
we've preserved the relfilenodes during pg_upgrade, which means that all of
these user relation files will have the same name. Therefore, it can be
much faster to instead move the entire data directory from the old cluster
to the new cluster and to then swap the catalog relation files.

The attached proof-of-concept patches implement this "catalog-swap" mode
for demonstration purposes. I tested this mode on a cluster with 200
databases, each with 10,000 tables with 1,000 rows and 2 unique constraints
apiece. Each database also had 10,000 sequences. The test used 96 jobs.

pg_upgrade --link --sync-method syncfs --> 10m 23s (~5m linking)
pg_upgrade --catalog-swap --> 5m 32s (~30s linking)

While these results are encouraging, there are a couple of interesting
problems to manage. First, in order to move the data directory from the
old cluster to the new cluster, we will have first moved the new cluster's
data directory (full of files created by pg_restore) aside. After the file
transfer stage, this directory will be filled with useless empty files that
should eventually be deleted. Furthermore, none of these files will have
been synchronized to disk (outside of whatever the kernel has done in the
background), so pg_upgrade's data synchronization step can take a very long
time, even when syncfs() is used (so long that pg_upgrade can take even
longer than before). After much testing, the best way I've found to deal
with this problem is to introduce a special mode for "initdb --sync-only"
that calls fsync() for everything _except_ the actual data files. If we
fsync() the new catalog files as we move them into place, and if we assume
that the old catalog files will have been properly synchronized before
upgrading, there's no reason to synchronize them again at the end.

Another interesting problem is that pg_upgrade currently doesn't transfer
the sequence data files. Since v10, we've restored these via pg_restore.
I believe this was originally done for the introduction of the pg_sequence
catalog, which changed the format of sequence tuples. In the new
catalog-swap mode I am proposing, this means we need to transfer all the
pg_restore-generated sequence data files. If there are many sequences, it
can be difficult to determine which transfer mode and synchronization
method will be faster. Since sequence tuple modifications are very rare, I
think the new catalog-swap mode should just use the sequence data files
from the old cluster whenever possible.

There are a couple of other smaller trade-offs with this approach, too.
First, this new mode complicates rollback if, say, the machine loses power
during file transfer. IME the vast majority of failures happen before this
step, and it should be relatively simple to generate a script that will
safely perform the required rollback steps, so I don't think this is a
deal-breaker. Second, this mode leaves around a bunch of files that users
would likely want to clean up at some point. I think the easiest way to
handle this is to just put all these files in the old cluster's data
directory so that the cleanup script generated by pg_upgrade also takes
care of them.

Thoughts?

--
nathan

Attachments:

v1-0001-Export-walkdir.patchtext/plain; charset=us-asciiDownload+4-5
v1-0002-Add-void-arg-parameter-to-walkdir-that-is-passed-.patchtext/plain; charset=us-asciiDownload+29-30
v1-0003-Introduce-catalog-swap-mode-for-pg_upgrade.patchtext/plain; charset=us-asciiDownload+176-1
v1-0004-Add-no-sync-data-files-flag-to-initdb.patchtext/plain; charset=us-asciiDownload+40-16
v1-0005-Export-pre_sync_fname.patchtext/plain; charset=us-asciiDownload+6-14
v1-0006-In-pg_upgrade-s-catalog-swap-mode-only-sync-files.patchtext/plain; charset=us-asciiDownload+56-3
v1-0007-Add-sequence-data-flag-to-pg_dump.patchtext/plain; charset=us-asciiDownload+3-11
v1-0008-Avoid-copying-sequence-files-in-pg_upgrade-s-cata.patchtext/plain; charset=us-asciiDownload+12-3
#2Greg Sabino Mullane
greg@turnstep.com
In reply to: Nathan Bossart (#1)
Re: optimize file transfer in pg_upgrade

On Wed, Nov 6, 2024 at 5:07 PM Nathan Bossart <nathandbossart@gmail.com>
wrote:

Therefore, it can be much faster to instead move the entire data directory
from the old cluster
to the new cluster and to then swap the catalog relation files.

Thank you for breaking this up so clearly into separate commits. I think it
is a very interesting idea, and anything to speed up pg_upgrade is always
welcome. Some minor thoughts:

[PATCH v1 3/8] Introduce catalog-swap mode for pg_upgrade.
.. we don't really expect there to be directories within database

directories,

so perhaps it would be better to either unconditionally rename or to fail.

Failure seems the best option here, so we can cleanly handle any future
cases in which we decide to put dirs in this directory.

if (RelFileNumberIsValid(rfn))
{
FileNameMap key;

key.relfilenumber = (RelFileNumber) rfn;
if (bsearch(&key, context->maps, context->size,
sizeof(FileNameMap), FileNameMapCmp))
return 0;
}

snprintf(dst, sizeof(dst), "%s/%s", context->target, filename);
if (rename(fname, dst) != 0)

I'm not quite clear what we are doing here with falling through
for InvalidOid entries, could you explain?

.. vm_must_add_frozenbit isn't handled yet. We could either disallow
using catalog-swap mode if the upgrade involves versions older than v9.6

Yes, this. No need for more code to handle super old versions when other
options exist.

with this problem is to introduce a special mode for "initdb --sync-only"

that calls fsync() for everything _except_ the actual data files. If we
fsync() the new catalog files as we move them into place, and if we assume
that the old catalog files will have been properly synchronized before
upgrading, there's no reason to synchronize them again at the end.

Very cool approach!

Cheers,
Greg

#3Bruce Momjian
bruce@momjian.us
In reply to: Nathan Bossart (#1)
Re: optimize file transfer in pg_upgrade

On Wed, Nov 6, 2024 at 04:07:35PM -0600, Nathan Bossart wrote:

For clusters with many relations, the file transfer step of pg_upgrade can
take the longest. This step clones, copies, or links the user relation
files from the older cluster to the new cluster, so the amount of time it
takes is closely related to the number of relations. However, since v15,
we've preserved the relfilenodes during pg_upgrade, which means that all of
these user relation files will have the same name. Therefore, it can be
much faster to instead move the entire data directory from the old cluster
to the new cluster and to then swap the catalog relation files.

That is certainly a creative idea. I am surprised the links take so
long. Obviously rollback would be hard, as you mentioned, while now you
can rollback --link until you start. I think it clearly should be
considered. The patch is smaller than I expected.

--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com

When a patient asks the doctor, "Am I going to die?", he means
"Am I going to die soon?"

#4Nathan Bossart
nathandbossart@gmail.com
In reply to: Greg Sabino Mullane (#2)
Re: optimize file transfer in pg_upgrade

On Sun, Nov 17, 2024 at 01:50:53PM -0500, Greg Sabino Mullane wrote:

On Wed, Nov 6, 2024 at 5:07 PM Nathan Bossart <nathandbossart@gmail.com>
wrote:

Therefore, it can be much faster to instead move the entire data directory
from the old cluster
to the new cluster and to then swap the catalog relation files.

Thank you for breaking this up so clearly into separate commits. I think it
is a very interesting idea, and anything to speed up pg_upgrade is always
welcome. Some minor thoughts:

Thank you for reviewing!

.. we don't really expect there to be directories within database

directories,

so perhaps it would be better to either unconditionally rename or to fail.

Failure seems the best option here, so we can cleanly handle any future
cases in which we decide to put dirs in this directory.

Good point.

if (RelFileNumberIsValid(rfn))
{
FileNameMap key;

key.relfilenumber = (RelFileNumber) rfn;
if (bsearch(&key, context->maps, context->size,
sizeof(FileNameMap), FileNameMapCmp))
return 0;
}

snprintf(dst, sizeof(dst), "%s/%s", context->target, filename);
if (rename(fname, dst) != 0)

I'm not quite clear what we are doing here with falling through
for InvalidOid entries, could you explain?

The idea is that if it looks like a data file that we might want to
transfer (i.e., it starts with a RelFileNumber), we should consult our map
to determine whether to move it. Otherwise, we want to unconditionally
transfer it so that we always use the files generated during pg_restore in
the new cluster (e.g., PG_VERSION and pg_filenode.map). In theory, this
should result in the same end state as what --link mode does today (for the
new cluster, at least).

.. vm_must_add_frozenbit isn't handled yet. We could either disallow
using catalog-swap mode if the upgrade involves versions older than v9.6

Yes, this. No need for more code to handle super old versions when other
options exist.

I'm inclined to agree.

with this problem is to introduce a special mode for "initdb --sync-only"
that calls fsync() for everything _except_ the actual data files. If we
fsync() the new catalog files as we move them into place, and if we assume
that the old catalog files will have been properly synchronized before
upgrading, there's no reason to synchronize them again at the end.

Very cool approach!

:)

--
nathan

#5Nathan Bossart
nathandbossart@gmail.com
In reply to: Bruce Momjian (#3)
Re: optimize file transfer in pg_upgrade

On Mon, Nov 18, 2024 at 10:34:00PM -0500, Bruce Momjian wrote:

On Wed, Nov 6, 2024 at 04:07:35PM -0600, Nathan Bossart wrote:

For clusters with many relations, the file transfer step of pg_upgrade can
take the longest. This step clones, copies, or links the user relation
files from the older cluster to the new cluster, so the amount of time it
takes is closely related to the number of relations. However, since v15,
we've preserved the relfilenodes during pg_upgrade, which means that all of
these user relation files will have the same name. Therefore, it can be
much faster to instead move the entire data directory from the old cluster
to the new cluster and to then swap the catalog relation files.

That is certainly a creative idea. I am surprised the links take so
long. Obviously rollback would be hard, as you mentioned, while now you
can rollback --link until you start. I think it clearly should be
considered.

I've yet to try, but I'm cautiously optimistic that it will be possible to
generate simple scripts that can unwind things by just looking at the
directory entries, even if pg_upgrade crashed halfway through the linking
stage.

The patch is smaller than I expected.

I was surprised by this, too. Obviously, this one is a bit smaller than
the "real" patches will be because it's just a proof-of-concept, but it
should still be pretty manageable.

--
nathan

#6Nathan Bossart
nathandbossart@gmail.com
In reply to: Nathan Bossart (#4)
Re: optimize file transfer in pg_upgrade

Here is a rebased patch set for cfbot. I'm planning to spend some time
getting these patches into a more reviewable state in the near future.

--
nathan

Attachments:

v2-0001-Export-walkdir.patchtext/plain; charset=us-asciiDownload+4-5
v2-0002-Add-void-arg-parameter-to-walkdir-that-is-passed-.patchtext/plain; charset=us-asciiDownload+29-30
v2-0003-Introduce-catalog-swap-mode-for-pg_upgrade.patchtext/plain; charset=us-asciiDownload+176-1
v2-0004-Add-no-sync-data-files-flag-to-initdb.patchtext/plain; charset=us-asciiDownload+40-16
v2-0005-Export-pre_sync_fname.patchtext/plain; charset=us-asciiDownload+6-14
v2-0006-In-pg_upgrade-s-catalog-swap-mode-only-sync-files.patchtext/plain; charset=us-asciiDownload+56-3
v2-0007-Add-sequence-data-flag-to-pg_dump.patchtext/plain; charset=us-asciiDownload+3-11
v2-0008-Avoid-copying-sequence-files-in-pg_upgrade-s-cata.patchtext/plain; charset=us-asciiDownload+12-3
#7Nathan Bossart
nathandbossart@gmail.com
In reply to: Nathan Bossart (#6)
Re: optimize file transfer in pg_upgrade

I've spent quite a bit of time recently trying to get this patch set into a
reasonable state. It's still a little rough around the edges, and the code
for the generated scripts is incomplete, but I figured I'd at least get
some CI testing going.

--
nathan

Attachments:

v3-0001-initdb-Add-no-sync-data-files.patchtext/plain; charset=us-asciiDownload+89-38
v3-0002-pg_dump-Add-sequence-data.patchtext/plain; charset=us-asciiDownload+16-12
v3-0003-Add-new-frontend-functions-for-durable-file-opera.patchtext/plain; charset=us-asciiDownload+49-10
v3-0004-pg_upgrade-Add-swap-for-faster-file-transfer.patchtext/plain; charset=us-asciiDownload+558-11
#8Robert Haas
robertmhaas@gmail.com
In reply to: Nathan Bossart (#1)
Re: optimize file transfer in pg_upgrade

On Wed, Nov 6, 2024 at 5:07 PM Nathan Bossart <nathandbossart@gmail.com> wrote:

these user relation files will have the same name. Therefore, it can be
much faster to instead move the entire data directory from the old cluster
to the new cluster and to then swap the catalog relation files.

This is a cool idea.

Another interesting problem is that pg_upgrade currently doesn't transfer
the sequence data files. Since v10, we've restored these via pg_restore.
I believe this was originally done for the introduction of the pg_sequence
catalog, which changed the format of sequence tuples. In the new
catalog-swap mode I am proposing, this means we need to transfer all the
pg_restore-generated sequence data files. If there are many sequences, it
can be difficult to determine which transfer mode and synchronization
method will be faster. Since sequence tuple modifications are very rare, I
think the new catalog-swap mode should just use the sequence data files
from the old cluster whenever possible.

Maybe we should rethink the decision not to transfer relfilenodes for
sequences. Or have more than one way to do it. pg_upgrade
--binary-upgrade --binary-upgrade-even-for-sequences, or whatever.

--
Robert Haas
EDB: http://www.enterprisedb.com

#9Robert Haas
robertmhaas@gmail.com
In reply to: Robert Haas (#8)
Re: optimize file transfer in pg_upgrade

On Fri, Feb 28, 2025 at 2:40 PM Robert Haas <robertmhaas@gmail.com> wrote:

Maybe we should rethink the decision not to transfer relfilenodes for
sequences. Or have more than one way to do it. pg_upgrade
--binary-upgrade --binary-upgrade-even-for-sequences, or whatever.

Sorry, I meant: pg_dump --binary-upgrade --binary-upgrade-even-for-sequences

i.e. pg_upgrade could decide which way to ask pg_dump to do it,
depending on versions and flags.

--
Robert Haas
EDB: http://www.enterprisedb.com

#10Nathan Bossart
nathandbossart@gmail.com
In reply to: Robert Haas (#9)
Re: optimize file transfer in pg_upgrade

On Fri, Feb 28, 2025 at 02:41:22PM -0500, Robert Haas wrote:

On Fri, Feb 28, 2025 at 2:40 PM Robert Haas <robertmhaas@gmail.com> wrote:

Maybe we should rethink the decision not to transfer relfilenodes for
sequences. Or have more than one way to do it. pg_upgrade
--binary-upgrade --binary-upgrade-even-for-sequences, or whatever.

Sorry, I meant: pg_dump --binary-upgrade --binary-upgrade-even-for-sequences

i.e. pg_upgrade could decide which way to ask pg_dump to do it,
depending on versions and flags.

That's exactly where I landed (see v3-0002). I haven't measured whether
transferring relfilenodes or dumping the sequence data is faster for the
existing modes, but for now I've left those alone, i.e., they still dump
sequence data. The new "swap" mode just uses the old cluster's sequence
files, and I've disallowed using swap mode for upgrades from <v10 to avoid
the sequence tuple format change (along with other incompatible changes).

I'll admit I'm a bit concerned that this will cause problems if and when
someone wants to change the sequence tuple format again. But that hasn't
happened for a while, AFAIK nobody's planning to change it, and even if it
does happen, we just need to have my proposed new mode transfer the
sequence files like it transfers the catalog files. That will make this
mode slower, especially if you have a ton of sequences, but maybe it'll
still be a win in most cases. Of course, we probably will need to have
pg_upgrade handle other kinds of format changes, too, but IMHO it's still
worth trying to speed up pg_upgrade despite the potential future
complexities.

--
nathan

#11Robert Haas
robertmhaas@gmail.com
In reply to: Nathan Bossart (#10)
Re: optimize file transfer in pg_upgrade

On Fri, Feb 28, 2025 at 3:01 PM Nathan Bossart <nathandbossart@gmail.com> wrote:

That's exactly where I landed (see v3-0002). I haven't measured whether
transferring relfilenodes or dumping the sequence data is faster for the
existing modes, but for now I've left those alone, i.e., they still dump
sequence data. The new "swap" mode just uses the old cluster's sequence
files, and I've disallowed using swap mode for upgrades from <v10 to avoid
the sequence tuple format change (along with other incompatible changes).

Ah. Perhaps I should have read the thread more carefully before
commenting. Sounds good, at any rate.

I'll admit I'm a bit concerned that this will cause problems if and when
someone wants to change the sequence tuple format again. But that hasn't
happened for a while, AFAIK nobody's planning to change it, and even if it
does happen, we just need to have my proposed new mode transfer the
sequence files like it transfers the catalog files. That will make this
mode slower, especially if you have a ton of sequences, but maybe it'll
still be a win in most cases. Of course, we probably will need to have
pg_upgrade handle other kinds of format changes, too, but IMHO it's still
worth trying to speed up pg_upgrade despite the potential future
complexities.

I think it's fine. If somebody comes along and says "hey, when v23
came out Nathan's feature only sped up pg_upgrade by 2x instead of 3x
like it did for v22, so Nathan is a bad person," I think we can fairly
reply "thanks for sharing your opinion, feel free not to use the
feature and run at 1x speed". There's no rule saying that every
optimization must always produce the maximum possible benefit in every
scenario. We're just concerned about regressions, and "only delivers
some of the speedup if the sequence format has changed on disk" is not
a regression.

--
Robert Haas
EDB: http://www.enterprisedb.com

#12Nathan Bossart
nathandbossart@gmail.com
In reply to: Robert Haas (#11)
Re: optimize file transfer in pg_upgrade

On Fri, Feb 28, 2025 at 03:37:49PM -0500, Robert Haas wrote:

On Fri, Feb 28, 2025 at 3:01 PM Nathan Bossart <nathandbossart@gmail.com> wrote:

That's exactly where I landed (see v3-0002). I haven't measured whether
transferring relfilenodes or dumping the sequence data is faster for the
existing modes, but for now I've left those alone, i.e., they still dump
sequence data. The new "swap" mode just uses the old cluster's sequence
files, and I've disallowed using swap mode for upgrades from <v10 to avoid
the sequence tuple format change (along with other incompatible changes).

Ah. Perhaps I should have read the thread more carefully before
commenting. Sounds good, at any rate.

On the contrary, I'm glad you independently came to the same conclusion.

I'll admit I'm a bit concerned that this will cause problems if and when
someone wants to change the sequence tuple format again. But that hasn't
happened for a while, AFAIK nobody's planning to change it, and even if it
does happen, we just need to have my proposed new mode transfer the
sequence files like it transfers the catalog files. That will make this
mode slower, especially if you have a ton of sequences, but maybe it'll
still be a win in most cases. Of course, we probably will need to have
pg_upgrade handle other kinds of format changes, too, but IMHO it's still
worth trying to speed up pg_upgrade despite the potential future
complexities.

I think it's fine. If somebody comes along and says "hey, when v23
came out Nathan's feature only sped up pg_upgrade by 2x instead of 3x
like it did for v22, so Nathan is a bad person," I think we can fairly
reply "thanks for sharing your opinion, feel free not to use the
feature and run at 1x speed". There's no rule saying that every
optimization must always produce the maximum possible benefit in every
scenario. We're just concerned about regressions, and "only delivers
some of the speedup if the sequence format has changed on disk" is not
a regression.

Cool. I appreciate the design feedback.

--
nathan

#13Nathan Bossart
nathandbossart@gmail.com
In reply to: Nathan Bossart (#12)
Re: optimize file transfer in pg_upgrade

On Fri, Feb 28, 2025 at 02:51:27PM -0600, Nathan Bossart wrote:

Cool. I appreciate the design feedback.

One other design point I wanted to bring up is whether we should bother
generating a rollback script for the new "swap" mode. In short, I'm
wondering if it would be unreasonable to say that, just for this mode, once
pg_upgrade enters the file transfer step, reverting to the old cluster
requires restoring a backup. I believe that's worth considering for the
following reasons:

* Anecdotally, I'm not sure I've ever actually seen pg_upgrade fail during
or after file transfer, and I'm hoping to get some real data about that
in the near future. Has anyone else dealt with such a failure? I
suspect that failures during file transfer are typically due to OS
crashes, power losses, etc., and hopefully those are rare.

* I've spent quite some time trying to generate a portable script, but it's
quite complicated and difficult to reason about its correctness. And I
haven't even started on the Windows version. Leaving this part out would
simplify the patch set quite a bit.

* If we give up the idea of reverting to the old cluster, we also can avoid
a bunch of intermediate fsync() calls which I only included to help
reason about the state of the files in case you failed halfway through.
This might not add up to much, but it's at least another area of
simplification.

Of course, rollback would still be possible, but you'd really need to
understand what "swap" mode does behind the scenes to do so safely. In any
case, I'm growing skeptical that a probably-not-super-well-tested script
that extremely few people will need and fewer will use is worth the
complexity.

--
nathan

#14Robert Haas
robertmhaas@gmail.com
In reply to: Nathan Bossart (#13)
Re: optimize file transfer in pg_upgrade

On Wed, Mar 5, 2025 at 2:42 PM Nathan Bossart <nathandbossart@gmail.com> wrote:

Of course, rollback would still be possible, but you'd really need to
understand what "swap" mode does behind the scenes to do so safely. In any
case, I'm growing skeptical that a probably-not-super-well-tested script
that extremely few people will need and fewer will use is worth the
complexity.

I don't have a super-strong view on what the right thing to do is
here, but I'm definitely in favor of not doing something half-baked.
If you think the revert script is going to suck, then let's not have
it at all and just be clear about what the requirements for using this
mode are.

One serious danger that you didn't mention here is that, if pg_upgrade
does fail, you may well try several times. And if you forget the
revert script at some point, or run it more than once, or run the
wrong version, you will be in trouble. I feel like this is something
someone could very easily mess up even if in theory it works
perfectly. Upgrades are often stressful times.

--
Robert Haas
EDB: http://www.enterprisedb.com

#15Greg Sabino Mullane
greg@turnstep.com
In reply to: Nathan Bossart (#13)
Re: optimize file transfer in pg_upgrade

On Wed, Mar 5, 2025 at 2:43 PM Nathan Bossart <nathandbossart@gmail.com>
wrote:

One other design point I wanted to bring up is whether we should bother
generating a rollback script for the new "swap" mode. In short, I'm
wondering if it would be unreasonable to say that, just for this mode, once
pg_upgrade enters the file transfer step, reverting to the old cluster
requires restoring a backup.

I think that's a fair requirement. And like Robert, revert scripts make me
nervous.

* Anecdotally, I'm not sure I've ever actually seen pg_upgrade fail

during or after file transfer, and I'm hoping to get some real data about
that in the near future. Has anyone else dealt with such a failure?

I've seen various failures, but they always get caught quite early.
Certainly early enough to easily abort, fix perms/mounts/etc., then retry.
I think your instinct is correct that this reversion is more trouble than
its worth. I don't think the pg_upgrade docs mention taking a backup, but
that's always step 0 in my playbook, and that's the rollback plan in the
unlikely event of failure.

Cheers,
Greg

--
Crunchy Data - https://www.crunchydata.com
Enterprise Postgres Software Products & Tech Support

#16Nathan Bossart
nathandbossart@gmail.com
In reply to: Greg Sabino Mullane (#15)
Re: optimize file transfer in pg_upgrade

On Wed, Mar 05, 2025 at 03:40:52PM -0500, Greg Sabino Mullane wrote:

I've seen various failures, but they always get caught quite early.
Certainly early enough to easily abort, fix perms/mounts/etc., then retry.
I think your instinct is correct that this reversion is more trouble than
its worth. I don't think the pg_upgrade docs mention taking a backup, but
that's always step 0 in my playbook, and that's the rollback plan in the
unlikely event of failure.

Thank you, Greg and Robert, for sharing your thoughts. With that, here's
what I'm considering to be a reasonably complete patch set for this
feature. This leaves about a month for rigorous testing and editing, so
I'm hopeful it'll be ready v18.

--
nathan

Attachments:

v4-0001-initdb-Add-no-sync-data-files.patchtext/plain; charset=us-asciiDownload+89-38
v4-0002-pg_dump-Add-sequence-data.patchtext/plain; charset=us-asciiDownload+16-12
v4-0003-pg_upgrade-Add-swap-for-faster-file-transfer.patchtext/plain; charset=us-asciiDownload+505-32
#17Bruce Momjian
bruce@momjian.us
In reply to: Greg Sabino Mullane (#15)
Re: optimize file transfer in pg_upgrade

On Wed, Mar 5, 2025 at 03:40:52PM -0500, Greg Sabino Mullane wrote:

On Wed, Mar 5, 2025 at 2:43 PM Nathan Bossart <nathandbossart@gmail.com> wrote:

One other design point I wanted to bring up is whether we should bother
generating a rollback script for the new "swap" mode.  In short, I'm
wondering if it would be unreasonable to say that, just for this mode, once
pg_upgrade enters the file transfer step, reverting to the old cluster
requires restoring a backup.

I think that's a fair requirement. And like Robert, revert scripts make me
nervous.

* Anecdotally, I'm not sure I've ever actually seen pg_upgrade fail
during or after file transfer, and I'm hoping to get some real data about
that in the near future.  Has anyone else dealt with such a failure?

I've seen various failures, but they always get caught quite early. Certainly
early enough to easily abort, fix perms/mounts/etc., then retry. I think your
instinct is correct that this reversion is more trouble than its worth. I don't
think the pg_upgrade docs mention taking a backup, but that's always step 0 in
my playbook, and that's the rollback plan in the unlikely event of failure.

I avoided many optimizations in pg_upgrade in the fear they would lead
to hard-to-detect bugs, or breakage from major release changes.
pg_upgrade is probably old enough now (15 years) that we can risk these
optimizations.

--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com

Do not let urgent matters crowd out time for investment in the future.

#18Nathan Bossart
nathandbossart@gmail.com
In reply to: Nathan Bossart (#16)
Re: optimize file transfer in pg_upgrade

On Wed, Mar 05, 2025 at 08:34:37PM -0600, Nathan Bossart wrote:

Thank you, Greg and Robert, for sharing your thoughts. With that, here's
what I'm considering to be a reasonably complete patch set for this
feature. This leaves about a month for rigorous testing and editing, so
I'm hopeful it'll be ready v18.

Here are my notes after a round of self-review.

0001:
* The documentation does not adequately describe the interaction between
--no-sync-data-files and --sync-method=syncfs.
* I really don't like the exclude_dir hack for skipping the main tablespace
directory, but I haven't thought of anything that seems better.
* I should verify that there's no path separator issues on Windows for the
exclude_dir hack. From some quick code analysis, I think it should work
fine, but I probably ought to test it out to be sure.
* The documentation needs to mention that the tablespace directories
themselves are not synchronized.

0002:
* The documentation changes are subject to update based on ongoing stats
import/export work.
* Does --statistics-only --sequence-data make any sense? It seems like it
ought to function as expected, but it's hard to see a use-case.

0003:
* Once committed, I should update one of my buildfarm animals to use
PG_TEST_PG_UPGRADE_MODE=--swap.
* For check_hard_link() and disable_old_cluster(), move the Assert() to an
"else" block with a pg_fatal() call for sturdiness.
* I need to do a thorough pass-through on all comments. Many are not
sufficiently detailed.
* The "." and ".." checks in the catalog swap code are redundant and can be
removed.
* The directory for "moved-aside" stuff should be placed within the old
cluster's corresponding tablespace directory so that no changes need to
be made to delete_old_cluster.{sh,bat}.
* Manual testing with non-default tablespaces!

Updated patches based on these notes are attached.

--
nathan

Attachments:

v5-0001-initdb-Add-no-sync-data-files.patchtext/plain; charset=us-asciiDownload+96-38
v5-0002-pg_dump-Add-sequence-data.patchtext/plain; charset=us-asciiDownload+16-12
v5-0003-pg_upgrade-Add-swap-for-faster-file-transfer.patchtext/plain; charset=us-asciiDownload+510-34
#19Robert Haas
robertmhaas@gmail.com
In reply to: Nathan Bossart (#18)
Re: optimize file transfer in pg_upgrade

On Mon, Mar 17, 2025 at 3:34 PM Nathan Bossart <nathandbossart@gmail.com> wrote:

* Once committed, I should update one of my buildfarm animals to use
PG_TEST_PG_UPGRADE_MODE=--swap.

It would be better if we didn't need a separate buildfarm animal to
test this, because that means you won't get notified by local testing
OR by CI if you break this. Can we instead have one test that checks
this which is part of the normal test run?

--
Robert Haas
EDB: http://www.enterprisedb.com

#20Nathan Bossart
nathandbossart@gmail.com
In reply to: Robert Haas (#19)
Re: optimize file transfer in pg_upgrade

On Mon, Mar 17, 2025 at 04:04:45PM -0400, Robert Haas wrote:

On Mon, Mar 17, 2025 at 3:34 PM Nathan Bossart <nathandbossart@gmail.com> wrote:

* Once committed, I should update one of my buildfarm animals to use
PG_TEST_PG_UPGRADE_MODE=--swap.

It would be better if we didn't need a separate buildfarm animal to
test this, because that means you won't get notified by local testing
OR by CI if you break this. Can we instead have one test that checks
this which is part of the normal test run?

That's what I set out to do before I discovered PG_TEST_PG_UPGRADE_MODE.
The commit message for b059a24 seemed to indicate that we don't want to
automatically test all supported modes, but I agree that it would be nice
to have some basic coverage for --swap on CI/buildfarm regardless of
PG_TEST_PG_UPGRADE_MODE. How about we add a simple TAP test (attached),
and I still plan on switching a buildfarm animal to --swap for more
in-depth testing?

--
nathan

Attachments:

v6-0001-initdb-Add-no-sync-data-files.patchtext/plain; charset=us-asciiDownload+96-38
v6-0002-pg_dump-Add-sequence-data.patchtext/plain; charset=us-asciiDownload+16-12
v6-0003-pg_upgrade-Add-swap-for-faster-file-transfer.patchtext/plain; charset=us-asciiDownload+553-34
#21Robert Haas
robertmhaas@gmail.com
In reply to: Nathan Bossart (#20)
#22Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#21)
#23Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#22)
#24Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Robert Haas (#21)
#25Nathan Bossart
nathandbossart@gmail.com
In reply to: Andres Freund (#23)
#26Andres Freund
andres@anarazel.de
In reply to: Nathan Bossart (#25)
#27Nathan Bossart
nathandbossart@gmail.com
In reply to: Andres Freund (#26)
#28Andres Freund
andres@anarazel.de
In reply to: Nathan Bossart (#27)
#29Nathan Bossart
nathandbossart@gmail.com
In reply to: Andres Freund (#28)
#30Nathan Bossart
nathandbossart@gmail.com
In reply to: Nathan Bossart (#29)
#31Nathan Bossart
nathandbossart@gmail.com
In reply to: Nathan Bossart (#30)
#32Andres Freund
andres@anarazel.de
In reply to: Nathan Bossart (#30)
#33Tom Lane
tgl@sss.pgh.pa.us
In reply to: Nathan Bossart (#31)
#34Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#33)
#35Nathan Bossart
nathandbossart@gmail.com
In reply to: Andres Freund (#34)
#36Nathan Bossart
nathandbossart@gmail.com
In reply to: Nathan Bossart (#35)
#37Nathan Bossart
nathandbossart@gmail.com
In reply to: Nathan Bossart (#36)
#38Nathan Bossart
nathandbossart@gmail.com
In reply to: Nathan Bossart (#37)
#39Nathan Bossart
nathandbossart@gmail.com
In reply to: Nathan Bossart (#38)
#40Nathan Bossart
nathandbossart@gmail.com
In reply to: Nathan Bossart (#39)
#41Alexander Lakhin
exclusion@gmail.com
In reply to: Nathan Bossart (#37)
#42Nathan Bossart
nathandbossart@gmail.com
In reply to: Alexander Lakhin (#41)
#43Alexander Lakhin
exclusion@gmail.com
In reply to: Nathan Bossart (#42)
#44Nathan Bossart
nathandbossart@gmail.com
In reply to: Alexander Lakhin (#43)
#45Alexander Lakhin
exclusion@gmail.com
In reply to: Nathan Bossart (#44)