Horribly slow pg_upgrade performance with many Large Objects

Started by Hannu Krosing9 months ago48 messages
#1Hannu Krosing
hannuk@google.com

Hi Hackers

## The issue

I have now met a not insignificant number of cases where pg_upgrade
performance is really bad when the database has a large number of
Large Objects.

The average time to `pg_dump --binary-upgrade --format=custom ...` a
database and then `pg_restore ...` it back is 1 minute per 1 million
LOs,

This does not seem to bad until you realize that it is
- 100 minutes, ot 1H40min for 100M LOs
- 16H40 min for 1 Billion large objects .
- 66H40m or 4 2days 18 hours and 40 min for those unfortunate enough
who have to full 4 Billion LOs

My direct tests have 10m5s for 10M LOs and 100m54s for 100M.

Also a smaller server just runs out of shared memory when restoring 100M LOs.

I know that our semi-official recommendation for using Large Objects
is "Don't!" but still people use them and sometimes they have a lot of
them. One of the reasons seems to be an older version of Hibernate
which converted any column of type CLOB into a Large Object, so we see
cases where average LO size is a few hundred to a few thousand bytes.

I read through the old discussions around the time of release of v12
and the consensus seemed to be at that time that dumping the LOs
without data was ok "because there will not be too many LOs" which
unfortunately has turned out to be wishful thinking, as there are lots
of Tiny Large Object out there and these really slow down pg_upgrade.

## So what to do about this?

The obvious solution would be to handle the table
`pg_largeobject_metadata` the same way as we currently handle
`pg_largeobject `by not doing anything with it in `pg_dump
--binary-upgrade` and just handle the contents it like we do for user
tables in pg_upgrade itself.

This should work fine for all source database versions starting from PgSQL v12.

For older supported versions 9.2 to 11 where `oid` was system column
we should just dump the three fields of `pg_largeobject_metadata`
directly so that the dump command would be `COPY (SELECT oid,
lomowner, lomacl) TO stdout` and then restore would happen more or
less automatically.

Or we could just modify `pg_restore` so that it restores LOs using
COPY instead of doing the current dance of 3 separate SELECT's from
lo_xxx() functions.

pg_dump --binary-upgrade format=custome is relatively fast for LOs -
1m58s for 100M object,

though still much slower than `pg_dump --data-only -t
pg_largeobject_metadata` which is 20 sec, or 23 when also run through
gzip

## preferred solution for --binary-upgrade

- link/copy data files when source is v12+
- dump the triplet of pg_largeobject_metadata columns if source is v9.2 - v11

an acceptable faster solution would be to just always dump
pg_largeobject_metadata (oid, lomowner, lomacl) as this would not
need any changes to `pg_upgrade`, just to `pg_dump --binary-upgrade`
it would not be as fast as linking/copying but would be several orders
of magnitude faster than current behaviour.

Does anyone have any objections to this ?

Would anyone be willing to take up fixing this ?

Or perhaps advise where in code the change should be done ?

I started looking at this and at high level It looks like I need to
create a task to copy out specific data columns and not the structure
of pg_largeobject metadata but as pg_dump.c is 582,673 it would take
some time to understand where exactly to do this and all the possible
interactions with flags like --binary-upgrade and specific formats.

---
Cheers,
Hannu

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Hannu Krosing (#1)
Re: Horribly slow pg_upgrade performance with many Large Objects

Hannu Krosing <hannuk@google.com> writes:

I have now met a not insignificant number of cases where pg_upgrade
performance is really bad when the database has a large number of
Large Objects.

What version are you testing? We did some work in that area in the
v17 cycle (a45c78e32).

regards, tom lane

#3Nathan Bossart
nathandbossart@gmail.com
In reply to: Hannu Krosing (#1)
Re: Horribly slow pg_upgrade performance with many Large Objects

On Mon, Apr 07, 2025 at 10:33:47PM +0200, Hannu Krosing wrote:

The obvious solution would be to handle the table
`pg_largeobject_metadata` the same way as we currently handle
`pg_largeobject `by not doing anything with it in `pg_dump
--binary-upgrade` and just handle the contents it like we do for user
tables in pg_upgrade itself.

This should work fine for all source database versions starting from PgSQL v12.

Unfortunately, the storage format for aclitem changed in v16, so this would
need to be restricted to upgrades from v16 and newer. That being said, I
regularly hear about slow upgrades with many LOs, so I think it'd be
worthwhile to try to improve matters in v19.

--
nathan

#4Michael Paquier
michael@paquier.xyz
In reply to: Tom Lane (#2)
Re: Horribly slow pg_upgrade performance with many Large Objects

On Mon, Apr 07, 2025 at 05:25:32PM -0400, Tom Lane wrote:

What version are you testing? We did some work in that area in the
v17 cycle (a45c78e32).

I am puzzled by the target version used here, as well. If there is
more that can be improved, v19 would be the version to consider for
future improvements at this stage.
--
Michael

#5Hannu Krosing
hannuk@google.com
In reply to: Michael Paquier (#4)
Re: Horribly slow pg_upgrade performance with many Large Objects

I was testing on version 17

On Tue, Apr 8, 2025 at 6:52 AM Michael Paquier <michael@paquier.xyz> wrote:

On Mon, Apr 07, 2025 at 05:25:32PM -0400, Tom Lane wrote:

What version are you testing? We did some work in that area in the
v17 cycle (a45c78e32).

I am puzzled by the target version used here, as well.

I was testing on version 17

Here is how you can easily test too (as --binary-upgrade does not dump
the actual data it is ok for the test to not put anything there)

hannuk@db01-c1a:~/work/lo-testing$ createdb -p 5433 lodb
hannuk@db01-c1a:~/work/lo-testing$ psql -p 5433 lodb
psql (17.4 (Ubuntu 17.4-1.pgdg22.04+2))
Type "help" for help.

lodb=# insert into pg_largeobject_metadata(oid, lomowner) SELECT i,
16384 FROM generate_series(1, 100_000_000) g(i);
INSERT 0 100000000
Time: 162414.216 ms (02:42.414)
lodb=#
\q

hannuk@db01-c1a:~/work/lo-testing$ time pg_dump --data-only -t
pg_largeobject_metadata -p 5433 lodb | gzip >
pg_largeobject_metadata.data.gz
real 0m22.094s
user 0m20.741s
sys 0m2.085s

hannuk@db01-c1a:~/work/lo-testing$ time pg_dump --data-only -t
pg_largeobject_metadata --format=custom -p 5433 lodb -f
pg_largeobject_metadata.dump
real 0m20.226s
user 0m18.068s
sys 0m0.824s

If there is
more that can be improved, v19 would be the version to consider for
future improvements at this stage.

If the internal format has changed in 16 the correct way would be to
go through the data-only dump of pg_largeobject_metadata in all cases.
Even for the 100M case where you get the restore in 2 minutes instead
of 100 minutes

hannuk@db01-c1a:~/work/lo-testing$ createdb -p 5434 lodb
hannuk@db01-c1a:~/work/lo-testing$ time pg_restore -p 5434
--exit-on-error --transaction-size=1000 --dbname lodb
pg_largeobject_metadata.dump

real 2m2.277s
user 0m2.594s
sys 0m0.549s

And even in case of the user-visible format change in acl format it is
most likely that changing the visible format using some regexp magic,
or even a dedicated function, would still me much faster than creating
all the LOs though creation commands.

------
The commands I used to do the pg_upgrade-like test were

hannuk@db01-c1a:~/work/lo-testing$ time pg_dump --schema-only
--quote-all-identifiers --binary-upgrade --format=custom
--file=lodb100m.dump -p 5433 lodb
real 1m58.241s
user 0m35.229s
sys 0m17.854s

hannuk@db01-c1a:~/work/lo-testing$ time pg_restore -p 5434
--exit-on-error --transaction-size=1000 --dbname lodb lodb100m.dump
real 100m54.878s
user 3m23.885s
sys 20m33.761s

(I left out the --verbose part that pg_upgrade also sets as I did not
want to get 100M lines of "large object created " messages )

also the postgres server at -p 5434 needs to be started with -b flag
to accept the loading a dump from --binary-upgrade. In Debian/Ubuntu
this can be directly passed to pg_ctlcluster as follows

sudo pg_ctlcluster 17 target -o -b

----
Hannu

Show quoted text

--
Michael

#6Hannu Krosing
hannuk@google.com
In reply to: Hannu Krosing (#5)
Re: Horribly slow pg_upgrade performance with many Large Objects

Looked like a bit illogical order on re-reading it so I want to make
clear that the pg_upgrade-like test showing 100min for 100 million LOs
is at the end of last message and the proposed solution is at the
beginning

Show quoted text

On Tue, Apr 8, 2025 at 9:15 AM Hannu Krosing <hannuk@google.com> wrote:

I was testing on version 17

On Tue, Apr 8, 2025 at 6:52 AM Michael Paquier <michael@paquier.xyz> wrote:

On Mon, Apr 07, 2025 at 05:25:32PM -0400, Tom Lane wrote:

What version are you testing? We did some work in that area in the
v17 cycle (a45c78e32).

I am puzzled by the target version used here, as well.

I was testing on version 17

Here is how you can easily test too (as --binary-upgrade does not dump
the actual data it is ok for the test to not put anything there)

hannuk@db01-c1a:~/work/lo-testing$ createdb -p 5433 lodb
hannuk@db01-c1a:~/work/lo-testing$ psql -p 5433 lodb
psql (17.4 (Ubuntu 17.4-1.pgdg22.04+2))
Type "help" for help.

lodb=# insert into pg_largeobject_metadata(oid, lomowner) SELECT i,
16384 FROM generate_series(1, 100_000_000) g(i);
INSERT 0 100000000
Time: 162414.216 ms (02:42.414)
lodb=#
\q

hannuk@db01-c1a:~/work/lo-testing$ time pg_dump --data-only -t
pg_largeobject_metadata -p 5433 lodb | gzip >
pg_largeobject_metadata.data.gz
real 0m22.094s
user 0m20.741s
sys 0m2.085s

hannuk@db01-c1a:~/work/lo-testing$ time pg_dump --data-only -t
pg_largeobject_metadata --format=custom -p 5433 lodb -f
pg_largeobject_metadata.dump
real 0m20.226s
user 0m18.068s
sys 0m0.824s

If there is
more that can be improved, v19 would be the version to consider for
future improvements at this stage.

If the internal format has changed in 16 the correct way would be to
go through the data-only dump of pg_largeobject_metadata in all cases.
Even for the 100M case where you get the restore in 2 minutes instead
of 100 minutes

hannuk@db01-c1a:~/work/lo-testing$ createdb -p 5434 lodb
hannuk@db01-c1a:~/work/lo-testing$ time pg_restore -p 5434
--exit-on-error --transaction-size=1000 --dbname lodb
pg_largeobject_metadata.dump

real 2m2.277s
user 0m2.594s
sys 0m0.549s

And even in case of the user-visible format change in acl format it is
most likely that changing the visible format using some regexp magic,
or even a dedicated function, would still me much faster than creating
all the LOs though creation commands.

------
The commands I used to do the pg_upgrade-like test were

hannuk@db01-c1a:~/work/lo-testing$ time pg_dump --schema-only
--quote-all-identifiers --binary-upgrade --format=custom
--file=lodb100m.dump -p 5433 lodb
real 1m58.241s
user 0m35.229s
sys 0m17.854s

hannuk@db01-c1a:~/work/lo-testing$ time pg_restore -p 5434
--exit-on-error --transaction-size=1000 --dbname lodb lodb100m.dump
real 100m54.878s
user 3m23.885s
sys 20m33.761s

(I left out the --verbose part that pg_upgrade also sets as I did not
want to get 100M lines of "large object created " messages )

also the postgres server at -p 5434 needs to be started with -b flag
to accept the loading a dump from --binary-upgrade. In Debian/Ubuntu
this can be directly passed to pg_ctlcluster as follows

sudo pg_ctlcluster 17 target -o -b

----
Hannu

--
Michael

#7Hannu Krosing
hannuk@google.com
In reply to: Nathan Bossart (#3)
Re: Horribly slow pg_upgrade performance with many Large Objects

On Tue, Apr 8, 2025 at 12:17 AM Nathan Bossart <nathandbossart@gmail.com> wrote:

On Mon, Apr 07, 2025 at 10:33:47PM +0200, Hannu Krosing wrote:

The obvious solution would be to handle the table
`pg_largeobject_metadata` the same way as we currently handle
`pg_largeobject `by not doing anything with it in `pg_dump
--binary-upgrade` and just handle the contents it like we do for user
tables in pg_upgrade itself.

This should work fine for all source database versions starting from PgSQL v12.

Unfortunately, the storage format for aclitem changed in v16, so this would
need to be restricted to upgrades from v16 and newer.

Have we also changed the external format of aclitem any time since v
9.2 or are the changes just to storage ?

If external formats have been stable we can still get reasonable
performance with dumping the data (2 min for 100M rows)
Plus dumping data would work for all the supported source versions.

The worst case would still be quite bad with 80+ min for the full set
of 4 billion LOs but even that would be much better than the 3 days
with current wayd.

That being said, I
regularly hear about slow upgrades with many LOs, so I think it'd be
worthwhile to try to improve matters in v19.

Changing the LO export to dumping pg_largeobject_metadata content
instead of creating the LOs should be a nice small change confined to
pg_dump --binary-upgrade only so perhaps we could squeeze it in v18
still.

--
Hannu

#8Nathan Bossart
nathandbossart@gmail.com
In reply to: Hannu Krosing (#7)
Re: Horribly slow pg_upgrade performance with many Large Objects

On Tue, Apr 08, 2025 at 09:35:24AM +0200, Hannu Krosing wrote:

On Tue, Apr 8, 2025 at 12:17 AM Nathan Bossart <nathandbossart@gmail.com> wrote:

That being said, I
regularly hear about slow upgrades with many LOs, so I think it'd be
worthwhile to try to improve matters in v19.

Changing the LO export to dumping pg_largeobject_metadata content
instead of creating the LOs should be a nice small change confined to
pg_dump --binary-upgrade only so perhaps we could squeeze it in v18
still.

Feature freeze for v18 was ~4 hours ago, so unfortunately this is v19
material at this point.

--
nathan

#9Hannu Krosing
hannuk@google.com
In reply to: Nathan Bossart (#8)
Re: Horribly slow pg_upgrade performance with many Large Objects

On Tue, Apr 8, 2025 at 5:46 PM Nathan Bossart <nathandbossart@gmail.com> wrote:

On Tue, Apr 08, 2025 at 09:35:24AM +0200, Hannu Krosing wrote:

On Tue, Apr 8, 2025 at 12:17 AM Nathan Bossart <nathandbossart@gmail.com> wrote:

That being said, I
regularly hear about slow upgrades with many LOs, so I think it'd be
worthwhile to try to improve matters in v19.

Changing the LO export to dumping pg_largeobject_metadata content
instead of creating the LOs should be a nice small change confined to
pg_dump --binary-upgrade only so perhaps we could squeeze it in v18
still.

Feature freeze for v18 was ~4 hours ago, so unfortunately this is v19
material at this point.

Sure. But I actually think that this is something that should be
back-ported to at least all supported versions at some pon.
Possibly made dependent on some environment flag so only people that
desperately need it will get it.

Btw, who would be the right person(s) to ask questions about internals
of pg_dump ?
I have a few more things in the pipeline to add there and would like
to make sure that I have the right approach.

------
Hannu

#10Tom Lane
tgl@sss.pgh.pa.us
In reply to: Nathan Bossart (#8)
Re: Horribly slow pg_upgrade performance with many Large Objects

Nathan Bossart <nathandbossart@gmail.com> writes:

On Tue, Apr 08, 2025 at 09:35:24AM +0200, Hannu Krosing wrote:

Changing the LO export to dumping pg_largeobject_metadata content
instead of creating the LOs should be a nice small change confined to
pg_dump --binary-upgrade only so perhaps we could squeeze it in v18
still.

Feature freeze for v18 was ~4 hours ago, so unfortunately this is v19
material at this point.

Yeah, even if we had a patch in hand, it's too late for v18. However
there are additional problems with this idea:

1. The idea requires role OIDs to match across the upgrade.
I don't believe that pg_upgrade tries to preserve role OIDs --- and
doing so would be problematic, because what if the new cluster's
bootstrap superuser is named differently in the old and new clusters?

It might be possible to work around that with some casting to/from
regrole, but I don't think a simple COPY into pg_largeobject_metadata
will play along with that.

2. If you just do the equivalent of an INSERT or COPY into
pg_largeobject_metadata, you could create entries that look right,
but they are actually not right because there should be pg_shdepend
entries backing each ownership or permission reference (for non-pinned
roles) and there won't be.

I guess you could think of also manually inserting rows into
pg_shdepend, but (a) ugh and (b) the claimed speedup is kind
of vanishing into the distance at this point.

regards, tom lane

#11Hannu Krosing
hannuk@google.com
In reply to: Tom Lane (#10)
Re: Horribly slow pg_upgrade performance with many Large Objects

This is what the opening comment in pg_upgrade says

I think we do preserve role oids

/*
* To simplify the upgrade process, we force certain system values to be
* identical between old and new clusters:
*
* We control all assignments of pg_class.oid (and relfilenode) so toast
* oids are the same between old and new clusters. This is important
* because toast oids are stored as toast pointers in user tables.
*
* While pg_class.oid and pg_class.relfilenode are initially the same in a
* cluster, they can diverge due to CLUSTER, REINDEX, or VACUUM FULL. We
* control assignments of pg_class.relfilenode because we want the filenames
* to match between the old and new cluster.
*
* We control assignment of pg_tablespace.oid because we want the oid to match
* between the old and new cluster.
*
* We control all assignments of pg_type.oid because these oids are stored
* in user composite type values.
*
* We control all assignments of pg_enum.oid because these oids are stored
* in user tables as enum values.
*
* We control all assignments of pg_authid.oid for historical reasons (the
* oids used to be stored in pg_largeobject_metadata, which is now copied via
* SQL commands), that might change at some point in the future.
*/

Show quoted text

On Tue, Apr 8, 2025 at 6:13 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Nathan Bossart <nathandbossart@gmail.com> writes:

On Tue, Apr 08, 2025 at 09:35:24AM +0200, Hannu Krosing wrote:

Changing the LO export to dumping pg_largeobject_metadata content
instead of creating the LOs should be a nice small change confined to
pg_dump --binary-upgrade only so perhaps we could squeeze it in v18
still.

Feature freeze for v18 was ~4 hours ago, so unfortunately this is v19
material at this point.

Yeah, even if we had a patch in hand, it's too late for v18. However
there are additional problems with this idea:

1. The idea requires role OIDs to match across the upgrade.
I don't believe that pg_upgrade tries to preserve role OIDs --- and
doing so would be problematic, because what if the new cluster's
bootstrap superuser is named differently in the old and new clusters?

It might be possible to work around that with some casting to/from
regrole, but I don't think a simple COPY into pg_largeobject_metadata
will play along with that.

2. If you just do the equivalent of an INSERT or COPY into
pg_largeobject_metadata, you could create entries that look right,
but they are actually not right because there should be pg_shdepend
entries backing each ownership or permission reference (for non-pinned
roles) and there won't be.

I guess you could think of also manually inserting rows into
pg_shdepend, but (a) ugh and (b) the claimed speedup is kind
of vanishing into the distance at this point.

regards, tom lane

#12Tom Lane
tgl@sss.pgh.pa.us
In reply to: Hannu Krosing (#11)
Re: Horribly slow pg_upgrade performance with many Large Objects

Hannu Krosing <hannuk@google.com> writes:

I think we do preserve role oids

Oh ... I'd been looking for mentions of "role" in
pg_upgrade_support.c, but what I should have looked for was
"pg_authid". So yeah, we do preserve role OIDs, and maybe that's
enough to make this workable, at least with source versions that
share the same rules for what goes into pg_largeobject_metadata and
pg_shdepend. It's not something I'd risk back-patching though.

regards, tom lane

#13Nathan Bossart
nathandbossart@gmail.com
In reply to: Tom Lane (#12)
Re: Horribly slow pg_upgrade performance with many Large Objects

On Tue, Apr 08, 2025 at 12:37:43PM -0400, Tom Lane wrote:

Hannu Krosing <hannuk@google.com> writes:

I think we do preserve role oids

Oh ... I'd been looking for mentions of "role" in
pg_upgrade_support.c, but what I should have looked for was
"pg_authid". So yeah, we do preserve role OIDs, and maybe that's
enough to make this workable, at least with source versions that
share the same rules for what goes into pg_largeobject_metadata and
pg_shdepend. It's not something I'd risk back-patching though.

I do think it's worth considering going back to copying
pg_largobject_metadata's files for upgrades from v16 and newer. That
sounds restrictive at the moment, but it'll mean that all but one supported
major version can copy the files during upgrade to v19. I'll admit I'm a
tad worried about having to go back to copying via SQL commands in the
future and re-regressing things (leading to unpredictable differences in
upgrade downtime), but I'm not sure that's a great reason to withhold this
optimization.

Of course, I wouldn't be opposed to optimizing the SQL command strategy,
too, but I suspect that won't compare to copying the files.

--
nathan

#14Hannu Krosing
hannuk@google.com
In reply to: Tom Lane (#12)
Re: Horribly slow pg_upgrade performance with many Large Objects

On Tue, Apr 8, 2025 at 6:37 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Hannu Krosing <hannuk@google.com> writes:

I think we do preserve role oids

Oh ... I'd been looking for mentions of "role" in
pg_upgrade_support.c, but what I should have looked for was
"pg_authid". So yeah, we do preserve role OIDs, and maybe that's
enough to make this workable, at least with source versions that
share the same rules for what goes into pg_largeobject_metadata and
pg_shdepend. It's not something I'd risk back-patching though.

The risk is why I suggest to have this in backports as something last
resort which is gated on an environment variable

I have been forced to do this half-manually a few times to make
pg_upgrade feasible at all.

Current really UGH workaround is to use pg_repack support for swapping
relfilenodes of pg_largeobject_metadata with a user table before and
after the upgrade and then manually patching up leftovers like
pg_shdepend after pg_upgrade.

These have fortunately been cases where there were no other
dependencies like comments or security labels on LOs, but these to
should be doable;

Show quoted text

regards, tom lane

#15Tom Lane
tgl@sss.pgh.pa.us
In reply to: Nathan Bossart (#13)
Re: Horribly slow pg_upgrade performance with many Large Objects

Nathan Bossart <nathandbossart@gmail.com> writes:

I do think it's worth considering going back to copying
pg_largobject_metadata's files for upgrades from v16 and newer.

(If we do this) I don't see why we'd need to stop at v16. I'm
envisioning that we'd use COPY, which will be dealing in the
text representation of aclitems, and I don't think that's changed
in a long time. The sort of thing that would break it is changes
in the set of available/default privilege bits for large objects.

That is, where the dump currently contains something like

SELECT pg_catalog.lo_create('2121');
ALTER LARGE OBJECT 2121 OWNER TO postgres;
GRANT ALL ON LARGE OBJECT 2121 TO joe;

we'd have

COPY pg_largeobject_metadata FROM STDIN;
...
2121 10 {postgres=rw/postgres,joe=rw/postgres}
...

and some appropriate COPY data for pg_shdepend too.

(The fact that this representation will contain both numeric and
symbolic role OIDs is why I was concerned about OID stability.)

The key thing that worries me here is if the source and target
versions have different ideas of which roles are pinned, which would
silently change what appears in pg_shdepend. But it'd only really
break if a role mentioned in some LO's owner or ACL is pinned in the
source and not in the target, which seems unlikely. (In the other
direction, we'd just be adding a useless row in pg_shdepend.)

regards, tom lane

#16Hannu Krosing
hannuk@google.com
In reply to: Tom Lane (#15)
Re: Horribly slow pg_upgrade performance with many Large Objects

On Tue, Apr 8, 2025 at 7:07 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Nathan Bossart <nathandbossart@gmail.com> writes:

I do think it's worth considering going back to copying
pg_largobject_metadata's files for upgrades from v16 and newer.

(If we do this) I don't see why we'd need to stop at v16. I'm
envisioning that we'd use COPY, which will be dealing in the
text representation of aclitems, and I don't think that's changed
in a long time. The sort of thing that would break it is changes
in the set of available/default privilege bits for large objects.

That is, where the dump currently contains something like

SELECT pg_catalog.lo_create('2121');
ALTER LARGE OBJECT 2121 OWNER TO postgres;
GRANT ALL ON LARGE OBJECT 2121 TO joe;

Also note that in my --binary-upgrade tests the 100 min / 100M objects
ratio was in case with no grants. I would expect this to grow to at
least 120 to 150 minutes when grants are also involved.

In copy case I would expect the presence of grants to not make much difference.

we'd have

COPY pg_largeobject_metadata FROM STDIN;
...
2121 10 {postgres=rw/postgres,joe=rw/postgres}
...

--
Hannu

#17Nathan Bossart
nathandbossart@gmail.com
In reply to: Tom Lane (#15)
Re: Horribly slow pg_upgrade performance with many Large Objects

On Tue, Apr 08, 2025 at 01:07:09PM -0400, Tom Lane wrote:

Nathan Bossart <nathandbossart@gmail.com> writes:

I do think it's worth considering going back to copying
pg_largobject_metadata's files for upgrades from v16 and newer.

(If we do this) I don't see why we'd need to stop at v16. I'm
envisioning that we'd use COPY, which will be dealing in the
text representation of aclitems, and I don't think that's changed
in a long time. The sort of thing that would break it is changes
in the set of available/default privilege bits for large objects.

I was thinking of actually reverting commit 12a53c7 for upgrades from v16,
which AFAICT is the last release where any relevant storage formats changed
(aclitem changed in v16). But if COPY gets us pretty close to that and is
less likely to be disrupted by future changes, it could be a better
long-term approach.

That is, where the dump currently contains something like

SELECT pg_catalog.lo_create('2121');
ALTER LARGE OBJECT 2121 OWNER TO postgres;
GRANT ALL ON LARGE OBJECT 2121 TO joe;

we'd have

COPY pg_largeobject_metadata FROM STDIN;
...
2121 10 {postgres=rw/postgres,joe=rw/postgres}
...

and some appropriate COPY data for pg_shdepend too.

Unless I'm missing something, we don't seem to have had any dependency
handling before commit 12a53c7. Was that broken before we moved to SQL
commands?

--
nathan

#18Tom Lane
tgl@sss.pgh.pa.us
In reply to: Hannu Krosing (#16)
Re: Horribly slow pg_upgrade performance with many Large Objects

Hannu Krosing <hannuk@google.com> writes:

In copy case I would expect the presence of grants to not make much difference.

aclitemin is slower than a lot of other datatype input functions,
but it's still got to be faster than a GRANT.

regards, tom lane

#19Tom Lane
tgl@sss.pgh.pa.us
In reply to: Hannu Krosing (#16)
Re: Horribly slow pg_upgrade performance with many Large Objects

Hmm ... one annoying thing for this project is that AFAICS pg_upgrade
does *not* preserve database OIDs, which is problematic for using
COPY to load pg_shdepend rows.

regards, tom lane

#20Nathan Bossart
nathandbossart@gmail.com
In reply to: Tom Lane (#19)
Re: Horribly slow pg_upgrade performance with many Large Objects

On Tue, Apr 08, 2025 at 01:36:58PM -0400, Tom Lane wrote:

Hmm ... one annoying thing for this project is that AFAICS pg_upgrade
does *not* preserve database OIDs, which is problematic for using
COPY to load pg_shdepend rows.

I think it does; see commit aa01051.

--
nathan

#21Tom Lane
tgl@sss.pgh.pa.us
In reply to: Nathan Bossart (#17)
Re: Horribly slow pg_upgrade performance with many Large Objects

Nathan Bossart <nathandbossart@gmail.com> writes:

Unless I'm missing something, we don't seem to have had any dependency
handling before commit 12a53c7. Was that broken before we moved to SQL
commands?

Sounds like it :-(

regards, tom lane

#22Tom Lane
tgl@sss.pgh.pa.us
In reply to: Nathan Bossart (#20)
Re: Horribly slow pg_upgrade performance with many Large Objects

Nathan Bossart <nathandbossart@gmail.com> writes:

On Tue, Apr 08, 2025 at 01:36:58PM -0400, Tom Lane wrote:

Hmm ... one annoying thing for this project is that AFAICS pg_upgrade
does *not* preserve database OIDs, which is problematic for using
COPY to load pg_shdepend rows.

I think it does; see commit aa01051.

Ah --- I thought I remembered something having been done about that,
but I failed to find it because I was looking in pg_upgrade not
pg_dump. Too bad aa01051 didn't update the comment at the top of
pg_upgrade.c.

regards, tom lane

#23Nathan Bossart
nathandbossart@gmail.com
In reply to: Tom Lane (#21)
Re: Horribly slow pg_upgrade performance with many Large Objects

On Tue, Apr 08, 2025 at 01:42:20PM -0400, Tom Lane wrote:

Nathan Bossart <nathandbossart@gmail.com> writes:

Unless I'm missing something, we don't seem to have had any dependency
handling before commit 12a53c7. Was that broken before we moved to SQL
commands?

Sounds like it :-(

Huh. Sure enough, it seems to be lost during an upgrade from 9.6 to 10.

v9.6:

postgres=# select lo_from_bytea(1234, '1234');
lo_from_bytea
---------------
1234
(1 row)

postgres=# create role bob;
CREATE ROLE
postgres=# grant select on large object 1234 to bob;
GRANT
postgres=# drop role bob;
ERROR: role "bob" cannot be dropped because some objects depend on it
DETAIL: privileges for large object 1234

v10 (upgraded from v9.6):

postgres=# select lo_get(1234);
lo_get
------------
\x31323334
(1 row)

postgres=# drop role bob;
DROP ROLE

If I then try to upgrade that database to v17, it fails like this:

pg_restore: from TOC entry 2422; 0 0 ACL LARGE OBJECT 1234 nathan
pg_restore: error: could not execute query: ERROR: role "16384" does not exist
Command was: GRANT SELECT ON LARGE OBJECT 1234 TO "16384";

I've also verified that the dependency information is carried over in
upgrades to later versions (AFAICT all the supported ones).

--
nathan

#24Hannu Krosing
hannuk@google.com
In reply to: Nathan Bossart (#23)
Re: Horribly slow pg_upgrade performance with many Large Objects

On Tue, Apr 8, 2025 at 8:39 PM Nathan Bossart <nathandbossart@gmail.com> wrote:

...

I've also verified that the dependency information is carried over in
upgrades to later versions (AFAICT all the supported ones).

If I remember correctly the change to not copying
pg_largeobject_metadata data file but instead moving LOs as part of
schema was done in v12 when oid,, which had been a system column in
v11, became a user column, so upgrade to v11 is likely also missing
the dependencies

#25Jan Wieck
jan@wi3ck.info
In reply to: Hannu Krosing (#24)
Re: Horribly slow pg_upgrade performance with many Large Objects

On 4/8/25 15:41, Hannu Krosing wrote:

On Tue, Apr 8, 2025 at 8:39 PM Nathan Bossart <nathandbossart@gmail.com> wrote:

...

I've also verified that the dependency information is carried over in
upgrades to later versions (AFAICT all the supported ones).

If I remember correctly the change to not copying
pg_largeobject_metadata data file but instead moving LOs as part of
schema was done in v12 when oid,, which had been a system column in
v11, became a user column, so upgrade to v11 is likely also missing
the dependencies

I remember an incident where large amounts of LOs ran pg_upgrade into a
transaction-ID wrap around because the restore part would create
individual single statement transactions per LO to create, then change
permissions and ownership and finally fill in the data. Could that be
related here?

Regards, Jan

#26Nathan Bossart
nathandbossart@gmail.com
In reply to: Jan Wieck (#25)
Re: Horribly slow pg_upgrade performance with many Large Objects

On Tue, Apr 08, 2025 at 05:37:50PM -0400, Jan Wieck wrote:

I remember an incident where large amounts of LOs ran pg_upgrade into a
transaction-ID wrap around because the restore part would create individual
single statement transactions per LO to create, then change permissions and
ownership and finally fill in the data. Could that be related here?

I believe commits 74cf7d4 and 959b38d have largely fixed that problem.

--
nathan

#27Nathan Bossart
nathandbossart@gmail.com
In reply to: Tom Lane (#22)
1 attachment(s)
Re: Horribly slow pg_upgrade performance with many Large Objects

On Tue, Apr 08, 2025 at 01:51:22PM -0400, Tom Lane wrote:

Nathan Bossart <nathandbossart@gmail.com> writes:

On Tue, Apr 08, 2025 at 01:36:58PM -0400, Tom Lane wrote:

Hmm ... one annoying thing for this project is that AFAICS pg_upgrade
does *not* preserve database OIDs, which is problematic for using
COPY to load pg_shdepend rows.

I think it does; see commit aa01051.

Ah --- I thought I remembered something having been done about that,
but I failed to find it because I was looking in pg_upgrade not
pg_dump. Too bad aa01051 didn't update the comment at the top of
pg_upgrade.c.

I'll apply the attached patch to fix the comment shortly.

--
nathan

Attachments:

v1-0001-Mention-that-we-preserve-database-OIDs-in-pg_upgr.patchtext/plain; charset=us-asciiDownload
From 0ca95279bde8f3ce069a749253eba7c04988003e Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Wed, 9 Apr 2025 12:12:27 -0500
Subject: [PATCH v1 1/1] Mention that we preserve database OIDs in pg_upgrade.c
 comment.

Oversight in commit aa01051418.

Reported-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/4055696.1744134682%40sss.pgh.pa.us
---
 src/bin/pg_upgrade/pg_upgrade.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/src/bin/pg_upgrade/pg_upgrade.c b/src/bin/pg_upgrade/pg_upgrade.c
index 72b5a5fde5a..d9db48dba0d 100644
--- a/src/bin/pg_upgrade/pg_upgrade.c
+++ b/src/bin/pg_upgrade/pg_upgrade.c
@@ -32,6 +32,9 @@
  *	We control all assignments of pg_authid.oid for historical reasons (the
  *	oids used to be stored in pg_largeobject_metadata, which is now copied via
  *	SQL commands), that might change at some point in the future.
+ *
+ *	We control all assignments of pg_database.oid because we want the directory
+ *	names to match between the old and new cluster.
  */
 
 
-- 
2.39.5 (Apple Git-154)

#28Nathan Bossart
nathandbossart@gmail.com
In reply to: Hannu Krosing (#24)
Re: Horribly slow pg_upgrade performance with many Large Objects

On Tue, Apr 08, 2025 at 09:41:06PM +0200, Hannu Krosing wrote:

On Tue, Apr 8, 2025 at 8:39 PM Nathan Bossart <nathandbossart@gmail.com> wrote:

I've also verified that the dependency information is carried over in
upgrades to later versions (AFAICT all the supported ones).

If I remember correctly the change to not copying
pg_largeobject_metadata data file but instead moving LOs as part of
schema was done in v12 when oid,, which had been a system column in
v11, became a user column, so upgrade to v11 is likely also missing
the dependencies

Right.

--
nathan

#29Nathan Bossart
nathandbossart@gmail.com
In reply to: Nathan Bossart (#17)
Re: Horribly slow pg_upgrade performance with many Large Objects

On Tue, Apr 08, 2025 at 12:22:00PM -0500, Nathan Bossart wrote:

On Tue, Apr 08, 2025 at 01:07:09PM -0400, Tom Lane wrote:

Nathan Bossart <nathandbossart@gmail.com> writes:

I do think it's worth considering going back to copying
pg_largobject_metadata's files for upgrades from v16 and newer.

(If we do this) I don't see why we'd need to stop at v16. I'm
envisioning that we'd use COPY, which will be dealing in the
text representation of aclitems, and I don't think that's changed
in a long time. The sort of thing that would break it is changes
in the set of available/default privilege bits for large objects.

I was thinking of actually reverting commit 12a53c7 for upgrades from v16,
which AFAICT is the last release where any relevant storage formats changed
(aclitem changed in v16). But if COPY gets us pretty close to that and is
less likely to be disrupted by future changes, it could be a better
long-term approach.

That is, where the dump currently contains something like

SELECT pg_catalog.lo_create('2121');
ALTER LARGE OBJECT 2121 OWNER TO postgres;
GRANT ALL ON LARGE OBJECT 2121 TO joe;

we'd have

COPY pg_largeobject_metadata FROM STDIN;
...
2121 10 {postgres=rw/postgres,joe=rw/postgres}
...

and some appropriate COPY data for pg_shdepend too.

I did some more research here. For many large objects without ACLs to
dump, I noticed that the vast majority of time is going to restoring the
ALTER OWNER commands. For 1 million such large objects, restoring took ~73
seconds on my machine. If I instead invented an lo_create_with_owner()
function and created 100 per SELECT command, the same restore takes ~7
seconds. Copying the relevant pg_shdepend rows out and back in takes ~2.5
seconds. I imagine using COPY for pg_largeobject_metadata would also take
a couple of seconds in this case.

For upgrading, I don't think there's any huge benefit to optimizing the
restore commands versus using COPY. It might make future catalog changes
for large object stuff easier, but I'd expect those to be rare. However,
the optimized restore commands could be nice for non-pg_upgrade use-cases.

--
nathan

#30Hannu Krosing
hannuk@google.com
In reply to: Nathan Bossart (#29)
Re: Horribly slow pg_upgrade performance with many Large Objects

And in case there *is* ACL present then each user mentioned in the ACL
adds more overhead

Also the separate GRANT calls cause bloat as the
pg_largeoject_metadata row gets updated for each ALTER USER or GRANT

The following is for 10 million LOs with 1 and 3 users being GRANTed
SELECT on each object (with no grants the pg_restore run was 10
minutes)

Nr of GRANTS | pg_dump time | pg_restore time
--------------+--------------+----------------
0 | 0m 10s | 10m 5s
1 | 0m 17s | 15m 3s
3 | 0m 21s | 27m 15s

NB! - I left out the --verbose flag from pg_dump as used by
pg_upgrade, as it will emit one line per LO dumped

## 1 GRANT / LO

hannuk@db01-c1a:~/work/lo-testing$ time pg_dump --schema-only
--quote-all-identifiers --binary-upgrade --format=custom
--file=lodb10m.dump -p 5433 lodb10m
real 0m17.022s
user 0m2.956s
sys 0m1.453s

hannuk@db01-c1a:~/work/lo-testing$ time pg_restore -p 5434
--exit-on-error --transaction-size=1000 --dbname lodb10m lodb10m.dump
real 15m3.136s
user 0m28.991s
sys 2m54.164s

## 3 GRANTs / LO

make sample LO with 3 grants

ALTER LARGE OBJECT 1 OWNER TO "hannuk";
GRANT SELECT ON LARGE OBJECT 1 TO "bob";
GRANT SELECT ON LARGE OBJECT 1 TO "joe";
GRANT SELECT ON LARGE OBJECT 1 TO "tom";
lodb10m=# select * from pg_shdepend where objid = 1;

┌───────┬─────────┬───────┬──────────┬────────────┬──────────┬─────────┐
│ dbid │ classid │ objid │ objsubid │ refclassid │ refobjid │ deptype │
├───────┼─────────┼───────┼──────────┼────────────┼──────────┼─────────┤
│ 16406 │ 2613 │ 1 │ 0 │ 1260 │ 16384 │ o │
│ 16406 │ 2613 │ 1 │ 0 │ 1260 │ 16393 │ a │
│ 16406 │ 2613 │ 1 │ 0 │ 1260 │ 16394 │ a │
│ 16406 │ 2613 │ 1 │ 0 │ 1260 │ 16395 │ a │
└───────┴─────────┴───────┴──────────┴────────────┴──────────┴─────────┘

lodb10m=# select * from pg_largeobject_metadata ;
┌─────┬──────────┬───────────────────────────────────────────────────────────┐
│ oid │ lomowner │ lomacl │
├─────┼──────────┼───────────────────────────────────────────────────────────┤
│ 1 │ 16384 │ {hannuk=rw/hannuk,bob=r/hannuk,joe=r/hannuk,tom=r/hannuk} │
└─────┴──────────┴───────────────────────────────────────────────────────────┘

Make the remaining 10M-1 LOs

lodb10m=# insert into pg_largeobject_metadata(oid, lomowner, lomacl)
SELECT i, 16384,
'{hannuk=rw/hannuk,bob=r/hannuk,joe=r/hannuk,tom=r/hannuk}' FROM
generate_series(2, 10_000_000) g(i);
INSERT 0 9999999
Time: 18859.341 ms (00:18.859)

And add their sharedeps

lodb10m=# WITH refdeps (robj, rdeptype)
AS ( VALUES
(16384, 'o'),
(16393, 'a'),
(16394, 'a'),
(16395, 'a')
)
INSERT INTO pg_shdepend SELECT 16396, 2613, i, 0, 1260, robj, rdeptype
FROM generate_series(2, 10_000_000) g(i)
, refdeps
;
INSERT 0 39999996
Time: 116697.342 ms (01:56.697)

Time pg_upgrade's pg_dump and pg_reload

hannuk@db01-c1a:~/work/lo-testing$ time pg_dump --schema-only
--quote-all-identifiers --binary-upgrade --format=custom
--file=lodb10m-3grants.dump -p 5433 lodb10m
real 0m21.519s
user 0m2.951s
sys 0m1.723s

hannuk@db01-c1a:~/work/lo-testing$ time pg_restore -p 5434
--exit-on-error --transaction-size=1000 --dbname lodb10m
lodb10m-3grants.dump
real 27m15.372s
user 0m45.157s
sys 4m57.513s

On Fri, Apr 11, 2025 at 10:11 PM Nathan Bossart
<nathandbossart@gmail.com> wrote:

Show quoted text

On Tue, Apr 08, 2025 at 12:22:00PM -0500, Nathan Bossart wrote:

On Tue, Apr 08, 2025 at 01:07:09PM -0400, Tom Lane wrote:

Nathan Bossart <nathandbossart@gmail.com> writes:

I do think it's worth considering going back to copying
pg_largobject_metadata's files for upgrades from v16 and newer.

(If we do this) I don't see why we'd need to stop at v16. I'm
envisioning that we'd use COPY, which will be dealing in the
text representation of aclitems, and I don't think that's changed
in a long time. The sort of thing that would break it is changes
in the set of available/default privilege bits for large objects.

I was thinking of actually reverting commit 12a53c7 for upgrades from v16,
which AFAICT is the last release where any relevant storage formats changed
(aclitem changed in v16). But if COPY gets us pretty close to that and is
less likely to be disrupted by future changes, it could be a better
long-term approach.

That is, where the dump currently contains something like

SELECT pg_catalog.lo_create('2121');
ALTER LARGE OBJECT 2121 OWNER TO postgres;
GRANT ALL ON LARGE OBJECT 2121 TO joe;

we'd have

COPY pg_largeobject_metadata FROM STDIN;
...
2121 10 {postgres=rw/postgres,joe=rw/postgres}
...

and some appropriate COPY data for pg_shdepend too.

I did some more research here. For many large objects without ACLs to
dump, I noticed that the vast majority of time is going to restoring the
ALTER OWNER commands. For 1 million such large objects, restoring took ~73
seconds on my machine. If I instead invented an lo_create_with_owner()
function and created 100 per SELECT command, the same restore takes ~7
seconds. Copying the relevant pg_shdepend rows out and back in takes ~2.5
seconds. I imagine using COPY for pg_largeobject_metadata would also take
a couple of seconds in this case.

For upgrading, I don't think there's any huge benefit to optimizing the
restore commands versus using COPY. It might make future catalog changes
for large object stuff easier, but I'd expect those to be rare. However,
the optimized restore commands could be nice for non-pg_upgrade use-cases.

--
nathan

#31Nathan Bossart
nathandbossart@gmail.com
In reply to: Tom Lane (#15)
1 attachment(s)
Re: Horribly slow pg_upgrade performance with many Large Objects

On Tue, Apr 08, 2025 at 01:07:09PM -0400, Tom Lane wrote:

Nathan Bossart <nathandbossart@gmail.com> writes:

I do think it's worth considering going back to copying
pg_largobject_metadata's files for upgrades from v16 and newer.

(If we do this) I don't see why we'd need to stop at v16. I'm
envisioning that we'd use COPY, which will be dealing in the
text representation of aclitems, and I don't think that's changed
in a long time. The sort of thing that would break it is changes
in the set of available/default privilege bits for large objects.

That is, where the dump currently contains something like

SELECT pg_catalog.lo_create('2121');
ALTER LARGE OBJECT 2121 OWNER TO postgres;
GRANT ALL ON LARGE OBJECT 2121 TO joe;

we'd have

COPY pg_largeobject_metadata FROM STDIN;
...
2121 10 {postgres=rw/postgres,joe=rw/postgres}
...

and some appropriate COPY data for pg_shdepend too.

Attached is a proof-of-concept grade patch for using COPY for
pg_largeobject_metadata and the relevant pg_shdepend entries. On my
laptop, pg_upgrade with 10M LOs (each with a non-bootstrap-superuser owner
and another role with SELECT rights) goes from ~8.5 minutes to ~1 minute
with this patch.

I originally set out to invent a new lo_create_with_owner() function and
teach pg_dump to batch those together in large groups, but as I started the
required pg_dump surgery, I was quickly scared away by the complexity.
Next, I gave COPY a try. The improvements from using COPY will likely be
limited to the pg_upgrade case, but that's the only case I regularly hear
complaints about for zillions of large objects, so maybe it's good enough
for now.

For the COPY approach, I modified pg_dump to dump the contents of
pg_largeobject_metadata. This is easy enough, but I ran into problems with
the dependent comments and security labels. It turns out that even before
v12, we run all the lo_create() commands just so that creating the comments
and security labels works (AFAICT). So I suspect upgrading with many large
objects has always been slow. The comment/security label dependency issue
can be fixed (at least well enough for the tests) by moving
PRIO_LARGE_OBJECT below PRIO_TABLE_DATA. There might be an existing issue
here, because dbObjectTypePriorities has the following comment:

* NOTE: object-type priorities must match the section assignments made in
* pg_dump.c; that is, PRE_DATA objects must sort before DO_PRE_DATA_BOUNDARY,
* POST_DATA objects must sort after DO_POST_DATA_BOUNDARY, and DATA objects
* must sort between them.

But dumpLO() puts large objects in SECTION_DATA, and PRIO_LARGE_OBJECT is
before PRIO_PRE_DATA_BOUNDARY. I admittedly haven't spent too much time
investigating this, though. In any case, it might be a good idea to also
make sure we explicitly mark the large objects and their comments/seclabels
as dependent on the pg_largeobject_metadata data.

That leaves pg_shdepend. For now, I've just instructed pg_upgrade to COPY
the relevant pg_shdepend rows as an independent step, but perhaps there's a
reasonably straightforward way to put that in pg_dump, too.

--
nathan

Attachments:

v1-0001-pg_upgrade-Use-COPY-for-large-object-metadata.patchtext/plain; charset=us-asciiDownload
From 59fa6558a9258719840e4d423c86313937156f95 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Mon, 5 May 2025 14:11:56 -0500
Subject: [PATCH v1 1/1] pg_upgrade: Use COPY for large object metadata.

---
 src/bin/pg_dump/pg_backup_archiver.c |  5 +++++
 src/bin/pg_dump/pg_dump.c            | 16 +++++++++++++++-
 src/bin/pg_dump/pg_dump_sort.c       |  2 +-
 src/bin/pg_dump/t/002_pg_dump.pl     |  4 +++-
 src/bin/pg_upgrade/dump.c            |  9 +++++++++
 src/bin/pg_upgrade/pg_upgrade.c      |  9 +++++++++
 6 files changed, 42 insertions(+), 3 deletions(-)

diff --git a/src/bin/pg_dump/pg_backup_archiver.c b/src/bin/pg_dump/pg_backup_archiver.c
index afa42337b11..0d61155ec0a 100644
--- a/src/bin/pg_dump/pg_backup_archiver.c
+++ b/src/bin/pg_dump/pg_backup_archiver.c
@@ -2974,6 +2974,11 @@ _tocEntryRequired(TocEntry *te, teSection curSection, ArchiveHandle *AH)
 	int			res = REQ_SCHEMA | REQ_DATA;
 	RestoreOptions *ropt = AH->public.ropt;
 
+	if (ropt->binary_upgrade &&
+		strcmp(te->tag, "pg_largeobject_metadata") == 0 &&
+		strcmp(te->namespace, "pg_catalog") == 0)
+		return REQ_DATA;
+
 	/* These items are treated specially */
 	if (strcmp(te->desc, "ENCODING") == 0 ||
 		strcmp(te->desc, "STDSTRINGS") == 0 ||
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index e2e7975b34e..312e1010456 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -1083,6 +1083,20 @@ main(int argc, char **argv)
 	if (!dopt.dumpData && dopt.sequence_data)
 		getTableData(&dopt, tblinfo, numTables, RELKIND_SEQUENCE);
 
+	if (dopt.binary_upgrade)
+	{
+		for (int i = 0; i < numTables; i++)
+		{
+			if (tblinfo[i].relkind == RELKIND_RELATION &&
+				strcmp(tblinfo[i].dobj.name, "pg_largeobject_metadata") == 0 &&
+				strcmp(tblinfo[i].dobj.namespace->dobj.name, "pg_catalog") == 0)
+			{
+				makeTableDataInfo(&dopt, &(tblinfo[i]));
+				break;
+			}
+		}
+	}
+
 	/*
 	 * In binary-upgrade mode, we do not have to worry about the actual LO
 	 * data or the associated metadata that resides in the pg_largeobject and
@@ -3925,7 +3939,7 @@ getLOs(Archive *fout)
 		 * pg_largeobject_metadata, after the dump is restored.
 		 */
 		if (dopt->binary_upgrade)
-			loinfo->dobj.dump &= ~DUMP_COMPONENT_DATA;
+			loinfo->dobj.dump &= ~(DUMP_COMPONENT_DATA | DUMP_COMPONENT_ACL | DUMP_COMPONENT_DEFINITION);
 
 		/*
 		 * Create a "BLOBS" data item for the group, too. This is just a
diff --git a/src/bin/pg_dump/pg_dump_sort.c b/src/bin/pg_dump/pg_dump_sort.c
index 0b0977788f1..538e7dcb493 100644
--- a/src/bin/pg_dump/pg_dump_sort.c
+++ b/src/bin/pg_dump/pg_dump_sort.c
@@ -76,10 +76,10 @@ enum dbObjectTypePriorities
 	PRIO_TABLE_ATTACH,
 	PRIO_DUMMY_TYPE,
 	PRIO_ATTRDEF,
-	PRIO_LARGE_OBJECT,
 	PRIO_PRE_DATA_BOUNDARY,		/* boundary! */
 	PRIO_TABLE_DATA,
 	PRIO_SEQUENCE_SET,
+	PRIO_LARGE_OBJECT,
 	PRIO_LARGE_OBJECT_DATA,
 	PRIO_STATISTICS_DATA_DATA,
 	PRIO_POST_DATA_BOUNDARY,	/* boundary! */
diff --git a/src/bin/pg_dump/t/002_pg_dump.pl b/src/bin/pg_dump/t/002_pg_dump.pl
index 55d892d9c16..2417f537ac1 100644
--- a/src/bin/pg_dump/t/002_pg_dump.pl
+++ b/src/bin/pg_dump/t/002_pg_dump.pl
@@ -1029,6 +1029,7 @@ my %tests = (
 			test_schema_plus_large_objects => 1,
 		},
 		unlike => {
+			binary_upgrade => 1,
 			no_large_objects => 1,
 			no_owner => 1,
 			schema_only => 1,
@@ -1517,6 +1518,7 @@ my %tests = (
 			test_schema_plus_large_objects => 1,
 		},
 		unlike => {
+			binary_upgrade => 1,
 			schema_only => 1,
 			schema_only_with_statistics => 1,
 			no_large_objects => 1,
@@ -4524,9 +4526,9 @@ my %tests = (
 			no_schema => 1,
 			section_data => 1,
 			test_schema_plus_large_objects => 1,
-			binary_upgrade => 1,
 		},
 		unlike => {
+			binary_upgrade => 1,
 			no_large_objects => 1,
 			no_privs => 1,
 			schema_only => 1,
diff --git a/src/bin/pg_upgrade/dump.c b/src/bin/pg_upgrade/dump.c
index 23cb08e8347..e5b73b7e7a0 100644
--- a/src/bin/pg_upgrade/dump.c
+++ b/src/bin/pg_upgrade/dump.c
@@ -15,6 +15,9 @@
 void
 generate_old_dump(void)
 {
+	char	   *path = psprintf("%s/lo_shdep.out", realpath(log_opts.dumpdir, NULL));
+	PQExpBuffer buf = createPQExpBuffer();
+	PGconn	   *conn;
 	int			dbnum;
 
 	prep_status("Creating dump of global objects");
@@ -70,5 +73,11 @@ generate_old_dump(void)
 		;
 
 	end_progress_output();
+
+	conn = connectToServer(&old_cluster, "template1");
+	appendStringLiteralConn(buf, path, conn);
+	PQclear(executeQueryOrDie(conn, "COPY (SELECT * FROM pg_shdepend WHERE classid = 'pg_largeobject'::regclass) TO %s", buf->data));
+	destroyPQExpBuffer(buf);
+	PQfinish(conn);
 	check_ok();
 }
diff --git a/src/bin/pg_upgrade/pg_upgrade.c b/src/bin/pg_upgrade/pg_upgrade.c
index 536e49d2616..7a6234d48c0 100644
--- a/src/bin/pg_upgrade/pg_upgrade.c
+++ b/src/bin/pg_upgrade/pg_upgrade.c
@@ -571,6 +571,8 @@ prepare_new_globals(void)
 static void
 create_new_objects(void)
 {
+	char	   *path = psprintf("%s/lo_shdep.out", realpath(log_opts.dumpdir, NULL));
+	PQExpBuffer buf = createPQExpBuffer();
 	int			dbnum;
 	PGconn	   *conn_new_template1;
 
@@ -688,6 +690,13 @@ create_new_objects(void)
 		;
 
 	end_progress_output();
+
+	conn_new_template1 = connectToServer(&new_cluster, "template1");
+	appendStringLiteralConn(buf, path, conn_new_template1);
+	PQclear(executeQueryOrDie(conn_new_template1, "COPY pg_shdepend FROM %s", buf->data));
+	destroyPQExpBuffer(buf);
+	PQfinish(conn_new_template1);
+
 	check_ok();
 
 	/*
-- 
2.39.5 (Apple Git-154)

#32Nathan Bossart
nathandbossart@gmail.com
In reply to: Nathan Bossart (#31)
1 attachment(s)
Re: Horribly slow pg_upgrade performance with many Large Objects

On Mon, May 05, 2025 at 02:23:25PM -0500, Nathan Bossart wrote:

That leaves pg_shdepend. For now, I've just instructed pg_upgrade to COPY
the relevant pg_shdepend rows as an independent step, but perhaps there's a
reasonably straightforward way to put that in pg_dump, too.

It turns out there is. TableDataInfo's filtercond field can be used to
easily add a WHERE clause to the data dumping command. On my laptop,
upgrading with --jobs=8 with 10M large objects evenly distributed across 10
databases (each with a non-bootstrap-superuser owner and another role with
select rights) takes ~100 seconds without this patch and ~30 seconds with
it.

I've also added dependency tracking, version checks (this only works for
upgrades from >=v12 for now), a hack to ensure the columns for
pg_largeobject_metadata/pg_shdepend are collected, and comments. I'm sure
there's something I've missed, but this patch has worked well in my tests
thus far.

Taking a step back, I'm a little disappointed in the gains here. A 3-9x
speedup is nice, but I guess I was hoping to find another order of
magnitude somewhere. To do any better, I think we'd need to copy the files
for pg_largeobject_metadata directly for upgrades from >= v16, but that
would have to fit somewhere between when pg_restore creates the database
and when it restores any large object comments/seclabels. I'm not wild
about the amount of hackery required to get that working.

--
nathan

Attachments:

v2-0001-pg_upgrade-Use-COPY-for-large-object-metadata.patchtext/plain; charset=us-asciiDownload
From b14b74a4ad3ffff60c2fcb3c241766a651d681c5 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Wed, 7 May 2025 09:06:05 -0500
Subject: [PATCH v2 1/1] pg_upgrade: Use COPY for large object metadata.

Reported-by: Hannu Krosing <hannuk@google.com>
Suggested-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/CAMT0RQSS-6qLH%2BzYsOeUbAYhop3wmQTkNmQpo5--QRDUR%2BqYmQ%40mail.gmail.com
---
 src/bin/pg_dump/pg_backup_archiver.c | 13 +++++
 src/bin/pg_dump/pg_dump.c            | 78 ++++++++++++++++++++++++++--
 src/bin/pg_dump/pg_dump_sort.c       |  2 +-
 src/bin/pg_dump/t/002_pg_dump.pl     |  4 +-
 4 files changed, 90 insertions(+), 7 deletions(-)

diff --git a/src/bin/pg_dump/pg_backup_archiver.c b/src/bin/pg_dump/pg_backup_archiver.c
index afa42337b11..4c387b0cb5e 100644
--- a/src/bin/pg_dump/pg_backup_archiver.c
+++ b/src/bin/pg_dump/pg_backup_archiver.c
@@ -31,6 +31,8 @@
 #endif
 
 #include "catalog/pg_class_d.h"
+#include "catalog/pg_largeobject_metadata.h"
+#include "catalog/pg_shdepend.h"
 #include "common/string.h"
 #include "compress_io.h"
 #include "dumputils.h"
@@ -2974,6 +2976,17 @@ _tocEntryRequired(TocEntry *te, teSection curSection, ArchiveHandle *AH)
 	int			res = REQ_SCHEMA | REQ_DATA;
 	RestoreOptions *ropt = AH->public.ropt;
 
+	/*
+	 * For binary upgrade mode, dump pg_largeobject_metadata and the
+	 * associated pg_shdepend rows. This is faster to restore than the
+	 * equivalent set of large object commands.
+	 */
+	if (ropt->binary_upgrade && AH->public.remoteVersion >= 120000 &&
+		strcmp(te->desc, "TABLE DATA") == 0 &&
+		(te->catalogId.oid == LargeObjectMetadataRelationId ||
+		 te->catalogId.oid == SharedDependRelationId))
+		return REQ_DATA;
+
 	/* These items are treated specially */
 	if (strcmp(te->desc, "ENCODING") == 0 ||
 		strcmp(te->desc, "STDSTRINGS") == 0 ||
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index e2e7975b34e..1666a4ec40f 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -49,8 +49,10 @@
 #include "catalog/pg_class_d.h"
 #include "catalog/pg_default_acl_d.h"
 #include "catalog/pg_largeobject_d.h"
+#include "catalog/pg_largeobject_metadata_d.h"
 #include "catalog/pg_proc_d.h"
 #include "catalog/pg_publication_d.h"
+#include "catalog/pg_shdepend.h"
 #include "catalog/pg_subscription_d.h"
 #include "catalog/pg_type_d.h"
 #include "common/connect.h"
@@ -209,6 +211,12 @@ static int	nbinaryUpgradeClassOids = 0;
 static SequenceItem *sequences = NULL;
 static int	nsequences = 0;
 
+/*
+ * For binary upgrade, the dump ID of pg_largeobject_metadata is saved for use
+ * as a dependency for pg_shdepend and any large object comments/seclabels.
+ */
+static DumpId lo_metadata_dumpId;
+
 /* Maximum number of relations to fetch in a fetchAttributeStats() call. */
 #define MAX_ATTR_STATS_RELS 64
 
@@ -1083,6 +1091,34 @@ main(int argc, char **argv)
 	if (!dopt.dumpData && dopt.sequence_data)
 		getTableData(&dopt, tblinfo, numTables, RELKIND_SEQUENCE);
 
+	/*
+	 * For binary upgrade mode, dump pg_largeobject_metadata and the
+	 * associated pg_shdepend rows. This is faster to restore than the
+	 * equivalent set of large object commands.
+	 */
+	if (dopt.binary_upgrade && fout->remoteVersion >= 120000)
+	{
+		TableInfo  *lo_metadata = findTableByOid(LargeObjectMetadataRelationId);
+		TableInfo  *shdepend = findTableByOid(SharedDependRelationId);
+
+		makeTableDataInfo(&dopt, lo_metadata);
+		makeTableDataInfo(&dopt, shdepend);
+
+		/*
+		 * Save pg_largeobject_metadata's dump ID for use as a dependency on
+		 * pg_shdepend and any large object comments/seclabels.
+		 */
+		lo_metadata_dumpId = lo_metadata->dataObj->dobj.dumpId;
+		addObjectDependency(&shdepend->dataObj->dobj, lo_metadata_dumpId);
+
+		/*
+		 * Only dump large object shdepend rows for this database.
+		 */
+		shdepend->dataObj->filtercond = "WHERE classid = 'pg_largeobject'::regclass "
+			"AND dbid = (SELECT oid FROM pg_database "
+			"            WHERE datname = current_database())";
+	}
+
 	/*
 	 * In binary-upgrade mode, we do not have to worry about the actual LO
 	 * data or the associated metadata that resides in the pg_largeobject and
@@ -3922,10 +3958,29 @@ getLOs(Archive *fout)
 		 * as it will be copied by pg_upgrade, which simply copies the
 		 * pg_largeobject table. We *do* however dump out anything but the
 		 * data, as pg_upgrade copies just pg_largeobject, but not
-		 * pg_largeobject_metadata, after the dump is restored.
+		 * pg_largeobject_metadata, after the dump is restored.  In versions
+		 * before v12, this is done via proper large object commands.  In
+		 * newer versions, we dump the content of pg_largeobject_metadata and
+		 * any associated pg_shdepend rows, which is faster to restore.
 		 */
 		if (dopt->binary_upgrade)
-			loinfo->dobj.dump &= ~DUMP_COMPONENT_DATA;
+		{
+			if (fout->remoteVersion >= 120000)
+			{
+				loinfo->dobj.dump &= ~(DUMP_COMPONENT_DATA | DUMP_COMPONENT_ACL | DUMP_COMPONENT_DEFINITION);
+
+				/*
+				 * Mark the large object as dependent on
+				 * pg_largeobject_metadata so that any large object
+				 * comments/seclables are dumped after it.
+				 */
+				loinfo->dobj.dependencies = (DumpId *) pg_malloc(sizeof(DumpId));
+				loinfo->dobj.dependencies[0] = lo_metadata_dumpId;
+				loinfo->dobj.nDeps = loinfo->dobj.allocDeps = 1;
+			}
+			else
+				loinfo->dobj.dump &= ~DUMP_COMPONENT_DATA;
+		}
 
 		/*
 		 * Create a "BLOBS" data item for the group, too. This is just a
@@ -9034,8 +9089,18 @@ getTableAttrs(Archive *fout, TableInfo *tblinfo, int numTables)
 		if (tbinfo->relkind == RELKIND_SEQUENCE)
 			continue;
 
-		/* Don't bother with uninteresting tables, either */
-		if (!tbinfo->interesting)
+		/*
+		 * Don't bother with uninteresting tables, either.  For binary
+		 * upgrades, this is bypassed for pg_largeobject_metadata and
+		 * pg_shdepend so that the columns names are collected for the
+		 * corresponding COPY commands.  Restoring the data for those catalogs
+		 * is faster than restoring the equivalent set of large object
+		 * commands.
+		 */
+		if (!tbinfo->interesting &&
+			!(fout->dopt->binary_upgrade && fout->remoteVersion >= 120000 &&
+			  (tbinfo->dobj.catId.oid == LargeObjectMetadataRelationId ||
+			   tbinfo->dobj.catId.oid == SharedDependRelationId)))
 			continue;
 
 		/* OK, we need info for this table */
@@ -9232,7 +9297,10 @@ getTableAttrs(Archive *fout, TableInfo *tblinfo, int numTables)
 			pg_fatal("unrecognized table OID %u", attrelid);
 		/* cross-check that we only got requested tables */
 		if (tbinfo->relkind == RELKIND_SEQUENCE ||
-			!tbinfo->interesting)
+			(!tbinfo->interesting &&
+			 !(fout->dopt->binary_upgrade && fout->remoteVersion >= 120000 &&
+			   (tbinfo->dobj.catId.oid == LargeObjectMetadataRelationId ||
+				tbinfo->dobj.catId.oid == SharedDependRelationId))))
 			pg_fatal("unexpected column data for table \"%s\"",
 					 tbinfo->dobj.name);
 
diff --git a/src/bin/pg_dump/pg_dump_sort.c b/src/bin/pg_dump/pg_dump_sort.c
index 0b0977788f1..538e7dcb493 100644
--- a/src/bin/pg_dump/pg_dump_sort.c
+++ b/src/bin/pg_dump/pg_dump_sort.c
@@ -76,10 +76,10 @@ enum dbObjectTypePriorities
 	PRIO_TABLE_ATTACH,
 	PRIO_DUMMY_TYPE,
 	PRIO_ATTRDEF,
-	PRIO_LARGE_OBJECT,
 	PRIO_PRE_DATA_BOUNDARY,		/* boundary! */
 	PRIO_TABLE_DATA,
 	PRIO_SEQUENCE_SET,
+	PRIO_LARGE_OBJECT,
 	PRIO_LARGE_OBJECT_DATA,
 	PRIO_STATISTICS_DATA_DATA,
 	PRIO_POST_DATA_BOUNDARY,	/* boundary! */
diff --git a/src/bin/pg_dump/t/002_pg_dump.pl b/src/bin/pg_dump/t/002_pg_dump.pl
index 55d892d9c16..2417f537ac1 100644
--- a/src/bin/pg_dump/t/002_pg_dump.pl
+++ b/src/bin/pg_dump/t/002_pg_dump.pl
@@ -1029,6 +1029,7 @@ my %tests = (
 			test_schema_plus_large_objects => 1,
 		},
 		unlike => {
+			binary_upgrade => 1,
 			no_large_objects => 1,
 			no_owner => 1,
 			schema_only => 1,
@@ -1517,6 +1518,7 @@ my %tests = (
 			test_schema_plus_large_objects => 1,
 		},
 		unlike => {
+			binary_upgrade => 1,
 			schema_only => 1,
 			schema_only_with_statistics => 1,
 			no_large_objects => 1,
@@ -4524,9 +4526,9 @@ my %tests = (
 			no_schema => 1,
 			section_data => 1,
 			test_schema_plus_large_objects => 1,
-			binary_upgrade => 1,
 		},
 		unlike => {
+			binary_upgrade => 1,
 			no_large_objects => 1,
 			no_privs => 1,
 			schema_only => 1,
-- 
2.39.5 (Apple Git-154)

#33Hannu Krosing
hannuk@google.com
In reply to: Nathan Bossart (#32)
Re: Horribly slow pg_upgrade performance with many Large Objects

Hi Nathan,

Did a quick check of the patch and it seems to work ok.

What do you think of the idea of not dumping pg_shdepend here, but
instead adding the required entries after loading
pg_largeobject_metadata based on the contents of it ?

The query for this would be

WITH db AS (
SELECT oid FROM pg_database WHERE datname = current_database()
)
INSERT INTO pg_shdepend
SELECT db.oid AS dbid,
2613 AS classid,
lm.oid AS objid,
0 AS objsubid,
1260 AS refclassid,
COALESCE(acl.grantee, lm.lomowner) AS refobjid,
CASE WHEN grantee IS NULL or grantee = lomowner
THEN 'o'
ELSE 'a'
END as deptype
FROM pg_largeobject_metadata as lm
LEFT JOIN LATERAL (
SELECT DISTINCT (aclexplode(lm.lomacl)).grantee
) AS acl ON true,
db
WHERE (lm.oid, COALESCE(acl.grantee, lm.lomowner)) NOT IN (SELECT
objid, refobjid FROM pg_shdepend WHERE dbid = db.oid)
;

(I had hoped to use ON CONFLICT DO NOTHING but this is not supported
for system tables.)

---|
Hannu

Show quoted text

On Wed, May 7, 2025 at 4:51 PM Nathan Bossart <nathandbossart@gmail.com> wrote:

On Mon, May 05, 2025 at 02:23:25PM -0500, Nathan Bossart wrote:

That leaves pg_shdepend. For now, I've just instructed pg_upgrade to COPY
the relevant pg_shdepend rows as an independent step, but perhaps there's a
reasonably straightforward way to put that in pg_dump, too.

It turns out there is. TableDataInfo's filtercond field can be used to
easily add a WHERE clause to the data dumping command. On my laptop,
upgrading with --jobs=8 with 10M large objects evenly distributed across 10
databases (each with a non-bootstrap-superuser owner and another role with
select rights) takes ~100 seconds without this patch and ~30 seconds with
it.

I've also added dependency tracking, version checks (this only works for
upgrades from >=v12 for now), a hack to ensure the columns for
pg_largeobject_metadata/pg_shdepend are collected, and comments. I'm sure
there's something I've missed, but this patch has worked well in my tests
thus far.

Taking a step back, I'm a little disappointed in the gains here. A 3-9x
speedup is nice, but I guess I was hoping to find another order of
magnitude somewhere. To do any better, I think we'd need to copy the files
for pg_largeobject_metadata directly for upgrades from >= v16, but that
would have to fit somewhere between when pg_restore creates the database
and when it restores any large object comments/seclabels. I'm not wild
about the amount of hackery required to get that working.

--
nathan

#34Nathan Bossart
nathandbossart@gmail.com
In reply to: Hannu Krosing (#33)
Re: Horribly slow pg_upgrade performance with many Large Objects

On Sun, Jul 06, 2025 at 02:48:08PM +0200, Hannu Krosing wrote:

Did a quick check of the patch and it seems to work ok.

Thanks for taking a look.

What do you think of the idea of not dumping pg_shdepend here, but
instead adding the required entries after loading
pg_largeobject_metadata based on the contents of it ?

While not dumping it might save a little space during upgrade, the query
seems to be extremely slow. So, I don't see any strong advantage.

--
nathan

#35Hannu Krosing
hannuk@google.com
In reply to: Nathan Bossart (#34)
Re: Horribly slow pg_upgrade performance with many Large Objects

On Tue, Jul 8, 2025 at 11:06 PM Nathan Bossart <nathandbossart@gmail.com> wrote:

On Sun, Jul 06, 2025 at 02:48:08PM +0200, Hannu Krosing wrote:

Did a quick check of the patch and it seems to work ok.

Thanks for taking a look.

What do you think of the idea of not dumping pg_shdepend here, but
instead adding the required entries after loading
pg_largeobject_metadata based on the contents of it ?

While not dumping it might save a little space during upgrade, the query
seems to be extremely slow. So, I don't see any strong advantage.

Yeah, looks like the part that avoids duplicates made it slow.

If you run it without the last WHERE it is reasonably fast. And it
behaves the same as just inserting from the dump which also does not
have any checks against duplicates.

#36Nathan Bossart
nathandbossart@gmail.com
In reply to: Hannu Krosing (#35)
Re: Horribly slow pg_upgrade performance with many Large Objects

On Wed, Jul 09, 2025 at 04:52:16PM +0200, Hannu Krosing wrote:

If you run it without the last WHERE it is reasonably fast. And it
behaves the same as just inserting from the dump which also does not
have any checks against duplicates.

With that change, the query is much faster, but my testing indicates that
it's still measurably slower than dumping and restoring the relevant
contents of pg_shdepend.

--
nathan

#37Hannu Krosing
hannuk@google.com
In reply to: Nathan Bossart (#36)
Re: Horribly slow pg_upgrade performance with many Large Objects

Ah, I see.

I retried and now it runs for 21 to 29 seconds for 1 million large
objects with two grants each, 3M total rows inserted.

Yesterday it ran 8 seconds. Unfortunately I do not have the plan from
yesterday saved .

Anyway it looks like just dumping pg_shdepend it is better, and likely
also compresses very well for most users

--
Hannu

#38Nitin Motiani
nitinmotiani@google.com
In reply to: Hannu Krosing (#37)
Re: Horribly slow pg_upgrade performance with many Large Objects

Hi,

I have a couple of comments/questions.

There might be an existing issue
here, because dbObjectTypePriorities has the following comment:

* NOTE: object-type priorities must match the section assignments made in
* pg_dump.c; that is, PRE_DATA objects must sort before

DO_PRE_DATA_BOUNDARY,

* POST_DATA objects must sort after DO_POST_DATA_BOUNDARY, and DATA

objects

* must sort between them.

But dumpLO() puts large objects in SECTION_DATA, and PRIO_LARGE_OBJECT is
before PRIO_PRE_DATA_BOUNDARY. I admittedly haven't spent too much time investigating

this, though.

I looked through the history of this to see how this happened and if it
could be an existing issue. Prior to a45c78e3284b, dumpLO used to put large
objects in SECTION_PRE_DATA. That commit changed dumpLO and also changed
addBoundaryDependencies to move DO_LARGE_OBJECT from pre-data to data
section. Seems like since then this has been inconsistent with
pg_dump_sort.c. I think the change in pg_dump_sort.c should be backported
to PG17 & 18 independent of the state of the larger patch.

But even with the inconsistency, it doesn't look like there is an existing
issue. As the dependencies were changed in addBoundaryDependencies, that
should take precedence over the order in pg_dump_sort.c. The
dbObjectTypePriorities are used by sortDumpableObjectsByTypeName. But right
after that sortDumpableObjects sorts the objects based on dependencies
therefore the change in boundary dependencies should ensure that this is
working as intended. Still I think dbObjectTypePriorities should be made
consistent with the rest.

Also regarding this change in the patch

-		 * pg_largeobject_metadata, after the dump is restored.
+		 * pg_largeobject_metadata, after the dump is restored.  In versions
+		 * before v12, this is done via proper large object commands.  In
+		 * newer versions, we dump the content of pg_largeobject_metadata and
+		 * any associated pg_shdepend rows, which is faster to restore.
 		 */

Should the comment provide further detail on why this is only being done
for v12 and above?

Thanks & Regards,
Nitin Motiani
Google

#39Nathan Bossart
nathandbossart@gmail.com
In reply to: Nitin Motiani (#38)
1 attachment(s)
Re: Horribly slow pg_upgrade performance with many Large Objects

On Thu, Jul 10, 2025 at 06:05:26PM +0530, Nitin Motiani wrote:

-		 * pg_largeobject_metadata, after the dump is restored.
+		 * pg_largeobject_metadata, after the dump is restored.  In versions
+		 * before v12, this is done via proper large object commands.  In
+		 * newer versions, we dump the content of pg_largeobject_metadata and
+		 * any associated pg_shdepend rows, which is faster to restore.
*/

Should the comment provide further detail on why this is only being done
for v12 and above?

Yes. I've fixed this in v3.

--
nathan

Attachments:

v3-0001-pg_upgrade-Use-COPY-for-large-object-metadata.patchtext/plain; charset=us-asciiDownload
From b784cd881933a1c01f4ea143f06bf77972924459 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Thu, 10 Jul 2025 16:20:33 -0500
Subject: [PATCH v3 1/1] pg_upgrade: Use COPY for large object metadata.

Reported-by: Hannu Krosing <hannuk@google.com>
Suggested-by: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Hannu Krosing <hannuk@google.com>
Reviewed-by: Nitin Motiani <nitinmotiani@google.com>
Discussion: https://postgr.es/m/CAMT0RQSS-6qLH%2BzYsOeUbAYhop3wmQTkNmQpo5--QRDUR%2BqYmQ%40mail.gmail.com
---
 src/bin/pg_dump/pg_backup_archiver.c | 15 +++++
 src/bin/pg_dump/pg_dump.c            | 84 ++++++++++++++++++++++++++--
 src/bin/pg_dump/t/002_pg_dump.pl     |  4 +-
 3 files changed, 97 insertions(+), 6 deletions(-)

diff --git a/src/bin/pg_dump/pg_backup_archiver.c b/src/bin/pg_dump/pg_backup_archiver.c
index 197c1295d93..1b1c61e5ce5 100644
--- a/src/bin/pg_dump/pg_backup_archiver.c
+++ b/src/bin/pg_dump/pg_backup_archiver.c
@@ -31,6 +31,8 @@
 #endif
 
 #include "catalog/pg_class_d.h"
+#include "catalog/pg_largeobject_metadata.h"
+#include "catalog/pg_shdepend.h"
 #include "common/string.h"
 #include "compress_io.h"
 #include "dumputils.h"
@@ -2974,6 +2976,19 @@ _tocEntryRequired(TocEntry *te, teSection curSection, ArchiveHandle *AH)
 	int			res = REQ_SCHEMA | REQ_DATA;
 	RestoreOptions *ropt = AH->public.ropt;
 
+	/*
+	 * For binary upgrade mode, dump pg_largeobject_metadata and the
+	 * associated pg_shdepend rows. This is faster to restore than the
+	 * equivalent set of large object commands.  We can only do this for
+	 * upgrades from v12 and newer; in older versions, pg_largeobject_metadata
+	 * was created WITH OIDS, so the OID column is hidden and won't be dumped.
+	 */
+	if (ropt->binary_upgrade && AH->public.remoteVersion >= 120000 &&
+		strcmp(te->desc, "TABLE DATA") == 0 &&
+		(te->catalogId.oid == LargeObjectMetadataRelationId ||
+		 te->catalogId.oid == SharedDependRelationId))
+		return REQ_DATA;
+
 	/* These items are treated specially */
 	if (strcmp(te->desc, "ENCODING") == 0 ||
 		strcmp(te->desc, "STDSTRINGS") == 0 ||
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 1937997ea67..bdea0ef7c50 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -49,8 +49,10 @@
 #include "catalog/pg_class_d.h"
 #include "catalog/pg_default_acl_d.h"
 #include "catalog/pg_largeobject_d.h"
+#include "catalog/pg_largeobject_metadata_d.h"
 #include "catalog/pg_proc_d.h"
 #include "catalog/pg_publication_d.h"
+#include "catalog/pg_shdepend.h"
 #include "catalog/pg_subscription_d.h"
 #include "catalog/pg_type_d.h"
 #include "common/connect.h"
@@ -209,6 +211,12 @@ static int	nbinaryUpgradeClassOids = 0;
 static SequenceItem *sequences = NULL;
 static int	nsequences = 0;
 
+/*
+ * For binary upgrade, the dump ID of pg_largeobject_metadata is saved for use
+ * as a dependency for pg_shdepend and any large object comments/seclabels.
+ */
+static DumpId lo_metadata_dumpId;
+
 /* Maximum number of relations to fetch in a fetchAttributeStats() call. */
 #define MAX_ATTR_STATS_RELS 64
 
@@ -1085,6 +1093,36 @@ main(int argc, char **argv)
 	if (!dopt.dumpData && dopt.sequence_data)
 		getTableData(&dopt, tblinfo, numTables, RELKIND_SEQUENCE);
 
+	/*
+	 * For binary upgrade mode, dump pg_largeobject_metadata and the
+	 * associated pg_shdepend rows. This is faster to restore than the
+	 * equivalent set of large object commands.  We can only do this for
+	 * upgrades from v12 and newer; in older versions, pg_largeobject_metadata
+	 * was created WITH OIDS, so the OID column is hidden and won't be dumped.
+	 */
+	if (dopt.binary_upgrade && fout->remoteVersion >= 120000)
+	{
+		TableInfo  *lo_metadata = findTableByOid(LargeObjectMetadataRelationId);
+		TableInfo  *shdepend = findTableByOid(SharedDependRelationId);
+
+		makeTableDataInfo(&dopt, lo_metadata);
+		makeTableDataInfo(&dopt, shdepend);
+
+		/*
+		 * Save pg_largeobject_metadata's dump ID for use as a dependency on
+		 * pg_shdepend and any large object comments/seclabels.
+		 */
+		lo_metadata_dumpId = lo_metadata->dataObj->dobj.dumpId;
+		addObjectDependency(&shdepend->dataObj->dobj, lo_metadata_dumpId);
+
+		/*
+		 * Only dump large object shdepend rows for this database.
+		 */
+		shdepend->dataObj->filtercond = "WHERE classid = 'pg_largeobject'::regclass "
+			"AND dbid = (SELECT oid FROM pg_database "
+			"            WHERE datname = current_database())";
+	}
+
 	/*
 	 * In binary-upgrade mode, we do not have to worry about the actual LO
 	 * data or the associated metadata that resides in the pg_largeobject and
@@ -3924,10 +3962,31 @@ getLOs(Archive *fout)
 		 * as it will be copied by pg_upgrade, which simply copies the
 		 * pg_largeobject table. We *do* however dump out anything but the
 		 * data, as pg_upgrade copies just pg_largeobject, but not
-		 * pg_largeobject_metadata, after the dump is restored.
+		 * pg_largeobject_metadata, after the dump is restored.  In versions
+		 * before v12, this is done via proper large object commands.  In
+		 * newer versions, we dump the content of pg_largeobject_metadata and
+		 * any associated pg_shdepend rows, which is faster to restore.  (On
+		 * <v12, pg_largeobject_metadata was created WITH OIDS, so the OID
+		 * column is hidden and won't be dumped.)
 		 */
 		if (dopt->binary_upgrade)
-			loinfo->dobj.dump &= ~DUMP_COMPONENT_DATA;
+		{
+			if (fout->remoteVersion >= 120000)
+			{
+				loinfo->dobj.dump &= ~(DUMP_COMPONENT_DATA | DUMP_COMPONENT_ACL | DUMP_COMPONENT_DEFINITION);
+
+				/*
+				 * Mark the large object as dependent on
+				 * pg_largeobject_metadata so that any large object
+				 * comments/seclables are dumped after it.
+				 */
+				loinfo->dobj.dependencies = (DumpId *) pg_malloc(sizeof(DumpId));
+				loinfo->dobj.dependencies[0] = lo_metadata_dumpId;
+				loinfo->dobj.nDeps = loinfo->dobj.allocDeps = 1;
+			}
+			else
+				loinfo->dobj.dump &= ~DUMP_COMPONENT_DATA;
+		}
 
 		/*
 		 * Create a "BLOBS" data item for the group, too. This is just a
@@ -9039,8 +9098,20 @@ getTableAttrs(Archive *fout, TableInfo *tblinfo, int numTables)
 		if (tbinfo->relkind == RELKIND_SEQUENCE)
 			continue;
 
-		/* Don't bother with uninteresting tables, either */
-		if (!tbinfo->interesting)
+		/*
+		 * Don't bother with uninteresting tables, either.  For binary
+		 * upgrades, this is bypassed for pg_largeobject_metadata and
+		 * pg_shdepend so that the columns names are collected for the
+		 * corresponding COPY commands.  Restoring the data for those catalogs
+		 * is faster than restoring the equivalent set of large object
+		 * commands.  We can only do this for upgrades from v12 and newer; in
+		 * older versions, pg_largeobject_metadata was created WITH OIDS, so
+		 * the OID column is hidden and won't be dumped.
+		 */
+		if (!tbinfo->interesting &&
+			!(fout->dopt->binary_upgrade && fout->remoteVersion >= 120000 &&
+			  (tbinfo->dobj.catId.oid == LargeObjectMetadataRelationId ||
+			   tbinfo->dobj.catId.oid == SharedDependRelationId)))
 			continue;
 
 		/* OK, we need info for this table */
@@ -9244,7 +9315,10 @@ getTableAttrs(Archive *fout, TableInfo *tblinfo, int numTables)
 			pg_fatal("unrecognized table OID %u", attrelid);
 		/* cross-check that we only got requested tables */
 		if (tbinfo->relkind == RELKIND_SEQUENCE ||
-			!tbinfo->interesting)
+			(!tbinfo->interesting &&
+			 !(fout->dopt->binary_upgrade && fout->remoteVersion >= 120000 &&
+			   (tbinfo->dobj.catId.oid == LargeObjectMetadataRelationId ||
+				tbinfo->dobj.catId.oid == SharedDependRelationId))))
 			pg_fatal("unexpected column data for table \"%s\"",
 					 tbinfo->dobj.name);
 
diff --git a/src/bin/pg_dump/t/002_pg_dump.pl b/src/bin/pg_dump/t/002_pg_dump.pl
index 2485d8f360e..d8330e2bd17 100644
--- a/src/bin/pg_dump/t/002_pg_dump.pl
+++ b/src/bin/pg_dump/t/002_pg_dump.pl
@@ -1087,6 +1087,7 @@ my %tests = (
 			test_schema_plus_large_objects => 1,
 		},
 		unlike => {
+			binary_upgrade => 1,
 			no_large_objects => 1,
 			no_owner => 1,
 			schema_only => 1,
@@ -1605,6 +1606,7 @@ my %tests = (
 			test_schema_plus_large_objects => 1,
 		},
 		unlike => {
+			binary_upgrade => 1,
 			schema_only => 1,
 			schema_only_with_statistics => 1,
 			no_large_objects => 1,
@@ -4612,9 +4614,9 @@ my %tests = (
 			no_schema => 1,
 			section_data => 1,
 			test_schema_plus_large_objects => 1,
-			binary_upgrade => 1,
 		},
 		unlike => {
+			binary_upgrade => 1,
 			no_large_objects => 1,
 			no_privs => 1,
 			schema_only => 1,
-- 
2.39.5 (Apple Git-154)

#40Nitin Motiani
nitinmotiani@google.com
In reply to: Nathan Bossart (#39)
Re: Horribly slow pg_upgrade performance with many Large Objects

On Fri, Jul 11, 2025 at 3:12 AM Nathan Bossart <nathandbossart@gmail.com>
wrote:

On Thu, Jul 10, 2025 at 06:05:26PM +0530, Nitin Motiani wrote:

-              * pg_largeobject_metadata, after the dump is restored.
+              * pg_largeobject_metadata, after the dump is restored.

In versions

+ * before v12, this is done via proper large object

commands. In

+ * newer versions, we dump the content of

pg_largeobject_metadata and

+ * any associated pg_shdepend rows, which is faster to

restore.

*/

Should the comment provide further detail on why this is only being done
for v12 and above?

Yes. I've fixed this in v3.

Thanks. Looks good to me. Also just would like to confirm that the
pg_dump_sort change will go in a different patch.

Regards,
Nitin Motiani
Google

#41Nathan Bossart
nathandbossart@gmail.com
In reply to: Nitin Motiani (#40)
Re: Horribly slow pg_upgrade performance with many Large Objects

On Fri, Jul 11, 2025 at 07:49:47PM +0530, Nitin Motiani wrote:

Thanks. Looks good to me.

Thank you for reviewing.

Also just would like to confirm that the pg_dump_sort change will go in a
different patch.

That's already been committed: http://postgr.es/c/fb6c860.

--
nathan

#42Nitin Motiani
nitinmotiani@google.com
In reply to: Nathan Bossart (#41)
Re: Horribly slow pg_upgrade performance with many Large Objects

On Fri, Jul 11, 2025 at 8:21 PM Nathan Bossart <nathandbossart@gmail.com> wrote:

Also just would like to confirm that the pg_dump_sort change will go in a
different patch.

That's already been committed: http://postgr.es/c/fb6c860.

That's great. Thank you.

#43Nathan Bossart
nathandbossart@gmail.com
In reply to: Nitin Motiani (#42)
1 attachment(s)
Re: Horribly slow pg_upgrade performance with many Large Objects

Here is what I have staged for commit, which (barring feedback or
objections) I am planning to do towards the end of the week.

--
nathan

Attachments:

v4-0001-pg_upgrade-Use-COPY-for-large-object-metadata.patchtext/plain; charset=us-asciiDownload
From 9501f31cdf3ce25012cec8f9e2d7c433df749979 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Mon, 14 Jul 2025 11:17:47 -0500
Subject: [PATCH v4 1/1] pg_upgrade: Use COPY for large object metadata.

Presently, pg_dump generates commands like

    SELECT pg_catalog.lo_create('5432');
    ALTER LARGE OBJECT 5432 OWNER TO alice;
    GRANT SELECT ON LARGE OBJECT 5432 TO bob;

for each large object.  This is particularly slow at restore time,
especially when there are tens or hundreds of millions of large
objects.  From reports and personal experience, such slow restores
are primarily encountered at pg_upgrade time.  This commit teaches
pg_dump to instead dump pg_largeobject_metadata and the
corresponding pg_shdepend rows when in binary upgrade mode, i.e.,
pg_dump now generates commands like

    COPY pg_catalog.pg_largeobject_metadata (oid, lomowner, lomacl) FROM stdin;
    5432	16384	{alice=rw/alice,bob=r/alice}
    \.

    COPY pg_catalog.pg_shdepend (dbid, classid, objid, objsubid, refclassid, refobjid, deptype) FROM stdin;
    5	2613	5432	0	1260	16384	o
    5	2613	5432	0	1260	16385	a
    \.

Testing indicates the COPY approach can be significantly faster.
To do any better, we'd probably need to find a way to copy/link
pg_largeobject_metadata's files during pg_upgrade, which would be
complicated and limited to upgrades from >= v16 (since commit
7b378237aa changed the storage format for aclitem, which is used
for pg_largeobject_metadata.lomacl).

Note that this change only applies to binary upgrade mode (i.e.,
dumps initiated by pg_upgrade) since it inserts rows directly into
catalogs, bypassing the ordinary privilege checks.  Also, this
optimization can only be used for upgrades from >= v12 because
pg_largeobject_metadata was created WITH OIDS in older versions,
which prevents pg_dump from handling pg_largeobject_metadata.oid
properly.  With some extra effort, it might be possible to support
upgrades from older versions, but the added complexity didn't seem
worth it to support versions that will have been out-of-support for
nearly 3 years by the time this change is released.

The astute hacker may remember that prior to v12, pg_upgrade
copied/linked pg_largeobject_metadata's files (see commit
12a53c732c).  Besides the aforementioned storage format issues,
this approach failed to transfer the relevant pg_shdepend rows, and
pg_dump still had to generate an lo_create() command per large
object so that creating the dependent comments and security labels
worked.  We could perhaps adopt a hybrid approach for upgrades from
v16 and newer (i.e., generate lo_create() commands for each large
object, copy/link pg_largeobject_metadata's files, and COPY the
relevant pg_shdepend rows), but further testing is needed.

Reported-by: Hannu Krosing <hannuk@google.com>
Suggested-by: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Hannu Krosing <hannuk@google.com>
Reviewed-by: Nitin Motiani <nitinmotiani@google.com>
Discussion: https://postgr.es/m/CAMT0RQSS-6qLH%2BzYsOeUbAYhop3wmQTkNmQpo5--QRDUR%2BqYmQ%40mail.gmail.com
---
 src/bin/pg_dump/pg_backup_archiver.c | 15 +++++
 src/bin/pg_dump/pg_dump.c            | 90 ++++++++++++++++++++++++++--
 src/bin/pg_dump/t/002_pg_dump.pl     |  4 +-
 3 files changed, 103 insertions(+), 6 deletions(-)

diff --git a/src/bin/pg_dump/pg_backup_archiver.c b/src/bin/pg_dump/pg_backup_archiver.c
index 197c1295d93..1b1c61e5ce5 100644
--- a/src/bin/pg_dump/pg_backup_archiver.c
+++ b/src/bin/pg_dump/pg_backup_archiver.c
@@ -31,6 +31,8 @@
 #endif
 
 #include "catalog/pg_class_d.h"
+#include "catalog/pg_largeobject_metadata.h"
+#include "catalog/pg_shdepend.h"
 #include "common/string.h"
 #include "compress_io.h"
 #include "dumputils.h"
@@ -2974,6 +2976,19 @@ _tocEntryRequired(TocEntry *te, teSection curSection, ArchiveHandle *AH)
 	int			res = REQ_SCHEMA | REQ_DATA;
 	RestoreOptions *ropt = AH->public.ropt;
 
+	/*
+	 * For binary upgrade mode, dump pg_largeobject_metadata and the
+	 * associated pg_shdepend rows. This is faster to restore than the
+	 * equivalent set of large object commands.  We can only do this for
+	 * upgrades from v12 and newer; in older versions, pg_largeobject_metadata
+	 * was created WITH OIDS, so the OID column is hidden and won't be dumped.
+	 */
+	if (ropt->binary_upgrade && AH->public.remoteVersion >= 120000 &&
+		strcmp(te->desc, "TABLE DATA") == 0 &&
+		(te->catalogId.oid == LargeObjectMetadataRelationId ||
+		 te->catalogId.oid == SharedDependRelationId))
+		return REQ_DATA;
+
 	/* These items are treated specially */
 	if (strcmp(te->desc, "ENCODING") == 0 ||
 		strcmp(te->desc, "STDSTRINGS") == 0 ||
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 1937997ea67..89a4927a25a 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -49,8 +49,10 @@
 #include "catalog/pg_class_d.h"
 #include "catalog/pg_default_acl_d.h"
 #include "catalog/pg_largeobject_d.h"
+#include "catalog/pg_largeobject_metadata_d.h"
 #include "catalog/pg_proc_d.h"
 #include "catalog/pg_publication_d.h"
+#include "catalog/pg_shdepend.h"
 #include "catalog/pg_subscription_d.h"
 #include "catalog/pg_type_d.h"
 #include "common/connect.h"
@@ -209,6 +211,12 @@ static int	nbinaryUpgradeClassOids = 0;
 static SequenceItem *sequences = NULL;
 static int	nsequences = 0;
 
+/*
+ * For binary upgrade, the dump ID of pg_largeobject_metadata is saved for use
+ * as a dependency for pg_shdepend and any large object comments/seclabels.
+ */
+static DumpId lo_metadata_dumpId;
+
 /* Maximum number of relations to fetch in a fetchAttributeStats() call. */
 #define MAX_ATTR_STATS_RELS 64
 
@@ -1085,6 +1093,36 @@ main(int argc, char **argv)
 	if (!dopt.dumpData && dopt.sequence_data)
 		getTableData(&dopt, tblinfo, numTables, RELKIND_SEQUENCE);
 
+	/*
+	 * For binary upgrade mode, dump pg_largeobject_metadata and the
+	 * associated pg_shdepend rows. This is faster to restore than the
+	 * equivalent set of large object commands.  We can only do this for
+	 * upgrades from v12 and newer; in older versions, pg_largeobject_metadata
+	 * was created WITH OIDS, so the OID column is hidden and won't be dumped.
+	 */
+	if (dopt.binary_upgrade && fout->remoteVersion >= 120000)
+	{
+		TableInfo  *lo_metadata = findTableByOid(LargeObjectMetadataRelationId);
+		TableInfo  *shdepend = findTableByOid(SharedDependRelationId);
+
+		makeTableDataInfo(&dopt, lo_metadata);
+		makeTableDataInfo(&dopt, shdepend);
+
+		/*
+		 * Save pg_largeobject_metadata's dump ID for use as a dependency on
+		 * pg_shdepend and any large object comments/seclabels.
+		 */
+		lo_metadata_dumpId = lo_metadata->dataObj->dobj.dumpId;
+		addObjectDependency(&shdepend->dataObj->dobj, lo_metadata_dumpId);
+
+		/*
+		 * Only dump large object shdepend rows for this database.
+		 */
+		shdepend->dataObj->filtercond = "WHERE classid = 'pg_largeobject'::regclass "
+			"AND dbid = (SELECT oid FROM pg_database "
+			"            WHERE datname = current_database())";
+	}
+
 	/*
 	 * In binary-upgrade mode, we do not have to worry about the actual LO
 	 * data or the associated metadata that resides in the pg_largeobject and
@@ -3924,10 +3962,37 @@ getLOs(Archive *fout)
 		 * as it will be copied by pg_upgrade, which simply copies the
 		 * pg_largeobject table. We *do* however dump out anything but the
 		 * data, as pg_upgrade copies just pg_largeobject, but not
-		 * pg_largeobject_metadata, after the dump is restored.
+		 * pg_largeobject_metadata, after the dump is restored.  In versions
+		 * before v12, this is done via proper large object commands.  In
+		 * newer versions, we dump the content of pg_largeobject_metadata and
+		 * any associated pg_shdepend rows, which is faster to restore.  (On
+		 * <v12, pg_largeobject_metadata was created WITH OIDS, so the OID
+		 * column is hidden and won't be dumped.)
 		 */
 		if (dopt->binary_upgrade)
-			loinfo->dobj.dump &= ~DUMP_COMPONENT_DATA;
+		{
+			if (fout->remoteVersion >= 120000)
+			{
+				/*
+				 * We should've saved pg_largeobject_metadata's dump ID before
+				 * this point.
+				 */
+				Assert(lo_metadata_dumpId);
+
+				loinfo->dobj.dump &= ~(DUMP_COMPONENT_DATA | DUMP_COMPONENT_ACL | DUMP_COMPONENT_DEFINITION);
+
+				/*
+				 * Mark the large object as dependent on
+				 * pg_largeobject_metadata so that any large object
+				 * comments/seclables are dumped after it.
+				 */
+				loinfo->dobj.dependencies = (DumpId *) pg_malloc(sizeof(DumpId));
+				loinfo->dobj.dependencies[0] = lo_metadata_dumpId;
+				loinfo->dobj.nDeps = loinfo->dobj.allocDeps = 1;
+			}
+			else
+				loinfo->dobj.dump &= ~DUMP_COMPONENT_DATA;
+		}
 
 		/*
 		 * Create a "BLOBS" data item for the group, too. This is just a
@@ -9039,8 +9104,20 @@ getTableAttrs(Archive *fout, TableInfo *tblinfo, int numTables)
 		if (tbinfo->relkind == RELKIND_SEQUENCE)
 			continue;
 
-		/* Don't bother with uninteresting tables, either */
-		if (!tbinfo->interesting)
+		/*
+		 * Don't bother with uninteresting tables, either.  For binary
+		 * upgrades, this is bypassed for pg_largeobject_metadata and
+		 * pg_shdepend so that the columns names are collected for the
+		 * corresponding COPY commands.  Restoring the data for those catalogs
+		 * is faster than restoring the equivalent set of large object
+		 * commands.  We can only do this for upgrades from v12 and newer; in
+		 * older versions, pg_largeobject_metadata was created WITH OIDS, so
+		 * the OID column is hidden and won't be dumped.
+		 */
+		if (!tbinfo->interesting &&
+			!(fout->dopt->binary_upgrade && fout->remoteVersion >= 120000 &&
+			  (tbinfo->dobj.catId.oid == LargeObjectMetadataRelationId ||
+			   tbinfo->dobj.catId.oid == SharedDependRelationId)))
 			continue;
 
 		/* OK, we need info for this table */
@@ -9244,7 +9321,10 @@ getTableAttrs(Archive *fout, TableInfo *tblinfo, int numTables)
 			pg_fatal("unrecognized table OID %u", attrelid);
 		/* cross-check that we only got requested tables */
 		if (tbinfo->relkind == RELKIND_SEQUENCE ||
-			!tbinfo->interesting)
+			(!tbinfo->interesting &&
+			 !(fout->dopt->binary_upgrade && fout->remoteVersion >= 120000 &&
+			   (tbinfo->dobj.catId.oid == LargeObjectMetadataRelationId ||
+				tbinfo->dobj.catId.oid == SharedDependRelationId))))
 			pg_fatal("unexpected column data for table \"%s\"",
 					 tbinfo->dobj.name);
 
diff --git a/src/bin/pg_dump/t/002_pg_dump.pl b/src/bin/pg_dump/t/002_pg_dump.pl
index 2485d8f360e..d8330e2bd17 100644
--- a/src/bin/pg_dump/t/002_pg_dump.pl
+++ b/src/bin/pg_dump/t/002_pg_dump.pl
@@ -1087,6 +1087,7 @@ my %tests = (
 			test_schema_plus_large_objects => 1,
 		},
 		unlike => {
+			binary_upgrade => 1,
 			no_large_objects => 1,
 			no_owner => 1,
 			schema_only => 1,
@@ -1605,6 +1606,7 @@ my %tests = (
 			test_schema_plus_large_objects => 1,
 		},
 		unlike => {
+			binary_upgrade => 1,
 			schema_only => 1,
 			schema_only_with_statistics => 1,
 			no_large_objects => 1,
@@ -4612,9 +4614,9 @@ my %tests = (
 			no_schema => 1,
 			section_data => 1,
 			test_schema_plus_large_objects => 1,
-			binary_upgrade => 1,
 		},
 		unlike => {
+			binary_upgrade => 1,
 			no_large_objects => 1,
 			no_privs => 1,
 			schema_only => 1,
-- 
2.39.5 (Apple Git-154)

#44Tom Lane
tgl@sss.pgh.pa.us
In reply to: Nathan Bossart (#43)
Re: Horribly slow pg_upgrade performance with many Large Objects

Nathan Bossart <nathandbossart@gmail.com> writes:

Here is what I have staged for commit, which (barring feedback or
objections) I am planning to do towards the end of the week.

Is it intentional that this does

+#include "catalog/pg_largeobject_metadata.h"
+#include "catalog/pg_shdepend.h"

rather than including the corresponding *_d.h headers?
If so, why? Our normal coding convention is that frontend
code should only include catalog *_d.h files, since the main
headers might contain frontend-unfriendly declarations.
If there is something we need to expose in these catalogs'
*_d.h headers, we should probably do that.

regards, tom lane

#45Nathan Bossart
nathandbossart@gmail.com
In reply to: Tom Lane (#44)
1 attachment(s)
Re: Horribly slow pg_upgrade performance with many Large Objects

On Mon, Jul 14, 2025 at 01:28:03PM -0400, Tom Lane wrote:

Is it intentional that this does

+#include "catalog/pg_largeobject_metadata.h"
+#include "catalog/pg_shdepend.h"

rather than including the corresponding *_d.h headers?

Nope, that was an oversight.

--
nathan

Attachments:

v5-0001-pg_upgrade-Use-COPY-for-large-object-metadata.patchtext/plain; charset=us-asciiDownload
From 99551525801a462ca7f1beaa0096890df8d67c0a Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Mon, 14 Jul 2025 12:47:14 -0500
Subject: [PATCH v5 1/1] pg_upgrade: Use COPY for large object metadata.

Presently, pg_dump generates commands like

    SELECT pg_catalog.lo_create('5432');
    ALTER LARGE OBJECT 5432 OWNER TO alice;
    GRANT SELECT ON LARGE OBJECT 5432 TO bob;

for each large object.  This is particularly slow at restore time,
especially when there are tens or hundreds of millions of large
objects.  From reports and personal experience, such slow restores
are primarily encountered at pg_upgrade time.  This commit teaches
pg_dump to instead dump pg_largeobject_metadata and the
corresponding pg_shdepend rows when in binary upgrade mode, i.e.,
pg_dump now generates commands like

    COPY pg_catalog.pg_largeobject_metadata (oid, lomowner, lomacl) FROM stdin;
    5432	16384	{alice=rw/alice,bob=r/alice}
    \.

    COPY pg_catalog.pg_shdepend (dbid, classid, objid, objsubid, refclassid, refobjid, deptype) FROM stdin;
    5	2613	5432	0	1260	16384	o
    5	2613	5432	0	1260	16385	a
    \.

Testing indicates the COPY approach can be significantly faster.
To do any better, we'd probably need to find a way to copy/link
pg_largeobject_metadata's files during pg_upgrade, which would be
complicated and limited to upgrades from >= v16 (since commit
7b378237aa changed the storage format for aclitem, which is used
for pg_largeobject_metadata.lomacl).

Note that this change only applies to binary upgrade mode (i.e.,
dumps initiated by pg_upgrade) since it inserts rows directly into
catalogs, bypassing the ordinary privilege checks.  Also, this
optimization can only be used for upgrades from >= v12 because
pg_largeobject_metadata was created WITH OIDS in older versions,
which prevents pg_dump from handling pg_largeobject_metadata.oid
properly.  With some extra effort, it might be possible to support
upgrades from older versions, but the added complexity didn't seem
worth it to support versions that will have been out-of-support for
nearly 3 years by the time this change is released.

The astute hacker may remember that prior to v12, pg_upgrade
copied/linked pg_largeobject_metadata's files (see commit
12a53c732c).  Besides the aforementioned storage format issues,
this approach failed to transfer the relevant pg_shdepend rows, and
pg_dump still had to generate an lo_create() command per large
object so that creating the dependent comments and security labels
worked.  We could perhaps adopt a hybrid approach for upgrades from
v16 and newer (i.e., generate lo_create() commands for each large
object, copy/link pg_largeobject_metadata's files, and COPY the
relevant pg_shdepend rows), but further testing is needed.

Reported-by: Hannu Krosing <hannuk@google.com>
Suggested-by: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Hannu Krosing <hannuk@google.com>
Reviewed-by: Nitin Motiani <nitinmotiani@google.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/CAMT0RQSS-6qLH%2BzYsOeUbAYhop3wmQTkNmQpo5--QRDUR%2BqYmQ%40mail.gmail.com
---
 src/bin/pg_dump/pg_backup_archiver.c | 15 +++++
 src/bin/pg_dump/pg_dump.c            | 90 ++++++++++++++++++++++++++--
 src/bin/pg_dump/t/002_pg_dump.pl     |  4 +-
 3 files changed, 103 insertions(+), 6 deletions(-)

diff --git a/src/bin/pg_dump/pg_backup_archiver.c b/src/bin/pg_dump/pg_backup_archiver.c
index 197c1295d93..30e0da31aa3 100644
--- a/src/bin/pg_dump/pg_backup_archiver.c
+++ b/src/bin/pg_dump/pg_backup_archiver.c
@@ -31,6 +31,8 @@
 #endif
 
 #include "catalog/pg_class_d.h"
+#include "catalog/pg_largeobject_metadata_d.h"
+#include "catalog/pg_shdepend_d.h"
 #include "common/string.h"
 #include "compress_io.h"
 #include "dumputils.h"
@@ -2974,6 +2976,19 @@ _tocEntryRequired(TocEntry *te, teSection curSection, ArchiveHandle *AH)
 	int			res = REQ_SCHEMA | REQ_DATA;
 	RestoreOptions *ropt = AH->public.ropt;
 
+	/*
+	 * For binary upgrade mode, dump pg_largeobject_metadata and the
+	 * associated pg_shdepend rows. This is faster to restore than the
+	 * equivalent set of large object commands.  We can only do this for
+	 * upgrades from v12 and newer; in older versions, pg_largeobject_metadata
+	 * was created WITH OIDS, so the OID column is hidden and won't be dumped.
+	 */
+	if (ropt->binary_upgrade && AH->public.remoteVersion >= 120000 &&
+		strcmp(te->desc, "TABLE DATA") == 0 &&
+		(te->catalogId.oid == LargeObjectMetadataRelationId ||
+		 te->catalogId.oid == SharedDependRelationId))
+		return REQ_DATA;
+
 	/* These items are treated specially */
 	if (strcmp(te->desc, "ENCODING") == 0 ||
 		strcmp(te->desc, "STDSTRINGS") == 0 ||
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 1937997ea67..e1e4710e913 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -49,8 +49,10 @@
 #include "catalog/pg_class_d.h"
 #include "catalog/pg_default_acl_d.h"
 #include "catalog/pg_largeobject_d.h"
+#include "catalog/pg_largeobject_metadata_d.h"
 #include "catalog/pg_proc_d.h"
 #include "catalog/pg_publication_d.h"
+#include "catalog/pg_shdepend_d.h"
 #include "catalog/pg_subscription_d.h"
 #include "catalog/pg_type_d.h"
 #include "common/connect.h"
@@ -209,6 +211,12 @@ static int	nbinaryUpgradeClassOids = 0;
 static SequenceItem *sequences = NULL;
 static int	nsequences = 0;
 
+/*
+ * For binary upgrade, the dump ID of pg_largeobject_metadata is saved for use
+ * as a dependency for pg_shdepend and any large object comments/seclabels.
+ */
+static DumpId lo_metadata_dumpId;
+
 /* Maximum number of relations to fetch in a fetchAttributeStats() call. */
 #define MAX_ATTR_STATS_RELS 64
 
@@ -1085,6 +1093,36 @@ main(int argc, char **argv)
 	if (!dopt.dumpData && dopt.sequence_data)
 		getTableData(&dopt, tblinfo, numTables, RELKIND_SEQUENCE);
 
+	/*
+	 * For binary upgrade mode, dump pg_largeobject_metadata and the
+	 * associated pg_shdepend rows. This is faster to restore than the
+	 * equivalent set of large object commands.  We can only do this for
+	 * upgrades from v12 and newer; in older versions, pg_largeobject_metadata
+	 * was created WITH OIDS, so the OID column is hidden and won't be dumped.
+	 */
+	if (dopt.binary_upgrade && fout->remoteVersion >= 120000)
+	{
+		TableInfo  *lo_metadata = findTableByOid(LargeObjectMetadataRelationId);
+		TableInfo  *shdepend = findTableByOid(SharedDependRelationId);
+
+		makeTableDataInfo(&dopt, lo_metadata);
+		makeTableDataInfo(&dopt, shdepend);
+
+		/*
+		 * Save pg_largeobject_metadata's dump ID for use as a dependency on
+		 * pg_shdepend and any large object comments/seclabels.
+		 */
+		lo_metadata_dumpId = lo_metadata->dataObj->dobj.dumpId;
+		addObjectDependency(&shdepend->dataObj->dobj, lo_metadata_dumpId);
+
+		/*
+		 * Only dump large object shdepend rows for this database.
+		 */
+		shdepend->dataObj->filtercond = "WHERE classid = 'pg_largeobject'::regclass "
+			"AND dbid = (SELECT oid FROM pg_database "
+			"            WHERE datname = current_database())";
+	}
+
 	/*
 	 * In binary-upgrade mode, we do not have to worry about the actual LO
 	 * data or the associated metadata that resides in the pg_largeobject and
@@ -3924,10 +3962,37 @@ getLOs(Archive *fout)
 		 * as it will be copied by pg_upgrade, which simply copies the
 		 * pg_largeobject table. We *do* however dump out anything but the
 		 * data, as pg_upgrade copies just pg_largeobject, but not
-		 * pg_largeobject_metadata, after the dump is restored.
+		 * pg_largeobject_metadata, after the dump is restored.  In versions
+		 * before v12, this is done via proper large object commands.  In
+		 * newer versions, we dump the content of pg_largeobject_metadata and
+		 * any associated pg_shdepend rows, which is faster to restore.  (On
+		 * <v12, pg_largeobject_metadata was created WITH OIDS, so the OID
+		 * column is hidden and won't be dumped.)
 		 */
 		if (dopt->binary_upgrade)
-			loinfo->dobj.dump &= ~DUMP_COMPONENT_DATA;
+		{
+			if (fout->remoteVersion >= 120000)
+			{
+				/*
+				 * We should've saved pg_largeobject_metadata's dump ID before
+				 * this point.
+				 */
+				Assert(lo_metadata_dumpId);
+
+				loinfo->dobj.dump &= ~(DUMP_COMPONENT_DATA | DUMP_COMPONENT_ACL | DUMP_COMPONENT_DEFINITION);
+
+				/*
+				 * Mark the large object as dependent on
+				 * pg_largeobject_metadata so that any large object
+				 * comments/seclables are dumped after it.
+				 */
+				loinfo->dobj.dependencies = (DumpId *) pg_malloc(sizeof(DumpId));
+				loinfo->dobj.dependencies[0] = lo_metadata_dumpId;
+				loinfo->dobj.nDeps = loinfo->dobj.allocDeps = 1;
+			}
+			else
+				loinfo->dobj.dump &= ~DUMP_COMPONENT_DATA;
+		}
 
 		/*
 		 * Create a "BLOBS" data item for the group, too. This is just a
@@ -9039,8 +9104,20 @@ getTableAttrs(Archive *fout, TableInfo *tblinfo, int numTables)
 		if (tbinfo->relkind == RELKIND_SEQUENCE)
 			continue;
 
-		/* Don't bother with uninteresting tables, either */
-		if (!tbinfo->interesting)
+		/*
+		 * Don't bother with uninteresting tables, either.  For binary
+		 * upgrades, this is bypassed for pg_largeobject_metadata and
+		 * pg_shdepend so that the columns names are collected for the
+		 * corresponding COPY commands.  Restoring the data for those catalogs
+		 * is faster than restoring the equivalent set of large object
+		 * commands.  We can only do this for upgrades from v12 and newer; in
+		 * older versions, pg_largeobject_metadata was created WITH OIDS, so
+		 * the OID column is hidden and won't be dumped.
+		 */
+		if (!tbinfo->interesting &&
+			!(fout->dopt->binary_upgrade && fout->remoteVersion >= 120000 &&
+			  (tbinfo->dobj.catId.oid == LargeObjectMetadataRelationId ||
+			   tbinfo->dobj.catId.oid == SharedDependRelationId)))
 			continue;
 
 		/* OK, we need info for this table */
@@ -9244,7 +9321,10 @@ getTableAttrs(Archive *fout, TableInfo *tblinfo, int numTables)
 			pg_fatal("unrecognized table OID %u", attrelid);
 		/* cross-check that we only got requested tables */
 		if (tbinfo->relkind == RELKIND_SEQUENCE ||
-			!tbinfo->interesting)
+			(!tbinfo->interesting &&
+			 !(fout->dopt->binary_upgrade && fout->remoteVersion >= 120000 &&
+			   (tbinfo->dobj.catId.oid == LargeObjectMetadataRelationId ||
+				tbinfo->dobj.catId.oid == SharedDependRelationId))))
 			pg_fatal("unexpected column data for table \"%s\"",
 					 tbinfo->dobj.name);
 
diff --git a/src/bin/pg_dump/t/002_pg_dump.pl b/src/bin/pg_dump/t/002_pg_dump.pl
index 2485d8f360e..d8330e2bd17 100644
--- a/src/bin/pg_dump/t/002_pg_dump.pl
+++ b/src/bin/pg_dump/t/002_pg_dump.pl
@@ -1087,6 +1087,7 @@ my %tests = (
 			test_schema_plus_large_objects => 1,
 		},
 		unlike => {
+			binary_upgrade => 1,
 			no_large_objects => 1,
 			no_owner => 1,
 			schema_only => 1,
@@ -1605,6 +1606,7 @@ my %tests = (
 			test_schema_plus_large_objects => 1,
 		},
 		unlike => {
+			binary_upgrade => 1,
 			schema_only => 1,
 			schema_only_with_statistics => 1,
 			no_large_objects => 1,
@@ -4612,9 +4614,9 @@ my %tests = (
 			no_schema => 1,
 			section_data => 1,
 			test_schema_plus_large_objects => 1,
-			binary_upgrade => 1,
 		},
 		unlike => {
+			binary_upgrade => 1,
 			no_large_objects => 1,
 			no_privs => 1,
 			schema_only => 1,
-- 
2.39.5 (Apple Git-154)

#46Nathan Bossart
nathandbossart@gmail.com
In reply to: Nathan Bossart (#45)
Re: Horribly slow pg_upgrade performance with many Large Objects

Committed.

I'm cautiously optimistic that we can find some better gains for upgrades
from v16 and newer. That would involve dumping lo_create() commands for
all LOs with comments/seclabels, dumping the relevant pg_shdepend rows, and
then copying/linking the pg_largeobject_metadata files like we did prior to
v12.

--
nathan

#47Hannu Krosing
hannuk@google.com
In reply to: Nathan Bossart (#46)
Re: Horribly slow pg_upgrade performance with many Large Objects

Do you think the current patch could be backported to at least some
latest versions ?

Show quoted text

On Fri, Jul 18, 2025 at 6:05 PM Nathan Bossart <nathandbossart@gmail.com> wrote:

Committed.

I'm cautiously optimistic that we can find some better gains for upgrades
from v16 and newer. That would involve dumping lo_create() commands for
all LOs with comments/seclabels, dumping the relevant pg_shdepend rows, and
then copying/linking the pg_largeobject_metadata files like we did prior to
v12.

--
nathan

#48Nathan Bossart
nathandbossart@gmail.com
In reply to: Hannu Krosing (#47)
Re: Horribly slow pg_upgrade performance with many Large Objects

On Mon, Jul 21, 2025 at 02:03:45AM +0200, Hannu Krosing wrote:

Do you think the current patch could be backported to at least some
latest versions ?

I think that's pretty unlikely. It'd be a pretty big departure from our
versioning policy. In the past, we have back-patched "critical"
performance fixes, but that involved changes that had already been in
released versions for years.

--
nathan