Horribly slow pg_upgrade performance with many Large Objects

tgl@sss.pgh.pa.us

11 months ago

In reply to: Hannu Krosing (#1)

Re: Horribly slow pg_upgrade performance with many Large Objects

Hannu Krosing <hannuk@google.com> writes:

I have now met a not insignificant number of cases where pg_upgrade
performance is really bad when the database has a large number of
Large Objects.

What version are you testing? We did some work in that area in the
v17 cycle (a45c78e32).

regards, tom lane

nathandbossart@gmail.com

11 months ago

In reply to: Hannu Krosing (#1)

Re: Horribly slow pg_upgrade performance with many Large Objects

On Mon, Apr 07, 2025 at 10:33:47PM +0200, Hannu Krosing wrote:

The obvious solution would be to handle the table
`pg_largeobject_metadata` the same way as we currently handle
`pg_largeobject `by not doing anything with it in `pg_dump
--binary-upgrade` and just handle the contents it like we do for user
tables in pg_upgrade itself.

This should work fine for all source database versions starting from PgSQL v12.

Unfortunately, the storage format for aclitem changed in v16, so this would
need to be restricted to upgrades from v16 and newer. That being said, I
regularly hear about slow upgrades with many LOs, so I think it'd be
worthwhile to try to improve matters in v19.

--
nathan

Michael Paquier

michael@paquier.xyz

11 months ago

In reply to: Tom Lane (#2)

Re: Horribly slow pg_upgrade performance with many Large Objects

On Mon, Apr 07, 2025 at 05:25:32PM -0400, Tom Lane wrote:

What version are you testing? We did some work in that area in the
v17 cycle (a45c78e32).

I am puzzled by the target version used here, as well. If there is
more that can be improved, v19 would be the version to consider for
future improvements at this stage.
--
Michael

hannu@tm.ee

11 months ago

In reply to: Michael Paquier (#4)

Re: Horribly slow pg_upgrade performance with many Large Objects

I was testing on version 17

On Tue, Apr 8, 2025 at 6:52 AM Michael Paquier <michael@paquier.xyz> wrote:

On Mon, Apr 07, 2025 at 05:25:32PM -0400, Tom Lane wrote:

What version are you testing? We did some work in that area in the
v17 cycle (a45c78e32).

I am puzzled by the target version used here, as well.

I was testing on version 17

Here is how you can easily test too (as --binary-upgrade does not dump
the actual data it is ok for the test to not put anything there)

hannuk@db01-c1a:~/work/lo-testing$ createdb -p 5433 lodb
hannuk@db01-c1a:~/work/lo-testing$ psql -p 5433 lodb
psql (17.4 (Ubuntu 17.4-1.pgdg22.04+2))
Type "help" for help.

lodb=# insert into pg_largeobject_metadata(oid, lomowner) SELECT i,
16384 FROM generate_series(1, 100_000_000) g(i);
INSERT 0 100000000
Time: 162414.216 ms (02:42.414)
lodb=#
\q

hannuk@db01-c1a:~/work/lo-testing$ time pg_dump --data-only -t
pg_largeobject_metadata -p 5433 lodb | gzip >
pg_largeobject_metadata.data.gz
real 0m22.094s
user 0m20.741s
sys 0m2.085s

hannuk@db01-c1a:~/work/lo-testing$ time pg_dump --data-only -t
pg_largeobject_metadata --format=custom -p 5433 lodb -f
pg_largeobject_metadata.dump
real 0m20.226s
user 0m18.068s
sys 0m0.824s

If there is
more that can be improved, v19 would be the version to consider for
future improvements at this stage.

If the internal format has changed in 16 the correct way would be to
go through the data-only dump of pg_largeobject_metadata in all cases.
Even for the 100M case where you get the restore in 2 minutes instead
of 100 minutes

hannuk@db01-c1a:~/work/lo-testing$ createdb -p 5434 lodb
hannuk@db01-c1a:~/work/lo-testing$ time pg_restore -p 5434
--exit-on-error --transaction-size=1000 --dbname lodb
pg_largeobject_metadata.dump

real 2m2.277s
user 0m2.594s
sys 0m0.549s

And even in case of the user-visible format change in acl format it is
most likely that changing the visible format using some regexp magic,
or even a dedicated function, would still me much faster than creating
all the LOs though creation commands.

------
The commands I used to do the pg_upgrade-like test were

hannuk@db01-c1a:~/work/lo-testing$ time pg_dump --schema-only
--quote-all-identifiers --binary-upgrade --format=custom
--file=lodb100m.dump -p 5433 lodb
real 1m58.241s
user 0m35.229s
sys 0m17.854s

hannuk@db01-c1a:~/work/lo-testing$ time pg_restore -p 5434
--exit-on-error --transaction-size=1000 --dbname lodb lodb100m.dump
real 100m54.878s
user 3m23.885s
sys 20m33.761s

(I left out the --verbose part that pg_upgrade also sets as I did not
want to get 100M lines of "large object created " messages )

also the postgres server at -p 5434 needs to be started with -b flag
to accept the loading a dump from --binary-upgrade. In Debian/Ubuntu
this can be directly passed to pg_ctlcluster as follows

sudo pg_ctlcluster 17 target -o -b

----
Hannu

Show quoted text

--
Michael

hannu@tm.ee

11 months ago

In reply to: Hannu Krosing (#5)

Re: Horribly slow pg_upgrade performance with many Large Objects

Looked like a bit illogical order on re-reading it so I want to make
clear that the pg_upgrade-like test showing 100min for 100 million LOs
is at the end of last message and the proposed solution is at the
beginning

Show quoted text

On Tue, Apr 8, 2025 at 9:15 AM Hannu Krosing <hannuk@google.com> wrote:

I was testing on version 17

On Tue, Apr 8, 2025 at 6:52 AM Michael Paquier <michael@paquier.xyz> wrote:

On Mon, Apr 07, 2025 at 05:25:32PM -0400, Tom Lane wrote:

What version are you testing? We did some work in that area in the
v17 cycle (a45c78e32).

I am puzzled by the target version used here, as well.

I was testing on version 17

Here is how you can easily test too (as --binary-upgrade does not dump
the actual data it is ok for the test to not put anything there)

hannuk@db01-c1a:~/work/lo-testing$ createdb -p 5433 lodb
hannuk@db01-c1a:~/work/lo-testing$ psql -p 5433 lodb
psql (17.4 (Ubuntu 17.4-1.pgdg22.04+2))
Type "help" for help.

lodb=# insert into pg_largeobject_metadata(oid, lomowner) SELECT i,
16384 FROM generate_series(1, 100_000_000) g(i);
INSERT 0 100000000
Time: 162414.216 ms (02:42.414)
lodb=#
\q

hannuk@db01-c1a:~/work/lo-testing$ time pg_dump --data-only -t
pg_largeobject_metadata -p 5433 lodb | gzip >
pg_largeobject_metadata.data.gz
real 0m22.094s
user 0m20.741s
sys 0m2.085s

hannuk@db01-c1a:~/work/lo-testing$ time pg_dump --data-only -t
pg_largeobject_metadata --format=custom -p 5433 lodb -f
pg_largeobject_metadata.dump
real 0m20.226s
user 0m18.068s
sys 0m0.824s

If there is
more that can be improved, v19 would be the version to consider for
future improvements at this stage.

If the internal format has changed in 16 the correct way would be to
go through the data-only dump of pg_largeobject_metadata in all cases.
Even for the 100M case where you get the restore in 2 minutes instead
of 100 minutes

hannuk@db01-c1a:~/work/lo-testing$ createdb -p 5434 lodb
hannuk@db01-c1a:~/work/lo-testing$ time pg_restore -p 5434
--exit-on-error --transaction-size=1000 --dbname lodb
pg_largeobject_metadata.dump

real 2m2.277s
user 0m2.594s
sys 0m0.549s

And even in case of the user-visible format change in acl format it is
most likely that changing the visible format using some regexp magic,
or even a dedicated function, would still me much faster than creating
all the LOs though creation commands.

------
The commands I used to do the pg_upgrade-like test were

hannuk@db01-c1a:~/work/lo-testing$ time pg_dump --schema-only
--quote-all-identifiers --binary-upgrade --format=custom
--file=lodb100m.dump -p 5433 lodb
real 1m58.241s
user 0m35.229s
sys 0m17.854s

hannuk@db01-c1a:~/work/lo-testing$ time pg_restore -p 5434
--exit-on-error --transaction-size=1000 --dbname lodb lodb100m.dump
real 100m54.878s
user 3m23.885s
sys 20m33.761s

(I left out the --verbose part that pg_upgrade also sets as I did not
want to get 100M lines of "large object created " messages )

also the postgres server at -p 5434 needs to be started with -b flag
to accept the loading a dump from --binary-upgrade. In Debian/Ubuntu
this can be directly passed to pg_ctlcluster as follows

sudo pg_ctlcluster 17 target -o -b

----
Hannu

--
Michael

hannu@tm.ee

11 months ago

In reply to: Nathan Bossart (#3)

Re: Horribly slow pg_upgrade performance with many Large Objects

On Tue, Apr 8, 2025 at 12:17 AM Nathan Bossart <nathandbossart@gmail.com> wrote:

On Mon, Apr 07, 2025 at 10:33:47PM +0200, Hannu Krosing wrote:

The obvious solution would be to handle the table
`pg_largeobject_metadata` the same way as we currently handle
`pg_largeobject `by not doing anything with it in `pg_dump
--binary-upgrade` and just handle the contents it like we do for user
tables in pg_upgrade itself.

This should work fine for all source database versions starting from PgSQL v12.

Unfortunately, the storage format for aclitem changed in v16, so this would
need to be restricted to upgrades from v16 and newer.

Have we also changed the external format of aclitem any time since v
9.2 or are the changes just to storage ?

If external formats have been stable we can still get reasonable
performance with dumping the data (2 min for 100M rows)
Plus dumping data would work for all the supported source versions.

The worst case would still be quite bad with 80+ min for the full set
of 4 billion LOs but even that would be much better than the 3 days
with current wayd.

That being said, I
regularly hear about slow upgrades with many LOs, so I think it'd be
worthwhile to try to improve matters in v19.

Changing the LO export to dumping pg_largeobject_metadata content
instead of creating the LOs should be a nice small change confined to
pg_dump --binary-upgrade only so perhaps we could squeeze it in v18
still.

--
Hannu

nathandbossart@gmail.com

11 months ago

In reply to: Hannu Krosing (#7)

Re: Horribly slow pg_upgrade performance with many Large Objects

On Tue, Apr 08, 2025 at 09:35:24AM +0200, Hannu Krosing wrote:

On Tue, Apr 8, 2025 at 12:17 AM Nathan Bossart <nathandbossart@gmail.com> wrote:

That being said, I
regularly hear about slow upgrades with many LOs, so I think it'd be
worthwhile to try to improve matters in v19.

Changing the LO export to dumping pg_largeobject_metadata content
instead of creating the LOs should be a nice small change confined to
pg_dump --binary-upgrade only so perhaps we could squeeze it in v18
still.

Feature freeze for v18 was ~4 hours ago, so unfortunately this is v19
material at this point.

--
nathan

hannu@tm.ee

11 months ago

In reply to: Nathan Bossart (#8)

Re: Horribly slow pg_upgrade performance with many Large Objects

On Tue, Apr 8, 2025 at 5:46 PM Nathan Bossart <nathandbossart@gmail.com> wrote:

On Tue, Apr 08, 2025 at 09:35:24AM +0200, Hannu Krosing wrote:

On Tue, Apr 8, 2025 at 12:17 AM Nathan Bossart <nathandbossart@gmail.com> wrote:

That being said, I
regularly hear about slow upgrades with many LOs, so I think it'd be
worthwhile to try to improve matters in v19.

Changing the LO export to dumping pg_largeobject_metadata content
instead of creating the LOs should be a nice small change confined to
pg_dump --binary-upgrade only so perhaps we could squeeze it in v18
still.

Feature freeze for v18 was ~4 hours ago, so unfortunately this is v19
material at this point.

Sure. But I actually think that this is something that should be
back-ported to at least all supported versions at some pon.
Possibly made dependent on some environment flag so only people that
desperately need it will get it.

Btw, who would be the right person(s) to ask questions about internals
of pg_dump ?
I have a few more things in the pipeline to add there and would like
to make sure that I have the right approach.

------
Hannu

#10

tgl@sss.pgh.pa.us

11 months ago

In reply to: Nathan Bossart (#8)

Re: Horribly slow pg_upgrade performance with many Large Objects

Nathan Bossart <nathandbossart@gmail.com> writes:

On Tue, Apr 08, 2025 at 09:35:24AM +0200, Hannu Krosing wrote:

Changing the LO export to dumping pg_largeobject_metadata content
instead of creating the LOs should be a nice small change confined to
pg_dump --binary-upgrade only so perhaps we could squeeze it in v18
still.

Feature freeze for v18 was ~4 hours ago, so unfortunately this is v19
material at this point.

Yeah, even if we had a patch in hand, it's too late for v18. However
there are additional problems with this idea:

1. The idea requires role OIDs to match across the upgrade.
I don't believe that pg_upgrade tries to preserve role OIDs --- and
doing so would be problematic, because what if the new cluster's
bootstrap superuser is named differently in the old and new clusters?

It might be possible to work around that with some casting to/from
regrole, but I don't think a simple COPY into pg_largeobject_metadata
will play along with that.

2. If you just do the equivalent of an INSERT or COPY into
pg_largeobject_metadata, you could create entries that look right,
but they are actually not right because there should be pg_shdepend
entries backing each ownership or permission reference (for non-pinned
roles) and there won't be.

I guess you could think of also manually inserting rows into
pg_shdepend, but (a) ugh and (b) the claimed speedup is kind
of vanishing into the distance at this point.

regards, tom lane

#11

hannu@tm.ee

11 months ago

In reply to: Tom Lane (#10)

Re: Horribly slow pg_upgrade performance with many Large Objects

This is what the opening comment in pg_upgrade says

I think we do preserve role oids

/*
* To simplify the upgrade process, we force certain system values to be
* identical between old and new clusters:
*
* We control all assignments of pg_class.oid (and relfilenode) so toast
* oids are the same between old and new clusters. This is important
* because toast oids are stored as toast pointers in user tables.
*
* While pg_class.oid and pg_class.relfilenode are initially the same in a
* cluster, they can diverge due to CLUSTER, REINDEX, or VACUUM FULL. We
* control assignments of pg_class.relfilenode because we want the filenames
* to match between the old and new cluster.
*
* We control assignment of pg_tablespace.oid because we want the oid to match
* between the old and new cluster.
*
* We control all assignments of pg_type.oid because these oids are stored
* in user composite type values.
*
* We control all assignments of pg_enum.oid because these oids are stored
* in user tables as enum values.
*
* We control all assignments of pg_authid.oid for historical reasons (the
* oids used to be stored in pg_largeobject_metadata, which is now copied via
* SQL commands), that might change at some point in the future.
*/

Show quoted text

On Tue, Apr 8, 2025 at 6:13 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Nathan Bossart <nathandbossart@gmail.com> writes:

On Tue, Apr 08, 2025 at 09:35:24AM +0200, Hannu Krosing wrote:

Changing the LO export to dumping pg_largeobject_metadata content
instead of creating the LOs should be a nice small change confined to
pg_dump --binary-upgrade only so perhaps we could squeeze it in v18
still.

Feature freeze for v18 was ~4 hours ago, so unfortunately this is v19
material at this point.

Yeah, even if we had a patch in hand, it's too late for v18. However
there are additional problems with this idea:

1. The idea requires role OIDs to match across the upgrade.
I don't believe that pg_upgrade tries to preserve role OIDs --- and
doing so would be problematic, because what if the new cluster's
bootstrap superuser is named differently in the old and new clusters?

It might be possible to work around that with some casting to/from
regrole, but I don't think a simple COPY into pg_largeobject_metadata
will play along with that.

2. If you just do the equivalent of an INSERT or COPY into
pg_largeobject_metadata, you could create entries that look right,
but they are actually not right because there should be pg_shdepend
entries backing each ownership or permission reference (for non-pinned
roles) and there won't be.

I guess you could think of also manually inserting rows into
pg_shdepend, but (a) ugh and (b) the claimed speedup is kind
of vanishing into the distance at this point.

regards, tom lane

#12

tgl@sss.pgh.pa.us

11 months ago

In reply to: Hannu Krosing (#11)

Re: Horribly slow pg_upgrade performance with many Large Objects

Hannu Krosing <hannuk@google.com> writes:

I think we do preserve role oids

Oh ... I'd been looking for mentions of "role" in
pg_upgrade_support.c, but what I should have looked for was
"pg_authid". So yeah, we do preserve role OIDs, and maybe that's
enough to make this workable, at least with source versions that
share the same rules for what goes into pg_largeobject_metadata and
pg_shdepend. It's not something I'd risk back-patching though.

regards, tom lane

#13

nathandbossart@gmail.com

11 months ago

In reply to: Tom Lane (#12)

Re: Horribly slow pg_upgrade performance with many Large Objects

On Tue, Apr 08, 2025 at 12:37:43PM -0400, Tom Lane wrote:

Hannu Krosing <hannuk@google.com> writes:

I think we do preserve role oids

Oh ... I'd been looking for mentions of "role" in
pg_upgrade_support.c, but what I should have looked for was
"pg_authid". So yeah, we do preserve role OIDs, and maybe that's
enough to make this workable, at least with source versions that
share the same rules for what goes into pg_largeobject_metadata and
pg_shdepend. It's not something I'd risk back-patching though.

I do think it's worth considering going back to copying
pg_largobject_metadata's files for upgrades from v16 and newer. That
sounds restrictive at the moment, but it'll mean that all but one supported
major version can copy the files during upgrade to v19. I'll admit I'm a
tad worried about having to go back to copying via SQL commands in the
future and re-regressing things (leading to unpredictable differences in
upgrade downtime), but I'm not sure that's a great reason to withhold this
optimization.

Of course, I wouldn't be opposed to optimizing the SQL command strategy,
too, but I suspect that won't compare to copying the files.

--
nathan

#14

hannu@tm.ee

11 months ago

In reply to: Tom Lane (#12)

Re: Horribly slow pg_upgrade performance with many Large Objects

On Tue, Apr 8, 2025 at 6:37 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Hannu Krosing <hannuk@google.com> writes:

I think we do preserve role oids

Oh ... I'd been looking for mentions of "role" in
pg_upgrade_support.c, but what I should have looked for was
"pg_authid". So yeah, we do preserve role OIDs, and maybe that's
enough to make this workable, at least with source versions that
share the same rules for what goes into pg_largeobject_metadata and
pg_shdepend. It's not something I'd risk back-patching though.

The risk is why I suggest to have this in backports as something last
resort which is gated on an environment variable

I have been forced to do this half-manually a few times to make
pg_upgrade feasible at all.

Current really UGH workaround is to use pg_repack support for swapping
relfilenodes of pg_largeobject_metadata with a user table before and
after the upgrade and then manually patching up leftovers like
pg_shdepend after pg_upgrade.

These have fortunately been cases where there were no other
dependencies like comments or security labels on LOs, but these to
should be doable;

Show quoted text

regards, tom lane

#15

tgl@sss.pgh.pa.us

11 months ago

In reply to: Nathan Bossart (#13)

Re: Horribly slow pg_upgrade performance with many Large Objects

Nathan Bossart <nathandbossart@gmail.com> writes:

I do think it's worth considering going back to copying
pg_largobject_metadata's files for upgrades from v16 and newer.

(If we do this) I don't see why we'd need to stop at v16. I'm
envisioning that we'd use COPY, which will be dealing in the
text representation of aclitems, and I don't think that's changed
in a long time. The sort of thing that would break it is changes
in the set of available/default privilege bits for large objects.

That is, where the dump currently contains something like

SELECT pg_catalog.lo_create('2121');
ALTER LARGE OBJECT 2121 OWNER TO postgres;
GRANT ALL ON LARGE OBJECT 2121 TO joe;

we'd have

COPY pg_largeobject_metadata FROM STDIN;
...
2121 10 {postgres=rw/postgres,joe=rw/postgres}
...

and some appropriate COPY data for pg_shdepend too.

(The fact that this representation will contain both numeric and
symbolic role OIDs is why I was concerned about OID stability.)

The key thing that worries me here is if the source and target
versions have different ideas of which roles are pinned, which would
silently change what appears in pg_shdepend. But it'd only really
break if a role mentioned in some LO's owner or ACL is pinned in the
source and not in the target, which seems unlikely. (In the other
direction, we'd just be adding a useless row in pg_shdepend.)

regards, tom lane

#16

hannu@tm.ee

11 months ago

In reply to: Tom Lane (#15)

Re: Horribly slow pg_upgrade performance with many Large Objects

On Tue, Apr 8, 2025 at 7:07 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Nathan Bossart <nathandbossart@gmail.com> writes:

I do think it's worth considering going back to copying
pg_largobject_metadata's files for upgrades from v16 and newer.

(If we do this) I don't see why we'd need to stop at v16. I'm
envisioning that we'd use COPY, which will be dealing in the
text representation of aclitems, and I don't think that's changed
in a long time. The sort of thing that would break it is changes
in the set of available/default privilege bits for large objects.

That is, where the dump currently contains something like

SELECT pg_catalog.lo_create('2121');
ALTER LARGE OBJECT 2121 OWNER TO postgres;
GRANT ALL ON LARGE OBJECT 2121 TO joe;

Also note that in my --binary-upgrade tests the 100 min / 100M objects
ratio was in case with no grants. I would expect this to grow to at
least 120 to 150 minutes when grants are also involved.

In copy case I would expect the presence of grants to not make much difference.

we'd have

COPY pg_largeobject_metadata FROM STDIN;
...
2121 10 {postgres=rw/postgres,joe=rw/postgres}
...

--
Hannu

#17

nathandbossart@gmail.com

11 months ago

In reply to: Tom Lane (#15)

Re: Horribly slow pg_upgrade performance with many Large Objects

On Tue, Apr 08, 2025 at 01:07:09PM -0400, Tom Lane wrote:

Nathan Bossart <nathandbossart@gmail.com> writes:

I do think it's worth considering going back to copying
pg_largobject_metadata's files for upgrades from v16 and newer.

(If we do this) I don't see why we'd need to stop at v16. I'm
envisioning that we'd use COPY, which will be dealing in the
text representation of aclitems, and I don't think that's changed
in a long time. The sort of thing that would break it is changes
in the set of available/default privilege bits for large objects.

I was thinking of actually reverting commit 12a53c7 for upgrades from v16,
which AFAICT is the last release where any relevant storage formats changed
(aclitem changed in v16). But if COPY gets us pretty close to that and is
less likely to be disrupted by future changes, it could be a better
long-term approach.

That is, where the dump currently contains something like

SELECT pg_catalog.lo_create('2121');
ALTER LARGE OBJECT 2121 OWNER TO postgres;
GRANT ALL ON LARGE OBJECT 2121 TO joe;

we'd have

COPY pg_largeobject_metadata FROM STDIN;
...
2121 10 {postgres=rw/postgres,joe=rw/postgres}
...

and some appropriate COPY data for pg_shdepend too.

Unless I'm missing something, we don't seem to have had any dependency
handling before commit 12a53c7. Was that broken before we moved to SQL
commands?

--
nathan

#18

tgl@sss.pgh.pa.us

11 months ago

In reply to: Hannu Krosing (#16)

Re: Horribly slow pg_upgrade performance with many Large Objects

Hannu Krosing <hannuk@google.com> writes:

In copy case I would expect the presence of grants to not make much difference.

aclitemin is slower than a lot of other datatype input functions,
but it's still got to be faster than a GRANT.

regards, tom lane

#19

tgl@sss.pgh.pa.us

11 months ago

In reply to: Hannu Krosing (#16)

Re: Horribly slow pg_upgrade performance with many Large Objects

Hmm ... one annoying thing for this project is that AFAICS pg_upgrade
does *not* preserve database OIDs, which is problematic for using
COPY to load pg_shdepend rows.

regards, tom lane

#20

nathandbossart@gmail.com

11 months ago

In reply to: Tom Lane (#19)

Re: Horribly slow pg_upgrade performance with many Large Objects

On Tue, Apr 08, 2025 at 01:36:58PM -0400, Tom Lane wrote:

Hmm ... one annoying thing for this project is that AFAICS pg_upgrade
does *not* preserve database OIDs, which is problematic for using
COPY to load pg_shdepend rows.

I think it does; see commit aa01051.

--
nathan

#21

tgl@sss.pgh.pa.us

11 months ago

In reply to: Nathan Bossart (#17)

#22