RE: pg_upgrade failing for 200+ million Large Objects

Started by Tharakan, Robinsabout 5 years ago100 messages
Jump to latest
#1Tharakan, Robins
tharar@amazon.com

Hi all,

Attached is a proof-of-concept patch that allows Postgres to perform
pg_upgrade if the instance has Millions of objects.

It would be great if someone could take a look and see if this patch is in
the right direction. There are some pending tasks (such as documentation /
pg_resetxlog vs pg_resetwal related changes) but for now, the patch helps
remove a stalemate where if a Postgres instance has a large number
(accurately speaking 146+ Million) of Large Objects, pg_upgrade fails. This
is easily reproducible and besides deleting Large Objects before upgrade,
there is no other (apparent) way for pg_upgrade to complete.

The patch (attached):
- Applies cleanly on REL9_6_STABLE -
c7a4fc3dd001646d5938687ad59ab84545d5d043
- 'make check' passes
- Allows the user to provide a constant via pg_upgrade command-line, that
overrides the 2 billion constant in pg_resetxlog [1] thereby increasing the
(window of) Transaction IDs available for pg_upgrade to complete.

Sample argument for pg_upgrade:
$ /opt/postgres/96/bin/pg_upgrade --max-limit-xid 1000000000 --old-bindir
...

With this patch, pg_upgrade is now able to upgrade a v9.5 cluster with 500
million Large Objects successfully to v9.6 - some stats below:

Source Postgres - v9.5.24
Target Version - v9.6.21
Large Object Count: 500 Million Large Objects
Machine - r5.4xlarge (16vCPU / 128GB RAM + 1TB swap)
Memory used during pg_upgrade - ~350GB
Time taken - 25+ hrs. (tested twice) - (All LOs processed sequentially ->
Scope for optimization)

Although counter-intuitive, for this testing purpose all Large Objects were
small (essentially the idea was to test the count) and created by using
something like this:

seq 1 50000 | xargs -n 1 -i -P 10 /opt/postgres/95/bin/psql -c "select
lo_from_bytea(0, '\xffffff00') from generate_series(1,10000);" > /dev/null

I am not married to the patch (especially the argument name) but ideally I'd
prefer a way to get this upgrade going without a patch. For now, I am unable
to find any other way to upgrade a v9.5 Postgres database in this scenario,
facing End-of-Life.

Reference:
1) 2 Billion constant -
https://github.com/postgres/postgres/blob/ca3b37487be333a1d241dab1bbdd17a211
a88f43/src/bin/pg_resetwal/pg_resetwal.c#L444

Thanks,
Robins Tharakan

Show quoted text

-----Original Message-----
From: Tharakan, Robins
Sent: Wednesday, 3 March 2021 10:36 PM
To: pgsql-hackers@postgresql.org
Subject: pg_upgrade failing for 200+ million Large Objects

Hi,

While reviewing a failed upgrade from Postgres v9.5 (to v9.6) I saw that
the
instance had ~200 million (in-use) Large Objects. I was able to reproduce
this on a test instance which too fails with a similar error.

pg_restore: executing BLOB 4980622
pg_restore: WARNING: database with OID 0 must be vacuumed within 1000001
transactions
HINT: To avoid a database shutdown, execute a database-wide VACUUM in
that
database.
You might also need to commit or roll back old prepared transactions.
pg_restore: executing BLOB 4980623
pg_restore: [archiver (db)] Error while PROCESSING TOC:
pg_restore: [archiver (db)] Error from TOC entry 2565; 2613 4980623 BLOB
4980623 postgres
pg_restore: [archiver (db)] could not execute query: ERROR: database is
not
accepting commands to avoid wraparound data loss in database with OID 0
HINT: Stop the postmaster and vacuum that database in single-user mode.
You might also need to commit or roll back old prepared transactions.
Command was: SELECT pg_catalog.lo_create('4980623');

To remove the obvious possibilities, these Large Objects that are still
in-use (so vacuumlo wouldn't help), giving more system resources doesn't
help, moving Large Objects around to another database doesn't help (since
this is cluster-wide restriction), the source instance is nowhere close
to
wraparound and lastly recent-most minor versions don't help either (I
tried
compiling 9_6_STABLE + upgrade database with 150 million LO and still
encountered the same issue).

Do let me know if I am missing something obvious but it appears that this
is
happening owing to 2 things coming together:

* Each Large Object is migrated in its own transaction during pg_upgrade
* pg_resetxlog appears to be narrowing the window (available for
pg_upgrade)
to ~146 Million XIDs (2^31 - 1 million XID wraparound margin - 2 billion
which is a hard-coded constant - see [1] - in what appears to be an
attempt
to force an Autovacuum Wraparound session soon after upgrade completes).

Ideally such an XID based restriction, is limiting for an instance that's
actively using a lot of Large Objects. Besides forcing AutoVacuum
Wraparound
logic to kick in soon after, I am unclear what much else it aims to do.
What
it does seem to be doing is to block Major Version upgrades if the
pre-upgrade instance has >146 Million Large Objects (half that, if the LO
additionally requires ALTER LARGE OBJECT OWNER TO for each of those
objects
during pg_restore)

For long-term these ideas came to mind, although am unsure which are
low-hanging fruits and which outright impossible - For e.g. clubbing
multiple objects in a transaction [2] / Force AutoVacuum post upgrade
(and
thus remove this limitation altogether) or see if "pg_resetxlog -x" (from
within pg_upgrade) could help in some way to work-around this limitation.

Is there a short-term recommendation for this scenario?

I can understand a high number of small-sized objects is not a great way
to
use pg_largeobject (since Large Objects was intended to be for, well,
'large
objects') but this magic number of Large Objects is now a stalemate at
this
point (with respect to v9.5 EOL).

Reference:
1) pg_resetxlog -
https://github.com/postgres/postgres/blob/ca3b37487be333a1d241dab1bbdd17a
211
a88f43/src/bin/pg_resetwal/pg_resetwal.c#L444
2)
/messages/by-id/ed7d86a1-b907-4f53-9f6e-
63482d2f2bac%4
0manitou-mail.org

-
Thanks
Robins Tharakan

Attachments:

pgupgrade_lo_v2.patchapplication/octet-stream; name=pgupgrade_lo_v2.patchDownload+32-5
smime.p7sapplication/pkcs7-signature; name=smime.p7sDownload
#2Daniel Gustafsson
daniel@yesql.se
In reply to: Tharakan, Robins (#1)
Re: pg_upgrade failing for 200+ million Large Objects

On 7 Mar 2021, at 09:43, Tharakan, Robins <tharar@amazon.com> wrote:

The patch (attached):
- Applies cleanly on REL9_6_STABLE -
c7a4fc3dd001646d5938687ad59ab84545d5d043

Did you target 9.6 because that's where you want to upgrade to, or is this not
a problem on HEAD? If it's still a problem on HEAD you should probably submit
the patch against there. You probably also want to add it to the next commit
fest to make sure it's not forgotten about: https://commitfest.postgresql.org/33/

I am not married to the patch (especially the argument name) but ideally I'd
prefer a way to get this upgrade going without a patch. For now, I am unable
to find any other way to upgrade a v9.5 Postgres database in this scenario,
facing End-of-Life.

It's obviously not my call to make in any shape or form, but this doesn't
really seem to fall under what is generally backported into a stable release?

--
Daniel Gustafsson https://vmware.com/

#3Peter Eisentraut
peter_e@gmx.net
In reply to: Tharakan, Robins (#1)
Re: pg_upgrade failing for 200+ million Large Objects

On 07.03.21 09:43, Tharakan, Robins wrote:

Attached is a proof-of-concept patch that allows Postgres to perform
pg_upgrade if the instance has Millions of objects.

It would be great if someone could take a look and see if this patch is in
the right direction. There are some pending tasks (such as documentation /
pg_resetxlog vs pg_resetwal related changes) but for now, the patch helps
remove a stalemate where if a Postgres instance has a large number
(accurately speaking 146+ Million) of Large Objects, pg_upgrade fails. This
is easily reproducible and besides deleting Large Objects before upgrade,
there is no other (apparent) way for pg_upgrade to complete.

The patch (attached):
- Applies cleanly on REL9_6_STABLE -
c7a4fc3dd001646d5938687ad59ab84545d5d043
- 'make check' passes
- Allows the user to provide a constant via pg_upgrade command-line, that
overrides the 2 billion constant in pg_resetxlog [1] thereby increasing the
(window of) Transaction IDs available for pg_upgrade to complete.

Could you explain what your analysis of the problem is and why this
patch (might) fix it?

Right now, all I see here is, pass a big number via a command-line
option and hope it works.

#4Tharakan, Robins
tharar@amazon.com
In reply to: Peter Eisentraut (#3)

Thanks Daniel for the input / next-steps.

I see that 'master' too has this same magic constant [1] and so I expect it
to have similar restrictions, although I haven't tested this yet.

I do agree that the need then is to re-submit a patch that works with
'master'. (I am travelling the next few days but) Unless discussions go
tangential, I expect to revert with an updated patch by the end of this week
and create a commitfest entry while at it.

Reference:
1)
https://github.com/postgres/postgres/blob/master/src/bin/pg_resetwal/pg_rese
twal.c#L444

-
Robins Tharakan

Show quoted text

-----Original Message-----
From: Daniel Gustafsson <daniel@yesql.se>
Sent: Monday, 8 March 2021 9:42 AM
To: Tharakan, Robins <tharar@amazon.com>
Cc: pgsql-hackers@postgresql.org
Subject: RE: [EXTERNAL] pg_upgrade failing for 200+ million Large Objects

CAUTION: This email originated from outside of the organization. Do not
click links or open attachments unless you can confirm the sender and
know the content is safe.

On 7 Mar 2021, at 09:43, Tharakan, Robins <tharar@amazon.com> wrote:

The patch (attached):
- Applies cleanly on REL9_6_STABLE -
c7a4fc3dd001646d5938687ad59ab84545d5d043

Did you target 9.6 because that's where you want to upgrade to, or is
this not a problem on HEAD? If it's still a problem on HEAD you should
probably submit the patch against there. You probably also want to add
it to the next commit fest to make sure it's not forgotten about:
https://commitfest.postgresql.org/33/

I am not married to the patch (especially the argument name) but
ideally I'd prefer a way to get this upgrade going without a patch.
For now, I am unable to find any other way to upgrade a v9.5 Postgres
database in this scenario, facing End-of-Life.

It's obviously not my call to make in any shape or form, but this doesn't
really seem to fall under what is generally backported into a stable
release?

--
Daniel Gustafsson https://vmware.com/

Attachments:

smime.p7sapplication/pkcs7-signature; name=smime.p7sDownload
#5Tharakan, Robins
tharar@amazon.com
In reply to: Tharakan, Robins (#4)
Re: pg_upgrade failing for 200+ million Large Objects

Thanks Peter.

The original email [1] had some more context that somehow didn't get
associated with this recent email. Apologies for any confusion.

In short, pg_resetxlog (and pg_resetwal) employs a magic constant [2] (for
both v9.6 as well as master) which seems to have been selected to force an
aggressive autovacuum as soon as the upgrade completes. Although that works
as planned, it narrows the window of Transaction IDs available for the
upgrade (before which XID wraparound protection kicks and aborts the
upgrade) to 146 Million.

Reducing this magic constant allows a larger XID window, which is what the
patch is trying to do. With the patch, I was able to upgrade a cluster with
500m Large Objects successfully (which otherwise reliably fails). In the
original email [1] I had also listed a few other possible workarounds, but
was unsure which would be a good direction to start working on.... thus this
patch to make a start.

Reference:
1) /messages/by-id/12601596dbbc4c01b86b4ac4d2bd4d48@
EX13D05UWC001.ant.amazon.com
2) https://github.com/postgres/postgres/blob/master/src/bin/pg_resetwal/pg_r
esetwal.c#L444

-
robins | tharar@ | syd12

Show quoted text

-----Original Message-----
From: Peter Eisentraut <peter.eisentraut@enterprisedb.com>
Sent: Monday, 8 March 2021 9:25 PM
To: Tharakan, Robins <tharar@amazon.com>; pgsql-hackers@postgresql.org
Subject: [EXTERNAL] [UNVERIFIED SENDER] Re: pg_upgrade failing for 200+
million Large Objects

CAUTION: This email originated from outside of the organization. Do not
click links or open attachments unless you can confirm the sender and
know the content is safe.

On 07.03.21 09:43, Tharakan, Robins wrote:

Attached is a proof-of-concept patch that allows Postgres to perform
pg_upgrade if the instance has Millions of objects.

It would be great if someone could take a look and see if this patch
is in the right direction. There are some pending tasks (such as
documentation / pg_resetxlog vs pg_resetwal related changes) but for
now, the patch helps remove a stalemate where if a Postgres instance
has a large number (accurately speaking 146+ Million) of Large
Objects, pg_upgrade fails. This is easily reproducible and besides
deleting Large Objects before upgrade, there is no other (apparent) way

for pg_upgrade to complete.

The patch (attached):
- Applies cleanly on REL9_6_STABLE -
c7a4fc3dd001646d5938687ad59ab84545d5d043
- 'make check' passes
- Allows the user to provide a constant via pg_upgrade command-line,
that overrides the 2 billion constant in pg_resetxlog [1] thereby
increasing the (window of) Transaction IDs available for pg_upgrade to

complete.

Could you explain what your analysis of the problem is and why this patch
(might) fix it?

Right now, all I see here is, pass a big number via a command-line option
and hope it works.

Attachments:

smime.p7sapplication/pkcs7-signature; name=smime.p7sDownload
#6Magnus Hagander
magnus@hagander.net
In reply to: Tharakan, Robins (#5)
Re: pg_upgrade failing for 200+ million Large Objects

On Mon, Mar 8, 2021 at 12:02 PM Tharakan, Robins <tharar@amazon.com> wrote:

Thanks Peter.

The original email [1] had some more context that somehow didn't get
associated with this recent email. Apologies for any confusion.

Please take a look at your email configuration -- all your emails are
lacking both References and In-reply-to headers, so every email starts
a new thread, both for each reader and in the archives. It seems quite
broken. It makes it very hard to follow.

In short, pg_resetxlog (and pg_resetwal) employs a magic constant [2] (for
both v9.6 as well as master) which seems to have been selected to force an
aggressive autovacuum as soon as the upgrade completes. Although that works
as planned, it narrows the window of Transaction IDs available for the
upgrade (before which XID wraparound protection kicks and aborts the
upgrade) to 146 Million.

Reducing this magic constant allows a larger XID window, which is what the
patch is trying to do. With the patch, I was able to upgrade a cluster with
500m Large Objects successfully (which otherwise reliably fails). In the
original email [1] I had also listed a few other possible workarounds, but
was unsure which would be a good direction to start working on.... thus this
patch to make a start.

This still seems to just fix the symptoms and not the actual problem.

What part of the pg_upgrade process is it that actually burns through
that many transactions?

Without looking, I would guess it's the schema reload using
pg_dump/pg_restore and not actually pg_upgrade itself. This is a known
issue in pg_dump/pg_restore. And if that is the case -- perhaps just
running all of those in a single transaction would be a better choice?
One could argue it's still not a proper fix, because we'd still have a
huge memory usage etc, but it would then only burn 1 xid instead of
500M...

AFAICT at a quick check, pg_dump in binary upgrade mode emits one
lo_create() and one ALTER ... OWNER TO for each large object - so with
500M large objects that would be a billion statements, and thus a
billion xids. And without checking, I'm fairly sure it doesn't load in
a single transaction...

--
Magnus Hagander
Me: https://www.hagander.net/
Work: https://www.redpill-linpro.com/

#7Robins Tharakan
tharakan@gmail.com
In reply to: Magnus Hagander (#6)
Re: pg_upgrade failing for 200+ million Large Objects

Hi Magnus,

On Mon, 8 Mar 2021 at 23:34, Magnus Hagander <magnus@hagander.net> wrote:

AFAICT at a quick check, pg_dump in binary upgrade mode emits one

lo_create() and one ALTER ... OWNER TO for each large object - so with

500M large objects that would be a billion statements, and thus a
billion xids. And without checking, I'm fairly sure it doesn't load in
a single transaction...

Your assumptions are pretty much correct.

The issue isn't with pg_upgrade itself. During pg_restore, each Large
Object (and separately each ALTER LARGE OBJECT OWNER TO) consumes an XID
each. For background, that's the reason the v9.5 production instance I was
reviewing, was unable to process more than 73 Million large objects since
each object required a CREATE + ALTER. (To clarify, 73 million = (2^31 - 2
billion magic constant - 1 Million wraparound protection) / 2)

Without looking, I would guess it's the schema reload using

pg_dump/pg_restore and not actually pg_upgrade itself. This is a known
issue in pg_dump/pg_restore. And if that is the case -- perhaps just
running all of those in a single transaction would be a better choice?
One could argue it's still not a proper fix, because we'd still have a
huge memory usage etc, but it would then only burn 1 xid instead of
500M...

(I hope I am not missing something but) When I tried to force pg_restore to
use a single transaction (by hacking pg_upgrade's pg_restore call to use
--single-transaction), it too failed owing to being unable to lock so many
objects in a single transaction.

This still seems to just fix the symptoms and not the actual problem.

I agree that the patch doesn't address the root-cause, but it did get the
upgrade to complete on a test-setup. Do you think that (instead of all
objects) batching multiple Large Objects in a single transaction (and
allowing the caller to size that batch via command line) would be a good /
acceptable idea here?

Please take a look at your email configuration -- all your emails are

lacking both References and In-reply-to headers.

Thanks for highlighting the cause here. Hopefully switching mail clients
would help.
-
Robins Tharakan

#8Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robins Tharakan (#7)
Re: pg_upgrade failing for 200+ million Large Objects

Robins Tharakan <tharakan@gmail.com> writes:

On Mon, 8 Mar 2021 at 23:34, Magnus Hagander <magnus@hagander.net> wrote:

Without looking, I would guess it's the schema reload using
pg_dump/pg_restore and not actually pg_upgrade itself. This is a known
issue in pg_dump/pg_restore. And if that is the case -- perhaps just
running all of those in a single transaction would be a better choice?
One could argue it's still not a proper fix, because we'd still have a
huge memory usage etc, but it would then only burn 1 xid instead of
500M...

(I hope I am not missing something but) When I tried to force pg_restore to
use a single transaction (by hacking pg_upgrade's pg_restore call to use
--single-transaction), it too failed owing to being unable to lock so many
objects in a single transaction.

It does seem that --single-transaction is a better idea than fiddling with
the transaction wraparound parameters, since the latter is just going to
put off the onset of trouble. However, we'd have to do something about
the lock consumption. Would it be sane to have the backend not bother to
take any locks in binary-upgrade mode?

regards, tom lane

#9Magnus Hagander
magnus@hagander.net
In reply to: Tom Lane (#8)
Re: pg_upgrade failing for 200+ million Large Objects

On Mon, Mar 8, 2021 at 5:33 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robins Tharakan <tharakan@gmail.com> writes:

On Mon, 8 Mar 2021 at 23:34, Magnus Hagander <magnus@hagander.net> wrote:

Without looking, I would guess it's the schema reload using
pg_dump/pg_restore and not actually pg_upgrade itself. This is a known
issue in pg_dump/pg_restore. And if that is the case -- perhaps just
running all of those in a single transaction would be a better choice?
One could argue it's still not a proper fix, because we'd still have a
huge memory usage etc, but it would then only burn 1 xid instead of
500M...

(I hope I am not missing something but) When I tried to force pg_restore to
use a single transaction (by hacking pg_upgrade's pg_restore call to use
--single-transaction), it too failed owing to being unable to lock so many
objects in a single transaction.

It does seem that --single-transaction is a better idea than fiddling with
the transaction wraparound parameters, since the latter is just going to
put off the onset of trouble. However, we'd have to do something about
the lock consumption. Would it be sane to have the backend not bother to
take any locks in binary-upgrade mode?

I believe the problem occurs when writing them rather than when
reading them, and I don't think we have a binary upgrade mode there.

We could invent one of course. Another option might be to exclusively
lock pg_largeobject, and just say that if you do that, we don't have
to lock the individual objects (ever)?

--
Magnus Hagander
Me: https://www.hagander.net/
Work: https://www.redpill-linpro.com/

#10Tom Lane
tgl@sss.pgh.pa.us
In reply to: Magnus Hagander (#9)
Re: pg_upgrade failing for 200+ million Large Objects

Magnus Hagander <magnus@hagander.net> writes:

On Mon, Mar 8, 2021 at 5:33 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

It does seem that --single-transaction is a better idea than fiddling with
the transaction wraparound parameters, since the latter is just going to
put off the onset of trouble. However, we'd have to do something about
the lock consumption. Would it be sane to have the backend not bother to
take any locks in binary-upgrade mode?

I believe the problem occurs when writing them rather than when
reading them, and I don't think we have a binary upgrade mode there.

You're confusing pg_dump's --binary-upgrade switch (indeed applied on
the dumping side) with the backend's -b switch (IsBinaryUpgrade,
applied on the restoring side).

We could invent one of course. Another option might be to exclusively
lock pg_largeobject, and just say that if you do that, we don't have
to lock the individual objects (ever)?

What was in the back of my mind is that we've sometimes seen complaints
about too many locks needed to dump or restore a database with $MANY
tables; so the large-object case seems like just a special case.

The answer up to now has been "raise max_locks_per_transaction enough
so you don't see the failure". Having now consumed a little more
caffeine, I remember that that works in pg_upgrade scenarios too,
since the user can fiddle with the target cluster's postgresql.conf
before starting pg_upgrade.

So it seems like the path of least resistance is

(a) make pg_upgrade use --single-transaction when calling pg_restore

(b) document (better) how to get around too-many-locks failures.

regards, tom lane

#11Magnus Hagander
magnus@hagander.net
In reply to: Tom Lane (#10)
Re: pg_upgrade failing for 200+ million Large Objects

On Mon, Mar 8, 2021 at 5:58 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Magnus Hagander <magnus@hagander.net> writes:

On Mon, Mar 8, 2021 at 5:33 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

It does seem that --single-transaction is a better idea than fiddling with
the transaction wraparound parameters, since the latter is just going to
put off the onset of trouble. However, we'd have to do something about
the lock consumption. Would it be sane to have the backend not bother to
take any locks in binary-upgrade mode?

I believe the problem occurs when writing them rather than when
reading them, and I don't think we have a binary upgrade mode there.

You're confusing pg_dump's --binary-upgrade switch (indeed applied on
the dumping side) with the backend's -b switch (IsBinaryUpgrade,
applied on the restoring side).

Ah. Yes, I am.

We could invent one of course. Another option might be to exclusively
lock pg_largeobject, and just say that if you do that, we don't have
to lock the individual objects (ever)?

What was in the back of my mind is that we've sometimes seen complaints
about too many locks needed to dump or restore a database with $MANY
tables; so the large-object case seems like just a special case.

It is -- but I guess it's more likely to have 100M large objects than
to have 100M tables. (and the cutoff point comes a lot earlier than
100M). But the fundamental onei s the same.

The answer up to now has been "raise max_locks_per_transaction enough
so you don't see the failure". Having now consumed a little more
caffeine, I remember that that works in pg_upgrade scenarios too,
since the user can fiddle with the target cluster's postgresql.conf
before starting pg_upgrade.

So it seems like the path of least resistance is

(a) make pg_upgrade use --single-transaction when calling pg_restore

(b) document (better) how to get around too-many-locks failures.

Agreed. Certainly seems like a better path forward than arbitrarily
pushing the limit on number of transactions which just postpones the
problem.

--
Magnus Hagander
Me: https://www.hagander.net/
Work: https://www.redpill-linpro.com/

#12Jan Wieck
JanWieck@Yahoo.com
In reply to: Tom Lane (#10)
Re: pg_upgrade failing for 200+ million Large Objects

On 3/8/21 11:58 AM, Tom Lane wrote:

The answer up to now has been "raise max_locks_per_transaction enough
so you don't see the failure". Having now consumed a little more
caffeine, I remember that that works in pg_upgrade scenarios too,
since the user can fiddle with the target cluster's postgresql.conf
before starting pg_upgrade.

So it seems like the path of least resistance is

(a) make pg_upgrade use --single-transaction when calling pg_restore

(b) document (better) how to get around too-many-locks failures.

That would first require to fix how pg_upgrade is creating the
databases. It uses "pg_restore --create", which is mutually exclusive
with --single-transaction because we cannot create a database inside of
a transaction. On the way pg_upgrade also mangles the pg_database.datdba
(all databases are owned by postgres after an upgrade; will submit a
separate patch for that as I consider that a bug by itself).

All that aside, the entire approach doesn't scale.

In a hacked up pg_upgrade that does "createdb" first before calling
pg_upgrade with --single-transaction. I can upgrade 1M large objects with
max_locks_per_transaction = 5300
max_connectinons=100
which contradicts the docs. Need to find out where that math went off
the rails because that config should only have room for 530,000 locks,
not 1M. The same test fails with max_locks_per_transaction = 5200.

But this would mean that one has to modify the postgresql.conf to
something like 530,000 max_locks_per_transaction at 100 max_connections
in order to actually run a successful upgrade of 100M large objects.
This config requires 26GB of memory just for locks. Add to that the
memory pg_restore needs to load the entire TOC before even restoring a
single object.

Not going to work. But tests are still ongoing ...

Regards, Jan

--
Jan Wieck
Principle Database Engineer
Amazon Web Services

#13Andrew Dunstan
andrew@dunslane.net
In reply to: Jan Wieck (#12)
Re: pg_upgrade failing for 200+ million Large Objects

On 3/20/21 12:39 AM, Jan Wieck wrote:

On 3/8/21 11:58 AM, Tom Lane wrote:

The answer up to now has been "raise max_locks_per_transaction enough
so you don't see the failure".  Having now consumed a little more
caffeine, I remember that that works in pg_upgrade scenarios too,
since the user can fiddle with the target cluster's postgresql.conf
before starting pg_upgrade.

So it seems like the path of least resistance is

(a) make pg_upgrade use --single-transaction when calling pg_restore

(b) document (better) how to get around too-many-locks failures.

That would first require to fix how pg_upgrade is creating the
databases. It uses "pg_restore --create", which is mutually exclusive
with --single-transaction because we cannot create a database inside
of a transaction. On the way pg_upgrade also mangles the
pg_database.datdba (all databases are owned by postgres after an
upgrade; will submit a separate patch for that as I consider that a
bug by itself).

All that aside, the entire approach doesn't scale.

In a hacked up pg_upgrade that does "createdb" first before calling
pg_upgrade with --single-transaction. I can upgrade 1M large objects with
    max_locks_per_transaction = 5300
    max_connectinons=100
which contradicts the docs. Need to find out where that math went off
the rails because that config should only have room for 530,000 locks,
not 1M. The same test fails with max_locks_per_transaction = 5200.

But this would mean that one has to modify the postgresql.conf to
something like 530,000 max_locks_per_transaction at 100
max_connections in order to actually run a successful upgrade of 100M
large objects. This config requires 26GB of memory just for locks. Add
to that the memory pg_restore needs to load the entire TOC before even
restoring a single object.

Not going to work. But tests are still ongoing ...

I thought Tom's suggestion upthread:

Would it be sane to have the backend not bother to
take any locks in binary-upgrade mode?

was interesting. Could we do that on the restore side? After all, what
are we locking against in binary upgrade mode?

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

#14Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jan Wieck (#12)
Re: pg_upgrade failing for 200+ million Large Objects

Jan Wieck <jan@wi3ck.info> writes:

On 3/8/21 11:58 AM, Tom Lane wrote:

So it seems like the path of least resistance is
(a) make pg_upgrade use --single-transaction when calling pg_restore
(b) document (better) how to get around too-many-locks failures.

That would first require to fix how pg_upgrade is creating the
databases. It uses "pg_restore --create", which is mutually exclusive
with --single-transaction because we cannot create a database inside of
a transaction.

Ugh.

All that aside, the entire approach doesn't scale.

Yeah, agreed. When we gave large objects individual ownership and ACL
info, it was argued that pg_dump could afford to treat each one as a
separate TOC entry because "you wouldn't have that many of them, if
they're large". The limits of that approach were obvious even at the
time, and I think now we're starting to see people for whom it really
doesn't work.

I wonder if pg_dump could improve matters cheaply by aggregating the
large objects by owner and ACL contents. That is, do

select distinct lomowner, lomacl from pg_largeobject_metadata;

and make just *one* BLOB TOC entry for each result. Then dump out
all the matching blobs under that heading.

A possible objection is that it'd reduce the ability to restore blobs
selectively, so maybe we'd need to make it optional.

Of course, that just reduces the memory consumption on the client
side; it does nothing for the locks. Can we get away with releasing the
lock immediately after doing an ALTER OWNER or GRANT/REVOKE on a blob?

regards, tom lane

#15Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#14)
Re: pg_upgrade failing for 200+ million Large Objects

On Sat, Mar 20, 2021 at 11:23:19AM -0400, Tom Lane wrote:

I wonder if pg_dump could improve matters cheaply by aggregating the
large objects by owner and ACL contents. That is, do

select distinct lomowner, lomacl from pg_largeobject_metadata;

and make just *one* BLOB TOC entry for each result. Then dump out
all the matching blobs under that heading.

A possible objection is that it'd reduce the ability to restore blobs
selectively, so maybe we'd need to make it optional.

Of course, that just reduces the memory consumption on the client
side; it does nothing for the locks. Can we get away with releasing the
lock immediately after doing an ALTER OWNER or GRANT/REVOKE on a blob?

Well, in pg_upgrade mode you can, since there are no other cluster
users, but you might be asking for general pg_dump usage.

--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com

If only the physical world exists, free will is an illusion.

#16Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#15)
Re: pg_upgrade failing for 200+ million Large Objects

Bruce Momjian <bruce@momjian.us> writes:

On Sat, Mar 20, 2021 at 11:23:19AM -0400, Tom Lane wrote:

Of course, that just reduces the memory consumption on the client
side; it does nothing for the locks. Can we get away with releasing the
lock immediately after doing an ALTER OWNER or GRANT/REVOKE on a blob?

Well, in pg_upgrade mode you can, since there are no other cluster
users, but you might be asking for general pg_dump usage.

Yeah, this problem doesn't only affect pg_upgrade scenarios, so it'd
really be better to find a way that isn't dependent on binary-upgrade
mode.

regards, tom lane

#17Jan Wieck
JanWieck@Yahoo.com
In reply to: Tom Lane (#14)
Re: pg_upgrade failing for 200+ million Large Objects

On 3/20/21 11:23 AM, Tom Lane wrote:

Jan Wieck <jan@wi3ck.info> writes:

All that aside, the entire approach doesn't scale.

Yeah, agreed. When we gave large objects individual ownership and ACL
info, it was argued that pg_dump could afford to treat each one as a
separate TOC entry because "you wouldn't have that many of them, if
they're large". The limits of that approach were obvious even at the
time, and I think now we're starting to see people for whom it really
doesn't work.

It actually looks more like some users have millions of "small objects".
I am still wondering where that is coming from and why they are abusing
LOs in that way, but that is more out of curiosity. Fact is that they
are out there and that they cannot upgrade from their 9.5 databases,
which are now past EOL.

I wonder if pg_dump could improve matters cheaply by aggregating the
large objects by owner and ACL contents. That is, do

select distinct lomowner, lomacl from pg_largeobject_metadata;

and make just *one* BLOB TOC entry for each result. Then dump out
all the matching blobs under that heading.

What I am currently experimenting with is moving the BLOB TOC entries
into the parallel data phase of pg_restore "when doing binary upgrade".
It seems to scale nicely with the number of cores in the system. In
addition to that have options for pg_upgrade and pg_restore that cause
the restore to batch them into transactions, like 10,000 objects at a
time. There was a separate thread for that but I guess it is better to
keep it all together here now.

A possible objection is that it'd reduce the ability to restore blobs
selectively, so maybe we'd need to make it optional.

I fully intend to make all this into new "options". I am afraid that
there is no one-size-fits-all solution here.

Of course, that just reduces the memory consumption on the client
side; it does nothing for the locks. Can we get away with releasing the
lock immediately after doing an ALTER OWNER or GRANT/REVOKE on a blob?

I'm not very fond of the idea going lockless when at the same time
trying to parallelize the restore phase. That can lead to really nasty
race conditions. For now I'm aiming at batches in transactions.

Regards, Jan

--
Jan Wieck
Principle Database Engineer
Amazon Web Services

#18Andrew Dunstan
andrew@dunslane.net
In reply to: Jan Wieck (#17)
Re: pg_upgrade failing for 200+ million Large Objects

On 3/20/21 12:55 PM, Jan Wieck wrote:

On 3/20/21 11:23 AM, Tom Lane wrote:

Jan Wieck <jan@wi3ck.info> writes:

All that aside, the entire approach doesn't scale.

Yeah, agreed.  When we gave large objects individual ownership and ACL
info, it was argued that pg_dump could afford to treat each one as a
separate TOC entry because "you wouldn't have that many of them, if
they're large".  The limits of that approach were obvious even at the
time, and I think now we're starting to see people for whom it really
doesn't work.

It actually looks more like some users have millions of "small
objects". I am still wondering where that is coming from and why they
are abusing LOs in that way, but that is more out of curiosity. Fact
is that they are out there and that they cannot upgrade from their 9.5
databases, which are now past EOL.

One possible (probable?) source is the JDBC driver, which currently
treats all Blobs (and Clobs, for that matter) as LOs. I'm working on
improving that some: <https://github.com/pgjdbc/pgjdbc/pull/2093&gt;

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

#19Jan Wieck
JanWieck@Yahoo.com
In reply to: Jan Wieck (#12)
Fix pg_upgrade to preserve datdba (was: Re: pg_upgrade failing for 200+ million Large Objects)

On 3/20/21 12:39 AM, Jan Wieck wrote:

On the way pg_upgrade also mangles the pg_database.datdba
(all databases are owned by postgres after an upgrade; will submit a
separate patch for that as I consider that a bug by itself).

Patch attached.

Regards, Jan

--
Jan Wieck
Principle Database Engineer
Amazon Web Services

Attachments:

pg_upgrade-preserve-datdba.v1.difftext/x-patch; charset=UTF-8; name=pg_upgrade-preserve-datdba.v1.diffDownload+37-12
#20Jan Wieck
JanWieck@Yahoo.com
In reply to: Andrew Dunstan (#18)
Re: pg_upgrade failing for 200+ million Large Objects

On 3/21/21 7:47 AM, Andrew Dunstan wrote:

One possible (probable?) source is the JDBC driver, which currently
treats all Blobs (and Clobs, for that matter) as LOs. I'm working on
improving that some: <https://github.com/pgjdbc/pgjdbc/pull/2093&gt;

You mean the user is using OID columns pointing to large objects and the
JDBC driver is mapping those for streaming operations?

Yeah, that would explain a lot.

Thanks, Jan

--
Jan Wieck
Principle Database Engineer
Amazon Web Services

#21Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jan Wieck (#19)
#22Jan Wieck
JanWieck@Yahoo.com
In reply to: Tom Lane (#21)
#23Jan Wieck
JanWieck@Yahoo.com
In reply to: Jan Wieck (#22)
#24Andrew Dunstan
andrew@dunslane.net
In reply to: Jan Wieck (#20)
#25Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jan Wieck (#23)
#26Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#25)
#27Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#26)
#28Jan Wieck
JanWieck@Yahoo.com
In reply to: Tom Lane (#26)
#29Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jan Wieck (#28)
#30Jan Wieck
JanWieck@Yahoo.com
In reply to: Tom Lane (#29)
#31Zhihong Yu
zyu@yugabyte.com
In reply to: Andrew Dunstan (#24)
#32Jan Wieck
JanWieck@Yahoo.com
In reply to: Zhihong Yu (#31)
#33Jan Wieck
JanWieck@Yahoo.com
In reply to: Jan Wieck (#32)
#34Bruce Momjian
bruce@momjian.us
In reply to: Jan Wieck (#33)
#35Jan Wieck
JanWieck@Yahoo.com
In reply to: Bruce Momjian (#34)
#36Bruce Momjian
bruce@momjian.us
In reply to: Jan Wieck (#35)
#37Jan Wieck
JanWieck@Yahoo.com
In reply to: Bruce Momjian (#36)
#38Bruce Momjian
bruce@momjian.us
In reply to: Jan Wieck (#37)
#39Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jan Wieck (#37)
#40Jan Wieck
JanWieck@Yahoo.com
In reply to: Tom Lane (#39)
#41Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jan Wieck (#40)
#42Jan Wieck
JanWieck@Yahoo.com
In reply to: Tom Lane (#41)
#43Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jan Wieck (#42)
#44Jan Wieck
JanWieck@Yahoo.com
In reply to: Tom Lane (#43)
#45Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jan Wieck (#44)
#46Jan Wieck
JanWieck@Yahoo.com
In reply to: Tom Lane (#45)
#47Jan Wieck
JanWieck@Yahoo.com
In reply to: Jan Wieck (#46)
#48Justin Pryzby
pryzby@telsasoft.com
In reply to: Jan Wieck (#47)
#49Nathan Bossart
nathandbossart@gmail.com
In reply to: Jan Wieck (#47)
#50Jacob Champion
jacob.champion@enterprisedb.com
In reply to: Nathan Bossart (#49)
#51Nathan Bossart
nathandbossart@gmail.com
In reply to: Jacob Champion (#50)
#52Jacob Champion
jacob.champion@enterprisedb.com
In reply to: Nathan Bossart (#51)
#53Nathan Bossart
nathandbossart@gmail.com
In reply to: Jacob Champion (#52)
#54Michael Paquier
michael@paquier.xyz
In reply to: Nathan Bossart (#53)
#55Kumar, Sachin
ssetiya@amazon.com
In reply to: Jacob Champion (#52)
#56Tom Lane
tgl@sss.pgh.pa.us
In reply to: Kumar, Sachin (#55)
#57Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#56)
#58Kumar, Sachin
ssetiya@amazon.com
In reply to: Tom Lane (#56)
#59Kumar, Sachin
ssetiya@amazon.com
In reply to: Tom Lane (#56)
#60Kumar, Sachin
ssetiya@amazon.com
In reply to: Jan Wieck (#40)
#61Tom Lane
tgl@sss.pgh.pa.us
In reply to: Kumar, Sachin (#60)
#62Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#61)
#63Nathan Bossart
nathandbossart@gmail.com
In reply to: Tom Lane (#62)
#64Tom Lane
tgl@sss.pgh.pa.us
In reply to: Nathan Bossart (#63)
#65Robins Tharakan
tharakan@gmail.com
In reply to: Tom Lane (#62)
#66Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robins Tharakan (#65)
#67Robins Tharakan
tharakan@gmail.com
In reply to: Tom Lane (#66)
#68Kumar, Sachin
ssetiya@amazon.com
In reply to: Tom Lane (#61)
#69Tom Lane
tgl@sss.pgh.pa.us
In reply to: Kumar, Sachin (#68)
#70Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#69)
#71Nathan Bossart
nathandbossart@gmail.com
In reply to: Tom Lane (#62)
#72Nathan Bossart
nathandbossart@gmail.com
In reply to: Tom Lane (#70)
#73Nathan Bossart
nathandbossart@gmail.com
In reply to: Nathan Bossart (#71)
#74Tom Lane
tgl@sss.pgh.pa.us
In reply to: Nathan Bossart (#71)
#75vignesh C
vignesh21@gmail.com
In reply to: Kumar, Sachin (#68)
#76Tom Lane
tgl@sss.pgh.pa.us
In reply to: vignesh C (#75)
#77Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#76)
#78Laurenz Albe
laurenz.albe@cybertec.at
In reply to: Tom Lane (#77)
#79Tom Lane
tgl@sss.pgh.pa.us
In reply to: Laurenz Albe (#78)
#80Laurenz Albe
laurenz.albe@cybertec.at
In reply to: Tom Lane (#79)
#81Michael Banck
michael.banck@credativ.de
In reply to: Tom Lane (#79)
#82Laurenz Albe
laurenz.albe@cybertec.at
In reply to: Michael Banck (#81)
#83Michael Banck
michael.banck@credativ.de
In reply to: Laurenz Albe (#82)
#84Tom Lane
tgl@sss.pgh.pa.us
In reply to: Michael Banck (#81)
#85Nathan Bossart
nathandbossart@gmail.com
In reply to: Tom Lane (#84)
#86Nathan Bossart
nathandbossart@gmail.com
In reply to: Nathan Bossart (#85)
#87Tom Lane
tgl@sss.pgh.pa.us
In reply to: Nathan Bossart (#86)
#88Nathan Bossart
nathandbossart@gmail.com
In reply to: Tom Lane (#87)
#89Justin Pryzby
pryzby@telsasoft.com
In reply to: Tom Lane (#87)
#90Alexander Korotkov
aekorotkov@gmail.com
In reply to: Justin Pryzby (#89)
#91Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alexander Korotkov (#90)
#92Justin Pryzby
pryzby@telsasoft.com
In reply to: Alexander Korotkov (#90)
#93Alexander Korotkov
aekorotkov@gmail.com
In reply to: Justin Pryzby (#92)
#94Tom Lane
tgl@sss.pgh.pa.us
In reply to: Justin Pryzby (#92)
#95Alexander Korotkov
aekorotkov@gmail.com
In reply to: Tom Lane (#94)
#96Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alexander Korotkov (#95)
#97Alexander Korotkov
aekorotkov@gmail.com
In reply to: Tom Lane (#96)
#98Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alexander Korotkov (#97)
#99Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#98)
#100Alexander Korotkov
aekorotkov@gmail.com
In reply to: Tom Lane (#99)