RE: pg_upgrade failing for 200+ million Large Objects
Hi all,
Attached is a proof-of-concept patch that allows Postgres to perform
pg_upgrade if the instance has Millions of objects.
It would be great if someone could take a look and see if this patch is in
the right direction. There are some pending tasks (such as documentation /
pg_resetxlog vs pg_resetwal related changes) but for now, the patch helps
remove a stalemate where if a Postgres instance has a large number
(accurately speaking 146+ Million) of Large Objects, pg_upgrade fails. This
is easily reproducible and besides deleting Large Objects before upgrade,
there is no other (apparent) way for pg_upgrade to complete.
The patch (attached):
- Applies cleanly on REL9_6_STABLE -
c7a4fc3dd001646d5938687ad59ab84545d5d043
- 'make check' passes
- Allows the user to provide a constant via pg_upgrade command-line, that
overrides the 2 billion constant in pg_resetxlog [1] thereby increasing the
(window of) Transaction IDs available for pg_upgrade to complete.
Sample argument for pg_upgrade:
$ /opt/postgres/96/bin/pg_upgrade --max-limit-xid 1000000000 --old-bindir
...
With this patch, pg_upgrade is now able to upgrade a v9.5 cluster with 500
million Large Objects successfully to v9.6 - some stats below:
Source Postgres - v9.5.24
Target Version - v9.6.21
Large Object Count: 500 Million Large Objects
Machine - r5.4xlarge (16vCPU / 128GB RAM + 1TB swap)
Memory used during pg_upgrade - ~350GB
Time taken - 25+ hrs. (tested twice) - (All LOs processed sequentially ->
Scope for optimization)
Although counter-intuitive, for this testing purpose all Large Objects were
small (essentially the idea was to test the count) and created by using
something like this:
seq 1 50000 | xargs -n 1 -i -P 10 /opt/postgres/95/bin/psql -c "select
lo_from_bytea(0, '\xffffff00') from generate_series(1,10000);" > /dev/null
I am not married to the patch (especially the argument name) but ideally I'd
prefer a way to get this upgrade going without a patch. For now, I am unable
to find any other way to upgrade a v9.5 Postgres database in this scenario,
facing End-of-Life.
Reference:
1) 2 Billion constant -
https://github.com/postgres/postgres/blob/ca3b37487be333a1d241dab1bbdd17a211
a88f43/src/bin/pg_resetwal/pg_resetwal.c#L444
Thanks,
Robins Tharakan
Show quoted text
-----Original Message-----
From: Tharakan, Robins
Sent: Wednesday, 3 March 2021 10:36 PM
To: pgsql-hackers@postgresql.org
Subject: pg_upgrade failing for 200+ million Large ObjectsHi,
While reviewing a failed upgrade from Postgres v9.5 (to v9.6) I saw that
the
instance had ~200 million (in-use) Large Objects. I was able to reproduce
this on a test instance which too fails with a similar error.pg_restore: executing BLOB 4980622
pg_restore: WARNING: database with OID 0 must be vacuumed within 1000001
transactions
HINT: To avoid a database shutdown, execute a database-wide VACUUM in
that
database.
You might also need to commit or roll back old prepared transactions.
pg_restore: executing BLOB 4980623
pg_restore: [archiver (db)] Error while PROCESSING TOC:
pg_restore: [archiver (db)] Error from TOC entry 2565; 2613 4980623 BLOB
4980623 postgres
pg_restore: [archiver (db)] could not execute query: ERROR: database is
not
accepting commands to avoid wraparound data loss in database with OID 0
HINT: Stop the postmaster and vacuum that database in single-user mode.
You might also need to commit or roll back old prepared transactions.
Command was: SELECT pg_catalog.lo_create('4980623');To remove the obvious possibilities, these Large Objects that are still
in-use (so vacuumlo wouldn't help), giving more system resources doesn't
help, moving Large Objects around to another database doesn't help (since
this is cluster-wide restriction), the source instance is nowhere close
to
wraparound and lastly recent-most minor versions don't help either (I
tried
compiling 9_6_STABLE + upgrade database with 150 million LO and still
encountered the same issue).Do let me know if I am missing something obvious but it appears that this
is
happening owing to 2 things coming together:* Each Large Object is migrated in its own transaction during pg_upgrade
* pg_resetxlog appears to be narrowing the window (available for
pg_upgrade)
to ~146 Million XIDs (2^31 - 1 million XID wraparound margin - 2 billion
which is a hard-coded constant - see [1] - in what appears to be an
attempt
to force an Autovacuum Wraparound session soon after upgrade completes).Ideally such an XID based restriction, is limiting for an instance that's
actively using a lot of Large Objects. Besides forcing AutoVacuum
Wraparound
logic to kick in soon after, I am unclear what much else it aims to do.
What
it does seem to be doing is to block Major Version upgrades if the
pre-upgrade instance has >146 Million Large Objects (half that, if the LO
additionally requires ALTER LARGE OBJECT OWNER TO for each of those
objects
during pg_restore)For long-term these ideas came to mind, although am unsure which are
low-hanging fruits and which outright impossible - For e.g. clubbing
multiple objects in a transaction [2] / Force AutoVacuum post upgrade
(and
thus remove this limitation altogether) or see if "pg_resetxlog -x" (from
within pg_upgrade) could help in some way to work-around this limitation.Is there a short-term recommendation for this scenario?
I can understand a high number of small-sized objects is not a great way
to
use pg_largeobject (since Large Objects was intended to be for, well,
'large
objects') but this magic number of Large Objects is now a stalemate at
this
point (with respect to v9.5 EOL).Reference:
1) pg_resetxlog -
https://github.com/postgres/postgres/blob/ca3b37487be333a1d241dab1bbdd17a
211
a88f43/src/bin/pg_resetwal/pg_resetwal.c#L444
2)
/messages/by-id/ed7d86a1-b907-4f53-9f6e-
63482d2f2bac%4
0manitou-mail.org-
Thanks
Robins Tharakan
On 7 Mar 2021, at 09:43, Tharakan, Robins <tharar@amazon.com> wrote:
The patch (attached):
- Applies cleanly on REL9_6_STABLE -
c7a4fc3dd001646d5938687ad59ab84545d5d043
Did you target 9.6 because that's where you want to upgrade to, or is this not
a problem on HEAD? If it's still a problem on HEAD you should probably submit
the patch against there. You probably also want to add it to the next commit
fest to make sure it's not forgotten about: https://commitfest.postgresql.org/33/
I am not married to the patch (especially the argument name) but ideally I'd
prefer a way to get this upgrade going without a patch. For now, I am unable
to find any other way to upgrade a v9.5 Postgres database in this scenario,
facing End-of-Life.
It's obviously not my call to make in any shape or form, but this doesn't
really seem to fall under what is generally backported into a stable release?
--
Daniel Gustafsson https://vmware.com/
On 07.03.21 09:43, Tharakan, Robins wrote:
Attached is a proof-of-concept patch that allows Postgres to perform
pg_upgrade if the instance has Millions of objects.It would be great if someone could take a look and see if this patch is in
the right direction. There are some pending tasks (such as documentation /
pg_resetxlog vs pg_resetwal related changes) but for now, the patch helps
remove a stalemate where if a Postgres instance has a large number
(accurately speaking 146+ Million) of Large Objects, pg_upgrade fails. This
is easily reproducible and besides deleting Large Objects before upgrade,
there is no other (apparent) way for pg_upgrade to complete.The patch (attached):
- Applies cleanly on REL9_6_STABLE -
c7a4fc3dd001646d5938687ad59ab84545d5d043
- 'make check' passes
- Allows the user to provide a constant via pg_upgrade command-line, that
overrides the 2 billion constant in pg_resetxlog [1] thereby increasing the
(window of) Transaction IDs available for pg_upgrade to complete.
Could you explain what your analysis of the problem is and why this
patch (might) fix it?
Right now, all I see here is, pass a big number via a command-line
option and hope it works.
Thanks Daniel for the input / next-steps.
I see that 'master' too has this same magic constant [1] and so I expect it
to have similar restrictions, although I haven't tested this yet.
I do agree that the need then is to re-submit a patch that works with
'master'. (I am travelling the next few days but) Unless discussions go
tangential, I expect to revert with an updated patch by the end of this week
and create a commitfest entry while at it.
Reference:
1)
https://github.com/postgres/postgres/blob/master/src/bin/pg_resetwal/pg_rese
twal.c#L444
-
Robins Tharakan
Show quoted text
-----Original Message-----
From: Daniel Gustafsson <daniel@yesql.se>
Sent: Monday, 8 March 2021 9:42 AM
To: Tharakan, Robins <tharar@amazon.com>
Cc: pgsql-hackers@postgresql.org
Subject: RE: [EXTERNAL] pg_upgrade failing for 200+ million Large ObjectsCAUTION: This email originated from outside of the organization. Do not
click links or open attachments unless you can confirm the sender and
know the content is safe.On 7 Mar 2021, at 09:43, Tharakan, Robins <tharar@amazon.com> wrote:
The patch (attached):
- Applies cleanly on REL9_6_STABLE -
c7a4fc3dd001646d5938687ad59ab84545d5d043Did you target 9.6 because that's where you want to upgrade to, or is
this not a problem on HEAD? If it's still a problem on HEAD you should
probably submit the patch against there. You probably also want to add
it to the next commit fest to make sure it's not forgotten about:
https://commitfest.postgresql.org/33/I am not married to the patch (especially the argument name) but
ideally I'd prefer a way to get this upgrade going without a patch.
For now, I am unable to find any other way to upgrade a v9.5 Postgres
database in this scenario, facing End-of-Life.It's obviously not my call to make in any shape or form, but this doesn't
really seem to fall under what is generally backported into a stable
release?--
Daniel Gustafsson https://vmware.com/
Attachments:
Import Notes
Resolved by subject fallback
Thanks Peter.
The original email [1] had some more context that somehow didn't get
associated with this recent email. Apologies for any confusion.
In short, pg_resetxlog (and pg_resetwal) employs a magic constant [2] (for
both v9.6 as well as master) which seems to have been selected to force an
aggressive autovacuum as soon as the upgrade completes. Although that works
as planned, it narrows the window of Transaction IDs available for the
upgrade (before which XID wraparound protection kicks and aborts the
upgrade) to 146 Million.
Reducing this magic constant allows a larger XID window, which is what the
patch is trying to do. With the patch, I was able to upgrade a cluster with
500m Large Objects successfully (which otherwise reliably fails). In the
original email [1] I had also listed a few other possible workarounds, but
was unsure which would be a good direction to start working on.... thus this
patch to make a start.
Reference:
1) /messages/by-id/12601596dbbc4c01b86b4ac4d2bd4d48@
EX13D05UWC001.ant.amazon.com
2) https://github.com/postgres/postgres/blob/master/src/bin/pg_resetwal/pg_r
esetwal.c#L444
-
robins | tharar@ | syd12
Show quoted text
-----Original Message-----
From: Peter Eisentraut <peter.eisentraut@enterprisedb.com>
Sent: Monday, 8 March 2021 9:25 PM
To: Tharakan, Robins <tharar@amazon.com>; pgsql-hackers@postgresql.org
Subject: [EXTERNAL] [UNVERIFIED SENDER] Re: pg_upgrade failing for 200+
million Large ObjectsCAUTION: This email originated from outside of the organization. Do not
click links or open attachments unless you can confirm the sender and
know the content is safe.On 07.03.21 09:43, Tharakan, Robins wrote:
Attached is a proof-of-concept patch that allows Postgres to perform
pg_upgrade if the instance has Millions of objects.It would be great if someone could take a look and see if this patch
is in the right direction. There are some pending tasks (such as
documentation / pg_resetxlog vs pg_resetwal related changes) but for
now, the patch helps remove a stalemate where if a Postgres instance
has a large number (accurately speaking 146+ Million) of Large
Objects, pg_upgrade fails. This is easily reproducible and besides
deleting Large Objects before upgrade, there is no other (apparent) wayfor pg_upgrade to complete.
The patch (attached):
- Applies cleanly on REL9_6_STABLE -
c7a4fc3dd001646d5938687ad59ab84545d5d043
- 'make check' passes
- Allows the user to provide a constant via pg_upgrade command-line,
that overrides the 2 billion constant in pg_resetxlog [1] thereby
increasing the (window of) Transaction IDs available for pg_upgrade tocomplete.
Could you explain what your analysis of the problem is and why this patch
(might) fix it?Right now, all I see here is, pass a big number via a command-line option
and hope it works.
Attachments:
Import Notes
Resolved by subject fallback
On Mon, Mar 8, 2021 at 12:02 PM Tharakan, Robins <tharar@amazon.com> wrote:
Thanks Peter.
The original email [1] had some more context that somehow didn't get
associated with this recent email. Apologies for any confusion.
Please take a look at your email configuration -- all your emails are
lacking both References and In-reply-to headers, so every email starts
a new thread, both for each reader and in the archives. It seems quite
broken. It makes it very hard to follow.
In short, pg_resetxlog (and pg_resetwal) employs a magic constant [2] (for
both v9.6 as well as master) which seems to have been selected to force an
aggressive autovacuum as soon as the upgrade completes. Although that works
as planned, it narrows the window of Transaction IDs available for the
upgrade (before which XID wraparound protection kicks and aborts the
upgrade) to 146 Million.Reducing this magic constant allows a larger XID window, which is what the
patch is trying to do. With the patch, I was able to upgrade a cluster with
500m Large Objects successfully (which otherwise reliably fails). In the
original email [1] I had also listed a few other possible workarounds, but
was unsure which would be a good direction to start working on.... thus this
patch to make a start.
This still seems to just fix the symptoms and not the actual problem.
What part of the pg_upgrade process is it that actually burns through
that many transactions?
Without looking, I would guess it's the schema reload using
pg_dump/pg_restore and not actually pg_upgrade itself. This is a known
issue in pg_dump/pg_restore. And if that is the case -- perhaps just
running all of those in a single transaction would be a better choice?
One could argue it's still not a proper fix, because we'd still have a
huge memory usage etc, but it would then only burn 1 xid instead of
500M...
AFAICT at a quick check, pg_dump in binary upgrade mode emits one
lo_create() and one ALTER ... OWNER TO for each large object - so with
500M large objects that would be a billion statements, and thus a
billion xids. And without checking, I'm fairly sure it doesn't load in
a single transaction...
--
Magnus Hagander
Me: https://www.hagander.net/
Work: https://www.redpill-linpro.com/
Hi Magnus,
On Mon, 8 Mar 2021 at 23:34, Magnus Hagander <magnus@hagander.net> wrote:
AFAICT at a quick check, pg_dump in binary upgrade mode emits one
lo_create() and one ALTER ... OWNER TO for each large object - so with
500M large objects that would be a billion statements, and thus a
billion xids. And without checking, I'm fairly sure it doesn't load in
a single transaction...
Your assumptions are pretty much correct.
The issue isn't with pg_upgrade itself. During pg_restore, each Large
Object (and separately each ALTER LARGE OBJECT OWNER TO) consumes an XID
each. For background, that's the reason the v9.5 production instance I was
reviewing, was unable to process more than 73 Million large objects since
each object required a CREATE + ALTER. (To clarify, 73 million = (2^31 - 2
billion magic constant - 1 Million wraparound protection) / 2)
Without looking, I would guess it's the schema reload using
pg_dump/pg_restore and not actually pg_upgrade itself. This is a known
issue in pg_dump/pg_restore. And if that is the case -- perhaps just
running all of those in a single transaction would be a better choice?
One could argue it's still not a proper fix, because we'd still have a
huge memory usage etc, but it would then only burn 1 xid instead of
500M...
(I hope I am not missing something but) When I tried to force pg_restore to
use a single transaction (by hacking pg_upgrade's pg_restore call to use
--single-transaction), it too failed owing to being unable to lock so many
objects in a single transaction.
This still seems to just fix the symptoms and not the actual problem.
I agree that the patch doesn't address the root-cause, but it did get the
upgrade to complete on a test-setup. Do you think that (instead of all
objects) batching multiple Large Objects in a single transaction (and
allowing the caller to size that batch via command line) would be a good /
acceptable idea here?
Please take a look at your email configuration -- all your emails are
lacking both References and In-reply-to headers.
Thanks for highlighting the cause here. Hopefully switching mail clients
would help.
-
Robins Tharakan
Robins Tharakan <tharakan@gmail.com> writes:
On Mon, 8 Mar 2021 at 23:34, Magnus Hagander <magnus@hagander.net> wrote:
Without looking, I would guess it's the schema reload using
pg_dump/pg_restore and not actually pg_upgrade itself. This is a known
issue in pg_dump/pg_restore. And if that is the case -- perhaps just
running all of those in a single transaction would be a better choice?
One could argue it's still not a proper fix, because we'd still have a
huge memory usage etc, but it would then only burn 1 xid instead of
500M...
(I hope I am not missing something but) When I tried to force pg_restore to
use a single transaction (by hacking pg_upgrade's pg_restore call to use
--single-transaction), it too failed owing to being unable to lock so many
objects in a single transaction.
It does seem that --single-transaction is a better idea than fiddling with
the transaction wraparound parameters, since the latter is just going to
put off the onset of trouble. However, we'd have to do something about
the lock consumption. Would it be sane to have the backend not bother to
take any locks in binary-upgrade mode?
regards, tom lane
On Mon, Mar 8, 2021 at 5:33 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Robins Tharakan <tharakan@gmail.com> writes:
On Mon, 8 Mar 2021 at 23:34, Magnus Hagander <magnus@hagander.net> wrote:
Without looking, I would guess it's the schema reload using
pg_dump/pg_restore and not actually pg_upgrade itself. This is a known
issue in pg_dump/pg_restore. And if that is the case -- perhaps just
running all of those in a single transaction would be a better choice?
One could argue it's still not a proper fix, because we'd still have a
huge memory usage etc, but it would then only burn 1 xid instead of
500M...(I hope I am not missing something but) When I tried to force pg_restore to
use a single transaction (by hacking pg_upgrade's pg_restore call to use
--single-transaction), it too failed owing to being unable to lock so many
objects in a single transaction.It does seem that --single-transaction is a better idea than fiddling with
the transaction wraparound parameters, since the latter is just going to
put off the onset of trouble. However, we'd have to do something about
the lock consumption. Would it be sane to have the backend not bother to
take any locks in binary-upgrade mode?
I believe the problem occurs when writing them rather than when
reading them, and I don't think we have a binary upgrade mode there.
We could invent one of course. Another option might be to exclusively
lock pg_largeobject, and just say that if you do that, we don't have
to lock the individual objects (ever)?
--
Magnus Hagander
Me: https://www.hagander.net/
Work: https://www.redpill-linpro.com/
Magnus Hagander <magnus@hagander.net> writes:
On Mon, Mar 8, 2021 at 5:33 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
It does seem that --single-transaction is a better idea than fiddling with
the transaction wraparound parameters, since the latter is just going to
put off the onset of trouble. However, we'd have to do something about
the lock consumption. Would it be sane to have the backend not bother to
take any locks in binary-upgrade mode?
I believe the problem occurs when writing them rather than when
reading them, and I don't think we have a binary upgrade mode there.
You're confusing pg_dump's --binary-upgrade switch (indeed applied on
the dumping side) with the backend's -b switch (IsBinaryUpgrade,
applied on the restoring side).
We could invent one of course. Another option might be to exclusively
lock pg_largeobject, and just say that if you do that, we don't have
to lock the individual objects (ever)?
What was in the back of my mind is that we've sometimes seen complaints
about too many locks needed to dump or restore a database with $MANY
tables; so the large-object case seems like just a special case.
The answer up to now has been "raise max_locks_per_transaction enough
so you don't see the failure". Having now consumed a little more
caffeine, I remember that that works in pg_upgrade scenarios too,
since the user can fiddle with the target cluster's postgresql.conf
before starting pg_upgrade.
So it seems like the path of least resistance is
(a) make pg_upgrade use --single-transaction when calling pg_restore
(b) document (better) how to get around too-many-locks failures.
regards, tom lane
On Mon, Mar 8, 2021 at 5:58 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Magnus Hagander <magnus@hagander.net> writes:
On Mon, Mar 8, 2021 at 5:33 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
It does seem that --single-transaction is a better idea than fiddling with
the transaction wraparound parameters, since the latter is just going to
put off the onset of trouble. However, we'd have to do something about
the lock consumption. Would it be sane to have the backend not bother to
take any locks in binary-upgrade mode?I believe the problem occurs when writing them rather than when
reading them, and I don't think we have a binary upgrade mode there.You're confusing pg_dump's --binary-upgrade switch (indeed applied on
the dumping side) with the backend's -b switch (IsBinaryUpgrade,
applied on the restoring side).
Ah. Yes, I am.
We could invent one of course. Another option might be to exclusively
lock pg_largeobject, and just say that if you do that, we don't have
to lock the individual objects (ever)?What was in the back of my mind is that we've sometimes seen complaints
about too many locks needed to dump or restore a database with $MANY
tables; so the large-object case seems like just a special case.
It is -- but I guess it's more likely to have 100M large objects than
to have 100M tables. (and the cutoff point comes a lot earlier than
100M). But the fundamental onei s the same.
The answer up to now has been "raise max_locks_per_transaction enough
so you don't see the failure". Having now consumed a little more
caffeine, I remember that that works in pg_upgrade scenarios too,
since the user can fiddle with the target cluster's postgresql.conf
before starting pg_upgrade.So it seems like the path of least resistance is
(a) make pg_upgrade use --single-transaction when calling pg_restore
(b) document (better) how to get around too-many-locks failures.
Agreed. Certainly seems like a better path forward than arbitrarily
pushing the limit on number of transactions which just postpones the
problem.
--
Magnus Hagander
Me: https://www.hagander.net/
Work: https://www.redpill-linpro.com/
On 3/8/21 11:58 AM, Tom Lane wrote:
The answer up to now has been "raise max_locks_per_transaction enough
so you don't see the failure". Having now consumed a little more
caffeine, I remember that that works in pg_upgrade scenarios too,
since the user can fiddle with the target cluster's postgresql.conf
before starting pg_upgrade.So it seems like the path of least resistance is
(a) make pg_upgrade use --single-transaction when calling pg_restore
(b) document (better) how to get around too-many-locks failures.
That would first require to fix how pg_upgrade is creating the
databases. It uses "pg_restore --create", which is mutually exclusive
with --single-transaction because we cannot create a database inside of
a transaction. On the way pg_upgrade also mangles the pg_database.datdba
(all databases are owned by postgres after an upgrade; will submit a
separate patch for that as I consider that a bug by itself).
All that aside, the entire approach doesn't scale.
In a hacked up pg_upgrade that does "createdb" first before calling
pg_upgrade with --single-transaction. I can upgrade 1M large objects with
max_locks_per_transaction = 5300
max_connectinons=100
which contradicts the docs. Need to find out where that math went off
the rails because that config should only have room for 530,000 locks,
not 1M. The same test fails with max_locks_per_transaction = 5200.
But this would mean that one has to modify the postgresql.conf to
something like 530,000 max_locks_per_transaction at 100 max_connections
in order to actually run a successful upgrade of 100M large objects.
This config requires 26GB of memory just for locks. Add to that the
memory pg_restore needs to load the entire TOC before even restoring a
single object.
Not going to work. But tests are still ongoing ...
Regards, Jan
--
Jan Wieck
Principle Database Engineer
Amazon Web Services
On 3/20/21 12:39 AM, Jan Wieck wrote:
On 3/8/21 11:58 AM, Tom Lane wrote:
The answer up to now has been "raise max_locks_per_transaction enough
so you don't see the failure". Having now consumed a little more
caffeine, I remember that that works in pg_upgrade scenarios too,
since the user can fiddle with the target cluster's postgresql.conf
before starting pg_upgrade.So it seems like the path of least resistance is
(a) make pg_upgrade use --single-transaction when calling pg_restore
(b) document (better) how to get around too-many-locks failures.
That would first require to fix how pg_upgrade is creating the
databases. It uses "pg_restore --create", which is mutually exclusive
with --single-transaction because we cannot create a database inside
of a transaction. On the way pg_upgrade also mangles the
pg_database.datdba (all databases are owned by postgres after an
upgrade; will submit a separate patch for that as I consider that a
bug by itself).All that aside, the entire approach doesn't scale.
In a hacked up pg_upgrade that does "createdb" first before calling
pg_upgrade with --single-transaction. I can upgrade 1M large objects with
max_locks_per_transaction = 5300
max_connectinons=100
which contradicts the docs. Need to find out where that math went off
the rails because that config should only have room for 530,000 locks,
not 1M. The same test fails with max_locks_per_transaction = 5200.But this would mean that one has to modify the postgresql.conf to
something like 530,000 max_locks_per_transaction at 100
max_connections in order to actually run a successful upgrade of 100M
large objects. This config requires 26GB of memory just for locks. Add
to that the memory pg_restore needs to load the entire TOC before even
restoring a single object.Not going to work. But tests are still ongoing ...
I thought Tom's suggestion upthread:
Would it be sane to have the backend not bother to
take any locks in binary-upgrade mode?
was interesting. Could we do that on the restore side? After all, what
are we locking against in binary upgrade mode?
cheers
andrew
--
Andrew Dunstan
EDB: https://www.enterprisedb.com
Jan Wieck <jan@wi3ck.info> writes:
On 3/8/21 11:58 AM, Tom Lane wrote:
So it seems like the path of least resistance is
(a) make pg_upgrade use --single-transaction when calling pg_restore
(b) document (better) how to get around too-many-locks failures.
That would first require to fix how pg_upgrade is creating the
databases. It uses "pg_restore --create", which is mutually exclusive
with --single-transaction because we cannot create a database inside of
a transaction.
Ugh.
All that aside, the entire approach doesn't scale.
Yeah, agreed. When we gave large objects individual ownership and ACL
info, it was argued that pg_dump could afford to treat each one as a
separate TOC entry because "you wouldn't have that many of them, if
they're large". The limits of that approach were obvious even at the
time, and I think now we're starting to see people for whom it really
doesn't work.
I wonder if pg_dump could improve matters cheaply by aggregating the
large objects by owner and ACL contents. That is, do
select distinct lomowner, lomacl from pg_largeobject_metadata;
and make just *one* BLOB TOC entry for each result. Then dump out
all the matching blobs under that heading.
A possible objection is that it'd reduce the ability to restore blobs
selectively, so maybe we'd need to make it optional.
Of course, that just reduces the memory consumption on the client
side; it does nothing for the locks. Can we get away with releasing the
lock immediately after doing an ALTER OWNER or GRANT/REVOKE on a blob?
regards, tom lane
On Sat, Mar 20, 2021 at 11:23:19AM -0400, Tom Lane wrote:
I wonder if pg_dump could improve matters cheaply by aggregating the
large objects by owner and ACL contents. That is, doselect distinct lomowner, lomacl from pg_largeobject_metadata;
and make just *one* BLOB TOC entry for each result. Then dump out
all the matching blobs under that heading.A possible objection is that it'd reduce the ability to restore blobs
selectively, so maybe we'd need to make it optional.Of course, that just reduces the memory consumption on the client
side; it does nothing for the locks. Can we get away with releasing the
lock immediately after doing an ALTER OWNER or GRANT/REVOKE on a blob?
Well, in pg_upgrade mode you can, since there are no other cluster
users, but you might be asking for general pg_dump usage.
--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com
If only the physical world exists, free will is an illusion.
Bruce Momjian <bruce@momjian.us> writes:
On Sat, Mar 20, 2021 at 11:23:19AM -0400, Tom Lane wrote:
Of course, that just reduces the memory consumption on the client
side; it does nothing for the locks. Can we get away with releasing the
lock immediately after doing an ALTER OWNER or GRANT/REVOKE on a blob?
Well, in pg_upgrade mode you can, since there are no other cluster
users, but you might be asking for general pg_dump usage.
Yeah, this problem doesn't only affect pg_upgrade scenarios, so it'd
really be better to find a way that isn't dependent on binary-upgrade
mode.
regards, tom lane
On 3/20/21 11:23 AM, Tom Lane wrote:
Jan Wieck <jan@wi3ck.info> writes:
All that aside, the entire approach doesn't scale.
Yeah, agreed. When we gave large objects individual ownership and ACL
info, it was argued that pg_dump could afford to treat each one as a
separate TOC entry because "you wouldn't have that many of them, if
they're large". The limits of that approach were obvious even at the
time, and I think now we're starting to see people for whom it really
doesn't work.
It actually looks more like some users have millions of "small objects".
I am still wondering where that is coming from and why they are abusing
LOs in that way, but that is more out of curiosity. Fact is that they
are out there and that they cannot upgrade from their 9.5 databases,
which are now past EOL.
I wonder if pg_dump could improve matters cheaply by aggregating the
large objects by owner and ACL contents. That is, doselect distinct lomowner, lomacl from pg_largeobject_metadata;
and make just *one* BLOB TOC entry for each result. Then dump out
all the matching blobs under that heading.
What I am currently experimenting with is moving the BLOB TOC entries
into the parallel data phase of pg_restore "when doing binary upgrade".
It seems to scale nicely with the number of cores in the system. In
addition to that have options for pg_upgrade and pg_restore that cause
the restore to batch them into transactions, like 10,000 objects at a
time. There was a separate thread for that but I guess it is better to
keep it all together here now.
A possible objection is that it'd reduce the ability to restore blobs
selectively, so maybe we'd need to make it optional.
I fully intend to make all this into new "options". I am afraid that
there is no one-size-fits-all solution here.
Of course, that just reduces the memory consumption on the client
side; it does nothing for the locks. Can we get away with releasing the
lock immediately after doing an ALTER OWNER or GRANT/REVOKE on a blob?
I'm not very fond of the idea going lockless when at the same time
trying to parallelize the restore phase. That can lead to really nasty
race conditions. For now I'm aiming at batches in transactions.
Regards, Jan
--
Jan Wieck
Principle Database Engineer
Amazon Web Services
On 3/20/21 12:55 PM, Jan Wieck wrote:
On 3/20/21 11:23 AM, Tom Lane wrote:
Jan Wieck <jan@wi3ck.info> writes:
All that aside, the entire approach doesn't scale.
Yeah, agreed. When we gave large objects individual ownership and ACL
info, it was argued that pg_dump could afford to treat each one as a
separate TOC entry because "you wouldn't have that many of them, if
they're large". The limits of that approach were obvious even at the
time, and I think now we're starting to see people for whom it really
doesn't work.It actually looks more like some users have millions of "small
objects". I am still wondering where that is coming from and why they
are abusing LOs in that way, but that is more out of curiosity. Fact
is that they are out there and that they cannot upgrade from their 9.5
databases, which are now past EOL.
One possible (probable?) source is the JDBC driver, which currently
treats all Blobs (and Clobs, for that matter) as LOs. I'm working on
improving that some: <https://github.com/pgjdbc/pgjdbc/pull/2093>
cheers
andrew
--
Andrew Dunstan
EDB: https://www.enterprisedb.com
On 3/20/21 12:39 AM, Jan Wieck wrote:
On the way pg_upgrade also mangles the pg_database.datdba
(all databases are owned by postgres after an upgrade; will submit a
separate patch for that as I consider that a bug by itself).
Patch attached.
Regards, Jan
--
Jan Wieck
Principle Database Engineer
Amazon Web Services
Attachments:
pg_upgrade-preserve-datdba.v1.difftext/x-patch; charset=UTF-8; name=pg_upgrade-preserve-datdba.v1.diffDownload+37-12
On 3/21/21 7:47 AM, Andrew Dunstan wrote:
One possible (probable?) source is the JDBC driver, which currently
treats all Blobs (and Clobs, for that matter) as LOs. I'm working on
improving that some: <https://github.com/pgjdbc/pgjdbc/pull/2093>
You mean the user is using OID columns pointing to large objects and the
JDBC driver is mapping those for streaming operations?
Yeah, that would explain a lot.
Thanks, Jan
--
Jan Wieck
Principle Database Engineer
Amazon Web Services