pg_upgrade and wraparound
Hello!
I've been trying to upgrade a postgresql cluster from 9.6 to 10. I've
executed the pg_upgrade with the following options:
/usr/lib/postgresql/10/bin/pg_upgrade -b /usr/lib/postgresql/9.6/bin/ -B
/usr/lib/postgresql/10/bin/ -d /var/lib/postgresql/9.6/main -D
/var/lib/postgresql/10/main -o ' -c
config_file=/etc/postgresql/9.6/main/postgresql.conf' -O ' -c
config_file=/etc/postgresql/10/main/postgresql.conf'
The upgrade operation failed after several hours with the following error:
database is not accepting commands to avoid wraparound data loss in
database with OID 0
Earlier in the log there are a lot of messages like
pg_restore: executing BLOB 1740736966
pg_restore: WARNING: database with OID 0 must be vacuumed within 1000279
transactions
HINT: To avoid a database shutdown, execute a database-wide VACUUM in that
database.
You might also need to commit or roll back old prepared transactions.
pg_restore: WARNING: database with OID 0 must be vacuumed within 1000278
transactions
HINT: To avoid a database shutdown, execute a database-wide VACUUM in that
database.
You might also need to commit or roll back old prepared transactions.
I've tried to do VACUUM FULL on my 9.6 cluster on all databases and then
retried the pg_upgrade - it failed in the same way.
Also to be noted, earlier this cluster was succesfully upgraded with
pg_upgrade using similar parameters from older versions (at least 2 times,
something like 9.1 -> 9.3, 9.3 -> 9.6). The database is around 700 GB and
has very many pg_largeobjects in it.
What could be the reason of this and how can I perform my upgrade?
Thanks in advance,
Alexander
On 06/09/2018 03:46 AM, Alexander Shutyaev wrote:
Hello!
I've been trying to upgrade a postgresql cluster from 9.6 to 10. I've
executed the pg_upgrade with the following options:/usr/lib/postgresql/10/bin/pg_upgrade -b /usr/lib/postgresql/9.6/bin/
-B /usr/lib/postgresql/10/bin/ -d /var/lib/postgresql/9.6/main -D
/var/lib/postgresql/10/main -o ' -c
config_file=/etc/postgresql/9.6/main/postgresql.conf' -O ' -c
config_file=/etc/postgresql/10/main/postgresql.conf'The upgrade operation failed after several hours with the following error:
database is not accepting commands to avoid wraparound data loss in
database with OID 0
Do you know which database has an OID of 0?
Earlier in the log there are a lot of messages like
pg_restore: executing BLOB 1740736966
pg_restore: WARNING: database with OID 0 must be vacuumed within
1000279 transactions
HINT: To avoid a database shutdown, execute a database-wide VACUUM in
that database.
You might also need to commit or roll back old prepared transactions.
pg_restore: WARNING: database with OID 0 must be vacuumed within
1000278 transactions
HINT: To avoid a database shutdown, execute a database-wide VACUUM in
that database.
You might also need to commit or roll back old prepared transactions.I've tried to do VACUUM FULL on my 9.6 cluster on all databases and then
retried the pg_upgrade - it failed in the same way.Also to be noted, earlier this cluster was succesfully upgraded with
pg_upgrade using similar parameters from older versions (at least 2
times, something like 9.1 -> 9.3, 9.3 -> 9.6). The database is around
700 GB and has very many pg_largeobjects in it.What could be the reason of this and how can I perform my upgrade?
Thanks in advance,
Alexander
--
Adrian Klaver
adrian.klaver@aklaver.com
Adrian Klaver <adrian.klaver@aklaver.com> writes:
On 06/09/2018 03:46 AM, Alexander Shutyaev wrote:
The upgrade operation failed after several hours with the following error:
database is not accepting commands to avoid wraparound data loss in
database with OID 0
Do you know which database has an OID of 0?
Well, none do, so the correct question is what is passing an invalid
database OID to the code that's complaining. This sure looks like a
bug, though I'm not sure we have enough info to locate it.
regards, tom lane
I can provide you with more info. Please tell me what you need. I really hope
someone here can help me somehow solve or workaround this, because I really
need to migrate to v10 for its features.
Thanks in advance,
Alexander
--
Sent from: http://www.postgresql-archive.org/PostgreSQL-general-f1843780.html
I can provide you with more info. Please tell me what you need. I really
hope someone here can help me somehow solve or workaround this, because I
really need to migrate to v10 for its features.
Thanks in advance,
Alexander
2018-06-09 22:52 GMT+03:00 Tom Lane <tgl@sss.pgh.pa.us>:
Show quoted text
Adrian Klaver <adrian.klaver@aklaver.com> writes:
On 06/09/2018 03:46 AM, Alexander Shutyaev wrote:
The upgrade operation failed after several hours with the following
error:
database is not accepting commands to avoid wraparound data loss in
database with OID 0Do you know which database has an OID of 0?
Well, none do, so the correct question is what is passing an invalid
database OID to the code that's complaining. This sure looks like a
bug, though I'm not sure we have enough info to locate it.regards, tom lane
On 06/10/2018 01:46 PM, Alexander Shutyaev wrote:
I can provide you with more info. Please tell me what you need. I really hope
someone here can help me somehow solve or workaround this, because I really
need to migrate to v10 for its features.
Well as Tom Lane said a database will not have an OID of 0, so we need
to know where that is coming from. On the oft chance it occurred in your
9.6 cluster can you see if:
select oid, datname from pg_database;
in the 9.6 cluster shows an OID of 0.
Also:
1) OS and version you are using?
2) Where are you getting you Postgres from?:
a) Package
Repo?
b) Source
File source?
3) Where there any issues with 9.6 cluster before you tried to upgrade
from it?
Thanks in advance,
Alexander--
Sent from: http://www.postgresql-archive.org/PostgreSQL-general-f1843780.html
--
Adrian Klaver
adrian.klaver@aklaver.com
No database with oid 0 in 9.6 cluster:
postgres=# select oid, datname from pg_database;
oid | datname
------------+-----------
1 | template1
12438 | template0
16400 | bof
12439 | postgres
1016305714 | sslentry
(5 rows)
1) OS and version you are using?
Ubuntu 16.04.4
2) Where are you getting you Postgres from?:
a) Package
Repo?
b) Source
File source?
I've installed it from postgresql repo -
http://apt.postgresql.org/pub/repos/apt/
3) Where there any issues with 9.6 cluster before you tried to upgrade
from it?
No, I don't remember anything specific about it.
2018-06-10 23:55 GMT+03:00 Adrian Klaver <adrian.klaver@aklaver.com>:
Show quoted text
On 06/10/2018 01:46 PM, Alexander Shutyaev wrote:
I can provide you with more info. Please tell me what you need. I really
hope
someone here can help me somehow solve or workaround this, because I
really
need to migrate to v10 for its features.Well as Tom Lane said a database will not have an OID of 0, so we need to
know where that is coming from. On the oft chance it occurred in your 9.6
cluster can you see if:select oid, datname from pg_database;
in the 9.6 cluster shows an OID of 0.
Also:
1) OS and version you are using?
2) Where are you getting you Postgres from?:
a) Package
Repo?
b) Source
File source?3) Where there any issues with 9.6 cluster before you tried to upgrade
from it?Thanks in advance,
Alexander--
Sent from: http://www.postgresql-archive.org/PostgreSQL-general-f184378
0.html--
Adrian Klaver
adrian.klaver@aklaver.com
Some more notes on databses.
Although the pg_upgrade failed, I've decided to check the databases in the
new cluster (10.4). There is no database with oid 0 either. Also to be
noted that some system databases changed the oids while others retained
them.
And of my databases - sslentry. It had a very big oid (can that seem
strange?) and its oid has changed.
select oid, datname from pg_database;
oid | datname
-------+-----------
13011 | template0
16400 | bof
13012 | postgres
16401 | sslentry
1 | template1
(5 rows)
2018-06-11 0:03 GMT+03:00 Alexander Shutyaev <shutyaev@gmail.com>:
Show quoted text
No database with oid 0 in 9.6 cluster:
postgres=# select oid, datname from pg_database;
oid | datname
------------+-----------
1 | template1
12438 | template0
16400 | bof
12439 | postgres
1016305714 | sslentry
(5 rows)1) OS and version you are using?
Ubuntu 16.04.4
2) Where are you getting you Postgres from?:
a) Package
Repo?
b) Source
File source?I've installed it from postgresql repo - http://apt.postgresql.org/
pub/repos/apt/3) Where there any issues with 9.6 cluster before you tried to upgrade
from it?
No, I don't remember anything specific about it.
2018-06-10 23:55 GMT+03:00 Adrian Klaver <adrian.klaver@aklaver.com>:
On 06/10/2018 01:46 PM, Alexander Shutyaev wrote:
I can provide you with more info. Please tell me what you need. I really
hope
someone here can help me somehow solve or workaround this, because I
really
need to migrate to v10 for its features.Well as Tom Lane said a database will not have an OID of 0, so we need to
know where that is coming from. On the oft chance it occurred in your 9.6
cluster can you see if:select oid, datname from pg_database;
in the 9.6 cluster shows an OID of 0.
Also:
1) OS and version you are using?
2) Where are you getting you Postgres from?:
a) Package
Repo?
b) Source
File source?3) Where there any issues with 9.6 cluster before you tried to upgrade
from it?Thanks in advance,
Alexander--
Sent from: http://www.postgresql-archive.org/PostgreSQL-general-f184378
0.html--
Adrian Klaver
adrian.klaver@aklaver.com
On 06/10/2018 02:09 PM, Alexander Shutyaev wrote:
Some more notes on databses.
Although the pg_upgrade failed, I've decided to check the databases in
the new cluster (10.4). There is no database with oid 0 either. Also to
be noted that some system databases changed the oids while others
retained them.
If I am following the source code for pg_upgrade correctly that is
expected. Pretty sure because the order of object creation is different.
And of my databases - sslentry. It had a very big oid (can that seem
strange?) and its oid has changed.
OID's are added at time of object creation so I would say the ssslentry
database was created some time after the other databases in the 9.6
cluster. Actually probably more accurate to say after 1016305714 -
16400(bof db) objects that have OID's where created.
When the upgrade failed pg_upgrade should have pointed you at an error log.
Did it and is there anything useful there?
From your OP post:
"You might also need to commit or roll back old prepared transactions."
Do you have any of those in the 9.6 cluster?
See:
https://www.postgresql.org/docs/10/static/view-pg-prepared-xacts.html
select oid, datname from pg_database;
oid | datname
-------+-----------
13011 | template0
16400 | bof
13012 | postgres
16401 | sslentry
1 | template1
(5 rows)
--
Adrian Klaver
adrian.klaver@aklaver.com
The error log is like this. Here's its tail:
pg_restore: executing BLOB 1740737401
pg_restore: WARNING: database with OID 0 must be vacuumed within 1000003
transactions
HINT: To avoid a database shutdown, execute a database-wide VACUUM in that
database.
You might also need to commit or roll back old prepared transactions.
pg_restore: WARNING: database with OID 0 must be vacuumed within 1000002
transactions
HINT: To avoid a database shutdown, execute a database-wide VACUUM in that
database.
You might also need to commit or roll back old prepared transactions.
pg_restore: executing BLOB 1740737402
pg_restore: WARNING: database with OID 0 must be vacuumed within 1000001
transactions
HINT: To avoid a database shutdown, execute a database-wide VACUUM in that
database.
You might also need to commit or roll back old prepared transactions.
pg_restore: [archiver (db)] Error while PROCESSING TOC:
pg_restore: [archiver (db)] Error from TOC entry 9759463; 2613 1740737402
BLOB 1740737402 bof_user
pg_restore: [archiver (db)] could not execute query: ERROR: database is
not accepting commands to avoid wraparound data loss in database with OID 0
HINT: Stop the postmaster and vacuum that database in single-user mode.
You might also need to commit or roll back old prepared transactions.
Command was: ALTER LARGE OBJECT 1740737402 OWNER TO bof_user;
Before that there is a lot of similar messages - the only things chainging
are the "executing BLOB nnn" number and "must be vacuumed within nnn
transactions" number.
As for the prepared transactions - no, I don't have them, our application
doesn't use this functionality.
2018-06-11 0:34 GMT+03:00 Adrian Klaver <adrian.klaver@aklaver.com>:
Show quoted text
On 06/10/2018 02:09 PM, Alexander Shutyaev wrote:
Some more notes on databses.
Although the pg_upgrade failed, I've decided to check the databases in
the new cluster (10.4). There is no database with oid 0 either. Also to be
noted that some system databases changed the oids while others retained
them.If I am following the source code for pg_upgrade correctly that is
expected. Pretty sure because the order of object creation is different.And of my databases - sslentry. It had a very big oid (can that seem
strange?) and its oid has changed.OID's are added at time of object creation so I would say the ssslentry
database was created some time after the other databases in the 9.6
cluster. Actually probably more accurate to say after 1016305714 -
16400(bof db) objects that have OID's where created.When the upgrade failed pg_upgrade should have pointed you at an error log.
Did it and is there anything useful there?
From your OP post:
"You might also need to commit or roll back old prepared transactions."
Do you have any of those in the 9.6 cluster?
See:
https://www.postgresql.org/docs/10/static/view-pg-prepared-xacts.html
select oid, datname from pg_database;
oid | datname
-------+-----------
13011 | template0
16400 | bof
13012 | postgres
16401 | sslentry
1 | template1
(5 rows)--
Adrian Klaver
adrian.klaver@aklaver.com
On 06/10/2018 02:45 PM, Alexander Shutyaev wrote:
Comments inline.
The error log is like this. Here's its tail:
Is this the regular Postgres log or the pg_upgrade log which should be
something like pg_upgrade_server.log?
pg_restore: [archiver (db)] could not execute query: ERROR: database is
not accepting commands to avoid wraparound data loss in database with OID 0
HINT: Stop the postmaster and vacuum that database in single-user mode.
How did you get into the 10 cluster to report on the database OID's and
names?
You might also need to commit or roll back old prepared transactions.
Command was: ALTER LARGE OBJECT 1740737402 OWNER TO bof_user;Before that there is a lot of similar messages - the only things
chainging are the "executing BLOB nnn" number and "must be vacuumed
within nnn transactions" number.
Which database has the large objects?
As for the prepared transactions - no, I don't have them, our
application doesn't use this functionality.
Did you check this view to confirm?:
https://www.postgresql.org/docs/10/static/view-pg-prepared-xacts.html
Just trying to eliminate possibilities.
--
Adrian Klaver
adrian.klaver@aklaver.com
Is this the regular Postgres log or the pg_upgrade log which should be
something like pg_upgrade_server.log?
This is the pg_upgrade_dump_16400.log.
How did you get into the 10 cluster to report on the database OID's and
names?
After the pg_upgrade failed I was able to start both clusters, so I
connected to the new 10.4 cluster and ran the query.
Which database has the large objects?
bof (OID=16400). It is also effectively the only database that matters
here. The other one - sslentry only contains a couple of tables and a dozen
of records.
Did you check this view to confirm?
Yes, I did:
select * from pg_prepared_xacts;
transaction | gid | prepared | owner | database
-------------+-----+----------+-------+----------
(0 rows)
2018-06-11 3:15 GMT+03:00 Adrian Klaver <adrian.klaver@aklaver.com>:
Show quoted text
On 06/10/2018 02:45 PM, Alexander Shutyaev wrote:
Comments inline.
The error log is like this. Here's its tail:
Is this the regular Postgres log or the pg_upgrade log which should be
something like pg_upgrade_server.log?pg_restore: [archiver (db)] could not execute query: ERROR: database is
not accepting commands to avoid wraparound data loss in database with OID 0
HINT: Stop the postmaster and vacuum that database in single-user mode.How did you get into the 10 cluster to report on the database OID's and
names?You might also need to commit or roll back old prepared transactions.
Command was: ALTER LARGE OBJECT 1740737402 OWNER TO bof_user;
Before that there is a lot of similar messages - the only things
chainging are the "executing BLOB nnn" number and "must be vacuumed within
nnn transactions" number.Which database has the large objects?
As for the prepared transactions - no, I don't have them, our application
doesn't use this functionality.
Did you check this view to confirm?:
https://www.postgresql.org/docs/10/static/view-pg-prepared-xacts.html
Just trying to eliminate possibilities.
--
Adrian Klaver
adrian.klaver@aklaver.com
On 06/10/2018 11:46 PM, Alexander Shutyaev wrote:
Is this the regular Postgres log or the pg_upgrade log which should
be something like pg_upgrade_server.log?
This is the pg_upgrade_dump_16400.log.
How did you get into the 10 cluster to report on the database OID's and
names?
After the pg_upgrade failed I was able to start both clusters, so I
connected to the new 10.4 cluster and ran the query.
I am at a loss for an explanation. My thoughts:
1) The database with an OID of 0 is a mystery, though it does not seem
to be stopping the upgrade by itself.
2) The upgrade stops because of transaction ID wraparound, which is
strange as that is not showing up in the 9.6 cluster I presume. You
might want the queries found below on the 9.6 and 10 clusters to help
figure this out:
https://www.postgresql.org/docs/10/static/routine-vacuuming.html#VACUUM-FOR-WRAPAROUND
SELECT c.oid::regclass as table_name,
greatest(age(c.relfrozenxid),age(t.relfrozenxid)) as age
FROM pg_class c
LEFT JOIN pg_class t ON c.reltoastrelid = t.oid
WHERE c.relkind IN ('r', 'm');
SELECT datname, age(datfrozenxid) FROM pg_database;
Which database has the large objects?
bof (OID=16400). It is also effectively the only database that matters
here. The other one - sslentry only contains a couple of tables and a
dozen of records.Did you check this view to confirm?
Yes, I did:
select * from pg_prepared_xacts;
transaction | gid | prepared | owner | database
-------------+-----+----------+-------+----------
(0 rows)
--
Adrian Klaver
adrian.klaver@aklaver.com
Hi,
On 2018-06-09 13:46:16 +0300, Alexander Shutyaev wrote:
Hello!
I've been trying to upgrade a postgresql cluster from 9.6 to 10. I've
executed the pg_upgrade with the following options:/usr/lib/postgresql/10/bin/pg_upgrade -b /usr/lib/postgresql/9.6/bin/ -B
/usr/lib/postgresql/10/bin/ -d /var/lib/postgresql/9.6/main -D
/var/lib/postgresql/10/main -o ' -c
config_file=/etc/postgresql/9.6/main/postgresql.conf' -O ' -c
config_file=/etc/postgresql/10/main/postgresql.conf'The upgrade operation failed after several hours with the following error:
database is not accepting commands to avoid wraparound data loss in
database with OID 0Earlier in the log there are a lot of messages like
pg_restore: executing BLOB 1740736966
pg_restore: WARNING: database with OID 0 must be vacuumed within 1000279
transactions
HINT: To avoid a database shutdown, execute a database-wide VACUUM in that
database.
You might also need to commit or roll back old prepared transactions.
Since this happens in the version you're migrating too, I suggest
replacing the warning with a PANIC and then using the debugger to look
at the corefile generated (when using an appropriate ulimit). It's not
immediately obvious why there'd not be correct knowledge about the
oldest database around. If you do so, please include 'p *ShmemVariableCache'
output.
Greetings,
Andres Freund
On 2018-06-09 15:52:26 -0400, Tom Lane wrote:
Adrian Klaver <adrian.klaver@aklaver.com> writes:
On 06/09/2018 03:46 AM, Alexander Shutyaev wrote:
The upgrade operation failed after several hours with the following error:
database is not accepting commands to avoid wraparound data loss in
database with OID 0Do you know which database has an OID of 0?
Well, none do, so the correct question is what is passing an invalid
database OID to the code that's complaining. This sure looks like a
bug, though I'm not sure we have enough info to locate it.
It sure looks like ShmemVariableCache->oldestXidDB isn't initialized.
As far as I remember we just initialize that from a checkpoint at
startup. I suspect the issue is that pg_resetwal does:
if (set_xid != 0)
{
ControlFile.checkPointCopy.nextXid = set_xid;
/*
* For the moment, just set oldestXid to a value that will force
* immediate autovacuum-for-wraparound. It's not clear whether adding
* user control of this is useful, so let's just do something that's
* reasonably safe. The magic constant here corresponds to the
* maximum allowed value of autovacuum_freeze_max_age.
*/
ControlFile.checkPointCopy.oldestXid = set_xid - 2000000000;
if (ControlFile.checkPointCopy.oldestXid < FirstNormalTransactionId)
ControlFile.checkPointCopy.oldestXid += FirstNormalTransactionId;
ControlFile.checkPointCopy.oldestXidDB = InvalidOid;
}
but we have codepath that doesn't check for oldestXidDB being
InvalidOid. Not great.
Greetings,
Andres Freund
Andres Freund <andres@anarazel.de> writes:
I suspect the issue is that pg_resetwal does:
if (set_xid != 0)
{
ControlFile.checkPointCopy.nextXid = set_xid;
/*
* For the moment, just set oldestXid to a value that will force
* immediate autovacuum-for-wraparound. It's not clear whether adding
* user control of this is useful, so let's just do something that's
* reasonably safe. The magic constant here corresponds to the
* maximum allowed value of autovacuum_freeze_max_age.
*/
ControlFile.checkPointCopy.oldestXid = set_xid - 2000000000;
if (ControlFile.checkPointCopy.oldestXid < FirstNormalTransactionId)
ControlFile.checkPointCopy.oldestXid += FirstNormalTransactionId;
ControlFile.checkPointCopy.oldestXidDB = InvalidOid;
}
but we have codepath that doesn't check for oldestXidDB being
InvalidOid. Not great.
Hm, I think I'd define the problem as "pg_resetwal is violating the
expectation that oldestXidDB be valid".
However, this just explains the basically-cosmetic issue that the
complaint message mentions OID 0. It doesn't really get us to the
answer to why Alexander is seeing a failure. It might be useful
to see pg_controldata output for the old cluster, as well as
"select datname, datfrozenxid from pg_database" output from the
old cluster.
regards, tom lane
On 2018-06-11 13:14:12 -0400, Tom Lane wrote:
Andres Freund <andres@anarazel.de> writes:
I suspect the issue is that pg_resetwal does:
if (set_xid != 0)
{
ControlFile.checkPointCopy.nextXid = set_xid;/*
* For the moment, just set oldestXid to a value that will force
* immediate autovacuum-for-wraparound. It's not clear whether adding
* user control of this is useful, so let's just do something that's
* reasonably safe. The magic constant here corresponds to the
* maximum allowed value of autovacuum_freeze_max_age.
*/
ControlFile.checkPointCopy.oldestXid = set_xid - 2000000000;
if (ControlFile.checkPointCopy.oldestXid < FirstNormalTransactionId)
ControlFile.checkPointCopy.oldestXid += FirstNormalTransactionId;
ControlFile.checkPointCopy.oldestXidDB = InvalidOid;
}but we have codepath that doesn't check for oldestXidDB being
InvalidOid. Not great.Hm, I think I'd define the problem as "pg_resetwal is violating the
expectation that oldestXidDB be valid".
Well, what could it do otherwise? ForceTransactionIdLimitUpdate()
currently does a syscache check for database existence. That'll just
return a lookup failure for InvalidOid, so we're reasonably good on that
front.
Using a hardcoded 2000000000 seems worse, will have funny results if
running with a smaller autovacuum_freeze_max_age...
However, this just explains the basically-cosmetic issue that the
complaint message mentions OID 0. It doesn't really get us to the
answer to why Alexander is seeing a failure. It might be useful
to see pg_controldata output for the old cluster, as well as
"select datname, datfrozenxid from pg_database" output from the
old cluster.
pg_upgrade starts the server with autovacuum disabled, I suspect
restoring all the large objects ends up using a lot of transaction
ids. GetNewTransactionId() should start autovacuum, but I'd guess that's
where things are going wrong for some reason.
Alexander, could you hack things up so autovacuum logging is enabled
(log_autovacuum_min_duration=0), and see whether it's triggered?
I'm not entirely clear why pg_restore appears to use a separate
transaction for each large object, surely exascerbating the problem.
Greetings,
Andres Freund
I'm back with more details.
First, I've deleted the smaller sslentry database, since I don't need it,
just so that it doesn't somehow spoil the picture. Now there is only 1 user
database - bof (OID=16400). After that I've ran the pg_upgrade on a clean
10.4 cluster and it failed in the same way.
Now, the answers to your queries.
2) The upgrade stops because of transaction ID wraparound, which is
strange as that is not showing up in the 9.6 cluster I presume. You might
want the queries found below on the 9.6 and 10 clusters to help figure this
out:
SELECT c.oid::regclass as table_name,
greatest(age(c.relfrozenxid),age(t.relfrozenxid)) as age
FROM pg_class c
LEFT JOIN pg_class t ON c.reltoastrelid = t.oid
WHERE c.relkind IN ('r', 'm');
9.6 result - see attached 96-query1.txt
10.4 result - see attached 104-query1.txt
SELECT datname, age(datfrozenxid) FROM pg_database;
9.6 result - see attached 96-query2.txt
10.4 result - see attached 104-query2.txt
It might be useful
to see pg_controldata output for the old cluster, as well as
"select datname, datfrozenxid from pg_database" output from the
old cluster.
for the query - see above, for pg_controldata:
9.6 - see attached 96-pg_controldata.txt
10.4 - see attached 104-pg_controldata.txt
Alexander, could you hack things up so autovacuum logging is enabled
(log_autovacuum_min_duration=0), and see whether it's triggered?
I'll be happy to, but that will require to run pg_upgrade once more and
that takes more that half a day and during this time clusters are not
available to me. Given the data I'm attaching it may happen that the
colleagues will want to see something else from my clusters or maybe change
some settings before running the pg_upgrade again. Therefore, I'll wait 12
hours after this message in case there will be any more requests and the
I'll run the pg_upgrade again.
Thank you all for trying to solve this matter, this is much appreciated! :)
2018-06-11 20:29 GMT+03:00 Andres Freund <andres@anarazel.de>:
Show quoted text
On 2018-06-11 13:14:12 -0400, Tom Lane wrote:
Andres Freund <andres@anarazel.de> writes:
I suspect the issue is that pg_resetwal does:
if (set_xid != 0)
{
ControlFile.checkPointCopy.nextXid = set_xid;/*
* For the moment, just set oldestXid to a value that willforce
* immediate autovacuum-for-wraparound. It's not clear
whether adding
* user control of this is useful, so let's just do
something that's
* reasonably safe. The magic constant here corresponds
to the
* maximum allowed value of autovacuum_freeze_max_age.
*/
ControlFile.checkPointCopy.oldestXid = set_xid -2000000000;
if (ControlFile.checkPointCopy.oldestXid <
FirstNormalTransactionId)
ControlFile.checkPointCopy.oldestXid +=
FirstNormalTransactionId;
ControlFile.checkPointCopy.oldestXidDB = InvalidOid;
}but we have codepath that doesn't check for oldestXidDB being
InvalidOid. Not great.Hm, I think I'd define the problem as "pg_resetwal is violating the
expectation that oldestXidDB be valid".Well, what could it do otherwise? ForceTransactionIdLimitUpdate()
currently does a syscache check for database existence. That'll just
return a lookup failure for InvalidOid, so we're reasonably good on that
front.Using a hardcoded 2000000000 seems worse, will have funny results if
running with a smaller autovacuum_freeze_max_age...However, this just explains the basically-cosmetic issue that the
complaint message mentions OID 0. It doesn't really get us to the
answer to why Alexander is seeing a failure. It might be useful
to see pg_controldata output for the old cluster, as well as
"select datname, datfrozenxid from pg_database" output from the
old cluster.pg_upgrade starts the server with autovacuum disabled, I suspect
restoring all the large objects ends up using a lot of transaction
ids. GetNewTransactionId() should start autovacuum, but I'd guess that's
where things are going wrong for some reason.Alexander, could you hack things up so autovacuum logging is enabled
(log_autovacuum_min_duration=0), and see whether it's triggered?I'm not entirely clear why pg_restore appears to use a separate
transaction for each large object, surely exascerbating the problem.Greetings,
Andres Freund
Attachments:
96-pg_controldata.txttext/plain; charset=US-ASCII; name=96-pg_controldata.txtDownload
96-query1.txttext/plain; charset=US-ASCII; name=96-query1.txtDownload
96-query2.txttext/plain; charset=US-ASCII; name=96-query2.txtDownload
104-pg_controldata.txttext/plain; charset=US-ASCII; name=104-pg_controldata.txtDownload
104-query1.txttext/plain; charset=US-ASCII; name=104-query1.txtDownload
104-query2.txttext/plain; charset=US-ASCII; name=104-query2.txtDownload
On 06/11/2018 11:32 AM, Alexander Shutyaev wrote:
I'm back with more details.
First, I've deleted the smaller sslentry database, since I don't need
it, just so that it doesn't somehow spoil the picture. Now there is only
1 user database - bof (OID=16400). After that I've ran the pg_upgrade on
a clean 10.4 cluster and it failed in the same way.Now, the answers to your queries.
2) The upgrade stops because of transaction ID wraparound, which is
strange as that is not showing up in the 9.6 cluster I presume. You
might want the queries found below on the 9.6 and 10 clusters to help
figure this out:SELECT c.oid::regclass as table_name,
greatest(age(c.relfrozenxid),age(t.relfrozenxid)) as age
FROM pg_class c
LEFT JOIN pg_class t ON c.reltoastrelid = t.oid
WHERE c.relkind IN ('r', 'm');9.6 result - see attached 96-query1.txt
10.4 result - see attached 104-query1.txt
I should have been clearer in my previous post, the above query is per
database. From the query below I am going to say the above query was
done on the bof database. Is that correct?
Given the below from 96-query2.txt:
template0 | 110588398
Can you run the table_name query in template0 in the 9.6 cluster?
SELECT datname, age(datfrozenxid) FROM pg_database;
9.6 result - see attached 96-query2.txt
10.4 result - see attached 104-query2.txtIt might be useful
to see pg_controldata output for the old cluster, as well as
"select datname, datfrozenxid from pg_database" output from the
old cluster.for the query - see above, for pg_controldata:
9.6 - see attached 96-pg_controldata.txt
10.4 - see attached 104-pg_controldata.txtAlexander, could you hack things up so autovacuum logging is enabled
(log_autovacuum_min_duration=0), and see whether it's triggered?I'll be happy to, but that will require to run pg_upgrade once more and
that takes more that half a day and during this time clusters are not
available to me. Given the data I'm attaching it may happen that the
colleagues will want to see something else from my clusters or maybe
change some settings before running the pg_upgrade again. Therefore,
I'll wait 12 hours after this message in case there will be any more
requests and the I'll run the pg_upgrade again.Thank you all for trying to solve this matter, this is much appreciated! :)
2018-06-11 20:29 GMT+03:00 Andres Freund <andres@anarazel.de
<mailto:andres@anarazel.de>>:On 2018-06-11 13:14:12 -0400, Tom Lane wrote:
Andres Freund <andres@anarazel.de <mailto:andres@anarazel.de>> writes:
I suspect the issue is that pg_resetwal does:
if (set_xid != 0)
{
ControlFile.checkPointCopy.nextXid = set_xid;/*
* For the moment, just set oldestXid to a value that will force
* immediate autovacuum-for-wraparound. It's not clear whether adding
* user control of this is useful, so let's just do something that's
* reasonably safe. The magic constant here corresponds to the
* maximum allowed value of autovacuum_freeze_max_age.
*/
ControlFile.checkPointCopy.oldestXid = set_xid - 2000000000;
if (ControlFile.checkPointCopy.oldestXid < FirstNormalTransactionId)
ControlFile.checkPointCopy.oldestXid += FirstNormalTransactionId;
ControlFile.checkPointCopy.oldestXidDB = InvalidOid;
}but we have codepath that doesn't check for oldestXidDB being
InvalidOid. Not great.Hm, I think I'd define the problem as "pg_resetwal is violating the
expectation that oldestXidDB be valid".Well, what could it do otherwise? ForceTransactionIdLimitUpdate()
currently does a syscache check for database existence. That'll just
return a lookup failure for InvalidOid, so we're reasonably good on that
front.Using a hardcoded 2000000000 seems worse, will have funny results if
running with a smaller autovacuum_freeze_max_age...However, this just explains the basically-cosmetic issue that the
complaint message mentions OID 0. It doesn't really get us to the
answer to why Alexander is seeing a failure. It might be useful
to see pg_controldata output for the old cluster, as well as
"select datname, datfrozenxid from pg_database" output from the
old cluster.pg_upgrade starts the server with autovacuum disabled, I suspect
restoring all the large objects ends up using a lot of transaction
ids. GetNewTransactionId() should start autovacuum, but I'd guess that's
where things are going wrong for some reason.Alexander, could you hack things up so autovacuum logging is enabled
(log_autovacuum_min_duration=0), and see whether it's triggered?I'm not entirely clear why pg_restore appears to use a separate
transaction for each large object, surely exascerbating the problem.Greetings,
Andres Freund
--
Adrian Klaver
adrian.klaver@aklaver.com
From the query below I am going to say the above query was done on the
bof database. Is that correct?
Yes, it is.
Can you run the table_name query in template0 in the 9.6 cluster?
At first I couldn't. There was an error:
psql: FATAL: database "template0" is not currently accepting connections
I've googled for it and found out that template0 is some special system
database that is protected from connections, but that can be changed (see
https://wiki.postgresql.org/wiki/Adventures_in_PostgreSQL,_Episode_1)
I've changed that and connected to this database and ran your query. The
result is attached in 96-query1-template0.txt.
2018-06-11 22:10 GMT+03:00 Adrian Klaver <adrian.klaver@aklaver.com>:
Show quoted text
On 06/11/2018 11:32 AM, Alexander Shutyaev wrote:
I'm back with more details.
First, I've deleted the smaller sslentry database, since I don't need it,
just so that it doesn't somehow spoil the picture. Now there is only 1 user
database - bof (OID=16400). After that I've ran the pg_upgrade on a clean
10.4 cluster and it failed in the same way.Now, the answers to your queries.
2) The upgrade stops because of transaction ID wraparound, which is
strange as that is not showing up in the 9.6 cluster I presume. You might
want the queries found below on the 9.6 and 10 clusters to help figure this
out:SELECT c.oid::regclass as table_name,
greatest(age(c.relfrozenxid),age(t.relfrozenxid)) as age
FROM pg_class c
LEFT JOIN pg_class t ON c.reltoastrelid = t.oid
WHERE c.relkind IN ('r', 'm');9.6 result - see attached 96-query1.txt
10.4 result - see attached 104-query1.txtI should have been clearer in my previous post, the above query is per
database. From the query below I am going to say the above query was done
on the bof database. Is that correct?Given the below from 96-query2.txt:
template0 | 110588398
Can you run the table_name query in template0 in the 9.6 cluster?
SELECT datname, age(datfrozenxid) FROM pg_database;
9.6 result - see attached 96-query2.txt
10.4 result - see attached 104-query2.txtIt might be useful
to see pg_controldata output for the old cluster, as well as
"select datname, datfrozenxid from pg_database" output from the
old cluster.for the query - see above, for pg_controldata:
9.6 - see attached 96-pg_controldata.txt
10.4 - see attached 104-pg_controldata.txtAlexander, could you hack things up so autovacuum logging is enabled
(log_autovacuum_min_duration=0), and see whether it's triggered?I'll be happy to, but that will require to run pg_upgrade once more and
that takes more that half a day and during this time clusters are not
available to me. Given the data I'm attaching it may happen that the
colleagues will want to see something else from my clusters or maybe change
some settings before running the pg_upgrade again. Therefore, I'll wait 12
hours after this message in case there will be any more requests and the
I'll run the pg_upgrade again.Thank you all for trying to solve this matter, this is much appreciated!
:)2018-06-11 20:29 GMT+03:00 Andres Freund <andres@anarazel.de <mailto:
andres@anarazel.de>>:On 2018-06-11 13:14:12 -0400, Tom Lane wrote:
Andres Freund <andres@anarazel.de <mailto:andres@anarazel.de>>
writes:
I suspect the issue is that pg_resetwal does:
if (set_xid != 0)
{
ControlFile.checkPointCopy.nextXid = set_xid;/*
* For the moment, just set oldestXid to a value that
will force
* immediate autovacuum-for-wraparound. It's not
clear whether adding
* user control of this is useful, so let's just do
something that's
* reasonably safe. The magic constant here
corresponds to the
* maximum allowed value of autovacuum_freeze_max_age.
*/
ControlFile.checkPointCopy.oldestXid = set_xid -2000000000;
if (ControlFile.checkPointCopy.oldestXid <
FirstNormalTransactionId)
ControlFile.checkPointCopy.oldestXid +=
FirstNormalTransactionId;
ControlFile.checkPointCopy.oldestXidDB = InvalidOid;
}but we have codepath that doesn't check for oldestXidDB
being
InvalidOid. Not great.
Hm, I think I'd define the problem as "pg_resetwal isviolating the
expectation that oldestXidDB be valid".
Well, what could it do otherwise? ForceTransactionIdLimitUpdate()
currently does a syscache check for database existence. That'll just
return a lookup failure for InvalidOid, so we're reasonably good on
that
front.Using a hardcoded 2000000000 seems worse, will have funny results if
running with a smaller autovacuum_freeze_max_age...However, this just explains the basically-cosmetic issue that the
complaint message mentions OID 0. It doesn't really get us to the
answer to why Alexander is seeing a failure. It might be useful
to see pg_controldata output for the old cluster, as well as
"select datname, datfrozenxid from pg_database" output from the
old cluster.pg_upgrade starts the server with autovacuum disabled, I suspect
restoring all the large objects ends up using a lot of transaction
ids. GetNewTransactionId() should start autovacuum, but I'd guess
that's
where things are going wrong for some reason.Alexander, could you hack things up so autovacuum logging is enabled
(log_autovacuum_min_duration=0), and see whether it's triggered?I'm not entirely clear why pg_restore appears to use a separate
transaction for each large object, surely exascerbating the problem.Greetings,
Andres Freund
--
Adrian Klaver
adrian.klaver@aklaver.com