Speeding up pg_upgrade
As part of PGConf.Asia 2017 in Tokyo, we had an unconference topic about
zero-downtime upgrades. After the usual discussion of using logical
replication, Slony, and perhaps having the server be able to read old
and new system catalogs, we discussed speeding up pg_upgrade.
There are clusters that take a long time to dump the schema from the old
cluster and recreate it in the new cluster. One idea of speeding up
pg_upgrade would be to allow pg_upgrade to be run in two stages:
1. prevent system catalog changes while the old cluster is running, and
dump the old cluster's schema and restore it in the new cluster
2. shut down the old cluster and copy/link the data files
My question is whether the schema dump/restore is time-consuming enough
to warrant this optional more complex API, and whether people would
support adding a server setting that prevented all system table changes?
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com
+ As you are, so once was I. As I am, so you will be. +
+ Ancient Roman grave inscription +
Bruce,
* Bruce Momjian (bruce@momjian.us) wrote:
As part of PGConf.Asia 2017 in Tokyo, we had an unconference topic about
zero-downtime upgrades. After the usual discussion of using logical
replication, Slony, and perhaps having the server be able to read old
and new system catalogs, we discussed speeding up pg_upgrade.
Sounds familiar.
There are clusters that take a long time to dump the schema from the old
cluster and recreate it in the new cluster. One idea of speeding up
pg_upgrade would be to allow pg_upgrade to be run in two stages:1. prevent system catalog changes while the old cluster is running, and
dump the old cluster's schema and restore it in the new cluster2. shut down the old cluster and copy/link the data files
Perhaps a bit more complicated, but couldn't we copy/link while the
old cluster is online and in backup mode, finish backup mode, shut down
the old cluster, and then play forward the WAL to catch any relation
extents being added or similar, and then flip to the new PG version?
My question is whether the schema dump/restore is time-consuming enough
to warrant this optional more complex API, and whether people would
support adding a server setting that prevented all system table changes?
When you say 'system table changes', you're referring to basically all
DDL, right? Just wish to clarify as there might be some confusion
between the terminology you're using here and allow_system_table_mods.
Would we need to have autovacuum shut down too..?
The other concern is if there's changes made to the catalogs by non-DDL
activity that needs to be addressed too (logical replication?); nothing
definite springs to mind off-hand for me, but perhaps others will think
of things.
Thanks!
Stephen
Hi
On Tue, Dec 5, 2017 at 11:01 PM, Bruce Momjian <bruce@momjian.us> wrote:
As part of PGConf.Asia 2017 in Tokyo, we had an unconference topic about
zero-downtime upgrades. After the usual discussion of using logical
replication, Slony, and perhaps having the server be able to read old
and new system catalogs, we discussed speeding up pg_upgrade.There are clusters that take a long time to dump the schema from the old
cluster and recreate it in the new cluster. One idea of speeding up
pg_upgrade would be to allow pg_upgrade to be run in two stages:1. prevent system catalog changes while the old cluster is running, and
dump the old cluster's schema and restore it in the new cluster2. shut down the old cluster and copy/link the data files
When we were discussing this, I was thinking that the linking could be done
in phase 1 too, as that's potentially slow on a very large schema.
My question is whether the schema dump/restore is time-consuming enough
to warrant this optional more complex API, and whether people would
support adding a server setting that prevented all system table changes?
I've certainly heard of cases where pg_upgrade takes significant amounts of
time to run on very complex databases.
--
Dave Page
Blog: http://pgsnake.blogspot.com
Twitter: @pgsnake
EnterpriseDB UK: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Tue, Dec 5, 2017 at 09:16:02AM -0500, Stephen Frost wrote:
Bruce,
* Bruce Momjian (bruce@momjian.us) wrote:
As part of PGConf.Asia 2017 in Tokyo, we had an unconference topic about
zero-downtime upgrades. After the usual discussion of using logical
replication, Slony, and perhaps having the server be able to read old
and new system catalogs, we discussed speeding up pg_upgrade.Sounds familiar.
Yeah. :-|
There are clusters that take a long time to dump the schema from the old
cluster and recreate it in the new cluster. One idea of speeding up
pg_upgrade would be to allow pg_upgrade to be run in two stages:1. prevent system catalog changes while the old cluster is running, and
dump the old cluster's schema and restore it in the new cluster2. shut down the old cluster and copy/link the data files
Perhaps a bit more complicated, but couldn't we copy/link while the
old cluster is online and in backup mode, finish backup mode, shut down
the old cluster, and then play forward the WAL to catch any relation
extents being added or similar, and then flip to the new PG version?
Well, that would require reading the old WAL, which would add an
additional compibility requirement that seems unwise.
My question is whether the schema dump/restore is time-consuming enough
to warrant this optional more complex API, and whether people would
support adding a server setting that prevented all system table changes?When you say 'system table changes', you're referring to basically all
DDL, right? Just wish to clarify as there might be some confusion
between the terminology you're using here and allow_system_table_mods.
Not only all DDL, but even updating them for the internal stuff, like
pg_class.relfrozenxid.
Would we need to have autovacuum shut down too..?
Yes.
The other concern is if there's changes made to the catalogs by non-DDL
activity that needs to be addressed too (logical replication?); nothing
definite springs to mind off-hand for me, but perhaps others will think
of things.
Yes, it could extend to many parts of the system, which is why I am
asking if it is worth it.
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com
+ As you are, so once was I. As I am, so you will be. +
+ Ancient Roman grave inscription +
On Tue, Dec 5, 2017 at 11:16:26PM +0900, Dave Page wrote:
Hi
On Tue, Dec 5, 2017 at 11:01 PM, Bruce Momjian <bruce@momjian.us> wrote:
As part of PGConf.Asia 2017 in Tokyo, we had an unconference topic about
zero-downtime upgrades.� After the usual discussion of using logical
replication, Slony, and perhaps having the server be able to read old
and new system catalogs, we discussed speeding up pg_upgrade.There are clusters that take a long time to dump the schema from the old
cluster and recreate it in the new cluster.� One idea of speeding up
pg_upgrade would be to allow pg_upgrade to be run in two stages:1.� prevent system catalog changes while the old cluster is running, and
dump the old cluster's schema and restore it in the new cluster2.� shut down the old cluster and copy/link the data files
When we were discussing this, I was thinking that the linking could be done in
phase 1 too, as that's potentially slow on a very large schema.
Uh, good point! You can create the hard links while system system is
running, no problem! It would only be copy that can't be done while the
system is running. Of course a big question is whether hard linking
takes any measurable time.
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com
+ As you are, so once was I. As I am, so you will be. +
+ Ancient Roman grave inscription +
Dave,
* Dave Page (dpage@pgadmin.org) wrote:
On Tue, Dec 5, 2017 at 11:01 PM, Bruce Momjian <bruce@momjian.us> wrote:
As part of PGConf.Asia 2017 in Tokyo, we had an unconference topic about
zero-downtime upgrades. After the usual discussion of using logical
replication, Slony, and perhaps having the server be able to read old
and new system catalogs, we discussed speeding up pg_upgrade.There are clusters that take a long time to dump the schema from the old
cluster and recreate it in the new cluster. One idea of speeding up
pg_upgrade would be to allow pg_upgrade to be run in two stages:1. prevent system catalog changes while the old cluster is running, and
dump the old cluster's schema and restore it in the new cluster2. shut down the old cluster and copy/link the data files
When we were discussing this, I was thinking that the linking could be done
in phase 1 too, as that's potentially slow on a very large schema.
Right, I had that thought too when first reading this, but the problem
there is that new files can show up due to a relation being extended (at
least, and perhaps in other cases too..).
My question is whether the schema dump/restore is time-consuming enough
to warrant this optional more complex API, and whether people would
support adding a server setting that prevented all system table changes?I've certainly heard of cases where pg_upgrade takes significant amounts of
time to run on very complex databases.
Right, but that doesn't really answer the question as to which part of
the pg_upgrade process is taking up the time.
In any case, of course, if we're able to move part of what pg_upgrade
does to be while the old server is online then that takes whatever the
cost of that is out of the downtime window. The question is if that's a
5% improvement in the overall performance of pg_upgrade or a 70% one.
This will be case-by-case, of course, but if, in the best-case, we only
get a 5% improvement then this might not be worth the risk.
Thanks!
Stephen
On Tue, Dec 5, 2017 at 09:23:49AM -0500, Stephen Frost wrote:
Dave,
* Dave Page (dpage@pgadmin.org) wrote:
On Tue, Dec 5, 2017 at 11:01 PM, Bruce Momjian <bruce@momjian.us> wrote:
As part of PGConf.Asia 2017 in Tokyo, we had an unconference topic about
zero-downtime upgrades. After the usual discussion of using logical
replication, Slony, and perhaps having the server be able to read old
and new system catalogs, we discussed speeding up pg_upgrade.There are clusters that take a long time to dump the schema from the old
cluster and recreate it in the new cluster. One idea of speeding up
pg_upgrade would be to allow pg_upgrade to be run in two stages:1. prevent system catalog changes while the old cluster is running, and
dump the old cluster's schema and restore it in the new cluster2. shut down the old cluster and copy/link the data files
When we were discussing this, I was thinking that the linking could be done
in phase 1 too, as that's potentially slow on a very large schema.Right, I had that thought too when first reading this, but the problem
there is that new files can show up due to a relation being extended (at
least, and perhaps in other cases too..).
Oh, yikes, yes.
My question is whether the schema dump/restore is time-consuming enough
to warrant this optional more complex API, and whether people would
support adding a server setting that prevented all system table changes?I've certainly heard of cases where pg_upgrade takes significant amounts of
time to run on very complex databases.Right, but that doesn't really answer the question as to which part of
the pg_upgrade process is taking up the time.In any case, of course, if we're able to move part of what pg_upgrade
does to be while the old server is online then that takes whatever the
cost of that is out of the downtime window. The question is if that's a
5% improvement in the overall performance of pg_upgrade or a 70% one.
This will be case-by-case, of course, but if, in the best-case, we only
get a 5% improvement then this might not be worth the risk.
Yes, and who is going to know if they have a setup where the more
complex API is worth it? pg_upgrade is already complex enough to use.
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com
+ As you are, so once was I. As I am, so you will be. +
+ Ancient Roman grave inscription +
Bruce,
* Bruce Momjian (bruce@momjian.us) wrote:
On Tue, Dec 5, 2017 at 09:16:02AM -0500, Stephen Frost wrote:
There are clusters that take a long time to dump the schema from the old
cluster and recreate it in the new cluster. One idea of speeding up
pg_upgrade would be to allow pg_upgrade to be run in two stages:1. prevent system catalog changes while the old cluster is running, and
dump the old cluster's schema and restore it in the new cluster2. shut down the old cluster and copy/link the data files
Perhaps a bit more complicated, but couldn't we copy/link while the
old cluster is online and in backup mode, finish backup mode, shut down
the old cluster, and then play forward the WAL to catch any relation
extents being added or similar, and then flip to the new PG version?Well, that would require reading the old WAL, which would add an
additional compibility requirement that seems unwise.
In my proposal, this would be the old version of PG reading the old WAL.
Thinking about it a bit further though, I'm not sure it'd end up working
in link mode anyway, due to post-backup-finish changes that could be
made by the old server on the data files before it's shut down.
We have to have a way of dealing with the delta between the hard link
trees after the old server is shut down though because there could be
new relation extents, at least.
My question is whether the schema dump/restore is time-consuming enough
to warrant this optional more complex API, and whether people would
support adding a server setting that prevented all system table changes?When you say 'system table changes', you're referring to basically all
DDL, right? Just wish to clarify as there might be some confusion
between the terminology you're using here and allow_system_table_mods.Not only all DDL, but even updating them for the internal stuff, like
pg_class.relfrozenxid.
Good point. We'd really need a pretty bullet-proof way to ensure that
the catalog isn't changed during this time period and that seems like it
might be difficult without a lot of work.
Would we need to have autovacuum shut down too..?
Yes.
Ok, makes sense.
The other concern is if there's changes made to the catalogs by non-DDL
activity that needs to be addressed too (logical replication?); nothing
definite springs to mind off-hand for me, but perhaps others will think
of things.Yes, it could extend to many parts of the system, which is why I am
asking if it is worth it.
My initial reaction is that it's worth it, but then I also wonder about
other issues (having to get an ANALYZE done on the new cluster before
opening it up, for example..) and it makes me wonder if perhaps it'll
end up being too much risk for too little gain.
Thanks!
Stephen
Bruce,
* Bruce Momjian (bruce@momjian.us) wrote:
In any case, of course, if we're able to move part of what pg_upgrade
does to be while the old server is online then that takes whatever the
cost of that is out of the downtime window. The question is if that's a
5% improvement in the overall performance of pg_upgrade or a 70% one.
This will be case-by-case, of course, but if, in the best-case, we only
get a 5% improvement then this might not be worth the risk.Yes, and who is going to know if they have a setup where the more
complex API is worth it? pg_upgrade is already complex enough to use.
Sure, but the solution there is really to make pg_upgrade simpler to
use, even as we add these more complicated APIs to it. What that likely
means in practical terms is that we have another utility, which uses
pg_upgrade underneath, that you're able to configure to know about your
existing cluster and the version of PG you want to upgrade to and where
you want it and if you want a copy or if hard-links are ok, etc.
Having such a tool is actually what I'd been hoping would come out of
the documented process for doing a "pg_upgrade" on replicas that's
currently in our documentation. That's not happened yet, but it's
something that David Steele and I have been chatting about because the
procedure in the documentation is terribly difficult and dangerous for
those who aren't as familiar with the system.
Perhaps we could have one tool that handles both the more complicated
pg_upgrade API and deals with upgrading replicas. Alternatively, we
could have a config file for pg_upgrade instead which might be a simpler
way for people to describe exactly their current configuration and what
they'd like to go to. Upgrading replicas involves using something like
SSH though..
Thanks!
Stephen
On 12/5/17 09:23, Stephen Frost wrote:
Right, but that doesn't really answer the question as to which part of
the pg_upgrade process is taking up the time.
Yeah, that should be measured before we do anything.
--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Dec 5, 2017 at 09:30:53AM -0500, Stephen Frost wrote:
The other concern is if there's changes made to the catalogs by non-DDL
activity that needs to be addressed too (logical replication?); nothing
definite springs to mind off-hand for me, but perhaps others will think
of things.Yes, it could extend to many parts of the system, which is why I am
asking if it is worth it.My initial reaction is that it's worth it, but then I also wonder about
other issues (having to get an ANALYZE done on the new cluster before
opening it up, for example..) and it makes me wonder if perhaps it'll
end up being too much risk for too little gain.
Yes, dump/reload of analyze statistics seems like a better use of time.
I have avoided it since it locks us into supporting the text
respresentation of data type, but at this point it might be worth it.
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com
+ As you are, so once was I. As I am, so you will be. +
+ Ancient Roman grave inscription +
Hi,
Yes, dump/reload of analyze statistics seems like a better use of time.
I have avoided it since it locks us into supporting the text
respresentation of data type, but at this point it might be worth it.
Couple of months ago we at Zalando upgraded a few databases of different
sizes to 9.6.
During preparations to the I've found 2.5 pain-points:
1. We are using schema-based api deployment. Basically ~every week we
create a new schema in the database and hundreds of stored procedures in it.
Off course we remove old API schemas and trying not to keep more than
last 10. Before the upgrade we basically dropped all API schemas except the
one used in production.
And even in this case dump-restore phase was taking much more time than
relinking of datafiles.
Unfortunately I don't have any numbers right now, but usually run of
pg_upgrade was taking about 30-35 seconds, and about 2/3 of the time was
spend in dump-restore.
2 ANALYZE phase is a pain. I think everybody agrees with it.
2.5 Usually ANALYZE stage 1 completes quite fast and performance becomes
reasonable, except one case: some of the columns might have non default
statistics target.
It breaks `vacuumdb --analyze-in-stages`, because those specific
columns it will not use value of default_statistics_target provided by
vacuumdb.
What I did - reset those non default values right before running
pg_upgrade and restored them only when analyze was completed. Off course
after that I've re-analyze those columns.
Regards,
Alexander Kukushkin
Alexander,
* Alexander Kukushkin (cyberdemn@gmail.com) wrote:
Couple of months ago we at Zalando upgraded a few databases of different
sizes to 9.6.
Thanks for sharing your experience!
During preparations to the I've found 2.5 pain-points:
1. We are using schema-based api deployment. Basically ~every week we
create a new schema in the database and hundreds of stored procedures in it.
Off course we remove old API schemas and trying not to keep more than
last 10. Before the upgrade we basically dropped all API schemas except the
one used in production.
And even in this case dump-restore phase was taking much more time than
relinking of datafiles.
Unfortunately I don't have any numbers right now, but usually run of
pg_upgrade was taking about 30-35 seconds, and about 2/3 of the time was
spend in dump-restore.
Ok, so eliminating 2/3 of the time would mean bringing it down to more
like 10 seconds. That certainly seems worthwhile to me. With the
linking time being much less than the dump/restore, we could at least
consider moving forward with Bruce's original idea where we do the
dump/restore while the system is online but then the linking with it
offline and get a serious performance boost out of it. That also avoids
the issue with new files showing up while the system is running that I
brought up when we were talking about having the linking done with the
system online.
2 ANALYZE phase is a pain. I think everybody agrees with it.
2.5 Usually ANALYZE stage 1 completes quite fast and performance becomes
reasonable, except one case: some of the columns might have non default
statistics target.
Ok, if the stage-1 is very fast and performance is reasonable enough
after that then perhaps it's not so bad to keep it as-is for now and
focus on the dump/restore time. That said, we should certainly also
work on improving this too.
It breaks `vacuumdb --analyze-in-stages`, because those specific
columns it will not use value of default_statistics_target provided by
vacuumdb.
What I did - reset those non default values right before running
pg_upgrade and restored them only when analyze was completed. Off course
after that I've re-analyze those columns.
Ah, yeah, ouch, that's unfortuante.. I wonder if there's something we
could do there to fix it..
Thanks!
Stephen
Stephen Frost wrote:
Alexander,
* Alexander Kukushkin (cyberdemn@gmail.com) wrote:
2 ANALYZE phase is a pain. I think everybody agrees with it.
2.5 Usually ANALYZE stage 1 completes quite fast and performance becomes
reasonable, except one case: some of the columns might have non default
statistics target.Ok, if the stage-1 is very fast and performance is reasonable enough
after that then perhaps it's not so bad to keep it as-is for now and
focus on the dump/restore time. That said, we should certainly also
work on improving this too.
It seems pretty clear to me that we should somehow transfer stats from
the old server to the new one. Shouldn't it just be a matter of
serializing the MCV/histogram/ndistinct values, then have capabilities
to load on the new server? I suppose it'd just be used during binary
upgrade, but the point seems painful enough for a lot of users.
Obviously it would not be the raw contents of pg_statistic{,_ext}, but
rather something a bit higher-level.
--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Alvaro,
* Alvaro Herrera (alvherre@alvh.no-ip.org) wrote:
Stephen Frost wrote:
* Alexander Kukushkin (cyberdemn@gmail.com) wrote:
2 ANALYZE phase is a pain. I think everybody agrees with it.
2.5 Usually ANALYZE stage 1 completes quite fast and performance becomes
reasonable, except one case: some of the columns might have non default
statistics target.Ok, if the stage-1 is very fast and performance is reasonable enough
after that then perhaps it's not so bad to keep it as-is for now and
focus on the dump/restore time. That said, we should certainly also
work on improving this too.It seems pretty clear to me that we should somehow transfer stats from
the old server to the new one. Shouldn't it just be a matter of
serializing the MCV/histogram/ndistinct values, then have capabilities
to load on the new server? I suppose it'd just be used during binary
upgrade, but the point seems painful enough for a lot of users.
Obviously it would not be the raw contents of pg_statistic{,_ext}, but
rather something a bit higher-level.
Right, I think that's what Bruce was getting at and certainly makes
sense to me as well. I agree that it's a definite pain point for
people. One complication is going to be custom data types, of course..
Thanks!
Stephen
On Tue, Dec 05, 2017 at 09:01:35AM -0500, Bruce Momjian wrote:
As part of PGConf.Asia 2017 in Tokyo, we had an unconference topic about
zero-downtime upgrades. ... we discussed speeding up pg_upgrade.There are clusters that take a long time to dump the schema from the old
cluster
Maybe it isn't representative of a typical case, but I can offer a data point:
For us, we have ~40 customers with DBs ranging in size from <100GB to ~25TB
(for which ~90% is on a ZFS tablespace with compression). We have what's
traditionally considered to be an excessive number of child tables, which works
okay since planning time is unimportant to us for the report queries which hit
them. Some of the tables are wide (historically up to 1600 columns). Some of
those have default values on nearly every column, and pg_attrdef was large
(>500MB), causing pg_dump --section pre-data to be slow (10+ minutes). Since
something similar is run by pg_upgrade, I worked around the issue for now by
dropping defaults on the historic children in advance of upgrades (at some
point I'll figure out what I have to do to allow DROPing DEFAULTs). It's not
the first time we've seen an issue with larger number of children*columns.
Our slowest pg-upgrade was ~40min, caused by column defaults in a case where I
failed to re-DROP DEFAULTs after our first scheduled upgrade date was pushed
back by over a month. Most of the rest were completed in less than 15min.
Justin
Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
It seems pretty clear to me that we should somehow transfer stats from
the old server to the new one. Shouldn't it just be a matter of
serializing the MCV/histogram/ndistinct values, then have capabilities
to load on the new server?
The reason pg_upgrade hasn't done that in the past is not wishing to
assume that the new version does stats identically to the old version.
Since we do in fact add stats or change stuff around from time to time,
that's not a negligible consideration.
regards, tom lane
On Thu, Dec 7, 2017 at 11:42 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
It seems pretty clear to me that we should somehow transfer stats from
the old server to the new one. Shouldn't it just be a matter of
serializing the MCV/histogram/ndistinct values, then have capabilities
to load on the new server?The reason pg_upgrade hasn't done that in the past is not wishing to
assume that the new version does stats identically to the old version.
Since we do in fact add stats or change stuff around from time to time,
that's not a negligible consideration.
Yes, but we don't do that for every release. We could put rules into
pg_upgrade about which releases changed the stats format incompatibly,
and not transfer the stats when crossing between two releases with
incompatible formats. That's more than zero effort, of course, but it
might be worth it. We've already got CATALOG_VERSION_NO,
XLOG_PAGE_MAGIC, PG_CONTROL_VERSION, PG_PROTOCOL_LATEST,
BTREE_VERSION, HASH_VERSION, BRIN_CURRENT_VERSION,
GIN_CURRENT_VERSION, LOGICALREP_PROTO_VERSION_NUM,
PG_PAGE_LAYOUT_VERSION, PG_DATA_CHECKSUM_VERSION, K_VERS_MAJOR,
K_VERS_MINOR, K_VERS_REV, and the utterly unused MIGRATOR_API_VERSION.
Now, I have to admit that I find the process of trying to remember to
bump the correct set of version numbers in every commit just a tad
frustrating; it adds a cognitive burden I'd just as well skip.
However, the failure to transfer stats over the years seems to have
actually caused real problems for many users, so I think in this case
we might be best off sucking it up and adding one more version number.
We might even want to make it a little more fine-grained and track it
separately by data type, but I'm not sure if that's really worth it.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Robert, all,
* Robert Haas (robertmhaas@gmail.com) wrote:
On Thu, Dec 7, 2017 at 11:42 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
It seems pretty clear to me that we should somehow transfer stats from
the old server to the new one. Shouldn't it just be a matter of
serializing the MCV/histogram/ndistinct values, then have capabilities
to load on the new server?The reason pg_upgrade hasn't done that in the past is not wishing to
assume that the new version does stats identically to the old version.
Since we do in fact add stats or change stuff around from time to time,
that's not a negligible consideration.Yes, but we don't do that for every release. We could put rules into
pg_upgrade about which releases changed the stats format incompatibly,
and not transfer the stats when crossing between two releases with
incompatible formats. That's more than zero effort, of course, but it
might be worth it. We've already got CATALOG_VERSION_NO,
XLOG_PAGE_MAGIC, PG_CONTROL_VERSION, PG_PROTOCOL_LATEST,
BTREE_VERSION, HASH_VERSION, BRIN_CURRENT_VERSION,
GIN_CURRENT_VERSION, LOGICALREP_PROTO_VERSION_NUM,
PG_PAGE_LAYOUT_VERSION, PG_DATA_CHECKSUM_VERSION, K_VERS_MAJOR,
K_VERS_MINOR, K_VERS_REV, and the utterly unused MIGRATOR_API_VERSION.
If we go down that route, since this makes a pretty serious difference
in terms of what the user has to deal with post-pg_upgrade, I'd suggest
we require an additional option for the user to pass when stats aren't
going to be migrated, so they are aware of that.
The concern I have hear is that we end up changing things in v13 and
suddenly everyone has to re-analyze but they didn't to go from 10->11 or
11->12 and they'll get caught off-guard by it.
Of course, this might end up having an entirely different effect: it
might mean that we're suddenly a lot shier about changing the stats in a
backwards-incompatible way, just as we now are basically stuck with the
existing on-disk heap format..
Now, I have to admit that I find the process of trying to remember to
bump the correct set of version numbers in every commit just a tad
frustrating; it adds a cognitive burden I'd just as well skip.
Agreed, would be great if we could improve on this.
However, the failure to transfer stats over the years seems to have
actually caused real problems for many users, so I think in this case
we might be best off sucking it up and adding one more version number.
Yes, it's definitely been an issue for users.
We might even want to make it a little more fine-grained and track it
separately by data type, but I'm not sure if that's really worth it.
This would have the nice property that we could just re-analyze the data
types where things changed, something that's more likely to happen with
new data types than existing ones, I'd guess, and so that might be much
more reasonable for users.
Thanks!
Stephen
Tom Lane wrote:
Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
It seems pretty clear to me that we should somehow transfer stats from
the old server to the new one. Shouldn't it just be a matter of
serializing the MCV/histogram/ndistinct values, then have capabilities
to load on the new server?The reason pg_upgrade hasn't done that in the past is not wishing to
assume that the new version does stats identically to the old version.
Since we do in fact add stats or change stuff around from time to time,
that's not a negligible consideration.
Sure, but the new version can probably limp along with incomplete stats
until the next natural ANALYZE runs -- the system is operational in much
shorter time than if you have to make it all wait for the post-upgrade
full-database ANALYZE run. The serialization step is so that the
underlying representation doesn't have to remain identical -- surely the
new server would be able to represent whatever the old server was able
to, regardless of any improvement made.
--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services