vacuumdb changes for stats import/export
On Mon, Jan 06, 2025 at 03:27:18PM -0600, Nathan Bossart wrote:
On Mon, Dec 30, 2024 at 03:45:03PM -0500, Bruce Momjian wrote:
On Mon, Dec 30, 2024 at 12:02:47PM -0800, Jeff Davis wrote:
I suggest that we make a new thread about the vacuumdb changes and
focus this thread and patch series on the pg_dump changes (and minor
flag adjustments to pg_upgrade).Unless you think that the pg_dump changes should block on the vacuumdb
changes? In which case please let me know because the pg_dump changes
are otherwise close to commit.I think that is a good idea. I don't see vacuumdb blocking this.
+1, I've been reviewing the vacuumdb portion and am planning to start a new
thread in the near future. IIUC the bulk of the vacuumdb changes are
relatively noncontroversial, we just haven't reached consensus on the user
interface.
As promised, I'm starting a new thread for this. The original thread [0]/messages/by-id/CADkLM=fR7TwH0cLREQkf5_=KLcOYVxJw0Km0i5MpaWeuDwVo6g@mail.gmail.com
has some preliminary discussion about the subject.
As you may be aware, there is an ongoing effort to carry over statistics
during pg_upgrade. Today, we encourage users to use vacuumdb to run
ANALYZE on all relations after upgrading. There's even a special
--analyze-in-stages option that fast-tracks an initial set of minimal
statistics for this use-case. Once the statistics are carried over by
pg_upgrade, there will be little need to do this, except for perhaps
extended statistics if they aren't carried over. But there are patches in
flight for that, too [1]/messages/by-id/CADkLM=dpz3KFnqP-dgJ-zvRvtjsa8UZv8wDAQdqho=qN3kX0Zg@mail.gmail.com.
This thread is dedicated to figuring out what, if anything, to do about
vacuumdb. I see the following general categories of options:
* Do nothing. Other than updating our recommended guidance for
post-upgrade analyzing, we'd leave vacuumdb alone. While this is
certainly a simple option, it has a couple of key drawbacks. For one,
anyone who doesn't see the new vacuumdb guidance may continue to do
unnecessary post-upgrade analyzes. Also, if we don't get the extended
statistics piece completed for v18, users will have to manually construct
ANALYZE commands for those to run post-upgrade.
* Add a breaking change so that users are forced to fix any outdated
post-upgrade scripts. This is what the attached patches do. In short,
they add a required parameter to --analyze-in-stages that can be set to
either "all" or "missing". The new "missing" mode generates ANALYZE
commands for relations that are missing statistics, while the "all" mode
does the same thing that --analyze-in-stages does today. While the
"missing" mode might be useful outside of upgrade cases, we could also
recommend it as a post-upgrade step if the extended statistics work
doesn't get committed for v18.
* Add a new option that will make it easy to ANALYZE any relations that are
missing statistics, but don't make any breaking changes to existing
post-upgrade scripts. This option isn't really strictly necessary if we
get the extended statistics parts committed, but it could be a nice
feature, anyway.
I chose the second approach because it had the most support in the other
thread, but I definitely wouldn't characterize it as a consensus. 0001
simply refactors the main catalog query to its own function so that its
results can be reused in later stages of --analyze-in-stages. This might
require a bit more memory and make --analyze-in-stages less responsive to
concurrent changes, but it wasn't all that responsive to begin with. 0002
adds the new "missing" mode functionality. Note that it regenerates all
statistics for a relation if any applicable statistics types are missing.
It's not clear whether we can or should do any better than that. Corey and
I put a lot of effort into the catalog query changes, and we think we've
covered everything, but we would of course appreciate some review on that
part.
BTW as long as the basic "missing" mode idea seems reasonable, it's easy
enough to adjust the user interface to whatever we want, and I'm happy to
do so as needed.
Finally, I think another open question is whether any of this should apply
to --analyze and/or --analyze-only. We do recommend the latter as a
post-upgrade step in our pg_upgrade documentation, and I could see the
"missing" mode being useful on its own for these modes, too.
Thoughts?
[0]: /messages/by-id/CADkLM=fR7TwH0cLREQkf5_=KLcOYVxJw0Km0i5MpaWeuDwVo6g@mail.gmail.com
[1]: /messages/by-id/CADkLM=dpz3KFnqP-dgJ-zvRvtjsa8UZv8wDAQdqho=qN3kX0Zg@mail.gmail.com
--
nathan
Thoughts?
I don't have anything to add to what Nathan said, but thought I should say
so since this thread was broken off from my earlier thread. Eagerly
awaiting feedback.
On Fri, Jan 24, 2025 at 7:44 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
On Mon, Jan 06, 2025 at 03:27:18PM -0600, Nathan Bossart wrote:
On Mon, Dec 30, 2024 at 03:45:03PM -0500, Bruce Momjian wrote:
On Mon, Dec 30, 2024 at 12:02:47PM -0800, Jeff Davis wrote:
I suggest that we make a new thread about the vacuumdb changes and
focus this thread and patch series on the pg_dump changes (and minor
flag adjustments to pg_upgrade).Unless you think that the pg_dump changes should block on the vacuumdb
changes? In which case please let me know because the pg_dump changes
are otherwise close to commit.I think that is a good idea. I don't see vacuumdb blocking this.
+1, I've been reviewing the vacuumdb portion and am planning to start a new
thread in the near future. IIUC the bulk of the vacuumdb changes are
relatively noncontroversial, we just haven't reached consensus on the user
interface.As promised, I'm starting a new thread for this. The original thread [0]
has some preliminary discussion about the subject.As you may be aware, there is an ongoing effort to carry over statistics
during pg_upgrade. Today, we encourage users to use vacuumdb to run
ANALYZE on all relations after upgrading. There's even a special
--analyze-in-stages option that fast-tracks an initial set of minimal
statistics for this use-case. Once the statistics are carried over by
pg_upgrade, there will be little need to do this, except for perhaps
extended statistics if they aren't carried over. But there are patches in
flight for that, too [1].This thread is dedicated to figuring out what, if anything, to do about
vacuumdb. I see the following general categories of options:* Do nothing. Other than updating our recommended guidance for
post-upgrade analyzing, we'd leave vacuumdb alone. While this is
certainly a simple option, it has a couple of key drawbacks. For one,
anyone who doesn't see the new vacuumdb guidance may continue to do
unnecessary post-upgrade analyzes. Also, if we don't get the extended
statistics piece completed for v18, users will have to manually construct
ANALYZE commands for those to run post-upgrade.* Add a breaking change so that users are forced to fix any outdated
post-upgrade scripts. This is what the attached patches do. In short,
they add a required parameter to --analyze-in-stages that can be set to
either "all" or "missing". The new "missing" mode generates ANALYZE
commands for relations that are missing statistics, while the "all" mode
does the same thing that --analyze-in-stages does today. While the
"missing" mode might be useful outside of upgrade cases, we could also
recommend it as a post-upgrade step if the extended statistics work
doesn't get committed for v18.* Add a new option that will make it easy to ANALYZE any relations that are
missing statistics, but don't make any breaking changes to existing
post-upgrade scripts. This option isn't really strictly necessary if we
get the extended statistics parts committed, but it could be a nice
feature, anyway.I chose the second approach because it had the most support in the other
thread, but I definitely wouldn't characterize it as a consensus. 0001
simply refactors the main catalog query to its own function so that its
results can be reused in later stages of --analyze-in-stages. This might
require a bit more memory and make --analyze-in-stages less responsive to
concurrent changes, but it wasn't all that responsive to begin with. 0002
adds the new "missing" mode functionality. Note that it regenerates all
statistics for a relation if any applicable statistics types are missing.
It's not clear whether we can or should do any better than that. Corey and
I put a lot of effort into the catalog query changes, and we think we've
covered everything, but we would of course appreciate some review on that
part.BTW as long as the basic "missing" mode idea seems reasonable, it's easy
enough to adjust the user interface to whatever we want, and I'm happy to
do so as needed.Finally, I think another open question is whether any of this should apply
to --analyze and/or --analyze-only. We do recommend the latter as a
post-upgrade step in our pg_upgrade documentation, and I could see the
"missing" mode being useful on its own for these modes, too.Thoughts?
I've not closely reviewed the patches yet but I find that the second
approach is reasonable to me. IIUC if the extended statistics
import/export work gets committed for v18, we don't need a new option
for post-upgrade analyze. But as you mentioned, these new modes seem
useful for other use cases too. Given that it could be used outside of
post-upgrading, I think it would make more sense to apply the "all"
and "missing" modes to --analyze and --analyze-only options too.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
I had the opportunity to bring this patch set up for discussion at the
developer meeting at FOSDEM PGDay last week [0]https://2025.fosdempgday.org/devmeeting. There seemed to be a
strong consensus that the idea of a "missing only" mode for vacuumdb's
analyze options was useful (especially so if the extended stats piece of
the stats import/export project doesn't make it into v18), but that we
shouldn't change the default behavior of the existing options. Given that,
I have modified the patch set to instead introduce a --missing-only option
that can be used in conjuction with --analyze-only and --analyze-in-stages.
The underlying implementation is the same as in v1 of the patch set, except
for the following changes:
* I've modified the extended stats part of the query to also check for
pg_statistic_ext.stxstattarget IS DISTINCT FROM 0.
* I've added a new clause to check for extended statistics on tables with
inheritance children, i.e., those with pg_statistic_ext_data.stxdinherit
set to true.
* I've added a server version check that disallows --missing-only on
servers older than v15. The catalog query would need some adjustments to
work on older versions, but that didn't seem critically important. We
could always revisit this in the future.
[0]: https://2025.fosdempgday.org/devmeeting
--
nathan
On Wed, Feb 5, 2025 at 4:44 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
[v2]
I started looking just at 0001, and it seems like a fairly
straightforward rearrangement. I found this comment quite hard to
read:
+ * 'found_objs' should be a fully qualified list of objects to process, as
+ * returned by a previous call to vacuum_one_database(). If *found_objs is
+ * NULL, it is set to the results of the catalog query discussed below. If
+ * found_objs is NULL, the results of the catalog query are not returned.
+ *
+ * If *found_objs is NULL, this function performs a catalog query to retrieve
+ * the list of tables to process. When 'objects' is NULL, all tables in the
I had to read it several times before I noticed the difference between
"* found_objs" and "*found_objs". Maybe some extra spacing and breaks
would help, or other reorganization.
--
John Naylor
Amazon Web Services
On Thu, Feb 27, 2025 at 04:36:04PM +0700, John Naylor wrote:
On Wed, Feb 5, 2025 at 4:44 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
[v2]
I started looking just at 0001, and it seems like a fairly
straightforward rearrangement.
Thanks for taking a look.
I found this comment quite hard to read:
+ * 'found_objs' should be a fully qualified list of objects to process, as + * returned by a previous call to vacuum_one_database(). If *found_objs is + * NULL, it is set to the results of the catalog query discussed below. If + * found_objs is NULL, the results of the catalog query are not returned. + * + * If *found_objs is NULL, this function performs a catalog query to retrieve + * the list of tables to process. When 'objects' is NULL, all tables in theI had to read it several times before I noticed the difference between
"* found_objs" and "*found_objs". Maybe some extra spacing and breaks
would help, or other reorganization.
Yeah, it's pretty atrocious. I think the main problem is that the
interface is just too complicated, so I'll take a step back and see if I
can make it more understandable to humans. In the meantime, here's an
attempt at adjusting the comment:
* found_objs is a double pointer to a fully qualified list of objects to
* process, as returned by a previous call to vacuum_one_database(). If
* *found_objs (the once-dereferenced double pointer) is NULL, it is set to the
* results of the catalog query discussed below. If found_objs (the double
* pointer) is NULL, the results of the catalog query are not returned.
*
* If *found_objs (the once-dereferenced double pointer) is NULL, this function
* performs a catalog query to retrieve the list of tables to process. When
* "objects" is NULL, all tables in the database are processed. Otherwise, the
* catalog query performs a lookup for the objects listed in "objects".
--
nathan
On Sat, Mar 1, 2025 at 3:42 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
On Thu, Feb 27, 2025 at 04:36:04PM +0700, John Naylor wrote:
I had to read it several times before I noticed the difference between
"* found_objs" and "*found_objs". Maybe some extra spacing and breaks
would help, or other reorganization.Yeah, it's pretty atrocious. I think the main problem is that the
interface is just too complicated, so I'll take a step back and see if I
can make it more understandable to humans.
The interface is awkward, but on the other hand only a small part has
to really know about it. It's worth trying to make it more readable if
you can.
In the meantime, here's an
attempt at adjusting the comment:
That's better, and if we end up with this interface, we'll want quotes
around the names so the eye can tell where the "*" belong.
--
John Naylor
Amazon Web Services
On Mon, Mar 03, 2025 at 05:58:43PM +0700, John Naylor wrote:
On Sat, Mar 1, 2025 at 3:42 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
On Thu, Feb 27, 2025 at 04:36:04PM +0700, John Naylor wrote:
I had to read it several times before I noticed the difference between
"* found_objs" and "*found_objs". Maybe some extra spacing and breaks
would help, or other reorganization.Yeah, it's pretty atrocious. I think the main problem is that the
interface is just too complicated, so I'll take a step back and see if I
can make it more understandable to humans.The interface is awkward, but on the other hand only a small part has
to really know about it. It's worth trying to make it more readable if
you can.
True. One small thing we could do is to require "found_objs" (the double
pointer) to always be non-NULL, but that just compels some callers to
provide otherwise-unused variables. I've left the interface alone for now.
In the meantime, here's an attempt at adjusting the comment:
That's better, and if we end up with this interface, we'll want quotes
around the names so the eye can tell where the "*" belong.
I did that in v3. I also tried to break up this comment into bullet points
for better separation and logical flow.
--
nathan
On Mon, Mar 3, 2025 at 11:21 PM Nathan Bossart <nathandbossart@gmail.com> wrote:
On Mon, Mar 03, 2025 at 05:58:43PM +0700, John Naylor wrote:
True. One small thing we could do is to require "found_objs" (the double
pointer) to always be non-NULL, but that just compels some callers to
provide otherwise-unused variables. I've left the interface alone for now.
One thing to consider is that a pointer named "dummy" is self-documenting.
That's better, and if we end up with this interface, we'll want quotes
around the names so the eye can tell where the "*" belong.I did that in v3. I also tried to break up this comment into bullet points
for better separation and logical flow.
That's much easier to follow, thanks.
--
John Naylor
Amazon Web Services
On Tue, Mar 04, 2025 at 01:05:17PM +0700, John Naylor wrote:
On Mon, Mar 3, 2025 at 11:21 PM Nathan Bossart <nathandbossart@gmail.com> wrote:
I did that in v3. I also tried to break up this comment into bullet points
for better separation and logical flow.That's much easier to follow, thanks.
Thanks for looking. I'll probably commit 0001 sooner than later so that we
can focus our attention on 0002.
--
nathan
On Wed, Mar 5, 2025 at 12:13 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
On Tue, Mar 04, 2025 at 01:05:17PM +0700, John Naylor wrote:
On Mon, Mar 3, 2025 at 11:21 PM Nathan Bossart <nathandbossart@gmail.com> wrote:
I did that in v3. I also tried to break up this comment into bullet points
for better separation and logical flow.That's much easier to follow, thanks.
Thanks for looking. I'll probably commit 0001 sooner than later so that we
can focus our attention on 0002.
+1
On 0002:
+ This option can only be used in conjunction with
+ <option>--analyze-only</option> and
<option>--analyze-in-stages</option>.
+ /* Prohibit --missing-only without --analyze-only or --analyze-in-stages */
+ if (vacopts.missing_only && !vacopts.analyze_only)
+ pg_fatal("cannot use the \"%s\" option without \"%s\" or \"%s\"",
+ "missing-only", "analyze-only", "analyze-in-stages");
The first is slightly ambiguous, so maybe "or" is better throughout.
+ " CROSS JOIN LATERAL (SELECT c.relkind IN ('p', 'I')) as p (inherited)\n"
Looking elsewhere in this file, I think we prefer something like
"(c.relkind OPERATOR(pg_catalog.=) ANY (array["
CppAsString2(RELKIND_PARTITIONED_TABLE) ", "
CppAsString2(RELKIND_PARTITIONED_INDEX) "]) as p (inherited)\n"
+ " AND NOT EXISTS (SELECT NULL FROM pg_catalog.pg_statistic s\n"
+ " WHERE s.starelid OPERATOR(pg_catalog.=) a.attrelid\n"
+ " AND s.staattnum OPERATOR(pg_catalog.=) a.attnum\n"
+ " AND s.stainherit OPERATOR(pg_catalog.=) p.inherited))\n");
IIUC correctly, pg_statistic doesn't store stats on itself, so this
causes the query result to always contain pg_statistic -- does that
get removed elsewhere?
--
John Naylor
Amazon Web Services
On Thu, Mar 06, 2025 at 06:30:59PM +0700, John Naylor wrote:
+ This option can only be used in conjunction with + <option>--analyze-only</option> and <option>--analyze-in-stages</option>.+ /* Prohibit --missing-only without --analyze-only or --analyze-in-stages */ + if (vacopts.missing_only && !vacopts.analyze_only) + pg_fatal("cannot use the \"%s\" option without \"%s\" or \"%s\"", + "missing-only", "analyze-only", "analyze-in-stages");The first is slightly ambiguous, so maybe "or" is better throughout.
Agreed.
+ " CROSS JOIN LATERAL (SELECT c.relkind IN ('p', 'I')) as p (inherited)\n"
Looking elsewhere in this file, I think we prefer something like
"(c.relkind OPERATOR(pg_catalog.=) ANY (array["
CppAsString2(RELKIND_PARTITIONED_TABLE) ", "
CppAsString2(RELKIND_PARTITIONED_INDEX) "]) as p (inherited)\n"
Fixed.
+ " AND NOT EXISTS (SELECT NULL FROM pg_catalog.pg_statistic s\n" + " WHERE s.starelid OPERATOR(pg_catalog.=) a.attrelid\n" + " AND s.staattnum OPERATOR(pg_catalog.=) a.attnum\n" + " AND s.stainherit OPERATOR(pg_catalog.=) p.inherited))\n");IIUC correctly, pg_statistic doesn't store stats on itself, so this
causes the query result to always contain pg_statistic -- does that
get removed elsewhere?
Good catch. I think the current behavior is to call ANALYZE on
pg_statistic, too, but that should be mostly harmless (analyze_rel()
refuses to process it). I suppose we could try to avoid returning
pg_statistic from the catalog query, but we don't bother doing that for any
other vacuumdb modes, so I'm tempted to leave it alone.
--
nathan
On Fri, Mar 7, 2025 at 4:47 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
On Thu, Mar 06, 2025 at 06:30:59PM +0700, John Naylor wrote:
IIUC correctly, pg_statistic doesn't store stats on itself, so this
causes the query result to always contain pg_statistic -- does that
get removed elsewhere?Good catch. I think the current behavior is to call ANALYZE on
pg_statistic, too, but that should be mostly harmless (analyze_rel()
refuses to process it). I suppose we could try to avoid returning
pg_statistic from the catalog query, but we don't bother doing that for any
other vacuumdb modes, so I'm tempted to leave it alone.
Okay, thanks for confirming. I have no further comments.
--
John Naylor
Amazon Web Services
On Mon, Mar 10, 2025 at 12:35:22PM +0700, John Naylor wrote:
I have no further comments.
Thanks. I'll give this a little more time for review before committing.
We'll still need to update the recommendation in pg_upgrade's
documentation. I'm going to keep that separate because the stats
import/export work is still settling.
--
nathan
On Mon, Mar 10, 2025 at 10:08:49AM -0500, Nathan Bossart wrote:
On Mon, Mar 10, 2025 at 12:35:22PM +0700, John Naylor wrote:
I have no further comments.
Thanks. I'll give this a little more time for review before committing.
I realized that we could limit the catalog query reuse to only when
--missing-only is specified, so I've updated 0001 and 0002 accordingly.
This avoids changing any existing behavior.
We'll still need to update the recommendation in pg_upgrade's
documentation. I'm going to keep that separate because the stats
import/export work is still settling.
0003 is a first attempt at this. Unless I am missing something, there's
really not much to update.
--
nathan
Attachments:
v5-0001-vacuumdb-Teach-vacuum_one_database-to-reuse-query.patchtext/plain; charset=us-asciiDownload+194-138
v5-0002-vacuumdb-Add-option-for-analyzing-only-relations-.patchtext/plain; charset=us-asciiDownload+207-3
v5-0003-Update-guidance-for-running-vacuumdb-after-pg_upg.patchtext/plain; charset=us-asciiDownload+2-3
On Wed, Mar 12, 2025 at 12:00 AM Nathan Bossart
<nathandbossart@gmail.com> wrote:
On Mon, Mar 10, 2025 at 10:08:49AM -0500, Nathan Bossart wrote:
On Mon, Mar 10, 2025 at 12:35:22PM +0700, John Naylor wrote:
I have no further comments.
Thanks. I'll give this a little more time for review before committing.
I realized that we could limit the catalog query reuse to only when
--missing-only is specified, so I've updated 0001 and 0002 accordingly.
This avoids changing any existing behavior.We'll still need to update the recommendation in pg_upgrade's
documentation. I'm going to keep that separate because the stats
import/export work is still settling.0003 is a first attempt at this. Unless I am missing something, there's
really not much to update.
The change here seems fine. My only quibble is that this sentence now
seems out of place: "Option --analyze-in-stages can be used to
generate minimal statistics quickly." I'm thinking we should make a
clearer separation for with and without the --no-statistics option
specified.
--
John Naylor
Amazon Web Services
On Wed, Mar 12, 2025 at 01:56:04PM +0700, John Naylor wrote:
The change here seems fine. My only quibble is that this sentence now
seems out of place: "Option --analyze-in-stages can be used to
generate minimal statistics quickly." I'm thinking we should make a
clearer separation for with and without the --no-statistics option
specified.
I think the recommendation stays the same regardless of whether
--no-statistics was used. --missing-only should do the right think either
way. But I do agree that this paragraph feels a bit haphazard. I modified
it to call out the full command for both --analyze-only and
--analyze-in-stages, and I moved the note about --jobs to its own sentence
and made it clear that it is applicable to both of the suggested vacuumdb
commands.
--
nathan
Attachments:
v6-0001-vacuumdb-Teach-vacuum_one_database-to-reuse-query.patchtext/plain; charset=us-asciiDownload+194-138
v6-0002-vacuumdb-Add-option-for-analyzing-only-relations-.patchtext/plain; charset=us-asciiDownload+207-3
v6-0003-Update-guidance-for-running-vacuumdb-after-pg_upg.patchtext/plain; charset=us-asciiDownload+6-6
Out of curiosity, I generated many relations with the following command
(stolen from [0]/messages/by-id/3612876.1689443232@sss.pgh.pa.us):
do $$
begin
for i in 1..100000 loop
execute format('create table t%s (f1 int unique, f2 int unique);', i);
execute format('insert into t%s select x, x from generate_series(1,1000) x',
i);
if i % 100 = 0 then commit; end if;
end loop;
end
$$;
And then I ran a database-wide ANALYZE. Without --missing-only, vacuumdb's
catalog query took 65 ms. With --missing-only, it took 735 ms. While
that's a big jump, this query will only run once for a given vacuumdb, and
--missing-only is likely to save a lot of time elsewhere.
If no feedback or objections materialize, I'm planning to commit these
early next week.
[0]: /messages/by-id/3612876.1689443232@sss.pgh.pa.us
--
nathan
On Fri, Mar 14, 2025 at 02:05:30PM -0500, Nathan Bossart wrote:
If no feedback or objections materialize, I'm planning to commit these
early next week.
While preparing this for commit, I noticed that the expression index part
of the query was disregarding attstattarget. To fix, I've modified that
part to look at the index's pg_attribute entries.
--
nathan