Proposal: ANALYZE (MODIFIED_STATS) using autoanalyze thresholds

Started by VASUKI M2 months ago7 messageshackers

vasukianand0119@gmail.com

2 months ago

Hi all,

Following up on the recent discussion around ANALYZE
(MISSING_STATS_ONLY):[1][ /messages/by-id/CAE2r8H61ZtT4ek3JmLKDPmr7ALQ0uE9WSwwJRFHbXm0WdOJnEQ@mail.gmail.com ]

I would like to start a separate discussion about a potential ANALYZE
(MODIFIED_STATS) option.

The idea is to allow manual ANALYZE to reuse the same threshold logic that
autoanalyze uses, so that when a user explicitly runs ANALYZE, only
relations that have crossed the modification threshold are processed.

Conceptually, this would use the existing formula:

analyze threshold = analyze_base_threshold
+ analyze_scale_factor * reltuples

and compare it against n_mod_since_analyze, similar to how autovacuum
decides when to trigger analyze.

The goal is not to replace autoanalyze, but to expose its decision model at
SQL level for deterministic, user-controlled execution. For example:

-Running maintenance in scripted environments
-Triggering analysis immediately after batch data loads
-Avoiding unnecessary work when running manual ANALYZE across many relations
- Environments where autovacuum is tuned conservatively or partially
disabled

Autoanalyze runs opportunistically in the background. This proposal would
allow a user to apply the same threshold logic explicitly and immediately.

A possible usage would look like: ANALYZE (MODIFIED_STATS);

I understand there is conceptual overlap with autovacuum, so I would
especially appreciate feedback on:

-Whether exposing the threshold logic at SQL level makes sense
architecturally
-Whether this should remain an explicit opt-in option
-Naming (e.g., MODIFIED_STATS vs SKIP_UNMODIFIED or something clearer)
-Whether the thresholds should reuse existing GUCs or accept per-command
overrides

I intentionally kept this separate from MISSING_STATS_ONLY, since that
option answers a different question (“are stats missing?”) while this one
would answer (“have enough rows changed to justify re-analysis?”).

I would greatly appreciate thoughts before working on a prototype patch.

Thanks again for all the feedback so far — it has been very helpful
,expecting here the same.

Regards,
Vasuki M
C-DAC,Chennai

[1]: [ /messages/by-id/CAE2r8H61ZtT4ek3JmLKDPmr7ALQ0uE9WSwwJRFHbXm0WdOJnEQ@mail.gmail.com ]
/messages/by-id/CAE2r8H61ZtT4ek3JmLKDPmr7ALQ0uE9WSwwJRFHbXm0WdOJnEQ@mail.gmail.com
]

Nathan Bossart

nathandbossart@gmail.com

2 months ago

In reply to: VASUKI M (#1)

Re: Proposal: ANALYZE (MODIFIED_STATS) using autoanalyze thresholds

On Tue, Feb 17, 2026 at 04:12:58PM +0530, VASUKI M wrote:

I would greatly appreciate thoughts before working on a prototype patch.

At the risk of substantially expanding the scope of these patches, I wonder
if a better long-term approach would be to first centralize the
autovacuum/autoanalyze prioritization code and share it via a system view
that vacuumdb could use in place of the giant query for
--missing-stats-only. That would also give users visibility into
auto{vacuum,analyze}'s decisions. TBH I'm not totally sold on the idea of
giving ANALYZE more options for this stuff, if for no other reason than I'm
not really following the concrete use-cases, but I wouldn't say I'm
mortally opposed to it.

--
nathan

David Rowley

dgrowleyml@gmail.com

2 months ago

In reply to: Nathan Bossart (#2)

Re: Proposal: ANALYZE (MODIFIED_STATS) using autoanalyze thresholds

On Wed, 18 Feb 2026 at 06:35, Nathan Bossart <nathandbossart@gmail.com> wrote:

On Tue, Feb 17, 2026 at 04:12:58PM +0530, VASUKI M wrote:

I would greatly appreciate thoughts before working on a prototype patch.

At the risk of substantially expanding the scope of these patches, I wonder
if a better long-term approach would be to first centralize the
autovacuum/autoanalyze prioritization code and share it via a system view
that vacuumdb could use in place of the giant query for
--missing-stats-only. That would also give users visibility into
auto{vacuum,analyze}'s decisions. TBH I'm not totally sold on the idea of
giving ANALYZE more options for this stuff, if for no other reason than I'm
not really following the concrete use-cases, but I wouldn't say I'm
mortally opposed to it.

Yeah, I'm not at all excited about adding options to ANALYZE for this
sort of thing either. I agree with the VIEW idea. If we had the vacuum
scoring stuff, I imagined it'd be useful to have a view that lists
tables and their vacuum/analyze score. If people want to script this
sort of thing then querying that view and running analyze on the
returned tables seems simple enough. I imagined a common thing that
people might want to do would be freeze tables that have a freeze age
somewhere close to autovacuum_freeze_max_age while off-peak so that
autovacuum doesn't trigger for that on-peak. If we allow ANALYZE to
have the proposed option, then we're opening a can of worms for
various other possible requirements for similar options in the VACUUM
command.

I also agree that vacuumdb could be a good place to code this up.

David

Corey Huinker

corey.huinker@gmail.com

about 2 months ago

In reply to: David Rowley (#3)

Re: Proposal: ANALYZE (MODIFIED_STATS) using autoanalyze thresholds

Yeah, I'm not at all excited about adding options to ANALYZE for this
sort of thing either. I agree with the VIEW idea. If we had the vacuum
scoring stuff, I imagined it'd be useful to have a view that lists
tables and their vacuum/analyze score.

Several of the stats import iterations that never made it to commits had
exportable stats view, so I took a stab at making pg_missing_stats, and a
couple of problems emerged.

1. The view would need to expose pg_class.oid so that it could be joined to
the listed_object CTE. Our pattern seems to be schemaname+tablename with no
exposed oids, something that has caused a problem in the past, namely in
pg_dump when batching attribute stats. Is exposing oids in these views a
no-no?
2. The view either needs the pg_class.oid to join back to pg_class to get
relkind, relpersistence, relfrozenxid, relminmxid, etc...or it needs to
expose those columns as a part of the view, which would be problematic when
vacuumdb and other apps decide that they need to filter on some other part
of pg_class, and then we've got different versions of the system view which
makes that option highly unattractive.

If we instead did a system function `pg_rel_is_missing_stats(oid) returns
boolean`, but it would still need to sanity check on relkind and filter on
relpersistence and inherited.

So either way we're doing some self-joins on pg_class, probably with a
security barrier.

Sami Imseih

samimseih@gmail.com

about 2 months ago

In reply to: Corey Huinker (#4)

Re: Proposal: ANALYZE (MODIFIED_STATS) using autoanalyze thresholds

Hi,

Yeah, I'm not at all excited about adding options to ANALYZE for this
sort of thing either. I agree with the VIEW idea. If we had the vacuum
scoring stuff, I imagined it'd be useful to have a view that lists
tables and their vacuum/analyze score.

If we instead did a system function `pg_rel_is_missing_stats(oid) returns boolean`, but it would still need to sanity check on relkind and filter on relpersistence and inherited.

So either way we're doing some self-joins on pg_class, probably with a security barrier.

It seems like this thread is now discussing both the modified stats
option ( this thread )
and the skipped stats option [1][/messages/by-id/CAE2r8H61ZtT4ek3JmLKDPmr7ALQ0uE9WSwwJRFHbXm0WdOJnEQ@mail.gmail.com%5D, which will be hard to follow.

But from a high-level, I am not sure how providing a VIEW or functions
in which users
will be required to script out the ANALYZE commands to run is better
than providing
ANALYZE options? the difference is putting the burden on the user to
come up with
the SQL script vs a simple ANALYZE option. IMO, the latter is more appealing
from a user perspective.

I think the view is a good idea, and this was discussed a bit here [2][/messages/by-id/CAApHDvpVE5F-_8rpPC+-L98mA0yK0S_jtQGqLn69fkRevf726g@mail.gmail.com%5D, if we
do implement a scoring algorithm, but this view will be more for
monitoring/visibility.

[1]: [/messages/by-id/CAE2r8H61ZtT4ek3JmLKDPmr7ALQ0uE9WSwwJRFHbXm0WdOJnEQ@mail.gmail.com%5D
[2]: [/messages/by-id/CAApHDvpVE5F-_8rpPC+-L98mA0yK0S_jtQGqLn69fkRevf726g@mail.gmail.com%5D

--
Sami Imseih
Amazon Web Services (AWS)

Nathan Bossart

nathandbossart@gmail.com

about 2 months ago

In reply to: Corey Huinker (#4)

Re: Proposal: ANALYZE (MODIFIED_STATS) using autoanalyze thresholds

On Tue, Feb 24, 2026 at 08:45:26AM -0500, Corey Huinker wrote:

1. The view would need to expose pg_class.oid so that it could be joined to
the listed_object CTE. Our pattern seems to be schemaname+tablename with no
exposed oids, something that has caused a problem in the past, namely in
pg_dump when batching attribute stats. Is exposing oids in these views a
no-no?

I don't think so. Counterexamples include pg_stat_activity,
pg_stat_subscription, pg_stat_database, pg_stat_all_tables, and more. If
anything, maybe we should consider _adding_ oid columns to views like
pg_stats_ext...

2. The view either needs the pg_class.oid to join back to pg_class to get
relkind, relpersistence, relfrozenxid, relminmxid, etc...or it needs to
expose those columns as a part of the view, which would be problematic when
vacuumdb and other apps decide that they need to filter on some other part
of pg_class, and then we've got different versions of the system view which
makes that option highly unattractive.

Hm. I don't know how much to worry about this. System view/catalog
changes do happen from time to time, so unless we have a bunch of possible
changes that will happen soon, I'd expect it to probably be okay.

If we instead did a system function `pg_rel_is_missing_stats(oid) returns
boolean`, but it would still need to sanity check on relkind and filter on
relpersistence and inherited.

Couldn't the function handle those sanity checks?

--
nathan

Nathan Bossart

nathandbossart@gmail.com

about 2 months ago

In reply to: Sami Imseih (#5)

Re: Proposal: ANALYZE (MODIFIED_STATS) using autoanalyze thresholds

On Tue, Feb 24, 2026 at 08:19:55AM -0600, Sami Imseih wrote:

But from a high-level, I am not sure how providing a VIEW or functions in
which users will be required to script out the ANALYZE commands to run is
better than providing ANALYZE options? the difference is putting the
burden on the user to come up with the SQL script vs a simple ANALYZE
option. IMO, the latter is more appealing from a user perspective.

My principle objection isn't with the option itself, it's that _just_
adding the option papers over the current state of affairs without taking
the opportunity to do a bunch of really useful centralization/refactoring.
By exposing the auto{vacuum,analyze} prioritization code and building off
that, we get a consistent picture wherever it is needed (e.g.,
auto{vacuum,analyze}, vacuumdb, monitoring tools, custom scripts, and any
new VACUUM/ANALYZE options) without having to reinvent the wheel each time.
Sure, we could have vacuumdb just use MISSING_STATS_ONLY for every command
in --missing-stats-only mode, but that seems like a lot of overhead, and we
still aren't helping with observability (which I expect to become even more
important as autovacuum's prioritization code evolves).

I'm admittedly waving my hands a bit here, but in any case, I think this is
an opportunity to do a bunch of super useful improvements that will have
much bigger benefits down the road.

--
nathan