another autovacuum scheduling thread

Started by Nathan Bossart5 months ago103 messages
Jump to latest
#1Nathan Bossart
nathandbossart@gmail.com

/me dons flame-proof suit

My goal with this thread is to produce some incremental autovacuum
scheduling improvements for v19, but realistically speaking, I know that
it's a bit of a long-shot. There have been many discussions over the
years, and I've read through a few of them [0]/messages/by-id/CA+TgmoafJPjB3WVqB3FrGWUU4NLRc3VHx8GXzLL-JM++JPwK+Q@mail.gmail.com [1]/messages/by-id/CAEG8a3+3fwQbgzak+h3Q7Bp=vK_aWhw1X7w7g5RCgEW9ufdvtA@mail.gmail.com [2]/messages/by-id/CAD21AoBUaSRBypA6pd9ZD=U-2TJCHtbyZRmrS91Nq0eVQ0B3BA@mail.gmail.com [3]/messages/by-id/CA+TgmobT3m=+dU5HF3VGVqiZ2O+v6P5wN1Gj+Prq+hj7dAm9AQ@mail.gmail.com [4]/messages/by-id/20130124215715.GE4528@alvh.no-ip.org, but there
are certainly others I haven't found. Since this seems to be a contentious
topic, I figured I'd start small to see if we can get _something_
committed.

While I am by no means wedded to a specific idea, my current concrete
proposal (proof-of-concept patch attached) is to start by ordering the
tables a worker will process by (M)XID age. Here are the reasons:

* We do some amount of prioritization of databases at risk of wraparound at
database level, per the following comment from autovacuum.c:

* Choose a database to connect to. We pick the database that was least
* recently auto-vacuumed, or one that needs vacuuming to prevent Xid
* wraparound-related data loss. If any db at risk of Xid wraparound is
* found, we pick the one with oldest datfrozenxid, independently of
* autovacuum times; similarly we pick the one with the oldest datminmxid
* if any is in MultiXactId wraparound. Note that those in Xid wraparound
* danger are given more priority than those in multi wraparound danger.

However, we do no such prioritization of the tables within a database. In
fact, the ordering of the tables is effectively random. IMHO this gives us
quite a bit of wiggle room to experiment; since we are processing tables in
no specific order today, changing the order to something vacuuming-related
seems more likely to help than it is to harm.

* Prioritizing tables based on their (M)XID age might help avoid more
aggressive vacuums, not to mention wraparound. Of course, there are
scenarios where this doesn't work. For example, the age of a table may
have changed greatly between the time we recorded it and the time we
process it. Or maybe there is another table in a different database that
is more important from a wraparound perspective. We could complicate the
patch to try to handle some of these things, but I maintain that even some
basic, incremental scheduling improvements would be better than the status
quo. And we can always change it further in the future to handle these
problems and to consider other things like bloat.

The attached patch works by storing the maximum of the XID age and the MXID
age in the list with the OIDs and sorting it prior to processing.

Thoughts?

[0]: /messages/by-id/CA+TgmoafJPjB3WVqB3FrGWUU4NLRc3VHx8GXzLL-JM++JPwK+Q@mail.gmail.com
[1]: /messages/by-id/CAEG8a3+3fwQbgzak+h3Q7Bp=vK_aWhw1X7w7g5RCgEW9ufdvtA@mail.gmail.com
[2]: /messages/by-id/CAD21AoBUaSRBypA6pd9ZD=U-2TJCHtbyZRmrS91Nq0eVQ0B3BA@mail.gmail.com
[3]: /messages/by-id/CA+TgmobT3m=+dU5HF3VGVqiZ2O+v6P5wN1Gj+Prq+hj7dAm9AQ@mail.gmail.com
[4]: /messages/by-id/20130124215715.GE4528@alvh.no-ip.org

--
nathan

Attachments:

v1-0001-autovacuum-order-tables-by-m-xid-age.patchtext/plain; charset=us-asciiDownload+41-8
#2Sami Imseih
samimseih@gmail.com
In reply to: Nathan Bossart (#1)
Re: another autovacuum scheduling thread

Thanks for raising this topic! I agree that autovacuum scheduling
could be improved.

* Prioritizing tables based on their (M)XID age might help avoid more
aggressive vacuums, not to mention wraparound. Of course, there are
scenarios where this doesn't work. For example, the age of a table may
have changed greatly between the time we recorded it and the time we
process it. Or maybe there is another table in a different database that
is more important from a wraparound perspective. We could complicate the
patch to try to handle some of these things, but I maintain that even some
basic, incremental scheduling improvements would be better than the status
quo. And we can always change it further in the future to handle these
problems and to consider other things like bloat.

One risk I see with this approach is that we will end up autovacuuming
tables that also take the longest time to complete, which could cause
smaller, quick-to-process tables to be neglected.

It’s not always the case that the oldest tables in terms of (M)XID age
are also the most expensive to vacuum, but that is often more true
than not.

Not saying that the current approach, which is as you mention is
random, is any better, however this approach will likely increase
the behavior of large tables saturating workers.

But I also do see the merit of this approach when we know we are
in failsafe territory, because I would want my oldest aged tables to be
a/v'd first.

--
Sami Imseih
Amazon Web Services (AWS)

#3Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Sami Imseih (#2)
Re: another autovacuum scheduling thread

On 2025-Oct-08, Sami Imseih wrote:

One risk I see with this approach is that we will end up autovacuuming
tables that also take the longest time to complete, which could cause
smaller, quick-to-process tables to be neglected.

Perhaps we can have autovacuum workers decide on a mode to use at
startup (or launcher decides for them), and use different prioritization
heuristics depending on the mode. For instance if we're past max freeze
age for any tables then we know we have to first vacuum tables with
higher MXID ages regardless of size considerations, but if there's at
least one worker in that mode then we use the mode where smaller
high-churn tables go first.

--
Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/
"No nos atrevemos a muchas cosas porque son difíciles,
pero son difíciles porque no nos atrevemos a hacerlas" (Séneca)

#4Andres Freund
andres@anarazel.de
In reply to: Nathan Bossart (#1)
Re: another autovacuum scheduling thread

Hi,

On 2025-10-08 10:18:17 -0500, Nathan Bossart wrote:

However, we do no such prioritization of the tables within a database. In
fact, the ordering of the tables is effectively random.

We don't prioritize tables, but I don't think the order really is random?
Isn't it basically in the order in which the data is in pg_class? That
typically won't change from one autovacuum pass to the next...

* Prioritizing tables based on their (M)XID age might help avoid more
aggressive vacuums, not to mention wraparound. Of course, there are
scenarios where this doesn't work. For example, the age of a table may
have changed greatly between the time we recorded it and the time we
process it.

Or maybe there is another table in a different database that
is more important from a wraparound perspective.

That seems like something no ordering within a single AV worker can address. I
think it's fine to just define that to be out of scope.

We could complicate the patch to try to handle some of these things, but I
maintain that even some basic, incremental scheduling improvements would be
better than the status quo. And we can always change it further in the
future to handle these problems and to consider other things like bloat.

Agreed! It doesn't take much to be better at scheduling than "order in
pg_class".

The attached patch works by storing the maximum of the XID age and the MXID
age in the list with the OIDs and sorting it prior to processing.

I think it may be worth trying to avoid reliably using the same order -
otherwise e.g. a corrupt index on the first scheduled table can cause
autovacuum to reliably fail on the same relation, never allowing it to
progress past that point.

Greetings,

Andres Freund

#5Sami Imseih
samimseih@gmail.com
In reply to: Sami Imseih (#2)
Re: another autovacuum scheduling thread

Not saying that the current approach, which is as you mention is
random, is any better, however this approach will likely increase
the behavior of large tables saturating workers.

Maybe it will be good to allocate some workers to the oldest tables
and workers based on some random list? This could balance things
out between the oldest (large) tables and everything else to avoid
this problem.

--
Sami Imseih
Amazon Web Services (AWS)

#6Jeremy Schneider
schneider@ardentperf.com
In reply to: Sami Imseih (#2)
Re: another autovacuum scheduling thread

On Wed, 8 Oct 2025 12:06:29 -0500
Sami Imseih <samimseih@gmail.com> wrote:

One risk I see with this approach is that we will end up autovacuuming
tables that also take the longest time to complete, which could cause
smaller, quick-to-process tables to be neglected.

It’s not always the case that the oldest tables in terms of (M)XID age
are also the most expensive to vacuum, but that is often more true
than not.

I think an approach of doing largest objects first actually might work
really well for balancing work amongst autovacuum workers. Many years
ago I designed a system to backup many databases with a pool of workers
and used this same simple & naive algorithm of just reverse sorting on
db size, and it worked remarkably well. If you have one big thing then
you probably want someone to get started on that first. As long as
there's a pool of workers available, as you work through the queue, you
can actually end up with pretty optimal use of all the workers.

-Jeremy

#7David Rowley
dgrowleyml@gmail.com
In reply to: Jeremy Schneider (#6)
Re: another autovacuum scheduling thread

On Thu, 9 Oct 2025 at 12:41, Jeremy Schneider <schneider@ardentperf.com> wrote:

I think an approach of doing largest objects first actually might work
really well for balancing work amongst autovacuum workers. Many years
ago I designed a system to backup many databases with a pool of workers
and used this same simple & naive algorithm of just reverse sorting on
db size, and it worked remarkably well. If you have one big thing then
you probably want someone to get started on that first. As long as
there's a pool of workers available, as you work through the queue, you
can actually end up with pretty optimal use of all the workers.

I believe that is methodology for processing work applies much better
in scenarios where there's no new work continually arriving and
there's no adverse effects from giving a lower priority to certain
portions of the work. I don't think you can apply that so easily to
autovacuum as there are scenarios where the work can pile up faster
than it can be handled. Also, smaller tables can bloat in terms of
growth proportional to the original table size much more quickly than
larger tables and that could have huge consequences for queries to
small tables which are not indexed sufficiently to handle being
becoming bloated and large.

David

#8Jeremy Schneider
schneider@ardentperf.com
In reply to: David Rowley (#7)
Re: another autovacuum scheduling thread

On Thu, 9 Oct 2025 12:59:23 +1300
David Rowley <dgrowleyml@gmail.com> wrote:

I believe that is methodology for processing work applies much better
in scenarios where there's no new work continually arriving and
there's no adverse effects from giving a lower priority to certain
portions of the work. I don't think you can apply that so easily to
autovacuum as there are scenarios where the work can pile up faster
than it can be handled. Also, smaller tables can bloat in terms of
growth proportional to the original table size much more quickly than
larger tables and that could have huge consequences for queries to
small tables which are not indexed sufficiently to handle being
becoming bloated and large.

I'm arguing that it works well with autovacuum. Not saying there aren't
going to be certain workloads that it's suboptimal for. We're talking
about sorting by (M)XID age. As the clock continues to move forward any
table that doesn't get processed naturally moves up the queue for the
next autovac run. I think the concerns are minimal here and this would
be a good change in general.

-Jeremy

--
To know the thoughts and deeds that have marked man's progress is to
feel the great heart throbs of humanity through the centuries; and if
one does not feel in these pulsations a heavenward striving, one must
indeed be deaf to the harmonies of life.

Helen Keller, The Story Of My Life, 1902, 1903, 1905, introduction by
Ralph Barton Perry (Garden City, NY: Doubleday & Company, 1954), p90.

#9Jeremy Schneider
schneider@ardentperf.com
In reply to: Jeremy Schneider (#8)
Re: another autovacuum scheduling thread

On Wed, 8 Oct 2025 17:27:27 -0700
Jeremy Schneider <schneider@ardentperf.com> wrote:

On Thu, 9 Oct 2025 12:59:23 +1300
David Rowley <dgrowleyml@gmail.com> wrote:

I believe that is methodology for processing work applies much
better in scenarios where there's no new work continually arriving
and there's no adverse effects from giving a lower priority to
certain portions of the work. I don't think you can apply that so
easily to autovacuum as there are scenarios where the work can pile
up faster than it can be handled. Also, smaller tables can bloat
in terms of growth proportional to the original table size much
more quickly than larger tables and that could have huge
consequences for queries to small tables which are not indexed
sufficiently to handle being becoming bloated and large.

I'm arguing that it works well with autovacuum. Not saying there
aren't going to be certain workloads that it's suboptimal for. We're
talking about sorting by (M)XID age. As the clock continues to move
forward any table that doesn't get processed naturally moves up the
queue for the next autovac run. I think the concerns are minimal here
and this would be a good change in general.

Hmm, doesn't work quite like that if the full queue needs to be
processed before the next iteration ~ but at steady state these small
tables are going to get processed at the same rate whether they were
top of bottom of the queue right?

And in non-steady-state conditions, this seems like a better order than
pg_class ordering?

-Jeremy

#10David Rowley
dgrowleyml@gmail.com
In reply to: Jeremy Schneider (#8)
Re: another autovacuum scheduling thread

On Thu, 9 Oct 2025 at 13:27, Jeremy Schneider <schneider@ardentperf.com> wrote:

I'm arguing that it works well with autovacuum. Not saying there aren't
going to be certain workloads that it's suboptimal for. We're talking
about sorting by (M)XID age. As the clock continues to move forward any
table that doesn't get processed naturally moves up the queue for the
next autovac run. I think the concerns are minimal here and this would
be a good change in general.

I thought if we're to have a priority queue that it would be hard to
argue against sorting by how far over the given auto-vacuum threshold
that the table is. If you assume that a table that just meets the
dead rows required to trigger autovacuum based on the
autovacuum_vacuum_scale_factor setting gets a priority of 1.0, but
another table that has n_mod_since_analyze twice over the
autovacuum_analyze_scale_factor gets priority 2.0. Effectively,
prioritise by the percentage over the given threshold the table is.
That way users could still tune things when they weren't happy with
the priority given to a table by adjusting the corresponding
reloption.

It just seems strange to me to only account for 1 of the 4 trigger
points for autovacuum when it's possible to account for all 4 without
much extra trouble.

David

#11Jeremy Schneider
schneider@ardentperf.com
In reply to: David Rowley (#10)
Re: another autovacuum scheduling thread

On Thu, 9 Oct 2025 14:03:34 +1300
David Rowley <dgrowleyml@gmail.com> wrote:

I thought if we're to have a priority queue that it would be hard to
argue against sorting by how far over the given auto-vacuum threshold
that the table is. If you assume that a table that just meets the
dead rows required to trigger autovacuum based on the
autovacuum_vacuum_scale_factor setting gets a priority of 1.0, but
another table that has n_mod_since_analyze twice over the
autovacuum_analyze_scale_factor gets priority 2.0. Effectively,
prioritise by the percentage over the given threshold the table is.
That way users could still tune things when they weren't happy with
the priority given to a table by adjusting the corresponding
reloption.

If users are tuning this thing then I feel like we've already lost the
battle :)

On a healthy system, autovac runs continually and hits tables at
regular intervals based on their steady state change rates. We have
existing knobs (for better or worse) that people can use to tell PG to
hit certain tables more frequently, to get rid of sleeps/delays, etc.

With our fleet of PG databases here, my current approach is geared
toward setting log_autovacuum_min_duration to some conservative value
fleet-wide, then monitoring based on the logs for any cases where it
runs longer than a defined threshold. I'm able to catch problems sooner
this way, versus monitoring on xid age alone.

Whenever there are problems with autovacuum, the actual issue is never
going to be resolved by what order autovacuum processes tables. I don't
think we should encourage any tunables here... to me it seems like
putting focus entirely in the wrong place.

-Jeremy

#12Jeremy Schneider
schneider@ardentperf.com
In reply to: Jeremy Schneider (#11)
Re: another autovacuum scheduling thread

On Wed, 8 Oct 2025 18:25:20 -0700
Jeremy Schneider <schneider@ardentperf.com> wrote:

On Thu, 9 Oct 2025 14:03:34 +1300
David Rowley <dgrowleyml@gmail.com> wrote:

I thought if we're to have a priority queue that it would be hard to
argue against sorting by how far over the given auto-vacuum
threshold that the table is. If you assume that a table that just
meets the dead rows required to trigger autovacuum based on the
autovacuum_vacuum_scale_factor setting gets a priority of 1.0, but
another table that has n_mod_since_analyze twice over the
autovacuum_analyze_scale_factor gets priority 2.0. Effectively,
prioritise by the percentage over the given threshold the table is.
That way users could still tune things when they weren't happy with
the priority given to a table by adjusting the corresponding
reloption.

If users are tuning this thing then I feel like we've already lost the
battle :)

I replied too quickly. Re-reading your email, I think your proposing a
different algorithm, taking tuple counts into account. No tunables. Is
there a fully fleshed out version of the proposed alternative algorithm
somewhere? (one of the older threads?) I guess this is why its so hard
to get anything committed in this area...

-J

#13David Rowley
dgrowleyml@gmail.com
In reply to: Jeremy Schneider (#12)
Re: another autovacuum scheduling thread

On Thu, 9 Oct 2025 at 14:47, Jeremy Schneider <schneider@ardentperf.com> wrote:

On Wed, 8 Oct 2025 18:25:20 -0700
Jeremy Schneider <schneider@ardentperf.com> wrote:

If users are tuning this thing then I feel like we've already lost the
battle :)

I replied too quickly. Re-reading your email, I think your proposing a
different algorithm, taking tuple counts into account. No tunables. Is
there a fully fleshed out version of the proposed alternative algorithm
somewhere? (one of the older threads?) I guess this is why its so hard
to get anything committed in this area...

It's along the lines of the "1a)" from [1]/messages/by-id/CAApHDvo8DWyt4CWhF=NPeRstz_78SteEuuNDfYO7cjp=7YTK4g@mail.gmail.com. I don't think that post
does a great job of explaining it.

I think the best way to understand it is if you look at
relation_needs_vacanalyze() and see how it calculates boolean values
for boolean output params. So, instead of calculating just a boolean
value it instead calculates a float4 where < 1.0 means don't do the
operation and anything >= 1.0 means do the operation. For example,
let's say a table has 600 dead rows and the scale factor and threshold
settings mean that autovacuum will trigger at 200 (3 times more dead
tuples than the trigger point). That would result in the value of 3.0
(600 / 200). The priority for relfrozenxid portion is basically
age(relfrozenxid) / autovacuum_freeze_max_age (plus need to account
for mxid by doing the same for that and taking the maximum of each
value). For each of those component "scores", the priority for
autovacuum would be the maximum of each of those.

Effectively, it's a method of aligning the different units of measure,
transactions or tuples into a single value which is calculated based
on the very same values that we use today to trigger autovacuums.

David

[1]: /messages/by-id/CAApHDvo8DWyt4CWhF=NPeRstz_78SteEuuNDfYO7cjp=7YTK4g@mail.gmail.com

#14Nathan Bossart
nathandbossart@gmail.com
In reply to: Andres Freund (#4)
Re: another autovacuum scheduling thread

On Wed, Oct 08, 2025 at 01:37:22PM -0400, Andres Freund wrote:

On 2025-10-08 10:18:17 -0500, Nathan Bossart wrote:

The attached patch works by storing the maximum of the XID age and the MXID
age in the list with the OIDs and sorting it prior to processing.

I think it may be worth trying to avoid reliably using the same order -
otherwise e.g. a corrupt index on the first scheduled table can cause
autovacuum to reliably fail on the same relation, never allowing it to
progress past that point.

Hm. What if we kept a short array of "failed" tables in shared memory?
Each worker would consult this table before processing. If the table is
there, it would remove it from the shared table and skip processing it.
Then the next worker would try processing the table again.

I also wonder how hard it would be to gracefully catch the error and let
the worker continue with the rest of its list...

--
nathan

#15Nathan Bossart
nathandbossart@gmail.com
In reply to: David Rowley (#13)
Re: another autovacuum scheduling thread

On Thu, Oct 09, 2025 at 04:13:23PM +1300, David Rowley wrote:

I think the best way to understand it is if you look at
relation_needs_vacanalyze() and see how it calculates boolean values
for boolean output params. So, instead of calculating just a boolean
value it instead calculates a float4 where < 1.0 means don't do the
operation and anything >= 1.0 means do the operation. For example,
let's say a table has 600 dead rows and the scale factor and threshold
settings mean that autovacuum will trigger at 200 (3 times more dead
tuples than the trigger point). That would result in the value of 3.0
(600 / 200). The priority for relfrozenxid portion is basically
age(relfrozenxid) / autovacuum_freeze_max_age (plus need to account
for mxid by doing the same for that and taking the maximum of each
value). For each of those component "scores", the priority for
autovacuum would be the maximum of each of those.

Effectively, it's a method of aligning the different units of measure,
transactions or tuples into a single value which is calculated based
on the very same values that we use today to trigger autovacuums.

I like the idea of a "score" approach, but I'm worried that we'll never
come to an agreement on the formula to use. Perhaps we'd have more luck
getting consensus on a multifaceted strategy if we kept it brutally simple.
IMHO it's worth a try...

--
nathan

#16Andres Freund
andres@anarazel.de
In reply to: Nathan Bossart (#14)
Re: another autovacuum scheduling thread

Hi,

On 2025-10-09 11:01:16 -0500, Nathan Bossart wrote:

On Wed, Oct 08, 2025 at 01:37:22PM -0400, Andres Freund wrote:

On 2025-10-08 10:18:17 -0500, Nathan Bossart wrote:

The attached patch works by storing the maximum of the XID age and the MXID
age in the list with the OIDs and sorting it prior to processing.

I think it may be worth trying to avoid reliably using the same order -
otherwise e.g. a corrupt index on the first scheduled table can cause
autovacuum to reliably fail on the same relation, never allowing it to
progress past that point.

Hm. What if we kept a short array of "failed" tables in shared memory?

I've thought about having that as part of pgstats...

Each worker would consult this table before processing. If the table is
there, it would remove it from the shared table and skip processing it.
Then the next worker would try processing the table again.

I also wonder how hard it would be to gracefully catch the error and let
the worker continue with the rest of its list...

The main set of cases I've seen are when workers get hung up permanently in
corrupt indexes. There never is actually an error, the autovacuums just get
terminated as part of whatever independent reason there is to restart. The
problem with that is that you'll never actually have vacuum fail...

Greetings,

Andres Freund

#17Nathan Bossart
nathandbossart@gmail.com
In reply to: Andres Freund (#16)
Re: another autovacuum scheduling thread

On Thu, Oct 09, 2025 at 12:15:31PM -0400, Andres Freund wrote:

On 2025-10-09 11:01:16 -0500, Nathan Bossart wrote:

I also wonder how hard it would be to gracefully catch the error and let
the worker continue with the rest of its list...

The main set of cases I've seen are when workers get hung up permanently in
corrupt indexes. There never is actually an error, the autovacuums just get
terminated as part of whatever independent reason there is to restart. The
problem with that is that you'll never actually have vacuum fail...

Ah. Wouldn't the other workers skip that table in that scenario? I'm not
following the great advantage of varying the order in this case. I suppose
the full set of workers might be able to process more tables before one
inevitably gets stuck. Is that it?

--
nathan

In reply to: Andres Freund (#16)
Re: another autovacuum scheduling thread

On Thu, Oct 9, 2025 at 12:15 PM Andres Freund <andres@anarazel.de> wrote:

Each worker would consult this table before processing. If the table is
there, it would remove it from the shared table and skip processing it.
Then the next worker would try processing the table again.

I also wonder how hard it would be to gracefully catch the error and let
the worker continue with the rest of its list...

The main set of cases I've seen are when workers get hung up permanently in
corrupt indexes.

How recently was this? I'm aware of problems like that that we
discussed around 2018, but they were greatly mitigated.
First by your commit 3a01f68e, then by my commit c34787f9.

In general, there's no particularly good reason why (at least with
nbtree indexes) VACUUM should ever hang forever. The access pattern is
overwhelmingly simple, sequential access. The only exception is nbtree
page deletion (plus backtracking), where it isn't particularly hard to
just be very careful about self-deadlock.

There never is actually an error, the autovacuums just get
terminated as part of whatever independent reason there is to restart.

What do you mean?

In general I'd expect nbtree VACUUM of a corrupt index to either not
fail at all (we'll soldier on to the best of our ability when page
deletion encounters an inconsistency), or to get permanently stuck due
to locking the same page twice/self-deadlock (though as I said, those
problems were mitigated, and might even be almost impossible these
days). Every other case involves some kind of error (e.g., an OOM is
just about possible).

I agree with you about using a perfectly deterministic order coming
with real downsides, without any upside. Don't interpret what I've
said as expressing opposition to that idea.

--
Peter Geoghegan

#19Nathan Bossart
nathandbossart@gmail.com
In reply to: Nathan Bossart (#15)
Re: another autovacuum scheduling thread

On Thu, Oct 09, 2025 at 11:13:48AM -0500, Nathan Bossart wrote:

On Thu, Oct 09, 2025 at 04:13:23PM +1300, David Rowley wrote:

I think the best way to understand it is if you look at
relation_needs_vacanalyze() and see how it calculates boolean values
for boolean output params. So, instead of calculating just a boolean
value it instead calculates a float4 where < 1.0 means don't do the
operation and anything >= 1.0 means do the operation. For example,
let's say a table has 600 dead rows and the scale factor and threshold
settings mean that autovacuum will trigger at 200 (3 times more dead
tuples than the trigger point). That would result in the value of 3.0
(600 / 200). The priority for relfrozenxid portion is basically
age(relfrozenxid) / autovacuum_freeze_max_age (plus need to account
for mxid by doing the same for that and taking the maximum of each
value). For each of those component "scores", the priority for
autovacuum would be the maximum of each of those.

Effectively, it's a method of aligning the different units of measure,
transactions or tuples into a single value which is calculated based
on the very same values that we use today to trigger autovacuums.

I like the idea of a "score" approach, but I'm worried that we'll never
come to an agreement on the formula to use. Perhaps we'd have more luck
getting consensus on a multifaceted strategy if we kept it brutally simple.
IMHO it's worth a try...

Here's a prototype of a "score" approach. Two notes:

* I've given special priority to anti-wraparound vacuums. I think this is
important to avoid focusing too much on bloat when wraparound is imminent.
In any case, we need a separate wraparound score in case autovacuum is
disabled.

* I didn't include the analyze threshold in the score because it doesn't
apply to TOAST tables, and therefore would artificially lower their
prioritiy. Perhaps there is another way to deal with this.

This is very much just a prototype of the basic idea. As-is, I think it'll
favor processing tables with lots of bloat unless we're in an
anti-wraparound scenario. Maybe that's okay. I'm not sure how scientific
we want to be about all of this, but I do intend to try some long-running
tests.

--
nathan

Attachments:

v2-0001-autovacuum-scheduling-improvements.patchtext/plain; charset=us-asciiDownload+81-13
#20Robert Haas
robertmhaas@gmail.com
In reply to: Nathan Bossart (#19)
Re: another autovacuum scheduling thread

On Fri, Oct 10, 2025 at 1:31 PM Nathan Bossart <nathandbossart@gmail.com> wrote:

Here's a prototype of a "score" approach. Two notes:

* I've given special priority to anti-wraparound vacuums. I think this is
important to avoid focusing too much on bloat when wraparound is imminent.
In any case, we need a separate wraparound score in case autovacuum is
disabled.

* I didn't include the analyze threshold in the score because it doesn't
apply to TOAST tables, and therefore would artificially lower their
prioritiy. Perhaps there is another way to deal with this.

This is very much just a prototype of the basic idea. As-is, I think it'll
favor processing tables with lots of bloat unless we're in an
anti-wraparound scenario. Maybe that's okay. I'm not sure how scientific
we want to be about all of this, but I do intend to try some long-running
tests.

I think this is a reasonable starting point, although I'm surprised
that you chose to combine the sub-scores using + rather than Max.

I think it will take a lot of experimentation to figure out whether
this particular algorithm (or any other) works well in practice. My
intuition (for whatever that is worth to you, which may not be much)
is that what will anger users is cases when we ignore a horrible
problem to deal with a routine problem. Figuring out how to design the
scoring system to avoid such outcomes is the hard part of this
problem, IMHO. For this particular algorithm, the main hazards that
spring to mind for me are:

- The wraparound score can't be more than about 10, but the bloat
score could be arbitrarily large, especially for tables with few
tuples, so there may be lots of cases in which the wraparound score
has no impact on the behavior.

- The patch attempts to guard against this by disregarding the
non-wraparound portion of the score once the wraparound portion
reaches 1.0, but that results in an abrupt behavior shift at that
point. Suddenly we go from mostly ignoring the wraparound score to
entirely ignoring the bloat score. This might result in the system
abruptly ignoring tables that are bloating extremely rapidly in favor
of trying to catch up in a wraparound situation that is not yet
terribly urgent.

When I've thought about this problem -- and I can't claim to have
thought about it very hard -- it's seemed to me that we need to (1)
somehow normalize everything to somewhat similar units and (2) make
sure that severe wraparound danger always wins over every other
consideration, but mild wraparound danger can lose to severe bloat.

--
Robert Haas
EDB: http://www.enterprisedb.com

#21Nathan Bossart
nathandbossart@gmail.com
In reply to: Robert Haas (#20)
#22Robert Haas
robertmhaas@gmail.com
In reply to: Nathan Bossart (#21)
#23Jeremy Schneider
schneider@ardentperf.com
In reply to: Robert Haas (#22)
#24David Rowley
dgrowleyml@gmail.com
In reply to: Robert Haas (#20)
#25Robert Haas
robertmhaas@gmail.com
In reply to: Jeremy Schneider (#23)
#26Nathan Bossart
nathandbossart@gmail.com
In reply to: David Rowley (#24)
#27David Rowley
dgrowleyml@gmail.com
In reply to: Nathan Bossart (#26)
#28Nathan Bossart
nathandbossart@gmail.com
In reply to: David Rowley (#27)
#29Nathan Bossart
nathandbossart@gmail.com
In reply to: Nathan Bossart (#28)
#30David Rowley
dgrowleyml@gmail.com
In reply to: Nathan Bossart (#29)
#31Nathan Bossart
nathandbossart@gmail.com
In reply to: David Rowley (#30)
#32Sami Imseih
samimseih@gmail.com
In reply to: Nathan Bossart (#31)
#33Nathan Bossart
nathandbossart@gmail.com
In reply to: Sami Imseih (#32)
#34Sami Imseih
samimseih@gmail.com
In reply to: Nathan Bossart (#33)
#35David Rowley
dgrowleyml@gmail.com
In reply to: Sami Imseih (#34)
#36Sami Imseih
samimseih@gmail.com
In reply to: David Rowley (#35)
#37David Rowley
dgrowleyml@gmail.com
In reply to: Sami Imseih (#36)
#38Nathan Bossart
nathandbossart@gmail.com
In reply to: David Rowley (#37)
In reply to: David Rowley (#30)
#40David Rowley
dgrowleyml@gmail.com
In reply to: Peter Geoghegan (#39)
#41David Rowley
dgrowleyml@gmail.com
In reply to: Nathan Bossart (#38)
#42Nathan Bossart
nathandbossart@gmail.com
In reply to: David Rowley (#41)
#43Sami Imseih
samimseih@gmail.com
In reply to: Nathan Bossart (#42)
#44Nathan Bossart
nathandbossart@gmail.com
In reply to: Sami Imseih (#43)
#45Sami Imseih
samimseih@gmail.com
In reply to: Nathan Bossart (#44)
#46David Rowley
dgrowleyml@gmail.com
In reply to: Nathan Bossart (#42)
#47David Rowley
dgrowleyml@gmail.com
In reply to: Sami Imseih (#45)
#48Nathan Bossart
nathandbossart@gmail.com
In reply to: David Rowley (#46)
#49Nathan Bossart
nathandbossart@gmail.com
In reply to: David Rowley (#47)
#50Sami Imseih
samimseih@gmail.com
In reply to: Nathan Bossart (#48)
#51wenhui qiu
qiuwenhuifx@gmail.com
In reply to: Sami Imseih (#50)
#52Sami Imseih
samimseih@gmail.com
In reply to: Nathan Bossart (#49)
#53Nathan Bossart
nathandbossart@gmail.com
In reply to: Sami Imseih (#50)
#54Nathan Bossart
nathandbossart@gmail.com
In reply to: wenhui qiu (#51)
#55Nathan Bossart
nathandbossart@gmail.com
In reply to: Sami Imseih (#52)
#56wenhui qiu
qiuwenhuifx@gmail.com
In reply to: Nathan Bossart (#55)
#57David Rowley
dgrowleyml@gmail.com
In reply to: wenhui qiu (#56)
#58wenhui qiu
qiuwenhuifx@gmail.com
In reply to: David Rowley (#57)
#59David Rowley
dgrowleyml@gmail.com
In reply to: wenhui qiu (#58)
#60Robert Haas
robertmhaas@gmail.com
In reply to: Nathan Bossart (#53)
#61Nathan Bossart
nathandbossart@gmail.com
In reply to: Robert Haas (#60)
#62Sami Imseih
samimseih@gmail.com
In reply to: Nathan Bossart (#61)
#63Sami Imseih
samimseih@gmail.com
In reply to: Sami Imseih (#62)
#64Nathan Bossart
nathandbossart@gmail.com
In reply to: Sami Imseih (#63)
#65David Rowley
dgrowleyml@gmail.com
In reply to: Nathan Bossart (#64)
#66David Rowley
dgrowleyml@gmail.com
In reply to: David Rowley (#65)
#67Sami Imseih
samimseih@gmail.com
In reply to: David Rowley (#66)
#68David Rowley
dgrowleyml@gmail.com
In reply to: Sami Imseih (#67)
#69Sami Imseih
samimseih@gmail.com
In reply to: David Rowley (#68)
#70David Rowley
dgrowleyml@gmail.com
In reply to: Sami Imseih (#69)
#71Nathan Bossart
nathandbossart@gmail.com
In reply to: David Rowley (#70)
#72Robert Treat
xzilla@users.sourceforge.net
In reply to: Nathan Bossart (#71)
#73Nathan Bossart
nathandbossart@gmail.com
In reply to: Robert Treat (#72)
#74Robert Treat
xzilla@users.sourceforge.net
In reply to: Nathan Bossart (#73)
#75David Rowley
dgrowleyml@gmail.com
In reply to: Nathan Bossart (#71)
#76Nathan Bossart
nathandbossart@gmail.com
In reply to: David Rowley (#75)
#77Nathan Bossart
nathandbossart@gmail.com
In reply to: Robert Treat (#74)
#78Sami Imseih
samimseih@gmail.com
In reply to: David Rowley (#70)
#79David Rowley
dgrowleyml@gmail.com
In reply to: Nathan Bossart (#76)
#80David Rowley
dgrowleyml@gmail.com
In reply to: Sami Imseih (#78)
#81Sami Imseih
samimseih@gmail.com
In reply to: David Rowley (#80)
#82Robert Treat
xzilla@users.sourceforge.net
In reply to: David Rowley (#79)
#83Nathan Bossart
nathandbossart@gmail.com
In reply to: Robert Treat (#82)
#84Sami Imseih
samimseih@gmail.com
In reply to: Nathan Bossart (#83)
#85Jeremy Schneider
schneider@ardentperf.com
In reply to: Sami Imseih (#84)
#86Sami Imseih
samimseih@gmail.com
In reply to: Jeremy Schneider (#85)
#87Robert Haas
robertmhaas@gmail.com
In reply to: Nathan Bossart (#83)
#88Nathan Bossart
nathandbossart@gmail.com
In reply to: Robert Haas (#87)
#89Sami Imseih
samimseih@gmail.com
In reply to: Nathan Bossart (#88)
#90Robert Haas
robertmhaas@gmail.com
In reply to: Sami Imseih (#89)
#91Sami Imseih
samimseih@gmail.com
In reply to: Robert Haas (#90)
#92David Rowley
dgrowleyml@gmail.com
In reply to: Robert Haas (#90)
#93Robert Haas
robertmhaas@gmail.com
In reply to: David Rowley (#92)
#94David Rowley
dgrowleyml@gmail.com
In reply to: Robert Haas (#93)
#95Robert Haas
robertmhaas@gmail.com
In reply to: David Rowley (#94)
#96Dilip Kumar
dilipbalaut@gmail.com
In reply to: Nathan Bossart (#88)
#97Sami Imseih
samimseih@gmail.com
In reply to: Robert Haas (#95)
#98Robert Haas
robertmhaas@gmail.com
In reply to: Sami Imseih (#97)
#99Nathan Bossart
nathandbossart@gmail.com
In reply to: Robert Haas (#95)
#100David Rowley
dgrowleyml@gmail.com
In reply to: Robert Haas (#98)
#101Sami Imseih
samimseih@gmail.com
In reply to: Robert Haas (#98)
#102Robert Haas
robertmhaas@gmail.com
In reply to: David Rowley (#100)
#103Nathan Bossart
nathandbossart@gmail.com
In reply to: Robert Haas (#102)