We probably need autovacuum_max_wraparound_workers

Started by Josh Berkusalmost 14 years ago25 messageshackers
Jump to latest
#1Josh Berkus
josh@agliodbs.com

Folks,

Yeah, I can't believe I'm calling for *yet another* configuration
variable either. Suggested workaround fixes very welcome.

The basic issue is that autovacuum_max_workers is set by most users
based on autovac's fairly lightweight action most of the time: analyze,
vacuuming pages not on the visibility list, etc. However, when XID
wraparound kicks in, then autovac starts reading entire tables from disk
... and those tables may be very large.

This becomes a downtime issue if you've set autovacuum_max_workers to,
say, 5 and several large tables hit the wraparound threshold at the same
time (as they tend to do if you're using the default settings). Then
you have 5 autovacuum processes concurrently doing heavy IO and getting
in each others' way.

I've seen this at two sites now, and my conclusion is that a single
autovacuum_max_workers isn't sufficient if to cover the case of
wraparound vacuum. Nor can we just single-thread the wraparound vacuum
(i.e. just one worker) since that would hurt users who have thousands of
small tables.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Josh Berkus (#1)
Re: We probably need autovacuum_max_wraparound_workers

Josh Berkus <josh@agliodbs.com> writes:

Yeah, I can't believe I'm calling for *yet another* configuration
variable either. Suggested workaround fixes very welcome.

The basic issue is that autovacuum_max_workers is set by most users
based on autovac's fairly lightweight action most of the time: analyze,
vacuuming pages not on the visibility list, etc. However, when XID
wraparound kicks in, then autovac starts reading entire tables from disk
... and those tables may be very large.

It doesn't seem to me that this has much of anything to do with
wraparound; that just happens to be one possible trigger condition
for a lot of vacuuming activity to be happening. (Others are bulk
data loads or bulk updates, for instance.) Nor am I convinced that
changing the max_workers setting is an appropriate fix anyway.

I think what you've really got here is inappropriate autovacuum cost
delay settings, and/or the logic in autovacuum.c to try to divvy up the
available I/O capacity by tweaking workers' delay settings isn't working
very well. It's hard to propose improvements without a lot more detail
than you've provided, though.

regards, tom lane

#3David G. Johnston
david.g.johnston@gmail.com
In reply to: Josh Berkus (#1)
Re: We probably need autovacuum_max_wraparound_workers

On Jun 27, 2012, at 22:00, Josh Berkus <josh@agliodbs.com> wrote:

Folks,

Yeah, I can't believe I'm calling for *yet another* configuration
variable either. Suggested workaround fixes very welcome.

The basic issue is that autovacuum_max_workers is set by most users
based on autovac's fairly lightweight action most of the time: analyze,
vacuuming pages not on the visibility list, etc. However, when XID
wraparound kicks in, then autovac starts reading entire tables from disk
... and those tables may be very large.

This becomes a downtime issue if you've set autovacuum_max_workers to,
say, 5 and several large tables hit the wraparound threshold at the same
time (as they tend to do if you're using the default settings). Then
you have 5 autovacuum processes concurrently doing heavy IO and getting
in each others' way.

I've seen this at two sites now, and my conclusion is that a single
autovacuum_max_workers isn't sufficient if to cover the case of
wraparound vacuum. Nor can we just single-thread the wraparound vacuum
(i.e. just one worker) since that would hurt users who have thousands of
small tables.

Would there be enough benefit to setting up separate small/medium?/large thresholds with user-changeable default table size boundaries so that you can configure 6 workers where 3 handle the small tables, 2 handle the medium tables, and 1 handles the large tables. Or alternatively a small worker consumes 1, medium 2, and large 3 'units' from whatever size pool has been defined. So you could have 6 small tables or two large tables in-progress simultaneously.

David J.

#4Josh Berkus
josh@agliodbs.com
In reply to: Tom Lane (#2)
Re: We probably need autovacuum_max_wraparound_workers

I think what you've really got here is inappropriate autovacuum cost
delay settings, and/or the logic in autovacuum.c to try to divvy up the
available I/O capacity by tweaking workers' delay settings isn't working
very well. It's hard to propose improvements without a lot more detail
than you've provided, though.

Wait, we *have* that logic? If so, that's the problem ... it's not
working very well.

What detail do you want?

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

#5Stephen Frost
sfrost@snowman.net
In reply to: Josh Berkus (#1)
Re: We probably need autovacuum_max_wraparound_workers

Josh, all,

* Josh Berkus (josh@agliodbs.com) wrote:

Yeah, I can't believe I'm calling for *yet another* configuration
variable either. Suggested workaround fixes very welcome.

As I suggested on IRC, my thought would be to have a goal-based system
for autovacuum which is similar to our goal-based commit system. We
don't need autovacuum sucking up all the I/O in the box, nor should we
ask the users to manage that. Instead, let's decide when the autovacuum
on a given table needs to finish and then plan to keep on working at a
rate that'll allow us to get done well in advance of that deadline.

Just my 2c.

Thanks,

Stephen

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Josh Berkus (#4)
Re: We probably need autovacuum_max_wraparound_workers

Josh Berkus <josh@agliodbs.com> writes:

I think what you've really got here is inappropriate autovacuum cost
delay settings, and/or the logic in autovacuum.c to try to divvy up the
available I/O capacity by tweaking workers' delay settings isn't working
very well. It's hard to propose improvements without a lot more detail
than you've provided, though.

Wait, we *have* that logic? If so, that's the problem ... it's not
working very well.

What detail do you want?

What's it doing? What do you think it should do instead?

regards, tom lane

#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: Stephen Frost (#5)
Re: We probably need autovacuum_max_wraparound_workers

Stephen Frost <sfrost@snowman.net> writes:

* Josh Berkus (josh@agliodbs.com) wrote:

Yeah, I can't believe I'm calling for *yet another* configuration
variable either. Suggested workaround fixes very welcome.

As I suggested on IRC, my thought would be to have a goal-based system
for autovacuum which is similar to our goal-based commit system. We
don't need autovacuum sucking up all the I/O in the box, nor should we
ask the users to manage that. Instead, let's decide when the autovacuum
on a given table needs to finish and then plan to keep on working at a
rate that'll allow us to get done well in advance of that deadline.

If we allow individual vacuum operations to stretch out just because
they don't need to be completed right away, we will need more concurrent
vacuum workers (so that we can respond to vacuum requirements for other
tables). So I submit that this would only move the problem around:
the number of active workers would increase to the point where things
are just as painful, plus or minus a bit.

The intent of the autovacuum cost delay features is to ensure that
autovacuum doesn't suck an untenable fraction of the machine's I/O
capacity, even when it's running flat out. So I think Josh's complaint
indicates that we have a problem with cost-delay tuning; hard to tell
what exactly without more info. It might only be that the defaults
are bad for these particular users, or it could be more involved.

regards, tom lane

#8Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#7)
Re: We probably need autovacuum_max_wraparound_workers

On Wed, Jun 27, 2012 at 11:38 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Stephen Frost <sfrost@snowman.net> writes:

* Josh Berkus (josh@agliodbs.com) wrote:

Yeah, I can't believe I'm calling for *yet another* configuration
variable either.  Suggested workaround fixes very welcome.

As I suggested on IRC, my thought would be to have a goal-based system
for autovacuum which is similar to our goal-based commit system.  We
don't need autovacuum sucking up all the I/O in the box, nor should we
ask the users to manage that.  Instead, let's decide when the autovacuum
on a given table needs to finish and then plan to keep on working at a
rate that'll allow us to get done well in advance of that deadline.

If we allow individual vacuum operations to stretch out just because
they don't need to be completed right away, we will need more concurrent
vacuum workers (so that we can respond to vacuum requirements for other
tables).  So I submit that this would only move the problem around:
the number of active workers would increase to the point where things
are just as painful, plus or minus a bit.

The intent of the autovacuum cost delay features is to ensure that
autovacuum doesn't suck an untenable fraction of the machine's I/O
capacity, even when it's running flat out.  So I think Josh's complaint
indicates that we have a problem with cost-delay tuning; hard to tell
what exactly without more info.  It might only be that the defaults
are bad for these particular users, or it could be more involved.

I've certainly come across many reports of the cost delay settings
being difficult to tune, both on pgsql-hackers/performance and in
various private EnterpriseDB correspondence. I think Stephen's got it
exactly right: the system needs to figure out the rate at which vacuum
needs to happen, not rely on the user to provide that information.

For checkpoints, we estimated the percentage of the checkpoint that
ought to be completed and the percentage that actually is completed;
if the latter is less than the former, we speed things up until we're
back on track. For autovacuum, the trick is to speed things up when
the rate at which tables are coming due for autovacuum exceeds the
rate at which we are vacuuming them; or, when we anticipate that a
whole bunch of wraparound vacuums are going to come due
simultaneously, to start doing them sooner so that they are more
spread out.

For example, suppose that 26 tables each of which is 4GB in size are
going to simultaneously come due for an anti-wraparound vacuum in 26
hours. For the sake of simplicity suppose that each will take 1 hour
to vacuum. What we currently do is wait for 26 hours and then start
vacuuming them all at top speed, thrashing the I/O system. What we
ought to do is start vacuuming them much sooner and do them
consecutively. Of course, the trick is to design a mechanism that
does something intelligent if we think we're on track and then all of
a sudden the rate of XID consumption changes dramatically, and now
we've got vacuum faster or with more workers.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#9Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#8)
Re: We probably need autovacuum_max_wraparound_workers

Robert Haas <robertmhaas@gmail.com> writes:

For example, suppose that 26 tables each of which is 4GB in size are
going to simultaneously come due for an anti-wraparound vacuum in 26
hours. For the sake of simplicity suppose that each will take 1 hour
to vacuum. What we currently do is wait for 26 hours and then start
vacuuming them all at top speed, thrashing the I/O system.

This is a nice description of a problem that has nothing to do with
reality. In the first place, we don't vacuum them all at once; we can
only vacuum max_workers of them at a time. In the second place, the
cost-delay features ought to be keeping autovacuum from thrashing the
I/O, entirely independently of what the reason was for starting the
vacuums. Clearly, since people are complaining, there's something that
needs work there. But not the work you're proposing.

regards, tom lane

#10Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#9)
Re: We probably need autovacuum_max_wraparound_workers

On Thu, Jun 28, 2012 at 12:51 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

For example, suppose that 26 tables each of which is 4GB in size are
going to simultaneously come due for an anti-wraparound vacuum in 26
hours.  For the sake of simplicity suppose that each will take 1 hour
to vacuum.  What we currently do is wait for 26 hours and then start
vacuuming them all at top speed, thrashing the I/O system.

This is a nice description of a problem that has nothing to do with
reality.  In the first place, we don't vacuum them all at once; we can
only vacuum max_workers of them at a time.  In the second place, the
cost-delay features ought to be keeping autovacuum from thrashing the
I/O, entirely independently of what the reason was for starting the
vacuums.

I don't think it works that way. The point is that the workload
imposed by autovac is intermittent and spikey. If you configure the
cost limit too low, or the delay too high, or the number of autovac
workers is too low, then autovac can't keep up, which causes all of
your tables to bloat and is a total disaster. You have to make sure
that isn't going to happen, so you naturally configure the settings
aggressively enough that you're sure autovac will be able to stay
ahead of your bloat problem. But then autovac is more
resource-intensive ALL the time, not just when there's a real need for
it. This is like giving a kid a $20 bill to buy lunch and having them
walk around until they find a restaurant sufficiently expensive that
lunch there costs $20. The point of handing over $20 was that you
were willing to spend that much *if needed*, not that the money was
burning a hole in your pocket.

To make that more concrete, suppose that a table has an update rate
such that it hits the autovac threshold every 10 minutes. If you set
the autovac settings such that an autovacuum of that table takes 9
minutes to complete, you are hosed: there will eventually be some
10-minute period where the update rate is ten times the typical
amount, and the table will gradually become horribly bloated. But if
you set the autovac settings such that an autovacuum of the table can
finish in 1 minute, so that you can cope with a spike, then whenever
there isn't a spike you are processing the table ten times faster than
necessary, and now one minute out of every ten carries a heavier I/O
load than the other 9, leading to uneven response times.

It's just ridiculous to assert that it doesn't matter if all the
anti-wraparound vacuums start simultaneously. It does matter. For
one thing, once every single autovacuum worker is pinned down doing an
anti-wraparound vacuum of some table, then a table that needs an
ordinary vacuum may have to wait quite some time before a worker is
available. Depending on the order in which workers iterate through
the tables, you could end up finishing all of the anti-wraparound
vacuums before doing any of the regular vacuums. If the wraparound
vacuums had been properly spread out, then there would at all times
have been workers available for regular vacuums as needed. For
another thing, you can't possibly think that three or five workers
running simultaneously, each reading a different table, is just as
efficient as having one worker grind through them consecutively.
Parallelism is not free, ever, and particularly not here, where it has
the potential to yank the disk head around between five different
files, seeking like crazy, instead of a nice sequential I/O pattern on
each file in turn. Josh wouldn't keep complaining about this if it
didn't suck.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#11Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#10)
Re: We probably need autovacuum_max_wraparound_workers

Robert Haas <robertmhaas@gmail.com> writes:

It's just ridiculous to assert that it doesn't matter if all the
anti-wraparound vacuums start simultaneously. It does matter. For
one thing, once every single autovacuum worker is pinned down doing an
anti-wraparound vacuum of some table, then a table that needs an
ordinary vacuum may have to wait quite some time before a worker is
available.

Well, that's a fair point, but I don't think it has anything to do with
Josh's complaint --- which AFAICT is about imposed load, not about
failure to vacuum things that need vacuumed. Any scheme you care to
design will sometimes be running max_workers workers at once, and if
that's too much load there will be trouble. I grant that there can be
value in a more complex strategy for when to schedule vacuuming
activities, but I don't think that it has a lot to do with solving the
present complaint.

Parallelism is not free, ever, and particularly not here, where it has
the potential to yank the disk head around between five different
files, seeking like crazy, instead of a nice sequential I/O pattern on
each file in turn.

Interesting point. Maybe what's going on here is that
autovac_balance_cost() is wrong to suppose that N workers can each have
1/N of the I/O bandwidth that we'd consider okay for a single worker to
eat. Maybe extra seek costs mean we have to derate that curve by some
large factor. 1/(N^2), perhaps? I bet the nature of the disk subsystem
affects this a lot, though.

regards, tom lane

#12Daniel Farina
daniel@heroku.com
In reply to: Josh Berkus (#1)
Re: We probably need autovacuum_max_wraparound_workers

On Wed, Jun 27, 2012 at 7:00 PM, Josh Berkus <josh@agliodbs.com> wrote:

I've seen this at two sites now, and my conclusion is that a single
autovacuum_max_workers isn't sufficient if to cover the case of
wraparound vacuum. Nor can we just single-thread the wraparound vacuum
(i.e. just one worker) since that would hurt users who have thousands of
small tables.

I have also witnessed very unfortunate un-smooth performance behavior
around wraparound time. It seems like a bit of adaptive response in
terms of allowed autovacuum throughput to number of pages requiring
wraparound vacuuming would be one load off my mind. Getting slower
and slower gradually with some way to know that autovacuum has decided
it should work harder and harder is better than the brick wall that
can sneak up currently.

Count me as appreciative for improvements in this area.

--
fdr

#13Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#11)
Re: We probably need autovacuum_max_wraparound_workers

On Thu, Jun 28, 2012 at 2:02 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

It's just ridiculous to assert that it doesn't matter if all the
anti-wraparound vacuums start simultaneously.  It does matter.  For
one thing, once every single autovacuum worker is pinned down doing an
anti-wraparound vacuum of some table, then a table that needs an
ordinary vacuum may have to wait quite some time before a worker is
available.

Well, that's a fair point, but I don't think it has anything to do with
Josh's complaint --- which AFAICT is about imposed load, not about
failure to vacuum things that need vacuumed.  Any scheme you care to
design will sometimes be running max_workers workers at once, and if
that's too much load there will be trouble.  I grant that there can be
value in a more complex strategy for when to schedule vacuuming
activities, but I don't think that it has a lot to do with solving the
present complaint.

I think it's got everything to do with it. Josh could fix his problem
by increasing the cost limit and/or reducing the cost delay, but if he
did that then his database would get bloated...

Parallelism is not free, ever, and particularly not here, where it has
the potential to yank the disk head around between five different
files, seeking like crazy, instead of a nice sequential I/O pattern on
each file in turn.

Interesting point.  Maybe what's going on here is that
autovac_balance_cost() is wrong to suppose that N workers can each have
1/N of the I/O bandwidth that we'd consider okay for a single worker to
eat.  Maybe extra seek costs mean we have to derate that curve by some
large factor.  1/(N^2), perhaps?  I bet the nature of the disk subsystem
affects this a lot, though.

...and this would have the same effect. Let's not assume that the
problem is that Josh doesn't know how to make autovacuum less
aggressive, because I'm pretty sure that ain't the issue.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#14Cédric Villemain
cedric@2ndquadrant.com
In reply to: Robert Haas (#13)
Re: We probably need autovacuum_max_wraparound_workers

Parallelism is not free, ever, and particularly not here, where it has
the potential to yank the disk head around between five different
files, seeking like crazy, instead of a nice sequential I/O pattern on
each file in turn.

Interesting point. Maybe what's going on here is that
autovac_balance_cost() is wrong to suppose that N workers can each have
1/N of the I/O bandwidth that we'd consider okay for a single worker to
eat. Maybe extra seek costs mean we have to derate that curve by some
large factor. 1/(N^2), perhaps? I bet the nature of the disk subsystem
affects this a lot, though.

...and this would have the same effect. Let's not assume that the
problem is that Josh doesn't know how to make autovacuum less
aggressive, because I'm pretty sure that ain't the issue.

we may need reserved workers to work on system tables, at least.
Just as a protection in case all workers all locked hours walking 'log'
tables. In the meantime, the pg_type table can bloat a lot for ex.

It might be that limiting the number of workers in 'antiwraparound-mode' to
(max_workers - round(max_workers/3)) is enough.

--
Cédric Villemain +33 (0)6 20 30 22 52
http://2ndQuadrant.fr/
PostgreSQL: Support 24x7 - Développement, Expertise et Formation

#15Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#13)
Re: We probably need autovacuum_max_wraparound_workers

Robert Haas <robertmhaas@gmail.com> writes:

On Thu, Jun 28, 2012 at 2:02 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Well, that's a fair point, but I don't think it has anything to do with
Josh's complaint --- which AFAICT is about imposed load, not about
failure to vacuum things that need vacuumed.

I think it's got everything to do with it. Josh could fix his problem
by increasing the cost limit and/or reducing the cost delay, but if he
did that then his database would get bloated...

Josh hasn't actually explained what his problem is, nor what if any
adjustments he made to try to ameliorate it. In the absence of data
I refuse to rule out misconfiguration. But, again, to the extent that
he's given us any info at all, it seemed to be a complaint about
oversaturated I/O at max load, *not* about inability to complete
vacuuming tasks as needed. You are inventing problem details to fit
your solution.

regards, tom lane

#16Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#15)
Re: We probably need autovacuum_max_wraparound_workers

On Thu, Jun 28, 2012 at 9:51 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

 You are inventing problem details to fit
your solution.

Well, what I'm actually doing is assuming that Josh's customers have
the same problem that our customers do.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#17Josh Berkus
josh@agliodbs.com
In reply to: Robert Haas (#10)
Re: We probably need autovacuum_max_wraparound_workers

Robert, Tom, Stephen,

So, first, a description of the specific problem I've encountered at two
sites. I'm working on another email suggesting workarounds and
solutions, but that's going to take a bit longer.

Observation
-----------

This problem occured on two database systems which shared the following
characteristics:

1) They were running with default autovacuum & vacuum settings, except
that one database had 5 workers instead of 3.

2) They have large partitioned tables, in which the partitions are
time-based and do not receive UPDATES after a certain date. Each
partition was larger than RAM.

3) The databases are old enough, and busy enough, to have been through
XID wraparound at least a couple of times.

Users reported that the database system became unresponsive, which was
surprising since both of these DBs had been placed on hardware which was
engineered for at least 100% growth over the current database size. On
investigation, we discovered the following things:

a) Each database had autovacuum_max_workers (one DB 5, one DB 3) doing
anti-wraparound vacuum on several partitions simultaneously.

b) The I/O created by the anti-wraparound vacuum was tying up the system.

c) terminating any individual autovacuum process didn't help, as it
simply caused autovac to start on a different partition.

So, first question was: why was autovacuum wanting to anti-wrapround
vacuum dozens of tables at the same time? A quick check showed that all
of these partitions had nearly identical XID ages (as in less than
100,000 transactions apart), which all had exceeded
autovacuum_max_freeze_age. How did this happen? I'm still not sure.

One thought is: this is an artifact of the *previous* wraparound vacuums
on each database. On cold partitions with old dead rows which have
been through wraparound vacuum several times, this tends to result in
the cold partitions converging towards having the same relfrozenxid over
time; I'm still working on the math to prove this. Alternately, it's
possible that a schema change to the partitioned tables gave them all
the same effective relfrozenxid at some point in the past; both
databases are still in development.

So there are two parts to this problem, each of which needs a different
solution:

1. Databases can inadvertently get to the state where many tables need
wraparound vacuuming at exactly the same time, especially if they have
many "cold" data partition tables.

2. When we do hit wraparound thresholds for multiple tables, autovacuum
has no hesitation about doing autovacuum_max_workers worth of wraparound
vacuum simultaneously, even when that exceeds the I/O capactity of the
system.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

#18Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Josh Berkus (#17)
Re: We probably need autovacuum_max_wraparound_workers

Excerpts from Josh Berkus's message of jue jun 28 15:03:15 -0400 2012:

2) They have large partitioned tables, in which the partitions are
time-based and do not receive UPDATES after a certain date. Each
partition was larger than RAM.

I think the solution to this problem has nothing to do with vacuum or
autovacuum settings, and lots to do with cataloguing enough info about
each of these tables to note that, past a certain point, they don't need
any vacuuming at all.

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

#19Chris Browne
cbbrowne@acm.org
In reply to: Josh Berkus (#17)
Re: We probably need autovacuum_max_wraparound_workers

On Thu, Jun 28, 2012 at 3:03 PM, Josh Berkus <josh@agliodbs.com> wrote:

1. Databases can inadvertently get to the state where many tables need
wraparound vacuuming at exactly the same time, especially if they have
many "cold" data partition tables.

This suggests that this should be handled rather earlier, and with
some attempt to not do them all simultaneously.

In effect, if there are 25 tables that will need wraparound vacuums in
the next million transactions, it is presumably beneficial to start
hitting on them right away, ideally one at a time, so as to draw their
future needs further apart.

The loose thought is that any time autovac isn't very busy, it should
consider (perhaps based on probability?) picking a table that is in a
cluster of tables that currently have wraparound needs at about the
same time, and, in effect, spread that cluster out.

I suppose there are two considerations, that conflict somewhat:
a) If there are tables that Absolutely Require wraparound vacuuming,
Right Soon Now, there's nothing to help this. They MUST be vacuumed,
otherwise the system will get very unhappy.
b) It's undesirable to *worsen* things by 'clustering' future
wraparound vacuums together, which gets induced any time autovac is
continually vacuuming a series of tables. If 25 tables get vacuumed
right around now, then that may cluster their next wraparound vacuum
to 2^31 transactions from 'right around now.'

But there's no helping a).

I suppose this suggests having an autovac thread that is 'devoted' to
spreading out future wraparound vacuums.

- If a *lot* of tables were just vacuumed recently, then it shouldn't
do anything, as Right Now is a cluster of 'badness.'
- It should group tables by slicing their next wraparounds (grouping
by rounding wraparound txid to the nearest, say, 10M or 20M), and
consider vacuuming a table Right Now that would take that table out of
the worst such "slice"

Thus, supposing the grouping is like:

| TxId - nearest 10 million | Tables Wrapping In Range |
|---------------------------+--------------------------|
| 0 | 250 |
| 1 | 80 |
| 2 | 72 |
| 3 | 30 |
| 4 | 21 |
| 5 | 35 |
| 6 | 9 |
| 7 | 15 |
| 8 | 8 |
| 9 | 7 |
| 10 | 22 |
| 11 | 35 |
| 12 | 14 |
| 13 | 135 |
| 14 | 120 |
| 15 | 89 |
| 16 | 35 |
| 17 | 45 |
| 18 | 60 |
| 19 | 25 |
| 20 | 15 |
| 21 | 150 |

Suppose current txid is 7500000, and the reason for there to be 250
tables in the current range is that there are a bunch of tables that
get *continually* vacuumed. No need to worry about that range, and
I'll presume that these are all in the past.

In this example, it's crucial to, pretty soon, vacuum the 150 tables
in partition #21, as they're getting near wraparound. Nothing to be
improved on there. Though it would be kind of nice to start on the
150 as early as possible, so that we *might* avoid having them
dominate autovac, as in Josh Berkus' example.

But once those are done, the next "crucial" set, in partition #20, are
a much smaller set of tables. It would be nice, at that point, to add
in a few tables from partitions #13 and #14, to smooth out the burden.
The ideal "steady state" would look like the following:

| TxId - nearest 10 million | Tables Wrapping In Range |
|---------------------------+--------------------------|
| 0 | 250 |
| 1 | 51 |
| 2 | 51 |
| 3 | 51 |
| 4 | 51 |
| 5 | 51 |
| 6 | 51 |
| 7 | 51 |
| 8 | 51 |
| 9 | 51 |
| 10 | 51 |
| 11 | 51 |
| 12 | 51 |
| 13 | 51 |
| 14 | 51 |
| 15 | 51 |
| 16 | 51 |
| 17 | 51 |
| 18 | 51 |
| 19 | 51 |
| 20 | 51 |
| 21 | 51 |

We might not get something totally smooth, but getting rid of the
*really* chunky ranges would be good.
--
When confronted by a difficult problem, solve it by reducing it to the
question, "How would the Lone Ranger handle this?"

#20Tom Lane
tgl@sss.pgh.pa.us
In reply to: Josh Berkus (#17)
Re: We probably need autovacuum_max_wraparound_workers

Josh Berkus <josh@agliodbs.com> writes:

So there are two parts to this problem, each of which needs a different
solution:

1. Databases can inadvertently get to the state where many tables need
wraparound vacuuming at exactly the same time, especially if they have
many "cold" data partition tables.

I'm not especially sold on your theory that there's some behavior that
forces such convergence, but it's certainly plausible that there was,
say, a schema alteration applied to all of those partitions at about the
same time. In any case, as Robert has been saying, it seems like it
would be smart to try to get autovacuum to spread out the
anti-wraparound work a bit better when it's faced with a lot of tables
with similar relfrozenxid values.

2. When we do hit wraparound thresholds for multiple tables, autovacuum
has no hesitation about doing autovacuum_max_workers worth of wraparound
vacuum simultaneously, even when that exceeds the I/O capactity of the
system.

I continue to maintain that this problem is unrelated to wraparound as
such, and that thinking it is is a great way to design a bad solution.
There are any number of reasons why autovacuum might need to run
max_workers at once. What we need to look at is making sure that they
don't run the system into the ground when that happens.

Since your users weren't complaining about performance with one or two
autovac workers running (were they?), we can assume that the cost-delay
settings were such as to not create a problem in that scenario. So it
seems to me that it's down to autovac_balance_cost(). Either there's
a plain-vanilla bug in there, or seek costs are breaking the assumption
that it's okay to give N workers each 1/Nth of the single-worker I/O
capacity.

As far as bugs are concerned, I wonder if the premise of the calculation

* The idea here is that we ration out I/O equally. The amount of I/O
* that a worker can consume is determined by cost_limit/cost_delay, so we
* try to equalize those ratios rather than the raw limit settings.

might be wrong in itself? The ratio idea seems plausible but ...

regards, tom lane

#21Josh Berkus
josh@agliodbs.com
In reply to: Tom Lane (#20)
#22Stephen Frost
sfrost@snowman.net
In reply to: Josh Berkus (#21)
#23Tom Lane
tgl@sss.pgh.pa.us
In reply to: Josh Berkus (#21)
#24Cédric Villemain
cedric@2ndquadrant.com
In reply to: Tom Lane (#23)
#25Jeff Janes
jeff.janes@gmail.com
In reply to: Josh Berkus (#21)