PG12 autovac issues

Started by Justin Kingabout 6 years ago44 messagesgeneral
Jump to latest
#1Justin King
kingpin867@gmail.com

We have a database that isn't overly large (~20G), but gets incredibly
frequent updates. Here's an example table:

feedi=# select * from pg_stat_all_tables where schemaname =
'production' and relname = 'tita';
relid = 16786
schemaname = production
relname = tita
seq_scan = 23365
seq_tup_read = 403862091
idx_scan = 26612759248
idx_tup_fetch = 19415752701
n_tup_ins = 24608806
n_tup_upd = 4207076934
n_tup_del = 24566916
n_tup_hot_upd = 4073821079
n_live_tup = 79942
n_dead_tup = 71969
n_mod_since_analyze = 12020
last_vacuum = 2020-03-17 15:35:19.588859+00
last_autovacuum = 2020-03-17 21:31:08.248598+00
last_analyze = 2020-03-17 15:35:20.372875+00
last_autoanalyze = 2020-03-17 22:04:41.76743+00
vacuum_count = 9
autovacuum_count = 135693
analyze_count = 9
autoanalyze_count = 495877

As you can see in this table, there are only ~80K rows, but billions
of updates. What we have observed is that the frozenxid reaches the
200M mark fairly quickly because of the amount of activity. What is
interesting is that this happens with the 'postgres' and 'template1'
databases as well and there is absolutely no activity in those
databases.

When the 'postgres' and/or 'template1' databases hit the
freeze_max_age, there are cases where it kicks off an aggressive
autovac of those tables which seems to prevent autovacs from running
elsewhere. Oddly, this is not consistent, but that condition seems to
be required. We have observed this across multiple PG12 servers (dev,
test, staging, production) all with similar workloads.

$ grep -i vacuum /var/log/postgresql/postgres.log | cut -b 1-9 | uniq -c
17 Mar 17 06
34 Mar 17 07
31 Mar 17 08
31 Mar 17 09
30 Mar 17 10
34 Mar 17 11
33 Mar 17 12
19 Mar 17 13
40 Mar 17 15
31 Mar 17 16
36 Mar 17 17
34 Mar 17 18
35 Mar 17 19
35 Mar 17 20
33 Mar 17 21

As you can see above, we end up having around ~33 autovac/hr, and
about 13:30 today, they stopped until we ran a "vacuum freeze verbose
analyze;" against the 'postgres' database (around 15:30) which then
caused the autovacs to resume running against the "feedi" database.

I'm completely perplexed as to what is happening and why it suddenly
started when we moved from PG10 > PG12. The configs and workload are
essentially the same between versions. We realize we could simply
increase the autovacuum_freeze_max_age, but that doesn't seem to
actually resolve anything -- it just pushes the problem out. Has
anyone seen anything similar to this?

Thanks very much for the consideration.

Justin King
http://flightaware.com/

#2Justin King
kingpin867@gmail.com
In reply to: Justin King (#1)
Fwd: PG12 autovac issues

Apologies, I accidentally sent this to the pgsql-admin list initially
but intended it go here:

We have a database that isn't overly large (~20G), but gets incredibly
frequent updates. Here's an example table:

feedi=# select * from pg_stat_all_tables where schemaname =
'production' and relname = 'tita';
relid = 16786
schemaname = production
relname = tita
seq_scan = 23365
seq_tup_read = 403862091
idx_scan = 26612759248
idx_tup_fetch = 19415752701
n_tup_ins = 24608806
n_tup_upd = 4207076934
n_tup_del = 24566916
n_tup_hot_upd = 4073821079
n_live_tup = 79942
n_dead_tup = 71969
n_mod_since_analyze = 12020
last_vacuum = 2020-03-17 15:35:19.588859+00
last_autovacuum = 2020-03-17 21:31:08.248598+00
last_analyze = 2020-03-17 15:35:20.372875+00
last_autoanalyze = 2020-03-17 22:04:41.76743+00
vacuum_count = 9
autovacuum_count = 135693
analyze_count = 9
autoanalyze_count = 495877

As you can see in this table, there are only ~80K rows, but billions
of updates. What we have observed is that the frozenxid reaches the
200M mark fairly quickly because of the amount of activity. What is
interesting is that this happens with the 'postgres' and 'template1'
databases as well and there is absolutely no activity in those
databases.

When the 'postgres' and/or 'template1' databases hit the
freeze_max_age, there are cases where it kicks off an aggressive
autovac of those tables which seems to prevent autovacs from running
elsewhere. Oddly, this is not consistent, but that condition seems to
be required. We have observed this across multiple PG12 servers (dev,
test, staging, production) all with similar workloads.

$ grep -i vacuum /var/log/postgresql/postgres.log | cut -b 1-9 | uniq -c
17 Mar 17 06
34 Mar 17 07
31 Mar 17 08
31 Mar 17 09
30 Mar 17 10
34 Mar 17 11
33 Mar 17 12
19 Mar 17 13
40 Mar 17 15
31 Mar 17 16
36 Mar 17 17
34 Mar 17 18
35 Mar 17 19
35 Mar 17 20
33 Mar 17 21

As you can see above, we end up having around ~33 autovac/hr, and
about 13:30 today, they stopped until we ran a "vacuum freeze verbose
analyze;" against the 'postgres' database (around 15:30) which then
caused the autovacs to resume running against the "feedi" database.

I'm completely perplexed as to what is happening and why it suddenly
started when we moved from PG10 > PG12. The configs and workload are
essentially the same between versions. We realize we could simply
increase the autovacuum_freeze_max_age, but that doesn't seem to
actually resolve anything -- it just pushes the problem out. Has
anyone seen anything similar to this?

Thanks very much for the consideration.

Justin King
http://flightaware.com/

#3Adrian Klaver
adrian.klaver@aklaver.com
In reply to: Justin King (#2)
Re: Fwd: PG12 autovac issues

On 3/17/20 3:22 PM, Justin King wrote:

Apologies, I accidentally sent this to the pgsql-admin list initially
but intended it go here:

We have a database that isn't overly large (~20G), but gets incredibly
frequent updates. Here's an example table:

feedi=# select * from pg_stat_all_tables where schemaname =
'production' and relname = 'tita';
relid = 16786
schemaname = production
relname = tita
seq_scan = 23365
seq_tup_read = 403862091
idx_scan = 26612759248
idx_tup_fetch = 19415752701
n_tup_ins = 24608806
n_tup_upd = 4207076934
n_tup_del = 24566916
n_tup_hot_upd = 4073821079
n_live_tup = 79942
n_dead_tup = 71969
n_mod_since_analyze = 12020
last_vacuum = 2020-03-17 15:35:19.588859+00
last_autovacuum = 2020-03-17 21:31:08.248598+00
last_analyze = 2020-03-17 15:35:20.372875+00
last_autoanalyze = 2020-03-17 22:04:41.76743+00
vacuum_count = 9
autovacuum_count = 135693
analyze_count = 9
autoanalyze_count = 495877

As you can see in this table, there are only ~80K rows, but billions
of updates. What we have observed is that the frozenxid reaches the
200M mark fairly quickly because of the amount of activity. What is
interesting is that this happens with the 'postgres' and 'template1'
databases as well and there is absolutely no activity in those
databases.

When the 'postgres' and/or 'template1' databases hit the
freeze_max_age, there are cases where it kicks off an aggressive
autovac of those tables which seems to prevent autovacs from running
elsewhere. Oddly, this is not consistent, but that condition seems to
be required. We have observed this across multiple PG12 servers (dev,
test, staging, production) all with similar workloads.

Is there anything in postgres and template1 besides what was created at
init?

What are your settings for autovacuum?:

https://www.postgresql.org/docs/12/runtime-config-autovacuum.html

$ grep -i vacuum /var/log/postgresql/postgres.log | cut -b 1-9 | uniq -c
17 Mar 17 06
34 Mar 17 07
31 Mar 17 08
31 Mar 17 09
30 Mar 17 10
34 Mar 17 11
33 Mar 17 12
19 Mar 17 13
40 Mar 17 15
31 Mar 17 16
36 Mar 17 17
34 Mar 17 18
35 Mar 17 19
35 Mar 17 20
33 Mar 17 21

As you can see above, we end up having around ~33 autovac/hr, and
about 13:30 today, they stopped until we ran a "vacuum freeze verbose
analyze;" against the 'postgres' database (around 15:30) which then
caused the autovacs to resume running against the "feedi" database.

I'm completely perplexed as to what is happening and why it suddenly
started when we moved from PG10 > PG12. The configs and workload are
essentially the same between versions. We realize we could simply
increase the autovacuum_freeze_max_age, but that doesn't seem to
actually resolve anything -- it just pushes the problem out. Has
anyone seen anything similar to this?

Thanks very much for the consideration.

Justin King
http://flightaware.com/

--
Adrian Klaver
adrian.klaver@aklaver.com

#4Justin King
kingpin867@gmail.com
In reply to: Adrian Klaver (#3)
Re: Fwd: PG12 autovac issues

On Tue, Mar 17, 2020 at 5:39 PM Adrian Klaver <adrian.klaver@aklaver.com> wrote:

On 3/17/20 3:22 PM, Justin King wrote:

Apologies, I accidentally sent this to the pgsql-admin list initially
but intended it go here:

We have a database that isn't overly large (~20G), but gets incredibly
frequent updates. Here's an example table:

feedi=# select * from pg_stat_all_tables where schemaname =
'production' and relname = 'tita';
relid = 16786
schemaname = production
relname = tita
seq_scan = 23365
seq_tup_read = 403862091
idx_scan = 26612759248
idx_tup_fetch = 19415752701
n_tup_ins = 24608806
n_tup_upd = 4207076934
n_tup_del = 24566916
n_tup_hot_upd = 4073821079
n_live_tup = 79942
n_dead_tup = 71969
n_mod_since_analyze = 12020
last_vacuum = 2020-03-17 15:35:19.588859+00
last_autovacuum = 2020-03-17 21:31:08.248598+00
last_analyze = 2020-03-17 15:35:20.372875+00
last_autoanalyze = 2020-03-17 22:04:41.76743+00
vacuum_count = 9
autovacuum_count = 135693
analyze_count = 9
autoanalyze_count = 495877

As you can see in this table, there are only ~80K rows, but billions
of updates. What we have observed is that the frozenxid reaches the
200M mark fairly quickly because of the amount of activity. What is
interesting is that this happens with the 'postgres' and 'template1'
databases as well and there is absolutely no activity in those
databases.

When the 'postgres' and/or 'template1' databases hit the
freeze_max_age, there are cases where it kicks off an aggressive
autovac of those tables which seems to prevent autovacs from running
elsewhere. Oddly, this is not consistent, but that condition seems to
be required. We have observed this across multiple PG12 servers (dev,
test, staging, production) all with similar workloads.

Is there anything in postgres and template1 besides what was created at
init?

There is nothing in there at all besides system tables created at init.

What are your settings for autovacuum?:

https://www.postgresql.org/docs/12/runtime-config-autovacuum.html

Here are the settings, these are the only ones that are not set to
default with the exception of a few tables that have been overridden
with a different value due to lots of updates and few rows:

autovacuum = on
log_autovacuum_min_duration = 0
autovacuum_max_workers = 8
autovacuum_naptime = 15s
autovacuum_vacuum_threshold = 500
autovacuum_analyze_threshold = 2500
vacuum_cost_limit = 1000

We want fairly aggressive autovacs to keep table bloat limited -- the
application latency suffers if it has to wade through dead tuples and
staying near realtime is important in our environment.

** Also, it should be noted that the autovacuum_analyze_threshold is
probably an incorrect value, we likely intended that to be 250 and
just have now realized it after poking more at the configuration.

Show quoted text

$ grep -i vacuum /var/log/postgresql/postgres.log | cut -b 1-9 | uniq -c
17 Mar 17 06
34 Mar 17 07
31 Mar 17 08
31 Mar 17 09
30 Mar 17 10
34 Mar 17 11
33 Mar 17 12
19 Mar 17 13
40 Mar 17 15
31 Mar 17 16
36 Mar 17 17
34 Mar 17 18
35 Mar 17 19
35 Mar 17 20
33 Mar 17 21

As you can see above, we end up having around ~33 autovac/hr, and
about 13:30 today, they stopped until we ran a "vacuum freeze verbose
analyze;" against the 'postgres' database (around 15:30) which then
caused the autovacs to resume running against the "feedi" database.

I'm completely perplexed as to what is happening and why it suddenly
started when we moved from PG10 > PG12. The configs and workload are
essentially the same between versions. We realize we could simply
increase the autovacuum_freeze_max_age, but that doesn't seem to
actually resolve anything -- it just pushes the problem out. Has
anyone seen anything similar to this?

Thanks very much for the consideration.

Justin King
http://flightaware.com/

--
Adrian Klaver
adrian.klaver@aklaver.com

#5Adrian Klaver
adrian.klaver@aklaver.com
In reply to: Justin King (#4)
Re: Fwd: PG12 autovac issues

On 3/17/20 3:48 PM, Justin King wrote:

On Tue, Mar 17, 2020 at 5:39 PM Adrian Klaver <adrian.klaver@aklaver.com> wrote:

On 3/17/20 3:22 PM, Justin King wrote:

Apologies, I accidentally sent this to the pgsql-admin list initially
but intended it go here:

We have a database that isn't overly large (~20G), but gets incredibly
frequent updates. Here's an example table:

feedi=# select * from pg_stat_all_tables where schemaname =
'production' and relname = 'tita';
relid = 16786
schemaname = production
relname = tita
seq_scan = 23365
seq_tup_read = 403862091
idx_scan = 26612759248
idx_tup_fetch = 19415752701
n_tup_ins = 24608806
n_tup_upd = 4207076934
n_tup_del = 24566916
n_tup_hot_upd = 4073821079
n_live_tup = 79942
n_dead_tup = 71969
n_mod_since_analyze = 12020
last_vacuum = 2020-03-17 15:35:19.588859+00
last_autovacuum = 2020-03-17 21:31:08.248598+00
last_analyze = 2020-03-17 15:35:20.372875+00
last_autoanalyze = 2020-03-17 22:04:41.76743+00
vacuum_count = 9
autovacuum_count = 135693
analyze_count = 9
autoanalyze_count = 495877

As you can see in this table, there are only ~80K rows, but billions
of updates. What we have observed is that the frozenxid reaches the
200M mark fairly quickly because of the amount of activity. What is
interesting is that this happens with the 'postgres' and 'template1'
databases as well and there is absolutely no activity in those
databases.

When the 'postgres' and/or 'template1' databases hit the
freeze_max_age, there are cases where it kicks off an aggressive
autovac of those tables which seems to prevent autovacs from running
elsewhere. Oddly, this is not consistent, but that condition seems to
be required. We have observed this across multiple PG12 servers (dev,
test, staging, production) all with similar workloads.

Is there anything in postgres and template1 besides what was created at
init?

There is nothing in there at all besides system tables created at init.

What are your settings for autovacuum?:

https://www.postgresql.org/docs/12/runtime-config-autovacuum.html

Here are the settings, these are the only ones that are not set to
default with the exception of a few tables that have been overridden
with a different value due to lots of updates and few rows:

And those values are?

More below.

autovacuum = on
log_autovacuum_min_duration = 0
autovacuum_max_workers = 8
autovacuum_naptime = 15s
autovacuum_vacuum_threshold = 500
autovacuum_analyze_threshold = 2500
vacuum_cost_limit = 1000

Are either of the below set > 0?:

vacuum_cost_delay

autovacuum_vacuum_cost_delay

We want fairly aggressive autovacs to keep table bloat limited -- the
application latency suffers if it has to wade through dead tuples and
staying near realtime is important in our environment.

** Also, it should be noted that the autovacuum_analyze_threshold is
probably an incorrect value, we likely intended that to be 250 and
just have now realized it after poking more at the configuration.

--
Adrian Klaver
adrian.klaver@aklaver.com

#6Jeff Janes
jeff.janes@gmail.com
In reply to: Justin King (#1)
Re: PG12 autovac issues

On Tue, Mar 17, 2020 at 6:19 PM Justin King <kingpin867@gmail.com> wrote:

We have a database that isn't overly large (~20G), but gets incredibly
frequent updates. Here's an example table:

feedi=# select * from pg_stat_all_tables where schemaname =
'production' and relname = 'tita';

n_tup_upd = 4207076934
n_tup_hot_upd = 4073821079
n_live_tup = 79942
n_dead_tup = 71969
n_mod_since_analyze = 12020

As you can see in this table, there are only ~80K rows, but billions
of updates.

But how long were those billions of updates spread over? You need to look
at deltas, not absolute values. And note that almost all of those updates
where HOT updates, which don't generate "vacuum debt"

What we have observed is that the frozenxid reaches the
200M mark fairly quickly because of the amount of activity. What is
interesting is that this happens with the 'postgres' and 'template1'
databases as well and there is absolutely no activity in those
databases.

When the 'postgres' and/or 'template1' databases hit the
freeze_max_age, there are cases where it kicks off an aggressive
autovac of those tables which seems to prevent autovacs from running
elsewhere.

Yes, it is a known long-outstanding bug (or malfeature) that one
database reaching autovacuum_freeze_max_age will starve all other databases
of autovac attention. But since the introduction of the "freeze map" in
9.6, it is hard to see how this starvation due to an inactive database
hitting autovacuum_freeze_max_age can last for any meaningful amount of
time. Maybe a shared catalog?

Oddly, this is not consistent, but that condition seems to

be required. We have observed this across multiple PG12 servers (dev,
test, staging, production) all with similar workloads.

$ grep -i vacuum /var/log/postgresql/postgres.log | cut -b 1-9 | uniq -c

It is hard to figure out what the significance of the occurrence of the
word 'vacuum' in the log file is, without being intimately familiar with
your log files. Could you interpret this some more for us? How many of
those are for 'tita'? How many for databases other than your active one?

Cheers,

Jeff

#7Andres Freund
andres@anarazel.de
In reply to: Justin King (#1)
Re: PG12 autovac issues

Hi,

On 2020-03-17 17:18:57 -0500, Justin King wrote:

As you can see in this table, there are only ~80K rows, but billions
of updates. What we have observed is that the frozenxid reaches the
200M mark fairly quickly because of the amount of activity.

And each of those updates is in a separate transaction? Is that
required? I.e. any chance to perform multiple of those updates in one
transaction?

Have you considered just increasing the vacuum limit? It's mostly there
because it can increase space usage a bit, but given today's systems its
not a usually a problem unless you have hundreds of postgres instances
on a single system. It's pretty common to just increase that to 1.8
billion or so.

From a single stats snapshot we can't actually understand the actual xid
consumption - is it actually the xid usage that triggers the vacuums?

What makes you think it is a problem that you have all these vacuums? If
you actually update that much, and you have indexes, you're going want a
lot of vacuums?

What is interesting is that this happens with the 'postgres' and
'template1' databases as well and there is absolutely no activity in
those databases.

That's normal. They should be pretty darn quick in v12?

Greetings,

Andres Freund

#8Adrian Klaver
adrian.klaver@aklaver.com
In reply to: Justin King (#1)
Re: Fwd: PG12 autovac issues

On 3/18/20 6:57 AM, Justin King wrote:
Please reply to list also
Ccing list

Here are the settings, these are the only ones that are not set to
default with the exception of a few tables that have been overridden
with a different value due to lots of updates and few rows:

And those values are?

Thanks for the response, hopefully this will help:

The below is helpful, but what I was referring to above was the settings
for the overridden tables.

postgres=# select name,setting from pg_settings where name like '%vacuum%';
name = setting
autovacuum = on
autovacuum_analyze_scale_factor = 0.1
autovacuum_analyze_threshold = 2500
autovacuum_freeze_max_age = 200000000
autovacuum_max_workers = 8
autovacuum_multixact_freeze_max_age = 400000000
autovacuum_naptime = 15
autovacuum_vacuum_cost_delay = 20
autovacuum_vacuum_cost_limit = -1
autovacuum_vacuum_scale_factor = 0.2
autovacuum_vacuum_threshold = 500
autovacuum_work_mem = -1
log_autovacuum_min_duration = 0
vacuum_cleanup_index_scale_factor = 0.1
vacuum_cost_delay = 0
vacuum_cost_limit = 1000
vacuum_cost_page_dirty = 20
vacuum_cost_page_hit = 1
vacuum_cost_page_miss = 10
vacuum_defer_cleanup_age = 0
vacuum_freeze_min_age = 50000000
vacuum_freeze_table_age = 150000000
vacuum_multixact_freeze_min_age = 5000000
vacuum_multixact_freeze_table_age = 150000000

More below.

autovacuum = on
log_autovacuum_min_duration = 0
autovacuum_max_workers = 8
autovacuum_naptime = 15s
autovacuum_vacuum_threshold = 500
autovacuum_analyze_threshold = 2500
vacuum_cost_limit = 1000

Are either of the below set > 0?:

vacuum_cost_delay

autovacuum_vacuum_cost_delay

We want fairly aggressive autovacs to keep table bloat limited -- the
application latency suffers if it has to wade through dead tuples and
staying near realtime is important in our environment.

** Also, it should be noted that the autovacuum_analyze_threshold is
probably an incorrect value, we likely intended that to be 250 and
just have now realized it after poking more at the configuration.

--
Adrian Klaver
adrian.klaver@aklaver.com

--
Adrian Klaver
adrian.klaver@aklaver.com

#9Michael Lewis
mlewis@entrata.com
In reply to: Adrian Klaver (#8)
Re: Fwd: PG12 autovac issues

Do you have default fillfactor set on this table? If not, I would wonder if
reducing it to 50% or even 20% would allow many more HOT updates that would
reduce bloat.

Also, is there any period of lower activity on your system that you could
schedule a vacuum freeze for daily or weekly? I believe having frozen pages
would also mean all the autovacuums would be able to skip more pages and
therefore be faster.

autovacuum_vacuum_cost_delay = 20

This was changed to 2ms in PG12. You should reduce that most likely.

#10Justin King
kingpin867@gmail.com
In reply to: Justin King (#1)
RE: PG12 autovac issues

Hi Andres-

Thanks for the reply, answers below.

On Tue, Mar 17, 2020 at 8:19 PM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2020-03-17 17:18:57 -0500, Justin King wrote:

As you can see in this table, there are only ~80K rows, but billions
of updates. What we have observed is that the frozenxid reaches the
200M mark fairly quickly because of the amount of activity.

And each of those updates is in a separate transaction? Is that
required? I.e. any chance to perform multiple of those updates in one
transaction?

Have you considered just increasing the vacuum limit? It's mostly there
because it can increase space usage a bit, but given today's systems its
not a usually a problem unless you have hundreds of postgres instances
on a single system. It's pretty common to just increase that to 1.8
billion or so.

We have considered increasing the limit, but as I mentioned, the
problem that we're seeing is that (effectively) a autovac starts on a
system database (postgres, template1) and never completes, or
deadlocks, or something. This completely stops autovacs from running
until we manually intervene and run a VACUUM FREEZE -- at which point,
the autovacs resume. If we increase the vacuum limit and this
situation happens, we're going to be in real trouble.

From a single stats snapshot we can't actually understand the actual xid
consumption - is it actually the xid usage that triggers the vacuums?

We have looked at this and the xid consumption averages around 1250
xid/sec -- this is when we see the "aggressive" autovac kick off in
the logs. What I don't understand is why these xid's are being
consumed at this rate on the databases with no activity (postgres,
template1).

What makes you think it is a problem that you have all these vacuums? If
you actually update that much, and you have indexes, you're going want a
lot of vacuums?

I actually don't think there's a problem with the vacuums (I was
mostly pointing out that they are very regular and not problematic).
The main problem I am having is that something is causing the
autovacuums to completely stop and require manual intervention to
resume -- and it seems to be when the "postgres" or "template1"
database hits the autovacuum_freeze_max_age.

What is interesting is that this happens with the 'postgres' and
'template1' databases as well and there is absolutely no activity in
those databases.

That's normal. They should be pretty darn quick in v12?

Yes, a manual VACUUM FREEZE of either database takes less than 1
second -- which is why it's perplexing that the autovac starts and
never seems to complete and prevents other autovacs from running.

Show quoted text

Greetings,

Andres Freund

#11Justin King
kingpin867@gmail.com
In reply to: Justin King (#1)
Re: Re: PG12 autovac issues

On Wed, Mar 18, 2020 at 10:13 AM Adrian Klaver
<adrian.klaver@aklaver.com> wrote:

On 3/18/20 6:57 AM, Justin King wrote:
Please reply to list also
Ccing list

Here are the settings, these are the only ones that are not set to
default with the exception of a few tables that have been overridden
with a different value due to lots of updates and few rows:

And those values are?

Thanks for the response, hopefully this will help:

The below is helpful, but what I was referring to above was the settings
for the overridden tables.

Ah, apologies, I missed that. They are:

alter table production.tita set
(autovacuum_analyze_scale_factor = 0, autovacuum_vacuum_scale_factor =
0, autovacuum_vacuum_threshold = 100000, autovacuum_analyze_threshold
= 50000);

Show quoted text

postgres=# select name,setting from pg_settings where name like '%vacuum%';
name = setting
autovacuum = on
autovacuum_analyze_scale_factor = 0.1
autovacuum_analyze_threshold = 2500
autovacuum_freeze_max_age = 200000000
autovacuum_max_workers = 8
autovacuum_multixact_freeze_max_age = 400000000
autovacuum_naptime = 15
autovacuum_vacuum_cost_delay = 20
autovacuum_vacuum_cost_limit = -1
autovacuum_vacuum_scale_factor = 0.2
autovacuum_vacuum_threshold = 500
autovacuum_work_mem = -1
log_autovacuum_min_duration = 0
vacuum_cleanup_index_scale_factor = 0.1
vacuum_cost_delay = 0
vacuum_cost_limit = 1000
vacuum_cost_page_dirty = 20
vacuum_cost_page_hit = 1
vacuum_cost_page_miss = 10
vacuum_defer_cleanup_age = 0
vacuum_freeze_min_age = 50000000
vacuum_freeze_table_age = 150000000
vacuum_multixact_freeze_min_age = 5000000
vacuum_multixact_freeze_table_age = 150000000

More below.

autovacuum = on
log_autovacuum_min_duration = 0
autovacuum_max_workers = 8
autovacuum_naptime = 15s
autovacuum_vacuum_threshold = 500
autovacuum_analyze_threshold = 2500
vacuum_cost_limit = 1000

Are either of the below set > 0?:

vacuum_cost_delay

autovacuum_vacuum_cost_delay

We want fairly aggressive autovacs to keep table bloat limited -- the
application latency suffers if it has to wade through dead tuples and
staying near realtime is important in our environment.

** Also, it should be noted that the autovacuum_analyze_threshold is
probably an incorrect value, we likely intended that to be 250 and
just have now realized it after poking more at the configuration.

--
Adrian Klaver
adrian.klaver@aklaver.com

--
Adrian Klaver
adrian.klaver@aklaver.com

#12Justin King
kingpin867@gmail.com
In reply to: Michael Lewis (#9)
Re: Fwd: PG12 autovac issues

On Wed, Mar 18, 2020 at 1:40 PM Michael Lewis <mlewis@entrata.com> wrote:

Do you have default fillfactor set on this table? If not, I would wonder if reducing it to 50% or even 20% would allow many more HOT updates that would reduce bloat.

I don't believe we have a default fillfactor, but I'm still trying to
understand why autovacs would completely stop -- that seems like a
bug. Especially since there was no change between PG10 and PG12 and
this problem never existed there.

Also, is there any period of lower activity on your system that you could schedule a vacuum freeze for daily or weekly? I believe having frozen pages would also mean all the autovacuums would be able to skip more pages and therefore be faster.

autovacuum_vacuum_cost_delay = 20

This was changed to 2ms in PG12. You should reduce that most likely.

Actually, we set that back from 2ms > 20ms to replicate what PG10 was
doing just in case it had an impact (spoiler: it did not).

#13Michael Lewis
mlewis@entrata.com
In reply to: Justin King (#12)
Re: Fwd: PG12 autovac issues

On Thu, Mar 19, 2020 at 9:31 AM Justin King <kingpin867@gmail.com> wrote:

On Wed, Mar 18, 2020 at 1:40 PM Michael Lewis <mlewis@entrata.com> wrote:

Do you have default fillfactor set on this table? If not, I would wonder

if reducing it to 50% or even 20% would allow many more HOT updates that
would reduce bloat.

I don't believe we have a default fillfactor, but I'm still trying to
understand why autovacs would completely stop -- that seems like a
bug. Especially since there was no change between PG10 and PG12 and
this problem never existed there.

Is there any reason to not schedule vacuum freeze for each db daily? Just
curious.

#14Justin King
kingpin867@gmail.com
In reply to: Michael Lewis (#13)
Re: Fwd: PG12 autovac issues

On Thu, Mar 19, 2020 at 11:02 AM Michael Lewis <mlewis@entrata.com> wrote:

On Thu, Mar 19, 2020 at 9:31 AM Justin King <kingpin867@gmail.com> wrote:

On Wed, Mar 18, 2020 at 1:40 PM Michael Lewis <mlewis@entrata.com> wrote:

Do you have default fillfactor set on this table? If not, I would wonder if reducing it to 50% or even 20% would allow many more HOT updates that would reduce bloat.

I don't believe we have a default fillfactor, but I'm still trying to
understand why autovacs would completely stop -- that seems like a
bug. Especially since there was no change between PG10 and PG12 and
this problem never existed there.

Is there any reason to not schedule vacuum freeze for each db daily? Just curious.

No, not really -- it just feels like a bandaid for something that
should (and was) working already. It is possible where I'm headed,
but I just thought I'd reach out to the community to see if anyone has
seen or could think of a reason why this might have started by moving
from PG10 > PG12 and whether it was some possible regression.

#15Andres Freund
andres@anarazel.de
In reply to: Justin King (#10)
Re: PG12 autovac issues

Hi,

On 2020-03-19 10:23:48 -0500, Justin King wrote:

From a single stats snapshot we can't actually understand the actual xid
consumption - is it actually the xid usage that triggers the vacuums?

We have looked at this and the xid consumption averages around 1250
xid/sec -- this is when we see the "aggressive" autovac kick off in
the logs. What I don't understand is why these xid's are being
consumed at this rate on the databases with no activity (postgres,
template1).

The xid counter is global across all databases.

What makes you think it is a problem that you have all these vacuums? If
you actually update that much, and you have indexes, you're going want a
lot of vacuums?

I actually don't think there's a problem with the vacuums (I was
mostly pointing out that they are very regular and not problematic).
The main problem I am having is that something is causing the
autovacuums to completely stop and require manual intervention to
resume -- and it seems to be when the "postgres" or "template1"
database hits the autovacuum_freeze_max_age.

Did you look at pg_stat_activity for those autovacuums to see whether
they're blocked on something?

What is interesting is that this happens with the 'postgres' and
'template1' databases as well and there is absolutely no activity in
those databases.

That's normal. They should be pretty darn quick in v12?

Yes, a manual VACUUM FREEZE of either database takes less than 1
second -- which is why it's perplexing that the autovac starts and
never seems to complete and prevents other autovacs from running.

One big difference between a manual VACUUM and autovacuum is that with
the default settings VACUUM is not throttled, but autovacuum is.

What are your vacuum_cost_delay, autovacuum_vacuum_cost_delay,
vacuum_cost_limit, autovacuum_vacuum_cost_limit, vacuum_cost_page_hit,
vacuum_cost_page_miss set to?

Greetings,

Andres Freund

#16Justin King
kingpin867@gmail.com
In reply to: Andres Freund (#15)
Re: PG12 autovac issues

On Thu, Mar 19, 2020 at 5:35 PM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2020-03-19 10:23:48 -0500, Justin King wrote:

From a single stats snapshot we can't actually understand the actual xid
consumption - is it actually the xid usage that triggers the vacuums?

We have looked at this and the xid consumption averages around 1250
xid/sec -- this is when we see the "aggressive" autovac kick off in
the logs. What I don't understand is why these xid's are being
consumed at this rate on the databases with no activity (postgres,
template1).

The xid counter is global across all databases.

Then what does the "age" value represent for each database in this
case? Perhaps I'm misunderstanding what I'm looking at?

postgres=# SELECT datname, age(datfrozenxid),
current_setting('autovacuum_freeze_max_age') FROM pg_database;
datname | age | current_setting
-----------+-----------+-----------------
postgres | 100937449 | 200000000
template1 | 50244438 | 200000000
template0 | 160207297 | 200000000
feedi | 150147602 | 200000000

What makes you think it is a problem that you have all these vacuums? If
you actually update that much, and you have indexes, you're going want a
lot of vacuums?

I actually don't think there's a problem with the vacuums (I was
mostly pointing out that they are very regular and not problematic).
The main problem I am having is that something is causing the
autovacuums to completely stop and require manual intervention to
resume -- and it seems to be when the "postgres" or "template1"
database hits the autovacuum_freeze_max_age.

Did you look at pg_stat_activity for those autovacuums to see whether
they're blocked on something?

This is not something we've done yet but will next time it occurs.

What is interesting is that this happens with the 'postgres' and
'template1' databases as well and there is absolutely no activity in
those databases.

That's normal. They should be pretty darn quick in v12?

Yes, a manual VACUUM FREEZE of either database takes less than 1
second -- which is why it's perplexing that the autovac starts and
never seems to complete and prevents other autovacs from running.

One big difference between a manual VACUUM and autovacuum is that with
the default settings VACUUM is not throttled, but autovacuum is.

What are your vacuum_cost_delay, autovacuum_vacuum_cost_delay,
vacuum_cost_limit, autovacuum_vacuum_cost_limit, vacuum_cost_page_hit,
vacuum_cost_page_miss set to?

Here are all the vacuum related values for the server:

postgres=# select name,setting from pg_settings where name like '%vacuum%';
autovacuum = on
autovacuum_analyze_scale_factor = 0.1
autovacuum_analyze_threshold = 2500
autovacuum_freeze_max_age = 200000000
autovacuum_max_workers = 8
autovacuum_multixact_freeze_max_age = 400000000
autovacuum_naptime = 15
autovacuum_vacuum_cost_delay = 20
autovacuum_vacuum_cost_limit = -1
autovacuum_vacuum_scale_factor = 0.2
autovacuum_vacuum_threshold = 500
autovacuum_work_mem = -1
log_autovacuum_min_duration = 0
vacuum_cleanup_index_scale_factor = 0.1
vacuum_cost_delay = 0
vacuum_cost_limit = 1000
vacuum_cost_page_dirty = 20
vacuum_cost_page_hit = 1
vacuum_cost_page_miss = 10
vacuum_defer_cleanup_age = 0
vacuum_freeze_min_age = 50000000
vacuum_freeze_table_age = 150000000
vacuum_multixact_freeze_min_age = 5000000
vacuum_multixact_freeze_table_age = 150000000

I know the database is busy, so the throttling makes sense, but it
seems like it would complete eventually. We see blocked autovacs for
many hours.

Show quoted text

Greetings,

Andres Freund

#17Andres Freund
andres@anarazel.de
In reply to: Justin King (#16)
Re: PG12 autovac issues

Hi,

On 2020-03-19 18:07:14 -0500, Justin King wrote:

On Thu, Mar 19, 2020 at 5:35 PM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2020-03-19 10:23:48 -0500, Justin King wrote:

From a single stats snapshot we can't actually understand the actual xid
consumption - is it actually the xid usage that triggers the vacuums?

We have looked at this and the xid consumption averages around 1250
xid/sec -- this is when we see the "aggressive" autovac kick off in
the logs. What I don't understand is why these xid's are being
consumed at this rate on the databases with no activity (postgres,
template1).

The xid counter is global across all databases.

Then what does the "age" value represent for each database in this
case? Perhaps I'm misunderstanding what I'm looking at?

postgres=# SELECT datname, age(datfrozenxid),
current_setting('autovacuum_freeze_max_age') FROM pg_database;
datname | age | current_setting
-----------+-----------+-----------------
postgres | 100937449 | 200000000
template1 | 50244438 | 200000000
template0 | 160207297 | 200000000
feedi | 150147602 | 200000000

Look at datfrozenxid without the age(). age(xid) computes how "old" xid
is compared to the "next" xid to be assigned. Until vacuum comes around
and performs work, pg_database.datfrozenxid / pg_class.relfrozenxid are
constant, since they represent the values actually present in the
table. But if xids are being consumed, their "age" increases, because
they're further and further in the past relative to the "newest" xids.

One big difference between a manual VACUUM and autovacuum is that with
the default settings VACUUM is not throttled, but autovacuum is.

What are your vacuum_cost_delay, autovacuum_vacuum_cost_delay,
vacuum_cost_limit, autovacuum_vacuum_cost_limit, vacuum_cost_page_hit,
vacuum_cost_page_miss set to?

Here are all the vacuum related values for the server:

postgres=# select name,setting from pg_settings where name like '%vacuum%';
autovacuum = on
autovacuum_analyze_scale_factor = 0.1
autovacuum_analyze_threshold = 2500
autovacuum_freeze_max_age = 200000000
autovacuum_max_workers = 8
autovacuum_multixact_freeze_max_age = 400000000
autovacuum_naptime = 15
autovacuum_vacuum_cost_delay = 20
autovacuum_vacuum_cost_limit = -1
autovacuum_vacuum_scale_factor = 0.2
autovacuum_vacuum_threshold = 500
autovacuum_work_mem = -1
log_autovacuum_min_duration = 0
vacuum_cleanup_index_scale_factor = 0.1
vacuum_cost_delay = 0
vacuum_cost_limit = 1000
vacuum_cost_page_dirty = 20
vacuum_cost_page_hit = 1
vacuum_cost_page_miss = 10
vacuum_defer_cleanup_age = 0
vacuum_freeze_min_age = 50000000
vacuum_freeze_table_age = 150000000
vacuum_multixact_freeze_min_age = 5000000
vacuum_multixact_freeze_table_age = 150000000

I know the database is busy, so the throttling makes sense, but it
seems like it would complete eventually.

The cost limit/delay are way too long/small respectively for a busy
postgres instance.

We see blocked autovacs for many hours.

On the same table, or just generally being busy?

Greetings,

Andres Freund

#18Justin King
kingpin867@gmail.com
In reply to: Andres Freund (#17)
Re: PG12 autovac issues

On Thu, Mar 19, 2020 at 6:56 PM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2020-03-19 18:07:14 -0500, Justin King wrote:

On Thu, Mar 19, 2020 at 5:35 PM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2020-03-19 10:23:48 -0500, Justin King wrote:

From a single stats snapshot we can't actually understand the actual xid
consumption - is it actually the xid usage that triggers the vacuums?

We have looked at this and the xid consumption averages around 1250
xid/sec -- this is when we see the "aggressive" autovac kick off in
the logs. What I don't understand is why these xid's are being
consumed at this rate on the databases with no activity (postgres,
template1).

The xid counter is global across all databases.

Then what does the "age" value represent for each database in this
case? Perhaps I'm misunderstanding what I'm looking at?

postgres=# SELECT datname, age(datfrozenxid),
current_setting('autovacuum_freeze_max_age') FROM pg_database;
datname | age | current_setting
-----------+-----------+-----------------
postgres | 100937449 | 200000000
template1 | 50244438 | 200000000
template0 | 160207297 | 200000000
feedi | 150147602 | 200000000

Look at datfrozenxid without the age(). age(xid) computes how "old" xid
is compared to the "next" xid to be assigned. Until vacuum comes around
and performs work, pg_database.datfrozenxid / pg_class.relfrozenxid are
constant, since they represent the values actually present in the
table. But if xids are being consumed, their "age" increases, because
they're further and further in the past relative to the "newest" xids.

One big difference between a manual VACUUM and autovacuum is that with
the default settings VACUUM is not throttled, but autovacuum is.

What are your vacuum_cost_delay, autovacuum_vacuum_cost_delay,
vacuum_cost_limit, autovacuum_vacuum_cost_limit, vacuum_cost_page_hit,
vacuum_cost_page_miss set to?

Here are all the vacuum related values for the server:

postgres=# select name,setting from pg_settings where name like '%vacuum%';
autovacuum = on
autovacuum_analyze_scale_factor = 0.1
autovacuum_analyze_threshold = 2500
autovacuum_freeze_max_age = 200000000
autovacuum_max_workers = 8
autovacuum_multixact_freeze_max_age = 400000000
autovacuum_naptime = 15
autovacuum_vacuum_cost_delay = 20
autovacuum_vacuum_cost_limit = -1
autovacuum_vacuum_scale_factor = 0.2
autovacuum_vacuum_threshold = 500
autovacuum_work_mem = -1
log_autovacuum_min_duration = 0
vacuum_cleanup_index_scale_factor = 0.1
vacuum_cost_delay = 0
vacuum_cost_limit = 1000
vacuum_cost_page_dirty = 20
vacuum_cost_page_hit = 1
vacuum_cost_page_miss = 10
vacuum_defer_cleanup_age = 0
vacuum_freeze_min_age = 50000000
vacuum_freeze_table_age = 150000000
vacuum_multixact_freeze_min_age = 5000000
vacuum_multixact_freeze_table_age = 150000000

I know the database is busy, so the throttling makes sense, but it
seems like it would complete eventually.

The cost limit/delay are way too long/small respectively for a busy
postgres instance.

This does make sense, we will look into adjusting those values.

We see blocked autovacs for many hours.

On the same table, or just generally being busy?

We haven't isolated *which* table it is blocked on (assuming it is),
but all autovac's cease running until we manually intervene.

When we get into this state again, is there some other information
(other than what is in pg_stat_statement or pg_stat_activity) that
would be useful for folks here to help understand what is going on?

Show quoted text

Greetings,

Andres Freund

#19Andres Freund
andres@anarazel.de
In reply to: Justin King (#18)
Re: PG12 autovac issues

Hi,

On 2020-03-20 12:42:31 -0500, Justin King wrote:

When we get into this state again, is there some other information
(other than what is in pg_stat_statement or pg_stat_activity) that
would be useful for folks here to help understand what is going on?

If it's actually stuck on a single table, and that table is not large,
it would be useful to get a backtrace with gdb.

Greetings,

Andres Freund

#20Michael Lewis
mlewis@entrata.com
In reply to: Justin King (#18)
Re: PG12 autovac issues

We haven't isolated *which* table it is blocked on (assuming it is),
but all autovac's cease running until we manually intervene.

When we get into this state again, is there some other information
(other than what is in pg_stat_statement or pg_stat_activity) that
would be useful for folks here to help understand what is going on?

https://www.postgresql.org/docs/current/progress-reporting.html#VACUUM-PROGRESS-REPORTING

#21Julien Rouhaud
rjuju123@gmail.com
In reply to: Andres Freund (#19)
#22Andres Freund
andres@anarazel.de
In reply to: Julien Rouhaud (#21)
#23Julien Rouhaud
rjuju123@gmail.com
In reply to: Andres Freund (#22)
#24Andres Freund
andres@anarazel.de
In reply to: Julien Rouhaud (#23)
#25Justin King
kingpin867@gmail.com
In reply to: Andres Freund (#24)
#26Andres Freund
andres@anarazel.de
In reply to: Justin King (#25)
#27Justin King
kingpin867@gmail.com
In reply to: Justin King (#25)
#28Michael Paquier
michael@paquier.xyz
In reply to: Andres Freund (#24)
#29Andres Freund
andres@anarazel.de
In reply to: Michael Paquier (#28)
#30Michael Paquier
michael@paquier.xyz
In reply to: Andres Freund (#29)
#31Andres Freund
andres@anarazel.de
In reply to: Michael Paquier (#30)
#32Justin King
kingpin867@gmail.com
In reply to: Andres Freund (#31)
#33Michael Paquier
michael@paquier.xyz
In reply to: Justin King (#32)
#34Andres Freund
andres@anarazel.de
In reply to: Michael Paquier (#33)
#35Michael Paquier
michael@paquier.xyz
In reply to: Andres Freund (#34)
#36Justin King
kingpin867@gmail.com
In reply to: Michael Paquier (#33)
#37Michael Paquier
michael@paquier.xyz
In reply to: Justin King (#36)
#38Julien Rouhaud
rjuju123@gmail.com
In reply to: Michael Paquier (#37)
#39Justin King
kingpin867@gmail.com
In reply to: Michael Paquier (#37)
#40Michael Paquier
michael@paquier.xyz
In reply to: Justin King (#39)
#41Michael Paquier
michael@paquier.xyz
In reply to: Julien Rouhaud (#38)
#42Andres Freund
andres@anarazel.de
In reply to: Michael Paquier (#40)
#43Michael Paquier
michael@paquier.xyz
In reply to: Andres Freund (#42)
#44Michael Paquier
michael@paquier.xyz
In reply to: Michael Paquier (#41)