Need help debugging why autovacuum seems "stuck" -- until I use superuser to vacuum freeze pg_database

Started by Shawnalmost 10 years ago4 messageshackers
Jump to latest
#1Shawn
shamccoy@amazon.com

I have been debugging a problem on a 9.3.10 Postgres database cluster with over 1200 databases. 10 workers, increased maintenance_work_mem, auto vacuum settings to run more frequently than default. What I will notice is that autovacuum will run for a week or so and traverse databases as expected. I will be able to see that age(datfrozenxid) for all 1200 databases will stay close to autovacuum_freeze_max_age as desired.

Then, suddenly I will see it get “stuck”. Autovacuum launcher will not launch worker processes even though databases start to age past autovacuum_freeze_max_age. If I create a list of databases and sort by age(datfrozenxid), connect to the database with the oldest and execute a simple: "vacuum freeze pg_database;”, autovacuum springs back into action.

It’s never the same database where autovacuum seems to get “stuck”. I’m attempting to gather more debugging information, but, also can’t understand why simply doing a “vacuum freeze pg_database” breaks up the jam.

Any thoughts?

Shawn

#2Robert Haas
robertmhaas@gmail.com
In reply to: Shawn (#1)
Re: Need help debugging why autovacuum seems "stuck" -- until I use superuser to vacuum freeze pg_database

On Sun, May 1, 2016 at 10:39 PM, McCoy, Shawn <shamccoy@amazon.com> wrote:

I have been debugging a problem on a 9.3.10 Postgres database cluster with
over 1200 databases. 10 workers, increased maintenance_work_mem, auto
vacuum settings to run more frequently than default. What I will notice is
that autovacuum will run for a week or so and traverse databases as
expected. I will be able to see that age(datfrozenxid) for all 1200
databases will stay close to autovacuum_freeze_max_age as desired.

Then, suddenly I will see it get “stuck”. Autovacuum launcher will not
launch worker processes even though databases start to age past
autovacuum_freeze_max_age. If I create a list of databases and sort by
age(datfrozenxid), connect to the database with the oldest and execute a
simple: "vacuum freeze pg_database;”, autovacuum springs back into action.

It’s never the same database where autovacuum seems to get “stuck”. I’m
attempting to gather more debugging information, but, also can’t understand
why simply doing a “vacuum freeze pg_database” breaks up the jam.

Any thoughts?

So when it's stuck, there are no AV worker processes running at all,
for a sustained period of time?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Shawn
shamccoy@amazon.com
In reply to: Robert Haas (#2)
Re: Need help debugging why autovacuum seems "stuck" -- until I use superuser to vacuum freeze pg_database

I enabled debug level and can see that the launcher does in fact spawn a
worker, but, it appears, scans and disappears in milliseconds.

It was stuck again this morning and it was spitting this out ad nauseum for
"my_database_name":

2016-05-11 16:03:42 UTC::@:[13567]:DEBUG: pg_database: vac: 0 (threshold
141), anl: 57 (threshold 166)
2016-05-11 16:03:42 UTC::@:[13567]:DEBUG: autovac_balance_cost(pid=13567
db=143187, rel=1262, dobalance=yes cost_limit=1000, cost_limit_base=1000,
cost_delay=10)
2016-05-11 16:03:42 UTC::@:[13567]:DEBUG: vacuuming
"pg_catalog.pg_database"
2016-05-11 16:03:42 UTC::@:[13567]:DEBUG: index "pg_database_datname_index"
now contains 1325 row versions in 11 pages
2016-05-11 16:03:42 UTC::@:[13567]:DETAIL: 0 index row versions were
removed.
0 index pages have been deleted, 0 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
2016-05-11 16:03:42 UTC::@:[13567]:DEBUG: index "pg_database_oid_index" now
contains 1325 row versions in 6 pages
2016-05-11 16:03:42 UTC::@:[13567]:DETAIL: 0 index row versions were
removed.
0 index pages have been deleted, 0 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
2016-05-11 16:03:42 UTC::@:[13567]:DEBUG: "pg_database": found 0 removable,
0 nonremovable row versions in 0 out of 43 pages
2016-05-11 16:03:42 UTC::@:[13567]:DETAIL: 0 dead row versions cannot be
removed yet.
There were 0 unused item pointers.
0 pages are entirely empty.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
2016-05-11 16:03:42 UTC::@:[13567]:LOG: automatic vacuum of table
"my_database_name.pg_catalog.pg_database": index scans: 0
pages: 0 removed, 43 remain
tuples: 0 removed, 1156 remain
buffer usage: 45 hits, 0 misses, 0 dirtied
avg read rate: 0.000 MB/s, avg write rate: 0.000 MB/s
system usage: CPU 0.00s/0.00u sec elapsed 0.00 sec

The problem is though, when I look at age(datfrozenxid) for
"my_database_name" it is the oldest at 654,189,546 and should trigger the
autovacuum_freeze_max_age of 200,000,000 but it doesn't. The relation with
this age is "pg_database". If I issue a manual "vacuum freeze pg_database"
in the database, autovacuum will run on the next oldest database but get
stuck again with same symptoms (i.e. pg_database is the relation with the
oldest age). If I don't issue the manual "vacuum freeze" for all databases
and left things alone, it would get to wrap-around state.

--
View this message in context: http://postgresql.nabble.com/Need-help-debugging-why-autovacuum-seems-stuck-until-I-use-superuser-to-vacuum-freeze-pg-database-tp5901425p5903104.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Shawn (#3)
Re: Re: Need help debugging why autovacuum seems "stuck" -- until I use superuser to vacuum freeze pg_database

Shawn wrote:

The problem is though, when I look at age(datfrozenxid) for
"my_database_name" it is the oldest at 654,189,546 and should trigger the
autovacuum_freeze_max_age of 200,000,000 but it doesn't. The relation with
this age is "pg_database". If I issue a manual "vacuum freeze pg_database"
in the database, autovacuum will run on the next oldest database but get
stuck again with same symptoms (i.e. pg_database is the relation with the
oldest age). If I don't issue the manual "vacuum freeze" for all databases
and left things alone, it would get to wrap-around state.

I was thinking about this in connection with the shared catalog
vacuuming bug I fixed yesterday. One problem here is that the
relfrozenxid/relminmxid settings are local to each database, so a
for-wraparound vacuum would have to scan the table in all databases, and
advance the counter in all databases, before the overall "age" of the
system goes down again, which is inconvenient. This is a problem
inherent to having these columns in a per-database catalog, and the only
solution I can see is to have another shared catalog containing age
state for the shared catalogs. (However, this issue isn't terribly
serious -- it just means you have to scan the catalog N times instead of
one.)

However this doesn't explain why the vacuuming of pg_database in one
database doesn't actually advance the relfrozenxid in that database; it
certainly should. This is the serious problem you're facing, it seems.
I wonder if you've tweaked the freeze settings somehow.

--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers