Autovacuum stuck for hours, blocking queries

Started by Tim Bellisabout 9 years ago25 messagesgeneral
Jump to latest
#1Tim Bellis
Tim.Bellis@metaswitch.com

I have a postgres 9.3.4 database table which (intermittently but reliably) gets into a state where queries get blocked indefinitely (at least for many hours) behind an automatic vacuum. I was under the impression that vacuum should never take any blocking locks for any significant period of time, and so would like help resolving the issue.

The process blocking the query is:
postgres 21985 11304 98 Feb13 ? 1-14:20:52 postgres: autovacuum worker process <db_name>
which is running the query
autovacuum: VACUUM public.<table_name>

The query being blocked is:
ALTER TABLE <table_name> ALTER COLUMN <column_name> DROP DEFAULT
(But I have seen this previously with other queries being blocked. I used the SQL in https://wiki.postgresql.org/wiki/Lock_Monitoring to determine which queries were blocked)

Notes:
- This database table is used for about 6 million row writes per day, all of which are then deleted at the end of the day.
- We have a reasonable number of systems running identical databases and near identical load profiles, and this problem has only been seen on two of those systems. It occurs intermittently but reliably (once every few days). It persists until postgres is restarted (usually accomplished by a reboot) or the autovacuum thread is forcibly killed. Although it often happens again after a short period of time when the autovacuum worker process starts.
- I don't have data on whether this vacuum ever finishes - it certainly takes longer than a few hours.
- Our application kicks off a manual vacuum against this table each night which doesn't hit this problem, as far as we're aware.
- I have sometimes seen this behaviour when there was database corruption. In that case I see logs like the following. But it also occurs when there are no logs indicating database corruption (which is the purpose of this email)
LOG: no left sibling (concurrent deletion?) in "<index_name>"
ERROR: right sibling's left-link doesn't match: block 41 links to 127 instead of expected 79 in index "<index_name>"

Things that I have tried:
- Printing out backtrace a few times from connecting to that process with gdb (see below)
- This indicates that the vacuum process isn't stuck, but I don't know what else it says
- Reading the FAQ and release notes of other 9.3.x versions to see whether this looks like a known issue. I couldn't see anything obvious apart from an issue about 0000 files (which didn't apply to my system - there was SQL in the release note to check).
- Forcibly killing the autovacuum thread (obviously not a good solution to the problem!). This causes the ALTER TABLE query to complete correctly.

Questions:
- Is this expected? I was under the impression that vacuum should never take any blocking locks for any significant period of time.
- Should I report this as a bug?
- Should I change some config options to avoid this? (I know turning off autovacuum would avoid it, but that's not good practice) I've put the vacuum settings for my database below.
- Should I avoid certain queries which autovacuum is happening?
- Are there more diags that I should gather to help diagnose this issue?

Thanks!

Tim

--System details--
Postgres 9.3.4 running on Linux x64 VMWare virtual machine. 24 CPUs of 2.7 GHz. 48GB RAM
It is also running another application (so the database is not the only thing on the server).

---Backtrace (from gdb) ---

#0 0x00000000007328c3 in hash_search_with_hash_value ()
#1 0x00000000006369ba in BufTableLookup ()
#2 0x0000000000639192 in ?? ()
#3 0x0000000000639ade in ReadBufferExtended ()
#4 0x00000000004906d9 in _bt_getbuf ()
#5 0x00000000004910e1 in _bt_pagedel ()
#6 0x00000000004924d1 in ?? ()
#7 0x00000000004926da in ?? ()
#8 0x00000000004928fa in btbulkdelete ()
#9 0x000000000072cdbd in FunctionCall4Coll ()
#10 0x000000000048cf58 in index_bulk_delete ()
#11 0x000000000057bb85 in ?? ()
#12 0x000000000057cfe9 in lazy_vacuum_rel ()
#13 0x000000000057af56 in ?? ()
#14 0x000000000057b28c in vacuum ()
#15 0x000000000060c8fa in ?? ()
#16 0x000000000060cd96 in ?? ()
#17 0x000000000060ce66 in StartAutoVacWorker ()
#18 0x0000000000617602 in ?? ()
#19 <signal handler called>
#20 0x00007f9210c9d393 in __select_nocancel () from /lib64/libc.so.6
#21 0x0000000000618add in PostmasterMain ()
#22 0x00000000005b58d0 in main ()

#0 0x0000000000491107 in _bt_pagedel ()
#1 0x00000000004924d1 in ?? ()
#2 0x00000000004926da in ?? ()
#3 0x00000000004928fa in btbulkdelete ()
...

#0 0x000000000047a1ef in hash_any ()
#1 0x00000000007336e9 in tag_hash ()
#2 0x000000000063916c in ?? ()
#3 0x0000000000639ade in ReadBufferExtended ()
#4 0x00000000004906d9 in _bt_getbuf ()
#5 0x00000000004910e1 in _bt_pagedel ()
#6 0x00000000004924d1 in ?? ()
#7 0x00000000004926da in ?? ()
#8 0x00000000004928fa in btbulkdelete ()
...

---Vacuum settings in postgresql.conf---

# - Cost-Based Vacuum Delay -

vacuum_cost_delay = 10 # 0-100 milliseconds
#vacuum_cost_page_hit = 1 # 0-10000 credits
#vacuum_cost_page_miss = 10 # 0-10000 credits
#vacuum_cost_page_dirty = 20 # 0-10000 credits
vacuum_cost_limit = 200 # 1-10000 credits

#vacuum_defer_cleanup_age = 0 # number of xacts by which cleanup is delayed

#------------------------------------------------------------------------------
# AUTOVACUUM PARAMETERS
#------------------------------------------------------------------------------

autovacuum = on # Enable autovacuum subprocess? 'on'
# requires track_counts to also be on.
#log_autovacuum_min_duration = -1 # -1 disables, 0 logs all actions and
# their durations, > 0 logs only
# actions running at least this number
# of milliseconds.
#autovacuum_max_workers = 3 # max number of autovacuum subprocesses
# (change requires restart)
#autovacuum_naptime = 1min # time between autovacuum runs
#autovacuum_vacuum_threshold = 50 # min number of row updates before
# vacuum
#autovacuum_analyze_threshold = 50 # min number of row updates before
# analyze
#autovacuum_vacuum_scale_factor = 0.2 # fraction of table size before vacuum
#autovacuum_analyze_scale_factor = 0.1 # fraction of table size before analyze
#autovacuum_freeze_max_age = 200000000 # maximum XID age before forced vacuum
# (change requires restart)
#autovacuum_multixact_freeze_max_age = 400000000 # maximum Multixact age
# before forced vacuum
# (change requires restart)
#autovacuum_vacuum_cost_delay = 20ms # default vacuum cost delay for
# autovacuum, in milliseconds;
# -1 means use vacuum_cost_delay
#autovacuum_vacuum_cost_limit = -1 # default vacuum cost limit for
# autovacuum, -1 means use
# vacuum_cost_limit

#vacuum_freeze_min_age = 50000000
#vacuum_freeze_table_age = 150000000
#vacuum_multixact_freeze_min_age = 5000000
#vacuum_multixact_freeze_table_age = 150000000

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#2Adrian Klaver
adrian.klaver@aklaver.com
In reply to: Tim Bellis (#1)
Re: Autovacuum stuck for hours, blocking queries

On 02/15/2017 09:30 AM, Tim Bellis wrote:

I have a postgres 9.3.4 database table which (intermittently but reliably) gets into a state where queries get blocked indefinitely (at least for many hours) behind an automatic vacuum. I was under the impression that vacuum should never take any blocking locks for any significant period of time, and so would like help resolving the issue.

The process blocking the query is:
postgres 21985 11304 98 Feb13 ? 1-14:20:52 postgres: autovacuum worker process <db_name>
which is running the query
autovacuum: VACUUM public.<table_name>

The query being blocked is:
ALTER TABLE <table_name> ALTER COLUMN <column_name> DROP DEFAULT
(But I have seen this previously with other queries being blocked. I used the SQL in https://wiki.postgresql.org/wiki/Lock_Monitoring to determine which queries were blocked)

Other ALTER TABLE queries?

If so I believe this might apply:

https://www.postgresql.org/docs/9.5/static/explicit-locking.html

SHARE UPDATE EXCLUSIVE

Conflicts with the SHARE UPDATE EXCLUSIVE, SHARE, SHARE ROW
EXCLUSIVE, EXCLUSIVE, and ACCESS EXCLUSIVE lock modes. This mode
protects a table against concurrent schema changes and VACUUM runs.

Acquired by VACUUM (without FULL), ANALYZE, CREATE INDEX
CONCURRENTLY, and ALTER TABLE VALIDATE and other ALTER TABLE variants
(for full details see ALTER TABLE).

--
Adrian Klaver
adrian.klaver@aklaver.com

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#3Scott Marlowe
scott.marlowe@gmail.com
In reply to: Tim Bellis (#1)
Re: Autovacuum stuck for hours, blocking queries

On Wed, Feb 15, 2017 at 10:30 AM, Tim Bellis <Tim.Bellis@metaswitch.com> wrote:

I have a postgres 9.3.4 database table which (intermittently but reliably) gets into a state where queries get blocked indefinitely (at least for many hours) behind an automatic vacuum. I was under the impression that vacuum should never take any blocking locks for any significant period of time, and so would like help resolving the issue.

The process blocking the query is:
postgres 21985 11304 98 Feb13 ? 1-14:20:52 postgres: autovacuum worker process <db_name>
which is running the query
autovacuum: VACUUM public.<table_name>

The query being blocked is:
ALTER TABLE <table_name> ALTER COLUMN <column_name> DROP DEFAULT
(But I have seen this previously with other queries being blocked. I used the SQL in https://wiki.postgresql.org/wiki/Lock_Monitoring to determine which queries were blocked)

Yup, there's a priority inversion in DDL, DML and maintenance
(vacuum). Vacuum runs slow in the background. Normal
update/delete/insert work fine because of the type of lock vacuum has.

Then a ddl gets in line. It has to wait on the vacuum, and the vacuum,
set to run super slow. And everybody waits. On vacuum.

Basically it's bad practice to alter tables that are big and being
worked on, because one way or another you're going to pay a price.

I've used partitions for logging and auditing that autocreate and drop
and vacuum, but they never get ddl done on them when they're getting
updated and vice versa.

There are also ways of making the table less likely / not likely /
will not get vacuum automatically. If you're willing to schedule ddl
and vacuum on your own you can then mix the two in relative safety.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#4Scott Marlowe
scott.marlowe@gmail.com
In reply to: Scott Marlowe (#3)
Re: Autovacuum stuck for hours, blocking queries

On Wed, Feb 15, 2017 at 3:26 PM, Scott Marlowe <scott.marlowe@gmail.com> wrote:

On Wed, Feb 15, 2017 at 10:30 AM, Tim Bellis <Tim.Bellis@metaswitch.com> wrote:

I have a postgres 9.3.4 database table which (intermittently but reliably) gets into a state where queries get blocked indefinitely (at least for many hours) behind an automatic vacuum. I was under the impression that vacuum should never take any blocking locks for any significant period of time, and so would like help resolving the issue.

The process blocking the query is:
postgres 21985 11304 98 Feb13 ? 1-14:20:52 postgres: autovacuum worker process <db_name>
which is running the query
autovacuum: VACUUM public.<table_name>

The query being blocked is:
ALTER TABLE <table_name> ALTER COLUMN <column_name> DROP DEFAULT
(But I have seen this previously with other queries being blocked. I used the SQL in https://wiki.postgresql.org/wiki/Lock_Monitoring to determine which queries were blocked)

There are also ways of making the table less likely / not likely /
will not get vacuum automatically. If you're willing to schedule ddl
and vacuum on your own you can then mix the two in relative safety.

Followup: https://www.postgresql.org/docs/9.3/static/sql-createtable.html#SQL-CREATETABLE-STORAGE-PARAMETERS

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#5Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Scott Marlowe (#3)
Re: Autovacuum stuck for hours, blocking queries

Scott Marlowe wrote:

Then a ddl gets in line. It has to wait on the vacuum, and the vacuum,
set to run super slow. And everybody waits. On vacuum.

Note that this is normally not seen, because autovacuum cancels itself
when somebody is blocked behind it -- until the table reaches the
freeze_max_age limit, and then autovacuum is a for-wraparound one that
is no longer terminated, and then everybody has to wait on it.

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#6Tim Bellis
Tim.Bellis@metaswitch.com
In reply to: Adrian Klaver (#2)
Re: Autovacuum stuck for hours, blocking queries

Thank you all - that's really useful :-)

The other query that gets blocked behind the vacuum is the below (truncated).

This query is generated by jdbc in this method:
org.postgresql.jdbc2.AbstractJdbc2DatabaseMetaData.getIndexInfo(AbstractJdbc2DatabaseMetaData.java:4023)

Even though this is a read only query, is it also expected to be blocked behind the vacuum? Is there a way of getting indexes for a table which won't be blocked behind a vacuum?

Thank you all again,

Tim

SELECT NULL AS TABLE_CAT, n.nspname AS TABLE_SCHEM, ct.relname AS TABLE_NAME, NOT i.indisunique AS NON_UNIQUE, NULL AS INDEX_QUALIFIER, ci.relname AS INDEX_NAME, CASE i.indisclustered WHEN true THEN 1 ELSE CASE am.amname WHEN 'hash' THEN 2 ELSE 3 END END AS TYPE, (i.keys).n AS ORDINAL_POSITION, pg_catalog.pg_get_indexdef(ci.oid, (i.keys).n, false) AS COLUMN_NAME, CASE am.amcanorder WHEN true THEN CASE i.indoption[(i.keys).n - 1] & 1 WHEN 1 THEN 'D' ELSE 'A' END ELSE NULL END AS ASC_OR_DESC, ci.reltuples AS CARDINALITY, ci.relpages AS PAGES, pg_catalog.pg_get_expr(i.indpred, i.indrelid) AS FILTER_CONDITION FROM pg_catalog.pg_class ct JOIN pg_catalog.pg_namespace n ON (ct.relnamespace = n.oid) JOIN (SELECT i.indexrelid, i.indrelid, i.indoption, i.indisunique, i.indisclustered, i.indpred, i.indexprs, information_schema._pg_expandarray(i.indkey) AS keys FROM pg_catalog.pg_index i) i ON (ct.oid = i.ind

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#7Adrian Klaver
adrian.klaver@aklaver.com
In reply to: Tim Bellis (#6)
Re: Autovacuum stuck for hours, blocking queries

On 02/16/2017 08:45 AM, Tim Bellis wrote:

Thank you all - that's really useful :-)

The other query that gets blocked behind the vacuum is the below (truncated).

This query is generated by jdbc in this method:
org.postgresql.jdbc2.AbstractJdbc2DatabaseMetaData.getIndexInfo(AbstractJdbc2DatabaseMetaData.java:4023)

Is JDBC doing anything else before issuing this?

Even though this is a read only query, is it also expected to be blocked behind the vacuum? Is there a way of getting indexes for a table which won't be blocked behind a vacuum?

Table 13.2 here:

https://www.postgresql.org/docs/9.5/static/explicit-locking.html

shows the conflicts with SHARE UPDATE EXCLUSIVE(vacuum).

pg_locks:

https://www.postgresql.org/docs/9.5/static/view-pg-locks.html

shows locks being held. So next time it happens I would take a look and
see if you can work backwards from there.

You could directly access the index information using:

https://www.postgresql.org/docs/9.6/static/catalog-pg-index.html
https://www.postgresql.org/docs/9.6/static/catalog-pg-class.html

Thank you all again,

Tim

SELECT NULL AS TABLE_CAT, n.nspname AS TABLE_SCHEM, ct.relname AS TABLE_NAME, NOT i.indisunique AS NON_UNIQUE, NULL AS INDEX_QUALIFIER, ci.relname AS INDEX_NAME, CASE i.indisclustered WHEN true THEN 1 ELSE CASE am.amname WHEN 'hash' THEN 2 ELSE 3 END END AS TYPE, (i.keys).n AS ORDINAL_POSITION, pg_catalog.pg_get_indexdef(ci.oid, (i.keys).n, false) AS COLUMN_NAME, CASE am.amcanorder WHEN true THEN CASE i.indoption[(i.keys).n - 1] & 1 WHEN 1 THEN 'D' ELSE 'A' END ELSE NULL END AS ASC_OR_DESC, ci.reltuples AS CARDINALITY, ci.relpages AS PAGES, pg_catalog.pg_get_expr(i.indpred, i.indrelid) AS FILTER_CONDITION FROM pg_catalog.pg_class ct JOIN pg_catalog.pg_namespace n ON (ct.relnamespace = n.oid) JOIN (SELECT i.indexrelid, i.indrelid, i.indoption, i.indisunique, i.indisclustered, i.indpred, i.indexprs, information_schema._pg_expandarray(i.indkey) AS keys FROM pg_catalog.pg_index i) i ON (ct.oid = i.ind

This query is cut off so cannot say whether it is the issue or not.

--
Adrian Klaver
adrian.klaver@aklaver.com

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#8Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tim Bellis (#6)
Re: Autovacuum stuck for hours, blocking queries

Tim Bellis <Tim.Bellis@metaswitch.com> writes:

Even though this is a read only query, is it also expected to be blocked behind the vacuum? Is there a way of getting indexes for a table which won't be blocked behind a vacuum?

It's not the vacuum that's blocking your read-only queries. It's the
ALTER TABLE, which needs an exclusive lock in order to alter the table's
schema. The ALTER is queued waiting for the vacuum to finish, and lesser
lock requests queue up behind it. We could let the non-exclusive lock
requests go ahead of the ALTER, but that would create a severe risk of the
ALTER *never* getting to run.

I'd kill the ALTER and figure on trying again after the vacuum is done.

Also you might want to look into how you got into a situation where
you have an anti-wraparound vacuum that's taking so long to run.
You didn't do something silly like disable autovacuum did you?

regards, tom lane

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#9Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Tom Lane (#8)
Re: Autovacuum stuck for hours, blocking queries

Tom Lane wrote:

Also you might want to look into how you got into a situation where
you have an anti-wraparound vacuum that's taking so long to run.

If there are ALTERs running all the time, regular (non-anti-wraparound)
vacuums would be canceled and never get a chance to run. Eventually,
autovacuum decides it's had enough and doesn't cancel anymore, so
everyone else gets stuck behind.

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#10Jeff Janes
jeff.janes@gmail.com
In reply to: Tim Bellis (#1)
Re: Autovacuum stuck for hours, blocking queries

On Wed, Feb 15, 2017 at 9:30 AM, Tim Bellis <Tim.Bellis@metaswitch.com>
wrote:

I have a postgres 9.3.4 database table which (intermittently but reliably)
gets into a state where queries get blocked indefinitely (at least for many
hours) behind an automatic vacuum. I was under the impression that vacuum
should never take any blocking locks for any significant period of time,
and so would like help resolving the issue.

The process blocking the query is:
postgres 21985 11304 98 Feb13 ? 1-14:20:52 postgres: autovacuum
worker process <db_name>
which is running the query
autovacuum: VACUUM public.<table_name>

Are you sure it doesn't really say:

autovacuum: VACUUM public.<table_name> (to prevent wraparound)

If it doesn't include the "to prevent wraparound", then it should sacrifice
itself as soon as it realizes it is blocking something else. If it is not
doing that, something is wrong.

If it does say "(to prevent wraparound)", then see all the other comments
on this thread.

Notes:
- This database table is used for about 6 million row writes per day,
all of which are then deleted at the end of the day.

How many transactions do those 6 million writes comprise?

- Our application kicks off a manual vacuum against this table each
night which doesn't hit this problem, as far as we're aware.

If it were having problems, would you be aware of it? Do you see in the
log files the completion of the vacuum? Or look in pg_stat_user_tables to
see when last_vacuum was. If it runs every night and succeeds, it is hard
to see why wraparound would ever kick in. Unless you are hitting
150,000,000 transactions in a day.

Cheers,

Jeff

#11Hannes Erven
hannes@erven.at
In reply to: Tim Bellis (#1)
Re: Autovacuum stuck for hours, blocking queries

Hi Tim,

Am 2017-02-15 um 18:30 schrieb Tim Bellis:

I have a postgres 9.3.4 database table which (intermittently but reliably)
gets into a state where queries get blocked indefinitely
[..]
Notes:
- This database table is used for about 6 million row writes per day,
all of which are then deleted at the end of the day.

If you are dumping the contents of the table anyways, why not use
TRUNCATE instead of DELETE? It unlinks and recreates the table data
files, requiring nearly zero IO and analyzing.
Or even drop, and recreate the table with the correct new structure so
you do not even need to ALTER TABLE ?

I'm a bit skeptical of these suggestions since very competent people
have already answered your post and did not come up with this... ;-) the
only drawback I'm aware of is that TRUNCATE will immediatly free disk
space on the OS level, so the table's space will not be "blocked". But
probably the VACUUM you are currently performing will also eventually
release the unused disk space, so this may or may not match the current
behaviour.

Best regards,

-hannes

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#12Tim Bellis
Tim.Bellis@metaswitch.com
In reply to: Hannes Erven (#11)
Re: Autovacuum stuck for hours, blocking queries

The DELETE operations only deletes rows from the previous day. It's possible that there have been rows added that day which ought not to be deleted, so TRUNCATE wouldn't work.

But that was a helpful suggestion - thanks!

Tim

-----Original Message-----
From: Hannes Erven [mailto:hannes@erven.at]
Sent: 17 February 2017 11:47
To: pgsql-general@postgresql.org
Cc: Tim Bellis <Tim.Bellis@metaswitch.com>
Subject: Re: [GENERAL] Autovacuum stuck for hours, blocking queries

Hi Tim,

Am 2017-02-15 um 18:30 schrieb Tim Bellis:

I have a postgres 9.3.4 database table which (intermittently but
reliably)
gets into a state where queries get blocked indefinitely > [..]
Notes:
- This database table is used for about 6 million row writes per
day,
all of which are then deleted at the end of the day.

If you are dumping the contents of the table anyways, why not use TRUNCATE instead of DELETE? It unlinks and recreates the table data files, requiring nearly zero IO and analyzing.
Or even drop, and recreate the table with the correct new structure so you do not even need to ALTER TABLE ?

I'm a bit skeptical of these suggestions since very competent people have already answered your post and did not come up with this... ;-) the only drawback I'm aware of is that TRUNCATE will immediatly free disk space on the OS level, so the table's space will not be "blocked". But probably the VACUUM you are currently performing will also eventually release the unused disk space, so this may or may not match the current behaviour.

Best regards,

-hannes

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#13Tim Bellis
Tim.Bellis@metaswitch.com
In reply to: Jeff Janes (#10)
Re: Autovacuum stuck for hours, blocking queries

From: Jeff Janes [mailto:jeff.janes@gmail.com]
Sent: 17 February 2017 02:59
To: Tim Bellis <Tim.Bellis@metaswitch.com>
Cc: pgsql-general@postgresql.org
Subject: Re: [GENERAL] Autovacuum stuck for hours, blocking queries

On Wed, Feb 15, 2017 at 9:30 AM, Tim Bellis <Tim.Bellis@metaswitch.com<mailto:Tim.Bellis@metaswitch.com>> wrote:
I have a postgres 9.3.4 database table which (intermittently but reliably) gets into a state where queries get blocked indefinitely (at least for many hours) behind an automatic vacuum. I was under the impression that vacuum should never take any blocking locks for any significant period of time, and so would like help resolving the issue.

The process blocking the query is:
postgres 21985 11304 98 Feb13 ? 1-14:20:52 postgres: autovacuum worker process <db_name>
which is running the query
autovacuum: VACUUM public.<table_name>

Are you sure it doesn't really say:

autovacuum: VACUUM public.<table_name> (to prevent wraparound)
[Tim Bellis] It doesn’t. I was using the query from https://wiki.postgresql.org/wiki/Lock_Monitoring and looking at the ‘current_statement_in_blocking_process’ column. Is there a different query I should be using?

If it doesn't include the "to prevent wraparound", then it should sacrifice itself as soon as it realizes it is blocking something else. If it is not doing that, something is wrong.

If it does say "(to prevent wraparound)", then see all the other comments on this thread.

Notes:
- This database table is used for about 6 million row writes per day, all of which are then deleted at the end of the day.

How many transactions do those 6 million writes comprise?
[Tim Bellis] I’m not sure - I think it’s 6 million transactions, but there might be some batching going on that I’m unaware of. What would this affect? (I can dig in if necessary)
(I might have been slightly wrong in characterising the exact behaviour; the table might be cleared every hour rather than every day, but there are still 6 million writes per day)

- Our application kicks off a manual vacuum against this table each night which doesn't hit this problem, as far as we're aware.

If it were having problems, would you be aware of it? Do you see in the log files the completion of the vacuum? Or look in pg_stat_user_tables to see when last_vacuum was. If it runs every night and succeeds, it is hard to see why wraparound would ever kick in. Unless you are hitting 150,000,000 transactions in a day.
[Tim Bellis] I shall investigate this.

Cheers,

Jeff

#14Hannes Erven
hannes@erven.at
In reply to: Tim Bellis (#12)
Re: Autovacuum stuck for hours, blocking queries

Hi Tim,

Am 2017-02-17 um 17:02 schrieb Tim Bellis:

The DELETE operations only deletes rows from the
previous day. It's possible that there have been rows
added that day which ought not to be deleted, so
TRUNCATE wouldn't work.

OK, then I'll try two other suggestions:

- use table partitioning (
https://www.postgresql.org/docs/9.3/static/ddl-partitioning.html )

- if the number of rows you need to keep is small, you could try
something like this:
LOCK TABLE yourtable <a suitable LOCKMODE>;
CREATE TEMPORARY TABLE keep AS SELECT * FROM yourtable WHERE <keep>;
TRUNCATE yourtable;
INSERT INTO yourtable SELECT * from keep;
COMMIT;

Best regards,

-hannes

-----Original Message-----
From: Hannes Erven [mailto:hannes@erven.at]
Sent: 17 February 2017 11:47
To: pgsql-general@postgresql.org
Cc: Tim Bellis <Tim.Bellis@metaswitch.com>
Subject: Re: [GENERAL] Autovacuum stuck for hours, blocking queries

Hi Tim,

Am 2017-02-15 um 18:30 schrieb Tim Bellis:

I have a postgres 9.3.4 database table which (intermittently but
reliably)
gets into a state where queries get blocked indefinitely > [..]
Notes:
- This database table is used for about 6 million row writes per
day,
all of which are then deleted at the end of the day.

If you are dumping the contents of the table anyways, why not use TRUNCATE instead of DELETE? It unlinks and recreates the table data files, requiring nearly zero IO and analyzing.
Or even drop, and recreate the table with the correct new structure so you do not even need to ALTER TABLE ?

I'm a bit skeptical of these suggestions since very competent people have already answered your post and did not come up with this... ;-) the only drawback I'm aware of is that TRUNCATE will immediatly free disk space on the OS level, so the table's space will not be "blocked". But probably the VACUUM you are currently performing will also eventually release the unused disk space, so this may or may not match the current behaviour.

Best regards,

-hannes

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#15Rakesh Kumar
rakeshkumar464@outlook.com
In reply to: Hannes Erven (#14)
Re: Autovacuum stuck for hours, blocking queries

LOCK TABLE yourtable <a suitable LOCKMODE>;
CREATE TEMPORARY TABLE keep AS SELECT * FROM yourtable WHERE <keep>;
TRUNCATE yourtable;
INSERT INTO yourtable SELECT * from keep;
COMMIT;
===
the above snippet assumes truncate in PG can be in a transaction. In other words, while truncate by itself
is atomic, it can't be rolled back. So in the above case, if "INSERT INTO yourtable SELECT * from keep;" and
we rollback, will it rollback yourtable.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#16Scott Marlowe
scott.marlowe@gmail.com
In reply to: Rakesh Kumar (#15)
Re: Autovacuum stuck for hours, blocking queries

On Fri, Feb 17, 2017 at 1:38 PM, Rakesh Kumar
<rakeshkumar464@outlook.com> wrote:

LOCK TABLE yourtable <a suitable LOCKMODE>;
CREATE TEMPORARY TABLE keep AS SELECT * FROM yourtable WHERE <keep>;
TRUNCATE yourtable;
INSERT INTO yourtable SELECT * from keep;
COMMIT;
===
the above snippet assumes truncate in PG can be in a transaction. In other words, while truncate by itself
is atomic, it can't be rolled back. So in the above case, if "INSERT INTO yourtable SELECT * from keep;" and
we rollback, will it rollback yourtable.

Yes it can. Truncate has been rollbackable for a while now.

begin;
create table
insert into table
truncate old table
. something goes wrong .
rollback;

Unless I misunderstand your meaning.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#17Michael Paquier
michael@paquier.xyz
In reply to: Scott Marlowe (#16)
Re: Autovacuum stuck for hours, blocking queries

On Sat, Feb 18, 2017 at 1:32 PM, Scott Marlowe <scott.marlowe@gmail.com> wrote:

Yes it can. Truncate has been rollbackable for a while now.

Per the docs:
https://www.postgresql.org/docs/9.6/static/sql-truncate.html
"TRUNCATE is transaction-safe with respect to the data in the tables:
the truncation will be safely rolled back if the surrounding
transaction does not commit."
In short yes a transaction doing a truncate can be rollbacked.
--
Michael

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#18Adrian Klaver
adrian.klaver@aklaver.com
In reply to: Michael Paquier (#17)
Re: Autovacuum stuck for hours, blocking queries

On 02/17/2017 11:54 PM, Michael Paquier wrote:

On Sat, Feb 18, 2017 at 1:32 PM, Scott Marlowe <scott.marlowe@gmail.com> wrote:

Yes it can. Truncate has been rollbackable for a while now.

Per the docs:
https://www.postgresql.org/docs/9.6/static/sql-truncate.html
"TRUNCATE is transaction-safe with respect to the data in the tables:
the truncation will be safely rolled back if the surrounding
transaction does not commit."
In short yes a transaction doing a truncate can be rollbacked.

I think the part that confuses people into thinking it can not be
rollbacked is this:

"TRUNCATE is not MVCC-safe. After truncation, the table will appear
empty to concurrent transactions, if they are using a snapshot taken
before the truncation occurred. See Section 13.5 for more details."

--
Adrian Klaver
adrian.klaver@aklaver.com

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#19Merlin Moncure
mmoncure@gmail.com
In reply to: Tom Lane (#8)
Re: Autovacuum stuck for hours, blocking queries

On Thursday, February 16, 2017, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Tim Bellis <Tim.Bellis@metaswitch.com <javascript:;>> writes:

Even though this is a read only query, is it also expected to be blocked

behind the vacuum? Is there a way of getting indexes for a table which
won't be blocked behind a vacuum?

It's not the vacuum that's blocking your read-only queries. It's the
ALTER TABLE, which needs an exclusive lock in order to alter the table's
schema. The ALTER is queued waiting for the vacuum to finish, and lesser
lock requests queue up behind it. We could let the non-exclusive lock
requests go ahead of the ALTER, but that would create a severe risk of the
ALTER *never* getting to run.

I'd kill the ALTER and figure on trying again after the vacuum is done.

I've been drilled by this and similar lock stacking issues enough times to
make me near 100% sure deferring the ALTER would be the better choice

merlin

#20Tim Bellis
Tim.Bellis@metaswitch.com
In reply to: Tom Lane (#8)
Re: Autovacuum stuck for hours, blocking queries

-----Original Message-----
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
Sent: 16 February 2017 22:40
To: Tim Bellis <Tim.Bellis@metaswitch.com>
Cc: Adrian Klaver <adrian.klaver@aklaver.com>; pgsql-general@postgresql.org; Alvaro Herrera <alvherre@2ndquadrant.com>; Scott Marlowe <scott.marlowe@gmail.com>
Subject: Re: [GENERAL] Autovacuum stuck for hours, blocking queries

Tim Bellis <Tim.Bellis@metaswitch.com> writes:

Even though this is a read only query, is it also expected to be blocked behind the vacuum? Is there a way of getting indexes for a table which won't be blocked behind a vacuum?

It's not the vacuum that's blocking your read-only queries. It's the ALTER TABLE, which needs an exclusive lock in order to alter the table's schema. The ALTER is queued waiting for the vacuum to finish, and lesser lock requests queue up behind it. We could let the non-exclusive lock requests go ahead of the ALTER, but that would create a severe risk of the ALTER *never* getting to run.

The lock monitoring query (https://wiki.postgresql.org/wiki/Lock_Monitoring) said that the blocking_pid and the current_statement_in_blocking_process for the queries reading the index data was the autovacuum, not the ALTER. Am I reading the output wrong? Does it not correctly represent the chain of locks?

I'd kill the ALTER and figure on trying again after the vacuum is done.

Also you might want to look into how you got into a situation where you have an anti-wraparound vacuum that's taking so long to run.
You didn't do something silly like disable autovacuum did you?

No, autovacuum is on (and this is an autovacuum which is in progress). But I will look at why I'm getting a blocking autovacuum.

regards, tom lane

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#21Jeff Janes
jeff.janes@gmail.com
In reply to: Merlin Moncure (#19)
#22Jeff Janes
jeff.janes@gmail.com
In reply to: Tim Bellis (#13)
#23Bill Moran
wmoran@potentialtech.com
In reply to: Jeff Janes (#21)
#24Merlin Moncure
mmoncure@gmail.com
In reply to: Jeff Janes (#21)
#25Jeff Janes
jeff.janes@gmail.com
In reply to: Bill Moran (#23)