autovacuum on primary blocking queries on replica?

Started by Don Seileralmost 4 years ago3 messagesgeneral

don@seiler.us

almost 4 years ago

- PostgreSQL 12.9 - PGDG Ubuntu 18.04 image
- Streaming physical replication
- hot_standby_feedback = on

We use a read replica to offload a lot of (what should be) quick queries.
This morning we had an incident where these queries were all blocking on
AccessShareLock waits, written to the log as:

2022-05-27 15:23:53.476 UTC [8185] foo@foo_all - myapp LOG: process 8185
still waiting for AccessShareLock on relation 16834 of database 16401 after
1000.228 ms at character 204
2022-05-27 15:23:53.476 UTC [8185] foo@foo_all - myapp DETAIL: Process
holding the lock: 10822. Wait queue: 32373, 8185, 13782, 14290, 13215,
4427, 16056, 13446, 13448, 13445, 9268, 11784, 14469
, 14502, 14340, 6510, 8730, 10876, 13012, 15760, 6645, 14712, 13006, 15923,
14485, 15924, 13447, 9926, 13005, 11803, 13013, 13298, 16059, 9266, 4943,
14470, 13148, 12118, 14491, 9927, 11791, 15097,
11804, 16060, 14711, 5469, 8113, 16061, 14698, 14475, 10620, 13150, 14501,
14465, 14341, 16062, 16065, 14484, 7060, 11785, 16068, 16063, 16064, 16066.

This went on for 30 seconds (the value of max_standby_streaming_delay)
until PG killed the blocking process:

2022-05-27 15:24:22.474 UTC [10822] foo@foo_all - anotherapp FATAL:
terminating connection due to conflict with recovery
2022-05-27 15:24:22.474 UTC [10822] foo@foo_all - anotherapp DETAIL: User
was holding a relation lock for too long.

I'm trying to find the root cause of why this started. We did see an UPDATE
that was executed at 2022-05-27 15:23:37.000 UTC on the table in question
(relation 16384) that ran against all rows of the table (only ~8200 rows,
not huge) but the transaction was rolled back (due to a syntax error later
in the transaction).

15 seconds later we then see an aggressive autovacuum on this table:

2022-05-27 15:23:52.507 UTC [30513] LOG: automatic aggressive vacuum of
table "foo_all.public.industry": index scans: 1
pages: 252 removed, 323 remain, 0 skipped due to pins, 0 skipped
frozen
tuples: 8252 removed, 8252 remain, 0 are dead but not yet
removable, oldest xmin: 1670999292
buffer usage: 12219 hits, 137 misses, 54 dirtied
avg read rate: 2.372 MB/s, avg write rate: 0.935 MB/s
system usage: CPU: user: 0.14 s, system: 0.00 s, elapsed: 0.45 s

and less than a second after that is when we see the first AccessShareLock
message on the replica.

I've been reading tales of autovacuum taking an AccessExclusiveLock when
truncating empty pages at the end of a table. I'm imagining that updating
every row of a table and then rolling back would leave all of those rows
empty at the end and qualify for truncation and lead to the scenario I saw
this morning.

I'm still not entirely satisfied since that table in question was so small
(only 8252 rows) so I wouldn't imagine it would hold things up as long as
it did. Although the blocking session on the replica was an application
session, not any background/recovery process.

I'm wondering if I'm on the right trail or if there is a much firmer
explanation for what happened.

Thanks,
Don.

--
Don Seiler
www.seiler.us