Prolonged truncation phase during vacuum on toast table with repeated interruptions by lock waiters and a proposed POC patch

Started by Shayon Mukherjee8 months ago7 messages

shayonj@gmail.com

8 months ago

3 attachment(s)

Hello hackers,

I'd like to discuss an issue we observed where the truncation phase of
autovacuum can become significantly prolonged on toast tables of busy
systems due to repeated interruptions by lock waiters, leading to extended
`AccessExclusiveLock` contention and impacting overall database &
application availability.

*Recap of the production incident*
We experienced this issue directly where a `VACUUM` on a TOAST table took
approximately 5 hours. The following shows the scale of pages eventually
removed and the duration:

```
automatic vacuum of table "tines.pg_toast.pg_toast_21059": index scans: 1
pages: 26096516 removed, 1421195 remain, 0 skipped due to pins, 26090236
skipped frozen
tuples: 4575083 removed, 26717710 remain, 21498 are dead but not yet
removable, oldest xmin: 173810717
index scan needed: 1151751 pages from table (4.19% of total) had 4629490
dead item identifiers removed
index "pg_toast_21059_index": pages: 303420 in total, 14038 newly deleted,
298668 currently deleted, 298418 reusable
I/O timings: read: 18428758.270 ms, write: 0.000 ms
avg read rate: 10.730 MB/s, avg write rate: 0.000 MB/s
buffer usage: 5132686 hits, 25714763 misses, 0 dirtied
WAL usage: 0 records, 0 full page images, 0 bytes
system usage: CPU: user: 327.79 s, system: 251.11 s, elapsed: 18722.21 s
```

During this 5-hour window, read queries against primary and replica
databases (on AWS Aurora) were frequently failing due to lock timeouts
(lock_timeout was set to 1s), consistent with the `AccessExclusiveLock`
being held and released repeatedly by the truncation part of the autovacuum
process.

This then further led to me getting a better understanding of the problem
and order of events. Which brings me to:

*Understanding the problem*
My understanding is that the PostgreSQL autovacuum process, specifically
its lazy variant (`lazy_vacuum_heap`), performs few passes to reclaim
space. One critical phase, if deemed beneficial, is the truncation of empty
pages from the end of a relation, handled by the `lazy_truncate_heap`
function.

As I traced down the code, the interaction occurs as follows:

- To physically truncate a relation, `VACUUM` must acquire an
`AccessExclusiveLock`.
- After acquiring the `AccessExclusiveLock`, `VACUUM` performs a
verification step by scanning the relation backwards from its apparent end
(`count_nondeletable_pages`). This is necessary because other transactions
might have added data to these trailing pages.
- While holding the `AccessExclusiveLock` and performing this backward
scan, `VACUUM` periodically checks if any other database sessions are
waiting for locks on the same relation using
`LockHasWaitersRelation(vacrel->rel, AccessExclusiveLock)`. This check
occurs at intervals defined by `VACUUM_TRUNCATE_LOCK_CHECK_INTERVAL` (20ms)
and roughly every 32 blocks scanned.
- Interruption and Retry Mechanism:
- If `LockHasWaitersRelation` returns `true` during the backward scan,
`count_nondeletable_pages` sets a flag (`lock_waiter_detected = true`) and
returns, effectively "suspending" its scan to allow waiting sessions to
proceed.
- Back in `lazy_truncate_heap`, if `lock_waiter_detected` is true,
`VACUUM` releases the `AccessExclusiveLock`.
- `lazy_truncate_heap` contains a `do...while` loop. If
`lock_waiter_detected` was true AND there are still potentially truncatable
pages (`new_rel_pages > vacrel->nonempty_pages`), this loop causes `VACUUM`
to retry the entire truncation sequence: attempt to re-acquire the
`AccessExclusiveLock` (with its own timeout of
`VACUUM_TRUNCATE_LOCK_TIMEOUT`, default 5s), restart the
`count_nondeletable_pages` backward scan, and so on.

- Next comes the "stuck in a loop" problem and this is where things get
interesting:
- On a very busy system with constant concurrent access, `VACUUM` can
repeatedly acquire the `AccessExclusiveLock`, start the backward scan,
detect waiters, release the lock, and retry. Each retry of
`count_nondeletable_pages` involves re-reading potentially many blocks from
disk while holding the exclusive lock which aggravates the lock contention.
- This cycle leads to prolonged `VACUUM` operations, significant
repeated I/O, and extended lock contention, reducing database availability
and causing query timeouts for other sessions (e.g., `canceling statement
due to lock timeout`, `LockNotAvailable`). This is particularly acute for
the TOAST tables in my case which sadly happens to be very large (for now).

*Testing the theory and patch*
To validate this understanding, I was able to reproduce a similar scenario
by loading a test table with a significant amount of TOAST-able data and
then simulating a busy read workload against it concurrently with a
`VACUUM` operation. I have attached a sql script to load the data and a
ruby script to simulate the concurrent workload if anyone is curious. The
tests demonstrated how the truncation phase could indeed enter extended
periods of repeatedly attempting to acquire the `AccessExclusiveLock` and
performing the backward scan, only to be interrupted by the simulated
concurrent workload.

Considering the potential for the truncation phase to cause prolonged lock
contention under these conditions, I am wondering if it would be beneficial
to explicitly cap the number of retries for the truncation process within
the `do...while` loop in `lazy_truncate_heap` when interruptions by lock
waiters occur?

Because I was already too deep in the code, I decided to write a small
patch to introduce a limit to the retries and observe the behavior.
Specifically, it adds a counter `truncate_interruption_retry_count` and
checks it against a new constant `VACUUM_TRUNCATE_INTERRUPTION_MAX_RETRIES`
(which for now I have set to 3). If the backward scan in
`count_nondeletable_pages` is interrupted by lock waiters this many times,
`VACUUM` will cease further truncation attempts for the current `VACUUM`
cycle on that table.

I was able to perform the test on a similar setup and the results are as
follows (noticed the `stopping truncate after 3 retries due to repeated
conflicting lock requests` coming from the patch):

```
VACUUM (VERBOSE) toast_target_table;
INFO: vacuuming "postgres.public.toast_target_table"
INFO: table "toast_target_table": suspending truncate due to conflicting
lock request
INFO: table "toast_target_table": truncated 144032 to 143744 pages
INFO: table "toast_target_table": suspending truncate due to conflicting
lock request
INFO: table "toast_target_table": truncated 143744 to 143456 pages
INFO: table "toast_target_table": suspending truncate due to conflicting
lock request
INFO: table "toast_target_table": truncated 143456 to 143168 pages
INFO: table "toast_target_table": stopping truncate after 3 retries due to
repeated conflicting lock requests
INFO: finished vacuuming "postgres.public.toast_target_table": index
scans: 0
pages: 864 removed, 143168 remain, 1 scanned (0.00% of total), 0 eagerly
scanned
tuples: 0 removed, 7544 remain, 0 are dead but not yet removable
removable cutoff: 2985896, which was 1 XIDs old when operation ended
frozen: 0 pages from table (0.00% of total) had 0 tuples frozen
visibility map: 0 pages set all-visible, 0 pages set all-frozen (0 were
all-visible)
index scan not needed: 0 pages from table (0.00% of total) had 0 dead item
identifiers removed
index "toast_target_table_pkey": pages: 28132 in total, 0 newly deleted,
27646 currently deleted, 27646 reusable
avg read rate: 13.423 MB/s, avg write rate: 0.006 MB/s
buffer usage: 28172 hits, 28571 reads, 12 dirtied
WAL usage: 22 records, 15 full page images, 120289 bytes, 0 buffers full
system usage: CPU: user: 0.13 s, system: 3.55 s, elapsed: 16.62 s
INFO: vacuuming "postgres.pg_toast.pg_toast_649469"
INFO: table "pg_toast_649469": suspending truncate due to conflicting lock
request
INFO: table "pg_toast_649469": truncated 2553888 to 2552416 pages
INFO: table "pg_toast_649469": suspending truncate due to conflicting lock
request
INFO: table "pg_toast_649469": truncated 2552416 to 2550560 pages
INFO: table "pg_toast_649469": suspending truncate due to conflicting lock
request
INFO: table "pg_toast_649469": truncated 2550560 to 2548832 pages
INFO: table "pg_toast_649469": stopping truncate after 3 retries due to
repeated conflicting lock requests
INFO: finished vacuuming "postgres.pg_toast.pg_toast_649469": index scans:
0
pages: 5056 removed, 2548832 remain, 1 scanned (0.00% of total), 0 eagerly
scanned
tuples: 0 removed, 14008 remain, 0 are dead but not yet removable
removable cutoff: 2985897, which was 1 XIDs old when operation ended
new relfrozenxid: 2985897, which is 2 XIDs ahead of previous value
frozen: 0 pages from table (0.00% of total) had 0 tuples frozen
visibility map: 0 pages set all-visible, 0 pages set all-frozen (0 were
all-visible)
index scan not needed: 0 pages from table (0.00% of total) had 0 dead item
identifiers removed
index "pg_toast_649469_index": pages: 56259 in total, 0 newly deleted,
55293 currently deleted, 55293 reusable
avg read rate: 5.067 MB/s, avg write rate: 0.002 MB/s
buffer usage: 56371 hits, 61075 reads, 19 dirtied
WAL usage: 28 records, 20 full page images, 165414 bytes, 0 buffers full
system usage: CPU: user: 1.67 s, system: 59.10 s, elapsed: 94.16 s
```

All that said, I wanted to bring some visibility into this part of the
problem and also learn from the community on potential ways we could solve
it. One way to solve this is capping the number of retries like in the
patch I attached. The reason I feel like something like this is beneficial
is because it would prioritize overall system availability by preventing
lazy truncate from acquiring the `AccessExclusiveLock` and holding it for
an extended period of time.

The patch is definitely more of a proof of concept for now (it lacks some
tests and maybe even some configurable params?) and more importantly I am
also curious to hear from the community if this feels like a worthwhile
change or approach (assuming my understanding of the problem makes sense).
Or perhaps there are other ways to solve this problem as well?

Thank you
Shayon Mukherjee

Attachments:

script.rbtext/x-ruby-script; charset=US-ASCII; name=script.rbDownload

v18-0001-Limit-VACUUM-truncation-retries-after-scan-inter.patchapplication/octet-stream; name=v18-0001-Limit-VACUUM-truncation-retries-after-scan-inter.patchDownload

From adca67867ff5f37ee8c97879a1ec5dc2bf0150f0 Mon Sep 17 00:00:00 2001
From: Shayon Mukherjee <shayonj@gmail.com>
Date: Mon, 5 May 2025 15:32:41 -0400
Subject: [PATCH v18] Limit VACUUM truncation retries after scan interruption

---
 src/backend/access/heap/vacuumlazy.c | 24 ++++++++++++++++++++++++
 1 file changed, 24 insertions(+)

diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index f28326bad0..a91c1b97cb 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -179,6 +179,8 @@
 #define VACUUM_TRUNCATE_LOCK_CHECK_INTERVAL		20	/* ms */
 #define VACUUM_TRUNCATE_LOCK_WAIT_INTERVAL		50	/* ms */
 #define VACUUM_TRUNCATE_LOCK_TIMEOUT			5000	/* ms */
+/* Max retries for the truncation attempt if the backward scan is interrupted */
+#define VACUUM_TRUNCATE_INTERRUPTION_MAX_RETRIES	3
 
 /*
  * Threshold that controls whether we bypass index vacuuming and heap
@@ -3213,6 +3215,7 @@ lazy_truncate_heap(LVRelState *vacrel)
 	BlockNumber new_rel_pages;
 	bool		lock_waiter_detected;
 	int			lock_retry;
+	int			truncate_interruption_retry_count = 0;
 
 	/* Report that we are now truncating */
 	pgstat_progress_update_param(PROGRESS_VACUUM_PHASE,
@@ -3227,6 +3230,19 @@ lazy_truncate_heap(LVRelState *vacrel)
 	 */
 	do
 	{
+		/*
+		 * Check if we've retried too many times due to interruptions. We
+		 * check here to allow at least one full attempt even if
+		 * max_truncate_retries is set low.
+		 */
+		if (truncate_interruption_retry_count >= VACUUM_TRUNCATE_INTERRUPTION_MAX_RETRIES)
+		{
+			ereport(vacrel->verbose ? INFO : DEBUG2,
+					(errmsg("table \"%s\": stopping truncate after %d retries due to repeated conflicting lock requests",
+							vacrel->relname, truncate_interruption_retry_count)));
+			break;
+		}
+
 		/*
 		 * We need full exclusive lock on the relation in order to do
 		 * truncation. If we can't get it, give up rather than waiting --- we
@@ -3329,6 +3345,14 @@ lazy_truncate_heap(LVRelState *vacrel)
 						vacrel->relname,
 						orig_rel_pages, new_rel_pages)));
 		orig_rel_pages = new_rel_pages;
+
+		/*
+		 * Increment retry count only if we were interrupted and will loop
+		 * again
+		 */
+		if (lock_waiter_detected && new_rel_pages > vacrel->nonempty_pages)
+			truncate_interruption_retry_count++;
+
 	} while (new_rel_pages > vacrel->nonempty_pages && lock_waiter_detected);
 }
 
-- 
2.39.5 (Apple Git-154)

load-data.sqlapplication/octet-stream; name=load-data.sqlDownload

Shayon Mukherjee

shayonj@gmail.com

8 months ago

In reply to: Shayon Mukherjee (#1)

3 attachment(s)

Re: Prolonged truncation phase during vacuum on toast table with repeated interruptions by lock waiters and a proposed POC patch

Hello,

Attached are the updated files to help reproduce the state mentioned in the
previous email more easily.

- The load_data.sql has the commands to insert a lot of dummy data that
will overflow to TOAST
- Which then is deleted (via the commands in the sql file) to create a
situation where a truncate will happen on a future vacuum
- Updated ruby script.rb to make simple select calls with high concurrency
- Renamed the POC to be `v1`

Thank you
Shayon

On Tue, May 6, 2025 at 6:05 PM Shayon Mukherjee <shayonj@gmail.com> wrote:

Show quoted text

Hello hackers,

I'd like to discuss an issue we observed where the truncation phase of
autovacuum can become significantly prolonged on toast tables of busy
systems due to repeated interruptions by lock waiters, leading to extended
`AccessExclusiveLock` contention and impacting overall database &
application availability.

*Recap of the production incident*
We experienced this issue directly where a `VACUUM` on a TOAST table took
approximately 5 hours. The following shows the scale of pages eventually
removed and the duration:

```
automatic vacuum of table "tines.pg_toast.pg_toast_21059": index scans: 1
pages: 26096516 removed, 1421195 remain, 0 skipped due to pins, 26090236
skipped frozen
tuples: 4575083 removed, 26717710 remain, 21498 are dead but not yet
removable, oldest xmin: 173810717
index scan needed: 1151751 pages from table (4.19% of total) had 4629490
dead item identifiers removed
index "pg_toast_21059_index": pages: 303420 in total, 14038 newly deleted,
298668 currently deleted, 298418 reusable
I/O timings: read: 18428758.270 ms, write: 0.000 ms
avg read rate: 10.730 MB/s, avg write rate: 0.000 MB/s
buffer usage: 5132686 hits, 25714763 misses, 0 dirtied
WAL usage: 0 records, 0 full page images, 0 bytes
system usage: CPU: user: 327.79 s, system: 251.11 s, elapsed: 18722.21 s
```

During this 5-hour window, read queries against primary and replica
databases (on AWS Aurora) were frequently failing due to lock timeouts
(lock_timeout was set to 1s), consistent with the `AccessExclusiveLock`
being held and released repeatedly by the truncation part of the autovacuum
process.

This then further led to me getting a better understanding of the problem
and order of events. Which brings me to:

*Understanding the problem*
My understanding is that the PostgreSQL autovacuum process, specifically
its lazy variant (`lazy_vacuum_heap`), performs few passes to reclaim
space. One critical phase, if deemed beneficial, is the truncation of empty
pages from the end of a relation, handled by the `lazy_truncate_heap`
function.

As I traced down the code, the interaction occurs as follows:

- To physically truncate a relation, `VACUUM` must acquire an
`AccessExclusiveLock`.
- After acquiring the `AccessExclusiveLock`, `VACUUM` performs a
verification step by scanning the relation backwards from its apparent end
(`count_nondeletable_pages`). This is necessary because other transactions
might have added data to these trailing pages.
- While holding the `AccessExclusiveLock` and performing this backward
scan, `VACUUM` periodically checks if any other database sessions are
waiting for locks on the same relation using
`LockHasWaitersRelation(vacrel->rel, AccessExclusiveLock)`. This check
occurs at intervals defined by `VACUUM_TRUNCATE_LOCK_CHECK_INTERVAL` (20ms)
and roughly every 32 blocks scanned.
- Interruption and Retry Mechanism:
- If `LockHasWaitersRelation` returns `true` during the backward scan,
`count_nondeletable_pages` sets a flag (`lock_waiter_detected = true`) and
returns, effectively "suspending" its scan to allow waiting sessions to
proceed.
- Back in `lazy_truncate_heap`, if `lock_waiter_detected` is true,
`VACUUM` releases the `AccessExclusiveLock`.
- `lazy_truncate_heap` contains a `do...while` loop. If
`lock_waiter_detected` was true AND there are still potentially truncatable
pages (`new_rel_pages > vacrel->nonempty_pages`), this loop causes `VACUUM`
to retry the entire truncation sequence: attempt to re-acquire the
`AccessExclusiveLock` (with its own timeout of
`VACUUM_TRUNCATE_LOCK_TIMEOUT`, default 5s), restart the
`count_nondeletable_pages` backward scan, and so on.

- Next comes the "stuck in a loop" problem and this is where things get
interesting:
- On a very busy system with constant concurrent access, `VACUUM` can
repeatedly acquire the `AccessExclusiveLock`, start the backward scan,
detect waiters, release the lock, and retry. Each retry of
`count_nondeletable_pages` involves re-reading potentially many blocks from
disk while holding the exclusive lock which aggravates the lock contention.
- This cycle leads to prolonged `VACUUM` operations, significant
repeated I/O, and extended lock contention, reducing database availability
and causing query timeouts for other sessions (e.g., `canceling statement
due to lock timeout`, `LockNotAvailable`). This is particularly acute for
the TOAST tables in my case which sadly happens to be very large (for now).

*Testing the theory and patch*
To validate this understanding, I was able to reproduce a similar scenario
by loading a test table with a significant amount of TOAST-able data and
then simulating a busy read workload against it concurrently with a
`VACUUM` operation. I have attached a sql script to load the data and a
ruby script to simulate the concurrent workload if anyone is curious. The
tests demonstrated how the truncation phase could indeed enter extended
periods of repeatedly attempting to acquire the `AccessExclusiveLock` and
performing the backward scan, only to be interrupted by the simulated
concurrent workload.

Considering the potential for the truncation phase to cause prolonged lock
contention under these conditions, I am wondering if it would be beneficial
to explicitly cap the number of retries for the truncation process within
the `do...while` loop in `lazy_truncate_heap` when interruptions by lock
waiters occur?

Because I was already too deep in the code, I decided to write a small
patch to introduce a limit to the retries and observe the behavior.
Specifically, it adds a counter `truncate_interruption_retry_count` and
checks it against a new constant `VACUUM_TRUNCATE_INTERRUPTION_MAX_RETRIES`
(which for now I have set to 3). If the backward scan in
`count_nondeletable_pages` is interrupted by lock waiters this many times,
`VACUUM` will cease further truncation attempts for the current `VACUUM`
cycle on that table.

I was able to perform the test on a similar setup and the results are as
follows (noticed the `stopping truncate after 3 retries due to repeated
conflicting lock requests` coming from the patch):

```
VACUUM (VERBOSE) toast_target_table;
INFO: vacuuming "postgres.public.toast_target_table"
INFO: table "toast_target_table": suspending truncate due to conflicting
lock request
INFO: table "toast_target_table": truncated 144032 to 143744 pages
INFO: table "toast_target_table": suspending truncate due to conflicting
lock request
INFO: table "toast_target_table": truncated 143744 to 143456 pages
INFO: table "toast_target_table": suspending truncate due to conflicting
lock request
INFO: table "toast_target_table": truncated 143456 to 143168 pages
INFO: table "toast_target_table": stopping truncate after 3 retries due
to repeated conflicting lock requests
INFO: finished vacuuming "postgres.public.toast_target_table": index
scans: 0
pages: 864 removed, 143168 remain, 1 scanned (0.00% of total), 0 eagerly
scanned
tuples: 0 removed, 7544 remain, 0 are dead but not yet removable
removable cutoff: 2985896, which was 1 XIDs old when operation ended
frozen: 0 pages from table (0.00% of total) had 0 tuples frozen
visibility map: 0 pages set all-visible, 0 pages set all-frozen (0 were
all-visible)
index scan not needed: 0 pages from table (0.00% of total) had 0 dead item
identifiers removed
index "toast_target_table_pkey": pages: 28132 in total, 0 newly deleted,
27646 currently deleted, 27646 reusable
avg read rate: 13.423 MB/s, avg write rate: 0.006 MB/s
buffer usage: 28172 hits, 28571 reads, 12 dirtied
WAL usage: 22 records, 15 full page images, 120289 bytes, 0 buffers full
system usage: CPU: user: 0.13 s, system: 3.55 s, elapsed: 16.62 s
INFO: vacuuming "postgres.pg_toast.pg_toast_649469"
INFO: table "pg_toast_649469": suspending truncate due to conflicting
lock request
INFO: table "pg_toast_649469": truncated 2553888 to 2552416 pages
INFO: table "pg_toast_649469": suspending truncate due to conflicting
lock request
INFO: table "pg_toast_649469": truncated 2552416 to 2550560 pages
INFO: table "pg_toast_649469": suspending truncate due to conflicting
lock request
INFO: table "pg_toast_649469": truncated 2550560 to 2548832 pages
INFO: table "pg_toast_649469": stopping truncate after 3 retries due to
repeated conflicting lock requests
INFO: finished vacuuming "postgres.pg_toast.pg_toast_649469": index
scans: 0
pages: 5056 removed, 2548832 remain, 1 scanned (0.00% of total), 0 eagerly
scanned
tuples: 0 removed, 14008 remain, 0 are dead but not yet removable
removable cutoff: 2985897, which was 1 XIDs old when operation ended
new relfrozenxid: 2985897, which is 2 XIDs ahead of previous value
frozen: 0 pages from table (0.00% of total) had 0 tuples frozen
visibility map: 0 pages set all-visible, 0 pages set all-frozen (0 were
all-visible)
index scan not needed: 0 pages from table (0.00% of total) had 0 dead item
identifiers removed
index "pg_toast_649469_index": pages: 56259 in total, 0 newly deleted,
55293 currently deleted, 55293 reusable
avg read rate: 5.067 MB/s, avg write rate: 0.002 MB/s
buffer usage: 56371 hits, 61075 reads, 19 dirtied
WAL usage: 28 records, 20 full page images, 165414 bytes, 0 buffers full
system usage: CPU: user: 1.67 s, system: 59.10 s, elapsed: 94.16 s
```

All that said, I wanted to bring some visibility into this part of the
problem and also learn from the community on potential ways we could solve
it. One way to solve this is capping the number of retries like in the
patch I attached. The reason I feel like something like this is beneficial
is because it would prioritize overall system availability by preventing
lazy truncate from acquiring the `AccessExclusiveLock` and holding it for
an extended period of time.

The patch is definitely more of a proof of concept for now (it lacks some
tests and maybe even some configurable params?) and more importantly I am
also curious to hear from the community if this feels like a worthwhile
change or approach (assuming my understanding of the problem makes sense).
Or perhaps there are other ways to solve this problem as well?

Thank you
Shayon Mukherjee

Attachments:

script.rbtext/x-ruby-script; charset=US-ASCII; name=script.rbDownload

load-data.sqlapplication/sql; name=load-data.sqlDownload

v1-0001-Limit-VACUUM-truncation-retries-after-scan-interr.patchapplication/x-patch; name=v1-0001-Limit-VACUUM-truncation-retries-after-scan-interr.patchDownload

From adca67867ff5f37ee8c97879a1ec5dc2bf0150f0 Mon Sep 17 00:00:00 2001
From: Shayon Mukherjee <shayonj@gmail.com>
Date: Mon, 5 May 2025 15:32:41 -0400
Subject: [PATCH v1] Limit VACUUM truncation retries after scan interruption

---
 src/backend/access/heap/vacuumlazy.c | 24 ++++++++++++++++++++++++
 1 file changed, 24 insertions(+)

diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index f28326bad0..a91c1b97cb 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -179,6 +179,8 @@
 #define VACUUM_TRUNCATE_LOCK_CHECK_INTERVAL		20	/* ms */
 #define VACUUM_TRUNCATE_LOCK_WAIT_INTERVAL		50	/* ms */
 #define VACUUM_TRUNCATE_LOCK_TIMEOUT			5000	/* ms */
+/* Max retries for the truncation attempt if the backward scan is interrupted */
+#define VACUUM_TRUNCATE_INTERRUPTION_MAX_RETRIES	3
 
 /*
  * Threshold that controls whether we bypass index vacuuming and heap
@@ -3213,6 +3215,7 @@ lazy_truncate_heap(LVRelState *vacrel)
 	BlockNumber new_rel_pages;
 	bool		lock_waiter_detected;
 	int			lock_retry;
+	int			truncate_interruption_retry_count = 0;
 
 	/* Report that we are now truncating */
 	pgstat_progress_update_param(PROGRESS_VACUUM_PHASE,
@@ -3227,6 +3230,19 @@ lazy_truncate_heap(LVRelState *vacrel)
 	 */
 	do
 	{
+		/*
+		 * Check if we've retried too many times due to interruptions. We
+		 * check here to allow at least one full attempt even if
+		 * max_truncate_retries is set low.
+		 */
+		if (truncate_interruption_retry_count >= VACUUM_TRUNCATE_INTERRUPTION_MAX_RETRIES)
+		{
+			ereport(vacrel->verbose ? INFO : DEBUG2,
+					(errmsg("table \"%s\": stopping truncate after %d retries due to repeated conflicting lock requests",
+							vacrel->relname, truncate_interruption_retry_count)));
+			break;
+		}
+
 		/*
 		 * We need full exclusive lock on the relation in order to do
 		 * truncation. If we can't get it, give up rather than waiting --- we
@@ -3329,6 +3345,14 @@ lazy_truncate_heap(LVRelState *vacrel)
 						vacrel->relname,
 						orig_rel_pages, new_rel_pages)));
 		orig_rel_pages = new_rel_pages;
+
+		/*
+		 * Increment retry count only if we were interrupted and will loop
+		 * again
+		 */
+		if (lock_waiter_detected && new_rel_pages > vacrel->nonempty_pages)
+			truncate_interruption_retry_count++;
+
 	} while (new_rel_pages > vacrel->nonempty_pages && lock_waiter_detected);
 }
 
-- 
2.39.5 (Apple Git-154)

Sami Imseih

samimseih@gmail.com

8 months ago

In reply to: Shayon Mukherjee (#2)

Re: Prolonged truncation phase during vacuum on toast table with repeated interruptions by lock waiters and a proposed POC patch

Hi,

Thanks for the work on this, but I am -1 on this idea. It sounds like
the type of workload
that could become problematic here is one in which there is a massive delete of
data which leaves many truncate-able pages at the end of the table and vacuum
kicks in while there is heavy concurrent reads on the table, so the
AEL becomes a
bottleneck. This is actually much worse on the hot standbys, because it does not
have provisions to prioritize normal backends for obtaining the exclusive lock.

The way I edalth with large deletes such as this in the past is to perform them
in batches and perhaps run vacuum after every few batches to amortize the work
vacuum has to perform. But also, there are 2 options available that
could help here
which is "VACUUM (truncate off)" ( v12+) which turns off the
truncation, and also
0164a0f9ee now allows a user to disable the truncate work for autovacuum [0]/messages/by-id/Z2DE4lDX4tHqNGZt@dev.null

I think it's better to simply disable the truncate work and perform it
at a later time
than introduce some new limit to how many times the truncation can be suspended.
In the type of workload you are referring to, it is likely that the
truncation will
end up not completing, so why even try at all?

[0]: /messages/by-id/Z2DE4lDX4tHqNGZt@dev.null

--
Sami Imseih
Amazon Web Services (AWS)

Robert Treat

rob@xzilla.net

8 months ago

In reply to: Sami Imseih (#3)

Re: Prolonged truncation phase during vacuum on toast table with repeated interruptions by lock waiters and a proposed POC patch

On Thu, May 8, 2025 at 12:10 PM Sami Imseih <samimseih@gmail.com> wrote:

I think it's better to simply disable the truncate work and perform it
at a later time
than introduce some new limit to how many times the truncation can be suspended.
In the type of workload you are referring to, it is likely that the
truncation will
end up not completing, so why even try at all?

Yeah, I pretty much had the same thought; if you bail out after some
kind of cap, there will be workloads that will never complete the
truncate heap phase, in which case you should probably be using
truncate off. But the one thing I was a bit sympathetic on was how to
actually determine you are in this situation, or how bad the situation
was, in order to know that setting truncate off would help? To that
end, I wonder if it'd be worth adding the counter and printing that
information (and maybe elapsed time in this phase?) in vacuum verbose
output?

Robert Treat
https://xzilla.net

Sami Imseih

samimseih@gmail.com

8 months ago

In reply to: Robert Treat (#4)

Re: Prolonged truncation phase during vacuum on toast table with repeated interruptions by lock waiters and a proposed POC patch

actually determine you are in this situation, or how bad the situation
was, in order to know that setting truncate off would help? To that

Adding counters for this area is not a bad idea in general, as this hits
customers particularly hard on hot standbys when the truncate does actually
occur on the primary.

What about adding cumulative counters ( per table and pg_stat_database
) such as:

pages_vac_truncated - # of pages truncated by vacuum
vac_truncate_conflicts - # of time truncate was skipped due to conflict
vac_truncate_suspended - # of times the truncate was suspended.

The difference between conflict and suspended is conflict causes vacuum
to skip the truncate phase while suspended causes vacuum to retry the
phase continually.

The argument against adding these counters is that the views,
pg_stat_all_tables/
pg_stat_all_database are becoming super-wide, so maybe we need to think
about inventing a new view for vacuum related counter metrics. This seems like
a good discussion for v19.

--
Sami

Shayon Mukherjee

shayonj@gmail.com

8 months ago

In reply to: Sami Imseih (#5)

Re: Prolonged truncation phase during vacuum on toast table with repeated interruptions by lock waiters and a proposed POC patch

Thank you for engaging into the conversation and sharing your thoughts Sami
and Robert.

On Thu, May 8, 2025 at 2:57 PM Sami Imseih <samimseih@gmail.com> wrote:

actually determine you are in this situation, or how bad the situation
was, in order to know that setting truncate off would help? To that

That's definitely a sound idea and the way I was able to replicate and
learn myself was by doing a PATCH that would just print the # of attempts
it was doing. That said, just to add more color - I was coming at this from
a point of view where truncation would be on a "best effort" basis and the
trade off here would be less availability risk at the cost of missed
truncations. My first instinct was that, having vacuum_truncate set to ON
by default on busy systems that have a lot of churn can catch a lot of
developers and PostgreSQL by surprise (sure did to me :D), so the best
effort and "fail fast" behavior could help, however at the same time I
agree that it doesn't solve either of the problems really well. Another
idea I had around this section was to expose an attribute/guc/setting like
VACUUM_TRUNCATE_INTERRUPTION_MAX_RETRIES with a default of `1` or
something, that users could optionally configure, perhaps that could be a
middleground, but I am myself not fully sold on the idea either (?).

Adding counters for this area is not a bad idea in general, as this hits
customers particularly hard on hot standbys when the truncate does actually
occur on the primary.

What about adding cumulative counters ( per table and pg_stat_database
) such as:

pages_vac_truncated - # of pages truncated by vacuum
vac_truncate_conflicts - # of time truncate was skipped due to conflict
vac_truncate_suspended - # of times the truncate was suspended.

The difference between conflict and suspended is conflict causes vacuum
to skip the truncate phase while suspended causes vacuum to retry the
phase continually.

Would vac_truncate_retries or something similar be more immediately clear?
If so, maybe something like

- vac_truncate_interruptions - # of time truncate was skipped due to
conflict and not retried
- vac_truncate_retries - # of times the truncate was suspended and retried

The argument against adding these counters is that the views,
pg_stat_all_tables/
pg_stat_all_database are becoming super-wide, so maybe we need to think
about inventing a new view for vacuum related counter metrics. This seems
like
a good discussion for v19.

+1 for this. I'd even be happy to even just start showing the attempts as
part of INFO or VERBOSE on conflicts & suspensions, without keeping track,
but it is useful statistical data.

Thanks
Shayon

Shayon Mukherjee

shayonj@gmail.com

5 months ago

In reply to: Shayon Mukherjee (#6)

1 attachment(s)

Re: Prolonged truncation phase during vacuum on toast table with repeated interruptions by lock waiters and a proposed POC patch

On Thu, May 8, 2025 at 4:58 PM Shayon Mukherjee <shayonj@gmail.com> wrote:

+1 for this. I'd even be happy to even just start showing the attempts as
part of INFO or VERBOSE on conflicts & suspensions, without keeping track,
but it is useful statistical data.

Hi hackers,

Following up on the recent discussion about prolonged VACUUM truncation
phases on busy systems, I took a stab at extending the logging to add a
little bit more visibility into the truncation process.

With this patch, when VACUUM is run with VERBOSE, it will now log the
following:

1. Individual suspension messages with attempt count and cumulative elapsed
time.
2. A summary at the end of the truncation process showing the total number
of suspensions and the total elapsed time.

Example output:
INFO: table "toast_target_table": suspending truncate due to conflicting
lock request (attempt 1, elapsed: 0.121 s)
INFO: table "toast_target_table": truncated 372096 to 370624 pages
INFO: table "toast_target_table": suspending truncate due to conflicting
lock request (attempt 2, elapsed: 6.108 s)
...
INFO: table "toast_target_table": truncation completed after 8
suspension(s), total elapsed: 58.090 s

Perhaps, this can give developers the information needed to identify
problematic truncation behavior and decide whether to use VACUUM (TRUNCATE
OFF) for specific tables.

Additionally, tracking these stats in a table would be useful as well.
Curious, if something like this is desirable by the community?

Proposal patch is attached. Some notes on the patch:

- Unclear if there is a best way to write regression tests for this, since
it can be flaky.
- There seems to be translation support for the copy. At least for the
initial patch, I have left that and happy to discuss more if the patch
feels viable.

Thank you
Shayon

Attachments:

v1-0001-Extend-logging-for-VACUUM-truncation-suspensions-.patchapplication/octet-stream; name=v1-0001-Extend-logging-for-VACUUM-truncation-suspensions-.patchDownload

From 07b468eb32e4980eb006e5da07c9a610c28fc47f Mon Sep 17 00:00:00 2001
From: Shayon Mukherjee <shayonj@gmail.com>
Date: Sat, 9 Aug 2025 10:43:47 -0400
Subject: [PATCH v1] Extend logging for VACUUM truncation suspensions with
 attempts and elapsed time

This patch adds visibility into VACUUM truncation suspension behavior
by logging attempt counts and elapsed time during verbose output.

When VACUUM truncation is suspended due to lock conflicts, users now see

- Individual suspension messages with attempt number and cumulative time
- Summary statistics showing total suspensions and elapsed time

Thus, giving users more visibility into the process
---
 src/backend/access/heap/vacuumlazy.c | 55 +++++++++++++++++++++++++---
 1 file changed, 49 insertions(+), 6 deletions(-)

diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 14036c27e8..4cfb31480e 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -458,7 +458,9 @@ static IndexBulkDeleteResult *lazy_cleanup_one_index(Relation indrel,
 static bool should_attempt_truncation(LVRelState *vacrel);
 static void lazy_truncate_heap(LVRelState *vacrel);
 static BlockNumber count_nondeletable_pages(LVRelState *vacrel,
-											bool *lock_waiter_detected);
+											bool *lock_waiter_detected,
+											int suspension_count,
+											instr_time *truncate_start_time);
 static void dead_items_alloc(LVRelState *vacrel, int nworkers);
 static void dead_items_add(LVRelState *vacrel, BlockNumber blkno, OffsetNumber *offsets,
 						   int num_offsets);
@@ -3203,6 +3205,8 @@ lazy_truncate_heap(LVRelState *vacrel)
 	BlockNumber new_rel_pages;
 	bool		lock_waiter_detected;
 	int			lock_retry;
+	int			truncate_suspension_count = 0;
+	instr_time	truncate_start_time;
 
 	/* Report that we are now truncating */
 	pgstat_progress_update_param(PROGRESS_VACUUM_PHASE,
@@ -3212,6 +3216,9 @@ lazy_truncate_heap(LVRelState *vacrel)
 	update_vacuum_error_info(vacrel, NULL, VACUUM_ERRCB_PHASE_TRUNCATE,
 							 vacrel->nonempty_pages, InvalidOffsetNumber);
 
+	/* Record start time for truncation phase */
+	INSTR_TIME_SET_CURRENT(truncate_start_time);
+
 	/*
 	 * Loop until no more truncating can be done.
 	 */
@@ -3282,7 +3289,7 @@ lazy_truncate_heap(LVRelState *vacrel)
 		 * other backends could have added tuples to these pages whilst we
 		 * were vacuuming.
 		 */
-		new_rel_pages = count_nondeletable_pages(vacrel, &lock_waiter_detected);
+		new_rel_pages = count_nondeletable_pages(vacrel, &lock_waiter_detected, truncate_suspension_count, &truncate_start_time);
 		vacrel->blkno = new_rel_pages;
 
 		if (new_rel_pages >= orig_rel_pages)
@@ -3292,6 +3299,10 @@ lazy_truncate_heap(LVRelState *vacrel)
 			return;
 		}
 
+		/* Count this suspension if lock waiters were detected */
+		if (lock_waiter_detected)
+			truncate_suspension_count++;
+
 		/*
 		 * Okay to truncate.
 		 */
@@ -3320,6 +3331,23 @@ lazy_truncate_heap(LVRelState *vacrel)
 						orig_rel_pages, new_rel_pages)));
 		orig_rel_pages = new_rel_pages;
 	} while (new_rel_pages > vacrel->nonempty_pages && lock_waiter_detected);
+
+	/* Report truncation suspension statistics in verbose mode */
+	if (vacrel->verbose && truncate_suspension_count > 0)
+	{
+		instr_time	truncate_end_time;
+		instr_time	truncate_elapsed;
+
+		INSTR_TIME_SET_CURRENT(truncate_end_time);
+		truncate_elapsed = truncate_end_time;
+		INSTR_TIME_SUBTRACT(truncate_elapsed, truncate_start_time);
+
+		ereport(INFO,
+				(errmsg("table \"%s\": truncation completed after %d suspension(s), total elapsed: %.3f s",
+						vacrel->relname,
+						truncate_suspension_count,
+						INSTR_TIME_GET_DOUBLE(truncate_elapsed))));
+	}
 }
 
 /*
@@ -3328,7 +3356,7 @@ lazy_truncate_heap(LVRelState *vacrel)
  * Returns number of nondeletable pages (last nonempty page + 1).
  */
 static BlockNumber
-count_nondeletable_pages(LVRelState *vacrel, bool *lock_waiter_detected)
+count_nondeletable_pages(LVRelState *vacrel, bool *lock_waiter_detected, int suspension_count, instr_time *truncate_start_time)
 {
 	BlockNumber blkno;
 	BlockNumber prefetchedUntil;
@@ -3376,9 +3404,24 @@ count_nondeletable_pages(LVRelState *vacrel, bool *lock_waiter_detected)
 			{
 				if (LockHasWaitersRelation(vacrel->rel, AccessExclusiveLock))
 				{
-					ereport(vacrel->verbose ? INFO : DEBUG2,
-							(errmsg("table \"%s\": suspending truncate due to conflicting lock request",
-									vacrel->relname)));
+					if (vacrel->verbose)
+					{
+						instr_time	elapsed_so_far;
+
+						elapsed_so_far = currenttime;
+						INSTR_TIME_SUBTRACT(elapsed_so_far, *truncate_start_time);
+
+						ereport(INFO,
+								(errmsg("table \"%s\": suspending truncate due to conflicting lock request (attempt %d, elapsed: %.3f s)",
+										vacrel->relname, suspension_count + 1,
+										INSTR_TIME_GET_DOUBLE(elapsed_so_far))));
+					}
+					else
+					{
+						ereport(DEBUG2,
+								(errmsg("table \"%s\": suspending truncate due to conflicting lock request",
+										vacrel->relname)));
+					}
 
 					*lock_waiter_detected = true;
 					return blkno;
-- 
2.39.5 (Apple Git-154)