048_vacuum_horizon_floor.pl hangs due to wakeup lost inside LockBufferForCleanup

Started by Alexander Lakhin6 days ago9 messageshackers
Jump to latest
#1Alexander Lakhin
exclusion@gmail.com

Hello hackers,

A recent failure produced by skink [1]https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2026-06-14%2001%3A21%3A38 apparently materializes an
interesting race condition:
398/398 recovery - postgresql:recovery/048_vacuum_horizon_floor TIMEOUT        15000.51s   killed by signal 15 SIGTERM

regress_log_048_vacuum_horizon_floor contains:
[05:01:16.990](0.000s) ok 3 - Cursor query returned 7 from second fetch. Expected value 7.
[09:08:43.280](14846.290s) # die: death by signal at
/home/bf/bf-build/skink-master/HEAD/pgsql/src/test/perl/PostgreSQL/Test/Cluster.pm line 181.
[09:08:43.281](0.001s) 1..3
death by signal at /home/bf/bf-build/skink-master/HEAD/pgsql/src/test/perl/PostgreSQL/Test/Cluster.pm line 181.

048_vacuum_horizon_floor_primary.log contains:
2026-06-14 05:00:51.700 CEST [655012][client backend][1/9:0] LOG: statement: VACUUM (VERBOSE, FREEZE, PARALLEL 0)
vac_horizon_floor_table;
2026-06-14 05:00:51.887 CEST [655012][client backend][1/10:0] INFO: aggressively vacuuming
"test_db.public.vac_horizon_floor_table"
2026-06-14 05:00:53.494 CEST [669376][client backend][5/2:0] LOG: statement: SELECT count(*) >= 1 FROM pg_stat_activity
            WHERE pid = 655012
            AND wait_event = 'BufferCleanup';
2026-06-14 05:01:07.789 CEST [676703][walsender][26/0:0] LOG: received replication command: IDENTIFY_SYSTEM
2026-06-14 05:01:07.789 CEST [676703][walsender][26/0:0] STATEMENT: IDENTIFY_SYSTEM
2026-06-14 05:01:07.992 CEST [676703][walsender][26/0:0] LOG: received replication command: START_REPLICATION 0/03000000
TIMELINE 1
2026-06-14 05:01:07.992 CEST [676703][walsender][26/0:0] STATEMENT: START_REPLICATION 0/03000000 TIMELINE 1
2026-06-14 05:01:11.486 CEST [678834][client backend][6/2:0] LOG: statement: SELECT EXISTS (SELECT * FROM
pg_stat_replication);
2026-06-14 05:01:15.472 CEST [656147][client backend][2/2:0] LOG: statement: FETCH vac_horizon_floor_cursor1
    ;
2026-06-14 05:01:19.594 CEST [681642][client backend][7/2:0] LOG: statement: SELECT index_vacuum_count > 0
        FROM pg_stat_progress_vacuum
        WHERE datname='test_db' AND relid::regclass = 'vac_horizon_floor_table'::regclass;
2026-06-14 05:01:24.183 CEST [656147][client backend][2/2:0] LOG: statement: COMMIT;
2026-06-14 05:01:26.473 CEST [684424][client backend][8/2:0] LOG: statement: SELECT vacuum_count > 0
        FROM pg_stat_all_tables WHERE relname = 'vac_horizon_floor_table';
...
2026-06-14 09:08:40.354 CEST [3155195][client backend][7/2090:0] LOG:  statement: SELECT vacuum_count > 0
        FROM pg_stat_all_tables WHERE relname = 'vac_horizon_floor_table';
2026-06-14 09:08:42.036 CEST [3156594][client backend][8/2090:0] LOG:  statement: SELECT vacuum_count > 0
        FROM pg_stat_all_tables WHERE relname = 'vac_horizon_floor_table';
2026-06-14 09:08:43.285 CEST [636342][postmaster][:0] LOG:  received immediate shutdown request

The next run of the test [2]https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2026-06-14%2007%3A22%3A09 took about 2 minutes:
 53/398 recovery - postgresql:recovery/048_vacuum_horizon_floor OK               120.90s   3 subtests passed

There was a similar failure in the past year [3]https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2025-03-05%2015%3A43%3A53:
 38/267 postgresql:recovery / recovery/043_vacuum_horizon_floor                   ERROR 335.83s   exit status 29

regress_log_043_vacuum_horizon_floor
[16:59:38.183](0.003s) ok 3 - Cursor query returned 1 from second fetch. Expected value 1.
IPC::Run: timeout on timer #2 at /usr/share/perl5/IPC/Run.pm line 3025.
# Postmaster PID for node "primary" is 2872485

But that time it looked more like an ordinary timeout.

I've managed to reproduce such a failure locally, on a slowed down VM,
under Valgrind:
ITERATION 23
Sun Jun 14 01:02:30 PM UTC 2026
# +++ tap check in src/test/recovery +++
t/048_vacuum_horizon_floor.pl .. ok
All tests successful.
Files=1, Tests=3, 131 wallclock secs ( 0.01 usr  0.00 sys + 46.21 cusr  3.12 csys = 49.34 CPU)
Result: PASS
ITERATION 24
Sun Jun 14 01:04:41 PM UTC 2026
# +++ tap check in src/test/recovery +++
t/048_vacuum_horizon_floor.pl .. 3/? # die: IPC::Run: timeout on timer #2 at /usr/share/perl5/IPC/Run.pm line 3361.
# Looks like your test exited with 29 just after 3.
t/048_vacuum_horizon_floor.pl .. Dubious, test returned 29 (wstat 7424, 0x1d00)

With this diagnostic addition:
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -6743,9 +6743,12 @@ LockBufferForCleanup(Buffer buffer)
                 }
                 bufHdr->wait_backend_pgprocno = MyProcNumber;
                 PinCountWaitBuf = bufHdr;
-               UnlockBufHdrExt(bufHdr, buf_state,
-                                               BM_PIN_COUNT_WAITER, 0,
-                                               0);
+for (volatile int i = 0; i < 10000000; i++);
+               buf_state = UnlockBufHdrExt(bufHdr, buf_state,
+ BM_PIN_COUNT_WAITER, 0,
+                                                               0);
+               if (BUF_STATE_GET_REFCOUNT(buf_state) == 1)
+                       elog(LOG, "!!!LockBufferForCleanup| wakeup lost");
                 LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
                 /* Wait to be signaled by UnpinBuffer() */
@@ -6794,7 +6797,11 @@ LockBufferForCleanup(Buffer buffer)
                         SetStartupBufferPinWaitBufId(-1);
                 }
                 else
+{
+elog(LOG, "!!!LockBufferForCleanup]| before ProcWaitForSignal");
ProcWaitForSignal(WAIT_EVENT_BUFFER_CLEANUP);
+elog(LOG, "!!!LockBufferForCleanup]| after ProcWaitForSignal");
+}

                /*
                 * Remove flag marking us as waiter. Normally this will not be set

(The delay before UnlockBufHdrExt() helps reproducing for me --- the test
fails within 10 iterations.)

I could see:
# +++ tap check in src/test/recovery +++
t/048_vacuum_horizon_floor.pl .. 3/? # die: IPC::Run: timeout on timer #2 at /usr/share/perl5/IPC/Run.pm line 3156.
# Looks like your test exited with 29 just after 3.
t/048_vacuum_horizon_floor.pl .. Dubious, test returned 29 (wstat 7424, 0x1d00)
All 3 subtests passed

Test Summary Report
-------------------
t/048_vacuum_horizon_floor.pl (Wstat: 7424 (exited 29) Tests: 3 Failed: 0)
  Non-zero exit status: 29
Files=1, Tests=3, 318 wallclock secs ( 0.01 usr  0.00 sys + 93.12 cusr  4.73 csys = 97.86 CPU)

048_vacuum_horizon_floor_primary.log contains:
2026-06-17 19:00:31.646 UTC [178421] [048_vacuum_horizon_floor.pl:2/10/0] LOG:  !!!LockBufferForCleanup| wakeup lost
2026-06-17 19:00:31.646 UTC [178421] [048_vacuum_horizon_floor.pl:2/10/0] CONTEXT:  while scanning block 26 of relation
"public.vac_horizon_floor_table"
2026-06-17 19:00:31.646 UTC [178421] [048_vacuum_horizon_floor.pl:2/10/0] STATEMENT:  VACUUM (VERBOSE, FREEZE, PARALLEL
0) vac_horizon_floor_table;
2026-06-17 19:00:31.646 UTC [178421] [048_vacuum_horizon_floor.pl:2/10/0] LOG:  !!!LockBufferForCleanup]| before
ProcWaitForSignal
2026-06-17 19:00:31.646 UTC [178421] [048_vacuum_horizon_floor.pl:2/10/0] CONTEXT:  while scanning block 26 of relation
"public.vac_horizon_floor_table"
2026-06-17 19:00:31.646 UTC [178421] [048_vacuum_horizon_floor.pl:2/10/0] STATEMENT:  VACUUM (VERBOSE, FREEZE, PARALLEL
0) vac_horizon_floor_table;
...
2026-06-17 19:04:36.835 UTC [178421] [048_vacuum_horizon_floor.pl:2/10/0] LOG:  !!!LockBufferForCleanup]| after
ProcWaitForSignal
2026-06-17 19:04:36.835 UTC [178421] [048_vacuum_horizon_floor.pl:2/10/0] CONTEXT:  while scanning block 26 of relation
"public.vac_horizon_floor_table"
2026-06-17 19:04:36.835 UTC [178421] [048_vacuum_horizon_floor.pl:2/10/0] STATEMENT:  VACUUM (VERBOSE, FREEZE, PARALLEL
0) vac_horizon_floor_table;

[1]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&amp;dt=2026-06-14%2001%3A21%3A38
[2]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&amp;dt=2026-06-14%2007%3A22%3A09
[3]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&amp;dt=2025-03-05%2015%3A43%3A53

Best regards,
Alexander

#2Xuneng Zhou
xunengzhou@gmail.com
In reply to: Alexander Lakhin (#1)
Re: 048_vacuum_horizon_floor.pl hangs due to wakeup lost inside LockBufferForCleanup

Hi Alexander,

On Thu, Jun 18, 2026 at 4:00 AM Alexander Lakhin <exclusion@gmail.com> wrote:

Hello hackers,

A recent failure produced by skink [1] apparently materializes an
interesting race condition:
398/398 recovery - postgresql:recovery/048_vacuum_horizon_floor TIMEOUT 15000.51s killed by signal 15 SIGTERM

regress_log_048_vacuum_horizon_floor contains:
[05:01:16.990](0.000s) ok 3 - Cursor query returned 7 from second fetch. Expected value 7.
[09:08:43.280](14846.290s) # die: death by signal at
/home/bf/bf-build/skink-master/HEAD/pgsql/src/test/perl/PostgreSQL/Test/Cluster.pm line 181.
[09:08:43.281](0.001s) 1..3
death by signal at /home/bf/bf-build/skink-master/HEAD/pgsql/src/test/perl/PostgreSQL/Test/Cluster.pm line 181.

048_vacuum_horizon_floor_primary.log contains:
2026-06-14 05:00:51.700 CEST [655012][client backend][1/9:0] LOG: statement: VACUUM (VERBOSE, FREEZE, PARALLEL 0)
vac_horizon_floor_table;
2026-06-14 05:00:51.887 CEST [655012][client backend][1/10:0] INFO: aggressively vacuuming
"test_db.public.vac_horizon_floor_table"
2026-06-14 05:00:53.494 CEST [669376][client backend][5/2:0] LOG: statement: SELECT count(*) >= 1 FROM pg_stat_activity
WHERE pid = 655012
AND wait_event = 'BufferCleanup';
2026-06-14 05:01:07.789 CEST [676703][walsender][26/0:0] LOG: received replication command: IDENTIFY_SYSTEM
2026-06-14 05:01:07.789 CEST [676703][walsender][26/0:0] STATEMENT: IDENTIFY_SYSTEM
2026-06-14 05:01:07.992 CEST [676703][walsender][26/0:0] LOG: received replication command: START_REPLICATION 0/03000000
TIMELINE 1
2026-06-14 05:01:07.992 CEST [676703][walsender][26/0:0] STATEMENT: START_REPLICATION 0/03000000 TIMELINE 1
2026-06-14 05:01:11.486 CEST [678834][client backend][6/2:0] LOG: statement: SELECT EXISTS (SELECT * FROM
pg_stat_replication);
2026-06-14 05:01:15.472 CEST [656147][client backend][2/2:0] LOG: statement: FETCH vac_horizon_floor_cursor1
;
2026-06-14 05:01:19.594 CEST [681642][client backend][7/2:0] LOG: statement: SELECT index_vacuum_count > 0
FROM pg_stat_progress_vacuum
WHERE datname='test_db' AND relid::regclass = 'vac_horizon_floor_table'::regclass;
2026-06-14 05:01:24.183 CEST [656147][client backend][2/2:0] LOG: statement: COMMIT;
2026-06-14 05:01:26.473 CEST [684424][client backend][8/2:0] LOG: statement: SELECT vacuum_count > 0
FROM pg_stat_all_tables WHERE relname = 'vac_horizon_floor_table';
...
2026-06-14 09:08:40.354 CEST [3155195][client backend][7/2090:0] LOG: statement: SELECT vacuum_count > 0
FROM pg_stat_all_tables WHERE relname = 'vac_horizon_floor_table';
2026-06-14 09:08:42.036 CEST [3156594][client backend][8/2090:0] LOG: statement: SELECT vacuum_count > 0
FROM pg_stat_all_tables WHERE relname = 'vac_horizon_floor_table';
2026-06-14 09:08:43.285 CEST [636342][postmaster][:0] LOG: received immediate shutdown request

The next run of the test [2] took about 2 minutes:
53/398 recovery - postgresql:recovery/048_vacuum_horizon_floor OK 120.90s 3 subtests passed

There was a similar failure in the past year [3]:
38/267 postgresql:recovery / recovery/043_vacuum_horizon_floor ERROR 335.83s exit status 29

regress_log_043_vacuum_horizon_floor
[16:59:38.183](0.003s) ok 3 - Cursor query returned 1 from second fetch. Expected value 1.
IPC::Run: timeout on timer #2 at /usr/share/perl5/IPC/Run.pm line 3025.
# Postmaster PID for node "primary" is 2872485

But that time it looked more like an ordinary timeout.

I've managed to reproduce such a failure locally, on a slowed down VM,
under Valgrind:
ITERATION 23
Sun Jun 14 01:02:30 PM UTC 2026
# +++ tap check in src/test/recovery +++
t/048_vacuum_horizon_floor.pl .. ok
All tests successful.
Files=1, Tests=3, 131 wallclock secs ( 0.01 usr 0.00 sys + 46.21 cusr 3.12 csys = 49.34 CPU)
Result: PASS
ITERATION 24
Sun Jun 14 01:04:41 PM UTC 2026
# +++ tap check in src/test/recovery +++
t/048_vacuum_horizon_floor.pl .. 3/? # die: IPC::Run: timeout on timer #2 at /usr/share/perl5/IPC/Run.pm line 3361.
# Looks like your test exited with 29 just after 3.
t/048_vacuum_horizon_floor.pl .. Dubious, test returned 29 (wstat 7424, 0x1d00)

With this diagnostic addition:
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -6743,9 +6743,12 @@ LockBufferForCleanup(Buffer buffer)
}
bufHdr->wait_backend_pgprocno = MyProcNumber;
PinCountWaitBuf = bufHdr;
-               UnlockBufHdrExt(bufHdr, buf_state,
-                                               BM_PIN_COUNT_WAITER, 0,
-                                               0);
+for (volatile int i = 0; i < 10000000; i++);
+               buf_state = UnlockBufHdrExt(bufHdr, buf_state,
+ BM_PIN_COUNT_WAITER, 0,
+                                                               0);
+               if (BUF_STATE_GET_REFCOUNT(buf_state) == 1)
+                       elog(LOG, "!!!LockBufferForCleanup| wakeup lost");
LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
/* Wait to be signaled by UnpinBuffer() */
@@ -6794,7 +6797,11 @@ LockBufferForCleanup(Buffer buffer)
SetStartupBufferPinWaitBufId(-1);
}
else
+{
+elog(LOG, "!!!LockBufferForCleanup]| before ProcWaitForSignal");
ProcWaitForSignal(WAIT_EVENT_BUFFER_CLEANUP);
+elog(LOG, "!!!LockBufferForCleanup]| after ProcWaitForSignal");
+}

/*
* Remove flag marking us as waiter. Normally this will not be set

(The delay before UnlockBufHdrExt() helps reproducing for me --- the test
fails within 10 iterations.)

I could see:
# +++ tap check in src/test/recovery +++
t/048_vacuum_horizon_floor.pl .. 3/? # die: IPC::Run: timeout on timer #2 at /usr/share/perl5/IPC/Run.pm line 3156.
# Looks like your test exited with 29 just after 3.
t/048_vacuum_horizon_floor.pl .. Dubious, test returned 29 (wstat 7424, 0x1d00)
All 3 subtests passed

Test Summary Report
-------------------
t/048_vacuum_horizon_floor.pl (Wstat: 7424 (exited 29) Tests: 3 Failed: 0)
Non-zero exit status: 29
Files=1, Tests=3, 318 wallclock secs ( 0.01 usr 0.00 sys + 93.12 cusr 4.73 csys = 97.86 CPU)

048_vacuum_horizon_floor_primary.log contains:
2026-06-17 19:00:31.646 UTC [178421] [048_vacuum_horizon_floor.pl:2/10/0] LOG: !!!LockBufferForCleanup| wakeup lost
2026-06-17 19:00:31.646 UTC [178421] [048_vacuum_horizon_floor.pl:2/10/0] CONTEXT: while scanning block 26 of relation
"public.vac_horizon_floor_table"
2026-06-17 19:00:31.646 UTC [178421] [048_vacuum_horizon_floor.pl:2/10/0] STATEMENT: VACUUM (VERBOSE, FREEZE, PARALLEL
0) vac_horizon_floor_table;
2026-06-17 19:00:31.646 UTC [178421] [048_vacuum_horizon_floor.pl:2/10/0] LOG: !!!LockBufferForCleanup]| before
ProcWaitForSignal
2026-06-17 19:00:31.646 UTC [178421] [048_vacuum_horizon_floor.pl:2/10/0] CONTEXT: while scanning block 26 of relation
"public.vac_horizon_floor_table"
2026-06-17 19:00:31.646 UTC [178421] [048_vacuum_horizon_floor.pl:2/10/0] STATEMENT: VACUUM (VERBOSE, FREEZE, PARALLEL
0) vac_horizon_floor_table;
...
2026-06-17 19:04:36.835 UTC [178421] [048_vacuum_horizon_floor.pl:2/10/0] LOG: !!!LockBufferForCleanup]| after
ProcWaitForSignal
2026-06-17 19:04:36.835 UTC [178421] [048_vacuum_horizon_floor.pl:2/10/0] CONTEXT: while scanning block 26 of relation
"public.vac_horizon_floor_table"
2026-06-17 19:04:36.835 UTC [178421] [048_vacuum_horizon_floor.pl:2/10/0] STATEMENT: VACUUM (VERBOSE, FREEZE, PARALLEL
0) vac_horizon_floor_table;

[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&amp;dt=2026-06-14%2001%3A21%3A38
[2] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&amp;dt=2026-06-14%2007%3A22%3A09
[3] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&amp;dt=2025-03-05%2015%3A43%3A53

Thanks for reporting this issue. I can reproduce it with the delay.
This looks like a real lost-wakeup race in LockBufferForCleanup().

The relevant sequence in failed test 048 is:

- Session B opens a cursor and fetches one heap tuple, leaving a heap
buffer pinned.
- Session A starts VACUUM (FREEZE).
- VACUUM reaches that page and waits in LockBufferForCleanup().
- Session B later advances/closes the cursor, releasing the pin.
- VACUUM is expected to wake up and finish.

In the failure, the tap test had already passed its three sql
assertions, but then it timed out waiting for VACUUM completion:

ok 3 - Cursor query returned 7 from second fetch
poll_query_until timed out:
SELECT vacuum_count > 0 ...
last actual query output: f
Looks like your test exited with 29 just after 3.

The diagnostic log shows the actual race:

LOG: !!!LockBufferForCleanup| wakeup lost
CONTEXT: while scanning block 90 of relation "public.vac_horizon_floor_table"

LOG: !!!LockBufferForCleanup]| before ProcWaitForSignal
CONTEXT: while scanning block 90 of relation "public.vac_horizon_floor_table"

There was no later "after ProcWaitForSignal" before the shutdown,
which implies that VACUUM published itself as a waiter, entered
ProcWaitForSignal(), and not been woken up later.

The direct regression appears to be 5310fac6e0f. It allows this interleaving:

W: LockBufferForCleanup() holds buffer header lock
W: observes refcount > 1
P: releases the last competing pin with atomic fetch_sub
P: old state does not contain BM_PIN_COUNT_WAITER, so no wakeup
W: publishes BM_PIN_COUNT_WAITER
W: sleeps in ProcWaitForSignal()

At this point the condition W wanted is already true: refcount is 1,
meaning only W's own pin remains. So W could sleep indefinitely as no
future unpin to wake it.

We can fix this with the state returned by UnlockBufHdrExt() when
publishing BM_PIN_COUNT_WAITER. If the wait refcount is 1, do not
enter the wait path. Instead, fall through to the existing waiter-bit
cleanup and retry the loop to acquire the cleanup lock normally. The
reproducer test passed after applying the patch.

--
Regards,
Xuneng Zhou
HighGo Software Co., Ltd.

Attachments:

v1-0001-Fix-lost-wakeup-in-LockBufferForCleanup.patchapplication/octet-stream; name=v1-0001-Fix-lost-wakeup-in-LockBufferForCleanup.patchDownload+12-4
#3Melanie Plageman
melanieplageman@gmail.com
In reply to: Xuneng Zhou (#2)
Re: 048_vacuum_horizon_floor.pl hangs due to wakeup lost inside LockBufferForCleanup

On Mon, Jun 22, 2026 at 4:14 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:

The direct regression appears to be 5310fac6e0f. It allows this interleaving:

W: LockBufferForCleanup() holds buffer header lock
W: observes refcount > 1
P: releases the last competing pin with atomic fetch_sub
P: old state does not contain BM_PIN_COUNT_WAITER, so no wakeup
W: publishes BM_PIN_COUNT_WAITER
W: sleeps in ProcWaitForSignal()

At this point the condition W wanted is already true: refcount is 1,
meaning only W's own pin remains. So W could sleep indefinitely as no
future unpin to wake it.

We can fix this with the state returned by UnlockBufHdrExt() when
publishing BM_PIN_COUNT_WAITER. If the wait refcount is 1, do not
enter the wait path. Instead, fall through to the existing waiter-bit
cleanup and retry the loop to acquire the cleanup lock normally. The
reproducer test passed after applying the patch.

Thanks for investigating!
Does the reproducer pass prior to 5310fac6e0f?

- Melanie

#4Andres Freund
andres@anarazel.de
In reply to: Alexander Lakhin (#1)
Re: 048_vacuum_horizon_floor.pl hangs due to wakeup lost inside LockBufferForCleanup

Hi,

On 2026-06-17 23:00:00 +0300, Alexander Lakhin wrote:

Hello hackers,
With this diagnostic addition:
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -6743,9 +6743,12 @@ LockBufferForCleanup(Buffer buffer)
                }
                bufHdr->wait_backend_pgprocno = MyProcNumber;
                PinCountWaitBuf = bufHdr;
-               UnlockBufHdrExt(bufHdr, buf_state,
-                                               BM_PIN_COUNT_WAITER, 0,
-                                               0);
+for (volatile int i = 0; i < 10000000; i++);
+               buf_state = UnlockBufHdrExt(bufHdr, buf_state,
+ BM_PIN_COUNT_WAITER, 0,
+                                                               0);
+               if (BUF_STATE_GET_REFCOUNT(buf_state) == 1)
+                       elog(LOG, "!!!LockBufferForCleanup| wakeup lost");
                LockBuffer(buffer, BUFFER_LOCK_UNLOCK);

I don't think seeing BUF_STATE_GET_REFCOUNT(buf_state) == 1 guarantees that a
wakeup was lost in any sort of way. Nothing prevents another backend from
acquiring a pin on the buffer while we are waiting for a pincount, so
occasionally seeing BUF_STATE_GET_REFCOUNT() == 1 here is to be expected.

You also can't expect repeatedly doing
UnlockBufHdrExt(set_bits=BM_PIN_COUNT_WAITER) to really make sense, because
BM_PIN_COUNT_WAITER only works if it was set *before* another backend releases
its pin, because otherwise the other backend obviously won't notice that the
flag was set if it only was set after the pg_atomic_fetch_sub_u64() in
UnpinBufferNoOwner().

I don't think the problem is that we are loosing a signal, the problem is that
we need to recheck if BUF_STATE_GET_REFCOUNT(buf_state) == 1 after setting
BM_PIN_COUNT_WAITER. I think this is a bug in (depending on your perspective)
either

commit 5310fac6e0f
Author: Andres Freund <andres@anarazel.de>
Date: 2025-11-06 16:43:16 -0500

bufmgr: Use atomic sub for unpinning buffers

or

commit c75ebc657ffce8dab76471da31aafb79fbe3fda2
Author: Andres Freund <andres@anarazel.de>
Date: 2025-11-06 16:42:10 -0500

bufmgr: Allow some buffer state modifications while holding header lock

Because before those commits, the buffer could not be unpinned while we held
the buffer header lock. With them, this race opens up.

Greetings,

Andres Freund

#5Andres Freund
andres@anarazel.de
In reply to: Xuneng Zhou (#2)
Re: 048_vacuum_horizon_floor.pl hangs due to wakeup lost inside LockBufferForCleanup

Hi,

On 2026-06-22 16:14:08 +0800, Xuneng Zhou wrote:

This looks like a real lost-wakeup race in LockBufferForCleanup().

I don't think any wakeup is lost. The problem is that we're waiting for a
wakeup when it is not guaranteed that one will arrive.

The relevant sequence in failed test 048 is:

- Session B opens a cursor and fetches one heap tuple, leaving a heap
buffer pinned.
- Session A starts VACUUM (FREEZE).
- VACUUM reaches that page and waits in LockBufferForCleanup().
- Session B later advances/closes the cursor, releasing the pin.
- VACUUM is expected to wake up and finish.

In the failure, the tap test had already passed its three sql
assertions, but then it timed out waiting for VACUUM completion:

ok 3 - Cursor query returned 7 from second fetch
poll_query_until timed out:
SELECT vacuum_count > 0 ...
last actual query output: f
Looks like your test exited with 29 just after 3.

The diagnostic log shows the actual race:

LOG: !!!LockBufferForCleanup| wakeup lost
CONTEXT: while scanning block 90 of relation "public.vac_horizon_floor_table"

LOG: !!!LockBufferForCleanup]| before ProcWaitForSignal
CONTEXT: while scanning block 90 of relation "public.vac_horizon_floor_table"

I don't think this is meaningful output, as written upthread.

There was no later "after ProcWaitForSignal" before the shutdown,
which implies that VACUUM published itself as a waiter, entered
ProcWaitForSignal(), and not been woken up later.

The direct regression appears to be 5310fac6e0f. It allows this interleaving:

W: LockBufferForCleanup() holds buffer header lock
W: observes refcount > 1
P: releases the last competing pin with atomic fetch_sub
P: old state does not contain BM_PIN_COUNT_WAITER, so no wakeup
W: publishes BM_PIN_COUNT_WAITER
W: sleeps in ProcWaitForSignal()

At this point the condition W wanted is already true: refcount is 1,
meaning only W's own pin remains. So W could sleep indefinitely as no
future unpin to wake it.

However, this indeed seems to be the problem.

We can fix this with the state returned by UnlockBufHdrExt() when
publishing BM_PIN_COUNT_WAITER. If the wait refcount is 1, do not
enter the wait path. Instead, fall through to the existing waiter-bit
cleanup and retry the loop to acquire the cleanup lock normally. The
reproducer test passed after applying the patch.

Why are we retrying, when we already achieved everything we wanted? Seems we
should just unset BM_PIN_COUNT_WAITER / wait_backend_pgprocno. I think the
easiest way to do this would be to not unlock the buffer header lock, but just
atomically-or-in BM_PIN_COUNT_WAITER and then recheck if we already are done,
while still holding the header lock.

Greetings,

Andres Freund

#6Xuneng Zhou
xunengzhou@gmail.com
In reply to: Andres Freund (#4)
Re: 048_vacuum_horizon_floor.pl hangs due to wakeup lost inside LockBufferForCleanup

Hi Andres,

On Tue, Jun 23, 2026 at 4:04 AM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2026-06-17 23:00:00 +0300, Alexander Lakhin wrote:

Hello hackers,
With this diagnostic addition:
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -6743,9 +6743,12 @@ LockBufferForCleanup(Buffer buffer)
}
bufHdr->wait_backend_pgprocno = MyProcNumber;
PinCountWaitBuf = bufHdr;
-               UnlockBufHdrExt(bufHdr, buf_state,
-                                               BM_PIN_COUNT_WAITER, 0,
-                                               0);
+for (volatile int i = 0; i < 10000000; i++);
+               buf_state = UnlockBufHdrExt(bufHdr, buf_state,
+ BM_PIN_COUNT_WAITER, 0,
+                                                               0);
+               if (BUF_STATE_GET_REFCOUNT(buf_state) == 1)
+                       elog(LOG, "!!!LockBufferForCleanup| wakeup lost");
LockBuffer(buffer, BUFFER_LOCK_UNLOCK);

I don't think seeing BUF_STATE_GET_REFCOUNT(buf_state) == 1 guarantees that a
wakeup was lost in any sort of way. Nothing prevents another backend from
acquiring a pin on the buffer while we are waiting for a pincount, so
occasionally seeing BUF_STATE_GET_REFCOUNT() == 1 here is to be expected.

Yeah, the log is misleading as observing
BUF_STATE_GET_REFCOUNT(buf_state) == 1 could be temporary, not
necessarily leading to lost wake-up.

A better wording seems to be:

elog(LOG, "buffer refcount is 1 after publishing cleanup-lock waiter");

You also can't expect repeatedly doing
UnlockBufHdrExt(set_bits=BM_PIN_COUNT_WAITER) to really make sense, because
BM_PIN_COUNT_WAITER only works if it was set *before* another backend releases
its pin, because otherwise the other backend obviously won't notice that the
flag was set if it only was set after the pg_atomic_fetch_sub_u64() in
UnpinBufferNoOwner().

This also makes sense to me. However the reproducer seems not to
repeatedly do UnlockBufHdrExt(set_bits=BM_PIN_COUNT_WAITER). It uses
a for loop as a delay before setting the BM_PIN_COUNT_WAITER, which
seems ok to me. Am I missing sth here?

I don't think the problem is that we are loosing a signal, the problem is that
we need to recheck if BUF_STATE_GET_REFCOUNT(buf_state) == 1 after setting
BM_PIN_COUNT_WAITER. I think this is a bug in (depending on your perspective)
either

commit 5310fac6e0f
Author: Andres Freund <andres@anarazel.de>
Date: 2025-11-06 16:43:16 -0500

bufmgr: Use atomic sub for unpinning buffers

or

commit c75ebc657ffce8dab76471da31aafb79fbe3fda2
Author: Andres Freund <andres@anarazel.de>
Date: 2025-11-06 16:42:10 -0500

bufmgr: Allow some buffer state modifications while holding header lock

Because before those commits, the buffer could not be unpinned while we held
the buffer header lock. With them, this race opens up.

I agree with this as the root cause of the issue. Thanks for your analysis!

--
Regards,
Xuneng Zhou
HighGo Software Co., Ltd.

#7Xuneng Zhou
xunengzhou@gmail.com
In reply to: Andres Freund (#5)
Re: 048_vacuum_horizon_floor.pl hangs due to wakeup lost inside LockBufferForCleanup

On Tue, Jun 23, 2026 at 4:11 AM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2026-06-22 16:14:08 +0800, Xuneng Zhou wrote:

This looks like a real lost-wakeup race in LockBufferForCleanup().

I don't think any wakeup is lost. The problem is that we're waiting for a
wakeup when it is not guaranteed that one will arrive.

I think this is true.

The relevant sequence in failed test 048 is:

- Session B opens a cursor and fetches one heap tuple, leaving a heap
buffer pinned.
- Session A starts VACUUM (FREEZE).
- VACUUM reaches that page and waits in LockBufferForCleanup().
- Session B later advances/closes the cursor, releasing the pin.
- VACUUM is expected to wake up and finish.

In the failure, the tap test had already passed its three sql
assertions, but then it timed out waiting for VACUUM completion:

ok 3 - Cursor query returned 7 from second fetch
poll_query_until timed out:
SELECT vacuum_count > 0 ...
last actual query output: f
Looks like your test exited with 29 just after 3.

The diagnostic log shows the actual race:

LOG: !!!LockBufferForCleanup| wakeup lost
CONTEXT: while scanning block 90 of relation "public.vac_horizon_floor_table"

LOG: !!!LockBufferForCleanup]| before ProcWaitForSignal
CONTEXT: while scanning block 90 of relation "public.vac_horizon_floor_table"

I don't think this is meaningful output, as written upthread.

I agree that the output could be misleading. The missing post-wait log
seems to be an indication of 'no one wakes us after falling asleep
unnecessary' in this specific test case since no later backend
pins/unpins that same table buffer. Sure this is just the symptom.

We can fix this with the state returned by UnlockBufHdrExt() when
publishing BM_PIN_COUNT_WAITER. If the wait refcount is 1, do not
enter the wait path. Instead, fall through to the existing waiter-bit
cleanup and retry the loop to acquire the cleanup lock normally. The
reproducer test passed after applying the patch.

Why are we retrying, when we already achieved everything we wanted? Seems we
should just unset BM_PIN_COUNT_WAITER / wait_backend_pgprocno. I think the
easiest way to do this would be to not unlock the buffer header lock, but just
atomically-or-in BM_PIN_COUNT_WAITER and then recheck if we already are done,
while still holding the header lock.

Yeah, there's no need to retry. What you proposed looks cleaner and
better. I'll update the patch as suggested. Thanks!
--
Regards,
Xuneng Zhou
HighGo Software Co., Ltd.

#8Xuneng Zhou
xunengzhou@gmail.com
In reply to: Xuneng Zhou (#7)
Re: 048_vacuum_horizon_floor.pl hangs due to wakeup lost inside LockBufferForCleanup

On Tue, Jun 23, 2026 at 10:44 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:

On Tue, Jun 23, 2026 at 4:11 AM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2026-06-22 16:14:08 +0800, Xuneng Zhou wrote:

This looks like a real lost-wakeup race in LockBufferForCleanup().

I don't think any wakeup is lost. The problem is that we're waiting for a
wakeup when it is not guaranteed that one will arrive.

I think this is true.

The relevant sequence in failed test 048 is:

- Session B opens a cursor and fetches one heap tuple, leaving a heap
buffer pinned.
- Session A starts VACUUM (FREEZE).
- VACUUM reaches that page and waits in LockBufferForCleanup().
- Session B later advances/closes the cursor, releasing the pin.
- VACUUM is expected to wake up and finish.

In the failure, the tap test had already passed its three sql
assertions, but then it timed out waiting for VACUUM completion:

ok 3 - Cursor query returned 7 from second fetch
poll_query_until timed out:
SELECT vacuum_count > 0 ...
last actual query output: f
Looks like your test exited with 29 just after 3.

The diagnostic log shows the actual race:

LOG: !!!LockBufferForCleanup| wakeup lost
CONTEXT: while scanning block 90 of relation "public.vac_horizon_floor_table"

LOG: !!!LockBufferForCleanup]| before ProcWaitForSignal
CONTEXT: while scanning block 90 of relation "public.vac_horizon_floor_table"

I don't think this is meaningful output, as written upthread.

I agree that the output could be misleading. The missing post-wait log
seems to be an indication of 'no one wakes us after falling asleep
unnecessary' in this specific test case since no later backend
pins/unpins that same table buffer. Sure this is just the symptom.

We can fix this with the state returned by UnlockBufHdrExt() when
publishing BM_PIN_COUNT_WAITER. If the wait refcount is 1, do not
enter the wait path. Instead, fall through to the existing waiter-bit
cleanup and retry the loop to acquire the cleanup lock normally. The
reproducer test passed after applying the patch.

Why are we retrying, when we already achieved everything we wanted? Seems we
should just unset BM_PIN_COUNT_WAITER / wait_backend_pgprocno. I think the
easiest way to do this would be to not unlock the buffer header lock, but just
atomically-or-in BM_PIN_COUNT_WAITER and then recheck if we already are done,
while still holding the header lock.

Yeah, there's no need to retry. What you proposed looks cleaner and
better. I'll update the patch as suggested. Thanks!

Here is the updated patch. Please take a look.

--
Regards,
Xuneng Zhou
HighGo Software Co., Ltd.

Attachments:

v2-0001-bufmgr-avoid-waiting-without-a-guaranteed-pin-cou.patchapplication/octet-stream; name=v2-0001-bufmgr-avoid-waiting-without-a-guaranteed-pin-cou.patchDownload+46-22
#9Xuneng Zhou
xunengzhou@gmail.com
In reply to: Melanie Plageman (#3)
Re: 048_vacuum_horizon_floor.pl hangs due to wakeup lost inside LockBufferForCleanup

Hi Melanie,

Does the reproducer pass prior to 5310fac6e0f?

I've verified that the reproducer did pass prior to the commit. For a
20-times run of this test, every wait entered ProcWaitForSignal() and
returned, and the "refcount already 1 after waiter publication" line
never appeared.

--
Regards,
Xuneng Zhou
HighGo Software Co., Ltd.

Attachments:

before-5310-cleanup-lock-reproducer.patchapplication/octet-stream; name=before-5310-cleanup-lock-reproducer.patchDownload+10-3