pgsql: Fix remaining race condition with CLOG truncation and LISTEN/NOT

Started by Heikki Linnakangas8 months ago2 messagescomitters
Jump to latest
#1Heikki Linnakangas
heikki.linnakangas@enterprisedb.com

Fix remaining race condition with CLOG truncation and LISTEN/NOTIFY

Previous commit fixed a bug where VACUUM would truncate the CLOG
that's still needed to check the commit status of XIDs in the async
notify queue, but as mentioned in the commit message, it wasn't a full
fix. If a backend is executing asyncQueueReadAllNotifications() and
has just made a local copy of an async SLRU page which contains old
XIDs, vacuum can concurrently truncate the CLOG covering those XIDs,
and the backend still gets an error when it calls
TransactionIdDidCommit() on those XIDs in the local copy. This commit
fixes that race condition.

To fix, hold the SLRU bank lock across the TransactionIdDidCommit()
calls in NOTIFY processing.

Per Tom Lane's idea. Backpatch to all supported versions.

Reviewed-by: Joel Jacobson <joel@compiler.org>
Reviewed-by: Arseniy Mukhin <arseniy.mukhin.dev@gmail.com>
Discussion: /messages/by-id/2759499.1761756503@sss.pgh.pa.us
Backpatch-through: 14

Branch
------
REL_16_STABLE

Details
-------
https://git.postgresql.org/pg/commitdiff/44e8c60be66c7c174431c4cd7948c1ff015fe516

Modified Files
--------------
src/backend/commands/async.c | 123 ++++++++++++++++++++++---------------------
1 file changed, 62 insertions(+), 61 deletions(-)

#2Michael Paquier
michael@paquier.xyz
In reply to: Heikki Linnakangas (#1)
Re: pgsql: Fix remaining race condition with CLOG truncation and LISTEN/NOT

Hi Heikki,

On Wed, Nov 12, 2025 at 07:16:38PM +0000, Heikki Linnakangas wrote:

Fix remaining race condition with CLOG truncation and LISTEN/NOTIFY

Previous commit fixed a bug where VACUUM would truncate the CLOG
that's still needed to check the commit status of XIDs in the async
notify queue, but as mentioned in the commit message, it wasn't a full
fix. If a backend is executing asyncQueueReadAllNotifications() and
has just made a local copy of an async SLRU page which contains old
XIDs, vacuum can concurrently truncate the CLOG covering those XIDs,
and the backend still gets an error when it calls
TransactionIdDidCommit() on those XIDs in the local copy. This commit
fixes that race condition.

To fix, hold the SLRU bank lock across the TransactionIdDidCommit()
calls in NOTIFY processing.

While reading through this set of changes, the addition of this commit
across the v14~v16 range has made me smile:

+    * Now that we have let go of the SLRU bank lock, send the notifications
+    * to our backend

SLRU bank locks have been introduced in v17 by Alvaro in 53c2a97a9266.
Just noticed while going through.
--
Michael