LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue

Started by Alexandra Wang5 months ago99 messages

alexandra.wang.oss@gmail.com

5 months ago

1 attachment(s)

Hi,

I'm bringing up a bug that was reported multiple times [1]/messages/by-id/18804-bccbbde5e77a68c2@postgresql.org[2]/messages/by-id/16961-25f29f95b3604a8a@postgresql.org[3]/messages/by-id/18394-e7459245148578b2@postgresql.org in
the bugs list here, for a broader audience.

The issue is that an ERROR like the one below occurs when trying to
register any listener in the database.

test=# listen c21;
ERROR: 58P01: could not access status of transaction 14279685
DETAIL: Could not open file "pg_xact/000D": No such file or directory.
LOCATION: SlruReportIOError, slru.c:1087

In [1]/messages/by-id/18804-bccbbde5e77a68c2@postgresql.org, Andrei Varashen provided detailed reproduction steps. I’m
copying and pasting his example from thread [1]/messages/by-id/18804-bccbbde5e77a68c2@postgresql.org here, with slight
simplification.

Pre-conditions:
- Disable autovacuum to avoid its intervention.

Steps to reproduce:

1. Create a test table and notify (but not listen) to a channel
(backend 1):

create table test (id int);
insert into test values (1);
notify c1;

2. List pg_xact files so we know its starting state:

➜ ls -lah ~/pg-devel/data/pg_xact
total 16
drwx------@ 3 alex.wang staff 96B Aug 5 20:10 .
drwx------@ 26 alex.wang staff 832B Aug 5 20:12 ..
-rw-------@ 1 alex.wang staff 8.0K Aug 5 20:12 0000

3. Prepare a "test.sql" file for pgbench, and then run pgbench to
generate lots of transactions, so that pg_xact/0000 is completely
filled and leads to pg_xact/0001.

cat test.sql

UPDATE test SET id = 1;

pgbench -n -c 80 -j 10 -t 15000 -f ~/workspace/test.sql postgres

4. Verify that pg_xact/0001 is created:

➜ ls -lah ~/pg-devel/data/pg_xact
total 560
drwx------@ 4 alex.wang staff 128B Aug 5 20:25 .
drwx------@ 26 alex.wang staff 832B Aug 5 20:12 ..
-rw-------@ 1 alex.wang staff 256K Aug 5 20:25 0000
-rw-------@ 1 alex.wang staff 16K Aug 5 20:25 0001

5. Execute VACUUM FREEZE on every database on the server to freeze
rows and purge pg_xact/0000.

postgres=# VACUUM FREEZE;
VACUUM

postgres=# \c template1
You are now connected to database "template1" as user "postgres".
template1=# VACUUM FREEZE;
VACUUM

template1=# ALTER DATABASE template0 WITH ALLOW_CONNECTIONS true;
ALTER DATABASE
template1=# \c template0
You are now connected to database "template0" as user "postgres".
template0=# VACUUM FREEZE;
VACUUM
postgres=# select datname, datfrozenxid from pg_database;
datname | datfrozenxid
-----------+--------------
postgres | 1200774
template0 | 1200775
template1 | 1200774
(3 rows)

6. Ensure that pg_xact/0000 is gone:

➜ ls -lah ~/pg-devel/data/pg_xact
total 80
drwx------@ 3 alex.wang staff 96B Aug 5 20:29 .
drwx------@ 26 alex.wang staff 832B Aug 5 20:12 ..
-rw-------@ 1 alex.wang staff 40K Aug 5 20:30 0001

7. Try to listen to any channel from any backend connection on the
same database:

postgres=# listen c1;
ERROR: 58P01: could not access status of transaction 773
DETAIL: Could not open file "pg_xact/0000": No such file or directory.
LOCATION: SlruReportIOError, slru.c:1087
postgres=# listen c2;
ERROR: 58P01: could not access status of transaction 773
DETAIL: Could not open file "pg_xact/0000": No such file or directory.
LOCATION: SlruReportIOError, slru.c:1087

Here's the stack trace from the master branch:
(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 21.1
* frame #0: 0x0000000102c798c8
postgres`SlruReportIOError(ctl=0x000000010378e5e0, pageno=0, xid=773) at
slru.c:1084:4
frame #1: 0x0000000102c792b0
postgres`SimpleLruReadPage(ctl=0x000000010378e5e0, pageno=0, write_ok=true,
xid=773) at slru.c:603:4
frame #2: 0x0000000102c79f7c
postgres`SimpleLruReadPage_ReadOnly(ctl=0x000000010378e5e0, pageno=0,
xid=773) at slru.c:661:9
frame #3: 0x0000000102c6a6bc postgres`TransactionIdGetStatus(xid=773,
lsn=0x000000016d2975e8) at clog.c:745:11
frame #4: 0x0000000102c7e924
postgres`TransactionLogFetch(transactionId=773) at transam.c:79:14
frame #5: 0x0000000102c7e74c
postgres`TransactionIdDidCommit(transactionId=773) at transam.c:130:14
frame #6: 0x0000000102dc8b94
postgres`asyncQueueProcessPageEntries(current=0x000000016d297720,
stop=QueuePosition @ 0x000000016d297690, page_buffer="\U00000014",
snapshot=0x000000012d812398) at async.c:2069:13
frame #7: 0x0000000102dc8954 postgres`asyncQueueReadAllNotifications at
async.c:1981:18
frame #8: 0x0000000102dc6b5c postgres`Exec_ListenPreCommit at
async.c:1127:3
frame #9: 0x0000000102dc664c postgres`PreCommit_Notify at async.c:881:6
frame #10: 0x0000000102c8c77c postgres`CommitTransaction at
xact.c:2341:2
frame #11: 0x0000000102c87b2c postgres`CommitTransactionCommandInternal
at xact.c:3214:4
frame #12: 0x0000000102c87a44 postgres`CommitTransactionCommand at
xact.c:3175:10
frame #13: 0x000000010321da94 postgres`finish_xact_command at
postgres.c:2833:3
frame #14: 0x000000010321b64c
postgres`exec_simple_query(query_string="listen c1;") at postgres.c:1298:4
frame #15: 0x000000010321a714 postgres`PostgresMain(dbname="postgres",
username="alex.wang") at postgres.c:4767:7
frame #16: 0x0000000103211a14
postgres`BackendMain(startup_data=0x000000016d299e48, startup_data_len=24)
at backend_startup.c:124:2
frame #17: 0x00000001030e938c
postgres`postmaster_child_launch(child_type=B_BACKEND, child_slot=56,
startup_data=0x000000016d299e48, startup_data_len=24,
client_sock=0x000000016d299ed8) at launch_backend.c:290:3
frame #18: 0x00000001030f0d60
postgres`BackendStartup(client_sock=0x000000016d299ed8) at
postmaster.c:3587:8
frame #19: 0x00000001030eebc4 postgres`ServerLoop at postmaster.c:1702:6
frame #20: 0x00000001030ed67c postgres`PostmasterMain(argc=3,
argv=0x00006000021f14e0) at postmaster.c:1400:11
frame #21: 0x0000000102f73e40 postgres`main(argc=3,
argv=0x00006000021f14e0) at main.c:231:4
frame #22: 0x00000001940a2b98 dyld`start + 6076

Daniil Davydov has analyzed the root cause in thread [4]/messages/by-id/CAJDiXgh3Jh2N90Fe4=X2qE+YAiZ1BSpgznhH+vkTHmvK3gjqxw@mail.gmail.com and I agree with
what he said:

On Thu, Jul 31, 2025 at 8:21 PM Daniil Davydov <3danissimo@gmail.com> wrote:

We have the following logic in the notify queue :
If there are no listeners within all databases, and we are calling
LISTEN, then we must iterate from 'tail' to 'head' of the queue and
check statuses of transactions (see Exec_ListenPreCommit).
If there is a pruned-away xid in the queue, we will try to access its
status and get an error.

Because the tail of the queue is not necessarily always advanced
forward by the listeners, we can get such error without any long lived
transactions.

The fix and workarounds were discussed in [5]/messages/by-id/CAJDiXgj1BmLKAZ=gOC1eETmctt7z=aj8MHfmd+nORE1P6qncsA@mail.gmail.com and [6]/messages/by-id/CAK98qZ0uZrf1b6L4mDs+17M07KCNt6YMc9vD-+TmeWaJHRbGpA@mail.gmail.com: In [5]/messages/by-id/CAJDiXgj1BmLKAZ=gOC1eETmctt7z=aj8MHfmd+nORE1P6qncsA@mail.gmail.com, Daniil
proposed a patch, which I’ve attached. The patch adds a call to
asyncQueueAdvanceTail() in vac_update_datfrozenxid(), so that VACUUM
advances the async queue tail.

Similarly, calling the built-in function pg_notification_queue_usage()
also advances the async queue tail. So when the issue occurs, calling
this function could also make the error go away. However, this doesn’t
prevent the error from happening in the first place.

My questions:

1. Is it acceptable to drop notifications from the async queue if
there are no active listeners? There might still be notifications that
haven’t been read by any previous listener.

2. If the answer to 1 is no, how can we teach VACUUM to respect the
minimum xid stored in all AsyncQueueEntries?

Best,
Alex

[1]: /messages/by-id/18804-bccbbde5e77a68c2@postgresql.org
/messages/by-id/18804-bccbbde5e77a68c2@postgresql.org
[2]: /messages/by-id/16961-25f29f95b3604a8a@postgresql.org
/messages/by-id/16961-25f29f95b3604a8a@postgresql.org
[3]: /messages/by-id/18394-e7459245148578b2@postgresql.org
/messages/by-id/18394-e7459245148578b2@postgresql.org
[4]: /messages/by-id/CAJDiXgh3Jh2N90Fe4=X2qE+YAiZ1BSpgznhH+vkTHmvK3gjqxw@mail.gmail.com
/messages/by-id/CAJDiXgh3Jh2N90Fe4=X2qE+YAiZ1BSpgznhH+vkTHmvK3gjqxw@mail.gmail.com
[5]: /messages/by-id/CAJDiXgj1BmLKAZ=gOC1eETmctt7z=aj8MHfmd+nORE1P6qncsA@mail.gmail.com
/messages/by-id/CAJDiXgj1BmLKAZ=gOC1eETmctt7z=aj8MHfmd+nORE1P6qncsA@mail.gmail.com
[6]: /messages/by-id/CAK98qZ0uZrf1b6L4mDs+17M07KCNt6YMc9vD-+TmeWaJHRbGpA@mail.gmail.com
/messages/by-id/CAK98qZ0uZrf1b6L4mDs+17M07KCNt6YMc9vD-+TmeWaJHRbGpA@mail.gmail.com

Attachments:

0001-Advance-tail-of-async-queue-before-updating-datfroze.patch.no-cfbotapplication/octet-stream; name=0001-Advance-tail-of-async-queue-before-updating-datfroze.patch.no-cfbotDownload

From 3ed2c16c7d4b03514d9828c96818a9b0ac6c279b Mon Sep 17 00:00:00 2001
From: Daniil Davidov <d.davydov@postgrespro.ru>
Date: Mon, 4 Aug 2025 17:49:23 +0700
Subject: [PATCH] Advance tail of async queue before updating datfrozenxid in
 vacuum

---
 src/backend/commands/async.c  |  9 ++++-----
 src/backend/commands/vacuum.c | 10 ++++++++++
 src/include/commands/async.h  |  2 ++
 3 files changed, 16 insertions(+), 5 deletions(-)

diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index 4bd37d5beb5..5504132cf1b 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -450,7 +450,6 @@ static bool asyncQueueProcessPageEntries(volatile QueuePosition *current,
 										 QueuePosition stop,
 										 char *page_buffer,
 										 Snapshot snapshot);
-static void asyncQueueAdvanceTail(void);
 static void ProcessIncomingNotify(bool flush);
 static bool AsyncExistsPendingNotify(Notification *n);
 static void AddEventToPendingNotifies(Notification *n);
@@ -1025,7 +1024,7 @@ AtCommit_Notify(void)
 	if (tryAdvanceTail)
 	{
 		tryAdvanceTail = false;
-		asyncQueueAdvanceTail();
+		AsyncQueueAdvanceTail();
 	}
 
 	/* And clean up */
@@ -1483,7 +1482,7 @@ pg_notification_queue_usage(PG_FUNCTION_ARGS)
 	double		usage;
 
 	/* Advance the queue tail so we don't report a too-large result */
-	asyncQueueAdvanceTail();
+	AsyncQueueAdvanceTail();
 
 	LWLockAcquire(NotifyQueueLock, LW_SHARED);
 	usage = asyncQueueUsage();
@@ -2104,8 +2103,8 @@ asyncQueueProcessPageEntries(volatile QueuePosition *current,
  * This is (usually) called during CommitTransaction(), so it's important for
  * it to have very low probability of failure.
  */
-static void
-asyncQueueAdvanceTail(void)
+void
+AsyncQueueAdvanceTail(void)
 {
 	QueuePosition min;
 	int64		oldtailpage;
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 733ef40ae7c..aedae9bf94d 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -37,6 +37,7 @@
 #include "catalog/namespace.h"
 #include "catalog/pg_database.h"
 #include "catalog/pg_inherits.h"
+#include "commands/async.h"
 #include "commands/cluster.h"
 #include "commands/defrem.h"
 #include "commands/progress.h"
@@ -1632,6 +1633,15 @@ vac_update_datfrozenxid(void)
 	 */
 	LockDatabaseFrozenIds(ExclusiveLock);
 
+	/*
+	 * Note, that messages in async queue store transaction id of their senders.
+	 * We must explicitly advance tail of async queue so that new channel
+	 * listeners don't have to process old messages (during
+	 * asyncQueueReadAllNotifications call), i.e. don't have to check their
+	 * transactions statuses (that might be already gone after clog truncate).
+	 */
+	AsyncQueueAdvanceTail();
+
 	/*
 	 * Initialize the "min" calculation with
 	 * GetOldestNonRemovableTransactionId(), which is a reasonable
diff --git a/src/include/commands/async.h b/src/include/commands/async.h
index f75c3df9556..47275f6f959 100644
--- a/src/include/commands/async.h
+++ b/src/include/commands/async.h
@@ -43,6 +43,8 @@ extern void AtPrepare_Notify(void);
 /* signal handler for inbound notifies (PROCSIG_NOTIFY_INTERRUPT) */
 extern void HandleNotifyInterrupt(void);
 
+extern void AsyncQueueAdvanceTail(void);
+
 /* process interrupts */
 extern void ProcessNotifyInterrupt(bool flush);
 
-- 
2.43.0

Álvaro Herrera

alvherre@kurilemu.de

5 months ago

In reply to: Alexandra Wang (#1)

Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue

On 2025-Aug-05, Alexandra Wang wrote:

I'm bringing up a bug that was reported multiple times [1][2][3] in
the bugs list here, for a broader audience.

The issue is that an ERROR like the one below occurs when trying to
register any listener in the database.

test=# listen c21;
ERROR: 58P01: could not access status of transaction 14279685
DETAIL: Could not open file "pg_xact/000D": No such file or directory.
LOCATION: SlruReportIOError, slru.c:1087

Oh, interesting problem. Many thanks for the excellent write-up.

My questions:

1. Is it acceptable to drop notifications from the async queue if
there are no active listeners? There might still be notifications that
haven’t been read by any previous listener.

I'm somewhat wary of this idea -- could these inactive listeners become
active later and expect to be able to read their notifies?

2. If the answer to 1 is no, how can we teach VACUUM to respect the
minimum xid stored in all AsyncQueueEntries?

Maybe we can have AsyncQueueAdvanceTail return the oldest XID of
listeners, and back off the pg_clog truncation based on that. This
could be done by having a new boolean argument that says to look up the
XID from the PGPROC using BackendPidGetProc(QUEUE_BACKEND_PID) (which
would only be passed true by vac_update_datfrozenxid(), to avoid
overhead by other callers), then collect the oldest of those and return
it.

This does create the problem that an inactive listener could cause the
XID counter to stay far in the past. Maybe we could try to avoid this
by adding more signalling (e.g, AsyncQueueAdvanceTail() itself could
send PROCSIG_NOTIFY_INTERRUPT signal?), and terminating backends that
are way overdue on reading notifies. I'm not sure if this is really
needed or useful; consider a backend stuck on SIGSTOP (debugger or
whatever): it will just sit there forever.

--
Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/

Matheus Alcantara

matheusssilv97@gmail.com

5 months ago

In reply to: Álvaro Herrera (#2)

1 attachment(s)

Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue

On Wed Aug 6, 2025 at 7:44 AM -03, Álvaro Herrera wrote:

My questions:

1. Is it acceptable to drop notifications from the async queue if
there are no active listeners? There might still be notifications that
haven’t been read by any previous listener.

I'm somewhat wary of this idea -- could these inactive listeners become
active later and expect to be able to read their notifies?

I'm bit worry about this too.

2. If the answer to 1 is no, how can we teach VACUUM to respect the
minimum xid stored in all AsyncQueueEntries?

Maybe we can have AsyncQueueAdvanceTail return the oldest XID of
listeners, and back off the pg_clog truncation based on that. This
could be done by having a new boolean argument that says to look up the
XID from the PGPROC using BackendPidGetProc(QUEUE_BACKEND_PID) (which
would only be passed true by vac_update_datfrozenxid(), to avoid
overhead by other callers), then collect the oldest of those and return
it.

The problem with only considering the oldest XID of listeners is that
IIUC we may have notifications without listeners, and in this case we
may still get this error because when the LISTEN is executed we loop
through the AsyncQueueEntry's on asyncQueueProcessPageEntries() and we
call TransactionIdDidCommit() that raise the error before
IsListeningOn(channel) is called.

Another option would be to add a minXid field on AsyncQueueControl and
then update this value on asyncQueueProcessPageEntries() and
asyncQueueAddEntries() routines, and then we could check this value on
vac_update_datfrozenxid().

This does create the problem that an inactive listener could cause the
XID counter to stay far in the past. Maybe we could try to avoid this
by adding more signalling (e.g, AsyncQueueAdvanceTail() itself could
send PROCSIG_NOTIFY_INTERRUPT signal?), and terminating backends that
are way overdue on reading notifies. I'm not sure if this is really
needed or useful; consider a backend stuck on SIGSTOP (debugger or
whatever): it will just sit there forever.

With this idea that I've proposed we still could have this problem, if a
listener take too long to consume a message we would block vacuum freeze
to advance the xid. For this I think that we could have two GUC's; One
to enable and disable the oldest xmin check on async queue and the
second to control how far we want to prevent the vacuum from freezing
the oldest async queue xid, and if the min xid raises this limit we
ignore and truncate the xid.

I've write a draft patch that plays with the idea, see attached.

--
Matheus Alcantara

Attachments:

v0-0001-Consider-async-queue-min-xid-on-VACUUM-FREEZE.patchtext/plain; charset=utf-8; name=v0-0001-Consider-async-queue-min-xid-on-VACUUM-FREEZE.patchDownload

From 7ec33dfc14174e9a4424a1738977601a4fd13bf5 Mon Sep 17 00:00:00 2001
From: Matheus Alcantara <mths.dev@pm.me>
Date: Sat, 9 Aug 2025 13:51:21 -0300
Subject: [PATCH v0] Consider async queue min xid on VACUUM FREEZE

---
 src/backend/commands/async.c  | 58 +++++++++++++++++++++++++++++++++++
 src/backend/commands/vacuum.c | 11 +++++++
 src/include/commands/async.h  |  3 ++
 3 files changed, 72 insertions(+)

diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index 4bd37d5beb5..fff055601a1 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -287,6 +287,7 @@ typedef struct AsyncQueueControl
 								 * tail.page */
 	ProcNumber	firstListener;	/* id of first listener, or
 								 * INVALID_PROC_NUMBER */
+	TransactionId minXid;
 	TimestampTz lastQueueFillWarn;	/* time of last queue-full msg */
 	QueueBackendStatus backend[FLEXIBLE_ARRAY_MEMBER];
 } AsyncQueueControl;
@@ -458,6 +459,9 @@ static uint32 notification_hash(const void *key, Size keysize);
 static int	notification_match(const void *key1, const void *key2, Size keysize);
 static void ClearPendingActionsAndNotifies(void);
 
+static void StoreEntryXid(TransactionId xid);
+static void ReleaseEntryXid(TransactionId xid);
+
 /*
  * Compute the difference between two queue page numbers.
  * Previously this function accounted for a wraparound.
@@ -521,6 +525,7 @@ AsyncShmemInit(void)
 		QUEUE_STOP_PAGE = 0;
 		QUEUE_FIRST_LISTENER = INVALID_PROC_NUMBER;
 		asyncQueueControl->lastQueueFillWarn = 0;
+		asyncQueueControl->minXid = MaxTransactionId;
 		for (int i = 0; i < MaxBackends; i++)
 		{
 			QUEUE_BACKEND_PID(i) = InvalidPid;
@@ -1428,6 +1433,12 @@ asyncQueueAddEntries(ListCell *nextNotify)
 			   &qe,
 			   qe.length);
 
+		/*
+		 * Remember the notification xid so that vacuum don't frozen after
+		 * this xid.
+		 */
+		StoreEntryXid(qe.xid);
+
 		/* Advance queue_head appropriately, and detect if page is full */
 		if (asyncQueueAdvance(&(queue_head), qe.length))
 		{
@@ -2077,6 +2088,12 @@ asyncQueueProcessPageEntries(volatile QueuePosition *current,
 					char	   *payload = qe->data + strlen(channel) + 1;
 
 					NotifyMyFrontEnd(channel, payload, qe->srcPid);
+
+					/*
+					 * Notification was sent, so release the notification xid
+					 * so that vacuum can freeze past this notification.
+					 */
+					ReleaseEntryXid(qe->xid);
 				}
 			}
 			else
@@ -2395,3 +2412,44 @@ check_notify_buffers(int *newval, void **extra, GucSource source)
 {
 	return check_slru_buffers("notify_buffers", newval);
 }
+
+/*
+ *  Return the minimum notification xid on the queue.
+ *
+ *  TODO(matheus): handle when the queue is empty, so we don't get locked in
+ *  the past with a notification that was already sent.
+ */
+TransactionId
+AsyncQueueMinXid()
+{
+	return asyncQueueControl->minXid;
+}
+
+/*
+ *  Subroutine of asyncQueueAddEntries
+ *
+ *  We are holding NotifyQueueLock already from the caller.
+ */
+static void
+StoreEntryXid(TransactionId xid)
+{
+	if (xid < asyncQueueControl->minXid)
+		asyncQueueControl->minXid = xid;
+}
+
+/*
+ * Subroutine of asyncQueueAddEntries
+ *
+ * TODO(matheus): may update the code comment on
+ * asyncQeueuReadAllNotifications() when calling
+ * asyncQueueProcessPageEntries().
+ *
+ */
+static void
+ReleaseEntryXid(TransactionId xid)
+{
+	LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE);
+	if (xid > asyncQueueControl->minXid)
+		asyncQueueControl->minXid = xid;
+	LWLockRelease(NotifyQueueLock);
+}
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 733ef40ae7c..f0d6d868e60 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -21,6 +21,7 @@
  *
  *-------------------------------------------------------------------------
  */
+#include "commands/async.h"
 #include "postgres.h"
 
 #include <math.h>
@@ -1739,6 +1740,16 @@ vac_update_datfrozenxid(void)
 	if (bogus)
 		return;
 
+	/*
+	 * We need to check transaction status of notifications before of notify
+	 * the client, if there is lag to consume the notifications we need to
+	 * consider the older xid of notification on the queue so that the
+	 * transaction status can be accessed.
+	 *
+	 * XXX(matheus): Wrap this behavior into a GUC?
+	 */
+	newFrozenXid = Min(newFrozenXid, AsyncQueueMinXid());
+
 	Assert(TransactionIdIsNormal(newFrozenXid));
 	Assert(MultiXactIdIsValid(newMinMulti));
 
diff --git a/src/include/commands/async.h b/src/include/commands/async.h
index f75c3df9556..512a8976126 100644
--- a/src/include/commands/async.h
+++ b/src/include/commands/async.h
@@ -13,6 +13,7 @@
 #ifndef ASYNC_H
 #define ASYNC_H
 
+#include <c.h>
 #include <signal.h>
 
 extern PGDLLIMPORT bool Trace_notify;
@@ -46,4 +47,6 @@ extern void HandleNotifyInterrupt(void);
 /* process interrupts */
 extern void ProcessNotifyInterrupt(bool flush);
 
+extern TransactionId AsyncQueueMinXid(void);
+
 #endif							/* ASYNC_H */
-- 
2.39.5 (Apple Git-154)

Daniil Davydov

3danissimo@gmail.com

5 months ago

In reply to: Matheus Alcantara (#3)

Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue

Hi,

On Mon, Aug 11, 2025 at 8:41 PM Matheus Alcantara
<matheusssilv97@gmail.com> wrote:

On Wed Aug 6, 2025 at 7:44 AM -03, Álvaro Herrera wrote:

My questions:

1. Is it acceptable to drop notifications from the async queue if
there are no active listeners? There might still be notifications that
haven’t been read by any previous listener.

I'm somewhat wary of this idea -- could these inactive listeners become
active later and expect to be able to read their notifies?

I'm bit worry about this too.

What exactly do we mean by "active listener"? According to the source code,
the active listener (as far as I understand) is the one who listens to at least
one channel. If we have no active listeners in the database, the new listener
will set its pointer to the tail of the async queue. Thus, messages with old
xid will not be touched by anybody. I don't see any point in dropping them
in this case.

If the "inactive" listener is the backend which is stuck somewhere, the
answer is "no" - this backend should be able to process all notifications.

2. If the answer to 1 is no, how can we teach VACUUM to respect the
minimum xid stored in all AsyncQueueEntries?

Maybe we can have AsyncQueueAdvanceTail return the oldest XID of
listeners, and back off the pg_clog truncation based on that. This
could be done by having a new boolean argument that says to look up the
XID from the PGPROC using BackendPidGetProc(QUEUE_BACKEND_PID) (which
would only be passed true by vac_update_datfrozenxid(), to avoid
overhead by other callers), then collect the oldest of those and return
it.

The problem with only considering the oldest XID of listeners is that
IIUC we may have notifications without listeners, and in this case we
may still get this error because when the LISTEN is executed we loop
through the AsyncQueueEntry's on asyncQueueProcessPageEntries() and we
call TransactionIdDidCommit() that raise the error before
IsListeningOn(channel) is called.

Agree.

This does create the problem that an inactive listener could cause the
XID counter to stay far in the past. Maybe we could try to avoid this
by adding more signalling (e.g, AsyncQueueAdvanceTail() itself could
send PROCSIG_NOTIFY_INTERRUPT signal?), and terminating backends that
are way overdue on reading notifies. I'm not sure if this is really
needed or useful; consider a backend stuck on SIGSTOP (debugger or
whatever): it will just sit there forever.

With this idea that I've proposed we still could have this problem, if a
listener take too long to consume a message we would block vacuum freeze
to advance the xid. For this I think that we could have two GUC's; One
to enable and disable the oldest xmin check on async queue and the
second to control how far we want to prevent the vacuum from freezing
the oldest async queue xid, and if the min xid raises this limit we
ignore and truncate the xid.

1) About Álvaro's comment:
I don't think that killing the lagging backend will save us, because entries
in the async queue are not ordered by xid. Thus, even after killing such a
backend, we have no guarantees that the problem is gone, because not
lagging backends still may encounter a super old xid.

2) About Matheus's comment:
I guess that there is only one reliable way to determine the precise minimal xid
in the queue (after message processing) - scan all pages from head to tail.
It obviously can take a lot of time. The second GUC (if I get it
right) logic is based
on minimal xid determination, so it won't be the best solution for
highload systems
(that are forced to turn on the first GUC, because of safety requirements).

Another option would be to add a minXid field on AsyncQueueControl and
then update this value on asyncQueueProcessPageEntries() and
asyncQueueAddEntries() routines, and then we could check this value on
vac_update_datfrozenxid().

I've write a draft patch that plays with the idea, see attached.

Thanks for the patch! I have few comments on it :
1)
Maybe we should add an assert that NotifyQueueLock is held to the StoreEntryXid
function?

2)
For fields in AsyncQueueControl we have macros like QUEUE_HEAD,
QUEUE_TAIL and so on. Maybe minXid should also be wrapped with something
like QUEUE_MIN_XID?

3)
This logic will not work with (for example) two listeners. Steps :
session1=# listen c1;
session2=# listen c1;
session3=# notify c1; // sets minimal xid to N inside StoreEntryXid
session3=# notify c1;
session1=# listen c1; // sets minimal xid to (N + 1) inside ReleaseEntryXid

After these steps, session2 still needs to process a message with xid = N, but
autovacuum legally can freeze it and truncate clog.

Maybe we should check whether there are no listeners with smaller queue
position compared to the current backend?

4)
After ReleaseEntryXid, minXid field contains not actually minimal xid
in the queue,
but maximum processed xid. Maybe this parameter can be renamed?

--
Best regards,
Daniil Davydov

Matheus Alcantara

matheusssilv97@gmail.com

5 months ago

In reply to: Daniil Davydov (#4)

Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue

On Wed Aug 13, 2025 at 4:29 PM -03, Daniil Davydov wrote:

Hi,

On Mon, Aug 11, 2025 at 8:41 PM Matheus Alcantara
<matheusssilv97@gmail.com> wrote:

On Wed Aug 6, 2025 at 7:44 AM -03, Álvaro Herrera wrote:

My questions:

1. Is it acceptable to drop notifications from the async queue if
there are no active listeners? There might still be notifications that
haven’t been read by any previous listener.

I'm somewhat wary of this idea -- could these inactive listeners become
active later and expect to be able to read their notifies?

I'm bit worry about this too.

What exactly do we mean by "active listener"? According to the source code,
the active listener (as far as I understand) is the one who listens to at least
one channel. If we have no active listeners in the database, the new listener
will set its pointer to the tail of the async queue. Thus, messages with old
xid will not be touched by anybody. I don't see any point in dropping them
in this case.

I think that this definition is correct, but IIUC the tail can still
have notifications with xid's that were already truncated by vacuum
freeze. When the LISTEN is executed, we first loop through the
notification queue to try to advance the queue pointers and we can
eventually iterate over a notification that was added on the queue
without any listener but it has a xid that is already truncated by vacuum
freeze, so in this case it will fail to get the transaction status. On
Alex steps to reproduce the issue it first executes the NOTIFY and
then executes the LISTEN which fails after vacuum freeze.

If the "inactive" listener is the backend which is stuck somewhere, the
answer is "no" - this backend should be able to process all notifications.

I tried to reproduce the issue by using some kind of "inactive"
listener but so far I didn't manage to trigger the error. This is what I
tried:

1. Create listener:
postgres=# listen c1;

2. Execute a very long query to make the backend busy to process the
notification:
postgres=# select * from generate_series(1,10000000000) g where g > 1;

3. On another session send the notification
postgres=# notify c1;

4. Execute pgbench test:
pgbench -n -c 80 -j 10 -t 15000 -f test.sql postgres

5. Verify that we have multiple files on pg_xact:
➜ ls -lah ~/pg-devel/data/pg_xact
total 608
-rw-------@ 1 matheus staff 256K Aug 18 20:36 0000
-rw-------@ 1 matheus staff 40K Aug 18 20:56 0001

6. Execute VACUUM FREEZE on every database on the server

postgres=# VACUUM FREEZE;
VACUUM

postgres=# \c template1
You are now connected to database "template1" as user "postgres".
template1=# VACUUM FREEZE;
VACUUM

template1=# \c template0
You are now connected to database "template0" as user "postgres".
template0=# VACUUM FREEZE;
VACUUM

After the vacuum freeze I still can see the same files on pg_xact/ and
if I cancel the long query the notification is received correctly, and
then if I execute vacuum freeze again on every database the oldest
pg_xact file is truncated.

So, if my tests are correct I don't think that storing the oldest xid is
necessary anymore since I don't think that we can lose notifications
using the patch from Daniil or I'm missing something here?

Thinking about this, maybe another solution for this would be to change
queue advancing pointers to skip the transaction status check? With this
we would not need to touch any vacuum freeze code and instead change the
LISTEN execution to handle this scenario.

Thoughts?

--
Matheus Alcantara

Daniil Davydov

3danissimo@gmail.com

5 months ago

In reply to: Matheus Alcantara (#5)

Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue

Hi,

On Tue, Aug 19, 2025 at 7:14 AM Matheus Alcantara
<matheusssilv97@gmail.com> wrote:

On Wed Aug 13, 2025 at 4:29 PM -03, Daniil Davydov wrote:

What exactly do we mean by "active listener"? According to the source code,
the active listener (as far as I understand) is the one who listens to at least
one channel. If we have no active listeners in the database, the new listener
will set its pointer to the tail of the async queue. Thus, messages with old
xid will not be touched by anybody. I don't see any point in dropping them
in this case.

I think that this definition is correct, but IIUC the tail can still
have notifications with xid's that were already truncated by vacuum
freeze. When the LISTEN is executed, we first loop through the
notification queue to try to advance the queue pointers and we can
eventually iterate over a notification that was added on the queue
without any listener but it has a xid that is already truncated by vacuum
freeze, so in this case it will fail to get the transaction status. On
Alex steps to reproduce the issue it first executes the NOTIFY and
then executes the LISTEN which fails after vacuum freeze.

Yeah, you are right. I looked at the code again, and found out that even
if there are no active listeners, new listener should iterate from the head
to the tail. Thus, it may encounter truncated xid. Anyway, I still think that
dropping notifications is not the best way to resolve this issue.

If the "inactive" listener is the backend which is stuck somewhere, the
answer is "no" - this backend should be able to process all notifications.

I tried to reproduce the issue by using some kind of "inactive"
listener but so far I didn't manage to trigger the error.

After the vacuum freeze I still can see the same files on pg_xact/ and
if I cancel the long query the notification is received correctly, and
then if I execute vacuum freeze again on every database the oldest
pg_xact file is truncated.

So, if my tests are correct I don't think that storing the oldest xid is
necessary anymore since I don't think that we can lose notifications
using the patch from Daniil or I'm missing something here?

You have started a very long transaction, which holds its xid and prevents
vacuum from freezing it. But what if the backend is stuck not inside a
transaction? Maybe we can just hardcode a huge delay (not inside the
transaction) or stop process execution via breakpoint in gdb. If we will use it
instead of a long query, I think that this error may be reproducible.

--
Best regards,
Daniil Davydov

Matheus Alcantara

matheusssilv97@gmail.com

5 months ago

In reply to: Daniil Davydov (#6)

Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue

On Tue Aug 19, 2025 at 12:57 AM -03, Daniil Davydov wrote:

I think that this definition is correct, but IIUC the tail can still
have notifications with xid's that were already truncated by vacuum
freeze. When the LISTEN is executed, we first loop through the
notification queue to try to advance the queue pointers and we can
eventually iterate over a notification that was added on the queue
without any listener but it has a xid that is already truncated by vacuum
freeze, so in this case it will fail to get the transaction status. On
Alex steps to reproduce the issue it first executes the NOTIFY and
then executes the LISTEN which fails after vacuum freeze.

Yeah, you are right. I looked at the code again, and found out that even
if there are no active listeners, new listener should iterate from the head
to the tail. Thus, it may encounter truncated xid. Anyway, I still think that
dropping notifications is not the best way to resolve this issue.

In the steps that Alex shared, is it expected that the "LISTEN c1" command
consumes the notification that was sent previously with NOTIFY? IIUC the
LISTEN command should be executed before of any NOTIFY, so executing the
LISTEN after a NOTIFY will not consume any previous notification added
on the channel, so how bad would be to drop this notification from the
queue in this situation?

If the "inactive" listener is the backend which is stuck somewhere, the
answer is "no" - this backend should be able to process all notifications.

I tried to reproduce the issue by using some kind of "inactive"
listener but so far I didn't manage to trigger the error.

After the vacuum freeze I still can see the same files on pg_xact/ and
if I cancel the long query the notification is received correctly, and
then if I execute vacuum freeze again on every database the oldest
pg_xact file is truncated.

So, if my tests are correct I don't think that storing the oldest xid is
necessary anymore since I don't think that we can lose notifications
using the patch from Daniil or I'm missing something here?

You have started a very long transaction, which holds its xid and prevents
vacuum from freezing it. But what if the backend is stuck not inside a
transaction? Maybe we can just hardcode a huge delay (not inside the
transaction) or stop process execution via breakpoint in gdb. If we will use it
instead of a long query, I think that this error may be reproducible.

But how could this happen in real scenarios? I mean, how the backend
could be stuck outside a transaction?

--
Matheus Alcantara

Daniil Davydov

3danissimo@gmail.com

5 months ago

In reply to: Matheus Alcantara (#7)

Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue

Hi,

On Tue, Aug 19, 2025 at 6:31 PM Matheus Alcantara
<matheusssilv97@gmail.com> wrote:

On Tue Aug 19, 2025 at 12:57 AM -03, Daniil Davydov wrote:

You have started a very long transaction, which holds its xid and prevents
vacuum from freezing it. But what if the backend is stuck not inside a
transaction? Maybe we can just hardcode a huge delay (not inside the
transaction) or stop process execution via breakpoint in gdb. If we will use it
instead of a long query, I think that this error may be reproducible.

But how could this happen in real scenarios? I mean, how the backend
could be stuck outside a transaction?

For now, I cannot come up with a situation where it may be possible.
Perhaps, such a lagging may occur during network communication,
but I couldn't reproduce it. Maybe other people know how we can achieve
this?

I think that if such a situation may be possible, the suggestion to delete
messages will no longer be relevant. Therefore, first of all, I would like to
clarify this issue.

--
Best regards,
Daniil Davydov

Masahiko Sawada

sawada.mshk@gmail.com

5 months ago

In reply to: Matheus Alcantara (#5)

Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue

On Mon, Aug 18, 2025 at 5:14 PM Matheus Alcantara
<matheusssilv97@gmail.com> wrote:

On Wed Aug 13, 2025 at 4:29 PM -03, Daniil Davydov wrote:

Hi,

On Mon, Aug 11, 2025 at 8:41 PM Matheus Alcantara
<matheusssilv97@gmail.com> wrote:

On Wed Aug 6, 2025 at 7:44 AM -03, Álvaro Herrera wrote:

My questions:

1. Is it acceptable to drop notifications from the async queue if
there are no active listeners? There might still be notifications that
haven’t been read by any previous listener.

I'm somewhat wary of this idea -- could these inactive listeners become
active later and expect to be able to read their notifies?

I'm bit worry about this too.

What exactly do we mean by "active listener"? According to the source code,
the active listener (as far as I understand) is the one who listens to at least
one channel. If we have no active listeners in the database, the new listener
will set its pointer to the tail of the async queue. Thus, messages with old
xid will not be touched by anybody. I don't see any point in dropping them
in this case.

I think that this definition is correct, but IIUC the tail can still
have notifications with xid's that were already truncated by vacuum
freeze. When the LISTEN is executed, we first loop through the
notification queue to try to advance the queue pointers and we can
eventually iterate over a notification that was added on the queue
without any listener but it has a xid that is already truncated by vacuum
freeze, so in this case it will fail to get the transaction status.

When a process first executes the LISTEN command, it iterates through
the notification queue, but it seems only to advance its queue pointer
because it doesn't add the interested channels to its list yet and it
isn't interested in the notifications queued before it registered as a
listener. I'm wondering if we could optimize this by allowing the
queue pointer to fast-forward without checking transaction status. If
feasible, this might resolve the reported issue.

However, I have a more fundamental concern regarding the LISTEN/NOTIFY
implementation. Since vacuum doesn't consider the XIDs of notification
entries, there might be a critical issue with the truncation of clog
entries that LISTEN/NOTIFY still requires. As I understand it,
notification queue entries aren't ordered by XID, and it's possible
for a notification with an older XID to be positioned at the queue's
head. If vacuum freeze then truncates the corresponding clogs,
listeners attempting to retrieve this notification would fail to
obtain the transaction status. To address this, we likely need to
either implement Álvaro's suggestion[1]/messages/by-id/202508061044.ptcyt7aqsaaa@alvherre.pgsql to make vacuum aware of the
oldest XID in the notification queue, or develop a mechanism to
remove/freeze XIDs of the notification entries.

Regards,

[1]: /messages/by-id/202508061044.ptcyt7aqsaaa@alvherre.pgsql

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#10

Matheus Alcantara

matheusssilv97@gmail.com

5 months ago

In reply to: Daniil Davydov (#8)

Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue

On Tue Aug 19, 2025 at 2:37 PM -03, Daniil Davydov wrote:

Hi,

On Tue, Aug 19, 2025 at 6:31 PM Matheus Alcantara
<matheusssilv97@gmail.com> wrote:

On Tue Aug 19, 2025 at 12:57 AM -03, Daniil Davydov wrote:

You have started a very long transaction, which holds its xid and prevents
vacuum from freezing it. But what if the backend is stuck not inside a
transaction? Maybe we can just hardcode a huge delay (not inside the
transaction) or stop process execution via breakpoint in gdb. If we will use it
instead of a long query, I think that this error may be reproducible.

But how could this happen in real scenarios? I mean, how the backend
could be stuck outside a transaction?

For now, I cannot come up with a situation where it may be possible.
Perhaps, such a lagging may occur during network communication,
but I couldn't reproduce it. Maybe other people know how we can achieve
this?

Reading more the code I understand that once the a NOTIFY command is
received by a backend (and the transaction is committed) it will
emedialy signal all other listener backends and if the listener backend
is in idle it will consume the notification and then send it back to the
client as a PqMsg_NotificationResponse, so if there is a network delay
to send the notification from the listener backend back to the client I
don't think that it would be possible to get this error, because the
message was already dispatched by the backend and it will eventually get
to the client and once the notification is dispatched the backend
doesn't need to track it anymore (the queue pointers of the backend are
advanced after the dispatch).

Assuming that every SQL command is wrapped into a transaction (if it's
not already inside in) I think a busy listener backend will always
prevent the vacuum from freezing clog files past from its current xid,
so any notification that is sent while the backend is busy will not have
their transaction status removed from clog files anyway.

Is all these understandings and assumptions correct or am I missing
something here?

I think that if such a situation may be possible, the suggestion to delete
messages will no longer be relevant. Therefore, first of all, I would like to
clarify this issue.

From what I've understood until now it seems to me that this can happen
only if we have a notification on the queue without any listener, so the
notification may stay on the queue from a long time and a vacuum freeze
can be executed during this time and then when we have a new listener
(even for a different channel) it will fail to advance the pointers at
listener setup(Exec_ListenPreCommit()) because it would not be able to
get the transition status of this very old notification.

(please note that I'm not trying to invalidate your concern, I'm also
have this concern but unfortunately I'm unable to reproduce it and I'm
just sharing my thoughts to see if this issue is really possible or not)

--
Matheus Alcantara

#11

Matheus Alcantara

matheusssilv97@gmail.com

5 months ago

In reply to: Masahiko Sawada (#9)

Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue

On Tue Aug 19, 2025 at 2:40 PM -03, Masahiko Sawada wrote:

However, I have a more fundamental concern regarding the LISTEN/NOTIFY
implementation. Since vacuum doesn't consider the XIDs of notification
entries, there might be a critical issue with the truncation of clog
entries that LISTEN/NOTIFY still requires. As I understand it,
notification queue entries aren't ordered by XID, and it's possible
for a notification with an older XID to be positioned at the queue's
head. If vacuum freeze then truncates the corresponding clogs,
listeners attempting to retrieve this notification would fail to
obtain the transaction status. To address this, we likely need to
either implement Álvaro's suggestion[1] to make vacuum aware of the
oldest XID in the notification queue, or develop a mechanism to
remove/freeze XIDs of the notification entries.

Thanks for the comments! Please see my reply at [1]/messages/by-id/DC7KGTXW3QSG.OZA24HONT78J@gmail.com that I mention that
I don't think that is too easy to have this specific scenario of a busy
backend loose dropped notifications.

[1]: /messages/by-id/DC7KGTXW3QSG.OZA24HONT78J@gmail.com

--
Matheus Alcantara

#12

Masahiko Sawada

sawada.mshk@gmail.com

5 months ago

In reply to: Matheus Alcantara (#10)

Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue

On Wed, Aug 20, 2025 at 2:18 PM Matheus Alcantara
<matheusssilv97@gmail.com> wrote:

On Tue Aug 19, 2025 at 2:37 PM -03, Daniil Davydov wrote:

Hi,

On Tue, Aug 19, 2025 at 6:31 PM Matheus Alcantara
<matheusssilv97@gmail.com> wrote:

On Tue Aug 19, 2025 at 12:57 AM -03, Daniil Davydov wrote:

You have started a very long transaction, which holds its xid and prevents
vacuum from freezing it. But what if the backend is stuck not inside a
transaction? Maybe we can just hardcode a huge delay (not inside the
transaction) or stop process execution via breakpoint in gdb. If we will use it
instead of a long query, I think that this error may be reproducible.

But how could this happen in real scenarios? I mean, how the backend
could be stuck outside a transaction?

For now, I cannot come up with a situation where it may be possible.
Perhaps, such a lagging may occur during network communication,
but I couldn't reproduce it. Maybe other people know how we can achieve
this?

Reading more the code I understand that once the a NOTIFY command is
received by a backend (and the transaction is committed) it will
emedialy signal all other listener backends and if the listener backend
is in idle it will consume the notification and then send it back to the
client as a PqMsg_NotificationResponse, so if there is a network delay
to send the notification from the listener backend back to the client I
don't think that it would be possible to get this error, because the
message was already dispatched by the backend and it will eventually get
to the client and once the notification is dispatched the backend
doesn't need to track it anymore (the queue pointers of the backend are
advanced after the dispatch).

Assuming that every SQL command is wrapped into a transaction (if it's
not already inside in) I think a busy listener backend will always
prevent the vacuum from freezing clog files past from its current xid,
so any notification that is sent while the backend is busy will not have
their transaction status removed from clog files anyway.

What about backend processes that don't have any xid or xmin (i.e.,
are read-only query and in idle-in-transaction)?

IIUC we process the notification entries at the beginning of the
server loop (see L4608 in postgres.c) and when reading a command (via
ProcessClientReadInterrupt()), but it seems to me that if a process is
in idle-in-transaction state it doesn't process the entries unless the
transaction is committed. I've reproduced the missing clog entry error
even if we have a notification on the queue with a valid listener,
with the following steps:

1. Initialize the database cluster.

2. Execute "ALTER DATABASE template0 WITH ALLOW_CONNECTIONS true;".

3. Start one psql session and execute:

-- Session 1
=# listen s;
LISTEN
=# begin;
BEGIN

(keep this session open to leave the process idle-in-transaction.)

4. Open another psql session and execute:

-- Session 2
=# begin;
BEGIN
=# select txid_current();
txid_current
--------------
756
(1 row)
=# notify s;
NOTIFY
=# commit;
COMMIT

The notification to the channel 's' should be available for the
session-1's transaction.

5. Consume enough XIDs to truncate clog entries.

-- Session 2
=# create extension xid_wraparound;
CREATE EXTENSION
=# select consume_xids(10_000_000);
NOTICE: consumed 10000000 / 10000000 XIDs, latest 0:10000757
consume_xids
--------------
10000757
(1 row)
=# select txid_current();
txid_current
--------------
10000758
(1 row)

6. Execute vacuum freeze on all databases:

$ vacuumdb --all --freeze
vacuumdb: vacuuming database "postgres"
vacuumdb: vacuuming database "template0"
vacuumdb: vacuuming database "template1"

7. On the first psql session:

-- Session 1
=# commit;
COMMIT
ERROR: could not access status of transaction 756
DETAIL: Could not open file "pg_xact/0000": No such file or directory.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#13

Matheus Alcantara

matheusssilv97@gmail.com

5 months ago

In reply to: Masahiko Sawada (#12)

Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue

On Thu Aug 21, 2025 at 7:25 PM -03, Masahiko Sawada wrote:

What about backend processes that don't have any xid or xmin (i.e.,
are read-only query and in idle-in-transaction)?

IIUC we process the notification entries at the beginning of the
server loop (see L4608 in postgres.c) and when reading a command (via
ProcessClientReadInterrupt()), but it seems to me that if a process is
in idle-in-transaction state it doesn't process the entries unless the
transaction is committed. I've reproduced the missing clog entry error
even if we have a notification on the queue with a valid listener,
with the following steps:

Ok, now we can reproduce this, thank you! So I think that we need a way
to teach the VACUUM FREEZE about old xid's on async queue.

I'll work on this considering the initial Daniil comments at [1]/messages/by-id/CAJDiXgg1ArRB1-6wLtXfVVnQ38P9Y+CDfEc9M2TXiOf_4kfBMg@mail.gmail.com

[1]: /messages/by-id/CAJDiXgg1ArRB1-6wLtXfVVnQ38P9Y+CDfEc9M2TXiOf_4kfBMg@mail.gmail.com

--
Matheus Alcantara

#14

Matheus Alcantara

matheusssilv97@gmail.com

5 months ago

In reply to: Matheus Alcantara (#13)

1 attachment(s)

Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue

On Thu Aug 21, 2025 at 8:14 PM -03, Matheus Alcantara wrote:

I'll work on this considering the initial Daniil comments at [1]

[1] /messages/by-id/CAJDiXgg1ArRB1-6wLtXfVVnQ38P9Y+CDfEc9M2TXiOf_4kfBMg@mail.gmail.com

I've been working on this on the last few days, please see the attached
patch version.

In this new version I tried to follow the suggestion from Daniil of
scanning all pages from tail to head of the async queue.

I tried to use other data structures like a list or a hashmap to store
the xids but it shows to me more complicated due to duplicated
information being stored and also considering that this data structure
should be stored on shared memory, so I just tried to get the oldest xid
with the information that we already have on the async queue.

To find the oldest xid on the async queue I basically read all
AsyncQueueEntry's on async queue pages on SLRU cache from tail to head,
skipping only uncommitted transactions and notifications for different
databases.

The code that reads the pages and the entries within the page is a bit
coupled with the send notification logic. The important functions that
execute this is asyncQueueReadAllNotifications() which read the pages
from SLRU cache and then call the asyncQueueProcessPageEntries() which
read the entries within the page and send the notification to the client
if needed. Since I need a very similar code to read the notification
entries on the queue to check the min xid I think that it would be good
to have an API that just works like an iterator and returns the next
queue entry to work on so that it can be used for both cases.

In this patch I created an AsyncQueueIterator that iterates over pages
and AsyncQueueEntry's within the pages. The code is based on the
asyncQueueReadAllNotifications() and asyncQueueProcessPageEntries()
functions. For now I'm just implementing the iterator logic and use at
AsyncQueueMinXid() to get the min xid on the queue to make the review
more easier but I think that we can have another patch that refactor
asyncQueueReadAllNotifications() and asyncQueueProcessPageEntries()
functions to use this iterator approach and avoid duplicated code.

I'm also adding a TAP test that reproduces the issue, but I'm not sure if
it's declared on the best path.

I think that it would be good to perform some benchmark tests to see if
we have performance issues when reading very long async queues.

Thoughts?

--
Matheus Alcantara

Attachments:

v1-0001-Consider-LISTEN-NOTIFY-min-xid-during-VACUUM-FREE.patchapplication/octet-stream; name=v1-0001-Consider-LISTEN-NOTIFY-min-xid-during-VACUUM-FREE.patchDownload

From 904a7ebe13c383c13aa2a92c0a0e0a945f85dd11 Mon Sep 17 00:00:00 2001
From: Matheus Alcantara <mths.dev@pm.me>
Date: Tue, 26 Aug 2025 10:09:01 -0300
Subject: [PATCH v1] Consider LISTEN/NOTIFY min xid during VACUUM FREEZE

Previously a listener backend that is delaying to consume notifications
(due to idle in transaction for example), if the VACUUM FREEZE is
executed during this period and drop clog files that contains
transaction information about the notification in the queue, the
listener backend can loose this notification when committing the
transaction:
ERROR:  could not access status of transaction 756
DETAIL:  Could not open file "pg_xact/0000": No such file or directory.

This commit fix this issue by iterating over the queue notifications for
each backend listener and check which is the oldest transaction xid on
the queue and then consider this value during VACUUM FREEZE execution.
---
 src/backend/commands/async.c                  | 231 ++++++++++++++++++
 src/backend/commands/vacuum.c                 |  12 +
 src/include/commands/async.h                  |   3 +
 src/test/modules/Makefile                     |   1 +
 src/test/modules/meson.build                  |   1 +
 src/test/modules/test_listen_notify/Makefile  |  17 ++
 .../modules/test_listen_notify/meson.build    |  13 +
 .../test_listen_notify/t/001_xid_freeze.pl    |  73 ++++++
 src/tools/pgindent/typedefs.list              |   1 +
 9 files changed, 352 insertions(+)
 create mode 100644 src/test/modules/test_listen_notify/Makefile
 create mode 100644 src/test/modules/test_listen_notify/meson.build
 create mode 100644 src/test/modules/test_listen_notify/t/001_xid_freeze.pl

diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index 4bd37d5beb5..91e10a5b455 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -401,6 +401,38 @@ struct NotificationHash
 	Notification *event;		/* => the actual Notification struct */
 };
 
+/*
+ * A state-based iterator for consuming notifications (AsyncQueueEntry) from the async queue.
+ *
+ * Note that the iterator will iterate over the async queue based on the
+ * "current" and "head" positions, it will start at "current" and it will read
+ * until it reach the "head" position. For example, to read notifications for a
+ * specific backend it should just use the QUEUE_BACKEND_POS as a "current"
+ * position starting point and head as QUEUE_HEAD. To read all async queue
+ * notifications just use QUEUE_HEAD as "current".
+ */
+typedef struct AsyncQueueIterator
+{
+	/* Current queue position of iteration. */
+	QueuePosition current;
+
+	/* how far it will read */
+	QueuePosition head;
+
+	/* Snapshot used to decide which xacts are still in progress. */
+	Snapshot	snapshot;
+
+	/* buffer to read pages from SLRU */
+	char		page_buffer[QUEUE_PAGESIZE];
+
+	/* Should read a page from SLRU? */
+	bool		read_page;
+
+	/* No more entries to read */
+	bool		done;
+} AsyncQueueIterator;
+
+
 static NotificationList *pendingNotifies = NULL;
 
 /*
@@ -458,6 +490,9 @@ static uint32 notification_hash(const void *key, Size keysize);
 static int	notification_match(const void *key1, const void *key2, Size keysize);
 static void ClearPendingActionsAndNotifies(void);
 
+static void AsyncQueueIterInit(AsyncQueueIterator *iter, QueuePosition current, QueuePosition head);
+static AsyncQueueEntry *AsyncQueueIterNextNotification(AsyncQueueIterator *iter);
+
 /*
  * Compute the difference between two queue page numbers.
  * Previously this function accounted for a wraparound.
@@ -2395,3 +2430,199 @@ check_notify_buffers(int *newval, void **extra, GucSource source)
 {
 	return check_slru_buffers("notify_buffers", newval);
 }
+
+
+/*
+ * Initializes an AsyncQueueIterator.
+ *
+ * It sets up the state and gets the initial snapshot.
+ */
+static void
+AsyncQueueIterInit(AsyncQueueIterator *iter, QueuePosition current, QueuePosition head)
+{
+	/* Initialize internal state */
+	iter->read_page = true;
+	iter->done = false;
+	iter->current = current;
+	iter->head = head;
+
+	/* Get the snapshot we'll use for visibility checks */
+	iter->snapshot = RegisterSnapshot(GetLatestSnapshot());
+}
+
+/*
+ * Returns the next AsyncQueueEntry from the async queue.
+ *
+ * Returns a pointer to the entry on success, otherwise NULL if there are no
+ * more notifications to process, or if an uncommitted notification is found.
+ *
+ * It handles fetching pages from the shared SLRU as needed. The returned
+ * pointer is to a local buffer, so it's only valid until the next call to this
+ * function.
+ */
+static AsyncQueueEntry *
+AsyncQueueIterNextNotification(AsyncQueueIterator *iter)
+{
+	AsyncQueueEntry *qe;
+	QueuePosition thisentry;
+
+	/* No more entries to process. */
+	if (iter->done)
+		return NULL;
+
+	if (QUEUE_POS_EQUAL(iter->current, iter->head))
+	{
+		/* Nothing to do, the backend don't have any notification to read. */
+		iter->done = true;
+		return NULL;
+	}
+
+
+	/*
+	 * We need to process a page at a time. If we haven't read the current
+	 * page yet, or have reached the end of the previous one, read the next
+	 * page.
+	 */
+	if (iter->read_page)
+	{
+		int64		curpage = QUEUE_POS_PAGE(iter->current);
+		int			curoffset = QUEUE_POS_OFFSET(iter->current);
+		int			slotno;
+		int			copysize;
+
+
+		/*
+		 * We copy the data from SLRU into a local buffer, so as to avoid
+		 * holding the SLRU lock while we are examining the entries and
+		 * possibly transmitting them to our frontend.  Copy only the part of
+		 * the page we will actually inspect.
+		 */
+		slotno = SimpleLruReadPage_ReadOnly(NotifyCtl, curpage, InvalidTransactionId);
+
+		/* Determine how much of the page we need to copy */
+		if (curpage == QUEUE_POS_PAGE(iter->head))
+		{
+			/* We only want to read as far as head */
+			copysize = QUEUE_POS_OFFSET(iter->head) - curoffset;
+			if (copysize < 0)
+				copysize = 0;	/* just for safety */
+		}
+		else
+		{
+			/* fetch all the rest of the page */
+			copysize = QUEUE_PAGESIZE - curoffset;
+		}
+
+		memcpy(iter->page_buffer,
+			   NotifyCtl->shared->page_buffer[slotno] + curoffset,
+			   copysize);
+
+		/* Release lock that we got from SimpleLruReadPage_ReadOnly() */
+		LWLockRelease(SimpleLruGetBankLock(NotifyCtl, curpage));
+
+		/*
+		 * Page was read, set to false to next calls will consume the entries
+		 * on the read page.
+		 */
+		iter->read_page = false;
+	}
+
+	thisentry = iter->current;
+
+	/* Read the AsyncQueueEntry within our local buffer. */
+	qe = (AsyncQueueEntry *) (iter->page_buffer + QUEUE_POS_OFFSET(thisentry));
+
+	/*
+	 * Advance iter->current over this message. If reached the end of the page
+	 * set read_page to true to try to read the next one on next call.
+	 *
+	 * TODO(matheus): Incorporate comments from asyncQueueProcessPageEntries.
+	 */
+	if (asyncQueueAdvance(&iter->current, qe->length))
+		iter->read_page = true;
+
+	/* Check for messages destined for other databases and ignore them */
+	if (qe->dboid != MyDatabaseId)
+		return AsyncQueueIterNextNotification(iter);
+
+	/* Check for uncommitted transaction and ignore if found */
+	if (XidInMVCCSnapshot(qe->xid, iter->snapshot))
+	{
+		/*
+		 * The source transaction is still in progress, so we can't process
+		 * this message yet.  Break out of the loop, but first back up
+		 * *current so we will reprocess the message next time.  (Note: it is
+		 * unlikely but not impossible for TransactionIdDidCommit to fail, so
+		 * we can't really avoid this advance-then-back-up behavior when
+		 * dealing with an uncommitted message.)
+		 *
+		 * Note that we must test XidInMVCCSnapshot before we test
+		 * TransactionIdDidCommit, else we might return a message from a
+		 * transaction that is not yet visible to snapshots; compare the
+		 * comments at the head of heapam_visibility.c.
+		 *
+		 * Also, while our own xact won't be listed in the snapshot, we need
+		 * not check for TransactionIdIsCurrentTransactionId because our
+		 * transaction cannot (yet) have queued any messages.
+		 */
+
+		iter->current = thisentry;
+		return NULL;
+	}
+	else if (TransactionIdDidCommit(qe->xid))
+	{
+		return qe;
+	}
+	else
+	{
+		/*
+		 * The source transaction aborted or crashed, so we just ignore its
+		 * notifications and go to the next.
+		 */
+		return AsyncQueueIterNextNotification(iter);
+	}
+}
+
+/*
+ * Cleans up the iterator by unregistering the snapshot.
+ */
+static void
+AsyncQueueIterDestroy(AsyncQueueIterator *iter)
+{
+	UnregisterSnapshot(iter->snapshot);
+}
+
+TransactionId
+AsyncQueueMinXid(void)
+{
+	QueuePosition current;
+	QueuePosition head;
+	AsyncQueueEntry *qe;
+	AsyncQueueIterator iter;
+	TransactionId minXid = MaxTransactionId;
+
+	/*
+	 * First advance the global queue tail so we don't need to worry about
+	 * notifications already processed by backends.
+	 */
+	asyncQueueAdvanceTail();
+
+	/* Fetch current state */
+	LWLockAcquire(NotifyQueueLock, LW_SHARED);
+	current = QUEUE_TAIL;
+	head = QUEUE_HEAD;
+	LWLockRelease(NotifyQueueLock);
+
+	AsyncQueueIterInit(&iter, current, head);
+
+	while ((qe = AsyncQueueIterNextNotification(&iter)) != NULL)
+	{
+		if (qe->xid < minXid)
+			minXid = qe->xid;
+	}
+
+	AsyncQueueIterDestroy(&iter);
+
+
+	return minXid;
+}
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 733ef40ae7c..31f923c88fa 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -37,6 +37,7 @@
 #include "catalog/namespace.h"
 #include "catalog/pg_database.h"
 #include "catalog/pg_inherits.h"
+#include "commands/async.h"
 #include "commands/cluster.h"
 #include "commands/defrem.h"
 #include "commands/progress.h"
@@ -1739,6 +1740,17 @@ vac_update_datfrozenxid(void)
 	if (bogus)
 		return;
 
+	/*
+	 * We need to check transaction status of notifications before of notify
+	 * the client, if there is lag to consume the notifications we need to
+	 * consider the older xid of notification on the queue so that the
+	 * transaction status can be accessed.
+	 *
+	 * XXX(matheus): Maybe add a GUC to prevent lazy listeners to hold the
+	 * VACUUM FREEZE newFronzenXid for too long.
+	 */
+	newFrozenXid = Min(newFrozenXid, AsyncQueueMinXid());
+
 	Assert(TransactionIdIsNormal(newFrozenXid));
 	Assert(MultiXactIdIsValid(newMinMulti));
 
diff --git a/src/include/commands/async.h b/src/include/commands/async.h
index f75c3df9556..3592103a0da 100644
--- a/src/include/commands/async.h
+++ b/src/include/commands/async.h
@@ -46,4 +46,7 @@ extern void HandleNotifyInterrupt(void);
 /* process interrupts */
 extern void ProcessNotifyInterrupt(bool flush);
 
+
+extern TransactionId AsyncQueueMinXid(void);
+
 #endif							/* ASYNC_H */
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index 903a8ac151a..4c0160df341 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -28,6 +28,7 @@ SUBDIRS = \
 		  test_int128 \
 		  test_integerset \
 		  test_json_parser \
+		  test_listen_notify \
 		  test_lfind \
 		  test_misc \
 		  test_oat_hooks \
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index 93be0f57289..144379b619b 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -27,6 +27,7 @@ subdir('test_ginpostinglist')
 subdir('test_int128')
 subdir('test_integerset')
 subdir('test_json_parser')
+subdir('test_listen_notify')
 subdir('test_lfind')
 subdir('test_misc')
 subdir('test_oat_hooks')
diff --git a/src/test/modules/test_listen_notify/Makefile b/src/test/modules/test_listen_notify/Makefile
new file mode 100644
index 00000000000..da1bf5bb1b7
--- /dev/null
+++ b/src/test/modules/test_listen_notify/Makefile
@@ -0,0 +1,17 @@
+# src/test/modules/test_listen_notify/Makefile
+
+MODULE = test_listen_notify
+PGFILEDESC = "test_listen_notify - regression testing for LISTEN/NOTIFY support"
+
+TAP_TESTS = 1
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_listen_notify
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_listen_notify/meson.build b/src/test/modules/test_listen_notify/meson.build
new file mode 100644
index 00000000000..8119e6c761f
--- /dev/null
+++ b/src/test/modules/test_listen_notify/meson.build
@@ -0,0 +1,13 @@
+# Copyright (c) 2022-2025, PostgreSQL Global Development Group
+
+tests += {
+  'name': 'test_listen_notify',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'tap': {
+    'tests': [
+      't/001_xid_freeze.pl',
+    ],
+  },
+}
+
diff --git a/src/test/modules/test_listen_notify/t/001_xid_freeze.pl b/src/test/modules/test_listen_notify/t/001_xid_freeze.pl
new file mode 100644
index 00000000000..79dcd73ed65
--- /dev/null
+++ b/src/test/modules/test_listen_notify/t/001_xid_freeze.pl
@@ -0,0 +1,73 @@
+# Copyright (c) 2024-2025, PostgreSQL Global Development Group
+
+use strict;
+use warnings FATAL => 'all';
+use File::Path qw(mkpath);
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+my $node = PostgreSQL::Test::Cluster->new('node');
+$node->init;
+$node->start;
+
+# Setup
+$node->safe_psql('postgres', 'CREATE EXTENSION xid_wraparound');
+$node->safe_psql('postgres',
+	'CREATE TABLE t AS SELECT g AS a, g+2 AS b from generate_series(1,100000) g;'
+);
+$node->safe_psql('postgres',
+	'ALTER DATABASE template0 WITH ALLOW_CONNECTIONS true');
+
+# --- Start Session 1 and leave it idle in transaction
+my $psql_session1 = $node->background_psql('postgres');
+$psql_session1->query_safe('listen s;', "Session 1 listens to 's'");
+$psql_session1->query_safe('begin;', "Session 1 starts a transaction");
+
+# --- Session 2, multiple notify's, and commit ---
+for my $i (1 .. 10)
+{
+	$node->safe_psql(
+		'postgres', "
+		BEGIN;
+		NOTIFY s, '$i';
+		COMMIT;");
+}
+
+# Consume enough XIDs to trigger truncation
+$node->safe_psql('postgres', 'select consume_xids(10000000);');
+
+# Execute update so the frozen xid of "t" table is updated to a xid greater
+# than consume_xids() result
+$node->safe_psql('postgres', 'UPDATE t SET a = a+b;');
+
+# Remember current datfrozenxid before vacuum freeze to ensure that it is advanced.
+my $datafronzenxid = $node->safe_psql('postgres', "select datfrozenxid from pg_database where datname = 'postgres'");
+
+# Execute vacuum freeze on all databases
+$node->command_ok([ 'vacuumdb', '--all', '--freeze', '--port', $node->port ],
+	"vacuumdb --all --freeze");
+
+# Get the new datfrozenxid after vacuum freeze to ensure that is advanced but
+# we can still get the notification status of the notification
+my $datafronzenxid_partial_freeze = $node->safe_psql('postgres', "select datfrozenxid from pg_database where datname = 'postgres'");
+ok($datafronzenxid_partial_freeze > $datafronzenxid, 'datfrozenxid is partially advanced');
+
+# On Session 1, commit and ensure that the all notifications is received
+my $res = $psql_session1->query_safe('commit;', "commit listen s;");
+my $notifications_count = 0;
+foreach my $i (split('\n', $res))
+{
+	$notifications_count++;
+	like($i, qr/Asynchronous notification "s" with payload "$notifications_count" received/);
+}
+is($notifications_count, 10, 'received all committed notifications');
+
+# Execute vacuum freeze on all databases again and ensure that datfrozenxid is fully advanced.
+$node->command_ok([ 'vacuumdb', '--all', '--freeze', '--port', $node->port ],
+	"vacuumdb --all --freeze");
+
+my $datafronzenxid_freeze = $node->safe_psql('postgres', "select datfrozenxid from pg_database where datname = 'postgres'");
+ok($datafronzenxid_freeze > $datafronzenxid_partial_freeze, 'datfrozenxid is advanced after notification is consumed');
+
+done_testing();
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index a13e8162890..14684584cff 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -158,6 +158,7 @@ ArrayToken
 ArrayType
 AsyncQueueControl
 AsyncQueueEntry
+AsyncQueueIterator
 AsyncRequest
 AttInMetadata
 AttStatsSlot
-- 
2.39.5 (Apple Git-154)

#15

Jacques Combrink

jacques@quantsolutions.co.za

4 months ago

In reply to: Matheus Alcantara (#14)

Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue

I came across this discussion after we had this problem at least twice
now at work.
I read through the discussion to see if this is the problem we are
experiencing.
At first, I thought this was not exactly our problem, because of the way
it was showed to be reproduced.

With a long running transaction that is listening, and then doing all
the other steps like like running through transactions until a second
pg_xact file is created, and then vacuuming etc etc...

But, I felt like we don't do this, so it cannot be the way we run into
the problem. Specifically, having a listener as a long running transaction.

So I set out to get it into the broken state, and here is how I can
reproduce it without long running queries:

1. Create another database

postgres=# CREATE DATABASE test;
CREATE DATABASE

2. Create a listener on the postgres database.

postgres=# LISTEN xx;
LISTEN

3. Create notifies for the test database.

Here I struggled to get a stuck notification in the queue (SELECT
pg_notification_queue_usage();).

It can happen with only a single notify from a psql connection, but I
get a higher hit rate with the following:

Create a notify.sql with the only contents being:

```
NOTIFY s, 'yappers';
```
Then run this against the test database with pgbench.

pgbench -n -c 80 -j 20 -t 1000 -f notify.sql test

4. Confirm that there is now something stuck in the queue:

postgres=# SELECT pg_notification_queue_usage();
pg_notification_queue_usage
-----------------------------
9.5367431640625e-07
(1 row)

If this still shows 0, then run step 3 again.

5. Consume xid's. I create a file consume.sql with the only contents being:
```
SELECT txid_current();
```

Then:

pgbench -n -c 80 -j 30 -t 100000 -f consume.sql postgres

6. Verify that a new file is created in the pg_xact folder.

If not, just run the previous step again.

7. Run vacuum freeze. (Remember to allow connections to template0
beforehand with `ALTER DATABASE template0 WITH ALLOW_CONNECTIONS true;`)

$ vacuumdb --all --freeze
vacuumdb: vacuuming database "postgres"
vacuumdb: vacuuming database "template0"
vacuumdb: vacuuming database "template1"
vacuumdb: vacuuming database "test"

8. Connect to the test database and try execute a listener.
test=# LISTEN anything;
ERROR: could not access status of transaction 81086
DETAIL: Could not open file "pg_xact/0000": No such file or directory.
test=#

Extra weirdness:

In step 2 create a connection to the test database and LISTEN there on
any channel, even the one you are notifying to:
And you don't have to run anything on that connection, there is no
backend_xmin on that connection, you don't start a transaction nothing,
you just run `LISTEN something;`
Then after step 7, as long as you don't close this connection, you will
not get an error when you try to set up a listener on the test database,
even on a new connection to the test database.

What I'm saying is after step 7 you can have two connections. One to
postgres database with active listener, and one to test database with
active listener.
As long as the postgres one is open the queue is stuck, ie `SELECT
pg_notification_queue_usage();` returns non-zero.
As long as the test database connection is open, new listeners do not
experience the error.

If you close the test database connection, but leave the postgres
connection, from now on any listener you try to create on test database
will error.

If you close the postgres connection, the queue clears immediately and
new LISTEN's on the test database will work.
This means it is possible to get rid of the problem without restarting
the postmaster if you can close connections until you close the one that
"caused" the problem.
Don't know if this is actually useful, but to me it seemed like everyone
believed it to be impossible to recover without restarting, so it's
something at least.

TLDR:
active listener on one database causes notify on another database to get
stuck.
At no point could I get a stuck notify if I don't have a listener on at
least one other database than the one I am notifying on. See the Extra
weirdness section.
At no point do you need to have any other queries running, there is
never an idle in transaction query needed for bad timing with the vacuum.

I hope I explained everything well enough so that one of you smart
people can find and fix the problem.

#16

Yura Sokolov

y.sokolov@postgrespro.ru

4 months ago

In reply to: Matheus Alcantara (#14)

Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue

29.08.2025 01:31, Matheus Alcantara пишет:

On Thu Aug 21, 2025 at 8:14 PM -03, Matheus Alcantara wrote:

I'll work on this considering the initial Daniil comments at [1]

[1] /messages/by-id/CAJDiXgg1ArRB1-6wLtXfVVnQ38P9Y+CDfEc9M2TXiOf_4kfBMg@mail.gmail.com

I've been working on this on the last few days, please see the attached
patch version.

In this new version I tried to follow the suggestion from Daniil of
scanning all pages from tail to head of the async queue.

Since we still need to scan those pages, couldn't we mark finished
transactions as committed/aborted?
This way we may to not hold datfrozenxid for a long time and will allow
both safe clog truncation and safe async queue notification.
More over, most notifications could be marked as completed in
AtCommit_Notify already. So there is a need to recheck a few notifications
that were written but not marked in AtCommit_Notify.

Since AsyncQueueEntry field is used only to check visibility, looks like it
is safe to set it to FrozenTransactionId.

--
regards
Yura Sokolov aka funny-falcon

#17

Arseniy Mukhin

arseniy.mukhin.dev@gmail.com

4 months ago

In reply to: Yura Sokolov (#16)

Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue

Hi,

On Tue, Sep 2, 2025 at 10:38 AM Yura Sokolov <y.sokolov@postgrespro.ru> wrote:

29.08.2025 01:31, Matheus Alcantara пишет:

On Thu Aug 21, 2025 at 8:14 PM -03, Matheus Alcantara wrote:

I'll work on this considering the initial Daniil comments at [1]

[1] /messages/by-id/CAJDiXgg1ArRB1-6wLtXfVVnQ38P9Y+CDfEc9M2TXiOf_4kfBMg@mail.gmail.com

I've been working on this on the last few days, please see the attached
patch version.

In this new version I tried to follow the suggestion from Daniil of
scanning all pages from tail to head of the async queue.

Since we still need to scan those pages, couldn't we mark finished
transactions as committed/aborted?
This way we may to not hold datfrozenxid for a long time and will allow
both safe clog truncation and safe async queue notification.
More over, most notifications could be marked as completed in
AtCommit_Notify already. So there is a need to recheck a few notifications
that were written but not marked in AtCommit_Notify.

Since AsyncQueueEntry field is used only to check visibility, looks like it
is safe to set it to FrozenTransactionId.

I like the idea to make queue records more self-sufficient, so we
don't need to check a transaction's status to determine if a
notification should be sent. What if we somehow mark every record of
an aborted transaction, so listeners can skip it without checking
transaction status? Also if the record is not marked as "aborted" we
also can send it without checking transaction status. IIUC
AtAbort_Notify is called before any listener can read records of the
aborted transaction (is it correct?), so we can do such "marking" in
AtAbort_Notify. And the listen/notify subsystem already has a similar
"marking" mechanism. At the end of every queue page we have a "dummy"
record which is skipped by every listener. Listeners skip it because
record's 'dboid' equals InvalidOid (listeners ignore any record where
the 'dboid' is different from their own). In the same way aborted
transactions can set 'dboid' to InvalidOid for their records in
AtAbort_Notify.

Some benefits of this approach:
1) Only aborted transactions would need to perform extra work.
2) Listeners don't need to check notify transaction status, but ofc
they still need to wait if the notify transaction is 'in progress'.
3) Vacuum don't need to be listen/notify aware.
4) We don't need to do an extra scan of the whole queue.

What do you think?

Best regards,
Arseniy Mukhin

#18

Matheus Alcantara

matheusssilv97@gmail.com

4 months ago

In reply to: Jacques Combrink (#15)

1 attachment(s)

Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue

On Mon Sep 1, 2025 at 11:06 AM -03, Jacques Combrink wrote:

TLDR:
active listener on one database causes notify on another database to get
stuck.
At no point could I get a stuck notify if I don't have a listener on at
least one other database than the one I am notifying on. See the Extra
weirdness section.
At no point do you need to have any other queries running, there is
never an idle in transaction query needed for bad timing with the vacuum.

I hope I explained everything well enough so that one of you smart
people can find and fix the problem.

The long running transaction steps is just an example that we can lose
notifications using the first patch from Daniil that Alex has shared on
[1]: /messages/by-id/CAK98qZ3wZLE-RZJN_Y+TFjiTRPPFPBwNBpBi5K5CU8hUHkzDpw@mail.gmail.com
issue but it's similar to the steps that Alex also shared on [1]/messages/by-id/CAK98qZ3wZLE-RZJN_Y+TFjiTRPPFPBwNBpBi5K5CU8hUHkzDpw@mail.gmail.com.

All these different ways to trigger the error face the same underlying
problem: If a notification is keep for too long on the queue that vacuum
freeze can run and truncate clog files that contains transaction
information of this notification the error will happen.

The patch that I've attached on [2]https://www.postgresql.org/message-id CAFY6G8cJm73_MM9SuynZUqtqcaTuepUDgDuvS661oLW7U0dgsg%40mail.gmail.com aims to fix the issue following the
steps that you've shared, but during the tests I've found a stack
overflow bug on AsyncQueueIterNextNotification() due to the number of
notifications. I'm attaching a new version that fix this bug and I tried
to reproduce your steps with this new version and the issue seems to be
fixed.

Note that notifications that were added without any previous LISTEN will
block the xid advance during VACUUM FREEZE until we have a listener on
the database that owns these notifications. The XXX comment on vacuum.c
is about this problem.

[1]: /messages/by-id/CAK98qZ3wZLE-RZJN_Y+TFjiTRPPFPBwNBpBi5K5CU8hUHkzDpw@mail.gmail.com
[2]: https://www.postgresql.org/message-id CAFY6G8cJm73_MM9SuynZUqtqcaTuepUDgDuvS661oLW7U0dgsg%40mail.gmail.com

--
Matheus Alcantara

Attachments:

v2-0001-Consider-LISTEN-NOTIFY-min-xid-during-VACUUM-FREE.patchtext/plain; charset=utf-8; name=v2-0001-Consider-LISTEN-NOTIFY-min-xid-during-VACUUM-FREE.patchDownload

From 839b2934fe3d73b78965198d86dba3b97e3696e2 Mon Sep 17 00:00:00 2001
From: Matheus Alcantara <mths.dev@pm.me>
Date: Tue, 26 Aug 2025 10:09:01 -0300
Subject: [PATCH v2] Consider LISTEN/NOTIFY min xid during VACUUM FREEZE

Previously a listener backend that is delaying to consume notifications
(due to idle in transaction for example), if the VACUUM FREEZE is
executed during this period and drop clog files that contains
transaction information about the notification in the queue, the
listener backend can loose this notification when committing the
transaction:
ERROR:  could not access status of transaction 756
DETAIL:  Could not open file "pg_xact/0000": No such file or directory.

This commit fix this issue by iterating over the queue notifications for
each backend listener and check which is the oldest transaction xid on
the queue and then consider this value during VACUUM FREEZE execution.
---
 src/backend/commands/async.c                  | 261 ++++++++++++++++++
 src/backend/commands/vacuum.c                 |  13 +
 src/include/commands/async.h                  |   3 +
 src/test/modules/Makefile                     |   1 +
 src/test/modules/meson.build                  |   1 +
 src/test/modules/test_listen_notify/Makefile  |  17 ++
 .../modules/test_listen_notify/meson.build    |  13 +
 .../test_listen_notify/t/001_xid_freeze.pl    |  73 +++++
 src/tools/pgindent/typedefs.list              |   1 +
 9 files changed, 383 insertions(+)
 create mode 100644 src/test/modules/test_listen_notify/Makefile
 create mode 100644 src/test/modules/test_listen_notify/meson.build
 create mode 100644 src/test/modules/test_listen_notify/t/001_xid_freeze.pl

diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index 4bd37d5beb5..e986a27eb4e 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -401,6 +401,45 @@ struct NotificationHash
 	Notification *event;		/* => the actual Notification struct */
 };
 
+/*
+ * A state-based iterator for consuming notifications (AsyncQueueEntry) from the async queue.
+ *
+ * Note that the iterator will iterate over the async queue based on the
+ * "current" and "head" positions, it will start at "current" and it will read
+ * until it reach the "head" position. For example, to read notifications for a
+ * specific backend it should just use the QUEUE_BACKEND_POS as a "current"
+ * position starting point and head as QUEUE_HEAD. To read all async queue
+ * notifications just use QUEUE_HEAD as "current".
+ */
+typedef struct AsyncQueueIterator
+{
+	/* Current queue position of iteration. */
+	QueuePosition current;
+
+	/* how far it will read */
+	QueuePosition head;
+
+	/* Current queue entry being read. */
+	AsyncQueueEntry *current_entry;
+
+	/* Snapshot used to decide which xacts are still in progress. */
+	Snapshot	snapshot;
+
+	/* buffer to read pages from SLRU */
+	union
+	{
+		char		buf[QUEUE_PAGESIZE];
+		AsyncQueueEntry align;
+	}			page_buffer;
+
+	/* Should read a page from SLRU? */
+	bool		read_next_page;
+
+	/* No more entries to read */
+	bool		done;
+} AsyncQueueIterator;
+
+
 static NotificationList *pendingNotifies = NULL;
 
 /*
@@ -458,6 +497,9 @@ static uint32 notification_hash(const void *key, Size keysize);
 static int	notification_match(const void *key1, const void *key2, Size keysize);
 static void ClearPendingActionsAndNotifies(void);
 
+static void AsyncQueueIterInit(AsyncQueueIterator *iter, QueuePosition current, QueuePosition head);
+static AsyncQueueEntry *AsyncQueueIterNextNotification(AsyncQueueIterator *iter);
+
 /*
  * Compute the difference between two queue page numbers.
  * Previously this function accounted for a wraparound.
@@ -2395,3 +2437,222 @@ check_notify_buffers(int *newval, void **extra, GucSource source)
 {
 	return check_slru_buffers("notify_buffers", newval);
 }
+
+
+/*
+ * Initializes an AsyncQueueIterator.
+ *
+ * It sets up the state and gets the initial snapshot.
+ */
+static void
+AsyncQueueIterInit(AsyncQueueIterator *iter, QueuePosition current, QueuePosition head)
+{
+	/* Initialize internal state */
+	iter->read_next_page = true;
+	iter->done = false;
+	iter->current = current;
+	iter->head = head;
+
+	/* Get the snapshot we'll use for visibility checks */
+	iter->snapshot = RegisterSnapshot(GetLatestSnapshot());
+}
+
+/*
+ * Returns the next AsyncQueueEntry from the async queue.
+ *
+ * Returns a pointer to the entry on success, otherwise NULL if there are no
+ * more notifications to process, or if an uncommitted notification is found.
+ *
+ * It handles fetching pages from the shared SLRU as needed. The returned
+ * pointer is to a local buffer, so it's only valid until the next call to this
+ * function.
+ */
+static AsyncQueueEntry *
+AsyncQueueIterNextNotification(AsyncQueueIterator *iter)
+{
+	AsyncQueueEntry *qe;
+	QueuePosition thisentry;
+
+	/*
+	 * Loop until a valid notification is found or we reach the end of the
+	 * queue or an uncommitted transaction.
+	 */
+	do
+	{
+		CHECK_FOR_INTERRUPTS();
+
+		/* No more entries to process. */
+		if (iter->done)
+			return NULL;
+
+		if (QUEUE_POS_EQUAL(iter->current, iter->head))
+		{
+			/* Nothing to do, the backend don't have any notification to read. */
+			iter->done = true;
+			return NULL;
+		}
+
+		/*
+		 * We need to process a page at a time. If we haven't read the current
+		 * page yet, or have reached the end of the previous one, read the
+		 * next page.
+		 */
+		if (iter->read_next_page)
+		{
+			int64		curpage = QUEUE_POS_PAGE(iter->current);
+			int			curoffset = QUEUE_POS_OFFSET(iter->current);
+			int			slotno;
+			int			copysize;
+
+
+			/*
+			 * We copy the data from SLRU into a local buffer, so as to avoid
+			 * holding the SLRU lock while we are examining the entries and
+			 * possibly transmitting them to our frontend.  Copy only the part
+			 * of the page we will actually inspect.
+			 */
+			slotno = SimpleLruReadPage_ReadOnly(NotifyCtl, curpage, InvalidTransactionId);
+
+			/* Determine how much of the page we need to copy */
+			if (curpage == QUEUE_POS_PAGE(iter->head))
+			{
+				/* We only want to read as far as head */
+				copysize = QUEUE_POS_OFFSET(iter->head) - curoffset;
+				if (copysize < 0)
+					copysize = 0;	/* just for safety */
+			}
+			else
+			{
+				/* fetch all the rest of the page */
+				copysize = QUEUE_PAGESIZE - curoffset;
+			}
+
+			memcpy(iter->page_buffer.buf,
+				   NotifyCtl->shared->page_buffer[slotno] + curoffset,
+				   copysize);
+
+			/* Release lock that we got from SimpleLruReadPage_ReadOnly() */
+			LWLockRelease(SimpleLruGetBankLock(NotifyCtl, curpage));
+
+			/*
+			 * Page was read, set to false to next calls will consume the
+			 * entries on the read page.
+			 */
+			iter->read_next_page = false;
+			iter->current_entry = (AsyncQueueEntry *) iter->page_buffer.buf;
+		}
+		else
+		{
+			/*
+			 * If we are not reading a new page, advance our local entry
+			 * pointer to the next message.
+			 */
+			iter->current_entry = (AsyncQueueEntry *) ((char *) iter->current_entry + iter->current_entry->length);
+		}
+
+		thisentry = iter->current;
+
+		/* Read the AsyncQueueEntry within our local buffer. */
+		qe = iter->current_entry;
+
+		Assert(qe->length > 0);
+
+		/*
+		 * Advance iter->current over this message. If reached the end of the
+		 * page set read_page to true to try to read the next one on next
+		 * call.
+		 *
+		 * TODO(matheus): Incorporate comments from
+		 * asyncQueueProcessPageEntries.
+		 */
+		iter->read_next_page = asyncQueueAdvance(&iter->current, qe->length);
+
+		/* Ignore messages destined for other databases */
+		if (qe->dboid == MyDatabaseId)
+		{
+			/* Check for uncommitted transaction and ignore if found */
+			if (XidInMVCCSnapshot(qe->xid, iter->snapshot))
+			{
+				/*
+				 * The source transaction is still in progress, so we can't
+				 * process this message yet.  Break out of the loop, but first
+				 * back up *current so we will reprocess the message next
+				 * time. (Note: it is unlikely but not impossible for
+				 * TransactionIdDidCommit to fail, so we can't really avoid
+				 * this advance-then-back-up behavior when dealing with an
+				 * uncommitted message.)
+				 *
+				 * Note that we must test XidInMVCCSnapshot before we test
+				 * TransactionIdDidCommit, else we might return a message from
+				 * a transaction that is not yet visible to snapshots; compare
+				 * the comments at the head of heapam_visibility.c.
+				 *
+				 * Also, while our own xact won't be listed in the snapshot,
+				 * we need not check for TransactionIdIsCurrentTransactionId
+				 * because our transaction cannot (yet) have queued any
+				 * messages.
+				 */
+
+				iter->current = thisentry;
+				return NULL;
+			}
+			else if (TransactionIdDidCommit(qe->xid))
+			{
+				/* Found a valid notification */
+				return qe;
+			}
+			else
+			{
+				/*
+				 * The source transaction aborted or crashed, so we just
+				 * ignore its notifications and go to the next.
+				 */
+				continue;
+			}
+		}
+	} while (true);
+}
+
+/*
+ * Cleans up the iterator by unregistering the snapshot.
+ */
+static void
+AsyncQueueIterDestroy(AsyncQueueIterator *iter)
+{
+	UnregisterSnapshot(iter->snapshot);
+}
+
+TransactionId
+AsyncQueueMinXid(void)
+{
+	QueuePosition current;
+	QueuePosition head;
+	AsyncQueueEntry *qe;
+	AsyncQueueIterator iter;
+	TransactionId minXid = MaxTransactionId;
+
+	/*
+	 * First advance the global queue tail so we don't need to worry about
+	 * notifications already processed by backends.
+	 */
+	asyncQueueAdvanceTail();
+
+	/* Fetch current state */
+	LWLockAcquire(NotifyQueueLock, LW_SHARED);
+	current = QUEUE_TAIL;
+	head = QUEUE_HEAD;
+	LWLockRelease(NotifyQueueLock);
+
+	AsyncQueueIterInit(&iter, current, head);
+
+	while ((qe = AsyncQueueIterNextNotification(&iter)) != NULL)
+	{
+		if (qe->xid < minXid)
+			minXid = qe->xid;
+	}
+
+	AsyncQueueIterDestroy(&iter);
+
+
+	return minXid;
+}
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 733ef40ae7c..d35d6fc8e8a 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -37,6 +37,7 @@
 #include "catalog/namespace.h"
 #include "catalog/pg_database.h"
 #include "catalog/pg_inherits.h"
+#include "commands/async.h"
 #include "commands/cluster.h"
 #include "commands/defrem.h"
 #include "commands/progress.h"
@@ -1739,6 +1740,18 @@ vac_update_datfrozenxid(void)
 	if (bogus)
 		return;
 
+	/*
+	 * We need to check transaction status of notifications before of notify
+	 * the client, if there is lag to consume the notifications we need to
+	 * consider the older xid of notification on the queue so that the
+	 * transaction status can be accessed.
+	 *
+	 * XXX(matheus): Maybe add a GUC to prevent lazy listeners or
+	 * notifications that were added without listeners to block the VACUUM
+	 * FREEZE newFronzenXid advance.
+	 */
+	newFrozenXid = Min(newFrozenXid, AsyncQueueMinXid());
+
 	Assert(TransactionIdIsNormal(newFrozenXid));
 	Assert(MultiXactIdIsValid(newMinMulti));
 
diff --git a/src/include/commands/async.h b/src/include/commands/async.h
index f75c3df9556..3592103a0da 100644
--- a/src/include/commands/async.h
+++ b/src/include/commands/async.h
@@ -46,4 +46,7 @@ extern void HandleNotifyInterrupt(void);
 /* process interrupts */
 extern void ProcessNotifyInterrupt(bool flush);
 
+
+extern TransactionId AsyncQueueMinXid(void);
+
 #endif							/* ASYNC_H */
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index 903a8ac151a..4c0160df341 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -28,6 +28,7 @@ SUBDIRS = \
 		  test_int128 \
 		  test_integerset \
 		  test_json_parser \
+		  test_listen_notify \
 		  test_lfind \
 		  test_misc \
 		  test_oat_hooks \
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index 93be0f57289..144379b619b 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -27,6 +27,7 @@ subdir('test_ginpostinglist')
 subdir('test_int128')
 subdir('test_integerset')
 subdir('test_json_parser')
+subdir('test_listen_notify')
 subdir('test_lfind')
 subdir('test_misc')
 subdir('test_oat_hooks')
diff --git a/src/test/modules/test_listen_notify/Makefile b/src/test/modules/test_listen_notify/Makefile
new file mode 100644
index 00000000000..da1bf5bb1b7
--- /dev/null
+++ b/src/test/modules/test_listen_notify/Makefile
@@ -0,0 +1,17 @@
+# src/test/modules/test_listen_notify/Makefile
+
+MODULE = test_listen_notify
+PGFILEDESC = "test_listen_notify - regression testing for LISTEN/NOTIFY support"
+
+TAP_TESTS = 1
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_listen_notify
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_listen_notify/meson.build b/src/test/modules/test_listen_notify/meson.build
new file mode 100644
index 00000000000..8119e6c761f
--- /dev/null
+++ b/src/test/modules/test_listen_notify/meson.build
@@ -0,0 +1,13 @@
+# Copyright (c) 2022-2025, PostgreSQL Global Development Group
+
+tests += {
+  'name': 'test_listen_notify',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'tap': {
+    'tests': [
+      't/001_xid_freeze.pl',
+    ],
+  },
+}
+
diff --git a/src/test/modules/test_listen_notify/t/001_xid_freeze.pl b/src/test/modules/test_listen_notify/t/001_xid_freeze.pl
new file mode 100644
index 00000000000..79dcd73ed65
--- /dev/null
+++ b/src/test/modules/test_listen_notify/t/001_xid_freeze.pl
@@ -0,0 +1,73 @@
+# Copyright (c) 2024-2025, PostgreSQL Global Development Group
+
+use strict;
+use warnings FATAL => 'all';
+use File::Path qw(mkpath);
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+my $node = PostgreSQL::Test::Cluster->new('node');
+$node->init;
+$node->start;
+
+# Setup
+$node->safe_psql('postgres', 'CREATE EXTENSION xid_wraparound');
+$node->safe_psql('postgres',
+	'CREATE TABLE t AS SELECT g AS a, g+2 AS b from generate_series(1,100000) g;'
+);
+$node->safe_psql('postgres',
+	'ALTER DATABASE template0 WITH ALLOW_CONNECTIONS true');
+
+# --- Start Session 1 and leave it idle in transaction
+my $psql_session1 = $node->background_psql('postgres');
+$psql_session1->query_safe('listen s;', "Session 1 listens to 's'");
+$psql_session1->query_safe('begin;', "Session 1 starts a transaction");
+
+# --- Session 2, multiple notify's, and commit ---
+for my $i (1 .. 10)
+{
+	$node->safe_psql(
+		'postgres', "
+		BEGIN;
+		NOTIFY s, '$i';
+		COMMIT;");
+}
+
+# Consume enough XIDs to trigger truncation
+$node->safe_psql('postgres', 'select consume_xids(10000000);');
+
+# Execute update so the frozen xid of "t" table is updated to a xid greater
+# than consume_xids() result
+$node->safe_psql('postgres', 'UPDATE t SET a = a+b;');
+
+# Remember current datfrozenxid before vacuum freeze to ensure that it is advanced.
+my $datafronzenxid = $node->safe_psql('postgres', "select datfrozenxid from pg_database where datname = 'postgres'");
+
+# Execute vacuum freeze on all databases
+$node->command_ok([ 'vacuumdb', '--all', '--freeze', '--port', $node->port ],
+	"vacuumdb --all --freeze");
+
+# Get the new datfrozenxid after vacuum freeze to ensure that is advanced but
+# we can still get the notification status of the notification
+my $datafronzenxid_partial_freeze = $node->safe_psql('postgres', "select datfrozenxid from pg_database where datname = 'postgres'");
+ok($datafronzenxid_partial_freeze > $datafronzenxid, 'datfrozenxid is partially advanced');
+
+# On Session 1, commit and ensure that the all notifications is received
+my $res = $psql_session1->query_safe('commit;', "commit listen s;");
+my $notifications_count = 0;
+foreach my $i (split('\n', $res))
+{
+	$notifications_count++;
+	like($i, qr/Asynchronous notification "s" with payload "$notifications_count" received/);
+}
+is($notifications_count, 10, 'received all committed notifications');
+
+# Execute vacuum freeze on all databases again and ensure that datfrozenxid is fully advanced.
+$node->command_ok([ 'vacuumdb', '--all', '--freeze', '--port', $node->port ],
+	"vacuumdb --all --freeze");
+
+my $datafronzenxid_freeze = $node->safe_psql('postgres', "select datfrozenxid from pg_database where datname = 'postgres'");
+ok($datafronzenxid_freeze > $datafronzenxid_partial_freeze, 'datfrozenxid is advanced after notification is consumed');
+
+done_testing();
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index a13e8162890..14684584cff 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -158,6 +158,7 @@ ArrayToken
 ArrayType
 AsyncQueueControl
 AsyncQueueEntry
+AsyncQueueIterator
 AsyncRequest
 AttInMetadata
 AttStatsSlot
-- 
2.39.5 (Apple Git-154)

#19

Matheus Alcantara

matheusssilv97@gmail.com

4 months ago

In reply to: Yura Sokolov (#16)

Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue

Thanks for the comments!

On Tue Sep 2, 2025 at 4:37 AM -03, Yura Sokolov wrote:

29.08.2025 01:31, Matheus Alcantara пишет:

On Thu Aug 21, 2025 at 8:14 PM -03, Matheus Alcantara wrote:

I'll work on this considering the initial Daniil comments at [1]

[1] /messages/by-id/CAJDiXgg1ArRB1-6wLtXfVVnQ38P9Y+CDfEc9M2TXiOf_4kfBMg@mail.gmail.com

I've been working on this on the last few days, please see the attached
patch version.

In this new version I tried to follow the suggestion from Daniil of
scanning all pages from tail to head of the async queue.

Since we still need to scan those pages, couldn't we mark finished
transactions as committed/aborted?
This way we may to not hold datfrozenxid for a long time and will allow
both safe clog truncation and safe async queue notification.
More over, most notifications could be marked as completed in
AtCommit_Notify already. So there is a need to recheck a few notifications
that were written but not marked in AtCommit_Notify.

Since AsyncQueueEntry field is used only to check visibility, looks like it
is safe to set it to FrozenTransactionId.

Maybe we could have a boolean "committed" field on AsyncQueueEntry and
check this field before sending the notification instead of call
TransactionIdDidCommit()? Is something similar what you are proposing?

We create the AsyncQueueEntry's and add into the SLRU at
PreCommit_Notify(), so to mark these entries as committed on
AtCommit_Notify() we would need a way to find these entries on the SLRU
from the pendingNotifies->events that contains the notifications added
on the current transaction being committed.

This idea looks promising to me TBH, I'm just not sure if it's
completely safe to mark an AsyncQueueEntry on the queue as committed
without checking on TransactionIdDidCommit(). I would like to read more
opinions about this before working on the next version based on this
idea.

--
Matheus Alcantara

#20

Matheus Alcantara

matheusssilv97@gmail.com

4 months ago

In reply to: Arseniy Mukhin (#17)

Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue

On Wed Sep 3, 2025 at 4:28 PM -03, Arseniy Mukhin wrote:

Since we still need to scan those pages, couldn't we mark finished
transactions as committed/aborted?
This way we may to not hold datfrozenxid for a long time and will allow
both safe clog truncation and safe async queue notification.
More over, most notifications could be marked as completed in
AtCommit_Notify already. So there is a need to recheck a few notifications
that were written but not marked in AtCommit_Notify.

Since AsyncQueueEntry field is used only to check visibility, looks like it
is safe to set it to FrozenTransactionId.

I like the idea to make queue records more self-sufficient, so we
don't need to check a transaction's status to determine if a
notification should be sent. What if we somehow mark every record of
an aborted transaction, so listeners can skip it without checking
transaction status? Also if the record is not marked as "aborted" we
also can send it without checking transaction status. IIUC
AtAbort_Notify is called before any listener can read records of the
aborted transaction (is it correct?), so we can do such "marking" in
AtAbort_Notify. And the listen/notify subsystem already has a similar
"marking" mechanism. At the end of every queue page we have a "dummy"
record which is skipped by every listener. Listeners skip it because
record's 'dboid' equals InvalidOid (listeners ignore any record where
the 'dboid' is different from their own). In the same way aborted
transactions can set 'dboid' to InvalidOid for their records in
AtAbort_Notify.

Some benefits of this approach:
1) Only aborted transactions would need to perform extra work.
2) Listeners don't need to check notify transaction status, but ofc
they still need to wait if the notify transaction is 'in progress'.
3) Vacuum don't need to be listen/notify aware.
4) We don't need to do an extra scan of the whole queue.

What do you think?

IIUC we don't store notifications of aborted transactions on the queue.
On PreCommit_Notify we add the notifications on the queue and on
Commit_Notify() we signal the backends.

Or I'm missing something here?

--
Matheus Alcantara

#21

Rishu Bagga

rishu.postgres@gmail.com

4 months ago

In reply to: Matheus Alcantara (#20)

Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue

On Wed, Sep 3, 2025 at 2:14 PM Matheus Alcantara

<matheusssilv97@gmail.com> wrote:

IIUC we don't store notifications of aborted transactions on the

queue. On PreCommit_Notify we add the notifications on the queue

and on Commit_Notify() we signal the backends.

Or I'm missing something here?

My understanding is that something could go wrong in between

PreCommit_Notify and AtCommit_Notify, which could cause the

transaction to abort, and the notification might be in the queue at

this point, even though the transaction aborted, hence the dependency

on the commit log.

Maybe we could have a boolean "committed" field on AsyncQueueEntry

and check this field before sending the notification instead of

call TransactionIdDidCommit()? Is something similar what you are

proposing?

I like this direction, as it opens up the ability to remove the

global lock held from PreCommit_Notify to the end of the transaction.

In the same vein as this idea, I was experimenting with a patch

inspired by Tom Lane’s idea in [1]/messages/by-id/1878165.1752858390@sss.pgh.pa.us where we write the actual

notification data into the WAL, and just write the db, lsn, and xid

into the notify queue in AtCommit_Notify, so the notify queue only

contains information about committed transactions. Unfortunately,

the additional WAL write overhead in this approach seemed to outweigh

the benefits of removing the lock.

Following that idea - I think perhaps we could have two queues in the

notify system, one that stores the notification data, and another

that stores the position of the committed transaction’s notification,

which we would append to in AtCommit_Notify. Then the listener would

go through the position queue, and find / read the notifications from

the notify data queue.

[1]: /messages/by-id/1878165.1752858390@sss.pgh.pa.us

#22

Arseniy Mukhin

arseniy.mukhin.dev@gmail.com

4 months ago

In reply to: Rishu Bagga (#21)

Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue

Hi,

On Thu, Sep 4, 2025 at 2:51 AM Rishu Bagga <rishu.postgres@gmail.com> wrote:

On Wed, Sep 3, 2025 at 2:14 PM Matheus Alcantara

<matheusssilv97@gmail.com> wrote:

IIUC we don't store notifications of aborted transactions on the

queue. On PreCommit_Notify we add the notifications on the queue

and on Commit_Notify() we signal the backends.

Or I'm missing something here?

My understanding is that something could go wrong in between

PreCommit_Notify and AtCommit_Notify, which could cause the

transaction to abort, and the notification might be in the queue at

this point, even though the transaction aborted, hence the dependency

on the commit log.

I have the same understanding.

Maybe we could have a boolean "committed" field on AsyncQueueEntry

and check this field before sending the notification instead of

call TransactionIdDidCommit()? Is something similar what you are

proposing?

I like this direction, as it opens up the ability to remove the

global lock held from PreCommit_Notify to the end of the transaction.

In the same vein as this idea, I was experimenting with a patch

inspired by Tom Lane’s idea in [1] where we write the actual

notification data into the WAL, and just write the db, lsn, and xid

into the notify queue in AtCommit_Notify, so the notify queue only

contains information about committed transactions. Unfortunately,

the additional WAL write overhead in this approach seemed to outweigh

the benefits of removing the lock.

Interesting, have you shared your patch and results somewhere? IIUC
Tom's approach resolves this bug, because with it we have queue
entries produced by committed transactions only, so we don't need to
check their status and don't have dependency on clog.

Following that idea - I think perhaps we could have two queues in the

notify system, one that stores the notification data, and another

that stores the position of the committed transaction’s notification,

which we would append to in AtCommit_Notify. Then the listener would

go through the position queue, and find / read the notifications from

the notify data queue.

If I'm not wrong, by the time we got into AtCommit_Notify we no longer
hold the global lock, so we can't provide notifications in the commit
order. I think it would work if we add entries to the position queue
concurrently with adding commit entries to the wal, as in Tom's idea
3rd step.
Another question: how to truncate notification data queue. It sounds
like it will be more difficult to figure out what data in the queue is
still needed.

Best regards,
Arseniy Mukhin

#23

Rishu Bagga

rishu.postgres@gmail.com

4 months ago

In reply to: Arseniy Mukhin (#22)

Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue

On Thu, Sep 4, 2025 at 7:14 AM Arseniy Mukhin
<arseniy.mukhin.dev@gmail.com> wrote:

Interesting, have you shared your patch and results somewhere? IIUC
Tom's approach resolves this bug, because with it we have queue
entries produced by committed transactions only, so we don't need to
check their status and don't have dependency on clog.

I was eventually able to get better numbers with the patch after using
a substantial number of connections, so I have now posted it here. [1]/messages/by-id/CAK80=jipUfGC+U QSzeA4oCP9daRtHZGm2SQZWLxC9NWmVTDtRQ%40mail.gmail.com.

Another question: how to truncate notification data queue. It sounds
like it will be more difficult to figure out what data in the queue is
still needed.

Yeah, I will think about this some more, and follow up with another
patch.

Thanks,
Rishu

[1]: /messages/by-id/CAK80=jipUfGC+U QSzeA4oCP9daRtHZGm2SQZWLxC9NWmVTDtRQ%40mail.gmail.com
QSzeA4oCP9daRtHZGm2SQZWLxC9NWmVTDtRQ%40mail.gmail.com

#24

Matheus Alcantara

matheusssilv97@gmail.com

4 months ago

In reply to: Rishu Bagga (#21)

Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue

On Wed Sep 3, 2025 at 8:51 PM -03, Rishu Bagga wrote:

On Wed, Sep 3, 2025 at 2:14 PM Matheus Alcantara

<matheusssilv97@gmail.com> wrote:

IIUC we don't store notifications of aborted transactions on the

queue. On PreCommit_Notify we add the notifications on the queue

and on Commit_Notify() we signal the backends.

Or I'm missing something here?

My understanding is that something could go wrong in between

PreCommit_Notify and AtCommit_Notify, which could cause the

transaction to abort, and the notification might be in the queue at

this point, even though the transaction aborted, hence the dependency

on the commit log.

Ok, I agree that that this may happen but I don't see this as a common
case to fix the issue based on this behaviour. I think that we check the
transaction status also to skip notifications that were added on the
queue by transactions that are not fully committed yet, and I see this
scenario as a more common case but I could be wrong.

--
Matheus Alcantara

#25

Matheus Alcantara

matheusssilv97@gmail.com

4 months ago

In reply to: Rishu Bagga (#23)

Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue

On Thu Sep 4, 2025 at 8:02 PM -03, Rishu Bagga wrote:

On Thu, Sep 4, 2025 at 7:14 AM Arseniy Mukhin
<arseniy.mukhin.dev@gmail.com> wrote:

Interesting, have you shared your patch and results somewhere? IIUC
Tom's approach resolves this bug, because with it we have queue
entries produced by committed transactions only, so we don't need to
check their status and don't have dependency on clog.

I was eventually able to get better numbers with the patch after using
a substantial number of connections, so I have now posted it here. [1].

Thanks for sharing the patch! I'll reserve some time to test and review.

--
Matheus Alcantara

#26

Arseniy Mukhin

arseniy.mukhin.dev@gmail.com

4 months ago

In reply to: Matheus Alcantara (#24)

Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue

On Fri, Sep 5, 2025 at 1:44 PM Matheus Alcantara
<matheusssilv97@gmail.com> wrote:

On Wed Sep 3, 2025 at 8:51 PM -03, Rishu Bagga wrote:

On Wed, Sep 3, 2025 at 2:14 PM Matheus Alcantara

<matheusssilv97@gmail.com> wrote:

IIUC we don't store notifications of aborted transactions on the

queue. On PreCommit_Notify we add the notifications on the queue

and on Commit_Notify() we signal the backends.

Or I'm missing something here?

My understanding is that something could go wrong in between

PreCommit_Notify and AtCommit_Notify, which could cause the

transaction to abort, and the notification might be in the queue at

this point, even though the transaction aborted, hence the dependency

on the commit log.

Ok, I agree that that this may happen but I don't see this as a common
case to fix the issue based on this behaviour. I think that we check the
transaction status also to skip notifications that were added on the
queue by transactions that are not fully committed yet, and I see this
scenario as a more common case but I could be wrong.

IIUC we don't need clog to check if the notification transaction is
still in progress, we use a snapshot for that (XidInMVCCSnapshot()).
If we see that the notification transaction is still in progress, we
postpone processing of that notification. If we see that notification
transaction is not in progress, then we check its status with
TransactionIdDidCommit() (using clog) and only then fail if clog
doesn't have notification transcation data (as a quick check, we can
see in the stack trace that the failure occurs during the call to
TransactionIdDidCommit, not XidInMVCCSnapshot).

So we need to check notification transaction status only to understand
if a transaction was committed and we can send notification or it was
aborted (somewhere in between PreCommit_Notify and AtCommit_Notify)
and we should skip notification. And the solution that Yura Sokolov
suggested is to introduce a 'commited' flag, so we can check it
instead of calling TransactionIdDidCommit. Another option is to make
aborted transactions to clean up after themselves in AtAbort_Notify(),
so if listener see notification in the queue and transaction that
wrote it is not in progress, then such notification was always created
by committed transaction and we also don't need to call
TransactionIdDidCommit. One way for aborted transactions to clean up
is to set its notifications dbOid to 0, so listeners skip them.
Another option I think is to set queue HEAD back to the position where
it was when the aborted transaction started to write to the queue. It
also makes such aborted transaction notifications 'invisible' for
listeners. We have a global lock and there is always only one writer,
so it sounds possible. And there is another option which is the patch
that Rishu Bagga is working on. In the patch all transactions write to
the listen/notify queue only after they wrote a commit record to the
WAL, so all notifications in the queue are always by committed
transactions and we don't need to call TransactionIdDidCommit.

Best regards,
Arseniy Mukhin

#27

Matheus Alcantara

matheusssilv97@gmail.com

4 months ago

In reply to: Arseniy Mukhin (#26)

Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue

On Fri Sep 5, 2025 at 9:56 AM -03, Arseniy Mukhin wrote:

Ok, I agree that that this may happen but I don't see this as a common
case to fix the issue based on this behaviour. I think that we check the
transaction status also to skip notifications that were added on the
queue by transactions that are not fully committed yet, and I see this
scenario as a more common case but I could be wrong.

IIUC we don't need clog to check if the notification transaction is
still in progress, we use a snapshot for that (XidInMVCCSnapshot()).
If we see that the notification transaction is still in progress, we
postpone processing of that notification. If we see that notification
transaction is not in progress, then we check its status with
TransactionIdDidCommit() (using clog) and only then fail if clog
doesn't have notification transcation data (as a quick check, we can
see in the stack trace that the failure occurs during the call to
TransactionIdDidCommit, not XidInMVCCSnapshot).

So we need to check notification transaction status only to understand
if a transaction was committed and we can send notification or it was
aborted (somewhere in between PreCommit_Notify and AtCommit_Notify)
and we should skip notification. And the solution that Yura Sokolov
suggested is to introduce a 'commited' flag, so we can check it
instead of calling TransactionIdDidCommit. Another option is to make
aborted transactions to clean up after themselves in AtAbort_Notify(),
so if listener see notification in the queue and transaction that
wrote it is not in progress, then such notification was always created
by committed transaction and we also don't need to call
TransactionIdDidCommit. One way for aborted transactions to clean up
is to set its notifications dbOid to 0, so listeners skip them.
Another option I think is to set queue HEAD back to the position where
it was when the aborted transaction started to write to the queue. It
also makes such aborted transaction notifications 'invisible' for
listeners. We have a global lock and there is always only one writer,
so it sounds possible. And there is another option which is the patch
that Rishu Bagga is working on. In the patch all transactions write to
the listen/notify queue only after they wrote a commit record to the
WAL, so all notifications in the queue are always by committed
transactions and we don't need to call TransactionIdDidCommit.

You are right, I missed the XidInMVCCSnapshot() call, sorry about that
and thanks very much for all the explanation!

I'll keep on hold the development of new patch versions for this thread
and focus on review and test the patch from Rishu at [1]/messages/by-id/CAK80=jipUfGC+UQSzeA4oCP9daRtHZGm2SQZWLxC9NWmVTDtRQ@mail.gmail.com to see if we
can make progress using the WAL approach.

[1]: /messages/by-id/CAK80=jipUfGC+UQSzeA4oCP9daRtHZGm2SQZWLxC9NWmVTDtRQ@mail.gmail.com

--
Matheus Alcantara

#28

Masahiko Sawada

sawada.mshk@gmail.com

4 months ago

In reply to: Matheus Alcantara (#27)

Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue

On Sat, Sep 6, 2025 at 7:15 AM Matheus Alcantara
<matheusssilv97@gmail.com> wrote:

On Fri Sep 5, 2025 at 9:56 AM -03, Arseniy Mukhin wrote:

Ok, I agree that that this may happen but I don't see this as a common
case to fix the issue based on this behaviour. I think that we check the
transaction status also to skip notifications that were added on the
queue by transactions that are not fully committed yet, and I see this
scenario as a more common case but I could be wrong.

IIUC we don't need clog to check if the notification transaction is
still in progress, we use a snapshot for that (XidInMVCCSnapshot()).
If we see that the notification transaction is still in progress, we
postpone processing of that notification. If we see that notification
transaction is not in progress, then we check its status with
TransactionIdDidCommit() (using clog) and only then fail if clog
doesn't have notification transcation data (as a quick check, we can
see in the stack trace that the failure occurs during the call to
TransactionIdDidCommit, not XidInMVCCSnapshot).

So we need to check notification transaction status only to understand
if a transaction was committed and we can send notification or it was
aborted (somewhere in between PreCommit_Notify and AtCommit_Notify)
and we should skip notification. And the solution that Yura Sokolov
suggested is to introduce a 'commited' flag, so we can check it
instead of calling TransactionIdDidCommit. Another option is to make
aborted transactions to clean up after themselves in AtAbort_Notify(),
so if listener see notification in the queue and transaction that
wrote it is not in progress, then such notification was always created
by committed transaction and we also don't need to call
TransactionIdDidCommit. One way for aborted transactions to clean up
is to set its notifications dbOid to 0, so listeners skip them.
Another option I think is to set queue HEAD back to the position where
it was when the aborted transaction started to write to the queue. It
also makes such aborted transaction notifications 'invisible' for
listeners. We have a global lock and there is always only one writer,
so it sounds possible. And there is another option which is the patch
that Rishu Bagga is working on. In the patch all transactions write to
the listen/notify queue only after they wrote a commit record to the
WAL, so all notifications in the queue are always by committed
transactions and we don't need to call TransactionIdDidCommit.

You are right, I missed the XidInMVCCSnapshot() call, sorry about that
and thanks very much for all the explanation!

I'll keep on hold the development of new patch versions for this thread
and focus on review and test the patch from Rishu at [1] to see if we
can make progress using the WAL approach.

While the WAL-based approach discussed on another thread is promising,
I think it would not be acceptable for back branches as it requires
quite a lot of refactoring. Given that this is a long-standing bug in
listen/notify, I think we can continue discussing how to fix the issue
on backbranches on this thread.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#29

Matheus Alcantara

matheusssilv97@gmail.com

4 months ago

In reply to: Masahiko Sawada (#28)

1 attachment(s)

Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue

On Mon Sep 15, 2025 at 2:40 PM -03, Masahiko Sawada wrote:

While the WAL-based approach discussed on another thread is promising,
I think it would not be acceptable for back branches as it requires
quite a lot of refactoring. Given that this is a long-standing bug in
listen/notify, I think we can continue discussing how to fix the issue
on backbranches on this thread.

Please see the new attached patch, it has a different implementation
that I've previously posted which is based on the idea that Arseniy
posted on [1]/messages/by-id/CAE7r3M+Xwf0A_aNhu7pYQd7nDQaqaEnyCjtvqg8XsgManmPOUA@mail.gmail.com.

This new version include the "committed" field on AsyncQueueEntry struct
so that we can use this info when processing the notification instead of
call TransactionIdDidCommit()

The "committed" field is set to true when the AsyncQueueEntry is being
added on the SLRU page buffer when the PreCommit_Notify() is called. If
an error occurs between the PreCommit_Notify() and AtCommit_Notify() the
AtAbort_Notify() will be called and will set the "committed" field to
false for the notifications inside the aborted transaction.

It's a bit tricky to know at AtAbort_Notify() which notifications were
added on the SLRU page buffer by the aborted transaction, so I created a
new data structure and a global variable to keep track of this
information. See the commit message for more information.

On the previously patch that I've posted I've created a TAP test to
reproduce the issue with the VACUUM FREEZE, this new version also
include this test and also a new test case that use the injection points
extension to force an error between the PreCommit_Notify() and
AtCommit_Notify() so that we can ensure that these notifications of an
aborted transaction are not visible to other listener backends.

[1]: /messages/by-id/CAE7r3M+Xwf0A_aNhu7pYQd7nDQaqaEnyCjtvqg8XsgManmPOUA@mail.gmail.com

--
Matheus Alcantara

Attachments:

v1-0001-Make-AsyncQueueEntry-s-self-contained.patchapplication/octet-stream; name=v1-0001-Make-AsyncQueueEntry-s-self-contained.patchDownload

From 72df2b4e2f76294692ce57553349b61ff141e00c Mon Sep 17 00:00:00 2001
From: Matheus Alcantara <mths.dev@pm.me>
Date: Sat, 6 Sep 2025 11:29:02 -0300
Subject: [PATCH v1] Make AsyncQueueEntry's self contained

Previously the asyncQueueProcessPageEntries() use the
TransactionIdDidCommit() to check if the transaction that a notification
belongs is committed or not. Although this work for almost all scenarios
we may have some cases where if a notification is keep for to long on
the queue and the VACUUM FREEZE is executed during this time it may
remove clog files that is needed to check the transaction status of
these notifications which will cause errors to listener backends when
reading the async queue.

This commit fix this issue by making the AsyncQueueEntry self contained
by adding the "committed" boolean field so asyncQueueProcessPageEntries()
can use this to check if the transaction of the notification is
committed or not.

We set committed as true when adding the entry on the SLRU page buffer
cache when PreCommit_Notify() is called and if an error occur before
AtCommit_Notify() the AtAbort_Notify() will be called which will mark
the committed field as false.

A new global List pendingAsyncQueueEntries is created to keep track of
pending notification entries that is on shared async queue page buffer
but it's not fully committed yet so we can mark these entries as not
committed in case of transaction abort.

Also this commit include TAP tests to exercise the VACUUM FREEZE issue
and also the scenario of an error being occur between the
PreCommit_Notify() and AtCommit_Notify() calls by using the injection
points extension.
---
 src/backend/access/transam/xact.c             |  3 +
 src/backend/commands/async.c                  | 92 ++++++++++++++++++-
 src/test/modules/Makefile                     |  1 +
 src/test/modules/meson.build                  |  1 +
 src/test/modules/test_listen_notify/Makefile  | 17 ++++
 .../modules/test_listen_notify/meson.build    | 14 +++
 .../test_listen_notify/t/001_xid_freeze.pl    | 66 +++++++++++++
 .../test_listen_notify/t/002_transaction.pl   | 57 ++++++++++++
 src/tools/pgindent/typedefs.list              |  1 +
 9 files changed, 251 insertions(+), 1 deletion(-)
 create mode 100644 src/test/modules/test_listen_notify/Makefile
 create mode 100644 src/test/modules/test_listen_notify/meson.build
 create mode 100644 src/test/modules/test_listen_notify/t/001_xid_freeze.pl
 create mode 100644 src/test/modules/test_listen_notify/t/002_transaction.pl

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index b46e7e9c2a6..34b46ab4d4a 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -64,6 +64,7 @@
 #include "utils/builtins.h"
 #include "utils/combocid.h"
 #include "utils/guc.h"
+#include "utils/injection_point.h"
 #include "utils/inval.h"
 #include "utils/memutils.h"
 #include "utils/relmapper.h"
@@ -2340,6 +2341,8 @@ CommitTransaction(void)
 	 */
 	PreCommit_Notify();
 
+	INJECTION_POINT("commit-transaction-pos-pre-commit-notify", NULL);
+
 	/*
 	 * Mark serializable transaction as complete for predicate locking
 	 * purposes.  This should be done as late as we can put it and still allow
diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index 4bd37d5beb5..a7c211fc09a 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -180,6 +180,8 @@ typedef struct AsyncQueueEntry
 	Oid			dboid;			/* sender's database OID */
 	TransactionId xid;			/* sender's XID */
 	int32		srcPid;			/* sender's PID */
+	bool		committed;		/* Is transaction that the entry belongs
+								 * committed? */
 	char		data[NAMEDATALEN + NOTIFY_PAYLOAD_MAX_LENGTH];
 } AsyncQueueEntry;
 
@@ -401,8 +403,30 @@ struct NotificationHash
 	Notification *event;		/* => the actual Notification struct */
 };
 
+/*
+ * Struct representing an AsyncQueueEntry that is added on global async queue
+ * but it is not yet fully committed.
+ *
+ * This struct is used to find AsyncQueueEntry's already added on the global
+ * async queue and mark then as not committed when an error between the
+ * Pre_CommitNotify() and At_CommitNotify() is raised.
+ */
+typedef struct PendingAsyncQueueEntry
+{
+	/* Page which the entry was added */
+	int64		page;
+
+	/* slot number where the entry was added */
+	int			slotno;
+
+	/* offset of the entry on page buffer */
+	int			offset;
+} PendingAsyncQueueEntry;
+
 static NotificationList *pendingNotifies = NULL;
 
+static List *pendingAsyncQueueEntries = NIL;
+
 /*
  * Inbound notifications are initially processed by HandleNotifyInterrupt(),
  * called from inside a signal handler. That just sets the
@@ -1398,12 +1422,30 @@ asyncQueueAddEntries(ListCell *nextNotify)
 	while (nextNotify != NULL)
 	{
 		Notification *n = (Notification *) lfirst(nextNotify);
+		PendingAsyncQueueEntry *pendingEntry = palloc(sizeof(PendingAsyncQueueEntry));
 
 		/* Construct a valid queue entry in local variable qe */
 		asyncQueueNotificationToEntry(n, &qe);
 
+		/*
+		 * Mark the entry as committed. If the transaction that this
+		 * notification belongs fails to commit the AtAbort_Notify() will mark
+		 * this entry as not committed.
+		 */
+		qe.committed = true;
+
 		offset = QUEUE_POS_OFFSET(queue_head);
 
+		/*
+		 * Store information from AsyncQueueEntry so AtAbort_Notify() can
+		 * lookup the entry on shared page buffer if the transaction fails to
+		 * commit
+		 */
+		pendingEntry->page = pageno;
+		pendingEntry->slotno = slotno;
+		pendingEntry->offset = offset;
+		pendingAsyncQueueEntries = lappend(pendingAsyncQueueEntries, pendingEntry);
+
 		/* Check whether the entry really fits on the current page */
 		if (offset + qe.length <= QUEUE_PAGESIZE)
 		{
@@ -1670,6 +1712,8 @@ SignalBackends(void)
 void
 AtAbort_Notify(void)
 {
+	ListCell   *lc;
+
 	/*
 	 * If we LISTEN but then roll back the transaction after PreCommit_Notify,
 	 * we have registered as a listener but have not made any entry in
@@ -1678,6 +1722,51 @@ AtAbort_Notify(void)
 	if (amRegisteredListener && listenChannels == NIL)
 		asyncQueueUnregister();
 
+	/*
+	 * At this stage if we have pendingAsyncQueueEntries we already have added
+	 * these notifications on the shared async queue, so we need to lookup
+	 * these notifications and mark then as not committed, so when a backend
+	 * is processing this notification it can skip without checking the
+	 * transaction status on clog files that may already have be truncated by
+	 * VACUUM FREEZE.
+	 */
+	if (pendingAsyncQueueEntries != NIL)
+	{
+		/*
+		 * We can not have pending async queue entries without pending
+		 * notifications
+		 */
+		Assert(pendingNotifies != NULL);
+
+		foreach(lc, pendingAsyncQueueEntries)
+		{
+			PendingAsyncQueueEntry *pendingEntry = (PendingAsyncQueueEntry *) lfirst(lc);
+			int			slotno = pendingEntry->slotno;
+			int			offset = pendingEntry->offset;
+			int64		page = pendingEntry->page;
+			LWLock	   *banklock = SimpleLruGetBankLock(NotifyCtl, page);
+			AsyncQueueEntry *qe;
+
+			LWLockAcquire(banklock, LW_EXCLUSIVE);
+
+			qe = (AsyncQueueEntry *) (NotifyCtl->shared->page_buffer[slotno] + offset);
+
+			/*
+			 * The entry should already be marked as committed when adding on
+			 * into the shared async queue.
+			 */
+			Assert(qe->committed);
+
+			/*
+			 * Mark notification entry as not committed so it's not visible to
+			 * other backends
+			 */
+			qe->committed = false;
+
+			LWLockRelease(banklock);
+		}
+	}
+
 	/* And clean up */
 	ClearPendingActionsAndNotifies();
 }
@@ -2066,7 +2155,7 @@ asyncQueueProcessPageEntries(volatile QueuePosition *current,
 				reachedStop = true;
 				break;
 			}
-			else if (TransactionIdDidCommit(qe->xid))
+			else if (qe->committed)
 			{
 				/* qe->data is the null-terminated channel name */
 				char	   *channel = qe->data;
@@ -2385,6 +2474,7 @@ ClearPendingActionsAndNotifies(void)
 	 */
 	pendingActions = NULL;
 	pendingNotifies = NULL;
+	pendingAsyncQueueEntries = NIL;
 }
 
 /*
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index 903a8ac151a..4c0160df341 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -28,6 +28,7 @@ SUBDIRS = \
 		  test_int128 \
 		  test_integerset \
 		  test_json_parser \
+		  test_listen_notify \
 		  test_lfind \
 		  test_misc \
 		  test_oat_hooks \
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index 93be0f57289..144379b619b 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -27,6 +27,7 @@ subdir('test_ginpostinglist')
 subdir('test_int128')
 subdir('test_integerset')
 subdir('test_json_parser')
+subdir('test_listen_notify')
 subdir('test_lfind')
 subdir('test_misc')
 subdir('test_oat_hooks')
diff --git a/src/test/modules/test_listen_notify/Makefile b/src/test/modules/test_listen_notify/Makefile
new file mode 100644
index 00000000000..da1bf5bb1b7
--- /dev/null
+++ b/src/test/modules/test_listen_notify/Makefile
@@ -0,0 +1,17 @@
+# src/test/modules/test_listen_notify/Makefile
+
+MODULE = test_listen_notify
+PGFILEDESC = "test_listen_notify - regression testing for LISTEN/NOTIFY support"
+
+TAP_TESTS = 1
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_listen_notify
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_listen_notify/meson.build b/src/test/modules/test_listen_notify/meson.build
new file mode 100644
index 00000000000..f0a2b5058e4
--- /dev/null
+++ b/src/test/modules/test_listen_notify/meson.build
@@ -0,0 +1,14 @@
+# Copyright (c) 2022-2025, PostgreSQL Global Development Group
+
+tests += {
+  'name': 'test_listen_notify',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'tap': {
+    'tests': [
+      't/001_xid_freeze.pl',
+      't/002_transaction.pl',
+    ],
+  },
+}
+
diff --git a/src/test/modules/test_listen_notify/t/001_xid_freeze.pl b/src/test/modules/test_listen_notify/t/001_xid_freeze.pl
new file mode 100644
index 00000000000..0a5130a042e
--- /dev/null
+++ b/src/test/modules/test_listen_notify/t/001_xid_freeze.pl
@@ -0,0 +1,66 @@
+# Copyright (c) 2024-2025, PostgreSQL Global Development Group
+
+use strict;
+use warnings FATAL => 'all';
+use File::Path qw(mkpath);
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+my $node = PostgreSQL::Test::Cluster->new('node');
+$node->init;
+$node->start;
+
+# Setup
+$node->safe_psql('postgres', 'CREATE EXTENSION xid_wraparound');
+$node->safe_psql('postgres',
+	'CREATE TABLE t AS SELECT g AS a, g+2 AS b from generate_series(1,100000) g;'
+);
+$node->safe_psql('postgres',
+	'ALTER DATABASE template0 WITH ALLOW_CONNECTIONS true');
+
+# --- Start Session 1 and leave it idle in transaction
+my $psql_session1 = $node->background_psql('postgres');
+$psql_session1->query_safe('listen s;', "Session 1 listens to 's'");
+$psql_session1->query_safe('begin;', "Session 1 starts a transaction");
+
+# --- Session 2, multiple notify's, and commit ---
+for my $i (1 .. 10)
+{
+	$node->safe_psql(
+		'postgres', "
+		BEGIN;
+		NOTIFY s, '$i';
+		COMMIT;");
+}
+
+# Consume enough XIDs to trigger truncation
+$node->safe_psql('postgres', 'select consume_xids(10000000);');
+
+# Execute update so the frozen xid of "t" table is updated to a xid greater
+# than consume_xids() result
+$node->safe_psql('postgres', 'UPDATE t SET a = a+b;');
+
+# Remember current datfrozenxid before vacuum freeze to ensure that it is advanced.
+my $datafronzenxid = $node->safe_psql('postgres', "select datfrozenxid from pg_database where datname = 'postgres'");
+
+# Execute vacuum freeze on all databases
+$node->command_ok([ 'vacuumdb', '--all', '--freeze', '--port', $node->port ],
+	"vacuumdb --all --freeze");
+
+# Get the new datfrozenxid after vacuum freeze to ensure that is advanced but
+# we can still get the notification status of the notification
+my $datafronzenxid_freeze = $node->safe_psql('postgres', "select datfrozenxid from pg_database where datname = 'postgres'");
+ok($datafronzenxid_freeze > $datafronzenxid, 'datfrozenxid is advanced');
+
+# On Session 1, commit and ensure that the all notifications is received
+my $res = $psql_session1->query_safe('commit;', "commit listen s;");
+my $notifications_count = 0;
+foreach my $i (split('\n', $res))
+{
+	$notifications_count++;
+	like($i, qr/Asynchronous notification "s" with payload "$notifications_count" received/);
+}
+is($notifications_count, 10, 'received all committed notifications');
+
+done_testing();
diff --git a/src/test/modules/test_listen_notify/t/002_transaction.pl b/src/test/modules/test_listen_notify/t/002_transaction.pl
new file mode 100644
index 00000000000..825ae9cb1a9
--- /dev/null
+++ b/src/test/modules/test_listen_notify/t/002_transaction.pl
@@ -0,0 +1,57 @@
+# Copyright (c) 2024-2025, PostgreSQL Global Development Group
+
+use strict;
+use warnings FATAL => 'all';
+use File::Path qw(mkpath);
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+my $node = PostgreSQL::Test::Cluster->new('node');
+$node->init;
+$node->start;
+
+if (!$node->check_extension('injection_points'))
+{
+	plan skip_all => 'Extension injection_points not installed';
+}
+
+# Use injection_points extension to test that if an error occur between
+# Pre_CommitNotify() and At_CommitNotify() the notification is not processed by
+# a backend listener.
+$node->safe_psql('postgres', 'CREATE EXTENSION injection_points;');
+
+# Start session 1, create a listener and leave it idle in transaction
+my $psql_session1 = $node->background_psql('postgres');
+$psql_session1->query_safe('listen c1;', "Session 1 listens to 'c1'");
+$psql_session1->query_safe('begin;', "Session 1 starts a transaction");
+
+# Start session 2 to send a notification inside a transaction block
+my $psql_session2 = $node->background_psql('postgres', on_error_stop => 0);
+$psql_session2->query_safe('BEGIN;');
+$psql_session2->query_safe('NOTIFY c1;');
+
+# Add injection point to fail to commit the notify c1 transaction.
+$psql_session2->query_safe("SELECT injection_points_attach('commit-transaction-pos-pre-commit-notify', 'error');");
+
+# Commit the NOTIFY transaction which will raise an error due to injection point
+$psql_session2->query('COMMIT;');
+
+# detach the injection point on the open transaction to make it complete.
+$psql_session1->query_safe("SELECT injection_points_detach('commit-transaction-pos-pre-commit-notify');");
+
+# Send another NOTIFY after injection point is detached to signal the psql_session1 backend.
+$node->safe_psql('postgres', 'NOTIFY c1') ;
+
+# Commit the listener transaction - It should not see the notification that fails to commit.
+my $res = $psql_session1->query_safe('commit;', "commit listen s2;");
+
+my $notifications_count = 0;
+foreach my $i (split('\n', $res))
+{
+	$notifications_count++;
+	like($i, qr/Asynchronous notification "c1" received/);
+}
+is($notifications_count, 1, 'received only committed notifications');
+
+done_testing();
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index e90af5b2ad3..6228faa56a5 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2148,6 +2148,7 @@ PatternInfo
 PatternInfoArray
 Pattern_Prefix_Status
 Pattern_Type
+PendingAsyncQueueEntry
 PendingFsyncEntry
 PendingRelDelete
 PendingRelSync
-- 
2.39.5 (Apple Git-154)

#30

Arseniy Mukhin

arseniy.mukhin.dev@gmail.com

4 months ago

In reply to: Matheus Alcantara (#29)

1 attachment(s)

Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue

Hi,

On Fri, Sep 19, 2025 at 12:35 AM Matheus Alcantara
<matheusssilv97@gmail.com> wrote:

On Mon Sep 15, 2025 at 2:40 PM -03, Masahiko Sawada wrote:

While the WAL-based approach discussed on another thread is promising,
I think it would not be acceptable for back branches as it requires
quite a lot of refactoring. Given that this is a long-standing bug in
listen/notify, I think we can continue discussing how to fix the issue
on backbranches on this thread.

Please see the new attached patch, it has a different implementation
that I've previously posted which is based on the idea that Arseniy
posted on [1].

Thank you for the new version.

This new version include the "committed" field on AsyncQueueEntry struct
so that we can use this info when processing the notification instead of
call TransactionIdDidCommit()

The "committed" field is set to true when the AsyncQueueEntry is being
added on the SLRU page buffer when the PreCommit_Notify() is called. If
an error occurs between the PreCommit_Notify() and AtCommit_Notify() the
AtAbort_Notify() will be called and will set the "committed" field to
false for the notifications inside the aborted transaction.

It's a bit tricky to know at AtAbort_Notify() which notifications were
added on the SLRU page buffer by the aborted transaction, so I created a
new data structure and a global variable to keep track of this
information. See the commit message for more information.

I like this approach. We got rid of dependency on clog and don't limit
vacuum. Several points about the fix:

Is it correct to remember and reuse slru slots here? IIUC we can't do
it if we don't hold SLRU bank lock, because by the time we get in
AtAbort_Notify() the queue page could be already evicted. Probably we
need to use SimpleLruReadPage after we acquire the lock in
AtAbort_Notify()?

I think adding a boolean 'committed' is a good approach, but what do
you think about setting the queue head back to the position where
aborted transaction notifications start? We can do such a reset in
AtAbort_Notify(). So instead of marking notifications as
'commited=false' we completely erase them from the queue by moving the
queue head back. From listeners perspective if there is a notification
of completed transaction in the queue - it's always a committed
transaction, so again get rid of TransactionIdDidCommit() call. It
seems like a simpler approach because we don't need to remember all
notifications positions in the queue and don't need the additional
field 'committed'. All we need is to remember the head position before
we write anything to the queue, and reset it back if there is an
abort. IIUC Listeners will never send such erased notifications:
- while the aborted transaction is looking like 'in progress',
listeners can't send its notifications.
- by the time the aborted transaction is completed, the head is
already set back so erased notifications are located after the queue
head and listeners can't read it.

On the previously patch that I've posted I've created a TAP test to
reproduce the issue with the VACUUM FREEZE, this new version also
include this test and also a new test case that use the injection points
extension to force an error between the PreCommit_Notify() and
AtCommit_Notify() so that we can ensure that these notifications of an
aborted transaction are not visible to other listener backends.

I think it's a good test to have. FWIW there is a way to reproduce the
test condition without the injection point. We can use the fact that
serializable conflicts are checked after tx adds notifications to the
queue. Please find the attached patch with the example tap test. Not
sure if using injections points is more preferable?

Best regards,
Arseniy Mukhin

Attachments:

0001-Adds-LISTEN-NOTIFY-aborted_tx_notification-TAP-test.patchtext/x-patch; charset=US-ASCII; name=0001-Adds-LISTEN-NOTIFY-aborted_tx_notification-TAP-test.patchDownload

From 08b98c34d418fd40f2351ad09d216317a72665a4 Mon Sep 17 00:00:00 2001
From: Arseniy Mukhin <arseniy.mukhin.dev@gmail.com>
Date: Fri, 19 Sep 2025 15:24:27 +0300
Subject: [PATCH] Adds LISTEN/NOTIFY aborted_tx_notification TAP test

---
 .../modules/test_listen_notify/meson.build    |  2 +-
 .../t/002_aborted_tx_notifies.pl              | 67 +++++++++++++++++++
 2 files changed, 68 insertions(+), 1 deletion(-)
 create mode 100644 src/test/modules/test_listen_notify/t/002_aborted_tx_notifies.pl

diff --git a/src/test/modules/test_listen_notify/meson.build b/src/test/modules/test_listen_notify/meson.build
index f0a2b5058e4..a68052cd353 100644
--- a/src/test/modules/test_listen_notify/meson.build
+++ b/src/test/modules/test_listen_notify/meson.build
@@ -7,7 +7,7 @@ tests += {
   'tap': {
     'tests': [
       't/001_xid_freeze.pl',
-      't/002_transaction.pl',
+      't/002_aborted_tx_notifies.pl'
     ],
   },
 }
diff --git a/src/test/modules/test_listen_notify/t/002_aborted_tx_notifies.pl b/src/test/modules/test_listen_notify/t/002_aborted_tx_notifies.pl
new file mode 100644
index 00000000000..ed6abf9c576
--- /dev/null
+++ b/src/test/modules/test_listen_notify/t/002_aborted_tx_notifies.pl
@@ -0,0 +1,67 @@
+# Copyright (c) 2024-2025, PostgreSQL Global Development Group
+
+use strict;
+use warnings FATAL => 'all';
+use File::Path qw(mkpath);
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+my $node = PostgreSQL::Test::Cluster->new('node');
+$node->init;
+$node->init;
+$node->start;
+
+# Test checks that listeners do not receive notifications from aborted
+# transaction even if notifications have been added to the listen/notify
+# queue. To reproduce it we use the fact that serializable conflicts
+# are checked after tx adds notifications to the queue.
+
+# Setup
+$node->safe_psql('postgres', 'CREATE TABLE t1 (a bigserial);');
+
+# Listener
+my $psql_listener = $node->background_psql('postgres');
+$psql_listener->query_safe('LISTEN ch;');
+
+# Session1. Start SERIALIZABLE tx and add a notification.
+my $psql_session1 = $node->background_psql('postgres');
+$psql_session1->query_safe("
+	BEGIN ISOLATION LEVEL SERIALIZABLE;
+	SELECT * FROM t1;
+	INSERT INTO t1 DEFAULT VALUES;
+	NOTIFY ch,'committed';
+");
+
+# Session2. Start SERIALIZABLE tx, add a notification and introduce a conflict
+# with session1.
+my $psql_session2 = $node->background_psql('postgres', on_error_stop => 0);
+$psql_session2->query_safe("
+	BEGIN ISOLATION LEVEL SERIALIZABLE;
+	SELECT * FROM t1;
+	INSERT INTO t1 DEFAULT VALUES;
+	NOTIFY ch,'aborted';
+");
+
+# Session1 should be committed successfully. Listeners must receive session1
+# notifications.
+$psql_session1->query_safe("COMMIT;");
+
+# Session2 should be aborted due to the conflict with session1. Transaction
+# is aborted after adding notifications to the listen/notify queue, but
+# listeners should not receive session2 notifications.
+$psql_session2->query("COMMIT;");
+
+# send another notification after aborted
+$node->safe_psql('postgres', "NOTIFY ch, 'next_committed';");
+
+# fetch notifications
+my $res = $psql_listener->query_safe('begin; commit;');
+
+# check received notifications
+my @lines = split('\n', $res);
+is(@lines, 2, 'received all committed notifications');
+like($lines[0], qr/Asynchronous notification "ch" with payload "committed" received/);
+like($lines[1], qr/Asynchronous notification "ch" with payload "next_committed" received/);
+
+done_testing();
\ No newline at end of file
-- 
2.43.0

#31

Matheus Alcantara

matheusssilv97@gmail.com

4 months ago

In reply to: Arseniy Mukhin (#30)

Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue

On Fri Sep 19, 2025 at 9:34 AM -03, Arseniy Mukhin wrote:

Hi,

Thanks for taking a look at this!

I like this approach. We got rid of dependency on clog and don't limit
vacuum. Several points about the fix:

Is it correct to remember and reuse slru slots here? IIUC we can't do
it if we don't hold SLRU bank lock, because by the time we get in
AtAbort_Notify() the queue page could be already evicted. Probably we
need to use SimpleLruReadPage after we acquire the lock in
AtAbort_Notify()?

Yeah, I think it make sense. Maybe we could just store the page number
and offset and call SimpleLruReadPage() inside AtAbort_Notify() to get
the slotno?

I think adding a boolean 'committed' is a good approach, but what do
you think about setting the queue head back to the position where
aborted transaction notifications start? We can do such a reset in
AtAbort_Notify(). So instead of marking notifications as
'commited=false' we completely erase them from the queue by moving the
queue head back. From listeners perspective if there is a notification
of completed transaction in the queue - it's always a committed
transaction, so again get rid of TransactionIdDidCommit() call. It
seems like a simpler approach because we don't need to remember all
notifications positions in the queue and don't need the additional
field 'committed'. All we need is to remember the head position before
we write anything to the queue, and reset it back if there is an
abort. IIUC Listeners will never send such erased notifications:
- while the aborted transaction is looking like 'in progress',
listeners can't send its notifications.
- by the time the aborted transaction is completed, the head is
already set back so erased notifications are located after the queue
head and listeners can't read it.

I think that with this approach we may have a scenario where when we
enter the AtAbort_Notify() the queue head may not be the notification
from the transaction being aborted but it can be a notification from
another committed transaction, so resetting the queue head back to the
position before the aborted transaction would make us lose the
notification from the committed transaction. Is that make sense? What do
you think?

I think it's a good test to have. FWIW there is a way to reproduce the
test condition without the injection point. We can use the fact that
serializable conflicts are checked after tx adds notifications to the
queue. Please find the attached patch with the example tap test. Not
sure if using injections points is more preferable?

Great, I think that if we can have a way to reproduce this without an
injection point would be better. I dind't look the test yet but I'll try
to incorporate this on the next patch version. Thanks!

--
Matheus Alcantara

#32

Arseniy Mukhin

arseniy.mukhin.dev@gmail.com

4 months ago

In reply to: Matheus Alcantara (#31)

Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue

On Fri, Sep 19, 2025 at 5:36 PM Matheus Alcantara
<matheusssilv97@gmail.com> wrote:

On Fri Sep 19, 2025 at 9:34 AM -03, Arseniy Mukhin wrote:

Hi,

Thanks for taking a look at this!

I like this approach. We got rid of dependency on clog and don't limit
vacuum. Several points about the fix:

Is it correct to remember and reuse slru slots here? IIUC we can't do
it if we don't hold SLRU bank lock, because by the time we get in
AtAbort_Notify() the queue page could be already evicted. Probably we
need to use SimpleLruReadPage after we acquire the lock in
AtAbort_Notify()?

Yeah, I think it make sense. Maybe we could just store the page number
and offset and call SimpleLruReadPage() inside AtAbort_Notify() to get
the slotno?

Yeah, I think it would work.

I think adding a boolean 'committed' is a good approach, but what do
you think about setting the queue head back to the position where
aborted transaction notifications start? We can do such a reset in
AtAbort_Notify(). So instead of marking notifications as
'commited=false' we completely erase them from the queue by moving the
queue head back. From listeners perspective if there is a notification
of completed transaction in the queue - it's always a committed
transaction, so again get rid of TransactionIdDidCommit() call. It
seems like a simpler approach because we don't need to remember all
notifications positions in the queue and don't need the additional
field 'committed'. All we need is to remember the head position before
we write anything to the queue, and reset it back if there is an
abort. IIUC Listeners will never send such erased notifications:
- while the aborted transaction is looking like 'in progress',
listeners can't send its notifications.
- by the time the aborted transaction is completed, the head is
already set back so erased notifications are located after the queue
head and listeners can't read it.

I think that with this approach we may have a scenario where when we
enter the AtAbort_Notify() the queue head may not be the notification
from the transaction being aborted but it can be a notification from
another committed transaction, so resetting the queue head back to the
position before the aborted transaction would make us lose the
notification from the committed transaction. Is that make sense? What do
you think?

I think it's impossible: before pushing anything to the queue we
acquire global lock in PreCommit_Notify():

LockSharedObject(DatabaseRelationId, InvalidOid, 0, AccessExclusiveLock);

While we are holding the lock, no writers can add anything to the
queue. Then we save head position and add pending notifications to the
queue. The moment we get in AtAbort_Notify(), we still hold the global
lock (maybe it is worth adding Assert about it if we start relying on
it), so we can be sure there are no notifications in the queue after
the saved head position except ours. So it seems safe but maybe I
missed something.

Best regards,
Arseniy Mukhin

#33

Matheus Alcantara

matheusssilv97@gmail.com

4 months ago

In reply to: Arseniy Mukhin (#32)

1 attachment(s)

Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue

On Fri Sep 19, 2025 at 12:13 PM -03, Arseniy Mukhin wrote:

I think it's impossible: before pushing anything to the queue we
acquire global lock in PreCommit_Notify():

LockSharedObject(DatabaseRelationId, InvalidOid, 0, AccessExclusiveLock);

While we are holding the lock, no writers can add anything to the
queue. Then we save head position and add pending notifications to the
queue. The moment we get in AtAbort_Notify(), we still hold the global
lock (maybe it is worth adding Assert about it if we start relying on
it), so we can be sure there are no notifications in the queue after
the saved head position except ours. So it seems safe but maybe I
missed something.

Thanks for the explanation! I'm just not sure if I understand why do we
need the LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE) on
PreCommit_Notify() if we already have the
LockSharedObject(DatabaseRelationId, InvalidOid, 0, AccessExclusiveLock);

See the attached patch that is based on your the previous comment of
resetting the QUEUE_HEAD at AtAbort_Notify()

--
Matheus Alcantara

Attachments:

v2-0001-Reset-LISTEN-NOTIFY-QUEUE_HEAD-for-crashed-transa.patchtext/plain; charset=utf-8; name=v2-0001-Reset-LISTEN-NOTIFY-QUEUE_HEAD-for-crashed-transa.patchDownload

From f31e096ed8eef6b6988d145ecaa0f513ec9f1042 Mon Sep 17 00:00:00 2001
From: Matheus Alcantara <mths.dev@pm.me>
Date: Sat, 6 Sep 2025 11:29:02 -0300
Subject: [PATCH v2] Reset LISTEN/NOTIFY QUEUE_HEAD for crashed transactions

Previously the asyncQueueProcessPageEntries() use the
TransactionIdDidCommit() to check if the transaction that a notification
belongs is committed or not. Although this work for almost all scenarios
we may have some cases where if a notification is keep for to long on
the queue and the VACUUM FREEZE is executed during this time it may
remove clog files that is needed to check the transaction status of
these notifications which will cause errors to listener backends when
reading the async queue.

This commit fix this issue by preserving the QUEUE_HEAD position before
a transaction's notifications are added to the shared queue. This
ensures that if the transaction fails between the PreCommit_Notify() and
AtCommit_Notify() phases, the At_AbortNotify() will be called to reset
the QUEUE_HEAD to its prior state, removing the entries from the aborted
transaction from the shared queue.

Also this commit include TAP tests to exercise the VACUUM FREEZE issue
and also the scenario of an error being occur between the
PreCommit_Notify() and AtCommit_Notify() calls.

Co-authored-by: Arseniy Mukhin <arseniy.mukhin.dev@gmail.com>
---
 src/backend/commands/async.c                  | 62 ++++++++++++++---
 src/backend/storage/lmgr/lmgr.c               | 19 ++++++
 src/include/storage/lmgr.h                    |  3 +
 src/test/modules/Makefile                     |  1 +
 src/test/modules/meson.build                  |  1 +
 src/test/modules/test_listen_notify/Makefile  | 17 +++++
 .../modules/test_listen_notify/meson.build    | 14 ++++
 .../test_listen_notify/t/001_xid_freeze.pl    | 66 +++++++++++++++++++
 .../t/002_aborted_tx_notifies.pl              | 66 +++++++++++++++++++
 src/tools/pgindent/typedefs.list              |  1 +
 10 files changed, 242 insertions(+), 8 deletions(-)
 create mode 100644 src/test/modules/test_listen_notify/Makefile
 create mode 100644 src/test/modules/test_listen_notify/meson.build
 create mode 100644 src/test/modules/test_listen_notify/t/001_xid_freeze.pl
 create mode 100644 src/test/modules/test_listen_notify/t/002_aborted_tx_notifies.pl

diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index 4bd37d5beb5..7276a1da8f0 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -401,8 +401,17 @@ struct NotificationHash
 	Notification *event;		/* => the actual Notification struct */
 };
 
+/*  Information needed by At_AbortNotify() to remove entries from the queue for crashed transactions. */
+typedef struct AtAbortNotifyInfo
+{
+	QueuePosition previousHead;
+
+} AtAbortNotifyInfo;
+
 static NotificationList *pendingNotifies = NULL;
 
+static AtAbortNotifyInfo *atAbortInfo = NULL;
+
 /*
  * Inbound notifications are initially processed by HandleNotifyInterrupt(),
  * called from inside a signal handler. That just sets the
@@ -1465,6 +1474,19 @@ asyncQueueAddEntries(ListCell *nextNotify)
 		}
 	}
 
+	/*
+	 * If 'atAbortInfo' is NULL, this is the first time we're adding
+	 * notifications for the current transaction on the shared global queue.
+	 * We save the initial QUEUE_HEAD position and if the transaction later
+	 * aborts, At_AbortNotify() will use this saved position to reset
+	 * QUEUE_HEAD, discarding the uncommitted notifications from the queue.
+	 */
+	if (atAbortInfo == NULL)
+	{
+		atAbortInfo = palloc(sizeof(AtAbortNotifyInfo));
+		atAbortInfo->previousHead = QUEUE_HEAD;
+	}
+
 	/* Success, so update the global QUEUE_HEAD */
 	QUEUE_HEAD = queue_head;
 
@@ -1678,6 +1700,25 @@ AtAbort_Notify(void)
 	if (amRegisteredListener && listenChannels == NIL)
 		asyncQueueUnregister();
 
+	/*
+	 * AtAbort_Notify information is set when we are adding entries on the
+	 * global shared queue at PreCommit_Notify(), so in case of a crash on the
+	 * transaction between the PreCommit_Notify() and AtCommit_Notify() we use
+	 * this information to remove the entries from the crashed transaction
+	 * from the queue. We remove these entries by resetting the QUEUE_HEAD to
+	 * the position that it was before the entries being added on the queue.
+	 *
+	 * We are protected by the global shared lock on the database.
+	 */
+	if (atAbortInfo != NULL)
+	{
+		Assert(CheckSharedObjectLockedByMe(DatabaseRelationId, LW_EXCLUSIVE, true));
+
+		LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE);
+		QUEUE_HEAD = atAbortInfo->previousHead;
+		LWLockRelease(NotifyQueueLock);
+	}
+
 	/* And clean up */
 	ClearPendingActionsAndNotifies();
 }
@@ -2066,8 +2107,19 @@ asyncQueueProcessPageEntries(volatile QueuePosition *current,
 				reachedStop = true;
 				break;
 			}
-			else if (TransactionIdDidCommit(qe->xid))
+			else
 			{
+				/*
+				 * AsyncQueueEntry's from crashed transactions are removed
+				 * from the queue at At_AbortNotify(), so if a notification
+				 * entry is not in-progress it is fully committed and safe to
+				 * be processed by a listener backends.
+				 *
+				 * For aborted transactions the notifications are not added on
+				 * the shared global queue since the COMMIT should be executed
+				 * by the transaction to include these on the queue.
+				 */
+
 				/* qe->data is the null-terminated channel name */
 				char	   *channel = qe->data;
 
@@ -2079,13 +2131,6 @@ asyncQueueProcessPageEntries(volatile QueuePosition *current,
 					NotifyMyFrontEnd(channel, payload, qe->srcPid);
 				}
 			}
-			else
-			{
-				/*
-				 * The source transaction aborted or crashed, so we just
-				 * ignore its notifications.
-				 */
-			}
 		}
 
 		/* Loop back if we're not at end of page */
@@ -2385,6 +2430,7 @@ ClearPendingActionsAndNotifies(void)
 	 */
 	pendingActions = NULL;
 	pendingNotifies = NULL;
+	atAbortInfo = NULL;
 }
 
 /*
diff --git a/src/backend/storage/lmgr/lmgr.c b/src/backend/storage/lmgr/lmgr.c
index 4798eb79003..12a21c51452 100644
--- a/src/backend/storage/lmgr/lmgr.c
+++ b/src/backend/storage/lmgr/lmgr.c
@@ -357,6 +357,25 @@ CheckRelationOidLockedByMe(Oid relid, LOCKMODE lockmode, bool orstronger)
 	return LockHeldByMe(&tag, lockmode, orstronger);
 }
 
+/*
+ *  CheckSharedObjectLockedByMe
+ *
+ *  Like CheckRelationLockedByMe, but it checks for shared objects.
+ */
+bool
+CheckSharedObjectLockedByMe(Oid classid, LOCKMODE lockmode, bool orstronger)
+{
+	LOCKTAG		tag;
+
+	SET_LOCKTAG_OBJECT(tag,
+					   InvalidOid,
+					   classid,
+					   InvalidOid,
+					   0);
+
+	return LockHeldByMe(&tag, lockmode, orstronger);
+}
+
 /*
  *		LockHasWaitersRelation
  *
diff --git a/src/include/storage/lmgr.h b/src/include/storage/lmgr.h
index b7abd18397d..c119c8f4ded 100644
--- a/src/include/storage/lmgr.h
+++ b/src/include/storage/lmgr.h
@@ -50,6 +50,9 @@ extern bool CheckRelationLockedByMe(Relation relation, LOCKMODE lockmode,
 									bool orstronger);
 extern bool CheckRelationOidLockedByMe(Oid relid, LOCKMODE lockmode,
 									   bool orstronger);
+extern bool CheckSharedObjectLockedByMe(Oid classid, LOCKMODE lockmode,
+									   bool orstronger);
+
 extern bool LockHasWaitersRelation(Relation relation, LOCKMODE lockmode);
 
 extern void LockRelationIdForSession(LockRelId *relid, LOCKMODE lockmode);
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index 8a3cd2afab7..4be824d7a31 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -28,6 +28,7 @@ SUBDIRS = \
 		  test_int128 \
 		  test_integerset \
 		  test_json_parser \
+		  test_listen_notify \
 		  test_lfind \
 		  test_lwlock_tranches \
 		  test_misc \
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index 717e85066ba..c0b0b327922 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -27,6 +27,7 @@ subdir('test_ginpostinglist')
 subdir('test_int128')
 subdir('test_integerset')
 subdir('test_json_parser')
+subdir('test_listen_notify')
 subdir('test_lfind')
 subdir('test_lwlock_tranches')
 subdir('test_misc')
diff --git a/src/test/modules/test_listen_notify/Makefile b/src/test/modules/test_listen_notify/Makefile
new file mode 100644
index 00000000000..da1bf5bb1b7
--- /dev/null
+++ b/src/test/modules/test_listen_notify/Makefile
@@ -0,0 +1,17 @@
+# src/test/modules/test_listen_notify/Makefile
+
+MODULE = test_listen_notify
+PGFILEDESC = "test_listen_notify - regression testing for LISTEN/NOTIFY support"
+
+TAP_TESTS = 1
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_listen_notify
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_listen_notify/meson.build b/src/test/modules/test_listen_notify/meson.build
new file mode 100644
index 00000000000..a68052cd353
--- /dev/null
+++ b/src/test/modules/test_listen_notify/meson.build
@@ -0,0 +1,14 @@
+# Copyright (c) 2022-2025, PostgreSQL Global Development Group
+
+tests += {
+  'name': 'test_listen_notify',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'tap': {
+    'tests': [
+      't/001_xid_freeze.pl',
+      't/002_aborted_tx_notifies.pl'
+    ],
+  },
+}
+
diff --git a/src/test/modules/test_listen_notify/t/001_xid_freeze.pl b/src/test/modules/test_listen_notify/t/001_xid_freeze.pl
new file mode 100644
index 00000000000..0a5130a042e
--- /dev/null
+++ b/src/test/modules/test_listen_notify/t/001_xid_freeze.pl
@@ -0,0 +1,66 @@
+# Copyright (c) 2024-2025, PostgreSQL Global Development Group
+
+use strict;
+use warnings FATAL => 'all';
+use File::Path qw(mkpath);
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+my $node = PostgreSQL::Test::Cluster->new('node');
+$node->init;
+$node->start;
+
+# Setup
+$node->safe_psql('postgres', 'CREATE EXTENSION xid_wraparound');
+$node->safe_psql('postgres',
+	'CREATE TABLE t AS SELECT g AS a, g+2 AS b from generate_series(1,100000) g;'
+);
+$node->safe_psql('postgres',
+	'ALTER DATABASE template0 WITH ALLOW_CONNECTIONS true');
+
+# --- Start Session 1 and leave it idle in transaction
+my $psql_session1 = $node->background_psql('postgres');
+$psql_session1->query_safe('listen s;', "Session 1 listens to 's'");
+$psql_session1->query_safe('begin;', "Session 1 starts a transaction");
+
+# --- Session 2, multiple notify's, and commit ---
+for my $i (1 .. 10)
+{
+	$node->safe_psql(
+		'postgres', "
+		BEGIN;
+		NOTIFY s, '$i';
+		COMMIT;");
+}
+
+# Consume enough XIDs to trigger truncation
+$node->safe_psql('postgres', 'select consume_xids(10000000);');
+
+# Execute update so the frozen xid of "t" table is updated to a xid greater
+# than consume_xids() result
+$node->safe_psql('postgres', 'UPDATE t SET a = a+b;');
+
+# Remember current datfrozenxid before vacuum freeze to ensure that it is advanced.
+my $datafronzenxid = $node->safe_psql('postgres', "select datfrozenxid from pg_database where datname = 'postgres'");
+
+# Execute vacuum freeze on all databases
+$node->command_ok([ 'vacuumdb', '--all', '--freeze', '--port', $node->port ],
+	"vacuumdb --all --freeze");
+
+# Get the new datfrozenxid after vacuum freeze to ensure that is advanced but
+# we can still get the notification status of the notification
+my $datafronzenxid_freeze = $node->safe_psql('postgres', "select datfrozenxid from pg_database where datname = 'postgres'");
+ok($datafronzenxid_freeze > $datafronzenxid, 'datfrozenxid is advanced');
+
+# On Session 1, commit and ensure that the all notifications is received
+my $res = $psql_session1->query_safe('commit;', "commit listen s;");
+my $notifications_count = 0;
+foreach my $i (split('\n', $res))
+{
+	$notifications_count++;
+	like($i, qr/Asynchronous notification "s" with payload "$notifications_count" received/);
+}
+is($notifications_count, 10, 'received all committed notifications');
+
+done_testing();
diff --git a/src/test/modules/test_listen_notify/t/002_aborted_tx_notifies.pl b/src/test/modules/test_listen_notify/t/002_aborted_tx_notifies.pl
new file mode 100644
index 00000000000..17fcb4b786e
--- /dev/null
+++ b/src/test/modules/test_listen_notify/t/002_aborted_tx_notifies.pl
@@ -0,0 +1,66 @@
+# Copyright (c) 2024-2025, PostgreSQL Global Development Group
+
+use strict;
+use warnings FATAL => 'all';
+use File::Path qw(mkpath);
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+my $node = PostgreSQL::Test::Cluster->new('node');
+$node->init;
+$node->start;
+
+# Test checks that listeners do not receive notifications from aborted
+# transaction even if notifications have been added to the listen/notify
+# queue. To reproduce it we use the fact that serializable conflicts
+# are checked after tx adds notifications to the queue.
+
+# Setup
+$node->safe_psql('postgres', 'CREATE TABLE t1 (a bigserial);');
+
+# Listener
+my $psql_listener = $node->background_psql('postgres');
+$psql_listener->query_safe('LISTEN ch;');
+
+# Session1. Start SERIALIZABLE tx and add a notification.
+my $psql_session1 = $node->background_psql('postgres');
+$psql_session1->query_safe("
+	BEGIN ISOLATION LEVEL SERIALIZABLE;
+	SELECT * FROM t1;
+	INSERT INTO t1 DEFAULT VALUES;
+	NOTIFY ch,'committed';
+");
+
+# Session2. Start SERIALIZABLE tx, add a notification and introduce a conflict
+# with session1.
+my $psql_session2 = $node->background_psql('postgres', on_error_stop => 0);
+$psql_session2->query_safe("
+	BEGIN ISOLATION LEVEL SERIALIZABLE;
+	SELECT * FROM t1;
+	INSERT INTO t1 DEFAULT VALUES;
+	NOTIFY ch,'aborted';
+");
+
+# Session1 should be committed successfully. Listeners must receive session1
+# notifications.
+$psql_session1->query_safe("COMMIT;");
+
+# Session2 should be aborted due to the conflict with session1. Transaction
+# is aborted after adding notifications to the listen/notify queue, but
+# listeners should not receive session2 notifications.
+$psql_session2->query("COMMIT;");
+
+# send another notification after aborted
+$node->safe_psql('postgres', "NOTIFY ch, 'next_committed';");
+
+# fetch notifications
+my $res = $psql_listener->query_safe('begin; commit;');
+
+# check received notifications
+my @lines = split('\n', $res);
+is(@lines, 2, 'received all committed notifications');
+like($lines[0], qr/Asynchronous notification "ch" with payload "committed" received/);
+like($lines[1], qr/Asynchronous notification "ch" with payload "next_committed" received/);
+
+done_testing();
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index e90af5b2ad3..e1c2384aa3d 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -159,6 +159,7 @@ ArrayType
 AsyncQueueControl
 AsyncQueueEntry
 AsyncRequest
+AtAbortNotifyInfo
 AttInMetadata
 AttStatsSlot
 AttoptCacheEntry
-- 
2.39.5 (Apple Git-154)

#34

Arseniy Mukhin

arseniy.mukhin.dev@gmail.com

4 months ago

In reply to: Matheus Alcantara (#33)

Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue

On Mon, Sep 22, 2025 at 4:09 PM Matheus Alcantara
<matheusssilv97@gmail.com> wrote:

On Fri Sep 19, 2025 at 12:13 PM -03, Arseniy Mukhin wrote:

I think it's impossible: before pushing anything to the queue we
acquire global lock in PreCommit_Notify():

LockSharedObject(DatabaseRelationId, InvalidOid, 0, AccessExclusiveLock);

While we are holding the lock, no writers can add anything to the
queue. Then we save head position and add pending notifications to the
queue. The moment we get in AtAbort_Notify(), we still hold the global
lock (maybe it is worth adding Assert about it if we start relying on
it), so we can be sure there are no notifications in the queue after
the saved head position except ours. So it seems safe but maybe I
missed something.

Thanks for the explanation! I'm just not sure if I understand why do we
need the LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE) on
PreCommit_Notify() if we already have the
LockSharedObject(DatabaseRelationId, InvalidOid, 0, AccessExclusiveLock);

Good question. It seems that it would be enough to hold
NotifyQueueLock only during the head update, so I don't understand it
either.

See the attached patch that is based on your the previous comment of
resetting the QUEUE_HEAD at AtAbort_Notify()

Thank you for the new version!

The fix looks exactly how I thought about it. But while thinking about
why we need to hold NotifyQueueLock in PreCommit_Notify I realized
there is a concurrency bug in the 'head resetting' approach. I thought
that listeners hold NotifyQueueLock lock while reading notifications
from the queue, but now I see that they hold it only while reading the
current head position. So we can end up in the next situation:

There are 2 backends: listener and writer (backend with notifications)

listener writer
----------------------------------------------------------------------
writer wants to commit, so it
adds notifications to the queue
in PreCommit_Notify() and advances
queue head.

----------------------------------------------------------------------
listener wants to read notifications. It
gets into asyncQueueReadAllNotifications()
and reads the current head (it already
contains writer's notifications):

asyncQueueReadAllNotifications()
...
LWLockAcquire(NotifyQueueLock, LW_SHARED);
head = QUEUE_HEAD;
LWLockRelease(NotifyQueueLock);
...

----------------------------------------------------------------------
writer failed to commit, so it
resets queue head in AtAbort_Notify()
and completes the transaction.

----------------------------------------------------------------------
listener gets a snapshot where the writer
is not in progress.

...
snapshot = RegisterSnapshot(GetLatestSnapshot());
...

This way the listener reads the head that includes all writer's
notifications and a snapshot where the writer is not in progress, so
nothing stops the listener from sending these notifications and it's
even possible to have the listener's position that is after the queue
head, so yes, it's bad :( Sorry about that.

Probably we can fix it (swap GetLatestSnapshot() with the head
position reading for example) but now I think that 'committed' field
approach is a more robust way to fix the bug. What do you think?

BTW one thing about 'committed' field fix version [0]/messages/by-id/CAFY6G8fui7omKfUXF+JJ82B34ExrwK6J7vKe61S-DhEmr_jBhA@mail.gmail.com. It seems that
instead of remembering all notification positions, we can remember the
head position after we acquire global lock and before we write
anything to the queue and in case of abort we can just start from the
saved position and mark all notifications until the end of the queue
as 'committed = false'. The reason is the same as with resetting the
head approach - as we hold global lock, no writers can add
notifications to the queue, so when we in AtAbort_Notify() all
notifications after the saved position are ours.

[0]: /messages/by-id/CAFY6G8fui7omKfUXF+JJ82B34ExrwK6J7vKe61S-DhEmr_jBhA@mail.gmail.com

Best regards,
Arseniy Mukhin

#35

Matheus Alcantara

matheusssilv97@gmail.com

4 months ago

In reply to: Arseniy Mukhin (#34)

1 attachment(s)

Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue

On Tue Sep 23, 2025 at 1:11 PM -03, Arseniy Mukhin wrote:

Thanks for the explanation! I'm just not sure if I understand why do we
need the LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE) on
PreCommit_Notify() if we already have the
LockSharedObject(DatabaseRelationId, InvalidOid, 0, AccessExclusiveLock);

Good question. It seems that it would be enough to hold
NotifyQueueLock only during the head update, so I don't understand it
either.

IIUC correctly we acquire the LockSharedObject(DatabaseRelationId,
InvalidOid, 0, AccessExclusiveLock) to make other COMMIT's wait and the
LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE) to make listener backends
wait while we are adding the entries on the queue.

Thank you for the new version!

The fix looks exactly how I thought about it. But while thinking about
why we need to hold NotifyQueueLock in PreCommit_Notify I realized
there is a concurrency bug in the 'head resetting' approach. I thought
that listeners hold NotifyQueueLock lock while reading notifications
from the queue, but now I see that they hold it only while reading the
current head position. So we can end up in the next situation:

There are 2 backends: listener and writer (backend with notifications)

listener writer
----------------------------------------------------------------------
writer wants to commit, so it
adds notifications to the queue
in PreCommit_Notify() and advances
queue head.

----------------------------------------------------------------------
listener wants to read notifications. It
gets into asyncQueueReadAllNotifications()
and reads the current head (it already
contains writer's notifications):

asyncQueueReadAllNotifications()
...
LWLockAcquire(NotifyQueueLock, LW_SHARED);
head = QUEUE_HEAD;
LWLockRelease(NotifyQueueLock);
...

----------------------------------------------------------------------
writer failed to commit, so it
resets queue head in AtAbort_Notify()
and completes the transaction.

----------------------------------------------------------------------
listener gets a snapshot where the writer
is not in progress.

...
snapshot = RegisterSnapshot(GetLatestSnapshot());
...

This way the listener reads the head that includes all writer's
notifications and a snapshot where the writer is not in progress, so
nothing stops the listener from sending these notifications and it's
even possible to have the listener's position that is after the queue
head, so yes, it's bad :( Sorry about that.

Yeah, this is bad. I'm wondering if we could reproduce such race
conditions scenarios with some TAP tests.

Probably we can fix it (swap GetLatestSnapshot() with the head
position reading for example) but now I think that 'committed' field
approach is a more robust way to fix the bug. What do you think?

I also agree that the committed field seems a more safe approach.

BTW one thing about 'committed' field fix version [0]. It seems that
instead of remembering all notification positions, we can remember the
head position after we acquire global lock and before we write
anything to the queue and in case of abort we can just start from the
saved position and mark all notifications until the end of the queue
as 'committed = false'. The reason is the same as with resetting the
head approach - as we hold global lock, no writers can add
notifications to the queue, so when we in AtAbort_Notify() all
notifications after the saved position are ours.

See the new attached version which implements this idea of using the
committed field approach. I was just a bit concenerd about a race
condition situation where the QUEUE_HEAD is changed by another publisher
process and just iterating over all entries from the saved previous
QUEUE_HEAD position until the current QUEUE_HEAD position we could mark
successfully committed notifications as not committed by mistake, so in
this new patch version I save the QUEUE_HEAD position before we add the
entries on the shared queue and also the QUEUE_HEAD position after these
entries are added so we ensure that we only process the entries of this
range although we have the global lock
LockSharedObject(DatabaseRelationId) that may prevent this situation.

What do you think?

--
Matheus Alcantara

Attachments:

v3-0001-Make-AsyncQueueEntry-s-self-contained.patchtext/plain; charset=utf-8; name=v3-0001-Make-AsyncQueueEntry-s-self-contained.patchDownload

From 8c1267711f26a7b32b8e2b9693469b3e407bb8a8 Mon Sep 17 00:00:00 2001
From: Matheus Alcantara <mths.dev@pm.me>
Date: Sat, 6 Sep 2025 11:29:02 -0300
Subject: [PATCH v3] Make AsyncQueueEntry's self contained

Previously the asyncQueueProcessPageEntries() use the
TransactionIdDidCommit() to check if the transaction that a notification
belongs is committed or not. Although this work for almost all scenarios
we may have some cases where if a notification is keep for to long on
the queue and the VACUUM FREEZE is executed during this time it may
remove clog files that is needed to check the transaction status of
these notifications which will cause errors to listener backends when
reading the async queue.

This commit fix this issue by making the AsyncQueueEntry self contained
by adding the "committed" boolean field so asyncQueueProcessPageEntries()
can use this to check if the transaction of the notification is
committed or not.

We set committed as true when adding the entry on the SLRU page buffer
cache when PreCommit_Notify() is called and if an error occur before
AtCommit_Notify() the AtAbort_Notify() will be called which will mark
the committed field as false. We do this by remembering the QUEUE_HEAD
position before the PreCommit_Notify() start adding entries on the
shared queue, and if the transaction crash we iterate from this saved
position until the new QUEUE_HEAD position marking the entries as not
committed.

Also this commit include TAP tests to exercise the VACUUM FREEZE issue
and also the scenario of an error being occur between the
PreCommit_Notify() and AtCommit_Notify() calls.

Co-authored-by: Arseniy Mukhin <arseniy.mukhin.dev@gmail.com>
---
 src/backend/commands/async.c                  | 168 +++++++++++++++++-
 src/backend/storage/lmgr/lmgr.c               |  19 ++
 src/include/storage/lmgr.h                    |   3 +
 src/test/modules/Makefile                     |   1 +
 src/test/modules/meson.build                  |   1 +
 src/test/modules/test_listen_notify/Makefile  |  17 ++
 .../modules/test_listen_notify/meson.build    |  14 ++
 .../test_listen_notify/t/001_xid_freeze.pl    |  66 +++++++
 .../t/002_aborted_tx_notifies.pl              |  66 +++++++
 src/tools/pgindent/typedefs.list              |   1 +
 10 files changed, 355 insertions(+), 1 deletion(-)
 create mode 100644 src/test/modules/test_listen_notify/Makefile
 create mode 100644 src/test/modules/test_listen_notify/meson.build
 create mode 100644 src/test/modules/test_listen_notify/t/001_xid_freeze.pl
 create mode 100644 src/test/modules/test_listen_notify/t/002_aborted_tx_notifies.pl

diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index 4bd37d5beb5..49c9c8b0b5c 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -180,6 +180,8 @@ typedef struct AsyncQueueEntry
 	Oid			dboid;			/* sender's database OID */
 	TransactionId xid;			/* sender's XID */
 	int32		srcPid;			/* sender's PID */
+	bool		committed;		/* Is transaction that the entry belongs
+								 * committed? */
 	char		data[NAMEDATALEN + NOTIFY_PAYLOAD_MAX_LENGTH];
 } AsyncQueueEntry;
 
@@ -401,8 +403,27 @@ struct NotificationHash
 	Notification *event;		/* => the actual Notification struct */
 };
 
+/*  Information needed by At_AbortNotify() to remove entries from the queue for crashed transactions. */
+typedef struct AtAbortNotifyInfo
+{
+	/*
+	 * head position before the transaction start adding entries on the shared
+	 * queue
+	 */
+	QueuePosition previousHead;
+
+	/*
+	 * head position after the entries from the in-progress to commit
+	 * transaction were added.
+	 */
+	QueuePosition head;
+
+} AtAbortNotifyInfo;
+
 static NotificationList *pendingNotifies = NULL;
 
+static AtAbortNotifyInfo *atAbortInfo = NULL;
+
 /*
  * Inbound notifications are initially processed by HandleNotifyInterrupt(),
  * called from inside a signal handler. That just sets the
@@ -457,6 +478,7 @@ static void AddEventToPendingNotifies(Notification *n);
 static uint32 notification_hash(const void *key, Size keysize);
 static int	notification_match(const void *key1, const void *key2, Size keysize);
 static void ClearPendingActionsAndNotifies(void);
+static void asyncQueueRollbackNotifications(void);
 
 /*
  * Compute the difference between two queue page numbers.
@@ -922,6 +944,18 @@ PreCommit_Notify(void)
 		LockSharedObject(DatabaseRelationId, InvalidOid, 0,
 						 AccessExclusiveLock);
 
+		/*
+		 * Before start adding entries on the shared queue, save the current
+		 * QUEUE_HEAD so if the current in-progress to commit transaction
+		 * crash we can mark the notifications added by this crashed
+		 * transaction as not committed. See AtAbortt_Notify() for more info.
+		 */
+		Assert(atAbortInfo == NULL);
+		atAbortInfo = palloc(sizeof(AtAbortNotifyInfo));
+		LWLockAcquire(NotifyQueueLock, LW_SHARED);
+		atAbortInfo->previousHead = QUEUE_HEAD;
+		LWLockRelease(NotifyQueueLock);
+
 		/* Now push the notifications into the queue */
 		nextNotify = list_head(pendingNotifies->events);
 		while (nextNotify != NULL)
@@ -948,6 +982,17 @@ PreCommit_Notify(void)
 			LWLockRelease(NotifyQueueLock);
 		}
 
+		/*
+		 * Save the new QUEUE_HEAD position so if another publisher add
+		 * entries on the shared queue and successfully commit the transaction
+		 * we don't change the committed status of these notifications while
+		 * marking the notification from a crashed transaction as not
+		 * committed.
+		 */
+		LWLockAcquire(NotifyQueueLock, LW_SHARED);
+		atAbortInfo->head = QUEUE_HEAD;
+		LWLockRelease(NotifyQueueLock);
+
 		/* Note that we don't clear pendingNotifies; AtCommit_Notify will. */
 	}
 }
@@ -1402,6 +1447,13 @@ asyncQueueAddEntries(ListCell *nextNotify)
 		/* Construct a valid queue entry in local variable qe */
 		asyncQueueNotificationToEntry(n, &qe);
 
+		/*
+		 * Mark the entry as committed. If the transaction that this
+		 * notification belongs fails to commit the AtAbort_Notify() will mark
+		 * this entry as not committed.
+		 */
+		qe.committed = true;
+
 		offset = QUEUE_POS_OFFSET(queue_head);
 
 		/* Check whether the entry really fits on the current page */
@@ -1678,6 +1730,16 @@ AtAbort_Notify(void)
 	if (amRegisteredListener && listenChannels == NIL)
 		asyncQueueUnregister();
 
+	/*
+	 * AtAbort_Notify information is set when we are adding entries on the
+	 * global shared queue at PreCommit_Notify(), so in case of a crash on the
+	 * transaction between the PreCommit_Notify() and AtCommit_Notify() we use
+	 * this information to mark the entries from the crashed transaction as
+	 * not committed.
+	 */
+	if (atAbortInfo != NULL)
+		asyncQueueRollbackNotifications();
+
 	/* And clean up */
 	ClearPendingActionsAndNotifies();
 }
@@ -2066,7 +2128,7 @@ asyncQueueProcessPageEntries(volatile QueuePosition *current,
 				reachedStop = true;
 				break;
 			}
-			else if (TransactionIdDidCommit(qe->xid))
+			else if (qe->committed)
 			{
 				/* qe->data is the null-terminated channel name */
 				char	   *channel = qe->data;
@@ -2385,6 +2447,7 @@ ClearPendingActionsAndNotifies(void)
 	 */
 	pendingActions = NULL;
 	pendingNotifies = NULL;
+	atAbortInfo = NULL;
 }
 
 /*
@@ -2395,3 +2458,106 @@ check_notify_buffers(int *newval, void **extra, GucSource source)
 {
 	return check_slru_buffers("notify_buffers", newval);
 }
+
+
+/*
+ *  Mark notifications added on an in-progress to commit transaction as not committed.
+ *
+ * Notifications added on the shared global queue are added with committed =
+ * true during PreCommit_Notify() call. If an error occur between the
+ * PreCommit_Notify() and AtCommit_Notify() the AtAbort_Notify() will be called
+ * and we need to mark these notifications added on the shared queue by the
+ * crashed transaction as not committed so that listener backends can skip
+ * these notifications when reading the queue.
+ *
+ * We previously rely on TransactionDidCommit() to check this but if a
+ * notification is keep for too long on the queue and the VACUUM FREEZE is
+ * executed during this period it can remove clog files that is needed to check
+ * the transaction status of this notification, so we make the notification
+ * entries self contained to skip this problem.
+ *
+ */
+static void
+asyncQueueRollbackNotifications(void)
+{
+	QueuePosition current = atAbortInfo->previousHead;
+	QueuePosition head = atAbortInfo->head;
+
+	/*
+	 * Iterates from the position saved at the beginning of the transaction
+	 * (previousHead) to the current head of the queue. We do this to mark all
+	 * entries within this range as uncommitted in case of a transaction
+	 * crash.
+	 */
+	for (;;)
+	{
+		int64		curpage = QUEUE_POS_PAGE(current);
+		int			curoffset = QUEUE_POS_OFFSET(current);
+		LWLock	   *lock = SimpleLruGetBankLock(NotifyCtl, curpage);
+		int			slotno;
+
+		/*
+		 * If we have reached the head, all entries from this transaction have
+		 * been marked as not committed so break the loop.
+		 */
+		if (QUEUE_POS_EQUAL(current, head))
+			break;
+
+		/*
+		 * Acquire an exclusive lock on the current SLRU page to ensure no
+		 * other process can read or write to it while we are marking the
+		 * entries.
+		 */
+		LWLockAcquire(lock, LW_EXCLUSIVE);
+
+		/* Fetch the page from SLRU to mark entries as not committed. */
+		slotno = SimpleLruReadPage(NotifyCtl, curpage, true, InvalidTransactionId);
+
+		/*
+		 * Loop through all entries on the current page. The loop will
+		 * continue until we reach the end of the page or the current head.
+		 */
+		for (;;)
+		{
+			AsyncQueueEntry *qe;
+			bool		reachedEndOfPage;
+
+			/*
+			 * Check again to stop processing the entries on the current page.
+			 */
+			if (QUEUE_POS_EQUAL(current, head))
+				break;
+
+			/*
+			 * Get a pointer to the current entry within the shared page
+			 * buffer.
+			 */
+			qe = (AsyncQueueEntry *) (NotifyCtl->shared->page_buffer[slotno] + curoffset);
+
+			/*
+			 * Just for sanity, all entries on the shared queue should be
+			 * marked as not committed.
+			 */
+			Assert(qe->committed);
+
+			/*
+			 * Mark the entry as uncommitted so listener backends can skip
+			 * this notification.
+			 */
+			qe->committed = false;
+
+			/* Advance our position. */
+			reachedEndOfPage = asyncQueueAdvance(&current, qe->length);
+			if (reachedEndOfPage)
+				break;
+
+			/*
+			 * Update the offset for the next iteration within the same page.
+			 */
+			curoffset = QUEUE_POS_OFFSET(current);
+		}
+
+		/* Release the exclusive lock on the page. */
+		LWLockRelease(lock);
+	}
+}
diff --git a/src/backend/storage/lmgr/lmgr.c b/src/backend/storage/lmgr/lmgr.c
index 4798eb79003..12a21c51452 100644
--- a/src/backend/storage/lmgr/lmgr.c
+++ b/src/backend/storage/lmgr/lmgr.c
@@ -357,6 +357,25 @@ CheckRelationOidLockedByMe(Oid relid, LOCKMODE lockmode, bool orstronger)
 	return LockHeldByMe(&tag, lockmode, orstronger);
 }
 
+/*
+ *  CheckSharedObjectLockedByMe
+ *
+ *  Like CheckRelationLockedByMe, but it checks for shared objects.
+ */
+bool
+CheckSharedObjectLockedByMe(Oid classid, LOCKMODE lockmode, bool orstronger)
+{
+	LOCKTAG		tag;
+
+	SET_LOCKTAG_OBJECT(tag,
+					   InvalidOid,
+					   classid,
+					   InvalidOid,
+					   0);
+
+	return LockHeldByMe(&tag, lockmode, orstronger);
+}
+
 /*
  *		LockHasWaitersRelation
  *
diff --git a/src/include/storage/lmgr.h b/src/include/storage/lmgr.h
index b7abd18397d..c119c8f4ded 100644
--- a/src/include/storage/lmgr.h
+++ b/src/include/storage/lmgr.h
@@ -50,6 +50,9 @@ extern bool CheckRelationLockedByMe(Relation relation, LOCKMODE lockmode,
 									bool orstronger);
 extern bool CheckRelationOidLockedByMe(Oid relid, LOCKMODE lockmode,
 									   bool orstronger);
+extern bool CheckSharedObjectLockedByMe(Oid classid, LOCKMODE lockmode,
+									   bool orstronger);
+
 extern bool LockHasWaitersRelation(Relation relation, LOCKMODE lockmode);
 
 extern void LockRelationIdForSession(LockRelId *relid, LOCKMODE lockmode);
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index 902a7954101..a015c961d35 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -29,6 +29,7 @@ SUBDIRS = \
 		  test_int128 \
 		  test_integerset \
 		  test_json_parser \
+		  test_listen_notify \
 		  test_lfind \
 		  test_lwlock_tranches \
 		  test_misc \
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index 14fc761c4cf..6af33448d7b 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -28,6 +28,7 @@ subdir('test_ginpostinglist')
 subdir('test_int128')
 subdir('test_integerset')
 subdir('test_json_parser')
+subdir('test_listen_notify')
 subdir('test_lfind')
 subdir('test_lwlock_tranches')
 subdir('test_misc')
diff --git a/src/test/modules/test_listen_notify/Makefile b/src/test/modules/test_listen_notify/Makefile
new file mode 100644
index 00000000000..da1bf5bb1b7
--- /dev/null
+++ b/src/test/modules/test_listen_notify/Makefile
@@ -0,0 +1,17 @@
+# src/test/modules/test_listen_notify/Makefile
+
+MODULE = test_listen_notify
+PGFILEDESC = "test_listen_notify - regression testing for LISTEN/NOTIFY support"
+
+TAP_TESTS = 1
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_listen_notify
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_listen_notify/meson.build b/src/test/modules/test_listen_notify/meson.build
new file mode 100644
index 00000000000..a68052cd353
--- /dev/null
+++ b/src/test/modules/test_listen_notify/meson.build
@@ -0,0 +1,14 @@
+# Copyright (c) 2022-2025, PostgreSQL Global Development Group
+
+tests += {
+  'name': 'test_listen_notify',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'tap': {
+    'tests': [
+      't/001_xid_freeze.pl',
+      't/002_aborted_tx_notifies.pl'
+    ],
+  },
+}
+
diff --git a/src/test/modules/test_listen_notify/t/001_xid_freeze.pl b/src/test/modules/test_listen_notify/t/001_xid_freeze.pl
new file mode 100644
index 00000000000..0a5130a042e
--- /dev/null
+++ b/src/test/modules/test_listen_notify/t/001_xid_freeze.pl
@@ -0,0 +1,66 @@
+# Copyright (c) 2024-2025, PostgreSQL Global Development Group
+
+use strict;
+use warnings FATAL => 'all';
+use File::Path qw(mkpath);
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+my $node = PostgreSQL::Test::Cluster->new('node');
+$node->init;
+$node->start;
+
+# Setup
+$node->safe_psql('postgres', 'CREATE EXTENSION xid_wraparound');
+$node->safe_psql('postgres',
+	'CREATE TABLE t AS SELECT g AS a, g+2 AS b from generate_series(1,100000) g;'
+);
+$node->safe_psql('postgres',
+	'ALTER DATABASE template0 WITH ALLOW_CONNECTIONS true');
+
+# --- Start Session 1 and leave it idle in transaction
+my $psql_session1 = $node->background_psql('postgres');
+$psql_session1->query_safe('listen s;', "Session 1 listens to 's'");
+$psql_session1->query_safe('begin;', "Session 1 starts a transaction");
+
+# --- Session 2, multiple notify's, and commit ---
+for my $i (1 .. 10)
+{
+	$node->safe_psql(
+		'postgres', "
+		BEGIN;
+		NOTIFY s, '$i';
+		COMMIT;");
+}
+
+# Consume enough XIDs to trigger truncation
+$node->safe_psql('postgres', 'select consume_xids(10000000);');
+
+# Execute update so the frozen xid of "t" table is updated to a xid greater
+# than consume_xids() result
+$node->safe_psql('postgres', 'UPDATE t SET a = a+b;');
+
+# Remember current datfrozenxid before vacuum freeze to ensure that it is advanced.
+my $datafronzenxid = $node->safe_psql('postgres', "select datfrozenxid from pg_database where datname = 'postgres'");
+
+# Execute vacuum freeze on all databases
+$node->command_ok([ 'vacuumdb', '--all', '--freeze', '--port', $node->port ],
+	"vacuumdb --all --freeze");
+
+# Get the new datfrozenxid after vacuum freeze to ensure that is advanced but
+# we can still get the notification status of the notification
+my $datafronzenxid_freeze = $node->safe_psql('postgres', "select datfrozenxid from pg_database where datname = 'postgres'");
+ok($datafronzenxid_freeze > $datafronzenxid, 'datfrozenxid is advanced');
+
+# On Session 1, commit and ensure that the all notifications is received
+my $res = $psql_session1->query_safe('commit;', "commit listen s;");
+my $notifications_count = 0;
+foreach my $i (split('\n', $res))
+{
+	$notifications_count++;
+	like($i, qr/Asynchronous notification "s" with payload "$notifications_count" received/);
+}
+is($notifications_count, 10, 'received all committed notifications');
+
+done_testing();
diff --git a/src/test/modules/test_listen_notify/t/002_aborted_tx_notifies.pl b/src/test/modules/test_listen_notify/t/002_aborted_tx_notifies.pl
new file mode 100644
index 00000000000..17fcb4b786e
--- /dev/null
+++ b/src/test/modules/test_listen_notify/t/002_aborted_tx_notifies.pl
@@ -0,0 +1,66 @@
+# Copyright (c) 2024-2025, PostgreSQL Global Development Group
+
+use strict;
+use warnings FATAL => 'all';
+use File::Path qw(mkpath);
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+my $node = PostgreSQL::Test::Cluster->new('node');
+$node->init;
+$node->start;
+
+# Test checks that listeners do not receive notifications from aborted
+# transaction even if notifications have been added to the listen/notify
+# queue. To reproduce it we use the fact that serializable conflicts
+# are checked after tx adds notifications to the queue.
+
+# Setup
+$node->safe_psql('postgres', 'CREATE TABLE t1 (a bigserial);');
+
+# Listener
+my $psql_listener = $node->background_psql('postgres');
+$psql_listener->query_safe('LISTEN ch;');
+
+# Session1. Start SERIALIZABLE tx and add a notification.
+my $psql_session1 = $node->background_psql('postgres');
+$psql_session1->query_safe("
+	BEGIN ISOLATION LEVEL SERIALIZABLE;
+	SELECT * FROM t1;
+	INSERT INTO t1 DEFAULT VALUES;
+	NOTIFY ch,'committed';
+");
+
+# Session2. Start SERIALIZABLE tx, add a notification and introduce a conflict
+# with session1.
+my $psql_session2 = $node->background_psql('postgres', on_error_stop => 0);
+$psql_session2->query_safe("
+	BEGIN ISOLATION LEVEL SERIALIZABLE;
+	SELECT * FROM t1;
+	INSERT INTO t1 DEFAULT VALUES;
+	NOTIFY ch,'aborted';
+");
+
+# Session1 should be committed successfully. Listeners must receive session1
+# notifications.
+$psql_session1->query_safe("COMMIT;");
+
+# Session2 should be aborted due to the conflict with session1. Transaction
+# is aborted after adding notifications to the listen/notify queue, but
+# listeners should not receive session2 notifications.
+$psql_session2->query("COMMIT;");
+
+# send another notification after aborted
+$node->safe_psql('postgres', "NOTIFY ch, 'next_committed';");
+
+# fetch notifications
+my $res = $psql_listener->query_safe('begin; commit;');
+
+# check received notifications
+my @lines = split('\n', $res);
+is(@lines, 2, 'received all committed notifications');
+like($lines[0], qr/Asynchronous notification "ch" with payload "committed" received/);
+like($lines[1], qr/Asynchronous notification "ch" with payload "next_committed" received/);
+
+done_testing();
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 3c80d49b67e..967c9cdf704 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -159,6 +159,7 @@ ArrayType
 AsyncQueueControl
 AsyncQueueEntry
 AsyncRequest
+AtAbortNotifyInfo
 AttInMetadata
 AttStatsSlot
 AttoptCacheEntry
-- 
2.39.5 (Apple Git-154)

#36

Arseniy Mukhin

arseniy.mukhin.dev@gmail.com

4 months ago

In reply to: Matheus Alcantara (#35)

Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue

On Wed, Sep 24, 2025 at 1:40 AM Matheus Alcantara
<matheusssilv97@gmail.com> wrote:

On Tue Sep 23, 2025 at 1:11 PM -03, Arseniy Mukhin wrote:

Thanks for the explanation! I'm just not sure if I understand why do we
need the LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE) on
PreCommit_Notify() if we already have the
LockSharedObject(DatabaseRelationId, InvalidOid, 0, AccessExclusiveLock);

Good question. It seems that it would be enough to hold
NotifyQueueLock only during the head update, so I don't understand it
either.

IIUC correctly we acquire the LockSharedObject(DatabaseRelationId,
InvalidOid, 0, AccessExclusiveLock) to make other COMMIT's wait and the
LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE) to make listener backends
wait while we are adding the entries on the queue.

Yeah, but I don't get why we want to block them with NotifyQueueLock
while we are writing to the queue. There is a SLRU bank lock and they
can't read anything after the head. Maybe it's just not a big deal?

...

This way the listener reads the head that includes all writer's
notifications and a snapshot where the writer is not in progress, so
nothing stops the listener from sending these notifications and it's
even possible to have the listener's position that is after the queue
head, so yes, it's bad :( Sorry about that.

Yeah, this is bad. I'm wondering if we could reproduce such race
conditions scenarios with some TAP tests.

I agree it would be great to have more tests for such cases. As for
the 'committed field' patch, I think we can add a TAP test that shows
that listeners postpone processing of notifications until
notifications were marked as 'committed=false' in case of aborted
transactions. I tried to write one, but have not succeeded yet. Hope
to finish it soon.

Probably we can fix it (swap GetLatestSnapshot() with the head
position reading for example) but now I think that 'committed' field
approach is a more robust way to fix the bug. What do you think?

I also agree that the committed field seems a more safe approach.

BTW one thing about 'committed' field fix version [0]. It seems that
instead of remembering all notification positions, we can remember the
head position after we acquire global lock and before we write
anything to the queue and in case of abort we can just start from the
saved position and mark all notifications until the end of the queue
as 'committed = false'. The reason is the same as with resetting the
head approach - as we hold global lock, no writers can add
notifications to the queue, so when we in AtAbort_Notify() all
notifications after the saved position are ours.

See the new attached version which implements this idea of using the
committed field approach. I was just a bit concenerd about a race
condition situation where the QUEUE_HEAD is changed by another publisher
process and just iterating over all entries from the saved previous
QUEUE_HEAD position until the current QUEUE_HEAD position we could mark
successfully committed notifications as not committed by mistake, so in
this new patch version I save the QUEUE_HEAD position before we add the
entries on the shared queue and also the QUEUE_HEAD position after these
entries are added so we ensure that we only process the entries of this
range although we have the global lock
LockSharedObject(DatabaseRelationId) that may prevent this situation.

Thank you! Speaking of the scenario, my understanding is that it's
impossible as we hold the global lock, so QueuePosition head should
always be equal to QUEUE_HEAD when we get into At_AbortNotify(), but
maybe I'm wrong.

Patch looks great. Some minor points:

I have a warning when using git am with the patch:
warning: 1 line adds whitespace errors.

There is a comment in the head of the async.c file about some
listen/notify internals. Maybe it's worth adding a comment about how
aborted transactions do clean up.

What do you think about a Assert in asyncQueueRollbackNotifications()
that other backends still see us as 'in progress'? So we can be sure
that they can't process our notifications before we mark notifications
as 'committed=false'. Not sure how to do it correctly, maybe

Assert(TransactionIdIsValid(MyProc->xid));

will work? The TAP test that I tried to write also should test it.

There are several comments where the word "crash" is used. What do you
think about using "abort" instead? "Crash" sounds more like PANIC
situation where we don't care about notifications because they don't
survive restart.

Thank you!

Best regards,
Arseniy Mukhin

#37

Matheus Alcantara

matheusssilv97@gmail.com

4 months ago

In reply to: Arseniy Mukhin (#36)

1 attachment(s)

Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue

On Wed Sep 24, 2025 at 10:59 AM -03, Arseniy Mukhin wrote:

Thank you! Speaking of the scenario, my understanding is that it's
impossible as we hold the global lock, so QueuePosition head should
always be equal to QUEUE_HEAD when we get into At_AbortNotify(), but
maybe I'm wrong.

Let's see if anyone else has any thoughts about this.

Patch looks great. Some minor points:

I have a warning when using git am with the patch:
warning: 1 line adds whitespace errors.

I don't think that this is a problem? And I don't know how to remove
this warning, I've just created the patch with git format-patch @~1

There is a comment in the head of the async.c file about some
listen/notify internals. Maybe it's worth adding a comment about how
aborted transactions do clean up.

Added

What do you think about a Assert in asyncQueueRollbackNotifications()
that other backends still see us as 'in progress'? So we can be sure
that they can't process our notifications before we mark notifications
as 'committed=false'. Not sure how to do it correctly, maybe

Assert(TransactionIdIsValid(MyProc->xid));

will work? The TAP test that I tried to write also should test it.

Sounds resanable to me. I think that we could use the following assert
when iterating over the entries, what do you think?
Assert(TransactionIdIsInProgress(qe->xid));

There are several comments where the word "crash" is used. What do you
think about using "abort" instead? "Crash" sounds more like PANIC
situation where we don't care about notifications because they don't
survive restart.

Good point, fixed

Thank you!

Thanks for reviewing this!

--
Matheus Alcantara

Attachments:

v4-0001-Make-AsyncQueueEntry-s-self-contained.patchtext/plain; charset=utf-8; name=v4-0001-Make-AsyncQueueEntry-s-self-contained.patchDownload

From e7488af5d808dac1661316cdeaace63ab3fbd7df Mon Sep 17 00:00:00 2001
From: Matheus Alcantara <mths.dev@pm.me>
Date: Sat, 6 Sep 2025 11:29:02 -0300
Subject: [PATCH v4] Make AsyncQueueEntry's self contained

Previously the asyncQueueProcessPageEntries() use the
TransactionIdDidCommit() to check if the transaction that a notification
belongs is committed or not. Although this work for almost all scenarios
we may have some cases where if a notification is keep for to long on
the queue and the VACUUM FREEZE is executed during this time it may
remove clog files that is needed to check the transaction status of
these notifications which will cause errors to listener backends when
reading the async queue.

This commit fix this issue by making the AsyncQueueEntry self contained
by adding the "committed" boolean field so asyncQueueProcessPageEntries()
can use this to check if the transaction of the notification is
committed or not.

We set committed as true when adding the entry on the SLRU page buffer
cache when PreCommit_Notify() is called and if an error occur before
AtCommit_Notify() the AtAbort_Notify() will be called which will mark
the committed field as false. We do this by remembering the QUEUE_HEAD
position before the PreCommit_Notify() start adding entries on the
shared queue, and if the transaction crash we iterate from this saved
position until the new QUEUE_HEAD position marking the entries as not
committed.

Also this commit include TAP tests to exercise the VACUUM FREEZE issue
and also the scenario of an error being occur between the
PreCommit_Notify() and AtCommit_Notify() calls.

Co-authored-by: Arseniy Mukhin <arseniy.mukhin.dev@gmail.com>
---
 src/backend/commands/async.c                  | 185 +++++++++++++++++-
 src/backend/storage/lmgr/lmgr.c               |  19 ++
 src/include/storage/lmgr.h                    |   3 +
 src/test/modules/Makefile                     |   1 +
 src/test/modules/meson.build                  |   1 +
 src/test/modules/test_listen_notify/Makefile  |  17 ++
 .../modules/test_listen_notify/meson.build    |  14 ++
 .../test_listen_notify/t/001_xid_freeze.pl    |  66 +++++++
 .../t/002_aborted_tx_notifies.pl              |  66 +++++++
 src/tools/pgindent/typedefs.list              |   1 +
 10 files changed, 372 insertions(+), 1 deletion(-)
 create mode 100644 src/test/modules/test_listen_notify/Makefile
 create mode 100644 src/test/modules/test_listen_notify/meson.build
 create mode 100644 src/test/modules/test_listen_notify/t/001_xid_freeze.pl
 create mode 100644 src/test/modules/test_listen_notify/t/002_aborted_tx_notifies.pl

diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index 4bd37d5beb5..4e7c0e16d57 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -79,6 +79,19 @@
  *	  are way behind and should be kicked to make them advance their
  *	  pointers).
  *
+ *    The notification entries added to the queue are self-contained and
+ *    include a "committed" field to inform listener backends if the associated
+ *    transaction has committed. We could use the TransactionIdDidCommit() but
+ *    if a notification remain in the queue long enough for VACUUM FREEZE to
+ *    remove the necessary pg_xact/ file, the listener backend will face errors
+ *    to get the transaction status. To prevent this, the "committed" field is
+ *    set to true during PreCommit_Notify() and if the transaction aborts between
+ *    the PreCommit_Notify() and AtCommit_Notify(), the AtAbort_Notify() is
+ *    called to mark these entries as uncommitted. To enable this, we save the
+ *    queue's head position before adding new entries from the in-progress to
+ *    commit transaction. If an abort occurs, AtAbort_Notify() uses this saved
+ *    position to find and mark the entries as uncommitted.
+ *
  *	  Finally, after we are out of the transaction altogether and about to go
  *	  idle, we scan the queue for messages that need to be sent to our
  *	  frontend (which might be notifies from other backends, or self-notifies
@@ -142,6 +155,7 @@
 #include "miscadmin.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
+#include "storage/procarray.h"
 #include "storage/procsignal.h"
 #include "tcop/tcopprot.h"
 #include "utils/builtins.h"
@@ -180,6 +194,8 @@ typedef struct AsyncQueueEntry
 	Oid			dboid;			/* sender's database OID */
 	TransactionId xid;			/* sender's XID */
 	int32		srcPid;			/* sender's PID */
+	bool		committed;		/* Is transaction that the entry belongs
+								 * committed? */
 	char		data[NAMEDATALEN + NOTIFY_PAYLOAD_MAX_LENGTH];
 } AsyncQueueEntry;
 
@@ -401,8 +417,27 @@ struct NotificationHash
 	Notification *event;		/* => the actual Notification struct */
 };
 
+/*  Information needed by At_AbortNotify() to remove entries from the queue for aborted transactions. */
+typedef struct AtAbortNotifyInfo
+{
+	/*
+	 * head position before the transaction start adding entries on the shared
+	 * queue
+	 */
+	QueuePosition previousHead;
+
+	/*
+	 * head position after the entries from the in-progress to commit
+	 * transaction were added.
+	 */
+	QueuePosition head;
+
+} AtAbortNotifyInfo;
+
 static NotificationList *pendingNotifies = NULL;
 
+static AtAbortNotifyInfo *atAbortInfo = NULL;
+
 /*
  * Inbound notifications are initially processed by HandleNotifyInterrupt(),
  * called from inside a signal handler. That just sets the
@@ -457,6 +492,7 @@ static void AddEventToPendingNotifies(Notification *n);
 static uint32 notification_hash(const void *key, Size keysize);
 static int	notification_match(const void *key1, const void *key2, Size keysize);
 static void ClearPendingActionsAndNotifies(void);
+static void asyncQueueRollbackNotifications(void);
 
 /*
  * Compute the difference between two queue page numbers.
@@ -922,6 +958,18 @@ PreCommit_Notify(void)
 		LockSharedObject(DatabaseRelationId, InvalidOid, 0,
 						 AccessExclusiveLock);
 
+		/*
+		 * Before start adding entries on the shared queue, save the current
+		 * QUEUE_HEAD so if the current in-progress to commit transaction
+		 * abort we can mark the notifications added by this aborted
+		 * transaction as not committed. See AtAbortt_Notify() for more info.
+		 */
+		Assert(atAbortInfo == NULL);
+		atAbortInfo = palloc(sizeof(AtAbortNotifyInfo));
+		LWLockAcquire(NotifyQueueLock, LW_SHARED);
+		atAbortInfo->previousHead = QUEUE_HEAD;
+		LWLockRelease(NotifyQueueLock);
+
 		/* Now push the notifications into the queue */
 		nextNotify = list_head(pendingNotifies->events);
 		while (nextNotify != NULL)
@@ -948,6 +996,17 @@ PreCommit_Notify(void)
 			LWLockRelease(NotifyQueueLock);
 		}
 
+		/*
+		 * Save the new QUEUE_HEAD position so if another publisher add
+		 * entries on the shared queue and successfully commit the transaction
+		 * we don't change the committed status of these notifications while
+		 * marking the notification from a aborted transaction as not
+		 * committed.
+		 */
+		LWLockAcquire(NotifyQueueLock, LW_SHARED);
+		atAbortInfo->head = QUEUE_HEAD;
+		LWLockRelease(NotifyQueueLock);
+
 		/* Note that we don't clear pendingNotifies; AtCommit_Notify will. */
 	}
 }
@@ -1402,6 +1461,13 @@ asyncQueueAddEntries(ListCell *nextNotify)
 		/* Construct a valid queue entry in local variable qe */
 		asyncQueueNotificationToEntry(n, &qe);
 
+		/*
+		 * Mark the entry as committed. If the transaction that this
+		 * notification belongs fails to commit the AtAbort_Notify() will mark
+		 * this entry as not committed.
+		 */
+		qe.committed = true;
+
 		offset = QUEUE_POS_OFFSET(queue_head);
 
 		/* Check whether the entry really fits on the current page */
@@ -1678,6 +1744,16 @@ AtAbort_Notify(void)
 	if (amRegisteredListener && listenChannels == NIL)
 		asyncQueueUnregister();
 
+	/*
+	 * AtAbort_Notify information is set when we are adding entries on the
+	 * global shared queue at PreCommit_Notify(), so in case of an abort on
+	 * the transaction between the PreCommit_Notify() and AtCommit_Notify() we
+	 * use this information to mark the entries from the aborted transaction
+	 * as not committed.
+	 */
+	if (atAbortInfo != NULL)
+		asyncQueueRollbackNotifications();
+
 	/* And clean up */
 	ClearPendingActionsAndNotifies();
 }
@@ -2066,7 +2142,7 @@ asyncQueueProcessPageEntries(volatile QueuePosition *current,
 				reachedStop = true;
 				break;
 			}
-			else if (TransactionIdDidCommit(qe->xid))
+			else if (qe->committed)
 			{
 				/* qe->data is the null-terminated channel name */
 				char	   *channel = qe->data;
@@ -2385,6 +2461,7 @@ ClearPendingActionsAndNotifies(void)
 	 */
 	pendingActions = NULL;
 	pendingNotifies = NULL;
+	atAbortInfo = NULL;
 }
 
 /*
@@ -2395,3 +2472,109 @@ check_notify_buffers(int *newval, void **extra, GucSource source)
 {
 	return check_slru_buffers("notify_buffers", newval);
 }
+
+
+/*
+ *  Mark notifications added on an in-progress to commit transaction as not committed.
+ *
+ * Notifications added on the shared global queue are added with committed =
+ * true during PreCommit_Notify() call. If an error occur between the
+ * PreCommit_Notify() and AtCommit_Notify() the AtAbort_Notify() will be called
+ * and we need to mark these notifications added on the shared queue by the
+ * aborted transaction as not committed so that listener backends can skip
+ * these notifications when reading the queue.
+ *
+ * We previously rely on TransactionDidCommit() to check this but if a
+ * notification is keep for too long on the queue and the VACUUM FREEZE is
+ * executed during this period it can remove clog files that is needed to check
+ * the transaction status of this notification, so we make the notification
+ * entries self contained to skip this problem.
+ *
+ */
+static void
+asyncQueueRollbackNotifications(void)
+{
+	QueuePosition current = atAbortInfo->previousHead;
+	QueuePosition head = atAbortInfo->head;
+
+	/*
+	 * Iterates from the position saved at the beginning of the transaction
+	 * (previousHead) to the current head of the queue. We do this to mark all
+	 * entries within this range as uncommitted in case of a transaction
+	 * crash.
+	 */
+	for (;;)
+	{
+		int64		curpage = QUEUE_POS_PAGE(current);
+		int			curoffset = QUEUE_POS_OFFSET(current);
+		LWLock	   *lock = SimpleLruGetBankLock(NotifyCtl, curpage);
+		int			slotno;
+
+		/*
+		 * If we have reached the head, all entries from this transaction have
+		 * been marked as not committed so break the loop.
+		 */
+		if (QUEUE_POS_EQUAL(current, head))
+			break;
+
+		/*
+		 * Acquire an exclusive lock on the current SLRU page to ensure no
+		 * other process can read or write to it while we are marking the
+		 * entries.
+		 */
+		LWLockAcquire(lock, LW_EXCLUSIVE);
+
+		/* Fetch the page from SLRU to mark entries as not committed. */
+		slotno = SimpleLruReadPage(NotifyCtl, curpage, true, InvalidTransactionId);
+
+		/*
+		 * Loop through all entries on the current page. The loop will
+		 * continue until we reach the end of the page or the current head.
+		 */
+		for (;;)
+		{
+			AsyncQueueEntry *qe;
+			bool		reachedEndOfPage;
+
+			/*
+			 * Check again to stop processing the entries on the current page.
+			 */
+			if (QUEUE_POS_EQUAL(current, head))
+				break;
+
+			/*
+			 * Get a pointer to the current entry within the shared page
+			 * buffer.
+			 */
+			qe = (AsyncQueueEntry *) (NotifyCtl->shared->page_buffer[slotno] + curoffset);
+
+			/*
+			 * Just for sanity, all entries on the shared queue should be
+			 * marked as not committed.
+			 */
+			Assert(qe->committed);
+
+			/* Ensure that listener backends can not see these entries */
+			Assert(TransactionIdIsInProgress(qe->xid));
+
+			/*
+			 * Mark the entry as uncommitted so listener backends can skip
+			 * this notification.
+			 */
+			qe->committed = false;
+
+			/* Advance our position. */
+			reachedEndOfPage = asyncQueueAdvance(&current, qe->length);
+			if (reachedEndOfPage)
+				break;
+
+			/*
+			 * Update the offset for the next iteration within the same page.
+			 */
+			curoffset = QUEUE_POS_OFFSET(current);
+		}
+
+		/* Release the exclusive lock on the page. */
+		LWLockRelease(lock);
+	}
+}
diff --git a/src/backend/storage/lmgr/lmgr.c b/src/backend/storage/lmgr/lmgr.c
index 4798eb79003..12a21c51452 100644
--- a/src/backend/storage/lmgr/lmgr.c
+++ b/src/backend/storage/lmgr/lmgr.c
@@ -357,6 +357,25 @@ CheckRelationOidLockedByMe(Oid relid, LOCKMODE lockmode, bool orstronger)
 	return LockHeldByMe(&tag, lockmode, orstronger);
 }
 
+/*
+ *  CheckSharedObjectLockedByMe
+ *
+ *  Like CheckRelationLockedByMe, but it checks for shared objects.
+ */
+bool
+CheckSharedObjectLockedByMe(Oid classid, LOCKMODE lockmode, bool orstronger)
+{
+	LOCKTAG		tag;
+
+	SET_LOCKTAG_OBJECT(tag,
+					   InvalidOid,
+					   classid,
+					   InvalidOid,
+					   0);
+
+	return LockHeldByMe(&tag, lockmode, orstronger);
+}
+
 /*
  *		LockHasWaitersRelation
  *
diff --git a/src/include/storage/lmgr.h b/src/include/storage/lmgr.h
index b7abd18397d..c119c8f4ded 100644
--- a/src/include/storage/lmgr.h
+++ b/src/include/storage/lmgr.h
@@ -50,6 +50,9 @@ extern bool CheckRelationLockedByMe(Relation relation, LOCKMODE lockmode,
 									bool orstronger);
 extern bool CheckRelationOidLockedByMe(Oid relid, LOCKMODE lockmode,
 									   bool orstronger);
+extern bool CheckSharedObjectLockedByMe(Oid classid, LOCKMODE lockmode,
+									   bool orstronger);
+
 extern bool LockHasWaitersRelation(Relation relation, LOCKMODE lockmode);
 
 extern void LockRelationIdForSession(LockRelId *relid, LOCKMODE lockmode);
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index 902a7954101..a015c961d35 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -29,6 +29,7 @@ SUBDIRS = \
 		  test_int128 \
 		  test_integerset \
 		  test_json_parser \
+		  test_listen_notify \
 		  test_lfind \
 		  test_lwlock_tranches \
 		  test_misc \
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index 14fc761c4cf..6af33448d7b 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -28,6 +28,7 @@ subdir('test_ginpostinglist')
 subdir('test_int128')
 subdir('test_integerset')
 subdir('test_json_parser')
+subdir('test_listen_notify')
 subdir('test_lfind')
 subdir('test_lwlock_tranches')
 subdir('test_misc')
diff --git a/src/test/modules/test_listen_notify/Makefile b/src/test/modules/test_listen_notify/Makefile
new file mode 100644
index 00000000000..da1bf5bb1b7
--- /dev/null
+++ b/src/test/modules/test_listen_notify/Makefile
@@ -0,0 +1,17 @@
+# src/test/modules/test_listen_notify/Makefile
+
+MODULE = test_listen_notify
+PGFILEDESC = "test_listen_notify - regression testing for LISTEN/NOTIFY support"
+
+TAP_TESTS = 1
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_listen_notify
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_listen_notify/meson.build b/src/test/modules/test_listen_notify/meson.build
new file mode 100644
index 00000000000..a68052cd353
--- /dev/null
+++ b/src/test/modules/test_listen_notify/meson.build
@@ -0,0 +1,14 @@
+# Copyright (c) 2022-2025, PostgreSQL Global Development Group
+
+tests += {
+  'name': 'test_listen_notify',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'tap': {
+    'tests': [
+      't/001_xid_freeze.pl',
+      't/002_aborted_tx_notifies.pl'
+    ],
+  },
+}
+
diff --git a/src/test/modules/test_listen_notify/t/001_xid_freeze.pl b/src/test/modules/test_listen_notify/t/001_xid_freeze.pl
new file mode 100644
index 00000000000..0a5130a042e
--- /dev/null
+++ b/src/test/modules/test_listen_notify/t/001_xid_freeze.pl
@@ -0,0 +1,66 @@
+# Copyright (c) 2024-2025, PostgreSQL Global Development Group
+
+use strict;
+use warnings FATAL => 'all';
+use File::Path qw(mkpath);
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+my $node = PostgreSQL::Test::Cluster->new('node');
+$node->init;
+$node->start;
+
+# Setup
+$node->safe_psql('postgres', 'CREATE EXTENSION xid_wraparound');
+$node->safe_psql('postgres',
+	'CREATE TABLE t AS SELECT g AS a, g+2 AS b from generate_series(1,100000) g;'
+);
+$node->safe_psql('postgres',
+	'ALTER DATABASE template0 WITH ALLOW_CONNECTIONS true');
+
+# --- Start Session 1 and leave it idle in transaction
+my $psql_session1 = $node->background_psql('postgres');
+$psql_session1->query_safe('listen s;', "Session 1 listens to 's'");
+$psql_session1->query_safe('begin;', "Session 1 starts a transaction");
+
+# --- Session 2, multiple notify's, and commit ---
+for my $i (1 .. 10)
+{
+	$node->safe_psql(
+		'postgres', "
+		BEGIN;
+		NOTIFY s, '$i';
+		COMMIT;");
+}
+
+# Consume enough XIDs to trigger truncation
+$node->safe_psql('postgres', 'select consume_xids(10000000);');
+
+# Execute update so the frozen xid of "t" table is updated to a xid greater
+# than consume_xids() result
+$node->safe_psql('postgres', 'UPDATE t SET a = a+b;');
+
+# Remember current datfrozenxid before vacuum freeze to ensure that it is advanced.
+my $datafronzenxid = $node->safe_psql('postgres', "select datfrozenxid from pg_database where datname = 'postgres'");
+
+# Execute vacuum freeze on all databases
+$node->command_ok([ 'vacuumdb', '--all', '--freeze', '--port', $node->port ],
+	"vacuumdb --all --freeze");
+
+# Get the new datfrozenxid after vacuum freeze to ensure that is advanced but
+# we can still get the notification status of the notification
+my $datafronzenxid_freeze = $node->safe_psql('postgres', "select datfrozenxid from pg_database where datname = 'postgres'");
+ok($datafronzenxid_freeze > $datafronzenxid, 'datfrozenxid is advanced');
+
+# On Session 1, commit and ensure that the all notifications is received
+my $res = $psql_session1->query_safe('commit;', "commit listen s;");
+my $notifications_count = 0;
+foreach my $i (split('\n', $res))
+{
+	$notifications_count++;
+	like($i, qr/Asynchronous notification "s" with payload "$notifications_count" received/);
+}
+is($notifications_count, 10, 'received all committed notifications');
+
+done_testing();
diff --git a/src/test/modules/test_listen_notify/t/002_aborted_tx_notifies.pl b/src/test/modules/test_listen_notify/t/002_aborted_tx_notifies.pl
new file mode 100644
index 00000000000..17fcb4b786e
--- /dev/null
+++ b/src/test/modules/test_listen_notify/t/002_aborted_tx_notifies.pl
@@ -0,0 +1,66 @@
+# Copyright (c) 2024-2025, PostgreSQL Global Development Group
+
+use strict;
+use warnings FATAL => 'all';
+use File::Path qw(mkpath);
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+my $node = PostgreSQL::Test::Cluster->new('node');
+$node->init;
+$node->start;
+
+# Test checks that listeners do not receive notifications from aborted
+# transaction even if notifications have been added to the listen/notify
+# queue. To reproduce it we use the fact that serializable conflicts
+# are checked after tx adds notifications to the queue.
+
+# Setup
+$node->safe_psql('postgres', 'CREATE TABLE t1 (a bigserial);');
+
+# Listener
+my $psql_listener = $node->background_psql('postgres');
+$psql_listener->query_safe('LISTEN ch;');
+
+# Session1. Start SERIALIZABLE tx and add a notification.
+my $psql_session1 = $node->background_psql('postgres');
+$psql_session1->query_safe("
+	BEGIN ISOLATION LEVEL SERIALIZABLE;
+	SELECT * FROM t1;
+	INSERT INTO t1 DEFAULT VALUES;
+	NOTIFY ch,'committed';
+");
+
+# Session2. Start SERIALIZABLE tx, add a notification and introduce a conflict
+# with session1.
+my $psql_session2 = $node->background_psql('postgres', on_error_stop => 0);
+$psql_session2->query_safe("
+	BEGIN ISOLATION LEVEL SERIALIZABLE;
+	SELECT * FROM t1;
+	INSERT INTO t1 DEFAULT VALUES;
+	NOTIFY ch,'aborted';
+");
+
+# Session1 should be committed successfully. Listeners must receive session1
+# notifications.
+$psql_session1->query_safe("COMMIT;");
+
+# Session2 should be aborted due to the conflict with session1. Transaction
+# is aborted after adding notifications to the listen/notify queue, but
+# listeners should not receive session2 notifications.
+$psql_session2->query("COMMIT;");
+
+# send another notification after aborted
+$node->safe_psql('postgres', "NOTIFY ch, 'next_committed';");
+
+# fetch notifications
+my $res = $psql_listener->query_safe('begin; commit;');
+
+# check received notifications
+my @lines = split('\n', $res);
+is(@lines, 2, 'received all committed notifications');
+like($lines[0], qr/Asynchronous notification "ch" with payload "committed" received/);
+like($lines[1], qr/Asynchronous notification "ch" with payload "next_committed" received/);
+
+done_testing();
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 3c80d49b67e..967c9cdf704 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -159,6 +159,7 @@ ArrayType
 AsyncQueueControl
 AsyncQueueEntry
 AsyncRequest
+AtAbortNotifyInfo
 AttInMetadata
 AttStatsSlot
 AttoptCacheEntry
-- 
2.39.5 (Apple Git-154)

#38

Arseniy Mukhin

arseniy.mukhin.dev@gmail.com

4 months ago

In reply to: Matheus Alcantara (#37)

Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue

On Wed, Sep 24, 2025 at 10:23 PM Matheus Alcantara
<matheusssilv97@gmail.com> wrote:

On Wed Sep 24, 2025 at 10:59 AM -03, Arseniy Mukhin wrote:

...
Patch looks great. Some minor points:

I have a warning when using git am with the patch:
warning: 1 line adds whitespace errors.

I don't think that this is a problem? And I don't know how to remove
this warning, I've just created the patch with git format-patch @~1

In my experience a run of pgindent usually fixes all such warnings,
but I agree it's not a big deal.

What do you think about a Assert in asyncQueueRollbackNotifications()
that other backends still see us as 'in progress'? So we can be sure
that they can't process our notifications before we mark notifications
as 'committed=false'. Not sure how to do it correctly, maybe

Assert(TransactionIdIsValid(MyProc->xid));

will work? The TAP test that I tried to write also should test it.

Sounds resanable to me. I think that we could use the following assert
when iterating over the entries, what do you think?
Assert(TransactionIdIsInProgress(qe->xid));

Yeah, TransactionIdIsInProgress() looks like the right way to do it.
And one more 'Assert' idea... maybe we can add
Assert(TransactionIdIsCurrentTransactionId(qe->xid)) just to verify
that we are only touching our own notifications?

Thank you, the fixes look good.

Best regards,
Arseniy Mukhin

#39

Arseniy Mukhin

arseniy.mukhin.dev@gmail.com

4 months ago

In reply to: Arseniy Mukhin (#38)

2 attachment(s)

Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue

Hi,

On Wed, Sep 24, 2025 at 4:59 PM Arseniy Mukhin
<arseniy.mukhin.dev@gmail.com> wrote:

On Wed, Sep 24, 2025 at 1:40 AM Matheus Alcantara
<matheusssilv97@gmail.com> wrote:

...

This way the listener reads the head that includes all writer's
notifications and a snapshot where the writer is not in progress, so
nothing stops the listener from sending these notifications and it's
even possible to have the listener's position that is after the queue
head, so yes, it's bad :( Sorry about that.

Yeah, this is bad. I'm wondering if we could reproduce such race
conditions scenarios with some TAP tests.

I agree it would be great to have more tests for such cases. As for
the 'committed field' patch, I think we can add a TAP test that shows
that listeners postpone processing of notifications until
notifications were marked as 'committed=false' in case of aborted
transactions. I tried to write one, but have not succeeded yet. Hope
to finish it soon.

I finally managed to write a TAP test for it, so there is a new
version with the tap test.

I also realized that we can increase test coverage in
002_aborted_tx_notifies.pl if notifications of the aborted transaction
span several pages. This way we can better test
asyncQueueRollbackNotifications(). So I changed
002_aborted_tx_notifies.pl TAP test a bit.

And there is a small indentation change in lmgr.h that should fix this
git am warning.

I added all changes as a separate patch file (0002) so it was more
clear what changed. Please feel free to merge into the main patch file
/ drop any part of it that makes sense to you.

Best regards,
Arseniy Mukhin

Attachments:

v5-0001-Make-AsyncQueueEntry-s-self-contained.patchtext/x-patch; charset=US-ASCII; name=v5-0001-Make-AsyncQueueEntry-s-self-contained.patchDownload

From 27cb1d870a8fdcdc1cf29c26586b05c3f8819cdb Mon Sep 17 00:00:00 2001
From: Matheus Alcantara <mths.dev@pm.me>
Date: Sat, 6 Sep 2025 11:29:02 -0300
Subject: [PATCH v5 1/2] Make AsyncQueueEntry's self contained

Previously the asyncQueueProcessPageEntries() use the
TransactionIdDidCommit() to check if the transaction that a notification
belongs is committed or not. Although this work for almost all scenarios
we may have some cases where if a notification is keep for to long on
the queue and the VACUUM FREEZE is executed during this time it may
remove clog files that is needed to check the transaction status of
these notifications which will cause errors to listener backends when
reading the async queue.

This commit fix this issue by making the AsyncQueueEntry self contained
by adding the "committed" boolean field so asyncQueueProcessPageEntries()
can use this to check if the transaction of the notification is
committed or not.

We set committed as true when adding the entry on the SLRU page buffer
cache when PreCommit_Notify() is called and if an error occur before
AtCommit_Notify() the AtAbort_Notify() will be called which will mark
the committed field as false. We do this by remembering the QUEUE_HEAD
position before the PreCommit_Notify() start adding entries on the
shared queue, and if the transaction crash we iterate from this saved
position until the new QUEUE_HEAD position marking the entries as not
committed.

Also this commit include TAP tests to exercise the VACUUM FREEZE issue
and also the scenario of an error being occur between the
PreCommit_Notify() and AtCommit_Notify() calls.

Co-authored-by: Arseniy Mukhin <arseniy.mukhin.dev@gmail.com>
---
 src/backend/commands/async.c                  | 185 +++++++++++++++++-
 src/backend/storage/lmgr/lmgr.c               |  19 ++
 src/include/storage/lmgr.h                    |   3 +
 src/test/modules/Makefile                     |   1 +
 src/test/modules/meson.build                  |   1 +
 src/test/modules/test_listen_notify/Makefile  |  17 ++
 .../modules/test_listen_notify/meson.build    |  14 ++
 .../test_listen_notify/t/001_xid_freeze.pl    |  66 +++++++
 .../t/002_aborted_tx_notifies.pl              |  66 +++++++
 src/tools/pgindent/typedefs.list              |   1 +
 10 files changed, 372 insertions(+), 1 deletion(-)
 create mode 100644 src/test/modules/test_listen_notify/Makefile
 create mode 100644 src/test/modules/test_listen_notify/meson.build
 create mode 100644 src/test/modules/test_listen_notify/t/001_xid_freeze.pl
 create mode 100644 src/test/modules/test_listen_notify/t/002_aborted_tx_notifies.pl

diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index 4bd37d5beb5..4e7c0e16d57 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -79,6 +79,19 @@
  *	  are way behind and should be kicked to make them advance their
  *	  pointers).
  *
+ *    The notification entries added to the queue are self-contained and
+ *    include a "committed" field to inform listener backends if the associated
+ *    transaction has committed. We could use the TransactionIdDidCommit() but
+ *    if a notification remain in the queue long enough for VACUUM FREEZE to
+ *    remove the necessary pg_xact/ file, the listener backend will face errors
+ *    to get the transaction status. To prevent this, the "committed" field is
+ *    set to true during PreCommit_Notify() and if the transaction aborts between
+ *    the PreCommit_Notify() and AtCommit_Notify(), the AtAbort_Notify() is
+ *    called to mark these entries as uncommitted. To enable this, we save the
+ *    queue's head position before adding new entries from the in-progress to
+ *    commit transaction. If an abort occurs, AtAbort_Notify() uses this saved
+ *    position to find and mark the entries as uncommitted.
+ *
  *	  Finally, after we are out of the transaction altogether and about to go
  *	  idle, we scan the queue for messages that need to be sent to our
  *	  frontend (which might be notifies from other backends, or self-notifies
@@ -142,6 +155,7 @@
 #include "miscadmin.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
+#include "storage/procarray.h"
 #include "storage/procsignal.h"
 #include "tcop/tcopprot.h"
 #include "utils/builtins.h"
@@ -180,6 +194,8 @@ typedef struct AsyncQueueEntry
 	Oid			dboid;			/* sender's database OID */
 	TransactionId xid;			/* sender's XID */
 	int32		srcPid;			/* sender's PID */
+	bool		committed;		/* Is transaction that the entry belongs
+								 * committed? */
 	char		data[NAMEDATALEN + NOTIFY_PAYLOAD_MAX_LENGTH];
 } AsyncQueueEntry;
 
@@ -401,8 +417,27 @@ struct NotificationHash
 	Notification *event;		/* => the actual Notification struct */
 };
 
+/*  Information needed by At_AbortNotify() to remove entries from the queue for aborted transactions. */
+typedef struct AtAbortNotifyInfo
+{
+	/*
+	 * head position before the transaction start adding entries on the shared
+	 * queue
+	 */
+	QueuePosition previousHead;
+
+	/*
+	 * head position after the entries from the in-progress to commit
+	 * transaction were added.
+	 */
+	QueuePosition head;
+
+} AtAbortNotifyInfo;
+
 static NotificationList *pendingNotifies = NULL;
 
+static AtAbortNotifyInfo *atAbortInfo = NULL;
+
 /*
  * Inbound notifications are initially processed by HandleNotifyInterrupt(),
  * called from inside a signal handler. That just sets the
@@ -457,6 +492,7 @@ static void AddEventToPendingNotifies(Notification *n);
 static uint32 notification_hash(const void *key, Size keysize);
 static int	notification_match(const void *key1, const void *key2, Size keysize);
 static void ClearPendingActionsAndNotifies(void);
+static void asyncQueueRollbackNotifications(void);
 
 /*
  * Compute the difference between two queue page numbers.
@@ -922,6 +958,18 @@ PreCommit_Notify(void)
 		LockSharedObject(DatabaseRelationId, InvalidOid, 0,
 						 AccessExclusiveLock);
 
+		/*
+		 * Before start adding entries on the shared queue, save the current
+		 * QUEUE_HEAD so if the current in-progress to commit transaction
+		 * abort we can mark the notifications added by this aborted
+		 * transaction as not committed. See AtAbortt_Notify() for more info.
+		 */
+		Assert(atAbortInfo == NULL);
+		atAbortInfo = palloc(sizeof(AtAbortNotifyInfo));
+		LWLockAcquire(NotifyQueueLock, LW_SHARED);
+		atAbortInfo->previousHead = QUEUE_HEAD;
+		LWLockRelease(NotifyQueueLock);
+
 		/* Now push the notifications into the queue */
 		nextNotify = list_head(pendingNotifies->events);
 		while (nextNotify != NULL)
@@ -948,6 +996,17 @@ PreCommit_Notify(void)
 			LWLockRelease(NotifyQueueLock);
 		}
 
+		/*
+		 * Save the new QUEUE_HEAD position so if another publisher add
+		 * entries on the shared queue and successfully commit the transaction
+		 * we don't change the committed status of these notifications while
+		 * marking the notification from a aborted transaction as not
+		 * committed.
+		 */
+		LWLockAcquire(NotifyQueueLock, LW_SHARED);
+		atAbortInfo->head = QUEUE_HEAD;
+		LWLockRelease(NotifyQueueLock);
+
 		/* Note that we don't clear pendingNotifies; AtCommit_Notify will. */
 	}
 }
@@ -1402,6 +1461,13 @@ asyncQueueAddEntries(ListCell *nextNotify)
 		/* Construct a valid queue entry in local variable qe */
 		asyncQueueNotificationToEntry(n, &qe);
 
+		/*
+		 * Mark the entry as committed. If the transaction that this
+		 * notification belongs fails to commit the AtAbort_Notify() will mark
+		 * this entry as not committed.
+		 */
+		qe.committed = true;
+
 		offset = QUEUE_POS_OFFSET(queue_head);
 
 		/* Check whether the entry really fits on the current page */
@@ -1678,6 +1744,16 @@ AtAbort_Notify(void)
 	if (amRegisteredListener && listenChannels == NIL)
 		asyncQueueUnregister();
 
+	/*
+	 * AtAbort_Notify information is set when we are adding entries on the
+	 * global shared queue at PreCommit_Notify(), so in case of an abort on
+	 * the transaction between the PreCommit_Notify() and AtCommit_Notify() we
+	 * use this information to mark the entries from the aborted transaction
+	 * as not committed.
+	 */
+	if (atAbortInfo != NULL)
+		asyncQueueRollbackNotifications();
+
 	/* And clean up */
 	ClearPendingActionsAndNotifies();
 }
@@ -2066,7 +2142,7 @@ asyncQueueProcessPageEntries(volatile QueuePosition *current,
 				reachedStop = true;
 				break;
 			}
-			else if (TransactionIdDidCommit(qe->xid))
+			else if (qe->committed)
 			{
 				/* qe->data is the null-terminated channel name */
 				char	   *channel = qe->data;
@@ -2385,6 +2461,7 @@ ClearPendingActionsAndNotifies(void)
 	 */
 	pendingActions = NULL;
 	pendingNotifies = NULL;
+	atAbortInfo = NULL;
 }
 
 /*
@@ -2395,3 +2472,109 @@ check_notify_buffers(int *newval, void **extra, GucSource source)
 {
 	return check_slru_buffers("notify_buffers", newval);
 }
+
+
+/*
+ *  Mark notifications added on an in-progress to commit transaction as not committed.
+ *
+ * Notifications added on the shared global queue are added with committed =
+ * true during PreCommit_Notify() call. If an error occur between the
+ * PreCommit_Notify() and AtCommit_Notify() the AtAbort_Notify() will be called
+ * and we need to mark these notifications added on the shared queue by the
+ * aborted transaction as not committed so that listener backends can skip
+ * these notifications when reading the queue.
+ *
+ * We previously rely on TransactionDidCommit() to check this but if a
+ * notification is keep for too long on the queue and the VACUUM FREEZE is
+ * executed during this period it can remove clog files that is needed to check
+ * the transaction status of this notification, so we make the notification
+ * entries self contained to skip this problem.
+ *
+ */
+static void
+asyncQueueRollbackNotifications(void)
+{
+	QueuePosition current = atAbortInfo->previousHead;
+	QueuePosition head = atAbortInfo->head;
+
+	/*
+	 * Iterates from the position saved at the beginning of the transaction
+	 * (previousHead) to the current head of the queue. We do this to mark all
+	 * entries within this range as uncommitted in case of a transaction
+	 * crash.
+	 */
+	for (;;)
+	{
+		int64		curpage = QUEUE_POS_PAGE(current);
+		int			curoffset = QUEUE_POS_OFFSET(current);
+		LWLock	   *lock = SimpleLruGetBankLock(NotifyCtl, curpage);
+		int			slotno;
+
+		/*
+		 * If we have reached the head, all entries from this transaction have
+		 * been marked as not committed so break the loop.
+		 */
+		if (QUEUE_POS_EQUAL(current, head))
+			break;
+
+		/*
+		 * Acquire an exclusive lock on the current SLRU page to ensure no
+		 * other process can read or write to it while we are marking the
+		 * entries.
+		 */
+		LWLockAcquire(lock, LW_EXCLUSIVE);
+
+		/* Fetch the page from SLRU to mark entries as not committed. */
+		slotno = SimpleLruReadPage(NotifyCtl, curpage, true, InvalidTransactionId);
+
+		/*
+		 * Loop through all entries on the current page. The loop will
+		 * continue until we reach the end of the page or the current head.
+		 */
+		for (;;)
+		{
+			AsyncQueueEntry *qe;
+			bool		reachedEndOfPage;
+
+			/*
+			 * Check again to stop processing the entries on the current page.
+			 */
+			if (QUEUE_POS_EQUAL(current, head))
+				break;
+
+			/*
+			 * Get a pointer to the current entry within the shared page
+			 * buffer.
+			 */
+			qe = (AsyncQueueEntry *) (NotifyCtl->shared->page_buffer[slotno] + curoffset);
+
+			/*
+			 * Just for sanity, all entries on the shared queue should be
+			 * marked as not committed.
+			 */
+			Assert(qe->committed);
+
+			/* Ensure that listener backends can not see these entries */
+			Assert(TransactionIdIsInProgress(qe->xid));
+
+			/*
+			 * Mark the entry as uncommitted so listener backends can skip
+			 * this notification.
+			 */
+			qe->committed = false;
+
+			/* Advance our position. */
+			reachedEndOfPage = asyncQueueAdvance(&current, qe->length);
+			if (reachedEndOfPage)
+				break;
+
+			/*
+			 * Update the offset for the next iteration within the same page.
+			 */
+			curoffset = QUEUE_POS_OFFSET(current);
+		}
+
+		/* Release the exclusive lock on the page. */
+		LWLockRelease(lock);
+	}
+}
diff --git a/src/backend/storage/lmgr/lmgr.c b/src/backend/storage/lmgr/lmgr.c
index 4798eb79003..12a21c51452 100644
--- a/src/backend/storage/lmgr/lmgr.c
+++ b/src/backend/storage/lmgr/lmgr.c
@@ -357,6 +357,25 @@ CheckRelationOidLockedByMe(Oid relid, LOCKMODE lockmode, bool orstronger)
 	return LockHeldByMe(&tag, lockmode, orstronger);
 }
 
+/*
+ *  CheckSharedObjectLockedByMe
+ *
+ *  Like CheckRelationLockedByMe, but it checks for shared objects.
+ */
+bool
+CheckSharedObjectLockedByMe(Oid classid, LOCKMODE lockmode, bool orstronger)
+{
+	LOCKTAG		tag;
+
+	SET_LOCKTAG_OBJECT(tag,
+					   InvalidOid,
+					   classid,
+					   InvalidOid,
+					   0);
+
+	return LockHeldByMe(&tag, lockmode, orstronger);
+}
+
 /*
  *		LockHasWaitersRelation
  *
diff --git a/src/include/storage/lmgr.h b/src/include/storage/lmgr.h
index b7abd18397d..c119c8f4ded 100644
--- a/src/include/storage/lmgr.h
+++ b/src/include/storage/lmgr.h
@@ -50,6 +50,9 @@ extern bool CheckRelationLockedByMe(Relation relation, LOCKMODE lockmode,
 									bool orstronger);
 extern bool CheckRelationOidLockedByMe(Oid relid, LOCKMODE lockmode,
 									   bool orstronger);
+extern bool CheckSharedObjectLockedByMe(Oid classid, LOCKMODE lockmode,
+									   bool orstronger);
+
 extern bool LockHasWaitersRelation(Relation relation, LOCKMODE lockmode);
 
 extern void LockRelationIdForSession(LockRelId *relid, LOCKMODE lockmode);
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index 902a7954101..a015c961d35 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -29,6 +29,7 @@ SUBDIRS = \
 		  test_int128 \
 		  test_integerset \
 		  test_json_parser \
+		  test_listen_notify \
 		  test_lfind \
 		  test_lwlock_tranches \
 		  test_misc \
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index 14fc761c4cf..6af33448d7b 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -28,6 +28,7 @@ subdir('test_ginpostinglist')
 subdir('test_int128')
 subdir('test_integerset')
 subdir('test_json_parser')
+subdir('test_listen_notify')
 subdir('test_lfind')
 subdir('test_lwlock_tranches')
 subdir('test_misc')
diff --git a/src/test/modules/test_listen_notify/Makefile b/src/test/modules/test_listen_notify/Makefile
new file mode 100644
index 00000000000..da1bf5bb1b7
--- /dev/null
+++ b/src/test/modules/test_listen_notify/Makefile
@@ -0,0 +1,17 @@
+# src/test/modules/test_listen_notify/Makefile
+
+MODULE = test_listen_notify
+PGFILEDESC = "test_listen_notify - regression testing for LISTEN/NOTIFY support"
+
+TAP_TESTS = 1
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_listen_notify
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_listen_notify/meson.build b/src/test/modules/test_listen_notify/meson.build
new file mode 100644
index 00000000000..a68052cd353
--- /dev/null
+++ b/src/test/modules/test_listen_notify/meson.build
@@ -0,0 +1,14 @@
+# Copyright (c) 2022-2025, PostgreSQL Global Development Group
+
+tests += {
+  'name': 'test_listen_notify',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'tap': {
+    'tests': [
+      't/001_xid_freeze.pl',
+      't/002_aborted_tx_notifies.pl'
+    ],
+  },
+}
+
diff --git a/src/test/modules/test_listen_notify/t/001_xid_freeze.pl b/src/test/modules/test_listen_notify/t/001_xid_freeze.pl
new file mode 100644
index 00000000000..0a5130a042e
--- /dev/null
+++ b/src/test/modules/test_listen_notify/t/001_xid_freeze.pl
@@ -0,0 +1,66 @@
+# Copyright (c) 2024-2025, PostgreSQL Global Development Group
+
+use strict;
+use warnings FATAL => 'all';
+use File::Path qw(mkpath);
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+my $node = PostgreSQL::Test::Cluster->new('node');
+$node->init;
+$node->start;
+
+# Setup
+$node->safe_psql('postgres', 'CREATE EXTENSION xid_wraparound');
+$node->safe_psql('postgres',
+	'CREATE TABLE t AS SELECT g AS a, g+2 AS b from generate_series(1,100000) g;'
+);
+$node->safe_psql('postgres',
+	'ALTER DATABASE template0 WITH ALLOW_CONNECTIONS true');
+
+# --- Start Session 1 and leave it idle in transaction
+my $psql_session1 = $node->background_psql('postgres');
+$psql_session1->query_safe('listen s;', "Session 1 listens to 's'");
+$psql_session1->query_safe('begin;', "Session 1 starts a transaction");
+
+# --- Session 2, multiple notify's, and commit ---
+for my $i (1 .. 10)
+{
+	$node->safe_psql(
+		'postgres', "
+		BEGIN;
+		NOTIFY s, '$i';
+		COMMIT;");
+}
+
+# Consume enough XIDs to trigger truncation
+$node->safe_psql('postgres', 'select consume_xids(10000000);');
+
+# Execute update so the frozen xid of "t" table is updated to a xid greater
+# than consume_xids() result
+$node->safe_psql('postgres', 'UPDATE t SET a = a+b;');
+
+# Remember current datfrozenxid before vacuum freeze to ensure that it is advanced.
+my $datafronzenxid = $node->safe_psql('postgres', "select datfrozenxid from pg_database where datname = 'postgres'");
+
+# Execute vacuum freeze on all databases
+$node->command_ok([ 'vacuumdb', '--all', '--freeze', '--port', $node->port ],
+	"vacuumdb --all --freeze");
+
+# Get the new datfrozenxid after vacuum freeze to ensure that is advanced but
+# we can still get the notification status of the notification
+my $datafronzenxid_freeze = $node->safe_psql('postgres', "select datfrozenxid from pg_database where datname = 'postgres'");
+ok($datafronzenxid_freeze > $datafronzenxid, 'datfrozenxid is advanced');
+
+# On Session 1, commit and ensure that the all notifications is received
+my $res = $psql_session1->query_safe('commit;', "commit listen s;");
+my $notifications_count = 0;
+foreach my $i (split('\n', $res))
+{
+	$notifications_count++;
+	like($i, qr/Asynchronous notification "s" with payload "$notifications_count" received/);
+}
+is($notifications_count, 10, 'received all committed notifications');
+
+done_testing();
diff --git a/src/test/modules/test_listen_notify/t/002_aborted_tx_notifies.pl b/src/test/modules/test_listen_notify/t/002_aborted_tx_notifies.pl
new file mode 100644
index 00000000000..17fcb4b786e
--- /dev/null
+++ b/src/test/modules/test_listen_notify/t/002_aborted_tx_notifies.pl
@@ -0,0 +1,66 @@
+# Copyright (c) 2024-2025, PostgreSQL Global Development Group
+
+use strict;
+use warnings FATAL => 'all';
+use File::Path qw(mkpath);
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+my $node = PostgreSQL::Test::Cluster->new('node');
+$node->init;
+$node->start;
+
+# Test checks that listeners do not receive notifications from aborted
+# transaction even if notifications have been added to the listen/notify
+# queue. To reproduce it we use the fact that serializable conflicts
+# are checked after tx adds notifications to the queue.
+
+# Setup
+$node->safe_psql('postgres', 'CREATE TABLE t1 (a bigserial);');
+
+# Listener
+my $psql_listener = $node->background_psql('postgres');
+$psql_listener->query_safe('LISTEN ch;');
+
+# Session1. Start SERIALIZABLE tx and add a notification.
+my $psql_session1 = $node->background_psql('postgres');
+$psql_session1->query_safe("
+	BEGIN ISOLATION LEVEL SERIALIZABLE;
+	SELECT * FROM t1;
+	INSERT INTO t1 DEFAULT VALUES;
+	NOTIFY ch,'committed';
+");
+
+# Session2. Start SERIALIZABLE tx, add a notification and introduce a conflict
+# with session1.
+my $psql_session2 = $node->background_psql('postgres', on_error_stop => 0);
+$psql_session2->query_safe("
+	BEGIN ISOLATION LEVEL SERIALIZABLE;
+	SELECT * FROM t1;
+	INSERT INTO t1 DEFAULT VALUES;
+	NOTIFY ch,'aborted';
+");
+
+# Session1 should be committed successfully. Listeners must receive session1
+# notifications.
+$psql_session1->query_safe("COMMIT;");
+
+# Session2 should be aborted due to the conflict with session1. Transaction
+# is aborted after adding notifications to the listen/notify queue, but
+# listeners should not receive session2 notifications.
+$psql_session2->query("COMMIT;");
+
+# send another notification after aborted
+$node->safe_psql('postgres', "NOTIFY ch, 'next_committed';");
+
+# fetch notifications
+my $res = $psql_listener->query_safe('begin; commit;');
+
+# check received notifications
+my @lines = split('\n', $res);
+is(@lines, 2, 'received all committed notifications');
+like($lines[0], qr/Asynchronous notification "ch" with payload "committed" received/);
+like($lines[1], qr/Asynchronous notification "ch" with payload "next_committed" received/);
+
+done_testing();
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index e90af5b2ad3..e1c2384aa3d 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -159,6 +159,7 @@ ArrayType
 AsyncQueueControl
 AsyncQueueEntry
 AsyncRequest
+AtAbortNotifyInfo
 AttInMetadata
 AttStatsSlot
 AttoptCacheEntry
-- 
2.43.0

v5-0002-tap-tests.patchtext/x-patch; charset=US-ASCII; name=v5-0002-tap-tests.patchDownload

From 00b372089d9c01c66b5e919187238ffb6ce45f0d Mon Sep 17 00:00:00 2001
From: Arseniy Mukhin <arseniy.mukhin.dev@gmail.com>
Date: Fri, 26 Sep 2025 13:17:59 +0300
Subject: [PATCH v5 2/2] tap tests

---
 src/backend/commands/async.c                  |   5 +
 src/include/storage/lmgr.h                    |   2 +-
 .../modules/test_listen_notify/meson.build    |   6 +-
 .../t/002_aborted_tx_notifies.pl              |  32 +++--
 ...tpone_in_progress_aborted_notifications.pl | 114 ++++++++++++++++++
 5 files changed, 146 insertions(+), 13 deletions(-)
 create mode 100644 src/test/modules/test_listen_notify/t/003_postpone_in_progress_aborted_notifications.pl

diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index 4e7c0e16d57..f6f8fa6134c 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -164,6 +164,7 @@
 #include "utils/ps_status.h"
 #include "utils/snapmgr.h"
 #include "utils/timestamp.h"
+#include "utils/injection_point.h"
 
 
 /*
@@ -2138,6 +2139,8 @@ asyncQueueProcessPageEntries(volatile QueuePosition *current,
 				 * because our transaction cannot (yet) have queued any
 				 * messages.
 				 */
+
+				INJECTION_POINT("listen-notify-in-progress-notification", NULL);
 				*current = thisentry;
 				reachedStop = true;
 				break;
@@ -2497,6 +2500,8 @@ asyncQueueRollbackNotifications(void)
 	QueuePosition current = atAbortInfo->previousHead;
 	QueuePosition head = atAbortInfo->head;
 
+	INJECTION_POINT("listen-notify-notifications-rollback", NULL);
+
 	/*
 	 * Iterates from the position saved at the beginning of the transaction
 	 * (previousHead) to the current head of the queue. We do this to mark all
diff --git a/src/include/storage/lmgr.h b/src/include/storage/lmgr.h
index c119c8f4ded..3d34e61772d 100644
--- a/src/include/storage/lmgr.h
+++ b/src/include/storage/lmgr.h
@@ -51,7 +51,7 @@ extern bool CheckRelationLockedByMe(Relation relation, LOCKMODE lockmode,
 extern bool CheckRelationOidLockedByMe(Oid relid, LOCKMODE lockmode,
 									   bool orstronger);
 extern bool CheckSharedObjectLockedByMe(Oid classid, LOCKMODE lockmode,
-									   bool orstronger);
+										bool orstronger);
 
 extern bool LockHasWaitersRelation(Relation relation, LOCKMODE lockmode);
 
diff --git a/src/test/modules/test_listen_notify/meson.build b/src/test/modules/test_listen_notify/meson.build
index a68052cd353..565f4ec8ef0 100644
--- a/src/test/modules/test_listen_notify/meson.build
+++ b/src/test/modules/test_listen_notify/meson.build
@@ -5,9 +5,13 @@ tests += {
   'sd': meson.current_source_dir(),
   'bd': meson.current_build_dir(),
   'tap': {
+    'env': {
+      'enable_injection_points': get_option('injection_points') ? 'yes' : 'no',
+    },
     'tests': [
       't/001_xid_freeze.pl',
-      't/002_aborted_tx_notifies.pl'
+      't/002_aborted_tx_notifies.pl',
+      't/003_postpone_in_progress_aborted_notifications.pl'
     ],
   },
 }
diff --git a/src/test/modules/test_listen_notify/t/002_aborted_tx_notifies.pl b/src/test/modules/test_listen_notify/t/002_aborted_tx_notifies.pl
index 17fcb4b786e..74c4ae9fa9d 100644
--- a/src/test/modules/test_listen_notify/t/002_aborted_tx_notifies.pl
+++ b/src/test/modules/test_listen_notify/t/002_aborted_tx_notifies.pl
@@ -7,15 +7,15 @@ use PostgreSQL::Test::Cluster;
 use PostgreSQL::Test::Utils;
 use Test::More;
 
-my $node = PostgreSQL::Test::Cluster->new('node');
-$node->init;
-$node->start;
-
 # Test checks that listeners do not receive notifications from aborted
 # transaction even if notifications have been added to the listen/notify
 # queue. To reproduce it we use the fact that serializable conflicts
 # are checked after tx adds notifications to the queue.
 
+my $node = PostgreSQL::Test::Cluster->new('node');
+$node->init;
+$node->start;
+
 # Setup
 $node->safe_psql('postgres', 'CREATE TABLE t1 (a bigserial);');
 
@@ -29,7 +29,8 @@ $psql_session1->query_safe("
 	BEGIN ISOLATION LEVEL SERIALIZABLE;
 	SELECT * FROM t1;
 	INSERT INTO t1 DEFAULT VALUES;
-	NOTIFY ch,'committed';
+	NOTIFY ch,'committed_0';
+	NOTIFY ch,'committed_1';
 ");
 
 # Session2. Start SERIALIZABLE tx, add a notification and introduce a conflict
@@ -39,9 +40,15 @@ $psql_session2->query_safe("
 	BEGIN ISOLATION LEVEL SERIALIZABLE;
 	SELECT * FROM t1;
 	INSERT INTO t1 DEFAULT VALUES;
-	NOTIFY ch,'aborted';
 ");
 
+# Send notifications that should not be eventually delivered, as session2
+# transaction will be aborted.
+my $message = 'aborted_' . 'a' x 1000;
+for (my $i = 0; $i < 10; $i++) {
+    $psql_session2->query_safe("NOTIFY ch, '$i$message'");
+}
+
 # Session1 should be committed successfully. Listeners must receive session1
 # notifications.
 $psql_session1->query_safe("COMMIT;");
@@ -51,16 +58,19 @@ $psql_session1->query_safe("COMMIT;");
 # listeners should not receive session2 notifications.
 $psql_session2->query("COMMIT;");
 
-# send another notification after aborted
-$node->safe_psql('postgres', "NOTIFY ch, 'next_committed';");
+# send more notifications after aborted
+$node->safe_psql('postgres', "NOTIFY ch, 'committed_2';");
+$node->safe_psql('postgres', "NOTIFY ch, 'committed_3';");
 
 # fetch notifications
 my $res = $psql_listener->query_safe('begin; commit;');
 
 # check received notifications
 my @lines = split('\n', $res);
-is(@lines, 2, 'received all committed notifications');
-like($lines[0], qr/Asynchronous notification "ch" with payload "committed" received/);
-like($lines[1], qr/Asynchronous notification "ch" with payload "next_committed" received/);
+is(@lines, 4, 'received all committed notifications');
+for (my $i = 0; $i < 4; $i++) {
+    like($lines[$i], qr/Asynchronous notification "ch" with payload "committed_$i" received/);
+}
+
 
 done_testing();
diff --git a/src/test/modules/test_listen_notify/t/003_postpone_in_progress_aborted_notifications.pl b/src/test/modules/test_listen_notify/t/003_postpone_in_progress_aborted_notifications.pl
new file mode 100644
index 00000000000..b53e6a1cdaa
--- /dev/null
+++ b/src/test/modules/test_listen_notify/t/003_postpone_in_progress_aborted_notifications.pl
@@ -0,0 +1,114 @@
+# Copyright (c) 2024-2025, PostgreSQL Global Development Group
+
+use strict;
+use warnings FATAL => 'all';
+use File::Path qw(mkpath);
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Test checks that listeners do not receive notifications from aborted
+# even if the see such notifications in the queue right before the moment
+# aborted transaction marks its notifications as 'committed = false'. To
+# reproduce it we use the fact that serializable conflicts are checked after
+# tx adds notifications to the queue.
+
+if ($ENV{enable_injection_points} ne 'yes')
+{
+    plan skip_all => 'Injection points not supported by this build';
+}
+
+my $node = PostgreSQL::Test::Cluster->new('node');
+$node->init;
+$node->start;
+
+# Check if the extension injection_points is available, as it may be
+# possible that this script is run with installcheck, where the module
+# would not be installed by default.
+if (!$node->check_extension('injection_points'))
+{
+    plan skip_all => 'Extension injection_points not installed';
+}
+
+$node->safe_psql('postgres', 'CREATE EXTENSION injection_points');
+
+# Setup
+$node->safe_psql('postgres', 'CREATE TABLE t1 (a bigserial);');
+
+# Injection points setup
+$node->safe_psql('postgres',"SELECT injection_points_attach('listen-notify-notifications-rollback', 'wait')");
+$node->safe_psql('postgres',"SELECT injection_points_attach('listen-notify-in-progress-notification', 'notice')");
+
+# Session1. Start SERIALIZABLE tx and add a notification.
+my $psql_session1 = $node->background_psql('postgres');
+$psql_session1->query_safe("
+	BEGIN ISOLATION LEVEL SERIALIZABLE;
+	SELECT * FROM t1;
+	INSERT INTO t1 DEFAULT VALUES;
+	NOTIFY ch,'committed';
+");
+
+# Session2. Start SERIALIZABLE tx, add a notification and introduce a conflict
+# with session1.
+my $psql_session2 = $node->background_psql('postgres', on_error_stop => 0);
+$psql_session2->query_safe("
+	BEGIN ISOLATION LEVEL SERIALIZABLE;
+	SELECT * FROM t1;
+	INSERT INTO t1 DEFAULT VALUES;
+	NOTIFY ch,'aborted';
+");
+
+# Session1 should be committed successfully and publish notification.
+$psql_session1->query_safe("COMMIT;");
+
+# Session2 should be aborted due to the conflict with session1. Transaction
+# is aborted after adding notifications to the listen/notify queue, so we have notification of
+# aborted transaction in the queue. Session2 should start sleeping on injection point right before
+# marking its notifications as 'committed = false'
+$psql_session2->query_until(
+    qr/start/, q(
+    \echo start
+    COMMIT;
+));
+
+$node->wait_for_event('client backend', 'listen-notify-notifications-rollback');
+
+# Setup listener
+my $psql_listener = $node->background_psql('postgres');
+$psql_listener->query_safe("SET log_min_messages = 'NOTICE'");
+
+# At the moment listener should skip the first committed notification (as it was committed before we started listening)
+# and stop on the first aborted notification as notify transaction is still in progress and backend is sleeping.
+my $log_offset = -s $node->logfile;
+my $res = $psql_listener->query('LISTEN ch;');
+
+# Check that we touched session2's pending notification and triggered injection point
+$node->wait_for_log(qr/notice triggered for injection point listen-notify-in-progress-notification/, $log_offset);
+
+# Check listener has no notifications
+my @lines = split('\n', $res);
+is(@lines, 0,  'received no notifications');
+
+# Wakeup backend with aborted transaction
+$node->safe_psql('postgres',"SELECT injection_points_wakeup('listen-notify-notifications-rollback');");
+
+# Send another notification after aborted
+$node->safe_psql('postgres', "NOTIFY ch, 'next_committed'");
+
+# Clean stderr as workaround of the current bug with background psql hanging
+$psql_listener->{stderr} = "";
+
+# Fetch the rest notifications
+$res = $psql_listener->query_safe('begin; commit;');
+
+
+# Now aborted transaction is completed, so listener must skip aborted notifications and get next_committed
+@lines = split('\n', $res);
+is(@lines, 1,  'received all committed notifications');
+like($lines[0], qr/Asynchronous notification "ch" with payload "next_committed" received/);
+
+# Injection points cleanup
+$node->safe_psql('postgres',"SELECT injection_points_detach('listen-notify-notifications-rollback');");
+$node->safe_psql('postgres',"SELECT injection_points_detach('listen-notify-in-progress-notification');");
+
+done_testing();
-- 
2.43.0

#40

Matheus Alcantara

matheusssilv97@gmail.com

3 months ago

In reply to: Arseniy Mukhin (#39)

Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue

On Sun Sep 28, 2025 at 10:17 AM -03, Arseniy Mukhin wrote:

I agree it would be great to have more tests for such cases. As for
the 'committed field' patch, I think we can add a TAP test that shows
that listeners postpone processing of notifications until
notifications were marked as 'committed=false' in case of aborted
transactions. I tried to write one, but have not succeeded yet. Hope
to finish it soon.

I finally managed to write a TAP test for it, so there is a new
version with the tap test.

I also realized that we can increase test coverage in
002_aborted_tx_notifies.pl if notifications of the aborted transaction
span several pages. This way we can better test
asyncQueueRollbackNotifications(). So I changed
002_aborted_tx_notifies.pl TAP test a bit.

And there is a small indentation change in lmgr.h that should fix this
git am warning.

Thanks for the patches.

I've created a CF entry so we can get more reviews and comments:
https://commitfest.postgresql.org/patch/6095/

Let's see what other think about the approach being used to fix this
issue.

--
Matheus Alcantara

#41

Arseniy Mukhin

arseniy.mukhin.dev@gmail.com

3 months ago

In reply to: Matheus Alcantara (#40)

1 attachment(s)

Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue

Hi,

On Wed, Oct 1, 2025 at 2:57 PM Matheus Alcantara
<matheusssilv97@gmail.com> wrote:

On Sun Sep 28, 2025 at 10:17 AM -03, Arseniy Mukhin wrote:

I agree it would be great to have more tests for such cases. As for
the 'committed field' patch, I think we can add a TAP test that shows
that listeners postpone processing of notifications until
notifications were marked as 'committed=false' in case of aborted
transactions. I tried to write one, but have not succeeded yet. Hope
to finish it soon.

I finally managed to write a TAP test for it, so there is a new
version with the tap test.

I also realized that we can increase test coverage in
002_aborted_tx_notifies.pl if notifications of the aborted transaction
span several pages. This way we can better test
asyncQueueRollbackNotifications(). So I changed
002_aborted_tx_notifies.pl TAP test a bit.

And there is a small indentation change in lmgr.h that should fix this
git am warning.

Thanks for the patches.

I've created a CF entry so we can get more reviews and comments:
https://commitfest.postgresql.org/patch/6095/

Thank you.

There is a test failure on CI, so please find the new patch version
with the fix (Makefile was updated a little bit). And I merged 0002
file with tap tests into 0001.

Best regards,
Arseniy Mukhin

Attachments:

v6-0001-Make-AsyncQueueEntry-s-self-contained.patchtext/x-patch; charset=US-ASCII; name=v6-0001-Make-AsyncQueueEntry-s-self-contained.patchDownload

From 50eeae2591f0ce95b4d5ec41076d3906327da053 Mon Sep 17 00:00:00 2001
From: Matheus Alcantara <mths.dev@pm.me>
Date: Sat, 6 Sep 2025 11:29:02 -0300
Subject: [PATCH v6] Make AsyncQueueEntry's self contained

Previously the asyncQueueProcessPageEntries() use the
TransactionIdDidCommit() to check if the transaction that a notification
belongs is committed or not. Although this work for almost all scenarios
we may have some cases where if a notification is keep for to long on
the queue and the VACUUM FREEZE is executed during this time it may
remove clog files that is needed to check the transaction status of
these notifications which will cause errors to listener backends when
reading the async queue.

This commit fix this issue by making the AsyncQueueEntry self contained
by adding the "committed" boolean field so asyncQueueProcessPageEntries()
can use this to check if the transaction of the notification is
committed or not.

We set committed as true when adding the entry on the SLRU page buffer
cache when PreCommit_Notify() is called and if an error occur before
AtCommit_Notify() the AtAbort_Notify() will be called which will mark
the committed field as false. We do this by remembering the QUEUE_HEAD
position before the PreCommit_Notify() start adding entries on the
shared queue, and if the transaction crash we iterate from this saved
position until the new QUEUE_HEAD position marking the entries as not
committed.

Also this commit include TAP tests to exercise the VACUUM FREEZE issue
and also the scenario of an error being occur between the
PreCommit_Notify() and AtCommit_Notify() calls.

Co-authored-by: Arseniy Mukhin <arseniy.mukhin.dev@gmail.com>
---
 src/backend/commands/async.c                  | 190 +++++++++++++++++-
 src/backend/storage/lmgr/lmgr.c               |  19 ++
 src/include/storage/lmgr.h                    |   3 +
 src/test/modules/Makefile                     |   1 +
 src/test/modules/meson.build                  |   1 +
 src/test/modules/test_listen_notify/Makefile  |  22 ++
 .../modules/test_listen_notify/meson.build    |  18 ++
 .../test_listen_notify/t/001_xid_freeze.pl    |  74 +++++++
 .../t/002_aborted_tx_notifies.pl              |  79 ++++++++
 ...tpone_in_progress_aborted_notifications.pl | 118 +++++++++++
 src/tools/pgindent/typedefs.list              |   1 +
 11 files changed, 525 insertions(+), 1 deletion(-)
 create mode 100644 src/test/modules/test_listen_notify/Makefile
 create mode 100644 src/test/modules/test_listen_notify/meson.build
 create mode 100644 src/test/modules/test_listen_notify/t/001_xid_freeze.pl
 create mode 100644 src/test/modules/test_listen_notify/t/002_aborted_tx_notifies.pl
 create mode 100644 src/test/modules/test_listen_notify/t/003_postpone_in_progress_aborted_notifications.pl

diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index 4bd37d5beb5..f6f8fa6134c 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -79,6 +79,19 @@
  *	  are way behind and should be kicked to make them advance their
  *	  pointers).
  *
+ *    The notification entries added to the queue are self-contained and
+ *    include a "committed" field to inform listener backends if the associated
+ *    transaction has committed. We could use the TransactionIdDidCommit() but
+ *    if a notification remain in the queue long enough for VACUUM FREEZE to
+ *    remove the necessary pg_xact/ file, the listener backend will face errors
+ *    to get the transaction status. To prevent this, the "committed" field is
+ *    set to true during PreCommit_Notify() and if the transaction aborts between
+ *    the PreCommit_Notify() and AtCommit_Notify(), the AtAbort_Notify() is
+ *    called to mark these entries as uncommitted. To enable this, we save the
+ *    queue's head position before adding new entries from the in-progress to
+ *    commit transaction. If an abort occurs, AtAbort_Notify() uses this saved
+ *    position to find and mark the entries as uncommitted.
+ *
  *	  Finally, after we are out of the transaction altogether and about to go
  *	  idle, we scan the queue for messages that need to be sent to our
  *	  frontend (which might be notifies from other backends, or self-notifies
@@ -142,6 +155,7 @@
 #include "miscadmin.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
+#include "storage/procarray.h"
 #include "storage/procsignal.h"
 #include "tcop/tcopprot.h"
 #include "utils/builtins.h"
@@ -150,6 +164,7 @@
 #include "utils/ps_status.h"
 #include "utils/snapmgr.h"
 #include "utils/timestamp.h"
+#include "utils/injection_point.h"
 
 
 /*
@@ -180,6 +195,8 @@ typedef struct AsyncQueueEntry
 	Oid			dboid;			/* sender's database OID */
 	TransactionId xid;			/* sender's XID */
 	int32		srcPid;			/* sender's PID */
+	bool		committed;		/* Is transaction that the entry belongs
+								 * committed? */
 	char		data[NAMEDATALEN + NOTIFY_PAYLOAD_MAX_LENGTH];
 } AsyncQueueEntry;
 
@@ -401,8 +418,27 @@ struct NotificationHash
 	Notification *event;		/* => the actual Notification struct */
 };
 
+/*  Information needed by At_AbortNotify() to remove entries from the queue for aborted transactions. */
+typedef struct AtAbortNotifyInfo
+{
+	/*
+	 * head position before the transaction start adding entries on the shared
+	 * queue
+	 */
+	QueuePosition previousHead;
+
+	/*
+	 * head position after the entries from the in-progress to commit
+	 * transaction were added.
+	 */
+	QueuePosition head;
+
+} AtAbortNotifyInfo;
+
 static NotificationList *pendingNotifies = NULL;
 
+static AtAbortNotifyInfo *atAbortInfo = NULL;
+
 /*
  * Inbound notifications are initially processed by HandleNotifyInterrupt(),
  * called from inside a signal handler. That just sets the
@@ -457,6 +493,7 @@ static void AddEventToPendingNotifies(Notification *n);
 static uint32 notification_hash(const void *key, Size keysize);
 static int	notification_match(const void *key1, const void *key2, Size keysize);
 static void ClearPendingActionsAndNotifies(void);
+static void asyncQueueRollbackNotifications(void);
 
 /*
  * Compute the difference between two queue page numbers.
@@ -922,6 +959,18 @@ PreCommit_Notify(void)
 		LockSharedObject(DatabaseRelationId, InvalidOid, 0,
 						 AccessExclusiveLock);
 
+		/*
+		 * Before start adding entries on the shared queue, save the current
+		 * QUEUE_HEAD so if the current in-progress to commit transaction
+		 * abort we can mark the notifications added by this aborted
+		 * transaction as not committed. See AtAbortt_Notify() for more info.
+		 */
+		Assert(atAbortInfo == NULL);
+		atAbortInfo = palloc(sizeof(AtAbortNotifyInfo));
+		LWLockAcquire(NotifyQueueLock, LW_SHARED);
+		atAbortInfo->previousHead = QUEUE_HEAD;
+		LWLockRelease(NotifyQueueLock);
+
 		/* Now push the notifications into the queue */
 		nextNotify = list_head(pendingNotifies->events);
 		while (nextNotify != NULL)
@@ -948,6 +997,17 @@ PreCommit_Notify(void)
 			LWLockRelease(NotifyQueueLock);
 		}
 
+		/*
+		 * Save the new QUEUE_HEAD position so if another publisher add
+		 * entries on the shared queue and successfully commit the transaction
+		 * we don't change the committed status of these notifications while
+		 * marking the notification from a aborted transaction as not
+		 * committed.
+		 */
+		LWLockAcquire(NotifyQueueLock, LW_SHARED);
+		atAbortInfo->head = QUEUE_HEAD;
+		LWLockRelease(NotifyQueueLock);
+
 		/* Note that we don't clear pendingNotifies; AtCommit_Notify will. */
 	}
 }
@@ -1402,6 +1462,13 @@ asyncQueueAddEntries(ListCell *nextNotify)
 		/* Construct a valid queue entry in local variable qe */
 		asyncQueueNotificationToEntry(n, &qe);
 
+		/*
+		 * Mark the entry as committed. If the transaction that this
+		 * notification belongs fails to commit the AtAbort_Notify() will mark
+		 * this entry as not committed.
+		 */
+		qe.committed = true;
+
 		offset = QUEUE_POS_OFFSET(queue_head);
 
 		/* Check whether the entry really fits on the current page */
@@ -1678,6 +1745,16 @@ AtAbort_Notify(void)
 	if (amRegisteredListener && listenChannels == NIL)
 		asyncQueueUnregister();
 
+	/*
+	 * AtAbort_Notify information is set when we are adding entries on the
+	 * global shared queue at PreCommit_Notify(), so in case of an abort on
+	 * the transaction between the PreCommit_Notify() and AtCommit_Notify() we
+	 * use this information to mark the entries from the aborted transaction
+	 * as not committed.
+	 */
+	if (atAbortInfo != NULL)
+		asyncQueueRollbackNotifications();
+
 	/* And clean up */
 	ClearPendingActionsAndNotifies();
 }
@@ -2062,11 +2139,13 @@ asyncQueueProcessPageEntries(volatile QueuePosition *current,
 				 * because our transaction cannot (yet) have queued any
 				 * messages.
 				 */
+
+				INJECTION_POINT("listen-notify-in-progress-notification", NULL);
 				*current = thisentry;
 				reachedStop = true;
 				break;
 			}
-			else if (TransactionIdDidCommit(qe->xid))
+			else if (qe->committed)
 			{
 				/* qe->data is the null-terminated channel name */
 				char	   *channel = qe->data;
@@ -2385,6 +2464,7 @@ ClearPendingActionsAndNotifies(void)
 	 */
 	pendingActions = NULL;
 	pendingNotifies = NULL;
+	atAbortInfo = NULL;
 }
 
 /*
@@ -2395,3 +2475,111 @@ check_notify_buffers(int *newval, void **extra, GucSource source)
 {
 	return check_slru_buffers("notify_buffers", newval);
 }
+
+
+/*
+ *  Mark notifications added on an in-progress to commit transaction as not committed.
+ *
+ * Notifications added on the shared global queue are added with committed =
+ * true during PreCommit_Notify() call. If an error occur between the
+ * PreCommit_Notify() and AtCommit_Notify() the AtAbort_Notify() will be called
+ * and we need to mark these notifications added on the shared queue by the
+ * aborted transaction as not committed so that listener backends can skip
+ * these notifications when reading the queue.
+ *
+ * We previously rely on TransactionDidCommit() to check this but if a
+ * notification is keep for too long on the queue and the VACUUM FREEZE is
+ * executed during this period it can remove clog files that is needed to check
+ * the transaction status of this notification, so we make the notification
+ * entries self contained to skip this problem.
+ *
+ */
+static void
+asyncQueueRollbackNotifications(void)
+{
+	QueuePosition current = atAbortInfo->previousHead;
+	QueuePosition head = atAbortInfo->head;
+
+	INJECTION_POINT("listen-notify-notifications-rollback", NULL);
+
+	/*
+	 * Iterates from the position saved at the beginning of the transaction
+	 * (previousHead) to the current head of the queue. We do this to mark all
+	 * entries within this range as uncommitted in case of a transaction
+	 * crash.
+	 */
+	for (;;)
+	{
+		int64		curpage = QUEUE_POS_PAGE(current);
+		int			curoffset = QUEUE_POS_OFFSET(current);
+		LWLock	   *lock = SimpleLruGetBankLock(NotifyCtl, curpage);
+		int			slotno;
+
+		/*
+		 * If we have reached the head, all entries from this transaction have
+		 * been marked as not committed so break the loop.
+		 */
+		if (QUEUE_POS_EQUAL(current, head))
+			break;
+
+		/*
+		 * Acquire an exclusive lock on the current SLRU page to ensure no
+		 * other process can read or write to it while we are marking the
+		 * entries.
+		 */
+		LWLockAcquire(lock, LW_EXCLUSIVE);
+
+		/* Fetch the page from SLRU to mark entries as not committed. */
+		slotno = SimpleLruReadPage(NotifyCtl, curpage, true, InvalidTransactionId);
+
+		/*
+		 * Loop through all entries on the current page. The loop will
+		 * continue until we reach the end of the page or the current head.
+		 */
+		for (;;)
+		{
+			AsyncQueueEntry *qe;
+			bool		reachedEndOfPage;
+
+			/*
+			 * Check again to stop processing the entries on the current page.
+			 */
+			if (QUEUE_POS_EQUAL(current, head))
+				break;
+
+			/*
+			 * Get a pointer to the current entry within the shared page
+			 * buffer.
+			 */
+			qe = (AsyncQueueEntry *) (NotifyCtl->shared->page_buffer[slotno] + curoffset);
+
+			/*
+			 * Just for sanity, all entries on the shared queue should be
+			 * marked as not committed.
+			 */
+			Assert(qe->committed);
+
+			/* Ensure that listener backends can not see these entries */
+			Assert(TransactionIdIsInProgress(qe->xid));
+
+			/*
+			 * Mark the entry as uncommitted so listener backends can skip
+			 * this notification.
+			 */
+			qe->committed = false;
+
+			/* Advance our position. */
+			reachedEndOfPage = asyncQueueAdvance(&current, qe->length);
+			if (reachedEndOfPage)
+				break;
+
+			/*
+			 * Update the offset for the next iteration within the same page.
+			 */
+			curoffset = QUEUE_POS_OFFSET(current);
+		}
+
+		/* Release the exclusive lock on the page. */
+		LWLockRelease(lock);
+	}
+}
diff --git a/src/backend/storage/lmgr/lmgr.c b/src/backend/storage/lmgr/lmgr.c
index 4798eb79003..12a21c51452 100644
--- a/src/backend/storage/lmgr/lmgr.c
+++ b/src/backend/storage/lmgr/lmgr.c
@@ -357,6 +357,25 @@ CheckRelationOidLockedByMe(Oid relid, LOCKMODE lockmode, bool orstronger)
 	return LockHeldByMe(&tag, lockmode, orstronger);
 }
 
+/*
+ *  CheckSharedObjectLockedByMe
+ *
+ *  Like CheckRelationLockedByMe, but it checks for shared objects.
+ */
+bool
+CheckSharedObjectLockedByMe(Oid classid, LOCKMODE lockmode, bool orstronger)
+{
+	LOCKTAG		tag;
+
+	SET_LOCKTAG_OBJECT(tag,
+					   InvalidOid,
+					   classid,
+					   InvalidOid,
+					   0);
+
+	return LockHeldByMe(&tag, lockmode, orstronger);
+}
+
 /*
  *		LockHasWaitersRelation
  *
diff --git a/src/include/storage/lmgr.h b/src/include/storage/lmgr.h
index b7abd18397d..3d34e61772d 100644
--- a/src/include/storage/lmgr.h
+++ b/src/include/storage/lmgr.h
@@ -50,6 +50,9 @@ extern bool CheckRelationLockedByMe(Relation relation, LOCKMODE lockmode,
 									bool orstronger);
 extern bool CheckRelationOidLockedByMe(Oid relid, LOCKMODE lockmode,
 									   bool orstronger);
+extern bool CheckSharedObjectLockedByMe(Oid classid, LOCKMODE lockmode,
+										bool orstronger);
+
 extern bool LockHasWaitersRelation(Relation relation, LOCKMODE lockmode);
 
 extern void LockRelationIdForSession(LockRelId *relid, LOCKMODE lockmode);
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index 902a7954101..a015c961d35 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -29,6 +29,7 @@ SUBDIRS = \
 		  test_int128 \
 		  test_integerset \
 		  test_json_parser \
+		  test_listen_notify \
 		  test_lfind \
 		  test_lwlock_tranches \
 		  test_misc \
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index 14fc761c4cf..6af33448d7b 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -28,6 +28,7 @@ subdir('test_ginpostinglist')
 subdir('test_int128')
 subdir('test_integerset')
 subdir('test_json_parser')
+subdir('test_listen_notify')
 subdir('test_lfind')
 subdir('test_lwlock_tranches')
 subdir('test_misc')
diff --git a/src/test/modules/test_listen_notify/Makefile b/src/test/modules/test_listen_notify/Makefile
new file mode 100644
index 00000000000..01f396005f9
--- /dev/null
+++ b/src/test/modules/test_listen_notify/Makefile
@@ -0,0 +1,22 @@
+# src/test/modules/test_listen_notify/Makefile
+
+MODULE = test_listen_notify
+PGFILEDESC = "test_listen_notify - regression testing for LISTEN/NOTIFY support"
+
+TAP_TESTS = 1
+
+EXTRA_INSTALL=src/test/modules/injection_points \
+	src/test/modules/xid_wraparound
+
+export enable_injection_points
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_listen_notify
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_listen_notify/meson.build b/src/test/modules/test_listen_notify/meson.build
new file mode 100644
index 00000000000..565f4ec8ef0
--- /dev/null
+++ b/src/test/modules/test_listen_notify/meson.build
@@ -0,0 +1,18 @@
+# Copyright (c) 2022-2025, PostgreSQL Global Development Group
+
+tests += {
+  'name': 'test_listen_notify',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'tap': {
+    'env': {
+      'enable_injection_points': get_option('injection_points') ? 'yes' : 'no',
+    },
+    'tests': [
+      't/001_xid_freeze.pl',
+      't/002_aborted_tx_notifies.pl',
+      't/003_postpone_in_progress_aborted_notifications.pl'
+    ],
+  },
+}
+
diff --git a/src/test/modules/test_listen_notify/t/001_xid_freeze.pl b/src/test/modules/test_listen_notify/t/001_xid_freeze.pl
new file mode 100644
index 00000000000..a8bbd268c0f
--- /dev/null
+++ b/src/test/modules/test_listen_notify/t/001_xid_freeze.pl
@@ -0,0 +1,74 @@
+# Copyright (c) 2024-2025, PostgreSQL Global Development Group
+
+use strict;
+use warnings FATAL => 'all';
+use File::Path qw(mkpath);
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+my $node = PostgreSQL::Test::Cluster->new('node');
+$node->init;
+$node->start;
+
+# Check if the extension xid_wraparound is available, as it may be
+# possible that this script is run with installcheck, where the module
+# would not be installed by default.
+if (!$node->check_extension('xid_wraparound'))
+{
+	plan skip_all => 'Extension xid_wraparound not installed';
+}
+
+# Setup
+$node->safe_psql('postgres', 'CREATE EXTENSION xid_wraparound');
+$node->safe_psql('postgres',
+	'CREATE TABLE t AS SELECT g AS a, g+2 AS b from generate_series(1,100000) g;'
+);
+$node->safe_psql('postgres',
+	'ALTER DATABASE template0 WITH ALLOW_CONNECTIONS true');
+
+# --- Start Session 1 and leave it idle in transaction
+my $psql_session1 = $node->background_psql('postgres');
+$psql_session1->query_safe('listen s;', "Session 1 listens to 's'");
+$psql_session1->query_safe('begin;', "Session 1 starts a transaction");
+
+# --- Session 2, multiple notify's, and commit ---
+for my $i (1 .. 10)
+{
+	$node->safe_psql(
+		'postgres', "
+		BEGIN;
+		NOTIFY s, '$i';
+		COMMIT;");
+}
+
+# Consume enough XIDs to trigger truncation
+$node->safe_psql('postgres', 'select consume_xids(10000000);');
+
+# Execute update so the frozen xid of "t" table is updated to a xid greater
+# than consume_xids() result
+$node->safe_psql('postgres', 'UPDATE t SET a = a+b;');
+
+# Remember current datfrozenxid before vacuum freeze to ensure that it is advanced.
+my $datafronzenxid = $node->safe_psql('postgres', "select datfrozenxid from pg_database where datname = 'postgres'");
+
+# Execute vacuum freeze on all databases
+$node->command_ok([ 'vacuumdb', '--all', '--freeze', '--port', $node->port ],
+	"vacuumdb --all --freeze");
+
+# Get the new datfrozenxid after vacuum freeze to ensure that is advanced but
+# we can still get the notification status of the notification
+my $datafronzenxid_freeze = $node->safe_psql('postgres', "select datfrozenxid from pg_database where datname = 'postgres'");
+ok($datafronzenxid_freeze > $datafronzenxid, 'datfrozenxid is advanced');
+
+# On Session 1, commit and ensure that the all notifications is received
+my $res = $psql_session1->query_safe('commit;', "commit listen s;");
+my $notifications_count = 0;
+foreach my $i (split('\n', $res))
+{
+	$notifications_count++;
+	like($i, qr/Asynchronous notification "s" with payload "$notifications_count" received/);
+}
+is($notifications_count, 10, 'received all committed notifications');
+
+done_testing();
diff --git a/src/test/modules/test_listen_notify/t/002_aborted_tx_notifies.pl b/src/test/modules/test_listen_notify/t/002_aborted_tx_notifies.pl
new file mode 100644
index 00000000000..dae7a24f5b2
--- /dev/null
+++ b/src/test/modules/test_listen_notify/t/002_aborted_tx_notifies.pl
@@ -0,0 +1,79 @@
+# Copyright (c) 2024-2025, PostgreSQL Global Development Group
+
+use strict;
+use warnings FATAL => 'all';
+use File::Path qw(mkpath);
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Test checks that listeners do not receive notifications from aborted
+# transaction even if notifications have been added to the listen/notify
+# queue. To reproduce it we use the fact that serializable conflicts
+# are checked after tx adds notifications to the queue.
+
+my $node = PostgreSQL::Test::Cluster->new('node');
+$node->init;
+$node->start;
+
+# Setup
+$node->safe_psql('postgres', 'CREATE TABLE t1 (a bigserial);');
+
+# Listener
+my $psql_listener = $node->background_psql('postgres');
+$psql_listener->query_safe('LISTEN ch;');
+
+# Session1. Start SERIALIZABLE tx and add a notification.
+my $psql_session1 = $node->background_psql('postgres');
+$psql_session1->query_safe("
+	BEGIN ISOLATION LEVEL SERIALIZABLE;
+	SELECT * FROM t1;
+	INSERT INTO t1 DEFAULT VALUES;
+	NOTIFY ch,'committed_0';
+	NOTIFY ch,'committed_1';
+");
+
+# Session2. Start SERIALIZABLE tx, add a notification and introduce a conflict
+# with session1.
+my $psql_session2 = $node->background_psql('postgres', on_error_stop => 0);
+$psql_session2->query_safe("
+	BEGIN ISOLATION LEVEL SERIALIZABLE;
+	SELECT * FROM t1;
+	INSERT INTO t1 DEFAULT VALUES;
+");
+
+# Send notifications that should not be eventually delivered, as session2
+# transaction will be aborted.
+my $message = 'aborted_' . 'a' x 1000;
+for (my $i = 0; $i < 10; $i++) {
+    $psql_session2->query_safe("NOTIFY ch, '$i$message'");
+}
+
+# Session1 should be committed successfully. Listeners must receive session1
+# notifications.
+$psql_session1->query_safe("COMMIT;");
+
+# Session2 should be aborted due to the conflict with session1. Transaction
+# is aborted after adding notifications to the listen/notify queue, but
+# listeners should not receive session2 notifications.
+$psql_session2->query("COMMIT;");
+
+# send more notifications after aborted
+$node->safe_psql('postgres', "NOTIFY ch, 'committed_2';");
+$node->safe_psql('postgres', "NOTIFY ch, 'committed_3';");
+
+# fetch notifications
+my $res = $psql_listener->query_safe('begin; commit;');
+
+# check received notifications
+my @lines = split('\n', $res);
+is(@lines, 4, 'received all committed notifications');
+for (my $i = 0; $i < 4; $i++) {
+    like($lines[$i], qr/Asynchronous notification "ch" with payload "committed_$i" received/);
+}
+
+ok($psql_listener->quit);
+ok($psql_session1->quit);
+ok($psql_session2->quit);
+
+done_testing();
diff --git a/src/test/modules/test_listen_notify/t/003_postpone_in_progress_aborted_notifications.pl b/src/test/modules/test_listen_notify/t/003_postpone_in_progress_aborted_notifications.pl
new file mode 100644
index 00000000000..8bfcae9cbf3
--- /dev/null
+++ b/src/test/modules/test_listen_notify/t/003_postpone_in_progress_aborted_notifications.pl
@@ -0,0 +1,118 @@
+# Copyright (c) 2024-2025, PostgreSQL Global Development Group
+
+use strict;
+use warnings FATAL => 'all';
+use File::Path qw(mkpath);
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Test checks that listeners do not receive notifications from aborted transactions
+# even if they see such notifications in the queue right before the moment
+# aborted transaction marks its notifications as 'committed = false'. To
+# reproduce it we use the fact that serializable conflicts are checked after
+# tx adds notifications to the queue.
+
+if ($ENV{enable_injection_points} ne 'yes')
+{
+    plan skip_all => 'Injection points not supported by this build';
+}
+
+my $node = PostgreSQL::Test::Cluster->new('node');
+$node->init;
+$node->start;
+
+# Check if the extension injection_points is available, as it may be
+# possible that this script is run with installcheck, where the module
+# would not be installed by default.
+if (!$node->check_extension('injection_points'))
+{
+    plan skip_all => 'Extension injection_points not installed';
+}
+
+$node->safe_psql('postgres', 'CREATE EXTENSION injection_points');
+
+# Setup
+$node->safe_psql('postgres', 'CREATE TABLE t1 (a bigserial);');
+
+# Injection points setup
+$node->safe_psql('postgres',"SELECT injection_points_attach('listen-notify-notifications-rollback', 'wait')");
+$node->safe_psql('postgres',"SELECT injection_points_attach('listen-notify-in-progress-notification', 'notice')");
+
+# Session1. Start SERIALIZABLE tx and add a notification.
+my $psql_session1 = $node->background_psql('postgres');
+$psql_session1->query_safe("
+	BEGIN ISOLATION LEVEL SERIALIZABLE;
+	SELECT * FROM t1;
+	INSERT INTO t1 DEFAULT VALUES;
+	NOTIFY ch,'committed';
+");
+
+# Session2. Start SERIALIZABLE tx, add a notification and introduce a conflict
+# with session1.
+my $psql_session2 = $node->background_psql('postgres', on_error_stop => 0);
+$psql_session2->query_safe("
+	BEGIN ISOLATION LEVEL SERIALIZABLE;
+	SELECT * FROM t1;
+	INSERT INTO t1 DEFAULT VALUES;
+	NOTIFY ch,'aborted';
+");
+
+# Session1 should be committed successfully and publish notification.
+$psql_session1->query_safe("COMMIT;");
+
+# Session2 should be aborted due to the conflict with session1. Transaction
+# is aborted after adding notifications to the listen/notify queue, so we have notification of
+# aborted transaction in the queue. Session2 should start sleeping on injection point right before
+# marking its notifications as 'committed = false'
+$psql_session2->query_until(
+    qr/start/, q(
+    \echo start
+    COMMIT;
+));
+
+$node->wait_for_event('client backend', 'listen-notify-notifications-rollback');
+
+# Setup listener
+my $psql_listener = $node->background_psql('postgres');
+$psql_listener->query_safe("SET log_min_messages = 'NOTICE'");
+
+# At the moment listener should skip the first committed notification (as it was committed before we started listening)
+# and stop on the first aborted notification as notify transaction is still in progress and backend is sleeping.
+my $log_offset = -s $node->logfile;
+my $res = $psql_listener->query('LISTEN ch;');
+
+# Check that we touched session2's pending notification and triggered injection point
+$node->wait_for_log(qr/notice triggered for injection point listen-notify-in-progress-notification/, $log_offset);
+
+# Check listener has no notifications
+my @lines = split('\n', $res);
+is(@lines, 0,  'received no notifications');
+
+# Wakeup backend with aborted transaction
+$node->safe_psql('postgres',"SELECT injection_points_wakeup('listen-notify-notifications-rollback');");
+
+# Send another notification after aborted
+$node->safe_psql('postgres', "NOTIFY ch, 'next_committed'");
+
+# Clean stderr as workaround of the current bug with background psql hanging
+$psql_listener->{stderr} = "";
+
+# Fetch the rest notifications
+$res = $psql_listener->query_safe('begin; commit;');
+
+
+# Now aborted transaction is completed, so listener must skip aborted notifications and get next_committed
+@lines = split('\n', $res);
+is(@lines, 1,  'received all committed notifications');
+like($lines[0], qr/Asynchronous notification "ch" with payload "next_committed" received/);
+
+# Injection points cleanup
+$node->safe_psql('postgres',"SELECT injection_points_detach('listen-notify-notifications-rollback');");
+$node->safe_psql('postgres',"SELECT injection_points_detach('listen-notify-in-progress-notification');");
+
+ok($psql_listener->quit);
+ok($psql_session1->quit);
+ok($psql_session2->quit);
+
+done_testing();
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 37f26f6c6b7..3490195f45d 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -159,6 +159,7 @@ ArrayType
 AsyncQueueControl
 AsyncQueueEntry
 AsyncRequest
+AtAbortNotifyInfo
 AttInMetadata
 AttStatsSlot
 AttoptCacheEntry
-- 
2.43.0

#42

Álvaro Herrera

alvherre@kurilemu.de

3 months ago

In reply to: Arseniy Mukhin (#41)

Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue

I have the impression that this thread has lost focus on the idea of
producing a backpatchable bugfix. The last proposed patch has a lot of
new mechanism that doesn't seem suitable for backpatch. I could be
wrong of course.

--
Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
"[PostgreSQL] is a great group; in my opinion it is THE best open source
development communities in existence anywhere." (Lamar Owen)

#43

Arseniy Mukhin

arseniy.mukhin.dev@gmail.com

3 months ago

In reply to: Álvaro Herrera (#42)

1 attachment(s)

Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue

On Fri, Oct 17, 2025 at 7:31 PM Álvaro Herrera <alvherre@kurilemu.de> wrote:

I have the impression that this thread has lost focus on the idea of
producing a backpatchable bugfix. The last proposed patch has a lot of
new mechanism that doesn't seem suitable for backpatch. I could be
wrong of course.

Oops, I guess the TAP test that I added in the last version and that
uses injection points is one of those things. PFA the new version
without it. I also noticed that CheckSharedObjectLockedByMe() which
was introduced in one of previous versions is unused, so I removed it
too.

Best regards,
Arseniy Mukhin

Attachments:

v7-0001-Make-AsyncQueueEntry-s-self-contained.patchtext/x-patch; charset=US-ASCII; name=v7-0001-Make-AsyncQueueEntry-s-self-contained.patchDownload

From c3c50d4e5a8ce80cabf43f3e3bd603c0d26010ac Mon Sep 17 00:00:00 2001
From: Matheus Alcantara <mths.dev@pm.me>
Date: Sat, 6 Sep 2025 11:29:02 -0300
Subject: [PATCH v7] Make AsyncQueueEntry's self contained

Previously the asyncQueueProcessPageEntries() use the
TransactionIdDidCommit() to check if the transaction that a notification
belongs is committed or not. Although this work for almost all scenarios
we may have some cases where if a notification is keep for to long on
the queue and the VACUUM FREEZE is executed during this time it may
remove clog files that is needed to check the transaction status of
these notifications which will cause errors to listener backends when
reading the async queue.

This commit fix this issue by making the AsyncQueueEntry self contained
by adding the "committed" boolean field so asyncQueueProcessPageEntries()
can use this to check if the transaction of the notification is
committed or not.

We set committed as true when adding the entry on the SLRU page buffer
cache when PreCommit_Notify() is called and if an error occur before
AtCommit_Notify() the AtAbort_Notify() will be called which will mark
the committed field as false. We do this by remembering the QUEUE_HEAD
position before the PreCommit_Notify() start adding entries on the
shared queue, and if the transaction crash we iterate from this saved
position until the new QUEUE_HEAD position marking the entries as not
committed.

Also this commit include TAP tests to exercise the VACUUM FREEZE issue
and also the scenario of an error being occur between the
PreCommit_Notify() and AtCommit_Notify() calls.

Co-authored-by: Arseniy Mukhin <arseniy.mukhin.dev@gmail.com>
---
 src/backend/commands/async.c                  | 186 +++++++++++++++++-
 src/test/modules/Makefile                     |   1 +
 src/test/modules/meson.build                  |   1 +
 src/test/modules/test_listen_notify/Makefile  |  19 ++
 .../modules/test_listen_notify/meson.build    |  14 ++
 .../test_listen_notify/t/001_xid_freeze.pl    |  74 +++++++
 .../t/002_aborted_tx_notifies.pl              |  79 ++++++++
 src/tools/pgindent/typedefs.list              |   1 +
 8 files changed, 374 insertions(+), 1 deletion(-)
 create mode 100644 src/test/modules/test_listen_notify/Makefile
 create mode 100644 src/test/modules/test_listen_notify/meson.build
 create mode 100644 src/test/modules/test_listen_notify/t/001_xid_freeze.pl
 create mode 100644 src/test/modules/test_listen_notify/t/002_aborted_tx_notifies.pl

diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index 4bd37d5beb5..c09fb6e5cd9 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -79,6 +79,19 @@
  *	  are way behind and should be kicked to make them advance their
  *	  pointers).
  *
+ *    The notification entries added to the queue are self-contained and
+ *    include a "committed" field to inform listener backends if the associated
+ *    transaction has committed. We could use the TransactionIdDidCommit() but
+ *    if a notification remain in the queue long enough for VACUUM FREEZE to
+ *    remove the necessary pg_xact/ file, the listener backend will face errors
+ *    to get the transaction status. To prevent this, the "committed" field is
+ *    set to true during PreCommit_Notify() and if the transaction aborts between
+ *    the PreCommit_Notify() and AtCommit_Notify(), the AtAbort_Notify() is
+ *    called to mark these entries as uncommitted. To enable this, we save the
+ *    queue's head position before adding new entries from the in-progress to
+ *    commit transaction. If an abort occurs, AtAbort_Notify() uses this saved
+ *    position to find and mark the entries as uncommitted.
+ *
  *	  Finally, after we are out of the transaction altogether and about to go
  *	  idle, we scan the queue for messages that need to be sent to our
  *	  frontend (which might be notifies from other backends, or self-notifies
@@ -142,6 +155,7 @@
 #include "miscadmin.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
+#include "storage/procarray.h"
 #include "storage/procsignal.h"
 #include "tcop/tcopprot.h"
 #include "utils/builtins.h"
@@ -180,6 +194,8 @@ typedef struct AsyncQueueEntry
 	Oid			dboid;			/* sender's database OID */
 	TransactionId xid;			/* sender's XID */
 	int32		srcPid;			/* sender's PID */
+	bool		committed;		/* Is transaction that the entry belongs
+								 * committed? */
 	char		data[NAMEDATALEN + NOTIFY_PAYLOAD_MAX_LENGTH];
 } AsyncQueueEntry;
 
@@ -401,8 +417,27 @@ struct NotificationHash
 	Notification *event;		/* => the actual Notification struct */
 };
 
+/*  Information needed by At_AbortNotify() to remove entries from the queue for aborted transactions. */
+typedef struct AtAbortNotifyInfo
+{
+	/*
+	 * head position before the transaction start adding entries on the shared
+	 * queue
+	 */
+	QueuePosition previousHead;
+
+	/*
+	 * head position after the entries from the in-progress to commit
+	 * transaction were added.
+	 */
+	QueuePosition head;
+
+} AtAbortNotifyInfo;
+
 static NotificationList *pendingNotifies = NULL;
 
+static AtAbortNotifyInfo *atAbortInfo = NULL;
+
 /*
  * Inbound notifications are initially processed by HandleNotifyInterrupt(),
  * called from inside a signal handler. That just sets the
@@ -457,6 +492,7 @@ static void AddEventToPendingNotifies(Notification *n);
 static uint32 notification_hash(const void *key, Size keysize);
 static int	notification_match(const void *key1, const void *key2, Size keysize);
 static void ClearPendingActionsAndNotifies(void);
+static void asyncQueueRollbackNotifications(void);
 
 /*
  * Compute the difference between two queue page numbers.
@@ -922,6 +958,18 @@ PreCommit_Notify(void)
 		LockSharedObject(DatabaseRelationId, InvalidOid, 0,
 						 AccessExclusiveLock);
 
+		/*
+		 * Before start adding entries on the shared queue, save the current
+		 * QUEUE_HEAD so if the current in-progress to commit transaction
+		 * abort we can mark the notifications added by this aborted
+		 * transaction as not committed. See AtAbortt_Notify() for more info.
+		 */
+		Assert(atAbortInfo == NULL);
+		atAbortInfo = palloc(sizeof(AtAbortNotifyInfo));
+		LWLockAcquire(NotifyQueueLock, LW_SHARED);
+		atAbortInfo->previousHead = QUEUE_HEAD;
+		LWLockRelease(NotifyQueueLock);
+
 		/* Now push the notifications into the queue */
 		nextNotify = list_head(pendingNotifies->events);
 		while (nextNotify != NULL)
@@ -948,6 +996,17 @@ PreCommit_Notify(void)
 			LWLockRelease(NotifyQueueLock);
 		}
 
+		/*
+		 * Save the new QUEUE_HEAD position so if another publisher add
+		 * entries on the shared queue and successfully commit the transaction
+		 * we don't change the committed status of these notifications while
+		 * marking the notification from a aborted transaction as not
+		 * committed.
+		 */
+		LWLockAcquire(NotifyQueueLock, LW_SHARED);
+		atAbortInfo->head = QUEUE_HEAD;
+		LWLockRelease(NotifyQueueLock);
+
 		/* Note that we don't clear pendingNotifies; AtCommit_Notify will. */
 	}
 }
@@ -1402,6 +1461,13 @@ asyncQueueAddEntries(ListCell *nextNotify)
 		/* Construct a valid queue entry in local variable qe */
 		asyncQueueNotificationToEntry(n, &qe);
 
+		/*
+		 * Mark the entry as committed. If the transaction that this
+		 * notification belongs fails to commit the AtAbort_Notify() will mark
+		 * this entry as not committed.
+		 */
+		qe.committed = true;
+
 		offset = QUEUE_POS_OFFSET(queue_head);
 
 		/* Check whether the entry really fits on the current page */
@@ -1678,6 +1744,16 @@ AtAbort_Notify(void)
 	if (amRegisteredListener && listenChannels == NIL)
 		asyncQueueUnregister();
 
+	/*
+	 * AtAbort_Notify information is set when we are adding entries on the
+	 * global shared queue at PreCommit_Notify(), so in case of an abort on
+	 * the transaction between the PreCommit_Notify() and AtCommit_Notify() we
+	 * use this information to mark the entries from the aborted transaction
+	 * as not committed.
+	 */
+	if (atAbortInfo != NULL)
+		asyncQueueRollbackNotifications();
+
 	/* And clean up */
 	ClearPendingActionsAndNotifies();
 }
@@ -2062,11 +2138,12 @@ asyncQueueProcessPageEntries(volatile QueuePosition *current,
 				 * because our transaction cannot (yet) have queued any
 				 * messages.
 				 */
+
 				*current = thisentry;
 				reachedStop = true;
 				break;
 			}
-			else if (TransactionIdDidCommit(qe->xid))
+			else if (qe->committed)
 			{
 				/* qe->data is the null-terminated channel name */
 				char	   *channel = qe->data;
@@ -2385,6 +2462,7 @@ ClearPendingActionsAndNotifies(void)
 	 */
 	pendingActions = NULL;
 	pendingNotifies = NULL;
+	atAbortInfo = NULL;
 }
 
 /*
@@ -2395,3 +2473,109 @@ check_notify_buffers(int *newval, void **extra, GucSource source)
 {
 	return check_slru_buffers("notify_buffers", newval);
 }
+
+
+/*
+ *  Mark notifications added on an in-progress to commit transaction as not committed.
+ *
+ * Notifications added on the shared global queue are added with committed =
+ * true during PreCommit_Notify() call. If an error occur between the
+ * PreCommit_Notify() and AtCommit_Notify() the AtAbort_Notify() will be called
+ * and we need to mark these notifications added on the shared queue by the
+ * aborted transaction as not committed so that listener backends can skip
+ * these notifications when reading the queue.
+ *
+ * We previously rely on TransactionDidCommit() to check this but if a
+ * notification is keep for too long on the queue and the VACUUM FREEZE is
+ * executed during this period it can remove clog files that is needed to check
+ * the transaction status of this notification, so we make the notification
+ * entries self contained to skip this problem.
+ *
+ */
+static void
+asyncQueueRollbackNotifications(void)
+{
+	QueuePosition current = atAbortInfo->previousHead;
+	QueuePosition head = atAbortInfo->head;
+
+	/*
+	 * Iterates from the position saved at the beginning of the transaction
+	 * (previousHead) to the current head of the queue. We do this to mark all
+	 * entries within this range as uncommitted in case of a transaction
+	 * crash.
+	 */
+	for (;;)
+	{
+		int64		curpage = QUEUE_POS_PAGE(current);
+		int			curoffset = QUEUE_POS_OFFSET(current);
+		LWLock	   *lock = SimpleLruGetBankLock(NotifyCtl, curpage);
+		int			slotno;
+
+		/*
+		 * If we have reached the head, all entries from this transaction have
+		 * been marked as not committed so break the loop.
+		 */
+		if (QUEUE_POS_EQUAL(current, head))
+			break;
+
+		/*
+		 * Acquire an exclusive lock on the current SLRU page to ensure no
+		 * other process can read or write to it while we are marking the
+		 * entries.
+		 */
+		LWLockAcquire(lock, LW_EXCLUSIVE);
+
+		/* Fetch the page from SLRU to mark entries as not committed. */
+		slotno = SimpleLruReadPage(NotifyCtl, curpage, true, InvalidTransactionId);
+
+		/*
+		 * Loop through all entries on the current page. The loop will
+		 * continue until we reach the end of the page or the current head.
+		 */
+		for (;;)
+		{
+			AsyncQueueEntry *qe;
+			bool		reachedEndOfPage;
+
+			/*
+			 * Check again to stop processing the entries on the current page.
+			 */
+			if (QUEUE_POS_EQUAL(current, head))
+				break;
+
+			/*
+			 * Get a pointer to the current entry within the shared page
+			 * buffer.
+			 */
+			qe = (AsyncQueueEntry *) (NotifyCtl->shared->page_buffer[slotno] + curoffset);
+
+			/*
+			 * Just for sanity, all entries on the shared queue should be
+			 * marked as not committed.
+			 */
+			Assert(qe->committed);
+
+			/* Ensure that listener backends can not see these entries */
+			Assert(TransactionIdIsInProgress(qe->xid));
+
+			/*
+			 * Mark the entry as uncommitted so listener backends can skip
+			 * this notification.
+			 */
+			qe->committed = false;
+
+			/* Advance our position. */
+			reachedEndOfPage = asyncQueueAdvance(&current, qe->length);
+			if (reachedEndOfPage)
+				break;
+
+			/*
+			 * Update the offset for the next iteration within the same page.
+			 */
+			curoffset = QUEUE_POS_OFFSET(current);
+		}
+
+		/* Release the exclusive lock on the page. */
+		LWLockRelease(lock);
+	}
+}
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index 902a7954101..a015c961d35 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -29,6 +29,7 @@ SUBDIRS = \
 		  test_int128 \
 		  test_integerset \
 		  test_json_parser \
+		  test_listen_notify \
 		  test_lfind \
 		  test_lwlock_tranches \
 		  test_misc \
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index 14fc761c4cf..6af33448d7b 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -28,6 +28,7 @@ subdir('test_ginpostinglist')
 subdir('test_int128')
 subdir('test_integerset')
 subdir('test_json_parser')
+subdir('test_listen_notify')
 subdir('test_lfind')
 subdir('test_lwlock_tranches')
 subdir('test_misc')
diff --git a/src/test/modules/test_listen_notify/Makefile b/src/test/modules/test_listen_notify/Makefile
new file mode 100644
index 00000000000..c1eb4fde370
--- /dev/null
+++ b/src/test/modules/test_listen_notify/Makefile
@@ -0,0 +1,19 @@
+# src/test/modules/test_listen_notify/Makefile
+
+MODULE = test_listen_notify
+PGFILEDESC = "test_listen_notify - regression testing for LISTEN/NOTIFY support"
+
+TAP_TESTS = 1
+
+EXTRA_INSTALL=src/test/modules/xid_wraparound
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_listen_notify
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_listen_notify/meson.build b/src/test/modules/test_listen_notify/meson.build
new file mode 100644
index 00000000000..a68052cd353
--- /dev/null
+++ b/src/test/modules/test_listen_notify/meson.build
@@ -0,0 +1,14 @@
+# Copyright (c) 2022-2025, PostgreSQL Global Development Group
+
+tests += {
+  'name': 'test_listen_notify',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'tap': {
+    'tests': [
+      't/001_xid_freeze.pl',
+      't/002_aborted_tx_notifies.pl'
+    ],
+  },
+}
+
diff --git a/src/test/modules/test_listen_notify/t/001_xid_freeze.pl b/src/test/modules/test_listen_notify/t/001_xid_freeze.pl
new file mode 100644
index 00000000000..a8bbd268c0f
--- /dev/null
+++ b/src/test/modules/test_listen_notify/t/001_xid_freeze.pl
@@ -0,0 +1,74 @@
+# Copyright (c) 2024-2025, PostgreSQL Global Development Group
+
+use strict;
+use warnings FATAL => 'all';
+use File::Path qw(mkpath);
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+my $node = PostgreSQL::Test::Cluster->new('node');
+$node->init;
+$node->start;
+
+# Check if the extension xid_wraparound is available, as it may be
+# possible that this script is run with installcheck, where the module
+# would not be installed by default.
+if (!$node->check_extension('xid_wraparound'))
+{
+	plan skip_all => 'Extension xid_wraparound not installed';
+}
+
+# Setup
+$node->safe_psql('postgres', 'CREATE EXTENSION xid_wraparound');
+$node->safe_psql('postgres',
+	'CREATE TABLE t AS SELECT g AS a, g+2 AS b from generate_series(1,100000) g;'
+);
+$node->safe_psql('postgres',
+	'ALTER DATABASE template0 WITH ALLOW_CONNECTIONS true');
+
+# --- Start Session 1 and leave it idle in transaction
+my $psql_session1 = $node->background_psql('postgres');
+$psql_session1->query_safe('listen s;', "Session 1 listens to 's'");
+$psql_session1->query_safe('begin;', "Session 1 starts a transaction");
+
+# --- Session 2, multiple notify's, and commit ---
+for my $i (1 .. 10)
+{
+	$node->safe_psql(
+		'postgres', "
+		BEGIN;
+		NOTIFY s, '$i';
+		COMMIT;");
+}
+
+# Consume enough XIDs to trigger truncation
+$node->safe_psql('postgres', 'select consume_xids(10000000);');
+
+# Execute update so the frozen xid of "t" table is updated to a xid greater
+# than consume_xids() result
+$node->safe_psql('postgres', 'UPDATE t SET a = a+b;');
+
+# Remember current datfrozenxid before vacuum freeze to ensure that it is advanced.
+my $datafronzenxid = $node->safe_psql('postgres', "select datfrozenxid from pg_database where datname = 'postgres'");
+
+# Execute vacuum freeze on all databases
+$node->command_ok([ 'vacuumdb', '--all', '--freeze', '--port', $node->port ],
+	"vacuumdb --all --freeze");
+
+# Get the new datfrozenxid after vacuum freeze to ensure that is advanced but
+# we can still get the notification status of the notification
+my $datafronzenxid_freeze = $node->safe_psql('postgres', "select datfrozenxid from pg_database where datname = 'postgres'");
+ok($datafronzenxid_freeze > $datafronzenxid, 'datfrozenxid is advanced');
+
+# On Session 1, commit and ensure that the all notifications is received
+my $res = $psql_session1->query_safe('commit;', "commit listen s;");
+my $notifications_count = 0;
+foreach my $i (split('\n', $res))
+{
+	$notifications_count++;
+	like($i, qr/Asynchronous notification "s" with payload "$notifications_count" received/);
+}
+is($notifications_count, 10, 'received all committed notifications');
+
+done_testing();
diff --git a/src/test/modules/test_listen_notify/t/002_aborted_tx_notifies.pl b/src/test/modules/test_listen_notify/t/002_aborted_tx_notifies.pl
new file mode 100644
index 00000000000..dae7a24f5b2
--- /dev/null
+++ b/src/test/modules/test_listen_notify/t/002_aborted_tx_notifies.pl
@@ -0,0 +1,79 @@
+# Copyright (c) 2024-2025, PostgreSQL Global Development Group
+
+use strict;
+use warnings FATAL => 'all';
+use File::Path qw(mkpath);
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Test checks that listeners do not receive notifications from aborted
+# transaction even if notifications have been added to the listen/notify
+# queue. To reproduce it we use the fact that serializable conflicts
+# are checked after tx adds notifications to the queue.
+
+my $node = PostgreSQL::Test::Cluster->new('node');
+$node->init;
+$node->start;
+
+# Setup
+$node->safe_psql('postgres', 'CREATE TABLE t1 (a bigserial);');
+
+# Listener
+my $psql_listener = $node->background_psql('postgres');
+$psql_listener->query_safe('LISTEN ch;');
+
+# Session1. Start SERIALIZABLE tx and add a notification.
+my $psql_session1 = $node->background_psql('postgres');
+$psql_session1->query_safe("
+	BEGIN ISOLATION LEVEL SERIALIZABLE;
+	SELECT * FROM t1;
+	INSERT INTO t1 DEFAULT VALUES;
+	NOTIFY ch,'committed_0';
+	NOTIFY ch,'committed_1';
+");
+
+# Session2. Start SERIALIZABLE tx, add a notification and introduce a conflict
+# with session1.
+my $psql_session2 = $node->background_psql('postgres', on_error_stop => 0);
+$psql_session2->query_safe("
+	BEGIN ISOLATION LEVEL SERIALIZABLE;
+	SELECT * FROM t1;
+	INSERT INTO t1 DEFAULT VALUES;
+");
+
+# Send notifications that should not be eventually delivered, as session2
+# transaction will be aborted.
+my $message = 'aborted_' . 'a' x 1000;
+for (my $i = 0; $i < 10; $i++) {
+    $psql_session2->query_safe("NOTIFY ch, '$i$message'");
+}
+
+# Session1 should be committed successfully. Listeners must receive session1
+# notifications.
+$psql_session1->query_safe("COMMIT;");
+
+# Session2 should be aborted due to the conflict with session1. Transaction
+# is aborted after adding notifications to the listen/notify queue, but
+# listeners should not receive session2 notifications.
+$psql_session2->query("COMMIT;");
+
+# send more notifications after aborted
+$node->safe_psql('postgres', "NOTIFY ch, 'committed_2';");
+$node->safe_psql('postgres', "NOTIFY ch, 'committed_3';");
+
+# fetch notifications
+my $res = $psql_listener->query_safe('begin; commit;');
+
+# check received notifications
+my @lines = split('\n', $res);
+is(@lines, 4, 'received all committed notifications');
+for (my $i = 0; $i < 4; $i++) {
+    like($lines[$i], qr/Asynchronous notification "ch" with payload "committed_$i" received/);
+}
+
+ok($psql_listener->quit);
+ok($psql_session1->quit);
+ok($psql_session2->quit);
+
+done_testing();
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 5290b91e83e..14b3e176dbf 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -160,6 +160,7 @@ ArrayType
 AsyncQueueControl
 AsyncQueueEntry
 AsyncRequest
+AtAbortNotifyInfo
 AttInMetadata
 AttStatsSlot
 AttoptCacheEntry
-- 
2.43.0

#44

Joel Jacobson

joel@compiler.org

3 months ago

In reply to: Arseniy Mukhin (#43)

Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue

On Fri, Oct 17, 2025, at 22:50, Arseniy Mukhin wrote:

On Fri, Oct 17, 2025 at 7:31 PM Álvaro Herrera <alvherre@kurilemu.de> wrote:

I have the impression that this thread has lost focus on the idea of
producing a backpatchable bugfix. The last proposed patch has a lot of
new mechanism that doesn't seem suitable for backpatch. I could be
wrong of course.

Oops, I guess the TAP test that I added in the last version and that
uses injection points is one of those things. PFA the new version
without it. I also noticed that CheckSharedObjectLockedByMe() which
was introduced in one of previous versions is unused, so I removed it
too.

What a funny coincidence that the approach in this patch,
has one similarity with the "Direct advancement" approach
in the patch in the "Optimize LISTEN/NOTIFY" [1]/messages/by-id/6899c044-4a82-49be-8117-e6f669765f7e@app.fastmail.com thread,
namely that we're both interested in QUEUE_HEAD before/after
we push the notifications into the queue, in PreCommit_Notify().

It looks to me like our new data structures are Interchangeable,
so I guess we probably want both patches to eventually settle
on one and the same?

The differences I note between our queue head before/after code are:

- In this patch, you are palloc'ing a struct with two fields.
In [1]/messages/by-id/6899c044-4a82-49be-8117-e6f669765f7e@app.fastmail.com, we're using two separate static QueuePosition variables.

- In this patch, you are taking/releasing a shared lock before/after
the loop to read QUEUE_HEAD and set previousHead/head.
In [1]/messages/by-id/6899c044-4a82-49be-8117-e6f669765f7e@app.fastmail.com, we avoid the need of the shared lock, by doing the reads
within the existing exclusive lock inside the loop, but instead
therefore need a firstIteration bool, to know which is the first
iteration, and need to overwrite the after-var in each iteration.

I don't think the noted differences above matter, both seems fine.

Another thing I noticed in your patch that made me wonder,
is the naming of the new AsyncQueueEntry bool field,
which is given the name "committed".

I think this name is not entirely faithful, since when set to true,
the entry has not been committed yet.

How about negating the meaning of this boolean field?
To instead indicate when the entry has been rollbacked.
Then, it would clearly communicate just that.

Maybe naming it something like "rollbacked" or "aborted"?

/Joel

[1]: /messages/by-id/6899c044-4a82-49be-8117-e6f669765f7e@app.fastmail.com

#45

Joel Jacobson

joel@compiler.org

3 months ago

In reply to: Álvaro Herrera (#42)

1 attachment(s)

Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue

On Fri, Oct 17, 2025, at 15:51, Álvaro Herrera wrote:

I have the impression that this thread has lost focus on the idea of
producing a backpatchable bugfix. The last proposed patch has a lot of
new mechanism that doesn't seem suitable for backpatch. I could be
wrong of course.

I've tried to create a minimal isolated fix, hopefully suitable for
backpatching, with no new mechanisms, other than the added
GetOldestQueuedNotifyXid used by vac_update_datfrozenxid.

It's based on the approach discussed earlier in this thread, that just
goes through the notification queue from QUEUE_TAIL to QUEUE_HEAD, to
find the oldestXid in the current database.

Implementation:

* Break out SLRU read page code from asyncQueueReadAllNotifications into
new helper-function asyncQueueReadPageToBuffer.

* Add GetOldestQueuedNotifyXid which uses the new helper-function
asyncQueueReadPageToBuffer.

It passes the 001_xid_freeze.pl test, not included in this patch.

/Joel

Attachments:

fix_notify_datfrozenxid.txttext/plain; name=fix_notify_datfrozenxid.txtDownload

From 31ff0b7c35320afacf30685c006f17d6de179421 Mon Sep 17 00:00:00 2001
From: Joel Jacobson <joel@compiler.org>
Date: Sun, 19 Oct 2025 18:55:25 +0200
Subject: [PATCH] Prevent VACUUM from truncating XIDs still present in
 notification queue

VACUUM's computation of datfrozenxid did not account for transaction IDs
in the LISTEN/NOTIFY queue.  This allowed VACUUM to truncate clog
entries for XIDs that were still referenced by queued notifications,
causing backends to fail in TransactionIdDidCommit when later processing
those notifications.

Fix by adding GetOldestQueuedNotifyXid to find the oldest XID in queued
notifications for the current database, and constraining datfrozenxid to
not pass that.  The function scans from QUEUE_TAIL, since notifications
may have been written before any listeners existed.

To avoid code duplication, refactor SLRU page-reading code into a new
helper function asyncQueueReadPageToBuffer.
---
 src/backend/commands/async.c  | 139 ++++++++++++++++++++++++++++------
 src/backend/commands/vacuum.c |  14 ++++
 src/include/commands/async.h  |   3 +
 3 files changed, 132 insertions(+), 24 deletions(-)

diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index 4bd37d5beb5..7c9d7831c9f 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -1841,6 +1841,44 @@ ProcessNotifyInterrupt(bool flush)
 		ProcessIncomingNotify(flush);
 }
 
+/*
+ * Read a page from the SLRU queue into a local buffer.
+ *
+ * Reads the page containing 'pos', copying the data from the current offset
+ * either to the end of the page or up to 'head' (whichever comes first)
+ * into page_buffer.
+ */
+static void
+asyncQueueReadPageToBuffer(QueuePosition pos, QueuePosition head,
+						   char *page_buffer)
+{
+	int64		curpage = QUEUE_POS_PAGE(pos);
+	int			curoffset = QUEUE_POS_OFFSET(pos);
+	int			slotno;
+	int			copysize;
+
+	slotno = SimpleLruReadPage_ReadOnly(NotifyCtl, curpage,
+										InvalidTransactionId);
+
+	if (curpage == QUEUE_POS_PAGE(head))
+	{
+		/* we only want to read as far as head */
+		copysize = QUEUE_POS_OFFSET(head) - curoffset;
+		if (copysize < 0)
+			copysize = 0;		/* just for safety */
+	}
+	else
+	{
+		/* fetch all the rest of the page */
+		copysize = QUEUE_PAGESIZE - curoffset;
+	}
+
+	memcpy(page_buffer + curoffset,
+		   NotifyCtl->shared->page_buffer[slotno] + curoffset,
+		   copysize);
+	/* Release lock that we got from SimpleLruReadPage_ReadOnly() */
+	LWLockRelease(SimpleLruGetBankLock(NotifyCtl, curpage));
+}
 
 /*
  * Read all pending notifications from the queue, and deliver appropriate
@@ -1932,36 +1970,13 @@ asyncQueueReadAllNotifications(void)
 
 		do
 		{
-			int64		curpage = QUEUE_POS_PAGE(pos);
-			int			curoffset = QUEUE_POS_OFFSET(pos);
-			int			slotno;
-			int			copysize;
-
 			/*
 			 * We copy the data from SLRU into a local buffer, so as to avoid
 			 * holding the SLRU lock while we are examining the entries and
 			 * possibly transmitting them to our frontend.  Copy only the part
 			 * of the page we will actually inspect.
 			 */
-			slotno = SimpleLruReadPage_ReadOnly(NotifyCtl, curpage,
-												InvalidTransactionId);
-			if (curpage == QUEUE_POS_PAGE(head))
-			{
-				/* we only want to read as far as head */
-				copysize = QUEUE_POS_OFFSET(head) - curoffset;
-				if (copysize < 0)
-					copysize = 0;	/* just for safety */
-			}
-			else
-			{
-				/* fetch all the rest of the page */
-				copysize = QUEUE_PAGESIZE - curoffset;
-			}
-			memcpy(page_buffer.buf + curoffset,
-				   NotifyCtl->shared->page_buffer[slotno] + curoffset,
-				   copysize);
-			/* Release lock that we got from SimpleLruReadPage_ReadOnly() */
-			LWLockRelease(SimpleLruGetBankLock(NotifyCtl, curpage));
+			asyncQueueReadPageToBuffer(pos, head, page_buffer.buf);
 
 			/*
 			 * Process messages up to the stop position, end of page, or an
@@ -2097,6 +2112,82 @@ asyncQueueProcessPageEntries(volatile QueuePosition *current,
 	return reachedStop;
 }
 
+/*
+ * Get the oldest XID in the notification queue that has not yet been
+ * processed by all listening backends.
+ *
+ * Returns InvalidTransactionId if there are no unprocessed notifications.
+ */
+TransactionId
+GetOldestQueuedNotifyXid(void)
+{
+	QueuePosition pos;
+	QueuePosition head;
+	TransactionId oldestXid = InvalidTransactionId;
+
+	/* page_buffer must be adequately aligned, so use a union */
+	union
+	{
+		char		buf[QUEUE_PAGESIZE];
+		AsyncQueueEntry align;
+	}			page_buffer;
+
+	LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE);
+
+	/*
+	 * We must start at QUEUE_TAIL since notification data might have been
+	 * written before there were any listening backends.
+	 */
+	pos = QUEUE_TAIL;
+	head = QUEUE_HEAD;
+
+	/* If the queue is empty, no XIDs need protection */
+	if (QUEUE_POS_EQUAL(pos, head))
+	{
+		LWLockRelease(NotifyQueueLock);
+		return InvalidTransactionId;
+	}
+
+	while (!QUEUE_POS_EQUAL(pos, head))
+	{
+		int			curoffset;
+		AsyncQueueEntry *qe;
+
+		/* Read the current page from SLRU into our local buffer */
+		asyncQueueReadPageToBuffer(pos, head, page_buffer.buf);
+
+		curoffset = QUEUE_POS_OFFSET(pos);
+
+		/* Process all entries on this page up to head */
+		while (curoffset + QUEUEALIGN(AsyncQueueEntryEmptySize) <= QUEUE_PAGESIZE &&
+			   !QUEUE_POS_EQUAL(pos, head))
+		{
+			qe = (AsyncQueueEntry *) (page_buffer.buf + curoffset);
+
+			/*
+			 * Check if this entry is for our database and has a valid XID.
+			 * Only entries for our database matter for our datfrozenxid.
+			 */
+			if (qe->dboid == MyDatabaseId && TransactionIdIsValid(qe->xid))
+			{
+				if (!TransactionIdIsValid(oldestXid) ||
+					TransactionIdPrecedes(qe->xid, oldestXid))
+					oldestXid = qe->xid;
+			}
+
+			/* Advance to next entry */
+			if (asyncQueueAdvance(&pos, qe->length))
+				break;			/* advanced to next page */
+
+			curoffset = QUEUE_POS_OFFSET(pos);
+		}
+	}
+
+	LWLockRelease(NotifyQueueLock);
+
+	return oldestXid;
+}
+
 /*
  * Advance the shared queue tail variable to the minimum of all the
  * per-backend tail pointers.  Truncate pg_notify space if possible.
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index ed03e3bd50d..4f278c6b988 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -37,6 +37,7 @@
 #include "catalog/namespace.h"
 #include "catalog/pg_database.h"
 #include "catalog/pg_inherits.h"
+#include "commands/async.h"
 #include "commands/cluster.h"
 #include "commands/defrem.h"
 #include "commands/progress.h"
@@ -1733,6 +1734,19 @@ vac_update_datfrozenxid(void)
 	if (bogus)
 		return;
 
+	/*
+	 * Also consider the oldest XID in the notification queue, since
+	 * backends will need to call TransactionIdDidCommit() on those
+	 * XIDs when processing the notifications.
+	 */
+	{
+		TransactionId oldestNotifyXid = GetOldestQueuedNotifyXid();
+
+		if (TransactionIdIsValid(oldestNotifyXid) &&
+			TransactionIdPrecedes(oldestNotifyXid, newFrozenXid))
+			newFrozenXid = oldestNotifyXid;
+	}
+
 	Assert(TransactionIdIsNormal(newFrozenXid));
 	Assert(MultiXactIdIsValid(newMinMulti));
 
diff --git a/src/include/commands/async.h b/src/include/commands/async.h
index f75c3df9556..d707f516316 100644
--- a/src/include/commands/async.h
+++ b/src/include/commands/async.h
@@ -26,6 +26,9 @@ extern void NotifyMyFrontEnd(const char *channel,
 							 const char *payload,
 							 int32 srcPid);
 
+/* get oldest XID in the notification queue for vacuum */
+extern TransactionId GetOldestQueuedNotifyXid(void);
+
 /* notify-related SQL statements */
 extern void Async_Notify(const char *channel, const char *payload);
 extern void Async_Listen(const char *channel);
-- 
2.50.1

#46

Matheus Alcantara

matheusssilv97@gmail.com

3 months ago

In reply to: Joel Jacobson (#44)

1 attachment(s)

Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue

On Sat Oct 18, 2025 at 2:43 AM -03, Joel Jacobson wrote:

On Fri, Oct 17, 2025, at 22:50, Arseniy Mukhin wrote:
What a funny coincidence that the approach in this patch,
has one similarity with the "Direct advancement" approach
in the patch in the "Optimize LISTEN/NOTIFY" [1] thread,
namely that we're both interested in QUEUE_HEAD before/after
we push the notifications into the queue, in PreCommit_Notify().

It looks to me like our new data structures are Interchangeable,
so I guess we probably want both patches to eventually settle
on one and the same?

The differences I note between our queue head before/after code are:

- In this patch, you are palloc'ing a struct with two fields.
In [1], we're using two separate static QueuePosition variables.

- In this patch, you are taking/releasing a shared lock before/after
the loop to read QUEUE_HEAD and set previousHead/head.
In [1], we avoid the need of the shared lock, by doing the reads
within the existing exclusive lock inside the loop, but instead
therefore need a firstIteration bool, to know which is the first
iteration, and need to overwrite the after-var in each iteration.

I don't think the noted differences above matter, both seems fine.

Yeah, I also think that both approach seems fine. I keep the v8 version
with the palloc, if someone has any concern about this I'm open to
switch to another approach.

Another thing I noticed in your patch that made me wonder,
is the naming of the new AsyncQueueEntry bool field,
which is given the name "committed".

I think this name is not entirely faithful, since when set to true,
the entry has not been committed yet.

How about negating the meaning of this boolean field?
To instead indicate when the entry has been rollbacked.
Then, it would clearly communicate just that.

Maybe naming it something like "rollbacked" or "aborted"?

Good point. I've renamed this field on the attached v8 version.

--
Matheus Alcantara

Attachments:

v8-0001-Make-AsyncQueueEntry-s-self-contained.patchtext/plain; charset=utf-8; name=v8-0001-Make-AsyncQueueEntry-s-self-contained.patchDownload

From 103d83c69ff0870f812e91b7f6491719bf7f39c6 Mon Sep 17 00:00:00 2001
From: Matheus Alcantara <mths.dev@pm.me>
Date: Sat, 6 Sep 2025 11:29:02 -0300
Subject: [PATCH v8] Make AsyncQueueEntry's self contained

Previously the asyncQueueProcessPageEntries() use the
TransactionIdDidCommit() to check if the transaction that a notification
belongs is committed or not. Although this work for almost all scenarios
we may have some cases where if a notification is keep for to long on
the queue and the VACUUM FREEZE is executed during this time it may
remove clog files that is needed to check the transaction status of
these notifications which will cause errors to listener backends when
reading the async queue.

This commit fix this issue by making the AsyncQueueEntry self contained
by adding the "rollbacked" boolean field so asyncQueueProcessPageEntries()
can use this to check if the transaction of the notification is
rollbacked or not.

We set rollbacked as false when adding the entry on the SLRU page buffer
cache when PreCommit_Notify() is called and if an error occur before
AtCommit_Notify() the AtAbort_Notify() will be called which will mark
the rollbacked field as true. We do this by remembering the QUEUE_HEAD
position before the PreCommit_Notify() start adding entries on the
shared queue, and if the transaction crash we iterate from this saved
position until the new QUEUE_HEAD position marking the entries as not
committed.

Also this commit include TAP tests to exercise the VACUUM FREEZE issue
and also the scenario of an error being occur between the
PreCommit_Notify() and AtCommit_Notify() calls.

Author: Matheus Alcantara <mths.dev@pm.me>
Co-authored-by: Arseniy Mukhin <arseniy.mukhin.dev@gmail.com>
---
 src/backend/commands/async.c                  | 186 +++++++++++++++++-
 src/test/modules/Makefile                     |   1 +
 src/test/modules/meson.build                  |   1 +
 src/test/modules/test_listen_notify/Makefile  |  19 ++
 .../modules/test_listen_notify/meson.build    |  14 ++
 .../test_listen_notify/t/001_xid_freeze.pl    |  74 +++++++
 .../t/002_aborted_tx_notifies.pl              |  79 ++++++++
 src/tools/pgindent/typedefs.list              |   1 +
 8 files changed, 374 insertions(+), 1 deletion(-)
 create mode 100644 src/test/modules/test_listen_notify/Makefile
 create mode 100644 src/test/modules/test_listen_notify/meson.build
 create mode 100644 src/test/modules/test_listen_notify/t/001_xid_freeze.pl
 create mode 100644 src/test/modules/test_listen_notify/t/002_aborted_tx_notifies.pl

diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index 4bd37d5beb5..7e4aa3a6487 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -79,6 +79,19 @@
  *	  are way behind and should be kicked to make them advance their
  *	  pointers).
  *
+ *    The notification entries added to the queue are self-contained and
+ *    include a "committed" field to inform listener backends if the associated
+ *    transaction has committed. We could use the TransactionIdDidCommit() but
+ *    if a notification remain in the queue long enough for VACUUM FREEZE to
+ *    remove the necessary pg_xact/ file, the listener backend will face errors
+ *    to get the transaction status. To prevent this, the "committed" field is
+ *    set to true during PreCommit_Notify() and if the transaction aborts between
+ *    the PreCommit_Notify() and AtCommit_Notify(), the AtAbort_Notify() is
+ *    called to mark these entries as uncommitted. To enable this, we save the
+ *    queue's head position before adding new entries from the in-progress to
+ *    commit transaction. If an abort occurs, AtAbort_Notify() uses this saved
+ *    position to find and mark the entries as uncommitted.
+ *
  *	  Finally, after we are out of the transaction altogether and about to go
  *	  idle, we scan the queue for messages that need to be sent to our
  *	  frontend (which might be notifies from other backends, or self-notifies
@@ -142,6 +155,7 @@
 #include "miscadmin.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
+#include "storage/procarray.h"
 #include "storage/procsignal.h"
 #include "tcop/tcopprot.h"
 #include "utils/builtins.h"
@@ -180,6 +194,8 @@ typedef struct AsyncQueueEntry
 	Oid			dboid;			/* sender's database OID */
 	TransactionId xid;			/* sender's XID */
 	int32		srcPid;			/* sender's PID */
+	bool		rollbacked;		/* Is transaction that the entry belongs
+								 * committed? */
 	char		data[NAMEDATALEN + NOTIFY_PAYLOAD_MAX_LENGTH];
 } AsyncQueueEntry;
 
@@ -401,8 +417,27 @@ struct NotificationHash
 	Notification *event;		/* => the actual Notification struct */
 };
 
+/*  Information needed by At_AbortNotify() to remove entries from the queue for aborted transactions. */
+typedef struct AtAbortNotifyInfo
+{
+	/*
+	 * head position before the transaction start adding entries on the shared
+	 * queue
+	 */
+	QueuePosition previousHead;
+
+	/*
+	 * head position after the entries from the in-progress to commit
+	 * transaction were added.
+	 */
+	QueuePosition head;
+
+} AtAbortNotifyInfo;
+
 static NotificationList *pendingNotifies = NULL;
 
+static AtAbortNotifyInfo *atAbortInfo = NULL;
+
 /*
  * Inbound notifications are initially processed by HandleNotifyInterrupt(),
  * called from inside a signal handler. That just sets the
@@ -457,6 +492,7 @@ static void AddEventToPendingNotifies(Notification *n);
 static uint32 notification_hash(const void *key, Size keysize);
 static int	notification_match(const void *key1, const void *key2, Size keysize);
 static void ClearPendingActionsAndNotifies(void);
+static void asyncQueueRollbackNotifications(void);
 
 /*
  * Compute the difference between two queue page numbers.
@@ -922,6 +958,18 @@ PreCommit_Notify(void)
 		LockSharedObject(DatabaseRelationId, InvalidOid, 0,
 						 AccessExclusiveLock);
 
+		/*
+		 * Before start adding entries on the shared queue, save the current
+		 * QUEUE_HEAD so if the current in-progress to commit transaction
+		 * abort we can mark the notifications added by this aborted
+		 * transaction as not committed. See AtAbortt_Notify() for more info.
+		 */
+		Assert(atAbortInfo == NULL);
+		atAbortInfo = palloc(sizeof(AtAbortNotifyInfo));
+		LWLockAcquire(NotifyQueueLock, LW_SHARED);
+		atAbortInfo->previousHead = QUEUE_HEAD;
+		LWLockRelease(NotifyQueueLock);
+
 		/* Now push the notifications into the queue */
 		nextNotify = list_head(pendingNotifies->events);
 		while (nextNotify != NULL)
@@ -948,6 +996,17 @@ PreCommit_Notify(void)
 			LWLockRelease(NotifyQueueLock);
 		}
 
+		/*
+		 * Save the new QUEUE_HEAD position so if another publisher add
+		 * entries on the shared queue and successfully commit the transaction
+		 * we don't change the committed status of these notifications while
+		 * marking the notification from a aborted transaction as not
+		 * committed.
+		 */
+		LWLockAcquire(NotifyQueueLock, LW_SHARED);
+		atAbortInfo->head = QUEUE_HEAD;
+		LWLockRelease(NotifyQueueLock);
+
 		/* Note that we don't clear pendingNotifies; AtCommit_Notify will. */
 	}
 }
@@ -1402,6 +1461,13 @@ asyncQueueAddEntries(ListCell *nextNotify)
 		/* Construct a valid queue entry in local variable qe */
 		asyncQueueNotificationToEntry(n, &qe);
 
+		/*
+		 * Mark the entry as not rollbacked. If the transaction that this
+		 * notification belongs fails to commit the AtAbort_Notify() will mark
+		 * this entry as rollbacked.
+		 */
+		qe.rollbacked = false;
+
 		offset = QUEUE_POS_OFFSET(queue_head);
 
 		/* Check whether the entry really fits on the current page */
@@ -1678,6 +1744,16 @@ AtAbort_Notify(void)
 	if (amRegisteredListener && listenChannels == NIL)
 		asyncQueueUnregister();
 
+	/*
+	 * AtAbort_Notify information is set when we are adding entries on the
+	 * global shared queue at PreCommit_Notify(), so in case of an abort on
+	 * the transaction between the PreCommit_Notify() and AtCommit_Notify() we
+	 * use this information to mark the entries from the aborted transaction
+	 * as not committed.
+	 */
+	if (atAbortInfo != NULL)
+		asyncQueueRollbackNotifications();
+
 	/* And clean up */
 	ClearPendingActionsAndNotifies();
 }
@@ -2062,11 +2138,12 @@ asyncQueueProcessPageEntries(volatile QueuePosition *current,
 				 * because our transaction cannot (yet) have queued any
 				 * messages.
 				 */
+
 				*current = thisentry;
 				reachedStop = true;
 				break;
 			}
-			else if (TransactionIdDidCommit(qe->xid))
+			else if (!qe->rollbacked)
 			{
 				/* qe->data is the null-terminated channel name */
 				char	   *channel = qe->data;
@@ -2385,6 +2462,7 @@ ClearPendingActionsAndNotifies(void)
 	 */
 	pendingActions = NULL;
 	pendingNotifies = NULL;
+	atAbortInfo = NULL;
 }
 
 /*
@@ -2395,3 +2473,109 @@ check_notify_buffers(int *newval, void **extra, GucSource source)
 {
 	return check_slru_buffers("notify_buffers", newval);
 }
+
+
+/*
+ *  Mark notifications added on an in-progress to commit transaction as not committed.
+ *
+ * Notifications added on the shared global queue are added with committed =
+ * true during PreCommit_Notify() call. If an error occur between the
+ * PreCommit_Notify() and AtCommit_Notify() the AtAbort_Notify() will be called
+ * and we need to mark these notifications added on the shared queue by the
+ * aborted transaction as not committed so that listener backends can skip
+ * these notifications when reading the queue.
+ *
+ * We previously rely on TransactionDidCommit() to check this but if a
+ * notification is keep for too long on the queue and the VACUUM FREEZE is
+ * executed during this period it can remove clog files that is needed to check
+ * the transaction status of this notification, so we make the notification
+ * entries self contained to skip this problem.
+ *
+ */
+static void
+asyncQueueRollbackNotifications(void)
+{
+	QueuePosition current = atAbortInfo->previousHead;
+	QueuePosition head = atAbortInfo->head;
+
+	/*
+	 * Iterates from the position saved at the beginning of the transaction
+	 * (previousHead) to the current head of the queue. We do this to mark all
+	 * entries within this range as uncommitted in case of a transaction
+	 * crash.
+	 */
+	for (;;)
+	{
+		int64		curpage = QUEUE_POS_PAGE(current);
+		int			curoffset = QUEUE_POS_OFFSET(current);
+		LWLock	   *lock = SimpleLruGetBankLock(NotifyCtl, curpage);
+		int			slotno;
+
+		/*
+		 * If we have reached the head, all entries from this transaction have
+		 * been marked as not committed so break the loop.
+		 */
+		if (QUEUE_POS_EQUAL(current, head))
+			break;
+
+		/*
+		 * Acquire an exclusive lock on the current SLRU page to ensure no
+		 * other process can read or write to it while we are marking the
+		 * entries.
+		 */
+		LWLockAcquire(lock, LW_EXCLUSIVE);
+
+		/* Fetch the page from SLRU to mark entries as not committed. */
+		slotno = SimpleLruReadPage(NotifyCtl, curpage, true, InvalidTransactionId);
+
+		/*
+		 * Loop through all entries on the current page. The loop will
+		 * continue until we reach the end of the page or the current head.
+		 */
+		for (;;)
+		{
+			AsyncQueueEntry *qe;
+			bool		reachedEndOfPage;
+
+			/*
+			 * Check again to stop processing the entries on the current page.
+			 */
+			if (QUEUE_POS_EQUAL(current, head))
+				break;
+
+			/*
+			 * Get a pointer to the current entry within the shared page
+			 * buffer.
+			 */
+			qe = (AsyncQueueEntry *) (NotifyCtl->shared->page_buffer[slotno] + curoffset);
+
+			/*
+			 * Just for sanity, all entries on the shared queue should be
+			 * marked as not committed.
+			 */
+			Assert(!qe->rollbacked);
+
+			/* Ensure that listener backends can not see these entries */
+			Assert(TransactionIdIsInProgress(qe->xid));
+
+			/*
+			 * Mark the entry as rollbacked so listener backends can skip this
+			 * notification.
+			 */
+			qe->rollbacked = true;
+
+			/* Advance our position. */
+			reachedEndOfPage = asyncQueueAdvance(&current, qe->length);
+			if (reachedEndOfPage)
+				break;
+
+			/*
+			 * Update the offset for the next iteration within the same page.
+			 */
+			curoffset = QUEUE_POS_OFFSET(current);
+		}
+
+		/* Release the exclusive lock on the page. */
+		LWLockRelease(lock);
+	}
+}
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index 902a7954101..a015c961d35 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -29,6 +29,7 @@ SUBDIRS = \
 		  test_int128 \
 		  test_integerset \
 		  test_json_parser \
+		  test_listen_notify \
 		  test_lfind \
 		  test_lwlock_tranches \
 		  test_misc \
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index 14fc761c4cf..6af33448d7b 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -28,6 +28,7 @@ subdir('test_ginpostinglist')
 subdir('test_int128')
 subdir('test_integerset')
 subdir('test_json_parser')
+subdir('test_listen_notify')
 subdir('test_lfind')
 subdir('test_lwlock_tranches')
 subdir('test_misc')
diff --git a/src/test/modules/test_listen_notify/Makefile b/src/test/modules/test_listen_notify/Makefile
new file mode 100644
index 00000000000..c1eb4fde370
--- /dev/null
+++ b/src/test/modules/test_listen_notify/Makefile
@@ -0,0 +1,19 @@
+# src/test/modules/test_listen_notify/Makefile
+
+MODULE = test_listen_notify
+PGFILEDESC = "test_listen_notify - regression testing for LISTEN/NOTIFY support"
+
+TAP_TESTS = 1
+
+EXTRA_INSTALL=src/test/modules/xid_wraparound
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_listen_notify
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_listen_notify/meson.build b/src/test/modules/test_listen_notify/meson.build
new file mode 100644
index 00000000000..a68052cd353
--- /dev/null
+++ b/src/test/modules/test_listen_notify/meson.build
@@ -0,0 +1,14 @@
+# Copyright (c) 2022-2025, PostgreSQL Global Development Group
+
+tests += {
+  'name': 'test_listen_notify',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'tap': {
+    'tests': [
+      't/001_xid_freeze.pl',
+      't/002_aborted_tx_notifies.pl'
+    ],
+  },
+}
+
diff --git a/src/test/modules/test_listen_notify/t/001_xid_freeze.pl b/src/test/modules/test_listen_notify/t/001_xid_freeze.pl
new file mode 100644
index 00000000000..a8bbd268c0f
--- /dev/null
+++ b/src/test/modules/test_listen_notify/t/001_xid_freeze.pl
@@ -0,0 +1,74 @@
+# Copyright (c) 2024-2025, PostgreSQL Global Development Group
+
+use strict;
+use warnings FATAL => 'all';
+use File::Path qw(mkpath);
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+my $node = PostgreSQL::Test::Cluster->new('node');
+$node->init;
+$node->start;
+
+# Check if the extension xid_wraparound is available, as it may be
+# possible that this script is run with installcheck, where the module
+# would not be installed by default.
+if (!$node->check_extension('xid_wraparound'))
+{
+	plan skip_all => 'Extension xid_wraparound not installed';
+}
+
+# Setup
+$node->safe_psql('postgres', 'CREATE EXTENSION xid_wraparound');
+$node->safe_psql('postgres',
+	'CREATE TABLE t AS SELECT g AS a, g+2 AS b from generate_series(1,100000) g;'
+);
+$node->safe_psql('postgres',
+	'ALTER DATABASE template0 WITH ALLOW_CONNECTIONS true');
+
+# --- Start Session 1 and leave it idle in transaction
+my $psql_session1 = $node->background_psql('postgres');
+$psql_session1->query_safe('listen s;', "Session 1 listens to 's'");
+$psql_session1->query_safe('begin;', "Session 1 starts a transaction");
+
+# --- Session 2, multiple notify's, and commit ---
+for my $i (1 .. 10)
+{
+	$node->safe_psql(
+		'postgres', "
+		BEGIN;
+		NOTIFY s, '$i';
+		COMMIT;");
+}
+
+# Consume enough XIDs to trigger truncation
+$node->safe_psql('postgres', 'select consume_xids(10000000);');
+
+# Execute update so the frozen xid of "t" table is updated to a xid greater
+# than consume_xids() result
+$node->safe_psql('postgres', 'UPDATE t SET a = a+b;');
+
+# Remember current datfrozenxid before vacuum freeze to ensure that it is advanced.
+my $datafronzenxid = $node->safe_psql('postgres', "select datfrozenxid from pg_database where datname = 'postgres'");
+
+# Execute vacuum freeze on all databases
+$node->command_ok([ 'vacuumdb', '--all', '--freeze', '--port', $node->port ],
+	"vacuumdb --all --freeze");
+
+# Get the new datfrozenxid after vacuum freeze to ensure that is advanced but
+# we can still get the notification status of the notification
+my $datafronzenxid_freeze = $node->safe_psql('postgres', "select datfrozenxid from pg_database where datname = 'postgres'");
+ok($datafronzenxid_freeze > $datafronzenxid, 'datfrozenxid is advanced');
+
+# On Session 1, commit and ensure that the all notifications is received
+my $res = $psql_session1->query_safe('commit;', "commit listen s;");
+my $notifications_count = 0;
+foreach my $i (split('\n', $res))
+{
+	$notifications_count++;
+	like($i, qr/Asynchronous notification "s" with payload "$notifications_count" received/);
+}
+is($notifications_count, 10, 'received all committed notifications');
+
+done_testing();
diff --git a/src/test/modules/test_listen_notify/t/002_aborted_tx_notifies.pl b/src/test/modules/test_listen_notify/t/002_aborted_tx_notifies.pl
new file mode 100644
index 00000000000..dae7a24f5b2
--- /dev/null
+++ b/src/test/modules/test_listen_notify/t/002_aborted_tx_notifies.pl
@@ -0,0 +1,79 @@
+# Copyright (c) 2024-2025, PostgreSQL Global Development Group
+
+use strict;
+use warnings FATAL => 'all';
+use File::Path qw(mkpath);
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Test checks that listeners do not receive notifications from aborted
+# transaction even if notifications have been added to the listen/notify
+# queue. To reproduce it we use the fact that serializable conflicts
+# are checked after tx adds notifications to the queue.
+
+my $node = PostgreSQL::Test::Cluster->new('node');
+$node->init;
+$node->start;
+
+# Setup
+$node->safe_psql('postgres', 'CREATE TABLE t1 (a bigserial);');
+
+# Listener
+my $psql_listener = $node->background_psql('postgres');
+$psql_listener->query_safe('LISTEN ch;');
+
+# Session1. Start SERIALIZABLE tx and add a notification.
+my $psql_session1 = $node->background_psql('postgres');
+$psql_session1->query_safe("
+	BEGIN ISOLATION LEVEL SERIALIZABLE;
+	SELECT * FROM t1;
+	INSERT INTO t1 DEFAULT VALUES;
+	NOTIFY ch,'committed_0';
+	NOTIFY ch,'committed_1';
+");
+
+# Session2. Start SERIALIZABLE tx, add a notification and introduce a conflict
+# with session1.
+my $psql_session2 = $node->background_psql('postgres', on_error_stop => 0);
+$psql_session2->query_safe("
+	BEGIN ISOLATION LEVEL SERIALIZABLE;
+	SELECT * FROM t1;
+	INSERT INTO t1 DEFAULT VALUES;
+");
+
+# Send notifications that should not be eventually delivered, as session2
+# transaction will be aborted.
+my $message = 'aborted_' . 'a' x 1000;
+for (my $i = 0; $i < 10; $i++) {
+    $psql_session2->query_safe("NOTIFY ch, '$i$message'");
+}
+
+# Session1 should be committed successfully. Listeners must receive session1
+# notifications.
+$psql_session1->query_safe("COMMIT;");
+
+# Session2 should be aborted due to the conflict with session1. Transaction
+# is aborted after adding notifications to the listen/notify queue, but
+# listeners should not receive session2 notifications.
+$psql_session2->query("COMMIT;");
+
+# send more notifications after aborted
+$node->safe_psql('postgres', "NOTIFY ch, 'committed_2';");
+$node->safe_psql('postgres', "NOTIFY ch, 'committed_3';");
+
+# fetch notifications
+my $res = $psql_listener->query_safe('begin; commit;');
+
+# check received notifications
+my @lines = split('\n', $res);
+is(@lines, 4, 'received all committed notifications');
+for (my $i = 0; $i < 4; $i++) {
+    like($lines[$i], qr/Asynchronous notification "ch" with payload "committed_$i" received/);
+}
+
+ok($psql_listener->quit);
+ok($psql_session1->quit);
+ok($psql_session2->quit);
+
+done_testing();
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 377a7946585..385bbb16d64 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -160,6 +160,7 @@ ArrayType
 AsyncQueueControl
 AsyncQueueEntry
 AsyncRequest
+AtAbortNotifyInfo
 AttInMetadata
 AttStatsSlot
 AttoptCacheEntry
-- 
2.51.0

#47

Matheus Alcantara

matheusssilv97@gmail.com

3 months ago

In reply to: Joel Jacobson (#45)

Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue

On Sun Oct 19, 2025 at 2:14 PM -03, Joel Jacobson wrote:

On Fri, Oct 17, 2025, at 15:51, Álvaro Herrera wrote:

I have the impression that this thread has lost focus on the idea of
producing a backpatchable bugfix. The last proposed patch has a lot of
new mechanism that doesn't seem suitable for backpatch. I could be
wrong of course.

I've tried to create a minimal isolated fix, hopefully suitable for
backpatching, with no new mechanisms, other than the added
GetOldestQueuedNotifyXid used by vac_update_datfrozenxid.

It's based on the approach discussed earlier in this thread, that just
goes through the notification queue from QUEUE_TAIL to QUEUE_HEAD, to
find the oldestXid in the current database.

Implementation:

* Break out SLRU read page code from asyncQueueReadAllNotifications into
new helper-function asyncQueueReadPageToBuffer.

* Add GetOldestQueuedNotifyXid which uses the new helper-function
asyncQueueReadPageToBuffer.

It passes the 001_xid_freeze.pl test, not included in this patch.

This is similar to what was already proposed at [1]/messages/by-id/CAFY6G8cJm73_MM9SuynZUqtqcaTuepUDgDuvS661oLW7U0dgsg@mail.gmail.com. This approach was
abandoned because a notification on the queue may block datfrozenxid
advance and clog truncation which can cause other issues for the users [2]/messages/by-id/d186fba0-dc65-4274-aa96-3906bbb2e530@postgrespro.ru.

[1]: /messages/by-id/CAFY6G8cJm73_MM9SuynZUqtqcaTuepUDgDuvS661oLW7U0dgsg@mail.gmail.com
[2]: /messages/by-id/d186fba0-dc65-4274-aa96-3906bbb2e530@postgrespro.ru

--
Matheus Alcantara

#48

Álvaro Herrera

alvherre@kurilemu.de

3 months ago

In reply to: Matheus Alcantara (#47)

Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue

On 2025-Oct-20, Matheus Alcantara wrote:

This is similar to what was already proposed at [1]. This approach was
abandoned because a notification on the queue may block datfrozenxid
advance and clog truncation which can cause other issues for the users [2].

Well, I think that this is the right solution for backpatching, and that
you were wrong to abandon it. You can continue to design a better
mechanism for the master branch, but in old branches we cannot really do
all those things you're proposing to do.

--
Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
"Cómo ponemos nuestros dedos en la arcilla del otro. Eso es la amistad; jugar
al alfarero y ver qué formas se pueden sacar del otro" (C. Halloway en
La Feria de las Tinieblas, R. Bradbury)

#49

Matheus Alcantara

matheusssilv97@gmail.com

3 months ago

In reply to: Álvaro Herrera (#48)

Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue

On Mon Oct 20, 2025 at 11:18 AM -03, Álvaro Herrera wrote:

On 2025-Oct-20, Matheus Alcantara wrote:

This is similar to what was already proposed at [1]. This approach was
abandoned because a notification on the queue may block datfrozenxid
advance and clog truncation which can cause other issues for the users [2].

Well, I think that this is the right solution for backpatching, and that
you were wrong to abandon it. You can continue to design a better
mechanism for the master branch, but in old branches we cannot really do
all those things you're proposing to do.

I actually would prefer this approach TBH, but since this can cause
other issues like transaction wraparound due to not consumed
notifications we would need other mechanisms to prevent that and I'm not
sure if users should expect this kind of behavior changes on minor
version updates?

I think that to go with this solution we would need some way to drop too
old notifications from the queue to advance the datfrozenxid, so I
imagine that we would need some GUC to make this configurable and we can
configure a default value of course but some use cases may not be the
best configuration, this is something that users should expected to deal
on minor version updates?

Going with the "self contained" idea sound more easier to backpatch
actually, so this is the main reason that I abandoned this other
approach. Could you please point what make the v8 version not visible
for bachpatching?

Thanks for the comments!

--
Matheus Alcantara

#50

Masahiko Sawada

sawada.mshk@gmail.com

3 months ago

In reply to: Matheus Alcantara (#46)

Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue

On Mon, Oct 20, 2025 at 5:37 AM Matheus Alcantara
<matheusssilv97@gmail.com> wrote:

On Sat Oct 18, 2025 at 2:43 AM -03, Joel Jacobson wrote:

On Fri, Oct 17, 2025, at 22:50, Arseniy Mukhin wrote:
What a funny coincidence that the approach in this patch,
has one similarity with the "Direct advancement" approach
in the patch in the "Optimize LISTEN/NOTIFY" [1] thread,
namely that we're both interested in QUEUE_HEAD before/after
we push the notifications into the queue, in PreCommit_Notify().

It looks to me like our new data structures are Interchangeable,
so I guess we probably want both patches to eventually settle
on one and the same?

The differences I note between our queue head before/after code are:

- In this patch, you are palloc'ing a struct with two fields.
In [1], we're using two separate static QueuePosition variables.

- In this patch, you are taking/releasing a shared lock before/after
the loop to read QUEUE_HEAD and set previousHead/head.
In [1], we avoid the need of the shared lock, by doing the reads
within the existing exclusive lock inside the loop, but instead
therefore need a firstIteration bool, to know which is the first
iteration, and need to overwrite the after-var in each iteration.

I don't think the noted differences above matter, both seems fine.

Yeah, I also think that both approach seems fine. I keep the v8 version
with the palloc, if someone has any concern about this I'm open to
switch to another approach.

Another thing I noticed in your patch that made me wonder,
is the naming of the new AsyncQueueEntry bool field,
which is given the name "committed".

I think this name is not entirely faithful, since when set to true,
the entry has not been committed yet.

How about negating the meaning of this boolean field?
To instead indicate when the entry has been rollbacked.
Then, it would clearly communicate just that.

Maybe naming it something like "rollbacked" or "aborted"?

Good point. I've renamed this field on the attached v8 version.

I've reviewed the v8 patch and I'm not sure it's a bullet-proof
approach. The basic idea of the v8 patch is to add async entries with
rollbacked=false and we set rollbacked=true in
asyncQueueRollbackNotifications called from AtAbort_Notify() if an
error happens between PreCommit_Notify() and AtCommit_Notify(). The
asyncQueueRollbackNotifications() reads SLRU pages to mark entries
'rollbacked' but reading SLRU pages could fail for some reason. In
this case, we would end up with a PANIC for recursive errors, or even
if we skip marking entries we would end up leaving entries as
committed. Also, if there are a lot of notifications across multiple
SLRU pages we need to mark as 'rollbacked',
asyncQueueRollbackNotifications() could take time to complete but it's
not interruptible. I don't think it's a good idea to introduce such an
operation in abort paths.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#51

Arseniy Mukhin

arseniy.mukhin.dev@gmail.com

3 months ago

In reply to: Masahiko Sawada (#50)

Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue

On Tue, Oct 21, 2025 at 4:32 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Mon, Oct 20, 2025 at 5:37 AM Matheus Alcantara
<matheusssilv97@gmail.com> wrote:

On Sat Oct 18, 2025 at 2:43 AM -03, Joel Jacobson wrote:

On Fri, Oct 17, 2025, at 22:50, Arseniy Mukhin wrote:
What a funny coincidence that the approach in this patch,
has one similarity with the "Direct advancement" approach
in the patch in the "Optimize LISTEN/NOTIFY" [1] thread,
namely that we're both interested in QUEUE_HEAD before/after
we push the notifications into the queue, in PreCommit_Notify().

It looks to me like our new data structures are Interchangeable,
so I guess we probably want both patches to eventually settle
on one and the same?

The differences I note between our queue head before/after code are:

- In this patch, you are palloc'ing a struct with two fields.
In [1], we're using two separate static QueuePosition variables.

- In this patch, you are taking/releasing a shared lock before/after
the loop to read QUEUE_HEAD and set previousHead/head.
In [1], we avoid the need of the shared lock, by doing the reads
within the existing exclusive lock inside the loop, but instead
therefore need a firstIteration bool, to know which is the first
iteration, and need to overwrite the after-var in each iteration.

I don't think the noted differences above matter, both seems fine.

Yeah, I also think that both approach seems fine. I keep the v8 version
with the palloc, if someone has any concern about this I'm open to
switch to another approach.

Another thing I noticed in your patch that made me wonder,
is the naming of the new AsyncQueueEntry bool field,
which is given the name "committed".

I think this name is not entirely faithful, since when set to true,
the entry has not been committed yet.

How about negating the meaning of this boolean field?
To instead indicate when the entry has been rollbacked.
Then, it would clearly communicate just that.

Maybe naming it something like "rollbacked" or "aborted"?

Good point. I've renamed this field on the attached v8 version.

I've reviewed the v8 patch and I'm not sure it's a bullet-proof
approach. The basic idea of the v8 patch is to add async entries with
rollbacked=false and we set rollbacked=true in
asyncQueueRollbackNotifications called from AtAbort_Notify() if an
error happens between PreCommit_Notify() and AtCommit_Notify(). The
asyncQueueRollbackNotifications() reads SLRU pages to mark entries
'rollbacked' but reading SLRU pages could fail for some reason. In
this case, we would end up with a PANIC for recursive errors, or even
if we skip marking entries we would end up leaving entries as
committed. Also, if there are a lot of notifications across multiple
SLRU pages we need to mark as 'rollbacked',
asyncQueueRollbackNotifications() could take time to complete but it's
not interruptible. I don't think it's a good idea to introduce such an
operation in abort paths.

Thanks for the explanation! Now I see why using AtAbort_Notify() was a
bad idea. It might be possible to solve the second problem and make
the amount of work in AtAbort_Notify independent of the number of
notifications. But I have no idea how to solve the SLRU page issue.

Best regards,
Arseniy Mukhin

#52

Masahiko Sawada

sawada.mshk@gmail.com

3 months ago

In reply to: Matheus Alcantara (#49)

Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue

On Mon, Oct 20, 2025 at 11:19 AM Matheus Alcantara
<matheusssilv97@gmail.com> wrote:

On Mon Oct 20, 2025 at 11:18 AM -03, Álvaro Herrera wrote:

On 2025-Oct-20, Matheus Alcantara wrote:

This is similar to what was already proposed at [1]. This approach was
abandoned because a notification on the queue may block datfrozenxid
advance and clog truncation which can cause other issues for the users [2].

Well, I think that this is the right solution for backpatching, and that
you were wrong to abandon it. You can continue to design a better
mechanism for the master branch, but in old branches we cannot really do
all those things you're proposing to do.

I actually would prefer this approach TBH, but since this can cause
other issues like transaction wraparound due to not consumed
notifications we would need other mechanisms to prevent that and I'm not
sure if users should expect this kind of behavior changes on minor
version updates?

True, unconsumed notifications could cause transaction wraparound by
preventing datfrozenxid from advancing. However, this risk only
applies when users have long-term unconsumed notifications, which is
uncommon. That said, we should note that, as I mentioned
previously[1]/messages/by-id/CAD21AoCD+HXoc2QZCAS9d8ahDeikNqbnU0i6cQzpMFOEurkPPg@mail.gmail.com, a process can accumulate unconsumed notifications
simply by being in idle-in-transaction state, even without
backend_xmin and backend_xid, which prevents datfrozenxid from
advancing. While this might not be problematic in practice if it's
rare, I find it concerning that we have no way to check the age of
unconsumed notifications.

I think that to go with this solution we would need some way to drop too
old notifications from the queue to advance the datfrozenxid, so I
imagine that we would need some GUC to make this configurable and we can
configure a default value of course but some use cases may not be the
best configuration, this is something that users should expected to deal
on minor version updates?

I think adding a new GUC would be overkill for this fix. As for
dropping old notifications from the queue, we probably don't need to
make it configurable - we could simply drop notifications whose commit
status is no longer available (instead of raising an error).

Going with the "self contained" idea sound more easier to backpatch
actually, so this is the main reason that I abandoned this other
approach. Could you please point what make the v8 version not visible
for bachpatching?

Regarding the v8 patch, it introduces a fundamentally new way of
managing notification entries (adding entries with 'committed' state
and marking them 'aborted' in abort paths). This affects all use
cases, not just those involving very old unconsumed notifications, and
could introduce more serious bugs like PANIC or SEGV. For
backpatching, I prefer targeting just the problematic behavior while
leaving unrelated parts unchanged. Though Álvaro might have a
different perspective on this.

Regards,

[1]: /messages/by-id/CAD21AoCD+HXoc2QZCAS9d8ahDeikNqbnU0i6cQzpMFOEurkPPg@mail.gmail.com

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#53

Matheus Alcantara

matheusssilv97@gmail.com

3 months ago

In reply to: Masahiko Sawada (#52)

Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue

On 21/10/25 18:42, Masahiko Sawada wrote:

On Mon, Oct 20, 2025 at 11:19 AM Matheus Alcantara
<matheusssilv97@gmail.com> wrote:

On Mon Oct 20, 2025 at 11:18 AM -03, Álvaro Herrera wrote:

On 2025-Oct-20, Matheus Alcantara wrote:

This is similar to what was already proposed at [1]. This approach was
abandoned because a notification on the queue may block datfrozenxid
advance and clog truncation which can cause other issues for the users [2].

Well, I think that this is the right solution for backpatching, and that
you were wrong to abandon it. You can continue to design a better
mechanism for the master branch, but in old branches we cannot really do
all those things you're proposing to do.

I actually would prefer this approach TBH, but since this can cause
other issues like transaction wraparound due to not consumed
notifications we would need other mechanisms to prevent that and I'm not
sure if users should expect this kind of behavior changes on minor
version updates?

True, unconsumed notifications could cause transaction wraparound by
preventing datfrozenxid from advancing. However, this risk only
applies when users have long-term unconsumed notifications, which is
uncommon. That said, we should note that, as I mentioned
previously[1], a process can accumulate unconsumed notifications
simply by being in idle-in-transaction state, even without
backend_xmin and backend_xid, which prevents datfrozenxid from
advancing. While this might not be problematic in practice if it's
rare, I find it concerning that we have no way to check the age of
unconsumed notifications.

Ok, I think that I was too conservative when thinking about the
transaction wraparound issue that it could happen. I agree that this
seems a uncommon scenario.

I think that to go with this solution we would need some way to drop too
old notifications from the queue to advance the datfrozenxid, so I
imagine that we would need some GUC to make this configurable and we can
configure a default value of course but some use cases may not be the
best configuration, this is something that users should expected to deal
on minor version updates?

I think adding a new GUC would be overkill for this fix. As for
dropping old notifications from the queue, we probably don't need to
make it configurable - we could simply drop notifications whose commit
status is no longer available (instead of raising an error).

IIUC this is about not making the vacuum freeze considering the oldest
xid on the queue but just remove notifications whose transaction status
is no longer available right? Since currently when the error happens we
already can't process the notifications it seems a reasonable way to go
IMO.

Going with the "self contained" idea sound more easier to backpatch
actually, so this is the main reason that I abandoned this other
approach. Could you please point what make the v8 version not visible
for bachpatching?

Regarding the v8 patch, it introduces a fundamentally new way of
managing notification entries (adding entries with 'committed' state
and marking them 'aborted' in abort paths). This affects all use
cases, not just those involving very old unconsumed notifications, and
could introduce more serious bugs like PANIC or SEGV. For
backpatching, I prefer targeting just the problematic behavior while
leaving unrelated parts unchanged. Though Álvaro might have a
different perspective on this.

Thanks very much for this explanation and for what you've previously
wrote on [1]/messages/by-id/CAD21AoCFZxXCBy+5DoarfG9LC9VdNwWRDpDHE5sdTh5Ym0EcqQ@mail.gmail.com. It's clear to me now that the v8 architecture is not a
good way to go.

[1]: /messages/by-id/CAD21AoCFZxXCBy+5DoarfG9LC9VdNwWRDpDHE5sdTh5Ym0EcqQ@mail.gmail.com

--
Matheus Alcantara

#54

Joel Jacobson

joel@compiler.org

3 months ago

In reply to: Matheus Alcantara (#53)

Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue

On Wed, Oct 22, 2025, at 02:16, Matheus Alcantara wrote:

Regarding the v8 patch, it introduces a fundamentally new way of
managing notification entries (adding entries with 'committed' state
and marking them 'aborted' in abort paths). This affects all use
cases, not just those involving very old unconsumed notifications, and
could introduce more serious bugs like PANIC or SEGV. For
backpatching, I prefer targeting just the problematic behavior while
leaving unrelated parts unchanged. Though Álvaro might have a
different perspective on this.

Thanks very much for this explanation and for what you've previously
wrote on [1]. It's clear to me now that the v8 architecture is not a
good way to go.

How about doing some more work in vac_update_datfrozenxid()?

Pseudo-code sketch:

```
void
vac_update_datfrozenxid(void)
{

/* After computing newFrozenXid from all known sources... */

TransactionId oldestNotifyXid = GetOldestQueuedNotifyXid();

if (TransactionIdIsValid(oldestNotifyXid) &&
TransactionIdPrecedes(oldestNotifyXid, newFrozenXid))
{
/*
* The async queue has XIDs older than our proposed freeze point.
* Attempt cleanup, then back off and let the next VACUUM benefit.
*/

if (asyncQueueHasListeners())
{
/*
* Wake all listening backends across *all* databases
* that are not already at QUEUE_HEAD.
* They'll hopefully process notifications and advance
* their pointers, allowing the next VACUUM to freeze further.
*/
asyncQueueWakeAllListeners();
}
else
{
/*
* No listeners exist - discard all unread notifications.
* The next VACUUM should succeed in advancing datfrozenxid.
* asyncQueueAdvanceTailNoListeners() would take exclusive lock
* on NotifyQueueLock before checking
* QUEUE_FIRST_LISTENER == INVALID_PROC_NUMBER
*/
asyncQueueAdvanceTailNoListeners();
}

/*
* Back off datfrozenxid to protect the old XIDs.
* The cleanup we just performed should allow the next VACUUM
* to freeze further.
*/
newFrozenXid = oldestNotifyXid;
}
}
```

Maybe it wouldn't solve all problematic situations, but to me it seems
like these measures could help many of them, or am I missing some
crucial insight here?

/Joel

#55

Masahiko Sawada

sawada.mshk@gmail.com

3 months ago

In reply to: Matheus Alcantara (#53)

Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue

On Tue, Oct 21, 2025 at 4:16 PM Matheus Alcantara
<matheusssilv97@gmail.com> wrote:

On 21/10/25 18:42, Masahiko Sawada wrote:

On Mon, Oct 20, 2025 at 11:19 AM Matheus Alcantara
<matheusssilv97@gmail.com> wrote:

On Mon Oct 20, 2025 at 11:18 AM -03, Álvaro Herrera wrote:

On 2025-Oct-20, Matheus Alcantara wrote:

This is similar to what was already proposed at [1]. This approach was
abandoned because a notification on the queue may block datfrozenxid
advance and clog truncation which can cause other issues for the users [2].

Well, I think that this is the right solution for backpatching, and that
you were wrong to abandon it. You can continue to design a better
mechanism for the master branch, but in old branches we cannot really do
all those things you're proposing to do.

I actually would prefer this approach TBH, but since this can cause
other issues like transaction wraparound due to not consumed
notifications we would need other mechanisms to prevent that and I'm not
sure if users should expect this kind of behavior changes on minor
version updates?

True, unconsumed notifications could cause transaction wraparound by
preventing datfrozenxid from advancing. However, this risk only
applies when users have long-term unconsumed notifications, which is
uncommon. That said, we should note that, as I mentioned
previously[1], a process can accumulate unconsumed notifications
simply by being in idle-in-transaction state, even without
backend_xmin and backend_xid, which prevents datfrozenxid from
advancing. While this might not be problematic in practice if it's
rare, I find it concerning that we have no way to check the age of
unconsumed notifications.

Ok, I think that I was too conservative when thinking about the
transaction wraparound issue that it could happen. I agree that this
seems a uncommon scenario.

I think that to go with this solution we would need some way to drop too
old notifications from the queue to advance the datfrozenxid, so I
imagine that we would need some GUC to make this configurable and we can
configure a default value of course but some use cases may not be the
best configuration, this is something that users should expected to deal
on minor version updates?

I think adding a new GUC would be overkill for this fix. As for
dropping old notifications from the queue, we probably don't need to
make it configurable - we could simply drop notifications whose commit
status is no longer available (instead of raising an error).

IIUC this is about not making the vacuum freeze considering the oldest
xid on the queue but just remove notifications whose transaction status
is no longer available right? Since currently when the error happens we
already can't process the notifications it seems a reasonable way to go
IMO.

On second thought, simply hiding the error would be worse than our
current behavior. Users wouldn't know their notifications are being
dropped, as they often don't check WARNINGs. The more frequently they
try to freeze XIDs, the more notifications they'd lose. To avoid
silent discards, they would need to increase
autovacuum_vacuum_max_freeze_age to accommodate more clog entries, but
this increases the risk of XID wraparound. I think the proposed
approach modifying the vacuum freeze to consider the oldest XID on the
queue would be better. This has a downside as I mentioned: processes
in idle-in-transaction state even without backend_xmin and backend_xid
can still accumulate unconsumed notifications. However, leaving
transactions in idle-in-transaction state for a long time is bad
practice anyway. While we might want to consider adding a safeguard
for this case, I guess it would rarely occur in practice.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#56

Matheus Alcantara

matheusssilv97@gmail.com

3 months ago

In reply to: Masahiko Sawada (#55)

2 attachment(s)

Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue

On Wed Oct 22, 2025 at 1:31 AM -03, Masahiko Sawada wrote:

On Tue, Oct 21, 2025 at 4:16 PM Matheus Alcantara

I think adding a new GUC would be overkill for this fix. As for
dropping old notifications from the queue, we probably don't need to
make it configurable - we could simply drop notifications whose commit
status is no longer available (instead of raising an error).

IIUC this is about not making the vacuum freeze considering the oldest
xid on the queue but just remove notifications whose transaction status
is no longer available right? Since currently when the error happens we
already can't process the notifications it seems a reasonable way to go
IMO.

On second thought, simply hiding the error would be worse than our
current behavior. Users wouldn't know their notifications are being
dropped, as they often don't check WARNINGs. The more frequently they
try to freeze XIDs, the more notifications they'd lose. To avoid
silent discards, they would need to increase
autovacuum_vacuum_max_freeze_age to accommodate more clog entries, but
this increases the risk of XID wraparound. I think the proposed
approach modifying the vacuum freeze to consider the oldest XID on the
queue would be better. This has a downside as I mentioned: processes
in idle-in-transaction state even without backend_xmin and backend_xid
can still accumulate unconsumed notifications. However, leaving
transactions in idle-in-transaction state for a long time is bad
practice anyway. While we might want to consider adding a safeguard
for this case, I guess it would rarely occur in practice.

I'm attaching a v9 patch which is based on the idea of changing the
vacuum freeze to consider the oldest xid on the listen/notify queue. The
0001 patch is from Joel that it was previously sent on [1]/messages/by-id/25651193-da4e-4185-a564-f2efa6b0c8a4@app.fastmail.com with some
small tweaks and the 0002 is the TAP tests introduced on the previously
versions by me and by Arseniy. I keep it separate because I'm not sure
if it's all suitable for back-pacthing.

I'm wondering if the 002_aborted_tx_notifies.pl is still needed with
this architecture being used. I think that it's not, but perhaps is a
good test to keep it?

[1]: /messages/by-id/25651193-da4e-4185-a564-f2efa6b0c8a4@app.fastmail.com

--
Matheus Alcantara

Attachments:

v9-0001-Prevent-VACUUM-from-truncating-XIDs-still-present.patchtext/plain; charset=utf-8; name=v9-0001-Prevent-VACUUM-from-truncating-XIDs-still-present.patchDownload

From a7e3f43f3091520ff6dcaa7d0b0a3b5a74729f19 Mon Sep 17 00:00:00 2001
From: Joel Jacobson <joel@compiler.org>
Date: Sun, 19 Oct 2025 18:55:25 +0200
Subject: [PATCH v9 1/2] Prevent VACUUM from truncating XIDs still present in
 notification queue

VACUUM's computation of datfrozenxid did not account for transaction IDs
in the LISTEN/NOTIFY queue.  This allowed VACUUM to truncate clog
entries for XIDs that were still referenced by queued notifications,
causing backends to fail in TransactionIdDidCommit when later processing
those notifications.

Fix by adding GetOldestQueuedNotifyXid to find the oldest XID in queued
notifications for the current database, and constraining datfrozenxid to
not pass that.  The function scans from QUEUE_TAIL, since notifications
may have been written before any listeners existed.

To avoid code duplication, refactor SLRU page-reading code into a new
helper function asyncQueueReadPageToBuffer.
---
 src/backend/commands/async.c  | 139 ++++++++++++++++++++++++++++------
 src/backend/commands/vacuum.c |  12 +++
 src/include/commands/async.h  |   3 +
 3 files changed, 130 insertions(+), 24 deletions(-)

diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index 4bd37d5beb5..7c9d7831c9f 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -1841,6 +1841,44 @@ ProcessNotifyInterrupt(bool flush)
 		ProcessIncomingNotify(flush);
 }
 
+/*
+ * Read a page from the SLRU queue into a local buffer.
+ *
+ * Reads the page containing 'pos', copying the data from the current offset
+ * either to the end of the page or up to 'head' (whichever comes first)
+ * into page_buffer.
+ */
+static void
+asyncQueueReadPageToBuffer(QueuePosition pos, QueuePosition head,
+						   char *page_buffer)
+{
+	int64		curpage = QUEUE_POS_PAGE(pos);
+	int			curoffset = QUEUE_POS_OFFSET(pos);
+	int			slotno;
+	int			copysize;
+
+	slotno = SimpleLruReadPage_ReadOnly(NotifyCtl, curpage,
+										InvalidTransactionId);
+
+	if (curpage == QUEUE_POS_PAGE(head))
+	{
+		/* we only want to read as far as head */
+		copysize = QUEUE_POS_OFFSET(head) - curoffset;
+		if (copysize < 0)
+			copysize = 0;		/* just for safety */
+	}
+	else
+	{
+		/* fetch all the rest of the page */
+		copysize = QUEUE_PAGESIZE - curoffset;
+	}
+
+	memcpy(page_buffer + curoffset,
+		   NotifyCtl->shared->page_buffer[slotno] + curoffset,
+		   copysize);
+	/* Release lock that we got from SimpleLruReadPage_ReadOnly() */
+	LWLockRelease(SimpleLruGetBankLock(NotifyCtl, curpage));
+}
 
 /*
  * Read all pending notifications from the queue, and deliver appropriate
@@ -1932,36 +1970,13 @@ asyncQueueReadAllNotifications(void)
 
 		do
 		{
-			int64		curpage = QUEUE_POS_PAGE(pos);
-			int			curoffset = QUEUE_POS_OFFSET(pos);
-			int			slotno;
-			int			copysize;
-
 			/*
 			 * We copy the data from SLRU into a local buffer, so as to avoid
 			 * holding the SLRU lock while we are examining the entries and
 			 * possibly transmitting them to our frontend.  Copy only the part
 			 * of the page we will actually inspect.
 			 */
-			slotno = SimpleLruReadPage_ReadOnly(NotifyCtl, curpage,
-												InvalidTransactionId);
-			if (curpage == QUEUE_POS_PAGE(head))
-			{
-				/* we only want to read as far as head */
-				copysize = QUEUE_POS_OFFSET(head) - curoffset;
-				if (copysize < 0)
-					copysize = 0;	/* just for safety */
-			}
-			else
-			{
-				/* fetch all the rest of the page */
-				copysize = QUEUE_PAGESIZE - curoffset;
-			}
-			memcpy(page_buffer.buf + curoffset,
-				   NotifyCtl->shared->page_buffer[slotno] + curoffset,
-				   copysize);
-			/* Release lock that we got from SimpleLruReadPage_ReadOnly() */
-			LWLockRelease(SimpleLruGetBankLock(NotifyCtl, curpage));
+			asyncQueueReadPageToBuffer(pos, head, page_buffer.buf);
 
 			/*
 			 * Process messages up to the stop position, end of page, or an
@@ -2097,6 +2112,82 @@ asyncQueueProcessPageEntries(volatile QueuePosition *current,
 	return reachedStop;
 }
 
+/*
+ * Get the oldest XID in the notification queue that has not yet been
+ * processed by all listening backends.
+ *
+ * Returns InvalidTransactionId if there are no unprocessed notifications.
+ */
+TransactionId
+GetOldestQueuedNotifyXid(void)
+{
+	QueuePosition pos;
+	QueuePosition head;
+	TransactionId oldestXid = InvalidTransactionId;
+
+	/* page_buffer must be adequately aligned, so use a union */
+	union
+	{
+		char		buf[QUEUE_PAGESIZE];
+		AsyncQueueEntry align;
+	}			page_buffer;
+
+	LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE);
+
+	/*
+	 * We must start at QUEUE_TAIL since notification data might have been
+	 * written before there were any listening backends.
+	 */
+	pos = QUEUE_TAIL;
+	head = QUEUE_HEAD;
+
+	/* If the queue is empty, no XIDs need protection */
+	if (QUEUE_POS_EQUAL(pos, head))
+	{
+		LWLockRelease(NotifyQueueLock);
+		return InvalidTransactionId;
+	}
+
+	while (!QUEUE_POS_EQUAL(pos, head))
+	{
+		int			curoffset;
+		AsyncQueueEntry *qe;
+
+		/* Read the current page from SLRU into our local buffer */
+		asyncQueueReadPageToBuffer(pos, head, page_buffer.buf);
+
+		curoffset = QUEUE_POS_OFFSET(pos);
+
+		/* Process all entries on this page up to head */
+		while (curoffset + QUEUEALIGN(AsyncQueueEntryEmptySize) <= QUEUE_PAGESIZE &&
+			   !QUEUE_POS_EQUAL(pos, head))
+		{
+			qe = (AsyncQueueEntry *) (page_buffer.buf + curoffset);
+
+			/*
+			 * Check if this entry is for our database and has a valid XID.
+			 * Only entries for our database matter for our datfrozenxid.
+			 */
+			if (qe->dboid == MyDatabaseId && TransactionIdIsValid(qe->xid))
+			{
+				if (!TransactionIdIsValid(oldestXid) ||
+					TransactionIdPrecedes(qe->xid, oldestXid))
+					oldestXid = qe->xid;
+			}
+
+			/* Advance to next entry */
+			if (asyncQueueAdvance(&pos, qe->length))
+				break;			/* advanced to next page */
+
+			curoffset = QUEUE_POS_OFFSET(pos);
+		}
+	}
+
+	LWLockRelease(NotifyQueueLock);
+
+	return oldestXid;
+}
+
 /*
  * Advance the shared queue tail variable to the minimum of all the
  * per-backend tail pointers.  Truncate pg_notify space if possible.
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index ed03e3bd50d..6c601ce81aa 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -37,6 +37,7 @@
 #include "catalog/namespace.h"
 #include "catalog/pg_database.h"
 #include "catalog/pg_inherits.h"
+#include "commands/async.h"
 #include "commands/cluster.h"
 #include "commands/defrem.h"
 #include "commands/progress.h"
@@ -1617,6 +1618,7 @@ vac_update_datfrozenxid(void)
 	bool		dirty = false;
 	ScanKeyData key[1];
 	void	   *inplace_state;
+	TransactionId oldestNotifyXid;
 
 	/*
 	 * Restrict this task to one backend per database.  This avoids race
@@ -1733,6 +1735,16 @@ vac_update_datfrozenxid(void)
 	if (bogus)
 		return;
 
+	/*
+	 * Also consider the oldest XID in the notification queue, since backends
+	 * will need to call TransactionIdDidCommit() on those XIDs when
+	 * processing the notifications.
+	 */
+	oldestNotifyXid = GetOldestQueuedNotifyXid();
+	if (TransactionIdIsValid(oldestNotifyXid) &&
+		TransactionIdPrecedes(oldestNotifyXid, newFrozenXid))
+		newFrozenXid = oldestNotifyXid;
+
 	Assert(TransactionIdIsNormal(newFrozenXid));
 	Assert(MultiXactIdIsValid(newMinMulti));
 
diff --git a/src/include/commands/async.h b/src/include/commands/async.h
index f75c3df9556..ac323ada492 100644
--- a/src/include/commands/async.h
+++ b/src/include/commands/async.h
@@ -26,6 +26,9 @@ extern void NotifyMyFrontEnd(const char *channel,
 							 const char *payload,
 							 int32 srcPid);
 
+/* get oldest XID in the notification queue for vacuum freeze */
+extern TransactionId GetOldestQueuedNotifyXid(void);
+
 /* notify-related SQL statements */
 extern void Async_Notify(const char *channel, const char *payload);
 extern void Async_Listen(const char *channel);
-- 
2.51.0

v9-0002-Add-tap-tests-for-listen-notify-vacuum-freeze.patchtext/plain; charset=utf-8; name=v9-0002-Add-tap-tests-for-listen-notify-vacuum-freeze.patchDownload

From b4c7c43a6c3f2b573f66f4e5e72f6e485446af3b Mon Sep 17 00:00:00 2001
From: Matheus Alcantara <mths.dev@pm.me>
Date: Wed, 22 Oct 2025 11:06:18 -0300
Subject: [PATCH v9 2/2] Add tap tests for listen notify vacuum freeze

---
 src/test/modules/Makefile                     |  1 +
 src/test/modules/meson.build                  |  1 +
 src/test/modules/test_listen_notify/Makefile  | 19 +++++
 .../modules/test_listen_notify/meson.build    | 14 ++++
 .../test_listen_notify/t/001_xid_freeze.pl    | 74 +++++++++++++++++
 .../t/002_aborted_tx_notifies.pl              | 79 +++++++++++++++++++
 6 files changed, 188 insertions(+)
 create mode 100644 src/test/modules/test_listen_notify/Makefile
 create mode 100644 src/test/modules/test_listen_notify/meson.build
 create mode 100644 src/test/modules/test_listen_notify/t/001_xid_freeze.pl
 create mode 100644 src/test/modules/test_listen_notify/t/002_aborted_tx_notifies.pl

diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index 902a7954101..a015c961d35 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -29,6 +29,7 @@ SUBDIRS = \
 		  test_int128 \
 		  test_integerset \
 		  test_json_parser \
+		  test_listen_notify \
 		  test_lfind \
 		  test_lwlock_tranches \
 		  test_misc \
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index 14fc761c4cf..6af33448d7b 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -28,6 +28,7 @@ subdir('test_ginpostinglist')
 subdir('test_int128')
 subdir('test_integerset')
 subdir('test_json_parser')
+subdir('test_listen_notify')
 subdir('test_lfind')
 subdir('test_lwlock_tranches')
 subdir('test_misc')
diff --git a/src/test/modules/test_listen_notify/Makefile b/src/test/modules/test_listen_notify/Makefile
new file mode 100644
index 00000000000..c1eb4fde370
--- /dev/null
+++ b/src/test/modules/test_listen_notify/Makefile
@@ -0,0 +1,19 @@
+# src/test/modules/test_listen_notify/Makefile
+
+MODULE = test_listen_notify
+PGFILEDESC = "test_listen_notify - regression testing for LISTEN/NOTIFY support"
+
+TAP_TESTS = 1
+
+EXTRA_INSTALL=src/test/modules/xid_wraparound
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_listen_notify
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_listen_notify/meson.build b/src/test/modules/test_listen_notify/meson.build
new file mode 100644
index 00000000000..a68052cd353
--- /dev/null
+++ b/src/test/modules/test_listen_notify/meson.build
@@ -0,0 +1,14 @@
+# Copyright (c) 2022-2025, PostgreSQL Global Development Group
+
+tests += {
+  'name': 'test_listen_notify',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'tap': {
+    'tests': [
+      't/001_xid_freeze.pl',
+      't/002_aborted_tx_notifies.pl'
+    ],
+  },
+}
+
diff --git a/src/test/modules/test_listen_notify/t/001_xid_freeze.pl b/src/test/modules/test_listen_notify/t/001_xid_freeze.pl
new file mode 100644
index 00000000000..a8bbd268c0f
--- /dev/null
+++ b/src/test/modules/test_listen_notify/t/001_xid_freeze.pl
@@ -0,0 +1,74 @@
+# Copyright (c) 2024-2025, PostgreSQL Global Development Group
+
+use strict;
+use warnings FATAL => 'all';
+use File::Path qw(mkpath);
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+my $node = PostgreSQL::Test::Cluster->new('node');
+$node->init;
+$node->start;
+
+# Check if the extension xid_wraparound is available, as it may be
+# possible that this script is run with installcheck, where the module
+# would not be installed by default.
+if (!$node->check_extension('xid_wraparound'))
+{
+	plan skip_all => 'Extension xid_wraparound not installed';
+}
+
+# Setup
+$node->safe_psql('postgres', 'CREATE EXTENSION xid_wraparound');
+$node->safe_psql('postgres',
+	'CREATE TABLE t AS SELECT g AS a, g+2 AS b from generate_series(1,100000) g;'
+);
+$node->safe_psql('postgres',
+	'ALTER DATABASE template0 WITH ALLOW_CONNECTIONS true');
+
+# --- Start Session 1 and leave it idle in transaction
+my $psql_session1 = $node->background_psql('postgres');
+$psql_session1->query_safe('listen s;', "Session 1 listens to 's'");
+$psql_session1->query_safe('begin;', "Session 1 starts a transaction");
+
+# --- Session 2, multiple notify's, and commit ---
+for my $i (1 .. 10)
+{
+	$node->safe_psql(
+		'postgres', "
+		BEGIN;
+		NOTIFY s, '$i';
+		COMMIT;");
+}
+
+# Consume enough XIDs to trigger truncation
+$node->safe_psql('postgres', 'select consume_xids(10000000);');
+
+# Execute update so the frozen xid of "t" table is updated to a xid greater
+# than consume_xids() result
+$node->safe_psql('postgres', 'UPDATE t SET a = a+b;');
+
+# Remember current datfrozenxid before vacuum freeze to ensure that it is advanced.
+my $datafronzenxid = $node->safe_psql('postgres', "select datfrozenxid from pg_database where datname = 'postgres'");
+
+# Execute vacuum freeze on all databases
+$node->command_ok([ 'vacuumdb', '--all', '--freeze', '--port', $node->port ],
+	"vacuumdb --all --freeze");
+
+# Get the new datfrozenxid after vacuum freeze to ensure that is advanced but
+# we can still get the notification status of the notification
+my $datafronzenxid_freeze = $node->safe_psql('postgres', "select datfrozenxid from pg_database where datname = 'postgres'");
+ok($datafronzenxid_freeze > $datafronzenxid, 'datfrozenxid is advanced');
+
+# On Session 1, commit and ensure that the all notifications is received
+my $res = $psql_session1->query_safe('commit;', "commit listen s;");
+my $notifications_count = 0;
+foreach my $i (split('\n', $res))
+{
+	$notifications_count++;
+	like($i, qr/Asynchronous notification "s" with payload "$notifications_count" received/);
+}
+is($notifications_count, 10, 'received all committed notifications');
+
+done_testing();
diff --git a/src/test/modules/test_listen_notify/t/002_aborted_tx_notifies.pl b/src/test/modules/test_listen_notify/t/002_aborted_tx_notifies.pl
new file mode 100644
index 00000000000..dae7a24f5b2
--- /dev/null
+++ b/src/test/modules/test_listen_notify/t/002_aborted_tx_notifies.pl
@@ -0,0 +1,79 @@
+# Copyright (c) 2024-2025, PostgreSQL Global Development Group
+
+use strict;
+use warnings FATAL => 'all';
+use File::Path qw(mkpath);
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Test checks that listeners do not receive notifications from aborted
+# transaction even if notifications have been added to the listen/notify
+# queue. To reproduce it we use the fact that serializable conflicts
+# are checked after tx adds notifications to the queue.
+
+my $node = PostgreSQL::Test::Cluster->new('node');
+$node->init;
+$node->start;
+
+# Setup
+$node->safe_psql('postgres', 'CREATE TABLE t1 (a bigserial);');
+
+# Listener
+my $psql_listener = $node->background_psql('postgres');
+$psql_listener->query_safe('LISTEN ch;');
+
+# Session1. Start SERIALIZABLE tx and add a notification.
+my $psql_session1 = $node->background_psql('postgres');
+$psql_session1->query_safe("
+	BEGIN ISOLATION LEVEL SERIALIZABLE;
+	SELECT * FROM t1;
+	INSERT INTO t1 DEFAULT VALUES;
+	NOTIFY ch,'committed_0';
+	NOTIFY ch,'committed_1';
+");
+
+# Session2. Start SERIALIZABLE tx, add a notification and introduce a conflict
+# with session1.
+my $psql_session2 = $node->background_psql('postgres', on_error_stop => 0);
+$psql_session2->query_safe("
+	BEGIN ISOLATION LEVEL SERIALIZABLE;
+	SELECT * FROM t1;
+	INSERT INTO t1 DEFAULT VALUES;
+");
+
+# Send notifications that should not be eventually delivered, as session2
+# transaction will be aborted.
+my $message = 'aborted_' . 'a' x 1000;
+for (my $i = 0; $i < 10; $i++) {
+    $psql_session2->query_safe("NOTIFY ch, '$i$message'");
+}
+
+# Session1 should be committed successfully. Listeners must receive session1
+# notifications.
+$psql_session1->query_safe("COMMIT;");
+
+# Session2 should be aborted due to the conflict with session1. Transaction
+# is aborted after adding notifications to the listen/notify queue, but
+# listeners should not receive session2 notifications.
+$psql_session2->query("COMMIT;");
+
+# send more notifications after aborted
+$node->safe_psql('postgres', "NOTIFY ch, 'committed_2';");
+$node->safe_psql('postgres', "NOTIFY ch, 'committed_3';");
+
+# fetch notifications
+my $res = $psql_listener->query_safe('begin; commit;');
+
+# check received notifications
+my @lines = split('\n', $res);
+is(@lines, 4, 'received all committed notifications');
+for (my $i = 0; $i < 4; $i++) {
+    like($lines[$i], qr/Asynchronous notification "ch" with payload "committed_$i" received/);
+}
+
+ok($psql_listener->quit);
+ok($psql_session1->quit);
+ok($psql_session2->quit);
+
+done_testing();
-- 
2.51.0

#57

Álvaro Herrera

alvherre@kurilemu.de

3 months ago

In reply to: Matheus Alcantara (#56)

Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue

On 2025-Oct-22, Matheus Alcantara wrote:

I'm wondering if the 002_aborted_tx_notifies.pl is still needed with
this architecture being used. I think that it's not, but perhaps is a
good test to keep it?

I'd rather have tests than not, but I'd think it needs to be behind
PG_TEST_EXTRA because of things like

+$node->safe_psql('postgres', 'select consume_xids(10000000);');

--
Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
"Hay quien adquiere la mala costumbre de ser infeliz" (M. A. Evans)

#58

Matheus Alcantara

matheusssilv97@gmail.com

3 months ago

In reply to: Álvaro Herrera (#57)

2 attachment(s)

Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue

On Wed Oct 22, 2025 at 12:49 PM -03, Álvaro Herrera wrote:

On 2025-Oct-22, Matheus Alcantara wrote:

I'm wondering if the 002_aborted_tx_notifies.pl is still needed with
this architecture being used. I think that it's not, but perhaps is a
good test to keep it?

I'd rather have tests than not, but I'd think it needs to be behind
PG_TEST_EXTRA because of things like

+$node->safe_psql('postgres', 'select consume_xids(10000000);');

Attached v10 with wrapping into PG_TEST_EXTRA. Should we enable this
somewhere to be executed on build farm?

--
Matheus Alcantara

Attachments:

v10-0001-Prevent-VACUUM-from-truncating-XIDs-still-presen.patchtext/plain; charset=utf-8; name=v10-0001-Prevent-VACUUM-from-truncating-XIDs-still-presen.patchDownload

From a7e3f43f3091520ff6dcaa7d0b0a3b5a74729f19 Mon Sep 17 00:00:00 2001
From: Joel Jacobson <joel@compiler.org>
Date: Sun, 19 Oct 2025 18:55:25 +0200
Subject: [PATCH v10 1/2] Prevent VACUUM from truncating XIDs still present in
 notification queue

VACUUM's computation of datfrozenxid did not account for transaction IDs
in the LISTEN/NOTIFY queue.  This allowed VACUUM to truncate clog
entries for XIDs that were still referenced by queued notifications,
causing backends to fail in TransactionIdDidCommit when later processing
those notifications.

Fix by adding GetOldestQueuedNotifyXid to find the oldest XID in queued
notifications for the current database, and constraining datfrozenxid to
not pass that.  The function scans from QUEUE_TAIL, since notifications
may have been written before any listeners existed.

To avoid code duplication, refactor SLRU page-reading code into a new
helper function asyncQueueReadPageToBuffer.
---
 src/backend/commands/async.c  | 139 ++++++++++++++++++++++++++++------
 src/backend/commands/vacuum.c |  12 +++
 src/include/commands/async.h  |   3 +
 3 files changed, 130 insertions(+), 24 deletions(-)

diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index 4bd37d5beb5..7c9d7831c9f 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -1841,6 +1841,44 @@ ProcessNotifyInterrupt(bool flush)
 		ProcessIncomingNotify(flush);
 }
 
+/*
+ * Read a page from the SLRU queue into a local buffer.
+ *
+ * Reads the page containing 'pos', copying the data from the current offset
+ * either to the end of the page or up to 'head' (whichever comes first)
+ * into page_buffer.
+ */
+static void
+asyncQueueReadPageToBuffer(QueuePosition pos, QueuePosition head,
+						   char *page_buffer)
+{
+	int64		curpage = QUEUE_POS_PAGE(pos);
+	int			curoffset = QUEUE_POS_OFFSET(pos);
+	int			slotno;
+	int			copysize;
+
+	slotno = SimpleLruReadPage_ReadOnly(NotifyCtl, curpage,
+										InvalidTransactionId);
+
+	if (curpage == QUEUE_POS_PAGE(head))
+	{
+		/* we only want to read as far as head */
+		copysize = QUEUE_POS_OFFSET(head) - curoffset;
+		if (copysize < 0)
+			copysize = 0;		/* just for safety */
+	}
+	else
+	{
+		/* fetch all the rest of the page */
+		copysize = QUEUE_PAGESIZE - curoffset;
+	}
+
+	memcpy(page_buffer + curoffset,
+		   NotifyCtl->shared->page_buffer[slotno] + curoffset,
+		   copysize);
+	/* Release lock that we got from SimpleLruReadPage_ReadOnly() */
+	LWLockRelease(SimpleLruGetBankLock(NotifyCtl, curpage));
+}
 
 /*
  * Read all pending notifications from the queue, and deliver appropriate
@@ -1932,36 +1970,13 @@ asyncQueueReadAllNotifications(void)
 
 		do
 		{
-			int64		curpage = QUEUE_POS_PAGE(pos);
-			int			curoffset = QUEUE_POS_OFFSET(pos);
-			int			slotno;
-			int			copysize;
-
 			/*
 			 * We copy the data from SLRU into a local buffer, so as to avoid
 			 * holding the SLRU lock while we are examining the entries and
 			 * possibly transmitting them to our frontend.  Copy only the part
 			 * of the page we will actually inspect.
 			 */
-			slotno = SimpleLruReadPage_ReadOnly(NotifyCtl, curpage,
-												InvalidTransactionId);
-			if (curpage == QUEUE_POS_PAGE(head))
-			{
-				/* we only want to read as far as head */
-				copysize = QUEUE_POS_OFFSET(head) - curoffset;
-				if (copysize < 0)
-					copysize = 0;	/* just for safety */
-			}
-			else
-			{
-				/* fetch all the rest of the page */
-				copysize = QUEUE_PAGESIZE - curoffset;
-			}
-			memcpy(page_buffer.buf + curoffset,
-				   NotifyCtl->shared->page_buffer[slotno] + curoffset,
-				   copysize);
-			/* Release lock that we got from SimpleLruReadPage_ReadOnly() */
-			LWLockRelease(SimpleLruGetBankLock(NotifyCtl, curpage));
+			asyncQueueReadPageToBuffer(pos, head, page_buffer.buf);
 
 			/*
 			 * Process messages up to the stop position, end of page, or an
@@ -2097,6 +2112,82 @@ asyncQueueProcessPageEntries(volatile QueuePosition *current,
 	return reachedStop;
 }
 
+/*
+ * Get the oldest XID in the notification queue that has not yet been
+ * processed by all listening backends.
+ *
+ * Returns InvalidTransactionId if there are no unprocessed notifications.
+ */
+TransactionId
+GetOldestQueuedNotifyXid(void)
+{
+	QueuePosition pos;
+	QueuePosition head;
+	TransactionId oldestXid = InvalidTransactionId;
+
+	/* page_buffer must be adequately aligned, so use a union */
+	union
+	{
+		char		buf[QUEUE_PAGESIZE];
+		AsyncQueueEntry align;
+	}			page_buffer;
+
+	LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE);
+
+	/*
+	 * We must start at QUEUE_TAIL since notification data might have been
+	 * written before there were any listening backends.
+	 */
+	pos = QUEUE_TAIL;
+	head = QUEUE_HEAD;
+
+	/* If the queue is empty, no XIDs need protection */
+	if (QUEUE_POS_EQUAL(pos, head))
+	{
+		LWLockRelease(NotifyQueueLock);
+		return InvalidTransactionId;
+	}
+
+	while (!QUEUE_POS_EQUAL(pos, head))
+	{
+		int			curoffset;
+		AsyncQueueEntry *qe;
+
+		/* Read the current page from SLRU into our local buffer */
+		asyncQueueReadPageToBuffer(pos, head, page_buffer.buf);
+
+		curoffset = QUEUE_POS_OFFSET(pos);
+
+		/* Process all entries on this page up to head */
+		while (curoffset + QUEUEALIGN(AsyncQueueEntryEmptySize) <= QUEUE_PAGESIZE &&
+			   !QUEUE_POS_EQUAL(pos, head))
+		{
+			qe = (AsyncQueueEntry *) (page_buffer.buf + curoffset);
+
+			/*
+			 * Check if this entry is for our database and has a valid XID.
+			 * Only entries for our database matter for our datfrozenxid.
+			 */
+			if (qe->dboid == MyDatabaseId && TransactionIdIsValid(qe->xid))
+			{
+				if (!TransactionIdIsValid(oldestXid) ||
+					TransactionIdPrecedes(qe->xid, oldestXid))
+					oldestXid = qe->xid;
+			}
+
+			/* Advance to next entry */
+			if (asyncQueueAdvance(&pos, qe->length))
+				break;			/* advanced to next page */
+
+			curoffset = QUEUE_POS_OFFSET(pos);
+		}
+	}
+
+	LWLockRelease(NotifyQueueLock);
+
+	return oldestXid;
+}
+
 /*
  * Advance the shared queue tail variable to the minimum of all the
  * per-backend tail pointers.  Truncate pg_notify space if possible.
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index ed03e3bd50d..6c601ce81aa 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -37,6 +37,7 @@
 #include "catalog/namespace.h"
 #include "catalog/pg_database.h"
 #include "catalog/pg_inherits.h"
+#include "commands/async.h"
 #include "commands/cluster.h"
 #include "commands/defrem.h"
 #include "commands/progress.h"
@@ -1617,6 +1618,7 @@ vac_update_datfrozenxid(void)
 	bool		dirty = false;
 	ScanKeyData key[1];
 	void	   *inplace_state;
+	TransactionId oldestNotifyXid;
 
 	/*
 	 * Restrict this task to one backend per database.  This avoids race
@@ -1733,6 +1735,16 @@ vac_update_datfrozenxid(void)
 	if (bogus)
 		return;
 
+	/*
+	 * Also consider the oldest XID in the notification queue, since backends
+	 * will need to call TransactionIdDidCommit() on those XIDs when
+	 * processing the notifications.
+	 */
+	oldestNotifyXid = GetOldestQueuedNotifyXid();
+	if (TransactionIdIsValid(oldestNotifyXid) &&
+		TransactionIdPrecedes(oldestNotifyXid, newFrozenXid))
+		newFrozenXid = oldestNotifyXid;
+
 	Assert(TransactionIdIsNormal(newFrozenXid));
 	Assert(MultiXactIdIsValid(newMinMulti));
 
diff --git a/src/include/commands/async.h b/src/include/commands/async.h
index f75c3df9556..ac323ada492 100644
--- a/src/include/commands/async.h
+++ b/src/include/commands/async.h
@@ -26,6 +26,9 @@ extern void NotifyMyFrontEnd(const char *channel,
 							 const char *payload,
 							 int32 srcPid);
 
+/* get oldest XID in the notification queue for vacuum freeze */
+extern TransactionId GetOldestQueuedNotifyXid(void);
+
 /* notify-related SQL statements */
 extern void Async_Notify(const char *channel, const char *payload);
 extern void Async_Listen(const char *channel);
-- 
2.51.0

v10-0002-Add-tap-tests-for-listen-notify-vacuum-freeze.patchtext/plain; charset=utf-8; name=v10-0002-Add-tap-tests-for-listen-notify-vacuum-freeze.patchDownload

From 0a644d3d7061ff316b11a8245118b7d70f7e986d Mon Sep 17 00:00:00 2001
From: Matheus Alcantara <mths.dev@pm.me>
Date: Wed, 22 Oct 2025 11:06:18 -0300
Subject: [PATCH v10 2/2] Add tap tests for listen notify vacuum freeze

---
 src/test/modules/Makefile                     |  1 +
 src/test/modules/meson.build                  |  1 +
 src/test/modules/test_listen_notify/Makefile  | 19 +++++
 .../modules/test_listen_notify/meson.build    | 14 ++++
 .../test_listen_notify/t/001_xid_freeze.pl    | 74 ++++++++++++++++
 .../t/002_aborted_tx_notifies.pl              | 84 +++++++++++++++++++
 6 files changed, 193 insertions(+)
 create mode 100644 src/test/modules/test_listen_notify/Makefile
 create mode 100644 src/test/modules/test_listen_notify/meson.build
 create mode 100644 src/test/modules/test_listen_notify/t/001_xid_freeze.pl
 create mode 100644 src/test/modules/test_listen_notify/t/002_aborted_tx_notifies.pl

diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index 902a7954101..a015c961d35 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -29,6 +29,7 @@ SUBDIRS = \
 		  test_int128 \
 		  test_integerset \
 		  test_json_parser \
+		  test_listen_notify \
 		  test_lfind \
 		  test_lwlock_tranches \
 		  test_misc \
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index 14fc761c4cf..6af33448d7b 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -28,6 +28,7 @@ subdir('test_ginpostinglist')
 subdir('test_int128')
 subdir('test_integerset')
 subdir('test_json_parser')
+subdir('test_listen_notify')
 subdir('test_lfind')
 subdir('test_lwlock_tranches')
 subdir('test_misc')
diff --git a/src/test/modules/test_listen_notify/Makefile b/src/test/modules/test_listen_notify/Makefile
new file mode 100644
index 00000000000..c1eb4fde370
--- /dev/null
+++ b/src/test/modules/test_listen_notify/Makefile
@@ -0,0 +1,19 @@
+# src/test/modules/test_listen_notify/Makefile
+
+MODULE = test_listen_notify
+PGFILEDESC = "test_listen_notify - regression testing for LISTEN/NOTIFY support"
+
+TAP_TESTS = 1
+
+EXTRA_INSTALL=src/test/modules/xid_wraparound
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_listen_notify
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_listen_notify/meson.build b/src/test/modules/test_listen_notify/meson.build
new file mode 100644
index 00000000000..a68052cd353
--- /dev/null
+++ b/src/test/modules/test_listen_notify/meson.build
@@ -0,0 +1,14 @@
+# Copyright (c) 2022-2025, PostgreSQL Global Development Group
+
+tests += {
+  'name': 'test_listen_notify',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'tap': {
+    'tests': [
+      't/001_xid_freeze.pl',
+      't/002_aborted_tx_notifies.pl'
+    ],
+  },
+}
+
diff --git a/src/test/modules/test_listen_notify/t/001_xid_freeze.pl b/src/test/modules/test_listen_notify/t/001_xid_freeze.pl
new file mode 100644
index 00000000000..a8bbd268c0f
--- /dev/null
+++ b/src/test/modules/test_listen_notify/t/001_xid_freeze.pl
@@ -0,0 +1,74 @@
+# Copyright (c) 2024-2025, PostgreSQL Global Development Group
+
+use strict;
+use warnings FATAL => 'all';
+use File::Path qw(mkpath);
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+my $node = PostgreSQL::Test::Cluster->new('node');
+$node->init;
+$node->start;
+
+# Check if the extension xid_wraparound is available, as it may be
+# possible that this script is run with installcheck, where the module
+# would not be installed by default.
+if (!$node->check_extension('xid_wraparound'))
+{
+	plan skip_all => 'Extension xid_wraparound not installed';
+}
+
+# Setup
+$node->safe_psql('postgres', 'CREATE EXTENSION xid_wraparound');
+$node->safe_psql('postgres',
+	'CREATE TABLE t AS SELECT g AS a, g+2 AS b from generate_series(1,100000) g;'
+);
+$node->safe_psql('postgres',
+	'ALTER DATABASE template0 WITH ALLOW_CONNECTIONS true');
+
+# --- Start Session 1 and leave it idle in transaction
+my $psql_session1 = $node->background_psql('postgres');
+$psql_session1->query_safe('listen s;', "Session 1 listens to 's'");
+$psql_session1->query_safe('begin;', "Session 1 starts a transaction");
+
+# --- Session 2, multiple notify's, and commit ---
+for my $i (1 .. 10)
+{
+	$node->safe_psql(
+		'postgres', "
+		BEGIN;
+		NOTIFY s, '$i';
+		COMMIT;");
+}
+
+# Consume enough XIDs to trigger truncation
+$node->safe_psql('postgres', 'select consume_xids(10000000);');
+
+# Execute update so the frozen xid of "t" table is updated to a xid greater
+# than consume_xids() result
+$node->safe_psql('postgres', 'UPDATE t SET a = a+b;');
+
+# Remember current datfrozenxid before vacuum freeze to ensure that it is advanced.
+my $datafronzenxid = $node->safe_psql('postgres', "select datfrozenxid from pg_database where datname = 'postgres'");
+
+# Execute vacuum freeze on all databases
+$node->command_ok([ 'vacuumdb', '--all', '--freeze', '--port', $node->port ],
+	"vacuumdb --all --freeze");
+
+# Get the new datfrozenxid after vacuum freeze to ensure that is advanced but
+# we can still get the notification status of the notification
+my $datafronzenxid_freeze = $node->safe_psql('postgres', "select datfrozenxid from pg_database where datname = 'postgres'");
+ok($datafronzenxid_freeze > $datafronzenxid, 'datfrozenxid is advanced');
+
+# On Session 1, commit and ensure that the all notifications is received
+my $res = $psql_session1->query_safe('commit;', "commit listen s;");
+my $notifications_count = 0;
+foreach my $i (split('\n', $res))
+{
+	$notifications_count++;
+	like($i, qr/Asynchronous notification "s" with payload "$notifications_count" received/);
+}
+is($notifications_count, 10, 'received all committed notifications');
+
+done_testing();
diff --git a/src/test/modules/test_listen_notify/t/002_aborted_tx_notifies.pl b/src/test/modules/test_listen_notify/t/002_aborted_tx_notifies.pl
new file mode 100644
index 00000000000..464c959a5a9
--- /dev/null
+++ b/src/test/modules/test_listen_notify/t/002_aborted_tx_notifies.pl
@@ -0,0 +1,84 @@
+# Copyright (c) 2024-2025, PostgreSQL Global Development Group
+
+use strict;
+use warnings FATAL => 'all';
+use File::Path qw(mkpath);
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+if (!$ENV{PG_TEST_EXTRA} || $ENV{PG_TEST_EXTRA} !~ /\blisten_notify\b/)
+{
+	plan skip_all => "test listen_notify not enabled in PG_TEST_EXTRA";
+}
+
+# Test checks that listeners do not receive notifications from aborted
+# transaction even if notifications have been added to the listen/notify
+# queue. To reproduce it we use the fact that serializable conflicts
+# are checked after tx adds notifications to the queue.
+
+my $node = PostgreSQL::Test::Cluster->new('node');
+$node->init;
+$node->start;
+
+# Setup
+$node->safe_psql('postgres', 'CREATE TABLE t1 (a bigserial);');
+
+# Listener
+my $psql_listener = $node->background_psql('postgres');
+$psql_listener->query_safe('LISTEN ch;');
+
+# Session1. Start SERIALIZABLE tx and add a notification.
+my $psql_session1 = $node->background_psql('postgres');
+$psql_session1->query_safe("
+	BEGIN ISOLATION LEVEL SERIALIZABLE;
+	SELECT * FROM t1;
+	INSERT INTO t1 DEFAULT VALUES;
+	NOTIFY ch,'committed_0';
+	NOTIFY ch,'committed_1';
+");
+
+# Session2. Start SERIALIZABLE tx, add a notification and introduce a conflict
+# with session1.
+my $psql_session2 = $node->background_psql('postgres', on_error_stop => 0);
+$psql_session2->query_safe("
+	BEGIN ISOLATION LEVEL SERIALIZABLE;
+	SELECT * FROM t1;
+	INSERT INTO t1 DEFAULT VALUES;
+");
+
+# Send notifications that should not be eventually delivered, as session2
+# transaction will be aborted.
+my $message = 'aborted_' . 'a' x 1000;
+for (my $i = 0; $i < 10; $i++) {
+    $psql_session2->query_safe("NOTIFY ch, '$i$message'");
+}
+
+# Session1 should be committed successfully. Listeners must receive session1
+# notifications.
+$psql_session1->query_safe("COMMIT;");
+
+# Session2 should be aborted due to the conflict with session1. Transaction
+# is aborted after adding notifications to the listen/notify queue, but
+# listeners should not receive session2 notifications.
+$psql_session2->query("COMMIT;");
+
+# send more notifications after aborted
+$node->safe_psql('postgres', "NOTIFY ch, 'committed_2';");
+$node->safe_psql('postgres', "NOTIFY ch, 'committed_3';");
+
+# fetch notifications
+my $res = $psql_listener->query_safe('begin; commit;');
+
+# check received notifications
+my @lines = split('\n', $res);
+is(@lines, 4, 'received all committed notifications');
+for (my $i = 0; $i < 4; $i++) {
+    like($lines[$i], qr/Asynchronous notification "ch" with payload "committed_$i" received/);
+}
+
+ok($psql_listener->quit);
+ok($psql_session1->quit);
+ok($psql_session2->quit);
+
+done_testing();
-- 
2.51.0

#59

Masahiko Sawada

sawada.mshk@gmail.com

3 months ago

In reply to: Matheus Alcantara (#58)

Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue

On Wed, Oct 22, 2025 at 10:25 AM Matheus Alcantara
<matheusssilv97@gmail.com> wrote:

On Wed Oct 22, 2025 at 12:49 PM -03, Álvaro Herrera wrote:

On 2025-Oct-22, Matheus Alcantara wrote:

I'm wondering if the 002_aborted_tx_notifies.pl is still needed with
this architecture being used. I think that it's not, but perhaps is a
good test to keep it?

I'd rather have tests than not, but I'd think it needs to be behind
PG_TEST_EXTRA because of things like

+$node->safe_psql('postgres', 'select consume_xids(10000000);');

Attached v10 with wrapping into PG_TEST_EXTRA.

Thank you for updating the patches!

I've reviewed the patches and here are the comments.

v10-0001-Prevent-VACUUM-from-truncating-XIDs-still-presen.patch:

+ *
+ * Returns InvalidTransactionId if there are no unprocessed notifications.

This comment is not accurate since the function returns
InvalidTransactionId even if all unprocessed notifications are created
on other databases.

---
+TransactionId
+GetOldestQueuedNotifyXid(void)
+{

How about renaming it to something like GetOldestNotifyTransactionId()?

---
+   /* page_buffer must be adequately aligned, so use a union */
+   union
+   {
+       char        buf[QUEUE_PAGESIZE];
+       AsyncQueueEntry align;
+   }           page_buffer;

asyncQueueReadAllNotifications() uses this union too, so how about
define this type as AlignedQueueEntryPage or something and use it in
both functions?

---
+
+   LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE);
+

I don't think we need an exclusive lock here.

---
In GetOldestQueuedNotifyXid() why do we keep holding NotifyQueueLock
while calculating the oldest XID in the queue?

---
+
+       /* Process all entries on this page up to head */
+       while (curoffset + QUEUEALIGN(AsyncQueueEntryEmptySize) <=
QUEUE_PAGESIZE &&
+              !QUEUE_POS_EQUAL(pos, head))
+       {
+           qe = (AsyncQueueEntry *) (page_buffer.buf + curoffset);
(snip)
+           /* Advance to next entry */
+           if (asyncQueueAdvance(&pos, qe->length))
+               break;          /* advanced to next page */
+

I think the check "curoffset + QUEUEALIGN(AsyncQueueEntryEmptySize) <=
QUEUE_PAGESIZE" is essentially the same as the one we do in
asyncQueueAdvance(). I think we can refactor the inner loop in
GetOldestQueuedNotifyXid() to something like:

for (;;)
{
bool reachedEndOfPage;

qe = (AsyncQueueEntry *) (page_buffer.buf + curoffset);

// check qe->xid here...

reachedEndOfPage = asyncQueueAdvance(&pos, qe->length);

if (reachedEndOfPage || QUEUE_POS_EQUAL(pos, head))
break;
}

---
v10-0002-Add-tap-tests-for-listen-notify-vacuum-freeze.patch:

This new test directory needs to have .gitignore.

---
+# Copyright (c) 2024-2025, PostgreSQL Global Development Group
+

It's better to have a short description of this test here.

---
+use File::Path qw(mkpath);

It seems not necessary.

---
+$node->safe_psql('postgres',
+   'CREATE TABLE t AS SELECT g AS a, g+2 AS b from
generate_series(1,100000) g;'

Why does it need to insert many rows?

---
+# --- Session 2, multiple notify's, and commit ---
+for my $i (1 .. 10)
+{
+   $node->safe_psql(
+       'postgres', "
+       BEGIN;
+       NOTIFY s, '$i';
+       COMMIT;");
+}

Why does it need to send a notification 10 times?

---
+# Consume enough XIDs to trigger truncation
+$node->safe_psql('postgres', 'select consume_xids(10000000);');
+

I guess it consumes XID too much to trigger truncation. Given the one
clog segment file is 256kB in size, it's enough to consume 1,050,000
XIDs to move to the next segment file.

---
+# Get the new datfrozenxid after vacuum freeze to ensure that is advanced but
+# we can still get the notification status of the notification
+my $datafronzenxid_freeze = $node->safe_psql('postgres', "select
datfrozenxid from pg_database where datname = 'postgres'");
+ok($datafronzenxid_freeze > $datafronzenxid, 'datfrozenxid is advanced');

I think this test passes in all cases, so it is meaningless. Instead,
what we need to check in terms of datfrozenxid is that its value
doesn't get greater than the XID we used for the notification, even
after vacuum freeze. I think we remember the XID used for NOTIFY and
check if the old/new datfrozenxid values are older than that value.

---
+is($notifications_count, 10, 'received all committed notifications');
+
+done_testing();

After consuming the unconsumed notifications, let's do vacuum freeze
and check if datfrozenxid now can be advanced.

---
I would expect to add 002_aborted_tx_notifies.pl in a separate patch
since it's not related to this bug fix.

---
+# Test checks that listeners do not receive notifications from aborted
+# transaction even if notifications have been added to the listen/notify
+# queue. To reproduce it we use the fact that serializable conflicts
+# are checked after tx adds notifications to the queue.

I wonder if we could implement this test using the isolation test
instead of the tap test. Is there any reason why you used a tap test
for that?

Regards,

Should we enable this
somewhere to be executed on build farm?

Yeah, I hope some buildfarm animals enable it.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#60

Arseniy Mukhin

arseniy.mukhin.dev@gmail.com

3 months ago

In reply to: Masahiko Sawada (#59)

Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue

On Fri, Oct 24, 2025 at 2:43 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Wed, Oct 22, 2025 at 10:25 AM Matheus Alcantara
<matheusssilv97@gmail.com> wrote:

On Wed Oct 22, 2025 at 12:49 PM -03, Álvaro Herrera wrote:

On 2025-Oct-22, Matheus Alcantara wrote:

I'm wondering if the 002_aborted_tx_notifies.pl is still needed with
this architecture being used. I think that it's not, but perhaps is a
good test to keep it?

I'd rather have tests than not, but I'd think it needs to be behind
PG_TEST_EXTRA because of things like

+$node->safe_psql('postgres', 'select consume_xids(10000000);');

Attached v10 with wrapping into PG_TEST_EXTRA.

Thank you for updating the patches!

I've reviewed the patches and here are the comments.

Thank you for the review!

I would expect to add 002_aborted_tx_notifies.pl in a separate patch
since it's not related to this bug fix.
---
+# Test checks that listeners do not receive notifications from aborted
+# transaction even if notifications have been added to the listen/notify
+# queue. To reproduce it we use the fact that serializable conflicts
+# are checked after tx adds notifications to the queue.
I wonder if we could implement this test using the isolation test
instead of the tap test. Is there any reason why you used a tap test
for that?

I agree it's less relevant to the patch now than it was with the new
'committed' field approach. And there is no particular reason why it
was implemented as a TAP test actually.. So +1 to move it to separate
patch (does it mean to separate thread as well or just separate patch
file?) and rewrite as an isolation test (IIUC it's better to use
isolation test infrastructure if it's possible). I can try to do it if
nobody else does it earlier.

Best regards,
Arseniy Mukhin

#61

Arseniy Mukhin

arseniy.mukhin.dev@gmail.com

3 months ago

In reply to: Arseniy Mukhin (#60)

Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue

On Wed, Oct 22, 2025 at 3:02 AM Joel Jacobson <joel@compiler.org> wrote:

On Wed, Oct 22, 2025, at 02:16, Matheus Alcantara wrote:

Regarding the v8 patch, it introduces a fundamentally new way of
managing notification entries (adding entries with 'committed' state
and marking them 'aborted' in abort paths). This affects all use
cases, not just those involving very old unconsumed notifications, and
could introduce more serious bugs like PANIC or SEGV. For
backpatching, I prefer targeting just the problematic behavior while
leaving unrelated parts unchanged. Though Álvaro might have a
different perspective on this.

Thanks very much for this explanation and for what you've previously
wrote on [1]. It's clear to me now that the v8 architecture is not a
good way to go.

How about doing some more work in vac_update_datfrozenxid()?

Pseudo-code sketch:

```
void
vac_update_datfrozenxid(void)
{

/* After computing newFrozenXid from all known sources... */

TransactionId oldestNotifyXid = GetOldestQueuedNotifyXid();

if (TransactionIdIsValid(oldestNotifyXid) &&
TransactionIdPrecedes(oldestNotifyXid, newFrozenXid))
{
/*
* The async queue has XIDs older than our proposed freeze point.
* Attempt cleanup, then back off and let the next VACUUM benefit.
*/

if (asyncQueueHasListeners())
{
/*
* Wake all listening backends across *all* databases
* that are not already at QUEUE_HEAD.
* They'll hopefully process notifications and advance
* their pointers, allowing the next VACUUM to freeze further.
*/
asyncQueueWakeAllListeners();
}
else
{
/*
* No listeners exist - discard all unread notifications.
* The next VACUUM should succeed in advancing datfrozenxid.
* asyncQueueAdvanceTailNoListeners() would take exclusive lock
* on NotifyQueueLock before checking
* QUEUE_FIRST_LISTENER == INVALID_PROC_NUMBER
*/
asyncQueueAdvanceTailNoListeners();
}

/*
* Back off datfrozenxid to protect the old XIDs.
* The cleanup we just performed should allow the next VACUUM
* to freeze further.
*/
newFrozenXid = oldestNotifyXid;
}
}
```

Maybe it wouldn't solve all problematic situations, but to me it seems
like these measures could help many of them, or am I missing some
crucial insight here?

I agree we need to add something like this. Looks like with v10 it's
possible for the listen/notify queue to block datfrozenxid advancing
even without extreme circumstances (without hanging listeners etc).

I see two thing we should take care of in v10:

1) Currently asyncQueueAdvanceTail is called regularly only if we have
a constant flow of notifications. We try to advance the tail every
time when the head reaches every 4th page. So if we don't have new
notifications, the tail will stay where it is forever. It means that
if we have at least 1 notification in the queue without constant flow
of notifications then GetOldestQueuedNotifyXid will constantly return
the same result which will block the advancement of datfrozenxid. So
it looks like we need something that will advance the tail in this
case.

2) Currently we wake up all listeners regularly only if we have a
constant flow of notifications (again). Even if we have a listener and
we never wake up such a listener because of the new notifications (for
example there are no new notifications in the listener's database), we
still signal such a listener sometimes as its position lags too much
from the head. But again, if we don't have new notifications, it's
possible that such a listener will never process some notification
(from another database) and tail advancement will be blocked. As a
result, datfrozenxid advancement also will be blocked. So probably we
need something that will wake up lagging listeners if they block tail
advancement.

Best regards,
Arseniy Mukhin

#62

Joel Jacobson

joel@compiler.org

3 months ago

In reply to: Arseniy Mukhin (#61)

2 attachment(s)

Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue

On Fri, Oct 24, 2025, at 11:55, Arseniy Mukhin wrote:

On Wed, Oct 22, 2025 at 3:02 AM Joel Jacobson <joel@compiler.org> wrote:

How about doing some more work in vac_update_datfrozenxid()?

Pseudo-code sketch:

...

I agree we need to add something like this. Looks like with v10 it's
possible for the listen/notify queue to block datfrozenxid advancing
even without extreme circumstances (without hanging listeners etc).

Attached, two implementations of the sketched out idea, which can be
applied on top of the v10 patch. They should both be functionally
equivalent.

vacuum_notify_queue_cleanup-with-code-dup.txt:

This version doesn't try at all to avoid code duplication;
asyncQueueAdvanceTailNoListeners is very similar to SignalBackends, and
asyncQueueAdvanceTailNoListeners is very similar to
asyncQueueAdvanceTail. I think this might be preferable, if the channel
hash optimization that we're working on in the other thread, as a bonus
solves these fundamental problems, so that these added safety
functionality can be eliminated. I think it's quite likely we can
achieve that, but not certain.

vacuum_notify_queue_cleanup-without-code-dup.txt:

This version instead equips SignalBackends and asyncQueueAdvanceTail
with a new boolean input parameter to control their behavior, where
passing false gives the current behavior, used at the current call
sites, and vacuum would pass true, to signal all non-caught-up backends
in all databases and forcibly advance the tail when there are no
listening backends.

I have no strong preference for one or the other.

/Joel

Attachments:

vacuum_notify_queue_cleanup-without-code-dup.txttext/plain; name="=?UTF-8?Q?vacuum=5Fnotify=5Fqueue=5Fcleanup-without-code-dup.txt?="Download

diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index 7c9d7831c9f..1805ddb36ce 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -444,13 +444,11 @@ static void asyncQueueNotificationToEntry(Notification *n, AsyncQueueEntry *qe);
 static ListCell *asyncQueueAddEntries(ListCell *nextNotify);
 static double asyncQueueUsage(void);
 static void asyncQueueFillWarning(void);
-static void SignalBackends(void);
 static void asyncQueueReadAllNotifications(void);
 static bool asyncQueueProcessPageEntries(volatile QueuePosition *current,
 										 QueuePosition stop,
 										 char *page_buffer,
 										 Snapshot snapshot);
-static void asyncQueueAdvanceTail(void);
 static void ProcessIncomingNotify(bool flush);
 static bool AsyncExistsPendingNotify(Notification *n);
 static void AddEventToPendingNotifies(Notification *n);
@@ -1011,7 +1009,7 @@ AtCommit_Notify(void)
 	 * PreCommit_Notify().
 	 */
 	if (pendingNotifies != NULL)
-		SignalBackends();
+		SignalBackends(false);
 
 	/*
 	 * If it's time to try to advance the global tail pointer, do that.
@@ -1025,7 +1023,7 @@ AtCommit_Notify(void)
 	if (tryAdvanceTail)
 	{
 		tryAdvanceTail = false;
-		asyncQueueAdvanceTail();
+		asyncQueueAdvanceTail(false);
 	}
 
 	/* And clean up */
@@ -1483,7 +1481,7 @@ pg_notification_queue_usage(PG_FUNCTION_ARGS)
 	double		usage;
 
 	/* Advance the queue tail so we don't report a too-large result */
-	asyncQueueAdvanceTail();
+	asyncQueueAdvanceTail(false);
 
 	LWLockAcquire(NotifyQueueLock, LW_SHARED);
 	usage = asyncQueueUsage();
@@ -1576,9 +1574,16 @@ asyncQueueFillWarning(void)
  *
  * This is called during CommitTransaction(), so it's important for it
  * to have very low probability of failure.
+ *
+ * If all_databases is false (normal NOTIFY), we signal listeners in our own
+ * database unless they're caught up, and listeners in other databases only
+ * if they are far behind (QUEUE_CLEANUP_DELAY pages).
+ *
+ * If all_databases is true (VACUUM cleanup), we signal all listeners across
+ * all databases that aren't already caught up, with no distance filtering.
  */
-static void
-SignalBackends(void)
+void
+SignalBackends(bool all_databases)
 {
 	int32	   *pids;
 	ProcNumber *procnos;
@@ -1604,25 +1609,21 @@ SignalBackends(void)
 
 		Assert(pid != InvalidPid);
 		pos = QUEUE_BACKEND_POS(i);
-		if (QUEUE_BACKEND_DBOID(i) == MyDatabaseId)
-		{
-			/*
-			 * Always signal listeners in our own database, unless they're
-			 * already caught up (unlikely, but possible).
-			 */
-			if (QUEUE_POS_EQUAL(pos, QUEUE_HEAD))
-				continue;
-		}
-		else
-		{
-			/*
-			 * Listeners in other databases should be signaled only if they
-			 * are far behind.
-			 */
-			if (asyncQueuePageDiff(QUEUE_POS_PAGE(QUEUE_HEAD),
-								   QUEUE_POS_PAGE(pos)) < QUEUE_CLEANUP_DELAY)
-				continue;
-		}
+
+		/*
+		 * Always skip backends that are already caught up.
+		 */
+		if (QUEUE_POS_EQUAL(pos, QUEUE_HEAD))
+			continue;
+
+		/*
+		 * Skip if we're not signaling all databases AND this is a different
+		 * database AND the listener is not far behind.
+		 */
+		if (!all_databases && QUEUE_BACKEND_DBOID(i) != MyDatabaseId &&
+			asyncQueuePageDiff(QUEUE_POS_PAGE(QUEUE_HEAD),
+							   QUEUE_POS_PAGE(pos)) < QUEUE_CLEANUP_DELAY)
+			continue;
 		/* OK, need to signal this one */
 		pids[count] = pid;
 		procnos[count] = i;
@@ -2188,15 +2189,38 @@ GetOldestQueuedNotifyXid(void)
 	return oldestXid;
 }
 
+/*
+ * Check if there are any active listeners in the notification queue.
+ *
+ * Returns true if at least one backend is registered as a listener,
+ * false otherwise.
+ */
+bool
+asyncQueueHasListeners(void)
+{
+	bool		hasListeners;
+
+	LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE);
+	hasListeners = (QUEUE_FIRST_LISTENER != INVALID_PROC_NUMBER);
+	LWLockRelease(NotifyQueueLock);
+
+	return hasListeners;
+}
+
 /*
  * Advance the shared queue tail variable to the minimum of all the
  * per-backend tail pointers.  Truncate pg_notify space if possible.
  *
- * This is (usually) called during CommitTransaction(), so it's important for
- * it to have very low probability of failure.
+ * If force_to_head is false (normal case), we compute the new tail as the
+ * minimum of all listener positions.  This is (usually) called during
+ * CommitTransaction(), so it's important for it to have very low probability
+ * of failure.
+ *
+ * If force_to_head is true (VACUUM cleanup), we advance the tail directly to
+ * the head, discarding all notifications, but only if there are no listeners.
  */
-static void
-asyncQueueAdvanceTail(void)
+void
+asyncQueueAdvanceTail(bool force_to_head)
 {
 	QueuePosition min;
 	int64		oldtailpage;
@@ -2224,12 +2248,38 @@ asyncQueueAdvanceTail(void)
 	 * to access the pages we are in the midst of truncating.
 	 */
 	LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE);
-	min = QUEUE_HEAD;
-	for (ProcNumber i = QUEUE_FIRST_LISTENER; i != INVALID_PROC_NUMBER; i = QUEUE_NEXT_LISTENER(i))
+
+	if (force_to_head)
 	{
-		Assert(QUEUE_BACKEND_PID(i) != InvalidPid);
-		min = QUEUE_POS_MIN(min, QUEUE_BACKEND_POS(i));
+		/*
+		 * Verify that there are still no listeners.  It's possible
+		 * that a listener appeared since VACUUM checked.
+		 */
+		if (QUEUE_FIRST_LISTENER != INVALID_PROC_NUMBER)
+		{
+			LWLockRelease(NotifyQueueLock);
+			LWLockRelease(NotifyQueueTailLock);
+			return;
+		}
+
+		/*
+		 * Advance the logical tail to the head, discarding all notifications.
+		 */
+		min = QUEUE_HEAD;
 	}
+	else
+	{
+		/*
+		 * Normal case: compute minimum position from all listeners.
+		 */
+		min = QUEUE_HEAD;
+		for (ProcNumber i = QUEUE_FIRST_LISTENER; i != INVALID_PROC_NUMBER; i = QUEUE_NEXT_LISTENER(i))
+		{
+			Assert(QUEUE_BACKEND_PID(i) != InvalidPid);
+			min = QUEUE_POS_MIN(min, QUEUE_BACKEND_POS(i));
+		}
+	}
+
 	QUEUE_TAIL = min;
 	oldtailpage = QUEUE_STOP_PAGE;
 	LWLockRelease(NotifyQueueLock);
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 6c601ce81aa..e4bc292ebd9 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -1739,11 +1739,24 @@ vac_update_datfrozenxid(void)
 	 * Also consider the oldest XID in the notification queue, since backends
 	 * will need to call TransactionIdDidCommit() on those XIDs when
 	 * processing the notifications.
+	 *
+	 * If the queue is blocking datfrozenxid advancement, attempt to clean it
+	 * up.  If listeners exist, wake them to process their pending
+	 * notifications.  If no listeners exist, discard all notifications.
+	 * Either way, we back off datfrozenxid for this VACUUM cycle; the next
+	 * VACUUM will benefit from the cleanup we've triggered.
 	 */
 	oldestNotifyXid = GetOldestQueuedNotifyXid();
 	if (TransactionIdIsValid(oldestNotifyXid) &&
 		TransactionIdPrecedes(oldestNotifyXid, newFrozenXid))
+	{
+		if (asyncQueueHasListeners())
+			SignalBackends(true);
+		else
+			asyncQueueAdvanceTail(true);
+
 		newFrozenXid = oldestNotifyXid;
+	}
 
 	Assert(TransactionIdIsNormal(newFrozenXid));
 	Assert(MultiXactIdIsValid(newMinMulti));
diff --git a/src/include/commands/async.h b/src/include/commands/async.h
index ac323ada492..bb442940c29 100644
--- a/src/include/commands/async.h
+++ b/src/include/commands/async.h
@@ -29,6 +29,11 @@ extern void NotifyMyFrontEnd(const char *channel,
 /* get oldest XID in the notification queue for vacuum freeze */
 extern TransactionId GetOldestQueuedNotifyXid(void);
 
+/* functions for vacuum to manage notification queue */
+extern bool asyncQueueHasListeners(void);
+extern void SignalBackends(bool all_databases);
+extern void asyncQueueAdvanceTail(bool force_to_head);
+
 /* notify-related SQL statements */
 extern void Async_Notify(const char *channel, const char *payload);
 extern void Async_Listen(const char *channel);

vacuum_notify_queue_cleanup-with-code-dup.txttext/plain; name=vacuum_notify_queue_cleanup-with-code-dup.txtDownload

diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index 7c9d7831c9f..c8c0ab66b66 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -2188,6 +2188,164 @@ GetOldestQueuedNotifyXid(void)
 	return oldestXid;
 }
 
+/*
+ * Check if there are any active listeners in the notification queue.
+ *
+ * Returns true if at least one backend is registered as a listener,
+ * false otherwise.
+ */
+bool
+asyncQueueHasListeners(void)
+{
+	bool		hasListeners;
+
+	LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE);
+	hasListeners = (QUEUE_FIRST_LISTENER != INVALID_PROC_NUMBER);
+	LWLockRelease(NotifyQueueLock);
+
+	return hasListeners;
+}
+
+/*
+ * Wake all listening backends to process notifications.
+ *
+ * This is called by VACUUM when it needs to advance datfrozenxid but the
+ * notification queue has old XIDs.  We signal all listeners across all
+ * databases that aren't already caught up, so they can process their
+ * pending notifications and advance the queue tail.
+ */
+void
+asyncQueueWakeAllListeners(void)
+{
+	int32	   *pids;
+	ProcNumber *procnos;
+	int			count;
+
+	/*
+	 * Identify backends that we need to signal.  We don't want to send
+	 * signals while holding the NotifyQueueLock, so this loop just builds a
+	 * list of target PIDs.
+	 *
+	 * XXX in principle these pallocs could fail, which would be bad. Maybe
+	 * preallocate the arrays?  They're not that large, though.
+	 */
+	pids = (int32 *) palloc(MaxBackends * sizeof(int32));
+	procnos = (ProcNumber *) palloc(MaxBackends * sizeof(ProcNumber));
+	count = 0;
+
+	LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE);
+	for (ProcNumber i = QUEUE_FIRST_LISTENER; i != INVALID_PROC_NUMBER; i = QUEUE_NEXT_LISTENER(i))
+	{
+		int32		pid = QUEUE_BACKEND_PID(i);
+		QueuePosition pos;
+
+		Assert(pid != InvalidPid);
+		pos = QUEUE_BACKEND_POS(i);
+
+		/*
+		 * Signal listeners unless they're already caught up.
+		 */
+		if (QUEUE_POS_EQUAL(pos, QUEUE_HEAD))
+			continue;
+
+		/* OK, need to signal this one */
+		pids[count] = pid;
+		procnos[count] = i;
+		count++;
+	}
+	LWLockRelease(NotifyQueueLock);
+
+	/* Now send signals */
+	for (int i = 0; i < count; i++)
+	{
+		int32		pid = pids[i];
+
+		/*
+		 * If we are signaling our own process, no need to involve the kernel;
+		 * just set the flag directly.
+		 */
+		if (pid == MyProcPid)
+		{
+			notifyInterruptPending = true;
+			continue;
+		}
+
+		/*
+		 * Note: assuming things aren't broken, a signal failure here could
+		 * only occur if the target backend exited since we released
+		 * NotifyQueueLock; which is unlikely but certainly possible. So we
+		 * just log a low-level debug message if it happens.
+		 */
+		if (SendProcSignal(pid, PROCSIG_NOTIFY_INTERRUPT, procnos[i]) < 0)
+			elog(DEBUG3, "could not signal backend with PID %d: %m", pid);
+	}
+
+	pfree(pids);
+	pfree(procnos);
+}
+
+/*
+ * Discard all notifications in the queue when there are no listeners.
+ *
+ * This is called by VACUUM when the notification queue has old XIDs but no
+ * active listeners exist.  We advance the tail to the head, effectively
+ * discarding all queued notifications, and truncate the SLRU segments.
+ */
+void
+asyncQueueAdvanceTailNoListeners(void)
+{
+	QueuePosition min;
+	int64		oldtailpage;
+	int64		newtailpage;
+	int64		boundary;
+
+	/* Restrict task to one backend per cluster; see SimpleLruTruncate(). */
+	LWLockAcquire(NotifyQueueTailLock, LW_EXCLUSIVE);
+	LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE);
+
+	/*
+	 * Verify that there are still no listeners.
+	 */
+	if (QUEUE_FIRST_LISTENER != INVALID_PROC_NUMBER)
+	{
+		LWLockRelease(NotifyQueueLock);
+		LWLockRelease(NotifyQueueTailLock);
+		return;
+	}
+
+	/*
+	 * Advance the logical tail to the head, discarding all notifications.
+	 */
+	min = QUEUE_HEAD;
+	QUEUE_TAIL = min;
+	oldtailpage = QUEUE_STOP_PAGE;
+	LWLockRelease(NotifyQueueLock);
+
+	/*
+	 * We can truncate something if the global tail advanced across an SLRU
+	 * segment boundary.
+	 *
+	 * XXX it might be better to truncate only once every several segments, to
+	 * reduce the number of directory scans.
+	 */
+	newtailpage = QUEUE_POS_PAGE(min);
+	boundary = newtailpage - (newtailpage % SLRU_PAGES_PER_SEGMENT);
+	if (asyncQueuePagePrecedes(oldtailpage, boundary))
+	{
+		/*
+		 * SimpleLruTruncate() will ask for SLRU bank locks but will also
+		 * release the lock again.
+		 */
+		SimpleLruTruncate(NotifyCtl, newtailpage);
+
+		LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE);
+		QUEUE_STOP_PAGE = newtailpage;
+		LWLockRelease(NotifyQueueLock);
+	}
+
+	LWLockRelease(NotifyQueueTailLock);
+}
+
 /*
  * Advance the shared queue tail variable to the minimum of all the
  * per-backend tail pointers.  Truncate pg_notify space if possible.
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 6c601ce81aa..f93f82e9040 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -1739,11 +1739,24 @@ vac_update_datfrozenxid(void)
 	 * Also consider the oldest XID in the notification queue, since backends
 	 * will need to call TransactionIdDidCommit() on those XIDs when
 	 * processing the notifications.
+	 *
+	 * If the queue is blocking datfrozenxid advancement, attempt to clean it
+	 * up.  If listeners exist, wake them to process their pending
+	 * notifications.  If no listeners exist, discard all notifications.
+	 * Either way, we back off datfrozenxid for this VACUUM cycle; the next
+	 * VACUUM will benefit from the cleanup we've triggered.
 	 */
 	oldestNotifyXid = GetOldestQueuedNotifyXid();
 	if (TransactionIdIsValid(oldestNotifyXid) &&
 		TransactionIdPrecedes(oldestNotifyXid, newFrozenXid))
+	{
+		if (asyncQueueHasListeners())
+			asyncQueueWakeAllListeners();
+		else
+			asyncQueueAdvanceTailNoListeners();
+
 		newFrozenXid = oldestNotifyXid;
+	}
 
 	Assert(TransactionIdIsNormal(newFrozenXid));
 	Assert(MultiXactIdIsValid(newMinMulti));
diff --git a/src/include/commands/async.h b/src/include/commands/async.h
index ac323ada492..f9dccf342b5 100644
--- a/src/include/commands/async.h
+++ b/src/include/commands/async.h
@@ -29,6 +29,11 @@ extern void NotifyMyFrontEnd(const char *channel,
 /* get oldest XID in the notification queue for vacuum freeze */
 extern TransactionId GetOldestQueuedNotifyXid(void);
 
+/* functions for vacuum to manage notification queue */
+extern bool asyncQueueHasListeners(void);
+extern void asyncQueueWakeAllListeners(void);
+extern void asyncQueueAdvanceTailNoListeners(void);
+
 /* notify-related SQL statements */
 extern void Async_Notify(const char *channel, const char *payload);
 extern void Async_Listen(const char *channel);

#63

Matheus Alcantara

matheusssilv97@gmail.com

3 months ago

In reply to: Masahiko Sawada (#59)

2 attachment(s)

Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue

Thanks for the review!

On Thu Oct 23, 2025 at 8:42 PM -03, Masahiko Sawada wrote:

On Wed, Oct 22, 2025 at 10:25 AM Matheus Alcantara
<matheusssilv97@gmail.com> wrote:

On Wed Oct 22, 2025 at 12:49 PM -03, Álvaro Herrera wrote:

On 2025-Oct-22, Matheus Alcantara wrote:

I'm wondering if the 002_aborted_tx_notifies.pl is still needed with
this architecture being used. I think that it's not, but perhaps is a
good test to keep it?

I'd rather have tests than not, but I'd think it needs to be behind
PG_TEST_EXTRA because of things like

+$node->safe_psql('postgres', 'select consume_xids(10000000);');

Attached v10 with wrapping into PG_TEST_EXTRA.

Thank you for updating the patches!

I've reviewed the patches and here are the comments.

v10-0001-Prevent-VACUUM-from-truncating-XIDs-still-presen.patch:
+ *
+ * Returns InvalidTransactionId if there are no unprocessed notifications.
This comment is not accurate since the function returns
InvalidTransactionId even if all unprocessed notifications are created
on other databases.

Fixed

---
+TransactionId
+GetOldestQueuedNotifyXid(void)
+{
How about renaming it to something like GetOldestNotifyTransactionId()?

Yeap, sounds better, fixed.

---
+   /* page_buffer must be adequately aligned, so use a union */
+   union
+   {
+       char        buf[QUEUE_PAGESIZE];
+       AsyncQueueEntry align;
+   }           page_buffer;
asyncQueueReadAllNotifications() uses this union too, so how about
define this type as AlignedQueueEntryPage or something and use it in
both functions?

Good point, fixed.

---
+
+   LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE);
+
I don't think we need an exclusive lock here.

Fixed.

---
In GetOldestQueuedNotifyXid() why do we keep holding NotifyQueueLock
while calculating the oldest XID in the queue?

Yeah, I don't think that it's necessary. The
asyncQueueReadAllNotifications() for example only hold the lock when
reading the QUEUE_BACKEND_POS and QUEUE_HEAD. I think that it's a
similar case here, fixed.

---
+
+       /* Process all entries on this page up to head */
+       while (curoffset + QUEUEALIGN(AsyncQueueEntryEmptySize) <=
QUEUE_PAGESIZE &&
+              !QUEUE_POS_EQUAL(pos, head))
+       {
+           qe = (AsyncQueueEntry *) (page_buffer.buf + curoffset);
(snip)
+           /* Advance to next entry */
+           if (asyncQueueAdvance(&pos, qe->length))
+               break;          /* advanced to next page */
+
I think the check "curoffset + QUEUEALIGN(AsyncQueueEntryEmptySize) <=
QUEUE_PAGESIZE" is essentially the same as the one we do in
asyncQueueAdvance(). I think we can refactor the inner loop in
GetOldestQueuedNotifyXid() to something like:

for (;;)
{
bool reachedEndOfPage;

qe = (AsyncQueueEntry *) (page_buffer.buf + curoffset);

// check qe->xid here...

reachedEndOfPage = asyncQueueAdvance(&pos, qe->length);

if (reachedEndOfPage || QUEUE_POS_EQUAL(pos, head))
break;
}

Fixed

---
v10-0002-Add-tap-tests-for-listen-notify-vacuum-freeze.patch:

This new test directory needs to have .gitignore.

Fixed

---
+# Copyright (c) 2024-2025, PostgreSQL Global Development Group
+
It's better to have a short description of this test here.

Fixed

---
+use File::Path qw(mkpath);
It seems not necessary.

Fixed

---
+$node->safe_psql('postgres',
+   'CREATE TABLE t AS SELECT g AS a, g+2 AS b from
generate_series(1,100000) g;'

Why does it need to insert many rows?

I think that this value was left when I was writing the first versions of
this test, I don't think that it's necessary. I reduced to just 10 rows.

---
+# --- Session 2, multiple notify's, and commit ---
+for my $i (1 .. 10)
+{
+   $node->safe_psql(
+       'postgres', "
+       BEGIN;
+       NOTIFY s, '$i';
+       COMMIT;");
+}

Why does it need to send a notification 10 times?

I just wanted to test with multiple notifications, we can reduce to two
or three, there is not specific reason to send 10 notifications.

---
+# Consume enough XIDs to trigger truncation
+$node->safe_psql('postgres', 'select consume_xids(10000000);');
+
I guess it consumes XID too much to trigger truncation. Given the one
clog segment file is 256kB in size, it's enough to consume 1,050,000
XIDs to move to the next segment file.

Fixed

---
+# Get the new datfrozenxid after vacuum freeze to ensure that is advanced but
+# we can still get the notification status of the notification
+my $datafronzenxid_freeze = $node->safe_psql('postgres', "select
datfrozenxid from pg_database where datname = 'postgres'");
+ok($datafronzenxid_freeze > $datafronzenxid, 'datfrozenxid is advanced');
I think this test passes in all cases, so it is meaningless. Instead,
what we need to check in terms of datfrozenxid is that its value
doesn't get greater than the XID we used for the notification, even
after vacuum freeze. I think we remember the XID used for NOTIFY and
check if the old/new datfrozenxid values are older than that value.

I think that was the goal of this test. I've swap the values to make it
more clear. Please let me know if I misunderstood your point here.

---
+is($notifications_count, 10, 'received all committed notifications');
+
+done_testing();
After consuming the unconsumed notifications, let's do vacuum freeze
and check if datfrozenxid now can be advanced.

Good point, during this I realize that the datafrozenxid was not being
advanced even after the queue being consumed. This was because when a
backend listener is consuming the notifications it only update their own
queue tail pointer and not the global shared tail pointer, so I've
included a call to asyncQueueAdvanceTail() at the beginning of
GetOldestNotifyTransactionId() to do this.

---
I would expect to add 002_aborted_tx_notifies.pl in a separate patch
since it's not related to this bug fix.

On the new attached version I've moved this test to a separated patch.

---
+# Test checks that listeners do not receive notifications from aborted
+# transaction even if notifications have been added to the listen/notify
+# queue. To reproduce it we use the fact that serializable conflicts
+# are checked after tx adds notifications to the queue.
I wonder if we could implement this test using the isolation test
instead of the tap test. Is there any reason why you used a tap test
for that?

On this new version I just moved the test into a separated patch but I
agree that we can implement this using the isolation test,

I don't know how we should handle the follow-up patches that Joel have
sent on [1]/messages/by-id/b6f6cbb8-1903-4ca0-ae64-01d84d1e12a3@app.fastmail.com, if we should incoportate on 0001 or not. I still need to
review and test them.

[1]: /messages/by-id/b6f6cbb8-1903-4ca0-ae64-01d84d1e12a3@app.fastmail.com

--
Matheus Alcantara

Attachments:

v11-0001-Prevent-VACUUM-from-truncating-XIDs-still-presen.patchtext/plain; charset=utf-8; name=v11-0001-Prevent-VACUUM-from-truncating-XIDs-still-presen.patchDownload

From 37675f6feff30a365de3b26719002b65a81dabb3 Mon Sep 17 00:00:00 2001
From: Joel Jacobson <joel@compiler.org>
Date: Sun, 19 Oct 2025 18:55:25 +0200
Subject: [PATCH v11 1/2] Prevent VACUUM from truncating XIDs still present in
 notification queue

VACUUM's computation of datfrozenxid did not account for transaction IDs
in the LISTEN/NOTIFY queue.  This allowed VACUUM to truncate clog
entries for XIDs that were still referenced by queued notifications,
causing backends to fail in TransactionIdDidCommit when later processing
those notifications.

Fix by adding GetOldestQueuedNotifyXid to find the oldest XID in queued
notifications for the current database, and constraining datfrozenxid to
not pass that.  The function scans from QUEUE_TAIL, since notifications
may have been written before any listeners existed.

To avoid code duplication, refactor SLRU page-reading code into a new
helper function asyncQueueReadPageToBuffer.

Co-authored-by: Matheus Alcantara <matheusssilv97@gmail.com>
---
 src/backend/commands/async.c                  | 156 ++++++++++++++----
 src/backend/commands/vacuum.c                 |  12 ++
 src/include/commands/async.h                  |   3 +
 src/test/modules/Makefile                     |   1 +
 src/test/modules/meson.build                  |   1 +
 .../modules/test_listen_notify/.gitignore     |   4 +
 src/test/modules/test_listen_notify/Makefile  |  19 +++
 .../modules/test_listen_notify/meson.build    |  13 ++
 .../test_listen_notify/t/001_xid_freeze.pl    |  89 ++++++++++
 src/tools/pgindent/typedefs.list              |   1 +
 10 files changed, 268 insertions(+), 31 deletions(-)
 create mode 100644 src/test/modules/test_listen_notify/.gitignore
 create mode 100644 src/test/modules/test_listen_notify/Makefile
 create mode 100644 src/test/modules/test_listen_notify/meson.build
 create mode 100644 src/test/modules/test_listen_notify/t/001_xid_freeze.pl

diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index 4bd37d5beb5..8fc1bbba7a2 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -401,6 +401,17 @@ struct NotificationHash
 	Notification *event;		/* => the actual Notification struct */
 };
 
+/*
+ * Page buffers used to read from SLRU cache must be adequately aligned,
+ * so use a union.
+ */
+typedef union
+{
+	char		buf[QUEUE_PAGESIZE];
+	AsyncQueueEntry align;
+} AlignedQueueEntryPage;
+
+
 static NotificationList *pendingNotifies = NULL;
 
 /*
@@ -1841,6 +1852,44 @@ ProcessNotifyInterrupt(bool flush)
 		ProcessIncomingNotify(flush);
 }
 
+/*
+ * Read a page from the SLRU queue into a local buffer.
+ *
+ * Reads the page containing 'pos', copying the data from the current offset
+ * either to the end of the page or up to 'head' (whichever comes first)
+ * into page_buffer.
+ */
+static void
+asyncQueueReadPageToBuffer(QueuePosition pos, QueuePosition head,
+						   char *page_buffer)
+{
+	int64		curpage = QUEUE_POS_PAGE(pos);
+	int			curoffset = QUEUE_POS_OFFSET(pos);
+	int			slotno;
+	int			copysize;
+
+	slotno = SimpleLruReadPage_ReadOnly(NotifyCtl, curpage,
+										InvalidTransactionId);
+
+	if (curpage == QUEUE_POS_PAGE(head))
+	{
+		/* we only want to read as far as head */
+		copysize = QUEUE_POS_OFFSET(head) - curoffset;
+		if (copysize < 0)
+			copysize = 0;		/* just for safety */
+	}
+	else
+	{
+		/* fetch all the rest of the page */
+		copysize = QUEUE_PAGESIZE - curoffset;
+	}
+
+	memcpy(page_buffer + curoffset,
+		   NotifyCtl->shared->page_buffer[slotno] + curoffset,
+		   copysize);
+	/* Release lock that we got from SimpleLruReadPage_ReadOnly() */
+	LWLockRelease(SimpleLruGetBankLock(NotifyCtl, curpage));
+}
 
 /*
  * Read all pending notifications from the queue, and deliver appropriate
@@ -1853,13 +1902,7 @@ asyncQueueReadAllNotifications(void)
 	volatile QueuePosition pos;
 	QueuePosition head;
 	Snapshot	snapshot;
-
-	/* page_buffer must be adequately aligned, so use a union */
-	union
-	{
-		char		buf[QUEUE_PAGESIZE];
-		AsyncQueueEntry align;
-	}			page_buffer;
+	AlignedQueueEntryPage page_buffer;
 
 	/* Fetch current state */
 	LWLockAcquire(NotifyQueueLock, LW_SHARED);
@@ -1932,36 +1975,13 @@ asyncQueueReadAllNotifications(void)
 
 		do
 		{
-			int64		curpage = QUEUE_POS_PAGE(pos);
-			int			curoffset = QUEUE_POS_OFFSET(pos);
-			int			slotno;
-			int			copysize;
-
 			/*
 			 * We copy the data from SLRU into a local buffer, so as to avoid
 			 * holding the SLRU lock while we are examining the entries and
 			 * possibly transmitting them to our frontend.  Copy only the part
 			 * of the page we will actually inspect.
 			 */
-			slotno = SimpleLruReadPage_ReadOnly(NotifyCtl, curpage,
-												InvalidTransactionId);
-			if (curpage == QUEUE_POS_PAGE(head))
-			{
-				/* we only want to read as far as head */
-				copysize = QUEUE_POS_OFFSET(head) - curoffset;
-				if (copysize < 0)
-					copysize = 0;	/* just for safety */
-			}
-			else
-			{
-				/* fetch all the rest of the page */
-				copysize = QUEUE_PAGESIZE - curoffset;
-			}
-			memcpy(page_buffer.buf + curoffset,
-				   NotifyCtl->shared->page_buffer[slotno] + curoffset,
-				   copysize);
-			/* Release lock that we got from SimpleLruReadPage_ReadOnly() */
-			LWLockRelease(SimpleLruGetBankLock(NotifyCtl, curpage));
+			asyncQueueReadPageToBuffer(pos, head, page_buffer.buf);
 
 			/*
 			 * Process messages up to the stop position, end of page, or an
@@ -2097,6 +2117,80 @@ asyncQueueProcessPageEntries(volatile QueuePosition *current,
 	return reachedStop;
 }
 
+/*
+ * Get the oldest XID in the notification queue that has not yet been
+ * processed by all listening backends.
+ *
+ * Returns InvalidTransactionId if there are no unprocessed notifications or if
+ * all unprocessed notifications are created on other databases different from
+ * MyDatabaseId.
+ */
+TransactionId
+GetOldestNotifyTransactionId(void)
+{
+	QueuePosition pos;
+	QueuePosition head;
+	AlignedQueueEntryPage page_buffer;
+	TransactionId oldestXid = InvalidTransactionId;
+
+	/* First advance the shared queue tail pointer */
+	asyncQueueAdvanceTail();
+
+	/*
+	 * We must start at QUEUE_TAIL since notification data might have been
+	 * written before there were any listening backends.
+	 */
+	LWLockAcquire(NotifyQueueLock, LW_SHARED);
+	pos = QUEUE_TAIL;
+	head = QUEUE_HEAD;
+	LWLockRelease(NotifyQueueLock);
+
+	/* If the queue is empty, no XIDs need protection */
+	if (QUEUE_POS_EQUAL(pos, head))
+		return InvalidTransactionId;
+
+	while (!QUEUE_POS_EQUAL(pos, head))
+	{
+		int			curoffset;
+		AsyncQueueEntry *qe;
+
+		/* Read the current page from SLRU into our local buffer */
+		asyncQueueReadPageToBuffer(pos, head, page_buffer.buf);
+
+		curoffset = QUEUE_POS_OFFSET(pos);
+
+		/* Process all entries on this page up to head */
+		for (;;)
+		{
+			bool		reachedEndOfPage;
+
+			qe = (AsyncQueueEntry *) (page_buffer.buf + curoffset);
+
+			/*
+			 * Check if this entry is for our database and has a valid XID.
+			 * Only entries for our database matter for our datfrozenxid.
+			 */
+			if (qe->dboid == MyDatabaseId && TransactionIdIsValid(qe->xid))
+			{
+				if (!TransactionIdIsValid(oldestXid) ||
+					TransactionIdPrecedes(qe->xid, oldestXid))
+					oldestXid = qe->xid;
+			}
+
+			/* Advance to next entry */
+			reachedEndOfPage = asyncQueueAdvance(&pos, qe->length);
+
+			if (reachedEndOfPage || QUEUE_POS_EQUAL(pos, head))
+				break;
+
+
+			curoffset = QUEUE_POS_OFFSET(pos);
+		}
+	}
+
+	return oldestXid;
+}
+
 /*
  * Advance the shared queue tail variable to the minimum of all the
  * per-backend tail pointers.  Truncate pg_notify space if possible.
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index ed03e3bd50d..e5fedfb3238 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -37,6 +37,7 @@
 #include "catalog/namespace.h"
 #include "catalog/pg_database.h"
 #include "catalog/pg_inherits.h"
+#include "commands/async.h"
 #include "commands/cluster.h"
 #include "commands/defrem.h"
 #include "commands/progress.h"
@@ -1617,6 +1618,7 @@ vac_update_datfrozenxid(void)
 	bool		dirty = false;
 	ScanKeyData key[1];
 	void	   *inplace_state;
+	TransactionId oldestNotifyXid;
 
 	/*
 	 * Restrict this task to one backend per database.  This avoids race
@@ -1733,6 +1735,16 @@ vac_update_datfrozenxid(void)
 	if (bogus)
 		return;
 
+	/*
+	 * Also consider the oldest XID in the notification queue, since backends
+	 * will need to call TransactionIdDidCommit() on those XIDs when
+	 * processing the notifications.
+	 */
+	oldestNotifyXid = GetOldestNotifyTransactionId();
+	if (TransactionIdIsValid(oldestNotifyXid) &&
+		TransactionIdPrecedes(oldestNotifyXid, newFrozenXid))
+		newFrozenXid = oldestNotifyXid;
+
 	Assert(TransactionIdIsNormal(newFrozenXid));
 	Assert(MultiXactIdIsValid(newMinMulti));
 
diff --git a/src/include/commands/async.h b/src/include/commands/async.h
index f75c3df9556..0f8f17ad22b 100644
--- a/src/include/commands/async.h
+++ b/src/include/commands/async.h
@@ -26,6 +26,9 @@ extern void NotifyMyFrontEnd(const char *channel,
 							 const char *payload,
 							 int32 srcPid);
 
+/* get oldest XID in the notification queue for vacuum freeze */
+extern TransactionId GetOldestNotifyTransactionId(void);
+
 /* notify-related SQL statements */
 extern void Async_Notify(const char *channel, const char *payload);
 extern void Async_Listen(const char *channel);
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index 902a7954101..a015c961d35 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -29,6 +29,7 @@ SUBDIRS = \
 		  test_int128 \
 		  test_integerset \
 		  test_json_parser \
+		  test_listen_notify \
 		  test_lfind \
 		  test_lwlock_tranches \
 		  test_misc \
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index 14fc761c4cf..6af33448d7b 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -28,6 +28,7 @@ subdir('test_ginpostinglist')
 subdir('test_int128')
 subdir('test_integerset')
 subdir('test_json_parser')
+subdir('test_listen_notify')
 subdir('test_lfind')
 subdir('test_lwlock_tranches')
 subdir('test_misc')
diff --git a/src/test/modules/test_listen_notify/.gitignore b/src/test/modules/test_listen_notify/.gitignore
new file mode 100644
index 00000000000..5dcb3ff9723
--- /dev/null
+++ b/src/test/modules/test_listen_notify/.gitignore
@@ -0,0 +1,4 @@
+# Generated subdirectories
+/log/
+/results/
+/tmp_check/
diff --git a/src/test/modules/test_listen_notify/Makefile b/src/test/modules/test_listen_notify/Makefile
new file mode 100644
index 00000000000..c1eb4fde370
--- /dev/null
+++ b/src/test/modules/test_listen_notify/Makefile
@@ -0,0 +1,19 @@
+# src/test/modules/test_listen_notify/Makefile
+
+MODULE = test_listen_notify
+PGFILEDESC = "test_listen_notify - regression testing for LISTEN/NOTIFY support"
+
+TAP_TESTS = 1
+
+EXTRA_INSTALL=src/test/modules/xid_wraparound
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_listen_notify
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_listen_notify/meson.build b/src/test/modules/test_listen_notify/meson.build
new file mode 100644
index 00000000000..ebf9444108a
--- /dev/null
+++ b/src/test/modules/test_listen_notify/meson.build
@@ -0,0 +1,13 @@
+# Copyright (c) 2022-2025, PostgreSQL Global Development Group
+
+tests += {
+  'name': 'test_listen_notify',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'tap': {
+    'tests': [
+      't/001_xid_freeze.pl'
+    ],
+  },
+}
+
diff --git a/src/test/modules/test_listen_notify/t/001_xid_freeze.pl b/src/test/modules/test_listen_notify/t/001_xid_freeze.pl
new file mode 100644
index 00000000000..c5553d6c792
--- /dev/null
+++ b/src/test/modules/test_listen_notify/t/001_xid_freeze.pl
@@ -0,0 +1,89 @@
+# Copyright (c) 2024-2025, PostgreSQL Global Development Group
+
+# Test that VACUUM FREEZE don't remove clog files that are needed to check the
+# transaction status of notifications that are on the LISTEN/NOTIFY queue
+# during its execution. The VACUUM FREEZE operation should check the oldest xid
+# on the queue during execution.
+use strict;
+use warnings FATAL => 'all';
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+my $node = PostgreSQL::Test::Cluster->new('node');
+$node->init;
+$node->start;
+
+# Check if the extension xid_wraparound is available, as it may be
+# possible that this script is run with installcheck, where the module
+# would not be installed by default.
+if (!$node->check_extension('xid_wraparound'))
+{
+	plan skip_all => 'Extension xid_wraparound not installed';
+}
+
+# Setup
+$node->safe_psql('postgres', 'CREATE EXTENSION xid_wraparound');
+$node->safe_psql('postgres',
+	'CREATE TABLE t AS SELECT g AS a, g+2 AS b from generate_series(1,10) g;'
+);
+$node->safe_psql('postgres',
+	'ALTER DATABASE template0 WITH ALLOW_CONNECTIONS true');
+
+# --- Start Session 1 and leave it idle in transaction
+my $psql_session1 = $node->background_psql('postgres');
+$psql_session1->query_safe('listen s;', "Session 1 listens to 's'");
+$psql_session1->query_safe('begin;', "Session 1 starts a transaction");
+
+# --- Session 2, multiple notify's, and commit ---
+for my $i (1 .. 10)
+{
+	$node->safe_psql(
+		'postgres', "
+		BEGIN;
+		NOTIFY s, '$i';
+		COMMIT;");
+}
+
+# Consume enough XIDs to trigger truncation
+$node->safe_psql('postgres', 'select consume_xids(1050000);');
+
+# Execute update so the frozen xid of "t" table is updated to a xid greater
+# than consume_xids() result
+$node->safe_psql('postgres', 'UPDATE t SET a = a+b;');
+
+# Remember current datfrozenxid before vacuum freeze to ensure that it is advanced.
+my $datafronzenxid = $node->safe_psql('postgres', "select datfrozenxid from pg_database where datname = 'postgres'");
+
+# Execute vacuum freeze on all databases
+$node->command_ok([ 'vacuumdb', '--all', '--freeze', '--port', $node->port ],
+	"vacuumdb --all --freeze");
+
+# Get the new datfrozenxid after vacuum freeze to ensure that is not advanced
+# to a value greater than the xid used to send the notifications.
+my $datafronzenxid_freeze = $node->safe_psql('postgres', "select datfrozenxid from pg_database where datname = 'postgres'");
+print("\n\n$datafronzenxid < $datafronzenxid_freeze\n\n");
+ok($datafronzenxid < $datafronzenxid_freeze, 'datfrozenxid is not fully advanced');
+
+# On Session 1, commit and ensure that the all notifications is received
+my $res = $psql_session1->query_safe('commit;', "commit listen s;");
+my $notifications_count = 0;
+foreach my $i (split('\n', $res))
+{
+	$notifications_count++;
+	like($i, qr/Asynchronous notification "s" with payload "$notifications_count" received/);
+}
+is($notifications_count, 10, 'received all committed notifications');
+
+# Execute vacuum freeze on all databases and ensure that the datafrozenxid is advanced
+
+$datafronzenxid = $node->safe_psql('postgres', "select datfrozenxid from pg_database where datname = 'postgres'");
+
+$node->command_ok([ 'vacuumdb', '--all', '--freeze', '--port', $node->port ],
+	"vacuumdb --all --freeze");
+
+$datafronzenxid_freeze = $node->safe_psql('postgres', "select datfrozenxid from pg_database where datname = 'postgres'");
+
+ok($datafronzenxid_freeze > $datafronzenxid, "datfrozenxid is advanced: $datafronzenxid_freeze > $datafronzenxid");
+
+done_testing();
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 43fe3bcd593..a8aa1365382 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -58,6 +58,7 @@ Aggref
 AggregateInstrumentation
 AlenState
 Alias
+AlignedQueueEntryPage
 AllocBlock
 AllocFreeListLink
 AllocPointer
-- 
2.51.0

v11-0002-Add-more-tests-for-listen-notify-vacuum-freeze.patchtext/plain; charset=utf-8; name=v11-0002-Add-more-tests-for-listen-notify-vacuum-freeze.patchDownload

From cad38eb3c267ac1d46dfbd1da546500c0985c28b Mon Sep 17 00:00:00 2001
From: Matheus Alcantara <mths.dev@pm.me>
Date: Fri, 24 Oct 2025 21:14:38 -0300
Subject: [PATCH v11 2/2] Add more tests for listen notify vacuum freeze

Author: Arseniy Mukhin <arseniy.mukhin.dev@gmail.com
---
 .../modules/test_listen_notify/meson.build    |  3 +-
 .../t/002_aborted_tx_notifies.pl              | 84 +++++++++++++++++++
 2 files changed, 86 insertions(+), 1 deletion(-)
 create mode 100644 src/test/modules/test_listen_notify/t/002_aborted_tx_notifies.pl

diff --git a/src/test/modules/test_listen_notify/meson.build b/src/test/modules/test_listen_notify/meson.build
index ebf9444108a..a68052cd353 100644
--- a/src/test/modules/test_listen_notify/meson.build
+++ b/src/test/modules/test_listen_notify/meson.build
@@ -6,7 +6,8 @@ tests += {
   'bd': meson.current_build_dir(),
   'tap': {
     'tests': [
-      't/001_xid_freeze.pl'
+      't/001_xid_freeze.pl',
+      't/002_aborted_tx_notifies.pl'
     ],
   },
 }
diff --git a/src/test/modules/test_listen_notify/t/002_aborted_tx_notifies.pl b/src/test/modules/test_listen_notify/t/002_aborted_tx_notifies.pl
new file mode 100644
index 00000000000..464c959a5a9
--- /dev/null
+++ b/src/test/modules/test_listen_notify/t/002_aborted_tx_notifies.pl
@@ -0,0 +1,84 @@
+# Copyright (c) 2024-2025, PostgreSQL Global Development Group
+
+use strict;
+use warnings FATAL => 'all';
+use File::Path qw(mkpath);
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+if (!$ENV{PG_TEST_EXTRA} || $ENV{PG_TEST_EXTRA} !~ /\blisten_notify\b/)
+{
+	plan skip_all => "test listen_notify not enabled in PG_TEST_EXTRA";
+}
+
+# Test checks that listeners do not receive notifications from aborted
+# transaction even if notifications have been added to the listen/notify
+# queue. To reproduce it we use the fact that serializable conflicts
+# are checked after tx adds notifications to the queue.
+
+my $node = PostgreSQL::Test::Cluster->new('node');
+$node->init;
+$node->start;
+
+# Setup
+$node->safe_psql('postgres', 'CREATE TABLE t1 (a bigserial);');
+
+# Listener
+my $psql_listener = $node->background_psql('postgres');
+$psql_listener->query_safe('LISTEN ch;');
+
+# Session1. Start SERIALIZABLE tx and add a notification.
+my $psql_session1 = $node->background_psql('postgres');
+$psql_session1->query_safe("
+	BEGIN ISOLATION LEVEL SERIALIZABLE;
+	SELECT * FROM t1;
+	INSERT INTO t1 DEFAULT VALUES;
+	NOTIFY ch,'committed_0';
+	NOTIFY ch,'committed_1';
+");
+
+# Session2. Start SERIALIZABLE tx, add a notification and introduce a conflict
+# with session1.
+my $psql_session2 = $node->background_psql('postgres', on_error_stop => 0);
+$psql_session2->query_safe("
+	BEGIN ISOLATION LEVEL SERIALIZABLE;
+	SELECT * FROM t1;
+	INSERT INTO t1 DEFAULT VALUES;
+");
+
+# Send notifications that should not be eventually delivered, as session2
+# transaction will be aborted.
+my $message = 'aborted_' . 'a' x 1000;
+for (my $i = 0; $i < 10; $i++) {
+    $psql_session2->query_safe("NOTIFY ch, '$i$message'");
+}
+
+# Session1 should be committed successfully. Listeners must receive session1
+# notifications.
+$psql_session1->query_safe("COMMIT;");
+
+# Session2 should be aborted due to the conflict with session1. Transaction
+# is aborted after adding notifications to the listen/notify queue, but
+# listeners should not receive session2 notifications.
+$psql_session2->query("COMMIT;");
+
+# send more notifications after aborted
+$node->safe_psql('postgres', "NOTIFY ch, 'committed_2';");
+$node->safe_psql('postgres', "NOTIFY ch, 'committed_3';");
+
+# fetch notifications
+my $res = $psql_listener->query_safe('begin; commit;');
+
+# check received notifications
+my @lines = split('\n', $res);
+is(@lines, 4, 'received all committed notifications');
+for (my $i = 0; $i < 4; $i++) {
+    like($lines[$i], qr/Asynchronous notification "ch" with payload "committed_$i" received/);
+}
+
+ok($psql_listener->quit);
+ok($psql_session1->quit);
+ok($psql_session2->quit);
+
+done_testing();
-- 
2.51.0

#64

Matheus Alcantara

matheusssilv97@gmail.com

3 months ago

In reply to: Arseniy Mukhin (#60)

Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue

On Fri Oct 24, 2025 at 4:36 AM -03, Arseniy Mukhin wrote:

I would expect to add 002_aborted_tx_notifies.pl in a separate patch
since it's not related to this bug fix.
---
+# Test checks that listeners do not receive notifications from aborted
+# transaction even if notifications have been added to the listen/notify
+# queue. To reproduce it we use the fact that serializable conflicts
+# are checked after tx adds notifications to the queue.
I wonder if we could implement this test using the isolation test
instead of the tap test. Is there any reason why you used a tap test
for that?
I agree it's less relevant to the patch now than it was with the new
'committed' field approach. And there is no particular reason why it
was implemented as a TAP test actually.. So +1 to move it to separate
patch (does it mean to separate thread as well or just separate patch
file?) and rewrite as an isolation test (IIUC it's better to use
isolation test infrastructure if it's possible). I can try to do it if
nobody else does it earlier.

On the v11 version that I've sent on [1]/messages/by-id/DDQZB2AD34V4.3RH2USCA72AS8@gmail.com I've move this test into a
separate patch, please feel free to implement it as an isolation test if
you want it.

[1]: /messages/by-id/DDQZB2AD34V4.3RH2USCA72AS8@gmail.com

--
Matheus Alcantara

#65

Arseniy Mukhin

arseniy.mukhin.dev@gmail.com

3 months ago

In reply to: Matheus Alcantara (#64)

2 attachment(s)

Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue

On Sat, Oct 25, 2025 at 3:33 AM Matheus Alcantara
<matheusssilv97@gmail.com> wrote:

On Fri Oct 24, 2025 at 4:36 AM -03, Arseniy Mukhin wrote:
I would expect to add 002_aborted_tx_notifies.pl in a separate patch
since it's not related to this bug fix.
---
+# Test checks that listeners do not receive notifications from aborted
+# transaction even if notifications have been added to the listen/notify
+# queue. To reproduce it we use the fact that serializable conflicts
+# are checked after tx adds notifications to the queue.
I wonder if we could implement this test using the isolation test
instead of the tap test. Is there any reason why you used a tap test
for that?
I agree it's less relevant to the patch now than it was with the new
'committed' field approach. And there is no particular reason why it
was implemented as a TAP test actually.. So +1 to move it to separate
patch (does it mean to separate thread as well or just separate patch
file?) and rewrite as an isolation test (IIUC it's better to use
isolation test infrastructure if it's possible). I can try to do it if
nobody else does it earlier.
On the v11 version that I've sent on [1] I've move this test into a
separate patch, please feel free to implement it as an isolation test if
you want it.

Thank you!

I reimplemented the test in 0002 as an isolation test and added the
commit message. PFA the new version.

Best regards,
Arseniy Mukhin

Attachments:

v12-0002-Add-test-for-listen-notify.patchtext/x-patch; charset=US-ASCII; name=v12-0002-Add-test-for-listen-notify.patchDownload

From 494975f96d1295b05baf3063ef7aa05ecd2e11f1 Mon Sep 17 00:00:00 2001
From: Arseniy Mukhin <arseniy.mukhin.dev@gmail.com>
Date: Sat, 25 Oct 2025 15:15:51 +0300
Subject: [PATCH v12 2/2] Add test for listen/notify

This commit adds an isolation test showing that listeners ignore
notifications from aborted transactions even if they were added
to the listen/notify queue
---
 src/test/isolation/expected/async-notify.out | 33 +++++++++++++++++++-
 src/test/isolation/specs/async-notify.spec   | 24 ++++++++++++++
 2 files changed, 56 insertions(+), 1 deletion(-)

diff --git a/src/test/isolation/expected/async-notify.out b/src/test/isolation/expected/async-notify.out
index 556e1805893..20d5763f319 100644
--- a/src/test/isolation/expected/async-notify.out
+++ b/src/test/isolation/expected/async-notify.out
@@ -1,4 +1,4 @@
-Parsed test spec with 3 sessions
+Parsed test spec with 4 sessions
 
 starting permutation: listenc notify1 notify2 notify3 notifyf
 step listenc: LISTEN c1; LISTEN c2;
@@ -104,6 +104,37 @@ step l2commit: COMMIT;
 listener2: NOTIFY "c1" with payload "" from notifier
 step l2stop: UNLISTEN *;
 
+starting permutation: llisten n1begins n1select n1insert notify1 n2begins n2select n2insert n2notify1 n1commit n2commit notify1 lcheck
+step llisten: LISTEN c1; LISTEN c2;
+step n1begins: BEGIN ISOLATION LEVEL SERIALIZABLE;
+step n1select: SELECT * FROM t1;
+a
+-
+(0 rows)
+
+step n1insert: INSERT INTO t1 DEFAULT VALUES;
+step notify1: NOTIFY c1;
+step n2begins: BEGIN ISOLATION LEVEL SERIALIZABLE;
+step n2select: SELECT * FROM t1;
+a
+-
+(0 rows)
+
+step n2insert: INSERT INTO t1 DEFAULT VALUES;
+step n2notify1: NOTIFY c1, 'n2_payload';
+step n1commit: COMMIT;
+step n2commit: COMMIT;
+ERROR:  could not serialize access due to read/write dependencies among transactions
+step notify1: NOTIFY c1;
+step lcheck: SELECT 1 AS x;
+x
+-
+1
+(1 row)
+
+listener: NOTIFY "c1" with payload "" from notifier
+listener: NOTIFY "c1" with payload "" from notifier
+
 starting permutation: llisten lbegin usage bignotify usage
 step llisten: LISTEN c1; LISTEN c2;
 step lbegin: BEGIN;
diff --git a/src/test/isolation/specs/async-notify.spec b/src/test/isolation/specs/async-notify.spec
index 0b8cfd91083..51b7ad43849 100644
--- a/src/test/isolation/specs/async-notify.spec
+++ b/src/test/isolation/specs/async-notify.spec
@@ -5,6 +5,10 @@
 # Note we assume that each step is delivered to the backend as a single Query
 # message so it will run as one transaction.
 
+# t1 table is used for serializable conflict
+setup { CREATE TABLE t1 (a bigserial); }
+teardown { DROP TABLE t1; }
+
 session notifier
 step listenc	{ LISTEN c1; LISTEN c2; }
 step notify1	{ NOTIFY c1; }
@@ -33,8 +37,21 @@ step notifys1	{
 }
 step usage		{ SELECT pg_notification_queue_usage() > 0 AS nonzero; }
 step bignotify	{ SELECT count(pg_notify('c1', s::text)) FROM generate_series(1, 1000) s; }
+step n1begins   { BEGIN ISOLATION LEVEL SERIALIZABLE; }
+step n1select   { SELECT * FROM t1; }
+step n1insert   { INSERT INTO t1 DEFAULT VALUES; }
+step n1commit   { COMMIT; }
 teardown		{ UNLISTEN *; }
 
+# notifier2 session is used to reproduce serializable conflict with notifier
+
+session notifier2
+step n2begins    { BEGIN ISOLATION LEVEL SERIALIZABLE; }
+step n2select    { SELECT * FROM t1; }
+step n2insert    { INSERT INTO t1 DEFAULT VALUES; }
+step n2commit    { COMMIT; }
+step n2notify1   { NOTIFY c1, 'n2_payload';  }
+
 # The listener session is used for cross-backend notify checks.
 
 session listener
@@ -73,6 +90,13 @@ permutation listenc llisten notify1 notify2 notify3 notifyf lcheck
 # and notify queue is not empty
 permutation l2listen l2begin notify1 lbegins llisten lcommit l2commit l2stop
 
+# Test checks that listeners ignore notifications from aborted
+# transaction even if notifications have been added to the listen/notify
+# queue. To reproduce it we use the fact that serializable conflicts
+# are checked after tx adds notifications to the queue.
+
+permutation llisten n1begins n1select n1insert notify1 n2begins n2select n2insert n2notify1 n1commit n2commit notify1 lcheck
+
 # Verify that pg_notification_queue_usage correctly reports a non-zero result,
 # after submitting notifications while another connection is listening for
 # those notifications and waiting inside an active transaction.  We have to
-- 
2.43.0

v12-0001-Prevent-VACUUM-from-truncating-XIDs-still-presen.patchtext/x-patch; charset=US-ASCII; name=v12-0001-Prevent-VACUUM-from-truncating-XIDs-still-presen.patchDownload

From eaf6da2921d647766a9c2398d59810ab042427ee Mon Sep 17 00:00:00 2001
From: Joel Jacobson <joel@compiler.org>
Date: Sun, 19 Oct 2025 18:55:25 +0200
Subject: [PATCH v12 1/2] Prevent VACUUM from truncating XIDs still present in
 notification queue

VACUUM's computation of datfrozenxid did not account for transaction IDs
in the LISTEN/NOTIFY queue.  This allowed VACUUM to truncate clog
entries for XIDs that were still referenced by queued notifications,
causing backends to fail in TransactionIdDidCommit when later processing
those notifications.

Fix by adding GetOldestQueuedNotifyXid to find the oldest XID in queued
notifications for the current database, and constraining datfrozenxid to
not pass that.  The function scans from QUEUE_TAIL, since notifications
may have been written before any listeners existed.

To avoid code duplication, refactor SLRU page-reading code into a new
helper function asyncQueueReadPageToBuffer.

Co-authored-by: Matheus Alcantara <matheusssilv97@gmail.com>
---
 src/backend/commands/async.c                  | 156 ++++++++++++++----
 src/backend/commands/vacuum.c                 |  12 ++
 src/include/commands/async.h                  |   3 +
 src/test/modules/Makefile                     |   1 +
 src/test/modules/meson.build                  |   1 +
 .../modules/test_listen_notify/.gitignore     |   4 +
 src/test/modules/test_listen_notify/Makefile  |  19 +++
 .../modules/test_listen_notify/meson.build    |  13 ++
 .../test_listen_notify/t/001_xid_freeze.pl    |  89 ++++++++++
 src/tools/pgindent/typedefs.list              |   1 +
 10 files changed, 268 insertions(+), 31 deletions(-)
 create mode 100644 src/test/modules/test_listen_notify/.gitignore
 create mode 100644 src/test/modules/test_listen_notify/Makefile
 create mode 100644 src/test/modules/test_listen_notify/meson.build
 create mode 100644 src/test/modules/test_listen_notify/t/001_xid_freeze.pl

diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index 4bd37d5beb5..8fc1bbba7a2 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -401,6 +401,17 @@ struct NotificationHash
 	Notification *event;		/* => the actual Notification struct */
 };
 
+/*
+ * Page buffers used to read from SLRU cache must be adequately aligned,
+ * so use a union.
+ */
+typedef union
+{
+	char		buf[QUEUE_PAGESIZE];
+	AsyncQueueEntry align;
+} AlignedQueueEntryPage;
+
+
 static NotificationList *pendingNotifies = NULL;
 
 /*
@@ -1841,6 +1852,44 @@ ProcessNotifyInterrupt(bool flush)
 		ProcessIncomingNotify(flush);
 }
 
+/*
+ * Read a page from the SLRU queue into a local buffer.
+ *
+ * Reads the page containing 'pos', copying the data from the current offset
+ * either to the end of the page or up to 'head' (whichever comes first)
+ * into page_buffer.
+ */
+static void
+asyncQueueReadPageToBuffer(QueuePosition pos, QueuePosition head,
+						   char *page_buffer)
+{
+	int64		curpage = QUEUE_POS_PAGE(pos);
+	int			curoffset = QUEUE_POS_OFFSET(pos);
+	int			slotno;
+	int			copysize;
+
+	slotno = SimpleLruReadPage_ReadOnly(NotifyCtl, curpage,
+										InvalidTransactionId);
+
+	if (curpage == QUEUE_POS_PAGE(head))
+	{
+		/* we only want to read as far as head */
+		copysize = QUEUE_POS_OFFSET(head) - curoffset;
+		if (copysize < 0)
+			copysize = 0;		/* just for safety */
+	}
+	else
+	{
+		/* fetch all the rest of the page */
+		copysize = QUEUE_PAGESIZE - curoffset;
+	}
+
+	memcpy(page_buffer + curoffset,
+		   NotifyCtl->shared->page_buffer[slotno] + curoffset,
+		   copysize);
+	/* Release lock that we got from SimpleLruReadPage_ReadOnly() */
+	LWLockRelease(SimpleLruGetBankLock(NotifyCtl, curpage));
+}
 
 /*
  * Read all pending notifications from the queue, and deliver appropriate
@@ -1853,13 +1902,7 @@ asyncQueueReadAllNotifications(void)
 	volatile QueuePosition pos;
 	QueuePosition head;
 	Snapshot	snapshot;
-
-	/* page_buffer must be adequately aligned, so use a union */
-	union
-	{
-		char		buf[QUEUE_PAGESIZE];
-		AsyncQueueEntry align;
-	}			page_buffer;
+	AlignedQueueEntryPage page_buffer;
 
 	/* Fetch current state */
 	LWLockAcquire(NotifyQueueLock, LW_SHARED);
@@ -1932,36 +1975,13 @@ asyncQueueReadAllNotifications(void)
 
 		do
 		{
-			int64		curpage = QUEUE_POS_PAGE(pos);
-			int			curoffset = QUEUE_POS_OFFSET(pos);
-			int			slotno;
-			int			copysize;
-
 			/*
 			 * We copy the data from SLRU into a local buffer, so as to avoid
 			 * holding the SLRU lock while we are examining the entries and
 			 * possibly transmitting them to our frontend.  Copy only the part
 			 * of the page we will actually inspect.
 			 */
-			slotno = SimpleLruReadPage_ReadOnly(NotifyCtl, curpage,
-												InvalidTransactionId);
-			if (curpage == QUEUE_POS_PAGE(head))
-			{
-				/* we only want to read as far as head */
-				copysize = QUEUE_POS_OFFSET(head) - curoffset;
-				if (copysize < 0)
-					copysize = 0;	/* just for safety */
-			}
-			else
-			{
-				/* fetch all the rest of the page */
-				copysize = QUEUE_PAGESIZE - curoffset;
-			}
-			memcpy(page_buffer.buf + curoffset,
-				   NotifyCtl->shared->page_buffer[slotno] + curoffset,
-				   copysize);
-			/* Release lock that we got from SimpleLruReadPage_ReadOnly() */
-			LWLockRelease(SimpleLruGetBankLock(NotifyCtl, curpage));
+			asyncQueueReadPageToBuffer(pos, head, page_buffer.buf);
 
 			/*
 			 * Process messages up to the stop position, end of page, or an
@@ -2097,6 +2117,80 @@ asyncQueueProcessPageEntries(volatile QueuePosition *current,
 	return reachedStop;
 }
 
+/*
+ * Get the oldest XID in the notification queue that has not yet been
+ * processed by all listening backends.
+ *
+ * Returns InvalidTransactionId if there are no unprocessed notifications or if
+ * all unprocessed notifications are created on other databases different from
+ * MyDatabaseId.
+ */
+TransactionId
+GetOldestNotifyTransactionId(void)
+{
+	QueuePosition pos;
+	QueuePosition head;
+	AlignedQueueEntryPage page_buffer;
+	TransactionId oldestXid = InvalidTransactionId;
+
+	/* First advance the shared queue tail pointer */
+	asyncQueueAdvanceTail();
+
+	/*
+	 * We must start at QUEUE_TAIL since notification data might have been
+	 * written before there were any listening backends.
+	 */
+	LWLockAcquire(NotifyQueueLock, LW_SHARED);
+	pos = QUEUE_TAIL;
+	head = QUEUE_HEAD;
+	LWLockRelease(NotifyQueueLock);
+
+	/* If the queue is empty, no XIDs need protection */
+	if (QUEUE_POS_EQUAL(pos, head))
+		return InvalidTransactionId;
+
+	while (!QUEUE_POS_EQUAL(pos, head))
+	{
+		int			curoffset;
+		AsyncQueueEntry *qe;
+
+		/* Read the current page from SLRU into our local buffer */
+		asyncQueueReadPageToBuffer(pos, head, page_buffer.buf);
+
+		curoffset = QUEUE_POS_OFFSET(pos);
+
+		/* Process all entries on this page up to head */
+		for (;;)
+		{
+			bool		reachedEndOfPage;
+
+			qe = (AsyncQueueEntry *) (page_buffer.buf + curoffset);
+
+			/*
+			 * Check if this entry is for our database and has a valid XID.
+			 * Only entries for our database matter for our datfrozenxid.
+			 */
+			if (qe->dboid == MyDatabaseId && TransactionIdIsValid(qe->xid))
+			{
+				if (!TransactionIdIsValid(oldestXid) ||
+					TransactionIdPrecedes(qe->xid, oldestXid))
+					oldestXid = qe->xid;
+			}
+
+			/* Advance to next entry */
+			reachedEndOfPage = asyncQueueAdvance(&pos, qe->length);
+
+			if (reachedEndOfPage || QUEUE_POS_EQUAL(pos, head))
+				break;
+
+
+			curoffset = QUEUE_POS_OFFSET(pos);
+		}
+	}
+
+	return oldestXid;
+}
+
 /*
  * Advance the shared queue tail variable to the minimum of all the
  * per-backend tail pointers.  Truncate pg_notify space if possible.
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index ed03e3bd50d..e5fedfb3238 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -37,6 +37,7 @@
 #include "catalog/namespace.h"
 #include "catalog/pg_database.h"
 #include "catalog/pg_inherits.h"
+#include "commands/async.h"
 #include "commands/cluster.h"
 #include "commands/defrem.h"
 #include "commands/progress.h"
@@ -1617,6 +1618,7 @@ vac_update_datfrozenxid(void)
 	bool		dirty = false;
 	ScanKeyData key[1];
 	void	   *inplace_state;
+	TransactionId oldestNotifyXid;
 
 	/*
 	 * Restrict this task to one backend per database.  This avoids race
@@ -1733,6 +1735,16 @@ vac_update_datfrozenxid(void)
 	if (bogus)
 		return;
 
+	/*
+	 * Also consider the oldest XID in the notification queue, since backends
+	 * will need to call TransactionIdDidCommit() on those XIDs when
+	 * processing the notifications.
+	 */
+	oldestNotifyXid = GetOldestNotifyTransactionId();
+	if (TransactionIdIsValid(oldestNotifyXid) &&
+		TransactionIdPrecedes(oldestNotifyXid, newFrozenXid))
+		newFrozenXid = oldestNotifyXid;
+
 	Assert(TransactionIdIsNormal(newFrozenXid));
 	Assert(MultiXactIdIsValid(newMinMulti));
 
diff --git a/src/include/commands/async.h b/src/include/commands/async.h
index f75c3df9556..0f8f17ad22b 100644
--- a/src/include/commands/async.h
+++ b/src/include/commands/async.h
@@ -26,6 +26,9 @@ extern void NotifyMyFrontEnd(const char *channel,
 							 const char *payload,
 							 int32 srcPid);
 
+/* get oldest XID in the notification queue for vacuum freeze */
+extern TransactionId GetOldestNotifyTransactionId(void);
+
 /* notify-related SQL statements */
 extern void Async_Notify(const char *channel, const char *payload);
 extern void Async_Listen(const char *channel);
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index 902a7954101..a015c961d35 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -29,6 +29,7 @@ SUBDIRS = \
 		  test_int128 \
 		  test_integerset \
 		  test_json_parser \
+		  test_listen_notify \
 		  test_lfind \
 		  test_lwlock_tranches \
 		  test_misc \
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index 14fc761c4cf..6af33448d7b 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -28,6 +28,7 @@ subdir('test_ginpostinglist')
 subdir('test_int128')
 subdir('test_integerset')
 subdir('test_json_parser')
+subdir('test_listen_notify')
 subdir('test_lfind')
 subdir('test_lwlock_tranches')
 subdir('test_misc')
diff --git a/src/test/modules/test_listen_notify/.gitignore b/src/test/modules/test_listen_notify/.gitignore
new file mode 100644
index 00000000000..5dcb3ff9723
--- /dev/null
+++ b/src/test/modules/test_listen_notify/.gitignore
@@ -0,0 +1,4 @@
+# Generated subdirectories
+/log/
+/results/
+/tmp_check/
diff --git a/src/test/modules/test_listen_notify/Makefile b/src/test/modules/test_listen_notify/Makefile
new file mode 100644
index 00000000000..c1eb4fde370
--- /dev/null
+++ b/src/test/modules/test_listen_notify/Makefile
@@ -0,0 +1,19 @@
+# src/test/modules/test_listen_notify/Makefile
+
+MODULE = test_listen_notify
+PGFILEDESC = "test_listen_notify - regression testing for LISTEN/NOTIFY support"
+
+TAP_TESTS = 1
+
+EXTRA_INSTALL=src/test/modules/xid_wraparound
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_listen_notify
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_listen_notify/meson.build b/src/test/modules/test_listen_notify/meson.build
new file mode 100644
index 00000000000..ebf9444108a
--- /dev/null
+++ b/src/test/modules/test_listen_notify/meson.build
@@ -0,0 +1,13 @@
+# Copyright (c) 2022-2025, PostgreSQL Global Development Group
+
+tests += {
+  'name': 'test_listen_notify',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'tap': {
+    'tests': [
+      't/001_xid_freeze.pl'
+    ],
+  },
+}
+
diff --git a/src/test/modules/test_listen_notify/t/001_xid_freeze.pl b/src/test/modules/test_listen_notify/t/001_xid_freeze.pl
new file mode 100644
index 00000000000..c5553d6c792
--- /dev/null
+++ b/src/test/modules/test_listen_notify/t/001_xid_freeze.pl
@@ -0,0 +1,89 @@
+# Copyright (c) 2024-2025, PostgreSQL Global Development Group
+
+# Test that VACUUM FREEZE don't remove clog files that are needed to check the
+# transaction status of notifications that are on the LISTEN/NOTIFY queue
+# during its execution. The VACUUM FREEZE operation should check the oldest xid
+# on the queue during execution.
+use strict;
+use warnings FATAL => 'all';
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+my $node = PostgreSQL::Test::Cluster->new('node');
+$node->init;
+$node->start;
+
+# Check if the extension xid_wraparound is available, as it may be
+# possible that this script is run with installcheck, where the module
+# would not be installed by default.
+if (!$node->check_extension('xid_wraparound'))
+{
+	plan skip_all => 'Extension xid_wraparound not installed';
+}
+
+# Setup
+$node->safe_psql('postgres', 'CREATE EXTENSION xid_wraparound');
+$node->safe_psql('postgres',
+	'CREATE TABLE t AS SELECT g AS a, g+2 AS b from generate_series(1,10) g;'
+);
+$node->safe_psql('postgres',
+	'ALTER DATABASE template0 WITH ALLOW_CONNECTIONS true');
+
+# --- Start Session 1 and leave it idle in transaction
+my $psql_session1 = $node->background_psql('postgres');
+$psql_session1->query_safe('listen s;', "Session 1 listens to 's'");
+$psql_session1->query_safe('begin;', "Session 1 starts a transaction");
+
+# --- Session 2, multiple notify's, and commit ---
+for my $i (1 .. 10)
+{
+	$node->safe_psql(
+		'postgres', "
+		BEGIN;
+		NOTIFY s, '$i';
+		COMMIT;");
+}
+
+# Consume enough XIDs to trigger truncation
+$node->safe_psql('postgres', 'select consume_xids(1050000);');
+
+# Execute update so the frozen xid of "t" table is updated to a xid greater
+# than consume_xids() result
+$node->safe_psql('postgres', 'UPDATE t SET a = a+b;');
+
+# Remember current datfrozenxid before vacuum freeze to ensure that it is advanced.
+my $datafronzenxid = $node->safe_psql('postgres', "select datfrozenxid from pg_database where datname = 'postgres'");
+
+# Execute vacuum freeze on all databases
+$node->command_ok([ 'vacuumdb', '--all', '--freeze', '--port', $node->port ],
+	"vacuumdb --all --freeze");
+
+# Get the new datfrozenxid after vacuum freeze to ensure that is not advanced
+# to a value greater than the xid used to send the notifications.
+my $datafronzenxid_freeze = $node->safe_psql('postgres', "select datfrozenxid from pg_database where datname = 'postgres'");
+print("\n\n$datafronzenxid < $datafronzenxid_freeze\n\n");
+ok($datafronzenxid < $datafronzenxid_freeze, 'datfrozenxid is not fully advanced');
+
+# On Session 1, commit and ensure that the all notifications is received
+my $res = $psql_session1->query_safe('commit;', "commit listen s;");
+my $notifications_count = 0;
+foreach my $i (split('\n', $res))
+{
+	$notifications_count++;
+	like($i, qr/Asynchronous notification "s" with payload "$notifications_count" received/);
+}
+is($notifications_count, 10, 'received all committed notifications');
+
+# Execute vacuum freeze on all databases and ensure that the datafrozenxid is advanced
+
+$datafronzenxid = $node->safe_psql('postgres', "select datfrozenxid from pg_database where datname = 'postgres'");
+
+$node->command_ok([ 'vacuumdb', '--all', '--freeze', '--port', $node->port ],
+	"vacuumdb --all --freeze");
+
+$datafronzenxid_freeze = $node->safe_psql('postgres', "select datfrozenxid from pg_database where datname = 'postgres'");
+
+ok($datafronzenxid_freeze > $datafronzenxid, "datfrozenxid is advanced: $datafronzenxid_freeze > $datafronzenxid");
+
+done_testing();
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index c13f92988ee..23f85b73114 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -58,6 +58,7 @@ Aggref
 AggregateInstrumentation
 AlenState
 Alias
+AlignedQueueEntryPage
 AllocBlock
 AllocFreeListLink
 AllocPointer
-- 
2.43.0

#66

Joel Jacobson

joel@compiler.org

3 months ago

In reply to: Arseniy Mukhin (#65)

3 attachment(s)

Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue

On Sat, Oct 25, 2025, at 15:08, Arseniy Mukhin wrote:

I reimplemented the test in 0002 as an isolation test and added the
commit message. PFA the new version.

I've rebased the patches so they can be applied on top of v12:
- v12-vacuum_notify_queue_cleanup-without-code-dup.txt
- v12-vacuum_notify_queue_cleanup-with-code-dup.txt

I also want to share a new idea that Heikki Linnakangas came up with
during PgConf25 in Riga.

The idea is basically to scan the notification queue from tail to head,
and update xids that precedes newFrozenXid, to either
FrozenTransactionId if committed, or InvalidTransactionId if not.

I've implemented this by adding a new extern function
AsyncNotifyFreezeXids to async.h, called from vac_update_datfrozenxid.

This way, we don't need to hold back the advancement of newFrozenXid in
vac_update_datfrozenxid.

Attempt of implementing Heikki's idea:
- async_notify_freeze_xids.txt

This idea has the benefit of never holding back newFrozenXid,
however, I see some problems with it:

- If there is a bug in the code, we could accidentally overwrite
xid values we didn't intend to overwrite, and maybe we would never
find out that we did.

- We wouldn't know for sure that we've understood the cause of
this bug.

With v12-vacuum_notify_queue_cleanup, we instead have the downside of
possibly holding back newFrozenXid, but with the advantage of giving us
higher confidence in that we've correctly identified the cause of the
bug.

/Joel

Attachments:

v12-vacuum_notify_queue_cleanup-without-code-dup.txttext/plain; name="=?UTF-8?Q?v12-vacuum=5Fnotify=5Fqueue=5Fcleanup-without-code-dup.txt?="Download

diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index 8fc1bbba7a..088215acdc 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -455,13 +455,11 @@ static void asyncQueueNotificationToEntry(Notification *n, AsyncQueueEntry *qe);
 static ListCell *asyncQueueAddEntries(ListCell *nextNotify);
 static double asyncQueueUsage(void);
 static void asyncQueueFillWarning(void);
-static void SignalBackends(void);
 static void asyncQueueReadAllNotifications(void);
 static bool asyncQueueProcessPageEntries(volatile QueuePosition *current,
 										 QueuePosition stop,
 										 char *page_buffer,
 										 Snapshot snapshot);
-static void asyncQueueAdvanceTail(void);
 static void ProcessIncomingNotify(bool flush);
 static bool AsyncExistsPendingNotify(Notification *n);
 static void AddEventToPendingNotifies(Notification *n);
@@ -1022,7 +1020,7 @@ AtCommit_Notify(void)
 	 * PreCommit_Notify().
 	 */
 	if (pendingNotifies != NULL)
-		SignalBackends();
+		SignalBackends(false);
 
 	/*
 	 * If it's time to try to advance the global tail pointer, do that.
@@ -1036,7 +1034,7 @@ AtCommit_Notify(void)
 	if (tryAdvanceTail)
 	{
 		tryAdvanceTail = false;
-		asyncQueueAdvanceTail();
+		asyncQueueAdvanceTail(false);
 	}
 
 	/* And clean up */
@@ -1494,7 +1492,7 @@ pg_notification_queue_usage(PG_FUNCTION_ARGS)
 	double		usage;
 
 	/* Advance the queue tail so we don't report a too-large result */
-	asyncQueueAdvanceTail();
+	asyncQueueAdvanceTail(false);
 
 	LWLockAcquire(NotifyQueueLock, LW_SHARED);
 	usage = asyncQueueUsage();
@@ -1587,9 +1585,16 @@ asyncQueueFillWarning(void)
  *
  * This is called during CommitTransaction(), so it's important for it
  * to have very low probability of failure.
+ *
+ * If all_databases is false (normal NOTIFY), we signal listeners in our own
+ * database unless they're caught up, and listeners in other databases only
+ * if they are far behind (QUEUE_CLEANUP_DELAY pages).
+ *
+ * If all_databases is true (VACUUM cleanup), we signal all listeners across
+ * all databases that aren't already caught up, with no distance filtering.
  */
-static void
-SignalBackends(void)
+void
+SignalBackends(bool all_databases)
 {
 	int32	   *pids;
 	ProcNumber *procnos;
@@ -1615,25 +1620,21 @@ SignalBackends(void)
 
 		Assert(pid != InvalidPid);
 		pos = QUEUE_BACKEND_POS(i);
-		if (QUEUE_BACKEND_DBOID(i) == MyDatabaseId)
-		{
-			/*
-			 * Always signal listeners in our own database, unless they're
-			 * already caught up (unlikely, but possible).
-			 */
-			if (QUEUE_POS_EQUAL(pos, QUEUE_HEAD))
-				continue;
-		}
-		else
-		{
-			/*
-			 * Listeners in other databases should be signaled only if they
-			 * are far behind.
-			 */
-			if (asyncQueuePageDiff(QUEUE_POS_PAGE(QUEUE_HEAD),
-								   QUEUE_POS_PAGE(pos)) < QUEUE_CLEANUP_DELAY)
-				continue;
-		}
+
+		/*
+		 * Always skip backends that are already caught up.
+		 */
+		if (QUEUE_POS_EQUAL(pos, QUEUE_HEAD))
+			continue;
+
+		/*
+		 * Skip if we're not signaling all databases AND this is a different
+		 * database AND the listener is not far behind.
+		 */
+		if (!all_databases && QUEUE_BACKEND_DBOID(i) != MyDatabaseId &&
+			asyncQueuePageDiff(QUEUE_POS_PAGE(QUEUE_HEAD),
+							   QUEUE_POS_PAGE(pos)) < QUEUE_CLEANUP_DELAY)
+			continue;
 		/* OK, need to signal this one */
 		pids[count] = pid;
 		procnos[count] = i;
@@ -2134,7 +2135,7 @@ GetOldestNotifyTransactionId(void)
 	TransactionId oldestXid = InvalidTransactionId;
 
 	/* First advance the shared queue tail pointer */
-	asyncQueueAdvanceTail();
+	asyncQueueAdvanceTail(false);
 
 	/*
 	 * We must start at QUEUE_TAIL since notification data might have been
@@ -2191,15 +2192,38 @@ GetOldestNotifyTransactionId(void)
 	return oldestXid;
 }
 
+/*
+ * Check if there are any active listeners in the notification queue.
+ *
+ * Returns true if at least one backend is registered as a listener,
+ * false otherwise.
+ */
+bool
+asyncQueueHasListeners(void)
+{
+	bool		hasListeners;
+
+	LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE);
+	hasListeners = (QUEUE_FIRST_LISTENER != INVALID_PROC_NUMBER);
+	LWLockRelease(NotifyQueueLock);
+
+	return hasListeners;
+}
+
 /*
  * Advance the shared queue tail variable to the minimum of all the
  * per-backend tail pointers.  Truncate pg_notify space if possible.
  *
- * This is (usually) called during CommitTransaction(), so it's important for
- * it to have very low probability of failure.
+ * If force_to_head is false (normal case), we compute the new tail as the
+ * minimum of all listener positions.  This is (usually) called during
+ * CommitTransaction(), so it's important for it to have very low probability
+ * of failure.
+ *
+ * If force_to_head is true (VACUUM cleanup), we advance the tail directly to
+ * the head, discarding all notifications, but only if there are no listeners.
  */
-static void
-asyncQueueAdvanceTail(void)
+void
+asyncQueueAdvanceTail(bool force_to_head)
 {
 	QueuePosition min;
 	int64		oldtailpage;
@@ -2227,12 +2251,38 @@ asyncQueueAdvanceTail(void)
 	 * to access the pages we are in the midst of truncating.
 	 */
 	LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE);
-	min = QUEUE_HEAD;
-	for (ProcNumber i = QUEUE_FIRST_LISTENER; i != INVALID_PROC_NUMBER; i = QUEUE_NEXT_LISTENER(i))
+
+	if (force_to_head)
 	{
-		Assert(QUEUE_BACKEND_PID(i) != InvalidPid);
-		min = QUEUE_POS_MIN(min, QUEUE_BACKEND_POS(i));
+		/*
+		 * Verify that there are still no listeners.  It's possible
+		 * that a listener appeared since VACUUM checked.
+		 */
+		if (QUEUE_FIRST_LISTENER != INVALID_PROC_NUMBER)
+		{
+			LWLockRelease(NotifyQueueLock);
+			LWLockRelease(NotifyQueueTailLock);
+			return;
+		}
+
+		/*
+		 * Advance the logical tail to the head, discarding all notifications.
+		 */
+		min = QUEUE_HEAD;
 	}
+	else
+	{
+		/*
+		 * Normal case: compute minimum position from all listeners.
+		 */
+		min = QUEUE_HEAD;
+		for (ProcNumber i = QUEUE_FIRST_LISTENER; i != INVALID_PROC_NUMBER; i = QUEUE_NEXT_LISTENER(i))
+		{
+			Assert(QUEUE_BACKEND_PID(i) != InvalidPid);
+			min = QUEUE_POS_MIN(min, QUEUE_BACKEND_POS(i));
+		}
+	}
+
 	QUEUE_TAIL = min;
 	oldtailpage = QUEUE_STOP_PAGE;
 	LWLockRelease(NotifyQueueLock);
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index e5fedfb323..4ee06f233b 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -1739,11 +1739,24 @@ vac_update_datfrozenxid(void)
 	 * Also consider the oldest XID in the notification queue, since backends
 	 * will need to call TransactionIdDidCommit() on those XIDs when
 	 * processing the notifications.
+	 *
+	 * If the queue is blocking datfrozenxid advancement, attempt to clean it
+	 * up.  If listeners exist, wake them to process their pending
+	 * notifications.  If no listeners exist, discard all notifications.
+	 * Either way, we back off datfrozenxid for this VACUUM cycle; the next
+	 * VACUUM will benefit from the cleanup we've triggered.
 	 */
 	oldestNotifyXid = GetOldestNotifyTransactionId();
 	if (TransactionIdIsValid(oldestNotifyXid) &&
 		TransactionIdPrecedes(oldestNotifyXid, newFrozenXid))
+	{
+		if (asyncQueueHasListeners())
+			SignalBackends(true);
+		else
+			asyncQueueAdvanceTail(true);
+
 		newFrozenXid = oldestNotifyXid;
+	}
 
 	Assert(TransactionIdIsNormal(newFrozenXid));
 	Assert(MultiXactIdIsValid(newMinMulti));
diff --git a/src/include/commands/async.h b/src/include/commands/async.h
index 0f8f17ad22..8ae9c4dfae 100644
--- a/src/include/commands/async.h
+++ b/src/include/commands/async.h
@@ -29,6 +29,11 @@ extern void NotifyMyFrontEnd(const char *channel,
 /* get oldest XID in the notification queue for vacuum freeze */
 extern TransactionId GetOldestNotifyTransactionId(void);
 
+/* functions for vacuum to manage notification queue */
+extern bool asyncQueueHasListeners(void);
+extern void SignalBackends(bool all_databases);
+extern void asyncQueueAdvanceTail(bool force_to_head);
+
 /* notify-related SQL statements */
 extern void Async_Notify(const char *channel, const char *payload);
 extern void Async_Listen(const char *channel);

v12-vacuum_notify_queue_cleanup-with-code-dup.txttext/plain; name="=?UTF-8?Q?v12-vacuum=5Fnotify=5Fqueue=5Fcleanup-with-code-dup.txt?="Download

diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index 8fc1bbba7a..72bd52ae8d 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -2191,6 +2191,164 @@ GetOldestNotifyTransactionId(void)
 	return oldestXid;
 }
 
+/*
+ * Check if there are any active listeners in the notification queue.
+ *
+ * Returns true if at least one backend is registered as a listener,
+ * false otherwise.
+ */
+bool
+asyncQueueHasListeners(void)
+{
+	bool		hasListeners;
+
+	LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE);
+	hasListeners = (QUEUE_FIRST_LISTENER != INVALID_PROC_NUMBER);
+	LWLockRelease(NotifyQueueLock);
+
+	return hasListeners;
+}
+
+/*
+ * Wake all listening backends to process notifications.
+ *
+ * This is called by VACUUM when it needs to advance datfrozenxid but the
+ * notification queue has old XIDs.  We signal all listeners across all
+ * databases that aren't already caught up, so they can process their
+ * pending notifications and advance the queue tail.
+ */
+void
+asyncQueueWakeAllListeners(void)
+{
+	int32	   *pids;
+	ProcNumber *procnos;
+	int			count;
+
+	/*
+	 * Identify backends that we need to signal.  We don't want to send
+	 * signals while holding the NotifyQueueLock, so this loop just builds a
+	 * list of target PIDs.
+	 *
+	 * XXX in principle these pallocs could fail, which would be bad. Maybe
+	 * preallocate the arrays?  They're not that large, though.
+	 */
+	pids = (int32 *) palloc(MaxBackends * sizeof(int32));
+	procnos = (ProcNumber *) palloc(MaxBackends * sizeof(ProcNumber));
+	count = 0;
+
+	LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE);
+	for (ProcNumber i = QUEUE_FIRST_LISTENER; i != INVALID_PROC_NUMBER; i = QUEUE_NEXT_LISTENER(i))
+	{
+		int32		pid = QUEUE_BACKEND_PID(i);
+		QueuePosition pos;
+
+		Assert(pid != InvalidPid);
+		pos = QUEUE_BACKEND_POS(i);
+
+		/*
+		 * Signal listeners unless they're already caught up.
+		 */
+		if (QUEUE_POS_EQUAL(pos, QUEUE_HEAD))
+			continue;
+
+		/* OK, need to signal this one */
+		pids[count] = pid;
+		procnos[count] = i;
+		count++;
+	}
+	LWLockRelease(NotifyQueueLock);
+
+	/* Now send signals */
+	for (int i = 0; i < count; i++)
+	{
+		int32		pid = pids[i];
+
+		/*
+		 * If we are signaling our own process, no need to involve the kernel;
+		 * just set the flag directly.
+		 */
+		if (pid == MyProcPid)
+		{
+			notifyInterruptPending = true;
+			continue;
+		}
+
+		/*
+		 * Note: assuming things aren't broken, a signal failure here could
+		 * only occur if the target backend exited since we released
+		 * NotifyQueueLock; which is unlikely but certainly possible. So we
+		 * just log a low-level debug message if it happens.
+		 */
+		if (SendProcSignal(pid, PROCSIG_NOTIFY_INTERRUPT, procnos[i]) < 0)
+			elog(DEBUG3, "could not signal backend with PID %d: %m", pid);
+	}
+
+	pfree(pids);
+	pfree(procnos);
+}
+
+/*
+ * Discard all notifications in the queue when there are no listeners.
+ *
+ * This is called by VACUUM when the notification queue has old XIDs but no
+ * active listeners exist.  We advance the tail to the head, effectively
+ * discarding all queued notifications, and truncate the SLRU segments.
+ */
+void
+asyncQueueAdvanceTailNoListeners(void)
+{
+	QueuePosition min;
+	int64		oldtailpage;
+	int64		newtailpage;
+	int64		boundary;
+
+	/* Restrict task to one backend per cluster; see SimpleLruTruncate(). */
+	LWLockAcquire(NotifyQueueTailLock, LW_EXCLUSIVE);
+	LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE);
+
+	/*
+	 * Verify that there are still no listeners.
+	 */
+	if (QUEUE_FIRST_LISTENER != INVALID_PROC_NUMBER)
+	{
+		LWLockRelease(NotifyQueueLock);
+		LWLockRelease(NotifyQueueTailLock);
+		return;
+	}
+
+	/*
+	 * Advance the logical tail to the head, discarding all notifications.
+	 */
+	min = QUEUE_HEAD;
+	QUEUE_TAIL = min;
+	oldtailpage = QUEUE_STOP_PAGE;
+	LWLockRelease(NotifyQueueLock);
+
+	/*
+	 * We can truncate something if the global tail advanced across an SLRU
+	 * segment boundary.
+	 *
+	 * XXX it might be better to truncate only once every several segments, to
+	 * reduce the number of directory scans.
+	 */
+	newtailpage = QUEUE_POS_PAGE(min);
+	boundary = newtailpage - (newtailpage % SLRU_PAGES_PER_SEGMENT);
+	if (asyncQueuePagePrecedes(oldtailpage, boundary))
+	{
+		/*
+		 * SimpleLruTruncate() will ask for SLRU bank locks but will also
+		 * release the lock again.
+		 */
+		SimpleLruTruncate(NotifyCtl, newtailpage);
+
+		LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE);
+		QUEUE_STOP_PAGE = newtailpage;
+		LWLockRelease(NotifyQueueLock);
+	}
+
+	LWLockRelease(NotifyQueueTailLock);
+}
+
 /*
  * Advance the shared queue tail variable to the minimum of all the
  * per-backend tail pointers.  Truncate pg_notify space if possible.
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index e5fedfb323..18e7e3b145 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -1739,11 +1739,24 @@ vac_update_datfrozenxid(void)
 	 * Also consider the oldest XID in the notification queue, since backends
 	 * will need to call TransactionIdDidCommit() on those XIDs when
 	 * processing the notifications.
+	 *
+	 * If the queue is blocking datfrozenxid advancement, attempt to clean it
+	 * up.  If listeners exist, wake them to process their pending
+	 * notifications.  If no listeners exist, discard all notifications.
+	 * Either way, we back off datfrozenxid for this VACUUM cycle; the next
+	 * VACUUM will benefit from the cleanup we've triggered.
 	 */
 	oldestNotifyXid = GetOldestNotifyTransactionId();
 	if (TransactionIdIsValid(oldestNotifyXid) &&
 		TransactionIdPrecedes(oldestNotifyXid, newFrozenXid))
+	{
+		if (asyncQueueHasListeners())
+			asyncQueueWakeAllListeners();
+		else
+			asyncQueueAdvanceTailNoListeners();
+
 		newFrozenXid = oldestNotifyXid;
+	}
 
 	Assert(TransactionIdIsNormal(newFrozenXid));
 	Assert(MultiXactIdIsValid(newMinMulti));
diff --git a/src/include/commands/async.h b/src/include/commands/async.h
index 0f8f17ad22..94fe2d0796 100644
--- a/src/include/commands/async.h
+++ b/src/include/commands/async.h
@@ -29,6 +29,11 @@ extern void NotifyMyFrontEnd(const char *channel,
 /* get oldest XID in the notification queue for vacuum freeze */
 extern TransactionId GetOldestNotifyTransactionId(void);
 
+/* functions for vacuum to manage notification queue */
+extern bool asyncQueueHasListeners(void);
+extern void asyncQueueWakeAllListeners(void);
+extern void asyncQueueAdvanceTailNoListeners(void);
+
 /* notify-related SQL statements */
 extern void Async_Notify(const char *channel, const char *payload);
 extern void Async_Listen(const char *channel);

async_notify_freeze_xids.txttext/plain; name=async_notify_freeze_xids.txtDownload

diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index 4bd37d5beb5..ba06234dc8e 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -2168,6 +2168,120 @@ asyncQueueAdvanceTail(void)
 	LWLockRelease(NotifyQueueTailLock);
 }
 
+/*
+ * AsyncNotifyFreezeXids
+ *
+ * Prepare the async notification queue for CLOG truncation by freezing
+ * transaction IDs that are about to become inaccessible.
+ *
+ * This function is called by VACUUM before advancing datfrozenxid. It scans
+ * the notification queue and replaces XIDs that would become inaccessible
+ * after CLOG truncation with special markers:
+ * - Committed transactions are set to FrozenTransactionId
+ * - Aborted/crashed transactions are set to InvalidTransactionId
+ *
+ * Only XIDs < newFrozenXid are processed, as those are the ones whose CLOG
+ * pages will be truncated. If XID < newFrozenXid, it cannot still be running
+ * (or it would have held back newFrozenXid through ProcArray).
+ * Therefore, if TransactionIdDidCommit returns false, we know the transaction
+ * either aborted explicitly or crashed, and we can safely mark it invalid.
+ */
+void
+AsyncNotifyFreezeXids(TransactionId newFrozenXid)
+{
+	QueuePosition pos;
+	QueuePosition head;
+	int64		curpage = -1;
+	int			slotno = -1;
+	char	   *page_buffer = NULL;
+	bool		page_dirty = false;
+
+	/*
+	 * Acquire locks in the correct order to avoid deadlocks. As per the
+	 * locking protocol: NotifyQueueTailLock, then NotifyQueueLock, then SLRU
+	 * bank locks.
+	 *
+	 * We only need SHARED mode since we're just reading the head/tail
+	 * positions, not modifying them.
+	 */
+	LWLockAcquire(NotifyQueueTailLock, LW_SHARED);
+	LWLockAcquire(NotifyQueueLock, LW_SHARED);
+
+	pos = QUEUE_TAIL;
+	head = QUEUE_HEAD;
+
+	/* Release NotifyQueueLock early, we only needed to read the positions */
+	LWLockRelease(NotifyQueueLock);
+
+	/*
+	 * Scan the queue from tail to head, freezing XIDs as needed. We hold
+	 * NotifyQueueTailLock throughout to ensure the tail doesn't move while
+	 * we're working.
+	 */
+	while (!QUEUE_POS_EQUAL(pos, head))
+	{
+		AsyncQueueEntry *qe;
+		TransactionId xid;
+		int64		pageno = QUEUE_POS_PAGE(pos);
+		int			offset = QUEUE_POS_OFFSET(pos);
+
+		/* If we need a different page, release old lock and get new one */
+		if (pageno != curpage)
+		{
+			LWLock	   *lock;
+
+			/* Release previous page if any */
+			if (slotno >= 0)
+			{
+				if (page_dirty)
+				{
+					NotifyCtl->shared->page_dirty[slotno] = true;
+					page_dirty = false;
+				}
+				LWLockRelease(SimpleLruGetBankLock(NotifyCtl, curpage));
+			}
+
+			lock = SimpleLruGetBankLock(NotifyCtl, pageno);
+			LWLockAcquire(lock, LW_EXCLUSIVE);
+			slotno = SimpleLruReadPage(NotifyCtl, pageno, true,
+									   InvalidTransactionId);
+			page_buffer = NotifyCtl->shared->page_buffer[slotno];
+			curpage = pageno;
+		}
+
+		qe = (AsyncQueueEntry *) (page_buffer + offset);
+		xid = qe->xid;
+
+		if (TransactionIdIsNormal(xid) &&
+			TransactionIdPrecedes(xid, newFrozenXid))
+		{
+			if (TransactionIdDidCommit(xid))
+			{
+				qe->xid = FrozenTransactionId;
+				page_dirty = true;
+			}
+			else
+			{
+				qe->xid = InvalidTransactionId;
+				page_dirty = true;
+			}
+		}
+
+		/* Advance to next entry */
+		asyncQueueAdvance(&pos, qe->length);
+	}
+
+	/* Release final page lock if we acquired one */
+	if (slotno >= 0)
+	{
+		if (page_dirty)
+			NotifyCtl->shared->page_dirty[slotno] = true;
+		LWLockRelease(SimpleLruGetBankLock(NotifyCtl, curpage));
+	}
+
+	LWLockRelease(NotifyQueueTailLock);
+}
+
 /*
  * ProcessIncomingNotify
  *
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index ed03e3bd50d..2e8dbf03bd5 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -37,6 +37,7 @@
 #include "catalog/namespace.h"
 #include "catalog/pg_database.h"
 #include "catalog/pg_inherits.h"
+#include "commands/async.h"
 #include "commands/cluster.h"
 #include "commands/defrem.h"
 #include "commands/progress.h"
@@ -1758,6 +1759,13 @@ vac_update_datfrozenxid(void)
 
 	dbform = (Form_pg_database) GETSTRUCT(tuple);
 
+	/*
+	 * Before advancing datfrozenxid, freeze any old transaction IDs in the
+	 * async notification queue to prevent "could not access status of
+	 * transaction" errors when LISTEN is called after CLOG truncation.
+	 */
+	AsyncNotifyFreezeXids(newFrozenXid);
+
 	/*
 	 * As in vac_update_relstats(), we ordinarily don't want to let
 	 * datfrozenxid go backward; but if it's "in the future" then it must be
diff --git a/src/include/commands/async.h b/src/include/commands/async.h
index f75c3df9556..aaec7314c10 100644
--- a/src/include/commands/async.h
+++ b/src/include/commands/async.h
@@ -46,4 +46,7 @@ extern void HandleNotifyInterrupt(void);
 /* process interrupts */
 extern void ProcessNotifyInterrupt(bool flush);
 
+/* freeze old transaction IDs in notify queue (called by VACUUM) */
+extern void AsyncNotifyFreezeXids(TransactionId newFrozenXid);
+
 #endif							/* ASYNC_H */

#67

Heikki Linnakangas

hlinnaka@iki.fi

2 months ago

In reply to: Joel Jacobson (#66)

Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue

On 25/10/2025 21:01, Joel Jacobson wrote:

On Sat, Oct 25, 2025, at 15:08, Arseniy Mukhin wrote:

I reimplemented the test in 0002 as an isolation test and added the
commit message. PFA the new version.

I've rebased the patches so they can be applied on top of v12:
- v12-vacuum_notify_queue_cleanup-without-code-dup.txt
- v12-vacuum_notify_queue_cleanup-with-code-dup.txt

I also want to share a new idea that Heikki Linnakangas came up with
during PgConf25 in Riga.

The idea is basically to scan the notification queue from tail to head,
and update xids that precedes newFrozenXid, to either
FrozenTransactionId if committed, or InvalidTransactionId if not.

I've implemented this by adding a new extern function
AsyncNotifyFreezeXids to async.h, called from vac_update_datfrozenxid.

This way, we don't need to hold back the advancement of newFrozenXid in
vac_update_datfrozenxid.

Attempt of implementing Heikki's idea:
- async_notify_freeze_xids.txt

Thanks!

This idea has the benefit of never holding back newFrozenXid,
however, I see some problems with it:

- If there is a bug in the code, we could accidentally overwrite
xid values we didn't intend to overwrite, and maybe we would never
find out that we did.

- We wouldn't know for sure that we've understood the cause of
this bug.

With v12-vacuum_notify_queue_cleanup, we instead have the downside of
possibly holding back newFrozenXid, but with the advantage of giving us
higher confidence in that we've correctly identified the cause of the
bug.

Meh. Robustness is good and all, and in heap tuples we don't overwrite
the xmin/xmax but just set a FROZEN flag, precisely so that we preserve
evidence if something goes wrong. But I can't get too worked up about
that for the async notification queue.

That said, we could add COMMITTED/INVALID hint bits to AsyncQueueEntry,
similar to heap tuples, and set the hint bit instead of replacing the
original xid. That might be good for performance too: If the first
backend to call TransactionIdDidCommit() on an AsyncQueueEntry would set
the hint bit, that would save the effort for other listening backends.

- Heikki

#68

Joel Jacobson

joel@compiler.org

2 months ago

In reply to: Heikki Linnakangas (#67)

Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue

On Wed, Oct 29, 2025, at 12:40, Heikki Linnakangas wrote:

On 25/10/2025 21:01, Joel Jacobson wrote:

I also want to share a new idea that Heikki Linnakangas came up with
during PgConf25 in Riga.

The idea is basically to scan the notification queue from tail to head,
and update xids that precedes newFrozenXid, to either
FrozenTransactionId if committed, or InvalidTransactionId if not.

I've implemented this by adding a new extern function
AsyncNotifyFreezeXids to async.h, called from vac_update_datfrozenxid.

This way, we don't need to hold back the advancement of newFrozenXid in
vac_update_datfrozenxid.

Attempt of implementing Heikki's idea:
- async_notify_freeze_xids.txt

Thanks!

Thanks for the idea!

This idea has the benefit of never holding back newFrozenXid,
however, I see some problems with it:

- If there is a bug in the code, we could accidentally overwrite
xid values we didn't intend to overwrite, and maybe we would never
find out that we did.

- We wouldn't know for sure that we've understood the cause of
this bug.

With v12-vacuum_notify_queue_cleanup, we instead have the downside of
possibly holding back newFrozenXid, but with the advantage of giving us
higher confidence in that we've correctly identified the cause of the
bug.

Meh. Robustness is good and all, and in heap tuples we don't overwrite
the xmin/xmax but just set a FROZEN flag, precisely so that we preserve
evidence if something goes wrong. But I can't get too worked up about
that for the async notification queue.

Having slept on this for some days, I'm less worried about this
approach. I like the simplicity of it, and that we don't bolt on complexity
to another subsystem, just for the sake of improved debugging
capabilities of a different subsystem.

I think a different way of looking at this, is that we wouldn't conceal
a bug in async, but rather we would let vacuum do part of its job, of
checking the commit status of the xids, when needed, and be okay with
that split responsibility.

If we can prove that the optimizations we're working on, will render
that new code unnecessary, then I guess we can just remove it.

That said, we could add COMMITTED/INVALID hint bits to AsyncQueueEntry,
similar to heap tuples, and set the hint bit instead of replacing the
original xid. That might be good for performance too: If the first
backend to call TransactionIdDidCommit() on an AsyncQueueEntry would set
the hint bit, that would save the effort for other listening backends.

Interesting, sounds like a really promising approach.

/Joel

#69

Heikki Linnakangas

hlinnaka@iki.fi

2 months ago

In reply to: Joel Jacobson (#68)

Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue

On 29/10/2025 14:30, Joel Jacobson wrote:

On Wed, Oct 29, 2025, at 12:40, Heikki Linnakangas wrote:

On 25/10/2025 21:01, Joel Jacobson wrote:

I also want to share a new idea that Heikki Linnakangas came up with
during PgConf25 in Riga.

The idea is basically to scan the notification queue from tail to head,
and update xids that precedes newFrozenXid, to either
FrozenTransactionId if committed, or InvalidTransactionId if not.

I've implemented this by adding a new extern function
AsyncNotifyFreezeXids to async.h, called from vac_update_datfrozenxid.

This way, we don't need to hold back the advancement of newFrozenXid in
vac_update_datfrozenxid.

Attempt of implementing Heikki's idea:
- async_notify_freeze_xids.txt

Thanks!

Thanks for the idea!

For the record, Yura suggested the same approach earlier in this thread
[1]: . That line of discussion led to more complicated patches but I think the complications came from trying to set the flag as part of commit/abort. The changes are more straightforward and backpatchable if we only do it at vacuum.
the complications came from trying to set the flag as part of
commit/abort. The changes are more straightforward and backpatchable if
we only do it at vacuum.

This idea has the benefit of never holding back newFrozenXid,
however, I see some problems with it:

- If there is a bug in the code, we could accidentally overwrite
xid values we didn't intend to overwrite, and maybe we would never
find out that we did.

- We wouldn't know for sure that we've understood the cause of
this bug.

With v12-vacuum_notify_queue_cleanup, we instead have the downside of
possibly holding back newFrozenXid, but with the advantage of giving us
higher confidence in that we've correctly identified the cause of the
bug.

Meh. Robustness is good and all, and in heap tuples we don't overwrite
the xmin/xmax but just set a FROZEN flag, precisely so that we preserve
evidence if something goes wrong. But I can't get too worked up about
that for the async notification queue.

Having slept on this for some days, I'm less worried about this
approach. I like the simplicity of it, and that we don't bolt on complexity
to another subsystem, just for the sake of improved debugging
capabilities of a different subsystem.

I think a different way of looking at this, is that we wouldn't conceal
a bug in async, but rather we would let vacuum do part of its job, of
checking the commit status of the xids, when needed, and be okay with
that split responsibility.

Ok, at quick glance I think this patch is in pretty good shape. I'll
review it more thoroughly, and see if I can incorporate the test from
Matheus Alcantara's or Arseniy Mukhin's latest patches, and commit and
backpatch this.

- Heikki

#70

Tom Lane

tgl@sss.pgh.pa.us

2 months ago

In reply to: Joel Jacobson (#68)

Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue

"Joel Jacobson" <joel@compiler.org> writes:

On Wed, Oct 29, 2025, at 12:40, Heikki Linnakangas wrote:

That said, we could add COMMITTED/INVALID hint bits to AsyncQueueEntry,
similar to heap tuples, and set the hint bit instead of replacing the
original xid. That might be good for performance too: If the first
backend to call TransactionIdDidCommit() on an AsyncQueueEntry would set
the hint bit, that would save the effort for other listening backends.

Interesting, sounds like a really promising approach.

I don't see any room for adding hint bits without enlarging
AsyncQueueEntry, which I'd rather not do. But we could mechanize that
without a separate hint bit by replacing the tested XID with FrozenXid
or InvalidXid, as the patch already does. We already require writing
an XID to shared memory to be atomic, so that seems okay.

However ... that won't actually work, the reason being that
asyncQueueProcessPageEntries() doesn't work directly from an SLRU page
but from a local copy. Even if it were to modify the state of that
copy, no other backend would see the effects.

The reason it's like that is stated in the comments:

* The current page must have been fetched into page_buffer from shared
* memory. (We could access the page right in shared memory, but that
* would imply holding the SLRU bank lock throughout this routine.)

The patch proposed here likewise appears to involve holding an SLRU
bank lock throughout what could be a significant number of
TransactionIdDidCommit tests. That seems like it could result in a
pretty bad "burp" in NOTIFY throughput. That problem is ameliorated
by only doing it when VACUUM is trying to advance datfrozenxid, but
still I wonder if we can't find a less concurrency-unfriendly answer.

The local-copy behavior also means that this patch isn't quite a 100%
fix. We could have a race condition like so:

1. Backend A grabs a copy of some SLRU page and begins running
asyncQueueProcessPageEntries(), but then loses the CPU.

2. Backend B completes a VACUUM, finds it can advance datfrozenxid,
marks some relevant XIDs as frozen in the notify SLRU, and truncates
clog.

3. Backend A gets control back, tries to discover the state of
some XID that's still present in its local copy of the XID page,
and fails.

Step 2 will take long enough that this isn't very plausible
timing-wise, but it's still theoretically a hole.

All of this is a problem mainly because of the presumption that
holding an SLRU bank lock for a long time is bad. I wonder how
dangerous that really is.

regards, tom lane

#71

Joel Jacobson

joel@compiler.org

2 months ago

In reply to: Tom Lane (#70)

Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue

On Wed, Oct 29, 2025, at 17:48, Tom Lane wrote:

However ... that won't actually work, the reason being that
asyncQueueProcessPageEntries() doesn't work directly from an SLRU page
but from a local copy. Even if it were to modify the state of that
copy, no other backend would see the effects.

The reason it's like that is stated in the comments:

* The current page must have been fetched into page_buffer from shared
* memory. (We could access the page right in shared memory, but that
* would imply holding the SLRU bank lock throughout this routine.)

The patch proposed here likewise appears to involve holding an SLRU
bank lock throughout what could be a significant number of
TransactionIdDidCommit tests. That seems like it could result in a
pretty bad "burp" in NOTIFY throughput. That problem is ameliorated
by only doing it when VACUUM is trying to advance datfrozenxid, but
still I wonder if we can't find a less concurrency-unfriendly answer.

...

All of this is a problem mainly because of the presumption that
holding an SLRU bank lock for a long time is bad. I wonder how
dangerous that really is.

Ops. Sounds scary.

I don't know if others have looked at the v12-vacuum_notify_queue_cleanup
approach; if it's bad, it would be helpful to understand why.

/Joel

#72

Heikki Linnakangas

hlinnaka@iki.fi

2 months ago

In reply to: Tom Lane (#70)

Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue

On 29/10/2025 18:48, Tom Lane wrote:

However ... that won't actually work, the reason being that
asyncQueueProcessPageEntries() doesn't work directly from an SLRU page
but from a local copy. Even if it were to modify the state of that
copy, no other backend would see the effects.

Ah, I missed that local copy.

The reason it's like that is stated in the comments:

* The current page must have been fetched into page_buffer from shared
* memory. (We could access the page right in shared memory, but that
* would imply holding the SLRU bank lock throughout this routine.)

The patch proposed here likewise appears to involve holding an SLRU
bank lock throughout what could be a significant number of
TransactionIdDidCommit tests. That seems like it could result in a
pretty bad "burp" in NOTIFY throughput. That problem is ameliorated
by only doing it when VACUUM is trying to advance datfrozenxid, but
still I wonder if we can't find a less concurrency-unfriendly answer.

Hmm. I thought the original comment was more worried about sending the
message to the client being slow. But I guess the
TransactionIdDidCommit() calls are not free either.

The local-copy behavior also means that this patch isn't quite a 100%
fix. We could have a race condition like so:

1. Backend A grabs a copy of some SLRU page and begins running
asyncQueueProcessPageEntries(), but then loses the CPU.

2. Backend B completes a VACUUM, finds it can advance datfrozenxid,
marks some relevant XIDs as frozen in the notify SLRU, and truncates
clog.

3. Backend A gets control back, tries to discover the state of
some XID that's still present in its local copy of the XID page,
and fails.

Step 2 will take long enough that this isn't very plausible
timing-wise, but it's still theoretically a hole.

It would be better than what we have now, but sure would be nice to fix
the problem fully..

A straightforward fix would be to introduce another lock,
NotifyQueueVacuumLock, that's held in shared mode across
asyncQueueReadAllNotifications(), and acquired in exclusive mode by
freezing. That way freezing would wait out the concurrent readers that
might have local copies. Or some other similar mechanism for waiting
them out with e.g. condition variables.

All of this is a problem mainly because of the presumption that
holding an SLRU bank lock for a long time is bad. I wonder how
dangerous that really is.

It was worse in older versions before the SLRU banks were introduced.

Holding the lock while we send the notification to the client is not
acceptable. But perhaps we could split asyncQueueProcessPageEntries()
into two parts:

1. Do the TransactionIdDidCommit() checks on each entry and copy the
visible entries to local memory.
2. Release lock and send the notifications to the backend.

Joel, since you've been working on some optimizations in this area too,
would you happen to have some suitable performance test scripts for this?

- Heikki

#73

Joel Jacobson

joel@compiler.org

2 months ago

In reply to: Heikki Linnakangas (#72)

1 attachment(s)

Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue

On Thu, Oct 30, 2025, at 14:25, Heikki Linnakangas wrote:

Joel, since you've been working on some optimizations in this area too,
would you happen to have some suitable performance test scripts for this?

Glad you asked. I'm actually working on a benchmark+correctness tester.
It's very much work-in-progress though, don't look too much at the code,
or your eyes will bleed.

It's a combined benchmark + correctness tester, that verifies that only
the expected notifications are received on the expected connections,
while at the same time doing timing measurements.

```
% ./pg_bench_lino --help
Usage: ./pg_bench_lino [OPTIONS]

PostgreSQL LISTEN/NOTIFY Benchmark Tool (CLI version)

Options:
-c, --connections N Number of database connections (default: 1)
-n, --channels N Number of notification channels (default: 1)
-t, --tick-ms N Tick interval in milliseconds (default: 1)
-l, --listen-prob N LISTEN probability % (default: 0.1, 0 to disable)
-u, --unlisten-prob N UNLISTEN probability % (default: 0.05, 0 to disable)
-p, --notify-prob N NOTIFY probability % (default: 1.0, 0 to disable)
-a, --unlisten-all-prob N UNLISTEN * probability % (default: 0.01, 0 to disable)
-T, --ticks N Number of ticks to run (REQUIRED)
-s, --seed N Random seed for reproducibility (default: current time)
-h, --help Show this help message
```

Example run on my MacBook M3 Max

--- master:

% ./pg_bench_lino -t 0 -c 100 -n 1000 -l 10 -u 0 -p 100 -s 42 -T 10000
Initializing 100 connections and 1000 channels...
Initialization complete. Starting benchmark...

========================================================
PostgreSQL LISTEN/NOTIFY Benchmark Results
========================================================

Configuration:
Connections: 100
Channels: 1000
Tick interval: 0 ms
LISTEN probability: 10.0000%
UNLISTEN probability: 0.0000%
NOTIFY probability: 100.0000%
UNLISTEN * probability: 0.0100%
Ticks executed: 10000
Random seed: 42

Final State:
Active connections: 100
Active channels: 1000
Listening pairs: 989
Correctness errors: 0

========================================================
Operation Statistics:
Operation Count Min(ms) Avg(ms) Max(ms)
--------------------------------------------------------
LISTEN 1009 0.016 0.029 0.291
UNLISTEN 0 - - -
UNLISTEN * 1 0.071 0.071 0.071
NOTIFY 9989 0.064 0.247 2.163
NOTIFY delivery 9989 0.072 0.277 2.170

========================================================
NOTIFY Delivery Time Distribution:

0.05-0.1ms # 4 (0.0%)
0.1-0.15ms ## 263 (2.6%)
0.15-0.2ms ####### 881 (8.8%)
0.2-0.3ms ################################################## 5907 (59.1%)
0.3-0.4ms #################### 2388 (23.9%)
0.4-0.5ms #### 525 (5.3%)
0.5-0.75ms # 18 (0.2%)
1-2ms # 2 (0.0%)
2-5ms # 1 (0.0%)

-- Optimization v22 patch:

% ./pg_bench_lino -t 0 -c 100 -n 1000 -l 10 -u 0 -p 100 -s 42 -T 10000
Initializing 100 connections and 1000 channels...
Initialization complete. Starting benchmark...

========================================================
PostgreSQL LISTEN/NOTIFY Benchmark Results
========================================================

Final State:
Active connections: 100
Active channels: 1000
Listening pairs: 989
Correctness errors: 0

========================================================
Operation Statistics:
Operation Count Min(ms) Avg(ms) Max(ms)
--------------------------------------------------------
LISTEN 1009 0.015 0.023 0.340
UNLISTEN 0 - - -
UNLISTEN * 1 0.056 0.056 0.056
NOTIFY 9989 0.018 0.031 2.037
NOTIFY delivery 9989 0.022 0.062 2.056

========================================================
NOTIFY Delivery Time Distribution:

0-0.05ms ################################################## 5002 (50.1%)
0.05-0.1ms ######################################### 4130 (41.3%)
0.1-0.15ms # 77 (0.8%)
0.15-0.2ms #### 427 (4.3%)
0.2-0.3ms ### 338 (3.4%)
0.3-0.4ms # 11 (0.1%)
0.4-0.5ms # 1 (0.0%)
2-5ms # 3 (0.0%)

To compile:
gcc -Wall -Wextra -g -I"$(pg_config --includedir)" -c pg_bench_lino.c -o pg_bench_lino.o
gcc pg_bench_lino.o -L"$(pg_config --libdir)" -lpq -o pg_bench_lino

/Joel

#74

Joel Jacobson

joel@compiler.org

2 months ago

In reply to: Joel Jacobson (#73)

1 attachment(s)

Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue

On Fri, Oct 31, 2025, at 00:08, Joel Jacobson wrote:

On Thu, Oct 30, 2025, at 14:25, Heikki Linnakangas wrote:

Joel, since you've been working on some optimizations in this area too,
would you happen to have some suitable performance test scripts for this?

Glad you asked. I'm actually working on a benchmark+correctness tester.
It's very much work-in-progress though, don't look too much at the code,
or your eyes will bleed.

It's a combined benchmark + correctness tester, that verifies that only
the expected notifications are received on the expected connections,
while at the same time doing timing measurements.

To run multiple pg_bench_lino processes in parallell to simulate
concurrent workloads, I realized the randomization of the channel names
and payloads were not random enough to avoid collissions. New version
attached that uses real UUIDs for channel names and payloads.

/Joel

#75

Heikki Linnakangas

hlinnaka@iki.fi

2 months ago

In reply to: Joel Jacobson (#74)

2 attachment(s)

Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue

On 31/10/2025 01:27, Joel Jacobson wrote:

On Fri, Oct 31, 2025, at 00:08, Joel Jacobson wrote:

On Thu, Oct 30, 2025, at 14:25, Heikki Linnakangas wrote:

Joel, since you've been working on some optimizations in this area too,
would you happen to have some suitable performance test scripts for this?

Glad you asked. I'm actually working on a benchmark+correctness tester.
It's very much work-in-progress though, don't look too much at the code,
or your eyes will bleed.

It's a combined benchmark + correctness tester, that verifies that only
the expected notifications are received on the expected connections,
while at the same time doing timing measurements.

To run multiple pg_bench_lino processes in parallell to simulate
concurrent workloads, I realized the randomization of the channel names
and payloads were not random enough to avoid collissions. New version
attached that uses real UUIDs for channel names and payloads.

Thanks! Here's a sketch for holding the bank lock across
TransactionIdDidCommit() calls. In quick testing with your test program,
I can't see any performance difference. However, I'm not quite sure what
options I should be using to stress this. My gut feeling is that it's
fine, but it'd be nice to do construct a real worst case test case to be
sure.

There are some opportunities for micro-optimizations here:

* IsListeningOn() is kind of expensive if a backend is listening
multiple channels. We really should turn that into a hash table. As the
patch stands, I'm doing the IsListeningOn() calls while holding the bank
lock, but unless we speed up IsListeningOn() in those degenerate cases,
we should perhaps call it only after copying and releasing the lock.

* If IsListeningOn() is made fast, it might make sense to call it before
TransactionIdDidCommit().

* Implement the hint bits.

I don't know how much those matter. Again, a test case would be nice.

I'll work more on performance testing next week of this, if no one else
picks that up.

- Heikki

Attachments:

0001-Fix-bug-where-we-truncated-CLOG-that-was-still-neede.patchtext/x-patch; charset=UTF-8; name=0001-Fix-bug-where-we-truncated-CLOG-that-was-still-neede.patchDownload

From 74384374a24334616ac427fdcfeed5199ee1fc43 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Fri, 31 Oct 2025 18:27:25 +0200
Subject: [PATCH 1/2] Fix bug where we truncated CLOG that was still needed by
 LISTEN/NOTIFY
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

The async notification queue contains the XID of the sender, and wnen
processing notifications we call TransactionIdDidCommit() on the
XID. But we had no safeguards to prevent the CLOG segments containing
those XIDs from being truncated. As a result, if a backend didn't for
some reason process the notifications for a long time, or when a new
backend issued LISTEN, you could get an error like:

test=# listen c21;
ERROR:  58P01: could not access status of transaction 14279685
DETAIL:  Could not open file "pg_xact/000D": No such file or directory.
LOCATION:  SlruReportIOError, slru.c:1087

This was first reported by Sergey Zhuravlev in 2021, with many other
people hitting the same issue later. I believe the bug goes back all
the way to commit d1e027221d, which introduced the SLRU-based async
notification queue.

Thanks to:
- Alexandra Wang, Daniil Davydov, Andrei Varashen and Jacques Combrink for
  investigating and providing reproducable test cases,
- Matheus Alcantara and Arseniy Mukhin for earlier proposed patches to
  fix this,
- Álvaro Herrera and Masahiko Sawada for reviewing said earlier patches,
- Yura Sokolov aka funny-falcon for the idea of marking transactions as
  committed in the notification queue, and
- Joel Jacobson for the final patch version. I hope I didn't forget anyone.

Author: Joel Jacobson, Arseniy Mukhin
Discussion: https://www.postgresql.org/message-id/16961-25f29f95b3604a8a@postgresql.org
Discussion: https://www.postgresql.org/message-id/18804-bccbbde5e77a68c2@postgresql.org
Discussion: https://www.postgresql.org/message-id/CAK98qZ3wZLE-RZJN_Y%2BTFjiTRPPFPBwNBpBi5K5CU8hUHkzDpw@mail.gmail.com
---
 src/backend/commands/async.c                  | 114 ++++++++++++++++++
 src/backend/commands/vacuum.c                 |   7 ++
 src/include/commands/async.h                  |   3 +
 src/test/isolation/expected/async-notify.out  |  33 ++++-
 src/test/isolation/specs/async-notify.spec    |  24 ++++
 src/test/modules/xid_wraparound/meson.build   |   1 +
 .../xid_wraparound/t/004_notify_freeze.pl     |  67 ++++++++++
 7 files changed, 248 insertions(+), 1 deletion(-)
 create mode 100644 src/test/modules/xid_wraparound/t/004_notify_freeze.pl

diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index 4bd37d5beb5..ba06234dc8e 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -2168,6 +2168,120 @@ asyncQueueAdvanceTail(void)
 	LWLockRelease(NotifyQueueTailLock);
 }
 
+/*
+ * AsyncNotifyFreezeXids
+ *
+ * Prepare the async notification queue for CLOG truncation by freezing
+ * transaction IDs that are about to become inaccessible.
+ *
+ * This function is called by VACUUM before advancing datfrozenxid. It scans
+ * the notification queue and replaces XIDs that would become inaccessible
+ * after CLOG truncation with special markers:
+ * - Committed transactions are set to FrozenTransactionId
+ * - Aborted/crashed transactions are set to InvalidTransactionId
+ *
+ * Only XIDs < newFrozenXid are processed, as those are the ones whose CLOG
+ * pages will be truncated. If XID < newFrozenXid, it cannot still be running
+ * (or it would have held back newFrozenXid through ProcArray).
+ * Therefore, if TransactionIdDidCommit returns false, we know the transaction
+ * either aborted explicitly or crashed, and we can safely mark it invalid.
+ */
+void
+AsyncNotifyFreezeXids(TransactionId newFrozenXid)
+{
+	QueuePosition pos;
+	QueuePosition head;
+	int64		curpage = -1;
+	int			slotno = -1;
+	char	   *page_buffer = NULL;
+	bool		page_dirty = false;
+
+	/*
+	 * Acquire locks in the correct order to avoid deadlocks. As per the
+	 * locking protocol: NotifyQueueTailLock, then NotifyQueueLock, then SLRU
+	 * bank locks.
+	 *
+	 * We only need SHARED mode since we're just reading the head/tail
+	 * positions, not modifying them.
+	 */
+	LWLockAcquire(NotifyQueueTailLock, LW_SHARED);
+	LWLockAcquire(NotifyQueueLock, LW_SHARED);
+
+	pos = QUEUE_TAIL;
+	head = QUEUE_HEAD;
+
+	/* Release NotifyQueueLock early, we only needed to read the positions */
+	LWLockRelease(NotifyQueueLock);
+
+	/*
+	 * Scan the queue from tail to head, freezing XIDs as needed. We hold
+	 * NotifyQueueTailLock throughout to ensure the tail doesn't move while
+	 * we're working.
+	 */
+	while (!QUEUE_POS_EQUAL(pos, head))
+	{
+		AsyncQueueEntry *qe;
+		TransactionId xid;
+		int64		pageno = QUEUE_POS_PAGE(pos);
+		int			offset = QUEUE_POS_OFFSET(pos);
+
+		/* If we need a different page, release old lock and get new one */
+		if (pageno != curpage)
+		{
+			LWLock	   *lock;
+
+			/* Release previous page if any */
+			if (slotno >= 0)
+			{
+				if (page_dirty)
+				{
+					NotifyCtl->shared->page_dirty[slotno] = true;
+					page_dirty = false;
+				}
+				LWLockRelease(SimpleLruGetBankLock(NotifyCtl, curpage));
+			}
+
+			lock = SimpleLruGetBankLock(NotifyCtl, pageno);
+			LWLockAcquire(lock, LW_EXCLUSIVE);
+			slotno = SimpleLruReadPage(NotifyCtl, pageno, true,
+									   InvalidTransactionId);
+			page_buffer = NotifyCtl->shared->page_buffer[slotno];
+			curpage = pageno;
+		}
+
+		qe = (AsyncQueueEntry *) (page_buffer + offset);
+		xid = qe->xid;
+
+		if (TransactionIdIsNormal(xid) &&
+			TransactionIdPrecedes(xid, newFrozenXid))
+		{
+			if (TransactionIdDidCommit(xid))
+			{
+				qe->xid = FrozenTransactionId;
+				page_dirty = true;
+			}
+			else
+			{
+				qe->xid = InvalidTransactionId;
+				page_dirty = true;
+			}
+		}
+
+		/* Advance to next entry */
+		asyncQueueAdvance(&pos, qe->length);
+	}
+
+	/* Release final page lock if we acquired one */
+	if (slotno >= 0)
+	{
+		if (page_dirty)
+			NotifyCtl->shared->page_dirty[slotno] = true;
+		LWLockRelease(SimpleLruGetBankLock(NotifyCtl, curpage));
+	}
+
+	LWLockRelease(NotifyQueueTailLock);
+}
+
 /*
  * ProcessIncomingNotify
  *
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index ed03e3bd50d..e785dd55ce5 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -37,6 +37,7 @@
 #include "catalog/namespace.h"
 #include "catalog/pg_database.h"
 #include "catalog/pg_inherits.h"
+#include "commands/async.h"
 #include "commands/cluster.h"
 #include "commands/defrem.h"
 #include "commands/progress.h"
@@ -1941,6 +1942,12 @@ vac_truncate_clog(TransactionId frozenXID,
 		return;
 	}
 
+	/*
+	 * Freeze any old transaction IDs in the async notification queue before
+	 * CLOG truncation.
+	 */
+	AsyncNotifyFreezeXids(frozenXID);
+
 	/*
 	 * Advance the oldest value for commit timestamps before truncating, so
 	 * that if a user requests a timestamp for a transaction we're truncating
diff --git a/src/include/commands/async.h b/src/include/commands/async.h
index f75c3df9556..aaec7314c10 100644
--- a/src/include/commands/async.h
+++ b/src/include/commands/async.h
@@ -46,4 +46,7 @@ extern void HandleNotifyInterrupt(void);
 /* process interrupts */
 extern void ProcessNotifyInterrupt(bool flush);
 
+/* freeze old transaction IDs in notify queue (called by VACUUM) */
+extern void AsyncNotifyFreezeXids(TransactionId newFrozenXid);
+
 #endif							/* ASYNC_H */
diff --git a/src/test/isolation/expected/async-notify.out b/src/test/isolation/expected/async-notify.out
index 556e1805893..20d5763f319 100644
--- a/src/test/isolation/expected/async-notify.out
+++ b/src/test/isolation/expected/async-notify.out
@@ -1,4 +1,4 @@
-Parsed test spec with 3 sessions
+Parsed test spec with 4 sessions
 
 starting permutation: listenc notify1 notify2 notify3 notifyf
 step listenc: LISTEN c1; LISTEN c2;
@@ -104,6 +104,37 @@ step l2commit: COMMIT;
 listener2: NOTIFY "c1" with payload "" from notifier
 step l2stop: UNLISTEN *;
 
+starting permutation: llisten n1begins n1select n1insert notify1 n2begins n2select n2insert n2notify1 n1commit n2commit notify1 lcheck
+step llisten: LISTEN c1; LISTEN c2;
+step n1begins: BEGIN ISOLATION LEVEL SERIALIZABLE;
+step n1select: SELECT * FROM t1;
+a
+-
+(0 rows)
+
+step n1insert: INSERT INTO t1 DEFAULT VALUES;
+step notify1: NOTIFY c1;
+step n2begins: BEGIN ISOLATION LEVEL SERIALIZABLE;
+step n2select: SELECT * FROM t1;
+a
+-
+(0 rows)
+
+step n2insert: INSERT INTO t1 DEFAULT VALUES;
+step n2notify1: NOTIFY c1, 'n2_payload';
+step n1commit: COMMIT;
+step n2commit: COMMIT;
+ERROR:  could not serialize access due to read/write dependencies among transactions
+step notify1: NOTIFY c1;
+step lcheck: SELECT 1 AS x;
+x
+-
+1
+(1 row)
+
+listener: NOTIFY "c1" with payload "" from notifier
+listener: NOTIFY "c1" with payload "" from notifier
+
 starting permutation: llisten lbegin usage bignotify usage
 step llisten: LISTEN c1; LISTEN c2;
 step lbegin: BEGIN;
diff --git a/src/test/isolation/specs/async-notify.spec b/src/test/isolation/specs/async-notify.spec
index 0b8cfd91083..51b7ad43849 100644
--- a/src/test/isolation/specs/async-notify.spec
+++ b/src/test/isolation/specs/async-notify.spec
@@ -5,6 +5,10 @@
 # Note we assume that each step is delivered to the backend as a single Query
 # message so it will run as one transaction.
 
+# t1 table is used for serializable conflict
+setup { CREATE TABLE t1 (a bigserial); }
+teardown { DROP TABLE t1; }
+
 session notifier
 step listenc	{ LISTEN c1; LISTEN c2; }
 step notify1	{ NOTIFY c1; }
@@ -33,8 +37,21 @@ step notifys1	{
 }
 step usage		{ SELECT pg_notification_queue_usage() > 0 AS nonzero; }
 step bignotify	{ SELECT count(pg_notify('c1', s::text)) FROM generate_series(1, 1000) s; }
+step n1begins   { BEGIN ISOLATION LEVEL SERIALIZABLE; }
+step n1select   { SELECT * FROM t1; }
+step n1insert   { INSERT INTO t1 DEFAULT VALUES; }
+step n1commit   { COMMIT; }
 teardown		{ UNLISTEN *; }
 
+# notifier2 session is used to reproduce serializable conflict with notifier
+
+session notifier2
+step n2begins    { BEGIN ISOLATION LEVEL SERIALIZABLE; }
+step n2select    { SELECT * FROM t1; }
+step n2insert    { INSERT INTO t1 DEFAULT VALUES; }
+step n2commit    { COMMIT; }
+step n2notify1   { NOTIFY c1, 'n2_payload';  }
+
 # The listener session is used for cross-backend notify checks.
 
 session listener
@@ -73,6 +90,13 @@ permutation listenc llisten notify1 notify2 notify3 notifyf lcheck
 # and notify queue is not empty
 permutation l2listen l2begin notify1 lbegins llisten lcommit l2commit l2stop
 
+# Test checks that listeners ignore notifications from aborted
+# transaction even if notifications have been added to the listen/notify
+# queue. To reproduce it we use the fact that serializable conflicts
+# are checked after tx adds notifications to the queue.
+
+permutation llisten n1begins n1select n1insert notify1 n2begins n2select n2insert n2notify1 n1commit n2commit notify1 lcheck
+
 # Verify that pg_notification_queue_usage correctly reports a non-zero result,
 # after submitting notifications while another connection is listening for
 # those notifications and waiting inside an active transaction.  We have to
diff --git a/src/test/modules/xid_wraparound/meson.build b/src/test/modules/xid_wraparound/meson.build
index f7dada67f67..3aec430df8c 100644
--- a/src/test/modules/xid_wraparound/meson.build
+++ b/src/test/modules/xid_wraparound/meson.build
@@ -30,6 +30,7 @@ tests += {
       't/001_emergency_vacuum.pl',
       't/002_limits.pl',
       't/003_wraparounds.pl',
+      't/004_notify_freeze.pl',
     ],
   },
 }
diff --git a/src/test/modules/xid_wraparound/t/004_notify_freeze.pl b/src/test/modules/xid_wraparound/t/004_notify_freeze.pl
new file mode 100644
index 00000000000..50824e7b5d7
--- /dev/null
+++ b/src/test/modules/xid_wraparound/t/004_notify_freeze.pl
@@ -0,0 +1,67 @@
+# Copyright (c) 2024-2025, PostgreSQL Global Development Group
+#
+# Test freezing the XIDs in the async notification queue
+#
+
+use strict;
+use warnings FATAL => 'all';
+use PostgreSQL::Test::Cluster;
+use Test::More;
+
+my $node = PostgreSQL::Test::Cluster->new('node');
+$node->init;
+$node->start;
+
+if (!$ENV{PG_TEST_EXTRA} || $ENV{PG_TEST_EXTRA} !~ /\bxid_wraparound\b/)
+{
+	plan skip_all => "test xid_wraparound not enabled in PG_TEST_EXTRA";
+}
+
+# Setup
+$node->safe_psql('postgres', 'CREATE EXTENSION xid_wraparound');
+$node->safe_psql('postgres',
+	'ALTER DATABASE template0 WITH ALLOW_CONNECTIONS true');
+
+# --- Start Session 1 and leave it idle in transaction
+my $psql_session1 = $node->background_psql('postgres');
+$psql_session1->query_safe('listen s;', "Session 1 listens to 's'");
+$psql_session1->query_safe('begin;', "Session 1 starts a transaction");
+
+# --- Send multiple notify's from other sessions ---
+for my $i (1 .. 10)
+{
+	$node->safe_psql(
+		'postgres', "
+		BEGIN;
+		NOTIFY s, '$i';
+		COMMIT;");
+}
+
+# Consume enough XIDs to trigger truncation, and one more with 'txid_current' to
+# bump up the freeze horizon.
+$node->safe_psql('postgres', 'select consume_xids(10000000);');
+$node->safe_psql('postgres', 'select txid_current()');
+
+# Remember current datfrozenxid before vacuum freeze to ensure that it is advanced.
+my $datafronzenxid = $node->safe_psql('postgres', "select datfrozenxid from pg_database where datname = 'postgres'");
+
+# Execute vacuum freeze on all databases
+$node->command_ok([ 'vacuumdb', '--all', '--freeze', '--port', $node->port ],
+	"vacuumdb --all --freeze");
+
+# Get the new datfrozenxid after vacuum freeze to ensure that is advanced.
+my $datafronzenxid_freeze = $node->safe_psql('postgres', "select datfrozenxid from pg_database where datname = 'postgres'");
+ok($datafronzenxid_freeze > $datafronzenxid, 'datfrozenxid is advanced');
+
+# On Session 1, commit and ensure that the all the notifications are received. (This depends
+# on correctly freezing the XIDs in the pending notification entries.)
+my $res = $psql_session1->query_safe('commit;', "commit listen s;");
+my $notifications_count = 0;
+foreach my $i (split('\n', $res))
+{
+	$notifications_count++;
+	like($i, qr/Asynchronous notification "s" with payload "$notifications_count" received/);
+}
+is($notifications_count, 10, 'received all committed notifications');
+
+done_testing();
-- 
2.47.3

0002-Hold-SLRU-bank-lock-across-TransactionIdDidCommit-in.patchtext/x-patch; charset=UTF-8; name=0002-Hold-SLRU-bank-lock-across-TransactionIdDidCommit-in.patchDownload

From 7b5052f4b6add84e78b39ef59ac7cb26e315c238 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Fri, 31 Oct 2025 18:30:26 +0200
Subject: [PATCH 2/2] Hold SLRU bank lock across TransactionIdDidCommit in
 NOTIFY processing

Per Tom Lane's idea
---
 src/backend/commands/async.c | 101 ++++++++++++++++-------------------
 1 file changed, 47 insertions(+), 54 deletions(-)

diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index ba06234dc8e..b5dee75af48 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -448,7 +448,6 @@ static void SignalBackends(void);
 static void asyncQueueReadAllNotifications(void);
 static bool asyncQueueProcessPageEntries(volatile QueuePosition *current,
 										 QueuePosition stop,
-										 char *page_buffer,
 										 Snapshot snapshot);
 static void asyncQueueAdvanceTail(void);
 static void ProcessIncomingNotify(bool flush);
@@ -1854,13 +1853,6 @@ asyncQueueReadAllNotifications(void)
 	QueuePosition head;
 	Snapshot	snapshot;
 
-	/* page_buffer must be adequately aligned, so use a union */
-	union
-	{
-		char		buf[QUEUE_PAGESIZE];
-		AsyncQueueEntry align;
-	}			page_buffer;
-
 	/* Fetch current state */
 	LWLockAcquire(NotifyQueueLock, LW_SHARED);
 	/* Assert checks that we have a valid state entry */
@@ -1932,37 +1924,6 @@ asyncQueueReadAllNotifications(void)
 
 		do
 		{
-			int64		curpage = QUEUE_POS_PAGE(pos);
-			int			curoffset = QUEUE_POS_OFFSET(pos);
-			int			slotno;
-			int			copysize;
-
-			/*
-			 * We copy the data from SLRU into a local buffer, so as to avoid
-			 * holding the SLRU lock while we are examining the entries and
-			 * possibly transmitting them to our frontend.  Copy only the part
-			 * of the page we will actually inspect.
-			 */
-			slotno = SimpleLruReadPage_ReadOnly(NotifyCtl, curpage,
-												InvalidTransactionId);
-			if (curpage == QUEUE_POS_PAGE(head))
-			{
-				/* we only want to read as far as head */
-				copysize = QUEUE_POS_OFFSET(head) - curoffset;
-				if (copysize < 0)
-					copysize = 0;	/* just for safety */
-			}
-			else
-			{
-				/* fetch all the rest of the page */
-				copysize = QUEUE_PAGESIZE - curoffset;
-			}
-			memcpy(page_buffer.buf + curoffset,
-				   NotifyCtl->shared->page_buffer[slotno] + curoffset,
-				   copysize);
-			/* Release lock that we got from SimpleLruReadPage_ReadOnly() */
-			LWLockRelease(SimpleLruGetBankLock(NotifyCtl, curpage));
-
 			/*
 			 * Process messages up to the stop position, end of page, or an
 			 * uncommitted message.
@@ -1978,9 +1939,7 @@ asyncQueueReadAllNotifications(void)
 			 * rewrite pages under us. Especially we don't want to hold a lock
 			 * while sending the notifications to the frontend.
 			 */
-			reachedStop = asyncQueueProcessPageEntries(&pos, head,
-													   page_buffer.buf,
-													   snapshot);
+			reachedStop = asyncQueueProcessPageEntries(&pos, head, snapshot);
 		} while (!reachedStop);
 	}
 	PG_FINALLY();
@@ -2000,12 +1959,9 @@ asyncQueueReadAllNotifications(void)
  * Fetch notifications from the shared queue, beginning at position current,
  * and deliver relevant ones to my frontend.
  *
- * The current page must have been fetched into page_buffer from shared
- * memory.  (We could access the page right in shared memory, but that
- * would imply holding the SLRU bank lock throughout this routine.)
- *
- * We stop if we reach the "stop" position, or reach a notification from an
- * uncommitted transaction, or reach the end of the page.
+ * This function processes the notifications on one page, the page that
+ * 'current' points to.  We stop if we reach the "stop" position, or reach a
+ * notification from an uncommitted transaction, or reach the end of the page.
  *
  * The function returns true once we have reached the stop position or an
  * uncommitted notification, and false if we have finished with the page.
@@ -2015,16 +1971,35 @@ asyncQueueReadAllNotifications(void)
 static bool
 asyncQueueProcessPageEntries(volatile QueuePosition *current,
 							 QueuePosition stop,
-							 char *page_buffer,
 							 Snapshot snapshot)
 {
+	int64		curpage = QUEUE_POS_PAGE(*current);
+	int			slotno;
+	char	   *page_buffer;
 	bool		reachedStop = false;
 	bool		reachedEndOfPage;
-	AsyncQueueEntry *qe;
+
+	/*
+	 * We copy the entries into a local buffer, so as to avoid holding the
+	 * SLRU lock while we transmit them to our frontend.
+	 *
+	 * The local buffer must be adequately aligned, so use a union.
+	 */
+	union
+	{
+		char		buf[QUEUE_PAGESIZE];
+		AsyncQueueEntry align;
+	}			scratch;
+	char	   *scratch_end = scratch.buf;
+
+	slotno = SimpleLruReadPage_ReadOnly(NotifyCtl, curpage,
+										InvalidTransactionId);
+	page_buffer = NotifyCtl->shared->page_buffer[slotno];
 
 	do
 	{
 		QueuePosition thisentry = *current;
+		AsyncQueueEntry *qe;
 
 		if (QUEUE_POS_EQUAL(thisentry, stop))
 			break;
@@ -2073,10 +2048,8 @@ asyncQueueProcessPageEntries(volatile QueuePosition *current,
 
 				if (IsListeningOn(channel))
 				{
-					/* payload follows channel name */
-					char	   *payload = qe->data + strlen(channel) + 1;
-
-					NotifyMyFrontEnd(channel, payload, qe->srcPid);
+					memcpy(scratch_end, qe, qe->length);
+					scratch_end += qe->length;
 				}
 			}
 			else
@@ -2091,6 +2064,26 @@ asyncQueueProcessPageEntries(volatile QueuePosition *current,
 		/* Loop back if we're not at end of page */
 	} while (!reachedEndOfPage);
 
+	/* Release lock that we got from SimpleLruReadPage_ReadOnly() */
+	LWLockRelease(SimpleLruGetBankLock(NotifyCtl, curpage));
+
+	/*
+	 * Now that we have let go of the SLRU bank lock, send the notifications
+	 * to our backend
+	 */
+	Assert(scratch_end - scratch.buf <= BLCKSZ);
+	for (char *p = scratch.buf; p < scratch_end;)
+	{
+		AsyncQueueEntry *qe = (AsyncQueueEntry *) p;
+		char	   *channel = qe->data;
+		/* payload follows channel name */
+		char	   *payload = qe->data + strlen(channel) + 1;
+
+		NotifyMyFrontEnd(channel, payload, qe->srcPid);
+
+		p += qe->length;
+	}
+
 	if (QUEUE_POS_EQUAL(*current, stop))
 		reachedStop = true;
 
-- 
2.47.3

#76

Heikki Linnakangas

hlinnaka@iki.fi

2 months ago

In reply to: Heikki Linnakangas (#75)

1 attachment(s)

Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue

On 31/10/2025 18:40, Heikki Linnakangas wrote:

On 31/10/2025 01:27, Joel Jacobson wrote:

On Fri, Oct 31, 2025, at 00:08, Joel Jacobson wrote:

On Thu, Oct 30, 2025, at 14:25, Heikki Linnakangas wrote:

Joel, since you've been working on some optimizations in this area too,
would you happen to have some suitable performance test scripts for
this?

Glad you asked. I'm actually working on a benchmark+correctness tester.
It's very much work-in-progress though, don't look too much at the code,
or your eyes will bleed.

It's a combined benchmark + correctness tester, that verifies that only
the expected notifications are received on the expected connections,
while at the same time doing timing measurements.

To run multiple pg_bench_lino processes in parallell to simulate
concurrent workloads, I realized the randomization of the channel names
and payloads were not random enough to avoid collissions. New version
attached that uses real UUIDs for channel names and payloads.

Thanks! Here's a sketch for holding the bank lock across
TransactionIdDidCommit() calls. In quick testing with your test program,
I can't see any performance difference. However, I'm not quite sure what
options I should be using to stress this. My gut feeling is that it's
fine, but it'd be nice to do construct a real worst case test case to be
sure.

I wrote another little stand-alone performance test program for this,
attached. It launches N connections that send NOTIFYs to a single
channel as fast as possible, and M threads that listen for the
notifications. I ran it with different combinations of N and M, on
'master' and on REL_14_STABLE (which didn't have SLRU banks) and I
cannot discern any performance difference from these patches. So it
seems that holding the SLRU (bank) lock across the
TransactionIdDidCommit() calls is fine.

- Heikki

#77

Joel Jacobson

joel@compiler.org

2 months ago

In reply to: Heikki Linnakangas (#76)

3 attachment(s)

Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue

On Mon, Nov 3, 2025, at 12:02, Heikki Linnakangas wrote:

I wrote another little stand-alone performance test program for this,
attached. It launches N connections that send NOTIFYs to a single
channel as fast as possible, and M threads that listen for the
notifications. I ran it with different combinations of N and M, on
'master' and on REL_14_STABLE (which didn't have SLRU banks) and I
cannot discern any performance difference from these patches. So it
seems that holding the SLRU (bank) lock across the
TransactionIdDidCommit() calls is fine.

Nice! That for the benchmark code! I took the liberty of hacking a bit
on it, and added support for multiple channels, with separate listener
and notifier threads per channel. Each notification now carries the
notifier ID, a sequence number, and a send timestamp. Listeners verify
that sequence numbers arrive in order and record delivery latency. The
program collects latency measurements into fixed buckets and reports
them once per second together with total and per-second send/receive
counts.

Also added a short delay before starting notifiers so that listeners
have time to issue their LISTEN commands, and a new --channels option,
and the meaning of --listeners and --notifiers was changed to apply per
channel.

Also fixed so the code could be compiled outside of the PostgreSQL
source code repo, if wanting to build this as stand-alone tool.

I've benchmarked master vs 0001+0002 and can't notice any differences;
see attached output from benchmark runs.

/Joel

#78

Heikki Linnakangas

hlinnaka@iki.fi

2 months ago

In reply to: Joel Jacobson (#77)

3 attachment(s)

Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue

On 03/11/2025 23:45, Joel Jacobson wrote:

On Mon, Nov 3, 2025, at 12:02, Heikki Linnakangas wrote:

I wrote another little stand-alone performance test program for this,
attached. It launches N connections that send NOTIFYs to a single
channel as fast as possible, and M threads that listen for the
notifications. I ran it with different combinations of N and M, on
'master' and on REL_14_STABLE (which didn't have SLRU banks) and I
cannot discern any performance difference from these patches. So it
seems that holding the SLRU (bank) lock across the
TransactionIdDidCommit() calls is fine.

Nice! That for the benchmark code! I took the liberty of hacking a bit
on it, and added support for multiple channels, with separate listener
and notifier threads per channel. Each notification now carries the
notifier ID, a sequence number, and a send timestamp. Listeners verify
that sequence numbers arrive in order and record delivery latency. The
program collects latency measurements into fixed buckets and reports
them once per second together with total and per-second send/receive
counts.

Also added a short delay before starting notifiers so that listeners
have time to issue their LISTEN commands, and a new --channels option,
and the meaning of --listeners and --notifiers was changed to apply per
channel.

Also fixed so the code could be compiled outside of the PostgreSQL
source code repo, if wanting to build this as stand-alone tool.

I've benchmarked master vs 0001+0002 and can't notice any differences;
see attached output from benchmark runs.

Thanks. After some further testing, I was able to find a scenario where
this patch significantly reduces performance: if the listening backends
subscribe to a massive number of channels, like 10000, they spend a lot
of time scanning the linked list of subscribed channels in
IsListeningOn(). With the patch, those checks were performed while
holding the SLRU lock, and it started to show up as lock contention
between notifiers and listeners. To demonstrate that, attached is
another version of the test program that adds an --extra-channels=N
argument. If you set it to e.g. 10000, each listener backends calls
LISTEN on 10000 additional channels that are never notified. They just
make the listenChannels list longer. With that and the patches I posted
previously, I'm getting:

$ PGHOST=localhost PGDB=postgres://localhost/postgres
./async-notify-test-3 --listeners=50 --notifiers=4 --channels=1
--extra-channels=10000
10 s: 12716 sent (1274/s), 635775 received (63737/s)
0.00-0.01ms 0 (0.0%) avg: 0.000ms
0.01-0.10ms 0 (0.0%) avg: 0.000ms
0.10-1.00ms # 1915 (0.3%) avg: 0.807ms
1.00-10.00ms ######### 633550 (99.7%) avg: 3.502ms
10.00-100.00ms # 310 (0.0%) avg: 11.423ms

100.00ms 0 (0.0%) avg: 0.000ms

Whereas on 'master', I see about 2-3x more notifies/s:

$ PGHOST=localhost PGDB=postgres://localhost/postgres
./async-notify-test-3 --listeners=50 --notifiers=4 --channels=1
--extra-channels=10000
10 s: 32057 sent (3296/s), 1602995 received (164896/s)
0.00-0.01ms 0 (0.0%) avg: 0.000ms
0.01-0.10ms # 11574 (0.7%) avg: 0.078ms
0.10-1.00ms ###### 1082960 (67.6%) avg: 0.577ms
1.00-10.00ms ### 508199 (31.7%) avg: 1.489ms
10.00-100.00ms # 262 (0.0%) avg: 16.178ms

100.00ms 0 (0.0%) avg: 0.000ms

Fortunately that's easy to fix: We can move the IsListeningOn() check
after releasing the lock. See attached.

The elephant in the room of course is that a lookup in a linked list is
O(n) and it would be very straightforward to replace it with e.g. a hash
table. We should do that irrespective of this bug fix. But I'm inclined
to do it as a separate followup patch.

- Heikki

Attachments:

async-notify-test-3.ctext/x-csrc; charset=UTF-8; name=async-notify-test-3.cDownload

v2-0001-Fix-bug-where-we-truncated-CLOG-that-was-still-ne.patchtext/x-patch; charset=UTF-8; name=v2-0001-Fix-bug-where-we-truncated-CLOG-that-was-still-ne.patchDownload

From 7c342e6efffc8d59c2e7658f6f2f3b138d02e0bb Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Tue, 4 Nov 2025 13:22:08 +0200
Subject: [PATCH v2 1/2] Fix bug where we truncated CLOG that was still needed
 by LISTEN/NOTIFY
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

The async notification queue contains the XID of the sender, and when
processing notifications we call TransactionIdDidCommit() on the
XID. But we had no safeguards to prevent the CLOG segments containing
those XIDs from being truncated away. As a result, if a backend didn't
for some reason process its notifications for a long time, or when a
new backend issued LISTEN, you could get an error like:

test=# listen c21;
ERROR:  58P01: could not access status of transaction 14279685
DETAIL:  Could not open file "pg_xact/000D": No such file or directory.
LOCATION:  SlruReportIOError, slru.c:1087

Note: This commit is not a full fix. A race condition remains, where a
backend is executing asyncQueueReadAllNotifications() and has just
made a local copy of an async SLRU page which contains old XIDs, while
vacuum concurrently truncates the CLOG covering those XIDs. When the
backend then calls TransactionIdDidCommit() on those XIDs from the
local copy, you still get the error. The next commit will fix that
remaining race condition.

This was first reported by Sergey Zhuravlev in 2021, with many other
people hitting the same issue later. Thanks to:
- Alexandra Wang, Daniil Davydov, Andrei Varashen and Jacques Combrink for
  investigating and providing reproducable test cases,
- Matheus Alcantara and Arseniy Mukhin for earlier proposed patches to
  fix this,
- Álvaro Herrera and Masahiko Sawada for reviewing said earlier patches,
- Yura Sokolov aka funny-falcon for the idea of marking transactions as
  committed in the notification queue, and
- Joel Jacobson for the final patch version. I hope I didn't forget anyone.

Backpatch to all supported versions. I believe the bug goes back all
the way to commit d1e027221d, which introduced the SLRU-based async
notification queue.

Discussion: https://www.postgresql.org/message-id/16961-25f29f95b3604a8a@postgresql.org
Discussion: https://www.postgresql.org/message-id/18804-bccbbde5e77a68c2@postgresql.org
Discussion: https://www.postgresql.org/message-id/CAK98qZ3wZLE-RZJN_Y%2BTFjiTRPPFPBwNBpBi5K5CU8hUHkzDpw@mail.gmail.com
---
 src/backend/commands/async.c                  | 114 ++++++++++++++++++
 src/backend/commands/vacuum.c                 |   7 ++
 src/include/commands/async.h                  |   3 +
 src/test/isolation/expected/async-notify.out  |  33 ++++-
 src/test/isolation/specs/async-notify.spec    |  24 ++++
 src/test/modules/xid_wraparound/meson.build   |   1 +
 .../xid_wraparound/t/004_notify_freeze.pl     |  75 ++++++++++++
 7 files changed, 256 insertions(+), 1 deletion(-)
 create mode 100644 src/test/modules/xid_wraparound/t/004_notify_freeze.pl

diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index 4bd37d5beb5..ba06234dc8e 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -2168,6 +2168,120 @@ asyncQueueAdvanceTail(void)
 	LWLockRelease(NotifyQueueTailLock);
 }
 
+/*
+ * AsyncNotifyFreezeXids
+ *
+ * Prepare the async notification queue for CLOG truncation by freezing
+ * transaction IDs that are about to become inaccessible.
+ *
+ * This function is called by VACUUM before advancing datfrozenxid. It scans
+ * the notification queue and replaces XIDs that would become inaccessible
+ * after CLOG truncation with special markers:
+ * - Committed transactions are set to FrozenTransactionId
+ * - Aborted/crashed transactions are set to InvalidTransactionId
+ *
+ * Only XIDs < newFrozenXid are processed, as those are the ones whose CLOG
+ * pages will be truncated. If XID < newFrozenXid, it cannot still be running
+ * (or it would have held back newFrozenXid through ProcArray).
+ * Therefore, if TransactionIdDidCommit returns false, we know the transaction
+ * either aborted explicitly or crashed, and we can safely mark it invalid.
+ */
+void
+AsyncNotifyFreezeXids(TransactionId newFrozenXid)
+{
+	QueuePosition pos;
+	QueuePosition head;
+	int64		curpage = -1;
+	int			slotno = -1;
+	char	   *page_buffer = NULL;
+	bool		page_dirty = false;
+
+	/*
+	 * Acquire locks in the correct order to avoid deadlocks. As per the
+	 * locking protocol: NotifyQueueTailLock, then NotifyQueueLock, then SLRU
+	 * bank locks.
+	 *
+	 * We only need SHARED mode since we're just reading the head/tail
+	 * positions, not modifying them.
+	 */
+	LWLockAcquire(NotifyQueueTailLock, LW_SHARED);
+	LWLockAcquire(NotifyQueueLock, LW_SHARED);
+
+	pos = QUEUE_TAIL;
+	head = QUEUE_HEAD;
+
+	/* Release NotifyQueueLock early, we only needed to read the positions */
+	LWLockRelease(NotifyQueueLock);
+
+	/*
+	 * Scan the queue from tail to head, freezing XIDs as needed. We hold
+	 * NotifyQueueTailLock throughout to ensure the tail doesn't move while
+	 * we're working.
+	 */
+	while (!QUEUE_POS_EQUAL(pos, head))
+	{
+		AsyncQueueEntry *qe;
+		TransactionId xid;
+		int64		pageno = QUEUE_POS_PAGE(pos);
+		int			offset = QUEUE_POS_OFFSET(pos);
+
+		/* If we need a different page, release old lock and get new one */
+		if (pageno != curpage)
+		{
+			LWLock	   *lock;
+
+			/* Release previous page if any */
+			if (slotno >= 0)
+			{
+				if (page_dirty)
+				{
+					NotifyCtl->shared->page_dirty[slotno] = true;
+					page_dirty = false;
+				}
+				LWLockRelease(SimpleLruGetBankLock(NotifyCtl, curpage));
+			}
+
+			lock = SimpleLruGetBankLock(NotifyCtl, pageno);
+			LWLockAcquire(lock, LW_EXCLUSIVE);
+			slotno = SimpleLruReadPage(NotifyCtl, pageno, true,
+									   InvalidTransactionId);
+			page_buffer = NotifyCtl->shared->page_buffer[slotno];
+			curpage = pageno;
+		}
+
+		qe = (AsyncQueueEntry *) (page_buffer + offset);
+		xid = qe->xid;
+
+		if (TransactionIdIsNormal(xid) &&
+			TransactionIdPrecedes(xid, newFrozenXid))
+		{
+			if (TransactionIdDidCommit(xid))
+			{
+				qe->xid = FrozenTransactionId;
+				page_dirty = true;
+			}
+			else
+			{
+				qe->xid = InvalidTransactionId;
+				page_dirty = true;
+			}
+		}
+
+		/* Advance to next entry */
+		asyncQueueAdvance(&pos, qe->length);
+	}
+
+	/* Release final page lock if we acquired one */
+	if (slotno >= 0)
+	{
+		if (page_dirty)
+			NotifyCtl->shared->page_dirty[slotno] = true;
+		LWLockRelease(SimpleLruGetBankLock(NotifyCtl, curpage));
+	}
+
+	LWLockRelease(NotifyQueueTailLock);
+}
+
 /*
  * ProcessIncomingNotify
  *
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index ed03e3bd50d..e785dd55ce5 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -37,6 +37,7 @@
 #include "catalog/namespace.h"
 #include "catalog/pg_database.h"
 #include "catalog/pg_inherits.h"
+#include "commands/async.h"
 #include "commands/cluster.h"
 #include "commands/defrem.h"
 #include "commands/progress.h"
@@ -1941,6 +1942,12 @@ vac_truncate_clog(TransactionId frozenXID,
 		return;
 	}
 
+	/*
+	 * Freeze any old transaction IDs in the async notification queue before
+	 * CLOG truncation.
+	 */
+	AsyncNotifyFreezeXids(frozenXID);
+
 	/*
 	 * Advance the oldest value for commit timestamps before truncating, so
 	 * that if a user requests a timestamp for a transaction we're truncating
diff --git a/src/include/commands/async.h b/src/include/commands/async.h
index f75c3df9556..aaec7314c10 100644
--- a/src/include/commands/async.h
+++ b/src/include/commands/async.h
@@ -46,4 +46,7 @@ extern void HandleNotifyInterrupt(void);
 /* process interrupts */
 extern void ProcessNotifyInterrupt(bool flush);
 
+/* freeze old transaction IDs in notify queue (called by VACUUM) */
+extern void AsyncNotifyFreezeXids(TransactionId newFrozenXid);
+
 #endif							/* ASYNC_H */
diff --git a/src/test/isolation/expected/async-notify.out b/src/test/isolation/expected/async-notify.out
index 556e1805893..20d5763f319 100644
--- a/src/test/isolation/expected/async-notify.out
+++ b/src/test/isolation/expected/async-notify.out
@@ -1,4 +1,4 @@
-Parsed test spec with 3 sessions
+Parsed test spec with 4 sessions
 
 starting permutation: listenc notify1 notify2 notify3 notifyf
 step listenc: LISTEN c1; LISTEN c2;
@@ -104,6 +104,37 @@ step l2commit: COMMIT;
 listener2: NOTIFY "c1" with payload "" from notifier
 step l2stop: UNLISTEN *;
 
+starting permutation: llisten n1begins n1select n1insert notify1 n2begins n2select n2insert n2notify1 n1commit n2commit notify1 lcheck
+step llisten: LISTEN c1; LISTEN c2;
+step n1begins: BEGIN ISOLATION LEVEL SERIALIZABLE;
+step n1select: SELECT * FROM t1;
+a
+-
+(0 rows)
+
+step n1insert: INSERT INTO t1 DEFAULT VALUES;
+step notify1: NOTIFY c1;
+step n2begins: BEGIN ISOLATION LEVEL SERIALIZABLE;
+step n2select: SELECT * FROM t1;
+a
+-
+(0 rows)
+
+step n2insert: INSERT INTO t1 DEFAULT VALUES;
+step n2notify1: NOTIFY c1, 'n2_payload';
+step n1commit: COMMIT;
+step n2commit: COMMIT;
+ERROR:  could not serialize access due to read/write dependencies among transactions
+step notify1: NOTIFY c1;
+step lcheck: SELECT 1 AS x;
+x
+-
+1
+(1 row)
+
+listener: NOTIFY "c1" with payload "" from notifier
+listener: NOTIFY "c1" with payload "" from notifier
+
 starting permutation: llisten lbegin usage bignotify usage
 step llisten: LISTEN c1; LISTEN c2;
 step lbegin: BEGIN;
diff --git a/src/test/isolation/specs/async-notify.spec b/src/test/isolation/specs/async-notify.spec
index 0b8cfd91083..51b7ad43849 100644
--- a/src/test/isolation/specs/async-notify.spec
+++ b/src/test/isolation/specs/async-notify.spec
@@ -5,6 +5,10 @@
 # Note we assume that each step is delivered to the backend as a single Query
 # message so it will run as one transaction.
 
+# t1 table is used for serializable conflict
+setup { CREATE TABLE t1 (a bigserial); }
+teardown { DROP TABLE t1; }
+
 session notifier
 step listenc	{ LISTEN c1; LISTEN c2; }
 step notify1	{ NOTIFY c1; }
@@ -33,8 +37,21 @@ step notifys1	{
 }
 step usage		{ SELECT pg_notification_queue_usage() > 0 AS nonzero; }
 step bignotify	{ SELECT count(pg_notify('c1', s::text)) FROM generate_series(1, 1000) s; }
+step n1begins   { BEGIN ISOLATION LEVEL SERIALIZABLE; }
+step n1select   { SELECT * FROM t1; }
+step n1insert   { INSERT INTO t1 DEFAULT VALUES; }
+step n1commit   { COMMIT; }
 teardown		{ UNLISTEN *; }
 
+# notifier2 session is used to reproduce serializable conflict with notifier
+
+session notifier2
+step n2begins    { BEGIN ISOLATION LEVEL SERIALIZABLE; }
+step n2select    { SELECT * FROM t1; }
+step n2insert    { INSERT INTO t1 DEFAULT VALUES; }
+step n2commit    { COMMIT; }
+step n2notify1   { NOTIFY c1, 'n2_payload';  }
+
 # The listener session is used for cross-backend notify checks.
 
 session listener
@@ -73,6 +90,13 @@ permutation listenc llisten notify1 notify2 notify3 notifyf lcheck
 # and notify queue is not empty
 permutation l2listen l2begin notify1 lbegins llisten lcommit l2commit l2stop
 
+# Test checks that listeners ignore notifications from aborted
+# transaction even if notifications have been added to the listen/notify
+# queue. To reproduce it we use the fact that serializable conflicts
+# are checked after tx adds notifications to the queue.
+
+permutation llisten n1begins n1select n1insert notify1 n2begins n2select n2insert n2notify1 n1commit n2commit notify1 lcheck
+
 # Verify that pg_notification_queue_usage correctly reports a non-zero result,
 # after submitting notifications while another connection is listening for
 # those notifications and waiting inside an active transaction.  We have to
diff --git a/src/test/modules/xid_wraparound/meson.build b/src/test/modules/xid_wraparound/meson.build
index f7dada67f67..3aec430df8c 100644
--- a/src/test/modules/xid_wraparound/meson.build
+++ b/src/test/modules/xid_wraparound/meson.build
@@ -30,6 +30,7 @@ tests += {
       't/001_emergency_vacuum.pl',
       't/002_limits.pl',
       't/003_wraparounds.pl',
+      't/004_notify_freeze.pl',
     ],
   },
 }
diff --git a/src/test/modules/xid_wraparound/t/004_notify_freeze.pl b/src/test/modules/xid_wraparound/t/004_notify_freeze.pl
new file mode 100644
index 00000000000..e0386afe26a
--- /dev/null
+++ b/src/test/modules/xid_wraparound/t/004_notify_freeze.pl
@@ -0,0 +1,75 @@
+# Copyright (c) 2024-2025, PostgreSQL Global Development Group
+#
+# Test freezing XIDs in the async notification queue. This isn't
+# really wraparound-related, but depends on the consume_xids() helper
+# function.
+
+use strict;
+use warnings FATAL => 'all';
+use PostgreSQL::Test::Cluster;
+use Test::More;
+
+my $node = PostgreSQL::Test::Cluster->new('node');
+$node->init;
+$node->start;
+
+if (!$ENV{PG_TEST_EXTRA} || $ENV{PG_TEST_EXTRA} !~ /\bxid_wraparound\b/)
+{
+	plan skip_all => "test xid_wraparound not enabled in PG_TEST_EXTRA";
+}
+
+# Setup
+$node->safe_psql('postgres', 'CREATE EXTENSION xid_wraparound');
+$node->safe_psql('postgres',
+	'ALTER DATABASE template0 WITH ALLOW_CONNECTIONS true');
+
+# Start Session 1 and leave it idle in transaction
+my $psql_session1 = $node->background_psql('postgres');
+$psql_session1->query_safe('listen s;', "Session 1 listens to 's'");
+$psql_session1->query_safe('begin;', "Session 1 starts a transaction");
+
+# Send multiple notify's from other sessions
+for my $i (1 .. 10)
+{
+	$node->safe_psql(
+		'postgres', "
+		BEGIN;
+		NOTIFY s, '$i';
+		COMMIT;");
+}
+
+# Consume enough XIDs to trigger truncation, and one more with
+# 'txid_current' to bump up the freeze horizon.
+$node->safe_psql('postgres', 'select consume_xids(10000000);');
+$node->safe_psql('postgres', 'select txid_current()');
+
+# Remember current datfrozenxid before vacuum freeze so that we can
+# check that it is advanced. (Taking the min() this way assumes that
+# XID wraparound doesn't happen.)
+my $datafronzenxid = $node->safe_psql('postgres',
+	"select min(datfrozenxid::text::bigint) from pg_database");
+
+# Execute vacuum freeze on all databases
+$node->command_ok([ 'vacuumdb', '--all', '--freeze', '--port', $node->port ],
+	"vacuumdb --all --freeze");
+
+# Check that vacuumdb advanced datfrozenxid
+my $datafronzenxid_freeze = $node->safe_psql('postgres',
+	"select min(datfrozenxid::text::bigint) from pg_database");
+ok($datafronzenxid_freeze > $datafronzenxid, 'datfrozenxid advanced');
+
+# On Session 1, commit and ensure that the all the notifications are
+# received. This depends on correctly freezing the XIDs in the pending
+# notification entries.
+my $res = $psql_session1->query_safe('commit;', "commit listen s;");
+my $notifications_count = 0;
+foreach my $i (split('\n', $res))
+{
+	$notifications_count++;
+	like($i,
+		qr/Asynchronous notification "s" with payload "$notifications_count" received/
+	);
+}
+is($notifications_count, 10, 'received all committed notifications');
+
+done_testing();
-- 
2.47.3

v2-0002-Fix-remaining-race-condition-with-CLOG-truncation.patchtext/x-patch; charset=UTF-8; name=v2-0002-Fix-remaining-race-condition-with-CLOG-truncation.patchDownload

From 6b416d7812ca8db0830307f29cd151af95e3fd92 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Tue, 4 Nov 2025 13:24:30 +0200
Subject: [PATCH v2 2/2] Fix remaining race condition with CLOG truncation and
 LISTEN/NOTIFY

Previous commit fixed a bug where VACUUM can truncate the CLOG that's
still needed to check the commit status of XIDs in the async notify
queue, but as mentioned in the commit message, it wasn't a full
fix. If a backend is executing asyncQueueReadAllNotifications() and
has just made a local copy of an async SLRU page which contains old
XIDs, vacuum can concurrently truncate the CLOG covering those XIDs,
and we will still get an error when backend then calls
TransactionIdDidCommit() on those XIDs in the local copy. This commit
fixes that race condition.

To fix, hold the SLRU bank lock across the TransactionIdDidCommit()
calls in NOTIFY processing.

Per Tom Lane's idea. Backpatch to all supported versions.

Reviewed-by: Joel Jacobson
Discussion: https://www.postgresql.org/message-id/2759499.1761756503@sss.pgh.pa.us
---
 src/backend/commands/async.c | 109 +++++++++++++++--------------------
 1 file changed, 48 insertions(+), 61 deletions(-)

diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index ba06234dc8e..8ac7d989641 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -448,7 +448,6 @@ static void SignalBackends(void);
 static void asyncQueueReadAllNotifications(void);
 static bool asyncQueueProcessPageEntries(volatile QueuePosition *current,
 										 QueuePosition stop,
-										 char *page_buffer,
 										 Snapshot snapshot);
 static void asyncQueueAdvanceTail(void);
 static void ProcessIncomingNotify(bool flush);
@@ -1854,13 +1853,6 @@ asyncQueueReadAllNotifications(void)
 	QueuePosition head;
 	Snapshot	snapshot;
 
-	/* page_buffer must be adequately aligned, so use a union */
-	union
-	{
-		char		buf[QUEUE_PAGESIZE];
-		AsyncQueueEntry align;
-	}			page_buffer;
-
 	/* Fetch current state */
 	LWLockAcquire(NotifyQueueLock, LW_SHARED);
 	/* Assert checks that we have a valid state entry */
@@ -1932,37 +1924,6 @@ asyncQueueReadAllNotifications(void)
 
 		do
 		{
-			int64		curpage = QUEUE_POS_PAGE(pos);
-			int			curoffset = QUEUE_POS_OFFSET(pos);
-			int			slotno;
-			int			copysize;
-
-			/*
-			 * We copy the data from SLRU into a local buffer, so as to avoid
-			 * holding the SLRU lock while we are examining the entries and
-			 * possibly transmitting them to our frontend.  Copy only the part
-			 * of the page we will actually inspect.
-			 */
-			slotno = SimpleLruReadPage_ReadOnly(NotifyCtl, curpage,
-												InvalidTransactionId);
-			if (curpage == QUEUE_POS_PAGE(head))
-			{
-				/* we only want to read as far as head */
-				copysize = QUEUE_POS_OFFSET(head) - curoffset;
-				if (copysize < 0)
-					copysize = 0;	/* just for safety */
-			}
-			else
-			{
-				/* fetch all the rest of the page */
-				copysize = QUEUE_PAGESIZE - curoffset;
-			}
-			memcpy(page_buffer.buf + curoffset,
-				   NotifyCtl->shared->page_buffer[slotno] + curoffset,
-				   copysize);
-			/* Release lock that we got from SimpleLruReadPage_ReadOnly() */
-			LWLockRelease(SimpleLruGetBankLock(NotifyCtl, curpage));
-
 			/*
 			 * Process messages up to the stop position, end of page, or an
 			 * uncommitted message.
@@ -1978,9 +1939,7 @@ asyncQueueReadAllNotifications(void)
 			 * rewrite pages under us. Especially we don't want to hold a lock
 			 * while sending the notifications to the frontend.
 			 */
-			reachedStop = asyncQueueProcessPageEntries(&pos, head,
-													   page_buffer.buf,
-													   snapshot);
+			reachedStop = asyncQueueProcessPageEntries(&pos, head, snapshot);
 		} while (!reachedStop);
 	}
 	PG_FINALLY();
@@ -2000,13 +1959,6 @@ asyncQueueReadAllNotifications(void)
  * Fetch notifications from the shared queue, beginning at position current,
  * and deliver relevant ones to my frontend.
  *
- * The current page must have been fetched into page_buffer from shared
- * memory.  (We could access the page right in shared memory, but that
- * would imply holding the SLRU bank lock throughout this routine.)
- *
- * We stop if we reach the "stop" position, or reach a notification from an
- * uncommitted transaction, or reach the end of the page.
- *
  * The function returns true once we have reached the stop position or an
  * uncommitted notification, and false if we have finished with the page.
  * In other words: once it returns true there is no need to look further.
@@ -2015,16 +1967,34 @@ asyncQueueReadAllNotifications(void)
 static bool
 asyncQueueProcessPageEntries(volatile QueuePosition *current,
 							 QueuePosition stop,
-							 char *page_buffer,
 							 Snapshot snapshot)
 {
+	int64		curpage = QUEUE_POS_PAGE(*current);
+	int			slotno;
+	char	   *page_buffer;
 	bool		reachedStop = false;
 	bool		reachedEndOfPage;
-	AsyncQueueEntry *qe;
+
+	/*
+	 * We copy the entries into a local buffer to avoid holding the SLRU lock
+	 * while we transmit them to our frontend.  The local buffer must be
+	 * adequately aligned, so use a union.
+	 */
+	union
+	{
+		char		buf[QUEUE_PAGESIZE];
+		AsyncQueueEntry align;
+	}			local_buf;
+	char	   *local_buf_end = local_buf.buf;
+
+	slotno = SimpleLruReadPage_ReadOnly(NotifyCtl, curpage,
+										InvalidTransactionId);
+	page_buffer = NotifyCtl->shared->page_buffer[slotno];
 
 	do
 	{
 		QueuePosition thisentry = *current;
+		AsyncQueueEntry *qe;
 
 		if (QUEUE_POS_EQUAL(thisentry, stop))
 			break;
@@ -2068,16 +2038,8 @@ asyncQueueProcessPageEntries(volatile QueuePosition *current,
 			}
 			else if (TransactionIdDidCommit(qe->xid))
 			{
-				/* qe->data is the null-terminated channel name */
-				char	   *channel = qe->data;
-
-				if (IsListeningOn(channel))
-				{
-					/* payload follows channel name */
-					char	   *payload = qe->data + strlen(channel) + 1;
-
-					NotifyMyFrontEnd(channel, payload, qe->srcPid);
-				}
+				memcpy(local_buf_end, qe, qe->length);
+				local_buf_end += qe->length;
 			}
 			else
 			{
@@ -2091,6 +2053,31 @@ asyncQueueProcessPageEntries(volatile QueuePosition *current,
 		/* Loop back if we're not at end of page */
 	} while (!reachedEndOfPage);
 
+	/* Release lock that we got from SimpleLruReadPage_ReadOnly() */
+	LWLockRelease(SimpleLruGetBankLock(NotifyCtl, curpage));
+
+	/*
+	 * Now that we have let go of the SLRU bank lock, send the notifications
+	 * to our backend
+	 */
+	Assert(local_buf_end - local_buf.buf <= BLCKSZ);
+	for (char *p = local_buf.buf; p < local_buf_end;)
+	{
+		AsyncQueueEntry *qe = (AsyncQueueEntry *) p;
+		/* qe->data is the null-terminated channel name */
+		char	   *channel = qe->data;
+
+		if (IsListeningOn(channel))
+		{
+			/* payload follows channel name */
+			char	   *payload = qe->data + strlen(channel) + 1;
+
+			NotifyMyFrontEnd(channel, payload, qe->srcPid);
+		}
+
+		p += qe->length;
+	}
+
 	if (QUEUE_POS_EQUAL(*current, stop))
 		reachedStop = true;
 
-- 
2.47.3

#79

Arseniy Mukhin

arseniy.mukhin.dev@gmail.com

2 months ago

In reply to: Heikki Linnakangas (#78)

Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue

Hi,

On Tue, Nov 4, 2025 at 3:10 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 03/11/2025 23:45, Joel Jacobson wrote:

On Mon, Nov 3, 2025, at 12:02, Heikki Linnakangas wrote:

I wrote another little stand-alone performance test program for this,
attached. It launches N connections that send NOTIFYs to a single
channel as fast as possible, and M threads that listen for the
notifications. I ran it with different combinations of N and M, on
'master' and on REL_14_STABLE (which didn't have SLRU banks) and I
cannot discern any performance difference from these patches. So it
seems that holding the SLRU (bank) lock across the
TransactionIdDidCommit() calls is fine.

Nice! That for the benchmark code! I took the liberty of hacking a bit
on it, and added support for multiple channels, with separate listener
and notifier threads per channel. Each notification now carries the
notifier ID, a sequence number, and a send timestamp. Listeners verify
that sequence numbers arrive in order and record delivery latency. The
program collects latency measurements into fixed buckets and reports
them once per second together with total and per-second send/receive
counts.

Also added a short delay before starting notifiers so that listeners
have time to issue their LISTEN commands, and a new --channels option,
and the meaning of --listeners and --notifiers was changed to apply per
channel.

Also fixed so the code could be compiled outside of the PostgreSQL
source code repo, if wanting to build this as stand-alone tool.

I've benchmarked master vs 0001+0002 and can't notice any differences;
see attached output from benchmark runs.

Thanks. After some further testing, I was able to find a scenario where
this patch significantly reduces performance: if the listening backends
subscribe to a massive number of channels, like 10000, they spend a lot
of time scanning the linked list of subscribed channels in
IsListeningOn(). With the patch, those checks were performed while
holding the SLRU lock, and it started to show up as lock contention
between notifiers and listeners. To demonstrate that, attached is
another version of the test program that adds an --extra-channels=N
argument. If you set it to e.g. 10000, each listener backends calls
LISTEN on 10000 additional channels that are never notified. They just
make the listenChannels list longer. With that and the patches I posted
previously, I'm getting:

$ PGHOST=localhost PGDB=postgres://localhost/postgres
./async-notify-test-3 --listeners=50 --notifiers=4 --channels=1
--extra-channels=10000
10 s: 12716 sent (1274/s), 635775 received (63737/s)
0.00-0.01ms 0 (0.0%) avg: 0.000ms
0.01-0.10ms 0 (0.0%) avg: 0.000ms
0.10-1.00ms # 1915 (0.3%) avg: 0.807ms
1.00-10.00ms ######### 633550 (99.7%) avg: 3.502ms
10.00-100.00ms # 310 (0.0%) avg: 11.423ms

100.00ms 0 (0.0%) avg: 0.000ms

^C

Whereas on 'master', I see about 2-3x more notifies/s:

$ PGHOST=localhost PGDB=postgres://localhost/postgres
./async-notify-test-3 --listeners=50 --notifiers=4 --channels=1
--extra-channels=10000
10 s: 32057 sent (3296/s), 1602995 received (164896/s)
0.00-0.01ms 0 (0.0%) avg: 0.000ms
0.01-0.10ms # 11574 (0.7%) avg: 0.078ms
0.10-1.00ms ###### 1082960 (67.6%) avg: 0.577ms
1.00-10.00ms ### 508199 (31.7%) avg: 1.489ms
10.00-100.00ms # 262 (0.0%) avg: 16.178ms

100.00ms 0 (0.0%) avg: 0.000ms

^C

Fortunately that's easy to fix: We can move the IsListeningOn() check
after releasing the lock. See attached.

Thank you for working on this!

It seems that 0002 handles errors during NotifyMyFrontEnd a little differently.

With master, in case of a failure during NotifyMyFrontEnd, the
listener's position in PG_FINALLY is set to the beginning of the next
notification, since we advance the "current" position only if the
previous notification was successfully sent.

With 0002, we advance the "current" position while copying
notifications to the local buffer, and begin sending them after the
position has already been advanced for all copied notifications. So in
case of a failure, the listener's position in PG_FINALLY is set to the
beginning of the next page or queue head. This means we can lose
notifications that were copied but were not sent.

If we want to preserve the previous behavior, maybe we could use a new
local position while copying notifications and only advance the
"current" position while sending notifications to the frontend?

Best regards,
Arseniy Mukhin

#80

Tom Lane

tgl@sss.pgh.pa.us

2 months ago

In reply to: Heikki Linnakangas (#78)

Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue

Heikki Linnakangas <hlinnaka@iki.fi> writes:

Fortunately that's easy to fix: We can move the IsListeningOn() check
after releasing the lock. See attached.
The elephant in the room of course is that a lookup in a linked list is
O(n) and it would be very straightforward to replace it with e.g. a hash
table. We should do that irrespective of this bug fix. But I'm inclined
to do it as a separate followup patch.

There is a different patch series that's concerned with improving
NOTIFY throughput [1]https://commitfest.postgresql.org/patch/6078/. I think this one should just focus on
fixing the XID problem.

regards, tom lane

[1]: https://commitfest.postgresql.org/patch/6078/

#81

Álvaro Herrera

alvherre@kurilemu.de

2 months ago

In reply to: Heikki Linnakangas (#78)

Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue

On 2025-Nov-04, Heikki Linnakangas wrote:

From 7c342e6efffc8d59c2e7658f6f2f3b138d02e0bb Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Tue, 4 Nov 2025 13:22:08 +0200
Subject: [PATCH v2 1/2] Fix bug where we truncated CLOG that was still needed
by LISTEN/NOTIFY

The async notification queue contains the XID of the sender, and when
processing notifications we call TransactionIdDidCommit() on the
XID. But we had no safeguards to prevent the CLOG segments containing
those XIDs from being truncated away. As a result, if a backend didn't
for some reason process its notifications for a long time, or when a
new backend issued LISTEN, you could get an error like:

test=# listen c21;
ERROR: 58P01: could not access status of transaction 14279685
DETAIL: Could not open file "pg_xact/000D": No such file or directory.
LOCATION: SlruReportIOError, slru.c:1087

Note: This commit is not a full fix. [...]

The commit message doesn't say just _what_ does the patch do to fix the
bug though. Looking through the code, I think you're setting XIDs in
the async queue to frozenXid, but I think the commit message should
explain that.

Thanks for spending time on this.

--
Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/
"La fuerza no está en los medios físicos
sino que reside en una voluntad indomable" (Gandhi)

#82

Joel Jacobson

joel@compiler.org

2 months ago

In reply to: Heikki Linnakangas (#78)

Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue

On Tue, Nov 4, 2025, at 13:09, Heikki Linnakangas wrote:

With that and the patches I posted
previously, I'm getting:

$ PGHOST=localhost PGDB=postgres://localhost/postgres
./async-notify-test-3 --listeners=50 --notifiers=4 --channels=1
--extra-channels=10000
10 s: 12716 sent (1274/s), 635775 received (63737/s)
0.00-0.01ms 0 (0.0%) avg: 0.000ms
0.01-0.10ms 0 (0.0%) avg: 0.000ms
0.10-1.00ms # 1915 (0.3%) avg: 0.807ms
1.00-10.00ms ######### 633550 (99.7%) avg: 3.502ms
10.00-100.00ms # 310 (0.0%) avg: 11.423ms

100.00ms 0 (0.0%) avg: 0.000ms

^C

Whereas on 'master', I see about 2-3x more notifies/s:

$ PGHOST=localhost PGDB=postgres://localhost/postgres
./async-notify-test-3 --listeners=50 --notifiers=4 --channels=1
--extra-channels=10000
10 s: 32057 sent (3296/s), 1602995 received (164896/s)
0.00-0.01ms 0 (0.0%) avg: 0.000ms
0.01-0.10ms # 11574 (0.7%) avg: 0.078ms
0.10-1.00ms ###### 1082960 (67.6%) avg: 0.577ms
1.00-10.00ms ### 508199 (31.7%) avg: 1.489ms
10.00-100.00ms # 262 (0.0%) avg: 16.178ms

100.00ms 0 (0.0%) avg: 0.000ms

Nice with the --extra-channels addition!

Below results on my MacBook Pro M3 Max.
I ran them for 30s to get more stable distributions.

% ninja install
C compiler for the host machine: cc (clang 17.0.0 "Apple clang version 17.0.0 (clang-1700.0.13.5)")
C linker for the host machine: cc ld64 1167.5
Host machine cpu family: aarch64
Host machine cpu: aarch64
buildtype: release

% gcc -Wall -Wextra -O2 -pthread -I/Users/joel/pg19/include/postgresql/server -I/Users/joel/pg19/include -o async-notify-test-3 async-notify-test-3.c -L/Users/joel/pg19/lib -lpq -pthread -lm

% ./async-notify-test-3 --listeners=50 --notifiers=4 --channels=1 --extra-channels=10000

# master (c98dffcb7c7010d216dc16d22cb594ef7d65fde1)
30 s: 293329 sent (9656/s), 14653182 received (480181/s)
0.00-0.01ms 0 (0.0%) avg: 0.000ms
0.01-0.10ms # 298195 (2.0%) avg: 0.078ms
0.10-1.00ms ## 3060085 (20.9%) avg: 0.390ms
1.00-10.00ms ## 3310362 (22.6%) avg: 4.242ms
10.00-100.00ms ##### 7482595 (51.1%) avg: 44.441ms

100.00ms # 501945 (3.4%) avg: 119.257ms

# 0001-Fix-bug-where-we-truncated-CLOG-that-was-still-neede.patch
# 0002-Hold-SLRU-bank-lock-across-TransactionIdDidCommit-in.patch
30 s: 87462 sent (3174/s), 4373100 received (158698/s)
0.00-0.01ms 0 (0.0%) avg: 0.000ms
0.01-0.10ms # 16 (0.0%) avg: 0.091ms
0.10-1.00ms # 576554 (13.2%) avg: 0.798ms
1.00-10.00ms ######## 3796530 (86.8%) avg: 1.589ms
10.00-100.00ms 0 (0.0%) avg: 0.000ms

100.00ms 0 (0.0%) avg: 0.000ms

# v2-0001-Fix-bug-where-we-truncated-CLOG-that-was-still-ne.patch
# v2-0002-Fix-remaining-race-condition-with-CLOG-truncation.patch
30 s: 274342 sent (10134/s), 13708284 received (525274/s)
0.00-0.01ms 0 (0.0%) avg: 0.000ms
0.01-0.10ms # 241981 (1.8%) avg: 0.079ms
0.10-1.00ms ### 4213484 (30.7%) avg: 0.399ms
1.00-10.00ms # 2727917 (19.9%) avg: 4.033ms
10.00-100.00ms #### 5920198 (43.2%) avg: 46.500ms

100.00ms # 604712 (4.4%) avg: 118.336ms

/Joel

#83

Joel Jacobson

joel@compiler.org

2 months ago

In reply to: Heikki Linnakangas (#78)

Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue

On Tue, Nov 4, 2025, at 13:09, Heikki Linnakangas wrote:

The elephant in the room of course is that a lookup in a linked list is
O(n) and it would be very straightforward to replace it with e.g. a hash
table. We should do that irrespective of this bug fix. But I'm inclined
to do it as a separate followup patch.

Thanks for the idea of replacing it with a local hash table.
I've implemented it in the patch set we're working on:
https://commitfest.postgresql.org/patch/6078/
/messages/by-id/12c08e29-c21c-4a3d-a269-a48a1a26b18d@app.fastmail.com

/Joel

#84

Heikki Linnakangas

hlinnaka@iki.fi

2 months ago

In reply to: Arseniy Mukhin (#79)

Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue

On 04/11/2025 16:40, Arseniy Mukhin wrote:

It seems that 0002 handles errors during NotifyMyFrontEnd a little differently.

With master, in case of a failure during NotifyMyFrontEnd, the
listener's position in PG_FINALLY is set to the beginning of the next
notification, since we advance the "current" position only if the
previous notification was successfully sent.

With 0002, we advance the "current" position while copying
notifications to the local buffer, and begin sending them after the
position has already been advanced for all copied notifications. So in
case of a failure, the listener's position in PG_FINALLY is set to the
beginning of the next page or queue head. This means we can lose
notifications that were copied but were not sent.

True.

If we want to preserve the previous behavior, maybe we could use a new
local position while copying notifications and only advance the
"current" position while sending notifications to the frontend?

That's not good. The loop advances 'current' before calling
TransactionIdDidCommit() to ensure that if there's a broken entry in the
queue for which TransactionIdDidCommit() throws an error, we advance
'current' past that point. Otherwise we would get back later to try to
process the same broken entry again and again.

We could put the NotifyMyFrontEnd() calls in a PG_FINALLY block, so that
the copied notifications get sent even on error. But I'm a reluctant to
put that in PG_FINALLY, it's not clear to me if NotifyMyFrontEnd() is
safe to call during error processing. And if the client is not
processing the notifications, the abort processing could be delayed for
a long time.

One idea is to close the client connection on a TransactionIdDidCommit
error, i.e. make the error FATAL. The current behavior where we skip the
notification seems problematic already. Closing the connection would be
a clear signal that some notifications have been lost.

Or we can just accept new behavior that if TransactionIdDidCommit()
fails, we might lose more notifications than the one that caused the error.

- Heikki

#85

Matheus Alcantara

matheusssilv97@gmail.com

2 months ago

In reply to: Heikki Linnakangas (#84)

Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue

On Wed Nov 5, 2025 at 6:21 AM -03, Heikki Linnakangas wrote:

On 04/11/2025 16:40, Arseniy Mukhin wrote:

It seems that 0002 handles errors during NotifyMyFrontEnd a little differently.

With master, in case of a failure during NotifyMyFrontEnd, the
listener's position in PG_FINALLY is set to the beginning of the next
notification, since we advance the "current" position only if the
previous notification was successfully sent.

With 0002, we advance the "current" position while copying
notifications to the local buffer, and begin sending them after the
position has already been advanced for all copied notifications. So in
case of a failure, the listener's position in PG_FINALLY is set to the
beginning of the next page or queue head. This means we can lose
notifications that were copied but were not sent.

True.

If we want to preserve the previous behavior, maybe we could use a new
local position while copying notifications and only advance the
"current" position while sending notifications to the frontend?

That's not good. The loop advances 'current' before calling
TransactionIdDidCommit() to ensure that if there's a broken entry in the
queue for which TransactionIdDidCommit() throws an error, we advance
'current' past that point. Otherwise we would get back later to try to
process the same broken entry again and again.

We could put the NotifyMyFrontEnd() calls in a PG_FINALLY block, so that
the copied notifications get sent even on error. But I'm a reluctant to
put that in PG_FINALLY, it's not clear to me if NotifyMyFrontEnd() is
safe to call during error processing. And if the client is not
processing the notifications, the abort processing could be delayed for
a long time.

One idea is to close the client connection on a TransactionIdDidCommit
error, i.e. make the error FATAL. The current behavior where we skip the
notification seems problematic already. Closing the connection would be
a clear signal that some notifications have been lost.

Or we can just accept new behavior that if TransactionIdDidCommit()
fails, we might lose more notifications than the one that caused the error.

I think that we may have an error on TransactionIdDidCommit() and on
NotifyMyFrontEnd() right?

In case of an error on TransactionIdDidCommit() I think that we will
have the same behavior as we advance the "current" position before of
DidCommit call the PG_FINALLY block will set the backend position past
the failing notification entry.

I think that the problem is on NotifyMyFrontEnd() call that if it fails
we will set the current backend position to the head of the queue
or the beginning of the next page possibly losing notifications.

How bad would be to store the notification entries within a list and
store the next position with the notification entry and then wrap the
NotifyMyFrontEnd() call within a PG_TRY and update the "current" to the
saved "next position" on PG_CATCH? Something like this:
do
{
ListEntry le;
qe = (AsyncQueueEntry *) (page_buffer + QUEUE_POS_OFFSET(thisentry));
reachedEndOfPage = asyncQueueAdvance(current, qe->length);
...
else if (TransactionIdDidCommit(qe->xid))
{
le->entry = qe;
le->nextPos = current;
}
} while (!reachedEndOfPage);

foreach(lc, notificationEntries)
{
ListEntry *le = lfirst(lc);
AsyncQueueEntry *qe = le->entry;
char *channel = qe->data;

PG_TRY()
{
if (IsListeningOn(channel))
{
char *payload = qe->data + strlen(channel) + 1;
NotifyMyFrontEnd(channel, payload, qe->srcPid);
}
}
PG_CATCH()
{
current = le->nextPos;
PG_RE_THROW();
}
}

My concern with this idea is about the overhead of using a list to store
the notification entries and also the PG_TRY block but perhaps it's
worth it to prevent losing notifications?

--
Matheus Alcantara
EDB: http://www.enterprisedb.com

#86

Heikki Linnakangas

hlinnaka@iki.fi

2 months ago

In reply to: Matheus Alcantara (#85)

Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue

On 05/11/2025 14:40, Matheus Alcantara wrote:

On Wed Nov 5, 2025 at 6:21 AM -03, Heikki Linnakangas wrote:

On 04/11/2025 16:40, Arseniy Mukhin wrote:

If we want to preserve the previous behavior, maybe we could use a new
local position while copying notifications and only advance the
"current" position while sending notifications to the frontend?

That's not good. The loop advances 'current' before calling
TransactionIdDidCommit() to ensure that if there's a broken entry in the
queue for which TransactionIdDidCommit() throws an error, we advance
'current' past that point. Otherwise we would get back later to try to
process the same broken entry again and again.

We could put the NotifyMyFrontEnd() calls in a PG_FINALLY block, so that
the copied notifications get sent even on error. But I'm a reluctant to
put that in PG_FINALLY, it's not clear to me if NotifyMyFrontEnd() is
safe to call during error processing. And if the client is not
processing the notifications, the abort processing could be delayed for
a long time.

One idea is to close the client connection on a TransactionIdDidCommit
error, i.e. make the error FATAL. The current behavior where we skip the
notification seems problematic already. Closing the connection would be
a clear signal that some notifications have been lost.

Or we can just accept new behavior that if TransactionIdDidCommit()
fails, we might lose more notifications than the one that caused the error.

I think that we may have an error on TransactionIdDidCommit() and on
NotifyMyFrontEnd() right?

True.

In case of an error on TransactionIdDidCommit() I think that we will
have the same behavior as we advance the "current" position before of
DidCommit call the PG_FINALLY block will set the backend position past
the failing notification entry.

With my patch, if TransactionIdDidCommit() fails, we will lose all the
notifications that we have buffered in the local buffer but haven't
passed to NotifyMyFrontEnd() yet. On 'master', you only lose a single
notification, the one where TransactionIdDidCommit() failed.

I think that the problem is on NotifyMyFrontEnd() call that if it fails
we will set the current backend position to the head of the queue
or the beginning of the next page possibly losing notifications.

True, a failure on NotifyMyFrontEnd can happen too, and it will also
lead to losing all the other notifications in the local buffer.

How bad would be to store the notification entries within a list and
store the next position with the notification entry and then wrap the
NotifyMyFrontEnd() call within a PG_TRY and update the "current" to the
saved "next position" on PG_CATCH? Something like this:
do
{
ListEntry le;
qe = (AsyncQueueEntry *) (page_buffer + QUEUE_POS_OFFSET(thisentry));
reachedEndOfPage = asyncQueueAdvance(current, qe->length);
...
else if (TransactionIdDidCommit(qe->xid))
{
le->entry = qe;
le->nextPos = current;
}
} while (!reachedEndOfPage);

foreach(lc, notificationEntries)
{
ListEntry *le = lfirst(lc);
AsyncQueueEntry *qe = le->entry;
char *channel = qe->data;

PG_TRY()
{
if (IsListeningOn(channel))
{
char *payload = qe->data + strlen(channel) + 1;
NotifyMyFrontEnd(channel, payload, qe->srcPid);
}
}
PG_CATCH()
{
current = le->nextPos;
PG_RE_THROW();
}
}

That addresses the failure on NotifyMyFrontEnd, but not on
TransactionIdDidCommit.

IMHO we should just make these errors FATAL. TransactionIdDidCommit()
should not really fail, after fixing the bug we're discussing.
NotifyMyFrontEnd() could fail on OOM, but that seems pretty unlikely on
an otherwise idle connection. Or it could fail if the client connection
is lost, but then the backend is about to die anyway. And arguably
closing the connection is better than losing even a single notification,
anyway.

- Heikki

#87

Arseniy Mukhin

arseniy.mukhin.dev@gmail.com

2 months ago

In reply to: Heikki Linnakangas (#86)

Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue

On Wed, Nov 5, 2025 at 12:21 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 04/11/2025 16:40, Arseniy Mukhin wrote:

It seems that 0002 handles errors during NotifyMyFrontEnd a little differently.

With master, in case of a failure during NotifyMyFrontEnd, the
listener's position in PG_FINALLY is set to the beginning of the next
notification, since we advance the "current" position only if the
previous notification was successfully sent.

With 0002, we advance the "current" position while copying
notifications to the local buffer, and begin sending them after the
position has already been advanced for all copied notifications. So in
case of a failure, the listener's position in PG_FINALLY is set to the
beginning of the next page or queue head. This means we can lose
notifications that were copied but were not sent.

True.

If we want to preserve the previous behavior, maybe we could use a new
local position while copying notifications and only advance the
"current" position while sending notifications to the frontend?

That's not good. The loop advances 'current' before calling
TransactionIdDidCommit() to ensure that if there's a broken entry in the
queue for which TransactionIdDidCommit() throws an error, we advance
'current' past that point. Otherwise we would get back later to try to
process the same broken entry again and again.

Ouch, I failed to realise that this try/catch saves us from
TransactionIdDidCommit failure too.

The comment around PG_TRY says:

/*
* It is possible that we fail while trying to send a message to our
* frontend (for example, because of encoding conversion failure). If

So I thought that it's the main reason we have try/catch here.

Best regards,
Arseniy Mukhin

#88

Matheus Alcantara

matheusssilv97@gmail.com

2 months ago

In reply to: Heikki Linnakangas (#86)

Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue

On Wed Nov 5, 2025 at 9:59 AM -03, Heikki Linnakangas wrote:

In case of an error on TransactionIdDidCommit() I think that we will
have the same behavior as we advance the "current" position before of
DidCommit call the PG_FINALLY block will set the backend position past
the failing notification entry.

With my patch, if TransactionIdDidCommit() fails, we will lose all the
notifications that we have buffered in the local buffer but haven't
passed to NotifyMyFrontEnd() yet. On 'master', you only lose a single
notification, the one where TransactionIdDidCommit() failed.

Yeah, that's true.

How bad would be to store the notification entries within a list and
store the next position with the notification entry and then wrap the
NotifyMyFrontEnd() call within a PG_TRY and update the "current" to the
saved "next position" on PG_CATCH? Something like this:

[ ...]

That addresses the failure on NotifyMyFrontEnd, but not on
TransactionIdDidCommit.

IMHO we should just make these errors FATAL. TransactionIdDidCommit()
should not really fail, after fixing the bug we're discussing.
NotifyMyFrontEnd() could fail on OOM, but that seems pretty unlikely on
an otherwise idle connection. Or it could fail if the client connection
is lost, but then the backend is about to die anyway. And arguably
closing the connection is better than losing even a single notification,
anyway.

My only concern with making these errors FATAL is that if a notification
entry causes a different, recoverable error, all subsequent messages
will be lost. This is because if backend die and the user open a new
connection and execute LISTEN on the channel it will not see these
notifications past the one that caused the error. I'm not sure if we are
completely safe from this case of a recoverable error, what do you
think?

--
Matheus Alcantara
EDB: http://www.enterprisedb.com

#89

Heikki Linnakangas

hlinnaka@iki.fi

2 months ago

In reply to: Matheus Alcantara (#88)

3 attachment(s)

Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue

On 05/11/2025 21:02, Matheus Alcantara wrote:

On Wed Nov 5, 2025 at 9:59 AM -03, Heikki Linnakangas wrote:

In case of an error on TransactionIdDidCommit() I think that we will
have the same behavior as we advance the "current" position before of
DidCommit call the PG_FINALLY block will set the backend position past
the failing notification entry.

With my patch, if TransactionIdDidCommit() fails, we will lose all the
notifications that we have buffered in the local buffer but haven't
passed to NotifyMyFrontEnd() yet. On 'master', you only lose a single
notification, the one where TransactionIdDidCommit() failed.

Yeah, that's true.

How bad would be to store the notification entries within a list and
store the next position with the notification entry and then wrap the
NotifyMyFrontEnd() call within a PG_TRY and update the "current" to the
saved "next position" on PG_CATCH? Something like this:

[ ...]

That addresses the failure on NotifyMyFrontEnd, but not on
TransactionIdDidCommit.

IMHO we should just make these errors FATAL. TransactionIdDidCommit()
should not really fail, after fixing the bug we're discussing.
NotifyMyFrontEnd() could fail on OOM, but that seems pretty unlikely on
an otherwise idle connection. Or it could fail if the client connection
is lost, but then the backend is about to die anyway. And arguably
closing the connection is better than losing even a single notification,
anyway.

My only concern with making these errors FATAL is that if a notification
entry causes a different, recoverable error, all subsequent messages
will be lost. This is because if backend die and the user open a new
connection and execute LISTEN on the channel it will not see these
notifications past the one that caused the error. I'm not sure if we are
completely safe from this case of a recoverable error, what do you
think?

I did some more testing of the current behavior, using the encoding
conversion to cause an error:

In backend A:

SET client_encoding='latin1';
LISTEN foo;

Backend b:

NOTIFY foo, 'ハ';

Observations:

- If the connection is idle when the notification is received, the ERROR
is turned into FATAL anyway:

postgres=# SET client_encoding='latin1';
SET
postgres=# LISTEN foo;
LISTEN
postgres=# select 1; -- do the NOTIFY in another connection before this
ERROR: character with byte sequence 0xe3 0x83 0x8f in encoding "UTF8"
has no equivalent in encoding "LATIN1"
FATAL: terminating connection because protocol synchronization was lost
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.

- If there are multiple notifications pending, we stop processing the
subsequent notifications on the first error. The subsequent
notifications will only be processed when another notify interrupt is
received, i.e. when a backend sends yet another notification.

I'm getting more and more convinced that escalating all ERRORs to FATALs
during notify processing is the right way to go. Attached is a new patch
set that does that.

- Heikki

Attachments:

v3-0001-Escalate-ERRORs-during-async-notify-processing-to.patchtext/x-patch; charset=UTF-8; name=v3-0001-Escalate-ERRORs-during-async-notify-processing-to.patchDownload

From c1e72ebc3ca840cb75d3fd004abba1944a028304 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Thu, 6 Nov 2025 11:21:39 +0200
Subject: [PATCH v3 1/3] Escalate ERRORs during async notify processing to
 FATAL

Previously, if async notify processing encountered an error, we would
report the error to the client and advance our read position past the
offending entry to prevent trying to process it over and over
again. Trying to continue after an error has a few problems however:

- We have no way of telling the client that a notification was
  lost. It's not clear if keeping the connection alive after losing a
  notification is a good thing. Depending on the application logic,
  missing a notification could for example cause the application to
  get stuck waiting.

- If the connection is idle, PqCommReadingMsg is set and any ERROR is
  turned into FATAL anyway.

- We bailed out of the notification processing loop on first error
  without processing any subsequent notifications, until another
  notify interrupt arrives. For example, if there were two
  notifications pending, and processing the first one caused an ERROR,
  the second notification would not be processed until someone sent a
  new NOTIFY.

This commit changes the behavior so that any ERROR while processing
async notifications is turned into FATAL, causing the client
connection to be dropped. That makes the behavior more consistent as
that's what happened in idle state already, and dropping the
connection is a clear signal to the application that it might've
missed some notifications.

The reason to do this now is that the next commits will change the
notification processing code in a way that would make it harder to
skip over just the offending notification entry on error.

Discussion: https://www.postgresql.org/message-id/fedbd908-4571-4bbe-b48e-63bfdcc38f64@iki.fi
---
 src/backend/commands/async.c | 34 +++++++++++++++++++++-------------
 1 file changed, 21 insertions(+), 13 deletions(-)

diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index 4bd37d5beb5..6b844808ef3 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -446,7 +446,7 @@ static double asyncQueueUsage(void);
 static void asyncQueueFillWarning(void);
 static void SignalBackends(void);
 static void asyncQueueReadAllNotifications(void);
-static bool asyncQueueProcessPageEntries(volatile QueuePosition *current,
+static bool asyncQueueProcessPageEntries(QueuePosition *current,
 										 QueuePosition stop,
 										 char *page_buffer,
 										 Snapshot snapshot);
@@ -1850,7 +1850,7 @@ ProcessNotifyInterrupt(bool flush)
 static void
 asyncQueueReadAllNotifications(void)
 {
-	volatile QueuePosition pos;
+	QueuePosition pos;
 	QueuePosition head;
 	Snapshot	snapshot;
 
@@ -1920,16 +1920,25 @@ asyncQueueReadAllNotifications(void)
 	 * It is possible that we fail while trying to send a message to our
 	 * frontend (for example, because of encoding conversion failure).  If
 	 * that happens it is critical that we not try to send the same message
-	 * over and over again.  Therefore, we place a PG_TRY block here that will
-	 * forcibly advance our queue position before we lose control to an error.
-	 * (We could alternatively retake NotifyQueueLock and move the position
-	 * before handling each individual message, but that seems like too much
-	 * lock traffic.)
+	 * over and over again.  Therefore, we set ExitOnAnyError to upgrade any
+	 * ERRORs to FATAL, causing the client connection to be closed on error.
+	 *
+	 * We used to only skip over only the offending message and try to soldier
+	 * on, but it was a little questionable to lose a notification and give
+	 * the client ERRORs instead.  A client application would not be prepared
+	 * for that and can't tell that a notification was missed.  It was also
+	 * not very useful in practice because notifications are often processed
+	 * while a connection is idle and reading a message from the client, and
+	 * in that state, any error is upgraded to FATAL anyway.  Closing the
+	 * connection is a clear signal to the application that it might have
+	 * missed notifications.
 	 */
-	PG_TRY();
 	{
+		bool		save_ExitOnAnyError = ExitOnAnyError;
 		bool		reachedStop;
 
+		ExitOnAnyError = true;
+
 		do
 		{
 			int64		curpage = QUEUE_POS_PAGE(pos);
@@ -1982,15 +1991,14 @@ asyncQueueReadAllNotifications(void)
 													   page_buffer.buf,
 													   snapshot);
 		} while (!reachedStop);
-	}
-	PG_FINALLY();
-	{
+
 		/* Update shared state */
 		LWLockAcquire(NotifyQueueLock, LW_SHARED);
 		QUEUE_BACKEND_POS(MyProcNumber) = pos;
 		LWLockRelease(NotifyQueueLock);
+
+		ExitOnAnyError = save_ExitOnAnyError;
 	}
-	PG_END_TRY();
 
 	/* Done with snapshot */
 	UnregisterSnapshot(snapshot);
@@ -2013,7 +2021,7 @@ asyncQueueReadAllNotifications(void)
  * The QueuePosition *current is advanced past all processed messages.
  */
 static bool
-asyncQueueProcessPageEntries(volatile QueuePosition *current,
+asyncQueueProcessPageEntries(QueuePosition *current,
 							 QueuePosition stop,
 							 char *page_buffer,
 							 Snapshot snapshot)
-- 
2.47.3

v3-0002-Fix-bug-where-we-truncated-CLOG-that-was-still-ne.patchtext/x-patch; charset=UTF-8; name=v3-0002-Fix-bug-where-we-truncated-CLOG-that-was-still-ne.patchDownload

From 57217873c5758c7bd0c6a82050cbbbea14445c43 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Tue, 4 Nov 2025 13:22:08 +0200
Subject: [PATCH v3 2/3] Fix bug where we truncated CLOG that was still needed
 by LISTEN/NOTIFY
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

The async notification queue contains the XID of the sender, and when
processing notifications we call TransactionIdDidCommit() on the
XID. But we had no safeguards to prevent the CLOG segments containing
those XIDs from being truncated away. As a result, if a backend didn't
for some reason process its notifications for a long time, or when a
new backend issued LISTEN, you could get an error like:

test=# listen c21;
ERROR:  58P01: could not access status of transaction 14279685
DETAIL:  Could not open file "pg_xact/000D": No such file or directory.
LOCATION:  SlruReportIOError, slru.c:1087

Note: This commit is not a full fix. A race condition remains, where a
backend is executing asyncQueueReadAllNotifications() and has just
made a local copy of an async SLRU page which contains old XIDs, while
vacuum concurrently truncates the CLOG covering those XIDs. When the
backend then calls TransactionIdDidCommit() on those XIDs from the
local copy, you still get the error. The next commit will fix that
remaining race condition.

This was first reported by Sergey Zhuravlev in 2021, with many other
people hitting the same issue later. Thanks to:
- Alexandra Wang, Daniil Davydov, Andrei Varashen and Jacques Combrink for
  investigating and providing reproducable test cases,
- Matheus Alcantara and Arseniy Mukhin for earlier proposed patches to
  fix this,
- Álvaro Herrera and Masahiko Sawada for reviewing said earlier patches,
- Yura Sokolov aka funny-falcon for the idea of marking transactions as
  committed in the notification queue, and
- Joel Jacobson for the final patch version. I hope I didn't forget anyone.

Backpatch to all supported versions. I believe the bug goes back all
the way to commit d1e027221d, which introduced the SLRU-based async
notification queue.

Discussion: https://www.postgresql.org/message-id/16961-25f29f95b3604a8a@postgresql.org
Discussion: https://www.postgresql.org/message-id/18804-bccbbde5e77a68c2@postgresql.org
Discussion: https://www.postgresql.org/message-id/CAK98qZ3wZLE-RZJN_Y%2BTFjiTRPPFPBwNBpBi5K5CU8hUHkzDpw@mail.gmail.com
---
 src/backend/commands/async.c                  | 114 ++++++++++++++++++
 src/backend/commands/vacuum.c                 |   7 ++
 src/include/commands/async.h                  |   3 +
 src/test/isolation/expected/async-notify.out  |  33 ++++-
 src/test/isolation/specs/async-notify.spec    |  24 ++++
 src/test/modules/xid_wraparound/meson.build   |   1 +
 .../xid_wraparound/t/004_notify_freeze.pl     |  75 ++++++++++++
 7 files changed, 256 insertions(+), 1 deletion(-)
 create mode 100644 src/test/modules/xid_wraparound/t/004_notify_freeze.pl

diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index 6b844808ef3..5b81e34e340 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -2176,6 +2176,120 @@ asyncQueueAdvanceTail(void)
 	LWLockRelease(NotifyQueueTailLock);
 }
 
+/*
+ * AsyncNotifyFreezeXids
+ *
+ * Prepare the async notification queue for CLOG truncation by freezing
+ * transaction IDs that are about to become inaccessible.
+ *
+ * This function is called by VACUUM before advancing datfrozenxid. It scans
+ * the notification queue and replaces XIDs that would become inaccessible
+ * after CLOG truncation with special markers:
+ * - Committed transactions are set to FrozenTransactionId
+ * - Aborted/crashed transactions are set to InvalidTransactionId
+ *
+ * Only XIDs < newFrozenXid are processed, as those are the ones whose CLOG
+ * pages will be truncated. If XID < newFrozenXid, it cannot still be running
+ * (or it would have held back newFrozenXid through ProcArray).
+ * Therefore, if TransactionIdDidCommit returns false, we know the transaction
+ * either aborted explicitly or crashed, and we can safely mark it invalid.
+ */
+void
+AsyncNotifyFreezeXids(TransactionId newFrozenXid)
+{
+	QueuePosition pos;
+	QueuePosition head;
+	int64		curpage = -1;
+	int			slotno = -1;
+	char	   *page_buffer = NULL;
+	bool		page_dirty = false;
+
+	/*
+	 * Acquire locks in the correct order to avoid deadlocks. As per the
+	 * locking protocol: NotifyQueueTailLock, then NotifyQueueLock, then SLRU
+	 * bank locks.
+	 *
+	 * We only need SHARED mode since we're just reading the head/tail
+	 * positions, not modifying them.
+	 */
+	LWLockAcquire(NotifyQueueTailLock, LW_SHARED);
+	LWLockAcquire(NotifyQueueLock, LW_SHARED);
+
+	pos = QUEUE_TAIL;
+	head = QUEUE_HEAD;
+
+	/* Release NotifyQueueLock early, we only needed to read the positions */
+	LWLockRelease(NotifyQueueLock);
+
+	/*
+	 * Scan the queue from tail to head, freezing XIDs as needed. We hold
+	 * NotifyQueueTailLock throughout to ensure the tail doesn't move while
+	 * we're working.
+	 */
+	while (!QUEUE_POS_EQUAL(pos, head))
+	{
+		AsyncQueueEntry *qe;
+		TransactionId xid;
+		int64		pageno = QUEUE_POS_PAGE(pos);
+		int			offset = QUEUE_POS_OFFSET(pos);
+
+		/* If we need a different page, release old lock and get new one */
+		if (pageno != curpage)
+		{
+			LWLock	   *lock;
+
+			/* Release previous page if any */
+			if (slotno >= 0)
+			{
+				if (page_dirty)
+				{
+					NotifyCtl->shared->page_dirty[slotno] = true;
+					page_dirty = false;
+				}
+				LWLockRelease(SimpleLruGetBankLock(NotifyCtl, curpage));
+			}
+
+			lock = SimpleLruGetBankLock(NotifyCtl, pageno);
+			LWLockAcquire(lock, LW_EXCLUSIVE);
+			slotno = SimpleLruReadPage(NotifyCtl, pageno, true,
+									   InvalidTransactionId);
+			page_buffer = NotifyCtl->shared->page_buffer[slotno];
+			curpage = pageno;
+		}
+
+		qe = (AsyncQueueEntry *) (page_buffer + offset);
+		xid = qe->xid;
+
+		if (TransactionIdIsNormal(xid) &&
+			TransactionIdPrecedes(xid, newFrozenXid))
+		{
+			if (TransactionIdDidCommit(xid))
+			{
+				qe->xid = FrozenTransactionId;
+				page_dirty = true;
+			}
+			else
+			{
+				qe->xid = InvalidTransactionId;
+				page_dirty = true;
+			}
+		}
+
+		/* Advance to next entry */
+		asyncQueueAdvance(&pos, qe->length);
+	}
+
+	/* Release final page lock if we acquired one */
+	if (slotno >= 0)
+	{
+		if (page_dirty)
+			NotifyCtl->shared->page_dirty[slotno] = true;
+		LWLockRelease(SimpleLruGetBankLock(NotifyCtl, curpage));
+	}
+
+	LWLockRelease(NotifyQueueTailLock);
+}
+
 /*
  * ProcessIncomingNotify
  *
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index ed03e3bd50d..e785dd55ce5 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -37,6 +37,7 @@
 #include "catalog/namespace.h"
 #include "catalog/pg_database.h"
 #include "catalog/pg_inherits.h"
+#include "commands/async.h"
 #include "commands/cluster.h"
 #include "commands/defrem.h"
 #include "commands/progress.h"
@@ -1941,6 +1942,12 @@ vac_truncate_clog(TransactionId frozenXID,
 		return;
 	}
 
+	/*
+	 * Freeze any old transaction IDs in the async notification queue before
+	 * CLOG truncation.
+	 */
+	AsyncNotifyFreezeXids(frozenXID);
+
 	/*
 	 * Advance the oldest value for commit timestamps before truncating, so
 	 * that if a user requests a timestamp for a transaction we're truncating
diff --git a/src/include/commands/async.h b/src/include/commands/async.h
index f75c3df9556..aaec7314c10 100644
--- a/src/include/commands/async.h
+++ b/src/include/commands/async.h
@@ -46,4 +46,7 @@ extern void HandleNotifyInterrupt(void);
 /* process interrupts */
 extern void ProcessNotifyInterrupt(bool flush);
 
+/* freeze old transaction IDs in notify queue (called by VACUUM) */
+extern void AsyncNotifyFreezeXids(TransactionId newFrozenXid);
+
 #endif							/* ASYNC_H */
diff --git a/src/test/isolation/expected/async-notify.out b/src/test/isolation/expected/async-notify.out
index 556e1805893..20d5763f319 100644
--- a/src/test/isolation/expected/async-notify.out
+++ b/src/test/isolation/expected/async-notify.out
@@ -1,4 +1,4 @@
-Parsed test spec with 3 sessions
+Parsed test spec with 4 sessions
 
 starting permutation: listenc notify1 notify2 notify3 notifyf
 step listenc: LISTEN c1; LISTEN c2;
@@ -104,6 +104,37 @@ step l2commit: COMMIT;
 listener2: NOTIFY "c1" with payload "" from notifier
 step l2stop: UNLISTEN *;
 
+starting permutation: llisten n1begins n1select n1insert notify1 n2begins n2select n2insert n2notify1 n1commit n2commit notify1 lcheck
+step llisten: LISTEN c1; LISTEN c2;
+step n1begins: BEGIN ISOLATION LEVEL SERIALIZABLE;
+step n1select: SELECT * FROM t1;
+a
+-
+(0 rows)
+
+step n1insert: INSERT INTO t1 DEFAULT VALUES;
+step notify1: NOTIFY c1;
+step n2begins: BEGIN ISOLATION LEVEL SERIALIZABLE;
+step n2select: SELECT * FROM t1;
+a
+-
+(0 rows)
+
+step n2insert: INSERT INTO t1 DEFAULT VALUES;
+step n2notify1: NOTIFY c1, 'n2_payload';
+step n1commit: COMMIT;
+step n2commit: COMMIT;
+ERROR:  could not serialize access due to read/write dependencies among transactions
+step notify1: NOTIFY c1;
+step lcheck: SELECT 1 AS x;
+x
+-
+1
+(1 row)
+
+listener: NOTIFY "c1" with payload "" from notifier
+listener: NOTIFY "c1" with payload "" from notifier
+
 starting permutation: llisten lbegin usage bignotify usage
 step llisten: LISTEN c1; LISTEN c2;
 step lbegin: BEGIN;
diff --git a/src/test/isolation/specs/async-notify.spec b/src/test/isolation/specs/async-notify.spec
index 0b8cfd91083..51b7ad43849 100644
--- a/src/test/isolation/specs/async-notify.spec
+++ b/src/test/isolation/specs/async-notify.spec
@@ -5,6 +5,10 @@
 # Note we assume that each step is delivered to the backend as a single Query
 # message so it will run as one transaction.
 
+# t1 table is used for serializable conflict
+setup { CREATE TABLE t1 (a bigserial); }
+teardown { DROP TABLE t1; }
+
 session notifier
 step listenc	{ LISTEN c1; LISTEN c2; }
 step notify1	{ NOTIFY c1; }
@@ -33,8 +37,21 @@ step notifys1	{
 }
 step usage		{ SELECT pg_notification_queue_usage() > 0 AS nonzero; }
 step bignotify	{ SELECT count(pg_notify('c1', s::text)) FROM generate_series(1, 1000) s; }
+step n1begins   { BEGIN ISOLATION LEVEL SERIALIZABLE; }
+step n1select   { SELECT * FROM t1; }
+step n1insert   { INSERT INTO t1 DEFAULT VALUES; }
+step n1commit   { COMMIT; }
 teardown		{ UNLISTEN *; }
 
+# notifier2 session is used to reproduce serializable conflict with notifier
+
+session notifier2
+step n2begins    { BEGIN ISOLATION LEVEL SERIALIZABLE; }
+step n2select    { SELECT * FROM t1; }
+step n2insert    { INSERT INTO t1 DEFAULT VALUES; }
+step n2commit    { COMMIT; }
+step n2notify1   { NOTIFY c1, 'n2_payload';  }
+
 # The listener session is used for cross-backend notify checks.
 
 session listener
@@ -73,6 +90,13 @@ permutation listenc llisten notify1 notify2 notify3 notifyf lcheck
 # and notify queue is not empty
 permutation l2listen l2begin notify1 lbegins llisten lcommit l2commit l2stop
 
+# Test checks that listeners ignore notifications from aborted
+# transaction even if notifications have been added to the listen/notify
+# queue. To reproduce it we use the fact that serializable conflicts
+# are checked after tx adds notifications to the queue.
+
+permutation llisten n1begins n1select n1insert notify1 n2begins n2select n2insert n2notify1 n1commit n2commit notify1 lcheck
+
 # Verify that pg_notification_queue_usage correctly reports a non-zero result,
 # after submitting notifications while another connection is listening for
 # those notifications and waiting inside an active transaction.  We have to
diff --git a/src/test/modules/xid_wraparound/meson.build b/src/test/modules/xid_wraparound/meson.build
index f7dada67f67..3aec430df8c 100644
--- a/src/test/modules/xid_wraparound/meson.build
+++ b/src/test/modules/xid_wraparound/meson.build
@@ -30,6 +30,7 @@ tests += {
       't/001_emergency_vacuum.pl',
       't/002_limits.pl',
       't/003_wraparounds.pl',
+      't/004_notify_freeze.pl',
     ],
   },
 }
diff --git a/src/test/modules/xid_wraparound/t/004_notify_freeze.pl b/src/test/modules/xid_wraparound/t/004_notify_freeze.pl
new file mode 100644
index 00000000000..e0386afe26a
--- /dev/null
+++ b/src/test/modules/xid_wraparound/t/004_notify_freeze.pl
@@ -0,0 +1,75 @@
+# Copyright (c) 2024-2025, PostgreSQL Global Development Group
+#
+# Test freezing XIDs in the async notification queue. This isn't
+# really wraparound-related, but depends on the consume_xids() helper
+# function.
+
+use strict;
+use warnings FATAL => 'all';
+use PostgreSQL::Test::Cluster;
+use Test::More;
+
+my $node = PostgreSQL::Test::Cluster->new('node');
+$node->init;
+$node->start;
+
+if (!$ENV{PG_TEST_EXTRA} || $ENV{PG_TEST_EXTRA} !~ /\bxid_wraparound\b/)
+{
+	plan skip_all => "test xid_wraparound not enabled in PG_TEST_EXTRA";
+}
+
+# Setup
+$node->safe_psql('postgres', 'CREATE EXTENSION xid_wraparound');
+$node->safe_psql('postgres',
+	'ALTER DATABASE template0 WITH ALLOW_CONNECTIONS true');
+
+# Start Session 1 and leave it idle in transaction
+my $psql_session1 = $node->background_psql('postgres');
+$psql_session1->query_safe('listen s;', "Session 1 listens to 's'");
+$psql_session1->query_safe('begin;', "Session 1 starts a transaction");
+
+# Send multiple notify's from other sessions
+for my $i (1 .. 10)
+{
+	$node->safe_psql(
+		'postgres', "
+		BEGIN;
+		NOTIFY s, '$i';
+		COMMIT;");
+}
+
+# Consume enough XIDs to trigger truncation, and one more with
+# 'txid_current' to bump up the freeze horizon.
+$node->safe_psql('postgres', 'select consume_xids(10000000);');
+$node->safe_psql('postgres', 'select txid_current()');
+
+# Remember current datfrozenxid before vacuum freeze so that we can
+# check that it is advanced. (Taking the min() this way assumes that
+# XID wraparound doesn't happen.)
+my $datafronzenxid = $node->safe_psql('postgres',
+	"select min(datfrozenxid::text::bigint) from pg_database");
+
+# Execute vacuum freeze on all databases
+$node->command_ok([ 'vacuumdb', '--all', '--freeze', '--port', $node->port ],
+	"vacuumdb --all --freeze");
+
+# Check that vacuumdb advanced datfrozenxid
+my $datafronzenxid_freeze = $node->safe_psql('postgres',
+	"select min(datfrozenxid::text::bigint) from pg_database");
+ok($datafronzenxid_freeze > $datafronzenxid, 'datfrozenxid advanced');
+
+# On Session 1, commit and ensure that the all the notifications are
+# received. This depends on correctly freezing the XIDs in the pending
+# notification entries.
+my $res = $psql_session1->query_safe('commit;', "commit listen s;");
+my $notifications_count = 0;
+foreach my $i (split('\n', $res))
+{
+	$notifications_count++;
+	like($i,
+		qr/Asynchronous notification "s" with payload "$notifications_count" received/
+	);
+}
+is($notifications_count, 10, 'received all committed notifications');
+
+done_testing();
-- 
2.47.3

v3-0003-Fix-remaining-race-condition-with-CLOG-truncation.patchtext/x-patch; charset=UTF-8; name=v3-0003-Fix-remaining-race-condition-with-CLOG-truncation.patchDownload

From 4403f2f2119fb3caed9e4daf26785aed221924f0 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Tue, 4 Nov 2025 13:24:30 +0200
Subject: [PATCH v3 3/3] Fix remaining race condition with CLOG truncation and
 LISTEN/NOTIFY

Previous commit fixed a bug where VACUUM can truncate the CLOG that's
still needed to check the commit status of XIDs in the async notify
queue, but as mentioned in the commit message, it wasn't a full
fix. If a backend is executing asyncQueueReadAllNotifications() and
has just made a local copy of an async SLRU page which contains old
XIDs, vacuum can concurrently truncate the CLOG covering those XIDs,
and we will still get an error when backend then calls
TransactionIdDidCommit() on those XIDs in the local copy. This commit
fixes that race condition.

To fix, hold the SLRU bank lock across the TransactionIdDidCommit()
calls in NOTIFY processing.

Per Tom Lane's idea. Backpatch to all supported versions.

Reviewed-by: Joel Jacobson
Discussion: https://www.postgresql.org/message-id/2759499.1761756503@sss.pgh.pa.us
---
 src/backend/commands/async.c | 109 +++++++++++++++--------------------
 1 file changed, 48 insertions(+), 61 deletions(-)

diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index 5b81e34e340..3a6765a55fa 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -448,7 +448,6 @@ static void SignalBackends(void);
 static void asyncQueueReadAllNotifications(void);
 static bool asyncQueueProcessPageEntries(QueuePosition *current,
 										 QueuePosition stop,
-										 char *page_buffer,
 										 Snapshot snapshot);
 static void asyncQueueAdvanceTail(void);
 static void ProcessIncomingNotify(bool flush);
@@ -1854,13 +1853,6 @@ asyncQueueReadAllNotifications(void)
 	QueuePosition head;
 	Snapshot	snapshot;
 
-	/* page_buffer must be adequately aligned, so use a union */
-	union
-	{
-		char		buf[QUEUE_PAGESIZE];
-		AsyncQueueEntry align;
-	}			page_buffer;
-
 	/* Fetch current state */
 	LWLockAcquire(NotifyQueueLock, LW_SHARED);
 	/* Assert checks that we have a valid state entry */
@@ -1941,37 +1933,6 @@ asyncQueueReadAllNotifications(void)
 
 		do
 		{
-			int64		curpage = QUEUE_POS_PAGE(pos);
-			int			curoffset = QUEUE_POS_OFFSET(pos);
-			int			slotno;
-			int			copysize;
-
-			/*
-			 * We copy the data from SLRU into a local buffer, so as to avoid
-			 * holding the SLRU lock while we are examining the entries and
-			 * possibly transmitting them to our frontend.  Copy only the part
-			 * of the page we will actually inspect.
-			 */
-			slotno = SimpleLruReadPage_ReadOnly(NotifyCtl, curpage,
-												InvalidTransactionId);
-			if (curpage == QUEUE_POS_PAGE(head))
-			{
-				/* we only want to read as far as head */
-				copysize = QUEUE_POS_OFFSET(head) - curoffset;
-				if (copysize < 0)
-					copysize = 0;	/* just for safety */
-			}
-			else
-			{
-				/* fetch all the rest of the page */
-				copysize = QUEUE_PAGESIZE - curoffset;
-			}
-			memcpy(page_buffer.buf + curoffset,
-				   NotifyCtl->shared->page_buffer[slotno] + curoffset,
-				   copysize);
-			/* Release lock that we got from SimpleLruReadPage_ReadOnly() */
-			LWLockRelease(SimpleLruGetBankLock(NotifyCtl, curpage));
-
 			/*
 			 * Process messages up to the stop position, end of page, or an
 			 * uncommitted message.
@@ -1987,9 +1948,7 @@ asyncQueueReadAllNotifications(void)
 			 * rewrite pages under us. Especially we don't want to hold a lock
 			 * while sending the notifications to the frontend.
 			 */
-			reachedStop = asyncQueueProcessPageEntries(&pos, head,
-													   page_buffer.buf,
-													   snapshot);
+			reachedStop = asyncQueueProcessPageEntries(&pos, head, snapshot);
 		} while (!reachedStop);
 
 		/* Update shared state */
@@ -2008,13 +1967,6 @@ asyncQueueReadAllNotifications(void)
  * Fetch notifications from the shared queue, beginning at position current,
  * and deliver relevant ones to my frontend.
  *
- * The current page must have been fetched into page_buffer from shared
- * memory.  (We could access the page right in shared memory, but that
- * would imply holding the SLRU bank lock throughout this routine.)
- *
- * We stop if we reach the "stop" position, or reach a notification from an
- * uncommitted transaction, or reach the end of the page.
- *
  * The function returns true once we have reached the stop position or an
  * uncommitted notification, and false if we have finished with the page.
  * In other words: once it returns true there is no need to look further.
@@ -2023,16 +1975,34 @@ asyncQueueReadAllNotifications(void)
 static bool
 asyncQueueProcessPageEntries(QueuePosition *current,
 							 QueuePosition stop,
-							 char *page_buffer,
 							 Snapshot snapshot)
 {
+	int64		curpage = QUEUE_POS_PAGE(*current);
+	int			slotno;
+	char	   *page_buffer;
 	bool		reachedStop = false;
 	bool		reachedEndOfPage;
-	AsyncQueueEntry *qe;
+
+	/*
+	 * We copy the entries into a local buffer to avoid holding the SLRU lock
+	 * while we transmit them to our frontend.  The local buffer must be
+	 * adequately aligned, so use a union.
+	 */
+	union
+	{
+		char		buf[QUEUE_PAGESIZE];
+		AsyncQueueEntry align;
+	}			local_buf;
+	char	   *local_buf_end = local_buf.buf;
+
+	slotno = SimpleLruReadPage_ReadOnly(NotifyCtl, curpage,
+										InvalidTransactionId);
+	page_buffer = NotifyCtl->shared->page_buffer[slotno];
 
 	do
 	{
 		QueuePosition thisentry = *current;
+		AsyncQueueEntry *qe;
 
 		if (QUEUE_POS_EQUAL(thisentry, stop))
 			break;
@@ -2076,16 +2046,8 @@ asyncQueueProcessPageEntries(QueuePosition *current,
 			}
 			else if (TransactionIdDidCommit(qe->xid))
 			{
-				/* qe->data is the null-terminated channel name */
-				char	   *channel = qe->data;
-
-				if (IsListeningOn(channel))
-				{
-					/* payload follows channel name */
-					char	   *payload = qe->data + strlen(channel) + 1;
-
-					NotifyMyFrontEnd(channel, payload, qe->srcPid);
-				}
+				memcpy(local_buf_end, qe, qe->length);
+				local_buf_end += qe->length;
 			}
 			else
 			{
@@ -2099,6 +2061,31 @@ asyncQueueProcessPageEntries(QueuePosition *current,
 		/* Loop back if we're not at end of page */
 	} while (!reachedEndOfPage);
 
+	/* Release lock that we got from SimpleLruReadPage_ReadOnly() */
+	LWLockRelease(SimpleLruGetBankLock(NotifyCtl, curpage));
+
+	/*
+	 * Now that we have let go of the SLRU bank lock, send the notifications
+	 * to our backend
+	 */
+	Assert(local_buf_end - local_buf.buf <= BLCKSZ);
+	for (char *p = local_buf.buf; p < local_buf_end;)
+	{
+		AsyncQueueEntry *qe = (AsyncQueueEntry *) p;
+		/* qe->data is the null-terminated channel name */
+		char	   *channel = qe->data;
+
+		if (IsListeningOn(channel))
+		{
+			/* payload follows channel name */
+			char	   *payload = qe->data + strlen(channel) + 1;
+
+			NotifyMyFrontEnd(channel, payload, qe->srcPid);
+		}
+
+		p += qe->length;
+	}
+
 	if (QUEUE_POS_EQUAL(*current, stop))
 		reachedStop = true;
 
-- 
2.47.3

#90

Álvaro Herrera

alvherre@kurilemu.de

2 months ago

In reply to: Matheus Alcantara (#88)

Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue

On 2025-Nov-05, Matheus Alcantara wrote:

My only concern with making these errors FATAL is that if a notification
entry causes a different, recoverable error, all subsequent messages
will be lost. This is because if backend die and the user open a new
connection and execute LISTEN on the channel it will not see these
notifications past the one that caused the error.

I don't think things are supposed to work that way -- I understand that
an application that connects is supposed to read the current state of
things, and then the notifies ensure that any changes from that point on
can be processed. So if a connection dies on a FATAL, then you have to
establish your LISTENs again and obtain the current state of the world
at that point. It doesn't miss anything, because any NOTIFYies that
occurred while the connection was down should be obtained when current
state is read.

--
Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/
"This is what I like so much about PostgreSQL. Most of the surprises
are of the "oh wow! That's cool" Not the "oh shit!" kind. :)"
Scott Marlowe, http://archives.postgresql.org/pgsql-admin/2008-10/msg00152.php

#91

Álvaro Herrera

alvherre@kurilemu.de

2 months ago

In reply to: Heikki Linnakangas (#78)

Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue

Hello,

I've been studying the code for freezing of items at vacuum-truncate
time, and one thing that jumps at me is that perhaps we ought to be more
proactive in freezing items immediately as they are processed. We can
do that for any transactions that precede RecentXmin, because by that
point, every snapshot in the system should have that committed
transaction as no longer in progress anyway. Something like

diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index 8ac7d989641..88d5ed2b461 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -2038,6 +2038,11 @@ asyncQueueProcessPageEntries(volatile QueuePosition *current,
 			}
 			else if (TransactionIdDidCommit(qe->xid))
 			{
+				if (TransactionIdPrecedes(RecentXmin, qe->xid))
+				{
+					elog(WARNING, "freezing an entry for transaction %u", qe->xid);
+					qe->xid = FrozenTransactionId;
+				}
 				memcpy(local_buf_end, qe, qe->length);
 				local_buf_end += qe->length;
 			}

If we do this, then the chances that we need to freeze items at vacuum
time are much lower, and we're not creating any danger that
TransactionIdDidCommit() errors any more than we have in the current
code.

Is there any reason this wouldn't work?

I noticed that async-notify-test-3 doesn't actually freeze any items by
itself ... you need to do "ALTER TABLE template0 allow_connections" and
then do vacuumdb -a in a loop in order for this to happen (or something
similar, I guess). Otherwise the new code in AsyncNotifyFreezeXids is
never executed.

--
Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/
"Those who use electric razors are infidels destined to burn in hell while
we drink from rivers of beer, download free vids and mingle with naked
well shaved babes." (http://slashdot.org/comments.pl?sid=44793&cid=4647152)

#92

Arseniy Mukhin

arseniy.mukhin.dev@gmail.com

2 months ago

In reply to: Álvaro Herrera (#91)

Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue

On Thu, Nov 6, 2025 at 1:05 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 05/11/2025 21:02, Matheus Alcantara wrote:

On Wed Nov 5, 2025 at 9:59 AM -03, Heikki Linnakangas wrote:

In case of an error on TransactionIdDidCommit() I think that we will
have the same behavior as we advance the "current" position before of
DidCommit call the PG_FINALLY block will set the backend position past
the failing notification entry.

With my patch, if TransactionIdDidCommit() fails, we will lose all the
notifications that we have buffered in the local buffer but haven't
passed to NotifyMyFrontEnd() yet. On 'master', you only lose a single
notification, the one where TransactionIdDidCommit() failed.

Yeah, that's true.

How bad would be to store the notification entries within a list and
store the next position with the notification entry and then wrap the
NotifyMyFrontEnd() call within a PG_TRY and update the "current" to the
saved "next position" on PG_CATCH? Something like this:

[ ...]

That addresses the failure on NotifyMyFrontEnd, but not on
TransactionIdDidCommit.

IMHO we should just make these errors FATAL. TransactionIdDidCommit()
should not really fail, after fixing the bug we're discussing.
NotifyMyFrontEnd() could fail on OOM, but that seems pretty unlikely on
an otherwise idle connection. Or it could fail if the client connection
is lost, but then the backend is about to die anyway. And arguably
closing the connection is better than losing even a single notification,
anyway.

My only concern with making these errors FATAL is that if a notification
entry causes a different, recoverable error, all subsequent messages
will be lost. This is because if backend die and the user open a new
connection and execute LISTEN on the channel it will not see these
notifications past the one that caused the error. I'm not sure if we are
completely safe from this case of a recoverable error, what do you
think?

I did some more testing of the current behavior, using the encoding
conversion to cause an error:

In backend A:

SET client_encoding='latin1';
LISTEN foo;

Backend b:

NOTIFY foo, 'ハ';

Observations:

- If the connection is idle when the notification is received, the ERROR
is turned into FATAL anyway:

postgres=# SET client_encoding='latin1';
SET
postgres=# LISTEN foo;
LISTEN
postgres=# select 1; -- do the NOTIFY in another connection before this
ERROR: character with byte sequence 0xe3 0x83 0x8f in encoding "UTF8"
has no equivalent in encoding "LATIN1"
FATAL: terminating connection because protocol synchronization was lost
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.

- If there are multiple notifications pending, we stop processing the
subsequent notifications on the first error. The subsequent
notifications will only be processed when another notify interrupt is
received, i.e. when a backend sends yet another notification.

I'm getting more and more convinced that escalating all ERRORs to FATALs
during notify processing is the right way to go. Attached is a new patch
set that does that.

One point about removing try/catch: when we execute LISTEN for the
very first time, we read the queue from the min(listener's pos) up to
the head. listenChannels is empty at the moment, so we never call
NotifyMyFrontEnd. But we do call TransactionIdDidCommit. So if we have
some problematic entry in the queue that we need to process during
initial LISTEN, it could block creation of new listeners. Is it
something we need to worry about?

Best regards,
Arseniy Mukhin

#93

Heikki Linnakangas

hlinnaka@iki.fi

2 months ago

In reply to: Álvaro Herrera (#91)

Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue

Thanks for reviewing!

On 06/11/2025 12:24, Álvaro Herrera wrote:

I've been studying the code for freezing of items at vacuum-truncate
time, and one thing that jumps at me is that perhaps we ought to be more
proactive in freezing items immediately as they are processed. We can
do that for any transactions that precede RecentXmin, because by that
point, every snapshot in the system should have that committed
transaction as no longer in progress anyway. Something like
diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index 8ac7d989641..88d5ed2b461 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -2038,6 +2038,11 @@ asyncQueueProcessPageEntries(volatile QueuePosition *current,
}
else if (TransactionIdDidCommit(qe->xid))
{
+				if (TransactionIdPrecedes(RecentXmin, qe->xid))
+				{
+					elog(WARNING, "freezing an entry for transaction %u", qe->xid);
+					qe->xid = FrozenTransactionId;
+				}
memcpy(local_buf_end, qe, qe->length);
local_buf_end += qe->length;
}
If we do this, then the chances that we need to freeze items at vacuum
time are much lower, and we're not creating any danger that
TransactionIdDidCommit() errors any more than we have in the current
code.

Is there any reason this wouldn't work?

We could do that and it would be good for performance anyway. This is
the "hint bit" idea that's been discussed in this thread. I think it
would be better to do it with additional hint bits rather than by
overwriting 'xid'. Having a separate COMMIITTED/ABORTED and FROZEN bits
would allow setting the COMMITTED hint bit even when the xid is newer
than RecentXmin.

Would need to mark the SLRU page dirty if we do that. Not sure if that
requires holding an exclusive lock. And we should also replace aborted
xids with InvalidTransactionId.

I feel that that should be a separate patch, and not backpatched. We'd
still need all the other changes from this patch series even if we did
that, it would be just an optimization.

I noticed that async-notify-test-3 doesn't actually freeze any items by
itself ... you need to do "ALTER TABLE template0 allow_connections" and
then do vacuumdb -a in a loop in order for this to happen (or something
similar, I guess). Otherwise the new code in AsyncNotifyFreezeXids is
never executed.

Right, async-notify-test-3 was just a performance test to verify that
the holding the SLRU lock longer doesn't degrade performance. I haven't
tried performance testing CLOG truncation itself, but it happens so
rarely that I'm not too worried about it. See
src/test/modules/xid_wraparound/t/004_notify_freeze.pl for a correctness
test.

- Heikki

#94

Heikki Linnakangas

hlinnaka@iki.fi

2 months ago

In reply to: Arseniy Mukhin (#92)

1 attachment(s)

Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue

On 06/11/2025 13:20, Arseniy Mukhin wrote:

On Thu, Nov 6, 2025 at 1:05 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 05/11/2025 21:02, Matheus Alcantara wrote:

On Wed Nov 5, 2025 at 9:59 AM -03, Heikki Linnakangas wrote:

In case of an error on TransactionIdDidCommit() I think that we will
have the same behavior as we advance the "current" position before of
DidCommit call the PG_FINALLY block will set the backend position past
the failing notification entry.

With my patch, if TransactionIdDidCommit() fails, we will lose all the
notifications that we have buffered in the local buffer but haven't
passed to NotifyMyFrontEnd() yet. On 'master', you only lose a single
notification, the one where TransactionIdDidCommit() failed.

Yeah, that's true.

How bad would be to store the notification entries within a list and
store the next position with the notification entry and then wrap the
NotifyMyFrontEnd() call within a PG_TRY and update the "current" to the
saved "next position" on PG_CATCH? Something like this:

[ ...]

That addresses the failure on NotifyMyFrontEnd, but not on
TransactionIdDidCommit.

IMHO we should just make these errors FATAL. TransactionIdDidCommit()
should not really fail, after fixing the bug we're discussing.
NotifyMyFrontEnd() could fail on OOM, but that seems pretty unlikely on
an otherwise idle connection. Or it could fail if the client connection
is lost, but then the backend is about to die anyway. And arguably
closing the connection is better than losing even a single notification,
anyway.

My only concern with making these errors FATAL is that if a notification
entry causes a different, recoverable error, all subsequent messages
will be lost. This is because if backend die and the user open a new
connection and execute LISTEN on the channel it will not see these
notifications past the one that caused the error. I'm not sure if we are
completely safe from this case of a recoverable error, what do you
think?

I did some more testing of the current behavior, using the encoding
conversion to cause an error:

In backend A:

SET client_encoding='latin1';
LISTEN foo;

Backend b:

NOTIFY foo, 'ハ';

Observations:

- If the connection is idle when the notification is received, the ERROR
is turned into FATAL anyway:

postgres=# SET client_encoding='latin1';
SET
postgres=# LISTEN foo;
LISTEN
postgres=# select 1; -- do the NOTIFY in another connection before this
ERROR: character with byte sequence 0xe3 0x83 0x8f in encoding "UTF8"
has no equivalent in encoding "LATIN1"
FATAL: terminating connection because protocol synchronization was lost
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.

- If there are multiple notifications pending, we stop processing the
subsequent notifications on the first error. The subsequent
notifications will only be processed when another notify interrupt is
received, i.e. when a backend sends yet another notification.

I'm getting more and more convinced that escalating all ERRORs to FATALs
during notify processing is the right way to go. Attached is a new patch
set that does that.

One point about removing try/catch: when we execute LISTEN for the
very first time, we read the queue from the min(listener's pos) up to
the head. listenChannels is empty at the moment, so we never call
NotifyMyFrontEnd. But we do call TransactionIdDidCommit. So if we have
some problematic entry in the queue that we need to process during
initial LISTEN, it could block creation of new listeners. Is it
something we need to worry about?

Hmm, I don't see how it could block the creation of new listeners.
Exec_ListenPreCommit() starts from "max(listener's pos)" over existing
listeners, not min(). Am I missing something?

That said, Exec_ListenPreCommit() could start from 'min' among all
listeners, if there are no listeners for the same backend. Per attached
patch. I don't think it makes much difference though.

Another small change we could make is to check for listenChannels == NIL
before calling TransactionIdDidCommit.

- Heikki

Attachments:

start-listen-from-min.patch.txttext/plain; charset=UTF-8; name=start-listen-from-min.patch.txtDownload

diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index 4bd37d5beb5..586d5609c51 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -1041,6 +1041,8 @@ static void
 Exec_ListenPreCommit(void)
 {
 	QueuePosition head;
+	QueuePosition startPos;
+	QueuePosition min;
 	QueuePosition max;
 	ProcNumber	prevListener;
 
@@ -1085,18 +1087,20 @@ Exec_ListenPreCommit(void)
 	 * and manipulate the list links.
 	 */
 	LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE);
-	head = QUEUE_HEAD;
+	head = min = QUEUE_HEAD;
 	max = QUEUE_TAIL;
 	prevListener = INVALID_PROC_NUMBER;
 	for (ProcNumber i = QUEUE_FIRST_LISTENER; i != INVALID_PROC_NUMBER; i = QUEUE_NEXT_LISTENER(i))
 	{
+		min = QUEUE_POS_MIN(min, QUEUE_BACKEND_POS(i));
 		if (QUEUE_BACKEND_DBOID(i) == MyDatabaseId)
 			max = QUEUE_POS_MAX(max, QUEUE_BACKEND_POS(i));
 		/* Also find last listening backend before this one */
 		if (i < MyProcNumber)
 			prevListener = i;
 	}
-	QUEUE_BACKEND_POS(MyProcNumber) = max;
+	startPos = QUEUE_POS_MAX(min, max);
+	QUEUE_BACKEND_POS(MyProcNumber) = startPos;
 	QUEUE_BACKEND_PID(MyProcNumber) = MyProcPid;
 	QUEUE_BACKEND_DBOID(MyProcNumber) = MyDatabaseId;
 	/* Insert backend into list of listeners at correct position */
@@ -1123,7 +1127,7 @@ Exec_ListenPreCommit(void)
 	 * our transaction might have executed NOTIFY, those message(s) aren't
 	 * queued yet so we won't skip them here.
 	 */
-	if (!QUEUE_POS_EQUAL(max, head))
+	if (!QUEUE_POS_EQUAL(startPos, head))
 		asyncQueueReadAllNotifications();
 }

#95

Arseniy Mukhin

arseniy.mukhin.dev@gmail.com

2 months ago

In reply to: Heikki Linnakangas (#94)

Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue

On Thu, Nov 6, 2025 at 3:06 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 06/11/2025 13:20, Arseniy Mukhin wrote:

On Thu, Nov 6, 2025 at 1:05 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 05/11/2025 21:02, Matheus Alcantara wrote:

On Wed Nov 5, 2025 at 9:59 AM -03, Heikki Linnakangas wrote:

In case of an error on TransactionIdDidCommit() I think that we will
have the same behavior as we advance the "current" position before of
DidCommit call the PG_FINALLY block will set the backend position past
the failing notification entry.

With my patch, if TransactionIdDidCommit() fails, we will lose all the
notifications that we have buffered in the local buffer but haven't
passed to NotifyMyFrontEnd() yet. On 'master', you only lose a single
notification, the one where TransactionIdDidCommit() failed.

Yeah, that's true.

How bad would be to store the notification entries within a list and
store the next position with the notification entry and then wrap the
NotifyMyFrontEnd() call within a PG_TRY and update the "current" to the
saved "next position" on PG_CATCH? Something like this:

[ ...]

That addresses the failure on NotifyMyFrontEnd, but not on
TransactionIdDidCommit.

IMHO we should just make these errors FATAL. TransactionIdDidCommit()
should not really fail, after fixing the bug we're discussing.
NotifyMyFrontEnd() could fail on OOM, but that seems pretty unlikely on
an otherwise idle connection. Or it could fail if the client connection
is lost, but then the backend is about to die anyway. And arguably
closing the connection is better than losing even a single notification,
anyway.

My only concern with making these errors FATAL is that if a notification
entry causes a different, recoverable error, all subsequent messages
will be lost. This is because if backend die and the user open a new
connection and execute LISTEN on the channel it will not see these
notifications past the one that caused the error. I'm not sure if we are
completely safe from this case of a recoverable error, what do you
think?

I did some more testing of the current behavior, using the encoding
conversion to cause an error:

In backend A:

SET client_encoding='latin1';
LISTEN foo;

Backend b:

NOTIFY foo, 'ハ';

Observations:

- If the connection is idle when the notification is received, the ERROR
is turned into FATAL anyway:

postgres=# SET client_encoding='latin1';
SET
postgres=# LISTEN foo;
LISTEN
postgres=# select 1; -- do the NOTIFY in another connection before this
ERROR: character with byte sequence 0xe3 0x83 0x8f in encoding "UTF8"
has no equivalent in encoding "LATIN1"
FATAL: terminating connection because protocol synchronization was lost
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.

- If there are multiple notifications pending, we stop processing the
subsequent notifications on the first error. The subsequent
notifications will only be processed when another notify interrupt is
received, i.e. when a backend sends yet another notification.

I'm getting more and more convinced that escalating all ERRORs to FATALs
during notify processing is the right way to go. Attached is a new patch
set that does that.

One point about removing try/catch: when we execute LISTEN for the
very first time, we read the queue from the min(listener's pos) up to
the head. listenChannels is empty at the moment, so we never call
NotifyMyFrontEnd. But we do call TransactionIdDidCommit. So if we have
some problematic entry in the queue that we need to process during
initial LISTEN, it could block creation of new listeners. Is it
something we need to worry about?

Hmm, I don't see how it could block the creation of new listeners.
Exec_ListenPreCommit() starts from "max(listener's pos)" over existing
listeners, not min(). Am I missing something?

My bad, you are right, we use max(listener's pos). But it doesn't
resolve the issue if I'm not missing something.

Let's say we have the queue:

(tail ... pos1 ... bad_entry_pos ... head)

bad_entry_pos - position of the entry where TransactionIdDidCommit fails.

We have the listener L1 with pos = pos1. It means every new listener
should process the queue from pos1 (as it's max(listener's pos)) up to
the queue head and when they try to do it, they will fail on
'bad_entry'.

That said, Exec_ListenPreCommit() could start from 'min' among all
listeners, if there are no listeners for the same backend. Per attached
patch. I don't think it makes much difference though.

Yes, if there are no listeners, every new listener will start the
initial reading from the QUEUE_TAIL and will fail on the bad_entry
too.

But this example is valid only if we think we can fail on
TransactionIdDidCommit (even after the fix).

Another small change we could make is to check for listenChannels == NIL
before calling TransactionIdDidCommit.

There is a comment that describes why we need this initial reading in
Exec_ListenPreCommit:

/*
* This is our first LISTEN, so establish our pointer.
*
* We set our pointer to the global tail pointer and then move it forward
* over already-committed notifications. This ensures we cannot miss any
* not-yet-committed notifications. We might get a few more but that
* doesn't hurt.

It seems that with such a check we will skip not-yet-committed
notifications and always get to the HEAD. If we decide that it's ok,
maybe we can just set 'pos' of every new listener to the HEAD without
reading?

Best regards,
Arseniy Mukhin

#96

Heikki Linnakangas

hlinnaka@iki.fi

2 months ago

In reply to: Arseniy Mukhin (#95)

1 attachment(s)

Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue

On 06/11/2025 17:13, Arseniy Mukhin wrote:

Let's say we have the queue:

(tail ... pos1 ... bad_entry_pos ... head)

bad_entry_pos - position of the entry where TransactionIdDidCommit fails.

We have the listener L1 with pos = pos1. It means every new listener
should process the queue from pos1 (as it's max(listener's pos)) up to
the queue head and when they try to do it, they will fail on
'bad_entry'.

Gotcha.

Another small change we could make is to check for listenChannels == NIL
before calling TransactionIdDidCommit.

There is a comment that describes why we need this initial reading in
Exec_ListenPreCommit:

/*
* This is our first LISTEN, so establish our pointer.
*
* We set our pointer to the global tail pointer and then move it forward
* over already-committed notifications. This ensures we cannot miss any
* not-yet-committed notifications. We might get a few more but that
* doesn't hurt.

It seems that with such a check we will skip not-yet-committed
notifications and always get to the HEAD. If we decide that it's ok,
maybe we can just set 'pos' of every new listener to the HEAD without
reading?

Right, I didn't mean skipping asyncQueueReadAllNotifications()
altogether. Just the TransactionIdDidCommit() calls in it, when
listenChannels == NIL, see attached patch. The idea is that when
'listenChannels == NIL', we're not going send the notification to the
frontend regardless of what TransactionIdDidCommit says. If we don't
call TransactionIdDidCommit, we won't fail on bad entries.

I'm not sure how much this matters. We really shouldn't have bad entries
in the queue to begin with. And if we do, a broken entry could cause all
kinds of trouble. For example, if the broken entry's (bogus) XID is
higher than current XID, XidInMVCCSnapshot() will return true and we'll
get stuck on that.

- Heikki

Attachments:

skip-TransactionIdDidCommit-on-first-listen.patch.txttext/plain; charset=UTF-8; name=skip-TransactionIdDidCommit-on-first-listen.patch.txtDownload

diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index 4bd37d5beb5..58656b53e5d 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -2066,7 +2066,20 @@ asyncQueueProcessPageEntries(volatile QueuePosition *current,
 				reachedStop = true;
 				break;
 			}
-			else if (TransactionIdDidCommit(qe->xid))
+
+			/*
+			 * Quick check for the case that we're not listening on any
+			 * channels, before calling TransactionIdDidCommit().  This might
+			 * be marginally faster, but perhaps more importantly, it ensures
+			 * that if there's a bad entry in the queue for some reason for
+			 * which TransactionIdDidCommit() fails, we can skip over it on
+			 * the first LISTEN in a session, and not get stuck on it
+			 * indefinitely.
+			 */
+			if (listenChannels == NIL)
+				continue;
+
+			if (TransactionIdDidCommit(qe->xid))
 			{
 				/* qe->data is the null-terminated channel name */
 				char	   *channel = qe->data;

#97

Arseniy Mukhin

arseniy.mukhin.dev@gmail.com

2 months ago

In reply to: Heikki Linnakangas (#96)

Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue

On Fri, Nov 7, 2025 at 11:10 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 06/11/2025 17:13, Arseniy Mukhin wrote:

Let's say we have the queue:

(tail ... pos1 ... bad_entry_pos ... head)

bad_entry_pos - position of the entry where TransactionIdDidCommit fails.

We have the listener L1 with pos = pos1. It means every new listener
should process the queue from pos1 (as it's max(listener's pos)) up to
the queue head and when they try to do it, they will fail on
'bad_entry'.

Gotcha.

Another small change we could make is to check for listenChannels == NIL
before calling TransactionIdDidCommit.

There is a comment that describes why we need this initial reading in
Exec_ListenPreCommit:

/*
* This is our first LISTEN, so establish our pointer.
*
* We set our pointer to the global tail pointer and then move it forward
* over already-committed notifications. This ensures we cannot miss any
* not-yet-committed notifications. We might get a few more but that
* doesn't hurt.

It seems that with such a check we will skip not-yet-committed
notifications and always get to the HEAD. If we decide that it's ok,
maybe we can just set 'pos' of every new listener to the HEAD without
reading?

Right, I didn't mean skipping asyncQueueReadAllNotifications()
altogether. Just the TransactionIdDidCommit() calls in it, when
listenChannels == NIL, see attached patch. The idea is that when
'listenChannels == NIL', we're not going send the notification to the
frontend regardless of what TransactionIdDidCommit says. If we don't
call TransactionIdDidCommit, we won't fail on bad entries.

Ahh, now I see, thank you. WFM.

Best regards,
Arseniy Mukhin

#98

Álvaro Herrera

alvherre@kurilemu.de

2 months ago

In reply to: Heikki Linnakangas (#89)

Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue

One thing I noticed while testing this is that asyncQueueAddEntries()
fills the end of a page with a dummy entry, when the next notify doesn't
fit. However, this dummy entry contains a very valid TransactionId,
which the new freezing code will try to look up and freeze. I think
this is somewhat bogus -- we shouldn't even try to look up that XID in
the first place. I propose to clear it like this

@@ -1419,6 +1424,7 @@ asyncQueueAddEntries(ListCell *nextNotify)
*/
qe.length = QUEUE_PAGESIZE - offset;
qe.dboid = InvalidOid;
+ qe.xid = InvalidTransactionId;
qe.data[0] = '\0'; /* empty channel */
qe.data[1] = '\0'; /* empty payload */
}

(Line numbers do not match, because I have other local changes.)

--
Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/
"I'm always right, but sometimes I'm more right than other times."
(Linus Torvalds)
https://lore.kernel.org/git/Pine.LNX.4.58.0504150753440.7211@ppc970.osdl.org/

#99

Heikki Linnakangas

hlinnaka@iki.fi

2 months ago

In reply to: Álvaro Herrera (#98)

Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue

On 07/11/2025 16:14, Álvaro Herrera wrote:

One thing I noticed while testing this is that asyncQueueAddEntries()
fills the end of a page with a dummy entry, when the next notify doesn't
fit. However, this dummy entry contains a very valid TransactionId,
which the new freezing code will try to look up and freeze. I think
this is somewhat bogus -- we shouldn't even try to look up that XID in
the first place. I propose to clear it like this

@@ -1419,6 +1424,7 @@ asyncQueueAddEntries(ListCell *nextNotify)
*/
qe.length = QUEUE_PAGESIZE - offset;
qe.dboid = InvalidOid;
+ qe.xid = InvalidTransactionId;
qe.data[0] = '\0'; /* empty channel */
qe.data[1] = '\0'; /* empty payload */
}

(Line numbers do not match, because I have other local changes.)

Committed. I committed the above separately, because I forgot to include
it in the main commit. Oops.

Just to summarize what was committed, out of all the different variants
discussed:

* Any ERROR while processing an async notification is now turned into FATAL
* Vacuum scans the async notification queue and freezes xids before
truncating CLOG
* The TransactionIdDidCommit() calls are now made while holding the SLRU
lock. NotifyMyFrontEnd() calls are still made after releasing the lock
* listenChannels == NIL special case is checked before
TransactionIdDidCommit(). This avoids the problem that no backend can
LISTEN to anything anymore, if there's one broken entry in the queue for
some reason
* 'xid' field on dummy entries is now set to InvalidTransactionId so
that they don't need to be frozen

Thanks everyone!

- Heikki