Optimize LISTEN/NOTIFY

Started by Joel Jacobson8 months ago125 messages
Jump to latest
#1Joel Jacobson
joel@compiler.org

Hi hackers,

The current LISTEN/NOTIFY implementation is well-suited for use-cases like
cache invalidation where many backends listen on the same channel. However,
its scalability is limited when many backends listen on distinct
channels. The root of the problem is that Async_Notify must signal every
listening backend in the database, as it lacks central knowledge of which
backend is interested in which channel. This results in an O(N) number of
kill(pid, SIGUSR1) syscalls as the listener count grows.

The attached proof-of-concept patch proposes a straightforward
optimization for the single-listener case. It introduces a shared-memory
hash table mapping (dboid, channelname) to the ProcNumber of a single
listener. When NOTIFY is issued, we first check this table. If a single
listener is found, we signal only that backend. Otherwise, we fall back to
the existing broadcast behavior.

The performance impact for this pattern is significant. A benchmark [1]Benchmark tool and full results: https://github.com/joelonsql/pg-bench-listen-notify
measuring a NOTIFY "ping-pong" between two connections, while adding a
variable number of idle listeners, shows the following:

master (8893c3a):
0 extra listeners: 9126 TPS
10 extra listeners: 6233 TPS
100 extra listeners: 2020 TPS
1000 extra listeners: 238 TPS

0001-Optimize-LISTEN-NOTIFY-signaling-for-single-listener.patch:
0 extra listeners: 9152 TPS
10 extra listeners: 9352 TPS
100 extra listeners: 9320 TPS
1000 extra listeners: 8937 TPS

As you can see, the patched version's performance is near O(1) with respect
to the number of idle listeners, while the current implementation shows the
expected O(N) degradation.

This patch is a first-step. It uses a simple boolean has_multiple_listeners
flag in the hash entry. Once a channel gets a second listener, this flag is
set and, crucially, never cleared. The entry will then permanently indicate
"multiple listeners", even after all backends on that channel disconnect.

A more complete solution would likely use reference counting for each
channel's listeners. This would solve the "stuck entry" problem and could
also enable a further optimization: targeted signaling to all listeners of a
multi-user channel, avoiding the database-wide broadcast entirely.

The patch also includes a "wake only tail" optimization (contributed by
Marko Tikkaja) to help prevent backends from falling too far behind.
Instead of waking all lagging backends at once and creating a "thundering
herd", this logic signals only the single backend that is currently at the
queue tail. This ensures the global queue tail can always advance, relying
on a chain reaction to get backends caught up efficiently. This seems like
a sensible improvement in its own right.

Thoughts?

/Joel

[1]: Benchmark tool and full results: https://github.com/joelonsql/pg-bench-listen-notify

Attachments:

0001-Optimize-LISTEN-NOTIFY-signaling-for-single-listener.patchapplication/octet-stream; name="=?UTF-8?Q?0001-Optimize-LISTEN-NOTIFY-signaling-for-single-listener.patc?= =?UTF-8?Q?h?="Download+537-36
#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Joel Jacobson (#1)
Re: Optimize LISTEN/NOTIFY

"Joel Jacobson" <joel@compiler.org> writes:

The attached proof-of-concept patch proposes a straightforward
optimization for the single-listener case. It introduces a shared-memory
hash table mapping (dboid, channelname) to the ProcNumber of a single
listener.

What does that do to the cost and parallelizability of LISTEN/UNLISTEN?

The patch also includes a "wake only tail" optimization (contributed by
Marko Tikkaja) to help prevent backends from falling too far behind.

Coulda sworn we dealt with that case some years ago. In any case,
if it's independent of the other idea it should probably get its
own thread.

regards, tom lane

#3Joel Jacobson
joel@compiler.org
In reply to: Tom Lane (#2)
Re: Optimize LISTEN/NOTIFY

On Sun, Jul 13, 2025, at 01:18, Tom Lane wrote:

"Joel Jacobson" <joel@compiler.org> writes:

The attached proof-of-concept patch proposes a straightforward
optimization for the single-listener case. It introduces a shared-memory
hash table mapping (dboid, channelname) to the ProcNumber of a single
listener.

What does that do to the cost and parallelizability of LISTEN/UNLISTEN?

Good point. The previous patch would effectively force all LISTEN/UNLISTEN
to be serialized, which would at least hurt parallelizability.

New benchmark confirm this hypothesis.

New patch attached that combines two complementary approaches, that together
seems to scale well for both common-channel and unique-channel scenarios:

1. Partitioned Hash Locking

The Channel Hash now uses HASH_PARTITION, with an array of NUM_NOTIFY_PARTITIONS
lightweight locks. A given channel is mapped to a partition lock using
a custom hash function on (dboid, channelname).

This allows LISTEN/UNLISTEN operations on different channels to proceed
concurrently without fighting over a single global lock, addressing the
"many distinct channels" use-case.

2. Optimistic Read-Locking

For the "many backends on one channel" use-case, lock acquisition now follows
a read-then-upgrade pattern. We first acquire a LW_SHARED lock, to check the
channel's state. If the channel is already marked as has_multiple_listeners,
we can return immediately without any need for a write.

Only if we are the first or second listener on a channel do we release
the shared lock and acquire an LW_EXCLUSIVE lock to modify the hash entry.
After getting the exclusive lock, we re-verify the state to guard against
race conditions. This avoids serializing the third and all subsequent
listeners for a popular channel.

BENCHMARK

https://raw.githubusercontent.com/joelonsql/pg-bench-listen-notify/refs/heads/master/performance_overview_connections_equal_jobs.png

https://raw.githubusercontent.com/joelonsql/pg-bench-listen-notify/refs/heads/master/performance_overview_fixed_connections.png

I didn't want to attached the images to this email because they are quite large,
due to all the details in the images.

However, since it's important this mailing list contains all relevant data discussed,
I've also included all data in the graphs formatted in ASCII/Markdown:

performance_overview.md

I've also included the raw parsed data from the pgbench output,
which has been used as input to create performance_overview.md
as well as the images:

pgbench_results_combined.csv

I've benchmarked five times per measurement, in random order.
All raw measurements have been included in the Markdown document
within { curly braces } sorted, next to the average values, to get an idea
of the variance. Stddev felt possibly misleading since I'm not sure the
data points are normally distributed, since it's benchmarking data.

I've run the benchmarks on my MacBook Pro Apple M3 Max,
using `caffeinate -dims pgbench ...`.

The patch also includes a "wake only tail" optimization (contributed by
Marko Tikkaja) to help prevent backends from falling too far behind.

Coulda sworn we dealt with that case some years ago. In any case,
if it's independent of the other idea it should probably get its
own thread.

Maybe it's been dealt with by some other part of the system, but I can't
find any such code anywhere, it's only async.c that currently sends
PROCSIG_NOTIFY_INTERRUPT.

The wake only tail mechanism seems almost perfect, but I can think of at least
one edge-case where we could still get a problem situation:

With lots of idle backends, the rate of this one-by-one catch-up may not be fast
enough to outpace the queue's advancement, causing other idle backends
to eventually lag by more than the QUEUE_CLEANUP_DELAY threshold.

To ensure all backends are eventually processed without re-introducing
the thundering herd problem, an additional mechanism seems neessary:

I see two main options:

1. Extend the chain reaction
Once woken, a backend could signal the next backend at the queue tail,
propagating the catch-up process. This would need to be managed carefully,
perhaps with some kind of global advisory lock, to prevent multiple
cascades from running at once.

2. Centralize the work
We already have the autovacuum daemon, maybe it could also be made responsible
for kicking lagging backends?

Other ideas?

/Joel

Attached:

* pgbench-scripts.tar.gz
pgbench scripts to reproduce the results, report and images.

* performance_overview.md
Same results as in the images, but in ASCII/Markdown format.

* pgbench_results_combined.csv
Parsed output from pgbench runs, used to create performance_overview.md as well as the linked images.

* 0001-Optimize-LISTEN-NOTIFY-signaling-for-single-listener-v2.patch
Old patch just renamed to -v2

* 0002-Partition-channel-hash-to-improve-LISTEN-UNLISTEN-v2.patch
New patch with the approach explained above.

Attachments:

pgbench_results_combined.csvtext/csv; name=pgbench_results_combined.csvDownload
pgbench-scripts.tar.gzapplication/x-gzip; name=pgbench-scripts.tar.gzDownload
performance_overview.mdapplication/octet-stream; name=performance_overview.mdDownload
0001-Optimize-LISTEN-NOTIFY-signaling-for-single-listener-v2.patchapplication/octet-stream; name="=?UTF-8?Q?0001-Optimize-LISTEN-NOTIFY-signaling-for-single-listener-v2.p?= =?UTF-8?Q?atch?="Download+537-36
0002-Partition-channel-hash-to-improve-LISTEN-UNLISTEN-v2.patchapplication/octet-stream; name="=?UTF-8?Q?0002-Partition-channel-hash-to-improve-LISTEN-UNLISTEN-v2.patc?= =?UTF-8?Q?h?="Download+241-158
#4Joel Jacobson
joel@compiler.org
In reply to: Joel Jacobson (#3)
Re: Optimize LISTEN/NOTIFY

On Tue, Jul 15, 2025, at 09:20, Joel Jacobson wrote:

On Sun, Jul 13, 2025, at 01:18, Tom Lane wrote:

"Joel Jacobson" <joel@compiler.org> writes:

The attached proof-of-concept patch proposes a straightforward
optimization for the single-listener case. It introduces a shared-memory
hash table mapping (dboid, channelname) to the ProcNumber of a single
listener.

What does that do to the cost and parallelizability of LISTEN/UNLISTEN?

Good point. The previous patch would effectively force all LISTEN/UNLISTEN
to be serialized, which would at least hurt parallelizability.

New benchmark confirm this hypothesis.

New patch attached that combines two complementary approaches, that together
seems to scale well for both common-channel and unique-channel scenarios:

Thanks to the FreeBSD animal failing, I see I made a shared memory blunder.
New squashed patch attached.

/Joel

Attachments:

0001-Subject-Optimize-LISTEN-NOTIFY-signaling-for-scalabi-v3.patchapplication/octet-stream; name="=?UTF-8?Q?0001-Subject-Optimize-LISTEN-NOTIFY-signaling-for-scalabi-v3.p?= =?UTF-8?Q?atch?="Download+641-34
#5Joel Jacobson
joel@compiler.org
In reply to: Joel Jacobson (#4)
Re: Optimize LISTEN/NOTIFY

On Tue, Jul 15, 2025, at 22:56, Joel Jacobson wrote:

On Tue, Jul 15, 2025, at 09:20, Joel Jacobson wrote:

On Sun, Jul 13, 2025, at 01:18, Tom Lane wrote:

"Joel Jacobson" <joel@compiler.org> writes:

The attached proof-of-concept patch proposes a straightforward
optimization for the single-listener case. It introduces a shared-memory
hash table mapping (dboid, channelname) to the ProcNumber of a single
listener.

What does that do to the cost and parallelizability of LISTEN/UNLISTEN?

Good point. The previous patch would effectively force all LISTEN/UNLISTEN
to be serialized, which would at least hurt parallelizability.

New benchmark confirm this hypothesis.

New patch attached that combines two complementary approaches, that together
seems to scale well for both common-channel and unique-channel scenarios:

Thanks to the FreeBSD animal failing, I see I made a shared memory blunder.
New squashed patch attached.

/Joel
Attachments:
* 0001-Subject-Optimize-LISTEN-NOTIFY-signaling-for-scalabi-v3.patch

(cfbot is not picking up my patch; I wonder if some filename length is exceeded,
trying a shorter filename, apologies for spamming)

/Joel

Attachments:

0001-optimize_listen_notify-v3.patchapplication/octet-stream; name="=?UTF-8?Q?0001-optimize=5Flisten=5Fnotify-v3.patch?="Download+641-34
#6Rishu Bagga
rishu.postgres@gmail.com
In reply to: Joel Jacobson (#5)
Re: Optimize LISTEN/NOTIFY

Hi Joel,

Thanks for sharing the patch.
I have a few questions based on a cursory first look.

If a single listener is found, we signal only that backend.
Otherwise, we fall back to the existing broadcast behavior.

The idea of not wanting to wake up all backends makes sense to me,
but I don’t understand why we want this optimization only for the case
where there is a single backend listening on a channel.

Is there a pattern of usage in LISTEN/NOTIFY where users typically
have either just one or several backends listening on a channel?

If we are doing this optimization, why not maintain a list of backends
for each channel, and only wake up those channels?

Thanks,
Rishu

#7Joel Jacobson
joel@compiler.org
In reply to: Rishu Bagga (#6)
Re: Optimize LISTEN/NOTIFY

On Wed, Jul 16, 2025, at 02:20, Rishu Bagga wrote:

Hi Joel,

Thanks for sharing the patch.
I have a few questions based on a cursory first look.

If a single listener is found, we signal only that backend.
Otherwise, we fall back to the existing broadcast behavior.

The idea of not wanting to wake up all backends makes sense to me,
but I don’t understand why we want this optimization only for the case
where there is a single backend listening on a channel.

Is there a pattern of usage in LISTEN/NOTIFY where users typically
have either just one or several backends listening on a channel?

If we are doing this optimization, why not maintain a list of backends
for each channel, and only wake up those channels?

Thanks for the thoughtful question. You've hit on the central design trade-off
in this optimization: how to provide targeted signaling for some workloads
without degrading performance for others.

While we don't have telemetry on real-world usage patterns of LISTEN/NOTIFY,
it seems likely that most applications fall into one of three categories,
which I've been thinking of in networking terms:

1. Broadcast-style ("hub mode")

Many backends listening on the *same* channel (e.g., for cache invalidation).
The current implementation is already well-optimized for this, behaving like
an Ethernet hub that broadcasts to all ports. Waking all listeners is efficient
because they all need the message.

2. Targeted notifications ("switch mode")

Each backend listens on its own private channel (e.g., for session events or
worker queues). This is where the current implementation scales poorly, as every
NOTIFY wakes up all listeners regardless of relevance. My patch is designed
to make this behave like an efficient Ethernet switch.

3. Selective multicast-style ("group mode")

A subset of backends shares a channel, but not all. This is the tricky middle
ground. Your question, "why not maintain a list of backends for each channel,
and only wake up those channels?" is exactly the right one to ask.
A full listener list seems like the obvious path to optimizing for *all* cases.
However, the devil is in the details of concurrency and performance. Managing
such a list would require heavier locking, which would create a new bottleneck
and degrade the scalability of LISTEN/UNLISTEN operations—especially for
the "hub mode" case where many backends rapidly subscribe to the same popular
channel.

This patch makes a deliberate architectural choice:
Prioritize a massive, low-risk win for "switch mode" while rigorously protecting
the performance of "hub mode".

It introduces a targeted fast path for single-listener channels and cleanly
falls back to the existing, well-performing broadcast model for everything else.

This brings us back to "group mode", which remains an open optimization problem.
A possible approach could be to track listeners up to a small threshold *K*
(e.g., store up to 4 ProcNumber's in the hash entry). If the count exceeds *K*,
we would flip a "broadcast" flag and revert to hub-mode behavior.

However, this path has a critical drawback:

1. Performance Penalty for Hub Mode

With the current patch, after the second listener joins a channel,
the has_multiple_listeners flag is set. Every subsequent listener can acquire
a shared lock, see the flag is true, and immediately continue. This is
a highly concurrent, read-only operation that does not require mutating shared
state.

In contrast, the K-listener approach would force every new listener (from the
third up to the K-th) to acquire an exclusive lock to mutate the shared
listener array**. This would serialize LISTEN operations on popular channels,
creating the very contention point this patch successfully avoids and directly
harming the hub-mode use case that currently works well.

2. Uncertainty

Compounding this, without clear data on typical "group" sizes, choosing a value
for *K* is a shot in the dark. A small *K* might not help much, while
a large *K* would increase the shared memory footprint and worsen the
serialization penalty.

For these reasons, attempting to build a switch that also optimizes for
multicast risks undermining the architectural clarity and performance of
both the switch and hub models.

This patch, therefore, draws a clean line. It provides a precise,
low-cost path for switch-mode workloads and preserves the existing,
well-performing path for hub-mode workloads. While this leaves "group mode"
unoptimized for now, it ensures we make two common use cases better without
making any use case worse. The new infrastructure is flexible, leaving
the door open should a better approach for "group mode" emerge in
the future—one that doesn't compromise the other two.

Benchmarks updated showing master vs 0001-optimize_listen_notify-v3.patch:
https://github.com/joelonsql/pg-bench-listen-notify/raw/master/plot.png
https://github.com/joelonsql/pg-bench-listen-notify/raw/master/performance_overview_connections_equal_jobs.png
https://github.com/joelonsql/pg-bench-listen-notify/raw/master/performance_overview_fixed_connections.png

I've not included the benchmark CSV data in this mail, since it's quite heavy,
160kB, and I couldn't see any significant performance changes since v2.

/Joel

#8Joel Jacobson
joel@compiler.org
In reply to: Rishu Bagga (#6)
Re: Optimize LISTEN/NOTIFY

On Wed, Jul 16, 2025, at 02:20, Rishu Bagga wrote:

If we are doing this optimization, why not maintain a list of backends
for each channel, and only wake up those channels?

Thanks for a contributing a great idea, it actually turned out to work
really well in practice!

The attached new v4 of the patch implements your multicast idea:

---

Improve NOTIFY scalability with multicast signaling

Previously, NOTIFY would signal all listening backends in a database for
any channel with more than one listener. This broadcast approach scales
poorly for workloads that rely on targeted notifications to small groups
of backends, as every NOTIFY could wake up many unrelated processes.

This commit introduces a multicast signaling optimization to improve
scalability for such use-cases. A new GUC, `notify_multicast_threshold`,
is added to control the maximum number of listeners to track per
channel. When a NOTIFY is issued, if the number of listeners is at or
below this threshold, only those specific backends are signaled. If the
limit is exceeded, the system falls back to the original broadcast
behavior.

The default for this threshold is set to 16. Benchmarks show this
provides a good balance, with significant performance gains for small to
medium-sized listener groups and diminishing returns for higher values.
Setting the threshold to 0 disables multicast signaling, forcing a
fallback to the broadcast path for all notifications.

To implement this, a new partitioned hash table is introduced in shared
memory to track listeners. Locking is managed with an optimistic
read-then-upgrade pattern. This allows concurrent LISTEN/UNLISTEN
operations on *different* channels to proceed in parallel, as they will
only acquire locks on their respective partitions.

For correctness and to prevent deadlocks, a strict lock ordering
hierarchy (NotifyQueueLock before any partition lock) is observed. The
signaling path in NOTIFY must acquire the global NotifyQueueLock first
before consulting the partitioned hash table, which serializes
concurrent NOTIFYs. The primary concurrency win is for LISTEN/UNLISTEN
operations, which are now much more scalable.

The "wake only tail" optimization, which signals backends that are far
behind in the queue, is also included to ensure the global queue tail
can always advance.

Thanks to Rishu Bagga for the multicast idea.

---

BENCHMARK

To find the optimal default notify_multicast_threshold value,
I created a new benchmark tool that spawns one "ping" worker that sends
notifications to a channel, and multiple "pong" workers that listen on channels
and all immediately reply back to the "ping" worker, and when all replies
have been received, the cycle repeats.

By measuring how many complete round-trips can be performed per second,
it evaluates the impact of different multicast threshold settings.

The results below show the effect of setting the notify_multicast_threshold
just below, or exactly at the N backends per channel, to compare broadcast
vs multicast, for different sizes of multicast groups (where 1 would be the
old targeted mode, optimized for specifically earlier).

K = notify_multicast_threshold

With 2 backends per channel (32 channels total):
patch-v4 (K=1): 8,477 TPS
patch-v4 (K=2): 27,748 TPS (3.3x improvement)

With 4 backends per channel (16 channels total):
patch-v4 (K=1): 7,367 TPS
patch-v4 (K=4): 18,777 TPS (2.6x improvement)

With 8 backends per channel (8 channels total):
patch-v4 (K=1): 5,892 TPS
patch-v4 (K=8): 8,620 TPS (1.5x improvement)

With 16 backends per channel (4 channels total):
patch-v4 (K=1): 4,202 TPS
patch-v4 (K=16): 4,750 TPS (1.1x improvement)

I also reran the old ping-pong as well as the pgbench benchmarks,
and I couldn't detect any negative impact, testing with
notify_multicast_threshold {1, 8, 16}.

Ping-pong benchmark:

Extra Connections: 0
--------------------------------------------------------------------------------
Version Max TPS vs Master All Values (sorted)
-------------------------------------------------------------------------------------
master 9119 baseline {9088, 9095, 9119}
patch-v4 (t=1) 9116 -0.0% {9082, 9090, 9116}
patch-v4 (t=8) 9106 -0.2% {9086, 9102, 9106}
patch-v4 (t=16) 9134 +0.2% {9082, 9116, 9134}

Extra Connections: 10
--------------------------------------------------------------------------------
Version Max TPS vs Master All Values (sorted)
-------------------------------------------------------------------------------------
master 6237 baseline {6224, 6227, 6237}
patch-v4 (t=1) 9358 +50.0% {9302, 9345, 9358}
patch-v4 (t=8) 9348 +49.9% {9266, 9312, 9348}
patch-v4 (t=16) 9408 +50.8% {9339, 9407, 9408}

Extra Connections: 100
--------------------------------------------------------------------------------
Version Max TPS vs Master All Values (sorted)
-------------------------------------------------------------------------------------
master 2028 baseline {2026, 2027, 2028}
patch-v4 (t=1) 9278 +357.3% {9222, 9235, 9278}
patch-v4 (t=8) 9227 +354.8% {9184, 9207, 9227}
patch-v4 (t=16) 9250 +355.9% {9180, 9243, 9250}

Extra Connections: 1000
--------------------------------------------------------------------------------
Version Max TPS vs Master All Values (sorted)
-------------------------------------------------------------------------------------
master 239 baseline {239, 239, 239}
patch-v4 (t=1) 8841 +3594.1% {8819, 8840, 8841}
patch-v4 (t=8) 8835 +3591.7% {8802, 8826, 8835}
patch-v4 (t=16) 8855 +3599.8% {8787, 8843, 8855}

Among my pgbench benchmarks, results seems unaffected in these benchmarks:
listen_unique.sql
listen_common.sql
listen_unlisten_unique.sql
listen_unlisten_common.sql

The listen_notify_unique.sql benchmark shows similar improvements
for all notify_multicast_threshold values tested,
which is expected, since this benchmark uses unique channels,
so a higher notify_multicast_threshold shouldn't affect the results,
which it didn't:

# TEST `listen_notify_unique.sql`

```sql
LISTEN channel_:client_id;
NOTIFY channel_:client_id;
```

## 1 Connection, 1 Job

- **master**: 63696 TPS (baseline)
- **optimize_listen_notify_v4 (t=1.0)**: 63377 TPS (-0.5%)
- **optimize_listen_notify_v4 (t=8.0)**: 62890 TPS (-1.3%)
- **optimize_listen_notify_v4 (t=16.0)**: 63114 TPS (-0.9%)

## 2 Connections, 2 Jobs

- **master**: 90967 TPS (baseline)
- **optimize_listen_notify_v4 (t=1.0)**: 109423 TPS (+20.3%)
- **optimize_listen_notify_v4 (t=8.0)**: 109107 TPS (+19.9%)
- **optimize_listen_notify_v4 (t=16.0)**: 109608 TPS (+20.5%)

## 4 Connections, 4 Jobs

- **master**: 114333 TPS (baseline)
- **optimize_listen_notify_v4 (t=1.0)**: 140986 TPS (+23.3%)
- **optimize_listen_notify_v4 (t=8.0)**: 141263 TPS (+23.6%)
- **optimize_listen_notify_v4 (t=16.0)**: 141327 TPS (+23.6%)

## 8 Connections, 8 Jobs

- **master**: 64429 TPS (baseline)
- **optimize_listen_notify_v4 (t=1.0)**: 93787 TPS (+45.6%)
- **optimize_listen_notify_v4 (t=8.0)**: 93828 TPS (+45.6%)
- **optimize_listen_notify_v4 (t=16.0)**: 93875 TPS (+45.7%)

## 16 Connections, 16 Jobs

- **master**: 41704 TPS (baseline)
- **optimize_listen_notify_v4 (t=1.0)**: 84791 TPS (+103.3%)
- **optimize_listen_notify_v4 (t=8.0)**: 88330 TPS (+111.8%)
- **optimize_listen_notify_v4 (t=16.0)**: 84827 TPS (+103.4%)

## 32 Connections, 32 Jobs

- **master**: 25988 TPS (baseline)
- **optimize_listen_notify_v4 (t=1.0)**: 83197 TPS (+220.1%)
- **optimize_listen_notify_v4 (t=8.0)**: 83453 TPS (+221.1%)
- **optimize_listen_notify_v4 (t=16.0)**: 83576 TPS (+221.6%)

## 1000 Connections, 1 Job

- **master**: 105 TPS (baseline)
- **optimize_listen_notify_v4 (t=1.0)**: 3097 TPS (+2852.1%)
- **optimize_listen_notify_v4 (t=8.0)**: 3079 TPS (+2835.1%)
- **optimize_listen_notify_v4 (t=16.0)**: 3080 TPS (+2835.9%)

## 1000 Connections, 2 Jobs

- **master**: 108 TPS (baseline)
- **optimize_listen_notify_v4 (t=1.0)**: 2981 TPS (+2671.7%)
- **optimize_listen_notify_v4 (t=8.0)**: 3091 TPS (+2774.4%)
- **optimize_listen_notify_v4 (t=16.0)**: 3097 TPS (+2779.6%)

## 1000 Connections, 4 Jobs

- **master**: 105 TPS (baseline)
- **optimize_listen_notify_v4 (t=1.0)**: 2947 TPS (+2705.5%)
- **optimize_listen_notify_v4 (t=8.0)**: 2994 TPS (+2751.0%)
- **optimize_listen_notify_v4 (t=16.0)**: 2992 TPS (+2748.7%)

## 1000 Connections, 8 Jobs

- **master**: 107 TPS (baseline)
- **optimize_listen_notify_v4 (t=1.0)**: 3064 TPS (+2777.0%)
- **optimize_listen_notify_v4 (t=8.0)**: 2981 TPS (+2698.5%)
- **optimize_listen_notify_v4 (t=16.0)**: 2979 TPS (+2696.8%)

## 1000 Connections, 16 Jobs

- **master**: 101 TPS (baseline)
- **optimize_listen_notify_v4 (t=1.0)**: 3068 TPS (+2923.2%)
- **optimize_listen_notify_v4 (t=8.0)**: 2950 TPS (+2806.4%)
- **optimize_listen_notify_v4 (t=16.0)**: 2940 TPS (+2796.8%)

## 1000 Connections, 32 Jobs

- **master**: 102 TPS (baseline)
- **optimize_listen_notify_v4 (t=1.0)**: 2980 TPS (+2815.0%)
- **optimize_listen_notify_v4 (t=8.0)**: 3034 TPS (+2867.9%)
- **optimize_listen_notify_v4 (t=16.0)**: 2962 TPS (+2798.0%)

Here are some plots that includes the above results:

https://github.com/joelonsql/pg-bench-listen-notify/raw/master/plot-v4.png
https://github.com/joelonsql/pg-bench-listen-notify/raw/master/performance_overview_connections_equal_jobs-v4.png
https://github.com/joelonsql/pg-bench-listen-notify/raw/master/performance_overview_fixed_connections-v4.png

/Joel

Attachments:

0001-optimize_listen_notify-v4.patchapplication/octet-stream; name="=?UTF-8?Q?0001-optimize=5Flisten=5Fnotify-v4.patch?="Download+808-34
#9Joel Jacobson
joel@compiler.org
In reply to: Joel Jacobson (#8)
Re: Optimize LISTEN/NOTIFY

On Thu, Jul 17, 2025, at 09:43, Joel Jacobson wrote:

On Wed, Jul 16, 2025, at 02:20, Rishu Bagga wrote:

If we are doing this optimization, why not maintain a list of backends
for each channel, and only wake up those channels?

Thanks for a contributing a great idea, it actually turned out to work
really well in practice!

The attached new v4 of the patch implements your multicast idea:

Hi hackers,

While my previous attempts of $subject has only focused on optimizing
the multi-channel scenario, I thought it would be really nice if
LISTEN/NOTIFY could be optimize in the general case, benefiting all
users, including those who just listen on a single channel.

To my surprise, this was not only possible, but actually quite simple.

The main idea in this patch, is to introduce an atomic state machine,
with three states, IDLE, SIGNALLED, and PROCESSED, so that we don't
interrupt backends that are already in the process of catching up.

Thanks to Thomas Munro for making me aware of his, Heikki Linnakanga's
and others work in the "Interrupts vs signals" [1] thread.

Maybe my patch is redundant due to their patch set, I'm not really sure?

Their patch seems to refactors the underlying wakeup mechanism. It
replaces the old, complex chain of events (SIGUSR1 signal -> handler ->
flag -> latch) with a single, direct function call: SendInterrupt(). For
async.c, this seems to be a low-level plumbing change that simplifies
how a notification wakeup is delivered.

My patch optimizes the high-level notification protocol. It introduces a
state machine (IDLE, SIGNALLED, PROCESSING) to only signal backends when
needed.

In their patch, in asyn.c's SignalBackends(), they do
SendInterrupt(INTERRUPT_ASYNC_NOTIFY, procno) instead of
SendProcSignal(pid, PROCSIG_NOTIFY_INTERRUPT, procnos[i]). They don't
seem to check if the backend is already signalled or not, but maybe
SendInterrupt() has signal coalescing built-in so it would be a noop
with almost no cost?

I'm happy to rebase my LISTEN/NOTIFY work on top of [1], but I could
also see benefits of doing the opposite.

I'm also happy to help with benchmarking of your work in [1].

Note that this patch doesn't contain the hash table to keep track of
listeners per backend, as proposed in earlier patches. I will propose
such a patch again later, but first we need to figure out if I should
rebase onto [1] or master (HEAD).

--- PATCH ---

Optimize NOTIFY signaling to avoid redundant backend signals

Previously, a NOTIFY would send SIGUSR1 to all listening backends, which
could lead to a "thundering herd" of redundant signals under high
traffic. To address this inefficiency, this patch replaces the simple
volatile notifyInterruptPending flag with a per-backend atomic state
machine, stored in asyncQueueControl->backend[i].state. This state
variable can be in one of three states: IDLE (awaiting signal),
SIGNALLED (signal received, work pending), or PROCESSING (actively
reading the queue).

From the notifier's perspective, SignalBackends now uses an atomic
compare-and-swap (CAS) to transition a listener from IDLE to SIGNALLED.
Only on a successful transition is a signal sent. If the listener is
already SIGNALLED or another notifier wins the race, no redundant signal
is sent. If the listener is in the PROCESSING state, the notifier will
also transition it to SIGNALLED to ensure the listener re-scans the
queue after its current work is done.

On the listener side, ProcessIncomingNotify first transitions its state
from SIGNALLED to PROCESSING. After reading notifications, it attempts
to transition from PROCESSING back to IDLE. If this CAS fails, it means
a new notification arrived during processing and a notifier has already
set the state back to SIGNALLED. The listener then simply re-latches
itself to process the new notifications, avoiding a tight loop.

The primary benefit is a significant reduction in syscall overhead and
unnecessary kernel wakeups in high-traffic scenarios. This dramatically
improves performance for workloads with many concurrent notifiers.
Benchmarks show a substantial increase in NOTIFY-only transaction
throughput, with gains exceeding 200% at higher
concurrency levels.

src/backend/commands/async.c | 209 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-----------------------------
src/backend/tcop/postgres.c | 4 ++--
src/include/commands/async.h | 4 +++-
3 files changed, 185 insertions(+), 32 deletions(-)

--- BENCHMARK ---

The attached benchmark script does LISTEN on one connection,
and then uses pgbench to send NOTIFY on a varying number of
connections and jobs, to cause a high procsignal load.

I've run the benchmark on my MacBook Pro M3 Max,
10 seconds per run, 3 runs.

(I reused the same benchmark script as in the other thread, "Optimize ProcSignal to avoid redundant SIGUSR1 signals")

Connections=Jobs | TPS (master) | TPS (patch) | Relative Diff (%) | StdDev (master) | StdDev (patch)
------------------+--------------+-------------+-------------------+-----------------+----------------
1 | 118833 | 151510 | 27.50% | 484 | 923
2 | 156005 | 239051 | 53.23% | 3145 | 1596
4 | 177351 | 250910 | 41.48% | 4305 | 4891
8 | 116597 | 171944 | 47.47% | 1549 | 2752
16 | 40835 | 165482 | 305.25% | 2695 | 2825
32 | 37940 | 145150 | 282.58% | 2533 | 1566
64 | 35495 | 131836 | 271.42% | 1837 | 573
128 | 40193 | 121333 | 201.88% | 2254 | 874
(8 rows)

/Joel

/messages/by-id/CA+hUKG+3MkS21yK4jL4cgZywdnnGKiBg0jatoV6kzaniBmcqbQ@mail.gmail.com

Attachments:

0001-Optimize-NOTIFY-signaling-to-avoid-redundant-backend.patchapplication/octet-stream; name="=?UTF-8?Q?0001-Optimize-NOTIFY-signaling-to-avoid-redundant-backend.patc?= =?UTF-8?Q?h?="Download+185-33
#10Thomas Munro
thomas.munro@gmail.com
In reply to: Joel Jacobson (#9)
Re: Optimize LISTEN/NOTIFY

On Wed, Jul 23, 2025 at 1:39 PM Joel Jacobson <joel@compiler.org> wrote:

In their patch, in asyn.c's SignalBackends(), they do
SendInterrupt(INTERRUPT_ASYNC_NOTIFY, procno) instead of
SendProcSignal(pid, PROCSIG_NOTIFY_INTERRUPT, procnos[i]). They don't
seem to check if the backend is already signalled or not, but maybe
SendInterrupt() has signal coalescing built-in so it would be a noop
with almost no cost?

Yeah:

+ old_pending = pg_atomic_fetch_or_u32(&proc->pendingInterrupts, interruptMask);
+
+ /*
+ * If the process is currently blocked waiting for an interrupt to arrive,
+ * and the interrupt wasn't already pending, wake it up.
+ */
+ if ((old_pending & (interruptMask | SLEEPING_ON_INTERRUPTS)) ==
SLEEPING_ON_INTERRUPTS)
+     WakeupOtherProc(proc);
#11Joel Jacobson
joel@compiler.org
In reply to: Thomas Munro (#10)
Re: Optimize LISTEN/NOTIFY

On Wed, Jul 23, 2025, at 04:44, Thomas Munro wrote:

On Wed, Jul 23, 2025 at 1:39 PM Joel Jacobson <joel@compiler.org> wrote:

In their patch, in asyn.c's SignalBackends(), they do
SendInterrupt(INTERRUPT_ASYNC_NOTIFY, procno) instead of
SendProcSignal(pid, PROCSIG_NOTIFY_INTERRUPT, procnos[i]). They don't
seem to check if the backend is already signalled or not, but maybe
SendInterrupt() has signal coalescing built-in so it would be a noop
with almost no cost?

Yeah:

+ old_pending = pg_atomic_fetch_or_u32(&proc->pendingInterrupts, interruptMask);
+
+ /*
+ * If the process is currently blocked waiting for an interrupt to arrive,
+ * and the interrupt wasn't already pending, wake it up.
+ */
+ if ((old_pending & (interruptMask | SLEEPING_ON_INTERRUPTS)) ==
SLEEPING_ON_INTERRUPTS)
+     WakeupOtherProc(proc);

Thanks for confirming the coalescing logic in SendInterrupt. That's a
great low-level optimization. It's clear we're both targeting the same
problem of redundant wake-ups under contention, but approaching it from
different architectural levels.

The core difference, as I see it, is *where* the state management
resides. The "Interrupts vs signals" patch set creates a unified
machinery where the 'pending' state for all subsystems is combined into
a single atomic bitmask. This is a valid approach.

However, I've been exploring an alternative pattern that decouples the
state management from the signaling machinery, allowing each subsystem
to manage its own state independently. I believe this leads to a
simpler, more modular migration path. I've developed a two-patch series
for `async.c` to demonstrate this concept.

1. The first patch introduces a lock-free, atomic finite state machine
(FSM) entirely within async.c. By using a subsystem-specific atomic
integer and CAS operations, async.c can now robustly manage its own
listener states (IDLE, SIGNALLED, PROCESSING). This solves the
redundant signal problem at the source, as notifiers can now observe
a listener's state and refrain from sending a wakeup if one is
already pending.

2. The second patch demonstrates that once state is managed locally, the
wakeup mechanism becomes trivial.** The expensive `SendProcSignal`
call is replaced with a direct `SetLatch`. This leverages the
existing, highly-optimized `WaitEventSet` infrastructure as a simple,
efficient "poke."

This suggests a powerful, incremental migration pattern: first, fix a
subsystem's state management internally; second, replace its wakeup
mechanism. This vertical, module-by-module approach seems complementary
to the horizontal, layer-by-layer refactoring in the "Interrupts vs
signals" thread.

I'll post a more detailed follow-up in that thread to discuss the
broader architectural implications. Attached are the two patches,
reframed to better illustrate this two-step pattern.

/Joel

Attachments:

pgbench-script.txttext/plain; name=pgbench-script.txtDownload
pgbench-results.txttext/plain; name=pgbench-results.txtDownload
0001-Optimize-LISTEN-NOTIFY-signaling-with-a-lock-free-at.patchapplication/octet-stream; name="=?UTF-8?Q?0001-Optimize-LISTEN-NOTIFY-signaling-with-a-lock-free-at.patc?= =?UTF-8?Q?h?="Download+185-33
0002-Optimize-LISTEN-NOTIFY-wakeup-by-replacing-signal-wi.patchapplication/octet-stream; name="=?UTF-8?Q?0002-Optimize-LISTEN-NOTIFY-wakeup-by-replacing-signal-wi.patc?= =?UTF-8?Q?h?="Download+19-7
#12Joel Jacobson
joel@compiler.org
In reply to: Joel Jacobson (#11)
Re: Optimize LISTEN/NOTIFY

On Thu, Jul 24, 2025, at 23:03, Joel Jacobson wrote:

* 0001-Optimize-LISTEN-NOTIFY-signaling-with-a-lock-free-at.patch
* 0002-Optimize-LISTEN-NOTIFY-wakeup-by-replacing-signal-wi.patch

I'm withdrawing the latest patches, since they won't fix the scalability
problems, but only provide some performance improvements by eliminating
redundant IPC signalling. This could also be improved outside of
async.c, by optimizing ProcSignal [1]/messages/by-id/a0b12a70-8200-4bd4-9e24-56796314bdce@app.fastmail.com or removing ProcSignal as
"Interrupts vs Signals" [2]/messages/by-id/CA+hUKG+3MkS21yK4jL4cgZywdnnGKiBg0jatoV6kzaniBmcqbQ@mail.gmail.com is working on.

There seems to be two different scalability problems, that appears to be
orthogonal:

First, it's the thundering herd problems that I tried to solve initially
in this thread, by introducing a hash table in shared memory, to keep
track of what backends listen to what channels, to avoid immediate
wakeup of all listening backends for every notification.

Second, it's the heavyweight lock in PreCommit_Notify(), that prevents
parallelism of NOTIFY. Tom Lane has an idea [3]/messages/by-id/1878165.1752858390@sss.pgh.pa.us on how to improve this.

My perf+pgbench experiments indicate that out of these two different
scalability problems, if one or the other is the bottleneck depends on
the workload.

I think the idea of keeping track of channels per backends has merit,
but I want to take a step back and see what others think about the idea first.

I guess my main question is if we think we should fix one problem first,
then the other, both at the same time, or only one or the other?

I've attached some benchmarks using pgbench and running postgres under
perf, which I hope can provide some insights.

/Joel

[1]: /messages/by-id/a0b12a70-8200-4bd4-9e24-56796314bdce@app.fastmail.com
[2]: /messages/by-id/CA+hUKG+3MkS21yK4jL4cgZywdnnGKiBg0jatoV6kzaniBmcqbQ@mail.gmail.com
[3]: /messages/by-id/1878165.1752858390@sss.pgh.pa.us

Attachments:

listen_notify_pgbench_perf.mdtext/markdown; name=listen_notify_pgbench_perf.mdDownload+0-16
listen_notify_pgbench_perf.pdfapplication/pdf; name=listen_notify_pgbench_perf.pdfDownload+5-4
#13Tom Lane
tgl@sss.pgh.pa.us
In reply to: Joel Jacobson (#12)
Re: Optimize LISTEN/NOTIFY

[ getting back to this... ]

"Joel Jacobson" <joel@compiler.org> writes:

I'm withdrawing the latest patches, since they won't fix the scalability
problems, but only provide some performance improvements by eliminating
redundant IPC signalling. This could also be improved outside of
async.c, by optimizing ProcSignal [1] or removing ProcSignal as
"Interrupts vs Signals" [2] is working on.

There seems to be two different scalability problems, that appears to be
orthogonal:

First, it's the thundering herd problems that I tried to solve initially
in this thread, by introducing a hash table in shared memory, to keep
track of what backends listen to what channels, to avoid immediate
wakeup of all listening backends for every notification.

Second, it's the heavyweight lock in PreCommit_Notify(), that prevents
parallelism of NOTIFY. Tom Lane has an idea [3] on how to improve this.

I concur that these are orthogonal issues, but I don't understand
why you withdrew your patches --- don't they constitute a solution
to the first scalability bottleneck?

I guess my main question is if we think we should fix one problem first,
then the other, both at the same time, or only one or the other?

I imagine we'd eventually want to fix both, but it doesn't have to
be done in the same patch.

regards, tom lane

#14Joel Jacobson
joel@compiler.org
In reply to: Tom Lane (#13)
Re: Optimize LISTEN/NOTIFY

On Tue, Sep 23, 2025, at 18:27, Tom Lane wrote:

I concur that these are orthogonal issues, but I don't understand
why you withdrew your patches --- don't they constitute a solution
to the first scalability bottleneck?

Thanks for getting back to this thread. I was unhappy with not finding a
solution that would improve all use-cases, I had a feeling it would be
possible to find one, and I think I've done so now.

I guess my main question is if we think we should fix one problem first,
then the other, both at the same time, or only one or the other?

I imagine we'd eventually want to fix both, but it doesn't have to
be done in the same patch.

I've attached a new patch with a new pragmatic approach, that
specifically addresses the context switching cost.

The patch is based upon the assumption that some extra LISTEN/NOTIFY
latency would be acceptable by most users, as a trade-off, in order to
improve throughput.

One nice thing with this approach is that it has the potential to
improve throughput both for users with just a single listening backend,
and also for users with lots of listening backends.

More details in the commit message of the patch.

Curious to hear thoughts on this approach.

/Joel

Attachments:

0001-LISTEN-NOTIFY-make-the-latency-throughput-trade-off-.patchapplication/octet-stream; name="=?UTF-8?Q?0001-LISTEN-NOTIFY-make-the-latency-throughput-trade-off-.patc?= =?UTF-8?Q?h?="Download+90-2
#15Chao Li
li.evan.chao@gmail.com
In reply to: Joel Jacobson (#14)
Re: Optimize LISTEN/NOTIFY

Hi Joel,

Thanks for the patch. After reviewing it, I got a few comments.

On Sep 25, 2025, at 04:34, Joel Jacobson <joel@compiler.org> wrote:

Curious to hear thoughts on this approach.

/Joel
<0001-LISTEN-NOTIFY-make-the-latency-throughput-trade-off-.patch>

1.
```
--- a/src/include/utils/timeout.h
+++ b/src/include/utils/timeout.h
@@ -35,6 +35,7 @@ typedef enum TimeoutId
 	IDLE_SESSION_TIMEOUT,
 	IDLE_STATS_UPDATE_TIMEOUT,
 	CLIENT_CONNECTION_CHECK_TIMEOUT,
+	NOTIFY_DEFERRED_WAKEUP_TIMEOUT,
 	STARTUP_PROGRESS_TIMEOUT,
```

Can we define the new one after STARTUP_PROGRESS_TIMEOUT to try to preserve the existing enum value?

2.
```
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -766,6 +766,7 @@ autovacuum_worker_slots = 16	# autovacuum worker slots to allocate
 #lock_timeout = 0				# in milliseconds, 0 is disabled
 #idle_in_transaction_session_timeout = 0	# in milliseconds, 0 is disabled
 #idle_session_timeout = 0			# in milliseconds, 0 is disabled
+#notify_latency_target = 0	# in milliseconds, 0 is disabled
 #bytea_output = 'hex'			# hex, escape
```

I think we should add one more table to make the comment to align with last line’s comment.

3.
```
/* GUC parameters */
bool Trace_notify = false;
+int notify_latency_target;
```

I know compiler will auto initiate notify_latency_target to 0. But all other global and static variables around are explicitly initiated, so it would look better to assign 0 to it, which just keeps coding style consistent.

4.
```
+	/*
+	 * Throttling check: if we were last active too recently, defer. This
+	 * check is safe without a lock because it's based on a backend-local
+	 * timestamp.
+	 */
+	if (notify_latency_target > 0 &&
+		!TimestampDifferenceExceeds(last_wakeup_start_time,
+									GetCurrentTimestamp(),
+									notify_latency_target))
+	{
+		/*
+		 * Too soon. We leave wakeup_pending_flag untouched (it must be true,
+		 * or we wouldn't have been signaled) to tell senders we are
+		 * intentionally delaying. Arm a timer to re-awaken and process the
+		 * backlog later.
+		 */
+		enable_timeout_after(NOTIFY_DEFERRED_WAKEUP_TIMEOUT,
+							 notify_latency_target);
+		return;
+	}
+
```

Should we avid duplicate timeout to be enabled? Now, whenever a duplicate notification is avoid, a new timeout is enabled. I think we can add another variable to remember if a timeout has been enabled.

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/

#16Joel Jacobson
joel@compiler.org
In reply to: Chao Li (#15)
Re: Optimize LISTEN/NOTIFY

On Thu, Sep 25, 2025, at 10:25, Chao Li wrote:

Hi Joel,

Thanks for the patch. After reviewing it, I got a few comments.

Thanks for reviewing!

On Sep 25, 2025, at 04:34, Joel Jacobson <joel@compiler.org> wrote:

1.

...

Can we define the new one after STARTUP_PROGRESS_TIMEOUT to try to
preserve the existing enum value?

Fixed.

2.

...

I think we should add one more table to make the comment to align with
last line’s comment.

Fixed.

3.

...

I know compiler will auto initiate notify_latency_target to 0. But all
other global and static variables around are explicitly initiated, so
it would look better to assign 0 to it, which just keeps coding style
consistent.

Fixed.

4.

...

Should we avid duplicate timeout to be enabled? Now, whenever a
duplicate notification is avoid, a new timeout is enabled. I think we
can add another variable to remember if a timeout has been enabled.

Hmm, I don't see how duplicate timeout could happen?

Once we decide to defer the wakeup, wakeup_pending_flag remains set,
which avoids further signals from notifiers, so I don't see how we could
re-enter ProcessIncomingNotify(), since notifyInterruptPending is reset
when ProcessIncomingNotify() is called, and notifyInterruptPending is
only set when a signal is received (or set directly when in same
process).

New patch attached with 1-3 fixed.

/Joel

Attachments:

0001-LISTEN-NOTIFY-make-the-latency-throughput-trade-off-v2.patchapplication/octet-stream; name="=?UTF-8?Q?0001-LISTEN-NOTIFY-make-the-latency-throughput-trade-off-v2.pa?= =?UTF-8?Q?tch?="Download+90-2
#17Chao Li
li.evan.chao@gmail.com
In reply to: Joel Jacobson (#16)
Re: Optimize LISTEN/NOTIFY

On Sep 26, 2025, at 05:13, Joel Jacobson <joel@compiler.org> wrote:

Hmm, I don't see how duplicate timeout could happen?

Once we decide to defer the wakeup, wakeup_pending_flag remains set,
which avoids further signals from notifiers, so I don't see how we could
re-enter ProcessIncomingNotify(), since notifyInterruptPending is reset
when ProcessIncomingNotify() is called, and notifyInterruptPending is
only set when a signal is received (or set directly when in same
process).

I think what you explained is partially correct.

Based on my understanding, any backend process may call SignalBackends(), which means that it’s possible that multiple backend processes may call SignalBackends() concurrently.

Looking at your code, between checking QUEUE_BACKEND_WAKEUP_PENDING_FLAG(i) and set the flag to true, there is a block of code (the “if-else”) to run, so that it’s possible that multiple backend processes have passed the QUEUE_BACKEND_WAKEUP_PENDING_FLAG(i) check, then multiple signals will be sent to a process, which will lead to duplicate timeout enabled in the receiver process.

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/

#18Joel Jacobson
joel@compiler.org
In reply to: Chao Li (#17)
Re: Optimize LISTEN/NOTIFY

On Fri, Sep 26, 2025, at 04:26, Chao Li wrote:

I think what you explained is partially correct.

Based on my understanding, any backend process may call
SignalBackends(), which means that it’s possible that multiple backend
processes may call SignalBackends() concurrently.

Looking at your code, between checking
QUEUE_BACKEND_WAKEUP_PENDING_FLAG(i) and set the flag to true, there is
a block of code (the “if-else”) to run, so that it’s possible that
multiple backend processes have passed the
QUEUE_BACKEND_WAKEUP_PENDING_FLAG(i) check, then multiple signals will
be sent to a process, which will lead to duplicate timeout enabled in
the receiver process.

I don't see how that can happen; we're checking wakeup_pending_flag
while holding an exclusive lock, so I don't see how multiple backend
processes could be within the region where we check/set
wakeup_pending_flag, at the same time?

/Joel

#19Chao Li
li.evan.chao@gmail.com
In reply to: Joel Jacobson (#18)
Re: Optimize LISTEN/NOTIFY

On Sep 26, 2025, at 17:32, Joel Jacobson <joel@compiler.org> wrote:

On Fri, Sep 26, 2025, at 04:26, Chao Li wrote:

I think what you explained is partially correct.

Based on my understanding, any backend process may call
SignalBackends(), which means that it’s possible that multiple backend
processes may call SignalBackends() concurrently.

Looking at your code, between checking
QUEUE_BACKEND_WAKEUP_PENDING_FLAG(i) and set the flag to true, there is
a block of code (the “if-else”) to run, so that it’s possible that
multiple backend processes have passed the
QUEUE_BACKEND_WAKEUP_PENDING_FLAG(i) check, then multiple signals will
be sent to a process, which will lead to duplicate timeout enabled in
the receiver process.

I don't see how that can happen; we're checking wakeup_pending_flag
while holding an exclusive lock, so I don't see how multiple backend
processes could be within the region where we check/set
wakeup_pending_flag, at the same time?

/Joel

I might miss the factor of holding an exclusive lock. I will revisit that part again.

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/

#20Joel Jacobson
joel@compiler.org
In reply to: Chao Li (#19)
Re: Optimize LISTEN/NOTIFY

On Fri, Sep 26, 2025, at 11:44, Chao Li wrote:

On Sep 26, 2025, at 17:32, Joel Jacobson <joel@compiler.org> wrote:

On Fri, Sep 26, 2025, at 04:26, Chao Li wrote:

I think what you explained is partially correct.

Based on my understanding, any backend process may call
SignalBackends(), which means that it’s possible that multiple backend
processes may call SignalBackends() concurrently.

Looking at your code, between checking
QUEUE_BACKEND_WAKEUP_PENDING_FLAG(i) and set the flag to true, there is
a block of code (the “if-else”) to run, so that it’s possible that
multiple backend processes have passed the
QUEUE_BACKEND_WAKEUP_PENDING_FLAG(i) check, then multiple signals will
be sent to a process, which will lead to duplicate timeout enabled in
the receiver process.

I don't see how that can happen; we're checking wakeup_pending_flag
while holding an exclusive lock, so I don't see how multiple backend
processes could be within the region where we check/set
wakeup_pending_flag, at the same time?

/Joel

I might miss the factor of holding an exclusive lock. I will revisit
that part again.

I've re-read this entire thread, and I actually think my original
approaches are more promising, that is, the
0001-optimize_listen_notify-v4.patch patch, doing multicast targeted
signaling.

Therefore, merely consider the latest patch as PoC with some possible
interesting ideas.

Before this patch, I had never used PostgreSQL's timeout mechanism
before, so I didn't consider it when thinking about how to solve the
remaining problems with 0001-optimize_listen_notify-v4.patch, which
currently can't guarantee that all listening backends will eventually
catch up, since it just kicks one of the most lagging ones, for each
notification. This could be a problem in practise if there is a long
period of time with no notifications coming in. Then some listening
backends could end up not being signaled and would stay behind,
preventing the queue tail from advancing.

I'm thinking maybe somehow we can use the timeout mechanism here, but
I'm not sure how yet. Any ideas?

/Joel

#21Chao Li
li.evan.chao@gmail.com
In reply to: Joel Jacobson (#20)
#22Joel Jacobson
joel@compiler.org
In reply to: Chao Li (#21)
#23Joel Jacobson
joel@compiler.org
In reply to: Joel Jacobson (#22)
#24Tom Lane
tgl@sss.pgh.pa.us
In reply to: Joel Jacobson (#22)
#25Joel Jacobson
joel@compiler.org
In reply to: Tom Lane (#24)
#26Joel Jacobson
joel@compiler.org
In reply to: Joel Jacobson (#25)
#27Joel Jacobson
joel@compiler.org
In reply to: Joel Jacobson (#26)
#28Tom Lane
tgl@sss.pgh.pa.us
In reply to: Joel Jacobson (#27)
#29Joel Jacobson
joel@compiler.org
In reply to: Tom Lane (#28)
#30Matheus Alcantara
matheusssilv97@gmail.com
In reply to: Joel Jacobson (#25)
#31Tom Lane
tgl@sss.pgh.pa.us
In reply to: Matheus Alcantara (#30)
#32Joel Jacobson
joel@compiler.org
In reply to: Matheus Alcantara (#30)
#33Tom Lane
tgl@sss.pgh.pa.us
In reply to: Joel Jacobson (#32)
#34Joel Jacobson
joel@compiler.org
In reply to: Tom Lane (#33)
#35Tom Lane
tgl@sss.pgh.pa.us
In reply to: Joel Jacobson (#34)
#36Matheus Alcantara
matheusssilv97@gmail.com
In reply to: Tom Lane (#31)
#37Tom Lane
tgl@sss.pgh.pa.us
In reply to: Matheus Alcantara (#36)
#38Matheus Alcantara
matheusssilv97@gmail.com
In reply to: Tom Lane (#37)
#39Chao Li
li.evan.chao@gmail.com
In reply to: Joel Jacobson (#34)
#40Chao Li
li.evan.chao@gmail.com
In reply to: Chao Li (#39)
#41Joel Jacobson
joel@compiler.org
In reply to: Tom Lane (#35)
#42Joel Jacobson
joel@compiler.org
In reply to: Chao Li (#39)
#43Tom Lane
tgl@sss.pgh.pa.us
In reply to: Joel Jacobson (#41)
#44Chao Li
li.evan.chao@gmail.com
In reply to: Joel Jacobson (#42)
#45Joel Jacobson
joel@compiler.org
In reply to: Chao Li (#44)
#46Chao Li
li.evan.chao@gmail.com
In reply to: Joel Jacobson (#45)
#47Joel Jacobson
joel@compiler.org
In reply to: Tom Lane (#43)
#48Joel Jacobson
joel@compiler.org
In reply to: Joel Jacobson (#47)
#49Joel Jacobson
joel@compiler.org
In reply to: Joel Jacobson (#48)
#50Joel Jacobson
joel@compiler.org
In reply to: Joel Jacobson (#49)
#51Tom Lane
tgl@sss.pgh.pa.us
In reply to: Joel Jacobson (#50)
#52Chao Li
li.evan.chao@gmail.com
In reply to: Tom Lane (#51)
#53Joel Jacobson
joel@compiler.org
In reply to: Tom Lane (#51)
#54Arseniy Mukhin
arseniy.mukhin.dev@gmail.com
In reply to: Joel Jacobson (#53)
#55Tom Lane
tgl@sss.pgh.pa.us
In reply to: Arseniy Mukhin (#54)
#56Joel Jacobson
joel@compiler.org
In reply to: Chao Li (#52)
#57Arseniy Mukhin
arseniy.mukhin.dev@gmail.com
In reply to: Tom Lane (#55)
#58Joel Jacobson
joel@compiler.org
In reply to: Arseniy Mukhin (#57)
#59Joel Jacobson
joel@compiler.org
In reply to: Tom Lane (#55)
#60Tom Lane
tgl@sss.pgh.pa.us
In reply to: Joel Jacobson (#59)
#61Chao Li
li.evan.chao@gmail.com
In reply to: Joel Jacobson (#56)
#62Joel Jacobson
joel@compiler.org
In reply to: Tom Lane (#55)
#63Joel Jacobson
joel@compiler.org
In reply to: Chao Li (#61)
#64Joel Jacobson
joel@compiler.org
In reply to: Joel Jacobson (#63)
#65Tom Lane
tgl@sss.pgh.pa.us
In reply to: Joel Jacobson (#64)
#66Arseniy Mukhin
arseniy.mukhin.dev@gmail.com
In reply to: Tom Lane (#65)
#67Joel Jacobson
joel@compiler.org
In reply to: Tom Lane (#65)
#68Joel Jacobson
joel@compiler.org
In reply to: Joel Jacobson (#67)
#69Joel Jacobson
joel@compiler.org
In reply to: Joel Jacobson (#68)
#70Arseniy Mukhin
arseniy.mukhin.dev@gmail.com
In reply to: Joel Jacobson (#67)
#71Chao Li
li.evan.chao@gmail.com
In reply to: Arseniy Mukhin (#70)
#72Arseniy Mukhin
arseniy.mukhin.dev@gmail.com
In reply to: Chao Li (#71)
#73Chao Li
li.evan.chao@gmail.com
In reply to: Arseniy Mukhin (#72)
#74Joel Jacobson
joel@compiler.org
In reply to: Chao Li (#73)
#75Joel Jacobson
joel@compiler.org
In reply to: Joel Jacobson (#74)
#76Joel Jacobson
joel@compiler.org
In reply to: Joel Jacobson (#74)
#77Chao Li
li.evan.chao@gmail.com
In reply to: Joel Jacobson (#76)
#78Joel Jacobson
joel@compiler.org
In reply to: Chao Li (#77)
#79Chao Li
li.evan.chao@gmail.com
In reply to: Joel Jacobson (#78)
#80Joel Jacobson
joel@compiler.org
In reply to: Chao Li (#79)
#81Chao Li
li.evan.chao@gmail.com
In reply to: Joel Jacobson (#80)
#82Joel Jacobson
joel@compiler.org
In reply to: Chao Li (#81)
#83Chao Li
li.evan.chao@gmail.com
In reply to: Joel Jacobson (#82)
#84Joel Jacobson
joel@compiler.org
In reply to: Chao Li (#83)
#85Chao Li
li.evan.chao@gmail.com
In reply to: Joel Jacobson (#84)
#86Arseniy Mukhin
arseniy.mukhin.dev@gmail.com
In reply to: Chao Li (#85)
#87Joel Jacobson
joel@compiler.org
In reply to: Arseniy Mukhin (#86)
#88Joel Jacobson
joel@compiler.org
In reply to: Joel Jacobson (#87)
#89Chao Li
li.evan.chao@gmail.com
In reply to: Arseniy Mukhin (#86)
#90Arseniy Mukhin
arseniy.mukhin.dev@gmail.com
In reply to: Chao Li (#89)
#91Chao Li
li.evan.chao@gmail.com
In reply to: Arseniy Mukhin (#90)
#92Joel Jacobson
joel@compiler.org
In reply to: Chao Li (#91)
#93Joel Jacobson
joel@compiler.org
In reply to: Joel Jacobson (#92)
#94Joel Jacobson
joel@compiler.org
In reply to: Joel Jacobson (#93)
#95Joel Jacobson
joel@compiler.org
In reply to: Joel Jacobson (#94)
#96Joel Jacobson
joel@compiler.org
In reply to: Joel Jacobson (#95)
#97Arseniy Mukhin
arseniy.mukhin.dev@gmail.com
In reply to: Joel Jacobson (#96)
#98Joel Jacobson
joel@compiler.org
In reply to: Arseniy Mukhin (#97)
#99Joel Jacobson
joel@compiler.org
In reply to: Joel Jacobson (#98)
#100Joel Jacobson
joel@compiler.org
In reply to: Arseniy Mukhin (#97)
#101Arseniy Mukhin
arseniy.mukhin.dev@gmail.com
In reply to: Joel Jacobson (#100)
#102Joel Jacobson
joel@compiler.org
In reply to: Arseniy Mukhin (#101)
#103Joel Jacobson
joel@compiler.org
In reply to: Joel Jacobson (#102)
#104Joel Jacobson
joel@compiler.org
In reply to: Joel Jacobson (#103)
#105Joel Jacobson
joel@compiler.org
In reply to: Joel Jacobson (#104)
#106Chao Li
li.evan.chao@gmail.com
In reply to: Joel Jacobson (#104)
#107Joel Jacobson
joel@compiler.org
In reply to: Chao Li (#106)
#108Tom Lane
tgl@sss.pgh.pa.us
In reply to: Joel Jacobson (#107)
#109Joel Jacobson
joel@compiler.org
In reply to: Tom Lane (#108)
#110Joel Jacobson
joel@compiler.org
In reply to: Joel Jacobson (#109)
#111Joel Jacobson
joel@compiler.org
In reply to: Joel Jacobson (#110)
#112Joel Jacobson
joel@compiler.org
In reply to: Tom Lane (#108)
#113Tom Lane
tgl@sss.pgh.pa.us
In reply to: Joel Jacobson (#112)
#114Joel Jacobson
joel@compiler.org
In reply to: Tom Lane (#113)
#115Joel Jacobson
joel@compiler.org
In reply to: Joel Jacobson (#114)
#116Joel Jacobson
joel@compiler.org
In reply to: Joel Jacobson (#115)
#117Tom Lane
tgl@sss.pgh.pa.us
In reply to: Joel Jacobson (#116)
#118Joel Jacobson
joel@compiler.org
In reply to: Tom Lane (#117)
#119Joel Jacobson
joel@compiler.org
In reply to: Joel Jacobson (#118)
#120Joel Jacobson
joel@compiler.org
In reply to: Joel Jacobson (#119)
#121Tom Lane
tgl@sss.pgh.pa.us
In reply to: Joel Jacobson (#119)
#122Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#121)
#123Joel Jacobson
joel@compiler.org
In reply to: Tom Lane (#122)
#124Tom Lane
tgl@sss.pgh.pa.us
In reply to: Joel Jacobson (#123)
#125Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#124)