[PATCH] Fix LISTEN startup race with direct advancement
Hi hackers,
I had another pass over the async.c rework committed in 282b1cd,
and found a race that can cause a notification committed after
the listener registered its queue position to be missed entirely.
This can happen in the small time window between PreCommit_Notify(),
where the first LISTEN registers the backend and records its queue
position, and AtCommit_Notify(), where the staged listen action is
made active in the shared channel map by setting listening = true.
If a concurrent NOTIFY commits in that window, SignalBackends() can
see the staged listener entry with listening = false and conclude that
the backend is not interested in the channel. With direct advancement,
that can move the backend's queue pointer past the notification instead
of waking it.
This is distinct from the documented LISTEN startup race in
listen.sgml. The documented race can produce false positives: after
LISTEN returns, an application may receive a notification for work
already observed by its initial database scan. That is harmless. This
race is a false negative: a notification can be missed entirely.
The fix is just to treat staged LISTEN entries as possible listeners
when deciding whom to wake:
```diff
- if (!listeners[j].listening)
- continue; /* ignore not-yet-committed listeners */
```
The attached patches split the report into tests and fix:
0001 Test missed LISTEN startup notification
0002 Test LISTEN startup notification for already-seen work
0003 Fix LISTEN startup race with direct advancement
/Joel
Attachments:
0001-Test-missed-LISTEN-startup-notification.patchapplication/octet-stream; name="=?UTF-8?Q?0001-Test-missed-LISTEN-startup-notification.patch?="Download+62-1
0003-Fix-LISTEN-startup-race-with-direct-advancement.patchapplication/octet-stream; name="=?UTF-8?Q?0003-Fix-LISTEN-startup-race-with-direct-advancement.patch?="Download+71-9
0002-Test-LISTEN-startup-notification-for-already-seen-wo.patchapplication/octet-stream; name="=?UTF-8?Q?0002-Test-LISTEN-startup-notification-for-already-seen-wo.patc?= =?UTF-8?Q?h?="Download+69-1
On Tue, May 19, 2026 at 11:39 PM Joel Jacobson <joel@compiler.org> wrote:
Hi hackers,
I had another pass over the async.c rework committed in 282b1cd,
and found a race that can cause a notification committed after
the listener registered its queue position to be missed entirely.This can happen in the small time window between PreCommit_Notify(),
where the first LISTEN registers the backend and records its queue
position, and AtCommit_Notify(), where the staged listen action is
made active in the shared channel map by setting listening = true.If a concurrent NOTIFY commits in that window, SignalBackends() can
see the staged listener entry with listening = false and conclude that
the backend is not interested in the channel. With direct advancement,
that can move the backend's queue pointer past the notification instead
of waking it.This is distinct from the documented LISTEN startup race in
listen.sgml. The documented race can produce false positives: after
LISTEN returns, an application may receive a notification for work
already observed by its initial database scan. That is harmless. This
race is a false negative: a notification can be missed entirely.The fix is just to treat staged LISTEN entries as possible listeners
when deciding whom to wake:```diff
- if (!listeners[j].listening)
- continue; /* ignore not-yet-committed listeners */
```The attached patches split the report into tests and fix:
0001 Test missed LISTEN startup notification
0002 Test LISTEN startup notification for already-seen work
0003 Fix LISTEN startup race with direct advancement/Joel
Hi,
Thank you for the fix! I tried the test and managed to reproduce it. I
think what is definitely wrong here is that we can do direct
advancement over messages that the listener is interested in, when it
has set the position already. So +1 for the fix. I think the fix is
correct and it's IIUC really harmless as the worst thing that can
happen is an unnecessary signal.
One point - looks like the 0003 contains the same test as 0001.
Also while I was trying to understand the issue, another bad schedule
came to my mind: it's basically the same scenario as you reproduced,
but now we have a message in between the listener's position and
notification that the listener is interested in. In this case the
notifier can't do direct advancement, so the message is not lost, but
the notifier still doesn't signal about a new message, so the listener
doesn't know that there is something to read until something triggers
it (like another message). PFA the reproducer (there is a schedule
diagram in the head of the file). It depends on 0001. Your patch fixes
it too.
Thank you!
Best regards,
Arseniy Mukhin
Attachments:
untriggered-notification.patch.nocfbotapplication/octet-stream; name=untriggered-notification.patch.nocfbotDownload+110-2
On Wed, May 20, 2026, at 04:01, Arseniy Mukhin wrote:
One point - looks like the 0003 contains the same test as 0001.
They are similar, but if you look carefully you'll see that they use
different injection points:
0001 uses async-notify-before-listen-commit
0002 uses async-notify-after-listen-commit
Another different is that if you only apply 0001 and run the tests,
that test will fail without the 0003 fix, whereas 0002 pass both
with and without 0003, since it just tests and demonstrates the
already documented quite harmless false positive race condition.
Also while I was trying to understand the issue
...
diagram in the head of the file). It depends on 0001. Your patch fixes
it too.
OK, I didn't look in detail at untriggered-notification.patch.nocfbot,
but good to hear 0003 fixes it too.
Thanks for reviewing.
/Joel
On Wed, May 20, 2026 at 2:20 PM Joel Jacobson <joel@compiler.org> wrote:
On Wed, May 20, 2026, at 04:01, Arseniy Mukhin wrote:
One point - looks like the 0003 contains the same test as 0001.
They are similar, but if you look carefully you'll see that they use
different injection points:0001 uses async-notify-before-listen-commit
0002 uses async-notify-after-listen-commit
Yes, but I meant 0001 and 0003. There are two specs:
- async-notify-listen-startup.spec (added in 0001)
- async-notify-listen-race.spec (added in 0003)
Looks identical, unless I missed something.
Best regards,
Arseniy Mukhin
On Wed, May 20, 2026, at 10:27, Arseniy Mukhin wrote:
Yes, but I meant 0001 and 0003. There are two specs:
- async-notify-listen-startup.spec (added in 0001)
- async-notify-listen-race.spec (added in 0003)
Looks identical, unless I missed something.
Ohh, right, thanks for spotting. New version attached, 0001 and 0002 identical,
and 0003 now only contain the fix.
/Joel
Attachments:
v2-0001-Test-missed-LISTEN-startup-notification.patchapplication/octet-stream; name="=?UTF-8?Q?v2-0001-Test-missed-LISTEN-startup-notification.patch?="Download+62-1
v2-0003-Fix-LISTEN-startup-race-with-direct-advancement.patchapplication/octet-stream; name="=?UTF-8?Q?v2-0003-Fix-LISTEN-startup-race-with-direct-advancement.patch?="Download+13-9
v2-0002-Test-LISTEN-startup-notification-for-already-seen-wo.patchapplication/octet-stream; name="=?UTF-8?Q?v2-0002-Test-LISTEN-startup-notification-for-already-seen-wo.p?= =?UTF-8?Q?atch?="Download+69-1
"Joel Jacobson" <joel@compiler.org> writes:
Ohh, right, thanks for spotting. New version attached, 0001 and 0002 identical,
and 0003 now only contain the fix.
I agree with this fix, but it seems to me that it changes the meaning
of the ListenerEntry.listening flag in a rather fundamental way.
I'm tempted to rename that flag to "committed" or something like that.
(Both "listening" and "committed" appear in dozens of places in this
file that are not references to this flag, so TBH I'd rather use a
flag name that is not either of those words. But I can't think of
a better name.)
Also, while the proposed test cases are good for showing that there's
a bug here, I'm disinclined to commit them. I do not think there is
value in them proportionate to the cost of two new isolation-test
instances.
regards, tom lane
On Tue, May 26, 2026, at 17:40, Tom Lane wrote:
I agree with this fix, but it seems to me that it changes the meaning
of the ListenerEntry.listening flag in a rather fundamental way.
I'm tempted to rename that flag to "committed" or something like that.(Both "listening" and "committed" appear in dozens of places in this
file that are not references to this flag, so TBH I'd rather use a
flag name that is not either of those words. But I can't think of
a better name.)
How about renaming listening to removeOnAbort and negating its meaning?
Also, while the proposed test cases are good for showing that there's
a bug here, I'm disinclined to commit them. I do not think there is
value in them proportionate to the cost of two new isolation-test
instances.
I agree. I should have said that feel free to remove them.
(Would be nice with a way to attach tests that are meant to be thrown away,
but still let cfbot include them in testing.)
/Joel
Attachments:
v3-0001-Fix-NOTIFY-wakeups-for-pre-commit-LISTEN-entries.patchapplication/octet-stream; name="=?UTF-8?Q?v3-0001-Fix-NOTIFY-wakeups-for-pre-commit-LISTEN-entries.patch?="Download+23-32
"Joel Jacobson" <joel@compiler.org> writes:
On Tue, May 26, 2026, at 17:40, Tom Lane wrote:
(Both "listening" and "committed" appear in dozens of places in this
file that are not references to this flag, so TBH I'd rather use a
flag name that is not either of those words. But I can't think of
a better name.)
How about renaming listening to removeOnAbort and negating its meaning?
Okay, or at least I haven't a better idea. Pushed after some more
fiddling with the comments.
regards, tom lane