Potential deadlock in pgaio_io_wait()
Hi,
I think this sequence can deadlock if there is a ->wait_one() function
needed for progress:
P1 P2
acquire lock
stage io
wait for io cv...
submit io
acquire lock...
It's OK to assume that the state will advance to SUBMITTED, but not
any further as the current coding does. And even without a deadlock,
P2 might just take a long time to get around to completing, when P1
should ideally handle it.
A naive solution would be to broadcast on ioh->cv after
pgaio_io_prepare_submit()'s state change, with a case to wait for
that. But that's expensive.
It might be slightly better to have a dedicated per-backend submit_cv,
so the owner only has to broadcast once per batch. If ->submit() does
significant per-IO work then maybe that's a bit later than it needs to
be, but it's really just one system call anyway, so I doubt it
matters. Sketch attached for illustration only.
A real improvement would be to avoid the broadcast when there are no
waiters, but I got a bit too confused about flags and memory barriers
so I'm just sharing this problem report and illustration-grade patch
for now.
I doubt it's very easy to reproduce with simple queries, but I assume
if you had a SQL function that acquires a central LWLock and you ran
concurrent queries SELECT COUNT(*) FROM t WHERE locking_function(x) it
in a tight loop in two sessions given t big enough to require a
BufferAccessStrategy so that we keep evicting the buffers and
streaming them back in, you must be able to reach it. It might be
easier to write test functions that submit and wait for IO handles
directly across two sessions or something.
Attachments:
0001-aio-Fix-pgaio_io_wait-for-DEFINED-and-STAGED.patchapplication/octet-stream; name=0001-aio-Fix-pgaio_io_wait-for-DEFINED-and-STAGED.patchDownload+19-4
On Mon, Aug 4, 2025 at 5:54 PM Thomas Munro <thomas.munro@gmail.com> wrote:
I doubt it's very easy to reproduce with simple queries, but I assume
if you had a SQL function that acquires a central LWLock and you ran
concurrent queries SELECT COUNT(*) FROM t WHERE locking_function(x)
Hmm, that's a bad example as it has the wrong lock scope. Probably
would need a dedicated test to demonstrate with low level functions.
What I was trying to convey is that it's not a problem that can be hit
in practice without great effort as far as I know, but it does break
an intended pgaio architectural principle as I understand it.
Also I accidentally sent that to -bugs by fat fingering an
autocompletion. Moving to -hackers.
I discussed this off-list with Andres who provided the following review:
* +1 for the analysis
* +1 for the solution
* his benchmark machine shows no regression under heavy IO submission workload
* needs better comments
I had expected that patch to be rejected as too slow. I was thinking
that it should be enough to insert a memory barrier and then do an
unlocked check for an empty waitlist in ConditionVariableBroadcast()
to avoid a spinlock acquire/release. That caused a few weird hangs I
didn't understand, which is why I didn't post that version last time,
but he pointed out that the other side also needs a memory barrier in
ConditionVariablePrepareToSleep() and the existing spinlock
acquire/release is not enough. Still processing that, but given that
ConditionVariableBroadcast() performance is already sufficient, there
is no need for fancy tricks for this problem and my "naive" patch is
actually fine.
So here's a version that just adds some comments and corrects a minor thinko.
I also thought of a small optimisation, presented in the -B patch.
It's a bit of a shame to wait for backend->submit_cv and then also
ioh->cv in io_method=worker. It's just a bunch of IPC ping-pong for
nothing. So I figured it should be allowed to fall all the way
through based on its lack of ->wait_one. Worth bothering with?
On a superficial note:
AIO_IO_COMPLETION "Waiting for another process to complete IO."
+AIO_IO_SUBMIT "Waiting for another process to submit IO."
AIO_IO_URING_SUBMIT "Waiting for IO submission via io_uring."
AIO_IO_URING_EXECUTION "Waiting for IO execution via io_uring."
We're inconsistent in our choice of noun or verb. I went with _SUBMIT
to match the following line, rather than _SUBMISSION to match the
preceding line. Shrug.
Attachments:
v2-0001-aio-Fix-pgaio_io_wait-for-staged-IOs.patchapplication/octet-stream; name=v2-0001-aio-Fix-pgaio_io_wait-for-staged-IOs.patchDownload+32-5
v2-0001-aio-Fix-pgaio_io_wait-for-staged-IOs-B.patchapplication/octet-stream; name=v2-0001-aio-Fix-pgaio_io_wait-for-staged-IOs-B.patchDownload+51-10
Hi,
On 2025-08-15 17:39:30 +1200, Thomas Munro wrote:
I discussed this off-list with Andres who provided the following review:
* +1 for the analysis
* +1 for the solution
* his benchmark machine shows no regression under heavy IO submission workload
* needs better commentsI had expected that patch to be rejected as too slow. I was thinking
that it should be enough to insert a memory barrier and then do an
unlocked check for an empty waitlist in ConditionVariableBroadcast()
to avoid a spinlock acquire/release. That caused a few weird hangs I
didn't understand, which is why I didn't post that version last time,
but he pointed out that the other side also needs a memory barrier in
ConditionVariablePrepareToSleep() and the existing spinlock
acquire/release is not enough.
FTR, the reason that we would need a barrier there is that we define
SpinLockRelease() as a non-atomic volatile store on x86. That means that there
is no guarantee that another backend sees an up2date view of the memory if
reading a memory location *without* acquiring the spinlock. The reason this is
correct for spinlocks is that x86 has strongly ordered stores, and the
*acquisition* of a spinlock will only succeed once the release of the spinlock
has become visible. But if you read something *without* the spinlock, you're
not guaranteed to see all the changes done without the spinlock.
I do wonder if we accidentally rely on SpinLockRelease() being a barrier
anywhere...
I also thought of a small optimisation, presented in the -B patch.
It's a bit of a shame to wait for backend->submit_cv and then also
ioh->cv in io_method=worker. It's just a bunch of IPC ping-pong for
nothing. So I figured it should be allowed to fall all the way
through based on its lack of ->wait_one. Worth bothering with?
I don't think I'd go there without concrete evidence that it helps - it makes
it harder to understand when to wait for what. Not terribly so, but enough
that I'd not go there without some measurable benefit.
On a superficial note:
AIO_IO_COMPLETION "Waiting for another process to complete IO."
+AIO_IO_SUBMIT "Waiting for another process to submit IO."
AIO_IO_URING_SUBMIT "Waiting for IO submission via io_uring."
AIO_IO_URING_EXECUTION "Waiting for IO execution via io_uring."
We're inconsistent in our choice of noun or verb. I went with _SUBMIT
to match the following line, rather than _SUBMISSION to match the
preceding line. Shrug.
FWIW, I think so far it's a verb when the process is doing the work (e.g. the
process is calling io_uring_enter(to_submit = ..). It's a noun if we wait for
somebody *else* to do the completion, since we're not doing work ourselves.
From ec2e1e21054f00918d3b28ce01129bc06de37790 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Sun, 3 Aug 2025 23:07:56 +1200
Subject: [PATCH v2 1/2] aio: Fix pgaio_io_wait() for staged IOs.Previously, pgaio_io_wait()'s cases for PGAIO_HS_DEFINED and
PGAIO_HS_STAGED fell through to waiting for completion. The owner only
promises to advance it to PGAIO_HS_SUBMITTED. The waiter needs to be
prepared to call ->wait_one() itself once the IO is submitted in order
to guarantee progress and avoid deadlocks on IO methods that provide
->wait_one().Introduce a new per-backend condition variable submit_cv, woken by by
pgaio_submit_stage(), and use it to wait for the state to advance. The
new broadcast doesn't seem to cause any measurable slowdown, so ideas
for optimizing the common no-waiters case were abandoned for now.It may not be possible to reach any real deadlock with existing AIO
users, but that situation could change. There's also no reason the
waiter shouldn't begin to wait via the IO method as soon as possible
even without a deadlock.Picked up by testing a proposed IO method that has ->wait_one(), like
io_method=io_uring, and code review.
LGTM.
Greetings,
Andres Freund
On 19/08/2025 03:07, Andres Freund wrote:
On 2025-08-15 17:39:30 +1200, Thomas Munro wrote:
From ec2e1e21054f00918d3b28ce01129bc06de37790 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Sun, 3 Aug 2025 23:07:56 +1200
Subject: [PATCH v2 1/2] aio: Fix pgaio_io_wait() for staged IOs.Previously, pgaio_io_wait()'s cases for PGAIO_HS_DEFINED and
PGAIO_HS_STAGED fell through to waiting for completion. The owner only
promises to advance it to PGAIO_HS_SUBMITTED. The waiter needs to be
prepared to call ->wait_one() itself once the IO is submitted in order
to guarantee progress and avoid deadlocks on IO methods that provide
->wait_one().Introduce a new per-backend condition variable submit_cv, woken by by
pgaio_submit_stage(), and use it to wait for the state to advance. The
new broadcast doesn't seem to cause any measurable slowdown, so ideas
for optimizing the common no-waiters case were abandoned for now.It may not be possible to reach any real deadlock with existing AIO
users, but that situation could change. There's also no reason the
waiter shouldn't begin to wait via the IO method as soon as possible
even without a deadlock.Picked up by testing a proposed IO method that has ->wait_one(), like
io_method=io_uring, and code review.LGTM.
I just independently noticed this same issue, wrote a little test to
reproduce it, and was about to report it, when I noticed that you found
this already. Attached is the repro script.
Both of the proposed patches seem fine to me. I'm inclined to go with
the first patch (v2-0001-aio-Fix-pgaio_io_wait-for-staged-IOs.patch),
without the extra optimization, unless we can actually measure a
performance difference.
- Heikki
Attachments:
0001-Repro-backend-stuck-waiting-on-submitted-IO-with-io_.patchtext/x-patch; charset=UTF-8; name=0001-Repro-backend-stuck-waiting-on-submitted-IO-with-io_.patchDownload+68-2
On Tue, Sep 23, 2025 at 2:53 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
I just independently noticed this same issue, wrote a little test to
reproduce it, and was about to report it, when I noticed that you found
this already. Attached is the repro script.
Nice. Thanks!
Both of the proposed patches seem fine to me. I'm inclined to go with
the first patch (v2-0001-aio-Fix-pgaio_io_wait-for-staged-IOs.patch),
without the extra optimization, unless we can actually measure a
performance difference.
I like the second one slightly more, but I'm also happy to be
outvoted. I'm working on committing the first one, but I ran into a
couple of CI failures[1]https://cirrus-ci.com/task/4737677834059776 that I haven't understood yet. It could be
my patch's fault or turn out to be something already in the tree...
looking into that...
Hello Thomas,
10.10.2025 04:44, Thomas Munro wrote:
I like the second one slightly more, but I'm also happy to be
outvoted. I'm working on committing the first one, but I ran into a
couple of CI failures[1] that I haven't understood yet. It could be
my patch's fault or turn out to be something already in the tree...
looking into that...
I wonder whether it may be related to the issue we encountered with
OpenBSD?:
/messages/by-id/db2773a2-aca0-43d0-99c1-060efcd9954e@gmail.com
(I've found the same hanging "SELECT * FROM writetest;" in my notes from
2024-12...)
Best regards,
Alexander