Potential deadlock in pgaio_io_wait()
Hi,
I think this sequence can deadlock if there is a ->wait_one() function
needed for progress:
P1 P2
acquire lock
stage io
wait for io cv...
submit io
acquire lock...
It's OK to assume that the state will advance to SUBMITTED, but not
any further as the current coding does. And even without a deadlock,
P2 might just take a long time to get around to completing, when P1
should ideally handle it.
A naive solution would be to broadcast on ioh->cv after
pgaio_io_prepare_submit()'s state change, with a case to wait for
that. But that's expensive.
It might be slightly better to have a dedicated per-backend submit_cv,
so the owner only has to broadcast once per batch. If ->submit() does
significant per-IO work then maybe that's a bit later than it needs to
be, but it's really just one system call anyway, so I doubt it
matters. Sketch attached for illustration only.
A real improvement would be to avoid the broadcast when there are no
waiters, but I got a bit too confused about flags and memory barriers
so I'm just sharing this problem report and illustration-grade patch
for now.
I doubt it's very easy to reproduce with simple queries, but I assume
if you had a SQL function that acquires a central LWLock and you ran
concurrent queries SELECT COUNT(*) FROM t WHERE locking_function(x) it
in a tight loop in two sessions given t big enough to require a
BufferAccessStrategy so that we keep evicting the buffers and
streaming them back in, you must be able to reach it. It might be
easier to write test functions that submit and wait for IO handles
directly across two sessions or something.
Attachments:
0001-aio-Fix-pgaio_io_wait-for-DEFINED-and-STAGED.patchapplication/octet-stream; name=0001-aio-Fix-pgaio_io_wait-for-DEFINED-and-STAGED.patchDownload
From fcd741083fdd02b541770ddb728db496102548ca Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Sun, 3 Aug 2025 23:07:56 +1200
Subject: [PATCH] aio: Fix pgaio_io_wait() for DEFINED and STAGED.
Previously, pgaio_io_wait()'s cases for PGAIO_HS_DEFINED or
PGAIO_HS_STAGED fell through to waiting for completion. The owner only
promises to advance it to PGAIO_HS_SUBMITTED, so the waiter needs to be
prepared to call wait_one() itself.
Introduce a per-backend condition variable submit_cv, woken by by
pgaio_submit_stage(), and use it to wait for progress out of these
states.
XXX just a sketch
XXX expensive and pessimistic... is there a way to avoid the broadcast
or at least spinlock when no one is listening?
---
src/backend/storage/aio/aio.c | 17 ++++++++++++++---
src/backend/storage/aio/aio_init.c | 1 +
src/backend/utils/activity/wait_event_names.txt | 1 +
src/include/storage/aio_internal.h | 3 +++
4 files changed, 19 insertions(+), 3 deletions(-)
diff --git a/src/backend/storage/aio/aio.c b/src/backend/storage/aio/aio.c
index 3643f27ad6e..2a78f684f84 100644
--- a/src/backend/storage/aio/aio.c
+++ b/src/backend/storage/aio/aio.c
@@ -607,6 +607,18 @@ pgaio_io_wait(PgAioHandle *ioh, uint64 ref_generation)
elog(ERROR, "IO in wrong state: %d", state);
break;
+ /* waiting for owner to submit */
+ case PGAIO_HS_DEFINED:
+ case PGAIO_HS_STAGED:
+ ConditionVariablePrepareToSleep(&ioh->cv);
+ while (!pgaio_io_was_recycled(ioh, ref_generation, &state) &&
+ (state == PGAIO_HS_DEFINED ||
+ state == PGAIO_HS_STAGED))
+ ConditionVariableSleep(&pgaio_ctl->backend_state[ioh->owner_procno].submit_cv,
+ WAIT_EVENT_AIO_IO_SUBMIT);
+ ConditionVariableCancelSleep();
+ continue;
+
case PGAIO_HS_SUBMITTED:
/*
@@ -621,9 +633,6 @@ pgaio_io_wait(PgAioHandle *ioh, uint64 ref_generation)
}
/* fallthrough */
- /* waiting for owner to submit */
- case PGAIO_HS_DEFINED:
- case PGAIO_HS_STAGED:
/* waiting for reaper to complete */
/* fallthrough */
case PGAIO_HS_COMPLETED_IO:
@@ -1139,6 +1148,8 @@ pgaio_submit_staged(void)
pgaio_my_backend->num_staged_ios = 0;
+ ConditionVariableBroadcast(&pgaio_my_backend->submit_cv);
+
pgaio_debug(DEBUG4,
"aio: submitted %d IOs",
total_submitted);
diff --git a/src/backend/storage/aio/aio_init.c b/src/backend/storage/aio/aio_init.c
index 885c3940c66..a065ad1463f 100644
--- a/src/backend/storage/aio/aio_init.c
+++ b/src/backend/storage/aio/aio_init.c
@@ -188,6 +188,7 @@ AioShmemInit(void)
dclist_init(&bs->idle_ios);
memset(bs->staged_ios, 0, sizeof(PgAioHandle *) * PGAIO_SUBMIT_BATCH_SIZE);
+ ConditionVariableInit(&bs->submit_cv);
dclist_init(&bs->in_flight_ios);
/* initialize per-backend IOs */
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 0be307d2ca0..7d1dff2d6da 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -194,6 +194,7 @@ ABI_compatibility:
Section: ClassName - WaitEventIO
AIO_IO_COMPLETION "Waiting for another process to complete IO."
+AIO_IO_SUBMIT "Waiting for another process to submit IO."
AIO_IO_URING_SUBMIT "Waiting for IO submission via io_uring."
AIO_IO_URING_EXECUTION "Waiting for IO execution via io_uring."
BASEBACKUP_READ "Waiting for base backup to read from a file."
diff --git a/src/include/storage/aio_internal.h b/src/include/storage/aio_internal.h
index 2d37a243abe..5447d7c2bcf 100644
--- a/src/include/storage/aio_internal.h
+++ b/src/include/storage/aio_internal.h
@@ -208,6 +208,9 @@ typedef struct PgAioBackend
uint16 num_staged_ios;
PgAioHandle *staged_ios[PGAIO_SUBMIT_BATCH_SIZE];
+ /* Other backends can wait for this backend's IOs to be submitted. */
+ ConditionVariable submit_cv;
+
/*
* List of in-flight IOs. Also contains IOs that aren't strictly speaking
* in-flight anymore, but have been waited-for and completed by another
--
2.39.5 (Apple Git-154)
On Mon, Aug 4, 2025 at 5:54 PM Thomas Munro <thomas.munro@gmail.com> wrote:
I doubt it's very easy to reproduce with simple queries, but I assume
if you had a SQL function that acquires a central LWLock and you ran
concurrent queries SELECT COUNT(*) FROM t WHERE locking_function(x)
Hmm, that's a bad example as it has the wrong lock scope. Probably
would need a dedicated test to demonstrate with low level functions.
What I was trying to convey is that it's not a problem that can be hit
in practice without great effort as far as I know, but it does break
an intended pgaio architectural principle as I understand it.
Also I accidentally sent that to -bugs by fat fingering an
autocompletion. Moving to -hackers.
I discussed this off-list with Andres who provided the following review:
* +1 for the analysis
* +1 for the solution
* his benchmark machine shows no regression under heavy IO submission workload
* needs better comments
I had expected that patch to be rejected as too slow. I was thinking
that it should be enough to insert a memory barrier and then do an
unlocked check for an empty waitlist in ConditionVariableBroadcast()
to avoid a spinlock acquire/release. That caused a few weird hangs I
didn't understand, which is why I didn't post that version last time,
but he pointed out that the other side also needs a memory barrier in
ConditionVariablePrepareToSleep() and the existing spinlock
acquire/release is not enough. Still processing that, but given that
ConditionVariableBroadcast() performance is already sufficient, there
is no need for fancy tricks for this problem and my "naive" patch is
actually fine.
So here's a version that just adds some comments and corrects a minor thinko.
I also thought of a small optimisation, presented in the -B patch.
It's a bit of a shame to wait for backend->submit_cv and then also
ioh->cv in io_method=worker. It's just a bunch of IPC ping-pong for
nothing. So I figured it should be allowed to fall all the way
through based on its lack of ->wait_one. Worth bothering with?
On a superficial note:
AIO_IO_COMPLETION "Waiting for another process to complete IO."
+AIO_IO_SUBMIT "Waiting for another process to submit IO."
AIO_IO_URING_SUBMIT "Waiting for IO submission via io_uring."
AIO_IO_URING_EXECUTION "Waiting for IO execution via io_uring."
We're inconsistent in our choice of noun or verb. I went with _SUBMIT
to match the following line, rather than _SUBMISSION to match the
preceding line. Shrug.
Attachments:
v2-0001-aio-Fix-pgaio_io_wait-for-staged-IOs.patchapplication/octet-stream; name=v2-0001-aio-Fix-pgaio_io_wait-for-staged-IOs.patchDownload
From ec2e1e21054f00918d3b28ce01129bc06de37790 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Sun, 3 Aug 2025 23:07:56 +1200
Subject: [PATCH v2 1/2] aio: Fix pgaio_io_wait() for staged IOs.
Previously, pgaio_io_wait()'s cases for PGAIO_HS_DEFINED and
PGAIO_HS_STAGED fell through to waiting for completion. The owner only
promises to advance it to PGAIO_HS_SUBMITTED. The waiter needs to be
prepared to call ->wait_one() itself once the IO is submitted in order
to guarantee progress and avoid deadlocks on IO methods that provide
->wait_one().
Introduce a new per-backend condition variable submit_cv, woken by by
pgaio_submit_stage(), and use it to wait for the state to advance. The
new broadcast doesn't seem to cause any measurable slowdown, so ideas
for optimizing the common no-waiters case were abandoned for now.
It may not be possible to reach any real deadlock with existing AIO
users, but that situation could change. There's also no reason the
waiter shouldn't begin to wait via the IO method as soon as possible
even without a deadlock.
Picked up by testing a proposed IO method that has ->wait_one(), like
io_method=io_uring, and code review.
Backpatch-through: 18
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CA%2BhUKG%2BmZYrSdnhk-XrBYO18H829K77S9gMKUsykOiTJtqB43g%40mail.gmail.com
---
src/backend/storage/aio/aio.c | 26 ++++++++++++++++---
src/backend/storage/aio/aio_init.c | 1 +
.../utils/activity/wait_event_names.txt | 1 +
src/include/storage/aio_internal.h | 7 +++++
4 files changed, 32 insertions(+), 3 deletions(-)
diff --git a/src/backend/storage/aio/aio.c b/src/backend/storage/aio/aio.c
index 3643f27ad6e..87eb1f71705 100644
--- a/src/backend/storage/aio/aio.c
+++ b/src/backend/storage/aio/aio.c
@@ -607,6 +607,23 @@ pgaio_io_wait(PgAioHandle *ioh, uint64 ref_generation)
elog(ERROR, "IO in wrong state: %d", state);
break;
+ case PGAIO_HS_DEFINED:
+ case PGAIO_HS_STAGED:
+
+ /*
+ * The owner hasn't submitted the IO yet. Wait for it to do
+ * so. Only then can this backend wait via the IO method if
+ * required, or otherwise ioh->cv.
+ */
+ ConditionVariablePrepareToSleep(&pgaio_ctl->backend_state[ioh->owner_procno].submit_cv);
+ while (!pgaio_io_was_recycled(ioh, ref_generation, &state) &&
+ (state == PGAIO_HS_DEFINED ||
+ state == PGAIO_HS_STAGED))
+ ConditionVariableSleep(&pgaio_ctl->backend_state[ioh->owner_procno].submit_cv,
+ WAIT_EVENT_AIO_IO_SUBMIT);
+ ConditionVariableCancelSleep();
+ continue;
+
case PGAIO_HS_SUBMITTED:
/*
@@ -621,9 +638,6 @@ pgaio_io_wait(PgAioHandle *ioh, uint64 ref_generation)
}
/* fallthrough */
- /* waiting for owner to submit */
- case PGAIO_HS_DEFINED:
- case PGAIO_HS_STAGED:
/* waiting for reaper to complete */
/* fallthrough */
case PGAIO_HS_COMPLETED_IO:
@@ -1139,6 +1153,12 @@ pgaio_submit_staged(void)
pgaio_my_backend->num_staged_ios = 0;
+ /*
+ * Wake any backend that started waiting for any of these IOs before
+ * submission, so that it can call ->wait_one() if necessary.
+ */
+ ConditionVariableBroadcast(&pgaio_my_backend->submit_cv);
+
pgaio_debug(DEBUG4,
"aio: submitted %d IOs",
total_submitted);
diff --git a/src/backend/storage/aio/aio_init.c b/src/backend/storage/aio/aio_init.c
index 885c3940c66..a065ad1463f 100644
--- a/src/backend/storage/aio/aio_init.c
+++ b/src/backend/storage/aio/aio_init.c
@@ -188,6 +188,7 @@ AioShmemInit(void)
dclist_init(&bs->idle_ios);
memset(bs->staged_ios, 0, sizeof(PgAioHandle *) * PGAIO_SUBMIT_BATCH_SIZE);
+ ConditionVariableInit(&bs->submit_cv);
dclist_init(&bs->in_flight_ios);
/* initialize per-backend IOs */
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 0be307d2ca0..7d1dff2d6da 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -194,6 +194,7 @@ ABI_compatibility:
Section: ClassName - WaitEventIO
AIO_IO_COMPLETION "Waiting for another process to complete IO."
+AIO_IO_SUBMIT "Waiting for another process to submit IO."
AIO_IO_URING_SUBMIT "Waiting for IO submission via io_uring."
AIO_IO_URING_EXECUTION "Waiting for IO execution via io_uring."
BASEBACKUP_READ "Waiting for base backup to read from a file."
diff --git a/src/include/storage/aio_internal.h b/src/include/storage/aio_internal.h
index 2d37a243abe..3e438b587d9 100644
--- a/src/include/storage/aio_internal.h
+++ b/src/include/storage/aio_internal.h
@@ -208,6 +208,13 @@ typedef struct PgAioBackend
uint16 num_staged_ios;
PgAioHandle *staged_ios[PGAIO_SUBMIT_BATCH_SIZE];
+ /*
+ * Other backends sometimes need to wait for the owning backend to submit.
+ * The per-IO CV would work for that purpose, but a per-backend CV allows
+ * for just one broadcast per submitted batch.
+ */
+ ConditionVariable submit_cv;
+
/*
* List of in-flight IOs. Also contains IOs that aren't strictly speaking
* in-flight anymore, but have been waited-for and completed by another
--
2.39.5 (Apple Git-154)
v2-0001-aio-Fix-pgaio_io_wait-for-staged-IOs-B.patchapplication/octet-stream; name=v2-0001-aio-Fix-pgaio_io_wait-for-staged-IOs-B.patchDownload
From a804e6044514078db7c7917a885b19e05eac962b Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Sun, 3 Aug 2025 23:07:56 +1200
Subject: [PATCH v2] aio: Fix pgaio_io_wait() for staged IOs (B).
Previously, pgaio_io_wait()'s cases for PGAIO_HS_DEFINED and
PGAIO_HS_STAGED fell through to waiting for completion. The owner only
promises to advance it to PGAIO_HS_SUBMITTED. The waiter needs to be
prepared to call ->wait_one() itself once the IO is submitted in order
to guarantee progress and avoid deadlocks on IO methods that provide
->wait_one().
Introduce a new per-backend condition variable submit_cv, woken by by
pgaio_submit_stage(), and use it to wait for the state to advance. The
new broadcast doesn't seem to cause any measurable slowdown, so ideas
for optimizing the common no-waiters case were abandoned for now.
It may not be possible to reach any real deadlock with existing AIO
users, but that situation could change. There's also no reason the
waiter shouldn't begin to wait via the IO method as soon as possible
even without a deadlock.
Picked up by testing a proposed IO method that has ->wait_one(), like
io_method=io_uring, and code review.
Backpatch-through: 18
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CA%2BhUKG%2BmZYrSdnhk-XrBYO18H829K77S9gMKUsykOiTJtqB43g%40mail.gmail.com
---
src/backend/storage/aio/aio.c | 50 ++++++++++++++++---
src/backend/storage/aio/aio_init.c | 1 +
.../utils/activity/wait_event_names.txt | 1 +
src/include/storage/aio_internal.h | 7 +++
4 files changed, 51 insertions(+), 8 deletions(-)
diff --git a/src/backend/storage/aio/aio.c b/src/backend/storage/aio/aio.c
index 3643f27ad6e..dea1721254b 100644
--- a/src/backend/storage/aio/aio.c
+++ b/src/backend/storage/aio/aio.c
@@ -568,6 +568,16 @@ pgaio_io_was_recycled(PgAioHandle *ioh, uint64 ref_generation, PgAioHandleState
return ioh->generation != ref_generation;
}
+/*
+ * Whether we need to wait via the IO method. Don't check via the IO method if
+ * the issuing backend is executing the IO synchronously.
+ */
+static bool
+pgaio_io_needs_wait_one(PgAioHandle *ioh)
+{
+ return pgaio_method_ops->wait_one && !(ioh->flags & PGAIO_HF_SYNCHRONOUS);
+}
+
/*
* Wait for IO to complete. External code should never use this, outside of
* the AIO subsystem waits are only allowed via pgaio_wref_wait().
@@ -607,23 +617,38 @@ pgaio_io_wait(PgAioHandle *ioh, uint64 ref_generation)
elog(ERROR, "IO in wrong state: %d", state);
break;
- case PGAIO_HS_SUBMITTED:
+ case PGAIO_HS_DEFINED:
+ case PGAIO_HS_STAGED:
/*
- * If we need to wait via the IO method, do so now. Don't
- * check via the IO method if the issuing backend is executing
- * the IO synchronously.
+ * The owner hasn't submitted the IO yet. If we need to wait
+ * via the IO method, wait for submission, giving this backend
+ * the chance to call ->wait_one().
*/
- if (pgaio_method_ops->wait_one && !(ioh->flags & PGAIO_HF_SYNCHRONOUS))
+ if (pgaio_io_needs_wait_one(ioh))
+ {
+ PgAioBackend *backend = &pgaio_ctl->backend_state[ioh->owner_procno];
+
+ ConditionVariablePrepareToSleep(&backend->submit_cv);
+ while (!pgaio_io_was_recycled(ioh, ref_generation, &state) &&
+ (state == PGAIO_HS_DEFINED ||
+ state == PGAIO_HS_STAGED))
+ ConditionVariableSleep(&backend->submit_cv, WAIT_EVENT_AIO_IO_SUBMIT);
+ ConditionVariableCancelSleep();
+ continue;
+ }
+ /* fallthrough */
+
+ case PGAIO_HS_SUBMITTED:
+
+ /* If we need to wait via the IO method, do so now. */
+ if (pgaio_io_needs_wait_one(ioh))
{
pgaio_method_ops->wait_one(ioh, ref_generation);
continue;
}
/* fallthrough */
- /* waiting for owner to submit */
- case PGAIO_HS_DEFINED:
- case PGAIO_HS_STAGED:
/* waiting for reaper to complete */
/* fallthrough */
case PGAIO_HS_COMPLETED_IO:
@@ -1139,6 +1164,15 @@ pgaio_submit_staged(void)
pgaio_my_backend->num_staged_ios = 0;
+ /*
+ * Wake any backend that started waiting for any of these IOs before
+ * submission, if it is necessary to call ->wait_one() to guarantee
+ * progress with the configured IO method. On its side, pgaio_io_wait()
+ * only waits for submit_cv on IO methods needing that.
+ */
+ if (pgaio_method_ops->wait_one)
+ ConditionVariableBroadcast(&pgaio_my_backend->submit_cv);
+
pgaio_debug(DEBUG4,
"aio: submitted %d IOs",
total_submitted);
diff --git a/src/backend/storage/aio/aio_init.c b/src/backend/storage/aio/aio_init.c
index 885c3940c66..a065ad1463f 100644
--- a/src/backend/storage/aio/aio_init.c
+++ b/src/backend/storage/aio/aio_init.c
@@ -188,6 +188,7 @@ AioShmemInit(void)
dclist_init(&bs->idle_ios);
memset(bs->staged_ios, 0, sizeof(PgAioHandle *) * PGAIO_SUBMIT_BATCH_SIZE);
+ ConditionVariableInit(&bs->submit_cv);
dclist_init(&bs->in_flight_ios);
/* initialize per-backend IOs */
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 0be307d2ca0..7d1dff2d6da 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -194,6 +194,7 @@ ABI_compatibility:
Section: ClassName - WaitEventIO
AIO_IO_COMPLETION "Waiting for another process to complete IO."
+AIO_IO_SUBMIT "Waiting for another process to submit IO."
AIO_IO_URING_SUBMIT "Waiting for IO submission via io_uring."
AIO_IO_URING_EXECUTION "Waiting for IO execution via io_uring."
BASEBACKUP_READ "Waiting for base backup to read from a file."
diff --git a/src/include/storage/aio_internal.h b/src/include/storage/aio_internal.h
index 2d37a243abe..3e438b587d9 100644
--- a/src/include/storage/aio_internal.h
+++ b/src/include/storage/aio_internal.h
@@ -208,6 +208,13 @@ typedef struct PgAioBackend
uint16 num_staged_ios;
PgAioHandle *staged_ios[PGAIO_SUBMIT_BATCH_SIZE];
+ /*
+ * Other backends sometimes need to wait for the owning backend to submit.
+ * The per-IO CV would work for that purpose, but a per-backend CV allows
+ * for just one broadcast per submitted batch.
+ */
+ ConditionVariable submit_cv;
+
/*
* List of in-flight IOs. Also contains IOs that aren't strictly speaking
* in-flight anymore, but have been waited-for and completed by another
--
2.39.5 (Apple Git-154)
Hi,
On 2025-08-15 17:39:30 +1200, Thomas Munro wrote:
I discussed this off-list with Andres who provided the following review:
* +1 for the analysis
* +1 for the solution
* his benchmark machine shows no regression under heavy IO submission workload
* needs better commentsI had expected that patch to be rejected as too slow. I was thinking
that it should be enough to insert a memory barrier and then do an
unlocked check for an empty waitlist in ConditionVariableBroadcast()
to avoid a spinlock acquire/release. That caused a few weird hangs I
didn't understand, which is why I didn't post that version last time,
but he pointed out that the other side also needs a memory barrier in
ConditionVariablePrepareToSleep() and the existing spinlock
acquire/release is not enough.
FTR, the reason that we would need a barrier there is that we define
SpinLockRelease() as a non-atomic volatile store on x86. That means that there
is no guarantee that another backend sees an up2date view of the memory if
reading a memory location *without* acquiring the spinlock. The reason this is
correct for spinlocks is that x86 has strongly ordered stores, and the
*acquisition* of a spinlock will only succeed once the release of the spinlock
has become visible. But if you read something *without* the spinlock, you're
not guaranteed to see all the changes done without the spinlock.
I do wonder if we accidentally rely on SpinLockRelease() being a barrier
anywhere...
I also thought of a small optimisation, presented in the -B patch.
It's a bit of a shame to wait for backend->submit_cv and then also
ioh->cv in io_method=worker. It's just a bunch of IPC ping-pong for
nothing. So I figured it should be allowed to fall all the way
through based on its lack of ->wait_one. Worth bothering with?
I don't think I'd go there without concrete evidence that it helps - it makes
it harder to understand when to wait for what. Not terribly so, but enough
that I'd not go there without some measurable benefit.
On a superficial note:
AIO_IO_COMPLETION "Waiting for another process to complete IO."
+AIO_IO_SUBMIT "Waiting for another process to submit IO."
AIO_IO_URING_SUBMIT "Waiting for IO submission via io_uring."
AIO_IO_URING_EXECUTION "Waiting for IO execution via io_uring."
We're inconsistent in our choice of noun or verb. I went with _SUBMIT
to match the following line, rather than _SUBMISSION to match the
preceding line. Shrug.
FWIW, I think so far it's a verb when the process is doing the work (e.g. the
process is calling io_uring_enter(to_submit = ..). It's a noun if we wait for
somebody *else* to do the completion, since we're not doing work ourselves.
From ec2e1e21054f00918d3b28ce01129bc06de37790 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Sun, 3 Aug 2025 23:07:56 +1200
Subject: [PATCH v2 1/2] aio: Fix pgaio_io_wait() for staged IOs.Previously, pgaio_io_wait()'s cases for PGAIO_HS_DEFINED and
PGAIO_HS_STAGED fell through to waiting for completion. The owner only
promises to advance it to PGAIO_HS_SUBMITTED. The waiter needs to be
prepared to call ->wait_one() itself once the IO is submitted in order
to guarantee progress and avoid deadlocks on IO methods that provide
->wait_one().Introduce a new per-backend condition variable submit_cv, woken by by
pgaio_submit_stage(), and use it to wait for the state to advance. The
new broadcast doesn't seem to cause any measurable slowdown, so ideas
for optimizing the common no-waiters case were abandoned for now.It may not be possible to reach any real deadlock with existing AIO
users, but that situation could change. There's also no reason the
waiter shouldn't begin to wait via the IO method as soon as possible
even without a deadlock.Picked up by testing a proposed IO method that has ->wait_one(), like
io_method=io_uring, and code review.
LGTM.
Greetings,
Andres Freund
On 19/08/2025 03:07, Andres Freund wrote:
On 2025-08-15 17:39:30 +1200, Thomas Munro wrote:
From ec2e1e21054f00918d3b28ce01129bc06de37790 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Sun, 3 Aug 2025 23:07:56 +1200
Subject: [PATCH v2 1/2] aio: Fix pgaio_io_wait() for staged IOs.Previously, pgaio_io_wait()'s cases for PGAIO_HS_DEFINED and
PGAIO_HS_STAGED fell through to waiting for completion. The owner only
promises to advance it to PGAIO_HS_SUBMITTED. The waiter needs to be
prepared to call ->wait_one() itself once the IO is submitted in order
to guarantee progress and avoid deadlocks on IO methods that provide
->wait_one().Introduce a new per-backend condition variable submit_cv, woken by by
pgaio_submit_stage(), and use it to wait for the state to advance. The
new broadcast doesn't seem to cause any measurable slowdown, so ideas
for optimizing the common no-waiters case were abandoned for now.It may not be possible to reach any real deadlock with existing AIO
users, but that situation could change. There's also no reason the
waiter shouldn't begin to wait via the IO method as soon as possible
even without a deadlock.Picked up by testing a proposed IO method that has ->wait_one(), like
io_method=io_uring, and code review.LGTM.
I just independently noticed this same issue, wrote a little test to
reproduce it, and was about to report it, when I noticed that you found
this already. Attached is the repro script.
Both of the proposed patches seem fine to me. I'm inclined to go with
the first patch (v2-0001-aio-Fix-pgaio_io_wait-for-staged-IOs.patch),
without the extra optimization, unless we can actually measure a
performance difference.
- Heikki
Attachments:
0001-Repro-backend-stuck-waiting-on-submitted-IO-with-io_.patchtext/x-patch; charset=UTF-8; name=0001-Repro-backend-stuck-waiting-on-submitted-IO-with-io_.patchDownload
From 98704151ae72b3cfa4a5fb85706fe5bca6c7998c Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Mon, 22 Sep 2025 17:08:58 +0300
Subject: [PATCH 1/1] Repro backend stuck waiting on submitted IO with io_uring
---
src/test/modules/test_aio/t/001_aio.pl | 58 ++++++++++++++++++++++++++
src/test/modules/test_aio/test_aio.c | 11 ++++-
2 files changed, 68 insertions(+), 1 deletion(-)
diff --git a/src/test/modules/test_aio/t/001_aio.pl b/src/test/modules/test_aio/t/001_aio.pl
index 3f0453619e8..1fcf02aafdb 100644
--- a/src/test/modules/test_aio/t/001_aio.pl
+++ b/src/test/modules/test_aio/t/001_aio.pl
@@ -673,6 +673,63 @@ sub test_complete_foreign
$psql_b->quit();
}
+# 1. Backend A stages an IO
+# 2. Backend B blocks waiting on it
+# 3. Backend A submits the IO, but does not wait for it to complete
+# 4. Backend B should complete the IO
+sub test_wait_staged
+{
+ my $io_method = shift;
+ my $node = shift;
+ my ($ret, $output);
+
+ my $psql_a = $node->background_psql('postgres', on_error_stop => 0);
+ # 10 s timeout to make the test faster if backend B gets stuck waiting for IO
+ my $psql_b = $node->background_psql('postgres', on_error_stop => 0, timeout => 10);
+
+ # To warm the catalog caches
+ $psql_a->query_safe(
+ qq(SELECT read_rel_block_ll('tbl_ok', 2, wait_complete=>true);));
+
+ # Start a new batch
+ $psql_a->query_safe(
+ qq(begin));
+ $psql_a->query_safe(
+ qq(SELECT WHERE batch_start() IS NULL));
+
+ # Stage IO without submitting it yet
+ $psql_a->query_safe(
+ qq(SELECT read_rel_block_ll('tbl_ok', 1, batchmode_enter=>false, wait_complete=>false);));
+
+ # Start reading the same block in another backend. It sees the IO as "staged" by backend
+ # A, and blocks waiting on it.
+ $psql_b->query_until(
+ qr/start/, q{
+ \echo start
+ BEGIN;
+ SELECT read_rel_block_ll('tbl_ok', 1, wait_complete=>true);
+ SELECT 'SUCCESS';
+ COMMIT;
+});
+
+ # make sure psql_b has blocked on the IO
+ sleep(1);
+
+ # Cause the IO we staged in backend A to be submitted, but don't wait for its completion
+ $psql_a->query_safe(
+ qq(rollback));
+
+ # Wait for the IO on backend B to complete.
+ pump_until(
+ $psql_b->{run}, $psql_b->{timeout},
+ \$psql_b->{stdout}, qr/SUCCESS/);
+# $psql_b->{stderr} = '';
+ ok(1, "$io_method: backend B sees the io staged and later submitted by backend A as completed");
+
+ $psql_a->quit();
+ $psql_b->quit();
+}
+
# Test that we deal correctly with FDs being closed while IO is in progress
sub test_close_fd
{
@@ -1525,6 +1582,7 @@ CHECKPOINT;
test_checksum($io_method, $node);
test_ignore_checksum($io_method, $node);
test_checksum_createdb($io_method, $node);
+ test_wait_staged($io_method, $node);
SKIP:
{
diff --git a/src/test/modules/test_aio/test_aio.c b/src/test/modules/test_aio/test_aio.c
index c55cf6c0aac..63ca3499815 100644
--- a/src/test/modules/test_aio/test_aio.c
+++ b/src/test/modules/test_aio/test_aio.c
@@ -374,11 +374,18 @@ read_rel_block_ll(PG_FUNCTION_ARGS)
PgAioHandle *ioh;
PgAioWaitRef iow;
SMgrRelation smgr;
+ ResourceOwner save_resowner = CurrentResourceOwner;
+
+ /*
+ * Associate the IO with top-transaction owner, so that if you call
+ * batch_start() before this, the IO gets staged in the batch.
+ */
+ CurrentResourceOwner = TopTransactionResourceOwner;
if (nblocks <= 0 || nblocks > PG_IOV_MAX)
elog(ERROR, "nblocks is out of range");
- rel = relation_open(relid, AccessExclusiveLock);
+ rel = relation_open(relid, AccessShareLock);
for (int i = 0; i < nblocks; i++)
{
@@ -449,6 +456,8 @@ read_rel_block_ll(PG_FUNCTION_ARGS)
relation_close(rel, NoLock);
+ CurrentResourceOwner = save_resowner;
+
PG_RETURN_VOID();
}
--
2.47.3
On Tue, Sep 23, 2025 at 2:53 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
I just independently noticed this same issue, wrote a little test to
reproduce it, and was about to report it, when I noticed that you found
this already. Attached is the repro script.
Nice. Thanks!
Both of the proposed patches seem fine to me. I'm inclined to go with
the first patch (v2-0001-aio-Fix-pgaio_io_wait-for-staged-IOs.patch),
without the extra optimization, unless we can actually measure a
performance difference.
I like the second one slightly more, but I'm also happy to be
outvoted. I'm working on committing the first one, but I ran into a
couple of CI failures[1]https://cirrus-ci.com/task/4737677834059776 that I haven't understood yet. It could be
my patch's fault or turn out to be something already in the tree...
looking into that...
Hello Thomas,
10.10.2025 04:44, Thomas Munro wrote:
I like the second one slightly more, but I'm also happy to be
outvoted. I'm working on committing the first one, but I ran into a
couple of CI failures[1] that I haven't understood yet. It could be
my patch's fault or turn out to be something already in the tree...
looking into that...
I wonder whether it may be related to the issue we encountered with
OpenBSD?:
/messages/by-id/db2773a2-aca0-43d0-99c1-060efcd9954e@gmail.com
(I've found the same hanging "SELECT * FROM writetest;" in my notes from
2024-12...)
Best regards,
Alexander