AIO v2.0
Hi,
It's been quite a while since the last version of the AIO patchset that I have
posted. Of course parts of the larger project have since gone upstream [1]bulk relation extension, streaming read.
A lot of time since the last versions was spent understanding the performance
characteristics of using AIO with WAL and understanding some other odd
performance characteristics I didn't understand. I think I mostly understand
that now and what the design implications for an AIO subsystem are.
The prototype I had been working on unfortunately suffered from a few design
issues that weren't trivial to fix.
The biggest was that each backend could essentially have hard references to
unbounded numbers of "AIO handles" and that these references prevented these
handles from being reused. Because "AIO handles" have to live in shared memory
(so other backends can wait on them, that IO workers can perform them, etc)
that's obviously an issue. There was always a way to just run out of AIO
handles. I went through quite a few iterations of a design for how to resolve
that - I think I finally got there.
Another significant issue was that when I wrote the AIO prototype,
bufmgr.c/smgr.c/md.c only issued IOs in BLCKSZ increments, with the AIO
subsystem merging them into larger IOs. Thomas et al's work around streaming
read make bufmgr.c issue larger IOs - which is good for performance. But it
was surprisingly hard to fit into my older design.
It took me much longer than I had hoped to address these issues in
prototype. In the end I made progress by working on a rewriting the patchset
from scratch (well, with a bit of copy & paste).
The main reason I had previously implemented WAL AIO etc was to know the
design implications - but now that they're somewhat understood, I'm planning
to keep the patchset much smaller, with the goal of making it upstreamable.
While making v2 somewhat presentable I unfortunately found a few more design
issues - they're now mostly resolved, I think. But I only resolved the last
one a few hours ago, who knows what a few nights of sleeping on it will
bring. Unfortunately that prevented me from doing some of the polishing that I
had wanted to finish...
Because of the aforementioned move, I currently do not have access to my
workstation. I just have access to my laptop - which has enough thermal issues
to make benchmarks not particularly reliable.
So here are just a few teaser numbers, on an PCIe v4 NVMe SSD, note however
that this is with the BAS_BULKREAD size increased, with the default 256kB, we
can only keep one IO in flight at a time (due to io_combine_limit building
larger IOs) - we'll need to do something better than this, but that's yet
another separate discussion.
Workload: pg_prewarm('pgbench_accounts') of a scale 5k database, which is
bigger than memory:
time
master: 59.097
aio v2.0, worker: 11.211
aio v2.0, uring *: 19.991
aio v2.0, direct, worker: 09.617
aio v2.0, direct, uring *: 09.802
Workload: SELECT sum(abalance) FROM pgbench_accounts;
0 workers 1 worker 2 workers 4 workers
master: 65.753 33.246 21.095 12.918
aio v2.0, worker: 21.519 12.636 10.450 10.004
aio v2.0, uring*: 31.446 17.745 12.889 10.395
aio v2.0, uring** 23.497 13.824 10.881 10.589
aio v2.0, direct, worker: 22.377 11.989 09.915 09.772
aio v2.0, direct, uring*: 24.502 12.603 10.058 09.759
* the reason io_uring is slower is that workers effectively parallelize
*memcpy, at the cost of increased CPU usage
** a simple heuristic to use IOSQE_ASYNC to force some parallelism of memcpys
Workload: checkpointing ~20GB of dirty data, mostly sequential:
time
master: 10.209
aio v2.0, worker: 05.391
aio v2.0, uring: 04.593
aio v2.0, direct, worker: 07.745
aio v2.0, direct, uring: 03.351
To solve the issue with an unbounded number of AIO references there are few
changes compared to the prior approach:
1) Only one AIO handle can be "handed out" to a backend, without being
defined. Previously the process of getting an AIO handle wasn't super
lightweight, which made it appealing to cache AIO handles - which was one
part of the problem for running out of AIO handles.
2) Nothing in a backend can force a "defined" AIO handle (i.e. one that is a
valid operation) to stay around, it's always possible to execute the AIO
operation and then reuse the handle. This provides a forward guarantee, by
ensuring that completing AIOs can free up handles (previously they couldn't
be reused until the backend local reference was released).
3) Callbacks on AIOs are not allowed to error out anymore, unless it's ok to
take the server down.
4) Obviously some code needs to know the result of AIO operation and be able
to error out. To allow for that the issuer of an AIO can provide a pointer
to local memory that'll receive the result of an AIO, including details
about what kind of errors occurred (possible errors are e.g. a read failing
or a buffer's checksum validation failing).
In the next few days I'll add a bunch more documentation and comments as well
as some better perf numbers (assuming my workstation survived...).
Besides that, I am planning to introduce "io_method=sync", which will just
execute IO synchrously. Besides that being a good capability to have, it'll
also make it more sensible to split off worker mode support into its own
commit(s).
Greetings,
Andres Freund
[1]: bulk relation extension, streaming read
[2]: personal health challenges, family health challenges and now moving from the US West Coast to the East Coast, ...
the US West Coast to the East Coast, ...
Attachments:
v2.0-0001-bufmgr-Return-early-in-ScheduleBufferTagForWrit.patchtext/x-diff; charset=us-asciiDownload
From e05cf468cab4003baa510053ff921063ca32c19a Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Thu, 27 Jul 2023 18:59:25 -0700
Subject: [PATCH v2.0 01/17] bufmgr: Return early in
ScheduleBufferTagForWriteback() if fsync=off
Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
src/backend/storage/buffer/bufmgr.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 5cdd2f10fc8..ec957635f2a 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -5926,7 +5926,12 @@ ScheduleBufferTagForWriteback(WritebackContext *wb_context, IOContext io_context
{
PendingWriteback *pending;
- if (io_direct_flags & IO_DIRECT_DATA)
+ /*
+ * As pg_flush_data() doesn't do anything with fsync disabled, there's no
+ * point in tracking in that case.
+ */
+ if (io_direct_flags & IO_DIRECT_DATA ||
+ !enableFsync)
return;
/*
--
2.45.2.827.g557ae147e6
v2.0-0002-Allow-lwlocks-to-be-unowned.patchtext/x-diff; charset=us-asciiDownload
From a1f0fd69a34d146294bd4398bd5a5712cdc002ce Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 5 Jan 2021 10:10:36 -0800
Subject: [PATCH v2.0 02/17] Allow lwlocks to be unowned
This is required for AIO so that the lock hold during a write can be released
in another backend. Which in turn is required to avoid the potential for
deadlocks.
---
src/include/storage/lwlock.h | 2 +
src/backend/storage/lmgr/lwlock.c | 96 +++++++++++++++++++++----------
2 files changed, 68 insertions(+), 30 deletions(-)
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index d70e6d37e09..00e8022fbad 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -129,6 +129,8 @@ extern bool LWLockAcquireOrWait(LWLock *lock, LWLockMode mode);
extern void LWLockRelease(LWLock *lock);
extern void LWLockReleaseClearVar(LWLock *lock, pg_atomic_uint64 *valptr, uint64 val);
extern void LWLockReleaseAll(void);
+extern LWLockMode LWLockReleaseOwnership(LWLock *l);
+extern void LWLockReleaseUnowned(LWLock *l, LWLockMode mode);
extern bool LWLockHeldByMe(LWLock *lock);
extern bool LWLockAnyHeldByMe(LWLock *lock, int nlocks, size_t stride);
extern bool LWLockHeldByMeInMode(LWLock *lock, LWLockMode mode);
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index e765754d805..f3d3435b1f5 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -1773,52 +1773,36 @@ LWLockUpdateVar(LWLock *lock, pg_atomic_uint64 *valptr, uint64 val)
}
}
-
-/*
- * LWLockRelease - release a previously acquired lock
- */
-void
-LWLockRelease(LWLock *lock)
+static void
+LWLockReleaseInternal(LWLock *lock, LWLockMode mode)
{
- LWLockMode mode;
uint32 oldstate;
bool check_waiters;
- int i;
-
- /*
- * Remove lock from list of locks held. Usually, but not always, it will
- * be the latest-acquired lock; so search array backwards.
- */
- for (i = num_held_lwlocks; --i >= 0;)
- if (lock == held_lwlocks[i].lock)
- break;
-
- if (i < 0)
- elog(ERROR, "lock %s is not held", T_NAME(lock));
-
- mode = held_lwlocks[i].mode;
-
- num_held_lwlocks--;
- for (; i < num_held_lwlocks; i++)
- held_lwlocks[i] = held_lwlocks[i + 1];
-
- PRINT_LWDEBUG("LWLockRelease", lock, mode);
/*
* Release my hold on lock, after that it can immediately be acquired by
* others, even if we still have to wakeup other waiters.
*/
if (mode == LW_EXCLUSIVE)
- oldstate = pg_atomic_sub_fetch_u32(&lock->state, LW_VAL_EXCLUSIVE);
+ oldstate = pg_atomic_fetch_sub_u32(&lock->state, LW_VAL_EXCLUSIVE);
else
- oldstate = pg_atomic_sub_fetch_u32(&lock->state, LW_VAL_SHARED);
+ oldstate = pg_atomic_fetch_sub_u32(&lock->state, LW_VAL_SHARED);
/* nobody else can have that kind of lock */
- Assert(!(oldstate & LW_VAL_EXCLUSIVE));
+ if (mode == LW_EXCLUSIVE)
+ Assert((oldstate & LW_LOCK_MASK) == LW_VAL_EXCLUSIVE);
+ else
+ Assert((oldstate & LW_LOCK_MASK) < LW_VAL_EXCLUSIVE &&
+ (oldstate & LW_LOCK_MASK) >= LW_VAL_SHARED);
if (TRACE_POSTGRESQL_LWLOCK_RELEASE_ENABLED())
TRACE_POSTGRESQL_LWLOCK_RELEASE(T_NAME(lock));
+ if (mode == LW_EXCLUSIVE)
+ oldstate -= LW_VAL_EXCLUSIVE;
+ else
+ oldstate -= LW_VAL_SHARED;
+
/*
* We're still waiting for backends to get scheduled, don't wake them up
* again.
@@ -1841,6 +1825,58 @@ LWLockRelease(LWLock *lock)
LWLockWakeup(lock);
}
+ TRACE_POSTGRESQL_LWLOCK_RELEASE(T_NAME(lock));
+}
+
+void
+LWLockReleaseUnowned(LWLock *lock, LWLockMode mode)
+{
+ LWLockReleaseInternal(lock, mode);
+}
+
+/*
+ * XXX: this doesn't do a RESUME_INTERRUPTS(), responsibility of the caller.
+ */
+LWLockMode
+LWLockReleaseOwnership(LWLock *lock)
+{
+ LWLockMode mode;
+ int i;
+
+ /*
+ * Remove lock from list of locks held. Usually, but not always, it will
+ * be the latest-acquired lock; so search array backwards.
+ */
+ for (i = num_held_lwlocks; --i >= 0;)
+ if (lock == held_lwlocks[i].lock)
+ break;
+
+ if (i < 0)
+ elog(ERROR, "lock %s is not held", T_NAME(lock));
+
+ mode = held_lwlocks[i].mode;
+
+ num_held_lwlocks--;
+ for (; i < num_held_lwlocks; i++)
+ held_lwlocks[i] = held_lwlocks[i + 1];
+
+ return mode;
+}
+
+/*
+ * LWLockRelease - release a previously acquired lock
+ */
+void
+LWLockRelease(LWLock *lock)
+{
+ LWLockMode mode;
+
+ mode = LWLockReleaseOwnership(lock);
+
+ PRINT_LWDEBUG("LWLockRelease", lock, mode);
+
+ LWLockReleaseInternal(lock, mode);
+
/*
* Now okay to allow cancel/die interrupts.
*/
--
2.45.2.827.g557ae147e6
v2.0-0003-Use-aux-process-resource-owner-in-walsender.patchtext/x-diff; charset=us-asciiDownload
From 97e621ddc5fb3b7f60b8dd5517c45fac16e1f6f7 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 31 Aug 2021 12:16:28 -0700
Subject: [PATCH v2.0 03/17] Use aux process resource owner in walsender
AIO will need a resource owner to do IO. Right now we create a resowner
on-demand during basebackup, and we could do the same for AIO. But it seems
easier to just always create an aux process resowner.
---
src/include/replication/walsender.h | 1 -
src/backend/backup/basebackup.c | 8 ++++--
src/backend/replication/walsender.c | 44 ++++++-----------------------
3 files changed, 13 insertions(+), 40 deletions(-)
diff --git a/src/include/replication/walsender.h b/src/include/replication/walsender.h
index f2d8297f016..aff0f7a51ca 100644
--- a/src/include/replication/walsender.h
+++ b/src/include/replication/walsender.h
@@ -38,7 +38,6 @@ extern PGDLLIMPORT bool log_replication_commands;
extern void InitWalSender(void);
extern bool exec_replication_command(const char *cmd_string);
extern void WalSndErrorCleanup(void);
-extern void WalSndResourceCleanup(bool isCommit);
extern void PhysicalWakeupLogicalWalSnd(void);
extern XLogRecPtr GetStandbyFlushRecPtr(TimeLineID *tli);
extern void WalSndSignals(void);
diff --git a/src/backend/backup/basebackup.c b/src/backend/backup/basebackup.c
index de16afac749..23bf8bf2db0 100644
--- a/src/backend/backup/basebackup.c
+++ b/src/backend/backup/basebackup.c
@@ -250,8 +250,10 @@ perform_base_backup(basebackup_options *opt, bbsink *sink,
state.bytes_total_is_valid = false;
/* we're going to use a BufFile, so we need a ResourceOwner */
- Assert(CurrentResourceOwner == NULL);
- CurrentResourceOwner = ResourceOwnerCreate(NULL, "base backup");
+ Assert(AuxProcessResourceOwner != NULL);
+ Assert(CurrentResourceOwner == AuxProcessResourceOwner ||
+ CurrentResourceOwner == NULL);
+ CurrentResourceOwner = AuxProcessResourceOwner;
backup_started_in_recovery = RecoveryInProgress();
@@ -672,7 +674,7 @@ perform_base_backup(basebackup_options *opt, bbsink *sink,
FreeBackupManifest(&manifest);
/* clean up the resource owner we created */
- WalSndResourceCleanup(true);
+ ReleaseAuxProcessResources(true);
basebackup_progress_done();
}
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index c5f1009f370..0e847535a64 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -282,10 +282,8 @@ InitWalSender(void)
/* Create a per-walsender data structure in shared memory */
InitWalSenderSlot();
- /*
- * We don't currently need any ResourceOwner in a walsender process, but
- * if we did, we could call CreateAuxProcessResourceOwner here.
- */
+ /* need resource owner for e.g. basebackups */
+ CreateAuxProcessResourceOwner();
/*
* Let postmaster know that we're a WAL sender. Once we've declared us as
@@ -346,7 +344,7 @@ WalSndErrorCleanup(void)
* without a transaction, we've got to clean that up now.
*/
if (!IsTransactionOrTransactionBlock())
- WalSndResourceCleanup(false);
+ ReleaseAuxProcessResources(false);
if (got_STOPPING || got_SIGUSR2)
proc_exit(0);
@@ -355,34 +353,6 @@ WalSndErrorCleanup(void)
WalSndSetState(WALSNDSTATE_STARTUP);
}
-/*
- * Clean up any ResourceOwner we created.
- */
-void
-WalSndResourceCleanup(bool isCommit)
-{
- ResourceOwner resowner;
-
- if (CurrentResourceOwner == NULL)
- return;
-
- /*
- * Deleting CurrentResourceOwner is not allowed, so we must save a pointer
- * in a local variable and clear it first.
- */
- resowner = CurrentResourceOwner;
- CurrentResourceOwner = NULL;
-
- /* Now we can release resources and delete it. */
- ResourceOwnerRelease(resowner,
- RESOURCE_RELEASE_BEFORE_LOCKS, isCommit, true);
- ResourceOwnerRelease(resowner,
- RESOURCE_RELEASE_LOCKS, isCommit, true);
- ResourceOwnerRelease(resowner,
- RESOURCE_RELEASE_AFTER_LOCKS, isCommit, true);
- ResourceOwnerDelete(resowner);
-}
-
/*
* Handle a client's connection abort in an orderly manner.
*/
@@ -685,8 +655,10 @@ UploadManifest(void)
* parsing the manifest will use the cryptohash stuff, which requires a
* resource owner
*/
- Assert(CurrentResourceOwner == NULL);
- CurrentResourceOwner = ResourceOwnerCreate(NULL, "base backup");
+ Assert(AuxProcessResourceOwner != NULL);
+ Assert(CurrentResourceOwner == AuxProcessResourceOwner ||
+ CurrentResourceOwner == NULL);
+ CurrentResourceOwner = AuxProcessResourceOwner;
/* Prepare to read manifest data into a temporary context. */
mcxt = AllocSetContextCreate(CurrentMemoryContext,
@@ -723,7 +695,7 @@ UploadManifest(void)
uploaded_manifest_mcxt = mcxt;
/* clean up the resource owner we created */
- WalSndResourceCleanup(true);
+ ReleaseAuxProcessResources(true);
}
/*
--
2.45.2.827.g557ae147e6
v2.0-0004-Ensure-a-resowner-exists-for-all-paths-that-may.patchtext/x-diff; charset=us-asciiDownload
From 6e9b170059b75642e348e93e4a83b332ef9b3f99 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Thu, 1 Aug 2024 09:56:36 -0700
Subject: [PATCH v2.0 04/17] Ensure a resowner exists for all paths that may
perform AIO
---
src/backend/bootstrap/bootstrap.c | 7 +++++++
src/backend/replication/logical/logical.c | 6 ++++++
src/backend/utils/init/postinit.c | 3 ++-
3 files changed, 15 insertions(+), 1 deletion(-)
diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index 7637581a184..234fdc57ca7 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -331,8 +331,15 @@ BootstrapModeMain(int argc, char *argv[], bool check_only)
BaseInit();
bootstrap_signals();
+
+ /* need a resowner for IO during BootStrapXLOG() */
+ CreateAuxProcessResourceOwner();
+
BootStrapXLOG(bootstrap_data_checksum_version);
+ ReleaseAuxProcessResources(true);
+ CurrentResourceOwner = NULL;
+
/*
* To ensure that src/common/link-canary.c is linked into the backend, we
* must call it from somewhere. Here is as good as anywhere.
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 3fe1774a1e9..be0c7846d00 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -386,6 +386,12 @@ CreateInitDecodingContext(const char *plugin,
slot->data.plugin = plugin_name;
SpinLockRelease(&slot->mutex);
+ if (CurrentResourceOwner == NULL)
+ {
+ Assert(am_walsender);
+ CurrentResourceOwner = AuxProcessResourceOwner;
+ }
+
if (XLogRecPtrIsInvalid(restart_lsn))
ReplicationSlotReserveWal();
else
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 3b50ce19a2c..11128ea461c 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -719,7 +719,8 @@ InitPostgres(const char *in_dbname, Oid dboid,
* and ShutdownXLOG will need one. Hence, create said resource owner
* (and register a callback to clean it up after ShutdownXLOG runs).
*/
- CreateAuxProcessResourceOwner();
+ if (!bootstrap)
+ CreateAuxProcessResourceOwner();
StartupXLOG();
/* Release (and warn about) any buffer pins leaked in StartupXLOG */
--
2.45.2.827.g557ae147e6
v2.0-0005-bufmgr-smgr-Don-t-cross-segment-boundaries-in-S.patchtext/x-diff; charset=us-asciiDownload
From 7d58cc85191c96d8dc731b62810b64c5b366743b Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sat, 31 Aug 2024 22:10:35 -0400
Subject: [PATCH v2.0 05/17] bufmgr/smgr: Don't cross segment boundaries in
StartReadBuffers()
With real AIO it doesn't make sense to cross segment boundaries with one
IO. Add smgrmaxcombine() to allow upper layers to query which buffers can be
merged.
---
src/include/storage/md.h | 2 ++
src/include/storage/smgr.h | 2 ++
src/backend/storage/buffer/bufmgr.c | 18 ++++++++++++++++++
src/backend/storage/smgr/md.c | 17 +++++++++++++++++
src/backend/storage/smgr/smgr.c | 16 ++++++++++++++++
5 files changed, 55 insertions(+)
diff --git a/src/include/storage/md.h b/src/include/storage/md.h
index 620f10abdeb..b72293c79a5 100644
--- a/src/include/storage/md.h
+++ b/src/include/storage/md.h
@@ -32,6 +32,8 @@ extern void mdzeroextend(SMgrRelation reln, ForkNumber forknum,
BlockNumber blocknum, int nblocks, bool skipFsync);
extern bool mdprefetch(SMgrRelation reln, ForkNumber forknum,
BlockNumber blocknum, int nblocks);
+extern uint32 mdmaxcombine(SMgrRelation reln, ForkNumber forknum,
+ BlockNumber blocknum);
extern void mdreadv(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
void **buffers, BlockNumber nblocks);
extern void mdwritev(SMgrRelation reln, ForkNumber forknum,
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index e15b20a566a..899d0d681c5 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -92,6 +92,8 @@ extern void smgrzeroextend(SMgrRelation reln, ForkNumber forknum,
BlockNumber blocknum, int nblocks, bool skipFsync);
extern bool smgrprefetch(SMgrRelation reln, ForkNumber forknum,
BlockNumber blocknum, int nblocks);
+extern uint32 smgrmaxcombine(SMgrRelation reln, ForkNumber forknum,
+ BlockNumber blocknum);
extern void smgrreadv(SMgrRelation reln, ForkNumber forknum,
BlockNumber blocknum,
void **buffers, BlockNumber nblocks);
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index ec957635f2a..f2e608f597d 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1286,6 +1286,7 @@ StartReadBuffersImpl(ReadBuffersOperation *operation,
{
int actual_nblocks = *nblocks;
int io_buffers_len = 0;
+ int maxcombine = 0;
Assert(*nblocks > 0);
Assert(*nblocks <= MAX_IO_COMBINE_LIMIT);
@@ -1317,6 +1318,23 @@ StartReadBuffersImpl(ReadBuffersOperation *operation,
{
/* Extend the readable range to cover this block. */
io_buffers_len++;
+
+ /*
+ * Check how many blocks we can cover with the same IO. The smgr
+ * implementation might e.g. be limited due to a segment boundary.
+ */
+ if (i == 0 && actual_nblocks > 1)
+ {
+ maxcombine = smgrmaxcombine(operation->smgr,
+ operation->forknum,
+ blockNum);
+ if (maxcombine < actual_nblocks)
+ {
+ elog(DEBUG2, "limiting nblocks at %u from %u to %u",
+ blockNum, actual_nblocks, maxcombine);
+ actual_nblocks = maxcombine;
+ }
+ }
}
}
*nblocks = actual_nblocks;
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 6796756358f..6cd81a61faa 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -803,6 +803,17 @@ buffers_to_iovec(struct iovec *iov, void **buffers, int nblocks)
return iovcnt;
}
+uint32
+mdmaxcombine(SMgrRelation reln, ForkNumber forknum,
+ BlockNumber blocknum)
+{
+ BlockNumber segoff;
+
+ segoff = blocknum % ((BlockNumber) RELSEG_SIZE);
+
+ return RELSEG_SIZE - segoff;
+}
+
/*
* mdreadv() -- Read the specified blocks from a relation.
*/
@@ -833,6 +844,9 @@ mdreadv(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
RELSEG_SIZE - (blocknum % ((BlockNumber) RELSEG_SIZE)));
nblocks_this_segment = Min(nblocks_this_segment, lengthof(iov));
+ if (nblocks_this_segment != nblocks)
+ elog(ERROR, "read crossing segment boundary");
+
iovcnt = buffers_to_iovec(iov, buffers, nblocks_this_segment);
size_this_segment = nblocks_this_segment * BLCKSZ;
transferred_this_segment = 0;
@@ -956,6 +970,9 @@ mdwritev(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
RELSEG_SIZE - (blocknum % ((BlockNumber) RELSEG_SIZE)));
nblocks_this_segment = Min(nblocks_this_segment, lengthof(iov));
+ if (nblocks_this_segment != nblocks)
+ elog(ERROR, "write crossing segment boundary");
+
iovcnt = buffers_to_iovec(iov, (void **) buffers, nblocks_this_segment);
size_this_segment = nblocks_this_segment * BLCKSZ;
transferred_this_segment = 0;
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 7b9fa103eff..ee31db85eec 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -88,6 +88,8 @@ typedef struct f_smgr
BlockNumber blocknum, int nblocks, bool skipFsync);
bool (*smgr_prefetch) (SMgrRelation reln, ForkNumber forknum,
BlockNumber blocknum, int nblocks);
+ uint32 (*smgr_maxcombine) (SMgrRelation reln, ForkNumber forknum,
+ BlockNumber blocknum);
void (*smgr_readv) (SMgrRelation reln, ForkNumber forknum,
BlockNumber blocknum,
void **buffers, BlockNumber nblocks);
@@ -117,6 +119,7 @@ static const f_smgr smgrsw[] = {
.smgr_extend = mdextend,
.smgr_zeroextend = mdzeroextend,
.smgr_prefetch = mdprefetch,
+ .smgr_maxcombine = mdmaxcombine,
.smgr_readv = mdreadv,
.smgr_writev = mdwritev,
.smgr_writeback = mdwriteback,
@@ -588,6 +591,19 @@ smgrprefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
return smgrsw[reln->smgr_which].smgr_prefetch(reln, forknum, blocknum, nblocks);
}
+/*
+ * smgrmaxcombine() - Return the maximum number of total blocks that can be
+ * combined with an IO starting at blocknum.
+ *
+ * The returned value includes the io for blocknum itself.
+ */
+uint32
+smgrmaxcombine(SMgrRelation reln, ForkNumber forknum,
+ BlockNumber blocknum)
+{
+ return smgrsw[reln->smgr_which].smgr_maxcombine(reln, forknum, blocknum);
+}
+
/*
* smgrreadv() -- read a particular block range from a relation into the
* supplied buffers.
--
2.45.2.827.g557ae147e6
v2.0-0006-aio-Add-liburing-dependency.patchtext/x-diff; charset=us-asciiDownload
From 1e1c9d880f71f4548c9616d57d6071e3f90d8f70 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 5 Jun 2024 19:37:25 -0700
Subject: [PATCH v2.0 06/17] aio: Add liburing dependency
Not yet used.
Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
src/include/pg_config.h.in | 3 +
src/makefiles/meson.build | 3 +
configure | 138 +++++++++++++++++++++++++++++++++++++
configure.ac | 11 +++
meson.build | 14 ++++
meson_options.txt | 3 +
src/Makefile.global.in | 4 ++
7 files changed, 176 insertions(+)
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 979925cc2e2..397133b51ac 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -708,6 +708,9 @@
/* Define to 1 to build with LDAP support. (--with-ldap) */
#undef USE_LDAP
+/* Define to build with io-uring support. (--with-liburing) */
+#undef USE_LIBURING
+
/* Define to 1 to build with XML support. (--with-libxml) */
#undef USE_LIBXML
diff --git a/src/makefiles/meson.build b/src/makefiles/meson.build
index 850e9275845..cca689b2028 100644
--- a/src/makefiles/meson.build
+++ b/src/makefiles/meson.build
@@ -200,6 +200,8 @@ pgxs_empty = [
'PTHREAD_CFLAGS', 'PTHREAD_LIBS',
'ICU_LIBS',
+
+ 'LIBURING_CFLAGS', 'LIBURING_LIBS',
]
if host_system == 'windows' and cc.get_argument_syntax() != 'msvc'
@@ -230,6 +232,7 @@ pgxs_deps = {
'gssapi': gssapi,
'icu': icu,
'ldap': ldap,
+ 'liburing': liburing,
'libxml': libxml,
'libxslt': libxslt,
'llvm': llvm,
diff --git a/configure b/configure
index 537366945c0..317a462f610 100755
--- a/configure
+++ b/configure
@@ -654,6 +654,8 @@ LIBOBJS
OPENSSL
ZSTD
LZ4
+LIBURING_LIBS
+LIBURING_CFLAGS
UUID_LIBS
LDAP_LIBS_BE
LDAP_LIBS_FE
@@ -712,6 +714,7 @@ XML2_CFLAGS
XML2_CONFIG
with_libxml
with_uuid
+with_liburing
with_readline
with_systemd
with_selinux
@@ -865,6 +868,7 @@ with_selinux
with_systemd
with_readline
with_libedit_preferred
+with_liburing
with_uuid
with_ossp_uuid
with_libxml
@@ -907,6 +911,8 @@ LDFLAGS_EX
LDFLAGS_SL
PERL
PYTHON
+LIBURING_CFLAGS
+LIBURING_LIBS
MSGFMT
TCLSH'
@@ -1574,6 +1580,7 @@ Optional Packages:
--without-readline do not use GNU Readline nor BSD Libedit for editing
--with-libedit-preferred
prefer BSD Libedit over GNU Readline
+ --with-liburing use liburing for async io
--with-uuid=LIB build contrib/uuid-ossp using LIB (bsd,e2fs,ossp)
--with-ossp-uuid obsolete spelling of --with-uuid=ossp
--with-libxml build with XML support
@@ -1617,6 +1624,10 @@ Some influential environment variables:
LDFLAGS_SL extra linker flags for linking shared libraries only
PERL Perl program
PYTHON Python program
+ LIBURING_CFLAGS
+ C compiler flags for LIBURING, overriding pkg-config
+ LIBURING_LIBS
+ linker flags for LIBURING, overriding pkg-config
MSGFMT msgfmt program for NLS
TCLSH Tcl interpreter program (tclsh)
@@ -8664,6 +8675,40 @@ fi
+#
+# liburing
+#
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking whether to build with liburing support" >&5
+$as_echo_n "checking whether to build with liburing support... " >&6; }
+
+
+
+# Check whether --with-liburing was given.
+if test "${with_liburing+set}" = set; then :
+ withval=$with_liburing;
+ case $withval in
+ yes)
+
+$as_echo "#define USE_LIBURING 1" >>confdefs.h
+
+ ;;
+ no)
+ :
+ ;;
+ *)
+ as_fn_error $? "no argument expected for --with-liburing option" "$LINENO" 5
+ ;;
+ esac
+
+else
+ with_liburing=no
+
+fi
+
+
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $with_liburing" >&5
+$as_echo "$with_liburing" >&6; }
+
#
# UUID library
@@ -13222,6 +13267,99 @@ fi
fi
+if test "$with_liburing" = yes; then
+
+pkg_failed=no
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for liburing" >&5
+$as_echo_n "checking for liburing... " >&6; }
+
+if test -n "$LIBURING_CFLAGS"; then
+ pkg_cv_LIBURING_CFLAGS="$LIBURING_CFLAGS"
+ elif test -n "$PKG_CONFIG"; then
+ if test -n "$PKG_CONFIG" && \
+ { { $as_echo "$as_me:${as_lineno-$LINENO}: \$PKG_CONFIG --exists --print-errors \"liburing\""; } >&5
+ ($PKG_CONFIG --exists --print-errors "liburing") 2>&5
+ ac_status=$?
+ $as_echo "$as_me:${as_lineno-$LINENO}: \$? = $ac_status" >&5
+ test $ac_status = 0; }; then
+ pkg_cv_LIBURING_CFLAGS=`$PKG_CONFIG --cflags "liburing" 2>/dev/null`
+ test "x$?" != "x0" && pkg_failed=yes
+else
+ pkg_failed=yes
+fi
+ else
+ pkg_failed=untried
+fi
+if test -n "$LIBURING_LIBS"; then
+ pkg_cv_LIBURING_LIBS="$LIBURING_LIBS"
+ elif test -n "$PKG_CONFIG"; then
+ if test -n "$PKG_CONFIG" && \
+ { { $as_echo "$as_me:${as_lineno-$LINENO}: \$PKG_CONFIG --exists --print-errors \"liburing\""; } >&5
+ ($PKG_CONFIG --exists --print-errors "liburing") 2>&5
+ ac_status=$?
+ $as_echo "$as_me:${as_lineno-$LINENO}: \$? = $ac_status" >&5
+ test $ac_status = 0; }; then
+ pkg_cv_LIBURING_LIBS=`$PKG_CONFIG --libs "liburing" 2>/dev/null`
+ test "x$?" != "x0" && pkg_failed=yes
+else
+ pkg_failed=yes
+fi
+ else
+ pkg_failed=untried
+fi
+
+
+
+if test $pkg_failed = yes; then
+ { $as_echo "$as_me:${as_lineno-$LINENO}: result: no" >&5
+$as_echo "no" >&6; }
+
+if $PKG_CONFIG --atleast-pkgconfig-version 0.20; then
+ _pkg_short_errors_supported=yes
+else
+ _pkg_short_errors_supported=no
+fi
+ if test $_pkg_short_errors_supported = yes; then
+ LIBURING_PKG_ERRORS=`$PKG_CONFIG --short-errors --print-errors --cflags --libs "liburing" 2>&1`
+ else
+ LIBURING_PKG_ERRORS=`$PKG_CONFIG --print-errors --cflags --libs "liburing" 2>&1`
+ fi
+ # Put the nasty error message in config.log where it belongs
+ echo "$LIBURING_PKG_ERRORS" >&5
+
+ as_fn_error $? "Package requirements (liburing) were not met:
+
+$LIBURING_PKG_ERRORS
+
+Consider adjusting the PKG_CONFIG_PATH environment variable if you
+installed software in a non-standard prefix.
+
+Alternatively, you may set the environment variables LIBURING_CFLAGS
+and LIBURING_LIBS to avoid the need to call pkg-config.
+See the pkg-config man page for more details." "$LINENO" 5
+elif test $pkg_failed = untried; then
+ { $as_echo "$as_me:${as_lineno-$LINENO}: result: no" >&5
+$as_echo "no" >&6; }
+ { { $as_echo "$as_me:${as_lineno-$LINENO}: error: in \`$ac_pwd':" >&5
+$as_echo "$as_me: error: in \`$ac_pwd':" >&2;}
+as_fn_error $? "The pkg-config script could not be found or is too old. Make sure it
+is in your PATH or set the PKG_CONFIG environment variable to the full
+path to pkg-config.
+
+Alternatively, you may set the environment variables LIBURING_CFLAGS
+and LIBURING_LIBS to avoid the need to call pkg-config.
+See the pkg-config man page for more details.
+
+To get pkg-config, see <http://pkg-config.freedesktop.org/>.
+See \`config.log' for more details" "$LINENO" 5; }
+else
+ LIBURING_CFLAGS=$pkg_cv_LIBURING_CFLAGS
+ LIBURING_LIBS=$pkg_cv_LIBURING_LIBS
+ { $as_echo "$as_me:${as_lineno-$LINENO}: result: yes" >&5
+$as_echo "yes" >&6; }
+
+fi
+fi
##
## Header files
diff --git a/configure.ac b/configure.ac
index 4e279c4bd66..fa634ecf9e0 100644
--- a/configure.ac
+++ b/configure.ac
@@ -970,6 +970,14 @@ AC_SUBST(with_readline)
PGAC_ARG_BOOL(with, libedit-preferred, no,
[prefer BSD Libedit over GNU Readline])
+#
+# liburing
+#
+AC_MSG_CHECKING([whether to build with liburing support])
+PGAC_ARG_BOOL(with, liburing, no, [use liburing for async io],
+ [AC_DEFINE([USE_LIBURING], 1, [Define to build with io-uring support. (--with-liburing)])])
+AC_MSG_RESULT([$with_liburing])
+AC_SUBST(with_liburing)
#
# UUID library
@@ -1430,6 +1438,9 @@ elif test "$with_uuid" = ossp ; then
fi
AC_SUBST(UUID_LIBS)
+if test "$with_liburing" = yes; then
+ PKG_CHECK_MODULES(LIBURING, liburing)
+fi
##
## Header files
diff --git a/meson.build b/meson.build
index ea07126f78e..71200f4cb8f 100644
--- a/meson.build
+++ b/meson.build
@@ -848,6 +848,18 @@ endif
+###############################################################
+# Library: liburing
+###############################################################
+
+liburingopt = get_option('liburing')
+liburing = dependency('liburing', required: liburingopt)
+if liburing.found()
+ cdata.set('USE_LIBURING', 1)
+endif
+
+
+
###############################################################
# Library: libxml
###############################################################
@@ -3103,6 +3115,7 @@ backend_both_deps += [
icu_i18n,
ldap,
libintl,
+ liburing,
libxml,
lz4,
pam,
@@ -3747,6 +3760,7 @@ if meson.version().version_compare('>=0.57')
'gss': gssapi,
'icu': icu,
'ldap': ldap,
+ 'liburing': liburing,
'libxml': libxml,
'libxslt': libxslt,
'llvm': llvm,
diff --git a/meson_options.txt b/meson_options.txt
index b9421557606..084eebe72d7 100644
--- a/meson_options.txt
+++ b/meson_options.txt
@@ -103,6 +103,9 @@ option('ldap', type: 'feature', value: 'auto',
option('libedit_preferred', type: 'boolean', value: false,
description: 'Prefer BSD Libedit over GNU Readline')
+option('liburing', type : 'feature', value: 'auto',
+ description: 'Use liburing for async io')
+
option('libxml', type: 'feature', value: 'auto',
description: 'XML support')
diff --git a/src/Makefile.global.in b/src/Makefile.global.in
index 42f50b49761..a8ff18faed6 100644
--- a/src/Makefile.global.in
+++ b/src/Makefile.global.in
@@ -190,6 +190,7 @@ with_systemd = @with_systemd@
with_gssapi = @with_gssapi@
with_krb_srvnam = @with_krb_srvnam@
with_ldap = @with_ldap@
+with_liburing = @with_liburing@
with_libxml = @with_libxml@
with_libxslt = @with_libxslt@
with_llvm = @with_llvm@
@@ -216,6 +217,9 @@ krb_srvtab = @krb_srvtab@
ICU_CFLAGS = @ICU_CFLAGS@
ICU_LIBS = @ICU_LIBS@
+LIBURING_CFLAGS = @LIBURING_CFLAGS@
+LIBURING_LIBS = @LIBURING_LIBS@
+
TCLSH = @TCLSH@
TCL_LIBS = @TCL_LIBS@
TCL_LIB_SPEC = @TCL_LIB_SPEC@
--
2.45.2.827.g557ae147e6
v2.0-0007-aio-Basic-subsystem-initialization.patchtext/x-diff; charset=us-asciiDownload
From 8b3dabb0ec36a6aea6b5f9d30fadefc8748bfb9c Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 10 Jun 2024 13:42:58 -0700
Subject: [PATCH v2.0 07/17] aio: Basic subsystem initialization
This is just separate to make it easier to review the tendrils into various
places.
---
src/include/storage/aio.h | 42 +++++++++++++++++
src/include/storage/aio_init.h | 26 +++++++++++
src/backend/postmaster/postmaster.c | 8 ++++
src/backend/storage/aio/Makefile | 2 +
src/backend/storage/aio/aio.c | 35 ++++++++++++++
src/backend/storage/aio/aio_init.c | 46 +++++++++++++++++++
src/backend/storage/aio/meson.build | 2 +
src/backend/storage/ipc/ipci.c | 3 ++
src/backend/tcop/postgres.c | 7 +++
src/backend/utils/init/miscinit.c | 3 ++
src/backend/utils/init/postinit.c | 3 ++
src/backend/utils/misc/guc_tables.c | 11 +++++
src/backend/utils/misc/postgresql.conf.sample | 7 +++
src/tools/pgindent/typedefs.list | 1 +
14 files changed, 196 insertions(+)
create mode 100644 src/include/storage/aio.h
create mode 100644 src/include/storage/aio_init.h
create mode 100644 src/backend/storage/aio/aio.c
create mode 100644 src/backend/storage/aio/aio_init.c
diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h
new file mode 100644
index 00000000000..98fafcf9bc4
--- /dev/null
+++ b/src/include/storage/aio.h
@@ -0,0 +1,42 @@
+/*-------------------------------------------------------------------------
+ *
+ * aio.h
+ * Main AIO interface
+ *
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/aio.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef AIO_H
+#define AIO_H
+
+
+#include "utils/guc_tables.h"
+
+
+/* GUC related */
+extern void assign_io_method(int newval, void *extra);
+
+
+/* Enum for io_method GUC. */
+typedef enum IoMethod
+{
+ IOMETHOD_WORKER = 0,
+ IOMETHOD_IO_URING,
+} IoMethod;
+
+
+/* We'll default to bgworker. */
+#define DEFAULT_IO_METHOD IOMETHOD_WORKER
+
+
+/* GUCs */
+extern const struct config_enum_entry io_method_options[];
+extern int io_method;
+
+
+#endif /* AIO_H */
diff --git a/src/include/storage/aio_init.h b/src/include/storage/aio_init.h
new file mode 100644
index 00000000000..5bcfb8a9d58
--- /dev/null
+++ b/src/include/storage/aio_init.h
@@ -0,0 +1,26 @@
+/*-------------------------------------------------------------------------
+ *
+ * aio_init.h
+ * AIO initialization - kept separate as initialization sites don't need to
+ * know about AIO itself and AIO users don't need to know about initialization.
+ *
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/aio_init.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef AIO_INIT_H
+#define AIO_INIT_H
+
+
+extern Size AioShmemSize(void);
+extern void AioShmemInit(void);
+
+extern void pgaio_postmaster_init(void);
+extern void pgaio_postmaster_child_init_local(void);
+extern void pgaio_postmaster_child_init(void);
+
+#endif /* AIO_INIT_H */
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index a6fff93db34..921073a2ca4 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -111,6 +111,7 @@
#include "replication/logicallauncher.h"
#include "replication/slotsync.h"
#include "replication/walsender.h"
+#include "storage/aio_init.h"
#include "storage/fd.h"
#include "storage/ipc.h"
#include "storage/pmsignal.h"
@@ -941,6 +942,13 @@ PostmasterMain(int argc, char *argv[])
ExitPostmaster(0);
}
+ /*
+ * As AIO might create internal FDs, and will trigger shared memory
+ * allocations, need to do this before reset_shared() and
+ * set_max_safe_fds().
+ */
+ pgaio_postmaster_init();
+
/*
* Set up shared memory and semaphores.
*
diff --git a/src/backend/storage/aio/Makefile b/src/backend/storage/aio/Makefile
index 2f29a9ec4d1..eaeaeeee8e3 100644
--- a/src/backend/storage/aio/Makefile
+++ b/src/backend/storage/aio/Makefile
@@ -9,6 +9,8 @@ top_builddir = ../../../..
include $(top_builddir)/src/Makefile.global
OBJS = \
+ aio.o \
+ aio_init.o \
read_stream.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/aio/aio.c b/src/backend/storage/aio/aio.c
new file mode 100644
index 00000000000..67f6b52de91
--- /dev/null
+++ b/src/backend/storage/aio/aio.c
@@ -0,0 +1,35 @@
+/*-------------------------------------------------------------------------
+ *
+ * aio.c
+ * Asynchronous I/O subsytem.
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/storage/aio/aio.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "storage/aio.h"
+
+
+/* Options for io_method. */
+const struct config_enum_entry io_method_options[] = {
+ {"worker", IOMETHOD_WORKER, false},
+#ifdef USE_LIBURING
+ {"io_uring", IOMETHOD_IO_URING, false},
+#endif
+ {NULL, 0, false}
+};
+
+int io_method = IOMETHOD_WORKER;
+
+
+void
+assign_io_method(int newval, void *extra)
+{
+}
diff --git a/src/backend/storage/aio/aio_init.c b/src/backend/storage/aio/aio_init.c
new file mode 100644
index 00000000000..1c277a7eb3b
--- /dev/null
+++ b/src/backend/storage/aio/aio_init.c
@@ -0,0 +1,46 @@
+/*-------------------------------------------------------------------------
+ *
+ * aio_init.c
+ * Asynchronous I/O subsytem - Initialization
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/storage/aio/aio_init.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "storage/aio_init.h"
+
+
+Size
+AioShmemSize(void)
+{
+ Size sz = 0;
+
+ return sz;
+}
+
+void
+AioShmemInit(void)
+{
+}
+
+void
+pgaio_postmaster_init(void)
+{
+}
+
+void
+pgaio_postmaster_child_init(void)
+{
+}
+
+void
+pgaio_postmaster_child_init_local(void)
+{
+}
diff --git a/src/backend/storage/aio/meson.build b/src/backend/storage/aio/meson.build
index 10e1aa3b20b..8d20759ebf8 100644
--- a/src/backend/storage/aio/meson.build
+++ b/src/backend/storage/aio/meson.build
@@ -1,5 +1,7 @@
# Copyright (c) 2024, PostgreSQL Global Development Group
backend_sources += files(
+ 'aio.c',
+ 'aio_init.c',
'read_stream.c',
)
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 6caeca3a8e6..f0227a12a7d 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -39,6 +39,7 @@
#include "replication/slotsync.h"
#include "replication/walreceiver.h"
#include "replication/walsender.h"
+#include "storage/aio_init.h"
#include "storage/bufmgr.h"
#include "storage/dsm.h"
#include "storage/dsm_registry.h"
@@ -152,6 +153,7 @@ CalculateShmemSize(int *num_semaphores)
size = add_size(size, InjectionPointShmemSize());
size = add_size(size, SlotSyncShmemSize());
size = add_size(size, WaitLSNShmemSize());
+ size = add_size(size, AioShmemSize());
/* include additional requested shmem from preload libraries */
size = add_size(size, total_addin_request);
@@ -339,6 +341,7 @@ CreateOrAttachShmemStructs(void)
WaitEventCustomShmemInit();
InjectionPointShmemInit();
WaitLSNShmemInit();
+ AioShmemInit();
}
/*
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 8bc6bea1135..4dc46b17b41 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -61,6 +61,7 @@
#include "replication/slot.h"
#include "replication/walsender.h"
#include "rewrite/rewriteHandler.h"
+#include "storage/aio_init.h"
#include "storage/bufmgr.h"
#include "storage/ipc.h"
#include "storage/pmsignal.h"
@@ -4198,6 +4199,12 @@ PostgresSingleUserMain(int argc, char *argv[],
*/
InitProcess();
+ /* AIO is needed during InitPostgres() */
+ pgaio_postmaster_init();
+ pgaio_postmaster_child_init_local();
+
+ set_max_safe_fds();
+
/*
* Now that sufficient infrastructure has been initialized, PostgresMain()
* can do the rest.
diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index 537d92c0cfd..b8fa2e64ffe 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -40,6 +40,7 @@
#include "postmaster/interrupt.h"
#include "postmaster/postmaster.h"
#include "replication/slotsync.h"
+#include "storage/aio_init.h"
#include "storage/fd.h"
#include "storage/ipc.h"
#include "storage/latch.h"
@@ -137,6 +138,8 @@ InitPostmasterChild(void)
InitProcessLocalLatch();
InitializeLatchWaitSet();
+ pgaio_postmaster_child_init_local();
+
/*
* If possible, make this process a group leader, so that the postmaster
* can signal any child processes too. Not all processes will have
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 11128ea461c..f1151645242 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -43,6 +43,7 @@
#include "replication/slot.h"
#include "replication/slotsync.h"
#include "replication/walsender.h"
+#include "storage/aio_init.h"
#include "storage/bufmgr.h"
#include "storage/fd.h"
#include "storage/ipc.h"
@@ -589,6 +590,8 @@ BaseInit(void)
*/
pgstat_initialize();
+ pgaio_postmaster_child_init();
+
/* Do local initialization of storage and buffer managers */
InitSync();
smgrinit();
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 521ec5591c8..4961a5f4b16 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -71,6 +71,7 @@
#include "replication/slot.h"
#include "replication/slotsync.h"
#include "replication/syncrep.h"
+#include "storage/aio.h"
#include "storage/bufmgr.h"
#include "storage/bufpage.h"
#include "storage/large_object.h"
@@ -5196,6 +5197,16 @@ struct config_enum ConfigureNamesEnum[] =
NULL, NULL, NULL
},
+ {
+ {"io_method", PGC_POSTMASTER, RESOURCES_MEM,
+ gettext_noop("Selects the method of asynchronous I/O to use."),
+ NULL
+ },
+ &io_method,
+ DEFAULT_IO_METHOD, io_method_options,
+ NULL, assign_io_method, NULL
+ },
+
/* End-of-list marker */
{
{NULL, 0, 0, NULL, NULL}, NULL, 0, NULL, NULL, NULL, NULL
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 667e0dc40a2..e904c3fea30 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -835,6 +835,13 @@
#include = '...' # include file
+#------------------------------------------------------------------------------
+# WIP AIO GUC docs
+#------------------------------------------------------------------------------
+
+#io_method = worker
+
+
#------------------------------------------------------------------------------
# CUSTOMIZED OPTIONS
#------------------------------------------------------------------------------
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 9e951a9e6f3..309686627e7 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1257,6 +1257,7 @@ IntervalAggState
IntoClause
InvalMessageArray
InvalidationMsgsGroup
+IoMethod
IpcMemoryId
IpcMemoryKey
IpcMemoryState
--
2.45.2.827.g557ae147e6
v2.0-0008-aio-Skeleton-IO-worker-infrastructure.patchtext/x-diff; charset=us-asciiDownload
From 4f6f260ff706c769d5e4f40e5fc23c2c3105afa2 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 28 Aug 2024 14:28:36 -0400
Subject: [PATCH v2.0 08/17] aio: Skeleton IO worker infrastructure
This doesn't do anything useful on its own, but the code that needs to be
touched is independent of other changes.
Remarks:
- should completely get rid of ID assignment logic in postmaster.c
- postmaster.c badly needs a refactoring.
- dynamic increase / decrease of workers based on IO load
Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
src/include/miscadmin.h | 2 +
src/include/postmaster/postmaster.h | 1 +
src/include/storage/aio_init.h | 2 +
src/include/storage/io_worker.h | 22 +++
src/include/storage/proc.h | 4 +-
src/backend/postmaster/launch_backend.c | 2 +
src/backend/postmaster/postmaster.c | 186 ++++++++++++++++--
src/backend/storage/aio/Makefile | 1 +
src/backend/storage/aio/meson.build | 1 +
src/backend/storage/aio/method_worker.c | 84 ++++++++
src/backend/tcop/postgres.c | 2 +
src/backend/utils/activity/pgstat_io.c | 1 +
.../utils/activity/wait_event_names.txt | 1 +
src/backend/utils/init/miscinit.c | 3 +
src/backend/utils/misc/guc_tables.c | 13 ++
src/backend/utils/misc/postgresql.conf.sample | 3 +-
16 files changed, 312 insertions(+), 16 deletions(-)
create mode 100644 src/include/storage/io_worker.h
create mode 100644 src/backend/storage/aio/method_worker.c
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 25348e71eb9..d043445b544 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -352,6 +352,7 @@ typedef enum BackendType
B_ARCHIVER,
B_BG_WRITER,
B_CHECKPOINTER,
+ B_IO_WORKER,
B_STARTUP,
B_WAL_RECEIVER,
B_WAL_SUMMARIZER,
@@ -380,6 +381,7 @@ extern PGDLLIMPORT BackendType MyBackendType;
#define AmWalReceiverProcess() (MyBackendType == B_WAL_RECEIVER)
#define AmWalSummarizerProcess() (MyBackendType == B_WAL_SUMMARIZER)
#define AmWalWriterProcess() (MyBackendType == B_WAL_WRITER)
+#define AmIoWorkerProcess() (MyBackendType == B_IO_WORKER)
extern const char *GetBackendTypeDesc(BackendType backendType);
diff --git a/src/include/postmaster/postmaster.h b/src/include/postmaster/postmaster.h
index 63c12917cfe..4cc000df79e 100644
--- a/src/include/postmaster/postmaster.h
+++ b/src/include/postmaster/postmaster.h
@@ -62,6 +62,7 @@ extern void InitProcessGlobals(void);
extern int MaxLivePostmasterChildren(void);
extern bool PostmasterMarkPIDForWorkerNotify(int);
+extern void assign_io_workers(int newval, void *extra);
#ifdef WIN32
extern void pgwin32_register_deadchild_callback(HANDLE procHandle, DWORD procId);
diff --git a/src/include/storage/aio_init.h b/src/include/storage/aio_init.h
index 5bcfb8a9d58..a38dd982fbe 100644
--- a/src/include/storage/aio_init.h
+++ b/src/include/storage/aio_init.h
@@ -23,4 +23,6 @@ extern void pgaio_postmaster_init(void);
extern void pgaio_postmaster_child_init_local(void);
extern void pgaio_postmaster_child_init(void);
+extern bool pgaio_workers_enabled(void);
+
#endif /* AIO_INIT_H */
diff --git a/src/include/storage/io_worker.h b/src/include/storage/io_worker.h
new file mode 100644
index 00000000000..ba5dcb9e6e4
--- /dev/null
+++ b/src/include/storage/io_worker.h
@@ -0,0 +1,22 @@
+/*-------------------------------------------------------------------------
+ *
+ * io_worker.h
+ * IO worker for implementing AIO "ourselves"
+ *
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/io.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef IO_WORKER_H
+#define IO_WORKER_H
+
+
+extern void IoWorkerMain(char *startup_data, size_t startup_data_len) pg_attribute_noreturn();
+
+extern int io_workers;
+
+#endif /* IO_WORKER_H */
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index deeb06c9e01..b466ba843d6 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -442,7 +442,9 @@ extern PGDLLIMPORT PGPROC *PreparedXactProcs;
* 2 slots, but WAL writer is launched only after startup has exited, so we
* only need 6 slots.
*/
-#define NUM_AUXILIARY_PROCS 6
+#define MAX_IO_WORKERS 32
+#define NUM_AUXILIARY_PROCS (6 + MAX_IO_WORKERS)
+
/* configurable options */
extern PGDLLIMPORT int DeadlockTimeout;
diff --git a/src/backend/postmaster/launch_backend.c b/src/backend/postmaster/launch_backend.c
index 0ae23fdf55e..78429b2af2f 100644
--- a/src/backend/postmaster/launch_backend.c
+++ b/src/backend/postmaster/launch_backend.c
@@ -55,6 +55,7 @@
#include "replication/walreceiver.h"
#include "storage/dsm.h"
#include "storage/fd.h"
+#include "storage/io_worker.h"
#include "storage/ipc.h"
#include "storage/pg_shmem.h"
#include "storage/pmsignal.h"
@@ -199,6 +200,7 @@ static child_process_kind child_process_kinds[] = {
[B_ARCHIVER] = {"archiver", PgArchiverMain, true},
[B_BG_WRITER] = {"bgwriter", BackgroundWriterMain, true},
[B_CHECKPOINTER] = {"checkpointer", CheckpointerMain, true},
+ [B_IO_WORKER] = {"io_worker", IoWorkerMain, true},
[B_STARTUP] = {"startup", StartupProcessMain, true},
[B_WAL_RECEIVER] = {"wal_receiver", WalReceiverMain, true},
[B_WAL_SUMMARIZER] = {"wal_summarizer", WalSummarizerMain, true},
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 921073a2ca4..fc3901d5347 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -113,6 +113,7 @@
#include "replication/walsender.h"
#include "storage/aio_init.h"
#include "storage/fd.h"
+#include "storage/io_worker.h"
#include "storage/ipc.h"
#include "storage/pmsignal.h"
#include "storage/proc.h"
@@ -321,6 +322,7 @@ typedef enum
* ckpt */
PM_SHUTDOWN_2, /* waiting for archiver and walsenders to
* finish */
+ PM_SHUTDOWN_IO, /* waiting for io workers to exit */
PM_WAIT_DEAD_END, /* waiting for dead_end children to exit */
PM_NO_CHILDREN, /* all important children have exited */
} PMState;
@@ -382,6 +384,10 @@ bool LoadedSSL = false;
static DNSServiceRef bonjour_sdref = NULL;
#endif
+/* State for IO worker management. */
+static int io_worker_count = 0;
+static pid_t io_worker_pids[MAX_IO_WORKERS];
+
/*
* postmaster.c - function prototypes
*/
@@ -420,6 +426,9 @@ static int CountChildren(int target);
static Backend *assign_backendlist_entry(void);
static void LaunchMissingBackgroundProcesses(void);
static void maybe_start_bgworkers(void);
+static bool maybe_reap_io_worker(int pid);
+static void maybe_adjust_io_workers(void);
+static void signal_io_workers(int signal);
static bool CreateOptsFile(int argc, char *argv[], char *fullprogname);
static pid_t StartChildProcess(BackendType type);
static void StartAutovacuumWorker(void);
@@ -1334,6 +1343,11 @@ PostmasterMain(int argc, char *argv[])
*/
AddToDataDirLockFile(LOCK_FILE_LINE_PM_STATUS, PM_STATUS_STARTING);
+ pmState = PM_STARTUP;
+
+ /* Make sure we can perform I/O while starting up. */
+ maybe_adjust_io_workers();
+
/* Start bgwriter and checkpointer so they can help with recovery */
if (CheckpointerPID == 0)
CheckpointerPID = StartChildProcess(B_CHECKPOINTER);
@@ -1346,7 +1360,6 @@ PostmasterMain(int argc, char *argv[])
StartupPID = StartChildProcess(B_STARTUP);
Assert(StartupPID != 0);
StartupStatus = STARTUP_RUNNING;
- pmState = PM_STARTUP;
/* Some workers may be scheduled to start now */
maybe_start_bgworkers();
@@ -1995,6 +2008,7 @@ process_pm_reload_request(void)
signal_child(SysLoggerPID, SIGHUP);
if (SlotSyncWorkerPID != 0)
signal_child(SlotSyncWorkerPID, SIGHUP);
+ signal_io_workers(SIGHUP);
/* Reload authentication config files too */
if (!load_hba())
@@ -2527,6 +2541,22 @@ process_pm_child_exit(void)
}
}
+ /* Was it an IO worker? */
+ if (maybe_reap_io_worker(pid))
+ {
+ if (!EXIT_STATUS_0(exitstatus) && !EXIT_STATUS_1(exitstatus))
+ HandleChildCrash(pid, exitstatus, _("io worker"));
+
+ maybe_adjust_io_workers();
+
+ if (io_worker_count == 0 &&
+ pmState >= PM_SHUTDOWN_IO)
+ {
+ pmState = PM_WAIT_DEAD_END;
+ }
+ continue;
+ }
+
/*
* We don't know anything about this child process. That's highly
* unexpected, as we do track all the child processes that we fork.
@@ -2763,6 +2793,9 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
if (SlotSyncWorkerPID != 0)
sigquit_child(SlotSyncWorkerPID);
+ /* Take care of io workers too */
+ signal_io_workers(SIGQUIT);
+
/* We do NOT restart the syslogger */
}
@@ -2986,10 +3019,11 @@ PostmasterStateMachine(void)
FatalError = true;
pmState = PM_WAIT_DEAD_END;
- /* Kill the walsenders and archiver too */
+ /* Kill walsenders, archiver and aio workers too */
SignalChildren(SIGQUIT);
if (PgArchPID != 0)
signal_child(PgArchPID, SIGQUIT);
+ signal_io_workers(SIGQUIT);
}
}
}
@@ -2999,16 +3033,26 @@ PostmasterStateMachine(void)
{
/*
* PM_SHUTDOWN_2 state ends when there's no other children than
- * dead_end children left. There shouldn't be any regular backends
- * left by now anyway; what we're really waiting for is walsenders and
- * archiver.
+ * dead_end children and aio workers left. There shouldn't be any
+ * regular backends left by now anyway; what we're really waiting for
+ * is walsenders and archiver.
*/
if (PgArchPID == 0 && CountChildren(BACKEND_TYPE_ALL) == 0)
{
- pmState = PM_WAIT_DEAD_END;
+ pmState = PM_SHUTDOWN_IO;
+ signal_io_workers(SIGUSR2);
}
}
+ if (pmState == PM_SHUTDOWN_IO)
+ {
+ /*
+ * PM_SHUTDOWN_IO state ends when there's only dead_end children left.
+ */
+ if (io_worker_count == 0)
+ pmState = PM_WAIT_DEAD_END;
+ }
+
if (pmState == PM_WAIT_DEAD_END)
{
/* Don't allow any new socket connection events. */
@@ -3016,17 +3060,22 @@ PostmasterStateMachine(void)
/*
* PM_WAIT_DEAD_END state ends when the BackendList is entirely empty
- * (ie, no dead_end children remain), and the archiver is gone too.
+ * (ie, no dead_end children remain), and the archiver and aio workers
+ * are all gone too.
*
- * The reason we wait for those two is to protect them against a new
+ * We need to wait for those because we might have transitioned
+ * directly to PM_WAIT_DEAD_END due to immediate shutdown or fatal
+ * error. Note that they have already been sent appropriate shutdown
+ * signals, either during a normal state transition leading up to
+ * PM_WAIT_DEAD_END, or during FatalError processing.
+ *
+ * The reason we wait for those is to protect them against a new
* postmaster starting conflicting subprocesses; this isn't an
* ironclad protection, but it at least helps in the
- * shutdown-and-immediately-restart scenario. Note that they have
- * already been sent appropriate shutdown signals, either during a
- * normal state transition leading up to PM_WAIT_DEAD_END, or during
- * FatalError processing.
+ * shutdown-and-immediately-restart scenario.
*/
- if (dlist_is_empty(&BackendList) && PgArchPID == 0)
+ if (dlist_is_empty(&BackendList) && io_worker_count == 0
+ && PgArchPID == 0)
{
/* These other guys should be dead already */
Assert(StartupPID == 0);
@@ -3119,10 +3168,14 @@ PostmasterStateMachine(void)
/* re-create shared memory and semaphores */
CreateSharedMemoryAndSemaphores();
+ pmState = PM_STARTUP;
+
+ /* Make sure we can perform I/O while starting up. */
+ maybe_adjust_io_workers();
+
StartupPID = StartChildProcess(B_STARTUP);
Assert(StartupPID != 0);
StartupStatus = STARTUP_RUNNING;
- pmState = PM_STARTUP;
/* crash recovery started, reset SIGKILL flag */
AbortStartTime = 0;
@@ -3374,6 +3427,7 @@ TerminateChildren(int signal)
signal_child(PgArchPID, signal);
if (SlotSyncWorkerPID != 0)
signal_child(SlotSyncWorkerPID, signal);
+ signal_io_workers(signal);
}
/*
@@ -3955,6 +4009,7 @@ bgworker_should_start_now(BgWorkerStartTime start_time)
{
case PM_NO_CHILDREN:
case PM_WAIT_DEAD_END:
+ case PM_SHUTDOWN_IO:
case PM_SHUTDOWN_2:
case PM_SHUTDOWN:
case PM_WAIT_BACKENDS:
@@ -4148,6 +4203,109 @@ maybe_start_bgworkers(void)
}
}
+static bool
+maybe_reap_io_worker(int pid)
+{
+ for (int id = 0; id < MAX_IO_WORKERS; ++id)
+ {
+ if (io_worker_pids[id] == pid)
+ {
+ --io_worker_count;
+ io_worker_pids[id] = 0;
+ return true;
+ }
+ }
+ return false;
+}
+
+static void
+maybe_adjust_io_workers(void)
+{
+ /* ATODO: This will need to check if io_method == worker */
+
+ /*
+ * If we're in final shutting down state, then we're just waiting for all
+ * processes to exit.
+ */
+ if (pmState >= PM_SHUTDOWN_IO)
+ return;
+
+ /* Don't start new workers during an immediate shutdown either. */
+ if (Shutdown >= ImmediateShutdown)
+ return;
+
+ /*
+ * Don't start new workers if we're in the shutdown phase of a crash
+ * restart. But we *do* need to start if we're already starting up again.
+ */
+ if (FatalError && pmState >= PM_STOP_BACKENDS)
+ return;
+
+ /* Not enough running? */
+ while (io_worker_count < io_workers)
+ {
+ int pid;
+ int id;
+
+ /* Find the lowest unused IO worker ID. */
+
+ /*
+ * AFIXME: This logic doesn't work right now, the ids aren't
+ * transported to workers anymore.
+ */
+ for (id = 0; id < MAX_IO_WORKERS; ++id)
+ {
+ if (io_worker_pids[id] == 0)
+ break;
+ }
+ if (id == MAX_IO_WORKERS)
+ elog(ERROR, "could not find a free IO worker ID");
+
+ Assert(pmState < PM_SHUTDOWN_IO);
+
+ /* Try to launch one. */
+ pid = StartChildProcess(B_IO_WORKER);
+ if (pid > 0)
+ {
+ io_worker_pids[id] = pid;
+ ++io_worker_count;
+ }
+ else
+ break; /* XXX try again soon? */
+ }
+
+ /* Too many running? */
+ if (io_worker_count > io_workers)
+ {
+ /* Ask the highest used IO worker ID to exit. */
+ for (int id = MAX_IO_WORKERS - 1; id >= 0; --id)
+ {
+ if (io_worker_pids[id] != 0)
+ {
+ kill(io_worker_pids[id], SIGUSR2);
+ break;
+ }
+ }
+ }
+}
+
+static void
+signal_io_workers(int signal)
+{
+ for (int i = 0; i < MAX_IO_WORKERS; ++i)
+ if (io_worker_pids[i] != 0)
+ signal_child(io_worker_pids[i], signal);
+}
+
+void
+assign_io_workers(int newval, void *extra)
+{
+ io_workers = newval;
+ if (!IsUnderPostmaster && pmState > PM_INIT)
+ maybe_adjust_io_workers();
+}
+
+
/*
* When a backend asks to be notified about worker state changes, we
* set a flag in its backend entry. The background worker machinery needs
diff --git a/src/backend/storage/aio/Makefile b/src/backend/storage/aio/Makefile
index eaeaeeee8e3..824682e7354 100644
--- a/src/backend/storage/aio/Makefile
+++ b/src/backend/storage/aio/Makefile
@@ -11,6 +11,7 @@ include $(top_builddir)/src/Makefile.global
OBJS = \
aio.o \
aio_init.o \
+ method_worker.o \
read_stream.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/aio/meson.build b/src/backend/storage/aio/meson.build
index 8d20759ebf8..e13728b73da 100644
--- a/src/backend/storage/aio/meson.build
+++ b/src/backend/storage/aio/meson.build
@@ -3,5 +3,6 @@
backend_sources += files(
'aio.c',
'aio_init.c',
+ 'method_worker.c',
'read_stream.c',
)
diff --git a/src/backend/storage/aio/method_worker.c b/src/backend/storage/aio/method_worker.c
new file mode 100644
index 00000000000..5df2eea4a03
--- /dev/null
+++ b/src/backend/storage/aio/method_worker.c
@@ -0,0 +1,84 @@
+/*-------------------------------------------------------------------------
+ *
+ * method_worker.c
+ * AIO implementation using workers
+ *
+ * Portions Copyright (c) 1996-2021, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/storage/aio/method_worker.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "libpq/pqsignal.h"
+#include "miscadmin.h"
+#include "postmaster/interrupt.h"
+#include "storage/io_worker.h"
+#include "storage/ipc.h"
+#include "storage/latch.h"
+#include "storage/proc.h"
+#include "tcop/tcopprot.h"
+#include "utils/wait_event.h"
+
+
+int io_workers = 3;
+
+
+void
+IoWorkerMain(char *startup_data, size_t startup_data_len)
+{
+ sigjmp_buf local_sigjmp_buf;
+
+ MyBackendType = B_IO_WORKER;
+
+ /* TODO review all signals */
+ pqsignal(SIGHUP, SignalHandlerForConfigReload);
+ pqsignal(SIGINT, die); /* to allow manually triggering worker restart */
+
+ /*
+ * Ignore SIGTERM, will get explicit shutdown via SIGUSR2 later in the
+ * shutdown sequence, similar to checkpointer.
+ */
+ pqsignal(SIGTERM, SIG_IGN);
+ /* SIGQUIT handler was already set up by InitPostmasterChild */
+ pqsignal(SIGALRM, SIG_IGN);
+ pqsignal(SIGPIPE, SIG_IGN);
+ pqsignal(SIGUSR1, procsignal_sigusr1_handler);
+ pqsignal(SIGUSR2, SignalHandlerForShutdownRequest);
+ sigprocmask(SIG_SETMASK, &UnBlockSig, NULL);
+
+ /* see PostgresMain() */
+ if (sigsetjmp(local_sigjmp_buf, 1) != 0)
+ {
+ error_context_stack = NULL;
+ HOLD_INTERRUPTS();
+
+ /*
+ * We normally shouldn't get errors here. Need to do just enough error
+ * recovery so that we can mark the IO as failed and then exit.
+ */
+ LWLockReleaseAll();
+
+ /* TODO: recover from IO errors */
+
+ EmitErrorReport();
+ proc_exit(1);
+ }
+
+ /* We can now handle ereport(ERROR) */
+ PG_exception_stack = &local_sigjmp_buf;
+
+ while (!ShutdownRequestPending)
+ {
+ WaitLatch(MyLatch, WL_LATCH_SET | WL_EXIT_ON_PM_DEATH, -1,
+ WAIT_EVENT_IO_WORKER_MAIN);
+ ResetLatch(MyLatch);
+ CHECK_FOR_INTERRUPTS();
+ }
+
+ proc_exit(0);
+}
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 4dc46b17b41..d42546db195 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -3294,6 +3294,8 @@ ProcessInterrupts(void)
(errcode(ERRCODE_ADMIN_SHUTDOWN),
errmsg("terminating background worker \"%s\" due to administrator command",
MyBgworkerEntry->bgw_type)));
+ else if (AmIoWorkerProcess())
+ proc_exit(0);
else
ereport(FATAL,
(errcode(ERRCODE_ADMIN_SHUTDOWN),
diff --git a/src/backend/utils/activity/pgstat_io.c b/src/backend/utils/activity/pgstat_io.c
index 8af55989eed..a750caa9b2a 100644
--- a/src/backend/utils/activity/pgstat_io.c
+++ b/src/backend/utils/activity/pgstat_io.c
@@ -335,6 +335,7 @@ pgstat_tracks_io_bktype(BackendType bktype)
{
case B_INVALID:
case B_ARCHIVER:
+ case B_IO_WORKER:
case B_LOGGER:
case B_WAL_RECEIVER:
case B_WAL_WRITER:
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 8efb4044d6f..47a2c4d126b 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -56,6 +56,7 @@ AUTOVACUUM_MAIN "Waiting in main loop of autovacuum launcher process."
BGWRITER_HIBERNATE "Waiting in background writer process, hibernating."
BGWRITER_MAIN "Waiting in main loop of background writer process."
CHECKPOINTER_MAIN "Waiting in main loop of checkpointer process."
+IO_WORKER_MAIN "Waiting in main loop of IO Worker process."
LOGICAL_APPLY_MAIN "Waiting in main loop of logical replication apply process."
LOGICAL_LAUNCHER_MAIN "Waiting in main loop of logical replication launcher process."
LOGICAL_PARALLEL_APPLY_MAIN "Waiting in main loop of logical replication parallel apply process."
diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index b8fa2e64ffe..bedeed588d3 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -293,6 +293,9 @@ GetBackendTypeDesc(BackendType backendType)
case B_CHECKPOINTER:
backendDesc = "checkpointer";
break;
+ case B_IO_WORKER:
+ backendDesc = "io worker";
+ break;
case B_LOGGER:
backendDesc = "logger";
break;
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 4961a5f4b16..5670f40478a 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -74,6 +74,7 @@
#include "storage/aio.h"
#include "storage/bufmgr.h"
#include "storage/bufpage.h"
+#include "storage/io_worker.h"
#include "storage/large_object.h"
#include "storage/pg_shmem.h"
#include "storage/predicate.h"
@@ -3201,6 +3202,18 @@ struct config_int ConfigureNamesInt[] =
NULL, NULL, NULL
},
+ {
+ {"io_workers",
+ PGC_SIGHUP,
+ RESOURCES_ASYNCHRONOUS,
+ gettext_noop("Number of IO worker processes, for io_method=worker."),
+ NULL,
+ },
+ &io_workers,
+ 3, 1, MAX_IO_WORKERS,
+ NULL, assign_io_workers, NULL
+ },
+
{
{"backend_flush_after", PGC_USERSET, RESOURCES_ASYNCHRONOUS,
gettext_noop("Number of pages after which previously performed writes are flushed to disk."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index e904c3fea30..90430381efa 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -839,7 +839,8 @@
# WIP AIO GUC docs
#------------------------------------------------------------------------------
-#io_method = worker
+#io_method = worker # (change requires restart)
+#io_workers = 3 # 1-32;
#------------------------------------------------------------------------------
--
2.45.2.827.g557ae147e6
v2.0-0009-aio-Basic-AIO-implementation.patchtext/x-diff; charset=us-asciiDownload
From 0de554082f3ff6468ff352000774245b337d6d64 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sat, 31 Aug 2024 21:23:37 -0400
Subject: [PATCH v2.0 09/17] aio: Basic AIO implementation
At this point nothing can use AIO - this commit does not include any
implementation of aio subjects / callbacks. That will come in later commits.
Todo:
- implement "synchronous" AIO method
- split worker, io_uring methods out into separate commits
- lots of cleanup
---
src/include/storage/aio.h | 308 ++++++
src/include/storage/aio_internal.h | 274 +++++
src/include/storage/aio_ref.h | 24 +
src/include/storage/lwlock.h | 1 +
src/include/storage/lwlocklist.h | 1 +
src/include/utils/resowner.h | 7 +
src/backend/access/transam/xact.c | 9 +
src/backend/postmaster/postmaster.c | 3 +-
src/backend/storage/aio/Makefile | 3 +
src/backend/storage/aio/aio.c | 963 +++++++++++++++++-
src/backend/storage/aio/aio_init.c | 318 ++++++
src/backend/storage/aio/aio_io.c | 111 ++
src/backend/storage/aio/aio_subject.c | 170 ++++
src/backend/storage/aio/meson.build | 3 +
src/backend/storage/aio/method_io_uring.c | 393 +++++++
src/backend/storage/aio/method_worker.c | 413 +++++++-
src/backend/storage/lmgr/lwlock.c | 1 +
.../utils/activity/wait_event_names.txt | 4 +
src/backend/utils/misc/guc_tables.c | 25 +
src/backend/utils/misc/postgresql.conf.sample | 6 +
src/backend/utils/resowner/resowner.c | 51 +
src/tools/pgindent/typedefs.list | 23 +
22 files changed, 3104 insertions(+), 7 deletions(-)
create mode 100644 src/include/storage/aio_internal.h
create mode 100644 src/include/storage/aio_ref.h
create mode 100644 src/backend/storage/aio/aio_io.c
create mode 100644 src/backend/storage/aio/aio_subject.c
create mode 100644 src/backend/storage/aio/method_io_uring.c
diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h
index 98fafcf9bc4..65052462b02 100644
--- a/src/include/storage/aio.h
+++ b/src/include/storage/aio.h
@@ -15,9 +15,315 @@
#define AIO_H
+#include "storage/aio_ref.h"
+#include "storage/procnumber.h"
#include "utils/guc_tables.h"
+typedef struct PgAioHandle PgAioHandle;
+
+typedef enum PgAioOp
+{
+ /* intentionally the zero value, to help catch zeroed memory etc */
+ PGAIO_OP_INVALID = 0,
+
+ PGAIO_OP_READ,
+ PGAIO_OP_WRITE,
+
+ PGAIO_OP_FSYNC,
+
+ PGAIO_OP_FLUSH_RANGE,
+
+ PGAIO_OP_NOP,
+
+ /**
+ * Eventually we'll additionally want at least:
+ * - send
+ * - recv
+ * - accept
+ **/
+} PgAioOp;
+
+#define PGAIO_OP_COUNT (PGAIO_OP_NOP + 1)
+
+
+/*
+ * On what is IO being performed.
+ *
+ * PgAioSharedCallback specific behaviour should be implemented in
+ * aio_subject.c.
+ */
+typedef enum PgAioSubjectID
+{
+ /* intentionally the zero value, to help catch zeroed memory etc */
+ ASI_INVALID = 0,
+} PgAioSubjectID;
+
+#define ASI_COUNT (ASI_INVALID + 1)
+
+/*
+ * Flags for an IO that can be set with pgaio_io_set_flag().
+ */
+typedef enum PgAioHandleFlags
+{
+ AHF_REFERENCES_LOCAL = 1 << 0,
+} PgAioHandleFlags;
+
+
+/*
+ * IDs for callbacks that can be registered on an IO.
+ *
+ * Callbacks are identified by an ID rather than a function pointer. There are
+ * two main reasons:
+
+ * 1) Memory within PgAioHandle is precious, due to the number of PgAioHandle
+ * structs in pre-allocated shared memory.
+
+ * 2) Due to EXEC_BACKEND function pointers are not necessarily stable between
+ * different backends, therefore function pointers cannot directly be in
+ * shared memory.
+ *
+ * Without 2), we could fairly easily allow to add new callbacks, by filling a
+ * ID->pointer mapping table on demand. In the presence of 2 that's still
+ * doable, but harder, because every process has to re-register the pointers
+ * so that a local ID->"backend local pointer" mapping can be maintained.
+ */
+typedef enum PgAioHandleSharedCallbackID
+{
+ ASC_PLACEHOLDER /* empty enums are invalid */ ,
+} PgAioHandleSharedCallbackID;
+
+
+/*
+ * Data necessary for basic IO types (PgAioOp).
+ *
+ * NB: Note that the FDs in here may *not* be relied upon for re-issuing
+ * requests (e.g. for partial reads/writes) - the FD might be from another
+ * process, or closed since. That's not a problem for IOs waiting to be issued
+ * only because the queue is flushed when closing an FD.
+ */
+typedef union
+{
+ struct
+ {
+ int fd;
+ uint16 iov_length;
+ uint64 offset;
+ } read;
+
+ struct
+ {
+ int fd;
+ uint16 iov_length;
+ uint64 offset;
+ } write;
+
+ struct
+ {
+ int fd;
+ bool datasync;
+ } fsync;
+
+ struct
+ {
+ int fd;
+ uint32 nbytes;
+ uint64 offset;
+ } flush_range;
+} PgAioOpData;
+
+
+/* XXX: Perhaps it's worth moving this to a dedicated file? */
+#include "storage/block.h"
+#include "storage/relfilelocator.h"
+
+typedef union PgAioSubjectData
+{
+ /* just as an example placeholder for later */
+ struct
+ {
+ uint32 queue_id;
+ } wal;
+} PgAioSubjectData;
+
+
+
+typedef enum PgAioResultStatus
+{
+ ARS_UNKNOWN,
+ ARS_OK,
+ ARS_PARTIAL,
+ ARS_ERROR,
+} PgAioResultStatus;
+
+typedef struct PgAioResult
+{
+ PgAioHandleSharedCallbackID id:8;
+ PgAioResultStatus status:2;
+ uint32 error_data:22;
+ int32 result;
+} PgAioResult;
+
+typedef struct PgAioReturn
+{
+ PgAioResult result;
+ PgAioSubjectData subject_data;
+} PgAioReturn;
+
+
+typedef struct PgAioSubjectInfo
+{
+ void (*reopen) (PgAioHandle *ioh);
+
+#ifdef NOT_YET
+ char *(*describe_identity) (PgAioHandle *ioh);
+#endif
+
+ const char *name;
+} PgAioSubjectInfo;
+
+
+typedef PgAioResult (*PgAioHandleSharedCallbackComplete) (PgAioHandle *ioh, PgAioResult prior_result);
+typedef void (*PgAioHandleSharedCallbackPrepare) (PgAioHandle *ioh);
+typedef void (*PgAioHandleSharedCallbackError) (PgAioResult result, const PgAioSubjectData *subject_data, int elevel);
+
+typedef struct PgAioHandleSharedCallbacks
+{
+ PgAioHandleSharedCallbackPrepare prepare;
+ PgAioHandleSharedCallbackComplete complete;
+ PgAioHandleSharedCallbackError error;
+} PgAioHandleSharedCallbacks;
+
+
+
+typedef struct PgAioBounceBuffer PgAioBounceBuffer;
+
+
+/*
+ * How many callbacks can be registered for one IO handle. Currently we only
+ * need two, but it's not hard to imagine needing a few more.
+ */
+#define AIO_MAX_SHARED_CALLBACKS 4
+
+
+
+/* AIO API */
+
+
+/* --------------------------------------------------------------------------------
+ * IO Handles
+ * --------------------------------------------------------------------------------
+ */
+
+struct ResourceOwnerData;
+extern PgAioHandle *pgaio_io_get(struct ResourceOwnerData *resowner, PgAioReturn *ret);
+extern PgAioHandle *pgaio_io_get_nb(struct ResourceOwnerData *resowner, PgAioReturn *ret);
+
+extern void pgaio_io_release(PgAioHandle *ioh);
+extern void pgaio_io_release_resowner(dlist_node *ioh_node, bool on_error);
+
+extern void pgaio_io_get_ref(PgAioHandle *ioh, PgAioHandleRef *ior);
+
+extern void pgaio_io_set_subject(PgAioHandle *ioh, PgAioSubjectID subjid);
+extern void pgaio_io_set_flag(PgAioHandle *ioh, PgAioHandleFlags flag);
+
+extern void pgaio_io_add_shared_cb(PgAioHandle *ioh, PgAioHandleSharedCallbackID cbid);
+
+extern void pgaio_io_set_io_data_32(PgAioHandle *ioh, uint32 *data, uint8 len);
+extern void pgaio_io_set_io_data_64(PgAioHandle *ioh, uint64 *data, uint8 len);
+extern uint64 *pgaio_io_get_io_data(PgAioHandle *ioh, uint8 *len);
+
+extern void pgaio_io_prepare(PgAioHandle *ioh, PgAioOp op);
+
+extern int pgaio_io_get_id(PgAioHandle *ioh);
+struct iovec;
+extern int pgaio_io_get_iovec(PgAioHandle *ioh, struct iovec **iov);
+extern bool pgaio_io_has_subject(PgAioHandle *ioh);
+
+extern PgAioSubjectData *pgaio_io_get_subject_data(PgAioHandle *ioh);
+extern PgAioOpData *pgaio_io_get_op_data(PgAioHandle *ioh);
+extern ProcNumber pgaio_io_get_owner(PgAioHandle *ioh);
+
+
+
+/* --------------------------------------------------------------------------------
+ * IO References
+ * --------------------------------------------------------------------------------
+ */
+
+extern void pgaio_io_ref_clear(PgAioHandleRef *ior);
+extern bool pgaio_io_ref_valid(PgAioHandleRef *ior);
+extern int pgaio_io_ref_get_id(PgAioHandleRef *ior);
+
+
+extern void pgaio_io_ref_wait(PgAioHandleRef *ior);
+extern bool pgaio_io_ref_check_done(PgAioHandleRef *ior);
+
+
+
+/* --------------------------------------------------------------------------------
+ * IO Result
+ * --------------------------------------------------------------------------------
+ */
+
+extern void pgaio_result_log(PgAioResult result, const PgAioSubjectData *subject_data,
+ int elevel);
+
+
+
+/* --------------------------------------------------------------------------------
+ * Bounce Buffers
+ * --------------------------------------------------------------------------------
+ */
+
+extern PgAioBounceBuffer *pgaio_bounce_buffer_get(void);
+extern void pgaio_io_assoc_bounce_buffer(PgAioHandle *ioh, PgAioBounceBuffer *bb);
+extern uint32 pgaio_bounce_buffer_id(PgAioBounceBuffer *bb);
+extern void pgaio_bounce_buffer_release(PgAioBounceBuffer *bb);
+extern char *pgaio_bounce_buffer_buffer(PgAioBounceBuffer *bb);
+extern void pgaio_bounce_buffer_release_resowner(dlist_node *bb_node, bool on_error);
+
+
+
+/* --------------------------------------------------------------------------------
+ * Actions on multiple IOs.
+ * --------------------------------------------------------------------------------
+ */
+
+extern void pgaio_submit_staged(void);
+extern bool pgaio_have_staged(void);
+
+
+
+/* --------------------------------------------------------------------------------
+ * Low level IO preparation routines
+ *
+ * These will often be called by code lowest level of initiating an
+ * IO. E.g. bufmgr.c may initiate IO for a buffer, but pgaio_io_prep_readv()
+ * will be called from within fd.c.
+ *
+ * Implemented in aio_io.c
+ * --------------------------------------------------------------------------------
+ */
+
+extern void pgaio_io_prep_readv(PgAioHandle *ioh,
+ int fd, int iovcnt, uint64 offset);
+
+extern void pgaio_io_prep_writev(PgAioHandle *ioh,
+ int fd, int iovcnt, uint64 offset);
+
+
+
+/* --------------------------------------------------------------------------------
+ * Other
+ * --------------------------------------------------------------------------------
+ */
+
+extern void pgaio_closing_fd(int fd);
+extern void pgaio_at_xact_end(bool is_subxact, bool is_commit);
+extern void pgaio_at_error(void);
+
+
/* GUC related */
extern void assign_io_method(int newval, void *extra);
@@ -37,6 +343,8 @@ typedef enum IoMethod
/* GUCs */
extern const struct config_enum_entry io_method_options[];
extern int io_method;
+extern int io_max_concurrency;
+extern int io_bounce_buffers;
#endif /* AIO_H */
diff --git a/src/include/storage/aio_internal.h b/src/include/storage/aio_internal.h
new file mode 100644
index 00000000000..67d994cc0b1
--- /dev/null
+++ b/src/include/storage/aio_internal.h
@@ -0,0 +1,274 @@
+/*-------------------------------------------------------------------------
+ *
+ * aio_internal.h
+ * aio_internal
+ *
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/aio_internal.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef AIO_INTERNAL_H
+#define AIO_INTERNAL_H
+
+
+#include "lib/ilist.h"
+#include "port/pg_iovec.h"
+#include "storage/aio.h"
+#include "storage/condition_variable.h"
+
+
+#define PGAIO_VERBOSE
+
+
+/* AFIXME */
+#define PGAIO_SUBMIT_BATCH_SIZE 32
+
+
+
+typedef enum PgAioHandleState
+{
+ /* not in use */
+ AHS_IDLE = 0,
+
+ /* returned by pgaio_io_get() */
+ AHS_HANDED_OUT,
+
+ /* pgaio_io_start_*() has been called, but IO hasn't been submitted yet */
+ AHS_DEFINED,
+
+ /* subjects prepare() callback has been called */
+ AHS_PREPARED,
+
+ /* IO is being executed */
+ AHS_IN_FLIGHT,
+
+ /* IO finished, but result has not yet been processed */
+ AHS_REAPED,
+
+ /* IO completed, shared completion has been called */
+ AHS_COMPLETED_SHARED,
+
+ /* IO completed, local completion has been called */
+ AHS_COMPLETED_LOCAL,
+} PgAioHandleState;
+
+
+struct ResourceOwnerData;
+
+/* typedef is in public header */
+struct PgAioHandle
+{
+ PgAioHandleState state:8;
+
+ /* what are we operating on */
+ PgAioSubjectID subject:8;
+
+ /* which operation */
+ PgAioOp op:8;
+
+ /* bitfield of PgAioHandleFlags */
+ uint8 flags;
+
+ uint8 num_shared_callbacks;
+
+ /* using the proper type here would use more space */
+ uint8 shared_callbacks[AIO_MAX_SHARED_CALLBACKS];
+
+ uint8 iovec_data_len;
+
+ /* XXX: could be optimized out with some pointer math */
+ int32 owner_procno;
+
+ /* FIXME: remove in favor of distilled_result */
+ /* raw result of the IO operation */
+ int32 result;
+
+ /* index into PgAioCtl->iovecs */
+ uint32 iovec_off;
+
+ /*
+ * List of bounce_buffers owned by IO. It would suffice to use an index
+ * based linked list here.
+ */
+ slist_head bounce_buffers;
+
+ /**
+ * In which list the handle is registered, depends on the state:
+ * - IDLE, in per-backend list
+ * - HANDED_OUT - not in a list
+ * - DEFINED - in per-backend staged list
+ * - PREPARED - in per-backend staged list
+ * - IN_FLIGHT - not in any list
+ * - REAPED - in per-reap context list
+ * - COMPLETED_SHARED - not in any list
+ * - COMPLETED_LOCAL - not in any list
+ *
+ * XXX: It probably make sense to optimize this out to save on per-io
+ * memory at the cost of per-backend memory.
+ **/
+ dlist_node node;
+
+ struct ResourceOwnerData *resowner;
+ dlist_node resowner_node;
+
+ /* incremented every time the IO handle is reused */
+ uint64 generation;
+
+ ConditionVariable cv;
+
+ /* result of shared callback, passed to issuer callback */
+ PgAioResult distilled_result;
+
+ PgAioReturn *report_return;
+
+ PgAioOpData op_data;
+
+ /*
+ * Data necessary for shared completions. Needs to be sufficient to allow
+ * another backend to retry an IO.
+ */
+ PgAioSubjectData scb_data;
+};
+
+
+struct PgAioBounceBuffer
+{
+ slist_node node;
+ struct ResourceOwnerData *resowner;
+ dlist_node resowner_node;
+ char *buffer;
+};
+
+
+typedef struct PgAioPerBackend
+{
+ /* index into PgAioCtl->io_handles */
+ uint32 io_handle_off;
+
+ /* index into PgAioCtl->bounce_buffers */
+ uint32 bounce_buffers_off;
+
+ /* IO Handles that currently are not used */
+ dclist_head idle_ios;
+
+ /*
+ * Only one IO may be returned by pgaio_io_get()/pgaio_io_get() without
+ * having been either defined (by actually associating it with IO) or by
+ * released (with pgaio_io_release()). This restriction is necessary to
+ * guarantee that we always can acquire an IO. ->handed_out_io is used to
+ * enforce that rule.
+ */
+ PgAioHandle *handed_out_io;
+
+ /*
+ * IOs that are defined, but not yet submitted.
+ */
+ dclist_head staged_ios;
+
+ /* Bounce Buffers that currently are not used */
+ slist_head idle_bbs;
+
+ /* see handed_out_io */
+ PgAioBounceBuffer *handed_out_bb;
+} PgAioPerBackend;
+
+
+typedef struct PgAioCtl
+{
+ int backend_state_count;
+ PgAioPerBackend *backend_state;
+
+ /*
+ * Array of iovec structs. Each iovec is owned by a specific backend. The
+ * allocation is in PgAioCtl to allow the maximum number of iovecs for
+ * individual IOs to be configurable with PGC_POSTMASTER GUC.
+ */
+ uint64 iovec_count;
+ struct iovec *iovecs;
+
+ /*
+ * For, e.g., an IO covering multiple buffers in shared / temp buffers, we
+ * need to get Buffer IDs during completion to be able to change the
+ * BufferDesc state accordingly. This space can be used to store e.g.
+ * Buffer IDs. Note that the actual iovec might be shorter than this,
+ * because we combine neighboring pages into one larger iovec entry.
+ */
+ uint64 *iovecs_data;
+
+ /*
+ * To perform AIO on buffers that are not located in shared memory (either
+ * because they are not in shared memory or because we need to operate on
+ * a copy, as e.g. the case for writes when checksums are in use)
+ */
+ uint64 bounce_buffers_count;
+ PgAioBounceBuffer *bounce_buffers;
+ char *bounce_buffers_data;
+
+ uint64 io_handle_count;
+ PgAioHandle *io_handles;
+} PgAioCtl;
+
+
+
+/*
+ * The set of callbacks that each IO method must implement.
+ */
+typedef struct IoMethodOps
+{
+ /* initialization */
+ size_t (*shmem_size) (void);
+ void (*shmem_init) (bool first_time);
+
+ void (*postmaster_init) (void);
+ void (*postmaster_child_init_local) (void);
+ void (*postmaster_child_init) (void);
+
+ /* teardown */
+ void (*postmaster_before_child_exit) (void);
+
+ /* handling of IOs */
+ int (*submit) (void);
+
+ void (*wait_one) (PgAioHandle *ioh,
+ uint64 ref_generation);
+
+ /* properties */
+ bool can_scatter_gather_direct;
+ bool can_scatter_gather_buffered;
+} IoMethodOps;
+
+
+extern bool pgaio_io_was_recycled(PgAioHandle *ioh, uint64 ref_generation, PgAioHandleState *state);
+
+extern void pgaio_io_prepare_subject(PgAioHandle *ioh);
+extern void pgaio_io_process_completion_subject(PgAioHandle *ioh);
+extern void pgaio_io_process_completion(PgAioHandle *ioh, int result);
+extern void pgaio_io_prepare_submit(PgAioHandle *ioh);
+
+extern bool pgaio_io_needs_synchronously(PgAioHandle *ioh);
+extern void pgaio_io_perform_synchronously(PgAioHandle *ioh);
+
+extern void pgaio_io_reopen(PgAioHandle *ioh);
+
+extern const char *pgaio_io_get_subject_name(PgAioHandle *ioh);
+extern const char *pgaio_io_get_op_name(PgAioHandle *ioh);
+extern const char *pgaio_io_get_state_name(PgAioHandle *ioh);
+
+
+/* Declarations for the tables of function pointers exposed by each IO method. */
+extern const IoMethodOps pgaio_worker_ops;
+#ifdef USE_LIBURING
+extern const IoMethodOps pgaio_uring_ops;
+#endif
+
+extern const IoMethodOps *pgaio_impl;
+extern PgAioCtl *aio_ctl;
+extern PgAioPerBackend *my_aio;
+
+
+
+#endif /* AIO_INTERNAL_H */
diff --git a/src/include/storage/aio_ref.h b/src/include/storage/aio_ref.h
new file mode 100644
index 00000000000..ad7e9ad34f3
--- /dev/null
+++ b/src/include/storage/aio_ref.h
@@ -0,0 +1,24 @@
+/*-------------------------------------------------------------------------
+ *
+ * aio_ref.h Definition of PgAioHandleRef, which sometimes needs to be used in
+ * headers.
+ *
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/aio_ref.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef AIO_REF_H
+#define AIO_REF_H
+
+typedef struct PgAioHandleRef
+{
+ uint32 aio_index;
+ uint32 generation_upper;
+ uint32 generation_lower;
+} PgAioHandleRef;
+
+#endif /* AIO_REF_H */
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index 00e8022fbad..f4e6abce327 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -217,6 +217,7 @@ typedef enum BuiltinTrancheIds
LWTRANCHE_SUBTRANS_SLRU,
LWTRANCHE_XACT_SLRU,
LWTRANCHE_PARALLEL_VACUUM_DSA,
+ LWTRANCHE_AIO_URING_COMPLETION,
LWTRANCHE_FIRST_USER_DEFINED,
} BuiltinTrancheIds;
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index 88dc79b2bd6..7aaccf69d1e 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -84,3 +84,4 @@ PG_LWLOCK(50, DSMRegistry)
PG_LWLOCK(51, InjectionPoint)
PG_LWLOCK(52, SerialControl)
PG_LWLOCK(53, WaitLSN)
+PG_LWLOCK(54, AioWorkerSubmissionQueue)
diff --git a/src/include/utils/resowner.h b/src/include/utils/resowner.h
index 4e534bc3e70..0cdd0c13ffb 100644
--- a/src/include/utils/resowner.h
+++ b/src/include/utils/resowner.h
@@ -164,4 +164,11 @@ struct LOCALLOCK;
extern void ResourceOwnerRememberLock(ResourceOwner owner, struct LOCALLOCK *locallock);
extern void ResourceOwnerForgetLock(ResourceOwner owner, struct LOCALLOCK *locallock);
+/* special support for AIO */
+struct dlist_node;
+extern void ResourceOwnerRememberAioHandle(ResourceOwner owner, struct dlist_node *ioh_node);
+extern void ResourceOwnerForgetAioHandle(ResourceOwner owner, struct dlist_node *ioh_node);
+extern void ResourceOwnerRememberAioBounceBuffer(ResourceOwner owner, struct dlist_node *bb_node);
+extern void ResourceOwnerForgetAioBounceBuffer(ResourceOwner owner, struct dlist_node *bb_node);
+
#endif /* RESOWNER_H */
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 0fe1630fca8..cb4ee5dfd1f 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -52,6 +52,7 @@
#include "replication/origin.h"
#include "replication/snapbuild.h"
#include "replication/syncrep.h"
+#include "storage/aio.h"
#include "storage/condition_variable.h"
#include "storage/fd.h"
#include "storage/lmgr.h"
@@ -2462,6 +2463,8 @@ CommitTransaction(void)
AtEOXact_LogicalRepWorkers(true);
pgstat_report_xact_timestamp(0);
+ pgaio_at_xact_end( /* is_subxact = */ false, /* is_commit = */ true);
+
ResourceOwnerDelete(TopTransactionResourceOwner);
s->curTransactionOwner = NULL;
CurTransactionResourceOwner = NULL;
@@ -2976,6 +2979,8 @@ AbortTransaction(void)
pgstat_report_xact_timestamp(0);
}
+ pgaio_at_xact_end( /* is_subxact = */ false, /* is_commit = */ false);
+
/*
* State remains TRANS_ABORT until CleanupTransaction().
*/
@@ -5185,6 +5190,8 @@ CommitSubTransaction(void)
AtEOSubXact_PgStat(true, s->nestingLevel);
AtSubCommit_Snapshot(s->nestingLevel);
+ pgaio_at_xact_end( /* is_subxact = */ true, /* is_commit = */ true);
+
/*
* We need to restore the upper transaction's read-only state, in case the
* upper is read-write while the child is read-only; GUC will incorrectly
@@ -5350,6 +5357,8 @@ AbortSubTransaction(void)
AtSubAbort_Snapshot(s->nestingLevel);
}
+ pgaio_at_xact_end( /* is_subxact = */ true, /* is_commit = */ false);
+
/*
* Restore the upper transaction's read-only state, too. This should be
* redundant with GUC's cleanup but we may as well do it for consistency
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index fc3901d5347..71930094309 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -4221,7 +4221,8 @@ maybe_reap_io_worker(int pid)
static void
maybe_adjust_io_workers(void)
{
- /* ATODO: This will need to check if io_method == worker */
+ if (!pgaio_workers_enabled())
+ return;
/*
* If we're in final shutting down state, then we're just waiting for all
diff --git a/src/backend/storage/aio/Makefile b/src/backend/storage/aio/Makefile
index 824682e7354..2a5e72a8024 100644
--- a/src/backend/storage/aio/Makefile
+++ b/src/backend/storage/aio/Makefile
@@ -10,8 +10,11 @@ include $(top_builddir)/src/Makefile.global
OBJS = \
aio.o \
+ aio_io.o \
aio_init.o \
+ aio_subject.o \
method_worker.o \
+ method_io_uring.o \
read_stream.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/aio/aio.c b/src/backend/storage/aio/aio.c
index 67f6b52de91..d6f9f658b97 100644
--- a/src/backend/storage/aio/aio.c
+++ b/src/backend/storage/aio/aio.c
@@ -14,7 +14,23 @@
#include "postgres.h"
+#include "miscadmin.h"
+#include "port/atomics.h"
#include "storage/aio.h"
+#include "storage/aio_internal.h"
+#include "storage/bufmgr.h"
+#include "utils/resowner.h"
+#include "utils/wait_event_types.h"
+
+
+
+static void pgaio_io_reclaim(PgAioHandle *ioh);
+static void pgaio_io_resowner_register(PgAioHandle *ioh);
+static void pgaio_io_wait_for_free(void);
+static PgAioHandle *pgaio_io_from_ref(PgAioHandleRef *ior, uint64 *ref_generation);
+
+static void pgaio_bounce_buffer_wait_for_free(void);
+
/* Options for io_method. */
@@ -26,10 +42,955 @@ const struct config_enum_entry io_method_options[] = {
{NULL, 0, false}
};
-int io_method = IOMETHOD_WORKER;
+int io_method = DEFAULT_IO_METHOD;
+int io_max_concurrency = -1;
+int io_bounce_buffers = -1;
+
+
+/* global control for AIO */
+PgAioCtl *aio_ctl;
+
+/* current backend's per-backend state */
+PgAioPerBackend *my_aio;
+
+
+static const IoMethodOps *pgaio_ops_table[] = {
+ [IOMETHOD_WORKER] = &pgaio_worker_ops,
+#ifdef USE_LIBURING
+ [IOMETHOD_IO_URING] = &pgaio_uring_ops,
+#endif
+};
+
+
+const IoMethodOps *pgaio_impl;
+
+
+
+/* --------------------------------------------------------------------------------
+ * "Core" IO Api
+ * --------------------------------------------------------------------------------
+ */
+
+/*
+ * AFIXME: rewrite
+ *
+ * Shared completion callbacks can be executed by any backend (otherwise there
+ * would be deadlocks). Therefore they cannot update state for the issuer of
+ * the IO. That can be done with issuer callbacks.
+ *
+ * Note that issuer callbacks are effectively executed in a critical
+ * section. This is necessary as we need to be able to execute IO in critical
+ * sections (consider e.g. WAL logging) and to be able to execute IOs we need
+ * to acquire an IO, which in turn requires executing issuer callbacks. An
+ * alternative scheme could be to defer local callback execution until a later
+ * point, but that gets complicated quickly.
+ *
+ * Therefore the typical pattern is to use an issuer callback to set some
+ * flags in backend local memory, which can then be used to error out at a
+ * later time.
+ *
+ * NB: The issuer callback is cleared when the resowner owning the IO goes out
+ * of scope.
+ */
+PgAioHandle *
+pgaio_io_get(struct ResourceOwnerData *resowner, PgAioReturn *ret)
+{
+ PgAioHandle *h;
+
+ while (true)
+ {
+ h = pgaio_io_get_nb(resowner, ret);
+
+ if (h != NULL)
+ return h;
+
+ /*
+ * Evidently all handles by this backend are in use. Just wait for
+ * some to complete.
+ */
+ pgaio_io_wait_for_free();
+ }
+}
+
+PgAioHandle *
+pgaio_io_get_nb(struct ResourceOwnerData *resowner, PgAioReturn *ret)
+{
+ if (dclist_count(&my_aio->staged_ios) >= PGAIO_SUBMIT_BATCH_SIZE)
+ {
+ pgaio_submit_staged();
+ }
+
+ if (my_aio->handed_out_io)
+ {
+ ereport(ERROR,
+ errmsg("API violation: Only one IO can be handed out"));
+ }
+
+ if (!dclist_is_empty(&my_aio->idle_ios))
+ {
+ dlist_node *ion = dclist_pop_head_node(&my_aio->idle_ios);
+ PgAioHandle *ioh = dclist_container(PgAioHandle, node, ion);
+
+ Assert(ioh->state == AHS_IDLE);
+ Assert(ioh->owner_procno == MyProcNumber);
+
+ ioh->state = AHS_HANDED_OUT;
+ my_aio->handed_out_io = ioh;
+
+ if (resowner)
+ pgaio_io_resowner_register(ioh);
+
+ if (ret)
+ ioh->report_return = ret;
+
+ return ioh;
+ }
+
+ return NULL;
+}
+
+void
+pgaio_io_release(PgAioHandle *ioh)
+{
+ if (ioh == my_aio->handed_out_io)
+ {
+ Assert(ioh->state == AHS_HANDED_OUT);
+ Assert(ioh->resowner);
+
+ my_aio->handed_out_io = NULL;
+ pgaio_io_reclaim(ioh);
+ }
+ else
+ {
+ elog(ERROR, "release in unexpected state");
+ }
+}
+
+void
+pgaio_io_release_resowner(dlist_node *ioh_node, bool on_error)
+{
+ PgAioHandle *ioh = dlist_container(PgAioHandle, resowner_node, ioh_node);
+
+ Assert(ioh->resowner);
+
+ ResourceOwnerForgetAioHandle(ioh->resowner, &ioh->resowner_node);
+ ioh->resowner = NULL;
+
+ switch (ioh->state)
+ {
+ case AHS_IDLE:
+ elog(ERROR, "unexpected");
+ break;
+ case AHS_HANDED_OUT:
+ Assert(ioh == my_aio->handed_out_io || my_aio->handed_out_io == NULL);
+
+ if (ioh == my_aio->handed_out_io)
+ {
+ my_aio->handed_out_io = NULL;
+ if (!on_error)
+ elog(WARNING, "leaked AIO handle");
+ }
+
+ pgaio_io_reclaim(ioh);
+ break;
+ case AHS_DEFINED:
+ case AHS_PREPARED:
+ /* XXX: Should we warn about this when is_commit? */
+ pgaio_submit_staged();
+ break;
+ case AHS_IN_FLIGHT:
+ case AHS_REAPED:
+ case AHS_COMPLETED_SHARED:
+ /* this is expected to happen */
+ break;
+ case AHS_COMPLETED_LOCAL:
+ /* XXX: unclear if this ought to be possible? */
+ pgaio_io_reclaim(ioh);
+ break;
+ }
+
+ /*
+ * Need to unregister the reporting of the IO's result, the memory it's
+ * referencing likely has gone away.
+ */
+ if (ioh->report_return)
+ ioh->report_return = NULL;
+}
+
+int
+pgaio_io_get_iovec(PgAioHandle *ioh, struct iovec **iov)
+{
+ Assert(ioh->state == AHS_HANDED_OUT);
+
+ *iov = &aio_ctl->iovecs[ioh->iovec_off];
+
+ /* AFIXME: Needs to be the value at startup time */
+ return io_combine_limit;
+}
+
+PgAioSubjectData *
+pgaio_io_get_subject_data(PgAioHandle *ioh)
+{
+ return &ioh->scb_data;
+}
+
+PgAioOpData *
+pgaio_io_get_op_data(PgAioHandle *ioh)
+{
+ return &ioh->op_data;
+}
+
+ProcNumber
+pgaio_io_get_owner(PgAioHandle *ioh)
+{
+ return ioh->owner_procno;
+}
+
+bool
+pgaio_io_has_subject(PgAioHandle *ioh)
+{
+ return ioh->subject != ASI_INVALID;
+}
+
+void
+pgaio_io_set_flag(PgAioHandle *ioh, PgAioHandleFlags flag)
+{
+ Assert(ioh->state == AHS_HANDED_OUT);
+
+ ioh->flags |= flag;
+}
+
+void
+pgaio_io_set_io_data_32(PgAioHandle *ioh, uint32 *data, uint8 len)
+{
+ Assert(ioh->state == AHS_HANDED_OUT);
+
+ for (int i = 0; i < len; i++)
+ aio_ctl->iovecs_data[ioh->iovec_off + i] = data[i];
+ ioh->iovec_data_len = len;
+}
+
+uint64 *
+pgaio_io_get_io_data(PgAioHandle *ioh, uint8 *len)
+{
+ Assert(ioh->iovec_data_len > 0);
+
+ *len = ioh->iovec_data_len;
+
+ return &aio_ctl->iovecs_data[ioh->iovec_off];
+}
+
+void
+pgaio_io_set_subject(PgAioHandle *ioh, PgAioSubjectID subjid)
+{
+ Assert(ioh->state == AHS_HANDED_OUT);
+
+ ioh->subject = subjid;
+
+ elog(DEBUG3, "io:%d, op %s, subject %s, set subject",
+ pgaio_io_get_id(ioh),
+ pgaio_io_get_op_name(ioh),
+ pgaio_io_get_subject_name(ioh));
+}
+
+void
+pgaio_io_get_ref(PgAioHandle *ioh, PgAioHandleRef *ior)
+{
+ Assert(ioh->state == AHS_HANDED_OUT ||
+ ioh->state == AHS_DEFINED ||
+ ioh->state == AHS_PREPARED);
+ Assert(ioh->generation != 0);
+
+ ior->aio_index = ioh - aio_ctl->io_handles;
+ ior->generation_upper = (uint32) (ioh->generation >> 32);
+ ior->generation_lower = (uint32) ioh->generation;
+}
+
+void
+pgaio_io_ref_clear(PgAioHandleRef *ior)
+{
+ ior->aio_index = PG_UINT32_MAX;
+}
+
+bool
+pgaio_io_ref_valid(PgAioHandleRef *ior)
+{
+ return ior->aio_index != PG_UINT32_MAX;
+}
+
+int
+pgaio_io_ref_get_id(PgAioHandleRef *ior)
+{
+ Assert(pgaio_io_ref_valid(ior));
+ return ior->aio_index;
+}
+
+bool
+pgaio_io_was_recycled(PgAioHandle *ioh, uint64 ref_generation, PgAioHandleState *state)
+{
+ *state = ioh->state;
+ pg_read_barrier();
+
+ return ioh->generation != ref_generation;
+}
+
+void
+pgaio_io_ref_wait(PgAioHandleRef *ior)
+{
+ uint64 ref_generation;
+ PgAioHandleState state;
+ bool am_owner;
+ PgAioHandle *ioh;
+
+ ioh = pgaio_io_from_ref(ior, &ref_generation);
+
+ am_owner = ioh->owner_procno == MyProcNumber;
+
+
+ if (pgaio_io_was_recycled(ioh, ref_generation, &state))
+ return;
+
+ if (am_owner)
+ {
+ if (state == AHS_DEFINED || state == AHS_PREPARED)
+ {
+ /* XXX: Arguably this should be prevented by callers? */
+ pgaio_submit_staged();
+ }
+ else if (state != AHS_IN_FLIGHT && state != AHS_REAPED && state != AHS_COMPLETED_SHARED && state != AHS_COMPLETED_LOCAL)
+ {
+ elog(PANIC, "waiting for own IO in wrong state: %d",
+ state);
+ }
+
+ /*
+ * Somebody else completed the IO, need to execute issuer callback, so
+ * reclaim eagerly.
+ */
+ if (state == AHS_COMPLETED_LOCAL)
+ {
+ pgaio_io_reclaim(ioh);
+
+ return;
+ }
+ }
+
+ while (true)
+ {
+ if (pgaio_io_was_recycled(ioh, ref_generation, &state))
+ return;
+
+ switch (state)
+ {
+ case AHS_IDLE:
+ case AHS_HANDED_OUT:
+ elog(ERROR, "IO in wrong state: %d", state);
+ break;
+
+ case AHS_IN_FLIGHT:
+ if (pgaio_impl->wait_one)
+ {
+ pgaio_impl->wait_one(ioh, ref_generation);
+ continue;
+ }
+ /* fallthrough */
+
+ /* waiting for owner to submit */
+ case AHS_PREPARED:
+ case AHS_DEFINED:
+ /* waiting for reaper to complete */
+ /* fallthrough */
+ case AHS_REAPED:
+ /* shouldn't be able to hit this otherwise */
+ Assert(IsUnderPostmaster);
+ /* ensure we're going to get woken up */
+ ConditionVariablePrepareToSleep(&ioh->cv);
+
+ while (!pgaio_io_was_recycled(ioh, ref_generation, &state))
+ {
+ if (state != AHS_REAPED && state != AHS_DEFINED &&
+ state != AHS_IN_FLIGHT)
+ break;
+ ConditionVariableSleep(&ioh->cv, WAIT_EVENT_AIO_COMPLETION);
+ }
+
+ ConditionVariableCancelSleep();
+ break;
+
+ case AHS_COMPLETED_SHARED:
+ /* see above */
+ if (am_owner)
+ pgaio_io_reclaim(ioh);
+ return;
+ case AHS_COMPLETED_LOCAL:
+ return;
+ }
+ }
+}
+
+bool
+pgaio_io_ref_check_done(PgAioHandleRef *ior)
+{
+ uint64 ref_generation;
+ PgAioHandleState state;
+ bool am_owner;
+ PgAioHandle *ioh;
+
+ ioh = pgaio_io_from_ref(ior, &ref_generation);
+
+ if (pgaio_io_was_recycled(ioh, ref_generation, &state))
+ return true;
+
+
+ if (state == AHS_IDLE)
+ return true;
+
+ am_owner = ioh->owner_procno == MyProcNumber;
+
+ if (state == AHS_COMPLETED_SHARED || state == AHS_COMPLETED_LOCAL)
+ {
+ if (am_owner)
+ pgaio_io_reclaim(ioh);
+ return true;
+ }
+
+ return false;
+}
+
+int
+pgaio_io_get_id(PgAioHandle *ioh)
+{
+ Assert(ioh >= aio_ctl->io_handles &&
+ ioh <= (aio_ctl->io_handles + aio_ctl->io_handle_count));
+ return ioh - aio_ctl->io_handles;
+}
+
+const char *
+pgaio_io_get_state_name(PgAioHandle *ioh)
+{
+ switch (ioh->state)
+ {
+ case AHS_IDLE:
+ return "idle";
+ case AHS_HANDED_OUT:
+ return "handed_out";
+ case AHS_DEFINED:
+ return "DEFINED";
+ case AHS_PREPARED:
+ return "PREPARED";
+ case AHS_IN_FLIGHT:
+ return "IN_FLIGHT";
+ case AHS_REAPED:
+ return "REAPED";
+ case AHS_COMPLETED_SHARED:
+ return "COMPLETED_SHARED";
+ case AHS_COMPLETED_LOCAL:
+ return "COMPLETED_LOCAL";
+ }
+ pg_unreachable();
+}
+
+/*
+ * Internal, should only be called from pgaio_io_prep_*().
+ */
+void
+pgaio_io_prepare(PgAioHandle *ioh, PgAioOp op)
+{
+ Assert(ioh->state == AHS_HANDED_OUT);
+ Assert(pgaio_io_has_subject(ioh));
+
+ ioh->op = op;
+ ioh->state = AHS_DEFINED;
+ ioh->result = 0;
+
+ /* allow a new IO to be staged */
+ my_aio->handed_out_io = NULL;
+
+ dclist_push_tail(&my_aio->staged_ios, &ioh->node);
+
+ pgaio_io_prepare_subject(ioh);
+
+ ioh->state = AHS_PREPARED;
+
+ elog(DEBUG3, "io:%d: prepared %s",
+ pgaio_io_get_id(ioh), pgaio_io_get_op_name(ioh));
+}
+
+/*
+ * Handle IO getting completed by a method.
+ */
+void
+pgaio_io_process_completion(PgAioHandle *ioh, int result)
+{
+ ioh->result = result;
+
+ pg_write_barrier();
+
+ /* FIXME: should be done in separate function */
+ ioh->state = AHS_REAPED;
+
+ pgaio_io_process_completion_subject(ioh);
+
+ /* ensure results of completion are visible before the new state */
+ pg_write_barrier();
+
+ ioh->state = AHS_COMPLETED_SHARED;
+
+ /* condition variable broadcast ensures state is visible before wakeup */
+ ConditionVariableBroadcast(&ioh->cv);
+
+ if (ioh->owner_procno == MyProcNumber)
+ pgaio_io_reclaim(ioh);
+}
+
+/*
+ * Handle IO being processed by IO method.
+ */
+void
+pgaio_io_prepare_submit(PgAioHandle *ioh)
+{
+ ioh->state = AHS_IN_FLIGHT;
+ pg_write_barrier();
+
+ dclist_delete_from(&my_aio->staged_ios, &ioh->node);
+}
+
+static PgAioHandle *
+pgaio_io_from_ref(PgAioHandleRef *ior, uint64 *ref_generation)
+{
+ PgAioHandle *ioh;
+
+ Assert(ior->aio_index < aio_ctl->io_handle_count);
+
+ ioh = &aio_ctl->io_handles[ior->aio_index];
+
+ *ref_generation = ((uint64) ior->generation_upper) << 32 |
+ ior->generation_lower;
+
+ Assert(*ref_generation != 0);
+
+ return ioh;
+}
+
+static void
+pgaio_io_resowner_register(PgAioHandle *ioh)
+{
+ Assert(!ioh->resowner);
+ Assert(CurrentResourceOwner);
+
+ ResourceOwnerRememberAioHandle(CurrentResourceOwner, &ioh->resowner_node);
+ ioh->resowner = CurrentResourceOwner;
+}
+
+static void
+pgaio_io_reclaim(PgAioHandle *ioh)
+{
+ /* This is only ok if it's our IO */
+ Assert(ioh->owner_procno == MyProcNumber);
+
+ ereport(DEBUG3,
+ errmsg("reclaiming io:%d, state: %s, op %s, subject %s, result: %d, distilled_result: AFIXME, report to: %p",
+ pgaio_io_get_id(ioh),
+ pgaio_io_get_state_name(ioh),
+ pgaio_io_get_op_name(ioh),
+ pgaio_io_get_subject_name(ioh),
+ ioh->result,
+ ioh->report_return
+ ),
+ errhidestmt(true), errhidecontext(true));
+
+ if (ioh->report_return)
+ {
+ if (ioh->state != AHS_HANDED_OUT)
+ {
+ ioh->report_return->result = ioh->distilled_result;
+ ioh->report_return->subject_data = ioh->scb_data;
+ }
+ }
+
+ /* reclaim all associated bounce buffers */
+ if (!slist_is_empty(&ioh->bounce_buffers))
+ {
+ slist_mutable_iter it;
+
+ slist_foreach_modify(it, &ioh->bounce_buffers)
+ {
+ PgAioBounceBuffer *bb = slist_container(PgAioBounceBuffer, node, it.cur);
+
+ slist_delete_current(&it);
+
+ slist_push_head(&my_aio->idle_bbs, &bb->node);
+ }
+ }
+
+ if (ioh->resowner)
+ {
+ ResourceOwnerForgetAioHandle(ioh->resowner, &ioh->resowner_node);
+ ioh->resowner = NULL;
+ }
+
+ Assert(!ioh->resowner);
+
+ ioh->num_shared_callbacks = 0;
+ ioh->iovec_data_len = 0;
+ ioh->report_return = NULL;
+ ioh->flags = 0;
+
+ pg_write_barrier();
+ ioh->generation++;
+ pg_write_barrier();
+ ioh->state = AHS_IDLE;
+ pg_write_barrier();
+
+ dclist_push_tail(&my_aio->idle_ios, &ioh->node);
+}
+
+static void
+pgaio_io_wait_for_free(void)
+{
+ bool found_handed_out = false;
+ int reclaimed = 0;
+ static uint32 lastpos = 0;
+
+ elog(DEBUG2,
+ "waiting for self: %d pending",
+ dclist_count(&my_aio->staged_ios));
+
+ /*
+ * First check if any of our IOs actually have completed - when using
+ * worker, that'll often be the case. We could do so as part of the loop
+ * below, but that'd potentially lead us to wait for some IO submitted
+ * before.
+ */
+ for (int i = 0; i < io_max_concurrency; i++)
+ {
+ PgAioHandle *ioh = &aio_ctl->io_handles[my_aio->io_handle_off + i];
+
+ if (ioh->state == AHS_COMPLETED_SHARED)
+ {
+ pgaio_io_reclaim(ioh);
+ reclaimed++;
+ }
+ }
+
+ if (reclaimed > 0)
+ return;
+
+ if (!dclist_is_empty(&my_aio->staged_ios))
+ {
+ elog(DEBUG2, "submitting while acquiring free io");
+ pgaio_submit_staged();
+ }
+
+ for (uint32 i = lastpos; i < lastpos + io_max_concurrency; i++)
+ {
+ uint32 thisoff = my_aio->io_handle_off + (i % io_max_concurrency);
+ PgAioHandle *ioh = &aio_ctl->io_handles[thisoff];
+
+ switch (ioh->state)
+ {
+ case AHS_IDLE:
+
+ /*
+ * While one might think that pgaio_io_get_nb() should have
+ * succeeded, this is reachable because the IO could have
+ * completed during the submission above.
+ */
+ return;
+ case AHS_DEFINED: /* should have been submitted above */
+ case AHS_PREPARED:
+ case AHS_COMPLETED_LOCAL:
+ elog(ERROR, "shouldn't get here with io:%d in state %d",
+ pgaio_io_get_id(ioh), ioh->state);
+ break;
+ case AHS_HANDED_OUT:
+ if (found_handed_out)
+ elog(ERROR, "more than one handed out IO");
+ found_handed_out = true;
+ continue;
+ case AHS_REAPED:
+ case AHS_IN_FLIGHT:
+ {
+ PgAioHandleRef ior;
+
+ ior.aio_index = ioh - aio_ctl->io_handles;
+ ior.generation_upper = (uint32) (ioh->generation >> 32);
+ ior.generation_lower = (uint32) ioh->generation;
+
+ pgaio_io_ref_wait(&ior);
+ elog(DEBUG2, "waited for io:%d",
+ pgaio_io_get_id(ioh));
+ lastpos = i;
+ return;
+ }
+ break;
+ case AHS_COMPLETED_SHARED:
+ /* reclaim */
+ pgaio_io_reclaim(ioh);
+ lastpos = i;
+ return;
+ }
+ }
+
+ elog(PANIC, "could not reclaim any handles");
+}
+
+
+
+/* --------------------------------------------------------------------------------
+ * Bounce Buffers
+ * --------------------------------------------------------------------------------
+ */
+
+PgAioBounceBuffer *
+pgaio_bounce_buffer_get(void)
+{
+ PgAioBounceBuffer *bb = NULL;
+ slist_node *node;
+
+ if (my_aio->handed_out_bb != NULL)
+ elog(ERROR, "can only hand out one BB");
+
+ /*
+ * FIXME It probably is not correct to have bounce buffers be per backend,
+ * they use too much memory.
+ */
+ if (slist_is_empty(&my_aio->idle_bbs))
+ {
+ pgaio_bounce_buffer_wait_for_free();
+ }
+
+ node = slist_pop_head_node(&my_aio->idle_bbs);
+ bb = slist_container(PgAioBounceBuffer, node, node);
+
+ my_aio->handed_out_bb = bb;
+
+ bb->resowner = CurrentResourceOwner;
+ ResourceOwnerRememberAioBounceBuffer(bb->resowner, &bb->resowner_node);
+
+ return bb;
+}
+
+void
+pgaio_io_assoc_bounce_buffer(PgAioHandle *ioh, PgAioBounceBuffer *bb)
+{
+ if (my_aio->handed_out_bb != bb)
+ elog(ERROR, "can only assign handed out BB");
+ my_aio->handed_out_bb = NULL;
+
+ /*
+ * There can be many bounce buffers assigned in case of vectorized IOs.
+ */
+ slist_push_head(&ioh->bounce_buffers, &bb->node);
+
+ /* once associated with an IO, the IO has ownership */
+ ResourceOwnerForgetAioBounceBuffer(bb->resowner, &bb->resowner_node);
+ bb->resowner = NULL;
+}
+
+uint32
+pgaio_bounce_buffer_id(PgAioBounceBuffer *bb)
+{
+ return bb - aio_ctl->bounce_buffers;
+}
+
+void
+pgaio_bounce_buffer_release(PgAioBounceBuffer *bb)
+{
+ if (my_aio->handed_out_bb != bb)
+ elog(ERROR, "can only release handed out BB");
+
+ slist_push_head(&my_aio->idle_bbs, &bb->node);
+ my_aio->handed_out_bb = NULL;
+
+ ResourceOwnerForgetAioBounceBuffer(bb->resowner, &bb->resowner_node);
+ bb->resowner = NULL;
+}
+
+void
+pgaio_bounce_buffer_release_resowner(dlist_node *bb_node, bool on_error)
+{
+ PgAioBounceBuffer *bb = dlist_container(PgAioBounceBuffer, resowner_node, bb_node);
+
+ Assert(bb->resowner);
+
+ if (!on_error)
+ elog(WARNING, "leaked AIO bounce buffer");
+
+ pgaio_bounce_buffer_release(bb);
+}
+
+char *
+pgaio_bounce_buffer_buffer(PgAioBounceBuffer *bb)
+{
+ return bb->buffer;
+}
+
+static void
+pgaio_bounce_buffer_wait_for_free(void)
+{
+ static uint32 lastpos = 0;
+
+ if (!dclist_is_empty(&my_aio->staged_ios))
+ {
+ elog(DEBUG2, "submitting while acquiring free bb");
+ pgaio_submit_staged();
+ }
+
+ for (uint32 i = lastpos; i < lastpos + io_max_concurrency; i++)
+ {
+ uint32 thisoff = my_aio->io_handle_off + (i % io_max_concurrency);
+ PgAioHandle *ioh = &aio_ctl->io_handles[thisoff];
+
+ switch (ioh->state)
+ {
+ case AHS_IDLE:
+ case AHS_HANDED_OUT:
+ continue;
+ case AHS_DEFINED: /* should have been submitted above */
+ case AHS_PREPARED:
+ elog(ERROR, "shouldn't get here with io:%d in state %d",
+ pgaio_io_get_id(ioh), ioh->state);
+ break;
+ case AHS_REAPED:
+ case AHS_IN_FLIGHT:
+ if (!slist_is_empty(&ioh->bounce_buffers))
+ {
+ PgAioHandleRef ior;
+
+ ior.aio_index = ioh - aio_ctl->io_handles;
+ ior.generation_upper = (uint32) (ioh->generation >> 32);
+ ior.generation_lower = (uint32) ioh->generation;
+
+ pgaio_io_ref_wait(&ior);
+ elog(DEBUG2, "waited for io:%d to reclaim BB",
+ pgaio_io_get_id(ioh));
+
+ if (slist_is_empty(&my_aio->idle_bbs))
+ elog(WARNING, "empty after wait");
+
+ if (!slist_is_empty(&my_aio->idle_bbs))
+ {
+ lastpos = i;
+ return;
+ }
+ }
+ break;
+ case AHS_COMPLETED_SHARED:
+ case AHS_COMPLETED_LOCAL:
+ /* reclaim */
+ pgaio_io_reclaim(ioh);
+
+ if (!slist_is_empty(&my_aio->idle_bbs))
+ {
+ lastpos = i;
+ return;
+ }
+ break;
+ }
+ }
+
+ /*
+ * The submission above could have caused the IO to complete at any time.
+ */
+ if (slist_is_empty(&my_aio->idle_bbs))
+ elog(PANIC, "no more bbs");
+}
+
+
+
+/* --------------------------------------------------------------------------------
+ * Actions on multiple IOs.
+ * --------------------------------------------------------------------------------
+ */
+
+void
+pgaio_submit_staged(void)
+{
+ int total_submitted = 0;
+
+ if (dclist_is_empty(&my_aio->staged_ios))
+ return;
+
+ while (!dclist_is_empty(&my_aio->staged_ios))
+ {
+ int staged_count PG_USED_FOR_ASSERTS_ONLY = dclist_count(&my_aio->staged_ios);
+ int did_submit;
+
+ Assert(staged_count > 0);
+
+ START_CRIT_SECTION();
+ END_CRIT_SECTION();
+
+ did_submit = pgaio_impl->submit();
+
+ total_submitted += did_submit;
+ }
+
+#ifdef PGAIO_VERBOSE
+ ereport(DEBUG2,
+ errmsg("submitted %d",
+ total_submitted),
+ errhidestmt(true),
+ errhidecontext(true));
+#endif
+}
+
+bool
+pgaio_have_staged(void)
+{
+ return !dclist_is_empty(&my_aio->staged_ios);
+}
+
+
+
+/* --------------------------------------------------------------------------------
+ * Other
+ * --------------------------------------------------------------------------------
+ */
+
+/*
+ * Need to submit staged but not yet submitted IOs using the fd, otherwise
+ * the IO would end up targeting something bogus.
+ */
+void
+pgaio_closing_fd(int fd)
+{
+ /*
+ * Might be called before AIO is initialized or in a subprocess that
+ * doesn't use AIO.
+ */
+ if (!my_aio)
+ return;
+
+ /*
+ * For now just submit all staged IOs - we could be more selective, but
+ * it's probably not worth it.
+ */
+ pgaio_submit_staged();
+}
+
+void
+pgaio_at_xact_end(bool is_subxact, bool is_commit)
+{
+ Assert(!my_aio->handed_out_io);
+ Assert(!my_aio->handed_out_bb);
+}
+
+/*
+ * Similar to pgaio_at_xact_end(..., is_commit = false), but for cases where
+ * errors happen outside of transactions.
+ */
+void
+pgaio_at_error(void)
+{
+ Assert(!my_aio->handed_out_io);
+ Assert(!my_aio->handed_out_bb);
+}
void
assign_io_method(int newval, void *extra)
{
+ pgaio_impl = pgaio_ops_table[newval];
}
diff --git a/src/backend/storage/aio/aio_init.c b/src/backend/storage/aio/aio_init.c
index 1c277a7eb3b..cf3512f79fc 100644
--- a/src/backend/storage/aio/aio_init.c
+++ b/src/backend/storage/aio/aio_init.c
@@ -14,33 +14,351 @@
#include "postgres.h"
+#include "miscadmin.h"
+#include "storage/aio.h"
#include "storage/aio_init.h"
+#include "storage/aio_internal.h"
+#include "storage/bufmgr.h"
+#include "storage/io_worker.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+static Size
+AioCtlShmemSize(void)
+{
+ Size sz;
+
+ /* aio_ctl itself */
+ sz = offsetof(PgAioCtl, io_handles);
+
+ return sz;
+}
+
+static uint32
+AioProcs(void)
+{
+ /*
+ * While AIO workers don't need their own AIO context, we can't currently
+ * guarantee nothing gets assigned to the a ProcNumber for an IO worker if
+ * we just subtracted MAX_IO_WORKERS.
+ */
+ return MaxBackends + NUM_AUXILIARY_PROCS;
+}
+
+static Size
+AioBackendShmemSize(void)
+{
+ return mul_size(AioProcs(), sizeof(PgAioPerBackend));
+}
+
+static Size
+AioHandleShmemSize(void)
+{
+ Size sz;
+
+ /* ios */
+ sz = mul_size(AioProcs(),
+ mul_size(io_max_concurrency, sizeof(PgAioHandle)));
+
+ return sz;
+}
+
+static Size
+AioIOVShmemSize(void)
+{
+ /* FIXME: io_combine_limit is USERSET */
+ return mul_size(sizeof(struct iovec),
+ mul_size(mul_size(io_combine_limit, AioProcs()),
+ io_max_concurrency));
+}
+
+static Size
+AioIOVDataShmemSize(void)
+{
+ /* FIXME: io_combine_limit is USERSET */
+ return mul_size(sizeof(uint64),
+ mul_size(mul_size(io_combine_limit, AioProcs()),
+ io_max_concurrency));
+}
+
+static Size
+AioBounceBufferDescShmemSize(void)
+{
+ Size sz;
+
+ /* PgAioBounceBuffer itself */
+ sz = mul_size(sizeof(PgAioBounceBuffer),
+ mul_size(AioProcs(), io_bounce_buffers));
+
+ return sz;
+}
+
+static Size
+AioBounceBufferDataShmemSize(void)
+{
+ Size sz;
+
+ /* and the associated buffer */
+ sz = mul_size(BLCKSZ,
+ mul_size(io_bounce_buffers, AioProcs()));
+ /* memory for alignment */
+ sz += BLCKSZ;
+
+ return sz;
+}
+
+/*
+ * Choose a suitable value for io_max_concurrency.
+ *
+ * It's unlikely that we could have more IOs in flight than buffers that we
+ * would be allowed to pin.
+ *
+ * On the upper end, apply a cap too - just because shared_buffers is large,
+ * it doesn't make sense have millions of buffers undergo IO concurrently.
+ */
+static int
+AioChooseMaxConccurrency(void)
+{
+ uint32 max_backends;
+ int max_proportional_pins;
+
+ /* Similar logic to LimitAdditionalPins() */
+ max_backends = MaxBackends + NUM_AUXILIARY_PROCS;
+ max_proportional_pins = NBuffers / max_backends;
+
+ max_proportional_pins = Max(max_proportional_pins, 1);
+
+ /* apply upper limit */
+ return Min(max_proportional_pins, 64);
+}
+
+/*
+ * Choose a suitable value for io_bounce_buffers.
+ *
+ * It's very unlikely to be useful to allocate more bounce buffers for each
+ * backend than the backend is allowed to pin. Additionally, bounce buffers
+ * currently are used for writes, it seems very uncommon for more than 10% of
+ * shared_buffers to be written out concurrently.
+ *
+ * XXX: This quickly can take up significant amounts of memory, the logic
+ * should probably fine tuned.
+ */
+static int
+AioChooseBounceBuffers(void)
+{
+ uint32 max_backends;
+ int max_proportional_pins;
+
+ /* Similar logic to LimitAdditionalPins() */
+ max_backends = MaxBackends + NUM_AUXILIARY_PROCS;
+ max_proportional_pins = (NBuffers / 10) / max_backends;
+
+ max_proportional_pins = Max(max_proportional_pins, 1);
+
+ /* apply upper limit */
+ return Min(max_proportional_pins, 256);
+}
+
Size
AioShmemSize(void)
{
Size sz = 0;
+ /*
+ * We prefer to report this value's source as PGC_S_DYNAMIC_DEFAULT.
+ * However, if the DBA explicitly set wal_buffers = -1 in the config file,
+ * then PGC_S_DYNAMIC_DEFAULT will fail to override that and we must force
+ *
+ */
+ if (io_max_concurrency == -1)
+ {
+ char buf[32];
+
+ snprintf(buf, sizeof(buf), "%d", AioChooseMaxConccurrency());
+ SetConfigOption("io_max_concurrency", buf, PGC_POSTMASTER,
+ PGC_S_DYNAMIC_DEFAULT);
+ if (io_bounce_buffers == -1) /* failed to apply it? */
+ SetConfigOption("io_max_concurrency", buf, PGC_POSTMASTER,
+ PGC_S_OVERRIDE);
+ }
+
+
+ /*
+ * If io_bounce_buffers is -1, we automatically choose a suitable value.
+ *
+ * See also comment above.
+ */
+ if (io_bounce_buffers == -1)
+ {
+ char buf[32];
+
+ snprintf(buf, sizeof(buf), "%d", AioChooseBounceBuffers());
+ SetConfigOption("io_bounce_buffers", buf, PGC_POSTMASTER,
+ PGC_S_DYNAMIC_DEFAULT);
+ if (io_bounce_buffers == -1) /* failed to apply it? */
+ SetConfigOption("io_bounce_buffers", buf, PGC_POSTMASTER,
+ PGC_S_OVERRIDE);
+ }
+
+ sz = add_size(sz, AioCtlShmemSize());
+ sz = add_size(sz, AioBackendShmemSize());
+ sz = add_size(sz, AioHandleShmemSize());
+ sz = add_size(sz, AioIOVShmemSize());
+ sz = add_size(sz, AioIOVDataShmemSize());
+ sz = add_size(sz, AioBounceBufferDescShmemSize());
+ sz = add_size(sz, AioBounceBufferDataShmemSize());
+
+ if (pgaio_impl->shmem_size)
+ sz = add_size(sz, pgaio_impl->shmem_size());
+
return sz;
}
void
AioShmemInit(void)
{
+ bool found;
+ uint32 io_handle_off = 0;
+ uint32 iovec_off = 0;
+ uint32 bounce_buffers_off = 0;
+ uint32 per_backend_iovecs = io_max_concurrency * io_combine_limit;
+ uint32 per_backend_bb = io_bounce_buffers;
+ char *bounce_buffers_data;
+
+ aio_ctl = (PgAioCtl *)
+ ShmemInitStruct("AioCtl", AioCtlShmemSize(), &found);
+
+ if (found)
+ goto out;
+
+ memset(aio_ctl, 0, AioCtlShmemSize());
+
+ aio_ctl->io_handle_count = AioProcs() * io_max_concurrency;
+ aio_ctl->iovec_count = AioProcs() * per_backend_iovecs;
+ aio_ctl->bounce_buffers_count = AioProcs() * per_backend_bb;
+
+ aio_ctl->backend_state = (PgAioPerBackend *)
+ ShmemInitStruct("AioBackend", AioBackendShmemSize(), &found);
+
+ aio_ctl->io_handles = (PgAioHandle *)
+ ShmemInitStruct("AioHandle", AioHandleShmemSize(), &found);
+
+ aio_ctl->iovecs = ShmemInitStruct("AioIOV", AioIOVShmemSize(), &found);
+ aio_ctl->iovecs_data = ShmemInitStruct("AioIOVData", AioIOVDataShmemSize(), &found);
+
+ aio_ctl->bounce_buffers = ShmemInitStruct("AioBounceBufferDesc", AioBounceBufferDescShmemSize(), &found);
+
+ bounce_buffers_data = ShmemInitStruct("AioBounceBufferData", AioBounceBufferDataShmemSize(), &found);
+ bounce_buffers_data = (char *) TYPEALIGN(BLCKSZ, (uintptr_t) bounce_buffers_data);
+ aio_ctl->bounce_buffers_data = bounce_buffers_data;
+
+
+ /* Initialize IO handles. */
+ for (uint64 i = 0; i < aio_ctl->io_handle_count; i++)
+ {
+ PgAioHandle *ioh = &aio_ctl->io_handles[i];
+
+ ioh->op = PGAIO_OP_INVALID;
+ ioh->subject = ASI_INVALID;
+ ioh->state = AHS_IDLE;
+
+ slist_init(&ioh->bounce_buffers);
+ }
+
+ /* Initialize Bounce Buffers. */
+ for (uint64 i = 0; i < aio_ctl->bounce_buffers_count; i++)
+ {
+ PgAioBounceBuffer *bb = &aio_ctl->bounce_buffers[i];
+
+ bb->buffer = bounce_buffers_data;
+ bounce_buffers_data += BLCKSZ;
+ }
+
+
+ for (int procno = 0; procno < AioProcs(); procno++)
+ {
+ PgAioPerBackend *bs = &aio_ctl->backend_state[procno];
+
+ bs->io_handle_off = io_handle_off;
+ io_handle_off += io_max_concurrency;
+
+ bs->bounce_buffers_off = bounce_buffers_off;
+ bounce_buffers_off += per_backend_bb;
+
+ dclist_init(&bs->idle_ios);
+ dclist_init(&bs->staged_ios);
+ slist_init(&bs->idle_bbs);
+
+ /* initialize per-backend IOs */
+ for (int i = 0; i < io_max_concurrency; i++)
+ {
+ PgAioHandle *ioh = &aio_ctl->io_handles[bs->io_handle_off + i];
+
+ ioh->generation = 1;
+ ioh->owner_procno = procno;
+ ioh->iovec_off = iovec_off;
+ ioh->iovec_data_len = 0;
+ ioh->report_return = NULL;
+ ioh->resowner = NULL;
+ ioh->num_shared_callbacks = 0;
+ ioh->distilled_result.status = ARS_UNKNOWN;
+ ioh->flags = 0;
+
+ ConditionVariableInit(&ioh->cv);
+
+ dclist_push_tail(&bs->idle_ios, &ioh->node);
+ iovec_off += io_combine_limit;
+ }
+
+ /* initialize per-backend bounce buffers */
+ for (int i = 0; i < per_backend_bb; i++)
+ {
+ PgAioBounceBuffer *bb = &aio_ctl->bounce_buffers[bs->bounce_buffers_off + i];
+
+ slist_push_head(&bs->idle_bbs, &bb->node);
+ }
+ }
+
+out:
+ /* Initialize IO method specific resources. */
+ pgaio_impl->shmem_init(!found);
}
void
pgaio_postmaster_init(void)
{
+ if (pgaio_impl->postmaster_init)
+ pgaio_impl->postmaster_init();
}
void
pgaio_postmaster_child_init(void)
{
+ /* shouldn't be initialized twice */
+ Assert(!my_aio);
+
+ if (MyBackendType == B_IO_WORKER)
+ return;
+
+ if (MyProc == NULL || MyProcNumber >= AioProcs())
+ elog(ERROR, "aio requires a normal PGPROC");
+
+ my_aio = &aio_ctl->backend_state[MyProcNumber];
+
+ if (pgaio_impl->postmaster_child_init)
+ pgaio_impl->postmaster_child_init();
}
void
pgaio_postmaster_child_init_local(void)
{
+ if (pgaio_impl->postmaster_child_init_local)
+ pgaio_impl->postmaster_child_init_local();
+}
+
+bool
+pgaio_workers_enabled(void)
+{
+ return io_method == IOMETHOD_WORKER;
}
diff --git a/src/backend/storage/aio/aio_io.c b/src/backend/storage/aio/aio_io.c
new file mode 100644
index 00000000000..5b2f9ee3ba6
--- /dev/null
+++ b/src/backend/storage/aio/aio_io.c
@@ -0,0 +1,111 @@
+/*-------------------------------------------------------------------------
+ *
+ * aio_io.c
+ * Asynchronous I/O subsytem.
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/storage/aio/aio_io.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "storage/aio.h"
+#include "storage/aio_internal.h"
+#include "storage/fd.h"
+#include "utils/wait_event.h"
+
+
+static void
+pgaio_io_before_prep(PgAioHandle *ioh)
+{
+ Assert(ioh->state == AHS_HANDED_OUT);
+ Assert(pgaio_io_has_subject(ioh));
+}
+
+const char *
+pgaio_io_get_op_name(PgAioHandle *ioh)
+{
+ Assert(ioh->op >= 0 && ioh->op < PGAIO_OP_COUNT);
+
+ switch (ioh->op)
+ {
+ case PGAIO_OP_INVALID:
+ return "invalid";
+ case PGAIO_OP_READ:
+ return "read";
+ case PGAIO_OP_WRITE:
+ return "write";
+ case PGAIO_OP_FSYNC:
+ return "fsync";
+ case PGAIO_OP_FLUSH_RANGE:
+ return "flush_range";
+ case PGAIO_OP_NOP:
+ return "nop";
+ }
+
+ pg_unreachable();
+}
+
+void
+pgaio_io_prep_readv(PgAioHandle *ioh,
+ int fd, int iovcnt, uint64 offset)
+{
+ pgaio_io_before_prep(ioh);
+
+ ioh->op_data.read.fd = fd;
+ ioh->op_data.read.offset = offset;
+ ioh->op_data.read.iov_length = iovcnt;
+
+ pgaio_io_prepare(ioh, PGAIO_OP_READ);
+}
+
+void
+pgaio_io_prep_writev(PgAioHandle *ioh,
+ int fd, int iovcnt, uint64 offset)
+{
+ pgaio_io_before_prep(ioh);
+
+ ioh->op_data.write.fd = fd;
+ ioh->op_data.write.offset = offset;
+ ioh->op_data.write.iov_length = iovcnt;
+
+ pgaio_io_prepare(ioh, PGAIO_OP_WRITE);
+}
+
+
+extern void
+pgaio_io_perform_synchronously(PgAioHandle *ioh)
+{
+ ssize_t result = 0;
+ struct iovec *iov = &aio_ctl->iovecs[ioh->iovec_off];
+
+ /* Perform IO. */
+ switch (ioh->op)
+ {
+ case PGAIO_OP_READ:
+ pgstat_report_wait_start(WAIT_EVENT_DATA_FILE_READ);
+ result = pg_preadv(ioh->op_data.read.fd, iov,
+ ioh->op_data.read.iov_length,
+ ioh->op_data.read.offset);
+ pgstat_report_wait_end();
+ break;
+ case PGAIO_OP_WRITE:
+ pgstat_report_wait_start(WAIT_EVENT_DATA_FILE_WRITE);
+ result = pg_pwritev(ioh->op_data.write.fd, iov,
+ ioh->op_data.write.iov_length,
+ ioh->op_data.write.offset);
+ pgstat_report_wait_end();
+ break;
+ default:
+ elog(ERROR, "not yet");
+ }
+
+ ioh->result = result < 0 ? -errno : result;
+
+ pgaio_io_process_completion(ioh, ioh->result);
+}
diff --git a/src/backend/storage/aio/aio_subject.c b/src/backend/storage/aio/aio_subject.c
new file mode 100644
index 00000000000..68e9e80074c
--- /dev/null
+++ b/src/backend/storage/aio/aio_subject.c
@@ -0,0 +1,170 @@
+/*-------------------------------------------------------------------------
+ *
+ * aio_subject.c
+ * IO completion handling for IOs on different subjects
+ *
+ * XXX Write me
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/storage/aio/aio_subject.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "storage/aio.h"
+#include "storage/aio_internal.h"
+#include "storage/buf_internals.h"
+#include "storage/bufmgr.h"
+#include "storage/smgr.h"
+#include "utils/memutils.h"
+
+
+static const PgAioSubjectInfo *aio_subject_info[] = {
+ [ASI_INVALID] = &(PgAioSubjectInfo) {
+ .name = "invalid",
+ },
+};
+
+static const PgAioHandleSharedCallbacks *aio_shared_cbs[] = {
+};
+
+
+void
+pgaio_io_add_shared_cb(PgAioHandle *ioh, PgAioHandleSharedCallbackID cbid)
+{
+ if (cbid >= lengthof(aio_shared_cbs))
+ elog(ERROR, "callback %d is out of range", cbid);
+ if (aio_shared_cbs[cbid]->complete == NULL)
+ elog(ERROR, "callback %d is undefined", cbid);
+ if (ioh->num_shared_callbacks >= AIO_MAX_SHARED_CALLBACKS)
+ elog(PANIC, "too many callbacks, the max is %d", AIO_MAX_SHARED_CALLBACKS);
+ ioh->shared_callbacks[ioh->num_shared_callbacks] = cbid;
+
+ elog(DEBUG3, "io:%d, op %s, subject %s, adding cbid num %d, id %d",
+ pgaio_io_get_id(ioh),
+ pgaio_io_get_op_name(ioh),
+ pgaio_io_get_subject_name(ioh),
+ ioh->num_shared_callbacks + 1, cbid);
+
+ ioh->num_shared_callbacks++;
+}
+
+const char *
+pgaio_io_get_subject_name(PgAioHandle *ioh)
+{
+ Assert(ioh->subject >= 0 && ioh->subject < ASI_COUNT);
+
+ return aio_subject_info[ioh->subject]->name;
+}
+
+void
+pgaio_io_prepare_subject(PgAioHandle *ioh)
+{
+ Assert(ioh->subject > ASI_INVALID && ioh->subject < ASI_COUNT);
+ Assert(ioh->op >= 0 && ioh->op < PGAIO_OP_COUNT);
+
+ for (int i = ioh->num_shared_callbacks; i > 0; i--)
+ {
+ PgAioHandleSharedCallbackID cbid = ioh->shared_callbacks[i - 1];
+ const PgAioHandleSharedCallbacks *cbs = aio_shared_cbs[cbid];
+
+ if (!cbs->prepare)
+ continue;
+
+ elog(DEBUG3, "io:%d, op %s, subject %s, calling cbid num %d, id %d: prepare",
+ pgaio_io_get_id(ioh),
+ pgaio_io_get_op_name(ioh),
+ pgaio_io_get_subject_name(ioh),
+ i, cbid);
+ cbs->prepare(ioh);
+ }
+}
+
+void
+pgaio_io_process_completion_subject(PgAioHandle *ioh)
+{
+ PgAioResult result;
+
+ Assert(ioh->subject >= 0 && ioh->subject < ASI_COUNT);
+ Assert(ioh->op >= 0 && ioh->op < PGAIO_OP_COUNT);
+
+ result.status = ARS_OK; /* low level IO is always considered OK */
+ result.result = ioh->result;
+ result.id = 0; /* FIXME */
+ result.error_data = 0;
+
+ for (int i = ioh->num_shared_callbacks; i > 0; i--)
+ {
+ PgAioHandleSharedCallbackID cbid;
+
+ cbid = ioh->shared_callbacks[i - 1];
+ elog(DEBUG3, "io:%d, op %s, subject %s, calling cbid num %d, id %d with distilled result status %d, id %u, error_data: %d, result: %d",
+ pgaio_io_get_id(ioh),
+ pgaio_io_get_op_name(ioh),
+ pgaio_io_get_subject_name(ioh),
+ i, cbid,
+ result.status,
+ result.id,
+ result.error_data,
+ result.result);
+ result = aio_shared_cbs[cbid]->complete(ioh, result);
+ }
+
+ ioh->distilled_result = result;
+
+ elog(DEBUG3, "io:%d, op %s, subject %s, distilled result status %d, id %u, error_data: %d, result: %d, raw_result %d",
+ pgaio_io_get_id(ioh),
+ pgaio_io_get_op_name(ioh),
+ pgaio_io_get_subject_name(ioh),
+ result.status,
+ result.id,
+ result.error_data,
+ result.result,
+ ioh->result);
+}
+
+void
+pgaio_io_reopen(PgAioHandle *ioh)
+{
+ Assert(ioh->subject >= 0 && ioh->subject < ASI_COUNT);
+ Assert(ioh->op >= 0 && ioh->op < PGAIO_OP_COUNT);
+
+ aio_subject_info[ioh->subject]->reopen(ioh);
+}
+
+bool
+pgaio_io_needs_synchronously(PgAioHandle *ioh)
+{
+ if (aio_subject_info[ioh->subject]->reopen == NULL)
+ return true;
+
+ return false;
+}
+
+
+
+/* --------------------------------------------------------------------------------
+ * IO Result
+ * --------------------------------------------------------------------------------
+ */
+
+void
+pgaio_result_log(PgAioResult result, const PgAioSubjectData *subject_data, int elevel)
+{
+ const PgAioHandleSharedCallbacks *scb;
+
+ Assert(result.status != ARS_UNKNOWN);
+ Assert(result.status != ARS_OK);
+
+ scb = aio_shared_cbs[result.id];
+
+ if (scb->error == NULL)
+ elog(ERROR, "scb id %d does not have error callback", result.id);
+
+ scb->error(result, subject_data, elevel);
+}
diff --git a/src/backend/storage/aio/meson.build b/src/backend/storage/aio/meson.build
index e13728b73da..8960223194a 100644
--- a/src/backend/storage/aio/meson.build
+++ b/src/backend/storage/aio/meson.build
@@ -2,7 +2,10 @@
backend_sources += files(
'aio.c',
+ 'aio_io.c',
'aio_init.c',
+ 'aio_subject.c',
+ 'method_io_uring.c',
'method_worker.c',
'read_stream.c',
)
diff --git a/src/backend/storage/aio/method_io_uring.c b/src/backend/storage/aio/method_io_uring.c
new file mode 100644
index 00000000000..f76533b4cdc
--- /dev/null
+++ b/src/backend/storage/aio/method_io_uring.c
@@ -0,0 +1,393 @@
+/*-------------------------------------------------------------------------
+ *
+ * method_io_uring.c
+ * AIO implementation using io_uring on Linux
+ *
+ * XXX Write me
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/storage/aio/method_io_uring.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#ifdef USE_LIBURING
+
+#include <liburing.h>
+
+#include "pgstat.h"
+#include "port/pg_iovec.h"
+#include "storage/aio_internal.h"
+#include "storage/fd.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+
+
+/* Entry points for IoMethodOps. */
+static size_t pgaio_uring_shmem_size(void);
+static void pgaio_uring_postmaster_init(void);
+static void pgaio_uring_shmem_init(bool first_time);
+static void pgaio_uring_postmaster_child_init(void);
+static void pgaio_uring_postmaster_child_init_local(void);
+
+static int pgaio_uring_submit(void);
+static void pgaio_uring_wait_one(PgAioHandle *ioh, uint64 ref_generation);
+
+static void pgaio_uring_sq_from_io(PgAioHandle *ioh, struct io_uring_sqe *sqe);
+
+
+const IoMethodOps pgaio_uring_ops = {
+ .shmem_size = pgaio_uring_shmem_size,
+ .shmem_init = pgaio_uring_shmem_init,
+ .postmaster_init = pgaio_uring_postmaster_init,
+ .postmaster_child_init = pgaio_uring_postmaster_child_init,
+ .postmaster_child_init_local = pgaio_uring_postmaster_child_init_local,
+
+ .submit = pgaio_uring_submit,
+ .wait_one = pgaio_uring_wait_one,
+#if 0
+ .retry = pgaio_uring_io_retry,
+ .wait_one = pgaio_uring_wait_one,
+ .drain = pgaio_uring_drain,
+#endif
+ .can_scatter_gather_direct = true,
+ .can_scatter_gather_buffered = true
+};
+
+typedef struct PgAioUringContext
+{
+ LWLock completion_lock;
+
+ struct io_uring io_uring_ring;
+ /* XXX: probably worth padding to a cacheline boundary here */
+} PgAioUringContext;
+
+
+static PgAioUringContext *aio_uring_contexts;
+static PgAioUringContext *my_shared_uring_context;
+
+/* io_uring local state */
+static struct io_uring local_ring;
+
+
+
+static Size
+AioContextShmemSize(void)
+{
+ uint32 TotalProcs = MaxBackends + NUM_AUXILIARY_PROCS - MAX_IO_WORKERS;
+
+ return mul_size(TotalProcs, sizeof(PgAioUringContext));
+}
+
+static size_t
+pgaio_uring_shmem_size(void)
+{
+ return AioContextShmemSize();
+}
+
+static void
+pgaio_uring_shmem_init(bool first_time)
+{
+ uint32 TotalProcs = MaxBackends + NUM_AUXILIARY_PROCS - MAX_IO_WORKERS;
+ bool found;
+
+ aio_uring_contexts = (PgAioUringContext *)
+ ShmemInitStruct("AioUring", pgaio_uring_shmem_size(), &found);
+
+ if (found)
+ return;
+
+ for (int contextno = 0; contextno < TotalProcs; contextno++)
+ {
+ PgAioUringContext *context = &aio_uring_contexts[contextno];
+ int ret;
+
+ /*
+ * XXX: Probably worth sharing the WQ between the different rings,
+ * when supported by the kernel. Could also cause additional
+ * contention, I guess?
+ */
+#if 0
+ if (!AcquireExternalFD())
+ elog(ERROR, "No external FD available");
+#endif
+ ret = io_uring_queue_init(io_max_concurrency, &context->io_uring_ring, 0);
+ if (ret < 0)
+ elog(ERROR, "io_uring_queue_init failed: %s", strerror(-ret));
+
+ LWLockInitialize(&context->completion_lock, LWTRANCHE_AIO_URING_COMPLETION);
+ }
+}
+
+static void
+pgaio_uring_postmaster_init(void)
+{
+ uint32 TotalProcs =
+ MaxBackends + NUM_AUXILIARY_PROCS - MAX_IO_WORKERS;
+
+ for (int i = 0; i < TotalProcs; i++)
+ ReserveExternalFD();
+}
+
+static void
+pgaio_uring_postmaster_child_init(void)
+{
+ my_shared_uring_context = &aio_uring_contexts[MyProcNumber];
+}
+
+static void
+pgaio_uring_postmaster_child_init_local(void)
+{
+ int ret;
+
+ ret = io_uring_queue_init(32, &local_ring, 0);
+ if (ret < 0)
+ elog(ERROR, "io_uring_queue_init failed: %s", strerror(-ret));
+}
+
+static int
+pgaio_uring_submit(void)
+{
+ PgAioHandle *ios[PGAIO_SUBMIT_BATCH_SIZE];
+ struct io_uring_sqe *sqe[PGAIO_SUBMIT_BATCH_SIZE];
+ struct io_uring *uring_instance = &my_shared_uring_context->io_uring_ring;
+ int nios = 0;
+
+ while (!dclist_is_empty(&my_aio->staged_ios))
+ {
+ dlist_node *node;
+ PgAioHandle *ioh;
+
+ node = dclist_head_node(&my_aio->staged_ios);
+ ioh = dlist_container(PgAioHandle, node, node);
+
+ sqe[nios] = io_uring_get_sqe(uring_instance);
+ ios[nios] = ioh;
+
+ pgaio_io_prepare_submit(ioh);
+ pgaio_uring_sq_from_io(ios[nios], sqe[nios]);
+
+ nios++;
+
+ if (nios + 1 > PGAIO_SUBMIT_BATCH_SIZE)
+ break;
+ }
+
+ while (true)
+ {
+ int ret;
+
+ pgstat_report_wait_start(WAIT_EVENT_AIO_SUBMIT);
+ ret = io_uring_submit(uring_instance);
+ pgstat_report_wait_end();
+
+ if (ret == -EINTR)
+ {
+ elog(DEBUG3, "submit EINTR, nios: %d", nios);
+ continue;
+ }
+ if (ret < 0)
+ elog(PANIC, "failed: %d/%s",
+ ret, strerror(-ret));
+ else if (ret != nios)
+ {
+ /* likely unreachable, but if it is, we would need to re-submit */
+ elog(PANIC, "submitted only %d of %d",
+ ret, nios);
+ }
+ else
+ {
+ elog(DEBUG3, "submit nios: %d", nios);
+ }
+ break;
+ }
+
+ return nios;
+}
+
+
+#define PGAIO_MAX_LOCAL_REAPED 16
+
+static void
+pgaio_uring_drain_locked(PgAioUringContext *context)
+{
+ int ready;
+ int orig_ready;
+
+ /*
+ * Don't drain more events than available right now. Otherwise it's
+ * plausible that one backend could get stuck, for a while, receiving CQEs
+ * without actually processing them.
+ */
+ orig_ready = ready = io_uring_cq_ready(&context->io_uring_ring);
+
+ while (ready > 0)
+ {
+ struct io_uring_cqe *reaped_cqes[PGAIO_MAX_LOCAL_REAPED];
+ uint32 reaped;
+
+ START_CRIT_SECTION();
+ reaped =
+ io_uring_peek_batch_cqe(&context->io_uring_ring,
+ reaped_cqes,
+ Min(PGAIO_MAX_LOCAL_REAPED, ready));
+ Assert(reaped <= ready);
+
+ ready -= reaped;
+
+ for (int i = 0; i < reaped; i++)
+ {
+ struct io_uring_cqe *cqe = reaped_cqes[i];
+ PgAioHandle *ioh;
+
+ ioh = io_uring_cqe_get_data(cqe);
+ io_uring_cqe_seen(&context->io_uring_ring, cqe);
+
+ pgaio_io_process_completion(ioh, cqe->res);
+ }
+
+ END_CRIT_SECTION();
+
+ ereport(DEBUG3,
+ errmsg("drained %d/%d, now expecting %d",
+ reaped, orig_ready, io_uring_cq_ready(&context->io_uring_ring)),
+ errhidestmt(true),
+ errhidecontext(true));
+
+ }
+}
+
+static void
+pgaio_uring_wait_one(PgAioHandle *ioh, uint64 ref_generation)
+{
+ PgAioHandleState state;
+ ProcNumber owner_procno = ioh->owner_procno;
+ PgAioUringContext *owner_context = &aio_uring_contexts[owner_procno];
+ bool expect_cqe;
+ int waited = 0;
+
+ /*
+ * We ought to have a smarter locking scheme, nearly all the time the
+ * backend owning the ring will reap the completions, making the locking
+ * unnecessarily expensive.
+ */
+ LWLockAcquire(&owner_context->completion_lock, LW_EXCLUSIVE);
+
+ while (true)
+ {
+ ereport(DEBUG3,
+ errmsg("wait_one for io:%d in state %s, cycle %d",
+ pgaio_io_get_id(ioh), pgaio_io_get_state_name(ioh), waited),
+ errhidestmt(true),
+ errhidecontext(true));
+
+ if (pgaio_io_was_recycled(ioh, ref_generation, &state) ||
+ state != AHS_IN_FLIGHT)
+ {
+ break;
+ }
+ else if (io_uring_cq_ready(&owner_context->io_uring_ring))
+ {
+ expect_cqe = true;
+ }
+ else
+ {
+ int ret;
+ struct io_uring_cqe *cqes;
+
+ pgstat_report_wait_start(WAIT_EVENT_AIO_DRAIN);
+ ret = io_uring_wait_cqes(&owner_context->io_uring_ring, &cqes, 1, NULL, NULL);
+ pgstat_report_wait_end();
+
+ if (ret == -EINTR)
+ {
+ continue;
+ }
+ else if (ret != 0)
+ {
+ elog(PANIC, "unexpected: %d/%s: %m", ret, strerror(-ret));
+ }
+ else
+ {
+ Assert(cqes != NULL);
+ expect_cqe = true;
+ waited++;
+ }
+ }
+
+ if (expect_cqe)
+ {
+ pgaio_uring_drain_locked(owner_context);
+ }
+ }
+
+ LWLockRelease(&owner_context->completion_lock);
+
+ ereport(DEBUG3,
+ errmsg("wait_one with %d sleeps",
+ waited),
+ errhidestmt(true),
+ errhidecontext(true));
+}
+
+static void
+pgaio_uring_sq_from_io(PgAioHandle *ioh, struct io_uring_sqe *sqe)
+{
+ struct iovec *iov;
+
+ switch (ioh->op)
+ {
+ case PGAIO_OP_READ:
+ iov = &aio_ctl->iovecs[ioh->iovec_off];
+ if (ioh->op_data.read.iov_length == 1)
+ {
+ io_uring_prep_read(sqe,
+ ioh->op_data.read.fd,
+ iov->iov_base,
+ iov->iov_len,
+ ioh->op_data.read.offset);
+ }
+ else
+ {
+ io_uring_prep_readv(sqe,
+ ioh->op_data.read.fd,
+ iov,
+ ioh->op_data.read.iov_length,
+ ioh->op_data.read.offset);
+
+ }
+ break;
+
+ case PGAIO_OP_WRITE:
+ iov = &aio_ctl->iovecs[ioh->iovec_off];
+ if (ioh->op_data.write.iov_length == 1)
+ {
+ io_uring_prep_write(sqe,
+ ioh->op_data.write.fd,
+ iov->iov_base,
+ iov->iov_len,
+ ioh->op_data.write.offset);
+ }
+ else
+ {
+ io_uring_prep_writev(sqe,
+ ioh->op_data.write.fd,
+ iov,
+ ioh->op_data.write.iov_length,
+ ioh->op_data.write.offset);
+ }
+ break;
+
+ default:
+ elog(ERROR, "not implemented");
+ }
+
+ io_uring_sqe_set_data(sqe, ioh);
+}
+
+#endif /* USE_LIBURING */
diff --git a/src/backend/storage/aio/method_worker.c b/src/backend/storage/aio/method_worker.c
index 5df2eea4a03..cd79bf1fba6 100644
--- a/src/backend/storage/aio/method_worker.c
+++ b/src/backend/storage/aio/method_worker.c
@@ -3,6 +3,21 @@
* method_worker.c
* AIO implementation using workers
*
+ * Worker processes consume IOs from a shared memory submission queue, run
+ * traditional synchronous system calls, and perform the shared completion
+ * handling immediately. Client code submits most requests by pushing IOs
+ * into the submission queue, and waits (if necessary) using condition
+ * variables. Some IOs cannot be performed in another process due to lack of
+ * infrastructure for reopening the file, and must processed synchronously by
+ * the client code when submitted.
+ *
+ * So that the submitter can make just one system call when submitting a batch
+ * of IOs, wakeups "fan out"; each woken backend can wake two more. XXX This
+ * could be improved by using futexes instead of latches to wake N waiters.
+ *
+ * This method of AIO is available in all builds on all operating systems, and
+ * is the default.
+ *
* Portions Copyright (c) 1996-2021, PostgreSQL Global Development Group
* Portions Copyright (c) 1994, Regents of the University of California
*
@@ -16,24 +31,299 @@
#include "libpq/pqsignal.h"
#include "miscadmin.h"
+#include "port/pg_bitutils.h"
+#include "postmaster/auxprocess.h"
#include "postmaster/interrupt.h"
+#include "storage/aio.h"
+#include "storage/aio_internal.h"
#include "storage/io_worker.h"
#include "storage/ipc.h"
#include "storage/latch.h"
#include "storage/proc.h"
#include "tcop/tcopprot.h"
#include "utils/wait_event.h"
+#include "utils/ps_status.h"
+
+
+/* How many workers should each worker wake up if needed? */
+#define IO_WORKER_WAKEUP_FANOUT 2
+
+
+typedef struct AioWorkerSubmissionQueue
+{
+ uint32 size;
+ uint32 mask;
+ uint32 head;
+ uint32 tail;
+ uint32 ios[FLEXIBLE_ARRAY_MEMBER];
+} AioWorkerSubmissionQueue;
+
+typedef struct AioWorkerSlot
+{
+ Latch *latch;
+ bool in_use;
+} AioWorkerSlot;
+
+typedef struct AioWorkerControl
+{
+ uint64 idle_worker_mask;
+ AioWorkerSlot workers[FLEXIBLE_ARRAY_MEMBER];
+} AioWorkerControl;
+
+
+static size_t pgaio_worker_shmem_size(void);
+static void pgaio_worker_shmem_init(bool first_time);
+static void pgaio_worker_postmaster_child_init_local(void);
+
+static int pgaio_worker_submit(void);
+
+
+const IoMethodOps pgaio_worker_ops = {
+ .shmem_size = pgaio_worker_shmem_size,
+ .shmem_init = pgaio_worker_shmem_init,
+ .postmaster_child_init_local = pgaio_worker_postmaster_child_init_local,
+ .submit = pgaio_worker_submit,
+#if 0
+ .wait_one = pgaio_worker_wait_one,
+ .retry = pgaio_worker_io_retry,
+ .drain = pgaio_worker_drain,
+#endif
+
+ .can_scatter_gather_direct = true,
+ .can_scatter_gather_buffered = true
+};
int io_workers = 3;
+static int io_worker_queue_size = 64;
+static int MyIoWorkerId;
+
+
+static AioWorkerSubmissionQueue *io_worker_submission_queue;
+static AioWorkerControl *io_worker_control;
+
+
+static size_t
+pgaio_worker_shmem_size(void)
+{
+ return
+ offsetof(AioWorkerSubmissionQueue, ios) +
+ sizeof(uint32) * io_worker_queue_size +
+ offsetof(AioWorkerControl, workers) +
+ sizeof(AioWorkerSlot) * io_workers;
+}
+
+static void
+pgaio_worker_shmem_init(bool first_time)
+{
+ bool found;
+ int size;
+
+ /* Round size up to next power of two so we can make a mask. */
+ size = pg_nextpower2_32(io_worker_queue_size);
+
+ io_worker_submission_queue =
+ ShmemInitStruct("AioWorkerSubmissionQueue",
+ offsetof(AioWorkerSubmissionQueue, ios) +
+ sizeof(uint32) * size,
+ &found);
+ if (!found)
+ {
+ io_worker_submission_queue->size = size;
+ io_worker_submission_queue->head = 0;
+ io_worker_submission_queue->tail = 0;
+ }
+
+ io_worker_control =
+ ShmemInitStruct("AioWorkerControl",
+ offsetof(AioWorkerControl, workers) +
+ sizeof(AioWorkerSlot) * io_workers,
+ &found);
+ if (!found)
+ {
+ io_worker_control->idle_worker_mask = 0;
+ for (int i = 0; i < io_workers; ++i)
+ {
+ io_worker_control->workers[i].latch = NULL;
+ io_worker_control->workers[i].in_use = false;
+ }
+ }
+}
+
+static void
+pgaio_worker_postmaster_child_init_local(void)
+{
+}
+
+
+static int
+pgaio_choose_idle_worker(void)
+{
+ int worker;
+
+ if (io_worker_control->idle_worker_mask == 0)
+ return -1;
+
+ /* Find the lowest bit position, and clear it. */
+ worker = pg_rightmost_one_pos64(io_worker_control->idle_worker_mask);
+ io_worker_control->idle_worker_mask &= ~(UINT64_C(1) << worker);
+
+ return worker;
+}
+
+static bool
+pgaio_worker_submission_queue_insert(PgAioHandle *ioh)
+{
+ AioWorkerSubmissionQueue *queue;
+ uint32 new_head;
+
+ queue = io_worker_submission_queue;
+ new_head = (queue->head + 1) & (queue->size - 1);
+ if (new_head == queue->tail)
+ {
+ elog(DEBUG1, "full");
+ return false; /* full */
+ }
+
+ queue->ios[queue->head] = pgaio_io_get_id(ioh);
+ queue->head = new_head;
+
+ return true;
+}
+
+static uint32
+pgaio_worker_submission_queue_consume(void)
+{
+ AioWorkerSubmissionQueue *queue;
+ uint32 result;
+
+ queue = io_worker_submission_queue;
+ if (queue->tail == queue->head)
+ return UINT32_MAX; /* empty */
+
+ result = queue->ios[queue->tail];
+ queue->tail = (queue->tail + 1) & (queue->size - 1);
+
+ return result;
+}
+
+static uint32
+pgaio_worker_submission_queue_depth(void)
+{
+ uint32 head;
+ uint32 tail;
+
+ head = io_worker_submission_queue->head;
+ tail = io_worker_submission_queue->tail;
+
+ if (tail > head)
+ head += io_worker_submission_queue->size;
+
+ Assert(head >= tail);
+
+ return head - tail;
+}
+
+static bool
+pgaio_worker_need_synchronous(PgAioHandle *ioh)
+{
+ return
+ !IsUnderPostmaster
+ || ioh->flags & AHF_REFERENCES_LOCAL
+ || pgaio_io_needs_synchronously(ioh);
+}
+
+static void
+pgaio_worker_submit_internal(PgAioHandle *ios[], int nios)
+{
+ PgAioHandle *synchronous_ios[PGAIO_SUBMIT_BATCH_SIZE];
+ int nsync = 0;
+ Latch *wakeup = NULL;
+ int worker;
+
+ Assert(nios <= PGAIO_SUBMIT_BATCH_SIZE);
+
+ LWLockAcquire(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE);
+ for (int i = 0; i < nios; ++i)
+ {
+ if (pgaio_worker_need_synchronous(ios[i]) ||
+ !pgaio_worker_submission_queue_insert(ios[i]))
+ {
+ /*
+ * We'll do it synchronously, but only after we've sent as many as
+ * we can to workers, to maximize concurrency.
+ */
+ synchronous_ios[nsync++] = ios[i];
+ continue;
+ }
+
+ if (wakeup == NULL)
+ {
+ /* Choose an idle worker to wake up if we haven't already. */
+ worker = pgaio_choose_idle_worker();
+ if (worker >= 0)
+ wakeup = io_worker_control->workers[worker].latch;
+
+ ereport(DEBUG3,
+ errmsg("submission for io:%d choosing worker %d, latch %p",
+ pgaio_io_get_id(ios[i]), worker, wakeup),
+ errhidestmt(true), errhidecontext(true));
+ }
+ }
+ LWLockRelease(AioWorkerSubmissionQueueLock);
+
+ if (wakeup)
+ SetLatch(wakeup);
+
+ /* Run whatever is left synchronously. */
+ if (nsync > 0)
+ {
+ for (int i = 0; i < nsync; ++i)
+ {
+ pgaio_io_perform_synchronously(synchronous_ios[i]);
+ }
+ }
+}
+
+static int
+pgaio_worker_submit(void)
+{
+ PgAioHandle *ios[PGAIO_SUBMIT_BATCH_SIZE];
+ int nios = 0;
+
+ while (!dclist_is_empty(&my_aio->staged_ios))
+ {
+ dlist_node *node;
+ PgAioHandle *ioh;
+
+ node = dclist_head_node(&my_aio->staged_ios);
+ ioh = dlist_container(PgAioHandle, node, node);
+
+ pgaio_io_prepare_submit(ioh);
+
+ Assert(nios < PGAIO_SUBMIT_BATCH_SIZE);
+
+ ios[nios++] = ioh;
+
+ if (nios + 1 == PGAIO_SUBMIT_BATCH_SIZE)
+ break;
+ }
+
+ pgaio_worker_submit_internal(ios, nios);
+
+ return nios;
+}
void
IoWorkerMain(char *startup_data, size_t startup_data_len)
{
sigjmp_buf local_sigjmp_buf;
+ volatile PgAioHandle *ioh = NULL;
+ char cmd[128];
MyBackendType = B_IO_WORKER;
+ AuxiliaryProcessMainCommon();
/* TODO review all signals */
pqsignal(SIGHUP, SignalHandlerForConfigReload);
@@ -49,7 +339,34 @@ IoWorkerMain(char *startup_data, size_t startup_data_len)
pqsignal(SIGPIPE, SIG_IGN);
pqsignal(SIGUSR1, procsignal_sigusr1_handler);
pqsignal(SIGUSR2, SignalHandlerForShutdownRequest);
- sigprocmask(SIG_SETMASK, &UnBlockSig, NULL);
+
+ /* FIXME: locking */
+ MyIoWorkerId = -1;
+
+ LWLockAcquire(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE);
+
+ for (int i = 0; i < io_workers; ++i)
+ {
+ if (!io_worker_control->workers[i].in_use)
+ {
+ Assert(io_worker_control->workers[i].latch == NULL);
+ io_worker_control->workers[i].in_use = true;
+ MyIoWorkerId = i;
+ break;
+ }
+ else
+ Assert(io_worker_control->workers[i].latch != NULL);
+ }
+
+ if (MyIoWorkerId == -1)
+ elog(ERROR, "couldn't find a free worker slot");
+
+ io_worker_control->idle_worker_mask |= (UINT64_C(1) << MyIoWorkerId);
+ io_worker_control->workers[MyIoWorkerId].latch = MyLatch;
+ LWLockRelease(AioWorkerSubmissionQueueLock);
+
+ sprintf(cmd, "worker: %d", MyIoWorkerId);
+ set_ps_display(cmd);
/* see PostgresMain() */
if (sigsetjmp(local_sigjmp_buf, 1) != 0)
@@ -64,21 +381,107 @@ IoWorkerMain(char *startup_data, size_t startup_data_len)
LWLockReleaseAll();
/* TODO: recover from IO errors */
+ if (ioh != NULL)
+ {
+#if 0
+ /* EINTR is treated as a retryable error */
+ pgaio_process_io_completion(unvolatize(PgAioInProgress *, io),
+ EINTR);
+#endif
+ }
EmitErrorReport();
+
+ /* FIXME: should probably be a before-shmem-exit instead */
+ LWLockAcquire(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE);
+ Assert(io_worker_control->workers[MyIoWorkerId].in_use);
+ Assert(io_worker_control->workers[MyIoWorkerId].latch == MyLatch);
+
+ io_worker_control->workers[MyIoWorkerId].in_use = false;
+ io_worker_control->workers[MyIoWorkerId].latch = NULL;
+ LWLockRelease(AioWorkerSubmissionQueueLock);
+
proc_exit(1);
}
/* We can now handle ereport(ERROR) */
PG_exception_stack = &local_sigjmp_buf;
+ sigprocmask(SIG_SETMASK, &UnBlockSig, NULL);
+
while (!ShutdownRequestPending)
{
- WaitLatch(MyLatch, WL_LATCH_SET | WL_EXIT_ON_PM_DEATH, -1,
- WAIT_EVENT_IO_WORKER_MAIN);
- ResetLatch(MyLatch);
- CHECK_FOR_INTERRUPTS();
+ uint32 io_index;
+ Latch *latches[IO_WORKER_WAKEUP_FANOUT];
+ int nlatches = 0;
+ int nwakeups = 0;
+ int worker;
+
+ /* Try to get a job to do. */
+ LWLockAcquire(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE);
+ if ((io_index = pgaio_worker_submission_queue_consume()) == UINT32_MAX)
+ {
+ /* Nothing to do. Mark self idle. */
+ /*
+ * XXX: Invent some kind of back pressure to reduce useless
+ * wakeups?
+ */
+ io_worker_control->idle_worker_mask |= (UINT64_C(1) << MyIoWorkerId);
+ }
+ else
+ {
+ /* Got one. Clear idle flag. */
+ io_worker_control->idle_worker_mask &= ~(UINT64_C(1) << MyIoWorkerId);
+
+ /* See if we can wake up some peers. */
+ nwakeups = Min(pgaio_worker_submission_queue_depth(),
+ IO_WORKER_WAKEUP_FANOUT);
+ for (int i = 0; i < nwakeups; ++i)
+ {
+ if ((worker = pgaio_choose_idle_worker()) < 0)
+ break;
+ latches[nlatches++] = io_worker_control->workers[worker].latch;
+ }
+#if 0
+ if (nwakeups > 0)
+ elog(LOG, "wake %d", nwakeups);
+#endif
+ }
+ LWLockRelease(AioWorkerSubmissionQueueLock);
+
+ for (int i = 0; i < nlatches; ++i)
+ SetLatch(latches[i]);
+
+ if (io_index != UINT32_MAX)
+ {
+ ioh = &aio_ctl->io_handles[io_index];
+
+ ereport(DEBUG3,
+ errmsg("worker processing io:%d",
+ pgaio_io_get_id(unvolatize(PgAioHandle *, ioh))),
+ errhidestmt(true), errhidecontext(true));
+
+ pgaio_io_reopen(unvolatize(PgAioHandle *, ioh));
+ pgaio_io_perform_synchronously(unvolatize(PgAioHandle *, ioh));
+
+ ioh = NULL;
+ }
+ else
+ {
+ WaitLatch(MyLatch, WL_LATCH_SET | WL_EXIT_ON_PM_DEATH, -1,
+ WAIT_EVENT_IO_WORKER_MAIN);
+ ResetLatch(MyLatch);
+ CHECK_FOR_INTERRUPTS();
+ }
}
+ LWLockAcquire(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE);
+ Assert(io_worker_control->workers[MyIoWorkerId].in_use);
+ Assert(io_worker_control->workers[MyIoWorkerId].latch == MyLatch);
+
+ io_worker_control->workers[MyIoWorkerId].in_use = false;
+ io_worker_control->workers[MyIoWorkerId].latch = NULL;
+ LWLockRelease(AioWorkerSubmissionQueueLock);
+
proc_exit(0);
}
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index f3d3435b1f5..63d1f905554 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -166,6 +166,7 @@ static const char *const BuiltinTrancheNames[] = {
[LWTRANCHE_SUBTRANS_SLRU] = "SubtransSLRU",
[LWTRANCHE_XACT_SLRU] = "XactSLRU",
[LWTRANCHE_PARALLEL_VACUUM_DSA] = "ParallelVacuumDSA",
+ [LWTRANCHE_AIO_URING_COMPLETION] = "AioUringCompletion",
};
StaticAssertDecl(lengthof(BuiltinTrancheNames) ==
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 47a2c4d126b..3678f2b3e43 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -192,6 +192,9 @@ ABI_compatibility:
Section: ClassName - WaitEventIO
+AIO_SUBMIT "Waiting for AIO submission."
+AIO_DRAIN "Waiting for IOs to finish."
+AIO_COMPLETION "Waiting for completion callback."
BASEBACKUP_READ "Waiting for base backup to read from a file."
BASEBACKUP_SYNC "Waiting for data written by a base backup to reach durable storage."
BASEBACKUP_WRITE "Waiting for base backup to write to a file."
@@ -348,6 +351,7 @@ DSMRegistry "Waiting to read or update the dynamic shared memory registry."
InjectionPoint "Waiting to read or update information related to injection points."
SerialControl "Waiting to read or update shared <filename>pg_serial</filename> state."
WaitLSN "Waiting to read or update shared Wait-for-LSN state."
+AioWorkerSubmissionQueue "Waiting to access AIO worker submission queue."
#
# END OF PREDEFINED LWLOCKS (DO NOT CHANGE THIS LINE)
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 5670f40478a..5828072a48e 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -3214,6 +3214,31 @@ struct config_int ConfigureNamesInt[] =
NULL, assign_io_workers, NULL
},
+ {
+ {"io_max_concurrency",
+ PGC_POSTMASTER,
+ RESOURCES_ASYNCHRONOUS,
+ gettext_noop("Number of IOs that may be in flight in one backend."),
+ NULL,
+ },
+ &io_max_concurrency,
+ -1, -1, 1024,
+ NULL, NULL, NULL
+ },
+
+ {
+ {"io_bounce_buffers",
+ PGC_POSTMASTER,
+ RESOURCES_ASYNCHRONOUS,
+ gettext_noop("Number of IO Bounce Buffers reserved for each backend."),
+ NULL,
+ GUC_UNIT_BLOCKS
+ },
+ &io_bounce_buffers,
+ -1, -1, 4096,
+ NULL, NULL, NULL
+ },
+
{
{"backend_flush_after", PGC_USERSET, RESOURCES_ASYNCHRONOUS,
gettext_noop("Number of pages after which previously performed writes are flushed to disk."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 90430381efa..1fc8336496c 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -842,6 +842,12 @@
#io_method = worker # (change requires restart)
#io_workers = 3 # 1-32;
+#io_max_concurrency = 32 # Max number of IOs that may be in
+ # flight at the same time in one backend
+ # (change requires restart)
+#io_bounce_buffers = -1 # -1 sets based on shared_buffers
+ # (change requires restart)
+
#------------------------------------------------------------------------------
# CUSTOMIZED OPTIONS
diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c
index 505534ee8d3..d1932b7393c 100644
--- a/src/backend/utils/resowner/resowner.c
+++ b/src/backend/utils/resowner/resowner.c
@@ -47,6 +47,8 @@
#include "common/hashfn.h"
#include "common/int.h"
+#include "lib/ilist.h"
+#include "storage/aio.h"
#include "storage/ipc.h"
#include "storage/predicate.h"
#include "storage/proc.h"
@@ -155,6 +157,13 @@ struct ResourceOwnerData
/* The local locks cache. */
LOCALLOCK *locks[MAX_RESOWNER_LOCKS]; /* list of owned locks */
+
+ /*
+ * AIO handles & bounce buffers need be registered in critical sections
+ * and therefore cannot use the normal ResoureElem mechanism.
+ */
+ dlist_head aio_handles;
+ dlist_head aio_bounce_buffers;
};
@@ -425,6 +434,9 @@ ResourceOwnerCreate(ResourceOwner parent, const char *name)
parent->firstchild = owner;
}
+ dlist_init(&owner->aio_handles);
+ dlist_init(&owner->aio_bounce_buffers);
+
return owner;
}
@@ -725,6 +737,21 @@ ResourceOwnerReleaseInternal(ResourceOwner owner,
* so issue warnings. In the abort case, just clean up quietly.
*/
ResourceOwnerReleaseAll(owner, phase, isCommit);
+
+ /* XXX: Could probably be a later phase? */
+ while (!dlist_is_empty(&owner->aio_handles))
+ {
+ dlist_node *node = dlist_head_node(&owner->aio_handles);
+
+ pgaio_io_release_resowner(node, !isCommit);
+ }
+
+ while (!dlist_is_empty(&owner->aio_bounce_buffers))
+ {
+ dlist_node *node = dlist_head_node(&owner->aio_bounce_buffers);
+
+ pgaio_bounce_buffer_release_resowner(node, !isCommit);
+ }
}
else if (phase == RESOURCE_RELEASE_LOCKS)
{
@@ -1082,3 +1109,27 @@ ResourceOwnerForgetLock(ResourceOwner owner, LOCALLOCK *locallock)
elog(ERROR, "lock reference %p is not owned by resource owner %s",
locallock, owner->name);
}
+
+void
+ResourceOwnerRememberAioHandle(ResourceOwner owner, struct dlist_node *ioh_node)
+{
+ dlist_push_tail(&owner->aio_handles, ioh_node);
+}
+
+void
+ResourceOwnerForgetAioHandle(ResourceOwner owner, struct dlist_node *ioh_node)
+{
+ dlist_delete_from(&owner->aio_handles, ioh_node);
+}
+
+void
+ResourceOwnerRememberAioBounceBuffer(ResourceOwner owner, struct dlist_node *ioh_node)
+{
+ dlist_push_tail(&owner->aio_bounce_buffers, ioh_node);
+}
+
+void
+ResourceOwnerForgetAioBounceBuffer(ResourceOwner owner, struct dlist_node *ioh_node)
+{
+ dlist_delete_from(&owner->aio_bounce_buffers, ioh_node);
+}
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 309686627e7..be8be9fbff0 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -54,6 +54,9 @@ AggStrategy
AggTransInfo
Aggref
AggregateInstrumentation
+AioWorkerControl
+AioWorkerSlot
+AioWorkerSubmissionQueue
AlenState
Alias
AllocBlock
@@ -1258,6 +1261,7 @@ IntoClause
InvalMessageArray
InvalidationMsgsGroup
IoMethod
+IoMethodOps
IpcMemoryId
IpcMemoryKey
IpcMemoryState
@@ -2093,6 +2097,25 @@ Permutation
PermutationStep
PermutationStepBlocker
PermutationStepBlockerType
+PgAioBounceBuffer
+PgAioCtl
+PgAioHandle
+PgAioHandleFlags
+PgAioHandleRef
+PgAioHandleState
+PgAioHandleSharedCallbacks
+PgAioHandleSharedCallbackID
+PgAioHandleSharedCallbacks
+PgAioOp
+PgAioOpData
+PgAioPerBackend
+PgAioResultStatus
+PgAioResult
+PgAioReturn
+PgAioSubjectData
+PgAioSubjectID
+PgAioSubjectInfo
+PgAioUringContext
PgArchData
PgBackendGSSStatus
PgBackendSSLStatus
--
2.45.2.827.g557ae147e6
v2.0-0010-aio-Implement-smgr-md.c-aio-methods.patchtext/x-diff; charset=us-asciiDownload
From 03723ac0d170aba51febc975921296d814af7765 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sat, 31 Aug 2024 22:33:30 -0400
Subject: [PATCH v2.0 10/17] aio: Implement smgr/md.c aio methods
---
src/include/storage/aio.h | 17 +-
src/include/storage/fd.h | 6 +
src/include/storage/md.h | 12 ++
src/include/storage/smgr.h | 21 +++
src/backend/storage/aio/aio_subject.c | 4 +
src/backend/storage/file/fd.c | 68 ++++++++
src/backend/storage/smgr/md.c | 217 ++++++++++++++++++++++++++
src/backend/storage/smgr/smgr.c | 91 +++++++++++
8 files changed, 434 insertions(+), 2 deletions(-)
diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h
index 65052462b02..acfd50c587c 100644
--- a/src/include/storage/aio.h
+++ b/src/include/storage/aio.h
@@ -57,9 +57,10 @@ typedef enum PgAioSubjectID
{
/* intentionally the zero value, to help catch zeroed memory etc */
ASI_INVALID = 0,
+ ASI_SMGR,
} PgAioSubjectID;
-#define ASI_COUNT (ASI_INVALID + 1)
+#define ASI_COUNT (ASI_SMGR + 1)
/*
* Flags for an IO that can be set with pgaio_io_set_flag().
@@ -90,7 +91,8 @@ typedef enum PgAioHandleFlags
*/
typedef enum PgAioHandleSharedCallbackID
{
- ASC_PLACEHOLDER /* empty enums are invalid */ ,
+ ASC_MD_READV,
+ ASC_MD_WRITEV,
} PgAioHandleSharedCallbackID;
@@ -139,6 +141,17 @@ typedef union
typedef union PgAioSubjectData
{
+ struct
+ {
+ RelFileLocator rlocator; /* physical relation identifier */
+ BlockNumber blockNum; /* blknum relative to begin of reln */
+ int nblocks;
+ ForkNumber forkNum:8; /* don't waste 4 byte for four values */
+ bool is_temp; /* proc can be inferred by owning AIO */
+ bool release_lock;
+ int8 mode;
+ } smgr;
+
/* just as an example placeholder for later */
struct
{
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 1456ab383a4..e993e1b671f 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -101,6 +101,8 @@ extern PGDLLIMPORT int max_safe_fds;
* prototypes for functions in fd.c
*/
+struct PgAioHandle;
+
/* Operations on virtual Files --- equivalent to Unix kernel file ops */
extern File PathNameOpenFile(const char *fileName, int fileFlags);
extern File PathNameOpenFilePerm(const char *fileName, int fileFlags, mode_t fileMode);
@@ -109,6 +111,10 @@ extern void FileClose(File file);
extern int FilePrefetch(File file, off_t offset, off_t amount, uint32 wait_event_info);
extern ssize_t FileReadV(File file, const struct iovec *iov, int iovcnt, off_t offset, uint32 wait_event_info);
extern ssize_t FileWriteV(File file, const struct iovec *iov, int iovcnt, off_t offset, uint32 wait_event_info);
+extern ssize_t FileReadV(File file, const struct iovec *iov, int iovcnt, off_t offset, uint32 wait_event_info);
+extern int FileStartReadV(struct PgAioHandle *ioh, File file, int iovcnt, off_t offset, uint32 wait_event_info);
+extern ssize_t FileWriteV(File file, const struct iovec *iov, int iovcnt, off_t offset, uint32 wait_event_info);
+extern int FileStartWriteV(struct PgAioHandle *ioh, File file, int iovcnt, off_t offset, uint32 wait_event_info);
extern int FileSync(File file, uint32 wait_event_info);
extern int FileZero(File file, off_t offset, off_t amount, uint32 wait_event_info);
extern int FileFallocate(File file, off_t offset, off_t amount, uint32 wait_event_info);
diff --git a/src/include/storage/md.h b/src/include/storage/md.h
index b72293c79a5..ede77695853 100644
--- a/src/include/storage/md.h
+++ b/src/include/storage/md.h
@@ -19,6 +19,10 @@
#include "storage/smgr.h"
#include "storage/sync.h"
+struct PgAioHandleSharedCallbacks;
+extern const struct PgAioHandleSharedCallbacks aio_md_readv_cb;
+extern const struct PgAioHandleSharedCallbacks aio_md_writev_cb;
+
/* md storage manager functionality */
extern void mdinit(void);
extern void mdopen(SMgrRelation reln);
@@ -36,9 +40,16 @@ extern uint32 mdmaxcombine(SMgrRelation reln, ForkNumber forknum,
BlockNumber blocknum);
extern void mdreadv(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
void **buffers, BlockNumber nblocks);
+extern void mdstartreadv(struct PgAioHandle *ioh,
+ SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+ void **buffers, BlockNumber nblocks);
extern void mdwritev(SMgrRelation reln, ForkNumber forknum,
BlockNumber blocknum,
const void **buffers, BlockNumber nblocks, bool skipFsync);
+extern void mdstartwritev(struct PgAioHandle *ioh,
+ SMgrRelation reln, ForkNumber forknum,
+ BlockNumber blocknum,
+ const void **buffers, BlockNumber nblocks, bool skipFsync);
extern void mdwriteback(SMgrRelation reln, ForkNumber forknum,
BlockNumber blocknum, BlockNumber nblocks);
extern BlockNumber mdnblocks(SMgrRelation reln, ForkNumber forknum);
@@ -46,6 +57,7 @@ extern void mdtruncate(SMgrRelation reln, ForkNumber forknum,
BlockNumber nblocks);
extern void mdimmedsync(SMgrRelation reln, ForkNumber forknum);
extern void mdregistersync(SMgrRelation reln, ForkNumber forknum);
+extern int mdfd(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, uint32 *off);
extern void ForgetDatabaseSyncRequests(Oid dbid);
extern void DropRelationFiles(RelFileLocator *delrels, int ndelrels, bool isRedo);
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 899d0d681c5..66730bc24fa 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -73,6 +73,11 @@ typedef SMgrRelationData *SMgrRelation;
#define SmgrIsTemp(smgr) \
RelFileLocatorBackendIsTemp((smgr)->smgr_rlocator)
+struct PgAioHandle;
+struct PgAioSubjectInfo;
+
+extern const struct PgAioSubjectInfo aio_smgr_subject_info;
+
extern void smgrinit(void);
extern SMgrRelation smgropen(RelFileLocator rlocator, ProcNumber backend);
extern bool smgrexists(SMgrRelation reln, ForkNumber forknum);
@@ -97,10 +102,19 @@ extern uint32 smgrmaxcombine(SMgrRelation reln, ForkNumber forknum,
extern void smgrreadv(SMgrRelation reln, ForkNumber forknum,
BlockNumber blocknum,
void **buffers, BlockNumber nblocks);
+extern void smgrstartreadv(struct PgAioHandle *ioh,
+ SMgrRelation reln, ForkNumber forknum,
+ BlockNumber blocknum,
+ void **buffers, BlockNumber nblocks);
extern void smgrwritev(SMgrRelation reln, ForkNumber forknum,
BlockNumber blocknum,
const void **buffers, BlockNumber nblocks,
bool skipFsync);
+extern void smgrstartwritev(struct PgAioHandle *ioh,
+ SMgrRelation reln, ForkNumber forknum,
+ BlockNumber blocknum,
+ const void **buffers, BlockNumber nblocks,
+ bool skipFsync);
extern void smgrwriteback(SMgrRelation reln, ForkNumber forknum,
BlockNumber blocknum, BlockNumber nblocks);
extern BlockNumber smgrnblocks(SMgrRelation reln, ForkNumber forknum);
@@ -109,6 +123,7 @@ extern void smgrtruncate(SMgrRelation reln, ForkNumber *forknum,
int nforks, BlockNumber *nblocks);
extern void smgrimmedsync(SMgrRelation reln, ForkNumber forknum);
extern void smgrregistersync(SMgrRelation reln, ForkNumber forknum);
+extern int smgrfd(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, uint32 *off);
extern void AtEOXact_SMgr(void);
extern bool ProcessBarrierSmgrRelease(void);
@@ -126,4 +141,10 @@ smgrwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
smgrwritev(reln, forknum, blocknum, &buffer, 1, skipFsync);
}
+extern void pgaio_io_set_subject_smgr(struct PgAioHandle *ioh,
+ SMgrRelationData *smgr,
+ ForkNumber forknum,
+ BlockNumber blocknum,
+ int nblocks);
+
#endif /* SMGR_H */
diff --git a/src/backend/storage/aio/aio_subject.c b/src/backend/storage/aio/aio_subject.c
index 68e9e80074c..12ab1730f49 100644
--- a/src/backend/storage/aio/aio_subject.c
+++ b/src/backend/storage/aio/aio_subject.c
@@ -20,6 +20,7 @@
#include "storage/aio_internal.h"
#include "storage/buf_internals.h"
#include "storage/bufmgr.h"
+#include "storage/md.h"
#include "storage/smgr.h"
#include "utils/memutils.h"
@@ -28,9 +29,12 @@ static const PgAioSubjectInfo *aio_subject_info[] = {
[ASI_INVALID] = &(PgAioSubjectInfo) {
.name = "invalid",
},
+ [ASI_SMGR] = &aio_smgr_subject_info,
};
static const PgAioHandleSharedCallbacks *aio_shared_cbs[] = {
+ [ASC_MD_READV] = &aio_md_readv_cb,
+ [ASC_MD_WRITEV] = &aio_md_writev_cb,
};
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 368cc9455cf..35bf3c1e7bd 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -95,6 +95,7 @@
#include "pgstat.h"
#include "portability/mem.h"
#include "postmaster/startup.h"
+#include "storage/aio.h"
#include "storage/fd.h"
#include "storage/ipc.h"
#include "utils/guc.h"
@@ -1295,6 +1296,8 @@ LruDelete(File file)
vfdP = &VfdCache[file];
+ pgaio_closing_fd(vfdP->fd);
+
/*
* Close the file. We aren't expecting this to fail; if it does, better
* to leak the FD than to mess up our internal state.
@@ -1988,6 +1991,8 @@ FileClose(File file)
if (!FileIsNotOpen(file))
{
+ pgaio_closing_fd(vfdP->fd);
+
/* close the file */
if (close(vfdP->fd) != 0)
{
@@ -2211,6 +2216,32 @@ retry:
return returnCode;
}
+int
+FileStartReadV(struct PgAioHandle *ioh, File file,
+ int iovcnt, off_t offset,
+ uint32 wait_event_info)
+{
+ int returnCode;
+ Vfd *vfdP;
+
+ Assert(FileIsValid(file));
+
+ DO_DB(elog(LOG, "FileStartReadV: %d (%s) " INT64_FORMAT " %d",
+ file, VfdCache[file].fileName,
+ (int64) offset,
+ iovcnt));
+
+ returnCode = FileAccess(file);
+ if (returnCode < 0)
+ return returnCode;
+
+ vfdP = &VfdCache[file];
+
+ pgaio_io_prep_readv(ioh, vfdP->fd, iovcnt, offset);
+
+ return 0;
+}
+
ssize_t
FileWriteV(File file, const struct iovec *iov, int iovcnt, off_t offset,
uint32 wait_event_info)
@@ -2316,6 +2347,34 @@ retry:
return returnCode;
}
+int
+FileStartWriteV(struct PgAioHandle *ioh, File file,
+ int iovcnt, off_t offset,
+ uint32 wait_event_info)
+{
+ int returnCode;
+ Vfd *vfdP;
+
+ Assert(FileIsValid(file));
+
+ DO_DB(elog(LOG, "FileStartWriteV: %d (%s) " INT64_FORMAT " %d",
+ file, VfdCache[file].fileName,
+ (int64) offset,
+ iovcnt));
+
+ returnCode = FileAccess(file);
+ if (returnCode < 0)
+ return returnCode;
+
+ vfdP = &VfdCache[file];
+
+ /* FIXME: think about / reimplement temp_file_limit */
+
+ pgaio_io_prep_writev(ioh, vfdP->fd, iovcnt, offset);
+
+ return 0;
+}
+
int
FileSync(File file, uint32 wait_event_info)
{
@@ -2499,6 +2558,12 @@ FilePathName(File file)
int
FileGetRawDesc(File file)
{
+ int returnCode;
+
+ returnCode = FileAccess(file);
+ if (returnCode < 0)
+ return returnCode;
+
Assert(FileIsValid(file));
return VfdCache[file].fd;
}
@@ -2779,6 +2844,7 @@ FreeDesc(AllocateDesc *desc)
result = closedir(desc->desc.dir);
break;
case AllocateDescRawFD:
+ pgaio_closing_fd(desc->desc.fd);
result = close(desc->desc.fd);
break;
default:
@@ -2847,6 +2913,8 @@ CloseTransientFile(int fd)
/* Only get here if someone passes us a file not in allocatedDescs */
elog(WARNING, "fd passed to CloseTransientFile was not obtained from OpenTransientFile");
+ pgaio_closing_fd(fd);
+
return close(fd);
}
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 6cd81a61faa..f96308490d9 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -31,6 +31,7 @@
#include "miscadmin.h"
#include "pg_trace.h"
#include "pgstat.h"
+#include "storage/aio.h"
#include "storage/bufmgr.h"
#include "storage/fd.h"
#include "storage/md.h"
@@ -931,6 +932,49 @@ mdreadv(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
}
}
+void
+mdstartreadv(PgAioHandle *ioh,
+ SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+ void **buffers, BlockNumber nblocks)
+{
+ off_t seekpos;
+ MdfdVec *v;
+ BlockNumber nblocks_this_segment;
+ struct iovec *iov;
+ int iovcnt;
+
+ v = _mdfd_getseg(reln, forknum, blocknum, false,
+ EXTENSION_FAIL | EXTENSION_CREATE_RECOVERY);
+
+ seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
+
+ Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
+
+ nblocks_this_segment =
+ Min(nblocks,
+ RELSEG_SIZE - (blocknum % ((BlockNumber) RELSEG_SIZE)));
+
+ if (nblocks_this_segment != nblocks)
+ elog(ERROR, "read crossing segment boundary");
+
+ iovcnt = pgaio_io_get_iovec(ioh, &iov);
+
+ Assert(nblocks <= iovcnt);
+
+ iovcnt = buffers_to_iovec(iov, buffers, nblocks_this_segment);
+
+ Assert(iovcnt <= nblocks_this_segment);
+
+ pgaio_io_set_subject_smgr(ioh,
+ reln,
+ forknum,
+ blocknum,
+ nblocks);
+ pgaio_io_add_shared_cb(ioh, ASC_MD_READV);
+
+ FileStartReadV(ioh, v->mdfd_vfd, iovcnt, seekpos, WAIT_EVENT_DATA_FILE_READ);
+}
+
/*
* mdwritev() -- Write the supplied blocks at the appropriate location.
*
@@ -1036,6 +1080,49 @@ mdwritev(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
}
}
+void
+mdstartwritev(PgAioHandle *ioh,
+ SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+ const void **buffers, BlockNumber nblocks, bool skipFsync)
+{
+ off_t seekpos;
+ MdfdVec *v;
+ BlockNumber nblocks_this_segment;
+ struct iovec *iov;
+ int iovcnt;
+
+ v = _mdfd_getseg(reln, forknum, blocknum, false,
+ EXTENSION_FAIL | EXTENSION_CREATE_RECOVERY);
+
+ seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
+
+ Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
+
+ nblocks_this_segment =
+ Min(nblocks,
+ RELSEG_SIZE - (blocknum % ((BlockNumber) RELSEG_SIZE)));
+
+ if (nblocks_this_segment != nblocks)
+ elog(ERROR, "write crossing segment boundary");
+
+ iovcnt = pgaio_io_get_iovec(ioh, &iov);
+
+ Assert(nblocks <= iovcnt);
+
+ iovcnt = buffers_to_iovec(iov, unconstify(void **, buffers), nblocks_this_segment);
+
+ Assert(iovcnt <= nblocks_this_segment);
+
+ pgaio_io_set_subject_smgr(ioh,
+ reln,
+ forknum,
+ blocknum,
+ nblocks);
+ pgaio_io_add_shared_cb(ioh, ASC_MD_WRITEV);
+
+ FileStartWriteV(ioh, v->mdfd_vfd, iovcnt, seekpos, WAIT_EVENT_DATA_FILE_WRITE);
+}
+
/*
* mdwriteback() -- Tell the kernel to write pages back to storage.
@@ -1357,6 +1444,21 @@ mdimmedsync(SMgrRelation reln, ForkNumber forknum)
}
}
+int
+mdfd(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, uint32 *off)
+{
+ MdfdVec *v = mdopenfork(reln, forknum, EXTENSION_FAIL);
+
+ v = _mdfd_getseg(reln, forknum, blocknum, false,
+ EXTENSION_FAIL);
+
+ *off = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
+
+ Assert(*off < (off_t) BLCKSZ * RELSEG_SIZE);
+
+ return FileGetRawDesc(v->mdfd_vfd);
+}
+
/*
* register_dirty_segment() -- Mark a relation segment as needing fsync
*
@@ -1832,3 +1934,118 @@ mdfiletagmatches(const FileTag *ftag, const FileTag *candidate)
*/
return ftag->rlocator.dbOid == candidate->rlocator.dbOid;
}
+
+
+
+static PgAioResult md_readv_complete(PgAioHandle *ioh, PgAioResult prior_result);
+static PgAioResult md_writev_complete(PgAioHandle *ioh, PgAioResult prior_result);
+static void md_readv_error(PgAioResult result, const PgAioSubjectData *subject_data, int elevel);
+
+const struct PgAioHandleSharedCallbacks aio_md_readv_cb = {
+ .complete = md_readv_complete,
+ .error = md_readv_error,
+};
+
+const struct PgAioHandleSharedCallbacks aio_md_writev_cb = {
+ .complete = md_writev_complete,
+};
+
+static PgAioResult
+md_readv_complete(PgAioHandle *ioh, PgAioResult prior_result)
+{
+ PgAioSubjectData *sd = pgaio_io_get_subject_data(ioh);
+ PgAioResult result = prior_result;
+
+ elog(DEBUG3, "%s: %d %d", __func__, prior_result.status, prior_result.result);
+
+ if (prior_result.result < 0)
+ {
+ result.status = ARS_ERROR;
+ result.id = ASC_MD_READV;
+ result.error_data = -prior_result.result;
+ result.result = 0;
+
+ md_readv_error(result, sd, LOG);
+
+ return result;
+ }
+
+ result.result /= BLCKSZ;
+
+ if (result.result == 0)
+ {
+ /* consider 0 blocks read a failure */
+ result.status = ARS_ERROR;
+ result.id = ASC_MD_READV;
+ result.error_data = 0;
+
+ md_readv_error(result, sd, LOG);
+ }
+
+ if (result.status != ARS_ERROR &&
+ result.result < sd->smgr.nblocks)
+ {
+ /* partial reads should be retried at upper level */
+ result.id = ASC_MD_READV;
+ result.status = ARS_PARTIAL;
+ }
+
+ /* AFIXME: post-read portion of mdreadv() */
+
+ return result;
+}
+
+static void
+md_readv_error(PgAioResult result, const PgAioSubjectData *subject_data, int elevel)
+{
+ MemoryContext oldContext = CurrentMemoryContext;
+
+ /* AFIXME: */
+ oldContext = MemoryContextSwitchTo(ErrorContext);
+
+ if (result.error_data != 0)
+ {
+ errno = result.error_data; /* for errcode_for_file_access() */
+
+ ereport(elevel,
+ errcode_for_file_access(),
+ errmsg("could not read blocks %u..%u in file \"%s\": %m",
+ subject_data->smgr.blockNum,
+ subject_data->smgr.blockNum + subject_data->smgr.nblocks,
+ relpathperm(subject_data->smgr.rlocator, subject_data->smgr.forkNum)
+ )
+ );
+ }
+ else
+ {
+ ereport(elevel,
+ errcode(ERRCODE_DATA_CORRUPTED),
+ errmsg("could not read blocks %u..%u in file \"%s\": read only %zu of %zu bytes",
+ subject_data->smgr.blockNum,
+ subject_data->smgr.blockNum + subject_data->smgr.nblocks - 1,
+ relpathperm(subject_data->smgr.rlocator, subject_data->smgr.forkNum),
+ result.result * (size_t) BLCKSZ,
+ subject_data->smgr.nblocks * (size_t) BLCKSZ
+ )
+ );
+ }
+
+ MemoryContextSwitchTo(oldContext);
+}
+
+
+static PgAioResult
+md_writev_complete(PgAioHandle *ioh, PgAioResult prior_result)
+{
+ elog(DEBUG3, "%s: %d %d", __func__, prior_result.status, prior_result.result);
+
+ if (prior_result.status == ARS_ERROR)
+ {
+ /* AFIXME: complain */
+ return prior_result;
+ }
+
+ prior_result.result /= BLCKSZ;
+
+ return prior_result;
+}
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index ee31db85eec..2dacb361a4f 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -53,6 +53,7 @@
#include "access/xlogutils.h"
#include "lib/ilist.h"
+#include "storage/aio.h"
#include "storage/bufmgr.h"
#include "storage/ipc.h"
#include "storage/md.h"
@@ -93,10 +94,19 @@ typedef struct f_smgr
void (*smgr_readv) (SMgrRelation reln, ForkNumber forknum,
BlockNumber blocknum,
void **buffers, BlockNumber nblocks);
+ void (*smgr_startreadv) (struct PgAioHandle *ioh,
+ SMgrRelation reln, ForkNumber forknum,
+ BlockNumber blocknum,
+ void **buffers, BlockNumber nblocks);
void (*smgr_writev) (SMgrRelation reln, ForkNumber forknum,
BlockNumber blocknum,
const void **buffers, BlockNumber nblocks,
bool skipFsync);
+ void (*smgr_startwritev) (struct PgAioHandle *ioh,
+ SMgrRelation reln, ForkNumber forknum,
+ BlockNumber blocknum,
+ const void **buffers, BlockNumber nblocks,
+ bool skipFsync);
void (*smgr_writeback) (SMgrRelation reln, ForkNumber forknum,
BlockNumber blocknum, BlockNumber nblocks);
BlockNumber (*smgr_nblocks) (SMgrRelation reln, ForkNumber forknum);
@@ -104,6 +114,7 @@ typedef struct f_smgr
BlockNumber nblocks);
void (*smgr_immedsync) (SMgrRelation reln, ForkNumber forknum);
void (*smgr_registersync) (SMgrRelation reln, ForkNumber forknum);
+ int (*smgr_fd) (SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, uint32 *off);
} f_smgr;
static const f_smgr smgrsw[] = {
@@ -121,12 +132,15 @@ static const f_smgr smgrsw[] = {
.smgr_prefetch = mdprefetch,
.smgr_maxcombine = mdmaxcombine,
.smgr_readv = mdreadv,
+ .smgr_startreadv = mdstartreadv,
.smgr_writev = mdwritev,
+ .smgr_startwritev = mdstartwritev,
.smgr_writeback = mdwriteback,
.smgr_nblocks = mdnblocks,
.smgr_truncate = mdtruncate,
.smgr_immedsync = mdimmedsync,
.smgr_registersync = mdregistersync,
+ .smgr_fd = mdfd,
}
};
@@ -145,6 +159,14 @@ static void smgrshutdown(int code, Datum arg);
static void smgrdestroy(SMgrRelation reln);
+static void smgr_aio_reopen(PgAioHandle *ioh);
+
+const struct PgAioSubjectInfo aio_smgr_subject_info = {
+ .name = "smgr",
+ .reopen = smgr_aio_reopen,
+};
+
+
/*
* smgrinit(), smgrshutdown() -- Initialize or shut down storage
* managers.
@@ -620,6 +642,19 @@ smgrreadv(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
nblocks);
}
+/*
+ * FILL ME IN
+ */
+void
+smgrstartreadv(struct PgAioHandle *ioh,
+ SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+ void **buffers, BlockNumber nblocks)
+{
+ smgrsw[reln->smgr_which].smgr_startreadv(ioh,
+ reln, forknum, blocknum, buffers,
+ nblocks);
+}
+
/*
* smgrwritev() -- Write the supplied buffers out.
*
@@ -651,6 +686,16 @@ smgrwritev(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
buffers, nblocks, skipFsync);
}
+void
+smgrstartwritev(struct PgAioHandle *ioh,
+ SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+ const void **buffers, BlockNumber nblocks, bool skipFsync)
+{
+ smgrsw[reln->smgr_which].smgr_startwritev(ioh,
+ reln, forknum, blocknum, buffers,
+ nblocks, skipFsync);
+}
+
/*
* smgrwriteback() -- Trigger kernel writeback for the supplied range of
* blocks.
@@ -807,6 +852,12 @@ smgrimmedsync(SMgrRelation reln, ForkNumber forknum)
smgrsw[reln->smgr_which].smgr_immedsync(reln, forknum);
}
+int
+smgrfd(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, uint32 *off)
+{
+ return smgrsw[reln->smgr_which].smgr_fd(reln, forknum, blocknum, off);
+}
+
/*
* AtEOXact_SMgr
*
@@ -835,3 +886,43 @@ ProcessBarrierSmgrRelease(void)
smgrreleaseall();
return true;
}
+
+void
+pgaio_io_set_subject_smgr(PgAioHandle *ioh,
+ struct SMgrRelationData *smgr,
+ ForkNumber forknum,
+ BlockNumber blocknum,
+ int nblocks)
+{
+ PgAioSubjectData *sd = pgaio_io_get_subject_data(ioh);
+
+ pgaio_io_set_subject(ioh, ASI_SMGR);
+
+ /* backend is implied via IO owner */
+ sd->smgr.rlocator = smgr->smgr_rlocator.locator;
+ sd->smgr.forkNum = forknum;
+ sd->smgr.blockNum = blocknum;
+ sd->smgr.nblocks = nblocks;
+ sd->smgr.is_temp = SmgrIsTemp(smgr);
+ sd->smgr.release_lock = false;
+ sd->smgr.mode = RBM_NORMAL;
+}
+
+static void
+smgr_aio_reopen(PgAioHandle *ioh)
+{
+ PgAioSubjectData *sd = pgaio_io_get_subject_data(ioh);
+ PgAioOpData *od = pgaio_io_get_op_data(ioh);
+ SMgrRelation reln;
+ ProcNumber procno;
+ uint32 off;
+
+ if (sd->smgr.is_temp)
+ procno = pgaio_io_get_owner(ioh);
+ else
+ procno = INVALID_PROC_NUMBER;
+
+ reln = smgropen(sd->smgr.rlocator, procno);
+ od->read.fd = smgrfd(reln, sd->smgr.forkNum, sd->smgr.blockNum, &off);
+ Assert(off == od->read.offset);
+}
--
2.45.2.827.g557ae147e6
v2.0-0011-bufmgr-Implement-AIO-support.patchtext/x-diff; charset=us-asciiDownload
From 9d8c6210e3a5e39d585d0a8ebebeac8a9e9b62a2 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sat, 31 Aug 2024 21:39:01 -0400
Subject: [PATCH v2.0 11/17] bufmgr: Implement AIO support
As of this commit there are no users of these AIO facilities, that'll come in
later commits.
Author:
Reviewed-By:
Discussion: https://postgr.es/m/
Backpatch:
---
src/include/storage/aio.h | 6 +
src/include/storage/buf_internals.h | 6 +
src/include/storage/bufmgr.h | 10 +
src/backend/storage/aio/aio_subject.c | 5 +
src/backend/storage/buffer/buf_init.c | 3 +
src/backend/storage/buffer/bufmgr.c | 431 +++++++++++++++++++++++++-
src/backend/storage/buffer/localbuf.c | 65 ++++
7 files changed, 519 insertions(+), 7 deletions(-)
diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h
index acfd50c587c..40c80a2fed4 100644
--- a/src/include/storage/aio.h
+++ b/src/include/storage/aio.h
@@ -93,6 +93,12 @@ typedef enum PgAioHandleSharedCallbackID
{
ASC_MD_READV,
ASC_MD_WRITEV,
+
+ ASC_SHARED_BUFFER_READ,
+ ASC_SHARED_BUFFER_WRITE,
+
+ ASC_LOCAL_BUFFER_READ,
+ ASC_LOCAL_BUFFER_WRITE,
} PgAioHandleSharedCallbackID;
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index f190e6e5e46..5cfa7dbd1f1 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -17,6 +17,7 @@
#include "pgstat.h"
#include "port/atomics.h"
+#include "storage/aio_ref.h"
#include "storage/buf.h"
#include "storage/bufmgr.h"
#include "storage/condition_variable.h"
@@ -252,6 +253,8 @@ typedef struct BufferDesc
int wait_backend_pgprocno; /* backend of pin-count waiter */
int freeNext; /* link in freelist chain */
+
+ PgAioHandleRef io_in_progress;
LWLock content_lock; /* to lock access to buffer contents */
} BufferDesc;
@@ -465,4 +468,7 @@ extern void DropRelationLocalBuffers(RelFileLocator rlocator,
extern void DropRelationAllLocalBuffers(RelFileLocator rlocator);
extern void AtEOXact_LocalBuffers(bool isCommit);
+
+extern bool ReadBufferCompleteReadLocal(Buffer buffer, int mode, bool failed);
+
#endif /* BUFMGR_INTERNALS_H */
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index eb0fba4230b..6cd64b8c2b3 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -177,6 +177,14 @@ extern PGDLLIMPORT int NLocBuffer;
extern PGDLLIMPORT Block *LocalBufferBlockPointers;
extern PGDLLIMPORT int32 *LocalRefCount;
+
+struct PgAioHandleSharedCallbacks;
+extern const struct PgAioHandleSharedCallbacks aio_shared_buffer_read_cb;
+extern const struct PgAioHandleSharedCallbacks aio_shared_buffer_write_cb;
+extern const struct PgAioHandleSharedCallbacks aio_local_buffer_read_cb;
+extern const struct PgAioHandleSharedCallbacks aio_local_buffer_write_cb;
+
+
/* upper limit for effective_io_concurrency */
#define MAX_IO_CONCURRENCY 1000
@@ -194,6 +202,8 @@ extern PGDLLIMPORT int32 *LocalRefCount;
/*
* prototypes for functions in bufmgr.c
*/
+struct PgAioHandle;
+
extern PrefetchBufferResult PrefetchSharedBuffer(struct SMgrRelationData *smgr_reln,
ForkNumber forkNum,
BlockNumber blockNum);
diff --git a/src/backend/storage/aio/aio_subject.c b/src/backend/storage/aio/aio_subject.c
index 12ab1730f49..0676f3d3a66 100644
--- a/src/backend/storage/aio/aio_subject.c
+++ b/src/backend/storage/aio/aio_subject.c
@@ -35,6 +35,11 @@ static const PgAioSubjectInfo *aio_subject_info[] = {
static const PgAioHandleSharedCallbacks *aio_shared_cbs[] = {
[ASC_MD_READV] = &aio_md_readv_cb,
[ASC_MD_WRITEV] = &aio_md_writev_cb,
+
+ [ASC_SHARED_BUFFER_READ] = &aio_shared_buffer_read_cb,
+ [ASC_SHARED_BUFFER_WRITE] = &aio_shared_buffer_write_cb,
+ [ASC_LOCAL_BUFFER_READ] = &aio_local_buffer_read_cb,
+ [ASC_LOCAL_BUFFER_WRITE] = &aio_local_buffer_write_cb,
};
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index 09bec6449b6..059a80dfb13 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -14,6 +14,7 @@
*/
#include "postgres.h"
+#include "storage/aio.h"
#include "storage/buf_internals.h"
#include "storage/bufmgr.h"
#include "storage/proc.h"
@@ -126,6 +127,8 @@ BufferManagerShmemInit(void)
buf->buf_id = i;
+ pgaio_io_ref_clear(&buf->io_in_progress);
+
/*
* Initially link all the buffers together as unused. Subsequent
* management of this list is done by freelist.c.
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index f2e608f597d..8feafd6e53c 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -48,6 +48,7 @@
#include "pg_trace.h"
#include "pgstat.h"
#include "postmaster/bgwriter.h"
+#include "storage/aio.h"
#include "storage/buf_internals.h"
#include "storage/bufmgr.h"
#include "storage/fd.h"
@@ -58,6 +59,7 @@
#include "storage/smgr.h"
#include "storage/standby.h"
#include "utils/memdebug.h"
+#include "utils/memutils.h"
#include "utils/ps_status.h"
#include "utils/rel.h"
#include "utils/resowner.h"
@@ -541,7 +543,8 @@ static int SyncOneBuffer(int buf_id, bool skip_recently_used,
static void WaitIO(BufferDesc *buf);
static bool StartBufferIO(BufferDesc *buf, bool forInput, bool nowait);
static void TerminateBufferIO(BufferDesc *buf, bool clear_dirty,
- uint32 set_flag_bits, bool forget_owner);
+ uint32 set_flag_bits, bool forget_owner,
+ bool syncio);
static void AbortBufferIO(Buffer buffer);
static void shared_buffer_write_error_callback(void *arg);
static void local_buffer_write_error_callback(void *arg);
@@ -1108,7 +1111,7 @@ ZeroAndLockBuffer(Buffer buffer, ReadBufferMode mode, bool already_valid)
else
{
/* Set BM_VALID, terminate IO, and wake up any waiters */
- TerminateBufferIO(bufHdr, false, BM_VALID, true);
+ TerminateBufferIO(bufHdr, false, BM_VALID, true, true);
}
}
else if (!isLocalBuf)
@@ -1593,7 +1596,7 @@ WaitReadBuffers(ReadBuffersOperation *operation)
else
{
/* Set BM_VALID, terminate IO, and wake up any waiters */
- TerminateBufferIO(bufHdr, false, BM_VALID, true);
+ TerminateBufferIO(bufHdr, false, BM_VALID, true, true);
}
/* Report I/Os as completing individually. */
@@ -2477,7 +2480,7 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
if (lock)
LWLockAcquire(BufferDescriptorGetContentLock(buf_hdr), LW_EXCLUSIVE);
- TerminateBufferIO(buf_hdr, false, BM_VALID, true);
+ TerminateBufferIO(buf_hdr, false, BM_VALID, true, true);
}
pgBufferUsage.shared_blks_written += extend_by;
@@ -3926,7 +3929,7 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
* Mark the buffer as clean (unless BM_JUST_DIRTIED has become set) and
* end the BM_IO_IN_PROGRESS state.
*/
- TerminateBufferIO(buf, true, 0, true);
+ TerminateBufferIO(buf, true, 0, true, true);
TRACE_POSTGRESQL_BUFFER_FLUSH_DONE(BufTagGetForkNum(&buf->tag),
buf->tag.blockNum,
@@ -5541,6 +5544,7 @@ WaitIO(BufferDesc *buf)
for (;;)
{
uint32 buf_state;
+ PgAioHandleRef ior;
/*
* It may not be necessary to acquire the spinlock to check the flag
@@ -5548,10 +5552,19 @@ WaitIO(BufferDesc *buf)
* play it safe.
*/
buf_state = LockBufHdr(buf);
+ ior = buf->io_in_progress;
UnlockBufHdr(buf, buf_state);
if (!(buf_state & BM_IO_IN_PROGRESS))
break;
+
+ if (pgaio_io_ref_valid(&ior))
+ {
+ pgaio_io_ref_wait(&ior);
+ ConditionVariablePrepareToSleep(cv);
+ continue;
+ }
+
ConditionVariableSleep(cv, WAIT_EVENT_BUFFER_IO);
}
ConditionVariableCancelSleep();
@@ -5640,7 +5653,7 @@ StartBufferIO(BufferDesc *buf, bool forInput, bool nowait)
*/
static void
TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag_bits,
- bool forget_owner)
+ bool forget_owner, bool syncio)
{
uint32 buf_state;
@@ -5652,6 +5665,13 @@ TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag_bits,
if (clear_dirty && !(buf_state & BM_JUST_DIRTIED))
buf_state &= ~(BM_DIRTY | BM_CHECKPOINT_NEEDED);
+ if (!syncio)
+ {
+ /* release ownership by the AIO subsystem */
+ buf_state -= BUF_REFCOUNT_ONE;
+ pgaio_io_ref_clear(&buf->io_in_progress);
+ }
+
buf_state |= set_flag_bits;
UnlockBufHdr(buf, buf_state);
@@ -5660,6 +5680,40 @@ TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag_bits,
BufferDescriptorGetBuffer(buf));
ConditionVariableBroadcast(BufferDescriptorGetIOCV(buf));
+
+ /*
+ * If we just released a pin, need to do BM_PIN_COUNT_WAITER handling.
+ * Most of the time the current backend will hold another pin preventing
+ * that from happening, but that's e.g. not the case when completing an IO
+ * another backend started.
+ *
+ * AFIXME: Deduplicate with UnpinBufferNoOwner() or just replace
+ * BM_PIN_COUNT_WAITER with something saner.
+ */
+ /* Support LockBufferForCleanup() */
+ if (buf_state & BM_PIN_COUNT_WAITER)
+ {
+ /*
+ * Acquire the buffer header lock, re-check that there's a waiter.
+ * Another backend could have unpinned this buffer, and already woken
+ * up the waiter. There's no danger of the buffer being replaced
+ * after we unpinned it above, as it's pinned by the waiter.
+ */
+ buf_state = LockBufHdr(buf);
+
+ if ((buf_state & BM_PIN_COUNT_WAITER) &&
+ BUF_STATE_GET_REFCOUNT(buf_state) == 1)
+ {
+ /* we just released the last pin other than the waiter's */
+ int wait_backend_pgprocno = buf->wait_backend_pgprocno;
+
+ buf_state &= ~BM_PIN_COUNT_WAITER;
+ UnlockBufHdr(buf, buf_state);
+ ProcSendSignal(wait_backend_pgprocno);
+ }
+ else
+ UnlockBufHdr(buf, buf_state);
+ }
}
/*
@@ -5711,7 +5765,7 @@ AbortBufferIO(Buffer buffer)
}
}
- TerminateBufferIO(buf_hdr, false, BM_IO_ERROR, false);
+ TerminateBufferIO(buf_hdr, false, BM_IO_ERROR, false, true);
}
/*
@@ -6170,3 +6224,366 @@ EvictUnpinnedBuffer(Buffer buf)
return result;
}
+
+static bool
+ReadBufferCompleteReadShared(Buffer buffer, int mode, bool failed)
+{
+ BufferDesc *bufHdr = NULL;
+ BlockNumber blockno;
+ bool buf_failed = false;
+ char *bufdata = BufferGetBlock(buffer);
+
+ Assert(BufferIsValid(buffer));
+
+ bufHdr = GetBufferDescriptor(buffer - 1);
+ blockno = bufHdr->tag.blockNum;
+
+#ifdef USE_ASSERT_CHECKING
+ {
+ uint32 buf_state = pg_atomic_read_u32(&bufHdr->state);
+
+ Assert(buf_state & BM_TAG_VALID);
+ Assert(!(buf_state & BM_VALID));
+ Assert(buf_state & BM_IO_IN_PROGRESS);
+ Assert(!(buf_state & BM_DIRTY));
+ }
+#endif
+
+ /* check for garbage data */
+ if (!failed &&
+ !PageIsVerifiedExtended((Page) bufdata, blockno,
+ PIV_LOG_WARNING | PIV_REPORT_STAT))
+ {
+ RelFileLocator rlocator = BufTagGetRelFileLocator(&bufHdr->tag);
+ BlockNumber forkNum = bufHdr->tag.forkNum;
+
+ /* AFIXME: relpathperm allocates memory */
+ MemoryContextSwitchTo(ErrorContext);
+ if (mode == READ_BUFFERS_ZERO_ON_ERROR || zero_damaged_pages)
+ {
+ ereport(LOG,
+ (errcode(ERRCODE_DATA_CORRUPTED),
+ errmsg("invalid page in block %u of relation %s; zeroing out page",
+ blockno,
+ relpathperm(rlocator, forkNum))));
+ memset(bufdata, 0, BLCKSZ);
+ }
+ else
+ {
+ ereport(LOG,
+ (errcode(ERRCODE_DATA_CORRUPTED),
+ errmsg("invalid page in block %u of relation %s",
+ blockno,
+ relpathperm(rlocator, forkNum))));
+ failed = true;
+ buf_failed = true;
+ }
+ }
+
+ /* Terminate I/O and set BM_VALID. */
+ TerminateBufferIO(bufHdr, false,
+ failed ? BM_IO_ERROR : BM_VALID,
+ false, false);
+
+ /* Report I/Os as completing individually. */
+
+ /* FIXME: Should we do TRACE_POSTGRESQL_BUFFER_READ_DONE here? */
+ return buf_failed;
+}
+
+static uint64
+ReadBufferCompleteWriteShared(Buffer buffer, bool release_lock, bool failed)
+{
+ BufferDesc *bufHdr;
+ bool result = false;
+
+ Assert(BufferIsValid(buffer));
+
+ bufHdr = GetBufferDescriptor(buffer - 1);
+
+#ifdef USE_ASSERT_CHECKING
+ {
+ uint32 buf_state = pg_atomic_read_u32(&bufHdr->state);
+
+ Assert(buf_state & BM_VALID);
+ Assert(buf_state & BM_TAG_VALID);
+ Assert(buf_state & BM_IO_IN_PROGRESS);
+ Assert(buf_state & BM_DIRTY);
+ }
+#endif
+
+ /* AFIXME: implement track_io_timing */
+
+ TerminateBufferIO(bufHdr, /* clear_dirty = */ true,
+ failed ? BM_IO_ERROR : 0,
+ /* forget_owner = */ false,
+ /* syncio = */ false);
+
+ /*
+ * The initiator of IO is not managing the lock (i.e. called
+ * LWLockReleaseOwnership()), we are.
+ */
+ if (release_lock)
+ LWLockReleaseUnowned(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
+
+ return result;
+}
+
+static void
+shared_buffer_prepare_common(PgAioHandle *ioh, bool is_write)
+{
+ uint64 *io_data;
+ uint8 io_data_len;
+ PgAioHandleRef io_ref;
+ BufferTag first PG_USED_FOR_ASSERTS_ONLY = {0};
+
+ io_data = pgaio_io_get_io_data(ioh, &io_data_len);
+
+ pgaio_io_get_ref(ioh, &io_ref);
+
+ for (int i = 0; i < io_data_len; i++)
+ {
+ Buffer buf = (Buffer) io_data[i];
+ BufferDesc *bufHdr;
+ uint32 buf_state;
+
+ bufHdr = GetBufferDescriptor(buf - 1);
+
+ if (i == 0)
+ first = bufHdr->tag;
+ else
+ {
+ Assert(bufHdr->tag.relNumber == first.relNumber);
+ Assert(bufHdr->tag.blockNum == first.blockNum + i);
+ }
+
+
+ buf_state = LockBufHdr(bufHdr);
+
+ Assert(buf_state & BM_TAG_VALID);
+ if (is_write)
+ {
+ Assert(buf_state & BM_VALID);
+ Assert(buf_state & BM_DIRTY);
+ }
+ else
+ Assert(!(buf_state & BM_VALID));
+
+ Assert(buf_state & BM_IO_IN_PROGRESS);
+ Assert(BUF_STATE_GET_REFCOUNT(buf_state) >= 1);
+
+ buf_state += BUF_REFCOUNT_ONE;
+ bufHdr->io_in_progress = io_ref;
+
+ UnlockBufHdr(bufHdr, buf_state);
+
+ if (is_write)
+ {
+ LWLock *content_lock;
+
+ content_lock = BufferDescriptorGetContentLock(bufHdr);
+
+ Assert(LWLockHeldByMe(content_lock));
+
+ /*
+ * Lock now owned by IO.
+ */
+ LWLockReleaseOwnership(content_lock);
+ }
+
+ /*
+ * Stop tracking this buffer via the resowner - the AIO system now
+ * keeps track.
+ */
+ ResourceOwnerForgetBufferIO(CurrentResourceOwner, buf);
+ }
+}
+
+static void
+shared_buffer_read_prepare(PgAioHandle *ioh)
+{
+ shared_buffer_prepare_common(ioh, false);
+}
+
+static void
+shared_buffer_write_prepare(PgAioHandle *ioh)
+{
+ shared_buffer_prepare_common(ioh, true);
+}
+
+
+static PgAioResult
+shared_buffer_read_complete(PgAioHandle *ioh, PgAioResult prior_result)
+{
+ PgAioResult result = prior_result;
+ int mode = pgaio_io_get_subject_data(ioh)->smgr.mode;
+ uint64 *io_data;
+ uint8 io_data_len;
+
+ elog(DEBUG3, "%s: %d %d", __func__, prior_result.status, prior_result.result);
+
+ io_data = pgaio_io_get_io_data(ioh, &io_data_len);
+
+ for (int io_data_off = 0; io_data_off < io_data_len; io_data_off++)
+ {
+ Buffer buf = io_data[io_data_off];
+ bool buf_failed;
+ bool failed;
+
+ failed =
+ prior_result.status == ARS_ERROR
+ || prior_result.result <= io_data_off;
+
+ elog(DEBUG3, "calling rbcrs for buf %d with failed %d, error: %d, result: %d, data_off: %d",
+ buf, failed, prior_result.status, prior_result.result, io_data_off);
+
+ /*
+ * AFIXME: It'd probably be better to not set BM_IO_ERROR (which is
+ * what failed = true leads to) when it's just a short read...
+ */
+ buf_failed = ReadBufferCompleteReadShared(buf,
+ mode,
+ failed);
+
+ if (result.status != ARS_ERROR && buf_failed)
+ {
+ result.status = ARS_ERROR;
+ result.id = ASC_SHARED_BUFFER_READ;
+ result.error_data = io_data_off + 1;
+ }
+ }
+
+ return result;
+}
+
+static void
+shared_buffer_read_error(PgAioResult result, const PgAioSubjectData *subject_data, int elevel)
+{
+ MemoryContext oldContext = CurrentMemoryContext;
+
+ /* AFIXME: */
+ oldContext = MemoryContextSwitchTo(ErrorContext);
+
+ ereport(elevel,
+ errcode(ERRCODE_DATA_CORRUPTED),
+ errmsg("invalid page in block %u of relation %s",
+ subject_data->smgr.blockNum + result.error_data,
+ relpathperm(subject_data->smgr.rlocator, subject_data->smgr.forkNum)
+ )
+ );
+ MemoryContextSwitchTo(oldContext);
+}
+
+static PgAioResult
+shared_buffer_write_complete(PgAioHandle *ioh, PgAioResult prior_result)
+{
+ PgAioResult result = prior_result;
+ uint64 *io_data;
+ uint8 io_data_len;
+
+ elog(DEBUG3, "%s: %d %d", __func__, prior_result.status, prior_result.result);
+
+ io_data = pgaio_io_get_io_data(ioh, &io_data_len);
+
+ /* FIXME: handle outright errors */
+
+ for (int io_data_off = 0; io_data_off < io_data_len; io_data_off++)
+ {
+ Buffer buf = io_data[io_data_off];
+
+ /* FIXME: handle short writes / failures */
+ /* FIXME: ioh->scb_data.shared_buffer.release_lock */
+ ReadBufferCompleteWriteShared(buf,
+ true,
+ false);
+
+ }
+
+ return result;
+}
+
+static void
+local_buffer_read_prepare(PgAioHandle *ioh)
+{
+ uint64 *io_data;
+ uint8 io_data_len;
+ PgAioHandleRef io_ref;
+
+ io_data = pgaio_io_get_io_data(ioh, &io_data_len);
+
+ pgaio_io_get_ref(ioh, &io_ref);
+
+ for (int i = 0; i < io_data_len; i++)
+ {
+ Buffer buf = (Buffer) io_data[i];
+ BufferDesc *bufHdr;
+ uint32 buf_state;
+
+ bufHdr = GetLocalBufferDescriptor(-buf - 1);
+
+ buf_state = pg_atomic_read_u32(&bufHdr->state);
+
+ bufHdr->io_in_progress = io_ref;
+ LocalRefCount[-buf - 1] += 1;
+
+ UnlockBufHdr(bufHdr, buf_state);
+ }
+}
+
+static PgAioResult
+local_buffer_read_complete(PgAioHandle *ioh, PgAioResult prior_result)
+{
+ PgAioResult result = prior_result;
+ int mode = pgaio_io_get_subject_data(ioh)->smgr.mode;
+ uint64 *io_data;
+ uint8 io_data_len;
+
+ elog(DEBUG3, "%s: %d %d", __func__, prior_result.status, prior_result.result);
+
+ io_data = pgaio_io_get_io_data(ioh, &io_data_len);
+
+ /* FIXME: error handling */
+
+ for (int io_data_off = 0; io_data_off < io_data_len; io_data_off++)
+ {
+ Buffer buf = io_data[io_data_off];
+ bool buf_failed;
+
+ buf_failed = ReadBufferCompleteReadLocal(buf,
+ mode,
+ false);
+
+ if (result.status != ARS_ERROR && buf_failed)
+ {
+ result.status = ARS_ERROR;
+ result.id = ASC_LOCAL_BUFFER_READ;
+ result.error_data = io_data_off + 1;
+ }
+ }
+
+ return result;
+}
+
+static void
+local_buffer_write_prepare(PgAioHandle *ioh)
+{
+ elog(ERROR, "not yet");
+}
+
+
+const struct PgAioHandleSharedCallbacks aio_shared_buffer_read_cb = {
+ .prepare = shared_buffer_read_prepare,
+ .complete = shared_buffer_read_complete,
+ .error = shared_buffer_read_error,
+};
+const struct PgAioHandleSharedCallbacks aio_shared_buffer_write_cb = {
+ .prepare = shared_buffer_write_prepare,
+ .complete = shared_buffer_write_complete,
+};
+const struct PgAioHandleSharedCallbacks aio_local_buffer_read_cb = {
+ .prepare = local_buffer_read_prepare,
+ .complete = local_buffer_read_complete,
+};
+const struct PgAioHandleSharedCallbacks aio_local_buffer_write_cb = {
+ .prepare = local_buffer_write_prepare,
+};
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 8da7dd6c98a..a7eb723f1e9 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -18,6 +18,7 @@
#include "access/parallel.h"
#include "executor/instrument.h"
#include "pgstat.h"
+#include "storage/aio.h"
#include "storage/buf_internals.h"
#include "storage/bufmgr.h"
#include "storage/fd.h"
@@ -620,6 +621,8 @@ InitLocalBuffers(void)
*/
buf->buf_id = -i - 2;
+ pgaio_io_ref_clear(&buf->io_in_progress);
+
/*
* Intentionally do not initialize the buffer's atomic variable
* (besides zeroing the underlying memory above). That way we get
@@ -836,3 +839,65 @@ AtProcExit_LocalBuffers(void)
*/
CheckForLocalBufferLeaks();
}
+
+bool
+ReadBufferCompleteReadLocal(Buffer buffer, int mode, bool failed)
+{
+ BufferDesc *buf_hdr = NULL;
+ BlockNumber blockno;
+ bool buf_failed = false;
+ char *bufdata = BufferGetBlock(buffer);
+
+ Assert(BufferIsValid(buffer));
+
+ buf_hdr = GetLocalBufferDescriptor(-buffer - 1);
+ blockno = buf_hdr->tag.blockNum;
+
+ /* check for garbage data */
+ if (!failed &&
+ !PageIsVerifiedExtended((Page) bufdata, blockno,
+ PIV_LOG_WARNING | PIV_REPORT_STAT))
+ {
+ RelFileLocator rlocator = BufTagGetRelFileLocator(&buf_hdr->tag);
+ BlockNumber forkNum = buf_hdr->tag.forkNum;
+
+ MemoryContextSwitchTo(ErrorContext);
+
+ if (mode == READ_BUFFERS_ZERO_ON_ERROR || zero_damaged_pages)
+ {
+
+ ereport(WARNING,
+ (errcode(ERRCODE_DATA_CORRUPTED),
+ errmsg("invalid page in block %u of relation %s; zeroing out page",
+ blockno,
+ relpathperm(rlocator, forkNum))));
+ memset(bufdata, 0, BLCKSZ);
+ }
+ else
+ {
+ ereport(LOG,
+ (errcode(ERRCODE_DATA_CORRUPTED),
+ errmsg("invalid page in block %u of relation %s",
+ blockno,
+ relpathperm(rlocator, forkNum))));
+ failed = true;
+ buf_failed = true;
+ }
+ }
+
+ /* Terminate I/O and set BM_VALID. */
+ pgaio_io_ref_clear(&buf_hdr->io_in_progress);
+
+ {
+ uint32 buf_state;
+
+ buf_state = pg_atomic_read_u32(&buf_hdr->state);
+ buf_state |= BM_VALID;
+ pg_atomic_unlocked_write_u32(&buf_hdr->state, buf_state);
+ }
+
+ /* release pin held by IO subsystem */
+ LocalRefCount[-buffer - 1] -= 1;
+
+ return buf_failed;
+}
--
2.45.2.827.g557ae147e6
v2.0-0012-bufmgr-Use-aio-for-StartReadBuffers.patchtext/x-diff; charset=us-asciiDownload
From fe6df768de29f124263f6fe250017f04454ca699 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sat, 31 Aug 2024 21:55:59 -0400
Subject: [PATCH v2.0 12/17] bufmgr: Use aio for StartReadBuffers()
Author:
Reviewed-By:
Discussion: https://postgr.es/m/
Backpatch:
---
src/include/storage/bufmgr.h | 25 ++-
src/backend/storage/buffer/bufmgr.c | 259 +++++++++++++++++-----------
2 files changed, 182 insertions(+), 102 deletions(-)
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 6cd64b8c2b3..a075a40b2ed 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -15,6 +15,7 @@
#define BUFMGR_H
#include "port/pg_iovec.h"
+#include "storage/aio_ref.h"
#include "storage/block.h"
#include "storage/buf.h"
#include "storage/bufpage.h"
@@ -107,11 +108,22 @@ typedef struct BufferManagerRelation
#define BMR_REL(p_rel) ((BufferManagerRelation){.rel = p_rel})
#define BMR_SMGR(p_smgr, p_relpersistence) ((BufferManagerRelation){.smgr = p_smgr, .relpersistence = p_relpersistence})
+
+#define MAX_IO_COMBINE_LIMIT PG_IOV_MAX
+#define DEFAULT_IO_COMBINE_LIMIT Min(MAX_IO_COMBINE_LIMIT, (128 * 1024) / BLCKSZ)
+
+
/* Zero out page if reading fails. */
#define READ_BUFFERS_ZERO_ON_ERROR (1 << 0)
/* Call smgrprefetch() if I/O necessary. */
#define READ_BUFFERS_ISSUE_ADVICE (1 << 1)
+/*
+ * FIXME: PgAioReturn is defined in aio.h. It'd be much better if we didn't
+ * need to include that here. Perhaps this could live in a separate header?
+ */
+#include "storage/aio.h"
+
struct ReadBuffersOperation
{
/* The following members should be set by the caller. */
@@ -131,6 +143,17 @@ struct ReadBuffersOperation
int flags;
int16 nblocks;
int16 io_buffers_len;
+
+ /*
+ * In some rare-ish cases one operation causes multiple reads (e.g. if a
+ * buffer was concurrently read by another backend). It'd be much better
+ * if we ensured that each ReadBuffersOperation covered only one IO - but
+ * that's not entirely trivial, due to having pinned victim buffers before
+ * starting IOs.
+ */
+ int16 nios;
+ PgAioHandleRef refs[MAX_IO_COMBINE_LIMIT];
+ PgAioReturn returns[MAX_IO_COMBINE_LIMIT];
};
typedef struct ReadBuffersOperation ReadBuffersOperation;
@@ -161,8 +184,6 @@ extern PGDLLIMPORT bool track_io_timing;
extern PGDLLIMPORT int effective_io_concurrency;
extern PGDLLIMPORT int maintenance_io_concurrency;
-#define MAX_IO_COMBINE_LIMIT PG_IOV_MAX
-#define DEFAULT_IO_COMBINE_LIMIT Min(MAX_IO_COMBINE_LIMIT, (128 * 1024) / BLCKSZ)
extern PGDLLIMPORT int io_combine_limit;
extern PGDLLIMPORT int checkpoint_flush_after;
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 8feafd6e53c..90e873d278f 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1280,6 +1280,12 @@ ReadBuffer_common(Relation rel, SMgrRelation smgr, char smgr_persistence,
return buffer;
}
+static bool AsyncReadBuffers(ReadBuffersOperation *operation,
+ Buffer *buffers,
+ BlockNumber blockNum,
+ int *nblocks,
+ int flags);
+
static pg_attribute_always_inline bool
StartReadBuffersImpl(ReadBuffersOperation *operation,
Buffer *buffers,
@@ -1315,6 +1321,12 @@ StartReadBuffersImpl(ReadBuffersOperation *operation,
* so we stop here.
*/
actual_nblocks = i + 1;
+
+ ereport(DEBUG3,
+ errmsg("found buf %d, idx %i: %s, data %p",
+ buffers[i], i, DebugPrintBufferRefcount(buffers[i]),
+ BufferGetBlock(buffers[i])),
+ errhidestmt(true), errhidecontext(true));
break;
}
else
@@ -1352,27 +1364,18 @@ StartReadBuffersImpl(ReadBuffersOperation *operation,
operation->nblocks = actual_nblocks;
operation->io_buffers_len = io_buffers_len;
- if (flags & READ_BUFFERS_ISSUE_ADVICE)
- {
- /*
- * In theory we should only do this if PinBufferForBlock() had to
- * allocate new buffers above. That way, if two calls to
- * StartReadBuffers() were made for the same blocks before
- * WaitReadBuffers(), only the first would issue the advice. That'd be
- * a better simulation of true asynchronous I/O, which would only
- * start the I/O once, but isn't done here for simplicity. Note also
- * that the following call might actually issue two advice calls if we
- * cross a segment boundary; in a true asynchronous version we might
- * choose to process only one real I/O at a time in that case.
- */
- smgrprefetch(operation->smgr,
- operation->forknum,
- blockNum,
- operation->io_buffers_len);
- }
+ operation->nios = 0;
- /* Indicate that WaitReadBuffers() should be called. */
- return true;
+ /*
+ * TODO: When called for synchronous IO execution, we probably should
+ * enter a dedicated fastpath here.
+ */
+
+ /* initiate the IO */
+ return AsyncReadBuffers(operation,
+ buffers,
+ blockNum,
+ nblocks, flags);
}
/*
@@ -1424,12 +1427,31 @@ StartReadBuffer(ReadBuffersOperation *operation,
}
static inline bool
-WaitReadBuffersCanStartIO(Buffer buffer, bool nowait)
+ReadBuffersCanStartIO(Buffer buffer, bool nowait)
{
if (BufferIsLocal(buffer))
{
BufferDesc *bufHdr = GetLocalBufferDescriptor(-buffer - 1);
+ /*
+ * The buffer could have IO in progress by another scan. Right now
+ * localbuf.c doesn't use IO_IN_PROGRESS, which is why we need this
+ * hack.
+ *
+ * AFIXME: localbuf.c should use IO_IN_PROGRESS / have an equivalent
+ * of StartBufferIO().
+ */
+ if (pgaio_io_ref_valid(&bufHdr->io_in_progress))
+ {
+ PgAioHandleRef ior = bufHdr->io_in_progress;
+
+ ereport(DEBUG3,
+ errmsg("waiting for temp buffer IO in CSIO"),
+ errhidestmt(true), errhidecontext(true));
+ pgaio_io_ref_wait(&ior);
+ return false;
+ }
+
return (pg_atomic_read_u32(&bufHdr->state) & BM_VALID) == 0;
}
else
@@ -1439,12 +1461,7 @@ WaitReadBuffersCanStartIO(Buffer buffer, bool nowait)
void
WaitReadBuffers(ReadBuffersOperation *operation)
{
- Buffer *buffers;
int nblocks;
- BlockNumber blocknum;
- ForkNumber forknum;
- IOContext io_context;
- IOObject io_object;
char persistence;
/*
@@ -1460,11 +1477,65 @@ WaitReadBuffers(ReadBuffersOperation *operation)
if (nblocks == 0)
return; /* nothing to do */
+ persistence = operation->persistence;
+
+ Assert(operation->nios > 0);
+
+ for (int i = 0; i < operation->nios; i++)
+ {
+ PgAioReturn *aio_ret;
+
+ pgaio_io_ref_wait(&operation->refs[i]);
+
+ aio_ret = &operation->returns[i];
+
+ if (aio_ret->result.status != ARS_OK)
+ pgaio_result_log(aio_ret->result, &aio_ret->subject_data, ERROR);
+ }
+
+ /*
+ * We count all these blocks as read by this backend. This is traditional
+ * behavior, but might turn out to be not true if we find that someone
+ * else has beaten us and completed the read of some of these blocks. In
+ * that case the system globally double-counts, but we traditionally don't
+ * count this as a "hit", and we don't have a separate counter for "miss,
+ * but another backend completed the read".
+ */
+ if (persistence == RELPERSISTENCE_TEMP)
+ pgBufferUsage.local_blks_read += nblocks;
+ else
+ pgBufferUsage.shared_blks_read += nblocks;
+
+ if (VacuumCostActive)
+ VacuumCostBalance += VacuumCostPageMiss * nblocks;
+
+ /* FIXME: io timing */
+ /* FIXME: READ_DONE tracepoint */
+}
+
+static bool
+AsyncReadBuffers(ReadBuffersOperation *operation,
+ Buffer *buffers,
+ BlockNumber blockNum,
+ int *nblocks,
+ int flags)
+{
+ int io_buffers_len = 0;
+ BlockNumber blocknum;
+ ForkNumber forknum;
+ IOContext io_context;
+ IOObject io_object;
+ char persistence;
+ bool did_start_io_overall = false;
+ PgAioHandle *ioh = NULL;
+
buffers = &operation->buffers[0];
blocknum = operation->blocknum;
forknum = operation->forknum;
- persistence = operation->persistence;
+ persistence = operation->rel
+ ? operation->rel->rd_rel->relpersistence
+ : RELPERSISTENCE_PERMANENT;
if (persistence == RELPERSISTENCE_TEMP)
{
io_context = IOCONTEXT_NORMAL;
@@ -1485,25 +1556,33 @@ WaitReadBuffers(ReadBuffersOperation *operation)
* but another backend completed the read".
*/
if (persistence == RELPERSISTENCE_TEMP)
- pgBufferUsage.local_blks_read += nblocks;
+ pgBufferUsage.local_blks_read += *nblocks;
else
- pgBufferUsage.shared_blks_read += nblocks;
+ pgBufferUsage.shared_blks_read += *nblocks;
- for (int i = 0; i < nblocks; ++i)
+ for (int i = 0; i < *nblocks; ++i)
{
- int io_buffers_len;
- Buffer io_buffers[MAX_IO_COMBINE_LIMIT];
void *io_pages[MAX_IO_COMBINE_LIMIT];
- instr_time io_start;
+ Buffer io_buffers[MAX_IO_COMBINE_LIMIT];
BlockNumber io_first_block;
+ bool did_start_io_this = false;
+
+ /*
+ * Get IO before ReadBuffersCanStartIO, as pgaio_io_get() might block,
+ * which we don't want after setting IO_IN_PROGRESS.
+ */
+ if (likely(!ioh))
+ ioh = pgaio_io_get(CurrentResourceOwner, &operation->returns[operation->nios]);
/*
* Skip this block if someone else has already completed it. If an
* I/O is already in progress in another backend, this will wait for
* the outcome: either done, or something went wrong and we will
* retry.
+ *
+ * ATODO: Should we wait if we already submitted another IO?
*/
- if (!WaitReadBuffersCanStartIO(buffers[i], false))
+ if (!ReadBuffersCanStartIO(buffers[i], did_start_io_overall))
{
/*
* Report this as a 'hit' for this backend, even though it must
@@ -1515,6 +1594,10 @@ WaitReadBuffers(ReadBuffersOperation *operation)
operation->smgr->smgr_rlocator.locator.relNumber,
operation->smgr->smgr_rlocator.backend,
true);
+
+ ereport(DEBUG3,
+ errmsg("can't start io for first buffer %u", buffers[i]),
+ errhidestmt(true), errhidecontext(true));
continue;
}
@@ -1524,6 +1607,11 @@ WaitReadBuffers(ReadBuffersOperation *operation)
io_first_block = blocknum + i;
io_buffers_len = 1;
+ ereport(DEBUG3,
+ errmsg("first prepped for io: %s, offset %d",
+ DebugPrintBufferRefcount(io_buffers[0]), i),
+ errhidestmt(true), errhidecontext(true));
+
/*
* How many neighboring-on-disk blocks can we can scatter-read into
* other buffers at the same time? In this case we don't wait if we
@@ -1531,86 +1619,57 @@ WaitReadBuffers(ReadBuffersOperation *operation)
* for the head block, so we should get on with that I/O as soon as
* possible. We'll come back to this block again, above.
*/
- while ((i + 1) < nblocks &&
- WaitReadBuffersCanStartIO(buffers[i + 1], true))
+ while ((i + 1) < *nblocks &&
+ ReadBuffersCanStartIO(buffers[i + 1], true))
{
/* Must be consecutive block numbers. */
Assert(BufferGetBlockNumber(buffers[i + 1]) ==
BufferGetBlockNumber(buffers[i]) + 1);
+ ereport(DEBUG3,
+ errmsg("seq prepped for io: %s, offset %d",
+ DebugPrintBufferRefcount(buffers[i + 1]),
+ i + 1),
+ errhidestmt(true), errhidecontext(true));
+
io_buffers[io_buffers_len] = buffers[++i];
io_pages[io_buffers_len++] = BufferGetBlock(buffers[i]);
}
- io_start = pgstat_prepare_io_time(track_io_timing);
- smgrreadv(operation->smgr, forknum, io_first_block, io_pages, io_buffers_len);
- pgstat_count_io_op_time(io_object, io_context, IOOP_READ, io_start,
- io_buffers_len);
+ pgaio_io_get_ref(ioh, &operation->refs[operation->nios]);
- /* Verify each block we read, and terminate the I/O. */
- for (int j = 0; j < io_buffers_len; ++j)
+ pgaio_io_set_io_data_32(ioh, (uint32 *) io_buffers, io_buffers_len);
+ if (persistence == RELPERSISTENCE_TEMP)
{
- BufferDesc *bufHdr;
- Block bufBlock;
-
- if (persistence == RELPERSISTENCE_TEMP)
- {
- bufHdr = GetLocalBufferDescriptor(-io_buffers[j] - 1);
- bufBlock = LocalBufHdrGetBlock(bufHdr);
- }
- else
- {
- bufHdr = GetBufferDescriptor(io_buffers[j] - 1);
- bufBlock = BufHdrGetBlock(bufHdr);
- }
-
- /* check for garbage data */
- if (!PageIsVerifiedExtended((Page) bufBlock, io_first_block + j,
- PIV_LOG_WARNING | PIV_REPORT_STAT))
- {
- if ((operation->flags & READ_BUFFERS_ZERO_ON_ERROR) || zero_damaged_pages)
- {
- ereport(WARNING,
- (errcode(ERRCODE_DATA_CORRUPTED),
- errmsg("invalid page in block %u of relation %s; zeroing out page",
- io_first_block + j,
- relpath(operation->smgr->smgr_rlocator, forknum))));
- memset(bufBlock, 0, BLCKSZ);
- }
- else
- ereport(ERROR,
- (errcode(ERRCODE_DATA_CORRUPTED),
- errmsg("invalid page in block %u of relation %s",
- io_first_block + j,
- relpath(operation->smgr->smgr_rlocator, forknum))));
- }
-
- /* Terminate I/O and set BM_VALID. */
- if (persistence == RELPERSISTENCE_TEMP)
- {
- uint32 buf_state = pg_atomic_read_u32(&bufHdr->state);
-
- buf_state |= BM_VALID;
- pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
- }
- else
- {
- /* Set BM_VALID, terminate IO, and wake up any waiters */
- TerminateBufferIO(bufHdr, false, BM_VALID, true, true);
- }
-
- /* Report I/Os as completing individually. */
- TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, io_first_block + j,
- operation->smgr->smgr_rlocator.locator.spcOid,
- operation->smgr->smgr_rlocator.locator.dbOid,
- operation->smgr->smgr_rlocator.locator.relNumber,
- operation->smgr->smgr_rlocator.backend,
- false);
+ pgaio_io_add_shared_cb(ioh, ASC_LOCAL_BUFFER_READ);
+ pgaio_io_set_flag(ioh, AHF_REFERENCES_LOCAL);
}
+ else
+ pgaio_io_add_shared_cb(ioh, ASC_SHARED_BUFFER_READ);
- if (VacuumCostActive)
- VacuumCostBalance += VacuumCostPageMiss * io_buffers_len;
+ did_start_io_overall = did_start_io_this = true;
+ smgrstartreadv(ioh, operation->smgr, forknum, io_first_block,
+ io_pages, io_buffers_len);
+ ioh = NULL;
+ operation->nios++;
+
+ /* not obvious what we'd use for time */
+ pgstat_count_io_op_n(io_object, io_context, IOOP_READ, io_buffers_len);
}
+
+ if (ioh)
+ {
+ pgaio_io_release(ioh);
+ ioh = NULL;
+ }
+
+ if (did_start_io_overall)
+ {
+ pgaio_submit_staged();
+ return true;
+ }
+ else
+ return false;
}
/*
--
2.45.2.827.g557ae147e6
v2.0-0013-aio-Very-WIP-read_stream.c-adjustments-for-real.patchtext/x-diff; charset=us-asciiDownload
From 6fcd84b237df81097a4271198e380dd82c76757b Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sat, 31 Aug 2024 21:39:30 -0400
Subject: [PATCH v2.0 13/17] aio: Very-WIP: read_stream.c adjustments for real
AIO
---
src/include/storage/bufmgr.h | 2 ++
src/backend/storage/aio/read_stream.c | 29 +++++++++++++++++++++------
src/backend/storage/buffer/bufmgr.c | 3 ++-
3 files changed, 27 insertions(+), 7 deletions(-)
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index a075a40b2ed..ac6496bb1eb 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -117,6 +117,8 @@ typedef struct BufferManagerRelation
#define READ_BUFFERS_ZERO_ON_ERROR (1 << 0)
/* Call smgrprefetch() if I/O necessary. */
#define READ_BUFFERS_ISSUE_ADVICE (1 << 1)
+/* caller will issue more io, don't submit */
+#define READ_BUFFERS_MORE_MORE_MORE (1 << 2)
/*
* FIXME: PgAioReturn is defined in aio.h. It'd be much better if we didn't
diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index 93cdd35fea0..42b2434918b 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -90,6 +90,7 @@
#include "catalog/pg_tablespace.h"
#include "miscadmin.h"
+#include "storage/aio.h"
#include "storage/fd.h"
#include "storage/smgr.h"
#include "storage/read_stream.h"
@@ -223,14 +224,18 @@ read_stream_start_pending_read(ReadStream *stream, bool suppress_advice)
/*
* If advice hasn't been suppressed, this system supports it, and this
* isn't a strictly sequential pattern, then we'll issue advice.
+ *
+ * XXX: Used to also check stream->pending_read_blocknum !=
+ * stream->seq_blocknum
*/
if (!suppress_advice &&
- stream->advice_enabled &&
- stream->pending_read_blocknum != stream->seq_blocknum)
+ stream->advice_enabled)
flags = READ_BUFFERS_ISSUE_ADVICE;
else
flags = 0;
+ flags |= READ_BUFFERS_MORE_MORE_MORE;
+
/* We say how many blocks we want to read, but may be smaller on return. */
buffer_index = stream->next_buffer_index;
io_index = stream->next_io_index;
@@ -289,6 +294,14 @@ read_stream_start_pending_read(ReadStream *stream, bool suppress_advice)
static void
read_stream_look_ahead(ReadStream *stream, bool suppress_advice)
{
+ if (stream->distance > (io_combine_limit * 8))
+ {
+ if (stream->pinned_buffers + stream->pending_read_nblocks > ((stream->distance * 3) / 4))
+ {
+ return;
+ }
+ }
+
while (stream->ios_in_progress < stream->max_ios &&
stream->pinned_buffers + stream->pending_read_nblocks < stream->distance)
{
@@ -338,6 +351,7 @@ read_stream_look_ahead(ReadStream *stream, bool suppress_advice)
{
/* And we've hit the limit. Rewind, and stop here. */
read_stream_unget_block(stream, blocknum);
+ pgaio_submit_staged();
return;
}
}
@@ -362,6 +376,8 @@ read_stream_look_ahead(ReadStream *stream, bool suppress_advice)
stream->distance == 0) &&
stream->ios_in_progress < stream->max_ios)
read_stream_start_pending_read(stream, suppress_advice);
+
+ pgaio_submit_staged();
}
/*
@@ -476,10 +492,11 @@ read_stream_begin_impl(int flags,
* direct I/O isn't enabled, the caller hasn't promised sequential access
* (overriding our detection heuristics), and max_ios hasn't been set to
* zero.
+ *
+ * FIXME: Used to also check (io_direct_flags & IO_DIRECT_DATA) == 0 &&
+ * (flags & READ_STREAM_SEQUENTIAL) == 0
*/
- if ((io_direct_flags & IO_DIRECT_DATA) == 0 &&
- (flags & READ_STREAM_SEQUENTIAL) == 0 &&
- max_ios > 0)
+ if (max_ios > 0)
stream->advice_enabled = true;
#endif
@@ -710,7 +727,7 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
if (++stream->oldest_io_index == stream->max_ios)
stream->oldest_io_index = 0;
- if (stream->ios[io_index].op.flags & READ_BUFFERS_ISSUE_ADVICE)
+ if (stream->ios[io_index].op.flags & (READ_BUFFERS_ISSUE_ADVICE | READ_BUFFERS_MORE_MORE_MORE))
{
/* Distance ramps up fast (behavior C). */
distance = stream->distance * 2;
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 90e873d278f..59f4b22457d 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1665,7 +1665,8 @@ AsyncReadBuffers(ReadBuffersOperation *operation,
if (did_start_io_overall)
{
- pgaio_submit_staged();
+ if (!(flags & READ_BUFFERS_MORE_MORE_MORE))
+ pgaio_submit_staged();
return true;
}
else
--
2.45.2.827.g557ae147e6
v2.0-0014-aio-Add-IO-queue-helper.patchtext/x-diff; charset=us-asciiDownload
From 2df34d8ac4fa381da607358ad3d214aadd05fdc7 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 23 Jul 2024 10:00:06 -0700
Subject: [PATCH v2.0 14/17] aio: Add IO queue helper
This is likely never going to anywhere - Thomas Munro is working on something
more complete. But I needed a way to exercise aio for checkpointer / bgwriter.
---
src/include/storage/io_queue.h | 33 +++++
src/backend/storage/aio/Makefile | 1 +
src/backend/storage/aio/io_queue.c | 195 ++++++++++++++++++++++++++++
src/backend/storage/aio/meson.build | 1 +
src/tools/pgindent/typedefs.list | 2 +
5 files changed, 232 insertions(+)
create mode 100644 src/include/storage/io_queue.h
create mode 100644 src/backend/storage/aio/io_queue.c
diff --git a/src/include/storage/io_queue.h b/src/include/storage/io_queue.h
new file mode 100644
index 00000000000..28077158d6d
--- /dev/null
+++ b/src/include/storage/io_queue.h
@@ -0,0 +1,33 @@
+/*-------------------------------------------------------------------------
+ *
+ * io_queue.h
+ * Mechanism for tracking many IOs
+ *
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/io_queue.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef IO_QUEUE_H
+#define IO_QUEUE_H
+
+#include "storage/bufmgr.h"
+
+struct IOQueue;
+typedef struct IOQueue IOQueue;
+
+struct PgAioHandleRef;
+
+extern IOQueue *io_queue_create(int depth, int flags);
+extern void io_queue_track(IOQueue *ioq, const struct PgAioHandleRef *ior);
+extern void io_queue_wait_one(IOQueue *ioq);
+extern void io_queue_wait_all(IOQueue *ioq);
+extern bool io_queue_is_empty(IOQueue *ioq);
+extern void io_queue_reserve(IOQueue *ioq);
+extern struct PgAioHandle *io_queue_get_io(IOQueue *ioq);
+extern void io_queue_free(IOQueue *ioq);
+
+#endif /* IO_QUEUE_H */
diff --git a/src/backend/storage/aio/Makefile b/src/backend/storage/aio/Makefile
index 2a5e72a8024..3fb527ed0d1 100644
--- a/src/backend/storage/aio/Makefile
+++ b/src/backend/storage/aio/Makefile
@@ -13,6 +13,7 @@ OBJS = \
aio_io.o \
aio_init.o \
aio_subject.o \
+ io_queue.o \
method_worker.o \
method_io_uring.o \
read_stream.o
diff --git a/src/backend/storage/aio/io_queue.c b/src/backend/storage/aio/io_queue.c
new file mode 100644
index 00000000000..4dda2f4e20e
--- /dev/null
+++ b/src/backend/storage/aio/io_queue.c
@@ -0,0 +1,195 @@
+/*-------------------------------------------------------------------------
+ *
+ * io_queue.c
+ * Mechanism for tracking many IOs
+ *
+ * Portions Copyright (c) 2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/storage/aio/io_queue.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "storage/io_queue.h"
+
+#include "storage/aio.h"
+
+
+typedef struct TrackedIO
+{
+ PgAioHandleRef ior;
+ dlist_node node;
+} TrackedIO;
+
+struct IOQueue
+{
+ int depth;
+ int unsubmitted;
+
+ bool has_reserved;
+
+ dclist_head idle;
+ dclist_head in_progress;
+
+ TrackedIO tracked_ios[FLEXIBLE_ARRAY_MEMBER];
+};
+
+
+IOQueue *
+io_queue_create(int depth, int flags)
+{
+ size_t sz;
+ IOQueue *ioq;
+
+ sz = offsetof(IOQueue, tracked_ios)
+ + sizeof(TrackedIO) * depth;
+
+ ioq = palloc0(sz);
+
+ ioq->depth = 0;
+
+ for (int i = 0; i < depth; i++)
+ {
+ TrackedIO *tio = &ioq->tracked_ios[i];
+
+ pgaio_io_ref_clear(&tio->ior);
+ dclist_push_tail(&ioq->idle, &tio->node);
+ }
+
+ return ioq;
+}
+
+void
+io_queue_wait_one(IOQueue *ioq)
+{
+ while (!dclist_is_empty(&ioq->in_progress))
+ {
+ /* FIXME: Should we really pop here already? */
+ dlist_node *node = dclist_pop_head_node(&ioq->in_progress);
+ TrackedIO *tio = dclist_container(TrackedIO, node, node);
+
+ pgaio_io_ref_wait(&tio->ior);
+ dclist_push_head(&ioq->idle, &tio->node);
+ }
+}
+
+void
+io_queue_reserve(IOQueue *ioq)
+{
+ if (ioq->has_reserved)
+ return;
+
+ if (dclist_is_empty(&ioq->idle))
+ io_queue_wait_one(ioq);
+
+ Assert(!dclist_is_empty(&ioq->idle));
+
+ ioq->has_reserved = true;
+}
+
+PgAioHandle *
+io_queue_get_io(IOQueue *ioq)
+{
+ PgAioHandle *ioh;
+
+ io_queue_reserve(ioq);
+
+ Assert(!dclist_is_empty(&ioq->idle));
+
+ if (!io_queue_is_empty(ioq))
+ {
+ ioh = pgaio_io_get_nb(CurrentResourceOwner, NULL);
+ if (ioh == NULL)
+ {
+ /*
+ * Need to wait for all IOs, blocking might not be legal in the
+ * context.
+ *
+ * XXX: This doesn't make a whole lot of sense, we're also
+ * blocking here. What was I smoking when I wrote the above?
+ */
+ io_queue_wait_all(ioq);
+ ioh = pgaio_io_get(CurrentResourceOwner, NULL);
+ }
+ }
+ else
+ {
+ ioh = pgaio_io_get(CurrentResourceOwner, NULL);
+ }
+
+ return ioh;
+}
+
+void
+io_queue_track(IOQueue *ioq, const struct PgAioHandleRef *ior)
+{
+ dlist_node *node;
+ TrackedIO *tio;
+
+ Assert(ioq->has_reserved);
+ ioq->has_reserved = false;
+
+ Assert(!dclist_is_empty(&ioq->idle));
+
+ node = dclist_pop_head_node(&ioq->idle);
+ tio = dclist_container(TrackedIO, node, node);
+
+ tio->ior = *ior;
+
+ dclist_push_tail(&ioq->in_progress, &tio->node);
+
+ ioq->unsubmitted++;
+
+ /*
+ * XXX: Should have some smarter logic here. We don't want to wait too
+ * long to submit, that'll mean we're more likely to block. But we also
+ * don't want to have the overhead of submitting every IO individually.
+ */
+ if (ioq->unsubmitted >= 4)
+ {
+ pgaio_submit_staged();
+ ioq->unsubmitted = 0;
+ }
+}
+
+void
+io_queue_wait_all(IOQueue *ioq)
+{
+ while (!dclist_is_empty(&ioq->in_progress))
+ {
+ /* wait for the last IO to minimize unnecessary wakeups */
+ dlist_node *node = dclist_tail_node(&ioq->in_progress);
+ TrackedIO *tio = dclist_container(TrackedIO, node, node);
+
+ if (!pgaio_io_ref_check_done(&tio->ior))
+ {
+ ereport(DEBUG3,
+ errmsg("io_queue_wait_all for io:%d",
+ pgaio_io_ref_get_id(&tio->ior)),
+ errhidestmt(true),
+ errhidecontext(true));
+
+ pgaio_io_ref_wait(&tio->ior);
+ }
+
+ dclist_delete_from(&ioq->in_progress, &tio->node);
+ dclist_push_head(&ioq->idle, &tio->node);
+ }
+}
+
+bool
+io_queue_is_empty(IOQueue *ioq)
+{
+ return dclist_is_empty(&ioq->in_progress);
+}
+
+void
+io_queue_free(IOQueue *ioq)
+{
+ io_queue_wait_all(ioq);
+
+ pfree(ioq);
+}
diff --git a/src/backend/storage/aio/meson.build b/src/backend/storage/aio/meson.build
index 8960223194a..6d64c75a49c 100644
--- a/src/backend/storage/aio/meson.build
+++ b/src/backend/storage/aio/meson.build
@@ -5,6 +5,7 @@ backend_sources += files(
'aio_io.c',
'aio_init.c',
'aio_subject.c',
+ 'io_queue.c',
'method_io_uring.c',
'method_worker.c',
'read_stream.c',
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index be8be9fbff0..6f39abcdf3c 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1171,6 +1171,7 @@ IOContext
IOFuncSelector
IOObject
IOOp
+IOQueue
IO_STATUS_BLOCK
IPCompareMethod
ITEM
@@ -2959,6 +2960,7 @@ TocEntry
TokenAuxData
TokenizedAuthLine
TrackItem
+TrackedIO
TransApplyAction
TransInvalidationInfo
TransState
--
2.45.2.827.g557ae147e6
v2.0-0015-bufmgr-use-AIO-in-checkpointer-bgwriter.patchtext/x-diff; charset=us-asciiDownload
From f7ad1fbd6a37434b67cb50916a5c28255d3a14eb Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 23 Jul 2024 10:01:23 -0700
Subject: [PATCH v2.0 15/17] bufmgr: use AIO in checkpointer, bgwriter
This is far from ready - just included to be able to exercise AIO writes and
get some preliminary numbers. In all likelihood this will instead be based
ontop of work by Thomas Munro instead of the preceding commit.
---
src/include/postmaster/bgwriter.h | 3 +-
src/include/storage/buf_internals.h | 1 +
src/include/storage/bufmgr.h | 3 +-
src/include/storage/bufpage.h | 1 +
src/backend/postmaster/bgwriter.c | 25 +-
src/backend/postmaster/checkpointer.c | 12 +-
src/backend/storage/buffer/bufmgr.c | 588 +++++++++++++++++++++++---
src/backend/storage/page/bufpage.c | 10 +
src/tools/pgindent/typedefs.list | 1 +
9 files changed, 580 insertions(+), 64 deletions(-)
diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index 407f26e5302..01a936fbc0a 100644
--- a/src/include/postmaster/bgwriter.h
+++ b/src/include/postmaster/bgwriter.h
@@ -31,7 +31,8 @@ extern void BackgroundWriterMain(char *startup_data, size_t startup_data_len) pg
extern void CheckpointerMain(char *startup_data, size_t startup_data_len) pg_attribute_noreturn();
extern void RequestCheckpoint(int flags);
-extern void CheckpointWriteDelay(int flags, double progress);
+struct IOQueue;
+extern void CheckpointWriteDelay(struct IOQueue *ioq, int flags, double progress);
extern bool ForwardSyncRequest(const FileTag *ftag, SyncRequestType type);
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 5cfa7dbd1f1..9d3123663b3 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -21,6 +21,7 @@
#include "storage/buf.h"
#include "storage/bufmgr.h"
#include "storage/condition_variable.h"
+#include "storage/io_queue.h"
#include "storage/latch.h"
#include "storage/lwlock.h"
#include "storage/shmem.h"
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index ac6496bb1eb..a65888c8915 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -325,7 +325,8 @@ extern bool ConditionalLockBufferForCleanup(Buffer buffer);
extern bool IsBufferCleanupOK(Buffer buffer);
extern bool HoldingBufferPinThatDelaysRecovery(void);
-extern bool BgBufferSync(struct WritebackContext *wb_context);
+struct IOQueue;
+extern bool BgBufferSync(struct IOQueue *ioq, struct WritebackContext *wb_context);
extern void LimitAdditionalPins(uint32 *additional_pins);
extern void LimitAdditionalLocalPins(uint32 *additional_pins);
diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h
index 5999e5ca5a5..f5f5adb066d 100644
--- a/src/include/storage/bufpage.h
+++ b/src/include/storage/bufpage.h
@@ -509,5 +509,6 @@ extern bool PageIndexTupleOverwrite(Page page, OffsetNumber offnum,
Item newtup, Size newsize);
extern char *PageSetChecksumCopy(Page page, BlockNumber blkno);
extern void PageSetChecksumInplace(Page page, BlockNumber blkno);
+extern bool PageNeedsChecksumCopy(Page page);
#endif /* BUFPAGE_H */
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 0f75548759a..71c08da45db 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -38,10 +38,12 @@
#include "postmaster/auxprocess.h"
#include "postmaster/bgwriter.h"
#include "postmaster/interrupt.h"
+#include "storage/aio.h"
#include "storage/buf_internals.h"
#include "storage/bufmgr.h"
#include "storage/condition_variable.h"
#include "storage/fd.h"
+#include "storage/io_queue.h"
#include "storage/lwlock.h"
#include "storage/proc.h"
#include "storage/procsignal.h"
@@ -89,6 +91,7 @@ BackgroundWriterMain(char *startup_data, size_t startup_data_len)
sigjmp_buf local_sigjmp_buf;
MemoryContext bgwriter_context;
bool prev_hibernate;
+ IOQueue *ioq;
WritebackContext wb_context;
Assert(startup_data_len == 0);
@@ -130,6 +133,7 @@ BackgroundWriterMain(char *startup_data, size_t startup_data_len)
ALLOCSET_DEFAULT_SIZES);
MemoryContextSwitchTo(bgwriter_context);
+ ioq = io_queue_create(128, 0);
WritebackContextInit(&wb_context, &bgwriter_flush_after);
/*
@@ -167,6 +171,7 @@ BackgroundWriterMain(char *startup_data, size_t startup_data_len)
* about in bgwriter, but we do have LWLocks, buffers, and temp files.
*/
LWLockReleaseAll();
+ pgaio_at_error();
ConditionVariableCancelSleep();
UnlockBuffers();
ReleaseAuxProcessResources(false);
@@ -226,12 +231,27 @@ BackgroundWriterMain(char *startup_data, size_t startup_data_len)
/* Clear any already-pending wakeups */
ResetLatch(MyLatch);
+ /*
+ * XXX: Before exiting, wait for all IO to finish. That's only
+ * important to avoid spurious PrintBufferLeakWarning() /
+ * PrintAioIPLeakWarning() calls, triggered by
+ * ReleaseAuxProcessResources() being called with isCommit=true.
+ *
+ * FIXME: this is theoretically racy, but I didn't want to copy
+ * HandleMainLoopInterrupts() remaining body here.
+ */
+ if (ShutdownRequestPending)
+ {
+ io_queue_wait_all(ioq);
+ io_queue_free(ioq);
+ }
+
HandleMainLoopInterrupts();
/*
* Do one cycle of dirty-buffer writing.
*/
- can_hibernate = BgBufferSync(&wb_context);
+ can_hibernate = BgBufferSync(ioq, &wb_context);
/* Report pending statistics to the cumulative stats system */
pgstat_report_bgwriter();
@@ -248,6 +268,9 @@ BackgroundWriterMain(char *startup_data, size_t startup_data_len)
smgrdestroyall();
}
+ /* finish IO before sleeping, to avoid blocking other backends */
+ io_queue_wait_all(ioq);
+
/*
* Log a new xl_running_xacts every now and then so replication can
* get into a consistent state faster (think of suboverflowed
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 199f008bcda..0350a71cab4 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -46,9 +46,11 @@
#include "postmaster/bgwriter.h"
#include "postmaster/interrupt.h"
#include "replication/syncrep.h"
+#include "storage/aio.h"
#include "storage/bufmgr.h"
#include "storage/condition_variable.h"
#include "storage/fd.h"
+#include "storage/io_queue.h"
#include "storage/ipc.h"
#include "storage/lwlock.h"
#include "storage/proc.h"
@@ -266,6 +268,7 @@ CheckpointerMain(char *startup_data, size_t startup_data_len)
* files.
*/
LWLockReleaseAll();
+ pgaio_at_error();
ConditionVariableCancelSleep();
pgstat_report_wait_end();
UnlockBuffers();
@@ -708,7 +711,7 @@ ImmediateCheckpointRequested(void)
* fraction between 0.0 meaning none, and 1.0 meaning all done.
*/
void
-CheckpointWriteDelay(int flags, double progress)
+CheckpointWriteDelay(IOQueue *ioq, int flags, double progress)
{
static int absorb_counter = WRITES_PER_ABSORB;
@@ -741,6 +744,13 @@ CheckpointWriteDelay(int flags, double progress)
/* Report interim statistics to the cumulative stats system */
pgstat_report_checkpointer();
+ /*
+ * Ensure all pending IO is submitted to avoid unnecessary delays for
+ * other processes.
+ */
+ io_queue_wait_all(ioq);
+
+
/*
* This sleep used to be connected to bgwriter_delay, typically 200ms.
* That resulted in more frequent wakeups if not much work to do.
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 59f4b22457d..e62f2de2034 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -52,6 +52,7 @@
#include "storage/buf_internals.h"
#include "storage/bufmgr.h"
#include "storage/fd.h"
+#include "storage/io_queue.h"
#include "storage/ipc.h"
#include "storage/lmgr.h"
#include "storage/proc.h"
@@ -77,6 +78,7 @@
/* Bits in SyncOneBuffer's return value */
#define BUF_WRITTEN 0x01
#define BUF_REUSABLE 0x02
+#define BUF_CANT_MERGE 0x04
#define RELS_BSEARCH_THRESHOLD 20
@@ -538,8 +540,6 @@ static void UnpinBuffer(BufferDesc *buf);
static void UnpinBufferNoOwner(BufferDesc *buf);
static void BufferSync(int flags);
static uint32 WaitBufHdrUnlocked(BufferDesc *buf);
-static int SyncOneBuffer(int buf_id, bool skip_recently_used,
- WritebackContext *wb_context);
static void WaitIO(BufferDesc *buf);
static bool StartBufferIO(BufferDesc *buf, bool forInput, bool nowait);
static void TerminateBufferIO(BufferDesc *buf, bool clear_dirty,
@@ -557,6 +557,7 @@ static inline BufferDesc *BufferAlloc(SMgrRelation smgr,
static Buffer GetVictimBuffer(BufferAccessStrategy strategy, IOContext io_context);
static void FlushBuffer(BufferDesc *buf, SMgrRelation reln,
IOObject io_object, IOContext io_context);
+
static void FindAndDropRelationBuffers(RelFileLocator rlocator,
ForkNumber forkNum,
BlockNumber nForkBlock,
@@ -2981,6 +2982,56 @@ UnpinBufferNoOwner(BufferDesc *buf)
}
}
+typedef struct BuffersToWrite
+{
+ int nbuffers;
+ BufferTag start_at_tag;
+ uint32 max_combine;
+
+ XLogRecPtr max_lsn;
+
+ PgAioHandle *ioh;
+ PgAioHandleRef ior;
+
+ uint64 total_writes;
+
+ Buffer buffers[IOV_MAX];
+ PgAioBounceBuffer *bounce_buffers[IOV_MAX];
+ const void *data_ptrs[IOV_MAX];
+} BuffersToWrite;
+
+static int PrepareToWriteBuffer(BuffersToWrite *to_write, Buffer buf,
+ bool skip_recently_used,
+ IOQueue *ioq, WritebackContext *wb_context);
+
+static void WriteBuffers(BuffersToWrite *to_write,
+ IOQueue *ioq, WritebackContext *wb_context);
+
+static void
+BuffersToWriteInit(BuffersToWrite *to_write,
+ IOQueue *ioq, WritebackContext *wb_context)
+{
+ to_write->total_writes = 0;
+ to_write->nbuffers = 0;
+ to_write->ioh = NULL;
+ pgaio_io_ref_clear(&to_write->ior);
+ to_write->max_lsn = InvalidXLogRecPtr;
+}
+
+static void
+BuffersToWriteEnd(BuffersToWrite *to_write)
+{
+ if (to_write->ioh != NULL)
+ {
+ pgaio_io_release(to_write->ioh);
+ to_write->ioh = NULL;
+ }
+
+ if (to_write->total_writes > 0)
+ pgaio_submit_staged();
+}
+
+
#define ST_SORT sort_checkpoint_bufferids
#define ST_ELEMENT_TYPE CkptSortItem
#define ST_COMPARE(a, b) ckpt_buforder_comparator(a, b)
@@ -3012,7 +3063,10 @@ BufferSync(int flags)
binaryheap *ts_heap;
int i;
int mask = BM_DIRTY;
+ IOQueue *ioq;
WritebackContext wb_context;
+ BuffersToWrite to_write;
+ int max_combine;
/*
* Unless this is a shutdown checkpoint or we have been explicitly told,
@@ -3074,7 +3128,9 @@ BufferSync(int flags)
if (num_to_scan == 0)
return; /* nothing to do */
+ ioq = io_queue_create(512, 0);
WritebackContextInit(&wb_context, &checkpoint_flush_after);
+ max_combine = Min(io_bounce_buffers, io_combine_limit);
TRACE_POSTGRESQL_BUFFER_SYNC_START(NBuffers, num_to_scan);
@@ -3182,48 +3238,89 @@ BufferSync(int flags)
*/
num_processed = 0;
num_written = 0;
+
+ BuffersToWriteInit(&to_write, ioq, &wb_context);
+
while (!binaryheap_empty(ts_heap))
{
BufferDesc *bufHdr = NULL;
CkptTsStatus *ts_stat = (CkptTsStatus *)
DatumGetPointer(binaryheap_first(ts_heap));
- buf_id = CkptBufferIds[ts_stat->index].buf_id;
- Assert(buf_id != -1);
-
- bufHdr = GetBufferDescriptor(buf_id);
-
- num_processed++;
+ Assert(ts_stat->num_scanned <= ts_stat->num_to_scan);
/*
- * We don't need to acquire the lock here, because we're only looking
- * at a single bit. It's possible that someone else writes the buffer
- * and clears the flag right after we check, but that doesn't matter
- * since SyncOneBuffer will then do nothing. However, there is a
- * further race condition: it's conceivable that between the time we
- * examine the bit here and the time SyncOneBuffer acquires the lock,
- * someone else not only wrote the buffer but replaced it with another
- * page and dirtied it. In that improbable case, SyncOneBuffer will
- * write the buffer though we didn't need to. It doesn't seem worth
- * guarding against this, though.
+ * Collect a batch of buffers to write out from the current
+ * tablespace. That causes some imbalance between the tablespaces, but
+ * that's more than outweighed by the efficiency gain due to batching.
*/
- if (pg_atomic_read_u32(&bufHdr->state) & BM_CHECKPOINT_NEEDED)
+ while (to_write.nbuffers < max_combine &&
+ ts_stat->num_scanned < ts_stat->num_to_scan)
{
- if (SyncOneBuffer(buf_id, false, &wb_context) & BUF_WRITTEN)
+ buf_id = CkptBufferIds[ts_stat->index].buf_id;
+ Assert(buf_id != -1);
+
+ bufHdr = GetBufferDescriptor(buf_id);
+
+ num_processed++;
+
+ /*
+ * We don't need to acquire the lock here, because we're only
+ * looking at a single bit. It's possible that someone else writes
+ * the buffer and clears the flag right after we check, but that
+ * doesn't matter since SyncOneBuffer will then do nothing.
+ * However, there is a further race condition: it's conceivable
+ * that between the time we examine the bit here and the time
+ * SyncOneBuffer acquires the lock, someone else not only wrote
+ * the buffer but replaced it with another page and dirtied it. In
+ * that improbable case, SyncOneBuffer will write the buffer
+ * though we didn't need to. It doesn't seem worth guarding
+ * against this, though.
+ */
+ if (pg_atomic_read_u32(&bufHdr->state) & BM_CHECKPOINT_NEEDED)
{
- TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
- PendingCheckpointerStats.buffers_written++;
- num_written++;
+ int result = PrepareToWriteBuffer(&to_write, buf_id + 1, false,
+ ioq, &wb_context);
+
+ if (result & BUF_CANT_MERGE)
+ {
+ Assert(to_write.nbuffers > 0);
+ WriteBuffers(&to_write, ioq, &wb_context);
+
+ result = PrepareToWriteBuffer(&to_write, buf_id + 1, false,
+ ioq, &wb_context);
+ Assert(result != BUF_CANT_MERGE);
+ }
+
+ if (result & BUF_WRITTEN)
+ {
+ TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
+ PendingCheckpointerStats.buffers_written++;
+ num_written++;
+ }
+ else
+ {
+ break;
+ }
}
+ else
+ {
+ if (to_write.nbuffers > 0)
+ WriteBuffers(&to_write, ioq, &wb_context);
+ }
+
+ /*
+ * Measure progress independent of actually having to flush the
+ * buffer - otherwise writing become unbalanced.
+ */
+ ts_stat->progress += ts_stat->progress_slice;
+ ts_stat->num_scanned++;
+ ts_stat->index++;
}
- /*
- * Measure progress independent of actually having to flush the buffer
- * - otherwise writing become unbalanced.
- */
- ts_stat->progress += ts_stat->progress_slice;
- ts_stat->num_scanned++;
- ts_stat->index++;
+ if (to_write.nbuffers > 0)
+ WriteBuffers(&to_write, ioq, &wb_context);
+
/* Have all the buffers from the tablespace been processed? */
if (ts_stat->num_scanned == ts_stat->num_to_scan)
@@ -3241,15 +3338,23 @@ BufferSync(int flags)
*
* (This will check for barrier events even if it doesn't sleep.)
*/
- CheckpointWriteDelay(flags, (double) num_processed / num_to_scan);
+ CheckpointWriteDelay(ioq, flags, (double) num_processed / num_to_scan);
}
+ Assert(to_write.nbuffers == 0);
+ io_queue_wait_all(ioq);
+
/*
* Issue all pending flushes. Only checkpointer calls BufferSync(), so
* IOContext will always be IOCONTEXT_NORMAL.
*/
IssuePendingWritebacks(&wb_context, IOCONTEXT_NORMAL);
+ io_queue_wait_all(ioq); /* IssuePendingWritebacks might have added
+ * more */
+ io_queue_free(ioq);
+ BuffersToWriteEnd(&to_write);
+
pfree(per_ts_stat);
per_ts_stat = NULL;
binaryheap_free(ts_heap);
@@ -3275,7 +3380,7 @@ BufferSync(int flags)
* bgwriter_lru_maxpages to 0.)
*/
bool
-BgBufferSync(WritebackContext *wb_context)
+BgBufferSync(IOQueue *ioq, WritebackContext *wb_context)
{
/* info obtained from freelist.c */
int strategy_buf_id;
@@ -3318,6 +3423,8 @@ BgBufferSync(WritebackContext *wb_context)
long new_strategy_delta;
uint32 new_recent_alloc;
+ BuffersToWrite to_write;
+
/*
* Find out where the freelist clock sweep currently is, and how many
* buffer allocations have happened since our last call.
@@ -3494,11 +3601,25 @@ BgBufferSync(WritebackContext *wb_context)
num_written = 0;
reusable_buffers = reusable_buffers_est;
+ BuffersToWriteInit(&to_write, ioq, wb_context);
+
/* Execute the LRU scan */
while (num_to_scan > 0 && reusable_buffers < upcoming_alloc_est)
{
- int sync_state = SyncOneBuffer(next_to_clean, true,
- wb_context);
+ int sync_state;
+
+ sync_state = PrepareToWriteBuffer(&to_write, next_to_clean + 1,
+ true, ioq, wb_context);
+ if (sync_state & BUF_CANT_MERGE)
+ {
+ Assert(to_write.nbuffers > 0);
+
+ WriteBuffers(&to_write, ioq, wb_context);
+
+ sync_state = PrepareToWriteBuffer(&to_write, next_to_clean + 1,
+ true, ioq, wb_context);
+ Assert(sync_state != BUF_CANT_MERGE);
+ }
if (++next_to_clean >= NBuffers)
{
@@ -3509,6 +3630,13 @@ BgBufferSync(WritebackContext *wb_context)
if (sync_state & BUF_WRITTEN)
{
+ Assert(sync_state & BUF_REUSABLE);
+
+ if (to_write.nbuffers == io_combine_limit)
+ {
+ WriteBuffers(&to_write, ioq, wb_context);
+ }
+
reusable_buffers++;
if (++num_written >= bgwriter_lru_maxpages)
{
@@ -3520,6 +3648,11 @@ BgBufferSync(WritebackContext *wb_context)
reusable_buffers++;
}
+ if (to_write.nbuffers > 0)
+ WriteBuffers(&to_write, ioq, wb_context);
+
+ BuffersToWriteEnd(&to_write);
+
PendingBgWriterStats.buf_written_clean += num_written;
#ifdef BGW_DEBUG
@@ -3558,8 +3691,66 @@ BgBufferSync(WritebackContext *wb_context)
return (bufs_to_lap == 0 && recent_alloc == 0);
}
+static inline bool
+BufferTagsSameRel(const BufferTag *tag1, const BufferTag *tag2)
+{
+ return (tag1->spcOid == tag2->spcOid) &&
+ (tag1->dbOid == tag2->dbOid) &&
+ (tag1->relNumber == tag2->relNumber) &&
+ (tag1->forkNum == tag2->forkNum)
+ ;
+}
+
+static bool
+CanMergeWrite(BuffersToWrite *to_write, BufferDesc *cur_buf_hdr)
+{
+ BlockNumber cur_block = cur_buf_hdr->tag.blockNum;
+
+ Assert(to_write->nbuffers > 0); /* can't merge with nothing */
+ Assert(to_write->start_at_tag.relNumber != InvalidOid);
+ Assert(to_write->start_at_tag.blockNum != InvalidBlockNumber);
+
+ Assert(to_write->ioh != NULL);
+
+ /*
+ * First check if the blocknumber is one that we could actually merge,
+ * that's cheaper than checking the tablespace/db/relnumber/fork match.
+ */
+ if (to_write->start_at_tag.blockNum + to_write->nbuffers != cur_block)
+ return false;
+
+ if (!BufferTagsSameRel(&to_write->start_at_tag, &cur_buf_hdr->tag))
+ return false;
+
+ /*
+ * Need to check with smgr how large a write we're allowed to make. To
+ * reduce the overhead of the smgr check, only inquire once, when
+ * processing the first to-be-merged buffer. That avoids the overhead in
+ * the common case of writing out buffers that definitely not mergeable.
+ */
+ if (to_write->nbuffers == 1)
+ {
+ SMgrRelation smgr;
+
+ smgr = smgropen(BufTagGetRelFileLocator(&to_write->start_at_tag), INVALID_PROC_NUMBER);
+
+ to_write->max_combine = smgrmaxcombine(smgr,
+ to_write->start_at_tag.forkNum,
+ to_write->start_at_tag.blockNum);
+ }
+ else
+ {
+ Assert(to_write->max_combine > 0);
+ }
+
+ if (to_write->start_at_tag.blockNum + to_write->max_combine <= cur_block)
+ return false;
+
+ return true;
+}
+
/*
- * SyncOneBuffer -- process a single buffer during syncing.
+ * PrepareToWriteBuffer -- process a single buffer during syncing.
*
* If skip_recently_used is true, we don't write currently-pinned buffers, nor
* buffers marked recently used, as these are not replacement candidates.
@@ -3568,22 +3759,56 @@ BgBufferSync(WritebackContext *wb_context)
* BUF_WRITTEN: we wrote the buffer.
* BUF_REUSABLE: buffer is available for replacement, ie, it has
* pin count 0 and usage count 0.
+ * BUF_CANT_MERGE: can't combine this write with prior writes, caller needs
+ * to issue those first
*
* (BUF_WRITTEN could be set in error if FlushBuffer finds the buffer clean
* after locking it, but we don't care all that much.)
*/
static int
-SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
+PrepareToWriteBuffer(BuffersToWrite *to_write, Buffer buf,
+ bool skip_recently_used,
+ IOQueue *ioq, WritebackContext *wb_context)
{
- BufferDesc *bufHdr = GetBufferDescriptor(buf_id);
- int result = 0;
+ BufferDesc *cur_buf_hdr = GetBufferDescriptor(buf - 1);
uint32 buf_state;
- BufferTag tag;
+ int result = 0;
+ XLogRecPtr cur_buf_lsn;
+ LWLock *content_lock;
+ bool may_block;
+
+ /*
+ * Check if this buffer can be written out together with already prepared
+ * writes. We check before we have pinned the buffer, so the buffer can be
+ * written out and replaced between this check and us pinning the buffer -
+ * we'll recheck below. The reason for the pre-check is that we don't want
+ * to pin the buffer just to find out that we can't merge the IO.
+ */
+ if (to_write->nbuffers != 0)
+ {
+ if (!CanMergeWrite(to_write, cur_buf_hdr))
+ {
+ result |= BUF_CANT_MERGE;
+ return result;
+ }
+ }
+ else
+ {
+ if (to_write->ioh == NULL)
+ {
+ to_write->ioh = io_queue_get_io(ioq);
+ pgaio_io_get_ref(to_write->ioh, &to_write->ior);
+ }
+
+ to_write->start_at_tag = cur_buf_hdr->tag;
+ }
/* Make sure we can handle the pin */
ReservePrivateRefCountEntry();
ResourceOwnerEnlarge(CurrentResourceOwner);
+ /* XXX: Should also check if we are allowed to pin one more buffer */
+
/*
* Check whether buffer needs writing.
*
@@ -3593,7 +3818,7 @@ SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
* don't worry because our checkpoint.redo points before log record for
* upcoming changes and so we are not required to write such dirty buffer.
*/
- buf_state = LockBufHdr(bufHdr);
+ buf_state = LockBufHdr(cur_buf_hdr);
if (BUF_STATE_GET_REFCOUNT(buf_state) == 0 &&
BUF_STATE_GET_USAGECOUNT(buf_state) == 0)
@@ -3602,40 +3827,282 @@ SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
}
else if (skip_recently_used)
{
+#if 0
+ elog(LOG, "at block %d: skip recent with nbuffers %d",
+ cur_buf_hdr->tag.blockNum, to_write->nbuffers);
+#endif
/* Caller told us not to write recently-used buffers */
- UnlockBufHdr(bufHdr, buf_state);
+ UnlockBufHdr(cur_buf_hdr, buf_state);
return result;
}
if (!(buf_state & BM_VALID) || !(buf_state & BM_DIRTY))
{
/* It's clean, so nothing to do */
- UnlockBufHdr(bufHdr, buf_state);
+ UnlockBufHdr(cur_buf_hdr, buf_state);
return result;
}
- /*
- * Pin it, share-lock it, write it. (FlushBuffer will do nothing if the
- * buffer is clean by the time we've locked it.)
- */
- PinBuffer_Locked(bufHdr);
- LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-
- FlushBuffer(bufHdr, NULL, IOOBJECT_RELATION, IOCONTEXT_NORMAL);
-
- LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
-
- tag = bufHdr->tag;
-
- UnpinBuffer(bufHdr);
+ /* pin the buffer, from now on its identity can't change anymore */
+ PinBuffer_Locked(cur_buf_hdr);
/*
- * SyncOneBuffer() is only called by checkpointer and bgwriter, so
- * IOContext will always be IOCONTEXT_NORMAL.
+ * If we are merging, check if the buffer's identity possibly changed
+ * while we hadn't yet pinned it.
+ *
+ * XXX: It might be worth checking if we still want to write the buffer
+ * out, e.g. it could have been replaced with a buffer that doesn't have
+ * BM_CHECKPOINT_NEEDED set.
*/
- ScheduleBufferTagForWriteback(wb_context, IOCONTEXT_NORMAL, &tag);
+ if (to_write->nbuffers != 0)
+ {
+ if (!CanMergeWrite(to_write, cur_buf_hdr))
+ {
+ elog(LOG, "changed identity");
+ UnpinBuffer(cur_buf_hdr);
- return result | BUF_WRITTEN;
+ result |= BUF_CANT_MERGE;
+
+ return result;
+ }
+ }
+
+ may_block = to_write->nbuffers == 0
+ && !pgaio_have_staged()
+ && io_queue_is_empty(ioq)
+ ;
+ content_lock = BufferDescriptorGetContentLock(cur_buf_hdr);
+
+ if (!may_block)
+ {
+ if (LWLockConditionalAcquire(content_lock, LW_SHARED))
+ {
+ /* done */
+ }
+ else if (to_write->nbuffers == 0)
+ {
+ /*
+ * Need to wait for all prior IO to finish before blocking for
+ * lock acquisition, to avoid the risk a deadlock due to us
+ * waiting for another backend that is waiting for our unsubmitted
+ * IO to complete.
+ */
+ pgaio_submit_staged();
+ io_queue_wait_all(ioq);
+
+ elog(DEBUG2, "at block %u: can't block, nbuffers = 0",
+ cur_buf_hdr->tag.blockNum
+ );
+
+ may_block = to_write->nbuffers == 0
+ && !pgaio_have_staged()
+ && io_queue_is_empty(ioq)
+ ;
+ Assert(may_block);
+
+ LWLockAcquire(content_lock, LW_SHARED);
+ }
+ else
+ {
+ elog(DEBUG2, "at block %d: can't block nbuffers = %d",
+ cur_buf_hdr->tag.blockNum,
+ to_write->nbuffers);
+
+ UnpinBuffer(cur_buf_hdr);
+ result |= BUF_CANT_MERGE;
+ Assert(to_write->nbuffers > 0);
+
+ return result;
+ }
+ }
+ else
+ {
+ LWLockAcquire(content_lock, LW_SHARED);
+ }
+
+ if (!may_block)
+ {
+ if (!StartBufferIO(cur_buf_hdr, false, !may_block))
+ {
+ pgaio_submit_staged();
+ io_queue_wait_all(ioq);
+
+ may_block = io_queue_is_empty(ioq) && to_write->nbuffers == 0 && !pgaio_have_staged();
+
+ if (!StartBufferIO(cur_buf_hdr, false, !may_block))
+ {
+ elog(DEBUG2, "at block %d: non-waitable StartBufferIO returns false, %d",
+ cur_buf_hdr->tag.blockNum,
+ may_block);
+
+ /*
+ * FIXME: can't tell whether this is because the buffer has
+ * been cleaned
+ */
+ if (!may_block)
+ {
+ result |= BUF_CANT_MERGE;
+ Assert(to_write->nbuffers > 0);
+ }
+ LWLockRelease(content_lock);
+ UnpinBuffer(cur_buf_hdr);
+
+ return result;
+ }
+ }
+ }
+ else
+ {
+ if (!StartBufferIO(cur_buf_hdr, false, false))
+ {
+ elog(DEBUG2, "waitable StartBufferIO returns false");
+ LWLockRelease(content_lock);
+ UnpinBuffer(cur_buf_hdr);
+
+ /*
+ * FIXME: Historically we returned BUF_WRITTEN in this case, which
+ * seems wrong
+ */
+ return result;
+ }
+ }
+
+ /*
+ * Run PageGetLSN while holding header lock, since we don't have the
+ * buffer locked exclusively in all cases.
+ */
+ buf_state = LockBufHdr(cur_buf_hdr);
+
+ cur_buf_lsn = BufferGetLSN(cur_buf_hdr);
+
+ /* To check if block content changes while flushing. - vadim 01/17/97 */
+ buf_state &= ~BM_JUST_DIRTIED;
+
+ UnlockBufHdr(cur_buf_hdr, buf_state);
+
+ to_write->buffers[to_write->nbuffers] = buf;
+ to_write->nbuffers++;
+
+ if (buf_state & BM_PERMANENT &&
+ (to_write->max_lsn == InvalidXLogRecPtr || to_write->max_lsn < cur_buf_lsn))
+ {
+ to_write->max_lsn = cur_buf_lsn;
+ }
+
+ result |= BUF_WRITTEN;
+
+ return result;
+}
+
+static void
+WriteBuffers(BuffersToWrite *to_write,
+ IOQueue *ioq, WritebackContext *wb_context)
+{
+ SMgrRelation smgr;
+ Buffer first_buf;
+ BufferDesc *first_buf_hdr;
+ bool needs_checksum;
+
+ Assert(to_write->nbuffers > 0 && to_write->nbuffers <= io_combine_limit);
+
+ first_buf = to_write->buffers[0];
+ first_buf_hdr = GetBufferDescriptor(first_buf - 1);
+
+ smgr = smgropen(BufTagGetRelFileLocator(&first_buf_hdr->tag), INVALID_PROC_NUMBER);
+
+ /*
+ * Force XLOG flush up to buffer's LSN. This implements the basic WAL
+ * rule that log updates must hit disk before any of the data-file changes
+ * they describe do.
+ *
+ * However, this rule does not apply to unlogged relations, which will be
+ * lost after a crash anyway. Most unlogged relation pages do not bear
+ * LSNs since we never emit WAL records for them, and therefore flushing
+ * up through the buffer LSN would be useless, but harmless. However,
+ * GiST indexes use LSNs internally to track page-splits, and therefore
+ * unlogged GiST pages bear "fake" LSNs generated by
+ * GetFakeLSNForUnloggedRel. It is unlikely but possible that the fake
+ * LSN counter could advance past the WAL insertion point; and if it did
+ * happen, attempting to flush WAL through that location would fail, with
+ * disastrous system-wide consequences. To make sure that can't happen,
+ * skip the flush if the buffer isn't permanent.
+ */
+ if (to_write->max_lsn != InvalidXLogRecPtr)
+ XLogFlush(to_write->max_lsn);
+
+ /*
+ * Now it's safe to write buffer to disk. Note that no one else should
+ * have been able to write it while we were busy with log flushing because
+ * only one process at a time can set the BM_IO_IN_PROGRESS bit.
+ */
+
+ for (int nbuf = 0; nbuf < to_write->nbuffers; nbuf++)
+ {
+ Buffer cur_buf = to_write->buffers[nbuf];
+ BufferDesc *cur_buf_hdr = GetBufferDescriptor(cur_buf - 1);
+ Block bufBlock;
+ char *bufToWrite;
+
+ bufBlock = BufHdrGetBlock(cur_buf_hdr);
+ needs_checksum = PageNeedsChecksumCopy((Page) bufBlock);
+
+ /*
+ * Update page checksum if desired. Since we have only shared lock on
+ * the buffer, other processes might be updating hint bits in it, so
+ * we must copy the page to a bounce buffer if we do checksumming.
+ */
+ if (needs_checksum)
+ {
+ PgAioBounceBuffer *bb = pgaio_bounce_buffer_get();
+
+ pgaio_io_assoc_bounce_buffer(to_write->ioh, bb);
+
+ bufToWrite = pgaio_bounce_buffer_buffer(bb);
+ memcpy(bufToWrite, bufBlock, BLCKSZ);
+ PageSetChecksumInplace((Page) bufToWrite, cur_buf_hdr->tag.blockNum);
+ }
+ else
+ {
+ bufToWrite = bufBlock;
+ }
+
+ to_write->data_ptrs[nbuf] = bufToWrite;
+ }
+
+ pgaio_io_set_io_data_32(to_write->ioh,
+ (uint32 *) to_write->buffers,
+ to_write->nbuffers);
+ pgaio_io_add_shared_cb(to_write->ioh, ASC_SHARED_BUFFER_WRITE);
+
+ smgrstartwritev(to_write->ioh, smgr,
+ BufTagGetForkNum(&first_buf_hdr->tag),
+ first_buf_hdr->tag.blockNum,
+ to_write->data_ptrs,
+ to_write->nbuffers,
+ false);
+ pgstat_count_io_op_n(IOOBJECT_RELATION, IOCONTEXT_NORMAL,
+ IOOP_WRITE, to_write->nbuffers);
+
+
+ for (int nbuf = 0; nbuf < to_write->nbuffers; nbuf++)
+ {
+ Buffer cur_buf = to_write->buffers[nbuf];
+ BufferDesc *cur_buf_hdr = GetBufferDescriptor(cur_buf - 1);
+
+ UnpinBuffer(cur_buf_hdr);
+ }
+
+ io_queue_track(ioq, &to_write->ior);
+ to_write->total_writes++;
+
+ /* clear state for next write */
+ to_write->nbuffers = 0;
+ to_write->start_at_tag.relNumber = InvalidOid;
+ to_write->start_at_tag.blockNum = InvalidBlockNumber;
+ to_write->max_combine = 0;
+ to_write->max_lsn = InvalidXLogRecPtr;
+ to_write->ioh = NULL;
+ pgaio_io_ref_clear(&to_write->ior);
}
/*
@@ -4001,6 +4468,7 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
error_context_stack = errcallback.previous;
}
+
/*
* RelationGetNumberOfBlocksInFork
* Determines the current number of pages in the specified relation fork.
diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c
index be6f1f62d29..8295e3fb0a0 100644
--- a/src/backend/storage/page/bufpage.c
+++ b/src/backend/storage/page/bufpage.c
@@ -1491,6 +1491,16 @@ PageIndexTupleOverwrite(Page page, OffsetNumber offnum,
return true;
}
+bool
+PageNeedsChecksumCopy(Page page)
+{
+ if (PageIsNew(page))
+ return false;
+
+ /* If we don't need a checksum, just return the passed-in data */
+ return DataChecksumsEnabled();
+}
+
/*
* Set checksum for a page in shared buffers.
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 6f39abcdf3c..ca6dd0bebf0 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -344,6 +344,7 @@ BufferManagerRelation
BufferStrategyControl
BufferTag
BufferUsage
+BuffersToWrite
BuildAccumulator
BuiltinScript
BulkInsertState
--
2.45.2.827.g557ae147e6
v2.0-0016-very-wip-test_aio-module.patchtext/x-diff; charset=us-asciiDownload
From c3a8731578a7fc1b03609e5bdb800e4fc18db80e Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sat, 31 Aug 2024 21:13:48 -0400
Subject: [PATCH v2.0 16/17] very-wip: test_aio module
Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
src/include/storage/aio_internal.h | 10 +
src/include/storage/buf_internals.h | 4 +
src/backend/storage/aio/aio.c | 38 ++
src/backend/storage/buffer/bufmgr.c | 3 +-
src/test/modules/Makefile | 1 +
src/test/modules/meson.build | 1 +
src/test/modules/test_aio/.gitignore | 6 +
src/test/modules/test_aio/Makefile | 34 ++
src/test/modules/test_aio/expected/inject.out | 180 +++++++
src/test/modules/test_aio/expected/io.out | 40 ++
.../modules/test_aio/expected/ownership.out | 148 ++++++
src/test/modules/test_aio/expected/prep.out | 17 +
src/test/modules/test_aio/io_uring.conf | 5 +
src/test/modules/test_aio/meson.build | 65 +++
src/test/modules/test_aio/sql/inject.sql | 51 ++
src/test/modules/test_aio/sql/io.sql | 16 +
src/test/modules/test_aio/sql/ownership.sql | 65 +++
src/test/modules/test_aio/sql/prep.sql | 9 +
src/test/modules/test_aio/test_aio--1.0.sql | 94 ++++
src/test/modules/test_aio/test_aio.c | 479 ++++++++++++++++++
src/test/modules/test_aio/test_aio.control | 3 +
src/test/modules/test_aio/worker.conf | 5 +
22 files changed, 1272 insertions(+), 2 deletions(-)
create mode 100644 src/test/modules/test_aio/.gitignore
create mode 100644 src/test/modules/test_aio/Makefile
create mode 100644 src/test/modules/test_aio/expected/inject.out
create mode 100644 src/test/modules/test_aio/expected/io.out
create mode 100644 src/test/modules/test_aio/expected/ownership.out
create mode 100644 src/test/modules/test_aio/expected/prep.out
create mode 100644 src/test/modules/test_aio/io_uring.conf
create mode 100644 src/test/modules/test_aio/meson.build
create mode 100644 src/test/modules/test_aio/sql/inject.sql
create mode 100644 src/test/modules/test_aio/sql/io.sql
create mode 100644 src/test/modules/test_aio/sql/ownership.sql
create mode 100644 src/test/modules/test_aio/sql/prep.sql
create mode 100644 src/test/modules/test_aio/test_aio--1.0.sql
create mode 100644 src/test/modules/test_aio/test_aio.c
create mode 100644 src/test/modules/test_aio/test_aio.control
create mode 100644 src/test/modules/test_aio/worker.conf
diff --git a/src/include/storage/aio_internal.h b/src/include/storage/aio_internal.h
index 67d994cc0b1..cd3063f6c11 100644
--- a/src/include/storage/aio_internal.h
+++ b/src/include/storage/aio_internal.h
@@ -259,6 +259,16 @@ extern const char *pgaio_io_get_op_name(PgAioHandle *ioh);
extern const char *pgaio_io_get_state_name(PgAioHandle *ioh);
+
+/* These functions are just for use in tests, from within injection points */
+#ifdef USE_INJECTION_POINTS
+
+extern PgAioHandle *pgaio_inj_io_get(void);
+
+#endif
+
+
+
/* Declarations for the tables of function pointers exposed by each IO method. */
extern const IoMethodOps pgaio_worker_ops;
#ifdef USE_LIBURING
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 9d3123663b3..1b3329a25b4 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -423,6 +423,10 @@ extern void IssuePendingWritebacks(WritebackContext *wb_context, IOContext io_co
extern void ScheduleBufferTagForWriteback(WritebackContext *wb_context,
IOContext io_context, BufferTag *tag);
+/* solely to make it easier to write tests */
+extern bool StartBufferIO(BufferDesc *buf, bool forInput, bool nowait);
+
+
/* freelist.c */
extern IOContext IOContextForStrategy(BufferAccessStrategy strategy);
extern BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy,
diff --git a/src/backend/storage/aio/aio.c b/src/backend/storage/aio/aio.c
index d6f9f658b97..9db661b1cd0 100644
--- a/src/backend/storage/aio/aio.c
+++ b/src/backend/storage/aio/aio.c
@@ -22,6 +22,9 @@
#include "utils/resowner.h"
#include "utils/wait_event_types.h"
+#ifdef USE_INJECTION_POINTS
+#include "utils/injection_point.h"
+#endif
static void pgaio_io_reclaim(PgAioHandle *ioh);
@@ -65,6 +68,11 @@ static const IoMethodOps *pgaio_ops_table[] = {
const IoMethodOps *pgaio_impl;
+#ifdef USE_INJECTION_POINTS
+static PgAioHandle *inj_cur_handle;
+#endif
+
+
/* --------------------------------------------------------------------------------
* "Core" IO Api
@@ -529,6 +537,19 @@ pgaio_io_process_completion(PgAioHandle *ioh, int result)
/* FIXME: should be done in separate function */
ioh->state = AHS_REAPED;
+#ifdef USE_INJECTION_POINTS
+ inj_cur_handle = ioh;
+
+ /*
+ * FIXME: This could be in a critical section - but it looks like we can't
+ * just InjectionPointLoad() at process start, as the injection point
+ * might not yet be defined.
+ */
+ InjectionPointCached("AIO_PROCESS_COMPLETION_BEFORE_SHARED");
+
+ inj_cur_handle = NULL;
+#endif
+
pgaio_io_process_completion_subject(ioh);
/* ensure results of completion are visible before the new state */
@@ -994,3 +1015,20 @@ assign_io_method(int newval, void *extra)
{
pgaio_impl = pgaio_ops_table[newval];
}
+
+
+
+/* --------------------------------------------------------------------------------
+ * Injection point support
+ * --------------------------------------------------------------------------------
+ */
+
+#ifdef USE_INJECTION_POINTS
+
+PgAioHandle *
+pgaio_inj_io_get(void)
+{
+ return inj_cur_handle;
+}
+
+#endif
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index e62f2de2034..f774b42651a 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -541,7 +541,6 @@ static void UnpinBufferNoOwner(BufferDesc *buf);
static void BufferSync(int flags);
static uint32 WaitBufHdrUnlocked(BufferDesc *buf);
static void WaitIO(BufferDesc *buf);
-static bool StartBufferIO(BufferDesc *buf, bool forInput, bool nowait);
static void TerminateBufferIO(BufferDesc *buf, bool clear_dirty,
uint32 set_flag_bits, bool forget_owner,
bool syncio);
@@ -6122,7 +6121,7 @@ WaitIO(BufferDesc *buf)
* find out if they can perform the I/O as part of a larger operation, without
* waiting for the answer or distinguishing the reasons why not.
*/
-static bool
+bool
StartBufferIO(BufferDesc *buf, bool forInput, bool nowait)
{
uint32 buf_state;
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index 256799f520a..7df90602e90 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -13,6 +13,7 @@ SUBDIRS = \
libpq_pipeline \
plsample \
spgist_name_ops \
+ test_aio \
test_bloomfilter \
test_copy_callbacks \
test_custom_rmgrs \
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index d8fe059d236..bc7d19e694f 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -1,5 +1,6 @@
# Copyright (c) 2022-2024, PostgreSQL Global Development Group
+subdir('test_aio')
subdir('brin')
subdir('commit_ts')
subdir('delay_execution')
diff --git a/src/test/modules/test_aio/.gitignore b/src/test/modules/test_aio/.gitignore
new file mode 100644
index 00000000000..b4903eba657
--- /dev/null
+++ b/src/test/modules/test_aio/.gitignore
@@ -0,0 +1,6 @@
+# Generated subdirectories
+/log/
+/results/
+/output_iso/
+/tmp_check/
+/tmp_check_iso/
diff --git a/src/test/modules/test_aio/Makefile b/src/test/modules/test_aio/Makefile
new file mode 100644
index 00000000000..ae6d685835b
--- /dev/null
+++ b/src/test/modules/test_aio/Makefile
@@ -0,0 +1,34 @@
+# src/test/modules/delay_execution/Makefile
+
+PGFILEDESC = "test_aio - test code for AIO"
+
+MODULE_big = test_aio
+OBJS = \
+ $(WIN32RES) \
+ test_aio.o
+
+EXTENSION = test_aio
+DATA = test_aio--1.0.sql
+
+REGRESS = prep ownership io
+
+ifeq ($(enable_injection_points),yes)
+REGRESS += inject
+endif
+
+# FIXME: with meson this runs the tests once with worker and once - if
+# supported - with io_uring.
+
+# requires custom config
+NO_INSTALLCHECK = 1
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_aio
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_aio/expected/inject.out b/src/test/modules/test_aio/expected/inject.out
new file mode 100644
index 00000000000..e52b0f086dd
--- /dev/null
+++ b/src/test/modules/test_aio/expected/inject.out
@@ -0,0 +1,180 @@
+SELECT count(*) FROM tbl_b WHERE ctid = '(2, 1)';
+ count
+-------
+ 1
+(1 row)
+
+-- injected what we'd expect
+SELECT inj_io_short_read_attach(8192);
+ inj_io_short_read_attach
+--------------------------
+
+(1 row)
+
+SELECT invalidate_rel_block('tbl_b', 2);
+ invalidate_rel_block
+----------------------
+
+(1 row)
+
+SELECT count(*) FROM tbl_b WHERE ctid = '(2, 1)';
+ count
+-------
+ 1
+(1 row)
+
+SELECT inj_io_short_read_detach();
+ inj_io_short_read_detach
+--------------------------
+
+(1 row)
+
+-- injected a read shorter than a single block, expecting error
+SELECT inj_io_short_read_attach(17);
+ inj_io_short_read_attach
+--------------------------
+
+(1 row)
+
+SELECT invalidate_rel_block('tbl_b', 2);
+ invalidate_rel_block
+----------------------
+
+(1 row)
+
+SELECT redact($$
+ SELECT count(*) FROM tbl_b WHERE ctid = '(2, 1)';
+$$);
+NOTICE: wrapped error: could not read blocks 2..2 in file base/<redacted>: read only 0 of 8192 bytes
+ redact
+--------
+ f
+(1 row)
+
+SELECT inj_io_short_read_detach();
+ inj_io_short_read_detach
+--------------------------
+
+(1 row)
+
+-- shorten multi-block read to a single block, should retry, but that's not
+-- implemented yet
+SELECT inj_io_short_read_attach(8192);
+ inj_io_short_read_attach
+--------------------------
+
+(1 row)
+
+SELECT invalidate_rel_block('tbl_b', 0);
+ invalidate_rel_block
+----------------------
+
+(1 row)
+
+SELECT invalidate_rel_block('tbl_b', 1);
+ invalidate_rel_block
+----------------------
+
+(1 row)
+
+SELECT invalidate_rel_block('tbl_b', 2);
+ invalidate_rel_block
+----------------------
+
+(1 row)
+
+SELECT redact($$
+ SELECT count(*) FROM tbl_b;
+$$);
+NOTICE: wrapped error: could not read blocks 1..2 in file base/<redacted>: read only 8192 of 16384 bytes
+ redact
+--------
+ f
+(1 row)
+
+SELECT inj_io_short_read_detach();
+ inj_io_short_read_detach
+--------------------------
+
+(1 row)
+
+-- verify that checksum errors are detected even as part of a shortened
+-- multi-block read
+-- (tbl_a, block 1 is corrupted)
+SELECT redact($$
+ SELECT count(*) FROM tbl_a WHERE ctid < '(2, 1)';
+$$);
+NOTICE: wrapped error: invalid page in block 2 of relation base/<redacted>
+ redact
+--------
+ f
+(1 row)
+
+SELECT inj_io_short_read_attach(8192);
+ inj_io_short_read_attach
+--------------------------
+
+(1 row)
+
+SELECT invalidate_rel_block('tbl_a', 0);
+ invalidate_rel_block
+----------------------
+
+(1 row)
+
+SELECT invalidate_rel_block('tbl_a', 1);
+ invalidate_rel_block
+----------------------
+
+(1 row)
+
+SELECT invalidate_rel_block('tbl_a', 2);
+ invalidate_rel_block
+----------------------
+
+(1 row)
+
+SELECT redact($$
+ SELECT count(*) FROM tbl_a WHERE ctid < '(2, 1)';
+$$);
+NOTICE: wrapped error: invalid page in block 2 of relation base/<redacted>
+ redact
+--------
+ f
+(1 row)
+
+SELECT inj_io_short_read_detach();
+ inj_io_short_read_detach
+--------------------------
+
+(1 row)
+
+-- FIXME: Should error
+-- FIXME: errno encoding?
+SELECT inj_io_short_read_attach(-5);
+ inj_io_short_read_attach
+--------------------------
+
+(1 row)
+
+SELECT invalidate_rel_block('tbl_b', 2);
+ invalidate_rel_block
+----------------------
+
+(1 row)
+
+SELECT redact($$
+ SELECT count(*) FROM tbl_b WHERE ctid = '(2, 1)';
+$$);
+NOTICE: wrapped error: could not read blocks 2..3 in file base/<redacted>: Input/output error
+ redact
+--------
+ f
+(1 row)
+
+SELECT inj_io_short_read_detach();
+ inj_io_short_read_detach
+--------------------------
+
+(1 row)
+
diff --git a/src/test/modules/test_aio/expected/io.out b/src/test/modules/test_aio/expected/io.out
new file mode 100644
index 00000000000..e46b582f290
--- /dev/null
+++ b/src/test/modules/test_aio/expected/io.out
@@ -0,0 +1,40 @@
+SELECT count(*) FROM tbl_a WHERE ctid = '(1, 1)';
+ count
+-------
+ 1
+(1 row)
+
+SELECT corrupt_rel_block('tbl_a', 1);
+ corrupt_rel_block
+-------------------
+
+(1 row)
+
+-- FIXME: Should report the error
+SELECT redact($$
+ SELECT read_corrupt_rel_block('tbl_a', 1);
+$$);
+ redact
+--------
+ t
+(1 row)
+
+-- verify the error is reported
+SELECT redact($$
+ SELECT count(*) FROM tbl_a WHERE ctid = '(1, 1)';
+$$);
+NOTICE: wrapped error: invalid page in block 2 of relation base/<redacted>
+ redact
+--------
+ f
+(1 row)
+
+SELECT redact($$
+ SELECT count(*) FROM tbl_a;
+$$);
+NOTICE: wrapped error: invalid page in block 2 of relation base/<redacted>
+ redact
+--------
+ f
+(1 row)
+
diff --git a/src/test/modules/test_aio/expected/ownership.out b/src/test/modules/test_aio/expected/ownership.out
new file mode 100644
index 00000000000..97fdad6c629
--- /dev/null
+++ b/src/test/modules/test_aio/expected/ownership.out
@@ -0,0 +1,148 @@
+-----
+-- IO handles
+----
+-- leak warning: implicit xact
+SELECT handle_get();
+WARNING: leaked AIO handle
+ handle_get
+------------
+
+(1 row)
+
+-- leak warning: explicit xact
+BEGIN; SELECT handle_get(); COMMIT;
+WARNING: leaked AIO handle
+ handle_get
+------------
+
+(1 row)
+
+-- leak warning + error: released in different command (thus resowner)
+BEGIN; SELECT handle_get(); SELECT handle_release_last(); COMMIT;
+WARNING: leaked AIO handle
+ handle_get
+------------
+
+(1 row)
+
+ERROR: release in unexpected state
+-- no leak, same command
+BEGIN; SELECT handle_get() UNION ALL SELECT handle_release_last(); COMMIT;
+ handle_get
+------------
+
+
+(2 rows)
+
+-- leak warning: subtrans
+BEGIN; SAVEPOINT foo; SELECT handle_get(); COMMIT;
+WARNING: leaked AIO handle
+ handle_get
+------------
+
+(1 row)
+
+-- normal handle use
+SELECT handle_get_release();
+ handle_get_release
+--------------------
+
+(1 row)
+
+-- should error out, API violation
+SELECT handle_get_twice();
+ERROR: API violation: Only one IO can be handed out
+-- recover after error in implicit xact
+SELECT handle_get_and_error(); SELECT handle_get_release();
+ERROR: as you command
+ handle_get_release
+--------------------
+
+(1 row)
+
+-- recover after error in explicit xact
+BEGIN; SELECT handle_get_and_error(); ROLLBACK; SELECT handle_get_release();
+ERROR: as you command
+ handle_get_release
+--------------------
+
+(1 row)
+
+-- recover after error in subtrans
+BEGIN; SAVEPOINT foo; SELECT handle_get_and_error(); ROLLBACK TO SAVEPOINT foo; SELECT handle_get_release(); ROLLBACK;
+ERROR: as you command
+ handle_get_release
+--------------------
+
+(1 row)
+
+-----
+-- Bounce Buffers handles
+----
+-- leak warning: implicit xact
+SELECT bb_get();
+WARNING: leaked AIO bounce buffer
+ bb_get
+--------
+
+(1 row)
+
+-- leak warning: explicit xact
+BEGIN; SELECT bb_get(); COMMIT;
+WARNING: leaked AIO bounce buffer
+ bb_get
+--------
+
+(1 row)
+
+-- missing leak warning: we should warn at command boundaries, not xact boundaries
+BEGIN; SELECT bb_get(); SELECT bb_release_last(); COMMIT;
+WARNING: leaked AIO bounce buffer
+ bb_get
+--------
+
+(1 row)
+
+ERROR: can only release handed out BB
+-- leak warning: subtrans
+BEGIN; SAVEPOINT foo; SELECT bb_get(); COMMIT;
+WARNING: leaked AIO bounce buffer
+ bb_get
+--------
+
+(1 row)
+
+-- normal bb use
+SELECT bb_get_release();
+ bb_get_release
+----------------
+
+(1 row)
+
+-- should error out, API violation
+SELECT bb_get_twice();
+ERROR: can only hand out one BB
+-- recover after error in implicit xact
+SELECT bb_get_and_error(); SELECT bb_get_release();
+ERROR: as you command
+ bb_get_release
+----------------
+
+(1 row)
+
+-- recover after error in explicit xact
+BEGIN; SELECT bb_get_and_error(); ROLLBACK; SELECT bb_get_release();
+ERROR: as you command
+ bb_get_release
+----------------
+
+(1 row)
+
+-- recover after error in subtrans
+BEGIN; SAVEPOINT foo; SELECT bb_get_and_error(); ROLLBACK TO SAVEPOINT foo; SELECT bb_get_release(); ROLLBACK;
+ERROR: as you command
+ bb_get_release
+----------------
+
+(1 row)
+
diff --git a/src/test/modules/test_aio/expected/prep.out b/src/test/modules/test_aio/expected/prep.out
new file mode 100644
index 00000000000..7fad6280db5
--- /dev/null
+++ b/src/test/modules/test_aio/expected/prep.out
@@ -0,0 +1,17 @@
+CREATE EXTENSION test_aio;
+CREATE TABLE tbl_a(data int not null);
+CREATE TABLE tbl_b(data int not null);
+INSERT INTO tbl_a SELECT generate_series(1, 10000);
+INSERT INTO tbl_b SELECT generate_series(1, 10000);
+SELECT grow_rel('tbl_a', 500);
+ grow_rel
+----------
+
+(1 row)
+
+SELECT grow_rel('tbl_b', 550);
+ grow_rel
+----------
+
+(1 row)
+
diff --git a/src/test/modules/test_aio/io_uring.conf b/src/test/modules/test_aio/io_uring.conf
new file mode 100644
index 00000000000..efd7ad143ff
--- /dev/null
+++ b/src/test/modules/test_aio/io_uring.conf
@@ -0,0 +1,5 @@
+shared_preload_libraries=test_aio
+io_method = 'io_uring'
+log_min_messages = 'DEBUG3'
+log_statement=all
+restart_after_crash=false
diff --git a/src/test/modules/test_aio/meson.build b/src/test/modules/test_aio/meson.build
new file mode 100644
index 00000000000..102c2e01537
--- /dev/null
+++ b/src/test/modules/test_aio/meson.build
@@ -0,0 +1,65 @@
+# Copyright (c) 2022-2024, PostgreSQL Global Development Group
+
+test_aio_sources = files(
+ 'test_aio.c',
+)
+
+if host_system == 'windows'
+ test_aio_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+ '--NAME', 'test_aio',
+ '--FILEDESC', 'test_aio - test code for AIO',])
+endif
+
+test_aio = shared_module('test_aio',
+ test_aio_sources,
+ kwargs: pg_test_mod_args,
+)
+test_install_libs += test_aio
+
+test_install_data += files(
+ 'test_aio.control',
+ 'test_aio--1.0.sql',
+)
+
+
+testfiles = [
+ 'prep',
+ 'ownership',
+ 'io',
+]
+
+if get_option('injection_points')
+ testfiles += 'inject'
+endif
+
+
+
+tests += {
+ 'name': 'test_aio_worker',
+ 'sd': meson.current_source_dir(),
+ 'bd': meson.current_build_dir(),
+ 'regress': {
+ 'sql': testfiles,
+ 'regress_args': [
+ '--temp-config', files('worker.conf'),
+ ],
+ # requires custom config
+ 'runningcheck': false,
+ },
+}
+
+if liburing.found()
+ tests += {
+ 'name': 'test_aio_uring',
+ 'sd': meson.current_source_dir(),
+ 'bd': meson.current_build_dir(),
+ 'regress': {
+ 'sql': testfiles,
+ 'regress_args': [
+ '--temp-config', files('io_uring.conf'),
+ ],
+ # requires custom config
+ 'runningcheck': false,
+ }
+ }
+endif
diff --git a/src/test/modules/test_aio/sql/inject.sql b/src/test/modules/test_aio/sql/inject.sql
new file mode 100644
index 00000000000..b3d34de8977
--- /dev/null
+++ b/src/test/modules/test_aio/sql/inject.sql
@@ -0,0 +1,51 @@
+SELECT count(*) FROM tbl_b WHERE ctid = '(2, 1)';
+
+-- injected what we'd expect
+SELECT inj_io_short_read_attach(8192);
+SELECT invalidate_rel_block('tbl_b', 2);
+SELECT count(*) FROM tbl_b WHERE ctid = '(2, 1)';
+SELECT inj_io_short_read_detach();
+
+-- injected a read shorter than a single block, expecting error
+SELECT inj_io_short_read_attach(17);
+SELECT invalidate_rel_block('tbl_b', 2);
+SELECT redact($$
+ SELECT count(*) FROM tbl_b WHERE ctid = '(2, 1)';
+$$);
+SELECT inj_io_short_read_detach();
+
+-- shorten multi-block read to a single block, should retry, but that's not
+-- implemented yet
+SELECT inj_io_short_read_attach(8192);
+SELECT invalidate_rel_block('tbl_b', 0);
+SELECT invalidate_rel_block('tbl_b', 1);
+SELECT invalidate_rel_block('tbl_b', 2);
+SELECT redact($$
+ SELECT count(*) FROM tbl_b;
+$$);
+SELECT inj_io_short_read_detach();
+
+-- verify that checksum errors are detected even as part of a shortened
+-- multi-block read
+-- (tbl_a, block 1 is corrupted)
+SELECT redact($$
+ SELECT count(*) FROM tbl_a WHERE ctid < '(2, 1)';
+$$);
+SELECT inj_io_short_read_attach(8192);
+SELECT invalidate_rel_block('tbl_a', 0);
+SELECT invalidate_rel_block('tbl_a', 1);
+SELECT invalidate_rel_block('tbl_a', 2);
+SELECT redact($$
+ SELECT count(*) FROM tbl_a WHERE ctid < '(2, 1)';
+$$);
+SELECT inj_io_short_read_detach();
+
+
+-- FIXME: Should error
+-- FIXME: errno encoding?
+SELECT inj_io_short_read_attach(-5);
+SELECT invalidate_rel_block('tbl_b', 2);
+SELECT redact($$
+ SELECT count(*) FROM tbl_b WHERE ctid = '(2, 1)';
+$$);
+SELECT inj_io_short_read_detach();
diff --git a/src/test/modules/test_aio/sql/io.sql b/src/test/modules/test_aio/sql/io.sql
new file mode 100644
index 00000000000..a29bb4eb15d
--- /dev/null
+++ b/src/test/modules/test_aio/sql/io.sql
@@ -0,0 +1,16 @@
+SELECT count(*) FROM tbl_a WHERE ctid = '(1, 1)';
+
+SELECT corrupt_rel_block('tbl_a', 1);
+
+-- FIXME: Should report the error
+SELECT redact($$
+ SELECT read_corrupt_rel_block('tbl_a', 1);
+$$);
+
+-- verify the error is reported
+SELECT redact($$
+ SELECT count(*) FROM tbl_a WHERE ctid = '(1, 1)';
+$$);
+SELECT redact($$
+ SELECT count(*) FROM tbl_a;
+$$);
diff --git a/src/test/modules/test_aio/sql/ownership.sql b/src/test/modules/test_aio/sql/ownership.sql
new file mode 100644
index 00000000000..63cf40c802a
--- /dev/null
+++ b/src/test/modules/test_aio/sql/ownership.sql
@@ -0,0 +1,65 @@
+-----
+-- IO handles
+----
+
+-- leak warning: implicit xact
+SELECT handle_get();
+
+-- leak warning: explicit xact
+BEGIN; SELECT handle_get(); COMMIT;
+
+-- leak warning + error: released in different command (thus resowner)
+BEGIN; SELECT handle_get(); SELECT handle_release_last(); COMMIT;
+
+-- no leak, same command
+BEGIN; SELECT handle_get() UNION ALL SELECT handle_release_last(); COMMIT;
+
+-- leak warning: subtrans
+BEGIN; SAVEPOINT foo; SELECT handle_get(); COMMIT;
+
+-- normal handle use
+SELECT handle_get_release();
+
+-- should error out, API violation
+SELECT handle_get_twice();
+
+-- recover after error in implicit xact
+SELECT handle_get_and_error(); SELECT handle_get_release();
+
+-- recover after error in explicit xact
+BEGIN; SELECT handle_get_and_error(); ROLLBACK; SELECT handle_get_release();
+
+-- recover after error in subtrans
+BEGIN; SAVEPOINT foo; SELECT handle_get_and_error(); ROLLBACK TO SAVEPOINT foo; SELECT handle_get_release(); ROLLBACK;
+
+
+-----
+-- Bounce Buffers handles
+----
+
+-- leak warning: implicit xact
+SELECT bb_get();
+
+-- leak warning: explicit xact
+BEGIN; SELECT bb_get(); COMMIT;
+
+-- missing leak warning: we should warn at command boundaries, not xact boundaries
+BEGIN; SELECT bb_get(); SELECT bb_release_last(); COMMIT;
+
+-- leak warning: subtrans
+BEGIN; SAVEPOINT foo; SELECT bb_get(); COMMIT;
+
+-- normal bb use
+SELECT bb_get_release();
+
+-- should error out, API violation
+SELECT bb_get_twice();
+
+-- recover after error in implicit xact
+SELECT bb_get_and_error(); SELECT bb_get_release();
+
+-- recover after error in explicit xact
+BEGIN; SELECT bb_get_and_error(); ROLLBACK; SELECT bb_get_release();
+
+-- recover after error in subtrans
+BEGIN; SAVEPOINT foo; SELECT bb_get_and_error(); ROLLBACK TO SAVEPOINT foo; SELECT bb_get_release(); ROLLBACK;
diff --git a/src/test/modules/test_aio/sql/prep.sql b/src/test/modules/test_aio/sql/prep.sql
new file mode 100644
index 00000000000..b8f225cbc98
--- /dev/null
+++ b/src/test/modules/test_aio/sql/prep.sql
@@ -0,0 +1,9 @@
+CREATE EXTENSION test_aio;
+
+CREATE TABLE tbl_a(data int not null);
+CREATE TABLE tbl_b(data int not null);
+
+INSERT INTO tbl_a SELECT generate_series(1, 10000);
+INSERT INTO tbl_b SELECT generate_series(1, 10000);
+SELECT grow_rel('tbl_a', 500);
+SELECT grow_rel('tbl_b', 550);
diff --git a/src/test/modules/test_aio/test_aio--1.0.sql b/src/test/modules/test_aio/test_aio--1.0.sql
new file mode 100644
index 00000000000..ea9ad43ed8f
--- /dev/null
+++ b/src/test/modules/test_aio/test_aio--1.0.sql
@@ -0,0 +1,94 @@
+/* src/test/modules/test_aio/test_aio--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_aio" to load this file. \quit
+
+
+CREATE FUNCTION grow_rel(rel regclass, nblocks int)
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+
+CREATE FUNCTION corrupt_rel_block(rel regclass, blockno int)
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION read_corrupt_rel_block(rel regclass, blockno int)
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION invalidate_rel_block(rel regclass, blockno int)
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION handle_get_and_error()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION handle_get_twice()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION handle_get()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION handle_get_release()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION handle_release_last()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+
+CREATE FUNCTION bb_get_and_error()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION bb_get_twice()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION bb_get()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION bb_get_release()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION bb_release_last()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+
+CREATE OR REPLACE FUNCTION redact(p_sql text)
+RETURNS bool
+LANGUAGE plpgsql
+AS $$
+ DECLARE
+ err_state text;
+ err_msg text;
+ BEGIN
+ EXECUTE p_sql;
+ RETURN true;
+ EXCEPTION WHEN OTHERS THEN
+ GET STACKED DIAGNOSTICS
+ err_state = RETURNED_SQLSTATE,
+ err_msg = MESSAGE_TEXT;
+ err_msg = regexp_replace(err_msg, '(file|relation) "?base/[0-9]+/[0-9]+"?', '\1 base/<redacted>');
+ RAISE NOTICE 'wrapped error: %', err_msg
+ USING ERRCODE = err_state;
+ RETURN false;
+ END;
+$$;
+
+
+CREATE FUNCTION inj_io_short_read_attach(result int)
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION inj_io_short_read_detach()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_aio/test_aio.c b/src/test/modules/test_aio/test_aio.c
new file mode 100644
index 00000000000..9626d495241
--- /dev/null
+++ b/src/test/modules/test_aio/test_aio.c
@@ -0,0 +1,479 @@
+/*-------------------------------------------------------------------------
+ *
+ * delay_execution.c
+ * Test module to allow delay between parsing and execution of a query.
+ *
+ * The delay is implemented by taking and immediately releasing a specified
+ * advisory lock. If another process has previously taken that lock, the
+ * current process will be blocked until the lock is released; otherwise,
+ * there's no effect. This allows an isolationtester script to reliably
+ * test behaviors where some specified action happens in another backend
+ * between parsing and execution of any desired query.
+ *
+ * Copyright (c) 2020-2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/test/modules/delay_execution/delay_execution.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "fmgr.h"
+#include "storage/aio.h"
+#include "storage/aio_internal.h"
+#include "storage/bufmgr.h"
+#include "storage/buf_internals.h"
+#include "storage/lwlock.h"
+#include "storage/ipc.h"
+#include "access/relation.h"
+#include "utils/rel.h"
+#include "utils/injection_point.h"
+
+
+PG_MODULE_MAGIC;
+
+
+typedef struct InjIoErrorState
+{
+ bool enabled;
+ bool result_set;
+ int result;
+} InjIoErrorState;
+
+static InjIoErrorState * inj_io_error_state;
+
+/* Shared memory init callbacks */
+static shmem_request_hook_type prev_shmem_request_hook = NULL;
+static shmem_startup_hook_type prev_shmem_startup_hook = NULL;
+
+
+static PgAioHandle *last_handle;
+static PgAioBounceBuffer *last_bb;
+
+
+
+static void
+test_aio_shmem_request(void)
+{
+ if (prev_shmem_request_hook)
+ prev_shmem_request_hook();
+
+ RequestAddinShmemSpace(sizeof(InjIoErrorState));
+}
+
+static void
+test_aio_shmem_startup(void)
+{
+ bool found;
+
+ if (prev_shmem_startup_hook)
+ prev_shmem_startup_hook();
+
+ /* Create or attach to the shared memory state */
+ LWLockAcquire(AddinShmemInitLock, LW_EXCLUSIVE);
+
+ inj_io_error_state = ShmemInitStruct("injection_points",
+ sizeof(InjIoErrorState),
+ &found);
+
+ if (!found)
+ {
+ /*
+ * First time through, so initialize. This is shared with the dynamic
+ * initialization using a DSM.
+ */
+ inj_io_error_state->enabled = false;
+
+#ifdef USE_INJECTION_POINTS
+ InjectionPointAttach("AIO_PROCESS_COMPLETION_BEFORE_SHARED",
+ "test_aio",
+ "inj_io_short_read",
+ NULL,
+ 0);
+ InjectionPointLoad("AIO_PROCESS_COMPLETION_BEFORE_SHARED");
+#endif
+ }
+ else
+ {
+#ifdef USE_INJECTION_POINTS
+ InjectionPointLoad("AIO_PROCESS_COMPLETION_BEFORE_SHARED");
+ elog(LOG, "injection point loaded");
+#endif
+ }
+
+ LWLockRelease(AddinShmemInitLock);
+}
+
+void
+_PG_init(void)
+{
+ if (!process_shared_preload_libraries_in_progress)
+ return;
+
+ /* Shared memory initialization */
+ prev_shmem_request_hook = shmem_request_hook;
+ shmem_request_hook = test_aio_shmem_request;
+ prev_shmem_startup_hook = shmem_startup_hook;
+ shmem_startup_hook = test_aio_shmem_startup;
+}
+
+
+PG_FUNCTION_INFO_V1(grow_rel);
+Datum
+grow_rel(PG_FUNCTION_ARGS)
+{
+ Oid relid = PG_GETARG_OID(0);
+ uint32 nblocks = PG_GETARG_UINT32(1);
+ Relation rel;
+#define MAX_BUFFERS_TO_EXTEND_BY 64
+ Buffer victim_buffers[MAX_BUFFERS_TO_EXTEND_BY];
+
+ rel = relation_open(relid, AccessExclusiveLock);
+
+ while (nblocks > 0)
+ {
+ uint32 extend_by_pages;
+
+ extend_by_pages = Min(nblocks, MAX_BUFFERS_TO_EXTEND_BY);
+
+ ExtendBufferedRelBy(BMR_REL(rel),
+ MAIN_FORKNUM,
+ NULL,
+ 0,
+ extend_by_pages,
+ victim_buffers,
+ &extend_by_pages);
+
+ nblocks -= extend_by_pages;
+
+ for (uint32 i = 0; i < extend_by_pages; i++)
+ {
+ ReleaseBuffer(victim_buffers[i]);
+ }
+ }
+
+ relation_close(rel, NoLock);
+
+ PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(corrupt_rel_block);
+Datum
+corrupt_rel_block(PG_FUNCTION_ARGS)
+{
+ Oid relid = PG_GETARG_OID(0);
+ uint32 block = PG_GETARG_UINT32(1);
+ Relation rel;
+ Buffer buf;
+ Page page;
+ PageHeader ph;
+
+ rel = relation_open(relid, AccessExclusiveLock);
+
+ buf = ReadBuffer(rel, block);
+ page = BufferGetPage(buf);
+
+ LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+
+ MarkBufferDirty(buf);
+
+ PageInit(page, BufferGetPageSize(buf), 0);
+
+ ph = (PageHeader) page;
+ ph->pd_special = BLCKSZ + 1;
+
+ FlushOneBuffer(buf);
+
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+
+ ReleaseBuffer(buf);
+
+ EvictUnpinnedBuffer(buf);
+
+ relation_close(rel, NoLock);
+
+ PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(read_corrupt_rel_block);
+Datum
+read_corrupt_rel_block(PG_FUNCTION_ARGS)
+{
+ Oid relid = PG_GETARG_OID(0);
+ uint32 block = PG_GETARG_UINT32(1);
+ Relation rel;
+ Buffer buf;
+ BufferDesc *buf_hdr;
+ Page page;
+ PgAioHandle *ioh;
+ PgAioHandleRef ior;
+ SMgrRelation smgr;
+ uint32 buf_state;
+
+ rel = relation_open(relid, AccessExclusiveLock);
+
+ /* read buffer without erroring out */
+ buf = ReadBufferExtended(rel, MAIN_FORKNUM, block, RBM_ZERO_AND_LOCK, NULL);
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+
+ page = BufferGetBlock(buf);
+
+ ioh = pgaio_io_get(CurrentResourceOwner, NULL);
+ pgaio_io_get_ref(ioh, &ior);
+
+ buf_hdr = GetBufferDescriptor(buf - 1);
+ smgr = RelationGetSmgr(rel);
+
+ /* FIXME: even if just a test, we should verify nobody else uses this */
+ buf_state = LockBufHdr(buf_hdr);
+ buf_state &= ~(BM_VALID | BM_DIRTY);
+ UnlockBufHdr(buf_hdr, buf_state);
+
+ StartBufferIO(buf_hdr, true, false);
+
+ pgaio_io_set_io_data_32(ioh, (uint32 *) &buf, 1);
+ pgaio_io_add_shared_cb(ioh, ASC_SHARED_BUFFER_READ);
+
+ smgrstartreadv(ioh, smgr, MAIN_FORKNUM, block,
+ (void *) &page, 1);
+
+ ReleaseBuffer(buf);
+
+ pgaio_io_ref_wait(&ior);
+
+ relation_close(rel, NoLock);
+
+ PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(invalidate_rel_block);
+Datum
+invalidate_rel_block(PG_FUNCTION_ARGS)
+{
+ Oid relid = PG_GETARG_OID(0);
+ uint32 block = PG_GETARG_UINT32(1);
+ Relation rel;
+ PrefetchBufferResult pr;
+ Buffer buf;
+
+ rel = relation_open(relid, AccessExclusiveLock);
+
+ /* this is a gross hack, but there's no good API exposed */
+ pr = PrefetchBuffer(rel, MAIN_FORKNUM, block);
+ buf = pr.recent_buffer;
+ elog(LOG, "recent: %d", buf);
+ if (BufferIsValid(buf))
+ {
+ /* if the buffer contents aren't valid, this'll return false */
+ if (ReadRecentBuffer(rel->rd_locator, MAIN_FORKNUM, block, buf))
+ {
+ LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+ FlushOneBuffer(buf);
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+ ReleaseBuffer(buf);
+
+ if (!EvictUnpinnedBuffer(buf))
+ elog(ERROR, "couldn't unpin");
+ }
+ }
+
+ relation_close(rel, AccessExclusiveLock);
+
+ PG_RETURN_VOID();
+}
+
+#if 0
+PG_FUNCTION_INFO_V1(test_unsubmitted_vs_close);
+Datum
+test_unsubmitted_vs_close(PG_FUNCTION_ARGS)
+{
+ Oid relid = PG_GETARG_OID(0);
+ uint32 block = PG_GETARG_UINT32(1);
+ Relation rel;
+ Buffer buf;
+ Page page;
+ PageHeader ph;
+
+ rel = relation_open(relid, AccessExclusiveLock);
+
+ buf = ReadBufferExtended(rel, MAIN_FORKNUM, block, RBM_ZERO_AND_LOCK, NULL);
+
+ buf = ReadBuffer(rel, block);
+ page = BufferGetPage(buf);
+
+ EvictUnpinnedBuffer(buf);
+
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+
+
+ MarkBufferDirty(buf);
+ ph->pd_special = BLCKSZ + 1;
+
+ /* last_handle = pgaio_io_get(); */
+
+ PG_RETURN_VOID();
+}
+#endif
+
+PG_FUNCTION_INFO_V1(handle_get);
+Datum
+handle_get(PG_FUNCTION_ARGS)
+{
+ last_handle = pgaio_io_get(CurrentResourceOwner, NULL);
+
+ PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(handle_release_last);
+Datum
+handle_release_last(PG_FUNCTION_ARGS)
+{
+ if (!last_handle)
+ elog(ERROR, "no handle");
+
+ pgaio_io_release(last_handle);
+
+ PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(handle_get_and_error);
+Datum
+handle_get_and_error(PG_FUNCTION_ARGS)
+{
+ pgaio_io_get(CurrentResourceOwner, NULL);
+
+ elog(ERROR, "as you command");
+ PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(handle_get_twice);
+Datum
+handle_get_twice(PG_FUNCTION_ARGS)
+{
+ pgaio_io_get(CurrentResourceOwner, NULL);
+ pgaio_io_get(CurrentResourceOwner, NULL);
+
+ PG_RETURN_VOID();
+}
+
+
+PG_FUNCTION_INFO_V1(handle_get_release);
+Datum
+handle_get_release(PG_FUNCTION_ARGS)
+{
+ PgAioHandle *handle;
+
+ handle = pgaio_io_get(CurrentResourceOwner, NULL);
+ pgaio_io_release(handle);
+
+ PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(bb_get);
+Datum
+bb_get(PG_FUNCTION_ARGS)
+{
+ last_bb = pgaio_bounce_buffer_get();
+
+ PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(bb_release_last);
+Datum
+bb_release_last(PG_FUNCTION_ARGS)
+{
+ if (!last_bb)
+ elog(ERROR, "no bb");
+
+ pgaio_bounce_buffer_release(last_bb);
+
+ PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(bb_get_and_error);
+Datum
+bb_get_and_error(PG_FUNCTION_ARGS)
+{
+ pgaio_bounce_buffer_get();
+
+ elog(ERROR, "as you command");
+ PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(bb_get_twice);
+Datum
+bb_get_twice(PG_FUNCTION_ARGS)
+{
+ pgaio_bounce_buffer_get();
+ pgaio_bounce_buffer_get();
+
+ PG_RETURN_VOID();
+}
+
+
+PG_FUNCTION_INFO_V1(bb_get_release);
+Datum
+bb_get_release(PG_FUNCTION_ARGS)
+{
+ PgAioBounceBuffer *bb;
+
+ bb = pgaio_bounce_buffer_get();
+ pgaio_bounce_buffer_release(bb);
+
+ PG_RETURN_VOID();
+}
+
+#ifdef USE_INJECTION_POINTS
+extern PGDLLEXPORT void inj_io_short_read(const char *name, const void *private_data);
+
+void
+inj_io_short_read(const char *name, const void *private_data)
+{
+ PgAioHandle *ioh;
+
+ elog(LOG, "short read called: %d", inj_io_error_state->enabled);
+
+ if (inj_io_error_state->enabled)
+ {
+ ioh = pgaio_inj_io_get();
+
+ if (inj_io_error_state->result_set)
+ {
+ elog(LOG, "short read, changing result from %d to %d",
+ ioh->result, inj_io_error_state->result);
+
+ ioh->result = inj_io_error_state->result;
+ }
+ }
+}
+#endif
+
+PG_FUNCTION_INFO_V1(inj_io_short_read_attach);
+Datum
+inj_io_short_read_attach(PG_FUNCTION_ARGS)
+{
+#ifdef USE_INJECTION_POINTS
+ inj_io_error_state->enabled = true;
+ inj_io_error_state->result_set = !PG_ARGISNULL(0);
+ if (inj_io_error_state->result_set)
+ inj_io_error_state->result = PG_GETARG_INT32(0);
+#else
+ elog(ERROR, "injection points not supported");
+#endif
+
+ PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(inj_io_short_read_detach);
+Datum
+inj_io_short_read_detach(PG_FUNCTION_ARGS)
+{
+#ifdef USE_INJECTION_POINTS
+ inj_io_error_state->enabled = false;
+#else
+ elog(ERROR, "injection points not supported");
+#endif
+ PG_RETURN_VOID();
+}
diff --git a/src/test/modules/test_aio/test_aio.control b/src/test/modules/test_aio/test_aio.control
new file mode 100644
index 00000000000..cd91c3ed16b
--- /dev/null
+++ b/src/test/modules/test_aio/test_aio.control
@@ -0,0 +1,3 @@
+comment = 'Test code for AIO'
+default_version = '1.0'
+module_pathname = '$libdir/test_aio'
diff --git a/src/test/modules/test_aio/worker.conf b/src/test/modules/test_aio/worker.conf
new file mode 100644
index 00000000000..8104c201924
--- /dev/null
+++ b/src/test/modules/test_aio/worker.conf
@@ -0,0 +1,5 @@
+shared_preload_libraries=test_aio
+io_method = 'worker'
+log_min_messages = 'DEBUG3'
+log_statement=all
+restart_after_crash=false
--
2.45.2.827.g557ae147e6
v2.0-0017-Temporary-Increase-BAS_BULKREAD-size.patchtext/x-diff; charset=us-asciiDownload
From 6ad40d5b074c4af85289e48b574c3461dbab9a4c Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sun, 1 Sep 2024 00:42:27 -0400
Subject: [PATCH v2.0 17/17] Temporary: Increase BAS_BULKREAD size
Without this we only can execute very little AIO for sequential scans, as
there's just not enough buffers in the ring. This isn't the right fix, as
just increasing the ring size can have negative performance implications in
workloads where the kernel has all the data cached.
Author:
Reviewed-By:
Discussion: https://postgr.es/m/
Backpatch:
---
src/backend/storage/buffer/freelist.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index dffdd57e9b5..5be8125ad3a 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -555,7 +555,11 @@ GetAccessStrategy(BufferAccessStrategyType btype)
return NULL;
case BAS_BULKREAD:
- ring_size_kb = 256;
+ /*
+ * FIXME: Temporary increase to allow large enough streaming reads
+ * to actually benefit from AIO. This needs a better solution.
+ */
+ ring_size_kb = 2 * 1024;
break;
case BAS_BULKWRITE:
ring_size_kb = 16 * 1024;
--
2.45.2.827.g557ae147e6
On 01/09/2024 09:27, Andres Freund wrote:
The main reason I had previously implemented WAL AIO etc was to know the
design implications - but now that they're somewhat understood, I'm planning
to keep the patchset much smaller, with the goal of making it upstreamable.
+1 on that approach.
To solve the issue with an unbounded number of AIO references there are few
changes compared to the prior approach:1) Only one AIO handle can be "handed out" to a backend, without being
defined. Previously the process of getting an AIO handle wasn't super
lightweight, which made it appealing to cache AIO handles - which was one
part of the problem for running out of AIO handles.2) Nothing in a backend can force a "defined" AIO handle (i.e. one that is a
valid operation) to stay around, it's always possible to execute the AIO
operation and then reuse the handle. This provides a forward guarantee, by
ensuring that completing AIOs can free up handles (previously they couldn't
be reused until the backend local reference was released).3) Callbacks on AIOs are not allowed to error out anymore, unless it's ok to
take the server down.4) Obviously some code needs to know the result of AIO operation and be able
to error out. To allow for that the issuer of an AIO can provide a pointer
to local memory that'll receive the result of an AIO, including details
about what kind of errors occurred (possible errors are e.g. a read failing
or a buffer's checksum validation failing).In the next few days I'll add a bunch more documentation and comments as well
as some better perf numbers (assuming my workstation survived...).
Yeah, a high-level README would be nice. Without that, it's hard to
follow what "handed out" and "defined" above means for example.
A few quick comments the patches:
v2.0-0001-bufmgr-Return-early-in-ScheduleBufferTagForWrit.patch
+1, this seems ready to be committed right away.
v2.0-0002-Allow-lwlocks-to-be-unowned.patch
With LOCK_DEBUG, LWLock->owner will point to the backend that acquired
the lock, but it doesn't own it anymore. That's reasonable, but maybe
add a boolean to the LWLock to mark whether the lock is currently owned
or not.
The LWLockReleaseOwnership() name is a bit confusing together with
LWLockReleaseUnowned() and LWLockrelease(). From the names, you might
think that they all release the lock, but LWLockReleaseOwnership() just
disassociates it from the current process. Rename it to LWLockDisown()
perhaps.
v2.0-0003-Use-aux-process-resource-owner-in-walsender.patch
+1. The old comment "We don't currently need any ResourceOwner in a
walsender process" was a bit misleading, because the walsender did
create the short-lived "base backup" resource owner, so it's nice to get
that fixed.
v2.0-0008-aio-Skeleton-IO-worker-infrastructure.patch
My refactoring around postmaster.c child process handling will conflict
with this [1]/messages/by-id/a102f15f-eac4-4ff2-af02-f9ff209ec66f@iki.fi. Not in any fundamental way, but can I ask you to review
those patch, please? After those patches, AIO workers should also have
PMChild slots (formerly known as Backend structs).
[1]: /messages/by-id/a102f15f-eac4-4ff2-af02-f9ff209ec66f@iki.fi
/messages/by-id/a102f15f-eac4-4ff2-af02-f9ff209ec66f@iki.fi
--
Heikki Linnakangas
Neon (https://neon.tech)
Hi,
On 2024-09-02 13:03:07 +0300, Heikki Linnakangas wrote:
On 01/09/2024 09:27, Andres Freund wrote:
In the next few days I'll add a bunch more documentation and comments as well
as some better perf numbers (assuming my workstation survived...).Yeah, a high-level README would be nice. Without that, it's hard to follow
what "handed out" and "defined" above means for example.
Yea - I had actually written a bunch of that before, but then redesigns just
obsoleted most of it :(
FWIW, "handed out" is an IO handle acquired by code, which doesn't yet have an
operation associated with it. Once "defined" it actually could be - but isn't
yet - executed.
A few quick comments the patches:
v2.0-0001-bufmgr-Return-early-in-ScheduleBufferTagForWrit.patch
+1, this seems ready to be committed right away.
Cool
v2.0-0002-Allow-lwlocks-to-be-unowned.patch
With LOCK_DEBUG, LWLock->owner will point to the backend that acquired the
lock, but it doesn't own it anymore. That's reasonable, but maybe add a
boolean to the LWLock to mark whether the lock is currently owned or not.
Hm, not sure it's worth doing that...
The LWLockReleaseOwnership() name is a bit confusing together with
LWLockReleaseUnowned() and LWLockrelease(). From the names, you might think
that they all release the lock, but LWLockReleaseOwnership() just
disassociates it from the current process. Rename it to LWLockDisown()
perhaps.
Yea, that makes sense.
v2.0-0008-aio-Skeleton-IO-worker-infrastructure.patch
My refactoring around postmaster.c child process handling will conflict with
this [1]. Not in any fundamental way, but can I ask you to review those
patch, please? After those patches, AIO workers should also have PMChild
slots (formerly known as Backend structs).
I'll try to do that soonish!
Greetings,
Andres Freund
I hope there can be a high-level design document that includes a
description, high-level architecture, and low-level design.
This way, others can also participate in reviewing the code.
For example, which paths were modified in the AIO module? Is it the
path for writing WAL logs, or the path for flushing pages, etc.?
Also, I recommend keeping this patch as small as possible.
For example, the first step could be to introduce libaio only, without
considering io_uring, as that would make it too complex.
On Sun, 1 Sept 2024 at 18:28, Andres Freund <andres@anarazel.de> wrote:
0 workers 1 worker 2 workers 4 workers
master: 65.753 33.246 21.095 12.918
aio v2.0, worker: 21.519 12.636 10.450 10.004
aio v2.0, uring*: 31.446 17.745 12.889 10.395
aio v2.0, uring** 23.497 13.824 10.881 10.589
aio v2.0, direct, worker: 22.377 11.989 09.915 09.772
aio v2.0, direct, uring*: 24.502 12.603 10.058 09.759
I took this for a test drive on an AMD 3990x machine with a 1TB
Samsung 980 Pro SSD on PCIe 4. I only tried io_method = io_uring, but
I did try with and without direct IO.
This machine has 64GB RAM and I was using ClickBench Q2 [1]https://github.com/ClickHouse/ClickBench/blob/main/postgresql-tuned/queries.sql, which is
"SELECT SUM(AdvEngineID), COUNT(*), AVG(ResolutionWidth) FROM hits;"
(for some reason they use 0-based query IDs). This table is 64GBs
without indexes.
I'm seeing direct IO slower than buffered IO with smaller worker
counts. That's counter to what I would have expected as I'd have
expected the memcpys from the kernel space to be quite an overhead in
the buffered IO case. With larger worker counts the bottleneck is
certainly disk. The part that surprised me was that the bottleneck is
reached more quickly with buffered IO. I was seeing iotop going up to
5.54GB/s at higher worker counts.
times in milliseconds
workers buffered direct cmp
0 58880 102852 57%
1 33622 53538 63%
2 24573 40436 61%
4 18557 27359 68%
8 14844 17330 86%
16 12491 12754 98%
32 11802 11956 99%
64 11895 11941 100%
Is there some other information I can provide to help this make sense?
(Or maybe it does already to you.)
David
[1]: https://github.com/ClickHouse/ClickBench/blob/main/postgresql-tuned/queries.sql
Hi,
Attached is the next version of the patchset. Changes:
- added "sync" io method, the main benefit of that is that the main AIO commit
doesn't need to include worker mode
- split worker and io_uring methods into their own commits
- added src/backend/storage/aio/README.md, explaining design constraints and
the resulting design on a high level
- renamed LWLockReleaseOwnership as suggested by Heikki
- a bunch of small cleanups and improvements
There's plenty more to do, but I thought this would be a useful checkpoint.
Greetings,
Andres Freund
Attachments:
v2.1-0012-aio-Add-README.md-explaining-higher-level-desig.patchtext/x-diff; charset=us-asciiDownload
From 55448fdaa5e54983fdfd147ff1f28cf3867d58e5 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Fri, 6 Sep 2024 15:27:57 -0400
Subject: [PATCH v2.1 12/20] aio: Add README.md explaining higher level design
---
src/backend/storage/aio/README.md | 311 ++++++++++++++++++++++++++++++
1 file changed, 311 insertions(+)
create mode 100644 src/backend/storage/aio/README.md
diff --git a/src/backend/storage/aio/README.md b/src/backend/storage/aio/README.md
new file mode 100644
index 00000000000..9c3a11f2063
--- /dev/null
+++ b/src/backend/storage/aio/README.md
@@ -0,0 +1,311 @@
+# Asynchronous & Direct IO
+
+## Design Criteria & Motivation
+
+### Why Asynchronous IO
+
+Until the introduction of asynchronous IO Postgres relied on the operating
+system to hide the cost of synchronous IO from Postgres. While this worked
+surprisingly well in a lot of workloads, it does not do as good a job on
+prefetching and controlled writeback as we would like.
+
+There are important expensive operations like `fdatasync()` where the operating
+system cannot hide the storage latency. This is particularly important for WAL
+writes, where the ability to asynchronously issue `fdatasync()` or O_DSYNC
+writes can yield significantly higher throughput.
+
+
+### Why Direct / unbuffered IO
+
+The main reason to want to use Direct IO are:
+
+- Lower CPU usage / higher throughput. Particularly on modern storage buffered
+ writes are bottlenecked by the operating system having to copy data from the
+ kernel's page cache to postgres buffer pool using the CPU. Whereas direct IO
+ can often move the data directly between the storage devices and postgres'
+ buffer cache, using DMA. While that transfer is ongoing, the CPU is free to
+ perform other work.
+- Reduced latency - Direct IO can have substantially lower latency than
+ buffered IO, which can be impactful for OLTP workloads bottlenecked by WAL
+ write latency.
+- Avoiding double buffering between operating system cache and postgres'
+ shared_buffers.
+- Better control over the timing and pace of dirty data writeback.
+
+
+The main reason *not* to use Direct IO are:
+
+- Without AIO, Direct IO is unusably slow for most purposes.
+- Even with AIO, many parts of postgres need to be modified to perform
+ explicit prefetching.
+- In situations where shared_buffers cannot be set appropriately large,
+ e.g. because there are many different postgres instances hosted on shared
+ hardware, performance will often be worse then when using buffered IO.
+
+
+### Deadlock and Starvation Dangers due to AIO
+
+Using AIO in a naive way can easily lead to deadlocks in an environment where
+the source/target of AIO are shared resources, like pages in postgres'
+shared_buffers.
+
+Consider one backend performing readahead on a table, initiating IO for a
+number of buffers ahead of the current "scan position". If that backend then
+performs some operation that blocks, or even just is slow, the IO completion
+for the asynchronously initiated read may not be processed.
+
+This AIO implementation solves this problem by requiring that AIO methods
+either allow AIO completions to be processed by any backend in the system
+(e.g. io_uring), or to guarantee that AIO processing will happen even when the
+issuing backend is blocked (e.g. worker mode, which offloads completion
+processing to the AIO workers).
+
+
+### IO can be started in critical sections
+
+Using AIO for WAL writes can reduce the overhead of WAL logging substantially:
+
+- AIO allows to start WAL writes eagerly, so they complete before needing to
+ wait
+- AIO allows to have multiple WAL flushes in progress at the same time
+- AIO makes it more realistic to use O\_DIRECT + O\_DSYNC, which can reduce
+ the number of roundtrips to storage on some OSs and storage HW (buffered IO
+ and direct IO without O_DSYNC needs to issue a write and after the writes
+ completion a cache cache flush, whereas O\_DIRECT + O\_DSYNC can use a
+ single FUA write).
+
+The need to be able to execute IO in critical sections has substantial design
+implication on the AIO subsystem. Mainly because completing IOs (see prior
+section) needs to be possible within a critical section, even if the
+to-be-completed IO itself was not issued in a critical section. Consider
+e.g. the case of a backend first starting a number of writes from shared
+buffers and then starting to flush the WAL. Because only a limited amount of
+IO can be in-progress at the same time, initiating the IO for flushing the WAL
+may require to first finish executing IO executed earlier.
+
+
+### State for AIO needs to live in shared memory
+
+Because postgres uses a process model and because AIOs need to be
+complete-able by any backend much of the state of the AIO subsystem needs to
+live in shared memory.
+
+In an `EXEC_BACKEND` build backends executable code and other process local
+state is not necessarily mapped to the same addresses in each process due to
+ASLR. This means that the shared memory cannot contain pointer to callbacks.
+
+
+## Design of the AIO Subsystem
+
+
+### AIO Methods
+
+To achieve portability and performance, multiple methods of performing AIO are
+implemented and others are likely worth adding in the future.
+
+
+#### Synchronous Mode
+
+`io_method=sync` does not actually perform AIO but allows to use the AIO API
+while performing synchronous IO. This can be useful for debugging. The code
+for the synchronous mode is also used as a fallback by e.g. the [worker
+mode](#Worker) uses it to execute IO that cannot be executed by workers.
+
+
+#### Worker
+
+`io_method=worker` is available on every platform postgres runs on, and
+implements asynchronous IO - from the view of the issuing process - by
+dispatching the IO to one of several worker processes performing the IO in a
+synchronous manner.
+
+
+#### io_uring
+
+`io_method=io_uring` is available on Linux 5.1+. In contrast to worker mode it
+dispatches all IO from within the process, lowering context switch rate /
+latency.
+
+
+### AIO Handles
+
+The central API piece for postgres' AIO abstraction are AIO handles. To
+execute an IO one first has to acquire an IO handle (`pgaio_io_get()`) and
+then "defined", i.e. associate an IO operation with the handle.
+
+Often AIO handles are acquired on a higher level and then passed to a lower
+level to be fully defined. E.g., for IO to/from shared buffers, bufmgr.c
+routines acquire the handle, which is then passed through smgr.c, md.c to be
+finally fully defined in fd.c.
+
+The functions used at the lowest level to define the operation are
+`pgaio_io_prep_*()`.
+
+Because acquisition of an IO handle
+[must always succeed](#IO-can-be-started-in-critical-sections)
+and the number of AIO Handles
+[has to be limited](#State-for-AIO-needs-to-live-in-shared-memory)
+AIO handles can be reused as soon as they have completed. Obviously code needs
+to be able to react to IO completion. Shared state can be updated using
+[AIO Completion callbacks](#AIO-Callbacks)
+and the issuing backend can provide a backend local variable to receive the
+result of the IO, as described in
+[AIO Result](#AIO-Result)
+. An IO can be waited for, by both the issuing and any other backend, using
+[AIO Referencs](#AIO-References).
+
+
+Because an AIO Handle is not executable just after calling `pgaio_io_get()`
+and because `pgaio_io_get()` needs to be able to succeed, only a single AIO
+Handle may be acquired (i.e. returned by `pgaio_io_get()`) without causing the
+IO to have been defined (by, potentially indirectly, causing
+`pgaio_io_prep_*()` to have been called). Otherwise a backend could trivially
+self-deadlock by using up all AIO Handles without the ability to wait for some
+of the IOs to complete.
+
+If it turns out that an AIO Handle is not needed, e.g., because the handle was
+acquired before holding a contended lock, it can be released without being
+defined using `pgaio_io_release()`.
+
+
+### AIO Callbacks
+
+Commonly several layers need to react to completion of an IO. E.g. for a read
+md.c needs to check if the IO outright failed or was shorter than needed,
+bufmgr.c needs to verify the page looks valid and bufmgr.c needs to update the
+BufferDesc to update the buffer's state.
+
+The fact that several layers / subsystems need to react to IO completion poses
+a few challenges:
+
+- Upper layers should not need to know details of lower layers. E.g. bufmgr.c
+ should not assume the IO will pass through md.c. Therefore upper levels
+ cannot know what lower layers would consider an error.
+
+- Lower layers should not need to know about upper layers. E.g. smgr APIs are
+ used going through shared buffers but are also used bypassing shared
+ buffers. This means that e.g. md.c is not in a position to validate
+ checksums.
+
+- Having code in the AIO subsystem for every possible combination of layers
+ would lead to a lot of duplication.
+
+The "solution" to this the ability to associate multiple completion callbacks
+with a handle. E.g. bufmgr.c can have a callback to update the BufferDesc
+state and to verify the page and md.c. another callback to check if the IO
+operation was successful.
+
+As [mentioned](#State-for-AIO-needs-to-live-in-shared-memory), shared memory
+currently cannot contain function pointers. Because of that completion
+callbacks are not directly identified by function pointers but by IDs
+(`PgAioHandleSharedCallbackID`). A substantial added benefit is that that
+allows callbacks to be identified by much smaller amount of memory (a single
+byte currently).
+
+In addition to completion, AIO callbacks also are called to "prepare" an
+IO. This is, e.g., used to acquire buffer pins owned by the AIO subsystem for
+IO to/from shared buffers, which is required to handle the case where the
+issuing backend errors out and releases its own pins.
+
+As [explained earlier](#IO-can-be-started-in-critical-sections) IO completions
+need to be safe to execute in critical sections. To allow the backend that
+issued the IO to error out in case of failure [AIO Result](#AIO-Results) can
+be used.
+
+
+### AIO Subjects
+
+In addition to the completion callbacks describe above, each AIO Handle has
+exactly one "subject". Each subject has some space inside an AIO Handle with
+information specific to the subject and can provide callbacks to allow to
+reopen the underlying file (required for worker mode) and to describe the IO
+operation (used for debug logging and error messages).
+
+
+### AIO References
+
+As [described above](#AIO-Handles) can be reused immediately after completion
+and therefore cannot be used to wait for completion of the IO. Waiting is
+enabled using AIO references, which do not just identify an AIO Handle but
+also include the handles "generation".
+
+A reference to an AIO Handle can be acquired using `pgaio_io_get_ref()` and
+then waited upon using `pgaio_io_ref_wait()`.
+
+
+### AIO Results
+
+As AIO completion callbacks
+[are executed in critical sections](#IO-can-be-started-in-critical-sections)
+and [may be executed by any backend](#Deadlock-and-Starvation-Dangers-due-to-AIO)
+completion callbacks cannot be used to, e.g., make the query that triggered an
+IO ERROR out.
+
+To allow to react to failing IOs the issuing backend can pass a pointer to a
+`PgAioReturn` in backend local memory. Before an AIO Handle is reused the
+`PgAioReturn` is filled with information about the IO. This includes
+information about whether the IO was successful (as a value of
+`PgAioResultStatus`) and enough information to raise an error in case of a
+failure (via `pgaio_result_log()`, with the error details encoded in
+`PgAioResult`).
+
+XXX: "return" vs "result" vs "result status" seems quite confusing. The naming
+should be improved.
+
+
+### AIO Errors
+
+It would be very convenient to have shared completion callbacks encode the
+details of errors as an `ErrorData` that could be raised at a later
+time. Unfortunately doing so would require allocating memory. While elog.c can
+guarantee (well, kinda) that logging a message will not run out of memory,
+that only works because a very limited number of messages are in the process
+of being logged. With AIO a large number of concurrently issued AIOs might
+fail.
+
+To avoid the need for preallocating a potentially large amount of memory (in
+shared memory no less!), completion callbacks instead have to encode errors in
+a more compact format that can be converted into an error message.
+
+
+### AIO Bounce Buffers
+
+For some uses of AIO there is no convenient memory location as the source /
+destination of an AIO. E.g. when data checksums are enabled, writes from
+shared buffers currently cannot be done directly from shared buffers, as a
+shared buffer lock still allows some modification, e.g., for hint bits(see
+`FlushBuffer()`). If the write were done in-place, such modifications can
+cause the checksum to fail.
+
+For synchronous IO this is solved by copying the buffer to separate memory
+before computing the checksum and using that copy as the source buffer for the
+AIO.
+
+However, for AIO that is not a workable solution:
+- Instead of a single buffer many buffers are required, as many IOs might be
+ in flight
+- When using the [worker method](#worker), the source/target of IO needs to be
+ in shared memory, otherwise the workers won't be able to access the memory.
+
+The AIO subsystem addresses this by providing a limited number of bounce
+buffers that can be used as the source / target for IO. A bounce buffer be
+acquired with `pgaio_bounce_buffer_get()` and multiple bounce buffers can be
+associated with an AIO Handle with `pgaio_io_assoc_bounce_buffer()`.
+
+Bounce buffers are automatically released when the IO completes.
+
+
+## Helpers
+
+Using the low-level AIO API introduces too much complexity to do so all over
+the tree. Most uses of AIO should be done via reusable, higher-level, helpers.
+
+
+### Read Stream
+
+A common and very beneficial use of AIO are reads where a substantial number
+of to-be-read locations are known ahead of time. E.g., for a sequential scan
+the set of blocks that need to be read can be determined solely by knowing the
+current position and checking the buffer mapping table.
+[Read stream](../../include/storage/read_stream.h)
+makes it comparatively easy to use AIO for such use cases.
--
2.45.2.827.g557ae147e6
v2.1-0013-aio-Implement-smgr-md.c-aio-methods.patchtext/x-diff; charset=us-asciiDownload
From f138cbab018b104e416d23175a38141d8827232d Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sat, 31 Aug 2024 22:33:30 -0400
Subject: [PATCH v2.1 13/20] aio: Implement smgr/md.c aio methods
---
src/include/storage/aio.h | 17 +-
src/include/storage/fd.h | 6 +
src/include/storage/md.h | 12 ++
src/include/storage/smgr.h | 21 +++
src/backend/storage/aio/aio_subject.c | 4 +
src/backend/storage/file/fd.c | 68 ++++++++
src/backend/storage/smgr/md.c | 217 ++++++++++++++++++++++++++
src/backend/storage/smgr/smgr.c | 91 +++++++++++
8 files changed, 434 insertions(+), 2 deletions(-)
diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h
index b8c743548c9..07bf92a6b7a 100644
--- a/src/include/storage/aio.h
+++ b/src/include/storage/aio.h
@@ -57,9 +57,10 @@ typedef enum PgAioSubjectID
{
/* intentionally the zero value, to help catch zeroed memory etc */
ASI_INVALID = 0,
+ ASI_SMGR,
} PgAioSubjectID;
-#define ASI_COUNT (ASI_INVALID + 1)
+#define ASI_COUNT (ASI_SMGR + 1)
/*
* Flags for an IO that can be set with pgaio_io_set_flag().
@@ -90,7 +91,8 @@ typedef enum PgAioHandleFlags
*/
typedef enum PgAioHandleSharedCallbackID
{
- ASC_PLACEHOLDER /* empty enums are invalid */ ,
+ ASC_MD_READV,
+ ASC_MD_WRITEV,
} PgAioHandleSharedCallbackID;
@@ -139,6 +141,17 @@ typedef union
typedef union PgAioSubjectData
{
+ struct
+ {
+ RelFileLocator rlocator; /* physical relation identifier */
+ BlockNumber blockNum; /* blknum relative to begin of reln */
+ int nblocks;
+ ForkNumber forkNum:8; /* don't waste 4 byte for four values */
+ bool is_temp; /* proc can be inferred by owning AIO */
+ bool release_lock;
+ int8 mode;
+ } smgr;
+
/* just as an example placeholder for later */
struct
{
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 1456ab383a4..e993e1b671f 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -101,6 +101,8 @@ extern PGDLLIMPORT int max_safe_fds;
* prototypes for functions in fd.c
*/
+struct PgAioHandle;
+
/* Operations on virtual Files --- equivalent to Unix kernel file ops */
extern File PathNameOpenFile(const char *fileName, int fileFlags);
extern File PathNameOpenFilePerm(const char *fileName, int fileFlags, mode_t fileMode);
@@ -109,6 +111,10 @@ extern void FileClose(File file);
extern int FilePrefetch(File file, off_t offset, off_t amount, uint32 wait_event_info);
extern ssize_t FileReadV(File file, const struct iovec *iov, int iovcnt, off_t offset, uint32 wait_event_info);
extern ssize_t FileWriteV(File file, const struct iovec *iov, int iovcnt, off_t offset, uint32 wait_event_info);
+extern ssize_t FileReadV(File file, const struct iovec *iov, int iovcnt, off_t offset, uint32 wait_event_info);
+extern int FileStartReadV(struct PgAioHandle *ioh, File file, int iovcnt, off_t offset, uint32 wait_event_info);
+extern ssize_t FileWriteV(File file, const struct iovec *iov, int iovcnt, off_t offset, uint32 wait_event_info);
+extern int FileStartWriteV(struct PgAioHandle *ioh, File file, int iovcnt, off_t offset, uint32 wait_event_info);
extern int FileSync(File file, uint32 wait_event_info);
extern int FileZero(File file, off_t offset, off_t amount, uint32 wait_event_info);
extern int FileFallocate(File file, off_t offset, off_t amount, uint32 wait_event_info);
diff --git a/src/include/storage/md.h b/src/include/storage/md.h
index b72293c79a5..ede77695853 100644
--- a/src/include/storage/md.h
+++ b/src/include/storage/md.h
@@ -19,6 +19,10 @@
#include "storage/smgr.h"
#include "storage/sync.h"
+struct PgAioHandleSharedCallbacks;
+extern const struct PgAioHandleSharedCallbacks aio_md_readv_cb;
+extern const struct PgAioHandleSharedCallbacks aio_md_writev_cb;
+
/* md storage manager functionality */
extern void mdinit(void);
extern void mdopen(SMgrRelation reln);
@@ -36,9 +40,16 @@ extern uint32 mdmaxcombine(SMgrRelation reln, ForkNumber forknum,
BlockNumber blocknum);
extern void mdreadv(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
void **buffers, BlockNumber nblocks);
+extern void mdstartreadv(struct PgAioHandle *ioh,
+ SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+ void **buffers, BlockNumber nblocks);
extern void mdwritev(SMgrRelation reln, ForkNumber forknum,
BlockNumber blocknum,
const void **buffers, BlockNumber nblocks, bool skipFsync);
+extern void mdstartwritev(struct PgAioHandle *ioh,
+ SMgrRelation reln, ForkNumber forknum,
+ BlockNumber blocknum,
+ const void **buffers, BlockNumber nblocks, bool skipFsync);
extern void mdwriteback(SMgrRelation reln, ForkNumber forknum,
BlockNumber blocknum, BlockNumber nblocks);
extern BlockNumber mdnblocks(SMgrRelation reln, ForkNumber forknum);
@@ -46,6 +57,7 @@ extern void mdtruncate(SMgrRelation reln, ForkNumber forknum,
BlockNumber nblocks);
extern void mdimmedsync(SMgrRelation reln, ForkNumber forknum);
extern void mdregistersync(SMgrRelation reln, ForkNumber forknum);
+extern int mdfd(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, uint32 *off);
extern void ForgetDatabaseSyncRequests(Oid dbid);
extern void DropRelationFiles(RelFileLocator *delrels, int ndelrels, bool isRedo);
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 899d0d681c5..66730bc24fa 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -73,6 +73,11 @@ typedef SMgrRelationData *SMgrRelation;
#define SmgrIsTemp(smgr) \
RelFileLocatorBackendIsTemp((smgr)->smgr_rlocator)
+struct PgAioHandle;
+struct PgAioSubjectInfo;
+
+extern const struct PgAioSubjectInfo aio_smgr_subject_info;
+
extern void smgrinit(void);
extern SMgrRelation smgropen(RelFileLocator rlocator, ProcNumber backend);
extern bool smgrexists(SMgrRelation reln, ForkNumber forknum);
@@ -97,10 +102,19 @@ extern uint32 smgrmaxcombine(SMgrRelation reln, ForkNumber forknum,
extern void smgrreadv(SMgrRelation reln, ForkNumber forknum,
BlockNumber blocknum,
void **buffers, BlockNumber nblocks);
+extern void smgrstartreadv(struct PgAioHandle *ioh,
+ SMgrRelation reln, ForkNumber forknum,
+ BlockNumber blocknum,
+ void **buffers, BlockNumber nblocks);
extern void smgrwritev(SMgrRelation reln, ForkNumber forknum,
BlockNumber blocknum,
const void **buffers, BlockNumber nblocks,
bool skipFsync);
+extern void smgrstartwritev(struct PgAioHandle *ioh,
+ SMgrRelation reln, ForkNumber forknum,
+ BlockNumber blocknum,
+ const void **buffers, BlockNumber nblocks,
+ bool skipFsync);
extern void smgrwriteback(SMgrRelation reln, ForkNumber forknum,
BlockNumber blocknum, BlockNumber nblocks);
extern BlockNumber smgrnblocks(SMgrRelation reln, ForkNumber forknum);
@@ -109,6 +123,7 @@ extern void smgrtruncate(SMgrRelation reln, ForkNumber *forknum,
int nforks, BlockNumber *nblocks);
extern void smgrimmedsync(SMgrRelation reln, ForkNumber forknum);
extern void smgrregistersync(SMgrRelation reln, ForkNumber forknum);
+extern int smgrfd(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, uint32 *off);
extern void AtEOXact_SMgr(void);
extern bool ProcessBarrierSmgrRelease(void);
@@ -126,4 +141,10 @@ smgrwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
smgrwritev(reln, forknum, blocknum, &buffer, 1, skipFsync);
}
+extern void pgaio_io_set_subject_smgr(struct PgAioHandle *ioh,
+ SMgrRelationData *smgr,
+ ForkNumber forknum,
+ BlockNumber blocknum,
+ int nblocks);
+
#endif /* SMGR_H */
diff --git a/src/backend/storage/aio/aio_subject.c b/src/backend/storage/aio/aio_subject.c
index 51ee3b3969d..14be8432f5a 100644
--- a/src/backend/storage/aio/aio_subject.c
+++ b/src/backend/storage/aio/aio_subject.c
@@ -20,6 +20,7 @@
#include "storage/aio_internal.h"
#include "storage/buf_internals.h"
#include "storage/bufmgr.h"
+#include "storage/md.h"
#include "storage/smgr.h"
#include "utils/memutils.h"
@@ -28,9 +29,12 @@ static const PgAioSubjectInfo *aio_subject_info[] = {
[ASI_INVALID] = &(PgAioSubjectInfo) {
.name = "invalid",
},
+ [ASI_SMGR] = &aio_smgr_subject_info,
};
static const PgAioHandleSharedCallbacks *aio_shared_cbs[] = {
+ [ASC_MD_READV] = &aio_md_readv_cb,
+ [ASC_MD_WRITEV] = &aio_md_writev_cb,
};
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index ec1505802b9..f5ff554f946 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -95,6 +95,7 @@
#include "pgstat.h"
#include "portability/mem.h"
#include "postmaster/startup.h"
+#include "storage/aio.h"
#include "storage/fd.h"
#include "storage/ipc.h"
#include "utils/guc.h"
@@ -1295,6 +1296,8 @@ LruDelete(File file)
vfdP = &VfdCache[file];
+ pgaio_closing_fd(vfdP->fd);
+
/*
* Close the file. We aren't expecting this to fail; if it does, better
* to leak the FD than to mess up our internal state.
@@ -1988,6 +1991,8 @@ FileClose(File file)
if (!FileIsNotOpen(file))
{
+ pgaio_closing_fd(vfdP->fd);
+
/* close the file */
if (close(vfdP->fd) != 0)
{
@@ -2211,6 +2216,32 @@ retry:
return returnCode;
}
+int
+FileStartReadV(struct PgAioHandle *ioh, File file,
+ int iovcnt, off_t offset,
+ uint32 wait_event_info)
+{
+ int returnCode;
+ Vfd *vfdP;
+
+ Assert(FileIsValid(file));
+
+ DO_DB(elog(LOG, "FileStartReadV: %d (%s) " INT64_FORMAT " %d",
+ file, VfdCache[file].fileName,
+ (int64) offset,
+ iovcnt));
+
+ returnCode = FileAccess(file);
+ if (returnCode < 0)
+ return returnCode;
+
+ vfdP = &VfdCache[file];
+
+ pgaio_io_prep_readv(ioh, vfdP->fd, iovcnt, offset);
+
+ return 0;
+}
+
ssize_t
FileWriteV(File file, const struct iovec *iov, int iovcnt, off_t offset,
uint32 wait_event_info)
@@ -2316,6 +2347,34 @@ retry:
return returnCode;
}
+int
+FileStartWriteV(struct PgAioHandle *ioh, File file,
+ int iovcnt, off_t offset,
+ uint32 wait_event_info)
+{
+ int returnCode;
+ Vfd *vfdP;
+
+ Assert(FileIsValid(file));
+
+ DO_DB(elog(LOG, "FileStartWriteV: %d (%s) " INT64_FORMAT " %d",
+ file, VfdCache[file].fileName,
+ (int64) offset,
+ iovcnt));
+
+ returnCode = FileAccess(file);
+ if (returnCode < 0)
+ return returnCode;
+
+ vfdP = &VfdCache[file];
+
+ /* FIXME: think about / reimplement temp_file_limit */
+
+ pgaio_io_prep_writev(ioh, vfdP->fd, iovcnt, offset);
+
+ return 0;
+}
+
int
FileSync(File file, uint32 wait_event_info)
{
@@ -2499,6 +2558,12 @@ FilePathName(File file)
int
FileGetRawDesc(File file)
{
+ int returnCode;
+
+ returnCode = FileAccess(file);
+ if (returnCode < 0)
+ return returnCode;
+
Assert(FileIsValid(file));
return VfdCache[file].fd;
}
@@ -2779,6 +2844,7 @@ FreeDesc(AllocateDesc *desc)
result = closedir(desc->desc.dir);
break;
case AllocateDescRawFD:
+ pgaio_closing_fd(desc->desc.fd);
result = close(desc->desc.fd);
break;
default:
@@ -2847,6 +2913,8 @@ CloseTransientFile(int fd)
/* Only get here if someone passes us a file not in allocatedDescs */
elog(WARNING, "fd passed to CloseTransientFile was not obtained from OpenTransientFile");
+ pgaio_closing_fd(fd);
+
return close(fd);
}
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 6cd81a61faa..f96308490d9 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -31,6 +31,7 @@
#include "miscadmin.h"
#include "pg_trace.h"
#include "pgstat.h"
+#include "storage/aio.h"
#include "storage/bufmgr.h"
#include "storage/fd.h"
#include "storage/md.h"
@@ -931,6 +932,49 @@ mdreadv(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
}
}
+void
+mdstartreadv(PgAioHandle *ioh,
+ SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+ void **buffers, BlockNumber nblocks)
+{
+ off_t seekpos;
+ MdfdVec *v;
+ BlockNumber nblocks_this_segment;
+ struct iovec *iov;
+ int iovcnt;
+
+ v = _mdfd_getseg(reln, forknum, blocknum, false,
+ EXTENSION_FAIL | EXTENSION_CREATE_RECOVERY);
+
+ seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
+
+ Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
+
+ nblocks_this_segment =
+ Min(nblocks,
+ RELSEG_SIZE - (blocknum % ((BlockNumber) RELSEG_SIZE)));
+
+ if (nblocks_this_segment != nblocks)
+ elog(ERROR, "read crossing segment boundary");
+
+ iovcnt = pgaio_io_get_iovec(ioh, &iov);
+
+ Assert(nblocks <= iovcnt);
+
+ iovcnt = buffers_to_iovec(iov, buffers, nblocks_this_segment);
+
+ Assert(iovcnt <= nblocks_this_segment);
+
+ pgaio_io_set_subject_smgr(ioh,
+ reln,
+ forknum,
+ blocknum,
+ nblocks);
+ pgaio_io_add_shared_cb(ioh, ASC_MD_READV);
+
+ FileStartReadV(ioh, v->mdfd_vfd, iovcnt, seekpos, WAIT_EVENT_DATA_FILE_READ);
+}
+
/*
* mdwritev() -- Write the supplied blocks at the appropriate location.
*
@@ -1036,6 +1080,49 @@ mdwritev(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
}
}
+void
+mdstartwritev(PgAioHandle *ioh,
+ SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+ const void **buffers, BlockNumber nblocks, bool skipFsync)
+{
+ off_t seekpos;
+ MdfdVec *v;
+ BlockNumber nblocks_this_segment;
+ struct iovec *iov;
+ int iovcnt;
+
+ v = _mdfd_getseg(reln, forknum, blocknum, false,
+ EXTENSION_FAIL | EXTENSION_CREATE_RECOVERY);
+
+ seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
+
+ Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
+
+ nblocks_this_segment =
+ Min(nblocks,
+ RELSEG_SIZE - (blocknum % ((BlockNumber) RELSEG_SIZE)));
+
+ if (nblocks_this_segment != nblocks)
+ elog(ERROR, "write crossing segment boundary");
+
+ iovcnt = pgaio_io_get_iovec(ioh, &iov);
+
+ Assert(nblocks <= iovcnt);
+
+ iovcnt = buffers_to_iovec(iov, unconstify(void **, buffers), nblocks_this_segment);
+
+ Assert(iovcnt <= nblocks_this_segment);
+
+ pgaio_io_set_subject_smgr(ioh,
+ reln,
+ forknum,
+ blocknum,
+ nblocks);
+ pgaio_io_add_shared_cb(ioh, ASC_MD_WRITEV);
+
+ FileStartWriteV(ioh, v->mdfd_vfd, iovcnt, seekpos, WAIT_EVENT_DATA_FILE_WRITE);
+}
+
/*
* mdwriteback() -- Tell the kernel to write pages back to storage.
@@ -1357,6 +1444,21 @@ mdimmedsync(SMgrRelation reln, ForkNumber forknum)
}
}
+int
+mdfd(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, uint32 *off)
+{
+ MdfdVec *v = mdopenfork(reln, forknum, EXTENSION_FAIL);
+
+ v = _mdfd_getseg(reln, forknum, blocknum, false,
+ EXTENSION_FAIL);
+
+ *off = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
+
+ Assert(*off < (off_t) BLCKSZ * RELSEG_SIZE);
+
+ return FileGetRawDesc(v->mdfd_vfd);
+}
+
/*
* register_dirty_segment() -- Mark a relation segment as needing fsync
*
@@ -1832,3 +1934,118 @@ mdfiletagmatches(const FileTag *ftag, const FileTag *candidate)
*/
return ftag->rlocator.dbOid == candidate->rlocator.dbOid;
}
+
+
+
+static PgAioResult md_readv_complete(PgAioHandle *ioh, PgAioResult prior_result);
+static PgAioResult md_writev_complete(PgAioHandle *ioh, PgAioResult prior_result);
+static void md_readv_error(PgAioResult result, const PgAioSubjectData *subject_data, int elevel);
+
+const struct PgAioHandleSharedCallbacks aio_md_readv_cb = {
+ .complete = md_readv_complete,
+ .error = md_readv_error,
+};
+
+const struct PgAioHandleSharedCallbacks aio_md_writev_cb = {
+ .complete = md_writev_complete,
+};
+
+static PgAioResult
+md_readv_complete(PgAioHandle *ioh, PgAioResult prior_result)
+{
+ PgAioSubjectData *sd = pgaio_io_get_subject_data(ioh);
+ PgAioResult result = prior_result;
+
+ elog(DEBUG3, "%s: %d %d", __func__, prior_result.status, prior_result.result);
+
+ if (prior_result.result < 0)
+ {
+ result.status = ARS_ERROR;
+ result.id = ASC_MD_READV;
+ result.error_data = -prior_result.result;
+ result.result = 0;
+
+ md_readv_error(result, sd, LOG);
+
+ return result;
+ }
+
+ result.result /= BLCKSZ;
+
+ if (result.result == 0)
+ {
+ /* consider 0 blocks read a failure */
+ result.status = ARS_ERROR;
+ result.id = ASC_MD_READV;
+ result.error_data = 0;
+
+ md_readv_error(result, sd, LOG);
+ }
+
+ if (result.status != ARS_ERROR &&
+ result.result < sd->smgr.nblocks)
+ {
+ /* partial reads should be retried at upper level */
+ result.id = ASC_MD_READV;
+ result.status = ARS_PARTIAL;
+ }
+
+ /* AFIXME: post-read portion of mdreadv() */
+
+ return result;
+}
+
+static void
+md_readv_error(PgAioResult result, const PgAioSubjectData *subject_data, int elevel)
+{
+ MemoryContext oldContext = CurrentMemoryContext;
+
+ /* AFIXME: */
+ oldContext = MemoryContextSwitchTo(ErrorContext);
+
+ if (result.error_data != 0)
+ {
+ errno = result.error_data; /* for errcode_for_file_access() */
+
+ ereport(elevel,
+ errcode_for_file_access(),
+ errmsg("could not read blocks %u..%u in file \"%s\": %m",
+ subject_data->smgr.blockNum,
+ subject_data->smgr.blockNum + subject_data->smgr.nblocks,
+ relpathperm(subject_data->smgr.rlocator, subject_data->smgr.forkNum)
+ )
+ );
+ }
+ else
+ {
+ ereport(elevel,
+ errcode(ERRCODE_DATA_CORRUPTED),
+ errmsg("could not read blocks %u..%u in file \"%s\": read only %zu of %zu bytes",
+ subject_data->smgr.blockNum,
+ subject_data->smgr.blockNum + subject_data->smgr.nblocks - 1,
+ relpathperm(subject_data->smgr.rlocator, subject_data->smgr.forkNum),
+ result.result * (size_t) BLCKSZ,
+ subject_data->smgr.nblocks * (size_t) BLCKSZ
+ )
+ );
+ }
+
+ MemoryContextSwitchTo(oldContext);
+}
+
+
+static PgAioResult
+md_writev_complete(PgAioHandle *ioh, PgAioResult prior_result)
+{
+ elog(DEBUG3, "%s: %d %d", __func__, prior_result.status, prior_result.result);
+
+ if (prior_result.status == ARS_ERROR)
+ {
+ /* AFIXME: complain */
+ return prior_result;
+ }
+
+ prior_result.result /= BLCKSZ;
+
+ return prior_result;
+}
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index ee31db85eec..2dacb361a4f 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -53,6 +53,7 @@
#include "access/xlogutils.h"
#include "lib/ilist.h"
+#include "storage/aio.h"
#include "storage/bufmgr.h"
#include "storage/ipc.h"
#include "storage/md.h"
@@ -93,10 +94,19 @@ typedef struct f_smgr
void (*smgr_readv) (SMgrRelation reln, ForkNumber forknum,
BlockNumber blocknum,
void **buffers, BlockNumber nblocks);
+ void (*smgr_startreadv) (struct PgAioHandle *ioh,
+ SMgrRelation reln, ForkNumber forknum,
+ BlockNumber blocknum,
+ void **buffers, BlockNumber nblocks);
void (*smgr_writev) (SMgrRelation reln, ForkNumber forknum,
BlockNumber blocknum,
const void **buffers, BlockNumber nblocks,
bool skipFsync);
+ void (*smgr_startwritev) (struct PgAioHandle *ioh,
+ SMgrRelation reln, ForkNumber forknum,
+ BlockNumber blocknum,
+ const void **buffers, BlockNumber nblocks,
+ bool skipFsync);
void (*smgr_writeback) (SMgrRelation reln, ForkNumber forknum,
BlockNumber blocknum, BlockNumber nblocks);
BlockNumber (*smgr_nblocks) (SMgrRelation reln, ForkNumber forknum);
@@ -104,6 +114,7 @@ typedef struct f_smgr
BlockNumber nblocks);
void (*smgr_immedsync) (SMgrRelation reln, ForkNumber forknum);
void (*smgr_registersync) (SMgrRelation reln, ForkNumber forknum);
+ int (*smgr_fd) (SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, uint32 *off);
} f_smgr;
static const f_smgr smgrsw[] = {
@@ -121,12 +132,15 @@ static const f_smgr smgrsw[] = {
.smgr_prefetch = mdprefetch,
.smgr_maxcombine = mdmaxcombine,
.smgr_readv = mdreadv,
+ .smgr_startreadv = mdstartreadv,
.smgr_writev = mdwritev,
+ .smgr_startwritev = mdstartwritev,
.smgr_writeback = mdwriteback,
.smgr_nblocks = mdnblocks,
.smgr_truncate = mdtruncate,
.smgr_immedsync = mdimmedsync,
.smgr_registersync = mdregistersync,
+ .smgr_fd = mdfd,
}
};
@@ -145,6 +159,14 @@ static void smgrshutdown(int code, Datum arg);
static void smgrdestroy(SMgrRelation reln);
+static void smgr_aio_reopen(PgAioHandle *ioh);
+
+const struct PgAioSubjectInfo aio_smgr_subject_info = {
+ .name = "smgr",
+ .reopen = smgr_aio_reopen,
+};
+
+
/*
* smgrinit(), smgrshutdown() -- Initialize or shut down storage
* managers.
@@ -620,6 +642,19 @@ smgrreadv(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
nblocks);
}
+/*
+ * FILL ME IN
+ */
+void
+smgrstartreadv(struct PgAioHandle *ioh,
+ SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+ void **buffers, BlockNumber nblocks)
+{
+ smgrsw[reln->smgr_which].smgr_startreadv(ioh,
+ reln, forknum, blocknum, buffers,
+ nblocks);
+}
+
/*
* smgrwritev() -- Write the supplied buffers out.
*
@@ -651,6 +686,16 @@ smgrwritev(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
buffers, nblocks, skipFsync);
}
+void
+smgrstartwritev(struct PgAioHandle *ioh,
+ SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+ const void **buffers, BlockNumber nblocks, bool skipFsync)
+{
+ smgrsw[reln->smgr_which].smgr_startwritev(ioh,
+ reln, forknum, blocknum, buffers,
+ nblocks, skipFsync);
+}
+
/*
* smgrwriteback() -- Trigger kernel writeback for the supplied range of
* blocks.
@@ -807,6 +852,12 @@ smgrimmedsync(SMgrRelation reln, ForkNumber forknum)
smgrsw[reln->smgr_which].smgr_immedsync(reln, forknum);
}
+int
+smgrfd(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, uint32 *off)
+{
+ return smgrsw[reln->smgr_which].smgr_fd(reln, forknum, blocknum, off);
+}
+
/*
* AtEOXact_SMgr
*
@@ -835,3 +886,43 @@ ProcessBarrierSmgrRelease(void)
smgrreleaseall();
return true;
}
+
+void
+pgaio_io_set_subject_smgr(PgAioHandle *ioh,
+ struct SMgrRelationData *smgr,
+ ForkNumber forknum,
+ BlockNumber blocknum,
+ int nblocks)
+{
+ PgAioSubjectData *sd = pgaio_io_get_subject_data(ioh);
+
+ pgaio_io_set_subject(ioh, ASI_SMGR);
+
+ /* backend is implied via IO owner */
+ sd->smgr.rlocator = smgr->smgr_rlocator.locator;
+ sd->smgr.forkNum = forknum;
+ sd->smgr.blockNum = blocknum;
+ sd->smgr.nblocks = nblocks;
+ sd->smgr.is_temp = SmgrIsTemp(smgr);
+ sd->smgr.release_lock = false;
+ sd->smgr.mode = RBM_NORMAL;
+}
+
+static void
+smgr_aio_reopen(PgAioHandle *ioh)
+{
+ PgAioSubjectData *sd = pgaio_io_get_subject_data(ioh);
+ PgAioOpData *od = pgaio_io_get_op_data(ioh);
+ SMgrRelation reln;
+ ProcNumber procno;
+ uint32 off;
+
+ if (sd->smgr.is_temp)
+ procno = pgaio_io_get_owner(ioh);
+ else
+ procno = INVALID_PROC_NUMBER;
+
+ reln = smgropen(sd->smgr.rlocator, procno);
+ od->read.fd = smgrfd(reln, sd->smgr.forkNum, sd->smgr.blockNum, &off);
+ Assert(off == od->read.offset);
+}
--
2.45.2.827.g557ae147e6
v2.1-0014-bufmgr-Implement-AIO-support.patchtext/x-diff; charset=us-asciiDownload
From 34a11207d325b445d15a12e2c63aff4b90a935d8 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sat, 31 Aug 2024 21:39:01 -0400
Subject: [PATCH v2.1 14/20] bufmgr: Implement AIO support
As of this commit there are no users of these AIO facilities, that'll come in
later commits.
Author:
Reviewed-By:
Discussion: https://postgr.es/m/
Backpatch:
---
src/include/storage/aio.h | 6 +
src/include/storage/buf_internals.h | 6 +
src/include/storage/bufmgr.h | 10 +
src/backend/storage/aio/aio_subject.c | 5 +
src/backend/storage/buffer/buf_init.c | 3 +
src/backend/storage/buffer/bufmgr.c | 432 +++++++++++++++++++++++++-
src/backend/storage/buffer/localbuf.c | 65 ++++
7 files changed, 520 insertions(+), 7 deletions(-)
diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h
index 07bf92a6b7a..260c3701247 100644
--- a/src/include/storage/aio.h
+++ b/src/include/storage/aio.h
@@ -93,6 +93,12 @@ typedef enum PgAioHandleSharedCallbackID
{
ASC_MD_READV,
ASC_MD_WRITEV,
+
+ ASC_SHARED_BUFFER_READ,
+ ASC_SHARED_BUFFER_WRITE,
+
+ ASC_LOCAL_BUFFER_READ,
+ ASC_LOCAL_BUFFER_WRITE,
} PgAioHandleSharedCallbackID;
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index f190e6e5e46..5cfa7dbd1f1 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -17,6 +17,7 @@
#include "pgstat.h"
#include "port/atomics.h"
+#include "storage/aio_ref.h"
#include "storage/buf.h"
#include "storage/bufmgr.h"
#include "storage/condition_variable.h"
@@ -252,6 +253,8 @@ typedef struct BufferDesc
int wait_backend_pgprocno; /* backend of pin-count waiter */
int freeNext; /* link in freelist chain */
+
+ PgAioHandleRef io_in_progress;
LWLock content_lock; /* to lock access to buffer contents */
} BufferDesc;
@@ -465,4 +468,7 @@ extern void DropRelationLocalBuffers(RelFileLocator rlocator,
extern void DropRelationAllLocalBuffers(RelFileLocator rlocator);
extern void AtEOXact_LocalBuffers(bool isCommit);
+
+extern bool ReadBufferCompleteReadLocal(Buffer buffer, int mode, bool failed);
+
#endif /* BUFMGR_INTERNALS_H */
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index eb0fba4230b..6cd64b8c2b3 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -177,6 +177,14 @@ extern PGDLLIMPORT int NLocBuffer;
extern PGDLLIMPORT Block *LocalBufferBlockPointers;
extern PGDLLIMPORT int32 *LocalRefCount;
+
+struct PgAioHandleSharedCallbacks;
+extern const struct PgAioHandleSharedCallbacks aio_shared_buffer_read_cb;
+extern const struct PgAioHandleSharedCallbacks aio_shared_buffer_write_cb;
+extern const struct PgAioHandleSharedCallbacks aio_local_buffer_read_cb;
+extern const struct PgAioHandleSharedCallbacks aio_local_buffer_write_cb;
+
+
/* upper limit for effective_io_concurrency */
#define MAX_IO_CONCURRENCY 1000
@@ -194,6 +202,8 @@ extern PGDLLIMPORT int32 *LocalRefCount;
/*
* prototypes for functions in bufmgr.c
*/
+struct PgAioHandle;
+
extern PrefetchBufferResult PrefetchSharedBuffer(struct SMgrRelationData *smgr_reln,
ForkNumber forkNum,
BlockNumber blockNum);
diff --git a/src/backend/storage/aio/aio_subject.c b/src/backend/storage/aio/aio_subject.c
index 14be8432f5a..07c7989b273 100644
--- a/src/backend/storage/aio/aio_subject.c
+++ b/src/backend/storage/aio/aio_subject.c
@@ -35,6 +35,11 @@ static const PgAioSubjectInfo *aio_subject_info[] = {
static const PgAioHandleSharedCallbacks *aio_shared_cbs[] = {
[ASC_MD_READV] = &aio_md_readv_cb,
[ASC_MD_WRITEV] = &aio_md_writev_cb,
+
+ [ASC_SHARED_BUFFER_READ] = &aio_shared_buffer_read_cb,
+ [ASC_SHARED_BUFFER_WRITE] = &aio_shared_buffer_write_cb,
+ [ASC_LOCAL_BUFFER_READ] = &aio_local_buffer_read_cb,
+ [ASC_LOCAL_BUFFER_WRITE] = &aio_local_buffer_write_cb,
};
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index 09bec6449b6..059a80dfb13 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -14,6 +14,7 @@
*/
#include "postgres.h"
+#include "storage/aio.h"
#include "storage/buf_internals.h"
#include "storage/bufmgr.h"
#include "storage/proc.h"
@@ -126,6 +127,8 @@ BufferManagerShmemInit(void)
buf->buf_id = i;
+ pgaio_io_ref_clear(&buf->io_in_progress);
+
/*
* Initially link all the buffers together as unused. Subsequent
* management of this list is done by freelist.c.
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 7e987836335..976ced82b6a 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -48,6 +48,7 @@
#include "pg_trace.h"
#include "pgstat.h"
#include "postmaster/bgwriter.h"
+#include "storage/aio.h"
#include "storage/buf_internals.h"
#include "storage/bufmgr.h"
#include "storage/fd.h"
@@ -58,6 +59,7 @@
#include "storage/smgr.h"
#include "storage/standby.h"
#include "utils/memdebug.h"
+#include "utils/memutils.h"
#include "utils/ps_status.h"
#include "utils/rel.h"
#include "utils/resowner.h"
@@ -514,7 +516,8 @@ static int SyncOneBuffer(int buf_id, bool skip_recently_used,
static void WaitIO(BufferDesc *buf);
static bool StartBufferIO(BufferDesc *buf, bool forInput, bool nowait);
static void TerminateBufferIO(BufferDesc *buf, bool clear_dirty,
- uint32 set_flag_bits, bool forget_owner);
+ uint32 set_flag_bits, bool forget_owner,
+ bool syncio);
static void AbortBufferIO(Buffer buffer);
static void shared_buffer_write_error_callback(void *arg);
static void local_buffer_write_error_callback(void *arg);
@@ -1081,7 +1084,7 @@ ZeroAndLockBuffer(Buffer buffer, ReadBufferMode mode, bool already_valid)
else
{
/* Set BM_VALID, terminate IO, and wake up any waiters */
- TerminateBufferIO(bufHdr, false, BM_VALID, true);
+ TerminateBufferIO(bufHdr, false, BM_VALID, true, true);
}
}
else if (!isLocalBuf)
@@ -1566,7 +1569,7 @@ WaitReadBuffers(ReadBuffersOperation *operation)
else
{
/* Set BM_VALID, terminate IO, and wake up any waiters */
- TerminateBufferIO(bufHdr, false, BM_VALID, true);
+ TerminateBufferIO(bufHdr, false, BM_VALID, true, true);
}
/* Report I/Os as completing individually. */
@@ -2450,7 +2453,7 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
if (lock)
LWLockAcquire(BufferDescriptorGetContentLock(buf_hdr), LW_EXCLUSIVE);
- TerminateBufferIO(buf_hdr, false, BM_VALID, true);
+ TerminateBufferIO(buf_hdr, false, BM_VALID, true, true);
}
pgBufferUsage.shared_blks_written += extend_by;
@@ -3899,7 +3902,7 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
* Mark the buffer as clean (unless BM_JUST_DIRTIED has become set) and
* end the BM_IO_IN_PROGRESS state.
*/
- TerminateBufferIO(buf, true, 0, true);
+ TerminateBufferIO(buf, true, 0, true, true);
TRACE_POSTGRESQL_BUFFER_FLUSH_DONE(BufTagGetForkNum(&buf->tag),
buf->tag.blockNum,
@@ -5514,6 +5517,7 @@ WaitIO(BufferDesc *buf)
for (;;)
{
uint32 buf_state;
+ PgAioHandleRef ior;
/*
* It may not be necessary to acquire the spinlock to check the flag
@@ -5521,10 +5525,19 @@ WaitIO(BufferDesc *buf)
* play it safe.
*/
buf_state = LockBufHdr(buf);
+ ior = buf->io_in_progress;
UnlockBufHdr(buf, buf_state);
if (!(buf_state & BM_IO_IN_PROGRESS))
break;
+
+ if (pgaio_io_ref_valid(&ior))
+ {
+ pgaio_io_ref_wait(&ior);
+ ConditionVariablePrepareToSleep(cv);
+ continue;
+ }
+
ConditionVariableSleep(cv, WAIT_EVENT_BUFFER_IO);
}
ConditionVariableCancelSleep();
@@ -5613,7 +5626,7 @@ StartBufferIO(BufferDesc *buf, bool forInput, bool nowait)
*/
static void
TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag_bits,
- bool forget_owner)
+ bool forget_owner, bool syncio)
{
uint32 buf_state;
@@ -5625,6 +5638,13 @@ TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag_bits,
if (clear_dirty && !(buf_state & BM_JUST_DIRTIED))
buf_state &= ~(BM_DIRTY | BM_CHECKPOINT_NEEDED);
+ if (!syncio)
+ {
+ /* release ownership by the AIO subsystem */
+ buf_state -= BUF_REFCOUNT_ONE;
+ pgaio_io_ref_clear(&buf->io_in_progress);
+ }
+
buf_state |= set_flag_bits;
UnlockBufHdr(buf, buf_state);
@@ -5633,6 +5653,40 @@ TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag_bits,
BufferDescriptorGetBuffer(buf));
ConditionVariableBroadcast(BufferDescriptorGetIOCV(buf));
+
+ /*
+ * If we just released a pin, need to do BM_PIN_COUNT_WAITER handling.
+ * Most of the time the current backend will hold another pin preventing
+ * that from happening, but that's e.g. not the case when completing an IO
+ * another backend started.
+ *
+ * AFIXME: Deduplicate with UnpinBufferNoOwner() or just replace
+ * BM_PIN_COUNT_WAITER with something saner.
+ */
+ /* Support LockBufferForCleanup() */
+ if (buf_state & BM_PIN_COUNT_WAITER)
+ {
+ /*
+ * Acquire the buffer header lock, re-check that there's a waiter.
+ * Another backend could have unpinned this buffer, and already woken
+ * up the waiter. There's no danger of the buffer being replaced
+ * after we unpinned it above, as it's pinned by the waiter.
+ */
+ buf_state = LockBufHdr(buf);
+
+ if ((buf_state & BM_PIN_COUNT_WAITER) &&
+ BUF_STATE_GET_REFCOUNT(buf_state) == 1)
+ {
+ /* we just released the last pin other than the waiter's */
+ int wait_backend_pgprocno = buf->wait_backend_pgprocno;
+
+ buf_state &= ~BM_PIN_COUNT_WAITER;
+ UnlockBufHdr(buf, buf_state);
+ ProcSendSignal(wait_backend_pgprocno);
+ }
+ else
+ UnlockBufHdr(buf, buf_state);
+ }
}
/*
@@ -5684,7 +5738,7 @@ AbortBufferIO(Buffer buffer)
}
}
- TerminateBufferIO(buf_hdr, false, BM_IO_ERROR, false);
+ TerminateBufferIO(buf_hdr, false, BM_IO_ERROR, false, true);
}
/*
@@ -6143,3 +6197,367 @@ EvictUnpinnedBuffer(Buffer buf)
return result;
}
+
+static bool
+ReadBufferCompleteReadShared(Buffer buffer, int mode, bool failed)
+{
+ BufferDesc *bufHdr = NULL;
+ BlockNumber blockno;
+ bool buf_failed = false;
+ char *bufdata = BufferGetBlock(buffer);
+
+ Assert(BufferIsValid(buffer));
+
+ bufHdr = GetBufferDescriptor(buffer - 1);
+ blockno = bufHdr->tag.blockNum;
+
+#ifdef USE_ASSERT_CHECKING
+ {
+ uint32 buf_state = pg_atomic_read_u32(&bufHdr->state);
+
+ Assert(buf_state & BM_TAG_VALID);
+ Assert(!(buf_state & BM_VALID));
+ Assert(buf_state & BM_IO_IN_PROGRESS);
+ Assert(!(buf_state & BM_DIRTY));
+ }
+#endif
+
+ /* check for garbage data */
+ if (!failed &&
+ !PageIsVerifiedExtended((Page) bufdata, blockno,
+ PIV_LOG_WARNING | PIV_REPORT_STAT))
+ {
+ RelFileLocator rlocator = BufTagGetRelFileLocator(&bufHdr->tag);
+ BlockNumber forkNum = bufHdr->tag.forkNum;
+
+ /* AFIXME: relpathperm allocates memory */
+ MemoryContextSwitchTo(ErrorContext);
+ if (mode == READ_BUFFERS_ZERO_ON_ERROR || zero_damaged_pages)
+ {
+ ereport(LOG,
+ (errcode(ERRCODE_DATA_CORRUPTED),
+ errmsg("invalid page in block %u of relation %s; zeroing out page",
+ blockno,
+ relpathperm(rlocator, forkNum))));
+ memset(bufdata, 0, BLCKSZ);
+ }
+ else
+ {
+ ereport(LOG,
+ (errcode(ERRCODE_DATA_CORRUPTED),
+ errmsg("invalid page in block %u of relation %s",
+ blockno,
+ relpathperm(rlocator, forkNum))));
+ failed = true;
+ buf_failed = true;
+ }
+ }
+
+ /* Terminate I/O and set BM_VALID. */
+ TerminateBufferIO(bufHdr, false,
+ failed ? BM_IO_ERROR : BM_VALID,
+ false, false);
+
+ /* Report I/Os as completing individually. */
+
+ /* FIXME: Should we do TRACE_POSTGRESQL_BUFFER_READ_DONE here? */
+ return buf_failed;
+}
+
+static uint64
+ReadBufferCompleteWriteShared(Buffer buffer, bool release_lock, bool failed)
+{
+ BufferDesc *bufHdr;
+ bool result = false;
+
+ Assert(BufferIsValid(buffer));
+
+ bufHdr = GetBufferDescriptor(buffer - 1);
+
+#ifdef USE_ASSERT_CHECKING
+ {
+ uint32 buf_state = pg_atomic_read_u32(&bufHdr->state);
+
+ Assert(buf_state & BM_VALID);
+ Assert(buf_state & BM_TAG_VALID);
+ Assert(buf_state & BM_IO_IN_PROGRESS);
+ Assert(buf_state & BM_DIRTY);
+ }
+#endif
+
+ /* AFIXME: implement track_io_timing */
+
+ TerminateBufferIO(bufHdr, /* clear_dirty = */ true,
+ failed ? BM_IO_ERROR : 0,
+ /* forget_owner = */ false,
+ /* syncio = */ false);
+
+ /*
+ * The initiator of IO is not managing the lock (i.e. called
+ * LWLockDisown()), we are.
+ */
+ if (release_lock)
+ LWLockReleaseUnowned(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
+
+ return result;
+}
+
+static void
+shared_buffer_prepare_common(PgAioHandle *ioh, bool is_write)
+{
+ uint64 *io_data;
+ uint8 io_data_len;
+ PgAioHandleRef io_ref;
+ BufferTag first PG_USED_FOR_ASSERTS_ONLY = {0};
+
+ io_data = pgaio_io_get_io_data(ioh, &io_data_len);
+
+ pgaio_io_get_ref(ioh, &io_ref);
+
+ for (int i = 0; i < io_data_len; i++)
+ {
+ Buffer buf = (Buffer) io_data[i];
+ BufferDesc *bufHdr;
+ uint32 buf_state;
+
+ bufHdr = GetBufferDescriptor(buf - 1);
+
+ if (i == 0)
+ first = bufHdr->tag;
+ else
+ {
+ Assert(bufHdr->tag.relNumber == first.relNumber);
+ Assert(bufHdr->tag.blockNum == first.blockNum + i);
+ }
+
+
+ buf_state = LockBufHdr(bufHdr);
+
+ Assert(buf_state & BM_TAG_VALID);
+ if (is_write)
+ {
+ Assert(buf_state & BM_VALID);
+ Assert(buf_state & BM_DIRTY);
+ }
+ else
+ Assert(!(buf_state & BM_VALID));
+
+ Assert(buf_state & BM_IO_IN_PROGRESS);
+ Assert(BUF_STATE_GET_REFCOUNT(buf_state) >= 1);
+
+ buf_state += BUF_REFCOUNT_ONE;
+ bufHdr->io_in_progress = io_ref;
+
+ UnlockBufHdr(bufHdr, buf_state);
+
+ if (is_write)
+ {
+ LWLock *content_lock;
+
+ content_lock = BufferDescriptorGetContentLock(bufHdr);
+
+ Assert(LWLockHeldByMe(content_lock));
+
+ /*
+ * Lock is now owned by IO.
+ */
+ LWLockDisown(content_lock);
+ RESUME_INTERRUPTS();
+ }
+
+ /*
+ * Stop tracking this buffer via the resowner - the AIO system now
+ * keeps track.
+ */
+ ResourceOwnerForgetBufferIO(CurrentResourceOwner, buf);
+ }
+}
+
+static void
+shared_buffer_read_prepare(PgAioHandle *ioh)
+{
+ shared_buffer_prepare_common(ioh, false);
+}
+
+static void
+shared_buffer_write_prepare(PgAioHandle *ioh)
+{
+ shared_buffer_prepare_common(ioh, true);
+}
+
+
+static PgAioResult
+shared_buffer_read_complete(PgAioHandle *ioh, PgAioResult prior_result)
+{
+ PgAioResult result = prior_result;
+ int mode = pgaio_io_get_subject_data(ioh)->smgr.mode;
+ uint64 *io_data;
+ uint8 io_data_len;
+
+ elog(DEBUG3, "%s: %d %d", __func__, prior_result.status, prior_result.result);
+
+ io_data = pgaio_io_get_io_data(ioh, &io_data_len);
+
+ for (int io_data_off = 0; io_data_off < io_data_len; io_data_off++)
+ {
+ Buffer buf = io_data[io_data_off];
+ bool buf_failed;
+ bool failed;
+
+ failed =
+ prior_result.status == ARS_ERROR
+ || prior_result.result <= io_data_off;
+
+ elog(DEBUG3, "calling rbcrs for buf %d with failed %d, error: %d, result: %d, data_off: %d",
+ buf, failed, prior_result.status, prior_result.result, io_data_off);
+
+ /*
+ * AFIXME: It'd probably be better to not set BM_IO_ERROR (which is
+ * what failed = true leads to) when it's just a short read...
+ */
+ buf_failed = ReadBufferCompleteReadShared(buf,
+ mode,
+ failed);
+
+ if (result.status != ARS_ERROR && buf_failed)
+ {
+ result.status = ARS_ERROR;
+ result.id = ASC_SHARED_BUFFER_READ;
+ result.error_data = io_data_off + 1;
+ }
+ }
+
+ return result;
+}
+
+static void
+shared_buffer_read_error(PgAioResult result, const PgAioSubjectData *subject_data, int elevel)
+{
+ MemoryContext oldContext = CurrentMemoryContext;
+
+ /* AFIXME: */
+ oldContext = MemoryContextSwitchTo(ErrorContext);
+
+ ereport(elevel,
+ errcode(ERRCODE_DATA_CORRUPTED),
+ errmsg("invalid page in block %u of relation %s",
+ subject_data->smgr.blockNum + result.error_data,
+ relpathperm(subject_data->smgr.rlocator, subject_data->smgr.forkNum)
+ )
+ );
+ MemoryContextSwitchTo(oldContext);
+}
+
+static PgAioResult
+shared_buffer_write_complete(PgAioHandle *ioh, PgAioResult prior_result)
+{
+ PgAioResult result = prior_result;
+ uint64 *io_data;
+ uint8 io_data_len;
+
+ elog(DEBUG3, "%s: %d %d", __func__, prior_result.status, prior_result.result);
+
+ io_data = pgaio_io_get_io_data(ioh, &io_data_len);
+
+ /* FIXME: handle outright errors */
+
+ for (int io_data_off = 0; io_data_off < io_data_len; io_data_off++)
+ {
+ Buffer buf = io_data[io_data_off];
+
+ /* FIXME: handle short writes / failures */
+ /* FIXME: ioh->scb_data.shared_buffer.release_lock */
+ ReadBufferCompleteWriteShared(buf,
+ true,
+ false);
+
+ }
+
+ return result;
+}
+
+static void
+local_buffer_read_prepare(PgAioHandle *ioh)
+{
+ uint64 *io_data;
+ uint8 io_data_len;
+ PgAioHandleRef io_ref;
+
+ io_data = pgaio_io_get_io_data(ioh, &io_data_len);
+
+ pgaio_io_get_ref(ioh, &io_ref);
+
+ for (int i = 0; i < io_data_len; i++)
+ {
+ Buffer buf = (Buffer) io_data[i];
+ BufferDesc *bufHdr;
+ uint32 buf_state;
+
+ bufHdr = GetLocalBufferDescriptor(-buf - 1);
+
+ buf_state = pg_atomic_read_u32(&bufHdr->state);
+
+ bufHdr->io_in_progress = io_ref;
+ LocalRefCount[-buf - 1] += 1;
+
+ UnlockBufHdr(bufHdr, buf_state);
+ }
+}
+
+static PgAioResult
+local_buffer_read_complete(PgAioHandle *ioh, PgAioResult prior_result)
+{
+ PgAioResult result = prior_result;
+ int mode = pgaio_io_get_subject_data(ioh)->smgr.mode;
+ uint64 *io_data;
+ uint8 io_data_len;
+
+ elog(DEBUG3, "%s: %d %d", __func__, prior_result.status, prior_result.result);
+
+ io_data = pgaio_io_get_io_data(ioh, &io_data_len);
+
+ /* FIXME: error handling */
+
+ for (int io_data_off = 0; io_data_off < io_data_len; io_data_off++)
+ {
+ Buffer buf = io_data[io_data_off];
+ bool buf_failed;
+
+ buf_failed = ReadBufferCompleteReadLocal(buf,
+ mode,
+ false);
+
+ if (result.status != ARS_ERROR && buf_failed)
+ {
+ result.status = ARS_ERROR;
+ result.id = ASC_LOCAL_BUFFER_READ;
+ result.error_data = io_data_off + 1;
+ }
+ }
+
+ return result;
+}
+
+static void
+local_buffer_write_prepare(PgAioHandle *ioh)
+{
+ elog(ERROR, "not yet");
+}
+
+
+const struct PgAioHandleSharedCallbacks aio_shared_buffer_read_cb = {
+ .prepare = shared_buffer_read_prepare,
+ .complete = shared_buffer_read_complete,
+ .error = shared_buffer_read_error,
+};
+const struct PgAioHandleSharedCallbacks aio_shared_buffer_write_cb = {
+ .prepare = shared_buffer_write_prepare,
+ .complete = shared_buffer_write_complete,
+};
+const struct PgAioHandleSharedCallbacks aio_local_buffer_read_cb = {
+ .prepare = local_buffer_read_prepare,
+ .complete = local_buffer_read_complete,
+};
+const struct PgAioHandleSharedCallbacks aio_local_buffer_write_cb = {
+ .prepare = local_buffer_write_prepare,
+};
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 8da7dd6c98a..a7eb723f1e9 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -18,6 +18,7 @@
#include "access/parallel.h"
#include "executor/instrument.h"
#include "pgstat.h"
+#include "storage/aio.h"
#include "storage/buf_internals.h"
#include "storage/bufmgr.h"
#include "storage/fd.h"
@@ -620,6 +621,8 @@ InitLocalBuffers(void)
*/
buf->buf_id = -i - 2;
+ pgaio_io_ref_clear(&buf->io_in_progress);
+
/*
* Intentionally do not initialize the buffer's atomic variable
* (besides zeroing the underlying memory above). That way we get
@@ -836,3 +839,65 @@ AtProcExit_LocalBuffers(void)
*/
CheckForLocalBufferLeaks();
}
+
+bool
+ReadBufferCompleteReadLocal(Buffer buffer, int mode, bool failed)
+{
+ BufferDesc *buf_hdr = NULL;
+ BlockNumber blockno;
+ bool buf_failed = false;
+ char *bufdata = BufferGetBlock(buffer);
+
+ Assert(BufferIsValid(buffer));
+
+ buf_hdr = GetLocalBufferDescriptor(-buffer - 1);
+ blockno = buf_hdr->tag.blockNum;
+
+ /* check for garbage data */
+ if (!failed &&
+ !PageIsVerifiedExtended((Page) bufdata, blockno,
+ PIV_LOG_WARNING | PIV_REPORT_STAT))
+ {
+ RelFileLocator rlocator = BufTagGetRelFileLocator(&buf_hdr->tag);
+ BlockNumber forkNum = buf_hdr->tag.forkNum;
+
+ MemoryContextSwitchTo(ErrorContext);
+
+ if (mode == READ_BUFFERS_ZERO_ON_ERROR || zero_damaged_pages)
+ {
+
+ ereport(WARNING,
+ (errcode(ERRCODE_DATA_CORRUPTED),
+ errmsg("invalid page in block %u of relation %s; zeroing out page",
+ blockno,
+ relpathperm(rlocator, forkNum))));
+ memset(bufdata, 0, BLCKSZ);
+ }
+ else
+ {
+ ereport(LOG,
+ (errcode(ERRCODE_DATA_CORRUPTED),
+ errmsg("invalid page in block %u of relation %s",
+ blockno,
+ relpathperm(rlocator, forkNum))));
+ failed = true;
+ buf_failed = true;
+ }
+ }
+
+ /* Terminate I/O and set BM_VALID. */
+ pgaio_io_ref_clear(&buf_hdr->io_in_progress);
+
+ {
+ uint32 buf_state;
+
+ buf_state = pg_atomic_read_u32(&buf_hdr->state);
+ buf_state |= BM_VALID;
+ pg_atomic_unlocked_write_u32(&buf_hdr->state, buf_state);
+ }
+
+ /* release pin held by IO subsystem */
+ LocalRefCount[-buffer - 1] -= 1;
+
+ return buf_failed;
+}
--
2.45.2.827.g557ae147e6
v2.1-0015-bufmgr-Use-aio-for-StartReadBuffers.patchtext/x-diff; charset=us-asciiDownload
From bfd939b88a8dcdbc424c1e7452d70195a46910ae Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sat, 31 Aug 2024 21:55:59 -0400
Subject: [PATCH v2.1 15/20] bufmgr: Use aio for StartReadBuffers()
Author:
Reviewed-By:
Discussion: https://postgr.es/m/
Backpatch:
---
src/include/storage/bufmgr.h | 25 ++-
src/backend/storage/buffer/bufmgr.c | 259 +++++++++++++++++-----------
2 files changed, 182 insertions(+), 102 deletions(-)
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 6cd64b8c2b3..a075a40b2ed 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -15,6 +15,7 @@
#define BUFMGR_H
#include "port/pg_iovec.h"
+#include "storage/aio_ref.h"
#include "storage/block.h"
#include "storage/buf.h"
#include "storage/bufpage.h"
@@ -107,11 +108,22 @@ typedef struct BufferManagerRelation
#define BMR_REL(p_rel) ((BufferManagerRelation){.rel = p_rel})
#define BMR_SMGR(p_smgr, p_relpersistence) ((BufferManagerRelation){.smgr = p_smgr, .relpersistence = p_relpersistence})
+
+#define MAX_IO_COMBINE_LIMIT PG_IOV_MAX
+#define DEFAULT_IO_COMBINE_LIMIT Min(MAX_IO_COMBINE_LIMIT, (128 * 1024) / BLCKSZ)
+
+
/* Zero out page if reading fails. */
#define READ_BUFFERS_ZERO_ON_ERROR (1 << 0)
/* Call smgrprefetch() if I/O necessary. */
#define READ_BUFFERS_ISSUE_ADVICE (1 << 1)
+/*
+ * FIXME: PgAioReturn is defined in aio.h. It'd be much better if we didn't
+ * need to include that here. Perhaps this could live in a separate header?
+ */
+#include "storage/aio.h"
+
struct ReadBuffersOperation
{
/* The following members should be set by the caller. */
@@ -131,6 +143,17 @@ struct ReadBuffersOperation
int flags;
int16 nblocks;
int16 io_buffers_len;
+
+ /*
+ * In some rare-ish cases one operation causes multiple reads (e.g. if a
+ * buffer was concurrently read by another backend). It'd be much better
+ * if we ensured that each ReadBuffersOperation covered only one IO - but
+ * that's not entirely trivial, due to having pinned victim buffers before
+ * starting IOs.
+ */
+ int16 nios;
+ PgAioHandleRef refs[MAX_IO_COMBINE_LIMIT];
+ PgAioReturn returns[MAX_IO_COMBINE_LIMIT];
};
typedef struct ReadBuffersOperation ReadBuffersOperation;
@@ -161,8 +184,6 @@ extern PGDLLIMPORT bool track_io_timing;
extern PGDLLIMPORT int effective_io_concurrency;
extern PGDLLIMPORT int maintenance_io_concurrency;
-#define MAX_IO_COMBINE_LIMIT PG_IOV_MAX
-#define DEFAULT_IO_COMBINE_LIMIT Min(MAX_IO_COMBINE_LIMIT, (128 * 1024) / BLCKSZ)
extern PGDLLIMPORT int io_combine_limit;
extern PGDLLIMPORT int checkpoint_flush_after;
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 976ced82b6a..4914c71d41e 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1253,6 +1253,12 @@ ReadBuffer_common(Relation rel, SMgrRelation smgr, char smgr_persistence,
return buffer;
}
+static bool AsyncReadBuffers(ReadBuffersOperation *operation,
+ Buffer *buffers,
+ BlockNumber blockNum,
+ int *nblocks,
+ int flags);
+
static pg_attribute_always_inline bool
StartReadBuffersImpl(ReadBuffersOperation *operation,
Buffer *buffers,
@@ -1288,6 +1294,12 @@ StartReadBuffersImpl(ReadBuffersOperation *operation,
* so we stop here.
*/
actual_nblocks = i + 1;
+
+ ereport(DEBUG3,
+ errmsg("found buf %d, idx %i: %s, data %p",
+ buffers[i], i, DebugPrintBufferRefcount(buffers[i]),
+ BufferGetBlock(buffers[i])),
+ errhidestmt(true), errhidecontext(true));
break;
}
else
@@ -1325,27 +1337,18 @@ StartReadBuffersImpl(ReadBuffersOperation *operation,
operation->nblocks = actual_nblocks;
operation->io_buffers_len = io_buffers_len;
- if (flags & READ_BUFFERS_ISSUE_ADVICE)
- {
- /*
- * In theory we should only do this if PinBufferForBlock() had to
- * allocate new buffers above. That way, if two calls to
- * StartReadBuffers() were made for the same blocks before
- * WaitReadBuffers(), only the first would issue the advice. That'd be
- * a better simulation of true asynchronous I/O, which would only
- * start the I/O once, but isn't done here for simplicity. Note also
- * that the following call might actually issue two advice calls if we
- * cross a segment boundary; in a true asynchronous version we might
- * choose to process only one real I/O at a time in that case.
- */
- smgrprefetch(operation->smgr,
- operation->forknum,
- blockNum,
- operation->io_buffers_len);
- }
+ operation->nios = 0;
- /* Indicate that WaitReadBuffers() should be called. */
- return true;
+ /*
+ * TODO: When called for synchronous IO execution, we probably should
+ * enter a dedicated fastpath here.
+ */
+
+ /* initiate the IO */
+ return AsyncReadBuffers(operation,
+ buffers,
+ blockNum,
+ nblocks, flags);
}
/*
@@ -1397,12 +1400,31 @@ StartReadBuffer(ReadBuffersOperation *operation,
}
static inline bool
-WaitReadBuffersCanStartIO(Buffer buffer, bool nowait)
+ReadBuffersCanStartIO(Buffer buffer, bool nowait)
{
if (BufferIsLocal(buffer))
{
BufferDesc *bufHdr = GetLocalBufferDescriptor(-buffer - 1);
+ /*
+ * The buffer could have IO in progress by another scan. Right now
+ * localbuf.c doesn't use IO_IN_PROGRESS, which is why we need this
+ * hack.
+ *
+ * AFIXME: localbuf.c should use IO_IN_PROGRESS / have an equivalent
+ * of StartBufferIO().
+ */
+ if (pgaio_io_ref_valid(&bufHdr->io_in_progress))
+ {
+ PgAioHandleRef ior = bufHdr->io_in_progress;
+
+ ereport(DEBUG3,
+ errmsg("waiting for temp buffer IO in CSIO"),
+ errhidestmt(true), errhidecontext(true));
+ pgaio_io_ref_wait(&ior);
+ return false;
+ }
+
return (pg_atomic_read_u32(&bufHdr->state) & BM_VALID) == 0;
}
else
@@ -1412,12 +1434,7 @@ WaitReadBuffersCanStartIO(Buffer buffer, bool nowait)
void
WaitReadBuffers(ReadBuffersOperation *operation)
{
- Buffer *buffers;
int nblocks;
- BlockNumber blocknum;
- ForkNumber forknum;
- IOContext io_context;
- IOObject io_object;
char persistence;
/*
@@ -1433,11 +1450,65 @@ WaitReadBuffers(ReadBuffersOperation *operation)
if (nblocks == 0)
return; /* nothing to do */
+ persistence = operation->persistence;
+
+ Assert(operation->nios > 0);
+
+ for (int i = 0; i < operation->nios; i++)
+ {
+ PgAioReturn *aio_ret;
+
+ pgaio_io_ref_wait(&operation->refs[i]);
+
+ aio_ret = &operation->returns[i];
+
+ if (aio_ret->result.status != ARS_OK)
+ pgaio_result_log(aio_ret->result, &aio_ret->subject_data, ERROR);
+ }
+
+ /*
+ * We count all these blocks as read by this backend. This is traditional
+ * behavior, but might turn out to be not true if we find that someone
+ * else has beaten us and completed the read of some of these blocks. In
+ * that case the system globally double-counts, but we traditionally don't
+ * count this as a "hit", and we don't have a separate counter for "miss,
+ * but another backend completed the read".
+ */
+ if (persistence == RELPERSISTENCE_TEMP)
+ pgBufferUsage.local_blks_read += nblocks;
+ else
+ pgBufferUsage.shared_blks_read += nblocks;
+
+ if (VacuumCostActive)
+ VacuumCostBalance += VacuumCostPageMiss * nblocks;
+
+ /* FIXME: io timing */
+ /* FIXME: READ_DONE tracepoint */
+}
+
+static bool
+AsyncReadBuffers(ReadBuffersOperation *operation,
+ Buffer *buffers,
+ BlockNumber blockNum,
+ int *nblocks,
+ int flags)
+{
+ int io_buffers_len = 0;
+ BlockNumber blocknum;
+ ForkNumber forknum;
+ IOContext io_context;
+ IOObject io_object;
+ char persistence;
+ bool did_start_io_overall = false;
+ PgAioHandle *ioh = NULL;
+
buffers = &operation->buffers[0];
blocknum = operation->blocknum;
forknum = operation->forknum;
- persistence = operation->persistence;
+ persistence = operation->rel
+ ? operation->rel->rd_rel->relpersistence
+ : RELPERSISTENCE_PERMANENT;
if (persistence == RELPERSISTENCE_TEMP)
{
io_context = IOCONTEXT_NORMAL;
@@ -1458,25 +1529,33 @@ WaitReadBuffers(ReadBuffersOperation *operation)
* but another backend completed the read".
*/
if (persistence == RELPERSISTENCE_TEMP)
- pgBufferUsage.local_blks_read += nblocks;
+ pgBufferUsage.local_blks_read += *nblocks;
else
- pgBufferUsage.shared_blks_read += nblocks;
+ pgBufferUsage.shared_blks_read += *nblocks;
- for (int i = 0; i < nblocks; ++i)
+ for (int i = 0; i < *nblocks; ++i)
{
- int io_buffers_len;
- Buffer io_buffers[MAX_IO_COMBINE_LIMIT];
void *io_pages[MAX_IO_COMBINE_LIMIT];
- instr_time io_start;
+ Buffer io_buffers[MAX_IO_COMBINE_LIMIT];
BlockNumber io_first_block;
+ bool did_start_io_this = false;
+
+ /*
+ * Get IO before ReadBuffersCanStartIO, as pgaio_io_get() might block,
+ * which we don't want after setting IO_IN_PROGRESS.
+ */
+ if (likely(!ioh))
+ ioh = pgaio_io_get(CurrentResourceOwner, &operation->returns[operation->nios]);
/*
* Skip this block if someone else has already completed it. If an
* I/O is already in progress in another backend, this will wait for
* the outcome: either done, or something went wrong and we will
* retry.
+ *
+ * ATODO: Should we wait if we already submitted another IO?
*/
- if (!WaitReadBuffersCanStartIO(buffers[i], false))
+ if (!ReadBuffersCanStartIO(buffers[i], did_start_io_overall))
{
/*
* Report this as a 'hit' for this backend, even though it must
@@ -1488,6 +1567,10 @@ WaitReadBuffers(ReadBuffersOperation *operation)
operation->smgr->smgr_rlocator.locator.relNumber,
operation->smgr->smgr_rlocator.backend,
true);
+
+ ereport(DEBUG3,
+ errmsg("can't start io for first buffer %u", buffers[i]),
+ errhidestmt(true), errhidecontext(true));
continue;
}
@@ -1497,6 +1580,11 @@ WaitReadBuffers(ReadBuffersOperation *operation)
io_first_block = blocknum + i;
io_buffers_len = 1;
+ ereport(DEBUG3,
+ errmsg("first prepped for io: %s, offset %d",
+ DebugPrintBufferRefcount(io_buffers[0]), i),
+ errhidestmt(true), errhidecontext(true));
+
/*
* How many neighboring-on-disk blocks can we can scatter-read into
* other buffers at the same time? In this case we don't wait if we
@@ -1504,86 +1592,57 @@ WaitReadBuffers(ReadBuffersOperation *operation)
* for the head block, so we should get on with that I/O as soon as
* possible. We'll come back to this block again, above.
*/
- while ((i + 1) < nblocks &&
- WaitReadBuffersCanStartIO(buffers[i + 1], true))
+ while ((i + 1) < *nblocks &&
+ ReadBuffersCanStartIO(buffers[i + 1], true))
{
/* Must be consecutive block numbers. */
Assert(BufferGetBlockNumber(buffers[i + 1]) ==
BufferGetBlockNumber(buffers[i]) + 1);
+ ereport(DEBUG3,
+ errmsg("seq prepped for io: %s, offset %d",
+ DebugPrintBufferRefcount(buffers[i + 1]),
+ i + 1),
+ errhidestmt(true), errhidecontext(true));
+
io_buffers[io_buffers_len] = buffers[++i];
io_pages[io_buffers_len++] = BufferGetBlock(buffers[i]);
}
- io_start = pgstat_prepare_io_time(track_io_timing);
- smgrreadv(operation->smgr, forknum, io_first_block, io_pages, io_buffers_len);
- pgstat_count_io_op_time(io_object, io_context, IOOP_READ, io_start,
- io_buffers_len);
+ pgaio_io_get_ref(ioh, &operation->refs[operation->nios]);
- /* Verify each block we read, and terminate the I/O. */
- for (int j = 0; j < io_buffers_len; ++j)
+ pgaio_io_set_io_data_32(ioh, (uint32 *) io_buffers, io_buffers_len);
+ if (persistence == RELPERSISTENCE_TEMP)
{
- BufferDesc *bufHdr;
- Block bufBlock;
-
- if (persistence == RELPERSISTENCE_TEMP)
- {
- bufHdr = GetLocalBufferDescriptor(-io_buffers[j] - 1);
- bufBlock = LocalBufHdrGetBlock(bufHdr);
- }
- else
- {
- bufHdr = GetBufferDescriptor(io_buffers[j] - 1);
- bufBlock = BufHdrGetBlock(bufHdr);
- }
-
- /* check for garbage data */
- if (!PageIsVerifiedExtended((Page) bufBlock, io_first_block + j,
- PIV_LOG_WARNING | PIV_REPORT_STAT))
- {
- if ((operation->flags & READ_BUFFERS_ZERO_ON_ERROR) || zero_damaged_pages)
- {
- ereport(WARNING,
- (errcode(ERRCODE_DATA_CORRUPTED),
- errmsg("invalid page in block %u of relation %s; zeroing out page",
- io_first_block + j,
- relpath(operation->smgr->smgr_rlocator, forknum))));
- memset(bufBlock, 0, BLCKSZ);
- }
- else
- ereport(ERROR,
- (errcode(ERRCODE_DATA_CORRUPTED),
- errmsg("invalid page in block %u of relation %s",
- io_first_block + j,
- relpath(operation->smgr->smgr_rlocator, forknum))));
- }
-
- /* Terminate I/O and set BM_VALID. */
- if (persistence == RELPERSISTENCE_TEMP)
- {
- uint32 buf_state = pg_atomic_read_u32(&bufHdr->state);
-
- buf_state |= BM_VALID;
- pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
- }
- else
- {
- /* Set BM_VALID, terminate IO, and wake up any waiters */
- TerminateBufferIO(bufHdr, false, BM_VALID, true, true);
- }
-
- /* Report I/Os as completing individually. */
- TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, io_first_block + j,
- operation->smgr->smgr_rlocator.locator.spcOid,
- operation->smgr->smgr_rlocator.locator.dbOid,
- operation->smgr->smgr_rlocator.locator.relNumber,
- operation->smgr->smgr_rlocator.backend,
- false);
+ pgaio_io_add_shared_cb(ioh, ASC_LOCAL_BUFFER_READ);
+ pgaio_io_set_flag(ioh, AHF_REFERENCES_LOCAL);
}
+ else
+ pgaio_io_add_shared_cb(ioh, ASC_SHARED_BUFFER_READ);
- if (VacuumCostActive)
- VacuumCostBalance += VacuumCostPageMiss * io_buffers_len;
+ did_start_io_overall = did_start_io_this = true;
+ smgrstartreadv(ioh, operation->smgr, forknum, io_first_block,
+ io_pages, io_buffers_len);
+ ioh = NULL;
+ operation->nios++;
+
+ /* not obvious what we'd use for time */
+ pgstat_count_io_op_n(io_object, io_context, IOOP_READ, io_buffers_len);
}
+
+ if (ioh)
+ {
+ pgaio_io_release(ioh);
+ ioh = NULL;
+ }
+
+ if (did_start_io_overall)
+ {
+ pgaio_submit_staged();
+ return true;
+ }
+ else
+ return false;
}
/*
--
2.45.2.827.g557ae147e6
v2.1-0001-bufmgr-Return-early-in-ScheduleBufferTagForWrit.patchtext/x-diff; charset=us-asciiDownload
From ea3373e8793932e849d2904046f76b14ec971549 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Thu, 27 Jul 2023 18:59:25 -0700
Subject: [PATCH v2.1 01/20] bufmgr: Return early in
ScheduleBufferTagForWriteback() if fsync=off
As pg_flush_data() doesn't do anything with fsync disabled, there's no point
in tracking the buffer for writeback. Arguably the better fix would be to
change pg_flush_data() to flush data even with fsync off, but that's a
behavioral change, whereas this is just a small optimization.
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Discussion: https://postgr.es/m/1f6b50a7-38ef-4d87-8246-786d39f46ab9@iki.fi
---
src/backend/storage/buffer/bufmgr.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 48520443001..b8680cc8fd4 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -5899,7 +5899,12 @@ ScheduleBufferTagForWriteback(WritebackContext *wb_context, IOContext io_context
{
PendingWriteback *pending;
- if (io_direct_flags & IO_DIRECT_DATA)
+ /*
+ * As pg_flush_data() doesn't do anything with fsync disabled, there's no
+ * point in tracking in that case.
+ */
+ if (io_direct_flags & IO_DIRECT_DATA ||
+ !enableFsync)
return;
/*
--
2.45.2.827.g557ae147e6
v2.1-0002-Allow-lwlocks-to-be-unowned.patchtext/x-diff; charset=us-asciiDownload
From 7daeafca64fd950bf63fb43cdb31fd578f27c85d Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 5 Jan 2021 10:10:36 -0800
Subject: [PATCH v2.1 02/20] Allow lwlocks to be unowned
This is required for AIO so that the lock hold during a write can be released
in another backend. Which in turn is required to avoid the potential for
deadlocks.
---
src/include/storage/lwlock.h | 2 +
src/backend/storage/lmgr/lwlock.c | 110 ++++++++++++++++++++++--------
2 files changed, 82 insertions(+), 30 deletions(-)
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index d70e6d37e09..eabf813ce05 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -129,6 +129,8 @@ extern bool LWLockAcquireOrWait(LWLock *lock, LWLockMode mode);
extern void LWLockRelease(LWLock *lock);
extern void LWLockReleaseClearVar(LWLock *lock, pg_atomic_uint64 *valptr, uint64 val);
extern void LWLockReleaseAll(void);
+extern LWLockMode LWLockDisown(LWLock *l);
+extern void LWLockReleaseUnowned(LWLock *l, LWLockMode mode);
extern bool LWLockHeldByMe(LWLock *lock);
extern bool LWLockAnyHeldByMe(LWLock *lock, int nlocks, size_t stride);
extern bool LWLockHeldByMeInMode(LWLock *lock, LWLockMode mode);
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index db6ed784ab3..a5fa77412ed 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -1773,52 +1773,36 @@ LWLockUpdateVar(LWLock *lock, pg_atomic_uint64 *valptr, uint64 val)
}
}
-
-/*
- * LWLockRelease - release a previously acquired lock
- */
-void
-LWLockRelease(LWLock *lock)
+static void
+LWLockReleaseInternal(LWLock *lock, LWLockMode mode)
{
- LWLockMode mode;
uint32 oldstate;
bool check_waiters;
- int i;
-
- /*
- * Remove lock from list of locks held. Usually, but not always, it will
- * be the latest-acquired lock; so search array backwards.
- */
- for (i = num_held_lwlocks; --i >= 0;)
- if (lock == held_lwlocks[i].lock)
- break;
-
- if (i < 0)
- elog(ERROR, "lock %s is not held", T_NAME(lock));
-
- mode = held_lwlocks[i].mode;
-
- num_held_lwlocks--;
- for (; i < num_held_lwlocks; i++)
- held_lwlocks[i] = held_lwlocks[i + 1];
-
- PRINT_LWDEBUG("LWLockRelease", lock, mode);
/*
* Release my hold on lock, after that it can immediately be acquired by
* others, even if we still have to wakeup other waiters.
*/
if (mode == LW_EXCLUSIVE)
- oldstate = pg_atomic_sub_fetch_u32(&lock->state, LW_VAL_EXCLUSIVE);
+ oldstate = pg_atomic_fetch_sub_u32(&lock->state, LW_VAL_EXCLUSIVE);
else
- oldstate = pg_atomic_sub_fetch_u32(&lock->state, LW_VAL_SHARED);
+ oldstate = pg_atomic_fetch_sub_u32(&lock->state, LW_VAL_SHARED);
/* nobody else can have that kind of lock */
- Assert(!(oldstate & LW_VAL_EXCLUSIVE));
+ if (mode == LW_EXCLUSIVE)
+ Assert((oldstate & LW_LOCK_MASK) == LW_VAL_EXCLUSIVE);
+ else
+ Assert((oldstate & LW_LOCK_MASK) < LW_VAL_EXCLUSIVE &&
+ (oldstate & LW_LOCK_MASK) >= LW_VAL_SHARED);
if (TRACE_POSTGRESQL_LWLOCK_RELEASE_ENABLED())
TRACE_POSTGRESQL_LWLOCK_RELEASE(T_NAME(lock));
+ if (mode == LW_EXCLUSIVE)
+ oldstate -= LW_VAL_EXCLUSIVE;
+ else
+ oldstate -= LW_VAL_SHARED;
+
/*
* We're still waiting for backends to get scheduled, don't wake them up
* again.
@@ -1841,6 +1825,72 @@ LWLockRelease(LWLock *lock)
LWLockWakeup(lock);
}
+ TRACE_POSTGRESQL_LWLOCK_RELEASE(T_NAME(lock));
+}
+
+void
+LWLockReleaseUnowned(LWLock *lock, LWLockMode mode)
+{
+ LWLockReleaseInternal(lock, mode);
+}
+
+/*
+ * Stop treating lock as held by current backend.
+ *
+ * After calling this function it's the callers responsibility to ensure that
+ * the lock gets released, even in case of an error. This only is desirable if
+ * the lock is going to be released in a different process than the process
+ * that acquired it.
+ *
+ * Returns the mode in which the lock was held by the current backend.
+ *
+ * NB: This will leave lock->owner pointing to the current backend (if
+ * LOCK_DEBUG is set). We could add a separate flag indicating that, but it
+ * doesn't really seem worth it.
+ *
+ * NB: This does not call RESUME_INTERRUPTS(), but leaves that responsibility
+ * of the caller.
+ */
+LWLockMode
+LWLockDisown(LWLock *lock)
+{
+ LWLockMode mode;
+ int i;
+
+ /*
+ * Remove lock from list of locks held. Usually, but not always, it will
+ * be the latest-acquired lock; so search array backwards.
+ */
+ for (i = num_held_lwlocks; --i >= 0;)
+ if (lock == held_lwlocks[i].lock)
+ break;
+
+ if (i < 0)
+ elog(ERROR, "lock %s is not held", T_NAME(lock));
+
+ mode = held_lwlocks[i].mode;
+
+ num_held_lwlocks--;
+ for (; i < num_held_lwlocks; i++)
+ held_lwlocks[i] = held_lwlocks[i + 1];
+
+ return mode;
+}
+
+/*
+ * LWLockRelease - release a previously acquired lock
+ */
+void
+LWLockRelease(LWLock *lock)
+{
+ LWLockMode mode;
+
+ mode = LWLockDisown(lock);
+
+ PRINT_LWDEBUG("LWLockRelease", lock, mode);
+
+ LWLockReleaseInternal(lock, mode);
+
/*
* Now okay to allow cancel/die interrupts.
*/
--
2.45.2.827.g557ae147e6
v2.1-0003-Use-aux-process-resource-owner-in-walsender.patchtext/x-diff; charset=us-asciiDownload
From 6dacd88481c9c79042a6b5bdc5783ca8f8ce1cce Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 31 Aug 2021 12:16:28 -0700
Subject: [PATCH v2.1 03/20] Use aux process resource owner in walsender
AIO will need a resource owner to do IO. Right now we create a resowner
on-demand during basebackup, and we could do the same for AIO. But it seems
easier to just always create an aux process resowner.
---
src/include/replication/walsender.h | 1 -
src/backend/backup/basebackup.c | 8 ++++--
src/backend/replication/walsender.c | 44 ++++++-----------------------
3 files changed, 13 insertions(+), 40 deletions(-)
diff --git a/src/include/replication/walsender.h b/src/include/replication/walsender.h
index f2d8297f016..aff0f7a51ca 100644
--- a/src/include/replication/walsender.h
+++ b/src/include/replication/walsender.h
@@ -38,7 +38,6 @@ extern PGDLLIMPORT bool log_replication_commands;
extern void InitWalSender(void);
extern bool exec_replication_command(const char *cmd_string);
extern void WalSndErrorCleanup(void);
-extern void WalSndResourceCleanup(bool isCommit);
extern void PhysicalWakeupLogicalWalSnd(void);
extern XLogRecPtr GetStandbyFlushRecPtr(TimeLineID *tli);
extern void WalSndSignals(void);
diff --git a/src/backend/backup/basebackup.c b/src/backend/backup/basebackup.c
index 14e5ba72e97..0f8cddcbeeb 100644
--- a/src/backend/backup/basebackup.c
+++ b/src/backend/backup/basebackup.c
@@ -250,8 +250,10 @@ perform_base_backup(basebackup_options *opt, bbsink *sink,
state.bytes_total_is_valid = false;
/* we're going to use a BufFile, so we need a ResourceOwner */
- Assert(CurrentResourceOwner == NULL);
- CurrentResourceOwner = ResourceOwnerCreate(NULL, "base backup");
+ Assert(AuxProcessResourceOwner != NULL);
+ Assert(CurrentResourceOwner == AuxProcessResourceOwner ||
+ CurrentResourceOwner == NULL);
+ CurrentResourceOwner = AuxProcessResourceOwner;
backup_started_in_recovery = RecoveryInProgress();
@@ -672,7 +674,7 @@ perform_base_backup(basebackup_options *opt, bbsink *sink,
FreeBackupManifest(&manifest);
/* clean up the resource owner we created */
- WalSndResourceCleanup(true);
+ ReleaseAuxProcessResources(true);
basebackup_progress_done();
}
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index c5f1009f370..0e847535a64 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -282,10 +282,8 @@ InitWalSender(void)
/* Create a per-walsender data structure in shared memory */
InitWalSenderSlot();
- /*
- * We don't currently need any ResourceOwner in a walsender process, but
- * if we did, we could call CreateAuxProcessResourceOwner here.
- */
+ /* need resource owner for e.g. basebackups */
+ CreateAuxProcessResourceOwner();
/*
* Let postmaster know that we're a WAL sender. Once we've declared us as
@@ -346,7 +344,7 @@ WalSndErrorCleanup(void)
* without a transaction, we've got to clean that up now.
*/
if (!IsTransactionOrTransactionBlock())
- WalSndResourceCleanup(false);
+ ReleaseAuxProcessResources(false);
if (got_STOPPING || got_SIGUSR2)
proc_exit(0);
@@ -355,34 +353,6 @@ WalSndErrorCleanup(void)
WalSndSetState(WALSNDSTATE_STARTUP);
}
-/*
- * Clean up any ResourceOwner we created.
- */
-void
-WalSndResourceCleanup(bool isCommit)
-{
- ResourceOwner resowner;
-
- if (CurrentResourceOwner == NULL)
- return;
-
- /*
- * Deleting CurrentResourceOwner is not allowed, so we must save a pointer
- * in a local variable and clear it first.
- */
- resowner = CurrentResourceOwner;
- CurrentResourceOwner = NULL;
-
- /* Now we can release resources and delete it. */
- ResourceOwnerRelease(resowner,
- RESOURCE_RELEASE_BEFORE_LOCKS, isCommit, true);
- ResourceOwnerRelease(resowner,
- RESOURCE_RELEASE_LOCKS, isCommit, true);
- ResourceOwnerRelease(resowner,
- RESOURCE_RELEASE_AFTER_LOCKS, isCommit, true);
- ResourceOwnerDelete(resowner);
-}
-
/*
* Handle a client's connection abort in an orderly manner.
*/
@@ -685,8 +655,10 @@ UploadManifest(void)
* parsing the manifest will use the cryptohash stuff, which requires a
* resource owner
*/
- Assert(CurrentResourceOwner == NULL);
- CurrentResourceOwner = ResourceOwnerCreate(NULL, "base backup");
+ Assert(AuxProcessResourceOwner != NULL);
+ Assert(CurrentResourceOwner == AuxProcessResourceOwner ||
+ CurrentResourceOwner == NULL);
+ CurrentResourceOwner = AuxProcessResourceOwner;
/* Prepare to read manifest data into a temporary context. */
mcxt = AllocSetContextCreate(CurrentMemoryContext,
@@ -723,7 +695,7 @@ UploadManifest(void)
uploaded_manifest_mcxt = mcxt;
/* clean up the resource owner we created */
- WalSndResourceCleanup(true);
+ ReleaseAuxProcessResources(true);
}
/*
--
2.45.2.827.g557ae147e6
v2.1-0004-Ensure-a-resowner-exists-for-all-paths-that-may.patchtext/x-diff; charset=us-asciiDownload
From a70eeb4cc7dd87a693162f0632d5d60bfa17575e Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Thu, 1 Aug 2024 09:56:36 -0700
Subject: [PATCH v2.1 04/20] Ensure a resowner exists for all paths that may
perform AIO
---
src/backend/bootstrap/bootstrap.c | 7 +++++++
src/backend/replication/logical/logical.c | 6 ++++++
src/backend/utils/init/postinit.c | 3 ++-
3 files changed, 15 insertions(+), 1 deletion(-)
diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index 7637581a184..234fdc57ca7 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -331,8 +331,15 @@ BootstrapModeMain(int argc, char *argv[], bool check_only)
BaseInit();
bootstrap_signals();
+
+ /* need a resowner for IO during BootStrapXLOG() */
+ CreateAuxProcessResourceOwner();
+
BootStrapXLOG(bootstrap_data_checksum_version);
+ ReleaseAuxProcessResources(true);
+ CurrentResourceOwner = NULL;
+
/*
* To ensure that src/common/link-canary.c is linked into the backend, we
* must call it from somewhere. Here is as good as anywhere.
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 3fe1774a1e9..be0c7846d00 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -386,6 +386,12 @@ CreateInitDecodingContext(const char *plugin,
slot->data.plugin = plugin_name;
SpinLockRelease(&slot->mutex);
+ if (CurrentResourceOwner == NULL)
+ {
+ Assert(am_walsender);
+ CurrentResourceOwner = AuxProcessResourceOwner;
+ }
+
if (XLogRecPtrIsInvalid(restart_lsn))
ReplicationSlotReserveWal();
else
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 3b50ce19a2c..11128ea461c 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -719,7 +719,8 @@ InitPostgres(const char *in_dbname, Oid dboid,
* and ShutdownXLOG will need one. Hence, create said resource owner
* (and register a callback to clean it up after ShutdownXLOG runs).
*/
- CreateAuxProcessResourceOwner();
+ if (!bootstrap)
+ CreateAuxProcessResourceOwner();
StartupXLOG();
/* Release (and warn about) any buffer pins leaked in StartupXLOG */
--
2.45.2.827.g557ae147e6
v2.1-0005-bufmgr-smgr-Don-t-cross-segment-boundaries-in-S.patchtext/x-diff; charset=us-asciiDownload
From 5308c29e3fd09601ad2e63669837f1e7eef45921 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sat, 31 Aug 2024 22:10:35 -0400
Subject: [PATCH v2.1 05/20] bufmgr/smgr: Don't cross segment boundaries in
StartReadBuffers()
With real AIO it doesn't make sense to cross segment boundaries with one
IO. Add smgrmaxcombine() to allow upper layers to query which buffers can be
merged.
---
src/include/storage/md.h | 2 ++
src/include/storage/smgr.h | 2 ++
src/backend/storage/buffer/bufmgr.c | 18 ++++++++++++++++++
src/backend/storage/smgr/md.c | 17 +++++++++++++++++
src/backend/storage/smgr/smgr.c | 16 ++++++++++++++++
5 files changed, 55 insertions(+)
diff --git a/src/include/storage/md.h b/src/include/storage/md.h
index 620f10abdeb..b72293c79a5 100644
--- a/src/include/storage/md.h
+++ b/src/include/storage/md.h
@@ -32,6 +32,8 @@ extern void mdzeroextend(SMgrRelation reln, ForkNumber forknum,
BlockNumber blocknum, int nblocks, bool skipFsync);
extern bool mdprefetch(SMgrRelation reln, ForkNumber forknum,
BlockNumber blocknum, int nblocks);
+extern uint32 mdmaxcombine(SMgrRelation reln, ForkNumber forknum,
+ BlockNumber blocknum);
extern void mdreadv(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
void **buffers, BlockNumber nblocks);
extern void mdwritev(SMgrRelation reln, ForkNumber forknum,
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index e15b20a566a..899d0d681c5 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -92,6 +92,8 @@ extern void smgrzeroextend(SMgrRelation reln, ForkNumber forknum,
BlockNumber blocknum, int nblocks, bool skipFsync);
extern bool smgrprefetch(SMgrRelation reln, ForkNumber forknum,
BlockNumber blocknum, int nblocks);
+extern uint32 smgrmaxcombine(SMgrRelation reln, ForkNumber forknum,
+ BlockNumber blocknum);
extern void smgrreadv(SMgrRelation reln, ForkNumber forknum,
BlockNumber blocknum,
void **buffers, BlockNumber nblocks);
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index b8680cc8fd4..7e987836335 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1259,6 +1259,7 @@ StartReadBuffersImpl(ReadBuffersOperation *operation,
{
int actual_nblocks = *nblocks;
int io_buffers_len = 0;
+ int maxcombine = 0;
Assert(*nblocks > 0);
Assert(*nblocks <= MAX_IO_COMBINE_LIMIT);
@@ -1290,6 +1291,23 @@ StartReadBuffersImpl(ReadBuffersOperation *operation,
{
/* Extend the readable range to cover this block. */
io_buffers_len++;
+
+ /*
+ * Check how many blocks we can cover with the same IO. The smgr
+ * implementation might e.g. be limited due to a segment boundary.
+ */
+ if (i == 0 && actual_nblocks > 1)
+ {
+ maxcombine = smgrmaxcombine(operation->smgr,
+ operation->forknum,
+ blockNum);
+ if (maxcombine < actual_nblocks)
+ {
+ elog(DEBUG2, "limiting nblocks at %u from %u to %u",
+ blockNum, actual_nblocks, maxcombine);
+ actual_nblocks = maxcombine;
+ }
+ }
}
}
*nblocks = actual_nblocks;
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 6796756358f..6cd81a61faa 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -803,6 +803,17 @@ buffers_to_iovec(struct iovec *iov, void **buffers, int nblocks)
return iovcnt;
}
+uint32
+mdmaxcombine(SMgrRelation reln, ForkNumber forknum,
+ BlockNumber blocknum)
+{
+ BlockNumber segoff;
+
+ segoff = blocknum % ((BlockNumber) RELSEG_SIZE);
+
+ return RELSEG_SIZE - segoff;
+}
+
/*
* mdreadv() -- Read the specified blocks from a relation.
*/
@@ -833,6 +844,9 @@ mdreadv(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
RELSEG_SIZE - (blocknum % ((BlockNumber) RELSEG_SIZE)));
nblocks_this_segment = Min(nblocks_this_segment, lengthof(iov));
+ if (nblocks_this_segment != nblocks)
+ elog(ERROR, "read crossing segment boundary");
+
iovcnt = buffers_to_iovec(iov, buffers, nblocks_this_segment);
size_this_segment = nblocks_this_segment * BLCKSZ;
transferred_this_segment = 0;
@@ -956,6 +970,9 @@ mdwritev(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
RELSEG_SIZE - (blocknum % ((BlockNumber) RELSEG_SIZE)));
nblocks_this_segment = Min(nblocks_this_segment, lengthof(iov));
+ if (nblocks_this_segment != nblocks)
+ elog(ERROR, "write crossing segment boundary");
+
iovcnt = buffers_to_iovec(iov, (void **) buffers, nblocks_this_segment);
size_this_segment = nblocks_this_segment * BLCKSZ;
transferred_this_segment = 0;
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 7b9fa103eff..ee31db85eec 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -88,6 +88,8 @@ typedef struct f_smgr
BlockNumber blocknum, int nblocks, bool skipFsync);
bool (*smgr_prefetch) (SMgrRelation reln, ForkNumber forknum,
BlockNumber blocknum, int nblocks);
+ uint32 (*smgr_maxcombine) (SMgrRelation reln, ForkNumber forknum,
+ BlockNumber blocknum);
void (*smgr_readv) (SMgrRelation reln, ForkNumber forknum,
BlockNumber blocknum,
void **buffers, BlockNumber nblocks);
@@ -117,6 +119,7 @@ static const f_smgr smgrsw[] = {
.smgr_extend = mdextend,
.smgr_zeroextend = mdzeroextend,
.smgr_prefetch = mdprefetch,
+ .smgr_maxcombine = mdmaxcombine,
.smgr_readv = mdreadv,
.smgr_writev = mdwritev,
.smgr_writeback = mdwriteback,
@@ -588,6 +591,19 @@ smgrprefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
return smgrsw[reln->smgr_which].smgr_prefetch(reln, forknum, blocknum, nblocks);
}
+/*
+ * smgrmaxcombine() - Return the maximum number of total blocks that can be
+ * combined with an IO starting at blocknum.
+ *
+ * The returned value includes the io for blocknum itself.
+ */
+uint32
+smgrmaxcombine(SMgrRelation reln, ForkNumber forknum,
+ BlockNumber blocknum)
+{
+ return smgrsw[reln->smgr_which].smgr_maxcombine(reln, forknum, blocknum);
+}
+
/*
* smgrreadv() -- read a particular block range from a relation into the
* supplied buffers.
--
2.45.2.827.g557ae147e6
v2.1-0006-aio-Basic-subsystem-initialization.patchtext/x-diff; charset=us-asciiDownload
From 177af4d07a51bac7b785dc02b2abea019d7395e4 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 10 Jun 2024 13:42:58 -0700
Subject: [PATCH v2.1 06/20] aio: Basic subsystem initialization
This is just separate to make it easier to review the tendrils into various
places.
---
src/include/storage/aio.h | 41 +++++++++++++++++
src/include/storage/aio_init.h | 26 +++++++++++
src/backend/postmaster/postmaster.c | 8 ++++
src/backend/storage/aio/Makefile | 2 +
src/backend/storage/aio/aio.c | 32 +++++++++++++
src/backend/storage/aio/aio_init.c | 46 +++++++++++++++++++
src/backend/storage/aio/meson.build | 2 +
src/backend/storage/ipc/ipci.c | 3 ++
src/backend/tcop/postgres.c | 7 +++
src/backend/utils/init/miscinit.c | 3 ++
src/backend/utils/init/postinit.c | 3 ++
src/backend/utils/misc/guc_tables.c | 11 +++++
src/backend/utils/misc/postgresql.conf.sample | 7 +++
src/tools/pgindent/typedefs.list | 1 +
14 files changed, 192 insertions(+)
create mode 100644 src/include/storage/aio.h
create mode 100644 src/include/storage/aio_init.h
create mode 100644 src/backend/storage/aio/aio.c
create mode 100644 src/backend/storage/aio/aio_init.c
diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h
new file mode 100644
index 00000000000..1e4dfd07e89
--- /dev/null
+++ b/src/include/storage/aio.h
@@ -0,0 +1,41 @@
+/*-------------------------------------------------------------------------
+ *
+ * aio.h
+ * Main AIO interface
+ *
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/aio.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef AIO_H
+#define AIO_H
+
+
+#include "utils/guc_tables.h"
+
+
+/* GUC related */
+extern void assign_io_method(int newval, void *extra);
+
+
+/* Enum for io_method GUC. */
+typedef enum IoMethod
+{
+ IOMETHOD_SYNC = 0,
+} IoMethod;
+
+
+/* We'll default to synchronous execution. */
+#define DEFAULT_IO_METHOD IOMETHOD_SYNC
+
+
+/* GUCs */
+extern const struct config_enum_entry io_method_options[];
+extern int io_method;
+
+
+#endif /* AIO_H */
diff --git a/src/include/storage/aio_init.h b/src/include/storage/aio_init.h
new file mode 100644
index 00000000000..5bcfb8a9d58
--- /dev/null
+++ b/src/include/storage/aio_init.h
@@ -0,0 +1,26 @@
+/*-------------------------------------------------------------------------
+ *
+ * aio_init.h
+ * AIO initialization - kept separate as initialization sites don't need to
+ * know about AIO itself and AIO users don't need to know about initialization.
+ *
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/aio_init.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef AIO_INIT_H
+#define AIO_INIT_H
+
+
+extern Size AioShmemSize(void);
+extern void AioShmemInit(void);
+
+extern void pgaio_postmaster_init(void);
+extern void pgaio_postmaster_child_init_local(void);
+extern void pgaio_postmaster_child_init(void);
+
+#endif /* AIO_INIT_H */
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 96bc1d1cfed..70c5ce19f6e 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -111,6 +111,7 @@
#include "replication/logicallauncher.h"
#include "replication/slotsync.h"
#include "replication/walsender.h"
+#include "storage/aio_init.h"
#include "storage/fd.h"
#include "storage/ipc.h"
#include "storage/pmsignal.h"
@@ -941,6 +942,13 @@ PostmasterMain(int argc, char *argv[])
ExitPostmaster(0);
}
+ /*
+ * As AIO might create internal FDs, and will trigger shared memory
+ * allocations, need to do this before reset_shared() and
+ * set_max_safe_fds().
+ */
+ pgaio_postmaster_init();
+
/*
* Set up shared memory and semaphores.
*
diff --git a/src/backend/storage/aio/Makefile b/src/backend/storage/aio/Makefile
index 2f29a9ec4d1..eaeaeeee8e3 100644
--- a/src/backend/storage/aio/Makefile
+++ b/src/backend/storage/aio/Makefile
@@ -9,6 +9,8 @@ top_builddir = ../../../..
include $(top_builddir)/src/Makefile.global
OBJS = \
+ aio.o \
+ aio_init.o \
read_stream.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/aio/aio.c b/src/backend/storage/aio/aio.c
new file mode 100644
index 00000000000..d831c772960
--- /dev/null
+++ b/src/backend/storage/aio/aio.c
@@ -0,0 +1,32 @@
+/*-------------------------------------------------------------------------
+ *
+ * aio.c
+ * Asynchronous I/O subsytem.
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/storage/aio/aio.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "storage/aio.h"
+
+
+/* Options for io_method. */
+const struct config_enum_entry io_method_options[] = {
+ {"sync", IOMETHOD_SYNC, false},
+ {NULL, 0, false}
+};
+
+int io_method = DEFAULT_IO_METHOD;
+
+
+void
+assign_io_method(int newval, void *extra)
+{
+}
diff --git a/src/backend/storage/aio/aio_init.c b/src/backend/storage/aio/aio_init.c
new file mode 100644
index 00000000000..1c277a7eb3b
--- /dev/null
+++ b/src/backend/storage/aio/aio_init.c
@@ -0,0 +1,46 @@
+/*-------------------------------------------------------------------------
+ *
+ * aio_init.c
+ * Asynchronous I/O subsytem - Initialization
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/storage/aio/aio_init.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "storage/aio_init.h"
+
+
+Size
+AioShmemSize(void)
+{
+ Size sz = 0;
+
+ return sz;
+}
+
+void
+AioShmemInit(void)
+{
+}
+
+void
+pgaio_postmaster_init(void)
+{
+}
+
+void
+pgaio_postmaster_child_init(void)
+{
+}
+
+void
+pgaio_postmaster_child_init_local(void)
+{
+}
diff --git a/src/backend/storage/aio/meson.build b/src/backend/storage/aio/meson.build
index 10e1aa3b20b..8d20759ebf8 100644
--- a/src/backend/storage/aio/meson.build
+++ b/src/backend/storage/aio/meson.build
@@ -1,5 +1,7 @@
# Copyright (c) 2024, PostgreSQL Global Development Group
backend_sources += files(
+ 'aio.c',
+ 'aio_init.c',
'read_stream.c',
)
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 6caeca3a8e6..f0227a12a7d 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -39,6 +39,7 @@
#include "replication/slotsync.h"
#include "replication/walreceiver.h"
#include "replication/walsender.h"
+#include "storage/aio_init.h"
#include "storage/bufmgr.h"
#include "storage/dsm.h"
#include "storage/dsm_registry.h"
@@ -152,6 +153,7 @@ CalculateShmemSize(int *num_semaphores)
size = add_size(size, InjectionPointShmemSize());
size = add_size(size, SlotSyncShmemSize());
size = add_size(size, WaitLSNShmemSize());
+ size = add_size(size, AioShmemSize());
/* include additional requested shmem from preload libraries */
size = add_size(size, total_addin_request);
@@ -339,6 +341,7 @@ CreateOrAttachShmemStructs(void)
WaitEventCustomShmemInit();
InjectionPointShmemInit();
WaitLSNShmemInit();
+ AioShmemInit();
}
/*
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 8bc6bea1135..4dc46b17b41 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -61,6 +61,7 @@
#include "replication/slot.h"
#include "replication/walsender.h"
#include "rewrite/rewriteHandler.h"
+#include "storage/aio_init.h"
#include "storage/bufmgr.h"
#include "storage/ipc.h"
#include "storage/pmsignal.h"
@@ -4198,6 +4199,12 @@ PostgresSingleUserMain(int argc, char *argv[],
*/
InitProcess();
+ /* AIO is needed during InitPostgres() */
+ pgaio_postmaster_init();
+ pgaio_postmaster_child_init_local();
+
+ set_max_safe_fds();
+
/*
* Now that sufficient infrastructure has been initialized, PostgresMain()
* can do the rest.
diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index 537d92c0cfd..b8fa2e64ffe 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -40,6 +40,7 @@
#include "postmaster/interrupt.h"
#include "postmaster/postmaster.h"
#include "replication/slotsync.h"
+#include "storage/aio_init.h"
#include "storage/fd.h"
#include "storage/ipc.h"
#include "storage/latch.h"
@@ -137,6 +138,8 @@ InitPostmasterChild(void)
InitProcessLocalLatch();
InitializeLatchWaitSet();
+ pgaio_postmaster_child_init_local();
+
/*
* If possible, make this process a group leader, so that the postmaster
* can signal any child processes too. Not all processes will have
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 11128ea461c..f1151645242 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -43,6 +43,7 @@
#include "replication/slot.h"
#include "replication/slotsync.h"
#include "replication/walsender.h"
+#include "storage/aio_init.h"
#include "storage/bufmgr.h"
#include "storage/fd.h"
#include "storage/ipc.h"
@@ -589,6 +590,8 @@ BaseInit(void)
*/
pgstat_initialize();
+ pgaio_postmaster_child_init();
+
/* Do local initialization of storage and buffer managers */
InitSync();
smgrinit();
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 686309db58b..a4b3c7c62bd 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -71,6 +71,7 @@
#include "replication/slot.h"
#include "replication/slotsync.h"
#include "replication/syncrep.h"
+#include "storage/aio.h"
#include "storage/bufmgr.h"
#include "storage/bufpage.h"
#include "storage/large_object.h"
@@ -5196,6 +5197,16 @@ struct config_enum ConfigureNamesEnum[] =
NULL, NULL, NULL
},
+ {
+ {"io_method", PGC_POSTMASTER, RESOURCES_MEM,
+ gettext_noop("Selects the method of asynchronous I/O to use."),
+ NULL
+ },
+ &io_method,
+ DEFAULT_IO_METHOD, io_method_options,
+ NULL, assign_io_method, NULL
+ },
+
/* End-of-list marker */
{
{NULL, 0, 0, NULL, NULL}, NULL, 0, NULL, NULL, NULL, NULL
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 667e0dc40a2..3a5e307c9dc 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -835,6 +835,13 @@
#include = '...' # include file
+#------------------------------------------------------------------------------
+# WIP AIO GUC docs
+#------------------------------------------------------------------------------
+
+#io_method = sync # (change requires restart)
+
+
#------------------------------------------------------------------------------
# CUSTOMIZED OPTIONS
#------------------------------------------------------------------------------
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index df3f336bec0..2681dd51bb7 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1258,6 +1258,7 @@ IntervalAggState
IntoClause
InvalMessageArray
InvalidationMsgsGroup
+IoMethod
IpcMemoryId
IpcMemoryKey
IpcMemoryState
--
2.45.2.827.g557ae147e6
v2.1-0007-aio-Core-AIO-implementation.patchtext/x-diff; charset=us-asciiDownload
From d1c318432d40aee43b46db6187a033872af96b31 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 4 Sep 2024 15:23:08 -0400
Subject: [PATCH v2.1 07/20] aio: Core AIO implementation
At this point nothing can use AIO - this commit does not include any
implementation of aio subjects / callbacks. That will come in later commits.
Todo:
- lots of cleanup
---
src/include/storage/aio.h | 308 ++++++
src/include/storage/aio_internal.h | 274 +++++
src/include/storage/aio_ref.h | 24 +
src/include/utils/resowner.h | 7 +
src/backend/access/transam/xact.c | 9 +
src/backend/storage/aio/Makefile | 3 +
src/backend/storage/aio/aio.c | 975 ++++++++++++++++++
src/backend/storage/aio/aio_init.c | 304 ++++++
src/backend/storage/aio/aio_io.c | 111 ++
src/backend/storage/aio/aio_subject.c | 167 +++
src/backend/storage/aio/meson.build | 3 +
src/backend/storage/aio/method_sync.c | 43 +
.../utils/activity/wait_event_names.txt | 3 +
src/backend/utils/misc/guc_tables.c | 25 +
src/backend/utils/misc/postgresql.conf.sample | 6 +
src/backend/utils/resowner/resowner.c | 51 +
src/tools/pgindent/typedefs.list | 19 +
17 files changed, 2332 insertions(+)
create mode 100644 src/include/storage/aio_internal.h
create mode 100644 src/include/storage/aio_ref.h
create mode 100644 src/backend/storage/aio/aio_io.c
create mode 100644 src/backend/storage/aio/aio_subject.c
create mode 100644 src/backend/storage/aio/method_sync.c
diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h
index 1e4dfd07e89..c0a59f47bc0 100644
--- a/src/include/storage/aio.h
+++ b/src/include/storage/aio.h
@@ -15,9 +15,315 @@
#define AIO_H
+#include "storage/aio_ref.h"
+#include "storage/procnumber.h"
#include "utils/guc_tables.h"
+typedef struct PgAioHandle PgAioHandle;
+
+typedef enum PgAioOp
+{
+ /* intentionally the zero value, to help catch zeroed memory etc */
+ PGAIO_OP_INVALID = 0,
+
+ PGAIO_OP_READ,
+ PGAIO_OP_WRITE,
+
+ PGAIO_OP_FSYNC,
+
+ PGAIO_OP_FLUSH_RANGE,
+
+ PGAIO_OP_NOP,
+
+ /**
+ * Eventually we'll additionally want at least:
+ * - send
+ * - recv
+ * - accept
+ **/
+} PgAioOp;
+
+#define PGAIO_OP_COUNT (PGAIO_OP_NOP + 1)
+
+
+/*
+ * On what is IO being performed.
+ *
+ * PgAioSharedCallback specific behaviour should be implemented in
+ * aio_subject.c.
+ */
+typedef enum PgAioSubjectID
+{
+ /* intentionally the zero value, to help catch zeroed memory etc */
+ ASI_INVALID = 0,
+} PgAioSubjectID;
+
+#define ASI_COUNT (ASI_INVALID + 1)
+
+/*
+ * Flags for an IO that can be set with pgaio_io_set_flag().
+ */
+typedef enum PgAioHandleFlags
+{
+ AHF_REFERENCES_LOCAL = 1 << 0,
+} PgAioHandleFlags;
+
+
+/*
+ * IDs for callbacks that can be registered on an IO.
+ *
+ * Callbacks are identified by an ID rather than a function pointer. There are
+ * two main reasons:
+
+ * 1) Memory within PgAioHandle is precious, due to the number of PgAioHandle
+ * structs in pre-allocated shared memory.
+
+ * 2) Due to EXEC_BACKEND function pointers are not necessarily stable between
+ * different backends, therefore function pointers cannot directly be in
+ * shared memory.
+ *
+ * Without 2), we could fairly easily allow to add new callbacks, by filling a
+ * ID->pointer mapping table on demand. In the presence of 2 that's still
+ * doable, but harder, because every process has to re-register the pointers
+ * so that a local ID->"backend local pointer" mapping can be maintained.
+ */
+typedef enum PgAioHandleSharedCallbackID
+{
+ ASC_PLACEHOLDER /* empty enums are invalid */ ,
+} PgAioHandleSharedCallbackID;
+
+
+/*
+ * Data necessary for basic IO types (PgAioOp).
+ *
+ * NB: Note that the FDs in here may *not* be relied upon for re-issuing
+ * requests (e.g. for partial reads/writes) - the FD might be from another
+ * process, or closed since. That's not a problem for IOs waiting to be issued
+ * only because the queue is flushed when closing an FD.
+ */
+typedef union
+{
+ struct
+ {
+ int fd;
+ uint16 iov_length;
+ uint64 offset;
+ } read;
+
+ struct
+ {
+ int fd;
+ uint16 iov_length;
+ uint64 offset;
+ } write;
+
+ struct
+ {
+ int fd;
+ bool datasync;
+ } fsync;
+
+ struct
+ {
+ int fd;
+ uint32 nbytes;
+ uint64 offset;
+ } flush_range;
+} PgAioOpData;
+
+
+/* XXX: Perhaps it's worth moving this to a dedicated file? */
+#include "storage/block.h"
+#include "storage/relfilelocator.h"
+
+typedef union PgAioSubjectData
+{
+ /* just as an example placeholder for later */
+ struct
+ {
+ uint32 queue_id;
+ } wal;
+} PgAioSubjectData;
+
+
+
+typedef enum PgAioResultStatus
+{
+ ARS_UNKNOWN,
+ ARS_OK,
+ ARS_PARTIAL,
+ ARS_ERROR,
+} PgAioResultStatus;
+
+typedef struct PgAioResult
+{
+ PgAioHandleSharedCallbackID id:8;
+ PgAioResultStatus status:2;
+ uint32 error_data:22;
+ int32 result;
+} PgAioResult;
+
+typedef struct PgAioReturn
+{
+ PgAioResult result;
+ PgAioSubjectData subject_data;
+} PgAioReturn;
+
+
+typedef struct PgAioSubjectInfo
+{
+ void (*reopen) (PgAioHandle *ioh);
+
+#ifdef NOT_YET
+ char *(*describe_identity) (PgAioHandle *ioh);
+#endif
+
+ const char *name;
+} PgAioSubjectInfo;
+
+
+typedef PgAioResult (*PgAioHandleSharedCallbackComplete) (PgAioHandle *ioh, PgAioResult prior_result);
+typedef void (*PgAioHandleSharedCallbackPrepare) (PgAioHandle *ioh);
+typedef void (*PgAioHandleSharedCallbackError) (PgAioResult result, const PgAioSubjectData *subject_data, int elevel);
+
+typedef struct PgAioHandleSharedCallbacks
+{
+ PgAioHandleSharedCallbackPrepare prepare;
+ PgAioHandleSharedCallbackComplete complete;
+ PgAioHandleSharedCallbackError error;
+} PgAioHandleSharedCallbacks;
+
+
+
+typedef struct PgAioBounceBuffer PgAioBounceBuffer;
+
+
+/*
+ * How many callbacks can be registered for one IO handle. Currently we only
+ * need two, but it's not hard to imagine needing a few more.
+ */
+#define AIO_MAX_SHARED_CALLBACKS 4
+
+
+
+/* AIO API */
+
+
+/* --------------------------------------------------------------------------------
+ * IO Handles
+ * --------------------------------------------------------------------------------
+ */
+
+struct ResourceOwnerData;
+extern PgAioHandle *pgaio_io_get(struct ResourceOwnerData *resowner, PgAioReturn *ret);
+extern PgAioHandle *pgaio_io_get_nb(struct ResourceOwnerData *resowner, PgAioReturn *ret);
+
+extern void pgaio_io_release(PgAioHandle *ioh);
+extern void pgaio_io_release_resowner(dlist_node *ioh_node, bool on_error);
+
+extern void pgaio_io_get_ref(PgAioHandle *ioh, PgAioHandleRef *ior);
+
+extern void pgaio_io_set_subject(PgAioHandle *ioh, PgAioSubjectID subjid);
+extern void pgaio_io_set_flag(PgAioHandle *ioh, PgAioHandleFlags flag);
+
+extern void pgaio_io_add_shared_cb(PgAioHandle *ioh, PgAioHandleSharedCallbackID cbid);
+
+extern void pgaio_io_set_io_data_32(PgAioHandle *ioh, uint32 *data, uint8 len);
+extern void pgaio_io_set_io_data_64(PgAioHandle *ioh, uint64 *data, uint8 len);
+extern uint64 *pgaio_io_get_io_data(PgAioHandle *ioh, uint8 *len);
+
+extern void pgaio_io_prepare(PgAioHandle *ioh, PgAioOp op);
+
+extern int pgaio_io_get_id(PgAioHandle *ioh);
+struct iovec;
+extern int pgaio_io_get_iovec(PgAioHandle *ioh, struct iovec **iov);
+extern bool pgaio_io_has_subject(PgAioHandle *ioh);
+
+extern PgAioSubjectData *pgaio_io_get_subject_data(PgAioHandle *ioh);
+extern PgAioOpData *pgaio_io_get_op_data(PgAioHandle *ioh);
+extern ProcNumber pgaio_io_get_owner(PgAioHandle *ioh);
+
+
+
+/* --------------------------------------------------------------------------------
+ * IO References
+ * --------------------------------------------------------------------------------
+ */
+
+extern void pgaio_io_ref_clear(PgAioHandleRef *ior);
+extern bool pgaio_io_ref_valid(PgAioHandleRef *ior);
+extern int pgaio_io_ref_get_id(PgAioHandleRef *ior);
+
+
+extern void pgaio_io_ref_wait(PgAioHandleRef *ior);
+extern bool pgaio_io_ref_check_done(PgAioHandleRef *ior);
+
+
+
+/* --------------------------------------------------------------------------------
+ * IO Result
+ * --------------------------------------------------------------------------------
+ */
+
+extern void pgaio_result_log(PgAioResult result, const PgAioSubjectData *subject_data,
+ int elevel);
+
+
+
+/* --------------------------------------------------------------------------------
+ * Bounce Buffers
+ * --------------------------------------------------------------------------------
+ */
+
+extern PgAioBounceBuffer *pgaio_bounce_buffer_get(void);
+extern void pgaio_io_assoc_bounce_buffer(PgAioHandle *ioh, PgAioBounceBuffer *bb);
+extern uint32 pgaio_bounce_buffer_id(PgAioBounceBuffer *bb);
+extern void pgaio_bounce_buffer_release(PgAioBounceBuffer *bb);
+extern char *pgaio_bounce_buffer_buffer(PgAioBounceBuffer *bb);
+extern void pgaio_bounce_buffer_release_resowner(dlist_node *bb_node, bool on_error);
+
+
+
+/* --------------------------------------------------------------------------------
+ * Actions on multiple IOs.
+ * --------------------------------------------------------------------------------
+ */
+
+extern void pgaio_submit_staged(void);
+extern bool pgaio_have_staged(void);
+
+
+
+/* --------------------------------------------------------------------------------
+ * Low level IO preparation routines
+ *
+ * These will often be called by code lowest level of initiating an
+ * IO. E.g. bufmgr.c may initiate IO for a buffer, but pgaio_io_prep_readv()
+ * will be called from within fd.c.
+ *
+ * Implemented in aio_io.c
+ * --------------------------------------------------------------------------------
+ */
+
+extern void pgaio_io_prep_readv(PgAioHandle *ioh,
+ int fd, int iovcnt, uint64 offset);
+
+extern void pgaio_io_prep_writev(PgAioHandle *ioh,
+ int fd, int iovcnt, uint64 offset);
+
+
+
+/* --------------------------------------------------------------------------------
+ * Other
+ * --------------------------------------------------------------------------------
+ */
+
+extern void pgaio_closing_fd(int fd);
+extern void pgaio_at_xact_end(bool is_subxact, bool is_commit);
+extern void pgaio_at_error(void);
+
+
/* GUC related */
extern void assign_io_method(int newval, void *extra);
@@ -36,6 +342,8 @@ typedef enum IoMethod
/* GUCs */
extern const struct config_enum_entry io_method_options[];
extern int io_method;
+extern int io_max_concurrency;
+extern int io_bounce_buffers;
#endif /* AIO_H */
diff --git a/src/include/storage/aio_internal.h b/src/include/storage/aio_internal.h
new file mode 100644
index 00000000000..82bce1cf27c
--- /dev/null
+++ b/src/include/storage/aio_internal.h
@@ -0,0 +1,274 @@
+/*-------------------------------------------------------------------------
+ *
+ * aio_internal.h
+ * aio_internal
+ *
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/aio_internal.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef AIO_INTERNAL_H
+#define AIO_INTERNAL_H
+
+
+#include "lib/ilist.h"
+#include "port/pg_iovec.h"
+#include "storage/aio.h"
+#include "storage/condition_variable.h"
+
+
+#define PGAIO_VERBOSE
+
+
+/* AFIXME */
+#define PGAIO_SUBMIT_BATCH_SIZE 32
+
+
+
+typedef enum PgAioHandleState
+{
+ /* not in use */
+ AHS_IDLE = 0,
+
+ /* returned by pgaio_io_get() */
+ AHS_HANDED_OUT,
+
+ /* pgaio_io_start_*() has been called, but IO hasn't been submitted yet */
+ AHS_DEFINED,
+
+ /* subjects prepare() callback has been called */
+ AHS_PREPARED,
+
+ /* IO is being executed */
+ AHS_IN_FLIGHT,
+
+ /* IO finished, but result has not yet been processed */
+ AHS_REAPED,
+
+ /* IO completed, shared completion has been called */
+ AHS_COMPLETED_SHARED,
+
+ /* IO completed, local completion has been called */
+ AHS_COMPLETED_LOCAL,
+} PgAioHandleState;
+
+
+struct ResourceOwnerData;
+
+/* typedef is in public header */
+struct PgAioHandle
+{
+ PgAioHandleState state:8;
+
+ /* what are we operating on */
+ PgAioSubjectID subject:8;
+
+ /* which operation */
+ PgAioOp op:8;
+
+ /* bitfield of PgAioHandleFlags */
+ uint8 flags;
+
+ uint8 num_shared_callbacks;
+
+ /* using the proper type here would use more space */
+ uint8 shared_callbacks[AIO_MAX_SHARED_CALLBACKS];
+
+ uint8 iovec_data_len;
+
+ /* XXX: could be optimized out with some pointer math */
+ int32 owner_procno;
+
+ /* FIXME: remove in favor of distilled_result */
+ /* raw result of the IO operation */
+ int32 result;
+
+ /* index into PgAioCtl->iovecs */
+ uint32 iovec_off;
+
+ /*
+ * List of bounce_buffers owned by IO. It would suffice to use an index
+ * based linked list here.
+ */
+ slist_head bounce_buffers;
+
+ /**
+ * In which list the handle is registered, depends on the state:
+ * - IDLE, in per-backend list
+ * - HANDED_OUT - not in a list
+ * - DEFINED - in per-backend staged list
+ * - PREPARED - in per-backend staged list
+ * - IN_FLIGHT - not in any list
+ * - REAPED - in per-reap context list
+ * - COMPLETED_SHARED - not in any list
+ * - COMPLETED_LOCAL - not in any list
+ *
+ * XXX: It probably make sense to optimize this out to save on per-io
+ * memory at the cost of per-backend memory.
+ **/
+ dlist_node node;
+
+ struct ResourceOwnerData *resowner;
+ dlist_node resowner_node;
+
+ /* incremented every time the IO handle is reused */
+ uint64 generation;
+
+ ConditionVariable cv;
+
+ /* result of shared callback, passed to issuer callback */
+ PgAioResult distilled_result;
+
+ PgAioReturn *report_return;
+
+ PgAioOpData op_data;
+
+ /*
+ * Data necessary for shared completions. Needs to be sufficient to allow
+ * another backend to retry an IO.
+ */
+ PgAioSubjectData scb_data;
+};
+
+
+struct PgAioBounceBuffer
+{
+ slist_node node;
+ struct ResourceOwnerData *resowner;
+ dlist_node resowner_node;
+ char *buffer;
+};
+
+
+typedef struct PgAioPerBackend
+{
+ /* index into PgAioCtl->io_handles */
+ uint32 io_handle_off;
+
+ /* index into PgAioCtl->bounce_buffers */
+ uint32 bounce_buffers_off;
+
+ /* IO Handles that currently are not used */
+ dclist_head idle_ios;
+
+ /*
+ * Only one IO may be returned by pgaio_io_get()/pgaio_io_get() without
+ * having been either defined (by actually associating it with IO) or by
+ * released (with pgaio_io_release()). This restriction is necessary to
+ * guarantee that we always can acquire an IO. ->handed_out_io is used to
+ * enforce that rule.
+ */
+ PgAioHandle *handed_out_io;
+
+ /*
+ * IOs that are defined, but not yet submitted.
+ */
+ uint16 num_staged_ios;
+ PgAioHandle *staged_ios[PGAIO_SUBMIT_BATCH_SIZE];
+
+ /* Bounce Buffers that currently are not used */
+ slist_head idle_bbs;
+
+ /* see handed_out_io */
+ PgAioBounceBuffer *handed_out_bb;
+} PgAioPerBackend;
+
+
+typedef struct PgAioCtl
+{
+ int backend_state_count;
+ PgAioPerBackend *backend_state;
+
+ /*
+ * Array of iovec structs. Each iovec is owned by a specific backend. The
+ * allocation is in PgAioCtl to allow the maximum number of iovecs for
+ * individual IOs to be configurable with PGC_POSTMASTER GUC.
+ */
+ uint64 iovec_count;
+ struct iovec *iovecs;
+
+ /*
+ * For, e.g., an IO covering multiple buffers in shared / temp buffers, we
+ * need to get Buffer IDs during completion to be able to change the
+ * BufferDesc state accordingly. This space can be used to store e.g.
+ * Buffer IDs. Note that the actual iovec might be shorter than this,
+ * because we combine neighboring pages into one larger iovec entry.
+ */
+ uint64 *iovecs_data;
+
+ /*
+ * To perform AIO on buffers that are not located in shared memory (either
+ * because they are not in shared memory or because we need to operate on
+ * a copy, as e.g. the case for writes when checksums are in use)
+ */
+ uint64 bounce_buffers_count;
+ PgAioBounceBuffer *bounce_buffers;
+ char *bounce_buffers_data;
+
+ uint64 io_handle_count;
+ PgAioHandle *io_handles;
+} PgAioCtl;
+
+
+
+/*
+ * The set of callbacks that each IO method must implement.
+ */
+typedef struct IoMethodOps
+{
+ /* initialization */
+ size_t (*shmem_size) (void);
+ void (*shmem_init) (bool first_time);
+
+ void (*postmaster_init) (void);
+ void (*postmaster_child_init_local) (void);
+ void (*postmaster_child_init) (void);
+
+ /* teardown */
+ void (*postmaster_before_child_exit) (void);
+
+ /* handling of IOs */
+ bool (*needs_synchronous_execution)(PgAioHandle *ioh);
+ int (*submit) (uint16 num_staged_ios, PgAioHandle **staged_ios);
+
+ void (*wait_one) (PgAioHandle *ioh,
+ uint64 ref_generation);
+
+ /* properties */
+ bool can_scatter_gather_direct;
+ bool can_scatter_gather_buffered;
+} IoMethodOps;
+
+
+extern bool pgaio_io_was_recycled(PgAioHandle *ioh, uint64 ref_generation, PgAioHandleState *state);
+
+extern void pgaio_io_prepare_subject(PgAioHandle *ioh);
+extern void pgaio_io_process_completion_subject(PgAioHandle *ioh);
+extern void pgaio_io_process_completion(PgAioHandle *ioh, int result);
+extern void pgaio_io_prepare_submit(PgAioHandle *ioh);
+
+extern bool pgaio_io_needs_synchronous_execution(PgAioHandle *ioh);
+extern void pgaio_io_perform_synchronously(PgAioHandle *ioh);
+
+extern bool pgaio_io_can_reopen(PgAioHandle *ioh);
+extern void pgaio_io_reopen(PgAioHandle *ioh);
+
+extern const char *pgaio_io_get_subject_name(PgAioHandle *ioh);
+extern const char *pgaio_io_get_op_name(PgAioHandle *ioh);
+extern const char *pgaio_io_get_state_name(PgAioHandle *ioh);
+
+
+/* Declarations for the tables of function pointers exposed by each IO method. */
+extern const IoMethodOps pgaio_sync_ops;
+
+extern const IoMethodOps *pgaio_impl;
+extern PgAioCtl *aio_ctl;
+extern PgAioPerBackend *my_aio;
+
+
+
+#endif /* AIO_INTERNAL_H */
diff --git a/src/include/storage/aio_ref.h b/src/include/storage/aio_ref.h
new file mode 100644
index 00000000000..ad7e9ad34f3
--- /dev/null
+++ b/src/include/storage/aio_ref.h
@@ -0,0 +1,24 @@
+/*-------------------------------------------------------------------------
+ *
+ * aio_ref.h Definition of PgAioHandleRef, which sometimes needs to be used in
+ * headers.
+ *
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/aio_ref.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef AIO_REF_H
+#define AIO_REF_H
+
+typedef struct PgAioHandleRef
+{
+ uint32 aio_index;
+ uint32 generation_upper;
+ uint32 generation_lower;
+} PgAioHandleRef;
+
+#endif /* AIO_REF_H */
diff --git a/src/include/utils/resowner.h b/src/include/utils/resowner.h
index 4e534bc3e70..0cdd0c13ffb 100644
--- a/src/include/utils/resowner.h
+++ b/src/include/utils/resowner.h
@@ -164,4 +164,11 @@ struct LOCALLOCK;
extern void ResourceOwnerRememberLock(ResourceOwner owner, struct LOCALLOCK *locallock);
extern void ResourceOwnerForgetLock(ResourceOwner owner, struct LOCALLOCK *locallock);
+/* special support for AIO */
+struct dlist_node;
+extern void ResourceOwnerRememberAioHandle(ResourceOwner owner, struct dlist_node *ioh_node);
+extern void ResourceOwnerForgetAioHandle(ResourceOwner owner, struct dlist_node *ioh_node);
+extern void ResourceOwnerRememberAioBounceBuffer(ResourceOwner owner, struct dlist_node *bb_node);
+extern void ResourceOwnerForgetAioBounceBuffer(ResourceOwner owner, struct dlist_node *bb_node);
+
#endif /* RESOWNER_H */
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 87700c7c5c7..1fccaa3eb79 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -52,6 +52,7 @@
#include "replication/origin.h"
#include "replication/snapbuild.h"
#include "replication/syncrep.h"
+#include "storage/aio.h"
#include "storage/condition_variable.h"
#include "storage/fd.h"
#include "storage/lmgr.h"
@@ -2462,6 +2463,8 @@ CommitTransaction(void)
AtEOXact_LogicalRepWorkers(true);
pgstat_report_xact_timestamp(0);
+ pgaio_at_xact_end( /* is_subxact = */ false, /* is_commit = */ true);
+
ResourceOwnerDelete(TopTransactionResourceOwner);
s->curTransactionOwner = NULL;
CurTransactionResourceOwner = NULL;
@@ -2976,6 +2979,8 @@ AbortTransaction(void)
pgstat_report_xact_timestamp(0);
}
+ pgaio_at_xact_end( /* is_subxact = */ false, /* is_commit = */ false);
+
/*
* State remains TRANS_ABORT until CleanupTransaction().
*/
@@ -5185,6 +5190,8 @@ CommitSubTransaction(void)
AtEOSubXact_PgStat(true, s->nestingLevel);
AtSubCommit_Snapshot(s->nestingLevel);
+ pgaio_at_xact_end( /* is_subxact = */ true, /* is_commit = */ true);
+
/*
* We need to restore the upper transaction's read-only state, in case the
* upper is read-write while the child is read-only; GUC will incorrectly
@@ -5350,6 +5357,8 @@ AbortSubTransaction(void)
AtSubAbort_Snapshot(s->nestingLevel);
}
+ pgaio_at_xact_end( /* is_subxact = */ true, /* is_commit = */ false);
+
/*
* Restore the upper transaction's read-only state, too. This should be
* redundant with GUC's cleanup but we may as well do it for consistency
diff --git a/src/backend/storage/aio/Makefile b/src/backend/storage/aio/Makefile
index eaeaeeee8e3..b253278f3c1 100644
--- a/src/backend/storage/aio/Makefile
+++ b/src/backend/storage/aio/Makefile
@@ -11,6 +11,9 @@ include $(top_builddir)/src/Makefile.global
OBJS = \
aio.o \
aio_init.o \
+ aio_io.o \
+ aio_subject.o \
+ method_sync.o \
read_stream.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/aio/aio.c b/src/backend/storage/aio/aio.c
index d831c772960..b5370330620 100644
--- a/src/backend/storage/aio/aio.c
+++ b/src/backend/storage/aio/aio.c
@@ -14,7 +14,23 @@
#include "postgres.h"
+#include "miscadmin.h"
+#include "port/atomics.h"
#include "storage/aio.h"
+#include "storage/aio_internal.h"
+#include "storage/bufmgr.h"
+#include "utils/resowner.h"
+#include "utils/wait_event_types.h"
+
+
+
+static void pgaio_io_reclaim(PgAioHandle *ioh);
+static void pgaio_io_resowner_register(PgAioHandle *ioh);
+static void pgaio_io_wait_for_free(void);
+static PgAioHandle *pgaio_io_from_ref(PgAioHandleRef *ior, uint64 *ref_generation);
+
+static void pgaio_bounce_buffer_wait_for_free(void);
+
/* Options for io_method. */
@@ -24,9 +40,968 @@ const struct config_enum_entry io_method_options[] = {
};
int io_method = DEFAULT_IO_METHOD;
+int io_max_concurrency = -1;
+int io_bounce_buffers = -1;
+
+
+/* global control for AIO */
+PgAioCtl *aio_ctl;
+
+/* current backend's per-backend state */
+PgAioPerBackend *my_aio;
+
+
+static const IoMethodOps *pgaio_ops_table[] = {
+ [IOMETHOD_SYNC] = &pgaio_sync_ops,
+};
+
+
+const IoMethodOps *pgaio_impl;
+
+
+
+/* --------------------------------------------------------------------------------
+ * "Core" IO Api
+ * --------------------------------------------------------------------------------
+ */
+
+/*
+ * AFIXME: rewrite
+ *
+ * Shared completion callbacks can be executed by any backend (otherwise there
+ * would be deadlocks). Therefore they cannot update state for the issuer of
+ * the IO. That can be done with issuer callbacks.
+ *
+ * Note that issuer callbacks are effectively executed in a critical
+ * section. This is necessary as we need to be able to execute IO in critical
+ * sections (consider e.g. WAL logging) and to be able to execute IOs we need
+ * to acquire an IO, which in turn requires executing issuer callbacks. An
+ * alternative scheme could be to defer local callback execution until a later
+ * point, but that gets complicated quickly.
+ *
+ * Therefore the typical pattern is to use an issuer callback to set some
+ * flags in backend local memory, which can then be used to error out at a
+ * later time.
+ *
+ * NB: The issuer callback is cleared when the resowner owning the IO goes out
+ * of scope.
+ */
+PgAioHandle *
+pgaio_io_get(struct ResourceOwnerData *resowner, PgAioReturn *ret)
+{
+ PgAioHandle *h;
+
+ while (true)
+ {
+ h = pgaio_io_get_nb(resowner, ret);
+
+ if (h != NULL)
+ return h;
+
+ /*
+ * Evidently all handles by this backend are in use. Just wait for
+ * some to complete.
+ */
+ pgaio_io_wait_for_free();
+ }
+}
+
+PgAioHandle *
+pgaio_io_get_nb(struct ResourceOwnerData *resowner, PgAioReturn *ret)
+{
+ if (my_aio->num_staged_ios >= PGAIO_SUBMIT_BATCH_SIZE)
+ {
+ Assert(my_aio->num_staged_ios == PGAIO_SUBMIT_BATCH_SIZE);
+ pgaio_submit_staged();
+ }
+
+ if (my_aio->handed_out_io)
+ {
+ ereport(ERROR,
+ errmsg("API violation: Only one IO can be handed out"));
+ }
+
+ if (!dclist_is_empty(&my_aio->idle_ios))
+ {
+ dlist_node *ion = dclist_pop_head_node(&my_aio->idle_ios);
+ PgAioHandle *ioh = dclist_container(PgAioHandle, node, ion);
+
+ Assert(ioh->state == AHS_IDLE);
+ Assert(ioh->owner_procno == MyProcNumber);
+
+ ioh->state = AHS_HANDED_OUT;
+ my_aio->handed_out_io = ioh;
+
+ if (resowner)
+ pgaio_io_resowner_register(ioh);
+
+ if (ret)
+ ioh->report_return = ret;
+
+ return ioh;
+ }
+
+ return NULL;
+}
+
+void
+pgaio_io_release(PgAioHandle *ioh)
+{
+ if (ioh == my_aio->handed_out_io)
+ {
+ Assert(ioh->state == AHS_HANDED_OUT);
+ Assert(ioh->resowner);
+
+ my_aio->handed_out_io = NULL;
+ pgaio_io_reclaim(ioh);
+ }
+ else
+ {
+ elog(ERROR, "release in unexpected state");
+ }
+}
+
+void
+pgaio_io_release_resowner(dlist_node *ioh_node, bool on_error)
+{
+ PgAioHandle *ioh = dlist_container(PgAioHandle, resowner_node, ioh_node);
+
+ Assert(ioh->resowner);
+
+ ResourceOwnerForgetAioHandle(ioh->resowner, &ioh->resowner_node);
+ ioh->resowner = NULL;
+
+ switch (ioh->state)
+ {
+ case AHS_IDLE:
+ elog(ERROR, "unexpected");
+ break;
+ case AHS_HANDED_OUT:
+ Assert(ioh == my_aio->handed_out_io || my_aio->handed_out_io == NULL);
+
+ if (ioh == my_aio->handed_out_io)
+ {
+ my_aio->handed_out_io = NULL;
+ if (!on_error)
+ elog(WARNING, "leaked AIO handle");
+ }
+
+ pgaio_io_reclaim(ioh);
+ break;
+ case AHS_DEFINED:
+ case AHS_PREPARED:
+ /* XXX: Should we warn about this when is_commit? */
+ pgaio_submit_staged();
+ break;
+ case AHS_IN_FLIGHT:
+ case AHS_REAPED:
+ case AHS_COMPLETED_SHARED:
+ /* this is expected to happen */
+ break;
+ case AHS_COMPLETED_LOCAL:
+ /* XXX: unclear if this ought to be possible? */
+ pgaio_io_reclaim(ioh);
+ break;
+ }
+
+ /*
+ * Need to unregister the reporting of the IO's result, the memory it's
+ * referencing likely has gone away.
+ */
+ if (ioh->report_return)
+ ioh->report_return = NULL;
+}
+
+int
+pgaio_io_get_iovec(PgAioHandle *ioh, struct iovec **iov)
+{
+ Assert(ioh->state == AHS_HANDED_OUT);
+
+ *iov = &aio_ctl->iovecs[ioh->iovec_off];
+
+ /* AFIXME: Needs to be the value at startup time */
+ return io_combine_limit;
+}
+
+PgAioSubjectData *
+pgaio_io_get_subject_data(PgAioHandle *ioh)
+{
+ return &ioh->scb_data;
+}
+
+PgAioOpData *
+pgaio_io_get_op_data(PgAioHandle *ioh)
+{
+ return &ioh->op_data;
+}
+
+ProcNumber
+pgaio_io_get_owner(PgAioHandle *ioh)
+{
+ return ioh->owner_procno;
+}
+
+bool
+pgaio_io_has_subject(PgAioHandle *ioh)
+{
+ return ioh->subject != ASI_INVALID;
+}
+
+void
+pgaio_io_set_flag(PgAioHandle *ioh, PgAioHandleFlags flag)
+{
+ Assert(ioh->state == AHS_HANDED_OUT);
+
+ ioh->flags |= flag;
+}
+
+void
+pgaio_io_set_io_data_32(PgAioHandle *ioh, uint32 *data, uint8 len)
+{
+ Assert(ioh->state == AHS_HANDED_OUT);
+
+ for (int i = 0; i < len; i++)
+ aio_ctl->iovecs_data[ioh->iovec_off + i] = data[i];
+ ioh->iovec_data_len = len;
+}
+
+uint64 *
+pgaio_io_get_io_data(PgAioHandle *ioh, uint8 *len)
+{
+ Assert(ioh->iovec_data_len > 0);
+
+ *len = ioh->iovec_data_len;
+
+ return &aio_ctl->iovecs_data[ioh->iovec_off];
+}
+
+void
+pgaio_io_set_subject(PgAioHandle *ioh, PgAioSubjectID subjid)
+{
+ Assert(ioh->state == AHS_HANDED_OUT);
+
+ ioh->subject = subjid;
+
+ elog(DEBUG3, "io:%d, op %s, subject %s, set subject",
+ pgaio_io_get_id(ioh),
+ pgaio_io_get_op_name(ioh),
+ pgaio_io_get_subject_name(ioh));
+}
+
+void
+pgaio_io_get_ref(PgAioHandle *ioh, PgAioHandleRef *ior)
+{
+ Assert(ioh->state == AHS_HANDED_OUT ||
+ ioh->state == AHS_DEFINED ||
+ ioh->state == AHS_PREPARED);
+ Assert(ioh->generation != 0);
+
+ ior->aio_index = ioh - aio_ctl->io_handles;
+ ior->generation_upper = (uint32) (ioh->generation >> 32);
+ ior->generation_lower = (uint32) ioh->generation;
+}
+
+void
+pgaio_io_ref_clear(PgAioHandleRef *ior)
+{
+ ior->aio_index = PG_UINT32_MAX;
+}
+
+bool
+pgaio_io_ref_valid(PgAioHandleRef *ior)
+{
+ return ior->aio_index != PG_UINT32_MAX;
+}
+
+int
+pgaio_io_ref_get_id(PgAioHandleRef *ior)
+{
+ Assert(pgaio_io_ref_valid(ior));
+ return ior->aio_index;
+}
+
+bool
+pgaio_io_was_recycled(PgAioHandle *ioh, uint64 ref_generation, PgAioHandleState *state)
+{
+ *state = ioh->state;
+ pg_read_barrier();
+
+ return ioh->generation != ref_generation;
+}
+
+void
+pgaio_io_ref_wait(PgAioHandleRef *ior)
+{
+ uint64 ref_generation;
+ PgAioHandleState state;
+ bool am_owner;
+ PgAioHandle *ioh;
+
+ ioh = pgaio_io_from_ref(ior, &ref_generation);
+
+ am_owner = ioh->owner_procno == MyProcNumber;
+
+
+ if (pgaio_io_was_recycled(ioh, ref_generation, &state))
+ return;
+
+ if (am_owner)
+ {
+ if (state == AHS_DEFINED || state == AHS_PREPARED)
+ {
+ /* XXX: Arguably this should be prevented by callers? */
+ pgaio_submit_staged();
+ }
+ else if (state != AHS_IN_FLIGHT && state != AHS_REAPED && state != AHS_COMPLETED_SHARED && state != AHS_COMPLETED_LOCAL)
+ {
+ elog(PANIC, "waiting for own IO in wrong state: %d",
+ state);
+ }
+
+ /*
+ * Somebody else completed the IO, need to execute issuer callback, so
+ * reclaim eagerly.
+ */
+ if (state == AHS_COMPLETED_LOCAL)
+ {
+ pgaio_io_reclaim(ioh);
+
+ return;
+ }
+ }
+
+ while (true)
+ {
+ if (pgaio_io_was_recycled(ioh, ref_generation, &state))
+ return;
+
+ switch (state)
+ {
+ case AHS_IDLE:
+ case AHS_HANDED_OUT:
+ elog(ERROR, "IO in wrong state: %d", state);
+ break;
+
+ case AHS_IN_FLIGHT:
+ if (pgaio_impl->wait_one)
+ {
+ pgaio_impl->wait_one(ioh, ref_generation);
+ continue;
+ }
+ /* fallthrough */
+
+ /* waiting for owner to submit */
+ case AHS_PREPARED:
+ case AHS_DEFINED:
+ /* waiting for reaper to complete */
+ /* fallthrough */
+ case AHS_REAPED:
+ /* shouldn't be able to hit this otherwise */
+ Assert(IsUnderPostmaster);
+ /* ensure we're going to get woken up */
+ ConditionVariablePrepareToSleep(&ioh->cv);
+
+ while (!pgaio_io_was_recycled(ioh, ref_generation, &state))
+ {
+ if (state != AHS_REAPED && state != AHS_DEFINED &&
+ state != AHS_IN_FLIGHT)
+ break;
+ ConditionVariableSleep(&ioh->cv, WAIT_EVENT_AIO_COMPLETION);
+ }
+
+ ConditionVariableCancelSleep();
+ break;
+
+ case AHS_COMPLETED_SHARED:
+ /* see above */
+ if (am_owner)
+ pgaio_io_reclaim(ioh);
+ return;
+ case AHS_COMPLETED_LOCAL:
+ return;
+ }
+ }
+}
+
+bool
+pgaio_io_ref_check_done(PgAioHandleRef *ior)
+{
+ uint64 ref_generation;
+ PgAioHandleState state;
+ bool am_owner;
+ PgAioHandle *ioh;
+
+ ioh = pgaio_io_from_ref(ior, &ref_generation);
+
+ if (pgaio_io_was_recycled(ioh, ref_generation, &state))
+ return true;
+
+
+ if (state == AHS_IDLE)
+ return true;
+
+ am_owner = ioh->owner_procno == MyProcNumber;
+
+ if (state == AHS_COMPLETED_SHARED || state == AHS_COMPLETED_LOCAL)
+ {
+ if (am_owner)
+ pgaio_io_reclaim(ioh);
+ return true;
+ }
+
+ return false;
+}
+
+int
+pgaio_io_get_id(PgAioHandle *ioh)
+{
+ Assert(ioh >= aio_ctl->io_handles &&
+ ioh <= (aio_ctl->io_handles + aio_ctl->io_handle_count));
+ return ioh - aio_ctl->io_handles;
+}
+
+const char *
+pgaio_io_get_state_name(PgAioHandle *ioh)
+{
+ switch (ioh->state)
+ {
+ case AHS_IDLE:
+ return "idle";
+ case AHS_HANDED_OUT:
+ return "handed_out";
+ case AHS_DEFINED:
+ return "DEFINED";
+ case AHS_PREPARED:
+ return "PREPARED";
+ case AHS_IN_FLIGHT:
+ return "IN_FLIGHT";
+ case AHS_REAPED:
+ return "REAPED";
+ case AHS_COMPLETED_SHARED:
+ return "COMPLETED_SHARED";
+ case AHS_COMPLETED_LOCAL:
+ return "COMPLETED_LOCAL";
+ }
+ pg_unreachable();
+}
+
+/*
+ * Internal, should only be called from pgaio_io_prep_*().
+ */
+void
+pgaio_io_prepare(PgAioHandle *ioh, PgAioOp op)
+{
+ Assert(ioh->state == AHS_HANDED_OUT);
+ Assert(pgaio_io_has_subject(ioh));
+
+ ioh->op = op;
+ ioh->state = AHS_DEFINED;
+ ioh->result = 0;
+
+ /* allow a new IO to be staged */
+ my_aio->handed_out_io = NULL;
+
+ pgaio_io_prepare_subject(ioh);
+
+ ioh->state = AHS_PREPARED;
+
+ elog(DEBUG3, "io:%d: prepared %s",
+ pgaio_io_get_id(ioh), pgaio_io_get_op_name(ioh));
+
+ if (!pgaio_io_needs_synchronous_execution(ioh))
+ {
+ my_aio->staged_ios[my_aio->num_staged_ios++] = ioh;
+ Assert(my_aio->num_staged_ios <= PGAIO_SUBMIT_BATCH_SIZE);
+ }
+ else
+ {
+ pgaio_io_prepare_submit(ioh);
+ pgaio_io_perform_synchronously(ioh);
+ }
+}
+
+/*
+ * Handle IO getting completed by a method.
+ */
+void
+pgaio_io_process_completion(PgAioHandle *ioh, int result)
+{
+ Assert(ioh->state == AHS_IN_FLIGHT);
+
+ ioh->result = result;
+
+ pg_write_barrier();
+
+ /* FIXME: should be done in separate function */
+ ioh->state = AHS_REAPED;
+
+ pgaio_io_process_completion_subject(ioh);
+
+ /* ensure results of completion are visible before the new state */
+ pg_write_barrier();
+
+ ioh->state = AHS_COMPLETED_SHARED;
+
+ /* condition variable broadcast ensures state is visible before wakeup */
+ ConditionVariableBroadcast(&ioh->cv);
+
+ if (ioh->owner_procno == MyProcNumber)
+ pgaio_io_reclaim(ioh);
+}
+
+bool
+pgaio_io_needs_synchronous_execution(PgAioHandle *ioh)
+{
+ if (pgaio_impl->needs_synchronous_execution)
+ return pgaio_impl->needs_synchronous_execution(ioh);
+ return false;
+}
+
+/*
+ * Handle IO being processed by IO method.
+ */
+void
+pgaio_io_prepare_submit(PgAioHandle *ioh)
+{
+ ioh->state = AHS_IN_FLIGHT;
+ pg_write_barrier();
+}
+
+static PgAioHandle *
+pgaio_io_from_ref(PgAioHandleRef *ior, uint64 *ref_generation)
+{
+ PgAioHandle *ioh;
+
+ Assert(ior->aio_index < aio_ctl->io_handle_count);
+
+ ioh = &aio_ctl->io_handles[ior->aio_index];
+
+ *ref_generation = ((uint64) ior->generation_upper) << 32 |
+ ior->generation_lower;
+
+ Assert(*ref_generation != 0);
+
+ return ioh;
+}
+
+static void
+pgaio_io_resowner_register(PgAioHandle *ioh)
+{
+ Assert(!ioh->resowner);
+ Assert(CurrentResourceOwner);
+
+ ResourceOwnerRememberAioHandle(CurrentResourceOwner, &ioh->resowner_node);
+ ioh->resowner = CurrentResourceOwner;
+}
+
+static void
+pgaio_io_reclaim(PgAioHandle *ioh)
+{
+ /* This is only ok if it's our IO */
+ Assert(ioh->owner_procno == MyProcNumber);
+
+ ereport(DEBUG3,
+ errmsg("reclaiming io:%d, state: %s, op %s, subject %s, result: %d, distilled_result: AFIXME, report to: %p",
+ pgaio_io_get_id(ioh),
+ pgaio_io_get_state_name(ioh),
+ pgaio_io_get_op_name(ioh),
+ pgaio_io_get_subject_name(ioh),
+ ioh->result,
+ ioh->report_return
+ ),
+ errhidestmt(true), errhidecontext(true));
+
+ if (ioh->report_return)
+ {
+ if (ioh->state != AHS_HANDED_OUT)
+ {
+ ioh->report_return->result = ioh->distilled_result;
+ ioh->report_return->subject_data = ioh->scb_data;
+ }
+ }
+
+ /* reclaim all associated bounce buffers */
+ if (!slist_is_empty(&ioh->bounce_buffers))
+ {
+ slist_mutable_iter it;
+
+ slist_foreach_modify(it, &ioh->bounce_buffers)
+ {
+ PgAioBounceBuffer *bb = slist_container(PgAioBounceBuffer, node, it.cur);
+
+ slist_delete_current(&it);
+
+ slist_push_head(&my_aio->idle_bbs, &bb->node);
+ }
+ }
+
+ if (ioh->resowner)
+ {
+ ResourceOwnerForgetAioHandle(ioh->resowner, &ioh->resowner_node);
+ ioh->resowner = NULL;
+ }
+
+ Assert(!ioh->resowner);
+
+ ioh->num_shared_callbacks = 0;
+ ioh->iovec_data_len = 0;
+ ioh->report_return = NULL;
+ ioh->flags = 0;
+
+ pg_write_barrier();
+ ioh->generation++;
+ pg_write_barrier();
+ ioh->state = AHS_IDLE;
+ pg_write_barrier();
+
+ dclist_push_tail(&my_aio->idle_ios, &ioh->node);
+}
+
+static void
+pgaio_io_wait_for_free(void)
+{
+ bool found_handed_out = false;
+ int reclaimed = 0;
+ static uint32 lastpos = 0;
+
+ elog(DEBUG2,
+ "waiting for self: %d pending",
+ my_aio->num_staged_ios);
+
+ /*
+ * First check if any of our IOs actually have completed - when using
+ * worker, that'll often be the case. We could do so as part of the loop
+ * below, but that'd potentially lead us to wait for some IO submitted
+ * before.
+ */
+ for (int i = 0; i < io_max_concurrency; i++)
+ {
+ PgAioHandle *ioh = &aio_ctl->io_handles[my_aio->io_handle_off + i];
+
+ if (ioh->state == AHS_COMPLETED_SHARED)
+ {
+ pgaio_io_reclaim(ioh);
+ reclaimed++;
+ }
+ }
+
+ if (reclaimed > 0)
+ return;
+
+ if (my_aio->num_staged_ios > 0)
+ {
+ elog(DEBUG2, "submitting while acquiring free io");
+ pgaio_submit_staged();
+ }
+
+ for (uint32 i = lastpos; i < lastpos + io_max_concurrency; i++)
+ {
+ uint32 thisoff = my_aio->io_handle_off + (i % io_max_concurrency);
+ PgAioHandle *ioh = &aio_ctl->io_handles[thisoff];
+
+ switch (ioh->state)
+ {
+ case AHS_IDLE:
+
+ /*
+ * While one might think that pgaio_io_get_nb() should have
+ * succeeded, this is reachable because the IO could have
+ * completed during the submission above.
+ */
+ return;
+ case AHS_DEFINED: /* should have been submitted above */
+ case AHS_PREPARED:
+ case AHS_COMPLETED_LOCAL:
+ elog(ERROR, "shouldn't get here with io:%d in state %d",
+ pgaio_io_get_id(ioh), ioh->state);
+ break;
+ case AHS_HANDED_OUT:
+ if (found_handed_out)
+ elog(ERROR, "more than one handed out IO");
+ found_handed_out = true;
+ continue;
+ case AHS_REAPED:
+ case AHS_IN_FLIGHT:
+ {
+ PgAioHandleRef ior;
+
+ ior.aio_index = ioh - aio_ctl->io_handles;
+ ior.generation_upper = (uint32) (ioh->generation >> 32);
+ ior.generation_lower = (uint32) ioh->generation;
+
+ pgaio_io_ref_wait(&ior);
+ elog(DEBUG2, "waited for io:%d",
+ pgaio_io_get_id(ioh));
+ lastpos = i;
+ return;
+ }
+ break;
+ case AHS_COMPLETED_SHARED:
+ /* reclaim */
+ pgaio_io_reclaim(ioh);
+ lastpos = i;
+ return;
+ }
+ }
+
+ elog(PANIC, "could not reclaim any handles");
+}
+
+
+
+/* --------------------------------------------------------------------------------
+ * Bounce Buffers
+ * --------------------------------------------------------------------------------
+ */
+
+PgAioBounceBuffer *
+pgaio_bounce_buffer_get(void)
+{
+ PgAioBounceBuffer *bb = NULL;
+ slist_node *node;
+
+ if (my_aio->handed_out_bb != NULL)
+ elog(ERROR, "can only hand out one BB");
+
+ /*
+ * FIXME It probably is not correct to have bounce buffers be per backend,
+ * they use too much memory.
+ */
+ if (slist_is_empty(&my_aio->idle_bbs))
+ {
+ pgaio_bounce_buffer_wait_for_free();
+ }
+
+ node = slist_pop_head_node(&my_aio->idle_bbs);
+ bb = slist_container(PgAioBounceBuffer, node, node);
+
+ my_aio->handed_out_bb = bb;
+
+ bb->resowner = CurrentResourceOwner;
+ ResourceOwnerRememberAioBounceBuffer(bb->resowner, &bb->resowner_node);
+
+ return bb;
+}
+
+void
+pgaio_io_assoc_bounce_buffer(PgAioHandle *ioh, PgAioBounceBuffer *bb)
+{
+ if (my_aio->handed_out_bb != bb)
+ elog(ERROR, "can only assign handed out BB");
+ my_aio->handed_out_bb = NULL;
+
+ /*
+ * There can be many bounce buffers assigned in case of vectorized IOs.
+ */
+ slist_push_head(&ioh->bounce_buffers, &bb->node);
+
+ /* once associated with an IO, the IO has ownership */
+ ResourceOwnerForgetAioBounceBuffer(bb->resowner, &bb->resowner_node);
+ bb->resowner = NULL;
+}
+
+uint32
+pgaio_bounce_buffer_id(PgAioBounceBuffer *bb)
+{
+ return bb - aio_ctl->bounce_buffers;
+}
+
+void
+pgaio_bounce_buffer_release(PgAioBounceBuffer *bb)
+{
+ if (my_aio->handed_out_bb != bb)
+ elog(ERROR, "can only release handed out BB");
+
+ slist_push_head(&my_aio->idle_bbs, &bb->node);
+ my_aio->handed_out_bb = NULL;
+
+ ResourceOwnerForgetAioBounceBuffer(bb->resowner, &bb->resowner_node);
+ bb->resowner = NULL;
+}
+
+void
+pgaio_bounce_buffer_release_resowner(dlist_node *bb_node, bool on_error)
+{
+ PgAioBounceBuffer *bb = dlist_container(PgAioBounceBuffer, resowner_node, bb_node);
+
+ Assert(bb->resowner);
+
+ if (!on_error)
+ elog(WARNING, "leaked AIO bounce buffer");
+
+ pgaio_bounce_buffer_release(bb);
+}
+
+char *
+pgaio_bounce_buffer_buffer(PgAioBounceBuffer *bb)
+{
+ return bb->buffer;
+}
+
+static void
+pgaio_bounce_buffer_wait_for_free(void)
+{
+ static uint32 lastpos = 0;
+
+ if (my_aio->num_staged_ios > 0)
+ {
+ elog(DEBUG2, "submitting while acquiring free bb");
+ pgaio_submit_staged();
+ }
+
+ for (uint32 i = lastpos; i < lastpos + io_max_concurrency; i++)
+ {
+ uint32 thisoff = my_aio->io_handle_off + (i % io_max_concurrency);
+ PgAioHandle *ioh = &aio_ctl->io_handles[thisoff];
+
+ switch (ioh->state)
+ {
+ case AHS_IDLE:
+ case AHS_HANDED_OUT:
+ continue;
+ case AHS_DEFINED: /* should have been submitted above */
+ case AHS_PREPARED:
+ elog(ERROR, "shouldn't get here with io:%d in state %d",
+ pgaio_io_get_id(ioh), ioh->state);
+ break;
+ case AHS_REAPED:
+ case AHS_IN_FLIGHT:
+ if (!slist_is_empty(&ioh->bounce_buffers))
+ {
+ PgAioHandleRef ior;
+
+ ior.aio_index = ioh - aio_ctl->io_handles;
+ ior.generation_upper = (uint32) (ioh->generation >> 32);
+ ior.generation_lower = (uint32) ioh->generation;
+
+ pgaio_io_ref_wait(&ior);
+ elog(DEBUG2, "waited for io:%d to reclaim BB",
+ pgaio_io_get_id(ioh));
+
+ if (slist_is_empty(&my_aio->idle_bbs))
+ elog(WARNING, "empty after wait");
+
+ if (!slist_is_empty(&my_aio->idle_bbs))
+ {
+ lastpos = i;
+ return;
+ }
+ }
+ break;
+ case AHS_COMPLETED_SHARED:
+ case AHS_COMPLETED_LOCAL:
+ /* reclaim */
+ pgaio_io_reclaim(ioh);
+
+ if (!slist_is_empty(&my_aio->idle_bbs))
+ {
+ lastpos = i;
+ return;
+ }
+ break;
+ }
+ }
+
+ /*
+ * The submission above could have caused the IO to complete at any time.
+ */
+ if (slist_is_empty(&my_aio->idle_bbs))
+ elog(PANIC, "no more bbs");
+}
+
+
+
+/* --------------------------------------------------------------------------------
+ * Actions on multiple IOs.
+ * --------------------------------------------------------------------------------
+ */
+
+void
+pgaio_submit_staged(void)
+{
+ int total_submitted = 0;
+ int did_submit;
+
+ if (my_aio->num_staged_ios == 0)
+ return;
+
+
+ START_CRIT_SECTION();
+
+ did_submit = pgaio_impl->submit(my_aio->num_staged_ios, my_aio->staged_ios);
+
+ END_CRIT_SECTION();
+
+ total_submitted += did_submit;
+
+ Assert(total_submitted == did_submit);
+
+ my_aio->num_staged_ios = 0;
+
+#ifdef PGAIO_VERBOSE
+ ereport(DEBUG2,
+ errmsg("submitted %d",
+ total_submitted),
+ errhidestmt(true),
+ errhidecontext(true));
+#endif
+}
+
+bool
+pgaio_have_staged(void)
+{
+ return my_aio->num_staged_ios > 0;
+}
+
+
+
+/* --------------------------------------------------------------------------------
+ * Other
+ * --------------------------------------------------------------------------------
+ */
+
+/*
+ * Need to submit staged but not yet submitted IOs using the fd, otherwise
+ * the IO would end up targeting something bogus.
+ */
+void
+pgaio_closing_fd(int fd)
+{
+ /*
+ * Might be called before AIO is initialized or in a subprocess that
+ * doesn't use AIO.
+ */
+ if (!my_aio)
+ return;
+
+ /*
+ * For now just submit all staged IOs - we could be more selective, but
+ * it's probably not worth it.
+ */
+ pgaio_submit_staged();
+}
+
+void
+pgaio_at_xact_end(bool is_subxact, bool is_commit)
+{
+ Assert(!my_aio->handed_out_io);
+ Assert(!my_aio->handed_out_bb);
+}
+
+/*
+ * Similar to pgaio_at_xact_end(..., is_commit = false), but for cases where
+ * errors happen outside of transactions.
+ */
+void
+pgaio_at_error(void)
+{
+ Assert(!my_aio->handed_out_io);
+ Assert(!my_aio->handed_out_bb);
+}
void
assign_io_method(int newval, void *extra)
{
+ pgaio_impl = pgaio_ops_table[newval];
}
diff --git a/src/backend/storage/aio/aio_init.c b/src/backend/storage/aio/aio_init.c
index 1c277a7eb3b..e25bdf1dba0 100644
--- a/src/backend/storage/aio/aio_init.c
+++ b/src/backend/storage/aio/aio_init.c
@@ -14,33 +14,337 @@
#include "postgres.h"
+#include "miscadmin.h"
+#include "storage/aio.h"
#include "storage/aio_init.h"
+#include "storage/aio_internal.h"
+#include "storage/bufmgr.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+static Size
+AioCtlShmemSize(void)
+{
+ Size sz;
+
+ /* aio_ctl itself */
+ sz = offsetof(PgAioCtl, io_handles);
+
+ return sz;
+}
+
+static uint32
+AioProcs(void)
+{
+ return MaxBackends + NUM_AUXILIARY_PROCS;
+}
+
+static Size
+AioBackendShmemSize(void)
+{
+ return mul_size(AioProcs(), sizeof(PgAioPerBackend));
+}
+
+static Size
+AioHandleShmemSize(void)
+{
+ Size sz;
+
+ /* ios */
+ sz = mul_size(AioProcs(),
+ mul_size(io_max_concurrency, sizeof(PgAioHandle)));
+
+ return sz;
+}
+
+static Size
+AioIOVShmemSize(void)
+{
+ /* FIXME: io_combine_limit is USERSET */
+ return mul_size(sizeof(struct iovec),
+ mul_size(mul_size(io_combine_limit, AioProcs()),
+ io_max_concurrency));
+}
+
+static Size
+AioIOVDataShmemSize(void)
+{
+ /* FIXME: io_combine_limit is USERSET */
+ return mul_size(sizeof(uint64),
+ mul_size(mul_size(io_combine_limit, AioProcs()),
+ io_max_concurrency));
+}
+
+static Size
+AioBounceBufferDescShmemSize(void)
+{
+ Size sz;
+
+ /* PgAioBounceBuffer itself */
+ sz = mul_size(sizeof(PgAioBounceBuffer),
+ mul_size(AioProcs(), io_bounce_buffers));
+
+ return sz;
+}
+
+static Size
+AioBounceBufferDataShmemSize(void)
+{
+ Size sz;
+
+ /* and the associated buffer */
+ sz = mul_size(BLCKSZ,
+ mul_size(io_bounce_buffers, AioProcs()));
+ /* memory for alignment */
+ sz += BLCKSZ;
+
+ return sz;
+}
+
+/*
+ * Choose a suitable value for io_max_concurrency.
+ *
+ * It's unlikely that we could have more IOs in flight than buffers that we
+ * would be allowed to pin.
+ *
+ * On the upper end, apply a cap too - just because shared_buffers is large,
+ * it doesn't make sense have millions of buffers undergo IO concurrently.
+ */
+static int
+AioChooseMaxConccurrency(void)
+{
+ uint32 max_backends;
+ int max_proportional_pins;
+
+ /* Similar logic to LimitAdditionalPins() */
+ max_backends = MaxBackends + NUM_AUXILIARY_PROCS;
+ max_proportional_pins = NBuffers / max_backends;
+
+ max_proportional_pins = Max(max_proportional_pins, 1);
+
+ /* apply upper limit */
+ return Min(max_proportional_pins, 64);
+}
+
+/*
+ * Choose a suitable value for io_bounce_buffers.
+ *
+ * It's very unlikely to be useful to allocate more bounce buffers for each
+ * backend than the backend is allowed to pin. Additionally, bounce buffers
+ * currently are used for writes, it seems very uncommon for more than 10% of
+ * shared_buffers to be written out concurrently.
+ *
+ * XXX: This quickly can take up significant amounts of memory, the logic
+ * should probably fine tuned.
+ */
+static int
+AioChooseBounceBuffers(void)
+{
+ uint32 max_backends;
+ int max_proportional_pins;
+
+ /* Similar logic to LimitAdditionalPins() */
+ max_backends = MaxBackends + NUM_AUXILIARY_PROCS;
+ max_proportional_pins = (NBuffers / 10) / max_backends;
+
+ max_proportional_pins = Max(max_proportional_pins, 1);
+
+ /* apply upper limit */
+ return Min(max_proportional_pins, 256);
+}
+
Size
AioShmemSize(void)
{
Size sz = 0;
+ /*
+ * We prefer to report this value's source as PGC_S_DYNAMIC_DEFAULT.
+ * However, if the DBA explicitly set wal_buffers = -1 in the config file,
+ * then PGC_S_DYNAMIC_DEFAULT will fail to override that and we must force
+ *
+ */
+ if (io_max_concurrency == -1)
+ {
+ char buf[32];
+
+ snprintf(buf, sizeof(buf), "%d", AioChooseMaxConccurrency());
+ SetConfigOption("io_max_concurrency", buf, PGC_POSTMASTER,
+ PGC_S_DYNAMIC_DEFAULT);
+ if (io_bounce_buffers == -1) /* failed to apply it? */
+ SetConfigOption("io_max_concurrency", buf, PGC_POSTMASTER,
+ PGC_S_OVERRIDE);
+ }
+
+
+ /*
+ * If io_bounce_buffers is -1, we automatically choose a suitable value.
+ *
+ * See also comment above.
+ */
+ if (io_bounce_buffers == -1)
+ {
+ char buf[32];
+
+ snprintf(buf, sizeof(buf), "%d", AioChooseBounceBuffers());
+ SetConfigOption("io_bounce_buffers", buf, PGC_POSTMASTER,
+ PGC_S_DYNAMIC_DEFAULT);
+ if (io_bounce_buffers == -1) /* failed to apply it? */
+ SetConfigOption("io_bounce_buffers", buf, PGC_POSTMASTER,
+ PGC_S_OVERRIDE);
+ }
+
+ sz = add_size(sz, AioCtlShmemSize());
+ sz = add_size(sz, AioBackendShmemSize());
+ sz = add_size(sz, AioHandleShmemSize());
+ sz = add_size(sz, AioIOVShmemSize());
+ sz = add_size(sz, AioIOVDataShmemSize());
+ sz = add_size(sz, AioBounceBufferDescShmemSize());
+ sz = add_size(sz, AioBounceBufferDataShmemSize());
+
+ if (pgaio_impl->shmem_size)
+ sz = add_size(sz, pgaio_impl->shmem_size());
+
return sz;
}
void
AioShmemInit(void)
{
+ bool found;
+ uint32 io_handle_off = 0;
+ uint32 iovec_off = 0;
+ uint32 bounce_buffers_off = 0;
+ uint32 per_backend_iovecs = io_max_concurrency * io_combine_limit;
+ uint32 per_backend_bb = io_bounce_buffers;
+ char *bounce_buffers_data;
+
+ aio_ctl = (PgAioCtl *)
+ ShmemInitStruct("AioCtl", AioCtlShmemSize(), &found);
+
+ if (found)
+ goto out;
+
+ memset(aio_ctl, 0, AioCtlShmemSize());
+
+ aio_ctl->io_handle_count = AioProcs() * io_max_concurrency;
+ aio_ctl->iovec_count = AioProcs() * per_backend_iovecs;
+ aio_ctl->bounce_buffers_count = AioProcs() * per_backend_bb;
+
+ aio_ctl->backend_state = (PgAioPerBackend *)
+ ShmemInitStruct("AioBackend", AioBackendShmemSize(), &found);
+
+ aio_ctl->io_handles = (PgAioHandle *)
+ ShmemInitStruct("AioHandle", AioHandleShmemSize(), &found);
+
+ aio_ctl->iovecs = ShmemInitStruct("AioIOV", AioIOVShmemSize(), &found);
+ aio_ctl->iovecs_data = ShmemInitStruct("AioIOVData", AioIOVDataShmemSize(), &found);
+
+ aio_ctl->bounce_buffers = ShmemInitStruct("AioBounceBufferDesc", AioBounceBufferDescShmemSize(), &found);
+
+ bounce_buffers_data = ShmemInitStruct("AioBounceBufferData", AioBounceBufferDataShmemSize(), &found);
+ bounce_buffers_data = (char *) TYPEALIGN(BLCKSZ, (uintptr_t) bounce_buffers_data);
+ aio_ctl->bounce_buffers_data = bounce_buffers_data;
+
+
+ /* Initialize IO handles. */
+ for (uint64 i = 0; i < aio_ctl->io_handle_count; i++)
+ {
+ PgAioHandle *ioh = &aio_ctl->io_handles[i];
+
+ ioh->op = PGAIO_OP_INVALID;
+ ioh->subject = ASI_INVALID;
+ ioh->state = AHS_IDLE;
+
+ slist_init(&ioh->bounce_buffers);
+ }
+
+ /* Initialize Bounce Buffers. */
+ for (uint64 i = 0; i < aio_ctl->bounce_buffers_count; i++)
+ {
+ PgAioBounceBuffer *bb = &aio_ctl->bounce_buffers[i];
+
+ bb->buffer = bounce_buffers_data;
+ bounce_buffers_data += BLCKSZ;
+ }
+
+
+ for (int procno = 0; procno < AioProcs(); procno++)
+ {
+ PgAioPerBackend *bs = &aio_ctl->backend_state[procno];
+
+ bs->io_handle_off = io_handle_off;
+ io_handle_off += io_max_concurrency;
+
+ bs->bounce_buffers_off = bounce_buffers_off;
+ bounce_buffers_off += per_backend_bb;
+
+ dclist_init(&bs->idle_ios);
+ memset(bs->staged_ios, 0, sizeof(PgAioHandle *) * PGAIO_SUBMIT_BATCH_SIZE);
+ slist_init(&bs->idle_bbs);
+
+ /* initialize per-backend IOs */
+ for (int i = 0; i < io_max_concurrency; i++)
+ {
+ PgAioHandle *ioh = &aio_ctl->io_handles[bs->io_handle_off + i];
+
+ ioh->generation = 1;
+ ioh->owner_procno = procno;
+ ioh->iovec_off = iovec_off;
+ ioh->iovec_data_len = 0;
+ ioh->report_return = NULL;
+ ioh->resowner = NULL;
+ ioh->num_shared_callbacks = 0;
+ ioh->distilled_result.status = ARS_UNKNOWN;
+ ioh->flags = 0;
+
+ ConditionVariableInit(&ioh->cv);
+
+ dclist_push_tail(&bs->idle_ios, &ioh->node);
+ iovec_off += io_combine_limit;
+ }
+
+ /* initialize per-backend bounce buffers */
+ for (int i = 0; i < per_backend_bb; i++)
+ {
+ PgAioBounceBuffer *bb = &aio_ctl->bounce_buffers[bs->bounce_buffers_off + i];
+
+ slist_push_head(&bs->idle_bbs, &bb->node);
+ }
+ }
+
+out:
+ /* Initialize IO method specific resources. */
+ if (pgaio_impl->shmem_init)
+ pgaio_impl->shmem_init(!found);
}
void
pgaio_postmaster_init(void)
{
+ if (pgaio_impl->postmaster_init)
+ pgaio_impl->postmaster_init();
}
void
pgaio_postmaster_child_init(void)
{
+ /* shouldn't be initialized twice */
+ Assert(!my_aio);
+
+ if (MyProc == NULL || MyProcNumber >= AioProcs())
+ elog(ERROR, "aio requires a normal PGPROC");
+
+ my_aio = &aio_ctl->backend_state[MyProcNumber];
+
+ if (pgaio_impl->postmaster_child_init)
+ pgaio_impl->postmaster_child_init();
}
void
pgaio_postmaster_child_init_local(void)
{
+ if (pgaio_impl->postmaster_child_init_local)
+ pgaio_impl->postmaster_child_init_local();
}
diff --git a/src/backend/storage/aio/aio_io.c b/src/backend/storage/aio/aio_io.c
new file mode 100644
index 00000000000..5b2f9ee3ba6
--- /dev/null
+++ b/src/backend/storage/aio/aio_io.c
@@ -0,0 +1,111 @@
+/*-------------------------------------------------------------------------
+ *
+ * aio_io.c
+ * Asynchronous I/O subsytem.
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/storage/aio/aio_io.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "storage/aio.h"
+#include "storage/aio_internal.h"
+#include "storage/fd.h"
+#include "utils/wait_event.h"
+
+
+static void
+pgaio_io_before_prep(PgAioHandle *ioh)
+{
+ Assert(ioh->state == AHS_HANDED_OUT);
+ Assert(pgaio_io_has_subject(ioh));
+}
+
+const char *
+pgaio_io_get_op_name(PgAioHandle *ioh)
+{
+ Assert(ioh->op >= 0 && ioh->op < PGAIO_OP_COUNT);
+
+ switch (ioh->op)
+ {
+ case PGAIO_OP_INVALID:
+ return "invalid";
+ case PGAIO_OP_READ:
+ return "read";
+ case PGAIO_OP_WRITE:
+ return "write";
+ case PGAIO_OP_FSYNC:
+ return "fsync";
+ case PGAIO_OP_FLUSH_RANGE:
+ return "flush_range";
+ case PGAIO_OP_NOP:
+ return "nop";
+ }
+
+ pg_unreachable();
+}
+
+void
+pgaio_io_prep_readv(PgAioHandle *ioh,
+ int fd, int iovcnt, uint64 offset)
+{
+ pgaio_io_before_prep(ioh);
+
+ ioh->op_data.read.fd = fd;
+ ioh->op_data.read.offset = offset;
+ ioh->op_data.read.iov_length = iovcnt;
+
+ pgaio_io_prepare(ioh, PGAIO_OP_READ);
+}
+
+void
+pgaio_io_prep_writev(PgAioHandle *ioh,
+ int fd, int iovcnt, uint64 offset)
+{
+ pgaio_io_before_prep(ioh);
+
+ ioh->op_data.write.fd = fd;
+ ioh->op_data.write.offset = offset;
+ ioh->op_data.write.iov_length = iovcnt;
+
+ pgaio_io_prepare(ioh, PGAIO_OP_WRITE);
+}
+
+
+extern void
+pgaio_io_perform_synchronously(PgAioHandle *ioh)
+{
+ ssize_t result = 0;
+ struct iovec *iov = &aio_ctl->iovecs[ioh->iovec_off];
+
+ /* Perform IO. */
+ switch (ioh->op)
+ {
+ case PGAIO_OP_READ:
+ pgstat_report_wait_start(WAIT_EVENT_DATA_FILE_READ);
+ result = pg_preadv(ioh->op_data.read.fd, iov,
+ ioh->op_data.read.iov_length,
+ ioh->op_data.read.offset);
+ pgstat_report_wait_end();
+ break;
+ case PGAIO_OP_WRITE:
+ pgstat_report_wait_start(WAIT_EVENT_DATA_FILE_WRITE);
+ result = pg_pwritev(ioh->op_data.write.fd, iov,
+ ioh->op_data.write.iov_length,
+ ioh->op_data.write.offset);
+ pgstat_report_wait_end();
+ break;
+ default:
+ elog(ERROR, "not yet");
+ }
+
+ ioh->result = result < 0 ? -errno : result;
+
+ pgaio_io_process_completion(ioh, ioh->result);
+}
diff --git a/src/backend/storage/aio/aio_subject.c b/src/backend/storage/aio/aio_subject.c
new file mode 100644
index 00000000000..51ee3b3969d
--- /dev/null
+++ b/src/backend/storage/aio/aio_subject.c
@@ -0,0 +1,167 @@
+/*-------------------------------------------------------------------------
+ *
+ * aio_subject.c
+ * IO completion handling for IOs on different subjects
+ *
+ * XXX Write me
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/storage/aio/aio_subject.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "storage/aio.h"
+#include "storage/aio_internal.h"
+#include "storage/buf_internals.h"
+#include "storage/bufmgr.h"
+#include "storage/smgr.h"
+#include "utils/memutils.h"
+
+
+static const PgAioSubjectInfo *aio_subject_info[] = {
+ [ASI_INVALID] = &(PgAioSubjectInfo) {
+ .name = "invalid",
+ },
+};
+
+static const PgAioHandleSharedCallbacks *aio_shared_cbs[] = {
+};
+
+
+void
+pgaio_io_add_shared_cb(PgAioHandle *ioh, PgAioHandleSharedCallbackID cbid)
+{
+ if (cbid >= lengthof(aio_shared_cbs))
+ elog(ERROR, "callback %d is out of range", cbid);
+ if (aio_shared_cbs[cbid]->complete == NULL)
+ elog(ERROR, "callback %d is undefined", cbid);
+ if (ioh->num_shared_callbacks >= AIO_MAX_SHARED_CALLBACKS)
+ elog(PANIC, "too many callbacks, the max is %d", AIO_MAX_SHARED_CALLBACKS);
+ ioh->shared_callbacks[ioh->num_shared_callbacks] = cbid;
+
+ elog(DEBUG3, "io:%d, op %s, subject %s, adding cbid num %d, id %d",
+ pgaio_io_get_id(ioh),
+ pgaio_io_get_op_name(ioh),
+ pgaio_io_get_subject_name(ioh),
+ ioh->num_shared_callbacks + 1, cbid);
+
+ ioh->num_shared_callbacks++;
+}
+
+const char *
+pgaio_io_get_subject_name(PgAioHandle *ioh)
+{
+ Assert(ioh->subject >= 0 && ioh->subject < ASI_COUNT);
+
+ return aio_subject_info[ioh->subject]->name;
+}
+
+void
+pgaio_io_prepare_subject(PgAioHandle *ioh)
+{
+ Assert(ioh->subject > ASI_INVALID && ioh->subject < ASI_COUNT);
+ Assert(ioh->op >= 0 && ioh->op < PGAIO_OP_COUNT);
+
+ for (int i = ioh->num_shared_callbacks; i > 0; i--)
+ {
+ PgAioHandleSharedCallbackID cbid = ioh->shared_callbacks[i - 1];
+ const PgAioHandleSharedCallbacks *cbs = aio_shared_cbs[cbid];
+
+ if (!cbs->prepare)
+ continue;
+
+ elog(DEBUG3, "io:%d, op %s, subject %s, calling cbid num %d, id %d: prepare",
+ pgaio_io_get_id(ioh),
+ pgaio_io_get_op_name(ioh),
+ pgaio_io_get_subject_name(ioh),
+ i, cbid);
+ cbs->prepare(ioh);
+ }
+}
+
+void
+pgaio_io_process_completion_subject(PgAioHandle *ioh)
+{
+ PgAioResult result;
+
+ Assert(ioh->subject >= 0 && ioh->subject < ASI_COUNT);
+ Assert(ioh->op >= 0 && ioh->op < PGAIO_OP_COUNT);
+
+ result.status = ARS_OK; /* low level IO is always considered OK */
+ result.result = ioh->result;
+ result.id = 0; /* FIXME */
+ result.error_data = 0;
+
+ for (int i = ioh->num_shared_callbacks; i > 0; i--)
+ {
+ PgAioHandleSharedCallbackID cbid;
+
+ cbid = ioh->shared_callbacks[i - 1];
+ elog(DEBUG3, "io:%d, op %s, subject %s, calling cbid num %d, id %d with distilled result status %d, id %u, error_data: %d, result: %d",
+ pgaio_io_get_id(ioh),
+ pgaio_io_get_op_name(ioh),
+ pgaio_io_get_subject_name(ioh),
+ i, cbid,
+ result.status,
+ result.id,
+ result.error_data,
+ result.result);
+ result = aio_shared_cbs[cbid]->complete(ioh, result);
+ }
+
+ ioh->distilled_result = result;
+
+ elog(DEBUG3, "io:%d, op %s, subject %s, distilled result status %d, id %u, error_data: %d, result: %d, raw_result %d",
+ pgaio_io_get_id(ioh),
+ pgaio_io_get_op_name(ioh),
+ pgaio_io_get_subject_name(ioh),
+ result.status,
+ result.id,
+ result.error_data,
+ result.result,
+ ioh->result);
+}
+
+bool
+pgaio_io_can_reopen(PgAioHandle *ioh)
+{
+ return aio_subject_info[ioh->subject]->reopen != NULL;
+}
+
+void
+pgaio_io_reopen(PgAioHandle *ioh)
+{
+ Assert(ioh->subject >= 0 && ioh->subject < ASI_COUNT);
+ Assert(ioh->op >= 0 && ioh->op < PGAIO_OP_COUNT);
+
+ aio_subject_info[ioh->subject]->reopen(ioh);
+}
+
+
+
+/* --------------------------------------------------------------------------------
+ * IO Result
+ * --------------------------------------------------------------------------------
+ */
+
+void
+pgaio_result_log(PgAioResult result, const PgAioSubjectData *subject_data, int elevel)
+{
+ const PgAioHandleSharedCallbacks *scb;
+
+ Assert(result.status != ARS_UNKNOWN);
+ Assert(result.status != ARS_OK);
+
+ scb = aio_shared_cbs[result.id];
+
+ if (scb->error == NULL)
+ elog(ERROR, "scb id %d does not have error callback", result.id);
+
+ scb->error(result, subject_data, elevel);
+}
diff --git a/src/backend/storage/aio/meson.build b/src/backend/storage/aio/meson.build
index 8d20759ebf8..8339d473aae 100644
--- a/src/backend/storage/aio/meson.build
+++ b/src/backend/storage/aio/meson.build
@@ -3,5 +3,8 @@
backend_sources += files(
'aio.c',
'aio_init.c',
+ 'aio_io.c',
+ 'aio_subject.c',
+ 'method_sync.c',
'read_stream.c',
)
diff --git a/src/backend/storage/aio/method_sync.c b/src/backend/storage/aio/method_sync.c
new file mode 100644
index 00000000000..9a3e70bde33
--- /dev/null
+++ b/src/backend/storage/aio/method_sync.c
@@ -0,0 +1,43 @@
+/*-------------------------------------------------------------------------
+ *
+ * method_sync.c
+ * "AIO" implementation that just executes IO synchronously
+ *
+ * This method is mainly to check if AIO use causes regressions. Other IO
+ * methods might also fall back to the synchronous method for functionality
+ * they cannot provide.
+ *
+ * Portions Copyright (c) 1996-2021, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/storage/aio/method_sync.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "storage/aio.h"
+#include "storage/aio_internal.h"
+
+static bool pgaio_sync_needs_synchronous_execution(PgAioHandle *ioh);
+static int pgaio_sync_submit(uint16 num_staged_ios, PgAioHandle **staged_ios);
+
+
+const IoMethodOps pgaio_sync_ops = {
+ .needs_synchronous_execution = pgaio_sync_needs_synchronous_execution,
+ .submit = pgaio_sync_submit,
+};
+
+static bool
+pgaio_sync_needs_synchronous_execution(PgAioHandle *ioh)
+{
+ return true;
+}
+
+static int
+pgaio_sync_submit(uint16 num_staged_ios, PgAioHandle **staged_ios)
+{
+ elog(ERROR, "should be unreachable");
+}
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 8efb4044d6f..99ec8321746 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -191,6 +191,9 @@ ABI_compatibility:
Section: ClassName - WaitEventIO
+AIO_SUBMIT "Waiting for AIO submission."
+AIO_DRAIN "Waiting for IOs to finish."
+AIO_COMPLETION "Waiting for completion callback."
BASEBACKUP_READ "Waiting for base backup to read from a file."
BASEBACKUP_SYNC "Waiting for data written by a base backup to reach durable storage."
BASEBACKUP_WRITE "Waiting for base backup to write to a file."
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index a4b3c7c62bd..e5886f3b0e9 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -3201,6 +3201,31 @@ struct config_int ConfigureNamesInt[] =
NULL, NULL, NULL
},
+ {
+ {"io_max_concurrency",
+ PGC_POSTMASTER,
+ RESOURCES_ASYNCHRONOUS,
+ gettext_noop("Number of IOs that may be in flight in one backend."),
+ NULL,
+ },
+ &io_max_concurrency,
+ -1, -1, 1024,
+ NULL, NULL, NULL
+ },
+
+ {
+ {"io_bounce_buffers",
+ PGC_POSTMASTER,
+ RESOURCES_ASYNCHRONOUS,
+ gettext_noop("Number of IO Bounce Buffers reserved for each backend."),
+ NULL,
+ GUC_UNIT_BLOCKS
+ },
+ &io_bounce_buffers,
+ -1, -1, 4096,
+ NULL, NULL, NULL
+ },
+
{
{"backend_flush_after", PGC_USERSET, RESOURCES_ASYNCHRONOUS,
gettext_noop("Number of pages after which previously performed writes are flushed to disk."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 3a5e307c9dc..ed746b8a533 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -841,6 +841,12 @@
#io_method = sync # (change requires restart)
+#io_max_concurrency = 32 # Max number of IOs that may be in
+ # flight at the same time in one backend
+ # (change requires restart)
+#io_bounce_buffers = -1 # -1 sets based on shared_buffers
+ # (change requires restart)
+
#------------------------------------------------------------------------------
# CUSTOMIZED OPTIONS
diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c
index 505534ee8d3..d1932b7393c 100644
--- a/src/backend/utils/resowner/resowner.c
+++ b/src/backend/utils/resowner/resowner.c
@@ -47,6 +47,8 @@
#include "common/hashfn.h"
#include "common/int.h"
+#include "lib/ilist.h"
+#include "storage/aio.h"
#include "storage/ipc.h"
#include "storage/predicate.h"
#include "storage/proc.h"
@@ -155,6 +157,13 @@ struct ResourceOwnerData
/* The local locks cache. */
LOCALLOCK *locks[MAX_RESOWNER_LOCKS]; /* list of owned locks */
+
+ /*
+ * AIO handles & bounce buffers need be registered in critical sections
+ * and therefore cannot use the normal ResoureElem mechanism.
+ */
+ dlist_head aio_handles;
+ dlist_head aio_bounce_buffers;
};
@@ -425,6 +434,9 @@ ResourceOwnerCreate(ResourceOwner parent, const char *name)
parent->firstchild = owner;
}
+ dlist_init(&owner->aio_handles);
+ dlist_init(&owner->aio_bounce_buffers);
+
return owner;
}
@@ -725,6 +737,21 @@ ResourceOwnerReleaseInternal(ResourceOwner owner,
* so issue warnings. In the abort case, just clean up quietly.
*/
ResourceOwnerReleaseAll(owner, phase, isCommit);
+
+ /* XXX: Could probably be a later phase? */
+ while (!dlist_is_empty(&owner->aio_handles))
+ {
+ dlist_node *node = dlist_head_node(&owner->aio_handles);
+
+ pgaio_io_release_resowner(node, !isCommit);
+ }
+
+ while (!dlist_is_empty(&owner->aio_bounce_buffers))
+ {
+ dlist_node *node = dlist_head_node(&owner->aio_bounce_buffers);
+
+ pgaio_bounce_buffer_release_resowner(node, !isCommit);
+ }
}
else if (phase == RESOURCE_RELEASE_LOCKS)
{
@@ -1082,3 +1109,27 @@ ResourceOwnerForgetLock(ResourceOwner owner, LOCALLOCK *locallock)
elog(ERROR, "lock reference %p is not owned by resource owner %s",
locallock, owner->name);
}
+
+void
+ResourceOwnerRememberAioHandle(ResourceOwner owner, struct dlist_node *ioh_node)
+{
+ dlist_push_tail(&owner->aio_handles, ioh_node);
+}
+
+void
+ResourceOwnerForgetAioHandle(ResourceOwner owner, struct dlist_node *ioh_node)
+{
+ dlist_delete_from(&owner->aio_handles, ioh_node);
+}
+
+void
+ResourceOwnerRememberAioBounceBuffer(ResourceOwner owner, struct dlist_node *ioh_node)
+{
+ dlist_push_tail(&owner->aio_bounce_buffers, ioh_node);
+}
+
+void
+ResourceOwnerForgetAioBounceBuffer(ResourceOwner owner, struct dlist_node *ioh_node)
+{
+ dlist_delete_from(&owner->aio_bounce_buffers, ioh_node);
+}
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 2681dd51bb7..2f463d29ca1 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1259,6 +1259,7 @@ IntoClause
InvalMessageArray
InvalidationMsgsGroup
IoMethod
+IoMethodOps
IpcMemoryId
IpcMemoryKey
IpcMemoryState
@@ -2094,6 +2095,24 @@ Permutation
PermutationStep
PermutationStepBlocker
PermutationStepBlockerType
+PgAioBounceBuffer
+PgAioCtl
+PgAioHandle
+PgAioHandleFlags
+PgAioHandleRef
+PgAioHandleState
+PgAioHandleSharedCallbacks
+PgAioHandleSharedCallbackID
+PgAioHandleSharedCallbacks
+PgAioOp
+PgAioOpData
+PgAioPerBackend
+PgAioResultStatus
+PgAioResult
+PgAioReturn
+PgAioSubjectData
+PgAioSubjectID
+PgAioSubjectInfo
PgArchData
PgBackendGSSStatus
PgBackendSSLStatus
--
2.45.2.827.g557ae147e6
v2.1-0008-aio-Skeleton-IO-worker-infrastructure.patchtext/x-diff; charset=us-asciiDownload
From 00976ef4bb067dda2454e0f4c4a74fc421715954 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 4 Sep 2024 15:24:51 -0400
Subject: [PATCH v2.1 08/20] aio: Skeleton IO worker infrastructure
This doesn't do anything useful on its own, but the code that needs to be
touched is independent of other changes.
Remarks:
- should completely get rid of ID assignment logic in postmaster.c
- postmaster.c badly needs a refactoring.
- dynamic increase / decrease of workers based on IO load
Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
src/include/miscadmin.h | 2 +
src/include/postmaster/postmaster.h | 1 +
src/include/storage/aio_init.h | 2 +
src/include/storage/io_worker.h | 22 +++
src/include/storage/proc.h | 4 +-
src/backend/postmaster/launch_backend.c | 2 +
src/backend/postmaster/postmaster.c | 186 ++++++++++++++++--
src/backend/storage/aio/Makefile | 1 +
src/backend/storage/aio/meson.build | 1 +
src/backend/storage/aio/method_worker.c | 84 ++++++++
src/backend/tcop/postgres.c | 2 +
src/backend/utils/activity/pgstat_io.c | 1 +
.../utils/activity/wait_event_names.txt | 1 +
src/backend/utils/init/miscinit.c | 3 +
src/backend/utils/misc/guc_tables.c | 13 ++
src/backend/utils/misc/postgresql.conf.sample | 1 +
16 files changed, 311 insertions(+), 15 deletions(-)
create mode 100644 src/include/storage/io_worker.h
create mode 100644 src/backend/storage/aio/method_worker.c
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 25348e71eb9..d043445b544 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -352,6 +352,7 @@ typedef enum BackendType
B_ARCHIVER,
B_BG_WRITER,
B_CHECKPOINTER,
+ B_IO_WORKER,
B_STARTUP,
B_WAL_RECEIVER,
B_WAL_SUMMARIZER,
@@ -380,6 +381,7 @@ extern PGDLLIMPORT BackendType MyBackendType;
#define AmWalReceiverProcess() (MyBackendType == B_WAL_RECEIVER)
#define AmWalSummarizerProcess() (MyBackendType == B_WAL_SUMMARIZER)
#define AmWalWriterProcess() (MyBackendType == B_WAL_WRITER)
+#define AmIoWorkerProcess() (MyBackendType == B_IO_WORKER)
extern const char *GetBackendTypeDesc(BackendType backendType);
diff --git a/src/include/postmaster/postmaster.h b/src/include/postmaster/postmaster.h
index 63c12917cfe..4cc000df79e 100644
--- a/src/include/postmaster/postmaster.h
+++ b/src/include/postmaster/postmaster.h
@@ -62,6 +62,7 @@ extern void InitProcessGlobals(void);
extern int MaxLivePostmasterChildren(void);
extern bool PostmasterMarkPIDForWorkerNotify(int);
+extern void assign_io_workers(int newval, void *extra);
#ifdef WIN32
extern void pgwin32_register_deadchild_callback(HANDLE procHandle, DWORD procId);
diff --git a/src/include/storage/aio_init.h b/src/include/storage/aio_init.h
index 5bcfb8a9d58..a38dd982fbe 100644
--- a/src/include/storage/aio_init.h
+++ b/src/include/storage/aio_init.h
@@ -23,4 +23,6 @@ extern void pgaio_postmaster_init(void);
extern void pgaio_postmaster_child_init_local(void);
extern void pgaio_postmaster_child_init(void);
+extern bool pgaio_workers_enabled(void);
+
#endif /* AIO_INIT_H */
diff --git a/src/include/storage/io_worker.h b/src/include/storage/io_worker.h
new file mode 100644
index 00000000000..ba5dcb9e6e4
--- /dev/null
+++ b/src/include/storage/io_worker.h
@@ -0,0 +1,22 @@
+/*-------------------------------------------------------------------------
+ *
+ * io_worker.h
+ * IO worker for implementing AIO "ourselves"
+ *
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/io.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef IO_WORKER_H
+#define IO_WORKER_H
+
+
+extern void IoWorkerMain(char *startup_data, size_t startup_data_len) pg_attribute_noreturn();
+
+extern int io_workers;
+
+#endif /* IO_WORKER_H */
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index deeb06c9e01..b466ba843d6 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -442,7 +442,9 @@ extern PGDLLIMPORT PGPROC *PreparedXactProcs;
* 2 slots, but WAL writer is launched only after startup has exited, so we
* only need 6 slots.
*/
-#define NUM_AUXILIARY_PROCS 6
+#define MAX_IO_WORKERS 32
+#define NUM_AUXILIARY_PROCS (6 + MAX_IO_WORKERS)
+
/* configurable options */
extern PGDLLIMPORT int DeadlockTimeout;
diff --git a/src/backend/postmaster/launch_backend.c b/src/backend/postmaster/launch_backend.c
index 0ae23fdf55e..78429b2af2f 100644
--- a/src/backend/postmaster/launch_backend.c
+++ b/src/backend/postmaster/launch_backend.c
@@ -55,6 +55,7 @@
#include "replication/walreceiver.h"
#include "storage/dsm.h"
#include "storage/fd.h"
+#include "storage/io_worker.h"
#include "storage/ipc.h"
#include "storage/pg_shmem.h"
#include "storage/pmsignal.h"
@@ -199,6 +200,7 @@ static child_process_kind child_process_kinds[] = {
[B_ARCHIVER] = {"archiver", PgArchiverMain, true},
[B_BG_WRITER] = {"bgwriter", BackgroundWriterMain, true},
[B_CHECKPOINTER] = {"checkpointer", CheckpointerMain, true},
+ [B_IO_WORKER] = {"io_worker", IoWorkerMain, true},
[B_STARTUP] = {"startup", StartupProcessMain, true},
[B_WAL_RECEIVER] = {"wal_receiver", WalReceiverMain, true},
[B_WAL_SUMMARIZER] = {"wal_summarizer", WalSummarizerMain, true},
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 70c5ce19f6e..3d970374733 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -113,6 +113,7 @@
#include "replication/walsender.h"
#include "storage/aio_init.h"
#include "storage/fd.h"
+#include "storage/io_worker.h"
#include "storage/ipc.h"
#include "storage/pmsignal.h"
#include "storage/proc.h"
@@ -321,6 +322,7 @@ typedef enum
* ckpt */
PM_SHUTDOWN_2, /* waiting for archiver and walsenders to
* finish */
+ PM_SHUTDOWN_IO, /* waiting for io workers to exit */
PM_WAIT_DEAD_END, /* waiting for dead_end children to exit */
PM_NO_CHILDREN, /* all important children have exited */
} PMState;
@@ -382,6 +384,10 @@ bool LoadedSSL = false;
static DNSServiceRef bonjour_sdref = NULL;
#endif
+/* State for IO worker management. */
+static int io_worker_count = 0;
+static pid_t io_worker_pids[MAX_IO_WORKERS];
+
/*
* postmaster.c - function prototypes
*/
@@ -420,6 +426,9 @@ static int CountChildren(int target);
static Backend *assign_backendlist_entry(void);
static void LaunchMissingBackgroundProcesses(void);
static void maybe_start_bgworkers(void);
+static bool maybe_reap_io_worker(int pid);
+static void maybe_adjust_io_workers(void);
+static void signal_io_workers(int signal);
static bool CreateOptsFile(int argc, char *argv[], char *fullprogname);
static pid_t StartChildProcess(BackendType type);
static void StartAutovacuumWorker(void);
@@ -1334,6 +1343,11 @@ PostmasterMain(int argc, char *argv[])
*/
AddToDataDirLockFile(LOCK_FILE_LINE_PM_STATUS, PM_STATUS_STARTING);
+ pmState = PM_STARTUP;
+
+ /* Make sure we can perform I/O while starting up. */
+ maybe_adjust_io_workers();
+
/* Start bgwriter and checkpointer so they can help with recovery */
if (CheckpointerPID == 0)
CheckpointerPID = StartChildProcess(B_CHECKPOINTER);
@@ -1346,7 +1360,6 @@ PostmasterMain(int argc, char *argv[])
StartupPID = StartChildProcess(B_STARTUP);
Assert(StartupPID != 0);
StartupStatus = STARTUP_RUNNING;
- pmState = PM_STARTUP;
/* Some workers may be scheduled to start now */
maybe_start_bgworkers();
@@ -1995,6 +2008,7 @@ process_pm_reload_request(void)
signal_child(SysLoggerPID, SIGHUP);
if (SlotSyncWorkerPID != 0)
signal_child(SlotSyncWorkerPID, SIGHUP);
+ signal_io_workers(SIGHUP);
/* Reload authentication config files too */
if (!load_hba())
@@ -2527,6 +2541,22 @@ process_pm_child_exit(void)
}
}
+ /* Was it an IO worker? */
+ if (maybe_reap_io_worker(pid))
+ {
+ if (!EXIT_STATUS_0(exitstatus) && !EXIT_STATUS_1(exitstatus))
+ HandleChildCrash(pid, exitstatus, _("io worker"));
+
+ maybe_adjust_io_workers();
+
+ if (io_worker_count == 0 &&
+ pmState >= PM_SHUTDOWN_IO)
+ {
+ pmState = PM_WAIT_DEAD_END;
+ }
+ continue;
+ }
+
/*
* We don't know anything about this child process. That's highly
* unexpected, as we do track all the child processes that we fork.
@@ -2764,6 +2794,9 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
if (SlotSyncWorkerPID != 0)
sigquit_child(SlotSyncWorkerPID);
+ /* Take care of io workers too */
+ signal_io_workers(SIGQUIT);
+
/* We do NOT restart the syslogger */
}
@@ -2987,10 +3020,11 @@ PostmasterStateMachine(void)
FatalError = true;
pmState = PM_WAIT_DEAD_END;
- /* Kill the walsenders and archiver too */
+ /* Kill walsenders, archiver and aio workers too */
SignalChildren(SIGQUIT);
if (PgArchPID != 0)
signal_child(PgArchPID, SIGQUIT);
+ signal_io_workers(SIGQUIT);
}
}
}
@@ -3000,16 +3034,26 @@ PostmasterStateMachine(void)
{
/*
* PM_SHUTDOWN_2 state ends when there's no other children than
- * dead_end children left. There shouldn't be any regular backends
- * left by now anyway; what we're really waiting for is walsenders and
- * archiver.
+ * dead_end children and aio workers left. There shouldn't be any
+ * regular backends left by now anyway; what we're really waiting for
+ * is walsenders and archiver.
*/
if (PgArchPID == 0 && CountChildren(BACKEND_TYPE_ALL) == 0)
{
- pmState = PM_WAIT_DEAD_END;
+ pmState = PM_SHUTDOWN_IO;
+ signal_io_workers(SIGUSR2);
}
}
+ if (pmState == PM_SHUTDOWN_IO)
+ {
+ /*
+ * PM_SHUTDOWN_IO state ends when there's only dead_end children left.
+ */
+ if (io_worker_count == 0)
+ pmState = PM_WAIT_DEAD_END;
+ }
+
if (pmState == PM_WAIT_DEAD_END)
{
/* Don't allow any new socket connection events. */
@@ -3017,17 +3061,22 @@ PostmasterStateMachine(void)
/*
* PM_WAIT_DEAD_END state ends when the BackendList is entirely empty
- * (ie, no dead_end children remain), and the archiver is gone too.
+ * (ie, no dead_end children remain), and the archiver and aio workers
+ * are all gone too.
*
- * The reason we wait for those two is to protect them against a new
+ * We need to wait for those because we might have transitioned
+ * directly to PM_WAIT_DEAD_END due to immediate shutdown or fatal
+ * error. Note that they have already been sent appropriate shutdown
+ * signals, either during a normal state transition leading up to
+ * PM_WAIT_DEAD_END, or during FatalError processing.
+ *
+ * The reason we wait for those is to protect them against a new
* postmaster starting conflicting subprocesses; this isn't an
* ironclad protection, but it at least helps in the
- * shutdown-and-immediately-restart scenario. Note that they have
- * already been sent appropriate shutdown signals, either during a
- * normal state transition leading up to PM_WAIT_DEAD_END, or during
- * FatalError processing.
+ * shutdown-and-immediately-restart scenario.
*/
- if (dlist_is_empty(&BackendList) && PgArchPID == 0)
+ if (dlist_is_empty(&BackendList) && io_worker_count == 0
+ && PgArchPID == 0)
{
/* These other guys should be dead already */
Assert(StartupPID == 0);
@@ -3120,10 +3169,14 @@ PostmasterStateMachine(void)
/* re-create shared memory and semaphores */
CreateSharedMemoryAndSemaphores();
+ pmState = PM_STARTUP;
+
+ /* Make sure we can perform I/O while starting up. */
+ maybe_adjust_io_workers();
+
StartupPID = StartChildProcess(B_STARTUP);
Assert(StartupPID != 0);
StartupStatus = STARTUP_RUNNING;
- pmState = PM_STARTUP;
/* crash recovery started, reset SIGKILL flag */
AbortStartTime = 0;
@@ -3375,6 +3428,7 @@ TerminateChildren(int signal)
signal_child(PgArchPID, signal);
if (SlotSyncWorkerPID != 0)
signal_child(SlotSyncWorkerPID, signal);
+ signal_io_workers(signal);
}
/*
@@ -3956,6 +4010,7 @@ bgworker_should_start_now(BgWorkerStartTime start_time)
{
case PM_NO_CHILDREN:
case PM_WAIT_DEAD_END:
+ case PM_SHUTDOWN_IO:
case PM_SHUTDOWN_2:
case PM_SHUTDOWN:
case PM_WAIT_BACKENDS:
@@ -4149,6 +4204,109 @@ maybe_start_bgworkers(void)
}
}
+static bool
+maybe_reap_io_worker(int pid)
+{
+ for (int id = 0; id < MAX_IO_WORKERS; ++id)
+ {
+ if (io_worker_pids[id] == pid)
+ {
+ --io_worker_count;
+ io_worker_pids[id] = 0;
+ return true;
+ }
+ }
+ return false;
+}
+
+static void
+maybe_adjust_io_workers(void)
+{
+ /* ATODO: This will need to check if io_method == worker */
+
+ /*
+ * If we're in final shutting down state, then we're just waiting for all
+ * processes to exit.
+ */
+ if (pmState >= PM_SHUTDOWN_IO)
+ return;
+
+ /* Don't start new workers during an immediate shutdown either. */
+ if (Shutdown >= ImmediateShutdown)
+ return;
+
+ /*
+ * Don't start new workers if we're in the shutdown phase of a crash
+ * restart. But we *do* need to start if we're already starting up again.
+ */
+ if (FatalError && pmState >= PM_STOP_BACKENDS)
+ return;
+
+ /* Not enough running? */
+ while (io_worker_count < io_workers)
+ {
+ int pid;
+ int id;
+
+ /* Find the lowest unused IO worker ID. */
+
+ /*
+ * AFIXME: This logic doesn't work right now, the ids aren't
+ * transported to workers anymore.
+ */
+ for (id = 0; id < MAX_IO_WORKERS; ++id)
+ {
+ if (io_worker_pids[id] == 0)
+ break;
+ }
+ if (id == MAX_IO_WORKERS)
+ elog(ERROR, "could not find a free IO worker ID");
+
+ Assert(pmState < PM_SHUTDOWN_IO);
+
+ /* Try to launch one. */
+ pid = StartChildProcess(B_IO_WORKER);
+ if (pid > 0)
+ {
+ io_worker_pids[id] = pid;
+ ++io_worker_count;
+ }
+ else
+ break; /* XXX try again soon? */
+ }
+
+ /* Too many running? */
+ if (io_worker_count > io_workers)
+ {
+ /* Ask the highest used IO worker ID to exit. */
+ for (int id = MAX_IO_WORKERS - 1; id >= 0; --id)
+ {
+ if (io_worker_pids[id] != 0)
+ {
+ kill(io_worker_pids[id], SIGUSR2);
+ break;
+ }
+ }
+ }
+}
+
+static void
+signal_io_workers(int signal)
+{
+ for (int i = 0; i < MAX_IO_WORKERS; ++i)
+ if (io_worker_pids[i] != 0)
+ signal_child(io_worker_pids[i], signal);
+}
+
+void
+assign_io_workers(int newval, void *extra)
+{
+ io_workers = newval;
+ if (!IsUnderPostmaster && pmState > PM_INIT)
+ maybe_adjust_io_workers();
+}
+
+
/*
* When a backend asks to be notified about worker state changes, we
* set a flag in its backend entry. The background worker machinery needs
diff --git a/src/backend/storage/aio/Makefile b/src/backend/storage/aio/Makefile
index b253278f3c1..fa2a7e9e5df 100644
--- a/src/backend/storage/aio/Makefile
+++ b/src/backend/storage/aio/Makefile
@@ -14,6 +14,7 @@ OBJS = \
aio_io.o \
aio_subject.o \
method_sync.o \
+ method_worker.o \
read_stream.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/aio/meson.build b/src/backend/storage/aio/meson.build
index 8339d473aae..62738ce1d14 100644
--- a/src/backend/storage/aio/meson.build
+++ b/src/backend/storage/aio/meson.build
@@ -6,5 +6,6 @@ backend_sources += files(
'aio_io.c',
'aio_subject.c',
'method_sync.c',
+ 'method_worker.c',
'read_stream.c',
)
diff --git a/src/backend/storage/aio/method_worker.c b/src/backend/storage/aio/method_worker.c
new file mode 100644
index 00000000000..5df2eea4a03
--- /dev/null
+++ b/src/backend/storage/aio/method_worker.c
@@ -0,0 +1,84 @@
+/*-------------------------------------------------------------------------
+ *
+ * method_worker.c
+ * AIO implementation using workers
+ *
+ * Portions Copyright (c) 1996-2021, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/storage/aio/method_worker.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "libpq/pqsignal.h"
+#include "miscadmin.h"
+#include "postmaster/interrupt.h"
+#include "storage/io_worker.h"
+#include "storage/ipc.h"
+#include "storage/latch.h"
+#include "storage/proc.h"
+#include "tcop/tcopprot.h"
+#include "utils/wait_event.h"
+
+
+int io_workers = 3;
+
+
+void
+IoWorkerMain(char *startup_data, size_t startup_data_len)
+{
+ sigjmp_buf local_sigjmp_buf;
+
+ MyBackendType = B_IO_WORKER;
+
+ /* TODO review all signals */
+ pqsignal(SIGHUP, SignalHandlerForConfigReload);
+ pqsignal(SIGINT, die); /* to allow manually triggering worker restart */
+
+ /*
+ * Ignore SIGTERM, will get explicit shutdown via SIGUSR2 later in the
+ * shutdown sequence, similar to checkpointer.
+ */
+ pqsignal(SIGTERM, SIG_IGN);
+ /* SIGQUIT handler was already set up by InitPostmasterChild */
+ pqsignal(SIGALRM, SIG_IGN);
+ pqsignal(SIGPIPE, SIG_IGN);
+ pqsignal(SIGUSR1, procsignal_sigusr1_handler);
+ pqsignal(SIGUSR2, SignalHandlerForShutdownRequest);
+ sigprocmask(SIG_SETMASK, &UnBlockSig, NULL);
+
+ /* see PostgresMain() */
+ if (sigsetjmp(local_sigjmp_buf, 1) != 0)
+ {
+ error_context_stack = NULL;
+ HOLD_INTERRUPTS();
+
+ /*
+ * We normally shouldn't get errors here. Need to do just enough error
+ * recovery so that we can mark the IO as failed and then exit.
+ */
+ LWLockReleaseAll();
+
+ /* TODO: recover from IO errors */
+
+ EmitErrorReport();
+ proc_exit(1);
+ }
+
+ /* We can now handle ereport(ERROR) */
+ PG_exception_stack = &local_sigjmp_buf;
+
+ while (!ShutdownRequestPending)
+ {
+ WaitLatch(MyLatch, WL_LATCH_SET | WL_EXIT_ON_PM_DEATH, -1,
+ WAIT_EVENT_IO_WORKER_MAIN);
+ ResetLatch(MyLatch);
+ CHECK_FOR_INTERRUPTS();
+ }
+
+ proc_exit(0);
+}
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 4dc46b17b41..d42546db195 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -3294,6 +3294,8 @@ ProcessInterrupts(void)
(errcode(ERRCODE_ADMIN_SHUTDOWN),
errmsg("terminating background worker \"%s\" due to administrator command",
MyBgworkerEntry->bgw_type)));
+ else if (AmIoWorkerProcess())
+ proc_exit(0);
else
ereport(FATAL,
(errcode(ERRCODE_ADMIN_SHUTDOWN),
diff --git a/src/backend/utils/activity/pgstat_io.c b/src/backend/utils/activity/pgstat_io.c
index 8af55989eed..a750caa9b2a 100644
--- a/src/backend/utils/activity/pgstat_io.c
+++ b/src/backend/utils/activity/pgstat_io.c
@@ -335,6 +335,7 @@ pgstat_tracks_io_bktype(BackendType bktype)
{
case B_INVALID:
case B_ARCHIVER:
+ case B_IO_WORKER:
case B_LOGGER:
case B_WAL_RECEIVER:
case B_WAL_WRITER:
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 99ec8321746..ecc513aa7bd 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -56,6 +56,7 @@ AUTOVACUUM_MAIN "Waiting in main loop of autovacuum launcher process."
BGWRITER_HIBERNATE "Waiting in background writer process, hibernating."
BGWRITER_MAIN "Waiting in main loop of background writer process."
CHECKPOINTER_MAIN "Waiting in main loop of checkpointer process."
+IO_WORKER_MAIN "Waiting in main loop of IO Worker process."
LOGICAL_APPLY_MAIN "Waiting in main loop of logical replication apply process."
LOGICAL_LAUNCHER_MAIN "Waiting in main loop of logical replication launcher process."
LOGICAL_PARALLEL_APPLY_MAIN "Waiting in main loop of logical replication parallel apply process."
diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index b8fa2e64ffe..bedeed588d3 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -293,6 +293,9 @@ GetBackendTypeDesc(BackendType backendType)
case B_CHECKPOINTER:
backendDesc = "checkpointer";
break;
+ case B_IO_WORKER:
+ backendDesc = "io worker";
+ break;
case B_LOGGER:
backendDesc = "logger";
break;
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index e5886f3b0e9..40737882fb4 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -74,6 +74,7 @@
#include "storage/aio.h"
#include "storage/bufmgr.h"
#include "storage/bufpage.h"
+#include "storage/io_worker.h"
#include "storage/large_object.h"
#include "storage/pg_shmem.h"
#include "storage/predicate.h"
@@ -3226,6 +3227,18 @@ struct config_int ConfigureNamesInt[] =
NULL, NULL, NULL
},
+ {
+ {"io_workers",
+ PGC_SIGHUP,
+ RESOURCES_ASYNCHRONOUS,
+ gettext_noop("Number of IO worker processes, for io_method=worker."),
+ NULL,
+ },
+ &io_workers,
+ 3, 1, MAX_IO_WORKERS,
+ NULL, assign_io_workers, NULL
+ },
+
{
{"backend_flush_after", PGC_USERSET, RESOURCES_ASYNCHRONOUS,
gettext_noop("Number of pages after which previously performed writes are flushed to disk."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index ed746b8a533..8c062240373 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -840,6 +840,7 @@
#------------------------------------------------------------------------------
#io_method = sync # (change requires restart)
+#io_workers = 3 # 1-32;
#io_max_concurrency = 32 # Max number of IOs that may be in
# flight at the same time in one backend
--
2.45.2.827.g557ae147e6
v2.1-0009-aio-Add-worker-method.patchtext/x-diff; charset=us-asciiDownload
From bc2016ad468094ccc09507d3ddd755f5c7692d4b Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 4 Sep 2024 15:27:00 -0400
Subject: [PATCH v2.1 09/20] aio: Add worker method
---
src/include/storage/aio.h | 5 +-
src/include/storage/aio_internal.h | 1 +
src/include/storage/lwlocklist.h | 1 +
src/backend/postmaster/postmaster.c | 3 +-
src/backend/storage/aio/aio.c | 2 +
src/backend/storage/aio/aio_init.c | 15 +
src/backend/storage/aio/method_worker.c | 404 +++++++++++++++++-
.../utils/activity/wait_event_names.txt | 1 +
src/backend/utils/misc/postgresql.conf.sample | 2 +-
src/tools/pgindent/typedefs.list | 3 +
10 files changed, 428 insertions(+), 9 deletions(-)
diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h
index c0a59f47bc0..1e4c8807c71 100644
--- a/src/include/storage/aio.h
+++ b/src/include/storage/aio.h
@@ -332,11 +332,12 @@ extern void assign_io_method(int newval, void *extra);
typedef enum IoMethod
{
IOMETHOD_SYNC = 0,
+ IOMETHOD_WORKER,
} IoMethod;
-/* We'll default to synchronous execution. */
-#define DEFAULT_IO_METHOD IOMETHOD_SYNC
+/* We'll default to bgworker. */
+#define DEFAULT_IO_METHOD IOMETHOD_WORKER
/* GUCs */
diff --git a/src/include/storage/aio_internal.h b/src/include/storage/aio_internal.h
index 82bce1cf27c..b6f44a875dd 100644
--- a/src/include/storage/aio_internal.h
+++ b/src/include/storage/aio_internal.h
@@ -264,6 +264,7 @@ extern const char *pgaio_io_get_state_name(PgAioHandle *ioh);
/* Declarations for the tables of function pointers exposed by each IO method. */
extern const IoMethodOps pgaio_sync_ops;
+extern const IoMethodOps pgaio_worker_ops;
extern const IoMethodOps *pgaio_impl;
extern PgAioCtl *aio_ctl;
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index 88dc79b2bd6..7aaccf69d1e 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -84,3 +84,4 @@ PG_LWLOCK(50, DSMRegistry)
PG_LWLOCK(51, InjectionPoint)
PG_LWLOCK(52, SerialControl)
PG_LWLOCK(53, WaitLSN)
+PG_LWLOCK(54, AioWorkerSubmissionQueue)
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 3d970374733..76440321d18 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -4222,7 +4222,8 @@ maybe_reap_io_worker(int pid)
static void
maybe_adjust_io_workers(void)
{
- /* ATODO: This will need to check if io_method == worker */
+ if (!pgaio_workers_enabled())
+ return;
/*
* If we're in final shutting down state, then we're just waiting for all
diff --git a/src/backend/storage/aio/aio.c b/src/backend/storage/aio/aio.c
index b5370330620..0ca641d9322 100644
--- a/src/backend/storage/aio/aio.c
+++ b/src/backend/storage/aio/aio.c
@@ -36,6 +36,7 @@ static void pgaio_bounce_buffer_wait_for_free(void);
/* Options for io_method. */
const struct config_enum_entry io_method_options[] = {
{"sync", IOMETHOD_SYNC, false},
+ {"worker", IOMETHOD_WORKER, false},
{NULL, 0, false}
};
@@ -53,6 +54,7 @@ PgAioPerBackend *my_aio;
static const IoMethodOps *pgaio_ops_table[] = {
[IOMETHOD_SYNC] = &pgaio_sync_ops,
+ [IOMETHOD_WORKER] = &pgaio_worker_ops,
};
diff --git a/src/backend/storage/aio/aio_init.c b/src/backend/storage/aio/aio_init.c
index e25bdf1dba0..ca3513019a6 100644
--- a/src/backend/storage/aio/aio_init.c
+++ b/src/backend/storage/aio/aio_init.c
@@ -19,6 +19,7 @@
#include "storage/aio_init.h"
#include "storage/aio_internal.h"
#include "storage/bufmgr.h"
+#include "storage/io_worker.h"
#include "storage/proc.h"
#include "storage/shmem.h"
@@ -37,6 +38,11 @@ AioCtlShmemSize(void)
static uint32
AioProcs(void)
{
+ /*
+ * While AIO workers don't need their own AIO context, we can't currently
+ * guarantee nothing gets assigned to the a ProcNumber for an IO worker if
+ * we just subtracted MAX_IO_WORKERS.
+ */
return MaxBackends + NUM_AUXILIARY_PROCS;
}
@@ -333,6 +339,9 @@ pgaio_postmaster_child_init(void)
/* shouldn't be initialized twice */
Assert(!my_aio);
+ if (MyBackendType == B_IO_WORKER)
+ return;
+
if (MyProc == NULL || MyProcNumber >= AioProcs())
elog(ERROR, "aio requires a normal PGPROC");
@@ -348,3 +357,9 @@ pgaio_postmaster_child_init_local(void)
if (pgaio_impl->postmaster_child_init_local)
pgaio_impl->postmaster_child_init_local();
}
+
+bool
+pgaio_workers_enabled(void)
+{
+ return io_method == IOMETHOD_WORKER;
+}
diff --git a/src/backend/storage/aio/method_worker.c b/src/backend/storage/aio/method_worker.c
index 5df2eea4a03..a6c21df2ea5 100644
--- a/src/backend/storage/aio/method_worker.c
+++ b/src/backend/storage/aio/method_worker.c
@@ -3,6 +3,21 @@
* method_worker.c
* AIO implementation using workers
*
+ * Worker processes consume IOs from a shared memory submission queue, run
+ * traditional synchronous system calls, and perform the shared completion
+ * handling immediately. Client code submits most requests by pushing IOs
+ * into the submission queue, and waits (if necessary) using condition
+ * variables. Some IOs cannot be performed in another process due to lack of
+ * infrastructure for reopening the file, and must processed synchronously by
+ * the client code when submitted.
+ *
+ * So that the submitter can make just one system call when submitting a batch
+ * of IOs, wakeups "fan out"; each woken backend can wake two more. XXX This
+ * could be improved by using futexes instead of latches to wake N waiters.
+ *
+ * This method of AIO is available in all builds on all operating systems, and
+ * is the default.
+ *
* Portions Copyright (c) 1996-2021, PostgreSQL Global Development Group
* Portions Copyright (c) 1994, Regents of the University of California
*
@@ -16,24 +31,290 @@
#include "libpq/pqsignal.h"
#include "miscadmin.h"
+#include "port/pg_bitutils.h"
+#include "postmaster/auxprocess.h"
#include "postmaster/interrupt.h"
+#include "storage/aio.h"
+#include "storage/aio_internal.h"
#include "storage/io_worker.h"
#include "storage/ipc.h"
#include "storage/latch.h"
#include "storage/proc.h"
#include "tcop/tcopprot.h"
#include "utils/wait_event.h"
+#include "utils/ps_status.h"
+
+
+/* How many workers should each worker wake up if needed? */
+#define IO_WORKER_WAKEUP_FANOUT 2
+
+
+typedef struct AioWorkerSubmissionQueue
+{
+ uint32 size;
+ uint32 mask;
+ uint32 head;
+ uint32 tail;
+ uint32 ios[FLEXIBLE_ARRAY_MEMBER];
+} AioWorkerSubmissionQueue;
+
+typedef struct AioWorkerSlot
+{
+ Latch *latch;
+ bool in_use;
+} AioWorkerSlot;
+
+typedef struct AioWorkerControl
+{
+ uint64 idle_worker_mask;
+ AioWorkerSlot workers[FLEXIBLE_ARRAY_MEMBER];
+} AioWorkerControl;
+
+
+static size_t pgaio_worker_shmem_size(void);
+static void pgaio_worker_shmem_init(bool first_time);
+static void pgaio_worker_postmaster_child_init_local(void);
+
+static bool pgaio_worker_needs_synchronous_execution(PgAioHandle *ioh);
+static int pgaio_worker_submit(uint16 num_staged_ios, PgAioHandle **staged_ios);
+
+
+const IoMethodOps pgaio_worker_ops = {
+ .shmem_size = pgaio_worker_shmem_size,
+ .shmem_init = pgaio_worker_shmem_init,
+ .postmaster_child_init_local = pgaio_worker_postmaster_child_init_local,
+
+ .needs_synchronous_execution = pgaio_worker_needs_synchronous_execution,
+ .submit = pgaio_worker_submit,
+#if 0
+ .wait_one = pgaio_worker_wait_one,
+ .retry = pgaio_worker_io_retry,
+ .drain = pgaio_worker_drain,
+#endif
+
+ .can_scatter_gather_direct = true,
+ .can_scatter_gather_buffered = true
+};
int io_workers = 3;
+static int io_worker_queue_size = 64;
+static int MyIoWorkerId;
+
+
+static AioWorkerSubmissionQueue *io_worker_submission_queue;
+static AioWorkerControl *io_worker_control;
+
+
+static size_t
+pgaio_worker_shmem_size(void)
+{
+ return
+ offsetof(AioWorkerSubmissionQueue, ios) +
+ sizeof(uint32) * io_worker_queue_size +
+ offsetof(AioWorkerControl, workers) +
+ sizeof(AioWorkerSlot) * io_workers;
+}
+
+static void
+pgaio_worker_shmem_init(bool first_time)
+{
+ bool found;
+ int size;
+
+ /* Round size up to next power of two so we can make a mask. */
+ size = pg_nextpower2_32(io_worker_queue_size);
+
+ io_worker_submission_queue =
+ ShmemInitStruct("AioWorkerSubmissionQueue",
+ offsetof(AioWorkerSubmissionQueue, ios) +
+ sizeof(uint32) * size,
+ &found);
+ if (!found)
+ {
+ io_worker_submission_queue->size = size;
+ io_worker_submission_queue->head = 0;
+ io_worker_submission_queue->tail = 0;
+ }
+
+ io_worker_control =
+ ShmemInitStruct("AioWorkerControl",
+ offsetof(AioWorkerControl, workers) +
+ sizeof(AioWorkerSlot) * io_workers,
+ &found);
+ if (!found)
+ {
+ io_worker_control->idle_worker_mask = 0;
+ for (int i = 0; i < io_workers; ++i)
+ {
+ io_worker_control->workers[i].latch = NULL;
+ io_worker_control->workers[i].in_use = false;
+ }
+ }
+}
+
+static void
+pgaio_worker_postmaster_child_init_local(void)
+{
+}
+
+
+static int
+pgaio_choose_idle_worker(void)
+{
+ int worker;
+
+ if (io_worker_control->idle_worker_mask == 0)
+ return -1;
+
+ /* Find the lowest bit position, and clear it. */
+ worker = pg_rightmost_one_pos64(io_worker_control->idle_worker_mask);
+ io_worker_control->idle_worker_mask &= ~(UINT64_C(1) << worker);
+
+ return worker;
+}
+
+static bool
+pgaio_worker_submission_queue_insert(PgAioHandle *ioh)
+{
+ AioWorkerSubmissionQueue *queue;
+ uint32 new_head;
+
+ queue = io_worker_submission_queue;
+ new_head = (queue->head + 1) & (queue->size - 1);
+ if (new_head == queue->tail)
+ {
+ elog(DEBUG1, "full");
+ return false; /* full */
+ }
+
+ queue->ios[queue->head] = pgaio_io_get_id(ioh);
+ queue->head = new_head;
+
+ return true;
+}
+
+static uint32
+pgaio_worker_submission_queue_consume(void)
+{
+ AioWorkerSubmissionQueue *queue;
+ uint32 result;
+
+ queue = io_worker_submission_queue;
+ if (queue->tail == queue->head)
+ return UINT32_MAX; /* empty */
+
+ result = queue->ios[queue->tail];
+ queue->tail = (queue->tail + 1) & (queue->size - 1);
+
+ return result;
+}
+
+static uint32
+pgaio_worker_submission_queue_depth(void)
+{
+ uint32 head;
+ uint32 tail;
+
+ head = io_worker_submission_queue->head;
+ tail = io_worker_submission_queue->tail;
+
+ if (tail > head)
+ head += io_worker_submission_queue->size;
+
+ Assert(head >= tail);
+
+ return head - tail;
+}
+
+static void
+pgaio_worker_submit_internal(int nios, PgAioHandle *ios[])
+{
+ PgAioHandle *synchronous_ios[PGAIO_SUBMIT_BATCH_SIZE];
+ int nsync = 0;
+ Latch *wakeup = NULL;
+ int worker;
+
+ Assert(nios <= PGAIO_SUBMIT_BATCH_SIZE);
+
+ LWLockAcquire(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE);
+ for (int i = 0; i < nios; ++i)
+ {
+ Assert(!pgaio_worker_needs_synchronous_execution(ios[i]));
+ if (!pgaio_worker_submission_queue_insert(ios[i]))
+ {
+ /*
+ * We'll do it synchronously, but only after we've sent as many as
+ * we can to workers, to maximize concurrency.
+ */
+ synchronous_ios[nsync++] = ios[i];
+ continue;
+ }
+
+ if (wakeup == NULL)
+ {
+ /* Choose an idle worker to wake up if we haven't already. */
+ worker = pgaio_choose_idle_worker();
+ if (worker >= 0)
+ wakeup = io_worker_control->workers[worker].latch;
+
+ ereport(DEBUG3,
+ errmsg("submission for io:%d choosing worker %d, latch %p",
+ pgaio_io_get_id(ios[i]), worker, wakeup),
+ errhidestmt(true), errhidecontext(true));
+ }
+ }
+ LWLockRelease(AioWorkerSubmissionQueueLock);
+
+ if (wakeup)
+ SetLatch(wakeup);
+
+ /* Run whatever is left synchronously. */
+ if (nsync > 0)
+ {
+ for (int i = 0; i < nsync; ++i)
+ {
+ pgaio_io_perform_synchronously(synchronous_ios[i]);
+ }
+ }
+}
+
+static bool
+pgaio_worker_needs_synchronous_execution(PgAioHandle *ioh)
+{
+ return
+ !IsUnderPostmaster
+ || ioh->flags & AHF_REFERENCES_LOCAL
+ || pgaio_io_can_reopen(ioh);
+}
+
+static int
+pgaio_worker_submit(uint16 num_staged_ios, PgAioHandle **staged_ios)
+{
+ int nios = 0;
+
+ for (int i = 0; i < num_staged_ios; i++)
+ {
+ PgAioHandle *ioh = staged_ios[i];
+
+ pgaio_io_prepare_submit(ioh);
+ }
+
+ pgaio_worker_submit_internal(num_staged_ios, staged_ios);
+
+ return nios;
+}
void
IoWorkerMain(char *startup_data, size_t startup_data_len)
{
sigjmp_buf local_sigjmp_buf;
+ volatile PgAioHandle *ioh = NULL;
+ char cmd[128];
MyBackendType = B_IO_WORKER;
+ AuxiliaryProcessMainCommon();
/* TODO review all signals */
pqsignal(SIGHUP, SignalHandlerForConfigReload);
@@ -49,7 +330,34 @@ IoWorkerMain(char *startup_data, size_t startup_data_len)
pqsignal(SIGPIPE, SIG_IGN);
pqsignal(SIGUSR1, procsignal_sigusr1_handler);
pqsignal(SIGUSR2, SignalHandlerForShutdownRequest);
- sigprocmask(SIG_SETMASK, &UnBlockSig, NULL);
+
+ /* FIXME: locking */
+ MyIoWorkerId = -1;
+
+ LWLockAcquire(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE);
+
+ for (int i = 0; i < io_workers; ++i)
+ {
+ if (!io_worker_control->workers[i].in_use)
+ {
+ Assert(io_worker_control->workers[i].latch == NULL);
+ io_worker_control->workers[i].in_use = true;
+ MyIoWorkerId = i;
+ break;
+ }
+ else
+ Assert(io_worker_control->workers[i].latch != NULL);
+ }
+
+ if (MyIoWorkerId == -1)
+ elog(ERROR, "couldn't find a free worker slot");
+
+ io_worker_control->idle_worker_mask |= (UINT64_C(1) << MyIoWorkerId);
+ io_worker_control->workers[MyIoWorkerId].latch = MyLatch;
+ LWLockRelease(AioWorkerSubmissionQueueLock);
+
+ sprintf(cmd, "worker: %d", MyIoWorkerId);
+ set_ps_display(cmd);
/* see PostgresMain() */
if (sigsetjmp(local_sigjmp_buf, 1) != 0)
@@ -64,21 +372,107 @@ IoWorkerMain(char *startup_data, size_t startup_data_len)
LWLockReleaseAll();
/* TODO: recover from IO errors */
+ if (ioh != NULL)
+ {
+#if 0
+ /* EINTR is treated as a retryable error */
+ pgaio_process_io_completion(unvolatize(PgAioInProgress *, io),
+ EINTR);
+#endif
+ }
EmitErrorReport();
+
+ /* FIXME: should probably be a before-shmem-exit instead */
+ LWLockAcquire(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE);
+ Assert(io_worker_control->workers[MyIoWorkerId].in_use);
+ Assert(io_worker_control->workers[MyIoWorkerId].latch == MyLatch);
+
+ io_worker_control->workers[MyIoWorkerId].in_use = false;
+ io_worker_control->workers[MyIoWorkerId].latch = NULL;
+ LWLockRelease(AioWorkerSubmissionQueueLock);
+
proc_exit(1);
}
/* We can now handle ereport(ERROR) */
PG_exception_stack = &local_sigjmp_buf;
+ sigprocmask(SIG_SETMASK, &UnBlockSig, NULL);
+
while (!ShutdownRequestPending)
{
- WaitLatch(MyLatch, WL_LATCH_SET | WL_EXIT_ON_PM_DEATH, -1,
- WAIT_EVENT_IO_WORKER_MAIN);
- ResetLatch(MyLatch);
- CHECK_FOR_INTERRUPTS();
+ uint32 io_index;
+ Latch *latches[IO_WORKER_WAKEUP_FANOUT];
+ int nlatches = 0;
+ int nwakeups = 0;
+ int worker;
+
+ /* Try to get a job to do. */
+ LWLockAcquire(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE);
+ if ((io_index = pgaio_worker_submission_queue_consume()) == UINT32_MAX)
+ {
+ /* Nothing to do. Mark self idle. */
+ /*
+ * XXX: Invent some kind of back pressure to reduce useless
+ * wakeups?
+ */
+ io_worker_control->idle_worker_mask |= (UINT64_C(1) << MyIoWorkerId);
+ }
+ else
+ {
+ /* Got one. Clear idle flag. */
+ io_worker_control->idle_worker_mask &= ~(UINT64_C(1) << MyIoWorkerId);
+
+ /* See if we can wake up some peers. */
+ nwakeups = Min(pgaio_worker_submission_queue_depth(),
+ IO_WORKER_WAKEUP_FANOUT);
+ for (int i = 0; i < nwakeups; ++i)
+ {
+ if ((worker = pgaio_choose_idle_worker()) < 0)
+ break;
+ latches[nlatches++] = io_worker_control->workers[worker].latch;
+ }
+#if 0
+ if (nwakeups > 0)
+ elog(LOG, "wake %d", nwakeups);
+#endif
+ }
+ LWLockRelease(AioWorkerSubmissionQueueLock);
+
+ for (int i = 0; i < nlatches; ++i)
+ SetLatch(latches[i]);
+
+ if (io_index != UINT32_MAX)
+ {
+ ioh = &aio_ctl->io_handles[io_index];
+
+ ereport(DEBUG3,
+ errmsg("worker processing io:%d",
+ pgaio_io_get_id(unvolatize(PgAioHandle *, ioh))),
+ errhidestmt(true), errhidecontext(true));
+
+ pgaio_io_reopen(unvolatize(PgAioHandle *, ioh));
+ pgaio_io_perform_synchronously(unvolatize(PgAioHandle *, ioh));
+
+ ioh = NULL;
+ }
+ else
+ {
+ WaitLatch(MyLatch, WL_LATCH_SET | WL_EXIT_ON_PM_DEATH, -1,
+ WAIT_EVENT_IO_WORKER_MAIN);
+ ResetLatch(MyLatch);
+ CHECK_FOR_INTERRUPTS();
+ }
}
+ LWLockAcquire(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE);
+ Assert(io_worker_control->workers[MyIoWorkerId].in_use);
+ Assert(io_worker_control->workers[MyIoWorkerId].latch == MyLatch);
+
+ io_worker_control->workers[MyIoWorkerId].in_use = false;
+ io_worker_control->workers[MyIoWorkerId].latch = NULL;
+ LWLockRelease(AioWorkerSubmissionQueueLock);
+
proc_exit(0);
}
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index ecc513aa7bd..3678f2b3e43 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -351,6 +351,7 @@ DSMRegistry "Waiting to read or update the dynamic shared memory registry."
InjectionPoint "Waiting to read or update information related to injection points."
SerialControl "Waiting to read or update shared <filename>pg_serial</filename> state."
WaitLSN "Waiting to read or update shared Wait-for-LSN state."
+AioWorkerSubmissionQueue "Waiting to access AIO worker submission queue."
#
# END OF PREDEFINED LWLOCKS (DO NOT CHANGE THIS LINE)
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 8c062240373..1fc8336496c 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -839,7 +839,7 @@
# WIP AIO GUC docs
#------------------------------------------------------------------------------
-#io_method = sync # (change requires restart)
+#io_method = worker # (change requires restart)
#io_workers = 3 # 1-32;
#io_max_concurrency = 32 # Max number of IOs that may be in
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 2f463d29ca1..f1cac7aa5bf 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -54,6 +54,9 @@ AggStrategy
AggTransInfo
Aggref
AggregateInstrumentation
+AioWorkerControl
+AioWorkerSlot
+AioWorkerSubmissionQueue
AlenState
Alias
AllocBlock
--
2.45.2.827.g557ae147e6
v2.1-0010-aio-Add-liburing-dependency.patchtext/x-diff; charset=us-asciiDownload
From 8cacec347f18d4d6390648928769cd084f57b77f Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 5 Jun 2024 19:37:25 -0700
Subject: [PATCH v2.1 10/20] aio: Add liburing dependency
Not yet used.
Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
src/include/pg_config.h.in | 3 +
src/makefiles/meson.build | 3 +
configure | 138 +++++++++++++++++++++++++++++++++++++
configure.ac | 11 +++
meson.build | 14 ++++
meson_options.txt | 3 +
src/Makefile.global.in | 4 ++
7 files changed, 176 insertions(+)
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 38006367a40..7d2fcb9d0f5 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -693,6 +693,9 @@
/* Define to 1 to build with LDAP support. (--with-ldap) */
#undef USE_LDAP
+/* Define to build with io-uring support. (--with-liburing) */
+#undef USE_LIBURING
+
/* Define to 1 to build with XML support. (--with-libxml) */
#undef USE_LIBXML
diff --git a/src/makefiles/meson.build b/src/makefiles/meson.build
index 850e9275845..cca689b2028 100644
--- a/src/makefiles/meson.build
+++ b/src/makefiles/meson.build
@@ -200,6 +200,8 @@ pgxs_empty = [
'PTHREAD_CFLAGS', 'PTHREAD_LIBS',
'ICU_LIBS',
+
+ 'LIBURING_CFLAGS', 'LIBURING_LIBS',
]
if host_system == 'windows' and cc.get_argument_syntax() != 'msvc'
@@ -230,6 +232,7 @@ pgxs_deps = {
'gssapi': gssapi,
'icu': icu,
'ldap': ldap,
+ 'liburing': liburing,
'libxml': libxml,
'libxslt': libxslt,
'llvm': llvm,
diff --git a/configure b/configure
index 53c8a1f2bad..aa82bafe783 100755
--- a/configure
+++ b/configure
@@ -654,6 +654,8 @@ LIBOBJS
OPENSSL
ZSTD
LZ4
+LIBURING_LIBS
+LIBURING_CFLAGS
UUID_LIBS
LDAP_LIBS_BE
LDAP_LIBS_FE
@@ -712,6 +714,7 @@ XML2_CFLAGS
XML2_CONFIG
with_libxml
with_uuid
+with_liburing
with_readline
with_systemd
with_selinux
@@ -865,6 +868,7 @@ with_selinux
with_systemd
with_readline
with_libedit_preferred
+with_liburing
with_uuid
with_ossp_uuid
with_libxml
@@ -907,6 +911,8 @@ LDFLAGS_EX
LDFLAGS_SL
PERL
PYTHON
+LIBURING_CFLAGS
+LIBURING_LIBS
MSGFMT
TCLSH'
@@ -1574,6 +1580,7 @@ Optional Packages:
--without-readline do not use GNU Readline nor BSD Libedit for editing
--with-libedit-preferred
prefer BSD Libedit over GNU Readline
+ --with-liburing use liburing for async io
--with-uuid=LIB build contrib/uuid-ossp using LIB (bsd,e2fs,ossp)
--with-ossp-uuid obsolete spelling of --with-uuid=ossp
--with-libxml build with XML support
@@ -1617,6 +1624,10 @@ Some influential environment variables:
LDFLAGS_SL extra linker flags for linking shared libraries only
PERL Perl program
PYTHON Python program
+ LIBURING_CFLAGS
+ C compiler flags for LIBURING, overriding pkg-config
+ LIBURING_LIBS
+ linker flags for LIBURING, overriding pkg-config
MSGFMT msgfmt program for NLS
TCLSH Tcl interpreter program (tclsh)
@@ -8664,6 +8675,40 @@ fi
+#
+# liburing
+#
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking whether to build with liburing support" >&5
+$as_echo_n "checking whether to build with liburing support... " >&6; }
+
+
+
+# Check whether --with-liburing was given.
+if test "${with_liburing+set}" = set; then :
+ withval=$with_liburing;
+ case $withval in
+ yes)
+
+$as_echo "#define USE_LIBURING 1" >>confdefs.h
+
+ ;;
+ no)
+ :
+ ;;
+ *)
+ as_fn_error $? "no argument expected for --with-liburing option" "$LINENO" 5
+ ;;
+ esac
+
+else
+ with_liburing=no
+
+fi
+
+
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $with_liburing" >&5
+$as_echo "$with_liburing" >&6; }
+
#
# UUID library
@@ -13209,6 +13254,99 @@ fi
fi
+if test "$with_liburing" = yes; then
+
+pkg_failed=no
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for liburing" >&5
+$as_echo_n "checking for liburing... " >&6; }
+
+if test -n "$LIBURING_CFLAGS"; then
+ pkg_cv_LIBURING_CFLAGS="$LIBURING_CFLAGS"
+ elif test -n "$PKG_CONFIG"; then
+ if test -n "$PKG_CONFIG" && \
+ { { $as_echo "$as_me:${as_lineno-$LINENO}: \$PKG_CONFIG --exists --print-errors \"liburing\""; } >&5
+ ($PKG_CONFIG --exists --print-errors "liburing") 2>&5
+ ac_status=$?
+ $as_echo "$as_me:${as_lineno-$LINENO}: \$? = $ac_status" >&5
+ test $ac_status = 0; }; then
+ pkg_cv_LIBURING_CFLAGS=`$PKG_CONFIG --cflags "liburing" 2>/dev/null`
+ test "x$?" != "x0" && pkg_failed=yes
+else
+ pkg_failed=yes
+fi
+ else
+ pkg_failed=untried
+fi
+if test -n "$LIBURING_LIBS"; then
+ pkg_cv_LIBURING_LIBS="$LIBURING_LIBS"
+ elif test -n "$PKG_CONFIG"; then
+ if test -n "$PKG_CONFIG" && \
+ { { $as_echo "$as_me:${as_lineno-$LINENO}: \$PKG_CONFIG --exists --print-errors \"liburing\""; } >&5
+ ($PKG_CONFIG --exists --print-errors "liburing") 2>&5
+ ac_status=$?
+ $as_echo "$as_me:${as_lineno-$LINENO}: \$? = $ac_status" >&5
+ test $ac_status = 0; }; then
+ pkg_cv_LIBURING_LIBS=`$PKG_CONFIG --libs "liburing" 2>/dev/null`
+ test "x$?" != "x0" && pkg_failed=yes
+else
+ pkg_failed=yes
+fi
+ else
+ pkg_failed=untried
+fi
+
+
+
+if test $pkg_failed = yes; then
+ { $as_echo "$as_me:${as_lineno-$LINENO}: result: no" >&5
+$as_echo "no" >&6; }
+
+if $PKG_CONFIG --atleast-pkgconfig-version 0.20; then
+ _pkg_short_errors_supported=yes
+else
+ _pkg_short_errors_supported=no
+fi
+ if test $_pkg_short_errors_supported = yes; then
+ LIBURING_PKG_ERRORS=`$PKG_CONFIG --short-errors --print-errors --cflags --libs "liburing" 2>&1`
+ else
+ LIBURING_PKG_ERRORS=`$PKG_CONFIG --print-errors --cflags --libs "liburing" 2>&1`
+ fi
+ # Put the nasty error message in config.log where it belongs
+ echo "$LIBURING_PKG_ERRORS" >&5
+
+ as_fn_error $? "Package requirements (liburing) were not met:
+
+$LIBURING_PKG_ERRORS
+
+Consider adjusting the PKG_CONFIG_PATH environment variable if you
+installed software in a non-standard prefix.
+
+Alternatively, you may set the environment variables LIBURING_CFLAGS
+and LIBURING_LIBS to avoid the need to call pkg-config.
+See the pkg-config man page for more details." "$LINENO" 5
+elif test $pkg_failed = untried; then
+ { $as_echo "$as_me:${as_lineno-$LINENO}: result: no" >&5
+$as_echo "no" >&6; }
+ { { $as_echo "$as_me:${as_lineno-$LINENO}: error: in \`$ac_pwd':" >&5
+$as_echo "$as_me: error: in \`$ac_pwd':" >&2;}
+as_fn_error $? "The pkg-config script could not be found or is too old. Make sure it
+is in your PATH or set the PKG_CONFIG environment variable to the full
+path to pkg-config.
+
+Alternatively, you may set the environment variables LIBURING_CFLAGS
+and LIBURING_LIBS to avoid the need to call pkg-config.
+See the pkg-config man page for more details.
+
+To get pkg-config, see <http://pkg-config.freedesktop.org/>.
+See \`config.log' for more details" "$LINENO" 5; }
+else
+ LIBURING_CFLAGS=$pkg_cv_LIBURING_CFLAGS
+ LIBURING_LIBS=$pkg_cv_LIBURING_LIBS
+ { $as_echo "$as_me:${as_lineno-$LINENO}: result: yes" >&5
+$as_echo "yes" >&6; }
+
+fi
+fi
##
## Header files
diff --git a/configure.ac b/configure.ac
index 6a35b2880bf..04480cdea0a 100644
--- a/configure.ac
+++ b/configure.ac
@@ -970,6 +970,14 @@ AC_SUBST(with_readline)
PGAC_ARG_BOOL(with, libedit-preferred, no,
[prefer BSD Libedit over GNU Readline])
+#
+# liburing
+#
+AC_MSG_CHECKING([whether to build with liburing support])
+PGAC_ARG_BOOL(with, liburing, no, [use liburing for async io],
+ [AC_DEFINE([USE_LIBURING], 1, [Define to build with io-uring support. (--with-liburing)])])
+AC_MSG_RESULT([$with_liburing])
+AC_SUBST(with_liburing)
#
# UUID library
@@ -1426,6 +1434,9 @@ elif test "$with_uuid" = ossp ; then
fi
AC_SUBST(UUID_LIBS)
+if test "$with_liburing" = yes; then
+ PKG_CHECK_MODULES(LIBURING, liburing)
+fi
##
## Header files
diff --git a/meson.build b/meson.build
index 4764b09266e..53266e04005 100644
--- a/meson.build
+++ b/meson.build
@@ -848,6 +848,18 @@ endif
+###############################################################
+# Library: liburing
+###############################################################
+
+liburingopt = get_option('liburing')
+liburing = dependency('liburing', required: liburingopt)
+if liburing.found()
+ cdata.set('USE_LIBURING', 1)
+endif
+
+
+
###############################################################
# Library: libxml
###############################################################
@@ -3094,6 +3106,7 @@ backend_both_deps += [
icu_i18n,
ldap,
libintl,
+ liburing,
libxml,
lz4,
pam,
@@ -3738,6 +3751,7 @@ if meson.version().version_compare('>=0.57')
'gss': gssapi,
'icu': icu,
'ldap': ldap,
+ 'liburing': liburing,
'libxml': libxml,
'libxslt': libxslt,
'llvm': llvm,
diff --git a/meson_options.txt b/meson_options.txt
index b9421557606..084eebe72d7 100644
--- a/meson_options.txt
+++ b/meson_options.txt
@@ -103,6 +103,9 @@ option('ldap', type: 'feature', value: 'auto',
option('libedit_preferred', type: 'boolean', value: false,
description: 'Prefer BSD Libedit over GNU Readline')
+option('liburing', type : 'feature', value: 'auto',
+ description: 'Use liburing for async io')
+
option('libxml', type: 'feature', value: 'auto',
description: 'XML support')
diff --git a/src/Makefile.global.in b/src/Makefile.global.in
index 42f50b49761..a8ff18faed6 100644
--- a/src/Makefile.global.in
+++ b/src/Makefile.global.in
@@ -190,6 +190,7 @@ with_systemd = @with_systemd@
with_gssapi = @with_gssapi@
with_krb_srvnam = @with_krb_srvnam@
with_ldap = @with_ldap@
+with_liburing = @with_liburing@
with_libxml = @with_libxml@
with_libxslt = @with_libxslt@
with_llvm = @with_llvm@
@@ -216,6 +217,9 @@ krb_srvtab = @krb_srvtab@
ICU_CFLAGS = @ICU_CFLAGS@
ICU_LIBS = @ICU_LIBS@
+LIBURING_CFLAGS = @LIBURING_CFLAGS@
+LIBURING_LIBS = @LIBURING_LIBS@
+
TCLSH = @TCLSH@
TCL_LIBS = @TCL_LIBS@
TCL_LIB_SPEC = @TCL_LIB_SPEC@
--
2.45.2.827.g557ae147e6
v2.1-0011-aio-Add-io_uring-method.patchtext/x-diff; charset=us-asciiDownload
From b50761d00455ef1fd0a0c9625624866c60a7333f Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 4 Sep 2024 16:15:17 -0400
Subject: [PATCH v2.1 11/20] aio: Add io_uring method
---
src/include/storage/aio.h | 1 +
src/include/storage/aio_internal.h | 3 +
src/include/storage/lwlock.h | 1 +
src/backend/storage/aio/Makefile | 1 +
src/backend/storage/aio/aio.c | 6 +
src/backend/storage/aio/meson.build | 1 +
src/backend/storage/aio/method_io_uring.c | 383 ++++++++++++++++++++++
src/backend/storage/lmgr/lwlock.c | 1 +
src/tools/pgindent/typedefs.list | 1 +
9 files changed, 398 insertions(+)
create mode 100644 src/backend/storage/aio/method_io_uring.c
diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h
index 1e4c8807c71..b8c743548c9 100644
--- a/src/include/storage/aio.h
+++ b/src/include/storage/aio.h
@@ -333,6 +333,7 @@ typedef enum IoMethod
{
IOMETHOD_SYNC = 0,
IOMETHOD_WORKER,
+ IOMETHOD_IO_URING,
} IoMethod;
diff --git a/src/include/storage/aio_internal.h b/src/include/storage/aio_internal.h
index b6f44a875dd..5d18d112e2d 100644
--- a/src/include/storage/aio_internal.h
+++ b/src/include/storage/aio_internal.h
@@ -265,6 +265,9 @@ extern const char *pgaio_io_get_state_name(PgAioHandle *ioh);
/* Declarations for the tables of function pointers exposed by each IO method. */
extern const IoMethodOps pgaio_sync_ops;
extern const IoMethodOps pgaio_worker_ops;
+#ifdef USE_LIBURING
+extern const IoMethodOps pgaio_uring_ops;
+#endif
extern const IoMethodOps *pgaio_impl;
extern PgAioCtl *aio_ctl;
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index eabf813ce05..72f928b7602 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -217,6 +217,7 @@ typedef enum BuiltinTrancheIds
LWTRANCHE_SUBTRANS_SLRU,
LWTRANCHE_XACT_SLRU,
LWTRANCHE_PARALLEL_VACUUM_DSA,
+ LWTRANCHE_AIO_URING_COMPLETION,
LWTRANCHE_FIRST_USER_DEFINED,
} BuiltinTrancheIds;
diff --git a/src/backend/storage/aio/Makefile b/src/backend/storage/aio/Makefile
index fa2a7e9e5df..3bcb8a0b2ed 100644
--- a/src/backend/storage/aio/Makefile
+++ b/src/backend/storage/aio/Makefile
@@ -13,6 +13,7 @@ OBJS = \
aio_init.o \
aio_io.o \
aio_subject.o \
+ method_io_uring.o \
method_sync.o \
method_worker.o \
read_stream.o
diff --git a/src/backend/storage/aio/aio.c b/src/backend/storage/aio/aio.c
index 0ca641d9322..8877a33b9f2 100644
--- a/src/backend/storage/aio/aio.c
+++ b/src/backend/storage/aio/aio.c
@@ -37,6 +37,9 @@ static void pgaio_bounce_buffer_wait_for_free(void);
const struct config_enum_entry io_method_options[] = {
{"sync", IOMETHOD_SYNC, false},
{"worker", IOMETHOD_WORKER, false},
+#ifdef USE_LIBURING
+ {"io_uring", IOMETHOD_IO_URING, false},
+#endif
{NULL, 0, false}
};
@@ -55,6 +58,9 @@ PgAioPerBackend *my_aio;
static const IoMethodOps *pgaio_ops_table[] = {
[IOMETHOD_SYNC] = &pgaio_sync_ops,
[IOMETHOD_WORKER] = &pgaio_worker_ops,
+#ifdef USE_LIBURING
+ [IOMETHOD_IO_URING] = &pgaio_uring_ops,
+#endif
};
diff --git a/src/backend/storage/aio/meson.build b/src/backend/storage/aio/meson.build
index 62738ce1d14..537f23d446d 100644
--- a/src/backend/storage/aio/meson.build
+++ b/src/backend/storage/aio/meson.build
@@ -5,6 +5,7 @@ backend_sources += files(
'aio_init.c',
'aio_io.c',
'aio_subject.c',
+ 'method_io_uring.c',
'method_sync.c',
'method_worker.c',
'read_stream.c',
diff --git a/src/backend/storage/aio/method_io_uring.c b/src/backend/storage/aio/method_io_uring.c
new file mode 100644
index 00000000000..0f0eda0ce9b
--- /dev/null
+++ b/src/backend/storage/aio/method_io_uring.c
@@ -0,0 +1,383 @@
+/*-------------------------------------------------------------------------
+ *
+ * method_io_uring.c
+ * AIO implementation using io_uring on Linux
+ *
+ * XXX Write me
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/storage/aio/method_io_uring.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#ifdef USE_LIBURING
+
+#include <liburing.h>
+
+#include "pgstat.h"
+#include "port/pg_iovec.h"
+#include "storage/aio_internal.h"
+#include "storage/fd.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+
+
+/* Entry points for IoMethodOps. */
+static size_t pgaio_uring_shmem_size(void);
+static void pgaio_uring_postmaster_init(void);
+static void pgaio_uring_shmem_init(bool first_time);
+static void pgaio_uring_postmaster_child_init(void);
+static void pgaio_uring_postmaster_child_init_local(void);
+
+static int pgaio_uring_submit(uint16 num_staged_ios, PgAioHandle **staged_ios);
+static void pgaio_uring_wait_one(PgAioHandle *ioh, uint64 ref_generation);
+
+static void pgaio_uring_sq_from_io(PgAioHandle *ioh, struct io_uring_sqe *sqe);
+
+
+const IoMethodOps pgaio_uring_ops = {
+ .shmem_size = pgaio_uring_shmem_size,
+ .shmem_init = pgaio_uring_shmem_init,
+ .postmaster_init = pgaio_uring_postmaster_init,
+ .postmaster_child_init = pgaio_uring_postmaster_child_init,
+ .postmaster_child_init_local = pgaio_uring_postmaster_child_init_local,
+
+ .submit = pgaio_uring_submit,
+ .wait_one = pgaio_uring_wait_one,
+#if 0
+ .retry = pgaio_uring_io_retry,
+ .wait_one = pgaio_uring_wait_one,
+ .drain = pgaio_uring_drain,
+#endif
+ .can_scatter_gather_direct = true,
+ .can_scatter_gather_buffered = true
+};
+
+typedef struct PgAioUringContext
+{
+ LWLock completion_lock;
+
+ struct io_uring io_uring_ring;
+ /* XXX: probably worth padding to a cacheline boundary here */
+} PgAioUringContext;
+
+
+static PgAioUringContext *aio_uring_contexts;
+static PgAioUringContext *my_shared_uring_context;
+
+/* io_uring local state */
+static struct io_uring local_ring;
+
+
+
+static Size
+AioContextShmemSize(void)
+{
+ uint32 TotalProcs = MaxBackends + NUM_AUXILIARY_PROCS - MAX_IO_WORKERS;
+
+ return mul_size(TotalProcs, sizeof(PgAioUringContext));
+}
+
+static size_t
+pgaio_uring_shmem_size(void)
+{
+ return AioContextShmemSize();
+}
+
+static void
+pgaio_uring_shmem_init(bool first_time)
+{
+ uint32 TotalProcs = MaxBackends + NUM_AUXILIARY_PROCS - MAX_IO_WORKERS;
+ bool found;
+
+ aio_uring_contexts = (PgAioUringContext *)
+ ShmemInitStruct("AioUring", pgaio_uring_shmem_size(), &found);
+
+ if (found)
+ return;
+
+ for (int contextno = 0; contextno < TotalProcs; contextno++)
+ {
+ PgAioUringContext *context = &aio_uring_contexts[contextno];
+ int ret;
+
+ /*
+ * XXX: Probably worth sharing the WQ between the different rings,
+ * when supported by the kernel. Could also cause additional
+ * contention, I guess?
+ */
+#if 0
+ if (!AcquireExternalFD())
+ elog(ERROR, "No external FD available");
+#endif
+ ret = io_uring_queue_init(io_max_concurrency, &context->io_uring_ring, 0);
+ if (ret < 0)
+ elog(ERROR, "io_uring_queue_init failed: %s", strerror(-ret));
+
+ LWLockInitialize(&context->completion_lock, LWTRANCHE_AIO_URING_COMPLETION);
+ }
+}
+
+static void
+pgaio_uring_postmaster_init(void)
+{
+ uint32 TotalProcs =
+ MaxBackends + NUM_AUXILIARY_PROCS - MAX_IO_WORKERS;
+
+ for (int i = 0; i < TotalProcs; i++)
+ ReserveExternalFD();
+}
+
+static void
+pgaio_uring_postmaster_child_init(void)
+{
+ my_shared_uring_context = &aio_uring_contexts[MyProcNumber];
+}
+
+static void
+pgaio_uring_postmaster_child_init_local(void)
+{
+ int ret;
+
+ ret = io_uring_queue_init(32, &local_ring, 0);
+ if (ret < 0)
+ elog(ERROR, "io_uring_queue_init failed: %s", strerror(-ret));
+}
+
+static int
+pgaio_uring_submit(uint16 num_staged_ios, PgAioHandle **staged_ios)
+{
+ struct io_uring *uring_instance = &my_shared_uring_context->io_uring_ring;
+
+ Assert(num_staged_ios <= PGAIO_SUBMIT_BATCH_SIZE);
+
+ for (int i = 0; i < num_staged_ios; i++)
+ {
+ PgAioHandle *ioh = staged_ios[i];
+ struct io_uring_sqe *sqe;
+
+ sqe = io_uring_get_sqe(uring_instance);
+
+ pgaio_io_prepare_submit(ioh);
+ pgaio_uring_sq_from_io(ioh, sqe);
+ }
+
+ while (true)
+ {
+ int ret;
+
+ pgstat_report_wait_start(WAIT_EVENT_AIO_SUBMIT);
+ ret = io_uring_submit(uring_instance);
+ pgstat_report_wait_end();
+
+ if (ret == -EINTR)
+ {
+ elog(DEBUG3, "submit EINTR, nios: %d", num_staged_ios);
+ continue;
+ }
+ if (ret < 0)
+ elog(PANIC, "failed: %d/%s",
+ ret, strerror(-ret));
+ else if (ret != num_staged_ios)
+ {
+ /* likely unreachable, but if it is, we would need to re-submit */
+ elog(PANIC, "submitted only %d of %d",
+ ret, num_staged_ios);
+ }
+ else
+ {
+ elog(DEBUG3, "submit nios: %d", num_staged_ios);
+ }
+ break;
+ }
+
+ return num_staged_ios;
+}
+
+
+#define PGAIO_MAX_LOCAL_REAPED 16
+
+static void
+pgaio_uring_drain_locked(PgAioUringContext *context)
+{
+ int ready;
+ int orig_ready;
+
+ /*
+ * Don't drain more events than available right now. Otherwise it's
+ * plausible that one backend could get stuck, for a while, receiving CQEs
+ * without actually processing them.
+ */
+ orig_ready = ready = io_uring_cq_ready(&context->io_uring_ring);
+
+ while (ready > 0)
+ {
+ struct io_uring_cqe *reaped_cqes[PGAIO_MAX_LOCAL_REAPED];
+ uint32 reaped;
+
+ START_CRIT_SECTION();
+ reaped =
+ io_uring_peek_batch_cqe(&context->io_uring_ring,
+ reaped_cqes,
+ Min(PGAIO_MAX_LOCAL_REAPED, ready));
+ Assert(reaped <= ready);
+
+ ready -= reaped;
+
+ for (int i = 0; i < reaped; i++)
+ {
+ struct io_uring_cqe *cqe = reaped_cqes[i];
+ PgAioHandle *ioh;
+
+ ioh = io_uring_cqe_get_data(cqe);
+ io_uring_cqe_seen(&context->io_uring_ring, cqe);
+
+ pgaio_io_process_completion(ioh, cqe->res);
+ }
+
+ END_CRIT_SECTION();
+
+ ereport(DEBUG3,
+ errmsg("drained %d/%d, now expecting %d",
+ reaped, orig_ready, io_uring_cq_ready(&context->io_uring_ring)),
+ errhidestmt(true),
+ errhidecontext(true));
+
+ }
+}
+
+static void
+pgaio_uring_wait_one(PgAioHandle *ioh, uint64 ref_generation)
+{
+ PgAioHandleState state;
+ ProcNumber owner_procno = ioh->owner_procno;
+ PgAioUringContext *owner_context = &aio_uring_contexts[owner_procno];
+ bool expect_cqe;
+ int waited = 0;
+
+ /*
+ * We ought to have a smarter locking scheme, nearly all the time the
+ * backend owning the ring will reap the completions, making the locking
+ * unnecessarily expensive.
+ */
+ LWLockAcquire(&owner_context->completion_lock, LW_EXCLUSIVE);
+
+ while (true)
+ {
+ ereport(DEBUG3,
+ errmsg("wait_one for io:%d in state %s, cycle %d",
+ pgaio_io_get_id(ioh), pgaio_io_get_state_name(ioh), waited),
+ errhidestmt(true),
+ errhidecontext(true));
+
+ if (pgaio_io_was_recycled(ioh, ref_generation, &state) ||
+ state != AHS_IN_FLIGHT)
+ {
+ break;
+ }
+ else if (io_uring_cq_ready(&owner_context->io_uring_ring))
+ {
+ expect_cqe = true;
+ }
+ else
+ {
+ int ret;
+ struct io_uring_cqe *cqes;
+
+ pgstat_report_wait_start(WAIT_EVENT_AIO_DRAIN);
+ ret = io_uring_wait_cqes(&owner_context->io_uring_ring, &cqes, 1, NULL, NULL);
+ pgstat_report_wait_end();
+
+ if (ret == -EINTR)
+ {
+ continue;
+ }
+ else if (ret != 0)
+ {
+ elog(PANIC, "unexpected: %d/%s: %m", ret, strerror(-ret));
+ }
+ else
+ {
+ Assert(cqes != NULL);
+ expect_cqe = true;
+ waited++;
+ }
+ }
+
+ if (expect_cqe)
+ {
+ pgaio_uring_drain_locked(owner_context);
+ }
+ }
+
+ LWLockRelease(&owner_context->completion_lock);
+
+ ereport(DEBUG3,
+ errmsg("wait_one with %d sleeps",
+ waited),
+ errhidestmt(true),
+ errhidecontext(true));
+}
+
+static void
+pgaio_uring_sq_from_io(PgAioHandle *ioh, struct io_uring_sqe *sqe)
+{
+ struct iovec *iov;
+
+ switch (ioh->op)
+ {
+ case PGAIO_OP_READ:
+ iov = &aio_ctl->iovecs[ioh->iovec_off];
+ if (ioh->op_data.read.iov_length == 1)
+ {
+ io_uring_prep_read(sqe,
+ ioh->op_data.read.fd,
+ iov->iov_base,
+ iov->iov_len,
+ ioh->op_data.read.offset);
+ }
+ else
+ {
+ io_uring_prep_readv(sqe,
+ ioh->op_data.read.fd,
+ iov,
+ ioh->op_data.read.iov_length,
+ ioh->op_data.read.offset);
+
+ }
+ break;
+
+ case PGAIO_OP_WRITE:
+ iov = &aio_ctl->iovecs[ioh->iovec_off];
+ if (ioh->op_data.write.iov_length == 1)
+ {
+ io_uring_prep_write(sqe,
+ ioh->op_data.write.fd,
+ iov->iov_base,
+ iov->iov_len,
+ ioh->op_data.write.offset);
+ }
+ else
+ {
+ io_uring_prep_writev(sqe,
+ ioh->op_data.write.fd,
+ iov,
+ ioh->op_data.write.iov_length,
+ ioh->op_data.write.offset);
+ }
+ break;
+
+ default:
+ elog(ERROR, "not implemented");
+ }
+
+ io_uring_sqe_set_data(sqe, ioh);
+}
+
+#endif /* USE_LIBURING */
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index a5fa77412ed..b138a36c461 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -166,6 +166,7 @@ static const char *const BuiltinTrancheNames[] = {
[LWTRANCHE_SUBTRANS_SLRU] = "SubtransSLRU",
[LWTRANCHE_XACT_SLRU] = "XactSLRU",
[LWTRANCHE_PARALLEL_VACUUM_DSA] = "ParallelVacuumDSA",
+ [LWTRANCHE_AIO_URING_COMPLETION] = "AioUringCompletion",
};
StaticAssertDecl(lengthof(BuiltinTrancheNames) ==
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index f1cac7aa5bf..46d31cf2b9f 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2116,6 +2116,7 @@ PgAioReturn
PgAioSubjectData
PgAioSubjectID
PgAioSubjectInfo
+PgAioUringContext
PgArchData
PgBackendGSSStatus
PgBackendSSLStatus
--
2.45.2.827.g557ae147e6
v2.1-0016-aio-Very-WIP-read_stream.c-adjustments-for-real.patchtext/x-diff; charset=us-asciiDownload
From 3b51bfa51eac42157c8177437fb6993ed349c0f3 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sat, 31 Aug 2024 21:39:30 -0400
Subject: [PATCH v2.1 16/20] aio: Very-WIP: read_stream.c adjustments for real
AIO
---
src/include/storage/bufmgr.h | 2 ++
src/backend/storage/aio/read_stream.c | 29 +++++++++++++++++++++------
src/backend/storage/buffer/bufmgr.c | 3 ++-
3 files changed, 27 insertions(+), 7 deletions(-)
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index a075a40b2ed..ac6496bb1eb 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -117,6 +117,8 @@ typedef struct BufferManagerRelation
#define READ_BUFFERS_ZERO_ON_ERROR (1 << 0)
/* Call smgrprefetch() if I/O necessary. */
#define READ_BUFFERS_ISSUE_ADVICE (1 << 1)
+/* caller will issue more io, don't submit */
+#define READ_BUFFERS_MORE_MORE_MORE (1 << 2)
/*
* FIXME: PgAioReturn is defined in aio.h. It'd be much better if we didn't
diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index 7f0e07d9586..7ff2d6a2071 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -91,6 +91,7 @@
#include "catalog/pg_tablespace.h"
#include "miscadmin.h"
+#include "storage/aio.h"
#include "storage/fd.h"
#include "storage/smgr.h"
#include "storage/read_stream.h"
@@ -241,14 +242,18 @@ read_stream_start_pending_read(ReadStream *stream, bool suppress_advice)
/*
* If advice hasn't been suppressed, this system supports it, and this
* isn't a strictly sequential pattern, then we'll issue advice.
+ *
+ * XXX: Used to also check stream->pending_read_blocknum !=
+ * stream->seq_blocknum
*/
if (!suppress_advice &&
- stream->advice_enabled &&
- stream->pending_read_blocknum != stream->seq_blocknum)
+ stream->advice_enabled)
flags = READ_BUFFERS_ISSUE_ADVICE;
else
flags = 0;
+ flags |= READ_BUFFERS_MORE_MORE_MORE;
+
/* We say how many blocks we want to read, but may be smaller on return. */
buffer_index = stream->next_buffer_index;
io_index = stream->next_io_index;
@@ -307,6 +312,14 @@ read_stream_start_pending_read(ReadStream *stream, bool suppress_advice)
static void
read_stream_look_ahead(ReadStream *stream, bool suppress_advice)
{
+ if (stream->distance > (io_combine_limit * 8))
+ {
+ if (stream->pinned_buffers + stream->pending_read_nblocks > ((stream->distance * 3) / 4))
+ {
+ return;
+ }
+ }
+
while (stream->ios_in_progress < stream->max_ios &&
stream->pinned_buffers + stream->pending_read_nblocks < stream->distance)
{
@@ -356,6 +369,7 @@ read_stream_look_ahead(ReadStream *stream, bool suppress_advice)
{
/* And we've hit the limit. Rewind, and stop here. */
read_stream_unget_block(stream, blocknum);
+ pgaio_submit_staged();
return;
}
}
@@ -380,6 +394,8 @@ read_stream_look_ahead(ReadStream *stream, bool suppress_advice)
stream->distance == 0) &&
stream->ios_in_progress < stream->max_ios)
read_stream_start_pending_read(stream, suppress_advice);
+
+ pgaio_submit_staged();
}
/*
@@ -494,10 +510,11 @@ read_stream_begin_impl(int flags,
* direct I/O isn't enabled, the caller hasn't promised sequential access
* (overriding our detection heuristics), and max_ios hasn't been set to
* zero.
+ *
+ * FIXME: Used to also check (io_direct_flags & IO_DIRECT_DATA) == 0 &&
+ * (flags & READ_STREAM_SEQUENTIAL) == 0
*/
- if ((io_direct_flags & IO_DIRECT_DATA) == 0 &&
- (flags & READ_STREAM_SEQUENTIAL) == 0 &&
- max_ios > 0)
+ if (max_ios > 0)
stream->advice_enabled = true;
#endif
@@ -728,7 +745,7 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
if (++stream->oldest_io_index == stream->max_ios)
stream->oldest_io_index = 0;
- if (stream->ios[io_index].op.flags & READ_BUFFERS_ISSUE_ADVICE)
+ if (stream->ios[io_index].op.flags & (READ_BUFFERS_ISSUE_ADVICE | READ_BUFFERS_MORE_MORE_MORE))
{
/* Distance ramps up fast (behavior C). */
distance = stream->distance * 2;
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 4914c71d41e..ed384fa1a44 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1638,7 +1638,8 @@ AsyncReadBuffers(ReadBuffersOperation *operation,
if (did_start_io_overall)
{
- pgaio_submit_staged();
+ if (!(flags & READ_BUFFERS_MORE_MORE_MORE))
+ pgaio_submit_staged();
return true;
}
else
--
2.45.2.827.g557ae147e6
v2.1-0017-aio-Add-IO-queue-helper.patchtext/x-diff; charset=us-asciiDownload
From 2d390f78e46219c8bace6d37ff35d20f6ff0fd30 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 4 Sep 2024 16:15:42 -0400
Subject: [PATCH v2.1 17/20] aio: Add IO queue helper
This is likely never going to anywhere - Thomas Munro is working on something
more complete. But I needed a way to exercise aio for checkpointer / bgwriter.
---
src/include/storage/io_queue.h | 33 +++++
src/backend/storage/aio/Makefile | 1 +
src/backend/storage/aio/io_queue.c | 195 ++++++++++++++++++++++++++++
src/backend/storage/aio/meson.build | 1 +
src/tools/pgindent/typedefs.list | 2 +
5 files changed, 232 insertions(+)
create mode 100644 src/include/storage/io_queue.h
create mode 100644 src/backend/storage/aio/io_queue.c
diff --git a/src/include/storage/io_queue.h b/src/include/storage/io_queue.h
new file mode 100644
index 00000000000..28077158d6d
--- /dev/null
+++ b/src/include/storage/io_queue.h
@@ -0,0 +1,33 @@
+/*-------------------------------------------------------------------------
+ *
+ * io_queue.h
+ * Mechanism for tracking many IOs
+ *
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/io_queue.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef IO_QUEUE_H
+#define IO_QUEUE_H
+
+#include "storage/bufmgr.h"
+
+struct IOQueue;
+typedef struct IOQueue IOQueue;
+
+struct PgAioHandleRef;
+
+extern IOQueue *io_queue_create(int depth, int flags);
+extern void io_queue_track(IOQueue *ioq, const struct PgAioHandleRef *ior);
+extern void io_queue_wait_one(IOQueue *ioq);
+extern void io_queue_wait_all(IOQueue *ioq);
+extern bool io_queue_is_empty(IOQueue *ioq);
+extern void io_queue_reserve(IOQueue *ioq);
+extern struct PgAioHandle *io_queue_get_io(IOQueue *ioq);
+extern void io_queue_free(IOQueue *ioq);
+
+#endif /* IO_QUEUE_H */
diff --git a/src/backend/storage/aio/Makefile b/src/backend/storage/aio/Makefile
index 3bcb8a0b2ed..f3a7f9e63d6 100644
--- a/src/backend/storage/aio/Makefile
+++ b/src/backend/storage/aio/Makefile
@@ -13,6 +13,7 @@ OBJS = \
aio_init.o \
aio_io.o \
aio_subject.o \
+ io_queue.o \
method_io_uring.o \
method_sync.o \
method_worker.o \
diff --git a/src/backend/storage/aio/io_queue.c b/src/backend/storage/aio/io_queue.c
new file mode 100644
index 00000000000..4dda2f4e20e
--- /dev/null
+++ b/src/backend/storage/aio/io_queue.c
@@ -0,0 +1,195 @@
+/*-------------------------------------------------------------------------
+ *
+ * io_queue.c
+ * Mechanism for tracking many IOs
+ *
+ * Portions Copyright (c) 2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/storage/aio/io_queue.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "storage/io_queue.h"
+
+#include "storage/aio.h"
+
+
+typedef struct TrackedIO
+{
+ PgAioHandleRef ior;
+ dlist_node node;
+} TrackedIO;
+
+struct IOQueue
+{
+ int depth;
+ int unsubmitted;
+
+ bool has_reserved;
+
+ dclist_head idle;
+ dclist_head in_progress;
+
+ TrackedIO tracked_ios[FLEXIBLE_ARRAY_MEMBER];
+};
+
+
+IOQueue *
+io_queue_create(int depth, int flags)
+{
+ size_t sz;
+ IOQueue *ioq;
+
+ sz = offsetof(IOQueue, tracked_ios)
+ + sizeof(TrackedIO) * depth;
+
+ ioq = palloc0(sz);
+
+ ioq->depth = 0;
+
+ for (int i = 0; i < depth; i++)
+ {
+ TrackedIO *tio = &ioq->tracked_ios[i];
+
+ pgaio_io_ref_clear(&tio->ior);
+ dclist_push_tail(&ioq->idle, &tio->node);
+ }
+
+ return ioq;
+}
+
+void
+io_queue_wait_one(IOQueue *ioq)
+{
+ while (!dclist_is_empty(&ioq->in_progress))
+ {
+ /* FIXME: Should we really pop here already? */
+ dlist_node *node = dclist_pop_head_node(&ioq->in_progress);
+ TrackedIO *tio = dclist_container(TrackedIO, node, node);
+
+ pgaio_io_ref_wait(&tio->ior);
+ dclist_push_head(&ioq->idle, &tio->node);
+ }
+}
+
+void
+io_queue_reserve(IOQueue *ioq)
+{
+ if (ioq->has_reserved)
+ return;
+
+ if (dclist_is_empty(&ioq->idle))
+ io_queue_wait_one(ioq);
+
+ Assert(!dclist_is_empty(&ioq->idle));
+
+ ioq->has_reserved = true;
+}
+
+PgAioHandle *
+io_queue_get_io(IOQueue *ioq)
+{
+ PgAioHandle *ioh;
+
+ io_queue_reserve(ioq);
+
+ Assert(!dclist_is_empty(&ioq->idle));
+
+ if (!io_queue_is_empty(ioq))
+ {
+ ioh = pgaio_io_get_nb(CurrentResourceOwner, NULL);
+ if (ioh == NULL)
+ {
+ /*
+ * Need to wait for all IOs, blocking might not be legal in the
+ * context.
+ *
+ * XXX: This doesn't make a whole lot of sense, we're also
+ * blocking here. What was I smoking when I wrote the above?
+ */
+ io_queue_wait_all(ioq);
+ ioh = pgaio_io_get(CurrentResourceOwner, NULL);
+ }
+ }
+ else
+ {
+ ioh = pgaio_io_get(CurrentResourceOwner, NULL);
+ }
+
+ return ioh;
+}
+
+void
+io_queue_track(IOQueue *ioq, const struct PgAioHandleRef *ior)
+{
+ dlist_node *node;
+ TrackedIO *tio;
+
+ Assert(ioq->has_reserved);
+ ioq->has_reserved = false;
+
+ Assert(!dclist_is_empty(&ioq->idle));
+
+ node = dclist_pop_head_node(&ioq->idle);
+ tio = dclist_container(TrackedIO, node, node);
+
+ tio->ior = *ior;
+
+ dclist_push_tail(&ioq->in_progress, &tio->node);
+
+ ioq->unsubmitted++;
+
+ /*
+ * XXX: Should have some smarter logic here. We don't want to wait too
+ * long to submit, that'll mean we're more likely to block. But we also
+ * don't want to have the overhead of submitting every IO individually.
+ */
+ if (ioq->unsubmitted >= 4)
+ {
+ pgaio_submit_staged();
+ ioq->unsubmitted = 0;
+ }
+}
+
+void
+io_queue_wait_all(IOQueue *ioq)
+{
+ while (!dclist_is_empty(&ioq->in_progress))
+ {
+ /* wait for the last IO to minimize unnecessary wakeups */
+ dlist_node *node = dclist_tail_node(&ioq->in_progress);
+ TrackedIO *tio = dclist_container(TrackedIO, node, node);
+
+ if (!pgaio_io_ref_check_done(&tio->ior))
+ {
+ ereport(DEBUG3,
+ errmsg("io_queue_wait_all for io:%d",
+ pgaio_io_ref_get_id(&tio->ior)),
+ errhidestmt(true),
+ errhidecontext(true));
+
+ pgaio_io_ref_wait(&tio->ior);
+ }
+
+ dclist_delete_from(&ioq->in_progress, &tio->node);
+ dclist_push_head(&ioq->idle, &tio->node);
+ }
+}
+
+bool
+io_queue_is_empty(IOQueue *ioq)
+{
+ return dclist_is_empty(&ioq->in_progress);
+}
+
+void
+io_queue_free(IOQueue *ioq)
+{
+ io_queue_wait_all(ioq);
+
+ pfree(ioq);
+}
diff --git a/src/backend/storage/aio/meson.build b/src/backend/storage/aio/meson.build
index 537f23d446d..e8a88e615c0 100644
--- a/src/backend/storage/aio/meson.build
+++ b/src/backend/storage/aio/meson.build
@@ -5,6 +5,7 @@ backend_sources += files(
'aio_init.c',
'aio_io.c',
'aio_subject.c',
+ 'io_queue.c',
'method_io_uring.c',
'method_sync.c',
'method_worker.c',
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 46d31cf2b9f..a38141b4e50 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1172,6 +1172,7 @@ IOContext
IOFuncSelector
IOObject
IOOp
+IOQueue
IO_STATUS_BLOCK
IPCompareMethod
ITEM
@@ -2960,6 +2961,7 @@ TocEntry
TokenAuxData
TokenizedAuthLine
TrackItem
+TrackedIO
TransApplyAction
TransInvalidationInfo
TransState
--
2.45.2.827.g557ae147e6
v2.1-0018-bufmgr-use-AIO-in-checkpointer-bgwriter.patchtext/x-diff; charset=us-asciiDownload
From 52aab8396a446a90e23178fd0c593fddfa433a7a Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 23 Jul 2024 10:01:23 -0700
Subject: [PATCH v2.1 18/20] bufmgr: use AIO in checkpointer, bgwriter
This is far from ready - just included to be able to exercise AIO writes and
get some preliminary numbers. In all likelihood this will instead be based
ontop of work by Thomas Munro instead of the preceding commit.
---
src/include/postmaster/bgwriter.h | 3 +-
src/include/storage/buf_internals.h | 1 +
src/include/storage/bufmgr.h | 3 +-
src/include/storage/bufpage.h | 1 +
src/backend/postmaster/bgwriter.c | 25 +-
src/backend/postmaster/checkpointer.c | 12 +-
src/backend/storage/buffer/bufmgr.c | 588 +++++++++++++++++++++++---
src/backend/storage/page/bufpage.c | 10 +
src/tools/pgindent/typedefs.list | 1 +
9 files changed, 580 insertions(+), 64 deletions(-)
diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index 407f26e5302..01a936fbc0a 100644
--- a/src/include/postmaster/bgwriter.h
+++ b/src/include/postmaster/bgwriter.h
@@ -31,7 +31,8 @@ extern void BackgroundWriterMain(char *startup_data, size_t startup_data_len) pg
extern void CheckpointerMain(char *startup_data, size_t startup_data_len) pg_attribute_noreturn();
extern void RequestCheckpoint(int flags);
-extern void CheckpointWriteDelay(int flags, double progress);
+struct IOQueue;
+extern void CheckpointWriteDelay(struct IOQueue *ioq, int flags, double progress);
extern bool ForwardSyncRequest(const FileTag *ftag, SyncRequestType type);
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 5cfa7dbd1f1..9d3123663b3 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -21,6 +21,7 @@
#include "storage/buf.h"
#include "storage/bufmgr.h"
#include "storage/condition_variable.h"
+#include "storage/io_queue.h"
#include "storage/latch.h"
#include "storage/lwlock.h"
#include "storage/shmem.h"
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index ac6496bb1eb..a65888c8915 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -325,7 +325,8 @@ extern bool ConditionalLockBufferForCleanup(Buffer buffer);
extern bool IsBufferCleanupOK(Buffer buffer);
extern bool HoldingBufferPinThatDelaysRecovery(void);
-extern bool BgBufferSync(struct WritebackContext *wb_context);
+struct IOQueue;
+extern bool BgBufferSync(struct IOQueue *ioq, struct WritebackContext *wb_context);
extern void LimitAdditionalPins(uint32 *additional_pins);
extern void LimitAdditionalLocalPins(uint32 *additional_pins);
diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h
index 6222d46e535..6f8fe796da3 100644
--- a/src/include/storage/bufpage.h
+++ b/src/include/storage/bufpage.h
@@ -509,5 +509,6 @@ extern bool PageIndexTupleOverwrite(Page page, OffsetNumber offnum,
Item newtup, Size newsize);
extern char *PageSetChecksumCopy(Page page, BlockNumber blkno);
extern void PageSetChecksumInplace(Page page, BlockNumber blkno);
+extern bool PageNeedsChecksumCopy(Page page);
#endif /* BUFPAGE_H */
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 0f75548759a..71c08da45db 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -38,10 +38,12 @@
#include "postmaster/auxprocess.h"
#include "postmaster/bgwriter.h"
#include "postmaster/interrupt.h"
+#include "storage/aio.h"
#include "storage/buf_internals.h"
#include "storage/bufmgr.h"
#include "storage/condition_variable.h"
#include "storage/fd.h"
+#include "storage/io_queue.h"
#include "storage/lwlock.h"
#include "storage/proc.h"
#include "storage/procsignal.h"
@@ -89,6 +91,7 @@ BackgroundWriterMain(char *startup_data, size_t startup_data_len)
sigjmp_buf local_sigjmp_buf;
MemoryContext bgwriter_context;
bool prev_hibernate;
+ IOQueue *ioq;
WritebackContext wb_context;
Assert(startup_data_len == 0);
@@ -130,6 +133,7 @@ BackgroundWriterMain(char *startup_data, size_t startup_data_len)
ALLOCSET_DEFAULT_SIZES);
MemoryContextSwitchTo(bgwriter_context);
+ ioq = io_queue_create(128, 0);
WritebackContextInit(&wb_context, &bgwriter_flush_after);
/*
@@ -167,6 +171,7 @@ BackgroundWriterMain(char *startup_data, size_t startup_data_len)
* about in bgwriter, but we do have LWLocks, buffers, and temp files.
*/
LWLockReleaseAll();
+ pgaio_at_error();
ConditionVariableCancelSleep();
UnlockBuffers();
ReleaseAuxProcessResources(false);
@@ -226,12 +231,27 @@ BackgroundWriterMain(char *startup_data, size_t startup_data_len)
/* Clear any already-pending wakeups */
ResetLatch(MyLatch);
+ /*
+ * XXX: Before exiting, wait for all IO to finish. That's only
+ * important to avoid spurious PrintBufferLeakWarning() /
+ * PrintAioIPLeakWarning() calls, triggered by
+ * ReleaseAuxProcessResources() being called with isCommit=true.
+ *
+ * FIXME: this is theoretically racy, but I didn't want to copy
+ * HandleMainLoopInterrupts() remaining body here.
+ */
+ if (ShutdownRequestPending)
+ {
+ io_queue_wait_all(ioq);
+ io_queue_free(ioq);
+ }
+
HandleMainLoopInterrupts();
/*
* Do one cycle of dirty-buffer writing.
*/
- can_hibernate = BgBufferSync(&wb_context);
+ can_hibernate = BgBufferSync(ioq, &wb_context);
/* Report pending statistics to the cumulative stats system */
pgstat_report_bgwriter();
@@ -248,6 +268,9 @@ BackgroundWriterMain(char *startup_data, size_t startup_data_len)
smgrdestroyall();
}
+ /* finish IO before sleeping, to avoid blocking other backends */
+ io_queue_wait_all(ioq);
+
/*
* Log a new xl_running_xacts every now and then so replication can
* get into a consistent state faster (think of suboverflowed
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index eeb73c85726..17aa980aa80 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -46,9 +46,11 @@
#include "postmaster/bgwriter.h"
#include "postmaster/interrupt.h"
#include "replication/syncrep.h"
+#include "storage/aio.h"
#include "storage/bufmgr.h"
#include "storage/condition_variable.h"
#include "storage/fd.h"
+#include "storage/io_queue.h"
#include "storage/ipc.h"
#include "storage/lwlock.h"
#include "storage/proc.h"
@@ -266,6 +268,7 @@ CheckpointerMain(char *startup_data, size_t startup_data_len)
* files.
*/
LWLockReleaseAll();
+ pgaio_at_error();
ConditionVariableCancelSleep();
pgstat_report_wait_end();
UnlockBuffers();
@@ -708,7 +711,7 @@ ImmediateCheckpointRequested(void)
* fraction between 0.0 meaning none, and 1.0 meaning all done.
*/
void
-CheckpointWriteDelay(int flags, double progress)
+CheckpointWriteDelay(IOQueue *ioq, int flags, double progress)
{
static int absorb_counter = WRITES_PER_ABSORB;
@@ -741,6 +744,13 @@ CheckpointWriteDelay(int flags, double progress)
/* Report interim statistics to the cumulative stats system */
pgstat_report_checkpointer();
+ /*
+ * Ensure all pending IO is submitted to avoid unnecessary delays for
+ * other processes.
+ */
+ io_queue_wait_all(ioq);
+
+
/*
* This sleep used to be connected to bgwriter_delay, typically 200ms.
* That resulted in more frequent wakeups if not much work to do.
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index ed384fa1a44..6ec700e5ef2 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -52,6 +52,7 @@
#include "storage/buf_internals.h"
#include "storage/bufmgr.h"
#include "storage/fd.h"
+#include "storage/io_queue.h"
#include "storage/ipc.h"
#include "storage/lmgr.h"
#include "storage/proc.h"
@@ -77,6 +78,7 @@
/* Bits in SyncOneBuffer's return value */
#define BUF_WRITTEN 0x01
#define BUF_REUSABLE 0x02
+#define BUF_CANT_MERGE 0x04
#define RELS_BSEARCH_THRESHOLD 20
@@ -511,8 +513,6 @@ static void UnpinBuffer(BufferDesc *buf);
static void UnpinBufferNoOwner(BufferDesc *buf);
static void BufferSync(int flags);
static uint32 WaitBufHdrUnlocked(BufferDesc *buf);
-static int SyncOneBuffer(int buf_id, bool skip_recently_used,
- WritebackContext *wb_context);
static void WaitIO(BufferDesc *buf);
static bool StartBufferIO(BufferDesc *buf, bool forInput, bool nowait);
static void TerminateBufferIO(BufferDesc *buf, bool clear_dirty,
@@ -530,6 +530,7 @@ static inline BufferDesc *BufferAlloc(SMgrRelation smgr,
static Buffer GetVictimBuffer(BufferAccessStrategy strategy, IOContext io_context);
static void FlushBuffer(BufferDesc *buf, SMgrRelation reln,
IOObject io_object, IOContext io_context);
+
static void FindAndDropRelationBuffers(RelFileLocator rlocator,
ForkNumber forkNum,
BlockNumber nForkBlock,
@@ -2954,6 +2955,56 @@ UnpinBufferNoOwner(BufferDesc *buf)
}
}
+typedef struct BuffersToWrite
+{
+ int nbuffers;
+ BufferTag start_at_tag;
+ uint32 max_combine;
+
+ XLogRecPtr max_lsn;
+
+ PgAioHandle *ioh;
+ PgAioHandleRef ior;
+
+ uint64 total_writes;
+
+ Buffer buffers[IOV_MAX];
+ PgAioBounceBuffer *bounce_buffers[IOV_MAX];
+ const void *data_ptrs[IOV_MAX];
+} BuffersToWrite;
+
+static int PrepareToWriteBuffer(BuffersToWrite *to_write, Buffer buf,
+ bool skip_recently_used,
+ IOQueue *ioq, WritebackContext *wb_context);
+
+static void WriteBuffers(BuffersToWrite *to_write,
+ IOQueue *ioq, WritebackContext *wb_context);
+
+static void
+BuffersToWriteInit(BuffersToWrite *to_write,
+ IOQueue *ioq, WritebackContext *wb_context)
+{
+ to_write->total_writes = 0;
+ to_write->nbuffers = 0;
+ to_write->ioh = NULL;
+ pgaio_io_ref_clear(&to_write->ior);
+ to_write->max_lsn = InvalidXLogRecPtr;
+}
+
+static void
+BuffersToWriteEnd(BuffersToWrite *to_write)
+{
+ if (to_write->ioh != NULL)
+ {
+ pgaio_io_release(to_write->ioh);
+ to_write->ioh = NULL;
+ }
+
+ if (to_write->total_writes > 0)
+ pgaio_submit_staged();
+}
+
+
#define ST_SORT sort_checkpoint_bufferids
#define ST_ELEMENT_TYPE CkptSortItem
#define ST_COMPARE(a, b) ckpt_buforder_comparator(a, b)
@@ -2985,7 +3036,10 @@ BufferSync(int flags)
binaryheap *ts_heap;
int i;
int mask = BM_DIRTY;
+ IOQueue *ioq;
WritebackContext wb_context;
+ BuffersToWrite to_write;
+ int max_combine;
/*
* Unless this is a shutdown checkpoint or we have been explicitly told,
@@ -3047,7 +3101,9 @@ BufferSync(int flags)
if (num_to_scan == 0)
return; /* nothing to do */
+ ioq = io_queue_create(512, 0);
WritebackContextInit(&wb_context, &checkpoint_flush_after);
+ max_combine = Min(io_bounce_buffers, io_combine_limit);
TRACE_POSTGRESQL_BUFFER_SYNC_START(NBuffers, num_to_scan);
@@ -3155,48 +3211,89 @@ BufferSync(int flags)
*/
num_processed = 0;
num_written = 0;
+
+ BuffersToWriteInit(&to_write, ioq, &wb_context);
+
while (!binaryheap_empty(ts_heap))
{
BufferDesc *bufHdr = NULL;
CkptTsStatus *ts_stat = (CkptTsStatus *)
DatumGetPointer(binaryheap_first(ts_heap));
- buf_id = CkptBufferIds[ts_stat->index].buf_id;
- Assert(buf_id != -1);
-
- bufHdr = GetBufferDescriptor(buf_id);
-
- num_processed++;
+ Assert(ts_stat->num_scanned <= ts_stat->num_to_scan);
/*
- * We don't need to acquire the lock here, because we're only looking
- * at a single bit. It's possible that someone else writes the buffer
- * and clears the flag right after we check, but that doesn't matter
- * since SyncOneBuffer will then do nothing. However, there is a
- * further race condition: it's conceivable that between the time we
- * examine the bit here and the time SyncOneBuffer acquires the lock,
- * someone else not only wrote the buffer but replaced it with another
- * page and dirtied it. In that improbable case, SyncOneBuffer will
- * write the buffer though we didn't need to. It doesn't seem worth
- * guarding against this, though.
+ * Collect a batch of buffers to write out from the current
+ * tablespace. That causes some imbalance between the tablespaces, but
+ * that's more than outweighed by the efficiency gain due to batching.
*/
- if (pg_atomic_read_u32(&bufHdr->state) & BM_CHECKPOINT_NEEDED)
+ while (to_write.nbuffers < max_combine &&
+ ts_stat->num_scanned < ts_stat->num_to_scan)
{
- if (SyncOneBuffer(buf_id, false, &wb_context) & BUF_WRITTEN)
+ buf_id = CkptBufferIds[ts_stat->index].buf_id;
+ Assert(buf_id != -1);
+
+ bufHdr = GetBufferDescriptor(buf_id);
+
+ num_processed++;
+
+ /*
+ * We don't need to acquire the lock here, because we're only
+ * looking at a single bit. It's possible that someone else writes
+ * the buffer and clears the flag right after we check, but that
+ * doesn't matter since SyncOneBuffer will then do nothing.
+ * However, there is a further race condition: it's conceivable
+ * that between the time we examine the bit here and the time
+ * SyncOneBuffer acquires the lock, someone else not only wrote
+ * the buffer but replaced it with another page and dirtied it. In
+ * that improbable case, SyncOneBuffer will write the buffer
+ * though we didn't need to. It doesn't seem worth guarding
+ * against this, though.
+ */
+ if (pg_atomic_read_u32(&bufHdr->state) & BM_CHECKPOINT_NEEDED)
{
- TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
- PendingCheckpointerStats.buffers_written++;
- num_written++;
+ int result = PrepareToWriteBuffer(&to_write, buf_id + 1, false,
+ ioq, &wb_context);
+
+ if (result & BUF_CANT_MERGE)
+ {
+ Assert(to_write.nbuffers > 0);
+ WriteBuffers(&to_write, ioq, &wb_context);
+
+ result = PrepareToWriteBuffer(&to_write, buf_id + 1, false,
+ ioq, &wb_context);
+ Assert(result != BUF_CANT_MERGE);
+ }
+
+ if (result & BUF_WRITTEN)
+ {
+ TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
+ PendingCheckpointerStats.buffers_written++;
+ num_written++;
+ }
+ else
+ {
+ break;
+ }
}
+ else
+ {
+ if (to_write.nbuffers > 0)
+ WriteBuffers(&to_write, ioq, &wb_context);
+ }
+
+ /*
+ * Measure progress independent of actually having to flush the
+ * buffer - otherwise writing become unbalanced.
+ */
+ ts_stat->progress += ts_stat->progress_slice;
+ ts_stat->num_scanned++;
+ ts_stat->index++;
}
- /*
- * Measure progress independent of actually having to flush the buffer
- * - otherwise writing become unbalanced.
- */
- ts_stat->progress += ts_stat->progress_slice;
- ts_stat->num_scanned++;
- ts_stat->index++;
+ if (to_write.nbuffers > 0)
+ WriteBuffers(&to_write, ioq, &wb_context);
+
/* Have all the buffers from the tablespace been processed? */
if (ts_stat->num_scanned == ts_stat->num_to_scan)
@@ -3214,15 +3311,23 @@ BufferSync(int flags)
*
* (This will check for barrier events even if it doesn't sleep.)
*/
- CheckpointWriteDelay(flags, (double) num_processed / num_to_scan);
+ CheckpointWriteDelay(ioq, flags, (double) num_processed / num_to_scan);
}
+ Assert(to_write.nbuffers == 0);
+ io_queue_wait_all(ioq);
+
/*
* Issue all pending flushes. Only checkpointer calls BufferSync(), so
* IOContext will always be IOCONTEXT_NORMAL.
*/
IssuePendingWritebacks(&wb_context, IOCONTEXT_NORMAL);
+ io_queue_wait_all(ioq); /* IssuePendingWritebacks might have added
+ * more */
+ io_queue_free(ioq);
+ BuffersToWriteEnd(&to_write);
+
pfree(per_ts_stat);
per_ts_stat = NULL;
binaryheap_free(ts_heap);
@@ -3248,7 +3353,7 @@ BufferSync(int flags)
* bgwriter_lru_maxpages to 0.)
*/
bool
-BgBufferSync(WritebackContext *wb_context)
+BgBufferSync(IOQueue *ioq, WritebackContext *wb_context)
{
/* info obtained from freelist.c */
int strategy_buf_id;
@@ -3291,6 +3396,8 @@ BgBufferSync(WritebackContext *wb_context)
long new_strategy_delta;
uint32 new_recent_alloc;
+ BuffersToWrite to_write;
+
/*
* Find out where the freelist clock sweep currently is, and how many
* buffer allocations have happened since our last call.
@@ -3467,11 +3574,25 @@ BgBufferSync(WritebackContext *wb_context)
num_written = 0;
reusable_buffers = reusable_buffers_est;
+ BuffersToWriteInit(&to_write, ioq, wb_context);
+
/* Execute the LRU scan */
while (num_to_scan > 0 && reusable_buffers < upcoming_alloc_est)
{
- int sync_state = SyncOneBuffer(next_to_clean, true,
- wb_context);
+ int sync_state;
+
+ sync_state = PrepareToWriteBuffer(&to_write, next_to_clean + 1,
+ true, ioq, wb_context);
+ if (sync_state & BUF_CANT_MERGE)
+ {
+ Assert(to_write.nbuffers > 0);
+
+ WriteBuffers(&to_write, ioq, wb_context);
+
+ sync_state = PrepareToWriteBuffer(&to_write, next_to_clean + 1,
+ true, ioq, wb_context);
+ Assert(sync_state != BUF_CANT_MERGE);
+ }
if (++next_to_clean >= NBuffers)
{
@@ -3482,6 +3603,13 @@ BgBufferSync(WritebackContext *wb_context)
if (sync_state & BUF_WRITTEN)
{
+ Assert(sync_state & BUF_REUSABLE);
+
+ if (to_write.nbuffers == io_combine_limit)
+ {
+ WriteBuffers(&to_write, ioq, wb_context);
+ }
+
reusable_buffers++;
if (++num_written >= bgwriter_lru_maxpages)
{
@@ -3493,6 +3621,11 @@ BgBufferSync(WritebackContext *wb_context)
reusable_buffers++;
}
+ if (to_write.nbuffers > 0)
+ WriteBuffers(&to_write, ioq, wb_context);
+
+ BuffersToWriteEnd(&to_write);
+
PendingBgWriterStats.buf_written_clean += num_written;
#ifdef BGW_DEBUG
@@ -3531,8 +3664,66 @@ BgBufferSync(WritebackContext *wb_context)
return (bufs_to_lap == 0 && recent_alloc == 0);
}
+static inline bool
+BufferTagsSameRel(const BufferTag *tag1, const BufferTag *tag2)
+{
+ return (tag1->spcOid == tag2->spcOid) &&
+ (tag1->dbOid == tag2->dbOid) &&
+ (tag1->relNumber == tag2->relNumber) &&
+ (tag1->forkNum == tag2->forkNum)
+ ;
+}
+
+static bool
+CanMergeWrite(BuffersToWrite *to_write, BufferDesc *cur_buf_hdr)
+{
+ BlockNumber cur_block = cur_buf_hdr->tag.blockNum;
+
+ Assert(to_write->nbuffers > 0); /* can't merge with nothing */
+ Assert(to_write->start_at_tag.relNumber != InvalidOid);
+ Assert(to_write->start_at_tag.blockNum != InvalidBlockNumber);
+
+ Assert(to_write->ioh != NULL);
+
+ /*
+ * First check if the blocknumber is one that we could actually merge,
+ * that's cheaper than checking the tablespace/db/relnumber/fork match.
+ */
+ if (to_write->start_at_tag.blockNum + to_write->nbuffers != cur_block)
+ return false;
+
+ if (!BufferTagsSameRel(&to_write->start_at_tag, &cur_buf_hdr->tag))
+ return false;
+
+ /*
+ * Need to check with smgr how large a write we're allowed to make. To
+ * reduce the overhead of the smgr check, only inquire once, when
+ * processing the first to-be-merged buffer. That avoids the overhead in
+ * the common case of writing out buffers that definitely not mergeable.
+ */
+ if (to_write->nbuffers == 1)
+ {
+ SMgrRelation smgr;
+
+ smgr = smgropen(BufTagGetRelFileLocator(&to_write->start_at_tag), INVALID_PROC_NUMBER);
+
+ to_write->max_combine = smgrmaxcombine(smgr,
+ to_write->start_at_tag.forkNum,
+ to_write->start_at_tag.blockNum);
+ }
+ else
+ {
+ Assert(to_write->max_combine > 0);
+ }
+
+ if (to_write->start_at_tag.blockNum + to_write->max_combine <= cur_block)
+ return false;
+
+ return true;
+}
+
/*
- * SyncOneBuffer -- process a single buffer during syncing.
+ * PrepareToWriteBuffer -- process a single buffer during syncing.
*
* If skip_recently_used is true, we don't write currently-pinned buffers, nor
* buffers marked recently used, as these are not replacement candidates.
@@ -3541,22 +3732,56 @@ BgBufferSync(WritebackContext *wb_context)
* BUF_WRITTEN: we wrote the buffer.
* BUF_REUSABLE: buffer is available for replacement, ie, it has
* pin count 0 and usage count 0.
+ * BUF_CANT_MERGE: can't combine this write with prior writes, caller needs
+ * to issue those first
*
* (BUF_WRITTEN could be set in error if FlushBuffer finds the buffer clean
* after locking it, but we don't care all that much.)
*/
static int
-SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
+PrepareToWriteBuffer(BuffersToWrite *to_write, Buffer buf,
+ bool skip_recently_used,
+ IOQueue *ioq, WritebackContext *wb_context)
{
- BufferDesc *bufHdr = GetBufferDescriptor(buf_id);
- int result = 0;
+ BufferDesc *cur_buf_hdr = GetBufferDescriptor(buf - 1);
uint32 buf_state;
- BufferTag tag;
+ int result = 0;
+ XLogRecPtr cur_buf_lsn;
+ LWLock *content_lock;
+ bool may_block;
+
+ /*
+ * Check if this buffer can be written out together with already prepared
+ * writes. We check before we have pinned the buffer, so the buffer can be
+ * written out and replaced between this check and us pinning the buffer -
+ * we'll recheck below. The reason for the pre-check is that we don't want
+ * to pin the buffer just to find out that we can't merge the IO.
+ */
+ if (to_write->nbuffers != 0)
+ {
+ if (!CanMergeWrite(to_write, cur_buf_hdr))
+ {
+ result |= BUF_CANT_MERGE;
+ return result;
+ }
+ }
+ else
+ {
+ if (to_write->ioh == NULL)
+ {
+ to_write->ioh = io_queue_get_io(ioq);
+ pgaio_io_get_ref(to_write->ioh, &to_write->ior);
+ }
+
+ to_write->start_at_tag = cur_buf_hdr->tag;
+ }
/* Make sure we can handle the pin */
ReservePrivateRefCountEntry();
ResourceOwnerEnlarge(CurrentResourceOwner);
+ /* XXX: Should also check if we are allowed to pin one more buffer */
+
/*
* Check whether buffer needs writing.
*
@@ -3566,7 +3791,7 @@ SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
* don't worry because our checkpoint.redo points before log record for
* upcoming changes and so we are not required to write such dirty buffer.
*/
- buf_state = LockBufHdr(bufHdr);
+ buf_state = LockBufHdr(cur_buf_hdr);
if (BUF_STATE_GET_REFCOUNT(buf_state) == 0 &&
BUF_STATE_GET_USAGECOUNT(buf_state) == 0)
@@ -3575,40 +3800,282 @@ SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
}
else if (skip_recently_used)
{
+#if 0
+ elog(LOG, "at block %d: skip recent with nbuffers %d",
+ cur_buf_hdr->tag.blockNum, to_write->nbuffers);
+#endif
/* Caller told us not to write recently-used buffers */
- UnlockBufHdr(bufHdr, buf_state);
+ UnlockBufHdr(cur_buf_hdr, buf_state);
return result;
}
if (!(buf_state & BM_VALID) || !(buf_state & BM_DIRTY))
{
/* It's clean, so nothing to do */
- UnlockBufHdr(bufHdr, buf_state);
+ UnlockBufHdr(cur_buf_hdr, buf_state);
return result;
}
- /*
- * Pin it, share-lock it, write it. (FlushBuffer will do nothing if the
- * buffer is clean by the time we've locked it.)
- */
- PinBuffer_Locked(bufHdr);
- LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-
- FlushBuffer(bufHdr, NULL, IOOBJECT_RELATION, IOCONTEXT_NORMAL);
-
- LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
-
- tag = bufHdr->tag;
-
- UnpinBuffer(bufHdr);
+ /* pin the buffer, from now on its identity can't change anymore */
+ PinBuffer_Locked(cur_buf_hdr);
/*
- * SyncOneBuffer() is only called by checkpointer and bgwriter, so
- * IOContext will always be IOCONTEXT_NORMAL.
+ * If we are merging, check if the buffer's identity possibly changed
+ * while we hadn't yet pinned it.
+ *
+ * XXX: It might be worth checking if we still want to write the buffer
+ * out, e.g. it could have been replaced with a buffer that doesn't have
+ * BM_CHECKPOINT_NEEDED set.
*/
- ScheduleBufferTagForWriteback(wb_context, IOCONTEXT_NORMAL, &tag);
+ if (to_write->nbuffers != 0)
+ {
+ if (!CanMergeWrite(to_write, cur_buf_hdr))
+ {
+ elog(LOG, "changed identity");
+ UnpinBuffer(cur_buf_hdr);
- return result | BUF_WRITTEN;
+ result |= BUF_CANT_MERGE;
+
+ return result;
+ }
+ }
+
+ may_block = to_write->nbuffers == 0
+ && !pgaio_have_staged()
+ && io_queue_is_empty(ioq)
+ ;
+ content_lock = BufferDescriptorGetContentLock(cur_buf_hdr);
+
+ if (!may_block)
+ {
+ if (LWLockConditionalAcquire(content_lock, LW_SHARED))
+ {
+ /* done */
+ }
+ else if (to_write->nbuffers == 0)
+ {
+ /*
+ * Need to wait for all prior IO to finish before blocking for
+ * lock acquisition, to avoid the risk a deadlock due to us
+ * waiting for another backend that is waiting for our unsubmitted
+ * IO to complete.
+ */
+ pgaio_submit_staged();
+ io_queue_wait_all(ioq);
+
+ elog(DEBUG2, "at block %u: can't block, nbuffers = 0",
+ cur_buf_hdr->tag.blockNum
+ );
+
+ may_block = to_write->nbuffers == 0
+ && !pgaio_have_staged()
+ && io_queue_is_empty(ioq)
+ ;
+ Assert(may_block);
+
+ LWLockAcquire(content_lock, LW_SHARED);
+ }
+ else
+ {
+ elog(DEBUG2, "at block %d: can't block nbuffers = %d",
+ cur_buf_hdr->tag.blockNum,
+ to_write->nbuffers);
+
+ UnpinBuffer(cur_buf_hdr);
+ result |= BUF_CANT_MERGE;
+ Assert(to_write->nbuffers > 0);
+
+ return result;
+ }
+ }
+ else
+ {
+ LWLockAcquire(content_lock, LW_SHARED);
+ }
+
+ if (!may_block)
+ {
+ if (!StartBufferIO(cur_buf_hdr, false, !may_block))
+ {
+ pgaio_submit_staged();
+ io_queue_wait_all(ioq);
+
+ may_block = io_queue_is_empty(ioq) && to_write->nbuffers == 0 && !pgaio_have_staged();
+
+ if (!StartBufferIO(cur_buf_hdr, false, !may_block))
+ {
+ elog(DEBUG2, "at block %d: non-waitable StartBufferIO returns false, %d",
+ cur_buf_hdr->tag.blockNum,
+ may_block);
+
+ /*
+ * FIXME: can't tell whether this is because the buffer has
+ * been cleaned
+ */
+ if (!may_block)
+ {
+ result |= BUF_CANT_MERGE;
+ Assert(to_write->nbuffers > 0);
+ }
+ LWLockRelease(content_lock);
+ UnpinBuffer(cur_buf_hdr);
+
+ return result;
+ }
+ }
+ }
+ else
+ {
+ if (!StartBufferIO(cur_buf_hdr, false, false))
+ {
+ elog(DEBUG2, "waitable StartBufferIO returns false");
+ LWLockRelease(content_lock);
+ UnpinBuffer(cur_buf_hdr);
+
+ /*
+ * FIXME: Historically we returned BUF_WRITTEN in this case, which
+ * seems wrong
+ */
+ return result;
+ }
+ }
+
+ /*
+ * Run PageGetLSN while holding header lock, since we don't have the
+ * buffer locked exclusively in all cases.
+ */
+ buf_state = LockBufHdr(cur_buf_hdr);
+
+ cur_buf_lsn = BufferGetLSN(cur_buf_hdr);
+
+ /* To check if block content changes while flushing. - vadim 01/17/97 */
+ buf_state &= ~BM_JUST_DIRTIED;
+
+ UnlockBufHdr(cur_buf_hdr, buf_state);
+
+ to_write->buffers[to_write->nbuffers] = buf;
+ to_write->nbuffers++;
+
+ if (buf_state & BM_PERMANENT &&
+ (to_write->max_lsn == InvalidXLogRecPtr || to_write->max_lsn < cur_buf_lsn))
+ {
+ to_write->max_lsn = cur_buf_lsn;
+ }
+
+ result |= BUF_WRITTEN;
+
+ return result;
+}
+
+static void
+WriteBuffers(BuffersToWrite *to_write,
+ IOQueue *ioq, WritebackContext *wb_context)
+{
+ SMgrRelation smgr;
+ Buffer first_buf;
+ BufferDesc *first_buf_hdr;
+ bool needs_checksum;
+
+ Assert(to_write->nbuffers > 0 && to_write->nbuffers <= io_combine_limit);
+
+ first_buf = to_write->buffers[0];
+ first_buf_hdr = GetBufferDescriptor(first_buf - 1);
+
+ smgr = smgropen(BufTagGetRelFileLocator(&first_buf_hdr->tag), INVALID_PROC_NUMBER);
+
+ /*
+ * Force XLOG flush up to buffer's LSN. This implements the basic WAL
+ * rule that log updates must hit disk before any of the data-file changes
+ * they describe do.
+ *
+ * However, this rule does not apply to unlogged relations, which will be
+ * lost after a crash anyway. Most unlogged relation pages do not bear
+ * LSNs since we never emit WAL records for them, and therefore flushing
+ * up through the buffer LSN would be useless, but harmless. However,
+ * GiST indexes use LSNs internally to track page-splits, and therefore
+ * unlogged GiST pages bear "fake" LSNs generated by
+ * GetFakeLSNForUnloggedRel. It is unlikely but possible that the fake
+ * LSN counter could advance past the WAL insertion point; and if it did
+ * happen, attempting to flush WAL through that location would fail, with
+ * disastrous system-wide consequences. To make sure that can't happen,
+ * skip the flush if the buffer isn't permanent.
+ */
+ if (to_write->max_lsn != InvalidXLogRecPtr)
+ XLogFlush(to_write->max_lsn);
+
+ /*
+ * Now it's safe to write buffer to disk. Note that no one else should
+ * have been able to write it while we were busy with log flushing because
+ * only one process at a time can set the BM_IO_IN_PROGRESS bit.
+ */
+
+ for (int nbuf = 0; nbuf < to_write->nbuffers; nbuf++)
+ {
+ Buffer cur_buf = to_write->buffers[nbuf];
+ BufferDesc *cur_buf_hdr = GetBufferDescriptor(cur_buf - 1);
+ Block bufBlock;
+ char *bufToWrite;
+
+ bufBlock = BufHdrGetBlock(cur_buf_hdr);
+ needs_checksum = PageNeedsChecksumCopy((Page) bufBlock);
+
+ /*
+ * Update page checksum if desired. Since we have only shared lock on
+ * the buffer, other processes might be updating hint bits in it, so
+ * we must copy the page to a bounce buffer if we do checksumming.
+ */
+ if (needs_checksum)
+ {
+ PgAioBounceBuffer *bb = pgaio_bounce_buffer_get();
+
+ pgaio_io_assoc_bounce_buffer(to_write->ioh, bb);
+
+ bufToWrite = pgaio_bounce_buffer_buffer(bb);
+ memcpy(bufToWrite, bufBlock, BLCKSZ);
+ PageSetChecksumInplace((Page) bufToWrite, cur_buf_hdr->tag.blockNum);
+ }
+ else
+ {
+ bufToWrite = bufBlock;
+ }
+
+ to_write->data_ptrs[nbuf] = bufToWrite;
+ }
+
+ pgaio_io_set_io_data_32(to_write->ioh,
+ (uint32 *) to_write->buffers,
+ to_write->nbuffers);
+ pgaio_io_add_shared_cb(to_write->ioh, ASC_SHARED_BUFFER_WRITE);
+
+ smgrstartwritev(to_write->ioh, smgr,
+ BufTagGetForkNum(&first_buf_hdr->tag),
+ first_buf_hdr->tag.blockNum,
+ to_write->data_ptrs,
+ to_write->nbuffers,
+ false);
+ pgstat_count_io_op_n(IOOBJECT_RELATION, IOCONTEXT_NORMAL,
+ IOOP_WRITE, to_write->nbuffers);
+
+
+ for (int nbuf = 0; nbuf < to_write->nbuffers; nbuf++)
+ {
+ Buffer cur_buf = to_write->buffers[nbuf];
+ BufferDesc *cur_buf_hdr = GetBufferDescriptor(cur_buf - 1);
+
+ UnpinBuffer(cur_buf_hdr);
+ }
+
+ io_queue_track(ioq, &to_write->ior);
+ to_write->total_writes++;
+
+ /* clear state for next write */
+ to_write->nbuffers = 0;
+ to_write->start_at_tag.relNumber = InvalidOid;
+ to_write->start_at_tag.blockNum = InvalidBlockNumber;
+ to_write->max_combine = 0;
+ to_write->max_lsn = InvalidXLogRecPtr;
+ to_write->ioh = NULL;
+ pgaio_io_ref_clear(&to_write->ior);
}
/*
@@ -3974,6 +4441,7 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
error_context_stack = errcallback.previous;
}
+
/*
* RelationGetNumberOfBlocksInFork
* Determines the current number of pages in the specified relation fork.
diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c
index be6f1f62d29..8295e3fb0a0 100644
--- a/src/backend/storage/page/bufpage.c
+++ b/src/backend/storage/page/bufpage.c
@@ -1491,6 +1491,16 @@ PageIndexTupleOverwrite(Page page, OffsetNumber offnum,
return true;
}
+bool
+PageNeedsChecksumCopy(Page page)
+{
+ if (PageIsNew(page))
+ return false;
+
+ /* If we don't need a checksum, just return the passed-in data */
+ return DataChecksumsEnabled();
+}
+
/*
* Set checksum for a page in shared buffers.
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index a38141b4e50..9973162dc86 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -345,6 +345,7 @@ BufferManagerRelation
BufferStrategyControl
BufferTag
BufferUsage
+BuffersToWrite
BuildAccumulator
BuiltinScript
BulkInsertState
--
2.45.2.827.g557ae147e6
v2.1-0019-very-wip-test_aio-module.patchtext/x-diff; charset=us-asciiDownload
From b3d46d7af01fb746ab8a366a771420b4608a337e Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sat, 31 Aug 2024 21:13:48 -0400
Subject: [PATCH v2.1 19/20] very-wip: test_aio module
Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
src/include/storage/aio_internal.h | 10 +
src/include/storage/buf_internals.h | 4 +
src/backend/storage/aio/aio.c | 38 ++
src/backend/storage/buffer/bufmgr.c | 3 +-
src/test/modules/Makefile | 1 +
src/test/modules/meson.build | 1 +
src/test/modules/test_aio/.gitignore | 6 +
src/test/modules/test_aio/Makefile | 34 ++
src/test/modules/test_aio/expected/inject.out | 180 +++++++
src/test/modules/test_aio/expected/io.out | 40 ++
.../modules/test_aio/expected/ownership.out | 148 ++++++
src/test/modules/test_aio/expected/prep.out | 17 +
src/test/modules/test_aio/io_uring.conf | 5 +
src/test/modules/test_aio/meson.build | 78 +++
src/test/modules/test_aio/sql/inject.sql | 51 ++
src/test/modules/test_aio/sql/io.sql | 16 +
src/test/modules/test_aio/sql/ownership.sql | 65 +++
src/test/modules/test_aio/sql/prep.sql | 9 +
src/test/modules/test_aio/sync.conf | 5 +
src/test/modules/test_aio/test_aio--1.0.sql | 94 ++++
src/test/modules/test_aio/test_aio.c | 479 ++++++++++++++++++
src/test/modules/test_aio/test_aio.control | 3 +
src/test/modules/test_aio/worker.conf | 5 +
23 files changed, 1290 insertions(+), 2 deletions(-)
create mode 100644 src/test/modules/test_aio/.gitignore
create mode 100644 src/test/modules/test_aio/Makefile
create mode 100644 src/test/modules/test_aio/expected/inject.out
create mode 100644 src/test/modules/test_aio/expected/io.out
create mode 100644 src/test/modules/test_aio/expected/ownership.out
create mode 100644 src/test/modules/test_aio/expected/prep.out
create mode 100644 src/test/modules/test_aio/io_uring.conf
create mode 100644 src/test/modules/test_aio/meson.build
create mode 100644 src/test/modules/test_aio/sql/inject.sql
create mode 100644 src/test/modules/test_aio/sql/io.sql
create mode 100644 src/test/modules/test_aio/sql/ownership.sql
create mode 100644 src/test/modules/test_aio/sql/prep.sql
create mode 100644 src/test/modules/test_aio/sync.conf
create mode 100644 src/test/modules/test_aio/test_aio--1.0.sql
create mode 100644 src/test/modules/test_aio/test_aio.c
create mode 100644 src/test/modules/test_aio/test_aio.control
create mode 100644 src/test/modules/test_aio/worker.conf
diff --git a/src/include/storage/aio_internal.h b/src/include/storage/aio_internal.h
index 5d18d112e2d..a44cdb457ee 100644
--- a/src/include/storage/aio_internal.h
+++ b/src/include/storage/aio_internal.h
@@ -262,6 +262,16 @@ extern const char *pgaio_io_get_op_name(PgAioHandle *ioh);
extern const char *pgaio_io_get_state_name(PgAioHandle *ioh);
+
+/* These functions are just for use in tests, from within injection points */
+#ifdef USE_INJECTION_POINTS
+
+extern PgAioHandle *pgaio_inj_io_get(void);
+
+#endif
+
+
+
/* Declarations for the tables of function pointers exposed by each IO method. */
extern const IoMethodOps pgaio_sync_ops;
extern const IoMethodOps pgaio_worker_ops;
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 9d3123663b3..1b3329a25b4 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -423,6 +423,10 @@ extern void IssuePendingWritebacks(WritebackContext *wb_context, IOContext io_co
extern void ScheduleBufferTagForWriteback(WritebackContext *wb_context,
IOContext io_context, BufferTag *tag);
+/* solely to make it easier to write tests */
+extern bool StartBufferIO(BufferDesc *buf, bool forInput, bool nowait);
+
+
/* freelist.c */
extern IOContext IOContextForStrategy(BufferAccessStrategy strategy);
extern BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy,
diff --git a/src/backend/storage/aio/aio.c b/src/backend/storage/aio/aio.c
index 8877a33b9f2..7efc9631f5f 100644
--- a/src/backend/storage/aio/aio.c
+++ b/src/backend/storage/aio/aio.c
@@ -22,6 +22,9 @@
#include "utils/resowner.h"
#include "utils/wait_event_types.h"
+#ifdef USE_INJECTION_POINTS
+#include "utils/injection_point.h"
+#endif
static void pgaio_io_reclaim(PgAioHandle *ioh);
@@ -67,6 +70,11 @@ static const IoMethodOps *pgaio_ops_table[] = {
const IoMethodOps *pgaio_impl;
+#ifdef USE_INJECTION_POINTS
+static PgAioHandle *inj_cur_handle;
+#endif
+
+
/* --------------------------------------------------------------------------------
* "Core" IO Api
@@ -543,6 +551,19 @@ pgaio_io_process_completion(PgAioHandle *ioh, int result)
/* FIXME: should be done in separate function */
ioh->state = AHS_REAPED;
+#ifdef USE_INJECTION_POINTS
+ inj_cur_handle = ioh;
+
+ /*
+ * FIXME: This could be in a critical section - but it looks like we can't
+ * just InjectionPointLoad() at process start, as the injection point
+ * might not yet be defined.
+ */
+ InjectionPointCached("AIO_PROCESS_COMPLETION_BEFORE_SHARED");
+
+ inj_cur_handle = NULL;
+#endif
+
pgaio_io_process_completion_subject(ioh);
/* ensure results of completion are visible before the new state */
@@ -1013,3 +1034,20 @@ assign_io_method(int newval, void *extra)
{
pgaio_impl = pgaio_ops_table[newval];
}
+
+
+
+/* --------------------------------------------------------------------------------
+ * Injection point support
+ * --------------------------------------------------------------------------------
+ */
+
+#ifdef USE_INJECTION_POINTS
+
+PgAioHandle *
+pgaio_inj_io_get(void)
+{
+ return inj_cur_handle;
+}
+
+#endif
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 6ec700e5ef2..44b1b6fb316 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -514,7 +514,6 @@ static void UnpinBufferNoOwner(BufferDesc *buf);
static void BufferSync(int flags);
static uint32 WaitBufHdrUnlocked(BufferDesc *buf);
static void WaitIO(BufferDesc *buf);
-static bool StartBufferIO(BufferDesc *buf, bool forInput, bool nowait);
static void TerminateBufferIO(BufferDesc *buf, bool clear_dirty,
uint32 set_flag_bits, bool forget_owner,
bool syncio);
@@ -6095,7 +6094,7 @@ WaitIO(BufferDesc *buf)
* find out if they can perform the I/O as part of a larger operation, without
* waiting for the answer or distinguishing the reasons why not.
*/
-static bool
+bool
StartBufferIO(BufferDesc *buf, bool forInput, bool nowait)
{
uint32 buf_state;
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index 256799f520a..7df90602e90 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -13,6 +13,7 @@ SUBDIRS = \
libpq_pipeline \
plsample \
spgist_name_ops \
+ test_aio \
test_bloomfilter \
test_copy_callbacks \
test_custom_rmgrs \
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index d8fe059d236..bc7d19e694f 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -1,5 +1,6 @@
# Copyright (c) 2022-2024, PostgreSQL Global Development Group
+subdir('test_aio')
subdir('brin')
subdir('commit_ts')
subdir('delay_execution')
diff --git a/src/test/modules/test_aio/.gitignore b/src/test/modules/test_aio/.gitignore
new file mode 100644
index 00000000000..b4903eba657
--- /dev/null
+++ b/src/test/modules/test_aio/.gitignore
@@ -0,0 +1,6 @@
+# Generated subdirectories
+/log/
+/results/
+/output_iso/
+/tmp_check/
+/tmp_check_iso/
diff --git a/src/test/modules/test_aio/Makefile b/src/test/modules/test_aio/Makefile
new file mode 100644
index 00000000000..ae6d685835b
--- /dev/null
+++ b/src/test/modules/test_aio/Makefile
@@ -0,0 +1,34 @@
+# src/test/modules/delay_execution/Makefile
+
+PGFILEDESC = "test_aio - test code for AIO"
+
+MODULE_big = test_aio
+OBJS = \
+ $(WIN32RES) \
+ test_aio.o
+
+EXTENSION = test_aio
+DATA = test_aio--1.0.sql
+
+REGRESS = prep ownership io
+
+ifeq ($(enable_injection_points),yes)
+REGRESS += inject
+endif
+
+# FIXME: with meson this runs the tests once with worker and once - if
+# supported - with io_uring.
+
+# requires custom config
+NO_INSTALLCHECK = 1
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_aio
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_aio/expected/inject.out b/src/test/modules/test_aio/expected/inject.out
new file mode 100644
index 00000000000..e52b0f086dd
--- /dev/null
+++ b/src/test/modules/test_aio/expected/inject.out
@@ -0,0 +1,180 @@
+SELECT count(*) FROM tbl_b WHERE ctid = '(2, 1)';
+ count
+-------
+ 1
+(1 row)
+
+-- injected what we'd expect
+SELECT inj_io_short_read_attach(8192);
+ inj_io_short_read_attach
+--------------------------
+
+(1 row)
+
+SELECT invalidate_rel_block('tbl_b', 2);
+ invalidate_rel_block
+----------------------
+
+(1 row)
+
+SELECT count(*) FROM tbl_b WHERE ctid = '(2, 1)';
+ count
+-------
+ 1
+(1 row)
+
+SELECT inj_io_short_read_detach();
+ inj_io_short_read_detach
+--------------------------
+
+(1 row)
+
+-- injected a read shorter than a single block, expecting error
+SELECT inj_io_short_read_attach(17);
+ inj_io_short_read_attach
+--------------------------
+
+(1 row)
+
+SELECT invalidate_rel_block('tbl_b', 2);
+ invalidate_rel_block
+----------------------
+
+(1 row)
+
+SELECT redact($$
+ SELECT count(*) FROM tbl_b WHERE ctid = '(2, 1)';
+$$);
+NOTICE: wrapped error: could not read blocks 2..2 in file base/<redacted>: read only 0 of 8192 bytes
+ redact
+--------
+ f
+(1 row)
+
+SELECT inj_io_short_read_detach();
+ inj_io_short_read_detach
+--------------------------
+
+(1 row)
+
+-- shorten multi-block read to a single block, should retry, but that's not
+-- implemented yet
+SELECT inj_io_short_read_attach(8192);
+ inj_io_short_read_attach
+--------------------------
+
+(1 row)
+
+SELECT invalidate_rel_block('tbl_b', 0);
+ invalidate_rel_block
+----------------------
+
+(1 row)
+
+SELECT invalidate_rel_block('tbl_b', 1);
+ invalidate_rel_block
+----------------------
+
+(1 row)
+
+SELECT invalidate_rel_block('tbl_b', 2);
+ invalidate_rel_block
+----------------------
+
+(1 row)
+
+SELECT redact($$
+ SELECT count(*) FROM tbl_b;
+$$);
+NOTICE: wrapped error: could not read blocks 1..2 in file base/<redacted>: read only 8192 of 16384 bytes
+ redact
+--------
+ f
+(1 row)
+
+SELECT inj_io_short_read_detach();
+ inj_io_short_read_detach
+--------------------------
+
+(1 row)
+
+-- verify that checksum errors are detected even as part of a shortened
+-- multi-block read
+-- (tbl_a, block 1 is corrupted)
+SELECT redact($$
+ SELECT count(*) FROM tbl_a WHERE ctid < '(2, 1)';
+$$);
+NOTICE: wrapped error: invalid page in block 2 of relation base/<redacted>
+ redact
+--------
+ f
+(1 row)
+
+SELECT inj_io_short_read_attach(8192);
+ inj_io_short_read_attach
+--------------------------
+
+(1 row)
+
+SELECT invalidate_rel_block('tbl_a', 0);
+ invalidate_rel_block
+----------------------
+
+(1 row)
+
+SELECT invalidate_rel_block('tbl_a', 1);
+ invalidate_rel_block
+----------------------
+
+(1 row)
+
+SELECT invalidate_rel_block('tbl_a', 2);
+ invalidate_rel_block
+----------------------
+
+(1 row)
+
+SELECT redact($$
+ SELECT count(*) FROM tbl_a WHERE ctid < '(2, 1)';
+$$);
+NOTICE: wrapped error: invalid page in block 2 of relation base/<redacted>
+ redact
+--------
+ f
+(1 row)
+
+SELECT inj_io_short_read_detach();
+ inj_io_short_read_detach
+--------------------------
+
+(1 row)
+
+-- FIXME: Should error
+-- FIXME: errno encoding?
+SELECT inj_io_short_read_attach(-5);
+ inj_io_short_read_attach
+--------------------------
+
+(1 row)
+
+SELECT invalidate_rel_block('tbl_b', 2);
+ invalidate_rel_block
+----------------------
+
+(1 row)
+
+SELECT redact($$
+ SELECT count(*) FROM tbl_b WHERE ctid = '(2, 1)';
+$$);
+NOTICE: wrapped error: could not read blocks 2..3 in file base/<redacted>: Input/output error
+ redact
+--------
+ f
+(1 row)
+
+SELECT inj_io_short_read_detach();
+ inj_io_short_read_detach
+--------------------------
+
+(1 row)
+
diff --git a/src/test/modules/test_aio/expected/io.out b/src/test/modules/test_aio/expected/io.out
new file mode 100644
index 00000000000..e46b582f290
--- /dev/null
+++ b/src/test/modules/test_aio/expected/io.out
@@ -0,0 +1,40 @@
+SELECT count(*) FROM tbl_a WHERE ctid = '(1, 1)';
+ count
+-------
+ 1
+(1 row)
+
+SELECT corrupt_rel_block('tbl_a', 1);
+ corrupt_rel_block
+-------------------
+
+(1 row)
+
+-- FIXME: Should report the error
+SELECT redact($$
+ SELECT read_corrupt_rel_block('tbl_a', 1);
+$$);
+ redact
+--------
+ t
+(1 row)
+
+-- verify the error is reported
+SELECT redact($$
+ SELECT count(*) FROM tbl_a WHERE ctid = '(1, 1)';
+$$);
+NOTICE: wrapped error: invalid page in block 2 of relation base/<redacted>
+ redact
+--------
+ f
+(1 row)
+
+SELECT redact($$
+ SELECT count(*) FROM tbl_a;
+$$);
+NOTICE: wrapped error: invalid page in block 2 of relation base/<redacted>
+ redact
+--------
+ f
+(1 row)
+
diff --git a/src/test/modules/test_aio/expected/ownership.out b/src/test/modules/test_aio/expected/ownership.out
new file mode 100644
index 00000000000..97fdad6c629
--- /dev/null
+++ b/src/test/modules/test_aio/expected/ownership.out
@@ -0,0 +1,148 @@
+-----
+-- IO handles
+----
+-- leak warning: implicit xact
+SELECT handle_get();
+WARNING: leaked AIO handle
+ handle_get
+------------
+
+(1 row)
+
+-- leak warning: explicit xact
+BEGIN; SELECT handle_get(); COMMIT;
+WARNING: leaked AIO handle
+ handle_get
+------------
+
+(1 row)
+
+-- leak warning + error: released in different command (thus resowner)
+BEGIN; SELECT handle_get(); SELECT handle_release_last(); COMMIT;
+WARNING: leaked AIO handle
+ handle_get
+------------
+
+(1 row)
+
+ERROR: release in unexpected state
+-- no leak, same command
+BEGIN; SELECT handle_get() UNION ALL SELECT handle_release_last(); COMMIT;
+ handle_get
+------------
+
+
+(2 rows)
+
+-- leak warning: subtrans
+BEGIN; SAVEPOINT foo; SELECT handle_get(); COMMIT;
+WARNING: leaked AIO handle
+ handle_get
+------------
+
+(1 row)
+
+-- normal handle use
+SELECT handle_get_release();
+ handle_get_release
+--------------------
+
+(1 row)
+
+-- should error out, API violation
+SELECT handle_get_twice();
+ERROR: API violation: Only one IO can be handed out
+-- recover after error in implicit xact
+SELECT handle_get_and_error(); SELECT handle_get_release();
+ERROR: as you command
+ handle_get_release
+--------------------
+
+(1 row)
+
+-- recover after error in explicit xact
+BEGIN; SELECT handle_get_and_error(); ROLLBACK; SELECT handle_get_release();
+ERROR: as you command
+ handle_get_release
+--------------------
+
+(1 row)
+
+-- recover after error in subtrans
+BEGIN; SAVEPOINT foo; SELECT handle_get_and_error(); ROLLBACK TO SAVEPOINT foo; SELECT handle_get_release(); ROLLBACK;
+ERROR: as you command
+ handle_get_release
+--------------------
+
+(1 row)
+
+-----
+-- Bounce Buffers handles
+----
+-- leak warning: implicit xact
+SELECT bb_get();
+WARNING: leaked AIO bounce buffer
+ bb_get
+--------
+
+(1 row)
+
+-- leak warning: explicit xact
+BEGIN; SELECT bb_get(); COMMIT;
+WARNING: leaked AIO bounce buffer
+ bb_get
+--------
+
+(1 row)
+
+-- missing leak warning: we should warn at command boundaries, not xact boundaries
+BEGIN; SELECT bb_get(); SELECT bb_release_last(); COMMIT;
+WARNING: leaked AIO bounce buffer
+ bb_get
+--------
+
+(1 row)
+
+ERROR: can only release handed out BB
+-- leak warning: subtrans
+BEGIN; SAVEPOINT foo; SELECT bb_get(); COMMIT;
+WARNING: leaked AIO bounce buffer
+ bb_get
+--------
+
+(1 row)
+
+-- normal bb use
+SELECT bb_get_release();
+ bb_get_release
+----------------
+
+(1 row)
+
+-- should error out, API violation
+SELECT bb_get_twice();
+ERROR: can only hand out one BB
+-- recover after error in implicit xact
+SELECT bb_get_and_error(); SELECT bb_get_release();
+ERROR: as you command
+ bb_get_release
+----------------
+
+(1 row)
+
+-- recover after error in explicit xact
+BEGIN; SELECT bb_get_and_error(); ROLLBACK; SELECT bb_get_release();
+ERROR: as you command
+ bb_get_release
+----------------
+
+(1 row)
+
+-- recover after error in subtrans
+BEGIN; SAVEPOINT foo; SELECT bb_get_and_error(); ROLLBACK TO SAVEPOINT foo; SELECT bb_get_release(); ROLLBACK;
+ERROR: as you command
+ bb_get_release
+----------------
+
+(1 row)
+
diff --git a/src/test/modules/test_aio/expected/prep.out b/src/test/modules/test_aio/expected/prep.out
new file mode 100644
index 00000000000..7fad6280db5
--- /dev/null
+++ b/src/test/modules/test_aio/expected/prep.out
@@ -0,0 +1,17 @@
+CREATE EXTENSION test_aio;
+CREATE TABLE tbl_a(data int not null);
+CREATE TABLE tbl_b(data int not null);
+INSERT INTO tbl_a SELECT generate_series(1, 10000);
+INSERT INTO tbl_b SELECT generate_series(1, 10000);
+SELECT grow_rel('tbl_a', 500);
+ grow_rel
+----------
+
+(1 row)
+
+SELECT grow_rel('tbl_b', 550);
+ grow_rel
+----------
+
+(1 row)
+
diff --git a/src/test/modules/test_aio/io_uring.conf b/src/test/modules/test_aio/io_uring.conf
new file mode 100644
index 00000000000..efd7ad143ff
--- /dev/null
+++ b/src/test/modules/test_aio/io_uring.conf
@@ -0,0 +1,5 @@
+shared_preload_libraries=test_aio
+io_method = 'io_uring'
+log_min_messages = 'DEBUG3'
+log_statement=all
+restart_after_crash=false
diff --git a/src/test/modules/test_aio/meson.build b/src/test/modules/test_aio/meson.build
new file mode 100644
index 00000000000..a4bef0ceeb0
--- /dev/null
+++ b/src/test/modules/test_aio/meson.build
@@ -0,0 +1,78 @@
+# Copyright (c) 2022-2024, PostgreSQL Global Development Group
+
+test_aio_sources = files(
+ 'test_aio.c',
+)
+
+if host_system == 'windows'
+ test_aio_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+ '--NAME', 'test_aio',
+ '--FILEDESC', 'test_aio - test code for AIO',])
+endif
+
+test_aio = shared_module('test_aio',
+ test_aio_sources,
+ kwargs: pg_test_mod_args,
+)
+test_install_libs += test_aio
+
+test_install_data += files(
+ 'test_aio.control',
+ 'test_aio--1.0.sql',
+)
+
+
+testfiles = [
+ 'prep',
+ 'ownership',
+ 'io',
+]
+
+if get_option('injection_points')
+ testfiles += 'inject'
+endif
+
+
+tests += {
+ 'name': 'test_aio_sync',
+ 'sd': meson.current_source_dir(),
+ 'bd': meson.current_build_dir(),
+ 'regress': {
+ 'sql': testfiles,
+ 'regress_args': [
+ '--temp-config', files('sync.conf'),
+ ],
+ # requires custom config
+ 'runningcheck': false,
+ },
+}
+
+tests += {
+ 'name': 'test_aio_worker',
+ 'sd': meson.current_source_dir(),
+ 'bd': meson.current_build_dir(),
+ 'regress': {
+ 'sql': testfiles,
+ 'regress_args': [
+ '--temp-config', files('worker.conf'),
+ ],
+ # requires custom config
+ 'runningcheck': false,
+ },
+}
+
+if liburing.found()
+ tests += {
+ 'name': 'test_aio_uring',
+ 'sd': meson.current_source_dir(),
+ 'bd': meson.current_build_dir(),
+ 'regress': {
+ 'sql': testfiles,
+ 'regress_args': [
+ '--temp-config', files('io_uring.conf'),
+ ],
+ # requires custom config
+ 'runningcheck': false,
+ }
+ }
+endif
diff --git a/src/test/modules/test_aio/sql/inject.sql b/src/test/modules/test_aio/sql/inject.sql
new file mode 100644
index 00000000000..b3d34de8977
--- /dev/null
+++ b/src/test/modules/test_aio/sql/inject.sql
@@ -0,0 +1,51 @@
+SELECT count(*) FROM tbl_b WHERE ctid = '(2, 1)';
+
+-- injected what we'd expect
+SELECT inj_io_short_read_attach(8192);
+SELECT invalidate_rel_block('tbl_b', 2);
+SELECT count(*) FROM tbl_b WHERE ctid = '(2, 1)';
+SELECT inj_io_short_read_detach();
+
+-- injected a read shorter than a single block, expecting error
+SELECT inj_io_short_read_attach(17);
+SELECT invalidate_rel_block('tbl_b', 2);
+SELECT redact($$
+ SELECT count(*) FROM tbl_b WHERE ctid = '(2, 1)';
+$$);
+SELECT inj_io_short_read_detach();
+
+-- shorten multi-block read to a single block, should retry, but that's not
+-- implemented yet
+SELECT inj_io_short_read_attach(8192);
+SELECT invalidate_rel_block('tbl_b', 0);
+SELECT invalidate_rel_block('tbl_b', 1);
+SELECT invalidate_rel_block('tbl_b', 2);
+SELECT redact($$
+ SELECT count(*) FROM tbl_b;
+$$);
+SELECT inj_io_short_read_detach();
+
+-- verify that checksum errors are detected even as part of a shortened
+-- multi-block read
+-- (tbl_a, block 1 is corrupted)
+SELECT redact($$
+ SELECT count(*) FROM tbl_a WHERE ctid < '(2, 1)';
+$$);
+SELECT inj_io_short_read_attach(8192);
+SELECT invalidate_rel_block('tbl_a', 0);
+SELECT invalidate_rel_block('tbl_a', 1);
+SELECT invalidate_rel_block('tbl_a', 2);
+SELECT redact($$
+ SELECT count(*) FROM tbl_a WHERE ctid < '(2, 1)';
+$$);
+SELECT inj_io_short_read_detach();
+
+
+-- FIXME: Should error
+-- FIXME: errno encoding?
+SELECT inj_io_short_read_attach(-5);
+SELECT invalidate_rel_block('tbl_b', 2);
+SELECT redact($$
+ SELECT count(*) FROM tbl_b WHERE ctid = '(2, 1)';
+$$);
+SELECT inj_io_short_read_detach();
diff --git a/src/test/modules/test_aio/sql/io.sql b/src/test/modules/test_aio/sql/io.sql
new file mode 100644
index 00000000000..a29bb4eb15d
--- /dev/null
+++ b/src/test/modules/test_aio/sql/io.sql
@@ -0,0 +1,16 @@
+SELECT count(*) FROM tbl_a WHERE ctid = '(1, 1)';
+
+SELECT corrupt_rel_block('tbl_a', 1);
+
+-- FIXME: Should report the error
+SELECT redact($$
+ SELECT read_corrupt_rel_block('tbl_a', 1);
+$$);
+
+-- verify the error is reported
+SELECT redact($$
+ SELECT count(*) FROM tbl_a WHERE ctid = '(1, 1)';
+$$);
+SELECT redact($$
+ SELECT count(*) FROM tbl_a;
+$$);
diff --git a/src/test/modules/test_aio/sql/ownership.sql b/src/test/modules/test_aio/sql/ownership.sql
new file mode 100644
index 00000000000..63cf40c802a
--- /dev/null
+++ b/src/test/modules/test_aio/sql/ownership.sql
@@ -0,0 +1,65 @@
+-----
+-- IO handles
+----
+
+-- leak warning: implicit xact
+SELECT handle_get();
+
+-- leak warning: explicit xact
+BEGIN; SELECT handle_get(); COMMIT;
+
+-- leak warning + error: released in different command (thus resowner)
+BEGIN; SELECT handle_get(); SELECT handle_release_last(); COMMIT;
+
+-- no leak, same command
+BEGIN; SELECT handle_get() UNION ALL SELECT handle_release_last(); COMMIT;
+
+-- leak warning: subtrans
+BEGIN; SAVEPOINT foo; SELECT handle_get(); COMMIT;
+
+-- normal handle use
+SELECT handle_get_release();
+
+-- should error out, API violation
+SELECT handle_get_twice();
+
+-- recover after error in implicit xact
+SELECT handle_get_and_error(); SELECT handle_get_release();
+
+-- recover after error in explicit xact
+BEGIN; SELECT handle_get_and_error(); ROLLBACK; SELECT handle_get_release();
+
+-- recover after error in subtrans
+BEGIN; SAVEPOINT foo; SELECT handle_get_and_error(); ROLLBACK TO SAVEPOINT foo; SELECT handle_get_release(); ROLLBACK;
+
+
+-----
+-- Bounce Buffers handles
+----
+
+-- leak warning: implicit xact
+SELECT bb_get();
+
+-- leak warning: explicit xact
+BEGIN; SELECT bb_get(); COMMIT;
+
+-- missing leak warning: we should warn at command boundaries, not xact boundaries
+BEGIN; SELECT bb_get(); SELECT bb_release_last(); COMMIT;
+
+-- leak warning: subtrans
+BEGIN; SAVEPOINT foo; SELECT bb_get(); COMMIT;
+
+-- normal bb use
+SELECT bb_get_release();
+
+-- should error out, API violation
+SELECT bb_get_twice();
+
+-- recover after error in implicit xact
+SELECT bb_get_and_error(); SELECT bb_get_release();
+
+-- recover after error in explicit xact
+BEGIN; SELECT bb_get_and_error(); ROLLBACK; SELECT bb_get_release();
+
+-- recover after error in subtrans
+BEGIN; SAVEPOINT foo; SELECT bb_get_and_error(); ROLLBACK TO SAVEPOINT foo; SELECT bb_get_release(); ROLLBACK;
diff --git a/src/test/modules/test_aio/sql/prep.sql b/src/test/modules/test_aio/sql/prep.sql
new file mode 100644
index 00000000000..b8f225cbc98
--- /dev/null
+++ b/src/test/modules/test_aio/sql/prep.sql
@@ -0,0 +1,9 @@
+CREATE EXTENSION test_aio;
+
+CREATE TABLE tbl_a(data int not null);
+CREATE TABLE tbl_b(data int not null);
+
+INSERT INTO tbl_a SELECT generate_series(1, 10000);
+INSERT INTO tbl_b SELECT generate_series(1, 10000);
+SELECT grow_rel('tbl_a', 500);
+SELECT grow_rel('tbl_b', 550);
diff --git a/src/test/modules/test_aio/sync.conf b/src/test/modules/test_aio/sync.conf
new file mode 100644
index 00000000000..c480922d6cf
--- /dev/null
+++ b/src/test/modules/test_aio/sync.conf
@@ -0,0 +1,5 @@
+shared_preload_libraries=test_aio
+io_method = 'sync'
+log_min_messages = 'DEBUG3'
+log_statement=all
+restart_after_crash=false
diff --git a/src/test/modules/test_aio/test_aio--1.0.sql b/src/test/modules/test_aio/test_aio--1.0.sql
new file mode 100644
index 00000000000..ea9ad43ed8f
--- /dev/null
+++ b/src/test/modules/test_aio/test_aio--1.0.sql
@@ -0,0 +1,94 @@
+/* src/test/modules/test_aio/test_aio--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_aio" to load this file. \quit
+
+
+CREATE FUNCTION grow_rel(rel regclass, nblocks int)
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+
+CREATE FUNCTION corrupt_rel_block(rel regclass, blockno int)
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION read_corrupt_rel_block(rel regclass, blockno int)
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION invalidate_rel_block(rel regclass, blockno int)
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION handle_get_and_error()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION handle_get_twice()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION handle_get()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION handle_get_release()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION handle_release_last()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+
+CREATE FUNCTION bb_get_and_error()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION bb_get_twice()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION bb_get()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION bb_get_release()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION bb_release_last()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+
+CREATE OR REPLACE FUNCTION redact(p_sql text)
+RETURNS bool
+LANGUAGE plpgsql
+AS $$
+ DECLARE
+ err_state text;
+ err_msg text;
+ BEGIN
+ EXECUTE p_sql;
+ RETURN true;
+ EXCEPTION WHEN OTHERS THEN
+ GET STACKED DIAGNOSTICS
+ err_state = RETURNED_SQLSTATE,
+ err_msg = MESSAGE_TEXT;
+ err_msg = regexp_replace(err_msg, '(file|relation) "?base/[0-9]+/[0-9]+"?', '\1 base/<redacted>');
+ RAISE NOTICE 'wrapped error: %', err_msg
+ USING ERRCODE = err_state;
+ RETURN false;
+ END;
+$$;
+
+
+CREATE FUNCTION inj_io_short_read_attach(result int)
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION inj_io_short_read_detach()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_aio/test_aio.c b/src/test/modules/test_aio/test_aio.c
new file mode 100644
index 00000000000..9626d495241
--- /dev/null
+++ b/src/test/modules/test_aio/test_aio.c
@@ -0,0 +1,479 @@
+/*-------------------------------------------------------------------------
+ *
+ * delay_execution.c
+ * Test module to allow delay between parsing and execution of a query.
+ *
+ * The delay is implemented by taking and immediately releasing a specified
+ * advisory lock. If another process has previously taken that lock, the
+ * current process will be blocked until the lock is released; otherwise,
+ * there's no effect. This allows an isolationtester script to reliably
+ * test behaviors where some specified action happens in another backend
+ * between parsing and execution of any desired query.
+ *
+ * Copyright (c) 2020-2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/test/modules/delay_execution/delay_execution.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "fmgr.h"
+#include "storage/aio.h"
+#include "storage/aio_internal.h"
+#include "storage/bufmgr.h"
+#include "storage/buf_internals.h"
+#include "storage/lwlock.h"
+#include "storage/ipc.h"
+#include "access/relation.h"
+#include "utils/rel.h"
+#include "utils/injection_point.h"
+
+
+PG_MODULE_MAGIC;
+
+
+typedef struct InjIoErrorState
+{
+ bool enabled;
+ bool result_set;
+ int result;
+} InjIoErrorState;
+
+static InjIoErrorState * inj_io_error_state;
+
+/* Shared memory init callbacks */
+static shmem_request_hook_type prev_shmem_request_hook = NULL;
+static shmem_startup_hook_type prev_shmem_startup_hook = NULL;
+
+
+static PgAioHandle *last_handle;
+static PgAioBounceBuffer *last_bb;
+
+
+
+static void
+test_aio_shmem_request(void)
+{
+ if (prev_shmem_request_hook)
+ prev_shmem_request_hook();
+
+ RequestAddinShmemSpace(sizeof(InjIoErrorState));
+}
+
+static void
+test_aio_shmem_startup(void)
+{
+ bool found;
+
+ if (prev_shmem_startup_hook)
+ prev_shmem_startup_hook();
+
+ /* Create or attach to the shared memory state */
+ LWLockAcquire(AddinShmemInitLock, LW_EXCLUSIVE);
+
+ inj_io_error_state = ShmemInitStruct("injection_points",
+ sizeof(InjIoErrorState),
+ &found);
+
+ if (!found)
+ {
+ /*
+ * First time through, so initialize. This is shared with the dynamic
+ * initialization using a DSM.
+ */
+ inj_io_error_state->enabled = false;
+
+#ifdef USE_INJECTION_POINTS
+ InjectionPointAttach("AIO_PROCESS_COMPLETION_BEFORE_SHARED",
+ "test_aio",
+ "inj_io_short_read",
+ NULL,
+ 0);
+ InjectionPointLoad("AIO_PROCESS_COMPLETION_BEFORE_SHARED");
+#endif
+ }
+ else
+ {
+#ifdef USE_INJECTION_POINTS
+ InjectionPointLoad("AIO_PROCESS_COMPLETION_BEFORE_SHARED");
+ elog(LOG, "injection point loaded");
+#endif
+ }
+
+ LWLockRelease(AddinShmemInitLock);
+}
+
+void
+_PG_init(void)
+{
+ if (!process_shared_preload_libraries_in_progress)
+ return;
+
+ /* Shared memory initialization */
+ prev_shmem_request_hook = shmem_request_hook;
+ shmem_request_hook = test_aio_shmem_request;
+ prev_shmem_startup_hook = shmem_startup_hook;
+ shmem_startup_hook = test_aio_shmem_startup;
+}
+
+
+PG_FUNCTION_INFO_V1(grow_rel);
+Datum
+grow_rel(PG_FUNCTION_ARGS)
+{
+ Oid relid = PG_GETARG_OID(0);
+ uint32 nblocks = PG_GETARG_UINT32(1);
+ Relation rel;
+#define MAX_BUFFERS_TO_EXTEND_BY 64
+ Buffer victim_buffers[MAX_BUFFERS_TO_EXTEND_BY];
+
+ rel = relation_open(relid, AccessExclusiveLock);
+
+ while (nblocks > 0)
+ {
+ uint32 extend_by_pages;
+
+ extend_by_pages = Min(nblocks, MAX_BUFFERS_TO_EXTEND_BY);
+
+ ExtendBufferedRelBy(BMR_REL(rel),
+ MAIN_FORKNUM,
+ NULL,
+ 0,
+ extend_by_pages,
+ victim_buffers,
+ &extend_by_pages);
+
+ nblocks -= extend_by_pages;
+
+ for (uint32 i = 0; i < extend_by_pages; i++)
+ {
+ ReleaseBuffer(victim_buffers[i]);
+ }
+ }
+
+ relation_close(rel, NoLock);
+
+ PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(corrupt_rel_block);
+Datum
+corrupt_rel_block(PG_FUNCTION_ARGS)
+{
+ Oid relid = PG_GETARG_OID(0);
+ uint32 block = PG_GETARG_UINT32(1);
+ Relation rel;
+ Buffer buf;
+ Page page;
+ PageHeader ph;
+
+ rel = relation_open(relid, AccessExclusiveLock);
+
+ buf = ReadBuffer(rel, block);
+ page = BufferGetPage(buf);
+
+ LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+
+ MarkBufferDirty(buf);
+
+ PageInit(page, BufferGetPageSize(buf), 0);
+
+ ph = (PageHeader) page;
+ ph->pd_special = BLCKSZ + 1;
+
+ FlushOneBuffer(buf);
+
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+
+ ReleaseBuffer(buf);
+
+ EvictUnpinnedBuffer(buf);
+
+ relation_close(rel, NoLock);
+
+ PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(read_corrupt_rel_block);
+Datum
+read_corrupt_rel_block(PG_FUNCTION_ARGS)
+{
+ Oid relid = PG_GETARG_OID(0);
+ uint32 block = PG_GETARG_UINT32(1);
+ Relation rel;
+ Buffer buf;
+ BufferDesc *buf_hdr;
+ Page page;
+ PgAioHandle *ioh;
+ PgAioHandleRef ior;
+ SMgrRelation smgr;
+ uint32 buf_state;
+
+ rel = relation_open(relid, AccessExclusiveLock);
+
+ /* read buffer without erroring out */
+ buf = ReadBufferExtended(rel, MAIN_FORKNUM, block, RBM_ZERO_AND_LOCK, NULL);
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+
+ page = BufferGetBlock(buf);
+
+ ioh = pgaio_io_get(CurrentResourceOwner, NULL);
+ pgaio_io_get_ref(ioh, &ior);
+
+ buf_hdr = GetBufferDescriptor(buf - 1);
+ smgr = RelationGetSmgr(rel);
+
+ /* FIXME: even if just a test, we should verify nobody else uses this */
+ buf_state = LockBufHdr(buf_hdr);
+ buf_state &= ~(BM_VALID | BM_DIRTY);
+ UnlockBufHdr(buf_hdr, buf_state);
+
+ StartBufferIO(buf_hdr, true, false);
+
+ pgaio_io_set_io_data_32(ioh, (uint32 *) &buf, 1);
+ pgaio_io_add_shared_cb(ioh, ASC_SHARED_BUFFER_READ);
+
+ smgrstartreadv(ioh, smgr, MAIN_FORKNUM, block,
+ (void *) &page, 1);
+
+ ReleaseBuffer(buf);
+
+ pgaio_io_ref_wait(&ior);
+
+ relation_close(rel, NoLock);
+
+ PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(invalidate_rel_block);
+Datum
+invalidate_rel_block(PG_FUNCTION_ARGS)
+{
+ Oid relid = PG_GETARG_OID(0);
+ uint32 block = PG_GETARG_UINT32(1);
+ Relation rel;
+ PrefetchBufferResult pr;
+ Buffer buf;
+
+ rel = relation_open(relid, AccessExclusiveLock);
+
+ /* this is a gross hack, but there's no good API exposed */
+ pr = PrefetchBuffer(rel, MAIN_FORKNUM, block);
+ buf = pr.recent_buffer;
+ elog(LOG, "recent: %d", buf);
+ if (BufferIsValid(buf))
+ {
+ /* if the buffer contents aren't valid, this'll return false */
+ if (ReadRecentBuffer(rel->rd_locator, MAIN_FORKNUM, block, buf))
+ {
+ LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+ FlushOneBuffer(buf);
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+ ReleaseBuffer(buf);
+
+ if (!EvictUnpinnedBuffer(buf))
+ elog(ERROR, "couldn't unpin");
+ }
+ }
+
+ relation_close(rel, AccessExclusiveLock);
+
+ PG_RETURN_VOID();
+}
+
+#if 0
+PG_FUNCTION_INFO_V1(test_unsubmitted_vs_close);
+Datum
+test_unsubmitted_vs_close(PG_FUNCTION_ARGS)
+{
+ Oid relid = PG_GETARG_OID(0);
+ uint32 block = PG_GETARG_UINT32(1);
+ Relation rel;
+ Buffer buf;
+ Page page;
+ PageHeader ph;
+
+ rel = relation_open(relid, AccessExclusiveLock);
+
+ buf = ReadBufferExtended(rel, MAIN_FORKNUM, block, RBM_ZERO_AND_LOCK, NULL);
+
+ buf = ReadBuffer(rel, block);
+ page = BufferGetPage(buf);
+
+ EvictUnpinnedBuffer(buf);
+
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+
+
+ MarkBufferDirty(buf);
+ ph->pd_special = BLCKSZ + 1;
+
+ /* last_handle = pgaio_io_get(); */
+
+ PG_RETURN_VOID();
+}
+#endif
+
+PG_FUNCTION_INFO_V1(handle_get);
+Datum
+handle_get(PG_FUNCTION_ARGS)
+{
+ last_handle = pgaio_io_get(CurrentResourceOwner, NULL);
+
+ PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(handle_release_last);
+Datum
+handle_release_last(PG_FUNCTION_ARGS)
+{
+ if (!last_handle)
+ elog(ERROR, "no handle");
+
+ pgaio_io_release(last_handle);
+
+ PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(handle_get_and_error);
+Datum
+handle_get_and_error(PG_FUNCTION_ARGS)
+{
+ pgaio_io_get(CurrentResourceOwner, NULL);
+
+ elog(ERROR, "as you command");
+ PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(handle_get_twice);
+Datum
+handle_get_twice(PG_FUNCTION_ARGS)
+{
+ pgaio_io_get(CurrentResourceOwner, NULL);
+ pgaio_io_get(CurrentResourceOwner, NULL);
+
+ PG_RETURN_VOID();
+}
+
+
+PG_FUNCTION_INFO_V1(handle_get_release);
+Datum
+handle_get_release(PG_FUNCTION_ARGS)
+{
+ PgAioHandle *handle;
+
+ handle = pgaio_io_get(CurrentResourceOwner, NULL);
+ pgaio_io_release(handle);
+
+ PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(bb_get);
+Datum
+bb_get(PG_FUNCTION_ARGS)
+{
+ last_bb = pgaio_bounce_buffer_get();
+
+ PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(bb_release_last);
+Datum
+bb_release_last(PG_FUNCTION_ARGS)
+{
+ if (!last_bb)
+ elog(ERROR, "no bb");
+
+ pgaio_bounce_buffer_release(last_bb);
+
+ PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(bb_get_and_error);
+Datum
+bb_get_and_error(PG_FUNCTION_ARGS)
+{
+ pgaio_bounce_buffer_get();
+
+ elog(ERROR, "as you command");
+ PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(bb_get_twice);
+Datum
+bb_get_twice(PG_FUNCTION_ARGS)
+{
+ pgaio_bounce_buffer_get();
+ pgaio_bounce_buffer_get();
+
+ PG_RETURN_VOID();
+}
+
+
+PG_FUNCTION_INFO_V1(bb_get_release);
+Datum
+bb_get_release(PG_FUNCTION_ARGS)
+{
+ PgAioBounceBuffer *bb;
+
+ bb = pgaio_bounce_buffer_get();
+ pgaio_bounce_buffer_release(bb);
+
+ PG_RETURN_VOID();
+}
+
+#ifdef USE_INJECTION_POINTS
+extern PGDLLEXPORT void inj_io_short_read(const char *name, const void *private_data);
+
+void
+inj_io_short_read(const char *name, const void *private_data)
+{
+ PgAioHandle *ioh;
+
+ elog(LOG, "short read called: %d", inj_io_error_state->enabled);
+
+ if (inj_io_error_state->enabled)
+ {
+ ioh = pgaio_inj_io_get();
+
+ if (inj_io_error_state->result_set)
+ {
+ elog(LOG, "short read, changing result from %d to %d",
+ ioh->result, inj_io_error_state->result);
+
+ ioh->result = inj_io_error_state->result;
+ }
+ }
+}
+#endif
+
+PG_FUNCTION_INFO_V1(inj_io_short_read_attach);
+Datum
+inj_io_short_read_attach(PG_FUNCTION_ARGS)
+{
+#ifdef USE_INJECTION_POINTS
+ inj_io_error_state->enabled = true;
+ inj_io_error_state->result_set = !PG_ARGISNULL(0);
+ if (inj_io_error_state->result_set)
+ inj_io_error_state->result = PG_GETARG_INT32(0);
+#else
+ elog(ERROR, "injection points not supported");
+#endif
+
+ PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(inj_io_short_read_detach);
+Datum
+inj_io_short_read_detach(PG_FUNCTION_ARGS)
+{
+#ifdef USE_INJECTION_POINTS
+ inj_io_error_state->enabled = false;
+#else
+ elog(ERROR, "injection points not supported");
+#endif
+ PG_RETURN_VOID();
+}
diff --git a/src/test/modules/test_aio/test_aio.control b/src/test/modules/test_aio/test_aio.control
new file mode 100644
index 00000000000..cd91c3ed16b
--- /dev/null
+++ b/src/test/modules/test_aio/test_aio.control
@@ -0,0 +1,3 @@
+comment = 'Test code for AIO'
+default_version = '1.0'
+module_pathname = '$libdir/test_aio'
diff --git a/src/test/modules/test_aio/worker.conf b/src/test/modules/test_aio/worker.conf
new file mode 100644
index 00000000000..8104c201924
--- /dev/null
+++ b/src/test/modules/test_aio/worker.conf
@@ -0,0 +1,5 @@
+shared_preload_libraries=test_aio
+io_method = 'worker'
+log_min_messages = 'DEBUG3'
+log_statement=all
+restart_after_crash=false
--
2.45.2.827.g557ae147e6
v2.1-0020-Temporary-Increase-BAS_BULKREAD-size.patchtext/x-diff; charset=us-asciiDownload
From 32ae60ce61cefe5c2e30341049e0c08b15e36de6 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sun, 1 Sep 2024 00:42:27 -0400
Subject: [PATCH v2.1 20/20] Temporary: Increase BAS_BULKREAD size
Without this we only can execute very little AIO for sequential scans, as
there's just not enough buffers in the ring. This isn't the right fix, as
just increasing the ring size can have negative performance implications in
workloads where the kernel has all the data cached.
Author:
Reviewed-By:
Discussion: https://postgr.es/m/
Backpatch:
---
src/backend/storage/buffer/freelist.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index dffdd57e9b5..5be8125ad3a 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -555,7 +555,11 @@ GetAccessStrategy(BufferAccessStrategyType btype)
return NULL;
case BAS_BULKREAD:
- ring_size_kb = 256;
+ /*
+ * FIXME: Temporary increase to allow large enough streaming reads
+ * to actually benefit from AIO. This needs a better solution.
+ */
+ ring_size_kb = 2 * 1024;
break;
case BAS_BULKWRITE:
ring_size_kb = 16 * 1024;
--
2.45.2.827.g557ae147e6
Hi,
On 2024-09-05 01:37:34 +0800, 陈宗志 wrote:
I hope there can be a high-level design document that includes a
description, high-level architecture, and low-level design.
This way, others can also participate in reviewing the code.
Yep, that was already on my todo list. The version I just posted includes
that.
For example, which paths were modified in the AIO module?
Is it the path for writing WAL logs, or the path for flushing pages, etc.?
I don't think it's good to document this in a design document - that's just
bound to get out of date.
For now the patchset causes AIO to be used for
1) all users of read_stream.h, e.g. sequential scans
2) bgwriter / checkpointer, mainly to have way to exercise the write path. As
mentioned in my email upthread, the code for that is in a somewhat rough
shape as Thomas Munro is working on a more general abstraction for some of
this.
The earlier patchset added a lot more AIO uses because I needed to know all
the design constraints. It e.g. added AIO use in WAL. While that allowed me to
learn a lot, it's not something that makes sense to continue working on for
now, as it requires a lot of work that's independent of AIO. Thus I am
focusing on the above users for now.
Also, I recommend keeping this patch as small as possible.
Yep. That's my goal (as mentioned upthread).
For example, the first step could be to introduce libaio only, without
considering io_uring, as that would make it too complex.
Currently the patchset doesn't contain libaio support and I am not planning to
work on using libaio. Nor do I think it makes sense for anybody else to do so
- libaio doesn't work for buffered IO, making it imo not particularly useful
for us.
The io_uring specific code isn't particularly complex / large compared to the
main AIO infrastructure.
Greetings,
Andres Freund
Hi Andres
Thanks for the AIO patch update. I gave it a try and ran into a FATAL
in bgwriter when executing a benchmark.
2024-09-12 01:38:00.851 PDT [2780939] PANIC: no more bbs
2024-09-12 01:38:00.854 PDT [2780473] LOG: background writer process
(PID 2780939) was terminated by signal 6: Aborted
2024-09-12 01:38:00.854 PDT [2780473] LOG: terminating any other
active server processes
I debugged a bit and found that BgBufferSync() is not capping the
batch size under io_bounce_buffers like BufferSync() for checkpoint.
Here is a small patch to fix it.
Best regards
Robert
Show quoted text
On Fri, Sep 6, 2024 at 12:47 PM Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2024-09-05 01:37:34 +0800, 陈宗志 wrote:
I hope there can be a high-level design document that includes a
description, high-level architecture, and low-level design.
This way, others can also participate in reviewing the code.Yep, that was already on my todo list. The version I just posted includes
that.For example, which paths were modified in the AIO module?
Is it the path for writing WAL logs, or the path for flushing pages, etc.?I don't think it's good to document this in a design document - that's just
bound to get out of date.For now the patchset causes AIO to be used for
1) all users of read_stream.h, e.g. sequential scans
2) bgwriter / checkpointer, mainly to have way to exercise the write path. As
mentioned in my email upthread, the code for that is in a somewhat rough
shape as Thomas Munro is working on a more general abstraction for some of
this.The earlier patchset added a lot more AIO uses because I needed to know all
the design constraints. It e.g. added AIO use in WAL. While that allowed me to
learn a lot, it's not something that makes sense to continue working on for
now, as it requires a lot of work that's independent of AIO. Thus I am
focusing on the above users for now.Also, I recommend keeping this patch as small as possible.
Yep. That's my goal (as mentioned upthread).
For example, the first step could be to introduce libaio only, without
considering io_uring, as that would make it too complex.Currently the patchset doesn't contain libaio support and I am not planning to
work on using libaio. Nor do I think it makes sense for anybody else to do so
- libaio doesn't work for buffered IO, making it imo not particularly useful
for us.The io_uring specific code isn't particularly complex / large compared to the
main AIO infrastructure.Greetings,
Andres Freund
Attachments:
0001-Fix-BgBufferSync-to-limit-batch-size-under-io_bounce.patchtext/x-patch; charset=US-ASCII; name=0001-Fix-BgBufferSync-to-limit-batch-size-under-io_bounce.patchDownload
From bd04bd18ce62cf3f88568d3578503d4efeeb6603 Mon Sep 17 00:00:00 2001
From: Robert Pang <robertpang@google.com>
Date: Thu, 12 Sep 2024 14:36:16 -0700
Subject: [PATCH] Fix BgBufferSync to limit batch size under io_bounce_buffers
for bgwriter.
---
src/backend/storage/buffer/bufmgr.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 44b1b6fb31..4cd959b295 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3396,6 +3396,7 @@ BgBufferSync(IOQueue *ioq, WritebackContext *wb_context)
uint32 new_recent_alloc;
BuffersToWrite to_write;
+ int max_combine;
/*
* Find out where the freelist clock sweep currently is, and how many
@@ -3417,6 +3418,8 @@ BgBufferSync(IOQueue *ioq, WritebackContext *wb_context)
return true;
}
+ max_combine = Min(io_bounce_buffers, io_combine_limit);
+
/*
* Compute strategy_delta = how many buffers have been scanned by the
* clock sweep since last time. If first time through, assume none. Then
@@ -3604,7 +3607,7 @@ BgBufferSync(IOQueue *ioq, WritebackContext *wb_context)
{
Assert(sync_state & BUF_REUSABLE);
- if (to_write.nbuffers == io_combine_limit)
+ if (to_write.nbuffers == max_combine)
{
WriteBuffers(&to_write, ioq, wb_context);
}
--
2.46.0.662.g92d0881bb0-goog
On Fri, Sep 06, 2024 at 03:38:16PM -0400, Andres Freund wrote:
There's plenty more to do, but I thought this would be a useful checkpoint.
I find patches 1-5 are Ready for Committer.
+typedef enum PgAioHandleState
This enum clarified a lot for me, so I wish I had read it before anything
else. I recommend referring to it in README.md. Would you also cover the
valid state transitions and which of them any backend can do vs. which are
specific to the defining backend?
+{ + /* not in use */ + AHS_IDLE = 0, + + /* returned by pgaio_io_get() */ + AHS_HANDED_OUT, + + /* pgaio_io_start_*() has been called, but IO hasn't been submitted yet */ + AHS_DEFINED, + + /* subjects prepare() callback has been called */ + AHS_PREPARED, + + /* IO is being executed */ + AHS_IN_FLIGHT,
Let's align terms between functions and states those functions reach. For
example, I recommend calling this state AHS_SUBMITTED, because
pgaio_io_prepare_submit() is the function reaching this state.
(Alternatively, use in_flight in the function name.)
+ + /* IO finished, but result has not yet been processed */ + AHS_REAPED, + + /* IO completed, shared completion has been called */ + AHS_COMPLETED_SHARED, + + /* IO completed, local completion has been called */ + AHS_COMPLETED_LOCAL, +} PgAioHandleState;
+void +pgaio_io_release_resowner(dlist_node *ioh_node, bool on_error) +{ + PgAioHandle *ioh = dlist_container(PgAioHandle, resowner_node, ioh_node); + + Assert(ioh->resowner); + + ResourceOwnerForgetAioHandle(ioh->resowner, &ioh->resowner_node); + ioh->resowner = NULL; + + switch (ioh->state) + { + case AHS_IDLE: + elog(ERROR, "unexpected"); + break; + case AHS_HANDED_OUT: + Assert(ioh == my_aio->handed_out_io || my_aio->handed_out_io == NULL); + + if (ioh == my_aio->handed_out_io) + { + my_aio->handed_out_io = NULL; + if (!on_error) + elog(WARNING, "leaked AIO handle"); + } + + pgaio_io_reclaim(ioh); + break; + case AHS_DEFINED: + case AHS_PREPARED: + /* XXX: Should we warn about this when is_commit? */
Yes.
+ pgaio_submit_staged(); + break; + case AHS_IN_FLIGHT: + case AHS_REAPED: + case AHS_COMPLETED_SHARED: + /* this is expected to happen */ + break; + case AHS_COMPLETED_LOCAL: + /* XXX: unclear if this ought to be possible? */ + pgaio_io_reclaim(ioh); + break; + }
+void +pgaio_io_ref_wait(PgAioHandleRef *ior) +{ + uint64 ref_generation; + PgAioHandleState state; + bool am_owner; + PgAioHandle *ioh; + + ioh = pgaio_io_from_ref(ior, &ref_generation); + + am_owner = ioh->owner_procno == MyProcNumber; + + + if (pgaio_io_was_recycled(ioh, ref_generation, &state)) + return; + + if (am_owner) + { + if (state == AHS_DEFINED || state == AHS_PREPARED) + { + /* XXX: Arguably this should be prevented by callers? */ + pgaio_submit_staged();
Agreed for AHS_DEFINED, if not both. AHS_DEFINED here would suggest a past
longjmp out of pgaio_io_prepare() w/o a subxact rollback to cleanup. Even so,
the next point might remove the need here:
+void +pgaio_io_prepare(PgAioHandle *ioh, PgAioOp op) +{ + Assert(ioh->state == AHS_HANDED_OUT); + Assert(pgaio_io_has_subject(ioh)); + + ioh->op = op; + ioh->state = AHS_DEFINED; + ioh->result = 0; + + /* allow a new IO to be staged */ + my_aio->handed_out_io = NULL; + + pgaio_io_prepare_subject(ioh); + + ioh->state = AHS_PREPARED;
As defense in depth, let's add a critical section from before assigning
AHS_DEFINED to here. This code already needs to be safe for that (per
README.md). When running outside a critical section, an ERROR in a subject
callback could leak the lwlock disowned in shared_buffer_prepare_common(). I
doubt there's a plausible way to reach that leak today, but future subject
callbacks could add risk over time.
+if test "$with_liburing" = yes; then + PKG_CHECK_MODULES(LIBURING, liburing) +fi
I used the attached makefile patch to build w/ liburing.
+pgaio_uring_shmem_init(bool first_time) +{ + uint32 TotalProcs = MaxBackends + NUM_AUXILIARY_PROCS - MAX_IO_WORKERS; + bool found; + + aio_uring_contexts = (PgAioUringContext *) + ShmemInitStruct("AioUring", pgaio_uring_shmem_size(), &found); + + if (found) + return; + + for (int contextno = 0; contextno < TotalProcs; contextno++) + { + PgAioUringContext *context = &aio_uring_contexts[contextno]; + int ret; + + /* + * XXX: Probably worth sharing the WQ between the different rings, + * when supported by the kernel. Could also cause additional + * contention, I guess? + */ +#if 0 + if (!AcquireExternalFD()) + elog(ERROR, "No external FD available"); +#endif + ret = io_uring_queue_init(io_max_concurrency, &context->io_uring_ring, 0);
With EXEC_BACKEND, "make check PG_TEST_INITDB_EXTRA_OPTS=-cio_method=io_uring"
fails early:
2024-09-15 12:46:08.168 PDT postmaster[2069397] LOG: starting PostgreSQL 18devel on x86_64-pc-linux-gnu, compiled by gcc (Debian 13.2.0-13) 13.2.0, 64-bit
2024-09-15 12:46:08.168 PDT postmaster[2069397] LOG: listening on Unix socket "/tmp/pg_regress-xgQOPH/.s.PGSQL.65312"
2024-09-15 12:46:08.203 PDT startup[2069423] LOG: database system was shut down at 2024-09-15 12:46:07 PDT
2024-09-15 12:46:08.209 PDT client backend[2069425] [unknown] FATAL: the database system is starting up
2024-09-15 12:46:08.222 PDT postmaster[2069397] LOG: database system is ready to accept connections
2024-09-15 12:46:08.254 PDT autovacuum launcher[2069435] PANIC: failed: -9/Bad file descriptor
2024-09-15 12:46:08.286 PDT client backend[2069444] [unknown] PANIC: failed: -95/Operation not supported
2024-09-15 12:46:08.355 PDT client backend[2069455] [unknown] PANIC: unexpected: -95/Operation not supported: No such file or directory
2024-09-15 12:46:08.370 PDT postmaster[2069397] LOG: received fast shutdown request
I expect that's from io_uring_queue_init() stashing in shared memory a file
descriptor and mmap address, which aren't valid in EXEC_BACKEND children.
Reattaching descriptors and memory in each child may work, or one could just
block io_method=io_uring under EXEC_BACKEND.
+pgaio_uring_submit(uint16 num_staged_ios, PgAioHandle **staged_ios) +{ + struct io_uring *uring_instance = &my_shared_uring_context->io_uring_ring; + + Assert(num_staged_ios <= PGAIO_SUBMIT_BATCH_SIZE); + + for (int i = 0; i < num_staged_ios; i++) + { + PgAioHandle *ioh = staged_ios[i]; + struct io_uring_sqe *sqe; + + sqe = io_uring_get_sqe(uring_instance); + + pgaio_io_prepare_submit(ioh); + pgaio_uring_sq_from_io(ioh, sqe); + } + + while (true) + { + int ret; + + pgstat_report_wait_start(WAIT_EVENT_AIO_SUBMIT); + ret = io_uring_submit(uring_instance); + pgstat_report_wait_end(); + + if (ret == -EINTR) + { + elog(DEBUG3, "submit EINTR, nios: %d", num_staged_ios); + continue; + }
Since io_uring_submit() is a wrapper around io_uring_enter(), this should also
retry on EAGAIN. "man io_uring_enter" has:
EAGAIN The kernel was unable to allocate memory for the request, or
otherwise ran out of resources to handle it. The application should wait
for some completions and try again.
+FileStartWriteV(struct PgAioHandle *ioh, File file, + int iovcnt, off_t offset, + uint32 wait_event_info) +{ + int returnCode; + Vfd *vfdP; + + Assert(FileIsValid(file)); + + DO_DB(elog(LOG, "FileStartWriteV: %d (%s) " INT64_FORMAT " %d", + file, VfdCache[file].fileName, + (int64) offset, + iovcnt)); + + returnCode = FileAccess(file); + if (returnCode < 0) + return returnCode; + + vfdP = &VfdCache[file]; + + /* FIXME: think about / reimplement temp_file_limit */ + + pgaio_io_prep_writev(ioh, vfdP->fd, iovcnt, offset); + + return 0; +}
FileStartWriteV() gets to state AHS_PREPARED, so let's align with the state
name by calling it FilePrepareWriteV (or FileWriteVPrepare or whatever).
For non-sync IO methods, I gather it's essential that a process other than the
IO definer be scanning for incomplete IOs and completing them. Otherwise,
deadlocks like this would happen:
backend1 locks blk1 for non-IO reasons
backend2 locks blk2, starts AIO write
backend1 waits for lock on blk2 for non-IO reasons
backend2 waits for lock on blk1 for non-IO reasons
If that's right, in worker mode, the IO worker resolves that deadlock. What
resolves it under io_uring? Another process that happens to do
pgaio_io_ref_wait() would dislodge things, but I didn't locate the code to
make that happen systematically. Could you add a mention of "deadlock" in the
comment at whichever code achieves that?
I could share more-tactical observations about patches 6-20, but they're
probably things you'd change without those observations. Is there any
specific decision you'd like to settle before patch 6 exits WIP?
Thanks,
nm
Attachments:
uring-makefile-v1.patchtext/plain; charset=us-asciiDownload
diff --git a/src/backend/Makefile b/src/backend/Makefile
index 84302cc..b123fdc 100644
--- a/src/backend/Makefile
+++ b/src/backend/Makefile
@@ -43,9 +43,10 @@ OBJS = \
$(top_builddir)/src/common/libpgcommon_srv.a \
$(top_builddir)/src/port/libpgport_srv.a
-# We put libpgport and libpgcommon into OBJS, so remove it from LIBS; also add
-# libldap and ICU
-LIBS := $(filter-out -lpgport -lpgcommon, $(LIBS)) $(LDAP_LIBS_BE) $(ICU_LIBS)
+# We put libpgport and libpgcommon into OBJS, so remove it from LIBS.
+LIBS := $(filter-out -lpgport -lpgcommon, $(LIBS))
+# The backend conditionally needs libraries that most executables don't need.
+LIBS += $(LDAP_LIBS_BE) $(ICU_LIBS) $(LIBURING_LIBS)
# The backend doesn't need everything that's in LIBS, however
LIBS := $(filter-out -lreadline -ledit -ltermcap -lncurses -lcurses, $(LIBS))
Hi,
Thanks for the review!
On 2024-09-16 07:43:49 -0700, Noah Misch wrote:
On Fri, Sep 06, 2024 at 03:38:16PM -0400, Andres Freund wrote:
There's plenty more to do, but I thought this would be a useful checkpoint.
I find patches 1-5 are Ready for Committer.
Cool!
+typedef enum PgAioHandleState
This enum clarified a lot for me, so I wish I had read it before anything
else. I recommend referring to it in README.md.
Makes sense.
Would you also cover the valid state transitions and which of them any
backend can do vs. which are specific to the defining backend?
Yea, we should. I earlier had something, but because details were still
changing it was hard to keep up2date.
+{ + /* not in use */ + AHS_IDLE = 0, + + /* returned by pgaio_io_get() */ + AHS_HANDED_OUT, + + /* pgaio_io_start_*() has been called, but IO hasn't been submitted yet */ + AHS_DEFINED, + + /* subjects prepare() callback has been called */ + AHS_PREPARED, + + /* IO is being executed */ + AHS_IN_FLIGHT,Let's align terms between functions and states those functions reach. For
example, I recommend calling this state AHS_SUBMITTED, because
pgaio_io_prepare_submit() is the function reaching this state.
(Alternatively, use in_flight in the function name.)
There used to be a separate SUBMITTED, but I removed it at some point as not
necessary anymore. Arguably it might be useful to re-introduce it so that
e.g. with worker mode one can tell the difference between the IO being queued
and the IO actually being processed.
+void +pgaio_io_ref_wait(PgAioHandleRef *ior) +{ + uint64 ref_generation; + PgAioHandleState state; + bool am_owner; + PgAioHandle *ioh; + + ioh = pgaio_io_from_ref(ior, &ref_generation); + + am_owner = ioh->owner_procno == MyProcNumber; + + + if (pgaio_io_was_recycled(ioh, ref_generation, &state)) + return; + + if (am_owner) + { + if (state == AHS_DEFINED || state == AHS_PREPARED) + { + /* XXX: Arguably this should be prevented by callers? */ + pgaio_submit_staged();Agreed for AHS_DEFINED, if not both. AHS_DEFINED here would suggest a past
longjmp out of pgaio_io_prepare() w/o a subxact rollback to cleanup.
That, or not having submitted the IO. One thing I've been thinking about as
being potentially helpful infrastructure is to have something similar to a
critical section, except that it asserts that one is not allowed to block or
forget submitting staged IOs.
+void +pgaio_io_prepare(PgAioHandle *ioh, PgAioOp op) +{ + Assert(ioh->state == AHS_HANDED_OUT); + Assert(pgaio_io_has_subject(ioh)); + + ioh->op = op; + ioh->state = AHS_DEFINED; + ioh->result = 0; + + /* allow a new IO to be staged */ + my_aio->handed_out_io = NULL; + + pgaio_io_prepare_subject(ioh); + + ioh->state = AHS_PREPARED;As defense in depth, let's add a critical section from before assigning
AHS_DEFINED to here. This code already needs to be safe for that (per
README.md). When running outside a critical section, an ERROR in a subject
callback could leak the lwlock disowned in shared_buffer_prepare_common(). I
doubt there's a plausible way to reach that leak today, but future subject
callbacks could add risk over time.
Makes sense.
+if test "$with_liburing" = yes; then + PKG_CHECK_MODULES(LIBURING, liburing) +fiI used the attached makefile patch to build w/ liburing.
Thanks, will incorporate.
With EXEC_BACKEND, "make check PG_TEST_INITDB_EXTRA_OPTS=-cio_method=io_uring"
fails early:
Right - that's to be expected.
2024-09-15 12:46:08.168 PDT postmaster[2069397] LOG: starting PostgreSQL 18devel on x86_64-pc-linux-gnu, compiled by gcc (Debian 13.2.0-13) 13.2.0, 64-bit
2024-09-15 12:46:08.168 PDT postmaster[2069397] LOG: listening on Unix socket "/tmp/pg_regress-xgQOPH/.s.PGSQL.65312"
2024-09-15 12:46:08.203 PDT startup[2069423] LOG: database system was shut down at 2024-09-15 12:46:07 PDT
2024-09-15 12:46:08.209 PDT client backend[2069425] [unknown] FATAL: the database system is starting up
2024-09-15 12:46:08.222 PDT postmaster[2069397] LOG: database system is ready to accept connections
2024-09-15 12:46:08.254 PDT autovacuum launcher[2069435] PANIC: failed: -9/Bad file descriptor
2024-09-15 12:46:08.286 PDT client backend[2069444] [unknown] PANIC: failed: -95/Operation not supported
2024-09-15 12:46:08.355 PDT client backend[2069455] [unknown] PANIC: unexpected: -95/Operation not supported: No such file or directory
2024-09-15 12:46:08.370 PDT postmaster[2069397] LOG: received fast shutdown requestI expect that's from io_uring_queue_init() stashing in shared memory a file
descriptor and mmap address, which aren't valid in EXEC_BACKEND children.
Reattaching descriptors and memory in each child may work, or one could just
block io_method=io_uring under EXEC_BACKEND.
I think the latter option is saner - I don't think there's anything to be
gained by supporting io_uring in this situation. It's not like anybody will
use it for real-world workloads where performance matters. Nor would it be
useful fo portability testing.
+pgaio_uring_submit(uint16 num_staged_ios, PgAioHandle **staged_ios) +{
+ if (ret == -EINTR) + { + elog(DEBUG3, "submit EINTR, nios: %d", num_staged_ios); + continue; + }Since io_uring_submit() is a wrapper around io_uring_enter(), this should also
retry on EAGAIN. "man io_uring_enter" has:EAGAIN The kernel was unable to allocate memory for the request, or
otherwise ran out of resources to handle it. The application should wait
for some completions and try again.
Hm. I'm not sure that makes sense. We only allow a limited number of IOs to be
in flight for each uring instance. That's different to a use of uring to
e.g. wait for incoming network data on thousands of sockets, where you could
have essentially unbounded amount of requests outstanding.
What would we wait for? What if we were holding a critical lock in that
moment? Would it be safe to just block for some completions? What if there's
actually no IO in progress?
+FileStartWriteV(struct PgAioHandle *ioh, File file, + int iovcnt, off_t offset, + uint32 wait_event_info) +{ ...FileStartWriteV() gets to state AHS_PREPARED, so let's align with the state
name by calling it FilePrepareWriteV (or FileWriteVPrepare or whatever).
Hm - that doesn't necessarily seem right to me. I don't think the caller
should assume that the IO will just be prepared and not already completed by
the time FileStartWriteV() returns - we might actually do the IO
synchronously.
For non-sync IO methods, I gather it's essential that a process other than the
IO definer be scanning for incomplete IOs and completing them.
Yep - it's something I've been fighting with / redesigning a *lot*. Earlier
the AIO subsystem could transparently retry IOs, but that ends up being a
nightmare - or at least I couldn't find a way to not make it a
nightmare. There are two main complexities:
1) What if the IO is being completed in a critical section? We can't reopen
the file in that situation. My initial fix for this was to defer retries,
but that's problematic too:
2) Acquiring an IO needs to be able to guarantee forward progress. Because
there's a limited number of IOs that means we need to be able to complete
IOs while acquiring an IO. So we can't just keep the IO handle around -
which in turn means that we'd need to save the state for retrying
somewhere. Which would require some pre-allocated memory to save that
state.
Thus I think it's actually better if we delegate retries to the callsites. I
was thinking that for partial reads of shared buffers we ought to not set
BM_IO_ERROR though...
Otherwise, deadlocks like this would happen:
backend1 locks blk1 for non-IO reasons
backend2 locks blk2, starts AIO write
backend1 waits for lock on blk2 for non-IO reasons
backend2 waits for lock on blk1 for non-IO reasonsIf that's right, in worker mode, the IO worker resolves that deadlock. What
resolves it under io_uring? Another process that happens to do
pgaio_io_ref_wait() would dislodge things, but I didn't locate the code to
make that happen systematically.
Yea, it's code that I haven't forward ported yet. I think basically
LockBuffer[ForCleanup] ought to call pgaio_io_ref_wait() when it can't
immediately acquire the lock and if the buffer has IO going on.
I could share more-tactical observations about patches 6-20, but they're
probably things you'd change without those observations.
Agreed.
Is there any specific decision you'd like to settle before patch 6 exits
WIP?
Patch 6 specifically? That I really mainly kept separate for review - it
doesn't seem particularly interesting to commit it earlier than 7, or do you
think differently?
In case you mean 6+7 or 6 to ~11, I can think of the following:
- I am worried about the need for bounce buffers for writes of checksummed
buffers. That quickly ends up being a significant chunk of memory,
particularly when using a small shared_buffers with a higher than default
number of connection. I'm currently hacking up a prototype that'd prevent us
from setting hint bits with just a share lock. I'm planning to start a
separate thread about that.
- The header split doesn't yet quite seem right yet
- I'd like to implement retries in the later patches, to make sure that it
doesn't have design implications
- Worker mode needs to be able to automatically adjust the number of running
workers, I think - otherwise it's going to be too hard to tune.
- I think the PgAioHandles need to be slimmed down a bit - there's some design
evolution visible that should not end up in the tree.
- I'm not sure that I like name "subject" for the different things AIO is
performed for
- I am wondering if the need for pgaio_io_set_io_data_32() (to store the set
of buffer ids that are affected by one IO) could be replaced by repurposing
BufferDesc->freeNext or something along those lines. I don't like the amount
of memory required for storing those arrays, even if it's not that much
compared to needing space to store struct iovec[PG_IOV_MAX] for each AIO
handle.
- I'd like to extend the test module to actually test more cases, it's too
hard to reach some paths, particularly without [a lot] of users yet. That's
not strictly a dependency of the earlier patches - since the initial patches
can't actually do much in the way of IO.
- We shouldn't reserve AioHandles etc for io workers - but because different
tpes of aux processes don't use a predetermined ProcNumber, that's not
entirely trivial without adding more complexity. I've actually wondered
whether IO workes should be their own "top-level" kind of process, rather
than an aux process. But that seems quite costly.
- Right now the io_uring mode has each backend's io_uring instance visible to
each other process. That ends up using a fair number of FDs. That's OK from
an efficiency perspective, but I think we'd need to add code to adjust the
soft RLIMIT_NOFILE (it's set to 1024 on most distros because there are
various programs that iterate over all possible FDs, causing significant
slowdowns when the soft limit defaults to something high). I earlier had a
limited number of io_uring instances, but that added a fair amount of
overhead because then submitting IO would require a lock.
That again doesn't have to be solved as part of the earlier patches but
might have some minor design impact.
Thanks again,
Andres Freund
Hi,
On 2024-09-12 14:55:49 -0700, Robert Pang wrote:
Hi Andres
Thanks for the AIO patch update. I gave it a try and ran into a FATAL
in bgwriter when executing a benchmark.2024-09-12 01:38:00.851 PDT [2780939] PANIC: no more bbs
2024-09-12 01:38:00.854 PDT [2780473] LOG: background writer process
(PID 2780939) was terminated by signal 6: Aborted
2024-09-12 01:38:00.854 PDT [2780473] LOG: terminating any other
active server processesI debugged a bit and found that BgBufferSync() is not capping the
batch size under io_bounce_buffers like BufferSync() for checkpoint.
Here is a small patch to fix it.
Good catch, thanks!
I am hoping (as described in my email to Noah a few minutes ago) that we can
get away from needing bounce buffers. They are a quite expensive solution to a
problem we made for ourselves...
Greetings,
Andres Freund
On Mon, Sep 16, 2024 at 01:51:42PM -0400, Andres Freund wrote:
On 2024-09-16 07:43:49 -0700, Noah Misch wrote:
On Fri, Sep 06, 2024 at 03:38:16PM -0400, Andres Freund wrote:
Reattaching descriptors and memory in each child may work, or one could just
block io_method=io_uring under EXEC_BACKEND.I think the latter option is saner
Works for me.
+pgaio_uring_submit(uint16 num_staged_ios, PgAioHandle **staged_ios) +{+ if (ret == -EINTR) + { + elog(DEBUG3, "submit EINTR, nios: %d", num_staged_ios); + continue; + }Since io_uring_submit() is a wrapper around io_uring_enter(), this should also
retry on EAGAIN. "man io_uring_enter" has:EAGAIN The kernel was unable to allocate memory for the request, or
otherwise ran out of resources to handle it. The application should wait
for some completions and try again.Hm. I'm not sure that makes sense. We only allow a limited number of IOs to be
in flight for each uring instance. That's different to a use of uring to
e.g. wait for incoming network data on thousands of sockets, where you could
have essentially unbounded amount of requests outstanding.What would we wait for? What if we were holding a critical lock in that
moment? Would it be safe to just block for some completions? What if there's
actually no IO in progress?
I'd try the following. First, scan for all IOs of all processes at
AHS_DEFINED and later, advancing them to AHS_COMPLETED_SHARED. This might be
unsafe today, but discovering why it's unsafe likely will inform design beyond
EAGAIN returns. I don't specifically know of a way it's unsafe. Do just one
pass of that; there may be newer IOs in progress afterward. If submit still
gets EAGAIN, sleep a bit and retry. Like we do in pgwin32_open_handle(), fail
after a fixed number of iterations. This isn't great if we hold a critical
lock, but it beats the alternative of PANIC on the first EAGAIN.
+FileStartWriteV(struct PgAioHandle *ioh, File file, + int iovcnt, off_t offset, + uint32 wait_event_info) +{ ...FileStartWriteV() gets to state AHS_PREPARED, so let's align with the state
name by calling it FilePrepareWriteV (or FileWriteVPrepare or whatever).Hm - that doesn't necessarily seem right to me. I don't think the caller
should assume that the IO will just be prepared and not already completed by
the time FileStartWriteV() returns - we might actually do the IO
synchronously.
Yes. Even if it doesn't become synchronous IO, some other process may advance
the IO to AHS_COMPLETED_SHARED by the next wake-up of the process that defined
the IO. Still, I think this shouldn't use the term "Start" while no state
name uses that term. What else could remove that mismatch?
Is there any specific decision you'd like to settle before patch 6 exits
WIP?Patch 6 specifically? That I really mainly kept separate for review - it
No. I'll rephrase as "Is there any specific decision you'd like to settle
before the next cohort of patches exits WIP?"
doesn't seem particularly interesting to commit it earlier than 7, or do you
think differently?
No, I agree a lone commit of 6 isn't a win. Roughly, the eight patches
6-9,12-15 could be a minimal attractive unit. I've not thought through that
grouping much.
In case you mean 6+7 or 6 to ~11, I can think of the following:
- I am worried about the need for bounce buffers for writes of checksummed
buffers. That quickly ends up being a significant chunk of memory,
particularly when using a small shared_buffers with a higher than default
number of connection. I'm currently hacking up a prototype that'd prevent us
from setting hint bits with just a share lock. I'm planning to start a
separate thread about that.
AioChooseBounceBuffers() limits usage to 256 blocks (2MB) per MaxBackends.
Doing better is nice, but I don't consider this a blocker. I recommend
dealing with the worry by reducing the limit initially (128 blocks?). Can
always raise it later.
- The header split doesn't yet quite seem right yet
I won't have a strong opinion on that one. The aio.c/aio_io.c split did catch
my attention. I made a note to check it again once those files get header
comments.
- I'd like to implement retries in the later patches, to make sure that it
doesn't have design implications
Yes, that's a blocker to me.
- Worker mode needs to be able to automatically adjust the number of running
workers, I think - otherwise it's going to be too hard to tune.
Changing that later wouldn't affect much else, so I'd not consider it a
blocker. (The worst case is that we think the initial AIO release will be a
loss for most users, so we wrap it in debug_ terminology like we did for
debug_io_direct. I'm not saying worker scaling will push AIO from one side of
that line to another, but that's why I'm fine with commits that omit
self-contained optimizations.)
- I think the PgAioHandles need to be slimmed down a bit - there's some design
evolution visible that should not end up in the tree.
Okay.
- I'm not sure that I like name "subject" for the different things AIO is
performed for
How about one of these six terms:
- listener, observer [if you view smgr as an observer of IOs in the sense of https://en.wikipedia.org/wiki/Observer_pattern]
- class, subclass, type, tag [if you view an SmgrIO as a subclass of an IO, in the object-oriented sense]
- I am wondering if the need for pgaio_io_set_io_data_32() (to store the set
of buffer ids that are affected by one IO) could be replaced by repurposing
BufferDesc->freeNext or something along those lines. I don't like the amount
of memory required for storing those arrays, even if it's not that much
compared to needing space to store struct iovec[PG_IOV_MAX] for each AIO
handle.
Here too, changing that later wouldn't affect much else, so I'd not consider
it a blocker.
- I'd like to extend the test module to actually test more cases, it's too
hard to reach some paths, particularly without [a lot] of users yet. That's
not strictly a dependency of the earlier patches - since the initial patches
can't actually do much in the way of IO.
Agreed. Among the post-patch check-world coverage, which uncovered parts have
the most risk?
- We shouldn't reserve AioHandles etc for io workers - but because different
tpes of aux processes don't use a predetermined ProcNumber, that's not
entirely trivial without adding more complexity. I've actually wondered
whether IO workes should be their own "top-level" kind of process, rather
than an aux process. But that seems quite costly.
Here too, changing that later wouldn't affect much else, so I'd not consider
it a blocker. Of these ones I'm calling non-blockers, which would you most
regret deferring?
- Right now the io_uring mode has each backend's io_uring instance visible to
each other process. That ends up using a fair number of FDs. That's OK from
an efficiency perspective, but I think we'd need to add code to adjust the
soft RLIMIT_NOFILE (it's set to 1024 on most distros because there are
various programs that iterate over all possible FDs, causing significant
Agreed on raising the soft limit. Docs and/or errhint() likely will need to
mention system configuration nonetheless, since some users will encounter
RLIMIT_MEMLOCK or /proc/sys/kernel/io_uring_disabled.
slowdowns when the soft limit defaults to something high). I earlier had a
limited number of io_uring instances, but that added a fair amount of
overhead because then submitting IO would require a lock.That again doesn't have to be solved as part of the earlier patches but
might have some minor design impact.
How far do you see the design impact spreading on that one?
Thanks,
nm
Hi,
On 2024-09-17 11:08:19 -0700, Noah Misch wrote:
- I am worried about the need for bounce buffers for writes of checksummed
buffers. That quickly ends up being a significant chunk of memory,
particularly when using a small shared_buffers with a higher than default
number of connection. I'm currently hacking up a prototype that'd prevent us
from setting hint bits with just a share lock. I'm planning to start a
separate thread about that.AioChooseBounceBuffers() limits usage to 256 blocks (2MB) per MaxBackends.
Doing better is nice, but I don't consider this a blocker. I recommend
dealing with the worry by reducing the limit initially (128 blocks?). Can
always raise it later.
On storage that has nontrivial latency, like just about all cloud storage,
even 256 will be too low. Particularly for checkpointer.
Assuming 1ms latency - which isn't the high end of cloud storage latency - 256
blocks in flight limits you to <= 256MByte/s, even on storage that can have a
lot more throughput. With 3ms, which isn't uncommon, it's 85MB/s.
Of course this could be addressed by tuning, but it seems like something that
shouldn't need to be tuned by the majority of folks running postgres.
We also discussed the topic at /messages/by-id/20240925020022.c5.nmisch@google.com
... neither BM_SETTING_HINTS nor keeping bounce buffers looks like a bad
decision. From what I've heard so far of the performance effects, if it were
me, I would keep the bounce buffers. I'd pursue BM_SETTING_HINTS and bounce
buffer removal as a distinct project after the main AIO capability. Bounce
buffers have an implementation. They aren't harming other design decisions.
The AIO project is big, so I'd want to err on the side of not designating
other projects as its prerequisites.
Given the issues that modifying pages while in flight causes, not just with PG
level checksums, but also filesystem level checksum, I don't feel like it's a
particularly promising approach.
However, I think this doesn't have to mean that the BM_SETTING_HINTS stuff has
to be completed before we can move forward with AIO. If I split out the write
portion from the read portion a bit further, the main AIO changes and the
shared-buffer read user can be merged before there's a dependency on the hint
bit stuff being done.
Does that seem reasonable?
Greetings,
Andres Freund
On Mon, 30 Sept 2024 at 16:49, Andres Freund <andres@anarazel.de> wrote:
On 2024-09-17 11:08:19 -0700, Noah Misch wrote:
- I am worried about the need for bounce buffers for writes of checksummed
buffers. That quickly ends up being a significant chunk of memory,
particularly when using a small shared_buffers with a higher than default
number of connection. I'm currently hacking up a prototype that'd prevent us
from setting hint bits with just a share lock. I'm planning to start a
separate thread about that.AioChooseBounceBuffers() limits usage to 256 blocks (2MB) per MaxBackends.
Doing better is nice, but I don't consider this a blocker. I recommend
dealing with the worry by reducing the limit initially (128 blocks?). Can
always raise it later.On storage that has nontrivial latency, like just about all cloud storage,
even 256 will be too low. Particularly for checkpointer.Assuming 1ms latency - which isn't the high end of cloud storage latency - 256
blocks in flight limits you to <= 256MByte/s, even on storage that can have a
lot more throughput. With 3ms, which isn't uncommon, it's 85MB/s.
FYI, I think you're off by a factor 8, i.e. that would be 2GB/sec and
666MB/sec respectively, given a normal page size of 8kB and exactly
1ms/3ms full round trip latency:
1 page/1 ms * 8kB/page * 256 concurrency = 256 pages/ms * 8kB/page =
2MiB/ms ~= 2GiB/sec.
for 3ms divide by 3 -> ~666MiB/sec.
Kind regards,
Matthias van de Meent
Neon (https://neon.tech)
On Mon, Sep 30, 2024 at 10:49:17AM -0400, Andres Freund wrote:
We also discussed the topic at /messages/by-id/20240925020022.c5.nmisch@google.com
... neither BM_SETTING_HINTS nor keeping bounce buffers looks like a bad
decision. From what I've heard so far of the performance effects, if it were
me, I would keep the bounce buffers. I'd pursue BM_SETTING_HINTS and bounce
buffer removal as a distinct project after the main AIO capability. Bounce
buffers have an implementation. They aren't harming other design decisions.
The AIO project is big, so I'd want to err on the side of not designating
other projects as its prerequisites.Given the issues that modifying pages while in flight causes, not just with PG
level checksums, but also filesystem level checksum, I don't feel like it's a
particularly promising approach.However, I think this doesn't have to mean that the BM_SETTING_HINTS stuff has
to be completed before we can move forward with AIO. If I split out the write
portion from the read portion a bit further, the main AIO changes and the
shared-buffer read user can be merged before there's a dependency on the hint
bit stuff being done.Does that seem reasonable?
Yes.
On Fri, Sep 6, 2024 at 9:38 PM Andres Freund <andres@anarazel.de> wrote:
Hi,
Attached is the next version of the patchset. (..)
Hi Andres,
Thank You for worth admiring persistence on this. Please do not take it as
criticism, just more like set of questions regarding the patchset v2.1 that
I finally got little time to play with:
0. Doesn't the v2.1-0011-aio-Add-io_uring-method.patch -> in
pgaio_uring_submit() -> io_uring_get_sqe() need a return value check ?
Otherwise we'll never know that SQ is full in theory, perhaps at least such
a check should be made with Assert() ? (I understand right now that we
allow just up to io_uring_queue_init(io_max_concurrency), but what happens
if:
a. previous io_uring_submit() failed for some reason and we do not have
free space for SQ?
b. (hypothetical) someday someone will try to make PG multithreaded and the
code starts using just one big queue, still without checking for
io_uring_get_sqe()?
1. In [0]/messages/by-id/237y5rabqim2c2v37js53li6i34v2525y2baf32isyexecn4ic@bqmlx5mrnwuf - "Right now the io_uring mode has each backend's io_uring instance visible to each other process.(..)" you wrote that there's this high amount of FDs consumed for
io_uring (dangerously close to RLIMIT_NOFILE). I can attest that there are
many customers who are using extremely high max_connections (4k-5k, but
there outliers with 10k in the wild too) - so they won't even start - and I
have one doubt on the user-friendliness impact of this. I'm quite certain
it's going to be the same as with pgbouncer where one is forced to tweak
OS(systemd/pam/limits.conf/etc), but in PG we are better because PG tries
to preallocate and then close() a lot of FDs, so that's safer in runtime.
IMVHO even if we just consume e.g. say > 30% of FDs just for io_uring, the
max_files_per_process looses it's spirit a little bit and PG is going to
start loose efficiency too due to frequent open()/close() calls as fd cache
is too small. Tomas also complained about it some time ago in [1]/messages/by-id/510b887e-c0ce-4a0c-a17a-2c6abb8d9a5c@enterprisedb.com - sentence after: "FWIW there's another bottleneck people may not realize (..)")
So maybe it would be good to introduce couple of sanity checks too (even
after setting higher limit):
- issue FATAL in case of using io_method = io_ring && max_connections would
be close to getrusage(RLIMIT_NOFILE)
- issue warning in case of using io_method = io_ring && we wouldnt have
even real 1k FDs free for handling relation FDs (detect something bad like:
getrusage(RLIMIT_NOFILE) <= max_connections + max_files_per_process)
2. In pgaio_uring_postmaster_child_init_local() there
"io_uring_queue_init(32,...)" - why 32? :) And also there's separate
io_uring_queue_init(io_max_concurrency) which seems to be derived from
AioChooseMaxConccurrency() which can go up to 64?
3. I find having two GUCs named literally the same
(effective_io_concurrency, io_max_concurrency). It is clear from IO_URING
perspective what is io_max_concurrency all about, but I bet having also
effective_io_concurrency in the mix is going to be a little confusing for
users (well, it is to me). Maybe that README.md could elaborate a little
bit on the relation between those two? Or maybe do you plan to remove
io_max_concurrency and bind it to effective_io_concurrency in future? To
add more fun , there's MAX_IO_CONCURRENCY nearby in v2.1-0014 too while the
earlier mentioned AioChooseMaxConccurrency() goes up to just 64
4. While we are at this, shouldn't the patch rename debug_io_direct to
simply io_direct so that GUCs are consistent in terms of naming?
5. It appears that pg_stat_io.reads seems to be not refreshed until they
query seems to be finished. While running a query for minutes with this
patchset, I've got:
now | reads | read_time
-------------------------------+----------+-----------
2024-11-15 12:09:09.151631+00 | 15004271 | 0
[..]
2024-11-15 12:10:25.241175+00 | 15004271 | 0
2024-11-15 12:10:26.241179+00 | 15004271 | 0
2024-11-15 12:10:27.241139+00 | 18250913 | 0
Or is that how it is supposed to work? Also pg_stat_io.read_time would be
something vague with io_uring/worker, so maybe zero is good here (?).
Otherwise we would have to measure time spent on waiting alone, but that
would force more instructions for calculating io times...
6. After playing with some basic measurements - which went fine, I wanted
to go test simple PostGIS even with sequential scans to see any
compatibility issues (AFAIR Thomas Munro on PGConfEU indicated as good
testing point), but before that I've tried to see what's the TOAST
performance alone with AIO+DIO (debug_io_direct=data). One issue I have
found is that DIO seems to be unusable until somebody will teach TOAST to
use readstreams, is that correct? Maybe I'm doing something wrong, but I
haven't seen any TOAST <-> readstreams topic:
-- 12MB table , 25GB toast
create table t (id bigint, t text storage external);
insert into t select i::bigint as id, repeat(md5(i::text),4000)::text as r
from generate_series(1,200000) s(i);
set max_parallel_workers_per_gather=0;
\timing
-- with cold caches: empty s_b, echo 3 > drop_caches
select sum(length(t)) from t;
master 101897.823 ms (01:41.898)
AIO 99758.399 ms (01:39.758)
AIO+DIO 191479.079 ms (03:11.479)
hotpath was detoast_attr() -> toast_fetch_datum() ->
heap_fetch_toast_slice() -> systable_getnext_ordered() ->
index_getnext_slot() -> index_fetch_heap() -> heapam_index_fetch_tuple() ->
ReadBufferExtended -> AIO code.
The difference is that on cold caches with DIO gets 2x slowdown; with clean
s_b and so on:
* getting normal heap data seqscan: up to 285MB/s
* but TOASTs maxes out at 132MB/s when using io_uring+DIO
Not about patch itself, but questions about related stack functionality:
----------------------------------------------------------------------------------------------------
7. Is pg_stat_aios still on the table or not ? (AIO 2021 had it). Any hints
on how to inspect real I/O calls requested to review if the code is issuing
sensible calls: there's no strace for uring, or do you stick to DEBUG3 or
perhaps using some bpftrace / xfsslower is the best way to go ?
8. Not sure if that helps, but I've managed the somehow to hit the
impossible situation You describe in pgaio_uring_submit() "(ret !=
num_staged_ios)", but I had to push urings really hard into using futexes
and probably I've could made some error in coding too for that too occur
[3]: /messages/by-id/CAKZiRmwrBjCbCJ433wV5zjvwt_OuY7BsVX12MBKiBu+eNZDm6g@mail.gmail.com
FIXME: fix ret != submitted ?! seems like bug?! */ (but i had that hit that
code-path pretty often with 6.10.x kernel)
9. Please let me know, what's the current up to date line of thinking about
this patchset: is it intended to be committed as v18 ? As a debug feature
or as non-debug feature? (that is which of the IO methods should be
scrutinized the most as it is going to be the new default - sync or worker?)
10. At this point, does it even make sense to give a try experimenty try to
pwritev2() with RWF_ATOMIC? (that thing is already in the open, but XFS is
going to cover it with 6.12.x apparently, but I could try with some -rcX)
-J.
p.s. I hope I did not ask stupid questions nor missed anything.
[0]: /messages/by-id/237y5rabqim2c2v37js53li6i34v2525y2baf32isyexecn4ic@bqmlx5mrnwuf - "Right now the io_uring mode has each backend's io_uring instance visible to each other process.(..)"
/messages/by-id/237y5rabqim2c2v37js53li6i34v2525y2baf32isyexecn4ic@bqmlx5mrnwuf
- "Right now the io_uring mode has each backend's io_uring instance visible
to
each other process.(..)"
[1]: /messages/by-id/510b887e-c0ce-4a0c-a17a-2c6abb8d9a5c@enterprisedb.com - sentence after: "FWIW there's another bottleneck people may not realize (..)"
/messages/by-id/510b887e-c0ce-4a0c-a17a-2c6abb8d9a5c@enterprisedb.com
- sentence after: "FWIW there's another bottleneck people may not realize
(..)"
[2]: /messages/by-id/x3f32prdpgalmiieyialqtn53j5uvb2e4c47nvnjetkipq3zyk@xk7jy7fnua6w
/messages/by-id/x3f32prdpgalmiieyialqtn53j5uvb2e4c47nvnjetkipq3zyk@xk7jy7fnua6w
[3]: /messages/by-id/CAKZiRmwrBjCbCJ433wV5zjvwt_OuY7BsVX12MBKiBu+eNZDm6g@mail.gmail.com
/messages/by-id/CAKZiRmwrBjCbCJ433wV5zjvwt_OuY7BsVX12MBKiBu+eNZDm6g@mail.gmail.com
Hi,
Sorry for loosing track of your message for this long, I saw it just now
because I was working on posting a new version.
On 2024-11-18 13:19:58 +0100, Jakub Wartak wrote:
On Fri, Sep 6, 2024 at 9:38 PM Andres Freund <andres@anarazel.de> wrote:
Thank You for worth admiring persistence on this. Please do not take it as
criticism, just more like set of questions regarding the patchset v2.1 that
I finally got little time to play with:0. Doesn't the v2.1-0011-aio-Add-io_uring-method.patch -> in
pgaio_uring_submit() -> io_uring_get_sqe() need a return value check ?
Yea, it shouldn't ever happen, but it's worth adding a check.
Otherwise we'll never know that SQ is full in theory, perhaps at least such
a check should be made with Assert() ? (I understand right now that we
allow just up to io_uring_queue_init(io_max_concurrency), but what happens
if:
a. previous io_uring_submit() failed for some reason and we do not have
free space for SQ?
We'd have PANICed at that failure :)
b. (hypothetical) someday someone will try to make PG multithreaded and the
code starts using just one big queue, still without checking for
io_uring_get_sqe()?
That'd not make sense - you'd still want to use separate rings, to avoid
contention.
1. In [0] you wrote that there's this high amount of FDs consumed for
io_uring (dangerously close to RLIMIT_NOFILE). I can attest that there are
many customers who are using extremely high max_connections (4k-5k, but
there outliers with 10k in the wild too) - so they won't even start - and I
have one doubt on the user-friendliness impact of this. I'm quite certain
it's going to be the same as with pgbouncer where one is forced to tweak
OS(systemd/pam/limits.conf/etc), but in PG we are better because PG tries
to preallocate and then close() a lot of FDs, so that's safer in runtime.
IMVHO even if we just consume e.g. say > 30% of FDs just for io_uring, the
max_files_per_process looses it's spirit a little bit and PG is going to
start loose efficiency too due to frequent open()/close() calls as fd cache
is too small. Tomas also complained about it some time ago in [1])
My current thoughts around this are that we should generally, independent of
io_uring, increase the FD limit ourselves.
In most distros the soft ulimit is set to something like 1024, but the hard
limit is much higher. The reason for that is that some applications try to
close all fds between 0 and RLIMIT_NOFILE - which takes a long time if
RLIMIT_NOFILE is high. By setting only the soft limit to a low value any
application needing higher limits can just opt into using more FDs.
On several of my machines the hard limit is 1073741816.
So maybe it would be good to introduce couple of sanity checks too (even
after setting higher limit):
- issue FATAL in case of using io_method = io_ring && max_connections would
be close to getrusage(RLIMIT_NOFILE)
- issue warning in case of using io_method = io_ring && we wouldnt have
even real 1k FDs free for handling relation FDs (detect something bad like:
getrusage(RLIMIT_NOFILE) <= max_connections + max_files_per_process)
Probably still worth adding something like this, even if we were to do what I
am suggesting above.
2. In pgaio_uring_postmaster_child_init_local() there
"io_uring_queue_init(32,...)" - why 32? :) And also there's separate
io_uring_queue_init(io_max_concurrency) which seems to be derived from
AioChooseMaxConccurrency() which can go up to 64?
Yea, that's probably not right.
3. I find having two GUCs named literally the same
(effective_io_concurrency, io_max_concurrency). It is clear from IO_URING
perspective what is io_max_concurrency all about, but I bet having also
effective_io_concurrency in the mix is going to be a little confusing for
users (well, it is to me). Maybe that README.md could elaborate a little
bit on the relation between those two? Or maybe do you plan to remove
io_max_concurrency and bind it to effective_io_concurrency in future?
io_max_concurrency is a hard maximum that needs to be set at server start,
because it requires allocating shared memory. Whereas effective_io_concurrency
can be changed on a per-session and per-tablespace
basis. I.e. io_max_concurrency is a hard upper limit for an entire backend,
whereas effective_io_concurrency controls how much one scan (or whatever does
prefetching) can issue.
To add more fun , there's MAX_IO_CONCURRENCY nearby in v2.1-0014 too while
the earlier mentioned AioChooseMaxConccurrency() goes up to just 64
Yea, that should probably be disambiguated.
4. While we are at this, shouldn't the patch rename debug_io_direct to
simply io_direct so that GUCs are consistent in terms of naming?
I used to have a patch like that in the series and it was a pain to
rebase...
I also suspect sure this is quite enough to make debug_io_direct quite
production ready, even if just considering io_direct=data. Without streaming
read use in heap + index VACUUM, RelationCopyStorage() and a few other places
the performance consequences of using direct IO can be, um, surprising.
5. It appears that pg_stat_io.reads seems to be not refreshed until they
query seems to be finished. While running a query for minutes with this
patchset, I've got:
now | reads | read_time
-------------------------------+----------+-----------
2024-11-15 12:09:09.151631+00 | 15004271 | 0
[..]
2024-11-15 12:10:25.241175+00 | 15004271 | 0
2024-11-15 12:10:26.241179+00 | 15004271 | 0
2024-11-15 12:10:27.241139+00 | 18250913 | 0Or is that how it is supposed to work?
Currently the patch has a FIXME to add some IO statistics (I think I raised
that somewhere in this thread, too). It's not clear to me what IO time ought
to mean. I suspect the least bad answer is what you suggest:
Also pg_stat_io.read_time would be something vague with io_uring/worker, so
maybe zero is good here (?). Otherwise we would have to measure time spent
on waiting alone, but that would force more instructions for calculating io
times...
I.e. we should track the amount of time spent waiting for IOs.
I don't think tracking time in worker or such would make much sense, that'd
often end up with reporting more IO time than a query took.
6. After playing with some basic measurements - which went fine, I wanted
to go test simple PostGIS even with sequential scans to see any
compatibility issues (AFAIR Thomas Munro on PGConfEU indicated as good
testing point), but before that I've tried to see what's the TOAST
performance alone with AIO+DIO (debug_io_direct=data).
It's worth noting that with the last posted version you needed to increase
effective_io_concurrency to something very high to see sensible
performance.
That's due to the way read_stream_begin_impl() limited the number of buffers
pinned to effective_io_concurrency * 4 - which, due to io_combine_limit, ends
up allowing only a single IO in flight in case of sequential blocks until
effective_io_concurrency is set to 8 or such. I've adjusted that to some
degree now, but I think that might need a bit more sophistication.
One issue I have found is that DIO seems to be unusable until somebody will
teach TOAST to use readstreams, is that correct? Maybe I'm doing something
wrong, but I haven't seen any TOAST <-> readstreams topic:
Hm, I suspect that aq read stream won't help a whole lot in manyq toast
cases. Unless you have particularly long toast datums, the time is going to be
dominated by the random accesses, as each toast datum is looked up in a
non-predictable way.
Generally, using DIO requires tuning shared buffers much more aggressively
than not using DIO, no amount of stream use will change that. Of course we
shoul try to reduce that "downside"...
I'm not sure if the best way to do prefetching toast chunks would be to rely
on more generalized index->table prefetching support, or to have dedicated code.
-- 12MB table , 25GB toast
create table t (id bigint, t text storage external);
insert into t select i::bigint as id, repeat(md5(i::text),4000)::text as r
from generate_series(1,200000) s(i);
set max_parallel_workers_per_gather=0;
\timing
-- with cold caches: empty s_b, echo 3 > drop_caches
select sum(length(t)) from t;
master 101897.823 ms (01:41.898)
AIO 99758.399 ms (01:39.758)
AIO+DIO 191479.079 ms (03:11.479)hotpath was detoast_attr() -> toast_fetch_datum() ->
heap_fetch_toast_slice() -> systable_getnext_ordered() ->
index_getnext_slot() -> index_fetch_heap() -> heapam_index_fetch_tuple() ->
ReadBufferExtended -> AIO code.The difference is that on cold caches with DIO gets 2x slowdown; with clean
s_b and so on:
* getting normal heap data seqscan: up to 285MB/s
* but TOASTs maxes out at 132MB/s when using io_uring+DIO
I started loading the data to try this out myself :).
Not about patch itself, but questions about related stack functionality:
----------------------------------------------------------------------------------------------------7. Is pg_stat_aios still on the table or not ? (AIO 2021 had it). Any hints
on how to inspect real I/O calls requested to review if the code is issuing
sensible calls: there's no strace for uring, or do you stick to DEBUG3 or
perhaps using some bpftrace / xfsslower is the best way to go ?
I think we still want something like it, but I don't think it needs to be in
the initial commits.
There are kernel events that you can track using e.g. perf. Particularly
useful are
io_uring:io_uring_submit_req
io_uring:io_uring_complete
8. Not sure if that helps, but I've managed the somehow to hit the
impossible situation You describe in pgaio_uring_submit() "(ret !=
num_staged_ios)", but I had to push urings really hard into using futexes
and probably I've could made some error in coding too for that too occur
[3]. As it stands in that patch from my thread, it was not covered: /*
FIXME: fix ret != submitted ?! seems like bug?! */ (but i had that hit that
code-path pretty often with 6.10.x kernel)
I think you can hit that if you don't take care to limit the number of IOs
being submitted at once or if you're not consuming completions. If the
completion queue is full enough the kernel at some point won't allow more IOs
to be submitted.
9. Please let me know, what's the current up to date line of thinking about
this patchset: is it intended to be committed as v18 ?
I'd love to get some of it into 18. I don't quite know whether we can make it
happen and to what extent.
As a debug feature or as non-debug feature? (that is which of the IO methods
should be scrutinized the most as it is going to be the new default - sync
or worker?)
I'd say initially worker, with a beta 1 or 2 checklist item to revise it.
10. At this point, does it even make sense to give a try experimenty try to
pwritev2() with RWF_ATOMIC? (that thing is already in the open, but XFS is
going to cover it with 6.12.x apparently, but I could try with some -rcX)
I don't think that's worth doing right now. There's too many dependencies and
it's going to be a while till the kernel support for that is widespread enough
to matter.
There's also the issue that, to my knowledge, outside of cloud environments
there's pretty much no hardware that actually reports power-fail atomicity
sizes bigger than a sector.
p.s. I hope I did not ask stupid questions nor missed anything.
You did not!
Greetings,
Andres Freund
Andres Freund <andres@anarazel.de> writes:
My current thoughts around this are that we should generally, independent of
io_uring, increase the FD limit ourselves.
I'm seriously down on that, because it amounts to an assumption that
we own the machine and can appropriate all its resources. If ENFILE
weren't a thing, it'd be all right, but that is a thing. We have no
business trying to consume resources the DBA didn't tell us we could.
regards, tom lane
Hi,
On 2024-12-19 17:34:29 -0500, Tom Lane wrote:
Andres Freund <andres@anarazel.de> writes:
My current thoughts around this are that we should generally, independent of
io_uring, increase the FD limit ourselves.I'm seriously down on that, because it amounts to an assumption that
we own the machine and can appropriate all its resources. If ENFILE
weren't a thing, it'd be all right, but that is a thing. We have no
business trying to consume resources the DBA didn't tell us we could.
Arguably the configuration *did* tell us, by having a higher hard limit...
I'm not saying that we should increase the limit without a bound or without a
configuration option, btw.
As I had mentioned, the problem with relying on increasing the soft limit that
is that it's not generally sensible to do so, because it causes a bunch of
binaries to do be weirdly slow.
Another reason to not increase the soft rlimit is that doing so can break
programs relying on select().
But opting into a higher rlimit, while obviously adhering to the hard limit
and perhaps some other config knob, seems fine?
Greetings,
Andres Freund
On Fri, 20 Dec 2024 at 01:54, Andres Freund <andres@anarazel.de> wrote:
Arguably the configuration *did* tell us, by having a higher hard limit...
<snip>
But opting into a higher rlimit, while obviously adhering to the hard limit
and perhaps some other config knob, seems fine?
Yes, totally fine. That's exactly the reasoning why the hard limit is
so much larger than the soft limit by default on systems with systemd:
Hi,
On 2024-12-20 18:27:13 +0100, Jelte Fennema-Nio wrote:
On Fri, 20 Dec 2024 at 01:54, Andres Freund <andres@anarazel.de> wrote:
Arguably the configuration *did* tell us, by having a higher hard limit...
<snip>
But opting into a higher rlimit, while obviously adhering to the hard limit
and perhaps some other config knob, seems fine?Yes, totally fine. That's exactly the reasoning why the hard limit is
so much larger than the soft limit by default on systems with systemd:
Good link.
This isn't just relevant for using io_uring:
There obviously are several people working on threaded postgres. Even if we
didn't duplicate fd.c file descriptors between threads (we probably will, at
least initially), the client connection FDs alone will mean that we have a lot
more FDs open. Due to the select() issue the soft limit won't be increased
beyond 1024, requiring everyone to add a 'ulimit -n $somehighnumber' before
starting postgres on linux doesn't help anyone.
Greetings,
Andres Freund
Hi,
Attached is a new version of the AIO patchset.
The biggest changes are:
- The README has been extended with an overview of the API. I think it gives a
good overview of how the API fits together. I'd be very good to get
feedback from folks that aren't as familiar with AIO, I can't really see
what's easy/hard anymore.
- The read/write patches and the bounce buffer patches are split out, so that
there's no dependency between the first few AIO patches and the "don't dirty
while IO is going on" patcheset [1]/messages/by-id/stj36ea6yyhoxtqkhpieia2z4krnam7qyetc57rfezgk4zgapf@gcnactj4z56m.
- Retries for partial IOs (i.e. short reads) are now implemented. Turned out
to take all of three lines and adding one missing variable initialization.
- I added quite a lot of function-header and file-header comments. There's
more to be done here, but see also the TODO section below.
- IO stats are now tracked. Specifically, the "time" for an IO is now the time
spent waiting for an IO, as discussed around [2]/messages/by-id/tp63m6tcbi7mmsjlqgxd55sghhwvjxp3mkgeljffkbaujezvdl@fvmdr3c6uhat. I haven't updated the
docs yet.
- There now is a fastpath for executing AIO "synchronously", i.e. preparing an
IO and immediately submitting it.
- Previously one needed very large effective_io_concurrency values to get
sufficient asynchronous IO for sequential scans, as read_stream.c limited
max_pinned_buffers to effective_io_concurrency * 4. Unless
effective_io_concurrency was very high, that'd only allow a single IO to be
in-flight, due to io_combine_limit buffers getting merged into one IO.
Instead the pin limit is now capped by effective_io_concurrency *
io_combine_limit.
Right now that's part of one larger "hack up read_stream.c" commit, Thomas
said he'd take a look at how to do this properly. This is probably
something we could and should commit separately.
- io_method = sync has been made more similar to the way IO happens today. In
particular, we now continue to issue prefetch requests and the actual IO is
done only within WaitReadBuffers().
- When using buffered IO with io_uring, there previously was a small
regression, due to more IO happening in the process context with io_uring
(instead of in a kernel thread). While one could argue that it's better to
not increase CPU usage beyond one process, I don't find that sufficiently
convincing. To work around that I added a heuritic that tells IO uring to
execute IOs using it's worker infrastructure. That seems to have fixed this
problem entirely.
- IO worker infrastructure was cleaned up
- I pushed a few minor preliminary commits a while ago
- lots of other smaller stuff
The biggest TODOs are:
- Right now the API between bufmgr.c and read_stream.c kind of necessitates
that one StartReadBuffers() call actually can trigger multiple IOs, if
one of the buffers was read in by another backend, before "this" backend
called StartBufferIO().
I think Thomas and I figured out a way to evolve the interface so that this
isn't necessary anymore:
We allow StartReadBuffers() to memorize buffers it pinned but didn't
initiate IO on in the buffers[] argument. The next call to StartReadBuffers
then doesn't have to repin thse buffers. That doesn't just solve the
multiple-IOs for one "read operation" issue, it also make the - very common
- case of a bunch of "buffer misses" followed by a "buffer hit" cleaner, the
hit wouldn't be tracked in the same ReadBuffersOperation anymore.
- Right now bufmgr.h includes aio.h, because it needs to include a reference
to the AIO's result in ReadBuffersOperation. Requiring a dynamic allocation
would be noticeable overhead, so that's not an option. I think the best
option here would be to introduce something like aio_types.h, so fewer
things are included.
- There's no obvious way to tell "internal" function operating on an IO handle
apart from functions that are expected to be called by the issuer of an IO.
One way to deal with this would be to introduce a distinct "issuer IO
reference" type. I think that might be a good idea, it would also make it
clearer that a good number of the functions can only be called by the
issuer, before the IO is submitted.
This would also make it easier to order functions more sensibly in aio.c, as
all the issuer functions would be together.
The functions on AIO handles that everyone can call already have a distinct
type (PgAioHandleRef vs PgAioHandle*).
- While I've added a lot of comments, I only got so far adding them. More are
needed.
- The naming around PgAioReturn, PgAioResult, PgAioResultStatus needs to be
improved
- The debug logging functions are a bit of a mess, lots of very similar code
in lots of places. I think AIO needs a few ereport() wrappers to make this
easier.
- More tests are needed. None of our current test frameworks really makes this
easy :(.
- Several folks asked for pg_stat_aio to come back, in "v1" that showed the
set of currently in-flight AIOs. That's not particularly hard - except
that it doesn't really fit in the pg_stat_* namespace.
- I'm not sure that effective_io_concurrency as we have it right now really
makes sense, particularly not with the current default values. But that's a
mostly independent change.
Greetings,
Andres Freund
[1]: /messages/by-id/stj36ea6yyhoxtqkhpieia2z4krnam7qyetc57rfezgk4zgapf@gcnactj4z56m
[2]: /messages/by-id/tp63m6tcbi7mmsjlqgxd55sghhwvjxp3mkgeljffkbaujezvdl@fvmdr3c6uhat
Attachments:
v2-0010-aio-Implement-smgr-md.c-aio-methods.patchtext/x-diff; charset=us-asciiDownload
From 45154f1e08ee325875673c14470479f019ef0461 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sun, 15 Dec 2024 12:36:32 -0500
Subject: [PATCH v2 10/20] aio: Implement smgr/md.c aio methods
---
src/include/storage/aio.h | 17 +-
src/include/storage/fd.h | 6 +
src/include/storage/md.h | 12 +
src/include/storage/smgr.h | 21 ++
src/backend/storage/aio/aio_subject.c | 4 +
src/backend/storage/file/fd.c | 68 ++++++
src/backend/storage/smgr/md.c | 314 ++++++++++++++++++++++++++
src/backend/storage/smgr/smgr.c | 91 ++++++++
8 files changed, 532 insertions(+), 1 deletion(-)
diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h
index a1633a0ed3d..d693b0b0bd8 100644
--- a/src/include/storage/aio.h
+++ b/src/include/storage/aio.h
@@ -55,9 +55,10 @@ typedef enum PgAioSubjectID
{
/* intentionally the zero value, to help catch zeroed memory etc */
ASI_INVALID = 0,
+ ASI_SMGR,
} PgAioSubjectID;
-#define ASI_COUNT (ASI_INVALID + 1)
+#define ASI_COUNT (ASI_SMGR + 1)
/*
* Flags for an IO that can be set with pgaio_io_set_flag().
@@ -100,6 +101,9 @@ typedef enum PgAioHandleFlags
typedef enum PgAioHandleSharedCallbackID
{
ASC_INVALID,
+
+ ASC_MD_READV,
+ ASC_MD_WRITEV,
} PgAioHandleSharedCallbackID;
@@ -135,6 +139,17 @@ typedef union
typedef union PgAioSubjectData
{
+ struct
+ {
+ RelFileLocator rlocator; /* physical relation identifier */
+ BlockNumber blockNum; /* blknum relative to begin of reln */
+ int nblocks;
+ ForkNumber forkNum:8; /* don't waste 4 byte for four values */
+ bool is_temp; /* proc can be inferred by owning AIO */
+ bool release_lock;
+ int8 mode;
+ } smgr;
+
/* just as an example placeholder for later */
struct
{
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 1456ab383a4..e993e1b671f 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -101,6 +101,8 @@ extern PGDLLIMPORT int max_safe_fds;
* prototypes for functions in fd.c
*/
+struct PgAioHandle;
+
/* Operations on virtual Files --- equivalent to Unix kernel file ops */
extern File PathNameOpenFile(const char *fileName, int fileFlags);
extern File PathNameOpenFilePerm(const char *fileName, int fileFlags, mode_t fileMode);
@@ -109,6 +111,10 @@ extern void FileClose(File file);
extern int FilePrefetch(File file, off_t offset, off_t amount, uint32 wait_event_info);
extern ssize_t FileReadV(File file, const struct iovec *iov, int iovcnt, off_t offset, uint32 wait_event_info);
extern ssize_t FileWriteV(File file, const struct iovec *iov, int iovcnt, off_t offset, uint32 wait_event_info);
+extern ssize_t FileReadV(File file, const struct iovec *iov, int iovcnt, off_t offset, uint32 wait_event_info);
+extern int FileStartReadV(struct PgAioHandle *ioh, File file, int iovcnt, off_t offset, uint32 wait_event_info);
+extern ssize_t FileWriteV(File file, const struct iovec *iov, int iovcnt, off_t offset, uint32 wait_event_info);
+extern int FileStartWriteV(struct PgAioHandle *ioh, File file, int iovcnt, off_t offset, uint32 wait_event_info);
extern int FileSync(File file, uint32 wait_event_info);
extern int FileZero(File file, off_t offset, off_t amount, uint32 wait_event_info);
extern int FileFallocate(File file, off_t offset, off_t amount, uint32 wait_event_info);
diff --git a/src/include/storage/md.h b/src/include/storage/md.h
index e7671dd6c18..c3a18465c6b 100644
--- a/src/include/storage/md.h
+++ b/src/include/storage/md.h
@@ -19,6 +19,10 @@
#include "storage/smgr.h"
#include "storage/sync.h"
+struct PgAioHandleSharedCallbacks;
+extern const struct PgAioHandleSharedCallbacks aio_md_readv_cb;
+extern const struct PgAioHandleSharedCallbacks aio_md_writev_cb;
+
/* md storage manager functionality */
extern void mdinit(void);
extern void mdopen(SMgrRelation reln);
@@ -36,9 +40,16 @@ extern uint32 mdmaxcombine(SMgrRelation reln, ForkNumber forknum,
BlockNumber blocknum);
extern void mdreadv(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
void **buffers, BlockNumber nblocks);
+extern void mdstartreadv(struct PgAioHandle *ioh,
+ SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+ void **buffers, BlockNumber nblocks);
extern void mdwritev(SMgrRelation reln, ForkNumber forknum,
BlockNumber blocknum,
const void **buffers, BlockNumber nblocks, bool skipFsync);
+extern void mdstartwritev(struct PgAioHandle *ioh,
+ SMgrRelation reln, ForkNumber forknum,
+ BlockNumber blocknum,
+ const void **buffers, BlockNumber nblocks, bool skipFsync);
extern void mdwriteback(SMgrRelation reln, ForkNumber forknum,
BlockNumber blocknum, BlockNumber nblocks);
extern BlockNumber mdnblocks(SMgrRelation reln, ForkNumber forknum);
@@ -46,6 +57,7 @@ extern void mdtruncate(SMgrRelation reln, ForkNumber forknum,
BlockNumber old_blocks, BlockNumber nblocks);
extern void mdimmedsync(SMgrRelation reln, ForkNumber forknum);
extern void mdregistersync(SMgrRelation reln, ForkNumber forknum);
+extern int mdfd(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, uint32 *off);
extern void ForgetDatabaseSyncRequests(Oid dbid);
extern void DropRelationFiles(RelFileLocator *delrels, int ndelrels, bool isRedo);
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 63a186bd346..fe23a7f744f 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -73,6 +73,11 @@ typedef SMgrRelationData *SMgrRelation;
#define SmgrIsTemp(smgr) \
RelFileLocatorBackendIsTemp((smgr)->smgr_rlocator)
+struct PgAioHandle;
+struct PgAioSubjectInfo;
+
+extern const struct PgAioSubjectInfo aio_smgr_subject_info;
+
extern void smgrinit(void);
extern SMgrRelation smgropen(RelFileLocator rlocator, ProcNumber backend);
extern bool smgrexists(SMgrRelation reln, ForkNumber forknum);
@@ -97,10 +102,19 @@ extern uint32 smgrmaxcombine(SMgrRelation reln, ForkNumber forknum,
extern void smgrreadv(SMgrRelation reln, ForkNumber forknum,
BlockNumber blocknum,
void **buffers, BlockNumber nblocks);
+extern void smgrstartreadv(struct PgAioHandle *ioh,
+ SMgrRelation reln, ForkNumber forknum,
+ BlockNumber blocknum,
+ void **buffers, BlockNumber nblocks);
extern void smgrwritev(SMgrRelation reln, ForkNumber forknum,
BlockNumber blocknum,
const void **buffers, BlockNumber nblocks,
bool skipFsync);
+extern void smgrstartwritev(struct PgAioHandle *ioh,
+ SMgrRelation reln, ForkNumber forknum,
+ BlockNumber blocknum,
+ const void **buffers, BlockNumber nblocks,
+ bool skipFsync);
extern void smgrwriteback(SMgrRelation reln, ForkNumber forknum,
BlockNumber blocknum, BlockNumber nblocks);
extern BlockNumber smgrnblocks(SMgrRelation reln, ForkNumber forknum);
@@ -110,6 +124,7 @@ extern void smgrtruncate(SMgrRelation reln, ForkNumber *forknum, int nforks,
BlockNumber *nblocks);
extern void smgrimmedsync(SMgrRelation reln, ForkNumber forknum);
extern void smgrregistersync(SMgrRelation reln, ForkNumber forknum);
+extern int smgrfd(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, uint32 *off);
extern void AtEOXact_SMgr(void);
extern bool ProcessBarrierSmgrRelease(void);
@@ -127,4 +142,10 @@ smgrwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
smgrwritev(reln, forknum, blocknum, &buffer, 1, skipFsync);
}
+extern void pgaio_io_set_subject_smgr(struct PgAioHandle *ioh,
+ SMgrRelationData *smgr,
+ ForkNumber forknum,
+ BlockNumber blocknum,
+ int nblocks);
+
#endif /* SMGR_H */
diff --git a/src/backend/storage/aio/aio_subject.c b/src/backend/storage/aio/aio_subject.c
index 8694cfafcd1..effb09c11c7 100644
--- a/src/backend/storage/aio/aio_subject.c
+++ b/src/backend/storage/aio/aio_subject.c
@@ -20,6 +20,7 @@
#include "storage/aio_internal.h"
#include "storage/buf_internals.h"
#include "storage/bufmgr.h"
+#include "storage/md.h"
#include "storage/smgr.h"
#include "utils/memutils.h"
@@ -35,6 +36,7 @@ static const PgAioSubjectInfo *aio_subject_info[] = {
[ASI_INVALID] = &(PgAioSubjectInfo) {
.name = "invalid",
},
+ [ASI_SMGR] = &aio_smgr_subject_info,
};
@@ -46,6 +48,8 @@ typedef struct PgAioHandleSharedCallbacksEntry
static const PgAioHandleSharedCallbacksEntry aio_shared_cbs[] = {
#define CALLBACK_ENTRY(id, callback) [id] = {.cb = &callback, .name = #callback}
+ CALLBACK_ENTRY(ASC_MD_READV, aio_md_readv_cb),
+ CALLBACK_ENTRY(ASC_MD_WRITEV, aio_md_writev_cb),
#undef CALLBACK_ENTRY
};
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 7c403fb360e..eeb6288a9b5 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -94,6 +94,7 @@
#include "miscadmin.h"
#include "pgstat.h"
#include "postmaster/startup.h"
+#include "storage/aio.h"
#include "storage/fd.h"
#include "storage/ipc.h"
#include "utils/guc.h"
@@ -1294,6 +1295,8 @@ LruDelete(File file)
vfdP = &VfdCache[file];
+ pgaio_closing_fd(vfdP->fd);
+
/*
* Close the file. We aren't expecting this to fail; if it does, better
* to leak the FD than to mess up our internal state.
@@ -1987,6 +1990,8 @@ FileClose(File file)
if (!FileIsNotOpen(file))
{
+ pgaio_closing_fd(vfdP->fd);
+
/* close the file */
if (close(vfdP->fd) != 0)
{
@@ -2210,6 +2215,32 @@ retry:
return returnCode;
}
+int
+FileStartReadV(struct PgAioHandle *ioh, File file,
+ int iovcnt, off_t offset,
+ uint32 wait_event_info)
+{
+ int returnCode;
+ Vfd *vfdP;
+
+ Assert(FileIsValid(file));
+
+ DO_DB(elog(LOG, "FileStartReadV: %d (%s) " INT64_FORMAT " %d",
+ file, VfdCache[file].fileName,
+ (int64) offset,
+ iovcnt));
+
+ returnCode = FileAccess(file);
+ if (returnCode < 0)
+ return returnCode;
+
+ vfdP = &VfdCache[file];
+
+ pgaio_io_prep_readv(ioh, vfdP->fd, iovcnt, offset);
+
+ return 0;
+}
+
ssize_t
FileWriteV(File file, const struct iovec *iov, int iovcnt, off_t offset,
uint32 wait_event_info)
@@ -2315,6 +2346,34 @@ retry:
return returnCode;
}
+int
+FileStartWriteV(struct PgAioHandle *ioh, File file,
+ int iovcnt, off_t offset,
+ uint32 wait_event_info)
+{
+ int returnCode;
+ Vfd *vfdP;
+
+ Assert(FileIsValid(file));
+
+ DO_DB(elog(LOG, "FileStartWriteV: %d (%s) " INT64_FORMAT " %d",
+ file, VfdCache[file].fileName,
+ (int64) offset,
+ iovcnt));
+
+ returnCode = FileAccess(file);
+ if (returnCode < 0)
+ return returnCode;
+
+ vfdP = &VfdCache[file];
+
+ /* FIXME: think about / reimplement temp_file_limit */
+
+ pgaio_io_prep_writev(ioh, vfdP->fd, iovcnt, offset);
+
+ return 0;
+}
+
int
FileSync(File file, uint32 wait_event_info)
{
@@ -2498,6 +2557,12 @@ FilePathName(File file)
int
FileGetRawDesc(File file)
{
+ int returnCode;
+
+ returnCode = FileAccess(file);
+ if (returnCode < 0)
+ return returnCode;
+
Assert(FileIsValid(file));
return VfdCache[file].fd;
}
@@ -2778,6 +2843,7 @@ FreeDesc(AllocateDesc *desc)
result = closedir(desc->desc.dir);
break;
case AllocateDescRawFD:
+ pgaio_closing_fd(desc->desc.fd);
result = close(desc->desc.fd);
break;
default:
@@ -2846,6 +2912,8 @@ CloseTransientFile(int fd)
/* Only get here if someone passes us a file not in allocatedDescs */
elog(WARNING, "fd passed to CloseTransientFile was not obtained from OpenTransientFile");
+ pgaio_closing_fd(fd);
+
return close(fd);
}
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 11fccda475f..b1277ed97ae 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -31,6 +31,7 @@
#include "miscadmin.h"
#include "pg_trace.h"
#include "pgstat.h"
+#include "storage/aio.h"
#include "storage/bufmgr.h"
#include "storage/fd.h"
#include "storage/md.h"
@@ -132,6 +133,22 @@ static MdfdVec *_mdfd_getseg(SMgrRelation reln, ForkNumber forknum,
static BlockNumber _mdnblocks(SMgrRelation reln, ForkNumber forknum,
MdfdVec *seg);
+static PgAioResult md_readv_complete(PgAioHandle *ioh, PgAioResult prior_result);
+static void md_readv_error(PgAioResult result, const PgAioSubjectData *subject_data, int elevel);
+static PgAioResult md_writev_complete(PgAioHandle *ioh, PgAioResult prior_result);
+static void md_writev_error(PgAioResult result, const PgAioSubjectData *subject_data, int elevel);
+
+const struct PgAioHandleSharedCallbacks aio_md_readv_cb = {
+ .complete = md_readv_complete,
+ .error = md_readv_error,
+};
+
+const struct PgAioHandleSharedCallbacks aio_md_writev_cb = {
+ .complete = md_writev_complete,
+ .error = md_writev_error,
+};
+
+
static inline int
_mdfd_open_flags(void)
{
@@ -927,6 +944,52 @@ mdreadv(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
}
}
+void
+mdstartreadv(PgAioHandle *ioh,
+ SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+ void **buffers, BlockNumber nblocks)
+{
+ off_t seekpos;
+ MdfdVec *v;
+ BlockNumber nblocks_this_segment;
+ struct iovec *iov;
+ int iovcnt;
+
+ v = _mdfd_getseg(reln, forknum, blocknum, false,
+ EXTENSION_FAIL | EXTENSION_CREATE_RECOVERY);
+
+ seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
+
+ Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
+
+ nblocks_this_segment =
+ Min(nblocks,
+ RELSEG_SIZE - (blocknum % ((BlockNumber) RELSEG_SIZE)));
+
+ if (nblocks_this_segment != nblocks)
+ elog(ERROR, "read crossing segment boundary");
+
+ iovcnt = pgaio_io_get_iovec(ioh, &iov);
+
+ Assert(nblocks <= iovcnt);
+
+ iovcnt = buffers_to_iovec(iov, buffers, nblocks_this_segment);
+
+ Assert(iovcnt <= nblocks_this_segment);
+
+ if (!(io_direct_flags & IO_DIRECT_DATA))
+ pgaio_io_set_flag(ioh, AHF_BUFFERED);
+
+ pgaio_io_set_subject_smgr(ioh,
+ reln,
+ forknum,
+ blocknum,
+ nblocks);
+ pgaio_io_add_shared_cb(ioh, ASC_MD_READV);
+
+ FileStartReadV(ioh, v->mdfd_vfd, iovcnt, seekpos, WAIT_EVENT_DATA_FILE_READ);
+}
+
/*
* mdwritev() -- Write the supplied blocks at the appropriate location.
*
@@ -1032,6 +1095,52 @@ mdwritev(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
}
}
+void
+mdstartwritev(PgAioHandle *ioh,
+ SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+ const void **buffers, BlockNumber nblocks, bool skipFsync)
+{
+ off_t seekpos;
+ MdfdVec *v;
+ BlockNumber nblocks_this_segment;
+ struct iovec *iov;
+ int iovcnt;
+
+ v = _mdfd_getseg(reln, forknum, blocknum, false,
+ EXTENSION_FAIL | EXTENSION_CREATE_RECOVERY);
+
+ seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
+
+ Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
+
+ nblocks_this_segment =
+ Min(nblocks,
+ RELSEG_SIZE - (blocknum % ((BlockNumber) RELSEG_SIZE)));
+
+ if (nblocks_this_segment != nblocks)
+ elog(ERROR, "write crossing segment boundary");
+
+ iovcnt = pgaio_io_get_iovec(ioh, &iov);
+
+ Assert(nblocks <= iovcnt);
+
+ iovcnt = buffers_to_iovec(iov, unconstify(void **, buffers), nblocks_this_segment);
+
+ Assert(iovcnt <= nblocks_this_segment);
+
+ if (!(io_direct_flags & IO_DIRECT_DATA))
+ pgaio_io_set_flag(ioh, AHF_BUFFERED);
+
+ pgaio_io_set_subject_smgr(ioh,
+ reln,
+ forknum,
+ blocknum,
+ nblocks);
+ pgaio_io_add_shared_cb(ioh, ASC_MD_WRITEV);
+
+ FileStartWriteV(ioh, v->mdfd_vfd, iovcnt, seekpos, WAIT_EVENT_DATA_FILE_WRITE);
+}
+
/*
* mdwriteback() -- Tell the kernel to write pages back to storage.
@@ -1355,6 +1464,21 @@ mdimmedsync(SMgrRelation reln, ForkNumber forknum)
}
}
+int
+mdfd(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, uint32 *off)
+{
+ MdfdVec *v = mdopenfork(reln, forknum, EXTENSION_FAIL);
+
+ v = _mdfd_getseg(reln, forknum, blocknum, false,
+ EXTENSION_FAIL);
+
+ *off = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
+
+ Assert(*off < (off_t) BLCKSZ * RELSEG_SIZE);
+
+ return FileGetRawDesc(v->mdfd_vfd);
+}
+
/*
* register_dirty_segment() -- Mark a relation segment as needing fsync
*
@@ -1838,3 +1962,193 @@ mdfiletagmatches(const FileTag *ftag, const FileTag *candidate)
*/
return ftag->rlocator.dbOid == candidate->rlocator.dbOid;
}
+
+/*
+ * AIO completion callback for mdstartreadv().
+ */
+static PgAioResult
+md_readv_complete(PgAioHandle *ioh, PgAioResult prior_result)
+{
+ PgAioSubjectData *sd = pgaio_io_get_subject_data(ioh);
+ PgAioResult result = prior_result;
+
+ if (prior_result.result < 0)
+ {
+ result.status = ARS_ERROR;
+ result.id = ASC_MD_READV;
+ /* For "hard" errors, track the error number in error_data */
+ result.error_data = -prior_result.result;
+ result.result = 0;
+
+ md_readv_error(result, sd, LOG);
+
+ return result;
+ }
+
+ result.result /= BLCKSZ;
+
+ if (result.result == 0)
+ {
+ /* consider 0 blocks read a failure */
+ result.status = ARS_ERROR;
+ result.id = ASC_MD_READV;
+ result.error_data = 0;
+
+ md_readv_error(result, sd, LOG);
+ }
+
+ if (result.status != ARS_ERROR &&
+ result.result < sd->smgr.nblocks)
+ {
+ /* partial reads should be retried at upper level */
+ result.status = ARS_PARTIAL;
+ result.id = ASC_MD_READV;
+ }
+
+ /* AFIXME: post-read portion of mdreadv() */
+
+ return result;
+}
+
+/*
+ * AIO error reporting callback for mdstartreadv().
+ */
+static void
+md_readv_error(PgAioResult result, const PgAioSubjectData *subject_data, int elevel)
+{
+ MemoryContext oldContext = CurrentMemoryContext;
+
+ /* AFIXME: */
+ oldContext = MemoryContextSwitchTo(ErrorContext);
+
+ if (result.error_data != 0)
+ {
+ errno = result.error_data; /* for errcode_for_file_access() */
+
+ ereport(elevel,
+ errcode_for_file_access(),
+ errmsg("could not read blocks %u..%u in file \"%s\": %m",
+ subject_data->smgr.blockNum,
+ subject_data->smgr.blockNum + subject_data->smgr.nblocks,
+ relpathperm(subject_data->smgr.rlocator, subject_data->smgr.forkNum)
+ )
+ );
+ }
+ else
+ {
+ /*
+ * NB: This will typically only be output in debug messages, while
+ * retrying a partial IO.
+ */
+ ereport(elevel,
+ errcode(ERRCODE_DATA_CORRUPTED),
+ errmsg("could not read blocks %u..%u in file \"%s\": read only %zu of %zu bytes",
+ subject_data->smgr.blockNum,
+ subject_data->smgr.blockNum + subject_data->smgr.nblocks - 1,
+ relpathperm(subject_data->smgr.rlocator, subject_data->smgr.forkNum),
+ result.result * (size_t) BLCKSZ,
+ subject_data->smgr.nblocks * (size_t) BLCKSZ
+ )
+ );
+ }
+
+ MemoryContextSwitchTo(oldContext);
+}
+
+/*
+ * AIO completion callback for mdstartwritev().
+ */
+static PgAioResult
+md_writev_complete(PgAioHandle *ioh, PgAioResult prior_result)
+{
+ PgAioSubjectData *sd = pgaio_io_get_subject_data(ioh);
+ PgAioResult result = prior_result;
+
+ if (prior_result.result < 0)
+ {
+ result.status = ARS_ERROR;
+ result.id = ASC_MD_WRITEV;
+ /* For "hard" errors, track the error number in error_data */
+ result.error_data = -prior_result.result;
+ result.result = 0;
+
+ md_writev_error(result, sd, LOG);
+
+ return result;
+ }
+
+ result.result /= BLCKSZ;
+
+ if (result.result == 0)
+ {
+ /* consider 0 blocks written a failure */
+ result.status = ARS_ERROR;
+ result.id = ASC_MD_WRITEV;
+ result.error_data = 0;
+
+ md_writev_error(result, sd, LOG);
+ }
+
+ if (result.status != ARS_ERROR &&
+ result.result < sd->smgr.nblocks)
+ {
+ /* partial writes should be retried at upper level */
+ result.status = ARS_PARTIAL;
+ result.id = ASC_MD_WRITEV;
+ }
+
+ if (prior_result.status == ARS_ERROR)
+ {
+ /* AFIXME: complain */
+ return prior_result;
+ }
+
+ prior_result.result /= BLCKSZ;
+
+ return prior_result;
+}
+
+/*
+ * AIO error reporting callback for mdstartwritev().
+ */
+static void
+md_writev_error(PgAioResult result, const PgAioSubjectData *subject_data, int elevel)
+{
+ MemoryContext oldContext = CurrentMemoryContext;
+
+ /* AFIXME: */
+ oldContext = MemoryContextSwitchTo(ErrorContext);
+
+ if (result.error_data != 0)
+ {
+ errno = result.error_data; /* for errcode_for_file_access() */
+
+ ereport(elevel,
+ errcode_for_file_access(),
+ errmsg("could not write blocks %u..%u in file \"%s\": %m",
+ subject_data->smgr.blockNum,
+ subject_data->smgr.blockNum + subject_data->smgr.nblocks,
+ relpathperm(subject_data->smgr.rlocator, subject_data->smgr.forkNum)
+ )
+ );
+ }
+ else
+ {
+ /*
+ * NB: This will typically only be output in debug messages, while
+ * retrying a partial IO.
+ */
+ ereport(elevel,
+ errcode(ERRCODE_DATA_CORRUPTED),
+ errmsg("could not write blocks %u..%u in file \"%s\": wrote only %zu of %zu bytes",
+ subject_data->smgr.blockNum,
+ subject_data->smgr.blockNum + subject_data->smgr.nblocks - 1,
+ relpathperm(subject_data->smgr.rlocator, subject_data->smgr.forkNum),
+ result.result * (size_t) BLCKSZ,
+ subject_data->smgr.nblocks * (size_t) BLCKSZ
+ )
+ );
+ }
+
+ MemoryContextSwitchTo(oldContext);
+}
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 36ad34aa6ac..454ebe9c243 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -53,6 +53,7 @@
#include "access/xlogutils.h"
#include "lib/ilist.h"
+#include "storage/aio.h"
#include "storage/bufmgr.h"
#include "storage/ipc.h"
#include "storage/md.h"
@@ -93,10 +94,19 @@ typedef struct f_smgr
void (*smgr_readv) (SMgrRelation reln, ForkNumber forknum,
BlockNumber blocknum,
void **buffers, BlockNumber nblocks);
+ void (*smgr_startreadv) (struct PgAioHandle *ioh,
+ SMgrRelation reln, ForkNumber forknum,
+ BlockNumber blocknum,
+ void **buffers, BlockNumber nblocks);
void (*smgr_writev) (SMgrRelation reln, ForkNumber forknum,
BlockNumber blocknum,
const void **buffers, BlockNumber nblocks,
bool skipFsync);
+ void (*smgr_startwritev) (struct PgAioHandle *ioh,
+ SMgrRelation reln, ForkNumber forknum,
+ BlockNumber blocknum,
+ const void **buffers, BlockNumber nblocks,
+ bool skipFsync);
void (*smgr_writeback) (SMgrRelation reln, ForkNumber forknum,
BlockNumber blocknum, BlockNumber nblocks);
BlockNumber (*smgr_nblocks) (SMgrRelation reln, ForkNumber forknum);
@@ -104,6 +114,7 @@ typedef struct f_smgr
BlockNumber old_blocks, BlockNumber nblocks);
void (*smgr_immedsync) (SMgrRelation reln, ForkNumber forknum);
void (*smgr_registersync) (SMgrRelation reln, ForkNumber forknum);
+ int (*smgr_fd) (SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, uint32 *off);
} f_smgr;
static const f_smgr smgrsw[] = {
@@ -121,12 +132,15 @@ static const f_smgr smgrsw[] = {
.smgr_prefetch = mdprefetch,
.smgr_maxcombine = mdmaxcombine,
.smgr_readv = mdreadv,
+ .smgr_startreadv = mdstartreadv,
.smgr_writev = mdwritev,
+ .smgr_startwritev = mdstartwritev,
.smgr_writeback = mdwriteback,
.smgr_nblocks = mdnblocks,
.smgr_truncate = mdtruncate,
.smgr_immedsync = mdimmedsync,
.smgr_registersync = mdregistersync,
+ .smgr_fd = mdfd,
}
};
@@ -145,6 +159,14 @@ static void smgrshutdown(int code, Datum arg);
static void smgrdestroy(SMgrRelation reln);
+static void smgr_aio_reopen(PgAioHandle *ioh);
+
+const struct PgAioSubjectInfo aio_smgr_subject_info = {
+ .name = "smgr",
+ .reopen = smgr_aio_reopen,
+};
+
+
/*
* smgrinit(), smgrshutdown() -- Initialize or shut down storage
* managers.
@@ -623,6 +645,19 @@ smgrreadv(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
nblocks);
}
+/*
+ * FILL ME IN
+ */
+void
+smgrstartreadv(struct PgAioHandle *ioh,
+ SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+ void **buffers, BlockNumber nblocks)
+{
+ smgrsw[reln->smgr_which].smgr_startreadv(ioh,
+ reln, forknum, blocknum, buffers,
+ nblocks);
+}
+
/*
* smgrwritev() -- Write the supplied buffers out.
*
@@ -657,6 +692,16 @@ smgrwritev(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
buffers, nblocks, skipFsync);
}
+void
+smgrstartwritev(struct PgAioHandle *ioh,
+ SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+ const void **buffers, BlockNumber nblocks, bool skipFsync)
+{
+ smgrsw[reln->smgr_which].smgr_startwritev(ioh,
+ reln, forknum, blocknum, buffers,
+ nblocks, skipFsync);
+}
+
/*
* smgrwriteback() -- Trigger kernel writeback for the supplied range of
* blocks.
@@ -819,6 +864,12 @@ smgrimmedsync(SMgrRelation reln, ForkNumber forknum)
smgrsw[reln->smgr_which].smgr_immedsync(reln, forknum);
}
+int
+smgrfd(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, uint32 *off)
+{
+ return smgrsw[reln->smgr_which].smgr_fd(reln, forknum, blocknum, off);
+}
+
/*
* AtEOXact_SMgr
*
@@ -847,3 +898,43 @@ ProcessBarrierSmgrRelease(void)
smgrreleaseall();
return true;
}
+
+void
+pgaio_io_set_subject_smgr(PgAioHandle *ioh,
+ struct SMgrRelationData *smgr,
+ ForkNumber forknum,
+ BlockNumber blocknum,
+ int nblocks)
+{
+ PgAioSubjectData *sd = pgaio_io_get_subject_data(ioh);
+
+ pgaio_io_set_subject(ioh, ASI_SMGR);
+
+ /* backend is implied via IO owner */
+ sd->smgr.rlocator = smgr->smgr_rlocator.locator;
+ sd->smgr.forkNum = forknum;
+ sd->smgr.blockNum = blocknum;
+ sd->smgr.nblocks = nblocks;
+ sd->smgr.is_temp = SmgrIsTemp(smgr);
+ sd->smgr.release_lock = false;
+ sd->smgr.mode = RBM_NORMAL;
+}
+
+static void
+smgr_aio_reopen(PgAioHandle *ioh)
+{
+ PgAioSubjectData *sd = pgaio_io_get_subject_data(ioh);
+ PgAioOpData *od = pgaio_io_get_op_data(ioh);
+ SMgrRelation reln;
+ ProcNumber procno;
+ uint32 off;
+
+ if (sd->smgr.is_temp)
+ procno = pgaio_io_get_owner(ioh);
+ else
+ procno = INVALID_PROC_NUMBER;
+
+ reln = smgropen(sd->smgr.rlocator, procno);
+ od->read.fd = smgrfd(reln, sd->smgr.forkNum, sd->smgr.blockNum, &off);
+ Assert(off == od->read.offset);
+}
--
2.45.2.746.g06e570c0df.dirty
v2-0011-bufmgr-Implement-AIO-read-support.patchtext/x-diff; charset=us-asciiDownload
From 7a42b48f7421f071dab6cff273e4cc5b1c3c755f Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sat, 31 Aug 2024 21:39:01 -0400
Subject: [PATCH v2 11/20] bufmgr: Implement AIO read support
As of this commit there are no users of these AIO facilities, that'll come in
later commits.
Author:
Reviewed-By:
Discussion: https://postgr.es/m/
Backpatch:
---
src/include/storage/aio.h | 4 +
src/include/storage/buf_internals.h | 6 +
src/include/storage/bufmgr.h | 8 +
src/backend/storage/aio/aio_subject.c | 4 +
src/backend/storage/buffer/buf_init.c | 3 +
src/backend/storage/buffer/bufmgr.c | 364 +++++++++++++++++++++++++-
src/backend/storage/buffer/localbuf.c | 65 +++++
7 files changed, 447 insertions(+), 7 deletions(-)
diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h
index d693b0b0bd8..ff44dac5bb2 100644
--- a/src/include/storage/aio.h
+++ b/src/include/storage/aio.h
@@ -104,6 +104,10 @@ typedef enum PgAioHandleSharedCallbackID
ASC_MD_READV,
ASC_MD_WRITEV,
+
+ ASC_SHARED_BUFFER_READ,
+
+ ASC_LOCAL_BUFFER_READ,
} PgAioHandleSharedCallbackID;
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index eda6c699212..37520890073 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -17,6 +17,7 @@
#include "pgstat.h"
#include "port/atomics.h"
+#include "storage/aio_ref.h"
#include "storage/buf.h"
#include "storage/bufmgr.h"
#include "storage/condition_variable.h"
@@ -251,6 +252,8 @@ typedef struct BufferDesc
int wait_backend_pgprocno; /* backend of pin-count waiter */
int freeNext; /* link in freelist chain */
+
+ PgAioHandleRef io_in_progress;
LWLock content_lock; /* to lock access to buffer contents */
} BufferDesc;
@@ -464,4 +467,7 @@ extern void DropRelationLocalBuffers(RelFileLocator rlocator,
extern void DropRelationAllLocalBuffers(RelFileLocator rlocator);
extern void AtEOXact_LocalBuffers(bool isCommit);
+
+extern bool ReadBufferCompleteReadLocal(Buffer buffer, int mode, bool failed);
+
#endif /* BUFMGR_INTERNALS_H */
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index eb0fba4230b..ca8e8b51e68 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -177,6 +177,12 @@ extern PGDLLIMPORT int NLocBuffer;
extern PGDLLIMPORT Block *LocalBufferBlockPointers;
extern PGDLLIMPORT int32 *LocalRefCount;
+
+struct PgAioHandleSharedCallbacks;
+extern const struct PgAioHandleSharedCallbacks aio_shared_buffer_readv_cb;
+extern const struct PgAioHandleSharedCallbacks aio_local_buffer_readv_cb;
+
+
/* upper limit for effective_io_concurrency */
#define MAX_IO_CONCURRENCY 1000
@@ -194,6 +200,8 @@ extern PGDLLIMPORT int32 *LocalRefCount;
/*
* prototypes for functions in bufmgr.c
*/
+struct PgAioHandle;
+
extern PrefetchBufferResult PrefetchSharedBuffer(struct SMgrRelationData *smgr_reln,
ForkNumber forkNum,
BlockNumber blockNum);
diff --git a/src/backend/storage/aio/aio_subject.c b/src/backend/storage/aio/aio_subject.c
index effb09c11c7..21341aae425 100644
--- a/src/backend/storage/aio/aio_subject.c
+++ b/src/backend/storage/aio/aio_subject.c
@@ -50,6 +50,10 @@ static const PgAioHandleSharedCallbacksEntry aio_shared_cbs[] = {
#define CALLBACK_ENTRY(id, callback) [id] = {.cb = &callback, .name = #callback}
CALLBACK_ENTRY(ASC_MD_READV, aio_md_readv_cb),
CALLBACK_ENTRY(ASC_MD_WRITEV, aio_md_writev_cb),
+
+ CALLBACK_ENTRY(ASC_SHARED_BUFFER_READ, aio_shared_buffer_readv_cb),
+
+ CALLBACK_ENTRY(ASC_LOCAL_BUFFER_READ, aio_local_buffer_readv_cb),
#undef CALLBACK_ENTRY
};
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index 56761a8eedc..7853b1877e0 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -14,6 +14,7 @@
*/
#include "postgres.h"
+#include "storage/aio.h"
#include "storage/buf_internals.h"
#include "storage/bufmgr.h"
@@ -125,6 +126,8 @@ BufferManagerShmemInit(void)
buf->buf_id = i;
+ pgaio_io_ref_clear(&buf->io_in_progress);
+
/*
* Initially link all the buffers together as unused. Subsequent
* management of this list is done by freelist.c.
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 2622221809c..c0fb3028c95 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -48,6 +48,7 @@
#include "pg_trace.h"
#include "pgstat.h"
#include "postmaster/bgwriter.h"
+#include "storage/aio.h"
#include "storage/buf_internals.h"
#include "storage/bufmgr.h"
#include "storage/fd.h"
@@ -58,6 +59,7 @@
#include "storage/smgr.h"
#include "storage/standby.h"
#include "utils/memdebug.h"
+#include "utils/memutils.h"
#include "utils/ps_status.h"
#include "utils/rel.h"
#include "utils/resowner.h"
@@ -514,7 +516,8 @@ static int SyncOneBuffer(int buf_id, bool skip_recently_used,
static void WaitIO(BufferDesc *buf);
static bool StartBufferIO(BufferDesc *buf, bool forInput, bool nowait);
static void TerminateBufferIO(BufferDesc *buf, bool clear_dirty,
- uint32 set_flag_bits, bool forget_owner);
+ uint32 set_flag_bits, bool forget_owner,
+ bool syncio);
static void AbortBufferIO(Buffer buffer);
static void shared_buffer_write_error_callback(void *arg);
static void local_buffer_write_error_callback(void *arg);
@@ -1081,7 +1084,7 @@ ZeroAndLockBuffer(Buffer buffer, ReadBufferMode mode, bool already_valid)
else
{
/* Set BM_VALID, terminate IO, and wake up any waiters */
- TerminateBufferIO(bufHdr, false, BM_VALID, true);
+ TerminateBufferIO(bufHdr, false, BM_VALID, true, true);
}
}
else if (!isLocalBuf)
@@ -1566,7 +1569,7 @@ WaitReadBuffers(ReadBuffersOperation *operation)
else
{
/* Set BM_VALID, terminate IO, and wake up any waiters */
- TerminateBufferIO(bufHdr, false, BM_VALID, true);
+ TerminateBufferIO(bufHdr, false, BM_VALID, true, true);
}
/* Report I/Os as completing individually. */
@@ -2450,7 +2453,7 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
if (lock)
LWLockAcquire(BufferDescriptorGetContentLock(buf_hdr), LW_EXCLUSIVE);
- TerminateBufferIO(buf_hdr, false, BM_VALID, true);
+ TerminateBufferIO(buf_hdr, false, BM_VALID, true, true);
}
pgBufferUsage.shared_blks_written += extend_by;
@@ -3899,7 +3902,7 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
* Mark the buffer as clean (unless BM_JUST_DIRTIED has become set) and
* end the BM_IO_IN_PROGRESS state.
*/
- TerminateBufferIO(buf, true, 0, true);
+ TerminateBufferIO(buf, true, 0, true, true);
TRACE_POSTGRESQL_BUFFER_FLUSH_DONE(BufTagGetForkNum(&buf->tag),
buf->tag.blockNum,
@@ -5514,6 +5517,7 @@ WaitIO(BufferDesc *buf)
for (;;)
{
uint32 buf_state;
+ PgAioHandleRef ior;
/*
* It may not be necessary to acquire the spinlock to check the flag
@@ -5521,10 +5525,19 @@ WaitIO(BufferDesc *buf)
* play it safe.
*/
buf_state = LockBufHdr(buf);
+ ior = buf->io_in_progress;
UnlockBufHdr(buf, buf_state);
if (!(buf_state & BM_IO_IN_PROGRESS))
break;
+
+ if (pgaio_io_ref_valid(&ior))
+ {
+ pgaio_io_ref_wait(&ior);
+ ConditionVariablePrepareToSleep(cv);
+ continue;
+ }
+
ConditionVariableSleep(cv, WAIT_EVENT_BUFFER_IO);
}
ConditionVariableCancelSleep();
@@ -5613,7 +5626,7 @@ StartBufferIO(BufferDesc *buf, bool forInput, bool nowait)
*/
static void
TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag_bits,
- bool forget_owner)
+ bool forget_owner, bool syncio)
{
uint32 buf_state;
@@ -5625,6 +5638,13 @@ TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag_bits,
if (clear_dirty && !(buf_state & BM_JUST_DIRTIED))
buf_state &= ~(BM_DIRTY | BM_CHECKPOINT_NEEDED);
+ if (!syncio)
+ {
+ /* release ownership by the AIO subsystem */
+ buf_state -= BUF_REFCOUNT_ONE;
+ pgaio_io_ref_clear(&buf->io_in_progress);
+ }
+
buf_state |= set_flag_bits;
UnlockBufHdr(buf, buf_state);
@@ -5633,6 +5653,40 @@ TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag_bits,
BufferDescriptorGetBuffer(buf));
ConditionVariableBroadcast(BufferDescriptorGetIOCV(buf));
+
+ /*
+ * If we just released a pin, need to do BM_PIN_COUNT_WAITER handling.
+ * Most of the time the current backend will hold another pin preventing
+ * that from happening, but that's e.g. not the case when completing an IO
+ * another backend started.
+ *
+ * AFIXME: Deduplicate with UnpinBufferNoOwner() or just replace
+ * BM_PIN_COUNT_WAITER with something saner.
+ */
+ /* Support LockBufferForCleanup() */
+ if (buf_state & BM_PIN_COUNT_WAITER)
+ {
+ /*
+ * Acquire the buffer header lock, re-check that there's a waiter.
+ * Another backend could have unpinned this buffer, and already woken
+ * up the waiter. There's no danger of the buffer being replaced
+ * after we unpinned it above, as it's pinned by the waiter.
+ */
+ buf_state = LockBufHdr(buf);
+
+ if ((buf_state & BM_PIN_COUNT_WAITER) &&
+ BUF_STATE_GET_REFCOUNT(buf_state) == 1)
+ {
+ /* we just released the last pin other than the waiter's */
+ int wait_backend_pgprocno = buf->wait_backend_pgprocno;
+
+ buf_state &= ~BM_PIN_COUNT_WAITER;
+ UnlockBufHdr(buf, buf_state);
+ ProcSendSignal(wait_backend_pgprocno);
+ }
+ else
+ UnlockBufHdr(buf, buf_state);
+ }
}
/*
@@ -5684,7 +5738,7 @@ AbortBufferIO(Buffer buffer)
}
}
- TerminateBufferIO(buf_hdr, false, BM_IO_ERROR, false);
+ TerminateBufferIO(buf_hdr, false, BM_IO_ERROR, false, true);
}
/*
@@ -6143,3 +6197,299 @@ EvictUnpinnedBuffer(Buffer buf)
return result;
}
+
+static bool
+ReadBufferCompleteReadShared(Buffer buffer, int mode, bool failed)
+{
+ BufferDesc *bufHdr = NULL;
+ BlockNumber blockno;
+ bool buf_failed = false;
+ char *bufdata = BufferGetBlock(buffer);
+
+ Assert(BufferIsValid(buffer));
+
+ bufHdr = GetBufferDescriptor(buffer - 1);
+ blockno = bufHdr->tag.blockNum;
+
+#ifdef USE_ASSERT_CHECKING
+ {
+ uint32 buf_state = pg_atomic_read_u32(&bufHdr->state);
+
+ Assert(buf_state & BM_TAG_VALID);
+ Assert(!(buf_state & BM_VALID));
+ Assert(buf_state & BM_IO_IN_PROGRESS);
+ Assert(!(buf_state & BM_DIRTY));
+ }
+#endif
+
+ /* check for garbage data */
+ if (!failed &&
+ !PageIsVerifiedExtended((Page) bufdata, blockno,
+ PIV_LOG_WARNING | PIV_REPORT_STAT))
+ {
+ RelFileLocator rlocator = BufTagGetRelFileLocator(&bufHdr->tag);
+ BlockNumber forkNum = bufHdr->tag.forkNum;
+
+ /* AFIXME: relpathperm allocates memory */
+ MemoryContextSwitchTo(ErrorContext);
+ if (mode == READ_BUFFERS_ZERO_ON_ERROR || zero_damaged_pages)
+ {
+ ereport(LOG,
+ (errcode(ERRCODE_DATA_CORRUPTED),
+ errmsg("invalid page in block %u of relation %s; zeroing out page",
+ blockno,
+ relpathperm(rlocator, forkNum))));
+ memset(bufdata, 0, BLCKSZ);
+ }
+ else
+ {
+ ereport(LOG,
+ (errcode(ERRCODE_DATA_CORRUPTED),
+ errmsg("invalid page in block %u of relation %s",
+ blockno,
+ relpathperm(rlocator, forkNum))));
+ failed = true;
+ buf_failed = true;
+ }
+ }
+
+ /* Terminate I/O and set BM_VALID. */
+ TerminateBufferIO(bufHdr, false,
+ failed ? BM_IO_ERROR : BM_VALID,
+ false, false);
+
+ /* Report I/Os as completing individually. */
+
+ /* FIXME: Should we do TRACE_POSTGRESQL_BUFFER_READ_DONE here? */
+ return buf_failed;
+}
+
+/*
+ * Helper to prepare IO on shared buffers for execution, shared between reads
+ * and writes.
+ */
+static void
+shared_buffer_prepare_common(PgAioHandle *ioh, bool is_write)
+{
+ uint64 *io_data;
+ uint8 io_data_len;
+ PgAioHandleRef io_ref;
+ BufferTag first PG_USED_FOR_ASSERTS_ONLY = {0};
+
+ io_data = pgaio_io_get_io_data(ioh, &io_data_len);
+
+ pgaio_io_get_ref(ioh, &io_ref);
+
+ for (int i = 0; i < io_data_len; i++)
+ {
+ Buffer buf = (Buffer) io_data[i];
+ BufferDesc *bufHdr;
+ uint32 buf_state;
+
+ bufHdr = GetBufferDescriptor(buf - 1);
+
+ if (i == 0)
+ first = bufHdr->tag;
+ else
+ {
+ Assert(bufHdr->tag.relNumber == first.relNumber);
+ Assert(bufHdr->tag.blockNum == first.blockNum + i);
+ }
+
+
+ buf_state = LockBufHdr(bufHdr);
+
+ Assert(buf_state & BM_TAG_VALID);
+ if (is_write)
+ {
+ Assert(buf_state & BM_VALID);
+ Assert(buf_state & BM_DIRTY);
+ }
+ else
+ Assert(!(buf_state & BM_VALID));
+
+ Assert(buf_state & BM_IO_IN_PROGRESS);
+ Assert(BUF_STATE_GET_REFCOUNT(buf_state) >= 1);
+
+ buf_state += BUF_REFCOUNT_ONE;
+ bufHdr->io_in_progress = io_ref;
+
+ UnlockBufHdr(bufHdr, buf_state);
+
+ if (is_write)
+ {
+ LWLock *content_lock;
+
+ content_lock = BufferDescriptorGetContentLock(bufHdr);
+
+ Assert(LWLockHeldByMe(content_lock));
+
+ /*
+ * Lock is now owned by IO.
+ */
+ LWLockDisown(content_lock);
+ RESUME_INTERRUPTS();
+ }
+
+ /*
+ * Stop tracking this buffer via the resowner - the AIO system now
+ * keeps track.
+ */
+ ResourceOwnerForgetBufferIO(CurrentResourceOwner, buf);
+ }
+}
+
+static void
+shared_buffer_readv_prepare(PgAioHandle *ioh)
+{
+ shared_buffer_prepare_common(ioh, false);
+}
+
+static PgAioResult
+shared_buffer_readv_complete(PgAioHandle *ioh, PgAioResult prior_result)
+{
+ PgAioResult result = prior_result;
+ int mode = pgaio_io_get_subject_data(ioh)->smgr.mode;
+ uint64 *io_data;
+ uint8 io_data_len;
+
+ elog(DEBUG3, "%s: %d %d", __func__, prior_result.status, prior_result.result);
+
+ io_data = pgaio_io_get_io_data(ioh, &io_data_len);
+
+ for (int io_data_off = 0; io_data_off < io_data_len; io_data_off++)
+ {
+ Buffer buf = io_data[io_data_off];
+ bool buf_failed;
+ bool failed;
+
+ failed =
+ prior_result.status == ARS_ERROR
+ || prior_result.result <= io_data_off;
+
+ elog(DEBUG3, "calling rbcrs for buf %d with failed %d, error: %d, result: %d, data_off: %d",
+ buf, failed, prior_result.status, prior_result.result, io_data_off);
+
+ /*
+ * XXX: It might be better to not set BM_IO_ERROR (which is what
+ * failed = true leads to) when it's just a short read...
+ */
+ buf_failed = ReadBufferCompleteReadShared(buf,
+ mode,
+ failed);
+
+ if (result.status != ARS_ERROR && buf_failed)
+ {
+ result.status = ARS_ERROR;
+ result.id = ASC_SHARED_BUFFER_READ;
+ result.error_data = io_data_off + 1;
+ }
+ }
+
+ return result;
+}
+
+static void
+buffer_readv_error(PgAioResult result, const PgAioSubjectData *subject_data, int elevel)
+{
+ MemoryContext oldContext = CurrentMemoryContext;
+ ProcNumber errProc;
+
+ if (subject_data->smgr.is_temp)
+ errProc = MyProcNumber;
+ else
+ errProc = INVALID_PROC_NUMBER;
+
+ /* AFIXME: need infrastructure to allow memory allocation for error reporting */
+ oldContext = MemoryContextSwitchTo(ErrorContext);
+
+ ereport(elevel,
+ errcode(ERRCODE_DATA_CORRUPTED),
+ errmsg("invalid page in block %u of relation %s",
+ subject_data->smgr.blockNum + result.error_data,
+ relpathbackend(subject_data->smgr.rlocator, errProc, subject_data->smgr.forkNum)
+ )
+ );
+ MemoryContextSwitchTo(oldContext);
+}
+
+/*
+ * Helper to prepare IO on local buffers for execution, shared between reads
+ * and writes.
+ */
+static void
+local_buffer_readv_prepare(PgAioHandle *ioh)
+{
+ uint64 *io_data;
+ uint8 io_data_len;
+ PgAioHandleRef io_ref;
+
+ io_data = pgaio_io_get_io_data(ioh, &io_data_len);
+
+ pgaio_io_get_ref(ioh, &io_ref);
+
+ for (int i = 0; i < io_data_len; i++)
+ {
+ Buffer buf = (Buffer) io_data[i];
+ BufferDesc *bufHdr;
+ uint32 buf_state;
+
+ bufHdr = GetLocalBufferDescriptor(-buf - 1);
+
+ buf_state = pg_atomic_read_u32(&bufHdr->state);
+
+ bufHdr->io_in_progress = io_ref;
+ LocalRefCount[-buf - 1] += 1;
+
+ UnlockBufHdr(bufHdr, buf_state);
+ }
+}
+
+static PgAioResult
+local_buffer_readv_complete(PgAioHandle *ioh, PgAioResult prior_result)
+{
+ PgAioResult result = prior_result;
+ int mode = pgaio_io_get_subject_data(ioh)->smgr.mode;
+ uint64 *io_data;
+ uint8 io_data_len;
+
+ elog(DEBUG3, "%s: %d %d", __func__, prior_result.status, prior_result.result);
+
+ io_data = pgaio_io_get_io_data(ioh, &io_data_len);
+
+ for (int io_data_off = 0; io_data_off < io_data_len; io_data_off++)
+ {
+ Buffer buf = io_data[io_data_off];
+ bool buf_failed;
+ bool failed;
+
+ failed =
+ prior_result.status == ARS_ERROR
+ || prior_result.result <= io_data_off;
+
+ buf_failed = ReadBufferCompleteReadLocal(buf,
+ mode,
+ failed);
+
+ if (result.status != ARS_ERROR && buf_failed)
+ {
+ result.status = ARS_ERROR;
+ result.id = ASC_LOCAL_BUFFER_READ;
+ result.error_data = io_data_off + 1;
+ }
+ }
+
+ return result;
+}
+
+
+const struct PgAioHandleSharedCallbacks aio_shared_buffer_readv_cb = {
+ .prepare = shared_buffer_readv_prepare,
+ .complete = shared_buffer_readv_complete,
+ .error = buffer_readv_error,
+};
+const struct PgAioHandleSharedCallbacks aio_local_buffer_readv_cb = {
+ .prepare = local_buffer_readv_prepare,
+ .complete = local_buffer_readv_complete,
+ .error = buffer_readv_error,
+};
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 6fd1a6418d2..75c4d2570e0 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -18,6 +18,7 @@
#include "access/parallel.h"
#include "executor/instrument.h"
#include "pgstat.h"
+#include "storage/aio.h"
#include "storage/buf_internals.h"
#include "storage/bufmgr.h"
#include "storage/fd.h"
@@ -620,6 +621,8 @@ InitLocalBuffers(void)
*/
buf->buf_id = -i - 2;
+ pgaio_io_ref_clear(&buf->io_in_progress);
+
/*
* Intentionally do not initialize the buffer's atomic variable
* (besides zeroing the underlying memory above). That way we get
@@ -836,3 +839,65 @@ AtProcExit_LocalBuffers(void)
*/
CheckForLocalBufferLeaks();
}
+
+bool
+ReadBufferCompleteReadLocal(Buffer buffer, int mode, bool failed)
+{
+ BufferDesc *buf_hdr = NULL;
+ BlockNumber blockno;
+ bool buf_failed = false;
+ char *bufdata = BufferGetBlock(buffer);
+
+ Assert(BufferIsValid(buffer));
+
+ buf_hdr = GetLocalBufferDescriptor(-buffer - 1);
+ blockno = buf_hdr->tag.blockNum;
+
+ /* check for garbage data */
+ if (!failed &&
+ !PageIsVerifiedExtended((Page) bufdata, blockno,
+ PIV_LOG_WARNING | PIV_REPORT_STAT))
+ {
+ RelFileLocator rlocator = BufTagGetRelFileLocator(&buf_hdr->tag);
+ BlockNumber forkNum = buf_hdr->tag.forkNum;
+
+ MemoryContextSwitchTo(ErrorContext);
+
+ if (mode == READ_BUFFERS_ZERO_ON_ERROR || zero_damaged_pages)
+ {
+
+ ereport(WARNING,
+ (errcode(ERRCODE_DATA_CORRUPTED),
+ errmsg("invalid page in block %u of relation %s; zeroing out page",
+ blockno,
+ relpathperm(rlocator, forkNum))));
+ memset(bufdata, 0, BLCKSZ);
+ }
+ else
+ {
+ ereport(LOG,
+ (errcode(ERRCODE_DATA_CORRUPTED),
+ errmsg("invalid page in block %u of relation %s",
+ blockno,
+ relpathperm(rlocator, forkNum))));
+ failed = true;
+ buf_failed = true;
+ }
+ }
+
+ /* Terminate I/O and set BM_VALID. */
+ pgaio_io_ref_clear(&buf_hdr->io_in_progress);
+
+ {
+ uint32 buf_state;
+
+ buf_state = pg_atomic_read_u32(&buf_hdr->state);
+ buf_state |= BM_VALID;
+ pg_atomic_unlocked_write_u32(&buf_hdr->state, buf_state);
+ }
+
+ /* release pin held by IO subsystem */
+ LocalRefCount[-buffer - 1] -= 1;
+
+ return buf_failed;
+}
--
2.45.2.746.g06e570c0df.dirty
v2-0012-bufmgr-Use-aio-for-StartReadBuffers.patchtext/x-diff; charset=us-asciiDownload
From e8a5a6318b0e386afb2c1ed2d7f4cc0372358ade Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sat, 31 Aug 2024 21:55:59 -0400
Subject: [PATCH v2 12/20] bufmgr: Use aio for StartReadBuffers()
Author:
Reviewed-By:
Discussion: https://postgr.es/m/
Backpatch:
---
src/include/storage/bufmgr.h | 27 +-
src/backend/storage/buffer/bufmgr.c | 378 ++++++++++++++++++++--------
2 files changed, 300 insertions(+), 105 deletions(-)
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index ca8e8b51e68..7a12ef6e9be 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -15,6 +15,7 @@
#define BUFMGR_H
#include "port/pg_iovec.h"
+#include "storage/aio_ref.h"
#include "storage/block.h"
#include "storage/buf.h"
#include "storage/bufpage.h"
@@ -107,10 +108,23 @@ typedef struct BufferManagerRelation
#define BMR_REL(p_rel) ((BufferManagerRelation){.rel = p_rel})
#define BMR_SMGR(p_smgr, p_relpersistence) ((BufferManagerRelation){.smgr = p_smgr, .relpersistence = p_relpersistence})
+
+#define MAX_IO_COMBINE_LIMIT PG_IOV_MAX
+#define DEFAULT_IO_COMBINE_LIMIT Min(MAX_IO_COMBINE_LIMIT, (128 * 1024) / BLCKSZ)
+
+
/* Zero out page if reading fails. */
#define READ_BUFFERS_ZERO_ON_ERROR (1 << 0)
/* Call smgrprefetch() if I/O necessary. */
#define READ_BUFFERS_ISSUE_ADVICE (1 << 1)
+/* IO will immediately be waited for */
+#define READ_BUFFERS_SYNCHRONOUSLY (1 << 2)
+
+/*
+ * FIXME: PgAioReturn is defined in aio.h. It'd be much better if we didn't
+ * need to include that here. Perhaps this could live in a separate header?
+ */
+#include "storage/aio.h"
struct ReadBuffersOperation
{
@@ -131,6 +145,17 @@ struct ReadBuffersOperation
int flags;
int16 nblocks;
int16 io_buffers_len;
+
+ /*
+ * In some rare-ish cases one operation causes multiple reads (e.g. if a
+ * buffer was concurrently read by another backend). It'd be much better
+ * if we ensured that each ReadBuffersOperation covered only one IO - but
+ * that's not entirely trivial, due to having pinned victim buffers before
+ * starting IOs.
+ */
+ int16 nios;
+ PgAioHandleRef refs[MAX_IO_COMBINE_LIMIT];
+ PgAioReturn returns[MAX_IO_COMBINE_LIMIT];
};
typedef struct ReadBuffersOperation ReadBuffersOperation;
@@ -161,8 +186,6 @@ extern PGDLLIMPORT bool track_io_timing;
extern PGDLLIMPORT int effective_io_concurrency;
extern PGDLLIMPORT int maintenance_io_concurrency;
-#define MAX_IO_COMBINE_LIMIT PG_IOV_MAX
-#define DEFAULT_IO_COMBINE_LIMIT Min(MAX_IO_COMBINE_LIMIT, (128 * 1024) / BLCKSZ)
extern PGDLLIMPORT int io_combine_limit;
extern PGDLLIMPORT int checkpoint_flush_after;
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index c0fb3028c95..89cb7b41b03 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1235,10 +1235,9 @@ ReadBuffer_common(Relation rel, SMgrRelation smgr, char smgr_persistence,
return buffer;
}
+ flags = READ_BUFFERS_SYNCHRONOUSLY;
if (mode == RBM_ZERO_ON_ERROR)
- flags = READ_BUFFERS_ZERO_ON_ERROR;
- else
- flags = 0;
+ flags |= READ_BUFFERS_ZERO_ON_ERROR;
operation.smgr = smgr;
operation.rel = rel;
operation.persistence = persistence;
@@ -1253,6 +1252,9 @@ ReadBuffer_common(Relation rel, SMgrRelation smgr, char smgr_persistence,
return buffer;
}
+static bool AsyncReadBuffers(ReadBuffersOperation *operation,
+ int nblocks);
+
static pg_attribute_always_inline bool
StartReadBuffersImpl(ReadBuffersOperation *operation,
Buffer *buffers,
@@ -1288,6 +1290,12 @@ StartReadBuffersImpl(ReadBuffersOperation *operation,
* so we stop here.
*/
actual_nblocks = i + 1;
+
+ ereport(DEBUG3,
+ errmsg("found buf %d, idx %i: %s, data %p",
+ buffers[i], i, DebugPrintBufferRefcount(buffers[i]),
+ BufferGetBlock(buffers[i])),
+ errhidestmt(true), errhidecontext(true));
break;
}
else
@@ -1324,28 +1332,51 @@ StartReadBuffersImpl(ReadBuffersOperation *operation,
operation->flags = flags;
operation->nblocks = actual_nblocks;
operation->io_buffers_len = io_buffers_len;
+ operation->nios = 0;
- if (flags & READ_BUFFERS_ISSUE_ADVICE)
+ /*
+ * When using AIO, start the IO in the background. If not, issue prefetch
+ * requests if desired by the caller.
+ *
+ * The reason we have a dedicated path for IOMETHOD_SYNC here is to derisk
+ * the introduction of AIO somewhat. It's a large architectural change,
+ * with lots of chances for unanticipated performance effects. Use of
+ * IOMETHOD_SYNC already leads to not actually performing IO
+ * asynchronously, but without the check here we'd execute IO earlier than
+ * we used to.
+ */
+ if (io_method != IOMETHOD_SYNC)
{
- /*
- * In theory we should only do this if PinBufferForBlock() had to
- * allocate new buffers above. That way, if two calls to
- * StartReadBuffers() were made for the same blocks before
- * WaitReadBuffers(), only the first would issue the advice. That'd be
- * a better simulation of true asynchronous I/O, which would only
- * start the I/O once, but isn't done here for simplicity. Note also
- * that the following call might actually issue two advice calls if we
- * cross a segment boundary; in a true asynchronous version we might
- * choose to process only one real I/O at a time in that case.
- */
- smgrprefetch(operation->smgr,
- operation->forknum,
- blockNum,
- operation->io_buffers_len);
+ /* initiate the IO asynchronously */
+ return AsyncReadBuffers(operation, io_buffers_len);
}
+ else
+ {
+ operation->flags |= READ_BUFFERS_SYNCHRONOUSLY;
+
+ if (flags & READ_BUFFERS_ISSUE_ADVICE)
+ {
+ /*
+ * In theory we should only do this if PinBufferForBlock() had to
+ * allocate new buffers above. That way, if two calls to
+ * StartReadBuffers() were made for the same blocks before
+ * WaitReadBuffers(), only the first would issue the
+ * advice. That'd be a better simulation of true asynchronous I/O,
+ * which would only start the I/O once, but isn't done here for
+ * simplicity. Note also that the following call might actually
+ * issue two advice calls if we cross a segment boundary; in a
+ * true asynchronous version we might choose to process only one
+ * real I/O at a time in that case.
+ */
+ smgrprefetch(operation->smgr,
+ operation->forknum,
+ blockNum,
+ operation->io_buffers_len);
+ }
- /* Indicate that WaitReadBuffers() should be called. */
- return true;
+ /* Indicate that WaitReadBuffers() should be called. */
+ return true;
+ }
}
/*
@@ -1397,12 +1428,31 @@ StartReadBuffer(ReadBuffersOperation *operation,
}
static inline bool
-WaitReadBuffersCanStartIO(Buffer buffer, bool nowait)
+ReadBuffersCanStartIO(Buffer buffer, bool nowait)
{
if (BufferIsLocal(buffer))
{
BufferDesc *bufHdr = GetLocalBufferDescriptor(-buffer - 1);
+ /*
+ * The buffer could have IO in progress by another scan. Right now
+ * localbuf.c doesn't use IO_IN_PROGRESS, which is why we need this
+ * hack.
+ *
+ * TODO: localbuf.c should use IO_IN_PROGRESS / have an equivalent of
+ * StartBufferIO().
+ */
+ if (pgaio_io_ref_valid(&bufHdr->io_in_progress))
+ {
+ PgAioHandleRef ior = bufHdr->io_in_progress;
+
+ ereport(DEBUG3,
+ errmsg("waiting for temp buffer IO in CSIO"),
+ errhidestmt(true), errhidecontext(true));
+ pgaio_io_ref_wait(&ior);
+ return false;
+ }
+
return (pg_atomic_read_u32(&bufHdr->state) & BM_VALID) == 0;
}
else
@@ -1412,13 +1462,38 @@ WaitReadBuffersCanStartIO(Buffer buffer, bool nowait)
void
WaitReadBuffers(ReadBuffersOperation *operation)
{
- Buffer *buffers;
+ IOContext io_context;
+ IOObject io_object;
int nblocks;
- BlockNumber blocknum;
- ForkNumber forknum;
- IOContext io_context;
- IOObject io_object;
- char persistence;
+ bool have_retryable_failure;
+
+ /*
+ * If we get here without any IO operations having been issued, the
+ * io_method == IOMETHOD_SYNC path must have been used. In that case, we
+ * start - as we used to before - the IO now, just before waiting.
+ */
+ if (operation->nios == 0)
+ {
+ Assert(io_method == IOMETHOD_SYNC);
+ if (!AsyncReadBuffers(operation, operation->io_buffers_len))
+ {
+ /* all blocks were already read in concurrently */
+ return;
+ }
+ }
+
+ if (operation->persistence == RELPERSISTENCE_TEMP)
+ {
+ io_context = IOCONTEXT_NORMAL;
+ io_object = IOOBJECT_TEMP_RELATION;
+ }
+ else
+ {
+ io_context = IOContextForStrategy(operation->strategy);
+ io_object = IOOBJECT_RELATION;
+ }
+
+restart:
/*
* Currently operations are only allowed to include a read of some range,
@@ -1433,15 +1508,101 @@ WaitReadBuffers(ReadBuffersOperation *operation)
if (nblocks == 0)
return; /* nothing to do */
- buffers = &operation->buffers[0];
- blocknum = operation->blocknum;
- forknum = operation->forknum;
- persistence = operation->persistence;
+ Assert(operation->nios > 0);
+ /*
+ * For IO timing we just count the time spent waiting for the IO.
+ *
+ * XXX: We probably should track the IO operation, rather than its time,
+ * separately, when initiating the IO. But right now that's not quite
+ * allowed by the interface.
+ */
+ have_retryable_failure = false;
+ for (int i = 0; i < operation->nios; i++)
+ {
+ PgAioReturn *aio_ret = &operation->returns[i];
+
+ /*
+ * Tracking a wait even if we don't actually need to wait a) is not
+ * cheap b) reports some time as waiting, even if we never waited.
+ */
+ if (aio_ret->result.status == ARS_UNKNOWN &&
+ !pgaio_io_ref_check_done(&operation->refs[i]))
+ {
+ instr_time io_start = pgstat_prepare_io_time(track_io_timing);
+
+ pgaio_io_ref_wait(&operation->refs[i]);
+
+ /*
+ * The IO operation itself was already counted earlier, in
+ * AsyncReadBuffers().
+ */
+ pgstat_count_io_op_time(io_object, io_context, IOOP_READ, io_start,
+ 0);
+ }
+ else
+ {
+ Assert(pgaio_io_ref_check_done(&operation->refs[i]));
+ }
+
+ if (aio_ret->result.status == ARS_PARTIAL)
+ {
+ /*
+ * We'll retry below, so we just emit a debug message the server
+ * log (or not even that in prod scenarios).
+ */
+ pgaio_result_log(aio_ret->result, &aio_ret->subject_data, DEBUG1);
+ have_retryable_failure = true;
+ }
+ else if (aio_ret->result.status != ARS_OK)
+ pgaio_result_log(aio_ret->result, &aio_ret->subject_data, ERROR);
+ }
+
+ /*
+ * If any of the associated IOs failed, try again to issue IOs. Buffers
+ * for which IO has completed successfully will be discovered as such and
+ * not retried.
+ */
+ if (have_retryable_failure)
+ {
+ nblocks = operation->io_buffers_len;
+
+ elog(DEBUG3, "retrying IO after partial failure");
+ CHECK_FOR_INTERRUPTS();
+ AsyncReadBuffers(operation, nblocks);
+ goto restart;
+ }
+
+ if (VacuumCostActive)
+ VacuumCostBalance += VacuumCostPageMiss * nblocks;
+
+ /* FIXME: READ_DONE tracepoint */
+}
+
+static bool
+AsyncReadBuffers(ReadBuffersOperation *operation,
+ int nblocks)
+{
+ int io_buffers_len = 0;
+ Buffer *buffers = &operation->buffers[0];
+ int flags = operation->flags;
+ BlockNumber blocknum = operation->blocknum;
+ ForkNumber forknum = operation->forknum;
+ IOContext io_context;
+ IOObject io_object;
+ char persistence;
+ bool did_start_io_overall = false;
+ PgAioHandle *ioh = NULL;
+ uint32 ioh_flags = 0;
+
+ persistence = operation->rel
+ ? operation->rel->rd_rel->relpersistence
+ : RELPERSISTENCE_PERMANENT;
if (persistence == RELPERSISTENCE_TEMP)
{
io_context = IOCONTEXT_NORMAL;
io_object = IOOBJECT_TEMP_RELATION;
+ ioh_flags |= AHF_REFERENCES_LOCAL;
}
else
{
@@ -1449,6 +1610,16 @@ WaitReadBuffers(ReadBuffersOperation *operation)
io_object = IOOBJECT_RELATION;
}
+ /*
+ * When this IO is executed synchronously, either because the caller will
+ * immediately block waiting for the IO or because IOMETHOD_SYNC is used,
+ * the AIO subsystem needs to know.
+ */
+ if (flags & READ_BUFFERS_SYNCHRONOUSLY)
+ ioh_flags |= AHF_SYNCHRONOUS;
+
+ operation->nios = 0;
+
/*
* We count all these blocks as read by this backend. This is traditional
* behavior, but might turn out to be not true if we find that someone
@@ -1464,19 +1635,38 @@ WaitReadBuffers(ReadBuffersOperation *operation)
for (int i = 0; i < nblocks; ++i)
{
- int io_buffers_len;
- Buffer io_buffers[MAX_IO_COMBINE_LIMIT];
void *io_pages[MAX_IO_COMBINE_LIMIT];
- instr_time io_start;
+ Buffer io_buffers[MAX_IO_COMBINE_LIMIT];
BlockNumber io_first_block;
+ bool did_start_io_this = false;
/*
- * Skip this block if someone else has already completed it. If an
- * I/O is already in progress in another backend, this will wait for
- * the outcome: either done, or something went wrong and we will
- * retry.
+ * Get IO before ReadBuffersCanStartIO, as pgaio_io_get() might block,
+ * which we don't want after setting IO_IN_PROGRESS.
+ *
+ * XXX: Should we attribute the time spent in here to the IO? If there
+ * already are a lot of IO operations in progress, getting an IO
+ * handle will block waiting for some other IO operation to finish.
+ *
+ * In most cases it'll be free to get the IO, so a timer would be
+ * overhead. Perhaps we should use pgaio_io_get_nb() and only account
+ * IO time when pgaio_io_get_nb() returned false?
*/
- if (!WaitReadBuffersCanStartIO(buffers[i], false))
+ if (likely(!ioh))
+ ioh = pgaio_io_get(CurrentResourceOwner, &operation->returns[operation->nios]);
+
+ /*
+ * Skip this block if someone else has already completed it.
+ *
+ * If an I/O is already in progress in another backend, this will wait
+ * for the outcome: either done, or something went wrong and we will
+ * retry. But don't wait if we have staged, but haven't issued,
+ * another IO.
+ *
+ * XXX: If we can't start IO due to unsubmitted IO, it might be worth
+ * to submit and then try to start IO again.
+ */
+ if (!ReadBuffersCanStartIO(buffers[i], did_start_io_overall))
{
/*
* Report this as a 'hit' for this backend, even though it must
@@ -1488,6 +1678,11 @@ WaitReadBuffers(ReadBuffersOperation *operation)
operation->smgr->smgr_rlocator.locator.relNumber,
operation->smgr->smgr_rlocator.backend,
true);
+
+ ereport(DEBUG3,
+ errmsg("can't start io for first buffer %u: %s",
+ buffers[i], DebugPrintBufferRefcount(buffers[i])),
+ errhidestmt(true), errhidecontext(true));
continue;
}
@@ -1497,6 +1692,11 @@ WaitReadBuffers(ReadBuffersOperation *operation)
io_first_block = blocknum + i;
io_buffers_len = 1;
+ ereport(DEBUG5,
+ errmsg("first prepped for io: %s, offset %d",
+ DebugPrintBufferRefcount(io_buffers[0]), i),
+ errhidestmt(true), errhidecontext(true));
+
/*
* How many neighboring-on-disk blocks can we scatter-read into other
* buffers at the same time? In this case we don't wait if we see an
@@ -1505,85 +1705,57 @@ WaitReadBuffers(ReadBuffersOperation *operation)
* We'll come back to this block again, above.
*/
while ((i + 1) < nblocks &&
- WaitReadBuffersCanStartIO(buffers[i + 1], true))
+ ReadBuffersCanStartIO(buffers[i + 1], true))
{
/* Must be consecutive block numbers. */
Assert(BufferGetBlockNumber(buffers[i + 1]) ==
BufferGetBlockNumber(buffers[i]) + 1);
+ ereport(DEBUG5,
+ errmsg("seq prepped for io: %s, offset %d",
+ DebugPrintBufferRefcount(buffers[i + 1]),
+ i + 1),
+ errhidestmt(true), errhidecontext(true));
+
io_buffers[io_buffers_len] = buffers[++i];
io_pages[io_buffers_len++] = BufferGetBlock(buffers[i]);
}
- io_start = pgstat_prepare_io_time(track_io_timing);
- smgrreadv(operation->smgr, forknum, io_first_block, io_pages, io_buffers_len);
- pgstat_count_io_op_time(io_object, io_context, IOOP_READ, io_start,
- io_buffers_len);
+ pgaio_io_get_ref(ioh, &operation->refs[operation->nios]);
- /* Verify each block we read, and terminate the I/O. */
- for (int j = 0; j < io_buffers_len; ++j)
- {
- BufferDesc *bufHdr;
- Block bufBlock;
+ pgaio_io_set_io_data_32(ioh, (uint32 *) io_buffers, io_buffers_len);
- if (persistence == RELPERSISTENCE_TEMP)
- {
- bufHdr = GetLocalBufferDescriptor(-io_buffers[j] - 1);
- bufBlock = LocalBufHdrGetBlock(bufHdr);
- }
- else
- {
- bufHdr = GetBufferDescriptor(io_buffers[j] - 1);
- bufBlock = BufHdrGetBlock(bufHdr);
- }
- /* check for garbage data */
- if (!PageIsVerifiedExtended((Page) bufBlock, io_first_block + j,
- PIV_LOG_WARNING | PIV_REPORT_STAT))
- {
- if ((operation->flags & READ_BUFFERS_ZERO_ON_ERROR) || zero_damaged_pages)
- {
- ereport(WARNING,
- (errcode(ERRCODE_DATA_CORRUPTED),
- errmsg("invalid page in block %u of relation %s; zeroing out page",
- io_first_block + j,
- relpath(operation->smgr->smgr_rlocator, forknum))));
- memset(bufBlock, 0, BLCKSZ);
- }
- else
- ereport(ERROR,
- (errcode(ERRCODE_DATA_CORRUPTED),
- errmsg("invalid page in block %u of relation %s",
- io_first_block + j,
- relpath(operation->smgr->smgr_rlocator, forknum))));
- }
+ if (persistence == RELPERSISTENCE_TEMP)
+ pgaio_io_add_shared_cb(ioh, ASC_LOCAL_BUFFER_READ);
+ else
+ pgaio_io_add_shared_cb(ioh, ASC_SHARED_BUFFER_READ);
- /* Terminate I/O and set BM_VALID. */
- if (persistence == RELPERSISTENCE_TEMP)
- {
- uint32 buf_state = pg_atomic_read_u32(&bufHdr->state);
+ pgaio_io_set_flag(ioh, ioh_flags);
- buf_state |= BM_VALID;
- pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
- }
- else
- {
- /* Set BM_VALID, terminate IO, and wake up any waiters */
- TerminateBufferIO(bufHdr, false, BM_VALID, true, true);
- }
+ did_start_io_overall = did_start_io_this = true;
+ smgrstartreadv(ioh, operation->smgr, forknum, io_first_block,
+ io_pages, io_buffers_len);
+ ioh = NULL;
+ operation->nios++;
- /* Report I/Os as completing individually. */
- TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, io_first_block + j,
- operation->smgr->smgr_rlocator.locator.spcOid,
- operation->smgr->smgr_rlocator.locator.dbOid,
- operation->smgr->smgr_rlocator.locator.relNumber,
- operation->smgr->smgr_rlocator.backend,
- false);
- }
+ /* not obvious what we'd use for time */
+ pgstat_count_io_op_n(io_object, io_context, IOOP_READ, io_buffers_len);
+ }
+
+ if (ioh)
+ {
+ pgaio_io_release(ioh);
+ ioh = NULL;
+ }
- if (VacuumCostActive)
- VacuumCostBalance += VacuumCostPageMiss * io_buffers_len;
+ if (did_start_io_overall)
+ {
+ pgaio_submit_staged();
+ return true;
}
+ else
+ return false;
}
/*
@@ -6367,7 +6539,7 @@ shared_buffer_readv_complete(PgAioHandle *ioh, PgAioResult prior_result)
prior_result.status == ARS_ERROR
|| prior_result.result <= io_data_off;
- elog(DEBUG3, "calling rbcrs for buf %d with failed %d, error: %d, result: %d, data_off: %d",
+ elog(DEBUG5, "calling rbcrs for buf %d with failed %d, error: %d, result: %d, data_off: %d",
buf, failed, prior_result.status, prior_result.result, io_data_off);
/*
--
2.45.2.746.g06e570c0df.dirty
v2-0013-aio-Very-WIP-read_stream.c-adjustments-for-real-A.patchtext/x-diff; charset=us-asciiDownload
From b7123290712da81631ecfbfb2437b95eb42a8e9c Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sat, 31 Aug 2024 21:39:30 -0400
Subject: [PATCH v2 13/20] aio: Very-WIP: read_stream.c adjustments for real
AIO
---
src/include/storage/bufmgr.h | 2 ++
src/backend/storage/aio/read_stream.c | 31 +++++++++++++++++++++------
src/backend/storage/buffer/bufmgr.c | 3 ++-
3 files changed, 28 insertions(+), 8 deletions(-)
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 7a12ef6e9be..2a836cf98c6 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -119,6 +119,8 @@ typedef struct BufferManagerRelation
#define READ_BUFFERS_ISSUE_ADVICE (1 << 1)
/* IO will immediately be waited for */
#define READ_BUFFERS_SYNCHRONOUSLY (1 << 2)
+/* caller will issue more io, don't submit */
+#define READ_BUFFERS_MORE_MORE_MORE (1 << 3)
/*
* FIXME: PgAioReturn is defined in aio.h. It'd be much better if we didn't
diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index 3d30e6224f7..5b5bae16c44 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -90,6 +90,7 @@
#include "postgres.h"
#include "miscadmin.h"
+#include "storage/aio.h"
#include "storage/fd.h"
#include "storage/smgr.h"
#include "storage/read_stream.h"
@@ -240,14 +241,18 @@ read_stream_start_pending_read(ReadStream *stream, bool suppress_advice)
/*
* If advice hasn't been suppressed, this system supports it, and this
* isn't a strictly sequential pattern, then we'll issue advice.
+ *
+ * XXX: Used to also check stream->pending_read_blocknum !=
+ * stream->seq_blocknum
*/
if (!suppress_advice &&
- stream->advice_enabled &&
- stream->pending_read_blocknum != stream->seq_blocknum)
+ stream->advice_enabled)
flags = READ_BUFFERS_ISSUE_ADVICE;
else
flags = 0;
+ flags |= READ_BUFFERS_MORE_MORE_MORE;
+
/* We say how many blocks we want to read, but may be smaller on return. */
buffer_index = stream->next_buffer_index;
io_index = stream->next_io_index;
@@ -306,6 +311,14 @@ read_stream_start_pending_read(ReadStream *stream, bool suppress_advice)
static void
read_stream_look_ahead(ReadStream *stream, bool suppress_advice)
{
+ if (stream->distance > (io_combine_limit * 8))
+ {
+ if (stream->pinned_buffers + stream->pending_read_nblocks > ((stream->distance * 3) / 4))
+ {
+ return;
+ }
+ }
+
while (stream->ios_in_progress < stream->max_ios &&
stream->pinned_buffers + stream->pending_read_nblocks < stream->distance)
{
@@ -355,6 +368,7 @@ read_stream_look_ahead(ReadStream *stream, bool suppress_advice)
{
/* And we've hit the limit. Rewind, and stop here. */
read_stream_unget_block(stream, blocknum);
+ pgaio_submit_staged();
return;
}
}
@@ -379,6 +393,8 @@ read_stream_look_ahead(ReadStream *stream, bool suppress_advice)
stream->distance == 0) &&
stream->ios_in_progress < stream->max_ios)
read_stream_start_pending_read(stream, suppress_advice);
+
+ pgaio_submit_staged();
}
/*
@@ -442,7 +458,7 @@ read_stream_begin_impl(int flags,
* overflow (even though that's not possible with the current GUC range
* limits), allowing also for the spare entry and the overflow space.
*/
- max_pinned_buffers = Max(max_ios * 4, io_combine_limit);
+ max_pinned_buffers = Max(max_ios * io_combine_limit, io_combine_limit);
max_pinned_buffers = Min(max_pinned_buffers,
PG_INT16_MAX - io_combine_limit - 1);
@@ -493,10 +509,11 @@ read_stream_begin_impl(int flags,
* direct I/O isn't enabled, the caller hasn't promised sequential access
* (overriding our detection heuristics), and max_ios hasn't been set to
* zero.
+ *
+ * FIXME: Used to also check (io_direct_flags & IO_DIRECT_DATA) == 0 &&
+ * (flags & READ_STREAM_SEQUENTIAL) == 0
*/
- if ((io_direct_flags & IO_DIRECT_DATA) == 0 &&
- (flags & READ_STREAM_SEQUENTIAL) == 0 &&
- max_ios > 0)
+ if (max_ios > 0)
stream->advice_enabled = true;
#endif
@@ -727,7 +744,7 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
if (++stream->oldest_io_index == stream->max_ios)
stream->oldest_io_index = 0;
- if (stream->ios[io_index].op.flags & READ_BUFFERS_ISSUE_ADVICE)
+ if (stream->ios[io_index].op.flags & (READ_BUFFERS_ISSUE_ADVICE | READ_BUFFERS_MORE_MORE_MORE))
{
/* Distance ramps up fast (behavior C). */
distance = stream->distance * 2;
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 89cb7b41b03..722e73eb7d0 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1751,7 +1751,8 @@ AsyncReadBuffers(ReadBuffersOperation *operation,
if (did_start_io_overall)
{
- pgaio_submit_staged();
+ if (!(flags & READ_BUFFERS_MORE_MORE_MORE))
+ pgaio_submit_staged();
return true;
}
else
--
2.45.2.746.g06e570c0df.dirty
v2-0014-aio-Add-bounce-buffers.patchtext/x-diff; charset=us-asciiDownload
From c1a5b7c868eb962a3e1e5348aa6309aa1005f4eb Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 25 Nov 2024 16:35:15 -0500
Subject: [PATCH v2 14/20] aio: Add bounce buffers
---
src/include/storage/aio.h | 18 ++
src/include/storage/aio_internal.h | 33 ++++
src/include/utils/resowner.h | 2 +
src/backend/storage/aio/README.md | 27 +++
src/backend/storage/aio/aio.c | 182 ++++++++++++++++++
src/backend/storage/aio/aio_init.c | 118 ++++++++++++
src/backend/utils/misc/guc_tables.c | 13 ++
src/backend/utils/misc/postgresql.conf.sample | 2 +
src/backend/utils/resowner/resowner.c | 25 ++-
src/tools/pgindent/typedefs.list | 1 +
10 files changed, 419 insertions(+), 2 deletions(-)
diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h
index ff44dac5bb2..1bef475b0a9 100644
--- a/src/include/storage/aio.h
+++ b/src/include/storage/aio.h
@@ -222,6 +222,9 @@ typedef struct PgAioHandleSharedCallbacks
+typedef struct PgAioBounceBuffer PgAioBounceBuffer;
+
+
/*
* How many callbacks can be registered for one IO handle. Currently we only
* need two, but it's not hard to imagine needing a few more.
@@ -294,6 +297,20 @@ extern void pgaio_result_log(PgAioResult result, const PgAioSubjectData *subject
+/* --------------------------------------------------------------------------------
+ * Bounce Buffers
+ * --------------------------------------------------------------------------------
+ */
+
+extern PgAioBounceBuffer *pgaio_bounce_buffer_get(void);
+extern void pgaio_io_assoc_bounce_buffer(PgAioHandle *ioh, PgAioBounceBuffer *bb);
+extern uint32 pgaio_bounce_buffer_id(PgAioBounceBuffer *bb);
+extern void pgaio_bounce_buffer_release(PgAioBounceBuffer *bb);
+extern char *pgaio_bounce_buffer_buffer(PgAioBounceBuffer *bb);
+extern void pgaio_bounce_buffer_release_resowner(dlist_node *bb_node, bool on_error);
+
+
+
/* --------------------------------------------------------------------------------
* Actions on multiple IOs.
* --------------------------------------------------------------------------------
@@ -354,6 +371,7 @@ typedef enum IoMethod
extern const struct config_enum_entry io_method_options[];
extern int io_method;
extern int io_max_concurrency;
+extern int io_bounce_buffers;
#endif /* AIO_H */
diff --git a/src/include/storage/aio_internal.h b/src/include/storage/aio_internal.h
index d2dc1516bdf..2065bde79c3 100644
--- a/src/include/storage/aio_internal.h
+++ b/src/include/storage/aio_internal.h
@@ -91,6 +91,12 @@ struct PgAioHandle
/* index into PgAioCtl->iovecs */
uint32 iovec_off;
+ /*
+ * List of bounce_buffers owned by IO. It would suffice to use an index
+ * based linked list here.
+ */
+ slist_head bounce_buffers;
+
/**
* In which list the handle is registered, depends on the state:
* - IDLE, in per-backend list
@@ -130,11 +136,23 @@ struct PgAioHandle
};
+struct PgAioBounceBuffer
+{
+ slist_node node;
+ struct ResourceOwnerData *resowner;
+ dlist_node resowner_node;
+ char *buffer;
+};
+
+
typedef struct PgAioPerBackend
{
/* index into PgAioCtl->io_handles */
uint32 io_handle_off;
+ /* index into PgAioCtl->bounce_buffers */
+ uint32 bounce_buffers_off;
+
/* IO Handles that currently are not used */
dclist_head idle_ios;
@@ -162,6 +180,12 @@ typedef struct PgAioPerBackend
* IOs being appended at the end.
*/
dclist_head in_flight_ios;
+
+ /* Bounce Buffers that currently are not used */
+ slist_head idle_bbs;
+
+ /* see handed_out_io */
+ PgAioBounceBuffer *handed_out_bb;
} PgAioPerBackend;
@@ -187,6 +211,15 @@ typedef struct PgAioCtl
*/
uint64 *iovecs_data;
+ /*
+ * To perform AIO on buffers that are not located in shared memory (either
+ * because they are not in shared memory or because we need to operate on
+ * a copy, as e.g. the case for writes when checksums are in use)
+ */
+ uint64 bounce_buffers_count;
+ PgAioBounceBuffer *bounce_buffers;
+ char *bounce_buffers_data;
+
uint64 io_handle_count;
PgAioHandle *io_handles;
} PgAioCtl;
diff --git a/src/include/utils/resowner.h b/src/include/utils/resowner.h
index 2d55720a54c..0cdd0c13ffb 100644
--- a/src/include/utils/resowner.h
+++ b/src/include/utils/resowner.h
@@ -168,5 +168,7 @@ extern void ResourceOwnerForgetLock(ResourceOwner owner, struct LOCALLOCK *local
struct dlist_node;
extern void ResourceOwnerRememberAioHandle(ResourceOwner owner, struct dlist_node *ioh_node);
extern void ResourceOwnerForgetAioHandle(ResourceOwner owner, struct dlist_node *ioh_node);
+extern void ResourceOwnerRememberAioBounceBuffer(ResourceOwner owner, struct dlist_node *bb_node);
+extern void ResourceOwnerForgetAioBounceBuffer(ResourceOwner owner, struct dlist_node *bb_node);
#endif /* RESOWNER_H */
diff --git a/src/backend/storage/aio/README.md b/src/backend/storage/aio/README.md
index 893f4ffe428..0076ea4aa10 100644
--- a/src/backend/storage/aio/README.md
+++ b/src/backend/storage/aio/README.md
@@ -395,6 +395,33 @@ shared memory no less!), completion callbacks instead have to encode errors in
a more compact format that can be converted into an error message.
+### AIO Bounce Buffers
+
+For some uses of AIO there is no convenient memory location as the source /
+destination of an AIO. E.g. when data checksums are enabled, writes from
+shared buffers currently cannot be done directly from shared buffers, as a
+shared buffer lock still allows some modification, e.g., for hint bits(see
+`FlushBuffer()`). If the write were done in-place, such modifications can
+cause the checksum to fail.
+
+For synchronous IO this is solved by copying the buffer to separate memory
+before computing the checksum and using that copy as the source buffer for the
+AIO.
+
+However, for AIO that is not a workable solution:
+- Instead of a single buffer many buffers are required, as many IOs might be
+ in flight
+- When using the [worker method](#worker), the source/target of IO needs to be
+ in shared memory, otherwise the workers won't be able to access the memory.
+
+The AIO subsystem addresses this by providing a limited number of bounce
+buffers that can be used as the source / target for IO. A bounce buffer be
+acquired with `pgaio_bounce_buffer_get()` and multiple bounce buffers can be
+associated with an AIO Handle with `pgaio_io_assoc_bounce_buffer()`.
+
+Bounce buffers are automatically released when the IO completes.
+
+
## Helpers
Using the low-level AIO API introduces too much complexity to do so all over
diff --git a/src/backend/storage/aio/aio.c b/src/backend/storage/aio/aio.c
index 2439ce3740d..e829e1752ca 100644
--- a/src/backend/storage/aio/aio.c
+++ b/src/backend/storage/aio/aio.c
@@ -54,6 +54,8 @@ static void pgaio_io_resowner_register(PgAioHandle *ioh);
static void pgaio_io_wait_for_free(void);
static PgAioHandle *pgaio_io_from_ref(PgAioHandleRef *ior, uint64 *ref_generation);
+static void pgaio_bounce_buffer_wait_for_free(void);
+
/* Options for io_method. */
@@ -68,6 +70,7 @@ const struct config_enum_entry io_method_options[] = {
int io_method = DEFAULT_IO_METHOD;
int io_max_concurrency = -1;
+int io_bounce_buffers = -1;
/* global control for AIO */
@@ -732,6 +735,21 @@ pgaio_io_reclaim(PgAioHandle *ioh)
}
}
+ /* reclaim all associated bounce buffers */
+ if (!slist_is_empty(&ioh->bounce_buffers))
+ {
+ slist_mutable_iter it;
+
+ slist_foreach_modify(it, &ioh->bounce_buffers)
+ {
+ PgAioBounceBuffer *bb = slist_container(PgAioBounceBuffer, node, it.cur);
+
+ slist_delete_current(&it);
+
+ slist_push_head(&my_aio->idle_bbs, &bb->node);
+ }
+ }
+
if (ioh->resowner)
{
ResourceOwnerForgetAioHandle(ioh->resowner, &ioh->resowner_node);
@@ -855,6 +873,168 @@ pgaio_io_wait_for_free(void)
+/* --------------------------------------------------------------------------------
+ * Bounce Buffers
+ * --------------------------------------------------------------------------------
+ */
+
+PgAioBounceBuffer *
+pgaio_bounce_buffer_get(void)
+{
+ PgAioBounceBuffer *bb = NULL;
+ slist_node *node;
+
+ if (my_aio->handed_out_bb != NULL)
+ elog(ERROR, "can only hand out one BB");
+
+ /*
+ * FIXME It probably is not correct to have bounce buffers be per backend,
+ * they use too much memory.
+ */
+ if (slist_is_empty(&my_aio->idle_bbs))
+ {
+ pgaio_bounce_buffer_wait_for_free();
+ }
+
+ node = slist_pop_head_node(&my_aio->idle_bbs);
+ bb = slist_container(PgAioBounceBuffer, node, node);
+
+ my_aio->handed_out_bb = bb;
+
+ bb->resowner = CurrentResourceOwner;
+ ResourceOwnerRememberAioBounceBuffer(bb->resowner, &bb->resowner_node);
+
+ return bb;
+}
+
+void
+pgaio_io_assoc_bounce_buffer(PgAioHandle *ioh, PgAioBounceBuffer *bb)
+{
+ if (my_aio->handed_out_bb != bb)
+ elog(ERROR, "can only assign handed out BB");
+ my_aio->handed_out_bb = NULL;
+
+ /*
+ * There can be many bounce buffers assigned in case of vectorized IOs.
+ */
+ slist_push_head(&ioh->bounce_buffers, &bb->node);
+
+ /* once associated with an IO, the IO has ownership */
+ ResourceOwnerForgetAioBounceBuffer(bb->resowner, &bb->resowner_node);
+ bb->resowner = NULL;
+}
+
+uint32
+pgaio_bounce_buffer_id(PgAioBounceBuffer *bb)
+{
+ return bb - aio_ctl->bounce_buffers;
+}
+
+void
+pgaio_bounce_buffer_release(PgAioBounceBuffer *bb)
+{
+ if (my_aio->handed_out_bb != bb)
+ elog(ERROR, "can only release handed out BB");
+
+ slist_push_head(&my_aio->idle_bbs, &bb->node);
+ my_aio->handed_out_bb = NULL;
+
+ ResourceOwnerForgetAioBounceBuffer(bb->resowner, &bb->resowner_node);
+ bb->resowner = NULL;
+}
+
+void
+pgaio_bounce_buffer_release_resowner(dlist_node *bb_node, bool on_error)
+{
+ PgAioBounceBuffer *bb = dlist_container(PgAioBounceBuffer, resowner_node, bb_node);
+
+ Assert(bb->resowner);
+
+ if (!on_error)
+ elog(WARNING, "leaked AIO bounce buffer");
+
+ pgaio_bounce_buffer_release(bb);
+}
+
+char *
+pgaio_bounce_buffer_buffer(PgAioBounceBuffer *bb)
+{
+ return bb->buffer;
+}
+
+static void
+pgaio_bounce_buffer_wait_for_free(void)
+{
+ static uint32 lastpos = 0;
+
+ if (my_aio->num_staged_ios > 0)
+ {
+ elog(DEBUG2, "submitting while acquiring free bb");
+ pgaio_submit_staged();
+ }
+
+ for (uint32 i = lastpos; i < lastpos + io_max_concurrency; i++)
+ {
+ uint32 thisoff = my_aio->io_handle_off + (i % io_max_concurrency);
+ PgAioHandle *ioh = &aio_ctl->io_handles[thisoff];
+
+ switch (ioh->state)
+ {
+ case AHS_IDLE:
+ case AHS_HANDED_OUT:
+ continue;
+ case AHS_DEFINED: /* should have been submitted above */
+ case AHS_PREPARED:
+ elog(ERROR, "shouldn't get here with io:%d in state %d",
+ pgaio_io_get_id(ioh), ioh->state);
+ break;
+ case AHS_REAPED:
+ case AHS_IN_FLIGHT:
+ if (!slist_is_empty(&ioh->bounce_buffers))
+ {
+ PgAioHandleRef ior;
+
+ ior.aio_index = ioh - aio_ctl->io_handles;
+ ior.generation_upper = (uint32) (ioh->generation >> 32);
+ ior.generation_lower = (uint32) ioh->generation;
+
+ pgaio_io_ref_wait(&ior);
+ elog(DEBUG2, "waited for io:%d to reclaim BB",
+ pgaio_io_get_id(ioh));
+
+ if (slist_is_empty(&my_aio->idle_bbs))
+ elog(WARNING, "empty after wait");
+
+ if (!slist_is_empty(&my_aio->idle_bbs))
+ {
+ lastpos = i;
+ return;
+ }
+ }
+ break;
+ case AHS_COMPLETED_SHARED:
+ case AHS_COMPLETED_LOCAL:
+ /* reclaim */
+ pgaio_io_reclaim(ioh);
+
+ if (!slist_is_empty(&my_aio->idle_bbs))
+ {
+ lastpos = i;
+ return;
+ }
+ break;
+ }
+ }
+
+ /*
+ * The submission above could have caused the IO to complete at any time.
+ */
+ if (slist_is_empty(&my_aio->idle_bbs))
+ elog(PANIC, "no more bbs");
+}
+
+
+
/* --------------------------------------------------------------------------------
* Actions on multiple IOs.
* --------------------------------------------------------------------------------
@@ -929,6 +1109,7 @@ void
pgaio_at_xact_end(bool is_subxact, bool is_commit)
{
Assert(!my_aio->handed_out_io);
+ Assert(!my_aio->handed_out_bb);
}
/*
@@ -939,6 +1120,7 @@ void
pgaio_at_error(void)
{
Assert(!my_aio->handed_out_io);
+ Assert(!my_aio->handed_out_bb);
}
diff --git a/src/backend/storage/aio/aio_init.c b/src/backend/storage/aio/aio_init.c
index 23adc5308e5..417526f3823 100644
--- a/src/backend/storage/aio/aio_init.c
+++ b/src/backend/storage/aio/aio_init.c
@@ -82,6 +82,32 @@ AioIOVDataShmemSize(void)
io_max_concurrency));
}
+static Size
+AioBounceBufferDescShmemSize(void)
+{
+ Size sz;
+
+ /* PgAioBounceBuffer itself */
+ sz = mul_size(sizeof(PgAioBounceBuffer),
+ mul_size(AioProcs(), io_bounce_buffers));
+
+ return sz;
+}
+
+static Size
+AioBounceBufferDataShmemSize(void)
+{
+ Size sz;
+
+ /* and the associated buffer */
+ sz = mul_size(BLCKSZ,
+ mul_size(io_bounce_buffers, AioProcs()));
+ /* memory for alignment */
+ sz += BLCKSZ;
+
+ return sz;
+}
+
/*
* Choose a suitable value for io_max_concurrency.
*
@@ -107,6 +133,33 @@ AioChooseMaxConccurrency(void)
return Min(max_proportional_pins, 64);
}
+/*
+ * Choose a suitable value for io_bounce_buffers.
+ *
+ * It's very unlikely to be useful to allocate more bounce buffers for each
+ * backend than the backend is allowed to pin. Additionally, bounce buffers
+ * currently are used for writes, it seems very uncommon for more than 10% of
+ * shared_buffers to be written out concurrently.
+ *
+ * XXX: This quickly can take up significant amounts of memory, the logic
+ * should probably fine tuned.
+ */
+static int
+AioChooseBounceBuffers(void)
+{
+ uint32 max_backends;
+ int max_proportional_pins;
+
+ /* Similar logic to LimitAdditionalPins() */
+ max_backends = MaxBackends + NUM_AUXILIARY_PROCS;
+ max_proportional_pins = (NBuffers / 10) / max_backends;
+
+ max_proportional_pins = Max(max_proportional_pins, 1);
+
+ /* apply upper limit */
+ return Min(max_proportional_pins, 256);
+}
+
Size
AioShmemSize(void)
{
@@ -130,11 +183,31 @@ AioShmemSize(void)
PGC_S_OVERRIDE);
}
+
+ /*
+ * If io_bounce_buffers is -1, we automatically choose a suitable value.
+ *
+ * See also comment above.
+ */
+ if (io_bounce_buffers == -1)
+ {
+ char buf[32];
+
+ snprintf(buf, sizeof(buf), "%d", AioChooseBounceBuffers());
+ SetConfigOption("io_bounce_buffers", buf, PGC_POSTMASTER,
+ PGC_S_DYNAMIC_DEFAULT);
+ if (io_bounce_buffers == -1) /* failed to apply it? */
+ SetConfigOption("io_bounce_buffers", buf, PGC_POSTMASTER,
+ PGC_S_OVERRIDE);
+ }
+
sz = add_size(sz, AioCtlShmemSize());
sz = add_size(sz, AioBackendShmemSize());
sz = add_size(sz, AioHandleShmemSize());
sz = add_size(sz, AioIOVShmemSize());
sz = add_size(sz, AioIOVDataShmemSize());
+ sz = add_size(sz, AioBounceBufferDescShmemSize());
+ sz = add_size(sz, AioBounceBufferDataShmemSize());
if (pgaio_impl->shmem_size)
sz = add_size(sz, pgaio_impl->shmem_size());
@@ -148,7 +221,10 @@ AioShmemInit(void)
bool found;
uint32 io_handle_off = 0;
uint32 iovec_off = 0;
+ uint32 bounce_buffers_off = 0;
uint32 per_backend_iovecs = io_max_concurrency * io_combine_limit;
+ uint32 per_backend_bb = io_bounce_buffers;
+ char *bounce_buffers_data;
aio_ctl = (PgAioCtl *)
ShmemInitStruct("AioCtl", AioCtlShmemSize(), &found);
@@ -160,6 +236,7 @@ AioShmemInit(void)
aio_ctl->io_handle_count = AioProcs() * io_max_concurrency;
aio_ctl->iovec_count = AioProcs() * per_backend_iovecs;
+ aio_ctl->bounce_buffers_count = AioProcs() * per_backend_bb;
aio_ctl->backend_state = (PgAioPerBackend *)
ShmemInitStruct("AioBackend", AioBackendShmemSize(), &found);
@@ -170,6 +247,35 @@ AioShmemInit(void)
aio_ctl->iovecs = ShmemInitStruct("AioIOV", AioIOVShmemSize(), &found);
aio_ctl->iovecs_data = ShmemInitStruct("AioIOVData", AioIOVDataShmemSize(), &found);
+ aio_ctl->bounce_buffers = ShmemInitStruct("AioBounceBufferDesc", AioBounceBufferDescShmemSize(), &found);
+
+ bounce_buffers_data = ShmemInitStruct("AioBounceBufferData", AioBounceBufferDataShmemSize(), &found);
+ bounce_buffers_data = (char *) TYPEALIGN(BLCKSZ, (uintptr_t) bounce_buffers_data);
+ aio_ctl->bounce_buffers_data = bounce_buffers_data;
+
+
+ /* Initialize IO handles. */
+ for (uint64 i = 0; i < aio_ctl->io_handle_count; i++)
+ {
+ PgAioHandle *ioh = &aio_ctl->io_handles[i];
+
+ ioh->op = PGAIO_OP_INVALID;
+ ioh->subject = ASI_INVALID;
+ ioh->state = AHS_IDLE;
+
+ slist_init(&ioh->bounce_buffers);
+ }
+
+ /* Initialize Bounce Buffers. */
+ for (uint64 i = 0; i < aio_ctl->bounce_buffers_count; i++)
+ {
+ PgAioBounceBuffer *bb = &aio_ctl->bounce_buffers[i];
+
+ bb->buffer = bounce_buffers_data;
+ bounce_buffers_data += BLCKSZ;
+ }
+
+
for (int procno = 0; procno < AioProcs(); procno++)
{
PgAioPerBackend *bs = &aio_ctl->backend_state[procno];
@@ -177,9 +283,13 @@ AioShmemInit(void)
bs->io_handle_off = io_handle_off;
io_handle_off += io_max_concurrency;
+ bs->bounce_buffers_off = bounce_buffers_off;
+ bounce_buffers_off += per_backend_bb;
+
dclist_init(&bs->idle_ios);
memset(bs->staged_ios, 0, sizeof(PgAioHandle *) * PGAIO_SUBMIT_BATCH_SIZE);
dclist_init(&bs->in_flight_ios);
+ slist_init(&bs->idle_bbs);
/* initialize per-backend IOs */
for (int i = 0; i < io_max_concurrency; i++)
@@ -201,6 +311,14 @@ AioShmemInit(void)
dclist_push_tail(&bs->idle_ios, &ioh->node);
iovec_off += io_combine_limit;
}
+
+ /* initialize per-backend bounce buffers */
+ for (int i = 0; i < per_backend_bb; i++)
+ {
+ PgAioBounceBuffer *bb = &aio_ctl->bounce_buffers[bs->bounce_buffers_off + i];
+
+ slist_push_head(&bs->idle_bbs, &bb->node);
+ }
}
out:
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index b2999b86c24..39e91ebd2a5 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -3233,6 +3233,19 @@ struct config_int ConfigureNamesInt[] =
NULL, NULL, NULL
},
+ {
+ {"io_bounce_buffers",
+ PGC_POSTMASTER,
+ RESOURCES_ASYNCHRONOUS,
+ gettext_noop("Number of IO Bounce Buffers reserved for each backend."),
+ NULL,
+ GUC_UNIT_BLOCKS
+ },
+ &io_bounce_buffers,
+ -1, -1, 4096,
+ NULL, NULL, NULL
+ },
+
{
{"io_workers",
PGC_SIGHUP,
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 5893eb29228..da6e248a29e 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -848,6 +848,8 @@
#io_max_concurrency = 32 # Max number of IOs that may be in
# flight at the same time in one backend
# (change requires restart)
+#io_bounce_buffers = -1 # -1 sets based on shared_buffers
+ # (change requires restart)
#------------------------------------------------------------------------------
diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c
index 5cf14472ebd..d1932b7393c 100644
--- a/src/backend/utils/resowner/resowner.c
+++ b/src/backend/utils/resowner/resowner.c
@@ -159,10 +159,11 @@ struct ResourceOwnerData
LOCALLOCK *locks[MAX_RESOWNER_LOCKS]; /* list of owned locks */
/*
- * AIO handles need be registered in critical sections and therefore
- * cannot use the normal ResoureElem mechanism.
+ * AIO handles & bounce buffers need be registered in critical sections
+ * and therefore cannot use the normal ResoureElem mechanism.
*/
dlist_head aio_handles;
+ dlist_head aio_bounce_buffers;
};
@@ -434,6 +435,7 @@ ResourceOwnerCreate(ResourceOwner parent, const char *name)
}
dlist_init(&owner->aio_handles);
+ dlist_init(&owner->aio_bounce_buffers);
return owner;
}
@@ -743,6 +745,13 @@ ResourceOwnerReleaseInternal(ResourceOwner owner,
pgaio_io_release_resowner(node, !isCommit);
}
+
+ while (!dlist_is_empty(&owner->aio_bounce_buffers))
+ {
+ dlist_node *node = dlist_head_node(&owner->aio_bounce_buffers);
+
+ pgaio_bounce_buffer_release_resowner(node, !isCommit);
+ }
}
else if (phase == RESOURCE_RELEASE_LOCKS)
{
@@ -1112,3 +1121,15 @@ ResourceOwnerForgetAioHandle(ResourceOwner owner, struct dlist_node *ioh_node)
{
dlist_delete_from(&owner->aio_handles, ioh_node);
}
+
+void
+ResourceOwnerRememberAioBounceBuffer(ResourceOwner owner, struct dlist_node *ioh_node)
+{
+ dlist_push_tail(&owner->aio_bounce_buffers, ioh_node);
+}
+
+void
+ResourceOwnerForgetAioBounceBuffer(ResourceOwner owner, struct dlist_node *ioh_node)
+{
+ dlist_delete_from(&owner->aio_bounce_buffers, ioh_node);
+}
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index a5b12b48f99..dc52d6165d4 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2104,6 +2104,7 @@ Permutation
PermutationStep
PermutationStepBlocker
PermutationStepBlockerType
+PgAioBounceBuffer
PgAioCtl
PgAioHandle
PgAioHandleFlags
--
2.45.2.746.g06e570c0df.dirty
v2-0015-bufmgr-Implement-AIO-write-support.patchtext/x-diff; charset=us-asciiDownload
From 40e15609a95f6733a7fe0e202c5ec4add3044bad Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sat, 31 Aug 2024 21:39:01 -0400
Subject: [PATCH v2 15/20] bufmgr: Implement AIO write support
As of this commit there are no users of these AIO facilities, that'll come in
later commits.
Author:
Reviewed-By:
Discussion: https://postgr.es/m/
Backpatch:
---
src/include/storage/aio.h | 2 +
src/include/storage/bufmgr.h | 2 +
src/backend/storage/aio/aio_subject.c | 2 +
src/backend/storage/buffer/bufmgr.c | 85 +++++++++++++++++++++++++++
4 files changed, 91 insertions(+)
diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h
index 1bef475b0a9..caa52d2aaba 100644
--- a/src/include/storage/aio.h
+++ b/src/include/storage/aio.h
@@ -106,8 +106,10 @@ typedef enum PgAioHandleSharedCallbackID
ASC_MD_WRITEV,
ASC_SHARED_BUFFER_READ,
+ ASC_SHARED_BUFFER_WRITE,
ASC_LOCAL_BUFFER_READ,
+ ASC_LOCAL_BUFFER_WRITE,
} PgAioHandleSharedCallbackID;
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 2a836cf98c6..2e88b19619c 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -205,7 +205,9 @@ extern PGDLLIMPORT int32 *LocalRefCount;
struct PgAioHandleSharedCallbacks;
extern const struct PgAioHandleSharedCallbacks aio_shared_buffer_readv_cb;
+extern const struct PgAioHandleSharedCallbacks aio_shared_buffer_writev_cb;
extern const struct PgAioHandleSharedCallbacks aio_local_buffer_readv_cb;
+extern const struct PgAioHandleSharedCallbacks aio_local_buffer_writev_cb;
/* upper limit for effective_io_concurrency */
diff --git a/src/backend/storage/aio/aio_subject.c b/src/backend/storage/aio/aio_subject.c
index 21341aae425..b2bd0c235e7 100644
--- a/src/backend/storage/aio/aio_subject.c
+++ b/src/backend/storage/aio/aio_subject.c
@@ -52,8 +52,10 @@ static const PgAioHandleSharedCallbacksEntry aio_shared_cbs[] = {
CALLBACK_ENTRY(ASC_MD_WRITEV, aio_md_writev_cb),
CALLBACK_ENTRY(ASC_SHARED_BUFFER_READ, aio_shared_buffer_readv_cb),
+ CALLBACK_ENTRY(ASC_SHARED_BUFFER_WRITE, aio_shared_buffer_writev_cb),
CALLBACK_ENTRY(ASC_LOCAL_BUFFER_READ, aio_local_buffer_readv_cb),
+ CALLBACK_ENTRY(ASC_LOCAL_BUFFER_WRITE, aio_local_buffer_writev_cb),
#undef CALLBACK_ENTRY
};
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 722e73eb7d0..0f94db19f9d 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -6437,6 +6437,44 @@ ReadBufferCompleteReadShared(Buffer buffer, int mode, bool failed)
return buf_failed;
}
+static uint64
+ReadBufferCompleteWriteShared(Buffer buffer, bool release_lock, bool failed)
+{
+ BufferDesc *bufHdr;
+ bool result = false;
+
+ Assert(BufferIsValid(buffer));
+
+ bufHdr = GetBufferDescriptor(buffer - 1);
+
+#ifdef USE_ASSERT_CHECKING
+ {
+ uint32 buf_state = pg_atomic_read_u32(&bufHdr->state);
+
+ Assert(buf_state & BM_VALID);
+ Assert(buf_state & BM_TAG_VALID);
+ Assert(buf_state & BM_IO_IN_PROGRESS);
+ Assert(buf_state & BM_DIRTY);
+ }
+#endif
+
+ /* AFIXME: implement track_io_timing */
+
+ TerminateBufferIO(bufHdr, /* clear_dirty = */ true,
+ failed ? BM_IO_ERROR : 0,
+ /* forget_owner = */ false,
+ /* syncio = */ false);
+
+ /*
+ * The initiator of IO is not managing the lock (i.e. called
+ * LWLockDisown()), we are.
+ */
+ if (release_lock)
+ LWLockReleaseUnowned(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
+
+ return result;
+}
+
/*
* Helper to prepare IO on shared buffers for execution, shared between reads
* and writes.
@@ -6518,6 +6556,12 @@ shared_buffer_readv_prepare(PgAioHandle *ioh)
shared_buffer_prepare_common(ioh, false);
}
+static void
+shared_buffer_writev_prepare(PgAioHandle *ioh)
+{
+ shared_buffer_prepare_common(ioh, true);
+}
+
static PgAioResult
shared_buffer_readv_complete(PgAioHandle *ioh, PgAioResult prior_result)
{
@@ -6586,6 +6630,34 @@ buffer_readv_error(PgAioResult result, const PgAioSubjectData *subject_data, int
MemoryContextSwitchTo(oldContext);
}
+static PgAioResult
+shared_buffer_writev_complete(PgAioHandle *ioh, PgAioResult prior_result)
+{
+ PgAioResult result = prior_result;
+ uint64 *io_data;
+ uint8 io_data_len;
+
+ elog(DEBUG3, "%s: %d %d", __func__, prior_result.status, prior_result.result);
+
+ io_data = pgaio_io_get_io_data(ioh, &io_data_len);
+
+ /* FIXME: handle outright errors */
+
+ for (int io_data_off = 0; io_data_off < io_data_len; io_data_off++)
+ {
+ Buffer buf = io_data[io_data_off];
+
+ /* FIXME: handle short writes / failures */
+ /* FIXME: ioh->scb_data.shared_buffer.release_lock */
+ ReadBufferCompleteWriteShared(buf,
+ true,
+ false);
+
+ }
+
+ return result;
+}
+
/*
* Helper to prepare IO on local buffers for execution, shared between reads
* and writes.
@@ -6655,14 +6727,27 @@ local_buffer_readv_complete(PgAioHandle *ioh, PgAioResult prior_result)
return result;
}
+static void
+local_buffer_writev_prepare(PgAioHandle *ioh)
+{
+ elog(ERROR, "not yet");
+}
+
const struct PgAioHandleSharedCallbacks aio_shared_buffer_readv_cb = {
.prepare = shared_buffer_readv_prepare,
.complete = shared_buffer_readv_complete,
.error = buffer_readv_error,
};
+const struct PgAioHandleSharedCallbacks aio_shared_buffer_writev_cb = {
+ .prepare = shared_buffer_writev_prepare,
+ .complete = shared_buffer_writev_complete,
+};
const struct PgAioHandleSharedCallbacks aio_local_buffer_readv_cb = {
.prepare = local_buffer_readv_prepare,
.complete = local_buffer_readv_complete,
.error = buffer_readv_error,
};
+const struct PgAioHandleSharedCallbacks aio_local_buffer_writev_cb = {
+ .prepare = local_buffer_writev_prepare,
+};
--
2.45.2.746.g06e570c0df.dirty
v2-0016-aio-Add-IO-queue-helper.patchtext/x-diff; charset=us-asciiDownload
From 0d7dbde438633fbb7af0dd2f3efd3a2c6b587438 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 4 Sep 2024 16:15:42 -0400
Subject: [PATCH v2 16/20] aio: Add IO queue helper
This is likely never going to anywhere - Thomas Munro is working on something
more complete. But I needed a way to exercise aio for checkpointer / bgwriter.
---
src/include/storage/io_queue.h | 33 +++++
src/backend/storage/aio/Makefile | 1 +
src/backend/storage/aio/io_queue.c | 195 ++++++++++++++++++++++++++++
src/backend/storage/aio/meson.build | 1 +
src/tools/pgindent/typedefs.list | 2 +
5 files changed, 232 insertions(+)
create mode 100644 src/include/storage/io_queue.h
create mode 100644 src/backend/storage/aio/io_queue.c
diff --git a/src/include/storage/io_queue.h b/src/include/storage/io_queue.h
new file mode 100644
index 00000000000..28077158d6d
--- /dev/null
+++ b/src/include/storage/io_queue.h
@@ -0,0 +1,33 @@
+/*-------------------------------------------------------------------------
+ *
+ * io_queue.h
+ * Mechanism for tracking many IOs
+ *
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/io_queue.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef IO_QUEUE_H
+#define IO_QUEUE_H
+
+#include "storage/bufmgr.h"
+
+struct IOQueue;
+typedef struct IOQueue IOQueue;
+
+struct PgAioHandleRef;
+
+extern IOQueue *io_queue_create(int depth, int flags);
+extern void io_queue_track(IOQueue *ioq, const struct PgAioHandleRef *ior);
+extern void io_queue_wait_one(IOQueue *ioq);
+extern void io_queue_wait_all(IOQueue *ioq);
+extern bool io_queue_is_empty(IOQueue *ioq);
+extern void io_queue_reserve(IOQueue *ioq);
+extern struct PgAioHandle *io_queue_get_io(IOQueue *ioq);
+extern void io_queue_free(IOQueue *ioq);
+
+#endif /* IO_QUEUE_H */
diff --git a/src/backend/storage/aio/Makefile b/src/backend/storage/aio/Makefile
index 3bcb8a0b2ed..f3a7f9e63d6 100644
--- a/src/backend/storage/aio/Makefile
+++ b/src/backend/storage/aio/Makefile
@@ -13,6 +13,7 @@ OBJS = \
aio_init.o \
aio_io.o \
aio_subject.o \
+ io_queue.o \
method_io_uring.o \
method_sync.o \
method_worker.o \
diff --git a/src/backend/storage/aio/io_queue.c b/src/backend/storage/aio/io_queue.c
new file mode 100644
index 00000000000..89ccfc2b9a7
--- /dev/null
+++ b/src/backend/storage/aio/io_queue.c
@@ -0,0 +1,195 @@
+/*-------------------------------------------------------------------------
+ *
+ * io_queue.c
+ * AIO - Mechanism for tracking many IOs
+ *
+ * Portions Copyright (c) 2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/storage/aio/io_queue.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "storage/io_queue.h"
+
+#include "storage/aio.h"
+
+
+typedef struct TrackedIO
+{
+ PgAioHandleRef ior;
+ dlist_node node;
+} TrackedIO;
+
+struct IOQueue
+{
+ int depth;
+ int unsubmitted;
+
+ bool has_reserved;
+
+ dclist_head idle;
+ dclist_head in_progress;
+
+ TrackedIO tracked_ios[FLEXIBLE_ARRAY_MEMBER];
+};
+
+
+IOQueue *
+io_queue_create(int depth, int flags)
+{
+ size_t sz;
+ IOQueue *ioq;
+
+ sz = offsetof(IOQueue, tracked_ios)
+ + sizeof(TrackedIO) * depth;
+
+ ioq = palloc0(sz);
+
+ ioq->depth = 0;
+
+ for (int i = 0; i < depth; i++)
+ {
+ TrackedIO *tio = &ioq->tracked_ios[i];
+
+ pgaio_io_ref_clear(&tio->ior);
+ dclist_push_tail(&ioq->idle, &tio->node);
+ }
+
+ return ioq;
+}
+
+void
+io_queue_wait_one(IOQueue *ioq)
+{
+ while (!dclist_is_empty(&ioq->in_progress))
+ {
+ /* FIXME: Should we really pop here already? */
+ dlist_node *node = dclist_pop_head_node(&ioq->in_progress);
+ TrackedIO *tio = dclist_container(TrackedIO, node, node);
+
+ pgaio_io_ref_wait(&tio->ior);
+ dclist_push_head(&ioq->idle, &tio->node);
+ }
+}
+
+void
+io_queue_reserve(IOQueue *ioq)
+{
+ if (ioq->has_reserved)
+ return;
+
+ if (dclist_is_empty(&ioq->idle))
+ io_queue_wait_one(ioq);
+
+ Assert(!dclist_is_empty(&ioq->idle));
+
+ ioq->has_reserved = true;
+}
+
+PgAioHandle *
+io_queue_get_io(IOQueue *ioq)
+{
+ PgAioHandle *ioh;
+
+ io_queue_reserve(ioq);
+
+ Assert(!dclist_is_empty(&ioq->idle));
+
+ if (!io_queue_is_empty(ioq))
+ {
+ ioh = pgaio_io_get_nb(CurrentResourceOwner, NULL);
+ if (ioh == NULL)
+ {
+ /*
+ * Need to wait for all IOs, blocking might not be legal in the
+ * context.
+ *
+ * XXX: This doesn't make a whole lot of sense, we're also
+ * blocking here. What was I smoking when I wrote the above?
+ */
+ io_queue_wait_all(ioq);
+ ioh = pgaio_io_get(CurrentResourceOwner, NULL);
+ }
+ }
+ else
+ {
+ ioh = pgaio_io_get(CurrentResourceOwner, NULL);
+ }
+
+ return ioh;
+}
+
+void
+io_queue_track(IOQueue *ioq, const struct PgAioHandleRef *ior)
+{
+ dlist_node *node;
+ TrackedIO *tio;
+
+ Assert(ioq->has_reserved);
+ ioq->has_reserved = false;
+
+ Assert(!dclist_is_empty(&ioq->idle));
+
+ node = dclist_pop_head_node(&ioq->idle);
+ tio = dclist_container(TrackedIO, node, node);
+
+ tio->ior = *ior;
+
+ dclist_push_tail(&ioq->in_progress, &tio->node);
+
+ ioq->unsubmitted++;
+
+ /*
+ * XXX: Should have some smarter logic here. We don't want to wait too
+ * long to submit, that'll mean we're more likely to block. But we also
+ * don't want to have the overhead of submitting every IO individually.
+ */
+ if (ioq->unsubmitted >= 4)
+ {
+ pgaio_submit_staged();
+ ioq->unsubmitted = 0;
+ }
+}
+
+void
+io_queue_wait_all(IOQueue *ioq)
+{
+ while (!dclist_is_empty(&ioq->in_progress))
+ {
+ /* wait for the last IO to minimize unnecessary wakeups */
+ dlist_node *node = dclist_tail_node(&ioq->in_progress);
+ TrackedIO *tio = dclist_container(TrackedIO, node, node);
+
+ if (!pgaio_io_ref_check_done(&tio->ior))
+ {
+ ereport(DEBUG3,
+ errmsg("io_queue_wait_all for io:%d",
+ pgaio_io_ref_get_id(&tio->ior)),
+ errhidestmt(true),
+ errhidecontext(true));
+
+ pgaio_io_ref_wait(&tio->ior);
+ }
+
+ dclist_delete_from(&ioq->in_progress, &tio->node);
+ dclist_push_head(&ioq->idle, &tio->node);
+ }
+}
+
+bool
+io_queue_is_empty(IOQueue *ioq)
+{
+ return dclist_is_empty(&ioq->in_progress);
+}
+
+void
+io_queue_free(IOQueue *ioq)
+{
+ io_queue_wait_all(ioq);
+
+ pfree(ioq);
+}
diff --git a/src/backend/storage/aio/meson.build b/src/backend/storage/aio/meson.build
index 537f23d446d..e8a88e615c0 100644
--- a/src/backend/storage/aio/meson.build
+++ b/src/backend/storage/aio/meson.build
@@ -5,6 +5,7 @@ backend_sources += files(
'aio_init.c',
'aio_io.c',
'aio_subject.c',
+ 'io_queue.c',
'method_io_uring.c',
'method_sync.c',
'method_worker.c',
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index dc52d6165d4..ca1e3427bc1 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1175,6 +1175,7 @@ IOContext
IOFuncSelector
IOObject
IOOp
+IOQueue
IO_STATUS_BLOCK
IPCompareMethod
ITEM
@@ -2974,6 +2975,7 @@ TocEntry
TokenAuxData
TokenizedAuthLine
TrackItem
+TrackedIO
TransApplyAction
TransInvalidationInfo
TransState
--
2.45.2.746.g06e570c0df.dirty
v2-0017-bufmgr-use-AIO-in-checkpointer-bgwriter.patchtext/x-diff; charset=us-asciiDownload
From ffe8489a8b44bc0a0b11ad765d578aa12801925a Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 23 Jul 2024 10:01:23 -0700
Subject: [PATCH v2 17/20] bufmgr: use AIO in checkpointer, bgwriter
This is far from ready - just included to be able to exercise AIO writes and
get some preliminary numbers. In all likelihood this will instead be based
ontop of work by Thomas Munro instead of the preceding commit.
---
src/include/postmaster/bgwriter.h | 3 +-
src/include/storage/buf_internals.h | 2 +
src/include/storage/bufmgr.h | 3 +-
src/include/storage/bufpage.h | 1 +
src/backend/postmaster/bgwriter.c | 25 +-
src/backend/postmaster/checkpointer.c | 12 +-
src/backend/storage/buffer/bufmgr.c | 581 +++++++++++++++++++++++---
src/backend/storage/page/bufpage.c | 10 +
src/tools/pgindent/typedefs.list | 1 +
9 files changed, 580 insertions(+), 58 deletions(-)
diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index 407f26e5302..01a936fbc0a 100644
--- a/src/include/postmaster/bgwriter.h
+++ b/src/include/postmaster/bgwriter.h
@@ -31,7 +31,8 @@ extern void BackgroundWriterMain(char *startup_data, size_t startup_data_len) pg
extern void CheckpointerMain(char *startup_data, size_t startup_data_len) pg_attribute_noreturn();
extern void RequestCheckpoint(int flags);
-extern void CheckpointWriteDelay(int flags, double progress);
+struct IOQueue;
+extern void CheckpointWriteDelay(struct IOQueue *ioq, int flags, double progress);
extern bool ForwardSyncRequest(const FileTag *ftag, SyncRequestType type);
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 37520890073..9d3123663b3 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -21,6 +21,8 @@
#include "storage/buf.h"
#include "storage/bufmgr.h"
#include "storage/condition_variable.h"
+#include "storage/io_queue.h"
+#include "storage/latch.h"
#include "storage/lwlock.h"
#include "storage/shmem.h"
#include "storage/smgr.h"
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 2e88b19619c..455bbbcbfc4 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -327,7 +327,8 @@ extern bool ConditionalLockBufferForCleanup(Buffer buffer);
extern bool IsBufferCleanupOK(Buffer buffer);
extern bool HoldingBufferPinThatDelaysRecovery(void);
-extern bool BgBufferSync(struct WritebackContext *wb_context);
+struct IOQueue;
+extern bool BgBufferSync(struct IOQueue *ioq, struct WritebackContext *wb_context);
extern void LimitAdditionalPins(uint32 *additional_pins);
extern void LimitAdditionalLocalPins(uint32 *additional_pins);
diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h
index 6222d46e535..6f8fe796da3 100644
--- a/src/include/storage/bufpage.h
+++ b/src/include/storage/bufpage.h
@@ -509,5 +509,6 @@ extern bool PageIndexTupleOverwrite(Page page, OffsetNumber offnum,
Item newtup, Size newsize);
extern char *PageSetChecksumCopy(Page page, BlockNumber blkno);
extern void PageSetChecksumInplace(Page page, BlockNumber blkno);
+extern bool PageNeedsChecksumCopy(Page page);
#endif /* BUFPAGE_H */
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 0f75548759a..71c08da45db 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -38,10 +38,12 @@
#include "postmaster/auxprocess.h"
#include "postmaster/bgwriter.h"
#include "postmaster/interrupt.h"
+#include "storage/aio.h"
#include "storage/buf_internals.h"
#include "storage/bufmgr.h"
#include "storage/condition_variable.h"
#include "storage/fd.h"
+#include "storage/io_queue.h"
#include "storage/lwlock.h"
#include "storage/proc.h"
#include "storage/procsignal.h"
@@ -89,6 +91,7 @@ BackgroundWriterMain(char *startup_data, size_t startup_data_len)
sigjmp_buf local_sigjmp_buf;
MemoryContext bgwriter_context;
bool prev_hibernate;
+ IOQueue *ioq;
WritebackContext wb_context;
Assert(startup_data_len == 0);
@@ -130,6 +133,7 @@ BackgroundWriterMain(char *startup_data, size_t startup_data_len)
ALLOCSET_DEFAULT_SIZES);
MemoryContextSwitchTo(bgwriter_context);
+ ioq = io_queue_create(128, 0);
WritebackContextInit(&wb_context, &bgwriter_flush_after);
/*
@@ -167,6 +171,7 @@ BackgroundWriterMain(char *startup_data, size_t startup_data_len)
* about in bgwriter, but we do have LWLocks, buffers, and temp files.
*/
LWLockReleaseAll();
+ pgaio_at_error();
ConditionVariableCancelSleep();
UnlockBuffers();
ReleaseAuxProcessResources(false);
@@ -226,12 +231,27 @@ BackgroundWriterMain(char *startup_data, size_t startup_data_len)
/* Clear any already-pending wakeups */
ResetLatch(MyLatch);
+ /*
+ * XXX: Before exiting, wait for all IO to finish. That's only
+ * important to avoid spurious PrintBufferLeakWarning() /
+ * PrintAioIPLeakWarning() calls, triggered by
+ * ReleaseAuxProcessResources() being called with isCommit=true.
+ *
+ * FIXME: this is theoretically racy, but I didn't want to copy
+ * HandleMainLoopInterrupts() remaining body here.
+ */
+ if (ShutdownRequestPending)
+ {
+ io_queue_wait_all(ioq);
+ io_queue_free(ioq);
+ }
+
HandleMainLoopInterrupts();
/*
* Do one cycle of dirty-buffer writing.
*/
- can_hibernate = BgBufferSync(&wb_context);
+ can_hibernate = BgBufferSync(ioq, &wb_context);
/* Report pending statistics to the cumulative stats system */
pgstat_report_bgwriter();
@@ -248,6 +268,9 @@ BackgroundWriterMain(char *startup_data, size_t startup_data_len)
smgrdestroyall();
}
+ /* finish IO before sleeping, to avoid blocking other backends */
+ io_queue_wait_all(ioq);
+
/*
* Log a new xl_running_xacts every now and then so replication can
* get into a consistent state faster (think of suboverflowed
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 982572a75db..0c08acd611f 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -46,9 +46,11 @@
#include "postmaster/bgwriter.h"
#include "postmaster/interrupt.h"
#include "replication/syncrep.h"
+#include "storage/aio.h"
#include "storage/bufmgr.h"
#include "storage/condition_variable.h"
#include "storage/fd.h"
+#include "storage/io_queue.h"
#include "storage/ipc.h"
#include "storage/lwlock.h"
#include "storage/proc.h"
@@ -266,6 +268,7 @@ CheckpointerMain(char *startup_data, size_t startup_data_len)
* files.
*/
LWLockReleaseAll();
+ pgaio_at_error();
ConditionVariableCancelSleep();
pgstat_report_wait_end();
UnlockBuffers();
@@ -719,7 +722,7 @@ ImmediateCheckpointRequested(void)
* fraction between 0.0 meaning none, and 1.0 meaning all done.
*/
void
-CheckpointWriteDelay(int flags, double progress)
+CheckpointWriteDelay(IOQueue *ioq, int flags, double progress)
{
static int absorb_counter = WRITES_PER_ABSORB;
@@ -752,6 +755,13 @@ CheckpointWriteDelay(int flags, double progress)
/* Report interim statistics to the cumulative stats system */
pgstat_report_checkpointer();
+ /*
+ * Ensure all pending IO is submitted to avoid unnecessary delays for
+ * other processes.
+ */
+ io_queue_wait_all(ioq);
+
+
/*
* This sleep used to be connected to bgwriter_delay, typically 200ms.
* That resulted in more frequent wakeups if not much work to do.
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 0f94db19f9d..863464f12da 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -52,6 +52,7 @@
#include "storage/buf_internals.h"
#include "storage/bufmgr.h"
#include "storage/fd.h"
+#include "storage/io_queue.h"
#include "storage/ipc.h"
#include "storage/lmgr.h"
#include "storage/proc.h"
@@ -77,6 +78,7 @@
/* Bits in SyncOneBuffer's return value */
#define BUF_WRITTEN 0x01
#define BUF_REUSABLE 0x02
+#define BUF_CANT_MERGE 0x04
#define RELS_BSEARCH_THRESHOLD 20
@@ -511,8 +513,6 @@ static void UnpinBuffer(BufferDesc *buf);
static void UnpinBufferNoOwner(BufferDesc *buf);
static void BufferSync(int flags);
static uint32 WaitBufHdrUnlocked(BufferDesc *buf);
-static int SyncOneBuffer(int buf_id, bool skip_recently_used,
- WritebackContext *wb_context);
static void WaitIO(BufferDesc *buf);
static bool StartBufferIO(BufferDesc *buf, bool forInput, bool nowait);
static void TerminateBufferIO(BufferDesc *buf, bool clear_dirty,
@@ -530,6 +530,7 @@ static inline BufferDesc *BufferAlloc(SMgrRelation smgr,
static Buffer GetVictimBuffer(BufferAccessStrategy strategy, IOContext io_context);
static void FlushBuffer(BufferDesc *buf, SMgrRelation reln,
IOObject io_object, IOContext io_context);
+
static void FindAndDropRelationBuffers(RelFileLocator rlocator,
ForkNumber forkNum,
BlockNumber nForkBlock,
@@ -3067,6 +3068,56 @@ UnpinBufferNoOwner(BufferDesc *buf)
}
}
+typedef struct BuffersToWrite
+{
+ int nbuffers;
+ BufferTag start_at_tag;
+ uint32 max_combine;
+
+ XLogRecPtr max_lsn;
+
+ PgAioHandle *ioh;
+ PgAioHandleRef ior;
+
+ uint64 total_writes;
+
+ Buffer buffers[IOV_MAX];
+ PgAioBounceBuffer *bounce_buffers[IOV_MAX];
+ const void *data_ptrs[IOV_MAX];
+} BuffersToWrite;
+
+static int PrepareToWriteBuffer(BuffersToWrite *to_write, Buffer buf,
+ bool skip_recently_used,
+ IOQueue *ioq, WritebackContext *wb_context);
+
+static void WriteBuffers(BuffersToWrite *to_write,
+ IOQueue *ioq, WritebackContext *wb_context);
+
+static void
+BuffersToWriteInit(BuffersToWrite *to_write,
+ IOQueue *ioq, WritebackContext *wb_context)
+{
+ to_write->total_writes = 0;
+ to_write->nbuffers = 0;
+ to_write->ioh = NULL;
+ pgaio_io_ref_clear(&to_write->ior);
+ to_write->max_lsn = InvalidXLogRecPtr;
+}
+
+static void
+BuffersToWriteEnd(BuffersToWrite *to_write)
+{
+ if (to_write->ioh != NULL)
+ {
+ pgaio_io_release(to_write->ioh);
+ to_write->ioh = NULL;
+ }
+
+ if (to_write->total_writes > 0)
+ pgaio_submit_staged();
+}
+
+
#define ST_SORT sort_checkpoint_bufferids
#define ST_ELEMENT_TYPE CkptSortItem
#define ST_COMPARE(a, b) ckpt_buforder_comparator(a, b)
@@ -3098,7 +3149,10 @@ BufferSync(int flags)
binaryheap *ts_heap;
int i;
int mask = BM_DIRTY;
+ IOQueue *ioq;
WritebackContext wb_context;
+ BuffersToWrite to_write;
+ int max_combine;
/*
* Unless this is a shutdown checkpoint or we have been explicitly told,
@@ -3160,7 +3214,9 @@ BufferSync(int flags)
if (num_to_scan == 0)
return; /* nothing to do */
+ ioq = io_queue_create(512, 0);
WritebackContextInit(&wb_context, &checkpoint_flush_after);
+ max_combine = Min(io_bounce_buffers, io_combine_limit);
TRACE_POSTGRESQL_BUFFER_SYNC_START(NBuffers, num_to_scan);
@@ -3268,48 +3324,91 @@ BufferSync(int flags)
*/
num_processed = 0;
num_written = 0;
+
+ BuffersToWriteInit(&to_write, ioq, &wb_context);
+
while (!binaryheap_empty(ts_heap))
{
BufferDesc *bufHdr = NULL;
CkptTsStatus *ts_stat = (CkptTsStatus *)
DatumGetPointer(binaryheap_first(ts_heap));
+ bool batch_continue = true;
- buf_id = CkptBufferIds[ts_stat->index].buf_id;
- Assert(buf_id != -1);
-
- bufHdr = GetBufferDescriptor(buf_id);
-
- num_processed++;
+ Assert(ts_stat->num_scanned <= ts_stat->num_to_scan);
/*
- * We don't need to acquire the lock here, because we're only looking
- * at a single bit. It's possible that someone else writes the buffer
- * and clears the flag right after we check, but that doesn't matter
- * since SyncOneBuffer will then do nothing. However, there is a
- * further race condition: it's conceivable that between the time we
- * examine the bit here and the time SyncOneBuffer acquires the lock,
- * someone else not only wrote the buffer but replaced it with another
- * page and dirtied it. In that improbable case, SyncOneBuffer will
- * write the buffer though we didn't need to. It doesn't seem worth
- * guarding against this, though.
+ * Collect a batch of buffers to write out from the current
+ * tablespace. That causes some imbalance between the tablespaces, but
+ * that's more than outweighed by the efficiency gain due to batching.
*/
- if (pg_atomic_read_u32(&bufHdr->state) & BM_CHECKPOINT_NEEDED)
+ while (batch_continue &&
+ to_write.nbuffers < max_combine &&
+ ts_stat->num_scanned < ts_stat->num_to_scan)
{
- if (SyncOneBuffer(buf_id, false, &wb_context) & BUF_WRITTEN)
+ buf_id = CkptBufferIds[ts_stat->index].buf_id;
+ Assert(buf_id != -1);
+
+ bufHdr = GetBufferDescriptor(buf_id);
+
+ num_processed++;
+
+ /*
+ * We don't need to acquire the lock here, because we're only
+ * looking at a single bit. It's possible that someone else writes
+ * the buffer and clears the flag right after we check, but that
+ * doesn't matter since SyncOneBuffer will then do nothing.
+ * However, there is a further race condition: it's conceivable
+ * that between the time we examine the bit here and the time
+ * SyncOneBuffer acquires the lock, someone else not only wrote
+ * the buffer but replaced it with another page and dirtied it. In
+ * that improbable case, SyncOneBuffer will write the buffer
+ * though we didn't need to. It doesn't seem worth guarding
+ * against this, though.
+ */
+ if (pg_atomic_read_u32(&bufHdr->state) & BM_CHECKPOINT_NEEDED)
{
- TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
- PendingCheckpointerStats.buffers_written++;
- num_written++;
+ int result = PrepareToWriteBuffer(&to_write, buf_id + 1, false,
+ ioq, &wb_context);
+
+ if (result & BUF_CANT_MERGE)
+ {
+ Assert(to_write.nbuffers > 0);
+ WriteBuffers(&to_write, ioq, &wb_context);
+
+ result = PrepareToWriteBuffer(&to_write, buf_id + 1, false,
+ ioq, &wb_context);
+ Assert(result != BUF_CANT_MERGE);
+ }
+
+ if (result & BUF_WRITTEN)
+ {
+ TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
+ PendingCheckpointerStats.buffers_written++;
+ num_written++;
+ }
+ else
+ {
+ batch_continue = false;
+ }
}
+ else
+ {
+ if (to_write.nbuffers > 0)
+ WriteBuffers(&to_write, ioq, &wb_context);
+ }
+
+ /*
+ * Measure progress independent of actually having to flush the
+ * buffer - otherwise writing become unbalanced.
+ */
+ ts_stat->progress += ts_stat->progress_slice;
+ ts_stat->num_scanned++;
+ ts_stat->index++;
}
- /*
- * Measure progress independent of actually having to flush the buffer
- * - otherwise writing become unbalanced.
- */
- ts_stat->progress += ts_stat->progress_slice;
- ts_stat->num_scanned++;
- ts_stat->index++;
+ if (to_write.nbuffers > 0)
+ WriteBuffers(&to_write, ioq, &wb_context);
+
/* Have all the buffers from the tablespace been processed? */
if (ts_stat->num_scanned == ts_stat->num_to_scan)
@@ -3327,15 +3426,23 @@ BufferSync(int flags)
*
* (This will check for barrier events even if it doesn't sleep.)
*/
- CheckpointWriteDelay(flags, (double) num_processed / num_to_scan);
+ CheckpointWriteDelay(ioq, flags, (double) num_processed / num_to_scan);
}
+ Assert(to_write.nbuffers == 0);
+ io_queue_wait_all(ioq);
+
/*
* Issue all pending flushes. Only checkpointer calls BufferSync(), so
* IOContext will always be IOCONTEXT_NORMAL.
*/
IssuePendingWritebacks(&wb_context, IOCONTEXT_NORMAL);
+ io_queue_wait_all(ioq); /* IssuePendingWritebacks might have added
+ * more */
+ io_queue_free(ioq);
+ BuffersToWriteEnd(&to_write);
+
pfree(per_ts_stat);
per_ts_stat = NULL;
binaryheap_free(ts_heap);
@@ -3361,7 +3468,7 @@ BufferSync(int flags)
* bgwriter_lru_maxpages to 0.)
*/
bool
-BgBufferSync(WritebackContext *wb_context)
+BgBufferSync(IOQueue *ioq, WritebackContext *wb_context)
{
/* info obtained from freelist.c */
int strategy_buf_id;
@@ -3404,6 +3511,9 @@ BgBufferSync(WritebackContext *wb_context)
long new_strategy_delta;
uint32 new_recent_alloc;
+ BuffersToWrite to_write;
+ int max_combine;
+
/*
* Find out where the freelist clock sweep currently is, and how many
* buffer allocations have happened since our last call.
@@ -3424,6 +3534,8 @@ BgBufferSync(WritebackContext *wb_context)
return true;
}
+ max_combine = Min(io_bounce_buffers, io_combine_limit);
+
/*
* Compute strategy_delta = how many buffers have been scanned by the
* clock sweep since last time. If first time through, assume none. Then
@@ -3580,11 +3692,25 @@ BgBufferSync(WritebackContext *wb_context)
num_written = 0;
reusable_buffers = reusable_buffers_est;
+ BuffersToWriteInit(&to_write, ioq, wb_context);
+
/* Execute the LRU scan */
while (num_to_scan > 0 && reusable_buffers < upcoming_alloc_est)
{
- int sync_state = SyncOneBuffer(next_to_clean, true,
- wb_context);
+ int sync_state;
+
+ sync_state = PrepareToWriteBuffer(&to_write, next_to_clean + 1,
+ true, ioq, wb_context);
+ if (sync_state & BUF_CANT_MERGE)
+ {
+ Assert(to_write.nbuffers > 0);
+
+ WriteBuffers(&to_write, ioq, wb_context);
+
+ sync_state = PrepareToWriteBuffer(&to_write, next_to_clean + 1,
+ true, ioq, wb_context);
+ Assert(sync_state != BUF_CANT_MERGE);
+ }
if (++next_to_clean >= NBuffers)
{
@@ -3595,6 +3721,13 @@ BgBufferSync(WritebackContext *wb_context)
if (sync_state & BUF_WRITTEN)
{
+ Assert(sync_state & BUF_REUSABLE);
+
+ if (to_write.nbuffers == max_combine)
+ {
+ WriteBuffers(&to_write, ioq, wb_context);
+ }
+
reusable_buffers++;
if (++num_written >= bgwriter_lru_maxpages)
{
@@ -3606,6 +3739,11 @@ BgBufferSync(WritebackContext *wb_context)
reusable_buffers++;
}
+ if (to_write.nbuffers > 0)
+ WriteBuffers(&to_write, ioq, wb_context);
+
+ BuffersToWriteEnd(&to_write);
+
PendingBgWriterStats.buf_written_clean += num_written;
#ifdef BGW_DEBUG
@@ -3644,8 +3782,66 @@ BgBufferSync(WritebackContext *wb_context)
return (bufs_to_lap == 0 && recent_alloc == 0);
}
+static inline bool
+BufferTagsSameRel(const BufferTag *tag1, const BufferTag *tag2)
+{
+ return (tag1->spcOid == tag2->spcOid) &&
+ (tag1->dbOid == tag2->dbOid) &&
+ (tag1->relNumber == tag2->relNumber) &&
+ (tag1->forkNum == tag2->forkNum)
+ ;
+}
+
+static bool
+CanMergeWrite(BuffersToWrite *to_write, BufferDesc *cur_buf_hdr)
+{
+ BlockNumber cur_block = cur_buf_hdr->tag.blockNum;
+
+ Assert(to_write->nbuffers > 0); /* can't merge with nothing */
+ Assert(to_write->start_at_tag.relNumber != InvalidOid);
+ Assert(to_write->start_at_tag.blockNum != InvalidBlockNumber);
+
+ Assert(to_write->ioh != NULL);
+
+ /*
+ * First check if the blocknumber is one that we could actually merge,
+ * that's cheaper than checking the tablespace/db/relnumber/fork match.
+ */
+ if (to_write->start_at_tag.blockNum + to_write->nbuffers != cur_block)
+ return false;
+
+ if (!BufferTagsSameRel(&to_write->start_at_tag, &cur_buf_hdr->tag))
+ return false;
+
+ /*
+ * Need to check with smgr how large a write we're allowed to make. To
+ * reduce the overhead of the smgr check, only inquire once, when
+ * processing the first to-be-merged buffer. That avoids the overhead in
+ * the common case of writing out buffers that definitely not mergeable.
+ */
+ if (to_write->nbuffers == 1)
+ {
+ SMgrRelation smgr;
+
+ smgr = smgropen(BufTagGetRelFileLocator(&to_write->start_at_tag), INVALID_PROC_NUMBER);
+
+ to_write->max_combine = smgrmaxcombine(smgr,
+ to_write->start_at_tag.forkNum,
+ to_write->start_at_tag.blockNum);
+ }
+ else
+ {
+ Assert(to_write->max_combine > 0);
+ }
+
+ if (to_write->start_at_tag.blockNum + to_write->max_combine <= cur_block)
+ return false;
+
+ return true;
+}
+
/*
- * SyncOneBuffer -- process a single buffer during syncing.
+ * PrepareToWriteBuffer -- process a single buffer during syncing.
*
* If skip_recently_used is true, we don't write currently-pinned buffers, nor
* buffers marked recently used, as these are not replacement candidates.
@@ -3654,22 +3850,56 @@ BgBufferSync(WritebackContext *wb_context)
* BUF_WRITTEN: we wrote the buffer.
* BUF_REUSABLE: buffer is available for replacement, ie, it has
* pin count 0 and usage count 0.
+ * BUF_CANT_MERGE: can't combine this write with prior writes, caller needs
+ * to issue those first
*
* (BUF_WRITTEN could be set in error if FlushBuffer finds the buffer clean
* after locking it, but we don't care all that much.)
*/
static int
-SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
+PrepareToWriteBuffer(BuffersToWrite *to_write, Buffer buf,
+ bool skip_recently_used,
+ IOQueue *ioq, WritebackContext *wb_context)
{
- BufferDesc *bufHdr = GetBufferDescriptor(buf_id);
+ BufferDesc *cur_buf_hdr = GetBufferDescriptor(buf - 1);
+ uint32 buf_state;
int result = 0;
- uint32 buf_state;
- BufferTag tag;
+ XLogRecPtr cur_buf_lsn;
+ LWLock *content_lock;
+ bool may_block;
+
+ /*
+ * Check if this buffer can be written out together with already prepared
+ * writes. We check before we have pinned the buffer, so the buffer can be
+ * written out and replaced between this check and us pinning the buffer -
+ * we'll recheck below. The reason for the pre-check is that we don't want
+ * to pin the buffer just to find out that we can't merge the IO.
+ */
+ if (to_write->nbuffers != 0)
+ {
+ if (!CanMergeWrite(to_write, cur_buf_hdr))
+ {
+ result |= BUF_CANT_MERGE;
+ return result;
+ }
+ }
+ else
+ {
+ if (to_write->ioh == NULL)
+ {
+ to_write->ioh = io_queue_get_io(ioq);
+ pgaio_io_get_ref(to_write->ioh, &to_write->ior);
+ }
+
+ to_write->start_at_tag = cur_buf_hdr->tag;
+ }
/* Make sure we can handle the pin */
ReservePrivateRefCountEntry();
ResourceOwnerEnlarge(CurrentResourceOwner);
+ /* XXX: Should also check if we are allowed to pin one more buffer */
+
/*
* Check whether buffer needs writing.
*
@@ -3679,7 +3909,7 @@ SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
* don't worry because our checkpoint.redo points before log record for
* upcoming changes and so we are not required to write such dirty buffer.
*/
- buf_state = LockBufHdr(bufHdr);
+ buf_state = LockBufHdr(cur_buf_hdr);
if (BUF_STATE_GET_REFCOUNT(buf_state) == 0 &&
BUF_STATE_GET_USAGECOUNT(buf_state) == 0)
@@ -3688,40 +3918,282 @@ SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
}
else if (skip_recently_used)
{
+#if 0
+ elog(LOG, "at block %d: skip recent with nbuffers %d",
+ cur_buf_hdr->tag.blockNum, to_write->nbuffers);
+#endif
/* Caller told us not to write recently-used buffers */
- UnlockBufHdr(bufHdr, buf_state);
+ UnlockBufHdr(cur_buf_hdr, buf_state);
return result;
}
if (!(buf_state & BM_VALID) || !(buf_state & BM_DIRTY))
{
/* It's clean, so nothing to do */
- UnlockBufHdr(bufHdr, buf_state);
+ UnlockBufHdr(cur_buf_hdr, buf_state);
return result;
}
+ /* pin the buffer, from now on its identity can't change anymore */
+ PinBuffer_Locked(cur_buf_hdr);
+
/*
- * Pin it, share-lock it, write it. (FlushBuffer will do nothing if the
- * buffer is clean by the time we've locked it.)
+ * If we are merging, check if the buffer's identity possibly changed
+ * while we hadn't yet pinned it.
+ *
+ * XXX: It might be worth checking if we still want to write the buffer
+ * out, e.g. it could have been replaced with a buffer that doesn't have
+ * BM_CHECKPOINT_NEEDED set.
*/
- PinBuffer_Locked(bufHdr);
- LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
+ if (to_write->nbuffers != 0)
+ {
+ if (!CanMergeWrite(to_write, cur_buf_hdr))
+ {
+ elog(LOG, "changed identity");
+ UnpinBuffer(cur_buf_hdr);
+
+ result |= BUF_CANT_MERGE;
+
+ return result;
+ }
+ }
+
+ may_block = to_write->nbuffers == 0
+ && !pgaio_have_staged()
+ && io_queue_is_empty(ioq)
+ ;
+ content_lock = BufferDescriptorGetContentLock(cur_buf_hdr);
+
+ if (!may_block)
+ {
+ if (LWLockConditionalAcquire(content_lock, LW_SHARED))
+ {
+ /* done */
+ }
+ else if (to_write->nbuffers == 0)
+ {
+ /*
+ * Need to wait for all prior IO to finish before blocking for
+ * lock acquisition, to avoid the risk a deadlock due to us
+ * waiting for another backend that is waiting for our unsubmitted
+ * IO to complete.
+ */
+ pgaio_submit_staged();
+ io_queue_wait_all(ioq);
+
+ elog(DEBUG2, "at block %u: can't block, nbuffers = 0",
+ cur_buf_hdr->tag.blockNum
+ );
+
+ may_block = to_write->nbuffers == 0
+ && !pgaio_have_staged()
+ && io_queue_is_empty(ioq)
+ ;
+ Assert(may_block);
+
+ LWLockAcquire(content_lock, LW_SHARED);
+ }
+ else
+ {
+ elog(DEBUG2, "at block %d: can't block nbuffers = %d",
+ cur_buf_hdr->tag.blockNum,
+ to_write->nbuffers);
+
+ UnpinBuffer(cur_buf_hdr);
+ result |= BUF_CANT_MERGE;
+ Assert(to_write->nbuffers > 0);
+
+ return result;
+ }
+ }
+ else
+ {
+ LWLockAcquire(content_lock, LW_SHARED);
+ }
+
+ if (!may_block)
+ {
+ if (!StartBufferIO(cur_buf_hdr, false, !may_block))
+ {
+ pgaio_submit_staged();
+ io_queue_wait_all(ioq);
- FlushBuffer(bufHdr, NULL, IOOBJECT_RELATION, IOCONTEXT_NORMAL);
+ may_block = io_queue_is_empty(ioq) && to_write->nbuffers == 0 && !pgaio_have_staged();
- LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
+ if (!StartBufferIO(cur_buf_hdr, false, !may_block))
+ {
+ elog(DEBUG2, "at block %d: non-waitable StartBufferIO returns false, %d",
+ cur_buf_hdr->tag.blockNum,
+ may_block);
- tag = bufHdr->tag;
+ /*
+ * FIXME: can't tell whether this is because the buffer has
+ * been cleaned
+ */
+ if (!may_block)
+ {
+ result |= BUF_CANT_MERGE;
+ Assert(to_write->nbuffers > 0);
+ }
+ LWLockRelease(content_lock);
+ UnpinBuffer(cur_buf_hdr);
- UnpinBuffer(bufHdr);
+ return result;
+ }
+ }
+ }
+ else
+ {
+ if (!StartBufferIO(cur_buf_hdr, false, false))
+ {
+ elog(DEBUG2, "waitable StartBufferIO returns false");
+ LWLockRelease(content_lock);
+ UnpinBuffer(cur_buf_hdr);
+
+ /*
+ * FIXME: Historically we returned BUF_WRITTEN in this case, which
+ * seems wrong
+ */
+ return result;
+ }
+ }
/*
- * SyncOneBuffer() is only called by checkpointer and bgwriter, so
- * IOContext will always be IOCONTEXT_NORMAL.
+ * Run PageGetLSN while holding header lock, since we don't have the
+ * buffer locked exclusively in all cases.
*/
- ScheduleBufferTagForWriteback(wb_context, IOCONTEXT_NORMAL, &tag);
+ buf_state = LockBufHdr(cur_buf_hdr);
+
+ cur_buf_lsn = BufferGetLSN(cur_buf_hdr);
+
+ /* To check if block content changes while flushing. - vadim 01/17/97 */
+ buf_state &= ~BM_JUST_DIRTIED;
+
+ UnlockBufHdr(cur_buf_hdr, buf_state);
+
+ to_write->buffers[to_write->nbuffers] = buf;
+ to_write->nbuffers++;
+
+ if (buf_state & BM_PERMANENT &&
+ (to_write->max_lsn == InvalidXLogRecPtr || to_write->max_lsn < cur_buf_lsn))
+ {
+ to_write->max_lsn = cur_buf_lsn;
+ }
+
+ result |= BUF_WRITTEN;
+
+ return result;
+}
+
+static void
+WriteBuffers(BuffersToWrite *to_write,
+ IOQueue *ioq, WritebackContext *wb_context)
+{
+ SMgrRelation smgr;
+ Buffer first_buf;
+ BufferDesc *first_buf_hdr;
+ bool needs_checksum;
+
+ Assert(to_write->nbuffers > 0 && to_write->nbuffers <= io_combine_limit);
+
+ first_buf = to_write->buffers[0];
+ first_buf_hdr = GetBufferDescriptor(first_buf - 1);
+
+ smgr = smgropen(BufTagGetRelFileLocator(&first_buf_hdr->tag), INVALID_PROC_NUMBER);
+
+ /*
+ * Force XLOG flush up to buffer's LSN. This implements the basic WAL
+ * rule that log updates must hit disk before any of the data-file changes
+ * they describe do.
+ *
+ * However, this rule does not apply to unlogged relations, which will be
+ * lost after a crash anyway. Most unlogged relation pages do not bear
+ * LSNs since we never emit WAL records for them, and therefore flushing
+ * up through the buffer LSN would be useless, but harmless. However,
+ * GiST indexes use LSNs internally to track page-splits, and therefore
+ * unlogged GiST pages bear "fake" LSNs generated by
+ * GetFakeLSNForUnloggedRel. It is unlikely but possible that the fake
+ * LSN counter could advance past the WAL insertion point; and if it did
+ * happen, attempting to flush WAL through that location would fail, with
+ * disastrous system-wide consequences. To make sure that can't happen,
+ * skip the flush if the buffer isn't permanent.
+ */
+ if (to_write->max_lsn != InvalidXLogRecPtr)
+ XLogFlush(to_write->max_lsn);
+
+ /*
+ * Now it's safe to write buffer to disk. Note that no one else should
+ * have been able to write it while we were busy with log flushing because
+ * only one process at a time can set the BM_IO_IN_PROGRESS bit.
+ */
+
+ for (int nbuf = 0; nbuf < to_write->nbuffers; nbuf++)
+ {
+ Buffer cur_buf = to_write->buffers[nbuf];
+ BufferDesc *cur_buf_hdr = GetBufferDescriptor(cur_buf - 1);
+ Block bufBlock;
+ char *bufToWrite;
+
+ bufBlock = BufHdrGetBlock(cur_buf_hdr);
+ needs_checksum = PageNeedsChecksumCopy((Page) bufBlock);
+
+ /*
+ * Update page checksum if desired. Since we have only shared lock on
+ * the buffer, other processes might be updating hint bits in it, so
+ * we must copy the page to a bounce buffer if we do checksumming.
+ */
+ if (needs_checksum)
+ {
+ PgAioBounceBuffer *bb = pgaio_bounce_buffer_get();
+
+ pgaio_io_assoc_bounce_buffer(to_write->ioh, bb);
+
+ bufToWrite = pgaio_bounce_buffer_buffer(bb);
+ memcpy(bufToWrite, bufBlock, BLCKSZ);
+ PageSetChecksumInplace((Page) bufToWrite, cur_buf_hdr->tag.blockNum);
+ }
+ else
+ {
+ bufToWrite = bufBlock;
+ }
+
+ to_write->data_ptrs[nbuf] = bufToWrite;
+ }
+
+ pgaio_io_set_io_data_32(to_write->ioh,
+ (uint32 *) to_write->buffers,
+ to_write->nbuffers);
+ pgaio_io_add_shared_cb(to_write->ioh, ASC_SHARED_BUFFER_WRITE);
+
+ smgrstartwritev(to_write->ioh, smgr,
+ BufTagGetForkNum(&first_buf_hdr->tag),
+ first_buf_hdr->tag.blockNum,
+ to_write->data_ptrs,
+ to_write->nbuffers,
+ false);
+ pgstat_count_io_op_n(IOOBJECT_RELATION, IOCONTEXT_NORMAL,
+ IOOP_WRITE, to_write->nbuffers);
+
+
+ for (int nbuf = 0; nbuf < to_write->nbuffers; nbuf++)
+ {
+ Buffer cur_buf = to_write->buffers[nbuf];
+ BufferDesc *cur_buf_hdr = GetBufferDescriptor(cur_buf - 1);
+
+ UnpinBuffer(cur_buf_hdr);
+ }
+
+ io_queue_track(ioq, &to_write->ior);
+ to_write->total_writes++;
- return result | BUF_WRITTEN;
+ /* clear state for next write */
+ to_write->nbuffers = 0;
+ to_write->start_at_tag.relNumber = InvalidOid;
+ to_write->start_at_tag.blockNum = InvalidBlockNumber;
+ to_write->max_combine = 0;
+ to_write->max_lsn = InvalidXLogRecPtr;
+ to_write->ioh = NULL;
+ pgaio_io_ref_clear(&to_write->ior);
}
/*
@@ -4087,6 +4559,7 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
error_context_stack = errcallback.previous;
}
+
/*
* RelationGetNumberOfBlocksInFork
* Determines the current number of pages in the specified relation fork.
diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c
index aa264f61b9c..1f6b982c7e9 100644
--- a/src/backend/storage/page/bufpage.c
+++ b/src/backend/storage/page/bufpage.c
@@ -1480,6 +1480,16 @@ PageIndexTupleOverwrite(Page page, OffsetNumber offnum,
return true;
}
+bool
+PageNeedsChecksumCopy(Page page)
+{
+ if (PageIsNew(page))
+ return false;
+
+ /* If we don't need a checksum, just return the passed-in data */
+ return DataChecksumsEnabled();
+}
+
/*
* Set checksum for a page in shared buffers.
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index ca1e3427bc1..cdfef5698e7 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -345,6 +345,7 @@ BufferManagerRelation
BufferStrategyControl
BufferTag
BufferUsage
+BuffersToWrite
BuildAccumulator
BuiltinScript
BulkInsertState
--
2.45.2.746.g06e570c0df.dirty
v2-0018-very-wip-test_aio-module.patchtext/x-diff; charset=us-asciiDownload
From bdc7ed519ced00b6cc7fd7eb8137d5d79d846353 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sat, 31 Aug 2024 21:13:48 -0400
Subject: [PATCH v2 18/20] very-wip: test_aio module
Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
src/include/storage/aio_internal.h | 10 +
src/include/storage/buf_internals.h | 4 +
src/backend/storage/aio/aio.c | 38 ++
src/backend/storage/buffer/bufmgr.c | 3 +-
src/test/modules/Makefile | 1 +
src/test/modules/meson.build | 1 +
src/test/modules/test_aio/.gitignore | 6 +
src/test/modules/test_aio/Makefile | 34 ++
src/test/modules/test_aio/expected/inject.out | 295 ++++++++++
src/test/modules/test_aio/expected/io.out | 40 ++
.../modules/test_aio/expected/ownership.out | 148 +++++
src/test/modules/test_aio/expected/prep.out | 17 +
src/test/modules/test_aio/io_uring.conf | 5 +
src/test/modules/test_aio/meson.build | 78 +++
src/test/modules/test_aio/sql/inject.sql | 84 +++
src/test/modules/test_aio/sql/io.sql | 16 +
src/test/modules/test_aio/sql/ownership.sql | 65 +++
src/test/modules/test_aio/sql/prep.sql | 9 +
src/test/modules/test_aio/sync.conf | 5 +
src/test/modules/test_aio/test_aio--1.0.sql | 99 ++++
src/test/modules/test_aio/test_aio.c | 504 ++++++++++++++++++
src/test/modules/test_aio/test_aio.control | 3 +
src/test/modules/test_aio/worker.conf | 5 +
23 files changed, 1468 insertions(+), 2 deletions(-)
create mode 100644 src/test/modules/test_aio/.gitignore
create mode 100644 src/test/modules/test_aio/Makefile
create mode 100644 src/test/modules/test_aio/expected/inject.out
create mode 100644 src/test/modules/test_aio/expected/io.out
create mode 100644 src/test/modules/test_aio/expected/ownership.out
create mode 100644 src/test/modules/test_aio/expected/prep.out
create mode 100644 src/test/modules/test_aio/io_uring.conf
create mode 100644 src/test/modules/test_aio/meson.build
create mode 100644 src/test/modules/test_aio/sql/inject.sql
create mode 100644 src/test/modules/test_aio/sql/io.sql
create mode 100644 src/test/modules/test_aio/sql/ownership.sql
create mode 100644 src/test/modules/test_aio/sql/prep.sql
create mode 100644 src/test/modules/test_aio/sync.conf
create mode 100644 src/test/modules/test_aio/test_aio--1.0.sql
create mode 100644 src/test/modules/test_aio/test_aio.c
create mode 100644 src/test/modules/test_aio/test_aio.control
create mode 100644 src/test/modules/test_aio/worker.conf
diff --git a/src/include/storage/aio_internal.h b/src/include/storage/aio_internal.h
index 2065bde79c3..f4c57438dd4 100644
--- a/src/include/storage/aio_internal.h
+++ b/src/include/storage/aio_internal.h
@@ -265,6 +265,16 @@ extern const char *pgaio_io_get_op_name(PgAioHandle *ioh);
extern const char *pgaio_io_get_state_name(PgAioHandle *ioh);
+
+/* These functions are just for use in tests, from within injection points */
+#ifdef USE_INJECTION_POINTS
+
+extern PgAioHandle *pgaio_inj_io_get(void);
+
+#endif
+
+
+
/* Declarations for the tables of function pointers exposed by each IO method. */
extern const IoMethodOps pgaio_sync_ops;
extern const IoMethodOps pgaio_worker_ops;
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 9d3123663b3..1b3329a25b4 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -423,6 +423,10 @@ extern void IssuePendingWritebacks(WritebackContext *wb_context, IOContext io_co
extern void ScheduleBufferTagForWriteback(WritebackContext *wb_context,
IOContext io_context, BufferTag *tag);
+/* solely to make it easier to write tests */
+extern bool StartBufferIO(BufferDesc *buf, bool forInput, bool nowait);
+
+
/* freelist.c */
extern IOContext IOContextForStrategy(BufferAccessStrategy strategy);
extern BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy,
diff --git a/src/backend/storage/aio/aio.c b/src/backend/storage/aio/aio.c
index e829e1752ca..261a752fb80 100644
--- a/src/backend/storage/aio/aio.c
+++ b/src/backend/storage/aio/aio.c
@@ -46,6 +46,9 @@
#include "utils/resowner.h"
#include "utils/wait_event_types.h"
+#ifdef USE_INJECTION_POINTS
+#include "utils/injection_point.h"
+#endif
static inline void pgaio_io_update_state(PgAioHandle *ioh, PgAioHandleState new_state);
@@ -92,6 +95,11 @@ static const IoMethodOps *pgaio_ops_table[] = {
const IoMethodOps *pgaio_impl;
+#ifdef USE_INJECTION_POINTS
+static PgAioHandle *inj_cur_handle;
+#endif
+
+
/* --------------------------------------------------------------------------------
* "Core" IO Api
@@ -631,6 +639,19 @@ pgaio_io_process_completion(PgAioHandle *ioh, int result)
pgaio_io_update_state(ioh, AHS_REAPED);
+#ifdef USE_INJECTION_POINTS
+ inj_cur_handle = ioh;
+
+ /*
+ * FIXME: This could be in a critical section - but it looks like we can't
+ * just InjectionPointLoad() at process start, as the injection point
+ * might not yet be defined.
+ */
+ InjectionPointCached("AIO_PROCESS_COMPLETION_BEFORE_SHARED");
+
+ inj_cur_handle = NULL;
+#endif
+
pgaio_io_process_completion_subject(ioh);
pgaio_io_update_state(ioh, AHS_COMPLETED_SHARED);
@@ -1129,3 +1150,20 @@ assign_io_method(int newval, void *extra)
{
pgaio_impl = pgaio_ops_table[newval];
}
+
+
+
+/* --------------------------------------------------------------------------------
+ * Injection point support
+ * --------------------------------------------------------------------------------
+ */
+
+#ifdef USE_INJECTION_POINTS
+
+PgAioHandle *
+pgaio_inj_io_get(void)
+{
+ return inj_cur_handle;
+}
+
+#endif
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 863464f12da..4a022440ada 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -514,7 +514,6 @@ static void UnpinBufferNoOwner(BufferDesc *buf);
static void BufferSync(int flags);
static uint32 WaitBufHdrUnlocked(BufferDesc *buf);
static void WaitIO(BufferDesc *buf);
-static bool StartBufferIO(BufferDesc *buf, bool forInput, bool nowait);
static void TerminateBufferIO(BufferDesc *buf, bool clear_dirty,
uint32 set_flag_bits, bool forget_owner,
bool syncio);
@@ -6213,7 +6212,7 @@ WaitIO(BufferDesc *buf)
* find out if they can perform the I/O as part of a larger operation, without
* waiting for the answer or distinguishing the reasons why not.
*/
-static bool
+bool
StartBufferIO(BufferDesc *buf, bool forInput, bool nowait)
{
uint32 buf_state;
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index c0d3cf0e14b..73ff9c55687 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -13,6 +13,7 @@ SUBDIRS = \
libpq_pipeline \
plsample \
spgist_name_ops \
+ test_aio \
test_bloomfilter \
test_copy_callbacks \
test_custom_rmgrs \
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index c829b619530..61c854a9b63 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -1,5 +1,6 @@
# Copyright (c) 2022-2024, PostgreSQL Global Development Group
+subdir('test_aio')
subdir('brin')
subdir('commit_ts')
subdir('delay_execution')
diff --git a/src/test/modules/test_aio/.gitignore b/src/test/modules/test_aio/.gitignore
new file mode 100644
index 00000000000..b4903eba657
--- /dev/null
+++ b/src/test/modules/test_aio/.gitignore
@@ -0,0 +1,6 @@
+# Generated subdirectories
+/log/
+/results/
+/output_iso/
+/tmp_check/
+/tmp_check_iso/
diff --git a/src/test/modules/test_aio/Makefile b/src/test/modules/test_aio/Makefile
new file mode 100644
index 00000000000..ae6d685835b
--- /dev/null
+++ b/src/test/modules/test_aio/Makefile
@@ -0,0 +1,34 @@
+# src/test/modules/delay_execution/Makefile
+
+PGFILEDESC = "test_aio - test code for AIO"
+
+MODULE_big = test_aio
+OBJS = \
+ $(WIN32RES) \
+ test_aio.o
+
+EXTENSION = test_aio
+DATA = test_aio--1.0.sql
+
+REGRESS = prep ownership io
+
+ifeq ($(enable_injection_points),yes)
+REGRESS += inject
+endif
+
+# FIXME: with meson this runs the tests once with worker and once - if
+# supported - with io_uring.
+
+# requires custom config
+NO_INSTALLCHECK = 1
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_aio
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_aio/expected/inject.out b/src/test/modules/test_aio/expected/inject.out
new file mode 100644
index 00000000000..e62e3718845
--- /dev/null
+++ b/src/test/modules/test_aio/expected/inject.out
@@ -0,0 +1,295 @@
+SELECT count(*) FROM tbl_b WHERE ctid = '(2, 1)';
+ count
+-------
+ 1
+(1 row)
+
+-- injected what we'd expect
+SELECT inj_io_short_read_attach(8192);
+ inj_io_short_read_attach
+--------------------------
+
+(1 row)
+
+SELECT invalidate_rel_block('tbl_b', 2);
+ invalidate_rel_block
+----------------------
+
+(1 row)
+
+SELECT count(*) FROM tbl_b WHERE ctid = '(2, 1)';
+ count
+-------
+ 1
+(1 row)
+
+SELECT inj_io_short_read_detach();
+ inj_io_short_read_detach
+--------------------------
+
+(1 row)
+
+-- injected a read shorter than a single block, expecting error
+SELECT inj_io_short_read_attach(17);
+ inj_io_short_read_attach
+--------------------------
+
+(1 row)
+
+SELECT invalidate_rel_block('tbl_b', 2);
+ invalidate_rel_block
+----------------------
+
+(1 row)
+
+SELECT redact($$
+ SELECT count(*) FROM tbl_b WHERE ctid = '(2, 1)';
+$$);
+NOTICE: wrapped error: could not read blocks 2..2 in file base/<redacted>: read only 0 of 8192 bytes
+ redact
+--------
+ f
+(1 row)
+
+SELECT inj_io_short_read_detach();
+ inj_io_short_read_detach
+--------------------------
+
+(1 row)
+
+-- shorten multi-block read to a single block, should retry
+SELECT count(*) FROM tbl_b; -- for comparison
+ count
+-------
+ 10000
+(1 row)
+
+SELECT invalidate_rel_block('tbl_b', 0);
+ invalidate_rel_block
+----------------------
+
+(1 row)
+
+SELECT invalidate_rel_block('tbl_b', 1);
+ invalidate_rel_block
+----------------------
+
+(1 row)
+
+SELECT invalidate_rel_block('tbl_b', 2);
+ invalidate_rel_block
+----------------------
+
+(1 row)
+
+SELECT inj_io_short_read_attach(8192);
+ inj_io_short_read_attach
+--------------------------
+
+(1 row)
+
+-- no need to redact, no messages to client
+SELECT count(*) FROM tbl_b;
+ count
+-------
+ 10000
+(1 row)
+
+SELECT inj_io_short_read_detach();
+ inj_io_short_read_detach
+--------------------------
+
+(1 row)
+
+-- shorten multi-block read to 1 1/2 blocks, should retry
+SELECT count(*) FROM tbl_b; -- for comparison
+ count
+-------
+ 10000
+(1 row)
+
+SELECT invalidate_rel_block('tbl_b', 0);
+ invalidate_rel_block
+----------------------
+
+(1 row)
+
+SELECT invalidate_rel_block('tbl_b', 1);
+ invalidate_rel_block
+----------------------
+
+(1 row)
+
+SELECT invalidate_rel_block('tbl_b', 2);
+ invalidate_rel_block
+----------------------
+
+(1 row)
+
+SELECT inj_io_short_read_attach(8192 + 4096);
+ inj_io_short_read_attach
+--------------------------
+
+(1 row)
+
+-- no need to redact, no messages to client
+SELECT count(*) FROM tbl_b;
+ count
+-------
+ 10000
+(1 row)
+
+SELECT inj_io_short_read_detach();
+ inj_io_short_read_detach
+--------------------------
+
+(1 row)
+
+-- shorten single-block read to read that block partially, we'll error out,
+-- because we assume we can read at least one block at a time.
+SELECT count(*) FROM tbl_b WHERE ctid = '(2, 1)'; -- for comparison
+ count
+-------
+ 1
+(1 row)
+
+SELECT invalidate_rel_block('tbl_b', 2);
+ invalidate_rel_block
+----------------------
+
+(1 row)
+
+SELECT inj_io_short_read_attach(4096);
+ inj_io_short_read_attach
+--------------------------
+
+(1 row)
+
+SELECT redact($$
+ SELECT count(*) FROM tbl_b WHERE ctid = '(2, 1)';
+$$);
+NOTICE: wrapped error: could not read blocks 2..2 in file base/<redacted>: read only 0 of 8192 bytes
+ redact
+--------
+ f
+(1 row)
+
+SELECT inj_io_short_read_detach();
+ inj_io_short_read_detach
+--------------------------
+
+(1 row)
+
+-- shorten single-block read to read 0 bytes, expect that to error out
+SELECT count(*) FROM tbl_b WHERE ctid = '(2, 1)'; -- for comparison
+ count
+-------
+ 1
+(1 row)
+
+SELECT invalidate_rel_block('tbl_b', 2);
+ invalidate_rel_block
+----------------------
+
+(1 row)
+
+SELECT inj_io_short_read_attach(0);
+ inj_io_short_read_attach
+--------------------------
+
+(1 row)
+
+SELECT redact($$
+ SELECT count(*) FROM tbl_b WHERE ctid = '(2, 1)';
+$$);
+NOTICE: wrapped error: could not read blocks 2..2 in file base/<redacted>: read only 0 of 8192 bytes
+ redact
+--------
+ f
+(1 row)
+
+SELECT inj_io_short_read_detach();
+ inj_io_short_read_detach
+--------------------------
+
+(1 row)
+
+-- verify that checksum errors are detected even as part of a shortened
+-- multi-block read
+-- (tbl_a, block 1 is corrupted)
+SELECT redact($$
+ SELECT count(*) FROM tbl_a WHERE ctid < '(2, 1)';
+$$);
+NOTICE: wrapped error: invalid page in block 2 of relation base/<redacted>
+ redact
+--------
+ f
+(1 row)
+
+SELECT inj_io_short_read_attach(8192);
+ inj_io_short_read_attach
+--------------------------
+
+(1 row)
+
+SELECT invalidate_rel_block('tbl_a', 0);
+ invalidate_rel_block
+----------------------
+
+(1 row)
+
+SELECT invalidate_rel_block('tbl_a', 1);
+ invalidate_rel_block
+----------------------
+
+(1 row)
+
+SELECT invalidate_rel_block('tbl_a', 2);
+ invalidate_rel_block
+----------------------
+
+(1 row)
+
+SELECT redact($$
+ SELECT count(*) FROM tbl_a WHERE ctid < '(2, 1)';
+$$);
+NOTICE: wrapped error: invalid page in block 2 of relation base/<redacted>
+ redact
+--------
+ f
+(1 row)
+
+SELECT inj_io_short_read_detach();
+ inj_io_short_read_detach
+--------------------------
+
+(1 row)
+
+-- trigger a hard error, should error out
+SELECT inj_io_short_read_attach(-errno_from_string('EIO'));
+ inj_io_short_read_attach
+--------------------------
+
+(1 row)
+
+SELECT invalidate_rel_block('tbl_b', 2);
+ invalidate_rel_block
+----------------------
+
+(1 row)
+
+SELECT redact($$
+ SELECT count(*) FROM tbl_b WHERE ctid = '(2, 1)';
+$$);
+NOTICE: wrapped error: could not read blocks 2..3 in file base/<redacted>: Input/output error
+ redact
+--------
+ f
+(1 row)
+
+SELECT inj_io_short_read_detach();
+ inj_io_short_read_detach
+--------------------------
+
+(1 row)
+
diff --git a/src/test/modules/test_aio/expected/io.out b/src/test/modules/test_aio/expected/io.out
new file mode 100644
index 00000000000..e46b582f290
--- /dev/null
+++ b/src/test/modules/test_aio/expected/io.out
@@ -0,0 +1,40 @@
+SELECT count(*) FROM tbl_a WHERE ctid = '(1, 1)';
+ count
+-------
+ 1
+(1 row)
+
+SELECT corrupt_rel_block('tbl_a', 1);
+ corrupt_rel_block
+-------------------
+
+(1 row)
+
+-- FIXME: Should report the error
+SELECT redact($$
+ SELECT read_corrupt_rel_block('tbl_a', 1);
+$$);
+ redact
+--------
+ t
+(1 row)
+
+-- verify the error is reported
+SELECT redact($$
+ SELECT count(*) FROM tbl_a WHERE ctid = '(1, 1)';
+$$);
+NOTICE: wrapped error: invalid page in block 2 of relation base/<redacted>
+ redact
+--------
+ f
+(1 row)
+
+SELECT redact($$
+ SELECT count(*) FROM tbl_a;
+$$);
+NOTICE: wrapped error: invalid page in block 2 of relation base/<redacted>
+ redact
+--------
+ f
+(1 row)
+
diff --git a/src/test/modules/test_aio/expected/ownership.out b/src/test/modules/test_aio/expected/ownership.out
new file mode 100644
index 00000000000..97fdad6c629
--- /dev/null
+++ b/src/test/modules/test_aio/expected/ownership.out
@@ -0,0 +1,148 @@
+-----
+-- IO handles
+----
+-- leak warning: implicit xact
+SELECT handle_get();
+WARNING: leaked AIO handle
+ handle_get
+------------
+
+(1 row)
+
+-- leak warning: explicit xact
+BEGIN; SELECT handle_get(); COMMIT;
+WARNING: leaked AIO handle
+ handle_get
+------------
+
+(1 row)
+
+-- leak warning + error: released in different command (thus resowner)
+BEGIN; SELECT handle_get(); SELECT handle_release_last(); COMMIT;
+WARNING: leaked AIO handle
+ handle_get
+------------
+
+(1 row)
+
+ERROR: release in unexpected state
+-- no leak, same command
+BEGIN; SELECT handle_get() UNION ALL SELECT handle_release_last(); COMMIT;
+ handle_get
+------------
+
+
+(2 rows)
+
+-- leak warning: subtrans
+BEGIN; SAVEPOINT foo; SELECT handle_get(); COMMIT;
+WARNING: leaked AIO handle
+ handle_get
+------------
+
+(1 row)
+
+-- normal handle use
+SELECT handle_get_release();
+ handle_get_release
+--------------------
+
+(1 row)
+
+-- should error out, API violation
+SELECT handle_get_twice();
+ERROR: API violation: Only one IO can be handed out
+-- recover after error in implicit xact
+SELECT handle_get_and_error(); SELECT handle_get_release();
+ERROR: as you command
+ handle_get_release
+--------------------
+
+(1 row)
+
+-- recover after error in explicit xact
+BEGIN; SELECT handle_get_and_error(); ROLLBACK; SELECT handle_get_release();
+ERROR: as you command
+ handle_get_release
+--------------------
+
+(1 row)
+
+-- recover after error in subtrans
+BEGIN; SAVEPOINT foo; SELECT handle_get_and_error(); ROLLBACK TO SAVEPOINT foo; SELECT handle_get_release(); ROLLBACK;
+ERROR: as you command
+ handle_get_release
+--------------------
+
+(1 row)
+
+-----
+-- Bounce Buffers handles
+----
+-- leak warning: implicit xact
+SELECT bb_get();
+WARNING: leaked AIO bounce buffer
+ bb_get
+--------
+
+(1 row)
+
+-- leak warning: explicit xact
+BEGIN; SELECT bb_get(); COMMIT;
+WARNING: leaked AIO bounce buffer
+ bb_get
+--------
+
+(1 row)
+
+-- missing leak warning: we should warn at command boundaries, not xact boundaries
+BEGIN; SELECT bb_get(); SELECT bb_release_last(); COMMIT;
+WARNING: leaked AIO bounce buffer
+ bb_get
+--------
+
+(1 row)
+
+ERROR: can only release handed out BB
+-- leak warning: subtrans
+BEGIN; SAVEPOINT foo; SELECT bb_get(); COMMIT;
+WARNING: leaked AIO bounce buffer
+ bb_get
+--------
+
+(1 row)
+
+-- normal bb use
+SELECT bb_get_release();
+ bb_get_release
+----------------
+
+(1 row)
+
+-- should error out, API violation
+SELECT bb_get_twice();
+ERROR: can only hand out one BB
+-- recover after error in implicit xact
+SELECT bb_get_and_error(); SELECT bb_get_release();
+ERROR: as you command
+ bb_get_release
+----------------
+
+(1 row)
+
+-- recover after error in explicit xact
+BEGIN; SELECT bb_get_and_error(); ROLLBACK; SELECT bb_get_release();
+ERROR: as you command
+ bb_get_release
+----------------
+
+(1 row)
+
+-- recover after error in subtrans
+BEGIN; SAVEPOINT foo; SELECT bb_get_and_error(); ROLLBACK TO SAVEPOINT foo; SELECT bb_get_release(); ROLLBACK;
+ERROR: as you command
+ bb_get_release
+----------------
+
+(1 row)
+
diff --git a/src/test/modules/test_aio/expected/prep.out b/src/test/modules/test_aio/expected/prep.out
new file mode 100644
index 00000000000..7fad6280db5
--- /dev/null
+++ b/src/test/modules/test_aio/expected/prep.out
@@ -0,0 +1,17 @@
+CREATE EXTENSION test_aio;
+CREATE TABLE tbl_a(data int not null);
+CREATE TABLE tbl_b(data int not null);
+INSERT INTO tbl_a SELECT generate_series(1, 10000);
+INSERT INTO tbl_b SELECT generate_series(1, 10000);
+SELECT grow_rel('tbl_a', 500);
+ grow_rel
+----------
+
+(1 row)
+
+SELECT grow_rel('tbl_b', 550);
+ grow_rel
+----------
+
+(1 row)
+
diff --git a/src/test/modules/test_aio/io_uring.conf b/src/test/modules/test_aio/io_uring.conf
new file mode 100644
index 00000000000..efd7ad143ff
--- /dev/null
+++ b/src/test/modules/test_aio/io_uring.conf
@@ -0,0 +1,5 @@
+shared_preload_libraries=test_aio
+io_method = 'io_uring'
+log_min_messages = 'DEBUG3'
+log_statement=all
+restart_after_crash=false
diff --git a/src/test/modules/test_aio/meson.build b/src/test/modules/test_aio/meson.build
new file mode 100644
index 00000000000..a4bef0ceeb0
--- /dev/null
+++ b/src/test/modules/test_aio/meson.build
@@ -0,0 +1,78 @@
+# Copyright (c) 2022-2024, PostgreSQL Global Development Group
+
+test_aio_sources = files(
+ 'test_aio.c',
+)
+
+if host_system == 'windows'
+ test_aio_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+ '--NAME', 'test_aio',
+ '--FILEDESC', 'test_aio - test code for AIO',])
+endif
+
+test_aio = shared_module('test_aio',
+ test_aio_sources,
+ kwargs: pg_test_mod_args,
+)
+test_install_libs += test_aio
+
+test_install_data += files(
+ 'test_aio.control',
+ 'test_aio--1.0.sql',
+)
+
+
+testfiles = [
+ 'prep',
+ 'ownership',
+ 'io',
+]
+
+if get_option('injection_points')
+ testfiles += 'inject'
+endif
+
+
+tests += {
+ 'name': 'test_aio_sync',
+ 'sd': meson.current_source_dir(),
+ 'bd': meson.current_build_dir(),
+ 'regress': {
+ 'sql': testfiles,
+ 'regress_args': [
+ '--temp-config', files('sync.conf'),
+ ],
+ # requires custom config
+ 'runningcheck': false,
+ },
+}
+
+tests += {
+ 'name': 'test_aio_worker',
+ 'sd': meson.current_source_dir(),
+ 'bd': meson.current_build_dir(),
+ 'regress': {
+ 'sql': testfiles,
+ 'regress_args': [
+ '--temp-config', files('worker.conf'),
+ ],
+ # requires custom config
+ 'runningcheck': false,
+ },
+}
+
+if liburing.found()
+ tests += {
+ 'name': 'test_aio_uring',
+ 'sd': meson.current_source_dir(),
+ 'bd': meson.current_build_dir(),
+ 'regress': {
+ 'sql': testfiles,
+ 'regress_args': [
+ '--temp-config', files('io_uring.conf'),
+ ],
+ # requires custom config
+ 'runningcheck': false,
+ }
+ }
+endif
diff --git a/src/test/modules/test_aio/sql/inject.sql b/src/test/modules/test_aio/sql/inject.sql
new file mode 100644
index 00000000000..1190531f5ad
--- /dev/null
+++ b/src/test/modules/test_aio/sql/inject.sql
@@ -0,0 +1,84 @@
+SELECT count(*) FROM tbl_b WHERE ctid = '(2, 1)';
+
+-- injected what we'd expect
+SELECT inj_io_short_read_attach(8192);
+SELECT invalidate_rel_block('tbl_b', 2);
+SELECT count(*) FROM tbl_b WHERE ctid = '(2, 1)';
+SELECT inj_io_short_read_detach();
+
+
+-- injected a read shorter than a single block, expecting error
+SELECT inj_io_short_read_attach(17);
+SELECT invalidate_rel_block('tbl_b', 2);
+SELECT redact($$
+ SELECT count(*) FROM tbl_b WHERE ctid = '(2, 1)';
+$$);
+SELECT inj_io_short_read_detach();
+
+
+-- shorten multi-block read to a single block, should retry
+SELECT count(*) FROM tbl_b; -- for comparison
+SELECT invalidate_rel_block('tbl_b', 0);
+SELECT invalidate_rel_block('tbl_b', 1);
+SELECT invalidate_rel_block('tbl_b', 2);
+SELECT inj_io_short_read_attach(8192);
+-- no need to redact, no messages to client
+SELECT count(*) FROM tbl_b;
+SELECT inj_io_short_read_detach();
+
+
+-- shorten multi-block read to 1 1/2 blocks, should retry
+SELECT count(*) FROM tbl_b; -- for comparison
+SELECT invalidate_rel_block('tbl_b', 0);
+SELECT invalidate_rel_block('tbl_b', 1);
+SELECT invalidate_rel_block('tbl_b', 2);
+SELECT inj_io_short_read_attach(8192 + 4096);
+-- no need to redact, no messages to client
+SELECT count(*) FROM tbl_b;
+SELECT inj_io_short_read_detach();
+
+
+-- shorten single-block read to read that block partially, we'll error out,
+-- because we assume we can read at least one block at a time.
+SELECT count(*) FROM tbl_b WHERE ctid = '(2, 1)'; -- for comparison
+SELECT invalidate_rel_block('tbl_b', 2);
+SELECT inj_io_short_read_attach(4096);
+SELECT redact($$
+ SELECT count(*) FROM tbl_b WHERE ctid = '(2, 1)';
+$$);
+SELECT inj_io_short_read_detach();
+
+
+-- shorten single-block read to read 0 bytes, expect that to error out
+SELECT count(*) FROM tbl_b WHERE ctid = '(2, 1)'; -- for comparison
+SELECT invalidate_rel_block('tbl_b', 2);
+SELECT inj_io_short_read_attach(0);
+SELECT redact($$
+ SELECT count(*) FROM tbl_b WHERE ctid = '(2, 1)';
+$$);
+SELECT inj_io_short_read_detach();
+
+
+-- verify that checksum errors are detected even as part of a shortened
+-- multi-block read
+-- (tbl_a, block 1 is corrupted)
+SELECT redact($$
+ SELECT count(*) FROM tbl_a WHERE ctid < '(2, 1)';
+$$);
+SELECT inj_io_short_read_attach(8192);
+SELECT invalidate_rel_block('tbl_a', 0);
+SELECT invalidate_rel_block('tbl_a', 1);
+SELECT invalidate_rel_block('tbl_a', 2);
+SELECT redact($$
+ SELECT count(*) FROM tbl_a WHERE ctid < '(2, 1)';
+$$);
+SELECT inj_io_short_read_detach();
+
+
+-- trigger a hard error, should error out
+SELECT inj_io_short_read_attach(-errno_from_string('EIO'));
+SELECT invalidate_rel_block('tbl_b', 2);
+SELECT redact($$
+ SELECT count(*) FROM tbl_b WHERE ctid = '(2, 1)';
+$$);
+SELECT inj_io_short_read_detach();
diff --git a/src/test/modules/test_aio/sql/io.sql b/src/test/modules/test_aio/sql/io.sql
new file mode 100644
index 00000000000..a29bb4eb15d
--- /dev/null
+++ b/src/test/modules/test_aio/sql/io.sql
@@ -0,0 +1,16 @@
+SELECT count(*) FROM tbl_a WHERE ctid = '(1, 1)';
+
+SELECT corrupt_rel_block('tbl_a', 1);
+
+-- FIXME: Should report the error
+SELECT redact($$
+ SELECT read_corrupt_rel_block('tbl_a', 1);
+$$);
+
+-- verify the error is reported
+SELECT redact($$
+ SELECT count(*) FROM tbl_a WHERE ctid = '(1, 1)';
+$$);
+SELECT redact($$
+ SELECT count(*) FROM tbl_a;
+$$);
diff --git a/src/test/modules/test_aio/sql/ownership.sql b/src/test/modules/test_aio/sql/ownership.sql
new file mode 100644
index 00000000000..63cf40c802a
--- /dev/null
+++ b/src/test/modules/test_aio/sql/ownership.sql
@@ -0,0 +1,65 @@
+-----
+-- IO handles
+----
+
+-- leak warning: implicit xact
+SELECT handle_get();
+
+-- leak warning: explicit xact
+BEGIN; SELECT handle_get(); COMMIT;
+
+-- leak warning + error: released in different command (thus resowner)
+BEGIN; SELECT handle_get(); SELECT handle_release_last(); COMMIT;
+
+-- no leak, same command
+BEGIN; SELECT handle_get() UNION ALL SELECT handle_release_last(); COMMIT;
+
+-- leak warning: subtrans
+BEGIN; SAVEPOINT foo; SELECT handle_get(); COMMIT;
+
+-- normal handle use
+SELECT handle_get_release();
+
+-- should error out, API violation
+SELECT handle_get_twice();
+
+-- recover after error in implicit xact
+SELECT handle_get_and_error(); SELECT handle_get_release();
+
+-- recover after error in explicit xact
+BEGIN; SELECT handle_get_and_error(); ROLLBACK; SELECT handle_get_release();
+
+-- recover after error in subtrans
+BEGIN; SAVEPOINT foo; SELECT handle_get_and_error(); ROLLBACK TO SAVEPOINT foo; SELECT handle_get_release(); ROLLBACK;
+
+
+-----
+-- Bounce Buffers handles
+----
+
+-- leak warning: implicit xact
+SELECT bb_get();
+
+-- leak warning: explicit xact
+BEGIN; SELECT bb_get(); COMMIT;
+
+-- missing leak warning: we should warn at command boundaries, not xact boundaries
+BEGIN; SELECT bb_get(); SELECT bb_release_last(); COMMIT;
+
+-- leak warning: subtrans
+BEGIN; SAVEPOINT foo; SELECT bb_get(); COMMIT;
+
+-- normal bb use
+SELECT bb_get_release();
+
+-- should error out, API violation
+SELECT bb_get_twice();
+
+-- recover after error in implicit xact
+SELECT bb_get_and_error(); SELECT bb_get_release();
+
+-- recover after error in explicit xact
+BEGIN; SELECT bb_get_and_error(); ROLLBACK; SELECT bb_get_release();
+
+-- recover after error in subtrans
+BEGIN; SAVEPOINT foo; SELECT bb_get_and_error(); ROLLBACK TO SAVEPOINT foo; SELECT bb_get_release(); ROLLBACK;
diff --git a/src/test/modules/test_aio/sql/prep.sql b/src/test/modules/test_aio/sql/prep.sql
new file mode 100644
index 00000000000..b8f225cbc98
--- /dev/null
+++ b/src/test/modules/test_aio/sql/prep.sql
@@ -0,0 +1,9 @@
+CREATE EXTENSION test_aio;
+
+CREATE TABLE tbl_a(data int not null);
+CREATE TABLE tbl_b(data int not null);
+
+INSERT INTO tbl_a SELECT generate_series(1, 10000);
+INSERT INTO tbl_b SELECT generate_series(1, 10000);
+SELECT grow_rel('tbl_a', 500);
+SELECT grow_rel('tbl_b', 550);
diff --git a/src/test/modules/test_aio/sync.conf b/src/test/modules/test_aio/sync.conf
new file mode 100644
index 00000000000..c480922d6cf
--- /dev/null
+++ b/src/test/modules/test_aio/sync.conf
@@ -0,0 +1,5 @@
+shared_preload_libraries=test_aio
+io_method = 'sync'
+log_min_messages = 'DEBUG3'
+log_statement=all
+restart_after_crash=false
diff --git a/src/test/modules/test_aio/test_aio--1.0.sql b/src/test/modules/test_aio/test_aio--1.0.sql
new file mode 100644
index 00000000000..e3d5ce29c60
--- /dev/null
+++ b/src/test/modules/test_aio/test_aio--1.0.sql
@@ -0,0 +1,99 @@
+/* src/test/modules/test_aio/test_aio--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_aio" to load this file. \quit
+
+
+CREATE FUNCTION errno_from_string(sym text)
+RETURNS pg_catalog.int4 STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+
+CREATE FUNCTION grow_rel(rel regclass, nblocks int)
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+
+CREATE FUNCTION corrupt_rel_block(rel regclass, blockno int)
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION read_corrupt_rel_block(rel regclass, blockno int)
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION invalidate_rel_block(rel regclass, blockno int)
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION handle_get_and_error()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION handle_get_twice()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION handle_get()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION handle_get_release()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION handle_release_last()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+
+CREATE FUNCTION bb_get_and_error()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION bb_get_twice()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION bb_get()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION bb_get_release()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION bb_release_last()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+
+CREATE OR REPLACE FUNCTION redact(p_sql text)
+RETURNS bool
+LANGUAGE plpgsql
+AS $$
+ DECLARE
+ err_state text;
+ err_msg text;
+ BEGIN
+ EXECUTE p_sql;
+ RETURN true;
+ EXCEPTION WHEN OTHERS THEN
+ GET STACKED DIAGNOSTICS
+ err_state = RETURNED_SQLSTATE,
+ err_msg = MESSAGE_TEXT;
+ err_msg = regexp_replace(err_msg, '(file|relation) "?base/[0-9]+/[0-9]+"?', '\1 base/<redacted>');
+ RAISE NOTICE 'wrapped error: %', err_msg
+ USING ERRCODE = err_state;
+ RETURN false;
+ END;
+$$;
+
+
+CREATE FUNCTION inj_io_short_read_attach(result int)
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION inj_io_short_read_detach()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_aio/test_aio.c b/src/test/modules/test_aio/test_aio.c
new file mode 100644
index 00000000000..e495c5309b3
--- /dev/null
+++ b/src/test/modules/test_aio/test_aio.c
@@ -0,0 +1,504 @@
+/*-------------------------------------------------------------------------
+ *
+ * delay_execution.c
+ * Test module to allow delay between parsing and execution of a query.
+ *
+ * The delay is implemented by taking and immediately releasing a specified
+ * advisory lock. If another process has previously taken that lock, the
+ * current process will be blocked until the lock is released; otherwise,
+ * there's no effect. This allows an isolationtester script to reliably
+ * test behaviors where some specified action happens in another backend
+ * between parsing and execution of any desired query.
+ *
+ * Copyright (c) 2020-2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/test/modules/delay_execution/delay_execution.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/relation.h"
+#include "fmgr.h"
+#include "storage/aio.h"
+#include "storage/aio_internal.h"
+#include "storage/buf_internals.h"
+#include "storage/bufmgr.h"
+#include "storage/ipc.h"
+#include "storage/lwlock.h"
+#include "utils/builtins.h"
+#include "utils/injection_point.h"
+#include "utils/rel.h"
+
+
+PG_MODULE_MAGIC;
+
+
+typedef struct InjIoErrorState
+{
+ bool enabled;
+ bool result_set;
+ int result;
+} InjIoErrorState;
+
+static InjIoErrorState * inj_io_error_state;
+
+/* Shared memory init callbacks */
+static shmem_request_hook_type prev_shmem_request_hook = NULL;
+static shmem_startup_hook_type prev_shmem_startup_hook = NULL;
+
+
+static PgAioHandle *last_handle;
+static PgAioBounceBuffer *last_bb;
+
+
+
+static void
+test_aio_shmem_request(void)
+{
+ if (prev_shmem_request_hook)
+ prev_shmem_request_hook();
+
+ RequestAddinShmemSpace(sizeof(InjIoErrorState));
+}
+
+static void
+test_aio_shmem_startup(void)
+{
+ bool found;
+
+ if (prev_shmem_startup_hook)
+ prev_shmem_startup_hook();
+
+ /* Create or attach to the shared memory state */
+ LWLockAcquire(AddinShmemInitLock, LW_EXCLUSIVE);
+
+ inj_io_error_state = ShmemInitStruct("injection_points",
+ sizeof(InjIoErrorState),
+ &found);
+
+ if (!found)
+ {
+ /*
+ * First time through, so initialize. This is shared with the dynamic
+ * initialization using a DSM.
+ */
+ inj_io_error_state->enabled = false;
+
+#ifdef USE_INJECTION_POINTS
+ InjectionPointAttach("AIO_PROCESS_COMPLETION_BEFORE_SHARED",
+ "test_aio",
+ "inj_io_short_read",
+ NULL,
+ 0);
+ InjectionPointLoad("AIO_PROCESS_COMPLETION_BEFORE_SHARED");
+#endif
+ }
+ else
+ {
+#ifdef USE_INJECTION_POINTS
+ InjectionPointLoad("AIO_PROCESS_COMPLETION_BEFORE_SHARED");
+ elog(LOG, "injection point loaded");
+#endif
+ }
+
+ LWLockRelease(AddinShmemInitLock);
+}
+
+void
+_PG_init(void)
+{
+ if (!process_shared_preload_libraries_in_progress)
+ return;
+
+ /* Shared memory initialization */
+ prev_shmem_request_hook = shmem_request_hook;
+ shmem_request_hook = test_aio_shmem_request;
+ prev_shmem_startup_hook = shmem_startup_hook;
+ shmem_startup_hook = test_aio_shmem_startup;
+}
+
+
+PG_FUNCTION_INFO_V1(errno_from_string);
+Datum
+errno_from_string(PG_FUNCTION_ARGS)
+{
+ const char *sym = text_to_cstring(PG_GETARG_TEXT_PP(0));
+
+ if (strcmp(sym, "EIO") == 0)
+ PG_RETURN_INT32(EIO);
+ else if (strcmp(sym, "EAGAIN") == 0)
+ PG_RETURN_INT32(EAGAIN);
+ else if (strcmp(sym, "EINTR") == 0)
+ PG_RETURN_INT32(EINTR);
+ else if (strcmp(sym, "ENOSPC") == 0)
+ PG_RETURN_INT32(ENOSPC);
+ else if (strcmp(sym, "EROFS") == 0)
+ PG_RETURN_INT32(EROFS);
+
+ ereport(ERROR,
+ errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg_internal("%s is not a supported errno value", sym));
+ PG_RETURN_INT32(0);
+}
+
+
+PG_FUNCTION_INFO_V1(grow_rel);
+Datum
+grow_rel(PG_FUNCTION_ARGS)
+{
+ Oid relid = PG_GETARG_OID(0);
+ uint32 nblocks = PG_GETARG_UINT32(1);
+ Relation rel;
+#define MAX_BUFFERS_TO_EXTEND_BY 64
+ Buffer victim_buffers[MAX_BUFFERS_TO_EXTEND_BY];
+
+ rel = relation_open(relid, AccessExclusiveLock);
+
+ while (nblocks > 0)
+ {
+ uint32 extend_by_pages;
+
+ extend_by_pages = Min(nblocks, MAX_BUFFERS_TO_EXTEND_BY);
+
+ ExtendBufferedRelBy(BMR_REL(rel),
+ MAIN_FORKNUM,
+ NULL,
+ 0,
+ extend_by_pages,
+ victim_buffers,
+ &extend_by_pages);
+
+ nblocks -= extend_by_pages;
+
+ for (uint32 i = 0; i < extend_by_pages; i++)
+ {
+ ReleaseBuffer(victim_buffers[i]);
+ }
+ }
+
+ relation_close(rel, NoLock);
+
+ PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(corrupt_rel_block);
+Datum
+corrupt_rel_block(PG_FUNCTION_ARGS)
+{
+ Oid relid = PG_GETARG_OID(0);
+ uint32 block = PG_GETARG_UINT32(1);
+ Relation rel;
+ Buffer buf;
+ Page page;
+ PageHeader ph;
+
+ rel = relation_open(relid, AccessExclusiveLock);
+
+ buf = ReadBuffer(rel, block);
+ page = BufferGetPage(buf);
+
+ LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+
+ MarkBufferDirty(buf);
+
+ PageInit(page, BufferGetPageSize(buf), 0);
+
+ ph = (PageHeader) page;
+ ph->pd_special = BLCKSZ + 1;
+
+ FlushOneBuffer(buf);
+
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+
+ ReleaseBuffer(buf);
+
+ EvictUnpinnedBuffer(buf);
+
+ relation_close(rel, NoLock);
+
+ PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(read_corrupt_rel_block);
+Datum
+read_corrupt_rel_block(PG_FUNCTION_ARGS)
+{
+ Oid relid = PG_GETARG_OID(0);
+ uint32 block = PG_GETARG_UINT32(1);
+ Relation rel;
+ Buffer buf;
+ BufferDesc *buf_hdr;
+ Page page;
+ PgAioHandle *ioh;
+ PgAioHandleRef ior;
+ SMgrRelation smgr;
+ uint32 buf_state;
+
+ rel = relation_open(relid, AccessExclusiveLock);
+
+ /* read buffer without erroring out */
+ buf = ReadBufferExtended(rel, MAIN_FORKNUM, block, RBM_ZERO_AND_LOCK, NULL);
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+
+ page = BufferGetBlock(buf);
+
+ ioh = pgaio_io_get(CurrentResourceOwner, NULL);
+ pgaio_io_get_ref(ioh, &ior);
+
+ buf_hdr = GetBufferDescriptor(buf - 1);
+ smgr = RelationGetSmgr(rel);
+
+ /* FIXME: even if just a test, we should verify nobody else uses this */
+ buf_state = LockBufHdr(buf_hdr);
+ buf_state &= ~(BM_VALID | BM_DIRTY);
+ UnlockBufHdr(buf_hdr, buf_state);
+
+ StartBufferIO(buf_hdr, true, false);
+
+ pgaio_io_set_io_data_32(ioh, (uint32 *) &buf, 1);
+ pgaio_io_add_shared_cb(ioh, ASC_SHARED_BUFFER_READ);
+
+ smgrstartreadv(ioh, smgr, MAIN_FORKNUM, block,
+ (void *) &page, 1);
+
+ ReleaseBuffer(buf);
+
+ pgaio_io_ref_wait(&ior);
+
+ relation_close(rel, NoLock);
+
+ PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(invalidate_rel_block);
+Datum
+invalidate_rel_block(PG_FUNCTION_ARGS)
+{
+ Oid relid = PG_GETARG_OID(0);
+ uint32 block = PG_GETARG_UINT32(1);
+ Relation rel;
+ PrefetchBufferResult pr;
+ Buffer buf;
+
+ rel = relation_open(relid, AccessExclusiveLock);
+
+ /* this is a gross hack, but there's no good API exposed */
+ pr = PrefetchBuffer(rel, MAIN_FORKNUM, block);
+ buf = pr.recent_buffer;
+ elog(LOG, "recent: %d", buf);
+ if (BufferIsValid(buf))
+ {
+ /* if the buffer contents aren't valid, this'll return false */
+ if (ReadRecentBuffer(rel->rd_locator, MAIN_FORKNUM, block, buf))
+ {
+ LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+ FlushOneBuffer(buf);
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+ ReleaseBuffer(buf);
+
+ if (!EvictUnpinnedBuffer(buf))
+ elog(ERROR, "couldn't unpin");
+ }
+ }
+
+ relation_close(rel, AccessExclusiveLock);
+
+ PG_RETURN_VOID();
+}
+
+#if 0
+PG_FUNCTION_INFO_V1(test_unsubmitted_vs_close);
+Datum
+test_unsubmitted_vs_close(PG_FUNCTION_ARGS)
+{
+ Oid relid = PG_GETARG_OID(0);
+ uint32 block = PG_GETARG_UINT32(1);
+ Relation rel;
+ Buffer buf;
+ Page page;
+ PageHeader ph;
+
+ rel = relation_open(relid, AccessExclusiveLock);
+
+ buf = ReadBufferExtended(rel, MAIN_FORKNUM, block, RBM_ZERO_AND_LOCK, NULL);
+
+ buf = ReadBuffer(rel, block);
+ page = BufferGetPage(buf);
+
+ EvictUnpinnedBuffer(buf);
+
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+
+
+ MarkBufferDirty(buf);
+ ph->pd_special = BLCKSZ + 1;
+
+ /* last_handle = pgaio_io_get(); */
+
+ PG_RETURN_VOID();
+}
+#endif
+
+PG_FUNCTION_INFO_V1(handle_get);
+Datum
+handle_get(PG_FUNCTION_ARGS)
+{
+ last_handle = pgaio_io_get(CurrentResourceOwner, NULL);
+
+ PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(handle_release_last);
+Datum
+handle_release_last(PG_FUNCTION_ARGS)
+{
+ if (!last_handle)
+ elog(ERROR, "no handle");
+
+ pgaio_io_release(last_handle);
+
+ PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(handle_get_and_error);
+Datum
+handle_get_and_error(PG_FUNCTION_ARGS)
+{
+ pgaio_io_get(CurrentResourceOwner, NULL);
+
+ elog(ERROR, "as you command");
+ PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(handle_get_twice);
+Datum
+handle_get_twice(PG_FUNCTION_ARGS)
+{
+ pgaio_io_get(CurrentResourceOwner, NULL);
+ pgaio_io_get(CurrentResourceOwner, NULL);
+
+ PG_RETURN_VOID();
+}
+
+
+PG_FUNCTION_INFO_V1(handle_get_release);
+Datum
+handle_get_release(PG_FUNCTION_ARGS)
+{
+ PgAioHandle *handle;
+
+ handle = pgaio_io_get(CurrentResourceOwner, NULL);
+ pgaio_io_release(handle);
+
+ PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(bb_get);
+Datum
+bb_get(PG_FUNCTION_ARGS)
+{
+ last_bb = pgaio_bounce_buffer_get();
+
+ PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(bb_release_last);
+Datum
+bb_release_last(PG_FUNCTION_ARGS)
+{
+ if (!last_bb)
+ elog(ERROR, "no bb");
+
+ pgaio_bounce_buffer_release(last_bb);
+
+ PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(bb_get_and_error);
+Datum
+bb_get_and_error(PG_FUNCTION_ARGS)
+{
+ pgaio_bounce_buffer_get();
+
+ elog(ERROR, "as you command");
+ PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(bb_get_twice);
+Datum
+bb_get_twice(PG_FUNCTION_ARGS)
+{
+ pgaio_bounce_buffer_get();
+ pgaio_bounce_buffer_get();
+
+ PG_RETURN_VOID();
+}
+
+
+PG_FUNCTION_INFO_V1(bb_get_release);
+Datum
+bb_get_release(PG_FUNCTION_ARGS)
+{
+ PgAioBounceBuffer *bb;
+
+ bb = pgaio_bounce_buffer_get();
+ pgaio_bounce_buffer_release(bb);
+
+ PG_RETURN_VOID();
+}
+
+#ifdef USE_INJECTION_POINTS
+extern PGDLLEXPORT void inj_io_short_read(const char *name, const void *private_data);
+
+void
+inj_io_short_read(const char *name, const void *private_data)
+{
+ PgAioHandle *ioh;
+
+ elog(LOG, "short read called: %d", inj_io_error_state->enabled);
+
+ if (inj_io_error_state->enabled)
+ {
+ ioh = pgaio_inj_io_get();
+
+ if (inj_io_error_state->result_set)
+ {
+ elog(LOG, "short read, changing result from %d to %d",
+ ioh->result, inj_io_error_state->result);
+
+ ioh->result = inj_io_error_state->result;
+ }
+ }
+}
+#endif
+
+PG_FUNCTION_INFO_V1(inj_io_short_read_attach);
+Datum
+inj_io_short_read_attach(PG_FUNCTION_ARGS)
+{
+#ifdef USE_INJECTION_POINTS
+ inj_io_error_state->enabled = true;
+ inj_io_error_state->result_set = !PG_ARGISNULL(0);
+ if (inj_io_error_state->result_set)
+ inj_io_error_state->result = PG_GETARG_INT32(0);
+#else
+ elog(ERROR, "injection points not supported");
+#endif
+
+ PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(inj_io_short_read_detach);
+Datum
+inj_io_short_read_detach(PG_FUNCTION_ARGS)
+{
+#ifdef USE_INJECTION_POINTS
+ inj_io_error_state->enabled = false;
+#else
+ elog(ERROR, "injection points not supported");
+#endif
+ PG_RETURN_VOID();
+}
diff --git a/src/test/modules/test_aio/test_aio.control b/src/test/modules/test_aio/test_aio.control
new file mode 100644
index 00000000000..cd91c3ed16b
--- /dev/null
+++ b/src/test/modules/test_aio/test_aio.control
@@ -0,0 +1,3 @@
+comment = 'Test code for AIO'
+default_version = '1.0'
+module_pathname = '$libdir/test_aio'
diff --git a/src/test/modules/test_aio/worker.conf b/src/test/modules/test_aio/worker.conf
new file mode 100644
index 00000000000..8104c201924
--- /dev/null
+++ b/src/test/modules/test_aio/worker.conf
@@ -0,0 +1,5 @@
+shared_preload_libraries=test_aio
+io_method = 'worker'
+log_min_messages = 'DEBUG3'
+log_statement=all
+restart_after_crash=false
--
2.45.2.746.g06e570c0df.dirty
v2-0001-Ensure-a-resowner-exists-for-all-paths-that-may-p.patchtext/x-diff; charset=us-asciiDownload
From 42af1a44eadbfc3ac4e65ab23d280d6933b23284 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 8 Oct 2024 14:34:38 -0400
Subject: [PATCH v2 01/20] Ensure a resowner exists for all paths that may
perform AIO
Reviewed-by: Noah Misch <noah@leadboat.com>
Discussion: https://postgr.es/m/1f6b50a7-38ef-4d87-8246-786d39f46ab9@iki.fi
---
src/backend/bootstrap/bootstrap.c | 7 +++++++
src/backend/replication/logical/logical.c | 6 ++++++
src/backend/utils/init/postinit.c | 6 +++++-
3 files changed, 18 insertions(+), 1 deletion(-)
diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index e0cb70ee9da..8ddcab0182a 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -361,8 +361,15 @@ BootstrapModeMain(int argc, char *argv[], bool check_only)
BaseInit();
bootstrap_signals();
+
+ /* need a resowner for IO during BootStrapXLOG() */
+ CreateAuxProcessResourceOwner();
+
BootStrapXLOG(bootstrap_data_checksum_version);
+ ReleaseAuxProcessResources(true);
+ CurrentResourceOwner = NULL;
+
/*
* To ensure that src/common/link-canary.c is linked into the backend, we
* must call it from somewhere. Here is as good as anywhere.
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 4dc14fdb495..76fce6749a9 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -386,6 +386,12 @@ CreateInitDecodingContext(const char *plugin,
slot->data.plugin = plugin_name;
SpinLockRelease(&slot->mutex);
+ if (CurrentResourceOwner == NULL)
+ {
+ Assert(am_walsender);
+ CurrentResourceOwner = AuxProcessResourceOwner;
+ }
+
if (XLogRecPtrIsInvalid(restart_lsn))
ReplicationSlotReserveWal();
else
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 01c4016ced6..8a09c939eff 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -755,8 +755,12 @@ InitPostgres(const char *in_dbname, Oid dboid,
* We don't yet have an aux-process resource owner, but StartupXLOG
* and ShutdownXLOG will need one. Hence, create said resource owner
* (and register a callback to clean it up after ShutdownXLOG runs).
+ *
+ * In bootstrap mode CreateAuxProcessResourceOwner() was already
+ * called in BootstrapModeMain().
*/
- CreateAuxProcessResourceOwner();
+ if (!bootstrap)
+ CreateAuxProcessResourceOwner();
StartupXLOG();
/* Release (and warn about) any buffer pins leaked in StartupXLOG */
--
2.45.2.746.g06e570c0df.dirty
v2-0002-Allow-lwlocks-to-be-unowned.patchtext/x-diff; charset=us-asciiDownload
From 5eff74f7f0bd0cf7102a04263a0dc9c0439123ed Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 5 Jan 2021 10:10:36 -0800
Subject: [PATCH v2 02/20] Allow lwlocks to be unowned
This is required for AIO so that the lock hold during a write can be released
in another backend. Which in turn is required to avoid the potential for
deadlocks.
---
src/include/storage/lwlock.h | 2 +
src/backend/storage/lmgr/lwlock.c | 110 ++++++++++++++++++++++--------
2 files changed, 82 insertions(+), 30 deletions(-)
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index d70e6d37e09..eabf813ce05 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -129,6 +129,8 @@ extern bool LWLockAcquireOrWait(LWLock *lock, LWLockMode mode);
extern void LWLockRelease(LWLock *lock);
extern void LWLockReleaseClearVar(LWLock *lock, pg_atomic_uint64 *valptr, uint64 val);
extern void LWLockReleaseAll(void);
+extern LWLockMode LWLockDisown(LWLock *l);
+extern void LWLockReleaseUnowned(LWLock *l, LWLockMode mode);
extern bool LWLockHeldByMe(LWLock *lock);
extern bool LWLockAnyHeldByMe(LWLock *lock, int nlocks, size_t stride);
extern bool LWLockHeldByMeInMode(LWLock *lock, LWLockMode mode);
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 9cf3e4f4f3a..bc459dc5d2b 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -1773,52 +1773,36 @@ LWLockUpdateVar(LWLock *lock, pg_atomic_uint64 *valptr, uint64 val)
}
}
-
-/*
- * LWLockRelease - release a previously acquired lock
- */
-void
-LWLockRelease(LWLock *lock)
+static void
+LWLockReleaseInternal(LWLock *lock, LWLockMode mode)
{
- LWLockMode mode;
uint32 oldstate;
bool check_waiters;
- int i;
-
- /*
- * Remove lock from list of locks held. Usually, but not always, it will
- * be the latest-acquired lock; so search array backwards.
- */
- for (i = num_held_lwlocks; --i >= 0;)
- if (lock == held_lwlocks[i].lock)
- break;
-
- if (i < 0)
- elog(ERROR, "lock %s is not held", T_NAME(lock));
-
- mode = held_lwlocks[i].mode;
-
- num_held_lwlocks--;
- for (; i < num_held_lwlocks; i++)
- held_lwlocks[i] = held_lwlocks[i + 1];
-
- PRINT_LWDEBUG("LWLockRelease", lock, mode);
/*
* Release my hold on lock, after that it can immediately be acquired by
* others, even if we still have to wakeup other waiters.
*/
if (mode == LW_EXCLUSIVE)
- oldstate = pg_atomic_sub_fetch_u32(&lock->state, LW_VAL_EXCLUSIVE);
+ oldstate = pg_atomic_fetch_sub_u32(&lock->state, LW_VAL_EXCLUSIVE);
else
- oldstate = pg_atomic_sub_fetch_u32(&lock->state, LW_VAL_SHARED);
+ oldstate = pg_atomic_fetch_sub_u32(&lock->state, LW_VAL_SHARED);
/* nobody else can have that kind of lock */
- Assert(!(oldstate & LW_VAL_EXCLUSIVE));
+ if (mode == LW_EXCLUSIVE)
+ Assert((oldstate & LW_LOCK_MASK) == LW_VAL_EXCLUSIVE);
+ else
+ Assert((oldstate & LW_LOCK_MASK) < LW_VAL_EXCLUSIVE &&
+ (oldstate & LW_LOCK_MASK) >= LW_VAL_SHARED);
if (TRACE_POSTGRESQL_LWLOCK_RELEASE_ENABLED())
TRACE_POSTGRESQL_LWLOCK_RELEASE(T_NAME(lock));
+ if (mode == LW_EXCLUSIVE)
+ oldstate -= LW_VAL_EXCLUSIVE;
+ else
+ oldstate -= LW_VAL_SHARED;
+
/*
* We're still waiting for backends to get scheduled, don't wake them up
* again.
@@ -1841,6 +1825,72 @@ LWLockRelease(LWLock *lock)
LWLockWakeup(lock);
}
+ TRACE_POSTGRESQL_LWLOCK_RELEASE(T_NAME(lock));
+}
+
+void
+LWLockReleaseUnowned(LWLock *lock, LWLockMode mode)
+{
+ LWLockReleaseInternal(lock, mode);
+}
+
+/*
+ * Stop treating lock as held by current backend.
+ *
+ * After calling this function it's the callers responsibility to ensure that
+ * the lock gets released, even in case of an error. This only is desirable if
+ * the lock is going to be released in a different process than the process
+ * that acquired it.
+ *
+ * Returns the mode in which the lock was held by the current backend.
+ *
+ * NB: This will leave lock->owner pointing to the current backend (if
+ * LOCK_DEBUG is set). We could add a separate flag indicating that, but it
+ * doesn't really seem worth it.
+ *
+ * NB: This does not call RESUME_INTERRUPTS(), but leaves that responsibility
+ * of the caller.
+ */
+LWLockMode
+LWLockDisown(LWLock *lock)
+{
+ LWLockMode mode;
+ int i;
+
+ /*
+ * Remove lock from list of locks held. Usually, but not always, it will
+ * be the latest-acquired lock; so search array backwards.
+ */
+ for (i = num_held_lwlocks; --i >= 0;)
+ if (lock == held_lwlocks[i].lock)
+ break;
+
+ if (i < 0)
+ elog(ERROR, "lock %s is not held", T_NAME(lock));
+
+ mode = held_lwlocks[i].mode;
+
+ num_held_lwlocks--;
+ for (; i < num_held_lwlocks; i++)
+ held_lwlocks[i] = held_lwlocks[i + 1];
+
+ return mode;
+}
+
+/*
+ * LWLockRelease - release a previously acquired lock
+ */
+void
+LWLockRelease(LWLock *lock)
+{
+ LWLockMode mode;
+
+ mode = LWLockDisown(lock);
+
+ PRINT_LWDEBUG("LWLockRelease", lock, mode);
+
+ LWLockReleaseInternal(lock, mode);
+
/*
* Now okay to allow cancel/die interrupts.
*/
--
2.45.2.746.g06e570c0df.dirty
v2-0003-aio-Basic-subsystem-initialization.patchtext/x-diff; charset=us-asciiDownload
From 93547a5a5b72fa0689b812ee6336b74c74eb95d7 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 10 Jun 2024 13:42:58 -0700
Subject: [PATCH v2 03/20] aio: Basic subsystem initialization
This is just separate to make it easier to review the tendrils into various
places.
---
src/include/storage/aio.h | 42 +++++++++++++++++++
src/include/storage/aio_init.h | 24 +++++++++++
src/backend/storage/aio/Makefile | 2 +
src/backend/storage/aio/aio.c | 33 +++++++++++++++
src/backend/storage/aio/aio_init.c | 41 ++++++++++++++++++
src/backend/storage/aio/meson.build | 2 +
src/backend/storage/ipc/ipci.c | 3 ++
src/backend/utils/init/postinit.c | 7 ++++
src/backend/utils/misc/guc_tables.c | 23 ++++++++++
src/backend/utils/misc/postgresql.conf.sample | 11 +++++
src/tools/pgindent/typedefs.list | 1 +
11 files changed, 189 insertions(+)
create mode 100644 src/include/storage/aio.h
create mode 100644 src/include/storage/aio_init.h
create mode 100644 src/backend/storage/aio/aio.c
create mode 100644 src/backend/storage/aio/aio_init.c
diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h
new file mode 100644
index 00000000000..0ee9d0043de
--- /dev/null
+++ b/src/include/storage/aio.h
@@ -0,0 +1,42 @@
+/*-------------------------------------------------------------------------
+ *
+ * aio.h
+ * Main AIO interface
+ *
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/aio.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef AIO_H
+#define AIO_H
+
+
+#include "utils/guc_tables.h"
+
+
+/* GUC related */
+extern void assign_io_method(int newval, void *extra);
+
+
+/* Enum for io_method GUC. */
+typedef enum IoMethod
+{
+ IOMETHOD_SYNC = 0,
+} IoMethod;
+
+
+/* We'll default to synchronous execution. */
+#define DEFAULT_IO_METHOD IOMETHOD_SYNC
+
+
+/* GUCs */
+extern const struct config_enum_entry io_method_options[];
+extern int io_method;
+extern int io_max_concurrency;
+
+
+#endif /* AIO_H */
diff --git a/src/include/storage/aio_init.h b/src/include/storage/aio_init.h
new file mode 100644
index 00000000000..1c1d62baa79
--- /dev/null
+++ b/src/include/storage/aio_init.h
@@ -0,0 +1,24 @@
+/*-------------------------------------------------------------------------
+ *
+ * aio_init.h
+ * AIO initialization - kept separate as initialization sites don't need to
+ * know about AIO itself and AIO users don't need to know about initialization.
+ *
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/aio_init.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef AIO_INIT_H
+#define AIO_INIT_H
+
+
+extern Size AioShmemSize(void);
+extern void AioShmemInit(void);
+
+extern void pgaio_init_backend(void);
+
+#endif /* AIO_INIT_H */
diff --git a/src/backend/storage/aio/Makefile b/src/backend/storage/aio/Makefile
index 2f29a9ec4d1..eaeaeeee8e3 100644
--- a/src/backend/storage/aio/Makefile
+++ b/src/backend/storage/aio/Makefile
@@ -9,6 +9,8 @@ top_builddir = ../../../..
include $(top_builddir)/src/Makefile.global
OBJS = \
+ aio.o \
+ aio_init.o \
read_stream.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/aio/aio.c b/src/backend/storage/aio/aio.c
new file mode 100644
index 00000000000..72110c0df3e
--- /dev/null
+++ b/src/backend/storage/aio/aio.c
@@ -0,0 +1,33 @@
+/*-------------------------------------------------------------------------
+ *
+ * aio.c
+ * AIO - Core Logic
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/storage/aio/aio.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "storage/aio.h"
+
+
+/* Options for io_method. */
+const struct config_enum_entry io_method_options[] = {
+ {"sync", IOMETHOD_SYNC, false},
+ {NULL, 0, false}
+};
+
+int io_method = DEFAULT_IO_METHOD;
+int io_max_concurrency = -1;
+
+
+void
+assign_io_method(int newval, void *extra)
+{
+}
diff --git a/src/backend/storage/aio/aio_init.c b/src/backend/storage/aio/aio_init.c
new file mode 100644
index 00000000000..84e0e37baae
--- /dev/null
+++ b/src/backend/storage/aio/aio_init.c
@@ -0,0 +1,41 @@
+/*-------------------------------------------------------------------------
+ *
+ * aio_init.c
+ * AIO - Subsystem Initialization
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/storage/aio/aio_init.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "storage/aio_init.h"
+
+
+Size
+AioShmemSize(void)
+{
+ Size sz = 0;
+
+ return sz;
+}
+
+void
+AioShmemInit(void)
+{
+}
+
+void
+pgaio_init_backend(void)
+{
+}
+
+void
+pgaio_postmaster_child_init_local(void)
+{
+}
diff --git a/src/backend/storage/aio/meson.build b/src/backend/storage/aio/meson.build
index 10e1aa3b20b..8d20759ebf8 100644
--- a/src/backend/storage/aio/meson.build
+++ b/src/backend/storage/aio/meson.build
@@ -1,5 +1,7 @@
# Copyright (c) 2024, PostgreSQL Global Development Group
backend_sources += files(
+ 'aio.c',
+ 'aio_init.c',
'read_stream.c',
)
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 7783ba854fc..c7703e5178e 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -37,6 +37,7 @@
#include "replication/slotsync.h"
#include "replication/walreceiver.h"
#include "replication/walsender.h"
+#include "storage/aio_init.h"
#include "storage/bufmgr.h"
#include "storage/dsm.h"
#include "storage/dsm_registry.h"
@@ -148,6 +149,7 @@ CalculateShmemSize(int *num_semaphores)
size = add_size(size, WaitEventCustomShmemSize());
size = add_size(size, InjectionPointShmemSize());
size = add_size(size, SlotSyncShmemSize());
+ size = add_size(size, AioShmemSize());
/* include additional requested shmem from preload libraries */
size = add_size(size, total_addin_request);
@@ -340,6 +342,7 @@ CreateOrAttachShmemStructs(void)
StatsShmemInit();
WaitEventCustomShmemInit();
InjectionPointShmemInit();
+ AioShmemInit();
}
/*
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 8a09c939eff..9d1025e815b 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -43,6 +43,7 @@
#include "replication/slot.h"
#include "replication/slotsync.h"
#include "replication/walsender.h"
+#include "storage/aio_init.h"
#include "storage/bufmgr.h"
#include "storage/fd.h"
#include "storage/ipc.h"
@@ -626,6 +627,12 @@ BaseInit(void)
*/
pgstat_initialize();
+ /*
+ * Initialize AIO before infrastructure that might need to actually
+ * execute AIO.
+ */
+ pgaio_init_backend();
+
/* Do local initialization of storage and buffer managers */
InitSync();
smgrinit();
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 8cf1afbad20..6d4056c68b9 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -71,6 +71,7 @@
#include "replication/slot.h"
#include "replication/slotsync.h"
#include "replication/syncrep.h"
+#include "storage/aio.h"
#include "storage/bufmgr.h"
#include "storage/bufpage.h"
#include "storage/large_object.h"
@@ -3219,6 +3220,18 @@ struct config_int ConfigureNamesInt[] =
NULL, NULL, NULL
},
+ {
+ {"io_max_concurrency",
+ PGC_POSTMASTER,
+ RESOURCES_ASYNCHRONOUS,
+ gettext_noop("Number of IOs that may be in flight in one backend."),
+ NULL,
+ },
+ &io_max_concurrency,
+ -1, -1, 1024,
+ NULL, NULL, NULL
+ },
+
{
{"backend_flush_after", PGC_USERSET, RESOURCES_ASYNCHRONOUS,
gettext_noop("Number of pages after which previously performed writes are flushed to disk."),
@@ -5226,6 +5239,16 @@ struct config_enum ConfigureNamesEnum[] =
NULL, NULL, NULL
},
+ {
+ {"io_method", PGC_POSTMASTER, RESOURCES_MEM,
+ gettext_noop("Selects the method of asynchronous I/O to use."),
+ NULL
+ },
+ &io_method,
+ DEFAULT_IO_METHOD, io_method_options,
+ NULL, assign_io_method, NULL
+ },
+
/* End-of-list marker */
{
{NULL, 0, 0, NULL, NULL}, NULL, 0, NULL, NULL, NULL, NULL
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index a2ac7575ca7..c4c60da9845 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -838,6 +838,17 @@
#include = '...' # include file
+#------------------------------------------------------------------------------
+# WIP AIO GUC docs
+#------------------------------------------------------------------------------
+
+#io_method = sync # (change requires restart)
+
+#io_max_concurrency = 32 # Max number of IOs that may be in
+ # flight at the same time in one backend
+ # (change requires restart)
+
+
#------------------------------------------------------------------------------
# CUSTOMIZED OPTIONS
#------------------------------------------------------------------------------
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index e1c4f913f84..2586d1cf53f 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1262,6 +1262,7 @@ IntoClause
InvalMessageArray
InvalidationInfo
InvalidationMsgsGroup
+IoMethod
IpcMemoryId
IpcMemoryKey
IpcMemoryState
--
2.45.2.746.g06e570c0df.dirty
v2-0004-aio-Core-AIO-implementation.patchtext/x-diff; charset=us-asciiDownload
From b64c247210c5a5067b5c76f6ab68c978606b0902 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 9 Dec 2024 14:14:13 -0500
Subject: [PATCH v2 04/20] aio: Core AIO implementation
At this point nothing can use AIO - this commit does not include any
implementation of aio subjects / callbacks. That will come in later commits.
Todo:
- lots of cleanup
---
src/include/storage/aio.h | 296 ++++++
src/include/storage/aio_internal.h | 244 +++++
src/include/storage/aio_ref.h | 24 +
src/include/utils/resowner.h | 5 +
src/backend/access/transam/xact.c | 9 +
src/backend/storage/aio/Makefile | 3 +
src/backend/storage/aio/aio.c | 906 ++++++++++++++++++
src/backend/storage/aio/aio_init.c | 186 +++-
src/backend/storage/aio/aio_io.c | 140 +++
src/backend/storage/aio/aio_subject.c | 231 +++++
src/backend/storage/aio/meson.build | 3 +
src/backend/storage/aio/method_sync.c | 45 +
.../utils/activity/wait_event_names.txt | 3 +
src/backend/utils/resowner/resowner.c | 30 +
src/tools/pgindent/typedefs.list | 18 +
15 files changed, 2139 insertions(+), 4 deletions(-)
create mode 100644 src/include/storage/aio_internal.h
create mode 100644 src/include/storage/aio_ref.h
create mode 100644 src/backend/storage/aio/aio_io.c
create mode 100644 src/backend/storage/aio/aio_subject.c
create mode 100644 src/backend/storage/aio/method_sync.c
diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h
index 0ee9d0043de..b386dabc921 100644
--- a/src/include/storage/aio.h
+++ b/src/include/storage/aio.h
@@ -15,9 +15,305 @@
#define AIO_H
+#include "storage/aio_ref.h"
+#include "storage/procnumber.h"
#include "utils/guc_tables.h"
+typedef struct PgAioHandle PgAioHandle;
+
+typedef enum PgAioOp
+{
+ /* intentionally the zero value, to help catch zeroed memory etc */
+ PGAIO_OP_INVALID = 0,
+
+ PGAIO_OP_READV,
+ PGAIO_OP_WRITEV,
+
+ /**
+ * In the near term we'll need at least:
+ * - fsync / fdatasync
+ * - flush_range
+ *
+ * Eventually we'll additionally want at least:
+ * - send
+ * - recv
+ * - accept
+ **/
+} PgAioOp;
+
+#define PGAIO_OP_COUNT (PGAIO_OP_WRITEV + 1)
+
+
+/*
+ * On what is IO being performed.
+ *
+ * PgAioSharedCallback specific behaviour should be implemented in
+ * aio_subject.c.
+ */
+typedef enum PgAioSubjectID
+{
+ /* intentionally the zero value, to help catch zeroed memory etc */
+ ASI_INVALID = 0,
+} PgAioSubjectID;
+
+#define ASI_COUNT (ASI_INVALID + 1)
+
+/*
+ * Flags for an IO that can be set with pgaio_io_set_flag().
+ */
+typedef enum PgAioHandleFlags
+{
+ /* hint that IO will be executed synchronously */
+ AHF_SYNCHRONOUS = 1 << 0,
+
+ /* the IO references backend local memory */
+ AHF_REFERENCES_LOCAL = 1 << 1,
+
+ /*
+ * IO is using buffered IO, used to control heuristic in some IO
+ * methods. Advantageous to set, if applicable, but not required for
+ * correctness.
+ */
+ AHF_BUFFERED = 1 << 2,
+} PgAioHandleFlags;
+
+
+/*
+ * IDs for callbacks that can be registered on an IO.
+ *
+ * Callbacks are identified by an ID rather than a function pointer. There are
+ * two main reasons:
+
+ * 1) Memory within PgAioHandle is precious, due to the number of PgAioHandle
+ * structs in pre-allocated shared memory.
+
+ * 2) Due to EXEC_BACKEND function pointers are not necessarily stable between
+ * different backends, therefore function pointers cannot directly be in
+ * shared memory.
+ *
+ * Without 2), we could fairly easily allow to add new callbacks, by filling a
+ * ID->pointer mapping table on demand. In the presence of 2 that's still
+ * doable, but harder, because every process has to re-register the pointers
+ * so that a local ID->"backend local pointer" mapping can be maintained.
+ */
+typedef enum PgAioHandleSharedCallbackID
+{
+ ASC_INVALID,
+} PgAioHandleSharedCallbackID;
+
+
+/*
+ * Data necessary for basic IO types (PgAioOp).
+ *
+ * NB: Note that the FDs in here may *not* be relied upon for re-issuing
+ * requests (e.g. for partial reads/writes) - the FD might be from another
+ * process, or closed since. That's not a problem for IOs waiting to be issued
+ * only because the queue is flushed when closing an FD.
+ */
+typedef union
+{
+ struct
+ {
+ int fd;
+ uint16 iov_length;
+ uint64 offset;
+ } read;
+
+ struct
+ {
+ int fd;
+ uint16 iov_length;
+ uint64 offset;
+ } write;
+} PgAioOpData;
+
+
+/* XXX: Perhaps it's worth moving this to a dedicated file? */
+#include "storage/block.h"
+#include "storage/relfilelocator.h"
+
+typedef union PgAioSubjectData
+{
+ /* just as an example placeholder for later */
+ struct
+ {
+ uint32 queue_id;
+ } wal;
+} PgAioSubjectData;
+
+
+typedef enum PgAioResultStatus
+{
+ ARS_UNKNOWN, /* not yet completed / uninitialized */
+ ARS_OK,
+ ARS_PARTIAL, /* did not fully succeed, but no error */
+ ARS_ERROR,
+} PgAioResultStatus;
+
+typedef struct PgAioResult
+{
+ /*
+ * This is of type PgAioHandleSharedCallbackID, but can't use a bitfield
+ * of an enum, because some compilers treat enums as signed.
+ */
+ uint32 id:8;
+
+ /* of type PgAioResultStatus, see above */
+ uint32 status:2;
+
+ /* meaning defined by callback->error */
+ uint32 error_data:22;
+
+ int32 result;
+} PgAioResult;
+
+/*
+ * Result of IO operation, visible only to the initiator of IO.
+ */
+typedef struct PgAioReturn
+{
+ PgAioResult result;
+ PgAioSubjectData subject_data;
+} PgAioReturn;
+
+
+typedef struct PgAioSubjectInfo
+{
+ void (*reopen) (PgAioHandle *ioh);
+
+#ifdef NOT_YET
+ char *(*describe_identity) (PgAioHandle *ioh);
+#endif
+
+ const char *name;
+} PgAioSubjectInfo;
+
+
+typedef PgAioResult (*PgAioHandleSharedCallbackComplete) (PgAioHandle *ioh, PgAioResult prior_result);
+typedef void (*PgAioHandleSharedCallbackPrepare) (PgAioHandle *ioh);
+typedef void (*PgAioHandleSharedCallbackError) (PgAioResult result, const PgAioSubjectData *subject_data, int elevel);
+
+typedef struct PgAioHandleSharedCallbacks
+{
+ PgAioHandleSharedCallbackPrepare prepare;
+ PgAioHandleSharedCallbackComplete complete;
+ PgAioHandleSharedCallbackError error;
+} PgAioHandleSharedCallbacks;
+
+
+
+/*
+ * How many callbacks can be registered for one IO handle. Currently we only
+ * need two, but it's not hard to imagine needing a few more.
+ */
+#define AIO_MAX_SHARED_CALLBACKS 4
+
+
+
+/* AIO API */
+
+
+/* --------------------------------------------------------------------------------
+ * IO Handles
+ * --------------------------------------------------------------------------------
+ */
+
+struct ResourceOwnerData;
+extern PgAioHandle *pgaio_io_get(struct ResourceOwnerData *resowner, PgAioReturn *ret);
+extern PgAioHandle *pgaio_io_get_nb(struct ResourceOwnerData *resowner, PgAioReturn *ret);
+
+extern void pgaio_io_release(PgAioHandle *ioh);
+extern void pgaio_io_release_resowner(dlist_node *ioh_node, bool on_error);
+
+extern void pgaio_io_get_ref(PgAioHandle *ioh, PgAioHandleRef *ior);
+
+extern void pgaio_io_set_subject(PgAioHandle *ioh, PgAioSubjectID subjid);
+extern void pgaio_io_set_flag(PgAioHandle *ioh, PgAioHandleFlags flag);
+
+extern void pgaio_io_add_shared_cb(PgAioHandle *ioh, PgAioHandleSharedCallbackID cbid);
+
+extern void pgaio_io_set_io_data_32(PgAioHandle *ioh, uint32 *data, uint8 len);
+extern void pgaio_io_set_io_data_64(PgAioHandle *ioh, uint64 *data, uint8 len);
+extern uint64 *pgaio_io_get_io_data(PgAioHandle *ioh, uint8 *len);
+
+extern void pgaio_io_prepare(PgAioHandle *ioh, PgAioOp op);
+
+extern int pgaio_io_get_id(PgAioHandle *ioh);
+struct iovec;
+extern int pgaio_io_get_iovec(PgAioHandle *ioh, struct iovec **iov);
+extern bool pgaio_io_has_subject(PgAioHandle *ioh);
+
+extern PgAioSubjectData *pgaio_io_get_subject_data(PgAioHandle *ioh);
+extern PgAioOpData *pgaio_io_get_op_data(PgAioHandle *ioh);
+extern ProcNumber pgaio_io_get_owner(PgAioHandle *ioh);
+
+
+
+/* --------------------------------------------------------------------------------
+ * IO References
+ * --------------------------------------------------------------------------------
+ */
+
+extern void pgaio_io_ref_clear(PgAioHandleRef *ior);
+extern bool pgaio_io_ref_valid(PgAioHandleRef *ior);
+extern int pgaio_io_ref_get_id(PgAioHandleRef *ior);
+
+
+extern void pgaio_io_ref_wait(PgAioHandleRef *ior);
+extern bool pgaio_io_ref_check_done(PgAioHandleRef *ior);
+
+
+
+/* --------------------------------------------------------------------------------
+ * IO Result
+ * --------------------------------------------------------------------------------
+ */
+
+extern void pgaio_result_log(PgAioResult result, const PgAioSubjectData *subject_data,
+ int elevel);
+
+
+
+/* --------------------------------------------------------------------------------
+ * Actions on multiple IOs.
+ * --------------------------------------------------------------------------------
+ */
+
+extern void pgaio_submit_staged(void);
+extern bool pgaio_have_staged(void);
+
+
+
+/* --------------------------------------------------------------------------------
+ * Low level IO preparation routines
+ *
+ * These will often be called by code lowest level of initiating an
+ * IO. E.g. bufmgr.c may initiate IO for a buffer, but pgaio_io_prep_readv()
+ * will be called from within fd.c.
+ *
+ * Implemented in aio_io.c
+ * --------------------------------------------------------------------------------
+ */
+
+extern void pgaio_io_prep_readv(PgAioHandle *ioh,
+ int fd, int iovcnt, uint64 offset);
+
+extern void pgaio_io_prep_writev(PgAioHandle *ioh,
+ int fd, int iovcnt, uint64 offset);
+
+
+
+/* --------------------------------------------------------------------------------
+ * Other
+ * --------------------------------------------------------------------------------
+ */
+
+extern void pgaio_closing_fd(int fd);
+extern void pgaio_at_xact_end(bool is_subxact, bool is_commit);
+extern void pgaio_at_error(void);
+
+
/* GUC related */
extern void assign_io_method(int newval, void *extra);
diff --git a/src/include/storage/aio_internal.h b/src/include/storage/aio_internal.h
new file mode 100644
index 00000000000..d600d45b4fd
--- /dev/null
+++ b/src/include/storage/aio_internal.h
@@ -0,0 +1,244 @@
+/*-------------------------------------------------------------------------
+ *
+ * aio_internal.h
+ * aio_internal
+ *
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/aio_internal.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef AIO_INTERNAL_H
+#define AIO_INTERNAL_H
+
+
+#include "lib/ilist.h"
+#include "port/pg_iovec.h"
+#include "storage/aio.h"
+#include "storage/condition_variable.h"
+
+
+#define PGAIO_VERBOSE
+
+
+/* AFIXME */
+#define PGAIO_SUBMIT_BATCH_SIZE 32
+
+
+
+typedef enum PgAioHandleState
+{
+ /* not in use */
+ AHS_IDLE = 0,
+
+ /* returned by pgaio_io_get() */
+ AHS_HANDED_OUT,
+
+ /* pgaio_io_start_*() has been called, but IO hasn't been submitted yet */
+ AHS_DEFINED,
+
+ /* subjects prepare() callback has been called */
+ AHS_PREPARED,
+
+ /* IO is being executed */
+ AHS_IN_FLIGHT,
+
+ /* IO finished, but result has not yet been processed */
+ AHS_REAPED,
+
+ /* IO completed, shared completion has been called */
+ AHS_COMPLETED_SHARED,
+
+ /* IO completed, local completion has been called */
+ AHS_COMPLETED_LOCAL,
+} PgAioHandleState;
+
+
+struct ResourceOwnerData;
+
+/* typedef is in public header */
+struct PgAioHandle
+{
+ /* all state updates should go through pgaio_io_update_state() */
+ PgAioHandleState state:8;
+
+ /* what are we operating on */
+ PgAioSubjectID subject:8;
+
+ /* which operation */
+ PgAioOp op:8;
+
+ /* bitfield of PgAioHandleFlags */
+ uint8 flags;
+
+ uint8 num_shared_callbacks;
+
+ /* using the proper type here would use more space */
+ uint8 shared_callbacks[AIO_MAX_SHARED_CALLBACKS];
+
+ uint8 iovec_data_len;
+
+ /* XXX: could be optimized out with some pointer math */
+ int32 owner_procno;
+
+ /* FIXME: remove in favor of distilled_result */
+ /* raw result of the IO operation */
+ int32 result;
+
+ /* index into PgAioCtl->iovecs */
+ uint32 iovec_off;
+
+ /**
+ * In which list the handle is registered, depends on the state:
+ * - IDLE, in per-backend list
+ * - HANDED_OUT - not in a list
+ * - DEFINED - in per-backend staged list
+ * - PREPARED - in per-backend staged list
+ * - IN_FLIGHT - in issuer's in_flight list
+ * - REAPED - in issuer's in_flight list
+ * - COMPLETED_SHARED - in issuer's in_flight list
+ * - COMPLETED_LOCAL - in issuer's in_flight list
+ *
+ * XXX: It probably make sense to optimize this out to save on per-io
+ * memory at the cost of per-backend memory.
+ **/
+ dlist_node node;
+
+ struct ResourceOwnerData *resowner;
+ dlist_node resowner_node;
+
+ /* incremented every time the IO handle is reused */
+ uint64 generation;
+
+ ConditionVariable cv;
+
+ /* result of shared callback, passed to issuer callback */
+ PgAioResult distilled_result;
+
+ PgAioReturn *report_return;
+
+ PgAioOpData op_data;
+
+ /*
+ * Data necessary for shared completions. Needs to be sufficient to allow
+ * another backend to retry an IO.
+ */
+ PgAioSubjectData scb_data;
+};
+
+
+typedef struct PgAioPerBackend
+{
+ /* index into PgAioCtl->io_handles */
+ uint32 io_handle_off;
+
+ /* IO Handles that currently are not used */
+ dclist_head idle_ios;
+
+ /*
+ * Only one IO may be returned by pgaio_io_get()/pgaio_io_get() without
+ * having been either defined (by actually associating it with IO) or by
+ * released (with pgaio_io_release()). This restriction is necessary to
+ * guarantee that we always can acquire an IO. ->handed_out_io is used to
+ * enforce that rule.
+ */
+ PgAioHandle *handed_out_io;
+
+ /*
+ * IOs that are defined, but not yet submitted.
+ */
+ uint16 num_staged_ios;
+ PgAioHandle *staged_ios[PGAIO_SUBMIT_BATCH_SIZE];
+
+ /*
+ * List of in-flight IOs. Also contains IOs that aren't strict speaking
+ * in-flight anymore, but have been waited-for and completed by another
+ * backend. Once this backend sees such an IO it'll be reclaimed.
+ *
+ * The list is ordered by submission time, with more recently submitted
+ * IOs being appended at the end.
+ */
+ dclist_head in_flight_ios;
+} PgAioPerBackend;
+
+
+typedef struct PgAioCtl
+{
+ int backend_state_count;
+ PgAioPerBackend *backend_state;
+
+ /*
+ * Array of iovec structs. Each iovec is owned by a specific backend. The
+ * allocation is in PgAioCtl to allow the maximum number of iovecs for
+ * individual IOs to be configurable with PGC_POSTMASTER GUC.
+ */
+ uint64 iovec_count;
+ struct iovec *iovecs;
+
+ /*
+ * For, e.g., an IO covering multiple buffers in shared / temp buffers, we
+ * need to get Buffer IDs during completion to be able to change the
+ * BufferDesc state accordingly. This space can be used to store e.g.
+ * Buffer IDs. Note that the actual iovec might be shorter than this,
+ * because we combine neighboring pages into one larger iovec entry.
+ */
+ uint64 *iovecs_data;
+
+ uint64 io_handle_count;
+ PgAioHandle *io_handles;
+} PgAioCtl;
+
+
+
+/*
+ * The set of callbacks that each IO method must implement.
+ */
+typedef struct IoMethodOps
+{
+ /* global initialization */
+ size_t (*shmem_size) (void);
+ void (*shmem_init) (bool first_time);
+
+ /* per-backend initialization */
+ void (*init_backend) (void);
+
+ /* handling of IOs */
+ bool (*needs_synchronous_execution) (PgAioHandle *ioh);
+ int (*submit) (uint16 num_staged_ios, PgAioHandle **staged_ios);
+
+ void (*wait_one) (PgAioHandle *ioh,
+ uint64 ref_generation);
+} IoMethodOps;
+
+
+extern bool pgaio_io_was_recycled(PgAioHandle *ioh, uint64 ref_generation, PgAioHandleState *state);
+
+extern void pgaio_io_prepare_subject(PgAioHandle *ioh);
+extern void pgaio_io_process_completion_subject(PgAioHandle *ioh);
+extern void pgaio_io_process_completion(PgAioHandle *ioh, int result);
+extern void pgaio_io_prepare_submit(PgAioHandle *ioh);
+
+extern bool pgaio_io_needs_synchronous_execution(PgAioHandle *ioh);
+extern void pgaio_io_perform_synchronously(PgAioHandle *ioh);
+
+extern bool pgaio_io_can_reopen(PgAioHandle *ioh);
+extern void pgaio_io_reopen(PgAioHandle *ioh);
+
+extern const char *pgaio_io_get_subject_name(PgAioHandle *ioh);
+extern const char *pgaio_io_get_op_name(PgAioHandle *ioh);
+extern const char *pgaio_io_get_state_name(PgAioHandle *ioh);
+
+
+/* Declarations for the tables of function pointers exposed by each IO method. */
+extern const IoMethodOps pgaio_sync_ops;
+
+extern const IoMethodOps *pgaio_impl;
+extern PgAioCtl *aio_ctl;
+extern PgAioPerBackend *my_aio;
+
+
+
+#endif /* AIO_INTERNAL_H */
diff --git a/src/include/storage/aio_ref.h b/src/include/storage/aio_ref.h
new file mode 100644
index 00000000000..ad7e9ad34f3
--- /dev/null
+++ b/src/include/storage/aio_ref.h
@@ -0,0 +1,24 @@
+/*-------------------------------------------------------------------------
+ *
+ * aio_ref.h Definition of PgAioHandleRef, which sometimes needs to be used in
+ * headers.
+ *
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/aio_ref.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef AIO_REF_H
+#define AIO_REF_H
+
+typedef struct PgAioHandleRef
+{
+ uint32 aio_index;
+ uint32 generation_upper;
+ uint32 generation_lower;
+} PgAioHandleRef;
+
+#endif /* AIO_REF_H */
diff --git a/src/include/utils/resowner.h b/src/include/utils/resowner.h
index 4e534bc3e70..2d55720a54c 100644
--- a/src/include/utils/resowner.h
+++ b/src/include/utils/resowner.h
@@ -164,4 +164,9 @@ struct LOCALLOCK;
extern void ResourceOwnerRememberLock(ResourceOwner owner, struct LOCALLOCK *locallock);
extern void ResourceOwnerForgetLock(ResourceOwner owner, struct LOCALLOCK *locallock);
+/* special support for AIO */
+struct dlist_node;
+extern void ResourceOwnerRememberAioHandle(ResourceOwner owner, struct dlist_node *ioh_node);
+extern void ResourceOwnerForgetAioHandle(ResourceOwner owner, struct dlist_node *ioh_node);
+
#endif /* RESOWNER_H */
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 3ebd7c40418..0356552c499 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -51,6 +51,7 @@
#include "replication/origin.h"
#include "replication/snapbuild.h"
#include "replication/syncrep.h"
+#include "storage/aio.h"
#include "storage/condition_variable.h"
#include "storage/fd.h"
#include "storage/lmgr.h"
@@ -2475,6 +2476,8 @@ CommitTransaction(void)
AtEOXact_LogicalRepWorkers(true);
pgstat_report_xact_timestamp(0);
+ pgaio_at_xact_end( /* is_subxact = */ false, /* is_commit = */ true);
+
ResourceOwnerDelete(TopTransactionResourceOwner);
s->curTransactionOwner = NULL;
CurTransactionResourceOwner = NULL;
@@ -2988,6 +2991,8 @@ AbortTransaction(void)
pgstat_report_xact_timestamp(0);
}
+ pgaio_at_xact_end( /* is_subxact = */ false, /* is_commit = */ false);
+
/*
* State remains TRANS_ABORT until CleanupTransaction().
*/
@@ -5185,6 +5190,8 @@ CommitSubTransaction(void)
AtEOSubXact_PgStat(true, s->nestingLevel);
AtSubCommit_Snapshot(s->nestingLevel);
+ pgaio_at_xact_end( /* is_subxact = */ true, /* is_commit = */ true);
+
/*
* We need to restore the upper transaction's read-only state, in case the
* upper is read-write while the child is read-only; GUC will incorrectly
@@ -5351,6 +5358,8 @@ AbortSubTransaction(void)
AtSubAbort_Snapshot(s->nestingLevel);
}
+ pgaio_at_xact_end( /* is_subxact = */ true, /* is_commit = */ false);
+
/*
* Restore the upper transaction's read-only state, too. This should be
* redundant with GUC's cleanup but we may as well do it for consistency
diff --git a/src/backend/storage/aio/Makefile b/src/backend/storage/aio/Makefile
index eaeaeeee8e3..b253278f3c1 100644
--- a/src/backend/storage/aio/Makefile
+++ b/src/backend/storage/aio/Makefile
@@ -11,6 +11,9 @@ include $(top_builddir)/src/Makefile.global
OBJS = \
aio.o \
aio_init.o \
+ aio_io.o \
+ aio_subject.o \
+ method_sync.o \
read_stream.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/aio/aio.c b/src/backend/storage/aio/aio.c
index 72110c0df3e..3e2ff9718ca 100644
--- a/src/backend/storage/aio/aio.c
+++ b/src/backend/storage/aio/aio.c
@@ -3,6 +3,28 @@
* aio.c
* AIO - Core Logic
*
+ * For documentation about how AIO works on a higher level, including a
+ * schematic example, see README.md.
+ *
+ *
+ * AIO is a complicated subsystem. To keep things navigable it is split across
+ * a number of files:
+ *
+ * - aio.c - core AIO state handling
+ *
+ * - aio_init.c - initialization
+ *
+ * - aio_io.c - dealing with actual IO, including executing IOs synchronously
+ *
+ * - aio_subject.c - functionality related to executing IO for different
+ * subjects
+ *
+ * - method_*.c - different ways of executing AIO
+ *
+ * - read_stream.c - helper for accessing buffered relation data with
+ * look-ahead
+ *
+ *
* Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
* Portions Copyright (c) 1994, Regents of the University of California
*
@@ -14,7 +36,22 @@
#include "postgres.h"
+#include "miscadmin.h"
+#include "port/atomics.h"
#include "storage/aio.h"
+#include "storage/aio_internal.h"
+#include "storage/bufmgr.h"
+#include "utils/resowner.h"
+#include "utils/wait_event_types.h"
+
+
+
+static inline void pgaio_io_update_state(PgAioHandle *ioh, PgAioHandleState new_state);
+static void pgaio_io_reclaim(PgAioHandle *ioh);
+static void pgaio_io_resowner_register(PgAioHandle *ioh);
+static void pgaio_io_wait_for_free(void);
+static PgAioHandle *pgaio_io_from_ref(PgAioHandleRef *ior, uint64 *ref_generation);
+
/* Options for io_method. */
@@ -27,7 +64,876 @@ int io_method = DEFAULT_IO_METHOD;
int io_max_concurrency = -1;
+/* global control for AIO */
+PgAioCtl *aio_ctl;
+
+/* current backend's per-backend state */
+PgAioPerBackend *my_aio;
+
+
+static const IoMethodOps *pgaio_ops_table[] = {
+ [IOMETHOD_SYNC] = &pgaio_sync_ops,
+};
+
+
+const IoMethodOps *pgaio_impl;
+
+
+
+/* --------------------------------------------------------------------------------
+ * "Core" IO Api
+ * --------------------------------------------------------------------------------
+ */
+
+/*
+ * Acquire an AioHandle, waiting for IO completion if necessary.
+ *
+ * Each backend can only have one AIO handle that that has been "handed out"
+ * to code, but not yet submitted or released. This restriction is necessary
+ * to ensure that it is possible for code to wait for an unused handle by
+ * waiting for in-flight IO to complete. There is a limited number of handles
+ * in each backend, if multiple handles could be handed out without being
+ * submitted, waiting for all in-flight IO to complete would not guarantee
+ * that handles free up.
+ *
+ * It is cheap to acquire an IO handle, unless all handles are in use. In that
+ * case this function waits for the oldest IO to complete. In case that is not
+ * desirable, see pgaio_io_get_nb().
+ *
+ * If a handle was acquired but then does not turn out to be needed,
+ * e.g. because pgaio_io_get() is called before starting an IO in a critical
+ * section, the handle needs to be be released with pgaio_io_release().
+ *
+ *
+ * To react to the completion of the IO as soon as it is know to have
+ * completed, callbacks can be registered with pgaio_io_add_shared_cb().
+ *
+ * To actually execute IO using the returned handle, the pgaio_io_prep_*()
+ * family of functions is used. In many cases the pgaio_io_prep_*() call will
+ * not be done directly by code that acquired the handle, but by lower level
+ * code that gets passed the handle. E.g. if code in bufmgr.c wants to perform
+ * AIO, it typically will pass the handle to smgr., which will pass it on to
+ * md.c, on to fd.c, which then finally calls pgaio_io_prep_*(). This
+ * forwarding allows the various layers to react to the IO's completion by
+ * registering callbacks. These callbacks in turn can translate a lower
+ * layer's result into a result understandable by a higher layer.
+ *
+ * Once pgaio_io_prep_*() is called, the IO may be in the process of being
+ * executed and might even complete before the functions return. That is,
+ * however, not guaranteed, to allow IO submission to be batched. To guarantee
+ * IO submission pgaio_submit_staged() needs to be called.
+ *
+ * After pgaio_io_prep_*() the AioHandle is "consumed" and may not be
+ * referenced by the IO issuing code. To e.g. wait for IO, references to the
+ * IO can be established with pgaio_io_get_ref() *before* pgaio_io_prep_*() is
+ * called. pgaio_io_ref_wait() can be used to wait for the IO to complete.
+ *
+ *
+ * To know if the IO [partially] succeeded or failed, a PgAioReturn * can be
+ * passed to pgaio_io_get(). Once the issuing backend has called
+ * pgaio_io_ref_wait(), the PgAioReturn contains information about whether the
+ * operation succeeded and details about the first failure, if any. The error
+ * can be raised / logged with pgaio_result_log().
+ *
+ * The lifetime of the memory pointed to be *ret needs to be at least as long
+ * as the passed in resowner. If the resowner releases resources before the IO
+ * completes, the reference to *ret will be cleared.
+ */
+PgAioHandle *
+pgaio_io_get(struct ResourceOwnerData *resowner, PgAioReturn *ret)
+{
+ PgAioHandle *h;
+
+ while (true)
+ {
+ h = pgaio_io_get_nb(resowner, ret);
+
+ if (h != NULL)
+ return h;
+
+ /*
+ * Evidently all handles by this backend are in use. Just wait for
+ * some to complete.
+ */
+ pgaio_io_wait_for_free();
+ }
+}
+
+/*
+ * Acquire an AioHandle, returning NULL if no handles are free.
+ *
+ * See pgaio_io_get(). The only difference is that this function will return
+ * NULL if there are no idle handles, instead of blocking.
+ */
+PgAioHandle *
+pgaio_io_get_nb(struct ResourceOwnerData *resowner, PgAioReturn *ret)
+{
+ if (my_aio->num_staged_ios >= PGAIO_SUBMIT_BATCH_SIZE)
+ {
+ Assert(my_aio->num_staged_ios == PGAIO_SUBMIT_BATCH_SIZE);
+ pgaio_submit_staged();
+ }
+
+ if (my_aio->handed_out_io)
+ {
+ ereport(ERROR,
+ errmsg("API violation: Only one IO can be handed out"));
+ }
+
+ if (!dclist_is_empty(&my_aio->idle_ios))
+ {
+ dlist_node *ion = dclist_pop_head_node(&my_aio->idle_ios);
+ PgAioHandle *ioh = dclist_container(PgAioHandle, node, ion);
+
+ Assert(ioh->state == AHS_IDLE);
+ Assert(ioh->owner_procno == MyProcNumber);
+
+ pgaio_io_update_state(ioh, AHS_HANDED_OUT);
+ my_aio->handed_out_io = ioh;
+
+ if (resowner)
+ pgaio_io_resowner_register(ioh);
+
+ if (ret)
+ {
+ ioh->report_return = ret;
+ ret->result.status = ARS_UNKNOWN;
+ }
+
+ return ioh;
+ }
+
+ return NULL;
+}
+
+/*
+ * Release IO handle that turned out to not be required.
+ *
+ * See pgaio_io_get() for more details.
+ */
+void
+pgaio_io_release(PgAioHandle *ioh)
+{
+ if (ioh == my_aio->handed_out_io)
+ {
+ Assert(ioh->state == AHS_HANDED_OUT);
+ Assert(ioh->resowner);
+
+ my_aio->handed_out_io = NULL;
+ pgaio_io_reclaim(ioh);
+ }
+ else
+ {
+ elog(ERROR, "release in unexpected state");
+ }
+}
+
+/*
+ * Release IO handle during resource owner cleanup.
+ */
+void
+pgaio_io_release_resowner(dlist_node *ioh_node, bool on_error)
+{
+ PgAioHandle *ioh = dlist_container(PgAioHandle, resowner_node, ioh_node);
+
+ Assert(ioh->resowner);
+
+ ResourceOwnerForgetAioHandle(ioh->resowner, &ioh->resowner_node);
+ ioh->resowner = NULL;
+
+ switch (ioh->state)
+ {
+ case AHS_IDLE:
+ elog(ERROR, "unexpected");
+ break;
+ case AHS_HANDED_OUT:
+ Assert(ioh == my_aio->handed_out_io || my_aio->handed_out_io == NULL);
+
+ if (ioh == my_aio->handed_out_io)
+ {
+ my_aio->handed_out_io = NULL;
+ if (!on_error)
+ elog(WARNING, "leaked AIO handle");
+ }
+
+ pgaio_io_reclaim(ioh);
+ break;
+ case AHS_DEFINED:
+ case AHS_PREPARED:
+ /* XXX: Should we warn about this when is_commit? */
+ pgaio_submit_staged();
+ break;
+ case AHS_IN_FLIGHT:
+ case AHS_REAPED:
+ case AHS_COMPLETED_SHARED:
+ /* this is expected to happen */
+ break;
+ case AHS_COMPLETED_LOCAL:
+ /* XXX: unclear if this ought to be possible? */
+ pgaio_io_reclaim(ioh);
+ break;
+ }
+
+ /*
+ * Need to unregister the reporting of the IO's result, the memory it's
+ * referencing likely has gone away.
+ */
+ if (ioh->report_return)
+ ioh->report_return = NULL;
+}
+
+int
+pgaio_io_get_iovec(PgAioHandle *ioh, struct iovec **iov)
+{
+ Assert(ioh->state == AHS_HANDED_OUT);
+
+ *iov = &aio_ctl->iovecs[ioh->iovec_off];
+
+ /* AFIXME: Needs to be the value at startup time */
+ return io_combine_limit;
+}
+
+PgAioSubjectData *
+pgaio_io_get_subject_data(PgAioHandle *ioh)
+{
+ return &ioh->scb_data;
+}
+
+PgAioOpData *
+pgaio_io_get_op_data(PgAioHandle *ioh)
+{
+ return &ioh->op_data;
+}
+
+ProcNumber
+pgaio_io_get_owner(PgAioHandle *ioh)
+{
+ return ioh->owner_procno;
+}
+
+bool
+pgaio_io_has_subject(PgAioHandle *ioh)
+{
+ return ioh->subject != ASI_INVALID;
+}
+
+void
+pgaio_io_set_flag(PgAioHandle *ioh, PgAioHandleFlags flag)
+{
+ Assert(ioh->state == AHS_HANDED_OUT);
+
+ ioh->flags |= flag;
+}
+
+void
+pgaio_io_set_io_data_32(PgAioHandle *ioh, uint32 *data, uint8 len)
+{
+ Assert(ioh->state == AHS_HANDED_OUT);
+
+ for (int i = 0; i < len; i++)
+ aio_ctl->iovecs_data[ioh->iovec_off + i] = data[i];
+ ioh->iovec_data_len = len;
+}
+
+uint64 *
+pgaio_io_get_io_data(PgAioHandle *ioh, uint8 *len)
+{
+ Assert(ioh->iovec_data_len > 0);
+
+ *len = ioh->iovec_data_len;
+
+ return &aio_ctl->iovecs_data[ioh->iovec_off];
+}
+
+void
+pgaio_io_set_subject(PgAioHandle *ioh, PgAioSubjectID subjid)
+{
+ Assert(ioh->state == AHS_HANDED_OUT);
+
+ ioh->subject = subjid;
+
+ elog(DEBUG3, "io:%d, op %s, subject %s, set subject",
+ pgaio_io_get_id(ioh),
+ pgaio_io_get_op_name(ioh),
+ pgaio_io_get_subject_name(ioh));
+}
+
+void
+pgaio_io_get_ref(PgAioHandle *ioh, PgAioHandleRef *ior)
+{
+ Assert(ioh->state == AHS_HANDED_OUT ||
+ ioh->state == AHS_DEFINED ||
+ ioh->state == AHS_PREPARED);
+ Assert(ioh->generation != 0);
+
+ ior->aio_index = ioh - aio_ctl->io_handles;
+ ior->generation_upper = (uint32) (ioh->generation >> 32);
+ ior->generation_lower = (uint32) ioh->generation;
+}
+
+void
+pgaio_io_ref_clear(PgAioHandleRef *ior)
+{
+ ior->aio_index = PG_UINT32_MAX;
+}
+
+bool
+pgaio_io_ref_valid(PgAioHandleRef *ior)
+{
+ return ior->aio_index != PG_UINT32_MAX;
+}
+
+int
+pgaio_io_ref_get_id(PgAioHandleRef *ior)
+{
+ Assert(pgaio_io_ref_valid(ior));
+ return ior->aio_index;
+}
+
+bool
+pgaio_io_was_recycled(PgAioHandle *ioh, uint64 ref_generation, PgAioHandleState *state)
+{
+ *state = ioh->state;
+ pg_read_barrier();
+
+ return ioh->generation != ref_generation;
+}
+
+void
+pgaio_io_ref_wait(PgAioHandleRef *ior)
+{
+ uint64 ref_generation;
+ PgAioHandleState state;
+ bool am_owner;
+ PgAioHandle *ioh;
+
+ ioh = pgaio_io_from_ref(ior, &ref_generation);
+
+ am_owner = ioh->owner_procno == MyProcNumber;
+
+ if (pgaio_io_was_recycled(ioh, ref_generation, &state))
+ return;
+
+ if (am_owner)
+ {
+ if (state == AHS_DEFINED || state == AHS_PREPARED)
+ {
+ /* XXX: Arguably this should be prevented by callers? */
+ pgaio_submit_staged();
+ }
+ else if (state != AHS_IN_FLIGHT
+ && state != AHS_REAPED
+ && state != AHS_COMPLETED_SHARED
+ && state != AHS_COMPLETED_LOCAL)
+ {
+ elog(PANIC, "waiting for own IO in wrong state: %d",
+ state);
+ }
+
+ /*
+ * Somebody else completed the IO, need to execute issuer callback, so
+ * reclaim eagerly.
+ */
+ if (state == AHS_COMPLETED_LOCAL)
+ {
+ pgaio_io_reclaim(ioh);
+
+ return;
+ }
+ }
+
+ while (true)
+ {
+ if (pgaio_io_was_recycled(ioh, ref_generation, &state))
+ return;
+
+ switch (state)
+ {
+ case AHS_IDLE:
+ case AHS_HANDED_OUT:
+ elog(ERROR, "IO in wrong state: %d", state);
+ break;
+
+ case AHS_IN_FLIGHT:
+ /*
+ * If we need to wait via the IO method, do so now. Don't
+ * check via the IO method if the issuing backend is executing
+ * the IO synchronously.
+ */
+ if (pgaio_impl->wait_one && !(ioh->flags & AHF_SYNCHRONOUS))
+ {
+ pgaio_impl->wait_one(ioh, ref_generation);
+ continue;
+ }
+ /* fallthrough */
+
+ /* waiting for owner to submit */
+ case AHS_PREPARED:
+ case AHS_DEFINED:
+ /* waiting for reaper to complete */
+ /* fallthrough */
+ case AHS_REAPED:
+ /* shouldn't be able to hit this otherwise */
+ Assert(IsUnderPostmaster);
+ /* ensure we're going to get woken up */
+ ConditionVariablePrepareToSleep(&ioh->cv);
+
+ while (!pgaio_io_was_recycled(ioh, ref_generation, &state))
+ {
+ if (state != AHS_REAPED && state != AHS_DEFINED &&
+ state != AHS_IN_FLIGHT)
+ break;
+ ConditionVariableSleep(&ioh->cv, WAIT_EVENT_AIO_COMPLETION);
+ }
+
+ ConditionVariableCancelSleep();
+ break;
+
+ case AHS_COMPLETED_SHARED:
+ /* see above */
+ if (am_owner)
+ pgaio_io_reclaim(ioh);
+ return;
+ case AHS_COMPLETED_LOCAL:
+ return;
+ }
+ }
+}
+
+/*
+ * Check if the the referenced IO completed, without blocking.
+ */
+bool
+pgaio_io_ref_check_done(PgAioHandleRef *ior)
+{
+ uint64 ref_generation;
+ PgAioHandleState state;
+ bool am_owner;
+ PgAioHandle *ioh;
+
+ ioh = pgaio_io_from_ref(ior, &ref_generation);
+
+ if (pgaio_io_was_recycled(ioh, ref_generation, &state))
+ return true;
+
+
+ if (state == AHS_IDLE)
+ return true;
+
+ am_owner = ioh->owner_procno == MyProcNumber;
+
+ if (state == AHS_COMPLETED_SHARED || state == AHS_COMPLETED_LOCAL)
+ {
+ if (am_owner)
+ pgaio_io_reclaim(ioh);
+ return true;
+ }
+
+ return false;
+}
+
+int
+pgaio_io_get_id(PgAioHandle *ioh)
+{
+ Assert(ioh >= aio_ctl->io_handles &&
+ ioh <= (aio_ctl->io_handles + aio_ctl->io_handle_count));
+ return ioh - aio_ctl->io_handles;
+}
+
+const char *
+pgaio_io_get_state_name(PgAioHandle *ioh)
+{
+ switch (ioh->state)
+ {
+ case AHS_IDLE:
+ return "idle";
+ case AHS_HANDED_OUT:
+ return "handed_out";
+ case AHS_DEFINED:
+ return "DEFINED";
+ case AHS_PREPARED:
+ return "PREPARED";
+ case AHS_IN_FLIGHT:
+ return "IN_FLIGHT";
+ case AHS_REAPED:
+ return "REAPED";
+ case AHS_COMPLETED_SHARED:
+ return "COMPLETED_SHARED";
+ case AHS_COMPLETED_LOCAL:
+ return "COMPLETED_LOCAL";
+ }
+ pg_unreachable();
+}
+
+/*
+ * Internal, should only be called from pgaio_io_prep_*().
+ */
+void
+pgaio_io_prepare(PgAioHandle *ioh, PgAioOp op)
+{
+ bool needs_synchronous;
+
+ Assert(ioh->state == AHS_HANDED_OUT);
+ Assert(pgaio_io_has_subject(ioh));
+
+ ioh->op = op;
+ ioh->result = 0;
+
+ pgaio_io_update_state(ioh, AHS_DEFINED);
+
+ /* allow a new IO to be staged */
+ my_aio->handed_out_io = NULL;
+
+ pgaio_io_prepare_subject(ioh);
+
+ pgaio_io_update_state(ioh, AHS_PREPARED);
+
+ needs_synchronous = pgaio_io_needs_synchronous_execution(ioh);
+
+ elog(DEBUG3, "io:%d: prepared %s, executed synchronously: %d",
+ pgaio_io_get_id(ioh), pgaio_io_get_op_name(ioh),
+ needs_synchronous);
+
+ if (!needs_synchronous)
+ {
+ my_aio->staged_ios[my_aio->num_staged_ios++] = ioh;
+ Assert(my_aio->num_staged_ios <= PGAIO_SUBMIT_BATCH_SIZE);
+ }
+ else
+ {
+ pgaio_io_prepare_submit(ioh);
+ pgaio_io_perform_synchronously(ioh);
+ }
+}
+
+/*
+ * Handle IO getting completed by a method.
+ */
+void
+pgaio_io_process_completion(PgAioHandle *ioh, int result)
+{
+ Assert(ioh->state == AHS_IN_FLIGHT);
+
+ ioh->result = result;
+
+ pgaio_io_update_state(ioh, AHS_REAPED);
+
+ pgaio_io_process_completion_subject(ioh);
+
+ pgaio_io_update_state(ioh, AHS_COMPLETED_SHARED);
+
+ /* condition variable broadcast ensures state is visible before wakeup */
+ ConditionVariableBroadcast(&ioh->cv);
+
+ if (ioh->owner_procno == MyProcNumber)
+ pgaio_io_reclaim(ioh);
+}
+
+bool
+pgaio_io_needs_synchronous_execution(PgAioHandle *ioh)
+{
+ if (ioh->flags & AHF_SYNCHRONOUS)
+ {
+ /* XXX: should we also check if there are other IOs staged? */
+ return true;
+ }
+
+ if (pgaio_impl->needs_synchronous_execution)
+ return pgaio_impl->needs_synchronous_execution(ioh);
+ return false;
+}
+
+/*
+ * Handle IO being processed by IO method.
+ */
+void
+pgaio_io_prepare_submit(PgAioHandle *ioh)
+{
+ pgaio_io_update_state(ioh, AHS_IN_FLIGHT);
+
+ dclist_push_tail(&my_aio->in_flight_ios, &ioh->node);
+}
+
+static inline void
+pgaio_io_update_state(PgAioHandle *ioh, PgAioHandleState new_state)
+{
+ /*
+ * Ensure the changes signified by the new state are visible before the
+ * new state becomes visible.
+ */
+ pg_write_barrier();
+
+ ioh->state = new_state;
+}
+
+static PgAioHandle *
+pgaio_io_from_ref(PgAioHandleRef *ior, uint64 *ref_generation)
+{
+ PgAioHandle *ioh;
+
+ Assert(ior->aio_index < aio_ctl->io_handle_count);
+
+ ioh = &aio_ctl->io_handles[ior->aio_index];
+
+ *ref_generation = ((uint64) ior->generation_upper) << 32 |
+ ior->generation_lower;
+
+ Assert(*ref_generation != 0);
+
+ return ioh;
+}
+
+static void
+pgaio_io_resowner_register(PgAioHandle *ioh)
+{
+ Assert(!ioh->resowner);
+ Assert(CurrentResourceOwner);
+
+ ResourceOwnerRememberAioHandle(CurrentResourceOwner, &ioh->resowner_node);
+ ioh->resowner = CurrentResourceOwner;
+}
+
+static void
+pgaio_io_reclaim(PgAioHandle *ioh)
+{
+ /* This is only ok if it's our IO */
+ Assert(ioh->owner_procno == MyProcNumber);
+
+ ereport(DEBUG3,
+ errmsg("reclaiming io:%d, state: %s, op %s, subject %s, result: %d, distilled_result: AFIXME, report to: %p",
+ pgaio_io_get_id(ioh),
+ pgaio_io_get_state_name(ioh),
+ pgaio_io_get_op_name(ioh),
+ pgaio_io_get_subject_name(ioh),
+ ioh->result,
+ ioh->report_return
+ ),
+ errhidestmt(true), errhidecontext(true));
+
+ /* if the IO has been defined, we might need to do more work */
+ if (ioh->state != AHS_HANDED_OUT)
+ {
+ dclist_delete_from(&my_aio->in_flight_ios, &ioh->node);
+
+ if (ioh->report_return)
+ {
+ ioh->report_return->result = ioh->distilled_result;
+ ioh->report_return->subject_data = ioh->scb_data;
+ }
+ }
+
+ if (ioh->resowner)
+ {
+ ResourceOwnerForgetAioHandle(ioh->resowner, &ioh->resowner_node);
+ ioh->resowner = NULL;
+ }
+
+ Assert(!ioh->resowner);
+
+ ioh->num_shared_callbacks = 0;
+ ioh->iovec_data_len = 0;
+ ioh->report_return = NULL;
+ ioh->flags = 0;
+
+ /* XXX: the barrier is probably superfluous */
+ pg_write_barrier();
+ ioh->generation++;
+
+ pgaio_io_update_state(ioh, AHS_IDLE);
+
+ /*
+ * We push the IO to the head of the idle IO list, that seems more cache
+ * efficient in cases where only a few IOs are used.
+ */
+ dclist_push_head(&my_aio->idle_ios, &ioh->node);
+}
+
+static void
+pgaio_io_wait_for_free(void)
+{
+ int reclaimed = 0;
+
+ elog(DEBUG2,
+ "waiting for self: %d pending",
+ my_aio->num_staged_ios);
+
+ /*
+ * First check if any of our IOs actually have completed - when using
+ * worker, that'll often be the case. We could do so as part of the loop
+ * below, but that'd potentially lead us to wait for some IO submitted
+ * before.
+ */
+ for (int i = 0; i < io_max_concurrency; i++)
+ {
+ PgAioHandle *ioh = &aio_ctl->io_handles[my_aio->io_handle_off + i];
+
+ if (ioh->state == AHS_COMPLETED_SHARED)
+ {
+ pgaio_io_reclaim(ioh);
+ reclaimed++;
+ }
+ }
+
+ if (reclaimed > 0)
+ return;
+
+ /*
+ * If we have any unsubmitted IOs, submit them now. We'll start waiting in
+ * a second, so it's better they're in flight. This also addresses the
+ * edge-case that all IOs are unsubmitted.
+ */
+ if (my_aio->num_staged_ios > 0)
+ {
+ elog(DEBUG2, "submitting while acquiring free io");
+ pgaio_submit_staged();
+ }
+
+ /*
+ * It's possible that we recognized there were free IOs while submitting.
+ */
+ if (dclist_count(&my_aio->in_flight_ios) == 0)
+ {
+ elog(ERROR, "no free IOs despite no in-flight IOs");
+ }
+
+ /*
+ * Wait for the oldest in-flight IO to complete.
+ *
+ * XXX: Reusing the general IO wait is suboptimal, we don't need to wait
+ * for that specific IO to complete, we just need *any* IO to complete.
+ */
+ {
+ PgAioHandle *ioh = dclist_head_element(PgAioHandle, node, &my_aio->in_flight_ios);
+
+ switch (ioh->state)
+ {
+ /* should not be in in-flight list */
+ case AHS_IDLE:
+ case AHS_DEFINED:
+ case AHS_HANDED_OUT:
+ case AHS_PREPARED:
+ case AHS_COMPLETED_LOCAL:
+ elog(ERROR, "shouldn't get here with io:%d in state %d",
+ pgaio_io_get_id(ioh), ioh->state);
+ break;
+
+ case AHS_REAPED:
+ case AHS_IN_FLIGHT:
+ {
+ PgAioHandleRef ior;
+
+ ior.aio_index = ioh - aio_ctl->io_handles;
+ ior.generation_upper = (uint32) (ioh->generation >> 32);
+ ior.generation_lower = (uint32) ioh->generation;
+
+ pgaio_io_ref_wait(&ior);
+ elog(DEBUG2, "waited for io:%d",
+ pgaio_io_get_id(ioh));
+ }
+ break;
+ case AHS_COMPLETED_SHARED:
+ /* it's possible that another backend just finished this IO */
+ pgaio_io_reclaim(ioh);
+ break;
+ }
+
+ if (dclist_count(&my_aio->idle_ios) == 0)
+ elog(PANIC, "no idle IOs after waiting");
+ return;
+ }
+}
+
+
+
+/* --------------------------------------------------------------------------------
+ * Actions on multiple IOs.
+ * --------------------------------------------------------------------------------
+ */
+
+void
+pgaio_submit_staged(void)
+{
+ int total_submitted = 0;
+ int did_submit;
+
+ if (my_aio->num_staged_ios == 0)
+ return;
+
+
+ START_CRIT_SECTION();
+
+ did_submit = pgaio_impl->submit(my_aio->num_staged_ios, my_aio->staged_ios);
+
+ END_CRIT_SECTION();
+
+ total_submitted += did_submit;
+
+ Assert(total_submitted == did_submit);
+
+ my_aio->num_staged_ios = 0;
+
+#ifdef PGAIO_VERBOSE
+ ereport(DEBUG2,
+ errmsg("submitted %d",
+ total_submitted),
+ errhidestmt(true),
+ errhidecontext(true));
+#endif
+}
+
+bool
+pgaio_have_staged(void)
+{
+ return my_aio->num_staged_ios > 0;
+}
+
+
+
+/* --------------------------------------------------------------------------------
+ * Other
+ * --------------------------------------------------------------------------------
+ */
+
+/*
+ * Need to submit staged but not yet submitted IOs using the fd, otherwise
+ * the IO would end up targeting something bogus.
+ */
+void
+pgaio_closing_fd(int fd)
+{
+ /*
+ * Might be called before AIO is initialized or in a subprocess that
+ * doesn't use AIO.
+ */
+ if (!my_aio)
+ return;
+
+ /*
+ * For now just submit all staged IOs - we could be more selective, but
+ * it's probably not worth it.
+ */
+ pgaio_submit_staged();
+}
+
+void
+pgaio_at_xact_end(bool is_subxact, bool is_commit)
+{
+ Assert(!my_aio->handed_out_io);
+}
+
+/*
+ * Similar to pgaio_at_xact_end(..., is_commit = false), but for cases where
+ * errors happen outside of transactions.
+ */
+void
+pgaio_at_error(void)
+{
+ Assert(!my_aio->handed_out_io);
+}
+
+
void
assign_io_method(int newval, void *extra)
{
+ pgaio_impl = pgaio_ops_table[newval];
}
diff --git a/src/backend/storage/aio/aio_init.c b/src/backend/storage/aio/aio_init.c
index 84e0e37baae..b9bdf51680a 100644
--- a/src/backend/storage/aio/aio_init.c
+++ b/src/backend/storage/aio/aio_init.c
@@ -14,28 +14,206 @@
#include "postgres.h"
+#include "miscadmin.h"
+#include "storage/aio.h"
#include "storage/aio_init.h"
+#include "storage/aio_internal.h"
+#include "storage/bufmgr.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+static Size
+AioCtlShmemSize(void)
+{
+ Size sz;
+
+ /* aio_ctl itself */
+ sz = offsetof(PgAioCtl, io_handles);
+
+ return sz;
+}
+
+static uint32
+AioProcs(void)
+{
+ return MaxBackends + NUM_AUXILIARY_PROCS;
+}
+
+static Size
+AioBackendShmemSize(void)
+{
+ return mul_size(AioProcs(), sizeof(PgAioPerBackend));
+}
+
+static Size
+AioHandleShmemSize(void)
+{
+ Size sz;
+
+ /* ios */
+ sz = mul_size(AioProcs(),
+ mul_size(io_max_concurrency, sizeof(PgAioHandle)));
+
+ return sz;
+}
+
+static Size
+AioIOVShmemSize(void)
+{
+ /* FIXME: io_combine_limit is USERSET */
+ return mul_size(sizeof(struct iovec),
+ mul_size(mul_size(io_combine_limit, AioProcs()),
+ io_max_concurrency));
+}
+
+static Size
+AioIOVDataShmemSize(void)
+{
+ /* FIXME: io_combine_limit is USERSET */
+ return mul_size(sizeof(uint64),
+ mul_size(mul_size(io_combine_limit, AioProcs()),
+ io_max_concurrency));
+}
+
+/*
+ * Choose a suitable value for io_max_concurrency.
+ *
+ * It's unlikely that we could have more IOs in flight than buffers that we
+ * would be allowed to pin.
+ *
+ * On the upper end, apply a cap too - just because shared_buffers is large,
+ * it doesn't make sense have millions of buffers undergo IO concurrently.
+ */
+static int
+AioChooseMaxConccurrency(void)
+{
+ uint32 max_backends;
+ int max_proportional_pins;
+
+ /* Similar logic to LimitAdditionalPins() */
+ max_backends = MaxBackends + NUM_AUXILIARY_PROCS;
+ max_proportional_pins = NBuffers / max_backends;
+
+ max_proportional_pins = Max(max_proportional_pins, 1);
+
+ /* apply upper limit */
+ return Min(max_proportional_pins, 64);
+}
+
Size
AioShmemSize(void)
{
Size sz = 0;
+ /*
+ * We prefer to report this value's source as PGC_S_DYNAMIC_DEFAULT.
+ * However, if the DBA explicitly set wal_buffers = -1 in the config file,
+ * then PGC_S_DYNAMIC_DEFAULT will fail to override that and we must force
+ *
+ */
+ if (io_max_concurrency == -1)
+ {
+ char buf[32];
+
+ snprintf(buf, sizeof(buf), "%d", AioChooseMaxConccurrency());
+ SetConfigOption("io_max_concurrency", buf, PGC_POSTMASTER,
+ PGC_S_DYNAMIC_DEFAULT);
+ if (io_max_concurrency == -1) /* failed to apply it? */
+ SetConfigOption("io_max_concurrency", buf, PGC_POSTMASTER,
+ PGC_S_OVERRIDE);
+ }
+
+ sz = add_size(sz, AioCtlShmemSize());
+ sz = add_size(sz, AioBackendShmemSize());
+ sz = add_size(sz, AioHandleShmemSize());
+ sz = add_size(sz, AioIOVShmemSize());
+ sz = add_size(sz, AioIOVDataShmemSize());
+
+ if (pgaio_impl->shmem_size)
+ sz = add_size(sz, pgaio_impl->shmem_size());
+
return sz;
}
void
AioShmemInit(void)
{
+ bool found;
+ uint32 io_handle_off = 0;
+ uint32 iovec_off = 0;
+ uint32 per_backend_iovecs = io_max_concurrency * io_combine_limit;
+
+ aio_ctl = (PgAioCtl *)
+ ShmemInitStruct("AioCtl", AioCtlShmemSize(), &found);
+
+ if (found)
+ goto out;
+
+ memset(aio_ctl, 0, AioCtlShmemSize());
+
+ aio_ctl->io_handle_count = AioProcs() * io_max_concurrency;
+ aio_ctl->iovec_count = AioProcs() * per_backend_iovecs;
+
+ aio_ctl->backend_state = (PgAioPerBackend *)
+ ShmemInitStruct("AioBackend", AioBackendShmemSize(), &found);
+
+ aio_ctl->io_handles = (PgAioHandle *)
+ ShmemInitStruct("AioHandle", AioHandleShmemSize(), &found);
+
+ aio_ctl->iovecs = ShmemInitStruct("AioIOV", AioIOVShmemSize(), &found);
+ aio_ctl->iovecs_data = ShmemInitStruct("AioIOVData", AioIOVDataShmemSize(), &found);
+
+ for (int procno = 0; procno < AioProcs(); procno++)
+ {
+ PgAioPerBackend *bs = &aio_ctl->backend_state[procno];
+
+ bs->io_handle_off = io_handle_off;
+ io_handle_off += io_max_concurrency;
+
+ dclist_init(&bs->idle_ios);
+ memset(bs->staged_ios, 0, sizeof(PgAioHandle *) * PGAIO_SUBMIT_BATCH_SIZE);
+ dclist_init(&bs->in_flight_ios);
+
+ /* initialize per-backend IOs */
+ for (int i = 0; i < io_max_concurrency; i++)
+ {
+ PgAioHandle *ioh = &aio_ctl->io_handles[bs->io_handle_off + i];
+
+ ioh->generation = 1;
+ ioh->owner_procno = procno;
+ ioh->iovec_off = iovec_off;
+ ioh->iovec_data_len = 0;
+ ioh->report_return = NULL;
+ ioh->resowner = NULL;
+ ioh->num_shared_callbacks = 0;
+ ioh->distilled_result.status = ARS_UNKNOWN;
+ ioh->flags = 0;
+
+ ConditionVariableInit(&ioh->cv);
+
+ dclist_push_tail(&bs->idle_ios, &ioh->node);
+ iovec_off += io_combine_limit;
+ }
+ }
+
+out:
+ /* Initialize IO method specific resources. */
+ if (pgaio_impl->shmem_init)
+ pgaio_impl->shmem_init(!found);
}
void
pgaio_init_backend(void)
{
-}
+ /* shouldn't be initialized twice */
+ Assert(!my_aio);
+
+ if (MyProc == NULL || MyProcNumber >= AioProcs())
+ elog(ERROR, "aio requires a normal PGPROC");
+
+ my_aio = &aio_ctl->backend_state[MyProcNumber];
-void
-pgaio_postmaster_child_init_local(void)
-{
+ if (pgaio_impl->init_backend)
+ pgaio_impl->init_backend();
}
diff --git a/src/backend/storage/aio/aio_io.c b/src/backend/storage/aio/aio_io.c
new file mode 100644
index 00000000000..3c255775833
--- /dev/null
+++ b/src/backend/storage/aio/aio_io.c
@@ -0,0 +1,140 @@
+/*-------------------------------------------------------------------------
+ *
+ * aio_io.c
+ * AIO - Low Level IO Handling
+ *
+ * Functions related to associating IO operations to IO Handles and IO-method
+ * independent support functions for actually performing IO.
+ *
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/storage/aio/aio_io.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "storage/aio.h"
+#include "storage/aio_internal.h"
+#include "storage/fd.h"
+#include "utils/wait_event.h"
+
+
+static void pgaio_io_before_prep(PgAioHandle *ioh);
+
+
+
+/* --------------------------------------------------------------------------------
+ * "Preparation" routines for individual IO types
+ *
+ * These are called by place the place actually initiating an IO, to associate
+ * the IO specific data with an AIO handle.
+ *
+ * Each of the preparation routines first needs to call
+ * pgaio_io_before_prep(), then fill IO specific fields in the handle and then
+ * finally call pgaio_io_prepare().
+ * --------------------------------------------------------------------------------
+ */
+
+void
+pgaio_io_prep_readv(PgAioHandle *ioh,
+ int fd, int iovcnt, uint64 offset)
+{
+ pgaio_io_before_prep(ioh);
+
+ ioh->op_data.read.fd = fd;
+ ioh->op_data.read.offset = offset;
+ ioh->op_data.read.iov_length = iovcnt;
+
+ pgaio_io_prepare(ioh, PGAIO_OP_READV);
+}
+
+void
+pgaio_io_prep_writev(PgAioHandle *ioh,
+ int fd, int iovcnt, uint64 offset)
+{
+ pgaio_io_before_prep(ioh);
+
+ ioh->op_data.write.fd = fd;
+ ioh->op_data.write.offset = offset;
+ ioh->op_data.write.iov_length = iovcnt;
+
+ pgaio_io_prepare(ioh, PGAIO_OP_WRITEV);
+}
+
+
+
+/* --------------------------------------------------------------------------------
+ * Functions implementing IO handle operations that are directly related to IO
+ * operations.
+ * --------------------------------------------------------------------------------
+ */
+
+/*
+ * Execute IO operation synchronously. This is implemented here, not in
+ * method_sync.c, because other IO methods lso might use it / fall back to it.
+ */
+void
+pgaio_io_perform_synchronously(PgAioHandle *ioh)
+{
+ ssize_t result = 0;
+ struct iovec *iov = &aio_ctl->iovecs[ioh->iovec_off];
+
+ /* Perform IO. */
+ switch (ioh->op)
+ {
+ case PGAIO_OP_READV:
+ pgstat_report_wait_start(WAIT_EVENT_DATA_FILE_READ);
+ result = pg_preadv(ioh->op_data.read.fd, iov,
+ ioh->op_data.read.iov_length,
+ ioh->op_data.read.offset);
+ pgstat_report_wait_end();
+ break;
+ case PGAIO_OP_WRITEV:
+ pgstat_report_wait_start(WAIT_EVENT_DATA_FILE_WRITE);
+ result = pg_pwritev(ioh->op_data.write.fd, iov,
+ ioh->op_data.write.iov_length,
+ ioh->op_data.write.offset);
+ pgstat_report_wait_end();
+ break;
+ case PGAIO_OP_INVALID:
+ elog(ERROR, "trying to execute invalid IO operation");
+ }
+
+ ioh->result = result < 0 ? -errno : result;
+
+ pgaio_io_process_completion(ioh, ioh->result);
+}
+
+const char *
+pgaio_io_get_op_name(PgAioHandle *ioh)
+{
+ Assert(ioh->op >= 0 && ioh->op < PGAIO_OP_COUNT);
+
+ switch (ioh->op)
+ {
+ case PGAIO_OP_INVALID:
+ return "invalid";
+ case PGAIO_OP_READV:
+ return "read";
+ case PGAIO_OP_WRITEV:
+ return "write";
+ }
+
+ pg_unreachable();
+}
+
+/*
+ * Helper function to be called by IO operation preparation functions, before
+ * any data in the handle is set. Mostly to centralize assertions.
+ */
+static void
+pgaio_io_before_prep(PgAioHandle *ioh)
+{
+ Assert(ioh->state == AHS_HANDED_OUT);
+ Assert(pgaio_io_has_subject(ioh));
+}
diff --git a/src/backend/storage/aio/aio_subject.c b/src/backend/storage/aio/aio_subject.c
new file mode 100644
index 00000000000..8694cfafcd1
--- /dev/null
+++ b/src/backend/storage/aio/aio_subject.c
@@ -0,0 +1,231 @@
+/*-------------------------------------------------------------------------
+ *
+ * aio_subject.c
+ * AIO - Functionality related to executing IO for different subjects
+ *
+ * XXX Write me
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/storage/aio/aio_subject.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "storage/aio.h"
+#include "storage/aio_internal.h"
+#include "storage/buf_internals.h"
+#include "storage/bufmgr.h"
+#include "storage/smgr.h"
+#include "utils/memutils.h"
+
+
+/*
+ * Registry for entities that can be the target of AIO.
+ *
+ * To support executing using worker processes, the file descriptor for an IO
+ * may need to be be reopened in a different process. This is done via the
+ * PgAioSubjectInfo.reopen callback.
+ */
+static const PgAioSubjectInfo *aio_subject_info[] = {
+ [ASI_INVALID] = &(PgAioSubjectInfo) {
+ .name = "invalid",
+ },
+};
+
+
+typedef struct PgAioHandleSharedCallbacksEntry
+{
+ const PgAioHandleSharedCallbacks *const cb;
+ const char *const name;
+} PgAioHandleSharedCallbacksEntry;
+
+static const PgAioHandleSharedCallbacksEntry aio_shared_cbs[] = {
+#define CALLBACK_ENTRY(id, callback) [id] = {.cb = &callback, .name = #callback}
+#undef CALLBACK_ENTRY
+};
+
+
+/*
+ * Register callback for the IO handle.
+ *
+ * Only a limited number (AIO_MAX_SHARED_CALLBACKS) of callbacks can be
+ * registered for each IO.
+ *
+ * Callbacks need to be registered before [indirectly] calling
+ * pgaio_io_prep_*(), as the IO may be executed immediately.
+ *
+ *
+ * Note that callbacks are executed in critical sections. This is necessary
+ * to be able to execute IO in critical sections (consider e.g. WAL
+ * logging). To perform AIO we first need to acquire a handle, which, if there
+ * are no free handles, requires waiting for IOs to complete and to execute
+ * their completion callbacks.
+ *
+ * Callbacks may be executed in the issuing backend but also in another
+ * backend (because that backend is waiting for the IO) or in IO workers (if
+ * io_method=worker is used).
+ *
+ *
+ * See PgAioHandleSharedCallbackID's definition for an explanation for why
+ * callbacks are not identified by a pointer.
+ */
+void
+pgaio_io_add_shared_cb(PgAioHandle *ioh, PgAioHandleSharedCallbackID cbid)
+{
+ const PgAioHandleSharedCallbacksEntry *ce = &aio_shared_cbs[cbid];
+
+ if (cbid >= lengthof(aio_shared_cbs))
+ elog(ERROR, "callback %d is out of range", cbid);
+ if (aio_shared_cbs[cbid].cb->complete == NULL)
+ elog(ERROR, "callback %d is undefined", cbid);
+ if (ioh->num_shared_callbacks >= AIO_MAX_SHARED_CALLBACKS)
+ elog(PANIC, "too many callbacks, the max is %d", AIO_MAX_SHARED_CALLBACKS);
+ ioh->shared_callbacks[ioh->num_shared_callbacks] = cbid;
+
+ elog(DEBUG3, "io:%d, op %s, subject %s, adding cb #%d, id %d/%s",
+ pgaio_io_get_id(ioh),
+ pgaio_io_get_op_name(ioh),
+ pgaio_io_get_subject_name(ioh),
+ ioh->num_shared_callbacks + 1,
+ cbid, ce->name);
+
+ ioh->num_shared_callbacks++;
+}
+
+/*
+ * Return the name for the subject associated with the IO. Mostly useful for
+ * debugging/logging.
+ */
+const char *
+pgaio_io_get_subject_name(PgAioHandle *ioh)
+{
+ Assert(ioh->subject >= 0 && ioh->subject < ASI_COUNT);
+
+ return aio_subject_info[ioh->subject]->name;
+}
+
+/*
+ * Internal function which invokes ->prepare for all the registered callbacks.
+ */
+void
+pgaio_io_prepare_subject(PgAioHandle *ioh)
+{
+ Assert(ioh->subject > ASI_INVALID && ioh->subject < ASI_COUNT);
+ Assert(ioh->op >= 0 && ioh->op < PGAIO_OP_COUNT);
+
+ for (int i = ioh->num_shared_callbacks; i > 0; i--)
+ {
+ PgAioHandleSharedCallbackID cbid = ioh->shared_callbacks[i - 1];
+ const PgAioHandleSharedCallbacksEntry *ce = &aio_shared_cbs[cbid];
+
+ if (!ce->cb->prepare)
+ continue;
+
+ elog(DEBUG3, "io:%d, op %s, subject %s, calling cb #%d %d/%s->prepare",
+ pgaio_io_get_id(ioh),
+ pgaio_io_get_op_name(ioh),
+ pgaio_io_get_subject_name(ioh),
+ i,
+ cbid, ce->name);
+ ce->cb->prepare(ioh);
+ }
+}
+
+/*
+ * Internal function which invokes ->complete for all the registered
+ * callbacks.
+ */
+void
+pgaio_io_process_completion_subject(PgAioHandle *ioh)
+{
+ PgAioResult result;
+
+ Assert(ioh->subject >= 0 && ioh->subject < ASI_COUNT);
+ Assert(ioh->op >= 0 && ioh->op < PGAIO_OP_COUNT);
+
+ result.status = ARS_OK; /* low level IO is always considered OK */
+ result.result = ioh->result;
+ result.id = ASC_INVALID;
+ result.error_data = 0;
+
+ for (int i = ioh->num_shared_callbacks; i > 0; i--)
+ {
+ PgAioHandleSharedCallbackID cbid = ioh->shared_callbacks[i - 1];
+ const PgAioHandleSharedCallbacksEntry *ce = &aio_shared_cbs[cbid];
+
+ elog(DEBUG3, "io:%d, op %s, subject %s, calling cb #%d, id %d/%s->complete with distilled result status %d, id %u, error_data: %d, result: %d",
+ pgaio_io_get_id(ioh),
+ pgaio_io_get_op_name(ioh),
+ pgaio_io_get_subject_name(ioh),
+ i,
+ cbid, ce->name,
+ result.status,
+ result.id,
+ result.error_data,
+ result.result);
+ result = ce->cb->complete(ioh, result);
+ }
+
+ ioh->distilled_result = result;
+
+ elog(DEBUG3, "io:%d, op %s, subject %s, distilled result status %d, id %u, error_data: %d, result: %d, raw_result %d",
+ pgaio_io_get_id(ioh),
+ pgaio_io_get_op_name(ioh),
+ pgaio_io_get_subject_name(ioh),
+ result.status,
+ result.id,
+ result.error_data,
+ result.result,
+ ioh->result);
+}
+
+/*
+ * Check if pgaio_io_reopen() is available for the IO.
+ */
+bool
+pgaio_io_can_reopen(PgAioHandle *ioh)
+{
+ return aio_subject_info[ioh->subject]->reopen != NULL;
+}
+
+/*
+ * Before executing an IO outside of the context of the process the IO has
+ * been prepared in, the file descriptor has to be reopened - any FD
+ * referenced in the IO itself, won't be valid in the separate process.
+ */
+void
+pgaio_io_reopen(PgAioHandle *ioh)
+{
+ Assert(ioh->subject >= 0 && ioh->subject < ASI_COUNT);
+ Assert(ioh->op >= 0 && ioh->op < PGAIO_OP_COUNT);
+
+ aio_subject_info[ioh->subject]->reopen(ioh);
+}
+
+
+
+/* --------------------------------------------------------------------------------
+ * IO Result
+ * --------------------------------------------------------------------------------
+ */
+
+void
+pgaio_result_log(PgAioResult result, const PgAioSubjectData *subject_data, int elevel)
+{
+ PgAioHandleSharedCallbackID cbid = result.id;
+ const PgAioHandleSharedCallbacksEntry *ce = &aio_shared_cbs[cbid];
+
+ Assert(result.status != ARS_UNKNOWN);
+ Assert(result.status != ARS_OK);
+
+ if (ce->cb->error == NULL)
+ elog(ERROR, "scb id %d/%s does not have an error callback",
+ result.id, ce->name);
+
+ ce->cb->error(result, subject_data, elevel);
+}
diff --git a/src/backend/storage/aio/meson.build b/src/backend/storage/aio/meson.build
index 8d20759ebf8..8339d473aae 100644
--- a/src/backend/storage/aio/meson.build
+++ b/src/backend/storage/aio/meson.build
@@ -3,5 +3,8 @@
backend_sources += files(
'aio.c',
'aio_init.c',
+ 'aio_io.c',
+ 'aio_subject.c',
+ 'method_sync.c',
'read_stream.c',
)
diff --git a/src/backend/storage/aio/method_sync.c b/src/backend/storage/aio/method_sync.c
new file mode 100644
index 00000000000..61fd06a277b
--- /dev/null
+++ b/src/backend/storage/aio/method_sync.c
@@ -0,0 +1,45 @@
+/*-------------------------------------------------------------------------
+ *
+ * method_sync.c
+ * AIO - perform "AIO" by executing it synchronously
+ *
+ * This method is mainly to check if AIO use causes regressions. Other IO
+ * methods might also fall back to the synchronous method for functionality
+ * they cannot provide.
+ *
+ * Portions Copyright (c) 1996-2021, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/storage/aio/method_sync.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "storage/aio.h"
+#include "storage/aio_internal.h"
+
+static bool pgaio_sync_needs_synchronous_execution(PgAioHandle *ioh);
+static int pgaio_sync_submit(uint16 num_staged_ios, PgAioHandle **staged_ios);
+
+
+const IoMethodOps pgaio_sync_ops = {
+ .needs_synchronous_execution = pgaio_sync_needs_synchronous_execution,
+ .submit = pgaio_sync_submit,
+};
+
+static bool
+pgaio_sync_needs_synchronous_execution(PgAioHandle *ioh)
+{
+ return true;
+}
+
+static int
+pgaio_sync_submit(uint16 num_staged_ios, PgAioHandle **staged_ios)
+{
+ elog(ERROR, "should be unreachable");
+
+ return 0;
+}
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 16144c2b72d..7a2e2b4432e 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -190,6 +190,9 @@ ABI_compatibility:
Section: ClassName - WaitEventIO
+AIO_SUBMIT "Waiting for AIO submission."
+AIO_DRAIN "Waiting for IOs to finish."
+AIO_COMPLETION "Waiting for completion callback."
BASEBACKUP_READ "Waiting for base backup to read from a file."
BASEBACKUP_SYNC "Waiting for data written by a base backup to reach durable storage."
BASEBACKUP_WRITE "Waiting for base backup to write to a file."
diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c
index 505534ee8d3..5cf14472ebd 100644
--- a/src/backend/utils/resowner/resowner.c
+++ b/src/backend/utils/resowner/resowner.c
@@ -47,6 +47,8 @@
#include "common/hashfn.h"
#include "common/int.h"
+#include "lib/ilist.h"
+#include "storage/aio.h"
#include "storage/ipc.h"
#include "storage/predicate.h"
#include "storage/proc.h"
@@ -155,6 +157,12 @@ struct ResourceOwnerData
/* The local locks cache. */
LOCALLOCK *locks[MAX_RESOWNER_LOCKS]; /* list of owned locks */
+
+ /*
+ * AIO handles need be registered in critical sections and therefore
+ * cannot use the normal ResoureElem mechanism.
+ */
+ dlist_head aio_handles;
};
@@ -425,6 +433,8 @@ ResourceOwnerCreate(ResourceOwner parent, const char *name)
parent->firstchild = owner;
}
+ dlist_init(&owner->aio_handles);
+
return owner;
}
@@ -725,6 +735,14 @@ ResourceOwnerReleaseInternal(ResourceOwner owner,
* so issue warnings. In the abort case, just clean up quietly.
*/
ResourceOwnerReleaseAll(owner, phase, isCommit);
+
+ /* XXX: Could probably be a later phase? */
+ while (!dlist_is_empty(&owner->aio_handles))
+ {
+ dlist_node *node = dlist_head_node(&owner->aio_handles);
+
+ pgaio_io_release_resowner(node, !isCommit);
+ }
}
else if (phase == RESOURCE_RELEASE_LOCKS)
{
@@ -1082,3 +1100,15 @@ ResourceOwnerForgetLock(ResourceOwner owner, LOCALLOCK *locallock)
elog(ERROR, "lock reference %p is not owned by resource owner %s",
locallock, owner->name);
}
+
+void
+ResourceOwnerRememberAioHandle(ResourceOwner owner, struct dlist_node *ioh_node)
+{
+ dlist_push_tail(&owner->aio_handles, ioh_node);
+}
+
+void
+ResourceOwnerForgetAioHandle(ResourceOwner owner, struct dlist_node *ioh_node)
+{
+ dlist_delete_from(&owner->aio_handles, ioh_node);
+}
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 2586d1cf53f..bc1acbb98ee 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1263,6 +1263,7 @@ InvalMessageArray
InvalidationInfo
InvalidationMsgsGroup
IoMethod
+IoMethodOps
IpcMemoryId
IpcMemoryKey
IpcMemoryState
@@ -2100,6 +2101,23 @@ Permutation
PermutationStep
PermutationStepBlocker
PermutationStepBlockerType
+PgAioCtl
+PgAioHandle
+PgAioHandleFlags
+PgAioHandleRef
+PgAioHandleSharedCallbackID
+PgAioHandleSharedCallbacks
+PgAioHandleSharedCallbacksEntry
+PgAioHandleState
+PgAioOp
+PgAioOpData
+PgAioPerBackend
+PgAioResultStatus
+PgAioResult
+PgAioReturn
+PgAioSubjectData
+PgAioSubjectID
+PgAioSubjectInfo
PgArchData
PgBackendGSSStatus
PgBackendSSLStatus
--
2.45.2.746.g06e570c0df.dirty
v2-0005-aio-Skeleton-IO-worker-infrastructure.patchtext/x-diff; charset=us-asciiDownload
From e6c7783183c0b36f94b9debfd9edde71e4d75bbc Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 25 Nov 2024 14:03:40 -0500
Subject: [PATCH v2 05/20] aio: Skeleton IO worker infrastructure
This doesn't do anything useful on its own, but the code that needs to be
touched is independent of other changes.
Remarks:
- should completely get rid of ID assignment logic in postmaster.c
- postmaster.c badly needs a refactoring.
- dynamic increase / decrease of workers based on IO load
Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
src/include/miscadmin.h | 2 +
src/include/postmaster/postmaster.h | 1 +
src/include/storage/aio_init.h | 2 +
src/include/storage/io_worker.h | 22 +++
src/include/storage/proc.h | 4 +-
src/backend/postmaster/launch_backend.c | 2 +
src/backend/postmaster/pmchild.c | 1 +
src/backend/postmaster/postmaster.c | 171 ++++++++++++++++--
src/backend/storage/aio/Makefile | 1 +
src/backend/storage/aio/aio_init.c | 7 +
src/backend/storage/aio/meson.build | 1 +
src/backend/storage/aio/method_worker.c | 86 +++++++++
src/backend/tcop/postgres.c | 2 +
src/backend/utils/activity/pgstat_backend.c | 1 +
src/backend/utils/activity/pgstat_io.c | 1 +
.../utils/activity/wait_event_names.txt | 1 +
src/backend/utils/init/miscinit.c | 3 +
src/backend/utils/misc/guc_tables.c | 13 ++
src/backend/utils/misc/postgresql.conf.sample | 1 +
19 files changed, 310 insertions(+), 12 deletions(-)
create mode 100644 src/include/storage/io_worker.h
create mode 100644 src/backend/storage/aio/method_worker.c
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index e4c0d1481e9..0afc57ebf27 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -360,6 +360,7 @@ typedef enum BackendType
B_ARCHIVER,
B_BG_WRITER,
B_CHECKPOINTER,
+ B_IO_WORKER,
B_STARTUP,
B_WAL_RECEIVER,
B_WAL_SUMMARIZER,
@@ -389,6 +390,7 @@ extern PGDLLIMPORT BackendType MyBackendType;
#define AmWalReceiverProcess() (MyBackendType == B_WAL_RECEIVER)
#define AmWalSummarizerProcess() (MyBackendType == B_WAL_SUMMARIZER)
#define AmWalWriterProcess() (MyBackendType == B_WAL_WRITER)
+#define AmIoWorkerProcess() (MyBackendType == B_IO_WORKER)
#define AmSpecialWorkerProcess() \
(AmAutoVacuumLauncherProcess() || \
diff --git a/src/include/postmaster/postmaster.h b/src/include/postmaster/postmaster.h
index 24d49a5439e..4d003b7f86d 100644
--- a/src/include/postmaster/postmaster.h
+++ b/src/include/postmaster/postmaster.h
@@ -98,6 +98,7 @@ extern void InitProcessGlobals(void);
extern int MaxLivePostmasterChildren(void);
extern bool PostmasterMarkPIDForWorkerNotify(int);
+extern void assign_io_workers(int newval, void *extra);
#ifdef WIN32
extern void pgwin32_register_deadchild_callback(HANDLE procHandle, DWORD procId);
diff --git a/src/include/storage/aio_init.h b/src/include/storage/aio_init.h
index 1c1d62baa79..70976791c93 100644
--- a/src/include/storage/aio_init.h
+++ b/src/include/storage/aio_init.h
@@ -21,4 +21,6 @@ extern void AioShmemInit(void);
extern void pgaio_init_backend(void);
+extern bool pgaio_workers_enabled(void);
+
#endif /* AIO_INIT_H */
diff --git a/src/include/storage/io_worker.h b/src/include/storage/io_worker.h
new file mode 100644
index 00000000000..ba5dcb9e6e4
--- /dev/null
+++ b/src/include/storage/io_worker.h
@@ -0,0 +1,22 @@
+/*-------------------------------------------------------------------------
+ *
+ * io_worker.h
+ * IO worker for implementing AIO "ourselves"
+ *
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/io.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef IO_WORKER_H
+#define IO_WORKER_H
+
+
+extern void IoWorkerMain(char *startup_data, size_t startup_data_len) pg_attribute_noreturn();
+
+extern int io_workers;
+
+#endif /* IO_WORKER_H */
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 0b1fa61310f..cafd0b334b9 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -461,7 +461,9 @@ extern PGDLLIMPORT PGPROC *PreparedXactProcs;
* 2 slots, but WAL writer is launched only after startup has exited, so we
* only need 6 slots.
*/
-#define NUM_AUXILIARY_PROCS 6
+#define MAX_IO_WORKERS 32
+#define NUM_AUXILIARY_PROCS (6 + MAX_IO_WORKERS)
+
/* configurable options */
extern PGDLLIMPORT int DeadlockTimeout;
diff --git a/src/backend/postmaster/launch_backend.c b/src/backend/postmaster/launch_backend.c
index 1f2d829ec5a..7399adfeae9 100644
--- a/src/backend/postmaster/launch_backend.c
+++ b/src/backend/postmaster/launch_backend.c
@@ -48,6 +48,7 @@
#include "replication/slotsync.h"
#include "replication/walreceiver.h"
#include "storage/dsm.h"
+#include "storage/io_worker.h"
#include "storage/pg_shmem.h"
#include "tcop/backend_startup.h"
#include "utils/memutils.h"
@@ -197,6 +198,7 @@ static child_process_kind child_process_kinds[] = {
[B_ARCHIVER] = {"archiver", PgArchiverMain, true},
[B_BG_WRITER] = {"bgwriter", BackgroundWriterMain, true},
[B_CHECKPOINTER] = {"checkpointer", CheckpointerMain, true},
+ [B_IO_WORKER] = {"io_worker", IoWorkerMain, true},
[B_STARTUP] = {"startup", StartupProcessMain, true},
[B_WAL_RECEIVER] = {"wal_receiver", WalReceiverMain, true},
[B_WAL_SUMMARIZER] = {"wal_summarizer", WalSummarizerMain, true},
diff --git a/src/backend/postmaster/pmchild.c b/src/backend/postmaster/pmchild.c
index 381cf005a9b..89ee626829d 100644
--- a/src/backend/postmaster/pmchild.c
+++ b/src/backend/postmaster/pmchild.c
@@ -101,6 +101,7 @@ InitPostmasterChildSlots(void)
pmchild_pools[B_AUTOVAC_WORKER].size = autovacuum_max_workers;
pmchild_pools[B_BG_WORKER].size = max_worker_processes;
+ pmchild_pools[B_IO_WORKER].size = MAX_IO_WORKERS;
/*
* There can be only one of each of these running at a time. They each
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 6f849ffbcb5..8dab7072114 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -108,9 +108,12 @@
#include "replication/logicallauncher.h"
#include "replication/slotsync.h"
#include "replication/walsender.h"
+#include "storage/aio_init.h"
#include "storage/fd.h"
+#include "storage/io_worker.h"
#include "storage/ipc.h"
#include "storage/pmsignal.h"
+#include "storage/proc.h"
#include "tcop/backend_startup.h"
#include "tcop/tcopprot.h"
#include "utils/datetime.h"
@@ -172,6 +175,7 @@ btmask_all_except(BackendType t)
return mask;
}
+#ifdef NOT_USED
static inline BackendTypeMask
btmask_all_except2(BackendType t1, BackendType t2)
{
@@ -181,6 +185,18 @@ btmask_all_except2(BackendType t1, BackendType t2)
mask = btmask_del(mask, t2);
return mask;
}
+#endif
+
+static inline BackendTypeMask
+btmask_all_except3(BackendType t1, BackendType t2, BackendType t3)
+{
+ BackendTypeMask mask = BTYPE_MASK_ALL;
+
+ mask = btmask_del(mask, t1);
+ mask = btmask_del(mask, t2);
+ mask = btmask_del(mask, t3);
+ return mask;
+}
static inline bool
btmask_contains(BackendTypeMask mask, BackendType t)
@@ -329,6 +345,7 @@ typedef enum
* ckpt */
PM_SHUTDOWN_2, /* waiting for archiver and walsenders to
* finish */
+ PM_SHUTDOWN_IO, /* waiting for io workers to exit */
PM_WAIT_DEAD_END, /* waiting for dead-end children to exit */
PM_NO_CHILDREN, /* all important children have exited */
} PMState;
@@ -390,6 +407,10 @@ bool LoadedSSL = false;
static DNSServiceRef bonjour_sdref = NULL;
#endif
+/* State for IO worker management. */
+static int io_worker_count = 0;
+static PMChild *io_worker_children[MAX_IO_WORKERS];
+
/*
* postmaster.c - function prototypes
*/
@@ -424,6 +445,8 @@ static void TerminateChildren(int signal);
static int CountChildren(BackendTypeMask targetMask);
static void LaunchMissingBackgroundProcesses(void);
static void maybe_start_bgworkers(void);
+static bool maybe_reap_io_worker(int pid);
+static void maybe_adjust_io_workers(void);
static bool CreateOptsFile(int argc, char *argv[], char *fullprogname);
static PMChild *StartChildProcess(BackendType type);
static void StartSysLogger(void);
@@ -1351,6 +1374,11 @@ PostmasterMain(int argc, char *argv[])
*/
AddToDataDirLockFile(LOCK_FILE_LINE_PM_STATUS, PM_STATUS_STARTING);
+ pmState = PM_STARTUP;
+
+ /* Make sure we can perform I/O while starting up. */
+ maybe_adjust_io_workers();
+
/* Start bgwriter and checkpointer so they can help with recovery */
if (CheckpointerPMChild == NULL)
CheckpointerPMChild = StartChildProcess(B_CHECKPOINTER);
@@ -1363,7 +1391,6 @@ PostmasterMain(int argc, char *argv[])
StartupPMChild = StartChildProcess(B_STARTUP);
Assert(StartupPMChild != NULL);
StartupStatus = STARTUP_RUNNING;
- pmState = PM_STARTUP;
/* Some workers may be scheduled to start now */
maybe_start_bgworkers();
@@ -2503,6 +2530,16 @@ process_pm_child_exit(void)
continue;
}
+ /* Was it an IO worker? */
+ if (maybe_reap_io_worker(pid))
+ {
+ if (!EXIT_STATUS_0(exitstatus) && !EXIT_STATUS_1(exitstatus))
+ HandleChildCrash(pid, exitstatus, _("io worker"));
+
+ maybe_adjust_io_workers();
+ continue;
+ }
+
/*
* Was it a backend or a background worker?
*/
@@ -2867,10 +2904,10 @@ PostmasterStateMachine(void)
targetMask = btmask_add(targetMask, B_CHECKPOINTER);
/*
- * Walsenders and archiver will continue running; they will be
- * terminated later after writing the checkpoint record. We also let
- * dead-end children to keep running for now. The syslogger process
- * exits last.
+ * Walsenders, archiver and IO workers will continue running; they
+ * will be terminated later after writing the checkpoint record. We
+ * also let dead-end children to keep running for now. The syslogger
+ * process exits last.
*
* This assertion checks that we have covered all backend types,
* either by including them in targetMask, or by noting here that they
@@ -2882,6 +2919,7 @@ PostmasterStateMachine(void)
remainMask = btmask_add(remainMask, B_WAL_SENDER);
remainMask = btmask_add(remainMask, B_ARCHIVER);
+ remainMask = btmask_add(remainMask, B_IO_WORKER);
remainMask = btmask_add(remainMask, B_DEAD_END_BACKEND);
remainMask = btmask_add(remainMask, B_LOGGER);
@@ -2963,7 +3001,7 @@ PostmasterStateMachine(void)
pmState = PM_WAIT_DEAD_END;
ConfigurePostmasterWaitSet(false);
- /* Kill the walsenders and archiver too */
+ /* Kill walsenders, archiver and aio workers too */
SignalChildren(SIGQUIT, btmask_all_except(B_LOGGER));
}
}
@@ -2974,11 +3012,23 @@ PostmasterStateMachine(void)
{
/*
* PM_SHUTDOWN_2 state ends when there's no other children than
- * dead-end children left. There shouldn't be any regular backends
- * left by now anyway; what we're really waiting for is walsenders and
- * archiver.
+ * dead-end children and io workers left. There shouldn't be any
+ * regular backends left by now anyway; what we're really waiting for
+ * is walsenders and archiver.
*/
- if (CountChildren(btmask_all_except2(B_LOGGER, B_DEAD_END_BACKEND)) == 0)
+ if (CountChildren(btmask_all_except3(B_LOGGER, B_DEAD_END_BACKEND, B_IO_WORKER)) == 0)
+ {
+ pmState = PM_SHUTDOWN_IO;
+ SignalChildren(SIGUSR2, btmask(B_IO_WORKER));
+ }
+ }
+
+ if (pmState == PM_SHUTDOWN_IO)
+ {
+ /*
+ * PM_SHUTDOWN_IO state ends when there's only dead_end children left.
+ */
+ if (io_worker_count == 0)
{
pmState = PM_WAIT_DEAD_END;
ConfigurePostmasterWaitSet(false);
@@ -3094,10 +3144,14 @@ PostmasterStateMachine(void)
/* re-create shared memory and semaphores */
CreateSharedMemoryAndSemaphores();
+ pmState = PM_STARTUP;
+
+ /* Make sure we can perform I/O while starting up. */
+ maybe_adjust_io_workers();
+
StartupPMChild = StartChildProcess(B_STARTUP);
Assert(StartupPMChild != NULL);
StartupStatus = STARTUP_RUNNING;
- pmState = PM_STARTUP;
/* crash recovery started, reset SIGKILL flag */
AbortStartTime = 0;
@@ -3918,6 +3972,7 @@ bgworker_should_start_now(BgWorkerStartTime start_time)
{
case PM_NO_CHILDREN:
case PM_WAIT_DEAD_END:
+ case PM_SHUTDOWN_IO:
case PM_SHUTDOWN_2:
case PM_SHUTDOWN:
case PM_WAIT_BACKENDS:
@@ -4070,6 +4125,100 @@ maybe_start_bgworkers(void)
}
}
+static bool
+maybe_reap_io_worker(int pid)
+{
+ for (int id = 0; id < MAX_IO_WORKERS; ++id)
+ {
+ if (io_worker_children[id] &&
+ io_worker_children[id]->pid == pid)
+ {
+ ReleasePostmasterChildSlot(io_worker_children[id]);
+
+ --io_worker_count;
+ io_worker_children[id] = NULL;
+ return true;
+ }
+ }
+ return false;
+}
+
+static void
+maybe_adjust_io_workers(void)
+{
+ if (!pgaio_workers_enabled())
+ return;
+
+ /*
+ * If we're in final shutting down state, then we're just waiting for all
+ * processes to exit.
+ */
+ if (pmState >= PM_SHUTDOWN_IO)
+ return;
+
+ /* Don't start new workers during an immediate shutdown either. */
+ if (Shutdown >= ImmediateShutdown)
+ return;
+
+ /*
+ * Don't start new workers if we're in the shutdown phase of a crash
+ * restart. But we *do* need to start if we're already starting up again.
+ */
+ if (FatalError && pmState >= PM_STOP_BACKENDS)
+ return;
+
+ Assert(pmState < PM_SHUTDOWN_IO);
+
+ /* Not enough running? */
+ while (io_worker_count < io_workers)
+ {
+ PMChild *child;
+ int id;
+
+ /* find unused entry in io_worker_children array */
+ for (id = 0; id < MAX_IO_WORKERS; ++id)
+ {
+ if (io_worker_children[id] == NULL)
+ break;
+ }
+ if (id == MAX_IO_WORKERS)
+ elog(ERROR, "could not find a free IO worker ID");
+
+ /* Try to launch one. */
+ child = StartChildProcess(B_IO_WORKER);
+ if (child != NULL)
+ {
+ io_worker_children[id] = child;
+ ++io_worker_count;
+ }
+ else
+ break; /* XXX try again soon? */
+ }
+
+ /* Too many running? */
+ if (io_worker_count > io_workers)
+ {
+ /* ask the IO worker in the highest slot to exit */
+ for (int id = MAX_IO_WORKERS - 1; id >= 0; --id)
+ {
+ if (io_worker_children[id] != NULL)
+ {
+ kill(io_worker_children[id]->pid, SIGUSR2);
+ break;
+ }
+ }
+ }
+}
+
+void
+assign_io_workers(int newval, void *extra)
+{
+ io_workers = newval;
+ if (!IsUnderPostmaster && pmState > PM_INIT)
+ maybe_adjust_io_workers();
+}
+
+
/*
* When a backend asks to be notified about worker state changes, we
* set a flag in its backend entry. The background worker machinery needs
diff --git a/src/backend/storage/aio/Makefile b/src/backend/storage/aio/Makefile
index b253278f3c1..fa2a7e9e5df 100644
--- a/src/backend/storage/aio/Makefile
+++ b/src/backend/storage/aio/Makefile
@@ -14,6 +14,7 @@ OBJS = \
aio_io.o \
aio_subject.o \
method_sync.o \
+ method_worker.o \
read_stream.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/aio/aio_init.c b/src/backend/storage/aio/aio_init.c
index b9bdf51680a..0c2d77ec8ab 100644
--- a/src/backend/storage/aio/aio_init.c
+++ b/src/backend/storage/aio/aio_init.c
@@ -217,3 +217,10 @@ pgaio_init_backend(void)
if (pgaio_impl->init_backend)
pgaio_impl->init_backend();
}
+
+bool
+pgaio_workers_enabled(void)
+{
+ /* placeholder for future commit */
+ return false;
+}
diff --git a/src/backend/storage/aio/meson.build b/src/backend/storage/aio/meson.build
index 8339d473aae..62738ce1d14 100644
--- a/src/backend/storage/aio/meson.build
+++ b/src/backend/storage/aio/meson.build
@@ -6,5 +6,6 @@ backend_sources += files(
'aio_io.c',
'aio_subject.c',
'method_sync.c',
+ 'method_worker.c',
'read_stream.c',
)
diff --git a/src/backend/storage/aio/method_worker.c b/src/backend/storage/aio/method_worker.c
new file mode 100644
index 00000000000..0ea749a8ba8
--- /dev/null
+++ b/src/backend/storage/aio/method_worker.c
@@ -0,0 +1,86 @@
+/*-------------------------------------------------------------------------
+ *
+ * method_worker.c
+ * AIO implementation using workers
+ *
+ * Portions Copyright (c) 1996-2021, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/storage/aio/method_worker.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "libpq/pqsignal.h"
+#include "miscadmin.h"
+#include "postmaster/auxprocess.h"
+#include "postmaster/interrupt.h"
+#include "storage/io_worker.h"
+#include "storage/ipc.h"
+#include "storage/latch.h"
+#include "storage/proc.h"
+#include "tcop/tcopprot.h"
+#include "utils/wait_event.h"
+
+
+int io_workers = 3;
+
+
+void
+IoWorkerMain(char *startup_data, size_t startup_data_len)
+{
+ sigjmp_buf local_sigjmp_buf;
+
+ MyBackendType = B_IO_WORKER;
+ AuxiliaryProcessMainCommon();
+
+ /* TODO review all signals */
+ pqsignal(SIGHUP, SignalHandlerForConfigReload);
+ pqsignal(SIGINT, die); /* to allow manually triggering worker restart */
+
+ /*
+ * Ignore SIGTERM, will get explicit shutdown via SIGUSR2 later in the
+ * shutdown sequence, similar to checkpointer.
+ */
+ pqsignal(SIGTERM, SIG_IGN);
+ /* SIGQUIT handler was already set up by InitPostmasterChild */
+ pqsignal(SIGALRM, SIG_IGN);
+ pqsignal(SIGPIPE, SIG_IGN);
+ pqsignal(SIGUSR1, procsignal_sigusr1_handler);
+ pqsignal(SIGUSR2, SignalHandlerForShutdownRequest);
+ sigprocmask(SIG_SETMASK, &UnBlockSig, NULL);
+
+ /* see PostgresMain() */
+ if (sigsetjmp(local_sigjmp_buf, 1) != 0)
+ {
+ error_context_stack = NULL;
+ HOLD_INTERRUPTS();
+
+ /*
+ * We normally shouldn't get errors here. Need to do just enough error
+ * recovery so that we can mark the IO as failed and then exit.
+ */
+ LWLockReleaseAll();
+
+ /* TODO: recover from IO errors */
+
+ EmitErrorReport();
+ proc_exit(1);
+ }
+
+ /* We can now handle ereport(ERROR) */
+ PG_exception_stack = &local_sigjmp_buf;
+
+ while (!ShutdownRequestPending)
+ {
+ WaitLatch(MyLatch, WL_LATCH_SET | WL_EXIT_ON_PM_DEATH, -1,
+ WAIT_EVENT_IO_WORKER_MAIN);
+ ResetLatch(MyLatch);
+ CHECK_FOR_INTERRUPTS();
+ }
+
+ proc_exit(0);
+}
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 85902788181..fcd3e1eb482 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -3313,6 +3313,8 @@ ProcessInterrupts(void)
(errcode(ERRCODE_ADMIN_SHUTDOWN),
errmsg("terminating background worker \"%s\" due to administrator command",
MyBgworkerEntry->bgw_type)));
+ else if (AmIoWorkerProcess())
+ proc_exit(0);
else
ereport(FATAL,
(errcode(ERRCODE_ADMIN_SHUTDOWN),
diff --git a/src/backend/utils/activity/pgstat_backend.c b/src/backend/utils/activity/pgstat_backend.c
index 6b2c9baa8c0..c48befef6a7 100644
--- a/src/backend/utils/activity/pgstat_backend.c
+++ b/src/backend/utils/activity/pgstat_backend.c
@@ -166,6 +166,7 @@ pgstat_tracks_backend_bktype(BackendType bktype)
case B_WAL_SUMMARIZER:
case B_BG_WRITER:
case B_CHECKPOINTER:
+ case B_IO_WORKER:
case B_STARTUP:
return false;
diff --git a/src/backend/utils/activity/pgstat_io.c b/src/backend/utils/activity/pgstat_io.c
index 011a3326dad..7869197dd1f 100644
--- a/src/backend/utils/activity/pgstat_io.c
+++ b/src/backend/utils/activity/pgstat_io.c
@@ -365,6 +365,7 @@ pgstat_tracks_io_bktype(BackendType bktype)
case B_INVALID:
case B_DEAD_END_BACKEND:
case B_ARCHIVER:
+ case B_IO_WORKER:
case B_LOGGER:
case B_WAL_RECEIVER:
case B_WAL_WRITER:
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 7a2e2b4432e..330a32a90ce 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -56,6 +56,7 @@ AUTOVACUUM_MAIN "Waiting in main loop of autovacuum launcher process."
BGWRITER_HIBERNATE "Waiting in background writer process, hibernating."
BGWRITER_MAIN "Waiting in main loop of background writer process."
CHECKPOINTER_MAIN "Waiting in main loop of checkpointer process."
+IO_WORKER_MAIN "Waiting in main loop of IO Worker process."
LOGICAL_APPLY_MAIN "Waiting in main loop of logical replication apply process."
LOGICAL_LAUNCHER_MAIN "Waiting in main loop of logical replication launcher process."
LOGICAL_PARALLEL_APPLY_MAIN "Waiting in main loop of logical replication parallel apply process."
diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index 6349abb8fb6..56133cfdd08 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -293,6 +293,9 @@ GetBackendTypeDesc(BackendType backendType)
case B_CHECKPOINTER:
backendDesc = gettext_noop("checkpointer");
break;
+ case B_IO_WORKER:
+ backendDesc = "io worker";
+ break;
case B_LOGGER:
backendDesc = gettext_noop("logger");
break;
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 6d4056c68b9..b2999b86c24 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -74,6 +74,7 @@
#include "storage/aio.h"
#include "storage/bufmgr.h"
#include "storage/bufpage.h"
+#include "storage/io_worker.h"
#include "storage/large_object.h"
#include "storage/pg_shmem.h"
#include "storage/predicate.h"
@@ -3232,6 +3233,18 @@ struct config_int ConfigureNamesInt[] =
NULL, NULL, NULL
},
+ {
+ {"io_workers",
+ PGC_SIGHUP,
+ RESOURCES_ASYNCHRONOUS,
+ gettext_noop("Number of IO worker processes, for io_method=worker."),
+ NULL,
+ },
+ &io_workers,
+ 3, 1, MAX_IO_WORKERS,
+ NULL, assign_io_workers, NULL
+ },
+
{
{"backend_flush_after", PGC_USERSET, RESOURCES_ASYNCHRONOUS,
gettext_noop("Number of pages after which previously performed writes are flushed to disk."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index c4c60da9845..0f80a0680ec 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -843,6 +843,7 @@
#------------------------------------------------------------------------------
#io_method = sync # (change requires restart)
+#io_workers = 3 # 1-32;
#io_max_concurrency = 32 # Max number of IOs that may be in
# flight at the same time in one backend
--
2.45.2.746.g06e570c0df.dirty
v2-0006-aio-Add-worker-method.patchtext/x-diff; charset=us-asciiDownload
From 9c9bbb42fb561fb2cf7d6d5183db5359d37e004e Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Fri, 8 Nov 2024 12:38:41 -0500
Subject: [PATCH v2 06/20] aio: Add worker method
---
src/include/storage/aio.h | 5 +-
src/include/storage/aio_internal.h | 1 +
src/include/storage/lwlocklist.h | 1 +
src/backend/storage/aio/aio.c | 2 +
src/backend/storage/aio/aio_init.c | 12 +-
src/backend/storage/aio/method_worker.c | 406 +++++++++++++++++-
.../utils/activity/wait_event_names.txt | 1 +
src/backend/utils/misc/postgresql.conf.sample | 2 +-
src/tools/pgindent/typedefs.list | 3 +
9 files changed, 423 insertions(+), 10 deletions(-)
diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h
index b386dabc921..2e84abfea2d 100644
--- a/src/include/storage/aio.h
+++ b/src/include/storage/aio.h
@@ -322,11 +322,12 @@ extern void assign_io_method(int newval, void *extra);
typedef enum IoMethod
{
IOMETHOD_SYNC = 0,
+ IOMETHOD_WORKER,
} IoMethod;
-/* We'll default to synchronous execution. */
-#define DEFAULT_IO_METHOD IOMETHOD_SYNC
+/* We'll default to bgworker. */
+#define DEFAULT_IO_METHOD IOMETHOD_WORKER
/* GUCs */
diff --git a/src/include/storage/aio_internal.h b/src/include/storage/aio_internal.h
index d600d45b4fd..f974c4accf5 100644
--- a/src/include/storage/aio_internal.h
+++ b/src/include/storage/aio_internal.h
@@ -234,6 +234,7 @@ extern const char *pgaio_io_get_state_name(PgAioHandle *ioh);
/* Declarations for the tables of function pointers exposed by each IO method. */
extern const IoMethodOps pgaio_sync_ops;
+extern const IoMethodOps pgaio_worker_ops;
extern const IoMethodOps *pgaio_impl;
extern PgAioCtl *aio_ctl;
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index 6a2f64c54fb..8d00d62e208 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -83,3 +83,4 @@ PG_LWLOCK(49, WALSummarizer)
PG_LWLOCK(50, DSMRegistry)
PG_LWLOCK(51, InjectionPoint)
PG_LWLOCK(52, SerialControl)
+PG_LWLOCK(53, AioWorkerSubmissionQueue)
diff --git a/src/backend/storage/aio/aio.c b/src/backend/storage/aio/aio.c
index 3e2ff9718ca..e4c9d439ddd 100644
--- a/src/backend/storage/aio/aio.c
+++ b/src/backend/storage/aio/aio.c
@@ -57,6 +57,7 @@ static PgAioHandle *pgaio_io_from_ref(PgAioHandleRef *ior, uint64 *ref_generatio
/* Options for io_method. */
const struct config_enum_entry io_method_options[] = {
{"sync", IOMETHOD_SYNC, false},
+ {"worker", IOMETHOD_WORKER, false},
{NULL, 0, false}
};
@@ -73,6 +74,7 @@ PgAioPerBackend *my_aio;
static const IoMethodOps *pgaio_ops_table[] = {
[IOMETHOD_SYNC] = &pgaio_sync_ops,
+ [IOMETHOD_WORKER] = &pgaio_worker_ops,
};
diff --git a/src/backend/storage/aio/aio_init.c b/src/backend/storage/aio/aio_init.c
index 0c2d77ec8ab..23adc5308e5 100644
--- a/src/backend/storage/aio/aio_init.c
+++ b/src/backend/storage/aio/aio_init.c
@@ -19,6 +19,7 @@
#include "storage/aio_init.h"
#include "storage/aio_internal.h"
#include "storage/bufmgr.h"
+#include "storage/io_worker.h"
#include "storage/proc.h"
#include "storage/shmem.h"
@@ -37,6 +38,11 @@ AioCtlShmemSize(void)
static uint32
AioProcs(void)
{
+ /*
+ * While AIO workers don't need their own AIO context, we can't currently
+ * guarantee nothing gets assigned to the a ProcNumber for an IO worker if
+ * we just subtracted MAX_IO_WORKERS.
+ */
return MaxBackends + NUM_AUXILIARY_PROCS;
}
@@ -209,6 +215,9 @@ pgaio_init_backend(void)
/* shouldn't be initialized twice */
Assert(!my_aio);
+ if (MyBackendType == B_IO_WORKER)
+ return;
+
if (MyProc == NULL || MyProcNumber >= AioProcs())
elog(ERROR, "aio requires a normal PGPROC");
@@ -221,6 +230,5 @@ pgaio_init_backend(void)
bool
pgaio_workers_enabled(void)
{
- /* placeholder for future commit */
- return false;
+ return io_method == IOMETHOD_WORKER;
}
diff --git a/src/backend/storage/aio/method_worker.c b/src/backend/storage/aio/method_worker.c
index 0ea749a8ba8..a508f53ebd4 100644
--- a/src/backend/storage/aio/method_worker.c
+++ b/src/backend/storage/aio/method_worker.c
@@ -1,7 +1,22 @@
/*-------------------------------------------------------------------------
*
* method_worker.c
- * AIO implementation using workers
+ * AIO - perform AIO using worker processes
+ *
+ * Worker processes consume IOs from a shared memory submission queue, run
+ * traditional synchronous system calls, and perform the shared completion
+ * handling immediately. Client code submits most requests by pushing IOs
+ * into the submission queue, and waits (if necessary) using condition
+ * variables. Some IOs cannot be performed in another process due to lack of
+ * infrastructure for reopening the file, and must processed synchronously by
+ * the client code when submitted.
+ *
+ * So that the submitter can make just one system call when submitting a batch
+ * of IOs, wakeups "fan out"; each woken backend can wake two more. XXX This
+ * could be improved by using futexes instead of latches to wake N waiters.
+ *
+ * This method of AIO is available in all builds on all operating systems, and
+ * is the default.
*
* Portions Copyright (c) 1996-2021, PostgreSQL Global Development Group
* Portions Copyright (c) 1994, Regents of the University of California
@@ -16,23 +31,323 @@
#include "libpq/pqsignal.h"
#include "miscadmin.h"
+#include "port/pg_bitutils.h"
#include "postmaster/auxprocess.h"
#include "postmaster/interrupt.h"
+#include "storage/aio.h"
+#include "storage/aio_internal.h"
#include "storage/io_worker.h"
#include "storage/ipc.h"
#include "storage/latch.h"
#include "storage/proc.h"
#include "tcop/tcopprot.h"
+#include "utils/ps_status.h"
#include "utils/wait_event.h"
+/* How many workers should each worker wake up if needed? */
+#define IO_WORKER_WAKEUP_FANOUT 2
+
+
+typedef struct AioWorkerSubmissionQueue
+{
+ uint32 size;
+ uint32 mask;
+ uint32 head;
+ uint32 tail;
+ uint32 ios[FLEXIBLE_ARRAY_MEMBER];
+} AioWorkerSubmissionQueue;
+
+typedef struct AioWorkerSlot
+{
+ Latch *latch;
+ bool in_use;
+} AioWorkerSlot;
+
+typedef struct AioWorkerControl
+{
+ uint64 idle_worker_mask;
+ AioWorkerSlot workers[FLEXIBLE_ARRAY_MEMBER];
+} AioWorkerControl;
+
+
+static size_t pgaio_worker_shmem_size(void);
+static void pgaio_worker_shmem_init(bool first_time);
+
+static bool pgaio_worker_needs_synchronous_execution(PgAioHandle *ioh);
+static int pgaio_worker_submit(uint16 num_staged_ios, PgAioHandle **staged_ios);
+
+
+const IoMethodOps pgaio_worker_ops = {
+ .shmem_size = pgaio_worker_shmem_size,
+ .shmem_init = pgaio_worker_shmem_init,
+
+ .needs_synchronous_execution = pgaio_worker_needs_synchronous_execution,
+ .submit = pgaio_worker_submit,
+};
+
+
int io_workers = 3;
+static int io_worker_queue_size = 64;
+static int MyIoWorkerId;
+
+
+static AioWorkerSubmissionQueue *io_worker_submission_queue;
+static AioWorkerControl *io_worker_control;
+
+
+static size_t
+pgaio_worker_shmem_size(void)
+{
+ return
+ offsetof(AioWorkerSubmissionQueue, ios) +
+ sizeof(uint32) * io_worker_queue_size +
+ offsetof(AioWorkerControl, workers) +
+ sizeof(AioWorkerSlot) * io_workers;
+}
+
+static void
+pgaio_worker_shmem_init(bool first_time)
+{
+ bool found;
+ int size;
+
+ /* Round size up to next power of two so we can make a mask. */
+ size = pg_nextpower2_32(io_worker_queue_size);
+
+ io_worker_submission_queue =
+ ShmemInitStruct("AioWorkerSubmissionQueue",
+ offsetof(AioWorkerSubmissionQueue, ios) +
+ sizeof(uint32) * size,
+ &found);
+ if (!found)
+ {
+ io_worker_submission_queue->size = size;
+ io_worker_submission_queue->head = 0;
+ io_worker_submission_queue->tail = 0;
+ }
+
+ io_worker_control =
+ ShmemInitStruct("AioWorkerControl",
+ offsetof(AioWorkerControl, workers) +
+ sizeof(AioWorkerSlot) * io_workers,
+ &found);
+ if (!found)
+ {
+ io_worker_control->idle_worker_mask = 0;
+ for (int i = 0; i < io_workers; ++i)
+ {
+ io_worker_control->workers[i].latch = NULL;
+ io_worker_control->workers[i].in_use = false;
+ }
+ }
+}
+
+
+static int
+pgaio_choose_idle_worker(void)
+{
+ int worker;
+
+ if (io_worker_control->idle_worker_mask == 0)
+ return -1;
+
+ /* Find the lowest bit position, and clear it. */
+ worker = pg_rightmost_one_pos64(io_worker_control->idle_worker_mask);
+ io_worker_control->idle_worker_mask &= ~(UINT64_C(1) << worker);
+
+ return worker;
+}
+
+static bool
+pgaio_worker_submission_queue_insert(PgAioHandle *ioh)
+{
+ AioWorkerSubmissionQueue *queue;
+ uint32 new_head;
+
+ queue = io_worker_submission_queue;
+ new_head = (queue->head + 1) & (queue->size - 1);
+ if (new_head == queue->tail)
+ {
+ elog(DEBUG1, "full");
+ return false; /* full */
+ }
+
+ queue->ios[queue->head] = pgaio_io_get_id(ioh);
+ queue->head = new_head;
+
+ return true;
+}
+
+static uint32
+pgaio_worker_submission_queue_consume(void)
+{
+ AioWorkerSubmissionQueue *queue;
+ uint32 result;
+
+ queue = io_worker_submission_queue;
+ if (queue->tail == queue->head)
+ return UINT32_MAX; /* empty */
+
+ result = queue->ios[queue->tail];
+ queue->tail = (queue->tail + 1) & (queue->size - 1);
+
+ return result;
+}
+
+static uint32
+pgaio_worker_submission_queue_depth(void)
+{
+ uint32 head;
+ uint32 tail;
+
+ head = io_worker_submission_queue->head;
+ tail = io_worker_submission_queue->tail;
+
+ if (tail > head)
+ head += io_worker_submission_queue->size;
+
+ Assert(head >= tail);
+
+ return head - tail;
+}
+
+static void
+pgaio_worker_submit_internal(int nios, PgAioHandle *ios[])
+{
+ PgAioHandle *synchronous_ios[PGAIO_SUBMIT_BATCH_SIZE];
+ int nsync = 0;
+ Latch *wakeup = NULL;
+ int worker;
+
+ Assert(nios <= PGAIO_SUBMIT_BATCH_SIZE);
+
+ LWLockAcquire(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE);
+ for (int i = 0; i < nios; ++i)
+ {
+ Assert(!pgaio_worker_needs_synchronous_execution(ios[i]));
+ if (!pgaio_worker_submission_queue_insert(ios[i]))
+ {
+ /*
+ * We'll do it synchronously, but only after we've sent as many as
+ * we can to workers, to maximize concurrency.
+ */
+ synchronous_ios[nsync++] = ios[i];
+ continue;
+ }
+
+ if (wakeup == NULL)
+ {
+ /* Choose an idle worker to wake up if we haven't already. */
+ worker = pgaio_choose_idle_worker();
+ if (worker >= 0)
+ wakeup = io_worker_control->workers[worker].latch;
+
+ ereport(DEBUG3,
+ errmsg("submission for io:%d choosing worker %d, latch %p",
+ pgaio_io_get_id(ios[i]), worker, wakeup),
+ errhidestmt(true), errhidecontext(true));
+ }
+ }
+ LWLockRelease(AioWorkerSubmissionQueueLock);
+
+ if (wakeup)
+ SetLatch(wakeup);
+
+ /* Run whatever is left synchronously. */
+ if (nsync > 0)
+ {
+ for (int i = 0; i < nsync; ++i)
+ {
+ pgaio_io_perform_synchronously(synchronous_ios[i]);
+ }
+ }
+}
+
+static bool
+pgaio_worker_needs_synchronous_execution(PgAioHandle *ioh)
+{
+ return
+ !IsUnderPostmaster
+ || ioh->flags & AHF_REFERENCES_LOCAL
+ || !pgaio_io_can_reopen(ioh);
+}
+
+static int
+pgaio_worker_submit(uint16 num_staged_ios, PgAioHandle **staged_ios)
+{
+ for (int i = 0; i < num_staged_ios; i++)
+ {
+ PgAioHandle *ioh = staged_ios[i];
+
+ pgaio_io_prepare_submit(ioh);
+ }
+
+ pgaio_worker_submit_internal(num_staged_ios, staged_ios);
+
+ return num_staged_ios;
+}
+
+/*
+ * shmem_exit() callback that releases the worker's slot in io_worker_control.
+ */
+static void
+pgaio_worker_die(int code, Datum arg)
+{
+ LWLockAcquire(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE);
+ Assert(io_worker_control->workers[MyIoWorkerId].in_use);
+ Assert(io_worker_control->workers[MyIoWorkerId].latch == MyLatch);
+
+ io_worker_control->workers[MyIoWorkerId].in_use = false;
+ io_worker_control->workers[MyIoWorkerId].latch = NULL;
+ LWLockRelease(AioWorkerSubmissionQueueLock);
+}
+
+/*
+ * Register the worker in shared memory, assign MyWorkerId and register a
+ * shutdown callback to release registration.
+ */
+static void
+pgaio_worker_register(void)
+{
+ MyIoWorkerId = -1;
+
+ /*
+ * XXX: This could do with more fine-grained locking. But it's also not
+ * very common for the number of workers to change at the moment...
+ */
+ LWLockAcquire(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE);
+
+ for (int i = 0; i < io_workers; ++i)
+ {
+ if (!io_worker_control->workers[i].in_use)
+ {
+ Assert(io_worker_control->workers[i].latch == NULL);
+ io_worker_control->workers[i].in_use = true;
+ MyIoWorkerId = i;
+ break;
+ }
+ else
+ Assert(io_worker_control->workers[i].latch != NULL);
+ }
+
+ if (MyIoWorkerId == -1)
+ elog(ERROR, "couldn't find a free worker slot");
+
+ io_worker_control->idle_worker_mask |= (UINT64_C(1) << MyIoWorkerId);
+ io_worker_control->workers[MyIoWorkerId].latch = MyLatch;
+ LWLockRelease(AioWorkerSubmissionQueueLock);
+
+ on_shmem_exit(pgaio_worker_die, 0);
+}
void
IoWorkerMain(char *startup_data, size_t startup_data_len)
{
sigjmp_buf local_sigjmp_buf;
+ volatile PgAioHandle *ioh = NULL;
+ char cmd[128];
MyBackendType = B_IO_WORKER;
AuxiliaryProcessMainCommon();
@@ -53,6 +368,11 @@ IoWorkerMain(char *startup_data, size_t startup_data_len)
pqsignal(SIGUSR2, SignalHandlerForShutdownRequest);
sigprocmask(SIG_SETMASK, &UnBlockSig, NULL);
+ pgaio_worker_register();
+
+ sprintf(cmd, "io worker: %d", MyIoWorkerId);
+ set_ps_display(cmd);
+
/* see PostgresMain() */
if (sigsetjmp(local_sigjmp_buf, 1) != 0)
{
@@ -66,8 +386,26 @@ IoWorkerMain(char *startup_data, size_t startup_data_len)
LWLockReleaseAll();
/* TODO: recover from IO errors */
+ if (ioh != NULL)
+ {
+#if 0
+ /* EINTR is treated as a retryable error */
+ pgaio_process_io_completion(unvolatize(PgAioInProgress *, io),
+ EINTR);
+#endif
+ }
EmitErrorReport();
+
+ /* FIXME: should probably be a before-shmem-exit instead */
+ LWLockAcquire(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE);
+ Assert(io_worker_control->workers[MyIoWorkerId].in_use);
+ Assert(io_worker_control->workers[MyIoWorkerId].latch == MyLatch);
+
+ io_worker_control->workers[MyIoWorkerId].in_use = false;
+ io_worker_control->workers[MyIoWorkerId].latch = NULL;
+ LWLockRelease(AioWorkerSubmissionQueueLock);
+
proc_exit(1);
}
@@ -76,10 +414,68 @@ IoWorkerMain(char *startup_data, size_t startup_data_len)
while (!ShutdownRequestPending)
{
- WaitLatch(MyLatch, WL_LATCH_SET | WL_EXIT_ON_PM_DEATH, -1,
- WAIT_EVENT_IO_WORKER_MAIN);
- ResetLatch(MyLatch);
- CHECK_FOR_INTERRUPTS();
+ uint32 io_index;
+ Latch *latches[IO_WORKER_WAKEUP_FANOUT];
+ int nlatches = 0;
+ int nwakeups = 0;
+ int worker;
+
+ /* Try to get a job to do. */
+ LWLockAcquire(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE);
+ if ((io_index = pgaio_worker_submission_queue_consume()) == UINT32_MAX)
+ {
+ /* Nothing to do. Mark self idle. */
+ /*
+ * XXX: Invent some kind of back pressure to reduce useless
+ * wakeups?
+ */
+ io_worker_control->idle_worker_mask |= (UINT64_C(1) << MyIoWorkerId);
+ }
+ else
+ {
+ /* Got one. Clear idle flag. */
+ io_worker_control->idle_worker_mask &= ~(UINT64_C(1) << MyIoWorkerId);
+
+ /* See if we can wake up some peers. */
+ nwakeups = Min(pgaio_worker_submission_queue_depth(),
+ IO_WORKER_WAKEUP_FANOUT);
+ for (int i = 0; i < nwakeups; ++i)
+ {
+ if ((worker = pgaio_choose_idle_worker()) < 0)
+ break;
+ latches[nlatches++] = io_worker_control->workers[worker].latch;
+ }
+#if 0
+ if (nwakeups > 0)
+ elog(LOG, "wake %d", nwakeups);
+#endif
+ }
+ LWLockRelease(AioWorkerSubmissionQueueLock);
+
+ for (int i = 0; i < nlatches; ++i)
+ SetLatch(latches[i]);
+
+ if (io_index != UINT32_MAX)
+ {
+ ioh = &aio_ctl->io_handles[io_index];
+
+ ereport(DEBUG3,
+ errmsg("worker processing io:%d",
+ pgaio_io_get_id(unvolatize(PgAioHandle *, ioh))),
+ errhidestmt(true), errhidecontext(true));
+
+ pgaio_io_reopen(unvolatize(PgAioHandle *, ioh));
+ pgaio_io_perform_synchronously(unvolatize(PgAioHandle *, ioh));
+
+ ioh = NULL;
+ }
+ else
+ {
+ WaitLatch(MyLatch, WL_LATCH_SET | WL_EXIT_ON_PM_DEATH, -1,
+ WAIT_EVENT_IO_WORKER_MAIN);
+ ResetLatch(MyLatch);
+ CHECK_FOR_INTERRUPTS();
+ }
}
proc_exit(0);
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 330a32a90ce..8c3aafd8a18 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -349,6 +349,7 @@ WALSummarizer "Waiting to read or update WAL summarization state."
DSMRegistry "Waiting to read or update the dynamic shared memory registry."
InjectionPoint "Waiting to read or update information related to injection points."
SerialControl "Waiting to read or update shared <filename>pg_serial</filename> state."
+AioWorkerSubmissionQueue "Waiting to access AIO worker submission queue."
#
# END OF PREDEFINED LWLOCKS (DO NOT CHANGE THIS LINE)
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 0f80a0680ec..5893eb29228 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -842,7 +842,7 @@
# WIP AIO GUC docs
#------------------------------------------------------------------------------
-#io_method = sync # (change requires restart)
+#io_method = worker # (change requires restart)
#io_workers = 3 # 1-32;
#io_max_concurrency = 32 # Max number of IOs that may be in
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index bc1acbb98ee..9b9c8f0d1fc 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -54,6 +54,9 @@ AggStrategy
AggTransInfo
Aggref
AggregateInstrumentation
+AioWorkerControl
+AioWorkerSlot
+AioWorkerSubmissionQueue
AlenState
Alias
AllocBlock
--
2.45.2.746.g06e570c0df.dirty
v2-0007-aio-Add-liburing-dependency.patchtext/x-diff; charset=us-asciiDownload
From 309863778a6051b0e18d949551961608dbf9d399 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 5 Jun 2024 19:37:25 -0700
Subject: [PATCH v2 07/20] aio: Add liburing dependency
Not yet used.
Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
meson.build | 14 ++++
meson_options.txt | 3 +
configure.ac | 11 +++
src/makefiles/meson.build | 3 +
src/include/pg_config.h.in | 3 +
configure | 138 +++++++++++++++++++++++++++++++++++++
src/Makefile.global.in | 4 ++
7 files changed, 176 insertions(+)
diff --git a/meson.build b/meson.build
index e5ce437a5c7..76c276437d7 100644
--- a/meson.build
+++ b/meson.build
@@ -854,6 +854,18 @@ endif
+###############################################################
+# Library: liburing
+###############################################################
+
+liburingopt = get_option('liburing')
+liburing = dependency('liburing', required: liburingopt)
+if liburing.found()
+ cdata.set('USE_LIBURING', 1)
+endif
+
+
+
###############################################################
# Library: libxml
###############################################################
@@ -3054,6 +3066,7 @@ backend_both_deps += [
icu_i18n,
ldap,
libintl,
+ liburing,
libxml,
lz4,
pam,
@@ -3698,6 +3711,7 @@ if meson.version().version_compare('>=0.57')
'gss': gssapi,
'icu': icu,
'ldap': ldap,
+ 'liburing': liburing,
'libxml': libxml,
'libxslt': libxslt,
'llvm': llvm,
diff --git a/meson_options.txt b/meson_options.txt
index 38935196394..6e8d376b3b2 100644
--- a/meson_options.txt
+++ b/meson_options.txt
@@ -103,6 +103,9 @@ option('ldap', type: 'feature', value: 'auto',
option('libedit_preferred', type: 'boolean', value: false,
description: 'Prefer BSD Libedit over GNU Readline')
+option('liburing', type : 'feature', value: 'auto',
+ description: 'Use liburing for async io')
+
option('libxml', type: 'feature', value: 'auto',
description: 'XML support')
diff --git a/configure.ac b/configure.ac
index 247ae97fa4c..dda296ee029 100644
--- a/configure.ac
+++ b/configure.ac
@@ -975,6 +975,14 @@ AC_SUBST(with_readline)
PGAC_ARG_BOOL(with, libedit-preferred, no,
[prefer BSD Libedit over GNU Readline])
+#
+# liburing
+#
+AC_MSG_CHECKING([whether to build with liburing support])
+PGAC_ARG_BOOL(with, liburing, no, [use liburing for async io],
+ [AC_DEFINE([USE_LIBURING], 1, [Define to build with io-uring support. (--with-liburing)])])
+AC_MSG_RESULT([$with_liburing])
+AC_SUBST(with_liburing)
#
# UUID library
@@ -1427,6 +1435,9 @@ elif test "$with_uuid" = ossp ; then
fi
AC_SUBST(UUID_LIBS)
+if test "$with_liburing" = yes; then
+ PKG_CHECK_MODULES(LIBURING, liburing)
+fi
##
## Header files
diff --git a/src/makefiles/meson.build b/src/makefiles/meson.build
index aba7411a1be..00613aebc79 100644
--- a/src/makefiles/meson.build
+++ b/src/makefiles/meson.build
@@ -199,6 +199,8 @@ pgxs_empty = [
'PTHREAD_CFLAGS', 'PTHREAD_LIBS',
'ICU_LIBS',
+
+ 'LIBURING_CFLAGS', 'LIBURING_LIBS',
]
if host_system == 'windows' and cc.get_argument_syntax() != 'msvc'
@@ -229,6 +231,7 @@ pgxs_deps = {
'gssapi': gssapi,
'icu': icu,
'ldap': ldap,
+ 'liburing': liburing,
'libxml': libxml,
'libxslt': libxslt,
'llvm': llvm,
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 07b2f798abd..6ab71a3dffe 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -663,6 +663,9 @@
/* Define to 1 to build with LDAP support. (--with-ldap) */
#undef USE_LDAP
+/* Define to build with io-uring support. (--with-liburing) */
+#undef USE_LIBURING
+
/* Define to 1 to build with XML support. (--with-libxml) */
#undef USE_LIBXML
diff --git a/configure b/configure
index 518c33b73a9..1c3fada9fe0 100755
--- a/configure
+++ b/configure
@@ -651,6 +651,8 @@ LIBOBJS
OPENSSL
ZSTD
LZ4
+LIBURING_LIBS
+LIBURING_CFLAGS
UUID_LIBS
LDAP_LIBS_BE
LDAP_LIBS_FE
@@ -709,6 +711,7 @@ XML2_CFLAGS
XML2_CONFIG
with_libxml
with_uuid
+with_liburing
with_readline
with_systemd
with_selinux
@@ -862,6 +865,7 @@ with_selinux
with_systemd
with_readline
with_libedit_preferred
+with_liburing
with_uuid
with_ossp_uuid
with_libxml
@@ -905,6 +909,8 @@ LDFLAGS_EX
LDFLAGS_SL
PERL
PYTHON
+LIBURING_CFLAGS
+LIBURING_LIBS
MSGFMT
TCLSH'
@@ -1572,6 +1578,7 @@ Optional Packages:
--without-readline do not use GNU Readline nor BSD Libedit for editing
--with-libedit-preferred
prefer BSD Libedit over GNU Readline
+ --with-liburing use liburing for async io
--with-uuid=LIB build contrib/uuid-ossp using LIB (bsd,e2fs,ossp)
--with-ossp-uuid obsolete spelling of --with-uuid=ossp
--with-libxml build with XML support
@@ -1618,6 +1625,10 @@ Some influential environment variables:
LDFLAGS_SL extra linker flags for linking shared libraries only
PERL Perl program
PYTHON Python program
+ LIBURING_CFLAGS
+ C compiler flags for LIBURING, overriding pkg-config
+ LIBURING_LIBS
+ linker flags for LIBURING, overriding pkg-config
MSGFMT msgfmt program for NLS
TCLSH Tcl interpreter program (tclsh)
@@ -8681,6 +8692,40 @@ fi
+#
+# liburing
+#
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking whether to build with liburing support" >&5
+$as_echo_n "checking whether to build with liburing support... " >&6; }
+
+
+
+# Check whether --with-liburing was given.
+if test "${with_liburing+set}" = set; then :
+ withval=$with_liburing;
+ case $withval in
+ yes)
+
+$as_echo "#define USE_LIBURING 1" >>confdefs.h
+
+ ;;
+ no)
+ :
+ ;;
+ *)
+ as_fn_error $? "no argument expected for --with-liburing option" "$LINENO" 5
+ ;;
+ esac
+
+else
+ with_liburing=no
+
+fi
+
+
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $with_liburing" >&5
+$as_echo "$with_liburing" >&6; }
+
#
# UUID library
@@ -13222,6 +13267,99 @@ fi
fi
+if test "$with_liburing" = yes; then
+
+pkg_failed=no
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for liburing" >&5
+$as_echo_n "checking for liburing... " >&6; }
+
+if test -n "$LIBURING_CFLAGS"; then
+ pkg_cv_LIBURING_CFLAGS="$LIBURING_CFLAGS"
+ elif test -n "$PKG_CONFIG"; then
+ if test -n "$PKG_CONFIG" && \
+ { { $as_echo "$as_me:${as_lineno-$LINENO}: \$PKG_CONFIG --exists --print-errors \"liburing\""; } >&5
+ ($PKG_CONFIG --exists --print-errors "liburing") 2>&5
+ ac_status=$?
+ $as_echo "$as_me:${as_lineno-$LINENO}: \$? = $ac_status" >&5
+ test $ac_status = 0; }; then
+ pkg_cv_LIBURING_CFLAGS=`$PKG_CONFIG --cflags "liburing" 2>/dev/null`
+ test "x$?" != "x0" && pkg_failed=yes
+else
+ pkg_failed=yes
+fi
+ else
+ pkg_failed=untried
+fi
+if test -n "$LIBURING_LIBS"; then
+ pkg_cv_LIBURING_LIBS="$LIBURING_LIBS"
+ elif test -n "$PKG_CONFIG"; then
+ if test -n "$PKG_CONFIG" && \
+ { { $as_echo "$as_me:${as_lineno-$LINENO}: \$PKG_CONFIG --exists --print-errors \"liburing\""; } >&5
+ ($PKG_CONFIG --exists --print-errors "liburing") 2>&5
+ ac_status=$?
+ $as_echo "$as_me:${as_lineno-$LINENO}: \$? = $ac_status" >&5
+ test $ac_status = 0; }; then
+ pkg_cv_LIBURING_LIBS=`$PKG_CONFIG --libs "liburing" 2>/dev/null`
+ test "x$?" != "x0" && pkg_failed=yes
+else
+ pkg_failed=yes
+fi
+ else
+ pkg_failed=untried
+fi
+
+
+
+if test $pkg_failed = yes; then
+ { $as_echo "$as_me:${as_lineno-$LINENO}: result: no" >&5
+$as_echo "no" >&6; }
+
+if $PKG_CONFIG --atleast-pkgconfig-version 0.20; then
+ _pkg_short_errors_supported=yes
+else
+ _pkg_short_errors_supported=no
+fi
+ if test $_pkg_short_errors_supported = yes; then
+ LIBURING_PKG_ERRORS=`$PKG_CONFIG --short-errors --print-errors --cflags --libs "liburing" 2>&1`
+ else
+ LIBURING_PKG_ERRORS=`$PKG_CONFIG --print-errors --cflags --libs "liburing" 2>&1`
+ fi
+ # Put the nasty error message in config.log where it belongs
+ echo "$LIBURING_PKG_ERRORS" >&5
+
+ as_fn_error $? "Package requirements (liburing) were not met:
+
+$LIBURING_PKG_ERRORS
+
+Consider adjusting the PKG_CONFIG_PATH environment variable if you
+installed software in a non-standard prefix.
+
+Alternatively, you may set the environment variables LIBURING_CFLAGS
+and LIBURING_LIBS to avoid the need to call pkg-config.
+See the pkg-config man page for more details." "$LINENO" 5
+elif test $pkg_failed = untried; then
+ { $as_echo "$as_me:${as_lineno-$LINENO}: result: no" >&5
+$as_echo "no" >&6; }
+ { { $as_echo "$as_me:${as_lineno-$LINENO}: error: in \`$ac_pwd':" >&5
+$as_echo "$as_me: error: in \`$ac_pwd':" >&2;}
+as_fn_error $? "The pkg-config script could not be found or is too old. Make sure it
+is in your PATH or set the PKG_CONFIG environment variable to the full
+path to pkg-config.
+
+Alternatively, you may set the environment variables LIBURING_CFLAGS
+and LIBURING_LIBS to avoid the need to call pkg-config.
+See the pkg-config man page for more details.
+
+To get pkg-config, see <http://pkg-config.freedesktop.org/>.
+See \`config.log' for more details" "$LINENO" 5; }
+else
+ LIBURING_CFLAGS=$pkg_cv_LIBURING_CFLAGS
+ LIBURING_LIBS=$pkg_cv_LIBURING_LIBS
+ { $as_echo "$as_me:${as_lineno-$LINENO}: result: yes" >&5
+$as_echo "yes" >&6; }
+
+fi
+fi
##
## Header files
diff --git a/src/Makefile.global.in b/src/Makefile.global.in
index eac3d001211..60393ed8fa4 100644
--- a/src/Makefile.global.in
+++ b/src/Makefile.global.in
@@ -190,6 +190,7 @@ with_systemd = @with_systemd@
with_gssapi = @with_gssapi@
with_krb_srvnam = @with_krb_srvnam@
with_ldap = @with_ldap@
+with_liburing = @with_liburing@
with_libxml = @with_libxml@
with_libxslt = @with_libxslt@
with_llvm = @with_llvm@
@@ -216,6 +217,9 @@ krb_srvtab = @krb_srvtab@
ICU_CFLAGS = @ICU_CFLAGS@
ICU_LIBS = @ICU_LIBS@
+LIBURING_CFLAGS = @LIBURING_CFLAGS@
+LIBURING_LIBS = @LIBURING_LIBS@
+
TCLSH = @TCLSH@
TCL_LIBS = @TCL_LIBS@
TCL_LIB_SPEC = @TCL_LIB_SPEC@
--
2.45.2.746.g06e570c0df.dirty
v2-0008-aio-Add-io_uring-method.patchtext/x-diff; charset=us-asciiDownload
From de57cec96e81a1867a9f1db4c44243cdc0072b20 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 4 Sep 2024 16:15:17 -0400
Subject: [PATCH v2 08/20] aio: Add io_uring method
---
src/include/storage/aio.h | 1 +
src/include/storage/aio_internal.h | 3 +
src/include/storage/lwlock.h | 1 +
src/backend/storage/aio/Makefile | 1 +
src/backend/storage/aio/aio.c | 6 +
src/backend/storage/aio/meson.build | 1 +
src/backend/storage/aio/method_io_uring.c | 386 ++++++++++++++++++++++
src/backend/storage/lmgr/lwlock.c | 1 +
src/tools/pgindent/typedefs.list | 1 +
9 files changed, 401 insertions(+)
create mode 100644 src/backend/storage/aio/method_io_uring.c
diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h
index 2e84abfea2d..a1633a0ed3d 100644
--- a/src/include/storage/aio.h
+++ b/src/include/storage/aio.h
@@ -323,6 +323,7 @@ typedef enum IoMethod
{
IOMETHOD_SYNC = 0,
IOMETHOD_WORKER,
+ IOMETHOD_IO_URING,
} IoMethod;
diff --git a/src/include/storage/aio_internal.h b/src/include/storage/aio_internal.h
index f974c4accf5..d2dc1516bdf 100644
--- a/src/include/storage/aio_internal.h
+++ b/src/include/storage/aio_internal.h
@@ -235,6 +235,9 @@ extern const char *pgaio_io_get_state_name(PgAioHandle *ioh);
/* Declarations for the tables of function pointers exposed by each IO method. */
extern const IoMethodOps pgaio_sync_ops;
extern const IoMethodOps pgaio_worker_ops;
+#ifdef USE_LIBURING
+extern const IoMethodOps pgaio_uring_ops;
+#endif
extern const IoMethodOps *pgaio_impl;
extern PgAioCtl *aio_ctl;
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index eabf813ce05..72f928b7602 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -217,6 +217,7 @@ typedef enum BuiltinTrancheIds
LWTRANCHE_SUBTRANS_SLRU,
LWTRANCHE_XACT_SLRU,
LWTRANCHE_PARALLEL_VACUUM_DSA,
+ LWTRANCHE_AIO_URING_COMPLETION,
LWTRANCHE_FIRST_USER_DEFINED,
} BuiltinTrancheIds;
diff --git a/src/backend/storage/aio/Makefile b/src/backend/storage/aio/Makefile
index fa2a7e9e5df..3bcb8a0b2ed 100644
--- a/src/backend/storage/aio/Makefile
+++ b/src/backend/storage/aio/Makefile
@@ -13,6 +13,7 @@ OBJS = \
aio_init.o \
aio_io.o \
aio_subject.o \
+ method_io_uring.o \
method_sync.o \
method_worker.o \
read_stream.o
diff --git a/src/backend/storage/aio/aio.c b/src/backend/storage/aio/aio.c
index e4c9d439ddd..701f06287d9 100644
--- a/src/backend/storage/aio/aio.c
+++ b/src/backend/storage/aio/aio.c
@@ -58,6 +58,9 @@ static PgAioHandle *pgaio_io_from_ref(PgAioHandleRef *ior, uint64 *ref_generatio
const struct config_enum_entry io_method_options[] = {
{"sync", IOMETHOD_SYNC, false},
{"worker", IOMETHOD_WORKER, false},
+#ifdef USE_LIBURING
+ {"io_uring", IOMETHOD_IO_URING, false},
+#endif
{NULL, 0, false}
};
@@ -75,6 +78,9 @@ PgAioPerBackend *my_aio;
static const IoMethodOps *pgaio_ops_table[] = {
[IOMETHOD_SYNC] = &pgaio_sync_ops,
[IOMETHOD_WORKER] = &pgaio_worker_ops,
+#ifdef USE_LIBURING
+ [IOMETHOD_IO_URING] = &pgaio_uring_ops,
+#endif
};
diff --git a/src/backend/storage/aio/meson.build b/src/backend/storage/aio/meson.build
index 62738ce1d14..537f23d446d 100644
--- a/src/backend/storage/aio/meson.build
+++ b/src/backend/storage/aio/meson.build
@@ -5,6 +5,7 @@ backend_sources += files(
'aio_init.c',
'aio_io.c',
'aio_subject.c',
+ 'method_io_uring.c',
'method_sync.c',
'method_worker.c',
'read_stream.c',
diff --git a/src/backend/storage/aio/method_io_uring.c b/src/backend/storage/aio/method_io_uring.c
new file mode 100644
index 00000000000..3f214e42767
--- /dev/null
+++ b/src/backend/storage/aio/method_io_uring.c
@@ -0,0 +1,386 @@
+/*-------------------------------------------------------------------------
+ *
+ * method_io_uring.c
+ * AIO - perform AIO using Linux' io_uring
+ *
+ * XXX Write me
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/storage/aio/method_io_uring.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#ifdef USE_LIBURING
+
+#include <liburing.h>
+
+#include "pgstat.h"
+#include "port/pg_iovec.h"
+#include "storage/aio_internal.h"
+#include "storage/fd.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+
+
+/* Entry points for IoMethodOps. */
+static size_t pgaio_uring_shmem_size(void);
+static void pgaio_uring_shmem_init(bool first_time);
+static void pgaio_uring_init_backend(void);
+
+static int pgaio_uring_submit(uint16 num_staged_ios, PgAioHandle **staged_ios);
+static void pgaio_uring_wait_one(PgAioHandle *ioh, uint64 ref_generation);
+
+static void pgaio_uring_sq_from_io(PgAioHandle *ioh, struct io_uring_sqe *sqe);
+
+
+const IoMethodOps pgaio_uring_ops = {
+ .shmem_size = pgaio_uring_shmem_size,
+ .shmem_init = pgaio_uring_shmem_init,
+ .init_backend = pgaio_uring_init_backend,
+
+ .submit = pgaio_uring_submit,
+ .wait_one = pgaio_uring_wait_one,
+};
+
+typedef struct PgAioUringContext
+{
+ LWLock completion_lock;
+
+ struct io_uring io_uring_ring;
+ /* XXX: probably worth padding to a cacheline boundary here */
+} PgAioUringContext;
+
+
+static PgAioUringContext *aio_uring_contexts;
+static PgAioUringContext *my_shared_uring_context;
+
+/* io_uring local state */
+static struct io_uring local_ring;
+
+
+
+static Size
+AioContextShmemSize(void)
+{
+ uint32 TotalProcs = MaxBackends + NUM_AUXILIARY_PROCS - MAX_IO_WORKERS;
+
+ return mul_size(TotalProcs, sizeof(PgAioUringContext));
+}
+
+static size_t
+pgaio_uring_shmem_size(void)
+{
+ return AioContextShmemSize();
+}
+
+static void
+pgaio_uring_shmem_init(bool first_time)
+{
+ uint32 TotalProcs = MaxBackends + NUM_AUXILIARY_PROCS - MAX_IO_WORKERS;
+ bool found;
+
+ aio_uring_contexts = (PgAioUringContext *)
+ ShmemInitStruct("AioUring", pgaio_uring_shmem_size(), &found);
+
+ if (found)
+ return;
+
+ for (int contextno = 0; contextno < TotalProcs; contextno++)
+ {
+ PgAioUringContext *context = &aio_uring_contexts[contextno];
+ int ret;
+
+ /*
+ * XXX: Probably worth sharing the WQ between the different rings,
+ * when supported by the kernel. Could also cause additional
+ * contention, I guess?
+ */
+#if 0
+ if (!AcquireExternalFD())
+ elog(ERROR, "No external FD available");
+#endif
+ ret = io_uring_queue_init(io_max_concurrency, &context->io_uring_ring, 0);
+ if (ret < 0)
+ elog(ERROR, "io_uring_queue_init failed: %s", strerror(-ret));
+
+ LWLockInitialize(&context->completion_lock, LWTRANCHE_AIO_URING_COMPLETION);
+ }
+}
+
+static void
+pgaio_uring_init_backend(void)
+{
+ int ret;
+
+ my_shared_uring_context = &aio_uring_contexts[MyProcNumber];
+
+ ret = io_uring_queue_init(32, &local_ring, 0);
+ if (ret < 0)
+ elog(ERROR, "io_uring_queue_init failed: %s", strerror(-ret));
+}
+
+static int
+pgaio_uring_submit(uint16 num_staged_ios, PgAioHandle **staged_ios)
+{
+ struct io_uring *uring_instance = &my_shared_uring_context->io_uring_ring;
+ int in_flight_before = dclist_count(&my_aio->in_flight_ios);
+
+ Assert(num_staged_ios <= PGAIO_SUBMIT_BATCH_SIZE);
+
+ for (int i = 0; i < num_staged_ios; i++)
+ {
+ PgAioHandle *ioh = staged_ios[i];
+ struct io_uring_sqe *sqe;
+
+ sqe = io_uring_get_sqe(uring_instance);
+
+ if (!sqe)
+ elog(ERROR, "io_uring submission queue is unexpectedly full");
+
+ pgaio_io_prepare_submit(ioh);
+ pgaio_uring_sq_from_io(ioh, sqe);
+
+ /*
+ * io_uring executes IO in process context if possible. That's
+ * generally good, as it reduces context switching. When performing a
+ * lot of buffered IO that means that copying between page cache and
+ * userspace memory happens in the foreground, as it can't be
+ * offloaded to DMA hardware as is possible when using direct IO. When
+ * executing a lot of buffered IO this causes io_uring to be slower
+ * than worker mode, as worker mode parallelizes the copying.
+ * io_uring can be told to offload work to worker threads instead.
+ *
+ * If an IO is buffered IO and we already have IOs in flight or
+ * multiple IOs are being submitted, we thus tell io_uring to execute
+ * the IO in the background. We don't do so for the first few IOs
+ * being submitted as executing in this process' context has lower
+ * latency.
+ */
+ if (in_flight_before > 4 && (ioh->flags & AHF_BUFFERED))
+ io_uring_sqe_set_flags(sqe, IOSQE_ASYNC);
+
+ in_flight_before++;
+ }
+
+ while (true)
+ {
+ int ret;
+
+ pgstat_report_wait_start(WAIT_EVENT_AIO_SUBMIT);
+ ret = io_uring_submit(uring_instance);
+ pgstat_report_wait_end();
+
+ if (ret == -EINTR)
+ {
+ elog(DEBUG3, "submit EINTR, nios: %d", num_staged_ios);
+ continue;
+ }
+ if (ret < 0)
+ elog(PANIC, "failed: %d/%s",
+ ret, strerror(-ret));
+ else if (ret != num_staged_ios)
+ {
+ /* likely unreachable, but if it is, we would need to re-submit */
+ elog(PANIC, "submitted only %d of %d",
+ ret, num_staged_ios);
+ }
+ else
+ {
+ elog(DEBUG3, "submit nios: %d", num_staged_ios);
+ }
+ break;
+ }
+
+ return num_staged_ios;
+}
+
+
+#define PGAIO_MAX_LOCAL_REAPED 16
+
+static void
+pgaio_uring_drain_locked(PgAioUringContext *context)
+{
+ int ready;
+ int orig_ready;
+
+ /*
+ * Don't drain more events than available right now. Otherwise it's
+ * plausible that one backend could get stuck, for a while, receiving CQEs
+ * without actually processing them.
+ */
+ orig_ready = ready = io_uring_cq_ready(&context->io_uring_ring);
+
+ while (ready > 0)
+ {
+ struct io_uring_cqe *reaped_cqes[PGAIO_MAX_LOCAL_REAPED];
+ uint32 reaped;
+
+ START_CRIT_SECTION();
+ reaped =
+ io_uring_peek_batch_cqe(&context->io_uring_ring,
+ reaped_cqes,
+ Min(PGAIO_MAX_LOCAL_REAPED, ready));
+ Assert(reaped <= ready);
+
+ ready -= reaped;
+
+ for (int i = 0; i < reaped; i++)
+ {
+ struct io_uring_cqe *cqe = reaped_cqes[i];
+ PgAioHandle *ioh;
+
+ ioh = io_uring_cqe_get_data(cqe);
+ io_uring_cqe_seen(&context->io_uring_ring, cqe);
+
+ pgaio_io_process_completion(ioh, cqe->res);
+ }
+
+ END_CRIT_SECTION();
+
+ ereport(DEBUG3,
+ errmsg("drained %d/%d, now expecting %d",
+ reaped, orig_ready, io_uring_cq_ready(&context->io_uring_ring)),
+ errhidestmt(true),
+ errhidecontext(true));
+
+ }
+}
+
+static void
+pgaio_uring_wait_one(PgAioHandle *ioh, uint64 ref_generation)
+{
+ PgAioHandleState state;
+ ProcNumber owner_procno = ioh->owner_procno;
+ PgAioUringContext *owner_context = &aio_uring_contexts[owner_procno];
+ bool expect_cqe;
+ int waited = 0;
+
+ /*
+ * We ought to have a smarter locking scheme, nearly all the time the
+ * backend owning the ring will reap the completions, making the locking
+ * unnecessarily expensive.
+ */
+ LWLockAcquire(&owner_context->completion_lock, LW_EXCLUSIVE);
+
+ while (true)
+ {
+ ereport(DEBUG3,
+ errmsg("wait_one for io:%d io_gen: %llu, ref_gen: %llu, in state %s, cycle %d",
+ pgaio_io_get_id(ioh),
+ (long long unsigned) ref_generation,
+ (long long unsigned) ioh->generation,
+ pgaio_io_get_state_name(ioh), waited),
+ errhidestmt(true),
+ errhidecontext(true));
+
+ if (pgaio_io_was_recycled(ioh, ref_generation, &state) ||
+ state != AHS_IN_FLIGHT)
+ {
+ break;
+ }
+ else if (io_uring_cq_ready(&owner_context->io_uring_ring))
+ {
+ expect_cqe = true;
+ }
+ else
+ {
+ int ret;
+ struct io_uring_cqe *cqes;
+
+ pgstat_report_wait_start(WAIT_EVENT_AIO_DRAIN);
+ ret = io_uring_wait_cqes(&owner_context->io_uring_ring, &cqes, 1, NULL, NULL);
+ pgstat_report_wait_end();
+
+ if (ret == -EINTR)
+ {
+ continue;
+ }
+ else if (ret != 0)
+ {
+ elog(PANIC, "unexpected: %d/%s: %m", ret, strerror(-ret));
+ }
+ else
+ {
+ Assert(cqes != NULL);
+ expect_cqe = true;
+ waited++;
+ }
+ }
+
+ if (expect_cqe)
+ {
+ pgaio_uring_drain_locked(owner_context);
+ }
+ }
+
+ LWLockRelease(&owner_context->completion_lock);
+
+ ereport(DEBUG3,
+ errmsg("wait_one with %d sleeps",
+ waited),
+ errhidestmt(true),
+ errhidecontext(true));
+}
+
+static void
+pgaio_uring_sq_from_io(PgAioHandle *ioh, struct io_uring_sqe *sqe)
+{
+ struct iovec *iov;
+
+ switch (ioh->op)
+ {
+ case PGAIO_OP_READV:
+ iov = &aio_ctl->iovecs[ioh->iovec_off];
+ if (ioh->op_data.read.iov_length == 1)
+ {
+ io_uring_prep_read(sqe,
+ ioh->op_data.read.fd,
+ iov->iov_base,
+ iov->iov_len,
+ ioh->op_data.read.offset);
+ }
+ else
+ {
+ io_uring_prep_readv(sqe,
+ ioh->op_data.read.fd,
+ iov,
+ ioh->op_data.read.iov_length,
+ ioh->op_data.read.offset);
+
+ }
+ break;
+
+ case PGAIO_OP_WRITEV:
+ iov = &aio_ctl->iovecs[ioh->iovec_off];
+ if (ioh->op_data.write.iov_length == 1)
+ {
+ io_uring_prep_write(sqe,
+ ioh->op_data.write.fd,
+ iov->iov_base,
+ iov->iov_len,
+ ioh->op_data.write.offset);
+ }
+ else
+ {
+ io_uring_prep_writev(sqe,
+ ioh->op_data.write.fd,
+ iov,
+ ioh->op_data.write.iov_length,
+ ioh->op_data.write.offset);
+ }
+ break;
+
+ case PGAIO_OP_INVALID:
+ elog(ERROR, "trying to prepare invalid IO operation for execution");
+ }
+
+ io_uring_sqe_set_data(sqe, ioh);
+}
+
+#endif /* USE_LIBURING */
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index bc459dc5d2b..4fdcfb1df1b 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -166,6 +166,7 @@ static const char *const BuiltinTrancheNames[] = {
[LWTRANCHE_SUBTRANS_SLRU] = "SubtransSLRU",
[LWTRANCHE_XACT_SLRU] = "XactSLRU",
[LWTRANCHE_PARALLEL_VACUUM_DSA] = "ParallelVacuumDSA",
+ [LWTRANCHE_AIO_URING_COMPLETION] = "AioUringCompletion",
};
StaticAssertDecl(lengthof(BuiltinTrancheNames) ==
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 9b9c8f0d1fc..a5b12b48f99 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2121,6 +2121,7 @@ PgAioReturn
PgAioSubjectData
PgAioSubjectID
PgAioSubjectInfo
+PgAioUringContext
PgArchData
PgBackendGSSStatus
PgBackendSSLStatus
--
2.45.2.746.g06e570c0df.dirty
v2-0009-aio-Add-README.md-explaining-higher-level-design.patchtext/x-diff; charset=us-asciiDownload
From c95ba2c47ddc454f19703c4361f47690ff8ff05e Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Fri, 6 Sep 2024 15:27:57 -0400
Subject: [PATCH v2 09/20] aio: Add README.md explaining higher level design
---
src/backend/storage/aio/README.md | 413 ++++++++++++++++++++++++++++++
src/backend/storage/aio/aio.c | 2 +
2 files changed, 415 insertions(+)
create mode 100644 src/backend/storage/aio/README.md
diff --git a/src/backend/storage/aio/README.md b/src/backend/storage/aio/README.md
new file mode 100644
index 00000000000..893f4ffe428
--- /dev/null
+++ b/src/backend/storage/aio/README.md
@@ -0,0 +1,413 @@
+# Asynchronous & Direct IO
+
+## AIO Usage Example
+
+In many cases code that can benefit from AIO does not directly have to
+interact with the AIO interface, but can use AIO via higher-level
+abstractions. See [Helpers](#helpers).
+
+In this example, a buffer will be read into shared buffers.
+
+```C
+/*
+ * Result of the operation, only to be accessed in this backend.
+ */
+PgAioReturn ioret;
+
+/*
+ * Acquire AIO Handle, ioret will get result upon completion.
+ */
+PgAioHandle *ioh = pgaio_io_get(CurrentResourceOwner, &ioret);
+
+/*
+ * Reference that can be used to wait for the IO we initiate below. This
+ * reference can reside in local or shared memory and waited upon by any
+ * process. An arbitrary number of references can be made for each IO.
+ */
+PgAioRef ior;
+
+pgaio_io_get_ref(ioh, &ior);
+
+/*
+ * Arrange for shared buffer completion callbacks to be called upon completion
+ * of the IO. This callback will update the buffer descriptors associated with
+ * the AioHandle, which e.g. allows other backends to access the buffer.
+ *
+ * Multiple completion callbacks can be registered for each handle.
+ */
+pgaio_io_add_shared_cb(ioh, ASC_SHARED_BUFFER_READ);
+
+/*
+ * The completion callback needs to know which buffers to update when the IO
+ * completes. As the AIO subsystem does not know about buffers, we have to
+ * associate this information with the AioHandle, for use by the completion
+ * callback registered above.
+ */
+pgaio_io_set_io_data_32(ioh, (uint32 *) buffer, 1);
+
+/*
+ * Hand AIO handle to lower-level function. When operating on the level of
+ * buffers, we don't know how exactly the IO is performed, that is the
+ * responsibility of the storage manager implementation.
+ *
+ * E.g. md.c needs to translate block numbers into offsets in segments.
+ *
+ * Once the IO handle has been handed of, it may not further be used, as the
+ * IO may immediately get executed below smgrstartreadv() and the handle reused
+ * for another IO.
+ */
+smgrstartreadv(ioh, operation->smgr, forknum, blkno,
+ BufferGetBlock(buffer), 1);
+
+/*
+ * As mentioned above, the IO might be initiated within smgrstartreadv(). That
+ * is however not guaranteed, to allow IO submission to be batched.
+ *
+ * Note that one needs to be careful while there may be unsubmitted IOs, as
+ * another backend may need to wait for one of the unsubmitted IOs. If this
+ * backend were to wait for the other backend, we'd have a deadlock. To avoid
+ * that, pending IOs need to be explicitly submitted before this backend
+ * might be blocked by a backend waiting for IO.
+ *
+ * Note that the IO might have immediately been submitted (e.g. due to reaching
+ * a limit on the number of unsubmitted IOs) and even completed during the
+ * smgrstartreadv() above.
+ *
+ * Once submitted, the IO is in-flight and can complete at any time.
+ */
+pgaio_submit_staged();
+
+/*
+ * To benefit from AIO, it is beneficial to perform other work, including
+ * submitting other IOs, before waiting for the IO to complete. Otherwise
+ * we could just have used synchronous, blocking IO.
+ */
+perform_other_work();
+
+/*
+ * We did some other work and now need the IO operation to have completed to
+ * continue.
+ */
+pgaio_io_ref_wait(&ior);
+
+/*
+ * At this point the IO has completed. We do not yet know whether it succeeded
+ * or failed, however. The buffer's state has been updated, which allows other
+ * backends to use the buffer (if the IO succeeded), or retry the IO (if it
+ * failed).
+ *
+ * Note that in case the IO has failed, a LOG message may have been emitted,
+ * but no ERROR has been raised. This is crucial, as another backend waiting
+ * for this IO should not see an ERROR.
+ *
+ * To check whether the operation succeeded, and to raise an ERROR, or if more
+ * appropriate LOG, the PgAioReturn we passed to pgaio_io_get() is used.
+ */
+if (ioret.result.status == ARS_ERROR)
+ pgaio_result_log(aio_ret.result, &aio_ret.subject_data, ERROR);
+
+/*
+ * Besides having succeeded completely, the IO could also have partially
+ * completed. If we e.g. tried to read many blocks at once, the read might have
+ * only succeeded for the first few blocks.
+ *
+ * If the IO partially succeeded and this backend needs all blocks to have
+ * completed, this backend needs to reissue the IO for the remaining buffers.
+ * The AIO subsystem cannot handle this retry transparently.
+ *
+ * As this example is already long, and we only read a single block, we'll just
+ * error out if there's a partial read.
+ */
+if (ioret.result.status == ARS_PARTIAL)
+ pgaio_result_log(aio_ret.result, &aio_ret.subject_data, ERROR);
+
+/*
+ * The IO succeeded, so we can use the buffer now.
+ */
+```
+
+
+## Design Criteria & Motivation
+
+### Why Asynchronous IO
+
+Until the introduction of asynchronous IO Postgres relied on the operating
+system to hide the cost of synchronous IO from Postgres. While this worked
+surprisingly well in a lot of workloads, it does not do as good a job on
+prefetching and controlled writeback as we would like.
+
+There are important expensive operations like `fdatasync()` where the operating
+system cannot hide the storage latency. This is particularly important for WAL
+writes, where the ability to asynchronously issue `fdatasync()` or O_DSYNC
+writes can yield significantly higher throughput.
+
+
+### Why Direct / unbuffered IO
+
+The main reason to want to use Direct IO are:
+
+- Lower CPU usage / higher throughput. Particularly on modern storage buffered
+ writes are bottlenecked by the operating system having to copy data from the
+ kernel's page cache to postgres buffer pool using the CPU. Whereas direct IO
+ can often move the data directly between the storage devices and postgres'
+ buffer cache, using DMA. While that transfer is ongoing, the CPU is free to
+ perform other work.
+- Reduced latency - Direct IO can have substantially lower latency than
+ buffered IO, which can be impactful for OLTP workloads bottlenecked by WAL
+ write latency.
+- Avoiding double buffering between operating system cache and postgres'
+ shared_buffers.
+- Better control over the timing and pace of dirty data writeback.
+
+
+The main reason *not* to use Direct IO are:
+
+- Without AIO, Direct IO is unusably slow for most purposes.
+- Even with AIO, many parts of postgres need to be modified to perform
+ explicit prefetching.
+- In situations where shared_buffers cannot be set appropriately large,
+ e.g. because there are many different postgres instances hosted on shared
+ hardware, performance will often be worse then when using buffered IO.
+
+
+### Deadlock and Starvation Dangers due to AIO
+
+Using AIO in a naive way can easily lead to deadlocks in an environment where
+the source/target of AIO are shared resources, like pages in postgres'
+shared_buffers.
+
+Consider one backend performing readahead on a table, initiating IO for a
+number of buffers ahead of the current "scan position". If that backend then
+performs some operation that blocks, or even just is slow, the IO completion
+for the asynchronously initiated read may not be processed.
+
+This AIO implementation solves this problem by requiring that AIO methods
+either allow AIO completions to be processed by any backend in the system
+(e.g. io_uring), or to guarantee that AIO processing will happen even when the
+issuing backend is blocked (e.g. worker mode, which offloads completion
+processing to the AIO workers).
+
+
+### IO can be started in critical sections
+
+Using AIO for WAL writes can reduce the overhead of WAL logging substantially:
+
+- AIO allows to start WAL writes eagerly, so they complete before needing to
+ wait
+- AIO allows to have multiple WAL flushes in progress at the same time
+- AIO makes it more realistic to use O\_DIRECT + O\_DSYNC, which can reduce
+ the number of roundtrips to storage on some OSs and storage HW (buffered IO
+ and direct IO without O_DSYNC needs to issue a write and after the writes
+ completion a cache cache flush, whereas O\_DIRECT + O\_DSYNC can use a
+ single FUA write).
+
+The need to be able to execute IO in critical sections has substantial design
+implication on the AIO subsystem. Mainly because completing IOs (see prior
+section) needs to be possible within a critical section, even if the
+to-be-completed IO itself was not issued in a critical section. Consider
+e.g. the case of a backend first starting a number of writes from shared
+buffers and then starting to flush the WAL. Because only a limited amount of
+IO can be in-progress at the same time, initiating the IO for flushing the WAL
+may require to first finish executing IO executed earlier.
+
+
+### State for AIO needs to live in shared memory
+
+Because postgres uses a process model and because AIOs need to be
+complete-able by any backend much of the state of the AIO subsystem needs to
+live in shared memory.
+
+In an `EXEC_BACKEND` build backends executable code and other process local
+state is not necessarily mapped to the same addresses in each process due to
+ASLR. This means that the shared memory cannot contain pointer to callbacks.
+
+
+## Design of the AIO Subsystem
+
+
+### AIO Methods
+
+To achieve portability and performance, multiple methods of performing AIO are
+implemented and others are likely worth adding in the future.
+
+
+#### Synchronous Mode
+
+`io_method=sync` does not actually perform AIO but allows to use the AIO API
+while performing synchronous IO. This can be useful for debugging. The code
+for the synchronous mode is also used as a fallback by e.g. the [worker
+mode](#worker) uses it to execute IO that cannot be executed by workers.
+
+
+#### Worker
+
+`io_method=worker` is available on every platform postgres runs on, and
+implements asynchronous IO - from the view of the issuing process - by
+dispatching the IO to one of several worker processes performing the IO in a
+synchronous manner.
+
+
+#### io_uring
+
+`io_method=io_uring` is available on Linux 5.1+. In contrast to worker mode it
+dispatches all IO from within the process, lowering context switch rate /
+latency.
+
+
+### AIO Handles
+
+The central API piece for postgres' AIO abstraction are AIO handles. To
+execute an IO one first has to acquire an IO handle (`pgaio_io_get()`) and
+then "defined", i.e. associate an IO operation with the handle.
+
+Often AIO handles are acquired on a higher level and then passed to a lower
+level to be fully defined. E.g., for IO to/from shared buffers, bufmgr.c
+routines acquire the handle, which is then passed through smgr.c, md.c to be
+finally fully defined in fd.c.
+
+The functions used at the lowest level to define the operation are
+`pgaio_io_prep_*()`.
+
+Because acquisition of an IO handle
+[must always succeed](#io-can-be-started-in-critical-sections)
+and the number of AIO Handles
+[has to be limited](#state-for-aio-needs-to-live-in-shared-memory)
+AIO handles can be reused as soon as they have completed. Obviously code needs
+to be able to react to IO completion. Shared state can be updated using
+[AIO Completion callbacks](#aio-callbacks)
+and the issuing backend can provide a backend local variable to receive the
+result of the IO, as described in
+[AIO Result](#aio-results)
+. An IO can be waited for, by both the issuing and any other backend, using
+[AIO References](#aio-references).
+
+
+Because an AIO Handle is not executable just after calling `pgaio_io_get()`
+and because `pgaio_io_get()` needs to be able to succeed, only a single AIO
+Handle may be acquired (i.e. returned by `pgaio_io_get()`) without causing the
+IO to have been defined (by, potentially indirectly, causing
+`pgaio_io_prep_*()` to have been called). Otherwise a backend could trivially
+self-deadlock by using up all AIO Handles without the ability to wait for some
+of the IOs to complete.
+
+If it turns out that an AIO Handle is not needed, e.g., because the handle was
+acquired before holding a contended lock, it can be released without being
+defined using `pgaio_io_release()`.
+
+
+### AIO Callbacks
+
+Commonly several layers need to react to completion of an IO. E.g. for a read
+md.c needs to check if the IO outright failed or was shorter than needed,
+bufmgr.c needs to verify the page looks valid and bufmgr.c needs to update the
+BufferDesc to update the buffer's state.
+
+The fact that several layers / subsystems need to react to IO completion poses
+a few challenges:
+
+- Upper layers should not need to know details of lower layers. E.g. bufmgr.c
+ should not assume the IO will pass through md.c. Therefore upper levels
+ cannot know what lower layers would consider an error.
+
+- Lower layers should not need to know about upper layers. E.g. smgr APIs are
+ used going through shared buffers but are also used bypassing shared
+ buffers. This means that e.g. md.c is not in a position to validate
+ checksums.
+
+- Having code in the AIO subsystem for every possible combination of layers
+ would lead to a lot of duplication.
+
+The "solution" to this the ability to associate multiple completion callbacks
+with a handle. E.g. bufmgr.c can have a callback to update the BufferDesc
+state and to verify the page and md.c. another callback to check if the IO
+operation was successful.
+
+As [mentioned](#state-for-aio-needs-to-live-in-shared-memory), shared memory
+currently cannot contain function pointers. Because of that completion
+callbacks are not directly identified by function pointers but by IDs
+(`PgAioHandleSharedCallbackID`). A substantial added benefit is that that
+allows callbacks to be identified by much smaller amount of memory (a single
+byte currently).
+
+In addition to completion, AIO callbacks also are called to "prepare" an
+IO. This is, e.g., used to acquire buffer pins owned by the AIO subsystem for
+IO to/from shared buffers, which is required to handle the case where the
+issuing backend errors out and releases its own pins.
+
+As [explained earlier](#io-can-be-started-in-critical-sections) IO completions
+need to be safe to execute in critical sections. To allow the backend that
+issued the IO to error out in case of failure [AIO Result](#aio-results) can
+be used.
+
+
+### AIO Subjects
+
+In addition to the completion callbacks describe above, each AIO Handle has
+exactly one "subject". Each subject has some space inside an AIO Handle with
+information specific to the subject and can provide callbacks to allow to
+reopen the underlying file (required for worker mode) and to describe the IO
+operation (used for debug logging and error messages).
+
+
+### AIO References
+
+As [described above](#aio-handles) can be reused immediately after completion
+and therefore cannot be used to wait for completion of the IO. Waiting is
+enabled using AIO references, which do not just identify an AIO Handle but
+also include the handles "generation".
+
+A reference to an AIO Handle can be acquired using `pgaio_io_get_ref()` and
+then waited upon using `pgaio_io_ref_wait()`.
+
+
+### AIO Results
+
+As AIO completion callbacks
+[are executed in critical sections](#io-can-be-started-in-critical-sections)
+and [may be executed by any backend](#deadlock-and-starvation-dangers-due-to-aio)
+completion callbacks cannot be used to, e.g., make the query that triggered an
+IO ERROR out.
+
+To allow to react to failing IOs the issuing backend can pass a pointer to a
+`PgAioReturn` in backend local memory. Before an AIO Handle is reused the
+`PgAioReturn` is filled with information about the IO. This includes
+information about whether the IO was successful (as a value of
+`PgAioResultStatus`) and enough information to raise an error in case of a
+failure (via `pgaio_result_log()`, with the error details encoded in
+`PgAioResult`).
+
+XXX: "return" vs "result" vs "result status" seems quite confusing. The naming
+should be improved.
+
+
+### AIO Errors
+
+It would be very convenient to have shared completion callbacks encode the
+details of errors as an `ErrorData` that could be raised at a later
+time. Unfortunately doing so would require allocating memory. While elog.c can
+guarantee (well, kinda) that logging a message will not run out of memory,
+that only works because a very limited number of messages are in the process
+of being logged. With AIO a large number of concurrently issued AIOs might
+fail.
+
+To avoid the need for preallocating a potentially large amount of memory (in
+shared memory no less!), completion callbacks instead have to encode errors in
+a more compact format that can be converted into an error message.
+
+
+## Helpers
+
+Using the low-level AIO API introduces too much complexity to do so all over
+the tree. Most uses of AIO should be done via reusable, higher-level,
+helpers.
+
+
+### Read Stream
+
+A common and very beneficial use of AIO are reads where a substantial number
+of to-be-read locations are known ahead of time. E.g., for a sequential scan
+the set of blocks that need to be read can be determined solely by knowing the
+current position and checking the buffer mapping table.
+
+The [Read Stream](../../../include/storage/read_stream.h) interface makes it
+comparatively easy to use AIO for such use cases.
diff --git a/src/backend/storage/aio/aio.c b/src/backend/storage/aio/aio.c
index 701f06287d9..2439ce3740d 100644
--- a/src/backend/storage/aio/aio.c
+++ b/src/backend/storage/aio/aio.c
@@ -24,6 +24,8 @@
* - read_stream.c - helper for accessing buffered relation data with
* look-ahead
*
+ * - README.md - higher-level overview over AIO
+ *
*
* Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
* Portions Copyright (c) 1994, Regents of the University of California
--
2.45.2.746.g06e570c0df.dirty
v2-0019-Temporary-Increase-BAS_BULKREAD-size.patchtext/x-diff; charset=us-asciiDownload
From 75c690243866d3f6b476ecfb9c249da8098122f0 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sun, 1 Sep 2024 00:42:27 -0400
Subject: [PATCH v2 19/20] Temporary: Increase BAS_BULKREAD size
Without this we only can execute very little AIO for sequential scans, as
there's just not enough buffers in the ring. This isn't the right fix, as
just increasing the ring size can have negative performance implications in
workloads where the kernel has all the data cached.
Author:
Reviewed-By:
Discussion: https://postgr.es/m/
Backpatch:
---
src/backend/storage/buffer/freelist.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index dffdd57e9b5..f5795b509c7 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -555,7 +555,12 @@ GetAccessStrategy(BufferAccessStrategyType btype)
return NULL;
case BAS_BULKREAD:
- ring_size_kb = 256;
+
+ /*
+ * FIXME: Temporary increase to allow large enough streaming reads
+ * to actually benefit from AIO. This needs a better solution.
+ */
+ ring_size_kb = 2 * 1024;
break;
case BAS_BULKWRITE:
ring_size_kb = 16 * 1024;
--
2.45.2.746.g06e570c0df.dirty
v2-0020-WIP-Use-MAP_POPULATE.patchtext/x-diff; charset=us-asciiDownload
From e9c132e191cacc9fc946b611afc5f489762c4387 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 31 Dec 2024 13:25:56 -0500
Subject: [PATCH v2 20/20] WIP: Use MAP_POPULATE
For benchmarking it's quite annoying that the first time a memory is touched
has completely different perf characteristics than subsequent accesses. Using
MAP_POPULATE reduces that substantially.
---
src/backend/port/sysv_shmem.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/src/backend/port/sysv_shmem.c b/src/backend/port/sysv_shmem.c
index a5a4511f66d..2a45dffd5e0 100644
--- a/src/backend/port/sysv_shmem.c
+++ b/src/backend/port/sysv_shmem.c
@@ -620,7 +620,7 @@ CreateAnonymousSegment(Size *size)
allocsize += hugepagesize - (allocsize % hugepagesize);
ptr = mmap(NULL, allocsize, PROT_READ | PROT_WRITE,
- PG_MMAP_FLAGS | mmap_flags, -1, 0);
+ PG_MMAP_FLAGS | MAP_POPULATE | mmap_flags, -1, 0);
mmap_errno = errno;
if (huge_pages == HUGE_PAGES_TRY && ptr == MAP_FAILED)
elog(DEBUG1, "mmap(%zu) with MAP_HUGETLB failed, huge pages disabled: %m",
--
2.45.2.746.g06e570c0df.dirty
Hi,
On 2024-12-19 17:29:12 -0500, Andres Freund wrote:
Not about patch itself, but questions about related stack functionality:
----------------------------------------------------------------------------------------------------7. Is pg_stat_aios still on the table or not ? (AIO 2021 had it). Any hints
on how to inspect real I/O calls requested to review if the code is issuing
sensible calls: there's no strace for uring, or do you stick to DEBUG3 or
perhaps using some bpftrace / xfsslower is the best way to go ?I think we still want something like it, but I don't think it needs to be in
the initial commits.
After I got this question from Thomas as well, I started hacking one up.
What information would you like to see?
Here's what I currently have:
┌─[ RECORD 1 ]───┬────────────────────────────────────────────────┐
│ pid │ 358212 │
│ io_id │ 2050 │
│ io_generation │ 4209 │
│ state │ COMPLETED_SHARED │
│ operation │ read │
│ offset │ 509083648 │
│ length │ 262144 │
│ subject │ smgr │
│ iovec_data_len │ 32 │
│ raw_result │ 262144 │
│ result │ OK │
│ error_desc │ (null) │
│ subject_desc │ blocks 1372864..1372895 in file "base/5/16388" │
│ flag_sync │ f │
│ flag_localmem │ f │
│ flag_buffered │ t │
├─[ RECORD 2 ]───┼────────────────────────────────────────────────┤
│ pid │ 358212 │
│ io_id │ 2051 │
│ io_generation │ 4199 │
│ state │ IN_FLIGHT │
│ operation │ read │
│ offset │ 511967232 │
│ length │ 262144 │
│ subject │ smgr │
│ iovec_data_len │ 32 │
│ raw_result │ (null) │
│ result │ UNKNOWN │
│ error_desc │ (null) │
│ subject_desc │ blocks 1373216..1373247 in file "base/5/16388" │
│ flag_sync │ f │
│ flag_localmem │ f │
│ flag_buffered │ t │
I didn't think that pg_stat_* was quite the right namespace, given that it
shows not stats, but the currently ongoing IOs. I am going with pg_aios for
now, but I don't particularly like that.
I think we'll want a pg_stat_aio as well, tracking things like:
- how often the queue to IO workes was full
- how many times we submitted IO to the kernel (<= #ios with io_uring)
- how many times we asked the kernel for events (<= #ios with io_uring)
- how many times we had to wait for in-flight IOs before issuing more IOs
Greetings,
Andres Freund
Patches 1 and 2 are still Ready for Committer.
On Tue, Dec 31, 2024 at 11:03:33PM -0500, Andres Freund wrote:
- The README has been extended with an overview of the API. I think it gives a
good overview of how the API fits together. I'd be very good to get
feedback from folks that aren't as familiar with AIO, I can't really see
what's easy/hard anymore.
That's a helpful addition. I've left inline comments on it, below.
The biggest TODOs are:
- Right now the API between bufmgr.c and read_stream.c kind of necessitates
that one StartReadBuffers() call actually can trigger multiple IOs, if
one of the buffers was read in by another backend, before "this" backend
called StartBufferIO().I think Thomas and I figured out a way to evolve the interface so that this
isn't necessary anymore:We allow StartReadBuffers() to memorize buffers it pinned but didn't
initiate IO on in the buffers[] argument. The next call to StartReadBuffers
then doesn't have to repin thse buffers. That doesn't just solve the
multiple-IOs for one "read operation" issue, it also make the - very common
- case of a bunch of "buffer misses" followed by a "buffer hit" cleaner, the
hit wouldn't be tracked in the same ReadBuffersOperation anymore.
That sounds reasonable.
- Right now bufmgr.h includes aio.h, because it needs to include a reference
to the AIO's result in ReadBuffersOperation. Requiring a dynamic allocation
would be noticeable overhead, so that's not an option. I think the best
option here would be to introduce something like aio_types.h, so fewer
things are included.
That sounds fine. Header splits aren't going to be perfect, so I'd pick
something (e.g. your proposal here) and move on.
- There's no obvious way to tell "internal" function operating on an IO handle
apart from functions that are expected to be called by the issuer of an IO.One way to deal with this would be to introduce a distinct "issuer IO
reference" type. I think that might be a good idea, it would also make it
clearer that a good number of the functions can only be called by the
issuer, before the IO is submitted.
That's reasonable, albeit non-critical.
- The naming around PgAioReturn, PgAioResult, PgAioResultStatus needs to be
improved
POSIX uses the word "result" for the consequences of a function (e.g. the
result of unlink() is readdir() no longer finding the link). It uses the word
"return" for a memory value that describes a result. In that usage, the
struct currently called PgAioResult would be a Return. The struct currently
called PgAioReturn is PgAioResult plus the data to identify the IO. Possible
name changes:
PgAioResult -> PgAioReturn
PgAioReturn -> PgAioReturnIdentified | PgAioReturnID | PgAioReturnTagged [I don't love these]
PgAioResultStatus -> PgAioStatus | PgAioFill
That said, I don't dislike the existing names and would not have raised the
topic myself.
- The debug logging functions are a bit of a mess, lots of very similar code
in lots of places. I think AIO needs a few ereport() wrappers to make this
easier.
May as well.
- More tests are needed. None of our current test frameworks really makes this
easy :(.
Which testing gap do you find most concerning? I'd be most interested in the
cases that would be undetected deadlocks under a naive design. An example
appeared at the end of postgr.es/m/20240916144349.74.nmisch@google.com
- Several folks asked for pg_stat_aio to come back, in "v1" that showed the
set of currently in-flight AIOs. That's not particularly hard - except
that it doesn't really fit in the pg_stat_* namespace.
Later message
postgr.es/m/6vjl6jeaqvyhfbpgwziypwmhem2rwla4o5pgpuxwtg3o3o3jb5@evyzorb5meth is
considering the name pg_aios. Works for me.
--- a/src/backend/storage/aio/aio.c +++ b/src/backend/storage/aio/aio.c @@ -3,6 +3,28 @@ * aio.c * AIO - Core Logic * + * For documentation about how AIO works on a higher level, including a + * schematic example, see README.md. + * + * + * AIO is a complicated subsystem. To keep things navigable it is split across + * a number of files: + * + * - aio.c - core AIO state handling + * + * - aio_init.c - initialization + * + * - aio_io.c - dealing with actual IO, including executing IOs synchronously + * + * - aio_subject.c - functionality related to executing IO for different + * subjects + * + * - method_*.c - different ways of executing AIO + * + * - read_stream.c - helper for accessing buffered relation data with + * look-ahead + *
I felt like some list entries in this new header comment largely restated the
file name. Here's how I'd write them to avoid that:
* - method_*.c - different ways of executing AIO (e.g. worker process)
* - aio_io.c - method-independent code for specific IO ops (e.g. readv)
* - aio_subject.c - callbacks at IO operation lifecycle events
* - aio_init.c - per-fork and per-startup-process initialization
* - aio.c - all other topics
* - read_stream.c - helper for reading buffered relation data
--- /dev/null +++ b/src/backend/storage/aio/README.md @@ -0,0 +1,413 @@ +# Asynchronous & Direct IO
I would move "### Why Asynchronous IO" to here; that's good background before
getting into the example. I might also move "### Why Direct / unbuffered IO"
to here. For me as a reader, I'd benefit from seeing things in this order:
- "why"
- condensed usage example like manpage SYNOPSIS, comments and decls removed
- PgAioHandleState and discussion of valid transitions
- usage example as it is, with full comments
- the rest
In other words, like this:
# Asynchronous & Direct IO
## Motivation
### Why Asynchronous IO
[existing content moved from lower in the file]
## Synopsis
ioh = pgaio_io_get(CurrentResourceOwner, &ioret);
pgaio_io_get_ref(ioh, &ior);
pgaio_io_add_shared_cb(ioh, ASC_SHARED_BUFFER_READ);
pgaio_io_set_io_data_32(ioh, (uint32 *) buffer, 1);
smgrstartreadv(ioh, operation->smgr, forknum, blkno,
BufferGetBlock(buffer), 1);
pgaio_submit_staged();
pgaio_io_ref_wait(&ior);
if (ioret.result.status == ARS_ERROR)
pgaio_result_log(aio_ret.result, &aio_ret.subject_data, ERROR);
## I/O Operation States & Transitions
[PgAioHandleState and its transitions]
## AIO Usage Example
[your content:]
+ +## AIO Usage Example + +In many cases code that can benefit from AIO does not directly have to +interact with the AIO interface, but can use AIO via higher-level +abstractions. See [Helpers](#helpers). + +In this example, a buffer will be read into shared buffers. + +```C +/* + * Result of the operation, only to be accessed in this backend. + */ +PgAioReturn ioret; + +/* + * Acquire AIO Handle, ioret will get result upon completion.
Consider adding: from here to pgaio_submit_staged(), don't do [description of
the kind of unacceptable blocking operations].
+ * Once the IO handle has been handed of, it may not further be used, as the
s/of/off/
+### IO can be started in critical sections
...
+The need to be able to execute IO in critical sections has substantial design +implication on the AIO subsystem. Mainly because completing IOs (see prior +section) needs to be possible within a critical section, even if the +to-be-completed IO itself was not issued in a critical section. Consider +e.g. the case of a backend first starting a number of writes from shared +buffers and then starting to flush the WAL. Because only a limited amount of +IO can be in-progress at the same time, initiating the IO for flushing the WAL +may require to first finish executing IO executed earlier.
The last line's two appearances of the word "execute" read awkwardly to me,
and it's an opportunity to use PgAioHandleState terms. Consider writing the
last line like "may first advance an existing IO from AHS_PREPARED to
AHS_COMPLETED_SHARED".
+ASLR. This means that the shared memory cannot contain pointer to callbacks.
s/pointer/pointers/
+### AIO Callbacks
...
+In addition to completion, AIO callbacks also are called to "prepare" an +IO. This is, e.g., used to acquire buffer pins owned by the AIO subsystem for +IO to/from shared buffers, which is required to handle the case where the +issuing backend errors out and releases its own pins.
Reading this, it's not obvious to me how to reconcile "finishing an IO could
require pin acquisition" with "finishing an IO could happen in a critical
section". Pinning a buffer in a critical section sounds bad. I vaguely
recall understanding how it was okay as of my September review, but I've
already forgotten. Can this text have a sentence making that explicit?
+### AIO Subjects + +In addition to the completion callbacks describe above, each AIO Handle has +exactly one "subject". Each subject has some space inside an AIO Handle with +information specific to the subject and can provide callbacks to allow to +reopen the underlying file (required for worker mode) and to describe the IO +operation (used for debug logging and error messages).
Can this say roughly how to decide when to add a new subject? Failing that,
can it give examples of what additional subjects might exist if certain
existing subsystems were to start using AIO?
+### AIO Results + +As AIO completion callbacks +[are executed in critical sections](#io-can-be-started-in-critical-sections) +and [may be executed by any backend](#deadlock-and-starvation-dangers-due-to-aio) +completion callbacks cannot be used to, e.g., make the query that triggered an +IO ERROR out. + +To allow to react to failing IOs the issuing backend can pass a pointer to a +`PgAioReturn` in backend local memory. Before an AIO Handle is reused the +`PgAioReturn` is filled with information about the IO. This includes +information about whether the IO was successful (as a value of +`PgAioResultStatus`) and enough information to raise an error in case of a +failure (via `pgaio_result_log()`, with the error details encoded in +`PgAioResult`).
Can this have a sentence on how this fits in bounded shmem, given the lack of
guarantees about a backend's responsiveness? In other words, what makes it
okay to have requests take arbitrarily long to move from AHS_COMPLETED_SHARED
to AHS_COMPLETED_LOCAL?
Thanks,
nm
Hi,
On 2025-01-06 10:52:20 -0800, Noah Misch wrote:
Patches 1 and 2 are still Ready for Committer.
I feel somewhat weird about pushing 0002 without a user, but I guess it's
still exercised, so it's probably fine...
On Tue, Dec 31, 2024 at 11:03:33PM -0500, Andres Freund wrote:
- The README has been extended with an overview of the API. I think it gives a
good overview of how the API fits together. I'd be very good to get
feedback from folks that aren't as familiar with AIO, I can't really see
what's easy/hard anymore.That's a helpful addition. I've left inline comments on it, below.
Cool!
- More tests are needed. None of our current test frameworks really makes this
easy :(.Which testing gap do you find most concerning?
Most of it isn't even AIO specific...
- temporary tables are rather poorly tested in general:
- e.g. trivial to exceed the number of buffers, but our tests don't reach that
- We have pretty no testing for IO errors. We have a bit of coverage due to
src/bin/pg_amcheck/t/003_check.pl, but that's for errors originating in
bufmgr.c itself.
- no real testing of StartBufferIO's etc wait paths
- no testing for BM_PIN_COUNT_WAITER
I e.g. just noticed that the error handling for AIO on temp tables was broken
- but our tests never reach that:
The bug exists due to temp tables not differentiating between "backend" pins
and a "global pincount" - which means that there's no real way for the AIO
subsystem to have a reference separate from the backend local pin -
CheckForLocalBufferLeaks() complains about any leftover pins. It seems to
works in non-assert mode, but with assertions transaction abort asserts out.
I'd be most interested in the
cases that would be undetected deadlocks under a naive design. An example
appeared at the end of postgr.es/m/20240916144349.74.nmisch@google.com
That's a good one, yea.
I think I'll try to translate the regression tests I wrote into an isolation
test, I hope that'll make it a bit easier to cover more cases.
And then we'll need more injection points, I'm afraid :(.
- Several folks asked for pg_stat_aio to come back, in "v1" that showed the
set of currently in-flight AIOs. That's not particularly hard - except
that it doesn't really fit in the pg_stat_* namespace.Later message
postgr.es/m/6vjl6jeaqvyhfbpgwziypwmhem2rwla4o5pgpuxwtg3o3o3jb5@evyzorb5meth is
considering the name pg_aios. Works for me.
Cool.
--- a/src/backend/storage/aio/aio.c +++ b/src/backend/storage/aio/aio.c @@ -3,6 +3,28 @@ * aio.c * AIO - Core Logic * + * For documentation about how AIO works on a higher level, including a + * schematic example, see README.md. + * + * + * AIO is a complicated subsystem. To keep things navigable it is split across + * a number of files: + * + * - aio.c - core AIO state handling + * + * - aio_init.c - initialization + * + * - aio_io.c - dealing with actual IO, including executing IOs synchronously + * + * - aio_subject.c - functionality related to executing IO for different + * subjects + * + * - method_*.c - different ways of executing AIO + * + * - read_stream.c - helper for accessing buffered relation data with + * look-ahead + *I felt like some list entries in this new header comment largely restated the
file name. Here's how I'd write them to avoid that:
Thanks, adopting.
* - method_*.c - different ways of executing AIO (e.g. worker process)
* - aio_io.c - method-independent code for specific IO ops (e.g. readv)
* - aio_subject.c - callbacks at IO operation lifecycle events
* - aio_init.c - per-fork and per-startup-process initialization
I don't particularly like "per-startup-process", because "global
initialization" really is separate (and precedes) from startup processes
startup. Maybe "per-server and per-backend initialization"?
* - aio.c - all other topics
* - read_stream.c - helper for reading buffered relation data
Did the order you listed the files have a system to it? If so, what is it?
--- /dev/null +++ b/src/backend/storage/aio/README.md @@ -0,0 +1,413 @@ +# Asynchronous & Direct IOI would move "### Why Asynchronous IO" to here; that's good background before
getting into the example.
I moved the example back and forth when writing because different readers
would benefit from a different order and I couldn't quite decide.
So I'm happy to adjust based on your feedback...
I might also move "### Why Direct / unbuffered IO" to here. For me as a
reader, I'd benefit from seeing things in this order:- "why"
- condensed usage example like manpage SYNOPSIS, comments and decls removed
- PgAioHandleState and discussion of valid transitions
Hm - why have PgAioHandleState and its states before the usage example? Seems
like it'd be harder to understand that way.
- usage example as it is, with full comments
- the rest
## Synopsis
ioh = pgaio_io_get(CurrentResourceOwner, &ioret);
pgaio_io_get_ref(ioh, &ior);
pgaio_io_add_shared_cb(ioh, ASC_SHARED_BUFFER_READ);
pgaio_io_set_io_data_32(ioh, (uint32 *) buffer, 1);
smgrstartreadv(ioh, operation->smgr, forknum, blkno,
BufferGetBlock(buffer), 1);
pgaio_submit_staged();
pgaio_io_ref_wait(&ior);
if (ioret.result.status == ARS_ERROR)
pgaio_result_log(aio_ret.result, &aio_ret.subject_data, ERROR);
Happy to add this, but I'm not entirely sure if that's really that useful to
have without commentary? The synopsis in manpages is helpful because it
provides the signature of various functions, but this wouldn't...
+ +## AIO Usage Example + +In many cases code that can benefit from AIO does not directly have to +interact with the AIO interface, but can use AIO via higher-level +abstractions. See [Helpers](#helpers). + +In this example, a buffer will be read into shared buffers. + +```C +/* + * Result of the operation, only to be accessed in this backend. + */ +PgAioReturn ioret; + +/* + * Acquire AIO Handle, ioret will get result upon completion.Consider adding: from here to pgaio_submit_staged(), don't do [description of
the kind of unacceptable blocking operations].
Hm. Strictly speaking it's fine to block here, depending on whether
StartBufferIO() was already called. I'll clarify.
+### IO can be started in critical sections
...
+The need to be able to execute IO in critical sections has substantial design +implication on the AIO subsystem. Mainly because completing IOs (see prior +section) needs to be possible within a critical section, even if the +to-be-completed IO itself was not issued in a critical section. Consider +e.g. the case of a backend first starting a number of writes from shared +buffers and then starting to flush the WAL. Because only a limited amount of +IO can be in-progress at the same time, initiating the IO for flushing the WAL +may require to first finish executing IO executed earlier.The last line's two appearances of the word "execute" read awkwardly to me,
and it's an opportunity to use PgAioHandleState terms. Consider writing the
last line like "may first advance an existing IO from AHS_PREPARED to
AHS_COMPLETED_SHARED".
It is indeed awkward. I don't love referencing the state-constants here
though, somehow that feels like a reference-cycle ;). What about this:
... Consider
e.g. the case of a backend first starting a number of writes from shared
buffers and then starting to flush the WAL. Because only a limited amount of
IO can be in-progress at the same time, initiating IO for flushing the WAL may
require to first complete IO that was started earlier.
+### AIO Callbacks
...
+In addition to completion, AIO callbacks also are called to "prepare" an +IO. This is, e.g., used to acquire buffer pins owned by the AIO subsystem for +IO to/from shared buffers, which is required to handle the case where the +issuing backend errors out and releases its own pins.Reading this, it's not obvious to me how to reconcile "finishing an IO could
require pin acquisition" with "finishing an IO could happen in a critical
section". Pinning a buffer in a critical section sounds bad. I vaguely
recall understanding how it was okay as of my September review, but I've
already forgotten. Can this text have a sentence making that explicit?
Ah, yes, that's easy to misunderstand. The answer basically is that we don't
newly pin a buffer, we just increment the reference count by 1. That should
never fail.
How about:
In addition to completion, AIO callbacks also are called to "prepare" an
IO. This is, e.g., used to increase buffer reference counts to account for the
AIO subsystem referencing the buffer, which is required to handle the case
where the issuing backend errors out and releases its own pins while the IO is
still ongoing.
+### AIO Subjects + +In addition to the completion callbacks describe above, each AIO Handle has +exactly one "subject". Each subject has some space inside an AIO Handle with +information specific to the subject and can provide callbacks to allow to +reopen the underlying file (required for worker mode) and to describe the IO +operation (used for debug logging and error messages).Can this say roughly how to decide when to add a new subject?
Hm, there obviously is some fuzziness. I was trying to get to some of that by
mentioning that the subject needs to know how to [re-]open a file and describe
the target of the IO in terms that make sense to the user.
E.g. smgr seemed to make sense as a subject as the smgr layer knows how to
open a file by delegating that to the layer below and the layer above just
knows about smgr, not md.c (or other potential smgr implementations).
The reason to keep this separate from the callbacks was that smgr IO going
through shared buffers, bypassing shared buffers and different smgr
implemenentations all could share the same subject implementation, even if
callbacks would differ between these use cases.
How about:
I.e., if two different uses of AIO can describe the identity of the file being
operated on the same way, it likely makes sense to use the same
subject. E.g. different smgr implementations can describe IO with
RelFileLocator, ForkNumber and BlockNumber and can thus share a subject. In
contrast, IO for a WAL file would be described with TimeLineID and XLogRecPtr
and it would not make sense to use the same subject for smgr and WAL.
Failing that, can it give examples of what additional subjects might exist
if certain existing subsystems were to start using AIO?
I think the main ones I can think of are:
1) WAL logging
This was implemented in v1. I'd guess that "real" WAL logging and
initializing new WAL segments might use a different subject, but that's
probably a question of taste.
2) "raw" file IO, for things that don't use the smgr abstraction. I could
e.g. imagine using AIO in COPY to read / write the FROM/TO file or to
implement CREATE DATABASE ... STRATEGY file_copy with AIO.
This was used in v1, e.g. to implement the initial data directory sync
after a crash. We do that on a filesystem level, not going through smgr
etc.
3) FE/BE network IO
+### AIO Results + +As AIO completion callbacks +[are executed in critical sections](#io-can-be-started-in-critical-sections) +and [may be executed by any backend](#deadlock-and-starvation-dangers-due-to-aio) +completion callbacks cannot be used to, e.g., make the query that triggered an +IO ERROR out. + +To allow to react to failing IOs the issuing backend can pass a pointer to a +`PgAioReturn` in backend local memory. Before an AIO Handle is reused the +`PgAioReturn` is filled with information about the IO. This includes +information about whether the IO was successful (as a value of +`PgAioResultStatus`) and enough information to raise an error in case of a +failure (via `pgaio_result_log()`, with the error details encoded in +`PgAioResult`).Can this have a sentence on how this fits in bounded shmem, given the lack of
guarantees about a backend's responsiveness? In other words, what makes it
okay to have requests take arbitrarily long to move from AHS_COMPLETED_SHARED
to AHS_COMPLETED_LOCAL?
I agree this should be explained somewhere - but not sure this is the best
place.
The reason it's ok is that each backend has a limited number of AIO handles
and if it runs out of IO handles we'll a) check if any IOs can be reclaimed b)
wait for the oldest IO to finish.
Thanks for the review!
Andres Freund
On 01/01/2025 06:03, Andres Freund wrote:
Hi,
Attached is a new version of the AIO patchset.
I haven't gone through it all yet, but some comments below.
The biggest changes are:
- The README has been extended with an overview of the API. I think it gives a
good overview of how the API fits together. I'd be very good to get
feedback from folks that aren't as familiar with AIO, I can't really see
what's easy/hard anymore.
Thanks, the README is super helpful! I was overwhelmed by all the new
concepts before, now it all makes much more sense.
Now that it's all laid out more clearly, I see how many different
concepts and states there really are:
- For a single IO, there is an "IO handle", "IO references", and an "IO
return". You first allocate an IO handle (PgAioHandle), and then you get
a reference (PgAioHandleRef) and an "IO return" (PgAioReturn) struct for it.
- An IO handle has eight different states (PgAioHandleState).
I'm sure all those concepts exist for a reason. But still I wonder: can
we simplify?
pgaio_io_get() and pgaio_io_release() are a bit asymmetric, I'd suggest
pgaio_io_acquire() or similar. "get" also feels very innocent, even
though it may wait for previous IO to finish. Especially when
pgaio_io_get_ref() actually is innocent.
typedef enum PgAioHandleState
{
/* not in use */
AHS_IDLE = 0,/* returned by pgaio_io_get() */
AHS_HANDED_OUT,/* pgaio_io_start_*() has been called, but IO hasn't been submitted yet */
AHS_DEFINED,/* subject's prepare() callback has been called */
AHS_PREPARED,/* IO has been submitted and is being executed */
AHS_IN_FLIGHT,/* IO finished, but result has not yet been processed */
AHS_REAPED,/* IO completed, shared completion has been called */
AHS_COMPLETED_SHARED,/* IO completed, local completion has been called */
AHS_COMPLETED_LOCAL,
} PgAioHandleState;
Do we need to distinguish between DEFINED and PREPARED? At quick glance,
those states are treated the same. (The comment refers to
pgaio_io_start_*() functions, but there's no such thing)
I didn't quite understand the point of the prepare callbacks. For
example, when AsyncReadBuffers() calls smgrstartreadv(), the
shared_buffer_readv_prepare() callback will be called. Why doesn't
AsyncReadBuffers() do the "prepare" work itself directly; why does it
need to be in a callback? I assume it's somehow related to error
handling, but I didn't quite get it. Perhaps an "abort" callback that'd
be called on error, instead of a "prepare" callback, would be better?
There are some synonyms used in the code: I think "in-flight" and
"submitted" mean the same thing. And "prepared" and "staged". I'd
suggest picking just one term for each concept.
I didn't understand the COMPLETED_SHARED and COMPLETED_LOCAL states.
does a single IO go through both states, or are the mutually exclusive?
At quick glance, I don't actually see any code that would set the
COMPLETED_LOCAL state; is it dead code?
REAPED feels like a bad name. It sounds like a later stage than
COMPLETED, but it's actually vice versa.
I'm a little surprised that the term "IO request" isn't used anywhere. I
have no concrete suggestion, but perhaps that would be a useful term.
- Retries for partial IOs (i.e. short reads) are now implemented. Turned out
to take all of three lines and adding one missing variable initialization.
:-)
- There's no obvious way to tell "internal" function operating on an IO handle
apart from functions that are expected to be called by the issuer of an IO.One way to deal with this would be to introduce a distinct "issuer IO
reference" type. I think that might be a good idea, it would also make it
clearer that a good number of the functions can only be called by the
issuer, before the IO is submitted.This would also make it easier to order functions more sensibly in aio.c, as
all the issuer functions would be together.The functions on AIO handles that everyone can call already have a distinct
type (PgAioHandleRef vs PgAioHandle*).
Hmm, yeah I think you might be onto something here.
Could pgaio_io_get() return an PgAioHandleRef directly, so that the
issuer would never see a raw PgAioHandle ?
Finally, attached are a couple of typos and other trivial suggestions.
--
Heikki Linnakangas
Neon (https://neon.tech)
Attachments:
aio-typos.patchtext/x-patch; charset=UTF-8; name=aio-typos.patchDownload
diff --git a/src/backend/storage/aio/README.md b/src/backend/storage/aio/README.md
index 0076ea4aa10..db3257c2705 100644
--- a/src/backend/storage/aio/README.md
+++ b/src/backend/storage/aio/README.md
@@ -15,7 +15,7 @@ In this example, a buffer will be read into shared buffers.
PgAioReturn ioret;
/*
- * Acquire AIO Handle, ioret will get result upon completion.
+ * Acquire an AIO Handle, ioret will get the result upon completion.
*/
PgAioHandle *ioh = pgaio_io_get(CurrentResourceOwner, &ioret);
@@ -46,15 +46,15 @@ pgaio_io_add_shared_cb(ioh, ASC_SHARED_BUFFER_READ);
pgaio_io_set_io_data_32(ioh, (uint32 *) buffer, 1);
/*
- * Hand AIO handle to lower-level function. When operating on the level of
+ * Pass the AIO handle to lower-level function. When operating on the level of
* buffers, we don't know how exactly the IO is performed, that is the
* responsibility of the storage manager implementation.
*
* E.g. md.c needs to translate block numbers into offsets in segments.
*
- * Once the IO handle has been handed of, it may not further be used, as the
- * IO may immediately get executed below smgrstartreadv() and the handle reused
- * for another IO.
+ * Once the IO handle has been handed off to smgrstartreadv(), it may not
+ * further be used, as the IO may immediately get executed in smgrstartreadv()
+ * and the handle reused for another IO.
*/
smgrstartreadv(ioh, operation->smgr, forknum, blkno,
BufferGetBlock(buffer), 1);
@@ -167,7 +167,7 @@ The main reason *not* to use Direct IO are:
explicit prefetching.
- In situations where shared_buffers cannot be set appropriately large,
e.g. because there are many different postgres instances hosted on shared
- hardware, performance will often be worse then when using buffered IO.
+ hardware, performance will often be worse than when using buffered IO.
### Deadlock and Starvation Dangers due to AIO
diff --git a/src/backend/storage/aio/aio.c b/src/backend/storage/aio/aio.c
index 261a752fb80..1cef6ef556b 100644
--- a/src/backend/storage/aio/aio.c
+++ b/src/backend/storage/aio/aio.c
@@ -123,10 +123,10 @@ static PgAioHandle *inj_cur_handle;
*
* If a handle was acquired but then does not turn out to be needed,
* e.g. because pgaio_io_get() is called before starting an IO in a critical
- * section, the handle needs to be be released with pgaio_io_release().
+ * section, the handle needs to be released with pgaio_io_release().
*
*
- * To react to the completion of the IO as soon as it is know to have
+ * To react to the completion of the IO as soon as it is known to have
* completed, callbacks can be registered with pgaio_io_add_shared_cb().
*
* To actually execute IO using the returned handle, the pgaio_io_prep_*()
diff --git a/src/backend/storage/aio/aio_io.c b/src/backend/storage/aio/aio_io.c
index 3c255775833..9e111c04b7e 100644
--- a/src/backend/storage/aio/aio_io.c
+++ b/src/backend/storage/aio/aio_io.c
@@ -31,7 +31,7 @@ static void pgaio_io_before_prep(PgAioHandle *ioh);
/* --------------------------------------------------------------------------------
* "Preparation" routines for individual IO types
*
- * These are called by place the place actually initiating an IO, to associate
+ * These are called by XXX place the place actually initiating an IO, to associate
* the IO specific data with an AIO handle.
*
* Each of the preparation routines first needs to call
diff --git a/src/include/storage/aio_internal.h b/src/include/storage/aio_internal.h
index f4c57438dd4..7a81e211d48 100644
--- a/src/include/storage/aio_internal.h
+++ b/src/include/storage/aio_internal.h
@@ -38,12 +38,13 @@ typedef enum PgAioHandleState
AHS_HANDED_OUT,
/* pgaio_io_start_*() has been called, but IO hasn't been submitted yet */
+ /* XXX: there are no pgaio_io_start_*() functions */
AHS_DEFINED,
- /* subjects prepare() callback has been called */
+ /* subject's prepare() callback has been called */
AHS_PREPARED,
- /* IO is being executed */
+ /* IO has been submitted and is being executed */
AHS_IN_FLIGHT,
/* IO finished, but result has not yet been processed */
On LWLockDisown():
+/* + * Stop treating lock as held by current backend. + * + * After calling this function it's the callers responsibility to ensure that + * the lock gets released, even in case of an error. This only is desirable if + * the lock is going to be released in a different process than the process + * that acquired it. + * + * Returns the mode in which the lock was held by the current backend.
Returning the lock mode feels a bit ad hoc..
+ * NB: This will leave lock->owner pointing to the current backend (if + * LOCK_DEBUG is set). We could add a separate flag indicating that, but it + * doesn't really seem worth it.
Hmm. I won't insist, but I feel it probably would be worth it. This is
only in LOCK_DEBUG mode so there's no performance penalty in non-debug
builds, and when you do compile with LOCK_DEBUG you probably appreciate
any extra information.
+ * NB: This does not call RESUME_INTERRUPTS(), but leaves that responsibility + * of the caller. + */
That feels weird. The only caller outside lwlock.c does call
RESUME_INTERRUPTS() immediately.
Perhaps it'd make for a better external interface if LWLockDisown() did
call RESUME_INTERRUPTS(), and there was a separate internal version that
didn't. And it might make more sense for the external version to return
'void' while we're at it. Returning a value that the caller ignores is
harmless, of course, but it feels a bit weird. It makes you wonder what
you're supposed to do with it.
+ { + {"io_method", PGC_POSTMASTER, RESOURCES_MEM, + gettext_noop("Selects the method of asynchronous I/O to use."), + NULL + }, + &io_method, + DEFAULT_IO_METHOD, io_method_options, + NULL, assign_io_method, NULL + }, +
The description is a bit funny because synchronous I/O is one of the
possible methods.
--
Heikki Linnakangas
Neon (https://neon.tech)
Hi,
On 2025-01-07 17:09:58 +0200, Heikki Linnakangas wrote:
On 01/01/2025 06:03, Andres Freund wrote:
Hi,
Attached is a new version of the AIO patchset.
I haven't gone through it all yet, but some comments below.
Thanks!
The biggest changes are:
- The README has been extended with an overview of the API. I think it gives a
good overview of how the API fits together. I'd be very good to get
feedback from folks that aren't as familiar with AIO, I can't really see
what's easy/hard anymore.Thanks, the README is super helpful! I was overwhelmed by all the new
concepts before, now it all makes much more sense.Now that it's all laid out more clearly, I see how many different concepts
and states there really are:- For a single IO, there is an "IO handle", "IO references", and an "IO
return". You first allocate an IO handle (PgAioHandle), and then you get a
reference (PgAioHandleRef) and an "IO return" (PgAioReturn) struct for it.- An IO handle has eight different states (PgAioHandleState).
I'm sure all those concepts exist for a reason. But still I wonder: can we
simplify?
Probably, but it's not exactly obvious to me where.
The difference between a handle and a reference is useful right now, to have
some separation between the functions that can be called by anyone (taking a
PgAioHandleRef) and only by the issuer (PgAioHandle). That might better be
solved by having a PgAioHandleIssuerRef ref or something.
Having PgAioReturn be separate from the AIO handle turns out to be rather
crucial, otherwise it's very hard to guarantee "forward progress",
i.e. guarantee that pgaio_io_get() will return something without blocking
forever.
pgaio_io_get() and pgaio_io_release() are a bit asymmetric, I'd suggest
pgaio_io_acquire() or similar. "get" also feels very innocent, even though
it may wait for previous IO to finish. Especially when pgaio_io_get_ref()
actually is innocent.
WFM.
typedef enum PgAioHandleState
{
/* not in use */
AHS_IDLE = 0,/* returned by pgaio_io_get() */
AHS_HANDED_OUT,/* pgaio_io_start_*() has been called, but IO hasn't been submitted yet */
AHS_DEFINED,/* subject's prepare() callback has been called */
AHS_PREPARED,/* IO has been submitted and is being executed */
AHS_IN_FLIGHT,/* IO finished, but result has not yet been processed */
AHS_REAPED,/* IO completed, shared completion has been called */
AHS_COMPLETED_SHARED,/* IO completed, local completion has been called */
AHS_COMPLETED_LOCAL,
} PgAioHandleState;Do we need to distinguish between DEFINED and PREPARED?
I found it to be rather confusing if it's not possible to tell if some action
(like the prepare callback) has already happened, or not. It's useful to be
able look at an IO in a backtrace or such and see exactly in what state it is
in.
In v1 I had several of the above states managed as separate boolean variables
- that turned out to be a huge mess, it's a lot easier to understand if
there's a single strictly monotonically increasing state.
At quick glance, those states are treated the same. (The comment refers to
pgaio_io_start_*() functions, but there's no such thing)
They're called pgaio_io_prep_{readv,writev} now, updated the comment.
I didn't quite understand the point of the prepare callbacks. For example,
when AsyncReadBuffers() calls smgrstartreadv(), the
shared_buffer_readv_prepare() callback will be called. Why doesn't
AsyncReadBuffers() do the "prepare" work itself directly; why does it need
to be in a callback?
One big part of it is "ownership" - while the IO isn't completely "assembled",
we can release all buffer pins etc in case of an error. But if the error
happens just after the IO was staged, we can't - the buffer is still
referenced by the IO. For that the AIO subystem needs to take its own pins
etc. Initially the prepare callback didn't exist, the code in
AsyncReadBuffers() was a lot more complicated before it.
I assume it's somehow related to error handling, but I didn't quite get
it. Perhaps an "abort" callback that'd be called on error, instead of a
"prepare" callback, would be better?
I don't think an error callback would be helpful - the whole thing is that we
basically need claim ownership of all IO related resources IFF the IO is
staged. Not before (because then the IO not getting staged would mean we have
a resource leak), not after (because we might error out and thus not keep
e.g. buffers pinned).
There are some synonyms used in the code: I think "in-flight" and
"submitted" mean the same thing.
Fair. I guess in my mind the process of moving an IO into flight is
"submitting" and the state of not having been submitted but not yet having
completed is being in flight. But that's probably not useful.
And "prepared" and "staged". I'd suggest picking just one term for each
concept.
Agreed.
I didn't understand the COMPLETED_SHARED and COMPLETED_LOCAL states. does a
single IO go through both states, or are the mutually exclusive? At quick
glance, I don't actually see any code that would set the COMPLETED_LOCAL
state; is it dead code?
It's dead code right now. I've made it dead and undead a couple times
:/. Unfortunately I think I need to revive it to make some corner cases with
temporary tables work (AIO for temp table is executed via IO uring, another
backend waits for *another* IO executed via that IO uring instance and reaps
the completion -> we can't update the local buffer state in the shared
completion callback).
REAPED feels like a bad name. It sounds like a later stage than COMPLETED,
but it's actually vice versa.
What would you call having gotten "completion notifications" from the kernel,
but not having processed them?
- There's no obvious way to tell "internal" function operating on an IO handle
apart from functions that are expected to be called by the issuer of an IO.One way to deal with this would be to introduce a distinct "issuer IO
reference" type. I think that might be a good idea, it would also make it
clearer that a good number of the functions can only be called by the
issuer, before the IO is submitted.This would also make it easier to order functions more sensibly in aio.c, as
all the issuer functions would be together.The functions on AIO handles that everyone can call already have a distinct
type (PgAioHandleRef vs PgAioHandle*).Hmm, yeah I think you might be onto something here.
I'll give it a try.
Could pgaio_io_get() return an PgAioHandleRef directly, so that the issuer
would never see a raw PgAioHandle ?
Don't think that would be helpful - that way there'd be no difference at all
anymore between what functions any backend can call and what the issuer can
do.
Finally, attached are a couple of typos and other trivial suggestions.
Integrating...
Thanks!
Andres
Hi,
On 2025-01-07 18:08:51 +0200, Heikki Linnakangas wrote:
On LWLockDisown():
+/* + * Stop treating lock as held by current backend. + * + * After calling this function it's the callers responsibility to ensure that + * the lock gets released, even in case of an error. This only is desirable if + * the lock is going to be released in a different process than the process + * that acquired it. + * + * Returns the mode in which the lock was held by the current backend.Returning the lock mode feels a bit ad hoc..
It seemed useful to me, that way callers could verify that the released lock
level is actually what it expected. What do we gain by hiding this information
anyway?
Orthogonal: I think it was a mistake that LWLockRelease() didn't require the
to-be-releaased lock mode to be passed in...
+ * NB: This will leave lock->owner pointing to the current backend (if + * LOCK_DEBUG is set). We could add a separate flag indicating that, but it + * doesn't really seem worth it.Hmm. I won't insist, but I feel it probably would be worth it. This is only
in LOCK_DEBUG mode so there's no performance penalty in non-debug builds,
and when you do compile with LOCK_DEBUG you probably appreciate any extra
information.
I actually thought it'd be more useful if it stays pointing to the 'original
owner'.
When you say "it" would be worth it, you mean resetting owner, or adding a
flag indicating that it's a disowned lock?
+ * NB: This does not call RESUME_INTERRUPTS(), but leaves that responsibility + * of the caller. + */That feels weird. The only caller outside lwlock.c does call
RESUME_INTERRUPTS() immediately.
Yea, I didn't feel happy with it either. It just seemed that the cure (a
separate function, or a parameter indicating whether interrupts should be
resumed) was as bad as the disease.
Perhaps it'd make for a better external interface if LWLockDisown() did call
RESUME_INTERRUPTS(), and there was a separate internal version that didn't.
Hm, that seems more complicated than it's worth. I'd either leave it as-is,
or add a parameter to LWLockDisown to indicate if interrupts should be
resumed.
And it might make more sense for the external version to return 'void' while
we're at it. Returning a value that the caller ignores is harmless, of
course, but it feels a bit weird. It makes you wonder what you're supposed
to do with it.
This one I disagree with, I think it makes a lot of sense to return the lock
mode of the lock you just disowned.
Doubtful it matters, but the compiler can trivially optimize that out for the
lwlock.c callers.
+ { + {"io_method", PGC_POSTMASTER, RESOURCES_MEM, + gettext_noop("Selects the method of asynchronous I/O to use."), + NULL + }, + &io_method, + DEFAULT_IO_METHOD, io_method_options, + NULL, assign_io_method, NULL + }, +The description is a bit funny because synchronous I/O is one of the
possible methods.
Hah. How about:
"Selects the method of, potentially asynchronous, IO execution."?
Greetings,
Andres Freund
On Mon, Jan 06, 2025 at 04:40:26PM -0500, Andres Freund wrote:
On 2025-01-06 10:52:20 -0800, Noah Misch wrote:
On Tue, Dec 31, 2024 at 11:03:33PM -0500, Andres Freund wrote:
- We have pretty no testing for IO errors.
Yes, that's remained a gap. I've wondered how much to address this via
targeted tests of specific sites vs. fuzzing, iterative fault injection, or
some other approach closer to brute force.
I'd be most interested in the
cases that would be undetected deadlocks under a naive design. An example
appeared at the end of postgr.es/m/20240916144349.74.nmisch@google.comThat's a good one, yea.
I think I'll try to translate the regression tests I wrote into an isolation
test, I hope that'll make it a bit easier to cover more cases.And then we'll need more injection points, I'm afraid :(.
Sounds good.
* - method_*.c - different ways of executing AIO (e.g. worker process)
* - aio_io.c - method-independent code for specific IO ops (e.g. readv)
* - aio_subject.c - callbacks at IO operation lifecycle events
* - aio_init.c - per-fork and per-startup-process initializationI don't particularly like "per-startup-process", because "global
initialization" really is separate (and precedes) from startup processes
startup. Maybe "per-server and per-backend initialization"?
That works for me. I wrote "per-startup-process" because it can happen more
than once in a postmaster that reaches "all server processes terminated;
reinitializing". That said, there's little risk of "per-server" giving folks
a materially wrong idea.
* - aio.c - all other topics
* - read_stream.c - helper for reading buffered relation dataDid the order you listed the files have a system to it? If so, what is it?
The rough idea was to avoid forward references:
* - method_*.c - different ways of executing AIO (e.g. worker process)
makes sense without other background
* - aio_io.c - method-independent code for specific IO ops (e.g. readv)
refers to methods, so listed after methods
* - aio_subject.c - callbacks at IO operation lifecycle events
refers to IO ops, so listed after aio_io.c
* - aio_init.c - per-fork and per-startup-process initialization
no surprise that this code will exist somewhere, so list it lower to deemphasize it
* - aio.c - all other topics
default route, hence last
* - read_stream.c - helper for reading buffered relation data
could just as easily come first, not last
could be under a distinct heading like "Recommended abstractions:"
I'd benefit from seeing things in this order:
- "why"
- condensed usage example like manpage SYNOPSIS, comments and decls removed
- PgAioHandleState and discussion of valid transitionsHm - why have PgAioHandleState and its states before the usage example? Seems
like it'd be harder to understand that way.
I usually look at the data structures before the code that manipulates them.
(Similarly, I look at the map before the directions.) I wouldn't mind it
appearing after the usage example, since order preferences do vary.
- usage example as it is, with full comments
- the rest## Synopsis
ioh = pgaio_io_get(CurrentResourceOwner, &ioret);
pgaio_io_get_ref(ioh, &ior);
pgaio_io_add_shared_cb(ioh, ASC_SHARED_BUFFER_READ);
pgaio_io_set_io_data_32(ioh, (uint32 *) buffer, 1);
smgrstartreadv(ioh, operation->smgr, forknum, blkno,
BufferGetBlock(buffer), 1);
pgaio_submit_staged();
pgaio_io_ref_wait(&ior);
if (ioret.result.status == ARS_ERROR)
pgaio_result_log(aio_ret.result, &aio_ret.subject_data, ERROR);Happy to add this, but I'm not entirely sure if that's really that useful to
have without commentary? The synopsis in manpages is helpful because it
provides the signature of various functions, but this wouldn't...
I'm not sure either. Let's drop that idea.
+### IO can be started in critical sections
...
+The need to be able to execute IO in critical sections has substantial design +implication on the AIO subsystem. Mainly because completing IOs (see prior +section) needs to be possible within a critical section, even if the +to-be-completed IO itself was not issued in a critical section. Consider +e.g. the case of a backend first starting a number of writes from shared +buffers and then starting to flush the WAL. Because only a limited amount of +IO can be in-progress at the same time, initiating the IO for flushing the WAL +may require to first finish executing IO executed earlier.The last line's two appearances of the word "execute" read awkwardly to me,
and it's an opportunity to use PgAioHandleState terms. Consider writing the
last line like "may first advance an existing IO from AHS_PREPARED to
AHS_COMPLETED_SHARED".It is indeed awkward. I don't love referencing the state-constants here
though, somehow that feels like a reference-cycle ;). What about this:... Consider
e.g. the case of a backend first starting a number of writes from shared
buffers and then starting to flush the WAL. Because only a limited amount of
IO can be in-progress at the same time, initiating IO for flushing the WAL may
require to first complete IO that was started earlier.
That's non-awkward. I like specific state names here since "complete" could
mean AHS_COMPLETED_SHARED or AHS_COMPLETED_LOCAL, and it matters here. If the
state names changed so AHS_COMPLETED_LOCAL dropped the word "complete", that
too would solve it.
+### AIO Callbacks
...
+In addition to completion, AIO callbacks also are called to "prepare" an +IO. This is, e.g., used to acquire buffer pins owned by the AIO subsystem for +IO to/from shared buffers, which is required to handle the case where the +issuing backend errors out and releases its own pins.Reading this, it's not obvious to me how to reconcile "finishing an IO could
require pin acquisition" with "finishing an IO could happen in a critical
section". Pinning a buffer in a critical section sounds bad. I vaguely
recall understanding how it was okay as of my September review, but I've
already forgotten. Can this text have a sentence making that explicit?Ah, yes, that's easy to misunderstand. The answer basically is that we don't
newly pin a buffer, we just increment the reference count by 1. That should
never fail.How about:
In addition to completion, AIO callbacks also are called to "prepare" an
IO. This is, e.g., used to increase buffer reference counts to account for the
AIO subsystem referencing the buffer, which is required to handle the case
where the issuing backend errors out and releases its own pins while the IO is
still ongoing.
Perfect.
+### AIO Subjects + +In addition to the completion callbacks describe above, each AIO Handle has +exactly one "subject". Each subject has some space inside an AIO Handle with +information specific to the subject and can provide callbacks to allow to +reopen the underlying file (required for worker mode) and to describe the IO +operation (used for debug logging and error messages).Can this say roughly how to decide when to add a new subject?
Hm, there obviously is some fuzziness. I was trying to get to some of that by
mentioning that the subject needs to know how to [re-]open a file and describe
the target of the IO in terms that make sense to the user.E.g. smgr seemed to make sense as a subject as the smgr layer knows how to
open a file by delegating that to the layer below and the layer above just
knows about smgr, not md.c (or other potential smgr implementations).The reason to keep this separate from the callbacks was that smgr IO going
through shared buffers, bypassing shared buffers and different smgr
implemenentations all could share the same subject implementation, even if
callbacks would differ between these use cases.How about:
I.e., if two different uses of AIO can describe the identity of the file being
operated on the same way, it likely makes sense to use the same
subject. E.g. different smgr implementations can describe IO with
RelFileLocator, ForkNumber and BlockNumber and can thus share a subject. In
contrast, IO for a WAL file would be described with TimeLineID and XLogRecPtr
and it would not make sense to use the same subject for smgr and WAL.
Sounds good to include.
Can this have a sentence on how this fits in bounded shmem, given the lack of
guarantees about a backend's responsiveness? In other words, what makes it
okay to have requests take arbitrarily long to move from AHS_COMPLETED_SHARED
to AHS_COMPLETED_LOCAL?I agree this should be explained somewhere - but not sure this is the best
place.The reason it's ok is that each backend has a limited number of AIO handles
and if it runs out of IO handles we'll a) check if any IOs can be reclaimed b)
wait for the oldest IO to finish.
Reading it again today, that topic may already have adequate coverage.
On Tue, Jan 7, 2025 at 11:11 AM Andres Freund <andres@anarazel.de> wrote:
The difference between a handle and a reference is useful right now, to have
some separation between the functions that can be called by anyone (taking a
PgAioHandleRef) and only by the issuer (PgAioHandle). That might better be
solved by having a PgAioHandleIssuerRef ref or something.
To me, those names don't convey that. I would perhaps call the thing
that supports issuer-only operations a "PgAio" and the thing other
people can use a "PgAioHandle". Or "PgAioRequest" and "PgAioHandle" or
something like that. With PgAioHandleRef, IMHO you've got two words
that both imply a layer of indirection -- "handle" and "ref" -- which
doesn't seem quite as nice, because then the other thing --
"PgAioHandle" still sort of implies
one layer of indirection and the whole thing seems a bit less clear.
(I say all of this having looked at nothing, so feel free to ignore me
if that doesn't sound coherent.)
REAPED feels like a bad name. It sounds like a later stage than COMPLETED,
but it's actually vice versa.What would you call having gotten "completion notifications" from the kernel,
but not having processed them?
The Linux kernel calls those zombie processes, so we could call it a
ZOMBIE state, but that seems like it might be a bit of inside
baseball. I do agree with Heikki that REAPED sounds later than
COMPLETED, because you reap zombie processes by collecting their exit
status. Maybe you could have AHS_COMPLETE or AHS_IO_COMPLETE for the
state where the I/O is done but there's still completion-related work
to be done, and then the other state could be AHS_DONE or AHS_FINISHED
or AHS_FINAL or AHS_REAPED or something.
--
Robert Haas
EDB: http://www.enterprisedb.com
On 07/01/2025 18:11, Andres Freund wrote:
The difference between a handle and a reference is useful right now, to have
some separation between the functions that can be called by anyone (taking a
PgAioHandleRef) and only by the issuer (PgAioHandle). That might better be
solved by having a PgAioHandleIssuerRef ref or something.Having PgAioReturn be separate from the AIO handle turns out to be rather
crucial, otherwise it's very hard to guarantee "forward progress",
i.e. guarantee that pgaio_io_get() will return something without blocking
forever.
Right, yeah, I can see that.
typedef enum PgAioHandleState
{
/* not in use */
AHS_IDLE = 0,/* returned by pgaio_io_get() */
AHS_HANDED_OUT,/* pgaio_io_start_*() has been called, but IO hasn't been submitted yet */
AHS_DEFINED,/* subject's prepare() callback has been called */
AHS_PREPARED,/* IO has been submitted and is being executed */
AHS_IN_FLIGHT,/* IO finished, but result has not yet been processed */
AHS_REAPED,/* IO completed, shared completion has been called */
AHS_COMPLETED_SHARED,/* IO completed, local completion has been called */
AHS_COMPLETED_LOCAL,
} PgAioHandleState;Do we need to distinguish between DEFINED and PREPARED?
I found it to be rather confusing if it's not possible to tell if some action
(like the prepare callback) has already happened, or not. It's useful to be
able look at an IO in a backtrace or such and see exactly in what state it is
in.
I see.
In v1 I had several of the above states managed as separate boolean variables
- that turned out to be a huge mess, it's a lot easier to understand if
there's a single strictly monotonically increasing state.
Agreed on that
I didn't quite understand the point of the prepare callbacks. For example,
when AsyncReadBuffers() calls smgrstartreadv(), the
shared_buffer_readv_prepare() callback will be called. Why doesn't
AsyncReadBuffers() do the "prepare" work itself directly; why does it need
to be in a callback?One big part of it is "ownership" - while the IO isn't completely "assembled",
we can release all buffer pins etc in case of an error. But if the error
happens just after the IO was staged, we can't - the buffer is still
referenced by the IO. For that the AIO subystem needs to take its own pins
etc. Initially the prepare callback didn't exist, the code in
AsyncReadBuffers() was a lot more complicated before it.I assume it's somehow related to error handling, but I didn't quite get
it. Perhaps an "abort" callback that'd be called on error, instead of a
"prepare" callback, would be better?I don't think an error callback would be helpful - the whole thing is that we
basically need claim ownership of all IO related resources IFF the IO is
staged. Not before (because then the IO not getting staged would mean we have
a resource leak), not after (because we might error out and thus not keep
e.g. buffers pinned).
Hmm. The comments say that when you call smgrstartreadv(), the IO handle
may no longer be modified, as the IO may be executed immediately. What
if we changed that so that it never submits the IO, only adds the
necessary callbacks to it?
In that world, when smgrstartreadv() returns, the necessary details and
completion callbacks have been set in the IO handle, but the caller can
still do more preparation before the IO is submitted. The caller must
ensure that it gets submitted, however, so no erroring out in that state.
Currently the call stack looks like this:
AsyncReadBuffers()
-> smgrstartreadv()
-> mdstartreadv()
-> FileStartReadV()
-> pgaio_io_prep_readv()
-> shared_buffer_readv_prepare() (callback)
<- (return)
<- (return)
<- (return)
<- (return)
<- (return)
I'm thinking that the prepare work is done "on the way up" instead:
AsyncReadBuffers()
-> smgrstartreadv()
-> mdstartreadv()
-> FileStartReadV()
-> pgaio_io_prep_readv()
<- (return)
<- (return)
<- (return)
-> shared_buffer_readv_prepare()
<- (return)
Attached is a patch to demonstrate concretely what I mean.
This adds a new pgaio_io_stage() step to the issuer, and the issuer
needs to call the prepare functions explicitly, instead of having them
as callbacks. Nominally that's more steps, but IMHO it's better to be
explicit. The same actions were happening previously too, it was just
hidden in the callback. I updated the README to show that too.
I'm not wedded to this, but it feels a little better to me.
--
Heikki Linnakangas
Neon (https://neon.tech)
Attachments:
aio-remove-prepare-callback.patchtext/x-patch; charset=UTF-8; name=aio-remove-prepare-callback.patchDownload
diff --git a/src/backend/storage/aio/README.md b/src/backend/storage/aio/README.md
index 0076ea4aa10..25b5f5d9529 100644
--- a/src/backend/storage/aio/README.md
+++ b/src/backend/storage/aio/README.md
@@ -60,7 +60,18 @@ smgrstartreadv(ioh, operation->smgr, forknum, blkno,
BufferGetBlock(buffer), 1);
/*
- * As mentioned above, the IO might be initiated within smgrstartreadv(). That
+ * After smgrstartreadv() has returned, we are committed to performing the IO.
+ * We may do more preparation or add more callbacks to the IO, but must
+ * *not* error out before calling pgaio_io_stage(). We don't have any such
+ * preparation to do here, so just call pgaio_io_stage() to indicate that we
+ * have completed building the IO request. It usually queues up the request
+ * for batching, but may submit it immediately if the batch is full or if
+ * the request needed to be processed synchronously.
+ */
+pgaio_io_stage(ioh);
+
+/*
+ * The IO might already have been initiated by pgaio_io_stage(). That
* is however not guaranteed, to allow IO submission to be batched.
*
* Note that one needs to be careful while there may be unsubmitted IOs, as
@@ -69,10 +80,6 @@ smgrstartreadv(ioh, operation->smgr, forknum, blkno,
* that, pending IOs need to be explicitly submitted before this backend
* might be blocked by a backend waiting for IO.
*
- * Note that the IO might have immediately been submitted (e.g. due to reaching
- * a limit on the number of unsubmitted IOs) and even completed during the
- * smgrstartreadv() above.
- *
* Once submitted, the IO is in-flight and can complete at any time.
*/
pgaio_submit_staged();
diff --git a/src/backend/storage/aio/aio.c b/src/backend/storage/aio/aio.c
index 261a752fb80..ed03fe03609 100644
--- a/src/backend/storage/aio/aio.c
+++ b/src/backend/storage/aio/aio.c
@@ -110,7 +110,7 @@ static PgAioHandle *inj_cur_handle;
* Acquire an AioHandle, waiting for IO completion if necessary.
*
* Each backend can only have one AIO handle that that has been "handed out"
- * to code, but not yet submitted or released. This restriction is necessary
+ * to code, but not yet staged or released. This restriction is necessary
* to ensure that it is possible for code to wait for an unused handle by
* waiting for in-flight IO to complete. There is a limited number of handles
* in each backend, if multiple handles could be handed out without being
@@ -249,6 +249,43 @@ pgaio_io_release(PgAioHandle *ioh)
}
}
+/*
+ * Finish building an IO request. Once a request has been staged, there's no
+ * going back; the IO subsystem will attempt to perform the IO. If the IO
+ * succeeds the completion callbacks will be called; on error, the error
+ * callbacks.
+ *
+ * This may add the IO to the current batch, or execute the request
+ * synchronously.
+ */
+void
+pgaio_io_stage(PgAioHandle *ioh)
+{
+ bool needs_synchronous;
+
+ /* allow a new IO to be staged */
+ my_aio->handed_out_io = NULL;
+
+ pgaio_io_update_state(ioh, AHS_PREPARED);
+
+ needs_synchronous = pgaio_io_needs_synchronous_execution(ioh);
+
+ elog(DEBUG3, "io:%d: staged %s, executed synchronously: %d",
+ pgaio_io_get_id(ioh), pgaio_io_get_op_name(ioh),
+ needs_synchronous);
+
+ if (!needs_synchronous)
+ {
+ my_aio->staged_ios[my_aio->num_staged_ios++] = ioh;
+ Assert(my_aio->num_staged_ios <= PGAIO_SUBMIT_BATCH_SIZE);
+ }
+ else
+ {
+ pgaio_io_prepare_submit(ioh);
+ pgaio_io_perform_synchronously(ioh);
+ }
+}
+
/*
* Release IO handle during resource owner cleanup.
*/
@@ -279,7 +316,7 @@ pgaio_io_release_resowner(dlist_node *ioh_node, bool on_error)
pgaio_io_reclaim(ioh);
break;
- case AHS_DEFINED:
+ case AHS_PREPARING:
case AHS_PREPARED:
/* XXX: Should we warn about this when is_commit? */
pgaio_submit_staged();
@@ -383,7 +420,7 @@ void
pgaio_io_get_ref(PgAioHandle *ioh, PgAioHandleRef *ior)
{
Assert(ioh->state == AHS_HANDED_OUT ||
- ioh->state == AHS_DEFINED ||
+ ioh->state == AHS_PREPARING ||
ioh->state == AHS_PREPARED);
Assert(ioh->generation != 0);
@@ -437,7 +474,7 @@ pgaio_io_ref_wait(PgAioHandleRef *ior)
if (am_owner)
{
- if (state == AHS_DEFINED || state == AHS_PREPARED)
+ if (state == AHS_PREPARING || state == AHS_PREPARED)
{
/* XXX: Arguably this should be prevented by callers? */
pgaio_submit_staged();
@@ -489,8 +526,8 @@ pgaio_io_ref_wait(PgAioHandleRef *ior)
/* fallthrough */
/* waiting for owner to submit */
+ case AHS_PREPARING:
case AHS_PREPARED:
- case AHS_DEFINED:
/* waiting for reaper to complete */
/* fallthrough */
case AHS_REAPED:
@@ -501,8 +538,7 @@ pgaio_io_ref_wait(PgAioHandleRef *ior)
while (!pgaio_io_was_recycled(ioh, ref_generation, &state))
{
- if (state != AHS_REAPED && state != AHS_DEFINED &&
- state != AHS_IN_FLIGHT)
+ if (state != AHS_REAPED && state != AHS_IN_FLIGHT)
break;
ConditionVariableSleep(&ioh->cv, WAIT_EVENT_AIO_COMPLETION);
}
@@ -570,8 +606,8 @@ pgaio_io_get_state_name(PgAioHandle *ioh)
return "idle";
case AHS_HANDED_OUT:
return "handed_out";
- case AHS_DEFINED:
- return "DEFINED";
+ case AHS_PREPARING:
+ return "PREPARING";
case AHS_PREPARED:
return "PREPARED";
case AHS_IN_FLIGHT:
@@ -588,43 +624,18 @@ pgaio_io_get_state_name(PgAioHandle *ioh)
/*
* Internal, should only be called from pgaio_io_prep_*().
+ *
+ * Switches the IO to PREPARING state.
*/
void
-pgaio_io_prepare(PgAioHandle *ioh, PgAioOp op)
+pgaio_io_start_staging(PgAioHandle *ioh)
{
- bool needs_synchronous;
-
Assert(ioh->state == AHS_HANDED_OUT);
Assert(pgaio_io_has_subject(ioh));
- ioh->op = op;
ioh->result = 0;
- pgaio_io_update_state(ioh, AHS_DEFINED);
-
- /* allow a new IO to be staged */
- my_aio->handed_out_io = NULL;
-
- pgaio_io_prepare_subject(ioh);
-
- pgaio_io_update_state(ioh, AHS_PREPARED);
-
- needs_synchronous = pgaio_io_needs_synchronous_execution(ioh);
-
- elog(DEBUG3, "io:%d: prepared %s, executed synchronously: %d",
- pgaio_io_get_id(ioh), pgaio_io_get_op_name(ioh),
- needs_synchronous);
-
- if (!needs_synchronous)
- {
- my_aio->staged_ios[my_aio->num_staged_ios++] = ioh;
- Assert(my_aio->num_staged_ios <= PGAIO_SUBMIT_BATCH_SIZE);
- }
- else
- {
- pgaio_io_prepare_submit(ioh);
- pgaio_io_perform_synchronously(ioh);
- }
+ pgaio_io_update_state(ioh, AHS_PREPARING);
}
/*
@@ -858,8 +869,8 @@ pgaio_io_wait_for_free(void)
{
/* should not be in in-flight list */
case AHS_IDLE:
- case AHS_DEFINED:
case AHS_HANDED_OUT:
+ case AHS_PREPARING:
case AHS_PREPARED:
case AHS_COMPLETED_LOCAL:
elog(ERROR, "shouldn't get here with io:%d in state %d",
@@ -1004,7 +1015,7 @@ pgaio_bounce_buffer_wait_for_free(void)
case AHS_IDLE:
case AHS_HANDED_OUT:
continue;
- case AHS_DEFINED: /* should have been submitted above */
+ case AHS_PREPARING: /* should have been submitted above */
case AHS_PREPARED:
elog(ERROR, "shouldn't get here with io:%d in state %d",
pgaio_io_get_id(ioh), ioh->state);
diff --git a/src/backend/storage/aio/aio_io.c b/src/backend/storage/aio/aio_io.c
index 3c255775833..e84b79d3f2e 100644
--- a/src/backend/storage/aio/aio_io.c
+++ b/src/backend/storage/aio/aio_io.c
@@ -46,11 +46,12 @@ pgaio_io_prep_readv(PgAioHandle *ioh,
{
pgaio_io_before_prep(ioh);
+ ioh->op = PGAIO_OP_READV;
ioh->op_data.read.fd = fd;
ioh->op_data.read.offset = offset;
ioh->op_data.read.iov_length = iovcnt;
- pgaio_io_prepare(ioh, PGAIO_OP_READV);
+ pgaio_io_start_staging(ioh);
}
void
@@ -59,11 +60,12 @@ pgaio_io_prep_writev(PgAioHandle *ioh,
{
pgaio_io_before_prep(ioh);
+ ioh->op = PGAIO_OP_WRITEV;
ioh->op_data.write.fd = fd;
ioh->op_data.write.offset = offset;
ioh->op_data.write.iov_length = iovcnt;
- pgaio_io_prepare(ioh, PGAIO_OP_WRITEV);
+ pgaio_io_start_staging(ioh);
}
diff --git a/src/backend/storage/aio/aio_subject.c b/src/backend/storage/aio/aio_subject.c
index b2bd0c235e7..321e1d8e975 100644
--- a/src/backend/storage/aio/aio_subject.c
+++ b/src/backend/storage/aio/aio_subject.c
@@ -119,33 +119,6 @@ pgaio_io_get_subject_name(PgAioHandle *ioh)
return aio_subject_info[ioh->subject]->name;
}
-/*
- * Internal function which invokes ->prepare for all the registered callbacks.
- */
-void
-pgaio_io_prepare_subject(PgAioHandle *ioh)
-{
- Assert(ioh->subject > ASI_INVALID && ioh->subject < ASI_COUNT);
- Assert(ioh->op >= 0 && ioh->op < PGAIO_OP_COUNT);
-
- for (int i = ioh->num_shared_callbacks; i > 0; i--)
- {
- PgAioHandleSharedCallbackID cbid = ioh->shared_callbacks[i - 1];
- const PgAioHandleSharedCallbacksEntry *ce = &aio_shared_cbs[cbid];
-
- if (!ce->cb->prepare)
- continue;
-
- elog(DEBUG3, "io:%d, op %s, subject %s, calling cb #%d %d/%s->prepare",
- pgaio_io_get_id(ioh),
- pgaio_io_get_op_name(ioh),
- pgaio_io_get_subject_name(ioh),
- i,
- cbid, ce->name);
- ce->cb->prepare(ioh);
- }
-}
-
/*
* Internal function which invokes ->complete for all the registered
* callbacks.
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 9bc0176a2ca..dd30856aca0 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -179,6 +179,9 @@ int backend_flush_after = DEFAULT_BACKEND_FLUSH_AFTER;
/* local state for LockBufferForCleanup */
static BufferDesc *PinCountWaitBuf = NULL;
+static void local_buffer_readv_prepare(PgAioHandle *ioh, Buffer *buffers, int nbuffers);
+static void shared_buffer_writev_prepare(PgAioHandle *ioh, Buffer *buffers, int nbuffers);
+
/*
* Backend-Private refcount management:
*
@@ -1725,7 +1728,6 @@ AsyncReadBuffers(ReadBuffersOperation *operation,
pgaio_io_set_io_data_32(ioh, (uint32 *) io_buffers, io_buffers_len);
-
if (persistence == RELPERSISTENCE_TEMP)
pgaio_io_add_shared_cb(ioh, ASC_LOCAL_BUFFER_READ);
else
@@ -1736,6 +1738,11 @@ AsyncReadBuffers(ReadBuffersOperation *operation,
did_start_io_overall = did_start_io_this = true;
smgrstartreadv(ioh, operation->smgr, forknum, io_first_block,
io_pages, io_buffers_len);
+ if (persistence == RELPERSISTENCE_TEMP)
+ local_buffer_readv_prepare(ioh, io_buffers, io_buffers_len);
+ else
+ shared_buffer_readv_prepare(ioh, io_buffers, io_buffers_len);
+ pgaio_io_stage(ioh);
ioh = NULL;
operation->nios++;
@@ -4170,10 +4177,11 @@ WriteBuffers(BuffersToWrite *to_write,
to_write->data_ptrs,
to_write->nbuffers,
false);
+ shared_buffer_writev_prepare(to_write->ioh, to_write->buffers, to_write->nbuffers);
+ pgaio_io_stage(to_write->ioh);
pgstat_count_io_op_n(IOOBJECT_RELATION, IOCONTEXT_NORMAL,
IOOP_WRITE, to_write->nbuffers);
-
for (int nbuf = 0; nbuf < to_write->nbuffers; nbuf++)
{
Buffer cur_buf = to_write->buffers[nbuf];
@@ -6952,20 +6960,16 @@ ReadBufferCompleteWriteShared(Buffer buffer, bool release_lock, bool failed)
* and writes.
*/
static void
-shared_buffer_prepare_common(PgAioHandle *ioh, bool is_write)
+shared_buffer_prepare_common(PgAioHandle *ioh, bool is_write, Buffer *buffers, int nbuffers)
{
- uint64 *io_data;
- uint8 io_data_len;
PgAioHandleRef io_ref;
BufferTag first PG_USED_FOR_ASSERTS_ONLY = {0};
- io_data = pgaio_io_get_io_data(ioh, &io_data_len);
-
pgaio_io_get_ref(ioh, &io_ref);
- for (int i = 0; i < io_data_len; i++)
+ for (int i = 0; i < nbuffers; i++)
{
- Buffer buf = (Buffer) io_data[i];
+ Buffer buf = buffers[i];
BufferDesc *bufHdr;
uint32 buf_state;
@@ -7022,16 +7026,16 @@ shared_buffer_prepare_common(PgAioHandle *ioh, bool is_write)
}
}
-static void
-shared_buffer_readv_prepare(PgAioHandle *ioh)
+void
+shared_buffer_readv_prepare(PgAioHandle *ioh, Buffer *buffers, int nbuffers)
{
- shared_buffer_prepare_common(ioh, false);
+ shared_buffer_prepare_common(ioh, false, buffers, nbuffers);
}
static void
-shared_buffer_writev_prepare(PgAioHandle *ioh)
+shared_buffer_writev_prepare(PgAioHandle *ioh, Buffer *buffers, int nbuffers)
{
- shared_buffer_prepare_common(ioh, true);
+ shared_buffer_prepare_common(ioh, true, buffers, nbuffers);
}
static PgAioResult
@@ -7135,19 +7139,15 @@ shared_buffer_writev_complete(PgAioHandle *ioh, PgAioResult prior_result)
* and writes.
*/
static void
-local_buffer_readv_prepare(PgAioHandle *ioh)
+local_buffer_readv_prepare(PgAioHandle *ioh, Buffer *buffers, int nbuffers)
{
- uint64 *io_data;
- uint8 io_data_len;
PgAioHandleRef io_ref;
- io_data = pgaio_io_get_io_data(ioh, &io_data_len);
-
pgaio_io_get_ref(ioh, &io_ref);
- for (int i = 0; i < io_data_len; i++)
+ for (int i = 0; i < nbuffers; i++)
{
- Buffer buf = (Buffer) io_data[i];
+ Buffer buf = buffers[i];
BufferDesc *bufHdr;
uint32 buf_state;
@@ -7199,27 +7199,17 @@ local_buffer_readv_complete(PgAioHandle *ioh, PgAioResult prior_result)
return result;
}
-static void
-local_buffer_writev_prepare(PgAioHandle *ioh)
-{
- elog(ERROR, "not yet");
-}
-
-
const struct PgAioHandleSharedCallbacks aio_shared_buffer_readv_cb = {
- .prepare = shared_buffer_readv_prepare,
.complete = shared_buffer_readv_complete,
.error = buffer_readv_error,
};
const struct PgAioHandleSharedCallbacks aio_shared_buffer_writev_cb = {
- .prepare = shared_buffer_writev_prepare,
.complete = shared_buffer_writev_complete,
};
const struct PgAioHandleSharedCallbacks aio_local_buffer_readv_cb = {
- .prepare = local_buffer_readv_prepare,
.complete = local_buffer_readv_complete,
.error = buffer_readv_error,
};
const struct PgAioHandleSharedCallbacks aio_local_buffer_writev_cb = {
- .prepare = local_buffer_writev_prepare,
+
};
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index d12225a9949..bf4522eeac6 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -985,9 +985,9 @@ mdstartreadv(PgAioHandle *ioh,
forknum,
blocknum,
nblocks);
- pgaio_io_add_shared_cb(ioh, ASC_MD_READV);
-
FileStartReadV(ioh, v->mdfd_vfd, iovcnt, seekpos, WAIT_EVENT_DATA_FILE_READ);
+
+ pgaio_io_add_shared_cb(ioh, ASC_MD_READV);
}
/*
@@ -1136,9 +1136,8 @@ mdstartwritev(PgAioHandle *ioh,
forknum,
blocknum,
nblocks);
- pgaio_io_add_shared_cb(ioh, ASC_MD_WRITEV);
-
FileStartWriteV(ioh, v->mdfd_vfd, iovcnt, seekpos, WAIT_EVENT_DATA_FILE_WRITE);
+ pgaio_io_add_shared_cb(ioh, ASC_MD_WRITEV);
}
diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h
index caa52d2aaba..d126a10f9d4 100644
--- a/src/include/storage/aio.h
+++ b/src/include/storage/aio.h
@@ -212,12 +212,10 @@ typedef struct PgAioSubjectInfo
typedef PgAioResult (*PgAioHandleSharedCallbackComplete) (PgAioHandle *ioh, PgAioResult prior_result);
-typedef void (*PgAioHandleSharedCallbackPrepare) (PgAioHandle *ioh);
typedef void (*PgAioHandleSharedCallbackError) (PgAioResult result, const PgAioSubjectData *subject_data, int elevel);
typedef struct PgAioHandleSharedCallbacks
{
- PgAioHandleSharedCallbackPrepare prepare;
PgAioHandleSharedCallbackComplete complete;
PgAioHandleSharedCallbackError error;
} PgAioHandleSharedCallbacks;
@@ -247,6 +245,8 @@ struct ResourceOwnerData;
extern PgAioHandle *pgaio_io_get(struct ResourceOwnerData *resowner, PgAioReturn *ret);
extern PgAioHandle *pgaio_io_get_nb(struct ResourceOwnerData *resowner, PgAioReturn *ret);
+extern void pgaio_io_stage(PgAioHandle *ioh);
+
extern void pgaio_io_release(PgAioHandle *ioh);
extern void pgaio_io_release_resowner(dlist_node *ioh_node, bool on_error);
@@ -261,7 +261,7 @@ extern void pgaio_io_set_io_data_32(PgAioHandle *ioh, uint32 *data, uint8 len);
extern void pgaio_io_set_io_data_64(PgAioHandle *ioh, uint64 *data, uint8 len);
extern uint64 *pgaio_io_get_io_data(PgAioHandle *ioh, uint8 *len);
-extern void pgaio_io_prepare(PgAioHandle *ioh, PgAioOp op);
+extern void pgaio_io_start_staging(PgAioHandle *ioh);
extern int pgaio_io_get_id(PgAioHandle *ioh);
struct iovec;
diff --git a/src/include/storage/aio_internal.h b/src/include/storage/aio_internal.h
index f4c57438dd4..55677d7dc8c 100644
--- a/src/include/storage/aio_internal.h
+++ b/src/include/storage/aio_internal.h
@@ -37,10 +37,10 @@ typedef enum PgAioHandleState
/* returned by pgaio_io_get() */
AHS_HANDED_OUT,
- /* pgaio_io_start_*() has been called, but IO hasn't been submitted yet */
- AHS_DEFINED,
+ /* pgaio_io_start_staging() has been called, but IO hasn't been fully staged yet */
+ AHS_PREPARING,
- /* subjects prepare() callback has been called */
+ /* pgaio_io_stage() has been called, but the IO hasn't been submitted yet */
AHS_PREPARED,
/* IO is being executed */
@@ -249,7 +249,6 @@ typedef struct IoMethodOps
extern bool pgaio_io_was_recycled(PgAioHandle *ioh, uint64 ref_generation, PgAioHandleState *state);
-extern void pgaio_io_prepare_subject(PgAioHandle *ioh);
extern void pgaio_io_process_completion_subject(PgAioHandle *ioh);
extern void pgaio_io_process_completion(PgAioHandle *ioh, int result);
extern void pgaio_io_prepare_submit(PgAioHandle *ioh);
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 3523d8a3860..5c7d602d91b 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -425,6 +425,7 @@ extern void ScheduleBufferTagForWriteback(WritebackContext *wb_context,
/* solely to make it easier to write tests */
extern bool StartBufferIO(BufferDesc *buf, bool forInput, bool nowait);
+extern void shared_buffer_readv_prepare(PgAioHandle *ioh, Buffer *buffers, int nbuffers);
/* freelist.c */
diff --git a/src/test/modules/test_aio/test_aio.c b/src/test/modules/test_aio/test_aio.c
index e495c5309b3..446da4f0231 100644
--- a/src/test/modules/test_aio/test_aio.c
+++ b/src/test/modules/test_aio/test_aio.c
@@ -264,6 +264,8 @@ read_corrupt_rel_block(PG_FUNCTION_ARGS)
smgrstartreadv(ioh, smgr, MAIN_FORKNUM, block,
(void *) &page, 1);
+ shared_buffer_readv_prepare(ioh, &buf, 1);
+ pgaio_io_stage(ioh);
ReleaseBuffer(buf);
pgaio_io_ref_wait(&ior);
On Mon, Jan 6, 2025 at 5:28 PM Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2024-12-19 17:29:12 -0500, Andres Freund wrote:
Not about patch itself, but questions about related stack functionality:
----------------------------------------------------------------------------------------------------7. Is pg_stat_aios still on the table or not ? (AIO 2021 had it). Any hints
on how to inspect real I/O calls requested to review if the code is issuing
sensible calls: there's no strace for uring, or do you stick to DEBUG3 or
perhaps using some bpftrace / xfsslower is the best way to go ?I think we still want something like it, but I don't think it needs to be in
the initial commits.After I got this question from Thomas as well, I started hacking one up.
What information would you like to see?
Here's what I currently have:
..
├─[ RECORD 2 ]───┼────────────────────────────────────────────────┤
│ pid │ 358212 │
│ io_id │ 2051 │
│ io_generation │ 4199 │
│ state │ IN_FLIGHT │
│ operation │ read │
│ offset │ 511967232 │
│ length │ 262144 │
│ subject │ smgr │
│ iovec_data_len │ 32 │
│ raw_result │ (null) │
│ result │ UNKNOWN │
│ error_desc │ (null) │
│ subject_desc │ blocks 1373216..1373247 in file "base/5/16388" │
│ flag_sync │ f │
│ flag_localmem │ f │
│ flag_buffered │ t │
Cool! It's more than enough for me in future, thanks!
I didn't think that pg_stat_* was quite the right namespace, given that it
shows not stats, but the currently ongoing IOs. I am going with pg_aios for
now, but I don't particularly like that.
If you are looking for other proposals:
* pg_aios_progress ? (to follow pattern of pg_stat_copy|vaccuum_progress?)
* pg_debug_aios ?
* pg_debug_io ?
I think we'll want a pg_stat_aio as well, tracking things like:
- how often the queue to IO workes was full
- how many times we submitted IO to the kernel (<= #ios with io_uring)
- how many times we asked the kernel for events (<= #ios with io_uring)
- how many times we had to wait for in-flight IOs before issuing more IOs
If I could dream of one thing that would be 99.9% percentile of IO
response times in milliseconds for different classes of I/O traffic
(read/write/flush). But it sounds like it would be very similiar to
pg_stat_io and potentially would have to be
per-tablespace/IO-traffic(subject)-type too. AFAIU pg_stat_io has
improper structure to have that there.
BTW: before trying to even start to compile that AIO v2.2* and
responding to the previous review, what are You looking interested to
hear the most about it so that it adds some value ? Any workload
specific measurements? just general feedback, functionality gaps?
Integrity/data testing with stuff like dm-dust, dm-flakey, dm-delay to
try the error handling routines? Some kind of AIO <-> standby/recovery
interactions?
* - btw, Date: 2025-01-01 04:03:33 - I saw what you did there! so
let's officially recognize the 2025 as the year of AIO in PG, as it
was 1st message :D
-J.
Hi,
On 2025-01-08 15:04:39 +0100, Jakub Wartak wrote:
On Mon, Jan 6, 2025 at 5:28 PM Andres Freund <andres@anarazel.de> wrote:
I didn't think that pg_stat_* was quite the right namespace, given that it
shows not stats, but the currently ongoing IOs. I am going with pg_aios for
now, but I don't particularly like that.If you are looking for other proposals:
* pg_aios_progress ? (to follow pattern of pg_stat_copy|vaccuum_progress?)
* pg_debug_aios ?
* pg_debug_io ?
I think pg_aios is better than those, if not by much. Seems others are ok
with that name too. And we easily can evolve it later.
I think we'll want a pg_stat_aio as well, tracking things like:
- how often the queue to IO workes was full
- how many times we submitted IO to the kernel (<= #ios with io_uring)
- how many times we asked the kernel for events (<= #ios with io_uring)
- how many times we had to wait for in-flight IOs before issuing more IOsIf I could dream of one thing that would be 99.9% percentile of IO
response times in milliseconds for different classes of I/O traffic
(read/write/flush). But it sounds like it would be very similiar to
pg_stat_io and potentially would have to be
per-tablespace/IO-traffic(subject)-type too.
Yea, that's a significant project on its own. It's not that cheap to compute
reasonably accurate percentiles and we have no infrastructure for doing so
right now.
AFAIU pg_stat_io has improper structure to have that there.
Hm, not obvious to me why? It might make the view a bit wide to add it as an
additional column, but otherwise I don't see a problem?
BTW: before trying to even start to compile that AIO v2.2* and
responding to the previous review, what are You looking interested to
hear the most about it so that it adds some value?
Due to the rather limited "users" of AIO in the patchset, I think most
benchmarks aren't expected to show any meaningful gains. However, they
shouldn't show any significant regressions either (when not using direct
IO). I think trying to find regressions would be a rather valuable thing.
I'm tempted to collect a few of the reasonbly-ready read stream conversions
into the patchset, to make the potential gains more visible. But I am not sure
it's a good investment of time right now.
One small regression I do know about, namely scans of large relations that are
bigger than shared buffers but do fit in the kernel page cache. The increase
of BAS_BULKREAD does cause a small slowdown - but without it we never can do
sufficient asynchronous IO. I think the slowdown is small enough to just
accept that, but it's worth qualifying that on a few machines.
Any workload specific measurements? just general feedback, functionality
gaps?
To see the benefits it'd be interesting to compare:
1) sequential scan performance with data not in shared buffers, using buffered IO
2) same, but using direct IO when testing the patch
3) checkpoint performance
In my experiments 1) gains a decent amount of performance in many cases, but
nothing overwhelming - sequential scans are easy for the kernel to read ahead.
I do see very significant gains for 2) - On a system with 10 striped NVMe SSDs
that each can do ~3.5 GB/s I measured very parallel sequential scans (I had
to use ALTER TABLE to get sufficient numbers of workers):
master: ~18 GB/s
patch, buffered: ~20 GB/s
patch, direct, worker: ~28 GB/s
patch, direct, uring: ~35 GB/s
This was with io_workers=32, io_max_concurrency=128,
effective_io_concurrency=1000 (doesn't need to be that high, but it's what I
still have the numbers for).
This was without data checksums enabled as otherwise the checksum code becomes
a *huge* bottleneck.
I also see significant gains with 3). Bigger when using direct IO. One
complicating factor measuring 3) is that the first write to a block will often
be slower than subsequent writes because the filesystem will need to update
some journaled metadata, presenting a bottleneck.
Checkpoint performance is also severely limited by data checksum computation
if enabled - independent of this patchset.
One annoying thing when testing DIO is that right now VACUUM will be rather
slow if the data isn't already in s_b, as it isn't yet read-stream-ified.
Integrity/data testing with stuff like dm-dust, dm-flakey, dm-delay
to try the error handling routines?
Hm. I don't think that's going to work very well even on master. If the
filesystem fails there's not much that PG can do...
Some kind of AIO <-> standby/recovery interactions?
I wouldn't expect anything there. I think Thomas somewhere has a patch that
read-stream-ifies recovery prefetching, once that's done it would be more
interesting.
* - btw, Date: 2025-01-01 04:03:33 - I saw what you did there! so
let's officially recognize the 2025 as the year of AIO in PG, as it
was 1st message :D
Hah, that was actually the opposite of what I intended :). I'd hoped to post
earlier, but jetlag had caught up with me...
Greetings,
Andres Freund
Hi,
On 2025-01-07 22:09:56 +0200, Heikki Linnakangas wrote:
On 07/01/2025 18:11, Andres Freund wrote:
I didn't quite understand the point of the prepare callbacks. For example,
when AsyncReadBuffers() calls smgrstartreadv(), the
shared_buffer_readv_prepare() callback will be called. Why doesn't
AsyncReadBuffers() do the "prepare" work itself directly; why does it need
to be in a callback?One big part of it is "ownership" - while the IO isn't completely "assembled",
we can release all buffer pins etc in case of an error. But if the error
happens just after the IO was staged, we can't - the buffer is still
referenced by the IO. For that the AIO subystem needs to take its own pins
etc. Initially the prepare callback didn't exist, the code in
AsyncReadBuffers() was a lot more complicated before it.I assume it's somehow related to error handling, but I didn't quite get
it. Perhaps an "abort" callback that'd be called on error, instead of a
"prepare" callback, would be better?I don't think an error callback would be helpful - the whole thing is that we
basically need claim ownership of all IO related resources IFF the IO is
staged. Not before (because then the IO not getting staged would mean we have
a resource leak), not after (because we might error out and thus not keep
e.g. buffers pinned).Hmm. The comments say that when you call smgrstartreadv(), the IO handle may
no longer be modified, as the IO may be executed immediately. What if we
changed that so that it never submits the IO, only adds the necessary
callbacks to it?
In that world, when smgrstartreadv() returns, the necessary details and
completion callbacks have been set in the IO handle, but the caller can
still do more preparation before the IO is submitted. The caller must ensure
that it gets submitted, however, so no erroring out in that state.Currently the call stack looks like this:
AsyncReadBuffers()
-> smgrstartreadv()
-> mdstartreadv()
-> FileStartReadV()
-> pgaio_io_prep_readv()
-> shared_buffer_readv_prepare() (callback)
<- (return)
<- (return)
<- (return)
<- (return)
<- (return)I'm thinking that the prepare work is done "on the way up" instead:
AsyncReadBuffers()
-> smgrstartreadv()
-> mdstartreadv()
-> FileStartReadV()
-> pgaio_io_prep_readv()
<- (return)
<- (return)
<- (return)
-> shared_buffer_readv_prepare()
<- (return)Attached is a patch to demonstrate concretely what I mean.
I think this would be somewhat limiting. Right now it's indeed just bufmgr.c
that needs to do a preparation (or "moving of ownership") step - but I don't
think it's necessarily going to stay that way.
Consider e.g. a hypothetical threaded future in which we have refcounted file
descriptors. While AIO is ongoing, the AIO subsystem would need to ensure that
the FD refcount is increased, otherwise you'd obviously run into trouble if
the issuing backend errored out and released its own reference as part of
resowner release.
I don't think the approach you suggest above would scale well for such a
situation - shared_buffer_readv_prepare() would again need to call to
smgr->md->fd. Whereas with the current approach md.c (or fd.c?) could just
define its own prepare callback that increased the refcount at the right
moment.
There's a few other scenarios I can think of:
- If somebody were - no idea what made me think of that - to write an smgr
implementation where storage is accessed over the network, one might need to
keep network buffers and sockets alive for the duration of the IO.
- It'd be rather useful to have support for asynchronously extending a
relation, that often requires filesystem journal IO and thus is slow. If
you're bulk loading, or the extension lock is contented, it'd be great if we
could start the next relation extension *before* it's needed and the
extension has to happen synchronously. To avoid deadlocks, such an
asynchronous extension would need to be able to release the lock in any
other backend, just like it's needed for the content locks when
asynchronously writing. Which in turn would require transferring ownership
of the relevant buffers *and* the extension lock. You could mash this
together, but it seems like a separate callback woul make it more
composable.
Does that make any sense to you?
This adds a new pgaio_io_stage() step to the issuer, and the issuer needs to
call the prepare functions explicitly, instead of having them as callbacks.
Nominally that's more steps, but IMHO it's better to be explicit. The same
actions were happening previously too, it was just hidden in the callback. I
updated the README to show that too.I'm not wedded to this, but it feels a little better to me.
Right now the current approach seems to make more sense to me, but I'll think
about it more. I might also have missed something with my theorizing above.
Greetings,
Andres Freund
Hi,
On 2025-01-07 14:59:58 -0500, Robert Haas wrote:
On Tue, Jan 7, 2025 at 11:11 AM Andres Freund <andres@anarazel.de> wrote:
The difference between a handle and a reference is useful right now, to have
some separation between the functions that can be called by anyone (taking a
PgAioHandleRef) and only by the issuer (PgAioHandle). That might better be
solved by having a PgAioHandleIssuerRef ref or something.To me, those names don't convey that.
I'm certainly not wedded to these names - I went back and forth between
different names a fair bit, because I wasn't quite happy. I am however certain
that the current names are better than what it used to be (PgAioInProgress and
because that's long, a bunch of PgAioIP* names) :)
To make sure were talking about the same things, I am thinking of the
following "entities" needing names:
1) Shared memory representation of an IO, for the AIO subsystem internally
Currently: PgAioHandle
Because shared memory is limited, we need to reuse this entity. This reuse
needs to be possible "immediately" after completion, to avoid a bunch of
nasty scenarios.
To distinguish a reused PgAioHandle from its "prior" incarnation, each
PgAioHandle has a 64bit "generation counter.
In addition to being referenceable via pointer, it's also possible to
assign a 32bit integer to each PgAioHandle, as there is a fixed number of
them.
2) A way for the issuer of an IO to reference 1), to attach information to the
IO
Currently: PgAioHandle*
As long as the issuer hasn't yet staged the IO, it can't be
reused. Therefore it's OK to just point to the PgAioHandle.
One disadvantage of just using a pointer to PgAioHandle* is that it's
harder to distinguish subystem-internal functions that accept PgAioHandle*
from "public" functions that accept the "issuer reference".
3) A way for any backend to wait for a specific IO to complete
Currently: PgAioHandleRef
This references 1) using a 32 bit ID and the 64bit generation.
This is used to allow any backend to wait for a specific IO to
complete. E.g. by including it in the BufferDesc so that WaitIO can wait
for it.
Because it includes the generation it's trivial to detect whether the
PgAioHandle was reused.
I would perhaps call the thing that supports issuer-only operations a
"PgAio" and the thing other people can use a "PgAioHandle". Or
"PgAioRequest" and "PgAioHandle" or something like that. With
PgAioHandleRef, IMHO you've got two words that both imply a layer of
indirection -- "handle" and "ref" -- which doesn't seem quite as nice,
because then the other thing -- "PgAioHandle" still sort of implies one
layer of indirection and the whole thing seems a bit less clear.
It's indirections all the way down. The PG representation of "one IO" in the
end is just an indirection for a kernel operation :)
I would like to split 1) and 2) above.
1) PgAio{Handle,Request,} (a large struct) - used internally by AIO subsystem,
"pointed to" by the following
2) PgAioIssuerRef (an ID or pointer) - used by the issuer to incrementally
define the IO
3) PgAioWaitRef - (an ID and generation) - used to wait for a specific IO to
complete, not affected by reuse of PgAioHandle
REAPED feels like a bad name. It sounds like a later stage than COMPLETED,
but it's actually vice versa.What would you call having gotten "completion notifications" from the kernel,
but not having processed them?The Linux kernel calls those zombie processes, so we could call it a ZOMBIE
state, but that seems like it might be a bit of inside baseball.
ZOMBIE feels even later than REAPED to me :)
I do agree with Heikki that REAPED sounds later than COMPLETED, because you
reap zombie processes by collecting their exit status. Maybe you could have
AHS_COMPLETE or AHS_IO_COMPLETE for the state where the I/O is done but
there's still completion-related work to be done, and then the other state
could be AHS_DONE or AHS_FINISHED or AHS_FINAL or AHS_REAPED or something.
How about
AHS_COMPLETE_KERNEL or AHS_COMPLETE_RAW - raw syscall completed
AHS_COMPLETE_SHARED_CB - shared callback completed
AHS_COMPLETE_LOCAL_CB - local callback completed
?
Greetings,
Andres Freund
On Wed, 8 Jan 2025 at 22:58, Andres Freund <andres@anarazel.de> wrote:
master: ~18 GB/s
patch, buffered: ~20 GB/s
patch, direct, worker: ~28 GB/s
patch, direct, uring: ~35 GB/sThis was with io_workers=32, io_max_concurrency=128,
effective_io_concurrency=1000 (doesn't need to be that high, but it's what I
still have the numbers for).This was without data checksums enabled as otherwise the checksum code becomes
a *huge* bottleneck.
I'm curious about this because the checksum code should be fast enough
to easily handle that throughput. I remember checksum overhead being
negligible even when pulling in pages from page cache. Is it just that
the calculation is slow, or is it the fact that checksumming needs to
bring the page into the CPU cache. Did you notice any hints which
might be the case? I don't really have a machine at hand that can do
anywhere close to this amount of I/O.
I'm asking because if it's the calculation that is slow then it seems
like it's time to compile different ISA extension variants of the
checksum code and select the best one at runtime.
--
Ants Aasma
Hi,
On 2025-01-09 10:59:22 +0200, Ants Aasma wrote:
On Wed, 8 Jan 2025 at 22:58, Andres Freund <andres@anarazel.de> wrote:
master: ~18 GB/s
patch, buffered: ~20 GB/s
patch, direct, worker: ~28 GB/s
patch, direct, uring: ~35 GB/sThis was with io_workers=32, io_max_concurrency=128,
effective_io_concurrency=1000 (doesn't need to be that high, but it's what I
still have the numbers for).This was without data checksums enabled as otherwise the checksum code becomes
a *huge* bottleneck.I'm curious about this because the checksum code should be fast enough
to easily handle that throughput.
It seems to top out at about ~5-6 GB/s on my 2x Xeon Gold 6442Y
workstation. But we don't have a good ready-made way of testing that without
also doing IO, so it's kinda hard to say.
I remember checksum overhead being negligible even when pulling in pages
from page cache.
It's indeed much less of an issue when pulling pages from the page cache, as
the copy from the page cache is fairly slow. With direct-IO, where the copy
from the page cache isn't the main driver of CPU use anymore, it becomes much
clearer.
Even with buffered IO it became a bigger issue with 17, due to
io_combine_limit. It turns out that lots of tiny syscalls are slow, so the
peak throughput that could reach the checksumming code was lower.
I created a 21554MB relation and measured the time to do a pg_prewarm() of
that relation after evicting all of shared buffers (the first time buffers are
touched has a bit different perf characteristics). I am using direct IO and
io_uring here, as buffered IO would include the page cache copy cost and
worker mode could parallelize the checksum computation on reads. The checksum
cost is a bigger issue for writes than reads, but it's harder to quickly
generate enough dirty data for a repeatable benchmark.
This system can do about 12.5GB/s of read IO.
Just to show the effect of the read size on page cache copy performance:
config checksums time in ms
buffered io_engine=sync io_combine_limit=1 0 6712.153
buffered io_engine=sync io_combine_limit=2 0 5919.215
buffered io_engine=sync io_combine_limit=4 0 5738.496
buffered io_engine=sync io_combine_limit=8 0 5396.415
buffered io_engine=sync io_combine_limit=16 0 5312.803
buffered io_engine=sync io_combine_limit=32 0 5275.389
To see the effect of page cache copy overhead:
config checksums time in ms
buffered io_engine=io_uring 0 3901.625
direct io_engine=io_uring 0 2075.330
Now to show the effect of checksums (enabled/disabled with pg_checksums):
config checksums time in ms
buffered io_engine=io_uring 0 3883.127
buffered io_engine=io_uring 1 5880.892
direct io_engine=io_uring 0 2067.142
direct io_engine=io_uring 1 3835.968
So with direct + uring w/o checksums, we can reach 10427 MB/s (close-ish to
disk speed), but with checksums we only reach 5620 MB/s.
Is it just that the calculation is slow, or is it the fact that checksumming
needs to bring the page into the CPU cache. Did you notice any hints which
might be the case?
I don't think the issue is that checksumming pulls the data into CPU caches
1) This is visible with SELECT that actually uses the data
2) I added prefetching to avoid any meaningful amount of cache misses and it
doesn't change the overall timing much
3) It's visible with buffered IO, which has pulled the data into CPU caches
already
I don't really have a machine at hand that can do anywhere close to this
amount of I/O.
It's visible even when pulling from the page cache, if to a somewhat lesser
degree.
I wonder if it's worth adding a test function that computes checksums of all
shared buffers in memory already. That'd allow exercising the checksum code in
a realistic context (i.e. buffer locking etc preventing some out-of-order
effects, using 8kB chunks etc) without also needing to involve the IO path.
I'm asking because if it's the calculation that is slow then it seems
like it's time to compile different ISA extension variants of the
checksum code and select the best one at runtime.
You think it's ISA specific? I don't see a significant effect of compiling
with -march=native or not - and that should suffice to make the checksum code
built with sufficiently high ISA support, right?
FWIW CPU profiles show all the time being spent in the "main checksum
calculation" loop:
Percent | Source code & Disassembly of postgres for cycles:P (5866 samples, percent: local period)
--------------------------------------------------------------------------------------------------------
:
:
:
: 3 Disassembly of section .text:
:
: 5 00000000009e3670 <pg_checksum_page>:
: 6 * calculation isn't affected by the old checksum stored on the page.
: 7 * Restore it after, because actually updating the checksum is NOT part of
: 8 * the API of this function.
: 9 */
: 10 save_checksum = cpage->phdr.pd_checksum;
: 11 cpage->phdr.pd_checksum = 0;
0.00 : 9e3670: xor %eax,%eax
: 13 CHECKSUM_COMP(sums[j], page->data[i][j]);
0.00 : 9e3672: mov $0x1000193,%r8d
: 15 cpage->phdr.pd_checksum = 0;
0.00 : 9e3678: vmovdqa -0x693fa0(%rip),%ymm3 # 34f6e0 <.LC0>
0.05 : 9e3680: vmovdqa -0x6935c8(%rip),%ymm4 # 3500c0 <.LC1>
0.00 : 9e3688: vmovdqa -0x693c10(%rip),%ymm0 # 34fa80 <.LC2>
0.00 : 9e3690: vmovdqa -0x693598(%rip),%ymm1 # 350100 <.LC3>
: 20 {
0.00 : 9e3698: mov %esi,%ecx
0.02 : 9e369a: lea 0x2000(%rdi),%rdx
: 23 save_checksum = cpage->phdr.pd_checksum;
0.00 : 9e36a1: movzwl 0x8(%rdi),%esi
: 25 CHECKSUM_COMP(sums[j], page->data[i][j]);
0.00 : 9e36a5: vpbroadcastd %r8d,%ymm5
: 27 cpage->phdr.pd_checksum = 0;
0.00 : 9e36ab: mov %ax,0x8(%rdi)
: 29 for (i = 0; i < (uint32) (BLCKSZ / (sizeof(uint32) * N_SUMS)); i++)
0.14 : 9e36af: mov %rdi,%rax
0.00 : 9e36b2: nopw 0x0(%rax,%rax,1)
: 32 CHECKSUM_COMP(sums[j], page->data[i][j]);
15.36 : 9e36b8: vpxord (%rax),%ymm1,%ymm1
4.79 : 9e36be: vmovdqu 0x80(%rax),%ymm2
: 35 for (i = 0; i < (uint32) (BLCKSZ / (sizeof(uint32) * N_SUMS)); i++)
0.07 : 9e36c6: add $0x100,%rax
: 37 CHECKSUM_COMP(sums[j], page->data[i][j]);
2.45 : 9e36cc: vpxord -0xe0(%rax),%ymm0,%ymm0
2.85 : 9e36d3: vpmulld %ymm5,%ymm1,%ymm6
0.02 : 9e36d8: vpsrld $0x11,%ymm1,%ymm1
3.17 : 9e36dd: vpternlogd $0x96,%ymm6,%ymm1,%ymm2
2.01 : 9e36e4: vpmulld %ymm5,%ymm0,%ymm6
13.16 : 9e36e9: vpmulld %ymm5,%ymm2,%ymm1
0.03 : 9e36ee: vpsrld $0x11,%ymm2,%ymm2
0.02 : 9e36f3: vpsrld $0x11,%ymm0,%ymm0
2.57 : 9e36f8: vpxord %ymm2,%ymm1,%ymm1
0.89 : 9e36fe: vmovdqu -0x60(%rax),%ymm2
0.12 : 9e3703: vpternlogd $0x96,%ymm6,%ymm0,%ymm2
4.48 : 9e370a: vpmulld %ymm5,%ymm2,%ymm0
0.49 : 9e370f: vpsrld $0x11,%ymm2,%ymm2
0.99 : 9e3714: vpxord %ymm2,%ymm0,%ymm0
11.88 : 9e371a: vpxord -0xc0(%rax),%ymm4,%ymm2
2.80 : 9e3721: vpmulld %ymm5,%ymm2,%ymm6
0.68 : 9e3726: vpsrld $0x11,%ymm2,%ymm4
4.94 : 9e372b: vmovdqu -0x40(%rax),%ymm2
1.45 : 9e3730: vpternlogd $0x96,%ymm6,%ymm4,%ymm2
8.63 : 9e3737: vpmulld %ymm5,%ymm2,%ymm4
0.17 : 9e373c: vpsrld $0x11,%ymm2,%ymm2
1.81 : 9e3741: vpxord %ymm2,%ymm4,%ymm4
0.10 : 9e3747: vpxord -0xa0(%rax),%ymm3,%ymm2
0.70 : 9e374e: vpmulld %ymm5,%ymm2,%ymm6
1.65 : 9e3753: vpsrld $0x11,%ymm2,%ymm3
0.03 : 9e3758: vmovdqu -0x20(%rax),%ymm2
0.85 : 9e375d: vpternlogd $0x96,%ymm6,%ymm3,%ymm2
3.73 : 9e3764: vpmulld %ymm5,%ymm2,%ymm3
0.07 : 9e3769: vpsrld $0x11,%ymm2,%ymm2
1.48 : 9e376e: vpxord %ymm2,%ymm3,%ymm3
: 68 for (i = 0; i < (uint32) (BLCKSZ / (sizeof(uint32) * N_SUMS)); i++)
0.02 : 9e3774: cmp %rax,%rdx
2.32 : 9e3777: jne 9e36b8 <pg_checksum_page+0x48>
: 71 CHECKSUM_COMP(sums[j], 0);
0.17 : 9e377d: vpmulld %ymm5,%ymm0,%ymm7
0.07 : 9e3782: vpmulld %ymm5,%ymm1,%ymm6
: 74 checksum = pg_checksum_block(cpage);
: 75 cpage->phdr.pd_checksum = save_checksum;
0.00 : 9e3787: mov %si,0x8(%rdi)
: 77 CHECKSUM_COMP(sums[j], 0);
0.02 : 9e378b: vpsrld $0x11,%ymm0,%ymm0
0.02 : 9e3790: vpsrld $0x11,%ymm1,%ymm1
0.02 : 9e3795: vpsrld $0x11,%ymm4,%ymm2
0.00 : 9e379a: vpxord %ymm0,%ymm7,%ymm7
0.10 : 9e37a0: vpmulld %ymm5,%ymm4,%ymm0
0.00 : 9e37a5: vpxord %ymm1,%ymm6,%ymm6
0.17 : 9e37ab: vpmulld %ymm5,%ymm3,%ymm1
0.19 : 9e37b0: vpmulld %ymm5,%ymm6,%ymm4
0.00 : 9e37b5: vpsrld $0x11,%ymm6,%ymm6
0.02 : 9e37ba: vpxord %ymm2,%ymm0,%ymm0
0.00 : 9e37c0: vpsrld $0x11,%ymm3,%ymm2
0.22 : 9e37c5: vpmulld %ymm5,%ymm7,%ymm3
0.02 : 9e37ca: vpsrld $0x11,%ymm7,%ymm7
0.00 : 9e37cf: vpxord %ymm2,%ymm1,%ymm1
0.03 : 9e37d5: vpsrld $0x11,%ymm0,%ymm2
0.15 : 9e37da: vpmulld %ymm5,%ymm0,%ymm0
: 94 result ^= sums[i];
0.00 : 9e37df: vpternlogd $0x96,%ymm3,%ymm7,%ymm2
: 96 CHECKSUM_COMP(sums[j], 0);
0.05 : 9e37e6: vpsrld $0x11,%ymm1,%ymm3
0.19 : 9e37eb: vpmulld %ymm5,%ymm1,%ymm1
: 99 result ^= sums[i];
0.02 : 9e37f0: vpternlogd $0x96,%ymm4,%ymm6,%ymm0
0.10 : 9e37f7: vpxord %ymm1,%ymm0,%ymm0
0.07 : 9e37fd: vpternlogd $0x96,%ymm2,%ymm3,%ymm0
0.15 : 9e3804: vextracti32x4 $0x1,%ymm0,%xmm1
0.03 : 9e380b: vpxord %xmm0,%xmm1,%xmm0
0.14 : 9e3811: vpsrldq $0x8,%xmm0,%xmm1
0.12 : 9e3816: vpxord %xmm1,%xmm0,%xmm0
0.09 : 9e381c: vpsrldq $0x4,%xmm0,%xmm1
0.12 : 9e3821: vpxord %xmm1,%xmm0,%xmm0
0.05 : 9e3827: vmovd %xmm0,%eax
:
: 111 /* Mix in the block number to detect transposed pages */
: 112 checksum ^= blkno;
0.07 : 9e382b: xor %ecx,%eax
:
: 115 /*
: 116 * Reduce to a uint16 (to fit in the pd_checksum field) with an offset of
: 117 * one. That avoids checksums of zero, which seems like a good idea.
: 118 */
: 119 return (uint16) ((checksum % 65535) + 1);
0.00 : 9e382d: mov $0x80008001,%ecx
0.03 : 9e3832: mov %eax,%edx
0.27 : 9e3834: imul %rcx,%rdx
0.09 : 9e3838: shr $0x2f,%rdx
0.07 : 9e383c: lea 0x1(%rax,%rdx,1),%eax
0.00 : 9e3840: vzeroupper
: 126 }
0.15 : 9e3843: ret
I did briefly experiment with changing N_SUMS. 16 is substantially worse, 64
seems to be about the same as 32.
Greetings,
Andres Freund
On Thu, 9 Jan 2025 at 18:25, Andres Freund <andres@anarazel.de> wrote:
I'm curious about this because the checksum code should be fast enough
to easily handle that throughput.It seems to top out at about ~5-6 GB/s on my 2x Xeon Gold 6442Y
workstation. But we don't have a good ready-made way of testing that without
also doing IO, so it's kinda hard to say.
Interesting, I wonder if it's related to Intel increasing vpmulld
latency to 10 already back in Haswell. The Zen 3 I'm testing on has
latency 3 and has twice the throughput.
Attached is a naive and crude benchmark that I used for testing here.
Compiled with:
gcc -O2 -funroll-loops -ftree-vectorize -march=native \
-I$(pg_config --includedir-server) \
bench-checksums.c -o bench-checksums-native
Just fills up an array of pages and checksums them, first argument is
number of checksums, second is array size. I used 1M checksums and 100
pages for in cache behavior and 100000 pages for in memory
performance.
869.85927ms @ 9.418 GB/s - generic from memory
772.12252ms @ 10.610 GB/s - generic in cache
442.61869ms @ 18.508 GB/s - native from memory
137.07573ms @ 59.763 GB/s - native in cache
Is it just that the calculation is slow, or is it the fact that checksumming
needs to bring the page into the CPU cache. Did you notice any hints which
might be the case?I don't think the issue is that checksumming pulls the data into CPU caches
1) This is visible with SELECT that actually uses the data
2) I added prefetching to avoid any meaningful amount of cache misses and it
doesn't change the overall timing much3) It's visible with buffered IO, which has pulled the data into CPU caches
already
I didn't yet check the code, when doing aio completions checksumming
be running on the same core as is going to be using the page?
It could also be that for some reason the checksumming is creating
extra bandwidth on memory bus or CPU internal rings, which due to the
already high amount of data already flying around causes contention.
I don't really have a machine at hand that can do anywhere close to this
amount of I/O.It's visible even when pulling from the page cache, if to a somewhat lesser
degree.
Good point, I'll see if I can reproduce.
I wonder if it's worth adding a test function that computes checksums of all
shared buffers in memory already. That'd allow exercising the checksum code in
a realistic context (i.e. buffer locking etc preventing some out-of-order
effects, using 8kB chunks etc) without also needing to involve the IO path.
OoO shouldn't matter that much, over here even in the best case it's
still taking 500+ cycles per iteration.
I'm asking because if it's the calculation that is slow then it seems
like it's time to compile different ISA extension variants of the
checksum code and select the best one at runtime.You think it's ISA specific? I don't see a significant effect of compiling
with -march=native or not - and that should suffice to make the checksum code
built with sufficiently high ISA support, right?
Right, the disassembly below looked very good.
FWIW CPU profiles show all the time being spent in the "main checksum
calculation" loop:
.. disassembly omitted for brevity
Not sure if it's applicable here or not due to microarch differences.
But in my case when bounded by memory bandwidth the main loop events
were clustered around a few instructions like it was in here, whereas
when running from cache all instructions were about equally
represented.
I did briefly experiment with changing N_SUMS. 16 is substantially worse, 64
seems to be about the same as 32.
This suggests that mulld latency is not the culprit.
Regards,
Ants
Attachments:
Hi,
On 2025-01-09 20:10:24 +0200, Ants Aasma wrote:
On Thu, 9 Jan 2025 at 18:25, Andres Freund <andres@anarazel.de> wrote:
I'm curious about this because the checksum code should be fast enough
to easily handle that throughput.It seems to top out at about ~5-6 GB/s on my 2x Xeon Gold 6442Y
workstation. But we don't have a good ready-made way of testing that without
also doing IO, so it's kinda hard to say.Interesting, I wonder if it's related to Intel increasing vpmulld
latency to 10 already back in Haswell. The Zen 3 I'm testing on has
latency 3 and has twice the throughput.
Attached is a naive and crude benchmark that I used for testing here.
Compiled with:gcc -O2 -funroll-loops -ftree-vectorize -march=native \
-I$(pg_config --includedir-server) \
bench-checksums.c -o bench-checksums-nativeJust fills up an array of pages and checksums them, first argument is
number of checksums, second is array size. I used 1M checksums and 100
pages for in cache behavior and 100000 pages for in memory
performance.869.85927ms @ 9.418 GB/s - generic from memory
772.12252ms @ 10.610 GB/s - generic in cache
442.61869ms @ 18.508 GB/s - native from memory
137.07573ms @ 59.763 GB/s - native in cache
printf '%16s\t%16s\t%s\n' march mem result; for mem in 100 100000 1000000; do for march in x86-64 x86-64-v2 x86-64-v3 x86-64-v4 native; do printf "%16s\t%16s\t" $march $mem; gcc -g -g3 -O2 -funroll-loops -ftree-vectorize -march=$march -I ~/src/postgresql/src/include/ -I src/include/ /tmp/bench-checksums.c -o bench-checksums-native && numactl --physcpubind 1 --membind 0 ./bench-checksums-native 1000000 $mem;done; done
Workstation w/ 2x Xeon Gold 6442Y:
march mem result
x86-64 100 731.87779ms @ 11.193 GB/s
x86-64-v2 100 327.18580ms @ 25.038 GB/s
x86-64-v3 100 264.03547ms @ 31.026 GB/s
x86-64-v4 100 282.08065ms @ 29.041 GB/s
native 100 246.13766ms @ 33.282 GB/s
x86-64 100000 842.66827ms @ 9.722 GB/s
x86-64-v2 100000 604.52959ms @ 13.551 GB/s
x86-64-v3 100000 477.16239ms @ 17.168 GB/s
x86-64-v4 100000 476.07039ms @ 17.208 GB/s
native 100000 456.08080ms @ 17.962 GB/s
x86-64 1000000 845.51132ms @ 9.689 GB/s
x86-64-v2 1000000 612.07973ms @ 13.384 GB/s
x86-64-v3 1000000 485.23738ms @ 16.882 GB/s
x86-64-v4 1000000 483.86411ms @ 16.930 GB/s
native 1000000 462.88461ms @ 17.698 GB/s
Zen 4 laptop (AMD Ryzen 7 PRO 7840U):
march mem result
x86-64 100 417.19762ms @ 19.636 GB/s
x86-64-v2 100 130.67596ms @ 62.689 GB/s
x86-64-v3 100 97.07758ms @ 84.386 GB/s
x86-64-v4 100 95.67704ms @ 85.621 GB/s
native 100 95.15734ms @ 86.089 GB/s
x86-64 100000 431.38370ms @ 18.990 GB/s
x86-64-v2 100000 215.74856ms @ 37.970 GB/s
x86-64-v3 100000 199.74492ms @ 41.012 GB/s
x86-64-v4 100000 186.98300ms @ 43.811 GB/s
native 100000 187.68125ms @ 43.648 GB/s
x86-64 1000000 433.87893ms @ 18.881 GB/s
x86-64-v2 1000000 217.46561ms @ 37.670 GB/s
x86-64-v3 1000000 200.40667ms @ 40.877 GB/s
x86-64-v4 1000000 187.51978ms @ 43.686 GB/s
native 1000000 190.29273ms @ 43.049 GB/s
Workstation w/ 2x Xeon Gold 5215:
march mem result
x86-64 100 780.38881ms @ 10.497 GB/s
x86-64-v2 100 389.62005ms @ 21.026 GB/s
x86-64-v3 100 323.97294ms @ 25.286 GB/s
x86-64-v4 100 274.19493ms @ 29.877 GB/s
native 100 283.48674ms @ 28.897 GB/s
x86-64 100000 1112.63898ms @ 7.363 GB/s
x86-64-v2 100000 831.45641ms @ 9.853 GB/s
x86-64-v3 100000 696.20789ms @ 11.767 GB/s
x86-64-v4 100000 685.61636ms @ 11.948 GB/s
native 100000 689.78023ms @ 11.876 GB/s
x86-64 1000000 1128.65580ms @ 7.258 GB/s
x86-64-v2 1000000 843.92594ms @ 9.707 GB/s
x86-64-v3 1000000 718.78848ms @ 11.397 GB/s
x86-64-v4 1000000 687.68258ms @ 11.912 GB/s
native 1000000 705.34731ms @ 11.614 GB/s
That's quite the drastic difference between amd and intel. Of course it's also
comparing a multi-core server uarch (lower per-core bandwidth, much higher
aggregate bandwidth) with a client uarch.
The difference between the baseline CPU target and a more modern profile is
also rather impressive. Looks like some cpu-capability based dispatch would
likely be worth it, even if it didn't matter in my case due to -march=native.
I just realized that
a) The meson build doesn't use the relevant flags for bufpage.c - it didn't
matter in my numbers though because I was building with -O3 and
march=native.
This clearly ought to be fixed.
b) Neither build uses the optimized flags for pg_checksum and pg_upgrade, both
of which include checksum_imp.h directly.
This probably should be fixed too - perhaps by building the relevant code
once as part of fe_utils or such?
It probably matters less than it used to - these days -O2 turns on
-ftree-loop-vectorize -ftree-slp-vectorize. But loop unrolling isn't
enabled.
I do see a perf difference at -O2 between using/not using
-funroll-loops. Interestingly not at -O3, despite -funroll-loops not actually
being enabled by -O3. I think the relevant option that *is* turned on by O3 is
-fpeel-loops.
Here's a comparison of different flags run the 6442Y
printf '%16s\t%32s\t%16s\t%s\n' march flags mem result; for mem in 100 100000; do for march in x86-64 x86-64-v2 x86-64-v3 x86-64-v4 native; do for flags in "-O2" "-O2 -funroll-loops" "-O3" "-O3 -funroll-loops"; do printf "%16s\t%32s\t%16s\t" "$march" "$flags" "$mem"; gcc $flags -march=$march -I ~/src/postgresql/src/include/ -I src/include/ /tmp/bench-checksums.c -o bench-checksums-native && numactl --physcpubind 3 --membind 0 ./bench-checksums-native 3000000 $mem;done; done;done
march flags mem result
x86-64 -O2 100 2280.86253ms @ 10.775 GB/s
x86-64 -O2 -funroll-loops 100 2195.66942ms @ 11.193 GB/s
x86-64 -O3 100 2422.57588ms @ 10.145 GB/s
x86-64 -O3 -funroll-loops 100 2243.75826ms @ 10.953 GB/s
x86-64-v2 -O2 100 1243.68063ms @ 19.761 GB/s
x86-64-v2 -O2 -funroll-loops 100 979.67783ms @ 25.086 GB/s
x86-64-v2 -O3 100 988.80296ms @ 24.854 GB/s
x86-64-v2 -O3 -funroll-loops 100 991.31632ms @ 24.791 GB/s
x86-64-v3 -O2 100 1146.90165ms @ 21.428 GB/s
x86-64-v3 -O2 -funroll-loops 100 785.81395ms @ 31.275 GB/s
x86-64-v3 -O3 100 800.53627ms @ 30.699 GB/s
x86-64-v3 -O3 -funroll-loops 100 790.21230ms @ 31.101 GB/s
x86-64-v4 -O2 100 883.82916ms @ 27.806 GB/s
x86-64-v4 -O2 -funroll-loops 100 831.55372ms @ 29.554 GB/s
x86-64-v4 -O3 100 843.23141ms @ 29.145 GB/s
x86-64-v4 -O3 -funroll-loops 100 821.19969ms @ 29.927 GB/s
native -O2 100 1197.41357ms @ 20.524 GB/s
native -O2 -funroll-loops 100 718.05253ms @ 34.226 GB/s
native -O3 100 747.94090ms @ 32.858 GB/s
native -O3 -funroll-loops 100 751.52379ms @ 32.702 GB/s
x86-64 -O2 100000 2911.47087ms @ 8.441 GB/s
x86-64 -O2 -funroll-loops 100000 2525.45504ms @ 9.731 GB/s
x86-64 -O3 100000 2497.42016ms @ 9.841 GB/s
x86-64 -O3 -funroll-loops 100000 2346.33551ms @ 10.474 GB/s
x86-64-v2 -O2 100000 2124.10102ms @ 11.570 GB/s
x86-64-v2 -O2 -funroll-loops 100000 1819.09659ms @ 13.510 GB/s
x86-64-v2 -O3 100000 1613.45823ms @ 15.232 GB/s
x86-64-v2 -O3 -funroll-loops 100000 1607.09245ms @ 15.292 GB/s
x86-64-v3 -O2 100000 1972.89390ms @ 12.457 GB/s
x86-64-v3 -O2 -funroll-loops 100000 1432.58229ms @ 17.155 GB/s
x86-64-v3 -O3 100000 1533.18003ms @ 16.029 GB/s
x86-64-v3 -O3 -funroll-loops 100000 1539.39779ms @ 15.965 GB/s
x86-64-v4 -O2 100000 1591.96881ms @ 15.437 GB/s
x86-64-v4 -O2 -funroll-loops 100000 1434.91828ms @ 17.127 GB/s
x86-64-v4 -O3 100000 1454.30133ms @ 16.899 GB/s
x86-64-v4 -O3 -funroll-loops 100000 1429.13733ms @ 17.196 GB/s
native -O2 100000 1980.53734ms @ 12.409 GB/s
native -O2 -funroll-loops 100000 1373.95337ms @ 17.887 GB/s
native -O3 100000 1517.90164ms @ 16.191 GB/s
native -O3 -funroll-loops 100000 1508.37021ms @ 16.293 GB/s
Is it just that the calculation is slow, or is it the fact that checksumming
needs to bring the page into the CPU cache. Did you notice any hints which
might be the case?I don't think the issue is that checksumming pulls the data into CPU caches
1) This is visible with SELECT that actually uses the data
2) I added prefetching to avoid any meaningful amount of cache misses and it
doesn't change the overall timing much3) It's visible with buffered IO, which has pulled the data into CPU caches
alreadyI didn't yet check the code, when doing aio completions checksumming
be running on the same core as is going to be using the page?
With io_uring normally yes, the exception being that another backend that
needs the same page could end up running the completion.
With worker mode normally no.
Greetings,
Andres Freund
On Thu, 9 Jan 2025 at 22:53, Andres Freund <andres@anarazel.de> wrote:
<Edited to highlight interesting numbers>
Workstation w/ 2x Xeon Gold 6442Y:
march mem result
native 100 246.13766ms @ 33.282 GB/s
native 100000 456.08080ms @ 17.962 GB/sZen 4 laptop (AMD Ryzen 7 PRO 7840U):
march mem result
native 100 95.15734ms @ 86.089 GB/s
native 100000 187.68125ms @ 43.648 GB/sWorkstation w/ 2x Xeon Gold 5215:
march mem result
native 100 283.48674ms @ 28.897 GB/s
native 100000 689.78023ms @ 11.876 GB/sThat's quite the drastic difference between amd and intel. Of course it's also
comparing a multi-core server uarch (lower per-core bandwidth, much higher
aggregate bandwidth) with a client uarch.
In hindsight building the hash around mulld primitive was a bad decision
because Intel for whatever reason decided to kill the performance of it:
vpmulld latency throughput
(values/cycle)
Sandy Bridge 5 4
Alder Lake 10 8
Zen 4 3 16
Zen 5 3 32
Most top performing hashes these days seem to be built around AES
instructions.
But I was curious why there is such a difference in streaming results.
Turns out that for whatever reason one core gets access to much less
bandwidth on Intel than on AMD. [1]https://chipsandcheese.com/p/a-peek-at-sapphire-rapids#%C2%A7bandwidth
This made me take another look at your previous prewarm numbers. It looks
like performance is suspiciously proportional to the number of copies of
data the CPU has to make:
config checksums time in ms number of copies
buffered io_engine=io_uring 0 3883.127 2
buffered io_engine=io_uring 1 5880.892 3
direct io_engine=io_uring 0 2067.142 1
direct io_engine=io_uring 1 3835.968 2
To me that feels like there is a bandwidth bottleneck in this workload and
checksumming the page when the contents is not looked at just adds to
consumed bandwidth, bringing down the performance correspondingly.
This doesn't explain why you observed slowdown in the select case, but I
think that might be due to the per-core bandwidth limitation. Both cases
might pull in the same amount of data into the cache, but without checksums
it is spread out over a longer time allowing other work to happen
concurrently.
[1]: https://chipsandcheese.com/p/a-peek-at-sapphire-rapids#%C2%A7bandwidth
The difference between the baseline CPU target and a more modern profile is
also rather impressive. Looks like some cpu-capability based dispatch would
likely be worth it, even if it didn't matter in my case due to -march=native.
Yes, along with using function attributes for the optimization flags to avoid
the build system hacks.
--
Ants
On Wed, Jan 8, 2025 at 7:26 PM Andres Freund <andres@anarazel.de> wrote:
1) Shared memory representation of an IO, for the AIO subsystem internally
Currently: PgAioHandle
2) A way for the issuer of an IO to reference 1), to attach information to the
IOCurrently: PgAioHandle*
3) A way for any backend to wait for a specific IO to complete
Currently: PgAioHandleRef
With that additional information, I don't mind this naming too much,
but I still think PgAioHandle -> PgAio and PgAioHandleRef ->
PgAioHandle is worth considering. Compare BackgroundWorkerSlot and
BackgroundWorkerHandle, which suggests PgAioHandle -> PgAioSlot and
PgAioHandleRef -> PgAioHandle.
ZOMBIE feels even later than REAPED to me :)
Makes logical sense, because you would assume that you die first and
then later become an undead creature, but the UNIX precedent is that
dying turns you into a zombie and someone then has to reap the exit
status for you to be just plain dead. :-)
I do agree with Heikki that REAPED sounds later than COMPLETED, because you
reap zombie processes by collecting their exit status. Maybe you could have
AHS_COMPLETE or AHS_IO_COMPLETE for the state where the I/O is done but
there's still completion-related work to be done, and then the other state
could be AHS_DONE or AHS_FINISHED or AHS_FINAL or AHS_REAPED or something.How about
AHS_COMPLETE_KERNEL or AHS_COMPLETE_RAW - raw syscall completed
AHS_COMPLETE_SHARED_CB - shared callback completed
AHS_COMPLETE_LOCAL_CB - local callback completed?
That's not bad. I like RAW better than KERNEL. I was hoping to use
different works like COMPLETE and DONE rather than, as you did it
here, COMPLETE and COMPLETE, but it's probably fine.
--
Robert Haas
EDB: http://www.enterprisedb.com
Hi,
On 2025-01-13 15:43:46 -0500, Robert Haas wrote:
On Wed, Jan 8, 2025 at 7:26 PM Andres Freund <andres@anarazel.de> wrote:
1) Shared memory representation of an IO, for the AIO subsystem internally
Currently: PgAioHandle
2) A way for the issuer of an IO to reference 1), to attach information to the
IOCurrently: PgAioHandle*
3) A way for any backend to wait for a specific IO to complete
Currently: PgAioHandleRef
With that additional information, I don't mind this naming too much,
but I still think PgAioHandle -> PgAio and PgAioHandleRef ->
PgAioHandle is worth considering. Compare BackgroundWorkerSlot and
BackgroundWorkerHandle, which suggests PgAioHandle -> PgAioSlot and
PgAioHandleRef -> PgAioHandle.
I don't love PgAioHandle -> PgAio as there are other things (e.g. per-backend
state) in the PgAio namespace...
I do agree with Heikki that REAPED sounds later than COMPLETED, because you
reap zombie processes by collecting their exit status. Maybe you could have
AHS_COMPLETE or AHS_IO_COMPLETE for the state where the I/O is done but
there's still completion-related work to be done, and then the other state
could be AHS_DONE or AHS_FINISHED or AHS_FINAL or AHS_REAPED or something.How about
AHS_COMPLETE_KERNEL or AHS_COMPLETE_RAW - raw syscall completed
AHS_COMPLETE_SHARED_CB - shared callback completed
AHS_COMPLETE_LOCAL_CB - local callback completed?
That's not bad. I like RAW better than KERNEL.
Cool.
I was hoping to use different works like COMPLETE and DONE rather than, as
you did it here, COMPLETE and COMPLETE, but it's probably fine.
Once the IO is really done, the handle is immediately recycled (and moved into
IDLE state, ready to be used again).
Greetings,
Andres Freund
On Mon, Jan 13, 2025 at 4:46 PM Andres Freund <andres@anarazel.de> wrote:
Once the IO is really done, the handle is immediately recycled (and moved into
IDLE state, ready to be used again).
OK, fair enough.
--
Robert Haas
EDB: http://www.enterprisedb.com
Hi,
Attached is v2.3.
There are a lot of changes - primarily renaming things based on on-list and
off-list feedback. But also some other things
Functional:
- Added pg_aios view
- md.c registering sync requests, that was previously omitted
- This triggered stats issues during shutdown, as it can lead to IO workers
emitting stats in some corner cases. I've written a patch series to
address this [1]/messages/by-id/kgng5nrvnlv335evmsuvpnh354rw7qyazl73kdysev2cr2v5zu@m3cfzxicm5kp. For now I've included them in this patchset, but I would
like to push the reordering patches soon.
- Testing error handling for temp table IO made me realize that the previous
pattern of just tracking the refcount held by the IO subsystem in the
LocalRefCount array leads to spurious buffer leak warnings [2]/messages/by-id/j6hny5ivrfqw356ugoy3ti5ccadamluekxod4k6amao5snew6c@t5h3bwhrgfqx. I attached
a prototype patch to deal with this by bringing localbuf.c more in line with
bufmgr.c, but it needs some cleanup.
That's in v2.3-0020-WIP-localbuf-Track-pincount-in-BufferDesc-as-we.patch
- Wait for all IOs to finish during shutdown. This is primarily required to
ensure there aren't IOs initiated by a prior "owner" of a ProcNumber when a
new backend starts. But there are also some kernels that don't like when
exiting while IO is in flight.
- Re-armed local completion callbacks, they're required for correctness of
temporary table IO
- Added a bunch of central debug helpers that only lead to output if
PGAIO_VERBOSE is defined. That did make code a good bit more readable.
Polishing:
- Lots of copy editing, a lot of it thanks to feedback by Noah and Heikki
- Renamed the previous concept of a "subject" of an IO (i.e. what the IO is
executed on, an smgr relation, a WAL file, ...) to "target". I'm not in
love with that name, but I went through dozens of variations, and it does
seem better than subject.
Not sure anymore how I ended up with subject, it's grammatically off and not
very descriptive to boot.
- Renamed "PgAioHandleRef" and related functions to
PgAioWaitRef/pgaio_wref_*(), that seems a lot more descriptive.
- Renamed pgaio_io_get() to pgaio_io_acquire()
- Renamed the IO handle states (PREPARED to STAGED, IN_FLIGHT to SUBMITTED,
REAPED to COMPLETED_IO).
Particularly the various COMPLETED state names aren't necessarily final,
I've been debating a bunch of variations with Thomas and Robert
- Renamed aio_ref.h to aio_types.h, moved a few more types into it.
- Renamed completion callbacks to not use "shared" anymore - ->prepare was not
really shared and now local callbacks are back (in a restricted form).
s/PgAioHandleSharedCallback/PgAioHandleCallback/
s/pgaio_io_add_shared_cb/pgaio_io_register_callbacks/
Not entirely sure *register_callbacks is the best, happy to adjust.
- Renamed the ->error IO handle callback to ->report
Also renamed s/pgaio_result_log/pgaio_result_report/g
- Renamed the ->prepare IO handle callback to ->stage
Also renamed s/pgaio_result_log/pgaio_result_report/g
- Partially addressed request to reorder aio/README.md
- Determine shared memory allocation size with PG_IOV_MAX not io_combine_limit
io_combine_limit is USERSET, so it's not correct to use it for shmem
allocations. I chose PG_IOV_MAX instead of MAX_IO_COMBINE_LIMIT because this
is a more generic limit than bufmgr.c IO.
- Prefix PgAio* enums with PGAIO_, global variables with pgaio_*
- Split out callback related cod from aio_subject.c (now aio_target.c) into
aio_callback.c. The target specific code is rather small, so this makes a
lot more sense.
- Distributed functions into more appropriate .c files, documented the choice
in aio.h, reorder them
- Disowned lwlock: More consistent naming, reduce diff size, resume interrupts
Heikki asked to clear ->owner when disowning the lock - but we currently
*never* clear it doesn't seem right to do so when disowning the lock.
- IO data that can be set on a handle (to e.g. transport an array of Buffers
to the completion callbacks) is now done with
pgaio_io_(get|set)_handle_data(). Mainly to distinguish it from data that's
actually the target/source of a read/write.
Heikki suggested to make this per-callback data, but I don't think there's
currently a use case for that, and it'd add a fair bit of memory overhead. I
added a comment documenting this.
- Lots of other cleanups, added comments and the like
Todo:
- Reorder README further
- Make per backend state not indexed by ProcNumber, as that requires reserving
per-backend state for IO workers, which will never need them
- Clean up localbuf.c "preparation" patches
- Add more tests - I had hoped to get to this, but got sidetracked with a
bunch of things I found while testing
- I started looking into having a distinct type for the public pgaio_io_*
related functions that can be used just by the issuer of the IO. It does
make things a bit easier to understand, but also complicates naming. Not
sure if it's worth it yet.
- Need to define (and test) the behavior when an IO worker fails to reopen the
file for an IO
- Heikki doesn't love pgaio_submit_staged(), suggested pgaio_kick_staged() or
such. I don't love that name though.
- There's some duplicated code in aio_callback.c, it'd be nice to deduplicate
the callback invaction of the different callbacks
- Local callbacks are triggered from within pgaio_io_reclaim(), that's not
exactly pretty. But it's currently the most central place to deal with the
case of IOs for which the shared completion callback was called in another
backend.
- As Jakub suggested (below [3]/messages/by-id/tp63m6tcbi7mmsjlqgxd55sghhwvjxp3mkgeljffkbaujezvdl@fvmdr3c6uhat), when io_method=io_uring is used, we can run
of out file descriptors much more easily. At the very least we need a good
error message, perhaps also some rlimit adjusting (probably as a second
step, if so).
- Thomas is working on the read_stream.c <-> bufmgr.c integration piece
- Start to write docs adjustments
[1]: /messages/by-id/kgng5nrvnlv335evmsuvpnh354rw7qyazl73kdysev2cr2v5zu@m3cfzxicm5kp
[2]: /messages/by-id/j6hny5ivrfqw356ugoy3ti5ccadamluekxod4k6amao5snew6c@t5h3bwhrgfqx
[3]: /messages/by-id/tp63m6tcbi7mmsjlqgxd55sghhwvjxp3mkgeljffkbaujezvdl@fvmdr3c6uhat
Attachments:
v2.3-0024-bufmgr-Implement-AIO-write-support.patchtext/x-diff; charset=us-asciiDownload
From 98ba93250f1fd40e4a97387bf08f90b28686705c Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 22 Jan 2025 16:09:51 -0500
Subject: [PATCH v2.3 24/30] bufmgr: Implement AIO write support
As of this commit there are no users of these AIO facilities, that'll come in
later commits.
Author:
Reviewed-By:
Discussion: https://postgr.es/m/
Backpatch:
---
src/include/storage/aio.h | 2 +
src/include/storage/bufmgr.h | 2 +
src/backend/storage/aio/aio_callback.c | 2 +
src/backend/storage/buffer/bufmgr.c | 90 +++++++++++++++++++++++++-
4 files changed, 95 insertions(+), 1 deletion(-)
diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h
index 30b08495f3d..7bdce41121e 100644
--- a/src/include/storage/aio.h
+++ b/src/include/storage/aio.h
@@ -180,8 +180,10 @@ typedef enum PgAioHandleCallbackID
PGAIO_HCB_MD_WRITEV,
PGAIO_HCB_SHARED_BUFFER_READV,
+ PGAIO_HCB_SHARED_BUFFER_WRITEV,
PGAIO_HCB_LOCAL_BUFFER_READV,
+ PGAIO_HCB_LOCAL_BUFFER_WRITEV,
} PgAioHandleCallbackID;
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index f205643c4ef..cf9d0a63aed 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -203,7 +203,9 @@ extern PGDLLIMPORT int32 *LocalRefCount;
struct PgAioHandleCallbacks;
extern const struct PgAioHandleCallbacks aio_shared_buffer_readv_cb;
+extern const struct PgAioHandleCallbacks aio_shared_buffer_writev_cb;
extern const struct PgAioHandleCallbacks aio_local_buffer_readv_cb;
+extern const struct PgAioHandleCallbacks aio_local_buffer_writev_cb;
/* upper limit for effective_io_concurrency */
diff --git a/src/backend/storage/aio/aio_callback.c b/src/backend/storage/aio/aio_callback.c
index 6054f57eb23..acfed50bfeb 100644
--- a/src/backend/storage/aio/aio_callback.c
+++ b/src/backend/storage/aio/aio_callback.c
@@ -45,8 +45,10 @@ static const PgAioHandleCallbacksEntry aio_handle_cbs[] = {
CALLBACK_ENTRY(PGAIO_HCB_MD_WRITEV, aio_md_writev_cb),
CALLBACK_ENTRY(PGAIO_HCB_SHARED_BUFFER_READV, aio_shared_buffer_readv_cb),
+ CALLBACK_ENTRY(PGAIO_HCB_SHARED_BUFFER_WRITEV, aio_shared_buffer_writev_cb),
CALLBACK_ENTRY(PGAIO_HCB_LOCAL_BUFFER_READV, aio_local_buffer_readv_cb),
+ CALLBACK_ENTRY(PGAIO_HCB_LOCAL_BUFFER_WRITEV, aio_local_buffer_writev_cb),
#undef CALLBACK_ENTRY
};
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 118a6e1ca31..d5212da4912 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -6402,6 +6402,42 @@ ReadBufferCompleteReadShared(Buffer buffer, int mode, bool failed)
return buf_failed;
}
+static uint64
+ReadBufferCompleteWriteShared(Buffer buffer, bool release_lock, bool failed)
+{
+ BufferDesc *bufHdr;
+ bool result = false;
+
+ Assert(BufferIsValid(buffer));
+
+ bufHdr = GetBufferDescriptor(buffer - 1);
+
+#ifdef USE_ASSERT_CHECKING
+ {
+ uint32 buf_state = pg_atomic_read_u32(&bufHdr->state);
+
+ Assert(buf_state & BM_VALID);
+ Assert(buf_state & BM_TAG_VALID);
+ Assert(buf_state & BM_IO_IN_PROGRESS);
+ Assert(buf_state & BM_DIRTY);
+ }
+#endif
+
+ TerminateBufferIO(bufHdr, /* clear_dirty = */ true,
+ failed ? BM_IO_ERROR : 0,
+ /* forget_owner = */ false,
+ /* syncio = */ false);
+
+ /*
+ * The initiator of IO is not managing the lock (i.e. called
+ * LWLockDisown()), we are.
+ */
+ if (release_lock)
+ LWLockReleaseDisowned(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
+
+ return result;
+}
+
/*
* Helper to prepare IO on shared buffers for execution, shared between reads
* and writes.
@@ -6466,7 +6502,6 @@ shared_buffer_stage_common(PgAioHandle *ioh, bool is_write)
* Lock is now owned by AIO subsystem.
*/
LWLockDisown(content_lock);
- RESUME_INTERRUPTS();
}
/*
@@ -6483,6 +6518,12 @@ shared_buffer_readv_stage(PgAioHandle *ioh)
shared_buffer_stage_common(ioh, false);
}
+static void
+shared_buffer_writev_stage(PgAioHandle *ioh)
+{
+ shared_buffer_stage_common(ioh, true);
+}
+
static PgAioResult
shared_buffer_readv_complete(PgAioHandle *ioh, PgAioResult prior_result)
{
@@ -6558,6 +6599,36 @@ buffer_readv_report(PgAioResult result, const PgAioTargetData *target_data, int
MemoryContextSwitchTo(oldContext);
}
+static PgAioResult
+shared_buffer_writev_complete(PgAioHandle *ioh, PgAioResult prior_result)
+{
+ PgAioResult result = prior_result;
+ uint64 *io_data;
+ uint8 handle_data_len;
+
+ ereport(DEBUG5,
+ errmsg("%s: %d %d", __func__, prior_result.status, prior_result.result),
+ errhidestmt(true), errhidecontext(true));
+
+ io_data = pgaio_io_get_handle_data(ioh, &handle_data_len);
+
+ /* FIXME: handle outright errors */
+
+ for (int io_data_off = 0; io_data_off < handle_data_len; io_data_off++)
+ {
+ Buffer buf = io_data[io_data_off];
+
+ /* FIXME: handle short writes / failures */
+ /* FIXME: ioh->target_data.shared_buffer.release_lock */
+ ReadBufferCompleteWriteShared(buf,
+ true,
+ false);
+
+ }
+
+ return result;
+}
+
/*
* Helper to stage IO on local buffers for execution, shared between reads
* and writes.
@@ -6644,12 +6715,26 @@ local_buffer_readv_complete(PgAioHandle *ioh, PgAioResult prior_result)
return result;
}
+static void
+local_buffer_writev_stage(PgAioHandle *ioh)
+{
+ /*
+ * Currently this is unreachable as the only write support is for
+ * checkpointer / bgwriter, which don't deal with local buffers.
+ */
+ elog(ERROR, "not yet");
+}
+
const struct PgAioHandleCallbacks aio_shared_buffer_readv_cb = {
.stage = shared_buffer_readv_stage,
.complete_shared = shared_buffer_readv_complete,
.report = buffer_readv_report,
};
+const struct PgAioHandleCallbacks aio_shared_buffer_writev_cb = {
+ .stage = shared_buffer_writev_stage,
+ .complete_shared = shared_buffer_writev_complete,
+};
const struct PgAioHandleCallbacks aio_local_buffer_readv_cb = {
.stage = local_buffer_readv_stage,
@@ -6662,3 +6747,6 @@ const struct PgAioHandleCallbacks aio_local_buffer_readv_cb = {
.complete_local = local_buffer_readv_complete,
.report = buffer_readv_report,
};
+const struct PgAioHandleCallbacks aio_local_buffer_writev_cb = {
+ .stage = local_buffer_writev_stage,
+};
--
2.48.1.76.g4e746b1a31.dirty
v2.3-0025-aio-Add-IO-queue-helper.patchtext/x-diff; charset=us-asciiDownload
From e7e8e954a1432f531d830242fd170564f268521c Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 22 Jan 2025 13:44:50 -0500
Subject: [PATCH v2.3 25/30] aio: Add IO queue helper
This is likely never going to anywhere - Thomas Munro is working on something
more complete. But I needed a way to exercise aio for checkpointer / bgwriter.
---
src/include/storage/io_queue.h | 31 +++++
src/backend/storage/aio/Makefile | 1 +
src/backend/storage/aio/io_queue.c | 198 ++++++++++++++++++++++++++++
src/backend/storage/aio/meson.build | 1 +
src/tools/pgindent/typedefs.list | 2 +
5 files changed, 233 insertions(+)
create mode 100644 src/include/storage/io_queue.h
create mode 100644 src/backend/storage/aio/io_queue.c
diff --git a/src/include/storage/io_queue.h b/src/include/storage/io_queue.h
new file mode 100644
index 00000000000..f5e1bc07ff3
--- /dev/null
+++ b/src/include/storage/io_queue.h
@@ -0,0 +1,31 @@
+/*-------------------------------------------------------------------------
+ *
+ * io_queue.h
+ * Mechanism for tracking many IOs
+ *
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/io_queue.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef IO_QUEUE_H
+#define IO_QUEUE_H
+
+struct IOQueue;
+typedef struct IOQueue IOQueue;
+
+struct PgAioWaitRef;
+
+extern IOQueue *io_queue_create(int depth, int flags);
+extern void io_queue_track(IOQueue *ioq, const struct PgAioWaitRef *iow);
+extern void io_queue_wait_one(IOQueue *ioq);
+extern void io_queue_wait_all(IOQueue *ioq);
+extern bool io_queue_is_empty(IOQueue *ioq);
+extern void io_queue_reserve(IOQueue *ioq);
+extern struct PgAioHandle *io_queue_acquire_io(IOQueue *ioq);
+extern void io_queue_free(IOQueue *ioq);
+
+#endif /* IO_QUEUE_H */
diff --git a/src/backend/storage/aio/Makefile b/src/backend/storage/aio/Makefile
index 3f2469cc399..86fa4276fda 100644
--- a/src/backend/storage/aio/Makefile
+++ b/src/backend/storage/aio/Makefile
@@ -15,6 +15,7 @@ OBJS = \
aio_init.o \
aio_io.o \
aio_target.o \
+ io_queue.o \
method_io_uring.o \
method_sync.o \
method_worker.o \
diff --git a/src/backend/storage/aio/io_queue.c b/src/backend/storage/aio/io_queue.c
new file mode 100644
index 00000000000..62ad06c8bfe
--- /dev/null
+++ b/src/backend/storage/aio/io_queue.c
@@ -0,0 +1,198 @@
+/*-------------------------------------------------------------------------
+ *
+ * io_queue.c
+ * AIO - Mechanism for tracking many IOs
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/storage/aio/io_queue.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "lib/ilist.h"
+#include "storage/aio.h"
+#include "storage/io_queue.h"
+#include "utils/resowner.h"
+
+
+
+typedef struct TrackedIO
+{
+ PgAioWaitRef iow;
+ dlist_node node;
+} TrackedIO;
+
+struct IOQueue
+{
+ int depth;
+ int unsubmitted;
+
+ bool has_reserved;
+
+ dclist_head idle;
+ dclist_head in_progress;
+
+ TrackedIO tracked_ios[FLEXIBLE_ARRAY_MEMBER];
+};
+
+
+IOQueue *
+io_queue_create(int depth, int flags)
+{
+ size_t sz;
+ IOQueue *ioq;
+
+ sz = offsetof(IOQueue, tracked_ios)
+ + sizeof(TrackedIO) * depth;
+
+ ioq = palloc0(sz);
+
+ ioq->depth = 0;
+
+ for (int i = 0; i < depth; i++)
+ {
+ TrackedIO *tio = &ioq->tracked_ios[i];
+
+ pgaio_wref_clear(&tio->iow);
+ dclist_push_tail(&ioq->idle, &tio->node);
+ }
+
+ return ioq;
+}
+
+void
+io_queue_wait_one(IOQueue *ioq)
+{
+ while (!dclist_is_empty(&ioq->in_progress))
+ {
+ /* FIXME: Should we really pop here already? */
+ dlist_node *node = dclist_pop_head_node(&ioq->in_progress);
+ TrackedIO *tio = dclist_container(TrackedIO, node, node);
+
+ pgaio_wref_wait(&tio->iow);
+ dclist_push_head(&ioq->idle, &tio->node);
+ }
+}
+
+void
+io_queue_reserve(IOQueue *ioq)
+{
+ if (ioq->has_reserved)
+ return;
+
+ if (dclist_is_empty(&ioq->idle))
+ io_queue_wait_one(ioq);
+
+ Assert(!dclist_is_empty(&ioq->idle));
+
+ ioq->has_reserved = true;
+}
+
+PgAioHandle *
+io_queue_acquire_io(IOQueue *ioq)
+{
+ PgAioHandle *ioh;
+
+ io_queue_reserve(ioq);
+
+ Assert(!dclist_is_empty(&ioq->idle));
+
+ if (!io_queue_is_empty(ioq))
+ {
+ ioh = pgaio_io_acquire_nb(CurrentResourceOwner, NULL);
+ if (ioh == NULL)
+ {
+ /*
+ * Need to wait for all IOs, blocking might not be legal in the
+ * context.
+ *
+ * XXX: This doesn't make a whole lot of sense, we're also
+ * blocking here. What was I smoking when I wrote the above?
+ */
+ io_queue_wait_all(ioq);
+ ioh = pgaio_io_acquire(CurrentResourceOwner, NULL);
+ }
+ }
+ else
+ {
+ ioh = pgaio_io_acquire(CurrentResourceOwner, NULL);
+ }
+
+ return ioh;
+}
+
+void
+io_queue_track(IOQueue *ioq, const struct PgAioWaitRef *iow)
+{
+ dlist_node *node;
+ TrackedIO *tio;
+
+ Assert(ioq->has_reserved);
+ ioq->has_reserved = false;
+
+ Assert(!dclist_is_empty(&ioq->idle));
+
+ node = dclist_pop_head_node(&ioq->idle);
+ tio = dclist_container(TrackedIO, node, node);
+
+ tio->iow = *iow;
+
+ dclist_push_tail(&ioq->in_progress, &tio->node);
+
+ ioq->unsubmitted++;
+
+ /*
+ * XXX: Should have some smarter logic here. We don't want to wait too
+ * long to submit, that'll mean we're more likely to block. But we also
+ * don't want to have the overhead of submitting every IO individually.
+ */
+ if (ioq->unsubmitted >= 4)
+ {
+ pgaio_submit_staged();
+ ioq->unsubmitted = 0;
+ }
+}
+
+void
+io_queue_wait_all(IOQueue *ioq)
+{
+ while (!dclist_is_empty(&ioq->in_progress))
+ {
+ /* wait for the last IO to minimize unnecessary wakeups */
+ dlist_node *node = dclist_tail_node(&ioq->in_progress);
+ TrackedIO *tio = dclist_container(TrackedIO, node, node);
+
+ if (!pgaio_wref_check_done(&tio->iow))
+ {
+ ereport(DEBUG3,
+ errmsg("io_queue_wait_all for io:%d",
+ pgaio_wref_get_id(&tio->iow)),
+ errhidestmt(true),
+ errhidecontext(true));
+
+ pgaio_wref_wait(&tio->iow);
+ }
+
+ dclist_delete_from(&ioq->in_progress, &tio->node);
+ dclist_push_head(&ioq->idle, &tio->node);
+ }
+}
+
+bool
+io_queue_is_empty(IOQueue *ioq)
+{
+ return dclist_is_empty(&ioq->in_progress);
+}
+
+void
+io_queue_free(IOQueue *ioq)
+{
+ io_queue_wait_all(ioq);
+
+ pfree(ioq);
+}
diff --git a/src/backend/storage/aio/meson.build b/src/backend/storage/aio/meson.build
index da6df2d3654..270c4a64428 100644
--- a/src/backend/storage/aio/meson.build
+++ b/src/backend/storage/aio/meson.build
@@ -7,6 +7,7 @@ backend_sources += files(
'aio_init.c',
'aio_io.c',
'aio_target.c',
+ 'io_queue.c',
'method_io_uring.c',
'method_sync.c',
'method_worker.c',
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index b3f06711e6a..91d8198af9f 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1179,6 +1179,7 @@ IOContext
IOFuncSelector
IOObject
IOOp
+IOQueue
IO_STATUS_BLOCK
IPCompareMethod
ITEM
@@ -2986,6 +2987,7 @@ TocEntry
TokenAuxData
TokenizedAuthLine
TrackItem
+TrackedIO
TransApplyAction
TransInvalidationInfo
TransState
--
2.48.1.76.g4e746b1a31.dirty
v2.3-0026-bufmgr-use-AIO-in-checkpointer-bgwriter.patchtext/x-diff; charset=us-asciiDownload
From 0e410933546b25b259ee8a02fa27bb7f34b3f736 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 22 Jan 2025 13:44:52 -0500
Subject: [PATCH v2.3 26/30] bufmgr: use AIO in checkpointer, bgwriter
This is far from ready - just included to be able to exercise AIO writes and
get some preliminary numbers. In all likelihood this will instead be based
ontop of work by Thomas Munro instead of the preceding commit.
---
src/include/postmaster/bgwriter.h | 3 +-
src/include/storage/buf_internals.h | 2 +
src/include/storage/bufmgr.h | 3 +-
src/include/storage/bufpage.h | 1 +
src/backend/postmaster/bgwriter.c | 20 +-
src/backend/postmaster/checkpointer.c | 12 +-
src/backend/storage/buffer/bufmgr.c | 587 +++++++++++++++++++++++---
src/backend/storage/page/bufpage.c | 10 +
src/tools/pgindent/typedefs.list | 1 +
9 files changed, 581 insertions(+), 58 deletions(-)
diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index 2d5854e6879..517c40cd804 100644
--- a/src/include/postmaster/bgwriter.h
+++ b/src/include/postmaster/bgwriter.h
@@ -31,7 +31,8 @@ extern void BackgroundWriterMain(char *startup_data, size_t startup_data_len) pg
extern void CheckpointerMain(char *startup_data, size_t startup_data_len) pg_attribute_noreturn();
extern void RequestCheckpoint(int flags);
-extern void CheckpointWriteDelay(int flags, double progress);
+struct IOQueue;
+extern void CheckpointWriteDelay(struct IOQueue *ioq, int flags, double progress);
extern bool ForwardSyncRequest(const FileTag *ftag, SyncRequestType type);
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 9f936cd6b84..aeefb1746ec 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -21,6 +21,8 @@
#include "storage/buf.h"
#include "storage/bufmgr.h"
#include "storage/condition_variable.h"
+#include "storage/io_queue.h"
+#include "storage/latch.h"
#include "storage/lwlock.h"
#include "storage/shmem.h"
#include "storage/smgr.h"
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index cf9d0a63aed..bc7ee73246e 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -321,7 +321,8 @@ extern bool ConditionalLockBufferForCleanup(Buffer buffer);
extern bool IsBufferCleanupOK(Buffer buffer);
extern bool HoldingBufferPinThatDelaysRecovery(void);
-extern bool BgBufferSync(struct WritebackContext *wb_context);
+struct IOQueue;
+extern bool BgBufferSync(struct IOQueue *ioq, struct WritebackContext *wb_context);
extern void LimitAdditionalPins(uint32 *additional_pins);
extern void LimitAdditionalLocalPins(uint32 *additional_pins);
diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h
index d06208b7ce6..a2bd1db92d0 100644
--- a/src/include/storage/bufpage.h
+++ b/src/include/storage/bufpage.h
@@ -509,5 +509,6 @@ extern bool PageIndexTupleOverwrite(Page page, OffsetNumber offnum,
Item newtup, Size newsize);
extern char *PageSetChecksumCopy(Page page, BlockNumber blkno);
extern void PageSetChecksumInplace(Page page, BlockNumber blkno);
+extern bool PageNeedsChecksumCopy(Page page);
#endif /* BUFPAGE_H */
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 3eff5dc6f0e..cf16f8bed5d 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -38,10 +38,12 @@
#include "postmaster/auxprocess.h"
#include "postmaster/bgwriter.h"
#include "postmaster/interrupt.h"
+#include "storage/aio.h"
#include "storage/buf_internals.h"
#include "storage/bufmgr.h"
#include "storage/condition_variable.h"
#include "storage/fd.h"
+#include "storage/io_queue.h"
#include "storage/lwlock.h"
#include "storage/proc.h"
#include "storage/procsignal.h"
@@ -89,6 +91,7 @@ BackgroundWriterMain(char *startup_data, size_t startup_data_len)
sigjmp_buf local_sigjmp_buf;
MemoryContext bgwriter_context;
bool prev_hibernate;
+ IOQueue *ioq;
WritebackContext wb_context;
Assert(startup_data_len == 0);
@@ -130,6 +133,7 @@ BackgroundWriterMain(char *startup_data, size_t startup_data_len)
ALLOCSET_DEFAULT_SIZES);
MemoryContextSwitchTo(bgwriter_context);
+ ioq = io_queue_create(128, 0);
WritebackContextInit(&wb_context, &bgwriter_flush_after);
/*
@@ -170,6 +174,7 @@ BackgroundWriterMain(char *startup_data, size_t startup_data_len)
ConditionVariableCancelSleep();
UnlockBuffers();
ReleaseAuxProcessResources(false);
+ pgaio_at_error();
AtEOXact_Buffers(false);
AtEOXact_SMgr();
AtEOXact_Files(false);
@@ -226,12 +231,22 @@ BackgroundWriterMain(char *startup_data, size_t startup_data_len)
/* Clear any already-pending wakeups */
ResetLatch(MyLatch);
+ /*
+ * FIXME: this is theoretically racy, but I didn't want to copy
+ * HandleMainLoopInterrupts() remaining body here.
+ */
+ if (ShutdownRequestPending)
+ {
+ io_queue_wait_all(ioq);
+ io_queue_free(ioq);
+ }
+
HandleMainLoopInterrupts();
/*
* Do one cycle of dirty-buffer writing.
*/
- can_hibernate = BgBufferSync(&wb_context);
+ can_hibernate = BgBufferSync(ioq, &wb_context);
/* Report pending statistics to the cumulative stats system */
pgstat_report_bgwriter();
@@ -248,6 +263,9 @@ BackgroundWriterMain(char *startup_data, size_t startup_data_len)
smgrdestroyall();
}
+ /* finish IO before sleeping, to avoid blocking other backends */
+ io_queue_wait_all(ioq);
+
/*
* Log a new xl_running_xacts every now and then so replication can
* get into a consistent state faster (think of suboverflowed
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 767bf9f5cf8..0fb7f3b7275 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -49,9 +49,11 @@
#include "postmaster/bgwriter.h"
#include "postmaster/interrupt.h"
#include "replication/syncrep.h"
+#include "storage/aio.h"
#include "storage/bufmgr.h"
#include "storage/condition_variable.h"
#include "storage/fd.h"
+#include "storage/io_queue.h"
#include "storage/ipc.h"
#include "storage/lwlock.h"
#include "storage/pmsignal.h"
@@ -278,6 +280,7 @@ CheckpointerMain(char *startup_data, size_t startup_data_len)
pgstat_report_wait_end();
UnlockBuffers();
ReleaseAuxProcessResources(false);
+ pgaio_at_error();
AtEOXact_Buffers(false);
AtEOXact_SMgr();
AtEOXact_Files(false);
@@ -762,7 +765,7 @@ ImmediateCheckpointRequested(void)
* fraction between 0.0 meaning none, and 1.0 meaning all done.
*/
void
-CheckpointWriteDelay(int flags, double progress)
+CheckpointWriteDelay(IOQueue *ioq, int flags, double progress)
{
static int absorb_counter = WRITES_PER_ABSORB;
@@ -796,6 +799,13 @@ CheckpointWriteDelay(int flags, double progress)
/* Report interim statistics to the cumulative stats system */
pgstat_report_checkpointer();
+ /*
+ * Ensure all pending IO is submitted to avoid unnecessary delays for
+ * other processes.
+ */
+ io_queue_wait_all(ioq);
+
+
/*
* This sleep used to be connected to bgwriter_delay, typically 200ms.
* That resulted in more frequent wakeups if not much work to do.
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index d5212da4912..1e8793d1630 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -52,6 +52,7 @@
#include "storage/buf_internals.h"
#include "storage/bufmgr.h"
#include "storage/fd.h"
+#include "storage/io_queue.h"
#include "storage/ipc.h"
#include "storage/lmgr.h"
#include "storage/proc.h"
@@ -77,6 +78,7 @@
/* Bits in SyncOneBuffer's return value */
#define BUF_WRITTEN 0x01
#define BUF_REUSABLE 0x02
+#define BUF_CANT_MERGE 0x04
#define RELS_BSEARCH_THRESHOLD 20
@@ -511,8 +513,6 @@ static void UnpinBuffer(BufferDesc *buf);
static void UnpinBufferNoOwner(BufferDesc *buf);
static void BufferSync(int flags);
static uint32 WaitBufHdrUnlocked(BufferDesc *buf);
-static int SyncOneBuffer(int buf_id, bool skip_recently_used,
- WritebackContext *wb_context);
static void WaitIO(BufferDesc *buf);
static bool StartBufferIO(BufferDesc *buf, bool forInput, bool nowait);
static void TerminateBufferIO(BufferDesc *buf, bool clear_dirty,
@@ -530,6 +530,7 @@ static inline BufferDesc *BufferAlloc(SMgrRelation smgr,
static Buffer GetVictimBuffer(BufferAccessStrategy strategy, IOContext io_context);
static void FlushBuffer(BufferDesc *buf, SMgrRelation reln,
IOObject io_object, IOContext io_context);
+
static void FindAndDropRelationBuffers(RelFileLocator rlocator,
ForkNumber forkNum,
BlockNumber nForkBlock,
@@ -3068,6 +3069,56 @@ UnpinBufferNoOwner(BufferDesc *buf)
}
}
+typedef struct BuffersToWrite
+{
+ int nbuffers;
+ BufferTag start_at_tag;
+ uint32 max_combine;
+
+ XLogRecPtr max_lsn;
+
+ PgAioHandle *ioh;
+ PgAioWaitRef iow;
+
+ uint64 total_writes;
+
+ Buffer buffers[IOV_MAX];
+ PgAioBounceBuffer *bounce_buffers[IOV_MAX];
+ const void *data_ptrs[IOV_MAX];
+} BuffersToWrite;
+
+static int PrepareToWriteBuffer(BuffersToWrite *to_write, Buffer buf,
+ bool skip_recently_used,
+ IOQueue *ioq, WritebackContext *wb_context);
+
+static void WriteBuffers(BuffersToWrite *to_write,
+ IOQueue *ioq, WritebackContext *wb_context);
+
+static void
+BuffersToWriteInit(BuffersToWrite *to_write,
+ IOQueue *ioq, WritebackContext *wb_context)
+{
+ to_write->total_writes = 0;
+ to_write->nbuffers = 0;
+ to_write->ioh = NULL;
+ pgaio_wref_clear(&to_write->iow);
+ to_write->max_lsn = InvalidXLogRecPtr;
+}
+
+static void
+BuffersToWriteEnd(BuffersToWrite *to_write)
+{
+ if (to_write->ioh != NULL)
+ {
+ pgaio_io_release(to_write->ioh);
+ to_write->ioh = NULL;
+ }
+
+ if (to_write->total_writes > 0)
+ pgaio_submit_staged();
+}
+
+
#define ST_SORT sort_checkpoint_bufferids
#define ST_ELEMENT_TYPE CkptSortItem
#define ST_COMPARE(a, b) ckpt_buforder_comparator(a, b)
@@ -3099,7 +3150,10 @@ BufferSync(int flags)
binaryheap *ts_heap;
int i;
int mask = BM_DIRTY;
+ IOQueue *ioq;
WritebackContext wb_context;
+ BuffersToWrite to_write;
+ int max_combine;
/*
* Unless this is a shutdown checkpoint or we have been explicitly told,
@@ -3161,7 +3215,9 @@ BufferSync(int flags)
if (num_to_scan == 0)
return; /* nothing to do */
+ ioq = io_queue_create(512, 0);
WritebackContextInit(&wb_context, &checkpoint_flush_after);
+ max_combine = Min(io_bounce_buffers, io_combine_limit);
TRACE_POSTGRESQL_BUFFER_SYNC_START(NBuffers, num_to_scan);
@@ -3269,48 +3325,91 @@ BufferSync(int flags)
*/
num_processed = 0;
num_written = 0;
+
+ BuffersToWriteInit(&to_write, ioq, &wb_context);
+
while (!binaryheap_empty(ts_heap))
{
BufferDesc *bufHdr = NULL;
CkptTsStatus *ts_stat = (CkptTsStatus *)
DatumGetPointer(binaryheap_first(ts_heap));
+ bool batch_continue = true;
- buf_id = CkptBufferIds[ts_stat->index].buf_id;
- Assert(buf_id != -1);
-
- bufHdr = GetBufferDescriptor(buf_id);
-
- num_processed++;
+ Assert(ts_stat->num_scanned <= ts_stat->num_to_scan);
/*
- * We don't need to acquire the lock here, because we're only looking
- * at a single bit. It's possible that someone else writes the buffer
- * and clears the flag right after we check, but that doesn't matter
- * since SyncOneBuffer will then do nothing. However, there is a
- * further race condition: it's conceivable that between the time we
- * examine the bit here and the time SyncOneBuffer acquires the lock,
- * someone else not only wrote the buffer but replaced it with another
- * page and dirtied it. In that improbable case, SyncOneBuffer will
- * write the buffer though we didn't need to. It doesn't seem worth
- * guarding against this, though.
+ * Collect a batch of buffers to write out from the current
+ * tablespace. That causes some imbalance between the tablespaces, but
+ * that's more than outweighed by the efficiency gain due to batching.
*/
- if (pg_atomic_read_u32(&bufHdr->state) & BM_CHECKPOINT_NEEDED)
+ while (batch_continue &&
+ to_write.nbuffers < max_combine &&
+ ts_stat->num_scanned < ts_stat->num_to_scan)
{
- if (SyncOneBuffer(buf_id, false, &wb_context) & BUF_WRITTEN)
+ buf_id = CkptBufferIds[ts_stat->index].buf_id;
+ Assert(buf_id != -1);
+
+ bufHdr = GetBufferDescriptor(buf_id);
+
+ num_processed++;
+
+ /*
+ * We don't need to acquire the lock here, because we're only
+ * looking at a single bit. It's possible that someone else writes
+ * the buffer and clears the flag right after we check, but that
+ * doesn't matter since SyncOneBuffer will then do nothing.
+ * However, there is a further race condition: it's conceivable
+ * that between the time we examine the bit here and the time
+ * SyncOneBuffer acquires the lock, someone else not only wrote
+ * the buffer but replaced it with another page and dirtied it. In
+ * that improbable case, SyncOneBuffer will write the buffer
+ * though we didn't need to. It doesn't seem worth guarding
+ * against this, though.
+ */
+ if (pg_atomic_read_u32(&bufHdr->state) & BM_CHECKPOINT_NEEDED)
{
- TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
- PendingCheckpointerStats.buffers_written++;
- num_written++;
+ int result = PrepareToWriteBuffer(&to_write, buf_id + 1, false,
+ ioq, &wb_context);
+
+ if (result & BUF_CANT_MERGE)
+ {
+ Assert(to_write.nbuffers > 0);
+ WriteBuffers(&to_write, ioq, &wb_context);
+
+ result = PrepareToWriteBuffer(&to_write, buf_id + 1, false,
+ ioq, &wb_context);
+ Assert(result != BUF_CANT_MERGE);
+ }
+
+ if (result & BUF_WRITTEN)
+ {
+ TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
+ PendingCheckpointerStats.buffers_written++;
+ num_written++;
+ }
+ else
+ {
+ batch_continue = false;
+ }
}
+ else
+ {
+ if (to_write.nbuffers > 0)
+ WriteBuffers(&to_write, ioq, &wb_context);
+ }
+
+ /*
+ * Measure progress independent of actually having to flush the
+ * buffer - otherwise writing become unbalanced.
+ */
+ ts_stat->progress += ts_stat->progress_slice;
+ ts_stat->num_scanned++;
+ ts_stat->index++;
}
- /*
- * Measure progress independent of actually having to flush the buffer
- * - otherwise writing become unbalanced.
- */
- ts_stat->progress += ts_stat->progress_slice;
- ts_stat->num_scanned++;
- ts_stat->index++;
+ if (to_write.nbuffers > 0)
+ WriteBuffers(&to_write, ioq, &wb_context);
+
/* Have all the buffers from the tablespace been processed? */
if (ts_stat->num_scanned == ts_stat->num_to_scan)
@@ -3328,15 +3427,23 @@ BufferSync(int flags)
*
* (This will check for barrier events even if it doesn't sleep.)
*/
- CheckpointWriteDelay(flags, (double) num_processed / num_to_scan);
+ CheckpointWriteDelay(ioq, flags, (double) num_processed / num_to_scan);
}
+ Assert(to_write.nbuffers == 0);
+ io_queue_wait_all(ioq);
+
/*
* Issue all pending flushes. Only checkpointer calls BufferSync(), so
* IOContext will always be IOCONTEXT_NORMAL.
*/
IssuePendingWritebacks(&wb_context, IOCONTEXT_NORMAL);
+ io_queue_wait_all(ioq); /* IssuePendingWritebacks might have added
+ * more */
+ io_queue_free(ioq);
+ BuffersToWriteEnd(&to_write);
+
pfree(per_ts_stat);
per_ts_stat = NULL;
binaryheap_free(ts_heap);
@@ -3362,7 +3469,7 @@ BufferSync(int flags)
* bgwriter_lru_maxpages to 0.)
*/
bool
-BgBufferSync(WritebackContext *wb_context)
+BgBufferSync(IOQueue *ioq, WritebackContext *wb_context)
{
/* info obtained from freelist.c */
int strategy_buf_id;
@@ -3405,6 +3512,9 @@ BgBufferSync(WritebackContext *wb_context)
long new_strategy_delta;
uint32 new_recent_alloc;
+ BuffersToWrite to_write;
+ int max_combine;
+
/*
* Find out where the freelist clock sweep currently is, and how many
* buffer allocations have happened since our last call.
@@ -3425,6 +3535,8 @@ BgBufferSync(WritebackContext *wb_context)
return true;
}
+ max_combine = Min(io_bounce_buffers, io_combine_limit);
+
/*
* Compute strategy_delta = how many buffers have been scanned by the
* clock sweep since last time. If first time through, assume none. Then
@@ -3581,11 +3693,25 @@ BgBufferSync(WritebackContext *wb_context)
num_written = 0;
reusable_buffers = reusable_buffers_est;
+ BuffersToWriteInit(&to_write, ioq, wb_context);
+
/* Execute the LRU scan */
while (num_to_scan > 0 && reusable_buffers < upcoming_alloc_est)
{
- int sync_state = SyncOneBuffer(next_to_clean, true,
- wb_context);
+ int sync_state;
+
+ sync_state = PrepareToWriteBuffer(&to_write, next_to_clean + 1,
+ true, ioq, wb_context);
+ if (sync_state & BUF_CANT_MERGE)
+ {
+ Assert(to_write.nbuffers > 0);
+
+ WriteBuffers(&to_write, ioq, wb_context);
+
+ sync_state = PrepareToWriteBuffer(&to_write, next_to_clean + 1,
+ true, ioq, wb_context);
+ Assert(sync_state != BUF_CANT_MERGE);
+ }
if (++next_to_clean >= NBuffers)
{
@@ -3596,6 +3722,13 @@ BgBufferSync(WritebackContext *wb_context)
if (sync_state & BUF_WRITTEN)
{
+ Assert(sync_state & BUF_REUSABLE);
+
+ if (to_write.nbuffers == max_combine)
+ {
+ WriteBuffers(&to_write, ioq, wb_context);
+ }
+
reusable_buffers++;
if (++num_written >= bgwriter_lru_maxpages)
{
@@ -3607,6 +3740,11 @@ BgBufferSync(WritebackContext *wb_context)
reusable_buffers++;
}
+ if (to_write.nbuffers > 0)
+ WriteBuffers(&to_write, ioq, wb_context);
+
+ BuffersToWriteEnd(&to_write);
+
PendingBgWriterStats.buf_written_clean += num_written;
#ifdef BGW_DEBUG
@@ -3645,8 +3783,66 @@ BgBufferSync(WritebackContext *wb_context)
return (bufs_to_lap == 0 && recent_alloc == 0);
}
+static inline bool
+BufferTagsSameRel(const BufferTag *tag1, const BufferTag *tag2)
+{
+ return (tag1->spcOid == tag2->spcOid) &&
+ (tag1->dbOid == tag2->dbOid) &&
+ (tag1->relNumber == tag2->relNumber) &&
+ (tag1->forkNum == tag2->forkNum)
+ ;
+}
+
+static bool
+CanMergeWrite(BuffersToWrite *to_write, BufferDesc *cur_buf_hdr)
+{
+ BlockNumber cur_block = cur_buf_hdr->tag.blockNum;
+
+ Assert(to_write->nbuffers > 0); /* can't merge with nothing */
+ Assert(to_write->start_at_tag.relNumber != InvalidOid);
+ Assert(to_write->start_at_tag.blockNum != InvalidBlockNumber);
+
+ Assert(to_write->ioh != NULL);
+
+ /*
+ * First check if the blocknumber is one that we could actually merge,
+ * that's cheaper than checking the tablespace/db/relnumber/fork match.
+ */
+ if (to_write->start_at_tag.blockNum + to_write->nbuffers != cur_block)
+ return false;
+
+ if (!BufferTagsSameRel(&to_write->start_at_tag, &cur_buf_hdr->tag))
+ return false;
+
+ /*
+ * Need to check with smgr how large a write we're allowed to make. To
+ * reduce the overhead of the smgr check, only inquire once, when
+ * processing the first to-be-merged buffer. That avoids the overhead in
+ * the common case of writing out buffers that definitely not mergeable.
+ */
+ if (to_write->nbuffers == 1)
+ {
+ SMgrRelation smgr;
+
+ smgr = smgropen(BufTagGetRelFileLocator(&to_write->start_at_tag), INVALID_PROC_NUMBER);
+
+ to_write->max_combine = smgrmaxcombine(smgr,
+ to_write->start_at_tag.forkNum,
+ to_write->start_at_tag.blockNum);
+ }
+ else
+ {
+ Assert(to_write->max_combine > 0);
+ }
+
+ if (to_write->start_at_tag.blockNum + to_write->max_combine <= cur_block)
+ return false;
+
+ return true;
+}
+
/*
- * SyncOneBuffer -- process a single buffer during syncing.
+ * PrepareToWriteBuffer -- process a single buffer during syncing.
*
* If skip_recently_used is true, we don't write currently-pinned buffers, nor
* buffers marked recently used, as these are not replacement candidates.
@@ -3655,22 +3851,50 @@ BgBufferSync(WritebackContext *wb_context)
* BUF_WRITTEN: we wrote the buffer.
* BUF_REUSABLE: buffer is available for replacement, ie, it has
* pin count 0 and usage count 0.
+ * BUF_CANT_MERGE: can't combine this write with prior writes, caller needs
+ * to issue those first
*
* (BUF_WRITTEN could be set in error if FlushBuffer finds the buffer clean
* after locking it, but we don't care all that much.)
*/
static int
-SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
+PrepareToWriteBuffer(BuffersToWrite *to_write, Buffer buf,
+ bool skip_recently_used,
+ IOQueue *ioq, WritebackContext *wb_context)
{
- BufferDesc *bufHdr = GetBufferDescriptor(buf_id);
+ BufferDesc *cur_buf_hdr = GetBufferDescriptor(buf - 1);
+ uint32 buf_state;
int result = 0;
- uint32 buf_state;
- BufferTag tag;
+ XLogRecPtr cur_buf_lsn;
+ LWLock *content_lock;
+ bool may_block;
+
+ /*
+ * Check if this buffer can be written out together with already prepared
+ * writes. We check before we have pinned the buffer, so the buffer can be
+ * written out and replaced between this check and us pinning the buffer -
+ * we'll recheck below. The reason for the pre-check is that we don't want
+ * to pin the buffer just to find out that we can't merge the IO.
+ */
+ if (to_write->nbuffers != 0)
+ {
+ if (!CanMergeWrite(to_write, cur_buf_hdr))
+ {
+ result |= BUF_CANT_MERGE;
+ return result;
+ }
+ }
+ else
+ {
+ to_write->start_at_tag = cur_buf_hdr->tag;
+ }
/* Make sure we can handle the pin */
ReservePrivateRefCountEntry();
ResourceOwnerEnlarge(CurrentResourceOwner);
+ /* XXX: Should also check if we are allowed to pin one more buffer */
+
/*
* Check whether buffer needs writing.
*
@@ -3680,7 +3904,7 @@ SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
* don't worry because our checkpoint.redo points before log record for
* upcoming changes and so we are not required to write such dirty buffer.
*/
- buf_state = LockBufHdr(bufHdr);
+ buf_state = LockBufHdr(cur_buf_hdr);
if (BUF_STATE_GET_REFCOUNT(buf_state) == 0 &&
BUF_STATE_GET_USAGECOUNT(buf_state) == 0)
@@ -3689,40 +3913,294 @@ SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
}
else if (skip_recently_used)
{
+#if 0
+ elog(LOG, "at block %d: skip recent with nbuffers %d",
+ cur_buf_hdr->tag.blockNum, to_write->nbuffers);
+#endif
/* Caller told us not to write recently-used buffers */
- UnlockBufHdr(bufHdr, buf_state);
+ UnlockBufHdr(cur_buf_hdr, buf_state);
return result;
}
if (!(buf_state & BM_VALID) || !(buf_state & BM_DIRTY))
{
/* It's clean, so nothing to do */
- UnlockBufHdr(bufHdr, buf_state);
+ UnlockBufHdr(cur_buf_hdr, buf_state);
return result;
}
+ /* pin the buffer, from now on its identity can't change anymore */
+ PinBuffer_Locked(cur_buf_hdr);
+
+ /*
+ * Acquire IO, if needed, now that it's likely that we'll need to write.
+ */
+ if (to_write->ioh == NULL)
+ {
+ /* otherwise we should already have acquired a handle */
+ Assert(to_write->nbuffers == 0);
+
+ to_write->ioh = io_queue_acquire_io(ioq);
+ pgaio_io_get_wref(to_write->ioh, &to_write->iow);
+ }
+
/*
- * Pin it, share-lock it, write it. (FlushBuffer will do nothing if the
- * buffer is clean by the time we've locked it.)
+ * If we are merging, check if the buffer's identity possibly changed
+ * while we hadn't yet pinned it.
+ *
+ * XXX: It might be worth checking if we still want to write the buffer
+ * out, e.g. it could have been replaced with a buffer that doesn't have
+ * BM_CHECKPOINT_NEEDED set.
*/
- PinBuffer_Locked(bufHdr);
- LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
+ if (to_write->nbuffers != 0)
+ {
+ if (!CanMergeWrite(to_write, cur_buf_hdr))
+ {
+ elog(LOG, "changed identity");
+ UnpinBuffer(cur_buf_hdr);
+
+ result |= BUF_CANT_MERGE;
+
+ return result;
+ }
+ }
+
+ may_block = to_write->nbuffers == 0
+ && !pgaio_have_staged()
+ && io_queue_is_empty(ioq)
+ ;
+ content_lock = BufferDescriptorGetContentLock(cur_buf_hdr);
+
+ if (!may_block)
+ {
+ if (LWLockConditionalAcquire(content_lock, LW_SHARED))
+ {
+ /* done */
+ }
+ else if (to_write->nbuffers == 0)
+ {
+ /*
+ * Need to wait for all prior IO to finish before blocking for
+ * lock acquisition, to avoid the risk a deadlock due to us
+ * waiting for another backend that is waiting for our unsubmitted
+ * IO to complete.
+ */
+ pgaio_submit_staged();
+ io_queue_wait_all(ioq);
+
+ elog(DEBUG2, "at block %u: can't block, nbuffers = 0",
+ cur_buf_hdr->tag.blockNum
+ );
+
+ may_block = to_write->nbuffers == 0
+ && !pgaio_have_staged()
+ && io_queue_is_empty(ioq)
+ ;
+ Assert(may_block);
+
+ LWLockAcquire(content_lock, LW_SHARED);
+ }
+ else
+ {
+ elog(DEBUG2, "at block %d: can't block nbuffers = %d",
+ cur_buf_hdr->tag.blockNum,
+ to_write->nbuffers);
- FlushBuffer(bufHdr, NULL, IOOBJECT_RELATION, IOCONTEXT_NORMAL);
+ UnpinBuffer(cur_buf_hdr);
+ result |= BUF_CANT_MERGE;
+ Assert(to_write->nbuffers > 0);
- LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
+ return result;
+ }
+ }
+ else
+ {
+ LWLockAcquire(content_lock, LW_SHARED);
+ }
- tag = bufHdr->tag;
+ if (!may_block)
+ {
+ if (!StartBufferIO(cur_buf_hdr, false, !may_block))
+ {
+ pgaio_submit_staged();
+ io_queue_wait_all(ioq);
- UnpinBuffer(bufHdr);
+ may_block = io_queue_is_empty(ioq) && to_write->nbuffers == 0 && !pgaio_have_staged();
+
+ if (!StartBufferIO(cur_buf_hdr, false, !may_block))
+ {
+ elog(DEBUG2, "at block %d: non-waitable StartBufferIO returns false, %d",
+ cur_buf_hdr->tag.blockNum,
+ may_block);
+
+ /*
+ * FIXME: can't tell whether this is because the buffer has
+ * been cleaned
+ */
+ if (!may_block)
+ {
+ result |= BUF_CANT_MERGE;
+ Assert(to_write->nbuffers > 0);
+ }
+ LWLockRelease(content_lock);
+ UnpinBuffer(cur_buf_hdr);
+
+ return result;
+ }
+ }
+ }
+ else
+ {
+ if (!StartBufferIO(cur_buf_hdr, false, false))
+ {
+ elog(DEBUG2, "waitable StartBufferIO returns false");
+ LWLockRelease(content_lock);
+ UnpinBuffer(cur_buf_hdr);
+
+ /*
+ * FIXME: Historically we returned BUF_WRITTEN in this case, which
+ * seems wrong
+ */
+ return result;
+ }
+ }
/*
- * SyncOneBuffer() is only called by checkpointer and bgwriter, so
- * IOContext will always be IOCONTEXT_NORMAL.
+ * Run PageGetLSN while holding header lock, since we don't have the
+ * buffer locked exclusively in all cases.
*/
- ScheduleBufferTagForWriteback(wb_context, IOCONTEXT_NORMAL, &tag);
+ buf_state = LockBufHdr(cur_buf_hdr);
+
+ cur_buf_lsn = BufferGetLSN(cur_buf_hdr);
+
+ /* To check if block content changes while flushing. - vadim 01/17/97 */
+ buf_state &= ~BM_JUST_DIRTIED;
+
+ UnlockBufHdr(cur_buf_hdr, buf_state);
+
+ to_write->buffers[to_write->nbuffers] = buf;
+ to_write->nbuffers++;
+
+ if (buf_state & BM_PERMANENT &&
+ (to_write->max_lsn == InvalidXLogRecPtr || to_write->max_lsn < cur_buf_lsn))
+ {
+ to_write->max_lsn = cur_buf_lsn;
+ }
+
+ result |= BUF_WRITTEN;
+
+ return result;
+}
+
+static void
+WriteBuffers(BuffersToWrite *to_write,
+ IOQueue *ioq, WritebackContext *wb_context)
+{
+ SMgrRelation smgr;
+ Buffer first_buf;
+ BufferDesc *first_buf_hdr;
+ bool needs_checksum;
+
+ Assert(to_write->nbuffers > 0 && to_write->nbuffers <= io_combine_limit);
+
+ first_buf = to_write->buffers[0];
+ first_buf_hdr = GetBufferDescriptor(first_buf - 1);
+
+ smgr = smgropen(BufTagGetRelFileLocator(&first_buf_hdr->tag), INVALID_PROC_NUMBER);
+
+ /*
+ * Force XLOG flush up to buffer's LSN. This implements the basic WAL
+ * rule that log updates must hit disk before any of the data-file changes
+ * they describe do.
+ *
+ * However, this rule does not apply to unlogged relations, which will be
+ * lost after a crash anyway. Most unlogged relation pages do not bear
+ * LSNs since we never emit WAL records for them, and therefore flushing
+ * up through the buffer LSN would be useless, but harmless. However,
+ * GiST indexes use LSNs internally to track page-splits, and therefore
+ * unlogged GiST pages bear "fake" LSNs generated by
+ * GetFakeLSNForUnloggedRel. It is unlikely but possible that the fake
+ * LSN counter could advance past the WAL insertion point; and if it did
+ * happen, attempting to flush WAL through that location would fail, with
+ * disastrous system-wide consequences. To make sure that can't happen,
+ * skip the flush if the buffer isn't permanent.
+ */
+ if (to_write->max_lsn != InvalidXLogRecPtr)
+ XLogFlush(to_write->max_lsn);
+
+ /*
+ * Now it's safe to write buffer to disk. Note that no one else should
+ * have been able to write it while we were busy with log flushing because
+ * only one process at a time can set the BM_IO_IN_PROGRESS bit.
+ */
+
+ for (int nbuf = 0; nbuf < to_write->nbuffers; nbuf++)
+ {
+ Buffer cur_buf = to_write->buffers[nbuf];
+ BufferDesc *cur_buf_hdr = GetBufferDescriptor(cur_buf - 1);
+ Block bufBlock;
+ char *bufToWrite;
+
+ bufBlock = BufHdrGetBlock(cur_buf_hdr);
+ needs_checksum = PageNeedsChecksumCopy((Page) bufBlock);
+
+ /*
+ * Update page checksum if desired. Since we have only shared lock on
+ * the buffer, other processes might be updating hint bits in it, so
+ * we must copy the page to a bounce buffer if we do checksumming.
+ */
+ if (needs_checksum)
+ {
+ PgAioBounceBuffer *bb = pgaio_bounce_buffer_get();
+
+ pgaio_io_assoc_bounce_buffer(to_write->ioh, bb);
+
+ bufToWrite = pgaio_bounce_buffer_buffer(bb);
+ memcpy(bufToWrite, bufBlock, BLCKSZ);
+ PageSetChecksumInplace((Page) bufToWrite, cur_buf_hdr->tag.blockNum);
+ }
+ else
+ {
+ bufToWrite = bufBlock;
+ }
+
+ to_write->data_ptrs[nbuf] = bufToWrite;
+ }
+
+ pgaio_io_set_handle_data_32(to_write->ioh,
+ (uint32 *) to_write->buffers,
+ to_write->nbuffers);
+ pgaio_io_register_callbacks(to_write->ioh, PGAIO_HCB_SHARED_BUFFER_WRITEV);
+
+ smgrstartwritev(to_write->ioh, smgr,
+ BufTagGetForkNum(&first_buf_hdr->tag),
+ first_buf_hdr->tag.blockNum,
+ to_write->data_ptrs,
+ to_write->nbuffers,
+ false);
+ pgstat_count_io_op(IOOBJECT_RELATION, IOCONTEXT_NORMAL,
+ IOOP_WRITE, 1, BLCKSZ * to_write->nbuffers);
+
+
+ for (int nbuf = 0; nbuf < to_write->nbuffers; nbuf++)
+ {
+ Buffer cur_buf = to_write->buffers[nbuf];
+ BufferDesc *cur_buf_hdr = GetBufferDescriptor(cur_buf - 1);
+
+ UnpinBuffer(cur_buf_hdr);
+ }
+
+ io_queue_track(ioq, &to_write->iow);
+ to_write->total_writes++;
- return result | BUF_WRITTEN;
+ /* clear state for next write */
+ to_write->nbuffers = 0;
+ to_write->start_at_tag.relNumber = InvalidOid;
+ to_write->start_at_tag.blockNum = InvalidBlockNumber;
+ to_write->max_combine = 0;
+ to_write->max_lsn = InvalidXLogRecPtr;
+ to_write->ioh = NULL;
+ pgaio_wref_clear(&to_write->iow);
}
/*
@@ -4088,6 +4566,7 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
error_context_stack = errcallback.previous;
}
+
/*
* RelationGetNumberOfBlocksInFork
* Determines the current number of pages in the specified relation fork.
diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c
index a931cdba151..7fd8e7681ae 100644
--- a/src/backend/storage/page/bufpage.c
+++ b/src/backend/storage/page/bufpage.c
@@ -1480,6 +1480,16 @@ PageIndexTupleOverwrite(Page page, OffsetNumber offnum,
return true;
}
+bool
+PageNeedsChecksumCopy(Page page)
+{
+ if (PageIsNew(page))
+ return false;
+
+ /* If we don't need a checksum, just return the passed-in data */
+ return DataChecksumsEnabled();
+}
+
/*
* Set checksum for a page in shared buffers.
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 91d8198af9f..bbd08cd6b4d 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -347,6 +347,7 @@ BufferManagerRelation
BufferStrategyControl
BufferTag
BufferUsage
+BuffersToWrite
BuildAccumulator
BuiltinScript
BulkInsertState
--
2.48.1.76.g4e746b1a31.dirty
v2.3-0027-very-wip-test_aio-module.patchtext/x-diff; charset=us-asciiDownload
From 4507a8ca905a9272bad59198f93e01e40da87451 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 22 Jan 2025 13:44:54 -0500
Subject: [PATCH v2.3 27/30] very-wip: test_aio module
Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
src/include/storage/aio_internal.h | 8 +
src/include/storage/buf_internals.h | 4 +
src/backend/storage/aio/aio.c | 39 ++
src/backend/storage/buffer/bufmgr.c | 3 +-
src/test/modules/Makefile | 1 +
src/test/modules/meson.build | 1 +
src/test/modules/test_aio/.gitignore | 6 +
src/test/modules/test_aio/Makefile | 34 ++
src/test/modules/test_aio/expected/inject.out | 295 ++++++++++
src/test/modules/test_aio/expected/io.out | 40 ++
.../modules/test_aio/expected/ownership.out | 148 +++++
src/test/modules/test_aio/expected/prep.out | 17 +
src/test/modules/test_aio/io_uring.conf | 5 +
src/test/modules/test_aio/meson.build | 78 +++
src/test/modules/test_aio/sql/inject.sql | 84 +++
src/test/modules/test_aio/sql/io.sql | 16 +
src/test/modules/test_aio/sql/ownership.sql | 65 +++
src/test/modules/test_aio/sql/prep.sql | 9 +
src/test/modules/test_aio/sync.conf | 5 +
src/test/modules/test_aio/test_aio--1.0.sql | 99 ++++
src/test/modules/test_aio/test_aio.c | 504 ++++++++++++++++++
src/test/modules/test_aio/test_aio.control | 3 +
src/test/modules/test_aio/worker.conf | 5 +
23 files changed, 1467 insertions(+), 2 deletions(-)
create mode 100644 src/test/modules/test_aio/.gitignore
create mode 100644 src/test/modules/test_aio/Makefile
create mode 100644 src/test/modules/test_aio/expected/inject.out
create mode 100644 src/test/modules/test_aio/expected/io.out
create mode 100644 src/test/modules/test_aio/expected/ownership.out
create mode 100644 src/test/modules/test_aio/expected/prep.out
create mode 100644 src/test/modules/test_aio/io_uring.conf
create mode 100644 src/test/modules/test_aio/meson.build
create mode 100644 src/test/modules/test_aio/sql/inject.sql
create mode 100644 src/test/modules/test_aio/sql/io.sql
create mode 100644 src/test/modules/test_aio/sql/ownership.sql
create mode 100644 src/test/modules/test_aio/sql/prep.sql
create mode 100644 src/test/modules/test_aio/sync.conf
create mode 100644 src/test/modules/test_aio/test_aio--1.0.sql
create mode 100644 src/test/modules/test_aio/test_aio.c
create mode 100644 src/test/modules/test_aio/test_aio.control
create mode 100644 src/test/modules/test_aio/worker.conf
diff --git a/src/include/storage/aio_internal.h b/src/include/storage/aio_internal.h
index 531532e306a..1855b57f355 100644
--- a/src/include/storage/aio_internal.h
+++ b/src/include/storage/aio_internal.h
@@ -316,6 +316,14 @@ extern const char *pgaio_io_get_target_name(PgAioHandle *ioh);
__VA_ARGS__)
+/* These functions are just for use in tests, from within injection points */
+#ifdef USE_INJECTION_POINTS
+
+extern PgAioHandle *pgaio_inj_io_get(void);
+
+#endif
+
+
/* Declarations for the tables of function pointers exposed by each IO method. */
extern PGDLLIMPORT const IoMethodOps pgaio_sync_ops;
extern PGDLLIMPORT const IoMethodOps pgaio_worker_ops;
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index aeefb1746ec..9939032d5f0 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -423,6 +423,10 @@ extern void IssuePendingWritebacks(WritebackContext *wb_context, IOContext io_co
extern void ScheduleBufferTagForWriteback(WritebackContext *wb_context,
IOContext io_context, BufferTag *tag);
+/* solely to make it easier to write tests */
+extern bool StartBufferIO(BufferDesc *buf, bool forInput, bool nowait);
+
+
/* freelist.c */
extern IOContext IOContextForStrategy(BufferAccessStrategy strategy);
extern BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy,
diff --git a/src/backend/storage/aio/aio.c b/src/backend/storage/aio/aio.c
index 431f2c2e5af..7a873f6ffbb 100644
--- a/src/backend/storage/aio/aio.c
+++ b/src/backend/storage/aio/aio.c
@@ -46,6 +46,10 @@
#include "utils/resowner.h"
#include "utils/wait_event_types.h"
+#ifdef USE_INJECTION_POINTS
+#include "utils/injection_point.h"
+#endif
+
static inline void pgaio_io_update_state(PgAioHandle *ioh, PgAioHandleState new_state);
static void pgaio_io_reclaim(PgAioHandle *ioh);
@@ -92,6 +96,11 @@ static const IoMethodOps *const pgaio_method_ops_table[] = {
const IoMethodOps *pgaio_method_ops;
+#ifdef USE_INJECTION_POINTS
+static PgAioHandle *pgaio_inj_cur_handle;
+#endif
+
+
/* --------------------------------------------------------------------------------
* Public Functions related to PgAioHandle
@@ -452,6 +461,19 @@ pgaio_io_process_completion(PgAioHandle *ioh, int result)
pgaio_io_update_state(ioh, PGAIO_HS_COMPLETED_IO);
+#ifdef USE_INJECTION_POINTS
+ pgaio_inj_cur_handle = ioh;
+
+ /*
+ * FIXME: This could be in a critical section - but it looks like we can't
+ * just InjectionPointLoad() at process start, as the injection point
+ * might not yet be defined.
+ */
+ InjectionPointCached("AIO_PROCESS_COMPLETION_BEFORE_SHARED");
+
+ pgaio_inj_cur_handle = NULL;
+#endif
+
pgaio_io_call_complete_shared(ioh);
pgaio_io_update_state(ioh, PGAIO_HS_COMPLETED_SHARED);
@@ -1128,3 +1150,20 @@ assign_io_method(int newval, void *extra)
pgaio_method_ops = pgaio_method_ops_table[newval];
}
+
+
+
+/* --------------------------------------------------------------------------------
+ * Injection point support
+ * --------------------------------------------------------------------------------
+ */
+
+#ifdef USE_INJECTION_POINTS
+
+PgAioHandle *
+pgaio_inj_io_get(void)
+{
+ return pgaio_inj_cur_handle;
+}
+
+#endif
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 1e8793d1630..7f6eabcb92e 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -514,7 +514,6 @@ static void UnpinBufferNoOwner(BufferDesc *buf);
static void BufferSync(int flags);
static uint32 WaitBufHdrUnlocked(BufferDesc *buf);
static void WaitIO(BufferDesc *buf);
-static bool StartBufferIO(BufferDesc *buf, bool forInput, bool nowait);
static void TerminateBufferIO(BufferDesc *buf, bool clear_dirty,
uint32 set_flag_bits, bool forget_owner,
bool syncio);
@@ -6184,7 +6183,7 @@ WaitIO(BufferDesc *buf)
* find out if they can perform the I/O as part of a larger operation, without
* waiting for the answer or distinguishing the reasons why not.
*/
-static bool
+bool
StartBufferIO(BufferDesc *buf, bool forInput, bool nowait)
{
uint32 buf_state;
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index c0d3cf0e14b..73ff9c55687 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -13,6 +13,7 @@ SUBDIRS = \
libpq_pipeline \
plsample \
spgist_name_ops \
+ test_aio \
test_bloomfilter \
test_copy_callbacks \
test_custom_rmgrs \
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index 4f544a042d4..b11dd72334c 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -1,5 +1,6 @@
# Copyright (c) 2022-2025, PostgreSQL Global Development Group
+subdir('test_aio')
subdir('brin')
subdir('commit_ts')
subdir('delay_execution')
diff --git a/src/test/modules/test_aio/.gitignore b/src/test/modules/test_aio/.gitignore
new file mode 100644
index 00000000000..b4903eba657
--- /dev/null
+++ b/src/test/modules/test_aio/.gitignore
@@ -0,0 +1,6 @@
+# Generated subdirectories
+/log/
+/results/
+/output_iso/
+/tmp_check/
+/tmp_check_iso/
diff --git a/src/test/modules/test_aio/Makefile b/src/test/modules/test_aio/Makefile
new file mode 100644
index 00000000000..ae6d685835b
--- /dev/null
+++ b/src/test/modules/test_aio/Makefile
@@ -0,0 +1,34 @@
+# src/test/modules/delay_execution/Makefile
+
+PGFILEDESC = "test_aio - test code for AIO"
+
+MODULE_big = test_aio
+OBJS = \
+ $(WIN32RES) \
+ test_aio.o
+
+EXTENSION = test_aio
+DATA = test_aio--1.0.sql
+
+REGRESS = prep ownership io
+
+ifeq ($(enable_injection_points),yes)
+REGRESS += inject
+endif
+
+# FIXME: with meson this runs the tests once with worker and once - if
+# supported - with io_uring.
+
+# requires custom config
+NO_INSTALLCHECK = 1
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_aio
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_aio/expected/inject.out b/src/test/modules/test_aio/expected/inject.out
new file mode 100644
index 00000000000..e62e3718845
--- /dev/null
+++ b/src/test/modules/test_aio/expected/inject.out
@@ -0,0 +1,295 @@
+SELECT count(*) FROM tbl_b WHERE ctid = '(2, 1)';
+ count
+-------
+ 1
+(1 row)
+
+-- injected what we'd expect
+SELECT inj_io_short_read_attach(8192);
+ inj_io_short_read_attach
+--------------------------
+
+(1 row)
+
+SELECT invalidate_rel_block('tbl_b', 2);
+ invalidate_rel_block
+----------------------
+
+(1 row)
+
+SELECT count(*) FROM tbl_b WHERE ctid = '(2, 1)';
+ count
+-------
+ 1
+(1 row)
+
+SELECT inj_io_short_read_detach();
+ inj_io_short_read_detach
+--------------------------
+
+(1 row)
+
+-- injected a read shorter than a single block, expecting error
+SELECT inj_io_short_read_attach(17);
+ inj_io_short_read_attach
+--------------------------
+
+(1 row)
+
+SELECT invalidate_rel_block('tbl_b', 2);
+ invalidate_rel_block
+----------------------
+
+(1 row)
+
+SELECT redact($$
+ SELECT count(*) FROM tbl_b WHERE ctid = '(2, 1)';
+$$);
+NOTICE: wrapped error: could not read blocks 2..2 in file base/<redacted>: read only 0 of 8192 bytes
+ redact
+--------
+ f
+(1 row)
+
+SELECT inj_io_short_read_detach();
+ inj_io_short_read_detach
+--------------------------
+
+(1 row)
+
+-- shorten multi-block read to a single block, should retry
+SELECT count(*) FROM tbl_b; -- for comparison
+ count
+-------
+ 10000
+(1 row)
+
+SELECT invalidate_rel_block('tbl_b', 0);
+ invalidate_rel_block
+----------------------
+
+(1 row)
+
+SELECT invalidate_rel_block('tbl_b', 1);
+ invalidate_rel_block
+----------------------
+
+(1 row)
+
+SELECT invalidate_rel_block('tbl_b', 2);
+ invalidate_rel_block
+----------------------
+
+(1 row)
+
+SELECT inj_io_short_read_attach(8192);
+ inj_io_short_read_attach
+--------------------------
+
+(1 row)
+
+-- no need to redact, no messages to client
+SELECT count(*) FROM tbl_b;
+ count
+-------
+ 10000
+(1 row)
+
+SELECT inj_io_short_read_detach();
+ inj_io_short_read_detach
+--------------------------
+
+(1 row)
+
+-- shorten multi-block read to 1 1/2 blocks, should retry
+SELECT count(*) FROM tbl_b; -- for comparison
+ count
+-------
+ 10000
+(1 row)
+
+SELECT invalidate_rel_block('tbl_b', 0);
+ invalidate_rel_block
+----------------------
+
+(1 row)
+
+SELECT invalidate_rel_block('tbl_b', 1);
+ invalidate_rel_block
+----------------------
+
+(1 row)
+
+SELECT invalidate_rel_block('tbl_b', 2);
+ invalidate_rel_block
+----------------------
+
+(1 row)
+
+SELECT inj_io_short_read_attach(8192 + 4096);
+ inj_io_short_read_attach
+--------------------------
+
+(1 row)
+
+-- no need to redact, no messages to client
+SELECT count(*) FROM tbl_b;
+ count
+-------
+ 10000
+(1 row)
+
+SELECT inj_io_short_read_detach();
+ inj_io_short_read_detach
+--------------------------
+
+(1 row)
+
+-- shorten single-block read to read that block partially, we'll error out,
+-- because we assume we can read at least one block at a time.
+SELECT count(*) FROM tbl_b WHERE ctid = '(2, 1)'; -- for comparison
+ count
+-------
+ 1
+(1 row)
+
+SELECT invalidate_rel_block('tbl_b', 2);
+ invalidate_rel_block
+----------------------
+
+(1 row)
+
+SELECT inj_io_short_read_attach(4096);
+ inj_io_short_read_attach
+--------------------------
+
+(1 row)
+
+SELECT redact($$
+ SELECT count(*) FROM tbl_b WHERE ctid = '(2, 1)';
+$$);
+NOTICE: wrapped error: could not read blocks 2..2 in file base/<redacted>: read only 0 of 8192 bytes
+ redact
+--------
+ f
+(1 row)
+
+SELECT inj_io_short_read_detach();
+ inj_io_short_read_detach
+--------------------------
+
+(1 row)
+
+-- shorten single-block read to read 0 bytes, expect that to error out
+SELECT count(*) FROM tbl_b WHERE ctid = '(2, 1)'; -- for comparison
+ count
+-------
+ 1
+(1 row)
+
+SELECT invalidate_rel_block('tbl_b', 2);
+ invalidate_rel_block
+----------------------
+
+(1 row)
+
+SELECT inj_io_short_read_attach(0);
+ inj_io_short_read_attach
+--------------------------
+
+(1 row)
+
+SELECT redact($$
+ SELECT count(*) FROM tbl_b WHERE ctid = '(2, 1)';
+$$);
+NOTICE: wrapped error: could not read blocks 2..2 in file base/<redacted>: read only 0 of 8192 bytes
+ redact
+--------
+ f
+(1 row)
+
+SELECT inj_io_short_read_detach();
+ inj_io_short_read_detach
+--------------------------
+
+(1 row)
+
+-- verify that checksum errors are detected even as part of a shortened
+-- multi-block read
+-- (tbl_a, block 1 is corrupted)
+SELECT redact($$
+ SELECT count(*) FROM tbl_a WHERE ctid < '(2, 1)';
+$$);
+NOTICE: wrapped error: invalid page in block 2 of relation base/<redacted>
+ redact
+--------
+ f
+(1 row)
+
+SELECT inj_io_short_read_attach(8192);
+ inj_io_short_read_attach
+--------------------------
+
+(1 row)
+
+SELECT invalidate_rel_block('tbl_a', 0);
+ invalidate_rel_block
+----------------------
+
+(1 row)
+
+SELECT invalidate_rel_block('tbl_a', 1);
+ invalidate_rel_block
+----------------------
+
+(1 row)
+
+SELECT invalidate_rel_block('tbl_a', 2);
+ invalidate_rel_block
+----------------------
+
+(1 row)
+
+SELECT redact($$
+ SELECT count(*) FROM tbl_a WHERE ctid < '(2, 1)';
+$$);
+NOTICE: wrapped error: invalid page in block 2 of relation base/<redacted>
+ redact
+--------
+ f
+(1 row)
+
+SELECT inj_io_short_read_detach();
+ inj_io_short_read_detach
+--------------------------
+
+(1 row)
+
+-- trigger a hard error, should error out
+SELECT inj_io_short_read_attach(-errno_from_string('EIO'));
+ inj_io_short_read_attach
+--------------------------
+
+(1 row)
+
+SELECT invalidate_rel_block('tbl_b', 2);
+ invalidate_rel_block
+----------------------
+
+(1 row)
+
+SELECT redact($$
+ SELECT count(*) FROM tbl_b WHERE ctid = '(2, 1)';
+$$);
+NOTICE: wrapped error: could not read blocks 2..3 in file base/<redacted>: Input/output error
+ redact
+--------
+ f
+(1 row)
+
+SELECT inj_io_short_read_detach();
+ inj_io_short_read_detach
+--------------------------
+
+(1 row)
+
diff --git a/src/test/modules/test_aio/expected/io.out b/src/test/modules/test_aio/expected/io.out
new file mode 100644
index 00000000000..e46b582f290
--- /dev/null
+++ b/src/test/modules/test_aio/expected/io.out
@@ -0,0 +1,40 @@
+SELECT count(*) FROM tbl_a WHERE ctid = '(1, 1)';
+ count
+-------
+ 1
+(1 row)
+
+SELECT corrupt_rel_block('tbl_a', 1);
+ corrupt_rel_block
+-------------------
+
+(1 row)
+
+-- FIXME: Should report the error
+SELECT redact($$
+ SELECT read_corrupt_rel_block('tbl_a', 1);
+$$);
+ redact
+--------
+ t
+(1 row)
+
+-- verify the error is reported
+SELECT redact($$
+ SELECT count(*) FROM tbl_a WHERE ctid = '(1, 1)';
+$$);
+NOTICE: wrapped error: invalid page in block 2 of relation base/<redacted>
+ redact
+--------
+ f
+(1 row)
+
+SELECT redact($$
+ SELECT count(*) FROM tbl_a;
+$$);
+NOTICE: wrapped error: invalid page in block 2 of relation base/<redacted>
+ redact
+--------
+ f
+(1 row)
+
diff --git a/src/test/modules/test_aio/expected/ownership.out b/src/test/modules/test_aio/expected/ownership.out
new file mode 100644
index 00000000000..97fdad6c629
--- /dev/null
+++ b/src/test/modules/test_aio/expected/ownership.out
@@ -0,0 +1,148 @@
+-----
+-- IO handles
+----
+-- leak warning: implicit xact
+SELECT handle_get();
+WARNING: leaked AIO handle
+ handle_get
+------------
+
+(1 row)
+
+-- leak warning: explicit xact
+BEGIN; SELECT handle_get(); COMMIT;
+WARNING: leaked AIO handle
+ handle_get
+------------
+
+(1 row)
+
+-- leak warning + error: released in different command (thus resowner)
+BEGIN; SELECT handle_get(); SELECT handle_release_last(); COMMIT;
+WARNING: leaked AIO handle
+ handle_get
+------------
+
+(1 row)
+
+ERROR: release in unexpected state
+-- no leak, same command
+BEGIN; SELECT handle_get() UNION ALL SELECT handle_release_last(); COMMIT;
+ handle_get
+------------
+
+
+(2 rows)
+
+-- leak warning: subtrans
+BEGIN; SAVEPOINT foo; SELECT handle_get(); COMMIT;
+WARNING: leaked AIO handle
+ handle_get
+------------
+
+(1 row)
+
+-- normal handle use
+SELECT handle_get_release();
+ handle_get_release
+--------------------
+
+(1 row)
+
+-- should error out, API violation
+SELECT handle_get_twice();
+ERROR: API violation: Only one IO can be handed out
+-- recover after error in implicit xact
+SELECT handle_get_and_error(); SELECT handle_get_release();
+ERROR: as you command
+ handle_get_release
+--------------------
+
+(1 row)
+
+-- recover after error in explicit xact
+BEGIN; SELECT handle_get_and_error(); ROLLBACK; SELECT handle_get_release();
+ERROR: as you command
+ handle_get_release
+--------------------
+
+(1 row)
+
+-- recover after error in subtrans
+BEGIN; SAVEPOINT foo; SELECT handle_get_and_error(); ROLLBACK TO SAVEPOINT foo; SELECT handle_get_release(); ROLLBACK;
+ERROR: as you command
+ handle_get_release
+--------------------
+
+(1 row)
+
+-----
+-- Bounce Buffers handles
+----
+-- leak warning: implicit xact
+SELECT bb_get();
+WARNING: leaked AIO bounce buffer
+ bb_get
+--------
+
+(1 row)
+
+-- leak warning: explicit xact
+BEGIN; SELECT bb_get(); COMMIT;
+WARNING: leaked AIO bounce buffer
+ bb_get
+--------
+
+(1 row)
+
+-- missing leak warning: we should warn at command boundaries, not xact boundaries
+BEGIN; SELECT bb_get(); SELECT bb_release_last(); COMMIT;
+WARNING: leaked AIO bounce buffer
+ bb_get
+--------
+
+(1 row)
+
+ERROR: can only release handed out BB
+-- leak warning: subtrans
+BEGIN; SAVEPOINT foo; SELECT bb_get(); COMMIT;
+WARNING: leaked AIO bounce buffer
+ bb_get
+--------
+
+(1 row)
+
+-- normal bb use
+SELECT bb_get_release();
+ bb_get_release
+----------------
+
+(1 row)
+
+-- should error out, API violation
+SELECT bb_get_twice();
+ERROR: can only hand out one BB
+-- recover after error in implicit xact
+SELECT bb_get_and_error(); SELECT bb_get_release();
+ERROR: as you command
+ bb_get_release
+----------------
+
+(1 row)
+
+-- recover after error in explicit xact
+BEGIN; SELECT bb_get_and_error(); ROLLBACK; SELECT bb_get_release();
+ERROR: as you command
+ bb_get_release
+----------------
+
+(1 row)
+
+-- recover after error in subtrans
+BEGIN; SAVEPOINT foo; SELECT bb_get_and_error(); ROLLBACK TO SAVEPOINT foo; SELECT bb_get_release(); ROLLBACK;
+ERROR: as you command
+ bb_get_release
+----------------
+
+(1 row)
+
diff --git a/src/test/modules/test_aio/expected/prep.out b/src/test/modules/test_aio/expected/prep.out
new file mode 100644
index 00000000000..7fad6280db5
--- /dev/null
+++ b/src/test/modules/test_aio/expected/prep.out
@@ -0,0 +1,17 @@
+CREATE EXTENSION test_aio;
+CREATE TABLE tbl_a(data int not null);
+CREATE TABLE tbl_b(data int not null);
+INSERT INTO tbl_a SELECT generate_series(1, 10000);
+INSERT INTO tbl_b SELECT generate_series(1, 10000);
+SELECT grow_rel('tbl_a', 500);
+ grow_rel
+----------
+
+(1 row)
+
+SELECT grow_rel('tbl_b', 550);
+ grow_rel
+----------
+
+(1 row)
+
diff --git a/src/test/modules/test_aio/io_uring.conf b/src/test/modules/test_aio/io_uring.conf
new file mode 100644
index 00000000000..efd7ad143ff
--- /dev/null
+++ b/src/test/modules/test_aio/io_uring.conf
@@ -0,0 +1,5 @@
+shared_preload_libraries=test_aio
+io_method = 'io_uring'
+log_min_messages = 'DEBUG3'
+log_statement=all
+restart_after_crash=false
diff --git a/src/test/modules/test_aio/meson.build b/src/test/modules/test_aio/meson.build
new file mode 100644
index 00000000000..a4bef0ceeb0
--- /dev/null
+++ b/src/test/modules/test_aio/meson.build
@@ -0,0 +1,78 @@
+# Copyright (c) 2022-2024, PostgreSQL Global Development Group
+
+test_aio_sources = files(
+ 'test_aio.c',
+)
+
+if host_system == 'windows'
+ test_aio_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+ '--NAME', 'test_aio',
+ '--FILEDESC', 'test_aio - test code for AIO',])
+endif
+
+test_aio = shared_module('test_aio',
+ test_aio_sources,
+ kwargs: pg_test_mod_args,
+)
+test_install_libs += test_aio
+
+test_install_data += files(
+ 'test_aio.control',
+ 'test_aio--1.0.sql',
+)
+
+
+testfiles = [
+ 'prep',
+ 'ownership',
+ 'io',
+]
+
+if get_option('injection_points')
+ testfiles += 'inject'
+endif
+
+
+tests += {
+ 'name': 'test_aio_sync',
+ 'sd': meson.current_source_dir(),
+ 'bd': meson.current_build_dir(),
+ 'regress': {
+ 'sql': testfiles,
+ 'regress_args': [
+ '--temp-config', files('sync.conf'),
+ ],
+ # requires custom config
+ 'runningcheck': false,
+ },
+}
+
+tests += {
+ 'name': 'test_aio_worker',
+ 'sd': meson.current_source_dir(),
+ 'bd': meson.current_build_dir(),
+ 'regress': {
+ 'sql': testfiles,
+ 'regress_args': [
+ '--temp-config', files('worker.conf'),
+ ],
+ # requires custom config
+ 'runningcheck': false,
+ },
+}
+
+if liburing.found()
+ tests += {
+ 'name': 'test_aio_uring',
+ 'sd': meson.current_source_dir(),
+ 'bd': meson.current_build_dir(),
+ 'regress': {
+ 'sql': testfiles,
+ 'regress_args': [
+ '--temp-config', files('io_uring.conf'),
+ ],
+ # requires custom config
+ 'runningcheck': false,
+ }
+ }
+endif
diff --git a/src/test/modules/test_aio/sql/inject.sql b/src/test/modules/test_aio/sql/inject.sql
new file mode 100644
index 00000000000..1190531f5ad
--- /dev/null
+++ b/src/test/modules/test_aio/sql/inject.sql
@@ -0,0 +1,84 @@
+SELECT count(*) FROM tbl_b WHERE ctid = '(2, 1)';
+
+-- injected what we'd expect
+SELECT inj_io_short_read_attach(8192);
+SELECT invalidate_rel_block('tbl_b', 2);
+SELECT count(*) FROM tbl_b WHERE ctid = '(2, 1)';
+SELECT inj_io_short_read_detach();
+
+
+-- injected a read shorter than a single block, expecting error
+SELECT inj_io_short_read_attach(17);
+SELECT invalidate_rel_block('tbl_b', 2);
+SELECT redact($$
+ SELECT count(*) FROM tbl_b WHERE ctid = '(2, 1)';
+$$);
+SELECT inj_io_short_read_detach();
+
+
+-- shorten multi-block read to a single block, should retry
+SELECT count(*) FROM tbl_b; -- for comparison
+SELECT invalidate_rel_block('tbl_b', 0);
+SELECT invalidate_rel_block('tbl_b', 1);
+SELECT invalidate_rel_block('tbl_b', 2);
+SELECT inj_io_short_read_attach(8192);
+-- no need to redact, no messages to client
+SELECT count(*) FROM tbl_b;
+SELECT inj_io_short_read_detach();
+
+
+-- shorten multi-block read to 1 1/2 blocks, should retry
+SELECT count(*) FROM tbl_b; -- for comparison
+SELECT invalidate_rel_block('tbl_b', 0);
+SELECT invalidate_rel_block('tbl_b', 1);
+SELECT invalidate_rel_block('tbl_b', 2);
+SELECT inj_io_short_read_attach(8192 + 4096);
+-- no need to redact, no messages to client
+SELECT count(*) FROM tbl_b;
+SELECT inj_io_short_read_detach();
+
+
+-- shorten single-block read to read that block partially, we'll error out,
+-- because we assume we can read at least one block at a time.
+SELECT count(*) FROM tbl_b WHERE ctid = '(2, 1)'; -- for comparison
+SELECT invalidate_rel_block('tbl_b', 2);
+SELECT inj_io_short_read_attach(4096);
+SELECT redact($$
+ SELECT count(*) FROM tbl_b WHERE ctid = '(2, 1)';
+$$);
+SELECT inj_io_short_read_detach();
+
+
+-- shorten single-block read to read 0 bytes, expect that to error out
+SELECT count(*) FROM tbl_b WHERE ctid = '(2, 1)'; -- for comparison
+SELECT invalidate_rel_block('tbl_b', 2);
+SELECT inj_io_short_read_attach(0);
+SELECT redact($$
+ SELECT count(*) FROM tbl_b WHERE ctid = '(2, 1)';
+$$);
+SELECT inj_io_short_read_detach();
+
+
+-- verify that checksum errors are detected even as part of a shortened
+-- multi-block read
+-- (tbl_a, block 1 is corrupted)
+SELECT redact($$
+ SELECT count(*) FROM tbl_a WHERE ctid < '(2, 1)';
+$$);
+SELECT inj_io_short_read_attach(8192);
+SELECT invalidate_rel_block('tbl_a', 0);
+SELECT invalidate_rel_block('tbl_a', 1);
+SELECT invalidate_rel_block('tbl_a', 2);
+SELECT redact($$
+ SELECT count(*) FROM tbl_a WHERE ctid < '(2, 1)';
+$$);
+SELECT inj_io_short_read_detach();
+
+
+-- trigger a hard error, should error out
+SELECT inj_io_short_read_attach(-errno_from_string('EIO'));
+SELECT invalidate_rel_block('tbl_b', 2);
+SELECT redact($$
+ SELECT count(*) FROM tbl_b WHERE ctid = '(2, 1)';
+$$);
+SELECT inj_io_short_read_detach();
diff --git a/src/test/modules/test_aio/sql/io.sql b/src/test/modules/test_aio/sql/io.sql
new file mode 100644
index 00000000000..a29bb4eb15d
--- /dev/null
+++ b/src/test/modules/test_aio/sql/io.sql
@@ -0,0 +1,16 @@
+SELECT count(*) FROM tbl_a WHERE ctid = '(1, 1)';
+
+SELECT corrupt_rel_block('tbl_a', 1);
+
+-- FIXME: Should report the error
+SELECT redact($$
+ SELECT read_corrupt_rel_block('tbl_a', 1);
+$$);
+
+-- verify the error is reported
+SELECT redact($$
+ SELECT count(*) FROM tbl_a WHERE ctid = '(1, 1)';
+$$);
+SELECT redact($$
+ SELECT count(*) FROM tbl_a;
+$$);
diff --git a/src/test/modules/test_aio/sql/ownership.sql b/src/test/modules/test_aio/sql/ownership.sql
new file mode 100644
index 00000000000..63cf40c802a
--- /dev/null
+++ b/src/test/modules/test_aio/sql/ownership.sql
@@ -0,0 +1,65 @@
+-----
+-- IO handles
+----
+
+-- leak warning: implicit xact
+SELECT handle_get();
+
+-- leak warning: explicit xact
+BEGIN; SELECT handle_get(); COMMIT;
+
+-- leak warning + error: released in different command (thus resowner)
+BEGIN; SELECT handle_get(); SELECT handle_release_last(); COMMIT;
+
+-- no leak, same command
+BEGIN; SELECT handle_get() UNION ALL SELECT handle_release_last(); COMMIT;
+
+-- leak warning: subtrans
+BEGIN; SAVEPOINT foo; SELECT handle_get(); COMMIT;
+
+-- normal handle use
+SELECT handle_get_release();
+
+-- should error out, API violation
+SELECT handle_get_twice();
+
+-- recover after error in implicit xact
+SELECT handle_get_and_error(); SELECT handle_get_release();
+
+-- recover after error in explicit xact
+BEGIN; SELECT handle_get_and_error(); ROLLBACK; SELECT handle_get_release();
+
+-- recover after error in subtrans
+BEGIN; SAVEPOINT foo; SELECT handle_get_and_error(); ROLLBACK TO SAVEPOINT foo; SELECT handle_get_release(); ROLLBACK;
+
+
+-----
+-- Bounce Buffers handles
+----
+
+-- leak warning: implicit xact
+SELECT bb_get();
+
+-- leak warning: explicit xact
+BEGIN; SELECT bb_get(); COMMIT;
+
+-- missing leak warning: we should warn at command boundaries, not xact boundaries
+BEGIN; SELECT bb_get(); SELECT bb_release_last(); COMMIT;
+
+-- leak warning: subtrans
+BEGIN; SAVEPOINT foo; SELECT bb_get(); COMMIT;
+
+-- normal bb use
+SELECT bb_get_release();
+
+-- should error out, API violation
+SELECT bb_get_twice();
+
+-- recover after error in implicit xact
+SELECT bb_get_and_error(); SELECT bb_get_release();
+
+-- recover after error in explicit xact
+BEGIN; SELECT bb_get_and_error(); ROLLBACK; SELECT bb_get_release();
+
+-- recover after error in subtrans
+BEGIN; SAVEPOINT foo; SELECT bb_get_and_error(); ROLLBACK TO SAVEPOINT foo; SELECT bb_get_release(); ROLLBACK;
diff --git a/src/test/modules/test_aio/sql/prep.sql b/src/test/modules/test_aio/sql/prep.sql
new file mode 100644
index 00000000000..b8f225cbc98
--- /dev/null
+++ b/src/test/modules/test_aio/sql/prep.sql
@@ -0,0 +1,9 @@
+CREATE EXTENSION test_aio;
+
+CREATE TABLE tbl_a(data int not null);
+CREATE TABLE tbl_b(data int not null);
+
+INSERT INTO tbl_a SELECT generate_series(1, 10000);
+INSERT INTO tbl_b SELECT generate_series(1, 10000);
+SELECT grow_rel('tbl_a', 500);
+SELECT grow_rel('tbl_b', 550);
diff --git a/src/test/modules/test_aio/sync.conf b/src/test/modules/test_aio/sync.conf
new file mode 100644
index 00000000000..c480922d6cf
--- /dev/null
+++ b/src/test/modules/test_aio/sync.conf
@@ -0,0 +1,5 @@
+shared_preload_libraries=test_aio
+io_method = 'sync'
+log_min_messages = 'DEBUG3'
+log_statement=all
+restart_after_crash=false
diff --git a/src/test/modules/test_aio/test_aio--1.0.sql b/src/test/modules/test_aio/test_aio--1.0.sql
new file mode 100644
index 00000000000..e3d5ce29c60
--- /dev/null
+++ b/src/test/modules/test_aio/test_aio--1.0.sql
@@ -0,0 +1,99 @@
+/* src/test/modules/test_aio/test_aio--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_aio" to load this file. \quit
+
+
+CREATE FUNCTION errno_from_string(sym text)
+RETURNS pg_catalog.int4 STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+
+CREATE FUNCTION grow_rel(rel regclass, nblocks int)
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+
+CREATE FUNCTION corrupt_rel_block(rel regclass, blockno int)
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION read_corrupt_rel_block(rel regclass, blockno int)
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION invalidate_rel_block(rel regclass, blockno int)
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION handle_get_and_error()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION handle_get_twice()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION handle_get()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION handle_get_release()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION handle_release_last()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+
+CREATE FUNCTION bb_get_and_error()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION bb_get_twice()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION bb_get()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION bb_get_release()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION bb_release_last()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+
+CREATE OR REPLACE FUNCTION redact(p_sql text)
+RETURNS bool
+LANGUAGE plpgsql
+AS $$
+ DECLARE
+ err_state text;
+ err_msg text;
+ BEGIN
+ EXECUTE p_sql;
+ RETURN true;
+ EXCEPTION WHEN OTHERS THEN
+ GET STACKED DIAGNOSTICS
+ err_state = RETURNED_SQLSTATE,
+ err_msg = MESSAGE_TEXT;
+ err_msg = regexp_replace(err_msg, '(file|relation) "?base/[0-9]+/[0-9]+"?', '\1 base/<redacted>');
+ RAISE NOTICE 'wrapped error: %', err_msg
+ USING ERRCODE = err_state;
+ RETURN false;
+ END;
+$$;
+
+
+CREATE FUNCTION inj_io_short_read_attach(result int)
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION inj_io_short_read_detach()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_aio/test_aio.c b/src/test/modules/test_aio/test_aio.c
new file mode 100644
index 00000000000..20d7e6dc82f
--- /dev/null
+++ b/src/test/modules/test_aio/test_aio.c
@@ -0,0 +1,504 @@
+/*-------------------------------------------------------------------------
+ *
+ * delay_execution.c
+ * Test module to allow delay between parsing and execution of a query.
+ *
+ * The delay is implemented by taking and immediately releasing a specified
+ * advisory lock. If another process has previously taken that lock, the
+ * current process will be blocked until the lock is released; otherwise,
+ * there's no effect. This allows an isolationtester script to reliably
+ * test behaviors where some specified action happens in another backend
+ * between parsing and execution of any desired query.
+ *
+ * Copyright (c) 2020-2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/test/modules/delay_execution/delay_execution.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/relation.h"
+#include "fmgr.h"
+#include "storage/aio.h"
+#include "storage/aio_internal.h"
+#include "storage/buf_internals.h"
+#include "storage/bufmgr.h"
+#include "storage/ipc.h"
+#include "storage/lwlock.h"
+#include "utils/builtins.h"
+#include "utils/injection_point.h"
+#include "utils/rel.h"
+
+
+PG_MODULE_MAGIC;
+
+
+typedef struct InjIoErrorState
+{
+ bool enabled;
+ bool result_set;
+ int result;
+} InjIoErrorState;
+
+static InjIoErrorState * inj_io_error_state;
+
+/* Shared memory init callbacks */
+static shmem_request_hook_type prev_shmem_request_hook = NULL;
+static shmem_startup_hook_type prev_shmem_startup_hook = NULL;
+
+
+static PgAioHandle *last_handle;
+static PgAioBounceBuffer *last_bb;
+
+
+
+static void
+test_aio_shmem_request(void)
+{
+ if (prev_shmem_request_hook)
+ prev_shmem_request_hook();
+
+ RequestAddinShmemSpace(sizeof(InjIoErrorState));
+}
+
+static void
+test_aio_shmem_startup(void)
+{
+ bool found;
+
+ if (prev_shmem_startup_hook)
+ prev_shmem_startup_hook();
+
+ /* Create or attach to the shared memory state */
+ LWLockAcquire(AddinShmemInitLock, LW_EXCLUSIVE);
+
+ inj_io_error_state = ShmemInitStruct("injection_points",
+ sizeof(InjIoErrorState),
+ &found);
+
+ if (!found)
+ {
+ /*
+ * First time through, so initialize. This is shared with the dynamic
+ * initialization using a DSM.
+ */
+ inj_io_error_state->enabled = false;
+
+#ifdef USE_INJECTION_POINTS
+ InjectionPointAttach("AIO_PROCESS_COMPLETION_BEFORE_SHARED",
+ "test_aio",
+ "inj_io_short_read",
+ NULL,
+ 0);
+ InjectionPointLoad("AIO_PROCESS_COMPLETION_BEFORE_SHARED");
+#endif
+ }
+ else
+ {
+#ifdef USE_INJECTION_POINTS
+ InjectionPointLoad("AIO_PROCESS_COMPLETION_BEFORE_SHARED");
+ elog(LOG, "injection point loaded");
+#endif
+ }
+
+ LWLockRelease(AddinShmemInitLock);
+}
+
+void
+_PG_init(void)
+{
+ if (!process_shared_preload_libraries_in_progress)
+ return;
+
+ /* Shared memory initialization */
+ prev_shmem_request_hook = shmem_request_hook;
+ shmem_request_hook = test_aio_shmem_request;
+ prev_shmem_startup_hook = shmem_startup_hook;
+ shmem_startup_hook = test_aio_shmem_startup;
+}
+
+
+PG_FUNCTION_INFO_V1(errno_from_string);
+Datum
+errno_from_string(PG_FUNCTION_ARGS)
+{
+ const char *sym = text_to_cstring(PG_GETARG_TEXT_PP(0));
+
+ if (strcmp(sym, "EIO") == 0)
+ PG_RETURN_INT32(EIO);
+ else if (strcmp(sym, "EAGAIN") == 0)
+ PG_RETURN_INT32(EAGAIN);
+ else if (strcmp(sym, "EINTR") == 0)
+ PG_RETURN_INT32(EINTR);
+ else if (strcmp(sym, "ENOSPC") == 0)
+ PG_RETURN_INT32(ENOSPC);
+ else if (strcmp(sym, "EROFS") == 0)
+ PG_RETURN_INT32(EROFS);
+
+ ereport(ERROR,
+ errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg_internal("%s is not a supported errno value", sym));
+ PG_RETURN_INT32(0);
+}
+
+
+PG_FUNCTION_INFO_V1(grow_rel);
+Datum
+grow_rel(PG_FUNCTION_ARGS)
+{
+ Oid relid = PG_GETARG_OID(0);
+ uint32 nblocks = PG_GETARG_UINT32(1);
+ Relation rel;
+#define MAX_BUFFERS_TO_EXTEND_BY 64
+ Buffer victim_buffers[MAX_BUFFERS_TO_EXTEND_BY];
+
+ rel = relation_open(relid, AccessExclusiveLock);
+
+ while (nblocks > 0)
+ {
+ uint32 extend_by_pages;
+
+ extend_by_pages = Min(nblocks, MAX_BUFFERS_TO_EXTEND_BY);
+
+ ExtendBufferedRelBy(BMR_REL(rel),
+ MAIN_FORKNUM,
+ NULL,
+ 0,
+ extend_by_pages,
+ victim_buffers,
+ &extend_by_pages);
+
+ nblocks -= extend_by_pages;
+
+ for (uint32 i = 0; i < extend_by_pages; i++)
+ {
+ ReleaseBuffer(victim_buffers[i]);
+ }
+ }
+
+ relation_close(rel, NoLock);
+
+ PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(corrupt_rel_block);
+Datum
+corrupt_rel_block(PG_FUNCTION_ARGS)
+{
+ Oid relid = PG_GETARG_OID(0);
+ uint32 block = PG_GETARG_UINT32(1);
+ Relation rel;
+ Buffer buf;
+ Page page;
+ PageHeader ph;
+
+ rel = relation_open(relid, AccessExclusiveLock);
+
+ buf = ReadBuffer(rel, block);
+ page = BufferGetPage(buf);
+
+ LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+
+ MarkBufferDirty(buf);
+
+ PageInit(page, BufferGetPageSize(buf), 0);
+
+ ph = (PageHeader) page;
+ ph->pd_special = BLCKSZ + 1;
+
+ FlushOneBuffer(buf);
+
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+
+ ReleaseBuffer(buf);
+
+ EvictUnpinnedBuffer(buf);
+
+ relation_close(rel, NoLock);
+
+ PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(read_corrupt_rel_block);
+Datum
+read_corrupt_rel_block(PG_FUNCTION_ARGS)
+{
+ Oid relid = PG_GETARG_OID(0);
+ uint32 block = PG_GETARG_UINT32(1);
+ Relation rel;
+ Buffer buf;
+ BufferDesc *buf_hdr;
+ Page page;
+ PgAioHandle *ioh;
+ PgAioWaitRef iow;
+ SMgrRelation smgr;
+ uint32 buf_state;
+
+ rel = relation_open(relid, AccessExclusiveLock);
+
+ /* read buffer without erroring out */
+ buf = ReadBufferExtended(rel, MAIN_FORKNUM, block, RBM_ZERO_AND_LOCK, NULL);
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+
+ page = BufferGetBlock(buf);
+
+ ioh = pgaio_io_acquire(CurrentResourceOwner, NULL);
+ pgaio_io_get_wref(ioh, &iow);
+
+ buf_hdr = GetBufferDescriptor(buf - 1);
+ smgr = RelationGetSmgr(rel);
+
+ /* FIXME: even if just a test, we should verify nobody else uses this */
+ buf_state = LockBufHdr(buf_hdr);
+ buf_state &= ~(BM_VALID | BM_DIRTY);
+ UnlockBufHdr(buf_hdr, buf_state);
+
+ StartBufferIO(buf_hdr, true, false);
+
+ pgaio_io_set_handle_data_32(ioh, (uint32 *) &buf, 1);
+ pgaio_io_register_callbacks(ioh, PGAIO_HCB_SHARED_BUFFER_READV);
+
+ smgrstartreadv(ioh, smgr, MAIN_FORKNUM, block,
+ (void *) &page, 1);
+
+ ReleaseBuffer(buf);
+
+ pgaio_wref_wait(&iow);
+
+ relation_close(rel, NoLock);
+
+ PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(invalidate_rel_block);
+Datum
+invalidate_rel_block(PG_FUNCTION_ARGS)
+{
+ Oid relid = PG_GETARG_OID(0);
+ uint32 block = PG_GETARG_UINT32(1);
+ Relation rel;
+ PrefetchBufferResult pr;
+ Buffer buf;
+
+ rel = relation_open(relid, AccessExclusiveLock);
+
+ /* this is a gross hack, but there's no good API exposed */
+ pr = PrefetchBuffer(rel, MAIN_FORKNUM, block);
+ buf = pr.recent_buffer;
+ elog(LOG, "recent: %d", buf);
+ if (BufferIsValid(buf))
+ {
+ /* if the buffer contents aren't valid, this'll return false */
+ if (ReadRecentBuffer(rel->rd_locator, MAIN_FORKNUM, block, buf))
+ {
+ LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+ FlushOneBuffer(buf);
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+ ReleaseBuffer(buf);
+
+ if (!EvictUnpinnedBuffer(buf))
+ elog(ERROR, "couldn't unpin");
+ }
+ }
+
+ relation_close(rel, AccessExclusiveLock);
+
+ PG_RETURN_VOID();
+}
+
+#if 0
+PG_FUNCTION_INFO_V1(test_unsubmitted_vs_close);
+Datum
+test_unsubmitted_vs_close(PG_FUNCTION_ARGS)
+{
+ Oid relid = PG_GETARG_OID(0);
+ uint32 block = PG_GETARG_UINT32(1);
+ Relation rel;
+ Buffer buf;
+ Page page;
+ PageHeader ph;
+
+ rel = relation_open(relid, AccessExclusiveLock);
+
+ buf = ReadBufferExtended(rel, MAIN_FORKNUM, block, RBM_ZERO_AND_LOCK, NULL);
+
+ buf = ReadBuffer(rel, block);
+ page = BufferGetPage(buf);
+
+ EvictUnpinnedBuffer(buf);
+
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+
+
+ MarkBufferDirty(buf);
+ ph->pd_special = BLCKSZ + 1;
+
+ /* last_handle = pgaio_io_acquire(); */
+
+ PG_RETURN_VOID();
+}
+#endif
+
+PG_FUNCTION_INFO_V1(handle_get);
+Datum
+handle_get(PG_FUNCTION_ARGS)
+{
+ last_handle = pgaio_io_acquire(CurrentResourceOwner, NULL);
+
+ PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(handle_release_last);
+Datum
+handle_release_last(PG_FUNCTION_ARGS)
+{
+ if (!last_handle)
+ elog(ERROR, "no handle");
+
+ pgaio_io_release(last_handle);
+
+ PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(handle_get_and_error);
+Datum
+handle_get_and_error(PG_FUNCTION_ARGS)
+{
+ pgaio_io_acquire(CurrentResourceOwner, NULL);
+
+ elog(ERROR, "as you command");
+ PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(handle_get_twice);
+Datum
+handle_get_twice(PG_FUNCTION_ARGS)
+{
+ pgaio_io_acquire(CurrentResourceOwner, NULL);
+ pgaio_io_acquire(CurrentResourceOwner, NULL);
+
+ PG_RETURN_VOID();
+}
+
+
+PG_FUNCTION_INFO_V1(handle_get_release);
+Datum
+handle_get_release(PG_FUNCTION_ARGS)
+{
+ PgAioHandle *handle;
+
+ handle = pgaio_io_acquire(CurrentResourceOwner, NULL);
+ pgaio_io_release(handle);
+
+ PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(bb_get);
+Datum
+bb_get(PG_FUNCTION_ARGS)
+{
+ last_bb = pgaio_bounce_buffer_get();
+
+ PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(bb_release_last);
+Datum
+bb_release_last(PG_FUNCTION_ARGS)
+{
+ if (!last_bb)
+ elog(ERROR, "no bb");
+
+ pgaio_bounce_buffer_release(last_bb);
+
+ PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(bb_get_and_error);
+Datum
+bb_get_and_error(PG_FUNCTION_ARGS)
+{
+ pgaio_bounce_buffer_get();
+
+ elog(ERROR, "as you command");
+ PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(bb_get_twice);
+Datum
+bb_get_twice(PG_FUNCTION_ARGS)
+{
+ pgaio_bounce_buffer_get();
+ pgaio_bounce_buffer_get();
+
+ PG_RETURN_VOID();
+}
+
+
+PG_FUNCTION_INFO_V1(bb_get_release);
+Datum
+bb_get_release(PG_FUNCTION_ARGS)
+{
+ PgAioBounceBuffer *bb;
+
+ bb = pgaio_bounce_buffer_get();
+ pgaio_bounce_buffer_release(bb);
+
+ PG_RETURN_VOID();
+}
+
+#ifdef USE_INJECTION_POINTS
+extern PGDLLEXPORT void inj_io_short_read(const char *name, const void *private_data);
+
+void
+inj_io_short_read(const char *name, const void *private_data)
+{
+ PgAioHandle *ioh;
+
+ elog(LOG, "short read called: %d", inj_io_error_state->enabled);
+
+ if (inj_io_error_state->enabled)
+ {
+ ioh = pgaio_inj_io_get();
+
+ if (inj_io_error_state->result_set)
+ {
+ elog(LOG, "short read, changing result from %d to %d",
+ ioh->result, inj_io_error_state->result);
+
+ ioh->result = inj_io_error_state->result;
+ }
+ }
+}
+#endif
+
+PG_FUNCTION_INFO_V1(inj_io_short_read_attach);
+Datum
+inj_io_short_read_attach(PG_FUNCTION_ARGS)
+{
+#ifdef USE_INJECTION_POINTS
+ inj_io_error_state->enabled = true;
+ inj_io_error_state->result_set = !PG_ARGISNULL(0);
+ if (inj_io_error_state->result_set)
+ inj_io_error_state->result = PG_GETARG_INT32(0);
+#else
+ elog(ERROR, "injection points not supported");
+#endif
+
+ PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(inj_io_short_read_detach);
+Datum
+inj_io_short_read_detach(PG_FUNCTION_ARGS)
+{
+#ifdef USE_INJECTION_POINTS
+ inj_io_error_state->enabled = false;
+#else
+ elog(ERROR, "injection points not supported");
+#endif
+ PG_RETURN_VOID();
+}
diff --git a/src/test/modules/test_aio/test_aio.control b/src/test/modules/test_aio/test_aio.control
new file mode 100644
index 00000000000..cd91c3ed16b
--- /dev/null
+++ b/src/test/modules/test_aio/test_aio.control
@@ -0,0 +1,3 @@
+comment = 'Test code for AIO'
+default_version = '1.0'
+module_pathname = '$libdir/test_aio'
diff --git a/src/test/modules/test_aio/worker.conf b/src/test/modules/test_aio/worker.conf
new file mode 100644
index 00000000000..8104c201924
--- /dev/null
+++ b/src/test/modules/test_aio/worker.conf
@@ -0,0 +1,5 @@
+shared_preload_libraries=test_aio
+io_method = 'worker'
+log_min_messages = 'DEBUG3'
+log_statement=all
+restart_after_crash=false
--
2.48.1.76.g4e746b1a31.dirty
v2.3-0023-aio-Add-bounce-buffers.patchtext/x-diff; charset=us-asciiDownload
From 398eabe5a83ff4ed74141b7803b7f6b23d0a3bdd Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 22 Jan 2025 17:13:46 -0500
Subject: [PATCH v2.3 23/30] aio: Add bounce buffers
---
src/include/storage/aio.h | 19 ++
src/include/storage/aio_internal.h | 33 ++++
src/include/utils/resowner.h | 2 +
src/backend/storage/aio/README.md | 27 +++
src/backend/storage/aio/aio.c | 180 ++++++++++++++++++
src/backend/storage/aio/aio_init.c | 123 ++++++++++++
src/backend/utils/misc/guc_tables.c | 13 ++
src/backend/utils/misc/postgresql.conf.sample | 2 +
src/backend/utils/resowner/resowner.c | 25 ++-
src/tools/pgindent/typedefs.list | 1 +
10 files changed, 423 insertions(+), 2 deletions(-)
diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h
index 6f36a0b9e4d..30b08495f3d 100644
--- a/src/include/storage/aio.h
+++ b/src/include/storage/aio.h
@@ -247,6 +247,10 @@ typedef struct PgAioHandleCallbacks
+typedef struct PgAioBounceBuffer PgAioBounceBuffer;
+
+
+
/* AIO API */
@@ -330,6 +334,20 @@ extern bool pgaio_have_staged(void);
+/* --------------------------------------------------------------------------------
+ * Bounce Buffers
+ * --------------------------------------------------------------------------------
+ */
+
+extern PgAioBounceBuffer *pgaio_bounce_buffer_get(void);
+extern void pgaio_io_assoc_bounce_buffer(PgAioHandle *ioh, PgAioBounceBuffer *bb);
+extern uint32 pgaio_bounce_buffer_id(PgAioBounceBuffer *bb);
+extern void pgaio_bounce_buffer_release(PgAioBounceBuffer *bb);
+extern char *pgaio_bounce_buffer_buffer(PgAioBounceBuffer *bb);
+extern void pgaio_bounce_buffer_release_resowner(struct dlist_node *bb_node, bool on_error);
+
+
+
/* --------------------------------------------------------------------------------
* Other
* --------------------------------------------------------------------------------
@@ -345,6 +363,7 @@ extern void assign_io_method(int newval, void *extra);
/* GUCs */
extern PGDLLIMPORT int io_method;
extern PGDLLIMPORT int io_max_concurrency;
+extern PGDLLIMPORT int io_bounce_buffers;
#endif /* AIO_H */
diff --git a/src/include/storage/aio_internal.h b/src/include/storage/aio_internal.h
index eff544ce621..531532e306a 100644
--- a/src/include/storage/aio_internal.h
+++ b/src/include/storage/aio_internal.h
@@ -97,6 +97,12 @@ struct PgAioHandle
*/
uint32 iovec_off;
+ /*
+ * List of bounce_buffers owned by IO. It would suffice to use an index
+ * based linked list here.
+ */
+ slist_head bounce_buffers;
+
/**
* In which list the handle is registered, depends on the state:
* - IDLE, in per-backend list
@@ -133,11 +139,23 @@ struct PgAioHandle
};
+struct PgAioBounceBuffer
+{
+ slist_node node;
+ struct ResourceOwnerData *resowner;
+ dlist_node resowner_node;
+ char *buffer;
+};
+
+
typedef struct PgAioBackend
{
/* index into PgAioCtl->io_handles */
uint32 io_handle_off;
+ /* index into PgAioCtl->bounce_buffers */
+ uint32 bounce_buffers_off;
+
/* IO Handles that currently are not used */
dclist_head idle_ios;
@@ -165,6 +183,12 @@ typedef struct PgAioBackend
* IOs being appended at the end.
*/
dclist_head in_flight_ios;
+
+ /* Bounce Buffers that currently are not used */
+ slist_head idle_bbs;
+
+ /* see handed_out_io */
+ PgAioBounceBuffer *handed_out_bb;
} PgAioBackend;
@@ -190,6 +214,15 @@ typedef struct PgAioCtl
*/
uint64 *handle_data;
+ /*
+ * To perform AIO on buffers that are not located in shared memory (either
+ * because they are not in shared memory or because we need to operate on
+ * a copy, as e.g. the case for writes when checksums are in use)
+ */
+ uint64 bounce_buffers_count;
+ PgAioBounceBuffer *bounce_buffers;
+ char *bounce_buffers_data;
+
uint64 io_handle_count;
PgAioHandle *io_handles;
} PgAioCtl;
diff --git a/src/include/utils/resowner.h b/src/include/utils/resowner.h
index aede4bfc820..7e2ec224169 100644
--- a/src/include/utils/resowner.h
+++ b/src/include/utils/resowner.h
@@ -168,5 +168,7 @@ extern void ResourceOwnerForgetLock(ResourceOwner owner, struct LOCALLOCK *local
struct dlist_node;
extern void ResourceOwnerRememberAioHandle(ResourceOwner owner, struct dlist_node *ioh_node);
extern void ResourceOwnerForgetAioHandle(ResourceOwner owner, struct dlist_node *ioh_node);
+extern void ResourceOwnerRememberAioBounceBuffer(ResourceOwner owner, struct dlist_node *bb_node);
+extern void ResourceOwnerForgetAioBounceBuffer(ResourceOwner owner, struct dlist_node *bb_node);
#endif /* RESOWNER_H */
diff --git a/src/backend/storage/aio/README.md b/src/backend/storage/aio/README.md
index 1b6f9d2c40b..dacff46ad12 100644
--- a/src/backend/storage/aio/README.md
+++ b/src/backend/storage/aio/README.md
@@ -412,6 +412,33 @@ shared memory no less!), completion callbacks instead have to encode errors in
a more compact format that can be converted into an error message.
+### AIO Bounce Buffers
+
+For some uses of AIO there is no convenient memory location as the source /
+destination of an AIO. E.g. when data checksums are enabled, writes from
+shared buffers currently cannot be done directly from shared buffers, as a
+shared buffer lock still allows some modification, e.g., for hint bits(see
+`FlushBuffer()`). If the write were done in-place, such modifications can
+cause the checksum to fail.
+
+For synchronous IO this is solved by copying the buffer to separate memory
+before computing the checksum and using that copy as the source buffer for the
+AIO.
+
+However, for AIO that is not a workable solution:
+- Instead of a single buffer many buffers are required, as many IOs might be
+ in flight
+- When using the [worker method](#worker), the source/target of IO needs to be
+ in shared memory, otherwise the workers won't be able to access the memory.
+
+The AIO subsystem addresses this by providing a limited number of bounce
+buffers that can be used as the source / target for IO. A bounce buffer be
+acquired with `pgaio_bounce_buffer_get()` and multiple bounce buffers can be
+associated with an AIO Handle with `pgaio_io_assoc_bounce_buffer()`.
+
+Bounce buffers are automatically released when the IO completes.
+
+
## Helpers
Using the low-level AIO API introduces too much complexity to do so all over
diff --git a/src/backend/storage/aio/aio.c b/src/backend/storage/aio/aio.c
index b3b4e74c3ce..431f2c2e5af 100644
--- a/src/backend/storage/aio/aio.c
+++ b/src/backend/storage/aio/aio.c
@@ -55,6 +55,8 @@ static PgAioHandle *pgaio_io_from_wref(PgAioWaitRef *iow, uint64 *ref_generation
static const char *pgaio_io_state_get_name(PgAioHandleState s);
static void pgaio_io_wait(PgAioHandle *ioh, uint64 ref_generation);
+static void pgaio_bounce_buffer_wait_for_free(void);
+
/* Options for io_method. */
const struct config_enum_entry io_method_options[] = {
@@ -69,6 +71,7 @@ const struct config_enum_entry io_method_options[] = {
/* GUCs */
int io_method = DEFAULT_IO_METHOD;
int io_max_concurrency = -1;
+int io_bounce_buffers = -1;
/* global control for AIO */
PgAioCtl *pgaio_ctl;
@@ -588,6 +591,21 @@ pgaio_io_reclaim(PgAioHandle *ioh)
}
}
+ /* reclaim all associated bounce buffers */
+ if (!slist_is_empty(&ioh->bounce_buffers))
+ {
+ slist_mutable_iter it;
+
+ slist_foreach_modify(it, &ioh->bounce_buffers)
+ {
+ PgAioBounceBuffer *bb = slist_container(PgAioBounceBuffer, node, it.cur);
+
+ slist_delete_current(&it);
+
+ slist_push_head(&pgaio_my_backend->idle_bbs, &bb->node);
+ }
+ }
+
if (ioh->resowner)
{
ResourceOwnerForgetAioHandle(ioh->resowner, &ioh->resowner_node);
@@ -874,6 +892,166 @@ pgaio_have_staged(void)
+/* --------------------------------------------------------------------------------
+ * Functions primarily related to PgAioBounceBuffer
+ * --------------------------------------------------------------------------------
+ */
+
+PgAioBounceBuffer *
+pgaio_bounce_buffer_get(void)
+{
+ PgAioBounceBuffer *bb = NULL;
+ slist_node *node;
+
+ if (pgaio_my_backend->handed_out_bb != NULL)
+ elog(ERROR, "can only hand out one BB");
+
+ /*
+ * FIXME It probably is not correct to have bounce buffers be per backend,
+ * they use too much memory.
+ */
+ if (slist_is_empty(&pgaio_my_backend->idle_bbs))
+ {
+ pgaio_bounce_buffer_wait_for_free();
+ }
+
+ node = slist_pop_head_node(&pgaio_my_backend->idle_bbs);
+ bb = slist_container(PgAioBounceBuffer, node, node);
+
+ pgaio_my_backend->handed_out_bb = bb;
+
+ bb->resowner = CurrentResourceOwner;
+ ResourceOwnerRememberAioBounceBuffer(bb->resowner, &bb->resowner_node);
+
+ return bb;
+}
+
+void
+pgaio_io_assoc_bounce_buffer(PgAioHandle *ioh, PgAioBounceBuffer *bb)
+{
+ if (pgaio_my_backend->handed_out_bb != bb)
+ elog(ERROR, "can only assign handed out BB");
+ pgaio_my_backend->handed_out_bb = NULL;
+
+ /*
+ * There can be many bounce buffers assigned in case of vectorized IOs.
+ */
+ slist_push_head(&ioh->bounce_buffers, &bb->node);
+
+ /* once associated with an IO, the IO has ownership */
+ ResourceOwnerForgetAioBounceBuffer(bb->resowner, &bb->resowner_node);
+ bb->resowner = NULL;
+}
+
+uint32
+pgaio_bounce_buffer_id(PgAioBounceBuffer *bb)
+{
+ return bb - pgaio_ctl->bounce_buffers;
+}
+
+void
+pgaio_bounce_buffer_release(PgAioBounceBuffer *bb)
+{
+ if (pgaio_my_backend->handed_out_bb != bb)
+ elog(ERROR, "can only release handed out BB");
+
+ slist_push_head(&pgaio_my_backend->idle_bbs, &bb->node);
+ pgaio_my_backend->handed_out_bb = NULL;
+
+ ResourceOwnerForgetAioBounceBuffer(bb->resowner, &bb->resowner_node);
+ bb->resowner = NULL;
+}
+
+void
+pgaio_bounce_buffer_release_resowner(dlist_node *bb_node, bool on_error)
+{
+ PgAioBounceBuffer *bb = dlist_container(PgAioBounceBuffer, resowner_node, bb_node);
+
+ Assert(bb->resowner);
+
+ if (!on_error)
+ elog(WARNING, "leaked AIO bounce buffer");
+
+ pgaio_bounce_buffer_release(bb);
+}
+
+char *
+pgaio_bounce_buffer_buffer(PgAioBounceBuffer *bb)
+{
+ return bb->buffer;
+}
+
+static void
+pgaio_bounce_buffer_wait_for_free(void)
+{
+ static uint32 lastpos = 0;
+
+ if (pgaio_my_backend->num_staged_ios > 0)
+ {
+ pgaio_debug(DEBUG2, "submitting %d, while acquiring free bb",
+ pgaio_my_backend->num_staged_ios);
+ pgaio_submit_staged();
+ }
+
+ for (uint32 i = lastpos; i < lastpos + io_max_concurrency; i++)
+ {
+ uint32 thisoff = pgaio_my_backend->io_handle_off + (i % io_max_concurrency);
+ PgAioHandle *ioh = &pgaio_ctl->io_handles[thisoff];
+
+ switch (ioh->state)
+ {
+ case PGAIO_HS_IDLE:
+ case PGAIO_HS_HANDED_OUT:
+ continue;
+ case PGAIO_HS_DEFINED: /* should have been submitted above */
+ case PGAIO_HS_STAGED:
+ elog(ERROR, "shouldn't get here with io:%d in state %d",
+ pgaio_io_get_id(ioh), ioh->state);
+ break;
+ case PGAIO_HS_COMPLETED_IO:
+ case PGAIO_HS_SUBMITTED:
+ if (!slist_is_empty(&ioh->bounce_buffers))
+ {
+ pgaio_debug_io(DEBUG2, ioh,
+ "waiting for IO to reclaim BB with %d in flight",
+ dclist_count(&pgaio_my_backend->in_flight_ios));
+
+ /* see comment in pgaio_io_wait_for_free() about raciness */
+ pgaio_io_wait(ioh, ioh->generation);
+
+ if (slist_is_empty(&pgaio_my_backend->idle_bbs))
+ elog(WARNING, "empty after wait");
+
+ if (!slist_is_empty(&pgaio_my_backend->idle_bbs))
+ {
+ lastpos = i;
+ return;
+ }
+ }
+ break;
+ case PGAIO_HS_COMPLETED_SHARED:
+ case PGAIO_HS_COMPLETED_LOCAL:
+ /* reclaim */
+ pgaio_io_reclaim(ioh);
+
+ if (!slist_is_empty(&pgaio_my_backend->idle_bbs))
+ {
+ lastpos = i;
+ return;
+ }
+ break;
+ }
+ }
+
+ /*
+ * The submission above could have caused the IO to complete at any time.
+ */
+ if (slist_is_empty(&pgaio_my_backend->idle_bbs))
+ elog(PANIC, "no more bbs");
+}
+
+
+
/* --------------------------------------------------------------------------------
* Other
* --------------------------------------------------------------------------------
@@ -904,6 +1082,7 @@ void
pgaio_at_xact_end(bool is_subxact, bool is_commit)
{
Assert(!pgaio_my_backend->handed_out_io);
+ Assert(!pgaio_my_backend->handed_out_bb);
}
/*
@@ -914,6 +1093,7 @@ void
pgaio_at_error(void)
{
Assert(!pgaio_my_backend->handed_out_io);
+ Assert(!pgaio_my_backend->handed_out_bb);
}
void
diff --git a/src/backend/storage/aio/aio_init.c b/src/backend/storage/aio/aio_init.c
index 76fcdf64670..a4f4a0b698e 100644
--- a/src/backend/storage/aio/aio_init.c
+++ b/src/backend/storage/aio/aio_init.c
@@ -82,6 +82,32 @@ AioHandleDataShmemSize(void)
io_max_concurrency));
}
+static Size
+AioBounceBufferDescShmemSize(void)
+{
+ Size sz;
+
+ /* PgAioBounceBuffer itself */
+ sz = mul_size(sizeof(PgAioBounceBuffer),
+ mul_size(AioProcs(), io_bounce_buffers));
+
+ return sz;
+}
+
+static Size
+AioBounceBufferDataShmemSize(void)
+{
+ Size sz;
+
+ /* and the associated buffer */
+ sz = mul_size(BLCKSZ,
+ mul_size(io_bounce_buffers, AioProcs()));
+ /* memory for alignment */
+ sz += BLCKSZ;
+
+ return sz;
+}
+
/*
* Choose a suitable value for io_max_concurrency.
*
@@ -107,6 +133,33 @@ AioChooseMaxConccurrency(void)
return Min(max_proportional_pins, 64);
}
+/*
+ * Choose a suitable value for io_bounce_buffers.
+ *
+ * It's very unlikely to be useful to allocate more bounce buffers for each
+ * backend than the backend is allowed to pin. Additionally, bounce buffers
+ * currently are used for writes, it seems very uncommon for more than 10% of
+ * shared_buffers to be written out concurrently.
+ *
+ * XXX: This quickly can take up significant amounts of memory, the logic
+ * should probably fine tuned.
+ */
+static int
+AioChooseBounceBuffers(void)
+{
+ uint32 max_backends;
+ int max_proportional_pins;
+
+ /* Similar logic to LimitAdditionalPins() */
+ max_backends = MaxBackends + NUM_AUXILIARY_PROCS;
+ max_proportional_pins = (NBuffers / 10) / max_backends;
+
+ max_proportional_pins = Max(max_proportional_pins, 1);
+
+ /* apply upper limit */
+ return Min(max_proportional_pins, 256);
+}
+
Size
AioShmemSize(void)
{
@@ -130,11 +183,31 @@ AioShmemSize(void)
PGC_S_OVERRIDE);
}
+
+ /*
+ * If io_bounce_buffers is -1, we automatically choose a suitable value.
+ *
+ * See also comment above.
+ */
+ if (io_bounce_buffers == -1)
+ {
+ char buf[32];
+
+ snprintf(buf, sizeof(buf), "%d", AioChooseBounceBuffers());
+ SetConfigOption("io_bounce_buffers", buf, PGC_POSTMASTER,
+ PGC_S_DYNAMIC_DEFAULT);
+ if (io_bounce_buffers == -1) /* failed to apply it? */
+ SetConfigOption("io_bounce_buffers", buf, PGC_POSTMASTER,
+ PGC_S_OVERRIDE);
+ }
+
sz = add_size(sz, AioCtlShmemSize());
sz = add_size(sz, AioBackendShmemSize());
sz = add_size(sz, AioHandleShmemSize());
sz = add_size(sz, AioHandleIOVShmemSize());
sz = add_size(sz, AioHandleDataShmemSize());
+ sz = add_size(sz, AioBounceBufferDescShmemSize());
+ sz = add_size(sz, AioBounceBufferDataShmemSize());
if (pgaio_method_ops->shmem_size)
sz = add_size(sz, pgaio_method_ops->shmem_size());
@@ -149,6 +222,9 @@ AioShmemInit(void)
uint32 io_handle_off = 0;
uint32 iovec_off = 0;
uint32 per_backend_iovecs = io_max_concurrency * PG_IOV_MAX;
+ uint32 bounce_buffers_off = 0;
+ uint32 per_backend_bb = io_bounce_buffers;
+ char *bounce_buffers_data;
pgaio_ctl = (PgAioCtl *)
ShmemInitStruct("AioCtl", AioCtlShmemSize(), &found);
@@ -160,6 +236,7 @@ AioShmemInit(void)
pgaio_ctl->io_handle_count = AioProcs() * io_max_concurrency;
pgaio_ctl->iovec_count = AioProcs() * per_backend_iovecs;
+ pgaio_ctl->bounce_buffers_count = AioProcs() * per_backend_bb;
pgaio_ctl->backend_state = (PgAioBackend *)
ShmemInitStruct("AioBackend", AioBackendShmemSize(), &found);
@@ -172,6 +249,40 @@ AioShmemInit(void)
pgaio_ctl->handle_data = (uint64 *)
ShmemInitStruct("AioHandleData", AioHandleDataShmemSize(), &found);
+ pgaio_ctl->bounce_buffers = (PgAioBounceBuffer *)
+ ShmemInitStruct("AioBounceBufferDesc", AioBounceBufferDescShmemSize(),
+ &found);
+
+ bounce_buffers_data =
+ ShmemInitStruct("AioBounceBufferData", AioBounceBufferDataShmemSize(),
+ &found);
+ bounce_buffers_data =
+ (char *) TYPEALIGN(BLCKSZ, (uintptr_t) bounce_buffers_data);
+ pgaio_ctl->bounce_buffers_data = bounce_buffers_data;
+
+
+ /* Initialize IO handles. */
+ for (uint64 i = 0; i < pgaio_ctl->io_handle_count; i++)
+ {
+ PgAioHandle *ioh = &pgaio_ctl->io_handles[i];
+
+ ioh->op = PGAIO_OP_INVALID;
+ ioh->target = PGAIO_TID_INVALID;
+ ioh->state = PGAIO_HS_IDLE;
+
+ slist_init(&ioh->bounce_buffers);
+ }
+
+ /* Initialize Bounce Buffers. */
+ for (uint64 i = 0; i < pgaio_ctl->bounce_buffers_count; i++)
+ {
+ PgAioBounceBuffer *bb = &pgaio_ctl->bounce_buffers[i];
+
+ bb->buffer = bounce_buffers_data;
+ bounce_buffers_data += BLCKSZ;
+ }
+
+
for (int procno = 0; procno < AioProcs(); procno++)
{
PgAioBackend *bs = &pgaio_ctl->backend_state[procno];
@@ -179,9 +290,13 @@ AioShmemInit(void)
bs->io_handle_off = io_handle_off;
io_handle_off += io_max_concurrency;
+ bs->bounce_buffers_off = bounce_buffers_off;
+ bounce_buffers_off += per_backend_bb;
+
dclist_init(&bs->idle_ios);
memset(bs->staged_ios, 0, sizeof(PgAioHandle *) * PGAIO_SUBMIT_BATCH_SIZE);
dclist_init(&bs->in_flight_ios);
+ slist_init(&bs->idle_bbs);
/* initialize per-backend IOs */
for (int i = 0; i < io_max_concurrency; i++)
@@ -203,6 +318,14 @@ AioShmemInit(void)
dclist_push_tail(&bs->idle_ios, &ioh->node);
iovec_off += PG_IOV_MAX;
}
+
+ /* initialize per-backend bounce buffers */
+ for (int i = 0; i < per_backend_bb; i++)
+ {
+ PgAioBounceBuffer *bb = &pgaio_ctl->bounce_buffers[bs->bounce_buffers_off + i];
+
+ slist_push_head(&bs->idle_bbs, &bb->node);
+ }
}
out:
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 8a83dcc820d..57865d45124 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -3234,6 +3234,19 @@ struct config_int ConfigureNamesInt[] =
NULL, NULL, NULL
},
+ {
+ {"io_bounce_buffers",
+ PGC_POSTMASTER,
+ RESOURCES_ASYNCHRONOUS,
+ gettext_noop("Number of IO Bounce Buffers reserved for each backend."),
+ NULL,
+ GUC_UNIT_BLOCKS
+ },
+ &io_bounce_buffers,
+ -1, -1, 4096,
+ NULL, NULL, NULL
+ },
+
{
{"io_workers",
PGC_SIGHUP,
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 5005e65cee0..294d661ebf4 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -853,6 +853,8 @@ autovacuum_worker_slots = 16 # autovacuum worker slots to allocate
#io_max_concurrency = 32 # Max number of IOs that may be in
# flight at the same time in one backend
# (change requires restart)
+#io_bounce_buffers = -1 # -1 sets based on shared_buffers
+ # (change requires restart)
#------------------------------------------------------------------------------
diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c
index e5d852b5ee6..9db3c07326c 100644
--- a/src/backend/utils/resowner/resowner.c
+++ b/src/backend/utils/resowner/resowner.c
@@ -159,10 +159,11 @@ struct ResourceOwnerData
LOCALLOCK *locks[MAX_RESOWNER_LOCKS]; /* list of owned locks */
/*
- * AIO handles need be registered in critical sections and therefore
- * cannot use the normal ResoureElem mechanism.
+ * AIO handles & bounce buffers need be registered in critical sections
+ * and therefore cannot use the normal ResoureElem mechanism.
*/
dlist_head aio_handles;
+ dlist_head aio_bounce_buffers;
};
@@ -434,6 +435,7 @@ ResourceOwnerCreate(ResourceOwner parent, const char *name)
}
dlist_init(&owner->aio_handles);
+ dlist_init(&owner->aio_bounce_buffers);
return owner;
}
@@ -743,6 +745,13 @@ ResourceOwnerReleaseInternal(ResourceOwner owner,
pgaio_io_release_resowner(node, !isCommit);
}
+
+ while (!dlist_is_empty(&owner->aio_bounce_buffers))
+ {
+ dlist_node *node = dlist_head_node(&owner->aio_bounce_buffers);
+
+ pgaio_bounce_buffer_release_resowner(node, !isCommit);
+ }
}
else if (phase == RESOURCE_RELEASE_LOCKS)
{
@@ -1112,3 +1121,15 @@ ResourceOwnerForgetAioHandle(ResourceOwner owner, struct dlist_node *ioh_node)
{
dlist_delete_from(&owner->aio_handles, ioh_node);
}
+
+void
+ResourceOwnerRememberAioBounceBuffer(ResourceOwner owner, struct dlist_node *ioh_node)
+{
+ dlist_push_tail(&owner->aio_bounce_buffers, ioh_node);
+}
+
+void
+ResourceOwnerForgetAioBounceBuffer(ResourceOwner owner, struct dlist_node *ioh_node)
+{
+ dlist_delete_from(&owner->aio_bounce_buffers, ioh_node);
+}
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index be2dd22f1d7..b3f06711e6a 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2110,6 +2110,7 @@ PermutationStep
PermutationStepBlocker
PermutationStepBlockerType
PgAioBackend
+PgAioBounceBuffer
PgAioCtl
PgAioHandle
PgAioHandleCallbackID
--
2.48.1.76.g4e746b1a31.dirty
v2.3-0001-checkpointer-Request-checkpoint-via-latch-inste.patchtext/x-diff; charset=us-asciiDownload
From 369d7d8f81f26bdaf4097c6acd09b58cc8f8d151 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Fri, 10 Jan 2025 11:11:40 -0500
Subject: [PATCH v2.3 01/30] checkpointer: Request checkpoint via latch instead
of signal
The main reason for this is that a future commit would like to use SIGINT for
another purpose. But it's also a tad nicer and tad more efficient to use
SetLatch(), as it avoids a signal when checkpointer already is busy.
Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Discussion: https://postgr.es/m/kgng5nrvnlv335evmsuvpnh354rw7qyazl73kdysev2cr2v5zu@m3cfzxicm5kp
---
src/backend/postmaster/checkpointer.c | 60 +++++++++------------------
1 file changed, 19 insertions(+), 41 deletions(-)
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 9bfd0fd665c..dd2c8376c6e 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -159,9 +159,6 @@ static bool ImmediateCheckpointRequested(void);
static bool CompactCheckpointerRequestQueue(void);
static void UpdateSharedMemoryConfig(void);
-/* Signal handlers */
-static void ReqCheckpointHandler(SIGNAL_ARGS);
-
/*
* Main entry point for checkpointer process
@@ -191,7 +188,7 @@ CheckpointerMain(char *startup_data, size_t startup_data_len)
* tell us it's okay to shut down (via SIGUSR2).
*/
pqsignal(SIGHUP, SignalHandlerForConfigReload);
- pqsignal(SIGINT, ReqCheckpointHandler); /* request checkpoint */
+ pqsignal(SIGINT, SIG_IGN);
pqsignal(SIGTERM, SIG_IGN); /* ignore SIGTERM */
/* SIGQUIT handler was already set up by InitPostmasterChild */
pqsignal(SIGALRM, SIG_IGN);
@@ -860,23 +857,6 @@ IsCheckpointOnSchedule(double progress)
}
-/* --------------------------------
- * signal handler routines
- * --------------------------------
- */
-
-/* SIGINT: set flag to run a normal checkpoint right away */
-static void
-ReqCheckpointHandler(SIGNAL_ARGS)
-{
- /*
- * The signaling process should have set ckpt_flags nonzero, so all we
- * need do is ensure that our main loop gets kicked out of any wait.
- */
- SetLatch(MyLatch);
-}
-
-
/* --------------------------------
* communication with backends
* --------------------------------
@@ -990,38 +970,36 @@ RequestCheckpoint(int flags)
SpinLockRelease(&CheckpointerShmem->ckpt_lck);
/*
- * Send signal to request checkpoint. It's possible that the checkpointer
- * hasn't started yet, or is in process of restarting, so we will retry a
- * few times if needed. (Actually, more than a few times, since on slow
- * or overloaded buildfarm machines, it's been observed that the
- * checkpointer can take several seconds to start.) However, if not told
- * to wait for the checkpoint to occur, we consider failure to send the
- * signal to be nonfatal and merely LOG it. The checkpointer should see
- * the request when it does start, with or without getting a signal.
+ * Set checkpointer's latch to request checkpoint. It's possible that the
+ * checkpointer hasn't started yet, so we will retry a few times if
+ * needed. (Actually, more than a few times, since on slow or overloaded
+ * buildfarm machines, it's been observed that the checkpointer can take
+ * several seconds to start.) However, if not told to wait for the
+ * checkpoint to occur, we consider failure to set the latch to be
+ * nonfatal and merely LOG it. The checkpointer should see the request
+ * when it does start, with or without the SetLatch().
*/
#define MAX_SIGNAL_TRIES 600 /* max wait 60.0 sec */
for (ntries = 0;; ntries++)
{
- if (CheckpointerShmem->checkpointer_pid == 0)
+ volatile PROC_HDR *procglobal = ProcGlobal;
+ ProcNumber checkpointerProc = procglobal->checkpointerProc;
+
+ if (checkpointerProc == INVALID_PROC_NUMBER)
{
if (ntries >= MAX_SIGNAL_TRIES || !(flags & CHECKPOINT_WAIT))
{
elog((flags & CHECKPOINT_WAIT) ? ERROR : LOG,
- "could not signal for checkpoint: checkpointer is not running");
- break;
- }
- }
- else if (kill(CheckpointerShmem->checkpointer_pid, SIGINT) != 0)
- {
- if (ntries >= MAX_SIGNAL_TRIES || !(flags & CHECKPOINT_WAIT))
- {
- elog((flags & CHECKPOINT_WAIT) ? ERROR : LOG,
- "could not signal for checkpoint: %m");
+ "could not notify checkpoint: checkpointer is not running");
break;
}
}
else
- break; /* signal sent successfully */
+ {
+ SetLatch(&GetPGProcByNumber(checkpointerProc)->procLatch);
+ /* notified successfully */
+ break;
+ }
CHECK_FOR_INTERRUPTS();
pg_usleep(100000L); /* wait 0.1 sec, then retry */
--
2.48.1.76.g4e746b1a31.dirty
v2.3-0002-postmaster-Don-t-open-code-TerminateChildren-in.patchtext/x-diff; charset=us-asciiDownload
From ea4f243a510b7151f0853b8b984fc81070c618c2 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 13 Jan 2025 23:20:25 -0500
Subject: [PATCH v2.3 02/30] postmaster: Don't open-code TerminateChildren() in
HandleChildCrash()
After removing the duplication no user of sigquit_child() remains, therefore
remove it.
Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
src/backend/postmaster/postmaster.c | 42 +++--------------------------
1 file changed, 4 insertions(+), 38 deletions(-)
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 5f615d0f605..8153edc446c 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -424,7 +424,6 @@ static int BackendStartup(ClientSocket *client_sock);
static void report_fork_failure_to_client(ClientSocket *client_sock, int errnum);
static CAC_state canAcceptConnections(BackendType backend_type);
static void signal_child(PMChild *pmchild, int signal);
-static void sigquit_child(PMChild *pmchild);
static bool SignalChildren(int signal, BackendTypeMask targetMask);
static void TerminateChildren(int signal);
static int CountChildren(BackendTypeMask targetMask);
@@ -2699,32 +2698,12 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
/*
* Signal all other child processes to exit. The crashed process has
* already been removed from ActiveChildList.
+ *
+ * We could exclude dead-end children here, but at least when sending
+ * SIGABRT it seems better to include them.
*/
if (take_action)
- {
- dlist_iter iter;
-
- dlist_foreach(iter, &ActiveChildList)
- {
- PMChild *bp = dlist_container(PMChild, elem, iter.cur);
-
- /* We do NOT restart the syslogger */
- if (bp == SysLoggerPMChild)
- continue;
-
- if (bp == StartupPMChild)
- StartupStatus = STARTUP_SIGNALED;
-
- /*
- * This backend is still alive. Unless we did so already, tell it
- * to commit hara-kiri.
- *
- * We could exclude dead-end children here, but at least when
- * sending SIGABRT it seems better to include them.
- */
- sigquit_child(bp);
- }
- }
+ TerminateChildren(send_abort_for_crash ? SIGABRT : SIGQUIT);
if (Shutdown != ImmediateShutdown)
FatalError = true;
@@ -3347,19 +3326,6 @@ signal_child(PMChild *pmchild, int signal)
#endif
}
-/*
- * Convenience function for killing a child process after a crash of some
- * other child process. We apply send_abort_for_crash to decide which signal
- * to send. Normally it's SIGQUIT -- and most other comments in this file are
- * written on the assumption that it is -- but developers might prefer to use
- * SIGABRT to collect per-child core dumps.
- */
-static void
-sigquit_child(PMChild *pmchild)
-{
- signal_child(pmchild, (send_abort_for_crash ? SIGABRT : SIGQUIT));
-}
-
/*
* Send a signal to the targeted children.
*/
--
2.48.1.76.g4e746b1a31.dirty
v2.3-0003-postmaster-Don-t-repeatedly-transition-to-crash.patchtext/x-diff; charset=us-asciiDownload
From ae79a4158d88ab0fbe78df9ab6ec15be3152343a Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 13 Jan 2025 23:30:46 -0500
Subject: [PATCH v2.3 03/30] postmaster: Don't repeatedly transition to
crashing state
Previously HandleChildCrash() skipped logging and signalling child exits if
already in an immediate shutdown or FatalError, but still transitioned server
state in response to a crash. That's redundant.
To make it easier to combine different paths for entering FatalError state,
only do so once.
Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
src/backend/postmaster/postmaster.c | 19 +++++++------------
1 file changed, 7 insertions(+), 12 deletions(-)
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 8153edc446c..939b1b2ef82 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -2676,8 +2676,6 @@ CleanupBackend(PMChild *bp,
static void
HandleChildCrash(int pid, int exitstatus, const char *procname)
{
- bool take_action;
-
/*
* We only log messages and send signals if this is the first process
* crash and we're not doing an immediate shutdown; otherwise, we're only
@@ -2685,15 +2683,13 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
* signaled children, nonzero exit status is to be expected, so don't
* clutter log.
*/
- take_action = !FatalError && Shutdown != ImmediateShutdown;
+ if (FatalError || Shutdown == ImmediateShutdown)
+ return;
- if (take_action)
- {
- LogChildExit(LOG, procname, pid, exitstatus);
- ereport(LOG,
- (errmsg("terminating any other active server processes")));
- SetQuitSignalReason(PMQUIT_FOR_CRASH);
- }
+ LogChildExit(LOG, procname, pid, exitstatus);
+ ereport(LOG,
+ (errmsg("terminating any other active server processes")));
+ SetQuitSignalReason(PMQUIT_FOR_CRASH);
/*
* Signal all other child processes to exit. The crashed process has
@@ -2702,8 +2698,7 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
* We could exclude dead-end children here, but at least when sending
* SIGABRT it seems better to include them.
*/
- if (take_action)
- TerminateChildren(send_abort_for_crash ? SIGABRT : SIGQUIT);
+ TerminateChildren(send_abort_for_crash ? SIGABRT : SIGQUIT);
if (Shutdown != ImmediateShutdown)
FatalError = true;
--
2.48.1.76.g4e746b1a31.dirty
v2.3-0004-postmaster-Move-code-to-switch-into-FatalError-.patchtext/x-diff; charset=us-asciiDownload
From c816b542699fdde710bbf5a909be45ffc9b8488e Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 14 Jan 2025 00:57:12 -0500
Subject: [PATCH v2.3 04/30] postmaster: Move code to switch into FatalError
state into function
There are two places switching to FatalError mode, behaving somewhat
differently. An upcoming commit will introduce a third. That doesn't seem seem
like a good idea.
This commit just moves the FatalError related code from HandleChildCrash()
into its own function, a subsequent commit will evolve the state machine
change to be suitable for other callers.
Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
src/backend/postmaster/postmaster.c | 70 +++++++++++++++++++----------
1 file changed, 46 insertions(+), 24 deletions(-)
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 939b1b2ef82..13d49eecd22 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -2665,40 +2665,29 @@ CleanupBackend(PMChild *bp,
}
/*
- * HandleChildCrash -- cleanup after failed backend, bgwriter, checkpointer,
- * walwriter, autovacuum, archiver, slot sync worker, or background worker.
- *
- * The objectives here are to clean up our local state about the child
- * process, and to signal all other remaining children to quickdie.
- *
- * The caller has already released its PMChild slot.
+ * Transition into FatalError state, in response to something bad having
+ * happened. Commonly the caller will have logged the reason for entering
+ * FatalError state.
*/
static void
-HandleChildCrash(int pid, int exitstatus, const char *procname)
+HandleFatalError(QuitSignalReason reason, bool consider_sigabrt)
{
- /*
- * We only log messages and send signals if this is the first process
- * crash and we're not doing an immediate shutdown; otherwise, we're only
- * here to update postmaster's idea of live processes. If we have already
- * signaled children, nonzero exit status is to be expected, so don't
- * clutter log.
- */
- if (FatalError || Shutdown == ImmediateShutdown)
- return;
+ int sigtosend;
+
+ SetQuitSignalReason(reason);
- LogChildExit(LOG, procname, pid, exitstatus);
- ereport(LOG,
- (errmsg("terminating any other active server processes")));
- SetQuitSignalReason(PMQUIT_FOR_CRASH);
+ if (consider_sigabrt && send_abort_for_crash)
+ sigtosend = SIGABRT;
+ else
+ sigtosend = SIGQUIT;
/*
- * Signal all other child processes to exit. The crashed process has
- * already been removed from ActiveChildList.
+ * Signal all other child processes to exit.
*
* We could exclude dead-end children here, but at least when sending
* SIGABRT it seems better to include them.
*/
- TerminateChildren(send_abort_for_crash ? SIGABRT : SIGQUIT);
+ TerminateChildren(sigtosend);
if (Shutdown != ImmediateShutdown)
FatalError = true;
@@ -2719,6 +2708,39 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
AbortStartTime = time(NULL);
}
+/*
+ * HandleChildCrash -- cleanup after failed backend, bgwriter, checkpointer,
+ * walwriter, autovacuum, archiver, slot sync worker, or background worker.
+ *
+ * The objectives here are to clean up our local state about the child
+ * process, and to signal all other remaining children to quickdie.
+ *
+ * The caller has already released its PMChild slot.
+ */
+static void
+HandleChildCrash(int pid, int exitstatus, const char *procname)
+{
+ /*
+ * We only log messages and send signals if this is the first process
+ * crash and we're not doing an immediate shutdown; otherwise, we're only
+ * here to update postmaster's idea of live processes. If we have already
+ * signaled children, nonzero exit status is to be expected, so don't
+ * clutter log.
+ */
+ if (FatalError || Shutdown == ImmediateShutdown)
+ return;
+
+ LogChildExit(LOG, procname, pid, exitstatus);
+ ereport(LOG,
+ (errmsg("terminating any other active server processes")));
+
+ /*
+ * Switch into error state. The crashed process has already been removed
+ * from ActiveChildList.
+ */
+ HandleFatalError(PMQUIT_FOR_CRASH, true);
+}
+
/*
* Log the death of a child process.
*/
--
2.48.1.76.g4e746b1a31.dirty
v2.3-0005-WIP-postmaster-Commonalize-FatalError-paths.patchtext/x-diff; charset=us-asciiDownload
From 97b4983b1443d03525b0565eb104b359a43044af Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 14 Jan 2025 00:25:01 -0500
Subject: [PATCH v2.3 05/30] WIP: postmaster: Commonalize FatalError paths
This includes some behavioural changes:
- Previously PM_WAIT_XLOG_ARCHIVAL wasn't handled in HandleFatalError(), that
doesn't seem quite right.
- Failure to fork checkpointer now transitions through PM_WAIT_BACKENDS, like
child crashes. That's not necessarily great, but...
Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
src/backend/postmaster/postmaster.c | 61 +++++++++++++++++++++++------
1 file changed, 49 insertions(+), 12 deletions(-)
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 13d49eecd22..41f2bbc214c 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -2693,12 +2693,47 @@ HandleFatalError(QuitSignalReason reason, bool consider_sigabrt)
FatalError = true;
/* We now transit into a state of waiting for children to die */
- if (pmState == PM_RECOVERY ||
- pmState == PM_HOT_STANDBY ||
- pmState == PM_RUN ||
- pmState == PM_STOP_BACKENDS ||
- pmState == PM_WAIT_XLOG_SHUTDOWN)
- UpdatePMState(PM_WAIT_BACKENDS);
+ switch (pmState)
+ {
+ case PM_INIT:
+ /* shouldn't have any children */
+ Assert(false);
+ break;
+ case PM_STARTUP:
+ /* should have been handled in process_pm_child_exit */
+ Assert(false);
+ break;
+
+ /* wait for children to die */
+ case PM_RECOVERY:
+ case PM_HOT_STANDBY:
+ case PM_RUN:
+ case PM_STOP_BACKENDS:
+ UpdatePMState(PM_WAIT_BACKENDS);
+ break;
+
+ case PM_WAIT_BACKENDS:
+ /* there might be more backends to wait for */
+ break;
+
+ case PM_WAIT_XLOG_SHUTDOWN:
+ case PM_WAIT_XLOG_ARCHIVAL:
+
+ /*
+ * Note that we switch *back* to PM_WAIT_BACKENDS here. This way
+ * the PM_WAIT_BACKENDS && FatalError code in
+ * PostmasterStateMachine does not have to be duplicated.
+ *
+ * XXX: This seems rather ugly, but it's not obvious if the
+ * alternative is better.
+ */
+ UpdatePMState(PM_WAIT_BACKENDS);
+ break;
+
+ case PM_WAIT_DEAD_END:
+ case PM_NO_CHILDREN:
+ break;
+ }
/*
* .. and if this doesn't happen quickly enough, now the clock is ticking
@@ -2836,6 +2871,9 @@ PostmasterStateMachine(void)
* PM_WAIT_BACKENDS, but we signal the processes first, before waiting for
* them. Treating it as a distinct pmState allows us to share this code
* across multiple shutdown code paths.
+ *
+ * Note that HandleFatalError() switches to PM_WAIT_BACKENDS even if we
+ * were, before the fatal error, in a "more advanced" state.
*/
if (pmState == PM_STOP_BACKENDS || pmState == PM_WAIT_BACKENDS)
{
@@ -2967,13 +3005,12 @@ PostmasterStateMachine(void)
* We don't consult send_abort_for_crash here, as it's
* unlikely that dumping cores would illuminate the reason
* for checkpointer fork failure.
+ *
+ * XXX: Is it worth inventing a different PMQUIT value
+ * that signals that the cluster is in a bad state,
+ * without a process having crashed?
*/
- FatalError = true;
- UpdatePMState(PM_WAIT_DEAD_END);
- ConfigurePostmasterWaitSet(false);
-
- /* Kill the walsenders and archiver too */
- SignalChildren(SIGQUIT, btmask_all_except(B_LOGGER));
+ HandleFatalError(PMQUIT_FOR_CRASH, false);
}
}
}
--
2.48.1.76.g4e746b1a31.dirty
v2.3-0006-postmaster-Adjust-which-processes-we-expect-to-.patchtext/x-diff; charset=us-asciiDownload
From 8f44b56322e97dbf7f5e8e514c8e6d3e603b73bd Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 14 Jan 2025 00:31:03 -0500
Subject: [PATCH v2.3 06/30] postmaster: Adjust which processes we expect to
have exited
Comments and code stated that we expect checkpointer to have been signalled in
case of immediate shutdown / fatal errors, but didn't treat archiver and
walsenders the same. That doesn't seem right.
I had started digging through the history to see where this oddity was
introduced, but it's not the fault of a single commit.
Instead treat archiver, checkpointer, and walsenders the same.
Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
src/backend/postmaster/postmaster.c | 28 ++++++++++++++++++----------
1 file changed, 18 insertions(+), 10 deletions(-)
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 41f2bbc214c..54801a32609 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -2906,16 +2906,20 @@ PostmasterStateMachine(void)
/*
* If we are doing crash recovery or an immediate shutdown then we
- * expect the checkpointer to exit as well, otherwise not.
+ * expect archiver, checkpointer and walsender to exit as well,
+ * otherwise not.
*/
if (FatalError || Shutdown >= ImmediateShutdown)
- targetMask = btmask_add(targetMask, B_CHECKPOINTER);
+ targetMask = btmask_add(targetMask,
+ B_CHECKPOINTER,
+ B_ARCHIVER,
+ B_WAL_SENDER);
/*
- * Walsenders and archiver will continue running; they will be
- * terminated later after writing the checkpoint record. We also let
- * dead-end children to keep running for now. The syslogger process
- * exits last.
+ * Normally walsenders and archiver will continue running; they will
+ * be terminated later after writing the checkpoint record. We also
+ * let dead-end children to keep running for now. The syslogger
+ * process exits last.
*
* This assertion checks that we have covered all backend types,
* either by including them in targetMask, or by noting here that they
@@ -2926,13 +2930,17 @@ PostmasterStateMachine(void)
BackendTypeMask remainMask = BTYPE_MASK_NONE;
remainMask = btmask_add(remainMask,
- B_WAL_SENDER,
- B_ARCHIVER,
B_DEAD_END_BACKEND,
B_LOGGER);
- /* checkpointer may or may not be in targetMask already */
- remainMask = btmask_add(remainMask, B_CHECKPOINTER);
+ /*
+ * Archiver, checkpointer and walsender may or may not be in
+ * targetMask already.
+ */
+ remainMask = btmask_add(remainMask,
+ B_ARCHIVER,
+ B_CHECKPOINTER,
+ B_WAL_SENDER);
/* these are not real postmaster children */
remainMask = btmask_add(remainMask,
--
2.48.1.76.g4e746b1a31.dirty
v2.3-0007-Change-shutdown-sequence-to-terminate-checkpoin.patchtext/x-diff; charset=us-asciiDownload
From ecb9f5995b5f0b38b01c8b86168aa848c9459c83 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 14 Jan 2025 01:18:42 -0500
Subject: [PATCH v2.3 07/30] Change shutdown sequence to terminate checkpointer
last
The main motivation for this change is to have a process that can serialize
stats after all other processes have terminated. Serializing stats already
happens in checkpointer, even though walsenders can be active longer.
The only reason the current state does not actively cause problems is that
walsender currently generate any stats. However, there is a patch to change
that.
Another need for this change originates in the AIO patchset, where IO
workers (which, in some edge cases, can emit stats of their own) need to run
while the shutdown checkpoint is being written.
This commit changes the shutdown sequence so checkpointer is signalled (via
SIGINT) to trigger writing the shutdown checkpoint without terminating
it. Once checkpointer wrote the checkpoint it will wait for a termination
signal (SIGUSR2, as before).
Postmaster now triggers the shutdown checkpoint via SIGINT, where we
previously did so by terminating checkpointer. Checkpointer now is terminated
after all children other than dead-end ones have been terminated, tracked
using the new PM_WAIT_CHECKPOINTER PMState.
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Discussion: https://postgr.es/m/kgng5nrvnlv335evmsuvpnh354rw7qyazl73kdysev2cr2v5zu@m3cfzxicm5kp
---
src/include/storage/pmsignal.h | 3 +-
src/backend/postmaster/checkpointer.c | 125 +++++++++++----
src/backend/postmaster/postmaster.c | 143 +++++++++++++-----
.../utils/activity/wait_event_names.txt | 1 +
4 files changed, 200 insertions(+), 72 deletions(-)
diff --git a/src/include/storage/pmsignal.h b/src/include/storage/pmsignal.h
index 3fbe5bf1136..d84a383047e 100644
--- a/src/include/storage/pmsignal.h
+++ b/src/include/storage/pmsignal.h
@@ -40,9 +40,10 @@ typedef enum
PMSIGNAL_BACKGROUND_WORKER_CHANGE, /* background worker state change */
PMSIGNAL_START_WALRECEIVER, /* start a walreceiver */
PMSIGNAL_ADVANCE_STATE_MACHINE, /* advance postmaster's state machine */
+ PMSIGNAL_XLOG_IS_SHUTDOWN, /* ShutdownXLOG() completed */
} PMSignalReason;
-#define NUM_PMSIGNALS (PMSIGNAL_ADVANCE_STATE_MACHINE+1)
+#define NUM_PMSIGNALS (PMSIGNAL_XLOG_IS_SHUTDOWN+1)
/*
* Reasons why the postmaster would send SIGQUIT to its children.
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index dd2c8376c6e..767bf9f5cf8 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -10,10 +10,13 @@
* fill WAL segments; the checkpointer itself doesn't watch for the
* condition.)
*
- * Normal termination is by SIGUSR2, which instructs the checkpointer to
- * execute a shutdown checkpoint and then exit(0). (All backends must be
- * stopped before SIGUSR2 is issued!) Emergency termination is by SIGQUIT;
- * like any backend, the checkpointer will simply abort and exit on SIGQUIT.
+ * The normal termination sequence is that checkpointer is instructed to
+ * execute the shutdown checkpoint by SIGINT. After that checkpointer waits
+ * to be terminated via SIGUSR2, which instructs the checkpointer to exit(0).
+ * All backends must be stopped before SIGINT or SIGUSR2 is issued!
+ *
+ * Emergency termination is by SIGQUIT; like any backend, the checkpointer
+ * will simply abort and exit on SIGQUIT.
*
* If the checkpointer exits unexpectedly, the postmaster treats that the same
* as a backend crash: shared memory may be corrupted, so remaining backends
@@ -51,6 +54,7 @@
#include "storage/fd.h"
#include "storage/ipc.h"
#include "storage/lwlock.h"
+#include "storage/pmsignal.h"
#include "storage/proc.h"
#include "storage/procsignal.h"
#include "storage/shmem.h"
@@ -141,6 +145,7 @@ double CheckPointCompletionTarget = 0.9;
* Private state
*/
static bool ckpt_active = false;
+static volatile sig_atomic_t ShutdownXLOGPending = false;
/* these values are valid when ckpt_active is true: */
static pg_time_t ckpt_start_time;
@@ -159,6 +164,9 @@ static bool ImmediateCheckpointRequested(void);
static bool CompactCheckpointerRequestQueue(void);
static void UpdateSharedMemoryConfig(void);
+/* Signal handlers */
+static void ReqShutdownXLOG(SIGNAL_ARGS);
+
/*
* Main entry point for checkpointer process
@@ -188,7 +196,7 @@ CheckpointerMain(char *startup_data, size_t startup_data_len)
* tell us it's okay to shut down (via SIGUSR2).
*/
pqsignal(SIGHUP, SignalHandlerForConfigReload);
- pqsignal(SIGINT, SIG_IGN);
+ pqsignal(SIGINT, ReqShutdownXLOG);
pqsignal(SIGTERM, SIG_IGN); /* ignore SIGTERM */
/* SIGQUIT handler was already set up by InitPostmasterChild */
pqsignal(SIGALRM, SIG_IGN);
@@ -211,8 +219,11 @@ CheckpointerMain(char *startup_data, size_t startup_data_len)
* process during a normal shutdown, and since checkpointer is shut down
* very late...
*
- * Walsenders are shut down after the checkpointer, but currently don't
- * report stats. If that changes, we need a more complicated solution.
+ * While e.g. walsenders are active after the shutdown checkpoint has been
+ * written (and thus could produce more stats), checkpointer stays around
+ * after the shutdown checkpoint has been written. postmaster will only
+ * signal checkpointer to exit after all processes that could emit stats
+ * have been shut down.
*/
before_shmem_exit(pgstat_before_server_shutdown, 0);
@@ -327,7 +338,7 @@ CheckpointerMain(char *startup_data, size_t startup_data_len)
ProcGlobal->checkpointerProc = MyProcNumber;
/*
- * Loop forever
+ * Loop until we've been asked to write shutdown checkpoint or terminate.
*/
for (;;)
{
@@ -346,7 +357,10 @@ CheckpointerMain(char *startup_data, size_t startup_data_len)
* Process any requests or signals received recently.
*/
AbsorbSyncRequests();
+
HandleCheckpointerInterrupts();
+ if (ShutdownXLOGPending || ShutdownRequestPending)
+ break;
/*
* Detect a pending checkpoint request by checking whether the flags
@@ -517,8 +531,13 @@ CheckpointerMain(char *startup_data, size_t startup_data_len)
ckpt_active = false;
- /* We may have received an interrupt during the checkpoint. */
+ /*
+ * We may have received an interrupt during the checkpoint and the
+ * latch might have been reset (e.g. in CheckpointWriteDelay).
+ */
HandleCheckpointerInterrupts();
+ if (ShutdownXLOGPending || ShutdownRequestPending)
+ break;
}
/* Check for archive_timeout and switch xlog files if necessary. */
@@ -557,6 +576,56 @@ CheckpointerMain(char *startup_data, size_t startup_data_len)
cur_timeout * 1000L /* convert to ms */ ,
WAIT_EVENT_CHECKPOINTER_MAIN);
}
+
+ /*
+ * From here on, elog(ERROR) should end with exit(1), not send control
+ * back to the sigsetjmp block above.
+ */
+ ExitOnAnyError = true;
+
+ if (ShutdownXLOGPending)
+ {
+ /*
+ * Close down the database.
+ *
+ * Since ShutdownXLOG() creates restartpoint or checkpoint, and
+ * updates the statistics, increment the checkpoint request and flush
+ * out pending statistic.
+ */
+ PendingCheckpointerStats.num_requested++;
+ ShutdownXLOG(0, 0);
+ pgstat_report_checkpointer();
+ pgstat_report_wal(true);
+
+ /*
+ * Tell postmaster that we're done.
+ */
+ SendPostmasterSignal(PMSIGNAL_XLOG_IS_SHUTDOWN);
+ }
+
+ /*
+ * Wait until we're asked to shut down. By separating the writing of the
+ * shutdown checkpoint from checkpointer exiting, checkpointer can perform
+ * some should-be-as-late-as-possible work like writing out stats.
+ */
+ for (;;)
+ {
+ /* Clear any already-pending wakeups */
+ ResetLatch(MyLatch);
+
+ HandleCheckpointerInterrupts();
+
+ if (ShutdownRequestPending)
+ break;
+
+ (void) WaitLatch(MyLatch,
+ WL_LATCH_SET | WL_EXIT_ON_PM_DEATH,
+ 0,
+ WAIT_EVENT_CHECKPOINTER_SHUTDOWN);
+ }
+
+ /* Normal exit from the checkpointer is here */
+ proc_exit(0); /* done */
}
/*
@@ -586,29 +655,6 @@ HandleCheckpointerInterrupts(void)
*/
UpdateSharedMemoryConfig();
}
- if (ShutdownRequestPending)
- {
- /*
- * From here on, elog(ERROR) should end with exit(1), not send control
- * back to the sigsetjmp block above
- */
- ExitOnAnyError = true;
-
- /*
- * Close down the database.
- *
- * Since ShutdownXLOG() creates restartpoint or checkpoint, and
- * updates the statistics, increment the checkpoint request and flush
- * out pending statistic.
- */
- PendingCheckpointerStats.num_requested++;
- ShutdownXLOG(0, 0);
- pgstat_report_checkpointer();
- pgstat_report_wal(true);
-
- /* Normal exit from the checkpointer is here */
- proc_exit(0); /* done */
- }
/* Perform logging of memory contexts of this process */
if (LogMemoryContextPending)
@@ -729,6 +775,7 @@ CheckpointWriteDelay(int flags, double progress)
* in which case we just try to catch up as quickly as possible.
*/
if (!(flags & CHECKPOINT_IMMEDIATE) &&
+ !ShutdownXLOGPending &&
!ShutdownRequestPending &&
!ImmediateCheckpointRequested() &&
IsCheckpointOnSchedule(progress))
@@ -857,6 +904,20 @@ IsCheckpointOnSchedule(double progress)
}
+/* --------------------------------
+ * signal handler routines
+ * --------------------------------
+ */
+
+/* SIGINT: set flag to trigger writing of shutdown checkpoint */
+static void
+ReqShutdownXLOG(SIGNAL_ARGS)
+{
+ ShutdownXLOGPending = true;
+ SetLatch(MyLatch);
+}
+
+
/* --------------------------------
* communication with backends
* --------------------------------
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 54801a32609..115ad3d31d2 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -334,6 +334,7 @@ typedef enum
* ckpt */
PM_WAIT_XLOG_ARCHIVAL, /* waiting for archiver and walsenders to
* finish */
+ PM_WAIT_CHECKPOINTER, /* waiting for checkpointer to shut down */
PM_WAIT_DEAD_END, /* waiting for dead-end children to exit */
PM_NO_CHILDREN, /* all important children have exited */
} PMState;
@@ -2354,35 +2355,19 @@ process_pm_child_exit(void)
{
ReleasePostmasterChildSlot(CheckpointerPMChild);
CheckpointerPMChild = NULL;
- if (EXIT_STATUS_0(exitstatus) && pmState == PM_WAIT_XLOG_SHUTDOWN)
+ if (EXIT_STATUS_0(exitstatus) && pmState == PM_WAIT_CHECKPOINTER)
{
/*
* OK, we saw normal exit of the checkpointer after it's been
- * told to shut down. We expect that it wrote a shutdown
- * checkpoint. (If for some reason it didn't, recovery will
- * occur on next postmaster start.)
+ * told to shut down. We know checkpointer wrote a shutdown
+ * checkpoint, otherwise we'd still be in
+ * PM_WAIT_XLOG_SHUTDOWN state.
*
- * At this point we should have no normal backend children
- * left (else we'd not be in PM_WAIT_XLOG_SHUTDOWN state) but
- * we might have dead-end children to wait for.
- *
- * If we have an archiver subprocess, tell it to do a last
- * archive cycle and quit. Likewise, if we have walsender
- * processes, tell them to send any remaining WAL and quit.
- */
- Assert(Shutdown > NoShutdown);
-
- /* Waken archiver for the last time */
- if (PgArchPMChild != NULL)
- signal_child(PgArchPMChild, SIGUSR2);
-
- /*
- * Waken walsenders for the last time. No regular backends
- * should be around anymore.
+ * At this point only dead-end children should be left.
*/
- SignalChildren(SIGUSR2, btmask(B_WAL_SENDER));
-
- UpdatePMState(PM_WAIT_XLOG_ARCHIVAL);
+ UpdatePMState(PM_WAIT_DEAD_END);
+ ConfigurePostmasterWaitSet(false);
+ SignalChildren(SIGTERM, btmask_all_except(B_LOGGER));
}
else
{
@@ -2718,6 +2703,7 @@ HandleFatalError(QuitSignalReason reason, bool consider_sigabrt)
case PM_WAIT_XLOG_SHUTDOWN:
case PM_WAIT_XLOG_ARCHIVAL:
+ case PM_WAIT_CHECKPOINTER:
/*
* Note that we switch *back* to PM_WAIT_BACKENDS here. This way
@@ -2980,9 +2966,9 @@ PostmasterStateMachine(void)
SignalChildren(SIGQUIT, btmask(B_DEAD_END_BACKEND));
/*
- * We already SIGQUIT'd walsenders and the archiver, if any,
- * when we started immediate shutdown or entered FatalError
- * state.
+ * We already SIGQUIT'd archiver, checkpointer and walsenders,
+ * if any, when we started immediate shutdown or entered
+ * FatalError state.
*/
}
else
@@ -2996,10 +2982,10 @@ PostmasterStateMachine(void)
/* Start the checkpointer if not running */
if (CheckpointerPMChild == NULL)
CheckpointerPMChild = StartChildProcess(B_CHECKPOINTER);
- /* And tell it to shut down */
+ /* And tell it to write the shutdown checkpoint */
if (CheckpointerPMChild != NULL)
{
- signal_child(CheckpointerPMChild, SIGUSR2);
+ signal_child(CheckpointerPMChild, SIGINT);
UpdatePMState(PM_WAIT_XLOG_SHUTDOWN);
}
else
@@ -3024,22 +3010,39 @@ PostmasterStateMachine(void)
}
}
+ /*
+ * The state transition from PM_WAIT_XLOG_SHUTDOWN to
+ * PM_WAIT_XLOG_ARCHIVAL is in proccess_pm_pmsignal(), in response to
+ * PMSIGNAL_XLOG_IS_SHUTDOWN.
+ */
+
if (pmState == PM_WAIT_XLOG_ARCHIVAL)
{
/*
- * PM_WAIT_XLOG_ARCHIVAL state ends when there's no other children
- * than dead-end children left. There shouldn't be any regular
- * backends left by now anyway; what we're really waiting for is
- * walsenders and archiver.
+ * PM_WAIT_XLOG_ARCHIVAL state ends when there's no children other
+ * than checkpointer and dead-end children left. There shouldn't be
+ * any regular backends left by now anyway; what we're really waiting
+ * for is for walsenders and archiver to exit.
*/
- if (CountChildren(btmask_all_except(B_LOGGER, B_DEAD_END_BACKEND)) == 0)
+ if (CountChildren(btmask_all_except(B_CHECKPOINTER, B_LOGGER, B_DEAD_END_BACKEND)) == 0)
{
- UpdatePMState(PM_WAIT_DEAD_END);
- ConfigurePostmasterWaitSet(false);
- SignalChildren(SIGTERM, btmask_all_except(B_LOGGER));
+ UpdatePMState(PM_WAIT_CHECKPOINTER);
+
+ /*
+ * Now that everyone important is gone, tell checkpointer to shut
+ * down too. That allows checkpointer to perform some last bits of
+ * cleanup without other processes interfering.
+ */
+ if (CheckpointerPMChild != NULL)
+ signal_child(CheckpointerPMChild, SIGUSR2);
}
}
+ /*
+ * The state transition from PM_WAIT_CHECKPOINTER to PM_WAIT_DEAD_END is
+ * in proccess_pm_child_exit().
+ */
+
if (pmState == PM_WAIT_DEAD_END)
{
/*
@@ -3176,6 +3179,7 @@ pmstate_name(PMState state)
PM_TOSTR_CASE(PM_WAIT_XLOG_SHUTDOWN);
PM_TOSTR_CASE(PM_WAIT_XLOG_ARCHIVAL);
PM_TOSTR_CASE(PM_WAIT_DEAD_END);
+ PM_TOSTR_CASE(PM_WAIT_CHECKPOINTER);
PM_TOSTR_CASE(PM_NO_CHILDREN);
}
#undef PM_TOSTR_CASE
@@ -3593,6 +3597,8 @@ ExitPostmaster(int status)
static void
process_pm_pmsignal(void)
{
+ bool request_state_update = false;
+
pending_pm_pmsignal = false;
ereport(DEBUG2,
@@ -3704,9 +3710,67 @@ process_pm_pmsignal(void)
WalReceiverRequested = true;
}
+ if (CheckPostmasterSignal(PMSIGNAL_XLOG_IS_SHUTDOWN))
+ {
+ /* Checkpointer completed the shutdown checkpoint */
+ if (pmState == PM_WAIT_XLOG_SHUTDOWN)
+ {
+ /*
+ * If we have an archiver subprocess, tell it to do a last archive
+ * cycle and quit. Likewise, if we have walsender processes, tell
+ * them to send any remaining WAL and quit.
+ */
+ Assert(Shutdown > NoShutdown);
+
+ /* Waken archiver for the last time */
+ if (PgArchPMChild != NULL)
+ signal_child(PgArchPMChild, SIGUSR2);
+
+ /*
+ * Waken walsenders for the last time. No regular backends should
+ * be around anymore.
+ */
+ SignalChildren(SIGUSR2, btmask(B_WAL_SENDER));
+
+ UpdatePMState(PM_WAIT_XLOG_ARCHIVAL);
+ }
+ else if (!FatalError && Shutdown != ImmediateShutdown)
+ {
+ /*
+ * Checkpointer only ought to perform the shutdown checkpoint
+ * during shutdown. If somehow checkpointer did so in another
+ * situation, we have no choice but to crash-restart.
+ *
+ * It's possible however that we get PMSIGNAL_XLOG_IS_SHUTDOWN
+ * outside of PM_WAIT_XLOG_SHUTDOWN if an orderly shutdown was
+ * "interrupted" by a crash or an immediate shutdown.
+ */
+ ereport(LOG,
+ (errmsg("WAL was shut down unexpectedly")));
+
+ /*
+ * Doesn't seem likely to help to take send_abort_for_crash into
+ * account here.
+ */
+ HandleFatalError(PMQUIT_FOR_CRASH, false);
+ }
+
+ /*
+ * Need to run PostmasterStateMachine() to check if we already can go
+ * to the next state.
+ */
+ request_state_update = true;
+ }
+
/*
* Try to advance postmaster's state machine, if a child requests it.
- *
+ */
+ if (CheckPostmasterSignal(PMSIGNAL_ADVANCE_STATE_MACHINE))
+ {
+ request_state_update = true;
+ }
+
+ /*
* Be careful about the order of this action relative to this function's
* other actions. Generally, this should be after other actions, in case
* they have effects PostmasterStateMachine would need to know about.
@@ -3714,7 +3778,7 @@ process_pm_pmsignal(void)
* cannot have any (immediate) effect on the state machine, but does
* depend on what state we're in now.
*/
- if (CheckPostmasterSignal(PMSIGNAL_ADVANCE_STATE_MACHINE))
+ if (request_state_update)
{
PostmasterStateMachine();
}
@@ -4025,6 +4089,7 @@ bgworker_should_start_now(BgWorkerStartTime start_time)
switch (pmState)
{
case PM_NO_CHILDREN:
+ case PM_WAIT_CHECKPOINTER:
case PM_WAIT_DEAD_END:
case PM_WAIT_XLOG_ARCHIVAL:
case PM_WAIT_XLOG_SHUTDOWN:
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 0b53cba807d..e199f071628 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -56,6 +56,7 @@ AUTOVACUUM_MAIN "Waiting in main loop of autovacuum launcher process."
BGWRITER_HIBERNATE "Waiting in background writer process, hibernating."
BGWRITER_MAIN "Waiting in main loop of background writer process."
CHECKPOINTER_MAIN "Waiting in main loop of checkpointer process."
+CHECKPOINTER_SHUTDOWN "Waiting for checkpointer process to be terminated."
LOGICAL_APPLY_MAIN "Waiting in main loop of logical replication apply process."
LOGICAL_LAUNCHER_MAIN "Waiting in main loop of logical replication launcher process."
LOGICAL_PARALLEL_APPLY_MAIN "Waiting in main loop of logical replication parallel apply process."
--
2.48.1.76.g4e746b1a31.dirty
v2.3-0008-Ensure-a-resowner-exists-for-all-paths-that-may.patchtext/x-diff; charset=us-asciiDownload
From 1476ef34b2a2c36e8e1eccbf6d2ac12607b4dab7 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 8 Oct 2024 14:34:38 -0400
Subject: [PATCH v2.3 08/30] Ensure a resowner exists for all paths that may
perform AIO
Reviewed-by: Noah Misch <noah@leadboat.com>
Discussion: https://postgr.es/m/1f6b50a7-38ef-4d87-8246-786d39f46ab9@iki.fi
---
src/backend/bootstrap/bootstrap.c | 7 +++++++
src/backend/replication/logical/logical.c | 6 ++++++
src/backend/utils/init/postinit.c | 6 +++++-
3 files changed, 18 insertions(+), 1 deletion(-)
diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index 359f58a8f95..5d41cfc6eb0 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -361,8 +361,15 @@ BootstrapModeMain(int argc, char *argv[], bool check_only)
BaseInit();
bootstrap_signals();
+
+ /* need a resowner for IO during BootStrapXLOG() */
+ CreateAuxProcessResourceOwner();
+
BootStrapXLOG(bootstrap_data_checksum_version);
+ ReleaseAuxProcessResources(true);
+ CurrentResourceOwner = NULL;
+
/*
* To ensure that src/common/link-canary.c is linked into the backend, we
* must call it from somewhere. Here is as good as anywhere.
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 0b25efafe2b..1f8ec3daa6a 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -386,6 +386,12 @@ CreateInitDecodingContext(const char *plugin,
slot->data.plugin = plugin_name;
SpinLockRelease(&slot->mutex);
+ if (CurrentResourceOwner == NULL)
+ {
+ Assert(am_walsender);
+ CurrentResourceOwner = AuxProcessResourceOwner;
+ }
+
if (XLogRecPtrIsInvalid(restart_lsn))
ReplicationSlotReserveWal();
else
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 01bb6a410cb..b491d04de58 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -755,8 +755,12 @@ InitPostgres(const char *in_dbname, Oid dboid,
* We don't yet have an aux-process resource owner, but StartupXLOG
* and ShutdownXLOG will need one. Hence, create said resource owner
* (and register a callback to clean it up after ShutdownXLOG runs).
+ *
+ * In bootstrap mode CreateAuxProcessResourceOwner() was already
+ * called in BootstrapModeMain().
*/
- CreateAuxProcessResourceOwner();
+ if (!bootstrap)
+ CreateAuxProcessResourceOwner();
StartupXLOG();
/* Release (and warn about) any buffer pins leaked in StartupXLOG */
--
2.48.1.76.g4e746b1a31.dirty
v2.3-0009-Allow-lwlocks-to-be-unowned.patchtext/x-diff; charset=us-asciiDownload
From 552b094c4f52b4092d7998cce01908bff5ddcf8b Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 5 Jan 2021 10:10:36 -0800
Subject: [PATCH v2.3 09/30] Allow lwlocks to be unowned
This is required for AIO so that the lock hold during a write can be released
in another backend. Which in turn is required to avoid the potential for
deadlocks.
---
src/include/storage/lwlock.h | 2 +
src/backend/storage/lmgr/lwlock.c | 108 +++++++++++++++++++++++-------
2 files changed, 85 insertions(+), 25 deletions(-)
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index 2aa46fd50da..13a7dc89980 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -129,6 +129,8 @@ extern bool LWLockAcquireOrWait(LWLock *lock, LWLockMode mode);
extern void LWLockRelease(LWLock *lock);
extern void LWLockReleaseClearVar(LWLock *lock, pg_atomic_uint64 *valptr, uint64 val);
extern void LWLockReleaseAll(void);
+extern void LWLockDisown(LWLock *l);
+extern void LWLockReleaseDisowned(LWLock *l, LWLockMode mode);
extern bool LWLockHeldByMe(LWLock *lock);
extern bool LWLockAnyHeldByMe(LWLock *lock, int nlocks, size_t stride);
extern bool LWLockHeldByMeInMode(LWLock *lock, LWLockMode mode);
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 2f558ffea14..c3d6f886e3c 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -1773,36 +1773,15 @@ LWLockUpdateVar(LWLock *lock, pg_atomic_uint64 *valptr, uint64 val)
}
}
-
/*
- * LWLockRelease - release a previously acquired lock
+ * Helper function to release lock, shared between LWLockRelease() and
+ * LWLockeleaseDisowned().
*/
-void
-LWLockRelease(LWLock *lock)
+static void
+LWLockReleaseInternal(LWLock *lock, LWLockMode mode)
{
- LWLockMode mode;
uint32 oldstate;
bool check_waiters;
- int i;
-
- /*
- * Remove lock from list of locks held. Usually, but not always, it will
- * be the latest-acquired lock; so search array backwards.
- */
- for (i = num_held_lwlocks; --i >= 0;)
- if (lock == held_lwlocks[i].lock)
- break;
-
- if (i < 0)
- elog(ERROR, "lock %s is not held", T_NAME(lock));
-
- mode = held_lwlocks[i].mode;
-
- num_held_lwlocks--;
- for (; i < num_held_lwlocks; i++)
- held_lwlocks[i] = held_lwlocks[i + 1];
-
- PRINT_LWDEBUG("LWLockRelease", lock, mode);
/*
* Release my hold on lock, after that it can immediately be acquired by
@@ -1840,6 +1819,85 @@ LWLockRelease(LWLock *lock)
LOG_LWDEBUG("LWLockRelease", lock, "releasing waiters");
LWLockWakeup(lock);
}
+}
+
+void
+LWLockReleaseDisowned(LWLock *lock, LWLockMode mode)
+{
+ LWLockReleaseInternal(lock, mode);
+}
+
+/*
+ * Stop treating lock as held by current backend.
+ *
+ * This is the code that can be shared between actually releasing a lock
+ * (LWLockRelease()) and just not tracking ownership of the lock anymore
+ * without releasing the lock (LWLockDisown()).
+ *
+ * Returns the mode in which the lock was held by the current backend.
+ *
+ * NB: This does not call RESUME_INTERRUPTS(), but leaves that responsibility
+ * of the caller.
+ *
+ * NB: This will leave lock->owner pointing to the current backend (if
+ * LOCK_DEBUG is set). This is somewhat intentional, as it makes it easier to
+ * debug cases of missing wakeups during lock release.
+ */
+static inline LWLockMode
+LWLockDisownInternal(LWLock *lock)
+{
+ LWLockMode mode;
+ int i;
+
+ /*
+ * Remove lock from list of locks held. Usually, but not always, it will
+ * be the latest-acquired lock; so search array backwards.
+ */
+ for (i = num_held_lwlocks; --i >= 0;)
+ if (lock == held_lwlocks[i].lock)
+ break;
+
+ if (i < 0)
+ elog(ERROR, "lock %s is not held", T_NAME(lock));
+
+ mode = held_lwlocks[i].mode;
+
+ num_held_lwlocks--;
+ for (; i < num_held_lwlocks; i++)
+ held_lwlocks[i] = held_lwlocks[i + 1];
+
+ return mode;
+}
+
+/*
+ * Stop treating lock as held by current backend.
+ *
+ * After calling this function it's the callers responsibility to ensure that
+ * the lock gets released (via LWLockReleaseDisowned()), even in case of an
+ * error. This only is desirable if the lock is going to be released in a
+ * different process than the process that acquired it.
+ */
+void
+LWLockDisown(LWLock *lock)
+{
+ LWLockDisownInternal(lock);
+
+ RESUME_INTERRUPTS();
+}
+
+/*
+ * LWLockRelease - release a previously acquired lock
+ */
+void
+LWLockRelease(LWLock *lock)
+{
+ LWLockMode mode;
+
+ mode = LWLockDisownInternal(lock);
+
+ PRINT_LWDEBUG("LWLockRelease", lock, mode);
+
+ LWLockReleaseInternal(lock, mode);
/*
* Now okay to allow cancel/die interrupts.
--
2.48.1.76.g4e746b1a31.dirty
v2.3-0010-aio-Basic-subsystem-initialization.patchtext/x-diff; charset=us-asciiDownload
From a6f1745cefdfb932be393f0374765e60563ab23d Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 10 Jun 2024 13:42:58 -0700
Subject: [PATCH v2.3 10/30] aio: Basic subsystem initialization
This is just separate to make it easier to review the tendrils into various
places.
---
src/include/storage/aio.h | 37 +++++++++++++++++++
src/include/storage/aio_init.h | 24 ++++++++++++
src/include/utils/guc.h | 1 +
src/backend/storage/aio/Makefile | 2 +
src/backend/storage/aio/aio.c | 36 ++++++++++++++++++
src/backend/storage/aio/aio_init.c | 37 +++++++++++++++++++
src/backend/storage/aio/meson.build | 2 +
src/backend/storage/ipc/ipci.c | 3 ++
src/backend/utils/init/postinit.c | 7 ++++
src/backend/utils/misc/guc_tables.c | 23 ++++++++++++
src/backend/utils/misc/postgresql.conf.sample | 11 ++++++
src/tools/pgindent/typedefs.list | 1 +
12 files changed, 184 insertions(+)
create mode 100644 src/include/storage/aio.h
create mode 100644 src/include/storage/aio_init.h
create mode 100644 src/backend/storage/aio/aio.c
create mode 100644 src/backend/storage/aio/aio_init.c
diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h
new file mode 100644
index 00000000000..0e3fadac543
--- /dev/null
+++ b/src/include/storage/aio.h
@@ -0,0 +1,37 @@
+/*-------------------------------------------------------------------------
+ *
+ * aio.h
+ * Main AIO interface
+ *
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/aio.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef AIO_H
+#define AIO_H
+
+
+
+/* Enum for io_method GUC. */
+typedef enum IoMethod
+{
+ IOMETHOD_SYNC = 0,
+} IoMethod;
+
+/* We'll default to synchronous execution. */
+#define DEFAULT_IO_METHOD IOMETHOD_SYNC
+
+
+extern void assign_io_method(int newval, void *extra);
+
+
+/* GUCs */
+extern PGDLLIMPORT int io_method;
+extern PGDLLIMPORT int io_max_concurrency;
+
+
+#endif /* AIO_H */
diff --git a/src/include/storage/aio_init.h b/src/include/storage/aio_init.h
new file mode 100644
index 00000000000..44151ef55bf
--- /dev/null
+++ b/src/include/storage/aio_init.h
@@ -0,0 +1,24 @@
+/*-------------------------------------------------------------------------
+ *
+ * aio_init.h
+ * AIO initialization - kept separate as initialization sites don't need to
+ * know about AIO itself and AIO users don't need to know about initialization.
+ *
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/aio_init.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef AIO_INIT_H
+#define AIO_INIT_H
+
+
+extern Size AioShmemSize(void);
+extern void AioShmemInit(void);
+
+extern void pgaio_init_backend(void);
+
+#endif /* AIO_INIT_H */
diff --git a/src/include/utils/guc.h b/src/include/utils/guc.h
index 532d6642bb4..aa859c92085 100644
--- a/src/include/utils/guc.h
+++ b/src/include/utils/guc.h
@@ -314,6 +314,7 @@ extern PGDLLIMPORT bool optimize_bounded_sort;
*/
extern PGDLLIMPORT const struct config_enum_entry archive_mode_options[];
extern PGDLLIMPORT const struct config_enum_entry dynamic_shared_memory_options[];
+extern PGDLLIMPORT const struct config_enum_entry io_method_options[];
extern PGDLLIMPORT const struct config_enum_entry recovery_target_action_options[];
extern PGDLLIMPORT const struct config_enum_entry wal_level_options[];
extern PGDLLIMPORT const struct config_enum_entry wal_sync_method_options[];
diff --git a/src/backend/storage/aio/Makefile b/src/backend/storage/aio/Makefile
index 2f29a9ec4d1..eaeaeeee8e3 100644
--- a/src/backend/storage/aio/Makefile
+++ b/src/backend/storage/aio/Makefile
@@ -9,6 +9,8 @@ top_builddir = ../../../..
include $(top_builddir)/src/Makefile.global
OBJS = \
+ aio.o \
+ aio_init.o \
read_stream.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/aio/aio.c b/src/backend/storage/aio/aio.c
new file mode 100644
index 00000000000..f68cbc2b3f4
--- /dev/null
+++ b/src/backend/storage/aio/aio.c
@@ -0,0 +1,36 @@
+/*-------------------------------------------------------------------------
+ *
+ * aio.c
+ * AIO - Core Logic
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/storage/aio/aio.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "storage/aio.h"
+#include "utils/guc.h"
+
+
+/* Options for io_method. */
+const struct config_enum_entry io_method_options[] = {
+ {"sync", IOMETHOD_SYNC, false},
+ {NULL, 0, false}
+};
+
+/* GUCs */
+int io_method = DEFAULT_IO_METHOD;
+int io_max_concurrency = -1;
+
+
+
+void
+assign_io_method(int newval, void *extra)
+{
+}
diff --git a/src/backend/storage/aio/aio_init.c b/src/backend/storage/aio/aio_init.c
new file mode 100644
index 00000000000..f7ee8270756
--- /dev/null
+++ b/src/backend/storage/aio/aio_init.c
@@ -0,0 +1,37 @@
+/*-------------------------------------------------------------------------
+ *
+ * aio_init.c
+ * AIO - Subsystem Initialization
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/storage/aio/aio_init.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "storage/aio_init.h"
+
+
+
+Size
+AioShmemSize(void)
+{
+ Size sz = 0;
+
+ return sz;
+}
+
+void
+AioShmemInit(void)
+{
+}
+
+void
+pgaio_init_backend(void)
+{
+}
diff --git a/src/backend/storage/aio/meson.build b/src/backend/storage/aio/meson.build
index 8abe0eb4863..c822fd4ddf7 100644
--- a/src/backend/storage/aio/meson.build
+++ b/src/backend/storage/aio/meson.build
@@ -1,5 +1,7 @@
# Copyright (c) 2024-2025, PostgreSQL Global Development Group
backend_sources += files(
+ 'aio.c',
+ 'aio_init.c',
'read_stream.c',
)
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 174eed70367..e11e82fc897 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -37,6 +37,7 @@
#include "replication/slotsync.h"
#include "replication/walreceiver.h"
#include "replication/walsender.h"
+#include "storage/aio_init.h"
#include "storage/bufmgr.h"
#include "storage/dsm.h"
#include "storage/dsm_registry.h"
@@ -148,6 +149,7 @@ CalculateShmemSize(int *num_semaphores)
size = add_size(size, WaitEventCustomShmemSize());
size = add_size(size, InjectionPointShmemSize());
size = add_size(size, SlotSyncShmemSize());
+ size = add_size(size, AioShmemSize());
/* include additional requested shmem from preload libraries */
size = add_size(size, total_addin_request);
@@ -340,6 +342,7 @@ CreateOrAttachShmemStructs(void)
StatsShmemInit();
WaitEventCustomShmemInit();
InjectionPointShmemInit();
+ AioShmemInit();
}
/*
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index b491d04de58..8ea50314a4e 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -43,6 +43,7 @@
#include "replication/slot.h"
#include "replication/slotsync.h"
#include "replication/walsender.h"
+#include "storage/aio_init.h"
#include "storage/bufmgr.h"
#include "storage/fd.h"
#include "storage/ipc.h"
@@ -626,6 +627,12 @@ BaseInit(void)
*/
pgstat_initialize();
+ /*
+ * Initialize AIO before infrastructure that might need to actually
+ * execute AIO.
+ */
+ pgaio_init_backend();
+
/* Do local initialization of storage and buffer managers */
InitSync();
smgrinit();
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 38cb9e970d5..de524eccad5 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -71,6 +71,7 @@
#include "replication/slot.h"
#include "replication/slotsync.h"
#include "replication/syncrep.h"
+#include "storage/aio.h"
#include "storage/bufmgr.h"
#include "storage/bufpage.h"
#include "storage/large_object.h"
@@ -3220,6 +3221,18 @@ struct config_int ConfigureNamesInt[] =
NULL, NULL, NULL
},
+ {
+ {"io_max_concurrency",
+ PGC_POSTMASTER,
+ RESOURCES_ASYNCHRONOUS,
+ gettext_noop("Number of IOs that may be in flight in one backend."),
+ NULL,
+ },
+ &io_max_concurrency,
+ -1, -1, 1024,
+ NULL, NULL, NULL
+ },
+
{
{"backend_flush_after", PGC_USERSET, RESOURCES_ASYNCHRONOUS,
gettext_noop("Number of pages after which previously performed writes are flushed to disk."),
@@ -5236,6 +5249,16 @@ struct config_enum ConfigureNamesEnum[] =
NULL, NULL, NULL
},
+ {
+ {"io_method", PGC_POSTMASTER, RESOURCES_MEM,
+ gettext_noop("Selects the method of asynchronous I/O to use."),
+ NULL
+ },
+ &io_method,
+ DEFAULT_IO_METHOD, io_method_options,
+ NULL, assign_io_method, NULL
+ },
+
/* End-of-list marker */
{
{NULL, 0, 0, NULL, NULL}, NULL, 0, NULL, NULL, NULL, NULL
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 079efa1baa7..fba0ad4b624 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -843,6 +843,17 @@ autovacuum_worker_slots = 16 # autovacuum worker slots to allocate
#include = '...' # include file
+#------------------------------------------------------------------------------
+# WIP AIO GUC docs
+#------------------------------------------------------------------------------
+
+#io_method = sync # (change requires restart)
+
+#io_max_concurrency = 32 # Max number of IOs that may be in
+ # flight at the same time in one backend
+ # (change requires restart)
+
+
#------------------------------------------------------------------------------
# CUSTOMIZED OPTIONS
#------------------------------------------------------------------------------
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index d5aa5c295ae..3bec090428d 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1266,6 +1266,7 @@ IntoClause
InvalMessageArray
InvalidationInfo
InvalidationMsgsGroup
+IoMethod
IpcMemoryId
IpcMemoryKey
IpcMemoryState
--
2.48.1.76.g4e746b1a31.dirty
v2.3-0011-aio-Core-AIO-implementation.patchtext/x-diff; charset=us-asciiDownload
From ac42f990b85ae4034f16acf9929ce28e18ec2088 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 7 Jan 2025 14:42:12 -0500
Subject: [PATCH v2.3 11/30] aio: Core AIO implementation
At this point nothing can use AIO - this commit does not include any
implementation of aio subjects / callbacks. That will come in later commits.
Todo:
- lots of cleanup
---
src/include/storage/aio.h | 301 ++++++
src/include/storage/aio_internal.h | 295 ++++++
src/include/storage/aio_types.h | 115 +++
src/include/utils/resowner.h | 5 +
src/backend/access/transam/xact.c | 9 +
src/backend/storage/aio/Makefile | 4 +
src/backend/storage/aio/aio.c | 904 ++++++++++++++++++
src/backend/storage/aio/aio_callback.c | 280 ++++++
src/backend/storage/aio/aio_init.c | 186 ++++
src/backend/storage/aio/aio_io.c | 175 ++++
src/backend/storage/aio/aio_target.c | 108 +++
src/backend/storage/aio/meson.build | 4 +
src/backend/storage/aio/method_sync.c | 47 +
.../utils/activity/wait_event_names.txt | 3 +
src/backend/utils/resowner/resowner.c | 30 +
src/tools/pgindent/typedefs.list | 21 +
16 files changed, 2487 insertions(+)
create mode 100644 src/include/storage/aio_internal.h
create mode 100644 src/include/storage/aio_types.h
create mode 100644 src/backend/storage/aio/aio_callback.c
create mode 100644 src/backend/storage/aio/aio_io.c
create mode 100644 src/backend/storage/aio/aio_target.c
create mode 100644 src/backend/storage/aio/method_sync.c
diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h
index 0e3fadac543..ffd382593d0 100644
--- a/src/include/storage/aio.h
+++ b/src/include/storage/aio.h
@@ -14,6 +14,9 @@
#ifndef AIO_H
#define AIO_H
+#include "storage/aio_types.h"
+#include "storage/procnumber.h"
+
/* Enum for io_method GUC. */
@@ -26,9 +29,307 @@ typedef enum IoMethod
#define DEFAULT_IO_METHOD IOMETHOD_SYNC
+/*
+ * Flags for an IO that can be set with pgaio_io_set_flag().
+ */
+typedef enum PgAioHandleFlags
+{
+ /*
+ * The IO references backend local memory.
+ *
+ * This needs to be set on an IO whenever the IO references process-local
+ * memory. Some IO methods do not support executing IO that references
+ * process local memory and thus need to fall back to executing IO
+ * synchronously for IOs with this flag set.
+ *
+ * Required for correctness.
+ */
+ PGAIO_HF_REFERENCES_LOCAL = 1 << 1,
+
+ /*
+ * Hint that IO will be executed synchronously.
+ *
+ * This can make it a bit cheaper to execute synchronous IO via the AIO
+ * interface, to avoid needing an AIO and non-AIO version of code.
+ *
+ * Advantageous to set, if applicable, but not required for correctness.
+ */
+ PGAIO_HF_SYNCHRONOUS = 1 << 0,
+
+ /*
+ * IO is using buffered IO, used to control heuristic in some IO methods.
+ *
+ * Advantageous to set, if applicable, but not required for correctness.
+ */
+ PGAIO_HF_BUFFERED = 1 << 2,
+} PgAioHandleFlags;
+
+/*
+ * The IO operations supported by the AIO subsystem.
+ *
+ * This could be in aio_internal.h, as it is not publically referenced, but
+ * PgAioOpData currently *does* need to be public, therefore keeping this
+ * public seems to make sense.
+ */
+typedef enum PgAioOp
+{
+ /* intentionally the zero value, to help catch zeroed memory etc */
+ PGAIO_OP_INVALID = 0,
+
+ PGAIO_OP_READV,
+ PGAIO_OP_WRITEV,
+
+ /**
+ * In the near term we'll need at least:
+ * - fsync / fdatasync
+ * - flush_range
+ *
+ * Eventually we'll additionally want at least:
+ * - send
+ * - recv
+ * - accept
+ **/
+} PgAioOp;
+
+#define PGAIO_OP_COUNT (PGAIO_OP_WRITEV + 1)
+
+
+/*
+ * On what is IO being performed.
+ *
+ * PgAioTargetID specific behaviour should be implemented in
+ * aio_target.c.
+ */
+typedef enum PgAioTargetID
+{
+ /* intentionally the zero value, to help catch zeroed memory etc */
+ PGAIO_TID_INVALID = 0,
+} PgAioTargetID;
+
+#define PGAIO_TID_COUNT (PGAIO_TID_INVALID + 1)
+
+
+/*
+ * Data necessary for support IO operations (see PgAioOp).
+ *
+ * NB: Note that the FDs in here may *not* be relied upon for re-issuing
+ * requests (e.g. for partial reads/writes) - the FD might be from another
+ * process, or closed since. That's not a problem for IOs waiting to be issued
+ * only because the queue is flushed when closing an FD.
+ */
+typedef union
+{
+ struct
+ {
+ int fd;
+ uint16 iov_length;
+ uint64 offset;
+ } read;
+
+ struct
+ {
+ int fd;
+ uint16 iov_length;
+ uint64 offset;
+ } write;
+} PgAioOpData;
+
+
+/*
+ * Information the object that IO is executed on. Mostly callbacks that
+ * operate on PgAioTargetData.
+ */
+typedef struct PgAioTargetInfo
+{
+ void (*reopen) (PgAioHandle *ioh);
+
+ char *(*describe_identity) (const PgAioTargetData *sd);
+
+ const char *name;
+} PgAioTargetInfo;
+
+
+/*
+ * IDs for callbacks that can be registered on an IO.
+ *
+ * Callbacks are identified by an ID rather than a function pointer. There are
+ * two main reasons:
+ *
+ * 1) Memory within PgAioHandle is precious, due to the number of PgAioHandle
+ * structs in pre-allocated shared memory.
+ *
+ * 2) Due to EXEC_BACKEND function pointers are not necessarily stable between
+ * different backends, therefore function pointers cannot directly be in
+ * shared memory.
+ *
+ * Without 2), we could fairly easily allow to add new callbacks, by filling a
+ * ID->pointer mapping table on demand. In the presence of 2 that's still
+ * doable, but harder, because every process has to re-register the pointers
+ * so that a local ID->"backend local pointer" mapping can be maintained.
+ */
+typedef enum PgAioHandleCallbackID
+{
+ PGAIO_HCB_INVALID,
+} PgAioHandleCallbackID;
+
+
+typedef void (*PgAioHandleCallbackStage) (PgAioHandle *ioh);
+typedef PgAioResult (*PgAioHandleCallbackComplete) (PgAioHandle *ioh, PgAioResult prior_result);
+typedef void (*PgAioHandleCallbackReport) (PgAioResult result, const PgAioTargetData *target_data, int elevel);
+
+typedef struct PgAioHandleCallbacks
+{
+ /*
+ * Prepare resources affected by the IO for execution. This could e.g.
+ * include moving ownership of buffer pins to the AIO subsystem.
+ */
+ PgAioHandleCallbackStage stage;
+
+ /*
+ * Update the state of resources affected by the IO to reflect completion
+ * of the IO. This could e.g. include updating shared buffer state to
+ * signal the IO has finished.
+ *
+ * The _shared suffix indicates that this is executed by the backend that
+ * completed the IO, which may or may not be the backend that issued the
+ * IO. Obviously the callback thus can only modify resources in shared
+ * memory.
+ *
+ * The latest registered callback is called first. This allows
+ * higher-level code to register callbacks that can rely on callbacks
+ * registered by lower-level code to already have been executed.
+ *
+ * NB: This is called in a critical section. Errors can be signalled by
+ * the callback's return value, it's the responsibility of the IO's issuer
+ * to react appropriately.
+ */
+ PgAioHandleCallbackComplete complete_shared;
+
+ /*
+ * Like complete_shared, except called in the issuing backend.
+ *
+ * This variant of the completion callback is useful when backend-local
+ * state has to be updated to reflect the IO's completion. E.g. a
+ * temporary buffer's BufferDesc isn't accessible in complete_shared.
+ *
+ * Local callbacks are only called after complete_shared for all
+ * registered callbacks has been called.
+ */
+ PgAioHandleCallbackComplete complete_local;
+
+ /*
+ * Report the result of an IO operation. This is e.g. used to raise an
+ * error after an IO failed at the appropriate time (i.e. not when the IO
+ * failed, but under control of the code that issued the IO).
+ */
+ PgAioHandleCallbackReport report;
+} PgAioHandleCallbacks;
+
+
+
+/*
+ * How many callbacks can be registered for one IO handle. Currently we only
+ * need two, but it's not hard to imagine needing a few more.
+ */
+#define PGAIO_HANDLE_MAX_CALLBACKS 4
+
+
+
+/* AIO API */
+
+
+/* --------------------------------------------------------------------------------
+ * IO Handles
+ * --------------------------------------------------------------------------------
+ */
+
+/* functions in aio.c */
+struct ResourceOwnerData;
+extern PgAioHandle *pgaio_io_acquire(struct ResourceOwnerData *resowner, PgAioReturn *ret);
+extern PgAioHandle *pgaio_io_acquire_nb(struct ResourceOwnerData *resowner, PgAioReturn *ret);
+
+extern void pgaio_io_release(PgAioHandle *ioh);
+struct dlist_node;
+extern void pgaio_io_release_resowner(struct dlist_node *ioh_node, bool on_error);
+
+extern void pgaio_io_set_flag(PgAioHandle *ioh, PgAioHandleFlags flag);
+
+extern int pgaio_io_get_id(PgAioHandle *ioh);
+extern ProcNumber pgaio_io_get_owner(PgAioHandle *ioh);
+
+extern void pgaio_io_get_wref(PgAioHandle *ioh, PgAioWaitRef *iow);
+
+/* functions in aio_io.c */
+struct iovec;
+extern int pgaio_io_get_iovec(PgAioHandle *ioh, struct iovec **iov);
+
+extern PgAioOpData *pgaio_io_get_op_data(PgAioHandle *ioh);
+
+extern void pgaio_io_prep_readv(PgAioHandle *ioh,
+ int fd, int iovcnt, uint64 offset);
+extern void pgaio_io_prep_writev(PgAioHandle *ioh,
+ int fd, int iovcnt, uint64 offset);
+
+/* functions in aio_target.c */
+extern void pgaio_io_set_target(PgAioHandle *ioh, PgAioTargetID targetid);
+extern bool pgaio_io_has_target(PgAioHandle *ioh);
+extern PgAioTargetData *pgaio_io_get_target_data(PgAioHandle *ioh);
+extern char *pgaio_io_get_target_description(PgAioHandle *ioh);
+
+/* functions in aio_callback.c */
+extern void pgaio_io_register_callbacks(PgAioHandle *ioh, PgAioHandleCallbackID cbid);
+extern void pgaio_io_set_handle_data_64(PgAioHandle *ioh, uint64 *data, uint8 len);
+extern void pgaio_io_set_handle_data_32(PgAioHandle *ioh, uint32 *data, uint8 len);
+extern uint64 *pgaio_io_get_handle_data(PgAioHandle *ioh, uint8 *len);
+
+
+
+/* --------------------------------------------------------------------------------
+ * IO Wait References
+ * --------------------------------------------------------------------------------
+ */
+
+extern void pgaio_wref_clear(PgAioWaitRef *iow);
+extern bool pgaio_wref_valid(PgAioWaitRef *iow);
+extern int pgaio_wref_get_id(PgAioWaitRef *iow);
+
+extern void pgaio_wref_wait(PgAioWaitRef *iow);
+extern bool pgaio_wref_check_done(PgAioWaitRef *iow);
+
+
+
+/* --------------------------------------------------------------------------------
+ * IO Result
+ * --------------------------------------------------------------------------------
+ */
+
+extern void pgaio_result_report(PgAioResult result, const PgAioTargetData *target_data,
+ int elevel);
+
+
+
+/* --------------------------------------------------------------------------------
+ * Actions on multiple IOs.
+ * --------------------------------------------------------------------------------
+ */
+
+extern void pgaio_submit_staged(void);
+extern bool pgaio_have_staged(void);
+
+
+
+/* --------------------------------------------------------------------------------
+ * Other
+ * --------------------------------------------------------------------------------
+ */
+
+extern void pgaio_closing_fd(int fd);
+extern void pgaio_at_xact_end(bool is_subxact, bool is_commit);
+extern void pgaio_at_error(void);
extern void assign_io_method(int newval, void *extra);
+
/* GUCs */
extern PGDLLIMPORT int io_method;
extern PGDLLIMPORT int io_max_concurrency;
diff --git a/src/include/storage/aio_internal.h b/src/include/storage/aio_internal.h
new file mode 100644
index 00000000000..174d365f9c0
--- /dev/null
+++ b/src/include/storage/aio_internal.h
@@ -0,0 +1,295 @@
+/*-------------------------------------------------------------------------
+ *
+ * aio_internal.h
+ * AIO related declarations that shoul only be used by the AIO subsystem
+ * internally.
+ *
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/aio_internal.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef AIO_INTERNAL_H
+#define AIO_INTERNAL_H
+
+
+#include "lib/ilist.h"
+#include "port/pg_iovec.h"
+#include "storage/aio.h"
+#include "storage/condition_variable.h"
+
+
+/* AFIXME */
+#define PGAIO_SUBMIT_BATCH_SIZE 32
+
+
+
+typedef enum PgAioHandleState
+{
+ /* not in use */
+ PGAIO_HS_IDLE = 0,
+
+ /* returned by pgaio_io_acquire() */
+ PGAIO_HS_HANDED_OUT,
+
+ /* pgaio_io_prep_*() has been called, but IO hasn't been submitted yet */
+ PGAIO_HS_DEFINED,
+
+ /* target's stage() callback has been called, ready to be submitted */
+ PGAIO_HS_STAGED,
+
+ /* IO has been submitted and is being executed */
+ PGAIO_HS_SUBMITTED,
+
+ /* IO finished, but result has not yet been processed */
+ PGAIO_HS_COMPLETED_IO,
+
+ /* IO completed, shared completion has been called */
+ PGAIO_HS_COMPLETED_SHARED,
+
+ /* IO completed, local completion has been called */
+ PGAIO_HS_COMPLETED_LOCAL,
+} PgAioHandleState;
+
+
+struct ResourceOwnerData;
+
+/* typedef is in public header */
+struct PgAioHandle
+{
+ /* all state updates should go through pgaio_io_update_state() */
+ PgAioHandleState state:8;
+
+ /* what are we operating on */
+ PgAioTargetID target:8;
+
+ /* which IO operation */
+ PgAioOp op:8;
+
+ /* bitfield of PgAioHandleFlags */
+ uint8 flags;
+
+ uint8 num_shared_callbacks;
+
+ /* using the proper type here would use more space */
+ uint8 shared_callbacks[PGAIO_HANDLE_MAX_CALLBACKS];
+
+ /*
+ * Length of data associated with handle using
+ * pgaio_io_set_handle_data_*().
+ */
+ uint8 handle_data_len;
+
+ /* XXX: could be optimized out with some pointer math */
+ int32 owner_procno;
+
+ /* raw result of the IO operation */
+ int32 result;
+
+ /*
+ * Index into PgAioCtl->iovecs and PgAioCtl->handle_data.
+ *
+ * At the moment there's no need to differentiate between the two, but
+ * that won't necessarily stay that way.
+ */
+ uint32 iovec_off;
+
+ /**
+ * In which list the handle is registered, depends on the state:
+ * - IDLE, in per-backend list
+ * - HANDED_OUT - not in a list
+ * - DEFINED - in per-backend staged list
+ * - STAGED - in per-backend staged list
+ * - SUBMITTED - in issuer's in_flight list
+ * - COMPLETED_IO - in issuer's in_flight list
+ * - COMPLETED_SHARED - in issuer's in_flight list
+ **/
+ dlist_node node;
+
+ struct ResourceOwnerData *resowner;
+ dlist_node resowner_node;
+
+ /* incremented every time the IO handle is reused */
+ uint64 generation;
+
+ ConditionVariable cv;
+
+ /* result of shared callback, passed to issuer callback */
+ PgAioResult distilled_result;
+
+ PgAioReturn *report_return;
+
+ PgAioOpData op_data;
+
+ /*
+ * Data necessary to identify the object undergoing IO to higher-level
+ * code. Needs to be sufficient to allow another backend to reopen the
+ * file.
+ */
+ PgAioTargetData target_data;
+};
+
+
+typedef struct PgAioBackend
+{
+ /* index into PgAioCtl->io_handles */
+ uint32 io_handle_off;
+
+ /* IO Handles that currently are not used */
+ dclist_head idle_ios;
+
+ /*
+ * Only one IO may be returned by pgaio_io_acquire()/pgaio_io_acquire()
+ * without having been either defined (by actually associating it with IO)
+ * or by released (with pgaio_io_release()). This restriction is necessary
+ * to guarantee that we always can acquire an IO. ->handed_out_io is used
+ * to enforce that rule.
+ */
+ PgAioHandle *handed_out_io;
+
+ /*
+ * IOs that are defined, but not yet submitted.
+ */
+ uint16 num_staged_ios;
+ PgAioHandle *staged_ios[PGAIO_SUBMIT_BATCH_SIZE];
+
+ /*
+ * List of in-flight IOs. Also contains IOs that aren't strict speaking
+ * in-flight anymore, but have been waited-for and completed by another
+ * backend. Once this backend sees such an IO it'll be reclaimed.
+ *
+ * The list is ordered by submission time, with more recently submitted
+ * IOs being appended at the end.
+ */
+ dclist_head in_flight_ios;
+} PgAioBackend;
+
+
+typedef struct PgAioCtl
+{
+ int backend_state_count;
+ PgAioBackend *backend_state;
+
+ /*
+ * Array of iovec structs. Each iovec is owned by a specific backend. The
+ * allocation is in PgAioCtl to allow the maximum number of iovecs for
+ * individual IOs to be configurable with PGC_POSTMASTER GUC.
+ */
+ uint64 iovec_count;
+ struct iovec *iovecs;
+
+ /*
+ * For, e.g., an IO covering multiple buffers in shared / temp buffers, we
+ * need to get Buffer IDs during completion to be able to change the
+ * BufferDesc state accordingly. This space can be used to store e.g.
+ * Buffer IDs. Note that the actual iovec might be shorter than this,
+ * because we combine neighboring pages into one larger iovec entry.
+ */
+ uint64 *handle_data;
+
+ uint64 io_handle_count;
+ PgAioHandle *io_handles;
+} PgAioCtl;
+
+
+
+/*
+ * The set of callbacks that each IO method must implement.
+ *
+ * AFIXME: Document these.
+ */
+typedef struct IoMethodOps
+{
+ /* global initialization */
+ size_t (*shmem_size) (void);
+ void (*shmem_init) (bool first_time);
+
+ /* per-backend initialization */
+ void (*init_backend) (void);
+
+ /* handling of IOs */
+ bool (*needs_synchronous_execution) (PgAioHandle *ioh);
+ int (*submit) (uint16 num_staged_ios, PgAioHandle **staged_ios);
+
+ void (*wait_one) (PgAioHandle *ioh,
+ uint64 ref_generation);
+} IoMethodOps;
+
+
+/* aio.c */
+extern bool pgaio_io_was_recycled(PgAioHandle *ioh, uint64 ref_generation, PgAioHandleState *state);
+extern void pgaio_io_stage(PgAioHandle *ioh, PgAioOp op);
+extern void pgaio_io_process_completion(PgAioHandle *ioh, int result);
+extern void pgaio_io_prepare_submit(PgAioHandle *ioh);
+extern bool pgaio_io_needs_synchronous_execution(PgAioHandle *ioh);
+extern const char *pgaio_io_get_state_name(PgAioHandle *ioh);
+extern void pgaio_shutdown(int code, Datum arg);
+
+/* aio_callback.c */
+extern void pgaio_io_call_stage(PgAioHandle *ioh);
+extern void pgaio_io_call_complete_shared(PgAioHandle *ioh);
+extern void pgaio_io_call_complete_local(PgAioHandle *ioh);
+
+/* aio_io.c */
+extern void pgaio_io_perform_synchronously(PgAioHandle *ioh);
+extern const char *pgaio_io_get_op_name(PgAioHandle *ioh);
+
+/* aio_target.c */
+extern bool pgaio_io_can_reopen(PgAioHandle *ioh);
+extern void pgaio_io_reopen(PgAioHandle *ioh);
+extern const char *pgaio_io_get_target_name(PgAioHandle *ioh);
+
+
+/*
+ * The AIO subsystem has fairly verbose debug logging support. This can be
+ * enabled/disabled at buildtime. The reason for this is that
+ * a) the verbosity can make debugging things on higher levels hard
+ * b) even if logging can be skipped due to elevel checks, it still causes a
+ * measurable slowdown
+ */
+#define PGAIO_VERBOSE 1
+
+/*
+ * Simple ereport() wrapper that only logs if PGAIO_VERBOSE is defined.
+ *
+ * This intentionally still compiles the code, guarded by a constant if (0),
+ * if verbose logging is disabled, to make it less likely that debug logging
+ * is silently broken.
+ *
+ * The current definition requires passing at least one argument.
+ */
+#define pgaio_debug(elevel, msg, ...) \
+ do { \
+ if (PGAIO_VERBOSE) \
+ ereport(elevel, \
+ errhidestmt(true), errhidecontext(true), \
+ errmsg_internal(msg, \
+ __VA_ARGS__)); \
+ } while(0)
+
+/*
+ * Simple ereport() wrapper. Note that the definition requires passing at
+ * least one argument.
+ */
+#define pgaio_debug_io(elevel, ioh, msg, ...) \
+ pgaio_debug(elevel, "io %-10d|op %-5s|target %-4s|state %-16s: " msg, \
+ pgaio_io_get_id(ioh), \
+ pgaio_io_get_op_name(ioh), \
+ pgaio_io_get_target_name(ioh), \
+ pgaio_io_get_state_name(ioh), \
+ __VA_ARGS__)
+
+
+/* Declarations for the tables of function pointers exposed by each IO method. */
+extern PGDLLIMPORT const IoMethodOps pgaio_sync_ops;
+
+extern PGDLLIMPORT const IoMethodOps *pgaio_method_ops;
+extern PGDLLIMPORT PgAioCtl *pgaio_ctl;
+extern PGDLLIMPORT PgAioBackend *pgaio_my_backend;
+
+
+
+#endif /* AIO_INTERNAL_H */
diff --git a/src/include/storage/aio_types.h b/src/include/storage/aio_types.h
new file mode 100644
index 00000000000..d2617139a25
--- /dev/null
+++ b/src/include/storage/aio_types.h
@@ -0,0 +1,115 @@
+/*-------------------------------------------------------------------------
+ *
+ * aio_types.h
+ * AIO related types that are useful to include separately, to reduce the
+ * "include burden".
+ *
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/aio_types.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef AIO_TYPES_H
+#define AIO_TYPES_H
+
+#include "storage/block.h"
+#include "storage/relfilelocator.h"
+
+
+typedef struct PgAioHandle PgAioHandle;
+
+/*
+ * A reference to an IO that can be used to wait for the IO (using
+ * pgaio_wref_wait()) to complete.
+ *
+ * These can be passed across process boundaries.
+ */
+typedef struct PgAioWaitRef
+{
+ /* internal ID identifying the specific PgAioHandle */
+ uint32 aio_index;
+
+ /*
+ * IO handles are reused. To detect if a handle was reused, and thereby
+ * avoid unnecessarily waiting for a newer IO, each time the handle is
+ * reused a generation number is increased.
+ *
+ * To avoid requiring alignment sufficient for an int64, split the
+ * generation into two.
+ */
+ uint32 generation_upper;
+ uint32 generation_lower;
+} PgAioWaitRef;
+
+
+/*
+ * Information identifying what the IO is being performed on.
+ *
+ * This needs sufficient information to
+ *
+ * a) Reopen the file for the IO if the IO is executed in a context that
+ * cannot use the FD provided initially (e.g. because the IO is executed in
+ * a worker process).
+ *
+ * b) Describe the object the IO is performed on in log / error messages.
+ */
+typedef union PgAioTargetData
+{
+ /* just as an example placeholder for later */
+ struct
+ {
+ uint32 queue_id;
+ } wal;
+} PgAioTargetData;
+
+
+/*
+ * The status of an AIO operation.
+ */
+typedef enum PgAioResultStatus
+{
+ ARS_UNKNOWN, /* not yet completed / uninitialized */
+ ARS_OK,
+ ARS_PARTIAL, /* did not fully succeed, but no error */
+ ARS_ERROR,
+} PgAioResultStatus;
+
+
+/*
+ * Result of IO operation, visible only to the initiator of IO.
+ */
+typedef struct PgAioResult
+{
+ /*
+ * This is of type PgAioHandleCallbackID, but can't use a bitfield of an
+ * enum, because some compilers treat enums as signed.
+ */
+ uint32 id:8;
+
+ /* of type PgAioResultStatus, see above */
+ uint32 status:2;
+
+ /* meaning defined by callback->error */
+ uint32 error_data:22;
+
+ int32 result;
+} PgAioResult;
+
+
+/*
+ * Combination of PgAioResult with minimal metadata about the IO.
+ *
+ * Contains sufficient information to be able, in case the IO [partially]
+ * fails, to log/raise an error under control of the IO issuing code.
+ */
+typedef struct PgAioReturn
+{
+ PgAioResult result;
+ PgAioTargetData target_data;
+} PgAioReturn;
+
+
+#endif /* AIO_TYPES_H */
diff --git a/src/include/utils/resowner.h b/src/include/utils/resowner.h
index e8d452ca7ee..aede4bfc820 100644
--- a/src/include/utils/resowner.h
+++ b/src/include/utils/resowner.h
@@ -164,4 +164,9 @@ struct LOCALLOCK;
extern void ResourceOwnerRememberLock(ResourceOwner owner, struct LOCALLOCK *locallock);
extern void ResourceOwnerForgetLock(ResourceOwner owner, struct LOCALLOCK *locallock);
+/* special support for AIO */
+struct dlist_node;
+extern void ResourceOwnerRememberAioHandle(ResourceOwner owner, struct dlist_node *ioh_node);
+extern void ResourceOwnerForgetAioHandle(ResourceOwner owner, struct dlist_node *ioh_node);
+
#endif /* RESOWNER_H */
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index d331ab90d78..a252c3a81b4 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -51,6 +51,7 @@
#include "replication/origin.h"
#include "replication/snapbuild.h"
#include "replication/syncrep.h"
+#include "storage/aio.h"
#include "storage/condition_variable.h"
#include "storage/fd.h"
#include "storage/lmgr.h"
@@ -2475,6 +2476,8 @@ CommitTransaction(void)
AtEOXact_LogicalRepWorkers(true);
pgstat_report_xact_timestamp(0);
+ pgaio_at_xact_end( /* is_subxact = */ false, /* is_commit = */ true);
+
ResourceOwnerDelete(TopTransactionResourceOwner);
s->curTransactionOwner = NULL;
CurTransactionResourceOwner = NULL;
@@ -2988,6 +2991,8 @@ AbortTransaction(void)
pgstat_report_xact_timestamp(0);
}
+ pgaio_at_xact_end( /* is_subxact = */ false, /* is_commit = */ false);
+
/*
* State remains TRANS_ABORT until CleanupTransaction().
*/
@@ -5185,6 +5190,8 @@ CommitSubTransaction(void)
AtEOSubXact_PgStat(true, s->nestingLevel);
AtSubCommit_Snapshot(s->nestingLevel);
+ pgaio_at_xact_end( /* is_subxact = */ true, /* is_commit = */ true);
+
/*
* We need to restore the upper transaction's read-only state, in case the
* upper is read-write while the child is read-only; GUC will incorrectly
@@ -5351,6 +5358,8 @@ AbortSubTransaction(void)
AtSubAbort_Snapshot(s->nestingLevel);
}
+ pgaio_at_xact_end( /* is_subxact = */ true, /* is_commit = */ false);
+
/*
* Restore the upper transaction's read-only state, too. This should be
* redundant with GUC's cleanup but we may as well do it for consistency
diff --git a/src/backend/storage/aio/Makefile b/src/backend/storage/aio/Makefile
index eaeaeeee8e3..89f821ea7e1 100644
--- a/src/backend/storage/aio/Makefile
+++ b/src/backend/storage/aio/Makefile
@@ -10,7 +10,11 @@ include $(top_builddir)/src/Makefile.global
OBJS = \
aio.o \
+ aio_callback.o \
aio_init.o \
+ aio_io.o \
+ aio_target.o \
+ method_sync.o \
read_stream.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/aio/aio.c b/src/backend/storage/aio/aio.c
index f68cbc2b3f4..cefa888884c 100644
--- a/src/backend/storage/aio/aio.c
+++ b/src/backend/storage/aio/aio.c
@@ -3,6 +3,28 @@
* aio.c
* AIO - Core Logic
*
+ * For documentation about how AIO works on a higher level, including a
+ * schematic example, see README.md.
+ *
+ *
+ * AIO is a complicated subsystem. To keep things navigable it is split across
+ * a number of files:
+ *
+ * - method_*.c - different ways of executing AIO (e.g. worker process)
+ *
+ * - aio_target.c - IO on different kinds of targets
+ *
+ * - aio_io.c - method-independent code for specific IO ops (e.g. readv)
+ *
+ * - aio_callback.c - callbacks at IO operation lifecycle events
+ *
+ * - aio_init.c - per-server and per-backend initialization
+ *
+ * - aio.c - all other topics
+ *
+ * - read_stream.c - helper for reading buffered relation data
+ *
+ *
* Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
* Portions Copyright (c) 1994, Regents of the University of California
*
@@ -14,8 +36,22 @@
#include "postgres.h"
+#include "miscadmin.h"
+#include "port/atomics.h"
#include "storage/aio.h"
+#include "storage/aio_internal.h"
#include "utils/guc.h"
+#include "utils/resowner.h"
+#include "utils/wait_event_types.h"
+
+
+static inline void pgaio_io_update_state(PgAioHandle *ioh, PgAioHandleState new_state);
+static void pgaio_io_reclaim(PgAioHandle *ioh);
+static void pgaio_io_resowner_register(PgAioHandle *ioh);
+static void pgaio_io_wait_for_free(void);
+static PgAioHandle *pgaio_io_from_wref(PgAioWaitRef *iow, uint64 *ref_generation);
+static const char *pgaio_io_state_get_name(PgAioHandleState s);
+static void pgaio_io_wait(PgAioHandle *ioh, uint64 ref_generation);
/* Options for io_method. */
@@ -28,9 +64,877 @@ const struct config_enum_entry io_method_options[] = {
int io_method = DEFAULT_IO_METHOD;
int io_max_concurrency = -1;
+/* global control for AIO */
+PgAioCtl *pgaio_ctl;
+/* current backend's per-backend state */
+PgAioBackend *pgaio_my_backend;
+
+
+static const IoMethodOps *const pgaio_method_ops_table[] = {
+ [IOMETHOD_SYNC] = &pgaio_sync_ops,
+};
+
+/* callbacks for the configured io_method, set by assign_io_method */
+const IoMethodOps *pgaio_method_ops;
+
+
+
+/* --------------------------------------------------------------------------------
+ * Public Functions related to PgAioHandle
+ * --------------------------------------------------------------------------------
+ */
+
+/*
+ * Acquire an AioHandle, waiting for IO completion if necessary.
+ *
+ * Each backend can only have one AIO handle that that has been "handed out"
+ * to code, but not yet submitted or released. This restriction is necessary
+ * to ensure that it is possible for code to wait for an unused handle by
+ * waiting for in-flight IO to complete. There is a limited number of handles
+ * in each backend, if multiple handles could be handed out without being
+ * submitted, waiting for all in-flight IO to complete would not guarantee
+ * that handles free up.
+ *
+ * It is cheap to acquire an IO handle, unless all handles are in use. In that
+ * case this function waits for the oldest IO to complete. In case that is not
+ * desirable, see pgaio_io_acquire_nb().
+ *
+ * If a handle was acquired but then does not turn out to be needed,
+ * e.g. because pgaio_io_acquire() is called before starting an IO in a
+ * critical section, the handle needs to be released with pgaio_io_release().
+ *
+ *
+ * To react to the completion of the IO as soon as it is known to have
+ * completed, callbacks can be registered with pgaio_io_register_callbacks().
+ *
+ * To actually execute IO using the returned handle, the pgaio_io_prep_*()
+ * family of functions is used. In many cases the pgaio_io_prep_*() call will
+ * not be done directly by code that acquired the handle, but by lower level
+ * code that gets passed the handle. E.g. if code in bufmgr.c wants to perform
+ * AIO, it typically will pass the handle to smgr., which will pass it on to
+ * md.c, on to fd.c, which then finally calls pgaio_io_prep_*(). This
+ * forwarding allows the various layers to react to the IO's completion by
+ * registering callbacks. These callbacks in turn can translate a lower
+ * layer's result into a result understandable by a higher layer.
+ *
+ * Once pgaio_io_prep_*() is called, the IO may be in the process of being
+ * executed and might even complete before the functions return. That is,
+ * however, not guaranteed, to allow IO submission to be batched. To guarantee
+ * IO submission pgaio_submit_staged() needs to be called.
+ *
+ * After pgaio_io_prep_*() the AioHandle is "consumed" and may not be
+ * referenced by the IO issuing code. To e.g. wait for IO, references to the
+ * IO can be established with pgaio_io_get_wref() *before* pgaio_io_prep_*()
+ * is called. pgaio_wref_wait() can be used to wait for the IO to complete.
+ *
+ *
+ * To know if the IO [partially] succeeded or failed, a PgAioReturn * can be
+ * passed to pgaio_io_acquire(). Once the issuing backend has called
+ * pgaio_wref_wait(), the PgAioReturn contains information about whether the
+ * operation succeeded and details about the first failure, if any. The error
+ * can be raised / logged with pgaio_result_report().
+ *
+ * The lifetime of the memory pointed to be *ret needs to be at least as long
+ * as the passed in resowner. If the resowner releases resources before the IO
+ * completes (typically due to an error), the reference to *ret will be
+ * cleared. In case of resowner cleanup *ret will not be updated with the
+ * results of the IO operation.
+ */
+PgAioHandle *
+pgaio_io_acquire(struct ResourceOwnerData *resowner, PgAioReturn *ret)
+{
+ PgAioHandle *h;
+
+ while (true)
+ {
+ h = pgaio_io_acquire_nb(resowner, ret);
+
+ if (h != NULL)
+ return h;
+
+ /*
+ * Evidently all handles by this backend are in use. Just wait for
+ * some to complete.
+ */
+ pgaio_io_wait_for_free();
+ }
+}
+
+/*
+ * Acquire an AioHandle, returning NULL if no handles are free.
+ *
+ * See pgaio_io_acquire(). The only difference is that this function will return
+ * NULL if there are no idle handles, instead of blocking.
+ */
+PgAioHandle *
+pgaio_io_acquire_nb(struct ResourceOwnerData *resowner, PgAioReturn *ret)
+{
+ if (pgaio_my_backend->num_staged_ios >= PGAIO_SUBMIT_BATCH_SIZE)
+ {
+ Assert(pgaio_my_backend->num_staged_ios == PGAIO_SUBMIT_BATCH_SIZE);
+ pgaio_submit_staged();
+ }
+
+ if (pgaio_my_backend->handed_out_io)
+ {
+ ereport(ERROR,
+ errmsg("API violation: Only one IO can be handed out"));
+ }
+
+ if (!dclist_is_empty(&pgaio_my_backend->idle_ios))
+ {
+ dlist_node *ion = dclist_pop_head_node(&pgaio_my_backend->idle_ios);
+ PgAioHandle *ioh = dclist_container(PgAioHandle, node, ion);
+
+ Assert(ioh->state == PGAIO_HS_IDLE);
+ Assert(ioh->owner_procno == MyProcNumber);
+
+ pgaio_io_update_state(ioh, PGAIO_HS_HANDED_OUT);
+ pgaio_my_backend->handed_out_io = ioh;
+
+ if (resowner)
+ pgaio_io_resowner_register(ioh);
+
+ if (ret)
+ {
+ ioh->report_return = ret;
+ ret->result.status = ARS_UNKNOWN;
+ }
+
+ return ioh;
+ }
+
+ return NULL;
+}
+
+/*
+ * Release IO handle that turned out to not be required.
+ *
+ * See pgaio_io_acquire() for more details.
+ */
+void
+pgaio_io_release(PgAioHandle *ioh)
+{
+ if (ioh == pgaio_my_backend->handed_out_io)
+ {
+ Assert(ioh->state == PGAIO_HS_HANDED_OUT);
+ Assert(ioh->resowner);
+
+ pgaio_my_backend->handed_out_io = NULL;
+ pgaio_io_reclaim(ioh);
+ }
+ else
+ {
+ elog(ERROR, "release in unexpected state");
+ }
+}
+
+/*
+ * Release IO handle during resource owner cleanup.
+ */
+void
+pgaio_io_release_resowner(dlist_node *ioh_node, bool on_error)
+{
+ PgAioHandle *ioh = dlist_container(PgAioHandle, resowner_node, ioh_node);
+
+ Assert(ioh->resowner);
+
+ ResourceOwnerForgetAioHandle(ioh->resowner, &ioh->resowner_node);
+ ioh->resowner = NULL;
+
+ switch (ioh->state)
+ {
+ case PGAIO_HS_IDLE:
+ elog(ERROR, "unexpected");
+ break;
+ case PGAIO_HS_HANDED_OUT:
+ Assert(ioh == pgaio_my_backend->handed_out_io || pgaio_my_backend->handed_out_io == NULL);
+
+ if (ioh == pgaio_my_backend->handed_out_io)
+ {
+ pgaio_my_backend->handed_out_io = NULL;
+ if (!on_error)
+ elog(WARNING, "leaked AIO handle");
+ }
+
+ pgaio_io_reclaim(ioh);
+ break;
+ case PGAIO_HS_DEFINED:
+ case PGAIO_HS_STAGED:
+ /* XXX: Should we warn about this when is_commit? */
+ pgaio_submit_staged();
+ break;
+ case PGAIO_HS_SUBMITTED:
+ case PGAIO_HS_COMPLETED_IO:
+ case PGAIO_HS_COMPLETED_SHARED:
+ case PGAIO_HS_COMPLETED_LOCAL:
+ /* this is expected to happen */
+ break;
+ }
+
+ /*
+ * Need to unregister the reporting of the IO's result, the memory it's
+ * referencing likely has gone away.
+ */
+ if (ioh->report_return)
+ ioh->report_return = NULL;
+}
+
+/*
+ * Add a [set of] flags to the IO.
+ *
+ * Note that this combines flags with already set flags, rather than set flags
+ * to explicitly the passed in parameters. This is to allow multiple callsites
+ * to set flags.
+ */
+void
+pgaio_io_set_flag(PgAioHandle *ioh, PgAioHandleFlags flag)
+{
+ Assert(ioh->state == PGAIO_HS_HANDED_OUT);
+
+ ioh->flags |= flag;
+}
+
+int
+pgaio_io_get_id(PgAioHandle *ioh)
+{
+ Assert(ioh >= pgaio_ctl->io_handles &&
+ ioh < (pgaio_ctl->io_handles + pgaio_ctl->io_handle_count));
+ return ioh - pgaio_ctl->io_handles;
+}
+
+ProcNumber
+pgaio_io_get_owner(PgAioHandle *ioh)
+{
+ return ioh->owner_procno;
+}
+
+void
+pgaio_io_get_wref(PgAioHandle *ioh, PgAioWaitRef *iow)
+{
+ Assert(ioh->state == PGAIO_HS_HANDED_OUT ||
+ ioh->state == PGAIO_HS_DEFINED ||
+ ioh->state == PGAIO_HS_STAGED);
+ Assert(ioh->generation != 0);
+
+ iow->aio_index = ioh - pgaio_ctl->io_handles;
+ iow->generation_upper = (uint32) (ioh->generation >> 32);
+ iow->generation_lower = (uint32) ioh->generation;
+}
+
+
+
+/* --------------------------------------------------------------------------------
+ * Internal Functions related to PgAioHandle
+ * --------------------------------------------------------------------------------
+ */
+
+static inline void
+pgaio_io_update_state(PgAioHandle *ioh, PgAioHandleState new_state)
+{
+ pgaio_debug_io(DEBUG4, ioh,
+ "updating state to %s",
+ pgaio_io_state_get_name(new_state));
+
+ /*
+ * Ensure the changes signified by the new state are visible before the
+ * new state becomes visible.
+ */
+ pg_write_barrier();
+
+ ioh->state = new_state;
+}
+
+static void
+pgaio_io_resowner_register(PgAioHandle *ioh)
+{
+ Assert(!ioh->resowner);
+ Assert(CurrentResourceOwner);
+
+ ResourceOwnerRememberAioHandle(CurrentResourceOwner, &ioh->resowner_node);
+ ioh->resowner = CurrentResourceOwner;
+}
+
+/*
+ * Should only be called from pgaio_io_prep_*().
+ */
+void
+pgaio_io_stage(PgAioHandle *ioh, PgAioOp op)
+{
+ bool needs_synchronous;
+
+ Assert(ioh->state == PGAIO_HS_HANDED_OUT);
+ Assert(pgaio_io_has_target(ioh));
+
+ ioh->op = op;
+ ioh->result = 0;
+
+ pgaio_io_update_state(ioh, PGAIO_HS_DEFINED);
+
+ /* allow a new IO to be staged */
+ pgaio_my_backend->handed_out_io = NULL;
+
+ pgaio_io_call_stage(ioh);
+
+ pgaio_io_update_state(ioh, PGAIO_HS_STAGED);
+
+ needs_synchronous = pgaio_io_needs_synchronous_execution(ioh);
+
+ pgaio_debug_io(DEBUG3, ioh,
+ "prepared, executing synchronously: %d",
+ needs_synchronous);
+
+ if (!needs_synchronous)
+ {
+ pgaio_my_backend->staged_ios[pgaio_my_backend->num_staged_ios++] = ioh;
+ Assert(pgaio_my_backend->num_staged_ios <= PGAIO_SUBMIT_BATCH_SIZE);
+ }
+ else
+ {
+ pgaio_io_prepare_submit(ioh);
+ pgaio_io_perform_synchronously(ioh);
+ }
+}
+
+bool
+pgaio_io_needs_synchronous_execution(PgAioHandle *ioh)
+{
+ if (ioh->flags & PGAIO_HF_SYNCHRONOUS)
+ {
+ /* XXX: should we also check if there are other IOs staged? */
+ return true;
+ }
+
+ if (pgaio_method_ops->needs_synchronous_execution)
+ return pgaio_method_ops->needs_synchronous_execution(ioh);
+ return false;
+}
+
+/*
+ * Handle IO being processed by IO method.
+ *
+ * Should be called by IO methods / synchronous IO execution, just before the
+ * IO is performed.
+ */
+void
+pgaio_io_prepare_submit(PgAioHandle *ioh)
+{
+ pgaio_io_update_state(ioh, PGAIO_HS_SUBMITTED);
+
+ dclist_push_tail(&pgaio_my_backend->in_flight_ios, &ioh->node);
+}
+
+/*
+ * Handle IO getting completed by a method.
+ *
+ * Should be called by IO methods / synchronous IO execution
+ */
+void
+pgaio_io_process_completion(PgAioHandle *ioh, int result)
+{
+ Assert(ioh->state == PGAIO_HS_SUBMITTED);
+
+ ioh->result = result;
+
+ pgaio_io_update_state(ioh, PGAIO_HS_COMPLETED_IO);
+
+ pgaio_io_call_complete_shared(ioh);
+
+ pgaio_io_update_state(ioh, PGAIO_HS_COMPLETED_SHARED);
+
+ /* condition variable broadcast ensures state is visible before wakeup */
+ ConditionVariableBroadcast(&ioh->cv);
+
+ if (ioh->owner_procno == MyProcNumber)
+ pgaio_io_reclaim(ioh);
+}
+
+bool
+pgaio_io_was_recycled(PgAioHandle *ioh, uint64 ref_generation, PgAioHandleState *state)
+{
+ *state = ioh->state;
+ pg_read_barrier();
+
+ return ioh->generation != ref_generation;
+}
+
+/*
+ * Wait for IO to complete. External code should never use this, outside of
+ * the AIO subsystem waits are only allowed via pgaio_wref_wait().
+ */
+static void
+pgaio_io_wait(PgAioHandle *ioh, uint64 ref_generation)
+{
+ PgAioHandleState state;
+ bool am_owner;
+
+ am_owner = ioh->owner_procno == MyProcNumber;
+
+ if (pgaio_io_was_recycled(ioh, ref_generation, &state))
+ return;
+
+ if (am_owner)
+ {
+ if (state == PGAIO_HS_STAGED)
+ {
+ /* XXX: Arguably this should be prevented by callers? */
+ pgaio_submit_staged();
+ }
+ else if (state != PGAIO_HS_SUBMITTED
+ && state != PGAIO_HS_COMPLETED_IO
+ && state != PGAIO_HS_COMPLETED_SHARED
+ && state != PGAIO_HS_COMPLETED_LOCAL)
+ {
+ elog(PANIC, "waiting for own IO in wrong state: %d",
+ state);
+ }
+ }
+
+ while (true)
+ {
+ if (pgaio_io_was_recycled(ioh, ref_generation, &state))
+ return;
+
+ switch (state)
+ {
+ case PGAIO_HS_IDLE:
+ case PGAIO_HS_HANDED_OUT:
+ elog(ERROR, "IO in wrong state: %d", state);
+ break;
+
+ case PGAIO_HS_SUBMITTED:
+
+ /*
+ * If we need to wait via the IO method, do so now. Don't
+ * check via the IO method if the issuing backend is executing
+ * the IO synchronously.
+ */
+ if (pgaio_method_ops->wait_one && !(ioh->flags & PGAIO_HF_SYNCHRONOUS))
+ {
+ pgaio_method_ops->wait_one(ioh, ref_generation);
+ continue;
+ }
+ /* fallthrough */
+
+ /* waiting for owner to submit */
+ case PGAIO_HS_DEFINED:
+ case PGAIO_HS_STAGED:
+ /* waiting for reaper to complete */
+ /* fallthrough */
+ case PGAIO_HS_COMPLETED_IO:
+ /* shouldn't be able to hit this otherwise */
+ Assert(IsUnderPostmaster);
+ /* ensure we're going to get woken up */
+ ConditionVariablePrepareToSleep(&ioh->cv);
+
+ while (!pgaio_io_was_recycled(ioh, ref_generation, &state))
+ {
+ if (state == PGAIO_HS_COMPLETED_SHARED ||
+ state == PGAIO_HS_COMPLETED_LOCAL)
+ break;
+ ConditionVariableSleep(&ioh->cv, WAIT_EVENT_AIO_COMPLETION);
+ }
+
+ ConditionVariableCancelSleep();
+ break;
+
+ case PGAIO_HS_COMPLETED_SHARED:
+ case PGAIO_HS_COMPLETED_LOCAL:
+ /* see above */
+ if (am_owner)
+ pgaio_io_reclaim(ioh);
+ return;
+ }
+ }
+}
+
+static void
+pgaio_io_reclaim(PgAioHandle *ioh)
+{
+ /* This is only ok if it's our IO */
+ Assert(ioh->owner_procno == MyProcNumber);
+
+ pgaio_debug_io(DEBUG4, ioh,
+ "reclaiming, result: %d, distilled_result: AFIXME, report to: %p",
+ ioh->result,
+ ioh->report_return);
+
+ if (ioh->state == PGAIO_HS_COMPLETED_SHARED)
+ {
+ pgaio_io_call_complete_local(ioh);
+ pgaio_io_update_state(ioh, PGAIO_HS_COMPLETED_LOCAL);
+ }
+
+ /* if the IO has been defined, we might need to do more work */
+ if (ioh->state != PGAIO_HS_HANDED_OUT)
+ {
+ dclist_delete_from(&pgaio_my_backend->in_flight_ios, &ioh->node);
+
+ if (ioh->report_return)
+ {
+ ioh->report_return->result = ioh->distilled_result;
+ ioh->report_return->target_data = ioh->target_data;
+ }
+ }
+
+ if (ioh->resowner)
+ {
+ ResourceOwnerForgetAioHandle(ioh->resowner, &ioh->resowner_node);
+ ioh->resowner = NULL;
+ }
+
+ Assert(!ioh->resowner);
+
+ ioh->op = PGAIO_OP_INVALID;
+ ioh->target = PGAIO_TID_INVALID;
+ ioh->flags = 0;
+ ioh->num_shared_callbacks = 0;
+ ioh->handle_data_len = 0;
+ ioh->report_return = NULL;
+ ioh->result = 0;
+ ioh->distilled_result.status = ARS_UNKNOWN;
+
+ /* XXX: the barrier is probably superfluous */
+ pg_write_barrier();
+ ioh->generation++;
+
+ pgaio_io_update_state(ioh, PGAIO_HS_IDLE);
+
+ /*
+ * We push the IO to the head of the idle IO list, that seems more cache
+ * efficient in cases where only a few IOs are used.
+ */
+ dclist_push_head(&pgaio_my_backend->idle_ios, &ioh->node);
+}
+
+static void
+pgaio_io_wait_for_free(void)
+{
+ int reclaimed = 0;
+
+ pgaio_debug(DEBUG2, "waiting for self with %d pending",
+ pgaio_my_backend->num_staged_ios);
+
+ /*
+ * First check if any of our IOs actually have completed - when using
+ * worker, that'll often be the case. We could do so as part of the loop
+ * below, but that'd potentially lead us to wait for some IO submitted
+ * before.
+ */
+ for (int i = 0; i < io_max_concurrency; i++)
+ {
+ PgAioHandle *ioh = &pgaio_ctl->io_handles[pgaio_my_backend->io_handle_off + i];
+
+ if (ioh->state == PGAIO_HS_COMPLETED_SHARED)
+ {
+ pgaio_io_reclaim(ioh);
+ reclaimed++;
+ }
+ }
+
+ if (reclaimed > 0)
+ return;
+
+ /*
+ * If we have any unsubmitted IOs, submit them now. We'll start waiting in
+ * a second, so it's better they're in flight. This also addresses the
+ * edge-case that all IOs are unsubmitted.
+ */
+ if (pgaio_my_backend->num_staged_ios > 0)
+ {
+ pgaio_submit_staged();
+ }
+
+ /*
+ * It's possible that we recognized there were free IOs while submitting.
+ */
+ if (dclist_count(&pgaio_my_backend->in_flight_ios) == 0)
+ {
+ elog(ERROR, "no free IOs despite no in-flight IOs");
+ }
+
+ /*
+ * Wait for the oldest in-flight IO to complete.
+ *
+ * XXX: Reusing the general IO wait is suboptimal, we don't need to wait
+ * for that specific IO to complete, we just need *any* IO to complete.
+ */
+ {
+ PgAioHandle *ioh = dclist_head_element(PgAioHandle, node, &pgaio_my_backend->in_flight_ios);
+
+ switch (ioh->state)
+ {
+ /* should not be in in-flight list */
+ case PGAIO_HS_IDLE:
+ case PGAIO_HS_DEFINED:
+ case PGAIO_HS_HANDED_OUT:
+ case PGAIO_HS_STAGED:
+ case PGAIO_HS_COMPLETED_LOCAL:
+ elog(ERROR, "shouldn't get here with io:%d in state %d",
+ pgaio_io_get_id(ioh), ioh->state);
+ break;
+
+ case PGAIO_HS_COMPLETED_IO:
+ case PGAIO_HS_SUBMITTED:
+ pgaio_debug_io(DEBUG2, ioh,
+ "waiting for free io with %d in flight",
+ dclist_count(&pgaio_my_backend->in_flight_ios));
+
+ /*
+ * In a more general case this would be racy, because the
+ * generation could increase after we read ioh->state above.
+ * But we are only looking at IOs by the current backend and
+ * the IO can only be recycled by this backend.
+ */
+ pgaio_io_wait(ioh, ioh->generation);
+ break;
+
+ case PGAIO_HS_COMPLETED_SHARED:
+ /* it's possible that another backend just finished this IO */
+ pgaio_io_reclaim(ioh);
+ break;
+ }
+
+ if (dclist_count(&pgaio_my_backend->idle_ios) == 0)
+ elog(PANIC, "no idle IOs after waiting");
+ return;
+ }
+}
+
+/*
+ * Internal - code outside of AIO should never need this and it'd be hard for
+ * such code to be safe.
+ */
+static PgAioHandle *
+pgaio_io_from_wref(PgAioWaitRef *iow, uint64 *ref_generation)
+{
+ PgAioHandle *ioh;
+
+ Assert(iow->aio_index < pgaio_ctl->io_handle_count);
+
+ ioh = &pgaio_ctl->io_handles[iow->aio_index];
+
+ *ref_generation = ((uint64) iow->generation_upper) << 32 |
+ iow->generation_lower;
+
+ Assert(*ref_generation != 0);
+
+ return ioh;
+}
+
+static const char *
+pgaio_io_state_get_name(PgAioHandleState s)
+{
+#define PGAIO_HS_TOSTR_CASE(sym) case PGAIO_HS_##sym: return #sym
+ switch (s)
+ {
+ PGAIO_HS_TOSTR_CASE(IDLE);
+ PGAIO_HS_TOSTR_CASE(HANDED_OUT);
+ PGAIO_HS_TOSTR_CASE(DEFINED);
+ PGAIO_HS_TOSTR_CASE(STAGED);
+ PGAIO_HS_TOSTR_CASE(SUBMITTED);
+ PGAIO_HS_TOSTR_CASE(COMPLETED_IO);
+ PGAIO_HS_TOSTR_CASE(COMPLETED_SHARED);
+ PGAIO_HS_TOSTR_CASE(COMPLETED_LOCAL);
+ }
+#undef PGAIO_HS_TOSTR_CASE
+
+ return NULL; /* silence compiler */
+}
+
+const char *
+pgaio_io_get_state_name(PgAioHandle *ioh)
+{
+ return pgaio_io_state_get_name(ioh->state);
+}
+
+
+
+/* --------------------------------------------------------------------------------
+ * Functions primarily related to IO Wait References
+ * --------------------------------------------------------------------------------
+ */
+
+void
+pgaio_wref_clear(PgAioWaitRef *iow)
+{
+ iow->aio_index = PG_UINT32_MAX;
+}
+
+bool
+pgaio_wref_valid(PgAioWaitRef *iow)
+{
+ return iow->aio_index != PG_UINT32_MAX;
+}
+
+int
+pgaio_wref_get_id(PgAioWaitRef *iow)
+{
+ Assert(pgaio_wref_valid(iow));
+ return iow->aio_index;
+}
+
+/*
+ * Wait for the IO to have completed.
+ */
+void
+pgaio_wref_wait(PgAioWaitRef *iow)
+{
+ uint64 ref_generation;
+ PgAioHandle *ioh;
+
+ ioh = pgaio_io_from_wref(iow, &ref_generation);
+
+ pgaio_io_wait(ioh, ref_generation);
+}
+
+/*
+ * Check if the the referenced IO completed, without blocking.
+ */
+bool
+pgaio_wref_check_done(PgAioWaitRef *iow)
+{
+ uint64 ref_generation;
+ PgAioHandleState state;
+ bool am_owner;
+ PgAioHandle *ioh;
+
+ ioh = pgaio_io_from_wref(iow, &ref_generation);
+
+ if (pgaio_io_was_recycled(ioh, ref_generation, &state))
+ return true;
+
+ if (state == PGAIO_HS_IDLE)
+ return true;
+
+ am_owner = ioh->owner_procno == MyProcNumber;
+
+ if (state == PGAIO_HS_COMPLETED_SHARED ||
+ state == PGAIO_HS_COMPLETED_LOCAL)
+ {
+ if (am_owner)
+ pgaio_io_reclaim(ioh);
+ return true;
+ }
+
+ return false;
+}
+
+
+
+/* --------------------------------------------------------------------------------
+ * Actions on multiple IOs.
+ * --------------------------------------------------------------------------------
+ */
+
+void
+pgaio_submit_staged(void)
+{
+ int total_submitted = 0;
+ int did_submit;
+
+ if (pgaio_my_backend->num_staged_ios == 0)
+ return;
+
+
+ START_CRIT_SECTION();
+
+ did_submit = pgaio_method_ops->submit(pgaio_my_backend->num_staged_ios,
+ pgaio_my_backend->staged_ios);
+
+ END_CRIT_SECTION();
+
+ total_submitted += did_submit;
+
+ Assert(total_submitted == did_submit);
+
+ pgaio_my_backend->num_staged_ios = 0;
+
+ pgaio_debug(DEBUG4,
+ "aio: submitted %d IOs",
+ total_submitted);
+}
+
+bool
+pgaio_have_staged(void)
+{
+ return pgaio_my_backend->num_staged_ios > 0;
+}
+
+
+
+/* --------------------------------------------------------------------------------
+ * Other
+ * --------------------------------------------------------------------------------
+ */
+
+/*
+ * Need to submit staged but not yet submitted IOs using the fd, otherwise
+ * the IO would end up targeting something bogus.
+ */
+void
+pgaio_closing_fd(int fd)
+{
+ /*
+ * Might be called before AIO is initialized or in a subprocess that
+ * doesn't use AIO.
+ */
+ if (!pgaio_my_backend)
+ return;
+
+ /*
+ * For now just submit all staged IOs - we could be more selective, but
+ * it's probably not worth it.
+ */
+ pgaio_submit_staged();
+}
+
+void
+pgaio_at_xact_end(bool is_subxact, bool is_commit)
+{
+ Assert(!pgaio_my_backend->handed_out_io);
+}
+
+/*
+ * Similar to pgaio_at_xact_end(..., is_commit = false), but for cases where
+ * errors happen outside of transactions.
+ */
+void
+pgaio_at_error(void)
+{
+ Assert(!pgaio_my_backend->handed_out_io);
+}
+
+void
+pgaio_shutdown(int code, Datum arg)
+{
+ Assert(pgaio_my_backend);
+ Assert(!pgaio_my_backend->handed_out_io);
+
+ /*
+ * Before exiting, make sure that all IOs are finished. That has two main
+ * purposes: - it's somewhat annoying to see partially finished IOs in
+ * stats views etc - it's rumored that some kernel-level AIO mechanisms
+ * don't deal well with the issuer of an AIO exiting
+ */
+
+ while (!dclist_is_empty(&pgaio_my_backend->in_flight_ios))
+ {
+ PgAioHandle *ioh = dclist_head_element(PgAioHandle, node, &pgaio_my_backend->in_flight_ios);
+
+ /* see comment in pgaio_io_wait_for_free() about raciness */
+ pgaio_io_wait(ioh, ioh->generation);
+ }
+
+ pgaio_my_backend = NULL;
+}
void
assign_io_method(int newval, void *extra)
{
+ Assert(pgaio_method_ops_table[newval] != NULL);
+ Assert(newval < lengthof(io_method_options));
+
+ pgaio_method_ops = pgaio_method_ops_table[newval];
}
diff --git a/src/backend/storage/aio/aio_callback.c b/src/backend/storage/aio/aio_callback.c
new file mode 100644
index 00000000000..93f71690169
--- /dev/null
+++ b/src/backend/storage/aio/aio_callback.c
@@ -0,0 +1,280 @@
+/*-------------------------------------------------------------------------
+ *
+ * aio_callback.c
+ * AIO - Functionality related to callbacks that can be registered on IO
+ * Handles
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/storage/aio/aio_callback.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "miscadmin.h"
+#include "storage/aio.h"
+#include "storage/aio_internal.h"
+#include "utils/memutils.h"
+
+
+/* just to have something to put into the aio_handle_cbs */
+static const struct PgAioHandleCallbacks aio_invalid_cb = {0};
+
+typedef struct PgAioHandleCallbacksEntry
+{
+ const PgAioHandleCallbacks *const cb;
+ const char *const name;
+} PgAioHandleCallbacksEntry;
+
+/*
+ * Callback definition for the callbacks that can be registered on an IO
+ * handle. See PgAioHandleCallbackID's definition for an explanation for why
+ * callbacks are not identified by a pointer.
+ */
+static const PgAioHandleCallbacksEntry aio_handle_cbs[] = {
+#define CALLBACK_ENTRY(id, callback) [id] = {.cb = &callback, .name = #callback}
+ CALLBACK_ENTRY(PGAIO_HCB_INVALID, aio_invalid_cb),
+#undef CALLBACK_ENTRY
+};
+
+
+
+/*
+ * Register callback for the IO handle.
+ *
+ * Only a limited number (PGAIO_HANDLE_MAX_CALLBACKS) of callbacks can be
+ * registered for each IO.
+ *
+ * Callbacks need to be registered before [indirectly] calling
+ * pgaio_io_prep_*(), as the IO may be executed immediately.
+ *
+ *
+ * Note that callbacks are executed in critical sections. This is necessary
+ * to be able to execute IO in critical sections (consider e.g. WAL
+ * logging). To perform AIO we first need to acquire a handle, which, if there
+ * are no free handles, requires waiting for IOs to complete and to execute
+ * their completion callbacks.
+ *
+ * Callbacks may be executed in the issuing backend but also in another
+ * backend (because that backend is waiting for the IO) or in IO workers (if
+ * io_method=worker is used).
+ *
+ *
+ * See PgAioHandleCallbackID's definition for an explanation for why
+ * callbacks are not identified by a pointer.
+ */
+void
+pgaio_io_register_callbacks(PgAioHandle *ioh, PgAioHandleCallbackID cbid)
+{
+ const PgAioHandleCallbacksEntry *ce = &aio_handle_cbs[cbid];
+
+ if (cbid >= lengthof(aio_handle_cbs))
+ elog(ERROR, "callback %d is out of range", cbid);
+ if (aio_handle_cbs[cbid].cb->complete_shared == NULL &&
+ aio_handle_cbs[cbid].cb->complete_local == NULL)
+ elog(ERROR, "callback %d does not have completion callback", cbid);
+ if (ioh->num_shared_callbacks >= PGAIO_HANDLE_MAX_CALLBACKS)
+ elog(PANIC, "too many callbacks, the max is %d", PGAIO_HANDLE_MAX_CALLBACKS);
+ ioh->shared_callbacks[ioh->num_shared_callbacks] = cbid;
+
+ pgaio_debug_io(DEBUG3, ioh,
+ "adding cb #%d, id %d/%s",
+ ioh->num_shared_callbacks + 1,
+ cbid, ce->name);
+
+ ioh->num_shared_callbacks++;
+}
+
+/*
+ * Associate an array of data with the Handle. This is e.g. useful to the
+ * transport knowledge about which buffers a multi-block IO affects to
+ * completion callbacks.
+ *
+ * Right now this can be done only once for each IO, even though multiple
+ * callbacks can be registered. There aren't any known usecases requiring more
+ * and the required amount of shared memory does add up, so it doesn't seem
+ * worth multiplying memory usage by PGAIO_HANDLE_MAX_CALLBACKS.
+ */
+void
+pgaio_io_set_handle_data_64(PgAioHandle *ioh, uint64 *data, uint8 len)
+{
+ Assert(ioh->state == PGAIO_HS_HANDED_OUT);
+ Assert(ioh->handle_data_len == 0);
+ Assert(len <= PG_IOV_MAX);
+
+ for (int i = 0; i < len; i++)
+ pgaio_ctl->handle_data[ioh->iovec_off + i] = data[i];
+ ioh->handle_data_len = len;
+}
+
+/*
+ * Convenience version of pgaio_io_set_handle_data_64() that converts a 32bit
+ * array to a 64bit array. Without it callers would end up needing to
+ * open-code equivalent code.
+ */
+void
+pgaio_io_set_handle_data_32(PgAioHandle *ioh, uint32 *data, uint8 len)
+{
+ Assert(ioh->state == PGAIO_HS_HANDED_OUT);
+ Assert(ioh->handle_data_len == 0);
+ Assert(len <= PG_IOV_MAX);
+
+ for (int i = 0; i < len; i++)
+ pgaio_ctl->handle_data[ioh->iovec_off + i] = data[i];
+ ioh->handle_data_len = len;
+}
+
+/*
+ * Return data set with pgaio_io_set_handle_data_*().
+ */
+uint64 *
+pgaio_io_get_handle_data(PgAioHandle *ioh, uint8 *len)
+{
+ Assert(ioh->handle_data_len > 0);
+
+ *len = ioh->handle_data_len;
+
+ return &pgaio_ctl->handle_data[ioh->iovec_off];
+}
+
+/*
+ * Internal function which invokes ->stage for all the registered callbacks.
+ */
+void
+pgaio_io_call_stage(PgAioHandle *ioh)
+{
+ Assert(ioh->target > PGAIO_TID_INVALID && ioh->target < PGAIO_TID_COUNT);
+ Assert(ioh->op > PGAIO_OP_INVALID && ioh->op < PGAIO_OP_COUNT);
+
+ for (int i = ioh->num_shared_callbacks; i > 0; i--)
+ {
+ PgAioHandleCallbackID cbid = ioh->shared_callbacks[i - 1];
+ const PgAioHandleCallbacksEntry *ce = &aio_handle_cbs[cbid];
+
+ if (!ce->cb->stage)
+ continue;
+
+ pgaio_debug_io(DEBUG3, ioh,
+ "calling cb #%d %d/%s->stage",
+ i, cbid, ce->name);
+ ce->cb->stage(ioh);
+ }
+}
+
+/*
+ * Internal function which invokes ->complete_shared for all the registered
+ * callbacks.
+ */
+void
+pgaio_io_call_complete_shared(PgAioHandle *ioh)
+{
+ PgAioResult result;
+
+ START_CRIT_SECTION();
+
+ Assert(ioh->target > PGAIO_TID_INVALID && ioh->target < PGAIO_TID_COUNT);
+ Assert(ioh->op > PGAIO_OP_INVALID && ioh->op < PGAIO_OP_COUNT);
+
+ result.status = ARS_OK; /* low level IO is always considered OK */
+ result.result = ioh->result;
+ result.id = PGAIO_HCB_INVALID;
+ result.error_data = 0;
+
+ for (int i = ioh->num_shared_callbacks; i > 0; i--)
+ {
+ PgAioHandleCallbackID cbid = ioh->shared_callbacks[i - 1];
+ const PgAioHandleCallbacksEntry *ce = &aio_handle_cbs[cbid];
+
+ if (!ce->cb->complete_shared)
+ continue;
+
+ pgaio_debug_io(DEBUG3, ioh,
+ "calling cb #%d, id %d/%s->complete_shared with distilled result status %d, id %u, error_data: %d, result: %d",
+ i, cbid, ce->name,
+ result.status, result.id, result.error_data, result.result);
+ result = ce->cb->complete_shared(ioh, result);
+ }
+
+ ioh->distilled_result = result;
+
+ pgaio_debug_io(DEBUG3, ioh,
+ "distilled result status %d, id %u, error_data: %d, result: %d, raw_result %d",
+ result.status, result.id, result.error_data, result.result,
+ ioh->result);
+
+ END_CRIT_SECTION();
+}
+
+
+/*
+ * Internal function which invokes ->complete_local for all the registered
+ * callbacks.
+ *
+ * XXX: It'd be nice to deduplicate with pgaio_io_call_complete_shared().
+ */
+void
+pgaio_io_call_complete_local(PgAioHandle *ioh)
+{
+ PgAioResult result;
+
+ START_CRIT_SECTION();
+
+ Assert(ioh->target > PGAIO_TID_INVALID && ioh->target < PGAIO_TID_COUNT);
+ Assert(ioh->op > PGAIO_OP_INVALID && ioh->op < PGAIO_OP_COUNT);
+
+ /* start with distilled result from shared callback */
+ result = ioh->distilled_result;
+
+ for (int i = ioh->num_shared_callbacks; i > 0; i--)
+ {
+ PgAioHandleCallbackID cbid = ioh->shared_callbacks[i - 1];
+ const PgAioHandleCallbacksEntry *ce = &aio_handle_cbs[cbid];
+
+ if (!ce->cb->complete_local)
+ continue;
+
+ pgaio_debug_io(DEBUG3, ioh,
+ "calling cb #%d, id %d/%s->complete_local with distilled result status %d, id %u, error_data: %d, result: %d",
+ i, cbid, ce->name,
+ result.status, result.id, result.error_data, result.result);
+ result = ce->cb->complete_local(ioh, result);
+ }
+
+ /*
+ * Note that we don't save the result in ioh->distilled_result, the local
+ * callback's result should not ever matter to other waiters.
+ */
+ pgaio_debug_io(DEBUG3, ioh,
+ "distilled result status %d, id %u, error_data: %d, result: %d, raw_result %d",
+ result.status, result.id, result.error_data, result.result,
+ ioh->result);
+
+ END_CRIT_SECTION();
+}
+
+
+
+/* --------------------------------------------------------------------------------
+ * IO Result
+ * --------------------------------------------------------------------------------
+ */
+
+void
+pgaio_result_report(PgAioResult result, const PgAioTargetData *target_data, int elevel)
+{
+ PgAioHandleCallbackID cbid = result.id;
+ const PgAioHandleCallbacksEntry *ce = &aio_handle_cbs[cbid];
+
+ Assert(result.status != ARS_UNKNOWN);
+ Assert(result.status != ARS_OK);
+
+ if (ce->cb->report == NULL)
+ elog(ERROR, "callback %d/%s does not have report callback",
+ result.id, ce->name);
+
+ ce->cb->report(result, target_data, elevel);
+}
diff --git a/src/backend/storage/aio/aio_init.c b/src/backend/storage/aio/aio_init.c
index f7ee8270756..0e98cc0c8fb 100644
--- a/src/backend/storage/aio/aio_init.c
+++ b/src/backend/storage/aio/aio_init.c
@@ -14,24 +14,210 @@
#include "postgres.h"
+#include "miscadmin.h"
+#include "storage/aio.h"
#include "storage/aio_init.h"
+#include "storage/aio_internal.h"
+#include "storage/ipc.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+#include "utils/guc.h"
+static Size
+AioCtlShmemSize(void)
+{
+ Size sz;
+
+ /* pgaio_ctl itself */
+ sz = offsetof(PgAioCtl, io_handles);
+
+ return sz;
+}
+
+static uint32
+AioProcs(void)
+{
+ return MaxBackends + NUM_AUXILIARY_PROCS;
+}
+
+static Size
+AioBackendShmemSize(void)
+{
+ return mul_size(AioProcs(), sizeof(PgAioBackend));
+}
+
+static Size
+AioHandleShmemSize(void)
+{
+ Size sz;
+
+ /* ios */
+ sz = mul_size(AioProcs(),
+ mul_size(io_max_concurrency, sizeof(PgAioHandle)));
+
+ return sz;
+}
+
+static Size
+AioHandleIOVShmemSize(void)
+{
+ return mul_size(sizeof(struct iovec),
+ mul_size(mul_size(PG_IOV_MAX, AioProcs()),
+ io_max_concurrency));
+}
+
+static Size
+AioHandleDataShmemSize(void)
+{
+ return mul_size(sizeof(uint64),
+ mul_size(mul_size(PG_IOV_MAX, AioProcs()),
+ io_max_concurrency));
+}
+
+/*
+ * Choose a suitable value for io_max_concurrency.
+ *
+ * It's unlikely that we could have more IOs in flight than buffers that we
+ * would be allowed to pin.
+ *
+ * On the upper end, apply a cap too - just because shared_buffers is large,
+ * it doesn't make sense have millions of buffers undergo IO concurrently.
+ */
+static int
+AioChooseMaxConccurrency(void)
+{
+ uint32 max_backends;
+ int max_proportional_pins;
+
+ /* Similar logic to LimitAdditionalPins() */
+ max_backends = MaxBackends + NUM_AUXILIARY_PROCS;
+ max_proportional_pins = NBuffers / max_backends;
+
+ max_proportional_pins = Max(max_proportional_pins, 1);
+
+ /* apply upper limit */
+ return Min(max_proportional_pins, 64);
+}
+
Size
AioShmemSize(void)
{
Size sz = 0;
+ /*
+ * We prefer to report this value's source as PGC_S_DYNAMIC_DEFAULT.
+ * However, if the DBA explicitly set wal_buffers = -1 in the config file,
+ * then PGC_S_DYNAMIC_DEFAULT will fail to override that and we must force
+ *
+ */
+ if (io_max_concurrency == -1)
+ {
+ char buf[32];
+
+ snprintf(buf, sizeof(buf), "%d", AioChooseMaxConccurrency());
+ SetConfigOption("io_max_concurrency", buf, PGC_POSTMASTER,
+ PGC_S_DYNAMIC_DEFAULT);
+ if (io_max_concurrency == -1) /* failed to apply it? */
+ SetConfigOption("io_max_concurrency", buf, PGC_POSTMASTER,
+ PGC_S_OVERRIDE);
+ }
+
+ sz = add_size(sz, AioCtlShmemSize());
+ sz = add_size(sz, AioBackendShmemSize());
+ sz = add_size(sz, AioHandleShmemSize());
+ sz = add_size(sz, AioHandleIOVShmemSize());
+ sz = add_size(sz, AioHandleDataShmemSize());
+
+ if (pgaio_method_ops->shmem_size)
+ sz = add_size(sz, pgaio_method_ops->shmem_size());
+
return sz;
}
void
AioShmemInit(void)
{
+ bool found;
+ uint32 io_handle_off = 0;
+ uint32 iovec_off = 0;
+ uint32 per_backend_iovecs = io_max_concurrency * PG_IOV_MAX;
+
+ pgaio_ctl = (PgAioCtl *)
+ ShmemInitStruct("AioCtl", AioCtlShmemSize(), &found);
+
+ if (found)
+ goto out;
+
+ memset(pgaio_ctl, 0, AioCtlShmemSize());
+
+ pgaio_ctl->io_handle_count = AioProcs() * io_max_concurrency;
+ pgaio_ctl->iovec_count = AioProcs() * per_backend_iovecs;
+
+ pgaio_ctl->backend_state = (PgAioBackend *)
+ ShmemInitStruct("AioBackend", AioBackendShmemSize(), &found);
+
+ pgaio_ctl->io_handles = (PgAioHandle *)
+ ShmemInitStruct("AioHandle", AioHandleShmemSize(), &found);
+
+ pgaio_ctl->iovecs = (struct iovec *)
+ ShmemInitStruct("AioHandleIOV", AioHandleIOVShmemSize(), &found);
+ pgaio_ctl->handle_data = (uint64 *)
+ ShmemInitStruct("AioHandleData", AioHandleDataShmemSize(), &found);
+
+ for (int procno = 0; procno < AioProcs(); procno++)
+ {
+ PgAioBackend *bs = &pgaio_ctl->backend_state[procno];
+
+ bs->io_handle_off = io_handle_off;
+ io_handle_off += io_max_concurrency;
+
+ dclist_init(&bs->idle_ios);
+ memset(bs->staged_ios, 0, sizeof(PgAioHandle *) * PGAIO_SUBMIT_BATCH_SIZE);
+ dclist_init(&bs->in_flight_ios);
+
+ /* initialize per-backend IOs */
+ for (int i = 0; i < io_max_concurrency; i++)
+ {
+ PgAioHandle *ioh = &pgaio_ctl->io_handles[bs->io_handle_off + i];
+
+ ioh->generation = 1;
+ ioh->owner_procno = procno;
+ ioh->iovec_off = iovec_off;
+ ioh->handle_data_len = 0;
+ ioh->report_return = NULL;
+ ioh->resowner = NULL;
+ ioh->num_shared_callbacks = 0;
+ ioh->distilled_result.status = ARS_UNKNOWN;
+ ioh->flags = 0;
+
+ ConditionVariableInit(&ioh->cv);
+
+ dclist_push_tail(&bs->idle_ios, &ioh->node);
+ iovec_off += PG_IOV_MAX;
+ }
+ }
+
+out:
+ /* Initialize IO method specific resources. */
+ if (pgaio_method_ops->shmem_init)
+ pgaio_method_ops->shmem_init(!found);
}
void
pgaio_init_backend(void)
{
+ /* shouldn't be initialized twice */
+ Assert(!pgaio_my_backend);
+
+ if (MyProc == NULL || MyProcNumber >= AioProcs())
+ elog(ERROR, "aio requires a normal PGPROC");
+
+ pgaio_my_backend = &pgaio_ctl->backend_state[MyProcNumber];
+
+ if (pgaio_method_ops->init_backend)
+ pgaio_method_ops->init_backend();
+
+ before_shmem_exit(pgaio_shutdown, 0);
}
diff --git a/src/backend/storage/aio/aio_io.c b/src/backend/storage/aio/aio_io.c
new file mode 100644
index 00000000000..bb010d6152c
--- /dev/null
+++ b/src/backend/storage/aio/aio_io.c
@@ -0,0 +1,175 @@
+/*-------------------------------------------------------------------------
+ *
+ * aio_io.c
+ * AIO - Low Level IO Handling
+ *
+ * Functions related to associating IO operations to IO Handles and IO-method
+ * independent support functions for actually performing IO.
+ *
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/storage/aio/aio_io.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "storage/aio.h"
+#include "storage/aio_internal.h"
+#include "storage/fd.h"
+#include "utils/wait_event.h"
+
+
+static void pgaio_io_before_prep(PgAioHandle *ioh);
+
+
+
+/* --------------------------------------------------------------------------------
+ * Public IO related functions operating on IO Handles
+ * --------------------------------------------------------------------------------
+ */
+
+/*
+ * Scatter/gather IO needs to associate an iovec with the Handle. To support
+ * worker mode this data needs to be in shared memory.
+ *
+ * XXX: Right now the amount of space available for each IO is
+ * PG_IOV_MAX. While it's tempting to use the io_combine_limit GUC, that's
+ * PGC_USERSET, so we can't allocate shared memory based on that.
+ */
+int
+pgaio_io_get_iovec(PgAioHandle *ioh, struct iovec **iov)
+{
+ Assert(ioh->state == PGAIO_HS_HANDED_OUT);
+
+ *iov = &pgaio_ctl->iovecs[ioh->iovec_off];
+
+ return PG_IOV_MAX;
+}
+
+PgAioOpData *
+pgaio_io_get_op_data(PgAioHandle *ioh)
+{
+ return &ioh->op_data;
+}
+
+
+
+/* --------------------------------------------------------------------------------
+ * "Preparation" routines for individual IO operations
+ *
+ * These are called by the code actually initiating an IO, to associate the IO
+ * specific data with an AIO handle.
+ *
+ * Each of the preparation routines first needs to call
+ * pgaio_io_before_prep(), then fill IO specific fields in the handle and then
+ * finally call pgaio_io_stage().
+ * --------------------------------------------------------------------------------
+ */
+
+void
+pgaio_io_prep_readv(PgAioHandle *ioh,
+ int fd, int iovcnt, uint64 offset)
+{
+ pgaio_io_before_prep(ioh);
+
+ ioh->op_data.read.fd = fd;
+ ioh->op_data.read.offset = offset;
+ ioh->op_data.read.iov_length = iovcnt;
+
+ pgaio_io_stage(ioh, PGAIO_OP_READV);
+}
+
+void
+pgaio_io_prep_writev(PgAioHandle *ioh,
+ int fd, int iovcnt, uint64 offset)
+{
+ pgaio_io_before_prep(ioh);
+
+ ioh->op_data.write.fd = fd;
+ ioh->op_data.write.offset = offset;
+ ioh->op_data.write.iov_length = iovcnt;
+
+ pgaio_io_stage(ioh, PGAIO_OP_WRITEV);
+}
+
+
+
+/* --------------------------------------------------------------------------------
+ * Internal IO related functions operating on IO Handles
+ * --------------------------------------------------------------------------------
+ */
+
+/*
+ * Execute IO operation synchronously. This is implemented here, not in
+ * method_sync.c, because other IO methods lso might use it / fall back to it.
+ */
+void
+pgaio_io_perform_synchronously(PgAioHandle *ioh)
+{
+ ssize_t result = 0;
+ struct iovec *iov = &pgaio_ctl->iovecs[ioh->iovec_off];
+
+ /* Perform IO. */
+ switch (ioh->op)
+ {
+ case PGAIO_OP_READV:
+ pgstat_report_wait_start(WAIT_EVENT_DATA_FILE_READ);
+ result = pg_preadv(ioh->op_data.read.fd, iov,
+ ioh->op_data.read.iov_length,
+ ioh->op_data.read.offset);
+ pgstat_report_wait_end();
+ break;
+ case PGAIO_OP_WRITEV:
+ pgstat_report_wait_start(WAIT_EVENT_DATA_FILE_WRITE);
+ result = pg_pwritev(ioh->op_data.write.fd, iov,
+ ioh->op_data.write.iov_length,
+ ioh->op_data.write.offset);
+ pgstat_report_wait_end();
+ break;
+ case PGAIO_OP_INVALID:
+ elog(ERROR, "trying to execute invalid IO operation");
+ }
+
+ ioh->result = result < 0 ? -errno : result;
+
+ pgaio_io_process_completion(ioh, ioh->result);
+}
+
+/*
+ * Helper function to be called by IO operation preparation functions, before
+ * any data in the handle is set. Mostly to centralize assertions.
+ */
+static void
+pgaio_io_before_prep(PgAioHandle *ioh)
+{
+ Assert(ioh->state == PGAIO_HS_HANDED_OUT);
+ Assert(pgaio_io_has_target(ioh));
+ Assert(ioh->op == PGAIO_OP_INVALID);
+}
+
+/*
+ * Could be made part of the public interface, but it's not clear there's
+ * really a use case for that.
+ */
+const char *
+pgaio_io_get_op_name(PgAioHandle *ioh)
+{
+ Assert(ioh->op >= 0 && ioh->op < PGAIO_OP_COUNT);
+
+ switch (ioh->op)
+ {
+ case PGAIO_OP_INVALID:
+ return "invalid";
+ case PGAIO_OP_READV:
+ return "read";
+ case PGAIO_OP_WRITEV:
+ return "write";
+ }
+
+ return NULL; /* silence compiler */
+}
diff --git a/src/backend/storage/aio/aio_target.c b/src/backend/storage/aio/aio_target.c
new file mode 100644
index 00000000000..15428968e58
--- /dev/null
+++ b/src/backend/storage/aio/aio_target.c
@@ -0,0 +1,108 @@
+/*-------------------------------------------------------------------------
+ *
+ * aio_target.c
+ * AIO - Functionality related to executing IO for different targets
+ *
+ * XXX Write me
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/storage/aio/aio_target.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "storage/aio.h"
+#include "storage/aio_internal.h"
+
+
+/*
+ * Registry for entities that can be the target of AIO.
+ *
+ * To support executing using worker processes, the file descriptor for an IO
+ * may need to be be reopened in a different process. This is done via the
+ * PgAioTargetInfo.reopen callback.
+ */
+static const PgAioTargetInfo *pgaio_target_info[] = {
+ [PGAIO_TID_INVALID] = &(PgAioTargetInfo) {
+ .name = "invalid",
+ },
+};
+
+
+
+bool
+pgaio_io_has_target(PgAioHandle *ioh)
+{
+ return ioh->target != PGAIO_TID_INVALID;
+}
+
+/*
+ * Return the name for the target associated with the IO. Mostly useful for
+ * debugging/logging.
+ */
+const char *
+pgaio_io_get_target_name(PgAioHandle *ioh)
+{
+ Assert(ioh->target >= 0 && ioh->target < PGAIO_TID_COUNT);
+
+ return pgaio_target_info[ioh->target]->name;
+}
+
+/*
+ * Assign a target to the IO.
+ *
+ * This has to be called exactly once before pgaio_io_prep_*() is called.
+ */
+void
+pgaio_io_set_target(PgAioHandle *ioh, PgAioTargetID targetid)
+{
+ Assert(ioh->state == PGAIO_HS_HANDED_OUT);
+ Assert(ioh->target == PGAIO_TID_INVALID);
+
+ ioh->target = targetid;
+}
+
+PgAioTargetData *
+pgaio_io_get_target_data(PgAioHandle *ioh)
+{
+ return &ioh->target_data;
+}
+
+/*
+ * Return a stringified description of the IO's target.
+ *
+ * The string is localized and allocated in the current memory context.
+ */
+char *
+pgaio_io_get_target_description(PgAioHandle *ioh)
+{
+ return pgaio_target_info[ioh->target]->describe_identity(&ioh->target_data);
+}
+
+/*
+ * Internal: Check if pgaio_io_reopen() is available for the IO.
+ */
+bool
+pgaio_io_can_reopen(PgAioHandle *ioh)
+{
+ return pgaio_target_info[ioh->target]->reopen != NULL;
+}
+
+/*
+ * Internal: Before executing an IO outside of the context of the process the
+ * IO has been prepared in, the file descriptor has to be reopened - any FD
+ * referenced in the IO itself, won't be valid in the separate process.
+ */
+void
+pgaio_io_reopen(PgAioHandle *ioh)
+{
+ Assert(ioh->target >= 0 && ioh->target < PGAIO_TID_COUNT);
+ Assert(ioh->op >= 0 && ioh->op < PGAIO_OP_COUNT);
+
+ pgaio_target_info[ioh->target]->reopen(ioh);
+}
diff --git a/src/backend/storage/aio/meson.build b/src/backend/storage/aio/meson.build
index c822fd4ddf7..2c26089d52e 100644
--- a/src/backend/storage/aio/meson.build
+++ b/src/backend/storage/aio/meson.build
@@ -2,6 +2,10 @@
backend_sources += files(
'aio.c',
+ 'aio_callback.c',
'aio_init.c',
+ 'aio_io.c',
+ 'aio_target.c',
+ 'method_sync.c',
'read_stream.c',
)
diff --git a/src/backend/storage/aio/method_sync.c b/src/backend/storage/aio/method_sync.c
new file mode 100644
index 00000000000..43f9c8bd0b3
--- /dev/null
+++ b/src/backend/storage/aio/method_sync.c
@@ -0,0 +1,47 @@
+/*-------------------------------------------------------------------------
+ *
+ * method_sync.c
+ * AIO - perform "AIO" by executing it synchronously
+ *
+ * This method is mainly to check if AIO use causes regressions. Other IO
+ * methods might also fall back to the synchronous method for functionality
+ * they cannot provide.
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/storage/aio/method_sync.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "storage/aio.h"
+#include "storage/aio_internal.h"
+
+static bool pgaio_sync_needs_synchronous_execution(PgAioHandle *ioh);
+static int pgaio_sync_submit(uint16 num_staged_ios, PgAioHandle **staged_ios);
+
+
+const IoMethodOps pgaio_sync_ops = {
+ .needs_synchronous_execution = pgaio_sync_needs_synchronous_execution,
+ .submit = pgaio_sync_submit,
+};
+
+
+
+static bool
+pgaio_sync_needs_synchronous_execution(PgAioHandle *ioh)
+{
+ return true;
+}
+
+static int
+pgaio_sync_submit(uint16 num_staged_ios, PgAioHandle **staged_ios)
+{
+ elog(ERROR, "should be unreachable");
+
+ return 0;
+}
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index e199f071628..b5d3dcbf1e9 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -191,6 +191,9 @@ ABI_compatibility:
Section: ClassName - WaitEventIO
+AIO_SUBMIT "Waiting for AIO submission."
+AIO_DRAIN "Waiting for IOs to finish."
+AIO_COMPLETION "Waiting for completion callback."
BASEBACKUP_READ "Waiting for base backup to read from a file."
BASEBACKUP_SYNC "Waiting for data written by a base backup to reach durable storage."
BASEBACKUP_WRITE "Waiting for base backup to write to a file."
diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c
index ac5ca4a765e..e5d852b5ee6 100644
--- a/src/backend/utils/resowner/resowner.c
+++ b/src/backend/utils/resowner/resowner.c
@@ -47,6 +47,8 @@
#include "common/hashfn.h"
#include "common/int.h"
+#include "lib/ilist.h"
+#include "storage/aio.h"
#include "storage/ipc.h"
#include "storage/predicate.h"
#include "storage/proc.h"
@@ -155,6 +157,12 @@ struct ResourceOwnerData
/* The local locks cache. */
LOCALLOCK *locks[MAX_RESOWNER_LOCKS]; /* list of owned locks */
+
+ /*
+ * AIO handles need be registered in critical sections and therefore
+ * cannot use the normal ResoureElem mechanism.
+ */
+ dlist_head aio_handles;
};
@@ -425,6 +433,8 @@ ResourceOwnerCreate(ResourceOwner parent, const char *name)
parent->firstchild = owner;
}
+ dlist_init(&owner->aio_handles);
+
return owner;
}
@@ -725,6 +735,14 @@ ResourceOwnerReleaseInternal(ResourceOwner owner,
* so issue warnings. In the abort case, just clean up quietly.
*/
ResourceOwnerReleaseAll(owner, phase, isCommit);
+
+ /* XXX: Could probably be a later phase? */
+ while (!dlist_is_empty(&owner->aio_handles))
+ {
+ dlist_node *node = dlist_head_node(&owner->aio_handles);
+
+ pgaio_io_release_resowner(node, !isCommit);
+ }
}
else if (phase == RESOURCE_RELEASE_LOCKS)
{
@@ -1082,3 +1100,15 @@ ResourceOwnerForgetLock(ResourceOwner owner, LOCALLOCK *locallock)
elog(ERROR, "lock reference %p is not owned by resource owner %s",
locallock, owner->name);
}
+
+void
+ResourceOwnerRememberAioHandle(ResourceOwner owner, struct dlist_node *ioh_node)
+{
+ dlist_push_tail(&owner->aio_handles, ioh_node);
+}
+
+void
+ResourceOwnerForgetAioHandle(ResourceOwner owner, struct dlist_node *ioh_node)
+{
+ dlist_delete_from(&owner->aio_handles, ioh_node);
+}
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 3bec090428d..c7f34559b1b 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1267,6 +1267,7 @@ InvalMessageArray
InvalidationInfo
InvalidationMsgsGroup
IoMethod
+IoMethodOps
IpcMemoryId
IpcMemoryKey
IpcMemoryState
@@ -2105,6 +2106,26 @@ Permutation
PermutationStep
PermutationStepBlocker
PermutationStepBlockerType
+PgAioBackend
+PgAioCtl
+PgAioHandle
+PgAioHandleCallbackID
+PgAioHandleCallbackStage
+PgAioHandleCallbackComplete
+PgAioHandleCallbackReport
+PgAioHandleCallbacks
+PgAioHandleCallbacksEntry
+PgAioHandleFlags
+PgAioHandleState
+PgAioOp
+PgAioOpData
+PgAioResult
+PgAioResultStatus
+PgAioReturn
+PgAioTargetData
+PgAioTargetID
+PgAioTargetInfo
+PgAioWaitRef
PgArchData
PgBackendGSSStatus
PgBackendSSLStatus
--
2.48.1.76.g4e746b1a31.dirty
v2.3-0012-aio-Skeleton-IO-worker-infrastructure.patchtext/x-diff; charset=us-asciiDownload
From 5e84720afa46fdfd892a8bac36585f0f7a29d3f3 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 22 Jan 2025 13:43:40 -0500
Subject: [PATCH v2.3 12/30] aio: Skeleton IO worker infrastructure
This doesn't do anything useful on its own, but the code that needs to be
touched is independent of other changes.
Remarks:
- should completely get rid of ID assignment logic in postmaster.c
- postmaster.c badly needs a refactoring.
- dynamic increase / decrease of workers based on IO load
Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
src/include/miscadmin.h | 2 +
src/include/postmaster/postmaster.h | 1 +
src/include/storage/aio_init.h | 2 +
src/include/storage/io_worker.h | 22 +++
src/include/storage/proc.h | 4 +-
src/backend/postmaster/launch_backend.c | 2 +
src/backend/postmaster/pmchild.c | 1 +
src/backend/postmaster/postmaster.c | 169 ++++++++++++++++--
src/backend/storage/aio/Makefile | 1 +
src/backend/storage/aio/aio_init.c | 7 +
src/backend/storage/aio/meson.build | 1 +
src/backend/storage/aio/method_worker.c | 86 +++++++++
src/backend/tcop/postgres.c | 2 +
src/backend/utils/activity/pgstat_backend.c | 1 +
src/backend/utils/activity/pgstat_io.c | 1 +
.../utils/activity/wait_event_names.txt | 1 +
src/backend/utils/init/miscinit.c | 3 +
src/backend/utils/misc/guc_tables.c | 13 ++
src/backend/utils/misc/postgresql.conf.sample | 1 +
19 files changed, 305 insertions(+), 15 deletions(-)
create mode 100644 src/include/storage/io_worker.h
create mode 100644 src/backend/storage/aio/method_worker.c
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index d016a9c9248..c2b3e27c613 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -360,6 +360,7 @@ typedef enum BackendType
B_ARCHIVER,
B_BG_WRITER,
B_CHECKPOINTER,
+ B_IO_WORKER,
B_STARTUP,
B_WAL_RECEIVER,
B_WAL_SUMMARIZER,
@@ -389,6 +390,7 @@ extern PGDLLIMPORT BackendType MyBackendType;
#define AmWalReceiverProcess() (MyBackendType == B_WAL_RECEIVER)
#define AmWalSummarizerProcess() (MyBackendType == B_WAL_SUMMARIZER)
#define AmWalWriterProcess() (MyBackendType == B_WAL_WRITER)
+#define AmIoWorkerProcess() (MyBackendType == B_IO_WORKER)
#define AmSpecialWorkerProcess() \
(AmAutoVacuumLauncherProcess() || \
diff --git a/src/include/postmaster/postmaster.h b/src/include/postmaster/postmaster.h
index 188a06e2379..253dc98c50e 100644
--- a/src/include/postmaster/postmaster.h
+++ b/src/include/postmaster/postmaster.h
@@ -98,6 +98,7 @@ extern void InitProcessGlobals(void);
extern int MaxLivePostmasterChildren(void);
extern bool PostmasterMarkPIDForWorkerNotify(int);
+extern void assign_io_workers(int newval, void *extra);
#ifdef WIN32
extern void pgwin32_register_deadchild_callback(HANDLE procHandle, DWORD procId);
diff --git a/src/include/storage/aio_init.h b/src/include/storage/aio_init.h
index 44151ef55bf..bc15b720fca 100644
--- a/src/include/storage/aio_init.h
+++ b/src/include/storage/aio_init.h
@@ -21,4 +21,6 @@ extern void AioShmemInit(void);
extern void pgaio_init_backend(void);
+extern bool pgaio_workers_enabled(void);
+
#endif /* AIO_INIT_H */
diff --git a/src/include/storage/io_worker.h b/src/include/storage/io_worker.h
new file mode 100644
index 00000000000..223d614dc4a
--- /dev/null
+++ b/src/include/storage/io_worker.h
@@ -0,0 +1,22 @@
+/*-------------------------------------------------------------------------
+ *
+ * io_worker.h
+ * IO worker for implementing AIO "ourselves"
+ *
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/io.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef IO_WORKER_H
+#define IO_WORKER_H
+
+
+extern void IoWorkerMain(char *startup_data, size_t startup_data_len) pg_attribute_noreturn();
+
+extern int io_workers;
+
+#endif /* IO_WORKER_H */
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 20777f7d5ae..64e9b8ff8c5 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -448,7 +448,9 @@ extern PGDLLIMPORT PGPROC *PreparedXactProcs;
* 2 slots, but WAL writer is launched only after startup has exited, so we
* only need 6 slots.
*/
-#define NUM_AUXILIARY_PROCS 6
+#define MAX_IO_WORKERS 32
+#define NUM_AUXILIARY_PROCS (6 + MAX_IO_WORKERS)
+
/* configurable options */
extern PGDLLIMPORT int DeadlockTimeout;
diff --git a/src/backend/postmaster/launch_backend.c b/src/backend/postmaster/launch_backend.c
index a97a1eda6da..54b4c22bd63 100644
--- a/src/backend/postmaster/launch_backend.c
+++ b/src/backend/postmaster/launch_backend.c
@@ -48,6 +48,7 @@
#include "replication/slotsync.h"
#include "replication/walreceiver.h"
#include "storage/dsm.h"
+#include "storage/io_worker.h"
#include "storage/pg_shmem.h"
#include "tcop/backend_startup.h"
#include "utils/memutils.h"
@@ -197,6 +198,7 @@ static child_process_kind child_process_kinds[] = {
[B_ARCHIVER] = {"archiver", PgArchiverMain, true},
[B_BG_WRITER] = {"bgwriter", BackgroundWriterMain, true},
[B_CHECKPOINTER] = {"checkpointer", CheckpointerMain, true},
+ [B_IO_WORKER] = {"io_worker", IoWorkerMain, true},
[B_STARTUP] = {"startup", StartupProcessMain, true},
[B_WAL_RECEIVER] = {"wal_receiver", WalReceiverMain, true},
[B_WAL_SUMMARIZER] = {"wal_summarizer", WalSummarizerMain, true},
diff --git a/src/backend/postmaster/pmchild.c b/src/backend/postmaster/pmchild.c
index 0d473226c3a..cde1d23a4ca 100644
--- a/src/backend/postmaster/pmchild.c
+++ b/src/backend/postmaster/pmchild.c
@@ -101,6 +101,7 @@ InitPostmasterChildSlots(void)
pmchild_pools[B_AUTOVAC_WORKER].size = autovacuum_worker_slots;
pmchild_pools[B_BG_WORKER].size = max_worker_processes;
+ pmchild_pools[B_IO_WORKER].size = MAX_IO_WORKERS;
/*
* There can be only one of each of these running at a time. They each
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 115ad3d31d2..ddd82b94720 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -108,9 +108,12 @@
#include "replication/logicallauncher.h"
#include "replication/slotsync.h"
#include "replication/walsender.h"
+#include "storage/aio_init.h"
#include "storage/fd.h"
+#include "storage/io_worker.h"
#include "storage/ipc.h"
#include "storage/pmsignal.h"
+#include "storage/proc.h"
#include "tcop/backend_startup.h"
#include "tcop/tcopprot.h"
#include "utils/datetime.h"
@@ -334,6 +337,7 @@ typedef enum
* ckpt */
PM_WAIT_XLOG_ARCHIVAL, /* waiting for archiver and walsenders to
* finish */
+ PM_WAIT_IO_WORKERS, /* waiting for io workers to exit */
PM_WAIT_CHECKPOINTER, /* waiting for checkpointer to shut down */
PM_WAIT_DEAD_END, /* waiting for dead-end children to exit */
PM_NO_CHILDREN, /* all important children have exited */
@@ -396,6 +400,10 @@ bool LoadedSSL = false;
static DNSServiceRef bonjour_sdref = NULL;
#endif
+/* State for IO worker management. */
+static int io_worker_count = 0;
+static PMChild *io_worker_children[MAX_IO_WORKERS];
+
/*
* postmaster.c - function prototypes
*/
@@ -430,6 +438,8 @@ static void TerminateChildren(int signal);
static int CountChildren(BackendTypeMask targetMask);
static void LaunchMissingBackgroundProcesses(void);
static void maybe_start_bgworkers(void);
+static bool maybe_reap_io_worker(int pid);
+static void maybe_adjust_io_workers(void);
static bool CreateOptsFile(int argc, char *argv[], char *fullprogname);
static PMChild *StartChildProcess(BackendType type);
static void StartSysLogger(void);
@@ -1357,6 +1367,11 @@ PostmasterMain(int argc, char *argv[])
*/
AddToDataDirLockFile(LOCK_FILE_LINE_PM_STATUS, PM_STATUS_STARTING);
+ UpdatePMState(PM_STARTUP);
+
+ /* Make sure we can perform I/O while starting up. */
+ maybe_adjust_io_workers();
+
/* Start bgwriter and checkpointer so they can help with recovery */
if (CheckpointerPMChild == NULL)
CheckpointerPMChild = StartChildProcess(B_CHECKPOINTER);
@@ -1369,7 +1384,6 @@ PostmasterMain(int argc, char *argv[])
StartupPMChild = StartChildProcess(B_STARTUP);
Assert(StartupPMChild != NULL);
StartupStatus = STARTUP_RUNNING;
- UpdatePMState(PM_STARTUP);
/* Some workers may be scheduled to start now */
maybe_start_bgworkers();
@@ -2493,6 +2507,16 @@ process_pm_child_exit(void)
continue;
}
+ /* Was it an IO worker? */
+ if (maybe_reap_io_worker(pid))
+ {
+ if (!EXIT_STATUS_0(exitstatus) && !EXIT_STATUS_1(exitstatus))
+ HandleChildCrash(pid, exitstatus, _("io worker"));
+
+ maybe_adjust_io_workers();
+ continue;
+ }
+
/*
* Was it a backend or a background worker?
*/
@@ -2704,6 +2728,7 @@ HandleFatalError(QuitSignalReason reason, bool consider_sigabrt)
case PM_WAIT_XLOG_SHUTDOWN:
case PM_WAIT_XLOG_ARCHIVAL:
case PM_WAIT_CHECKPOINTER:
+ case PM_WAIT_IO_WORKERS:
/*
* Note that we switch *back* to PM_WAIT_BACKENDS here. This way
@@ -2892,20 +2917,21 @@ PostmasterStateMachine(void)
/*
* If we are doing crash recovery or an immediate shutdown then we
- * expect archiver, checkpointer and walsender to exit as well,
- * otherwise not.
+ * expect archiver, checkpointer, io workers and walsender to exit as
+ * well, otherwise not.
*/
if (FatalError || Shutdown >= ImmediateShutdown)
targetMask = btmask_add(targetMask,
B_CHECKPOINTER,
B_ARCHIVER,
+ B_IO_WORKER,
B_WAL_SENDER);
/*
- * Normally walsenders and archiver will continue running; they will
- * be terminated later after writing the checkpoint record. We also
- * let dead-end children to keep running for now. The syslogger
- * process exits last.
+ * Normally archiver, checkpointer, IO workers and walsenders will
+ * continue running; they will be terminated later after writing the
+ * checkpoint record. We also let dead-end children to keep running
+ * for now. The syslogger process exits last.
*
* This assertion checks that we have covered all backend types,
* either by including them in targetMask, or by noting here that they
@@ -2920,12 +2946,13 @@ PostmasterStateMachine(void)
B_LOGGER);
/*
- * Archiver, checkpointer and walsender may or may not be in
- * targetMask already.
+ * Archiver, checkpointer, IO workers, and walsender may or may
+ * not be in targetMask already.
*/
remainMask = btmask_add(remainMask,
B_ARCHIVER,
B_CHECKPOINTER,
+ B_IO_WORKER,
B_WAL_SENDER);
/* these are not real postmaster children */
@@ -3020,11 +3047,25 @@ PostmasterStateMachine(void)
{
/*
* PM_WAIT_XLOG_ARCHIVAL state ends when there's no children other
- * than checkpointer and dead-end children left. There shouldn't be
- * any regular backends left by now anyway; what we're really waiting
- * for is for walsenders and archiver to exit.
+ * than checkpointer, io workers and dead-end children left. There
+ * shouldn't be any regular backends left by now anyway; what we're
+ * really waiting for is for walsenders and archiver to exit.
*/
- if (CountChildren(btmask_all_except(B_CHECKPOINTER, B_LOGGER, B_DEAD_END_BACKEND)) == 0)
+ if (CountChildren(btmask_all_except(B_CHECKPOINTER, B_IO_WORKER,
+ B_LOGGER, B_DEAD_END_BACKEND)) == 0)
+ {
+ UpdatePMState(PM_WAIT_IO_WORKERS);
+ SignalChildren(SIGUSR2, btmask(B_IO_WORKER));
+ }
+ }
+
+ if (pmState == PM_WAIT_IO_WORKERS)
+ {
+ /*
+ * PM_WAIT_IO_WORKERS state ends when there's only checkpointer and
+ * dead_end children left.
+ */
+ if (io_worker_count == 0)
{
UpdatePMState(PM_WAIT_CHECKPOINTER);
@@ -3151,10 +3192,14 @@ PostmasterStateMachine(void)
/* re-create shared memory and semaphores */
CreateSharedMemoryAndSemaphores();
+ UpdatePMState(PM_STARTUP);
+
+ /* Make sure we can perform I/O while starting up. */
+ maybe_adjust_io_workers();
+
StartupPMChild = StartChildProcess(B_STARTUP);
Assert(StartupPMChild != NULL);
StartupStatus = STARTUP_RUNNING;
- UpdatePMState(PM_STARTUP);
/* crash recovery started, reset SIGKILL flag */
AbortStartTime = 0;
@@ -3178,6 +3223,7 @@ pmstate_name(PMState state)
PM_TOSTR_CASE(PM_WAIT_BACKENDS);
PM_TOSTR_CASE(PM_WAIT_XLOG_SHUTDOWN);
PM_TOSTR_CASE(PM_WAIT_XLOG_ARCHIVAL);
+ PM_TOSTR_CASE(PM_WAIT_IO_WORKERS);
PM_TOSTR_CASE(PM_WAIT_DEAD_END);
PM_TOSTR_CASE(PM_WAIT_CHECKPOINTER);
PM_TOSTR_CASE(PM_NO_CHILDREN);
@@ -4093,6 +4139,7 @@ bgworker_should_start_now(BgWorkerStartTime start_time)
case PM_WAIT_DEAD_END:
case PM_WAIT_XLOG_ARCHIVAL:
case PM_WAIT_XLOG_SHUTDOWN:
+ case PM_WAIT_IO_WORKERS:
case PM_WAIT_BACKENDS:
case PM_STOP_BACKENDS:
break;
@@ -4243,6 +4290,100 @@ maybe_start_bgworkers(void)
}
}
+static bool
+maybe_reap_io_worker(int pid)
+{
+ for (int id = 0; id < MAX_IO_WORKERS; ++id)
+ {
+ if (io_worker_children[id] &&
+ io_worker_children[id]->pid == pid)
+ {
+ ReleasePostmasterChildSlot(io_worker_children[id]);
+
+ --io_worker_count;
+ io_worker_children[id] = NULL;
+ return true;
+ }
+ }
+ return false;
+}
+
+static void
+maybe_adjust_io_workers(void)
+{
+ if (!pgaio_workers_enabled())
+ return;
+
+ /*
+ * If we're in final shutting down state, then we're just waiting for all
+ * processes to exit.
+ */
+ if (pmState >= PM_WAIT_IO_WORKERS)
+ return;
+
+ /* Don't start new workers during an immediate shutdown either. */
+ if (Shutdown >= ImmediateShutdown)
+ return;
+
+ /*
+ * Don't start new workers if we're in the shutdown phase of a crash
+ * restart. But we *do* need to start if we're already starting up again.
+ */
+ if (FatalError && pmState >= PM_STOP_BACKENDS)
+ return;
+
+ Assert(pmState < PM_WAIT_IO_WORKERS);
+
+ /* Not enough running? */
+ while (io_worker_count < io_workers)
+ {
+ PMChild *child;
+ int id;
+
+ /* find unused entry in io_worker_children array */
+ for (id = 0; id < MAX_IO_WORKERS; ++id)
+ {
+ if (io_worker_children[id] == NULL)
+ break;
+ }
+ if (id == MAX_IO_WORKERS)
+ elog(ERROR, "could not find a free IO worker ID");
+
+ /* Try to launch one. */
+ child = StartChildProcess(B_IO_WORKER);
+ if (child != NULL)
+ {
+ io_worker_children[id] = child;
+ ++io_worker_count;
+ }
+ else
+ break; /* XXX try again soon? */
+ }
+
+ /* Too many running? */
+ if (io_worker_count > io_workers)
+ {
+ /* ask the IO worker in the highest slot to exit */
+ for (int id = MAX_IO_WORKERS - 1; id >= 0; --id)
+ {
+ if (io_worker_children[id] != NULL)
+ {
+ kill(io_worker_children[id]->pid, SIGUSR2);
+ break;
+ }
+ }
+ }
+}
+
+void
+assign_io_workers(int newval, void *extra)
+{
+ io_workers = newval;
+ if (!IsUnderPostmaster && pmState > PM_INIT)
+ maybe_adjust_io_workers();
+}
+
+
/*
* When a backend asks to be notified about worker state changes, we
* set a flag in its backend entry. The background worker machinery needs
diff --git a/src/backend/storage/aio/Makefile b/src/backend/storage/aio/Makefile
index 89f821ea7e1..f51c34a37f8 100644
--- a/src/backend/storage/aio/Makefile
+++ b/src/backend/storage/aio/Makefile
@@ -15,6 +15,7 @@ OBJS = \
aio_io.o \
aio_target.o \
method_sync.o \
+ method_worker.o \
read_stream.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/aio/aio_init.c b/src/backend/storage/aio/aio_init.c
index 0e98cc0c8fb..233c144965b 100644
--- a/src/backend/storage/aio/aio_init.c
+++ b/src/backend/storage/aio/aio_init.c
@@ -221,3 +221,10 @@ pgaio_init_backend(void)
before_shmem_exit(pgaio_shutdown, 0);
}
+
+bool
+pgaio_workers_enabled(void)
+{
+ /* placeholder for future commit */
+ return false;
+}
diff --git a/src/backend/storage/aio/meson.build b/src/backend/storage/aio/meson.build
index 2c26089d52e..74f94c6e40b 100644
--- a/src/backend/storage/aio/meson.build
+++ b/src/backend/storage/aio/meson.build
@@ -7,5 +7,6 @@ backend_sources += files(
'aio_io.c',
'aio_target.c',
'method_sync.c',
+ 'method_worker.c',
'read_stream.c',
)
diff --git a/src/backend/storage/aio/method_worker.c b/src/backend/storage/aio/method_worker.c
new file mode 100644
index 00000000000..1d79e7e85ef
--- /dev/null
+++ b/src/backend/storage/aio/method_worker.c
@@ -0,0 +1,86 @@
+/*-------------------------------------------------------------------------
+ *
+ * method_worker.c
+ * AIO implementation using workers
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/storage/aio/method_worker.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "libpq/pqsignal.h"
+#include "miscadmin.h"
+#include "postmaster/auxprocess.h"
+#include "postmaster/interrupt.h"
+#include "storage/io_worker.h"
+#include "storage/ipc.h"
+#include "storage/latch.h"
+#include "storage/proc.h"
+#include "tcop/tcopprot.h"
+#include "utils/wait_event.h"
+
+
+int io_workers = 3;
+
+
+void
+IoWorkerMain(char *startup_data, size_t startup_data_len)
+{
+ sigjmp_buf local_sigjmp_buf;
+
+ MyBackendType = B_IO_WORKER;
+ AuxiliaryProcessMainCommon();
+
+ /* TODO review all signals */
+ pqsignal(SIGHUP, SignalHandlerForConfigReload);
+ pqsignal(SIGINT, die); /* to allow manually triggering worker restart */
+
+ /*
+ * Ignore SIGTERM, will get explicit shutdown via SIGUSR2 later in the
+ * shutdown sequence, similar to checkpointer.
+ */
+ pqsignal(SIGTERM, SIG_IGN);
+ /* SIGQUIT handler was already set up by InitPostmasterChild */
+ pqsignal(SIGALRM, SIG_IGN);
+ pqsignal(SIGPIPE, SIG_IGN);
+ pqsignal(SIGUSR1, procsignal_sigusr1_handler);
+ pqsignal(SIGUSR2, SignalHandlerForShutdownRequest);
+ sigprocmask(SIG_SETMASK, &UnBlockSig, NULL);
+
+ /* see PostgresMain() */
+ if (sigsetjmp(local_sigjmp_buf, 1) != 0)
+ {
+ error_context_stack = NULL;
+ HOLD_INTERRUPTS();
+
+ /*
+ * We normally shouldn't get errors here. Need to do just enough error
+ * recovery so that we can mark the IO as failed and then exit.
+ */
+ LWLockReleaseAll();
+
+ /* TODO: recover from IO errors */
+
+ EmitErrorReport();
+ proc_exit(1);
+ }
+
+ /* We can now handle ereport(ERROR) */
+ PG_exception_stack = &local_sigjmp_buf;
+
+ while (!ShutdownRequestPending)
+ {
+ WaitLatch(MyLatch, WL_LATCH_SET | WL_EXIT_ON_PM_DEATH, -1,
+ WAIT_EVENT_IO_WORKER_MAIN);
+ ResetLatch(MyLatch);
+ CHECK_FOR_INTERRUPTS();
+ }
+
+ proc_exit(0);
+}
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 5655348a2e2..605c8950043 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -3313,6 +3313,8 @@ ProcessInterrupts(void)
(errcode(ERRCODE_ADMIN_SHUTDOWN),
errmsg("terminating background worker \"%s\" due to administrator command",
MyBgworkerEntry->bgw_type)));
+ else if (AmIoWorkerProcess())
+ proc_exit(0);
else
ereport(FATAL,
(errcode(ERRCODE_ADMIN_SHUTDOWN),
diff --git a/src/backend/utils/activity/pgstat_backend.c b/src/backend/utils/activity/pgstat_backend.c
index bcf9e4b1487..b2151ab4ca3 100644
--- a/src/backend/utils/activity/pgstat_backend.c
+++ b/src/backend/utils/activity/pgstat_backend.c
@@ -241,6 +241,7 @@ pgstat_tracks_backend_bktype(BackendType bktype)
case B_WAL_SUMMARIZER:
case B_BG_WRITER:
case B_CHECKPOINTER:
+ case B_IO_WORKER:
case B_STARTUP:
return false;
diff --git a/src/backend/utils/activity/pgstat_io.c b/src/backend/utils/activity/pgstat_io.c
index 6ff5d9e96a1..70518749142 100644
--- a/src/backend/utils/activity/pgstat_io.c
+++ b/src/backend/utils/activity/pgstat_io.c
@@ -365,6 +365,7 @@ pgstat_tracks_io_bktype(BackendType bktype)
case B_BG_WORKER:
case B_BG_WRITER:
case B_CHECKPOINTER:
+ case B_IO_WORKER:
case B_SLOTSYNC_WORKER:
case B_STANDALONE_BACKEND:
case B_STARTUP:
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index b5d3dcbf1e9..e702aa7152a 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -57,6 +57,7 @@ BGWRITER_HIBERNATE "Waiting in background writer process, hibernating."
BGWRITER_MAIN "Waiting in main loop of background writer process."
CHECKPOINTER_MAIN "Waiting in main loop of checkpointer process."
CHECKPOINTER_SHUTDOWN "Waiting for checkpointer process to be terminated."
+IO_WORKER_MAIN "Waiting in main loop of IO Worker process."
LOGICAL_APPLY_MAIN "Waiting in main loop of logical replication apply process."
LOGICAL_LAUNCHER_MAIN "Waiting in main loop of logical replication launcher process."
LOGICAL_PARALLEL_APPLY_MAIN "Waiting in main loop of logical replication parallel apply process."
diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index 0347fc11092..cbca090d2b0 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -293,6 +293,9 @@ GetBackendTypeDesc(BackendType backendType)
case B_CHECKPOINTER:
backendDesc = gettext_noop("checkpointer");
break;
+ case B_IO_WORKER:
+ backendDesc = "io worker";
+ break;
case B_LOGGER:
backendDesc = gettext_noop("logger");
break;
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index de524eccad5..8a83dcc820d 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -74,6 +74,7 @@
#include "storage/aio.h"
#include "storage/bufmgr.h"
#include "storage/bufpage.h"
+#include "storage/io_worker.h"
#include "storage/large_object.h"
#include "storage/pg_shmem.h"
#include "storage/predicate.h"
@@ -3233,6 +3234,18 @@ struct config_int ConfigureNamesInt[] =
NULL, NULL, NULL
},
+ {
+ {"io_workers",
+ PGC_SIGHUP,
+ RESOURCES_ASYNCHRONOUS,
+ gettext_noop("Number of IO worker processes, for io_method=worker."),
+ NULL,
+ },
+ &io_workers,
+ 3, 1, MAX_IO_WORKERS,
+ NULL, assign_io_workers, NULL
+ },
+
{
{"backend_flush_after", PGC_USERSET, RESOURCES_ASYNCHRONOUS,
gettext_noop("Number of pages after which previously performed writes are flushed to disk."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index fba0ad4b624..e68e112c72f 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -848,6 +848,7 @@ autovacuum_worker_slots = 16 # autovacuum worker slots to allocate
#------------------------------------------------------------------------------
#io_method = sync # (change requires restart)
+#io_workers = 3 # 1-32;
#io_max_concurrency = 32 # Max number of IOs that may be in
# flight at the same time in one backend
--
2.48.1.76.g4e746b1a31.dirty
v2.3-0013-aio-Add-worker-method.patchtext/x-diff; charset=us-asciiDownload
From 5bdabe467f82dc7cc7348d8698b0c10f7bbeb7b8 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 22 Jan 2025 13:44:34 -0500
Subject: [PATCH v2.3 13/30] aio: Add worker method
---
src/include/storage/aio.h | 5 +-
src/include/storage/aio_internal.h | 1 +
src/include/storage/lwlocklist.h | 1 +
src/backend/storage/aio/aio.c | 2 +
src/backend/storage/aio/aio_init.c | 12 +-
src/backend/storage/aio/method_worker.c | 394 +++++++++++++++++-
.../utils/activity/wait_event_names.txt | 1 +
src/backend/utils/misc/postgresql.conf.sample | 2 +-
src/tools/pgindent/typedefs.list | 3 +
9 files changed, 410 insertions(+), 11 deletions(-)
diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h
index ffd382593d0..39d7e4cff55 100644
--- a/src/include/storage/aio.h
+++ b/src/include/storage/aio.h
@@ -23,10 +23,11 @@
typedef enum IoMethod
{
IOMETHOD_SYNC = 0,
+ IOMETHOD_WORKER,
} IoMethod;
-/* We'll default to synchronous execution. */
-#define DEFAULT_IO_METHOD IOMETHOD_SYNC
+/* We'll default to worker based execution. */
+#define DEFAULT_IO_METHOD IOMETHOD_WORKER
/*
diff --git a/src/include/storage/aio_internal.h b/src/include/storage/aio_internal.h
index 174d365f9c0..86d8d099c91 100644
--- a/src/include/storage/aio_internal.h
+++ b/src/include/storage/aio_internal.h
@@ -285,6 +285,7 @@ extern const char *pgaio_io_get_target_name(PgAioHandle *ioh);
/* Declarations for the tables of function pointers exposed by each IO method. */
extern PGDLLIMPORT const IoMethodOps pgaio_sync_ops;
+extern PGDLLIMPORT const IoMethodOps pgaio_worker_ops;
extern PGDLLIMPORT const IoMethodOps *pgaio_method_ops;
extern PGDLLIMPORT PgAioCtl *pgaio_ctl;
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index cf565452382..932024b1b0b 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -83,3 +83,4 @@ PG_LWLOCK(49, WALSummarizer)
PG_LWLOCK(50, DSMRegistry)
PG_LWLOCK(51, InjectionPoint)
PG_LWLOCK(52, SerialControl)
+PG_LWLOCK(53, AioWorkerSubmissionQueue)
diff --git a/src/backend/storage/aio/aio.c b/src/backend/storage/aio/aio.c
index cefa888884c..6c264b61ca5 100644
--- a/src/backend/storage/aio/aio.c
+++ b/src/backend/storage/aio/aio.c
@@ -57,6 +57,7 @@ static void pgaio_io_wait(PgAioHandle *ioh, uint64 ref_generation);
/* Options for io_method. */
const struct config_enum_entry io_method_options[] = {
{"sync", IOMETHOD_SYNC, false},
+ {"worker", IOMETHOD_WORKER, false},
{NULL, 0, false}
};
@@ -73,6 +74,7 @@ PgAioBackend *pgaio_my_backend;
static const IoMethodOps *const pgaio_method_ops_table[] = {
[IOMETHOD_SYNC] = &pgaio_sync_ops,
+ [IOMETHOD_WORKER] = &pgaio_worker_ops,
};
/* callbacks for the configured io_method, set by assign_io_method */
diff --git a/src/backend/storage/aio/aio_init.c b/src/backend/storage/aio/aio_init.c
index 233c144965b..76fcdf64670 100644
--- a/src/backend/storage/aio/aio_init.c
+++ b/src/backend/storage/aio/aio_init.c
@@ -18,6 +18,7 @@
#include "storage/aio.h"
#include "storage/aio_init.h"
#include "storage/aio_internal.h"
+#include "storage/io_worker.h"
#include "storage/ipc.h"
#include "storage/proc.h"
#include "storage/shmem.h"
@@ -39,6 +40,11 @@ AioCtlShmemSize(void)
static uint32
AioProcs(void)
{
+ /*
+ * While AIO workers don't need their own AIO context, we can't currently
+ * guarantee nothing gets assigned to the a ProcNumber for an IO worker if
+ * we just subtracted MAX_IO_WORKERS.
+ */
return MaxBackends + NUM_AUXILIARY_PROCS;
}
@@ -211,6 +217,9 @@ pgaio_init_backend(void)
/* shouldn't be initialized twice */
Assert(!pgaio_my_backend);
+ if (MyBackendType == B_IO_WORKER)
+ return;
+
if (MyProc == NULL || MyProcNumber >= AioProcs())
elog(ERROR, "aio requires a normal PGPROC");
@@ -225,6 +234,5 @@ pgaio_init_backend(void)
bool
pgaio_workers_enabled(void)
{
- /* placeholder for future commit */
- return false;
+ return io_method == IOMETHOD_WORKER;
}
diff --git a/src/backend/storage/aio/method_worker.c b/src/backend/storage/aio/method_worker.c
index 1d79e7e85ef..92415467c71 100644
--- a/src/backend/storage/aio/method_worker.c
+++ b/src/backend/storage/aio/method_worker.c
@@ -1,7 +1,22 @@
/*-------------------------------------------------------------------------
*
* method_worker.c
- * AIO implementation using workers
+ * AIO - perform AIO using worker processes
+ *
+ * Worker processes consume IOs from a shared memory submission queue, run
+ * traditional synchronous system calls, and perform the shared completion
+ * handling immediately. Client code submits most requests by pushing IOs
+ * into the submission queue, and waits (if necessary) using condition
+ * variables. Some IOs cannot be performed in another process due to lack of
+ * infrastructure for reopening the file, and must processed synchronously by
+ * the client code when submitted.
+ *
+ * So that the submitter can make just one system call when submitting a batch
+ * of IOs, wakeups "fan out"; each woken backend can wake two more. XXX This
+ * could be improved by using futexes instead of latches to wake N waiters.
+ *
+ * This method of AIO is available in all builds on all operating systems, and
+ * is the default.
*
* Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
* Portions Copyright (c) 1994, Regents of the University of California
@@ -16,23 +31,323 @@
#include "libpq/pqsignal.h"
#include "miscadmin.h"
+#include "port/pg_bitutils.h"
#include "postmaster/auxprocess.h"
#include "postmaster/interrupt.h"
+#include "storage/aio.h"
+#include "storage/aio_internal.h"
#include "storage/io_worker.h"
#include "storage/ipc.h"
#include "storage/latch.h"
#include "storage/proc.h"
#include "tcop/tcopprot.h"
+#include "utils/ps_status.h"
#include "utils/wait_event.h"
+/* How many workers should each worker wake up if needed? */
+#define IO_WORKER_WAKEUP_FANOUT 2
+
+
+typedef struct AioWorkerSubmissionQueue
+{
+ uint32 size;
+ uint32 mask;
+ uint32 head;
+ uint32 tail;
+ uint32 ios[FLEXIBLE_ARRAY_MEMBER];
+} AioWorkerSubmissionQueue;
+
+typedef struct AioWorkerSlot
+{
+ Latch *latch;
+ bool in_use;
+} AioWorkerSlot;
+
+typedef struct AioWorkerControl
+{
+ uint64 idle_worker_mask;
+ AioWorkerSlot workers[FLEXIBLE_ARRAY_MEMBER];
+} AioWorkerControl;
+
+
+static size_t pgaio_worker_shmem_size(void);
+static void pgaio_worker_shmem_init(bool first_time);
+
+static bool pgaio_worker_needs_synchronous_execution(PgAioHandle *ioh);
+static int pgaio_worker_submit(uint16 num_staged_ios, PgAioHandle **staged_ios);
+
+
+const IoMethodOps pgaio_worker_ops = {
+ .shmem_size = pgaio_worker_shmem_size,
+ .shmem_init = pgaio_worker_shmem_init,
+
+ .needs_synchronous_execution = pgaio_worker_needs_synchronous_execution,
+ .submit = pgaio_worker_submit,
+};
+
+
int io_workers = 3;
+static int io_worker_queue_size = 64;
+static int MyIoWorkerId;
+
+
+static AioWorkerSubmissionQueue *io_worker_submission_queue;
+static AioWorkerControl *io_worker_control;
+
+
+static size_t
+pgaio_worker_shmem_size(void)
+{
+ return
+ offsetof(AioWorkerSubmissionQueue, ios) +
+ sizeof(uint32) * io_worker_queue_size +
+ offsetof(AioWorkerControl, workers) +
+ sizeof(AioWorkerSlot) * io_workers;
+}
+
+static void
+pgaio_worker_shmem_init(bool first_time)
+{
+ bool found;
+ int size;
+
+ /* Round size up to next power of two so we can make a mask. */
+ size = pg_nextpower2_32(io_worker_queue_size);
+
+ io_worker_submission_queue =
+ ShmemInitStruct("AioWorkerSubmissionQueue",
+ offsetof(AioWorkerSubmissionQueue, ios) +
+ sizeof(uint32) * size,
+ &found);
+ if (!found)
+ {
+ io_worker_submission_queue->size = size;
+ io_worker_submission_queue->head = 0;
+ io_worker_submission_queue->tail = 0;
+ }
+
+ io_worker_control =
+ ShmemInitStruct("AioWorkerControl",
+ offsetof(AioWorkerControl, workers) +
+ sizeof(AioWorkerSlot) * io_workers,
+ &found);
+ if (!found)
+ {
+ io_worker_control->idle_worker_mask = 0;
+ for (int i = 0; i < io_workers; ++i)
+ {
+ io_worker_control->workers[i].latch = NULL;
+ io_worker_control->workers[i].in_use = false;
+ }
+ }
+}
+
+
+static int
+pgaio_choose_idle_worker(void)
+{
+ int worker;
+
+ if (io_worker_control->idle_worker_mask == 0)
+ return -1;
+
+ /* Find the lowest bit position, and clear it. */
+ worker = pg_rightmost_one_pos64(io_worker_control->idle_worker_mask);
+ io_worker_control->idle_worker_mask &= ~(UINT64_C(1) << worker);
+
+ return worker;
+}
+
+static bool
+pgaio_worker_submission_queue_insert(PgAioHandle *ioh)
+{
+ AioWorkerSubmissionQueue *queue;
+ uint32 new_head;
+
+ queue = io_worker_submission_queue;
+ new_head = (queue->head + 1) & (queue->size - 1);
+ if (new_head == queue->tail)
+ {
+ pgaio_debug(DEBUG1, "io queue is full, at %ud elements",
+ io_worker_submission_queue->size);
+ return false; /* full */
+ }
+
+ queue->ios[queue->head] = pgaio_io_get_id(ioh);
+ queue->head = new_head;
+
+ return true;
+}
+
+static uint32
+pgaio_worker_submission_queue_consume(void)
+{
+ AioWorkerSubmissionQueue *queue;
+ uint32 result;
+
+ queue = io_worker_submission_queue;
+ if (queue->tail == queue->head)
+ return UINT32_MAX; /* empty */
+
+ result = queue->ios[queue->tail];
+ queue->tail = (queue->tail + 1) & (queue->size - 1);
+
+ return result;
+}
+
+static uint32
+pgaio_worker_submission_queue_depth(void)
+{
+ uint32 head;
+ uint32 tail;
+
+ head = io_worker_submission_queue->head;
+ tail = io_worker_submission_queue->tail;
+
+ if (tail > head)
+ head += io_worker_submission_queue->size;
+
+ Assert(head >= tail);
+
+ return head - tail;
+}
+
+static void
+pgaio_worker_submit_internal(int nios, PgAioHandle *ios[])
+{
+ PgAioHandle *synchronous_ios[PGAIO_SUBMIT_BATCH_SIZE];
+ int nsync = 0;
+ Latch *wakeup = NULL;
+ int worker;
+
+ Assert(nios <= PGAIO_SUBMIT_BATCH_SIZE);
+
+ LWLockAcquire(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE);
+ for (int i = 0; i < nios; ++i)
+ {
+ Assert(!pgaio_worker_needs_synchronous_execution(ios[i]));
+ if (!pgaio_worker_submission_queue_insert(ios[i]))
+ {
+ /*
+ * We'll do it synchronously, but only after we've sent as many as
+ * we can to workers, to maximize concurrency.
+ */
+ synchronous_ios[nsync++] = ios[i];
+ continue;
+ }
+
+ if (wakeup == NULL)
+ {
+ /* Choose an idle worker to wake up if we haven't already. */
+ worker = pgaio_choose_idle_worker();
+ if (worker >= 0)
+ wakeup = io_worker_control->workers[worker].latch;
+
+ pgaio_debug_io(DEBUG4, ios[i],
+ "choosing worker %d",
+ worker);
+ }
+ }
+ LWLockRelease(AioWorkerSubmissionQueueLock);
+
+ if (wakeup)
+ SetLatch(wakeup);
+
+ /* Run whatever is left synchronously. */
+ if (nsync > 0)
+ {
+ for (int i = 0; i < nsync; ++i)
+ {
+ pgaio_io_perform_synchronously(synchronous_ios[i]);
+ }
+ }
+}
+
+static bool
+pgaio_worker_needs_synchronous_execution(PgAioHandle *ioh)
+{
+ return
+ !IsUnderPostmaster
+ || ioh->flags & PGAIO_HF_REFERENCES_LOCAL
+ || !pgaio_io_can_reopen(ioh);
+}
+
+static int
+pgaio_worker_submit(uint16 num_staged_ios, PgAioHandle **staged_ios)
+{
+ for (int i = 0; i < num_staged_ios; i++)
+ {
+ PgAioHandle *ioh = staged_ios[i];
+
+ pgaio_io_prepare_submit(ioh);
+ }
+
+ pgaio_worker_submit_internal(num_staged_ios, staged_ios);
+
+ return num_staged_ios;
+}
+
+/*
+ * shmem_exit() callback that releases the worker's slot in io_worker_control.
+ */
+static void
+pgaio_worker_die(int code, Datum arg)
+{
+ LWLockAcquire(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE);
+ Assert(io_worker_control->workers[MyIoWorkerId].in_use);
+ Assert(io_worker_control->workers[MyIoWorkerId].latch == MyLatch);
+
+ io_worker_control->workers[MyIoWorkerId].in_use = false;
+ io_worker_control->workers[MyIoWorkerId].latch = NULL;
+ LWLockRelease(AioWorkerSubmissionQueueLock);
+}
+
+/*
+ * Register the worker in shared memory, assign MyWorkerId and register a
+ * shutdown callback to release registration.
+ */
+static void
+pgaio_worker_register(void)
+{
+ MyIoWorkerId = -1;
+
+ /*
+ * XXX: This could do with more fine-grained locking. But it's also not
+ * very common for the number of workers to change at the moment...
+ */
+ LWLockAcquire(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE);
+
+ for (int i = 0; i < io_workers; ++i)
+ {
+ if (!io_worker_control->workers[i].in_use)
+ {
+ Assert(io_worker_control->workers[i].latch == NULL);
+ io_worker_control->workers[i].in_use = true;
+ MyIoWorkerId = i;
+ break;
+ }
+ else
+ Assert(io_worker_control->workers[i].latch != NULL);
+ }
+
+ if (MyIoWorkerId == -1)
+ elog(ERROR, "couldn't find a free worker slot");
+
+ io_worker_control->idle_worker_mask |= (UINT64_C(1) << MyIoWorkerId);
+ io_worker_control->workers[MyIoWorkerId].latch = MyLatch;
+ LWLockRelease(AioWorkerSubmissionQueueLock);
+
+ on_shmem_exit(pgaio_worker_die, 0);
+}
void
IoWorkerMain(char *startup_data, size_t startup_data_len)
{
sigjmp_buf local_sigjmp_buf;
+ volatile PgAioHandle *ioh = NULL;
+ char cmd[128];
MyBackendType = B_IO_WORKER;
AuxiliaryProcessMainCommon();
@@ -53,6 +368,11 @@ IoWorkerMain(char *startup_data, size_t startup_data_len)
pqsignal(SIGUSR2, SignalHandlerForShutdownRequest);
sigprocmask(SIG_SETMASK, &UnBlockSig, NULL);
+ pgaio_worker_register();
+
+ sprintf(cmd, "io worker: %d", MyIoWorkerId);
+ set_ps_display(cmd);
+
/* see PostgresMain() */
if (sigsetjmp(local_sigjmp_buf, 1) != 0)
{
@@ -65,9 +385,18 @@ IoWorkerMain(char *startup_data, size_t startup_data_len)
*/
LWLockReleaseAll();
- /* TODO: recover from IO errors */
+ /* FIXME: recover from IO errors */
+ if (ioh != NULL)
+ {
+#if 0
+ /* EINTR is treated as a retryable error */
+ pgaio_process_io_completion(unvolatize(PgAioInProgress *, io),
+ EINTR);
+#endif
+ }
EmitErrorReport();
+
proc_exit(1);
}
@@ -76,10 +405,63 @@ IoWorkerMain(char *startup_data, size_t startup_data_len)
while (!ShutdownRequestPending)
{
- WaitLatch(MyLatch, WL_LATCH_SET | WL_EXIT_ON_PM_DEATH, -1,
- WAIT_EVENT_IO_WORKER_MAIN);
- ResetLatch(MyLatch);
- CHECK_FOR_INTERRUPTS();
+ uint32 io_index;
+ Latch *latches[IO_WORKER_WAKEUP_FANOUT];
+ int nlatches = 0;
+ int nwakeups = 0;
+ int worker;
+
+ /* Try to get a job to do. */
+ LWLockAcquire(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE);
+ if ((io_index = pgaio_worker_submission_queue_consume()) == UINT32_MAX)
+ {
+ /* Nothing to do. Mark self idle. */
+ /*
+ * XXX: Invent some kind of back pressure to reduce useless
+ * wakeups?
+ */
+ io_worker_control->idle_worker_mask |= (UINT64_C(1) << MyIoWorkerId);
+ }
+ else
+ {
+ /* Got one. Clear idle flag. */
+ io_worker_control->idle_worker_mask &= ~(UINT64_C(1) << MyIoWorkerId);
+
+ /* See if we can wake up some peers. */
+ nwakeups = Min(pgaio_worker_submission_queue_depth(),
+ IO_WORKER_WAKEUP_FANOUT);
+ for (int i = 0; i < nwakeups; ++i)
+ {
+ if ((worker = pgaio_choose_idle_worker()) < 0)
+ break;
+ latches[nlatches++] = io_worker_control->workers[worker].latch;
+ }
+ }
+ LWLockRelease(AioWorkerSubmissionQueueLock);
+
+ for (int i = 0; i < nlatches; ++i)
+ SetLatch(latches[i]);
+
+ if (io_index != UINT32_MAX)
+ {
+ ioh = &pgaio_ctl->io_handles[io_index];
+
+ pgaio_debug_io(DEBUG4, unvolatize(PgAioHandle *, ioh),
+ "worker %d processing IO",
+ MyIoWorkerId);
+
+ pgaio_io_reopen(unvolatize(PgAioHandle *, ioh));
+ pgaio_io_perform_synchronously(unvolatize(PgAioHandle *, ioh));
+
+ ioh = NULL;
+ }
+ else
+ {
+ WaitLatch(MyLatch, WL_LATCH_SET | WL_EXIT_ON_PM_DEATH, -1,
+ WAIT_EVENT_IO_WORKER_MAIN);
+ ResetLatch(MyLatch);
+ CHECK_FOR_INTERRUPTS();
+ }
}
proc_exit(0);
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index e702aa7152a..05751417482 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -350,6 +350,7 @@ WALSummarizer "Waiting to read or update WAL summarization state."
DSMRegistry "Waiting to read or update the dynamic shared memory registry."
InjectionPoint "Waiting to read or update information related to injection points."
SerialControl "Waiting to read or update shared <filename>pg_serial</filename> state."
+AioWorkerSubmissionQueue "Waiting to access AIO worker submission queue."
#
# END OF PREDEFINED LWLOCKS (DO NOT CHANGE THIS LINE)
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index e68e112c72f..5005e65cee0 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -847,7 +847,7 @@ autovacuum_worker_slots = 16 # autovacuum worker slots to allocate
# WIP AIO GUC docs
#------------------------------------------------------------------------------
-#io_method = sync # (change requires restart)
+#io_method = worker # (change requires restart)
#io_workers = 3 # 1-32;
#io_max_concurrency = 32 # Max number of IOs that may be in
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index c7f34559b1b..1e7bbeff1b6 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -55,6 +55,9 @@ AggStrategy
AggTransInfo
Aggref
AggregateInstrumentation
+AioWorkerControl
+AioWorkerSlot
+AioWorkerSubmissionQueue
AlenState
Alias
AllocBlock
--
2.48.1.76.g4e746b1a31.dirty
v2.3-0014-aio-Add-liburing-dependency.patchtext/x-diff; charset=us-asciiDownload
From cbd5bc8e99f0d80fa37f5065c893751f238c26da Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 5 Jun 2024 19:37:25 -0700
Subject: [PATCH v2.3 14/30] aio: Add liburing dependency
Not yet used.
Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
meson.build | 14 ++++
meson_options.txt | 3 +
configure.ac | 11 +++
src/makefiles/meson.build | 3 +
src/include/pg_config.h.in | 3 +
src/backend/Makefile | 7 +-
configure | 138 +++++++++++++++++++++++++++++++++++++
.cirrus.tasks.yml | 1 +
src/Makefile.global.in | 4 ++
9 files changed, 181 insertions(+), 3 deletions(-)
diff --git a/meson.build b/meson.build
index 32fc89f3a4b..2bca586e5f3 100644
--- a/meson.build
+++ b/meson.build
@@ -854,6 +854,18 @@ endif
+###############################################################
+# Library: liburing
+###############################################################
+
+liburingopt = get_option('liburing')
+liburing = dependency('liburing', required: liburingopt)
+if liburing.found()
+ cdata.set('USE_LIBURING', 1)
+endif
+
+
+
###############################################################
# Library: libxml
###############################################################
@@ -3058,6 +3070,7 @@ backend_both_deps += [
icu_i18n,
ldap,
libintl,
+ liburing,
libxml,
lz4,
pam,
@@ -3702,6 +3715,7 @@ if meson.version().version_compare('>=0.57')
'gss': gssapi,
'icu': icu,
'ldap': ldap,
+ 'liburing': liburing,
'libxml': libxml,
'libxslt': libxslt,
'llvm': llvm,
diff --git a/meson_options.txt b/meson_options.txt
index d9c7ddccbc4..abe8600ec35 100644
--- a/meson_options.txt
+++ b/meson_options.txt
@@ -103,6 +103,9 @@ option('ldap', type: 'feature', value: 'auto',
option('libedit_preferred', type: 'boolean', value: false,
description: 'Prefer BSD Libedit over GNU Readline')
+option('liburing', type : 'feature', value: 'auto',
+ description: 'Use liburing for async io')
+
option('libxml', type: 'feature', value: 'auto',
description: 'XML support')
diff --git a/configure.ac b/configure.ac
index d713360f340..00d6c366ecd 100644
--- a/configure.ac
+++ b/configure.ac
@@ -975,6 +975,14 @@ AC_SUBST(with_readline)
PGAC_ARG_BOOL(with, libedit-preferred, no,
[prefer BSD Libedit over GNU Readline])
+#
+# liburing
+#
+AC_MSG_CHECKING([whether to build with liburing support])
+PGAC_ARG_BOOL(with, liburing, no, [use liburing for async io],
+ [AC_DEFINE([USE_LIBURING], 1, [Define to build with io-uring support. (--with-liburing)])])
+AC_MSG_RESULT([$with_liburing])
+AC_SUBST(with_liburing)
#
# UUID library
@@ -1427,6 +1435,9 @@ elif test "$with_uuid" = ossp ; then
fi
AC_SUBST(UUID_LIBS)
+if test "$with_liburing" = yes; then
+ PKG_CHECK_MODULES(LIBURING, liburing)
+fi
##
## Header files
diff --git a/src/makefiles/meson.build b/src/makefiles/meson.build
index d49b2079a44..714b7ccaa4e 100644
--- a/src/makefiles/meson.build
+++ b/src/makefiles/meson.build
@@ -199,6 +199,8 @@ pgxs_empty = [
'PTHREAD_CFLAGS', 'PTHREAD_LIBS',
'ICU_LIBS',
+
+ 'LIBURING_CFLAGS', 'LIBURING_LIBS',
]
if host_system == 'windows' and cc.get_argument_syntax() != 'msvc'
@@ -229,6 +231,7 @@ pgxs_deps = {
'gssapi': gssapi,
'icu': icu,
'ldap': ldap,
+ 'liburing': liburing,
'libxml': libxml,
'libxslt': libxslt,
'llvm': llvm,
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 07b2f798abd..6ab71a3dffe 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -663,6 +663,9 @@
/* Define to 1 to build with LDAP support. (--with-ldap) */
#undef USE_LDAP
+/* Define to build with io-uring support. (--with-liburing) */
+#undef USE_LIBURING
+
/* Define to 1 to build with XML support. (--with-libxml) */
#undef USE_LIBXML
diff --git a/src/backend/Makefile b/src/backend/Makefile
index 42d4a28e5aa..7344c8c7f5c 100644
--- a/src/backend/Makefile
+++ b/src/backend/Makefile
@@ -43,9 +43,10 @@ OBJS = \
$(top_builddir)/src/common/libpgcommon_srv.a \
$(top_builddir)/src/port/libpgport_srv.a
-# We put libpgport and libpgcommon into OBJS, so remove it from LIBS; also add
-# libldap and ICU
-LIBS := $(filter-out -lpgport -lpgcommon, $(LIBS)) $(LDAP_LIBS_BE) $(ICU_LIBS)
+# We put libpgport and libpgcommon into OBJS, so remove it from LIBS.
+LIBS := $(filter-out -lpgport -lpgcommon, $(LIBS))
+# The backend conditionally needs libraries that most executables don't need.
+LIBS += $(LDAP_LIBS_BE) $(ICU_LIBS) $(LIBURING_LIBS)
# The backend doesn't need everything that's in LIBS, however
LIBS := $(filter-out -lreadline -ledit -ltermcap -lncurses -lcurses, $(LIBS))
diff --git a/configure b/configure
index ceeef9b0915..e477baedfb6 100755
--- a/configure
+++ b/configure
@@ -651,6 +651,8 @@ LIBOBJS
OPENSSL
ZSTD
LZ4
+LIBURING_LIBS
+LIBURING_CFLAGS
UUID_LIBS
LDAP_LIBS_BE
LDAP_LIBS_FE
@@ -709,6 +711,7 @@ XML2_CFLAGS
XML2_CONFIG
with_libxml
with_uuid
+with_liburing
with_readline
with_systemd
with_selinux
@@ -862,6 +865,7 @@ with_selinux
with_systemd
with_readline
with_libedit_preferred
+with_liburing
with_uuid
with_ossp_uuid
with_libxml
@@ -905,6 +909,8 @@ LDFLAGS_EX
LDFLAGS_SL
PERL
PYTHON
+LIBURING_CFLAGS
+LIBURING_LIBS
MSGFMT
TCLSH'
@@ -1572,6 +1578,7 @@ Optional Packages:
--without-readline do not use GNU Readline nor BSD Libedit for editing
--with-libedit-preferred
prefer BSD Libedit over GNU Readline
+ --with-liburing use liburing for async io
--with-uuid=LIB build contrib/uuid-ossp using LIB (bsd,e2fs,ossp)
--with-ossp-uuid obsolete spelling of --with-uuid=ossp
--with-libxml build with XML support
@@ -1618,6 +1625,10 @@ Some influential environment variables:
LDFLAGS_SL extra linker flags for linking shared libraries only
PERL Perl program
PYTHON Python program
+ LIBURING_CFLAGS
+ C compiler flags for LIBURING, overriding pkg-config
+ LIBURING_LIBS
+ linker flags for LIBURING, overriding pkg-config
MSGFMT msgfmt program for NLS
TCLSH Tcl interpreter program (tclsh)
@@ -8681,6 +8692,40 @@ fi
+#
+# liburing
+#
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking whether to build with liburing support" >&5
+$as_echo_n "checking whether to build with liburing support... " >&6; }
+
+
+
+# Check whether --with-liburing was given.
+if test "${with_liburing+set}" = set; then :
+ withval=$with_liburing;
+ case $withval in
+ yes)
+
+$as_echo "#define USE_LIBURING 1" >>confdefs.h
+
+ ;;
+ no)
+ :
+ ;;
+ *)
+ as_fn_error $? "no argument expected for --with-liburing option" "$LINENO" 5
+ ;;
+ esac
+
+else
+ with_liburing=no
+
+fi
+
+
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $with_liburing" >&5
+$as_echo "$with_liburing" >&6; }
+
#
# UUID library
@@ -13231,6 +13276,99 @@ fi
fi
+if test "$with_liburing" = yes; then
+
+pkg_failed=no
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for liburing" >&5
+$as_echo_n "checking for liburing... " >&6; }
+
+if test -n "$LIBURING_CFLAGS"; then
+ pkg_cv_LIBURING_CFLAGS="$LIBURING_CFLAGS"
+ elif test -n "$PKG_CONFIG"; then
+ if test -n "$PKG_CONFIG" && \
+ { { $as_echo "$as_me:${as_lineno-$LINENO}: \$PKG_CONFIG --exists --print-errors \"liburing\""; } >&5
+ ($PKG_CONFIG --exists --print-errors "liburing") 2>&5
+ ac_status=$?
+ $as_echo "$as_me:${as_lineno-$LINENO}: \$? = $ac_status" >&5
+ test $ac_status = 0; }; then
+ pkg_cv_LIBURING_CFLAGS=`$PKG_CONFIG --cflags "liburing" 2>/dev/null`
+ test "x$?" != "x0" && pkg_failed=yes
+else
+ pkg_failed=yes
+fi
+ else
+ pkg_failed=untried
+fi
+if test -n "$LIBURING_LIBS"; then
+ pkg_cv_LIBURING_LIBS="$LIBURING_LIBS"
+ elif test -n "$PKG_CONFIG"; then
+ if test -n "$PKG_CONFIG" && \
+ { { $as_echo "$as_me:${as_lineno-$LINENO}: \$PKG_CONFIG --exists --print-errors \"liburing\""; } >&5
+ ($PKG_CONFIG --exists --print-errors "liburing") 2>&5
+ ac_status=$?
+ $as_echo "$as_me:${as_lineno-$LINENO}: \$? = $ac_status" >&5
+ test $ac_status = 0; }; then
+ pkg_cv_LIBURING_LIBS=`$PKG_CONFIG --libs "liburing" 2>/dev/null`
+ test "x$?" != "x0" && pkg_failed=yes
+else
+ pkg_failed=yes
+fi
+ else
+ pkg_failed=untried
+fi
+
+
+
+if test $pkg_failed = yes; then
+ { $as_echo "$as_me:${as_lineno-$LINENO}: result: no" >&5
+$as_echo "no" >&6; }
+
+if $PKG_CONFIG --atleast-pkgconfig-version 0.20; then
+ _pkg_short_errors_supported=yes
+else
+ _pkg_short_errors_supported=no
+fi
+ if test $_pkg_short_errors_supported = yes; then
+ LIBURING_PKG_ERRORS=`$PKG_CONFIG --short-errors --print-errors --cflags --libs "liburing" 2>&1`
+ else
+ LIBURING_PKG_ERRORS=`$PKG_CONFIG --print-errors --cflags --libs "liburing" 2>&1`
+ fi
+ # Put the nasty error message in config.log where it belongs
+ echo "$LIBURING_PKG_ERRORS" >&5
+
+ as_fn_error $? "Package requirements (liburing) were not met:
+
+$LIBURING_PKG_ERRORS
+
+Consider adjusting the PKG_CONFIG_PATH environment variable if you
+installed software in a non-standard prefix.
+
+Alternatively, you may set the environment variables LIBURING_CFLAGS
+and LIBURING_LIBS to avoid the need to call pkg-config.
+See the pkg-config man page for more details." "$LINENO" 5
+elif test $pkg_failed = untried; then
+ { $as_echo "$as_me:${as_lineno-$LINENO}: result: no" >&5
+$as_echo "no" >&6; }
+ { { $as_echo "$as_me:${as_lineno-$LINENO}: error: in \`$ac_pwd':" >&5
+$as_echo "$as_me: error: in \`$ac_pwd':" >&2;}
+as_fn_error $? "The pkg-config script could not be found or is too old. Make sure it
+is in your PATH or set the PKG_CONFIG environment variable to the full
+path to pkg-config.
+
+Alternatively, you may set the environment variables LIBURING_CFLAGS
+and LIBURING_LIBS to avoid the need to call pkg-config.
+See the pkg-config man page for more details.
+
+To get pkg-config, see <http://pkg-config.freedesktop.org/>.
+See \`config.log' for more details" "$LINENO" 5; }
+else
+ LIBURING_CFLAGS=$pkg_cv_LIBURING_CFLAGS
+ LIBURING_LIBS=$pkg_cv_LIBURING_LIBS
+ { $as_echo "$as_me:${as_lineno-$LINENO}: result: yes" >&5
+$as_echo "yes" >&6; }
+
+fi
+fi
##
## Header files
diff --git a/.cirrus.tasks.yml b/.cirrus.tasks.yml
index 18e944ca89d..67d3d77fb10 100644
--- a/.cirrus.tasks.yml
+++ b/.cirrus.tasks.yml
@@ -334,6 +334,7 @@ task:
--enable-cassert --enable-injection-points --enable-debug \
--enable-tap-tests --enable-nls \
--with-segsize-blocks=6 \
+ --with-liburing \
\
${LINUX_CONFIGURE_FEATURES} \
\
diff --git a/src/Makefile.global.in b/src/Makefile.global.in
index 1278b7744f4..8ad259a54cd 100644
--- a/src/Makefile.global.in
+++ b/src/Makefile.global.in
@@ -190,6 +190,7 @@ with_systemd = @with_systemd@
with_gssapi = @with_gssapi@
with_krb_srvnam = @with_krb_srvnam@
with_ldap = @with_ldap@
+with_liburing = @with_liburing@
with_libxml = @with_libxml@
with_libxslt = @with_libxslt@
with_llvm = @with_llvm@
@@ -216,6 +217,9 @@ krb_srvtab = @krb_srvtab@
ICU_CFLAGS = @ICU_CFLAGS@
ICU_LIBS = @ICU_LIBS@
+LIBURING_CFLAGS = @LIBURING_CFLAGS@
+LIBURING_LIBS = @LIBURING_LIBS@
+
TCLSH = @TCLSH@
TCL_LIBS = @TCL_LIBS@
TCL_LIB_SPEC = @TCL_LIB_SPEC@
--
2.48.1.76.g4e746b1a31.dirty
v2.3-0015-aio-Add-io_uring-method.patchtext/x-diff; charset=us-asciiDownload
From 8729492fc8eb698851442f0165cb12948c4db4f4 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 22 Jan 2025 13:44:36 -0500
Subject: [PATCH v2.3 15/30] aio: Add io_uring method
---
src/include/storage/aio.h | 3 +
src/include/storage/aio_internal.h | 3 +
src/include/storage/lwlock.h | 1 +
src/backend/storage/aio/Makefile | 1 +
src/backend/storage/aio/aio.c | 6 +
src/backend/storage/aio/meson.build | 1 +
src/backend/storage/aio/method_io_uring.c | 382 ++++++++++++++++++++++
src/backend/storage/lmgr/lwlock.c | 1 +
src/tools/pgindent/typedefs.list | 1 +
9 files changed, 399 insertions(+)
create mode 100644 src/backend/storage/aio/method_io_uring.c
diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h
index 39d7e4cff55..8c1b9a1b496 100644
--- a/src/include/storage/aio.h
+++ b/src/include/storage/aio.h
@@ -24,6 +24,9 @@ typedef enum IoMethod
{
IOMETHOD_SYNC = 0,
IOMETHOD_WORKER,
+#ifdef USE_LIBURING
+ IOMETHOD_IO_URING,
+#endif
} IoMethod;
/* We'll default to worker based execution. */
diff --git a/src/include/storage/aio_internal.h b/src/include/storage/aio_internal.h
index 86d8d099c91..eff544ce621 100644
--- a/src/include/storage/aio_internal.h
+++ b/src/include/storage/aio_internal.h
@@ -286,6 +286,9 @@ extern const char *pgaio_io_get_target_name(PgAioHandle *ioh);
/* Declarations for the tables of function pointers exposed by each IO method. */
extern PGDLLIMPORT const IoMethodOps pgaio_sync_ops;
extern PGDLLIMPORT const IoMethodOps pgaio_worker_ops;
+#ifdef USE_LIBURING
+extern PGDLLIMPORT const IoMethodOps pgaio_uring_ops;
+#endif
extern PGDLLIMPORT const IoMethodOps *pgaio_method_ops;
extern PGDLLIMPORT PgAioCtl *pgaio_ctl;
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index 13a7dc89980..043e8bae7a9 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -217,6 +217,7 @@ typedef enum BuiltinTrancheIds
LWTRANCHE_SUBTRANS_SLRU,
LWTRANCHE_XACT_SLRU,
LWTRANCHE_PARALLEL_VACUUM_DSA,
+ LWTRANCHE_AIO_URING_COMPLETION,
LWTRANCHE_FIRST_USER_DEFINED,
} BuiltinTrancheIds;
diff --git a/src/backend/storage/aio/Makefile b/src/backend/storage/aio/Makefile
index f51c34a37f8..c06c50771e0 100644
--- a/src/backend/storage/aio/Makefile
+++ b/src/backend/storage/aio/Makefile
@@ -14,6 +14,7 @@ OBJS = \
aio_init.o \
aio_io.o \
aio_target.o \
+ method_io_uring.o \
method_sync.o \
method_worker.o \
read_stream.o
diff --git a/src/backend/storage/aio/aio.c b/src/backend/storage/aio/aio.c
index 6c264b61ca5..c1dd073e37f 100644
--- a/src/backend/storage/aio/aio.c
+++ b/src/backend/storage/aio/aio.c
@@ -58,6 +58,9 @@ static void pgaio_io_wait(PgAioHandle *ioh, uint64 ref_generation);
const struct config_enum_entry io_method_options[] = {
{"sync", IOMETHOD_SYNC, false},
{"worker", IOMETHOD_WORKER, false},
+#ifdef USE_LIBURING
+ {"io_uring", IOMETHOD_IO_URING, false},
+#endif
{NULL, 0, false}
};
@@ -75,6 +78,9 @@ PgAioBackend *pgaio_my_backend;
static const IoMethodOps *const pgaio_method_ops_table[] = {
[IOMETHOD_SYNC] = &pgaio_sync_ops,
[IOMETHOD_WORKER] = &pgaio_worker_ops,
+#ifdef USE_LIBURING
+ [IOMETHOD_IO_URING] = &pgaio_uring_ops,
+#endif
};
/* callbacks for the configured io_method, set by assign_io_method */
diff --git a/src/backend/storage/aio/meson.build b/src/backend/storage/aio/meson.build
index 74f94c6e40b..2f0f03d8071 100644
--- a/src/backend/storage/aio/meson.build
+++ b/src/backend/storage/aio/meson.build
@@ -6,6 +6,7 @@ backend_sources += files(
'aio_init.c',
'aio_io.c',
'aio_target.c',
+ 'method_io_uring.c',
'method_sync.c',
'method_worker.c',
'read_stream.c',
diff --git a/src/backend/storage/aio/method_io_uring.c b/src/backend/storage/aio/method_io_uring.c
new file mode 100644
index 00000000000..da92795fce7
--- /dev/null
+++ b/src/backend/storage/aio/method_io_uring.c
@@ -0,0 +1,382 @@
+/*-------------------------------------------------------------------------
+ *
+ * method_io_uring.c
+ * AIO - perform AIO using Linux' io_uring
+ *
+ * XXX Write me
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/storage/aio/method_io_uring.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#ifdef USE_LIBURING
+
+#include <liburing.h>
+
+#include "pgstat.h"
+#include "port/pg_iovec.h"
+#include "storage/aio_internal.h"
+#include "storage/fd.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+
+
+/* Entry points for IoMethodOps. */
+static size_t pgaio_uring_shmem_size(void);
+static void pgaio_uring_shmem_init(bool first_time);
+static void pgaio_uring_init_backend(void);
+
+static int pgaio_uring_submit(uint16 num_staged_ios, PgAioHandle **staged_ios);
+static void pgaio_uring_wait_one(PgAioHandle *ioh, uint64 ref_generation);
+
+static void pgaio_uring_sq_from_io(PgAioHandle *ioh, struct io_uring_sqe *sqe);
+
+
+const IoMethodOps pgaio_uring_ops = {
+ .shmem_size = pgaio_uring_shmem_size,
+ .shmem_init = pgaio_uring_shmem_init,
+ .init_backend = pgaio_uring_init_backend,
+
+ .submit = pgaio_uring_submit,
+ .wait_one = pgaio_uring_wait_one,
+};
+
+typedef struct PgAioUringContext
+{
+ LWLock completion_lock;
+
+ struct io_uring io_uring_ring;
+ /* XXX: probably worth padding to a cacheline boundary here */
+} PgAioUringContext;
+
+
+static PgAioUringContext *pgaio_uring_contexts;
+static PgAioUringContext *pgaio_my_uring_context;
+
+/* io_uring local state */
+static struct io_uring local_ring;
+
+
+
+static Size
+pgaio_uring_context_shmem_size(void)
+{
+ uint32 TotalProcs = MaxBackends + NUM_AUXILIARY_PROCS - MAX_IO_WORKERS;
+
+ return mul_size(TotalProcs, sizeof(PgAioUringContext));
+}
+
+static size_t
+pgaio_uring_shmem_size(void)
+{
+ return pgaio_uring_context_shmem_size();
+}
+
+static void
+pgaio_uring_shmem_init(bool first_time)
+{
+ uint32 TotalProcs = MaxBackends + NUM_AUXILIARY_PROCS - MAX_IO_WORKERS;
+ bool found;
+
+ pgaio_uring_contexts = (PgAioUringContext *)
+ ShmemInitStruct("AioUring", pgaio_uring_shmem_size(), &found);
+
+ if (found)
+ return;
+
+ for (int contextno = 0; contextno < TotalProcs; contextno++)
+ {
+ PgAioUringContext *context = &pgaio_uring_contexts[contextno];
+ int ret;
+
+ /*
+ * XXX: Probably worth sharing the WQ between the different rings,
+ * when supported by the kernel. Could also cause additional
+ * contention, I guess?
+ */
+#if 0
+ if (!AcquireExternalFD())
+ elog(ERROR, "No external FD available");
+#endif
+ ret = io_uring_queue_init(io_max_concurrency, &context->io_uring_ring, 0);
+ if (ret < 0)
+ elog(ERROR, "io_uring_queue_init failed: %s", strerror(-ret));
+
+ LWLockInitialize(&context->completion_lock, LWTRANCHE_AIO_URING_COMPLETION);
+ }
+}
+
+static void
+pgaio_uring_init_backend(void)
+{
+ int ret;
+
+ pgaio_my_uring_context = &pgaio_uring_contexts[MyProcNumber];
+
+ ret = io_uring_queue_init(32, &local_ring, 0);
+ if (ret < 0)
+ elog(ERROR, "io_uring_queue_init failed: %s", strerror(-ret));
+}
+
+static int
+pgaio_uring_submit(uint16 num_staged_ios, PgAioHandle **staged_ios)
+{
+ struct io_uring *uring_instance = &pgaio_my_uring_context->io_uring_ring;
+ int in_flight_before = dclist_count(&pgaio_my_backend->in_flight_ios);
+
+ Assert(num_staged_ios <= PGAIO_SUBMIT_BATCH_SIZE);
+
+ for (int i = 0; i < num_staged_ios; i++)
+ {
+ PgAioHandle *ioh = staged_ios[i];
+ struct io_uring_sqe *sqe;
+
+ sqe = io_uring_get_sqe(uring_instance);
+
+ if (!sqe)
+ elog(ERROR, "io_uring submission queue is unexpectedly full");
+
+ pgaio_io_prepare_submit(ioh);
+ pgaio_uring_sq_from_io(ioh, sqe);
+
+ /*
+ * io_uring executes IO in process context if possible. That's
+ * generally good, as it reduces context switching. When performing a
+ * lot of buffered IO that means that copying between page cache and
+ * userspace memory happens in the foreground, as it can't be
+ * offloaded to DMA hardware as is possible when using direct IO. When
+ * executing a lot of buffered IO this causes io_uring to be slower
+ * than worker mode, as worker mode parallelizes the copying. io_uring
+ * can be told to offload work to worker threads instead.
+ *
+ * If an IO is buffered IO and we already have IOs in flight or
+ * multiple IOs are being submitted, we thus tell io_uring to execute
+ * the IO in the background. We don't do so for the first few IOs
+ * being submitted as executing in this process' context has lower
+ * latency.
+ */
+ if (in_flight_before > 4 && (ioh->flags & PGAIO_HF_BUFFERED))
+ io_uring_sqe_set_flags(sqe, IOSQE_ASYNC);
+
+ in_flight_before++;
+ }
+
+ while (true)
+ {
+ int ret;
+
+ pgstat_report_wait_start(WAIT_EVENT_AIO_SUBMIT);
+ ret = io_uring_submit(uring_instance);
+ pgstat_report_wait_end();
+
+ if (ret == -EINTR)
+ {
+ pgaio_debug(DEBUG3,
+ "aio method uring: submit EINTR, nios: %d",
+ num_staged_ios);
+ continue;
+ }
+ if (ret < 0)
+ elog(PANIC, "failed: %d/%s",
+ ret, strerror(-ret));
+ else if (ret != num_staged_ios)
+ {
+ /* likely unreachable, but if it is, we would need to re-submit */
+ elog(PANIC, "submitted only %d of %d",
+ ret, num_staged_ios);
+ }
+ else
+ {
+ pgaio_debug(DEBUG4,
+ "aio method uring: submitted %d IOs",
+ num_staged_ios);
+ }
+ break;
+ }
+
+ return num_staged_ios;
+}
+
+
+#define PGAIO_MAX_LOCAL_COMPLETED_IO 32
+
+static void
+pgaio_uring_drain_locked(PgAioUringContext *context)
+{
+ int ready;
+ int orig_ready;
+
+ /*
+ * Don't drain more events than available right now. Otherwise it's
+ * plausible that one backend could get stuck, for a while, receiving CQEs
+ * without actually processing them.
+ */
+ orig_ready = ready = io_uring_cq_ready(&context->io_uring_ring);
+
+ while (ready > 0)
+ {
+ struct io_uring_cqe *cqes[PGAIO_MAX_LOCAL_COMPLETED_IO];
+ uint32 ncqes;
+
+ START_CRIT_SECTION();
+ ncqes =
+ io_uring_peek_batch_cqe(&context->io_uring_ring,
+ cqes,
+ Min(PGAIO_MAX_LOCAL_COMPLETED_IO, ready));
+ Assert(ncqes <= ready);
+
+ ready -= ncqes;
+
+ for (int i = 0; i < ncqes; i++)
+ {
+ struct io_uring_cqe *cqe = cqes[i];
+ PgAioHandle *ioh;
+
+ ioh = io_uring_cqe_get_data(cqe);
+ io_uring_cqe_seen(&context->io_uring_ring, cqe);
+
+ pgaio_io_process_completion(ioh, cqe->res);
+ }
+
+ END_CRIT_SECTION();
+
+ pgaio_debug(DEBUG3,
+ "drained %d/%d, now expecting %d",
+ ncqes, orig_ready, io_uring_cq_ready(&context->io_uring_ring));
+ }
+}
+
+static void
+pgaio_uring_wait_one(PgAioHandle *ioh, uint64 ref_generation)
+{
+ PgAioHandleState state;
+ ProcNumber owner_procno = ioh->owner_procno;
+ PgAioUringContext *owner_context = &pgaio_uring_contexts[owner_procno];
+ bool expect_cqe;
+ int waited = 0;
+
+ /*
+ * We ought to have a smarter locking scheme, nearly all the time the
+ * backend owning the ring will consume the completions, making the
+ * locking unnecessarily expensive.
+ */
+ LWLockAcquire(&owner_context->completion_lock, LW_EXCLUSIVE);
+
+ while (true)
+ {
+ pgaio_debug_io(DEBUG3, ioh,
+ "wait_one io_gen: %llu, ref_gen: %llu, cycle %d",
+ (long long unsigned) ref_generation,
+ (long long unsigned) ioh->generation,
+ waited);
+
+ if (pgaio_io_was_recycled(ioh, ref_generation, &state) ||
+ state != PGAIO_HS_SUBMITTED)
+ {
+ break;
+ }
+ else if (io_uring_cq_ready(&owner_context->io_uring_ring))
+ {
+ expect_cqe = true;
+ }
+ else
+ {
+ int ret;
+ struct io_uring_cqe *cqes;
+
+ pgstat_report_wait_start(WAIT_EVENT_AIO_DRAIN);
+ ret = io_uring_wait_cqes(&owner_context->io_uring_ring, &cqes, 1, NULL, NULL);
+ pgstat_report_wait_end();
+
+ if (ret == -EINTR)
+ {
+ continue;
+ }
+ else if (ret != 0)
+ {
+ elog(PANIC, "unexpected: %d/%s: %m", ret, strerror(-ret));
+ }
+ else
+ {
+ Assert(cqes != NULL);
+ expect_cqe = true;
+ waited++;
+ }
+ }
+
+ if (expect_cqe)
+ {
+ pgaio_uring_drain_locked(owner_context);
+ }
+ }
+
+ LWLockRelease(&owner_context->completion_lock);
+
+ pgaio_debug(DEBUG3,
+ "wait_one with %d sleeps",
+ waited);
+}
+
+static void
+pgaio_uring_sq_from_io(PgAioHandle *ioh, struct io_uring_sqe *sqe)
+{
+ struct iovec *iov;
+
+ switch (ioh->op)
+ {
+ case PGAIO_OP_READV:
+ iov = &pgaio_ctl->iovecs[ioh->iovec_off];
+ if (ioh->op_data.read.iov_length == 1)
+ {
+ io_uring_prep_read(sqe,
+ ioh->op_data.read.fd,
+ iov->iov_base,
+ iov->iov_len,
+ ioh->op_data.read.offset);
+ }
+ else
+ {
+ io_uring_prep_readv(sqe,
+ ioh->op_data.read.fd,
+ iov,
+ ioh->op_data.read.iov_length,
+ ioh->op_data.read.offset);
+
+ }
+ break;
+
+ case PGAIO_OP_WRITEV:
+ iov = &pgaio_ctl->iovecs[ioh->iovec_off];
+ if (ioh->op_data.write.iov_length == 1)
+ {
+ io_uring_prep_write(sqe,
+ ioh->op_data.write.fd,
+ iov->iov_base,
+ iov->iov_len,
+ ioh->op_data.write.offset);
+ }
+ else
+ {
+ io_uring_prep_writev(sqe,
+ ioh->op_data.write.fd,
+ iov,
+ ioh->op_data.write.iov_length,
+ ioh->op_data.write.offset);
+ }
+ break;
+
+ case PGAIO_OP_INVALID:
+ elog(ERROR, "trying to prepare invalid IO operation for execution");
+ }
+
+ io_uring_sqe_set_data(sqe, ioh);
+}
+
+#endif /* USE_LIBURING */
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index c3d6f886e3c..dbc169c8541 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -166,6 +166,7 @@ static const char *const BuiltinTrancheNames[] = {
[LWTRANCHE_SUBTRANS_SLRU] = "SubtransSLRU",
[LWTRANCHE_XACT_SLRU] = "XactSLRU",
[LWTRANCHE_PARALLEL_VACUUM_DSA] = "ParallelVacuumDSA",
+ [LWTRANCHE_AIO_URING_COMPLETION] = "AioUringCompletion",
};
StaticAssertDecl(lengthof(BuiltinTrancheNames) ==
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 1e7bbeff1b6..be2dd22f1d7 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2128,6 +2128,7 @@ PgAioReturn
PgAioTargetData
PgAioTargetID
PgAioTargetInfo
+PgAioUringContext
PgAioWaitRef
PgArchData
PgBackendGSSStatus
--
2.48.1.76.g4e746b1a31.dirty
v2.3-0016-aio-Add-README.md-explaining-higher-level-desig.patchtext/x-diff; charset=us-asciiDownload
From 0a201985c794113e4cf062e8f5037fb7ab03c1ea Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 22 Jan 2025 13:44:37 -0500
Subject: [PATCH v2.3 16/30] aio: Add README.md explaining higher level design
---
src/backend/storage/aio/README.md | 430 ++++++++++++++++++++++++++++++
src/backend/storage/aio/aio.c | 2 +
2 files changed, 432 insertions(+)
create mode 100644 src/backend/storage/aio/README.md
diff --git a/src/backend/storage/aio/README.md b/src/backend/storage/aio/README.md
new file mode 100644
index 00000000000..1b6f9d2c40b
--- /dev/null
+++ b/src/backend/storage/aio/README.md
@@ -0,0 +1,430 @@
+# Asynchronous & Direct IO
+
+## Motivation
+
+### Why Asynchronous IO
+
+Until the introduction of asynchronous IO Postgres relied on the operating
+system to hide the cost of synchronous IO from Postgres. While this worked
+surprisingly well in a lot of workloads, it does not do as good a job on
+prefetching and controlled writeback as we would like.
+
+There are important expensive operations like `fdatasync()` where the operating
+system cannot hide the storage latency. This is particularly important for WAL
+writes, where the ability to asynchronously issue `fdatasync()` or O_DSYNC
+writes can yield significantly higher throughput.
+
+
+### Why Direct / unbuffered IO
+
+The main reason to want to use Direct IO are:
+
+- Lower CPU usage / higher throughput. Particularly on modern storage buffered
+ writes are bottlenecked by the operating system having to copy data from the
+ kernel's page cache to postgres buffer pool using the CPU. Whereas direct IO
+ can often move the data directly between the storage devices and postgres'
+ buffer cache, using DMA. While that transfer is ongoing, the CPU is free to
+ perform other work.
+- Reduced latency - Direct IO can have substantially lower latency than
+ buffered IO, which can be impactful for OLTP workloads bottlenecked by WAL
+ write latency.
+- Avoiding double buffering between operating system cache and postgres'
+ shared_buffers.
+- Better control over the timing and pace of dirty data writeback.
+
+
+The main reason *not* to use Direct IO are:
+
+- Without AIO, Direct IO is unusably slow for most purposes.
+- Even with AIO, many parts of postgres need to be modified to perform
+ explicit prefetching.
+- In situations where shared_buffers cannot be set appropriately large,
+ e.g. because there are many different postgres instances hosted on shared
+ hardware, performance will often be worse than when using buffered IO.
+
+
+## AIO Usage Example
+
+In many cases code that can benefit from AIO does not directly have to
+interact with the AIO interface, but can use AIO via higher-level
+abstractions. See [Helpers](#helpers).
+
+In this example, a buffer will be read into shared buffers.
+
+```C
+/*
+ * Result of the operation, only to be accessed in this backend.
+ */
+PgAioReturn ioret;
+
+/*
+ * Acquire an AIO Handle, ioret will get result upon completion.
+ *
+ * Note that ioret needs to stay alive until the IO completes or
+ * CurrentResourceOwner is released (i.e. an error is thrown).
+ */
+PgAioHandle *ioh = pgaio_io_acquire(CurrentResourceOwner, &ioret);
+
+/*
+ * Reference that can be used to wait for the IO we initiate below. This
+ * reference can reside in local or shared memory and waited upon by any
+ * process. An arbitrary number of references can be made for each IO.
+ */
+PgAioWaitRef iow;
+
+pgaio_io_get_wref(ioh, &iow);
+
+/*
+ * Arrange for shared buffer completion callbacks to be called upon completion
+ * of the IO. This callback will update the buffer descriptors associated with
+ * the AioHandle, which e.g. allows other backends to access the buffer.
+ *
+ * Multiple completion callbacks can be registered for each handle.
+ */
+pgaio_io_register_callbacks(ioh, PGAIO_HCB_SHARED_BUFFER_READV);
+
+/*
+ * The completion callback needs to know which buffers to update when the IO
+ * completes. As the AIO subsystem does not know about buffers, we have to
+ * associate this information with the AioHandle, for use by the completion
+ * callback registered above.
+ *
+ * In this example we're reading only a single buffer, hence the 1.
+ */
+pgaio_io_set_handle_data_32(ioh, (uint32 *) buffer, 1);
+
+/*
+ * Pass the AIO handle to lower-level function. When operating on the level of
+ * buffers, we don't know how exactly the IO is performed, that is the
+ * responsibility of the storage manager implementation.
+ *
+ * E.g. md.c needs to translate block numbers into offsets in segments.
+ *
+ * Once the IO handle has been handed off to smgstartreadv(), it may not
+ * further be used, as the IO may immediately get executed below
+ * smgrstartreadv() and the handle reused for another IO.
+ */
+smgrstartreadv(ioh, operation->smgr, forknum, blkno,
+ BufferGetBlock(buffer), 1);
+
+/*
+ * As mentioned above, the IO might be initiated within smgrstartreadv(). That
+ * is however not guaranteed, to allow IO submission to be batched.
+ *
+ * Note that one needs to be careful while there may be unsubmitted IOs, as
+ * another backend may need to wait for one of the unsubmitted IOs. If this
+ * backend were to wait for the other backend, we'd have a deadlock. To avoid
+ * that, pending IOs need to be explicitly submitted before this backend
+ * might be blocked by a backend waiting for IO.
+ *
+ * Note that the IO might have immediately been submitted (e.g. due to reaching
+ * a limit on the number of unsubmitted IOs) and even completed during the
+ * smgrstartreadv() above.
+ *
+ * Once submitted, the IO is in-flight and can complete at any time.
+ *
+ * TODO: rename to kick as suggested by Heikki?
+ */
+pgaio_submit_staged();
+
+/*
+ * To benefit from AIO, it is beneficial to perform other work, including
+ * submitting other IOs, before waiting for the IO to complete. Otherwise
+ * we could just have used synchronous, blocking IO.
+ */
+perform_other_work();
+
+/*
+ * We did some other work and now need the IO operation to have completed to
+ * continue.
+ */
+pgaio_wref_wait(&iow);
+
+/*
+ * At this point the IO has completed. We do not yet know whether it succeeded
+ * or failed, however. The buffer's state has been updated, which allows other
+ * backends to use the buffer (if the IO succeeded), or retry the IO (if it
+ * failed).
+ *
+ * Note that in case the IO has failed, a LOG message may have been emitted,
+ * but no ERROR has been raised. This is crucial, as another backend waiting
+ * for this IO should not see an ERROR.
+ *
+ * To check whether the operation succeeded, and to raise an ERROR, or if more
+ * appropriate LOG, the PgAioReturn we passed to pgaio_io_acquire() is used.
+ */
+if (ioret.result.status == ARS_ERROR)
+ pgaio_result_report(aio_ret.result, &aio_ret.target_data, ERROR);
+
+/*
+ * Besides having succeeded completely, the IO could also have partially
+ * completed. If we e.g. tried to read many blocks at once, the read might have
+ * only succeeded for the first few blocks.
+ *
+ * If the IO partially succeeded and this backend needs all blocks to have
+ * completed, this backend needs to reissue the IO for the remaining buffers.
+ * The AIO subsystem cannot handle this retry transparently.
+ *
+ * As this example is already long, and we only read a single block, we'll just
+ * error out if there's a partial read.
+ */
+if (ioret.result.status == ARS_PARTIAL)
+ pgaio_result_report(aio_ret.result, &aio_ret.target_data, ERROR);
+
+/*
+ * The IO succeeded, so we can use the buffer now.
+ */
+```
+
+
+## Design Criteria & Motivation
+
+### Deadlock and Starvation Dangers due to AIO
+
+Using AIO in a naive way can easily lead to deadlocks in an environment where
+the source/target of AIO are shared resources, like pages in postgres'
+shared_buffers.
+
+Consider one backend performing readahead on a table, initiating IO for a
+number of buffers ahead of the current "scan position". If that backend then
+performs some operation that blocks, or even just is slow, the IO completion
+for the asynchronously initiated read may not be processed.
+
+This AIO implementation solves this problem by requiring that AIO methods
+either allow AIO completions to be processed by any backend in the system
+(e.g. io_uring), or to guarantee that AIO processing will happen even when the
+issuing backend is blocked (e.g. worker mode, which offloads completion
+processing to the AIO workers).
+
+
+### IO can be started in critical sections
+
+Using AIO for WAL writes can reduce the overhead of WAL logging substantially:
+
+- AIO allows to start WAL writes eagerly, so they complete before needing to
+ wait
+- AIO allows to have multiple WAL flushes in progress at the same time
+- AIO makes it more realistic to use O\_DIRECT + O\_DSYNC, which can reduce
+ the number of roundtrips to storage on some OSs and storage HW (buffered IO
+ and direct IO without O_DSYNC needs to issue a write and after the writes
+ completion a cache cache flush, whereas O\_DIRECT + O\_DSYNC can use a
+ single FUA write).
+
+The need to be able to execute IO in critical sections has substantial design
+implication on the AIO subsystem. Mainly because completing IOs (see prior
+section) needs to be possible within a critical section, even if the
+to-be-completed IO itself was not issued in a critical section. Consider
+e.g. the case of a backend first starting a number of writes from shared
+buffers and then starting to flush the WAL. Because only a limited amount of
+IO can be in-progress at the same time, initiating IO for flushing the WAL may
+require to first complete IO that was started earlier.
+
+
+### State for AIO needs to live in shared memory
+
+Because postgres uses a process model and because AIOs need to be
+complete-able by any backend much of the state of the AIO subsystem needs to
+live in shared memory.
+
+In an `EXEC_BACKEND` build backends executable code and other process local
+state is not necessarily mapped to the same addresses in each process due to
+ASLR. This means that the shared memory cannot contain pointer to callbacks.
+
+
+## Design of the AIO Subsystem
+
+
+### AIO Methods
+
+To achieve portability and performance, multiple methods of performing AIO are
+implemented and others are likely worth adding in the future.
+
+
+#### Synchronous Mode
+
+`io_method=sync` does not actually perform AIO but allows to use the AIO API
+while performing synchronous IO. This can be useful for debugging. The code
+for the synchronous mode is also used as a fallback by e.g. the [worker
+mode](#worker) uses it to execute IO that cannot be executed by workers.
+
+
+#### Worker
+
+`io_method=worker` is available on every platform postgres runs on, and
+implements asynchronous IO - from the view of the issuing process - by
+dispatching the IO to one of several worker processes performing the IO in a
+synchronous manner.
+
+
+#### io_uring
+
+`io_method=io_uring` is available on Linux 5.1+. In contrast to worker mode it
+dispatches all IO from within the process, lowering context switch rate /
+latency.
+
+
+### AIO Handles
+
+The central API piece for postgres' AIO abstraction are AIO handles. To
+execute an IO one first has to acquire an IO handle (`pgaio_io_acquire()`) and
+then "defined", i.e. associate an IO operation with the handle.
+
+Often AIO handles are acquired on a higher level and then passed to a lower
+level to be fully defined. E.g., for IO to/from shared buffers, bufmgr.c
+routines acquire the handle, which is then passed through smgr.c, md.c to be
+finally fully defined in fd.c.
+
+The functions used at the lowest level to define the operation are
+`pgaio_io_prep_*()`.
+
+Because acquisition of an IO handle
+[must always succeed](#io-can-be-started-in-critical-sections)
+and the number of AIO Handles
+[has to be limited](#state-for-aio-needs-to-live-in-shared-memory)
+AIO handles can be reused as soon as they have completed. Obviously code needs
+to be able to react to IO completion. Shared state can be updated using
+[AIO Completion callbacks](#aio-callbacks)
+and the issuing backend can provide a backend local variable to receive the
+result of the IO, as described in
+[AIO Result](#aio-results)
+. An IO can be waited for, by both the issuing and any other backend, using
+[AIO References](#aio-wait-references).
+
+
+Because an AIO Handle is not executable just after calling `pgaio_io_acquire()`
+and because `pgaio_io_acquire()` needs to be able to succeed, only a single AIO
+Handle may be acquired (i.e. returned by `pgaio_io_acquire()`) without causing
+the IO to have been defined (by, potentially indirectly, causing
+`pgaio_io_prep_*()` to have been called). Otherwise a backend could trivially
+self-deadlock by using up all AIO Handles without the ability to wait for some
+of the IOs to complete.
+
+If it turns out that an AIO Handle is not needed, e.g., because the handle was
+acquired before holding a contended lock, it can be released without being
+defined using `pgaio_io_release()`.
+
+
+### AIO Callbacks
+
+Commonly several layers need to react to completion of an IO. E.g. for a read
+md.c needs to check if the IO outright failed or was shorter than needed,
+bufmgr.c needs to verify the page looks valid and bufmgr.c needs to update the
+BufferDesc to update the buffer's state.
+
+The fact that several layers / subsystems need to react to IO completion poses
+a few challenges:
+
+- Upper layers should not need to know details of lower layers. E.g. bufmgr.c
+ should not assume the IO will pass through md.c. Therefore upper levels
+ cannot know what lower layers would consider an error.
+
+- Lower layers should not need to know about upper layers. E.g. smgr APIs are
+ used going through shared buffers but are also used bypassing shared
+ buffers. This means that e.g. md.c is not in a position to validate
+ checksums.
+
+- Having code in the AIO subsystem for every possible combination of layers
+ would lead to a lot of duplication.
+
+The "solution" to this the ability to associate multiple completion callbacks
+with a handle. E.g. bufmgr.c can have a callback to update the BufferDesc
+state and to verify the page and md.c. another callback to check if the IO
+operation was successful.
+
+As [mentioned](#state-for-aio-needs-to-live-in-shared-memory), shared memory
+currently cannot contain function pointers. Because of that completion
+callbacks are not directly identified by function pointers but by IDs
+(`PgAioHandleCallbackID`). A substantial added benefit is that that
+allows callbacks to be identified by much smaller amount of memory (a single
+byte currently).
+
+In addition to completion, AIO callbacks also are called to "prepare" an
+IO. This is, e.g., used to increase buffer reference counts to account for the
+AIO subsystem referencing the buffer, which is required to handle the case
+where the issuing backend errors out and releases its own pins while the IO is
+still ongoing.
+
+As [explained earlier](#io-can-be-started-in-critical-sections) IO completions
+need to be safe to execute in critical sections. To allow the backend that
+issued the IO to error out in case of failure [AIO Result](#aio-results) can
+be used.
+
+
+### AIO Targets
+
+In addition to the completion callbacks describe above, each AIO Handle has
+exactly one "target". Each target has some space inside an AIO Handle with
+information specific to the target and can provide callbacks to allow to
+reopen the underlying file (required for worker mode) and to describe the IO
+operation (used for debug logging and error messages).
+
+I.e., if two different uses of AIO can describe the identity of the file being
+operated on the same way, it likely makes sense to use the same
+target. E.g. different smgr implementations can describe IO with
+RelFileLocator, ForkNumber and BlockNumber and can thus share a target. In
+contrast, IO for a WAL file would be described with TimeLineID and XLogRecPtr
+and it would not make sense to use the same target for smgr and WAL.
+
+
+### AIO Wait References
+
+As [described above](#aio-handles), AIO Handles can be reused immediately
+after completion and therefore cannot be used to wait for completion of the
+IO. Waiting is enabled using AIO wait references, which do not just identify
+an AIO Handle but also include the handles "generation".
+
+A reference to an AIO Handle can be acquired using `pgaio_io_get_wref()` and
+then waited upon using `pgaio_wref_wait()`.
+
+
+### AIO Results
+
+As AIO completion callbacks
+[are executed in critical sections](#io-can-be-started-in-critical-sections)
+and [may be executed by any backend](#deadlock-and-starvation-dangers-due-to-aio)
+completion callbacks cannot be used to, e.g., make the query that triggered an
+IO ERROR out.
+
+To allow to react to failing IOs the issuing backend can pass a pointer to a
+`PgAioReturn` in backend local memory. Before an AIO Handle is reused the
+`PgAioReturn` is filled with information about the IO. This includes
+information about whether the IO was successful (as a value of
+`PgAioResultStatus`) and enough information to raise an error in case of a
+failure (via `pgaio_result_report()`, with the error details encoded in
+`PgAioResult`).
+
+XXX: "return" vs "result" vs "result status" seems quite confusing. The naming
+should be improved.
+
+
+### AIO Errors
+
+It would be very convenient to have shared completion callbacks encode the
+details of errors as an `ErrorData` that could be raised at a later
+time. Unfortunately doing so would require allocating memory. While elog.c can
+guarantee (well, kinda) that logging a message will not run out of memory,
+that only works because a very limited number of messages are in the process
+of being logged. With AIO a large number of concurrently issued AIOs might
+fail.
+
+To avoid the need for preallocating a potentially large amount of memory (in
+shared memory no less!), completion callbacks instead have to encode errors in
+a more compact format that can be converted into an error message.
+
+
+## Helpers
+
+Using the low-level AIO API introduces too much complexity to do so all over
+the tree. Most uses of AIO should be done via reusable, higher-level,
+helpers.
+
+
+### Read Stream
+
+A common and very beneficial use of AIO are reads where a substantial number
+of to-be-read locations are known ahead of time. E.g., for a sequential scan
+the set of blocks that need to be read can be determined solely by knowing the
+current position and checking the buffer mapping table.
+
+The [Read Stream](../../../include/storage/read_stream.h) interface makes it
+comparatively easy to use AIO for such use cases.
diff --git a/src/backend/storage/aio/aio.c b/src/backend/storage/aio/aio.c
index c1dd073e37f..b3b4e74c3ce 100644
--- a/src/backend/storage/aio/aio.c
+++ b/src/backend/storage/aio/aio.c
@@ -24,6 +24,8 @@
*
* - read_stream.c - helper for reading buffered relation data
*
+ * - README.md - higher-level overview over AIO
+ *
*
* Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
* Portions Copyright (c) 1994, Regents of the University of California
--
2.48.1.76.g4e746b1a31.dirty
v2.3-0017-aio-Implement-smgr-md-fd-aio-methods.patchtext/x-diff; charset=us-asciiDownload
From 6c9493bdbc9164decc460c7ab74aaceea19d67a0 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 22 Jan 2025 16:06:51 -0500
Subject: [PATCH v2.3 17/30] aio: Implement smgr/md/fd aio methods
---
src/include/storage/aio.h | 6 +-
src/include/storage/aio_types.h | 12 +-
src/include/storage/fd.h | 6 +
src/include/storage/md.h | 12 +
src/include/storage/smgr.h | 22 ++
src/backend/storage/aio/aio_callback.c | 4 +
src/backend/storage/aio/aio_target.c | 2 +
src/backend/storage/file/fd.c | 68 +++++
src/backend/storage/smgr/md.c | 360 +++++++++++++++++++++++++
src/backend/storage/smgr/smgr.c | 126 +++++++++
10 files changed, 614 insertions(+), 4 deletions(-)
diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h
index 8c1b9a1b496..a948eaeefa7 100644
--- a/src/include/storage/aio.h
+++ b/src/include/storage/aio.h
@@ -108,9 +108,10 @@ typedef enum PgAioTargetID
{
/* intentionally the zero value, to help catch zeroed memory etc */
PGAIO_TID_INVALID = 0,
+ PGAIO_TID_SMGR,
} PgAioTargetID;
-#define PGAIO_TID_COUNT (PGAIO_TID_INVALID + 1)
+#define PGAIO_TID_COUNT (PGAIO_TID_SMGR + 1)
/*
@@ -174,6 +175,9 @@ typedef struct PgAioTargetInfo
typedef enum PgAioHandleCallbackID
{
PGAIO_HCB_INVALID,
+
+ PGAIO_HCB_MD_READV,
+ PGAIO_HCB_MD_WRITEV,
} PgAioHandleCallbackID;
diff --git a/src/include/storage/aio_types.h b/src/include/storage/aio_types.h
index d2617139a25..762fce3f075 100644
--- a/src/include/storage/aio_types.h
+++ b/src/include/storage/aio_types.h
@@ -58,11 +58,17 @@ typedef struct PgAioWaitRef
*/
typedef union PgAioTargetData
{
- /* just as an example placeholder for later */
struct
{
- uint32 queue_id;
- } wal;
+ RelFileLocator rlocator; /* physical relation identifier */
+ BlockNumber blockNum; /* blknum relative to begin of reln */
+ BlockNumber nblocks;
+ ForkNumber forkNum:8; /* don't waste 4 byte for four values */
+ bool is_temp:1; /* proc can be inferred by owning AIO */
+ bool release_lock:1;
+ bool skip_fsync:1;
+ uint8 mode;
+ } smgr;
} PgAioTargetData;
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index e3067ab6597..e2fd896646e 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -101,6 +101,8 @@ extern PGDLLIMPORT int max_safe_fds;
* prototypes for functions in fd.c
*/
+struct PgAioHandle;
+
/* Operations on virtual Files --- equivalent to Unix kernel file ops */
extern File PathNameOpenFile(const char *fileName, int fileFlags);
extern File PathNameOpenFilePerm(const char *fileName, int fileFlags, mode_t fileMode);
@@ -109,6 +111,10 @@ extern void FileClose(File file);
extern int FilePrefetch(File file, off_t offset, off_t amount, uint32 wait_event_info);
extern ssize_t FileReadV(File file, const struct iovec *iov, int iovcnt, off_t offset, uint32 wait_event_info);
extern ssize_t FileWriteV(File file, const struct iovec *iov, int iovcnt, off_t offset, uint32 wait_event_info);
+extern ssize_t FileReadV(File file, const struct iovec *iov, int iovcnt, off_t offset, uint32 wait_event_info);
+extern int FileStartReadV(struct PgAioHandle *ioh, File file, int iovcnt, off_t offset, uint32 wait_event_info);
+extern ssize_t FileWriteV(File file, const struct iovec *iov, int iovcnt, off_t offset, uint32 wait_event_info);
+extern int FileStartWriteV(struct PgAioHandle *ioh, File file, int iovcnt, off_t offset, uint32 wait_event_info);
extern int FileSync(File file, uint32 wait_event_info);
extern int FileZero(File file, off_t offset, off_t amount, uint32 wait_event_info);
extern int FileFallocate(File file, off_t offset, off_t amount, uint32 wait_event_info);
diff --git a/src/include/storage/md.h b/src/include/storage/md.h
index 05bf537066e..7b28c3d482c 100644
--- a/src/include/storage/md.h
+++ b/src/include/storage/md.h
@@ -19,6 +19,10 @@
#include "storage/smgr.h"
#include "storage/sync.h"
+struct PgAioHandleCallbacks;
+extern const struct PgAioHandleCallbacks aio_md_readv_cb;
+extern const struct PgAioHandleCallbacks aio_md_writev_cb;
+
/* md storage manager functionality */
extern void mdinit(void);
extern void mdopen(SMgrRelation reln);
@@ -36,9 +40,16 @@ extern uint32 mdmaxcombine(SMgrRelation reln, ForkNumber forknum,
BlockNumber blocknum);
extern void mdreadv(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
void **buffers, BlockNumber nblocks);
+extern void mdstartreadv(struct PgAioHandle *ioh,
+ SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+ void **buffers, BlockNumber nblocks);
extern void mdwritev(SMgrRelation reln, ForkNumber forknum,
BlockNumber blocknum,
const void **buffers, BlockNumber nblocks, bool skipFsync);
+extern void mdstartwritev(struct PgAioHandle *ioh,
+ SMgrRelation reln, ForkNumber forknum,
+ BlockNumber blocknum,
+ const void **buffers, BlockNumber nblocks, bool skipFsync);
extern void mdwriteback(SMgrRelation reln, ForkNumber forknum,
BlockNumber blocknum, BlockNumber nblocks);
extern BlockNumber mdnblocks(SMgrRelation reln, ForkNumber forknum);
@@ -46,6 +57,7 @@ extern void mdtruncate(SMgrRelation reln, ForkNumber forknum,
BlockNumber old_blocks, BlockNumber nblocks);
extern void mdimmedsync(SMgrRelation reln, ForkNumber forknum);
extern void mdregistersync(SMgrRelation reln, ForkNumber forknum);
+extern int mdfd(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, uint32 *off);
extern void ForgetDatabaseSyncRequests(Oid dbid);
extern void DropRelationFiles(RelFileLocator *delrels, int ndelrels, bool isRedo);
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 4016b206ad6..86fa07b110f 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -73,6 +73,11 @@ typedef SMgrRelationData *SMgrRelation;
#define SmgrIsTemp(smgr) \
RelFileLocatorBackendIsTemp((smgr)->smgr_rlocator)
+struct PgAioHandle;
+struct PgAioTargetInfo;
+
+extern const struct PgAioTargetInfo aio_smgr_target_info;
+
extern void smgrinit(void);
extern SMgrRelation smgropen(RelFileLocator rlocator, ProcNumber backend);
extern bool smgrexists(SMgrRelation reln, ForkNumber forknum);
@@ -97,10 +102,19 @@ extern uint32 smgrmaxcombine(SMgrRelation reln, ForkNumber forknum,
extern void smgrreadv(SMgrRelation reln, ForkNumber forknum,
BlockNumber blocknum,
void **buffers, BlockNumber nblocks);
+extern void smgrstartreadv(struct PgAioHandle *ioh,
+ SMgrRelation reln, ForkNumber forknum,
+ BlockNumber blocknum,
+ void **buffers, BlockNumber nblocks);
extern void smgrwritev(SMgrRelation reln, ForkNumber forknum,
BlockNumber blocknum,
const void **buffers, BlockNumber nblocks,
bool skipFsync);
+extern void smgrstartwritev(struct PgAioHandle *ioh,
+ SMgrRelation reln, ForkNumber forknum,
+ BlockNumber blocknum,
+ const void **buffers, BlockNumber nblocks,
+ bool skipFsync);
extern void smgrwriteback(SMgrRelation reln, ForkNumber forknum,
BlockNumber blocknum, BlockNumber nblocks);
extern BlockNumber smgrnblocks(SMgrRelation reln, ForkNumber forknum);
@@ -110,6 +124,7 @@ extern void smgrtruncate(SMgrRelation reln, ForkNumber *forknum, int nforks,
BlockNumber *nblocks);
extern void smgrimmedsync(SMgrRelation reln, ForkNumber forknum);
extern void smgrregistersync(SMgrRelation reln, ForkNumber forknum);
+extern int smgrfd(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, uint32 *off);
extern void AtEOXact_SMgr(void);
extern bool ProcessBarrierSmgrRelease(void);
@@ -127,4 +142,11 @@ smgrwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
smgrwritev(reln, forknum, blocknum, &buffer, 1, skipFsync);
}
+extern void pgaio_io_set_target_smgr(struct PgAioHandle *ioh,
+ SMgrRelationData *smgr,
+ ForkNumber forknum,
+ BlockNumber blocknum,
+ int nblocks,
+ bool skip_fsync);
+
#endif /* SMGR_H */
diff --git a/src/backend/storage/aio/aio_callback.c b/src/backend/storage/aio/aio_callback.c
index 93f71690169..7fd42880535 100644
--- a/src/backend/storage/aio/aio_callback.c
+++ b/src/backend/storage/aio/aio_callback.c
@@ -18,6 +18,7 @@
#include "miscadmin.h"
#include "storage/aio.h"
#include "storage/aio_internal.h"
+#include "storage/md.h"
#include "utils/memutils.h"
@@ -38,6 +39,9 @@ typedef struct PgAioHandleCallbacksEntry
static const PgAioHandleCallbacksEntry aio_handle_cbs[] = {
#define CALLBACK_ENTRY(id, callback) [id] = {.cb = &callback, .name = #callback}
CALLBACK_ENTRY(PGAIO_HCB_INVALID, aio_invalid_cb),
+
+ CALLBACK_ENTRY(PGAIO_HCB_MD_READV, aio_md_readv_cb),
+ CALLBACK_ENTRY(PGAIO_HCB_MD_WRITEV, aio_md_writev_cb),
#undef CALLBACK_ENTRY
};
diff --git a/src/backend/storage/aio/aio_target.c b/src/backend/storage/aio/aio_target.c
index 15428968e58..a43edd89890 100644
--- a/src/backend/storage/aio/aio_target.c
+++ b/src/backend/storage/aio/aio_target.c
@@ -18,6 +18,7 @@
#include "storage/aio.h"
#include "storage/aio_internal.h"
+#include "storage/smgr.h"
/*
@@ -31,6 +32,7 @@ static const PgAioTargetInfo *pgaio_target_info[] = {
[PGAIO_TID_INVALID] = &(PgAioTargetInfo) {
.name = "invalid",
},
+ [PGAIO_TID_SMGR] = &aio_smgr_target_info,
};
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 843d1021cf9..89f2dc29555 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -94,6 +94,7 @@
#include "miscadmin.h"
#include "pgstat.h"
#include "postmaster/startup.h"
+#include "storage/aio.h"
#include "storage/fd.h"
#include "storage/ipc.h"
#include "utils/guc.h"
@@ -1294,6 +1295,8 @@ LruDelete(File file)
vfdP = &VfdCache[file];
+ pgaio_closing_fd(vfdP->fd);
+
/*
* Close the file. We aren't expecting this to fail; if it does, better
* to leak the FD than to mess up our internal state.
@@ -1987,6 +1990,8 @@ FileClose(File file)
if (!FileIsNotOpen(file))
{
+ pgaio_closing_fd(vfdP->fd);
+
/* close the file */
if (close(vfdP->fd) != 0)
{
@@ -2210,6 +2215,32 @@ retry:
return returnCode;
}
+int
+FileStartReadV(struct PgAioHandle *ioh, File file,
+ int iovcnt, off_t offset,
+ uint32 wait_event_info)
+{
+ int returnCode;
+ Vfd *vfdP;
+
+ Assert(FileIsValid(file));
+
+ DO_DB(elog(LOG, "FileStartReadV: %d (%s) " INT64_FORMAT " %d",
+ file, VfdCache[file].fileName,
+ (int64) offset,
+ iovcnt));
+
+ returnCode = FileAccess(file);
+ if (returnCode < 0)
+ return returnCode;
+
+ vfdP = &VfdCache[file];
+
+ pgaio_io_prep_readv(ioh, vfdP->fd, iovcnt, offset);
+
+ return 0;
+}
+
ssize_t
FileWriteV(File file, const struct iovec *iov, int iovcnt, off_t offset,
uint32 wait_event_info)
@@ -2315,6 +2346,34 @@ retry:
return returnCode;
}
+int
+FileStartWriteV(struct PgAioHandle *ioh, File file,
+ int iovcnt, off_t offset,
+ uint32 wait_event_info)
+{
+ int returnCode;
+ Vfd *vfdP;
+
+ Assert(FileIsValid(file));
+
+ DO_DB(elog(LOG, "FileStartWriteV: %d (%s) " INT64_FORMAT " %d",
+ file, VfdCache[file].fileName,
+ (int64) offset,
+ iovcnt));
+
+ returnCode = FileAccess(file);
+ if (returnCode < 0)
+ return returnCode;
+
+ vfdP = &VfdCache[file];
+
+ /* FIXME: think about / reimplement temp_file_limit */
+
+ pgaio_io_prep_writev(ioh, vfdP->fd, iovcnt, offset);
+
+ return 0;
+}
+
int
FileSync(File file, uint32 wait_event_info)
{
@@ -2498,6 +2557,12 @@ FilePathName(File file)
int
FileGetRawDesc(File file)
{
+ int returnCode;
+
+ returnCode = FileAccess(file);
+ if (returnCode < 0)
+ return returnCode;
+
Assert(FileIsValid(file));
return VfdCache[file].fd;
}
@@ -2778,6 +2843,7 @@ FreeDesc(AllocateDesc *desc)
result = closedir(desc->desc.dir);
break;
case AllocateDescRawFD:
+ pgaio_closing_fd(desc->desc.fd);
result = close(desc->desc.fd);
break;
default:
@@ -2846,6 +2912,8 @@ CloseTransientFile(int fd)
/* Only get here if someone passes us a file not in allocatedDescs */
elog(WARNING, "fd passed to CloseTransientFile was not obtained from OpenTransientFile");
+ pgaio_closing_fd(fd);
+
return close(fd);
}
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 7bf0b45e2c3..e204b7abba6 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -31,6 +31,7 @@
#include "miscadmin.h"
#include "pg_trace.h"
#include "pgstat.h"
+#include "storage/aio.h"
#include "storage/bufmgr.h"
#include "storage/fd.h"
#include "storage/md.h"
@@ -132,6 +133,22 @@ static MdfdVec *_mdfd_getseg(SMgrRelation reln, ForkNumber forknum,
static BlockNumber _mdnblocks(SMgrRelation reln, ForkNumber forknum,
MdfdVec *seg);
+static PgAioResult md_readv_complete(PgAioHandle *ioh, PgAioResult prior_result);
+static void md_readv_report(PgAioResult result, const PgAioTargetData *target_data, int elevel);
+static PgAioResult md_writev_complete(PgAioHandle *ioh, PgAioResult prior_result);
+static void md_writev_report(PgAioResult result, const PgAioTargetData *target_data, int elevel);
+
+const struct PgAioHandleCallbacks aio_md_readv_cb = {
+ .complete_shared = md_readv_complete,
+ .report = md_readv_report,
+};
+
+const struct PgAioHandleCallbacks aio_md_writev_cb = {
+ .complete_shared = md_writev_complete,
+ .report = md_writev_report,
+};
+
+
static inline int
_mdfd_open_flags(void)
{
@@ -927,6 +944,53 @@ mdreadv(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
}
}
+void
+mdstartreadv(PgAioHandle *ioh,
+ SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+ void **buffers, BlockNumber nblocks)
+{
+ off_t seekpos;
+ MdfdVec *v;
+ BlockNumber nblocks_this_segment;
+ struct iovec *iov;
+ int iovcnt;
+
+ v = _mdfd_getseg(reln, forknum, blocknum, false,
+ EXTENSION_FAIL | EXTENSION_CREATE_RECOVERY);
+
+ seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
+
+ Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
+
+ nblocks_this_segment =
+ Min(nblocks,
+ RELSEG_SIZE - (blocknum % ((BlockNumber) RELSEG_SIZE)));
+
+ if (nblocks_this_segment != nblocks)
+ elog(ERROR, "read crossing segment boundary");
+
+ iovcnt = pgaio_io_get_iovec(ioh, &iov);
+
+ Assert(nblocks <= iovcnt);
+
+ iovcnt = buffers_to_iovec(iov, buffers, nblocks_this_segment);
+
+ Assert(iovcnt <= nblocks_this_segment);
+
+ if (!(io_direct_flags & IO_DIRECT_DATA))
+ pgaio_io_set_flag(ioh, PGAIO_HF_BUFFERED);
+
+ pgaio_io_set_target_smgr(ioh,
+ reln,
+ forknum,
+ blocknum,
+ nblocks,
+ false);
+ pgaio_io_register_callbacks(ioh, PGAIO_HCB_MD_READV);
+
+ FileStartReadV(ioh, v->mdfd_vfd, iovcnt, seekpos, WAIT_EVENT_DATA_FILE_READ);
+}
+
/*
* mdwritev() -- Write the supplied blocks at the appropriate location.
*
@@ -1032,6 +1096,53 @@ mdwritev(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
}
}
+void
+mdstartwritev(PgAioHandle *ioh,
+ SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+ const void **buffers, BlockNumber nblocks, bool skipFsync)
+{
+ off_t seekpos;
+ MdfdVec *v;
+ BlockNumber nblocks_this_segment;
+ struct iovec *iov;
+ int iovcnt;
+
+ v = _mdfd_getseg(reln, forknum, blocknum, false,
+ EXTENSION_FAIL | EXTENSION_CREATE_RECOVERY);
+
+ seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
+
+ Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
+
+ nblocks_this_segment =
+ Min(nblocks,
+ RELSEG_SIZE - (blocknum % ((BlockNumber) RELSEG_SIZE)));
+
+ if (nblocks_this_segment != nblocks)
+ elog(ERROR, "write crossing segment boundary");
+
+ iovcnt = pgaio_io_get_iovec(ioh, &iov);
+
+ Assert(nblocks <= iovcnt);
+
+ iovcnt = buffers_to_iovec(iov, unconstify(void **, buffers), nblocks_this_segment);
+
+ Assert(iovcnt <= nblocks_this_segment);
+
+ if (!(io_direct_flags & IO_DIRECT_DATA))
+ pgaio_io_set_flag(ioh, PGAIO_HF_BUFFERED);
+
+ pgaio_io_set_target_smgr(ioh,
+ reln,
+ forknum,
+ blocknum,
+ nblocks,
+ skipFsync);
+ pgaio_io_register_callbacks(ioh, PGAIO_HCB_MD_WRITEV);
+
+ FileStartWriteV(ioh, v->mdfd_vfd, iovcnt, seekpos, WAIT_EVENT_DATA_FILE_WRITE);
+}
+
/*
* mdwriteback() -- Tell the kernel to write pages back to storage.
@@ -1355,6 +1466,21 @@ mdimmedsync(SMgrRelation reln, ForkNumber forknum)
}
}
+int
+mdfd(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, uint32 *off)
+{
+ MdfdVec *v = mdopenfork(reln, forknum, EXTENSION_FAIL);
+
+ v = _mdfd_getseg(reln, forknum, blocknum, false,
+ EXTENSION_FAIL);
+
+ *off = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
+
+ Assert(*off < (off_t) BLCKSZ * RELSEG_SIZE);
+
+ return FileGetRawDesc(v->mdfd_vfd);
+}
+
/*
* register_dirty_segment() -- Mark a relation segment as needing fsync
*
@@ -1405,6 +1531,35 @@ register_dirty_segment(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
}
}
+/*
+ * Like register_dirty_segment(), except for use by AIO. In the completion
+ * callback we don't have access to the MdfdVec (the completion callback might
+ * be executed in a different backend than the issuing backend), therefore we
+ * have to implement this slightly differently.
+ */
+static void
+register_dirty_segment_aio(RelFileLocator locator, ForkNumber forknum, uint64 segno)
+{
+ FileTag tag;
+
+ INIT_MD_FILETAG(tag, locator, forknum, segno);
+
+ if (!RegisterSyncRequest(&tag, SYNC_REQUEST, false /* retryOnError */ ))
+ {
+ char path[MAXPGPATH];
+
+ ereport(DEBUG1,
+ (errmsg_internal("could not forward fsync request because request queue is full")));
+
+ /* reuse mdsyncfiletag() to avoid duplicating code */
+ if (mdsyncfiletag(&tag, path))
+ ereport(data_sync_elevel(ERROR),
+ (errcode_for_file_access(),
+ errmsg("could not fsync file \"%s\": %m",
+ path)));
+ }
+}
+
/*
* register_unlink_segment() -- Schedule a file to be deleted after next checkpoint
*/
@@ -1838,3 +1993,208 @@ mdfiletagmatches(const FileTag *ftag, const FileTag *candidate)
*/
return ftag->rlocator.dbOid == candidate->rlocator.dbOid;
}
+
+/*
+ * AIO completion callback for mdstartreadv().
+ */
+static PgAioResult
+md_readv_complete(PgAioHandle *ioh, PgAioResult prior_result)
+{
+ PgAioTargetData *sd = pgaio_io_get_target_data(ioh);
+ PgAioResult result = prior_result;
+
+ if (prior_result.result < 0)
+ {
+ result.status = ARS_ERROR;
+ result.id = PGAIO_HCB_MD_READV;
+ /* For "hard" errors, track the error number in error_data */
+ result.error_data = -prior_result.result;
+ result.result = 0;
+
+ md_readv_report(result, sd, LOG);
+
+ return result;
+ }
+
+ result.result /= BLCKSZ;
+
+ if (result.result == 0)
+ {
+ /* consider 0 blocks read a failure */
+ result.status = ARS_ERROR;
+ result.id = PGAIO_HCB_MD_READV;
+ result.error_data = 0;
+
+ md_readv_report(result, sd, LOG);
+ }
+
+ if (result.status != ARS_ERROR &&
+ result.result < sd->smgr.nblocks)
+ {
+ /* partial reads should be retried at upper level */
+ result.status = ARS_PARTIAL;
+ result.id = PGAIO_HCB_MD_READV;
+ }
+
+ /* AFIXME: post-read portion of mdreadv() */
+
+ return result;
+}
+
+/*
+ * AIO error reporting callback for mdstartreadv().
+ */
+static void
+md_readv_report(PgAioResult result, const PgAioTargetData *sd, int elevel)
+{
+ MemoryContext oldContext = CurrentMemoryContext;
+ char *path;
+
+ /* AFIXME: */
+ oldContext = MemoryContextSwitchTo(ErrorContext);
+
+ path = relpathbackend(sd->smgr.rlocator,
+ sd->smgr.is_temp ? MyProcNumber : INVALID_PROC_NUMBER,
+ sd->smgr.forkNum);
+
+ if (result.error_data != 0)
+ {
+ errno = result.error_data; /* for errcode_for_file_access() */
+
+ ereport(elevel,
+ errcode_for_file_access(),
+ errmsg("could not read blocks %u..%u in file \"%s\": %m",
+ sd->smgr.blockNum,
+ sd->smgr.blockNum + sd->smgr.nblocks,
+ path
+ )
+ );
+ }
+ else
+ {
+ /*
+ * NB: This will typically only be output in debug messages, while
+ * retrying a partial IO.
+ */
+ ereport(elevel,
+ errcode(ERRCODE_DATA_CORRUPTED),
+ errmsg("could not read blocks %u..%u in file \"%s\": read only %zu of %zu bytes",
+ sd->smgr.blockNum,
+ sd->smgr.blockNum + sd->smgr.nblocks - 1,
+ path,
+ result.result * (size_t) BLCKSZ,
+ sd->smgr.nblocks * (size_t) BLCKSZ
+ )
+ );
+ }
+
+ pfree(path);
+ MemoryContextSwitchTo(oldContext);
+}
+
+/*
+ * AIO completion callback for mdstartwritev().
+ */
+static PgAioResult
+md_writev_complete(PgAioHandle *ioh, PgAioResult prior_result)
+{
+ PgAioTargetData *sd = pgaio_io_get_target_data(ioh);
+ PgAioResult result = prior_result;
+
+ if (prior_result.result < 0)
+ {
+ result.status = ARS_ERROR;
+ result.id = PGAIO_HCB_MD_WRITEV;
+ /* For "hard" errors, track the error number in error_data */
+ result.error_data = -prior_result.result;
+ result.result = 0;
+
+ md_writev_report(result, sd, LOG);
+
+ return result;
+ }
+
+ result.result /= BLCKSZ;
+
+ if (result.result == 0)
+ {
+ /* consider 0 blocks written a failure */
+ result.status = ARS_ERROR;
+ result.id = PGAIO_HCB_MD_WRITEV;
+ result.error_data = 0;
+
+ md_writev_report(result, sd, LOG);
+ }
+
+ if (result.status != ARS_ERROR &&
+ result.result < sd->smgr.nblocks)
+ {
+ /* partial writes should be retried at upper level */
+ result.status = ARS_PARTIAL;
+ result.id = PGAIO_HCB_MD_WRITEV;
+ }
+
+ if (prior_result.status == ARS_ERROR)
+ {
+ /* AFIXME: complain */
+ return prior_result;
+ }
+
+ prior_result.result /= BLCKSZ;
+
+ if (!sd->smgr.skip_fsync)
+ register_dirty_segment_aio(sd->smgr.rlocator, sd->smgr.forkNum,
+ sd->smgr.blockNum / ((BlockNumber) RELSEG_SIZE));
+
+ return prior_result;
+}
+
+/*
+ * AIO error reporting callback for mdstartwritev().
+ */
+static void
+md_writev_report(PgAioResult result, const PgAioTargetData *sd, int elevel)
+{
+ MemoryContext oldContext = CurrentMemoryContext;
+ char *path;
+
+ /* AFIXME: */
+ oldContext = MemoryContextSwitchTo(ErrorContext);
+
+ path = relpathbackend(sd->smgr.rlocator,
+ sd->smgr.is_temp ? MyProcNumber : INVALID_PROC_NUMBER,
+ sd->smgr.forkNum);
+
+ if (result.error_data != 0)
+ {
+ errno = result.error_data; /* for errcode_for_file_access() */
+
+ ereport(elevel,
+ errcode_for_file_access(),
+ errmsg("could not write blocks %u..%u in file \"%s\": %m",
+ sd->smgr.blockNum,
+ sd->smgr.blockNum + sd->smgr.nblocks,
+ path)
+ );
+ }
+ else
+ {
+ /*
+ * NB: This will typically only be output in debug messages, while
+ * retrying a partial IO.
+ */
+ ereport(elevel,
+ errcode(ERRCODE_DATA_CORRUPTED),
+ errmsg("could not write blocks %u..%u in file \"%s\": wrote only %zu of %zu bytes",
+ sd->smgr.blockNum,
+ sd->smgr.blockNum + sd->smgr.nblocks - 1,
+ path,
+ result.result * (size_t) BLCKSZ,
+ sd->smgr.nblocks * (size_t) BLCKSZ
+ )
+ );
+ }
+
+ pfree(path);
+ MemoryContextSwitchTo(oldContext);
+}
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index ebe35c04de5..fb231e6ad48 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -53,6 +53,7 @@
#include "access/xlogutils.h"
#include "lib/ilist.h"
+#include "storage/aio.h"
#include "storage/bufmgr.h"
#include "storage/ipc.h"
#include "storage/md.h"
@@ -93,10 +94,19 @@ typedef struct f_smgr
void (*smgr_readv) (SMgrRelation reln, ForkNumber forknum,
BlockNumber blocknum,
void **buffers, BlockNumber nblocks);
+ void (*smgr_startreadv) (struct PgAioHandle *ioh,
+ SMgrRelation reln, ForkNumber forknum,
+ BlockNumber blocknum,
+ void **buffers, BlockNumber nblocks);
void (*smgr_writev) (SMgrRelation reln, ForkNumber forknum,
BlockNumber blocknum,
const void **buffers, BlockNumber nblocks,
bool skipFsync);
+ void (*smgr_startwritev) (struct PgAioHandle *ioh,
+ SMgrRelation reln, ForkNumber forknum,
+ BlockNumber blocknum,
+ const void **buffers, BlockNumber nblocks,
+ bool skipFsync);
void (*smgr_writeback) (SMgrRelation reln, ForkNumber forknum,
BlockNumber blocknum, BlockNumber nblocks);
BlockNumber (*smgr_nblocks) (SMgrRelation reln, ForkNumber forknum);
@@ -104,6 +114,7 @@ typedef struct f_smgr
BlockNumber old_blocks, BlockNumber nblocks);
void (*smgr_immedsync) (SMgrRelation reln, ForkNumber forknum);
void (*smgr_registersync) (SMgrRelation reln, ForkNumber forknum);
+ int (*smgr_fd) (SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, uint32 *off);
} f_smgr;
static const f_smgr smgrsw[] = {
@@ -121,12 +132,15 @@ static const f_smgr smgrsw[] = {
.smgr_prefetch = mdprefetch,
.smgr_maxcombine = mdmaxcombine,
.smgr_readv = mdreadv,
+ .smgr_startreadv = mdstartreadv,
.smgr_writev = mdwritev,
+ .smgr_startwritev = mdstartwritev,
.smgr_writeback = mdwriteback,
.smgr_nblocks = mdnblocks,
.smgr_truncate = mdtruncate,
.smgr_immedsync = mdimmedsync,
.smgr_registersync = mdregistersync,
+ .smgr_fd = mdfd,
}
};
@@ -145,6 +159,16 @@ static void smgrshutdown(int code, Datum arg);
static void smgrdestroy(SMgrRelation reln);
+static void smgr_aio_reopen(PgAioHandle *ioh);
+static char *smgr_aio_describe_identity(const PgAioTargetData *sd);
+
+const struct PgAioTargetInfo aio_smgr_target_info = {
+ .name = "smgr",
+ .reopen = smgr_aio_reopen,
+ .describe_identity = smgr_aio_describe_identity,
+};
+
+
/*
* smgrinit(), smgrshutdown() -- Initialize or shut down storage
* managers.
@@ -623,6 +647,19 @@ smgrreadv(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
nblocks);
}
+/*
+ * AFIXME: FILL ME IN
+ */
+void
+smgrstartreadv(struct PgAioHandle *ioh,
+ SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+ void **buffers, BlockNumber nblocks)
+{
+ smgrsw[reln->smgr_which].smgr_startreadv(ioh,
+ reln, forknum, blocknum, buffers,
+ nblocks);
+}
+
/*
* smgrwritev() -- Write the supplied buffers out.
*
@@ -657,6 +694,19 @@ smgrwritev(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
buffers, nblocks, skipFsync);
}
+/*
+ * AFIXME: FILL ME IN
+ */
+void
+smgrstartwritev(struct PgAioHandle *ioh,
+ SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+ const void **buffers, BlockNumber nblocks, bool skipFsync)
+{
+ smgrsw[reln->smgr_which].smgr_startwritev(ioh,
+ reln, forknum, blocknum, buffers,
+ nblocks, skipFsync);
+}
+
/*
* smgrwriteback() -- Trigger kernel writeback for the supplied range of
* blocks.
@@ -819,6 +869,12 @@ smgrimmedsync(SMgrRelation reln, ForkNumber forknum)
smgrsw[reln->smgr_which].smgr_immedsync(reln, forknum);
}
+int
+smgrfd(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, uint32 *off)
+{
+ return smgrsw[reln->smgr_which].smgr_fd(reln, forknum, blocknum, off);
+}
+
/*
* AtEOXact_SMgr
*
@@ -847,3 +903,73 @@ ProcessBarrierSmgrRelease(void)
smgrreleaseall();
return true;
}
+
+void
+pgaio_io_set_target_smgr(PgAioHandle *ioh,
+ struct SMgrRelationData *smgr,
+ ForkNumber forknum,
+ BlockNumber blocknum,
+ int nblocks,
+ bool skip_fsync)
+{
+ PgAioTargetData *sd = pgaio_io_get_target_data(ioh);
+
+ pgaio_io_set_target(ioh, PGAIO_TID_SMGR);
+
+ /* backend is implied via IO owner */
+ sd->smgr.rlocator = smgr->smgr_rlocator.locator;
+ sd->smgr.forkNum = forknum;
+ sd->smgr.blockNum = blocknum;
+ sd->smgr.nblocks = nblocks;
+ sd->smgr.is_temp = SmgrIsTemp(smgr);
+ sd->smgr.release_lock = false;
+ /* Temp relations should never be fsync'd */
+ sd->smgr.skip_fsync = skip_fsync && !SmgrIsTemp(smgr);
+ sd->smgr.mode = RBM_NORMAL;
+}
+
+static void
+smgr_aio_reopen(PgAioHandle *ioh)
+{
+ PgAioTargetData *sd = pgaio_io_get_target_data(ioh);
+ PgAioOpData *od = pgaio_io_get_op_data(ioh);
+ SMgrRelation reln;
+ ProcNumber procno;
+ uint32 off;
+
+ if (sd->smgr.is_temp)
+ procno = pgaio_io_get_owner(ioh);
+ else
+ procno = INVALID_PROC_NUMBER;
+
+ reln = smgropen(sd->smgr.rlocator, procno);
+ od->read.fd = smgrfd(reln, sd->smgr.forkNum, sd->smgr.blockNum, &off);
+ Assert(off == od->read.offset);
+}
+
+static char *
+smgr_aio_describe_identity(const PgAioTargetData *sd)
+{
+ char *path;
+ char *desc;
+
+ path = relpathbackend(sd->smgr.rlocator,
+ sd->smgr.is_temp ? MyProcNumber : INVALID_PROC_NUMBER,
+ sd->smgr.forkNum);
+
+ if (sd->smgr.nblocks == 0)
+ desc = psprintf(_("file \"%s\""), path);
+ else if (sd->smgr.nblocks == 1)
+ desc = psprintf(_("block %u in file \"%s\""),
+ sd->smgr.blockNum,
+ path);
+ else
+ desc = psprintf(_("blocks %u..%u in file \"%s\""),
+ sd->smgr.blockNum,
+ sd->smgr.blockNum + sd->smgr.nblocks - 1,
+ path);
+
+ pfree(path);
+
+ return desc;
+}
--
2.48.1.76.g4e746b1a31.dirty
v2.3-0018-aio-Add-pg_aios-view.patchtext/x-diff; charset=us-asciiDownload
From a92ecb8ff29feaa485c50c10914f30678d3694ad Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 22 Jan 2025 13:44:40 -0500
Subject: [PATCH v2.3 18/30] aio: Add pg_aios view
Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
src/include/catalog/pg_proc.dat | 10 ++
src/backend/catalog/system_views.sql | 3 +
src/backend/storage/aio/Makefile | 1 +
src/backend/storage/aio/aio_funcs.c | 240 +++++++++++++++++++++++++++
src/backend/storage/aio/meson.build | 1 +
src/test/regress/expected/rules.out | 17 ++
6 files changed, 272 insertions(+)
create mode 100644 src/backend/storage/aio/aio_funcs.c
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 18560755d26..df29275d7b1 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -12435,4 +12435,14 @@
proargtypes => 'int4',
prosrc => 'gist_stratnum_common' },
+# AIO related functions
+{ oid => '9200', descr => 'information about in-progress asynchronous IOs',
+ proname => 'pg_get_aios', prorows => '100', proretset => 't',
+ provolatile => 'v', proparallel => 'r', prorettype => 'record', proargtypes => '',
+ proallargtypes => '{int4,int4,int8,text,text,int8,int8,text,int2,int4,text,text,text,bool,bool,bool}',
+ proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
+ proargnames => '{pid,io_id,io_generation,state,operation,offset,length,target,handle_data_len,raw_result,result,error_desc,target_desc,f_sync,f_localmem,f_buffered}',
+ prosrc => 'pg_get_aios' },
+
+
]
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 46868bf7e89..884c73cd2bf 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1388,3 +1388,6 @@ CREATE VIEW pg_stat_subscription_stats AS
CREATE VIEW pg_wait_events AS
SELECT * FROM pg_get_wait_events();
+
+CREATE VIEW pg_aios AS
+ SELECT * FROM pg_get_aios();
diff --git a/src/backend/storage/aio/Makefile b/src/backend/storage/aio/Makefile
index c06c50771e0..3f2469cc399 100644
--- a/src/backend/storage/aio/Makefile
+++ b/src/backend/storage/aio/Makefile
@@ -11,6 +11,7 @@ include $(top_builddir)/src/Makefile.global
OBJS = \
aio.o \
aio_callback.o \
+ aio_funcs.o \
aio_init.o \
aio_io.o \
aio_target.o \
diff --git a/src/backend/storage/aio/aio_funcs.c b/src/backend/storage/aio/aio_funcs.c
new file mode 100644
index 00000000000..65ee3cb22a6
--- /dev/null
+++ b/src/backend/storage/aio/aio_funcs.c
@@ -0,0 +1,240 @@
+/*-------------------------------------------------------------------------
+ *
+ * aio_funcs.c
+ * AIO - SQL interface for AIO
+ *
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/storage/aio/aio_funcs.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "storage/aio.h"
+#include "storage/aio_internal.h"
+#include "utils/builtins.h"
+#include "funcapi.h"
+#include "storage/proc.h"
+
+
+/*
+ * Byte length of an iovec.
+ */
+static size_t
+iov_byte_length(const struct iovec *iov, int cnt)
+{
+ size_t len = 0;
+
+ for (int i = 0; i < cnt; i++)
+ {
+ len += iov[i].iov_len;
+ }
+
+ return len;
+}
+
+static const char *
+pgaio_result_status_string(PgAioResultStatus rs)
+{
+ switch (rs)
+ {
+ case ARS_UNKNOWN:
+ return "UNKNOWN";
+ case ARS_OK:
+ return "OK";
+ case ARS_PARTIAL:
+ return "PARTIAL";
+ case ARS_ERROR:
+ return "ERROR";
+ }
+
+ return NULL; /* silence compiler */
+}
+
+Datum
+pg_get_aios(PG_FUNCTION_ARGS)
+{
+ ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+
+ InitMaterializedSRF(fcinfo, 0);
+
+#define PG_GET_AIOS_COLS 16
+
+ for (uint64 i = 0; i < pgaio_ctl->io_handle_count; i++)
+ {
+ PgAioHandle *live_ioh = &pgaio_ctl->io_handles[i];
+ uint32 ioh_id = pgaio_io_get_id(live_ioh);
+ Datum values[PG_GET_AIOS_COLS] = {0};
+ bool nulls[PG_GET_AIOS_COLS] = {0};
+ ProcNumber owner;
+ PGPROC *owner_proc;
+ int32 owner_pid;
+ PgAioHandleState start_state;
+ uint64 start_generation;
+ PgAioHandle ioh_copy;
+ struct iovec iov_copy[PG_IOV_MAX];
+
+retry:
+
+ /*
+ * There is no lock that could prevent the state of the IO to advance
+ * concurrently - and we don't want to introduce one, as that would
+ * introduce atomics into a very common path. Instead we
+ *
+ * 1) determine the state + generation of the IO
+ *
+ * 2) copy the IO to local memory
+ *
+ * 3) check if state and generation of the IO changed
+ */
+
+ /* 1) from above */
+ start_generation = live_ioh->generation;
+ pg_read_barrier();
+ start_state = live_ioh->state;
+
+ if (start_state == PGAIO_HS_IDLE)
+ continue;
+
+ /* 2) from above */
+ memcpy(&ioh_copy, live_ioh, sizeof(PgAioHandle));
+
+ /*
+ * Safe to copy even if no iovec is used - we always reserve the
+ * required space.
+ */
+ memcpy(&iov_copy, &pgaio_ctl->iovecs[ioh_copy.iovec_off],
+ PG_IOV_MAX * sizeof(struct iovec));
+
+ /*
+ * Copy information about owner before 3) below, if the process exited
+ * it'd have to wait for the IO to finish first, which we would detect
+ * in 3).
+ */
+ owner = ioh_copy.owner_procno;
+ owner_proc = GetPGProcByNumber(owner);
+ owner_pid = owner_proc->pid;
+
+ /* 3) from above */
+ pg_read_barrier();
+
+ /*
+ * The IO completed and a new one was started with the same ID. Don't
+ * display it - it really started after this function was called.
+ * There be a risk of a livelock if we just retried endlessly, if IOs
+ * complete very quickly.
+ */
+ if (live_ioh->generation != start_generation)
+ continue;
+
+ /*
+ * The IOs state changed while we were "rendering" it. Just start from
+ * scratch. There's no risk of a livelock here, as an IO has a limited
+ * sets of states it can be in, and state changes go only in a single
+ * direction.
+ */
+ if (live_ioh->state != start_state)
+ goto retry;
+
+ /*
+ * Now that we have copied the IO into local memory and checked that
+ * it's still in the same state, we are not allowed to access "live"
+ * memory anymore. To make it slightly easier to catch such cases, set
+ * the "live" pointers to NULL.
+ */
+ live_ioh = NULL;
+ owner_proc = NULL;
+
+
+ /* column: owning pid */
+ if (owner_pid != 0)
+ values[0] = Int32GetDatum(owner_pid);
+ else
+ nulls[0] = false;
+
+ /* column: IO's id */
+ values[1] = ioh_id;
+
+ /* column: IO's generation */
+ values[2] = Int64GetDatum(start_generation);
+
+ /* column: IO's state */
+ values[3] = CStringGetTextDatum(pgaio_io_get_state_name(&ioh_copy));
+
+ /*
+ * If the IO is in PGAIO_HS_HANDED_OUT state, none of it's fields are
+ * valid yet (or are in the process of being set). Therefore we don't
+ * want to display any other columns.
+ */
+ if (start_state == PGAIO_HS_HANDED_OUT)
+ {
+ memset(nulls + 4, 1, (lengthof(nulls) - 4) * sizeof(bool));
+ goto display;
+ }
+
+ /* column: IO's operation */
+ values[4] = CStringGetTextDatum(pgaio_io_get_op_name(&ioh_copy));
+
+ /* columns: details about the IO's operation */
+ switch (ioh_copy.op)
+ {
+ case PGAIO_OP_INVALID:
+ nulls[5] = true;
+ nulls[6] = true;
+ break;
+ case PGAIO_OP_READV:
+ values[5] = Int64GetDatum(ioh_copy.op_data.read.offset);
+ values[6] =
+ Int64GetDatum(iov_byte_length(iov_copy, ioh_copy.op_data.read.iov_length));
+ break;
+ case PGAIO_OP_WRITEV:
+ values[5] = Int64GetDatum(ioh_copy.op_data.write.offset);
+ values[6] =
+ Int64GetDatum(iov_byte_length(iov_copy, ioh_copy.op_data.write.iov_length));
+ break;
+ }
+
+ /* column: IO's target */
+ values[7] = CStringGetTextDatum(pgaio_io_get_target_name(&ioh_copy));
+
+ /* column: length of IO's data array */
+ values[8] = Int16GetDatum(ioh_copy.handle_data_len);
+
+ /* column: raw result (i.e. some form of syscall return value) */
+ if (start_state == PGAIO_HS_COMPLETED_IO
+ || start_state == PGAIO_HS_COMPLETED_SHARED)
+ values[9] = Int32GetDatum(ioh_copy.result);
+ else
+ nulls[9] = true;
+
+ /*
+ * column: result in the higher level representation (unknown if not
+ * finished
+ */
+ values[10] =
+ CStringGetTextDatum(pgaio_result_status_string(ioh_copy.distilled_result.status));
+
+ /* column: error description */
+ /* AFIXME: implement */
+ nulls[11] = true;
+
+ /* column: target description */
+ values[12] = CStringGetTextDatum(pgaio_io_get_target_description(&ioh_copy));
+
+ /* columns: one for each flag */
+ values[13] = BoolGetDatum(ioh_copy.flags & PGAIO_HF_SYNCHRONOUS);
+ values[14] = BoolGetDatum(ioh_copy.flags & PGAIO_HF_REFERENCES_LOCAL);
+ values[15] = BoolGetDatum(ioh_copy.flags & PGAIO_HF_BUFFERED);
+
+display:
+
+ tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc, values, nulls);
+ }
+
+ return (Datum) 0;
+}
diff --git a/src/backend/storage/aio/meson.build b/src/backend/storage/aio/meson.build
index 2f0f03d8071..da6df2d3654 100644
--- a/src/backend/storage/aio/meson.build
+++ b/src/backend/storage/aio/meson.build
@@ -3,6 +3,7 @@
backend_sources += files(
'aio.c',
'aio_callback.c',
+ 'aio_funcs.c',
'aio_init.c',
'aio_io.c',
'aio_target.c',
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 856a8349c50..c0e18a350f5 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1286,6 +1286,23 @@ drop table cchild;
SELECT viewname, definition FROM pg_views
WHERE schemaname = 'pg_catalog'
ORDER BY viewname;
+pg_aios| SELECT pid,
+ io_id,
+ io_generation,
+ state,
+ operation,
+ "offset",
+ length,
+ target,
+ handle_data_len,
+ raw_result,
+ result,
+ error_desc,
+ target_desc,
+ f_sync,
+ f_localmem,
+ f_buffered
+ FROM pg_get_aios() pg_get_aios(pid, io_id, io_generation, state, operation, "offset", length, target, handle_data_len, raw_result, result, error_desc, target_desc, f_sync, f_localmem, f_buffered);
pg_available_extension_versions| SELECT e.name,
e.version,
(x.extname IS NOT NULL) AS installed,
--
2.48.1.76.g4e746b1a31.dirty
v2.3-0019-bufmgr-Implement-AIO-read-support.patchtext/x-diff; charset=us-asciiDownload
From 34bdf7e671846828be4d194cee881218b78a817b Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 22 Jan 2025 16:08:58 -0500
Subject: [PATCH v2.3 19/30] bufmgr: Implement AIO read support
As of this commit there are no users of these AIO facilities, that'll come in
later commits.
Author:
Reviewed-By:
Discussion: https://postgr.es/m/
Backpatch:
---
src/include/storage/aio.h | 4 +
src/include/storage/buf_internals.h | 6 +
src/include/storage/bufmgr.h | 8 +
src/backend/storage/aio/aio_callback.c | 5 +
src/backend/storage/buffer/buf_init.c | 3 +
src/backend/storage/buffer/bufmgr.c | 389 ++++++++++++++++++++++++-
src/backend/storage/buffer/localbuf.c | 65 +++++
7 files changed, 473 insertions(+), 7 deletions(-)
diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h
index a948eaeefa7..6f36a0b9e4d 100644
--- a/src/include/storage/aio.h
+++ b/src/include/storage/aio.h
@@ -178,6 +178,10 @@ typedef enum PgAioHandleCallbackID
PGAIO_HCB_MD_READV,
PGAIO_HCB_MD_WRITEV,
+
+ PGAIO_HCB_SHARED_BUFFER_READV,
+
+ PGAIO_HCB_LOCAL_BUFFER_READV,
} PgAioHandleCallbackID;
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 1a65342177d..9f936cd6b84 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -17,6 +17,7 @@
#include "pgstat.h"
#include "port/atomics.h"
+#include "storage/aio_types.h"
#include "storage/buf.h"
#include "storage/bufmgr.h"
#include "storage/condition_variable.h"
@@ -251,6 +252,8 @@ typedef struct BufferDesc
int wait_backend_pgprocno; /* backend of pin-count waiter */
int freeNext; /* link in freelist chain */
+
+ PgAioWaitRef io_wref;
LWLock content_lock; /* to lock access to buffer contents */
} BufferDesc;
@@ -464,4 +467,7 @@ extern void DropRelationLocalBuffers(RelFileLocator rlocator,
extern void DropRelationAllLocalBuffers(RelFileLocator rlocator);
extern void AtEOXact_LocalBuffers(bool isCommit);
+
+extern bool ReadBufferCompleteReadLocal(Buffer buffer, int mode, bool failed);
+
#endif /* BUFMGR_INTERNALS_H */
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 46b4e0d90f3..5cff4e223f9 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -177,6 +177,12 @@ extern PGDLLIMPORT int NLocBuffer;
extern PGDLLIMPORT Block *LocalBufferBlockPointers;
extern PGDLLIMPORT int32 *LocalRefCount;
+
+struct PgAioHandleCallbacks;
+extern const struct PgAioHandleCallbacks aio_shared_buffer_readv_cb;
+extern const struct PgAioHandleCallbacks aio_local_buffer_readv_cb;
+
+
/* upper limit for effective_io_concurrency */
#define MAX_IO_CONCURRENCY 1000
@@ -194,6 +200,8 @@ extern PGDLLIMPORT int32 *LocalRefCount;
/*
* prototypes for functions in bufmgr.c
*/
+struct PgAioHandle;
+
extern PrefetchBufferResult PrefetchSharedBuffer(struct SMgrRelationData *smgr_reln,
ForkNumber forkNum,
BlockNumber blockNum);
diff --git a/src/backend/storage/aio/aio_callback.c b/src/backend/storage/aio/aio_callback.c
index 7fd42880535..6054f57eb23 100644
--- a/src/backend/storage/aio/aio_callback.c
+++ b/src/backend/storage/aio/aio_callback.c
@@ -18,6 +18,7 @@
#include "miscadmin.h"
#include "storage/aio.h"
#include "storage/aio_internal.h"
+#include "storage/bufmgr.h"
#include "storage/md.h"
#include "utils/memutils.h"
@@ -42,6 +43,10 @@ static const PgAioHandleCallbacksEntry aio_handle_cbs[] = {
CALLBACK_ENTRY(PGAIO_HCB_MD_READV, aio_md_readv_cb),
CALLBACK_ENTRY(PGAIO_HCB_MD_WRITEV, aio_md_writev_cb),
+
+ CALLBACK_ENTRY(PGAIO_HCB_SHARED_BUFFER_READV, aio_shared_buffer_readv_cb),
+
+ CALLBACK_ENTRY(PGAIO_HCB_LOCAL_BUFFER_READV, aio_local_buffer_readv_cb),
#undef CALLBACK_ENTRY
};
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index ed1f8e03190..ed1dc488a42 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -14,6 +14,7 @@
*/
#include "postgres.h"
+#include "storage/aio.h"
#include "storage/buf_internals.h"
#include "storage/bufmgr.h"
@@ -125,6 +126,8 @@ BufferManagerShmemInit(void)
buf->buf_id = i;
+ pgaio_wref_clear(&buf->io_wref);
+
/*
* Initially link all the buffers together as unused. Subsequent
* management of this list is done by freelist.c.
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 0d8849bf894..169829e8031 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -48,6 +48,7 @@
#include "pg_trace.h"
#include "pgstat.h"
#include "postmaster/bgwriter.h"
+#include "storage/aio.h"
#include "storage/buf_internals.h"
#include "storage/bufmgr.h"
#include "storage/fd.h"
@@ -58,6 +59,7 @@
#include "storage/smgr.h"
#include "storage/standby.h"
#include "utils/memdebug.h"
+#include "utils/memutils.h"
#include "utils/ps_status.h"
#include "utils/rel.h"
#include "utils/resowner.h"
@@ -514,7 +516,8 @@ static int SyncOneBuffer(int buf_id, bool skip_recently_used,
static void WaitIO(BufferDesc *buf);
static bool StartBufferIO(BufferDesc *buf, bool forInput, bool nowait);
static void TerminateBufferIO(BufferDesc *buf, bool clear_dirty,
- uint32 set_flag_bits, bool forget_owner);
+ uint32 set_flag_bits, bool forget_owner,
+ bool syncio);
static void AbortBufferIO(Buffer buffer);
static void shared_buffer_write_error_callback(void *arg);
static void local_buffer_write_error_callback(void *arg);
@@ -1081,7 +1084,7 @@ ZeroAndLockBuffer(Buffer buffer, ReadBufferMode mode, bool already_valid)
else
{
/* Set BM_VALID, terminate IO, and wake up any waiters */
- TerminateBufferIO(bufHdr, false, BM_VALID, true);
+ TerminateBufferIO(bufHdr, false, BM_VALID, true, true);
}
}
else if (!isLocalBuf)
@@ -1566,7 +1569,7 @@ WaitReadBuffers(ReadBuffersOperation *operation)
else
{
/* Set BM_VALID, terminate IO, and wake up any waiters */
- TerminateBufferIO(bufHdr, false, BM_VALID, true);
+ TerminateBufferIO(bufHdr, false, BM_VALID, true, true);
}
/* Report I/Os as completing individually. */
@@ -2450,7 +2453,7 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
if (lock)
LWLockAcquire(BufferDescriptorGetContentLock(buf_hdr), LW_EXCLUSIVE);
- TerminateBufferIO(buf_hdr, false, BM_VALID, true);
+ TerminateBufferIO(buf_hdr, false, BM_VALID, true, true);
}
pgBufferUsage.shared_blks_written += extend_by;
@@ -3899,7 +3902,7 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
* Mark the buffer as clean (unless BM_JUST_DIRTIED has become set) and
* end the BM_IO_IN_PROGRESS state.
*/
- TerminateBufferIO(buf, true, 0, true);
+ TerminateBufferIO(buf, true, 0, true, true);
TRACE_POSTGRESQL_BUFFER_FLUSH_DONE(BufTagGetForkNum(&buf->tag),
buf->tag.blockNum,
@@ -5456,6 +5459,7 @@ WaitIO(BufferDesc *buf)
for (;;)
{
uint32 buf_state;
+ PgAioWaitRef iow;
/*
* It may not be necessary to acquire the spinlock to check the flag
@@ -5463,10 +5467,19 @@ WaitIO(BufferDesc *buf)
* play it safe.
*/
buf_state = LockBufHdr(buf);
+ iow = buf->io_wref;
UnlockBufHdr(buf, buf_state);
if (!(buf_state & BM_IO_IN_PROGRESS))
break;
+
+ if (pgaio_wref_valid(&iow))
+ {
+ pgaio_wref_wait(&iow);
+ ConditionVariablePrepareToSleep(cv);
+ continue;
+ }
+
ConditionVariableSleep(cv, WAIT_EVENT_BUFFER_IO);
}
ConditionVariableCancelSleep();
@@ -5555,7 +5568,7 @@ StartBufferIO(BufferDesc *buf, bool forInput, bool nowait)
*/
static void
TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag_bits,
- bool forget_owner)
+ bool forget_owner, bool syncio)
{
uint32 buf_state;
@@ -5567,6 +5580,13 @@ TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag_bits,
if (clear_dirty && !(buf_state & BM_JUST_DIRTIED))
buf_state &= ~(BM_DIRTY | BM_CHECKPOINT_NEEDED);
+ if (!syncio)
+ {
+ /* release ownership by the AIO subsystem */
+ buf_state -= BUF_REFCOUNT_ONE;
+ pgaio_wref_clear(&buf->io_wref);
+ }
+
buf_state |= set_flag_bits;
UnlockBufHdr(buf, buf_state);
@@ -5575,6 +5595,40 @@ TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag_bits,
BufferDescriptorGetBuffer(buf));
ConditionVariableBroadcast(BufferDescriptorGetIOCV(buf));
+
+ /*
+ * If we just released a pin, need to do BM_PIN_COUNT_WAITER handling.
+ * Most of the time the current backend will hold another pin preventing
+ * that from happening, but that's e.g. not the case when completing an IO
+ * another backend started.
+ *
+ * AFIXME: Deduplicate with UnpinBufferNoOwner() or just replace
+ * BM_PIN_COUNT_WAITER with something saner.
+ */
+ /* Support LockBufferForCleanup() */
+ if (buf_state & BM_PIN_COUNT_WAITER)
+ {
+ /*
+ * Acquire the buffer header lock, re-check that there's a waiter.
+ * Another backend could have unpinned this buffer, and already woken
+ * up the waiter. There's no danger of the buffer being replaced
+ * after we unpinned it above, as it's pinned by the waiter.
+ */
+ buf_state = LockBufHdr(buf);
+
+ if ((buf_state & BM_PIN_COUNT_WAITER) &&
+ BUF_STATE_GET_REFCOUNT(buf_state) == 1)
+ {
+ /* we just released the last pin other than the waiter's */
+ int wait_backend_pgprocno = buf->wait_backend_pgprocno;
+
+ buf_state &= ~BM_PIN_COUNT_WAITER;
+ UnlockBufHdr(buf, buf_state);
+ ProcSendSignal(wait_backend_pgprocno);
+ }
+ else
+ UnlockBufHdr(buf, buf_state);
+ }
}
/*
@@ -5626,7 +5680,7 @@ AbortBufferIO(Buffer buffer)
}
}
- TerminateBufferIO(buf_hdr, false, BM_IO_ERROR, false);
+ TerminateBufferIO(buf_hdr, false, BM_IO_ERROR, false, true);
}
/*
@@ -6085,3 +6139,324 @@ EvictUnpinnedBuffer(Buffer buf)
return result;
}
+
+static bool
+ReadBufferCompleteReadShared(Buffer buffer, int mode, bool failed)
+{
+ BufferDesc *bufHdr = NULL;
+ BlockNumber blockno;
+ bool buf_failed = false;
+ char *bufdata = BufferGetBlock(buffer);
+
+ Assert(BufferIsValid(buffer));
+
+ bufHdr = GetBufferDescriptor(buffer - 1);
+ blockno = bufHdr->tag.blockNum;
+
+#ifdef USE_ASSERT_CHECKING
+ {
+ uint32 buf_state = pg_atomic_read_u32(&bufHdr->state);
+
+ Assert(buf_state & BM_TAG_VALID);
+ Assert(!(buf_state & BM_VALID));
+ Assert(buf_state & BM_IO_IN_PROGRESS);
+ Assert(!(buf_state & BM_DIRTY));
+ }
+#endif
+
+ /* check for garbage data */
+ if (!failed &&
+ !PageIsVerifiedExtended((Page) bufdata, blockno,
+ PIV_LOG_WARNING | PIV_REPORT_STAT))
+ {
+ RelFileLocator rlocator = BufTagGetRelFileLocator(&bufHdr->tag);
+ BlockNumber forkNum = bufHdr->tag.forkNum;
+
+ /* AFIXME: relpathperm allocates memory */
+ MemoryContextSwitchTo(ErrorContext);
+ if (mode == READ_BUFFERS_ZERO_ON_ERROR || zero_damaged_pages)
+ {
+ ereport(LOG,
+ (errcode(ERRCODE_DATA_CORRUPTED),
+ errmsg("invalid page in block %u of relation %s; zeroing out page",
+ blockno,
+ relpathperm(rlocator, forkNum))));
+ memset(bufdata, 0, BLCKSZ);
+ }
+ else
+ {
+ ereport(LOG,
+ (errcode(ERRCODE_DATA_CORRUPTED),
+ errmsg("invalid page in block %u of relation %s",
+ blockno,
+ relpathperm(rlocator, forkNum))));
+ failed = true;
+ buf_failed = true;
+ }
+ }
+
+ /* Terminate I/O and set BM_VALID. */
+ TerminateBufferIO(bufHdr, false,
+ failed ? BM_IO_ERROR : BM_VALID,
+ false, false);
+
+ /* Report I/Os as completing individually. */
+
+ /* FIXME: Should we do TRACE_POSTGRESQL_BUFFER_READ_DONE here? */
+ return buf_failed;
+}
+
+/*
+ * Helper to prepare IO on shared buffers for execution, shared between reads
+ * and writes.
+ */
+static void
+shared_buffer_stage_common(PgAioHandle *ioh, bool is_write)
+{
+ uint64 *io_data;
+ uint8 handle_data_len;
+ PgAioWaitRef io_ref;
+ BufferTag first PG_USED_FOR_ASSERTS_ONLY = {0};
+
+ io_data = pgaio_io_get_handle_data(ioh, &handle_data_len);
+
+ pgaio_io_get_wref(ioh, &io_ref);
+
+ for (int i = 0; i < handle_data_len; i++)
+ {
+ Buffer buf = (Buffer) io_data[i];
+ BufferDesc *bufHdr;
+ uint32 buf_state;
+
+ bufHdr = GetBufferDescriptor(buf - 1);
+
+ if (i == 0)
+ first = bufHdr->tag;
+ else
+ {
+ Assert(bufHdr->tag.relNumber == first.relNumber);
+ Assert(bufHdr->tag.blockNum == first.blockNum + i);
+ }
+
+
+ buf_state = LockBufHdr(bufHdr);
+
+ Assert(buf_state & BM_TAG_VALID);
+ if (is_write)
+ {
+ Assert(buf_state & BM_VALID);
+ Assert(buf_state & BM_DIRTY);
+ }
+ else
+ Assert(!(buf_state & BM_VALID));
+
+ Assert(buf_state & BM_IO_IN_PROGRESS);
+ Assert(BUF_STATE_GET_REFCOUNT(buf_state) >= 1);
+
+ buf_state += BUF_REFCOUNT_ONE;
+ bufHdr->io_wref = io_ref;
+
+ UnlockBufHdr(bufHdr, buf_state);
+
+ if (is_write)
+ {
+ LWLock *content_lock;
+
+ content_lock = BufferDescriptorGetContentLock(bufHdr);
+
+ Assert(LWLockHeldByMe(content_lock));
+
+ /*
+ * Lock is now owned by AIO subsystem.
+ */
+ LWLockDisown(content_lock);
+ RESUME_INTERRUPTS();
+ }
+
+ /*
+ * Stop tracking this buffer via the resowner - the AIO system now
+ * keeps track.
+ */
+ ResourceOwnerForgetBufferIO(CurrentResourceOwner, buf);
+ }
+}
+
+static void
+shared_buffer_readv_stage(PgAioHandle *ioh)
+{
+ shared_buffer_stage_common(ioh, false);
+}
+
+static PgAioResult
+shared_buffer_readv_complete(PgAioHandle *ioh, PgAioResult prior_result)
+{
+ PgAioResult result = prior_result;
+ int mode = pgaio_io_get_target_data(ioh)->smgr.mode;
+ uint64 *io_data;
+ uint8 handle_data_len;
+
+ ereport(DEBUG5,
+ errmsg("%s: %d %d", __func__, prior_result.status, prior_result.result),
+ errhidestmt(true), errhidecontext(true));
+
+ io_data = pgaio_io_get_handle_data(ioh, &handle_data_len);
+
+ for (int io_data_off = 0; io_data_off < handle_data_len; io_data_off++)
+ {
+ Buffer buf = io_data[io_data_off];
+ bool buf_failed;
+ bool failed;
+
+ failed =
+ prior_result.status == ARS_ERROR
+ || prior_result.result <= io_data_off;
+
+ ereport(DEBUG5,
+ errmsg("calling rbcrs for buf %d with failed %d, status: %d, result: %d, data_off: %d",
+ buf, failed, prior_result.status, prior_result.result, io_data_off),
+ errhidestmt(true), errhidecontext(true));
+
+ /*
+ * XXX: It might be better to not set BM_IO_ERROR (which is what
+ * failed = true leads to) when it's just a short read...
+ */
+ buf_failed = ReadBufferCompleteReadShared(buf,
+ mode,
+ failed);
+
+ if (result.status != ARS_ERROR && buf_failed)
+ {
+ result.status = ARS_ERROR;
+ result.id = PGAIO_HCB_SHARED_BUFFER_READV;
+ result.error_data = io_data_off + 1;
+ }
+ }
+
+ return result;
+}
+
+static void
+buffer_readv_report(PgAioResult result, const PgAioTargetData *target_data, int elevel)
+{
+ MemoryContext oldContext = CurrentMemoryContext;
+ ProcNumber errProc;
+
+ if (target_data->smgr.is_temp)
+ errProc = MyProcNumber;
+ else
+ errProc = INVALID_PROC_NUMBER;
+
+ /*
+ * AFIXME: need infrastructure to allow memory allocation for error
+ * reporting
+ */
+ oldContext = MemoryContextSwitchTo(ErrorContext);
+
+ ereport(elevel,
+ errcode(ERRCODE_DATA_CORRUPTED),
+ errmsg("invalid page in block %u of relation %s",
+ target_data->smgr.blockNum + result.error_data,
+ relpathbackend(target_data->smgr.rlocator, errProc, target_data->smgr.forkNum)
+ )
+ );
+ MemoryContextSwitchTo(oldContext);
+}
+
+/*
+ * Helper to stage IO on local buffers for execution, shared between reads
+ * and writes.
+ */
+static void
+local_buffer_readv_stage(PgAioHandle *ioh)
+{
+ uint64 *io_data;
+ uint8 handle_data_len;
+ PgAioWaitRef io_wref;
+
+ io_data = pgaio_io_get_handle_data(ioh, &handle_data_len);
+
+ pgaio_io_get_wref(ioh, &io_wref);
+
+ for (int i = 0; i < handle_data_len; i++)
+ {
+ Buffer buf = (Buffer) io_data[i];
+ BufferDesc *bufHdr;
+ uint32 buf_state;
+
+ bufHdr = GetLocalBufferDescriptor(-buf - 1);
+
+ buf_state = pg_atomic_read_u32(&bufHdr->state);
+
+ bufHdr->io_wref = io_wref;
+ LocalRefCount[-buf - 1] += 1;
+
+ UnlockBufHdr(bufHdr, buf_state);
+ }
+}
+
+static PgAioResult
+local_buffer_readv_complete(PgAioHandle *ioh, PgAioResult prior_result)
+{
+ PgAioResult result = prior_result;
+ PgAioTargetData *td = pgaio_io_get_target_data(ioh);
+ int mode = td->smgr.mode;
+ uint64 *io_data;
+ uint8 handle_data_len;
+
+ Assert(td->smgr.is_temp);
+ Assert(pgaio_io_get_owner(ioh) == MyProcNumber);
+
+ ereport(DEBUG5,
+ errmsg("%s: %d %d", __func__, prior_result.status, prior_result.result),
+ errhidestmt(true), errhidecontext(true));
+
+ io_data = pgaio_io_get_handle_data(ioh, &handle_data_len);
+
+ for (int io_data_off = 0; io_data_off < handle_data_len; io_data_off++)
+ {
+ Buffer buf = io_data[io_data_off];
+ bool buf_failed;
+ bool failed;
+
+ failed =
+ prior_result.status == ARS_ERROR
+ || prior_result.result <= io_data_off;
+
+ ereport(DEBUG5,
+ errmsg("calling rbcrl for buf %d with failed %d, status: %d, result: %d, data_off: %d",
+ buf, failed, prior_result.status, prior_result.result, io_data_off),
+ errhidestmt(true), errhidecontext(true));
+
+ buf_failed = ReadBufferCompleteReadLocal(buf,
+ mode,
+ failed);
+
+ if (result.status != ARS_ERROR && buf_failed)
+ {
+ result.status = ARS_ERROR;
+ result.id = PGAIO_HCB_LOCAL_BUFFER_READV;
+ result.error_data = io_data_off + 1;
+ }
+ }
+
+ return result;
+}
+
+
+const struct PgAioHandleCallbacks aio_shared_buffer_readv_cb = {
+ .stage = shared_buffer_readv_stage,
+ .complete_shared = shared_buffer_readv_complete,
+ .report = buffer_readv_report,
+};
+const struct PgAioHandleCallbacks aio_local_buffer_readv_cb = {
+ .stage = local_buffer_readv_stage,
+
+ /*
+ * Note that this, in contrast to the shared_buffers case, uses
+ * complete_local, as only the issuing backend has access to the required
+ * datastructures. This is important in case the IO completion may be
+ * consumed incidentally by another backend.
+ */
+ .complete_local = local_buffer_readv_complete,
+ .report = buffer_readv_report,
+};
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 8f81428970b..b3805c1ff94 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -18,6 +18,7 @@
#include "access/parallel.h"
#include "executor/instrument.h"
#include "pgstat.h"
+#include "storage/aio.h"
#include "storage/buf_internals.h"
#include "storage/bufmgr.h"
#include "storage/fd.h"
@@ -621,6 +622,8 @@ InitLocalBuffers(void)
*/
buf->buf_id = -i - 2;
+ pgaio_wref_clear(&buf->io_wref);
+
/*
* Intentionally do not initialize the buffer's atomic variable
* (besides zeroing the underlying memory above). That way we get
@@ -837,3 +840,65 @@ AtProcExit_LocalBuffers(void)
*/
CheckForLocalBufferLeaks();
}
+
+bool
+ReadBufferCompleteReadLocal(Buffer buffer, int mode, bool failed)
+{
+ BufferDesc *buf_hdr = NULL;
+ BlockNumber blockno;
+ bool buf_failed = false;
+ char *bufdata = BufferGetBlock(buffer);
+
+ Assert(BufferIsValid(buffer));
+
+ buf_hdr = GetLocalBufferDescriptor(-buffer - 1);
+ blockno = buf_hdr->tag.blockNum;
+
+ /* check for garbage data */
+ if (!failed &&
+ !PageIsVerifiedExtended((Page) bufdata, blockno,
+ PIV_LOG_WARNING | PIV_REPORT_STAT))
+ {
+ RelFileLocator rlocator = BufTagGetRelFileLocator(&buf_hdr->tag);
+ BlockNumber forkNum = buf_hdr->tag.forkNum;
+
+ MemoryContextSwitchTo(ErrorContext);
+
+ if (mode == READ_BUFFERS_ZERO_ON_ERROR || zero_damaged_pages)
+ {
+
+ ereport(LOG,
+ (errcode(ERRCODE_DATA_CORRUPTED),
+ errmsg("invalid page in block %u of relation %s; zeroing out page",
+ blockno,
+ relpathbackend(rlocator, MyProcNumber, forkNum))));
+ memset(bufdata, 0, BLCKSZ);
+ }
+ else
+ {
+ ereport(LOG,
+ (errcode(ERRCODE_DATA_CORRUPTED),
+ errmsg("invalid page in block %u of relation %s",
+ blockno,
+ relpathbackend(rlocator, MyProcNumber, forkNum))));
+ failed = true;
+ buf_failed = true;
+ }
+ }
+
+ /* Terminate I/O and set BM_VALID. */
+ pgaio_wref_clear(&buf_hdr->io_wref);
+
+ {
+ uint32 buf_state;
+
+ buf_state = pg_atomic_read_u32(&buf_hdr->state);
+ buf_state |= BM_VALID;
+ pg_atomic_unlocked_write_u32(&buf_hdr->state, buf_state);
+ }
+
+ /* release pin held by IO subsystem */
+ LocalRefCount[-buffer - 1] -= 1;
+
+ return buf_failed;
+}
--
2.48.1.76.g4e746b1a31.dirty
v2.3-0020-WIP-localbuf-Track-pincount-in-BufferDesc-as-we.patchtext/x-diff; charset=us-asciiDownload
From 9c9745754dc88502e050b5822d90d20b517a052b Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 22 Jan 2025 13:44:44 -0500
Subject: [PATCH v2.3 20/30] WIP: localbuf: Track pincount in BufferDesc as
well
For AIO on temp tables the AIO subsystem needs to be able to ensure a pin on a
buffer while AIO is going on, even if the IO issuing query errors out. To do
so, track the refcount in BufferDesc.state, not ust LocalRefCount.
Note that we still don't need locking, AIO completion callbacks for local
buffers are executed in the issuing session (nobody else has access to the
BufferDesc).
---
src/backend/storage/buffer/bufmgr.c | 40 ++++++++--
src/backend/storage/buffer/localbuf.c | 108 ++++++++++++++++----------
2 files changed, 101 insertions(+), 47 deletions(-)
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 169829e8031..fe871691350 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -5356,8 +5356,20 @@ ConditionalLockBufferForCleanup(Buffer buffer)
Assert(refcount > 0);
if (refcount != 1)
return false;
- /* Nobody else to wait for */
- return true;
+
+ bufHdr = GetLocalBufferDescriptor(-buffer - 1);
+ buf_state = pg_atomic_read_u32(&bufHdr->state);
+
+ /*
+ * Check that the AIO subsystem doesn't have a pin. Likely not
+ * possible today, but better safe than sorry.
+ */
+ refcount = BUF_STATE_GET_REFCOUNT(buf_state);
+ Assert(refcount > 0);
+ if (refcount == 1)
+ return true;
+
+ return false;
}
/* There should be exactly one local pin */
@@ -5409,8 +5421,18 @@ IsBufferCleanupOK(Buffer buffer)
/* There should be exactly one pin */
if (LocalRefCount[-buffer - 1] != 1)
return false;
- /* Nobody else to wait for */
- return true;
+
+ bufHdr = GetLocalBufferDescriptor(-buffer - 1);
+ buf_state = pg_atomic_read_u32(&bufHdr->state);
+
+ /*
+ * Check that the AIO subsystem doesn't have a pin. Likely not
+ * possible today, but better safe than sorry.
+ */
+ if (BUF_STATE_GET_REFCOUNT(buf_state) == 1)
+ return true;
+
+ return false;
}
/* There should be exactly one local pin */
@@ -6388,9 +6410,15 @@ local_buffer_readv_stage(PgAioHandle *ioh)
buf_state = pg_atomic_read_u32(&bufHdr->state);
bufHdr->io_wref = io_wref;
- LocalRefCount[-buf - 1] += 1;
- UnlockBufHdr(bufHdr, buf_state);
+ /*
+ * Track pin by AIO subsystem in BufferDesc, not in LocalRefCount as
+ * one might initially think. This is necessary to handle this backend
+ * erroring out while AIO is still in progress.
+ */
+ buf_state += BUF_REFCOUNT_ONE;
+
+ pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
}
}
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index b3805c1ff94..72c93ae15a2 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -208,10 +208,19 @@ GetLocalVictimBuffer(void)
pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
trycounter = NLocBuffer;
}
+ else if (BUF_STATE_GET_REFCOUNT(buf_state) > 0)
+ {
+ /*
+ * This can be reached if the backend initiated AIO for this
+ * buffer and then errored out.
+ */
+ }
else
{
/* Found a usable buffer */
PinLocalBuffer(bufHdr, false);
+ /* the buf_state may be modified inside PinLocalBuffer */
+ buf_state = pg_atomic_read_u32(&bufHdr->state);
break;
}
}
@@ -476,6 +485,44 @@ MarkLocalBufferDirty(Buffer buffer)
pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
}
+static void
+InvalidateLocalBuffer(BufferDesc *bufHdr)
+{
+ Buffer buffer = BufferDescriptorGetBuffer(bufHdr);
+ int bufid = -buffer - 1;
+ uint32 buf_state;
+ LocalBufferLookupEnt *hresult;
+
+ buf_state = pg_atomic_read_u32(&bufHdr->state);
+
+ /*
+ * We need to test not just LocalRefCount[bufid] but also the BufferDesc
+ * itself, as the latter is used to represent a pin by the AIO subsystem.
+ * This can happen if AIO is initiated and then the query errors out.
+ */
+ if (LocalRefCount[bufid] != 0 ||
+ BUF_STATE_GET_REFCOUNT(buf_state) > 0)
+ elog(ERROR, "block %u of %s is still referenced (local %u)",
+ bufHdr->tag.blockNum,
+ relpathbackend(BufTagGetRelFileLocator(&bufHdr->tag),
+ MyProcNumber,
+ BufTagGetForkNum(&bufHdr->tag)),
+ LocalRefCount[bufid]);
+
+ /* Remove entry from hashtable */
+ hresult = (LocalBufferLookupEnt *)
+ hash_search(LocalBufHash, &bufHdr->tag, HASH_REMOVE, NULL);
+ if (!hresult) /* shouldn't happen */
+ elog(ERROR, "local buffer hash table corrupted");
+ /* Mark buffer invalid */
+ ClearBufferTag(&bufHdr->tag);
+
+ buf_state &= ~BUF_FLAG_MASK;
+ buf_state &= ~BUF_USAGECOUNT_MASK;
+ pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+
+}
+
/*
* DropRelationLocalBuffers
* This function removes from the buffer pool all the pages of the
@@ -496,7 +543,6 @@ DropRelationLocalBuffers(RelFileLocator rlocator, ForkNumber forkNum,
for (i = 0; i < NLocBuffer; i++)
{
BufferDesc *bufHdr = GetLocalBufferDescriptor(i);
- LocalBufferLookupEnt *hresult;
uint32 buf_state;
buf_state = pg_atomic_read_u32(&bufHdr->state);
@@ -506,24 +552,7 @@ DropRelationLocalBuffers(RelFileLocator rlocator, ForkNumber forkNum,
BufTagGetForkNum(&bufHdr->tag) == forkNum &&
bufHdr->tag.blockNum >= firstDelBlock)
{
- if (LocalRefCount[i] != 0)
- elog(ERROR, "block %u of %s is still referenced (local %u)",
- bufHdr->tag.blockNum,
- relpathbackend(BufTagGetRelFileLocator(&bufHdr->tag),
- MyProcNumber,
- BufTagGetForkNum(&bufHdr->tag)),
- LocalRefCount[i]);
-
- /* Remove entry from hashtable */
- hresult = (LocalBufferLookupEnt *)
- hash_search(LocalBufHash, &bufHdr->tag, HASH_REMOVE, NULL);
- if (!hresult) /* shouldn't happen */
- elog(ERROR, "local buffer hash table corrupted");
- /* Mark buffer invalid */
- ClearBufferTag(&bufHdr->tag);
- buf_state &= ~BUF_FLAG_MASK;
- buf_state &= ~BUF_USAGECOUNT_MASK;
- pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+ InvalidateLocalBuffer(bufHdr);
}
}
}
@@ -543,7 +572,6 @@ DropRelationAllLocalBuffers(RelFileLocator rlocator)
for (i = 0; i < NLocBuffer; i++)
{
BufferDesc *bufHdr = GetLocalBufferDescriptor(i);
- LocalBufferLookupEnt *hresult;
uint32 buf_state;
buf_state = pg_atomic_read_u32(&bufHdr->state);
@@ -551,23 +579,7 @@ DropRelationAllLocalBuffers(RelFileLocator rlocator)
if ((buf_state & BM_TAG_VALID) &&
BufTagMatchesRelFileLocator(&bufHdr->tag, &rlocator))
{
- if (LocalRefCount[i] != 0)
- elog(ERROR, "block %u of %s is still referenced (local %u)",
- bufHdr->tag.blockNum,
- relpathbackend(BufTagGetRelFileLocator(&bufHdr->tag),
- MyProcNumber,
- BufTagGetForkNum(&bufHdr->tag)),
- LocalRefCount[i]);
- /* Remove entry from hashtable */
- hresult = (LocalBufferLookupEnt *)
- hash_search(LocalBufHash, &bufHdr->tag, HASH_REMOVE, NULL);
- if (!hresult) /* shouldn't happen */
- elog(ERROR, "local buffer hash table corrupted");
- /* Mark buffer invalid */
- ClearBufferTag(&bufHdr->tag);
- buf_state &= ~BUF_FLAG_MASK;
- buf_state &= ~BUF_USAGECOUNT_MASK;
- pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+ InvalidateLocalBuffer(bufHdr);
}
}
}
@@ -667,12 +679,13 @@ PinLocalBuffer(BufferDesc *buf_hdr, bool adjust_usagecount)
if (LocalRefCount[bufid] == 0)
{
NLocalPinnedBuffers++;
+ buf_state += BUF_REFCOUNT_ONE;
if (adjust_usagecount &&
BUF_STATE_GET_USAGECOUNT(buf_state) < BM_MAX_USAGE_COUNT)
{
buf_state += BUF_USAGECOUNT_ONE;
- pg_atomic_unlocked_write_u32(&buf_hdr->state, buf_state);
}
+ pg_atomic_unlocked_write_u32(&buf_hdr->state, buf_state);
}
LocalRefCount[bufid]++;
ResourceOwnerRememberBuffer(CurrentResourceOwner,
@@ -698,7 +711,17 @@ UnpinLocalBufferNoOwner(Buffer buffer)
Assert(NLocalPinnedBuffers > 0);
if (--LocalRefCount[buffid] == 0)
+ {
+ BufferDesc *buf_hdr = GetLocalBufferDescriptor(buffid);
+ uint32 buf_state;
+
NLocalPinnedBuffers--;
+
+ buf_state = pg_atomic_read_u32(&buf_hdr->state);
+ Assert(BUF_STATE_GET_REFCOUNT(buf_state) > 0);
+ buf_state -= BUF_REFCOUNT_ONE;
+ pg_atomic_unlocked_write_u32(&buf_hdr->state, buf_state);
+ }
}
/*
@@ -894,11 +917,14 @@ ReadBufferCompleteReadLocal(Buffer buffer, int mode, bool failed)
buf_state = pg_atomic_read_u32(&buf_hdr->state);
buf_state |= BM_VALID;
+
+ /*
+ * Release pin held by IO subsystem, see also
+ * local_buffer_readv_prepare().
+ */
+ buf_state -= BUF_REFCOUNT_ONE;
pg_atomic_unlocked_write_u32(&buf_hdr->state, buf_state);
}
- /* release pin held by IO subsystem */
- LocalRefCount[-buffer - 1] -= 1;
-
return buf_failed;
}
--
2.48.1.76.g4e746b1a31.dirty
v2.3-0021-bufmgr-Use-aio-for-StartReadBuffers.patchtext/x-diff; charset=us-asciiDownload
From b153f4c8c7cf10171dd7390920ef38e079be1c87 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 22 Jan 2025 13:44:45 -0500
Subject: [PATCH v2.3 21/30] bufmgr: Use aio for StartReadBuffers()
Author:
Reviewed-By:
Discussion: https://postgr.es/m/
Backpatch:
---
src/include/storage/bufmgr.h | 25 +-
src/backend/storage/buffer/bufmgr.c | 377 ++++++++++++++++++++--------
2 files changed, 298 insertions(+), 104 deletions(-)
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 5cff4e223f9..46ee957e99c 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -15,6 +15,7 @@
#define BUFMGR_H
#include "port/pg_iovec.h"
+#include "storage/aio_types.h"
#include "storage/block.h"
#include "storage/buf.h"
#include "storage/bufpage.h"
@@ -107,10 +108,18 @@ typedef struct BufferManagerRelation
#define BMR_REL(p_rel) ((BufferManagerRelation){.rel = p_rel})
#define BMR_SMGR(p_smgr, p_relpersistence) ((BufferManagerRelation){.smgr = p_smgr, .relpersistence = p_relpersistence})
+
+#define MAX_IO_COMBINE_LIMIT PG_IOV_MAX
+#define DEFAULT_IO_COMBINE_LIMIT Min(MAX_IO_COMBINE_LIMIT, (128 * 1024) / BLCKSZ)
+
+
/* Zero out page if reading fails. */
#define READ_BUFFERS_ZERO_ON_ERROR (1 << 0)
/* Call smgrprefetch() if I/O necessary. */
#define READ_BUFFERS_ISSUE_ADVICE (1 << 1)
+/* IO will immediately be waited for */
+#define READ_BUFFERS_SYNCHRONOUSLY (1 << 2)
+
struct ReadBuffersOperation
{
@@ -131,6 +140,20 @@ struct ReadBuffersOperation
int flags;
int16 nblocks;
int16 io_buffers_len;
+
+ /*
+ * In some rare-ish cases one operation causes multiple reads (e.g. if a
+ * buffer was concurrently read by another backend). It'd be much better
+ * if we ensured that each ReadBuffersOperation covered only one IO - but
+ * that's not entirely trivial, due to having pinned victim buffers before
+ * starting IOs.
+ *
+ * TODO: Change the API of StartReadBuffers() to ensure we only ever need
+ * one IO.
+ */
+ int16 nios;
+ PgAioWaitRef wrefs[MAX_IO_COMBINE_LIMIT];
+ PgAioReturn returns[MAX_IO_COMBINE_LIMIT];
};
typedef struct ReadBuffersOperation ReadBuffersOperation;
@@ -161,8 +184,6 @@ extern PGDLLIMPORT bool track_io_timing;
extern PGDLLIMPORT int effective_io_concurrency;
extern PGDLLIMPORT int maintenance_io_concurrency;
-#define MAX_IO_COMBINE_LIMIT PG_IOV_MAX
-#define DEFAULT_IO_COMBINE_LIMIT Min(MAX_IO_COMBINE_LIMIT, (128 * 1024) / BLCKSZ)
extern PGDLLIMPORT int io_combine_limit;
extern PGDLLIMPORT int checkpoint_flush_after;
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index fe871691350..70f1da84083 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1235,10 +1235,9 @@ ReadBuffer_common(Relation rel, SMgrRelation smgr, char smgr_persistence,
return buffer;
}
+ flags = READ_BUFFERS_SYNCHRONOUSLY;
if (mode == RBM_ZERO_ON_ERROR)
- flags = READ_BUFFERS_ZERO_ON_ERROR;
- else
- flags = 0;
+ flags |= READ_BUFFERS_ZERO_ON_ERROR;
operation.smgr = smgr;
operation.rel = rel;
operation.persistence = persistence;
@@ -1253,6 +1252,9 @@ ReadBuffer_common(Relation rel, SMgrRelation smgr, char smgr_persistence,
return buffer;
}
+static bool AsyncReadBuffers(ReadBuffersOperation *operation,
+ int nblocks);
+
static pg_attribute_always_inline bool
StartReadBuffersImpl(ReadBuffersOperation *operation,
Buffer *buffers,
@@ -1288,6 +1290,11 @@ StartReadBuffersImpl(ReadBuffersOperation *operation,
* so we stop here.
*/
actual_nblocks = i + 1;
+
+ ereport(DEBUG3,
+ errmsg("found buf at idx %i: %s",
+ i, DebugPrintBufferRefcount(buffers[i])),
+ errhidestmt(true), errhidecontext(true));
break;
}
else
@@ -1324,28 +1331,51 @@ StartReadBuffersImpl(ReadBuffersOperation *operation,
operation->flags = flags;
operation->nblocks = actual_nblocks;
operation->io_buffers_len = io_buffers_len;
+ operation->nios = 0;
- if (flags & READ_BUFFERS_ISSUE_ADVICE)
+ /*
+ * When using AIO, start the IO in the background. If not, issue prefetch
+ * requests if desired by the caller.
+ *
+ * The reason we have a dedicated path for IOMETHOD_SYNC here is to derisk
+ * the introduction of AIO somewhat. It's a large architectural change,
+ * with lots of chances for unanticipated performance effects. Use of
+ * IOMETHOD_SYNC already leads to not actually performing IO
+ * asynchronously, but without the check here we'd execute IO earlier than
+ * we used to.
+ */
+ if (io_method != IOMETHOD_SYNC)
{
- /*
- * In theory we should only do this if PinBufferForBlock() had to
- * allocate new buffers above. That way, if two calls to
- * StartReadBuffers() were made for the same blocks before
- * WaitReadBuffers(), only the first would issue the advice. That'd be
- * a better simulation of true asynchronous I/O, which would only
- * start the I/O once, but isn't done here for simplicity. Note also
- * that the following call might actually issue two advice calls if we
- * cross a segment boundary; in a true asynchronous version we might
- * choose to process only one real I/O at a time in that case.
- */
- smgrprefetch(operation->smgr,
- operation->forknum,
- blockNum,
- operation->io_buffers_len);
+ /* initiate the IO asynchronously */
+ return AsyncReadBuffers(operation, io_buffers_len);
}
+ else
+ {
+ operation->flags |= READ_BUFFERS_SYNCHRONOUSLY;
+
+ if (flags & READ_BUFFERS_ISSUE_ADVICE)
+ {
+ /*
+ * In theory we should only do this if PinBufferForBlock() had to
+ * allocate new buffers above. That way, if two calls to
+ * StartReadBuffers() were made for the same blocks before
+ * WaitReadBuffers(), only the first would issue the advice.
+ * That'd be a better simulation of true asynchronous I/O, which
+ * would only start the I/O once, but isn't done here for
+ * simplicity. Note also that the following call might actually
+ * issue two advice calls if we cross a segment boundary; in a
+ * true asynchronous version we might choose to process only one
+ * real I/O at a time in that case.
+ */
+ smgrprefetch(operation->smgr,
+ operation->forknum,
+ blockNum,
+ operation->io_buffers_len);
+ }
- /* Indicate that WaitReadBuffers() should be called. */
- return true;
+ /* Indicate that WaitReadBuffers() should be called. */
+ return true;
+ }
}
/*
@@ -1397,12 +1427,31 @@ StartReadBuffer(ReadBuffersOperation *operation,
}
static inline bool
-WaitReadBuffersCanStartIO(Buffer buffer, bool nowait)
+ReadBuffersCanStartIO(Buffer buffer, bool nowait)
{
if (BufferIsLocal(buffer))
{
BufferDesc *bufHdr = GetLocalBufferDescriptor(-buffer - 1);
+ /*
+ * The buffer could have IO in progress by another scan. Right now
+ * localbuf.c doesn't use IO_IN_PROGRESS, which is why we need this
+ * hack.
+ *
+ * TODO: localbuf.c should use IO_IN_PROGRESS / have an equivalent of
+ * StartBufferIO().
+ */
+ if (pgaio_wref_valid(&bufHdr->io_wref))
+ {
+ PgAioWaitRef iow = bufHdr->io_wref;
+
+ ereport(DEBUG3,
+ errmsg("waiting for temp buffer IO in CSIO"),
+ errhidestmt(true), errhidecontext(true));
+ pgaio_wref_wait(&iow);
+ return false;
+ }
+
return (pg_atomic_read_u32(&bufHdr->state) & BM_VALID) == 0;
}
else
@@ -1412,13 +1461,38 @@ WaitReadBuffersCanStartIO(Buffer buffer, bool nowait)
void
WaitReadBuffers(ReadBuffersOperation *operation)
{
- Buffer *buffers;
+ IOContext io_context;
+ IOObject io_object;
int nblocks;
- BlockNumber blocknum;
- ForkNumber forknum;
- IOContext io_context;
- IOObject io_object;
- char persistence;
+ bool have_retryable_failure;
+
+ /*
+ * If we get here without any IO operations having been issued, the
+ * io_method == IOMETHOD_SYNC path must have been used. In that case, we
+ * start - as we used to before - the IO now, just before waiting.
+ */
+ if (operation->nios == 0)
+ {
+ Assert(io_method == IOMETHOD_SYNC);
+ if (!AsyncReadBuffers(operation, operation->io_buffers_len))
+ {
+ /* all blocks were already read in concurrently */
+ return;
+ }
+ }
+
+ if (operation->persistence == RELPERSISTENCE_TEMP)
+ {
+ io_context = IOCONTEXT_NORMAL;
+ io_object = IOOBJECT_TEMP_RELATION;
+ }
+ else
+ {
+ io_context = IOContextForStrategy(operation->strategy);
+ io_object = IOOBJECT_RELATION;
+ }
+
+restart:
/*
* Currently operations are only allowed to include a read of some range,
@@ -1433,15 +1507,101 @@ WaitReadBuffers(ReadBuffersOperation *operation)
if (nblocks == 0)
return; /* nothing to do */
- buffers = &operation->buffers[0];
- blocknum = operation->blocknum;
- forknum = operation->forknum;
- persistence = operation->persistence;
+ Assert(operation->nios > 0);
+ /*
+ * For IO timing we just count the time spent waiting for the IO.
+ *
+ * XXX: We probably should track the IO operation, rather than its time,
+ * separately, when initiating the IO. But right now that's not quite
+ * allowed by the interface.
+ */
+ have_retryable_failure = false;
+ for (int i = 0; i < operation->nios; i++)
+ {
+ PgAioReturn *aio_ret = &operation->returns[i];
+
+ /*
+ * Tracking a wait even if we don't actually need to wait a) is not
+ * cheap b) reports some time as waiting, even if we never waited.
+ */
+ if (aio_ret->result.status == ARS_UNKNOWN &&
+ !pgaio_wref_check_done(&operation->wrefs[i]))
+ {
+ instr_time io_start = pgstat_prepare_io_time(track_io_timing);
+
+ pgaio_wref_wait(&operation->wrefs[i]);
+
+ /*
+ * The IO operation itself was already counted earlier, in
+ * AsyncReadBuffers().
+ */
+ pgstat_count_io_op_time(io_object, io_context, IOOP_READ,
+ io_start, 0, 0);
+ }
+ else
+ {
+ Assert(pgaio_wref_check_done(&operation->wrefs[i]));
+ }
+
+ if (aio_ret->result.status == ARS_PARTIAL)
+ {
+ /*
+ * We'll retry below, so we just emit a debug message the server
+ * log (or not even that in prod scenarios).
+ */
+ pgaio_result_report(aio_ret->result, &aio_ret->target_data, DEBUG1);
+ have_retryable_failure = true;
+ }
+ else if (aio_ret->result.status != ARS_OK)
+ pgaio_result_report(aio_ret->result, &aio_ret->target_data, ERROR);
+ }
+
+ /*
+ * If any of the associated IOs failed, try again to issue IOs. Buffers
+ * for which IO has completed successfully will be discovered as such and
+ * not retried.
+ */
+ if (have_retryable_failure)
+ {
+ nblocks = operation->io_buffers_len;
+
+ elog(DEBUG3, "retrying IO after partial failure");
+ CHECK_FOR_INTERRUPTS();
+ AsyncReadBuffers(operation, nblocks);
+ goto restart;
+ }
+
+ if (VacuumCostActive)
+ VacuumCostBalance += VacuumCostPageMiss * nblocks;
+
+ /* FIXME: READ_DONE tracepoint */
+}
+
+static bool
+AsyncReadBuffers(ReadBuffersOperation *operation,
+ int nblocks)
+{
+ int io_buffers_len = 0;
+ Buffer *buffers = &operation->buffers[0];
+ int flags = operation->flags;
+ BlockNumber blocknum = operation->blocknum;
+ ForkNumber forknum = operation->forknum;
+ IOContext io_context;
+ IOObject io_object;
+ char persistence;
+ bool did_start_io_overall = false;
+ PgAioHandle *ioh = NULL;
+ uint32 ioh_flags = 0;
+
+ persistence = operation->rel
+ ? operation->rel->rd_rel->relpersistence
+ : RELPERSISTENCE_PERMANENT;
if (persistence == RELPERSISTENCE_TEMP)
{
io_context = IOCONTEXT_NORMAL;
io_object = IOOBJECT_TEMP_RELATION;
+ ioh_flags |= PGAIO_HF_REFERENCES_LOCAL;
}
else
{
@@ -1449,6 +1609,16 @@ WaitReadBuffers(ReadBuffersOperation *operation)
io_object = IOOBJECT_RELATION;
}
+ /*
+ * When this IO is executed synchronously, either because the caller will
+ * immediately block waiting for the IO or because IOMETHOD_SYNC is used,
+ * the AIO subsystem needs to know.
+ */
+ if (flags & READ_BUFFERS_SYNCHRONOUSLY)
+ ioh_flags |= PGAIO_HF_SYNCHRONOUS;
+
+ operation->nios = 0;
+
/*
* We count all these blocks as read by this backend. This is traditional
* behavior, but might turn out to be not true if we find that someone
@@ -1464,19 +1634,39 @@ WaitReadBuffers(ReadBuffersOperation *operation)
for (int i = 0; i < nblocks; ++i)
{
- int io_buffers_len;
- Buffer io_buffers[MAX_IO_COMBINE_LIMIT];
void *io_pages[MAX_IO_COMBINE_LIMIT];
- instr_time io_start;
+ Buffer io_buffers[MAX_IO_COMBINE_LIMIT];
BlockNumber io_first_block;
+ bool did_start_io_this = false;
/*
- * Skip this block if someone else has already completed it. If an
- * I/O is already in progress in another backend, this will wait for
- * the outcome: either done, or something went wrong and we will
- * retry.
+ * Get IO before ReadBuffersCanStartIO, as pgaio_io_acquire() might
+ * block, which we don't want after setting IO_IN_PROGRESS.
+ *
+ * XXX: Should we attribute the time spent in here to the IO? If there
+ * already are a lot of IO operations in progress, getting an IO
+ * handle will block waiting for some other IO operation to finish.
+ *
+ * In most cases it'll be free to get the IO, so a timer would be
+ * overhead. Perhaps we should use pgaio_io_acquire_nb() and only
+ * account IO time when pgaio_io_acquire_nb() returned false?
*/
- if (!WaitReadBuffersCanStartIO(buffers[i], false))
+ if (likely(!ioh))
+ ioh = pgaio_io_acquire(CurrentResourceOwner,
+ &operation->returns[operation->nios]);
+
+ /*
+ * Skip this block if someone else has already completed it.
+ *
+ * If an I/O is already in progress in another backend, this will wait
+ * for the outcome: either done, or something went wrong and we will
+ * retry. But don't wait if we have staged, but haven't issued,
+ * another IO.
+ *
+ * XXX: If we can't start IO due to unsubmitted IO, it might be worth
+ * to submit and then try to start IO again.
+ */
+ if (!ReadBuffersCanStartIO(buffers[i], did_start_io_overall))
{
/*
* Report this as a 'hit' for this backend, even though it must
@@ -1488,6 +1678,11 @@ WaitReadBuffers(ReadBuffersOperation *operation)
operation->smgr->smgr_rlocator.locator.relNumber,
operation->smgr->smgr_rlocator.backend,
true);
+
+ ereport(DEBUG3,
+ errmsg("can't start io for first buffer %u: %s",
+ buffers[i], DebugPrintBufferRefcount(buffers[i])),
+ errhidestmt(true), errhidecontext(true));
continue;
}
@@ -1497,6 +1692,11 @@ WaitReadBuffers(ReadBuffersOperation *operation)
io_first_block = blocknum + i;
io_buffers_len = 1;
+ ereport(DEBUG5,
+ errmsg("first prepped for io: %s, offset %d",
+ DebugPrintBufferRefcount(io_buffers[0]), i),
+ errhidestmt(true), errhidecontext(true));
+
/*
* How many neighboring-on-disk blocks can we scatter-read into other
* buffers at the same time? In this case we don't wait if we see an
@@ -1505,85 +1705,58 @@ WaitReadBuffers(ReadBuffersOperation *operation)
* We'll come back to this block again, above.
*/
while ((i + 1) < nblocks &&
- WaitReadBuffersCanStartIO(buffers[i + 1], true))
+ ReadBuffersCanStartIO(buffers[i + 1], true))
{
/* Must be consecutive block numbers. */
Assert(BufferGetBlockNumber(buffers[i + 1]) ==
BufferGetBlockNumber(buffers[i]) + 1);
+ ereport(DEBUG5,
+ errmsg("seq prepped for io: %s, offset %d",
+ DebugPrintBufferRefcount(buffers[i + 1]),
+ i + 1),
+ errhidestmt(true), errhidecontext(true));
+
io_buffers[io_buffers_len] = buffers[++i];
io_pages[io_buffers_len++] = BufferGetBlock(buffers[i]);
}
- io_start = pgstat_prepare_io_time(track_io_timing);
- smgrreadv(operation->smgr, forknum, io_first_block, io_pages, io_buffers_len);
- pgstat_count_io_op_time(io_object, io_context, IOOP_READ, io_start,
- 1, io_buffers_len * BLCKSZ);
+ pgaio_io_get_wref(ioh, &operation->wrefs[operation->nios]);
- /* Verify each block we read, and terminate the I/O. */
- for (int j = 0; j < io_buffers_len; ++j)
- {
- BufferDesc *bufHdr;
- Block bufBlock;
+ pgaio_io_set_handle_data_32(ioh, (uint32 *) io_buffers, io_buffers_len);
- if (persistence == RELPERSISTENCE_TEMP)
- {
- bufHdr = GetLocalBufferDescriptor(-io_buffers[j] - 1);
- bufBlock = LocalBufHdrGetBlock(bufHdr);
- }
- else
- {
- bufHdr = GetBufferDescriptor(io_buffers[j] - 1);
- bufBlock = BufHdrGetBlock(bufHdr);
- }
- /* check for garbage data */
- if (!PageIsVerifiedExtended((Page) bufBlock, io_first_block + j,
- PIV_LOG_WARNING | PIV_REPORT_STAT))
- {
- if ((operation->flags & READ_BUFFERS_ZERO_ON_ERROR) || zero_damaged_pages)
- {
- ereport(WARNING,
- (errcode(ERRCODE_DATA_CORRUPTED),
- errmsg("invalid page in block %u of relation %s; zeroing out page",
- io_first_block + j,
- relpath(operation->smgr->smgr_rlocator, forknum))));
- memset(bufBlock, 0, BLCKSZ);
- }
- else
- ereport(ERROR,
- (errcode(ERRCODE_DATA_CORRUPTED),
- errmsg("invalid page in block %u of relation %s",
- io_first_block + j,
- relpath(operation->smgr->smgr_rlocator, forknum))));
- }
+ if (persistence == RELPERSISTENCE_TEMP)
+ pgaio_io_register_callbacks(ioh, PGAIO_HCB_LOCAL_BUFFER_READV);
+ else
+ pgaio_io_register_callbacks(ioh, PGAIO_HCB_SHARED_BUFFER_READV);
- /* Terminate I/O and set BM_VALID. */
- if (persistence == RELPERSISTENCE_TEMP)
- {
- uint32 buf_state = pg_atomic_read_u32(&bufHdr->state);
+ pgaio_io_set_flag(ioh, ioh_flags);
- buf_state |= BM_VALID;
- pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
- }
- else
- {
- /* Set BM_VALID, terminate IO, and wake up any waiters */
- TerminateBufferIO(bufHdr, false, BM_VALID, true, true);
- }
+ did_start_io_overall = did_start_io_this = true;
+ smgrstartreadv(ioh, operation->smgr, forknum, io_first_block,
+ io_pages, io_buffers_len);
+ ioh = NULL;
+ operation->nios++;
- /* Report I/Os as completing individually. */
- TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, io_first_block + j,
- operation->smgr->smgr_rlocator.locator.spcOid,
- operation->smgr->smgr_rlocator.locator.dbOid,
- operation->smgr->smgr_rlocator.locator.relNumber,
- operation->smgr->smgr_rlocator.backend,
- false);
- }
+ /* not obvious what we'd use for time */
+ pgstat_count_io_op(io_object, io_context, IOOP_READ,
+ 1, io_buffers_len * BLCKSZ);
+ }
+
+ if (ioh)
+ {
+ pgaio_io_release(ioh);
+ ioh = NULL;
+ }
- if (VacuumCostActive)
- VacuumCostBalance += VacuumCostPageMiss * io_buffers_len;
+ if (did_start_io_overall)
+ {
+ pgaio_submit_staged();
+ return true;
}
+ else
+ return false;
}
/*
--
2.48.1.76.g4e746b1a31.dirty
v2.3-0022-aio-Very-WIP-read_stream.c-adjustments-for-real.patchtext/x-diff; charset=us-asciiDownload
From b0bb4b478b27c2a38bf819ee927be9167e551d28 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sat, 31 Aug 2024 21:39:30 -0400
Subject: [PATCH v2.3 22/30] aio: Very-WIP: read_stream.c adjustments for real
AIO
Things that need to be fixed / are fixed in this:
- max pinned buffers should be limited by io_combine_limit, not * 4
- overflow distance
- pins need to be limited in more places
---
src/include/storage/bufmgr.h | 2 ++
src/backend/storage/aio/read_stream.c | 31 +++++++++++++++++++++------
src/backend/storage/buffer/bufmgr.c | 3 ++-
3 files changed, 28 insertions(+), 8 deletions(-)
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 46ee957e99c..f205643c4ef 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -119,6 +119,8 @@ typedef struct BufferManagerRelation
#define READ_BUFFERS_ISSUE_ADVICE (1 << 1)
/* IO will immediately be waited for */
#define READ_BUFFERS_SYNCHRONOUSLY (1 << 2)
+/* caller will issue more io, don't submit */
+#define READ_BUFFERS_MORE_MORE_MORE (1 << 3)
struct ReadBuffersOperation
diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index e4414b2e915..c2211cab02a 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -90,6 +90,7 @@
#include "postgres.h"
#include "miscadmin.h"
+#include "storage/aio.h"
#include "storage/fd.h"
#include "storage/smgr.h"
#include "storage/read_stream.h"
@@ -240,14 +241,18 @@ read_stream_start_pending_read(ReadStream *stream, bool suppress_advice)
/*
* If advice hasn't been suppressed, this system supports it, and this
* isn't a strictly sequential pattern, then we'll issue advice.
+ *
+ * XXX: Used to also check stream->pending_read_blocknum !=
+ * stream->seq_blocknum
*/
if (!suppress_advice &&
- stream->advice_enabled &&
- stream->pending_read_blocknum != stream->seq_blocknum)
+ stream->advice_enabled)
flags = READ_BUFFERS_ISSUE_ADVICE;
else
flags = 0;
+ flags |= READ_BUFFERS_MORE_MORE_MORE;
+
/* We say how many blocks we want to read, but may be smaller on return. */
buffer_index = stream->next_buffer_index;
io_index = stream->next_io_index;
@@ -306,6 +311,14 @@ read_stream_start_pending_read(ReadStream *stream, bool suppress_advice)
static void
read_stream_look_ahead(ReadStream *stream, bool suppress_advice)
{
+ if (stream->distance > (io_combine_limit * 8))
+ {
+ if (stream->pinned_buffers + stream->pending_read_nblocks > ((stream->distance * 3) / 4))
+ {
+ return;
+ }
+ }
+
while (stream->ios_in_progress < stream->max_ios &&
stream->pinned_buffers + stream->pending_read_nblocks < stream->distance)
{
@@ -355,6 +368,7 @@ read_stream_look_ahead(ReadStream *stream, bool suppress_advice)
{
/* And we've hit the limit. Rewind, and stop here. */
read_stream_unget_block(stream, blocknum);
+ pgaio_submit_staged();
return;
}
}
@@ -379,6 +393,8 @@ read_stream_look_ahead(ReadStream *stream, bool suppress_advice)
stream->distance == 0) &&
stream->ios_in_progress < stream->max_ios)
read_stream_start_pending_read(stream, suppress_advice);
+
+ pgaio_submit_staged();
}
/*
@@ -442,7 +458,7 @@ read_stream_begin_impl(int flags,
* overflow (even though that's not possible with the current GUC range
* limits), allowing also for the spare entry and the overflow space.
*/
- max_pinned_buffers = Max(max_ios * 4, io_combine_limit);
+ max_pinned_buffers = Max(max_ios * io_combine_limit, io_combine_limit);
max_pinned_buffers = Min(max_pinned_buffers,
PG_INT16_MAX - io_combine_limit - 1);
@@ -493,10 +509,11 @@ read_stream_begin_impl(int flags,
* direct I/O isn't enabled, the caller hasn't promised sequential access
* (overriding our detection heuristics), and max_ios hasn't been set to
* zero.
+ *
+ * FIXME: Used to also check (io_direct_flags & IO_DIRECT_DATA) == 0 &&
+ * (flags & READ_STREAM_SEQUENTIAL) == 0
*/
- if ((io_direct_flags & IO_DIRECT_DATA) == 0 &&
- (flags & READ_STREAM_SEQUENTIAL) == 0 &&
- max_ios > 0)
+ if (max_ios > 0)
stream->advice_enabled = true;
#endif
@@ -727,7 +744,7 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
if (++stream->oldest_io_index == stream->max_ios)
stream->oldest_io_index = 0;
- if (stream->ios[io_index].op.flags & READ_BUFFERS_ISSUE_ADVICE)
+ if (stream->ios[io_index].op.flags & (READ_BUFFERS_ISSUE_ADVICE | READ_BUFFERS_MORE_MORE_MORE))
{
/* Distance ramps up fast (behavior C). */
distance = stream->distance * 2;
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 70f1da84083..118a6e1ca31 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1752,7 +1752,8 @@ AsyncReadBuffers(ReadBuffersOperation *operation,
if (did_start_io_overall)
{
- pgaio_submit_staged();
+ if (!(flags & READ_BUFFERS_MORE_MORE_MORE))
+ pgaio_submit_staged();
return true;
}
else
--
2.48.1.76.g4e746b1a31.dirty
v2.3-0028-Temporary-Increase-BAS_BULKREAD-size.patchtext/x-diff; charset=us-asciiDownload
From 2dea8961fd6383afe1e457926131c2213db211f0 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sun, 1 Sep 2024 00:42:27 -0400
Subject: [PATCH v2.3 28/30] Temporary: Increase BAS_BULKREAD size
Without this we only can execute very little AIO for sequential scans, as
there's just not enough buffers in the ring. This isn't the right fix, as
just increasing the ring size can have negative performance implications in
workloads where the kernel has all the data cached.
Author:
Reviewed-By:
Discussion: https://postgr.es/m/
Backpatch:
---
src/backend/storage/buffer/freelist.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 1f757d96f07..ac19fb87433 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -555,7 +555,12 @@ GetAccessStrategy(BufferAccessStrategyType btype)
return NULL;
case BAS_BULKREAD:
- ring_size_kb = 256;
+
+ /*
+ * FIXME: Temporary increase to allow large enough streaming reads
+ * to actually benefit from AIO. This needs a better solution.
+ */
+ ring_size_kb = 2 * 1024;
break;
case BAS_BULKWRITE:
ring_size_kb = 16 * 1024;
--
2.48.1.76.g4e746b1a31.dirty
v2.3-0029-WIP-Use-MAP_POPULATE.patchtext/x-diff; charset=us-asciiDownload
From ca1654b4d99e3565b2e14525b3409bc8c164849e Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 31 Dec 2024 13:25:56 -0500
Subject: [PATCH v2.3 29/30] WIP: Use MAP_POPULATE
For benchmarking it's quite annoying that the first time a memory is touched
has completely different perf characteristics than subsequent accesses. Using
MAP_POPULATE reduces that substantially.
---
src/backend/port/sysv_shmem.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/src/backend/port/sysv_shmem.c b/src/backend/port/sysv_shmem.c
index 197926d44f6..a700b02d5a1 100644
--- a/src/backend/port/sysv_shmem.c
+++ b/src/backend/port/sysv_shmem.c
@@ -620,7 +620,7 @@ CreateAnonymousSegment(Size *size)
allocsize += hugepagesize - (allocsize % hugepagesize);
ptr = mmap(NULL, allocsize, PROT_READ | PROT_WRITE,
- PG_MMAP_FLAGS | mmap_flags, -1, 0);
+ PG_MMAP_FLAGS | MAP_POPULATE | mmap_flags, -1, 0);
mmap_errno = errno;
if (huge_pages == HUGE_PAGES_TRY && ptr == MAP_FAILED)
elog(DEBUG1, "mmap(%zu) with MAP_HUGETLB failed, huge pages disabled: %m",
--
2.48.1.76.g4e746b1a31.dirty
On Thu, Jan 23, 2025 at 5:29 AM Andres Freund <andres@anarazel.de> wrote:
Hi,
Attached is v2.3.
There are a lot of changes - primarily renaming things based on on-list and
off-list feedback. But also some other things
[..snip]
Hi Andres, OK, so I've hastily launched AIO v2.3 (full, 29 patches)
patchset probe run before going for short vacations and here results
are attached*. TLDR; in terms of SELECTs the master vs aioworkers
looks very solid! I was kind of afraid that additional IPC to separate
processes would put workers at a disadvantage a little bit , but
that's amazingly not true. The intention of this effort just to see if
committing AIO with defaults as it stands is good enough to not cause
basic regressions for users and to me it looks like it is nearly
finished :)). So here to save time I have *not* tested aio23 with
io_uring, it's just about aioworkers (the future default).
Random notes and thoughts:
1. not a single crash was observed , but those were pretty short runs
2. my very limited in terms of time data analysis thoughts
- most of the time perf with aioworkers is identical (+/- 3%) as of
the master, in most cases it is much BETTER
- up to like 2.01x boosts can be spotted even on low-end like this but
with fast I/O even without IO_URING (just workers)
- on seqscans "sata" with datasets bigger than VFS-cache ("big") and
without parallel workers, it looks like it's always better
- on parallel seqscans "sata" with datasets bigger than VFS-cache
("big") and high e_io_c with high client counts(sigh!), it looks like
it would user noticeable big regression but to me it's not regression
itself, probably we are issuing way too many posix_fadvise()
readaheads with diminishing returns. Just letting you know. Not sure
it is worth introducing some global (shared aioworkers e_io_c
limiter), I think not. I think it could also be some maintenance noise
on that I/O device, but I have no isolated SATA RAID10 with like 8x
HDDs in home to launch such a test to be absolutely sure.
3. with aioworkers in documentation it would worth pointing out that
`iotop` won't be good enough to show which PID is doing I/O anymore .
I've often get question like this: who is taking the most of I/O right
now because storage is fully saturated on multi-use system. Not sure
it would require new view or not (pg_aios output seems to be not more
like in-memory debug view that would be have to be sampled
aggressively, and pg_statio_all_tables shows well table, but not PID
-- same for pg_stat_io). IMHO if docs would be simple like
"In order to understand which processes (PIDs) are issuing lots of
IOs, please check pg_stat_activty for *IO/AioCompletion* waits events"
it should be good enough for a start.
Bench machine: it was intentionally much smaller hardware. Azure's
Lsv2 L8s_v2 (1st gen EPYC/1s4c8t, with kernel 6.10.11+bpo-cloud-amd64
and booted with mem=12GB that limited real usable RAM memory to just
like ~8GB to stress I/O). liburing 2.9. Normal standard compile
options were used without asserts (such as normal users would use).
Bench had those two I/O storage (with XFS) attached:
- "sata" stands for Azure's "Premium SSD LRS" mounted on /sata
(Size=255GB, Max IOPS=1100 (@ 4kB?), Max throughput=125MB/s)
- "nvme" stands for bulit-in NVME on that VM mounted on /nvme
(Size=1788GB, Max IOPS=8000 (@ 4kB?))
I'll try to see in the coming weeks if dedicating more time is
possible (long run tests, more write tests, maybe some basic I/O
failure injections tests).
-J.
* = 8640 test runs, always with restart and flushing VFS cache, took
probably 2-3 days? I've had to reduce tries to 1 and limit myself to
just reads just to get it running solid, before I left and not to miss
the plane :^)
Attachments:
aio23_potential_parallel_seqscan_regression.pngimage/png; name=aio23_potential_parallel_seqscan_regression.pngDownload
�PNG
IHDR � � S��{ sRGB ��� gAMA ���a pHYs � ��o�d ��IDATx^���[e�����x���T�]Vw�-�d���{,(5G+�����(�lV�3K�r�?�7�ea��XYLp�H+:�������B=p��Q'����BW]�u��������qN~������d��>�xL��:W�����s�������O�!�B!�B��q�5 �B!�B��%�u!�B!����`]!�B!�h1�e����5�o��_�"c
� ���H��In�'��������j�m���m���V 9�Or47�5�����x����a���|�#|�C�������7��
��X �Z�������}r�.��mK-/���V;�����k���$Gs�]��������L�h��_�!�@�����=�M���]�1��Z^Z���v<�����
$��I��F����A���L�h���&umb��6uY7,.��Z��cb�����~k���R���Q��=�9&7�}gb;������s���mG�*��$�x�&���as��W�Y{��Q�7�Z�q3;�f��cs���o6���\�������X��]�Z'G�H�+g������6�abs�� s~]�
���=7\h��J��%��Lge�~�Yg��n;�����b��Oq�s���I��/r��zzz��p?+���IR�Z&�;��[{�����~�Ak�y�p�)�;=������������=����}v�};U��%���w0��L�+y����������u�����vo_+h���`[���_�Mj���y��E�,��]\q����lkf�9�^���]���s�f�8���Yg����k�b��GW\<���2u�,����y-�����f�F���t�J��0'��h��d�������9��������K;f���e�������Z�xf[����^�������@���Q{�������+������z.~�~���e�2��?��`?�>^o&�3�W�3_�&���+�xj/�����k�~a&�fE��+�f���s����B�J�D����b�����d���`�#/���Nk|��}J�'s�Dw����������F�k^Z�}>����~[��Z� 3?�����k����o��#��\5���v1y=����&�����G^���]�2�+�y��^>X��5��i������Is���k��k�*��w?�[�3�h���t������L�n���f>rP�,|�N������e��Z���N����V3oo�����uF3<3?�����k����X���.�l�����k��b�&��>��������W���z��5X�5Pg�>M�;���.�b�l���5!���x->��/^D+~&��;�S+.�����Fn0g$W���|m^�Lf�]�1��5/����m�g<�m�����v<�����
Z+�/r���L]s��} ����Q���������\�o�kkN���^8?����!�����e4�������c����h�������o��d#�k����zZ#����p�m�J������k���g��k������s-����k�v����K���������j�3�]���qS���f�Z/���q������Z/G�a��z��\y-���$_���f�����Z����uKkh&�M
�����cn������Q�b^
���E�5��d�]���*�;�m=���]�kA����uZ���^��.�]��>�� ��W��};���K��|+����-k�����j�3�S��w+
�om���vZ2��od��5������QU%��&
���d�}Sp���OE3����c���k��m�xkh&��e��5T��@�[��V���~�<�`��b��k13�?{����A��c�-���Z{[�xf[���H�����v����;��E�z{{��q�5\U�3�/������l��m���K^��������O���m������j�3���}�@r\��hn�k^�����c���f���Q��3#��'�i��K;f�R�K�����g��{�Z���>���h��63����c����q��A��/f
� ���H��In�'��������j�m���m���V 9�Or47�5��|�=����cw�����f������o�����N�/�V���W�c}!�B!����N~���?:?l
U5'3���o����h^!�B����|��WZ�
��wf�g�5,����e������)V�\i
M��9�f]!�B�����B�6 �u!�B���c�FB�d�.�B!ZR.���C��GGWUtkxIi$��i�
��6�gb��x~���\��������U���������B!D�:|7������h��\���<d{�4w�P��FB�=��p�(<��:CE5���P��1�T�e�a\����b��U��q|����(Jt�l�A#eZH��z��L0q���
Mz��;��~�Q��9������{Y�U����|��F�
!�B��f���-���4� �j<#�d�YF)���:�Fd��%2fV������y����EHd��w3j�����nU����2-��`� ������>������<�u|f�Z�e�W 8�C���%���{�����s#7,7f�o�o?|7,?wn�����+�@��B9#�x�$�MW��>u6`So~���<�r�1�7����>�o4?Q6�{��d������B!�����rO��N���x�s���,�5�W�������T�s"����������s��\��o�����e��|�������rB�N���/P�������>���������Uz��AJT/��U�������d����He��G��G�����GK�5���Eq(D�(�q�������S��5*sP,� � �y��f(+cs�f����q,���:�����G������~��g�������W����������dx���i�Y���di�3�*����x��ps���iw�N��W� ���nn?�vm]��<b�A9\����Ht7W][w�sp��W�m7W>{/_��Q��g������!c�W�!�B�f���
[�e����y��Q����-�p����y��g�G��N�r����p0_O�s��s��K��v
�������f��s��se���t���&~��?���/s�����{{E9!ZO]��&�@7�Nc��L3�������-"�T&�f<�#���(c��Pb��9c���};��@z�'����F��(�qC�?Jw��aEAQ��U���t�g���@�.5�k0E`<Kv<@j�U�)F����3�I���K� @j����r�T��a�&��i|$�+ �������l��������P�x��!�<��[6?��C��K�:^>�x�'�eR���x���u����8�3|Uy#�(�'���?y]��>}�����|�*^z�F�����H�k�^�Y/�*����V��}�^�i�.����/�p��������/=���K/+)/�B��We���h������:�>����+�s#�6��~��|=5�u*�i^�;��=':��V�[�����9��Z������������\�.ZUj����`�Cx}l�)].c�<=iRz
��6�g�P��*��;���F�u�FSx���s9C 2~����;��2�o2Bz Lx M�d��>���N������<�#`y��N/N���n����������o ��r>�t���yqZ����C�A��?��P��"O����@���m�k~��`��#<�,����7n6�I�g�W�X�3y��<���{����G~{��7r�����������>�3�������-p��
%���U7p�����
;��*]
&�B���8��9��3�8��.oo�!D���d��r� �;\2S`�����YrsI���L�C�<�[�sv�M����tw����\NB1
Mb2��;�L��N/��nci�B�VBh�,�#����_���31��u�S��?;���Rx�����l�a���u�l;���y����s���k�J?m���Z������6�H��[g���|I��
����
��u����s�!N����u-<�K^.�_!��q��(������������cLl���{�e���u�����K�_�^n����7�k����� q�j~�7NR5�9W�k��2Y�l�2eu�,�����'��[��
�a�����r{HO��q�
EJ�*�N���m�MGW��)���Q)F�tt]��b������F ��B1#He�~�0;�;��x�n,o��s9=Zu.]%K�/��:#��*.����[������D��~���U~*|��n�Js�K���/��+��
��_����7R�e\�u-�������`�/)�<|�u�~�e�����gC~�~��q�B!��Qr��A�ws�:`�vm]���c��{��z���X7�g6����KYwqF��������\�����My��M�������-�����*8C$F�{]8\��=��� �!���^?c}a�8��diE^|#R�����Z��P ��b�4W}�2:�H���r��t������%u�\�7nD�����u�����N������]��=�c^�!��sz;!�����5����2�u�s��l��Y���}�v��j�5</~�����n��q�(w.�������Us�.nP6�j�q��B!��4��4��|GS{���Rk�!�?|���<�r����!1CSSS�\����Q� ����������1�3����+��kg��� ��xG��I�B!�ba�1o�!����f��B!��w4U�~�i��������kf�T,=����5��:::(����N:�z�|[1��^}��g���B!�s�����5$�h2��������_i
7���}X�B!�������ffN�vm�|k&�mw�z[�^���C!�B!��$��E�:o�!�B!�bq���"�����b�������5,����KS+���|l���mB�7]U���w1u��;&��0{6uY77�Q�\�Fn���uCu����������
6���%��_��3��>6[9l��[r����[~���L�����o�����6l���Z�I��7t����}����7���x���s������_.Y~#?.�8�C��X��T(c>��X�����E���F~7Rf.��u�e����5����������^Z0-�w�`��V���z{V���Eq(D[-���`�kg���������L]s����M��h5�x`c?�|���fl����u;�w�|nZ�9l��_����_�C<��g���
N+���7�u_�'�W�-�@��������g���!s���
�w�3<Q�x�����>ftl
���6�6����}G,Y�{{����H�#���Z7���=X?��m���������g����_�)�a�y�e:����LJ��+����o����������32w6:���N�re�������������f���*�US+�G�����Z���������.�3"�������������K��l��+��*��U��Y���7r���s}?��WXib��W�)��^�����K_�C<T���N�����_����R������7��|.���&v��;��W�wZ�����}�*�`�L��V}
?����}��]��p���8�e�~��<���+�^���s]�`�l�}�Z����]*<����Z�����GU���=mSf��}�F�����6�~U�m��m0�s��%�l�R�{c�������������Z��"�{��������'���6���=���h����P�:�������DK��|��@ ��:Q��9
��n�z��A�9�����4�����(%�5��Eq8(>}I���;p���Q�
?W�W�-�;���+�����������
��(/��v���������� �3�*���6fP�~����dx���i�Y����G$���Y�g�?�k���Cm�����\��s7���n�������������q���^Q������^V�<���w@Y���k�/����g7K��r^��?^�����������O�}�3<��##�qE�][��g��2�zg�x\���H_���s��\i�~�Sg��q��l��W�<����qp�y}����;��X�o.���;�����+C�_�_��~c=��x�Uka���C��C�{�_��;�j������d���u�eZ��]�vm]�q����/9��3VY��b��\p/7+�<|�\}�e;ppK?�~~v-W^Q������g��e��B������N�Y�
����F���w�\h��j���
��Y�@�����l�8�e�W8��g2�t��=��*.��_���>�W�7�n.+�oe��-�^��}����W���I1��Dv�O�xd�N_��H R���*�t3�����@���Q?�)#��� ���Pb� �'r��?�f �#��;�[�B�4>��������[^$�k0E`<Kv<@j�EPu��
�'���I���A�LC����~�J��E'�$7f��=s���z�&�=�������O�e|��W���7r���� �p���i-���2N[um4���q��W�7r�b|�y�P�'��]�!�x��*��K�����������Z�������������5��� �q���9m�
�U���8�>m����69�]<
��g���;.��.��6����^������K����+��wf*��+����/��U@apr6��x=��Wqe��^����>�\2��m�����2�^/k�����q�r�tu�k�����V�}��8�'�3P5�v�*:m�!��][�|dU�����G���+���vm=���`�Vxxc��Ba����1����
��k��d����������>W�������~?{����
^q��a��,�/���l|�#���M\g��g(�������������[���}������%fOw���kT�}x�N�}n<��d���|L� ��2���<�t)����y�Afa��� �c�>�.�1������=yq��������'�@ �����#`���T��>I�3�H E&�36���v��g�l����`*����~�s1w���6�ab�^zfu�n.�R�9r� W_Q����u�~��3wYgW��]Gp�V7��gC���E����������*�w�����O���u��z�5����y�*��U������V�]�/6�wg�N��������x�*�&_���=�lwy��/g��S3}�7Z�L���u��35��o����i���������%�u�n�Z~d�#U.7iD���3��bc�Z4�'�P���Q�8\�d:��
����!�\���n�����Z�`�������H���^�$�3��;�s{�'���<�5��@�d�M��1s��`�k��y=7.\���#�x����p��,�\�
�2��������qm���~�.�e��C�u?��k���>�+{�����8�S;������,r�Yo��?+�Sa#��X�G���|i���s�n���j�.���Itsj�;W�Y��Mj��j���vy4"f|�x����t�5��7���1������^�~S�?������0�qQa@W����}1j���w�����#���e�{#7�\���W����]���:�n��w���{�b)���s���5����7��V��]�������Z����YT����h���k����z�����j��������r���v�����B�����.�~.'���~����O�zJg���
MW��Y\���1#]�g��|����f+���s��xK9;��h]M�;���'E<'����q��������c<